IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
A PCT APPLICATION FOR AN
ARCHITECTURE FOR ENTERPRISE DATA INTEGRATION SYSTEMS
ARCHITECTURE FOR ENTERPRISE DATA INTEGRATION SYSTEMS RELATED APPLICATIONS
This application claims priority to U.S. Provisional Application Number 60/606,237, filed August 31, 2004 and entitled "Architecture for Enterprise Data Integration Systems".
BACKGROUND
1. Field.
This invention relates to the field of information technology, and more particularly to the field of data integration systems.
2. Description of the Related Art.
Computer applications make many business processes faster and more efficient. However, the proliferation of different computer applications that use different data structures, communication protocols, languages and platforms has led to great complexity in the information technology infrastructure of the typical business enterprise. Different business processes within the typical enterprise may use completely different computer applications, each computer application being developed and optimized for the particular business process, rather than for the enterprise as a whole. For example, a business may have one computer application for tracking accounts payable and another for keeping track of customer contacts. Even the same business process may use more than one computer application, such as when an enterprise keeps a centralized customer contact database, but employees keep their own contact information, such as in a personal information manager.
The advantages of specialized computer applications are offset by the inefficiencies they introduce, such as repetitive entry, redundant data processing, or the failure of the enterprise to recognize the interconnectedness of its enterprise datasets and capitalize on data that is associated with one process when the enterprise executes another process that could benefit from that data. For example, if the accounts payable process is separated from the supply chain and ordering process, the enterprise may accept and fill orders from a customer whose credit history would have caused the enterprise to decline the order. Many other examples can be provided where an enterprise would benefit from consistent access to all of its data across varied computer applications.
A number of companies have recognized and addressed the need for sharing of data across different applications in the business enterprise. Thus, enterprise application integration, or EAI, has emerged as a message- based strategy for addressing data from disparate sources. As computer applications increase in complexity and number, EAI efforts encounter many challenges, ranging from the need to handle different protocols, the need to address ever-increasing volumes of data and numbers of transactions, and an ever-increasing appetite for faster integration of data. Various approaches to EAI have been taken, including least-common-denominator approaches, atomic approaches, and bridge-type approaches. However, EAI is based upon communication between individual applications. As a significant disadvantage, the complexity of EAI solutions grows geometrically in response to linear additions of platforms and applications.
While data integration systems provided useful tools for addressing the needs of an enterprise, such systems are typically deployed as custom solutions. They have a lengthy development cycle, and may require sophisticated technical training to accommodate changes in business structure and information requirements. There remains a need for data integration system tools that permit use, reuse, and modification of functionality in a
changing business environment, and for an improved architecture supporting the design and use of data integration systems.
SUMMARY Disclosed herein is an architecture for building and managing data integration processes. The architecture may provide modularity and extensibility to many aspects of the integration design process, including user interfaces, programmatic interfaces, services, components, runtime engines, and external connectors. The architecture may also employ a common integrated metadata sharing approach throughout the design process to enable seamless transitions between various phases of the design and implementation of a data integration process. In one aspect, a data integration system according to the disclosure includes: a user interface for performing a plurality of data integration tasks, the user interface including a plurality of menus corresponding to a plurality of phases of a workflow for a project; a task matrix associating selected ones of the data integration tasks with selected ones of the menus; a service oriented architecture including a registry of services, one or more of the services associated with each of the data integration tasks of the user interface; a repository associated with the user interface and one or more services of the services of the service oriented architecture, the repository storing a common metadata model for more than one phase of the workflow for the project; one or more runtime engines for executing the project; and one or more connectors for connecting the one or more runtime engines to one or more external resources, the connectors using modular components and compositions for data throughput. In another aspect, a method for managing a data integration process may include: providing a user interface for performing a plurality of data integration tasks, the user interface including a plurality of menus corresponding to a plurality of phases of a workflow for a project; associating selected ones of the data integration tasks with selected ones of the menus; providing a service oriented architecture including a registry of services; associating one or more of the services with each of the data integration tasks of the user interface; associating a repository with the user interface and one or more services of the services of the service oriented architecture; storing a common metadata model for more than one phase of the workflow for the project in the repository; executing the project using one or more runtime engines, the runtime engines using modular components and compositions for data processing of the project; and connecting the one or more runtime engines to one or more external resources.
The data integration system or method may further include an intelligent automation layer between the user interface and the service oriented architecture, the intelligent automation layer providing context-specific content to a user of the user interface. The data integration system or method may further include a runtime optimization layer for the runtime engines, the runtime optimization layer autonomously allocating a process between the runtime engines.
In another aspect, a method for providing a user interface to a data integration system may include: providing a plurality of data integration services in a services oriented architecture; providing a strongly separated user interface with modular controls; associating the controls with one or more services of the services oriented architecture; and presenting the one or more services as tasks in the controls of the user interface. A user interface system for data integration processes may include: a plurality of data integration services in a services oriented architecture; a strongly separated user interface with modular controls, the user interface configured to present one or more of the data integration services as tasks in the controls of the user interface; and a task matrix associating the controls of the user interface with one or more services of the services oriented architecture.
The services may include live relationships to shared metadata. Changes to the shared metadata may be immediately visible to other users of the shared metadata. The user interface may include context-sensitive help. The services may include connectors to external source-code control systems. The services may include at least one of mapping, data cleansing, and data enrichment. The user interface may be task oriented. The user interface may be skill-level sensitive or role sensitive. The role may include monitoring, deployment, and/or operational control. The user interface may include an integrated session history spanning all services accessible through the user interface. The user interface may be user-definable. The user interface may be adapted for use on a mobile device. The modular controls may be provided in a library of low-level controls. The method or system may include providing a library of reusable dialogs. A service oriented architecture for providing services in a data integration system may include: a service directory for dynamically locating and instantiating services, the services including a plurality of discrete reusable and sharable services composed into tools, a component service exposing external resource connectors as services, and one or more repository services for accessing metadata models stored in a repository; and a separated user interface for user design of a data integration process using the services. A method for providing service oriented services in a data integration system may include: registering a plurality of services in a service directory, the services including a plurality of discrete reusable and sharable services composed into tools, a component service exposing external resource connectors as services, and one or more repository services for accessing metadata models stored in a repository; dynamically locating one of the plurality of services within the registry; instantiating the located one of the plurality of services; and providing a separated user interface for user design of a data integration process using the located one of the plurality of services.
One or more of the services may be internal services and/or external services. The services may use different technologies. One or more of the services may be synchronous and/or asynchronous. The services may include a plurality of functions controlling one or more of auditing, logging, monitoring, management and administration, security, user profiles, session management, and reporting, transactions, orchestrations, user directory, component services, configuration, system management, source control, import/export of source and target metadata, metadata query, access, analysis, reporting, and subscription, sample data, data profiling, process editing, process compiling, data validation, design deployment, design execution, installation and upgrades, demo setup, deployment, release management, synchronization, data profiling, data cleansing, investigation and tuning, application of templates, testing, and problem reporting.
In another aspect, a method for sharing a metadata model in a repository may include: storing a metadata model in a repository; registering one or more repository services in a service directory configured for dynamic location and instantiation of the repository services; deploying the service directory and the one or more repository services in a service oriented architecture; and accessing the metadata model through at least one of the one or more repository services. A system for sharing a metadata model repository may include: a repository storing at least one metadata model; a service directory configured for dynamic location and instantiation of one or more repository services in a service oriented architecture; and an interface for shared access to the at least one metadata model through at least one of the one or more repository services.
The one or more repository services may include one or more of model registration, import/export, version management, navigation access, search/query access, persistence, and check-in/check-out services. The repository
may include one or more of a personal repository, a team repository, a project repository, a division based repository, an enterprise repository, a centralized repository, and a decentralized repository.
In another aspect, a method for connecting a data integration system to an external resource may include: deploying the data integration system as a plurality of services in a service oriented architecture; registering one or more connector components as services in a services directory of the data integration system, each connector component configured to connect to an external resource; and using one of the connector components as a service from within the data integration system to establish a connection to an external resource. A system for connecting a data integration system to an external resource may include: a service oriented architecture including a services directory and a plurality of services; one or more connector components registered as services in the services directory, each connector component configured to connect to an external resource; and at least one runtime engine using one of the connector components as a service to establish a connection to the external resource from within the data integration system.
The connector components may provide all access to external resource files. At least one of the connector components may include a translation process. The connector components may be available during design of a data integration process and/or during runtime. The connector components may include interfaces to specific external resources. The connection to the external resource may include event listening and resource access. The connection to the external resource may be persistent.
In another aspect, a method for connecting a data integration system to an external resource described herein includes deploying the data integration system as a plurality of services in a service oriented architecture; registering one or more connector components as services in a services directory of the data integration system, each connector component configured to connect to an external resource; and using one of the connector components as a service from within the data integration system to establish a connection to an external resource.
The connector components may provide all access to external resource files. At least one of the connector components may include a translation process. The connector components may be available during design of a data integration process. The connector components may be available during runtime. The connector components may include interfaces to specific external resources. The connection to the external resource may include event listening and resource access. The connection to the external resource may be persistent. The external resource may include at least one data source or at least one data target. The connector components may provide access to metadata for external resources. The connector components may provide access to data for external resources. In another aspect, a system disclosed herein includes a services oriented architecture including a directory of services; a plurality of functions of a suite of related computer program products, each one of the functions registered as a service in the directory of services; and an interface for invoking one or more instances of the one or more services, whereby the suite of related computer products is deployed in the services oriented architecture.
The suite of related computer program products may include a data integration suite. The functions may include one or more of auditing, logging, monitoring, management and administration, security, user profiles, session management, and reporting, transactions, orchestrations, user directory, component services, configuration, system management, source control, import/export of source and target metadata, metadata query, access, analysis, reporting, and subscription, sample data, data profiling, process editing, process compiling, data validation, design deployment, design execution, installation and upgrades, demo setup, deployment, release management, synchronization, data profiling, data cleansing, investigation and tuning, application of templates, testing, and problem reporting. The one or more parameters may be passed to the one or more instances of the one or more
services. The one or more services may include at least one common service that is common to all of the related computer program products.
In other aspects, a computer program product may include a computer useable medium including computer readable program code, wherein the computer readable program code when executed on one or more computers causes the one or more computers to perform any one or more of the methods above.
"International Business Machines" or "IBM" as used herein shall refer to International Business Machines Corporation of Armonk, New York.
As used herein, "data source" or "data target" are intended to have the broadest possible meaning consistent with these terms, and shall include a database, a plurality of databases, a repository information manager, a queue, a message service, a repository, a data facility, a data storage facility, a data provider, a website, a server, a computer, a computer storage facility, a CD, a DVD, a mobile storage facility, a central storage facility, a hard disk, a multiple coordinating data storage facilities, RAM, ROM, flash memory, a memory card, a temporary memory facility, a permanent memory facility, magnetic tape, a locally connected computing facility, a remotely connected computing facility, a wireless facility, a wired facility, a mobile facility, a central facility, a web browser, a client, a laptop, a personal digital assistant ("PDA"), a telephone, a cellular phone, a mobile phone, an information platform, an analysis facility, a processing facility, a business enterprise system or other facility where data is handled or other facility provided to store data or other information, as well as any files or file types for maintaining structured or unstructured data used in any of the above systems, or any streaming, messaged, event driven, or otherwise sourced data, and any combinations of the foregoing, unless a specific meaning is otherwise indicated or the context of the phrase requires otherwise. A storage mechanism is any physical or logical device, resource, or facility capable of serving as a data source or a data target, or otherwise storing data in a retrievable form.
"Enterprise Java Bean (EJB)" shall include the server-side component architecture for the J2EE platform. EJBs support rapid and simplified development of distributed, transactional, secure and portable Java applications. EJBs support a container architecture that allows concurrent consumption of messages and provide support for distributed transactions, so that database updates, message processing, and connections to enterprise systems using the J2EE architecture can participate in the same transaction context.
"JMS" shall mean the Java Message Service, which is an enterprise message service for the Java-based J2EE enterprise architecture. "JCA" shall mean the J2EE Connector architecture of the J2EE platform described more particularly below. It should be appreciated that, while EJB, JMS, and JCA are commonly used software tools in contemporary distributed transaction environments, any platform, system, or architecture providing similar functionality may be employed with the data integration systems described herein.
"Real time" as used herein, shall include periods of time that approximate the duration of a business transaction or business and shall include processes or services that occur during a business operation or business process, as opposed to occurring off-line, such as in a nightly batch processing operation. Depending on the duration of the business process, real time might include seconds, fractions of seconds, minutes, hours, or even days.
"Business process," "business logic" and "business transaction" as used herein, shall include any methods, service, operations, processes or transactions that can be performed by a business, including, without limitation, sales, marketing, fulfillment, inventory management, pricing, product design, professional services, financial services, administration, finance, underwriting, analysis, contracting, information technology services, data storage, data mining, delivery of information, routing of goods, scheduling, communications, investments, transactions,
offerings, promotions, advertisements, offers, engineering, manufacturing, supply chain management, human resources management, data processing, data integration, work flow administration, software production, hardware production, development of new products, research, development, strategy functions, quality control and assurance, packaging, logistics, customer relationship management, handling rebates and returns, customer support, product maintenance, telemarketing, corporate communications, investor relations, and many others.
"Service oriented architecture (SOA)", as used herein, shall include services that form part of the infrastructure of a business enterprise. In the SOA, services can become building blocks for application development and deployment, allowing rapid application development and avoiding redundant code. Each service may embody a set of business logic or business rules that can be bound to the surrounding environment, such as the source of the data inputs for the service or the targets for the data outputs of the service. Various instances of SOA are provided in the following description.
"Metadata," as used herein, shall include data that brings context to the data being processed, data about the data, information pertaining to the context of related information, information pertaining to the origin of data, information pertaining to the location of data, information pertaining to the meaning of data, information pertaining to the age of data, information pertaining to the heading of data, information pertaining to the units of data, information pertaining to the field of data, information pertaining to any other information relating to the context of the data.
"WSDL" or "Web Services Description Language" as used herein, includes an XML format for describing network services (often web services) as a set of endpoints operating on messages containing either document- oriented or procedure-oriented information. The operations and messages are described abstractly, and then bound to a concrete network protocol and message format to define an endpoint. Related concrete endpoints are combined into abstract endpoints (services). WSDL is extensible to allow description of endpoints and their messages regardless of what message formats or network protocols are used to communicate.
BRIEF DESCRIPTION OF THE FIGURES
Fig. 1 is a schematic diagram of a business enterprise with a plurality of business processes, each of which may include a plurality of different computer applications and data sources.
Fig. 2 is a schematic diagram showing data integration across a plurality of business processes of a business enterprise. Fig. 3 is a schematic diagram showing an architecture for providing data integration for a plurality of data sources for a business enterprise.
Fig. 4 is schematic diagram showing details of a discovery facility for a data integration job. Fig. 5 is a flow diagram showing steps for accomplishing a discover step for a data integration process. Fig. 6 is a schematic diagram showing a cleansing facility for a data integration process. Fig. 7 is a flow diagram showing steps for a cleansing process for a data integration process.
Fig. 8 is a schematic diagram showing a transformation facility for a data integration process. Fig. 9 is a flow diagram showing steps for transforming data as part of a data integration process. Fig. 10 depicts an example of a transformation process for mortgage data modeled using a graphical user interface. Fig. 1 IA is a schematic diagram showing a plurality of connection facilities for connecting a data integration process to other processes of a business enterprise.
Fig. 1 IB shows a plurality of connection facilities using a bridge model.
Fig. 12 is a flow diagram showing steps for connecting a data integration process to other processes of a business enterprise.
Fig. 13 shows an enterprise computing system that includes a data integration system. Fig. 14A illustrates management of metadata in a data integration job.
Fig. 14 B illustrates an aspect oriented programming environment that may be used in a data integration job.
Fig. 15 is a flow diagram showing additional steps for using a metadata facility in connection with a data integration job. Fig. 16 is a flow diagram showing additional steps for using a metadata facility in connection with a data integration job.
Fig. 16A is a flow diagram showing additional steps for using a metadata facility in connection with a data integration job.
Fig. 17 is a schematic diagram showing a facility for parallel execution of a plurality of processes of a data integration process.
Fig. 18 is a flow diagram showing steps for parallel execution of a plurality of processes of a data integration process.
Fig. 19 is a schematic diagram showing a data integration job, comprising inputs from a plurality of data sources and outputs to a plurality of data targets. Fig. 20 is a schematic diagram showing a data integration job, comprising inputs from a plurality of data sources and outputs to a plurality of data targets.
Fig. 21 shows a graphical user interface whereby a data manager for a business enterprise can design a data integration job.
Fig. 22 shows another embodiment of a graphical user interface whereby a data manager can design a data integration job.
Fig.23 is a schematic diagram of an architecture for integrating a real time data integration service facility with a data integration process.
Fig. 24 is a schematic diagram showing a services oriented architecture for a business enterprise.
Fig. 25 is a schematic diagram showing a SOAP message format. Fig. 26 is a schematic diagram showing elements of a WSDL description for a web service.
Fig. 27 is a schematic diagram showing elements for enabling a real time data integration process for an enterprise.
Fig. 28 is an embodiment of a server for enabling a real time integration service.
Fig. 29 shows an architecture and functions of a typical J2EE server. Fig. 30 represents an RTI console for administering an RTI service.
Fig. 31 shows further detail of an architecture for enabling an RTI service.
Fig. 32 is a schematic diagram of the internal architecture for an RTI service.
Fig. 33 illustrates an aspect of the interaction of the RTI server and an RTI agent.
Fig. 34 illustrates an RTI service used in a financial services business. Fig. 35 shows how an enterprise may update customer records using RTI services.
Fig. 36 illustrates a data integration system including a master customer database.
Fig 37 shows an RTI service may embody a set of data transformation, validation and standardization routines
Fig 38 illustrates an application accessing real time integration services
Fig 39 shows an underwriting process without data integration services Fig 40 shows an underwriting process employing RTI services
Fig 41 shows an enterprise using multiple RTI services
Fig 42 shows a trucking broker business using real time integration services
Fig 43 illustrates a set of data integration services supporting applications that a driver can access as web services, such as using a mobile device Fig 44 shows a data integration system used for financial reporting
Fig 45 shows a data integration system used to maintain an authoritative customer database in a retail business
Fig 46 shows a data integration system used in the pharmaceutical industry
Fig 47 shows a data integration system used in a manufacturing business Fig 48 shows a data integration system used to analyze clinical trial study results
Fig 49 shows a data integration system used for review of scientific research data
Fig 50 shows a data mtegiation system used to manage customer data across multiple business systems
Fig 51 shows a data integration system used to provide on-demand, automated matching of inbound customer data with existing customer records Fig 52 shows a high level schematic view of an architecture for data integration services
Fig 52A shows a task matrix for defining a user interface as a number of tasks
Fig 53 shows a more detailed schematic view of the GUI
Fig 54 shows a UML diagram of the SOA roles and relationships in the architecture
Fig 55 shows a schematic of an SOA environment for the architecture Fig 56 shows a schematic of the models m the repository services
Fig 57 shows a schematic of the repository services architecture
Fig 58 shows a high level UML class diagram of the component and composition model
Fig 59 shows a EIL example of a composition model
Fig 60 shows a UML class diagram of the relationships of clients, services, and components Fig 61 shows a schematic of a data representation as it is passed from one component to another
Fig 62 shows common transformation framework for the architecture
Fig 63 shows connectivity for the architecture
DETAILED DESCRIPTION Throughout the following discussion, like element numerals are intended to refer to like elements, unless specifically indicated otherwise
The mvention(s) disclosed herein can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc
Furthermore, the invention(s) can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk - read only memory (CD-ROM), compact disk — read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Fig. 1 represents a platform 100 for facilitating integration of various data of a business enterprise. The platform includes a plurality of business processes, each of which may include a plurality of different computer applications and data sources. The platform may include several data sources 102, such as those described above. These data sources may include a wide variety of data types from a wide variety of physical locations. For example, the data source may include systems from providers such as such as Sybase, Microsoft, Informix, Oracle, Inlomover, EMC, Trillium, First Logic, Siebel, PeopleSoft, IBM, Apache, or Netscape. The data sources 102 may include systems using database products or standards such as IMS, DB2, ADABAS, VSAM, MD Series, UDB, XML, complex flat files, or FTP files. The data sources 102 may include files created or used by applications such as Microsoft Outlook, Microsoft Word, Microsoft Excel, Microsoft Access, as well as files in standard formats such as ASCII, CSV, GIF, TIF, PNG, PDF, and so forth. The data sources 102 may come from various locations or they may be centrally located. The data supplied from the data sources 102 may come in various forms and have different formats that may or may not be compatible with one another.
Data targets are discussed later in this description. In general, these data targets may be any of the data sources 102 noted above. This difference in nomenclature typically denotes whether a data system provides data or receives data in a data integration process. However, it should be appreciated that this distinction is not intended to convey any difference in capability between data sources and data targets (unless specifically stated otherwise), since in a conventional data integration system, data sources may receive data and data targets may provide data.
The platform illustrated in Fig.l also includes a data integration system 104. The data integration system may, for example, facilitate the collection of data from the data sources 102 as the result of a query or retrieval command the data integration system 104 receives. The data integration system 104 may send commands to one or
more of the data sources 102 such that the data sσurce(s) provides data to the data integration system 104. Since the data received may be in multiple formats including varying metadata, the data integration system may reconfigure the received data such that it can be later combined for integrated processing. The functions that may be performed by the data integration system 104 are described in more detail below. The platform 100 also includes several retrieval systems 108. The retrieval systems 108 may include databases or processing platforms used to further manipulate the data communicated from the data integration system 104. For example, the data integration system 104 may cleanse, combine, transform or otherwise manipulate the data it receives from the data sources 102 such that a retrieval system 108 can use the processed data to produce reports 110 useful to the business. The reports 110 may be used to report data associations, answer complex queries, answer simple queries, or form other reports useful to the business or user, and may include raw data, tables, charts, graphs, and any other representations of data from the retrieval systems 108.
The platform 100 may also include a database or data base management system 112. The database 112 may be used to store information temporally, temporarily, or for permanent or long-term storage. For example, the data integration system 104 may collect data from one or more data sources 102 and transform the data into forms that are compatible with one another or compatible to be combined with one another. Once the data is transformed, the data integration system 104 may store the data in the database 112 in a decomposed form, combined form or other form for later retrieval.
Fig. 2 is a schematic diagram showing data integration across a plurality of entities and business processes of a business enterprise. In the illustrated embodiment, the data integration system 104 facilitates the information flowing between user interface systems 202 and data sources 102. The data integration system 104 may receive queries from the interface systems 202, where the queries necessitate the extraction and/or transformation of data residing in one or more of the data sources 102. The interface systems 202 may include any device or program for communicating with the data integration system 104, such as a web browser operating on a laptop or desktop computer, a cell phone, a personal digital assistant ("PDA"), a networked platform and devices attached thereto, or any other device or system that might interface with the data integration system 104.
For example, a user may be operating a PDA and make a request for information to the data integration system 104 over a WiFi or Wireless Access Protocol/Wireless Markup Language ("WAP/WML") interface. The data integration system 104 may receive the request and generate any required queries to access information from a website or other data source 102 such as an FTP file site. The data from the data sources 102 may be extracted and transformed into a format compatible with the requesting interface system 202 (a PDA in this example) and then communicated to the interface system 202 for user viewing and manipulation. In another embodiment, the data may have previously been extracted from the data sources and stored in a separate database 112, which may be a data warehouse or other data facility used by the data integration system 104. The data may have been stored in the database 112 in a transformed condition or in its original state. For example, the data may be stored in a transformed condition such that the data from a number of data sources 102 can be combined in another transformation process. For example, a query from the PDA may be transmitted to the data integration system 104 and the data integration system 104 may extract the information from the database 112. Following the extraction, the data integration system 104 may transform the data into a combined format compatible with the PDA before transmission to the PDA. Fig. 3 is a schematic diagram showing an architecture for providing data integration for a plurality of data sources 102 for a business enterprise. An embodiment of a data integration system 104 may include a discover data
stage 302 to perform, possibly among other processes, extraction of data from a data source and analysis of column values and table structures for source data. A discover data stage 302 may also generate recommendations about table structure, relationships, and keys for a data target. More sophisticated profiling and auditing functions may include date range validation, accuracy of computations, accuracy of if-then evaluations, and so forth. The discover data stage 302 may normalize data, such as by eliminating redundant dependencies and other anomalies in the source data. The discover data stage 302 may provide additional functions, such as drill down to exceptions within a data source 102 for further analysis, or enabling direct profiling of mainframe data. A non-limiting example of a commercial embodiment of a discover data stage 302 may be found in Ascential's ProfileStage product.
The data integration system 104 may also include a data preparation stage 304 where the data is prepared, standardized, matched, or otherwise manipulated to produce quality data to be later transformed. The data preparation stage 304 may perform generic data quality functions, such as reconciling inconsistencies or checking for correct matches (including one-to-one matches, one-to-many matches, and deduplication) within data. The data preparation stage 304 may also provide specific data enhancement functions. For example, the data preparation stage 304 may ensure that addresses conform to multinational postal references for improved international communication. The data preparation stage 304 may conform location data to multinational geocoding standards for spatial information management. The data preparation stage may modify or add to addresses to ensure that address information qualifies for U.S. Postal Service mail rate discounts under Government Certified U.S. Address Correction. Similar analysis and data revision may be provided for Canadian and Australian postal systems, which provide discount rates for properly addressed mail. A non-limiting example of a commercial embodiment of a data preparation stage 304 may be found in Ascential's QualityStage product.
The data integration system may also include a data transformation stage 308 to transform, enrich and deliver transformed data. The data transformation stage 308 may perform transitional services such as reorganization and reformatting of data, and perform calculations based on business rules and algorithms of the system user. The data transformation stage 308 may also organize target data into subsets known as datamarts or cubes for more highly tuned processing of data in certain analytical contexts. The data transformation stage 308 may employ bridges, translators, or other interfaces (as discussed generally below) to span various software and hardware architectures of various data sources and data targets used by the data integration system 104. The data transformation stage 308 may include a graphical user interface, a command line interface, or some combination of these, to design data integration jobs across the platform 100. A non-limiting example of a commercial embodiment of a data transformation stage 308 may be found in Ascential's DataStage product.
The stages 302, 304, 308 of the data integration system 104 may be executed using a parallel execution system 310 or in a serial or combination manner to optimize the performance of the system 104.
The data integration system 104 may also include a metadata management system 312 for managing metadata associated with data sources 102. In general, the metadata management system 312 may provide for interchange, integration, management, and analysis of metadata across all of the tools in a data integration environment. For example, a metadata management system 312 may provide common, universally accessible views of data in disparate sources, such as Ascential's ODBC MetaBroker, CA ERwin, Ascential PrifileStage, Ascential DataStage, Ascential QualityStage, IBM DB2 Cube Views, and Cognos Impromptu. The metadata management system 312 may also provide analysis tools for data lineage and impact analysis for changes to data structures. The metadata management system 312 may further be used to prepare a business data glossary of data definitions, algorithms, and business contexts for data within the data integration system 104, which glossary may
be published for use throughout an enterprise. A non-limiting example of a commercial embodiment of a metadata management system 312 may be found in Ascential's MegaStage product.
Fig. 4 is schematic diagram showing details of a facility implementing the discovery data stage 302 for a data integration job. In this embodiment, the discovery data stage 302 queries a database 402, which may be any of the data sources 102 described above, to determine the content and structure of data in the database 402. The database 402 provides the results to the discovery data stage 302 and the discovery data stage 302 facilitates the subsequent communication of extracted data to the other portions of the data integration system 104. In an embodiment, the discovery data stage 302 may query many data sources 102 so that the data integration system 104 can cleanse and consolidate the data into a central database or repository information manager. Fig. 5 is a flow diagram showing steps for accomplishing a discover step for a data integration process
5OO.While a specific data integration process 500 is described below, a data integration process 500 as used herein may refer to any process using the data sources 102 and data targets, databases 112, data integration systems 104, and other components described herein. In an embodiment the process steps for an example discover step may include a first step 502 where the discovery facility, such as the discover data stage 302 described above, receives a command to extract data from one or more data sources 102. Following the receipt of an extraction command, the discovery facility may identify the appropriate data sources(s) 102 where the data to be extracted resides, as shown in step 504. The data source(s) 102 may or may not be identified in the command. If the data source(s) 102 is identified, the discover facility may query the identified data source(s) 102. In the event a data source(s) 102 is not identified in the command, the discovery facility may determine the data source 102 from the type of data requested from the data extraction command or from another piece of information in the command or after determining the association to other data that is required. For example, the query may be for a customer address and a first portion of the customer address data may reside in a first data source 102 while a second portion resides in a second data source 102. The discovery facility may process the extraction command and direct its extraction activities to the two data sources 102 without further instructions in the command. Once the data source(s) 102 is identified, the data facility may execute a process to extract the data, as shown in step 508. Once the data has been extracted, the discovery facility may facilitate the communication of the data to another portion of the data integration system. Fig. 6 is a schematic diagram showing a cleansing facility, which may be the data preparation stage 304 described above, for a data integration process 500. Generally, data coming from several data sources 102 may have inaccuracies and these inaccuracies, if left unchecked and uncorrected, could cause errors in the interpretation of the data ultimately produced by the data integration system 104. Company mergers, acquisitions, reorganizations, or other consolidation of data sources 102 can further compound the data quality issue by bringing new data labels, acronyms, metrics, methods for the calculations and so forth. As depicted in Fig. 6, a cleansing facility may receive data 602 from a data source 102. The data 602 may have come from one or more data sources 102 and may have inconsistencies or inaccuracies. The cleansing facility may provide automated, semi-automated, or manual facilities for screening, correcting, cleaning or otherwise enhancing quality of the data 602. Once the data 602 passes through the cleansing facility it may be communicated to another portion of the data integration system 104.
Fig. 7 is a flow diagram showing steps for a cleansing process 700 in a data integration process 500. In an embodiment, the cleaning process may include a step 702 of receiving data from one or more data sources 102 (e.g. through a discovery facility). The cleansing process 700 may include one or more methods of cleaning the data.
For example, the process may include a step 704 of automatically cleaning the data. The process may include a step
708 of semi-manually cleaning the data. The process may include a step 710 of manually cleaning the data. The step 704 of automatically correcting or cleaning the data or a portion of the data may include the application of several techniques, such as automatic spell checking and correction, comparing data, comparing timeliness of the data, condition of the data, or other techniques for enhancing data quality and consistency. The step 708 for semi- automatically cleansing data may include a facility where a user interacts with some of the process steps and the system automatically performs cleaning tasks assigned. The semi-automated system may include a graphical user interface process step 712, in which a user interacts with the graphical user interface to facilitate the process 700 for cleansing the data. The process 700 may also include a step 710 for manually correcting the data. This step may also include use of a user interface to facilitate the manual correction, consolidation and/or cleaning of the data. The cleansed data from the cleansing processes 700 may be transmitted to another facility in the data integration system 104, such as the data transformation stage 308.
Fig. 8 is a schematic diagram showing a transformation facility, which may be the data transformation stage 308 described above, for a data integration process 500. The transformation facility may receive cleansed data 802 from a cleansing facility and perform transformation processes, enrich the data and deliver the data to another process within the data integration system 104 or outside of the data integration system 104 where the integrated data may be viewed, used, further transformed or otherwise manipulated. For example, a user may investigate the data through data mining, or generate reports useful to the user or business.
Fig. 9 is a flow diagram showing steps for transforming data as part of a data integration process 500. The transformation process 900 may include receiving cleansed data (e.g. from the data preparation stage 308 described above), as shown in step 902. As shown in step 904, the process 900 may make a determination of the type of desired transformation. Following the step 904 of determining the transformation process, the transformation process may be executed, as shown in step 908. The transformed data may then be transmitted to another facility as shown in step 910.
In general, the data integration system 104 may be controlled and applied to specific enterprise data using a graphical user interface. The interface may include visual tools for modeling data sources, data targets, and stages or processes for acting upon data, as well as tools for establishing relationships among these data entities to model a desired data integration task. Graphical user interfaces are described in greater detail below. The following provides a general example to depict how a user interface might be used in this context.
Fig. 10 depicts an example of a transformation process 1000 for mortgage data modeled using a graphical user interface 1018. For this example, a business enterprise wishes to generate a report concerning certain mortgages. The mortgage balance information may reside in a mortgage database, which may be one of the data sources 102 described above, and the personal information such as address of the property information may reside in a property database, which may also be one of the data sources 102 described above. A graphical user interface 1018 may be provided to organize the transformation process. For example, the user may select a graphical representation of the mortgage database 1002 and a graphical representation of the property database 1012, and manipulate these representations 1002, 1012 into position within the interface 1018 using, e.g., conventional drag and drop operations. Then the user may select a graphical representation of a row transformation process 1004 to prepare the rows for combination. The user may drag and drop process flow directions, indicated generally within Fig. 10 as arrows, such that the data from the databases flows into the row transformation process. In this model, the user may elect to remove any unmatched files and send them to a storage facility. To accomplish this, the user may place a graphical representation of a storage facility 6114 within the interface 1018. If the user wishes to
further process the remaining matched files, the user may, for example, add a graphical representation of another transformation and aggregation process 1008 that combines data from the two databases. Finally, the user may decide to send the aggregate data to a storage facility by adding a graphical representation of a data warehouse 1010. Once the user sets this process up using the graphical user interface, the user may run the transformation process.
Fig. 11 is a schematic diagram showing a plurality of connection facilities for connecting a data integration process 500 to other processes of a business enterprise. In an embodiment, the data integration system 104 may be associated with an integrated storage facility 1102, which may be one of the data sources 102 described above. The integrated storage facility 1102 may contain data that has been extracted from several other data sources 102 and processed through the data integration system 104. The integrated data may be stored in a form that permits one or more computer platforms 1108A and 1108B to retrieve data from the integrated data storage facility 1102. The computing platforms 1108 A and 1108B may request data from the integrated data facility 1102 through a translation engine 1104A and 1104B. For example, each of the computing platforms 1108A and 1108B may be associated with a separate translation engine 1104A and 1104B. The translation engine 1104A and 1104B may be adapted to translate the integrated data from the storage facility 1102 into a form compatible with the associated computing platform 1108A and 1108B. In an embodiment, the translation engines 1104A and 1104B may also be associated with the data integration system 104. This association may be used to update the translation engines 1104A and 1104B with required information. This process may also involve the handling of metadata which will be further defined below. While the hub model for data integration, as generally depicted in Fig. 1 IA, is one model for connecting to different computing platforms 1108 A, 1108B and other data sources 102, other models may be employed, such as the bridge model described in reference to Fig. 1 IB. It should be appreciated that, where connections to data sources 102 are described herein, either of these models, or other models, may be used, unless specified or otherwise indicated by the context. Fig. 1 IB shows a plurality of connection facilities using a bridge model. In this system, a plurality of data sources 102, such as an inventory system, a customer relations system, and an accounting system, may be connected to a data integration system 104 of an enterprise computing system 1300 through a plurality of bridges 1120 or connection facilities. Each bridge 1120 may be a vendor-specific transformation engine that provides metadata models for the external data sources 102, and enables bi-directional transfers of information between the data integration system 104 and the data sources 102. Enterprise integration vendors may have a proprietary format for their data sources 102 and therefore a different bridge 1120 may be required for each different external model. Each bridge 1120 may provide a connection facility to all or some of the data within a data source 102, and separate maps or models may be maintained for connections to and from each data source 102. Further, each bridge 1120 may provide error checking, reconciliation, or other services to maintain data integrity among the data sources 102. With the data sources 102 interconnected in this manner, data may be shared or reconciled among systems, and various data integration tasks may be performed on data within the data sources 102 as though the data sources 102 formed a single data source 102 or warehouse.
Fig. 12 is a flow diagram showing steps for connecting a data integration process 500 to other processes of a business enterprise. In an embodiment, the connection process may include a step 1202 during which the data integration system 104 stores data it has processed in a central storage facility. The data integration system 104
may also update one or more translation engines in step 1204. The illustration in fig 12 shows these processes occurring in series, but they may also occur in parallel, or some combination of these. The process may involve a step 1208 where a computing platform generates a data request and the data request is sent to an associated translation engine. Step 1210 may involve the translation engine extracting the data from the storage facility. The translation engine may also translate the data into a form compatible with the computing platform in step 1212 and the data may then be communicated to the computing platform in step 1214.
Fig. 13 shows an enterprise computing system 1300 that includes a data integration system 104. The enterprise computing system 1300 may include any combination of computers, mainframes, portable devices, data sources, and other devices, connected locally through one or more local area networks and/or connected remotely through one or more wide area or public networks using, for example, a virtual private network over the Internet. Devices within the enterprise computing system 1300 may be interconnected into a single enterprise to share data, resources, communications, and information technology management. Typically, resources within the enterprise computing system 1300 are used by a common entity, such as a business, association, or governmental body, or university. However, in certain business models, resources of the enterprise computing system 1300 may be owned (or leased) and used by a number of different entities, such as where an application service provider offers on- demand access to remotely executing applications.
The enterprise computing system 1300 may include a plurality of tools 1302, which access a common data structure, termed herein a repository information manager ("RIM") 1304 (also referred to below as a "hub") through respective translation engines 1308 (which, in a bridge-based system, may be the bridges 1120 described above). The RIM 1304 may include any of the data sources 102 described above. It will be appreciated that, while three translation engines 1308 and three tools 1302 are depicted, any number of translation engines 1308 and tools 1302 may be employed within an enterprise computing system 1300, including a number less than three and a number significantly greater than three. The tools 1302 generally comprise, for example, diverse types of database management systems and other applications programs that access shared data stored in the RIM 1304. The tools 1302, RIM 1304, and translation engines 1308 may be processed and maintained on a single computer system, or they may be processed and maintained on a number of computer systems which may be interconnected by, for example, a network (not shown), which transfers data access requests, translated data access requests, and responses between the different components 1302, 1304, 1308.
While they are executing, the tools 1302 may generate data access requests to initiate a data access operation, that is, a retrieval of data from or storage of data in the RIM 1304. Data may be stored in the RIM 1304 in an atomic data model and format that will be described below. Typically, the tools 1302 will view the data stored in the RIM 1304 in a variety of diverse characteristic data models and formats, as will be described below, and each translation engine 1308, upon receiving a data access request, will translate the data between respective tool's characteristic model and format and the atomic model format of RIM 1304 as necessary. For example, during an access operation of the retrieval type, in which data items are to be retrieved from the RIM 1304, the translation engine 1308 will identify one or more atomic data items in the RIM 1304 that jointly comprise the data item to be retrieved in response to the access request, and will enable the RIM 1304 to provide the atomic data items to one of the translation engines 1308. The translation engine 1308, in turn, will aggregate the atomic data items that it receives from the RIM 1304 into one or more data items as required by the tool's characteristic model and format, or "view" of the data, and provide the aggregated data items to the tool 1302 that issued the access request. During data storage, in which data in the RIM 304 is to be updated, the translation engine 1308 may receive the data to be
stored in a characteristic model and format for one of the tools 1302. The translation engine 1308 may translate the data into the atomic model and format for the RIM 1304, and provide the translated data to the RIM 1304 for storage. If the data storage access request enables data to be updated, the RIM 1304 may substitute the newly- supplied data from the translation engine 1308 for the current data. On the other hand, if the data storage access request represents new data, the RIM 1308 may add the data, in the atomic format as provided by the translation engine 1308, to the current data in the RIM 1308.
The enterprise computing system 1300 further includes a data integration system 104, which maintains and updates the atomic format of the RIM 1304 and the translation engines 1308 as new tools 1302 are added to the system 1300. It will be appreciated that certain operations performed by the data integration system 104 may be performed automatically or manually controlled. Briefly, when the system 1300 is initially established or when one or more tools 1302 are added to the system 1300 whose data models and formats differ from the current data models and formats, the data integration system 104 may determine any differences and modify the data model and format of the data in the RIM 1304 to accommodate the data model and format of the new tool 1302. In that operation, the data integration system 104 may determine an atomic data model which is common to the data models of any tools 1302 that are currently in the system 1300 and the new tool 1302 to be added, and enable the data model of the RIM 1304 to be updated to the new atomic data model. In addition, the data integration system 104 may update the translation engines 1308 associated with any tools 1302 currently in the system 1300 based on the updated atomic data model of the RIM 1304, and may also generate a translation engine 1308 for the new tool 1302. Accordingly, the data integration system 104 ensures that the translation engines 1308 of all tools 1302, including any tools 1302 currently in the system as well as a tool 1302 to be added conform to the atomic data models and formats of the RIM 1304.
Before proceeding further, it may be helpful to provide a specific example illustrating characteristic data models and formats that may be useful for various tools 1302 and an atomic data model and format useful for the RIM 1304. It will be appreciated that the specific characteristic data models and formats for the tools 1302 will depend on the particular tools 1302 that are present in a specific enterprise computing system 1300. In addition, it will be appreciated that the specific atomic data models and formats for the RIM 1304 may depend on the atomic data models and formats which are used for tools 1302, and may represent the aggregate or union of the finest- grained elements of the data models and format for all of the tools 1304 in the system 1300.
Fig. 14 A provides an example relating to a database of designs for a cup, such as a drinking cup or other vessel for holding liquids. The database may be used for designing and manufacturing the cups. In this application, the tools 1302 may be used to add cup design elements to the RIM 1304, such as design drawings, dimensions, exterior surface treatments, color, materials, handles (or lack thereof), cost data, and so on. The tools 1302 may also be used to modify cup design elements stored in the RIM 1304, and re-use and associate particular cup design elements in the RIM 1304 with a number of different cup designs. The RIM 1304 and translation engines 1308 may provide a mechanism by which a number of different tools 1302 can share the elements stored in the RIM 1304 without having to agree on a common schema or model and format arrangement for the elements.
In this example, the RIM 1304 may store data items in an entity-relationship format, with each entity being a data item and relationships reflecting relationships among data items, as will be illustrated below. The entities are in the form of objects that may, in turn, be members or instances of classes and subclasses in an object-oriented environment. It will be appreciated that other models and formats may be used for the RIM 1304.
Fig. 14A depicts an illustrative metadata structure for a cup design database. The class structure may include a main class 1402, two subclasses 1404 for containers and handles that depend from the main class 1402, and two lower-level subclasses 1408 for sides and bases, both of which depend from the container subclass 1404. Each data item in class 1402, which is termed an "entity" in the entity -relationship format, may represent a specific cup or specific type of cup in an inventory, and will have associated attributes which define various characteristics of the cup, with each attribute being identified by a particular attribute identifier and data value for the attribute.
Each data item in the handle and container subclasses 1404, which are also "entities" in the entity- relationship format, may represent container and handle characteristics of the specific cups or types of cups in the inventory. More specifically, each data item in container subclass 1404 may represent the container characteristic of a cup represented by a data item in the cup class 1402, such as color, sidewall characteristics, base characteristics and the like. In addition, each data item in the handle subclass 1404 may represent the handle characteristics of a cup that is represented by a data item in the cup class 1402, such as curvature, texture, color, position and the like. In addition, it will be appreciated that there may be one or more relationships between the data items in the handle subclass 1404 and the container subclass 1404 that serve to link the data items between the subclasses 1404. For example, there may be a relationship signifying whether a container has a handle. In addition, or instead, there may be a relationship signifying how many handles a container has. Further, there may be a position relationship, which specifies the position of a handle on the container. The number and position relationships may be viewed as properties of the first relationship (container has a handle), or as separate relationships. The two lower-level subclasses 1408 may be associated with the container subclass 1404 and represent various elements of the container. In the illustration depicted in Fig. 14A, the subclasses 1408 may, include a sidewall type subclass
1408 and a base type subclass 1408, each characterizing an element of the cup class 1402. It will be appreciated that the cup and the properties of the cup, such as the container and the handle, may be defined in an object-oriented manner using any desired level of detail.
Although not explicitly depicted in Fig. 14 A, it should be appreciated that one or more translation engines 1308 may coordinate communication between the tools 1302, which require one view of data, and the RIM 1304, which may store data in a different format. More generally, each one of the tools 1302 depicted in Fig. 14 A, may have a somewhat different or completely different characteristic data model and format to view the cup data stored in the RIM 1304. That is, where a data item is a cup, characteristics of the cup may be stored in the RIM 1304 as attributes and attribute values for the cup design associated with the data item. In a retrieval access request, the tools 1302 may provide their associated translation engines 1308 with the identification of a cup data item in cup class 1402 to be retrieved, and will expect to receive at least some of the data item's attribute data, which may be identified in the request, in response. Similarly, in response to an access request of the storage type, such tools will provide their associated translation engines 1308 with the identification of the cup data item to be updated or created and the associated attribute information to be updated or to be used in creating a new data item .
Other tools 1302 may have characteristic data models and formats that view the cups separately as the container and handle entities in the subclasses 1404, rather than the main cup class 1402 having attributes for the container and the handle. In that view, there may be two data items, namely "container" and "handle" associated with each cup, each of which has attributes that describe the respective container and handle. In that case, each data item each may be independently retrievable and updateable and new data items may be separately created for each of the two classes. For such a view, the tools 1302 will, in an access request of the retrieval type, provide their
associated translation engines 1308 with the identification of a container or a handle to be retrieved, and will expect to receive the data item's attribute data in response. Similarly, in response to an access request of the storage type, such tools 1302 will provide their associated translation engines 1308 with the identification of the "container" or "handle" data item to be updated or created and the associated attribute data. Accordingly, these tools 1302 view the container and handle data separately, and can retrieve, update and store container and handle attribute data separately.
As another example using the same atomic data structure in the RIM 1304, tools 1302 may have characteristic formats which view the cups separately as sidewall, base and handle entities in classes 1402-1408. In such a view, there may be three data items, namely, a sidewall, a base, and a handle associated with each cup, each of which has attributes that describe the respective sidewall, base and handle of the cup. In that case, each data item may be independently created, retrieved, or updated. For such a view, the tools 1302 may provide their associated translation engines 1308 with the identification of a sidewall, base or a handle whose data item is to be operated on, and may perform operations (such as create, retrieve, store) separately for each.
As described above, the RIM 1304 may store cup data in an "atomic" data model and format. That is, with the class structure as depicted in Fig. 14A, the RIM 1304 may store the data as data items corresponding to each class and subclass in a consistent data structure, such as a data structure reflecting the most detailed format for the class structure employed by the collective tools 1302.
Translation engines 1308 may translate between the views maintained by each tool 1302 and the atomic data structures maintained by the RIM 1304, based upon relationships between the atomic data structures in the RIM 1304 and the view of the data used by the tool 1302. The translation engines 1308 may perform a number of functions when translating between tool 1302 views and RIM 1304 data structures. Such as combining or separating classes or subclasses, translating attribute names or identifiers, generating or removing attribute values, and so on. The required translations may arise in a number of contexts, such as creating data items, retrieving data items, deleting data items, or modifying data items. As new tools 1302 are added to the data integration system 104, the system 104 may update data structures in the RIM 1304, as well as translation engines 1308 that may be required for new tools 1302. Existing translation engines 1308 may also need to be updated where the underlying data structure used within the RIM 1304 has been changed to accommodate the new tools 1302, or where the data structure has been reorganized for other reasons.
More generally, as the data integration system 104 is adapted to new demands, or new thinking about existing demands, the system 104 may update and regenerate the underlying class structure for the RIM 1304 to create new atomic models for data. At the same time, translation engines 1308 may be revised to re-map tools 1302 to the new data structure of the RIM 1304. This latter function may involve only those translation engines 1308 that are specifically related to newly composed data structures, while others may continue to be used without modification. An operator, using the data integration system 104, may determine and specify the mapping relationships between the data models and formats used by the respective tools 1308 and the data model and format used by the RIM 1304, and may maintain a rules database from the mapping relationships which may be used to generate and update the respective translation engines 1308.
In order to ensure accurate propagation of updates through the RIM 1304, the data integration system 104 may associate each tool 1302 with a class whose associated data item(s) will be deemed "master physical items," and a specific relationship, if any, to other data items. For example, the data integration system 104 may select as the master physical item the particular class that appears most semantically equivalent to the object of the tool's data
model. Other data items, if any, which are related to the master physical item, may be deemed secondary physical items in a graph. For example, the cup class may contain master physical items for tools 1302 that operate on an entire cup design. The arrows designated as "RELATIONSHIPS" in Fig. 14A show possible relationships between master physical items and secondary physical items. In performing an update operation, a directed graph that is associated with the data items to be updated may be traversed from a master physical item with the appropriate attributes and values updated. In traversing the directed graph, conventional graph-traversal algorithms can be used to ensure that each data item in the graph, can, as a graph node, be appropriately visited and updated, thereby ensuring that the data items are updated.
The above example generally describes metadata management in an object oriented programming environment. However, it will be appreciated that a variety of software paradigms may be usefully employed with data in an enterprise computing system 1300. For example, an aspect-oriented programming system is described with reference to Fig 14B, and may be usefully employed with the enterprise computing system 1300 described above. An example of a tool 1302 with functions 1410 is shown in the figure. Each function 1410 may be written to interact with several external services such as ID logging 1412 and metadata updating 1414. In a typical object oriented environment, the external services 1412-1418 must often be "crosscut" to respond to functions 1410 that call them, i.e., recoded to correspond to the calls of an updated function 1410 of the tool 1302.
As an example, in skeleton code, object oriented programming ("OOP") code for functions 1410 that perform login and validation may look like:
DataValidation( ...) //Login user code
//Validate access code
//Lock data objects against another functions use code //== Data Validation Code ===== //Log out user code //Unlock data object code
//Update metadata with latest access code // More operations the same as above
In the above example, the code of the functions 1410 invokes actions with outside services 1410-1414. So-called crosscutting occurs wherever the application writer must recode outside services 1410-1414, and may be required for proper interaction of code. This may significantly increase the complexity of a redesign, and compound the time and potential for error.
In Aspect Oriented Programming (AOP), the resulting code for the functions 1410 may be similar to the OOP code (in fact, AOP may be deployed using OOP platforms, such as C++). But in an AOP environment, the application writer will code only the function specific logic for the functions 1410, and use a set of weaver rules to define how the logic accesses the external services 1412-1418. The weaver rules describe when and how the functions 1402 should interact with the other services, therefore weaving the core code of the tools 1302 and external services 1412-1418 together. When the code for the functions 1410 is compiled, the weaver will combine the core code with support code to call the proper independent service creating the final function 1410. In skeleton code the typical AOP code for a function 1410 may look like: DataValidation( ...)
//Data Validation Logic
The crosscutting code is removed from the code for the function 1410. The application writer may then create weaver rules to apply to the AOP code. In skeleton code, the weaver rules for the functions 1410 may include:
1 ) ID log at each operation start 2) ID log out at each operation end
3) Update metadata after final operation
The resulting AOP skeleton code for the function 1410 may look like:
DataValidation( ...)
-ID Logger.in //Data Validation Logic
-ID Logger, out -Metadata.update
The simplified code created by the application writer may allow for full concentration to be placed on creating the tool 1302 without concerns about the required crosscutting code. Similarly, a change to one of the services 1412-1418, may not require any changes to the functions 1410 of the tool 1302. Structuring code in this manner may significantly reduce the possibility of coding errors when creating or modifying a tool 1302, and simplify service updates for external services 1412-1418.
It will also be appreciated that translation engines 1308 are only one possible method of handling the data and metadata in an enterprise computing system 1300. The translation engines 1308 may include, or consist of, bridges 1120, as described above, or may employ a least common factor method where the data that is passed through a translation engine 1308 is compatible with both computing systems connected by the translation engine 1308. In yet a further embodiment, the translation may be performed on a standardized facility such that all computing platforms that conform to the standards can communicate and extract data through the standardized facility. There are many other methods of handling data and its associated metadata that are contemplated, and may be usefully employed with the enterprise computing system 1300 described herein.
With this background, specific operations performed by the data integration system 104 and tools 1302 and translation engines 1304 will now be described in greater detail.
Fig. 15 is a flow diagram showing a process 1500 for using a metadata management system 312, or metadata facility, in connection with a data integration system 104. Initially, a new tool 1302 may be added to the data integration system, as depicted in step 1502. As shown, the data integration system 104 may initially receive information as to the current atomic data model and format of the RIM 1304 (if any) and the data model and format of the tool 1302 to be added. As shown in step 1503, a determination may then be made whether the new tool 1302 is the first tool 1302 to be added to the data integration system 104. If the new tool 1302 is the first tool 1302, then the process 1500 may proceed to step 1504 where atomic data models are selected, using either the views required by the tool 1302, or any other finer-grained data model and format selected by a user.
If the new tool 1302 is not the first tool 1302, then the process 1500 may proceed to step 1508 where correspondences between the new tool's data model and format, including the new tool's class and attribute structure and associations between that class and attribute structure and the class and attribute structure of the RIM's current atomic data model and format will be determined. A RIM 1304 and translation engine 1308 update rules database may be generated therefrom. As shown in step 1510, the data integration system 104 may use the rule database to
update the RIM's atomic data model and format and the existing translation engines 1308 as described above. The data integration system 104 may also establish a translation engine 1308 for the tool 1302 that is being added.
As depicted generally in Fig. 16, once a translation engine 1308 has been generated or updated for a tool 1302, the translation engine 1308 can be used in connection with various operations of the tool 1302. As shown in step 1602, a tool 1302 may generate an access request, which may be transferred to an associated translation engine 1308. After receiving the access request, the translation engine 1308 may determine the request type, such as whether the request is a retrieval request or a storage request, as shown in step 1604. As shown in step 1608, if the request is a retrieval request, the translation engine 1308 may use its associations between the tool's data models and format and the RIM's data models and format to translate the request into one or more requests for the RIM 1304. Upon receiving responsive data items from the RIM 1304, the translation engine 1308 may convert the data items from the model and format received from the RIM 1304 to the model and format required by the tool 1302, and may provide the data items to the tool 1302 in the appropriate format.
As shown in step 1614, if the translation engine 1308 determines that the request is a storage request, including a request to update a previously-stored data item, the translation engine 1308 may, with the RIM 1304, generate a directed graph for the respective classes and subclasses from the master physical item associated with the tool 1302. If the operation is an update operation, the directed graph will comprise, as graph nodes, existing data items in the respective classes and subclasses. If the operation is to store new data the directed graph will comprise, as graph nodes, empty data items which can be used to store new data included in the request. After the directed graph has been established, the translation engine 1308 and RIM 1304 operates to traverse the graph and establish or update the contents of the data items as required in the request, as shown in step 1618. After the graph traversal operation is complete, the translation engine 1308 may notify the tool 1302 that the storage operation is complete, as shown in step 1620.
A data integration system 104 as described above may provide significant advantages. For example, the system 104 may provide for the efficient sharing and updating of information by a number of tools 1302 in an enterprise computing system 1300, without constraining the tools 1302 to specific data models, and without requiring information exchange programs that exchange information between different tools 1302. The data integration system 104 may provide a RIM 1304 that maintains data in an atomic data model and format that may be used for any of the tools 1302 in the system 104, and the format may be readily updated and evolved in a convenient manner when a new tool 1302 is added to the system 104. Further, by explicitly associating each tool 1302 with a master physical item class, directed graphs may be established among data items in the RIM 1304. As a result, updating of information in the RIM 1304 can be efficiently accomplished using conventional directed graph traversal procedures
Fig. 17 is a schematic diagram showing a parallel execution facility 1700 for parallel execution of a plurality of processes of a data integration process. In an embodiment, the process 1700 may involve a process initiation facility 1702. The process initiation facility 1702 may determine the scope of the job that needs to be run and determine that a first and second process may be run simultaneously (e.g., because they are not dependant). Once the determination is made, the two processing facilities 1704 and 1708 may run the first process and the second process respectively. Following the execution of these two jobs, a third process may be undertaken on another processing facility 1712. Once the third process is complete, the corresponding process facility 1712 may communicate information to a transformation facility 1714. In an embodiment, the transformation facility 1714 may not begin the transformation process until it has received information 1718 from one or more other parallel
processes, such as the first and second processing facilities 1704, 1708. Once all of the information is presented, the transformation facility 1714 may perform the transformation. This parallel process flow minimizes run time by running several processes at one time (e.g., processes that are not dependant on one another) and then presenting the information from the two or more parallel executions to a common facility (e.g., where the common facility is dependant on the results of the two parallel facilities). In this embodiment, the several process facilities are depicted as separate facilities for ease of explanation. However, it should be understood that two or more of these facilities may be the same physical facilities. It should also be understood that two or more of the processing facilities may be different physical facilities and may reside in various physical locations (e.g., facility 1704 may reside in one physical location and facility 1708 may reside in another physical location). Fig. 18 is a flow diagram showing steps for parallel execution of a plurality of processes of a data integration process. In an embodiment, a parallel process flow may involve step 1802 wherein the job sequence is determined. Once the job sequence is determined, the job may be sent to two or more process facilities as shown in step 1804. In step 1808 a first process facility may receive and execute certain routines and programs and communicate the processed information to a third process facility. In step 1810 a second process facility may receive and execute certain routines and programs and, once complete, communicate the processed information to the third process facility. The third process facility may wait to receive the processed information from the first to process facilities before running its own routines on the two sources of information. Again, it should be understood the process facilities might be the same facilities or reside in the same location, or the process facilities may be different and/or reside in different locations. More generally, scaleable architectures using parallel processing may include SMP, clustering, and MPP platforms, and grid computing solutions. These may be deployed in a manner that does not require modification of underlying data integration processes. Current commercially available parallel databases that may be used with the systems described herein include IBM DB2 UDB, Oracle, and Teradata databases. A concept related to parallelism is the concept of pipelining, in which records are moved directly through a series of processing functions defined by the data flow of a job. Pipelining provides numerous processing advantages, such as removing requirements for interim data storage and removing input/output management between processing steps. Pipelining may be employed within a data integration system to improve processing efficiency.
Fig. 19 is a schematic diagram showing a data integration job 1900, comprising inputs from a plurality of data sources and outputs to a plurality of data targets. It may be desirable to collect data from several data sources 1902A, 1902B and 1902C, which may be any of the data sources 102 described above, and use the combination of the data in a business enterprise. In an embodiment, a data integration system 104 may be used to collect, cleanse, transform or otherwise manipulate the data from the several data sources 1902A, 1902B and 1902C and to store the data in a common data warehouse or database 1908, which may be any of the databases 112 described above, such that it can be accessed from various tools, targets, or other computing systems. This may include, for example, the data integration process 500 described above. The data integration system 104 may store the collected data in the storage facility 1908 such that it can be directly accessed from the various tools 1910A and 1910B, which may be the tools 1302 described above, or the tools may access the data through data translators 1904A and 1904B, which may be the translation engines 1308 described above, whether automatically, manually or semi-automatically generated as described herein. The data translators 1904A, 1904B are illustrated as separate facilities; however, it should be understood that they may be incorporated into the data integration system 104, a tool 1302, or otherwise located to accomplish the desired tasks.
Fig. 20 is a schematic diagram showing another data integration job 1900, comprising inputs from a plurality of data sources and outputs to a plurality of data targets. It may be desirable to collect data from several data sources 1902A, 1902B and 1902C, which may be any of the data sources 102 described above, and use the combination of the data in a business enterprise. In an embodiment, a data integration system 104 may collect, cleanse, transform or otherwise manipulate the data from the several data sources 1902A, 1902B and 1902C and pass on the collected information in a combined manner to several targets 1910A and 1910B, which may also be any of the data sources 102 described above. This may be accomplished in real-time or in a batch mode for example. Rather than storing all of the collected information in a central database to be accessed at some point in the future, the data integration system 104 may collect and process the data from the data sources 1902 A, 1902B and 1902C at or near the time the request for data is made by the targets 1910A and 1910B. It should be understood that the data integration system 104 might still include memory in an embodiment such as this. In an embodiment, the memory may be used for temporarily storing data to be passed to the targets when the processing is completed. The embodiments of a data integration job 1900 described in reference to Fig. 19 and Fig. 20 are generic. It will be appreciated that such a data integration job 1900 may be applied in numerous commercial, educational, governmental, and other environments, and may involve many different types of data sources 102, data integration systems 104, data targets, and/or databases 112.
Fig. 21 shows a graphical user interface 2102 whereby a data manager for a business enterprise may design a data integration job 1900. In an embodiment, a graphical user interface 2102 may be presented to the user to facilitate setting up a data integration job. The user interface may include a palate of tools 2106 including databases, transformation tools, targets, path identifiers, and other tools to be used by a user. The user may graphically manipulate tools from the palate of tools 2106 into a workspace 2104, using, e.g., drag and drop operations, drop down menus, command lines, and any other controls, tools, toolboxes, or other user interface components. The workspace 2104 may be used to layout the databases, path of data flow, transformation steps and the like to configure a data integration job, such as the data integration jobs 1900 described above. In an embodiment, once the job is configured it may be run from this or another user interface. The user interface 2102 may be generated by an application or other programming environment, or as a web page that a user may access using a web browser.
Fig. 22 shows another embodiment of a graphical user interface 2102 with which a data manager can design a data integration job 1900. In an embodiment, a user may use the graphical user interface 2102 to select icons that represent data targets/sources, and to associate these icons with functions or other relationships. In this environment, the user may create associations or command structures between the several icons to create a data integration job 2202, which my be any of the data integration jobs 1900 described above.
The user interface 2102 may provide access to numerous resources and design tools within the platform 100 and the data integration system 104. For example, the user interface 2102 may include a type designer data object modeling. The type designer may be used to create and manage type trees that define properties for data structures, define containment of data, create data validation rules, and so on. The type designer may include importers for automatically generating type trees (i.e., data object definitions) for data that is described in formats such as XML, COBOL Copybooks, and structures specific to applications such as SAP R/3, BEA Tuxedo, and PeopleSoft EnterpriseOne. The user interface 2102 may include a map designer used to formulate transformation and business rules.
The map designer may use definitions of data objects created with the type designer as inputs and outputs, and may
become building blocks for application development and deployment, allowing rapid application development and avoiding redundant code. Each service embodies a set of business logic or business rules that can be blind to the surrounding environment, such as the source of the data inputs for the service or the targets for the data outputs of the service. As a result, services can be reused in connection with a variety of applications, provided that appropriate inputs and outputs are established between the service and the applications. The service-oriented architecture 2400 allows the service to be protected against environmental changes, so that the architecture functions even if the surrounding computer environment is changed. As a result, services may not need to be recoded as a result of infrastructure changes, which may result in savings of time and effort. The embodiment of Fig. 24 is an embodiment of an SOA 2400 for a web service. As used herein, variations such as "service-oriented architecture", "services oriented architecture", "SOA", and the like are intended to be used interchangeably to refer generally to an SOA 2400 as described with reference to Fig. 24, and more generally as described throughout this specification.
In the SOA 2400 of Fig. 24, there are three entities, a service provider 2402, a service requester 2404 and a service registry 2408. The registry 2408 may be public or private. The service requester 2404 may search a registry 2408 for an appropriate service. Once an appropriate service is discovered, the service requester 2404 may receive code, such as Web Services Description Language ("WSDL") code, that is necessary to invoke the service. WSDL is a programming language conventionally used to describe web services. The service requester 2404 may then interface with the service provider 2402, such as through messages in appropriate formats (such as the Simple Object Access Protocol ("SOAP") format for web service messages), to invoke the service. The SOAP protocol is a preferred protocol for transferring data in web services. The SOAP protocol defines the exchange format for messages between a web services client and a web services server. The SOAP protocol uses an extensible Markup Language ("XML") schema, XML being a generic language specification commonly used in web services for tagging data, although other markup languages may be used.
Figure 25 shows an example of a SOAP message. The SOAP message 2502 may include a transport envelope 2504 (such as an HTTP or JMS envelope, or the like), a SOAP envelope 2508, a SOAP header 2510 and a SOAP body 2512. The following is an example of a SOAP-format request message and a SOAP-format response message:
Web services can be modular, self-describing, self-contained applications that can be published, located and invoked across the web. For example, in the embodiment of the web service of Fig. 24, the service provider
2402 publishes the web service to the registry 2408, which may be, for example, a Universal Description, Discovery and Integration (UDDI) registry, which provides a listing of what web services are available, or a private registry or other public registry. The web service can be published, for example, in WSDL format. To discover the service, the service requester 2404 may browse the service registry and retrieve the WSDL document. The registry 2408 may include a browsing facility and a search facility. The registry 2408 may store the WSDL documents and their metadata.
To invoke the web service, the service requester 2404 sends the service provider 2402 a SOAP message 2502 as described in the WSDL, receives a SOAP message 2502 in response, and decodes the response message as described in the WSDL. Depending on their complexity, web services can provide a wide array of functions, ranging from simple operations, such as requests for data, to complicated business process operations. Once a web service is deployed, other applications (including other web services) can discover and invoke the web service.
26
be used to specify rules for transforming and routing data, as well as the environment for analyzing, compiling and testing the maps that are developed.
A database design interface may be provided as a modeling component to import metadata about queries, tables and stored procedures for data stored in relational databases. The database design interface may identify characteristics, such as update keys and database triggers, of various objects to meet mapping and execution requirements. An integration flow designer may be used to define and manage data integration processes. The integration flow designer may more specifically be used to define interactions among maps and systems of maps, to validate the logical consistency of workflows, and to prepare systems of maps to run. A command server component may be provided for command-driven execution within the graphical user interface. This may be employed, for example, for testing of maps within the map designer environment. A resource registry may provide a resource alias repository, used to abstract parameter settings using aliases that resolve at execution time to specific resources within an enterprise.
The user interface 2102 may also provide access to various administration and management tools. For example, an event server administration tool may be provided from which a user can specify deployment directories, configure users and user access rights, specify listening ports, and define properties for Java Remote Method Invocation ("RMI"). A management console may provide management and monitoring for the event server, from which a user can start, stop, pause, and resume the system, and view information about the status of the even server and maps being run. An event server monitor may provide dynamic detailed views of single maps as they run, and create snapshots of activity at a specific time. Fig. 23 represents a platform 2300 for facilitating integration of various data of a business enterprise. The platform may be, for example, the platform 100 described above, and may include an integration suite that is capable of providing known enterprise application integration (EAI) services, such as extraction of data from various sources, transformation of the data into desired formats and loading of data into various targets, sometimes referred to as ETL (Extract, Transform, Load). The platform 2300 may include a real-time integration ("RTI") service 2704 that facilitates exposing a conventional data integration platform 2702 as a service that can be accessed by computer applications of the enterprise, including through web service protocols 2302 such as Enterprise Java Beans ("EJB") and the Java Messaging Service ("JMS"). While the product suite may be a data integration or enterprise application integration suite, as described in detail in the examples below, it will be appreciated that any suite of interrelated programs or functions may be deployed and managed as services in a services oriented architecture using the principles described herein. Thus, for example, the product suite may include an office automation or productivity suite that includes tools such as automated document assembly, word processing, spreadsheets, and the like. Or the product suite may be a computer automated design suite with separate software packages or functions deployed as various services for, e.g., design, visual rendering, costing, simulation, and the like. As another example, an integrated suite of financial products such as bookkeeping, accounting, tax preparation, and electronic tax filing products may be deployed as services, either at a product level or as collections of services specific to each product, using the services oriented architecture. Similarly, economic modeling tools, financial analysis tools (for, e.g., individual company financials, stock trading patterns, market trading patterns, and the like), and electronic trading tools may be deployed as services in an integrated stock trading suite. Fig. 24 shows a schematic diagram of a service-oriented architecture ("SOA") 2400. The SOA can be part of the infrastructure of an enterprise computing system 1300 of a business enterprise. In the SOA 2400, services
25
Other web services standards are being defined by the Web Services Interoperability Organization (WS-I), an open industry organization chartered to promote interoperability of web services across platforms. Examples include WS-Coordination, WS-Security, WS-Transaction, WSIF, BPEL and the like, and the web services described herein should be understood to encompass services contemplated by any such standards. Referring to Fig. 26, a WSDL definition 2600 is an XML schema that defines the interface, location and encoding scheme for a web service. The definition 2600 defines the service 2602, identifies the port 2604 through which the service 2602 can be accessed (such as an Internet address), and defines the bindings 2608 (such as Enterprise Java Bean or SOAP bindings) that are used to invoke the web service and communicate with it. The WSDL definition 2600 may include an abstract definition 2610, which may define the port type 2612, incoming message parts 2616 and outgoing message parts 2618 for the web service, as well as the operations 2614 performed by the service.
There are a variety of web services clients from various providers that can invoke web services. Web services clients include .Net applications, Java applications (e.g., JAX-RPC), applications in the Microsoft SOAP toolkit (Microsoft Office, Microsoft SQL Server, and others), applications from SeeBeyond, WebMethods, Tibco and BizTalk, as well as AscentiaPs DataStage (WS PACK). It should be understood that other web services clients may also be used in the enterprise data integration methods and systems described herein. Similarly, there are various web services providers, including .Net applications, Java applications, applications from Siebel and SAP, 12 applications, DB2 and SQL Server applications, enterprise application integration (EAI) applications, business process management (BPM) applications, and Ascential Software's Real Time Integration (RTl) application, all of which may be used with web services clients as described herein.
The RTI services 2704 described herein may use an open standard specification such as WSDL to describe a data integration process service interface. When a data integration service definition is complete, it can use the WSDL web service definition language (a language that is not necessarily specific to web services), which is an abstract definition that gives the name of the service, the operations of the service , the signature of each operation, and the bindings for the service, as described generally above. Within the WSDL definition 2600 (an XML document) there are various tags, with the structure described in connection with Fig. 26. For each service, there can be multiple ports, each of which has a binding. The abstract definition is the RTI service definition for the data integration service in question. The port type is an entry point for a set of operations, each of which has a set of input arguments and output arguments. WSDL was defined for web services, but with only one binding defined (SOAP over HTTP). WSDL has since been extended through industry bodies to include WSDL extensions for various other bindings, such as EJB, JMS, and the like. An RTI service 2704 may use WSDL extensions to create bindings for various other protocols. Thus, a single RTI data integration service can support multiple bindings at the same time to the single service. As a result, a business can take a data integration process 500, expose it as a set of abstract processes (completely agnostic to protocols), and then add the bindings. A service can support any number of bindings.
A user may take a preexisting data integration job 1900, add appropriate RTI input and output phases, and expose the job as a service that can be invoked by various applications that use different native protocols.
Referring to Fig. 27 a high-level architecture is represented for a data integration platform 2700, which may be deployed, for example, across the platform 100 described above and adapted for real time data integration. A conventional data integration facility 2702, which may be, for example, the data integration system 104 described above, may provide methods and systems for processing a data integration job. The data integration facility 2702
may connect to one or more applications through a real time integration facility, or RTI service 2704, which comprises a service in a service-oriented architecture. The RTI service 2704 can invoke or be invoked by various applications 2708 of the enterprise. The data integration facility 2702 can provide matching, standardization, transformation, cleansing, discovery, metadata, parallel execution, and similar facilities that are required to perform data integration jobs. In embodiments, the RTI service 2704 exposes the data integration jobs of the data integration facility 2702 as services that can be invoked in real time by applications 2708 of the enterprise. The RTI service 2704 exposes the data integration facility 2702, so that data integration jobs can be used as services, synchronously or asynchronously. The jobs can be called, for example, from enterprise application integration platforms, application server platforms, as well as Java and .Net applications. The RTI service 2704 allows the same logic to be reused and applied across batch and real-time services. The RTI service 2704 may be invoked using various bindings 2710, such as Enterprise Java Bean (EJB), Java Message Service (JMS), or web service bindings.
Referring to Fig. 28, in embodiments, the RTI service 2704 runs on an RTI server 2802, which acts as a connection facility for various elements of the real time data integration process. For example, the RTI server 2802 can connect a plurality of enterprise application integration servers, such as DataStage servers from Ascential Software of Westborough, Massachusetts, so that the RTI server 2802 can provide pooling and load balancing among the other servers. The RTI server 2802 may comprise a separate J2EE application running on a J2EE application server. More than one RTI server 2802 may be included in a data integration process.
J2EE provides a component-based approach to design, development, assembly and deployment of enterprise applications. Among other things, J2EE offers a multi-tiered, distributed application model, the ability to reuse components, a unified security model, and transaction control mechanisms. J2EE applications are made up of components. A J2EE component is a self-contained functional software unit that is assembled into a J2EE application with its related classes and files and that communicates with other components.
The J2EE specification defines various J2EE components, including: application clients and applets, which are components that run on the client side; Java Servlet and JavaServer Pages (JSP) technology components, which are Web components that run on the server; and Enterprise JavaBean (EJB) components (enterprise beans), which are business components that run on the server. J2EE components are written in Java and are compiled in the same way as any program. The difference between J2EE components and "standard" Java classes is that J2EE components are assembled into a J2EE application, verified to be well-formed and in compliance with the J2EE specification, and deployed to production, where they are run and managed by a J2EE server. There are three kinds of EJBs: session beans, entity beans, and message-driven beans. A session bean represents a transient conversation with a client. When the client finishes executing, the session bean and its data are gone. In contrast, an entity bean represents persistent data stored in one row of a database table. If the client terminates or if the server shuts down, the underlying services ensure that the entity bean data is saved. A message-driven bean combines features of a session bean and a Java Message Service ("JMS") message listener, allowing a business component to receive JMS messages asynchronously.
The J2EE specification also defines containers, which are the interface between a component and the low- level platform-specific functionality that supports the component. Before a web, enterprise bean, or application client component can be executed, it must be assembled into a J2EE application and deployed into its container. The assembly process involves specifying container settings for each component in the J2EE application and for the J2EE application itself. Container settings customize the underlying support provided by the J2EE server, which
includes services such as security, transaction management, Java Naming and Directory Interface (JNDI) lookups, and remote connectivity.
Fig. 29 depicts an architecture 2900 for a typical J2EE server 2908 and related applications. The J2EE server 2908 comprises the runtime aspect of a J2EE architecture. A J2EE server 2908 provides EJB and web containers. The EJB container 2902 manages the execution of enterprise beans 2904 for J2EE applications.
Enterprise beans 2904 and their container 2902 run on the J2EE server 2908. The web container 2910 manages the execution of JSP pages 2912 and servlet components 2914 for J2EE applications. Web components and their container 2910 also run on the J2EE server 2908. Meanwhile, an application client container 2918 manages the execution of application client components. Application clients 2920 and their containers 2918 run on the client side. The applet container manages the execution of applets. The applet container may consist of a web browser and a Java plug-in running together on the client.
J2EE components are typically packaged separately and bundled into a J2EE application for deployment. Each component, its related files such as GIF and HTML files or server-side utility classes, and a deployment descriptor are assembled into a module and added to the J2EE application. A J2EE application and each of its modules has its own deployment descriptor. A deployment descriptor is an XML document with an .xml extension that describes a component's deployment settings. A J2EE application with all of its modules is delivered in an Enterprise Archive (EAR) file. An EAR file is a standard Java Archive (JAR) file with an .ear extension. Each EJB JAR file contains a deployment descriptor, the enterprise bean files, and related files. Each application client JAR file contains a deployment descriptor, the class files for the application client, and related files. Each file contains a deployment descriptor, the Web component files, and related resources.
The RTI server 2802 may act as a hosting service for a real time enterprise application integration environment. The RTI server 2802 may be a J2EE server capable of performing the functions described herein. The RTI server 2802 may provide a secure, scaleable platform for enterprise application integration services. The RTI server 2802 may provide a variety of conventional server functions, including session management, logging (such as Apache Log4J logging), configuration and monitoring (such as J2EE JMX), security (such as J2EE JAAS, SSL encryption via J2EE administrator). The RTI server 2802 may serve as a local or private web services registry, and it can be used to publish web services to a public web service registry, such as the UDDI registry used for many conventional web services. The RTI server 2802 may perform resource pooling and load balancing functions among other servers, such as those used to run data integration jobs. The RTI server 2802 can also serve as an administration console for establishing and administering RTI services. The RTI server 2802 may operate in connection with various environments, such as JBOSS 3.0, IBM Websphere 5.0, BEA WebLogic 7.0 and BEA WebLogic δ.l.
Once established, the RTI server 2802 may allow data integration jobs (such as DataStage and QualityStage jobs performed by the Ascential Software platform) to be invoked by web services, enterprise Java beans, Java message service messages, or the like. The approach of using a service-oriented architecture with the RTI server 2802 allows binding decisions to be separated from data integration job design. Also, multiple bindings can be established for the same data integration job. Because the data integration jobs are indifferent to the environment and can work with multiple bindings, it may be easier to reuse processing logic across multiple applications and across batch and real-time modes. Figure 30 shows an RTI console 3002 that may be provided for administering an RTI service. The RTI console 3002 may enable the creation and deployment of RTI services. Among other things, the RTI console
allows the user to establish what bindings will be used to provide an interface to a given RTI service and to establish parameters for runtime usage of the RTI service The RTI console may be provided with a graphical user interface and run m any suitable environment for supporting such an interface, such as a Microsoft Windows-based environment, or a web browser interface Further detail on uses of the RTI console is provided below The RTI console 3002 may be used by a designer to create a service, create operations of the service, attach ajob to the operation of the service and create bindings desired by the user for implementing the service with vaπous protocols
Referring again to Fig 27, the RTI service 2704 may sit between the data integration platform 2702 and vaπous applications 2708 The RTI service 2704 may allow the applications 2708 to access the data integration platform 2702 m real time or in batch mode, synchronously or asynchronously Data integration rules established m the data integration platform 2702 can be shared across an enterprise computing system 1300 The data integration rules may be wπtten in any language, without requiring knowledge of the platform 2702 The RTI service 2704 may leverage web service definitions to facilitate real time data integration The flow of the data integration job can, in accordance with the methods and systems described herein, be connected to a batch environment or the real time environment The methods and systems disclosed herein include the concept of a container, a piece of business logic contained between a defined entry point and a defined exit point in a process By configuring a data integration process as the business logic in a container, the data integration can be used in batch and real time modes Once business logic is m a container, moving between batch and real time modes may be simplified A data integration job can be accessed as a real time service, and the same data integration job can be accessed in batch mode, such as to process a large batch of files, performing the same transformations as in the real time mode Referring to Fig 31, further detail is provided of an architecture 3100 for enabling an embodiment of an
RTI service 2704 The RTI server 2802 may include vaπous components, including facilities for auditing 3104, authentication 3108, authoπzation 3110 and logging 3112, such as those provided by a typical J2EE-compliant server The RTI server 2802 may also include a process pooling facility 3102, which can opeiate to pool and allocate resources, such as resources associated with data integration jobs running on data integration platforms 2702 The process pooling facility 3102 may provide server and job selection across vaπous servers that are running data integration jobs Selection may be based on balancing the load among machines, or based on which data integration jobs are capable of running (or running most effectively) on which machines The RTI server 2802 may also include binding facilities 3114, such as a SOAP binding facility 3116, a JMS binding facility 3118, and an EJB binding facility 3120 The binding facilities 3114 allow the interface between the RTI server 2802 and vaπous applications, such as the web service client 3122, the JMS queue 3124 or a Java application 3128
Referring still to Fig 31, the RTI console 3002 may be the administration console for the RTI server 2802 The RTI console 3002 may allow an administrator to create and deploy an RTI service, configure the runtime parameters of the service, and define the bindings or interfaces to the service
The architecture 3100 may include one or more data integration platforms 2702, which may compπse servers, such as DataStage servers provided by Ascential Software of Westborough, Massachusetts The data integration platforms 2702 may include facilities for supporting interaction with the RTI server 2802, including an RTI agent 3132, which is a process running on the data integration platform 2702 that marshals requests to and from the RTI server 2802 Thus, once the process pooling facility 3102 selects a particular machine as the data integration platform 2702 for a real time data integration job, it may hand the request to the RTI agent 3132 for that data integration platform 2702 On the data integration platform 2702, one or more data integration jobs 3134, such as the data integration j obs 1900 described above, may be running The data integration j obs 3134 may optionally
always be on, rather than having to be initiated at the time of invocation. For example, the data integration jobs 3134 may have already-open connections with databases, web services, and the like, waiting for data to come and invoke the data integration job 3134, rather than having to open new connections at the time of processing. Thus, an instance of the already-on data integration job 3134 may be invoked by the RTI agent 3132 and ean commence immediately with execution of the data integration job 3134, using the particular inputs from the RTI server 2802, which might be a file, a row of data, a batch of data, or the like.
Bach data integration job 3134 may include an RTI input stage 3138 and an RTI output stage 3140. The RTI input stage 3138 is the entry point to the data integration job 3134 from the RTI agent 3132 and the RTI output stage 3140 is the output stage back to the RTI agent 3132. With the RTI input and output stages, the data integration job 3134 can be a piece of business logic that is platform independent. The RTI server 2802 knows what inputs are required for the RTI input stage 3138 of each RTI data integration job 3134. For example, if the business logic of a given data integration job 3134 takes a customer's last name and age as inputs, then the RTI server 2802 may pass inputs in the form of a string and an integer to the RTI input stage 3138 of that data integration job 3134. The RTI input stage takes the input and formats it appropriate for whatever native application code is used to execute the data integration job 3134.
In embodiments, the methods and systems described herein may enable a designer to define automatic, customizable mapping machinery from a data integration process to an RTI service interface. In particular, the RTI console 3002 may allow the designer to create an automated service interface for the data integration process. Among other things, it may allow a user (or a set of rules, or a program) to customize the generic service interface to fit a specific purpose. When there is a data integration job, with a flow of transactions, such as transformations, and with the RTI input stage 3138 and RTI output stage 3140, metadata for the job may indicate, for example, the format of data exchanged between components or stages of the job. A table definition describes what the RTI input stage 3138 expects to receive; for example, the input stage of the data integration job might expect three calls: one string and two integers. Meanwhile, at the end of the data integration job flow the output stage may return calls that are in the form (string, integer). When the user creates an RTI service that is going to use this job, it is desirable for the operation that is defined to reflect what data is expected at the input and what data is going to be returned at the output. Compared to a conventional object-oriented programming method, a service corresponds to a class, and an operation to a method, where a job defines the signature of the operation based on metadata, such as an RTI input table 3414 associated with the RTI input stage 3138 and an RTI output table 3418 associated with the RTI output stage 3140.
By way of example, a user might define (string, int, int) as the input arguments for a particular RTI operation at the RTI input table 3414. One could define the outputs in the RTI output table 3418 as a struct: (string; int). In embodiments the input and output might be single strings. If there are other fields (more calls), the user can customize the input mapping. Instead of having an operation with fifteen integers, the user can create a STRUCT (a complex type with multiple fields, each field corresponding to a complex operations), such as Opt (stuct(string, int, int)):struct (string, int). The user can group the input parameters so that they are grouped as one complex input type. As a result, it is possible to handle an array, so that the transaction is defined as: Optl(array(struct(string, int, int). For example, the input structure could be (Name, SSN, age) and the output structure could be (Name, birthday). The array can be passed through the RTI service. At the end, the service outputs the corresponding reply for the array. Arrays allow grouping of multiple rows into a single transaction. In the RTI console 3002, a checkbox 5308 allows the user to "accept multiple rows" in order to enable arrays. To define the inputs, in the RTI
console 3002, a particular row may be checked or unchecked to determine whether it will become part of the signature of the operation as an input. A user may not want to expose a particular input column to the operation (for example because it may always be the same for a particular operation), in which case the user can fix a static value for the input, so that the operation only sees the variables that are not static values. A similar process may be used to map outputs for an operation, such as using the RTI console to ignore certain columns of output, an action that can be stored as part of the signature of a particular operation.
In embodiments, RTI service requests that pass through the data integration platform 2702 from the RTI server 2802 are delivered in a pipeline of individual requests, rather than in a batch or large set of files. The pipeline approach allows individual service requests to be picked up immediately by an already-running instance of a data integration job 3134, resulting in rapid, real-time data integration, rather than requiring the enterprise to wait for completion of a batch integration job. Service requests passing through the pipeline can be thought of as waves, and each service request can be marked by a start of wave marker and an end of wave marker, so that the RTI agent 3132 recognizes the initiation of a new service request and the completion of a data integration job 3134 for a particular service request. End-of-wave markers delineate a succession of units of work, each unit being separated by end-of-wave markers The use of an end-of-wave marker may permit the system to do both batch and real time operations with the same service. In a batch environment a data integration user typically wants to optimize the flow of data, such as to do the maximum amount of processing at a given stage, then transmit to the next stage in bulk, to reduce the number of times data has to be moved, because data movement is resource-intensive. In contrast, in a real time process, the data integration user may want to move each transaction request as fast as possible through the flow. The end-of-wave marker sends a signal that informs the job instance to flush the particular request on through the data integration job, rather than waiting for more data to start the processing (as a system typically would do in batch mode). A benefit of end-of-wave markers is that a given job instance can process multiple transactions at the same time, each of which is separated from others by end-of-wave markers. Whatever is between two end-of-wave markers is a transaction. .
Pipelining allows multiple requests to be processed simultaneously by a service. The load balancing algorithm of the process pooling facility 3102 may fill a single instance to its maximum capacity (filling the pipeline) before starting a new instance of the data integration job. In a real time integration model, when you have a recall being processed in real time (unlike in a batch mode where the system typically fills a buffer before processing the batch) the end-of-wave markers may allow pipelining the multiple transactions into the flow of the data integration job. For load balancing, it may be desirable for the balance not to be based only on whether a job is busy, because a job maybe busy, while still having unused throughput capacity.
On the other hand, it may be desirable to avoid starting new data integration job instances before the capacity of the pipeline has reached its maximum. This means that load balancing needs to be dynamic and based on additional properties. In the RTI agent process, the RTI agent 3132 knows about the instances running on each data integration platform 2702 accessed by the RTI server 2802. In the RTI agent 3132, the user can create a buffer for each of the job instances running on the data integration platform 2702. Various parameters can be set in the RTI console 3002 to help with dynamic load balancing. One parameter is the maximum size for the buffer (measured in number of requests) that can be placed in the buffer waiting for handling by the job instance. It may be preferable to have only a single request, resulting in constant throughput, but in practice there are usually variances in throughput, so that it is often desirable to have a buffer for each job instance. A second parameter is
the pipeline threshold, which is a parameter that says at what point it may be desirable to initiate a new job instance. In embodiments, the threshold may generate a warning indicator, rather than automatically starting a new instance, because the delay may be the result of an anomalous increase in traffic. A third parameter may determine that if the threshold is exceeded for more than a specified period of time, then a new instance will be started. In sum, pipelining properties, such as the buffer size, threshold, and instance start delay, are parameters that the user may control.
In embodiments, all of the data integration platforms 2702 are machines using the DataStage server from Ascential Software. On each of them, there can be data integration jobs 3134, which may be DataStage jobs. The presence of the RTI input stage 3138 means that a job 3134 is always up and running and waiting for a request, unlike in a batch mode, where a job instance is initiated at the time of batch processing. In operation, the data integration job 3134 is up and running with all of its requisite connections with databases, web services, and the like, and the RTI input stage 3134 is listening, waiting for some data to come. For each transaction an end-of-wave marker may travel through the stages of the data integration job 3134. RTI input stage 3138 and RTI output stage 3140 are the communication points between the data integration job 3134 and the rest of the RTI service environment.
For example, a computer application of the business enterprise may send a request for a transaction. The RTI server 2802 may determine that RTI data integration jobs 3134 are running on various data integration platforms 2702, which in an embodiment are DataStage servers from Ascential Software. The RTI server 2802 may map the data in the request from the computer application into what the RTI input stage 3138 needs to see for the particular data integration job 3134. The RTI agent 3132 may track what is running on each of the data integration platforms 2702. The RTI agent 3132 may operate with shared memory with the RTI input stage 3138 and the RTI output stage 3140. The RTI agent 3132 may mark a transaction with end-of-wave markers, sends the transaction into the RTI input stage 3138, then, recognizing the end-of-wave marker as the data integration job 3134 is completed, take the result out of the RTI output stage 3140 and send the result back to the computer application that initiated the transaction.
The RTI methods and systems described herein may allow data integration processes to be exposed as a set of managed abstract services, accessible by late binding multiple access protocols. Using a data integration platform 2702, such as the Ascential platform, the user may create data integration processes (typically represented by a flow in a graphical user interface). The user may then expose the processes defined by the flow as a service that can be invoked in real time, synchronously or asynchronously, by various applications. To take greatest advantage of the RTI service, it may be desirable to support various protocols, such as IMS queues (where the process can post data to a queue and an application can retrieve data from the queue), Java classes, and web services. Binding multiple access protocols allows various applications to access the RTI service. Since the bindings handle application-specific protocol requirements, the RTI service can be defined as an abstract service. The abstract service is defined by what the service is doing, rather than by a specific protocol or environment. More generally, the RTI services may be published in a directory and shared with numerous users.
An RTI service can have multiple operations, and each operation may be implemented by a job. To create the service, the user doesn't need to know about the particular web service, Java class, or the like. When designing the data integration job that will be exposed through the RTI service, the user doesn't need to know how the service is going to be called. The user may build the RTI service, and then for a given data integration request the system may execute the RTI service. At some point the user binds the RTI service to one or more protocols, which could
be a web service, Enterprise Java Bean (EJB), JMS, JMX, C++ or any of a great number of protocols that can embody the service. For a particular RTI service, there may be several bindings, so that the service can be accessed by different applications with different protocols.
Once an RTI service is defined, the user can attach a binding, or multiple bindings, so that multiple applications using different protocols can invoke the RTI service at the same time. In a conventional WSDL document, the service definition includes a port type, but necessarily tells how the service is called. A user can define all the types that can be attached to the particular WSDL-defϊned jobs. Examples include SOAP over HTTP, EJB, Text Over JMS, and others. For example, to create an EJB binding the RTI server 2802 is going to generate Java source code of an Enterprise Java Bean. At service deployment the user uses the RTI console 3002 to define properties, compile code, create a Java archive file, and then give that to the user of an enterprise application to deploy in the users Java application server, so that each operation is one method of the Java class. As a result, there may be a one-to-one correspondence between an RTI service name and a Java class name, as well as a correspondence between an RTI operation name and a Java method name. As a result, Java application method calls will call the operation in the RTI service. As a result, a web service using SOAP over HTTP and a Java application using an EJB can go to the exact same data integration job via the RTI service. The entry point and exit points don't require a specific protocol, so the same job may be working on multiple protocols.
While SOAP and EJB bindings support synchronous processes, other bindings support asynchronous processes. For example, SOAP over JMS and Text over JMS are asynchronous. For example, in an embodiment a message can be attached to a queue. The RTI service can monitor asynchronous inputs to the input queue and asynchronously post the output to another queue.
Fig. 32 is a schematic diagram 3200 of the internal architecture for an RTI service. The architecture includes the RTI server 2802, which is a J2EE-compliant server. The RTI server 2802 interacts with the RTI agent 3132 of the data integration platform 2702. The process pool facility 3102 manages projects by selecting the appropriate data integration platform machine 2702 to which a data integration job will be passed. The RTI server 2802 includes ajob pool facility 3202 for handling data integration jobs. The job pool facility 3202 includes ajob list 3204, which lists jobs and a status of whether each is available or not. The job pool facility may include a cache manager and operations facility for handling jobs that are passed to the RTI server 2802. The RTI server 2802 may also include a registry facility 3220 for managing interactions with an appropriate public or private registry, such as publishing WSDL descriptions to the registry for services that can be accessed through the RTI server 2802. The RTI server 2802 may also includes an EJB container 3208, which includes an RTI session bean runtime facility 3210 for the RTI services, in accordance with J2EE. The EJB container 3208 may include message beans 3212, session beans 3214, and entity beans 3218 for enabling the RTI service. The EJB container 3208 may facilitate various interfaces, including a JMS interface 3222, and EJB client interface 3224 and an Axis interface 3228. Referring to Fig. 33, an aspect of the interaction of the RTI server 2802 and the RTI agent 3132 is that RTI agent 3132 manages a pipeline of service requests, which are then passed to a job instance 3302 for the data integration job. The job instance 3302 runs on the data integration platform 2702, and has an RTI input stage 3138 and RTI output stage 3140. Depending on need, more than one job instance 3302 may be running on a particular data integration platform 2702. The RTI agent 3132 manages the opening and closing of job instances as service requests are passed to it from the RTI server 2802. In contrast to traditional batch-type data integration, each request for an RTI service travels through the RTI server 2802, RTI agent 3132, and data integration platform 2702
in a pipeline 3304 of jobs. The pipeline 3304 can be managed in the RTI agent 3132, such as by setting various parameters of the pipeline 3304. For example, the pipeline 3304 can have a buffer, the size of which can be set by the user using a maximum buffer size parameter 3308. The administrator can also set other parameters, such as the period of delay that the RTI agent 3132 will accept before starting a new job instance 3302, namely, the instance start delay 3310. The administrator can also set a threshold 3312 for the pipeline, representing the number of service requests that the pipeline can accept for a given job instance 3302.
An RTI service can be managed in a registry that can be searched. The RTI service can have added to it an already-written application that is using the protocol that is attached to the service. For example, a customer management operation, such as adding a customer, removing a customer, or validating a customer address can use or be attached to a known web service protocol. The customer management applications may be attached to an RTI service, where the application is a client of the RTI service. In other words, a predefined application can be attached to the RTI service where the application calls or uses the RTI service. The result is that the user can download a service on demand to a particular device and run it from (or on) the device.
For example, a mobile computing device such as a pocket PC may have a hosting environment. The mobile computing device may have an application, such as one for mobile data integration services, with a number of downloaded applications and available applications. The mobile device may browse applications. When it downloads the application that is attached to an RTI service, the application is downloaded over the air to the mobile device, but it invokes the RTI service attached to it at the same time. As a result, the user can have mobile application deployment, while simultaneously having access to real time, integrated data from the enterprise. Thus, RTI services may offer a highly effective model for mobile computing applications where an enterprise benefits from having the user have up-to-date data.
Having now described various aspects of a data integration system 104 for an enterprise computing system 1300 in its generic form, several examples of the data integration system 104 will now be provided encompassing various commercial and other applications. As shown in Fig. 34, a data integration system 104 with RTI services 2704 may be used in connection with the financial services industry. Real time data integration may allow a business enterprise in the financial services industry to avoid risks that would otherwise be present. For example, if one branch of a financial institution 3402 handles a consumer's 3404 loan application 3410, while another branch executes trades in equities 3408, the institution 3402 may be undertaking more risk in making the loan than it would otherwise be willing to take. Real time data integration allows the financial institution to have a more accurate profile of the consumer 3404 at the time a given transaction is executed. Thus, an RTI service 3412 may allow a computer application associated with the loan application to request up-to-the-minute data about the consumer's 3404 equity account, which can be retrieved through the RTI service 3412 from data associated with applications of the financial institution 3402 that handle equity trades 3408. Of course, not only financial institutions, but finance departments of many enterprises may make similar financial decisions that could benefit from real time data integration.
Business enterprises can benefit from real time data integration services, such as the RTI services described herein, in a wide variety of environments and for many purposes. One example is in the area of operational reporting and analysis. Among other things, RTI services may provide a consolidated view of real time transactional analysis with large volume batch data. Referring to Fig. 35, an RTI service 3502 can be constructed that calls out in real time to all of a business enterprise's important data sources 3504, such as enterprise data warehouses, data marts, databases, and the like. The RTI service 3502 can then apply consistent data-level
transforms on the data from the data sources 3504. Used in this way, the RTI service can also automate source system analysis and provide in-flight, real time data quality management. There are many operational reporting or analysis processes of business enterprises that can benefit from such an RTI service, such as fraud detection and risk analysis in the financial services area, inventory control, forecasting and market-basket analysis in the retail area, compliance activities in the financial area, and shrinkage analysis and staff scheduling in the retail area. Any analysis or reporting task that can benefit from data from more than one source can similarly benefit from an RTI service that retrieves and integrates the data on the fly in real time in accordance with a well-defined data integration job.
Another class of business processes that can benefit from RTI services such as those described herein is the set of business processes that involve creating a master system of record databases. Referring to Fig. 36, an enterprise can have many databases that include data about a particular topic, such as customers 3604. For example, the customer's information may appear in a sales database 3608, a CRM database 3610, a support database 3612 and a finance database 3614. In fact, in a real business enterprise it is not unusual for each of these departments to have multiple databases of their own. One of the desired benefits from data integration efforts is to establish data consistency across many databases. For example, for a triggering event 3618, such as a customer's address change, only one entity of the business may initially receive the information, but it would be preferable for all different departments to have access to the change. RTI services offer the possibility of creating master systems of records, without requiring changes in the native databases. Thus, an RTI process 3602 can be defined that links disparate silos of information, including those that use different protocols. By supporting multiple bindings, the RTI process can accept inputs and provide outputs to various applications of disparate formats. Meanwhile, the business logic in the RTI service can perform data integration tasks, such as performing data standardization for all incoming data, providing meta lineage information for all data, and maintaining linkage between the disparate data sources. The result is a real-time, up-to-the minute master record service, which can be accessed as an RTI service. There are many examples of applications that may benefit from master records. In financial services, an institution may wish to have a customer master record, as well as a security master record across the whole enterprise. In telecommunications, insurance and other industries that deal with huge numbers of customers, master records services can support consisting billing, claims processing and the like. In retail enterprises, master records can support point of sale applications, web services, customer marketing databases, and inventory synchronization functions. In manufacturing and logistics operations, a business enterprise can establish a master record process for data about a product from different sources, such as information about design, manufacturing, inventory, sales, returns, service obligations, warranty information, and the like. In other cases, the business can use the RTI service to support ERP instance consolidation. RTI services that embody master records allow the benefits of data integration without requiring coding in the native applications to allow disparate data sources to talk to each other. The embodiment of Fig. 37 provides a master customer database 3700. The master customer database 3700 may include an integrated customer view across many different databases that include some data about the customer, including both internal and external systems. The master customer database would be a master system that would include the "best" data about the customer from all different sources. To establish the master customer database, data integration requires matching, standardization, consolidation, transformation and enrichment of data, all of which is performed by the RTI service 3702. While some data can be handled in batch mode, new data must be handled in real time to ensure that rapidly changing data is the most accurate data available. A master customer database could be used by a business entity in almost any field, including retail, financial services, manufacturing,
logistics, professional services, medical and pharmaceutical, telecommunications, information technology, biotechnology, or many others. Similar data management may be desirable for associations, academic institutions, governmental institutions, or any other large organization or institution.
RTI services as described herein can also support many services that expose data integration tasks, such as transformation, validation and standardization routines, to transactional business processes. Thus, the RTI services may provide on-the-fly data quality, enrichment and transformation. An application may access such services via a services oriented architecture, which promotes the reuse of standard business logic across the entire business enterprise. Referring to Fig. 38, an RTI service 3802, which may be the RTI service 2704 described above, embodies a set of data transformation, validation and standardization routines, such as those embodied by a data integration platform 3804, such as Ascential's DataStage platform. An application 3808 can trigger an event that calls the RTI service 3802 to accomplish the data integration task on the fly.
Many business processes can benefit from real-time transformation, validation and standardization routines. This may include call center up-selling and cross-selling in the telemarketing industry, reinsurance risk validation in the financial industry, point of sale account creation in retail businesses, and enhanced service quality in fields such as health care and information technology services.
Referring to Fig. 39, an example of a business process that can benefit from real time integration services is an underwriting process 3900, such as underwriting for an insurance policy, such as property insurance. The process of underwriting property may require access to a variety of different data sources of different types, such as text files 3902, spreadsheets 3904, web data 3908, and the like. Data can be inconsistent and error-prone. The lead- time for obtaining supplemental data slow down underwriting decisions. The main underwriting database 3910 may contain some data, but other relevant data may be included in various other databases, such as an environmental database 3912, an occupancy database 3914, and a geographic database 3918. As a result, an underwriting decision may be made based on flawed assumptions, if the data from the different sources and databases is not integrated at the time of the decision. By integrating access to various data sources 3902, 3904, 3908, 3912, 1914, 1918 using a real time integration service, speed and accuracy of underwriting decisions may be improved. Referring to Fig. 40, an RTI service can improve the quality of the underwriting decision. The text files, spreadsheets, and web files can each be inputted to the RTI service, which may be any of the RTI services 2704 described above, running on an RTI server 3904, such as through a web interface 3902. The environmental database 3912, occupancy database 3914, and geographic database 3918, as well as the underwriting database 3910, can all be called by a data integration job 4012, which can include a CASS process 4010 and a Waves process 4008, such as embodied by Ascential Software's QualityStage product. The RTI service can include bindings for the protocols for each of those databases. The result is an integrated underwriting decision process that benefits from current information from all of the schedules, as well as the disparate databases, all enabled by the RTI service. For example, an underwriting process needs current address information, and an RTI integration job such as described above can quickly integrate thousands of addresses from disparate sources.
Enterprise data services may also benefit from data integration as described herein. In particular, an RTI integration process can provide standard, consolidated data access and transformation services. The RTI integration process can provide virtual access to disparate data sources, both internal and external. The RTI integration process can provide on-the-fly data quality enrichment and transformation. The RTI integration process can also track all metadata passing through the process. Referring to Fig. 41, one or more RTI services 4102, 4104 can operate
within the enterprise to provide data services. Each of them can support data integration jobs 4108. The data integration jobs 4108 can access databases 4110, which may be disparate data sources, with different native languages and protocols, both internal and external to the enterprise. An enterprise application can access the data integration jobs 4108 through the RTI services 4102, 4104. Referring to Fig. 42, another business enterprise that can benefit from real time integration services is a distribution enterprise, such as a trucking broker. The trucking broker may handle a plurality of trucks 4202, which carry goods from location to location. The trucks 4202 may have remote devices that run simple applications 4204, such as applications that allow the truck 4202 to log in when the truck 4202 arrives at a location. Drivers of trucks 4202 often have mobile computing devices, such as LandStar satellite system devices, which the drivers may use to enter data, such as arrival at a checkpoint. The enterprise itself may have several computer applications or databases, such as a freight bill application 4208, an agent process 4210, and a check call application 4212. However, these native applications, while handling processes that may provide useful information to drivers, are not typically coded to run on the mobile devices of the trucks 4202. For example, drivers may wish to be able to schedule trips, but the trip scheduling application may require data (such as what other trips have been completed) that is not resident on the mobile device of the truck 4202.
Referring to Fig. 43, using an RTI service model, a set of data integration services 4302 can be defined to support applications 4310 that a driver can access as web services, such as using a mobile device. For example, an application 4310 can allow the driver to update his schedule with data from the truck broker enterprise. The RTI server 4304 publishes data integration jobs from the data integration services 4302, which the applications 4310 access as web services 4308. The data integration services 4302 can integrate data from the enterprise, such as about what other jobs have already been completed, including data from the freight bill application 4208 and agent process 4210. The RTI service, which may be any of the RTI services 2704 described above, may act as a smart graphical user interface for the driver's applications, such as to provide a scheduling application. The driver can download the application to the mobile device to invoke the service. As a result, using the RTI service model, it is convenient to provide the infrastructure for applications that use RTI services on mobile devices.
As another example (without illustrating figures), data integration may be used to improve supply chain management, such as in inventory management and perishable goods distribution. For example, if a supply chain manager has a current picture of the current inventory levels in various retail store locations, the manager can direct further deliveries or partial shipments to the stores that have low inventory levels or high demand, resulting in a more efficient distribution of goods. Similarly, if a marketing manager has current information about the inventory levels in retail stores or warehouses and current information about demand (such as in different parts of the country) the manager can structure pricing, advertisements or promotions to account for that information, such as to lower prices on items for which demand is weak or for which inventory levels are unexpectedly high. Of course, these are simple examples, but in preferred embodiments managers can have access to a wide range of data sources that enable highly complex business decisions to be made in real time.
Possible applications of such a system are literally endless. A weight loss company may use data integration to prepare a customer database for new marketing opportunities that may be used to enhance revenue to the company from existing customers. A financial services firm may use data integration to prepare a single, valid source for reporting and analysis of customer profitability for bankers, managers, and analysts. A pharmaceutical company may use data integration to create a data warehouse from diverse legacy data sources using different standards and formats, including free form data within various text data fields. A web-based marketplace provider
may employ data integration to manage millions of daily transactions between shoppers and on-line merchants. A bank may employ data integration services to learn more about current customers and improve offerings on products such as savings accounts, checking accounts, credit cards, certificates of deposit, and ATM services. A telecommunications company may employ a high-throughput, parallel processing data integration system to increase the number of calling campaigns undertaking. A transportation company may use a high-throughput, parallel processing data integration system to re-price services inter-daily, such as four times a day. An investment company may employ a high-throughput, parallel processing data integration system to comply with SEC transaction settlement time requirements, and to generally reduce the time, cost, and effort required for settling financial transactions. A health care provider may use a data integration system to meet the requirements of the U.S. Health Insurance Portability and Accountability Act. A web-based education provider may employ data integration systems to monitor the student lifecycle and improve recruiting efforts, as well as student progress and retention.
A number of additional examples of specific commercial applications of a data integration system are now provided.
Figure 44 depicts a data integration system 104 which may be used for financial reporting. In this example the system 4400 may include a sales and order processing system 4402, a general ledger 4404, a data integration system 104 and a finance and accounting financial reporting data warehouse 4408. The sales and order processing system 4402, general ledger 4404 and finance and accounting financial reporting data warehouse 4408 may each include a data source 102, such as any of the data sources 102 described above. The sales and order processing system 4402 may store data gathered during sales and order processing such as price, quantity, date, time, order number and purchase order terms and conditions and other data and any other data characterizing any transaction which may be processed and/or recorded by the system 4400. The general ledger 4404 may store data that may be related to a business tracking its finances such as balance sheet, cash flow, income statement and financial covenant data. The finance and accounting financial reporting data warehouse 4408 may store data related to the financial and accounting departments and functions of a business such as data from the disparate financial and accounting systems.
The system 4400 may include one or more data integration systems 104, which may be any of the data integration systems 104 described above, which may extract data from the sales and order processing system 4402 and the general ledger 4404 and which may transfer, analyze, process, transform or manipulate such data, as described above. Any such data integration system 104 may load such data into the finance and accounting reporting data warehouse 4408, a data repository or other data target which may be any of the data sources 102 described above. Any of the data integration systems 104 may be configured to receive real-time updates or inputs from any data source 102 and/or be configured to generate corresponding real-time outputs to the corresponding finance and accounting reporting data warehouse 4408 or other data target. Optionally, the data integration system 104 may extract, transfer, analyze, process, transform, manipulate and/or load data on a periodic basis, such as at the close of the business day or the end of a reporting cycle, or in response to any external event, such as a user request.
In this manner a data warehouse 4408 may be created and maintained which can provide the company with current financial and accounting information. This system 4400 may enable the company to compare its financial performance to its financial goals in real-time allowing it to rapidly respond to deviations. This system 4400 may
also enable the company to assess its compliance with any legal or regulatory requirements, or private debt or other covenants of its loans, thus allowing it to calculate any additional costs or penalties associated with its actions.
Figure 45 depicts a data integration system 104 used to create and maintain an authoritative, current and accurate list of customers to be used with point of sale, customer relationship management and other applications and/or databases at a retail or other store or company. In this example the system 4500 may include a point of sale application 4502, point of sale database 4504, customer relationship management application 4508, customer relationship management database 4510, data integration system 104 and customer database 4512.
The point of sale application 4502 may be a computer program, software or firmware running or stored on a, networked or standalone computer, handheld device, palm device, cell phone, barcode reader or any combination of the forgoing or any other device or combination of devices for the processing or recording of a sale, exchange, return or other transaction. The point of sale application may be linked to a point of sale database 4504 which may include any of the data sources 102 described above. The point of sale database 4504 may contain data gathered during sales, exchanges, returns and/or other transactions such as price, quantity, date, time and order number data and any other data characterizing any transaction which may be processed or recorded by the point of sale application 4502. The customer relationship management application 4508 may be a computer program, software or firmware running or stored on a networked or standalone computer, handheld device, palm device, cell phone, barcode reader or any combination of the forgoing or any other device or combination of devices for the input, storage, analysis, manipulation, viewing and/or retrieval of information about customers, other individuals and/or entities such as name, address, corporate structure, birth date, order history, credit rating and any other data characterizing or related to any customer, other individual or entity. The customer relationship management application 4508 may be linked to a customer relationship management database 4510 which may include any of the data sources 102 described above, and may contain information about customers, other individuals and/or entities.
The data integration system 104, which may be any of the data integration systems 104 described above, may independently extract data from or load data to any of the point of sale application 4502 or database 4504, the customer relationship management application 4508 or database 4510 or the customer database 4512. The data integration system 104 may also analyze, process, transform or manipulate such data, as described above. For example, a customer service representative or other employee may update a customer's address using the customer relationship management application 4508 during a courtesy call following the purchase of a household durable item, such as a freezer or washing machine. The customer relationship management application 4508 may then transfer the updated address data to the customer relationship management database 4510. The data integration system 104 may then extract the updated address data from the customer relationship management database 4510, transform it to a common format and load it into the customer database 4512. The next time the customer makes a purchase, the cashier or other employee may complete the transaction using the point of sale application 4502, which may, via the data integration system 104, access the updated address data in the customer database 4512 so that the cashier or other employee need only confirm the address information as opposed to entering it in the point of sale application 4502. In addition, the point of sale application 4502 may transfer the new transaction data to the point of sale database 4504. The data integration system 104 may then extract the transaction data from the point of sale database 4504, transform it to a common format and load it into the customer database 4512. As a result the new transaction data is accessible to the point of sale and customer relationship management applications and databases as well as any other applications or databases maintained by the business enterprise.
In this manner a customer database 4512 may be created and maintained which can provide the retail or other store or company with current, accurate and complete data concerning each of its customers. With this information, the store or company may better serve its customers. For example, if customer service granted a customer a discount on his next purchase, the cashier or other employee using the point of sale application 4502 will be able to verify the discount and record a notice that the discount has been used. The system 4500 may also enable the store or company to prevent customer fraud. For example, customer service representatives or other employees receiving customer complaints over the telephone can, using the customer relationship management application 4508, access point of sale information to determine the date of a purchase of a particular product allowing them to determine if a product is still covered by the store or manufacturer's warranty. Figure 46 depicts a data integration system 104 which may be used to convert drug replenishment or other information generated or stored at retail pharmacies into industry standard XML or other languages for use with pharmacy distributors or other parties. In this example the system 4600 may include retail pharmacies 4602, drug replenishment information, a data integration system 104, and pharmacy distributors 4604.
The retail pharmacies 4602 may use applications, computer programs, software or firmware running or stored on a networked or standalone computer, handheld device, palm device, cell phone, barcode reader or any combination of the forgoing or any other device or combination of devices for collecting, generating or storing the drug replenishment or other information. Such applications, computer programs, software or firmware may be linked to one or more databases which may include at least one data source 102, such as any of the data sources 102 described above, which contains drug replenishment information such as inventory level, days-on-hand and orders to be filled. Such applications, computer programs, software or firmware may also be linked to one or more data integration systems 104, which may be any of the data integration systems 104 described above. The pharmacy distributors 4604 may use applications, computer programs, software or firmware running or stored on a networked or standalone computer, handheld device, palm device, cell phone, barcode reader or any combination of the forgoing or any other device or combination of devices for receiving, analyzing, processing or storing the drug replenishment information, in industry standard XML or another language or format. Such applications, computer programs, software or firmware may be linked to a database, which may include any of the data sources 102 described above, that contains the drug replenishment information.
The system 4600 may include one or more data integration systems 104, which may be any of the data integration systems 104 described above. The data integration system 104 may extract the drug replenishment information from the retail pharmacies 4602, convert the drug replenishment information to industry standard XML or otherwise analyze, process, transform or manipulate such information and then load or transfer, automatically or upon request, such information to the pharmacy distributors 4604. For example, a customer may purchase the penultimate bottle of cold medicine X at a given retail pharmacy 4602. Immediately after the sale, that retail pharmacy's systems may determine that the pharmacy 4602 needs to increase its stock of cold medicine X by a certain number of bottles before a certain date and then send the drug replenishment information to the data integration system 104. The data integration system 104 may then convert the drug replenishment information to industry standard XML and uploads it to the pharmacy distributors' system. The pharmacy distributors 4604 can then automatically ensure that the given pharmacy 4602 receives the requested number of bottles before the specified date. Thus a system 4600 may be created allowing retail pharmacies 4602 to communicate with pharmacy distributors 4604 in a manner that enables minimal supply chain interruptions and expenses. This system 4600 may
allow retail pharmacies 4602 to automatically communicate their inventory needs to pharmacy distributors 4604 reducing surplus inventory holding costs, waste due to expired products and the transaction and other costs associated with returns to the pharmacy distributors. This system 4600 may be supplemented with additional data integration systems 104 to support credit history review, payment, and other financial services to ensure good credit risks and timely payment for the pharmacy distributors.
Figure 47 depicts a data integration system 104 which may be used to provide access to manufacturing analytical data 4702 via pre-built services 4704 that are invoked from business applications and integration technologies 4708, such as enterprise application integration, message oriented middleware and web services, to allow the data to be used in operational optimization, decision-making and other functions. In this example the system 4700 may include manufacturing analytical data 4702, such as inventory, parts, sales, payroll, human resources and other data, pre-built services 4704, business applications and integration technologies 4708, a user or users 4710, a data integration system 104, and user business applications 4712.
The user 4710 may, using business applications and integration technologies 4708 running or stored on a, networked or standalone, computer, computer system, handheld device, palm device, cell phone or any combination of the forgoing or any other device or combination of devices, invoke pre-built services 4704 to provide access to manufacturing analytical data. The pre-built services 4704 may be data integration systems 104 as described above or other infrastructure which may transfer, analyze, modify, process, transform or manipulate data or other information. The pre-built services 4704 may use, and the manufacturing analytical data 4702 may be stored on, a database which may include a data source 102, such as any of the data sources 102 described above. The user business applications 4712 may be a computer program, software or firmware running or stored on a networked or standalone computer, handheld device, palm device, cell phone or any combination of the forgoing or any other device or combination of devices for the processing or analysis of manufacturing analytical data 4702 or other information. The user business applications 4712 may be linked to a database that includes a data source 102, such as any of the data sources 102 described above. The system 4700 may include one or more data integration systems 104, which may be any of the data integration systems 104 described above, which may extract, analyze, modify, process, transform or manipulate the manufacturing analytical 4702 or other data, in response to a user input via the business application and/or integration technologies 4708 or other user related or external event or on a periodic basis, and make the results available to the user business applications 4712 for display, storage or further processing, analysis or manipulation of the data.
For example, a manager using existing business applications and integration technologies 4708 may access via a pre-built service 4704 certain manufacturing analytical data 4702. The manager may determine the numbers of a certain group of parts in inventory and the payroll costs associated with having enough employees on hand to assemble the parts. The data integration system 104 may extract, integrate and analyze the required data from the inventory, parts, payroll and human resources databases and upload the results to the manager's business application 4712. The business application 4712 may then display the results in several text and graphical formats and prompt the user (manager) for further analytical requests.
In this manner, a system 4700 may be created that allows managers and other decision-makers across the enterprise to access the data they require. This system 4700 may enable actors within the enterprise to make more informed decisions based on an integrated view of all the data available at a given point in time. In addition, this system 4700 may enable the enterprise to make faster decisions since it can rapidly integrate data from many
disparate data sources 102 and obtain an enterprise-wide analysis in a short period of time. Overall, this system 4700 may allow the enterprise to optimize its operations, decision-making and other functions.
Fig. 48 depicts a data integration system 104 that may be used to analytically process clinical trial study results for loading into a pharmacokinetic data warehouse 4802 on an event-driven basis. In this example the system 4800 may include a clinical trial study 4804, clinical trial study databases 4808, an event 4810, a data integration system 104 and a pharmacokinetic data warehouse 4810.
The clinical trial study 4804 may generate data which may be stored in one or more clinical trial study databases 4808 which may each include a data source 102, such as any of the data sources 102 described above. Each clinical trial study database 4808 may contain data gathered during the clinical trial study 4804 such as patient names, addresses, medical conditions, mediations and dosages, absorption, distribution and elimination rates for a given drug, government approval and ethics committee approval information and any other data which may be associated with a clinical trial 4804. The pharmacokinetic data warehouse 4802 may include any of the data sources 102 described above, which may contain data related to clinical trial studies 4804, including data such as that housed in the clinical trial study databases 4808, as well as data and information relating to drug interactions and properties, biochemistry, chemistry, physics, biology, physiology, medical literature or other relevant information or data. The external event 4810 may be a user input or the achievement of a certain study or other result or any other specified event.
The system 4800 may include one or more data integration systems 104 as described above, which may extract, modify, transform, manipulate or analytically process the clinical trial study data 4804 or other data, in response to the external event 4810 or on a periodic basis, such as at the close of the business day or the end of a reporting cycle, and may make the results available to the pharmacokinetic data warehouse 4802. For example, the external event 4810 may be the requirement of certain information in connection with a research grant application. The grant review committee may require data on drug absorption responses in an on-going clinical trial before it will commit to allocating funds for a related clinical trial. The system 4800 may be used to extract the required data from the clinical trial study data database 4808, analytically process the data to determine, for example, the mean, median, maximum and minimum rate of drug absorption and compare these results to those of other studies and for similar drugs. All this information may then be presented to the grant review committee.
In this manner a system 4800 may be created which will allow researchers and others rapid access to complete and accurate pharmacokinetic information, including information from completed and on-going clinical trials. This system 4800 may enable researchers and others to generate preliminary results and detect adverse effects or trends before they become serious. This system 4800 may also enable researchers and others to link the on-going or final results of a given study to those of other studies, theories or established principles. In addition, the system 4800 may aid researchers and others in the design of new studies, trials and experiments.
Figure 49 depicts a data integration system 104 which may be used to provide scientists 4902 with a list of available studies 4904 through a Java application 4908 and allow them to initiate extract, transform and load processing 4910 on selected studies. In this example the system 4800 may include a group of scientists 4902, a list of available studies 4904, a Java application 4908, a database of studies 4912, a list of selected studies 4914, extract, transform and load processing 4910 and a data integration system 104.
The studies database 4912 many include any of the data sources 102 described above, which may store the titles, abstract, full text, data and results of the studies as well as other information associated with the studies. The Java application 4908 may consist of one or more applets, running or stored on a computer, handheld device, palm
device, cell phone or any combination of the forgoing or any other device or combination of devices, which may generate complete list of studies m the database or a list of studies in the database responsive to certain user defined or other characteπstics The scientists, laboratory personnel or others may select a subset of studies from this list and generate a list of selected studies 4914 The system 4900 may include one or more data integration systems as described above, which may extract, modify, transform, manipulate, process or analyze the lists of available studies 4904 or data from the studies database For example, the scientists 4902, laboratory personnel or others may request, using the Java application 4908 through a web browser, a list of all available studies 4904 relating to a certain specified drug or medical condition The scientists 4902, laboratory personnel or others may then select certain studies from such list or add other studies to such list to generate a list of selected studies 4914 The scientists 4902, laboratory personnel or others may then send the list of selected studies to the data integration system 104, for extract, transform and load processing 4910 The scientists 4902, laboratory personnel or others may request as an output all the metabolic rate or other specified data from the selected studies in a particular format
In this manner a system 4900 may be cieated which will allow scientists 4902, laboratory personnel or others access to a directory of relevant studies with the ability to extract or manipulate data and other information from those studies This system 4900 may enable scientists 4902, laboratory personnel or others obtain relevant pπor data or other information, to avoid unnecessary repetition of expeπments or to select certain studies that conflict with their results or predictions for the purpose of repeating the studies or reconciling the results The system 4900 may also enable scientists 4902, laboratory personnel or others to obtain, integrate and analyze the results from pπor studies m order to simulate new experiments without actually performing the expeπments in the laboratory
Figure 50 depicts a data integration system 104 which may be used to create and maintain a cross- reference of customer data 5002 as it is entered across multiple systems, such as point of sale 5004, customer relationship management 5008 and sales force automation systems 5010, for improved customer understanding and intimacy or for other purposes In this example the system 5000 may include point of sale 5004, customer relationship management 5008, sales force automation 5010 or other systems 5012, a data integration system 104, and a customer data cross-reference database 5002
The point of sale 5004, customer relationship management 5008 and sales force automation systems 5010 may each consist of one or more applications and/or databases The applications may be computer programs, software or firmware running or stored on a networked or standalone computer, handheld device, palm device, cell phone or any combination of the forgoing or any other device or combination of devices The databases may include any of the data sources 102 described above The point of sale application may be used for the processing or recording of a sale, exchange, return or other transaction and the point of sale database may contain data gathered during sales, exchanges, returns and/or other transactions such as price, quantity, date, time and order number data and any other data characterizing any transaction which may be processed or recorded by the system 5000 The customer relationship management application may be used for the input, storage, analysis, manipulation, viewing and/or retrieval of information about customers, other individuals and/or entities such as name, address, corporate structure, birth date, order history, credit rating and any other data characterizing or related to any customer, other individual or entity The customer relationship management database may contain information about customers, other individuals and/or entities The sales force automation application may be used for lead generation, contact cross-referencing, scheduling, performance tracking and other functions and the sales force automation database
may contain information or data in connection with sales leads and contacts, schedules of individual members of the sales force, performance objectives and actual results as well as other data.
The system 5000 may include one or more data integration systems 104 as described above, which may extract, modify, transform, manipulate, process or analyze the data from the point of sale 5004, customer relationship management 5008, sales force automation 5010 and other systems 5012 and which may make the results available to the customer data cross reference database 5002. For example, the system 5000 may, on a periodic basis, such as at the close of the business day or the end of a reporting cycle, or in response to any external event, such as a user request, extract data from any or all of the point of sale 5004, customer relationship management 5008, sales force automation 5010 or other systems 5012. The system 5000 may then convert the data to a common format or otherwise transfer, process or manipulate the data for loading into a customer data cross reference database 5002, which is available to other applications across the enterprise. The data integration process 104 may also be configured to receive real-time updates or inputs from any data source 102 and/or be configured to generate corresponding real-time outputs to the customer data cross reference database 5002.
In this manner a system 5000 may be created which provides users with access to cross-referenced customer data 5002 across the enterprise. The system 5000 may provide the enterprise with cleansed, consistent, duplicate-free customer data for use by all systems 5000 leading to a deeper understanding of customers and stronger customer relationships.
Figure 51 depicts a data integration system 104 which may be used to provide on-demand automated cross-referencing and matching 5102 of inbound customer records 5104 with customer data stored across internal systems to avoid duplicates and provide a full cross-system record of data for any given customer. In this example the system 5100 may include inbound customer records 5104, a data integration system 104 and internal customer databases 5108.
The inbound customer records 5104 may include information gathered during transactions or interactions with or regarding customers such as name, address, corporate structure, birth date, products purchased, scheduled maintenance and other information. The internal databases 5108 may include any of the data sources 102 described above, and may store data gathered during transactions or interactions with or regarding customers. The internal databases 5108 may be linked to internal applications which may be computer programs, software or firmware running or stored on a, networked or standalone, computer, handheld device, palm device, cell phone or any combination of the forgoing or any other device or combination of devices. The system 5100 may include one or more data integration systems as described above, which may extract, modify, transform, manipulate, process or analyze the inbound customer records 5104 or any data from the internal customer databases 5108. In addition the data integration system 104 may cross-reference 5102 the inbound customer records 5104 against the data in the internal customer databases 5108. For example, the internal customer databases 5108 may be a database with information related to the products purchased by customers, a database with information related to the services purchased by customers, a database providing information on the size of each customer organization and a database containing credit information for customers. The system 5100 may cross reference inbound customer records 5104 against the products, service, size and credit information to reveal and correct inconsistencies and ensure the accuracy and uniqueness of the data record for each customer.
In this manner a system 5100 may be created which will allow for accurate and complete customer records. This system 5100 may provide the enterprise deeper customer knowledge allowing for better customer service. The system 5100 may enable sales people, in reliance on the data contained in the customer databases, to suggest
products and services complementary to those already purchased by the customer and geared to the size of the customer's business.
Having described various data integration systems and business enterprises, an architecture for provide data integration services in an enterprise will now be described in greater detail. Referring to Fig. 52, a high level schematic view of an architecture depicts how a plurality of services may be combined to operate as an integrated application that unifies development, deployment, operation, and life-cycle management of a data integration solution. The unification of data integration tasks into a single platform may eliminate the need for separate software products for different phases of design and deployment.
The architecture 5200 may include a GUI/tool framework 5202, an intelligent automation layer 5203, one or more clients 5204, APIs 5208, core services 5210, product function services 5212, metadata services 5222, metadata repositories 5224, one or more runtime engines 5214 with component runtimes 5220 and connectors 5218. The architecture 5200 may be deployed on a service-oriented architecture 5201, such as any of the service-oriented architectures 2400 described above.
Metadata models stored in the metadata repository 5224 provide common internal representations of data throughout the system at every step of the process from design through deployment. Models may be registered in a directory that is accessible to other system components. The common models may provide a common representation (common to all product function services) of numerous suite- wide items including metadata (data descriptive data including data profile information), data integration process specifications, users, machine and software configurations, etc. These common models may enable common user views of enterprise resources and integration processes no matter what product functions the user is using, and may obviate the need for model translation among integrated product functions.
The service oriented architecture (SOA) 5201 is shown as encompassing all of the services and may provide for the coordination of all the services from the GUI 5202 through the run time engine 5214 and the connections 5218 to the computing environment. The common models, which may be stored in the metadata repository 5224, may allow the SOA 5201 to seamlessly provide interaction between a plurality of services. The SOA 5201 may, for example, expose the GUI 5202 to all aspects of data integration design and deployment by use of common core services 5210, production function services 5212, and metadata services 5222, and may operate through an intelligent automation layer 5203. The common models and services may allow for common representation of objects in the GUI 5202 for various actions during the design and deployment process. The GUI 5202 may have a plurality of clients 5204 interfacing with SOA 5201 coordinated services. The clients 5204 may allow users to interface with the data integration design with a plurality of skill levels enabling users to work as a team across organizationally appropriate levels. The SOA 5201 may provide access to common core services 5210 and product function services 5212, as well as providing back end support to APIs 5208, for functions and services in data integration designs. Services may be shared and reused by a plurality of clients 5204 and other services. The intelligent automation layer 5203 may employ metadata and services within the architecture 5200 to simplify user choices within the GUI 5202, such as by showing only relevant user choices, or automating common, frequent, and/or obvious operations. The intelligent automation layer 5203 may automatically generate certain jobs, diagnose designs and design choices, and tune performance. The intelligent automation layer 5203 may also support higher-level design paradigms, such as workflow management or modeling of business context, and may more generally apply project or other contextual awareness to assist a user in more quickly and efficiently implementing data integration solutions.
The common core services 5210 may provide common function services that may be commonly used across all aspects of the design and deployment of the data integration solution, such as directory services for one or more common registries, logging and auditing services, monitoring, event management, transaction services, security, licensing (such as creation and enforcement of licensing policies and communication with external licensing services), and provisioning, and management of SOA services. The common core services may allow a common representation of functions and objects to the common GUI 5202.
Other product specific function services may be contained in the product function services 5212 and may provide services to specific appropriate clients 5204 and services. These may include, for example, importing and browsing external metadata, as well as profiling, analyzing, and generating reports. Other functions may be more design-oriented, such as services for designing, compiling, deploying, and running data integration services through the architecture. The product function services 5212 may be accessible to the GUI 5202 when an appropriate task is used and may provide a task oriented GUI 5202. A task oriented GUI may present a user only functions that are appropriate for the actions in the data integration design.
The application program interface (APIs) 5208 may provide a programming interface for access to the full architecture, including any or all of the services, repositories, engines, and connectors therein. The APIs 5208 may contain a commonly used library of functions used by and/or created from various services, and may be called recursively.
The metadata and repository services 5222 may control access to the metadata repository 5224. All functions may keep metadata represented by its own function-specific models in a common repository in the metadata repository 5224. Functions may share common models, or use metadata mappings to dynamically translate semantics among their respective models. All internal metadata and data used in data integration designs may be stored in the metadata repository 5224 and access to external metadata and data may be provided by a hub (a metadata model) stored in the metadata repository 5224 and controlled by the metadata and repository services 5222. Metadata and metadata models may be stored in the metadata repository 5224 and the metadata and repository services 5222 may maintain metadata versioning, persistence, check-in and check-out of metadata and metadata models, and repository space for interim metadata created by a user before it is reconciled with other metadata. The metadata and repository services 5222 may provide access to the metadata repository 5224 to a plurality of services, GUI 5202, internal clients 5204 and external clients using a repository hub. Access by other services and clients 5204 to the metadata repository 5224 may allow metadata to be accessed, transformed, combined, cleansed, and queried by the other services in seamless transactions coordinated by the SOA 5201.
A runtime engine 5214, of which there may be several, may use adapters and connections 5218 to communicate with external sources. The engines 5214 may be exposed to designs created by a user to create compiled and deployed solutions based on the computing environment. The runtime engine 5214 may provide late binding to the computer environment and may provide the user the ability to design data integration solutions independent of computer environment considerations. The runtime engine 5214 may compile the data integration solution and provide an appropriate deployed runtime for high throughput or high concurrency. Services may be deployed as J2EE structures from a registry that provides access to interface and usage specifications for various services. The services may support multiple protocols, such as HTTP, Corba/RMI, JMS, JCA, and the like, for use with heterogeneous hardware and software environments. Bindings to these protocols may be automatically selected by the runtime engine 5214 or manually selected by the user from the GUI 5202 as part of the deployment process.
External connectors 5218 may provide access to a network or other external resources, and provide common access points for multiple execution engines and other transformation execution environments, such as Java or stored procedures, to external resources.
It will be appreciated that an additional functional layer may be provided to assist in selecting and using the various runtime engines 5214. This is particularly useful when provided in support of the high throughput or high concurrency deployments. For example, the runtime engines 5214 may include a transaction engine adapted to parse large transactions of potentially unlimited length, as well as continuous streams of real time transactions. The runtime engines 5214 may also include a parallelism engine adapted to processing small independent transaction. The parallelism engine may be adapted to receive preprocessed input (and output) that has been divided into a pipelined or otherwise partitioned flow. As another example, a concurrency runtime engine 5214 may be optimized for quick response to interactive demands, such as a large volume of small independent transactions. A compilation and optimization layer may determine how to allocate integration processes among these various engines, such as by preprocessing output to the parallelism engine into small chunks. By centralizing connectors within the architecture, it is possible to more closely control distribution of processes between various engines, and to provide accessibility to this control at the user interface level. Also, a common intermediate representation of connectivity in a transformation process enables deployment of any automation strategies, and selection of different combinations of execution engines, as well as optimization based on, for example, metadata or profiling.
The architecture 5200 described herein provides a high-degree of flexibility and customizability to the user's working environment. This may be applied, for example, to configure user environments around existing or planned workflows and design processes. Users may be able to create specific functional services by constructing components and combining them into compositions, which may also serve in turn as components allowing recursive nesting of modularity in the design of new components. The components and compositions may be stored in the metadata repository 5224 with access provided by the metadata and repository services 5222. Metadata and repository services 5222 may provide common data definitions with a common interface with a plurality of services and may provide support for native data formats and industry standard formats. The modular nature of the architecture described herein enables packaging of any enterprise function(s) or integration process(es) into a package having components selected from the common core services 5210 and other ones of the product function services 5212, as well as other components of the overall architecture. The ability to make packages from system components may be provided as a common core service 5212. Through this packaging capability, any arbitrary function can be constructed, provided it is capable of expression as a combination of atomic services, components, and compositions already within the architecture 5200. The packaging capability of the architecture 5200 may be combined with the task orientation of the user interface to achieve a user interface specifically adapted to any workflow or design methodology that a user wishes.
Figure 52 A shows a task matrix for defining a user interface as a number of task, which may be, for example, any of the functions, services, or packages described above. The a user interface may advantageously be strongly separated from underlying functionality in order to maintain design flexibility. In such an architecture, the tasks dictated by a workflow may be organized within a task matrix 5250 that is interpreted to provide an interface to the user. The task matrix 5250 may include one or more contexts 5252, each including a number of tasks 5254 and a number of menus 5256 (the "pillars" described in the interface below). The one or more contexts 5252 may relate to, for example, a number of different presentations of the tasks within the menus. Through the use of a number of separately defined contexts, the interface may be designed with
a variety of optional presentations, such as skill-level sensitive presentations or security-level sensitive presentations. More generally, by providing a number of optional contexts, an additional dimension of design flexibility is realized within the task matrix 5250. This may be used in a number of ways. As an example, a health care provider and an insurer may both be responsible under a regulatory scheme such as HIPAA for maintaining internal records and transacting with others in a specific manner. In this case, the provider and the insurer may have a number of tasks in common, such as encrypting personal identification information or using a common format for a request for payment that is sent by the provider to the insurer. Data integration processes for the insurer and the provider may share a substantial number of tasks in common, while also requiring significant differences. Two contexts 5252 may be used for the provider and the insurer to define two different workflows based upon the same set of common tasks under a more general HIP AA-compliant interface definition.
Each task 5254 may be any number of user-defined tasks or common tasks, functions, services and the like provided by the system, or combinations of such tasks, functions, and/or services combined into a single useful task relating to a workflow. The tasks 5254 may also include, or be associated with, dialogs, wizards, help windows, and the like useful within the interface. The tasks 5254 identified in the task matrix 5250 may be predefined with respect to, e.g., their location within the interface, association with certain control regions of the interface, the controls, inputs, and outputs associated with the task, and so on. Thus it is simply necessary to indicate the presence of the task in the task matrix 5250 for the task to be accessible through the user interface. Optionally, the intersection of each task 5254 and menu 5256 may include one or more specifications relating to the occurrence of the task 5254. Thus, the task matrix 5250 may specify visual elements, controls, location, inputs, outputs, skill- level parameters, and so on.
Each menu item 5256 may correspond to a phase of a workflow. Using the task matrix 5250, the relevant tasks for a workflow may be realized at appropriate locations within the user environment.
To further assist in rapid deployment of workflow-based interface designs, the user interface itself may be defined as a collection of visual styles, controls, control panels, workspaces, tools, wizards, dialogs, and so on. At the same time, the tasks may be defined, for example, as services within a service-oriented architecture. Thus, through the use of the task matrix 5250 an arbitrary combination of menu items 5256, each with one or more supporting tasks 5254, may be conveniently arranged to express any data integration or other workflow as a navigation methodology.
The interface may advantageously operate against a common data set, so that changes are persisted among various phases (i.e., menus) of the workflow, and between tasks within a phase. Each task 5254 and each menu 5256 may represent a different functional perspective on the same data set. Thus, the architecture may provide a significant improvement over existing application-based environments in which a workflow passes data from application to application for different phases of the workflow. As a further advantage, the task matrix 5250 itself may be modified by a user according to personal preferences or design problems. Tools for modifying the task matrix 5250 may be exposed as a service, and accessible by users through a user interface or other tool acting as a client to that service in the service-oriented architecture.
While a specific construct has been identified for organizing tasks within a workflow, it should be appreciated that any number of data structures or other storage mechanisms may be used to store, retrieve, and modify definitions of user interfaces, and it should further be appreciated that the architecture of the user interface
itself may be defined at various levels of specificity, according to the desired trade-off between flexibility and ease of design.
Referring to Fig. 53, a more detailed schematic of the GUI 5202 is shown to provide an understanding of the plurality of user skill levels and common design library 5300 interactions. The GUI 5202 may be strongly separated from underlying functionality of the overall system, and more particularly from other aspects of the architecture and the metadata models and designs in the metadata repository 5224. In such a "separated" design, all of the functionality of the architecture may be presented to the user interface as services. The user interface may in turn provide a common set of low level tools and user controls for accessing these services. This approach permits relatively straightforward extensibility for either the interface or the underlying system, and may provide a seamless working environment for all stages of a data integration process. Where new workflows or different users arise, the interface may be adapted accordingly without affecting existing models and data integration jobs or processes. Where new services or models are developed, the may be added to the service oriented architecture without modifying the user interface (unless, of course, a different user interface is desired).
The GUI 5202 architecture may provide alternative views of the system, such as an interface for a relatively low-skill user that provides controls and components for a user to manipulate a design 5304 without detailed knowledge of the system architecture or the metadata models. Another higher skill-level user interface may, while operating on the same metadata models, provide more detailed control over the model design. The higher skill-level GUI may, for example, provide access to more functions and greater customization. Similarly, the GUI 5202 may provide alternative role sensitive views which may be defined behaviorally, such as monitoring, deployment, and operational control, or defined occupationally, such as director, officer, executive, manager, analyst, accountant, engineer, and so on. Roles may also be hierarchically defined, such as where executives are categorized as president, vice president, and so on, or where an analyst is one of a marketing analyst, a business analyst, a financial analyst, and so on. As another example, the GUI 5202 may provide platform sensitive views, such as a user interface specifically adapted for a mobile device, or more specifically adapted for use on a BlackBerry 7730 from Research In Motion Limited, or an Axim X30 from Dell Inc. Still more specifically, a task matrix may be defined for technology personnel using laptop computers, or for officers of a company using Motorola cellular phones. Each user profile may be realized as a separate task matrix, all while operating on a common, shared metadata model.
The GUI 5202 may provide an integrated user environment for all aspects of the design solution from the beginning of the design to the final deployment without the need for other external tools. The GUI 5202 may provide common functions to all user levels, such as catalog management, searching, querying, navigating, personalization, scripting, persistence, or other tasks associated to the design of a data integration solution 5304. Users may access the data integration design solution 5304 using skill-specific or role-specific user interface functionality. Different user views 5302 may allow for a plurality of levels of complexity of design capability. As an example, in many data integration design solutions 5304 users from various organizations with various skills may be teamed together to develop a single design solution 5304. The GUI 5202 may be configurable to allow users of varying skill work with functions and services that are applicable to their field of expertise. The architecture 5200 may allow such teams to collaborate on the same metadata, models and designs and, since the live model is common to all users, may provide global visibility as changes are may by various team members. Regardless of the user skill level, the user view 5302 may be exposed to a common function library and design wizards 5300 for use by a wide variety of users creating a design solution 5304. The library and design
wizard 5300 functions may provide access based on the user view 5302 skill level and may limit the available functions on a skill-level basis. The user views 5302 may enable team collaboration with shared access to metadata and models.
The user view 5302 may be organized with task and context oriented guidance for the data integration design solution 5304. As data integration designs become more complex it may be advantageous to provide task oriented options while the user is working on the design solution 5304. As an example, if a user is developing a query, the available query task options may be displayed to guide the user in the proper design of the query. As different aspects of the designs are added or modified, the appropriate context functions and services may become available to the user to help guide the user to a robust design solution 5304. As an example, a user may add a data source component to a design and the interface may display any appropriate metadata catalog, import, sample data and data profiling functions. In addition, the available catalog may only display data sources that are already associated with a current design. The user may also be shown other data integration designs that have used the same data source so the user may reuse other design aspects.
The user views 5302, while working on the design solution 5304, may be restrict display of certain details of the design to make the design solution 5304 less visually complicated. The user may be able to hide or expose a plurality of layers of the design to allow the user to focus only on the design aspect that is being worked at that time, such as models, aggregations and associations, business processes, control flow, data access, error handling or other aspects of the design.
The aspects of Fig. 53 may be applicable to an enterprise environment having multiple development sites using a plurality of resources both internal and external. The user views may be able to work with the design solution 5304 from a plurality of geographical locations with transactions such as modeling, analysis, developing, testing, operation, and administrators being at different locations. AU aspects as described above for Fig. 53 may be available at a remote locations.
Fig. 54 a shows a UML diagram that describes the roles and relationships for the SOA. An SOA client 5412 may find and instantiate services 5414 through a service directory 5418. Each service 5414 may use other services 5414 as part of their implementation, including common shared services such as the component service 5420. The component service 5420 may expose any external resource connectors, such as databases, files, applications, queues, and the like as services that can be used at design time.
The client 5412 represents a generalization of programmatic or user interfaces that act as clients to the services 5414. The process designer 5402 may provide, for example a graphical user interface or programmatic access for manipulating a model or process. The API 5410 may provide an application programming interface for external programs to access the services 5414, which may be any services 5414 provided by the architecture 5200 described above. The wizard 5408 may, for example, provide a pre-packaged collection of design steps that, for example, create a metadata model or perform a recurring collection of integration tasks. The compiler 5404 may provide an interface to compilation and runtime services, or other deployment services. Each of these specific interfaces, as well as any number of other interfaces or combinations of these interfaces, may act as a client 5412 to the services 5414 for performing operations within the architecture 5200 described above.
Each service 5414 may be recursive in that a first service may use a second service, or another instance of itself, which may in turn use a third service or another instance of itself, and so on. The recursive or nested structure of service that employs other combinations of services is generally transparent to the client 5412. More generally, each product function within the architecture 5200, as well as combinations of services 5414, may be
factored into a service for use by clients 5412. The services 5412 exposed to clients 5412 may be part of the common core services 5210, product function services 5212, metadata and repository services 5222, or any other services that might be catalogued in the service directory 5418.
A client 5412 may find and instantiate services 5414 through the service directory 5418. Once a service 5412 is instantiated, it may be used in a design. As can be seen in the UML diagram, there may be a one to many relationship of the service directory 5418 to the service 5414 with the service 5414 receiving information from the service directory 5418. As an example, a client 5412 may request a service 5412 that has not been instantiated. The request may be passed to the service directory 5418, which may instantiate the service 5414 and expose the service 5414 to the client 5412. This process may also instantiate secondary services, such as services used by the service 5414 being instantiated.
The component service 5420 may provide access to the functionality of individual components, may expose external resources such as databases, files, applications, queues, metadata, or other available resources as services 5414. These exposed external resources may be available as services 5414 to a user (through the process designer 5402) at design time, and available to runtime engines at run time. Also, through the component services 5420 and connectors 5218, the external resources may have access to services within the SOA architecture 5200. Referring to Fig. 55 a schematic of the SOA 5201 environment shows how the SOA 5201 interfaces to other architecture 5200 clients and services. The core of the SOA 5201 may be the service binding 5512, SOA infrastructure 5514, and service implementation 5518. Service binding 5512 may permit binding of clients, such as GUI 5508, applications 5504, script orchestration 5502, management framework 5500, and other clients, to services that may be internal or external to the SOA 5201. The bound services may be part of the common core services
5520 and the services binding 5512 may access the service description registry 5510 to instantiate the service. The service binding 5512 may make it possible for clients to use services that may be local or external using the same or different technologies. The binding to external services may expose the external services and they may be invoked in the same manner as internal services. Communication to the services may be synchronous or asynchronous, may use different communication paths, and may be stateful or stateless. The service binding 5512 may provide support for a plurality of protocols such as, HTTP, CORBA/RMI, JMS, or JCA. The service binding 5512 may determine the appropriate protocol for the service binding automatically according to the computer environment or the user may select the protocol from the GUI 5508 as part of the design solution 5304.
The management framework 5500 client may provide facilities to install, expose, catalog, configure, monitor, and otherwise administer the SOA 5201 services. The management framework 5500 may provide access to clients, internal services, external services through connections, or metadata in internal or external metadata.
The orchestration client 5502 may make it possible to design a plurality of complex product functions and workflows by composing a plurality of SOA 5201 services into a design solution 5304. The services may be composed from the common core services 5520, services external to the internal services 5524, internal processes 5528, or user defined services 5522. The orchestration of the SOA 5201 is at the core of the capability to provide a unified data integration designs in the enterprise environment. The orchestration between the clients, core services, metadata repository services, deployment engines, and external services and metadata enables designs meeting a wide range of enterprise needs. The unified approach provides an architecture to bind together the entire suite for enterprise design and may allow for a single GUI 5508 capable of the seamless presentation of entire design process through to a to deployment design solution 5304.
The application client 5504 may programmatically provide additional functionality to SOA 5201 coordinated services by allowing services to call common functions as needed. The functions of the application client 5504 may enhance the capability of the services of the SOA 5201 by allowing the services to call the functions and apply them as if they were part of the service. The GUI client 5508 may provide the user interface to the SOA 5201 services and resources by allowing these services and resources to be graphically displayed and manipulated. The GUI 5508 is more fully described in the Fig. 53 description above.
The SOA infrastructure 5514 may be J2EE based and may provide the facility to allow services to be developed independent of the deployment environment. The SOA infrastructure 5514 may provide additional functionality in support of the deployment environment such as resource pooling, interception, serializing, load balancing, event listening, and monitoring. The SOA infrastructure 5514 may have access to the computing environment and may influence services available to the GUI 5508 and may support a context-directed GUI 5508.
The SOA infrastructure 5512 may provide resource pooling using, for example, enterprise Java bean (EJB) and real time integration (RTI). The resource pooling may permit a plurality of concurrent service instances to share a small number of resources, both internal and external. The SOA infrastructure may provide a number of useful tools and features. Interception may provide for insertion of encryption, compression, tracing, monitoring, and other management tools that may be transparent to the services and provide reporting of these services to clients and other services. Serialization and de-serialization may provide complex service request and data transfer support across a plurality of invocation protocols and across disparate technologies. Load balancing may allow a plurality of service instances to be distributed across a plurality of servers. Load balancing may support high concurrency processing or high throughput processing accessing one or a plurality of processor on a plurality of servers. Event listening and generation may enable the invocation of a service based on observed external events. This may allow the invocation of a second service based on the function of a first service and if a specified condition may occur. Event listening may also support call back capability specifying that a service may be invoked using the same identifier as when previously invoked. The service description registry 5510 may be a service that maintains all interface and usage specifications for all other services. The service description registry 5510 may provide query and selection services to create instances of services, bindings, and protocols to be used with a design solution 5304. As an example, instances of services may be requested by a client or other service to the SOA 5201 where the SOA 5201 will request a query or selection of the called service. The service description registry 5510 may then return the instance of the service for binding by the service binding 5512 and then may be used in the design solution 5304.
The common core services 5520 may contain a plurality of services that may be invoked to create design solutions 5304 and runtime deployed solutions. The common core services 5520 may contain all of the common services for design solutions 5304 therefore freeing other services from having to maintain the capabilities of these services themselves. The services themselves may call other services within the common core services 5520 as required to complete the design solution 5304. A plurality of clients may access the common core services 5520 through the service binding 5512, SOA infrastructure 5514 and service description registry 5510. Common core services may also be accessed by external services through metadata repository services 5222 and the SOA infrastructure 5514.
Additional external services may access any of the environment supported by the SOA infrastructure 5512 through the service implementation 5518. The service implementation may provide access to external services through use of adapters and connectors 5218. Through the service implementation 5518, services 5524 may expose
specific product functionality provided by other software products for developing design solutions 5304. These services 5524 may provide investigation, design, development, testing, deployment, operation, monitoring, tuning, or other functions. As an example, the services 5524 may perform the data integration jobs and may access the SOA 5201 for metadata, meta models, or services. The service implementation 5518 may provide access for the processes 5528 to integration processes created with other tools and exposed as services to the SOA infrastructure 5514. Users of other tools may have created these integration processes and these processes may be exposed as services to the SOA 5201 and clients.
The service implementation 5518 may also provide access to user defined services 5522 that may allow users to define or create their own custom processes and expose them as SOA services. Exposing the user-defined services 5522 as SOA services allows them to be exposed to all clients and services of the SOA 5201.
Fig. 56 depicts the interaction of tools 5604, legacy tools 5602, and external tools 5600 with the repository services 5222. Repository services 5222 may provide services for access to metadata models for the internal tools 5604, legacy tools 5602, and/or external tools 5600. Each of these services may maintain metadata or meta models in repository space maintained by repository services 5224. Repository services 5224 may provide the capability, through model mapping (indicated by arrows) to the semantic hub 5614, for the various tools to access other metadata.
In the architecture 5200, tools 5604 use distinct models for their working metadata and may be stored in the repository 5224 provided by the repository services 5222. The repository 5224 may store the tool specific models, metadata models, and common models that may be shared by a plurality of services of the SOA 5201. The tools 5604 may access the repository 5224 through the SOA 5201 and may provide data and metadata for a design solution 5304. There may be distinct functional areas in the repository 5224 for activities such as data profiling, data cleansing, design solutions 5304, or other data integration related action.
The repository services 5222 may interact with legacy tools 5602 and may maintain the legacy models repository 5612. The legacy models may be used for metadata coexistence, metadata migration, metadata concurrent versions, or other metadata and modeling needs for design solutions 5304. The legacy model repository 5612 may be maintained by the repository services 5222 and using the mapping from the legacy model repository 5612 to the repository 5224 the legacy tools 5602 may be exposed to the tools 5604 as services. Using the same mapping from the repository 5224 to the legacy tools 5612 may expose the tools as services to the legacy tools 5602. With the legacy model repository 5612 exposed to the repository 5224 and tools 5604, metadata, and data from the legacy model repository 5612 may be maintained as concurrent metadata or reconciled with existing repository metadata.
The repository services 5222 may be the primary service for importing and exporting 5608 metadata and models for external tools 5600. External metadata and models may be imported to create external view models 5610 that may be maintained by the repository services 5224. These external view models may be exposed to the SOA 5201 for design solutions 5304 and deployment. In the repository services 5222, models may reference other models and may allow a tool to access a plurality of models concurrently. Model referencing between repositories and external repositories may also allow a plurality of tools to share the same metadata or models in the repository 5224.
The repository services 5222 may use a semantic hub model to provide access to external view models 5610, legacy models 5612, and repository models 5224 to external tools 5600, legacy tools 5602 and tools 5604. The semantic hub model may be a consistent model that maintains mapping between the external view models
5610, legacy models 5612 and repository models 5224 and may only require partial mapping to allow the tools 5604 to share metadata. By use of the semantic hub model 5614 mapping between other system components, tools 5604 may have access to legacy models 5612 and external view models 5610 for user design solutions 5304. The semantic hub model 5614 mapping may provide for live metadata analysis of external view models 5610 for SOA 5201 services. The tools 5604 may access legacy models and may support coexistence and migration of legacy metadata and models to the repository 5224. The tools 5604 may interact with legacy models 5612 and may support them as additional versions of the metadata and models. Legacy models 5612 may be migrated into repository 5224 models. Through the semantic hub model 5614, the external tools 5600 and legacy tools 5602 may be able to access SOA 5201 services by accessing the SOA connectors 5218. Fig. 57 depicts an architecture for repository services, such as the repository services 5222 described above. The architecture may include interfaces 5700, such as to GUI clients, non-GUI clients, and services. The architecture may further include an XMI stream 5702, an application programming interface 5704, import/export functions 5608, definition and extension functions 5710, navigation functions 5712, query function 5714, internal distribution, locking, and caching services 5708, internal persistence and versioning services 5718, object management services 5720, and a database 5722. In general, the architecture provides programmatic or other access to models in the database 5722 for a wide variety of users and user types, while providing database functionality that preserves the history and persistence of the models. The repository services 5222 may maintain design models such as meta-models (i.e., models of metadata), as well as model instances for execution in jobs, services, or other functions. The interfaces 5700 and XMI stream 5702 may provide communication in any suitable format between the repository services 5222 and external users. The interfaces 5700 may provide a programming interface, message- oriented interface, and/or other interface or interfaces for external access to the repository services 5222. For example, the XML metadata interchange (XMI) stream 5702 may provide a standard, message-oriented format for encoding and transmitting metamodel and instance file save-restore-interchange formats. The XMI stream 5702 may communicate directly with import and export functions 5608 of the repository services 5222.
The application programming interface ("API") 5704 may receive, for example, attribute values and object IDs from clients accessing the repository services 5222. The API may include a reflective interface that independently maintains information about either the structure of the repository 5222 or the features and functionality of an interface 5700 accessing the repository 5222, or both. A user may access metadata in the database 5722 using a direct programmatic interface (e.g., Java classes), or using reflective features of the programming interface 5704 that may, for example, require less information about the structure of the repository services 5222 and/or may require less explicit information from the interface 5700 about the user that is accessing the repository services. For example, the application programming interface 5704 may include definition and extension functions 5710 that detect and/or describe models used by known interfaces 5700 to access the repository services 5710. The application programming interface 5704 may also include navigation functions 5712 with an awareness of the structure and features of navigation provided by known GUI (or non-GUI) client interfaces 5700. Using these functions 5704, 5710, 5712 the repository services 5222 may better interpret and respond to requests from various interfaces 5700 having known configurations. Alternatively, any model-aware client or service may directly access any model-specific classes compiled from the model definitions. The query function 5714 may directly expose the repository services 5222 for appropriately formatted requests from interfaces 5700, such as GUI, non-GUI clients, and services. For example, the query function 5714
may directly receive query or update expressions from the clients and services 5700, and may reply directly with responsive object IDs. Other commonly used data search and update functions such as searching may be included with, or in addition to the query function 5714. The query function 5714 may be defined by any commercially available relational database used as the database 5722, or any other standard or proprietary database useful for storing data within the repository services 5222.
The object management services 5720 may manage data within the database 5722 using, for example, the object distribution, locking, and caching service 5708 to control access to the database 5722, and using the object persistence and version service 5718 to manage versioning and other relate features of objects within the database 5722. For example, the persistence and versioning services 5718 may support versioning of both meta-models and model instances, with multiple successor branches and automated or semi-automated merging. The persistence and versioning services 5718 may also support custom or user-defined models with instances that survive other system revisions or upgrades. As another example, the persistence and versioning services 5718 may support multiple concurrent model version usage. At the same time, the repository services 5222 may provide interfaces to external source control and version management systems, such as according to the Microsoft SCCI specification. Cross- model references may be persistent. More generally, metadata services may reside at any layer in the repository architecture (above the database and any core services), and may provide functionality for, e.g., reconciliation of metadata items such as imported items, analysis such as impact analysis, change propagation, and data lineage, reporting and documentation, cataloging for enterprise-wide and personal catalogs and categorization schemes, collaboration and workflow support, and scripting capability. The database 5722 may be, for example, any of the data sources 102 described above. It will be appreciated that a logical repository, which may include an operational (e.g., runtime) repository and a design repository (also referred to as a common repository or a core repository), may be flexibly configured as multiple and/or distributed repositories, including personal, team/project, division, or enterprise repositories, as well as centralized or decentralized repositories/ Further the repository may provide automated synchronization among different repository instances. The repository services 5222 may include functions or services to manage configuration of the repositories in a manner that is transparent to external clients and services 5700.
Referring to Fig. 58 a high level UML diagram is shown describing the component and composition model. Components may be fine-grained elements of design solutions 5304, which may include external resources, such as various types of transformations and integrations. Components may be linked together into compositions that may be exposed as services and coordinated by the SOA 5201. Thus, there is described herein a component- based design, or description, of services in a data integration process. The elements of a component are described below.
A component may have a definition, componentdef 5810, with at least one associated design time object, designtime 5814, such as a stage editor. A plurality of designtime 5814 objects may associated with the componentdef 5810 depending on the design requirements. A componentdef 5810 may also contain a plurality of property definition objects, propertydef 5802, that may specify the names and types of properties for that component.
In support of flexible solution development, The architecture 5200 supports both abstract and concrete components. Abstract components may support flexible incremental development and may represent an object with only a partial specification. The abstractcomponentdef 5820 may represent a particular type of abstract component
such as a database or process and may become concrete at a later time in the design solution 5304. A concrete component is a fully defined object and is defined by concretecomponentdef 5818.
A concrete component may have a runtime 5814 implementation object for each deployment engine such as generated code, high throughput, debugging, checkpoint, or process state persistence. The run time implementation objects may support component instance life cycle operations such as initialize, open, run, get, put, end of wave, reset, and abort in the deployment of a design solution 5304. A connectordef 5828 object may define a concretecomponetdef 5818 that may be connected to an external resource and may provide exposure to external resources such as services or metadata.
The fine-grained components may be interconnected to form compositions for more complex functions. In general, fine-grained components may be used or composed together for highly performant interfaces, as might be identified within a particular runtime environment. Thus, a component-based design may be useful independent of the services-oriented architectures described herein. However, compositions are defined by a compositiondef 5822 that may also be recursive and reusable with other SOA 5201 coordinated services, and that may be exposed as services to other SOA 5201 clients and services. An outer most level of compositiondef 5822 may be a processdef 5830 which may be a specialized compositiondef 5822. A compositioninstance 5812 may be created with each new defined compositiondef 5822 and a compositioninstance 5824 may contain a property value for the instance.
Compositions may be executed only when all of the components have become concrete. During the design process, any components, such as processes or databases, that start as abstract components must be defined to become a concrete component. At the same time, the design for abstract components may change, or be rearranged, either in response to the final design of a concrete component, or other external design factors. Thus, a progressive design may be realized, in which a design may be modified at various levels of abstraction within a single design or model. Common compositions and integration process models may be exposed through the GUI 5202 for a design solution 5304. Compositions may themselves be considered a component and compositions may be linked together to create more complex composition services with out requiring significant programming skill from a user. Components and compositions may be created in the GUI 5202 and may access a wizard for component and composition design by a less skilled user. The architecture 5200 composition models may contain attributes that allow them to be coordinated by the SOA 5201 such as properties, inputs and outputs, units of work, local variables, scopes, control constraints, events, stateless and state-based settings, exceptions, and recursive composition.
Referring to Fig. 59 an Enterprise Integration Language ("EIL") example of a composition model is shown with XML representation 5914. The integration process represented may receive a series of work item messages from a client or other service through the SOA 5201. In this example, each of these work messages may require a database lookup 5904, a transformation 5908, promotion of sub-records into relational rows 5910, and insertion into the database 5912. The outermost atomic 5900 scope may specify sequences for execution. In this example the lookup 5904, transformation 5908, promotion of sub-records to relational rows 5910, and insertion into the database 5912 must be done sequentially. To the atomic scope 5900, the independent 5902 scope is exposed as a complete sequence that may be run in sequence with the insertion into the database 5912.
The inner independent 5902 scope may indicate that the enclosed sequences, the database lookup 5904, transformation 5908, and promotion of sub-records into relational rows 5910, may be in a pipeline fashion independent of each other. This may facilitate execution using pipelining or concurrent processing. In this manner the EIL model may define how different sequences of a composition interact, the model may give permission to execute some sequences in series and others in parallel.
The XML representation 5914 may show the script interpretation of the EIL model. A person knowledgeable in the art is able to see the association of the XML script to the EIL model.
Referring to Fig. 60, a UML diagram of the relationship of clients, services, and components are shown as described in UML of Fig. 54, that described the client and services relationship, and the UML of Fig. 58 that described the component and composition relationships. Refer to the previously descriptions of Fig. 54 and Fig. 58 for a detailed explanation for the foil workings of those two modules.
Clients 5412 may provide user interaction and APIs for data integration development with the SOA services 5414. The individual components may provide design time and run time interfaces that interface with the SOA services 5414 and clients 5412. The design time 5808 component may interface with the client 5412 and may be coordinated by the SOA service 5414, for the design solution 5304, with available component capabilities. This coordination with the design time 5808 component and the SOA services 5414 may provide the context available to the user in the process designer 5402 and may support a context sensitive GUI.
The run time 5814 component may interface with the SOA service 5414 to coordinate deployment of the design solution 5304 with available services. The SOA service 5414 may have connections with both the design time component 5808 and the run time component 5814 exposing the SOA service 5414 to the design requirements of the design solution 5304 and the run time requirements of the hardware environment. The SOA service 5414 may then coordinate the deployment by the run time component 5814 and the design time component 5808 for the design solution 5304. The component service 5420 that may be responsible for communication to external resources may use the componentinstance 5812 to expose the communication service to the SOA 5414 therefore providing the SOA services 5414 exposure to external resources for the deployment and design of the design solution 5304. The exposure of the external resources to the SOA services 5414 may intern expose SOA 5201 clients and services to the external resources and services.
Referring to Fig. 61 a schematic representation of data being passed from a first component 6100 to a second component 6102 is shown. The architecture 5200 run time data representation may use a common data representation to allow the plurality of services and connections to work with the data in a common manor. The architecture 5200 repository services 5222 may interface with a plurality of data structures such as XML, EDI, COBOL, binary structures, native language formats, native machine formats and other industry data formats. In order to efficiently handle numerous data structures, the architecture 5200 may use a common internal data structure.
The repository services 5222 may pass the data into memory 6114 with minimal data translations to avoid unnecessary conversions and may therefore increase the efficiency of the data transfer. The data may be passed into sharable memory 6114 where only the needed data may be translated and scattered-gathered into shareable memory 6114 for components to perform data integration jobs. Large uninterpreted data items such as BLOBs and CLOBs may be passed by reference from sources to targets. By providing only reference, the large uninterrupted data may not need to be copied to memory, as data is requested only the needed parts of the data are translated and then scattered/gathered into memory 6114.
An example of passing data from a first component 6100 to a second component 6102 shows the components accessing a scatter-gather list 6104. The scatter-gather list 6104 may be in shareable memory 6114 with pointers into smaller parts of the data within the source or target data representations. The data representations may be XML text 6108, database fetch buffers 6110, a COBOL record 6112, or other data format. The scatter-
gather list 6104 may allow the data representations to be loaded into sharable memory 6114 once and the pointers to the data representations are manipulated to point to the needed data. By manipulating the pointers in the scatter- gather list instead of moving actual data, most repetitive data copying and memory reallocation may be eliminated. Once the pointer is established for a component, data translations for the needed data may automatically take place, which may leave the majority of the data non translated in the shared memory 6114. Data handling may be further enhanced by caching the converted data values with pointers to the scatter-gather 6104 list for reuse by other components.
Using these techniques, data may be passed between components of a composition in a highly performant manner. Fig. 62 shows a common engine for the architecture 5200. The common engine framework may include users such as a programmer 6204, an integration designer 6202, and a business analyst 6200, who may interface with a common intermediate integration process specification 6220 using a flow designer 6210, wizards 6212, examples 6214 such as stored models or previous projects, templates 6216, and the like. A runtime compilation and optimization layer 6230 may provide varying degrees of automated or manual control over use of engines such as a high-throughput engine 6240, a high concurrency engine 6242, stored procedures 6244, computer program code 6246 and so on, which may in turn communicate with external resources through common components and connectivity 6250.
Many data integration services and functions require transformations such as conversion, derivation, aggregation, routing, metadata mapping, quality assertions, and other data transformations. The architecture 5200 may provide a single transformation specification binding the GUI 5202 and the transformation logic. The transformation framework may provide a unified method of transformation for a plurality of data formats such as relational, hierarchical, XML, arbitrary, and other industry standard formats. As discussed previously, processing such as transformations may advantageously restricted to a subset of data, with significant parts of the data may be passed without any transformation. As discussed in Fig. 52, The architecture 5200 may provide for a single GUI 5202 that is skill based allowing users from a variety of organizational positions to work in an environment suited to their abilities. The transformation framework leverages this ability and may provide for a plurality of user interfaces as part of the GUI to work with the data transformations. The transformation framework may provide access for business analyst 6200, integration designers 6202, programmers 6204 and any other users through an interface that may be tuned to their skill set.
The flow designer 6210, wizards 6212, examples 6214 and templates 6216 represent a non-exhaustive collection of possible user interface tools for designing integration processes, which may be employed by the various users 6200, 6204, 6206 to design and modify the common intermediate integration process specification 6220, which may include any of the metadata, metadata models, services, components, compositions, and other process design elements discussed above. Additionally, each of these or additional possible design user interfaces may allow the user different levels of access to design details consistent with the user's skill level, role, technology platform, and so on.
The business analyst 6200, for example, may be more concerned with higher level logical designs such as conditional branching, event detection, and filtration rules. The business analyst 6200 may tend to use wizards 6212, examples 6214 or templates 6216 to develop integration process specifications 6220 using more simplified or
integrated functionality that may, for example, allow simple design flow and transformation expression building capability in line with the business analyst 6200 programming skills.
Integration designers 6202 may be capable of more complex programming abilities for creating quality assertions, data conversions, key expressions, routing, field level transformations, and data cleansing. Integration designers may use the flow designer 6210 to explicitly model portions of a process, in addition to employing wizards 6212, examples 6214, and templates 6216, as well as other high or low level GUI interfaces, which may allow a more in depth capability to work with an expression language for direct, explicit design. This level of GUI may still be skill-level constrained for the integration designer 6202, and may have access to wizards 6212 for expression building or other tasks requiring assistance. Programmers 6204 may preferably use a flow designer 6210 GUI or command line interface to allow the greatest control over custom transformation functions and custom components, although other interfaces 6212, 6214, 6216 may be interchangeably used throughout a design process. The programmer 6204 may have access to the finest level of detail for integration process design, including the capability of record level and data set level logic. The programmer may be allowed to access finer-grained capabilities including but not limited to the ability to write or customize elements of the common components and connectivity 6250, to create and use user-defined services accessed via the SOA 5201, and to provide detailed advice to the compilation and optimization processing layer 6230.
Using the various interfaces 6210-6216, users such as the business analyst 6200, integration designer 6202, and programmer 6204 may design integration processes that use a common intermediate integration process specification 6220, which is a representation that is also independent of the eventual engines and components and connectivity used to execute integration processes. This intermediate integration process specification 6220 may be expressed in a common intermediate integration process specification language such as EIL 5900, or any equivalent representation.
The compilation and optimization process layer 6230 may translate the common intermediate integration process specification 6230 into execution specifications specific to one or more execution engine environments including, but not limited to the high throughput engine 6240, the high concurrency engine 6242, one or more database stored procedures 6244, and/or one or more segments of generated computer program code 6246 in any conventional programming language including but not limited to C, C++, C#, COBOL and Java including JavaBeans and Enterprise JavaBeans, or any other execution environment capable of translating, interpreting and/or running general purpose computing algorithms suitable for the systems described herein. The compilation and optimization process layer 6230 may use knowledge of metadata stored in the metadata repository 5214 about data sources 102 and data targets 102 as well as information about computer system environment configurations, to make decisions on how to translate the integration process specification 6230 into an appropriate and efficient combination of instructions for the various engine and execution environments. The compilation and optimization process layer 6230 in addition may use any of a plethora of services 5414 available via the SOA 5201 to derive additional metadata such as data profiling information in order to make an informed decision how to best compile and optimize the integration process specification 6220.
The different integration process execution engines and environments 6240, 6242, 6244, 6246 and the like may use predefined and/or user defined common components and external data source and target connectivity 6250, including component implementations 5814 and connectors 6310.
Custom designed transformation components and connectivity elements that may be created by clients such as the business analyst 6200, the integration designer 6202 and the programmer 6204, may be stored in the repository 5224 as separate sharable elements in the repository catalog, from where they can be used by the compilation and optimization layer 6230 to generate executable integration processes. Referring to Fig. 63, a connectivity architecture is shown that may permit access for the architecture 5200 services 6300, components 6302 and engines 6304 to external resources through a connector component 6310. These services have been previously defined in the above figures.
The role of the connector component 6310 is to encapsulate the interfaces of external resources and standardize these external resources for use with the components 6308 and services. The role of the connector may be to pass data with little transformation to other components 6308 where the transformation may be more efficiently completed, such as the transformation component. As discussed previously, transformations may only be for a limited amount of data within the transmitted external data. With the connector 6310, only minimally translating data it may allow the connector 6310 to efficiently move data to the components and services. Connectors may be specialized components that offer design time interfaces through the component service 6302 or run time interfaces to the deployment engines 6304.
As part of the connector 6310, the adapter 6312 may be a sub-class of a component providing component interfaces to a specific external resource. In a connector 6310 there may be a plurality of connections 6314 as a low level component the may manage the life cycle of a single external resource. The connection 6314 may have associated event listening and resource access that may read, write, send, receive, query, or update external resources.
The connection management interface 6312 may provide pooling functionality in support of the connections 6314. The pooling may be used primarily for high concurrency execution environments and may allow for smooth transfer of concurrent data. For high throughput execution environments the same connection management interface 6312 may be used but may maintain a persistent connection that may allow for high duty cycle reuse to provide a consistent connection. Connectors 6310 may provide a plurality of external resource lifecycle functions such as installation, design time interfaces, metadata interfaces, compilation, validation, debugging, monitoring, instantiation, state management, pooling, caching, transactions, exception handling, data access, event detection, persistence, leasing, BOM production, data sampling, and sample data generation. Connectors 6310 may be described by models and may be stored in the repository 5224 and may access a semantic hub model 5614 for access to external resources. Connectors 6310 may wrap interfaces to external resources such as ODBC/JDBC, JCA, EJB, and web interfaces.
While the invention has been described in connection with certain preferred embodiments, it should be understood that other embodiments would be recognized by one of ordinary skill in the art, and are intended to fall within the scope of this disclosure.