WO2019013824A1 - System and method for improving agility of data analytics - Google Patents

System and method for improving agility of data analytics Download PDF

Info

Publication number
WO2019013824A1
WO2019013824A1 PCT/US2017/042230 US2017042230W WO2019013824A1 WO 2019013824 A1 WO2019013824 A1 WO 2019013824A1 US 2017042230 W US2017042230 W US 2017042230W WO 2019013824 A1 WO2019013824 A1 WO 2019013824A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage system
volume
data
time
program
Prior art date
Application number
PCT/US2017/042230
Other languages
French (fr)
Inventor
Hideo Saito
Yuki SAKASHITA
Keisuke Hatasaki
Original Assignee
Hitachi, Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi, Ltd. filed Critical Hitachi, Ltd.
Priority to PCT/US2017/042230 priority Critical patent/WO2019013824A1/en
Publication of WO2019013824A1 publication Critical patent/WO2019013824A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present disclosure is directed generally to database management systems, and more specifically, to management of data analytics and related functions thereof.
  • data from mission critical or business critical systems are copied to the analytics system of the business.
  • the copying is offloaded to the storage layer to limit the performance impact on the mission critical or business critical systems.
  • An example related art implementation includes tools that schedule the copying and the transforming of data.
  • An example of such a related art implementation can include systems and methods for scheduling data flow execution based on an arbitrary graph describing the desired data flow as described, for example, in U.S. Patent No. 8,639,847.
  • the data analytics may not be as agile as a data scientist would desire.
  • the data scientist may need to wait a long time to begin performing data analytics, because copying data from the mission critical or business critical systems to the analytics system takes a long time. If the copying is operated by a storage admin, the interaction between the data scientist and the storage admin adds more to the wait time. Such additional wait time could be eliminated if the data scientist directly managed the copy operation.
  • the data scientist may want to focus on data analytics without having to manage the operation of the storage systems.
  • An ETL (Extract Transform Load) tool can create a schedule that completes copying and transforming data in a short time.
  • the ETL tool queries the database and the storage system holding the mission critical or business critical data in order to acquire information about how long it would take to copy and transform the data.
  • the ETL tool then creates the schedule based on a data transformation flow provided by the user and the information acquired from the database and the storage system.
  • Example implementations described herein are directed to provide a system and method that improves the agility of data analytics.
  • aspects of the present disclosure can involve a server configured to manage a first storage system with a plurality of volumes, the first storage system communicatively coupled to a second storage system.
  • the server can involve a memory, configured to manage transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and a processor, configured to execute an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume.
  • ETL Extract, Transform, Load
  • the execution of the ETL program causes the processor to be configured to request data related to the first volume to be processed from the database management program; request information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimate a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; request from the database management program, a size of the data to be processed from the first volume; estimate a second time for performing each of the one or more transformations based on the size of the data to be processed; and determine a schedule based on the first time and the second time.
  • aspects of the present disclosure further include a method for managing a first storage system having a plurality of volumes, the first storage system communicatively coupled to a second storage system.
  • the method can involve managing transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and executing an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program involving requesting data related to the first volume to be processed from the database management program; requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimating a first time for replicating the data from the second storage
  • aspects of the present disclosure further include a non-transitory computer readable medium, storing instructions for managing a first storage system having a plurality of volumes, the first storage system communicatively coupled to a second storage system.
  • the instructions can involve managing transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and executing an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program involving requesting data related to the first volume to be processed from the database management program; requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimating
  • aspects of the present disclosure further include a system, which can involve one or more apparatuses configured to manage a first storage system comprising a plurality of volumes, the first storage system communicatively coupled to a second storage system, the one or more apparatuses involving a memory, configured to manage transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and a processor, configured to execute an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program causing the processor to be configured to request data related to the first volume to be processed from the database management program; request information from the first storage system or the second storage system regarding the
  • ETL Extract
  • aspects of the present disclosure further include a system, which can involve means for managing a first storage system comprising a plurality of volumes, the first storage system communicatively coupled to a second storage system, means for managing transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; means for executing an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, means for executing of the ETL program involving means for requesting data related to the first volume to be processed from the database management program; means for requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; means for estimating
  • I/O
  • FIG. 1 illustrates an example physical configuration of the system in which the example implementations may be applied.
  • FIG. 2 illustrates the physical configuration of servers, in accordance with an example implementation.
  • FIG. 3 illustrates the physical configuration of storage system, in accordance with an example implementation.
  • FIGS. 4 to 6 illustrate example logical layouts of server memory, in accordance with an example implementation.
  • FIG. 7 illustrates an example logical layout of storage system memory, in accordance with an example implementation.
  • FIG. 8 illustrates an overview of data flow, in accordance with an example implementation.
  • FIG. 9 illustrates the logical layout of Table Management Tables, in accordance with an example implementation.
  • FIG. 10 illustrates the logical layout of Transformation Flow Table, in accordance with an example implementation.
  • FIG. 11 illustrates the logical layout of Operation Management Table, in accordance with an example implementation.
  • FIG. 12 illustrates the logical layout of Schedule Management Table, in accordance with an example implementation.
  • FIG. 13 illustrates the logical layout of Volume Management Tables, in accordance with an example implementation.
  • FIG. 14 illustrates the logical layout of Replication Management Tables, in accordance with an example implementation.
  • FIG. 15 illustrates the flow of ETL processing executed by ETL Program, in accordance with an example implementation.
  • FIG. 16 illustrates the flow of volume information determination by ETL Program, in accordance with an example implementation.
  • FIG. 17 illustrates the flow of schedule determination executed by ETL Program, in accordance with an example implementation.
  • FIG. 18 shows the logical layout of storage system memory, in accordance with a second example implementation.
  • FIG. 19 illustrates the logical layout of Volume Management Table in accordance with the second example implementation.
  • FIG. 20 illustrates the logical layout of Load Management Table, in accordance with a second example implementation.
  • FIG. 21 illustrates the flow of volume information determination executed by ETL Program in accordance with a second example implementation.
  • FIG. 22 illustrates the flow of replication time estimation executed by ETL Program in accordance with the second implementation.
  • FIG. 23 illustrates the logical layout of Table Management Table in accordance with a third example implementation.
  • FIG. 24 illustrates the logical layout of Operation Management Table in accordance with the third example implementation.
  • FIG. 25 illustrates the flow of ETL processing executed by ETL Program 1 10B in accordance with the third example implementation.
  • FIG. 26 illustrates the flow of schedule determination executed by ETL Program in accordance with a third example implementation.
  • an ETL tool creates a schedule that completes copying and transforming data in a short time.
  • FIG. 1 illustrates an example physical configuration of the system in which the example implementations may be applied.
  • One or more Servers 1A are connected to a Storage System 2A via a Storage Area Network (SAN) 3 A.
  • SAN Storage Area Network
  • One or more Servers IB and one or more Servers 1C are connected to a Storage System 2B via a SAN 3B.
  • Storage System 2 A and Storage System 2B are connected to each other via a SAN 3C.
  • Server IB and Server 2B are connected to each other via a Local Area Network (LAN) 4.
  • LAN Local Area Network
  • Server 1A uses SAN 3A to send Input/Output (I/O) requests to Storage System 2A.
  • Server IB and Server 1C use SAN 3B to send I/O requests to Storage System 2B.
  • Storage System 2A uses SAN 3C to replicate data to Storage System 2B.
  • Server IB and Server 1C use LAN 4 to communicate with each other.
  • FIG. 2 illustrates the physical configuration of Servers 1A, IB and 1C, in accordance with an example implementation.
  • Each server 1A, IB, and 1C have their corresponding SAN port 13A, 13B, and 13C, corresponding LAN port 14A, 14B, and 14C, corresponding Central Processing Unit (CPU) 10A, 10B, and IOC, corresponding memory 11 A, 11B, and 11C, and corresponding storage device 12A, 12B, and 12C.
  • CPU Central Processing Unit
  • any one or a combination of several apparatuses can manage a first storage system such as storage system 2B that manages a plurality of volumes, which is communicatively coupled to a second storage system such as storage system 2A through SAN 3C.
  • a first storage system such as storage system 2B that manages a plurality of volumes
  • second storage system such as storage system 2A through SAN 3C.
  • Any one of memory 11 A, 11B and 11C can in singular, or in combination manage transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes as illustrated in FIG. 10.
  • processor(s) 10A, 10B, and IOC can be configured to execute an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume as illustrated in the flow diagrams of FIGS. 15 to 17, 21 to 22 and 25 to 26.
  • ETL Extract, Transform, Load
  • the execution of the ETL program can involve requesting data related to the first volume to be processed from the database management program; requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimating a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; requesting from the database management program, a size of the data to be processed from the first volume; estimating a second time for performing each of the one or more transformations based on the size of the data to be processed; and determining a schedule based on the first time and the second time through the execution of the flow diagrams of FIG. 15 and/or FIG. 25.
  • the data related to the first volume can include a list of volumes of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system can involve a size of each volume in the list of volumes, and a bandwidth of each volume in the list of volumes, as illustrated in FIGS. 9-11 and 14, and as determined from the execution of the flow diagrams of FIG. 15 and/or FIG. 25.
  • the information regarding the requested data to be replicated from the second storage system to the first storage system involves a central processing unit (CPU) for each volume of a list of volumes of the second storage system to be replicated to the first volume, wherein the estimation of the first time is based on the CPU load on the each volume of a list of volumes of the second storage system to be replicated to the first volume as illustrated and described with respect to FIGS. 18 to 22.
  • CPU central processing unit
  • the data related to the first volume comprises a list of storage regions of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system can involve a size of each storage region in the list of storage regions, and a bandwidth for each storage region in the list of storage regions, as illustrated, for example in FIGS. 10, 14, and 23.
  • the ETL program can be configured to determine the determine a schedule based on a determination of an order for replication of the list of storage regions from the second storage system to the first storage system based on the first time and the second time as illustrated and described in FIGS. 23 to 26.
  • the ETL program can be configured to determine the schedule based on the first time and the second time through a process that involves generating a plurality of schedules encompassing all combinations between execution of the one or more transformations and data to be transferred from the second storage system to the first storage system; estimating the first time and the second time for each schedule of the plurality of schedules; and determining the schedule from the plurality of schedules having a shortest schedule based on the first time and the second time as determined, for example, through the execution of FIG. 17.
  • FIG. 3 illustrates the physical configuration of Storage Systems 2A and 2B, in accordance with an example implementation.
  • Storage Systems 2A and 2B can include their corresponding SAN port 24A and 24B, corresponding CPU 21A and 21B, corresponding memory 22A and 22B, and corresponding storage device 23A and 23B.
  • the storage devices 23A and 23B can be used to store data across either a plurality of volumes, and/or a plurality of storage regions or storage areas depending on the desired implementation.
  • FIG. 4 illustrates an example logical layout of Memory 11 A, in accordance with an example implementation.
  • Memory 11A can include Application Program 110A, Database Management Program 111A and Table Management Table 112A.
  • Database Management Program 111A provides a database to Application Program 110A.
  • Application Program 110A accesses tables provided by Database Management Program 111A.
  • Database Management Program 111A manages the tables using Table Management Table 112A and stores table data in Storage System 2A by sending read and write requests to Storage System 2A.
  • FIG. 5 illustrates the logical layout of Memory 11B, in accordance with an example implementation.
  • Memory 11B contains ETL Program HOB, Database Management Program 11 IB, Table Management Table 112B, Transformation Flow Table 113B, Operation Management Table 114B and Schedule Management Table 115B.
  • ETL Program HOB transforms data based on information stored in Transformation Flow Table 113B, Operation Management Table 114B and Schedule Management Table 115B.
  • Database Management Program 11 IB provides a database to ETL Program HOB.
  • ETL Program HOB accesses tables provided by Database Management Program 11 IB.
  • Database Management Program 11 IB manages the tables using Table Management Table 112B and stores table data in Storage System 2B by sending read and write requests to Storage System 2B.
  • FIG. 6 shows the logical layout of Memory 11C, in accordance with an example implementation.
  • Memory 11C contains Application Program HOC and File System Program l l lC.
  • File System Program 111C provides a file system to ETL Program HOB and Application Program HOC.
  • ETL Program HOB and Application Program HOC access files provided by File System Program 111C.
  • File System Program 111C stores file data in Storage System 2B by sending read and write requests to Storage System 2B.
  • FIG. 7 illustrates an example logical layout of Memory 22A and 22B, in accordance with an example implementation.
  • Memory 22A contains Configuration Management Program 220 A, I/O Processing Program 221 A, Replication Management Program 222 A, Volume Management Table 223 A, Replication Management Table 224 A and Cache Area 225A.
  • Configuration Management Program 220A is executed by Central Processing Unit (CPU) 21 A and manages the configuration of Storage System 2 A using Volume Management Table 223 A.
  • CPU Central Processing Unit
  • I/O Processing Program 221 A is executed by CPU 21 A and processes I/O requests from Server 1 A.
  • Cache Area 225A is used by I/O Processing Program 221A to temporarily store data being read from or written to Storage Device 23A.
  • Replication Management Program 222A is executed by CPU 21A and manages replication of data between Storage System 2A and other storage systems using Replication Management Table 224A.
  • Memory 22B contains Configuration Management Program 220B, I/O Processing Program 22 IB, Replication Management Program 222B, Volume Management Table 223B, Replication Management Table 224B and Cache Area 225B.
  • Configuration Management Program 220B is executed by CPU 21B and manages the configuration of Storage System 2B
  • I/O Processing Program 221B is executed by CPU 21B and processes I/O requests from Servers IB and 1C
  • Replication Management Program 222B is executed by CPU 2 IB and manages replication of data between Storage System 2B and other storage systems.
  • FIG. 8 illustrates an overview of data flow, in accordance with an example implementation.
  • Application Program 11 OA writes to one or more tables provided by Database Management Program 111 A.
  • Database Management Program 1 11 A writes to one or more Table Data Volumes 200 A provided by Storage System 1A.
  • Storage System 1A replicates data written to Table Data Volume 200A to Table Data Volume 200B provided by Storage System IB.
  • ETL Program HOB reads from one or more tables provided by Database Management Program 1 1 IB.
  • Database Management Program 11 IB reads from Table Data Volume 200B.
  • ETL Program HOB transforms the data read from Database Management Program 11 IB through the use of one or more transformations associated with the read data, and writes to the file system provided by File System Program 111C.
  • File System Program 111C writes to File Data Volume 201B.
  • Application Program HOC reads from the file system provided by File System Program 111C.
  • FIG. 9 illustrates the logical layout of Table Management Tables 112A and 112B, in accordance with an example implementation.
  • Table Management Table 112A can include multiple entries, each entry corresponding to a table managed by Database Management Program 111 A. Each entry can involve Table ID 1120 A, Table Size 1121 A and Volume Info 1122A.
  • Table ID 1120A is used by Database Management Program 111 A to identify a table managed by Database Management Program 111A.
  • Table Size 1121 A is used by Database Management Program 111A to manage the size of the table corresponding to Table ID 1120A.
  • Volume Info 1122A is used by Database Management Program 11 1 A to identify one or more volumes in which the table corresponding to Table ID 1120A is stored. Volume Info 1122 A may include, for example, the ID of a storage system and the ID of a volume within that storage system.
  • Table Management Table 112B The logical layout of Table Management Table 112B is essentially the same as that of Table Management Table 112A and can involve can involve Table ID 1120B, Table Size 1121B and Volume Info 1122B. The main difference is that each entry of Table Management Table 112B corresponds to a table managed by Database Management Program 11 IB (instead of Database Management Program 111 A).
  • FIG. 10 illustrates the logical layout of Transformation Flow Table 113B, in accordance with an example implementation.
  • Transformation Flow Table 113B can include multiple entries, each entry corresponding to a transformation to be performed by ETL Program HOB. Each entry can include Transformation ID 1130B, Transformation Type 113 IB, Source Data Info 1132B, Destination Data Info 1133B, and Prerequisite Transformation ID 1134B. Transformation ID 1130B is used by ETL Program HOB to identify a transformation to be performed by ETL Program HOB. Transformation Type 113 IB is used by ETL Program 11 OB to identify the type of the transformation corresponding to Transformation ID 1 130B.
  • Example values of Transformation Type 113 IB include "Convert" and "Merge”. "Convert” denotes a transformation which converts a table to a file. “Merge” denotes a transformation which merges two files into one file.
  • Source Data Info 1132B is used by ETL Program HOB to locate the source data of the transformation corresponding to Transformation 1130B.
  • Source Data Info 1132B may include information to locate the database providing the table and information to locate the table within that database.
  • Source Data Info 1132B may include information to locate the filesystem providing the file and information to locate the file within that filesystem.
  • Destination Data Info 1133B is used by ETL Program HOB to locate the destination data of the transformation corresponding to Transformation 1130B.
  • Destination Data Info 1133B may include information to locate the database providing the table and information to locate the table within that database.
  • Destination Data Info 1133B may include information to locate the filesystem providing the file and information to locate the file within that filesystem.
  • Prerequisite Transformation ID 1134B is used by ETL Program HOB to identify one or more transformations that must be performed before the transformation corresponding to Transformation ID 1130B can be performed.
  • FIG. 11 shows the logical layout of Operation Management Table 1 14B, in accordance with an example implementation.
  • Operation Management Table 114B can include multiple entries, each entry corresponding to an operation to be performed by ETL Program HOB. Each entry can include Operation ID 1140B, Transformation ID 1141B and Operation Type 1142B.
  • Operation ID 1140B is used by ETL Program 11 OB to identify an operation to be performed by ETL Program 1140B.
  • Transformation ID 1141B is used by ETL Program HOB to identify the transformation that is related to the operation corresponding to Operation ID 1140B.
  • Operation Type 1142B is used by ETL Program HOB to identify the type of the operation corresponding to Operation ID 1140B.
  • Operation Type 1142B is either "Replicate” or "Transform”.
  • Replicate denotes an operation which replicates Table Data Volume 200 A to Table Data Volume 200B.
  • Transform denotes an operation which executes the transformation corresponding to Transformation ID 1142B.
  • FIG. 12 illustrates the logical layout of Schedule Management Table 1 15B, in accordance with an example implementation.
  • Schedule Management Table 115B stores Schedule 1150B and Estimated Execution Time 1151B.
  • Schedule 1150B is used by ETL Program HOB to store the schedule of operations that ETL Program HOB will execute.
  • Schedule 1150B may include, for example, a list of operation IDs corresponding to Operation ID 1140B.
  • Estimated Execution Time 115 IB is used by ETL Program HOB to store the estimated execution time of the transformation schedule corresponding to Schedule 1150B.
  • the estimated execution time can be the sum of the time taken for each operation in the Schedule 1150B.
  • FIG. 13 illustrates the logical layout of Volume Management Tables 223 A and 223B, in accordance with an example implementation.
  • Volume Management Table 223A can include multiple entries, each entry corresponding to a volume of Storage System 2A. Each entry can include Volume ID 2230A and Capacity 2231 A.
  • Volume ID 2230A is used by Configuration Management Program 220A to identify a volume within Storage System 2A.
  • Capacity 2231 A is the capacity of the volume corresponding to Volume ID 223 OA.
  • Volume Management Table 223B can include multiple entries, each entry corresponding to a volume of Storage System 2B. Each entry can include Volume ID 2230B and Capacity 223 IB. Volume ID 2230B is used by Configuration Management Program 220B to identify a volume within Storage System 2B. Capacity 223 IB is the capacity of the volume corresponding to Volume ID 2230B.
  • FIG. 14 illustrates the logical layout of Replication Management Tables 224A and 224B, in accordance with an example implementation.
  • Replication Management Table 224A can include multiple entries, each entry corresponding to a volume of Storage System 2A that is either the source or the destination of a replicated volume pair. Each entry can include Pair ID 2240 A, Source Volume Info 2241 A, Destination Volume Info 2242 A, Pair State 2243 A, Difference Info 2244A and Link Info 2245A.
  • Pair ID 2240A is used by Replication Management Program 222A to identify a replicated volume pair within Storage System 2A.
  • Source Volume Info 2241A is used by Replication Management Program 222A to identify the source volume of the replicated volume pair corresponding to Pair ID 2240 A.
  • Source Volume Info 2241 A may include, for example, the ID of the storage system providing the source volume and the ID of the source volume within that storage system.
  • Destination Volume Info 2242A is used by Replication Management Program 222A to identify the destination volume of the replicated volume pair corresponding to Pair ID 2240 A.
  • Destination Volume Info 2241 A may include, for example, the ID of the storage system providing the destination volume and the ID of the destination volume within that storage system.
  • Pair State 2243A is used by Replication Management Program 222A to manage the state of the replicated volume pair corresponding to Pair ID 2240A.
  • Example values of Pair State 2243A include "PAIR", "SUSP" and "COPY".
  • PAIR denotes a state in which the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A are synchronized. In this state, when Storage System 2 A receives a write request from Server 1A, I/O Processing Program 221 A stores write data received from Server 1 A in Storage Device 23A and replicates the write data to Storage System 2B before sending a completion response to Server 1 A.
  • SUSP denotes a state in which replication between the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A is suspended and there is a difference between the data stored in the source and the destination volumes.
  • I/O Processing Program 221A stores write data received from Server 1A in Storage Device 23A and updates Difference Info 2244 A before sending a completion response to Server 1 A.
  • COORD denotes a state in which the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A are in the process of being resynchronized. In this state, Replication Management Program 222A identifies the regions of the source volume that are different from the destination volume by referencing Difference Info 2244A, and replicates those regions from the source volume to the destination volume.
  • Difference Info 2244A is used by Replication Management Program 222A to manage the difference between the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A.
  • Difference Info 2244A may include, for example, "ON” or "OFF” for each region of the source volume. "ON” denotes that for the region the data stored in the source and the destination volumes are different. “OFF” denotes that for the region the data stored in the source and the destination volumes are the same.
  • Link Info 2245A is used by Replication Management Program 222A to manage information about the link used to replicate data between the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A.
  • Link Info 2245A may include, for example, the speed of the link (e.g., the bandwidth of the link).
  • Replication Management Table 224B The logical layout of Replication Management Table 224B is essentially the same as that of Replication Management Table 224A, and each entry can include Pair ID 2240B, Source Volume Info 224 IB, Destination Volume Info 2242B, Pair State 2243B, Difference Info 2244B and Link Info 224BA. The main difference is that each entry of Replication Management Table 224B corresponds to a replicated volume pair managed by Replication Management Program 222B (instead of Replication Management Program 222 A).
  • FIG. 15 illustrates the flow of ETL processing executed by ETL Program HOB, in accordance with an example implementation.
  • ETL Program HOB determines the database tables to be transformed by referencing Transformation Flow Table 113B.
  • ETL Program HOB determines the Table Data Volumes 200B storing the database tables determined in the flow at 500 by sending a request to Database Management Program 11 IB.
  • Database Management Program 11 IB receives the request, Database Management Program 11 IB determines the volumes by referencing Table Management Table 112B and sends a list of the determined volumes to ETL Program HOB.
  • ETL Program HOB receives the list of the determined volumes from ETL Program HOB.
  • ETL Program HOB determines information about each Table Data Volume 200B determined in the flow at 501.
  • Information determined about each volume includes, for example, the capacity of each volume and the speed at which each volume can be replicated (e.g., the bandwidth available for each volume to be replicated).
  • ETL Program HOB estimates the time required to replicate data to each Table Data Volume 200B determined in Step 501 from the Table Data Volume 200A corresponding to the Table Data Volume 200B using the information determined in the flow at 502.
  • ETL Program HOB determines the capacity of the Table Data Volume 200B based on the capacity information received in the flow at 511. Then, ETL Program HOB determines the replication speed between the Table Data Volume 200A and Table Data Volume 200B based on the replication speed information received in the flow at 513. Finally, ETL Program HOB estimates the required time to be the determined capacity divided by the determined replication speed.
  • ETL Program HOB determines the size of each table determined to be transformed in the flow at 500 by sending a request to Database Management Program 11 IB.
  • Database Management Program 11 IB receives the request, Database Management Program 11 IB determines the size of each table by referencing Table Management Table 112B and sends the size of each table to ETL Program HOB.
  • ETL Program HOB receives the size of each table from ETL Program HOB.
  • ETL Program HOB estimates the time required to perform each table transformation by dividing the size of each table by a predetermined transformation speed.
  • ETL Program HOB determines a schedule by executing the flow as described in FIG. 16.
  • ETL Program HOB executes the schedule determined in the flow at 506 by referencing Schedule Management Table 115B. For each operation in the schedule, ETL Program HOB looks up the operation in Operation Management Table 114B. If the Operation Type 1142B of the entry corresponding to the operation is "Replicate”, ETL Program HOB instructs Storage System 2B to replicate. If the Operation Type 1142B of the entry corresponding to the operation is "Transform", ETL Program HOB performs a transformation.
  • FIG. 16 illustrates the flow of volume information determination by ETL Program HOB, in accordance with an example implementation. This flow corresponds to the flow at 502 from FIG. 15.
  • ETL Program HOB sends a request to Storage System 2B for capacity information of each Table Data Volume 200B determined in the flow at 501.
  • Configuration Management Program 220B determines the capacity of the Table Data Volume 200B by referencing Volume Management Table 220B and sends the determined capacity as capacity information to ETL Program HOB.
  • Configuration Management Program 220B determines the difference between Data Volume 200A and Data Volume 200B by referencing Replication Management Table 224B, and sends the determined difference as capacity information to ETL Program HOB.
  • ETL Program HOB receives the capacity information sent by Configuration Management Program 220B in the flow at 510.
  • ETL Program HOB sends a request to Storage System 2B for replication speed information of each Table Data Volume 200B determined in the flow at 501.
  • Replication Management Program 222B determines the replication speed information by referencing Replication Management Table 224B and sends the determined replication speed information to ETL Program HOB.
  • the replication information determined and sent by Replication Management Program 222B is, for example, Link Info 2245B corresponding to each volume.
  • ETL Program HOB receives the replication speed information sent by Replication Management Program 222B in the flow at 511.
  • FIG. 17 illustrates the flow of schedule determination executed by ETL Program HOB, in accordance with an example implementation. This flow corresponds to the flow at 506 in FIG. 15.
  • ETL Program HOB initializes the estimate for the execution time of the shortest transformation schedule by storing an initial time in Estimated Execution Time 1141B.
  • the initial time stored in Estimated Execution Time 1141B may be, for example, a time that is longer than any transformation schedule would take.
  • the initial time stored in Estimated Execution Time 1141B may be a value that indicates that Estimated Execution Time 1141B is not valid.
  • ETL Program HOB creates all possible schedules which include volume replication operations and transformation operations.
  • a schedule is possible if two conditions are met: (1) every volume replication operation is performed before the transformation operation to which it is related and (2) every transformation operation is performed after all of its prerequisite transformation operations.
  • ETL Program HOB starts a loop that executes the flows at 523 to 526 for each schedule created in in the flow at 521.
  • ETL Program HOB estimates the execution time of the schedule using volume replication times estimated in the flow at 503 and the transformation times estimated in the flow at 505.
  • ETL Program HOB determines if the execution time of the schedule is shorter than the current estimate for the execution time of the shortest schedule. In order to make this determination, ETL Program HOB compares the execution time estimated in the flow at 523 to Estimated Execution Time 115 IB. If the execution time of the schedule is shorter than the current estimate for the execution time (Yes), then the flow proceeds to 525, otherwise (No), the flow proceeds to 527 to end the loop.
  • ETL Program HOB updates Schedule 1150B with the schedule.
  • ETL Program HOB updates Estimated Execution Time 1141B with the execution time estimated in the flow at 523.
  • ETL Program HOB ends the loop started in the flow at 522.
  • ETL Program HOB By executing a loop over all possible schedules, ETL Program HOB is able to determine the shortest schedule. ETL Program HOB may, alternatively, execute a loop over only some of the possible schedules in order to determine a short but not necessarily shortest schedule.
  • the ETL Program considers the load of the replication source to adjust replication speed and prevent an overload of the replication source.
  • FIG. 18 illustrates the logical layout of Storage System Memory 22A in accordance with a second example implementation.
  • the logical layout of Memory 22A is the same as in first example implementation described above, except that Memory 22A contains an additional table, Load Management Table 226A.
  • Load Management Table is used by Configuration Management Program 220A to manage information about the load of Storage System 2A.
  • FIG. 19 illustrates the logical layout of Volume Management Table 223 A in accordance with the second example implementation.
  • Volume Management Table 223A is the same as Volume Management Table 223 A in accordance with the first example implementation except that each entry of Volume Management Table 223A contains an additional field, CPU ID 2232A.
  • CPU ID 2232A is used by Configuration Management Program 220 A to identify the CPU 21 that processes I/O requests to the volume corresponding to Volume ID 223 OA.
  • FIG. 20 illustrates the logical layout of Load Management Table 226A, in accordance with a second example implementation.
  • Load Management Table 226A can include multiple entries, each entry corresponding to a CPU 21 of Storage System 2 A. Each entry can include CPU ID 2260A and Load 2261 A.
  • CPU ID 2260A is used by Configuration Management Program 220A to identify a CPU 21 within Storage System 2A.
  • Load 2261A is the load of the CPU 21 corresponding to CPU ID 2260 A.
  • FIG. 21 illustrates the flow of volume information determination executed by ETL Program 110B in accordance with a second example implementation.
  • the flow is the same as in the first example implementation except there are two additional flows at 514 and 515.
  • ETL Program 110 sends a request to Storage System 2B for load information of the Table Data Volume 200A corresponding to each Table Data Volume 200B determined in the flow at 501.
  • Configuration Management Program 220B determines the load information of each Table Data Volume 200A by sending a request to Storage System 2A.
  • Configuration Management Program 220A determines the load information of the Table Data Volume 200A by referencing Volume Management Table 223A and Load Management Table 226A and sends the determined load information to Configuration Management Program 220B.
  • the load information determined and sent by Configuration Management Program 220A is, for example, the Load 2261 A corresponding to the CPU 21 A that processes I/O requests to the Table Data Volume 200A.
  • Configuration Management Program 220B sends the load information to ETL Program HOB. [0100]
  • ETL Program HOB receives the load information sent by Configuration Management Program 220B in the flow at 514.
  • FIG. 22 illustrates the flow of replication time estimation executed by ETL Program HOB in accordance with the second implementation. In the second example implementation, this flow is executed between the flows at 502 and 504 in replacement of the flow at 503.
  • ETL Program HOB starts a loop that executes the flow from 531 through 534 for each Table Data Volume 200B determined in the flow at 501.
  • ETL Program HOB determines if the load of the Table Data Volume 200A corresponding to the Table Data Volume 200B is high. ETL Program HOB makes this determination based on the load information received in the flow at 513. For example, ETL Program HOB compares the load of the CPU 21 A that processes I/O requests to the Table Data Volume 200A to a predetermined threshold, and determines that the load is high if the load of the CPU 21A is higher than the predetermined threshold. If the load is considered to be high (Yes), then the flow proceeds to 532, otherwise (No), the flow proceeds to 534.
  • ETL Program 110 estimates the time required to replicate data to the Table Data Volume 200B from the Table Data Volume 200A corresponding to the Table Data Volume 200B based on the replication speed between Table Data Volume 200A and Table Data Volume 200B reduced by a predetermined method.
  • ETL Program HOB determines the capacity of the Table Data Volume 200B based on the capacity information received in the flow at 511. Then, ETL Program HOB determines the replication speed between the Table Data Volume 200A and Table Data Volume 200B based on the replication speed information received in the flow at 513. Then, ETL Program HOB computes a reduced replication speed by dividing the determined replication speed by a predetermined factor. Finally, ETL Program HOB estimates the required time to be the determined capacity divided by the reduced replication speed.
  • ETL Program 110 ends the loop started in the flow at 530.
  • ETL Program HOB estimates the time required to replicate data to the Table Data Volume 200B from the Table Data Volume 200A corresponding to the Table Data Volume 200B based on the replication speed between Table Data Volume 200A and Table Data Volume 200B.
  • ETL Program HOB determines the capacity of the Table Data Volume 200B based on the capacity information received in the flow at 511. Then, ETL Program HOB determines the replication speed between the Table Data Volume 200A and Table Data Volume 200B based on the replication speed information received in the flow at 513. Finally, ETL Program HOB estimates the required time to be the determined capacity divided by the determined replication speed.
  • an ETL Program makes a short schedule by determining the order in which storage regions are copied within each volume.
  • FIG. 23 illustrates the logical layout of Table Management Table 112B in accordance with a third example implementation.
  • the logical layout of Table Management Table 112B is the same as in first example implementation except that each entry of Table Management Table 112B has Storage Region ID 1123B and Storage Region Info 1124B instead of Volume Info 1122B.
  • Storage Region ID 1123B is used by Database Management Program 111A to identify a storage region storing a part or all of the table corresponding to Table ID 1120 A.
  • Storage Region Info 1 124B is access information for the storage region corresponding to Storage Region ID 1123B.
  • Storage Region Info 1124B may include, for example, the ID of a storage system, the ID of a volume within that storage system, and a range of blocks within that volume.
  • FIG. 24 illustrates the logical layout of Operation Management Table 114B in accordance with the third example implementation.
  • the logical layout of Operation Management Table 114B is the same as in the first example implementation except that each entry of Operation Management Table 114B has an additional field, Storage Region ID 1143B.
  • Storage Region ID 1143B is used by ETL Program HOB to identify the storage region that is to be replicated.
  • Storage Region ID 1143B is not used by ETL Program HOB. In this case, "N/A" is stored in Storage Region ID 1143B.
  • FIG. 25 illustrates the flow of ETL processing executed by ETL Program 1 10B in accordance with the third example implementation.
  • the flow is the same as in the flow of the first example implementation, except that the flow at 501 is replaced by 50 and the flow at 503 is replaced by 503 '.
  • ETL Program HOB determines the storage regions storing the database tables determined in the flow at 500 by sending a request to Database Management Program 11 IB.
  • Database Management Program 11 IB determines the storage regions by referencing Table Management Table 112B and sends a list of the storage regions to ETL Program HOB.
  • ETL Program HOB receives the list of the determined storage regions from ETL Program HOB.
  • ETL Program HOB estimates the time required to replicate data to each storage region determined in the flow at 501 ' from Storage System 2A using the information determined in the flow at 502. For example, first, ETL Program HOB determines the capacity of the storage region based on the capacity information received in flow at 511. Then, ETL Program HOB determines the replication speed between the Table Data Volume 200A and Table Data Volume 200B based on the replication speed information received in flow at 513. Finally, ETL Program HOB estimates the required time to be the determined capacity divided by the determined replication speed.
  • FIG. 26 illustrates the flow of schedule determination executed by ETL Program 110B in accordance with a third example implementation.
  • the flow is the same as in the first example implementation, except that the flow at 521 is replaced by 521 ' and the flow at 523 is replaced by the flow at 523'
  • ETL Program HOB creates all possible schedules which include storage region replication operations and transformation operations.
  • a schedule is possible if two conditions are met: (1) every storage region replication operation is performed before the transformation operation to which it is related and (2) every transformation operation is performed after all of its prerequisite transformation operations.
  • ETL Program HOB estimates the execution time of the schedule using storage region replication times estimated in the flow at 503' and the transformation times estimated in the flow at 505.
  • Example implementations may also relate to an apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the required purposes, or it may include one or more general-purpose computers selectively activated or reconfigured by one or more computer programs.
  • Such computer programs may be stored in a computer readable medium, such as a computer-readable storage medium or a computer-readable signal medium.
  • a computer-readable storage medium may involve tangible mediums such as, but not limited to optical disks, magnetic disks, read-only memories, random access memories, solid state devices and drives, or any other types of tangible or non-transitory media suitable for storing electronic information.
  • a computer readable signal medium may include mediums such as carrier waves.
  • the algorithms and displays presented herein are not inherently related to any particular computer or other apparatus.
  • Computer programs can involve pure software implementations that involve instructions that perform the operations of the desired implementation.
  • the operations described above can be performed by hardware, software, or some combination of software and hardware.
  • Various aspects of the example implementations may be implemented using circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out implementations of the present application.
  • some example implementations of the present application may be performed solely in hardware, whereas other example implementations may be performed solely in software.
  • the various functions described can be performed in a single unit, or can be spread across a number of components in any number of ways.
  • the methods may be executed by a processor, such as a general purpose computer, based on instructions stored on a computer-readable medium. If desired, the instructions can be stored on the medium in a compressed and/or encrypted format.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Example implementations described herein are directed to systems and methods that can involve one or more apparatuses configured to manage a first storage system comprising a plurality of volumes, the first storage system communicatively coupled to a second storage system. The one or more apparatuses are configured to execute an Extract, Transform, Load (ETL) program as well as a program for transformations of data for a database, and to provide a schedule that estimates the completion of data replication and data transformation. Such example implementations can be utilized by data scientists to avoid waiting a long time to begin performing data analytics.

Description

SYSTEM AND METHOD FOR IMPROVING AGILITY
OF DATA ANALYTICS
BACKGROUND
Field
[0001] The present disclosure is directed generally to database management systems, and more specifically, to management of data analytics and related functions thereof.
Related Art
[0002] In related art implementations, businesses perform advanced data analytics to find ways to increase revenue or decrease costs. Performing advanced analytics can require the use and integration of various data. Examples of data that may be used include data from the mission critical or business critical systems and data provided by Social Networking Sites (SNS).
[0003] In related art implementations, data from mission critical or business critical systems are copied to the analytics system of the business. In some solutions, the copying is offloaded to the storage layer to limit the performance impact on the mission critical or business critical systems.
[0004] An example related art implementation includes tools that schedule the copying and the transforming of data. An example of such a related art implementation can include systems and methods for scheduling data flow execution based on an arbitrary graph describing the desired data flow as described, for example, in U.S. Patent No. 8,639,847.
SUMMARY
[0005] In current solutions, the data analytics may not be as agile as a data scientist would desire. For example, the data scientist may need to wait a long time to begin performing data analytics, because copying data from the mission critical or business critical systems to the analytics system takes a long time. If the copying is operated by a storage admin, the interaction between the data scientist and the storage admin adds more to the wait time. Such additional wait time could be eliminated if the data scientist directly managed the copy operation. However, the data scientist may want to focus on data analytics without having to manage the operation of the storage systems.
[0006] An ETL (Extract Transform Load) tool can create a schedule that completes copying and transforming data in a short time. The ETL tool queries the database and the storage system holding the mission critical or business critical data in order to acquire information about how long it would take to copy and transform the data. The ETL tool then creates the schedule based on a data transformation flow provided by the user and the information acquired from the database and the storage system.
[0007] However, the related art implementations do not provide an ETL tool that queries a databases and a storage system in order to acquire information about how long it would take to copy and transform data.
[0008] Example implementations described herein are directed to provide a system and method that improves the agility of data analytics.
[0009] Aspects of the present disclosure can involve a server configured to manage a first storage system with a plurality of volumes, the first storage system communicatively coupled to a second storage system. The server can involve a memory, configured to manage transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and a processor, configured to execute an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume. The execution of the ETL program causes the processor to be configured to request data related to the first volume to be processed from the database management program; request information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimate a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; request from the database management program, a size of the data to be processed from the first volume; estimate a second time for performing each of the one or more transformations based on the size of the data to be processed; and determine a schedule based on the first time and the second time.
[0010] Aspects of the present disclosure further include a method for managing a first storage system having a plurality of volumes, the first storage system communicatively coupled to a second storage system. The method can involve managing transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and executing an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program involving requesting data related to the first volume to be processed from the database management program; requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimating a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; requesting from the database management program, a size of the data to be processed from the first volume; estimating a second time for performing each of the one or more transformations based on the size of the data to be processed; and determining a schedule based on the first time and the second time.
[0011] Aspects of the present disclosure further include a non-transitory computer readable medium, storing instructions for managing a first storage system having a plurality of volumes, the first storage system communicatively coupled to a second storage system. The instructions can involve managing transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and executing an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program involving requesting data related to the first volume to be processed from the database management program; requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimating a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; requesting from the database management program, a size of the data to be processed from the first volume; estimating a second time for performing each of the one or more transformations based on the size of the data to be processed; and determining a schedule based on the first time and the second time.
[0012] Aspects of the present disclosure further include a system, which can involve one or more apparatuses configured to manage a first storage system comprising a plurality of volumes, the first storage system communicatively coupled to a second storage system, the one or more apparatuses involving a memory, configured to manage transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and a processor, configured to execute an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program causing the processor to be configured to request data related to the first volume to be processed from the database management program; request information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimate a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; request from the database management program, a size of the data to be processed from the first volume; estimate a second time for performing each of the one or more transformations based on the size of the data to be processed; and determine a schedule based on the first time and the second time.
[0013] Aspects of the present disclosure further include a system, which can involve means for managing a first storage system comprising a plurality of volumes, the first storage system communicatively coupled to a second storage system, means for managing transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; means for executing an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, means for executing of the ETL program involving means for requesting data related to the first volume to be processed from the database management program; means for requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; means for estimating a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; means for requesting from the database management program, a size of the data to be processed from the first volume; means for estimating a second time for performing each of the one or more transformations based on the size of the data to be processed; and means for determining a schedule based on the first time and the second time.
BRIEF DESCRIPTION OF DRAWINGS
[0014] FIG. 1 illustrates an example physical configuration of the system in which the example implementations may be applied.
[0015] FIG. 2 illustrates the physical configuration of servers, in accordance with an example implementation.
[0016] FIG. 3 illustrates the physical configuration of storage system, in accordance with an example implementation. [0017] FIGS. 4 to 6 illustrate example logical layouts of server memory, in accordance with an example implementation.
[0018] FIG. 7 illustrates an example logical layout of storage system memory, in accordance with an example implementation.
[0019] FIG. 8 illustrates an overview of data flow, in accordance with an example implementation.
[0020] FIG. 9 illustrates the logical layout of Table Management Tables, in accordance with an example implementation.
[0021] FIG. 10 illustrates the logical layout of Transformation Flow Table, in accordance with an example implementation.
[0022] FIG. 11 illustrates the logical layout of Operation Management Table, in accordance with an example implementation.
[0023] FIG. 12 illustrates the logical layout of Schedule Management Table, in accordance with an example implementation.
[0024] FIG. 13 illustrates the logical layout of Volume Management Tables, in accordance with an example implementation.
[0025] FIG. 14 illustrates the logical layout of Replication Management Tables, in accordance with an example implementation.
[0026] FIG. 15 illustrates the flow of ETL processing executed by ETL Program, in accordance with an example implementation.
[0027] FIG. 16 illustrates the flow of volume information determination by ETL Program, in accordance with an example implementation.
[0028] FIG. 17 illustrates the flow of schedule determination executed by ETL Program, in accordance with an example implementation.
[0029] FIG. 18 shows the logical layout of storage system memory, in accordance with a second example implementation. [0030] FIG. 19 illustrates the logical layout of Volume Management Table in accordance with the second example implementation.
[0031] FIG. 20 illustrates the logical layout of Load Management Table, in accordance with a second example implementation.
[0032] FIG. 21 illustrates the flow of volume information determination executed by ETL Program in accordance with a second example implementation.
[0033] FIG. 22 illustrates the flow of replication time estimation executed by ETL Program in accordance with the second implementation.
[0034] FIG. 23 illustrates the logical layout of Table Management Table in accordance with a third example implementation.
[0035] FIG. 24 illustrates the logical layout of Operation Management Table in accordance with the third example implementation.
[0036] FIG. 25 illustrates the flow of ETL processing executed by ETL Program 1 10B in accordance with the third example implementation.
[0037] FIG. 26 illustrates the flow of schedule determination executed by ETL Program in accordance with a third example implementation.
DETAILED DESCRIPTION
[0038] The following detailed description provides further details of the figures and example implementations of the present application. Reference numerals and descriptions of redundant elements between figures are omitted for clarity. Terms used throughout the description are provided as examples and are not intended to be limiting. For example, the use of the term "automatic" may involve fully automatic or semi-automatic implementations involving user or administrator control over certain aspects of the implementation, depending on the desired implementation of one of ordinary skill in the art practicing implementations of the present application. Selection can be conducted by a user through a user interface or other input means, or can be implemented through a desired algorithm. Example implementations as described herein can be utilized either singularly or in combination and the functionality of the example implementations can be implemented through any means according to the desired implementations.
[0039] In a first example implementation, an ETL tool creates a schedule that completes copying and transforming data in a short time.
[0040] FIG. 1 illustrates an example physical configuration of the system in which the example implementations may be applied. One or more Servers 1A are connected to a Storage System 2A via a Storage Area Network (SAN) 3 A. One or more Servers IB and one or more Servers 1C are connected to a Storage System 2B via a SAN 3B. Storage System 2 A and Storage System 2B are connected to each other via a SAN 3C. Server IB and Server 2B are connected to each other via a Local Area Network (LAN) 4.
[0041] Server 1A uses SAN 3A to send Input/Output (I/O) requests to Storage System 2A. Server IB and Server 1C use SAN 3B to send I/O requests to Storage System 2B. Storage System 2A uses SAN 3C to replicate data to Storage System 2B. Server IB and Server 1C use LAN 4 to communicate with each other.
[0042] FIG. 2 illustrates the physical configuration of Servers 1A, IB and 1C, in accordance with an example implementation. Each server 1A, IB, and 1C have their corresponding SAN port 13A, 13B, and 13C, corresponding LAN port 14A, 14B, and 14C, corresponding Central Processing Unit (CPU) 10A, 10B, and IOC, corresponding memory 11 A, 11B, and 11C, and corresponding storage device 12A, 12B, and 12C.
[0043] As described herein, any one or a combination of several apparatuses (e.g., such as server 1A, IB, 1C), can manage a first storage system such as storage system 2B that manages a plurality of volumes, which is communicatively coupled to a second storage system such as storage system 2A through SAN 3C. Any one of memory 11 A, 11B and 11C can in singular, or in combination manage transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes as illustrated in FIG. 10. Any one of processor(s) 10A, 10B, and IOC can be configured to execute an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume as illustrated in the flow diagrams of FIGS. 15 to 17, 21 to 22 and 25 to 26. The execution of the ETL program can involve requesting data related to the first volume to be processed from the database management program; requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimating a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; requesting from the database management program, a size of the data to be processed from the first volume; estimating a second time for performing each of the one or more transformations based on the size of the data to be processed; and determining a schedule based on the first time and the second time through the execution of the flow diagrams of FIG. 15 and/or FIG. 25.
[0044] In an example implementation, the data related to the first volume can include a list of volumes of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system can involve a size of each volume in the list of volumes, and a bandwidth of each volume in the list of volumes, as illustrated in FIGS. 9-11 and 14, and as determined from the execution of the flow diagrams of FIG. 15 and/or FIG. 25. In an example implementation, the information regarding the requested data to be replicated from the second storage system to the first storage system involves a central processing unit (CPU) for each volume of a list of volumes of the second storage system to be replicated to the first volume, wherein the estimation of the first time is based on the CPU load on the each volume of a list of volumes of the second storage system to be replicated to the first volume as illustrated and described with respect to FIGS. 18 to 22.
[0045] In an example implementation, the data related to the first volume comprises a list of storage regions of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system can involve a size of each storage region in the list of storage regions, and a bandwidth for each storage region in the list of storage regions, as illustrated, for example in FIGS. 10, 14, and 23. The ETL program can be configured to determine the determine a schedule based on a determination of an order for replication of the list of storage regions from the second storage system to the first storage system based on the first time and the second time as illustrated and described in FIGS. 23 to 26.
[0046] The ETL program can be configured to determine the schedule based on the first time and the second time through a process that involves generating a plurality of schedules encompassing all combinations between execution of the one or more transformations and data to be transferred from the second storage system to the first storage system; estimating the first time and the second time for each schedule of the plurality of schedules; and determining the schedule from the plurality of schedules having a shortest schedule based on the first time and the second time as determined, for example, through the execution of FIG. 17.
[0047] FIG. 3 illustrates the physical configuration of Storage Systems 2A and 2B, in accordance with an example implementation. Storage Systems 2A and 2B can include their corresponding SAN port 24A and 24B, corresponding CPU 21A and 21B, corresponding memory 22A and 22B, and corresponding storage device 23A and 23B. The storage devices 23A and 23B can be used to store data across either a plurality of volumes, and/or a plurality of storage regions or storage areas depending on the desired implementation.
[0048] FIG. 4 illustrates an example logical layout of Memory 11 A, in accordance with an example implementation. Memory 11A can include Application Program 110A, Database Management Program 111A and Table Management Table 112A. Database Management Program 111A provides a database to Application Program 110A. Application Program 110A accesses tables provided by Database Management Program 111A. Database Management Program 111A manages the tables using Table Management Table 112A and stores table data in Storage System 2A by sending read and write requests to Storage System 2A.
[0049] FIG. 5 illustrates the logical layout of Memory 11B, in accordance with an example implementation. Memory 11B contains ETL Program HOB, Database Management Program 11 IB, Table Management Table 112B, Transformation Flow Table 113B, Operation Management Table 114B and Schedule Management Table 115B. ETL Program HOB transforms data based on information stored in Transformation Flow Table 113B, Operation Management Table 114B and Schedule Management Table 115B. Database Management Program 11 IB provides a database to ETL Program HOB. ETL Program HOB accesses tables provided by Database Management Program 11 IB. Database Management Program 11 IB manages the tables using Table Management Table 112B and stores table data in Storage System 2B by sending read and write requests to Storage System 2B.
[0050] FIG. 6 shows the logical layout of Memory 11C, in accordance with an example implementation. Memory 11C contains Application Program HOC and File System Program l l lC. File System Program 111C provides a file system to ETL Program HOB and Application Program HOC. ETL Program HOB and Application Program HOC access files provided by File System Program 111C. File System Program 111C stores file data in Storage System 2B by sending read and write requests to Storage System 2B.
[0051] FIG. 7 illustrates an example logical layout of Memory 22A and 22B, in accordance with an example implementation. Memory 22A contains Configuration Management Program 220 A, I/O Processing Program 221 A, Replication Management Program 222 A, Volume Management Table 223 A, Replication Management Table 224 A and Cache Area 225A. Configuration Management Program 220A is executed by Central Processing Unit (CPU) 21 A and manages the configuration of Storage System 2 A using Volume Management Table 223 A.
[0052] I/O Processing Program 221 A is executed by CPU 21 A and processes I/O requests from Server 1 A. Cache Area 225A is used by I/O Processing Program 221A to temporarily store data being read from or written to Storage Device 23A. Replication Management Program 222A is executed by CPU 21A and manages replication of data between Storage System 2A and other storage systems using Replication Management Table 224A.
[0053] The logical layout of Memory 22B is essentially the same as that of Memory 22 A. Memory 22B contains Configuration Management Program 220B, I/O Processing Program 22 IB, Replication Management Program 222B, Volume Management Table 223B, Replication Management Table 224B and Cache Area 225B. The main differences are that Configuration Management Program 220B is executed by CPU 21B and manages the configuration of Storage System 2B, I/O Processing Program 221B is executed by CPU 21B and processes I/O requests from Servers IB and 1C, and Replication Management Program 222B is executed by CPU 2 IB and manages replication of data between Storage System 2B and other storage systems. [0054] FIG. 8 illustrates an overview of data flow, in accordance with an example implementation. Application Program 11 OA writes to one or more tables provided by Database Management Program 111 A. Database Management Program 1 11 A writes to one or more Table Data Volumes 200 A provided by Storage System 1A. Storage System 1A replicates data written to Table Data Volume 200A to Table Data Volume 200B provided by Storage System IB.
[0055] ETL Program HOB reads from one or more tables provided by Database Management Program 1 1 IB. Database Management Program 11 IB reads from Table Data Volume 200B. ETL Program HOB transforms the data read from Database Management Program 11 IB through the use of one or more transformations associated with the read data, and writes to the file system provided by File System Program 111C. File System Program 111C writes to File Data Volume 201B. Application Program HOC reads from the file system provided by File System Program 111C.
[0056] FIG. 9 illustrates the logical layout of Table Management Tables 112A and 112B, in accordance with an example implementation. Table Management Table 112A can include multiple entries, each entry corresponding to a table managed by Database Management Program 111 A. Each entry can involve Table ID 1120 A, Table Size 1121 A and Volume Info 1122A. Table ID 1120A is used by Database Management Program 111 A to identify a table managed by Database Management Program 111A. Table Size 1121 A is used by Database Management Program 111A to manage the size of the table corresponding to Table ID 1120A. Volume Info 1122A is used by Database Management Program 11 1 A to identify one or more volumes in which the table corresponding to Table ID 1120A is stored. Volume Info 1122 A may include, for example, the ID of a storage system and the ID of a volume within that storage system.
[0057] The logical layout of Table Management Table 112B is essentially the same as that of Table Management Table 112A and can involve can involve Table ID 1120B, Table Size 1121B and Volume Info 1122B. The main difference is that each entry of Table Management Table 112B corresponds to a table managed by Database Management Program 11 IB (instead of Database Management Program 111 A).
[0058] FIG. 10 illustrates the logical layout of Transformation Flow Table 113B, in accordance with an example implementation. Transformation Flow Table 113B can include multiple entries, each entry corresponding to a transformation to be performed by ETL Program HOB. Each entry can include Transformation ID 1130B, Transformation Type 113 IB, Source Data Info 1132B, Destination Data Info 1133B, and Prerequisite Transformation ID 1134B. Transformation ID 1130B is used by ETL Program HOB to identify a transformation to be performed by ETL Program HOB. Transformation Type 113 IB is used by ETL Program 11 OB to identify the type of the transformation corresponding to Transformation ID 1 130B. Example values of Transformation Type 113 IB include "Convert" and "Merge". "Convert" denotes a transformation which converts a table to a file. "Merge" denotes a transformation which merges two files into one file.
[0059] Source Data Info 1132B is used by ETL Program HOB to locate the source data of the transformation corresponding to Transformation 1130B. For example, if the source data is a table, Source Data Info 1132B may include information to locate the database providing the table and information to locate the table within that database. Alternatively, if the source data is a file, Source Data Info 1132B may include information to locate the filesystem providing the file and information to locate the file within that filesystem.
[0060] Destination Data Info 1133B is used by ETL Program HOB to locate the destination data of the transformation corresponding to Transformation 1130B. For example, if the destination data is a table, Destination Data Info 1133B may include information to locate the database providing the table and information to locate the table within that database. Alternatively, if the destination data is a file, Destination Data Info 1133B may include information to locate the filesystem providing the file and information to locate the file within that filesystem.
[0061] Prerequisite Transformation ID 1134B is used by ETL Program HOB to identify one or more transformations that must be performed before the transformation corresponding to Transformation ID 1130B can be performed.
[0062] FIG. 11 shows the logical layout of Operation Management Table 1 14B, in accordance with an example implementation. Operation Management Table 114B can include multiple entries, each entry corresponding to an operation to be performed by ETL Program HOB. Each entry can include Operation ID 1140B, Transformation ID 1141B and Operation Type 1142B. [0063] Operation ID 1140B is used by ETL Program 11 OB to identify an operation to be performed by ETL Program 1140B.
[0064] Transformation ID 1141B is used by ETL Program HOB to identify the transformation that is related to the operation corresponding to Operation ID 1140B.
[0065] Operation Type 1142B is used by ETL Program HOB to identify the type of the operation corresponding to Operation ID 1140B. Operation Type 1142B is either "Replicate" or "Transform". "Replicate" denotes an operation which replicates Table Data Volume 200 A to Table Data Volume 200B. "Transform" denotes an operation which executes the transformation corresponding to Transformation ID 1142B.
[0066] FIG. 12 illustrates the logical layout of Schedule Management Table 1 15B, in accordance with an example implementation. Schedule Management Table 115B stores Schedule 1150B and Estimated Execution Time 1151B. Schedule 1150B is used by ETL Program HOB to store the schedule of operations that ETL Program HOB will execute. Schedule 1150B may include, for example, a list of operation IDs corresponding to Operation ID 1140B.
[0067] Estimated Execution Time 115 IB is used by ETL Program HOB to store the estimated execution time of the transformation schedule corresponding to Schedule 1150B. In an example implementation, the estimated execution time can be the sum of the time taken for each operation in the Schedule 1150B.
[0068] FIG. 13 illustrates the logical layout of Volume Management Tables 223 A and 223B, in accordance with an example implementation. Volume Management Table 223A can include multiple entries, each entry corresponding to a volume of Storage System 2A. Each entry can include Volume ID 2230A and Capacity 2231 A. Volume ID 2230A is used by Configuration Management Program 220A to identify a volume within Storage System 2A. Capacity 2231 A is the capacity of the volume corresponding to Volume ID 223 OA.
[0069] Volume Management Table 223B can include multiple entries, each entry corresponding to a volume of Storage System 2B. Each entry can include Volume ID 2230B and Capacity 223 IB. Volume ID 2230B is used by Configuration Management Program 220B to identify a volume within Storage System 2B. Capacity 223 IB is the capacity of the volume corresponding to Volume ID 2230B. [0070] FIG. 14 illustrates the logical layout of Replication Management Tables 224A and 224B, in accordance with an example implementation. Replication Management Table 224A can include multiple entries, each entry corresponding to a volume of Storage System 2A that is either the source or the destination of a replicated volume pair. Each entry can include Pair ID 2240 A, Source Volume Info 2241 A, Destination Volume Info 2242 A, Pair State 2243 A, Difference Info 2244A and Link Info 2245A.
[0071] Pair ID 2240A is used by Replication Management Program 222A to identify a replicated volume pair within Storage System 2A. Source Volume Info 2241A is used by Replication Management Program 222A to identify the source volume of the replicated volume pair corresponding to Pair ID 2240 A. Source Volume Info 2241 A may include, for example, the ID of the storage system providing the source volume and the ID of the source volume within that storage system. Destination Volume Info 2242A is used by Replication Management Program 222A to identify the destination volume of the replicated volume pair corresponding to Pair ID 2240 A. Destination Volume Info 2241 A may include, for example, the ID of the storage system providing the destination volume and the ID of the destination volume within that storage system. Pair State 2243A is used by Replication Management Program 222A to manage the state of the replicated volume pair corresponding to Pair ID 2240A. Example values of Pair State 2243A include "PAIR", "SUSP" and "COPY".
[0072] "PAIR" denotes a state in which the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A are synchronized. In this state, when Storage System 2 A receives a write request from Server 1A, I/O Processing Program 221 A stores write data received from Server 1 A in Storage Device 23A and replicates the write data to Storage System 2B before sending a completion response to Server 1 A.
[0073] "SUSP" denotes a state in which replication between the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A is suspended and there is a difference between the data stored in the source and the destination volumes. In this state, when Storage System 2A receives a write request from Server 1 A, I/O Processing Program 221A stores write data received from Server 1A in Storage Device 23A and updates Difference Info 2244 A before sending a completion response to Server 1 A.
[0074] "COPY" denotes a state in which the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A are in the process of being resynchronized. In this state, Replication Management Program 222A identifies the regions of the source volume that are different from the destination volume by referencing Difference Info 2244A, and replicates those regions from the source volume to the destination volume.
[0075] Difference Info 2244A is used by Replication Management Program 222A to manage the difference between the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A. Difference Info 2244A may include, for example, "ON" or "OFF" for each region of the source volume. "ON" denotes that for the region the data stored in the source and the destination volumes are different. "OFF" denotes that for the region the data stored in the source and the destination volumes are the same.
[0076] Link Info 2245A is used by Replication Management Program 222A to manage information about the link used to replicate data between the source and the destination volumes of the replicated volume pair corresponding to Pair ID 2240A. Link Info 2245A may include, for example, the speed of the link (e.g., the bandwidth of the link).
[0077] The logical layout of Replication Management Table 224B is essentially the same as that of Replication Management Table 224A, and each entry can include Pair ID 2240B, Source Volume Info 224 IB, Destination Volume Info 2242B, Pair State 2243B, Difference Info 2244B and Link Info 224BA. The main difference is that each entry of Replication Management Table 224B corresponds to a replicated volume pair managed by Replication Management Program 222B (instead of Replication Management Program 222 A).
[0078] FIG. 15 illustrates the flow of ETL processing executed by ETL Program HOB, in accordance with an example implementation. At 500, ETL Program HOB determines the database tables to be transformed by referencing Transformation Flow Table 113B. At 501, ETL Program HOB determines the Table Data Volumes 200B storing the database tables determined in the flow at 500 by sending a request to Database Management Program 11 IB. When Database Management Program 11 IB receives the request, Database Management Program 11 IB determines the volumes by referencing Table Management Table 112B and sends a list of the determined volumes to ETL Program HOB. ETL Program HOB receives the list of the determined volumes from ETL Program HOB.
[0079] At 502, ETL Program HOB determines information about each Table Data Volume 200B determined in the flow at 501. Information determined about each volume includes, for example, the capacity of each volume and the speed at which each volume can be replicated (e.g., the bandwidth available for each volume to be replicated).
[0080] At 503, ETL Program HOB estimates the time required to replicate data to each Table Data Volume 200B determined in Step 501 from the Table Data Volume 200A corresponding to the Table Data Volume 200B using the information determined in the flow at 502.
[0081] For example, first, ETL Program HOB determines the capacity of the Table Data Volume 200B based on the capacity information received in the flow at 511. Then, ETL Program HOB determines the replication speed between the Table Data Volume 200A and Table Data Volume 200B based on the replication speed information received in the flow at 513. Finally, ETL Program HOB estimates the required time to be the determined capacity divided by the determined replication speed.
[0082] At 504, ETL Program HOB determines the size of each table determined to be transformed in the flow at 500 by sending a request to Database Management Program 11 IB. When Database Management Program 11 IB receives the request, Database Management Program 11 IB determines the size of each table by referencing Table Management Table 112B and sends the size of each table to ETL Program HOB. ETL Program HOB receives the size of each table from ETL Program HOB.
[0083] At 505, ETL Program HOB estimates the time required to perform each table transformation by dividing the size of each table by a predetermined transformation speed. At 506, ETL Program HOB determines a schedule by executing the flow as described in FIG. 16. At 507, ETL Program HOB executes the schedule determined in the flow at 506 by referencing Schedule Management Table 115B. For each operation in the schedule, ETL Program HOB looks up the operation in Operation Management Table 114B. If the Operation Type 1142B of the entry corresponding to the operation is "Replicate", ETL Program HOB instructs Storage System 2B to replicate. If the Operation Type 1142B of the entry corresponding to the operation is "Transform", ETL Program HOB performs a transformation.
[0084] FIG. 16 illustrates the flow of volume information determination by ETL Program HOB, in accordance with an example implementation. This flow corresponds to the flow at 502 from FIG. 15. [0085] At 510, ETL Program HOB sends a request to Storage System 2B for capacity information of each Table Data Volume 200B determined in the flow at 501. When Storage System 2B receives the request, Configuration Management Program 220B determines the capacity of the Table Data Volume 200B by referencing Volume Management Table 220B and sends the determined capacity as capacity information to ETL Program HOB. Alternatively, Configuration Management Program 220B determines the difference between Data Volume 200A and Data Volume 200B by referencing Replication Management Table 224B, and sends the determined difference as capacity information to ETL Program HOB.
[0086] At 511, ETL Program HOB receives the capacity information sent by Configuration Management Program 220B in the flow at 510. At 512, ETL Program HOB sends a request to Storage System 2B for replication speed information of each Table Data Volume 200B determined in the flow at 501. When Storage System 2B receives the request, Replication Management Program 222B determines the replication speed information by referencing Replication Management Table 224B and sends the determined replication speed information to ETL Program HOB. The replication information determined and sent by Replication Management Program 222B is, for example, Link Info 2245B corresponding to each volume.
[0087] At 513, ETL Program HOB receives the replication speed information sent by Replication Management Program 222B in the flow at 511.
[0088] FIG. 17 illustrates the flow of schedule determination executed by ETL Program HOB, in accordance with an example implementation. This flow corresponds to the flow at 506 in FIG. 15. At 520, ETL Program HOB initializes the estimate for the execution time of the shortest transformation schedule by storing an initial time in Estimated Execution Time 1141B. The initial time stored in Estimated Execution Time 1141B may be, for example, a time that is longer than any transformation schedule would take. Alternatively, the initial time stored in Estimated Execution Time 1141B may be a value that indicates that Estimated Execution Time 1141B is not valid.
[0089] At 521, ETL Program HOB creates all possible schedules which include volume replication operations and transformation operations. A schedule is possible if two conditions are met: (1) every volume replication operation is performed before the transformation operation to which it is related and (2) every transformation operation is performed after all of its prerequisite transformation operations.
[0090] At 522, ETL Program HOB starts a loop that executes the flows at 523 to 526 for each schedule created in in the flow at 521.
[0091] At 523, ETL Program HOB estimates the execution time of the schedule using volume replication times estimated in the flow at 503 and the transformation times estimated in the flow at 505.
[0092] At 524, ETL Program HOB determines if the execution time of the schedule is shorter than the current estimate for the execution time of the shortest schedule. In order to make this determination, ETL Program HOB compares the execution time estimated in the flow at 523 to Estimated Execution Time 115 IB. If the execution time of the schedule is shorter than the current estimate for the execution time (Yes), then the flow proceeds to 525, otherwise (No), the flow proceeds to 527 to end the loop.
[0093] At 525, ETL Program HOB updates Schedule 1150B with the schedule. At 526, ETL Program HOB updates Estimated Execution Time 1141B with the execution time estimated in the flow at 523. At 527, ETL Program HOB ends the loop started in the flow at 522.
[0094] By executing a loop over all possible schedules, ETL Program HOB is able to determine the shortest schedule. ETL Program HOB may, alternatively, execute a loop over only some of the possible schedules in order to determine a short but not necessarily shortest schedule.
[0095] In a second example implementation, the ETL Program considers the load of the replication source to adjust replication speed and prevent an overload of the replication source.
[0096] FIG. 18 illustrates the logical layout of Storage System Memory 22A in accordance with a second example implementation. The logical layout of Memory 22A is the same as in first example implementation described above, except that Memory 22A contains an additional table, Load Management Table 226A. Load Management Table is used by Configuration Management Program 220A to manage information about the load of Storage System 2A.
[0097] FIG. 19 illustrates the logical layout of Volume Management Table 223 A in accordance with the second example implementation. Volume Management Table 223A is the same as Volume Management Table 223 A in accordance with the first example implementation except that each entry of Volume Management Table 223A contains an additional field, CPU ID 2232A. CPU ID 2232A is used by Configuration Management Program 220 A to identify the CPU 21 that processes I/O requests to the volume corresponding to Volume ID 223 OA.
[0098] FIG. 20 illustrates the logical layout of Load Management Table 226A, in accordance with a second example implementation. Load Management Table 226A can include multiple entries, each entry corresponding to a CPU 21 of Storage System 2 A. Each entry can include CPU ID 2260A and Load 2261 A. CPU ID 2260A is used by Configuration Management Program 220A to identify a CPU 21 within Storage System 2A. Load 2261A is the load of the CPU 21 corresponding to CPU ID 2260 A.
[0099] FIG. 21 illustrates the flow of volume information determination executed by ETL Program 110B in accordance with a second example implementation. The flow is the same as in the first example implementation except there are two additional flows at 514 and 515. At 514, ETL Program 110 sends a request to Storage System 2B for load information of the Table Data Volume 200A corresponding to each Table Data Volume 200B determined in the flow at 501. When Storage System 2B receives the request, Configuration Management Program 220B determines the load information of each Table Data Volume 200A by sending a request to Storage System 2A. When Storage System 2A receives the request, Configuration Management Program 220A determines the load information of the Table Data Volume 200A by referencing Volume Management Table 223A and Load Management Table 226A and sends the determined load information to Configuration Management Program 220B. The load information determined and sent by Configuration Management Program 220A is, for example, the Load 2261 A corresponding to the CPU 21 A that processes I/O requests to the Table Data Volume 200A. When Configuration Management Program 220B receives the load information, Configuration Management Program 220B sends the load information to ETL Program HOB. [0100] At 515, ETL Program HOB receives the load information sent by Configuration Management Program 220B in the flow at 514.
[0101] FIG. 22 illustrates the flow of replication time estimation executed by ETL Program HOB in accordance with the second implementation. In the second example implementation, this flow is executed between the flows at 502 and 504 in replacement of the flow at 503.
[0102] At 530, ETL Program HOB starts a loop that executes the flow from 531 through 534 for each Table Data Volume 200B determined in the flow at 501.
[0103] At 531, ETL Program HOB determines if the load of the Table Data Volume 200A corresponding to the Table Data Volume 200B is high. ETL Program HOB makes this determination based on the load information received in the flow at 513. For example, ETL Program HOB compares the load of the CPU 21 A that processes I/O requests to the Table Data Volume 200A to a predetermined threshold, and determines that the load is high if the load of the CPU 21A is higher than the predetermined threshold. If the load is considered to be high (Yes), then the flow proceeds to 532, otherwise (No), the flow proceeds to 534.
[0104] At 532, ETL Program 110 estimates the time required to replicate data to the Table Data Volume 200B from the Table Data Volume 200A corresponding to the Table Data Volume 200B based on the replication speed between Table Data Volume 200A and Table Data Volume 200B reduced by a predetermined method.
[0105] For example, first, ETL Program HOB determines the capacity of the Table Data Volume 200B based on the capacity information received in the flow at 511. Then, ETL Program HOB determines the replication speed between the Table Data Volume 200A and Table Data Volume 200B based on the replication speed information received in the flow at 513. Then, ETL Program HOB computes a reduced replication speed by dividing the determined replication speed by a predetermined factor. Finally, ETL Program HOB estimates the required time to be the determined capacity divided by the reduced replication speed.
[0106] At 533, ETL Program 110 ends the loop started in the flow at 530. At 534, ETL Program HOB estimates the time required to replicate data to the Table Data Volume 200B from the Table Data Volume 200A corresponding to the Table Data Volume 200B based on the replication speed between Table Data Volume 200A and Table Data Volume 200B.
[0107] For example, first, ETL Program HOB determines the capacity of the Table Data Volume 200B based on the capacity information received in the flow at 511. Then, ETL Program HOB determines the replication speed between the Table Data Volume 200A and Table Data Volume 200B based on the replication speed information received in the flow at 513. Finally, ETL Program HOB estimates the required time to be the determined capacity divided by the determined replication speed.
[0108] In a third example implementation, an ETL Program makes a short schedule by determining the order in which storage regions are copied within each volume.
[0109] FIG. 23 illustrates the logical layout of Table Management Table 112B in accordance with a third example implementation. The logical layout of Table Management Table 112B is the same as in first example implementation except that each entry of Table Management Table 112B has Storage Region ID 1123B and Storage Region Info 1124B instead of Volume Info 1122B.
[0110] Storage Region ID 1123B is used by Database Management Program 111A to identify a storage region storing a part or all of the table corresponding to Table ID 1120 A.
[0111] Storage Region Info 1 124B is access information for the storage region corresponding to Storage Region ID 1123B. Storage Region Info 1124B may include, for example, the ID of a storage system, the ID of a volume within that storage system, and a range of blocks within that volume.
[0112] FIG. 24 illustrates the logical layout of Operation Management Table 114B in accordance with the third example implementation. The logical layout of Operation Management Table 114B is the same as in the first example implementation except that each entry of Operation Management Table 114B has an additional field, Storage Region ID 1143B. When the Operation Type 1142B of an entry is "Replicate", Storage Region ID 1143B is used by ETL Program HOB to identify the storage region that is to be replicated. When the Operation Type 1142B of an entry is "Transform", Storage Region ID 1143B is not used by ETL Program HOB. In this case, "N/A" is stored in Storage Region ID 1143B. [0113] FIG. 25 illustrates the flow of ETL processing executed by ETL Program 1 10B in accordance with the third example implementation. The flow is the same as in the flow of the first example implementation, except that the flow at 501 is replaced by 50 and the flow at 503 is replaced by 503 '.
[0114] At 50 , ETL Program HOB determines the storage regions storing the database tables determined in the flow at 500 by sending a request to Database Management Program 11 IB. When Database Management Program 11 IB receives the request, Database Management Program 11 IB determines the storage regions by referencing Table Management Table 112B and sends a list of the storage regions to ETL Program HOB. ETL Program HOB receives the list of the determined storage regions from ETL Program HOB.
[0115] At 503' ETL Program HOB estimates the time required to replicate data to each storage region determined in the flow at 501 ' from Storage System 2A using the information determined in the flow at 502. For example, first, ETL Program HOB determines the capacity of the storage region based on the capacity information received in flow at 511. Then, ETL Program HOB determines the replication speed between the Table Data Volume 200A and Table Data Volume 200B based on the replication speed information received in flow at 513. Finally, ETL Program HOB estimates the required time to be the determined capacity divided by the determined replication speed.
[0116] FIG. 26 illustrates the flow of schedule determination executed by ETL Program 110B in accordance with a third example implementation. The flow is the same as in the first example implementation, except that the flow at 521 is replaced by 521 ' and the flow at 523 is replaced by the flow at 523'
[0117] At 521 ', ETL Program HOB creates all possible schedules which include storage region replication operations and transformation operations. A schedule is possible if two conditions are met: (1) every storage region replication operation is performed before the transformation operation to which it is related and (2) every transformation operation is performed after all of its prerequisite transformation operations. At 523', ETL Program HOB estimates the execution time of the schedule using storage region replication times estimated in the flow at 503' and the transformation times estimated in the flow at 505.
[0118] Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations within a computer. These algorithmic descriptions and symbolic representations are the means used by those skilled in the data processing arts to convey the essence of their innovations to others skilled in the art. An algorithm is a series of defined steps leading to a desired end state or result. In example implementations, the steps carried out require physical manipulations of tangible quantities for achieving a tangible result.
[0119] Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout the description, discussions utilizing terms such as "processing," "computing," "calculating," "determining," "displaying," or the like, can include the actions and processes of a computer system or other information processing device that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system's memories or registers or other information storage, transmission or display devices.
[0120] Example implementations may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may include one or more general-purpose computers selectively activated or reconfigured by one or more computer programs. Such computer programs may be stored in a computer readable medium, such as a computer-readable storage medium or a computer-readable signal medium. A computer-readable storage medium may involve tangible mediums such as, but not limited to optical disks, magnetic disks, read-only memories, random access memories, solid state devices and drives, or any other types of tangible or non-transitory media suitable for storing electronic information. A computer readable signal medium may include mediums such as carrier waves. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Computer programs can involve pure software implementations that involve instructions that perform the operations of the desired implementation.
[0121] Various general-purpose systems may be used with programs and modules in accordance with the examples herein, or it may prove convenient to construct a more specialized apparatus to perform desired method steps. In addition, the example implementations are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the example implementations as described herein. The instructions of the programming language(s) may be executed by one or more processing devices, e.g., central processing units (CPUs), processors, or controllers.
[0122] As is known in the art, the operations described above can be performed by hardware, software, or some combination of software and hardware. Various aspects of the example implementations may be implemented using circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out implementations of the present application. Further, some example implementations of the present application may be performed solely in hardware, whereas other example implementations may be performed solely in software. Moreover, the various functions described can be performed in a single unit, or can be spread across a number of components in any number of ways. When performed by software, the methods may be executed by a processor, such as a general purpose computer, based on instructions stored on a computer-readable medium. If desired, the instructions can be stored on the medium in a compressed and/or encrypted format.
[0123] Moreover, other implementations of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the teachings of the present application. Various aspects and/or components of the described example implementations may be used singly or in any combination. It is intended that the specification and example implementations be considered as examples only, with the true scope and spirit of the present application being indicated by the following claims.

Claims

CLAIMS What is claimed is:
1. A server configured to manage a first storage system comprising a plurality of volumes, the first storage system communicatively coupled to a second storage system, the server comprising: a memory, configured to manage transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and a processor, configured to execute an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program causing the processor to be configured to: request data related to the first volume to be processed from the database management program; request information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimate a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; request from the database management program, a size of the data to be processed from the first volume; estimate a second time for performing each of the one or more transformations based on the size of the data to be processed; and determine a schedule based on the first time and the second time.
2. The server of claim 1, wherein the data related to the first volume comprises a list of volumes of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system comprises a size of each volume in the list of volumes, and a bandwidth of each volume in the list of volumes.
3. The system of claim 2, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system comprises a central processing unit (CPU) for each volume of a list of volumes of the second storage system to be replicated to the first volume, wherein the estimation of the first time is based on the CPU load on the each volume of a list of volumes of the second storage system to be replicated to the first volume.
4. The server of claim 1, wherein the data related to the first volume comprises a list of storage regions of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system comprises a size of each storage region in the list of storage regions, and a bandwidth for each storage region in the list of storage regions.
5. The system of claim 4, wherein the ETL program is configured to determine the schedule based on a determination of an order for replication of the list of storage regions from the second storage system to the first storage system based on the first time and the second time.
6. The system of claim 1, wherein the ETL program is configured to determine the schedule based on the first time and the second time through a process comprising: generating a plurality of schedules encompassing all combinations between execution of the one or more transformations and data to be transferred from the second storage system to the first storage system; estimating the first time and the second time for each schedule of the plurality of schedules; and determining the schedule from the plurality of schedules having a shortest schedule based on the first time and the second time.
7. A method for managing a first storage system comprising a plurality of volumes, the first storage system communicatively coupled to a second storage system, the method comprising: managing transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and executing an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program comprising: requesting data related to the first volume to be processed from the database management program; requesting information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimating a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; requesting from the database management program, a size of the data to be processed from the first volume; estimating a second time for performing each of the one or more transformations based on the size of the data to be processed; and determining a schedule based on the first time and the second time.
8. The method of claim 7, wherein the data related to the first volume comprises a list of volumes of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system comprises a size of each volume in the list of volumes, and a bandwidth of each volume in the list of volumes.
9. The method of claim 8, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system comprises a central processing unit (CPU) for each volume of a list of volumes of the second storage system to be replicated to the first volume, wherein the estimation of the first time is based on the CPU load on the each volume of a list of volumes of the second storage system to be replicated to the first volume.
10. The method of claim 7, wherein the data related to the first volume comprises a list of storage regions of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system comprises a size of each storage region in the list of storage regions, and a bandwidth for each storage region in the list of storage regions.
11. The method of claim 10, wherein the ETL program is configured to determine the schedule based on a determination of an order for replication of the list of storage regions from the second storage system to the first storage system based on the first time and the second time.
12. The method of claim 7, wherein the ETL program is configured to determine the schedule based on the first time and the second time through a process comprising: generating a plurality of schedules encompassing all combinations between execution of the one or more transformations and data to be transferred from the second storage system to the first storage system; estimating the first time and the second time for each schedule of the plurality of schedules; and determining the schedule from the plurality of schedules having a shortest schedule of the first time and the second time.
13. A system, comprising: one or more apparatuses configured to manage a first storage system comprising a plurality of volumes, the first storage system communicatively coupled to a second storage system, the one or more apparatuses comprising: a memory, configured to manage transformation flow information, the transformation flow information indicative of one or more transformations to be performed on data from the plurality of volumes; and a processor, configured to execute an Extract, Transform, Load (ETL) program from the first storage system or the second storage system, the ETL program configured to extract data from a first volume of the plurality of volumes, transform the extracted data based on the transformation flow information, and store the transformed extracted data to a second volume of the plurality of volumes, and to execute a database management program configured to manage input/output (I/O) to the first volume, the execution of the ETL program causing the processor to be configured to: request data related to the first volume to be processed from the database management program; request information from the first storage system or the second storage system regarding the requested data to be replicated from the second storage system to the first storage system; estimate a first time for replicating the data from the second storage system to the first storage system based on the information regarding the requested data to be replicated from the second storage system to the first storage system, and the data to be processed; request from the database management program, a size of the data to be processed from the first volume; estimate a second time for performing each of the one or more transformations based on the size of the data to be processed; and determine a schedule based on the first time and the second time.
14. The system of claim 13, wherein the data related to the first volume comprises a list of volumes of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system comprises a size of each volume in the list of volumes, and a bandwidth of each volume in the list of volumes.
15. The system of claim 13, wherein the data related to the first volume comprises a list of storage regions of the second storage system to be replicated to the first volume, wherein the information regarding the requested data to be replicated from the second storage system to the first storage system comprises a size of each storage region in the list of storage regions, and a bandwidth for each storage region in the list of storage regions.
PCT/US2017/042230 2017-07-14 2017-07-14 System and method for improving agility of data analytics WO2019013824A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2017/042230 WO2019013824A1 (en) 2017-07-14 2017-07-14 System and method for improving agility of data analytics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2017/042230 WO2019013824A1 (en) 2017-07-14 2017-07-14 System and method for improving agility of data analytics

Publications (1)

Publication Number Publication Date
WO2019013824A1 true WO2019013824A1 (en) 2019-01-17

Family

ID=65002231

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/042230 WO2019013824A1 (en) 2017-07-14 2017-07-14 System and method for improving agility of data analytics

Country Status (1)

Country Link
WO (1) WO2019013824A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050193070A1 (en) * 2004-02-26 2005-09-01 International Business Machines Corporation Providing a portion of an electronic mail message based upon a transfer rate, a message size, and a file format
US20070276712A1 (en) * 2006-05-24 2007-11-29 Kolanchery Renjeev V Project size estimation tool
US20090177671A1 (en) * 2008-01-03 2009-07-09 Accenture Global Services Gmbh System and method for automating etl application
US20100180088A1 (en) * 2009-01-09 2010-07-15 Linkage Technology Group Co., Ltd. Memory Dispatching Method Applied to Real-time Data ETL System
US20120278512A1 (en) * 2011-04-29 2012-11-01 International Business Machines Corporation System, Method and Program Product to Schedule Transfer of Data
US20170024446A1 (en) * 2015-07-21 2017-01-26 Accenture Global Services Limited Data storage extract, transform and load operations for entity and time-based record generation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050193070A1 (en) * 2004-02-26 2005-09-01 International Business Machines Corporation Providing a portion of an electronic mail message based upon a transfer rate, a message size, and a file format
US20070276712A1 (en) * 2006-05-24 2007-11-29 Kolanchery Renjeev V Project size estimation tool
US20090177671A1 (en) * 2008-01-03 2009-07-09 Accenture Global Services Gmbh System and method for automating etl application
US20100180088A1 (en) * 2009-01-09 2010-07-15 Linkage Technology Group Co., Ltd. Memory Dispatching Method Applied to Real-time Data ETL System
US20120278512A1 (en) * 2011-04-29 2012-11-01 International Business Machines Corporation System, Method and Program Product to Schedule Transfer of Data
US20170024446A1 (en) * 2015-07-21 2017-01-26 Accenture Global Services Limited Data storage extract, transform and load operations for entity and time-based record generation

Similar Documents

Publication Publication Date Title
US9811424B2 (en) Optimizing restoration of deduplicated data
US8645323B2 (en) Large volume data replication using job replication
CN104220987B (en) Using the method and system installed
US11249983B2 (en) Transaction change data forwarding
CN109983456B (en) Method and system for searching key range in memory
JP2016517080A (en) Token-based authorization control for duplicate writes
CN103631924B (en) A kind of application process and system of distributive database platform
JPWO2004063928A1 (en) Database load reduction system and load reduction program
JP6196389B2 (en) Distributed disaster recovery file synchronization server system
TW201738781A (en) Method and device for joining tables
WO2016101759A1 (en) Data routing method, data management device and distributed storage system
US20190215363A1 (en) Dynamic pool-based tiering for synchronization storage
US10956125B2 (en) Data shuffling with hierarchical tuple spaces
US10127270B1 (en) Transaction processing using a key-value store
US9965536B2 (en) Replication description model for data distribution
WO2017188929A1 (en) Method and apparatus for replicating data between storage systems
US11308119B2 (en) Replicating large statements with low latency
WO2019013824A1 (en) System and method for improving agility of data analytics
US10831621B2 (en) Policy-driven high availability standby servers
CN106909472B (en) Input/output load adjusting method and device of distributed file system
US10891274B2 (en) Data shuffling with hierarchical tuple spaces
US9639630B1 (en) System for business intelligence data integration
CN107113340B (en) Parallel data flow between cloud-based applications and massively parallel systems
Pippal et al. High availability of databases for cloud
CN109491967A (en) A kind of distributed file management method and system based on UUID

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17917727

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17917727

Country of ref document: EP

Kind code of ref document: A1