US20170109420A1 - Parallel Transfer of SQL Data to Software Framework - Google Patents

Parallel Transfer of SQL Data to Software Framework Download PDF

Info

Publication number
US20170109420A1
US20170109420A1 US14/918,447 US201514918447A US2017109420A1 US 20170109420 A1 US20170109420 A1 US 20170109420A1 US 201514918447 A US201514918447 A US 201514918447A US 2017109420 A1 US2017109420 A1 US 2017109420A1
Authority
US
United States
Prior art keywords
query
database
relational database
software framework
parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US14/918,447
Other versions
US10120921B2 (en
Inventor
Joshua A. Allbright
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mastercard International Inc
Original Assignee
Mastercard International Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mastercard International Inc filed Critical Mastercard International Inc
Priority to US14/918,447 priority Critical patent/US10120921B2/en
Assigned to MASTERCARD INTERNATIONAL INCORPORATED reassignment MASTERCARD INTERNATIONAL INCORPORATED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ALLBRIGHT, JOSHUA A.
Priority to PCT/US2016/057577 priority patent/WO2017070134A1/en
Publication of US20170109420A1 publication Critical patent/US20170109420A1/en
Application granted granted Critical
Publication of US10120921B2 publication Critical patent/US10120921B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • G06F17/30569
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1858Parallel file systems, i.e. file systems supporting multiple processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24561Intermediate data storage techniques for performance improvement
    • G06F17/30224
    • G06F17/3033
    • G06F17/30501
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/70Admission control; Resource allocation
    • H04L47/72Admission control; Resource allocation using reservation actions during connection setup
    • H04L47/726Reserving resources in multiple paths to be used simultaneously

Definitions

  • aspects of the disclosure relate in general to machine-to-machine communication of data. Aspects include a method and system to facilitate the parallel transfer of Structured Query Language (SQL) data to a software framework.
  • SQL Structured Query Language
  • a Structured Query Language is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS).
  • RDBMS relational database management system
  • RDSMS relational data stream management system
  • SQL Based upon relational algebra and Tuple relational calculus, SQL consists of a data definition language and a data manipulation language.
  • the scope of SQL includes data insert, query, update and delete, schema creation and modification, and data access control.
  • SQL is often described as, and to a great extent is, a declarative language (4GL), it also includes procedural elements.
  • SQL is an American National Standards Institute (ANSI) and an International Organization for Standardization (ISO) standard.
  • the standard includes a larger set of features. Despite the existence of such standards, most SQL code is not completely portable among different database systems without adjustments.
  • a software framework is a universal reusable software environment that provides particular functionality as part of a larger software platform to facilitate development of software applications, products and solutions.
  • Software frameworks may include support programs, compilers, code libraries, tool sets, and application programming interfaces (APIs) that bring together all the different components to enable development of a project or solution.
  • APIs application programming interfaces
  • a software framework can be thought of as an abstraction in which software providing generic functionality can be selectively changed by additional user-written code, thus providing application-specific software.
  • Software frameworks facilitate software development by allowing designers and programmers to devote their time to meeting software requirements rather than dealing with the more standard low-level details of providing a working system, reducing overall development time. For example, a team using a web application framework to develop a banking web site can focus on writing code particular to banking rather than the mechanics of request handling and state management.
  • Embodiments include a system, device, method and computer-readable medium to facilitate the parallel transfer of Structured Query Language data to a software framework.
  • a device embodiment is configured to transferring data from at least one relational database to a software framework.
  • the device includes a processor and a non-transitory computer-readable storage medium.
  • the processor determines a maximum number of parallel database threads for the at least one relational database.
  • the processor load balances the maximum number of parallel database threads, resulting in a load-balanced number of parallel database threads.
  • the processor determines a number of threads per query, and a number of hash buckets to be used by mapping a row of the at least one relational database to the hash buckets.
  • the processor formulates a query to the at least one relational database with the processor resulting in a formulated query.
  • the formulated query including the load balanced number of parallel database and the number of hash buckets.
  • the formulated query assigns the row of the at least one relational database into a hash bucket.
  • the processor receives a query file from the at least one relational database using the formulated query.
  • the query file being stored on the non-transitory computer-readable storage medium.
  • the processor imports the query file into the software framework.
  • the software framework being stored on the non-transitory computer-readable storage medium.
  • a method embodiment is configured to transferring data from at least one relational database to a software framework.
  • a processor determines a maximum number of parallel database threads for the at least one relational database. The processor load balances the maximum number of parallel database threads, resulting in a load-balanced number of parallel database threads.
  • the processor determines a number of threads per query, and a number of hash buckets to be used by mapping a row of the at least one relational database to the hash buckets.
  • the processor formulates a query to the at least one relational database with the processor resulting in a formulated query.
  • the formulated query including the load balanced number of parallel database and the number of hash buckets.
  • the formulated query assigns the row of the at least one relational database into a hash bucket.
  • the processor receives a query file from the at least one relational database using the formulated query.
  • the query file being stored on a non-transitory computer-readable storage medium.
  • the processor imports the query file into the software framework.
  • the software framework being stored on the non-transitory computer-readable storage medium.
  • FIG. 1 is flowchart depicting a method to facilitate the parallel transfer of Structured Query Language data to a software framework.
  • FIG. 2 is a block diagram of an architecture embodiment to facilitate the parallel transfer of Structured Query Language data to a software framework.
  • One aspect of the disclosure includes the realization that purchasing and using an SQL database is an expensive endeavor.
  • This disclosure refers to a relational database and an SQL database synonymously.
  • Another aspect of the disclosure is the realization that an open-source software framework is a cost-effective way of processing very large data sets on computer clusters built from commodity hardware.
  • An aspect of the disclosure is the realization that an open-source software framework may be able to replace an SQL database for very large data sets if there was an efficient way of exporting the data set from the SQL database.
  • an open-source software framework may be able to replace an SQL database for very large data sets if there was an efficient way of exporting the data set from the SQL database.
  • an open-source software framework includes the understanding that while it is possible to move entire SQL tables or partitions of tables, it is much more useful to move the contents of an SQL query. While some open-source software frameworks allow moving contents of an SQL query, they do not do it in parallel effectively.
  • Embodiments include a system and method of facilitating the parallel transfer of SQL query data to an open-source software framework.
  • FIG. 1 is a flow chart of a process 1000 of facilitating the parallel transfer of SQL query data to an open-source software framework.
  • Process 1000 may be executed by one or more database server 2000 , which is depicted in FIG. 2 . Both process 1000 and database server 2000 are constructed and operative in accordance with embodiments of the present disclosure. It is understood that a system containing a plurality of database servers 2000 may implement process 1000 .
  • Database server 2000 may run a multi-tasking operating system (OS) and include at least one processor or central processing unit (CPU) 2100 , a non-transitory computer-readable storage medium 2200 , and a network interface 2300 .
  • OS operating system
  • CPU central processing unit
  • Processor 2100 may be a central processing unit, microprocessor, micro-controller, computational device or circuit known in the art. It is understood that processor 2100 may temporarily store data and instructions in a Random Access Memory (RAM) (not shown), as is known in the art.
  • RAM Random Access Memory
  • processor 2100 is functionally comprised of a SQL database server 2110 , a software framework 2130 , a data processor 2120 , and a data transfer interface 2140 .
  • Data processor 2120 interfaces with storage medium 2200 and network interface 2300 .
  • the data processor 2120 enables processor 2100 to locate data on, read data from, and writes data to, these components.
  • SQL database server 2110 is the structure that provides database services to other programs or computers using a client-server model, and may store its information in a SQL database 2210 .
  • An example SQL database server is Oracle Database, sold by Oracle Corporation of Redwood City, Calif.
  • Software framework 2130 is a software framework for distributed storage and distributed processing of very large data sets on computer clusters.
  • software framework 2130 is written in Java programming language.
  • Software framework 2130 may store its data in a software framework distributed file system 2220 .
  • software framework 2130 comprises the modules designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the software framework 2130 .
  • An example software framework 2130 is Apache Hadoop, developed by Apache Software Foundation (ASF) of Forest Hill, Md.
  • Data transfer interface 2140 is the structure that facilitates the parallel transfer of SQL data from a SQL database 2210 to a software framework distributed file system 2220 .
  • data transfer interface 2140 uses Apache Sqoop, developed by Apache Software Foundation to help format the SQL data stream.
  • Computer-readable storage medium 2200 may be a read/write memory such as a magnetic disk drive, floppy disk drive, optical drive, compact-disk read-only-memory (CD-ROM) drive, digital versatile disk (DVD) drive, high definition digital versatile disk (HD-DVD) drive, Blu-ray disc drive, magneto-optical drive, optical drive, flash memory, memory stick, transistor-based memory, magnetic tape or other computer-readable memory device as is known in the art for storing and retrieving data.
  • computer-readable storage medium 2200 may be remotely located from processor 2100 , and be connected to processor 2100 via a network such as a local area network (LAN), a wide area network (WAN), or the Internet.
  • LAN local area network
  • WAN wide area network
  • storage medium 2200 may also contain a SQL database 2210 and a software framework distributed file system 2220 .
  • SQL database 2210 contains a relational database.
  • Software framework distributed file system 2220 splits files into large blocks and distributes them amongst nodes in a database cluster.
  • Software framework transfers packaged code for nodes to process in parallel with microprocessors, based on the data each node processes, resulting in data locality. The data locality allows data to be processed faster and more efficiently that a conventional supercomputer architecture that relies on a parallel file system.
  • Network interface 2300 may be any data port as is known in the art for interfacing, communicating or transferring data across a computer network, examples of such networks include Transmission Control Protocol/Internet Protocol (TCP/IP), Ethernet, Fiber Distributed Data Interface (FDDI), token bus, or token ring networks.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • FDDI Fiber Distributed Data Interface
  • Network interface 2300 allows database server 2000 to with other database servers.
  • the method 1000 extends capability in data movement between the platforms and increases scale for certain kinds of movement that is otherwise not possible. While data transfer interface 2140 may have built in methods for moving data in parallel from SQL databases 2210 to software frameworks 2130 , the fastest methods are limited to moving entire tables or partitions of tables. Moreover, it is much more useful to move the contents of a query. Method 1000 allows moving the contents of any query against an SQL database 2210 to a software framework 2130 at very high levels of parallelism.
  • Method 1000 employs a hashing function on the rowid of the primary table to evenly bucket rows in the query and have data transfer interface 2140 transfer a bucket of rows per mapper.
  • each mapper pulls an even number of rows given that the hashing function will spread all the rows into buckets evenly.
  • Method 1000 relies on a number of mappers to be less than or equal to the number of hash buckets.
  • This method only works on an SQL database 2210 , such as Oracle Exadata, because the hash function is pushed down to storage.
  • This method works on any SQL database 2210 with built in hash functions or having the ability to specify a user defined function but works particularly well on Oracle Exadata because the hash function is pushed down to storage.
  • data transfer interface 2140 determines the maximum number of a parallel database threads are available.
  • the maximum threads per SQL database server 2110 is a database parameter.
  • the number of parallel threads is load balanced, block 1020 .
  • the maximum number of parallel database threads would be 3920, the load-balanced amount would be 784 threads (20% of 3920).
  • the load balancing may differ depending upon the type of processor 2100 used in the system.
  • Data transfer interface 2140 determines the number of threads per query at block 1030 . From empirical data, approximately four threads per query were found to be optimal by testing 4, 8 and 16 threads per query and measuring runtime. The greater the thread per query, the less number of streams are possible given needing to stay under a set number of parallel threads in total, in this case approximately 784.
  • data transfer interface 2140 determines the mappers, hash buckets, and database connections per job over an SQL row.
  • the number of mappers defines the number of hash buckets to be used and database connections. From empirical data, approximately thirty-two queries per job (with each query pulling from a hash bucket of rows) were found to be optimal. Thirty-two is a function of how many total parallel threads used based on the output from the number of parallel threads and the overall concurrency achieved. A maximum of 784 threads, 32 queries per job, and using 4 threads per query, results in 128 threads for the job. This results in about six jobs concurrently. Thus, the selection of thirty-two queries per job is really a function of the overall concurrency achieved. If four threads per query and thirty-two queries per job, there will be a total of 128 parallel threads on the database per job.
  • Data transfer interface 2140 formulates a query of the desired information, defining the number of threads and hash buckets, assigning SQL table rows into a hash bucket, block 1050 . It is understood by those familiar with the art that the query may include a database administrator user name and password, as well as the names of database servers 2000 . Using the above example, the following query would be generated:
  • the software framework 2130 receives the exported SQL query file, block 1060 , and imports the SQL query file into the software framework distributed file system 2220 , block 1070 .

Abstract

A system, method, and computer-readable storage medium configured to facilitate the parallel transfer of Structured Query Language (SQL) data to a software framework.

Description

    BACKGROUND
  • Field of the Disclosure
  • Aspects of the disclosure relate in general to machine-to-machine communication of data. Aspects include a method and system to facilitate the parallel transfer of Structured Query Language (SQL) data to a software framework.
  • Description of the Related Art
  • A Structured Query Language is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS).
  • Based upon relational algebra and Tuple relational calculus, SQL consists of a data definition language and a data manipulation language. The scope of SQL includes data insert, query, update and delete, schema creation and modification, and data access control. Although SQL is often described as, and to a great extent is, a declarative language (4GL), it also includes procedural elements.
  • SQL is an American National Standards Institute (ANSI) and an International Organization for Standardization (ISO) standard. The standard includes a larger set of features. Despite the existence of such standards, most SQL code is not completely portable among different database systems without adjustments.
  • A software framework is a universal reusable software environment that provides particular functionality as part of a larger software platform to facilitate development of software applications, products and solutions. Software frameworks may include support programs, compilers, code libraries, tool sets, and application programming interfaces (APIs) that bring together all the different components to enable development of a project or solution.
  • In computer programming, a software framework can be thought of as an abstraction in which software providing generic functionality can be selectively changed by additional user-written code, thus providing application-specific software. Software frameworks facilitate software development by allowing designers and programmers to devote their time to meeting software requirements rather than dealing with the more standard low-level details of providing a working system, reducing overall development time. For example, a team using a web application framework to develop a banking web site can focus on writing code particular to banking rather than the mechanics of request handling and state management.
  • SUMMARY
  • Embodiments include a system, device, method and computer-readable medium to facilitate the parallel transfer of Structured Query Language data to a software framework.
  • A device embodiment is configured to transferring data from at least one relational database to a software framework. The device includes a processor and a non-transitory computer-readable storage medium. The processor determines a maximum number of parallel database threads for the at least one relational database. The processor load balances the maximum number of parallel database threads, resulting in a load-balanced number of parallel database threads. The processor determines a number of threads per query, and a number of hash buckets to be used by mapping a row of the at least one relational database to the hash buckets. The processor formulates a query to the at least one relational database with the processor resulting in a formulated query. The formulated query including the load balanced number of parallel database and the number of hash buckets. The formulated query assigns the row of the at least one relational database into a hash bucket. The processor receives a query file from the at least one relational database using the formulated query. The query file being stored on the non-transitory computer-readable storage medium. The processor imports the query file into the software framework. The software framework being stored on the non-transitory computer-readable storage medium.
  • A method embodiment is configured to transferring data from at least one relational database to a software framework. A processor determines a maximum number of parallel database threads for the at least one relational database. The processor load balances the maximum number of parallel database threads, resulting in a load-balanced number of parallel database threads. The processor determines a number of threads per query, and a number of hash buckets to be used by mapping a row of the at least one relational database to the hash buckets. The processor formulates a query to the at least one relational database with the processor resulting in a formulated query. The formulated query including the load balanced number of parallel database and the number of hash buckets. The formulated query assigns the row of the at least one relational database into a hash bucket. The processor receives a query file from the at least one relational database using the formulated query. The query file being stored on a non-transitory computer-readable storage medium. The processor imports the query file into the software framework. The software framework being stored on the non-transitory computer-readable storage medium.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is flowchart depicting a method to facilitate the parallel transfer of Structured Query Language data to a software framework.
  • FIG. 2 is a block diagram of an architecture embodiment to facilitate the parallel transfer of Structured Query Language data to a software framework.
  • DETAILED DESCRIPTION
  • One aspect of the disclosure includes the realization that purchasing and using an SQL database is an expensive endeavor. This disclosure refers to a relational database and an SQL database synonymously.
  • Another aspect of the disclosure is the realization that an open-source software framework is a cost-effective way of processing very large data sets on computer clusters built from commodity hardware.
  • An aspect of the disclosure is the realization that an open-source software framework may be able to replace an SQL database for very large data sets if there was an efficient way of exporting the data set from the SQL database. In further aspect includes the understanding that while it is possible to move entire SQL tables or partitions of tables, it is much more useful to move the contents of an SQL query. While some open-source software frameworks allow moving contents of an SQL query, they do not do it in parallel effectively.
  • Embodiments include a system and method of facilitating the parallel transfer of SQL query data to an open-source software framework.
  • The systems and processes are not limited to the specific embodiments described herein. In addition, components of each system and each process can be practiced independently and separately from other components and processes described herein. Each component and process also can be used in combination with other assembly packages and processes.
  • FIG. 1 is a flow chart of a process 1000 of facilitating the parallel transfer of SQL query data to an open-source software framework. Process 1000 may be executed by one or more database server 2000, which is depicted in FIG. 2. Both process 1000 and database server 2000 are constructed and operative in accordance with embodiments of the present disclosure. It is understood that a system containing a plurality of database servers 2000 may implement process 1000.
  • Database server 2000 may run a multi-tasking operating system (OS) and include at least one processor or central processing unit (CPU) 2100, a non-transitory computer-readable storage medium 2200, and a network interface 2300.
  • Processor 2100 may be a central processing unit, microprocessor, micro-controller, computational device or circuit known in the art. It is understood that processor 2100 may temporarily store data and instructions in a Random Access Memory (RAM) (not shown), as is known in the art.
  • As shown in FIG. 2, processor 2100 is functionally comprised of a SQL database server 2110, a software framework 2130, a data processor 2120, and a data transfer interface 2140.
  • Data processor 2120 interfaces with storage medium 2200 and network interface 2300. The data processor 2120 enables processor 2100 to locate data on, read data from, and writes data to, these components.
  • SQL database server 2110 is the structure that provides database services to other programs or computers using a client-server model, and may store its information in a SQL database 2210. An example SQL database server is Oracle Database, sold by Oracle Corporation of Redwood City, Calif.
  • Software framework 2130 is a software framework for distributed storage and distributed processing of very large data sets on computer clusters. In some embodiments, software framework 2130 is written in Java programming language. Software framework 2130 may store its data in a software framework distributed file system 2220. In some embodiments, software framework 2130 comprises the modules designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the software framework 2130. An example software framework 2130 is Apache Hadoop, developed by Apache Software Foundation (ASF) of Forest Hill, Md.
  • Data transfer interface 2140 is the structure that facilitates the parallel transfer of SQL data from a SQL database 2210 to a software framework distributed file system 2220. In one embodiment, data transfer interface 2140 uses Apache Sqoop, developed by Apache Software Foundation to help format the SQL data stream.
  • These structures may be implemented as hardware, firmware, or software encoded on a computer readable medium, such as storage medium 2200. Further details of these components are described with their relation to method embodiments below.
  • Computer-readable storage medium 2200 may be a read/write memory such as a magnetic disk drive, floppy disk drive, optical drive, compact-disk read-only-memory (CD-ROM) drive, digital versatile disk (DVD) drive, high definition digital versatile disk (HD-DVD) drive, Blu-ray disc drive, magneto-optical drive, optical drive, flash memory, memory stick, transistor-based memory, magnetic tape or other computer-readable memory device as is known in the art for storing and retrieving data. In some embodiments, computer-readable storage medium 2200 may be remotely located from processor 2100, and be connected to processor 2100 via a network such as a local area network (LAN), a wide area network (WAN), or the Internet.
  • In addition, as shown in FIG. 2, storage medium 2200 may also contain a SQL database 2210 and a software framework distributed file system 2220. SQL database 2210 contains a relational database. Software framework distributed file system 2220 splits files into large blocks and distributes them amongst nodes in a database cluster. Software framework transfers packaged code for nodes to process in parallel with microprocessors, based on the data each node processes, resulting in data locality. The data locality allows data to be processed faster and more efficiently that a conventional supercomputer architecture that relies on a parallel file system.
  • Network interface 2300 may be any data port as is known in the art for interfacing, communicating or transferring data across a computer network, examples of such networks include Transmission Control Protocol/Internet Protocol (TCP/IP), Ethernet, Fiber Distributed Data Interface (FDDI), token bus, or token ring networks. Network interface 2300 allows database server 2000 to with other database servers.
  • We now return our attention to a method or process embodiment of the present disclosure, as shown in FIG. 1. It is understood by those known in the art other equivalent implementations can exist without departing from the spirit or claims of the disclosure.
  • The method 1000 extends capability in data movement between the platforms and increases scale for certain kinds of movement that is otherwise not possible. While data transfer interface 2140 may have built in methods for moving data in parallel from SQL databases 2210 to software frameworks 2130, the fastest methods are limited to moving entire tables or partitions of tables. Moreover, it is much more useful to move the contents of a query. Method 1000 allows moving the contents of any query against an SQL database 2210 to a software framework 2130 at very high levels of parallelism.
  • Method 1000 employs a hashing function on the rowid of the primary table to evenly bucket rows in the query and have data transfer interface 2140 transfer a bucket of rows per mapper. When setup according the method 1000, each mapper pulls an even number of rows given that the hashing function will spread all the rows into buckets evenly. Method 1000 relies on a number of mappers to be less than or equal to the number of hash buckets. This method only works on an SQL database 2210, such as Oracle Exadata, because the hash function is pushed down to storage. This method works on any SQL database 2210 with built in hash functions or having the ability to specify a user defined function but works particularly well on Oracle Exadata because the hash function is pushed down to storage.
  • At block 1010, data transfer interface 2140 determines the maximum number of a parallel database threads are available. The maximum threads per SQL database server 2110 is a database parameter. The maximum number of parallel database threads is simply the number of SQL database servers 2110 times the maximum threads per SQL database server. For example, suppose the maximum threads per SQL database server is 280, and there are fourteen SQL database servers 2110 in the system. The maximum number of parallel database threads would be 3920, as 280×14=3920.
  • Once the maximum number of parallel database threads is calculated, using this number of threads would take up a large processor load. Consequently, the number of parallel threads is load balanced, block 1020. In one example, to avoid overloading the processor 2100, only twenty percent of the maximum number of parallel database threads would be used. Applying the above example, if the maximum number of parallel database threads would be 3920, the load-balanced amount would be 784 threads (20% of 3920). The load balancing may differ depending upon the type of processor 2100 used in the system.
  • Data transfer interface 2140 determines the number of threads per query at block 1030. From empirical data, approximately four threads per query were found to be optimal by testing 4, 8 and 16 threads per query and measuring runtime. The greater the thread per query, the less number of streams are possible given needing to stay under a set number of parallel threads in total, in this case approximately 784.
  • At block 1040, data transfer interface 2140 determines the mappers, hash buckets, and database connections per job over an SQL row. The number of mappers defines the number of hash buckets to be used and database connections. From empirical data, approximately thirty-two queries per job (with each query pulling from a hash bucket of rows) were found to be optimal. Thirty-two is a function of how many total parallel threads used based on the output from the number of parallel threads and the overall concurrency achieved. A maximum of 784 threads, 32 queries per job, and using 4 threads per query, results in 128 threads for the job. This results in about six jobs concurrently. Thus, the selection of thirty-two queries per job is really a function of the overall concurrency achieved. If four threads per query and thirty-two queries per job, there will be a total of 128 parallel threads on the database per job.
  • Data transfer interface 2140 formulates a query of the desired information, defining the number of threads and hash buckets, assigning SQL table rows into a hash bucket, block 1050. It is understood by those familiar with the art that the query may include a database administrator user name and password, as well as the names of database servers 2000. Using the above example, the following query would be generated:
  • sqoop import -D oraoop.jdbc.url.verbatim=true -D
    mapred.child.java.opts = “-Xmx4096m”
    --connect
    jdbc:oracle:thin:@//database:port/service_name
    --username user
    --password $PASS
    --query “select /*+ parallel(4)*/ to_char(col1_date,
    ‘YYYY-MM-DD’) as col1_str, table1.* from table1 where col1
    _date between ‘01-JAN-2015’ and ‘31-JAN-2015’ and
    \$CONDITIONS” #Query to be moved
    --as-avrodatafile
    --compression-codec
    org.apache.hadoop.io.compress.SnappyCodec
    -m 32  #This defines how many parallel
    connections (mappers), in this case we would like 32 to
    match the number of hash buckets we have
    --target-dir
    hdfs://nameservice1/user/admin/table1/2015/1
    --split-by “ora_hash(rowid, 31)” #This
    assigns every row to a bucket, in this case 32
    --boundary-query “select 0 min_bucket, 31 max_bucket
    from dual” #This describes the minimum and
    maximum bucket values, in this case the smallest bucket is
    0 and largest bucket is 31
    --fetch-size 100000
  • Using this query, the software framework 2130 receives the exported SQL query file, block 1060, and imports the SQL query file into the software framework distributed file system 2220, block 1070.
  • It is understood by those familiar with the art that the system described herein may be implemented in hardware, firmware, or software encoded on a non-transitory computer-readable storage medium.
  • The previous description of the embodiments is provided to enable any person skilled in the art to practice the disclosure. The various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Thus, the present disclosure is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (18)

What is claimed is:
1. A method of transferring data from at least one relational database to a software framework comprising:
determining, with a processor, a maximum number of parallel database threads for the at least one relational database;
load balancing the maximum number of parallel database threads with the processor, resulting in a load-balanced number of parallel database threads;
determining, with the processor, a number of threads per query;
determining, with the processor, a number of hash buckets to be used by mapping a row of the at least one relational database to the hash buckets;
formulating a query to the at least one relational database with the processor resulting in a formulated query, the formulated query including the load balanced number of parallel database and the number of hash buckets, the formulated query assigning the row of the at least one relational database into a hash bucket;
receiving a query file from the at least one relational database using the formulated query, the query file being stored on a non-transitory computer-readable storage medium;
importing the query file into the software framework, the software framework being stored on the non-transitory computer-readable storage medium.
2. The method of claim 1, wherein the formulated query hashes the row of the at least one relational database into the hash bucket evenly.
3. The method of claim 2, wherein the at least one relational database is a Structured Query Language database.
4. The method of claim 3, wherein the software framework is configured for distributed storage and processing on computer clusters.
5. The method of claim 4, wherein the software framework is configured for distributed storage and processing on a plurality of computer clusters.
6. The method of claim 5, wherein each of the plurality of computer clusters comprises:
a microprocessor configured to process in parallel with the plurality of the computer clusters.
7. A computing device to transfer data from at least one relational database to a software framework comprising:
means for determining a maximum number of parallel database threads for the at least one relational database;
means for load balancing the maximum number of parallel database threads, resulting in a load-balanced number of parallel database threads;
means for determining a number of threads per query;
means for determining a number of hash buckets to be used by mapping a row of the at least one relational database to the hash buckets;
means for formulating a query to the at least one relational database resulting in a formulated query, the formulated query including the load balanced number of parallel database and the number of hash buckets, the formulated query assigning the row of the at least one relational database into a hash bucket;
means for receiving a query file from the at least one relational database using the formulated query, the query file being stored on a non-transitory computer-readable storage medium;
means for importing the query file into the software framework, the software framework being stored on the non-transitory computer-readable storage medium.
8. The device of claim 7, wherein the formulated query hashes the row of the at least one relational database into the hash bucket evenly.
9. The device of claim 8, wherein the at least one relational database is a Structured Query Language database.
10. The device of claim 9, wherein the software framework is configured for distributed storage and processing on computer clusters.
11. The device of claim 10, wherein the software framework is configured for distributed storage and processing on a plurality of computer clusters.
12. The device of claim 11, wherein each of the plurality of computer clusters comprises:
a microprocessor configured to process in parallel with the plurality of the computer clusters.
13. A device to transfer data from at least one relational database to a software framework comprising:
a processor configured to determining a maximum number of parallel database threads for the at least one relational database, to load balance the maximum number of parallel database threads with the processor, resulting in a load balanced number of parallel database threads, to determine a number of threads per query, to determine a number of hash buckets to be used by mapping a row of the at least one relational database to the hash buckets, to formulate a query to the at least one relational database with the processor resulting in a formulated query, the formulated query including the load balanced number of parallel database and the number of hash buckets, the formulated query assigning the row of the at least one relational database into a hash bucket, to receive a query file from the at least one relational database using the formulated query, to import the query file into a software framework;
a non-transitory computer-readable storage medium configured to store the query file and the software framework.
14. The device of claim 13, wherein the formulated query hashes the row of the at least one relational database into the hash bucket evenly.
15. The device of claim 14, wherein the at least one relational database is a Structured Query Language database.
16. The device of claim 15, wherein the software framework is configured for distributed storage and processing on computer clusters.
17. The device of claim 16, wherein the software framework is configured for distributed storage and processing on a plurality of computer clusters.
18. The device of claim 17, wherein each of the plurality of computer clusters comprises:
a microprocessor configured to process in parallel with the plurality of the computer clusters.
US14/918,447 2015-10-20 2015-10-20 Parallel transfer of SQL data to software framework Active 2036-11-14 US10120921B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US14/918,447 US10120921B2 (en) 2015-10-20 2015-10-20 Parallel transfer of SQL data to software framework
PCT/US2016/057577 WO2017070134A1 (en) 2015-10-20 2016-10-19 Parallel transfer of sql data to software framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US14/918,447 US10120921B2 (en) 2015-10-20 2015-10-20 Parallel transfer of SQL data to software framework

Publications (2)

Publication Number Publication Date
US20170109420A1 true US20170109420A1 (en) 2017-04-20
US10120921B2 US10120921B2 (en) 2018-11-06

Family

ID=57206451

Family Applications (1)

Application Number Title Priority Date Filing Date
US14/918,447 Active 2036-11-14 US10120921B2 (en) 2015-10-20 2015-10-20 Parallel transfer of SQL data to software framework

Country Status (2)

Country Link
US (1) US10120921B2 (en)
WO (1) WO2017070134A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190236403A1 (en) * 2018-01-31 2019-08-01 Analytical Graphics, Inc. Systems and Methods for Converting Massive Point Cloud Datasets to a Hierarchical Storage Format
CN111414358A (en) * 2019-12-30 2020-07-14 杭州美创科技有限公司 Method for loading relational database data
US11258585B2 (en) * 2019-03-25 2022-02-22 Woven Planet North America, Inc. Systems and methods for implementing robotics frameworks
CN114416737A (en) * 2022-01-04 2022-04-29 北京中电兴发科技有限公司 Time sequence data storage method based on dynamic weight balance time sequence database cluster

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020174267A1 (en) * 2001-03-01 2002-11-21 International Business Machines Corporation Performance optimizer for the transfer of bulk data between computer systems
US20060265696A1 (en) * 2005-05-23 2006-11-23 International Business Machines Corporation Method and system for data migration between versions of software
US7394288B1 (en) * 2004-12-13 2008-07-01 Massachusetts Institute Of Technology Transferring data in a parallel processing environment
US7461236B1 (en) * 2005-03-25 2008-12-02 Tilera Corporation Transferring data in a parallel processing environment
US20120054362A1 (en) * 2010-08-31 2012-03-01 Canon Kabushiki Kaisha Mechanism for autotuning mass data transfer from a sender to a receiver over parallel connections
US20120110309A1 (en) * 2010-10-29 2012-05-03 Advanced Micro Devices, Inc. Data Output Transfer To Memory
US20140324928A1 (en) * 2013-04-29 2014-10-30 Hewlett-Packard Development Company, L.P. Large-scale data transfer
US20160036903A1 (en) * 2014-07-31 2016-02-04 Splunk Inc. Asynchronous processing of messages from multiple servers
US20160036716A1 (en) * 2014-07-31 2016-02-04 Splunk Inc. Priority-based processing of messages from multiple servers
US20160246538A1 (en) * 2015-02-19 2016-08-25 Simulatio Corporation Data transfer method and data transfer program
US20160294549A1 (en) * 2015-03-31 2016-10-06 Here Global B.V. Method and apparatus for migrating encrypted data

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020174267A1 (en) * 2001-03-01 2002-11-21 International Business Machines Corporation Performance optimizer for the transfer of bulk data between computer systems
US7394288B1 (en) * 2004-12-13 2008-07-01 Massachusetts Institute Of Technology Transferring data in a parallel processing environment
US7461236B1 (en) * 2005-03-25 2008-12-02 Tilera Corporation Transferring data in a parallel processing environment
US20060265696A1 (en) * 2005-05-23 2006-11-23 International Business Machines Corporation Method and system for data migration between versions of software
US20120054362A1 (en) * 2010-08-31 2012-03-01 Canon Kabushiki Kaisha Mechanism for autotuning mass data transfer from a sender to a receiver over parallel connections
US20120110309A1 (en) * 2010-10-29 2012-05-03 Advanced Micro Devices, Inc. Data Output Transfer To Memory
US20140324928A1 (en) * 2013-04-29 2014-10-30 Hewlett-Packard Development Company, L.P. Large-scale data transfer
US20160036903A1 (en) * 2014-07-31 2016-02-04 Splunk Inc. Asynchronous processing of messages from multiple servers
US20160036716A1 (en) * 2014-07-31 2016-02-04 Splunk Inc. Priority-based processing of messages from multiple servers
US20160246538A1 (en) * 2015-02-19 2016-08-25 Simulatio Corporation Data transfer method and data transfer program
US20160294549A1 (en) * 2015-03-31 2016-10-06 Here Global B.V. Method and apparatus for migrating encrypted data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190236403A1 (en) * 2018-01-31 2019-08-01 Analytical Graphics, Inc. Systems and Methods for Converting Massive Point Cloud Datasets to a Hierarchical Storage Format
US10438092B2 (en) * 2018-01-31 2019-10-08 Analytical Graphics, Inc. Systems and methods for converting massive point cloud datasets to a hierarchical storage format
US11258585B2 (en) * 2019-03-25 2022-02-22 Woven Planet North America, Inc. Systems and methods for implementing robotics frameworks
CN111414358A (en) * 2019-12-30 2020-07-14 杭州美创科技有限公司 Method for loading relational database data
CN114416737A (en) * 2022-01-04 2022-04-29 北京中电兴发科技有限公司 Time sequence data storage method based on dynamic weight balance time sequence database cluster

Also Published As

Publication number Publication date
US10120921B2 (en) 2018-11-06
WO2017070134A1 (en) 2017-04-27

Similar Documents

Publication Publication Date Title
US11113279B2 (en) Materialized views based on access rights
Toshniwal et al. Storm@ twitter
Shamsi et al. Data-intensive cloud computing: requirements, expectations, challenges, and solutions
US8478790B2 (en) Mechanism for co-located data placement in a parallel elastic database management system
US9576000B2 (en) Adaptive fragment assignment for processing file data in a database
US11775523B2 (en) Hash table structure for optimizing hash join operations in a relational database system
US11271995B2 (en) Partition balancing in an on-demand services environment
US20120151479A1 (en) Horizontal splitting of tasks within a homogenous pool of virtual machines
US9460152B2 (en) Dynamically indentifying and preventing skewed partitions in a shared-nothing database
US10120921B2 (en) Parallel transfer of SQL data to software framework
JP2018506775A (en) Identifying join relationships based on transaction access patterns
GB2519761A (en) A method and a system for distributed processing of data records
US11163792B2 (en) Work assignment in parallelized database synchronization
CN104239529A (en) Method and device for preventing Hive data from being inclined
Dehne et al. A distributed tree data structure for real-time OLAP on cloud architectures
US11157496B2 (en) Predictive data distribution for parallel databases to optimize storage and query performance
CN105608077A (en) Big data distributed storage method and system
Hameurlain et al. Big data management in the cloud: evolution or crossroad?
KR20140049202A (en) Distributed database management system and method
Bodepudi Data Transfer Between RDBMS and HDFS By Using The Spark Framework In Sqoop For Better Performance
Zhao et al. A multidimensional OLAP engine implementation in key-value database systems
Nanda Review of Query Processing Techniques of Cloud Databases
Gao et al. Batch insertion strategy in a distribution database
Papailiou et al. Scalable Indexing and Adaptive Querying of RDF Data in the cloud
Sahu et al. Graph Based Workload Driven Partitioning System by Using MongoDB

Legal Events

Date Code Title Description
AS Assignment

Owner name: MASTERCARD INTERNATIONAL INCORPORATED, NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ALLBRIGHT, JOSHUA A.;REEL/FRAME:036838/0340

Effective date: 20151020

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4