WO2012102707A1 - Analytical data processing - Google Patents

Analytical data processing Download PDF

Info

Publication number
WO2012102707A1
WO2012102707A1 PCT/US2011/022437 US2011022437W WO2012102707A1 WO 2012102707 A1 WO2012102707 A1 WO 2012102707A1 US 2011022437 W US2011022437 W US 2011022437W WO 2012102707 A1 WO2012102707 A1 WO 2012102707A1
Authority
WO
WIPO (PCT)
Prior art keywords
user defined
query
function
olap
functions
Prior art date
Application number
PCT/US2011/022437
Other languages
French (fr)
Inventor
Muthian GEORGE
Original Assignee
Hewlett-Packard Development Company L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company L.P. filed Critical Hewlett-Packard Development Company L.P.
Priority to PCT/US2011/022437 priority Critical patent/WO2012102707A1/en
Priority to CN2011800601915A priority patent/CN103262076A/en
Priority to US13/885,032 priority patent/US20130238548A1/en
Priority to EP11856881.5A priority patent/EP2668598A4/en
Priority to US13/240,582 priority patent/US9355145B2/en
Priority to US13/249,482 priority patent/US8856151B2/en
Priority to US13/272,598 priority patent/US9229984B2/en
Priority to US13/364,155 priority patent/US8538954B2/en
Priority to US13/485,246 priority patent/US20120239612A1/en
Publication of WO2012102707A1 publication Critical patent/WO2012102707A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Definitions

  • This invention relates to information processing, and more particularly, to integration of user defined functions into an analytical data processing framework.
  • Analytical processing systems consume tables of data which are typically linked together by relationships that simplify the storage of data and make queries of the data more efficient.
  • a standardized query language such as
  • Structured Query Language ⁇ SQL ⁇ can be used for creating and operating on relational databases.
  • Analytics is the application of computer technology and statistics to solve problems in business and industry.
  • the science of analytics is concerned with extracting useful properties of data using computable functions and, generally speaking, involves the extraction of desired properties of data sets from large databases. Analytics therefore bridges the disciplines of computer science, statistics, and mathematics.
  • FIG. 1 illustrates an example of an analytic data processing system in which user defined functions (UDF) have been integrated into an online analytics processing (OLAP) framework.
  • UDF user defined functions
  • OLAP online analytics processing
  • FIG. 2 illustrates one example of the integration of an analytical processing system into a database system, such that user defined functions (UDF) have been integrated into an online analytics processing (OLAP) and GROUP BY processing framework.
  • UDF user defined functions
  • OLAP online analytics processing
  • GROUP BY processing framework GROUP BY processing framework
  • FIG. 3 illustrates one method for executing a query.
  • FIG. 4 is a schematic block diagram illustrating an exemplary system of hardware components capable of implementing examples of the systems and methods for user defined function integration disclosed in FIGS. 1 -3.
  • FIG. 1 illustrates an example of an analytical processing system 10 in which user defined functions (UDF) have been integrated into an online analytics processing (OLAP) framework.
  • the system 10 of FIG. 1 can represent a database system, a data warehouse, a data mart, an in-memory database, a standalone OLAP engine, a business intelligence report generation system, a data mining system, or a federated query processing system.
  • the analytics system can be represented as a plurality of functional components 22, 24, 26, and 28, each of which can be implemented as any appropriate combination of hardware and programming configured to perform their associated function.
  • each of the query compiler 22, the processing engine 24, the data tables 26, and the user defined function 28 is stored on a non-transitory medium and executed by an associated processor, but it will be appreciated that other
  • the example system 10 includes a processor 12 and a memory 14 connected to a communications interface 16.
  • the memory 14 can be a removable memory, connected to the processor 12 and the communications interface 16 through an appropriate port or drive, such as an optical drive, a USB port, or other appropriate interface.
  • the memory 14 can be remote from the processor 12, with machine readable instructions stored on the memory provided to the processor via a communications link.
  • the communication interface 16 can comprise any appropriate hardware and machine readable instructions for receiving database queries from an associated query source (not shown) and returning the results of the queries to the query source.
  • the communications interface 1 6 can include any or all of a bus or similar data connection within a computer system or a wired or wireless network adapter.
  • the memory 14 can include any appropriate standard storage devices associated with computer systems, such as magnetic and optical storage media.
  • the device memory 14 can include a query compiler 22, a processing engine 24 to compile and execute queries on data tables 26, and a user defined function (UDF) 28.
  • the query compiler 22 can utilize any appropriate query language, for example, structured query language (SQL) or multidimensional expression (MDX) language.
  • the user defined function 28 is stored as one of a shared object or a dynamic link library.
  • the query compiler 22 identifies a call to the user defined function 28 and a defined data partition. For example, the query compiler 22 can identify the defined data partition from one of a GROUP BY and an OLAP window expression associated with the call to the user defined function 28.
  • the processing engine 24 executes the user defined function as a windowed function, such that the function operates on the defined partition set to produce a function output.
  • the user defined function 28 has metadata exposed to the query compiler 22 and the processing engine 24 indicating that the user defined function is executable within an OLAP window or GROUP BY expression.
  • the metadata can further include defined inputs, outputs, and parameters for the user defined function.
  • the exposed metadata can further include an application programming interface (API) comprising rules for validating and resolving arguments within the call to the user defined function.
  • API application programming interface
  • the user defined function metadata further comprises a second function pointer to construct an object for processing during execution of the query.
  • the user defined function 28 can be one of a plurality of user defined functions identified by the query compiler 22 within a query, and the processing engine 24 can execute the plurality of user defined functions as part of the query such that one user defined function consumes, as an input, field(s) output by another of the plurality of user defined functions.
  • Each of the user defined functions can have an associated defined OLAP window, and the defined OLAP window for each function can differ among the defined windows.
  • FIG. 2 ilSustrates one example of the integration of an analytical processing system into a database system 50, such that user defined functions (UDF) have been integrated into an online analytics processing (OLAP) and GROUP BY processing framework.
  • the system 50 includes a processor 52 and a memory 54 connected to a communications interface 56.
  • the communication interface 56 can comprise any appropriate hardware and machine readable instructions for receiving database queries from an associated query source (not shown) and returning the results of the queries to the query source.
  • the communications interface 56 can include any or all of a bus or similar data connection within a computer system or a wired or wireless network adapter.
  • the memory 54 can include any appropriate standard storage devices associated with computer systems, such as magnetic and optica! storage media.
  • the device memory 54 can include a query compiler 62 and a database engine 64 to compile and execute queries on a database table 66.
  • the query compiler 62 identifies a call to a user defined function and a defined data partition, expressed, for example, as an online analytics processing (OLAP) window or a GROUP BY expression, for the user defined function, if the function is of an appropriate class type for integrated OLAP processing. If the defined OLAP data partition is provided, the database engine 64 executes the user defined function to provide an output having a set of one or more output fields and projects the returned output fields back into the select list.
  • OLAP online analytics processing
  • the query compiler 62 processes a call to a user defined function in a database query using a standardized user defined function syntax to distinct!y map input, output, and parameter expressions to appropriate objects.
  • List of fields in the input and output expressions define rows of data that can be composed into tables.
  • inputs are presented as a list of fields within parentheses following the user defined function name.
  • Inputs are extracted from query fields, and composed as input table.
  • Outputs are denoted with a keyword OUTPUT and a list of fields in parentheses.
  • the query compiler 62 structures these input and output fields into self-describing table objects with field names, data types and data size to standardize processing of all user defined function class types capable of handling multiple records simultaneously.
  • Each table object has metadata to describe each field of the table object in terms of its name, data type, and data length. Tables also have the ability to store and retrieve rows of data based on row number. They are capable of handling large row sets with the ability to manage memory overflow into disk files.
  • the query compiler 62 includes a query parser 74, which extracts each of the input table, the output table, and any parameter objects from each user defined function. It will be appreciated that the basic syntax for all user defined functions, regardless of their associated class, can be standardized, such that they are parsed in a similar manner at the query parser 74.
  • One example of such a standardized syntax can include a statement such as:
  • Output fields are field names from the user defined function output metadata or field position identifiers using "$#" syntax, where $ represents a special character and # represents an ordinal number of the output field left to right starting from one.
  • $ represents a special character
  • # represents an ordinal number of the output field left to right starting from one.
  • the "key” is the field name in the user defined function parameter metadata.
  • the "valueExpression” is a constant or an expression that evaluates to a constant.
  • the parameters defined in the expression can be dates, time, time- stamps, integers, decimal values ⁇ e.g., float, double, or long double values), character strings, or comma separated array constants formed from one of these data types.
  • the user defined functions are built in a UDF library 68, for example as shared objects or dynamic link Sibraries, and registered with the database engine.
  • Each shared object exposes the user defined functions in the form of self-describing UDF metadata 69 that can be retrieved by the query complier 62.
  • the UDF metadata 69 can include a name of the user defined function, a description, an associated class type, a factory constructor function pointer to create a runtime processing object instance, a function pointer for the va!idation and resolution of input, output and parameters, and other runtime optimization
  • the query complier 64 accesses these fields for query resolution and validation.
  • the input and output fields can be either fixed or variable fields. Each fixed field represents only one field during execution of the query, while each variable field can represent zero, one, or multiple fields during execution. When input and-output fields are marked as variable, they can be repeated zero, one, or multiple times at the time of query processing. It will be appreciated that variable fields can occur at any input or output position, and there can be multiple variable fields in input and output.
  • the size of a variable field for a given instance of the function can be provided, for example, as a parameter expression.
  • the parameter expression can also be used to define scalar or array values and character parameters used for processing in the user defined function itself.
  • Input and output fields are marked as fixed or variable types in the input and output metadata of each user defined function. It will be appreciated that a given user defined function requires a standard order and number of inputs, as defined in the input metadata for the function. Each field is referenced by a corresponding input argument field or fields in the query. Specifically, fixed fields will have only one corresponding input argument field and variable fields can have zero, one or multiple consecutive input argument fields in the query, input and output fields can have fixed or undefined data types within the metadata. When the data type of an input field is undefined, the field obtains its data types from the data types of input field arguments in the query. Query input field defines if the field has null value or not.
  • data from the input query field is converted to the required input field type of the user defined function at runtime when the data types do not match and the data type promotion is valid.
  • data type is undefined in an output field, it is set to depend upon an input field for resolution.
  • undefined output fields are resolved from the input fields.
  • the metadata can also include default output fields, which are returned when a query does not explicitly assign output fields.
  • a repetition number for the variable field can be determined as the difference between a total number of input arguments and the number of fixed input fields defined in the user defined function metadata.
  • parameter fields are used to provide the repetition number, with each variable field having a corresponding parameter field.
  • the parameter input is used for validating and resolving the repetition number for the variable fields.
  • the names for the variable input and output fields can be generated serially by extending the base name of the variable field with a number incrementally starting from one.
  • the fixed data type of all the variable input and output fields is same as the field marked as variable.
  • the undefined data type of the variable input field is resolved by the data type of the input fields in the query.
  • a general purpose function supplied with the system can be used to validate input and output fields and resolve their data types and lengths at the query compiler 62 when explicit function is not supplied by the user defined function as a function pointer in the metadata.
  • the listing of fields in the output expression allows the output of a user defined function to be mapped to a list of fields provided within the output metadata.
  • the user defined function returns only the fields that are mapped in the query.
  • Output fields can be mapped in any order in queries, using either the field name or field position identifier. If output variable fields are mapped using position identifiers, the validation and resolution function substitutes position identifiers with output field names and sets data type and size in the output table object at query compile time.
  • Changing the order and mapping is particularly useful for nested user defined function processing, to allow an inner user defined function to return the fields in the order the outer user defined function requires as input.
  • output mappings from user defined functions are merged with the rest of the fields in the projection list in the order in which they are mapped.
  • projection fields are mapped similarly to queries with OLAP multidimensional and multiple windows functions.
  • the metadata 69 for each user defined function can include an associated class for each function out of a plurality of function classes 72 to assist in the optimization of the query.
  • the user defined function class types implicitly set the rules for data processing in the database engine along with the cardinality of their output results. For example, user defined functions belonging to some classes will be processed in OLAP windows, whereas such processing is impractical or unnecessary for other classes of functions. Unlike inbuilt functions that return only one output field, all the user defined function class types may return one or multiple output fields.
  • user defined functions can be associated with any of ten classes.
  • Aggregate user defined functions process a set of rows partitioned by a GROUP BY expression or an OLAP window expression and return only one output row per input row set. Without a GROUP BY expression or an OLAP window, the whole table data from the query is considered as the input set.
  • Rank-type user defined functions process a set of rows partitioned by an OLAP window and sorted in ORDER BY clause and return one row per input row.
  • Series user defined functions are processed by OLAP PARTITION BY in a subset window frame of ROWS or RANGE clause. It will be appreciated that aggregate and series classes are not mutually exclusive, and that a given user defined function can belong to either or both the classes.
  • SIWF Inbuilt Window Frame
  • a SIWF user defined function receives the window frame size through a parameter specification. The function receives one input row at a time and returns one output aggregate return row for the result generated using the window frame size given.
  • the SIWF class supports fast computations for moving window aggregates away from the OLAP window frames.
  • the SiWF class of functions can use the OLAP PARTITION BY expression or the OLAP ORDER BY clause if the input is required to be in a sorted order.
  • a Multiple Rows Returning Aggregate user defined function is similar to the aggregate functions except that MRRA functions return more than one row per set of rows processed.
  • MRRA functions can process a set of rows partitioned by GROUP BY or OLAP PARTITION BY clause, or, in the absence of such a clause, the whole table data from the query can be considered as the input set.
  • NTM N To M user defined function
  • NTM functions are a special type of user defined function that processes a set of N input rows and generates M output rows. NTM functions can use OLAP PARTITION BY and ORDER BY processing.
  • the number of output rows can be more than, less than, or equal to the number of input rows.
  • a One To M user defined function is a class of user defined function that processes one input row at a time and produces multiple rows of output. OTM functions do not require OLAP windows support. 8) Value Generating user defined functions generate one row of output per one row of input, generally performing simple mathematical computations, such as !ogio- 9)
  • a row-sending UDF is a UDF that does not return any output, but instead supplies data to an external process.
  • a row-sending UDF can be used in a SELECT query statement to send a set of rows to an external process. In order to work with the SELECT query statement, a row-sending UDF can return the number of rows sent. In one instance, the database can be configured to automatically send any modified rows via a row- sending function whenever the database table is modified.
  • TVUDF Table Valued user defined functions
  • Table Valued user defined functions occur in the FROM clause of SQL queries, where the table they return can participate in relational set processing similar to regular database tables.
  • TVUDFs mostly connect to external structured or unstructured data sources and return structured tabular data for query processing.
  • TVUDFs are without input arguments and are used for generating tables from external data sources. In such a case, inputs for the table extraction are given as parameters mostly in the form of query statement and the function receives an empty input table object to populate an output table from the external source.
  • TVUDFs When TVUDFs have parameterized external queries for processing in the external databases, they will have input fields from the local query corresponding to the external query parameters, input argument fields supply parameter values for external query processing in the external database. In such cases, TVUDFs behave exactly as OTM functions and are marked as members of the OTM class in the metadata. There can be value generating user defined functions without input arguments similar to CURRENT_DATE() or Pl().
  • the query parser 74 can review the input query to ensure that the call to each user defined function is valid. To this end, the query parser 74 can retrieve an associated class of each user defined function from the metadata 69 and apply a set of logical rules to the query to evaluate the validity of each function call.
  • User defined functions belonging to any class beside the value generating class and table- valued class occur in the projection list of a SQL query similar to GROUP BY or OLAP window functions.
  • User defined functions belonging to the value generating user defined function class can occur anywhere an expression can occur in a SQL query.
  • Table-valued user defined functions occur in the FROM clause of a SQL query or in places where a sub-query is allowed.
  • user defined functions belonging to the aggregate, rank, series, and SIWF classes can occur in a query with multiple user defined functions in any combination, in a manner similar to inbuilt GROUP BY or OLAP window functions.
  • User defined functions belonging to the MRRA, NTM and OTM classes referred to herein as singleton user defined functions, must be the only user defined function in its associated query or sub-query.
  • Singleton user defined functions have an indeterminate output number of rows. There is no restriction the number of singleton user defined function sub-queries in a query, however. Row-sending user defined functions also belong to the singleton user defined function type, though its output is determinate in a query. In the case of rank-type analytic OLAP functions, there are no input arguments for the functions.
  • a query optimizer 76 establishes a plan for executing the received query, including any user defined functions in the query.
  • the query optimizer 76 retrieves metadata describing the associated classes of the user defined functions, and uses the associated class of each user defined function to determine if the number of rows in the output of the function is known or if the output to the function is indeterminate. For classes having known outputs, the query optimizer 76 can proceed normally. Functions having indeterminate outputs are handled as part of an alternative optimization process, in which the potential impact of a function having a large number of output rows is mitigated.
  • Each of the data objects extracted by the query parser 72 and the query plan developed at the query optimizer 76 are provided to the database engine 64 to execute the query, including the user defined function.
  • the database engine 64 uses the extracted data objects to execute the called function and return the results of the function as the mapped output fields.
  • the database engine 64 is provided with the input fields in the order the input fields are defined in the user defined function input field metadata. There is no need for the input field names from the query to match the input field names in the UDF metadata as the input field metadata names are just placeholders.
  • Variable input or output fields are expanded with the right repetition numbers in consecutive positions. The field name of the expanded variable field is appended with the ordinal number of the field starting from one.
  • Output fields from user defined function are mapped in queries using field names from the UDF metadata or field position identifier in a $# syntax.
  • Output fields can be renamed using an appropriate syntax, for example: ⁇ $# / field name> [AS] ⁇ user defined name>.
  • the query compiler 62 replaces them with output field names obtained from the user defined function metadata.
  • User queries can map a subset of output fields from user defined function in any order irrespective of their order in the metadata.
  • the results of the output fields marked as default output fields in the metadata are returned.
  • the database engine 64 executes the query plan to provide a query result, and returns the query result to the query source via the communications interface 56.
  • data are loaded into the input table object according to the user defined function class type.
  • the data processing API of the user defined function is called to process data.
  • the user defined function emits output into the output table object which is retrieved each time the processing function is called. In case there is no input, as in the case of some table-valued user defined functions, the processing function is called with an empty input table object.
  • an analytic function that can be implemented in the database system is a multiple regression user defined function. It will be appreciated that in a multiple regression analysis, the number of independent variables varies according to the regression model fitted which is illustrated in the example below. Accordingly, a user defined function to perform this function will have variable output fields for beta coefficients and averages for independent variables, with the size of these variable fields being defined by a parameter value representing the number of independent variables in the desired model.
  • the user defined function for example, can be implemented as aggregate class user defined function because it returns one output row for N input rows processed.
  • beta3 AS betaCouponPromotionCost, avglndependentl AS avgSalespersonTrainingCost,
  • the first input field is the dependent variable which is a fixed field.
  • the second input field is the starting point for the independent variable which is marked as a variable field, giving the user defined function two input fields.
  • a first output field is the alpha intercept, 'alphalntercept' and a second output is the average of the dependent variable, 'avgDependent,' each of which are fixed fields.
  • a third output field, 'countObservations,' is a fixed field representing the number of observations.
  • a fourth field, marked as a variable field is named 'beta' and represents the beta coefficients produced by the regression analysis.
  • a fifth field, also marked as a variable field is named 'avglndependent' and represents the average values of the independent variables.
  • the fourth field can be repeated until all the beta co-efficient fields are mapped.
  • the names of the fields are generated by appending an integer starting from one.
  • the fifth field, avglndependent starts only after all the beta fields are mapped and repeats similarly with number appended to it starting from one.
  • the user defined function requires one parameter and looks for COUNTJNDEPENDENT_VAR keyword or parameter field name in the parameter object to decide which multiple regression model to fit.
  • the validation and resolution API function computes the repetition number.
  • the model specifies three independent variables and, therefore, three output fields each for 'beta' and 'avglndependent' are generated. Note that, in the metadata, the variable output fields, beta and avglndependent, are marked for dependency resolution from the second input variable field. If the query does not map 'avglndependent' from independent-variables or any other field, the user defined function understands it from the fields requested by the user query in the output table object and does not project results for them.
  • the illustrated database system makes it possible to model complex analytic applications, including statistical applications, in SQL queries. Since the user defined function developer knows its class type, input, output and parameter, the UDF metadata is built as part of the user defined function for dynamic detection and validation of syntax and semantics of its usage at query compile time. There could be additional metadata captured in the database for managing user defined functions, such as user defined function usage and installation privileges.
  • the use of the parameter field allows for user defined functions general enough to allow processing of many analytical application models by varying only parameters in queries.
  • Statistical applications such as multiple regression, least-squares, multivariate analytics and frequency distribution can be implemented with different statistical models depending upon the requirements of a given analysts simply by changing the parameter field of the query. Accordingly, parameterization makes it easier to represent complex models in simple key/value pair of parameters.
  • the system 50 further provides the ability to map only the required output fields in any order from user defined functions makes it easier to have nested user defined function queries where inner user defined functions return outputs required for the input of outer user defined functions. This helps in building complex applications in parts and composed by nesting in queries according to application requirement.
  • the tight integration of user defined functions into SQL brings analytic application processing closer to the data, resulting in data security and less mobility of data across systems. It also makes it more efficient to build user defined functions for serving vertical applications through SQL in a programming language. With table-valued user defined functions available for use in FROM clause of queries for collecting data from external sources in conjunction with the analytic user defined functions, users will find many analytic applications easier to model and process using data from external sources in SQL.
  • FIG. 3 illustrates one method 100 for executing a query.
  • a call to a user defined function is identified in a structured query language (SQL) query.
  • SQL structured query language
  • a data partition associated with the user defined function is identified within the SQL query.
  • the data partitioning specification can be extracted from one of a GROUP BY and or an OLAP window expression (e.g., PARTITION BY, ORDER BY and data frame in ROWS or RANGE specification) associated with the call to the user defined function.
  • the user defined function is executed on the identified data partition within the OLAP framework to produce an output.
  • FIG. 4 is a schematic block diagram illustrating an exemplary system 200 of hardware components capable of implementing the example systems and methods for user defined function integration disclosed in FIGS. 1 -3.
  • the system 200 can include various systems and subsystems.
  • the system 200 can be a personal computer, a laptop computer, a workstation, a computer system, an appliance, an application-specific integrated circuit (ASIC), a server, a server blade center, a server farm, or any other appropriate processing component.
  • ASIC application-specific integrated circuit
  • the system 200 can include a system bus 202, a processing unit 204, a system memory 206, memory devices 208 and 210, a communication interface 212 (e.g., a network interface), a communication link 214, a display 216 ⁇ e.g., a video screen), and an input device 218 (e.g., a keyboard and/or a mouse).
  • the system bus 202 can be in communication with the processing unit 204 and the system memory 206.
  • the additional memory devices 208 and 210 such as a hard disk drive, server, stand alone database, or other non-voiatiie memory, can also be in communication with the system bus 202.
  • the system bus 202 operably
  • system bus 202 also operably interconnects an additional port (not shown), such as a universal serial bus (USB) port.
  • USB universal serial bus
  • the processing unit 204 can be a computing device and can include an application-specific integrated circuit (ASIC).
  • the processing unit 204 executes a set of instructions to implement the operations of examples disclosed herein.
  • the processing unit can include a processing core.
  • the additional memory devices 206, 208 and 210 can store data, programs, instructions, database queries in text or compiled form, and any other information that can be needed to operate a computer.
  • the memories 206, 208 and 210 can be implemented as computer-readable media (integrated or removable) such as a memory card, disk drive, compact disk (CD), or server accessible over a network.
  • the memories 206, 208 and 210 can comprise text, images, video, and/or audio.
  • the memory devices 208 and 210 can serve as databases or data storage. Additionally or alternatively, the system 200 can access an external data source or query source through the communication interface 212, which can communicate with the system bus 202 and the communication link 214.
  • the system 200 can be used to implement a database system that executes user defined functions within an online analytics processing (OLAP) framework in response to an appropriate query.
  • the queries can be formatted in accordance with various query database protocols, including SQL.
  • Computer executable logic for implementing the real-time analytics system resides on one or more of the system memory 206, and the memory devices 208, 210 in accordance with certain examples.
  • the processing unit 204 executes one or more computer executable instructions originating from the system memory 206 and the memory devices 208 and 210.
  • the term "computer readable medium” as used herein refers to a medium that participates in providing instructions to the processing unit 204 for execution.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Systems and apparatuses are provided for integrating user defined functions into an analytical data processing framework. A query compiler (22) identifies a call to a user defined function (28) and one of a GROUP BY expression and a defined online analytics processing (OLAP) window associated with the user defined function. A database engine (24) executes the user defined function (28) with a partition set defined by the one of the GROUP BY expression and the defined OLAP window, such that the user defined function operates on the partition set to produce a function output.

Description

ANALYTICAL DATA PROCESSING
TECHNICAL FIELD
[0001] This invention relates to information processing, and more particularly, to integration of user defined functions into an analytical data processing framework.
BACKGROUND
[0002] Analytical processing systems consume tables of data which are typically linked together by relationships that simplify the storage of data and make queries of the data more efficient. A standardized query language, such as
Structured Query Language {SQL}, can be used for creating and operating on relational databases. Analytics is the application of computer technology and statistics to solve problems in business and industry. The science of analytics is concerned with extracting useful properties of data using computable functions and, generally speaking, involves the extraction of desired properties of data sets from large databases. Analytics therefore bridges the disciplines of computer science, statistics, and mathematics.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] FIG. 1 illustrates an example of an analytic data processing system in which user defined functions (UDF) have been integrated into an online analytics processing (OLAP) framework.
[0004] FIG. 2 illustrates one example of the integration of an analytical processing system into a database system, such that user defined functions (UDF) have been integrated into an online analytics processing (OLAP) and GROUP BY processing framework.
[0005] FIG. 3 illustrates one method for executing a query.
[0006] FIG. 4 is a schematic block diagram illustrating an exemplary system of hardware components capable of implementing examples of the systems and methods for user defined function integration disclosed in FIGS. 1 -3.
DETAILED DESCRIPTION
[0007] Many databases have the framework for processing a limited number of in-built analytic functions in queries using GROUP BY and OLAP window clauses. Unfortunately, real-world analytical problems in databases go beyond the analytic functions natively supported, and datasets in databases that require intensive analytical processing often end up using third party analytical packages for analytical processing. In situations where the datasets are large, they are often retained outside the databases due to the limited analytical capabilities available in
databases. These packages require data to be retrieved from databases, persisted outside the database, and reformatted before processing. Since these approaches move data away from the database tables to a location associated with the analytical processes, the result is a replication of same large raw data sets in different formats in multiple locations, with the accompanying security and data governance risk for confidential data. Further, this approach results in excessive manpower and hardware cost for data retrieval, management, and storage as well as significant latency in processing, making it impractical for real-time analytics.
[0008] FIG. 1 illustrates an example of an analytical processing system 10 in which user defined functions (UDF) have been integrated into an online analytics processing (OLAP) framework. For example, the system 10 of FIG. 1 can represent a database system, a data warehouse, a data mart, an in-memory database, a standalone OLAP engine, a business intelligence report generation system, a data mining system, or a federated query processing system. In general, the analytics system can be represented as a plurality of functional components 22, 24, 26, and 28, each of which can be implemented as any appropriate combination of hardware and programming configured to perform their associated function. In the illustrated example, each of the query compiler 22, the processing engine 24, the data tables 26, and the user defined function 28 is stored on a non-transitory medium and executed by an associated processor, but it will be appreciated that other
implementations of the functional components, for example, as dedicated hardware or as a combination of hardware and machine readable instructions, could be used.
[0009] The example system 10 includes a processor 12 and a memory 14 connected to a communications interface 16. It will further be appreciated that the memory 14 can be a removable memory, connected to the processor 12 and the communications interface 16 through an appropriate port or drive, such as an optical drive, a USB port, or other appropriate interface. Further, the memory 14 can be remote from the processor 12, with machine readable instructions stored on the memory provided to the processor via a communications link. It will be appreciated that the communication interface 16 can comprise any appropriate hardware and machine readable instructions for receiving database queries from an associated query source (not shown) and returning the results of the queries to the query source. Accordingly, the communications interface 1 6 can include any or all of a bus or similar data connection within a computer system or a wired or wireless network adapter. The memory 14 can include any appropriate standard storage devices associated with computer systems, such as magnetic and optical storage media.
[0010] The device memory 14 can include a query compiler 22, a processing engine 24 to compile and execute queries on data tables 26, and a user defined function (UDF) 28. The query compiler 22 can utilize any appropriate query language, for example, structured query language (SQL) or multidimensional expression (MDX) language. In one implementation, the user defined function 28 is stored as one of a shared object or a dynamic link library. The query compiler 22 identifies a call to the user defined function 28 and a defined data partition. For example, the query compiler 22 can identify the defined data partition from one of a GROUP BY and an OLAP window expression associated with the call to the user defined function 28. The processing engine 24 executes the user defined function as a windowed function, such that the function operates on the defined partition set to produce a function output.
[0011] It will be appreciated that OLAP processing is not appropriate for all user defined functions. Accordingly, the user defined function 28 has metadata exposed to the query compiler 22 and the processing engine 24 indicating that the user defined function is executable within an OLAP window or GROUP BY expression. The metadata can further include defined inputs, outputs, and parameters for the user defined function. In one instance, the exposed metadata can further include an application programming interface (API) comprising rules for validating and resolving arguments within the call to the user defined function. It will be appreciated, however, that for the majority of user defined functions, a
generalized API function pointer in the metadata will validate and resolve arguments unless the user defined function requires special handling for resolving input, output, and/or parameter arguments. The user defined function metadata further comprises a second function pointer to construct an object for processing during execution of the query. [0012] One advantage of integrating the user defined functions into the OLAP framework is that the output of each user defined function is projected back into the SELECT list of the query, and is thus available to other user defined functions and inbuilt functions within a given query. Accordingly, the user defined function 28 can be one of a plurality of user defined functions identified by the query compiler 22 within a query, and the processing engine 24 can execute the plurality of user defined functions as part of the query such that one user defined function consumes, as an input, field(s) output by another of the plurality of user defined functions. Each of the user defined functions can have an associated defined OLAP window, and the defined OLAP window for each function can differ among the defined windows.
[0013] FIG. 2 ilSustrates one example of the integration of an analytical processing system into a database system 50, such that user defined functions (UDF) have been integrated into an online analytics processing (OLAP) and GROUP BY processing framework. The system 50 includes a processor 52 and a memory 54 connected to a communications interface 56. It will be appreciated that the communication interface 56 can comprise any appropriate hardware and machine readable instructions for receiving database queries from an associated query source (not shown) and returning the results of the queries to the query source. Accordingly, the communications interface 56 can include any or all of a bus or similar data connection within a computer system or a wired or wireless network adapter. The memory 54 can include any appropriate standard storage devices associated with computer systems, such as magnetic and optica! storage media.
[0014] The device memory 54 can include a query compiler 62 and a database engine 64 to compile and execute queries on a database table 66. The query compiler 62 identifies a call to a user defined function and a defined data partition, expressed, for example, as an online analytics processing (OLAP) window or a GROUP BY expression, for the user defined function, if the function is of an appropriate class type for integrated OLAP processing. If the defined OLAP data partition is provided, the database engine 64 executes the user defined function to provide an output having a set of one or more output fields and projects the returned output fields back into the select list. One benefit of performing the processing within the OLAP framework is that multiple user defined functions can be executed as part of a single query. The functions can also be nested, with one user defined function consuming the output of another user defined function. [0015] During operation, the query compiler 62 processes a call to a user defined function in a database query using a standardized user defined function syntax to distinct!y map input, output, and parameter expressions to appropriate objects. List of fields in the input and output expressions define rows of data that can be composed into tables. In the modified syntax, inputs are presented as a list of fields within parentheses following the user defined function name. Inputs are extracted from query fields, and composed as input table. Outputs are denoted with a keyword OUTPUT and a list of fields in parentheses. The query compiler 62 structures these input and output fields into self-describing table objects with field names, data types and data size to standardize processing of all user defined function class types capable of handling multiple records simultaneously. Each table object has metadata to describe each field of the table object in terms of its name, data type, and data length. Tables also have the ability to store and retrieve rows of data based on row number. They are capable of handling large row sets with the ability to manage memory overflow into disk files.
[0016] The query compiler 62 includes a query parser 74, which extracts each of the input table, the output table, and any parameter objects from each user defined function. It will be appreciated that the basic syntax for all user defined functions, regardless of their associated class, can be standardized, such that they are parsed in a similar manner at the query parser 74. One example of such a standardized syntax can include a statement such as:
[0017] <UDF name> ([<lnput Expression List>]) [OUTPUT(<Output
Expression List>)] [[WITH] PARAMETER (<key=valueExpression>[:...])]
[0018] In the above statement, items within brackets are optional, items within parentheses are mandatory, and items given within chevrons (<>) are replaced with appropriate expressions. The names of the user defined functions are unique and case-insensitive. The user defined functions support variable input and output fields composed as table objects. The various expression lists can comprise a series of comma separated items. The input expression list, if present, can include columns or expressions composed using columns from query table. A mapping for the output of the user defined function is provided using the keyword OUTPUT, with the output expression list comprising of one or more output fields or expressions composed from output fields. Output fields are field names from the user defined function output metadata or field position identifiers using "$#" syntax, where $ represents a special character and # represents an ordinal number of the output field left to right starting from one. When the output is not explicitly mapped in a query, default output fields defined within the user defined function can be returned. Parameters are given as "key=valueExpression" separated by colon using WITH PARAMETER syntax when required. The "key" is the field name in the user defined function parameter metadata. The "valueExpression" is a constant or an expression that evaluates to a constant. The parameters defined in the expression can be dates, time, time- stamps, integers, decimal values {e.g., float, double, or long double values), character strings, or comma separated array constants formed from one of these data types.
[0019] In one implementation, the user defined functions are built in a UDF library 68, for example as shared objects or dynamic link Sibraries, and registered with the database engine. Each shared object exposes the user defined functions in the form of self-describing UDF metadata 69 that can be retrieved by the query complier 62. The UDF metadata 69 can include a name of the user defined function, a description, an associated class type, a factory constructor function pointer to create a runtime processing object instance, a function pointer for the va!idation and resolution of input, output and parameters, and other runtime optimization
parameters, as well as defined input, output and parameter fields for the function. The query complier 64 accesses these fields for query resolution and validation. The input and output fields can be either fixed or variable fields. Each fixed field represents only one field during execution of the query, while each variable field can represent zero, one, or multiple fields during execution. When input and-output fields are marked as variable, they can be repeated zero, one, or multiple times at the time of query processing. It will be appreciated that variable fields can occur at any input or output position, and there can be multiple variable fields in input and output. The size of a variable field for a given instance of the function can be provided, for example, as a parameter expression. The parameter expression can also be used to define scalar or array values and character parameters used for processing in the user defined function itself.
[0020] Input and output fields are marked as fixed or variable types in the input and output metadata of each user defined function. It will be appreciated that a given user defined function requires a standard order and number of inputs, as defined in the input metadata for the function. Each field is referenced by a corresponding input argument field or fields in the query. Specifically, fixed fields will have only one corresponding input argument field and variable fields can have zero, one or multiple consecutive input argument fields in the query, input and output fields can have fixed or undefined data types within the metadata. When the data type of an input field is undefined, the field obtains its data types from the data types of input field arguments in the query. Query input field defines if the field has null value or not. For fields having a fixed data type, data from the input query field is converted to the required input field type of the user defined function at runtime when the data types do not match and the data type promotion is valid. When the data type is undefined in an output field, it is set to depend upon an input field for resolution. At the query compiler 62, undefined output fields are resolved from the input fields. The metadata can also include default output fields, which are returned when a query does not explicitly assign output fields.
[0021] When there is only a single input variable field, it is resolved at the query compiler 62 from the input arguments in the query. Specifically, a repetition number for the variable field can be determined as the difference between a total number of input arguments and the number of fixed input fields defined in the user defined function metadata. When there are multiple input or output variable fields, parameter fields are used to provide the repetition number, with each variable field having a corresponding parameter field. When the query is processed, the parameter input is used for validating and resolving the repetition number for the variable fields. The names for the variable input and output fields can be generated serially by extending the base name of the variable field with a number incrementally starting from one. The fixed data type of all the variable input and output fields is same as the field marked as variable. The undefined data type of the variable input field is resolved by the data type of the input fields in the query. A general purpose function supplied with the system can be used to validate input and output fields and resolve their data types and lengths at the query compiler 62 when explicit function is not supplied by the user defined function as a function pointer in the metadata.
[0022] The listing of fields in the output expression allows the output of a user defined function to be mapped to a list of fields provided within the output metadata. When the number of output fields in the query is less than the number of output fields from the user defined function, the user defined function returns only the fields that are mapped in the query. Output fields can be mapped in any order in queries, using either the field name or field position identifier. If output variable fields are mapped using position identifiers, the validation and resolution function substitutes position identifiers with output field names and sets data type and size in the output table object at query compile time.
[0023] Changing the order and mapping is particularly useful for nested user defined function processing, to allow an inner user defined function to return the fields in the order the outer user defined function requires as input. In the projection list, output mappings from user defined functions are merged with the rest of the fields in the projection list in the order in which they are mapped. When multiple user defined functions occur in a query, projection fields are mapped similarly to queries with OLAP multidimensional and multiple windows functions.
[0024] The metadata 69 for each user defined function can include an associated class for each function out of a plurality of function classes 72 to assist in the optimization of the query. The user defined function class types implicitly set the rules for data processing in the database engine along with the cardinality of their output results. For example, user defined functions belonging to some classes will be processed in OLAP windows, whereas such processing is impractical or unnecessary for other classes of functions. Unlike inbuilt functions that return only one output field, all the user defined function class types may return one or multiple output fields.
[0025] In one instance, user defined functions can be associated with any of ten classes. 1 ) Aggregate user defined functions process a set of rows partitioned by a GROUP BY expression or an OLAP window expression and return only one output row per input row set. Without a GROUP BY expression or an OLAP window, the whole table data from the query is considered as the input set. 2) Rank-type user defined functions process a set of rows partitioned by an OLAP window and sorted in ORDER BY clause and return one row per input row. 3) Series user defined functions are processed by OLAP PARTITION BY in a subset window frame of ROWS or RANGE clause. It will be appreciated that aggregate and series classes are not mutually exclusive, and that a given user defined function can belong to either or both the classes.
[0026] 4) Series user defined functions with Inbuilt Window Frame (SIWF) are similar to series user defined functions except that they use an inbuilt window frame for moving window aggregate processing. A SIWF user defined function receives the window frame size through a parameter specification. The function receives one input row at a time and returns one output aggregate return row for the result generated using the window frame size given. The SIWF class supports fast computations for moving window aggregates away from the OLAP window frames. The SiWF class of functions can use the OLAP PARTITION BY expression or the OLAP ORDER BY clause if the input is required to be in a sorted order.
[0027] 5) A Multiple Rows Returning Aggregate user defined function (MRRA) is similar to the aggregate functions except that MRRA functions return more than one row per set of rows processed. MRRA functions can process a set of rows partitioned by GROUP BY or OLAP PARTITION BY clause, or, in the absence of such a clause, the whole table data from the query can be considered as the input set. 6) N To M user defined function (NTM) functions are a special type of user defined function that processes a set of N input rows and generates M output rows. NTM functions can use OLAP PARTITION BY and ORDER BY processing. The number of output rows can be more than, less than, or equal to the number of input rows.
[0028] 7) A One To M user defined function (OTM) is a class of user defined function that processes one input row at a time and produces multiple rows of output. OTM functions do not require OLAP windows support. 8) Value Generating user defined functions generate one row of output per one row of input, generally performing simple mathematical computations, such as !ogio- 9) A row-sending UDF is a UDF that does not return any output, but instead supplies data to an external process. A row-sending UDF can be used in a SELECT query statement to send a set of rows to an external process. In order to work with the SELECT query statement, a row-sending UDF can return the number of rows sent. In one instance, the database can be configured to automatically send any modified rows via a row- sending function whenever the database table is modified.
[0029] 10) Table Valued user defined functions (TVUDF) occur in the FROM clause of SQL queries, where the table they return can participate in relational set processing similar to regular database tables. TVUDFs mostly connect to external structured or unstructured data sources and return structured tabular data for query processing. TVUDFs are without input arguments and are used for generating tables from external data sources. In such a case, inputs for the table extraction are given as parameters mostly in the form of query statement and the function receives an empty input table object to populate an output table from the external source. When TVUDFs have parameterized external queries for processing in the external databases, they will have input fields from the local query corresponding to the external query parameters, input argument fields supply parameter values for external query processing in the external database. In such cases, TVUDFs behave exactly as OTM functions and are marked as members of the OTM class in the metadata. There can be value generating user defined functions without input arguments similar to CURRENT_DATE() or Pl().
[0030] The query parser 74 can review the input query to ensure that the call to each user defined function is valid. To this end, the query parser 74 can retrieve an associated class of each user defined function from the metadata 69 and apply a set of logical rules to the query to evaluate the validity of each function call. User defined functions belonging to any class beside the value generating class and table- valued class occur in the projection list of a SQL query similar to GROUP BY or OLAP window functions. User defined functions belonging to the value generating user defined function class can occur anywhere an expression can occur in a SQL query. Table-valued user defined functions occur in the FROM clause of a SQL query or in places where a sub-query is allowed.
[0031] Similarly, user defined functions belonging to the aggregate, rank, series, and SIWF classes can occur in a query with multiple user defined functions in any combination, in a manner similar to inbuilt GROUP BY or OLAP window functions. User defined functions belonging to the MRRA, NTM and OTM classes, referred to herein as singleton user defined functions, must be the only user defined function in its associated query or sub-query. Singleton user defined functions have an indeterminate output number of rows. There is no restriction the number of singleton user defined function sub-queries in a query, however. Row-sending user defined functions also belong to the singleton user defined function type, though its output is determinate in a query. In the case of rank-type analytic OLAP functions, there are no input arguments for the functions.
[0032] However, the lists of fields in the ORDER BY clause are implicitly considered as the input arguments for the inbuilt rank-type analytic functions. It will be appreciated, however, that there can be input arguments in the UDF when there is an ORDER BY expression for the function. In such cases, the input arguments in the function are considered as the only input arguments and the list of fields in the ORDER BY clause are considered as fields required for data sorting purposes only. However, there is no restriction in using the fields from the ORDER BY clause as input arguments in the UDF.
[0033J A query optimizer 76 establishes a plan for executing the received query, including any user defined functions in the query. The query optimizer 76 retrieves metadata describing the associated classes of the user defined functions, and uses the associated class of each user defined function to determine if the number of rows in the output of the function is known or if the output to the function is indeterminate. For classes having known outputs, the query optimizer 76 can proceed normally. Functions having indeterminate outputs are handled as part of an alternative optimization process, in which the potential impact of a function having a large number of output rows is mitigated. Each of the data objects extracted by the query parser 72 and the query plan developed at the query optimizer 76 are provided to the database engine 64 to execute the query, including the user defined function.
[0034] The database engine 64 uses the extracted data objects to execute the called function and return the results of the function as the mapped output fields. The database engine 64 is provided with the input fields in the order the input fields are defined in the user defined function input field metadata. There is no need for the input field names from the query to match the input field names in the UDF metadata as the input field metadata names are just placeholders. Variable input or output fields are expanded with the right repetition numbers in consecutive positions. The field name of the expanded variable field is appended with the ordinal number of the field starting from one. Output fields from user defined function are mapped in queries using field names from the UDF metadata or field position identifier in a $# syntax. Output fields can be renamed using an appropriate syntax, for example: <$# / field name> [AS] <user defined name>. When the field position identifiers are not renamed in the output expression, the query compiler 62 replaces them with output field names obtained from the user defined function metadata. User queries can map a subset of output fields from user defined function in any order irrespective of their order in the metadata. When output is not explicitly mapped in a query, the results of the output fields marked as default output fields in the metadata are returned. These approaches simplify the application programming interface (API) for user defined function development with an object class having one member function for registering parameter input, a second member function with input and output table objects to initiate the processing, and a third member function to iterative!;/ process input table object with one or more records and emit output rows into the output table object according to the class type of the user defined function.
[0035] The database engine 64 executes the query plan to provide a query result, and returns the query result to the query source via the communications interface 56. During runtime, data are loaded into the input table object according to the user defined function class type. With the input and output table objects as arguments, the data processing API of the user defined function is called to process data. The user defined function emits output into the output table object which is retrieved each time the processing function is called. In case there is no input, as in the case of some table-valued user defined functions, the processing function is called with an empty input table object.
[0036] One example of an analytic function that can be implemented in the database system is a multiple regression user defined function. It will be appreciated that in a multiple regression analysis, the number of independent variables varies according to the regression model fitted which is illustrated in the example below. Accordingly, a user defined function to perform this function will have variable output fields for beta coefficients and averages for independent variables, with the size of these variable fields being defined by a parameter value representing the number of independent variables in the desired model. The user defined function, for example, can be implemented as aggregate class user defined function because it returns one output row for N input rows processed.
[0037] An example query using a specific instantiation of the general multiple regression function is presented below. For example, the query could be used for studying how three independent expenses are affecting the dependent sales figures using a sales and cost database table that has data, for example, for ten years, with four quarters each, at each region and city in each country:
SELECT country,
ultipleRegression(salesDo!larValue, salespersonTratningCost,
advertizementCampaignCost, couponPromotionCost)
OUTPUT(countObservations, alpha!ntercept,
betal AS betaSalespersonTrainingCost, beta2 AS betaAdvertizementCampaignCosi,
beta3 AS betaCouponPromotionCost, avglndependentl AS avgSalespersonTrainingCost,
avglndependent2 AS avgAdvertizementCampaignCost, avg Independents AS avgCouponPromotionCost,
avgDependent AS avgSaiesDollarValue)
WITH PARAMETER(COUNT_lNDEPENDENT_VAR=3) FROM salesAndCost
GROUP BY country;
[0038] tn the above example, the first input field is the dependent variable which is a fixed field. The second input field is the starting point for the independent variable which is marked as a variable field, giving the user defined function two input fields. A first output field is the alpha intercept, 'alphalntercept' and a second output is the average of the dependent variable, 'avgDependent,' each of which are fixed fields. A third output field, 'countObservations,' is a fixed field representing the number of observations. A fourth field, marked as a variable field, is named 'beta' and represents the beta coefficients produced by the regression analysis. A fifth field, also marked as a variable field, is named 'avglndependent' and represents the average values of the independent variables.
[0039] Note that the fourth field can be repeated until all the beta co-efficient fields are mapped. As described previously, the names of the fields are generated by appending an integer starting from one. Thus, in the following example, there are three beta co-efficient fields with field names betal , beta2 and betaS. The fifth field, avglndependent starts only after all the beta fields are mapped and repeats similarly with number appended to it starting from one.
[0040] The user defined function requires one parameter and looks for COUNTJNDEPENDENT_VAR keyword or parameter field name in the parameter object to decide which multiple regression model to fit. In case the parameter is not given, since there is only one input variable field, the validation and resolution API function computes the repetition number. In this example given, the model specifies three independent variables and, therefore, three output fields each for 'beta' and 'avglndependent' are generated. Note that, in the metadata, the variable output fields, beta and avglndependent, are marked for dependency resolution from the second input variable field. If the query does not map 'avglndependent' from independent-variables or any other field, the user defined function understands it from the fields requested by the user query in the output table object and does not project results for them.
[0041] The illustrated database system makes it possible to model complex analytic applications, including statistical applications, in SQL queries. Since the user defined function developer knows its class type, input, output and parameter, the UDF metadata is built as part of the user defined function for dynamic detection and validation of syntax and semantics of its usage at query compile time. There could be additional metadata captured in the database for managing user defined functions, such as user defined function usage and installation privileges. The use of the parameter field allows for user defined functions general enough to allow processing of many analytical application models by varying only parameters in queries. Statistical applications such as multiple regression, least-squares, multivariate analytics and frequency distribution can be implemented with different statistical models depending upon the requirements of a given analysts simply by changing the parameter field of the query. Accordingly, parameterization makes it easier to represent complex models in simple key/value pair of parameters.
[0042] The system 50 further provides the ability to map only the required output fields in any order from user defined functions makes it easier to have nested user defined function queries where inner user defined functions return outputs required for the input of outer user defined functions. This helps in building complex applications in parts and composed by nesting in queries according to application requirement. The tight integration of user defined functions into SQL brings analytic application processing closer to the data, resulting in data security and less mobility of data across systems. It also makes it more efficient to build user defined functions for serving vertical applications through SQL in a programming language. With table-valued user defined functions available for use in FROM clause of queries for collecting data from external sources in conjunction with the analytic user defined functions, users will find many analytic applications easier to model and process using data from external sources in SQL.
[0043] FIG. 3 illustrates one method 100 for executing a query. At 102, a call to a user defined function is identified in a structured query language (SQL) query. At 104, a data partition associated with the user defined function is identified within the SQL query. For example, the data partitioning specification can be extracted from one of a GROUP BY and or an OLAP window expression (e.g., PARTITION BY, ORDER BY and data frame in ROWS or RANGE specification) associated with the call to the user defined function, At 106, the user defined function is executed on the identified data partition within the OLAP framework to produce an output.
[0044] FIG. 4 is a schematic block diagram illustrating an exemplary system 200 of hardware components capable of implementing the example systems and methods for user defined function integration disclosed in FIGS. 1 -3. The system 200 can include various systems and subsystems. The system 200 can be a personal computer, a laptop computer, a workstation, a computer system, an appliance, an application-specific integrated circuit (ASIC), a server, a server blade center, a server farm, or any other appropriate processing component.
[0045] The system 200 can include a system bus 202, a processing unit 204, a system memory 206, memory devices 208 and 210, a communication interface 212 (e.g., a network interface), a communication link 214, a display 216 {e.g., a video screen), and an input device 218 (e.g., a keyboard and/or a mouse). The system bus 202 can be in communication with the processing unit 204 and the system memory 206. The additional memory devices 208 and 210, such as a hard disk drive, server, stand alone database, or other non-voiatiie memory, can also be in communication with the system bus 202. The system bus 202 operably
interconnects the processing unit 204, the memory devices 206-210, the
communication interface 212, the display 216, and the input device 218. In some examples, the system bus 202 also operably interconnects an additional port (not shown), such as a universal serial bus (USB) port.
[0046] The processing unit 204 can be a computing device and can include an application-specific integrated circuit (ASIC). The processing unit 204 executes a set of instructions to implement the operations of examples disclosed herein. The processing unit can include a processing core.
[0047] The additional memory devices 206, 208 and 210 can store data, programs, instructions, database queries in text or compiled form, and any other information that can be needed to operate a computer. The memories 206, 208 and 210 can be implemented as computer-readable media (integrated or removable) such as a memory card, disk drive, compact disk (CD), or server accessible over a network. In certain examples, the memories 206, 208 and 210 can comprise text, images, video, and/or audio. [0048] Additionally, the memory devices 208 and 210 can serve as databases or data storage. Additionally or alternatively, the system 200 can access an external data source or query source through the communication interface 212, which can communicate with the system bus 202 and the communication link 214.
[0049] !n operation, the system 200 can be used to implement a database system that executes user defined functions within an online analytics processing (OLAP) framework in response to an appropriate query. The queries can be formatted in accordance with various query database protocols, including SQL. Computer executable logic for implementing the real-time analytics system resides on one or more of the system memory 206, and the memory devices 208, 210 in accordance with certain examples. The processing unit 204 executes one or more computer executable instructions originating from the system memory 206 and the memory devices 208 and 210. The term "computer readable medium" as used herein refers to a medium that participates in providing instructions to the processing unit 204 for execution.
[0050] What have been described above are examples of the present invention, it is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the present invention, but one of ordinary skill in the art will recognize that many further combinations and permutations of the present invention are possible. Accordingly, the present invention is intended to embrace all such alterations, modifications, and variations that fall within the scope of the appended claims.

Claims

CLAIMS What is claimed is:
1 . An analytical data processing system (10) comprising:
a processor (12); and
a non-transitory computer readable medium (14) storing machine readable instructions, the machine readable instructions comprising:
a query compiler (22) to identify a call to a user defined function (28) and one of a GROUP BY expression and a defined online analytics processing (OLAP) window associated with the user defined function; and
a processing engine (24) to execute the user defined function (28) with a partition set defined by the one of the GROUP BY expression and the defined OLAP window, such that the user defined function operates on the partition set to produce a function output.
2. The analytical data processing system of claim 1 , the user defined function (28) having metadata (69) exposed to the query compiler and the processing engine (24), the metadata (69) indicating that the user defined function (28) is executable within the one of the GROUP BY expression and the defined OLAP window.
3. The analytical data processing system of claim 2, the exposed metadata (69) comprising an application programming interface, the application programming interface comprising rules for validating and resolving arguments within the call to the user defined function (28) .
4. The analytical data processing system of claim 1 , the call to the user defined function (28) being one of a plurality of calls to respective user defined functions identified by the query compiler (22) within a query, and the processing engine (24) executing the plurality of user defined functions as part of the query.
5. The analytical data processing system of claim 4, the user defined function (28) consuming a field output by another of the plurality of user defined functions as an input.
6. The analytical data processing system of claim 4, the partition set being a first partition set in a query, and the query compiler (22) further identifying a second partition set associated with another of the plurality of user defined functions, the first partition set differing from the second partition set.
7. The analytical data processing system of claim 4, another of the plurality of user defined functions having metadata (69) exposed to the query compiler (22) indicating that the user defined function should be executed without reference to either of a GROUP BY expression and an OLAP window.
8. The analytical data processing system of claim 1 , the query compiler (22) identifying the one of the GROUP BY expression and the defined online analytics processing (OLAP) window from an OLAP window expression associated with the call to the user defined function.
9. An analytical processing system comprising:
a processor (12); and
a non-transitory computer readable medium (14) storing machine readable instructions, the machine readable instructions comprising:
a query compiler (22) to identify respective calls to a plurality of user defined functions (68) within a structured query language (SQL) query, the query compiler (22) identifying a plurality of data partitions, each associated with one of the plurality of user defined functions and being identified from one of a GROUP BY expression and an OLAP window expression, such that one of the plurality of data partitions differs from another of the plurality of data partitions; and
a processing engine (24) to execute the plurality of user defined functions as part of the SQL query such that one of the plurality of user defined functions consumes, as an input, a field output by another of the plurality of user defined functions;
wherein each user defined function (68) has metadata (69) exposed to the query compiler (22) and the processing engine (24) indicating that the user defined function is executable within the one of the GROUP BY expression and the OLAP expression.
10. A non-transitory computer readable medium storing machine executable instructions, the machine executable instructions comprising:
a query compiler (62) to identify a call to a plurality of user defined functions (68) and one of a GROUP BY expression and a defined online analytics processing (OLAP) window associated with the user defined function within an SQL query; and a database engine (64) to execute the plurality of user defined functions (68) with a data partition defined by the one of the GROUP BY expression and the defined OLAP window, such that a user defined function of the plurality of user defined function operates on the data partition to produce a function output.
1 1 . The non-transitory computer readable medium of claim 10, one of the plurality of user defined functions (68) consuming, as an input, a field output by another of the plurality of user defined functions.
12. The non-transitory computer readable medium of claim 10, the query compiler (62) identifying the one of a GROUP BY expression and a defined online analytics processing (OLAP) window from an OLAP window expression associated with the call to one of the plurality of user defined functions.
13. The non-transitory computer readable medium of claim 10, each of the plurality of user defined functions (68) having metadata (69) exposed to the query compiler (62) and the database engine (64), the metadata (69) for each function indicating that the user defined functions are executable within the one of the GROUP BY expression and the defined OLAP window.
14. The non-transitory computer readable medium of claim 13, the metadata (69) comprising a generalized application programming interface function pointer for validating and resolving arguments within the call to the one of the plurality of user defined functions.
15. The non-transitory computer readable medium of claim 10, the defined data partition being a first data partition associated with one of the plurality of user defined functions (68) in the SQL query, and the query compiler (62) further identifying a second data partition defined by another OLAP window associated with another of the plurality of user defined functions, the first data partition differing from the second data partition.
PCT/US2011/022437 2011-01-25 2011-01-25 Analytical data processing WO2012102707A1 (en)

Priority Applications (9)

Application Number Priority Date Filing Date Title
PCT/US2011/022437 WO2012102707A1 (en) 2011-01-25 2011-01-25 Analytical data processing
CN2011800601915A CN103262076A (en) 2011-01-25 2011-01-25 Analytical data processing
US13/885,032 US20130238548A1 (en) 2011-01-25 2011-01-25 Analytical data processing
EP11856881.5A EP2668598A4 (en) 2011-01-25 2011-01-25 Analytical data processing
US13/240,582 US9355145B2 (en) 2011-01-25 2011-09-22 User defined function classification in analytical data processing systems
US13/249,482 US8856151B2 (en) 2011-01-25 2011-09-30 Output field mapping of user defined functions in databases
US13/272,598 US9229984B2 (en) 2011-01-25 2011-10-13 Parameter expressions for modeling user defined function execution in analytical data processing systems
US13/364,155 US8538954B2 (en) 2011-01-25 2012-02-01 Aggregate function partitions for distributed processing
US13/485,246 US20120239612A1 (en) 2011-01-25 2012-05-31 User defined functions for data loading

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2011/022437 WO2012102707A1 (en) 2011-01-25 2011-01-25 Analytical data processing

Related Child Applications (4)

Application Number Title Priority Date Filing Date
US13/240,582 Continuation-In-Part US9355145B2 (en) 2011-01-25 2011-09-22 User defined function classification in analytical data processing systems
US13/249,482 Continuation-In-Part US8856151B2 (en) 2011-01-25 2011-09-30 Output field mapping of user defined functions in databases
US13/272,598 Continuation-In-Part US9229984B2 (en) 2011-01-25 2011-10-13 Parameter expressions for modeling user defined function execution in analytical data processing systems
US13/364,155 Continuation-In-Part US8538954B2 (en) 2011-01-25 2012-02-01 Aggregate function partitions for distributed processing

Publications (1)

Publication Number Publication Date
WO2012102707A1 true WO2012102707A1 (en) 2012-08-02

Family

ID=46581072

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2011/022437 WO2012102707A1 (en) 2011-01-25 2011-01-25 Analytical data processing

Country Status (4)

Country Link
US (1) US20130238548A1 (en)
EP (1) EP2668598A4 (en)
CN (1) CN103262076A (en)
WO (1) WO2012102707A1 (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9355145B2 (en) 2011-01-25 2016-05-31 Hewlett Packard Enterprise Development Lp User defined function classification in analytical data processing systems
WO2012125166A1 (en) * 2011-03-17 2012-09-20 Hewlett-Packard Development Company L.P. Data source analytics
US9472015B2 (en) * 2012-05-15 2016-10-18 Sap Se Real-time visualization of transactional data objects
US9396240B2 (en) * 2013-12-03 2016-07-19 Business Objects Software Ltd. Extreme visualization enabling extension for large data sets
US9760599B2 (en) 2014-04-09 2017-09-12 International Business Machines Corporation Group-by processing for data containing singleton groups
US9454571B2 (en) 2014-06-26 2016-09-27 Sap Se Optimization of parallelization of user-defined functions with flexible partitioning
US10248688B2 (en) * 2014-06-26 2019-04-02 Sap Se Annotations for parallelization of user-defined functions with flexible partitioning
EP2977918A1 (en) * 2014-07-24 2016-01-27 Tata Consultancy Services Limited A system and method for executing a sparql query
US10380266B2 (en) * 2014-08-11 2019-08-13 InMobi Pte Ltd. Method and system for analyzing data in a database
CN104635910A (en) * 2014-12-30 2015-05-20 宁波江东波莫纳电子科技有限公司 Power management control system for computer
US9830373B2 (en) * 2015-01-06 2017-11-28 Entit Software Llc Data transfer requests with data transfer policies
DK3405902T3 (en) * 2016-01-22 2022-03-21 Virsec Systems Inc PATTERN MATCHING-BASED DATA SET EXTRACTION
CN106547883B (en) * 2016-11-03 2021-02-19 北京集奥聚合科技有限公司 Method and system for processing User Defined Function (UDF) running condition
US20180316641A1 (en) * 2017-04-28 2018-11-01 Ca, Inc. Publish/subscribe system incorporating derivation actions
US11385874B2 (en) * 2020-02-03 2022-07-12 Sap Se Automatic type determination for database programming
US11797521B1 (en) * 2020-06-30 2023-10-24 Amazon Technologies, Inc. Associating a function with a table in a database system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6385604B1 (en) * 1999-08-04 2002-05-07 Hyperroll, Israel Limited Relational database management system having integrated non-relational multi-dimensional data store of aggregated data elements
US6473750B1 (en) * 1999-10-15 2002-10-29 Microsoft Corporation Adaptive query execution in a distributed database system
US20060010159A1 (en) * 2004-07-09 2006-01-12 Microsoft Corporation Custom grouping for dimension members
US20060136415A1 (en) * 2004-12-21 2006-06-22 International Business Machines Corporation Method, system, and program product for executing a scalar function on a varying number of records within a RDBMS using SQL

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6324533B1 (en) * 1998-05-29 2001-11-27 International Business Machines Corporation Integrated database and data-mining system
US7024414B2 (en) * 2001-08-06 2006-04-04 Sensage, Inc. Storage of row-column data
US20060074883A1 (en) * 2004-10-05 2006-04-06 Microsoft Corporation Systems, methods, and interfaces for providing personalized search and information access
US20100036801A1 (en) * 2008-08-08 2010-02-11 Behzad Pirvali Structured query language function in-lining
US20100241893A1 (en) * 2009-03-18 2010-09-23 Eric Friedman Interpretation and execution of a customizable database request using an extensible computer process and an available computing environment
CN101582079B (en) * 2009-06-23 2011-01-05 用友软件股份有限公司 Object query method and device
US9305057B2 (en) * 2009-12-28 2016-04-05 Oracle International Corporation Extensible indexing framework using data cartridges
US9183253B2 (en) * 2012-06-27 2015-11-10 Nec Laboratories America, Inc. System for evolutionary analytics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6385604B1 (en) * 1999-08-04 2002-05-07 Hyperroll, Israel Limited Relational database management system having integrated non-relational multi-dimensional data store of aggregated data elements
US6473750B1 (en) * 1999-10-15 2002-10-29 Microsoft Corporation Adaptive query execution in a distributed database system
US20060010159A1 (en) * 2004-07-09 2006-01-12 Microsoft Corporation Custom grouping for dimension members
US20060136415A1 (en) * 2004-12-21 2006-06-22 International Business Machines Corporation Method, system, and program product for executing a scalar function on a varying number of records within a RDBMS using SQL

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2668598A4 *

Also Published As

Publication number Publication date
EP2668598A4 (en) 2015-11-04
EP2668598A1 (en) 2013-12-04
CN103262076A (en) 2013-08-21
US20130238548A1 (en) 2013-09-12

Similar Documents

Publication Publication Date Title
US20130238548A1 (en) Analytical data processing
US8856151B2 (en) Output field mapping of user defined functions in databases
US9355145B2 (en) User defined function classification in analytical data processing systems
US8538954B2 (en) Aggregate function partitions for distributed processing
US9229984B2 (en) Parameter expressions for modeling user defined function execution in analytical data processing systems
US11907247B2 (en) Metadata hub for metadata models of database objects
US9665619B1 (en) Optimizing database queries using subquery composition
US9053210B2 (en) Graph query processing using plurality of engines
CN105518676B (en) Universal SQL enhancement to query arbitrary semi-structured data and techniques to efficiently support such enhancements
US7797304B2 (en) Extensible query language with support for rich data types
US7630993B2 (en) Generating database schemas for relational and markup language data from a conceptual model
US20170083573A1 (en) Multi-query optimization
EP2184689B1 (en) Normalizing a filter condition of a database query
EP2439656A1 (en) Hybrid query execution plan
US20100017395A1 (en) Apparatus and methods for transforming relational queries into multi-dimensional queries
US20140279839A1 (en) Integration of transactional and analytical capabilities of a database management system
US20060235837A1 (en) Rewriting table functions as SQL strings
US20220083552A1 (en) Query processing in a polystore
Pokorný Integration of relational and graph databases functionally
EP3293645A1 (en) Iterative evaluation of data through simd processor registers
Rompf et al. A SQL to C compiler in 500 lines of code
Sattler et al. Interactive example-driven integration and reconciliation for accessing database federations
US10713244B2 (en) Calculation engine optimizations for join operations utilizing automatic detection of forced constraints
US20170031909A1 (en) Locality-sensitive hashing for algebraic expressions
Unbehauen et al. SPARQL update queries over R2RML mapped data sources

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11856881

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 13885032

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2011856881

Country of ref document: EP