WO2014101520A1 - 基于MapReduce实现分析函数的方法及系统 - Google Patents

基于MapReduce实现分析函数的方法及系统 Download PDF

Info

Publication number
WO2014101520A1
WO2014101520A1 PCT/CN2013/084860 CN2013084860W WO2014101520A1 WO 2014101520 A1 WO2014101520 A1 WO 2014101520A1 CN 2013084860 W CN2013084860 W CN 2013084860W WO 2014101520 A1 WO2014101520 A1 WO 2014101520A1
Authority
WO
WIPO (PCT)
Prior art keywords
analysis
operator
buffer
data
row
Prior art date
Application number
PCT/CN2013/084860
Other languages
English (en)
French (fr)
Inventor
张书彬
田万鹏
肖品
鲍春健
郭玮
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2014101520A1 publication Critical patent/WO2014101520A1/zh
Priority to US14/750,887 priority Critical patent/US20150356162A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1858Parallel file systems, i.e. file systems supporting multiple processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/244Grouping and aggregation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/22Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
    • G06F7/24Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers sorting methods in general

Definitions

  • the present disclosure relates to the field of data warehousing, and in particular, to a method and system for implementing an analysis function based on MapReduce. Background technique
  • a data warehouse is a warehouse that organizes, stores, and manages data according to its data structure. With the promotion of computers, data warehouses have been widely used in work and life. At present, with the rapid development of the Internet and information technology, data warehouses not only store and manage data, but also have strong ability to analyze data.
  • Commonly used databases such as ORACLE, PostgreSQL, etc., provide multiple analysis functions that analyze the data according to user needs and provide analysis results to users.
  • the analysis function is used to calculate a certain aggregate value based on the data group. Unlike the aggregate function, the analysis function returns multiple rows of data after processing the data group, and the aggregation function returns a row of data after processing the data group.
  • MapReduce is a programming model for parallel computing of large data sets.
  • distributed data warehouses based on the MapReduce framework such as Hive data warehouse
  • Embodiments of the present disclosure provide a method and system for implementing an analysis function based on MapReduce, which can solve the problem that a distributed database based on the MapReduce framework cannot implement an analysis function for data processing.
  • an embodiment of the present disclosure provides a method for implementing an analysis function based on MapReduce, the method comprising: a table scan operator acquiring a data row from a file block, and sending the data row to a mapping operator; The operator receives the data row, determines a reduce key of the analysis function, a partition key, and a sort key, and sends the data row to the analysis operator through the MapReduce framework, The analysis operator belongs to the Reduce end of the MapReduce framework; the analysis operator receives the data row, analyzes the data row to obtain an analysis result, and forwards the data row and the analysis result to a successor operator.
  • an embodiment of the present disclosure further provides a system for implementing an analysis function based on MapReduce, where the system includes a scan operator module, a mapping operator module, and an analysis operator module, where: the scan operator module is configured To obtain a data row from a file block, the data row is sent to a mapping operator module; the mapping operator module is configured to receive the data row, determine a reduction key, a split key, and a sort key of the analysis function,
  • the MapReduce framework sends the data row to an analysis operator module, the analysis operator module belongs to the Reduce end of the MapReduce framework; the analysis operator module is configured to receive the data row, and perform the data row
  • the analysis results are analyzed, and the data rows and analysis results are forwarded to the successor operator module.
  • the method and system for implementing the analysis function based on MapReduce provided by the embodiments of the present disclosure can be applied to a distributed database based on the MapReduce framework (such as a Tencent distributed data warehouse, a Hive database, etc.) to implement data analysis, and to increase distributed based on the MapReduce framework.
  • the functionality of the database enables users to perform data analysis in a distributed database based on the MapReduce framework.
  • FIG. 1 is a flow chart of a method for implementing an analysis function based on MapReduce according to a first embodiment of the present disclosure
  • FIG. 2 is a flow chart of a method for implementing an analysis function based on MapReduce according to Embodiment 2 of the present disclosure
  • FIG. 3 is a schematic structural diagram of an analysis operator buffer according to Embodiment 2 of the present disclosure
  • FIG. 4 is a schematic structural diagram of an analyzer buffer according to Embodiment 2 of the present disclosure
  • FIG. 6(&)-(d) are schematic diagrams of window modes according to Embodiment 2 of the present disclosure, respectively;
  • FIG. 7 is a schematic structural diagram of a system for implementing an analysis function based on MapReduce according to Embodiment 3 of the present disclosure
  • FIG. 8 is a schematic structural diagram of the analysis operator module 53 shown in FIG. 7. detailed description
  • the embodiment of the present disclosure provides a method for implementing an analysis function based on MapReduce, which is applicable to data analysis of a distributed data warehouse based on a MapReduce framework. As shown in FIG. 1, the method includes the following steps.
  • Step 101 A table scan operator (TableScanOperator) acquires a data row from the file block, and sends the data row to the mapping operator.
  • TableScanOperator acquires a data row from the file block, and sends the data row to the mapping operator.
  • Step 102 The mapping operator (ReduceSinkOperator) receives the data row, determines a reduction key, a split key, and a sort key of the analysis function, and sends the data row to an analysis operator through a MapReduce framework, where the analysis operator It belongs to the Reduce side of the MapReduce framework.
  • MapReduceOperator receives the data row, determines a reduction key, a split key, and a sort key of the analysis function, and sends the data row to an analysis operator through a MapReduce framework, where the analysis operator It belongs to the Reduce side of the MapReduce framework.
  • Step 103 The analysis operator (AnalysisOperator) receives the data row, analyzes the data row to obtain an analysis result, and forwards the data row and the analysis result to a subsequent operator.
  • the analysis operator AnalysisOperator
  • the successor operator can be determined according to the operation required by the specific situation, for example, an aggregation operator, a filter operator, or a write file operator, but is not limited thereto.
  • the method for implementing the analysis function based on MapReduce provided by the embodiment of the present disclosure can be applied to a distributed data warehouse based on the MapReduce framework (for example, a Tencent distributed data warehouse, a Hive data warehouse, etc.) for analyzing and analyzing data, and increasing the distribution based on the MapReduce framework.
  • the function of the database which makes the analysis of the data in the distributed database based on the MapReduce framework.
  • the embodiment of the present disclosure provides a method for implementing an analysis function based on MapReduce, which is applicable to data analysis based on a distributed database of the MapReduce framework. As shown in FIG. 2, the method includes the following steps.
  • Step 201 The table scan operator obtains a data row from the file block, and sends the data row to the mapping operator.
  • a plurality of different analysis functions may be preset.
  • the pairs of data are analyzed.
  • Commonly used analysis functions may include, for example, LAG, LEAD, RANK, DENSE-RANK, ROW_NUMBER, SUM, COUNT, AVG, MAX, MIN, RATIO_TO_REPORT, and the like.
  • a new analysis function may be added according to user requirements.
  • Step 202 The mapping operator receives the data row, determines a reduction key, a split key, and a sort key of the analysis function, and sends the data row to an analysis operator through a MapReduce framework, where the analysis operator belongs to the The Reduce side of the MapReduce framework.
  • the mapping operator may determine a reduction key, a split key, and a sort key of the analysis function by using the following methods, and specifically, the method may include:
  • the column in the partition clause of the analysis function and/or the column in the sort clause may be used as a reduction key, or
  • the distinct column can be used as the reduction key, or
  • any constant can be specified as a reduction key
  • the column in the partition clause of the analysis function may be used as a split key
  • the same constant as the reduction key can be used as the split key.
  • Step 203 The analysis operator receives the data row, and stores the data row in an analysis operator buffer for use by all analyzers.
  • an analysis operator buffer AnalysisBuffer can be provided in the analysis operator (specifically, the analysis operator module formed by the analysis operator), and the buffer has the following characteristics: a. Allowing data of a specified length to be saved In memory; b. When the length exceeds the limit, half of the original memory buffer is overflowed to the hard disk; c allows the user to access the elements in the index; d. Allows the user to delete the elements that have been forwarded from the beginning.
  • the analysis operator buffer may include a memory buffer and a disk buffer (which may be located in the disk shown in FIG. 4).
  • the received new data row can be preferentially put into the memory buffer; if the memory buffer is full, the memory buffer can be buffered
  • the older data lines in the area are stored in the disk buffer to free up the memory space of the memory buffer, and then the received new data line can be flushed.
  • Step 204 The analysis operator parses out a partition field and a sort field of the data row, and determines whether the data row belongs to a current partition, where the current partition is a previous data row received by the analysis operator. The partition to which it belongs; if yes, go to step 205; if no, go to step 206.
  • Step 205 The analyzer corresponding to the analysis operator calls the analysis function analyzes the data row, obtains an analysis result, and stores the analysis result in an analyzer buffer.
  • an analysis function can correspond to an analyzer, and each analyzer can correspond to an analyzer buffer for storing analysis results, intermediate results, or total aggregation results associated with each data row.
  • the parser buffer may include a memory buffer and a disk buffer (which may be located in the disk shown in FIG. 4), and the memory buffer may include an output buffer and an input buffer.
  • the analyzer buffer is used to buffer and update the analysis results. Specifically, when the analyzer buffer buffer buffers the analysis result: the analysis result may be stored in the output buffer; if the output buffer is full, the output buffer may be Inside the disk buffer to release the storage space of the output buffer.
  • the parser buffer updates the analysis result: If the row to be updated is stored in the output buffer, it can be directly The line to be updated and the received new data line in the output buffer update the analysis result; if the line to be updated is stored in the input buffer, it can directly according to the line to be updated and the received in the input buffer The new data row updates the analysis result; if the row to be updated is stored on a disk (ie, a disk buffer), the input buffer can be stored in the disk and the disk is to be updated The buffer block in which the row is located is read into the input buffer, so that the analysis node is analyzed according to the row to be updated and the received new data row in the input buffer. Updated.
  • Step 206 The analysis operator ends the analysis of the current partition, and analyzes all data rows of the current partition stored in the analysis operator buffer and all the current partitions stored in the analyzer buffer. The results are summarized into new data rows that are forwarded to the successor operator.
  • the analyzer corresponding to the call analysis function analyzes the data row, and after obtaining the analysis result, the data row and the analysis result may be directly forwarded and forwarded. Until the successor, there is no need to cache the data rows and analysis results.
  • this embodiment provides an exemplary algorithm for 11 common analysis functions. The details are as follows.
  • Algorithm 1 Overview of the LAG algorithm:
  • the pointer pi points to the smallest line that is currently unprocessed
  • the pointer p2 points to the current line.
  • the result of the row pointed to by pi is set to the content of the column col of p2, and pl++, the row number is less than or equal to The pi rows can be forwarded.
  • Algorithm 3 Overview of the RANK algorithm:
  • the RANK parser buffer has the current sequence rank, the value corresponding to the current sequence number, and the number of rows with the current sequence number.
  • the rank column of the row is set to rank, number++ in the parser buffer; otherwise, rank ⁇ 'J is set to rank+number, Set the rank in the parser buffer to rank+number, the value to the specified value of the new row, and the number to 1. All rows that are currently processed can be forwarded.
  • Algorithm 4 Overview of the DENSE_RANK algorithm:
  • the analyzer buffer of DENSE_RANK has the current serial number, the value corresponding to the current serial number, and the line number with the current serial number.
  • the rank column of the row is set to rank, number++ in the parser buffer; otherwise, the rank column is set to rank+1 and will be analyzed
  • the rank in the buffer is set to rank+1, the value is set to the specified value of the new row, and the number is set to 1.
  • the currently processed rows can be forwarded.
  • Algorithm 5 Overview of the ROW_NUMBER algorithm:
  • Algorithm 8 Overview of the AVG algorithm:
  • Algorithm 10 Overview of the MIN algorithm:
  • Algorithm 11 Overview of the RATIO_TO_REPORT algorithm:
  • the analysis function is based on a set of records (for example, multiple data rows) to calculate the aggregated values for each row of data to obtain the analysis results.
  • the set of records on which this is based is called a "window".
  • a window For each row of records, there is a window that is used to specify the recordset in which the analytic function performs the clustering operation.
  • the present embodiment provides the following eight modes (ie, the window mode, specifically, the mode of setting the window position) for reference:
  • Rows between window.lag preceding and window.lead is located in the range of the window.lag line and the window.lead line before the current line;
  • Range between window.lag preceding and window.lead is less than the current value (or larger) window.lag and is larger (or smaller) than the current value in the window.lead range.
  • Rows between window.lag preceding and window.lead preceding ⁇ is in the range of the window.lag line and the window.lead line before the current line;
  • Range between window.lag preceding and window.lead preceding ⁇ is smaller than the current value (or large) in the range of window.lag and window.lead. Mode 3, shown in Figure 5 (c):
  • the method for implementing the analysis function based on MapReduce provided by the embodiment of the present disclosure can be applied to a distributed database based on the MapReduce framework (such as a Tencent distributed data warehouse, a Hive data warehouse, etc.) to implement data analysis, and a distributed database based on the MapReduce framework.
  • Embodiments of the present disclosure provide a system for implementing an analysis function based on MapReduce, which can implement the foregoing method embodiments.
  • the system may include a scan operator 51, a mapping operator 52, and an analysis operator 53.
  • the scan operator 51 may form a scan operator module or be included in a scan operator module.
  • the terms “scan operator” and “scan operator module” are used interchangeably.
  • the mapping operator 52 may form a mapping operator module or be included in a mapping operator module.
  • the terms “mapping operator” and “mapping operator module” are used interchangeably.
  • the analysis operator 53 may form an analysis operator module or be included in an analysis operator module.
  • the terms “analysis operator” and “analysis operator module” are used interchangeably.
  • the system may also include analysis operator buffers (not shown) which are the same as the analysis operator buffers described above, and thus a detailed description thereof is omitted herein.
  • the scan operator 51 is configured to acquire a data row from a file block, and send the data row to the mapping operator 52;
  • the mapping operator 52 is configured to receive the data row, determine a reduction key, a split key, and a sort key of the analysis function, and send the data row to the analysis operator 53 through the MapReduce framework, where the analysis operator 53 belongs to Theuce end of the MapReduce framework;
  • the analysis operator 53 receives the data row, analyzes the data row to obtain an analysis result, and forwards the data row and the analysis result to a successor operator.
  • the mapping operator 52 may be specifically used when the analysis function includes a partition clause and is a reduction key, or the mapping operator 52 may also be used when the analysis function has no sort clause but When there is a distinct keyword, the distinct column is used as a reduction key, or the mapping operator 52 can also be used to specify an arbitrary constant when the analysis function does not include a partition clause, a sort clause, or a distinct keyword. Reduction button.
  • the mapping operator 52 can also be used to divide the score when the analysis function includes a partition clause
  • the column in the partition clause of the function is used as a split key, or the map operator 52 can also be used to use the same constant as the split key as the split key when the analysis function does not have a partition clause.
  • the mapping operator 52 can also be used to use a column in the sorting clause as a sorting key when the analysis function contains a sorting clause.
  • the analysis operator 53 may include:
  • a storage module 531 which can be configured to receive the data row, and store the data row in an analysis operator buffer for use by all analyzers;
  • a determining module 532 configured to parse a partition field and a sorting field of the data row, determine whether the data row belongs to a current partition, where the current partition belongs to a previous data row received by the analysis operator Partition, if yes, the analysis operator 53 can call the analyzer corresponding to the analysis function to analyze the data row, obtain an analysis result, store the analysis result in the analyzer buffer, and if not, the analysis The operator 53 may end the analysis of the current partition, and aggregate all the data rows of the current partition stored in the analysis operator buffer and all the analysis results of the current partition stored in the analyzer buffer into new The data rows are forwarded to the successor operator (ie, the operator module).
  • the analyzer and analyzer buffers are the same as described above, they may be located in a system in accordance with embodiment 3 of the present invention, or may be external to the system and operatively coupled to the system.
  • the analysis operator 53 may directly forward the data row and the analysis result to a subsequent operator (ie, an operator module) after obtaining the analysis result. There is no need to cache the data rows and analysis results.
  • the system based on MapReduce analytic function provided by the embodiment of the present disclosure can be applied to a distributed database based on the MapReduce framework (such as a Tencent distributed data warehouse, a Hive database, etc.) to implement data analysis, and a distributed database based on the MapReduce framework is added.
  • the function so that the analysis function is implemented in the distributed database based on the MapReduce framework for data analysis.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了一种基于MapReduce实现分析函数的方法及系统。所述方法包括:表扫描算子从文件块获取数据行,将所述数据行发送至映射算子;所述映射算子接收所述数据行,确定分析函数的归约键、分割键和排序键,通过MapReduce框架将所述数据行发送至分析算子;所述分析算子接收所述数据行,对所述数据行进行分析得到分析结果,并将所述数据行和分析结果转发至后继算子。本公开可以在MapReduce框架的分布式数据仓库中实现分析函数,从而解决在基于MapReduce框架的分布式数据仓库中无法使用分析函数进行数据分析处理的问题。

Description

基于 MapReduce实现分析函数的方法及系统 技术领域
本公开涉及数据仓库领域,尤其涉及一种基于 MapReduce实现分析函数 的方法及系统。 背景技术
数据仓库(Data Warehouse )是按照数据结构来组织、 存储和管理数据 的仓库。 随着计算机的推广, 数据仓库已经广泛的应用于工作和生活中。 目 前,随着互联网及信息技术的快速发展,数据仓库不仅仅是存储和管理数据, 且具备了较强的分析数据的能力。常用的数据库,例如 ORACLE, PostgreSQL 等, 均提供了多个分析函数, 可以根据用户需求对数据进行分析, 向用户提 供分析结果。分析函数用于计算基于数据组的某种聚集值,与聚集函数不同, 分析函数对数据组进行处理后返回多行数据, 而聚集函数对数据组进行处理 后返回一行数据。
MapReduce是一种编程模型, 用于大规模数据集的并行运算。 目前, 基 于 MapReduce框架的分布式数据仓库(例如 Hive数据仓库 )无法进行实现 分析函数进行数据处理, 在数据库的使用过程中带来诸多不便。 发明内容
本公开的实施例提供一种基于 MapReduce实现分析函数的方法及系统, 能够解决基于 MapReduce框架的分布式数据库无法实现分析函数进行数据 处理的问题。
为达到上述目的, 本公开的实施例采用如下技术方案。
第一方面,本公开实施例提供了一种基于 MapReduce实现分析函数的方 法, 所述方法包括: 表扫描算子从文件块获取数据行, 将所述数据行发送至 映射算子; 所述映射算子接收所述数据行, 确定分析函数的归约键(reduce key )、 分割键 ( partition key )和排序键 ( sort key ), 通过 MapReduce框架将 所述数据行发送至分析算子, 所述分析算子属于所述 MapReduce 框架的 Reduce端; 所述分析算子接收所述数据行, 对所述数据行进行分析得到分析 结果, 并将所述数据行和分析结果转发至后继算子。 第二方面,本公开实施例还提供了一种基于 MapReduce实现分析函数的 系统, 所述系统包括扫描算子模块、 映射算子模块和分析算子模块, 其中: 所述扫描算子模块被配置为从文件块获取数据行, 将所述数据行发送至映射 算子模块; 所述映射算子模块被配置为接收所述数据行, 确定分析函数的归 约键、分割键和排序键,通过 MapReduce框架将所述数据行发送至分析算子 模块, 所述分析算子模块属于所述 MapReduce框架的 Reduce端; 所述分析 算子模块被配置为接收所述数据行, 对所述数据行进行分析得到分析结果, 并将所述数据行和分析结果转发至后继算子模块。
本公开实施例提供的基于 MapReduce实现分析函数的方法及系统,能够 应用于基于 MapReduce框架的分布式数据库(例如腾讯分布式数据仓库、 Hive数据库等) 以实现数据分析, 增加基于 MapReduce框架的分布式数据 库的功能,从而使得用户能够在基于 MapReduce框架的分布式数据库中进行 数据分析。 附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案, 下面将对实 施例或现有技术描述中所需要使用的附图作筒单地介绍, 显而易见地, 下面 描述中的附图仅仅是本公开的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的附图。
图 1为根据本公开实施例一的基于 MapReduce实现分析函数的方法的流 程示意图;
图 2为根据本公开实施例二的基于 MapReduce实现分析函数的方法的流 程示意图;
图 3为根据本公开实施例二的分析算子緩沖区的结构示意图; 图 4为根据本公开实施例二的分析器緩沖区的结构示意图;
图 5 ( a ) - ( (1 )和图 6 ( & ) - ( d )分别为根据本公开实施例二的窗口 模式的示意图;
图 7为根据本公开实施例三的基于 MapReduce实现分析函数的系统的结 构示意图;
图 8为图 7所示的分析算子模块 53的结构示意图。 具体实施方式
下面将结合本公开实施例中的附图, 对本公开实施例中的技术方案进行 清楚、 完整地描述, 显然, 所描述的实施例仅仅是本公开一部分实施例, 而 不是全部的实施例。 基于本公开中的实施例, 本领域普通技术人员在没有作 出创造性劳动前提下所获得的所有其他实施例, 都属于本公开保护的范围。
实施例一
本公开实施例提供了一种基于 MapReduce实现分析函数的方法,适用于 基于 MapReduce框架的分布式数据仓库进行数据分析,如图 1所示, 所述方 法包括以下步骤。
步骤 101、 表扫描算子(TableScanOperator )从文件块获取数据行, 将所 述数据行发送至映射算子。
步骤 102、 所述映射算子(ReduceSinkOperator )接收所述数据行, 确定 分析函数的归约键、分割键和排序键, 通过 MapReduce框架将所述数据行发 送至分析算子, 所述分析算子属于所述 MapReduce框架的 Reduce端。
步骤 103、 所述分析算子(AnalysisOperator )接收所述数据行, 对所述 数据行进行分析得到分析结果 ,并将所述数据行和分析结果转发至后继算子。
其中, 后继算子可以根据具体情况需要的操作进行确定, 例如: 聚合算 子、 过滤算子、 或写文件算子等, 但不仅限于此。
本公开实施例提供的基于 MapReduce实现分析函数的方法,能够应用于 基于 MapReduce框架的分布式数据仓库(例如腾讯分布式数据仓库、 Hive 数据仓库等)进行数据分析分析函数,增加基于 MapReduce框架的分布式数 据库的功能,从而使得在基于 MapReduce框架的分布式数据库中使用分析函 数进行数据分析。 实施例二
本公开实施例提供了一种基于 MapReduce实现分析函数的方法,适用于 基于 MapReduce框架的分布式数据库进行数据分析, 如图 2所示, 所述方法 包括以下步骤。
步骤 201、 表扫描算子从文件块获取数据行, 将所述数据行发送至映射 算子。
值得说明的是, 在本实施例提供的方法中, 可以预设多种不同的分析函 数对数据进行分析, 常用的分析函数例如可以包括 LAG、 LEAD, RANK, DENSE—RANK、 ROW—NUMBER、 SUM, COUNT、 AVG、 MAX, MIN、 RATIO_TO_REPORT等。 可选的, 在本实施例提供的方法中, 可以根据用户 需要添加新的分析函数。
步骤 202、 所述映射算子接收所述数据行, 确定分析函数的归约键、 分 割键和排序键, 通过 MapReduce框架将所述数据行发送至分析算子, 所述分 析算子属于所述 MapReduce框架的 Reduce端。
例如, 所述映射算子可以通过如下方法确定分析函数的归约键、 分割键 和排序键, 具体可以包括:
( 1 ) 当所述分析函数包含分区子句和 /或排序子句时, 可以将所述分析 函数的分区子句中的列和 /或排序子句中的列作为归约键, 或者
当所述分析函数没有排序子句但有 distinct关键字时, 可以以 distinct列 作为归约键, 或者
当分析函数不含分区子句、排序子句, 也不含 distinct关键字时, 可以指 定任意常量作为归约键;
( 2 )当所述分析函数包含分区子句时,可以将所述分析函数的分区子句 中的列作为分割键, 或者
当所述分析函数不含分区子句时, 可以以与归约键相同的常量作为分割 键。
( 3 )当所述分析函数含有排序子句时,可以以排序子句中的列作为排序 键。
步骤 203、 所述分析算子接收所述数据行, 将所述数据行存储于分析算 子緩沖区以供所有分析器使用。
为了实现数据共享, 在分析算子(具体地, 该分析算子形成的分析算子 模块)中可以提供一个分析算子緩沖区 AnalysisBuffer, 该緩沖区具备以下特 点: a.允许指定长度的数据保存在内存中; b.当长度超出限定值后, 将原内 存緩沖区中的一半内容溢出到硬盘; c允许用户按照索引访问其中的元素; d. 允许用户从头开始删除其中已转发的元素。
具体地, 如图 3所示, 分析算子緩沖区可以包括内存緩沖区和磁盘緩沖 区(其可以位于图 4所示的磁盘中)。 在所述分析算子緩沖区中, 可以优先将 接收的新数据行放入内存緩沖区; 如果内存緩沖区已满, 则可以将内存緩沖 区中较旧的数据行存入所述磁盘緩沖区, 以释放内存緩沖区的存储空间, 然 后可以将接收的新数据行 ^内^爰沖区。
步骤 204、 所述分析算子解析出所述数据行的分区字段和排序字段, 判 断所述数据行是否属于当前分区, 其中, 所述当前分区是所述分析算子接收 到的上一数据行所属的分区; 若是, 执行步骤 205; 若否, 执行步骤 206。
步骤 205、 所述分析算子调用分析函数对应的分析器对所述数据行进行 分析, 得到分析结果, 将所述分析结果存储于分析器緩沖区。
值得说明的是, 一个分析函数可以对应一个分析器, 每个分析器可以对 应一个分析器緩沖区, 用于存储与每一数据行相关的分析结果、 中间结果或 总的聚合结果。 如图 4所示, 所述分析器緩沖区可以包括内存緩沖区和磁盘 緩沖区(其可位于图 4所示的磁盘中 ),所述内存緩沖区可以包括输出緩沖区 和输入緩沖区。
所述分析器緩沖区用于对分析结果进行緩沖和更新。 具体地, 当所述分 析器緩沖区对分析结果进行緩沖时: 可以将所述分析结果存储于所述输出緩 沖区; 如果所述输出緩沖区已满, 则可以将所述输出緩沖区中的内^ "入所 述磁盘緩沖区, 以释放所述输出緩沖区的存储空间。 当所述分析器緩沖区对 分析结果进行更新时: 如果待更新行存储于输出緩沖区, 则可以直接根据所 述输出緩沖区中的待更新行和接收到的新数据行对分析结果进行更新; 如果 待更新行存储于输入緩沖区, 则可以直接根据所述输入緩沖区中的待更新行 和接收到的新数据行对分析结果进行更新; 如果待更新行存储于磁盘(即, 磁盘緩沖器),则可以将所述输入緩沖区中的内 储到所述磁盘,并将所述 磁盘中的待更新行所在的緩沖块读入所述输入緩沖区, 以使得根据所述输入 緩沖区中的待更新行和接收到的新数据行对分析结果进行更新。
步骤 206、 所述分析算子结束对所述当前分区的分析, 将所述分析算子 緩沖区中存储的当前分区的所有数据行、 以及所述分析器緩沖区中存储的当 前分区的所有分析结果汇总成新的数据行转发至后继算子。
值得说明的是, 如果所述分析函数不需要累计, 那么在所述调用分析函 数对应的分析器对所述数据行进行分析, 得到分析结果之后, 可以直接将所 述数据行和分析结果汇总转发至后继算子, 无需对所述数据行和分析结果进 行緩存。
为了便于理解, 本实施例提供了 11 种常见的分析函数的示例性算法概 述, 具体如下。
算法 1: LAG算法概述:
假设调用的分析函数为 lag(col, offset) over(...)。
LAG的分析器緩沖区中仅有一个行号计数器 p (初始值为 -1 )。 当分析新 的一行时, 将 p加 1 , 如果 p>=offset, 则将 p所指向的行的该列设为 p-offset 行 col列的内容, 并指示 p-offset行及之前的行的内容可以转发; 否则, 将当 前行的结果设为 null, 所有行都不得转发。
算法 2: LEAD算法概述:
假设调用的分析函数为 lead(col, offset) over(...)。
LEAD的分析器緩沖区中有两个指针,指针 pi指向当前尚未处理的最小 行, 指针 p2指向当前行。 当分析新的一行时, 将指针 p2加 1 , 此时, 如果 p2-pl>=offset,则将 pi所指向行的结果设为 p2所指行 col列的内容,且 pl++, 行号小于等于 pi的行均可转发。
算法 3: RANK算法概述:
RANK的分析器緩沖区中有当前序号 rank, 当前序号对应的值 value, 具有当前序号的行数 number。当分析新的一行时,如果新的一行的值与 value 相等, 则将该行的 rank列设为 rank, 分析器緩沖区中的 number++; 否则, 将 rank 歹 'J设为 rank+number , 同时将分析器緩沖区中的 rank 设为 rank+number, value设为新行的指定值, number设为 1。 当前处理后的所有 行均可转发。
算法 4: DENSE_RANK算法概述:
DENSE_RANK的分析器緩沖区中有当前序号 rank, 当前序号对应的值 value, 具有当前序号的行号 number。 当分析新的一行时, 如果新的一行的值 与 value相等, 则将该行的 rank列设为 rank, 分析器緩沖区中的 number++; 否则, 将 rank列设为 rank+1 , 同时将分析器緩沖区中的 rank设为 rank+1 , value设为新行的指定值, number设为 1。 当前处理后的行均可转发。
算法 5: ROW_NUMBER算法概述:
ROW_NUMBER的分析器緩沖区中只有一个 rownumber值(初始值为 -1 )。 当分析新的一行的时候, ^!夺新行的 rownumber列设为 rownumber+1 , 同 时将分析器緩沖区中的 rownumber设为 rownumber+1。 当前处理后的行均可 转发。 算法 6: SUM算法概述:
在 SUM的分析器緩沖区中, 保存一个变量, 即当前总和 sum。 当分析 新的一行时,将 sum的值加上新行的指定表达式值(需非空)存入 sum即可。
在整个分区分析完成前不得转发。 分区分析完成后, 将 sum值作为每一 行的计算结果即可。
算法 7: COUNT算法概述:
COUNT的分析器緩沖区中只有一个 count计数器。每分析一个新行,如 果待分析列的值非空, 就将该计数器加一。
在整个分区分析完成前不得转发。 分区分析完成后, 将 count值作为每 一行的计算结果即可。
算法 8: AVG算法概述:
AVG的分析器緩沖区中有两个计数器值, 一个是 sum (初始值为 0 ), — 个是 count(初始值为 0 )。当分析新的一行时,如果表达式为非空值, count++, sum设为 sum+新行的表达式值。
在整个分区分析完成前不得转发任一行。 分区分析完成后, 如果 count!
= 0, 将 sum / count值作为每一行的计算结果即可; 否则, 将 null作为每一 行的分析结果。
算法 9: MAX算法概述:
MAX的分析器緩沖区中只有一个 max值。 分析新行时, 将新行的表达 式(非空)与 max比较, 如果比 max大则更新 max。 在分析完分区时, 将所 有的行的指定列设为 max即可。
在整个分区分析完成前不得转发。
算法 10: MIN算法概述:
MIN的分析器緩沖区中只有一个 min值。 分析新行时, 将新行的表达式 (非空)与 min比较, 如果比 min小则更新 min。 在分析完分区时, 将所有 的行的指定列设为 min即可。
在整个分区分析完成前不得转发。
算法 11 : RATIO_TO_REPORT算法概述:
RATIO_TO_REPORT类的分析器緩沖区中只有一个 sum值。 分析新行 时, 将新行的表达式(非空)与 sum相加设为 sum的值。 在分析完分区时, 用所有的行的指定列分别除以 sum设为该列的值即可, 如果 sum为 0, 则均 置为 null。
在整个分区分析完成前不得转发。 值得说明的是, 分析函数是基于一组记录(例如多个数据行) 为每一行 数据计算聚集值得到分析结果的, 所基于的这一组记录称之为 "窗口" ( window )。 对于每一行记录, 都有一个窗口, 用它来指定分析函数执行聚 集运算的记录集。针对带窗口子句的情况,本实施例提供了如下 8种模式(即, 窗口模式, 具体地, 设置窗口位置的模式) 以供参考:
模式 1 , 在图 5 ( a ) 中示出:
该模式的代表语句为:
Rows between window.lag preceding and window.lead following 〃位于当 前行之前 window.lag行和之后 window.lead行的范围内;
Range between window.lag preceding and window.lead following 〃比当前 值小 (或大) window.lag和比当前值大(或小) window.lead的范围内。
模式 2, 在图 5 ( b ) 中示出:
该模式的代表语句为:
Rows between window.lag preceding and window.lead preceding 〃位于 当前行之前 window.lag行和 window.lead行的范围内;
Range between window.lag preceding and window.lead preceding 〃比当前 值小 (或大) window.lag和 window.lead的范围内。 模式 3, 在图 5 ( c ) 中示出:
该模式的代表语句为:
Rows between window.lag following and window.lead following 〃位于当 前行之后 window.lag行和 window.lead行的范围内;
Range between window.lag following and window.lead following 〃比当前 值大 (或小) window.lag和 window.lead的范围内。 模式 4, 在图 5 ( d ) 中示出:
该模式的代表语句为:
Rows between unbounded preceding and window.lead following 〃从最开 始到当前行之后 window.lead行的范围内;
Range between unbounded preceding and window.lead following 〃从最开 始到比当前值大(或小) window.lead的范围内。 模式 5, 在图 6 ( a ) 中示出:
该模式的代表语句为:
Rows between window.lag preceding and unbounded following 〃从当前行 之前 window.lag行到最后的范围内;
Range between window.lag preceding and unbounded following 〃从比当 前值 d、(或大) window.lag到最后的范围内。 模式 6, 在图 6 ( b ) 中示出:
该模式的代表语句为:
Rows between unbounded preceding and unbounded following 〃从开始到 最后;
Range between unbounded preceding and unbounded following 〃从开始 到最后。 模式 7, 在图 6 ( c ) 中示出:
该模式的代表语句为:
Rows between unbounded preceding and window.lead preceding 〃从开始 到 window.lead行之前的范围内;
Range between unbounded preceding and window.lead preceding 〃从开 始到比当前值小 (或大) window.lead的范围内。 模式 8, 在图 6 ( d ) 中示出:
该模式的代表语句为:
Rows between window.lag following and unbounded following 〃从当前行 之后 window.lag行到最后的范围内;
Range between window.lag following and unbounded following 〃从比当 前值大 (或小 )window.lag到最后的范围内。 根据上述 8种模式, 可以很容易的实现出相应的分析函数处理算法。 本公开实施例提供的基于 MapReduce实现分析函数的方法,能够应用于 基于 MapReduce框架的分布式数据库(例如腾讯分布式数据仓库、 Hive数 据仓库等)以实现数据分析, 增加基于 MapReduce框架的分布式数据库的功 能, 从而使得在基于 MapReduce框架的分布式数据库中进行数据分析。 实施例三
本公开实施例提供了一种基于 MapReduce实现分析函数的系统,能够实 现上述方法实施例。 如图 6所示, 所述系统可以包括扫描算子 51、 映射算子 52和分析算子 53。 所述扫描算子 51可以形成扫描算子模块或被包括在扫描 算子模块中, 在本实施例中, 可互换地使用术语 "扫描算子" 和 "扫描算子 模块"。所述映射算子 52可以形成映射算子模块或被包括在映射算子模块中, 在本实施例中, 可互换地使用术语 "映射算子"和 "映射算子模块"。 所述分 析算子 53 可以形成分析算子模块或被包括在分析算子模块中, 在本实施例 中, 可互换地使用术语 "分析算子"和 "分析算子模块"。 所述系统还可以包 括分析算子緩沖区(图中未示出),它们与在上文中描述的分析算子緩沖区相 同, 因此在这里省略其详细描述。
所述扫描算子 51用于从文件块获取数据行,将所述数据行发送至映射算 子 52;
所述映射算子 52用于接收所述数据行,确定分析函数的归约键、分割键 和排序键, 通过 MapReduce框架将所述数据行发送至分析算子 53 , 所述分 析算子 53属于所述 MapReduce框架的 Reduce端;
所述分析算子 53接收所述数据行, 对所述数据行进行分析得到分析结 果, 并将所述数据行和分析结果转发至后继算子。
可选的,所述映射算子 52可以具体用于当所述分析函数包含分区子句和 为归约键,或者所述映射算子 52还可以用于当所述分析函数没有排序子句但 有 distinct关键字时, 以 distinct列作为归约键, 或者所述映射算子 52还可以 用于当分析函数不含分区子句、排序子句, 也不含 distinct关键字时, 指定任 意常量作为归约键。
所述映射算子 52还可以用于当所述分析函数包含分区子句时,将所述分 析函数的分区子句中的列作为分割键 ,或者所述映射算子 52还可以用于当所 述分析函数不含分区子句时, 以与归约键相同的常量作为分割键。
所述映射算子 52还可以用于当所述分析函数含有排序子句时,以排序子 句中的列作为排序键。
进一步的, 如图 7所示, 所述分析算子 53可以包括:
存储模块 531 , 其可以用于接收所述数据行, 将所述数据行存储于分析 算子緩沖区以供所有分析器使用;
判断模块 532, 其可以用于解析出所述数据行的分区字段和排序字段, 判断所述数据行是否属于当前分区, 所述当前分区是所述分析算子接收到的 上一数据行所属的分区,若是,则所述分析算子 53可以调用分析函数对应的 分析器对所述数据行进行分析, 得到分析结果, 将所述分析结果存储于分析 器緩沖区, 若否, 则所述分析算子 53可以结束对所述当前分区的分析, 将所 述分析算子緩沖区中存储的当前分区的所有数据行、 以及所述分析器緩沖区 中存储的当前分区的所有分析结果汇总成新的数据行转发至后继算子(即, 算子模块)。所述分析器和分析器緩沖区与上文所述相同,它们可以位于根据 本发明实施例三的系统中, 也可以位于所述系统之外并且可操作地耦接到所 述系统。
可选的,如果所述分析函数不需要累计,那么所述分析算子 53可以在得 到分析结果之后, 直接将所述数据行和分析结果汇总转发至后继算子(即, 算子模块), 无需对所述数据行和分析结果进行緩存。
本公开实施例提供的基于 MapReduce实现分析函数的系统,能够应用于 基于 MapReduce框架的分布式数据库(例如腾讯分布式数据仓库、 Hive数 据库等)以实现数据分析,增加基于 MapReduce框架的分布式数据库的功能, 从而使得在基于 MapReduce框架的分布式数据库中实现分析函数进行数据 分析。
通过以上的实施方式的描述, 所属领域的技术人员可以清楚地了解到本 公开可借助软件加必需的通用硬件的方式来实现, 当然也可以通过硬件, 但 很多情况下前者是更佳的实施方式。 基于这样的理解, 本公开的技术方案本 质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来, 该 计算机软件产品存储在可读取的存储介质中, 如计算机的软盘, 硬盘或光盘 等, 包括若干指令用以使得一台计算机设备(可以是个人计算机, 服务器, 或者网络设备等)执行本公开各个实施例所述的方法。
以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围并不局限 于此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易 想到变化或替换, 都应涵盖在本发明的保护范围之内。 因此, 本发明的保护 范围应以所述权利要求的保护范围为准。

Claims

权利要求书
1、 一种基于 MapReduce实现分析函数的方法, 包括:
表扫描算子从文件块获取数据行, 将所述数据行发送至映射算子; 所述映射算子接收所述数据行, 确定分析函数的归约键、 分割键和排序 键, 通过 MapReduce框架将所述数据行发送至分析算子, 所述分析算子属于 所述 MapReduce 架的 Reduce端;
所述分析算子接收所述数据行, 对所述数据行进行分析得到分析结果, 并将所述数据行和分析结果转发至后继算子。
2、 根据权利要求 1所述的方法, 其中, 所述确定分析函数的归约键、 分 割键和排序键, 包括:
当所述分析函数包含分区子句和 /或排序子句时,将所述分析函数的分区 子句中的列和 /或排序子句中的列作为归约键, 或者
当所述分析函数没有排序子句但有 distinct关键字时, 以 distinct列作为 归约键, 或者
当分析函数不含分区子句、排序子句, 也不含 distinct关键字时, 指定任 意常量作为归约键;
当所述分析函数包含分区子句时, 将所述分析函数的分区子句中的列作 为分割键, 或者
当所述分析函数不含分区子句时, 以与归约键相同的常量作为分割键; 当所述分析函数含有排序子句时, 以排序子句中的列作为排序键。
3、根据权利要求 1或 2所述的方法, 其中, 所述分析算子接收所述数据 行, 对所述数据行进行分析得到分析结果, 并将所述数据行和分析结果转发 至后继算子, 包括:
所述分析算子接收所述数据行, 将所述数据行存储于分析算子緩沖区以 供所有分析器使用;
所述分析算子解析出所述数据行的分区字段和排序字段, 判断所述数据 行是否属于当前分区, 所述当前分区是所述分析算子接收到的上一数据行所 属的分区,
若是, 则调用分析函数对应的分析器对所述数据行进行分析, 得到 分析结果, 将所述分析结果存储于分析器緩沖区, 若否, 则结束对所述当前分区的分析, 将所述分析算子緩沖区中存 储的当前分区的所有数据行、 以及所述分析器緩沖区中存储的当前分区的所 有分析结果汇总成新的数据行转发至后继算子。
4、 根据权利要求 3所述的方法, 其中, 如果所述分析函数不需要累计, 那么在所述调用分析函数对应的分析器对所述数据行进行分析, 得到分析结 果之后, 直接将所述数据行和分析结果汇总转发至后继算子, 无需对所述数 据行和分析结果进行緩存。
5、根据权利要求 3所述的方法, 其中, 所述分析算子緩沖区包括内存緩 沖区和磁盘緩沖区, 所述分析算子緩沖区优先将接收的新数据行放入内存緩 沖区, 如果内存緩沖区已满, 则将内存緩沖区中较旧的数据行存入所述磁盘 緩沖区, 以释放内存緩沖区的存储空间。
6、根据权利要求 3所述的方法, 其中, 所述分析器緩沖区包括内存緩沖 区和磁盘緩沖区, 所述内存緩沖区包括输出緩沖区和输入緩沖区, 所述分析 器緩沖区用于对分析结果进行緩沖和更新;
所述分析器緩沖区对分析结果进行緩沖时, 将所述分析结果存储于所述 输出緩沖区, 如果所述输出緩沖区已满, 则将所述输出緩沖区中的内 ^"入 所述磁盘緩沖区, 以释放所述输出緩沖区的存储空间;
所述分析器緩沖区对分析结果进行更新时:
如果待更新行存储于输出緩沖区, 则直接根据所述输出緩沖区中的 待更新行和接收到的新数据行对分析结果进行更新,
如果待更新行存储于输入緩沖区, 则直接根据所述输入緩沖区中的 待更新行和接收到的新数据行对分析结果进行更新,
如果待更新行存储于磁盘緩沖区, 则将所述输入緩沖区中的内容存 储到所述磁盘緩沖区, 并将所述磁盘緩沖区中的待更新行所在的緩沖块读入 所述输入緩沖区, 以使得根据所述输入緩沖区中的待更新行和接收到的新数 据行对分析结果进行更新。
7、 一种基于 MapReduce实现分析函数的系统, 包括扫描算子模块、 映 射算子模块和分析算子模块, 其中:
所述扫描算子被配置为从文件块获取数据行, 将所述数据行发送至映射 算子;
所述映射算子被配置为接收所述数据行, 确定分析函数的归约键、 分割 键和排序键, 通过 MapReduce框架将所述数据行发送至分析算子, 所述分析 算子属于所述 MapReduce框架的 Reduce端;
所述分析算子被配置为接收所述数据行, 对所述数据行进行分析得到分 析结果, 并将所述数据行和分析结果转发至后继算子模块。
8、 根据权利要求 7所述的系统, 其中, 所述映射算子模块被配置为: 当所述分析函数包含分区子句和 /或排序子句时,将所述分析函数的 分区子句中的列和 /或排序子句中的列作为归约键, 或者
当所述分析函数没有排序子句但有 distinct关键字时, 以 distinct列 作为归约键, 或者
当分析函数不含分区子句、排序子句, 也不含 distinct关键字时, 指 定任意常量作为归约键;
所述映射算子模块还被配置为: 作为分割键, 或者
当所述分析函数不含分区子句时, 以与归约键相同的常量作为分割 键;
所述映射算子还被配置为当所述分析函数含有排序子句时, 以排序子句 中的列作为排序键。
9、 根据权利要求 7或 8所述的系统, 其中, 所述分析算子模块包括: 存储模块, 被配置为接收所述数据行, 将所述数据行存储于分析算子緩 沖区以供所有分析器使用;
判断模块, 被配置为解析出所述数据行的分区字段和排序字段, 判断所 述数据行是否属于当前分区, 所述当前分区是所述分析算子接收到的上一数 据行所属的分区,
若是, 则所述分析算子调用分析函数对应的分析器对所述数据行进 行分析, 得到分析结果, 将所述分析结果存储于分析器緩沖区;
若否, 则所述分析算子结束对所述当前分区的分析, 将所述分析算 子緩沖区中存储的当前分区的所有数据行、 以及所述分析器緩沖区中存储的 当前分区的所有分析结果汇总成新的数据行转发至后继算子模块。
10、根据权利要求 9所述的系统, 其中,如果所述分析函数不需要累计, 那么所述分析算子在得到分析结果之后, 直接将所述数据行和分析结果汇总 转发至后继算子, 无需对所述数据行和分析结果进行緩存。
PCT/CN2013/084860 2012-12-27 2013-10-09 基于MapReduce实现分析函数的方法及系统 WO2014101520A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/750,887 US20150356162A1 (en) 2012-12-27 2015-06-25 Method and system for implementing analytic function based on mapreduce

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210580817.1A CN103902592B (zh) 2012-12-27 2012-12-27 基于MapReduce实现分析函数的方法及系统
CN201210580817.1 2012-12-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/750,887 Continuation US20150356162A1 (en) 2012-12-27 2015-06-25 Method and system for implementing analytic function based on mapreduce

Publications (1)

Publication Number Publication Date
WO2014101520A1 true WO2014101520A1 (zh) 2014-07-03

Family

ID=50993920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/084860 WO2014101520A1 (zh) 2012-12-27 2013-10-09 基于MapReduce实现分析函数的方法及系统

Country Status (3)

Country Link
US (1) US20150356162A1 (zh)
CN (1) CN103902592B (zh)
WO (1) WO2014101520A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10672078B1 (en) * 2014-05-19 2020-06-02 Allstate Insurance Company Scoring of insurance data
CN104679884B (zh) * 2015-03-16 2018-04-10 北京奇虎科技有限公司 数据库的数据分析方法、装置以及系统
CN106406985B (zh) * 2016-09-21 2019-10-11 北京百度网讯科技有限公司 分布式计算框架和分布式计算方法
CN107886286A (zh) * 2016-09-29 2018-04-06 中国石油化工股份有限公司 地震数据处理作业流方法及系统
CN108121745B (zh) * 2016-11-30 2021-08-06 中移(苏州)软件技术有限公司 一种数据加载方法和装置
US11301468B2 (en) * 2019-09-13 2022-04-12 Oracle International Corporation Efficient execution of a sequence of SQL operations using runtime partition injection and iterative execution
CN112783924A (zh) * 2019-11-07 2021-05-11 北京沃东天骏信息技术有限公司 一种脏数据识别方法、装置和系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129457A (zh) * 2011-03-09 2011-07-20 浙江大学 大规模语义数据路径查询的方法
US20120254193A1 (en) * 2011-04-01 2012-10-04 Google Inc. Processing Data in a Mapreduce Framework

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09305616A (ja) * 1996-05-10 1997-11-28 Hitachi Ltd データ分析方法
US7343367B2 (en) * 2005-05-12 2008-03-11 International Business Machines Corporation Optimizing a database query that returns a predetermined number of rows using a generated optimized access plan
US9165044B2 (en) * 2008-05-30 2015-10-20 Ethority, Llc Enhanced user interface and data handling in business intelligence software
JP5187017B2 (ja) * 2008-06-18 2013-04-24 富士通株式会社 分散ディスクキャッシュシステム及び分散ディスクキャッシュ方法
US9003110B2 (en) * 2010-01-13 2015-04-07 International Business Machines Corporation Dividing incoming data into multiple data streams and transforming the data for storage in a logical data object
US8918388B1 (en) * 2010-02-26 2014-12-23 Turn Inc. Custom data warehouse on top of mapreduce
CN102779025A (zh) * 2012-03-19 2012-11-14 南京大学 一种基于Hadoop的并行化PLSA方法
CN102663083A (zh) * 2012-04-01 2012-09-12 南通大学 基于分布式计算的大规模社交网络信息抽取方法
US9210044B2 (en) * 2012-07-27 2015-12-08 Dell Products L.P. Automated remediation with an appliance

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129457A (zh) * 2011-03-09 2011-07-20 浙江大学 大规模语义数据路径查询的方法
US20120254193A1 (en) * 2011-04-01 2012-10-04 Google Inc. Processing Data in a Mapreduce Framework

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LANG, WEIMIN ET AL.: "MapReduce technology of cloud computing", TELECOMMUNICATIONS INFORMATION, no. 3, 31 March 2012 (2012-03-31), pages 3 - 5,12 *
ZHANG, SHUBIN ET AL.: "Research on implementing spatial queries based on MapReduce", CHINESE HIGH TECHNOLOGY LETTERS, vol. 20, no. 7, 31 July 2010 (2010-07-31), pages 719 - 725 *

Also Published As

Publication number Publication date
US20150356162A1 (en) 2015-12-10
CN103902592B (zh) 2018-02-27
CN103902592A (zh) 2014-07-02

Similar Documents

Publication Publication Date Title
WO2014101520A1 (zh) 基于MapReduce实现分析函数的方法及系统
AU2019232789B2 (en) Aggregating data in a mediation system
CN104424258B (zh) 多维数据查询的方法、查询服务器、列存储服务器及系统
US20150039641A1 (en) Executing structured queries on unstructured data
WO2018036549A1 (zh) 分布式数据库查询方法、装置及管理系统
CN104252536A (zh) 一种基于hbase的上网日志数据查询方法及装置
US9600526B2 (en) Generating and using temporal data partition revisions
CN111159219B (zh) 一种数据管理方法、装置、服务器及存储介质
US20230315727A1 (en) Cost-based query optimization for untyped fields in database systems
WO2023232120A1 (zh) 数据处理方法、电子设备及存储介质
Le-Phuoc Operator-aware approach for boosting performance in RDF stream processing
US9229969B2 (en) Management of searches in a database system
US8200673B2 (en) System and method for on-demand indexing
CN117149777A (zh) 一种数据查询方法、装置、设备及存储介质
US20240095246A1 (en) Data query method and apparatus based on doris, storage medium and device
WO2024041221A1 (zh) 一种选择率估算方法及估算装置
WO2013097073A1 (zh) 一种流处理方法和装置
US11768818B1 (en) Usage driven indexing in a spreadsheet based data store
Naeem Efficient processing of semi-stream data
CN114741407A (zh) 条件查询方法、装置和电子设备
CN117171227A (zh) 一种基于Impala的任务统计方法和系统
CN117573741A (zh) 数据检索方法、装置、终端设备以及存储介质
CN114896485A (zh) 一种基于分页聚合检索不同数据源及数据结构的方法及系统
CN116069800A (zh) 数据处理方法、装置、电子设备及存储介质
CN117851383A (zh) 一种数据迁移方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13869767

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 02/09/2015)

122 Ep: pct application non-entry in european phase

Ref document number: 13869767

Country of ref document: EP

Kind code of ref document: A1