WO2013078583A1 - Method and apparatus for optimizing data access, method and apparatus for optimizing data storage - Google Patents

Method and apparatus for optimizing data access, method and apparatus for optimizing data storage Download PDF

Info

Publication number
WO2013078583A1
WO2013078583A1 PCT/CN2011/083021 CN2011083021W WO2013078583A1 WO 2013078583 A1 WO2013078583 A1 WO 2013078583A1 CN 2011083021 W CN2011083021 W CN 2011083021W WO 2013078583 A1 WO2013078583 A1 WO 2013078583A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
information
range
input
key value
Prior art date
Application number
PCT/CN2011/083021
Other languages
French (fr)
Chinese (zh)
Inventor
智伟
赵智峰
周帅锋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201180002537.6A priority Critical patent/CN102725753B/en
Priority to PCT/CN2011/083021 priority patent/WO2013078583A1/en
Publication of WO2013078583A1 publication Critical patent/WO2013078583A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are a method and an apparatus for optimizing data access, and a method and an apparatus for optimizing data storage. The method for optimizing data access comprises: a main controller receiving a request of a user for accessing a data table in HBASE, the request carrying data input range information, and the data input range information comprising a plurality of data input ranges; determining input block information according to region information of the data table and the data input range information; determining the number of Map tasks according to the input block information; distributing, according to the number of the Map tasks, data in the data table read from a processor; and returning the data read from the processor to the user.

Description

优化数据访问的方法及装置、 优化数据存储的方法及装置 技术领域  Method and device for optimizing data access, method and device for optimizing data storage
本发明涉及数据处理技术领域, 具体涉及一种优化数据访问的方法及装 置、 以及一种优化数据存储的方法及装置。  The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for optimizing data access, and a method and apparatus for optimizing data storage.
背景技术 Background technique
MapReduce作为大规模数据并行处理方法, 已经广泛应用于大规模数据 分析中。 HBASE ( Hadoop Database )是一个高可靠性、 高性能、 面向列、 可 伸缩的分布式存储系统, HBASE可以作为 MapReduce的数据来源和数据目的 地, 从而使得 MapReduce能够处理 HBASE中保存的数据或者将输出数据保存 在 HBASE中。  As a large-scale data parallel processing method, MapReduce has been widely used in large-scale data analysis. HBASE (Hadoop Database) is a high-reliability, high-performance, column-oriented, scalable distributed storage system. HBASE can be used as a MapReduce data source and data destination, enabling MapReduce to process data stored in HBASE or output. The data is saved in HBASE.
在 HBASE作为 MapReduce的数据来源时, 通过表名和范围查询对象来 定义 MapReduce访问的表和访问的数据范围, 其中范围查询对象通过制定起 始键值和终止键值来定义数据查询范围。  When HBASE is used as the data source of MapReduce, the table name and the range of the data accessed by the MapReduce are defined by the table name and the scope query object. The range query object defines the data query range by setting the start key value and the terminating key value.
当用户程序调用 MapReduce函数, 就会引起如下操作: 信息, 通过比较范围查询对象定义数据范围和所有分区信息得到分块信息和 个数 M。  When the user program calls the MapReduce function, it will cause the following operations: Information, by comparing the range query object to define the data range and all the partition information to get the block information and the number M.
( 2 )主控制器根据分块信息和个数创建 Map (映射)任务, 每个 Map 任务处理一个分块中的数据。  (2) The main controller creates a Map task based on the block information and the number, and each Map task processes the data in one block.
( 3 )其它流程同基本 MapReduce过程。  (3) Other processes are the same as the basic MapReduce process.
在 HBASE作为 MapReduce的数据目的地时,通过表名来定义 MapReduce 数据将要保存的表, 主要有如下操作:  When HBASE is used as the data destination of MapReduce, the table name is used to define the table to be saved by MapReduce data. The main operations are as follows:
( 1 )用户程序中的 MapReduce函数库首先将输入文件分成 M块, 根据 数据要保存的表, 访问 HBASE元数据表得到表对应的范围信息, 通过范围 信息指定 Reduce (化筒) 个数 R。  (1) The MapReduce function library in the user program first divides the input file into M blocks. According to the table to be saved, access the HBASE metadata table to obtain the range information corresponding to the table, and specify the number of Reduces R by the range information.
( 2 )主控制器得到输入划分信息, 为每个划分创建一个 Map任务。 根 据配置的 Reduce数量,创建 Reduce任务。总共有 M个 Map任务和 R个 Reduce 任务需要分派。 主控制器向从处理器分派任务。 总共有 M个 Map任务和 R 个 Reduce任务需要分派。 ( 3 )其它流程同基本 MapReduce过程。 (2) The main controller gets the input split information and creates a Map task for each partition. Create a Reduce task based on the number of Reduces configured. There are a total of M Map tasks and R Reduce tasks that need to be dispatched. The master controller dispatches tasks to the slave processor. There are a total of M Map tasks and R Reduce tasks that need to be dispatched. (3) Other processes are the same as the basic MapReduce process.
由上述处理过程可以看出, 现有技术至少存在以下问题:  As can be seen from the above process, the prior art has at least the following problems:
在 HBASE作为 MapReduce的数据来源时, 根据一个范围查询对象中的 数据查询范围来定义 MapReduce访问的数据范围, 为了使符合要求的记录数 据不被遗漏, 只能通过扩大范围查询对象指定的数据范围, 从而导致涵盖过 多分区且分区中包含大量无效数据。 MapReduce程序读取分区中的符合要求 的记录外, 还必须读取大量无效记录进行比较并丢弃, 造成大量无效操作, 严重降低了数据处理执行速度。  When HBASE is used as the data source of MapReduce, the data range of MapReduce access is defined according to the scope of data query in a range query object. In order to make the recorded data that meets the requirements not to be missed, the data range specified by the scope query object can only be expanded. This results in too many partitions being covered and the partition contains a large amount of invalid data. In addition to reading the required records in the partition, the MapReduce program must also read a large number of invalid records for comparison and discard, resulting in a large number of invalid operations, which seriously reduces the speed of data processing execution.
在 HBASE作为 MapReduce的数据目的地时, 如果数据保存的分区范围 有限, 则会产生多个无用 Reduce进程, 浪费调度时间和系统资源, 降低了数 据处理执行速度。 发明内容  When HBASE is used as the data destination of MapReduce, if the partition range of data storage is limited, multiple useless Reduce processes will be generated, which wastes scheduling time and system resources and reduces the speed of data processing execution. Summary of the invention
本发明实施例一方面提供一种优化数据访问的方法及装置, 以减少无效 数据的读取, 提高数据处理效率。  An embodiment of the present invention provides a method and apparatus for optimizing data access to reduce the reading of invalid data and improve data processing efficiency.
本发明实施例一方面提供一种优化数据存储的方法及装置, 以减少无效 MapReduce任务的执行, 提高数据处理效率。  An embodiment of the present invention provides a method and an apparatus for optimizing data storage, so as to reduce execution of invalid MapReduce tasks and improve data processing efficiency.
为了解决以上技术问题, 本发明实施例采取的技术方案是:  In order to solve the above technical problem, the technical solution adopted by the embodiment of the present invention is:
一种优化数据访问的方法, 包括:  A method of optimizing data access, including:
主控制器接收用户访问 HBASE 中数据表的请求, 所述请求中携带数据 输入范围信息, 所述数据输入范围信息包括多个数据输入范围;  The main controller receives a request for the user to access the data table in the HBASE, where the request carries data input range information, and the data input range information includes multiple data input ranges;
根据所述数据表的分区信息及所述数据输入范围信息, 确定输入分块信 根据所述输入分块信息确定 Map任务个数;  Determining, according to the partition information of the data table and the data input range information, the input block information, determining the number of Map tasks according to the input block information;
按照所述 Map任务个数分配从处理器读取所述数据表中的数据; 将所述从处理器读取的数据返回给所述用户。  Reading data in the data table from the processor according to the number of Map tasks; and returning the data read from the processor to the user.
一种优化数据存储的方法, 包括:  A method of optimizing data storage, including:
主控制器接收用户向 HBASE 中数据表存储数据的请求, 所述请求中携 带一个或多个数据输出范围; 根据所述数据表的分区信息及所述数据输出范围, 确定输出分块信息; 根据所述输出分块信息确定 Reduce任务个数; The primary controller receives a request from the user to store data in a data table in HBASE, the request carrying one or more data output ranges; Determining output block information according to the partition information of the data table and the data output range; determining a number of Reduce tasks according to the output block information;
按照所述 Reduce任务个数分配从处理器向所述数据表中写入数据。  Data is written from the processor to the data table in accordance with the number of Reduce tasks.
一种优化数据访问的装置, 包括:  An apparatus for optimizing data access, comprising:
接收单元, 用于接收用户访问 HBASE 中数据表的请求, 所述请求中携 带数据输入范围信息, 所述数据输入范围信息包括多个数据输入范围;  a receiving unit, configured to receive a request for a user to access a data table in the HBASE, where the request carries data input range information, where the data input range information includes multiple data input ranges;
输入分块单元, 用于根据所述数据表的分区信息及所述数据输入范围信 息, 确定输入分块信息;  Inputting a blocking unit, configured to determine input block information according to the partition information of the data table and the data input range information;
任务确定单元, 用于根据所述输入分块信息确定 Map任务个数; 分配单元, 用于按照所述 Map任务个数分配从处理器读取所述数据表中 的数据;  a task determining unit, configured to determine a number of Map tasks according to the input block information; and an allocating unit, configured to read data in the data table from the processor according to the number of the Map tasks;
发送单元, 用于将所述从处理器读取的数据返回给所述用户。  And a sending unit, configured to return the data read by the slave processor to the user.
一种优化数据存储的装置, 包括:  An apparatus for optimizing data storage, comprising:
接收单元, 用于接收用户向 HBASE 中数据表存储数据的请求, 所述请 求中携带一个或多个数据输出范围;  a receiving unit, configured to receive a request for a user to store data in a data table in HBASE, where the request carries one or more data output ranges;
输出分块单元, 用于根据所述数据表的分区信息及所述数据输出范围, 确定输出分块信息;  An output blocking unit, configured to determine output block information according to the partition information of the data table and the data output range;
任务确定单元, 用于根据所述输出分块信息确定 Reduce任务个数; 分配单元,用于按照所述 Reduce任务个数分配从处理器向所述数据表中 写入数据。  a task determining unit, configured to determine, according to the output block information, a number of Reduce tasks; and an allocating unit, configured to allocate data from the processor to the data table according to the number of the reduced task allocations.
本发明实施例优化数据访问的方法及装置, 在 HBASE作为 MapReduce 的数据来源时, 通过指定多个数据输入范围来减少无效数据的读取, 提高数 据处理效率; 相应地, 本发明实施例优化数据存储的方法及装置, 在 HBASE 作为 MapReduce 的数据目的地时, 通过指定多个数据输出范围来减少无效 MapReduce任务的执行, 提高数据处理效率。 附图说明  The method and device for optimizing data access according to the embodiment of the present invention, when HBASE is used as the data source of MapReduce, reduce the reading of invalid data by specifying a plurality of data input ranges, and improve data processing efficiency; accordingly, the embodiment of the present invention optimizes data. The storage method and device, when HBASE is used as the data destination of MapReduce, reduce the execution of invalid MapReduce tasks by specifying multiple data output ranges, and improve data processing efficiency. DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实 施例或现有技术描述中所需要使用的附图作筒单地介绍, 显而易见地, 下面 描述中的附图仅仅是本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的附图。 In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below, and obviously, the following The drawings in the description are only some of the embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.
图 1是现有的 MapReduce的操作流程示意图;  FIG. 1 is a schematic diagram of an operation flow of an existing MapReduce;
图 2是现有 HBASE数据库中表的数据模型示意图;  2 is a schematic diagram of a data model of a table in an existing HBASE database;
图 3是现有技术中 HBASE作为数据来源时 MapReduce的操作流程示意 图;  3 is a schematic diagram of an operation flow of MapReduce in the prior art when HBASE is used as a data source;
图 4是现有技术中 HBASE作为数据目的地时 MapReduce的操作流程示 意图;  4 is a schematic diagram of the operation flow of MapReduce in the prior art when HBASE is used as a data destination;
图 5是本发明实施例优化数据访问的方法的流程图;  5 is a flowchart of a method for optimizing data access according to an embodiment of the present invention;
图 6是本发明实施例 HBASE中数据表的示意图;  6 is a schematic diagram of a data table in HBASE according to an embodiment of the present invention;
图 7是按照现有技术中对图 6所示数据表进行查询的示意图;  7 is a schematic diagram of querying the data table shown in FIG. 6 according to the prior art;
图 8是按照本发明实施例的方法对图 6所示数据表进行查询的示意图; 图 9是本发明实施例优化数据存储的方法的流程图;  8 is a schematic diagram of querying the data table shown in FIG. 6 according to a method according to an embodiment of the present invention; FIG. 9 is a flowchart of a method for optimizing data storage according to an embodiment of the present invention;
图 10是按照现有技术中对图 4所示数据表进行存储的示意图; 图 11是按照本发明实施例的方法对图 4所示数据表进行存储的示意图; 图 12是本发明实施例优化数据访问的装置的结构示意图;  FIG. 10 is a schematic diagram of storing the data table shown in FIG. 4 according to the prior art; FIG. 11 is a schematic diagram of storing the data table shown in FIG. 4 according to the method of the embodiment of the present invention; FIG. 12 is an embodiment of the present invention. Schematic diagram of a device for data access;
图 13是本发明实施例优化数据访问的装置中输入分块单元的一种结构示 意图;  13 is a schematic diagram showing the structure of an input blocking unit in an apparatus for optimizing data access according to an embodiment of the present invention;
图 14是本发明实施例优化数据存储的装置的结构示意图;  14 is a schematic structural diagram of an apparatus for optimizing data storage according to an embodiment of the present invention;
图 15是是本发明实施例优化数据存储的装置中输出分块单元的一种结构 示意图。 具体实施方式  Figure 15 is a block diagram showing the structure of an output block unit in an apparatus for optimizing data storage in an embodiment of the present invention. detailed description
为了使本技术领域的人员更好地理解本发明实施例的方案, 下面结合附 图和实施方式对本发明实施例作进一步的详细说明。  The embodiments of the present invention are further described in detail below in conjunction with the drawings and embodiments.
下面首先对现有的 MapReduce的操作流程作筒单说明。  The following is a description of the existing MapReduce operation flow.
如图 1所示, MapReduce包括三个独立的实体, 分别为: 用户程序、 主 控制器、 从处理器。 其中, 主控制器用于协调作业的运行, 向从处理器分派 任务; 从处理器处理作业运行后的 Map任务和 Reduce任务。 当用户程序调用 MapReduce函数时, 会引起如下操作: As shown in Figure 1, MapReduce consists of three separate entities: user program, main controller, and slave processor. The main controller is used to coordinate the running of the job, and assigns the task to the slave processor; the slave task processes the Map task and the Reduce task after the job is run. When the user program calls the MapReduce function, it will cause the following operations:
1 )用户程序中的 MapReduce函数库首先将输入文件分成 M块。  1) The MapReduce library in the user program first divides the input file into M blocks.
2 )主控制器得到输入划分信息, 为每个划分创建一个 Map任务。 根据 配置的 Reduce数量,创建 Reduce任务。总共有 M个 Map任务和 R个 Reduce 任务需要分派。 主控制器向从处理器分派任务。  2) The main controller gets the input split information and creates a Map task for each partition. Create a Reduce task based on the configured amount of Reduce. There are a total of M Map tasks and R Reduce tasks that need to be dispatched. The master controller dispatches tasks to the slave processor.
3) 一个分配了 Map任务的从处理器读取并处理相关的输入小块。  3) A slave processor assigned a Map task reads and processes the associated input chunks.
4) Map任务緩沖到内存的中间结果将被定期写到本地硬盘,数据通过分 区函数分成 R个区。 中间结果在本地硬盘的位置信息将被发送回主控制器, 然后主控制器负责将这些位置信息传送给 Reduce任务的从处理器。  4) The intermediate result of the Map task buffering to memory will be periodically written to the local hard disk, and the data is divided into R areas by the partition function. The intermediate result is that the location information of the local hard disk will be sent back to the primary controller, and then the primary controller is responsible for transmitting these location information to the slave processor of the Reduce task.
5 ) 当主控制器通知 Reduce任务的从处理器中间数据位置时, 从 Map任 务的从处理器的本地硬盘上读取緩沖的中间数据。  5) When the primary controller notifies the slave processor of the intermediate data location of the Reduce task, the buffered intermediate data is read from the local hard disk of the slave processor of the Map task.
6 ) Reduce任务的从处理器处理中间数据, 将中间结果值集合传递给用 户定义的 Reduce函数。该 Reduce函数中对于本 Reduce区块的输出到一个最 终的输出文件。  6) The slave task of the Reduce task processes the intermediate data and passes the intermediate result value set to the user-defined Reduce function. The output of the Reduce block in the Reduce function is output to a final output file.
7 ) 当所有的 Map任务和 Reduce任务都已经完成了的时候, 主控制器激 活用户程序, MapReduce返回用户程序的调用点。  7) When all Map tasks and Reduce tasks have been completed, the main controller activates the user program, and MapReduce returns the call point of the user program.
HBASE作为分布式列存储数据库, 其中表的数据模型如图 2所示: 图 2中所示为一个 HBASE表中的数据, 包括以下信息:  HBASE is a distributed column storage database, in which the data model of the table is shown in Figure 2: Figure 2 shows the data in a HBASE table, including the following information:
行键 ( RowKey ): 每一行数据的标识, 类似关系数据库中表的主键。 列簇(Column Family ), HBASE表是不同列簇的集合。 类似关系数据库 中表的列, 需要预先定义。 与关系数据库中表的列不同的是, 列簇下可以有 多个列。  RowKey: The identifier of each row of data, similar to the primary key of a table in a relational database. Column Family, the HBASE table is a collection of different column clusters. Columns of tables in a similar relational database need to be predefined. Unlike columns in a table in a relational database, there can be multiple columns under a column cluster.
歹 |J ( Column ), 是列簇下的一个标签, 可以在写入数据时任意添加。 列值(Value ), 与关系数据库中的列的值类似 。  歹 |J ( Column ), is a label under the column cluster, which can be arbitrarily added when writing data. The column value (Value ) is similar to the value of a column in a relational database.
分区 (Region ), 表中数据按照一定大小进行分区组织。 当表中列簇下数 据达到阈值时, 分区会分裂, 建立新的分区, 将原来列簇下的数据按照行键 顺序 迁一部分到新的分区中。  Region, the data in the table is organized according to a certain size. When the data in the column cluster in the table reaches the threshold, the partition will be split, a new partition will be created, and the data under the original column cluster will be moved to the new partition in the row key order.
HBASE可以作为 MapReduce的数据来源, 也就是说, MapReduce可以 处理 HBASE中保存的数据。 如图 3所示, 当用户程序调用 MapReduce函数时, 会引起如下操作: 1 ) MapReduce根据表名从 HBASE中获取制定表的所有 Region信息, 通过比较范围查询对象定义数据范围和所有 Region范围信息得到分块信息和 个数。 HBASE can be used as a data source for MapReduce, that is, MapReduce can process data stored in HBASE. As shown in Figure 3, when the user program calls the MapReduce function, it will cause the following operations: 1) MapReduce obtains all Region information of the formulated table from HBASE according to the table name, and obtains the data range and all Region scope information by comparing the scope query object. Block information and number.
2 )主控制器根据上述分块信息和个数创建 Map任务, 每个 Map任务处 理一个分块中的数据。  2) The main controller creates a Map task according to the above-mentioned block information and the number, and each Map task processes the data in one block.
3 )主控制器向从处理器分派 Map任务。  3) The master controller dispatches a Map task to the slave processor.
4 )其它流程同基本 MapReduce过程。  4) Other processes are the same as the basic MapReduce process.
HBASE也可以作为 MapReduce的数据出口, 也就是说, MapReduce可 以将输出数据存储到 HBASE中, 通过表名来定义 MapReduce的数据要存储 的表。  HBASE can also be used as a data export for MapReduce. That is to say, MapReduce can store output data in HBASE, and define the table to be stored by MapReduce data by table name.
如图 4所示, 当用户程序调用 MapReduce函数时, 会引起如下操作: As shown in Figure 4, when the user program calls the MapReduce function, it will cause the following operations:
1 ) 用户程序中的 MapReduce函数首先将输入文件分成多个分块, 根据 数据要保存的表,访问 HBASE数据表得到表对应的 Region信息,通过 Region 信息指定 Reduce个数。 1) The MapReduce function in the user program first divides the input file into multiple partitions. According to the table to be saved, the HBASE data table is accessed to obtain the Region information corresponding to the table, and the Reduce number is specified by the Region information.
2 )主控制器得到输入文件分块信息, 为每个分块创建一个 Map任务。 根据配置的 Reduce数量, 创建 Reduce任务。  2) The main controller gets the input file block information and creates a Map task for each block. Create a Reduce task based on the configured Reduce amount.
3 )主控制器向从处理器分派 Map任务和 Reduce任务。  3) The main controller dispatches the Map task and the Reduce task to the slave processor.
4 )其它流程同基本 MapReduce过程。  4) Other processes are the same as the basic MapReduce process.
本发明实施例优化数据访问的方法及装置, 针对上述现有技术存在的问 题, 在 HBASE作为 MapReduce的数据来源时, 通过指定多个数据输入范围 来减少无效数据的读取, 提高数据处理效率。  The method and device for optimizing data access according to the embodiment of the present invention, in view of the above problems in the prior art, when HBASE is used as a data source of MapReduce, the number of data input ranges is specified to reduce the reading of invalid data, thereby improving data processing efficiency.
相应地, 本发明实施例优化数据存储的方法及装置, 针对上述现有技术 存在的问题, 在 HBASE作为 MapReduce的数据目的地时, 通过指定多个数 据输出范围来减少无效 MapReduce任务的执行, 提高数据处理效率。  Correspondingly, the method and device for optimizing data storage according to the embodiment of the present invention, in view of the above problems in the prior art, when HBASE is used as a data destination of MapReduce, the execution of invalid MapReduce tasks is reduced by specifying multiple data output ranges, thereby improving Data processing efficiency.
下面分别对本发明实施例优化数据访问的方法及装置、 以及优化数据存 储的方法及装置进行详细说明。  The method and apparatus for optimizing data access and the method and apparatus for optimizing data storage according to embodiments of the present invention are described in detail below.
如图 5所示, 是本发明实施例优化数据访问的方法的流程图, 包括以下 步骤: 步骤 501 , 主控制器接收用户访问 HBASE中数据表的请求, 所述请求中 携带数据输入范围信息, 所述数据输入范围信息包括多个数据输入范围。 As shown in FIG. 5, it is a flowchart of a method for optimizing data access according to an embodiment of the present invention, which includes the following steps: Step 501: The main controller receives a request for a user to access a data table in the HBASE, where the request carries data input range information, where the data input range information includes multiple data input ranges.
上述多个数据输入范围可以通过分隔符进行分割, 相应地, 主控制器可 以利用分隔符对输入字符串进行划分来获取多个数据输入范围。  The plurality of data input ranges described above may be divided by a separator, and accordingly, the main controller may divide the input string by a separator to obtain a plurality of data input ranges.
所述数据输入范围可以采用多种形式, 比如, 可以是以下任意一种形式: a )列表形式, 并且列表中包含多个范围查询对象, 每个范围查询对象为 一个数据输入范围, 例如:  The data input range may take various forms, for example, any one of the following forms: a) a list form, and the list includes multiple range query objects, and each range query object is a data input range, for example:
SCAN1 ( 20010101 , 20010131 ),  SCAN1 (20010101, 20010131),
SCAN2 ( 20010201 , 20010228 ),  SCAN2 (20010201, 20010228),
SCAN3 ( 20010301 , 20010331 )。  SCAN3 (20010301, 20010331).
b ) 列表形式, 并且列表中包含多个起始、 终止数据范围对, 每个起始、 终止数据范围对表示一个数据范围, 例如:  b) a list form, and the list contains multiple start and end data range pairs, each starting and ending data range pair representing a data range, for example:
( 20010101 , 20010131 ),  (20010101, 20010131),
( 20010201 , 20010228 ),  (20010201, 20010228),
( 20010301 , 20010331 )。  (20010301, 20010331).
c )文件形式, 即通过文件的形式, 将多个数据输入范围保存在文件中, 主控制器通过读取文件获取数据输入范围, 例如:  c) The file format, that is, in the form of a file, a plurality of data input ranges are saved in the file, and the main controller obtains a data input range by reading the file, for example:
( 20010101 , 20010131 ), ( 20010201 , 20010228 ), ( 20010301 , 20010331 )。 步骤 502, 根据所述数据表的分区信息及所述数据输入范围信息, 确定 输入分块信息。  (20010101, 20010131), (20010201, 20010228), (20010301, 20010331). Step 502: Determine input block information according to the partition information of the data table and the data input range information.
在该过程中, 可以首先获取所述数据表中所有分区的起始键值和终止键 值, 然后将所述数据输入范围信息中的每个数据输入范围分别与各分区的起 始键值和终止键值进行比较,得到每个数据输入范围在各分区中的覆盖范围, 根据该覆盖范围即可确定输入分块信息。  In the process, the start key value and the end key value of all the partitions in the data table may be first acquired, and then each data input range in the data input range information is respectively compared with the start key value of each partition. The termination key values are compared to obtain the coverage of each data input range in each partition, and the input block information can be determined according to the coverage range.
另外, 在得到每个数据输入范围在各分区中的覆盖范围后, 也可以先将 得到的所有覆盖范围中属于同一个分区并且连续的覆盖范围合并, 然后再根 据合并后的覆盖范围确定输入分块信息。  In addition, after obtaining the coverage of each data input range in each partition, the coverage areas belonging to the same partition and the continuous coverage may be merged first, and then the input points are determined according to the combined coverage. Block information.
为了进一步提高处理效率, 在上述过程中, 在将所述数据输入范围信息 中的数据输入范围分别与各分区的起始键值和终止键值进行比较之前, 可以 先对所述数据输入范围信息中的数据输入范围进行排序 , 然后对排序后的数 据输入范围依次与各分区的起始键值和终止键值进行比较。 In order to further improve the processing efficiency, in the above process, before the data input range in the data input range information is compared with the start key value and the end key value of each partition respectively, The data input range in the data input range information is first sorted, and then the sorted data input range is sequentially compared with the start key value and the end key value of each partition.
当然, 也可以不进行上述排序, 而是在得到每个数据输入范围在各分区 中的覆盖范围后, 对得到的所有覆盖范围进行排序, 然后再将其中属于同一 个分区并且连续的覆盖范围合并。  Of course, instead of performing the above sorting, after obtaining the coverage of each data input range in each partition, all the obtained coverage ranges are sorted, and then the same partition is merged and the continuous coverage is merged. .
上述输入分块信息可以包括: 输入分块个数及每个输入分块的起始键值 和终止键值。  The input block information may include: inputting the number of blocks and the start key value and the end key value of each input block.
步骤 503 , 根据所述输入分块信息确定 Map任务个数。  Step 503: Determine the number of Map tasks according to the input block information.
具体地,可以根据所述输入分块个数确定 Map任务个数,每个 Map任务 对应一个输入分块。  Specifically, the number of Map tasks may be determined according to the number of input blocks, and each Map task corresponds to one input block.
步骤 504, 按照所述 Map任务个数分配从处理器读取所述数据表中的数 据。  Step 504: Read data in the data table from the processor according to the number of the Map tasks.
具体地, 主控制器可以为每个 Map任务分配一个从处理器, 并将所述 Map任务对应的输入分块的起起始键值和终止键值传送给所述从处理器。 相 应地, 所述从处理器根据所述输入分块的起起始键值和终止键值读取所述数 据表中的数据。  Specifically, the master controller may allocate a slave processor for each Map task, and transmit the start key value and the terminating key value of the input block corresponding to the Map task to the slave processor. Correspondingly, the slave processor reads the data in the data table according to the start key value and the end key value of the input block.
步骤 505, 将所述从处理器读取的数据返回给所述用户。  Step 505: Return the data read from the processor to the user.
需要说明的是, 在上述用户访问 HBASE 中数据表的请求中, 还包括所 述数据表的表名信息。 在实际应用中, 主控制器接收到该请求后, 可以先对 该请求中的表名及携带的各数据输入范围进行检查,判断这些信息是否正确。 也就是说, 根据该请求中的表名信息检查 HBASE 中是否有相应的数据表, 以及所述数据输入范围是否在相应数据表存储的数据范围内。 如果这些信息 正确, 再执行上述步骤 502。  It should be noted that, in the above request for the user to access the data table in the HBASE, the table name information of the data table is also included. In the actual application, after receiving the request, the main controller may first check the name of the table in the request and the data input range carried in the request to determine whether the information is correct. That is to say, according to the table name information in the request, it is checked whether there is a corresponding data table in HBASE, and whether the data input range is within the data range stored in the corresponding data table. If the information is correct, perform step 502 above.
具体地, 可以查询 HBASE的系统表, 如果所述系统表中没有所述表名, 则说明输入表错误; 如果有所述表名, 则说明表名正确。 然后获取所述表名 对应的数据表的 Region信息, 通过比较数据输入范围和 Region信息, 如果 数据输入范围小于表的起始键值或者大于表的终止键值则可确定数据输入范 围不正确, 否则可以确定数据输入范围正确。  Specifically, the HBASE system table may be queried. If the table name is not included in the system table, the input table is incorrect; if the table name is present, the table name is correct. Then, the Region information of the data table corresponding to the table name is obtained, and by comparing the data input range and the Region information, if the data input range is smaller than the start key value of the table or greater than the termination key value of the table, the data input range may be determined to be incorrect. Otherwise, you can determine that the data input range is correct.
另外, 本发明实施例优化数据访问的方法在实际应用中, 也可以与现有 的方法相兼容, 比如, 主控制器接收到用户访问 HBASE中数据表的请求后, 判断该请求中是否携带了多个数据输入范围, 如果是, 则按照上述步骤 502 至步骤 504的流程访问 HBASE中的数据表; 否则, 按照现有技术的方式访 问 HBASE中的数据表。 In addition, the method for optimizing data access in the embodiment of the present invention may also be The method is compatible. For example, after receiving the request of the user to access the data table in the HBASE, the main controller determines whether the data carries a plurality of data input ranges, and if yes, accesses the HBASE according to the process of steps 502 to 504. The data table in ; otherwise, access the data table in HBASE in the manner of the prior art.
下面举例进一步说明本发明实施例优化数据访问的方法中 MapReduce的 处理过程与现有技术的区别。  The following is an example to further illustrate the difference between the processing of MapReduce and the prior art in the method for optimizing data access in the embodiment of the present invention.
例如, 图 6所示的 HBASE中的一个数据表, 该数据表为影片访问信息 表, 表示不同影片在每天被不同类型终端访问的次数。  For example, a data table in HBASE shown in Figure 6 is a movie access information table indicating the number of times different movies are accessed by different types of terminals each day.
图 6所示的数据表中的 Region信息如下表 1所示。  The Region information in the data table shown in Figure 6 is shown in Table 1 below.
表 1 :  Table 1 :
Figure imgf000011_0001
Figure imgf000011_0001
如果按照现有技术, 对图 6所示的数据表执行 MapReduce处理, 汇聚 8 月份第一周每个影片不同终端的访问总量, 则处理过程如图 7所示, 具体如 下:  If MapReduce processing is performed on the data table shown in Figure 6 according to the prior art, and the total number of accesses of different terminals of each movie in the first week of August is aggregated, the processing is as shown in Fig. 7, as follows:
设置范围查询对象的起始键值为: 影片 1#20110801 , 结束键值为: 影片 3#20110807。  Set the starting key value of the range query object: Movie 1#20110801 , End key value: Movie 3#20110807.
将范围查询对象起始键值和结束键值和 Region信息进行比较得到有效输 入数据范围为:  Comparing the range query object start key value and the end key value with the Region information, the effective input data range is:
RegionO (影片 1#20110801 , 影片 1#20110808 )  RegionO (film 1#20110801, film 1#20110808)
Regionl (影片 2#20110102, 影片 2#20110808 )  Regionl (movie 2#20110102, film 2#20110808)
Region2 (影片 3#20110101 , 影片 1#20110807 )  Region2 (movie 3#20110101, film 1#20110807)
这些有效输入数据范围属于三个不同的 Region, 实际上其中每个有效数 据范围中可能会包括无效数据。  These valid input data ranges belong to three different Regions, and virtually every valid data range may include invalid data.
将该有效输入数据范围划分为三个不同的分块信息, 分别为:  The valid input data range is divided into three different block information, which are:
第一个分块信息: 影片 1#20110801 影片 1#20110808 , 其中影片 !#20110808为无效数据。 第二个分块信息: 影片 2#20110102 影片 2#20110808 , 其中影片 2#20110101—影片 2#20110731 , 影片 2#20110808为无效数据。 The first block information: Movie 1#20110801 Movie 1#20110808, where the movie! #20110808 is invalid data. The second block information: film 2#20110102 film 2#20110808, where the film 2#20110101—movie 2#20110731, the film 2#20110808 is invalid data.
第三个分块信息: 影片 3#20110101 影片 1#20110807 , 其中影片 3#20110101—影片 3#20110731为无效数据。  The third block information: film 3#20110101 film 1#20110807, where the film 3#20110101-film 3#20110731 is invalid data.
因此, 主控制器根据这三个分块信息启动三个 Map任务, 分派给从处理 器进行处理。 三个 Map任务分别处理不同分块中的数据, 而上述无效数据也 被包含在上述分块中。 从处理器在处理数据时会从 HBASE表中读取各分块 中的每条记录, 判断是否是 8月第一周数据, 如果是则进行相加, 如果不是 则直接丢弃不进行处理。 如图 7所示, 粗线框内的数据为有效数据, 其它为 无效数据。  Therefore, the host controller initiates three Map tasks based on these three block information and dispatches them to the slave processor for processing. The three Map tasks process the data in the different blocks, and the invalid data is also included in the above blocks. When the processor processes the data, it reads each record in each block from the HBASE table to determine whether it is the first week of August data, and if so, adds it. If not, it discards it and does not process it. As shown in Figure 7, the data in the thick line frame is valid data, and the others are invalid data.
由此可见, 按照 Map任务会读取大量无效数据进行判断, 造成大量无效 操作, 降低了数据处理执行速度, 使得效率较低。  It can be seen that according to the Map task, a large amount of invalid data is read to judge, which causes a large number of invalid operations, which reduces the data processing execution speed and makes the efficiency low.
如果按照本发明实施例优化数据访问的方法, 对图 6所示的数据表执行 MapReduce处理, 汇聚 8月份第一周每个影片不同终端的访问总量, 则处理 过程如图 8所示, 具体如下:  If the method for optimizing data access is performed according to the embodiment of the present invention, MapReduce processing is performed on the data table shown in FIG. 6, and the total number of accesses of different terminals of each movie in the first week of August is aggregated, and the processing procedure is as shown in FIG. as follows:
1. 主控制器接收用户进行数据访问的请求, 所述请求中携带多个数据输 入范围信息, 比如, 在该示例中, 有 3个数据输入范围, 具体如下:  1. The main controller receives a request for data access by the user, and the request carries multiple data input range information. For example, in this example, there are three data input ranges, as follows:
SCAN1 (起始键值: 影片 1#20110801 , 终止键值: 影片 1#20110807 ); SCAN2 (起始键值: 影片 2#20110801 , 终止键值: 影片 2#20110807 ); SCAN3 (起始键值: 影片 3#20110801 , 终止键值: 影片 3#20110807 )。 SCAN1 (starting key value: movie 1#20110801, termination key value: movie 1#20110807); SCAN2 (starting key value: movie 2#20110801, termination key value: movie 2#20110807); SCAN3 (starting key value) : Movie 3#20110801, End Key: Movie 3#20110807).
2. 调用 MapReduce分别对每个数据输入范围进行分块划分, 其中: 对于第一个数据输入范围和 Region信息进行比较,得到有效输入数据范 围为: 2. Call MapReduce to divide each data input range into blocks, where: For the first data input range and Region information, the valid input data range is:
RegionO (影片 1#20110801 , 影片 1#20110807 );  RegionO (film 1#20110801, film 1#20110807);
对于第二个数据输入范围和 Region信息进行比较,得到有效输入数据范 围为:  Comparing the second data input range with the Region information, the valid input data range is:
Regionl (影片 2#20110801 , 影片 2#20110807 );  Regionl (movie 2#20110801, film 2#20110807);
对于第三个数据输入范围和 Region信息进行比较,得到有效输入数据范 围为: Region2 (影片 3#20110801 , 影片 3#20110807 )。 For the third data input range and the Region information, the valid input data range is: Region2 (movie 3#20110801, film 3#20110807).
上述有效输入数据范围属于三个不同的 Region, 因此根据上述得到的三 个有效输入数据范围, 得到三个不同的输入分块信息, 分别为:  The above valid input data range belongs to three different Regions, so according to the three valid input data ranges obtained above, three different input block information are obtained, which are:
SPLIT0 (影片 1#20110801 影片 1 #20110807 )  SPLIT0 (film 1#20110801 film 1 #20110807 )
SPLIT1 (影片 2#20110801 影片 2#20110807 )  SPLIT1 (movie 2#20110801 film 2#20110807)
SPLIT2 (影片 3#20110801 影片 3#20110807 )。  SPLIT2 (movie 3#20110801 film 3#20110807).
3. 根据上述输入分块信息和个数, 主控制器启动三个 Map任务, 分派 给从处理器进行处理。 三个 Map任务分别处理不同输入分块信息中的数据。 由图 6可以看出, 上述三个输入分块信息中的数据均为有效数据, 而不包含 无效数据。  3. Based on the above input block information and number, the host controller starts three Map tasks and dispatches them to the slave processor for processing. The three Map tasks process the data in different input block information. As can be seen from Figure 6, the data in the above three input block information is valid data, and does not contain invalid data.
由此可见, 本发明实施例优化数据访问的方法中, 通过细分和设置多个 数据输入范围, 即查询对象, 通过筒单的多次比较操作限定有效输入数据范 围, 使得数据处理过程中减少了大量无效数据的读取, 大大提高了处理效率。  It can be seen that, in the method for optimizing data access in the embodiment of the present invention, by subdividing and setting a plurality of data input ranges, that is, querying objects, the effective input data range is limited by multiple comparison operations of the cartridge, so that the data processing process is reduced. A large amount of invalid data is read, which greatly improves the processing efficiency.
如图 9所示, 是本发明实施例优化数据存储的方法的流程图, 包括以下 步骤:  As shown in FIG. 9, it is a flowchart of a method for optimizing data storage according to an embodiment of the present invention, which includes the following steps:
步骤 901 , 主控制器接收用户向 HBASE中数据表存储数据的请求, 所述 请求中携带一个或多个数据输出范围。  Step 901: The main controller receives a request for the user to store data in the HBASE data table, where the request carries one or more data output ranges.
上述多个数据输出范围可以通过分隔符进行分割, 相应地, 主控制器利 用分隔符对输出字符串进行划分来获取多个数据输出范围。  The plurality of data output ranges described above may be divided by a separator, and accordingly, the main controller divides the output string by a separator to obtain a plurality of data output ranges.
所述数据输出范围可以采用多种形式, 比如, 可以是以下任意一种形式: a ) 列表形式, 列表中包含一个起始、 终止数据范围对, 例如:  The data output range may take various forms, for example, any of the following forms: a) In the form of a list, the list includes a start and end data range pair, for example:
( 20010101 , 20010331 )。  (20010101, 20010331).
c ) 列表形式, 并且列表中包含多个起始、 终止数据范围对, 每个起始、 终止数据范围对表示一个数据输出范围, 例如:  c) in the form of a list, and the list contains multiple start and end data range pairs. Each start and end data range pair represents a data output range, for example:
( 20010101 , 20010131 ),  (20010101, 20010131),
( 20010201 , 20010228 ),  (20010201, 20010228),
( 20010301 , 20010331 )。  (20010301, 20010331).
d )文件形式, 即通过文件的形式, 将一个或多个数据输出范围保存在文 件中, 主控制器通过读取文件获取数据输出范围, 例如: ( 20010101 , 20010131 ), ( 20010201 , 20010228 ), ( 20010301 , 20010331 )。 步骤 902, 根据所述数据表的分区信息及所述数据输出范围, 确定输出 分块信息。 d) The file form, that is, in the form of a file, one or more data output ranges are saved in the file, and the main controller obtains the data output range by reading the file, for example: (20010101, 20010131), (20010201, 20010228), (20010301, 20010331). Step 902: Determine output block information according to the partition information of the data table and the data output range.
在该过程中, 可以首先获取所述数据表中所有分区的起始键值和终止键 值,然后将所述数据输出范围分别与各分区的起始键值和终止键值进行比较, 得到所述数据输出范围覆盖的分区信息, 根据所述分区信息确定输出分块信 息。  In the process, the start key value and the end key value of all the partitions in the data table may be first obtained, and then the data output range is compared with the start key value and the end key value of each partition respectively to obtain a The partition information covered by the data output range is determined, and the output block information is determined according to the partition information.
上述输出分块信息包括: 输出分块个数。  The output block information includes: outputting the number of blocks.
步骤 903 , 根据所述输出分块信息确定 Reduce任务个数。  Step 903: Determine, according to the output block information, a number of Reduce tasks.
具体地,可以根据所述输出分块个数确定 Reduce任务个数,每个 Reduce 任务对应一个输出分块。  Specifically, the number of Reduce tasks may be determined according to the number of output blocks, and each Reduce task corresponds to one output block.
步骤 904, 按照所述 Reduce任务个数分配从处理器向所述数据表中写入 数据。  Step 904: Write data from the processor to the data table according to the number of the Reduce tasks.
具体地, 主控制器可以为每个 Reduce任务分配一个从处理器, 并将所述 Reduce任务对应的存储数据传送给所述从处理器。 相应地, 所述从处理器将 所述 Reduce任务对应的存储数据写入所述输出分块对应的分区中。  Specifically, the main controller may allocate a slave processor for each Reduce task, and transfer the storage data corresponding to the Reduce task to the slave processor. Correspondingly, the slave processor writes the storage data corresponding to the Reduce task into the partition corresponding to the output block.
下面举例进一步说明本发明实施例优化数据存储的方法中 MapReduce的 处理过程与现有技术的区别。  The following examples further illustrate the difference between the processing of MapReduce and the prior art in the method for optimizing data storage in the embodiment of the present invention.
假设下表 2是影片 3的最新访问信息, 需要将该信息保存到 HBASE的 数据表中, 数据输出的表为图 6所示的影片访问信息表。  Assume that the following table 2 is the latest access information of the movie 3, and the information needs to be saved in the HBASE data table. The data output table is the movie access information table shown in FIG.
表 2:  Table 2:
Figure imgf000014_0001
Figure imgf000014_0001
如果按照现有技术, 对表 2所示的信息执行 MapReduce处理, 则处理过 程如下: 根据图 6所示的数据表中的 Region信息得到输出数据范围:If MapReduce processing is performed on the information shown in Table 2 according to the prior art, the processing is as follows: The output data range is obtained according to the Region information in the data table shown in FIG. 6:
RegionO (影片 1#20110104 , 影片 1#20110808 ); RegionO (movie 1#20110104, film 1#20110808);
Regionl (影片 2#20110101 , 影片 2#20110808 );  Regionl (movie 2#20110101, film 2#20110808);
Region2 (影片 3#20110101 , 影片 3#20110808 )。  Region2 (movie 3#20110101, film 3#20110808).
这些输出数据范围属于三个不同的 Region, 则主控制器设置三个不同的 These output data ranges belong to three different Regions, and the main controller is set to three different
Reduce任务, 分派给从处理器上进行处理。 其中每个 Reduce任务处理一个 输出数据范围, 将满足条件 "起始键值〈输入数据 终止键值" 的数据保存在 相应的 Region内。 The Reduce task is dispatched to the slave processor for processing. Each of the Reduce tasks processes an output data range, and the data satisfying the condition "start key value <input data termination key value" is stored in the corresponding Region.
而表 2 中需要保存到 HBASE 的数据表中的实际数据范围为影片 3#20110809—影片 3#20110812, 其属于 Region2对应的 Reduce数据范围内, 因而 RegionO和 Regionl对应的 Reduce不会处理任何数据。 也就是说, 由于 实际输入数据满足 Region2的数据范围,因此表 2中的数据将保存在 Region2 中, 而另外两个 Reduce任务则无任何输出数据, 为无效 Reduce任务, 如图 10所示。  The actual data range in Table 2 that needs to be saved to HBASE is the movie 3#20110809—movie 3#20110812, which belongs to the Reduce data range corresponding to Region2, so the Reduce corresponding to RegionO and Regionl will not process any data. That is, since the actual input data satisfies the data range of Region2, the data in Table 2 will be stored in Region2, while the other two Reduce tasks have no output data, which is an invalid Reduce task, as shown in Figure 10.
由此可见,现有技术中按照 Reduce任务会产生多个无效的 Reduce进程, 浪费调度时间和资源, 降低数据处理执行速度。  It can be seen that in the prior art, multiple invalid Reduce processes are generated according to the Reduce task, which wastes scheduling time and resources, and reduces data processing execution speed.
如果按照本发明实施例优化数据存储的方法, 对表 2 所示的信息执行 MapReduce处理, 则处理过程如图 11所示, 具体如下:  If the method of optimizing data storage according to the embodiment of the present invention performs MapReduce processing on the information shown in Table 2, the processing procedure is as shown in FIG. 11, and the details are as follows:
1. 主控制器接收用户进行数据存储的请求, 所述请求中携带数据输出范 围信息, 比如,在该示例中,携带的数据输出范围信息为: <影片 3#20110809, 影片 3#20110812>。  1. The main controller receives a request for data storage by the user, and the request carries data output range information. For example, in this example, the data output range information carried is: <movie 3#20110809, movie 3#20110812>.
2. MapReduce对数据输出范围和图 6所示的数据表的 Region信息进行 比较, 得到有效输出数据范围, 即 Region2 (影片 3#20110101 , 影片 3#20110808 )。 该有效输出数据范围只属于一个 Region。  2. MapReduce compares the data output range with the Region information of the data table shown in Figure 6, and obtains a valid output data range, namely Region2 (movie 3#20110101, movie 3#20110808). The valid output data range belongs to only one Region.
因此根据上述得到的一个有效输出数据范围, 得到一个输出分块信息: Therefore, according to the range of valid output data obtained above, an output block information is obtained:
Region2 (影片 3#20110101 , 影片 3#20110808 ) Region2 (film 3#20110101, film 3#20110808)
3. 主控制器根据上述输出分块信息, 设置 1个 Reduce任务。 分派给从 处理器进行处理。  3. The main controller sets one Reduce task according to the above output block information. Dispatched to the slave processor for processing.
由此可见, 本发明实施例数据存储的方法, 通过细分和设置输出数据范 围, 通过筒单的比较操作限定有效输出数据范围, 使得数据处理过程中减少 了大量无效 Reduce任务的启动和销毁, 大大提高了处理效率。 It can be seen that the data storage method of the embodiment of the present invention divides and sets the output data by subdividing The comparison output operation limits the effective output data range, which reduces the startup and destruction of a large number of invalid Reduce tasks in the data processing process, and greatly improves the processing efficiency.
相应地, 本发明实施例还提供一种优化数据访问的装置, 如图 12所示, 是该装置的一种结构示意图。  Correspondingly, the embodiment of the present invention further provides an apparatus for optimizing data access, as shown in FIG. 12, which is a schematic structural diagram of the apparatus.
在该实施例中, 所述优化数据访问的装置包括:  In this embodiment, the device for optimizing data access includes:
接收单元 121 , 用于接收用户访问 HBASE中数据表的请求, 所述请求中 携带数据输入范围信息, 所述数据输入范围信息包括多个数据输入范围; 输入分块单元 122, 用于根据所述数据表的分区信息及所述数据输入范 围信息, 确定输入分块信息;  The receiving unit 121 is configured to receive a request for the user to access the data table in the HBASE, where the request carries data input range information, the data input range information includes multiple data input ranges, and the input blocking unit 122 is configured to Partition information of the data table and the data input range information, determining input block information;
任务确定单元 123, 根据所述输入分块信息确定 Map任务个数; 分配单元 124, 用于按照所述 Map任务个数分配从处理器读取所述数据 表中的数据;  The task determining unit 123 is configured to determine the number of Map tasks according to the input block information, and the allocating unit 124 is configured to read data in the data table from the processor according to the number of the Map tasks;
发送单元 125, 用于将所述从处理器读取的数据返回给所述用户。  The sending unit 125 is configured to return the data read by the slave processor to the user.
上述输入分块单元 122可以有多种方式实现, 比如, 如图 13所示, 该输 入分块单元 122可以包括: 分区信息获取子单元 1221 , 比较子单元 1222, 合 并子单元 1223和分块信息确定子单元 1224。 其中:  The input blocking unit 122 can be implemented in various manners. For example, as shown in FIG. 13, the input blocking unit 122 can include: a partition information acquiring subunit 1221, a comparing subunit 1222, a merging subunit 1223, and blocking information. Subunit 1224 is determined. among them:
所述分区信息获取子单元 1221 , 用于获取所述数据表中所有分区的起始 键值和终止键值;  The partition information obtaining sub-unit 1221 is configured to obtain a start key value and a stop key value of all partitions in the data table;
所述比较子单元 1222, 用于将所述数据输入范围信息中的数据输入范围 分别与各分区的起始键值和终止键值进行比较, 得到所述数据输入范围在各 分区中的覆盖范围;  The comparing subunit 1222 is configured to compare the data input range in the data input range information with the start key value and the end key value of each partition, to obtain coverage of the data input range in each partition. ;
所述合并子单元 1223, 用于将所述比较子单元得到的所有覆盖范围中属 于同一个分区并且连续的覆盖范围合并;  The merging sub-unit 1223 is configured to merge all the coverages obtained by the comparing sub-units into the same partition and merge the coverage areas;
所述分块信息确定子单元 1224, 用于根据合并后的覆盖范围确定输入分 块信息。  The block information determining sub-unit 1224 is configured to determine the input block information according to the merged coverage.
需要说明的是, 上述合并子单元 1223是可选的, 也就是说, 分块信息确 定子单元 1224也可以直接根据比较子单元 1222得到的覆盖范围确定输入分 块信息。  It should be noted that the foregoing merging sub-unit 1223 is optional, that is, the blocking information confirming that the staging unit 1224 can directly determine the input blocking information according to the coverage obtained by the comparing sub-unit 1222.
为了进一步提高处理效率, 上述输入分块单元 122还可进一步包括: 排 序子单元 1225 , 用于在所述比较子单元 1222将所述数据输入范围信息中的 数据输入范围分别与各分区的起始键值和终止键值进行比较之前, 对所述数 据输入范围信息中的数据输入范围进行排序。 In order to further improve the processing efficiency, the above input blocking unit 122 may further include: The sequence subunit 1225 is configured to input the range information to the data before the comparison subunit 1222 compares the data input range in the data input range information with the start key value and the termination key value of each partition respectively. Sort the data input range in .
在本发明实施例中, 所述输入分块信息可以包括: 输入分块个数及每个 输入分块的起始键值和终止键值。  In the embodiment of the present invention, the input block information may include: inputting the number of blocks and the start key value and the terminating key value of each input block.
相应地, 上述任务确定单元 123可以根据所述输入分块个数确定 Map任 务个数, 每个 Map任务对应一个输入分块。  Correspondingly, the task determining unit 123 may determine the number of Map tasks according to the number of input blocks, and each Map task corresponds to one input block.
相应地, 上述分配单元 124可以为每个 Map任务分配一个从处理器, 并 将所述 Map任务对应的输入分块的起起始键值和终止键值传送给所述从处理 器, 以使所述从处理器根据所述输入分块的起起始键值和终止键值读取所述 数据表中的数据。  Correspondingly, the foregoing allocating unit 124 may allocate a slave processor for each Map task, and transmit the start key value and the terminating key value of the input block corresponding to the Map task to the slave processor, so that The slave processor reads data in the data table according to a start key value and a stop key value of the input block.
本发明实施例优化数据访问的装置可以作为 MapReduce中的主控制器, 利用该装置, 可以通过筒单的多次比较操作限定有效输入数据范围, 使得数 据处理过程中减少了大量无效数据的读取, 大大提高了处理效率。 具体处理 过程可参照前面本发明实施例优化数据访问的方法中的描述,在此不再赘述。  The device for optimizing data access in the embodiment of the present invention can be used as a main controller in MapReduce, and the device can be used to limit the effective input data range by multiple comparison operations of the single tube, so that a large amount of invalid data is read during data processing. , greatly improving the processing efficiency. For details, refer to the description in the method for optimizing data access in the foregoing embodiments of the present invention, and details are not described herein again.
相应地, 本发明实施例还提供一种优化数据存储的装置, 如图 14所示, 是该装置的一种结构示意图。  Correspondingly, the embodiment of the present invention further provides an apparatus for optimizing data storage, as shown in FIG. 14, which is a schematic structural diagram of the apparatus.
在该实施例中, 所述优化数据存储的装置包括:  In this embodiment, the apparatus for optimizing data storage includes:
接收单元 131 , 用于接收用户向 HBASE中数据表存储数据的请求, 所述 请求中携带一个或多个数据输出范围;  The receiving unit 131 is configured to receive a request for the user to store data in the HBASE data table, where the request carries one or more data output ranges;
输出分块单元 132, 用于根据所述数据表的分区信息及所述数据输出范 围, 确定输出分块信息;  The output blocking unit 132 is configured to determine output blocking information according to the partition information of the data table and the data output range;
任务确定单元 133 , 用于根据所述输出分块信息确定 Reduce任务个数; 分配单元 134, 用于按照所述 Reduce任务个数分配从处理器向所述数据 表中写入数据。  The task determining unit 133 is configured to determine a number of Reduce tasks according to the output block information, and an allocating unit 134, configured to allocate data from the processor to the data table according to the number of the Reduce tasks.
上述输出分块单元 132可以有多种方式实现, 比如, 如图 15所示, 该输 出分块单元 132可以包括: 分区信息获取子单元 1321 , 比较子单元 1322和 分块信息确定子单元 1323。 其中:  The output blocking unit 132 can be implemented in various manners. For example, as shown in FIG. 15, the output blocking unit 132 can include: a partition information acquiring subunit 1321, a comparing subunit 1322, and a blocking information determining subunit 1323. among them:
所述分区信息获取子单元 1321 , 用于获取所述数据表中所有分区的起始 键值和终止键值; The partition information obtaining sub-unit 1321 is configured to acquire the start of all partitions in the data table. Key value and end key value;
所述比较子单元 1322, 用于将所述数据输出范围分别与各分区的起始键 值和终止键值进行比较, 得到所述数据输出范围覆盖的分区信息;  The comparing subunit 1322 is configured to compare the data output range with a start key value and a stop key value of each partition to obtain partition information covered by the data output range;
所述分块信息确定子单元 1323 , 用于根据所述分区信息确定输出分块信 。  The blocking information determining subunit 1323 is configured to determine an output blocking message according to the partition information.
当然, 上述输出分块单元 132还可以有其它实现方式, 对此本发明实施 例不做限定。  Of course, the foregoing output blocking unit 132 may have other implementation manners, which are not limited in this embodiment of the present invention.
在本发明实施例中, 所述输出分块信息可以包括: 输出分块个数。  In the embodiment of the present invention, the outputting the block information may include: outputting the number of blocks.
相应地, 上述任务确定单元 133可以根据所述输出分块个数确定 Reduce 任务个数, 每个 Reduce任务对应一个输出分块;  Correspondingly, the task determining unit 133 may determine the number of Reduce tasks according to the number of output blocks, and each Reduce task corresponds to one output block;
相应地, 上述分配单元 134可以为每个 Reduce任务分配一个从处理器, 器将所述 Reduce任务对应的存储数据写入所述输出分块对应的分区中。  Correspondingly, the allocating unit 134 may allocate a slave processor for each Reduce task, and the storage data corresponding to the Reduce task is written into the partition corresponding to the output block.
本发明实施例优化数据存储的装置可以作为 MapReduce中的主控制器, 利用该装置, 可以通过筒单的比较操作限定有效输出数据范围, 使得数据处 理过程中减少了大量无效 Reduce任务的启动和销毁, 大大提高了处理效率。 具体处理过程可参照前面本发明实施例优化数据访问的方法中的描述, 在此 不再赘述。  The device for optimizing data storage in the embodiment of the present invention can be used as a main controller in MapReduce. With the device, the effective output data range can be limited by the comparison operation of the cartridge, so that the startup and destruction of a large number of invaliduce tasks are reduced during the data processing. , greatly improving the processing efficiency. For a specific process, refer to the description in the method for optimizing data access in the foregoing embodiments of the present invention, and details are not described herein again.
通过以上的实施方式的描述可知, 本领域的技术人员可以清楚地了解到 上述实施例方法中的全部或部分步骤可借助软件加必需的通用硬件平台的方 式来实现。 基于这样的理解, 本发明的技术方案本质上或者说对现有技术做 出贡献的部分可以以软件产品的形式体现出来, 该计算机软件产品可以存储 在存储介质中, 如 ROM/RAM、 磁碟、 光盘等, 包括若干指令用以使得一台 计算机设备(可以是个人计算机, 服务器, 或者网络设备等)执行本发明各 个实施例或者实施例的某些部分所述的方法。  It will be apparent to those skilled in the art from the above description of the embodiments that all or part of the steps of the above embodiments may be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk. , an optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention or portions of the embodiments.
需要说明的是, 本说明书中的各个实施例均采用递进的方式描述, 各个 实施例之间相同相似的部分互相参见即可, 每个实施例重点说明的都是与其 他实施例的不同之处。 尤其, 对于装置实施例而言, 由于其基本相似于方法 实施例, 所以描述得比较筒单, 相关之处参见方法实施例的部分说明即可。 以上所描述的装置实施例仅仅是示意性的, 其中作为分离部件说明的单元可 以是或者也可以不是物理上分开的, 作为单元显示的部件可以是或者也可以 不是物理单元, 即可以位于一个地方, 或者也可以分布到多个网络单元上。 可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目 的。 本领域普通技术人员在不付出创造性劳动的情况下, 即可以理解并实施。 It is to be noted that the various embodiments in the present specification are described in a progressive manner, and the same similar parts between the various embodiments may be referred to each other, and each embodiment focuses on different embodiments from other embodiments. At the office. In particular, for the device embodiment, since it is basically similar to the method embodiment, it is described in a relatively simple manner, and the relevant parts can be referred to the description of the method embodiment. The device embodiments described above are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located in one place. , or it can be distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
以上所述仅为本发明的较佳实施例而已, 并非用于限定本发明的保护范 围。 凡在本发明的精神和原则之内所作的任何修改、 等同替换、 改进等, 均 包含在本发明的保护范围内。  The above description is only the preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

权 利 要 求 Rights request
1、 一种优化数据访问的方法, 其特征在于, 包括:  A method for optimizing data access, characterized in that it comprises:
主控制器接收用户访问 HBASE 中数据表的请求, 所述请求中携带数据 输入范围信息, 所述数据输入范围信息包括多个数据输入范围;  The main controller receives a request for the user to access the data table in the HBASE, where the request carries data input range information, and the data input range information includes multiple data input ranges;
根据所述数据表的分区信息及所述数据输入范围信息, 确定输入分块信 根据所述输入分块信息确定 Map任务个数;  Determining, according to the partition information of the data table and the data input range information, the input block information, determining the number of Map tasks according to the input block information;
按照所述 Map任务个数分配从处理器读取所述数据表中的数据; 将所述从处理器读取的数据返回给所述用户。  Reading data in the data table from the processor according to the number of Map tasks; and returning the data read from the processor to the user.
2、 如权利要求 1所述的方法, 其特征在于, 所述多个数据输入范围为以 下任意一种形式:  2. The method according to claim 1, wherein the plurality of data input ranges are in any one of the following forms:
列表形式, 并且列表中包含多个范围查询对象, 每个范围查询对象为一 个数据输入范围;  List form, and the list contains multiple range query objects, each range query object is a data input range;
列表形式, 并且列表中包含多个起始、 终止数据范围对, 每个起始、 终 止数据范围对表示一个数据输入范围;  List form, and the list contains multiple start and end data range pairs, each starting and ending data range pair represents a data input range;
文件形式。  File form.
3、 如权利要求 1或 2所述的方法, 其特征在于, 所述根据所述数据表的 分区信息及所述数据输入范围信息, 确定输入分块信息包括:  The method according to claim 1 or 2, wherein the determining the input block information according to the partition information of the data table and the data input range information comprises:
获取所述数据表中所有分区的起始键值和终止键值;  Obtaining a starting key value and a terminating key value of all partitions in the data table;
将所述数据输入范围信息中的数据输入范围分别与各分区的起始键值和 终止键值进行比较, 得到所述数据输入范围在各分区中的覆盖范围;  Comparing the data input range in the data input range information with the start key value and the end key value of each partition, respectively, to obtain a coverage range of the data input range in each partition;
根据所述覆盖范围确定输入分块信息。  The input block information is determined according to the coverage.
4、 如权利要求 3所述的方法, 其特征在于, 所述方法还包括: 在根据所述覆盖范围确定输入分块信息之前, 将得到的所有覆盖范围中 属于同一个分区并且连续的覆盖范围合并;  The method according to claim 3, wherein the method further comprises: before determining the input block information according to the coverage range, all the obtained coverage areas belong to the same partition and continuous coverage Consolidate
所述根据所述覆盖范围确定输入分块信息包括:  The determining the input block information according to the coverage includes:
根据合并后的覆盖范围确定输入分块信息。  The input block information is determined based on the combined coverage.
5、 如权利要求 3或 4所述的方法, 其特征在于, 所述方法还包括: 在将所述数据输入范围信息中的数据输入范围分别与各分区的起始键值 和终止键值进行比较之前 , 对所述数据输入范围信息中的数据输入范围进行 排序。 The method according to claim 3 or 4, wherein the method further comprises: inputting a data input range in the data input range information and a start key value of each partition respectively The data input range in the data input range information is sorted before being compared with the terminating key value.
6、 如权利要求 3或 4或 5所述的方法, 其特征在于, 所述输入分块信息 包括: 输入分块个数及每个输入分块的起始键值和终止键值;  The method according to claim 3 or 4 or 5, wherein the input block information comprises: inputting a number of blocks and a start key value and a stop key value of each input block;
所述根据所述输入分块信息确定 Map任务个数包括: 根据所述输入分块 个数确定 Map任务个数, 每个 Map任务对应一个输入分块;  Determining the number of Map tasks according to the input block information includes: determining a number of Map tasks according to the number of input blocks, and each Map task corresponds to one input block;
所述按照所述 Map任务个数分配从处理器读取所述数据表中的数据包 括:  The reading the data in the data table from the processor according to the number of the Map task allocation includes:
为每个 Map任务分配一个从处理器,并将所述 Map任务对应的输入分块 的起始键值和终止键值传送给所述从处理器, 以使所述从处理器根据所述输 入分块的起起始键值和终止键值读取所述数据表中的数据。  Allocating a slave processor for each Map task, and transmitting a start key value and a terminating key value of the input block corresponding to the Map task to the slave processor, so that the slave processor according to the input The start and exit key values of the block read the data in the data table.
7、 一种优化数据存储的方法, 其特征在于, 包括:  7. A method of optimizing data storage, comprising:
主控制器接收用户向 HBASE 中数据表存储数据的请求, 所述请求中携 带一个或多个数据输出范围;  The primary controller receives a request from the user to store data in a data table in HBASE, the request carrying one or more data output ranges;
根据所述数据表的分区信息及所述数据输出范围, 确定输出分块信息; 根据所述输出分块信息确定 Reduce任务个数;  Determining output block information according to the partition information of the data table and the data output range; determining a number of Reduce tasks according to the output block information;
按照所述 Reduce任务个数分配从处理器向所述数据表中写入数据。 Data is written from the processor to the data table in accordance with the number of Reduce tasks.
8、 如权利要求 7所述的方法, 其特征在于, 所述数据输出范围为以下任 意一种形式: 8. The method according to claim 7, wherein the data output range is any one of the following forms:
列表形式, 并且列表中包含一个或多个起始、 终止数据范围对, 每个起 始、 终止数据范围对表示一个数据输出范围;  a list form, and the list contains one or more start and end data range pairs, each starting and ending data range pair representing a data output range;
文件形式。  File form.
9、 如权利要求 7或 8所述的方法, 其特征在于, 所述根据所述数据表的 分区信息及所述数据输出范围, 确定输出分块信息包括:  The method according to claim 7 or 8, wherein the determining the output block information according to the partition information of the data table and the data output range comprises:
获取所述数据表中所有分区的起始键值和终止键值;  Obtaining a starting key value and a terminating key value of all partitions in the data table;
将所述数据输出范围分别与各分区的起始键值和终止键值进行比较, 得 到所述数据输出范围覆盖的分区信息;  Comparing the data output range with the start key value and the termination key value of each partition, respectively, to obtain partition information covered by the data output range;
根据所述分区信息确定输出分块信息。  The output block information is determined based on the partition information.
10、 如权利要求 9所述的方法, 其特征在于, 所述输出分块信息包括: 输出分块个数; The method of claim 9, wherein the outputting the block information comprises: Output the number of blocks;
所述根据所述输出分块信息确定 Reduce任务个数包括:根据所述输出分 块个数确定 Reduce任务个数, 每个 Reduce任务对应一个输出分块;  Determining, according to the output block information, the number of Reduce tasks includes: determining a number of Reduce tasks according to the number of output blocks, and each Reduce task corresponds to one output block;
所述按照所述 Reduce任务个数分配从处理器向所述数据表中写入数据 包括:  And the writing of data from the processor to the data table according to the number of the reduced task tasks includes:
为每个 Reduce任务分配一个从处理器, 并将所述 Reduce任务对应的存 储数据传送给所述从处理器,以使所述从处理器将所述 Reduce任务对应的存 储数据写入所述输出分块对应的分区中。  Allocating a slave processor to each reduce task, and transmitting the storage data corresponding to the reduce task to the slave processor, so that the slave processor writes the storage data corresponding to the reduce task to the output The partition corresponds to the partition.
11、 一种优化数据访问的装置, 其特征在于, 包括:  11. An apparatus for optimizing data access, comprising:
接收单元, 用于接收用户访问 HBASE 中数据表的请求, 所述请求中携 带数据输入范围信息, 所述数据输入范围信息包括多个数据输入范围;  a receiving unit, configured to receive a request for a user to access a data table in the HBASE, where the request carries data input range information, where the data input range information includes multiple data input ranges;
输入分块单元, 用于根据所述数据表的分区信息及所述数据输入范围信 息, 确定输入分块信息;  Inputting a blocking unit, configured to determine input block information according to the partition information of the data table and the data input range information;
任务确定单元, 用于根据所述输入分块信息确定 Map任务个数; 分配单元, 用于按照所述 Map任务个数分配从处理器读取所述数据表中 的数据;  a task determining unit, configured to determine a number of Map tasks according to the input block information; and an allocating unit, configured to read data in the data table from the processor according to the number of the Map tasks;
发送单元, 用于将所述从处理器读取的数据返回给所述用户。  And a sending unit, configured to return the data read by the slave processor to the user.
12、 如权利要求 11所述的装置, 其特征在于, 所述输入分块单元包括: 分区信息获取子单元, 用于获取所述数据表中所有分区的起始键值和终 止键值;  The device according to claim 11, wherein the input blocking unit comprises: a partition information acquiring subunit, configured to acquire a starting key value and a ending key value of all partitions in the data table;
比较子单元, 用于将所述数据输入范围信息中的数据输入范围分别与各 分区的起始键值和终止键值进行比较, 得到所述数据输入范围在各分区中的 覆盖范围;  a comparison subunit, configured to compare a data input range in the data input range information with a start key value and a stop key value of each partition, to obtain a coverage range of the data input range in each partition;
分块信息确定子单元, 用于根据所述覆盖范围确定输入分块信息。  The blocking information determining subunit is configured to determine the input blocking information according to the coverage.
13、如权利要求 12所述的装置,其特征在于,所述输入分块单元还包括: 合并子单元, 用于将所述比较子单元得到的所有覆盖范围中属于同一个 分区并且连续的覆盖范围合并;  The apparatus according to claim 12, wherein the input blocking unit further comprises: a merging subunit, wherein all coverages obtained by the comparing subunit belong to the same partition and are continuously covered Scope merger;
所述分块信息确定子单元, 具体用于根据所述合并子单元合并后的覆盖 范围确定输入分块信息。 The block information determining subunit is specifically configured to determine input block information according to the combined coverage of the merged subunit.
14、 如权利要求 12或 13所述的装置, 其特征在于, 所述输入分块单元 还包括: The device according to claim 12 or 13, wherein the input blocking unit further comprises:
排序子单元, 用于在所述比较子单元将所述数据输入范围信息中的数据 输入范围分别与各分区的起始键值和终止键值进行比较之前, 对所述数据输 入范围信息中的数据输入范围进行排序。  a sorting subunit, configured to: in the data input range information, before the comparing subunit compares the data input range in the data input range information with the start key value and the end key value of each partition respectively The data input range is sorted.
15、 如权利要求 12或 13或 14所述的装置, 其特征在于, 所述输入分块 信息包括: 输入分块个数及每个输入分块的起始键值和终止键值;  The device according to claim 12 or 13 or 14, wherein the input block information comprises: an input block number and a start key value and a stop key value of each input block;
所述任务确定单元,具体用于根据所述输入分块个数确定 Map任务个数, 每个 Map任务对应一个输入分块;  The task determining unit is specifically configured to determine a number of Map tasks according to the number of input blocks, and each Map task corresponds to one input block;
所述分配单元, 具体用于为每个 Map任务分配一个从处理器, 并将所述 The allocating unit is specifically configured to allocate a slave processor for each Map task, and the
Map任务对应的输入分块的起起始键值和终止键值传送给所述从处理器, 以 使所述从处理器根据所述输入分块的起起始键值和终止键值读取所述数据表 中的数据。 The start key value and the end key value of the input block corresponding to the Map task are transmitted to the slave processor, so that the slave processor reads the start key value and the end key value according to the input block The data in the data table.
16、 一种优化数据存储的装置, 其特征在于, 包括:  16. An apparatus for optimizing data storage, comprising:
接收单元, 用于接收用户向 HBASE 中数据表存储数据的请求, 所述请 求中携带一个或多个数据输出范围;  a receiving unit, configured to receive a request for a user to store data in a data table in HBASE, where the request carries one or more data output ranges;
输出分块单元, 用于根据所述数据表的分区信息及所述数据输出范围, 确定输出分块信息;  An output blocking unit, configured to determine output block information according to the partition information of the data table and the data output range;
任务确定单元, 用于根据所述输出分块信息确定 Reduce任务个数; 分配单元,用于按照所述 Reduce任务个数分配从处理器向所述数据表中 写入数据。  a task determining unit, configured to determine, according to the output block information, a number of Reduce tasks; and an allocating unit, configured to allocate data from the processor to the data table according to the number of the reduced task allocations.
17、 如权利要求 16所述的装置, 其特征在于, 所述输出分块单元包括: 分区信息获取子单元, 用于获取所述数据表中所有分区的起始键值和终 止键值;  The apparatus according to claim 16, wherein the output blocking unit comprises: a partition information acquiring subunit, configured to acquire a starting key value and a final key value of all partitions in the data table;
比较子单元, 用于将所述数据输出范围分别与各分区的起始键值和终止 键值进行比较, 得到所述数据输出范围覆盖的分区信息;  a comparison subunit, configured to compare the data output range with a start key value and a stop key value of each partition to obtain partition information covered by the data output range;
分块信息确定子单元, 用于根据所述分区信息确定输出分块信息。  The block information determining subunit is configured to determine output block information according to the partition information.
18、 如权利要求 17所述的装置, 其特征在于, 所述输出分块信息包括: 输出分块个数; 所述任务确定单元,具体用于根据所述输出分块个数确定 Reduce任务个 数, 每个 Reduce任务对应一个输出分块; The device according to claim 17, wherein the output block information comprises: outputting a number of blocks; The task determining unit is specifically configured to determine a number of Reduce tasks according to the number of output blocks, and each Reduce task corresponds to one output block;
所述分配单元, 具体用于为每个 Reduce任务分配一个从处理器, 并将所 述 Reduce任务对应的存储数据传送给所述从处理器,以使所述从处理器将所 述 Reduce任务对应的存储数据写入所述输出分块对应的分区中。  The allocating unit is configured to allocate a slave processor to each reduce task, and transmit the storage data corresponding to the reduce task to the slave processor, so that the slave processor corresponds the reduce task The stored data is written into the partition corresponding to the output block.
PCT/CN2011/083021 2011-11-28 2011-11-28 Method and apparatus for optimizing data access, method and apparatus for optimizing data storage WO2013078583A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201180002537.6A CN102725753B (en) 2011-11-28 2011-11-28 Method and apparatus for optimizing data access, method and apparatus for optimizing data storage
PCT/CN2011/083021 WO2013078583A1 (en) 2011-11-28 2011-11-28 Method and apparatus for optimizing data access, method and apparatus for optimizing data storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083021 WO2013078583A1 (en) 2011-11-28 2011-11-28 Method and apparatus for optimizing data access, method and apparatus for optimizing data storage

Publications (1)

Publication Number Publication Date
WO2013078583A1 true WO2013078583A1 (en) 2013-06-06

Family

ID=46950464

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/083021 WO2013078583A1 (en) 2011-11-28 2011-11-28 Method and apparatus for optimizing data access, method and apparatus for optimizing data storage

Country Status (2)

Country Link
CN (1) CN102725753B (en)
WO (1) WO2013078583A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109195175A (en) * 2018-09-03 2019-01-11 郑州云海信息技术有限公司 A kind of mobile wireless network optimization method based on cloud computing
WO2020034194A1 (en) * 2018-08-17 2020-02-20 西门子股份公司 Method, device, and system for processing distributed data, and machine readable medium

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838632B (en) * 2012-11-21 2017-04-12 阿里巴巴集团控股有限公司 Data querying method and device
CN103150403A (en) * 2013-03-28 2013-06-12 北京圆通慧达管理软件开发有限公司 Data processing system and method thereof
CN103198109A (en) * 2013-03-28 2013-07-10 北京圆通慧达管理软件开发有限公司 Data processing system and method
CN103226532A (en) * 2013-03-28 2013-07-31 北京圆通慧达管理软件开发有限公司 Data processing system and method
CN104679590B (en) * 2013-11-27 2018-12-07 阿里巴巴集团控股有限公司 Map optimization method and device in distributed computing system
CN103646073A (en) * 2013-12-11 2014-03-19 浪潮电子信息产业股份有限公司 Condition query optimizing method based on HBase table
CN104112011B (en) * 2014-07-16 2017-09-15 深圳国泰安教育技术股份有限公司 The method and device that a kind of mass data is extracted
CN104252536B (en) * 2014-09-16 2017-12-08 福建新大陆软件工程有限公司 A kind of internet log data query method and device based on hbase
CN106326309B (en) * 2015-07-03 2020-02-21 阿里巴巴集团控股有限公司 Data query method and device
CN106383826A (en) * 2015-07-29 2017-02-08 阿里巴巴集团控股有限公司 Database checking method and apparatus
CN106484689B (en) * 2015-08-24 2019-09-03 杭州华为数字技术有限公司 Data processing method and device
CN105183901A (en) * 2015-09-30 2015-12-23 北京京东尚科信息技术有限公司 Method and device for reading database table through data query engine
CN105956043A (en) * 2016-04-26 2016-09-21 海尔优家智能科技(北京)有限公司 Method and device for allocating Map task for MapReduce running on Hbase database
CN106294886A (en) * 2016-10-17 2017-01-04 北京集奥聚合科技有限公司 A kind of method and system of full dose extracted data from HBase
CN108427747B (en) * 2018-03-09 2021-10-15 广西师范大学 Dynamic planning data fragmentation optimization method based on range query boundary set
CN109657009B (en) * 2018-12-21 2021-03-12 北京锐安科技有限公司 Method, device, equipment and storage medium for creating data pre-partition storage periodic table
CN110083658B (en) * 2019-03-11 2021-05-25 北京达佳互联信息技术有限公司 Data synchronization method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957863A (en) * 2010-10-14 2011-01-26 广州从兴电子开发有限公司 Data parallel processing method, device and system
KR20110069338A (en) * 2009-12-17 2011-06-23 한국전자통신연구원 Distributed parallel processing system and method based on incremental mapreduce on data stream

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110069338A (en) * 2009-12-17 2011-06-23 한국전자통신연구원 Distributed parallel processing system and method based on incremental mapreduce on data stream
CN101957863A (en) * 2010-10-14 2011-01-26 广州从兴电子开发有限公司 Data parallel processing method, device and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIN QIANG: "Research and Design of RDF Storage System Based on HBase", CHINESE MASTER'S THESES FULL-TEXT DATABASE: INFORMATION SCIENCE AND TECHNOLOGY, 15 July 2011 (2011-07-15), pages 1137 - 28 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020034194A1 (en) * 2018-08-17 2020-02-20 西门子股份公司 Method, device, and system for processing distributed data, and machine readable medium
CN109195175A (en) * 2018-09-03 2019-01-11 郑州云海信息技术有限公司 A kind of mobile wireless network optimization method based on cloud computing
CN109195175B (en) * 2018-09-03 2021-12-21 郑州云海信息技术有限公司 Mobile wireless network optimization method based on cloud computing

Also Published As

Publication number Publication date
CN102725753B (en) 2014-01-01
CN102725753A (en) 2012-10-10

Similar Documents

Publication Publication Date Title
WO2013078583A1 (en) Method and apparatus for optimizing data access, method and apparatus for optimizing data storage
US11960726B2 (en) Method and apparatus for SSD storage access
CN107533551B (en) Big data statistics at data Block level
US20200050607A1 (en) Reassigning processing tasks to an external storage system
TWI549060B (en) Access methods and devices for virtual machine data
JP2020038623A (en) Method, device, and system for storing data
CN102307206B (en) Caching system and caching method for rapidly accessing virtual machine images based on cloud storage
EP2863310B1 (en) Data processing method and apparatus, and shared storage device
WO2017161540A1 (en) Data query method, data object storage method and data system
US10073648B2 (en) Repartitioning data in a distributed computing system
WO2017028394A1 (en) Example-based distributed data recovery method and apparatus
CN111258978A (en) Data storage method
US11157456B2 (en) Replication of data in a distributed file system using an arbiter
CN110162395B (en) Memory allocation method and device
US7509461B1 (en) Method and apparatus for intelligent buffer cache pre-emption
US11625192B2 (en) Peer storage compute sharing using memory buffer
WO2016206100A1 (en) Partitioned management method and apparatus for data table
CN111061557B (en) Method and device for balancing distributed memory database load
WO2023040348A1 (en) Data processing method in distributed system, and related system
WO2012171363A1 (en) Method and equipment for data operation in distributed cache system
CN113835613B (en) File reading method and device, electronic equipment and storage medium
US10824640B1 (en) Framework for scheduling concurrent replication cycles
US10185729B2 (en) Index creation method and system
CN108287853B (en) Data relation analysis method and system
US11550793B1 (en) Systems and methods for spilling data for hash joins

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201180002537.6

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11876546

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11876546

Country of ref document: EP

Kind code of ref document: A1