CN102467570A - Connection query system and method for distributed data warehouse - Google Patents

Connection query system and method for distributed data warehouse Download PDF

Info

Publication number
CN102467570A
CN102467570A CN2010105564905A CN201010556490A CN102467570A CN 102467570 A CN102467570 A CN 102467570A CN 2010105564905 A CN2010105564905 A CN 2010105564905A CN 201010556490 A CN201010556490 A CN 201010556490A CN 102467570 A CN102467570 A CN 102467570A
Authority
CN
China
Prior art keywords
data
node
mapping
query
according
Prior art date
Application number
CN2010105564905A
Other languages
Chinese (zh)
Other versions
CN102467570B (en
Inventor
伍涛
刘晓炜
胡卫松
齐红威
Original Assignee
日电(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日电(中国)有限公司 filed Critical 日电(中国)有限公司
Priority to CN201010556490.5A priority Critical patent/CN102467570B/en
Publication of CN102467570A publication Critical patent/CN102467570A/en
Application granted granted Critical
Publication of CN102467570B publication Critical patent/CN102467570B/en

Links

Abstract

The invention discloses a connection query system and a connection query method for a distributed data warehouse. The system comprises a master node, mapping work nodes and reduction work nodes, wherein the master node calculates a fragment size according to the size of a data table and system performance, allocates a data block to the mapping work node based on the calculated fragment size, and formulates fragmentation mapping rules and summarization rules in the mapping work node; the mapping work node maps a query keyword in the data block to a corresponding fragment number according to the fragmentation mapping rules, and transmits data with the same fragment number to a specified reduction work node according to the summarization rules; and the reduction work node receives the data from the mapping work node, combines the data with the same fragment number and establishes connection according to the query keyword to obtain a final connection query result. By the system and the method, data transmission in a distributed system is reduced, the data volume and program complexity of the reduction work node are decreased, and the performance of the distributed data warehouse is improved.

Description

用于分布式数据仓库的连接查询系统和方法 Query system and method for connecting distributed data warehouse

技术领域 FIELD

[0001] 本发明涉及数据库技术,具体涉及一种用于分布式数据仓库的连接查询系统和方法。 [0001] The present invention relates to database technology, particularly relates to a system and method for connecting a query distributed data warehouse for.

背景技术 Background technique

[0002] 随着信息技术的飞速发展,海量数据的存储、检索和分析变得非常关键。 [0002] With the rapid development of information technology, storage, retrieval and analysis of massive data becomes critical. 数据仓库即应运而生,其通常的定义是:一个面向主题的、集成的、稳定的、随时间变化的数据集合, 用于支持管理决策。 That came into the data warehouse, which is usually defined as: a subject-oriented, integrated, stable, with time-varying data collection to support management decisions. 数据仓库有两个层次的含义,一是它用于支持决策,面向分析数据处理;二是它由多源异构数据组成,集成后按照主题重组,并包含历史数据。 Data warehouse has two levels of meaning, one that is used to support decision-making, for analytical data processing; and second, it consists of multi-source heterogeneous data, the latter integrated in accordance with the theme of restructuring, and contains historical data. 大容量、高性能、 高可用性、可拓展性、可管理性以及按需服务成为衡量当今数据仓库和分布式文件系统的关键指标。 Large capacity, high performance, high availability, scalability, manageability, and on-demand services become a key indicator of today's data warehouse and distributed file systems.

[0003] 传统的数据仓库建立于重量级的服务器及数据仓库系统上,造价昂贵,拓展性差。 [0003] Traditional data warehouse built on heavyweight server and data warehouse systems, expensive, poor scalability. 数十台单机节点组成的集群,即已达到并行处理的瓶颈。 Dozens of single cluster nodes and parallel processing bottleneck has been reached. 但随着互联网服务的爆炸式发展, 数据与信息呈现指数式增长,针对互联网数据的搜索引擎、用户数据挖掘、商业智能等应用,传统的数据仓库已经不能满足需求。 But with the explosive growth of Internet services, data and information presented exponential growth in Internet data mining for the search engine, user data, business intelligence and other applications, traditional data warehouse has been unable to meet the demand. 基于分布式文件系统和映射/规约(Map/Reduce) 分布式计算框架的大规模数据处理方法,可以建造在普通个人电脑之上,其造价低廉、拓展性强、支持异构数据格式,逐渐被业界采用,例如Google的GFS分布式文件系统、Facebook 的Hive数据仓库等。 Large-scale data processing method for distributed computing framework for distributed file systems and mapping / protocol (Map / Reduce) based, can be built on top of an ordinary PC, its low cost, strong expansion, support for heterogeneous data formats, gradually the industry uses, such as Google's GFS distributed file system, Facebook's Hive data warehouse.

[0004] 尽管如此,这些数据仓库目前一般用于离线的定时数据批处理,其效率还远不能达到实时性的要求。 [0004] Nevertheless, these data warehouse general timing data for offline batch, its efficiency is far achieve real-time requirements. 特别地,数据连接查询是基本的、频繁使用的功能,因此数据连接查询的效率对改善系统的总体性能来说意义重大。 In particular, the data connection queries are basic, frequently used functions, data connection query performance improvement of the efficiency of the overall system is of great significance. 在分布式的数据环境中,数据连接查询的含义是查找集群中的数据并连接关联的字段,其本质是在海量的数据存储中建立一种合理的数据结构和分布式数据存储机制,以支持高效的连接查询。 In a distributed data environment, meaning the data connection is a query to find data in the cluster and connect the associated field, its essence is to establish a reasonable data structure and distributed data storage mechanisms to store vast amounts of data in order to support efficient join queries. 由于多个数据表可能存储于不同的数据节点上,如何快速定位这些分布式数据表并提高查询和排序的性能,是提升数据连接查询效率的关键所在。 Since multiple data tables may be stored on different data nodes, how to quickly locate these distributed data tables and improve query performance and sorting, the key is to enhance the efficiency of query the data connection. 在传统的数据仓库技术中,数据聚簇、并行查询、数据分区是常见的提升性能的技术,下面一一加以介绍。 In traditional data warehouse technology, data clustering, parallel query, data partition is a common technique to improve performance, introduce them one by one below.

[0005] 由于很多查询需要顺序访问大量的数据,数据聚簇技术解决了顺序访问的问题, 聚簇通过物理地将表放在一起以获得顺序的数据聚簇,数据聚簇是数据库管理系统的功能,依赖于数据库本身的聚簇技术。 [0005] As many queries require access to large amounts of sequence data, data clustering technology to solve the problem of sequential access, by physically clustered together on the table in order to obtain a sequence of data clustering, data is clustered database management system function, depending on the database itself clustering technology. 很显然,这个技术在分布式的数据系统中不能直接使用。 Obviously, this technique can not be used directly in a distributed data system.

[0006] 并行处理是将大数据量的查询分成小的部分然后并行地执行以提高性能。 [0006] Parallel processing is the large amount of data is divided into small portions and then query parallel execution to improve performance. 并行处理技术可以用于数据加载和数据重组。 Parallel processing techniques may be used for data loading and data reorganization. 并行处理技术和数据分区紧密联系。 Parallel processing technology and data partitions closely. 服务器硬件的并行架构也影响并行处理的方式。 Server hardware parallel architecture also affect parallel processing. 一些物理选项对高效的并行处理很重要。 Some physical option is important for efficient parallel processing. 并行处理和分区技术一起提供了提高性能的巨大潜力。 Partitioning and parallel processing technologies offer great potential to improve the performance together.

[0007] 数据分区是指针对大量的数据表(超过百万行记录),因其载入效率低、索引时间长、备份和还原耗时大、遍历更新慢,故采用数据表分区将表和索引都进行分区管理,这样便于维护且操作方便。 [0007] The data partitioning refers to a data table for a large number (more than one million rows), loading because of its low efficiency, long time index, time consuming large backup and restore, updates slow traverse, so the use of the data table partitions and table index are partition management, it is easy to maintain and easy to operate. 对于数据仓库来说,数据分区是关键的决策,必须在实施前计划好, 因为后续的更改将耗费巨大。 For a data warehouse, data partition is the critical decision, you must plan well before implementation, because subsequent changes will be costly. 数据分区可以垂直分区和水平分区,在垂直分区中,将选择的列编组分割为分区,每个分区和原始表都具有相同的行数;在水平分区中,将选择的行分组进行分区,每个分区和原始表都具有相同的列数。 Data partition can be a vertical partition and a horizontal partition, in a vertical partition, grouping the selected column is divided into partitions, each of the original table and have the same number of rows; horizontal partition lines partition the selected packets, each of and partition the original table have the same number of columns. 数据分区具有很多关键的优点,如查询时只需查必要的分区、分区可脱机维护、更快地建立索引、数据损坏不扩散、分区与磁盘映射是输入输出平衡等。 Data partition has many key advantages, simply check necessary, the partition can be offline for maintenance, faster indexing, data corruption, non-proliferation, partition and disk mapping input and output balance, such as when a query. 但是,传统的数据分区技术,基于数据库构建的数据仓库,在分区时,一般从业务逻辑出发来设定数据分区的准则,没有考虑到分布式处理的能力。 However, traditional data partitioning, database built data warehouse based in the District, general departure from the business logic to the data partition set criteria, without taking into account the ability of distributed processing. 正因为此,在不同分区的数据合并时,效率较低。 Because of this, when the data in different partitions of the merger, less efficient. 最重要的是,这项技术无法支持巨量的分布式数据处理请求。 Most importantly, this technology can not support the huge amount of distributed data processing request. 而且,一旦发生数据更新,则所有的过程都要重新执行。 Moreover, once the data update occurs, all processes must be re-executed.

[0008] 由于传统的数据仓库技术对分布式计算框架的支持不足,使得其拓展性、实时处理性具有瓶颈。 [0008] Due to the lack of traditional data warehouse technical support for distributed computing framework, making its scalability, real-time processing of have bottlenecks. 在数据连接查询中,由于数据节点之间大量数据传输以及归并排序操作,导致资源利用率低下,性能较低。 Query the data connection, since the large amount of data transfer between the data and the merge sort node operation, resulting in poor resource utilization, lower performance.

[0009] 在现有的研究成果中,由!^cebook贡献的Hive数据仓库提出了一种大数据表的动态连接查询算法(下文称作第一参考方案)。 [0009] In the conventional research results, the! ^ Cebook Hive data warehouse contribution proposes a large dynamic search algorithm connection data table (hereinafter referred to as first reference embodiment). 该算法需要一个完整的映射/规约(Map/ Reduce)过程,两个表都会作为映射过程的输入进入分布式处理。 The algorithm requires a complete mapping / protocol (Map / Reduce) process, two tables are used as input into the distributed processing of the mapping process. 图1示出了分布式计算系统中通常采用的映射/规约(Map/Reduce)计算模型。 FIG 1 shows mapping / protocol (Map / Reduce) distributed computing systems commonly employed in the calculation model. 如图1所示,分布式计算系统通常由若干个节点组成,这些节点被分为主节点和工作节点,主节点负责各工作节点的任务调度, 各工作节点执行指定的任务。 1, a distributed computing system typically consists of several nodes, those nodes are divided into a main working node and the node, the master node is responsible for scheduling the work of each node, each node performs specified tasks work. 首先,如101处所示,用户程序根据映射/规约的编程规范, 按照分布式文件系统规范准备输入,并准备映射和规约任务的执行代码。 First, as shown, the user program 101 according to a program map specification / protocol, according to the distributed file system ready input specification, and prepare and execute code mapping tasks statute. 然后,如102处所示,主节点将映射和规约任务分别指派给对应的计算节点。 Then, as shown at 102, the master node mapping and protocols computing tasks are assigned to the corresponding node. 映射工作节点被激活后,开始读入分配的数据块,该数据块位于分布式文件系统中,参见103处所示。 After working node mapping is activated to start reading the data block allocation, the data block located in a distributed file system, see indicated at 103. 接下来,如104处所示,映射任务按照用户程序对输入数据块进行处理,处理完毕后将结果写入到本地磁盘。 Next, as shown in 104, the mapping task processing of input data according to a user program block, processed results written to the local disk after. 在此之后,规约任务被激活,开始读入映射任务生成的远程数据文件(参见105),并按照用户程序对数据进行规约处理。 After this, the Statute task is activated to start reading the map task to generate a remote data file (see 105), protocol processing and data according to the user program. 最后,如106处所示,将处理结果输出并写入分布式文件系统。 Finally, as shown at 106, and outputs the processing result is written distributed file system.

[0010] 图2示出了根据第一参考方案的连接查询过程的系统框图,其中示出了一个主节点20、3个映射工作节点30和2个规约工作节点40。 [0010] FIG. 2 shows a system block diagram of a join query procedure according to the first embodiment of the reference, which shows a master node mappings 20,3 working node 30 and node 40 two working statute. 附图中相同的标记表示相同的元件。 The same numerals in the drawings denote like elements. 一般来说,映射工作节点的个数大于规约工作节点的个数。 In general, the number of nodes is greater than the number of map worker working node of the statute. 该系统是一种分布式的文件系统,使用类SQL的查询语句,对海量的数据表进行查询分析。 The system is a distributed file system, using SQL-like query of vast amounts of data table analysis. 在映射阶段,连接的字段将作为输出关键字(Output Key),其他属性将作为值(Value)。 In the mapping stage, connected to the field as the output key (Output Key), as the value of other attributes (Value). 在规约阶段,关键字相同的记录将被设置在一起,这样就完成了一次连接查询。 In the Statute of the stage, the same record key will be set together, thus completing a connection query.

[0011] 具体地,图2中所示系统的输入是并行系统中的数据表和特定的关键字(连接字段)。 [0011] Specifically, the input system shown in FIG. 2 is a parallel system data table and a specific keyword (connection field). 主节点20中的连接工作调度器201接收数据表和关键字,向系统中的工作节点分配映射任务和规约任务。 The master node connected to the working scheduler 201 receives data table 20, and keywords, and protocols distribution map task to task in the system working node. 在本示例中,3个工作节点被分配执行映射任务(即映射工作节点30),2个工作节点被分配执行规约任务(即规约工作节点40)。 In the present example, three working nodes are assigned tasks to perform the mapping (i.e. mapping working node 30), two nodes are assigned to perform work tasks Statute (i.e. statute working node 40). 每一个映射工作节点30中的存储单元301从主节点20的连接工作调度器201接收所分配的特定数据表中的特定数据块。 Particular data block specific data storage unit 301 table mapping each working node 30 receives the connection from the working scheduler 201 the master node 20 of the assigned. 例如,在本示例中,主节点20的连接工作调度器201向每一个映射工作节点30分配固定大小为64M的数据块。 For example, in the present example, the master node 20 is connected to the working scheduler 201 is allocated to each working node 30 maps the data block size is fixed to 64M. 映射工作节点30中的映射排序单元302产生< 关键字,值〉 对,并按照关键字对其进行排序。 Mapping working node 30 maps sorting unit 302 generates <key, value> pairs, sorted according to their keywords. 映射工作节点30中的分发单元303将映射排序单元302 的结果向每一个规约工作节点40分发。 Results Mapping the distribution unit 303 working node 30 to map the distribution of the sorting unit 302 to each working node 40 statute. 规约工作节点40中的合并单元401对来自每一个映射工作节点30的数据进行合并,规约单元402对合并后的数据执行查询,连接排序单元403将查询结果进行连接,以获得最终的连接结果。 Combining unit 40 statute work node 401 to the data mapping each working node 30 from the merging, the Statute cell data perform queries 402 pairs of the combined connector sorting unit 403 query results connection, to obtain a final join results.

[0012] 尽管上述第一参考方案对并行处理有良好的支持,但在效率方面存在很大的提升空间。 [0012] While the above-described first embodiment with reference to a good support for parallel processing, but there are much room for improvement in terms of efficiency. 在一个基本的映射和规约过程之间,存在集群内数据分发的过程,该方案对映射过程中生成的< 关键字,值〉数据未做处理,这些数据将被全部向各个规约节点分发,因此耗费非常大,导致连接查询的效率较低。 A mapping between the basic processes and protocols, the presence of cluster data distribution process, the program generated during the mapping <key, value> without making data processing, all the data will be distributed to each node statute, so cost is very large, resulting in less efficient join queries.

[0013] 论文"A Switch Criteria for Datasets Merging on Top ofMapReduce,,(The 8th International Conference on Grid and CloudComputing,GCC2009)介绍了一禾中数据归并标准,其中考虑单节点的处理能力和集群的并行处理能力,计算出每个映射任务最优的分块大小,并区分输入数据表的规模大小。如果小表的规模小于最优分块大小,则直接在全局文件中共享此小表,这样在连接查询的过程中,只需对大表进行映射处理。在映射过程中就将连接合并完成,无需进行规约过程的处理。这样,减少了映射与规约过程之间的数据分发,明显提高了效率。该方案在实际应用中非常重要,因为小表在分布式数据处理中普遍存在,所以提升小表处理的效率就提高了很多情形下数据处理的效率。然而,对于两个表规模都很大的情形(较小的数据表也超过单个任务的内存容量),这个方案具有局限 [0013] Articles "A Switch Criteria for Datasets Merging on Top ofMapReduce ,, (The 8th International Conference on Grid and CloudComputing, GCC2009) describes a data merge Wo standards, taking processing and parallel processing of a single cluster node , calculated for each map tasks optimum block size, and to distinguish input data table size. If the table size is less than optimal small block size, the share of this small table directly in the global file, so that the join query process, only large tables mapping processing will be connected to the merge is complete in the mapping process, without performing the processing procedure of the statute. Thus, reducing the data distribution between the statute mapping process, significantly improves efficiency. the program is very important in practical applications, because of the small table in a distributed data processing ubiquitous, so to enhance the efficiency of processing small table to improve the efficiency of data processing in many cases. However, for the case of a large scale are two tables (also smaller data table memory capacity than a single task), this scheme has limitations .

[0014]专利申请 W02005076160A1(题目为“The Data WarehouseDistributed System and Architecture to Support Distributed QueryExecution”)提出了一种可拓展的、快速查询的分布式数据仓库架构。 [0014] Patent Application W02005076160A1 (entitled "The Data WarehouseDistributed System and Architecture to Support Distributed QueryExecution") presents a scalable, distributed data warehouse architecture to quickly query. 该专利申请考虑了分布式计算的性能以及负载均衡,也考虑了分片对于提升性能的重要作用,但其对分布式计算的拓展性支持不够。 The patent application takes into account the performance of distributed computing and load balancing, but also takes into account the important role of fragmentation for performance improvements, but expansion of its distributed computing support is not enough. 即,在分布式计算的情况下,未考虑到根据单个节点的性能,分配最优的任务及最优大小的数据片。 That is, in the case of distributed computing, not take into account the performance of a single node, the optimal distribution of tasks and the optimal size of the data sheet. 因此, 在大规模分布式环境下,资源利用率将比较低下。 Therefore, in large-scale distributed environment, resource utilization will be relatively low.

发明内容 SUMMARY

[0015] 本发明旨在提出一种支持分布式存储和分布式计算框架的、以数据最优分片为核心的动态连接查询方案。 [0015] The present invention is directed to a distributed storage and support distributed computing frameworks, optimal data slice core dynamic link query plan. 在本文中,“分片”的含义是:根据规约节点的计算资源和映射规约计算框架,将分布式计算系统的输入文件进行适当的分割,以使得映射、规约过程中每个计算节点获得最优的资源利用率。 As used herein, "fragment" means: The computing nodes statute and mapping resources statute computing framework, the input file is a distributed computing system appropriately divided, so that the mapping, each computing node statute obtained during most optimal resource utilization. 具体地,本发明首先评估分布式集群系统的单机处理能力和集群的并行处理能力,基于评估结果对数据表进行动态分片,然后对每个工作节点上的每一个处理记录采用全局统一的映射算法,寻找既定规则下的处理节点,并将数据定向分发到特定的工作节点。 In particular, the present invention is evaluated first standalone parallel processing capability and processing capability of the cluster distributed cluster system, the data table based on the evaluation result of the dynamic fragmentation, and then use the global unified mapping process for each record on each working node algorithm to find the processing nodes under the established rules, and distribute data directed to a particular worker nodes. 这样,需要连接的数据表中具有相同分片编号的数据被分配到同一个处理节点上进行处理,这既减少了分布式系统中的文件传输,又提高了归并和排序的效率。 Thus, the data in the table need to connect fragmentation data with the same numbers are assigned to the same processing node for processing, which reduces both the file transfer in the distributed system, and improve efficiency and merge sort. 同时,本发明提供了基于动态连接查询算法的缓存机制,可以大大提升查询效率。 Meanwhile, the present invention provides a mechanism to query caching algorithm based on dynamic connection, can greatly improve query efficiency.

[0016] 根据本发明的一个方面,提供了一种用于分布式数据仓库的连接查询系统,包括主节点、映射工作节点和规约工作节点,其中:主节点根据数据表的大小和系统性能来计算分片大小,基于计算得到的分片大小向映射工作节点分配数据块,并且规定映射工作节点中的分片映射规则和汇总规则;映射工作节点根据分片映射规则将数据块中的查询关键字映射到相应的分片编号,并根据汇总规则把具有相同分片编号的数据传输到指定的规约工作节点;规约工作节点接收映射工作节点传输的数据,合并具有相同分片编号的数据并按照查询关键字进行连接,以获得最终的连接查询结果。 [0016] In accordance with one aspect of the invention there is provided a connection inquiry system for distributed data warehouse, including the master node, node mapping and protocols work worker nodes, wherein: the master node according to size and system performance data table calculation fragment size, data block allocation to the mapping work node based on fragment size is calculated, and a predetermined slice mapping rule mapping work nodes and aggregation rules; mapping work node according fragment mapping rule query key data block word mapped to the corresponding slice number, and based on aggregated rule data transmission with the same tile numbers to the Statute work specified node; receiving a mapping node transmissions work statute working node data, merging data with the same slice numbers and in accordance with Search keyword connect, to obtain a final connection query results. [0017] 优选地,主节点包括:分片大小和数量计算单元,获取系统中每一个映射工作节点的物理资源配置信息,根据数据表的大小和所获得的物理资源配置信息来计算每一个映射工作节点所对应的分片大小以及每一个数据表所对应的分片数量;以及分片处理调度器, 将每一个数据表按照相应的分片大小进行划分以传输至每一个映射工作节点,并规定每一个映射工作节点中的分片映射规则和汇总规则。 [0017] Preferably, the master node comprising: fragment size and number calculation unit acquires physical resource configuration information of mapping each working node calculates each mapping according to the size of the data table, and the physical resource configuration information obtained number of fragments working nodes corresponding to the segment size, and each data table corresponding to; and fragment processing scheduler, each data table is divided according to the appropriate segment size for transmission to each of the mapped working node, and predetermined fragmentation and aggregation rule mapping rules mapping each working node.

[0018] 优选地,映射工作节点包括:存储单元,接收主节点传输的数据块;映射和分片处理单元,根据分片映射规则将数据块中的查询关键字映射到特定的分片编号,并把具有相同分片编号的数据存储在同一个数据集中;以及定向分发单元,根据汇总规则将各个数据集中存储的数据分别传输到指定的规约工作节点。 [0018] Preferably, the mapping working node comprising: a storage unit, the master node receives the data block transmitted; mapping and fragment processing unit, according to the mapping rule maps fragmentation query key data block to a specific slice number, and stores the data slices having the same number in the same data set; and a directional distribution unit according to the aggregate data set transmission rules are stored in respective data node to the designated work statute.

[0019] 优选地,规约工作节点包括:规约单元,接收从映射工作节点传输来的数据,合并具有相同分片编号的数据以形成分片数据文件;表分片存储单元,存储分片数据文件;以及连接排序单元,将分片数据文件中的数据按照查询关键字进行连接和排序,以输出最终的连接查询结果。 [0019] Preferably, the Statute working node comprising: statute means, received data are mapped working node transmitted from the combined data with the same tile numbers to form slice data file; Table-part storage unit storing fragment data files ; and a sorting means connected to the data slice data file is connected and sorted query key, connected to output a final query result.

[0020] 优选地,分片映射规则包括:按照查询关键字的值区间进行分片映射,或基于查询关键字的哈希函数值进行分片映射。 [0020] Preferably, the fragmentation mapping rule comprises: fragmentation maps as a query key value interval or fragmented mapping based on the hash function value query keywords.

[0021] 优选地,映射工作节点的物理资源配置信息包括映射工作节点的空闲内存容量。 [0021] Preferably, the physical resource mapping working node configuration information comprises an idle work memory map node.

[0022] 优选地,分片大小和数量计算单元计算每一个映射工作节点的空闲内存容量与映射任务个数的商,把计算得到的商与该映射工作节点的虚拟机内存上限值进行比较,取两者中的较小值作为该映射工作节点所对应的分片大小。 [0022] Preferably, the fragment size and the number of free memory capacity calculating means for calculating a quotient of each working node mapping and map task number, the calculated quotient is compared with the map on a virtual machine working memory node limit , whichever is the smaller value as a slice corresponding to the mapping work node size.

[0023] 优选地,规约工作节点还包括:全局分片索引单元,针对特定的查询关键字建立全局索引表,该全局索引表包括分片编号、对应的分片存储节点、数据表名称和分片数据文件的路径。 [0023] Preferably, a working node statute further comprising: a global slice index unit to build the global index table for a particular keyword query, the index table comprises a global slice numbers corresponding fragment storage node, and the sub-table name slice data file path. 当再次查询已经建立了全局索引表的查询关键字时,访问全局索引表,直接加载相应的分片数据文件。 When this time has established a global index keyword query tables, table access to the global index, direct load the appropriate data file fragmentation.

[0024] 根据本发明的另一个方面,提供了一种用于分布式数据仓库的连接查询方法,包括:在主节点处,根据数据表的大小和系统性能来计算分片大小,基于计算得到的分片大小向映射工作节点分配数据块,并且规定映射工作节点中的分片映射规则和汇总规则;在映射工作节点处,根据分片映射规则将数据块中的查询关键字映射到相应的分片编号,并根据汇总规则把具有相同分片编号的数据传输到指定的规约工作节点;在规约工作节点处, 接收映射工作节点传输的数据,合并具有相同分片编号的数据并按照查询关键字进行连接,以获得最终的连接查询结果。 [0024] According to another aspect of the invention, there is provided a connecting method for distributed query data warehouse, comprising: at the master node to calculate the fragment size based on the size and performance data table, based on the calculated the fragment size is allocated to the data blocks mapped working node, and a predetermined mapping rule mapping work fragmentation and aggregation rule node; node mapping work, according to a mapping rule maps fragmentation query key data block to the corresponding slice number, and based on aggregated rule data transmission with the same tile numbers to the Statute working node specified; the statute working node, receiving the map data working node transmission, combined data with the same slice numbers and in accordance with the query key word connection, in order to obtain final connection query results.

[0025] 优选地,在主节点处执行的步骤包括:获取系统中每一个映射工作节点的物理资源配置信息,根据数据表的大小和所获得的物理资源配置信息来计算每一个映射工作节点所对应的分片大小以及每一个数据表所对应的分片数量;以及将每一个数据表按照相应的分片大小进行划分以传输至每一个映射工作节点,并规定每一个映射工作节点中的分片映射规则和汇总规则。 [0025] Preferably, the steps performed at the master node comprises: obtaining the physical resource configuration information mapping each working node of the system, calculating maps each work node according to the size of the data table, and the physical resource configuration information obtained number of fragments corresponding to the fragment size, and each data table corresponding to; and each data table is divided according to the appropriate segment size for transmission to each of the mapped working node, and a predetermined mapping each working node points summary sheet mapping rules and regulations.

[0026] 优选地,在映射工作节点处执行的步骤包括:接收主节点传输的数据块;根据分片映射规则将数据块中的查询关键字映射到特定的分片编号,并把具有相同分片编号的数据存储在同一个数据集中;以及根据汇总规则将各个数据集中存储的数据分别传输到指定的规约工作节点。 [0026] Preferably, the step of the work performed in the map node comprising: a node receiving data blocks in the main transmission; mapping rule according fragmentation maps query key data block to a specific slice number, and the same points sheet number data stored in the same data set; and the transmission data are centrally stored various data to the specified node according to the statute work aggregation rule. [0027] 优选地,在规约工作节点处执行的步骤包括:接收从映射工作节点传输来的数据, 合并具有相同分片编号的数据以形成分片数据文件;存储分片数据文件;以及将分片数据文件中的数据按照查询关键字进行连接和排序,以输出最终的连接查询结果。 [0027] Preferably, the steps performed in the statute working node comprising: receiving a data mapping working node transmitted from the combined data with the same tile numbers to form slice data file; store fragmented data file; and points data slice data file is connected and sorted query key, connected to output a final query result.

[0028] 优选地,分片映射规则包括:按照查询关键字的值区间进行分片映射,或基于查询关键字的哈希函数值进行分片映射。 [0028] Preferably, the fragmentation mapping rule comprises: fragmentation maps as a query key value interval or fragmented mapping based on the hash function value query keywords.

[0029] 优选地,映射工作节点的物理资源配置信息包括映射工作节点的空闲内存容量。 [0029] Preferably, the physical resource mapping working node configuration information comprises an idle work memory map node.

[0030] 优选地,在主节点处,计算每一个映射工作节点的空闲内存容量与映射任务个数的商,把计算得到的商与该映射工作节点的虚拟机内存上限值进行比较,取两者中的较小值作为该映射工作节点所对应的分片大小。 [0030] Preferably, the master node, calculated for each working node mapping the quotient number of free memory mapping task, the calculated quotient is compared with the map on the virtual machine memory node working limits, taking whichever is smaller as a working node of the map corresponding to the fragment size.

[0031] 优选地,在规约工作节点处还执行以下步骤:针对特定的查询关键字建立全局索引表,该全局索引表包括分片编号、对应的分片存储节点、数据表名称和分片数据文件的路径。 [0031] Preferably, the node further work statute perform the following steps: establishing a global index table for a particular query keywords, the index of the global table comprising fragment numbers, the corresponding fragment storage node, and table names sliced ​​data path to the file. 当再次查询已经建立了全局索引表的查询关键字时,通过访问全局索引表,直接加载相应的分片数据文件。 When this time has established a global index keyword query tables, by accessing the global index table, direct load the appropriate data file fragmentation.

[0032] 本发明基于分布式文件系统,考虑到集群中计算节点的计算能力,采用数据分片、 定向分发的方式进行数据连接查询处理,减少了数据在分布式系统内部的传输,降低了冗余数据带来的排序和连接成本,提升了集群系统的资源利用率,从而有效地提高了分布式数据仓库的性能。 [0032] The present invention is based on a distributed file system, taking into account the computing nodes in the cluster is calculated, using data slice, the way data connection oriented distributed query processing, reduced transmission of data within the distributed system, reduces redundant It brought sorting data and connection costs, improved resource utilization cluster system, thus effectively improving the performance of distributed data warehouse. 此外,本发明在执行定向分发、分片处理时,还可以建立全局索引,这能够提升后续的连接查询的效率。 Further, the present invention is performed in the orientation distribution, when the fragment processing, can also create a global index, which can enhance the efficiency of the subsequent join query.

附图说明 BRIEF DESCRIPTION

[0033] 通过下文结合附图的详细描述,本发明的上述和其它特征将会变得更加明显,其中: [0033] from the following detailed description taken in conjunction with the above and other features of the invention will become more apparent, wherein:

[0034] 图1示出了根据现有技术的分布式计算系统中采用的映射/规约(Map/Reduce) 计算模型的示意图; [0034] FIG. 1 shows a schematic view of the calculation model of distributed computing system according to the prior art of the mapping / protocol (Map / Reduce);

[0035] 图2示出了根据现有技术的用于分布式系统中的连接查询过程系统的框图; [0035] FIG. 2 shows a block diagram of a system for connecting a distributed query processing system of the prior art;

[0036] 图3示出了根据本发明一个实施例的用于分布式数据仓库的分片连接查询系统的框图; [0036] FIG. 3 shows a block diagram of a distributed data warehouse query slice connection system according to one embodiment of the present invention;

[0037] 图4示出了根据本发明另一个实施例的用于分布式数据仓库的分片连接查询系统的框图; [0037] FIG. 4 shows a block diagram of a distributed data warehouse slice join query system according to another embodiment of the present invention;

[0038] 图5是示出了根据本发明一个实施例的全局索引表的示例的示意图; [0038] FIG. 5 is a schematic diagram showing an example of a global index table according to an embodiment of the present invention;

[0039] 图6是示出了根据本发明一个实施例的用于分布式数据仓库的分片连接查询方法的流程图;以及 [0039] FIG 6 is a flowchart illustrating a method in accordance join query slice distributed data warehouse to one embodiment of the present invention; and

[0040] 图7是示出了根据本发明另一个实施例的用于分布式数据仓库的分片连接查询方法的流程图。 [0040] FIG. 7 is a flowchart illustrating a distributed data warehouse slice according to another embodiment of the present invention, the connection query method.

具体实施方式 Detailed ways

[0041] 下面,通过结合附图对本发明的具体实施例的描述,本发明的原理和实现将会变得明显。 [0041] Hereinafter, the embodiments described in conjunction with the accompanying drawings of specific embodiments of the present invention, the principles of the present invention will become apparent and realized. 应当注意的是,本发明不应局限于下文所述的具体实施例。 It should be noted that the present invention should not be limited to the specific embodiments described below. 另外,为了简便起见, 省略了对公知元件的详细描述。 Further, for simplicity, detailed descriptions of well-known elements. [0042] 图3示出了根据本发明一个实施例的用于分布式数据仓库的分片连接查询系统的框图。 [0042] FIG. 3 shows a block diagram of a system according to the join query distributed data warehouse fragmentation to an embodiment of the present invention. 作为示例,在图3中示出了1个主节点50、3个映射工作节点60和2个规约工作节点70,并且附图中相同的标记表示相同的元件。 By way of example, in FIG. 3 illustrates a master node mappings 50,3 working node 60 and node 70 two working statute, and the same numerals in the drawings denote like elements. 然而应当理解的是,本发明可以应用于包括任意多个映射工作节点和任意多个规约工作节点的分布式系统中。 However, it should be appreciated that the present invention may be applied to any distributed system comprising a plurality of nodes and an arbitrary mapping work plurality statute work nodes. 一般来说,映射工作节点的个数要大于规约工作节点的个数。 In general, the number of maps to be larger than the number of working nodes statute work nodes.

[0043] 具体地,图3所示系统的输入是并行系统中的数据表和特定的关键字(连接字段)。 [0043] Specifically, the input system shown in FIG. 3 is a parallel system data table and a specific keyword (connection field). 位于主节点50中的分片大小和数量计算单元501接收这些数据表和关键字,根据数据表的大小和系统性能来计算分片大小以及每一个数据表所对应的分片数量。 Fragment size and number of the primary node 50 receives these data tables calculating unit 501 and the key, calculates the number of fragments, and each fragment size corresponding to a data table according to the size and performance data table. 在本示例中,系统的性能可以包括每一个工作节点的空闲内存容量。 In the present example, the performance of the system may include a free memory capacity of each working node. 下面,详细描述本发明中的分片处理。 Next, the fragment processing in the present invention is described in detail.

[0044] 在传统的分布式计算系统中,输入数据在主节点处被分割成不同的数据块,不同的数据块在不同的映射工作节点上进行映射处理,并且处理结果将会被分发到所有的规约工作节点进行处理。 [0044] In a conventional distributed computing system, the input data in the main node is divided into different data blocks on different data blocks in different working node mapping mapping processing, and the processing results will be distributed to all of the Statute of the working node for processing. 这个过程造成了分布式系统内的网络传输数据量巨大和规约工作节点上的数据冗余。 This process resulted in redundant data on the amount of data transmission network within the distributed systems and protocols great work node. 因此,在本发明中,考虑如下参数来计算分片大小和数量: Accordingly, in the present invention, consider the following parameters to calculate the size and number of slices:

[0045] NMapNum = min (SLfile/Ssplit, Nmapcap) [0045] NMapNum = min (SLfile / Ssplit, Nmapcap)

[0046] 其中,NMapNim表示并行的映射任务的数目,Smle表示数据表的大小,Ssplit表示一个分块的大小,Nmaprap表示映射/规约框架中所设定的映射任务的最大数目。 [0046] wherein, NMapNim map represents the number of parallel tasks, Smle represents the size of the data table, Ssplit represents the size of a block, Nmaprap represents the maximum number of mapping / protocol frame set in the mapping task. 当大文件的数据大小超过一定数值后,多个映射任务将在一个计算节点上并发地执行,映射/规约框架要求映射任务的平均数应当与数据大小除以分块大小的值相等。 When the data size is large file exceeds a certain value, the plurality of map tasks concurrently executed on a computing node, the mapping / protocol frame mapping tasks required average data size should be divided by partition value equal to the block size.

[0047] 同时,单个计算节点上的任务数,还受限于单机的硬件环境。 [0047] Meanwhile, the number of tasks on a single computing node further limited to stand-alone hardware environment. 在单个工作节点上, 为了并发地将分片放入内存中,必须保证分片的大小与分片数的乘积小于机器的物理内存 Work on a single node, in order to concurrently fragments into memory, the size of the product must ensure that the number of slices and slice is smaller than the physical memory of the machine

容量。 capacity. 即: which is:

[0048] ^MapNumPerNode ^ ^parition ^^ ^memory [0048] ^ MapNumPerNode ^ ^ parition ^^ ^ memory

[0049] 同时,在映射任务中,每个任务对应的进程还有内存堆上限。 [0049] Meanwhile, in the mapping task, each task as well as the corresponding process heap memory limit. 因此,分片大小不能大于映射进程的内存大小限制: Thus, the fragment size can not exceed the memory size limit of the mapping process:

[0050] Sparition ( Mcap [0050] Sparition (Mcap

[0051] 其中,M-表示每个映射任务的内存上限值,这个上限值与采用的计算容器有关。 [0051] wherein, M- represents the upper limit of each task is mapped memory, the upper limit value calculated using the relevant container. 例如,Java运行环境中的java虚拟机运行时存在的内存上限值。 For example, there is a limit on the Java runtime environment java virtual machine runtime memory.

[0052] 因此,分片大小可以由以下等式获得: [0052] Thus, fragment size may be obtained by the following equation:

[0053] Sparition 一min (Smemory/NMapNumPerNodej Mcap) [0053] Sparition a min (Smemory / NMapNumPerNodej Mcap)

[0054] 相应地,每一个数据表所对应的分片数量是该数据表的大小除以分片大小后得到的数值。 [0054] Accordingly, a number of fragments for each data table corresponding to the size of the data table is divided by the value obtained fragment size.

[0055] 考虑到上述约束,分片大小和数量计算单元501获取各个节点的物理资源配置信息,计算Sparition = min (S [0055] Considering the above constraints, fragment size and the number of physical resource calculating unit 501 acquires the configuration information of each node, calculating Sparition = min (S

memory/^MapNumPerNode ' ^cap ),以形成包括< 节点编号'ion〉 数据对的信 memory / ^ MapNumPerNode '^ cap), to form comprises a <node ID' ion> data Letter

息表,并将该信息表保存到分布式系统中,以供主节点和各个工作节点在需要的时候访问。 Information table, and stores the table information in a distributed system, the master node and for each work node access when needed.

[0056] 分片处理调度器502负责将各个数据表按照指定的分片大小进行分割,并将分割后的数据块分发到每一个映射工作节点60。 [0056] The slicing process scheduler 502 is responsible for dividing the respective data tables in the specified segment size, the data blocks distributed to each working node 60 maps divided. 具体地,分片处理调度器502访问包括< 节点编号,Sp„ti。n>数据对的信息表,获取与某个特定的工作节点相对应的分片大小,并相应地分割个分片的数据给该工作节点。另外,分片处理调度器502还规定了每一个映射工作节点60中的数据块的分片映射规则和汇总规则(S卩,每片数据汇总时所对应的规约工作节点)。分片映射规则可以是区间分片(即,按照查询关键字的取值区间进行分片映射),也可以是基于哈希算法的分片(即,按照查询关键字的哈希函数值进行分片映射)。 In particular, the fragment processing scheduler 502 accessing a <node ID, Sp "ti.n> of the data information table, acquires a specific node corresponding to the work piece size, and accordingly a divided slices statute working node data to the working node. in addition, the fragment processing scheduler 502 also provides for fragmentation mapping rules and aggregated rule data block map each working node 60 (S Jie, each piece of data corresponding to when aggregated ). slice interval mapping rules can be fragmented (ie, fragmentation value interval mapping in accordance with the query keywords), it may also be based on fragmentation hashing algorithm (ie, hash function value in accordance with the query keywords fragmentation mapping).

[0057] 映射工作节点60中的存储单元601从主节点50的分片处理调度器502接收所分配的特定数据块。 [0057] The storage unit 601 maps the work 60 node 502 receives from the master node assigned slicing process scheduler 50 of a specific data block. 然后,映射和分片处理单元602对查询关键字进行映射。 Then, the mapping and the fragment processing unit 602 maps the query keywords. 具体地,在执行映射时,映射和分片处理单元602根据分片处理调度器502所规定的分片映射规则,将关键字值映射到某个分片编号,并把具有相同分片编号的数据存储在同一个数据集中。 Specifically, when performing mapping, mapping and fragment processing unit 602 according to the mapping rule fragment fragment prescribed process scheduler 502, the key value mapped to a fragment number, and the slice with the same number data stored in the same data set.

[0058] 映射工作节点60中的定向分发单元603将不同分片编号所对应的数据集分发到指定的规约工作节点。 [0058] orientation mapping the distribution unit 603 working node 60 in different slice numbers corresponding to the distribution of the data set specified working node statute. 其中,根据分片处理调度器502中的规定,每一个分片编号所对应的数据只被传输到指定的规约工作节点。 Wherein, in accordance with a predetermined slicing process scheduler 502, each in a slice number corresponding to the data is transmitted to only the specified working node statute. 由此,极大地降低了分布式系统内的网络传输数据量,并降低了规约工作节点上的数据冗余。 Thus, greatly reducing the amount of network traffic in a distributed system, and reduces data redundancy in the statute working node.

[0059] 规约工作节点70中的规约单元701接收相应的映射工作节点所传输的数据,合并具有相同分片编号的数据以形成分片数据文件。 [0059] statute worker nodes 701 statute unit 70 receives the transmitted data corresponding mapping working node, merging the data having the same number of slices to form a slice data file. 之后,规约工作节点70中的表分片存储单元702存储具有相同分片编号的数据,连接排序单元703将各个分片数据文件中的数据按照指定的关键字进行连接和排序,并输出最终的查询结果。 Thereafter, the work table 70 of the Statute of the fragment memory unit 702 stores slice data having the same number, the sorting unit 703 is connected to each data slice data file is connected and the specified sort key, and outputs a final search result.

[0060] 在实际使用中,某些关键字可能经常被查询。 [0060] In actual use, some keywords can often be queried. 因此,为了减少相应的查询时间,可以考虑将针对这些关键字查询得到的结果进行保存,以便下次快速查询。 Therefore, to reduce the corresponding query time, you can consider to save the results for these keywords from the query, to quickly query. 关于某个关键字是否属于经常被查询的关键字,可以按照系统的查询次数统计来判断。 About whether a keyword keywords frequently requested, it can be determined according to the number of queries the system statistics.

[0061] 图4示出了根据本发明另一个实施例的用于分布式数据仓库的分片连接查询系统的框图。 [0061] FIG. 4 shows a block diagram of a distributed data warehouse for slicing according to another embodiment of the present invention, the connection inquiry system. 该系统与图3中的系统的不同之处在于,规约工作节点70中还包括全局分片索引单元704。 The system in Figure 3 differs from the system that work statute node 70 also includes a global fragmentation index unit 704.

[0062] 在对某个关键字进行第一次查询后,全局分片索引单元704可以将产生的分片编号、对应的分片存储节点、数据表名称和分片文件路径建立索引,并将建立好的全局索引表存储在分布式系统中。 [0062] After the first keyword of a query fragment global fragmentation index number it may be generated by unit 704, corresponding to the fragment storage node, and the fragment table name index file path, and well-established global index table stored in the distributed system. 例如,图5示出了由全局分片索引单元建立的全局索引表的一个示例。 For example, FIG. 5 shows an example of the establishment of a global global fragmentation index means the index table.

[0063] 这样,当再次执行对已经查询过的关键字的查询时,主节点50在分配映射任务时不再向各个映射工作节点60分发数据块,而是利用索引表中的各个分片索引,这样可以从分布式系统中直接加载相应的数据文件。 [0063] Thus, when performing a query keyword already queried again, the master node 50 is no longer distributed to each block 60 at node mapping work task allocation map, but using each slice index table index , which can be loaded directly from the corresponding data file in the distributed system. 因此,根据该实施例的查询系统能够减少针对经常被查询的关键字的查询时间,从而提升了系统的性能。 Thus, the query system of this embodiment can reduce the time for a query keyword query frequently, thus improving performance of the system.

[0064] 图6是示出了根据本发明一个实施例的用于分布式数据仓库的分片连接查询方法的流程图。 [0064] FIG 6 is a flowchart illustrating a method in accordance join query slice distributed data warehouse to one embodiment of the present invention. 该方法在步骤SlOOO处开始。 The method begins at step SlOOO.

[0065] 在步骤S1100,分布式系统中的主节点根据系统的性能(例如系统中的工作节点的空闲内存容量)和数据表的大小,计算分片大小和分片数量。 [0065] The master node at step S1100, the size of the distributed system (e.g., system memory free operation node) and the data table based on performance of the system, calculates fragment size and number of fragments. 根据计算得到的分片大小和数量,向工作节点分配映射任务。 The size and number of slice calculated, map tasks assigned to the working node. 同时,主节点还规定了映射工作节点中的数据块的分片映射规则和汇总规则。 Meanwhile, the master node also provides for fragmentation and aggregation rule mapping rule maps the data block in the working node.

[0066] 在步骤S1200,映射工作节点按照主节点中所规定的分片映射规则(例如区间分片或基于哈希算法的分片),将关键字值映射到某个分片编号,并把具有相同分片编号的数据存储在同一个数据集中。 [0066] In step S1200, the mapping work according to the master node as the node specified fragment mapping rules (e.g., fragment or segment fragments based hashing algorithm), the key value mapped to a fragment number, and the storing slice data having the same number in the same data set.

[0067] 在步骤S1300,映射工作节点将不同分片编号所对应的数据集分发到指定的规约工作节点。 [0067] In step S1300, the mapping work node numbers of different fragments corresponding to the designated data set distributed statute working node. 其中,每一个分片编号所对应的数据只被传输到指定的规约工作节点。 Wherein each slice number corresponding to the data will only be transmitted to the specified node statute work.

[0068] 在步骤S1400,规约工作节点接收相应的映射工作节点所传输的数据,合并具有相同分片编号的数据以形成分片数据文件。 [0068] In step S1400, the Statute working node receives data transmitted by the nodes corresponding mapping work, combined with the same number of slice data to form slice data file. 之后,规约工作节点将各个分片数据文件中的数据按照指定的关键字进行连接和排序,并输出最终的连接查询结果。 Thereafter, the working nodes statute sliced ​​data of each data file and connecting the specified sort key, and outputs a final result of the query connection. 最后,方法在步骤S1500结束。 Finally, the method ends in step S1500.

[0069] 图7是示出了根据本发明另一个实施例的用于分布式数据仓库的分片连接查询方法的流程图。 [0069] FIG. 7 is a flowchart illustrating a distributed data warehouse slice according to another embodiment of the present invention, the connection query method. 其中,图7中的步骤S2100-S2400分别与图6中的步骤Sl 100-1400相同, 此处不再赘述。 Wherein the same steps in FIG. 7 S2100-S2400, respectively, in FIG. 6 and step Sl 100-1400, is not repeated here. 另外,图7还包括步骤S2010、S2020和S2410。 Further, FIG. 7 further comprising the step of S2010, S2020 and S2410. 在实际使用中,某些关键字可能经常被查询。 In actual use, some keywords can often be queried. 因此,为了减少相应的查询时间,可以考虑将针对这些关键字查询得到的结果进行保存,以便下次快速查询。 Therefore, to reduce the corresponding query time, you can consider to save the results for these keywords from the query, to quickly query. 针对于此,在步骤SMOO之后的步骤S2410处,规约工作节点可以将针对经常查询的关键字产生的分片编号、对应的分片存储节点、数据表名称和分片文件路径建立索引,并将建立好的全局索引表存储在分布式系统中。 For this, at step S2410 after step SMOO, the slice number of work statute node may generate a query for keywords frequently, the corresponding fragment storage node, and the fragment table name index file path, and well-established global index table stored in the distributed system. 这样,每当执行图7所示的方法时,在开始步骤S2000后先执行步骤S2010,判断所查询的关键字是否是经常查询的关键字。 Thus, each time the method is performed as shown in FIG. 7, after the start of step S2010 to step S2000, it determines whether the keyword is a keyword frequently queried whether the query. 如果不是,则方法继续进行到步骤S2100,按照上文所述的过程进行连接查询。 If not, the method proceeds to step S2100, the connection query according to the procedure described above. 否则,方法直接进行到步骤S2020,在该步骤,访问先前生成的全局索引表,直接加载针对该关键字已经生成的查询数据文件。 Otherwise, the method proceeds directly to step S2020, at which the access previously generated global index table loaded directly against the keyword query data has been generated file. 这样,能够减少针对经常被查询的关键字的查询时间。 This makes it possible to reduce the time for the keyword query frequently queried.

[0070] 本发明提供的方法和系统基于分布式文件系统,充分考虑到海量数据的分布式存储特征,通过使用最优分片和定向传输,使键值区间相同(或哈希函数的结果相同)的分片数据在同一个规约工作节点上进行数据连接处理,减少了数据在分布式系统内部的传输, 从而提升了查询效率。 [0070] The present invention provides a method and system for a distributed file system, fully taking into account the distributed storage characteristics of massive data, and by using the optimal slicing directional transmission of the same (or the same result as the hash function key section ) of the slice data in the same data connection processing node statute work, it reduces the transmission of data within the distributed system, so as to enhance search efficiency. 同时,本发明还可以将使用过的分片记录保存在系统缓存中,使得后续的查询具有极高的效率。 Meanwhile, the present invention may also be used slice records stored in the cache system, so that subsequent queries with high efficiency.

[0071] 本发明的方法和系统支持各种不同的数据格式,支持不同数量级的数据文件和数据表,在分布式数据仓库中具有广泛的应用价值。 [0071] The method and system of the present invention supports a variety of data formats, the number of stages of support different data tables and data files, having a wide range of application in distributed data warehouse.

[0072] 另外,本发明提出的分布式环境下的分片连接查询方案在应用于并行关系数据库时更为有利。 [0072] Further, in the patch of the present invention proposed a distributed environment connected parallel query plan when applied to a relational database is more advantageous. 这是因为,并行关系数据库一般部署在高速局域网内,而且各个数据库服务器节点的性能相当。 This is because, generally parallel relational database deployed in high-speed LAN, but rather the performance of individual database server nodes. 在这样的物理环境下,根据本发明的分片连接查询方案能够更加高效地执行。 In this physical environment, in accordance with the present invention is connected fragmentation scheme can perform queries more efficiently.

[0073] 尽管以上已经结合本发明的优选实施例示出了本发明,但是本领域的技术人员将会理解,在不脱离本发明的精神和范围的情况下,可以对本发明进行各种修改、替换和改变。 [0073] While the foregoing has been with respect to preferred embodiments of the invention illustrating the present invention, those skilled in the art will appreciate that, without departing from the spirit and scope of the invention, various modifications of the present invention, replacing and change. 因此,本发明不应由上述实施例米限定,而应由所附权利要求及其等价物来限定。 Accordingly, the present invention should not be limited by the above embodiments meters embodiments, but by the appended claims and equivalents thereof.

Claims (22)

1. 一种用于分布式数据仓库的连接查询系统,包括主节点、映射工作节点和规约工作节点,其中:主节点根据数据表的大小和系统性能来计算分片大小,基于计算得到的分片大小向映射工作节点分配数据块,并且规定映射工作节点中的分片映射规则和汇总规则;映射工作节点根据分片映射规则将数据块中的查询关键字映射到相应的分片编号,并根据汇总规则把具有相同分片编号的数据传输到指定的规约工作节点;规约工作节点接收映射工作节点传输的数据,合并具有相同分片编号的数据并按照查询关键字进行连接,以获得最终的连接查询结果。 A distributed data warehouse join query system, comprising a master node, node mapping and protocols work worker nodes, wherein: the master node to calculate the fragment size based on the size and performance data table, based on the calculated points obtained sheet size is allocated to the data blocks mapped working node, and a predetermined mapping rule mapping work fragmentation and aggregation rule node; mapping work node according to a mapping rule maps fragmentation query key data block to a corresponding number of slices, and the aggregation rule to transmit data having the same number to slice the work statute specified node; statute working node receives data transmission node mapping work, combined fragmentation data with the same number and connected in the query keywords, to obtain a final join query results.
2.根据权利要求1所述的用于分布式数据仓库的连接查询系统,其中,主节点包括:分片大小和数量计算单元,获取系统中每一个映射工作节点的物理资源配置信息,根据数据表的大小和所获得的物理资源配置信息来计算每一个映射工作节点所对应的分片大小以及每一个数据表所对应的分片数量;以及分片处理调度器,将每一个数据表按照相应的分片大小进行划分以传输至每一个映射工作节点,并规定每一个映射工作节点中的分片映射规则和汇总规则。 According to claim distributed data warehouse query connection system of claim 1, wherein the master node comprising: fragment size and number calculation unit acquires the configuration information of the physical resource mapping each working node, the data size of the table and the physical resources acquired configuration information to calculate the number of fragments mapping each working node corresponding to the segment size, and each data table corresponding to; and fragment processing scheduler, each data table according to the corresponding the fragment size is divided for transmission to each of the mapped working node, and a predetermined mapping rule fragmentation and aggregation rule mapping each working node.
3.根据权利要求1所述的用于分布式数据仓库的连接查询系统,其中,映射工作节点包括:存储单元,接收主节点传输的数据块;映射和分片处理单元,根据分片映射规则将数据块中的查询关键字映射到特定的分片编号,并把具有相同分片编号的数据存储在同一个数据集中;以及定向分发单元,根据汇总规则将各个数据集中存储的数据分别传输到指定的规约工作节点。 According to claim distributed data warehouse query connector system of claim 1, wherein the working map node comprising: a storage unit, the master node receives the data block transmitted; mapping and fragment processing unit, according to the mapping rules slice mapping the query key data block to a specific slice number, and stores the data slices having the same number in the same data set; and a directional distribution unit according to the aggregate data transmission rules are centrally stored data to the respective the statute specifies the working node.
4.根据权利要求1所述的用于分布式数据仓库的连接查询系统,其中,规约工作节点包括:规约单元,接收从映射工作节点传输来的数据,合并具有相同分片编号的数据以形成分片数据文件;表分片存储单元,存储分片数据文件;以及连接排序单元,将分片数据文件中的数据按照查询关键字进行连接和排序,以输出最终的连接查询结果。 According to claim connection for distributed data warehouse query system of claim 1, wherein the working statute node comprising: means statute, received data are mapped working node transmitted from the combined data with the same number of slices to form sliced ​​data file; fragment table storage unit that stores slice data files; sorting unit and a connection, the data slice data file is connected and sorted query key, connected to output a final query result.
5.根据权利要求1所述的用于分布式数据仓库的连接查询系统,其中,分片映射规则包括按照查询关键字的值区间进行分片映射。 The connection for distributed data warehouse query system according to claim 1, wherein the mapping rule includes fragmentation fragmentation maps as a query key value interval.
6.根据权利要求1所述的用于分布式数据仓库的连接查询系统,其中,分片映射规则包括基于查询关键字的哈希函数值进行分片映射。 The connection for distributed data warehouse query system of claim 1, wherein the mapping rule includes fragmentation fragment mapping hash function value based on the query key claims.
7.根据权利要求2所述的用于分布式数据仓库的连接查询系统,其中,映射工作节点的物理资源配置信息包括映射工作节点的空闲内存容量。 The connection for distributed data warehouse query system according to claim 2, wherein the physical resource mapping working node configuration information includes the free memory capacity of the mapped working node.
8.根据权利要求7所述的用于分布式数据仓库的连接查询系统,其中,分片大小和数量计算单元计算每一个映射工作节点的空闲内存容量与映射任务个数的商,把计算得到的商与该映射工作节点的虚拟机内存上限值进行比较,取两者中的较小值作为该映射工作节点所对应的分片大小。 According to claim distributed data warehouse query connector system of claim 7, wherein the fragment size and the number of quotient calculation unit calculates the number of free memory mapping each node mapping work task, be calculated the quotient is compared with the upper limit value of the memory mapped virtual machine working node, whichever is less, as the segment size of the map corresponding to the working node.
9.根据权利要求4所述的用于分布式数据仓库的连接查询系统,其中,规约工作节点还包括:全局分片索引单元,针对特定的查询关键字建立全局索引表,该全局索引表包括分片编号、对应的分片存储节点、数据表名称和分片数据文件的路径。 According to claim distributed data warehouse query system connected to claim 4, wherein the node further work statute comprising: a global slice index unit to build the global index table for a particular keyword query, the index of the global table comprising path slice numbers corresponding to the fragment storage node, and table names sliced ​​data file.
10.根据权利要求9所述的用于分布式数据仓库的连接查询系统,其中,当再次查询已经建立了全局索引表的查询关键字时,访问全局索引表,直接加载相应的分片数据文件。 According to claim distributed data warehouse query connector system of claim 9, wherein, when a query has been re-established global query key index table, the index table to access the global, direct loading the respective fragment data file .
11.根据权利要求1-10中任意一项所述的用于分布式数据仓库的连接查询系统,其中,所述分布式数据仓库包括并行关系数据库。 According to any one of claims 1-10 for the distributed data warehouse connection inquiry system, wherein the distributed data warehouse includes a parallel relational database.
12. 一种用于分布式数据仓库的连接查询方法,包括:在主节点处,根据数据表的大小和系统性能来计算分片大小,基于计算得到的分片大小向映射工作节点分配数据块,并且规定映射工作节点中的分片映射规则和汇总规则;在映射工作节点处,根据分片映射规则将数据块中的查询关键字映射到相应的分片编号,并根据汇总规则把具有相同分片编号的数据传输到指定的规约工作节点;在规约工作节点处,接收映射工作节点传输的数据,合并具有相同分片编号的数据并按照查询关键字进行连接,以获得最终的连接查询结果。 A join query method for distributed data warehouse, comprising: at the master node to calculate the fragment size based on the size and performance data table, the data blocks are allocated based on the working node mapping fragment size calculated , and the predetermined mapping rules fragmentation and aggregation rule mapping work nodes; node mapping work, according to a mapping rule maps fragmentation query key data block to a corresponding number of slices, and according to the same rules summary data transmission to the designated slice number of the statute working node; node working in the statute, working node receives the map data transmission data having the same combined slice number and connected in the query keywords, to obtain a final result of the connection query .
13.根据权利要求12所述的用于分布式数据仓库的连接查询方法,其中,在主节点处执行的步骤包括:获取系统中每一个映射工作节点的物理资源配置信息,根据数据表的大小和所获得的物理资源配置信息来计算每一个映射工作节点所对应的分片大小以及每一个数据表所对应的分片数量;以及将每一个数据表按照相应的分片大小进行划分以传输至每一个映射工作节点,并规定每一个映射工作节点中的分片映射规则和汇总规则。 13. A connector according to Claim query methods for distributed data warehouse 12, wherein the step of performing in the master node comprises: obtaining the physical resource configuration information for each work node a mapping system, in accordance with the size of the data table and physical resource configuration information obtained to calculate the number of fragments mapping each working node corresponding to the segment size, and each data table corresponding to; and each data table is divided to transmit to in a corresponding fragment size mapping each working node, and a predetermined slice mapping rules mapping each working node and summary rules.
14.根据权利要求12所述的用于分布式数据仓库的连接查询方法,其中,在映射工作节点处执行的步骤包括:接收主节点传输的数据块;根据分片映射规则将数据块中的查询关键字映射到特定的分片编号,并把具有相同分片编号的数据存储在同一个数据集中;以及根据汇总规则将各个数据集中存储的数据分别传输到指定的规约工作节点。 14. A connector according to Claim query methods for distributed data warehouse 12, wherein the step of performing the work in the map node comprising: a node receiving data blocks in the main transmission; mapping rule according to the data fragment block a query key mapped to a specific slice number, and the same data in a centralized data store having the same number of slices; and rules are based on aggregated data sets stored in the respective transmission data to the specified node statute work.
15.根据权利要求12所述的用于分布式数据仓库的连接查询方法,其中,在规约工作节点处执行的步骤包括:接收从映射工作节点传输来的数据,合并具有相同分片编号的数据以形成分片数据文件;存储分片数据文件;以及将分片数据文件中的数据按照查询关键字进行连接和排序,以输出最终的连接查询结果。 15. A connector according to claim query methods for distributed data warehouse 12, wherein the step of performing the work in the statute node comprising: receiving map data transmitted from a working node, combined fragmentation data with the same number to form slice data file; store fragmented data file; and the fragment data in the data file is connected and sorted query key, connected to output a final query result.
16.根据权利要求12所述的用于分布式数据仓库的连接查询方法,其中,分片映射规则包括按照查询关键字的值区间进行分片映射。 16. A connector according to claim query methods for distributed data warehouse 12, wherein the mapping rule includes fragmentation fragmentation maps as a query key value interval.
17.根据权利要求12所述的用于分布式数据仓库的连接查询方法,其中,分片映射规则包括基于查询关键字的哈希函数值进行分片映射。 17. A connector according to claim query methods for distributed data warehouse 12, wherein the mapping rule includes fragmentation fragment mapping based on the hash function value query key.
18.根据权利要求13所述的用于分布式数据仓库的连接查询方法,其中,映射工作节点的物理资源配置信息包括映射工作节点的空闲内存容量。 18. A connector according to claim query methods for distributed data warehouse 13, wherein the physical resource mapping working node configuration information includes the free memory capacity of the mapped working node.
19.根据权利要求18所述的用于分布式数据仓库的连接查询方法,其中,在主节点处, 计算每一个映射工作节点的空闲内存容量与映射任务个数的商,把计算得到的商与该映射工作节点的虚拟机内存上限值进行比较,取两者中的较小值作为该映射工作节点所对应的分片大小。 19. The connector of query methods for distributed data warehouse of claim 18, wherein, in the master node, calculated for each working node mapping the quotient number of free memory mapping task, the calculated quotient and mapping the virtual machine memory node working limit compared, whichever is less, as the segment size of the map corresponding to the working node.
20.根据权利要求15所述的用于分布式数据仓库的连接查询方法,其中,在规约工作节点处还执行以下步骤:针对特定的查询关键字建立全局索引表,该全局索引表包括分片编号、对应的分片存储节点、数据表名称和分片数据文件的路径。 20. The connector query methods for distributed data warehouse according to claim 15, wherein, in the working statute node further performs the steps of: establishing a global index table for a particular query keywords, the index table comprising the global fragmentation path number, a corresponding fragment storage node, and table names sliced ​​data file.
21.根据权利要求20所述的用于分布式数据仓库的连接查询方法,其中,当再次查询已经建立了全局索引表的查询关键字时,通过访问全局索引表,直接加载相应的分片数据文件。 21. The connector query methods for distributed data warehouse according to claim 20, wherein, when a query has been re-established global query key index table, the index table by accessing the global, direct sliced ​​data corresponding loading file.
22.根据权利要求12-21中任意一项所述的用于分布式数据仓库的连接查询方法,其中,所述分布式数据仓库包括并行关系数据库。 22. The method of distributed data warehouse query connector 12-21 for any one claim, wherein said distributed data warehouse includes a parallel relational database.
CN201010556490.5A 2010-11-17 2010-11-17 Connection query system and method for distributed data warehouse CN102467570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010556490.5A CN102467570B (en) 2010-11-17 2010-11-17 Connection query system and method for distributed data warehouse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010556490.5A CN102467570B (en) 2010-11-17 2010-11-17 Connection query system and method for distributed data warehouse

Publications (2)

Publication Number Publication Date
CN102467570A true CN102467570A (en) 2012-05-23
CN102467570B CN102467570B (en) 2014-03-12

Family

ID=46071213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010556490.5A CN102467570B (en) 2010-11-17 2010-11-17 Connection query system and method for distributed data warehouse

Country Status (1)

Country Link
CN (1) CN102467570B (en)

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309747A (en) * 2013-05-20 2013-09-18 青岛海信传媒网络技术有限公司 Method and device for allocation of code file statistics task
CN103345514A (en) * 2013-07-09 2013-10-09 焦点科技股份有限公司 Streamed data processing method in big data environment
CN103399943A (en) * 2013-08-14 2013-11-20 曙光信息产业(北京)有限公司 Communication method and communication device for parallel query of clustered databases
CN103412897A (en) * 2013-07-25 2013-11-27 中国科学院软件研究所 Parallel data processing method based on distributed structure
CN103488778A (en) * 2013-09-27 2014-01-01 华为技术有限公司 Data searching method and device
CN103595651A (en) * 2013-10-15 2014-02-19 北京航空航天大学 Distributed data stream processing method and system
CN103970520A (en) * 2013-01-31 2014-08-06 国际商业机器公司 Resource management method and device in MapReduce framework and framework system with device
CN104008199A (en) * 2014-06-16 2014-08-27 北京京东尚科信息技术有限公司 Data inquiring method
CN104036039A (en) * 2014-06-30 2014-09-10 浪潮(北京)电子信息产业有限公司 Parallel processing method and system of data
CN104050202A (en) * 2013-03-15 2014-09-17 伊姆西公司 Method and device for searching in database
CN104050291A (en) * 2014-06-30 2014-09-17 浪潮(北京)电子信息产业有限公司 Parallel processing method and system for account balance data
CN104063376A (en) * 2013-03-18 2014-09-24 阿里巴巴集团控股有限公司 Multi-dimensional grouping operation method and system
WO2014180411A1 (en) * 2013-12-17 2014-11-13 中兴通讯股份有限公司 Distributed index generation method and device
CN104166661A (en) * 2013-05-20 2014-11-26 方正宽带网络服务股份有限公司 Data storage system and method
CN104376055A (en) * 2014-11-04 2015-02-25 国电南瑞科技股份有限公司 Large-scale data comparison method based on fragmentation technology
CN104408159A (en) * 2014-12-04 2015-03-11 曙光信息产业(北京)有限公司 Data correlating, loading and querying method and device
CN104615487A (en) * 2015-01-12 2015-05-13 中国科学院计算机网络信息中心 System and method for optimizing parallel tasks
CN104901783A (en) * 2014-03-06 2015-09-09 携程计算机技术(上海)有限公司 Data transmitting method and server system
CN105045871A (en) * 2015-07-15 2015-11-11 国家超级计算深圳中心(深圳云计算中心) Data aggregation query method and apparatus
CN105183901A (en) * 2015-09-30 2015-12-23 北京京东尚科信息技术有限公司 Method and device for reading database table through data query engine
CN105404638A (en) * 2015-09-28 2016-03-16 高新兴科技集团股份有限公司 Method for solving correlated query of distributed cross-database fragment table
CN105808560A (en) * 2014-12-29 2016-07-27 腾讯科技(深圳)有限公司 Same-machine multi-service retrieval method and system
CN105893497A (en) * 2016-03-29 2016-08-24 杭州数梦工场科技有限公司 Task processing method and device
CN103488778B (en) * 2013-09-27 2016-11-30 华为技术有限公司 A kind of data query method and device
CN106168963A (en) * 2016-06-30 2016-11-30 北京金山安全软件有限公司 The processing method of real-time streaming data, device and server
CN106202261A (en) * 2016-06-29 2016-12-07 浪潮(北京)电子信息产业有限公司 The distributed approach of a kind of data access request and engine
CN106446039A (en) * 2016-08-30 2017-02-22 北京航空航天大学 Aggregation type big data search method and device
CN106446168A (en) * 2016-09-26 2017-02-22 北京赛思信安技术股份有限公司 Oriented distribution data warehouse high efficiency load client end realization method
CN106599052A (en) * 2016-11-15 2017-04-26 上海跬智信息技术有限公司 Data query system based on ApacheKylin, and method thereof
CN106649418A (en) * 2015-11-04 2017-05-10 江苏引跑网络科技有限公司 High-performance method for importing data into distributed database through direct connection of fragments in driver
WO2017084020A1 (en) * 2015-11-16 2017-05-26 华为技术有限公司 Method and apparatus for model parameter fusion
CN106844405A (en) * 2015-12-07 2017-06-13 杭州海康威视数字技术股份有限公司 Data query method and apparatus
CN106874272A (en) * 2015-12-10 2017-06-20 华为技术有限公司 A kind of distributed connection method and system
CN106933934A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 The connection method of tables of data and device
CN106933933A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 The processing method and processing device of data table information
CN107450855A (en) * 2017-08-08 2017-12-08 山东浪潮云服务信息科技有限公司 A kind of model for distributed storage variable data distribution method and system
CN107506394A (en) * 2017-07-31 2017-12-22 武汉工程大学 A kind of optimization method for eliminating big data normative connection connection redundancy
CN107818100A (en) * 2016-09-12 2018-03-20 杭州海康威视数字技术股份有限公司 A kind of SQL statement performs method and device
CN108038239A (en) * 2017-12-27 2018-05-15 中科鼎富(北京)科技发展有限公司 A kind of heterogeneous data source method of standardization management, device and server

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1178407A2 (en) * 2000-06-02 2002-02-06 Compaq Computer Corporation Architecture for parallel distributed table driven I/O mapping
CN101183368A (en) * 2007-12-06 2008-05-21 华南理工大学 Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing
CN101535944A (en) * 2005-08-15 2009-09-16 谷歌公司 Scalable user clustering based on set similarity
US20090303880A1 (en) * 2008-06-09 2009-12-10 Microsoft Corporation Data center interconnect and traffic engineering
CN101799748A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Method for determining data sample class and system thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1178407A2 (en) * 2000-06-02 2002-02-06 Compaq Computer Corporation Architecture for parallel distributed table driven I/O mapping
CN101535944A (en) * 2005-08-15 2009-09-16 谷歌公司 Scalable user clustering based on set similarity
CN101183368A (en) * 2007-12-06 2008-05-21 华南理工大学 Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing
US20090303880A1 (en) * 2008-06-09 2009-12-10 Microsoft Corporation Data center interconnect and traffic engineering
CN101799748A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Method for determining data sample class and system thereof

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970520A (en) * 2013-01-31 2014-08-06 国际商业机器公司 Resource management method and device in MapReduce framework and framework system with device
CN103970520B (en) * 2013-01-31 2017-06-16 国际商业机器公司 Method for managing resource, device and architecture system in MapReduce frameworks
US9582334B2 (en) 2013-01-31 2017-02-28 International Business Machines Corporation Resource management in MapReduce architecture and architectural system
US9720740B2 (en) 2013-01-31 2017-08-01 International Business Machines Corporation Resource management in MapReduce architecture and architectural system
CN104050202A (en) * 2013-03-15 2014-09-17 伊姆西公司 Method and device for searching in database
CN104063376A (en) * 2013-03-18 2014-09-24 阿里巴巴集团控股有限公司 Multi-dimensional grouping operation method and system
CN104166661B (en) * 2013-05-20 2018-05-22 方正宽带网络服务股份有限公司 Data-storage system and date storage method
CN104166661A (en) * 2013-05-20 2014-11-26 方正宽带网络服务股份有限公司 Data storage system and method
CN103309747A (en) * 2013-05-20 2013-09-18 青岛海信传媒网络技术有限公司 Method and device for allocation of code file statistics task
CN103309747B (en) * 2013-05-20 2016-09-28 青岛海信传媒网络技术有限公司 The distribution method of a kind of code file statistics task and device
CN103345514B (en) * 2013-07-09 2016-06-08 焦点科技股份有限公司 Streaming data processing method under big data environment
CN103345514A (en) * 2013-07-09 2013-10-09 焦点科技股份有限公司 Streamed data processing method in big data environment
CN103412897B (en) * 2013-07-25 2017-03-01 中国科学院软件研究所 A kind of parallel data processing method based on distributed frame
CN103412897A (en) * 2013-07-25 2013-11-27 中国科学院软件研究所 Parallel data processing method based on distributed structure
CN103399943A (en) * 2013-08-14 2013-11-20 曙光信息产业(北京)有限公司 Communication method and communication device for parallel query of clustered databases
CN103488778A (en) * 2013-09-27 2014-01-01 华为技术有限公司 Data searching method and device
CN103488778B (en) * 2013-09-27 2016-11-30 华为技术有限公司 A kind of data query method and device
CN103595651A (en) * 2013-10-15 2014-02-19 北京航空航天大学 Distributed data stream processing method and system
CN103595651B (en) * 2013-10-15 2017-02-15 北京航空航天大学 Distributed data stream processing method and system
WO2014180411A1 (en) * 2013-12-17 2014-11-13 中兴通讯股份有限公司 Distributed index generation method and device
CN104901783A (en) * 2014-03-06 2015-09-09 携程计算机技术(上海)有限公司 Data transmitting method and server system
CN104901783B (en) * 2014-03-06 2019-06-18 携程计算机技术(上海)有限公司 Data transmission method and server system
CN104008199A (en) * 2014-06-16 2014-08-27 北京京东尚科信息技术有限公司 Data inquiring method
CN104008199B (en) * 2014-06-16 2017-05-31 北京京东尚科信息技术有限公司 A kind of data query method
CN104050291B (en) * 2014-06-30 2017-11-10 浪潮(北京)电子信息产业有限公司 A kind of method for parallel processing and system of account balance data
CN104050291A (en) * 2014-06-30 2014-09-17 浪潮(北京)电子信息产业有限公司 Parallel processing method and system for account balance data
CN104036039A (en) * 2014-06-30 2014-09-10 浪潮(北京)电子信息产业有限公司 Parallel processing method and system of data
CN104036039B (en) * 2014-06-30 2017-09-29 浪潮(北京)电子信息产业有限公司 The method for parallel processing and system of a kind of data
CN104376055A (en) * 2014-11-04 2015-02-25 国电南瑞科技股份有限公司 Large-scale data comparison method based on fragmentation technology
CN104376055B (en) * 2014-11-04 2017-08-29 国电南瑞科技股份有限公司 A kind of large-sized model data comparing method based on allocation methods
CN104408159A (en) * 2014-12-04 2015-03-11 曙光信息产业(北京)有限公司 Data correlating, loading and querying method and device
CN104408159B (en) * 2014-12-04 2018-01-16 曙光信息产业(北京)有限公司 A kind of data correlation, loading, querying method and device
CN105808560A (en) * 2014-12-29 2016-07-27 腾讯科技(深圳)有限公司 Same-machine multi-service retrieval method and system
CN104615487B (en) * 2015-01-12 2019-03-08 中国科学院计算机网络信息中心 Parallel task optimization system and method
CN104615487A (en) * 2015-01-12 2015-05-13 中国科学院计算机网络信息中心 System and method for optimizing parallel tasks
CN105045871A (en) * 2015-07-15 2015-11-11 国家超级计算深圳中心(深圳云计算中心) Data aggregation query method and apparatus
CN105045871B (en) * 2015-07-15 2018-09-28 国家超级计算深圳中心(深圳云计算中心) Data aggregate querying method and device
CN105404638A (en) * 2015-09-28 2016-03-16 高新兴科技集团股份有限公司 Method for solving correlated query of distributed cross-database fragment table
CN105183901A (en) * 2015-09-30 2015-12-23 北京京东尚科信息技术有限公司 Method and device for reading database table through data query engine
CN106649418A (en) * 2015-11-04 2017-05-10 江苏引跑网络科技有限公司 High-performance method for importing data into distributed database through direct connection of fragments in driver
WO2017084020A1 (en) * 2015-11-16 2017-05-26 华为技术有限公司 Method and apparatus for model parameter fusion
CN106844405A (en) * 2015-12-07 2017-06-13 杭州海康威视数字技术股份有限公司 Data query method and apparatus
CN106844405B (en) * 2015-12-07 2019-10-22 杭州海康威视数字技术股份有限公司 Data query method and apparatus
CN106874272A (en) * 2015-12-10 2017-06-20 华为技术有限公司 A kind of distributed connection method and system
CN106933933B (en) * 2015-12-31 2019-12-10 北京国双科技有限公司 Data table information processing method and device
CN106933934A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 The connection method of tables of data and device
CN106933933A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 The processing method and processing device of data table information
CN105893497A (en) * 2016-03-29 2016-08-24 杭州数梦工场科技有限公司 Task processing method and device
CN106202261A (en) * 2016-06-29 2016-12-07 浪潮(北京)电子信息产业有限公司 The distributed approach of a kind of data access request and engine
CN106168963A (en) * 2016-06-30 2016-11-30 北京金山安全软件有限公司 The processing method of real-time streaming data, device and server
CN106168963B (en) * 2016-06-30 2019-06-11 北京金山安全软件有限公司 Processing method, device and the server of real-time streaming data
CN106446039A (en) * 2016-08-30 2017-02-22 北京航空航天大学 Aggregation type big data search method and device
CN107818100A (en) * 2016-09-12 2018-03-20 杭州海康威视数字技术股份有限公司 A kind of SQL statement performs method and device
CN107818100B (en) * 2016-09-12 2019-12-20 杭州海康威视数字技术股份有限公司 SQL statement execution method and device
CN106446168A (en) * 2016-09-26 2017-02-22 北京赛思信安技术股份有限公司 Oriented distribution data warehouse high efficiency load client end realization method
CN106446168B (en) * 2016-09-26 2019-11-01 北京赛思信安技术股份有限公司 A kind of load client realization method of Based on Distributed data warehouse
CN106599052A (en) * 2016-11-15 2017-04-26 上海跬智信息技术有限公司 Data query system based on ApacheKylin, and method thereof
CN107506394A (en) * 2017-07-31 2017-12-22 武汉工程大学 A kind of optimization method for eliminating big data normative connection connection redundancy
CN107450855A (en) * 2017-08-08 2017-12-08 山东浪潮云服务信息科技有限公司 A kind of model for distributed storage variable data distribution method and system
CN108038239A (en) * 2017-12-27 2018-05-15 中科鼎富(北京)科技发展有限公司 A kind of heterogeneous data source method of standardization management, device and server

Also Published As

Publication number Publication date
CN102467570B (en) 2014-03-12

Similar Documents

Publication Publication Date Title
US9613104B2 (en) System and method for building a point-in-time snapshot of an eventually-consistent data store
US8261020B2 (en) Cache enumeration and indexing
US8838595B2 (en) Operating on objects stored in a distributed database
EP2577512B1 (en) Data loading systems and methods
US9460185B2 (en) Storage device selection for database partition replicas
US8386532B2 (en) Mechanism for co-located data placement in a parallel elastic database management system
Yang et al. Druid: A real-time analytical data store
Karun et al. A review on hadoop—HDFS infrastructure extensions
US9262458B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
US10019459B1 (en) Distributed deduplication in a distributed system of hybrid storage and compute nodes
Jindal et al. Trojan data layouts: right shoes for a running elephant
US20120109892A1 (en) Partitioning online databases
Escriva et al. HyperDex: A distributed, searchable key-value store
US9590915B2 (en) Transmission of Map/Reduce data in a data center
US9317536B2 (en) System and methods for mapping and searching objects in multidimensional space
US8041685B2 (en) Method of changing system configuration in shared-nothing database management system
JP5765416B2 (en) Distributed storage system and method
US8965921B2 (en) Data management and indexing across a distributed database
EP1148430A2 (en) Optimization of a star join operation using a bitmap index structure
US9477741B2 (en) Systems and methods for redistributing data in a relational database
Zhang et al. Adapting skyline computation to the mapreduce framework: Algorithms and experiments
US20120303633A1 (en) Systems and methods for querying column oriented databases
CN102663117B (en) OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform
US8762407B2 (en) Concurrent OLAP-oriented database query processing method
CA2512312A1 (en) Metadata based file switch and switched file system

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
CF01 Termination of patent right due to non-payment of annual fee