CN102375853A - Distributed database system, method for building index therein and query method - Google Patents

Distributed database system, method for building index therein and query method Download PDF

Info

Publication number
CN102375853A
CN102375853A CN2010102611675A CN201010261167A CN102375853A CN 102375853 A CN102375853 A CN 102375853A CN 2010102611675 A CN2010102611675 A CN 2010102611675A CN 201010261167 A CN201010261167 A CN 201010261167A CN 102375853 A CN102375853 A CN 102375853A
Authority
CN
China
Prior art keywords
index
data
query
plurality
data block
Prior art date
Application number
CN2010102611675A
Other languages
Chinese (zh)
Inventor
周大
孙少陵
张卫平
张松波
罗治国
郭磊涛
钱岭
齐骥
Original Assignee
中国移动通信集团公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国移动通信集团公司 filed Critical 中国移动通信集团公司
Priority to CN2010102611675A priority Critical patent/CN102375853A/en
Publication of CN102375853A publication Critical patent/CN102375853A/en

Links

Abstract

The invention discloses a distributed database system and a method for building an index in the distributed database system. The distributed database system comprises a plurality of distributed storage units, an index memory, a resolver, an index query module and a parallel processing engine, wherein the distributed storage units store a plurality of data block files by sections; the index memory stores the indexes of the data block files; the resolver resolves a query sentence initiated by a user and selects a corresponding query index; the index query module searches the indexes of the data block files according to the selected query index to obtain at least one query data block set; the query data block set comprises an index key value and records the position information of the data block files corresponding to the index key value in the data block files; and the parallel processing engine splits the at least one query data block set and initiates a parallel scanning task.

Description

分布式数据库系统、在其中建立索引的方法和查询方法 Distributed database system, a method in which the indexing and query methods

技术领域 FIELD

[0001] 本申请涉及一种分布式数据库系统、在其中建立索引的方法和查询方法。 [0001] The present application relates to a distributed database system, in which the process of indexing and query method. 背景技术 Background technique

[0002] 将大批量结构化数据存储在数据库中,特别是关系型数据库中是常用的数据管理方法。 [0002] The large quantities of structured data stored in a database, in particular the relational database is a common data management method. 简单直观的实践是:部署成熟的数据库管理系统,用标准的接口(如SQL)定义数据表及数据结构,将收集到的数据导入或插入到数据库的相应表中。 Practice is simple and intuitive: the deployment of sophisticated database management systems (e.g., SQL) interface defined by the standard data tables and data structures, the collected data into or inserted into the corresponding database table. 根据需要,数据库系统为其建立索引供快速查询时使用。 If necessary, use a database system for the establishment of an index for quick reference. 查询数据时,根据查询条件,可以选用合适的索引以优化查询性能。 When the query data, according to the query conditions, can choose the appropriate index to optimize query performance.

[0003] 在大规模数据的管理方面,影响数据查询性能的关键因素是查询时访问的数据量和磁盘10。 [0003] In the large-scale data management, the key factors affecting data query performance is the amount of data and disk access time of 10 queries. 索引技术是数据库实践中提高查询性能的重要方法。 Indexing is an important method to improve database query performance practice. 索引的数据量一般比实际的数据量小很多,而且可以组织成便于查找的数据结构,如树或HASH表结构。 Usually the amount of data is much smaller than the index of the actual amount of data, easy to find and can be organized into a data structure, such as a tree or a HASH table structure. 通过优先查找索引过滤掉大部分无须访问的数据而不是直接扫描实际数据,可以有效地减少须访问的数据量和磁盘10。 Filter out most without access to the first search through the index data instead of directly scanning the actual data, it can effectively reduce the amount of data to be accessed and disk 10. 同时数据的组织和存储方式对建立有效的索引也非常重要,不同的索引技术对数据的组织和存储方式也有不同的要求。 While the data organization and storage is also very important to establish an effective index, different indexing techniques for the organization and storage of data have different requirements. 数据库系统中常用的索引类型如B-TREE 索引、HASH索引和BITMAP索引等,分别适用于不同的场合,它们的原理基本都是通过查询的键值来快速定位数据记录的存储位置。 Database systems commonly used such as B-TREE index type index, and index the HASH BITMAP indexes, respectively, for different occasions, their principles are basically to quickly locate the storage location of data records by key value query.

[0004] 当前在许多行业中,产生和累积的数据量特别巨大,甚至达到几百TB或PB级。 [0004] The current in many industries, the production and accumulation of huge amounts of data, even up to several hundred TB or PB level. 并且这些数据随着时间在不断膨胀,随着业务的发展产生数据的速度也在不断提高。 These data over time and continues to expand, as the business development speed data generated is also rising. 例如电信业务⑶R(Call Detail Record)数据、物联网传感器数据、金融交易数据、互联网日志数据等。 Such as telecommunications services ⑶R (Call Detail Record) data, networking and sensor data, financial transaction data, Internet log data.

[0005] 海量数据具有以下特点中的至少之一: [0005] The mass data having at least one of the following features:

[0006] (1)数据多是时间序列数据,有时间标签,并按照或大致按照时间顺序产生和存储。 [0006] (1) time-series data are data of multiple, time-stamped and substantially generates and stores in chronological order according to or.

[0007] (2)数据是结构化或半结构化数据,并且结构可能变更; [0007] (2) the data is structured or semi-structured data, and may change the structure;

[0008] (3)数据产生的速度非常快(如某系统每天产生2TB或50亿条记录),并且数据量越来越大; [0008] (3) data generated very fast rate (e.g., a system generates 2TB million records per day, or 50), and the increasing amount of data;

[0009] (4)很多属性域上的值的重复率很高。 [0009] (4) high repetition rates much value on the attribute domain.

[0010] 对海量数据的管理和应用也有以下特点: [0010] for massive data management and application also has the following characteristics:

[0011] (1)需保存较长时间(如半年),更久的数据被丢弃或备份到其它介质; [0011] (1) should be kept for a long time (e.g., six months), longer backed up data is discarded or to other media;

[0012] (2)老旧的历史数据须可以被访问,但被访问的机会较少;出于成本考虑,除了存储资源外不应占用太多运行时资源(如CPU、内存、带宽等); [0012] (2) the old historical data must be accessible, but less chance of being accessed; because of cost considerations, in addition to storage resources should not take up too much run-time resource (such as CPU, memory, bandwidth, etc.) ;

[0013] (3)历史数据一般不需要修改,一旦数据存储好,就只需读之; [0013] (3) generally do not require modification history data, once the data storage good, only Reading;

[0014] (4)对数据的查询一般会指定一定的时间范围条件; [0014] (4) a query of data typically specify a certain time range condition;

[0015] (5)对同样的数据集,除了要支持快速的查询操作外,往往还需要支持批量数据分析和挖掘操作。 [0015] (5) for the same set of data, in addition to support rapid query operations, but often also need to support bulk data analysis and mining operations. 并且对同一批数据的同样的分析和挖掘操作一般不会多次重复执行。 And the same analysis and mining operations on the same batch of data is generally not repeated execution. [0016] 用户要从海量数据集中查询获得想要的数据,继续使用现有的数据库及其索引方法已经非常困难。 [0016] user query data from a massive data set obtained want to continue to use the existing database and indexing method has been very difficult. 数据库往往无法存储如此巨大的数据量,而且不太适用于半结构化数据或数据结构的变更。 Databases are often unable to store such a huge amount of data, and are less suitable for semi-structured data or change data structures. 密集完备的索引对海量数据来说不仅会使建立和维护索引的开销大、 速度慢,而且索引本身的数据量也非常庞大,从而也使数据的写入速度难以跟上数据的产生速度。 Intensive comprehensive index for massive data will not only establish and maintain an index of the cost of large, slow, and the amount of data the index itself is also very large, and thus the data writing speed is difficult to keep up with the speed at which data is generated.

发明内容 SUMMARY

[0017] 一方面,本申请公开了一种分布式数据库系统,包括: [0017] In one aspect, the present application discloses a distributed database system, comprising:

[0018] 多个分布式存储单元,分区存储有多个数据块文件; [0018] a plurality of distributed memory cells, a plurality of partitions are stored files of data blocks;

[0019] 索引存储器,存储有所述多个数据块文件的索引; [0019] index memory stores a plurality of index data blocks of the file;

[0020] 解析器,解析用户发起的查询语句,并选择相应的查询索引; [0020] parser, parses user-initiated query, and select the appropriate query index;

[0021] 索引查询模块,根据选择的查询索引,搜索所述多个数据块文件的索引以获得至少一个查询数据块集,所述查询数据块集包括索引键值、和记录了所述多个数据块文件中与所述索引键值对应的数据块文件的位置信息;以及 [0021] index query module, a query according to the selected index, searching the plurality of data blocks of a file index to obtain at least one query data block set, the query index key set includes data blocks, and said plurality of records position information and the key index data block corresponding to the file data block file; and

[0022] 并行处理引擎,将所述至少一个查询数据块集拆分并发起并行扫描任务。 [0022] The parallel processing engine, the at least one query data block set and initiate a split parallel scan tasks.

[0023] 在本申请的一个实施方式中,分布式数据库系统定义了数据组织和存储的基本结构,以流的方式收集或批量获得的数据记录被顺序写入其中。 [0023] In one embodiment of the present application, the distributed database system defines the basic structure and organization of data stored in the data record collected in stream or batch manner wherein the sequence obtained is written. 所述数据组织和存储的基本结构中包括数据文件和对应的数据块索引文件。 The basic structure of the data organization and storage of data files and index files corresponding data block. 每个数据文件中可以顺序存放许多压缩数据块,而每个数据块中可以顺序存放许多数据记录。 Each data file can be stored in the order of a number of compressed data blocks, and each block of data can be stored in a plurality of data recording order. 数据块的大小可以根据平均记录长度适当定义,例如定义为1MB ;数据文件的大小也可以灵活定义,如定义为1GB。 Block size may be suitably defined according to the average length of records, such as those defined as 1MB; the size of the data file may also be defined flexibly as defined 1GB. 数据块采用常用的压缩算法进行压缩以节省空间。 Data block using the conventional compression algorithm is compressed to save space. 每个数据文件伴随着一个非常轻量级的数据块索引文件,用于快速定位指定的数据块。 Each data file is accompanied by a very lightweight data block index file for fast positioning of the specified data block. 数据块索引一般在写数据文件的同时生成,也可以根据已经存在的数据文件重新构建。 Usually data block index file while writing data generated may be re-constructed according to the existing data file. 本申请并不限制将数据块和其索引分开存储在不同文件中, 也可以存储在同一文件中。 The present application is not limited to blocks and their index data are stored in different files separately, it may be stored in the same file.

[0024] 本申请提供的索引建立在前面所述的数据块索引之上。 [0024] The present application provides an index based on the data block index previously described. 这是一种近似的稀疏索引结构,即所述索引的键值并不是定位到每条记录的存储位置,而只是近似地指向出现过该键值的所有数据块上,在索引中只记录在所指数据块中第一次出现该键值的位置。 This is an approximate sparse index structure, i.e., the index key value is not positioned to a storage location for each record, but only approximately directed appeared on all the key data blocks and recorded only in the index It referred to the position of the key data block of the first occurrence. 因为每个数据块中包括许多条记录,而且同一个键值可能在某个数据块中多次重复,这样建立的索引就可以成数量级地缩小,并且大大加快建立索引的速度。 Since each data block comprises a plurality of records, and may be repeated in the same key in a data block, an index can be created so as to reduce the number of stages, and greatly speed up indexing. 同时也可以避免因为键值分布严重不均勻造成的索引不均勻问题。 Also avoided because of uneven distribution of keys serious problems caused by uneven index. 例如在一个数据块中有10000条记录,而只有100 个唯一键值,就只会产生100条索引。 For example, 10000 is recorded in a data block, and only a unique key 100, 100 will only generate an index.

[0025] 对于取值范围较大但有限离散的属性域,例如电信业务CDR中的电话号码或其他数据集中的用户ID等,建立索引对该属性域的查询非常有效。 [0025] For a large range of properties, but finite discrete domain, user ID, and other telecommunication services in the CDR phone number or other data set, for example, the attribute field indexing query is very effective. 在一个数据块中,无论该属性域某一特定值出现过多少次,只记录其第一次出现的位置。 In a data block, regardless of the value of a particular attribute field there have been many times, it is only the recording position of the first occurrence. 索引的结构如:<Key, BlockLocation〉。 The index structure: <Key, BlockLocation>. 因为此类属性域的取值重复率很高,因此其索引也非常小和稀疏。 Because the value of such property field repetition rate is very high, so the index is very small and sparse. 此类属性域较为常见,也往往需要建立索引。 Such attribute domain is more common, and often needs to be indexed. 也可以对多个属性域建立联合索引。 You can also create multiple joint index attribute domain.

[0026] 虽然所述索引策略只是建立在数据块之上,大大减小了索引的大小,但需要在查询时增加在有限数据块内进行顺序扫描的开销。 [0026] Although the indexing strategy based only on a data block, greatly reducing the size of the index, but it is necessary to increase the cost for sequential scanning within the limited data block query. 在海量数据的处理上,这种折中所获得的益处要比建立繁重的索引多很多。 In dealing with huge amounts of data, the benefits of such a compromise obtained heavy index than build a lot more. 在分布式系统中采用并行处理技术的情况下,上述开销 The case where parallel processing technology in a distributed system, the overhead

5将降到可以接受的较低水平。 5 will be reduced to a lower level acceptable.

[0027] 此外,本申请公开了一种在分布式数据库系统中建立索引的方法,包括: [0027] Further, the present application discloses a method of indexing a database in a distributed system, comprising:

[0028] 收集要存储的数据; [0028] The collected data to be stored;

[0029] 将所述数据块压缩成多个数据块并确定相应的数据块索引; [0029] The compressed data block into a plurality of blocks of data and determining the corresponding data block index;

[0030] 将压缩的数据块按照文件的形式分区存储在所述分布式数据库系统中的多个分布式存储单元中;以及 [0030] The compressed block of data in the form of a plurality of distributed file partitions stored in said memory cells in a distributed database system; and

[0031] 对所存储的数据块建立索引文件,其中,所述索引文件中的各个索引包括索引键值和所述数据块的位置信息。 [0031] index file stored in the data block, wherein each of the index file index includes the index key and location information of the data block.

[0032] 上述建立的索引数据本身因为容量不大,可以将其存储在关系型数据库中,在关系型数据库中对其关键字建立B-TREE索引,这样可以同时支持对该属性域的范围查询和点查询。 [0032] The above-described index data to establish itself as little capacity, may be in a relational database, which is established in the relational database stored on its B-TREE index key, so that the range of the attribute field may support queries simultaneously and point queries. 也可以将索引数据存储在分布式的Key-Value存储系统提供更好的伸缩性和稳定性。 Index data store may also provide better scalability and stability of the Key-Value in a distributed storage system.

[0033] 作为对所述索引策略的可选补充,为了减少跨越较宽范围(如多天时间)的数据查询时需要访问的数据量,也为了减少指定范围(如多天时间)的批量统计分析和数据挖掘操作需要访问的数据量,可以对数据文件进行分区分目录存储,如按日期分区。 [0033] As an optional supplement to the indexing strategy, in order to reduce the amount of data across a wide range of needs access to the time (such as multi-day period) of data query, but also to reduce the specified range (such as multiple days) batch statistics the amount of data analysis and data mining operations need to access, can be partitioned sub-directory to store data files such as partitioned by date. 分区后, 前述的基于数据块的索引可以建立在分区上。 After partition, the data block based on a partition indexes may be established. 分区可以被看作一种基于目录的粗粒度索引。 Partitions can be seen as a directory-based coarse-grained index.

[0034] 本本申请还公开了一种应用于分布式数据库系统中的查询方法,所述分布式数据库系统包括使用上述方法形成的索引,所述查询方法包括: [0034] Also disclosed herein is a notebook query is applied to a method of distributed database system, the distributed database system includes an index formed using the above method, the inquiry method comprising:

[0035] 解析查询语句并确定出相应的查询索引; [0035] parsed query and determine the corresponding query index;

[0036] 根据选择的查询索引搜索所述索引文件以获得至少一个查询数据块集;以及 [0036] The selection of the index file index search query to obtain at least a set of query data block; and

[0037] 将所述至少一个查询数据块集拆分、并根据所述查询数据块集包括的位置信息发起并行扫描任务。 [0037] the at least one set of split data block query, and initiate a parallel scan tasks according to the location information query data comprises a set of blocks.

[0038] 在一个实施方式中,当查询时,如果查询条件中包括分区条件,首先判断查询条件中所涉及的分区列表(如日期分区),缩小查询的分区范围。 [0038] In one embodiment, when a query, if the query condition including partitioning conditions, first determines the list of partitions involved in the query conditions (e.g., date of partitions), reduce the partition scope of the query. 如果查询条件中包括建立了索弓丨的属性域,先查询各相关分区的该属性域的索引,得到一个数据块集,进一步缩小了数据块的范围。 If the query criteria comprises establishing a bow cable Shu the attribute domain, the first query index partitions related to the attribute of each region to obtain a set of data blocks, further narrowing the range of the data block. 如果查询条件中存在多个建立了索引的属性域,就分别查询对应的索引得到多个数据块集,再根据多个条件的逻辑关系,例如AND或0R,获取数据块集的交集或并集。 If the attribute domain index is established in a plurality of query conditions, each query corresponding to the index to obtain a plurality of data block set, then logic in accordance with a plurality of conditions, such as AND or 0R, obtaining intersection data block set or sets and . 最后,对获得的数据块集发起并行扫描匹配操作,将所述匹配操作的结果合并扫描,并将扫描的结果作为本次查询的结果。 Last, the obtained data block sets to initiate parallel scan matching operation, the matching result of the merge scan operation, and as a result of the scan of this query.

附图说明 BRIEF DESCRIPTION

[0039] 图1示出了根据本申请一个实施方式的数据储基本结构。 [0039] FIG. 1 shows a basic structure of a data storage according to one embodiment of the present application.

[0040] 图2描述了根据本申请一个实施方式的、在分布式数据库系统中建立索引的方法。 [0040] FIG 2 is described, the method index in a distributed database system of the present application according to one embodiment.

[0041] 图3示出了根据本申请一个实施方式的用户号码索引的逻辑结构示意。 [0041] FIG. 3 shows a schematic configuration according to the logical index number of the user to one embodiment of the present application.

[0042] 图4为示出了根据本申请一个实施方式的分布式数据库系统的方框图。 [0042] FIG. 4 is a block diagram illustrating a distributed database system according to an embodiment of the present application.

[0043] 图5为根据本申请另一个实施方式的查询处理。 [0043] FIG. 5 is a further embodiment of the query processing application in accordance with the present embodiment. 具体实施方式 Detailed ways

[0044] 下面,参照附图对本申请的示例性实施方式进行详细描述。 [0044] Next, with reference to the accompanying drawings of an exemplary embodiment of the present application will be described in detail.

[0045] 本申请中的实施方式以分布式文件系统为基础。 Embodiment [0045] The present application is based on a distributed file system. 分布式文件系统由多个存储和计算节点组成;这些节点可由多个联网的PC服务器组成,节点数量甚至可达到几千个。 Distributed File System from a plurality of storage and computing nodes; these nodes by multiple networked PC servers, and even the number of nodes can reach thousands. 在不中断服务的情况下,可以根据容量需要平滑增加或删除数据节点,少数数据节点的故障也不会导致系统服务中断。 Without interruption of service may be required smooth add or remove data nodes according to the capacity, few data node failure will not cause a system outage. 如下面将要描述的那样,文件数据被分割成块并尽可能均衡地分布在各个数据节点上,并提供多份复制保证数据的可靠性。 As will be described later, the file data is divided into blocks and distributed as evenly as possible over each of the data nodes, and provides multiple replication to ensure data reliability. 可以通过调用分布式文件系统的客户端API访问文件系统中的任何文件及其分布存储在各个数据节点上的数据,其中对文件中数据的读写直接和相关的数据节点通讯。 Data files can be stored and distributed call the Distributed File System client API to access the file system any data on each node, which read and write data in the file directly and associated data node communication. 这种文件系统很好地解决了处理海量数据所需的分布存储、负载均衡、稳定性、数据可靠性、伸缩性和高吞吐量等问题。 This solves the file system to store the desired distribution of mass data processing, load balancing, stability, data reliability, scalability, and high throughput and so on.

[0046] 图1示出了根据本申请一个实施方式的数据储基本结构100。 [0046] FIG. 1 shows a basic structure of a data storage 100 according to one embodiment of the present application. 该存储结构100包括数据文件111和与其对应的数据块索引文件112。 The memory 100 includes configuration data 111 and the corresponding file data block index file 112. 数据记录以顺序记录流的形式写入该存储结构中,并按照用户定义的数据块大小(如1MB)进行压缩(如采用GZIP、LZO等压缩算法),将压缩后的数据块顺序写入数据文件111中。 Sequential recording data recorded in the form of a stream written in the storage structure, and compressed (e.g., using GZIP, LZO compression algorithm, etc.) The data block size (e.g., 1MB) user-defined data written to the compressed block sequence 111 file. 在一个实施方式中,在写数据文件的同时,生成对应的索引并写入数据块索引文件112中。 In one embodiment, the data file is written at the same time, the index and to generate a corresponding data block is written in the index file 112. 用户可以定义数据文件的最大尺寸(如1GB)。 The user can define the maximum size of the data file (e.g. 1GB).

[0047] 有两种方式读存储结构110中的数据:一种根据指定数据块的ID确定出其块索引在数据块索引文件112中的位置,并根据确定出的索引搜索到数据块在数据文件111中的位置。 [0047] There are two ways to read in the data storage structure 110: A method of determining the position of the index blocks in which the index file data block 112 according to the ID specified data block, and the search index is determined according to the data in the data block file 111 in position. 另一种方式是直接根据数据块在数据文件111中的位置读取,这样省去了读数据块索引文件的开销。 Another way is to directly read the position of the data block in the data file 111, thus eliminating the need for overhead read block index file. 读数据时如果要定位到指定数据块中具体的记录ID,需在定位到指定数据块后顺序跳转到指定记录ID。 If data is to be read designated data block to locate a particular record ID, order ID needs to jump to a specific record in the target specified data block.

[0048] 表1示出了数据块索引文件112的数据结构。 [0048] Table 1 shows the data structure of the data block of the index file 112. “数据块ID”是隐含的参数,并不在数据块索引数据结构中出现。 "Block ID" is implicit parameter is not present in the data block of the index data structure. “块偏移量”表示数据块在数据文件中的位置。 "Block offset" indicates the position of the data block in the data file. “原始数据字节数”表示压缩前该数据块的大小,一般稍大于或等于用户定义的数据块大小。 "Number of bytes of raw data" represents the size of the data block before compression, generally equal to or slightly larger than the size of the user-defined data block. “压缩字节数”是压缩后该数据块实际占用的存储空间大小。 "Compression bytes" is the compressed data blocks of storage space actually occupied. “记录条数”是个统计值,表示该数据块中的总记录条数。 "Number of records" is a statistical value, which represents the total number of records in the data block. 在数据块索引文件中,每条索引是等长的,因此可以很容易根据数据块ID 计算其在文件中的位置。 In the index file data blocks, each index are of equal length, it is possible to easily calculate its position in the file according to the data block ID. 如果需要更快的速度根据数据块ID定位数据块,可以选择将数据块索引缓存在内存中。 If faster speeds according to the positioning data block ID data block, the data block index may be selected in the cache memory.

[0049] [0049]

数据块ID 块偏移量 原始数据 压缩字节 记录条数 字节数 数 Block offset data block ID number of the original data compressing the number of bytes of records

[0050]表 1 [0050] TABLE 1

[0051] 下面参照图2描述在分布式数据库系统中建立索引的方法200。 [0051] The method described below with reference to FIG. 2 index in a distributed database system 200. 清楚起见,下面以海量电信业务⑶R数据为例描述处理200,但本发明并不限于此。 Clarity, the following telecommunications services to the mass ⑶R data processing 200 is described as an example, but the present invention is not limited thereto. 电信业务⑶R是电信网络中产生的记录用户呼叫事件的数据。 Telecommunications services ⑶R user data recorded call events generated telecommunications network. 例如一条典型的⑶R中包括用户号码、时间标签、业务类型、失败原因等很多信息,长度约400字节。 A typical example includes a lot of information ⑶R subscriber number, time stamp, type of business, and other reasons for failure, a length of about 400 bytes. 例如每天产生约50亿条记录,约2TB之巨, 并需要保存3个月即2TB*90 = 180TB的数据。 Per day, for example, to produce about 50 million records, giant about 2TB of, and need for 3 months i.e. 2TB * 90 = 180TB of data. 根据指定用户号码查询其在特定时间段内的⑶R记录是一种常用的查询需求。 The user specifies which queries the number recorded in a specific period of time ⑶R is a common query requirements. 并且运营商还需要对这些⑶R进行批量分析和挖掘。 And operators also need to bulk analysis and mining of these ⑶R.

[0052] 在步骤S201中,首先收集⑶R数据。 [0052] In step S201, data is first collected ⑶R. 可采用现有的⑶R集中收集方式实现,也可以采用并行处理(MapReduce)批量处理最原始的⑶R收集文件采集⑶R数据。 May be employed to achieve centralized collection ⑶R conventional manner, parallel processing may be used (the MapReduce) batch processing ⑶R most primitive collection ⑶R data collection file.

[0053] 在步骤S202中将收集的数据压缩成多个数据块。 [0053] Compression into a plurality of data blocks in the data collected in the step S202. 每条CDR记录可例如按照紧凑编码格式(如采用GZIP、LZO等压缩算法)进行编码。 Each CDR may be recorded, for example, according to a compact encoding format (e.g., using GZIP, LZO compression algorithms, etc.) is encoded. 在压缩数据时可确定出各个数据块的索引列。 When it is determined that the compressed data may be indexed column of each data block.

[0054] 在步骤204中对索引文件的建立进行了描述。 [0054] for indexing files described in step 204. 因此删除了上面不清楚的描述。 Therefore deleted the above description is not clear.

[0055] 接着,在步骤S203中,将压缩的数据块按照文件的形式分区存储在所述分布式数据库系统中的多个分布式存储单元中。 [0055] Next, in step S203, the compressed data blocks in the form of a plurality of distributed file partitions stored in said memory cells in a distributed database system. 例如可根据时间标签按日期对CDR数据进行分区存储,即在分布式文件系统中,不同日期的数据被存储在不同的目录下。 CDR data may be, for example, for partitioning tag stored by date according to the time, i.e. in a distributed file system, the data are stored on different dates in different directories. 如2010年1月3日的数据存储在目录/CDR/20100103目录下的files中。 Such as data files stored January 3, 2010 in the directory / CDR / 20100103 directory.

[0056] 然后,在步骤S204中,对所存储的数据块建立索引文件,其中,所述索引文件中的各个索引包括索引键值和所述数据块的位置信息。 [0056] Then, in step S204, the index file stored in the data block, wherein each of the index file index includes the index key and location information of the data block. 在本实施例中,用户号码是一个有限离散的并且重复率较高的属性域。 In the present embodiment, the user is a finite number of discrete properties and high repetition rate field. 整个数据集在一段时间内总的用户号码数是一定的,同一用户号码的CDR记录只会出现在少量有限的数据块中。 The entire data set in the period of time the total number of the number of users is constant, CDR records for the same subscriber number will appear in the limited small number of data blocks. 在一个数据块中,无论一个用户号码出现过多少次,只记录其第一次出现的位置。 In a data block, a user number appears regardless of how many times, it is only the recording position of the first occurrence. 索引的数据结构如:〈用户号码,BlockLocations〉。 Data structure of the index as: <user number, BlockLocations>. 其中BlockLocation直接记录了该数据块在特定file中的位置。 Wherein BlockLocation directly recorded the location of the data block in a particular file. BlockLocation还可记录该数据块的大小等信息。 BlockLocation may also record the data block size information. 作为一种选择,还可以简单地在索引数据中记录特定file中的数据块ID。 As an option, you can simply record a particular file data block ID in the index data. 这样在查询时,需要先读取指定file的数据块索引文件,增加了一次磁盘寻道和10。 In this query, you need to read the specified file data block index file, add a disk seek and 10.

[0057] 步骤S204可以采用并行处理(MapReduce)批量扫描每个分区内新加入的files 进行。 [0057] Step S204 may partition each batch scanning files for the newly added parallel processing (MapReduce). 也可以在实现上述步骤S203的同时进行,以减少磁盘扫描的过程。 It may be performed while achieving the above-described step S203, in order to reduce disk scanning process. 本步骤产生的索引数据按照分区存储在分布式数据库存储系统中。 This step of generating the index data stored in the distributed database according to partition the storage system. 在一个实施方式中,可例如采用一种类似GoogleBigtable的存储系统来存储产生的索引数据。 In one embodiment, the index data may be employed, for example, a memory system similar to storing the generated GoogleBigtable. 不同分区对应的索引存储在不同的列组中,例如分区20100103的索引存储在列组20100103中。 Groups in different columns, for example, in the column group in the index storage 20100103 20100103 partitions corresponding to different partitions of the storage index.

[0058] 图3示出了上述索引方法200所建立的用户号码索引的逻辑结构示意。 [0058] FIG 3 schematically illustrates the logical structure of the subscriber number index 200 index established method. 用户号码作为索引的键值(Key) 301,其取值包括整个数据集中出现过的所有用户号码,例如一共出现过1千万个用户号码,这里就有1千万行索引。 Subscriber number as an index of key (Key) 301, which includes the value of all the subscriber numbers appeared the entire data set, for example, there have been a total of 10 million subscriber numbers, where there are 10 million row index. 每个日期分区302中包括了若干文件(files)303。 Each partition 302 includes the date the number of documents (files) 303. 而特定用户号码的索引只记录其在特定的files中出现过的数据块的BlockLocations 304。 And only specific subscriber number index data blocks recorded BlockLocations it appears in particular in files through 304. 因为特定用户产生的⑶R记录是非常离散的,或者在某段时间根本没有记录,因此这种索引的逻辑结构是非常稀疏的。 ⑶R recording because the specific user is very discrete produced, or at a certain time is not recorded, so the logical structure of such an index is very sparse. 在索引的存储结构中,空单元305并不占用任何存储空间,这样总的索引大小可以保持较小。 In the storage structure of the index, the empty unit 305 does not occupy any storage space, so that the total size of an index can be kept small.

[0059] 图4为示出了根据本申请一个实施方式的分布式数据库系统400的方框图。 [0059] FIG. 4 is a block diagram illustrating a distributed database system 400 according to one embodiment of the present application. 在该系统框架中,数据文件存储在分布式文件系统410中,该文件系统410由多个存储单元节点组成,这些节点由多个联网的PC服务器组成。 In this framework, the data file stored in the distributed file system 410, file system 410 of the plurality of memory cell nodes, these nodes by a plurality of networked PC servers. 结构上,分布式文件系统410包括一个主控单元(图中未示出)和多个数据存储单元。 Structurally, the distributed file system 410 includes a main control unit (not shown) and a plurality of data storage units. 文件系统410对大文件采用分块(例如每块64MB)的方式将不同的数据块均衡分布在不同存储单元节点上,并且对每个数据块存储多个备份(例如3个备份)。 File using the file system 410 for large block (e.g. 64MB each) manner different data blocks evenly distributed over different storage cell node, for each data block and a plurality of backup storage (e.g., three backup). 在存储单元节点上,数据块可例如以Linux本地文件的形式存储在本地磁盘上。 Node on the storage unit, the data blocks may be stored on a local disk, for example, in the form of local files Linux. 主控单元提供统一的文件系统名字空间元数据并协调管理整个集群系统, 数据存储单元分布式地存储数据块。 The main control unit to provide a unified file system namespace metadata management and coordination of the entire cluster system, a distributed data storage unit to store data blocks. 在分布式系统中,通过主控单元存储数据为现有技术, 因此不再赘述。 In distributed systems, the prior art, and therefore it will not be repeated by the main control unit for storing data. [0060] 并行处理平台(MapReduce框架)420可以和分布式文件系统410部署在同一集群中负责建立索引、数据查询时、数据分析和挖掘时的并行处理等。 [0060] parallel processing platform (MapReduce framework) 420 and 410 can deploy distributed file systems in the same cluster is responsible for indexing, data query, data analysis and parallel processing during excavation.

[0061] 索引数据文件存储在索引存储器430中,本实施例中采用一种类似Google Bigtable模型的分布式存储系统来存储索引,其在索引关键字建立了B-TREE索引,支持快速查找。 [0061] The index data file stored in the index memory 430, this embodiment uses a similar index to store the distributed storage system model in the present embodiment Google Bigtable, which establishes a B-TREE index keys in the index, to quickly find support. 索引存储器430也可以和分布式文件系统410及并行处理平台420部署在同一集群中。 Index memory 430 and may be distributed file system 410 and parallel processing platform 420 deployed in the same cluster. 具体的索引数据文件可例如和上述表1和图3所示。 Specific index data file may be, for example, in Table 1 and FIG. 3 and FIG.

[0062] 执行引擎440主要负责查询操作的执行,并可包括解析器(例如SQL解析器)440-1、索引查询模块440-2和并行处理引擎440-3。 [0062] The execution engine 440 is responsible for performing a query operation, and may include a parser (such as SQL parser) 440-1, 440-2 and the index query module parallel processing engine 440-3. 其中,解析器440-1负责解析来自用户接口150的操作语句,如查询语句,并选择相应的查询索引;索引查询模块440-2负责查询索引得到缩小的数据扫描范围,如索引数据块集;具体地,索引查询模块440-2可根据选择的查询索引在所述索引存储器430中搜索所述多个数据块文件的索引以获得至少一个查询数据块集。 Wherein the parser parses 440-1 operation instruction from the user interface 150, such as a query, the query and select the appropriate index; index query module 440-2 is responsible for querying the index data to obtain a reduced scanning range, as index data block set; specifically, the index query module 440-2 may be selected according to the index query index in the index memory 430 searches the plurality of data blocks of a file to obtain at least one query data block set. 并行处理引擎440-3负责将待扫描的数据范围进行逻辑拆分,发起并行处理任务。 Parallel processing engines 440-3 responsible for the data range to be scanned logical split, parallel processing tasks initiated.

[0063] 并行处理平台420在处理该并行任务后,将处理的结果合并返回给查询客户端。 [0063] parallel processing platform 420 after handling the parallel task, the result will be returned to the query process of merging the client.

[0064] 下面参照图5,以查询某用户号码(如13500000002)在某两天的(如20100103和20100104)的⑶R记录为例描述根据本申请一个实施方式的查询处理500。 [0064] In the following queries a subscriber number (e.g., 13500000002) in Case (e.g., 20,100,103 and 20,100,104) of a recording ⑶R day 500 according to the query processing described an embodiment of the present application with reference to FIG. 此外,出于说明的目的,以图4所示的系统400在下面描述处理500。 Furthermore, for purposes of illustration, the system 400 shown in FIG. 4 process 500 will be described below. 然而,查询处理500并不限应用于图4所示的系统。 However, query processing 500 and not applied to the system shown in Fig.

[0065] 首先,在步骤S501中,用户通过用户接口450发起的查询语句(如SQL查询语句); 接着,在步骤S502中,解析器440-1对查询语句进行解析并确定出索引。 [0065] First, in step S501, the user interface 450 a user initiates a query (e.g., SQL query); Next, in step S502, the parser parses 440-1 query statement and determine the index. 例如,查询语句中的查询条件可涉及分区列表(如日期分区),以缩小查询的分区范围。 For example, search criteria may involve statement partition list (e.g., date of partitions), to narrow the scope of the query partition. 如果查询条件中包括建立了索引的属性域,则选择各相关分区的该属性域的索引,得到一个数据块集,从而可进一步缩小了数据块的范围。 If the query criteria comprises establishing the index of the index attribute field attribute domain, select the relevant partition, to obtain a block of data sets, thereby further narrowing the range of the data block. 如果查询条件中有多个建立了索引的属性域,就分别选择对应的索引。 If the query is established in a plurality of attribute fields indexed to select the corresponding respective index.

[0066] 如果没有建立可用的索引,或者数据分析应用需要对大块的数据进行批量分析操作,则可以直接将该操作提交给并行处理引擎440-3执行(步骤S504)。 [0066] If no index is available to establish or data analysis applications require large blocks of bulk data analysis operations, this operation may be directly submitted to the parallel processing engine 440-3 performs (step S504).

[0067] 在步骤S503中,索引查询模块440-2根据解析的结果查询索引存储器430中存储的索引文件以获得至少一个查询数据块集。 [0067] In step S503, the query module 440-2 index based on the result parsed query the index file index 430 stored in the memory to obtain at least one query data block set. 当在步骤S501中分析获得查询条件中有多个索引的属性域,并且在上述步骤S502中分别选择了对应的索引,则在该步骤中分别查询对应的索引并得到多个数据块集,再根据多个条件的逻辑关系(例如AND或OR)获得数据块集的交集或并集。 When analyzed in step S501 to obtain the domain attribute query has a plurality of indexes, and are selected in step S502 above corresponds to the index, respectively corresponding to the query index obtained in this step and sets the plurality of data blocks, then obtaining an intersection set of data blocks according to a logical relationship between the plurality of conditions (e.g., aND or oR), or union. 以图4所示的索引为例,可得到下列数据块集: In the index shown in FIG. 4 as an example, the following block of data sets can be obtained:

[0068] 20100103/file-2/BlockLocation-3 [0068] 20100103 / file-2 / BlockLocation-3

[0069] 20100104/file-4/BlockLocation-6 [0069] 20100104 / file-4 / BlockLocation-6

[0070] 20100104/file-4/BlockLocation-7 [0070] 20100104 / file-4 / BlockLocation-7

[0071] 20100104/file-5/BlockLocation-8 [0071] 20100104 / file-5 / BlockLocation-8

[0072] 接着,将上述数据块集交给并行处理引擎440-3进行拆分并发起并行扫描任务给并行处理平台420。 [0072] Next, the data block to the parallel processing engines 440-3 sets split and initiate a scan task in parallel to the parallel processing platform 420. 例如将上述数据块集中的四个数据块分别指派给四个并行处理节点同时扫描。 For example the set of said data blocks are assigned to four blocks of four data processing nodes in parallel while scanning. 具体地,在步骤S504中,并行处理平台420根据上述数据块集对上述查询命令进行处理,并行处理引擎440-3将并行处理平台420处理的结构合并后返回给查询客户端。 Specifically, the structure, in the step S504, the parallel processing platform 420 for processing on the query command set based on the data block, the parallel processing engine 440-3 parallel processing platform 420 is returned to the query merging the client.

9[0073] 以上仅为本申请的示例性实施方式,本领域技术人员根据上述实施方式,在本申请权利要求限定的范围内,可以对上述各个实施方式进行修改。 9 [0073] The above exemplary embodiment is only exemplary embodiment of the present disclosure, one skilled in the art according to the above embodiment, the present application is defined in the claims, various modifications may be made to the above-described embodiments.

Claims (16)

1. 一种分布式数据库系统,包括:多个分布式存储单元,分区存储有多个数据块文件;索引存储器,存储有所述多个数据块文件的索引;解析器,解析用户发起的查询语句,并选择相应的查询索引;索引查询模块,根据选择的查询索引,搜索所述多个数据块文件的索引以获得至少一个查询数据块集,所述查询数据块集包括索引键值、和记录了所述多个数据块文件中与所述索引键值对应的数据块文件的位置信息;以及并行处理引擎,将所述至少一个查询数据块集拆分并发起并行扫描任务。 A distributed database system, comprising: a plurality of distributed memory cells, a plurality of partitions are stored files of data blocks; index memory storing an index file of the plurality of data blocks; parser parsing the user-initiated queries statement, and select the appropriate query index; index query module, a query according to the selected index, searching the plurality of data blocks of a file index to obtain at least one set of data block query, the query data includes an index key block set, and recording position information of the plurality of data blocks in the index file corresponding to the key data block file; and a parallel processing engines, the at least one query data block set and initiate a split parallel scan tasks.
2.如权利要求1所述的系统,其中,所述查询语句包括查询条件,所述查询条件中包括所述索引的多个属性域,以及其中,所述解析器对所述查询语句分析后分别选择与所述多个属性域对应的索引。 2. The system according to claim 1, wherein said query comprises a query, the query comprising a plurality of attribute fields in the index conditions, and wherein said parser analyzes the query statement were selected with the index corresponding to the plurality of attribute domains.
3.如权利要求2所述的系统,其中,所述索引查询模块分别查询与所述多个属性域对应的索引,以得到多个索引数据块集,并通过逻辑运算确定出所述多个索引数据块集的交集或并集。 3. The system according to claim 2, wherein the index query query module respectively corresponding to the plurality of attribute domains index, index data blocks to obtain a plurality of sets, and is determined by a plurality of logical operation intersection or union index data block set.
4.如权利要求1所述的系统,其中,所述多个数据块文件按照不同的数据属性存储在所述多个分布式存储单元中不同的文件目录下。 4. The system according to claim 1, wherein the said plurality of data blocks in different data files of different attributes stored in said plurality of memory cells distributed file directory.
5.如权利要求1所述的系统,其中,所述索引存储器存储的数据块文件按照紧凑编码格式进行编码压缩。 5. The system according to claim 1, wherein the index file memory storing encoded data blocks in accordance with the compact compression encoding format.
6. 一种在分布式数据库系统中建立索引的方法,包括:收集要存储的数据;将所述数据分割成多个数据块并确定相应的数据块索引;将分割的数据块按照文件的形式分区存储在所述分布式数据库系统中的多个分布式存储单元中;以及对所存储的数据块建立索引文件,其中,所述索引文件中的各个索引包括索引键值和所述数据块的位置信息。 A method of indexing a database in a distributed system, comprising: collecting data to be stored; dividing the data into blocks and determining the corresponding data block index; form of data blocks in accordance with the divided file partition stores a plurality of distributed memory cells in the distributed database system; and index files stored in the data block, wherein each of the index file includes an index key and the index of the data block location information.
7.如权利要求6所述的方法,其中,将压缩的数据块按照文件的形式分区存储在所述分布式数据库系统中的多个分布式存储单元中的步骤包括:将压缩的数据块按照不同的数据块属性存储在所述分布式数据库系统中的多个分布式存储单元中不同文件目录下。 7. The method according to claim 6, wherein the compressed data block following the procedure of the plurality of memory cells distributed in a form of file partitions stored in the distributed database system comprising: a compressed block of data in accordance with the plurality of memory cells distributed in a different directory attribute different data blocks stored in the distributed database system.
8.如权利要求6所述的方法,其中,所述位置信息记录了所述数据块在所述文件目录中的位置。 8. The method according to claim 6, wherein the position information recording the position of said data blocks in the file directory.
9.如权利要求7所述的方法,其中,所述数据块属性为所述数据块生成的时间。 9. The method of claim 7, wherein the block of data attributes for the data block generated time.
10.如权利要求6所述的方法,其中,将所述数据压缩成多个数据块并确定相应的数据块索引的步骤包括:将所述数据快按块按照紧凑编码格式进行编码压缩并确定相应的数据块索引。 10. The method according to claim 6, wherein the compressed data into a plurality of blocks of data and determining the corresponding data block index comprises the step of: fast the data block is encoded according to the encoding format in accordance with the compact compression and determination the corresponding data block index.
11.如权利要求6-10中任意一项所述的方法,将所述数据分割成多个数据块并确定相应的数据块索引的步骤包括:将所述数据分割成多个数据块;压缩所分割的多个数据块;以及为各个压缩的数据块确定数据块索引。 11. The method according to any one of 6-10 claims, dividing the data into blocks and determining the corresponding data block index comprises the step of: dividing the data into a plurality of data blocks; Compression dividing the plurality of blocks of data; and determining data block index of each compressed data block.
12.如权利要求11中任意一项所述的方法,其中,所述索引键值指向于出现过该索引键值的所有数据块,在所述索引文件中只记录在所指数据块中第一次出现该索引键值的位置。 12. A method as claimed in any one of claim 11, wherein the index key point occurred in all data blocks of the index key is recorded only in the meaning of the data block in the index file the key index of the first occurrence location.
13. 一种应用于分布式数据库系统中的查询方法,所述分布式数据库系统包括如权利要求12所述的方法形成的索引,所述查询方法包括:解析查询语句并确定出相应的查询索引;根据选择的查询索引搜索所述索引文件以获得至少一个查询数据块集;以及将所述至少一个查询数据块集拆分、并根据所述查询数据块集包括的位置信息发起并行扫描任务。 13. A method applied to a distributed query database system, the database system comprises distributed method of claim 12, wherein the forming of the index, the search method comprising: parsing the query and determine the corresponding query index ; according to the selection of the index file index search query to obtain at least a set of query data block; query and the at least one set of split data blocks, and initiate a scan task in parallel according to the location information comprises a set of query data block.
14.如权利要求13所述的查询方法,其中,所述查询语句包括查询条件,所述查询条件包括分区列表,用于缩小查询的分区范围。 14. The search method according to claim 13, wherein said query comprises a query, the query criteria comprises a partition table, partition for reducing the scope of the query.
15.如权利要求13所述的查询方法,其中,所述解析查询语句并确定出相应的查询索引的步骤包括:解析出所述查询条件中包括有多个索引属性域,并分别选择出与所述多个属性域对应的索引。 Step 15. The search method according to claim 13, wherein the parsing the query and determine the corresponding query index comprises: parsing the query index comprises a plurality of attribute domains, respectively, and selects and domain of the plurality of corresponding index.
16.如权利要求15所述的查询方法,其中,所述根据选择的查询索引搜索所述索引文件以获得至少一个查询数据块集的步骤包括:分别查询所述对应的索引得到多个索引数据块集;以及通过逻辑运算关系确定出所述多个索引数据块集的交集或并集。 16. The search method according to claim 15, wherein the index search query according to the selected index file to obtain the step of at least one query data block set comprises: the query index, respectively corresponding to the plurality of index data obtained block set; and the relationship determined by the logic operation of said plurality of index data blocks intersection or union set.
CN2010102611675A 2010-08-24 2010-08-24 Distributed database system, method for building index therein and query method CN102375853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102611675A CN102375853A (en) 2010-08-24 2010-08-24 Distributed database system, method for building index therein and query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102611675A CN102375853A (en) 2010-08-24 2010-08-24 Distributed database system, method for building index therein and query method

Publications (1)

Publication Number Publication Date
CN102375853A true CN102375853A (en) 2012-03-14

Family

ID=45794475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102611675A CN102375853A (en) 2010-08-24 2010-08-24 Distributed database system, method for building index therein and query method

Country Status (1)

Country Link
CN (1) CN102375853A (en)

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779160A (en) * 2012-06-14 2012-11-14 中金数据系统有限公司 Mass data information indexing system and indexing construction method
CN102833352A (en) * 2012-09-17 2012-12-19 深圳中兴网信科技有限公司 Distributed cache management system and method for implementing distributed cache management
CN102841944A (en) * 2012-08-27 2012-12-26 南京云创存储科技有限公司 Method achieving real-time processing of big data
CN102915324A (en) * 2012-08-09 2013-02-06 深圳中兴网信科技有限公司 Data storing and retrieving device and data storing and retrieving method
CN102968309A (en) * 2012-11-30 2013-03-13 亚信联创科技(中国)有限公司 Method and device for realizing rule matching based on rule engine
CN103002027A (en) * 2012-11-26 2013-03-27 中国科学院高能物理研究所 System and method for data storage on basis of key-value pair system tree-shaped directory achieving structure
CN103036891A (en) * 2012-12-19 2013-04-10 北京时代凌宇科技有限公司 Method and device based on wireless fidelity (Wi-Fi) for accessing to Internet of Things
CN103034734A (en) * 2012-12-27 2013-04-10 上海顶竹通讯技术有限公司 File storage and inquiry agency and information searching method and system
CN103064933A (en) * 2012-12-24 2013-04-24 华为技术有限公司 Data query method and system
CN103309902A (en) * 2012-03-16 2013-09-18 多玩娱乐信息技术(北京)有限公司 Method and device for storing and searching user information in social network
CN103399945A (en) * 2013-08-15 2013-11-20 成都博云科技有限公司 Data structure based on cloud computing database system
CN103473276A (en) * 2013-08-26 2013-12-25 广东电网公司电力调度控制中心 Storage method of very large data and distributed database system and retrieval method thereof
CN103488709A (en) * 2013-09-09 2014-01-01 东软集团股份有限公司 Method and system for building indexes and method and system for retrieving indexes
CN103631910A (en) * 2013-11-26 2014-03-12 烽火通信科技股份有限公司 Distributed database multi-column composite query system and method
CN103631539A (en) * 2013-12-13 2014-03-12 百度在线网络技术(北京)有限公司 Distributed storage system and distributed storage method based on erasure coding mechanism
CN103748578A (en) * 2012-07-26 2014-04-23 华为技术有限公司 Data distribution method, device, and system
CN103902698A (en) * 2014-03-31 2014-07-02 北京车商汇软件有限公司 Data storage system and data storage method
CN103902702A (en) * 2014-03-31 2014-07-02 北京车商汇软件有限公司 Data storage system and data storage method
CN103914483A (en) * 2013-01-07 2014-07-09 深圳市腾讯计算机系统有限公司 File storage method and device and file reading method and device
CN104063376A (en) * 2013-03-18 2014-09-24 阿里巴巴集团控股有限公司 Multi-dimensional grouping operation method and system
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN104239525A (en) * 2014-09-18 2014-12-24 浪潮软件集团有限公司 Internet-based distributed storage method
CN104331453A (en) * 2014-10-30 2015-02-04 北京思特奇信息技术股份有限公司 Distributed file system and constructing method thereof
CN104536962A (en) * 2014-11-11 2015-04-22 珠海天琴信息科技有限公司 Data query method and data query device used in embedded system
CN104699815A (en) * 2015-03-24 2015-06-10 北京嘀嘀无限科技发展有限公司 Data processing method and system
CN104750690A (en) * 2013-12-25 2015-07-01 中国移动通信集团公司 Query processing method, device and system
CN104951464A (en) * 2014-03-27 2015-09-30 华为技术有限公司 Data storage method and system
CN105117171A (en) * 2015-08-28 2015-12-02 南京国电南自美卓控制系统有限公司 Energy SCADA massive data distributed processing system and method thereof
CN105488085A (en) * 2014-12-27 2016-04-13 北京安天电子设备有限公司 File positioning method and system through log
CN105512200A (en) * 2015-11-26 2016-04-20 华为技术有限公司 Distributed database processing method and device
WO2016119275A1 (en) * 2015-01-30 2016-08-04 深圳市华傲数据技术有限公司 Network account identifying and matching method
CN105843933A (en) * 2016-03-30 2016-08-10 电子科技大学 Index building method for distributed memory columnar database
CN105868253A (en) * 2015-12-23 2016-08-17 乐视网信息技术(北京)股份有限公司 Data importing and query methods and apparatuses
CN105912687A (en) * 2016-04-19 2016-08-31 江苏物联网研究发展中心 Mass distributed database memory cell
WO2016141584A1 (en) * 2015-03-12 2016-09-15 Intel Corporation Method and apparatus for compaction of data received over a network
WO2016165509A1 (en) * 2015-04-15 2016-10-20 Huawei Technologies Co., Ltd. Big data statistics at data-block level
CN106126545A (en) * 2016-06-15 2016-11-16 北京智能管家科技有限公司 Distributed fission query method and device
CN106250409A (en) * 2016-07-21 2016-12-21 中国农业银行股份有限公司 Data query method and device
CN106503128A (en) * 2016-10-19 2017-03-15 许继集团有限公司 Method and system for searching data of intelligent ammeter
CN103678520B (en) * 2013-11-29 2017-03-29 中国科学院计算技术研究所 One kind of cloud computing multi-dimensional range query method and system based on
CN107463632A (en) * 2016-09-21 2017-12-12 广州特道信息科技有限公司 Distributed NewSQL database system and data query method
WO2019080790A1 (en) * 2017-10-26 2019-05-02 Huawei Technologies Co., Ltd. Method and apparatus for storing and retrieving information in a distributed database

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246500A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 Retrieval system and method for implementing data fast indexing
CN101727465A (en) * 2008-11-03 2010-06-09 中国移动通信集团公司 Methods for establishing and inquiring index of distributed column storage database, device and system thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246500A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 Retrieval system and method for implementing data fast indexing
CN101727465A (en) * 2008-11-03 2010-06-09 中国移动通信集团公司 Methods for establishing and inquiring index of distributed column storage database, device and system thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李晔锋: "数据仓库的存储研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, 15 October 2009 (2009-10-15), pages 1 - 63 *
董继润: "关系数据库和顺序相关性", 《山东大学学报》, no. 4, 31 December 1983 (1983-12-31), pages 31 - 39 *
谢力军等: "几种索引技术的比较", 《怀化学院学报》, vol. 28, no. 8, 31 August 2009 (2009-08-31), pages 115 - 118 *

Cited By (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309902A (en) * 2012-03-16 2013-09-18 多玩娱乐信息技术(北京)有限公司 Method and device for storing and searching user information in social network
CN102779160B (en) * 2012-06-14 2016-02-03 中金数据系统有限公司 Mass index data indexing system and method for constructing
CN102779160A (en) * 2012-06-14 2012-11-14 中金数据系统有限公司 Mass data information indexing system and indexing construction method
CN103748578A (en) * 2012-07-26 2014-04-23 华为技术有限公司 Data distribution method, device, and system
CN102915324B (en) * 2012-08-09 2016-08-03 深圳中兴网信科技有限公司 Data storage and retrieval means and the data storage and retrieval method
CN102915324A (en) * 2012-08-09 2013-02-06 深圳中兴网信科技有限公司 Data storing and retrieving device and data storing and retrieving method
CN102841944A (en) * 2012-08-27 2012-12-26 南京云创存储科技有限公司 Method achieving real-time processing of big data
CN102833352A (en) * 2012-09-17 2012-12-19 深圳中兴网信科技有限公司 Distributed cache management system and method for implementing distributed cache management
CN103002027A (en) * 2012-11-26 2013-03-27 中国科学院高能物理研究所 System and method for data storage on basis of key-value pair system tree-shaped directory achieving structure
CN103002027B (en) * 2012-11-26 2015-09-02 中国科学院高能物理研究所 The data storage system and method for tree-based directory structures implemented on the system key
CN102968309B (en) * 2012-11-30 2016-01-20 亚信科技(中国)有限公司 Rule matching method and apparatus for implementing a rule-based engine
CN102968309A (en) * 2012-11-30 2013-03-13 亚信联创科技(中国)有限公司 Method and device for realizing rule matching based on rule engine
CN103036891A (en) * 2012-12-19 2013-04-10 北京时代凌宇科技有限公司 Method and device based on wireless fidelity (Wi-Fi) for accessing to Internet of Things
CN103064933A (en) * 2012-12-24 2013-04-24 华为技术有限公司 Data query method and system
CN103064933B (en) * 2012-12-24 2016-06-29 华为技术有限公司 Method and system for data query
WO2014101445A1 (en) * 2012-12-24 2014-07-03 华为技术有限公司 Data query method and system
CN103034734A (en) * 2012-12-27 2013-04-10 上海顶竹通讯技术有限公司 File storage and inquiry agency and information searching method and system
WO2014106418A1 (en) * 2013-01-07 2014-07-10 Tencent Technology (Shenzhen) Company Limited Method and apparatus for storing and reading files
CN103914483B (en) * 2013-01-07 2018-09-25 深圳市腾讯计算机系统有限公司 File storage method, device and a document reading method, apparatus
CN103914483A (en) * 2013-01-07 2014-07-09 深圳市腾讯计算机系统有限公司 File storage method and device and file reading method and device
CN104063376A (en) * 2013-03-18 2014-09-24 阿里巴巴集团控股有限公司 Multi-dimensional grouping operation method and system
CN103399945A (en) * 2013-08-15 2013-11-20 成都博云科技有限公司 Data structure based on cloud computing database system
CN103473276A (en) * 2013-08-26 2013-12-25 广东电网公司电力调度控制中心 Storage method of very large data and distributed database system and retrieval method thereof
CN103473276B (en) * 2013-08-26 2017-08-25 广东电网公司电力调度控制中心 Large data storage method, a distributed database system and searching method
CN103488709B (en) * 2013-09-09 2017-06-16 东软集团股份有限公司 An index that establish methods and systems, and retrieval system
CN103488709A (en) * 2013-09-09 2014-01-01 东软集团股份有限公司 Method and system for building indexes and method and system for retrieving indexes
CN103631910A (en) * 2013-11-26 2014-03-12 烽火通信科技股份有限公司 Distributed database multi-column composite query system and method
CN103678520B (en) * 2013-11-29 2017-03-29 中国科学院计算技术研究所 One kind of cloud computing multi-dimensional range query method and system based on
CN103631539B (en) * 2013-12-13 2016-08-24 百度在线网络技术(北京)有限公司 Based distributed storage system and how to store the erasure coding scheme
CN103631539A (en) * 2013-12-13 2014-03-12 百度在线网络技术(北京)有限公司 Distributed storage system and distributed storage method based on erasure coding mechanism
CN104750690A (en) * 2013-12-25 2015-07-01 中国移动通信集团公司 Query processing method, device and system
CN104951464B (en) * 2014-03-27 2018-09-11 华为技术有限公司 Method and system for data storage
CN104951464A (en) * 2014-03-27 2015-09-30 华为技术有限公司 Data storage method and system
CN103902698A (en) * 2014-03-31 2014-07-02 北京车商汇软件有限公司 Data storage system and data storage method
CN103902702A (en) * 2014-03-31 2014-07-02 北京车商汇软件有限公司 Data storage system and data storage method
CN103902702B (en) * 2014-03-31 2017-11-28 北京皮尔布莱尼软件有限公司 A data storage system and a storage method
CN103902698B (en) * 2014-03-31 2018-04-13 北京皮尔布莱尼软件有限公司 A data storage system and a storage method
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN104239525A (en) * 2014-09-18 2014-12-24 浪潮软件集团有限公司 Internet-based distributed storage method
CN104331453B (en) * 2014-10-30 2017-10-17 北京思特奇信息技术股份有限公司 A distributed file systems and distributed file system construction method
CN104331453A (en) * 2014-10-30 2015-02-04 北京思特奇信息技术股份有限公司 Distributed file system and constructing method thereof
CN104536962A (en) * 2014-11-11 2015-04-22 珠海天琴信息科技有限公司 Data query method and data query device used in embedded system
CN105488085A (en) * 2014-12-27 2016-04-13 北京安天电子设备有限公司 File positioning method and system through log
WO2016119275A1 (en) * 2015-01-30 2016-08-04 深圳市华傲数据技术有限公司 Network account identifying and matching method
WO2016141584A1 (en) * 2015-03-12 2016-09-15 Intel Corporation Method and apparatus for compaction of data received over a network
US10015272B2 (en) 2015-03-12 2018-07-03 Intel Corporation Method and apparatus for compaction of data received over a network
CN104699815A (en) * 2015-03-24 2015-06-10 北京嘀嘀无限科技发展有限公司 Data processing method and system
WO2016165509A1 (en) * 2015-04-15 2016-10-20 Huawei Technologies Co., Ltd. Big data statistics at data-block level
CN105117171A (en) * 2015-08-28 2015-12-02 南京国电南自美卓控制系统有限公司 Energy SCADA massive data distributed processing system and method thereof
CN105512200A (en) * 2015-11-26 2016-04-20 华为技术有限公司 Distributed database processing method and device
WO2017088358A1 (en) * 2015-11-26 2017-06-01 华为技术有限公司 Distributed database processing method and device
CN105868253A (en) * 2015-12-23 2016-08-17 乐视网信息技术(北京)股份有限公司 Data importing and query methods and apparatuses
CN105843933B (en) * 2016-03-30 2019-01-29 电子科技大学 The index establishing method of distributed memory columnar database
CN105843933A (en) * 2016-03-30 2016-08-10 电子科技大学 Index building method for distributed memory columnar database
CN105912687A (en) * 2016-04-19 2016-08-31 江苏物联网研究发展中心 Mass distributed database memory cell
CN105912687B (en) * 2016-04-19 2019-05-24 江苏物联网研究发展中心 Magnanimity distributed data base storage unit
CN106126545A (en) * 2016-06-15 2016-11-16 北京智能管家科技有限公司 Distributed fission query method and device
CN106250409A (en) * 2016-07-21 2016-12-21 中国农业银行股份有限公司 Data query method and device
CN107463632A (en) * 2016-09-21 2017-12-12 广州特道信息科技有限公司 Distributed NewSQL database system and data query method
CN106503128A (en) * 2016-10-19 2017-03-15 许继集团有限公司 Method and system for searching data of intelligent ammeter
WO2019080790A1 (en) * 2017-10-26 2019-05-02 Huawei Technologies Co., Ltd. Method and apparatus for storing and retrieving information in a distributed database

Similar Documents

Publication Publication Date Title
Hu et al. Toward scalable systems for big data analytics: A technology tutorial
Lim et al. SILT: A memory-efficient, high-performance key-value store
US7146377B2 (en) Storage system having partitioned migratable metadata
US8078825B2 (en) Composite hash and list partitioning of database tables
US20040205044A1 (en) Method for storing inverted index, method for on-line updating the same and inverted index mechanism
CN103177055B (en) Storing database tables mixed both rows and columns stored in memory
US7257690B1 (en) Log-structured temporal shadow store
US8255398B2 (en) Compression of sorted value indexes using common prefixes
US8732139B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
CN103577123B (en) Hdfs based optimization method for storing small files
US8762353B2 (en) Elimination of duplicate objects in storage clusters
US9130971B2 (en) Site-based search affinity
CN101996250B (en) Hadoop-based mass stream data storage and query method and system
Yang et al. Druid: A real-time analytical data store
CN101084499B (en) Systems and methods for searching and storage of data
US8626717B2 (en) Database backup and restore with integrated index reorganization
US7447839B2 (en) System for a distributed column chunk data store
US8285689B2 (en) Distributed file system and data block consistency managing method thereof
US20070143564A1 (en) System and method for updating data in a distributed column chunk data store
US20120016901A1 (en) Data Storage and Processing Service
CN102193917B (en) Method and device for processing and querying data
Ren et al. {TABLEFS}: Enhancing Metadata Efficiency in the Local File System
CN102332029B (en) Hadoop-based mass classifiable small file association storage method
US10176225B2 (en) Data processing service
US9124612B2 (en) Multi-site clustering

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C12 Rejection of a patent application after its publication