CN102375853A

CN102375853A - Distributed database system, method for building index therein and query method

Info

Publication number: CN102375853A
Application number: CN2010102611675A
Authority: CN
Inventors: 齐骥; 钱岭; 郭磊涛; 周大; 罗治国; 孙少陵; 张松波; 张卫平
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2010-08-24
Filing date: 2010-08-24
Publication date: 2012-03-14

Abstract

The invention discloses a distributed database system and a method for building an index in the distributed database system. The distributed database system comprises a plurality of distributed storage units, an index memory, a resolver, an index query module and a parallel processing engine, wherein the distributed storage units store a plurality of data block files by sections; the index memory stores the indexes of the data block files; the resolver resolves a query sentence initiated by a user and selects a corresponding query index; the index query module searches the indexes of the data block files according to the selected query index to obtain at least one query data block set; the query data block set comprises an index key value and records the position information of the data block files corresponding to the index key value in the data block files; and the parallel processing engine splits the at least one query data block set and initiates a parallel scanning task.

Description

Distributed data base system, set up the method and the querying method of index therein

Technical field

The application relates to a kind of distributed data base system, sets up the method and the querying method of index therein.

Background technology

Structural data in enormous quantities is stored in the database, particularly is the data managing method of using always in the relevant database.Simple and direct practice is: dispose ripe data base management system (DBMS), with interface (like SQL) the definition of data table and the data structure of standard, with the data importing of collecting or be inserted in the respective table of database.As required, Database Systems are set up for it and are used when index supplies fast query.During data query,, can select for use suitable index to optimize query performance according to querying condition.

At the management aspect of large-scale data, the data volume and the disk I of visit when the key factor that influences the data query performance is inquiry.Index technology is the important method that improves query performance in the database practice.The indexed data amount is generally little a lot of than actual data volume, and can be organized into the data structure of being convenient to search, like tree or HASH list structure.Filter out the data that major part need not visit rather than directly scan real data through preferentially searching index, can reduce effectively must visit data volume and disk I.The tissue and the storage mode of data are also extremely important to setting up effective index simultaneously, and different index technologies also has different requirement to the tissue and the storage mode of data.Index type such as B-TREE index, HASH index and the BITMAP index etc. used always in the Database Systems are applicable to different occasions respectively, and their principle basically all is to come the memory location of rapidly locating record through the key assignments of inquiry.

Current in many industries, the data volume of generation and accumulation is huge especially, even reaches hundreds of TB or PB level.And these data along with the time in continuous expansion, the speed that produces data along with the development of business is also improving constantly.For example telecommunication service CDR (Call Detail Record) data, Internet of Things sensing data, data of financial transaction, internet daily record data etc.

Mass data have in the following characteristics one of at least:

(1) mostly data are time series data, free label, and according to or roughly produce and storage according to time sequencing.

(2) data are structuring or semi-structured data, and structure is subject to variation;

(3) speed that produces of data very fast (producing 2TB or 5,000,000,000 records every day) like certain system, and data volume is increasing;

(4) repetition rate of the value on a lot of Attribute domains is very high.

Management and application to mass data also have following characteristics:

(1) need to preserve the long period (like half a year), data more of a specified duration are dropped or backup to other medium;

(2) old historical data must be able to be visited, but the chance of being visited is less; From cost consideration, resource (like CPU, internal memory, bandwidth etc.) in the time of except storage resources, should not taking too many operation;

(3) historical data does not generally need to revise, in case data storage is good, just only need read it;

(4) generally can specify the regular hour range of condition to the inquiry of data;

(5), except will supporting fast query manipulation, need support batch data analysis and dredge operation toward contact to same data set.And same analysis and dredge operation to same batch data generally can repeatedly not repeat.

The user will concentrate inquiry to obtain the data of wanting from mass data, continues to use existing database and indexing means thereof unusual difficulty.Database often can't be stored googol like this according to amount, and not too is applicable to the change of semi-structured data or data structure.Intensive complete index not only can make concerning mass data to be set up and safeguards that the expense of index is big, speed is slow, and the data volume of index itself is also very huge, thereby also makes the writing speed of data be difficult to catch up with the generation speed of data.

Summary of the invention

On the one hand, the application discloses a kind of distributed data base system, comprising:

A plurality of distributed storage unit, subregion store a plurality of data block files;

Index store stores the index of said a plurality of data block files;

Resolver is resolved the Client-initiated query statement, and is selected corresponding search index;

The search index module; According to the search index of selecting; The index of searching for said a plurality of data block files is to obtain at least one data query piece collection, and said data query piece collection comprises the index key assignments and write down the positional information of data block file corresponding with said index key assignments in said a plurality of data block file; And

The parallel processing engine splits and initiates parallel scan task with said at least one data query piece collection.

In the application's a embodiment, distributed data base system has defined the basic structure of data organization and storage, is write wherein by order with the data recording that the mode of stream is collected or batch obtains.Comprise data file and corresponding data block index file in the basic structure of said data organization and storage.Many compression data blocks can be deposited in proper order in each data file, and many data recording can be deposited in proper order in each data block.The size of data block can suitably define according to average record length, for example is defined as 1MB; The size of data file also can define flexibly, as is defined as 1GB.Data block adopts compression algorithm commonly used to compress to save the space.Each data file is accompanied by a very data block index file of lightweight, is used for locating fast the data designated piece.The data block index generally generates in the write data file, also can rebuild according to the data file that has existed.The application does not limit data block and its index separate storage in different files, can be stored in the identical file yet.

The index that the application provides is based upon on the foregoing data block index.This is a kind of approximate sparse index structure; The key assignments that is said index is not the memory location that navigates to every record; Occurred on all data blocks of this key assignments and just point to approx, in index, only be recorded in the position that occurs this key assignments in the indication data block for the first time.Because comprise many records in each data block, and same key assignments maybe be in certain data block repeatedly repeats, and the index of setting up so just can become order of magnitude ground to dwindle, and the speed of index is set up in quickening greatly.Also can avoid the serious inhomogeneous index problem of non-uniform that causes simultaneously because key assignments distributes.10000 records are for example arranged in a data block, and have only 100 unique key assignments, just only can produce 100 index.

For the big but Attribute domain of Finite Discrete of span, the for example telephone number among the telecommunication service CDR or the ID of other data centralizations etc., it is very effective to the inquiry of this Attribute domain to set up index.In a data block, no matter how many times appearred in a certain particular value of this Attribute domain, only write down the position that it occurs for the first time.The structure of index is like < Key, BlockLocation >.Because the value repetition rate of this type of Attribute domain is very high, so its index is also very little and sparse.This type of Attribute domain is comparatively common, also often need set up index.Also can set up joint index to a plurality of Attribute domains.

Though said index strategy just is based upon on the data block, has reduced the size of index greatly, need when inquiry, be increased in the expense of carrying out sequential scanning in the finite data piece.In the processing of mass data, the benefit that this compromise obtained is howed a lot than setting up heavy index.In distributed system, adopt under the situation of parallel processing technique, above-mentioned expense will drop to the acceptable reduced levels.

In addition, the application discloses a kind of method of in distributed data base system, setting up index, comprising:

The data that collection will be stored;

Said data block is compressed into a plurality of data blocks and confirms the corresponding data block index;

The form subregion of the data block of compressing according to file is stored in a plurality of distributed storage unit in the said distributed data base system; And

Institute's data blocks stored is set up index file, and wherein, each index in the said index file comprises the positional information of index key assignments and said data block.

The index data of above-mentioned foundation itself can be stored in it in relevant database because capacity is little, in relevant database, the B-TREE index is set up in its key word, can support range query and some inquiry to this Attribute domain so simultaneously.Also can in distributed Key-Value storage system better retractility and stability be provided with index datastore.

As the optional of said index strategy replenished; The data volume that need visit when reducing the data query of crossing over relative broad range (like many days time); Also for the data volume of the bulk statistics analysis that reduces specified scope (like many days time) and the visit of data mining action need; Can carry out the division directory stores to the data file, like subregion by date.Behind the subregion, aforesaid index based on data block can be based upon on the subregion.Subregion can be counted as a kind of coarseness index based on catalogue.

This application also discloses a kind of querying method that is applied in the distributed data base system, and said distributed data base system comprises the index that uses said method to form, and said querying method comprises:

Resolve query statement and determine corresponding search index;

Search index according to selecting is searched for said index file to obtain at least one data query piece collection; And

Initiate parallel scan task with said at least one data query piece collection fractionation and according to the positional information that said data query piece collection comprises.

In one embodiment, when inquiry,, at first judge partition list related in the querying condition (like the date subregion), the subregion scope of dwindling inquiry if comprise the subregion condition in the querying condition.If comprise the Attribute domain of having set up index in the querying condition, the index of this Attribute domain of each relevant partitions of inquiry obtains a set of data blocks earlier, has further dwindled the scope of data block.If there are a plurality of Attribute domains of setting up index in the querying condition, just the corresponding index of inquiry obtains a plurality of set of data blocks respectively, and again according to the logical relation of a plurality of conditions, for example AND or OR obtain the common factor or the union of set of data blocks.At last, the set of data blocks that obtains is initiated and the line scanning matching operation, the result of said matching operation is merged scanning, and the result that will scan is as the result of this inquiry.

Description of drawings

Fig. 1 shows the data storage basic structure according to an embodiment of the application.

Fig. 2 has described according to a method embodiment of the application, that in distributed data base system, set up index.

Fig. 3 shows the logical organization signal according to the Subscriber Number index of an embodiment of the application.

Fig. 4 is the block scheme that shows according to the distributed data base system of an embodiment of the application.

Fig. 5 is the query processing according to another embodiment of the application.

Embodiment

Below, be described in detail with reference to the illustrative embodiments of accompanying drawing the application.

Embodiment among the application is the basis with the distributed file system.Distributed file system is made up of a plurality of storages and computing node; These nodes can be made up of the PC server of a plurality of networkings, number of nodes even can reach several thousand.Under the situation of not break in service, can increase or the deleted data node according to the capacity needs are level and smooth, the fault of minority back end can not cause system service to be interrupted yet.That kind as will be described below, file data are divided into piece and as far as possible balancedly are distributed on each back end, and provide book copying to guarantee the reliability of data.Can be through calling any file and the data of distributed store on each back end thereof in the distributed file system client API Access file system, wherein direct with relevant back end communication to data write in the file.This file system has solved the required problems such as distributed store, load balancing, stability, data reliability, retractility and high-throughput of mass data of handling well.

Fig. 1 shows the data storage basic structure 100 according to an embodiment of the application.This storage organization 100 comprises data file 111 and the data block index file 112 corresponding with it.Data recording writes in this storage organization with the form of journal stream, and compresses (as adopting compression algorithms such as GZIP, LZO) according to user-defined data block size (like 1MB), and the data block after the compression is write in the data file 111 in proper order.In one embodiment, in the write data file, generate in the corresponding index and writing data blocks index file 112.The user can the definition of data file full-size (like 1GB).

Have dual mode to read the data in the storage organization 110: a kind of ID according to specified data block determines its position of piece index in data block index file 112, and according to the indexed search of determining to the position of data block in data file 111.Another kind of mode is directly to read according to the position of data block in data file 111, has saved the expense of read data piece index file like this.If navigate to Record ID concrete in the specified data block, need that order jump to designated recorder ID after navigating to specified data block during read data.

Table 1 shows the data structure of data block index file 112." data block ID " is the parameter that implies, and in the data block index data structure, do not occur.The position of " block offset " expression data block in data file.The size of this data block before " raw data byte number " expression compression is usually slightly larger than or equals user-defined data block size." packed byte number " is the actual storage size that takies of this data block after the compression." record strip number " is a statistical value, representes the summary journal bar number in this data block.In the data block index file, every index is isometric, therefore can be easy to calculate its position hereof according to data block ID.Faster speed can be selected the data block indexed cache in internal memory according to data block ID locator data piece if desired.

Data block ID

Block offset

The raw data byte number

The packed byte number

The record strip number

......

Table 1

Be described in the method 200 of setting up index in the distributed data base system with reference to Fig. 2 below.For the purpose of clear, describe with magnanimity telecommunication service CDR data instance below and handle 200, but the present invention be not limited to this.Telecommunication service CDR is the data of the recording user call event that produces in the communication network.For example comprise a lot of information such as Subscriber Number, time tag, type of service, failure cause, about 400 bytes of length among a typical C DR.For example produce about 5,000,000,000 records every day, about 2TB, and need preservation be the data of 2TB*90=180TB in 3 months.According to its cdr logging in special time period of designated user number inquiry is a kind of query demand commonly used.And operator also need carry out batch quantity analysis and excavation to these CDR.

In step S201, at first collect the CDR data.Can adopt existing C DR centralized collection mode to realize, also can adopt the most original CDR of parallel processing (MapReduce) batch processing to collect file and gather the CDR data.

In step S202, the data compression of collecting is become a plurality of data blocks.Every cdr logging can for example be encoded according to compact code form (as adopting compression algorithms such as GZIP, LZO).When packed data, can determine the index column of each data block.

Foundation to index file in step 204 is described.Therefore unclear description above having deleted.

Then, in step S203, the form subregion of the data block of compressing according to file is stored in a plurality of distributed storage unit in the said distributed data base system.For example can carry out the subregion storage to the CDR data by date according to time tag, promptly in distributed file system, the data of same date are not stored under the different catalogues.Among the files of data storage under catalogue/CDR/20100103 catalogue like on January 3rd, 2010.

Then, in step S204, institute's data blocks stored is set up index file, wherein, each index in the said index file comprises the positional information of index key assignments and said data block.In the present embodiment, Subscriber Number is a Finite Discrete and Attribute domain that repetition rate is higher.Whole data set total user number yardage in a period of time is certain, and the cdr logging of same Subscriber Number only can appear in a small amount of limited data block.In a data block, no matter how many times appearred in a Subscriber Number, only write down the position that it occurs for the first time.The indexed data structure is like < Subscriber Number, BlockLocations >.Wherein BlockLocation has directly write down the position of this data block in specific file.BlockLocation also can write down the information such as size of this data block.As a kind of selection, can also in index data, write down the data block ID among the specific file simply.In when inquiry, the data block index file of specifying file be need read earlier like this, disk tracking and IO increased.

Step S204 can adopt that initiate files carries out in each subregion of parallel processing (MapReduce) batch scanning.Carry out when also can state step S203 in realization, to reduce the process of disk scanning.The index data of this generating step is stored in the distributed data library storage system according to subregion.In one embodiment, can for example adopt the storage system of a kind of similar GoogleBigtable to store the index data of generation.The index stores of different subregions correspondences is in different row groups, and for example the index stores of subregion 20100103 is in row group 20100103.

Fig. 3 shows the logical organization signal of the Subscriber Number index that above-mentioned indexing means 200 set up.Subscriber Number is as the key assignments (Key) 301 of index, and its value comprises all Subscriber Numbers that whole data centralization occurred, and for example occurs 1,000 ten thousand Subscriber Numbers altogether, and 1,000 ten thousand line index are just arranged here.Some files (files) 303 have been comprised in each date subregion 302.And the index of specific user's number only writes down the BlockLocations 304 of its data block that in specific files, occurred.Because the cdr logging that the specific user produces is very discrete, perhaps there is not record certain period at all, therefore the logical organization of this index is very sparse.In the storage organization of index, dummy cell 305 does not take any storage space, and total like this index size can keep less.

Fig. 4 is the block scheme that shows according to the distributed data base system 400 of an embodiment of the application.In this system framework, data file is stored in the distributed file system 410, and this document system 410 is made up of a plurality of unit node, and these nodes are made up of the PC server of a plurality of networkings.On the structure, distributed file system 410 comprises a main control unit (not shown) and a plurality of data storage cell.410 pairs of big files of file system adopt the mode of piecemeal (for example every 64MB) that different data blocks is evenly distributed on the different unit node, and to a plurality of backups of each data block store (for example 3 backups).On unit node, data block can be for example with the stored in form of Linux local file on local disk.Main control unit provides unified file system namespace metadata and coordinates and manages whole group system, data storage cell distributed earth storage data block.In distributed system, be prior art through main control unit storage data, therefore repeat no more.

Parallel processing platform (MapReduce framework) 420 can be deployed in when being responsible for setting up index, data query in the same cluster with distributed file system 410, the parallel processing when data analysis and excavation etc.

The index data file storage adopts a kind of distributed memory system of similar Google Bigtable model to store index in index store 430 in the present embodiment, it has set up the B-TREE index at index key, supports to search fast.Index store 430 also can be deployed in the same cluster with distributed file system 410 and parallel processing platform 420.Concrete index data file can be for example with above-mentioned table 1 with shown in Figure 3.

Carry out the engine 440 main execution of being responsible for query manipulation, and can comprise resolver (for example SQL resolver) 440-1, search index module 440-2 and parallel processing engine 440-3.Wherein, resolver 440-1 is responsible for resolving the action statement from user interface 150, like query statement, and selects corresponding search index; Search index module 440-2 is responsible for the data scanning scope that search index obtains dwindling, like index data piece collection; Particularly, search index module 440-2 can in said index store 430, search for said a plurality of data block files according to the search index of selecting index to obtain at least one data query piece collection.Parallel processing engine 440-3 is responsible for that data area to be scanned is carried out logic and splits, and initiates parallel processing task.

Parallel processing platform 420 returns to inquiring client terminal with the process result merging after handling this parallel task.

With reference to Fig. 5, be that example is described the query processing 500 according to an embodiment of the application below to inquire about the cdr logging of certain Subscriber Number (as 13500000002) in certain two days (as 20100103 and 20100104).In addition, for purposes of illustration, be described below processing 500 with system shown in Figure 4 400.Yet the system shown in Figure 4 that is applied to is not limit in query processing 500.

At first, in step S501, the query statement (like the SQL query statement) that the user initiates through user interface 450; Then, in step S502, resolver 440-1 resolves and determines index to query statement.For example, the querying condition in the query statement can relate to partition list (like the date subregion), to dwindle the subregion scope of inquiry.If comprise the Attribute domain of having set up index in the querying condition, then select the index of this Attribute domain of each relevant partitions, obtain a set of data blocks, thereby can further dwindle the scope of data block.If a plurality of Attribute domains of setting up index are arranged in the querying condition, just select corresponding index respectively.

If do not set up available index, perhaps data analysis application need be carried out the batch quantity analysis operation to the data of bulk, then can directly parallel processing engine 440-3 be submitted in this operation and carry out (step S504).

In step S503, search index module 440-2 according to the index file of storage in the result queries index store 430 of resolving to obtain at least one data query piece collection.When in step S501, analyzing the Attribute domain that obtains to have in the querying condition a plurality of index; And in above-mentioned steps S502, selected corresponding index respectively; Then in this step, inquire about corresponding index respectively and obtain a plurality of set of data blocks, obtain the common factor or the union of set of data blocks again according to the logical relation (for example AND or OR) of a plurality of conditions.With index shown in Figure 4 is example, can obtain following set of data blocks:

20100103/file-2/BlockLocation-3

20100104/file-4/BlockLocation-6

20100104/file-4/BlockLocation-7

20100104/file-5/BlockLocation-8

Then, give parallel processing engine 440-3 with above-mentioned set of data blocks and split and initiate parallel scan task to parallel processing platform 420.For example four data blocks in the above-mentioned set of data blocks are assigned respectively to four parallel processing nodes and scan simultaneously.Particularly, in step S504, parallel processing platform 420 is handled above-mentioned querying command according to above-mentioned set of data blocks, returns to inquiring client terminal after the structure that parallel processing engine 440-3 handles parallel processing platform 420 merges.

More than be merely the application's illustrative embodiments, those skilled in the art, can make amendment to above-mentioned each embodiment in the application's scope thereof according to above-mentioned embodiment.

Claims

1. distributed data base system comprises:

Index store stores the index of said a plurality of data block files;

2. the system of claim 1, wherein, said query statement comprises querying condition, comprises a plurality of Attribute domains of said index in the said querying condition, and

Wherein, said resolver is selected and the corresponding index of said a plurality of Attribute domains respectively after said query statement is analyzed.

3. system as claimed in claim 2, wherein, said search index module is inquired about respectively and the corresponding index of said a plurality of Attribute domains, obtaining a plurality of index data piece collection, and determines the common factor or the union of said a plurality of index data piece collection through logical operation.

4. the system of claim 1, wherein, said a plurality of data block files according to different attribute store in said a plurality of distributed storage unit under the different files catalogue.

5. the system of claim 1, wherein, said index store data blocks stored file carries out encoding compression according to the compact code form.

6. method of in distributed data base system, setting up index comprises:

The data that collection will be stored;

Said data are divided into a plurality of data blocks and confirm the corresponding data block index;

The data block that to cut apart is stored in a plurality of distributed storage unit in the said distributed data base system according to the form subregion of file; And

7. method as claimed in claim 6 wherein, comprises the data block of compression according to the step that the form subregion of file is stored in a plurality of distributed storage unit in the said distributed data base system:

With the compression data block according under the different file directorys in a plurality of distributed storage unit of different data block property store in said distributed data base system.

8. method as claimed in claim 6, wherein, said positional information has write down the position of said data block in said file directory.

9. method as claimed in claim 7, wherein, said data block attribute is the time that said data block generates.

10. method as claimed in claim 6 wherein, becomes said data compression a plurality of data blocks and confirms that the step of corresponding data block index comprises:

Said data are carried out encoding compression and confirmed the corresponding data block index according to the compact code form by piece soon.

11., said data are divided into a plurality of data blocks and confirm that the step of corresponding data block index comprises like any described method among the claim 6-10:

Said data are divided into a plurality of data blocks;

A plurality of data blocks that compression is cut apart; And

Data block specified data piece index for each compression.

12. like any described method in the claim 11, wherein, said index key assignments is directed to all data blocks that this index key assignments occurred, in said index file, only is recorded in the position that occurs this index key assignments in the indication data block for the first time.

13. a querying method that is applied in the distributed data base system, said distributed data base system comprise the index that method as claimed in claim 12 forms, said querying method comprises:

Resolve query statement and determine corresponding search index;

14. querying method as claimed in claim 13, wherein, said query statement comprises querying condition, and said querying condition comprises partition list, is used to dwindle the subregion scope of inquiry.

15. querying method as claimed in claim 13, wherein, said parsing query statement and the step of determining corresponding search index comprise:

Parse and include a plurality of index attributes territory in the said querying condition, and select respectively and the corresponding index of said a plurality of Attribute domains.

16. querying method as claimed in claim 15, wherein, said search index according to selection is searched for said index file and is comprised with the step that obtains at least one data query piece collection:

The index of inquiring about said correspondence respectively obtains a plurality of index data piece collection; And

Determine the common factor or the union of said a plurality of index data piece collection through the logical operation relation.