CN103631910A

CN103631910A - Distributed database multi-column composite query system and method

Info

Publication number: CN103631910A
Application number: CN201310615977.XA
Authority: CN
Inventors: 孙杰; 阎星娥; 赵万亮; 杨昆
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2013-11-26
Filing date: 2013-11-26
Publication date: 2014-03-12

Abstract

The invention discloses a distributed database multi-column composite query system and method. The distributed database multi-column composite query system is composed of a storage subsystem, an index subsystem, a linear sequence generator, a database entry module and a query module. When data enter a database and indexes are built, a monotone increasing sequential value is generated for each data record, and values of index fields and the monotone increasing sequential values are combined and used as row keys of an index table. When indexes are scanned, returned results are sequenced according to the sequence of the row keys, execution efficiency is high, and occupied system resources are few. Query of index key values, merging of index results and search of the storage subsystem can be concurrently executed, and therefore the query response speed is greatly increased.

Description

A kind of system and method for distributed data base multiple row compound query

Technical field

The application belongs to areas of information technology, relates in particular to a kind of system and method for distributed data base multiple row compound query.

Background technology

Current a lot of industry, every day, along with the development of technology and business, the speed that data produce was constantly accelerated all producing a large amount of data, and data volume constantly expands.For this massive data sets, store and therefrom search fast the data that need, traditional database is not too applicable, so the various distributed data bases that have been born.

In large-scale data management, the key factor that affects data query speed is to need data volume and the disk I/O of access.Index technology is in database practice, to improve the important method of query performance.

In at present common distributed data base system, for multiple row inquiry, querying condition comprises the inquiry of a plurality of index key assignments, conventionally has following several processing mode:

1. according to the inquiry that indexes respectively of each index key assignments, obtain series of results collection, then according to the logical relation between each index key assignments, each result set is got and occured simultaneously or get union, finally obtain a result set that there is no repetition.Whether every result need to searching successively during merging in each result set is present among other result sets, for improving combined efficiency, conventionally has again two kinds of specific implementations:

A) each result set is sorted, the result set after sequence is done and merged again;

B) value of each result set is deposited in HASH container, improve seek rate.

2. from a plurality of index key assignments, choose an inquiry that indexes that selectivity ratios is higher, obtain a result set, scan the total data in this result set, use other these data of index key-value pair that do not index inquiry in querying condition to filter, obtain final query results.

Such as inquiry below:

select*from?user_info?where?username＝‘CC’and?sex＝‘male’，

Separately according to username, search the number of results obtaining fewer, the selectivity ratios that is username row is higher, so only search according to this condition of username=' CC ', travel through its result set, the result of the sex=' male ' that wherein satisfies condition is returned to inquiring user.

Yet, when prior art is inquired about in multiple row, there is the problems such as inefficiency, resources occupation rate be higher.

In aforementioned processing mode 1 (a), need to sort to each result set, must wait the inquiry of each index key assignments all to finish, just can complete sequence, after having sorted, could start to do and merge and return results.While adopting in this way, if there is the data volume of a result set very large, even if the data volume of all the other result sets is all very little, also cannot return results very soon, its response speed is limited by a slowest subquery.

In aforementioned processing mode 1 (b), each result set need to be deposited in to HASH class container, can take larger internal memory like this, when result set data volume is very large, also can surpass system peak load.

Aforementioned processing mode 2, be only applicable to logical relation between a plurality of index key assignments for situation, if the logical relation between a plurality of index key assignments is or, inapplicable.Secondly, due to the business in actual motion environment and data complicated and changeable, accurately choosing alternative large index key assignments is not easy to accomplish, sometimes or even cannot accomplish, list is done inquiry to an index key assignments and can be obtained a lot of results like this, the data that these indexed results are corresponding all read out and filter from raw data memory module, can cause a large amount of disk I/O, and excessive data access amount and the disk I/O common performance bottleneck place of high-volume database just.

Summary of the invention

The technical matters that present patent application will solve is: a kind of optimization method at distributed data base multiple row compound query is provided, solves current distributed data base system for problems such as multiple row search efficiency are low, resources occupation rate is higher.

In order to solve the problems of the technologies described above, present patent application provides a kind of system and method for distributed data base multiple row compound query.Described in the application system by storage subsystem, index subsystem, linear order maker, enter library module, enquiry module forms, wherein:

Storage subsystem adopts distributed file system, comprises a plurality of data blocks of partitioned storage, for storing complete raw data;

Index subsystem adopts distributed column storage database, for storing the index of data;

Linear order maker is that each data recording generates a monotonically increasing sequential value before data loading;

Enter library module and be responsible for raw data to write storage subsystem, and in index subsystem, set up corresponding index;

Enquiry module is divided into again youngster's submodules such as query parse module, search index module, raw data scan module, and enquiry module is responsible for processing user's inquiry request, returns to Query Result.

When data loading is set up index, for each data recording generates a monotone-increasing sequence value, the value of index field and monotone-increasing sequence value are combined to the line unit as concordance list.During index scanning, return results by row key sequencing.Like this, when inquiring about according to the index key assignments of some appointments, the result obtaining is by its sequential value sequence.Thereby, the Query Result of a plurality of index key assignments is done and merged, be that a plurality of ordered queues are done to merger, its execution efficiency is higher and resources occupation rate is lower, contributes to improve inquiry response speed and the supported concurrent number of system.

During data query, query parse module in enquiry module is decomposed into the sub-condition of multiple queries by query statement, each inquires about sub-condition is an index key assignments, index key assignments can obtain a series of data recording that comprise this index key assignments thus, and the memory location of these data recording, form a result set.Enquiry module merges into one by these result sets.During union operation, can as distinguishing, whether be the foundation of different records with monotone-increasing sequence value or the memory location of record.The result set obtaining according to merging, searches storage subsystem, and the original data record content obtaining is returned to inquiring client terminal.

The application's useful consequence is:

1, because every sub-result set is all to sort according to unified monotone-increasing sequence, so the method union operation execution speed of the distributed data base multiple row compound query described in present patent application is than very fast;

2, during the inquiry returning part result of each index key assignments, just can start these results to do and merge, needn't wait the poll-final of each index key assignments to do and merge again;

3, meanwhile, according to the result set merging, search storage subsystem and also needn't wait and to be combinedly all complete, like this, the inquiry of index key assignments, the merging of indexed results, search storage subsystem and can concurrently carry out, greatly improved inquiry response speed.

4, owing to entering determinant storage, access needed IO amount and be confined to needed field, greatly reduced IO visiting demand.

Through measuring and calculating and simulation, so data access optimization, process optimization and result set calculate after pretrigger, IO request decreased average half, can improve response speed more than one times; If set up, return to transformation, response speed can improve more than ten times.

Accompanying drawing explanation

Accompanying drawing 1 is system architecture diagram

Accompanying drawing 2 is data loading process flow diagram

Accompanying drawing 3 is the concordance list schematic diagram of embodiment 1

Accompanying drawing 4 is data query process flow diagram

Embodiment

The system of a kind of distributed data base multiple row compound query described in present patent application by index subsystem, linear order maker, enter library module, enquiry module forms.Its system architecture diagram as shown in Figure 1.Wherein, enquiry module comprises query parse module, search index module, raw data scan module.

Data loading flow process as shown in Figure 2, before data loading, is a sequential value of each data recording generation.This sequential value is generated by linear order maker, is a monotone-increasing sequence.Preferably, if there is such field in raw readings, its value meets monotone increasing condition and not for empty, linear order maker can directly use the value of this field as sequential value.

During data loading, first deposit raw data in primary data storage subsystem, obtain data storage location, then this data recording is set up to index.

In a raw data table, can set up respectively index to a plurality of fields.While setting up index, will in raw data, need the field that is used as querying condition as index field, each index field is a corresponding concordance list in index subsystem.Every index comprises line unit and two parts of row value, and line unit is comprised of the value of index field and monotone-increasing sequence value two parts of this data recording; Row value is recorded in the memory location in storage subsystem for data, and described data storage location comprises that the position of data recording place data block and data are recorded in the side-play amount in data block, so can be directly targeted to data recording according to this memory location.

During index scanning, the result of returning is by line unit sequence, and while therefore inquiring about with a certain assigned indexes key assignments, the result obtaining sorts by sequential value.

Embodiment 1: have a customer transaction record sheet (ExchangeInfo), each customer transaction information comprises user identification field (UserName), merchandise classification field (Category), transaction value field (Price), in addition be that every record generates a sequential value (Sequence), trading record sheet detailed data is as shown in the table:

Table 1 customer transaction information table

Sequence	UserName	Category	Price
					1	Zhang San	General merchandise	100
2	Li Si	Digital	1000
				3	Li Si	General merchandise	200
4	King five	General merchandise	300

Take user ID and merchandise classification as index field, corresponding two concordance lists in index subsystem, user ID concordance list and merchandise classification concordance list, as shown in Figure 3, concordance list comprises two row, line unit (RowKey), row value (being data recording memory location (RecordLocation)).

Data query flow process as shown in Figure 4.Query parse module is decomposed into the sub-condition of multiple queries by query statement, each inquires about sub-condition is an index key assignments, index key assignments can obtain a series of data recording that comprise this index key assignments thus, and the memory location of these data recording, forms a result set.Enquiry module merges into one by these result sets.

When the logical relation of inquiry between sub-condition be " with " time, each is inquired about to subconditional result set and gets common factor; If have a sub-condition of inquiry poll-final and its Query Result all completed merger, or Query Result quantity reaches the transformation that returns results of setting, stops other and inquires about subconditional inquiry and result set merges operation;

When the logical relation between the sub-condition of inquiry is "or", each is inquired about to subconditional result set and get union.If now Zhi Sheng mono-tunnel result does not have merger to finish, all the other results can directly be put into the result set after merging.

The explanation of above embodiment is only applicable to help to understand the principle of present patent application, simultaneously to one of ordinary skill in the art, according to present patent application embodiment, in embodiment and range of application, all will change, so this description should not be construed as the restriction to present patent application.

Claims

1. a system for distributed data base multiple row compound query, is characterized in that: by storage subsystem, index subsystem, linear order maker, enter library module, enquiry module forms.

2. the system of a kind of distributed data base multiple row compound query as claimed in claim 1, it is characterized in that: storage subsystem adopts distributed file system, index subsystem adopts distributed column storage database, and enquiry module comprises query parse module, search index module and raw data scan module.

3. the method for a distributed data base multiple row compound query, it is characterized in that: when data loading is set up index, for each data recording generates a monotone-increasing sequence value, the value of index field and monotone-increasing sequence value are combined to the line unit as concordance list; During index scanning, return results by row key sequencing.

4. the method for a kind of distributed data base multiple row compound query as claimed in claim 3, it is characterized in that: if existed the value of a field to meet monotone increasing condition in raw readings and not for empty, can directly use the value of this field as sequential value.

5. the method for a kind of distributed data base multiple row compound query as claimed in claim 3, it is characterized in that: during data query, query parse module in enquiry module is decomposed into the sub-condition of multiple queries by query statement, each inquires about sub-condition is an index key assignments, index key assignments can obtain a series of data recording that comprise this index key assignments thus, and the memory location of these data recording, form a result set; Enquiry module merges into one by these result sets, and the result set obtaining according to merging is searched storage subsystem, then the original data record content obtaining is returned to inquiring client terminal.

6. the method for a kind of distributed data base multiple row compound query as claimed in claim 3, is characterized in that:

When the logical relation between the sub-condition of inquiry is "or", each is inquired about to subconditional result set and get union; If Zhi Sheng mono-tunnel result does not have merger to finish, all the other results are directly put into the result set after merging.