CN103902614A

CN103902614A - Data processing method, device and system

Info

Publication number: CN103902614A
Application number: CN201210584674.1A
Authority: CN
Inventors: 徐萌; 何鸿凌; 杜宇健; 钱岭; 孙少陵; 金骏
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2014-07-02
Anticipated expiration: 2032-12-28
Also published as: CN103902614B

Abstract

An embodiment of the invention discloses a data processing method, device and system. The method includes that a sharding server receives data querying request, including key fields used for indicating the row where the requested data located and list fields used for indicating the column where the requested data located, transmitted by a main server; the sharding server queries corresponding column data in self-stored data according to the key fields and the list fields, and returns the queried column data to the main server in an array manner. According to the method, performance consumption of data processing in a distributed column type database system is reduced, and data processing efficiency is improved.

Description

A kind of data processing method, equipment and system

Technical field

The present invention relates to communication technical field, particularly relate to a kind of data processing method, equipment and system.

Background technology

Distributed column storage database is a kind of applicable fast query, distributed good solution, and it can also effectively improve the inquiry velocity to data when mass data storage is provided.

In existing distributed column memory technology scheme, mainly focus on how to realize data query, and the demand that does not have focused data to analyze.And in practical application, the major function of database is except inquiry, major part is analytic type demand.For example, add up under certain condition the summation of certain row; Calculate for certain several row, as calculated the ratio etc. of local telephone network minute and long-distance call minute.

For the problems referred to above, the solution in distributed system can adopt the method for Distributed Calculation to realize at present.For example, based on the system of Hadoop, adopt Mapreduce as Computational frame, its Map interface is dbinputformat, and this interface provides reading in data line.Specific as follows:

1), inputformat can be divided into several bursts according to key;

2), each Map reads in a burst;

3), the read-write interface that provides of Map intrinsic call distributed data base, according to key, read a line item.

In the inner Realization analysis of Map, what read in is the line item of a line a line, first need to distinguish concrete field to be processed according to field location, and then process; Some action need enters the reduce stage, for example summation.Obviously, this according to the capable mode that reads processing, do not utilize the advantage of column storage.

Realizing in process of the present invention, inventor finds at least to exist in prior art following problem:

Due to distributed column storage, each row family is kept in a file, so read the interface of a line item at every turn, need to read according to key the field of response from multiple files, then merges into a record and returns; Meanwhile, in the Map stage, because needs operate for certain row, also need line item to decompose according to field, could further operate, caused and merged and the twice performance loss splitting.

Summary of the invention

The embodiment of the present invention provides a kind of data processing method, equipment and system, to reduce the performance consumption of the data processing based on distributed column storage database system, improves data-handling efficiency.

In order to reach above object, the embodiment of the present invention provides a kind of data processing method, is applied in the distributed column storage database system that comprises master server and burst server, and the method comprises:

Burst server receives the data query request that master server forwards, and wherein carries the list field that is used to indicate the data column that the key field that the data that read of request are expert at and the request that is used to indicate read;

Described burst server is inquired about corresponding column data according to described key field and list field in the data of self storage, and the column data inquiring is returned to described master server with the form of array.

The embodiment of the present invention also provides a kind of distributed column storage database system, comprises master server and burst server,

Described master server is used for, and receives the data query request that client is initiated, and this data query request is transmitted to burst server; And receive the data of the array form that burst server returns;

Described burst server is used for, and receives the data query request that master server forwards, and wherein carries the list field that is used to indicate the data column that the key field that the data that read of request are expert at and the request that is used to indicate read; In the data of self storage, inquire about corresponding column data according to described key field and list field, and the column data inquiring is returned to described master server with the form of array.

The embodiment of the present invention also provides a kind of burst server, be applied in the distributed column storage database system that comprises master server, described distribution server comprises: a data slice module Hregion, at least one row module Hstore, and at least one row storage file HstoreFile; Wherein:

Described Hregion is used for, and receives the data query request that main service forwards, and wherein carries the list field of the data column that the key field that data that the request of being used to indicate reads are expert at and the request that is used to indicate read; Determine corresponding Hstore according to described list field, and this data query request is transmitted to this Hstore; Receive the data file that Hstore returns, according to this data file generated data array, and this data array is returned to master server;

Described Hstore is used for, and in the time receiving the data query request of Hregion forwarding, determines corresponding HstoreFile, and this data query request is transmitted to this HstoreFile according to described key field; Receive the data file that HstoreFile returns, and this data file is returned to Hregion;

Described HstoreFile is used for, and in the time receiving the data query request of Hstore forwarding, returns to whole data file to Hstore.

In the above embodiment of the present invention, burst server receives after the data query request of master server forwarding, in the data of self storage, inquire about corresponding column data according to key field and list field, and the column data inquiring is returned to master server with the form of array, the performance consumption that has reduced data processing in distributed column storage database system, has improved data-handling efficiency.

Brief description of the drawings

Fig. 1 is existing distributed column storage database system architecture schematic diagram;

Fig. 2 is the schematic flow sheet of existing distributed data base reading out data;

Fig. 3 is the schematic flow sheet of existing Map task deal with data;

The schematic flow sheet of a kind of data processing method that Fig. 4 provides for the embodiment of the present invention;

The schematic flow sheet of a kind of data processing method that Fig. 5 provides for the embodiment of the present invention;

The schematic flow sheet of a kind of data processing method that Fig. 6 provides for the embodiment of the present invention;

The structural representation of a kind of distributed column storage database system that Fig. 7 provides for the embodiment of the present invention;

The structural representation of a kind of split blade type server that Fig. 8 provides for the embodiment of the present invention.

Embodiment

The technical scheme providing in order to understand better the embodiment of the present invention, simply describes existing distributed column storage database system architecture and the conventional data processing method based on existing distributed column storage database system architecture below.

Referring to Fig. 1, existing distributed column storage database system comprises master server (Master) and burst server (Tablet Server), this burst server comprises: a data slice module (Hregion), at least one row module (Hstore), and at least one row storage file (HstoreFile); Wherein:

In a Hregion, can store one or more fragment datas; This fragment data comprises the total data of former tables of data a line or multirow, burst number can according to the quantity of the equipment of parallel data processing determine;

In a burst server, the data of storing in Hregion are stored in different Hstore and (in a Hstore, store the data of row or a Ge Lie family) by row or row family; The data branch storing in Hstore is stored in HstoreFile.Wherein, in distributed column storage database, several row of often simultaneously being accessed are defined as to row family.

Based on above-mentioned distributed column storage database system, in prior art, flow chart of data processing can be as shown in Figures 2 and 3.Wherein, this flow chart of data processing relates generally to two flow processs: first is the process of distributed data base reading out data; Second is the flow process of Map task deal with data.

Referring to Fig. 2, in prior art, the process of distributed data base reading out data can comprise the following steps:

Step 201, master server receive the data query request that client sends, and this data query request is transmitted to corresponding Hregion by the key field that the data that read according to the request that is used to indicate of wherein carrying are expert at.

Step 202, Hregion receive data query request, and traversal Hstore, to inquire about the data of corresponding key field in respective column.

Step 203, Hstore determine corresponding HstoreFile according to key field;

Step 204, HstoreFile determine the side-play amount (offset) of asking the data that read according to index corresponding to key field, and this side-play amount are returned to Hstore.

Step 205, Hstore read corresponding data according to this side-play amount, and the data that read are returned to Hregion.

Step 206, Hregion splice the result that all Hstore return.

Spliced result is returned to master server by step 207, Hregion.

Wherein, master server obtains after result, outputs it to Map task.

Referring to Fig. 3, for the flow process of Map task deal with data in prior art can comprise the following steps:

Step 301, Map read in a record (being data line, the mode reading data of Map to read line by line).

Step 302, from the record reading in, split out corresponding field value according to metadata information.

Wherein, during due to Map reading data, be to be undertaken by the capable mode reading, and the data that need to analyze and process are the data of certain row or a few row in tables of data, therefore, after Map reading data, need to from the data of reading in, split out according to metadata information corresponding field value (as the age).

Step 303, the field value obtaining is carried out to respective handling (as summation).

In such scheme, carry out data while reading still for reading by row, and due in distributed column storage database system, each row or row family are stored in a file, press the interface of row reading out data, need to read according to key the field of response from multiple files, then merge into a record and return, data reading performance using redundancy is lower; Further, the processing stage of Map task, because needs operate for certain row, after row reading out data, line item need to be decomposed according to field, could further operate, increase the performance consumption of data processing.

For the problems referred to above, the embodiment of the present invention provides a kind of technical scheme that is applied to the data processing in distributed column storage database system.In this technical scheme, in the data query request that client sends to the master server of distributed column storage database system, not only carry and be used to indicate the key field of asking the data that read to be expert at, also comprise and be used to indicate the list field of asking the data column reading; Master server receives after data query request, according to key field, this data query request is transmitted to corresponding burst server; Burst server receives after the data query request of master server forwarding, in the data of self storage, inquire about corresponding column data according to key field and list field, and the column data inquiring is returned to master server with the form of array, the performance consumption that has reduced data processing in distributed column storage database system, has improved data-handling efficiency.

Below in conjunction with the accompanying drawing in embodiments of the invention, the technical scheme in embodiments of the invention is clearly and completely described, obviously, the embodiments described below are only the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art are not making the every other embodiment obtaining under creative work prerequisite, all belong to the scope of embodiments of the invention protection.

As shown in Figure 4, the schematic flow sheet of a kind of data processing method providing for the embodiment of the present invention, can comprise the following steps:

Step 401, burst server receive the data query request that master server forwards, and wherein carry the list field of the data column that the key field that data that the request of being used to indicate reads are expert at and the request that is used to indicate read.

Concrete, for distributed column storage database system, in the time that user need to carry out data query, can initiate data query request with the master server to distributed column storage database system by input corresponding query argument in client.

In order to make full use of the advantage of distributed column storage database system, in embodiments of the present invention, in the data query request that client sends to the master server of distributed column storage database system, except carrying the key field that data that the conventional request that is used to indicate reads are expert at, also carry the list field of the data column that the request of being used to indicate reads.

Master server receives after the data query request of client transmission, determines the burst server at the data place of institute's requesting query, and this data query request is transmitted to corresponding burst server according to the key field of wherein carrying.

Step 402, burst server are inquired about corresponding column data according to key field and list field in the data of self storage, and the column data inquiring is returned to master server with the form of array.

Concrete, in embodiments of the present invention, burst server is inquired about corresponding column data according to key field and list field in the data of self storage, and the specific implementation that the column data inquiring is returned to master server with the form of array can comprise the following steps:

Step 4021, Hregion determine corresponding Hstore according to the list field of carrying in data query request, and this data query request is transmitted to this Hstore.

Concrete, the fragment data of storing in Hregion is stored in Hstore by row or row family, when Hregion receives after data query request, determine the row of asking the data place of reading according to the list field of wherein carrying, and then determine the Hstore that stores this column data, and this data query request is transmitted to this Hstore.

Step 4022, Hstore determine corresponding HstoreFile according to the key field of carrying in data query request, and this data query request is transmitted to this HstoreFile.

Concrete, the data of storing in Hstore are stored in HstoreFile by row, when Hstore receives after data query request, determine the row of asking the data place of reading according to the key field of wherein carrying, and then determine the HstoreFile that stores the row data, and this data query request is transmitted to this HstoreFile.

Step 4023, HstoreFile receive after data query request, return to whole data file to Hstore.

Concrete, in prior art, HstoreFile receives after data query request, need to determine the offset that asks the data that read according to index corresponding to key field, and this offset is returned to Hstore, read the full line data of corresponding row according to this offset by Hstore.

In order to improve data-handling efficiency, in embodiments of the present invention, HstoreFile receives after data query request, directly whole data file is returned to Hstore, make Hstore directly obtain corresponding column data, and without go to read full line data according to offset.

The data file receiving is returned to Hregion by step 4024, Hstore.

The data file generated data array that step 4025, Hregion basis receive, and this data array is returned to master server.

By with upper type, realize reading of distributed column storage database system midrange certificate, take full advantage of the advantage of column storage, reduce the performance consumption that data read, improve the efficiency of data processing.

Master server receives after the data that burst server returns, and data need to be exported to Map task, to carry out further the processing of Map task.

As shown in Figure 6, the data processing method that the embodiment of the present invention provides can also comprise the following steps:

Step 601, Map read in a ColRecord.

Concrete, in embodiments of the present invention, be defined as follows structure:

ColRecord（coldata[1]，coldata[2]，……coldata[n]）

Wherein, n is the columns of the column data that arrives of described burst server lookup, coldata[i] be described burst server lookup to column data in a column data, i is the positive integer that is not more than n.

Map receives after the data array data of master server output, according to above-mentioned data structure reading data.

Step 602, Map obtain each column data according to this ColRecord.

Step 603, Map carry out data processing according to the column data obtaining by row.

The data of exporting to Map task due to master server are no longer full line data, but data array; Map receives after the data array data of master server output, can be according to ColRecord structure reading data, directly obtain needing each column data to be processed, thereby each column data is analyzed and processed by row, and without again the line item reading in being decomposed according to field, the performance consumption that has further reduced data processing, has improved data-handling efficiency.

Can find out by above description, in the technical scheme providing in the embodiment of the present invention, in the data query request that client sends to the master server of distributed column storage database system, not only carry and be used to indicate the key field of asking the data that read to be expert at, also comprise and be used to indicate the list field of asking the data column reading; Master server receives after data query request, according to key field, this data query request is transmitted to corresponding burst server; Burst server receives after the data query request of master server forwarding, in the data of self storage, inquire about corresponding column data according to key field and list field, and the column data inquiring is returned to master server with the form of array, the performance consumption that has reduced data processing in distributed column storage database system, has improved data-handling efficiency.

Based on the identical technical conceive of said method embodiment, the embodiment of the present invention provides a kind of distributed column storage database system.

As shown in Figure 7, the structural representation of a kind of distributed column storage database system providing for the embodiment of the present invention, can comprise master server 71 and burst server 72, wherein:

Described master server 71 can be for, receives the data query request that client is initiated, and this data query request is transmitted to burst server 72; And receive the data of the array form that burst server 72 returns;

Described burst server 72 for, receive the data query request that forwards of master server 71, wherein carry the list field of the data column that the key field that data that the request of being used to indicate reads are expert at and the request that is used to indicate read; In the data of self storage, inquire about corresponding column data according to described key field and list field, and the column data inquiring is returned to described master server 71 with the form of array.

Wherein, described burst server 72 comprises a data slice module Hregion, at least one row module Hstore, and at least one row storage file HstoreFile; Wherein:

Wherein, described master server 71 can also be used for, and described data array is exported to Map, so that described Map is according to this data array reading out data, and carries out analyzing and processing according to the column data obtaining by row.

Wherein, described master server specifically for, data array is exported to Map, so that described Map reads described data array data according to ColRecord structure;

Described ColRecord structure is specially:

ColRecord（coldata[1]，coldata[2]，……coldata[n]）

Wherein, in the distributed column storage database system providing in the embodiment of the present invention, a master server can corresponding one or more burst servers.

Based on the identical technical conceive of said method embodiment, the embodiment of the present invention also provides a kind of burst server, can be applied to said method embodiment.

As shown in Figure 8, the structural representation of a kind of burst server providing for the embodiment of the present invention, can comprise: a data slice module Hregion81, at least one row module Hstore82, and at least one row storage file HstoreFile83; Wherein:

Described Hregion81 is used for, and receives the data query request that main service forwards, and wherein carries the list field of the data column that the key field that data that the request of being used to indicate reads are expert at and the request that is used to indicate read; Determine corresponding Hstore82 according to described list field, and this data query request is transmitted to this Hstore82; Receive the data file that Hstore82 returns, according to this data file generated data array, and this data array is returned to master server;

Described Hstore82 is used for, and in the time receiving the data query request of Hregion81 forwarding, determines corresponding HstoreFile83, and this data query request is transmitted to this HstoreFile83 according to described key field; Receive the data file that HstoreFile83 returns, and this data file is returned to Hregion81;

Described HstoreFile83 is used for, and in the time receiving the data query request of Hstore82 forwarding, returns to whole data file to Hstore82.

Through the above description of the embodiments, those skilled in the art can be well understood to the embodiment of the present invention and can realize by hardware, and the mode that also can add necessary general hardware platform by software realizes.Based on such understanding, the technical scheme of the embodiment of the present invention can embody with the form of software product, it (can be CD-ROM that this software product can be stored in a non-volatile memory medium, USB flash disk, portable hard drive etc.) in, comprise that each implements the method described in scene in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) carry out the embodiment of the present invention in some instructions.

It will be appreciated by those skilled in the art that accompanying drawing is a schematic diagram of preferably implementing scene, the module in accompanying drawing or flow process might not be that the enforcement embodiment of the present invention is necessary.

It will be appreciated by those skilled in the art that the module in the device of implementing in scene can be distributed in the device of implementing scene according to implementing scene description, also can carry out respective change and be arranged in the one or more devices that are different from this enforcement scene.The module of above-mentioned enforcement scene can be merged into a module, also can further split into multiple submodules.

The invention described above embodiment sequence number, just to describing, does not represent the quality of implementing scene.

Disclosed is above only the several concrete enforcement scene of the embodiment of the present invention, and still, the embodiment of the present invention is not limited thereto, and the changes that any person skilled in the art can think of all should fall into the traffic limits scope of the embodiment of the present invention.

Claims

1. a data processing method, is applied in the distributed column storage database system that comprises master server and burst server, it is characterized in that, the method comprises:

2. the method for claim 1, is characterized in that, described burst server comprises a data slice module Hregion, at least one row module Hstore, and at least one row storage file HstoreFile;

Described burst server is inquired about corresponding column data according to described key field and list field in the data of self storage, and the column data inquiring is returned to described master server with the form of array, is specially:

Described Hregion determines corresponding Hstore according to described list field, and this data query request is transmitted to this Hstore;

Described Hstore determines corresponding HstoreFile according to described key field, and this data query request is transmitted to this HstoreFile;

Described HstoreFile receives after data query request, returns to whole data file to Hstore;

The data file receiving is returned to Hregion by described Hstore;

The data file generated data array that described Hregion basis receives, and this data array is returned to master server.

3. method as claimed in claim 2, is characterized in that, the method also comprises:

Described data array is exported to Map by described master server, so that described Map is according to this data array reading out data, and carries out analyzing and processing according to the column data obtaining by row.

4. method as claimed in claim 3, is characterized in that, described Map, according to data array reading out data, is specially:

Described Map reads described data array data according to ColRecord structure;

Described ColRecord structure is specially:

ColRecord（coldata[1]，coldata[2]，……coldata[n]）

5. a distributed column storage database system, comprises master server and burst server, it is characterized in that,

6. distributed column storage database system as claimed in claim 5, is characterized in that, described burst server comprises a data slice module Hregion, at least one row module Hstore, and at least one row storage file HstoreFile; Wherein:

7. system as claimed in claim 6, is characterized in that,

Described master server also for, described data array is exported to Map so that described Map is according to this data array reading out data, and carry out analyzing and processing according to the column data obtaining by row.

8. system as claimed in claim 7, is characterized in that,

Described master server specifically for, data array is exported to Map, so that described Map reads described data array data according to ColRecord structure;

Described ColRecord structure is specially:

ColRecord（coldata[1]，coldata[2]，……coldata[n]）

9. a burst server, be applied in the distributed column storage database system that comprises master server, it is characterized in that, described distribution server comprises: a data slice module Hregion, at least one row module Hstore, and at least one row storage file HstoreFile; Wherein: