CN104699815A

CN104699815A - Data processing method and system

Info

Publication number: CN104699815A
Application number: CN201510131621.8A
Authority: CN
Inventors: 董旭; 冯海涛
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2015-06-10

Abstract

The invention relates to data processing methods and device systems, in particular to a data write-in method and system and a data reading method and system. The data write-in method includes that data of predetermined rows are read so as to generate corresponding file blocks; a header file is established for each file block, the header files comprise index of each file block in a database and index of the files that are stored in each file block, and each file block and the corresponding header file form a data block; the data blocks are written in storage nodes. By means of the methods and the systems, efficient compression of data can be performed, storage costs are lowered, the storage space is saved, and data query and analysis speed is increased.

Description

Data processing method and system

Technical field

Embodiment of the present disclosure relates generally to field of data storage, for processing data, and the particularly system of data write and read method and correspondence thereof.

Background technology

For the mechanisms such as enterprise, tissue, office, in daily middle a large amount of data that can produce, and these data grow with each passing day along with the passing on date, and data volume will become abnormal huge.These large data are that business development, statistical study, policy making etc. provide valuable raw data.But, along with the continuous increase of the data volume gathered or collect, store the system load capacity sustainable growth of these data, the storage architecture for data constantly proposes higher requirement, how to improve storage capacity and the search efficiency of mass data, become one of difficult problem solving large data problem.

Traditional data storage is that line stores and column stores.As shown in Fig. 1 (a) He Fig. 1 (b), respectively illustrate the schematic diagram that line stores and column stores, wherein data are stored in multiple different memory node (i.e. distributed storage) by the mode of row or column.But all there is different problems in these two kinds of storage modes.

For line stores, the data query (or analysis) of current main-stream is all based on column, and when utilizing per-column data enquire method to inquire about the database of line storage, efficiency data query aspect also exists major defect.Specifically, such as, in a database, include multiple field such as " ID ", " name ", " count ", " year ", when needing to utilize Structured Query Language (SQL) (SQL) to inquire about the data in database, such as, " SELECT name FROM order WHERE year=2014 ", the database stored due to line can only read line by line, therefore when inquiring about, need the every data line in database to read out, then respectively the data meeting query statement condition are extracted, cause inquiry velocity slow.In addition, because when looking into in a few row situation, unnecessary row cannot be skipped and read; Due to the row mixing different pieces of information value, row stores the ratio of compression that not easily acquisition one is high, and namely space availability ratio is lower.

For column stores, although can the low problem of search efficiency that stores of line, but the different data rows of the data that column stores is stored in different memory nodes, therefore the data that acquisition one is complete are wanted, tuple data reconfiguration cost is large, cause network overhead excessive, even if the demand that distributed file system inquires about large data cannot be met.Such as, the data of the different lines of same data cell are stored on different memory nodes, so then needing to read multiple node repeatedly to obtain same data cell, to reconstruct related data, causing network overhead to increase.When reading mass data, this network overhead can have a strong impact on inquiry velocity.

Summary of the invention

One of object of the present disclosure is to provide a kind of method for writing data and system, to solve or to alleviate above-mentioned one or more problem of the prior art.

One of object of the present disclosure is also to provide a kind of method for reading data and system, to solve or to alleviate above-mentioned one or more problem of the prior art.

According to first aspect of the present disclosure, a kind of method for writing data is provided, comprises: the data reading predetermined row, to generate corresponding blocks of files; For each blocks of files creates header file, the index of file that wherein said header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block; And described data block is written to memory node.

According to an embodiment of the present disclosure, the data be created in described respective file block can read by row or by row.

According to an embodiment of the present disclosure, described data block is written to memory node and comprises: described data block is written to memory node by column or row.

According to an embodiment of the present disclosure, described data block is written to memory node and comprises: before being written to described memory node, the data corresponding to the data in described data block are compressed.

According to an embodiment of the present disclosure, described data block is written to memory node and comprises: after the size of described data block reaches predetermined threshold, described data block is write multiple memory node in turn.

According to second aspect of the present disclosure, a kind of data writing systems is provided, comprises: blocks of files generating apparatus, be configured to the data reading predetermined row, to generate corresponding blocks of files; Header file generating apparatus, be configured as each blocks of files and create header file, the index of file that wherein said header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block; And writing station, be configured to described data block to be written to memory node.

According to an embodiment of the present disclosure, the data in described blocks of files are being stored by row or by arranging the mode of carrying out reading.

According to an embodiment of the present disclosure, said write device is configured to described data block to be written to memory node by column or row.

According to an embodiment of the present disclosure, said write device comprises compression unit, and described compression unit is configured to: before described data block is written to described memory node, and the data corresponding to the data in described data block are compressed.

According to an embodiment of the present disclosure, said write device is configured to further: after the size of described data block reaches predetermined threshold, and described data block is write multiple memory node in turn.

According to method for writing data and the system of embodiment of the present disclosure, read due to data line and generate corresponding blocks of files, guarantee that the data of a line of same data record unit are positioned at same node, therefore data analysis has the high adaptive faculty of rapid data loading and dynamic load, in addition, the expense of tuple reconstruct is very low.Because blocks of files (or database) is by row write memory node, therefore the column data in raw data can be compressed according to pre-defined algorithm, effectively reduce storage space.

According to the third aspect of the present disclosure, a kind of method for reading data is provided, comprises: the query statement of analytic structure query language, with generated query task; Read head file, and the position data obtaining the file relevant to described query task from described header file; And based on described position data, from described blocks of files, extract data; Wherein said header file is created based on each blocks of files, and the index of file that described header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block, described data block is stored in memory node.

According to an embodiment of the present disclosure, based on described position data, from described blocks of files, extract data comprise: according to described query task, the data be stored in described blocks of files by row or by row reading.

According to fourth aspect of the present disclosure, a kind of data reading system is provided, comprises: task generating device, be configured to analytic structure query language query statement with generated query task; Position acquisition device, is configured to read head file and from described header file, obtains the position data of the file relevant to described query task; And data extraction device, be configured to from described blocks of files, extract data based on described position data; Wherein said header file is created based on each blocks of files, and the index of file that described header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block, described data block is stored in memory node.

According to an embodiment of the present disclosure, described data extraction device is configured to read by row or column the data be stored in described blocks of files according to described query task.

According to method for reading data and the system of embodiment of the present disclosure, when carrying out data query, because data block to be stored in memory node and each blocks of files is provided with header file, fast query and the analysis of data can be realized.In addition, because all fields of identical data record cell are all at same node, such storage organization ensure that the data of a line of same data record unit are positioned at same node, and therefore the expense of tuple reconstruct is very low, ensure that search efficiency.In addition, when inquiring about, only reading row and the row of needs according to index file, decreasing the expense of network, improve search efficiency.

The explanation that other advantages of above-mentioned characteristic sum of the present disclosure pass through embodiment below will become clear.

Accompanying drawing explanation

Now by means of only the mode of example, with reference to appended accompanying drawing, embodiment of the present disclosure is described, wherein:

Fig. 1 (a) and Fig. 1 (b) respectively illustrates the storage organization schematic diagram that line stores and column stores of prior art;

Fig. 2 is the process flow diagram of the method for writing data according to exemplary embodiment of the present disclosure;

Fig. 3 performs the data store organisation figure according to the front and back of the method for writing data of exemplary embodiment of the present disclosure;

Fig. 4 is the data store organisation schematic diagram according to exemplary embodiment of the present disclosure;

Fig. 5 is the schematic diagram of the data writing systems according to exemplary embodiment of the present disclosure;

Fig. 6 is the process flow diagram of the method for reading data according to exemplary embodiment of the present disclosure; And

Fig. 7 is the schematic diagram of the data reading system according to exemplary embodiment of the present disclosure.

Embodiment

Now will be specifically described embodiment of the present disclosure by reference to the accompanying drawings.It should be noted that in accompanying drawing and may use same figure denote to similar unit or functional module.Appended accompanying drawing is only intended to embodiment of the present disclosure is described.Those skilled in the art can obtain alternate embodiments from following description on the basis of not departing from disclosure spirit and protection domain.

Embodiment of the present disclosure is described in detail below in conjunction with accompanying drawing.

As shown in Figure 2, according to an embodiment of the present disclosure, a kind of method for writing data is provided.The method comprises: in step S101, reads the data of predetermined row, to generate corresponding blocks of files.In step S102, be that each blocks of files creates header file, the index of file that wherein said header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block.In step S103, described data block is written to memory node.

The scheme of embodiment of the present disclosure can realize the advantage of the combination that column stores and line stores, and can carry out Efficient Compression to data, reduces carrying cost, saves storage space, and improves data query and analysis speed.In one embodiment, data can be structural data, semi-structured data (such as daily record etc.) or unstructured data.

In embodiment of the present disclosure, the data in blocks of files are read by row, and by blocks of files corresponding for the data genaration of reading.In one embodiment, predetermined row can be read, such as 20 row, 50 row, 100 row etc., by blocks of files corresponding for the data genaration of these predetermined row.In another embodiment, the size of blocks of files can be preset, read certain line number to generate predetermined blocks of files size.Adopt and have the following advantages in this way: owing to reading data by row, the data therefore in identical data record cell are created in a blocks of files usually.In one embodiment, in employing Hadoop distributed file system (HDFS), all fields are stored in same HDFS block.This assures the high adaptive faculty according to embodiment of the present disclosure with rapid data loading and dynamic load.In addition, because all fields of identical data record cell are all at same node, such storage organization ensure that the data of a line of same data record unit are positioned at same node, and therefore the expense of tuple reconstruct is very low.

In embodiment of the present disclosure, each blocks of files creates header file, the index of file that wherein said header file comprises each blocks of files self index in a database and stores in each blocks of files.Adopt in this way, owing to creating header file, the file (or data) therefore stored in blocks of files can carry out effective location or inquiry by header file; In addition, header file also comprises each blocks of files from the index in database or memory node, therefore, when in multiple memory node during data query, can the position of quick position blocks of files, and search out relevant inquiring data with fast and easy.

According to embodiment of the present disclosure, read due to data line and generate corresponding blocks of files, therefore guarantee that the data of a line of same data record unit are positioned at same node, therefore data analysis has the high adaptive faculty of rapid data loading and dynamic load, and the expense of tuple reconstruct is very low in addition.Because blocks of files (or database) is by row write memory node, therefore the column data in raw data can be compressed according to pre-defined algorithm, effectively reduce storage space.Compression algorithm can be various frequently-used data compression algorithm of the prior art.

Fig. 3 performs the data store organisation figure according to the front and back of the method for writing data of exemplary embodiment of the present disclosure.On the left of Fig. 3, original data block is shown, the schematic diagram performing the data store organisation after according to embodiment write algorithm of the present disclosure is shown on the right side of Fig. 3.It should be noted that the object for signal, illustrate only the data of four field A, B, C, D, 5 row.As shown in the figure, read the data of predetermined row (being 5 row in the drawings), to generate corresponding blocks of files, each blocks of files creates header file, wherein said header file comprises each blocks of files self index in a database and (is exemplarily illustrated as the Sync of 16 bytes, index also can be the extended formatting arranged according to preset rules) and the index of file that stores in each blocks of files (in figure, be illustratively shown as tuple data head, can Data Position effectively in locating file block by tuple data head), and each blocks of files and corresponding header file composition data block (Row group1, Row group2 in figure, the rightmost side is the structural drawing of data block), described data block is written to memory node (Row group1, Row group2 ... be stored in by row in memory node).Design of the present invention can be more clearly understood by the structure of Fig. 3.

Fig. 4 shows the Data Data storage organization schematic diagram according to exemplary embodiment of the present disclosure.Fig. 4 shows inventive concept of the present disclosure.As shown in Figure 4, in distributed memory system, the block-based column storage mode of data acquisition.Can store and go the advantage stored by composite column thus.

According to an embodiment of the present disclosure, the data be created in respective file block can read by row or by row.In the embodiment shown in fig. 3, the predetermined row read from data carries out being stored in blocks of files with the form of row.In another embodiment, the predetermined row read from data carries out being stored in blocks of files with the form of row.In other words, due to can data effectively in locating file block by header file, the data therefore stored in a database be by row or store by row and all can realize object of the present disclosure.

In an embodiment of the present disclosure, the blocks of files (or data block) generated is written to memory node by row in turn.Multiple blocks of files (or data block) of predetermined number can be stored at a memory node.

In another embodiment of the present disclosure, the blocks of files (or data block) generated is written to memory node in turn by row.In one embodiment, in the blocks of files of a memory node storage 100,500,1000 or more.Should be understood that, the number of blocks of files can set according to the size of the storage space of memory node.Because blocks of files is by row write memory node.Adopt in this way, the data compression of row dimension can be utilized.Column data in raw data has identical data attribute usually, therefore blocks of files is stored by row, can inherit this advantage, is convenient to data to compress, thus significantly saves storage space.In addition, because blocks of files stores by row, therefore when inquiring about, unnecessary row can be skipped and read, improving data reading performance using redundancy.

According to an embodiment of the present disclosure, described data block is written to memory node and comprises: before being written to described memory node, the data corresponding to the data in described data block are compressed.Store after data compression, can space be saved.

According to an embodiment of the present disclosure, described data block is written to memory node and comprises: after the size of described data block reaches predetermined threshold, described data block is write multiple memory node in turn.Wherein, threshold value here can based on the size free setting of the size of database and memory node.Such as can be set to the size of tens, also can be hundreds of million.

According to second aspect of the present disclosure, also provide the data writing systems that a kind of and above-mentioned wiring method is corresponding.Fig. 5 is the schematic diagram of the data writing systems according to exemplary embodiment of the present disclosure.As shown in Figure 5, data writing systems 100 comprises: blocks of files generating apparatus 12, is configured to the data reading predetermined row, to generate corresponding blocks of files; Header file generating apparatus 16, be configured as each blocks of files and create header file, the index of file that wherein said header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block; And writing station 18, be configured to described data block to be written to memory node.

The advantage identical with above-mentioned method for writing data can be realized according to data writing systems of the present disclosure.In order to avoid repeating, its detailed description is omitted.

In addition, said method also has following variation.According to an embodiment of the present disclosure, the data in described blocks of files are being stored by row or by arranging the mode of carrying out reading.According to an embodiment of the present disclosure, said write device is configured to described data block to be written to memory node by column or row.According to an embodiment of the present disclosure, said write device comprises compression unit, and described compression unit is configured to: before described data block is written to described memory node, and the data corresponding to the data in described data block are compressed.According to an embodiment of the present disclosure, said write device is configured to further: after the size of described data block reaches predetermined threshold, and described data block is write multiple memory node in turn.

According to the third aspect of the present disclosure, also provide a kind of method for reading data.Fig. 6 is the process flow diagram of the method for reading data according to exemplary embodiment of the present disclosure.Method for reading data according to exemplary embodiment of the present disclosure comprises: in step S201, the query statement of analytic structure query language, with generated query task.In step S202, read head file, and the position data obtaining the file relevant to described query task from described header file.In step S203, based on described position data, from described blocks of files, extract data; Wherein said header file is created based on each blocks of files, and the index of file that described header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block, described data block is stored in memory node.

According to the method for reading data of embodiment of the present disclosure, when carrying out data query, because data block to be stored in memory node and each blocks of files is provided with header file, fast query and the analysis of data can be realized.In addition, because all fields of identical data record cell are all at same node, such storage organization ensure that the data of a line of same data record unit are positioned at same node, and therefore the expense of tuple reconstruct is very low, ensure that search efficiency.In addition, when inquiring about, only reading row and the row of needs according to index file, decreasing the expense of network, improve search efficiency.

According to fourth aspect of the present disclosure, also provide a kind of data reading system.Fig. 7 is the schematic diagram of the data reading system according to exemplary embodiment of the present disclosure.As shown in Figure 7, the data reading system 200 according to exemplary embodiment of the present disclosure comprises: task generating device 202, is configured to analytic structure query language query statement with generated query task; Position acquisition device 204, is configured to read head file and from described header file, obtains the position data of the file relevant to described query task; And data extraction device 206, be configured to from described blocks of files, extract data based on described position data; Wherein said header file is created based on each blocks of files, and the index of file that described header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block, described data block is stored in memory node.

The advantage identical with above-mentioned knot method for reading data can be realized according to data reading system of the present disclosure.In order to avoid repeating, its detailed description is omitted.

By describing above and instruction given in relevant drawings, of the present disclosure many modification given here and other embodiment will recognize by disclosure those skilled in the relevant art.Therefore, it being understood that embodiment of the present disclosure is not limited to disclosed embodiment, and modification and other embodiment are intended to comprise within the scope of the present disclosure.In addition, although more than to describe and relevant drawings is described example embodiment under the background of some example combination form of unit and/or function, but should be realized, the scope of the present disclosure can not deviated from by the various combination form of alternate embodiment providing unit and/or function.On this point, such as, be also expected with other array configuration of the different unit clearly described above and/or function and be within the scope of the present disclosure.Although be employed herein concrete term, they only use with general and descriptive implication and and are not intended to limit.

Claims

1. a method for writing data, comprising:

Read the data of predetermined row, to generate corresponding blocks of files;

For each blocks of files creates header file, the index of file that wherein said header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block; And

Described data block is written to memory node.

2. wiring method according to claim 1, the data be wherein created in described respective file block can read by row or by row.

3. wiring method according to claim 1, is wherein written to memory node by described data block and comprises:

Described data block is written to memory node by column or row.

4. the wiring method according to any one of claim 1-3, is wherein written to memory node by described data block and comprises:

Before being written to described memory node, the data corresponding to the data in described data block are compressed.

5. the wiring method according to any one of claim 1-3, is wherein written to memory node by described data block and comprises:

After the size of described data block reaches predetermined threshold, described data block is write multiple memory node in turn.

6. a data writing systems, comprising:

Blocks of files generating apparatus, is configured to the data reading predetermined row, to generate corresponding blocks of files;

Header file generating apparatus, be configured as each blocks of files and create header file, the index of file that wherein said header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block; And

Writing station, is configured to described data block to be written to memory node.

7. writing system according to claim 6, is wherein created on data in described respective file block being stored by row or by arranging the mode of carrying out reading.

8. writing system according to claim 6, wherein said writing station is configured to described data block to be written to memory node by column or row.

9. the writing system according to any one of claim 6-8, wherein said writing station comprises compression unit, described compression unit is configured to before described data block is written to described memory node, and the data corresponding to the data in described data block are compressed.

10. the writing system according to any one of claim 6-8, said write device is configured to further: after the size of described data block reaches predetermined threshold, and described data block is write multiple memory node in turn.

11. 1 kinds of method for reading data, comprising:

The query statement of analytic structure query language, with generated query task;

Read head file, and the position data obtaining the file relevant to described query task from described header file; And

Based on described position data, from described blocks of files, extract data;

Wherein said header file is created based on each blocks of files, and the index of file that described header file comprises each blocks of files self index in a database and stores in each blocks of files, and each blocks of files and corresponding header file composition data block, described data block is stored in memory node.

12. method for reading data according to claim 11, wherein based on described position data, extract data and comprise from described blocks of files:

According to described query task, the data be stored in described blocks of files by row or by row reading.

13. 1 kinds of data reading systems, comprising:

Task generating device, is configured to analytic structure query language query statement with generated query task;

Position acquisition device, is configured to read head file, and obtains the position data of the file relevant to described query task from described header file; And

Data extraction device, is configured to from described blocks of files, extract data based on described position data;

14. data reading systems according to claim 13, wherein said data extraction device is configured to according to described query task, can read the data be stored in described blocks of files by row or column.