CN105183391B

CN105183391B - The method and apparatus that data store under a kind of distributed data platform

Info

Publication number: CN105183391B
Application number: CN201510598396.9A
Authority: CN
Inventors: 周龙波; 王晓; 王彦明
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2015-09-18
Filing date: 2015-09-18
Publication date: 2018-12-28
Anticipated expiration: 2035-09-18
Also published as: CN105183391A

Abstract

The present invention provides the method and apparatus that data store under a kind of distributed data platform, can improve data storage and data effectiveness of retrieval while effectively record data variation.The method that data store under a kind of distributed data platform of the invention includes: to be classified by being compared the data on the same day with the data in data mode variation table to changed data；By under the sorted data acquisition to different catalogues, and it is stored under corresponding subregion according to the data storage rule of the catalogue；And it updates the data mode and changes table.

Description

The method and apparatus that data store under a kind of distributed data platform

Technical field

The present invention relates to field of computer technology, a kind of method that particularly data store under distributed data platform And device.

Background technique

Big data --- people describe the epoch of current information explosion with it, it not only shows flies in data volume Jump, and data storage type is also more and more, it is more various to form from traditional relational data, Key-Value data Flat file, picture, audio, video etc..So many and diverse data are analyzed, to the calculated performance and storage of data platform Performance made higher requirement.

Storage and analysis that big data is done using distributed Hadoop system are the common practices of industry, due to distribution The Hadoop system of formula using file storage data mode, although improving the amount of storage and handling capacity of data, But the update mechanism of original relevant database is sacrificed, only supports insertion, is deleted, the mode of operation of overlay text file, Cause the accumulation of current data history can only be by the way of data snapshot.Portion is saved daily to the data stored in database Snapshot records complete data mode, and as time integral history of forming data store.When needs restore or retrieve data When the historical track of state change, needs to scan historical data by full dose, carry out the universe calculating ratio pair of different time points, look for The difference of data out, the data mode of recovery time point.

But there are some disadvantages below for existing technical solution:

1. the storage scheme for relevant database is helpless to the processing of big data quantity；And existing distributed document The mode that system takes snapshot to accumulate, sacrifices mass storage space, and in subsequent calculating, inefficiency；

2. data retrieval generally requires to carry out full dose scanning, a large amount of system resources are occupied；

3. lacking flexibility for data scene complicated and changeable on line.

However, a data often pass through many state changes, phase from generating to withering away in a large amount of application scenarios Ying Di, data platform produces more parts of snapshots, data storage meeting rapid expansion when recording data mode variation, and divides in data During analysis, the tracking that data are carried out with historical track is generally required, needs to scan a large amount of historical data and carries out going back for state Original, inefficiency.Therefore, how to design a kind of mechanism make data platform can either record data mode variation and convenient for analysis and Reduction, is the major issue for putting the urgent need to resolve in face of us.

Summary of the invention

In view of this, the present invention provides the method and apparatus that data store under a kind of distributed data platform, can have While effect record data variation, data storage and data effectiveness of retrieval are improved.

To achieve the above object, according to an aspect of the invention, there is provided data are deposited under a kind of distributed data platform The method of storage.

A kind of method that data store under distributed data platform, comprising: by becoming the data on the same day and data mode Data in change table are compared, and are classified to changed data；By the sorted data acquisition to different Under catalogue, and it is stored under corresponding subregion according to the data storage rule of the catalogue；And it updates the data mode and becomes Change table.

Optionally, the classification be carried out according to the process of data life period, and including online class, expired class and File class three types.

Optionally, the step of classifying to changed data includes: the key name by searching for data, by the same day Data are compared with the data in data mode variation table；If there is no the data in the data mode variation table, and There are the data in the data on the same day, then the data are online class；If the data mode variation table is worked as with described There are the data in it data, but the key assignments of the data is different, then the data mode changes the number in table According to for expired class, and the data on the same day are online class；And if there is the data in the data mode variation table, and There is no the data in the data on the same day, then the data are filing class.

Optionally, the data storage rule includes 3 partition name, data time and data life deadline catalogues Rank.

Optionally, the partition name includes online subregion, expired subregion and archive partition.

Optionally, according to the step that the data storage rule of the catalogue is stored under corresponding subregion include: it is described The first class catalogue partition name of line class data is online subregion, and second-level directory data time is maximum time, three-level catalogue data Life deadline is maximum time；The first class catalogue partition name of the expired class data is expired subregion, second-level directory number It is transformation period according to the time, three-level catalogue data life deadline is transformation period；And the level-one of the filing class data The entitled archive partition of directory partition, second-level directory data time are transformation period, and three-level catalogue data life deadline is Maximum time.

Optionally, the step of updating the data mode variation table includes: key name, the key for being inserted into the online class data Value, state change initial time and state change end time, wherein the state change initial time is transformation period, institute Stating the state change end time is maximum time；And the state change end time of the expired class data is set as becoming Change the time.

According to another aspect of the present invention, the device that data store under a kind of distributed data platform is provided.

The device that data store under a kind of distributed data platform, comprising: data categorization module, for by by the same day Data are compared with the data in data mode variation table, are classified to changed data；Data memory module is used In by under the sorted data acquisition to different catalogues, and it is stored in accordingly according to the data storage rule of the catalogue Subregion under；And state update module, for updating the data mode variation table.

Optionally, the classification is the process of the life cycle according to data to carry out, and including online class, expired class With filing class three types.

Optionally, the data categorization module is also used to: by searching for the key name of data, by the data on the same day and data shape Data in state variation table are compared；If there is no the data in the data mode variation table, and the number on the same day There are the data in, then the data are online class；If in the data mode variation table and the data on the same day all There are the data, but the key assignments of the data is different, then the data in the data mode variation table are expired class, and The data on the same day are online class；And if having the data in the data mode variation table, and in the data on the same day There is no the data, then the data are filing class.

Optionally, the data memory module is also used to: the first class catalogue partition name of the online class data is online Subregion, second-level directory data time are maximum time, and three-level catalogue data life deadline is maximum time；The expired class The first class catalogue partition name of data is expired subregion, and second-level directory data time is transformation period, three-level catalogue data life Deadline is transformation period；And the first class catalogue partition name of the filing class data is archive partition, second-level directory number It is transformation period according to the time, three-level catalogue data life deadline is maximum time.

Optionally, the state update module is also used to: key name, the key assignments, state change of the insertion online class data Initial time and state change end time, wherein the state change initial time is transformation period, the state change knot The beam time is maximum time；And the state change end time of the expired class data is set as transformation period.

According to the technique and scheme of the present invention, it only when data mode changes, just needs classify to the data, deposit Storage and state update etc. operation, for not changed data without carry out it is secondary storage or state update, so as to While effectively recording data variation, data storage and data effectiveness of retrieval are improved, data space is effectively saved, and And it is also very easy and conveniently to the cleaning of stale data.

Detailed description of the invention

Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:

Fig. 1 is that the key step for the method that data store under a kind of distributed data platform according to an embodiment of the present invention is shown It is intended to；

Fig. 2 is the schematic diagram of data partitioned storage according to an embodiment of the present invention；

Fig. 3 is the schematic diagram of data scrubbing according to an embodiment of the present invention；

Fig. 4 is the schematic diagram of data mode variation table according to an embodiment of the present invention；

Fig. 5 is that the main modular for the device that data store under a kind of distributed data platform according to an embodiment of the present invention is shown It is intended to；

Fig. 6 is the storage effect comparison schematic diagram of the embodiment of the present invention and the prior art.

Specific embodiment

Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.

The method that data store under a kind of distributed data platform of the invention, only when item status changes, Just need classified to the data item, store and state update etc. operation, for not changed data item without carry out Secondary storage or state update, so as to improve data storage and data retrieval while effectively record data variation Efficiency.

Fig. 1 is that the key step for the method that data store under a kind of distributed data platform according to an embodiment of the present invention is shown It is intended to.As shown in Figure 1, the method that data store under a kind of distributed data platform of the invention mainly includes the following steps, namely S11 to step S13.

Step S11: by being compared the data on the same day with the data in data mode variation table, to changed Data are classified.The characteristics of in order to adapt to Hadoop file system, needs uniform sequential carry out data to store to improve effect Rate.According to the process of data life period, data can be divided into three classes, i.e., online class (ACTIVE), expired class (EXPIRED), File class (HISTORY).Online class data indicate the current effective data of meaning, it is possible to can change；Expired class data Indicate the no longer valid data of current meaning；Filing class data expression has been mothballed and has no longer changed, the lasting effective data of meaning.

When carrying out data classification, according to data processing rule predetermined, by searching for the key name of data, by the same day Data be compared with the data in data mode variation table with the changed data of determination；If the data mode becomes There is no the data in change table, and there are the data in the data on the same day, then the data are online class；If the number According to there is the data in the data on state change table and the same day, but the key assignments of the data is different, then the data The data in state change table are expired class, and the data on the same day are online class；And the if data mode There are the data in variation table, and there is no the data in the data on the same day, then the data are filing class.

Step S12: it is stored by under the sorted data acquisition to different catalogues, and according to the data of the catalogue Rule is stored under corresponding subregion.Wherein, the data storage rule includes that partition name, data time and data life are cut Only 3 directory levels of time.The data classification in conjunction with described in step S11 is it is found that the partition name includes online subregion, mistake Phase subregion and archive partition.

As shown in Fig. 2, being the schematic diagram of data partitioned storage according to an embodiment of the present invention.Stablize operation for one For large enterprise, with being incremented by for time, data volume in expired subregion and archive partition also can steady-state growth, online subregion Data volume can keep relative stability as far as possible while being increased newly.It is with the time when data store it can be seen from Fig. 2 For main spool, equably it is stored in this corresponding subregion of 3 top-level directories as far as possible.

It, can be according to the number of the catalogue when carrying out data storage for the ease of carrying out the classification storage and lookup of data It is stored under corresponding subregion according to storage rule, correspondingly includes following 3 kinds of situations for 3 class data above-mentioned:

The first class catalogue partition name of the online class data is online subregion, when second-level directory data time is maximum Between, three-level catalogue data life deadline is maximum time；

The first class catalogue partition name of the expired class data is expired subregion, when second-level directory data time is variation Between, three-level catalogue data life deadline is transformation period；And

The first class catalogue partition name of the filing class data is archive partition, when second-level directory data time is variation Between, three-level catalogue data life deadline is maximum time.

In the following, citing describes specific data storage catalogue hierarchical structure.Such as:

For online class data, data storage catalogue hierarchical structure is dp=ACTIVE/dt=4712-12-31/end_ Date=4712-12-31；

For expired class data, data storage catalogue hierarchical structure is dp=EXPIRED/dt=2013-10-11/end_ Date=2013-10-11；

For filing class data, data storage catalogue hierarchical structure is dp=HISTORY/dt=2014-06-22/end_ Date=4712-12-31.Wherein, dp indicates that data subregion data partition, dt indicate data time data time, End_date indicates data life by the time.To file class data instance, when storing the data for needing to file, first It is that determination is stored in " dp=HISTORY " this subregion；Later, according to the transformation period of the data " dt=2014-06-22 " Can be stored in the time is transferred under the data directory of the subregion；Finally, according to data life by time " end_date= 4712-12-31 " saves the data into corresponding data table.Wherein, since data filing represents data storage, no longer change, The value and meaning of attribute are until permanent, so its " end_date " is maximum time " 4712-12-31 ".In practical applications, may be used Data storage catalogue hierarchical structure is set as the case may be.

The schematic diagram of Fig. 3 data scrubbing according to an embodiment of the present invention.It is carried out using partitioned storage mode as shown in Figure 2 Data storage, can very easily carry out the cleaning of historical data.As shown in figure 3, for expired class data, data attribute or Person's measurement has been changed, and current data meaning is no longer valid, and only needing when clearing up it will be expired point corresponding Area is deleted, simple to operate.

Step S13: the data mode variation table is updated.When the state of data changes, we need to data State is updated.In conjunction with step S11 and step S12 it is found that when being updated to data state change table, need to be inserted into institute State key name, key assignments, state change initial time and the state change end time of online class data, wherein the state change Initial time is transformation period, and the state change end time is maximum time；And it will be described in the expired class data The state change end time is set as transformation period.For the data of any variation do not occur, without carrying out state update.

As shown in figure 4, to change the schematic diagram of table according to the data mode of the embodiment of the present invention.As the table of upper left is The table of data on the day of 2014-01-01, upper right is the data on the day of 2014-01-02, and existing technical solution is will be daily Data carry out snapshot preservation, when requiring to look up some data or carrying out the processing such as calculating, full dose are needed to scan all snapshots, Not only it had sacrificed a large amount of storage space but also had wasted system resource.And the scheme of the invention is the table 2014-01-02 of upper right is worked as It data are compared with the data on the day of the table 2014-01-01 of upper left, are added and are recorded to changed data item, no Changed data item is without being changed.Meanwhile in the structure of design data state change table, audit field is introduced Start_date/end_date carrys out the starting and ending time of mark data state change, also, in order to better discriminate between number According to the major key of tables of data will add audit field start_date.

In Fig. 4, the data of the table 2014-01-02 of upper right are compared with the data of the table 2014-01-01 of upper left Afterwards, data mode variation table mytable shown in can obtaining below Fig. 4 arrow.In table mytable, major key includes key name key With the initial time start_date of data state change, each data is distinguished by major key.Data record is usual on line There are three types of operations: Insert indicates the generation of new record；Delete indicates the termination that record is worth online；Update is of equal value In Delete/Update composition operation, the transition of recording status, the i.e. production of the end of record previous state and new state are indicated It is raw.For example, the data of 2014-01-02 are compared with the data of 2014-01-01 can be seen that, the data that key is 1 are become Change (Update), so in table mytable, according to major key by key is 1 and start_date is 2014/1/1 data End_date is revised as transformation period, while newly increasing a record, and major key is that key is 1 and start_date is transformation period. Equally, the data for being 4 for key, in table mytable directly newly-increased (Insert).By by daily data and data Data in state change table are compared, and can find changed data, according to table mytable identification data state Method, without daily carry out snapshot preservation, so as to effectively save memory space, and guarantee it is continuous in time, can To provide basis for subsequent retrieval analysis.

The date storage method as described in above step S11 to step S13, data store organisation according to the invention and Catalogue divides, and according to the needs of data retrieval and calculating, is directly inquired by writing SQL statement.For example, if we want The state that the 2014-01-01 same day " 1 " is searched from the table mytable of Fig. 4, it is as follows can to write SQL statement:

Select*from mytable where start_date≤' 2014-01-01'and end_date > ' 2014-01-01'and [key='1']；

If to search the state of " 1 " during this section of 2014-01-01 to 2014-01-02 from table mytable, can compile It is as follows to write SQL statement:

Select*from mytable where start_date≤' 2014-01-02'and end_date >=' 2014-01-01'and [key='1']；

If to search " 1 " current last state from table mytable, it is as follows SQL statement can be write:

Select*from mytable where dp=(' ACTIVE'or [dp='HISTORY']) and [key=' 1']。

In this way, directly carrying out the inquiry of data mode by writing SQL statement, prescreening can be carried out to catalogue, without All catalogues are traversed, guarantee the retrieval and calculating of completing data under the smallest resource usage amount.

Fig. 5 is that the main modular for the device that data store under a kind of distributed data platform according to an embodiment of the present invention is shown It is intended to.As shown in figure 5, the device 50 that data store under distributed data platform of the invention mainly includes data categorization module 51, data memory module 52 and state update module 53.

Data categorization module 51 is used for by being compared the data on the same day with the data in data mode variation table, right Changed data are classified；Data memory module 52 is used for the sorted data acquisition to different catalogues Under, and be stored under corresponding subregion according to the data storage rule of the catalogue；And state update module 53 is for updating The data mode changes table.

Wherein, data categorization module 51 is the process of the life cycle according to data when carrying out data classification to carry out , and including online class, expired class and filing class three types.

The key name that data categorization module 51 can be also used for by searching for data changes the data on the same day and data mode Data in table are compared；If there is no the data in the data mode variation table, and have in the data on the same day The data, then the data are online class；If had in the data mode variation table and the data on the same day described Data, but the key assignments of the data is different, then the data mode changes the data in table as expired class, and the same day The data are online class；And if there are the data in the data mode variation table, and there is no institute in the data on the same day Data are stated, then the data are filing class.

For data memory module 52 when carrying out data storage, the data storage rule of foundation includes partition name, number According to 3 directory levels of time and data life deadline, and the partition name includes online subregion, expired subregion and filing Subregion.

Data memory module 52 can be also used for, and the first class catalogue partition name of the online class data is online subregion, Second-level directory data time is maximum time, and three-level catalogue data life deadline is maximum time；The expired class data First class catalogue partition name be expired subregion, second-level directory data time be transformation period, three-level catalogue data life cut-off Time is transformation period；And the first class catalogue partition name of the filing class data is archive partition, when second-level directory data Between be transformation period, three-level catalogue data life deadline be maximum time.

When state update module 53 can be also used for key name, key assignments, the state change starting for being inserted into the online class data Between and the state change end time, wherein the state change initial time be transformation period, the state change end time For maximum time；And the state change end time of the expired class data is set as transformation period.

Fig. 6 is the storage effect comparison schematic diagram of the embodiment of the present invention and the prior art.With the buildup of increments of the prior art Processing mode is compared, and data storage scheme of the invention can effectively save data space.With one hundred million grade data For table, in million ranks or so, saving rate in space can be calculated the data volume which increases newly and change daily by following formula It obtains.

In above formula, base: radix (hundred million grades), N: number of days, C: every daily increment (million grades), M: every daily variation (million Grade).When N tends to infinity, saving rate in space is 1, it may be assumed that time span is longer, and it is more to save space.In practical applications, Space saving rate can be to 90% or more.It can be seen that can effectively save data storage using technical solution of the present invention Space can retain the historical rudiment of total data with the smallest storage.

Technical solution according to an embodiment of the present invention just needs to carry out the data only when data mode changes The operations such as classification, storage and state update update not changed data without carrying out secondary storage or state, thus Data storage and data effectiveness of retrieval can be improved while effectively record data variation, it is effective to save data storage Space, and it is also very easy and conveniently to the cleaning of stale data.

Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims

1. a kind of method that data store under distributed data platform characterized by comprising

By being compared the data on the same day with the data in data mode variation table, changed data are divided Class, the classification are the processes of foundation data life period to carry out, and including online class, expired class and filing three type of class Type, the data mode variation table distinguish data by major key, and major key includes the initial time of key name and data state change；

By under the sorted data acquisition to different catalogues, and phase is stored according to the data storage rule of the catalogue Under the subregion answered, the partition name includes online subregion, expired subregion and archive partition；And

Update the data mode variation table.

2. the method according to claim 1, wherein the step of classifying to changed data includes:

By searching for the key name of data, the data on the same day are compared with the data in data mode variation table；

If there is no the data in the data mode variation table, and there are the data in the data on the same day, then it is described Data are online class；

If having the data in the data mode variation table and the data on the same day, but the key assignments of the data is not Together, then the data in the data mode variation table are expired class, and the data on the same day are online class；And

If there are the data in the data mode variation table, and there is no the data in the data on the same day, then the data To file class.

3. the method according to claim 1, wherein when the data storage rule includes partition name, data Between and 3 directory levels of data life deadline.

4. the method according to claim 1, wherein the data storage rule according to the catalogue is stored in accordingly Subregion under step include:

The first class catalogue partition name of the online class data be online subregion, second-level directory data time be maximum time, three Grade catalogue data life deadline is maximum time；

The first class catalogue partition name of the expired class data be expired subregion, second-level directory data time be transformation period, three Grade catalogue data life deadline is transformation period；And

It is described filing class data first class catalogue partition name be archive partition, second-level directory data time be transformation period, three Grade catalogue data life deadline is maximum time.

5. the method according to claim 1, wherein the step of updating the data mode variation table includes:

It is inserted into key name, key assignments, state change initial time and the state change end time of the online class data, wherein institute Stating state change initial time is transformation period, and the state change end time is maximum time；And

The state change end time of the expired class data is set as transformation period.

6. the device that data store under a kind of distributed data platform characterized by comprising

Data categorization module, for by being compared the data on the same day with the data in data mode variation table, to generation The data of variation are classified, and the classification is carried out according to the process of data life period, and including online class, expired Class and filing class three types, the data mode variation table distinguish data by major key, and major key includes key name and data shape The initial time of state variation；

Data memory module, for by under the sorted data acquisition to different catalogues, and according to the number of the catalogue It is stored under corresponding subregion according to storage rule, the partition name includes online subregion, expired subregion and archive partition；And

State update module, for updating the data mode variation table.

7. device according to claim 6, which is characterized in that the data categorization module is also used to:

8. device according to claim 6, which is characterized in that when the data storage rule includes partition name, data Between and 3 directory levels of data life deadline.

9. device according to claim 6, which is characterized in that the data memory module is also used to:

10. device according to claim 6, which is characterized in that the state update module is also used to:

11. the electronic equipment that data store under a kind of distributed data platform characterized by comprising

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 5.

12. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor Such as method as claimed in any one of claims 1 to 5 is realized when row.