CN103944964A

CN103944964A - Distributed system and method carrying out expansion step by step through same

Info

Publication number: CN103944964A
Application number: CN201410116840.4A
Authority: CN
Inventors: 李晓华
Original assignee: SHANGHAI CLOUDYBI INFORMATION TECHNOLOGY Co Ltd
Current assignee: SHANGHAI CLOUDYBI INFORMATION TECHNOLOGY Co Ltd
Priority date: 2014-03-27
Filing date: 2014-03-27
Publication date: 2014-07-23

Abstract

The invention provides a distributed system and a method carrying out expansion step by step through the same. The system comprises an ETL written-in data module, a buffer module, a data redistribution module, a data index distribution table forming module, a dispatching module and a plurality of databases. The databases are borne on a plurality of servers. The buffer module is an ETL mobile hard disk and is in communication with the databases and the ETL written-in data module. The dispatching module comprises a pausing unit, a starting unit and a task adding unit. The problem that old nodes can not be used after expansion is solved, so that the cost of the servers is lowered, and the profit is increased. The system can provide system service to the exterior during the working time in the day, and both new nodes and the old nodes can run at the same time.

Description

A kind of distributed system and the method for carrying out progressively dilatation according to this system

Technical field

The present invention relates to a kind of System Expansion method, a kind of method that is specifically related to distributed system and carries out progressively dilatation according to this system.

Background technology

In large data handling system, the data volume of each node storage is very large, such as the actual amount of data 2.8T of a hard disk.When system processing power is not enough, when need to increase back end and promote disposal ability, how the data of original system are heavily distributed, be deployed to all back end (old node+new node), be a more difficult problem.Current most of distributed data system all adopts hash mode distributed data, when increasing node, by system, uses hash mode heavily to distribute.But because data volume is huge, often within 24 hours, also cannot complete, and most of System production time also needs externally to provide service, this contradiction needs to solve.Some product, greenplum system such as EMC, in the mobile signaling protocol acquisition analysis system of Guangdong, when node will extend to 50 nodes from 10 nodes, have no idea to accomplish this point, have to use the parallel data that import of new and old node, last new system only has 40 nodes, rather than 50 nodes.This problem can be more serious in from 50 to 100.

Summary of the invention

The present invention is for solving the above-mentioned shortcoming that redistributes dilatation existence in data handling system of mentioning, a kind of distributed system is provided and carries out progressively dilatation method according to this distributed system, can progressively to system, carry out dilatation, can guarantee that on the one hand the system operating time by day still can externally provide system service, guarantee that on the other hand new and old node can move in system.

The invention provides a kind of distributed system of progressively carrying out dilatation, it comprises that ETL data writing module, buffer module, the heavy distribution module of data, data directory distribution table form module, scheduler module and a plurality of database, described a plurality of database is carried on a plurality of servers, described buffer module is ETL portable hard drive, and described buffer module and a plurality of database are connected with described ETL data writing module communication; Described scheduler module comprises to be suspended unit, start unit and adds TU task unit.

Concrete, a kind of method of carrying out progressively dilatation according to above-mentioned distributed system, it comprises the following steps:

S1: after a plurality of new database of distributed system produces, according to the quantity of a plurality of new new database, data directory distribution table forms module, generates new data directory distribution table;

S2: described time-out unit suspends the operation of ETL data writing module data writing, and file is saved to buffer module, and described interpolation TU task unit is added into task queue by the task of ETL data writing;

S3: the heavy distribution module of described start unit log-on data, distributes data by new data directory distribution table;

S4: restart ETL data writing module, accelerate the task of ETL data writing in processing queue; Until after ETL data writing completes, restart online query task.

Preferably, data directory distribution table forms module according to the quantity of new database, by business rule, generates new data directory distribution table.

Preferably, described new data directory distribution table is distributed to the 30%-60% of the data in old node in new node.

Preferably, described new data directory distribution table is distributed in new node 50% of the data in old node.

Preferably, described interpolation TU task unit comprises selected cell and command unit, and described selected cell can be set the priority of the task of ETL data writing, and described command unit is selected the first post command of tasks carrying according to the priority of task.

Preferably, the priority of the task of ETL data writing is divided into limit priority, inferior priority and normal priority.

Advantage of the present invention is as described below: the present invention adopts a minute day progressively dilatation method to carry out dilatation to system, after dilatation, old node and new node can both be used, solve the out of use problem of old node after dilatation, thereby reduced the cost of server, increased income.The cost of large data server, generally in ten thousand yuan of left and right of 5-15, by 100,000 1 calculating, when having 20 old nodes to be utilized, just can be saved 2,000,000 yuan.And, can progressively to system, carry out dilatation, can guarantee that on the one hand the system operating time by day still can externally provide system service, guaranteed that on the other hand new and old node can move in system.

Accompanying drawing explanation

Fig. 1 is the structural representation that progressively carries out the distributed system of dilatation provided by the invention;

Fig. 2 is the structural representation of scheduler module of the present invention;

Fig. 3 is schematic diagram during dilatation in progressively dilatation method of distributed system of the present invention.

Embodiment

First, some terms that relate in the present invention are explained:

Database is a subject-oriented, data acquisition system integrated, nonupdatable, that constantly change in time, and it is for supporting the Analysis of Policy Making of enterprise or tissue to process.Database is generally used for storing the historical data of enterprise, and by ETL process, produces enterprise's form etc.

ETL cleans after referring to data (such as relation data, flat data file) that distribute, in heterogeneous data source etc. being drawn into interim intermediate layer, conversion, integrated, finally be loaded in database, become the basis of enterprise's form, on-line analytical processing, data mining.ETL task generally, in operation at night, is processed the data in enormous quantities of enterprise, forms crucial operation indicator (KPI, Key Performance Indication) and is loaded in form.

Data source refers to the source data that certain required by task of ETL computing is wanted, and is the data of Production database sometimes, is the data that another one ETL program produces sometimes.

Production database is the database that the operating activity in the daytime of enterprise is used, and is the data source of database maximum.

Below in conjunction with accompanying drawing and specific embodiment, the present invention is further explained.

As shown in Figure 1, a kind of distributed system of progressively carrying out dilatation, it comprises that ETL data writing module 1, buffer module 2, the heavy distribution module 3 of data, data directory distribution table form module 4, scheduler module 5 and a plurality of database 6, a plurality of databases 6 are carried on a plurality of servers, buffer module 2 is ETL portable hard drive, and buffer module 2 and a plurality of database 6 are connected with 1 communication of ETL data writing module; As shown in Figure 2, scheduler module 5 comprises time-out unit 50, start unit 51 and adds TU task unit 52.

As shown in Figure 3, a kind of method of carrying out progressively dilatation according to above-mentioned distributed system, it comprises the following steps:

S1: after the new database (below also referred to as node) of distributed system produces, data directory distribution table forms module 4 according to the quantity of new database, generates new data directory distribution table;

S2: suspend the operation that unit 50 suspends ETL data writing module 1 data writing, file is saved to buffer module, add TU task unit 52 task of ETL data writing is added into task queue;

S3: the heavy distribution module 3 of start unit 51 log-on data, distributes data by new data directory distribution table; The new node location that ETL data writing module 1 is looked for, deposits data in;

S4: start unit 51 restarts ETL data writing module 1, accelerates the task of ETL data writing in processing queue;

After S5:ETL data writing completes, start unit 51 restarts online query task.

Preferably, data directory distribution table forms module 4 according to the quantity of new database, by business rule, generates new data directory distribution table.

Preferably, new data directory distribution table is distributed to the 30%-60% of the data in old node in new node.In the present embodiment, new data directory distribution table is distributed in new node 50% of the data in old node.As shown in the table:

In distributed system, the distribution of data distributes by hash value or business rule often.By hash value, distributing is mainly to realize by mathematical algorithm, cannot manual control in the time of dilatation, cause dilatation after old node cannot continue use, increased the Cost Problems of database.In the present invention, according to business rule, generate new data directory distribution table and distribute, when dilatation, easily control.

Be exemplified below: new data directory distribution table is distributed in new node 50% of the data in old node.

Tentation data according to number latter two carry out data distribution, distribution relation is as follows before dilatation:

Districts and cities	Record number	Node
			00	50	DB1

01	50	DB1
			02	50	DB1
03	40	DB1
			04	20	DB2
05	80	DB2
			06	40	DB2
07	60	DB2
			....	?	?

In dilatation, can adjust data directory distribution table table, such as

Districts and cities	Record number	Node
			00	50	DB101
01	50	DB101
			02	50	DB1
03	40	DB1
			04	20	DB201
05	80	DB201
			06	40	DB2
07	60	DB2
			....	?	?

Preferably, add TU task unit 52 and comprise selected cell 520 and command unit 521, selected cell 520 can be set the priority of the task of ETL data writing, and command unit 521 is selected the first post command of tasks carrying according to the priority of task.

As preferred embodiment, the priority of the task of ETL data writing is divided into limit priority, inferior priority and normal priority.

Person of ordinary skill in the field is to be understood that: in the situation that not departing from basic principle of the present invention; can carry out various modifications, retouching, combination to the present invention, supplement or the replacement of technical characterictic, these are equal to substitute mode or within obviously mode of texturing all falls into protection scope of the present invention.

Claims

1. a distributed system of progressively carrying out dilatation, it is characterized in that: it comprises that ETL data writing module, buffer module, the heavy distribution module of data, data directory distribution table form module, scheduler module and a plurality of database, described a plurality of database is carried on a plurality of servers, described buffer module is ETL portable hard drive, and described buffer module and a plurality of database are connected with described ETL data writing module communication; Described scheduler module comprises to be suspended unit, start unit and adds TU task unit.

2. distributed system is carried out a method for progressively dilatation, it is characterized in that: it comprises the following steps:

S4: restart ETL data writing module, accelerate the task of ETL data writing in processing queue; Until after ETL data writing completes, start online query task.

3. the method for progressively dilatation according to claim 2, is characterized in that: data directory distribution table forms module according to the quantity of new database, by business rule, generates new data directory distribution table.

4. the method for progressively dilatation according to claim 3, is characterized in that: described new data directory distribution table is distributed to the 30%-60% of the data in old node in new node.

5. the method for progressively dilatation according to claim 4, is characterized in that: described new data directory distribution table is distributed in new node 50% of the data in old node.

6. the method for progressively dilatation according to claim 2, it is characterized in that: described interpolation TU task unit comprises selected cell and command unit, described selected cell can be set the priority of the task of ETL data writing, and described command unit is selected the first post command of tasks carrying according to the priority of task.

7. the method for progressively dilatation according to claim 6, is characterized in that: the priority of the task of ETL data writing is divided into limit priority, inferior priority and normal priority.