A kind of system and method for distributed treatment data
Technical field
The present invention relates to data processing technique, particularly a kind of system and method for distributed treatment data.
Background technology
Distributed data base system is to belong to same system in logic, the data acquisition on a plurality of nodes (node) that the active computer network that physically distributes connects.Node links together in communication network, and each node all is an independent database system, has separately database, CPU and terminal, and local data base management system (local DBMS) separately.In distributed data base system; User data is generally pressed user distribution in different nodes database (DB); The node database that needs first consumer positioning data place when each visit or modification user data, and the information of the node database at consumer positioning data place is the important status data of user.
Fig. 1 is the structural representation of user distribution device in the existing distributed Database Systems.Combine Fig. 1 at present, user distribution device in the existing distributed Database Systems is described, specific as follows:
When the user registers; User distribution control module 21 obtains the user distribution weight of different node databases in the current system;, to DBid corresponding of user distribution the number of users in the system balancedly is distributed on the different nodes database according to the user distribution weight of different node databases in the current system with this user's user id.
User distribution information database 22 is preserved user distribution information; User distribution information comprises user id and the DBid corresponding with user id, can also comprise the current state information of each user data.
When receiving access request, user capture control module 23 obtains the DBid of the corresponding storaging user data of this user id through user id inquiring user distributed intelligence database 22, arrives this DBid corresponding nodes database access user data then.When system upgrades to user data or moves; The current state that this user data in the user distribution information database 22 is revised in user data state configuration unit 24 is a maintenance state; After upgrading or migration were accomplished, the current state with this user data in the user distribution information was revised as the normal condition that allows visit again.
User distribution device in the existing distributed Database Systems;, the user takes into full account the distribution of existing subscriber on node database when registering; Make the user can be evenly distributed in the node database in the system; When user data is upgraded or move, only influence the user that this is upgraded or moves, can not influence the visit of other user data.But; The existing distributed Database Systems are normally set up distributed data base system based on large databases such as ORACLE, DB2, SYSBASE, and in the process of handling mass data, not only cost is higher; And the maintenance of system need expend the more time, is still waiting further improvement.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of system of distributed treatment data, this system can reduce cost, and shortens the spent time of maintenance of system.
The object of the present invention is to provide a kind of method of distributed treatment data, this method can reduce cost, and shortens the spent time of maintenance of system.
For achieving the above object, technical scheme of the present invention specifically is achieved in that
A kind of system of distributed treatment data, this system comprises:
Data acquisition module; According to preset extraction condition; From the data source that distributes, extract and obtain first data, the cleaning transformation rule according to preset carries out data cleansing and conversion to first data; First Data Loading after data cleansing and the conversion in form, and is loaded into the data warehouse module;
The data warehouse module, the submeter rule according to preset splits form; Extract first data the form after splitting, obtain the list data and first data, preserve list data; According to preset zoning ordinance, first data are classified, and be kept in the corresponding subregion;
Data access module according to the instruction of input, reads the list data and first data from the data warehouse module, in the corresponding form of list data, and output comprises the form of first data with first Data Loading.
In the said system, said data acquisition module comprises:
The data pick-up unit according to preset extraction condition, extracts from the data source that distributes and obtains first data, and export data processing unit to;
Data processing unit, the cleaning transformation rule according to preset carries out data cleansing and conversion to first data, and exports the Data Loading unit to;
The Data Loading unit is put in order first data, and first Data Loading in form, is loaded form to the data warehouse module that comprises first data.
In the said system, said data warehouse module comprises:
Management node according to the instruction of outside input, starts or closes SQL node and back end, administration configuration file and journal file, and the key message that back end is reported writes journal file;
At least one SQL node, each SQL node are set up man-to-man annexation according to the list data of its preservation with the back end that is kept at first data in the form; The SQL node is according to preset submeter rule; Form is split, extract first data the form after splitting, obtain the list data and first data; Export first data to back end, preserve the list data after splitting;
At least one back end obtains configuration file from management node, and retrieval obtains configuration data, accomplishes the configuration of node, according to preset zoning ordinance, first data is classified and is kept in the corresponding subregion.
Preferably, said SQL node is that also list data sets up the table index corresponding with it, also for the table storage after each fractionation in order to first file of stored table structure, reach in order to deposit the 3rd file of table index in order to second file of stored table data.
In the said system, preset submeter rule is vertical submeter rule or horizontal submeter rule;
Said vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The said second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation;
Said horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
In the said system, said preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword;
Said by the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion;
The said subregion of KEY according to keyword is for according to preset expression formula the data of field record being calculated, with the alternative condition of result of calculation as the data in the partitioned file corresponding with subregion.
In the said system, said Data access module comprises:
The data retrieval unit; Instruction according to input; From the SQL node, search the table index that the 3rd file comprises, obtain first file and second file, export first file and second file to the form generation unit; Read first data of preserving in the back end that establishes a connection with the SQL node, and export the form generation unit to;
The form generation unit according to first file and second file, generates form, inserts in the form first data and output.
A kind of method of distributed treatment data, this method comprises:
A, data acquisition module extract from the data source that distributes and obtain first data, and the cleaning transformation rule according to preset carries out data cleansing and conversion to first data, with first Data Loading after data cleansing and the conversion in form;
B, data warehouse module split form according to preset submeter rule, extract first data the form after splitting, and obtain the list data and first data, preserve list data;
C, data warehouse module are classified to first data, and are kept in the corresponding subregion according to preset zoning ordinance;
D, Data access module read the list data and first data according to the instruction of importing, and first Data Loading is also exported in the corresponding form of list data.
Preferably, further comprise before the said step B:
The management node that the data warehouse module comprises starts or closed node administration configuration file and journal file according to the instruction of outside input;
The back end that SQL node that said node comprises for the data warehouse module or data warehouse module comprise.
In the said method, said step B comprises:
The SQL node that B1, data warehouse module comprise is set up man-to-man annexation according to the list data of its preservation with the back end that is kept at first data in the form;
The SQL node that B2, data warehouse module comprise splits form according to preset submeter rule;
Extract first data the form of the SQL node that B3, data warehouse module comprise after splitting, obtain the list data and first data, preserve the list data after splitting, export the back end that first data to data warehouse module comprises.
In the said method, said step C comprises:
The back end that C1, data warehouse module comprise obtains configuration file from the management node that the data warehouse module comprises, and retrieval obtains configuration data, accomplishes the configuration of node;
The back end that C2, data warehouse module comprise is according to preset zoning ordinance, and first data that receive are classified and are kept in the corresponding subregion.
Preferably, further comprise after the said step B3:
The SQL node that B4, data warehouse module comprise is that list data is set up the table index corresponding with it;
The SQL node that B5, data warehouse module comprise for each table storage after splitting in order to first file of stored table structure, in order to second file of stored table data and in order to deposit the 3rd file of table index.
In the said method, preset submeter rule is vertical submeter rule or horizontal submeter rule;
Said vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The said second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation;
Said horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
In the said method, said preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword;
Said by the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with loaded data in the field;
Said by Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion;
The said subregion of KEY according to keyword is for calculating the data of field record according to the expression formula of setting, with the alternative condition of result of calculation as the data in the partitioned file of subregion correspondence.
In the said method, said step D comprises:
D1, Data access module are searched the table index that the 3rd file comprises according to the instruction of importing from the SQL node that the data warehouse module comprises, obtain first file and second file;
D2, Data access module be according to first file and second file, reads first data of preserving in the back end that the SQL node that comprises with the data warehouse module establishes a connection;
D3, Data access module generate form according to first file and second file, insert in the form first data and output.
Visible by above-mentioned technical scheme, the invention provides a kind of system and method for distributed treatment data, the data warehouse module is according to preset submeter rule; Form is split, extract first data the form after splitting, obtain the list data and first data; Preserve list data; According to preset zoning ordinance, first data are classified, and be kept in the corresponding subregion; Data access module reads the list data and first data according to the instruction of input from the data warehouse module, in the corresponding form of list data, and output comprises the form of first data with first Data Loading.Adopt system and method for the present invention, can reduce cost, shorten the spent time of maintenance of system.
Description of drawings
Fig. 1 is the structural representation of user distribution device in the existing distributed Database Systems;
Fig. 2 is the structural representation of the system of distributed treatment data of the present invention;
Fig. 3 is the method flow diagram of distributed treatment data of the present invention.
Embodiment
For make the object of the invention, technical scheme, and advantage clearer, below with reference to the accompanying drawing embodiment that develops simultaneously, to further explain of the present invention.
The invention provides a kind of system and method for distributed treatment data; The data warehouse module is preserved the form that generates in the data acquisition module, and form is split, and the list data after splitting is kept in the SQL node; First data of putting down in writing in the form after splitting are kept in the back end; And first data of putting down in writing in list data and this form are set up corresponding relation, so that Data access module can have been obtained its required record from SQL node and the back end form of data has not only been practiced thrift cost; Can also effectively protect mass data, shorten the system maintenance spent time of data.
Fig. 2 is the structural representation of the system of distributed treatment data of the present invention.Combine Fig. 2 at present, the system of distributed treatment data of the present invention is described, specific as follows:
The system of distributed treatment data of the present invention comprises: data acquisition module 311, data warehouse module 312 and Data access module 313.
Data acquisition module 311 is according to preset extraction condition; From a plurality of data sources that distribute, extract and obtain first data; According to preset cleaning transformation rule; First data are carried out data cleansing and conversion, first Data Loading after data cleansing and the conversion in form, and is loaded into data warehouse module 312.Wherein, a plurality of data sources of distribution can comprise the file of significant data for business datum list, daily record, CDR file etc.; First data are user data or business datum, such as, data such as telephone expenses, wage, the duration of call.
Data warehouse module 312 splits form according to preset submeter rule, extracts first data the form after splitting; Obtain the list data and first data, preserve list data, according to preset zoning ordinance; First data are classified, and be kept in the corresponding subregion.Wherein, data warehouse module 312 can be distributed on the station terminal, also can be distributed on many station terminals.
Wherein, preset submeter rule is vertical submeter rule or horizontal submeter rule.Vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation.Horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
Wherein, preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword.
By the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field; Such as; The serial number scope of setting is the time; Can divide 2011-10-1~2011-10-31,2011-11-1~2011-11-30, three districts of 2011-12-1~2011-12-31; The corresponding partitioned file of each subregion; Then the time of record belongs to the first corresponding data storage of the interior field of 2011-10-1~2011-10-31 time range in partitioned file A, and the time of record belongs to first data storage of the field correspondence in 2011-11-1~2011-11-30 time range in partitioned file B, and the time of record belongs to first data storage of the field correspondence in 2011-12-1~2011-12-31 time range in partitioned file C.
By tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field; Such as; The wage amount of money according to putting down in writing in the field carries out subregion; The property value of setting is 1000,2000,3000 and 4000, then can be divided into four districts, the corresponding partitioned file of each subregion; Then put down in writing the wage amount of money and be the first corresponding data storage of 1000 field in partitioned file A '; Put down in writing the wage amount of money and be the first corresponding data storage of 2000 field in partitioned file B ', put down in writing the wage amount of money and be the first corresponding data storage of 3000 field, put down in writing the wage amount of money and be the first corresponding data storage of 4000 field in the partitioned file D ' in partitioned file C '; Wherein, first data corresponding with field can be other data of putting down in writing in this field, such as position, and information such as hiring date.
By Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion; Adopting the HASH subregion mainly is to guarantee that first data corresponding with field are evenly distributed in the subregion of specified data in advance.
According to keyword the KEY subregion is for calculating the data of field record according to preset expression formula, with the alternative condition of result of calculation as the data in the partitioned file of subregion correspondence.Wherein, KEY subregion and difference by Hash HASH subregion according to keyword mainly is that the expression formula of calculating is different; The expression formula that adopts during by Hash HASH subregion is a hash function; The expression formula of the hash function that according to keyword adopts during the KEY subregion can be provided by the MySQL server, and the expression formula that also can adopt the user to set is calculated, and no longer concrete expression formula is given unnecessary details at this.
Data access module 313 reads the list data and first data according to the instruction of input from data warehouse module 312, in the corresponding form of list data, and output comprises the form of first data with first Data Loading.Data access module 313 provides the interface of the data in the visit data warehouse module 312 for the user.
Wherein, data acquisition module 311 comprises: data pick-up unit 3111, data processing unit 3112 and Data Loading unit 3113.
Data pick-up unit 3111 extracts from a plurality of data sources that distribute and obtains first data, and export data processing unit 3112 to according to preset extraction condition.Wherein, data pick-up unit 3111 is connected with a plurality of data sources, can from a plurality of data sources, extract the first required data.
Data processing unit 3112 carries out data cleansing and conversion to first data, and exports Data Loading unit 3113 to according to preset cleaning transformation rule.Wherein, preset cleaning transformation rule is the data screening condition preset and the system property of operation, can confirm the data layout supported according to the system property of operation, so that be the data layout that system supported with first data-switching.
3113 pairs first data in Data Loading unit are put in order, and first Data Loading in form, is loaded form to the data warehouse module 312 that comprises first data.Wherein, Data Loading unit 3113 generates one in order to load the form of first data, so that the mode of following adopted form is stored data and output data.
Wherein, data warehouse module 312 comprises: management node 3121, at least one SQL node 3122 and at least one back end 3123.Wherein, management node 3121 is positioned on the terminal, and SQL node 3122 can be positioned on the same terminal with management node 3121; Also can be distributed on the different terminal; In like manner, back end 3123 can be positioned on the same terminal with management node 3121, also can be distributed on the different terminal.
Management node 3121 is according to the instruction of outside input, and start or close SQL node 3122 and back end 3123, administration configuration file and journal file, the key message that back end 3123 is reported writes journal file.Wherein, the independent communication link between the configuration of independent back end and a plurality of back end that are provided with in recording in the configuration file bunch; Management node 3121 available command ndb mgmd start, and have precedence over SQL node 3122 and be triggered with back end 3123.
Each SQL node 3122 is set up man-to-man annexation according to the list data of its preservation with the back end 3123 that is kept at first data in the form.SQL node 3122 splits form according to preset submeter rule, extracts first data the form after splitting, and obtains the list data and first data, exports first data to back end 3123, the list data that SQL node 3122 is preserved after splitting.Wherein, SQL node 3122 is that available command mysqld-ndbcluster starts with the back end 3123 in visiting bunch, or uses mysqld to start after adding ndbcluster to my.cnf.
Back end 3123 obtains configuration file from management node 3121, and retrieval obtains configuration data, accomplishes the configuration of node, according to preset zoning ordinance, first data is classified and is kept in the corresponding subregion.Wherein, the quantity of back end 3123 is relevant with the quantity of copy, is the multiple of fragment, such as, for two copies, each copy has two fragments, and 4 data nodes 3123 are then arranged; Back end 3123 available command ndbd start.
For the ease of searching and improve the work efficiency of retrieve data; SQL node 3122 is that also list data sets up the table index corresponding with it, also for the table storage after each fractionation in order to first file of stored table structure, reach in order to deposit the 3rd file of table index in order to second file of stored table data.Wherein, SQL node 3122 can be set up index to the field that has uniqueness in the form when setting up index, such as the student number in student's table; Also can be needs the field of frequent ordering, grouping and joint operation to set up index, and the field that also can be the Chang Zuowei querying condition is set up index; SQL node 3122 needs to consider the size of storage space when setting up index, reduce the quantity of index, to increase work efficiency as far as possible.
Wherein, Data access module 313 comprises: data retrieval unit 3131 and form generation unit 3132.
Data retrieval unit 3131 is according to the instruction of input; From SQL node 3122, search the table index that the 3rd file comprises; Obtain first file and second file; Export first file and second file to form generation unit 3132, read first data of preserving in the back end 3123 that establishes a connection with SQL node 3122, and export form generation unit 3132 to.Wherein, the instruction of output is in order to search the order of first data.
Form generation unit 3132 generates form according to first file and second file, inserts in the form first data and output.
Fig. 3 is the method flow diagram of distributed treatment data of the present invention.Combine Fig. 3 at present, the method for distributed treatment data of the present invention is described, specific as follows:
Step 41: from the data source that distributes, extract and obtain first data;
This step is: data acquisition module 311 extracts from a plurality of data sources that distribute and obtains first data according to preset extraction condition.
Step 42: the cleaning transformation rule according to preset, carry out data cleansing and conversion to first data;
This step is: data acquisition module 311 carries out data cleansing and conversion according to preset cleaning transformation rule to first data, so that first data can meet the needs of data warehouse module 312 storages.
Step 43: with data cleansing with the conversion after first Data Loading in form;
This step is: data acquisition module 311 with data cleansing with the conversion after first Data Loading in form; And the form that will load first data is loaded in the data warehouse module 312, so that 312 pairs of data of data warehouse module are preserved effectively and safeguarded.
Step 44: the submeter rule according to preset, form is split, extract first data the form after splitting, obtain the list data and first data, preserve list data;
This step is responsible for execution by data warehouse module 312.
This step comprises:
Step 441, the SQL node 3122 that data warehouse module 312 comprises are set up man-to-man annexation according to the list data of its preservation with the back end 3123 that is kept at first data in the form;
Step 442, the SQL node 3122 that data warehouse module 312 comprises splits form according to preset submeter rule;
Step 443; Extract first data the form of the SQL node 3122 that data warehouse module 312 comprises after splitting; Obtain the list data and first data, preserve the list data after splitting, export the back end 3123 that first data to data warehouse module 312 comprises.
Preferably, also further comprise after the step 443:
Step 444, the SQL node 3122 that data warehouse module 312 comprises is set up the table index corresponding with it for list data;
Step 445, the SQL node 3122 that data warehouse module 312 comprises for each table storage after splitting in order to first file of stored table structure, in order to second file of stored table data and in order to deposit the 3rd file of table index.
Wherein, preset submeter rule is vertical submeter rule or horizontal submeter rule; Vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation; Horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
Step 45:, first data are classified, and be kept in the corresponding subregion according to preset zoning ordinance;
This step is carried out by data warehouse module 312.
This step comprises:
Step 451, the back end 3123 that data warehouse module 312 comprises obtains configuration file from the management node 3121 that data warehouse module 312 comprises, and retrieval obtains configuration data, accomplishes the configuration of node;
Step 452, the back end 3123 that data warehouse module 312 comprises be according to preset zoning ordinance, and first data that receive are classified and are kept in the corresponding subregion.
Wherein, preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword.
By the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field; By tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field; By Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion; According to keyword the KEY subregion is for calculating the data of field record according to preset expression formula, with the alternative condition of result of calculation as the data in the partitioned file of subregion correspondence.
Step 46: according to the instruction of input, read the list data and first data, first Data Loading is also exported in the corresponding form of list data.
This step comprises:
Step 461, Data access module 313 is searched the table index that the 3rd file comprises according to the instruction of input from the SQL node 3122 that data warehouse module 312 comprises, obtain first file and second file;
Step 462, Data access module 313 be according to first file and second file, reads first data of preserving in the back end that the SQL node 3122 that comprises with data warehouse module 312 establishes a connection;
Step 463, Data access module 313 generate form according to first file and second file, insert in the form first data and output.
Preferably, further comprise before the step 44: the management node 3121 that data warehouse module 312 comprises starts or closed node administration configuration file and journal file according to the instruction of outside input; Wherein, the node in this step is the back end 3123 that the SQL node 3122 that comprises of data warehouse module 312 or data warehouse module 312 comprise.
In the above-mentioned preferred embodiment of the present invention, no longer adopt the system of existing distributed treatment data based on large databases such as ORACLE, DB2, SYSBASE, and be based on the MYSQL distributed data base, mass data is handled, reduced cost; When handling, adopted subregion and the method for dividing form that data are handled, shortened the spent time of system maintenance.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.