Summary of the invention
In view of this, the object of the present invention is to provide a kind of system of distributed data processing, this system can reduce costs, and shortens the spent time of maintenance of system.
The object of the present invention is to provide a kind of method of distributed data processing, the method can reduce costs, and shortens the spent time of maintenance of system.
For achieving the above object, technical scheme of the present invention is specifically achieved in that
A system for distributed data processing, this system comprises:
Data acquisition module, according to default extraction condition, from the data source distributing, extract and obtain the first data, according to default cleaning transformation rule, the first data are carried out to data cleansing and conversion, the first data loading after data cleansing and conversion, in form, and is loaded into data warehouse module; Described default cleaning transformation rule is default data screening condition and the system property of operation, can determine supported data layout, to the first data are converted to the data layout that system is supported according to the system property of operation;
Data warehouse module, the submeter rule according to default, splits form, in form from splitting, extract the first data, obtain list data and the first data, preserve list data, according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion;
Data access module according to the instruction of input, reads list data and the first data from data warehouse module, and by the first data loading, in form corresponding to list data, and output packet is containing the form of the first data.
In said system, described data acquisition module comprises:
Data pick-up unit according to default extraction condition, extracts and obtains the first data, and export data processing unit to from the data source distributing;
Data processing unit, the cleaning transformation rule according to default, carries out data cleansing and conversion to the first data, and exports data loading unit to;
Data loading unit, arranges the first data, and by the first data loading, in form, the form that loading comprises the first data is to data warehouse module.
In said system, described data warehouse module comprises:
Management node, the instruction according to outside input, starts or closes SQL node and back end, administration configuration file and journal file, the key message that back end is reported writes journal file;
At least one SQL node, each SQL node is according to the list data of its preservation, set up man-to-man annexation with the back end that is kept at the first data in form, SQL node is according to default submeter rule, form is split, in the form from splitting, extract the first data, obtain list data and the first data, export the first data to back end, preserve the list data after splitting;
At least one back end, obtains configuration file from management node, and retrieval obtains configuration data, completes the configuration of node, according to default zoning ordinance, the first data is classified and is kept in corresponding subregion.
Preferably, described SQL node is also set up the table index corresponding with it for list data, also for the table storage after each fractionation is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.
In said system, default submeter rule is vertical submeter rule or horizontal submeter rule;
Described vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; Described the second sublist Tab2 record and the first sublist Tab1 between the major key information of incidence relation;
Described horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
In said system, described default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword;
Described is basis by range partition for take the data of recording in field, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion;
Described is basis by list list subregion for take the data of recording in field, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion;
Described is that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion by Hash HASH subregion;
Described according to keyword KEY subregion is that the data of field being recorded according to default expression formula are calculated, the alternative condition of the data using result of calculation in the partitioned file corresponding with subregion.
In said system, described Data access module comprises:
Data retrieval unit, according to the instruction of input, from SQL node, search the table index of the 3rd file including, obtain the first file and the second file, by the first file and the second file output to form generation unit, read first data of preserving in the back end establishing a connection with SQL node, and export form generation unit to;
Form generation unit, according to the first file and the second file, generates form, and the first data are inserted in form and output.
A method for distributed data processing, the method comprises:
A, data acquisition module extract and obtain the first data from the data source distributing, and the cleaning transformation rule according to default, carries out data cleansing and conversion to the first data, and the first data loading by data cleansing and after changing is in form; Described default cleaning transformation rule is default data screening condition and the system property of operation, can determine supported data layout, to the first data are converted to the data layout that system is supported according to the system property of operation;
B, data warehouse module, according to default submeter rule, split form, in the form from splitting, extract the first data, obtain list data and the first data, preserve list data;
C, data warehouse module, according to default zoning ordinance, are classified to the first data, and are kept in corresponding subregion;
D, Data access module, according to the instruction of input, read list data and the first data, and the first data loading is also exported in form corresponding to list data.
Preferably, before described step B, further comprise:
The management node that data warehouse module comprises, according to the instruction of outside input, starts or closed node administration configuration file and journal file;
Described node is the back end that the SQL node that comprises of data warehouse module or data warehouse module comprise.
In said method, described step B comprises:
The SQL node that B1, data warehouse module comprise, according to the list data of its preservation, is set up man-to-man annexation with the back end that is kept at the first data in form;
The SQL node that B2, data warehouse module comprise, according to default submeter rule, splits form;
In the form of the SQL node that B3, data warehouse module comprise from splitting, extract the first data, obtain list data and the first data, preserve the list data after splitting, export the back end that the first data comprise to data warehouse module.
In said method, described step C comprises:
The management node that the back end that C1, data warehouse module comprise comprises from data warehouse module obtains configuration file, and retrieval obtains configuration data, completes the configuration of node;
The back end that C2, data warehouse module comprise, according to default zoning ordinance, is classified and is kept in corresponding subregion the first data that receive.
Preferably, after described step B3, further comprise:
The SQL node that B4, data warehouse module comprise is that list data is set up the table index corresponding with it;
The SQL node that B5, data warehouse module comprise is for each table storage after splitting is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.
In said method, default submeter rule is vertical submeter rule or horizontal submeter rule;
Described vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; Described the second sublist Tab2 record and the first sublist Tab1 between the major key information of incidence relation;
Described horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
In said method, described default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword;
Described is basis by range partition for take the data of recording in field, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion;
Described is basis by list list subregion for take the data that load in field, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion;
Described is that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion by Hash HASH subregion;
Described according to keyword KEY subregion calculates for the data of field being recorded according to the expression formula of setting, the alternative condition of the data using result of calculation in partitioned file corresponding to subregion.
In said method, described step D comprises:
D1, Data access module, according to the instruction of input, are searched the table index of the 3rd file including from data warehouse module the SQL node comprising, obtain the first file and the second file;
D2, Data access module be according to the first file and the second file, reads first data of preserving in the back end that the SQL node that comprises with data warehouse module establishes a connection;
D3, Data access module, according to the first file and the second file, generate form, and the first data are inserted in form and output.
As seen from the above technical solutions, the invention provides a kind of system and method for distributed data processing, data warehouse module is according to default submeter rule, form is split, in the form from splitting, extract the first data, obtain list data and the first data, preserve list data, according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion; Data access module, according to the instruction of input, reads list data and the first data from data warehouse module, and by the first data loading, in form corresponding to list data, and output packet is containing the form of the first data.Adopt system and method for the present invention, can reduce costs, shorten the spent time of maintenance of system.
Embodiment
For making object of the present invention, technical scheme and advantage clearer, referring to the accompanying drawing embodiment that develops simultaneously, the present invention is described in more detail.
The invention provides a kind of system and method for distributed data processing, data warehouse module is preserved the form generating in data acquisition module, and form is split, list data after splitting is kept in SQL node, the first data of recording in form after splitting are kept in back end, and the first data of recording in list data and this form are set up corresponding relation, so that Data access module can have been obtained its required record from SQL node and the back end form of data, not only saved cost, can also effectively protect mass data, shorten the system maintenance spent time of data.
Fig. 2 is the structural representation of the system of distributed data processing of the present invention.Now, in conjunction with Fig. 2, the system of distributed data processing of the present invention is described, specific as follows:
The system of distributed data processing of the present invention comprises: data acquisition module 311, data warehouse module 312 and Data access module 313.
Data acquisition module 311 is according to default extraction condition, from a plurality of data sources that distribute, extract and obtain the first data, according to default cleaning transformation rule, the first data are carried out to data cleansing and conversion, the first data loading after data cleansing and conversion, in form, and is loaded into data warehouse module 312.Wherein, the file that a plurality of data sources of distribution can comprise significant data for business datum list, daily record, CDR file etc.; The first data are user data or business datum, such as, the data such as telephone expenses, wage, the duration of call.
Data warehouse module 312, according to default submeter rule, splits form, in the form from splitting, extracts the first data, obtain list data and the first data, preserve list data, according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion.Wherein, data warehouse module 312 can be distributed on a station terminal, also can be distributed on many station terminals.
Wherein, default submeter rule is vertical submeter rule or horizontal submeter rule.Vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; The second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation.Horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
Wherein, default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword.
By range partition, for take the data of recording in field, be basis, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion, such as, the serial number scope of setting is the time, can divide 2011-10-1~2011-10-31, 2011-11-1~2011-11-30, 2011-12-1~2011-12-31 San Ge district, the corresponding partitioned file of each subregion, the first data that the time of recording belongs to the Related fields in 2011-10-1~2011-10-31 time range are stored in partitioned file A, the first data that the time of recording belongs to the Related fields in 2011-11-1~2011-11-30 time range are stored in partitioned file B, the first data that the time of recording belongs to the Related fields in 2011-12-1~2011-12-31 time range are stored in partitioned file C.
By list list subregion, for take the data of recording in field, be basis, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion, such as, according to the wage amount of money of recording in field, carry out subregion, the property value of setting is 1000, 2000, 3000 and 4000, can be divided into Si Ge district, the corresponding partitioned file of each subregion, the first data of the Related fields that to have recorded the wage amount of money be 1000 are stored in partitioned file A ', the first data of the Related fields that to have recorded the wage amount of money be 2000 are stored in partitioned file B ', the first data of the Related fields that to have recorded the wage amount of money be 3000 are stored in partitioned file C ', the first data of the Related fields that to have recorded the wage amount of money be 4000 are stored in partitioned file D ', wherein, can be with the first data of Related fields other data of recording in this field, such as position, the information such as hiring date.
By Hash HASH subregion, be that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion; Adopting HASH subregion is mainly to guarantee to be evenly distributed in pre-determining the subregion of data with the first data of Related fields.
According to keyword KEY subregion calculates for the data of field being recorded according to default expression formula, the alternative condition of the data using result of calculation in partitioned file corresponding to subregion.Wherein, according to keyword KEY subregion is different from the expression formula that by the difference of Hash HASH subregion is mainly calculating, the expression formula adopting during by Hash HASH subregion is hash function, the expression formula of the hash function according to keyword adopting during KEY subregion can be provided by MySQL server, the expression formula that also can adopt user to set is calculated, and at this, no longer concrete expression formula is repeated.
Data access module 313, according to the instruction of input, reads list data and the first data from data warehouse module 312, and by the first data loading, in form corresponding to list data, and output packet is containing the form of the first data.Data access module 313 provides the interface of the data in a visit data warehouse module 312 for user.
Wherein, data acquisition module 311 comprises: data pick-up unit 3111, data processing unit 3112 and data loading unit 3113.
Data pick-up unit 3111, according to default extraction condition, extracts and obtains the first data, and export data processing unit 3112 to from a plurality of data sources that distribute.Wherein, data pick-up unit 3111 is connected with a plurality of data sources, can from a plurality of data sources, extract the first required data.
Data processing unit 3112, according to default cleaning transformation rule, carries out data cleansing and conversion to the first data, and exports data loading unit 3113 to.Wherein, default cleaning transformation rule is default data screening condition and the system property of operation, can determine supported data layout, to the first data are converted to the data layout that system is supported according to the system property of operation.
3113 pairs of data loading unit the first data arrange, and by the first data loading, in form, the form that loading comprises the first data is to data warehouse module 312.Wherein, data loading unit 3113 generates one in order to load the form of the first data, so that the mode of following adopted form is stored data and output data.
Wherein, data warehouse module 312 comprises: management node 3121, at least one SQL node 3122 and at least one back end 3123.Wherein, management node 3121 is positioned in a terminal, and SQL node 3122 can be positioned in same terminal with management node 3121, also can be distributed in different terminals, in like manner, back end 3123 can be positioned in same terminal with management node 3121, also can be distributed in different terminals.
Management node 3121, according to the instruction of outside input, starts or closes SQL node 3122 and back end 3123, administration configuration file and journal file, and the key message that back end 3123 is reported writes journal file.Wherein, the independent communication link between the configuration of independent back end and a plurality of back end of arranging in recording in configuration file bunch; Management node 3121 available command ndb_mgmd start, and have precedence over SQL node 3122 and back end 3123 is triggered.
Each SQL node 3122, according to the list data of its preservation, is set up man-to-man annexation with the back end 3123 that is kept at the first data in form.SQL node 3122, according to default submeter rule, splits form, in the form from splitting, extracts the first data, obtains list data and the first data, exports the first data to back end 3123, the list data that SQL node 3122 is preserved after splitting.Wherein, SQL node 3122 is that available command mysqld-ndbcluster starts, or uses mysqld to start after adding ndbcluster to my.cnf with the back end 3123 in visiting bunch.
Back end 3123 obtains configuration file from management node 3121, and retrieval obtains configuration data, completes the configuration of node, according to default zoning ordinance, the first data is classified and is kept in corresponding subregion.Wherein, the quantity of back end 3123 is relevant to the quantity of copy, is the multiple of fragment, such as, for two copies, each copy has two fragments, has 4 back end 3123; Back end 3123 available command ndbd start.
For the ease of searching and improve the work efficiency of retrieve data, SQL node 3122 is also set up the table index corresponding with it for list data, also for the table storage after each fractionation is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.Wherein, SQL node 3122, when setting up index, can be set up index to having the field of uniqueness in form, such as the student number in student's table, also can be and need the field of frequent sequence, grouping and joint operation to set up index, the field that also can be Chang Zuowei querying condition is set up index; SQL node 3122, when setting up index, needs to consider the size of storage space, reduces the quantity of index, to increase work efficiency as far as possible.
Wherein, Data access module 313 comprises: data retrieval unit 3131 and form generation unit 3132.
Data retrieval unit 3131 is according to the instruction of input, from SQL node 3122, search the table index of the 3rd file including, obtain the first file and the second file, by the first file and the second file output to form generation unit 3132, read first data of preserving in the back end 3123 establishing a connection with SQL node 3122, and export form generation unit 3132 to.Wherein, the instruction of output is in order to search the order of the first data.
Form generation unit 3132, according to the first file and the second file, generates form, and the first data are inserted in form and output.
Fig. 3 is the method flow diagram of distributed data processing of the present invention.Now, in conjunction with Fig. 3, the method for distributed data processing of the present invention is described, specific as follows:
Step 41: extract from the data source distributing and obtain the first data;
This step is: data acquisition module 311, according to default extraction condition, extracts and obtains the first data from a plurality of data sources that distribute.
Step 42: the cleaning transformation rule according to default, carries out data cleansing and conversion to the first data;
This step is: data acquisition module 311, according to default cleaning transformation rule, carries out data cleansing and conversion to the first data, so that the first data can meet the needs of data warehouse module 312 storages.
Step 43: the first data loading by data cleansing and after changing is in form;
This step is: first data loading of data acquisition module 311 by data cleansing and after changing is in form, and the form that has loaded the first data is loaded in data warehouse module 312, so that 312 pairs of data of data warehouse module are effectively preserved and safeguard.
Step 44: the submeter rule according to default, form is split, in the form from splitting, extract the first data, obtain list data and the first data, preserve list data;
This step is responsible for execution by data warehouse module 312.
This step comprises:
Step 441, the SQL node 3122 that data warehouse module 312 comprises, according to the list data of its preservation, is set up man-to-man annexation with the back end 3123 that is kept at the first data in form;
Step 442, the SQL node 3122 that data warehouse module 312 comprises, according to default submeter rule, splits form;
Step 443, in the form of the SQL node 3122 that data warehouse module 312 comprises from splitting, extract the first data, obtain list data and the first data, preserve the list data after splitting, export the back end 3123 that the first data to data warehouse module 312 comprises.
Preferably, after step 443, also further comprise:
Step 444, the SQL node 3122 that data warehouse module 312 comprises is the list data foundation table index corresponding with it;
Step 445, the SQL node 3122 that data warehouse module 312 comprises is for each table storage after splitting is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.
Wherein, default submeter rule is vertical submeter rule or horizontal submeter rule; Vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; The second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation; Horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
Step 45: according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion;
This step is carried out by data warehouse module 312.
This step comprises:
Step 451, the management node 3121 that the back end 3123 that data warehouse module 312 comprises comprises from data warehouse module 312 obtains configuration file, and retrieval obtains configuration data, completes the configuration of node;
Step 452, the back end 3123 that data warehouse module 312 comprises, according to default zoning ordinance, is classified and is kept in corresponding subregion the first data that receive.
Wherein, default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword.
By range partition, for take the data of recording in field, be basis, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion; By list list subregion, for take the data of recording in field, be basis, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion; By Hash HASH subregion, be that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion; According to keyword KEY subregion calculates for the data of field being recorded according to default expression formula, the alternative condition of the data using result of calculation in partitioned file corresponding to subregion.
Step 46: according to the instruction of input, read list data and the first data, the first data loading is also exported in form corresponding to list data.
This step comprises:
Step 461, Data access module 313, according to the instruction of input, is searched the table index of the 3rd file including from data warehouse module 312 the SQL node 3122 comprising, obtain the first file and the second file;
Step 462, Data access module 313 is according to the first file and the second file, reads first data of preserving in the back end that the SQL node 3122 that comprises with data warehouse module 312 establishes a connection;
Step 463, Data access module 313, according to the first file and the second file, generates form, and the first data are inserted in form and output.
Preferably, before step 44, further comprise: the management node 3121 that data warehouse module 312 comprises, according to the instruction of outside input, starts or closed node administration configuration file and journal file; Wherein, the node in this step is the back end 3123 that the SQL node 3122 that comprises of data warehouse module 312 or data warehouse module 312 comprise.
In above-mentioned preferred embodiment of the present invention, no longer adopt the system of the existing distributed data processing based on large databases such as ORACLE, DB2, SYSBASE, but based on MYSQL distributed data base, mass data is processed, reduced cost; When processing, adopted the method for subregion and minute form to process data, shortened the spent time of system maintenance.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.