CN102542071B - Distributed data processing system and method - Google Patents

Distributed data processing system and method Download PDF

Info

Publication number
CN102542071B
CN102542071B CN201210013801.2A CN201210013801A CN102542071B CN 102542071 B CN102542071 B CN 102542071B CN 201210013801 A CN201210013801 A CN 201210013801A CN 102542071 B CN102542071 B CN 102542071B
Authority
CN
China
Prior art keywords
data
file
default
subregion
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210013801.2A
Other languages
Chinese (zh)
Other versions
CN102542071A (en
Inventor
李海军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Coship Electronics Co Ltd
Original Assignee
SHENZHEN LONGGUAN MEDIA CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN LONGGUAN MEDIA CO Ltd filed Critical SHENZHEN LONGGUAN MEDIA CO Ltd
Priority to CN201210013801.2A priority Critical patent/CN102542071B/en
Publication of CN102542071A publication Critical patent/CN102542071A/en
Application granted granted Critical
Publication of CN102542071B publication Critical patent/CN102542071B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a distributed data processing system and method. The distributed data processing system comprises a data acquisition module, a data warehouse module and a data access module, wherein the data acquisition module is used for acquiring first data through extraction from distributed data sources, carrying out data cleaning and conversion on the first data according to a preset cleaning conversion rule, and loading the cleaned and converted first data into a table; the data warehouse module is used for splitting the table according to a preset table-splitting rule, extracting the first data from the split table, acquiring table data and the first data, and storing the table data; the data warehouse module is used for classifying the first data according to a preset partitioning rule, and storing the classified first data into corresponding partitions; and the data access module is used for reading the table data and the first data according to an input instruction, loading the first data into a table corresponding to the table data and outputting the table. By the adoption of the system and the method, provided by the invention, the cost can be reduced, and the time for system maintenance is shortened.

Description

A kind of system and method for distributed data processing
Technical field
The present invention relates to data processing technique, particularly a kind of system and method for distributed data processing.
Background technology
Distributed data base system is to belong in logic same system, the data acquisition on a plurality of nodes (node) that the active computer network that physically distributes connects.Node links together in communication network, and each node is independently Database Systems, has database, CPU (central processing unit) and terminal separately, and local data base management system (local DBMS) separately.In distributed data base system, user data is generally pressed user distribution in different node database (DB), the node database that needs first consumer positioning data place when each access or modification user data, and the information of the node database at consumer positioning data place is the important status data of user.
Fig. 1 is the structural representation of user distribution device in existing distributed data base system.Now, in conjunction with Fig. 1, user distribution device in existing distributed data base system is described, specific as follows:
When user registers, user distribution control module 21 obtains the user distribution weight of different node databases in current system, according to the user distribution weight of different node databases in current system, to DBid corresponding with this user's user id of user distribution, the number of users in system is balancedly distributed on different node databases.
User distribution information database 22 is preserved user distribution information; User distribution information comprises user id and the DBid corresponding with user id, can also comprise the current state information of each user data.
When receiving request of access, user access control unit 23, by user id inquiring user distributed intelligence database 22, obtains the DBid of the storage user data that this user id is corresponding, then arrives the node database calling party data that this DBid is corresponding.When system is upgraded to user data or is moved, the current state that this user data in user distribution information database 22 is revised in user data state configuration unit 24 is maintenance state, upgrading or move after, then the current state of this user data in user distribution information is revised as to the normal condition of allow accessing.
User distribution device in existing distributed data base system, when registering, user takes into full account the distribution of existing subscriber on node database, user can be evenly distributed in the node database in system, when user data is upgraded or move, only affect the user that this is upgraded or moves, can not affect the access of other user data.But, existing distributed data base system is normally set up distributed data base system based on large databases such as ORACLE, DB2, SYSBASE, and in processing the process of mass data, not only cost is higher, and the maintenance of system need to expend the more time, need further improvement.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of system of distributed data processing, this system can reduce costs, and shortens the spent time of maintenance of system.
The object of the present invention is to provide a kind of method of distributed data processing, the method can reduce costs, and shortens the spent time of maintenance of system.
For achieving the above object, technical scheme of the present invention is specifically achieved in that
A system for distributed data processing, this system comprises:
Data acquisition module, according to default extraction condition, from the data source distributing, extract and obtain the first data, according to default cleaning transformation rule, the first data are carried out to data cleansing and conversion, the first data loading after data cleansing and conversion, in form, and is loaded into data warehouse module; Described default cleaning transformation rule is default data screening condition and the system property of operation, can determine supported data layout, to the first data are converted to the data layout that system is supported according to the system property of operation;
Data warehouse module, the submeter rule according to default, splits form, in form from splitting, extract the first data, obtain list data and the first data, preserve list data, according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion;
Data access module according to the instruction of input, reads list data and the first data from data warehouse module, and by the first data loading, in form corresponding to list data, and output packet is containing the form of the first data.
In said system, described data acquisition module comprises:
Data pick-up unit according to default extraction condition, extracts and obtains the first data, and export data processing unit to from the data source distributing;
Data processing unit, the cleaning transformation rule according to default, carries out data cleansing and conversion to the first data, and exports data loading unit to;
Data loading unit, arranges the first data, and by the first data loading, in form, the form that loading comprises the first data is to data warehouse module.
In said system, described data warehouse module comprises:
Management node, the instruction according to outside input, starts or closes SQL node and back end, administration configuration file and journal file, the key message that back end is reported writes journal file;
At least one SQL node, each SQL node is according to the list data of its preservation, set up man-to-man annexation with the back end that is kept at the first data in form, SQL node is according to default submeter rule, form is split, in the form from splitting, extract the first data, obtain list data and the first data, export the first data to back end, preserve the list data after splitting;
At least one back end, obtains configuration file from management node, and retrieval obtains configuration data, completes the configuration of node, according to default zoning ordinance, the first data is classified and is kept in corresponding subregion.
Preferably, described SQL node is also set up the table index corresponding with it for list data, also for the table storage after each fractionation is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.
In said system, default submeter rule is vertical submeter rule or horizontal submeter rule;
Described vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; Described the second sublist Tab2 record and the first sublist Tab1 between the major key information of incidence relation;
Described horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
In said system, described default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword;
Described is basis by range partition for take the data of recording in field, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion;
Described is basis by list list subregion for take the data of recording in field, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion;
Described is that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion by Hash HASH subregion;
Described according to keyword KEY subregion is that the data of field being recorded according to default expression formula are calculated, the alternative condition of the data using result of calculation in the partitioned file corresponding with subregion.
In said system, described Data access module comprises:
Data retrieval unit, according to the instruction of input, from SQL node, search the table index of the 3rd file including, obtain the first file and the second file, by the first file and the second file output to form generation unit, read first data of preserving in the back end establishing a connection with SQL node, and export form generation unit to;
Form generation unit, according to the first file and the second file, generates form, and the first data are inserted in form and output.
A method for distributed data processing, the method comprises:
A, data acquisition module extract and obtain the first data from the data source distributing, and the cleaning transformation rule according to default, carries out data cleansing and conversion to the first data, and the first data loading by data cleansing and after changing is in form; Described default cleaning transformation rule is default data screening condition and the system property of operation, can determine supported data layout, to the first data are converted to the data layout that system is supported according to the system property of operation;
B, data warehouse module, according to default submeter rule, split form, in the form from splitting, extract the first data, obtain list data and the first data, preserve list data;
C, data warehouse module, according to default zoning ordinance, are classified to the first data, and are kept in corresponding subregion;
D, Data access module, according to the instruction of input, read list data and the first data, and the first data loading is also exported in form corresponding to list data.
Preferably, before described step B, further comprise:
The management node that data warehouse module comprises, according to the instruction of outside input, starts or closed node administration configuration file and journal file;
Described node is the back end that the SQL node that comprises of data warehouse module or data warehouse module comprise.
In said method, described step B comprises:
The SQL node that B1, data warehouse module comprise, according to the list data of its preservation, is set up man-to-man annexation with the back end that is kept at the first data in form;
The SQL node that B2, data warehouse module comprise, according to default submeter rule, splits form;
In the form of the SQL node that B3, data warehouse module comprise from splitting, extract the first data, obtain list data and the first data, preserve the list data after splitting, export the back end that the first data comprise to data warehouse module.
In said method, described step C comprises:
The management node that the back end that C1, data warehouse module comprise comprises from data warehouse module obtains configuration file, and retrieval obtains configuration data, completes the configuration of node;
The back end that C2, data warehouse module comprise, according to default zoning ordinance, is classified and is kept in corresponding subregion the first data that receive.
Preferably, after described step B3, further comprise:
The SQL node that B4, data warehouse module comprise is that list data is set up the table index corresponding with it;
The SQL node that B5, data warehouse module comprise is for each table storage after splitting is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.
In said method, default submeter rule is vertical submeter rule or horizontal submeter rule;
Described vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; Described the second sublist Tab2 record and the first sublist Tab1 between the major key information of incidence relation;
Described horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
In said method, described default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword;
Described is basis by range partition for take the data of recording in field, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion;
Described is basis by list list subregion for take the data that load in field, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion;
Described is that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion by Hash HASH subregion;
Described according to keyword KEY subregion calculates for the data of field being recorded according to the expression formula of setting, the alternative condition of the data using result of calculation in partitioned file corresponding to subregion.
In said method, described step D comprises:
D1, Data access module, according to the instruction of input, are searched the table index of the 3rd file including from data warehouse module the SQL node comprising, obtain the first file and the second file;
D2, Data access module be according to the first file and the second file, reads first data of preserving in the back end that the SQL node that comprises with data warehouse module establishes a connection;
D3, Data access module, according to the first file and the second file, generate form, and the first data are inserted in form and output.
As seen from the above technical solutions, the invention provides a kind of system and method for distributed data processing, data warehouse module is according to default submeter rule, form is split, in the form from splitting, extract the first data, obtain list data and the first data, preserve list data, according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion; Data access module, according to the instruction of input, reads list data and the first data from data warehouse module, and by the first data loading, in form corresponding to list data, and output packet is containing the form of the first data.Adopt system and method for the present invention, can reduce costs, shorten the spent time of maintenance of system.
Accompanying drawing explanation
Fig. 1 is the structural representation of user distribution device in existing distributed data base system;
Fig. 2 is the structural representation of the system of distributed data processing of the present invention;
Fig. 3 is the method flow diagram of distributed data processing of the present invention.
Embodiment
For making object of the present invention, technical scheme and advantage clearer, referring to the accompanying drawing embodiment that develops simultaneously, the present invention is described in more detail.
The invention provides a kind of system and method for distributed data processing, data warehouse module is preserved the form generating in data acquisition module, and form is split, list data after splitting is kept in SQL node, the first data of recording in form after splitting are kept in back end, and the first data of recording in list data and this form are set up corresponding relation, so that Data access module can have been obtained its required record from SQL node and the back end form of data, not only saved cost, can also effectively protect mass data, shorten the system maintenance spent time of data.
Fig. 2 is the structural representation of the system of distributed data processing of the present invention.Now, in conjunction with Fig. 2, the system of distributed data processing of the present invention is described, specific as follows:
The system of distributed data processing of the present invention comprises: data acquisition module 311, data warehouse module 312 and Data access module 313.
Data acquisition module 311 is according to default extraction condition, from a plurality of data sources that distribute, extract and obtain the first data, according to default cleaning transformation rule, the first data are carried out to data cleansing and conversion, the first data loading after data cleansing and conversion, in form, and is loaded into data warehouse module 312.Wherein, the file that a plurality of data sources of distribution can comprise significant data for business datum list, daily record, CDR file etc.; The first data are user data or business datum, such as, the data such as telephone expenses, wage, the duration of call.
Data warehouse module 312, according to default submeter rule, splits form, in the form from splitting, extracts the first data, obtain list data and the first data, preserve list data, according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion.Wherein, data warehouse module 312 can be distributed on a station terminal, also can be distributed on many station terminals.
Wherein, default submeter rule is vertical submeter rule or horizontal submeter rule.Vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; The second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation.Horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
Wherein, default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword.
By range partition, for take the data of recording in field, be basis, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion, such as, the serial number scope of setting is the time, can divide 2011-10-1~2011-10-31, 2011-11-1~2011-11-30, 2011-12-1~2011-12-31 San Ge district, the corresponding partitioned file of each subregion, the first data that the time of recording belongs to the Related fields in 2011-10-1~2011-10-31 time range are stored in partitioned file A, the first data that the time of recording belongs to the Related fields in 2011-11-1~2011-11-30 time range are stored in partitioned file B, the first data that the time of recording belongs to the Related fields in 2011-12-1~2011-12-31 time range are stored in partitioned file C.
By list list subregion, for take the data of recording in field, be basis, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion, such as, according to the wage amount of money of recording in field, carry out subregion, the property value of setting is 1000, 2000, 3000 and 4000, can be divided into Si Ge district, the corresponding partitioned file of each subregion, the first data of the Related fields that to have recorded the wage amount of money be 1000 are stored in partitioned file A ', the first data of the Related fields that to have recorded the wage amount of money be 2000 are stored in partitioned file B ', the first data of the Related fields that to have recorded the wage amount of money be 3000 are stored in partitioned file C ', the first data of the Related fields that to have recorded the wage amount of money be 4000 are stored in partitioned file D ', wherein, can be with the first data of Related fields other data of recording in this field, such as position, the information such as hiring date.
By Hash HASH subregion, be that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion; Adopting HASH subregion is mainly to guarantee to be evenly distributed in pre-determining the subregion of data with the first data of Related fields.
According to keyword KEY subregion calculates for the data of field being recorded according to default expression formula, the alternative condition of the data using result of calculation in partitioned file corresponding to subregion.Wherein, according to keyword KEY subregion is different from the expression formula that by the difference of Hash HASH subregion is mainly calculating, the expression formula adopting during by Hash HASH subregion is hash function, the expression formula of the hash function according to keyword adopting during KEY subregion can be provided by MySQL server, the expression formula that also can adopt user to set is calculated, and at this, no longer concrete expression formula is repeated.
Data access module 313, according to the instruction of input, reads list data and the first data from data warehouse module 312, and by the first data loading, in form corresponding to list data, and output packet is containing the form of the first data.Data access module 313 provides the interface of the data in a visit data warehouse module 312 for user.
Wherein, data acquisition module 311 comprises: data pick-up unit 3111, data processing unit 3112 and data loading unit 3113.
Data pick-up unit 3111, according to default extraction condition, extracts and obtains the first data, and export data processing unit 3112 to from a plurality of data sources that distribute.Wherein, data pick-up unit 3111 is connected with a plurality of data sources, can from a plurality of data sources, extract the first required data.
Data processing unit 3112, according to default cleaning transformation rule, carries out data cleansing and conversion to the first data, and exports data loading unit 3113 to.Wherein, default cleaning transformation rule is default data screening condition and the system property of operation, can determine supported data layout, to the first data are converted to the data layout that system is supported according to the system property of operation.
3113 pairs of data loading unit the first data arrange, and by the first data loading, in form, the form that loading comprises the first data is to data warehouse module 312.Wherein, data loading unit 3113 generates one in order to load the form of the first data, so that the mode of following adopted form is stored data and output data.
Wherein, data warehouse module 312 comprises: management node 3121, at least one SQL node 3122 and at least one back end 3123.Wherein, management node 3121 is positioned in a terminal, and SQL node 3122 can be positioned in same terminal with management node 3121, also can be distributed in different terminals, in like manner, back end 3123 can be positioned in same terminal with management node 3121, also can be distributed in different terminals.
Management node 3121, according to the instruction of outside input, starts or closes SQL node 3122 and back end 3123, administration configuration file and journal file, and the key message that back end 3123 is reported writes journal file.Wherein, the independent communication link between the configuration of independent back end and a plurality of back end of arranging in recording in configuration file bunch; Management node 3121 available command ndb_mgmd start, and have precedence over SQL node 3122 and back end 3123 is triggered.
Each SQL node 3122, according to the list data of its preservation, is set up man-to-man annexation with the back end 3123 that is kept at the first data in form.SQL node 3122, according to default submeter rule, splits form, in the form from splitting, extracts the first data, obtains list data and the first data, exports the first data to back end 3123, the list data that SQL node 3122 is preserved after splitting.Wherein, SQL node 3122 is that available command mysqld-ndbcluster starts, or uses mysqld to start after adding ndbcluster to my.cnf with the back end 3123 in visiting bunch.
Back end 3123 obtains configuration file from management node 3121, and retrieval obtains configuration data, completes the configuration of node, according to default zoning ordinance, the first data is classified and is kept in corresponding subregion.Wherein, the quantity of back end 3123 is relevant to the quantity of copy, is the multiple of fragment, such as, for two copies, each copy has two fragments, has 4 back end 3123; Back end 3123 available command ndbd start.
For the ease of searching and improve the work efficiency of retrieve data, SQL node 3122 is also set up the table index corresponding with it for list data, also for the table storage after each fractionation is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.Wherein, SQL node 3122, when setting up index, can be set up index to having the field of uniqueness in form, such as the student number in student's table, also can be and need the field of frequent sequence, grouping and joint operation to set up index, the field that also can be Chang Zuowei querying condition is set up index; SQL node 3122, when setting up index, needs to consider the size of storage space, reduces the quantity of index, to increase work efficiency as far as possible.
Wherein, Data access module 313 comprises: data retrieval unit 3131 and form generation unit 3132.
Data retrieval unit 3131 is according to the instruction of input, from SQL node 3122, search the table index of the 3rd file including, obtain the first file and the second file, by the first file and the second file output to form generation unit 3132, read first data of preserving in the back end 3123 establishing a connection with SQL node 3122, and export form generation unit 3132 to.Wherein, the instruction of output is in order to search the order of the first data.
Form generation unit 3132, according to the first file and the second file, generates form, and the first data are inserted in form and output.
Fig. 3 is the method flow diagram of distributed data processing of the present invention.Now, in conjunction with Fig. 3, the method for distributed data processing of the present invention is described, specific as follows:
Step 41: extract from the data source distributing and obtain the first data;
This step is: data acquisition module 311, according to default extraction condition, extracts and obtains the first data from a plurality of data sources that distribute.
Step 42: the cleaning transformation rule according to default, carries out data cleansing and conversion to the first data;
This step is: data acquisition module 311, according to default cleaning transformation rule, carries out data cleansing and conversion to the first data, so that the first data can meet the needs of data warehouse module 312 storages.
Step 43: the first data loading by data cleansing and after changing is in form;
This step is: first data loading of data acquisition module 311 by data cleansing and after changing is in form, and the form that has loaded the first data is loaded in data warehouse module 312, so that 312 pairs of data of data warehouse module are effectively preserved and safeguard.
Step 44: the submeter rule according to default, form is split, in the form from splitting, extract the first data, obtain list data and the first data, preserve list data;
This step is responsible for execution by data warehouse module 312.
This step comprises:
Step 441, the SQL node 3122 that data warehouse module 312 comprises, according to the list data of its preservation, is set up man-to-man annexation with the back end 3123 that is kept at the first data in form;
Step 442, the SQL node 3122 that data warehouse module 312 comprises, according to default submeter rule, splits form;
Step 443, in the form of the SQL node 3122 that data warehouse module 312 comprises from splitting, extract the first data, obtain list data and the first data, preserve the list data after splitting, export the back end 3123 that the first data to data warehouse module 312 comprises.
Preferably, after step 443, also further comprise:
Step 444, the SQL node 3122 that data warehouse module 312 comprises is the list data foundation table index corresponding with it;
Step 445, the SQL node 3122 that data warehouse module 312 comprises is for each table storage after splitting is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.
Wherein, default submeter rule is vertical submeter rule or horizontal submeter rule; Vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; The second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation; Horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
Step 45: according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion;
This step is carried out by data warehouse module 312.
This step comprises:
Step 451, the management node 3121 that the back end 3123 that data warehouse module 312 comprises comprises from data warehouse module 312 obtains configuration file, and retrieval obtains configuration data, completes the configuration of node;
Step 452, the back end 3123 that data warehouse module 312 comprises, according to default zoning ordinance, is classified and is kept in corresponding subregion the first data that receive.
Wherein, default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword.
By range partition, for take the data of recording in field, be basis, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion; By list list subregion, for take the data of recording in field, be basis, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion; By Hash HASH subregion, be that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion; According to keyword KEY subregion calculates for the data of field being recorded according to default expression formula, the alternative condition of the data using result of calculation in partitioned file corresponding to subregion.
Step 46: according to the instruction of input, read list data and the first data, the first data loading is also exported in form corresponding to list data.
This step comprises:
Step 461, Data access module 313, according to the instruction of input, is searched the table index of the 3rd file including from data warehouse module 312 the SQL node 3122 comprising, obtain the first file and the second file;
Step 462, Data access module 313 is according to the first file and the second file, reads first data of preserving in the back end that the SQL node 3122 that comprises with data warehouse module 312 establishes a connection;
Step 463, Data access module 313, according to the first file and the second file, generates form, and the first data are inserted in form and output.
Preferably, before step 44, further comprise: the management node 3121 that data warehouse module 312 comprises, according to the instruction of outside input, starts or closed node administration configuration file and journal file; Wherein, the node in this step is the back end 3123 that the SQL node 3122 that comprises of data warehouse module 312 or data warehouse module 312 comprise.
In above-mentioned preferred embodiment of the present invention, no longer adopt the system of the existing distributed data processing based on large databases such as ORACLE, DB2, SYSBASE, but based on MYSQL distributed data base, mass data is processed, reduced cost; When processing, adopted the method for subregion and minute form to process data, shortened the spent time of system maintenance.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims (15)

1. a system for distributed data processing, is characterized in that, this system comprises:
Data acquisition module, according to default extraction condition, from the data source distributing, extract and obtain the first data, according to default cleaning transformation rule, the first data are carried out to data cleansing and conversion, the first data loading after data cleansing and conversion, in form, and is loaded into data warehouse module; Described default cleaning transformation rule is default data screening condition and the system property of operation, can determine supported data layout, to the first data are converted to the data layout that system is supported according to the system property of operation;
Data warehouse module, the submeter rule according to default, splits form, in form from splitting, extract the first data, obtain list data and the first data, preserve list data, according to default zoning ordinance, the first data are classified, and be kept in corresponding subregion;
Data access module according to the instruction of input, reads list data and the first data from data warehouse module, and by the first data loading, in form corresponding to list data, and output packet is containing the form of the first data.
2. system according to claim 1, is characterized in that, described data acquisition module comprises:
Data pick-up unit according to default extraction condition, extracts and obtains the first data, and export data processing unit to from the data source distributing;
Data processing unit, the cleaning transformation rule according to default, carries out data cleansing and conversion to the first data, and exports data loading unit to;
Data loading unit, arranges the first data, and by the first data loading, in form, the form that loading comprises the first data is to data warehouse module.
3. system according to claim 1, is characterized in that, described data warehouse module comprises:
Management node, the instruction according to outside input, starts or closes SQL node and back end, administration configuration file and journal file, the key message that back end is reported writes journal file;
At least one SQL node, each SQL node is according to the list data of its preservation, set up man-to-man annexation with the back end that is kept at the first data in form, SQL node is according to default submeter rule, form is split, in the form from splitting, extract the first data, obtain list data and the first data, export the first data to back end, preserve the list data after splitting;
At least one back end, obtains configuration file from management node, and retrieval obtains configuration data, completes the configuration of node, according to default zoning ordinance, the first data is classified and is kept in corresponding subregion.
4. system according to claim 3, it is characterized in that, described SQL node is also set up the table index corresponding with it for list data, also for the table storage after each fractionation is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.
5. according to the system described in claim 3 or 4, it is characterized in that, default submeter rule is vertical submeter rule or horizontal submeter rule;
Described vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; Described the second sublist Tab2 record and the first sublist Tab1 between the major key information of incidence relation;
Described horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
6. according to the system described in claim 3 or 4, it is characterized in that, described default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword;
Described is basis by range partition for take the data of recording in field, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion;
Described is basis by list list subregion for take the data of recording in field, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion;
Described is that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion by Hash HASH subregion;
Described according to keyword KEY subregion is that the data of field being recorded according to default expression formula are calculated, the alternative condition of the data using result of calculation in the partitioned file corresponding with subregion.
7. system according to claim 4, is characterized in that, described Data access module comprises:
Data retrieval unit, according to the instruction of input, from SQL node, search the table index of the 3rd file including, obtain the first file and the second file, by the first file and the second file output to form generation unit, read first data of preserving in the back end establishing a connection with SQL node, and export form generation unit to;
Form generation unit, according to the first file and the second file, generates form, and the first data are inserted in form and output.
8. a method for distributed data processing, is characterized in that, the method comprises:
A, data acquisition module extract and obtain the first data from the data source distributing, and the cleaning transformation rule according to default, carries out data cleansing and conversion to the first data, and the first data loading by data cleansing and after changing is in form; Described default cleaning transformation rule is default data screening condition and the system property of operation, can determine supported data layout, to the first data are converted to the data layout that system is supported according to the system property of operation;
B, data warehouse module, according to default submeter rule, split form, in the form from splitting, extract the first data, obtain list data and the first data, preserve list data;
C, data warehouse module, according to default zoning ordinance, are classified to the first data, and are kept in corresponding subregion;
D, Data access module, according to the instruction of input, read list data and the first data, and the first data loading is also exported in form corresponding to list data.
9. method according to claim 8, is characterized in that, before described step B, further comprises:
The management node that data warehouse module comprises, according to the instruction of outside input, starts or closed node administration configuration file and journal file;
Described node is the back end that the SQL node that comprises of data warehouse module or data warehouse module comprise.
10. method according to claim 8 or claim 9, is characterized in that, described step B comprises:
The SQL node that B1, data warehouse module comprise, according to the list data of its preservation, is set up man-to-man annexation with the back end that is kept at the first data in form;
The SQL node that B2, data warehouse module comprise, according to default submeter rule, splits form;
In the form of the SQL node that B3, data warehouse module comprise from splitting, extract the first data, obtain list data and the first data, preserve the list data after splitting, export the back end that the first data comprise to data warehouse module.
11. methods according to claim 8 or claim 9, is characterized in that, described step C comprises:
The management node that the back end that C1, data warehouse module comprise comprises from data warehouse module obtains configuration file, and retrieval obtains configuration data, completes the configuration of node;
The back end that C2, data warehouse module comprise, according to default zoning ordinance, is classified and is kept in corresponding subregion the first data that receive.
12. methods according to claim 10, is characterized in that, after described step B3, further comprise:
The SQL node that B4, data warehouse module comprise is that list data is set up the table index corresponding with it;
The SQL node that B5, data warehouse module comprise is for each table storage after splitting is in order to the first file of stored table structure, in order to the second file of stored table data and in order to deposit the 3rd file of table index.
13. methods according to claim 10, is characterized in that, default submeter rule is vertical submeter rule or horizontal submeter rule;
Described vertical submeter rule is for splitting into the table Tab of (N1+N2) individual field the first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field; Described the second sublist Tab2 record and the first sublist Tab1 between the major key information of incidence relation;
Described horizontal submeter rule, for the first data in form are calculated according to default Hash hash algorithm, splits form according to result of calculation.
14. methods according to claim 11, is characterized in that, described default zoning ordinance is for pressing range partition, by list list subregion, by Hash HASH subregion or KEY subregion according to keyword;
Described is basis by range partition for take the data of recording in field, the alternative condition of the data using the serial number scope of setting in the partitioned file corresponding with subregion;
Described is basis by list list subregion for take the data that load in field, the alternative condition of the data using the property value of setting in the partitioned file corresponding with subregion;
Described is that the data of field being recorded according to default number of partitions are carried out Hash calculation, the alternative condition of the data using Hash calculation result in the partitioned file corresponding with subregion by Hash HASH subregion;
Described according to keyword KEY subregion calculates for the data of field being recorded according to the expression formula of setting, the alternative condition of the data using result of calculation in partitioned file corresponding to subregion.
15. methods according to claim 12, is characterized in that, described step D comprises:
D1, Data access module, according to the instruction of input, are searched the table index of the 3rd file including from data warehouse module the SQL node comprising, obtain the first file and the second file;
D2, Data access module be according to the first file and the second file, reads first data of preserving in the back end that the SQL node that comprises with data warehouse module establishes a connection;
D3, Data access module, according to the first file and the second file, generate form, and the first data are inserted in form and output.
CN201210013801.2A 2012-01-17 2012-01-17 Distributed data processing system and method Expired - Fee Related CN102542071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210013801.2A CN102542071B (en) 2012-01-17 2012-01-17 Distributed data processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210013801.2A CN102542071B (en) 2012-01-17 2012-01-17 Distributed data processing system and method

Publications (2)

Publication Number Publication Date
CN102542071A CN102542071A (en) 2012-07-04
CN102542071B true CN102542071B (en) 2014-02-26

Family

ID=46348950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210013801.2A Expired - Fee Related CN102542071B (en) 2012-01-17 2012-01-17 Distributed data processing system and method

Country Status (1)

Country Link
CN (1) CN102542071B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793424B (en) * 2012-10-31 2018-04-20 阿里巴巴集团控股有限公司 database data migration method and system
CN103838770A (en) * 2012-11-26 2014-06-04 中国移动通信集团北京有限公司 Logic data partition method and system
CN103559254B (en) * 2013-10-31 2018-03-02 上海上讯信息技术股份有限公司 A kind of storage system and method based on module
CN105468651B (en) * 2014-09-12 2020-03-27 阿里巴巴集团控股有限公司 Relational database data query method and system
CN104252535A (en) * 2014-09-16 2014-12-31 福建新大陆软件工程有限公司 Hbase-based data hash processing method and device
CN104252544A (en) * 2014-09-30 2014-12-31 北京华智凯科技有限公司 Big data mining method and device
CN105573971B (en) * 2014-10-10 2018-09-25 富士通株式会社 Table reconfiguration device and method
CN104462462B (en) * 2014-12-16 2017-11-07 用友软件股份有限公司 Change the data warehouse modeling method and model building device of frequency based on business
CN106294498A (en) * 2015-06-09 2017-01-04 阿里巴巴集团控股有限公司 A kind of data processing method and equipment
CN105022791A (en) * 2015-06-19 2015-11-04 华南理工大学 Novel KV distributed data storage method
CN108153744A (en) * 2016-12-02 2018-06-12 上海中兴软件有限责任公司 A kind of data storage system maintenance method and device
CN106933992B (en) * 2017-02-24 2018-02-06 北京华安普惠高新技术有限公司 Distributed data purging system and method based on data analysis
CN107784070B (en) * 2017-09-15 2020-10-30 平安科技(深圳)有限公司 Method, device and equipment for improving data cleaning efficiency
CN107832333B (en) * 2017-09-29 2022-05-10 北京邮电大学 Method and system for constructing user network data fingerprint based on distributed processing and DPI data
CN107908610A (en) * 2017-12-04 2018-04-13 北京中燕信息技术有限公司 A kind of data processing method and device
CN108304486A (en) * 2017-12-29 2018-07-20 北京欧链科技有限公司 A kind of data processing method and device based on block chain
CN109857832A (en) * 2019-01-03 2019-06-07 中国银行股份有限公司 A kind of preprocess method and device of payment data
CN110287199B (en) * 2019-07-01 2021-11-16 联想(北京)有限公司 Database processing method and electronic equipment
CN110825739B (en) * 2019-10-30 2021-07-16 京东数字科技控股有限公司 Table building statement generation method, device, equipment and storage medium
CN112231406A (en) * 2020-10-20 2021-01-15 浪潮云信息技术股份公司 Distributed cloud data centralized processing method
CN112307721B (en) * 2020-10-30 2022-08-30 广州朗国电子科技股份有限公司 Method for quickly converting third-party interface data into customized form and storage medium
CN112597219A (en) * 2020-12-15 2021-04-02 中国建设银行股份有限公司 Method and device for importing large-data-volume text file into distributed database
CN113759884B (en) * 2021-11-08 2022-02-01 西安热工研究院有限公司 Method and system for generating input/output point product file of distributed control system
CN117633024B (en) * 2024-01-23 2024-04-23 天津南大通用数据技术股份有限公司 Database optimization method based on preprocessing optimization join

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996250A (en) * 2010-11-15 2011-03-30 中国科学院计算技术研究所 Hadoop-based mass stream data storage and query method and system
CN102281332A (en) * 2011-08-31 2011-12-14 上海西本网络科技有限公司 Distributed cache array and data updating method thereof
CN102308273A (en) * 2009-02-17 2012-01-04 日本电气株式会社 Storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102308273A (en) * 2009-02-17 2012-01-04 日本电气株式会社 Storage system
CN101996250A (en) * 2010-11-15 2011-03-30 中国科学院计算技术研究所 Hadoop-based mass stream data storage and query method and system
CN102281332A (en) * 2011-08-31 2011-12-14 上海西本网络科技有限公司 Distributed cache array and data updating method thereof

Also Published As

Publication number Publication date
CN102542071A (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN102542071B (en) Distributed data processing system and method
CN103473239B (en) A kind of data of non relational database update method and device
CN102867071B (en) Management method for massive network management historical data
CN102567495B (en) Mass information storage system and implementation method
CN103294710B (en) A kind of data access method and device
CN103631924B (en) A kind of application process and system of distributive database platform
CN106294352B (en) A kind of document handling method, device and file system
CN107533551A (en) The other big data statistics of block level
WO2016149552A1 (en) Compaction policy
CN104317800A (en) Hybrid storage system and method for mass intelligent power utilization data
CN107247778A (en) System and method for implementing expansible data storage service
CN104657459A (en) Massive data storage method based on file granularity
CN102332030A (en) Data storing, managing and inquiring method and system for distributed key-value storage system
CN105956123A (en) Local updating software-based data processing method and apparatus
CN104239377A (en) Platform-crossing data retrieval method and device
CN109241159B (en) Partition query method and system for data cube and terminal equipment
CN110489407A (en) Data filling mining method, apparatus, computer equipment and storage medium
CN108629029A (en) A kind of data processing method and device applied to data warehouse
CN107807932B (en) Hierarchical data management method and system based on path enumeration
CN102541990A (en) Database redistribution method and system utilizing virtual partitions
CN103793493A (en) Method and system for processing car-mounted terminal mass data
CN103294413B (en) Support the distributed memory real-time storage device and method of magnanimity acquisition terminal
CN109669925A (en) The management method and device of unstructured data
CN102779138A (en) Hard disk access method of real time data
CN105589881A (en) Data processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHENZHEN LONGSHI MEDIA CO., LTD.

Free format text: FORMER OWNER: SHENZHEN COSHIP VIDEO COMMUNICATION CO., LTD.

Effective date: 20130407

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20130407

Address after: 4, 518057 floor, rainbow science and technology building, north high tech Zone, Nanshan District, Guangdong, Shenzhen

Applicant after: Shenzhen Longguan Media Co., Ltd.

Address before: 518057 B2-1 District, rainbow tech building, North Fifth Industrial Zone, north high tech Zone, Nanshan District, Guangdong, Shenzhen

Applicant before: Shenzhen Tongzhou Video Media Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHENZHEN TONGZHOU ELECTRONIC CO., LTD.

Free format text: FORMER OWNER: SHENZHEN LONGSHI MEDIA CO., LTD.

Effective date: 20140521

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20140521

Address after: 518057 rainbow science and Technology Building (North West Road), Nanshan District hi tech Zone, Shenzhen, Guangdong

Patentee after: Shenzhen Tongzhou Electronic Co., Ltd.

Address before: 4, 518057 floor, rainbow science and technology building, north high tech Zone, Nanshan District, Guangdong, Shenzhen

Patentee before: Shenzhen Longguan Media Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140226

Termination date: 20150117

EXPY Termination of patent right or utility model