CN102542071A - Distributed data processing system and method - Google Patents

Distributed data processing system and method Download PDF

Info

Publication number
CN102542071A
CN102542071A CN2012100138012A CN201210013801A CN102542071A CN 102542071 A CN102542071 A CN 102542071A CN 2012100138012 A CN2012100138012 A CN 2012100138012A CN 201210013801 A CN201210013801 A CN 201210013801A CN 102542071 A CN102542071 A CN 102542071A
Authority
CN
China
Prior art keywords
data
file
subregion
preset
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100138012A
Other languages
Chinese (zh)
Other versions
CN102542071B (en
Inventor
李海军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Coship Electronics Co Ltd
Original Assignee
SHENZHEN TONGZHOU VIDEO MEDIA CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN TONGZHOU VIDEO MEDIA CO Ltd filed Critical SHENZHEN TONGZHOU VIDEO MEDIA CO Ltd
Priority to CN201210013801.2A priority Critical patent/CN102542071B/en
Publication of CN102542071A publication Critical patent/CN102542071A/en
Application granted granted Critical
Publication of CN102542071B publication Critical patent/CN102542071B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a distributed data processing system and method. The distributed data processing system comprises a data acquisition module, a data warehouse module and a data access module, wherein the data acquisition module is used for acquiring first data through extraction from distributed data sources, carrying out data cleaning and conversion on the first data according to a preset cleaning conversion rule, and loading the cleaned and converted first data into a table; the data warehouse module is used for splitting the table according to a preset table-splitting rule, extracting the first data from the split table, acquiring table data and the first data, and storing the table data; the data warehouse module is used for classifying the first data according to a preset partitioning rule, and storing the classified first data into corresponding partitions; and the data access module is used for reading the table data and the first data according to an input instruction, loading the first data into a table corresponding to the table data and outputting the table. By the adoption of the system and the method, provided by the invention, the cost can be reduced, and the time for system maintenance is shortened.

Description

A kind of system and method for distributed treatment data
Technical field
The present invention relates to data processing technique, particularly a kind of system and method for distributed treatment data.
Background technology
Distributed data base system is to belong to same system in logic, the data acquisition on a plurality of nodes (node) that the active computer network that physically distributes connects.Node links together in communication network, and each node all is an independent database system, has separately database, CPU and terminal, and local data base management system (local DBMS) separately.In distributed data base system; User data is generally pressed user distribution in different nodes database (DB); The node database that needs first consumer positioning data place when each visit or modification user data, and the information of the node database at consumer positioning data place is the important status data of user.
Fig. 1 is the structural representation of user distribution device in the existing distributed Database Systems.Combine Fig. 1 at present, user distribution device in the existing distributed Database Systems is described, specific as follows:
When the user registers; User distribution control module 21 obtains the user distribution weight of different node databases in the current system;, to DBid corresponding of user distribution the number of users in the system balancedly is distributed on the different nodes database according to the user distribution weight of different node databases in the current system with this user's user id.
User distribution information database 22 is preserved user distribution information; User distribution information comprises user id and the DBid corresponding with user id, can also comprise the current state information of each user data.
When receiving access request, user capture control module 23 obtains the DBid of the corresponding storaging user data of this user id through user id inquiring user distributed intelligence database 22, arrives this DBid corresponding nodes database access user data then.When system upgrades to user data or moves; The current state that this user data in the user distribution information database 22 is revised in user data state configuration unit 24 is a maintenance state; After upgrading or migration were accomplished, the current state with this user data in the user distribution information was revised as the normal condition that allows visit again.
User distribution device in the existing distributed Database Systems;, the user takes into full account the distribution of existing subscriber on node database when registering; Make the user can be evenly distributed in the node database in the system; When user data is upgraded or move, only influence the user that this is upgraded or moves, can not influence the visit of other user data.But; The existing distributed Database Systems are normally set up distributed data base system based on large databases such as ORACLE, DB2, SYSBASE, and in the process of handling mass data, not only cost is higher; And the maintenance of system need expend the more time, is still waiting further improvement.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of system of distributed treatment data, this system can reduce cost, and shortens the spent time of maintenance of system.
The object of the present invention is to provide a kind of method of distributed treatment data, this method can reduce cost, and shortens the spent time of maintenance of system.
For achieving the above object, technical scheme of the present invention specifically is achieved in that
A kind of system of distributed treatment data, this system comprises:
Data acquisition module; According to preset extraction condition; From the data source that distributes, extract and obtain first data, the cleaning transformation rule according to preset carries out data cleansing and conversion to first data; First Data Loading after data cleansing and the conversion in form, and is loaded into the data warehouse module;
The data warehouse module, the submeter rule according to preset splits form; Extract first data the form after splitting, obtain the list data and first data, preserve list data; According to preset zoning ordinance, first data are classified, and be kept in the corresponding subregion;
Data access module according to the instruction of input, reads the list data and first data from the data warehouse module, in the corresponding form of list data, and output comprises the form of first data with first Data Loading.
In the said system, said data acquisition module comprises:
The data pick-up unit according to preset extraction condition, extracts from the data source that distributes and obtains first data, and export data processing unit to;
Data processing unit, the cleaning transformation rule according to preset carries out data cleansing and conversion to first data, and exports the Data Loading unit to;
The Data Loading unit is put in order first data, and first Data Loading in form, is loaded form to the data warehouse module that comprises first data.
In the said system, said data warehouse module comprises:
Management node according to the instruction of outside input, starts or closes SQL node and back end, administration configuration file and journal file, and the key message that back end is reported writes journal file;
At least one SQL node, each SQL node are set up man-to-man annexation according to the list data of its preservation with the back end that is kept at first data in the form; The SQL node is according to preset submeter rule; Form is split, extract first data the form after splitting, obtain the list data and first data; Export first data to back end, preserve the list data after splitting;
At least one back end obtains configuration file from management node, and retrieval obtains configuration data, accomplishes the configuration of node, according to preset zoning ordinance, first data is classified and is kept in the corresponding subregion.
Preferably, said SQL node is that also list data sets up the table index corresponding with it, also for the table storage after each fractionation in order to first file of stored table structure, reach in order to deposit the 3rd file of table index in order to second file of stored table data.
In the said system, preset submeter rule is vertical submeter rule or horizontal submeter rule;
Said vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The said second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation;
Said horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
In the said system, said preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword;
Said by the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion;
The said subregion of KEY according to keyword is for according to preset expression formula the data of field record being calculated, with the alternative condition of result of calculation as the data in the partitioned file corresponding with subregion.
In the said system, said Data access module comprises:
The data retrieval unit; Instruction according to input; From the SQL node, search the table index that the 3rd file comprises, obtain first file and second file, export first file and second file to the form generation unit; Read first data of preserving in the back end that establishes a connection with the SQL node, and export the form generation unit to;
The form generation unit according to first file and second file, generates form, inserts in the form first data and output.
A kind of method of distributed treatment data, this method comprises:
A, data acquisition module extract from the data source that distributes and obtain first data, and the cleaning transformation rule according to preset carries out data cleansing and conversion to first data, with first Data Loading after data cleansing and the conversion in form;
B, data warehouse module split form according to preset submeter rule, extract first data the form after splitting, and obtain the list data and first data, preserve list data;
C, data warehouse module are classified to first data, and are kept in the corresponding subregion according to preset zoning ordinance;
D, Data access module read the list data and first data according to the instruction of importing, and first Data Loading is also exported in the corresponding form of list data.
Preferably, further comprise before the said step B:
The management node that the data warehouse module comprises starts or closed node administration configuration file and journal file according to the instruction of outside input;
The back end that SQL node that said node comprises for the data warehouse module or data warehouse module comprise.
In the said method, said step B comprises:
The SQL node that B1, data warehouse module comprise is set up man-to-man annexation according to the list data of its preservation with the back end that is kept at first data in the form;
The SQL node that B2, data warehouse module comprise splits form according to preset submeter rule;
Extract first data the form of the SQL node that B3, data warehouse module comprise after splitting, obtain the list data and first data, preserve the list data after splitting, export the back end that first data to data warehouse module comprises.
In the said method, said step C comprises:
The back end that C1, data warehouse module comprise obtains configuration file from the management node that the data warehouse module comprises, and retrieval obtains configuration data, accomplishes the configuration of node;
The back end that C2, data warehouse module comprise is according to preset zoning ordinance, and first data that receive are classified and are kept in the corresponding subregion.
Preferably, further comprise after the said step B3:
The SQL node that B4, data warehouse module comprise is that list data is set up the table index corresponding with it;
The SQL node that B5, data warehouse module comprise for each table storage after splitting in order to first file of stored table structure, in order to second file of stored table data and in order to deposit the 3rd file of table index.
In the said method, preset submeter rule is vertical submeter rule or horizontal submeter rule;
Said vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The said second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation;
Said horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
In the said method, said preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword;
Said by the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with loaded data in the field;
Said by Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion;
The said subregion of KEY according to keyword is for calculating the data of field record according to the expression formula of setting, with the alternative condition of result of calculation as the data in the partitioned file of subregion correspondence.
In the said method, said step D comprises:
D1, Data access module are searched the table index that the 3rd file comprises according to the instruction of importing from the SQL node that the data warehouse module comprises, obtain first file and second file;
D2, Data access module be according to first file and second file, reads first data of preserving in the back end that the SQL node that comprises with the data warehouse module establishes a connection;
D3, Data access module generate form according to first file and second file, insert in the form first data and output.
Visible by above-mentioned technical scheme, the invention provides a kind of system and method for distributed treatment data, the data warehouse module is according to preset submeter rule; Form is split, extract first data the form after splitting, obtain the list data and first data; Preserve list data; According to preset zoning ordinance, first data are classified, and be kept in the corresponding subregion; Data access module reads the list data and first data according to the instruction of input from the data warehouse module, in the corresponding form of list data, and output comprises the form of first data with first Data Loading.Adopt system and method for the present invention, can reduce cost, shorten the spent time of maintenance of system.
Description of drawings
Fig. 1 is the structural representation of user distribution device in the existing distributed Database Systems;
Fig. 2 is the structural representation of the system of distributed treatment data of the present invention;
Fig. 3 is the method flow diagram of distributed treatment data of the present invention.
Embodiment
For make the object of the invention, technical scheme, and advantage clearer, below with reference to the accompanying drawing embodiment that develops simultaneously, to further explain of the present invention.
The invention provides a kind of system and method for distributed treatment data; The data warehouse module is preserved the form that generates in the data acquisition module, and form is split, and the list data after splitting is kept in the SQL node; First data of putting down in writing in the form after splitting are kept in the back end; And first data of putting down in writing in list data and this form are set up corresponding relation, so that Data access module can have been obtained its required record from SQL node and the back end form of data has not only been practiced thrift cost; Can also effectively protect mass data, shorten the system maintenance spent time of data.
Fig. 2 is the structural representation of the system of distributed treatment data of the present invention.Combine Fig. 2 at present, the system of distributed treatment data of the present invention is described, specific as follows:
The system of distributed treatment data of the present invention comprises: data acquisition module 311, data warehouse module 312 and Data access module 313.
Data acquisition module 311 is according to preset extraction condition; From a plurality of data sources that distribute, extract and obtain first data; According to preset cleaning transformation rule; First data are carried out data cleansing and conversion, first Data Loading after data cleansing and the conversion in form, and is loaded into data warehouse module 312.Wherein, a plurality of data sources of distribution can comprise the file of significant data for business datum list, daily record, CDR file etc.; First data are user data or business datum, such as, data such as telephone expenses, wage, the duration of call.
Data warehouse module 312 splits form according to preset submeter rule, extracts first data the form after splitting; Obtain the list data and first data, preserve list data, according to preset zoning ordinance; First data are classified, and be kept in the corresponding subregion.Wherein, data warehouse module 312 can be distributed on the station terminal, also can be distributed on many station terminals.
Wherein, preset submeter rule is vertical submeter rule or horizontal submeter rule.Vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation.Horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
Wherein, preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword.
By the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field; Such as; The serial number scope of setting is the time; Can divide 2011-10-1~2011-10-31,2011-11-1~2011-11-30, three districts of 2011-12-1~2011-12-31; The corresponding partitioned file of each subregion; Then the time of record belongs to the first corresponding data storage of the interior field of 2011-10-1~2011-10-31 time range in partitioned file A, and the time of record belongs to first data storage of the field correspondence in 2011-11-1~2011-11-30 time range in partitioned file B, and the time of record belongs to first data storage of the field correspondence in 2011-12-1~2011-12-31 time range in partitioned file C.
By tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field; Such as; The wage amount of money according to putting down in writing in the field carries out subregion; The property value of setting is 1000,2000,3000 and 4000, then can be divided into four districts, the corresponding partitioned file of each subregion; Then put down in writing the wage amount of money and be the first corresponding data storage of 1000 field in partitioned file A '; Put down in writing the wage amount of money and be the first corresponding data storage of 2000 field in partitioned file B ', put down in writing the wage amount of money and be the first corresponding data storage of 3000 field, put down in writing the wage amount of money and be the first corresponding data storage of 4000 field in the partitioned file D ' in partitioned file C '; Wherein, first data corresponding with field can be other data of putting down in writing in this field, such as position, and information such as hiring date.
By Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion; Adopting the HASH subregion mainly is to guarantee that first data corresponding with field are evenly distributed in the subregion of specified data in advance.
According to keyword the KEY subregion is for calculating the data of field record according to preset expression formula, with the alternative condition of result of calculation as the data in the partitioned file of subregion correspondence.Wherein, KEY subregion and difference by Hash HASH subregion according to keyword mainly is that the expression formula of calculating is different; The expression formula that adopts during by Hash HASH subregion is a hash function; The expression formula of the hash function that according to keyword adopts during the KEY subregion can be provided by the MySQL server, and the expression formula that also can adopt the user to set is calculated, and no longer concrete expression formula is given unnecessary details at this.
Data access module 313 reads the list data and first data according to the instruction of input from data warehouse module 312, in the corresponding form of list data, and output comprises the form of first data with first Data Loading.Data access module 313 provides the interface of the data in the visit data warehouse module 312 for the user.
Wherein, data acquisition module 311 comprises: data pick-up unit 3111, data processing unit 3112 and Data Loading unit 3113.
Data pick-up unit 3111 extracts from a plurality of data sources that distribute and obtains first data, and export data processing unit 3112 to according to preset extraction condition.Wherein, data pick-up unit 3111 is connected with a plurality of data sources, can from a plurality of data sources, extract the first required data.
Data processing unit 3112 carries out data cleansing and conversion to first data, and exports Data Loading unit 3113 to according to preset cleaning transformation rule.Wherein, preset cleaning transformation rule is the data screening condition preset and the system property of operation, can confirm the data layout supported according to the system property of operation, so that be the data layout that system supported with first data-switching.
3113 pairs first data in Data Loading unit are put in order, and first Data Loading in form, is loaded form to the data warehouse module 312 that comprises first data.Wherein, Data Loading unit 3113 generates one in order to load the form of first data, so that the mode of following adopted form is stored data and output data.
Wherein, data warehouse module 312 comprises: management node 3121, at least one SQL node 3122 and at least one back end 3123.Wherein, management node 3121 is positioned on the terminal, and SQL node 3122 can be positioned on the same terminal with management node 3121; Also can be distributed on the different terminal; In like manner, back end 3123 can be positioned on the same terminal with management node 3121, also can be distributed on the different terminal.
Management node 3121 is according to the instruction of outside input, and start or close SQL node 3122 and back end 3123, administration configuration file and journal file, the key message that back end 3123 is reported writes journal file.Wherein, the independent communication link between the configuration of independent back end and a plurality of back end that are provided with in recording in the configuration file bunch; Management node 3121 available command ndb mgmd start, and have precedence over SQL node 3122 and be triggered with back end 3123.
Each SQL node 3122 is set up man-to-man annexation according to the list data of its preservation with the back end 3123 that is kept at first data in the form.SQL node 3122 splits form according to preset submeter rule, extracts first data the form after splitting, and obtains the list data and first data, exports first data to back end 3123, the list data that SQL node 3122 is preserved after splitting.Wherein, SQL node 3122 is that available command mysqld-ndbcluster starts with the back end 3123 in visiting bunch, or uses mysqld to start after adding ndbcluster to my.cnf.
Back end 3123 obtains configuration file from management node 3121, and retrieval obtains configuration data, accomplishes the configuration of node, according to preset zoning ordinance, first data is classified and is kept in the corresponding subregion.Wherein, the quantity of back end 3123 is relevant with the quantity of copy, is the multiple of fragment, such as, for two copies, each copy has two fragments, and 4 data nodes 3123 are then arranged; Back end 3123 available command ndbd start.
For the ease of searching and improve the work efficiency of retrieve data; SQL node 3122 is that also list data sets up the table index corresponding with it, also for the table storage after each fractionation in order to first file of stored table structure, reach in order to deposit the 3rd file of table index in order to second file of stored table data.Wherein, SQL node 3122 can be set up index to the field that has uniqueness in the form when setting up index, such as the student number in student's table; Also can be needs the field of frequent ordering, grouping and joint operation to set up index, and the field that also can be the Chang Zuowei querying condition is set up index; SQL node 3122 needs to consider the size of storage space when setting up index, reduce the quantity of index, to increase work efficiency as far as possible.
Wherein, Data access module 313 comprises: data retrieval unit 3131 and form generation unit 3132.
Data retrieval unit 3131 is according to the instruction of input; From SQL node 3122, search the table index that the 3rd file comprises; Obtain first file and second file; Export first file and second file to form generation unit 3132, read first data of preserving in the back end 3123 that establishes a connection with SQL node 3122, and export form generation unit 3132 to.Wherein, the instruction of output is in order to search the order of first data.
Form generation unit 3132 generates form according to first file and second file, inserts in the form first data and output.
Fig. 3 is the method flow diagram of distributed treatment data of the present invention.Combine Fig. 3 at present, the method for distributed treatment data of the present invention is described, specific as follows:
Step 41: from the data source that distributes, extract and obtain first data;
This step is: data acquisition module 311 extracts from a plurality of data sources that distribute and obtains first data according to preset extraction condition.
Step 42: the cleaning transformation rule according to preset, carry out data cleansing and conversion to first data;
This step is: data acquisition module 311 carries out data cleansing and conversion according to preset cleaning transformation rule to first data, so that first data can meet the needs of data warehouse module 312 storages.
Step 43: with data cleansing with the conversion after first Data Loading in form;
This step is: data acquisition module 311 with data cleansing with the conversion after first Data Loading in form; And the form that will load first data is loaded in the data warehouse module 312, so that 312 pairs of data of data warehouse module are preserved effectively and safeguarded.
Step 44: the submeter rule according to preset, form is split, extract first data the form after splitting, obtain the list data and first data, preserve list data;
This step is responsible for execution by data warehouse module 312.
This step comprises:
Step 441, the SQL node 3122 that data warehouse module 312 comprises are set up man-to-man annexation according to the list data of its preservation with the back end 3123 that is kept at first data in the form;
Step 442, the SQL node 3122 that data warehouse module 312 comprises splits form according to preset submeter rule;
Step 443; Extract first data the form of the SQL node 3122 that data warehouse module 312 comprises after splitting; Obtain the list data and first data, preserve the list data after splitting, export the back end 3123 that first data to data warehouse module 312 comprises.
Preferably, also further comprise after the step 443:
Step 444, the SQL node 3122 that data warehouse module 312 comprises is set up the table index corresponding with it for list data;
Step 445, the SQL node 3122 that data warehouse module 312 comprises for each table storage after splitting in order to first file of stored table structure, in order to second file of stored table data and in order to deposit the 3rd file of table index.
Wherein, preset submeter rule is vertical submeter rule or horizontal submeter rule; Vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation; Horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
Step 45:, first data are classified, and be kept in the corresponding subregion according to preset zoning ordinance;
This step is carried out by data warehouse module 312.
This step comprises:
Step 451, the back end 3123 that data warehouse module 312 comprises obtains configuration file from the management node 3121 that data warehouse module 312 comprises, and retrieval obtains configuration data, accomplishes the configuration of node;
Step 452, the back end 3123 that data warehouse module 312 comprises be according to preset zoning ordinance, and first data that receive are classified and are kept in the corresponding subregion.
Wherein, preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword.
By the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field; By tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field; By Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion; According to keyword the KEY subregion is for calculating the data of field record according to preset expression formula, with the alternative condition of result of calculation as the data in the partitioned file of subregion correspondence.
Step 46: according to the instruction of input, read the list data and first data, first Data Loading is also exported in the corresponding form of list data.
This step comprises:
Step 461, Data access module 313 is searched the table index that the 3rd file comprises according to the instruction of input from the SQL node 3122 that data warehouse module 312 comprises, obtain first file and second file;
Step 462, Data access module 313 be according to first file and second file, reads first data of preserving in the back end that the SQL node 3122 that comprises with data warehouse module 312 establishes a connection;
Step 463, Data access module 313 generate form according to first file and second file, insert in the form first data and output.
Preferably, further comprise before the step 44: the management node 3121 that data warehouse module 312 comprises starts or closed node administration configuration file and journal file according to the instruction of outside input; Wherein, the node in this step is the back end 3123 that the SQL node 3122 that comprises of data warehouse module 312 or data warehouse module 312 comprise.
In the above-mentioned preferred embodiment of the present invention, no longer adopt the system of existing distributed treatment data based on large databases such as ORACLE, DB2, SYSBASE, and be based on the MYSQL distributed data base, mass data is handled, reduced cost; When handling, adopted subregion and the method for dividing form that data are handled, shortened the spent time of system maintenance.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims (15)

1. the system of distributed treatment data is characterized in that, this system comprises:
Data acquisition module; According to preset extraction condition; From the data source that distributes, extract and obtain first data, the cleaning transformation rule according to preset carries out data cleansing and conversion to first data; First Data Loading after data cleansing and the conversion in form, and is loaded into the data warehouse module;
The data warehouse module, the submeter rule according to preset splits form; Extract first data the form after splitting, obtain the list data and first data, preserve list data; According to preset zoning ordinance, first data are classified, and be kept in the corresponding subregion;
Data access module according to the instruction of input, reads the list data and first data from the data warehouse module, in the corresponding form of list data, and output comprises the form of first data with first Data Loading.
2. system according to claim 1 is characterized in that, said data acquisition module comprises:
The data pick-up unit according to preset extraction condition, extracts from the data source that distributes and obtains first data, and export data processing unit to;
Data processing unit, the cleaning transformation rule according to preset carries out data cleansing and conversion to first data, and exports the Data Loading unit to;
The Data Loading unit is put in order first data, and first Data Loading in form, is loaded form to the data warehouse module that comprises first data.
3. system according to claim 1 is characterized in that, said data warehouse module comprises:
Management node according to the instruction of outside input, starts or closes SQL node and back end, administration configuration file and journal file, and the key message that back end is reported writes journal file;
At least one SQL node, each SQL node are set up man-to-man annexation according to the list data of its preservation with the back end that is kept at first data in the form; The SQL node is according to preset submeter rule; Form is split, extract first data the form after splitting, obtain the list data and first data; Export first data to back end, preserve the list data after splitting;
At least one back end obtains configuration file from management node, and retrieval obtains configuration data, accomplishes the configuration of node, according to preset zoning ordinance, first data is classified and is kept in the corresponding subregion.
4. system according to claim 3; It is characterized in that; Said SQL node is that also list data sets up the table index corresponding with it, also for the table storage after each fractionation in order to first file of stored table structure, reach in order to deposit the 3rd file of table index in order to second file of stored table data.
5. according to claim 3 or 4 described systems, it is characterized in that preset submeter rule is vertical submeter rule or horizontal submeter rule;
Said vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The said second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation;
Said horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
6. according to claim 3 or 4 described systems, it is characterized in that, said preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword;
Said by the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion;
The said subregion of KEY according to keyword is for according to preset expression formula the data of field record being calculated, with the alternative condition of result of calculation as the data in the partitioned file corresponding with subregion.
7. system according to claim 4 is characterized in that, said Data access module comprises:
The data retrieval unit; Instruction according to input; From the SQL node, search the table index that the 3rd file comprises, obtain first file and second file, export first file and second file to the form generation unit; Read first data of preserving in the back end that establishes a connection with the SQL node, and export the form generation unit to;
The form generation unit according to first file and second file, generates form, inserts in the form first data and output.
8. the method for distributed treatment data is characterized in that, this method comprises:
A, data acquisition module extract from the data source that distributes and obtain first data, and the cleaning transformation rule according to preset carries out data cleansing and conversion to first data, with first Data Loading after data cleansing and the conversion in form;
B, data warehouse module split form according to preset submeter rule, extract first data the form after splitting, and obtain the list data and first data, preserve list data;
C, data warehouse module are classified to first data, and are kept in the corresponding subregion according to preset zoning ordinance;
D, Data access module read the list data and first data according to the instruction of importing, and first Data Loading is also exported in the corresponding form of list data.
9. method according to claim 8 is characterized in that, further comprises before the said step B:
The management node that the data warehouse module comprises starts or closed node administration configuration file and journal file according to the instruction of outside input;
The back end that SQL node that said node comprises for the data warehouse module or data warehouse module comprise.
10. according to Claim 8 or 9 described methods, it is characterized in that said step B comprises:
The SQL node that B1, data warehouse module comprise is set up man-to-man annexation according to the list data of its preservation with the back end that is kept at first data in the form;
The SQL node that B2, data warehouse module comprise splits form according to preset submeter rule;
Extract first data the form of the SQL node that B3, data warehouse module comprise after splitting, obtain the list data and first data, preserve the list data after splitting, export the back end that first data to data warehouse module comprises.
11. according to Claim 8 or 9 described methods, it is characterized in that said step C comprises:
The back end that C1, data warehouse module comprise obtains configuration file from the management node that the data warehouse module comprises, and retrieval obtains configuration data, accomplishes the configuration of node;
The back end that C2, data warehouse module comprise is according to preset zoning ordinance, and first data that receive are classified and are kept in the corresponding subregion.
12. method according to claim 10 is characterized in that, further comprises after the said step B3:
The SQL node that B4, data warehouse module comprise is that list data is set up the table index corresponding with it;
The SQL node that B5, data warehouse module comprise for each table storage after splitting in order to first file of stored table structure, in order to second file of stored table data and in order to deposit the 3rd file of table index.
13. method according to claim 10 is characterized in that, preset submeter rule is vertical submeter rule or horizontal submeter rule;
Said vertical submeter rule splits into first sublist Tab1 of N1 field and (N2+1) the second sublist Tab2 of field for the table Tab with (N1+N2) individual field; The said second sublist Tab2 record and the first sublist Tab2 between the major key information of incidence relation;
Said horizontal submeter rule splits form according to result of calculation for first data in the form are calculated according to preset Hash hash algorithm.
14. method according to claim 11 is characterized in that, said preset zoning ordinance for by the scope subregion, by tabulation list subregion, press Hash HASH subregion or KEY subregion according to keyword;
Said by the scope subregion for to be the basis, with the alternative condition of the serial number scope of setting as the data in the partitioned file corresponding with subregion with the data of putting down in writing in the field;
Said by tabulation list subregion for to be the basis, with the property value of setting alternative condition as the data in the partitioned file corresponding with subregion with loaded data in the field;
Said by Hash HASH subregion for the data of field record being carried out hash calculation according to preset number of partitions, with hash calculation as a result as the alternative condition of the data in the partitioned file corresponding with subregion;
The said subregion of KEY according to keyword is for calculating the data of field record according to the expression formula of setting, with the alternative condition of result of calculation as the data in the partitioned file of subregion correspondence.
15. according to Claim 8 or 9 described methods, it is characterized in that said step D comprises:
D1, Data access module are searched the table index that the 3rd file comprises according to the instruction of importing from the SQL node that the data warehouse module comprises, obtain first file and second file;
D2, Data access module be according to first file and second file, reads first data of preserving in the back end that the SQL node that comprises with the data warehouse module establishes a connection;
D3, Data access module generate form according to first file and second file, insert in the form first data and output.
CN201210013801.2A 2012-01-17 2012-01-17 Distributed data processing system and method Expired - Fee Related CN102542071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210013801.2A CN102542071B (en) 2012-01-17 2012-01-17 Distributed data processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210013801.2A CN102542071B (en) 2012-01-17 2012-01-17 Distributed data processing system and method

Publications (2)

Publication Number Publication Date
CN102542071A true CN102542071A (en) 2012-07-04
CN102542071B CN102542071B (en) 2014-02-26

Family

ID=46348950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210013801.2A Expired - Fee Related CN102542071B (en) 2012-01-17 2012-01-17 Distributed data processing system and method

Country Status (1)

Country Link
CN (1) CN102542071B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559254A (en) * 2013-10-31 2014-02-05 上海上讯信息技术有限公司 Storage system and method on basis of modules
CN103793424A (en) * 2012-10-31 2014-05-14 阿里巴巴集团控股有限公司 Database data migration method and database data migration system
CN103838770A (en) * 2012-11-26 2014-06-04 中国移动通信集团北京有限公司 Logic data partition method and system
CN104252535A (en) * 2014-09-16 2014-12-31 福建新大陆软件工程有限公司 Hbase-based data hash processing method and device
CN104252544A (en) * 2014-09-30 2014-12-31 北京华智凯科技有限公司 Big data mining method and device
CN104462462A (en) * 2014-12-16 2015-03-25 用友软件股份有限公司 Service change frequency based data warehouse modeling method and device
CN105022791A (en) * 2015-06-19 2015-11-04 华南理工大学 Novel KV distributed data storage method
CN105468651A (en) * 2014-09-12 2016-04-06 阿里巴巴集团控股有限公司 Data query method and system for relational database
CN105573971A (en) * 2014-10-10 2016-05-11 富士通株式会社 Table reconstruction apparatus and method
WO2016197852A1 (en) * 2015-06-09 2016-12-15 阿里巴巴集团控股有限公司 Data processing method and device
CN106933992A (en) * 2017-02-24 2017-07-07 北京华安普惠高新技术有限公司 Distributed data purging system and method based on data analysis
CN107832333A (en) * 2017-09-29 2018-03-23 北京邮电大学 Method and system based on distributed treatment and DPI data structure user network data fingerprint
CN107908610A (en) * 2017-12-04 2018-04-13 北京中燕信息技术有限公司 A kind of data processing method and device
CN108153744A (en) * 2016-12-02 2018-06-12 上海中兴软件有限责任公司 A kind of data storage system maintenance method and device
CN108304486A (en) * 2017-12-29 2018-07-20 北京欧链科技有限公司 A kind of data processing method and device based on block chain
WO2019052162A1 (en) * 2017-09-15 2019-03-21 平安科技(深圳)有限公司 Method, apparatus and device for improving data cleaning efficiency, and readable storage medium
CN109857832A (en) * 2019-01-03 2019-06-07 中国银行股份有限公司 A kind of preprocess method and device of payment data
CN110287199A (en) * 2019-07-01 2019-09-27 联想(北京)有限公司 A kind of processing method and electronic equipment of database
CN110825739A (en) * 2019-10-30 2020-02-21 京东数字科技控股有限公司 Table building statement generation method, device, equipment and storage medium
CN112231406A (en) * 2020-10-20 2021-01-15 浪潮云信息技术股份公司 Distributed cloud data centralized processing method
CN112307721A (en) * 2020-10-30 2021-02-02 广州朗国电子科技有限公司 Method for quickly converting third-party interface data into customized form and storage medium
CN112597219A (en) * 2020-12-15 2021-04-02 中国建设银行股份有限公司 Method and device for importing large-data-volume text file into distributed database
CN113759884A (en) * 2021-11-08 2021-12-07 西安热工研究院有限公司 Method and system for generating input/output point product file of distributed control system
CN117633024A (en) * 2024-01-23 2024-03-01 天津南大通用数据技术股份有限公司 Database optimization method based on preprocessing optimization join

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996250A (en) * 2010-11-15 2011-03-30 中国科学院计算技术研究所 Hadoop-based mass stream data storage and query method and system
CN102281332A (en) * 2011-08-31 2011-12-14 上海西本网络科技有限公司 Distributed cache array and data updating method thereof
CN102308273A (en) * 2009-02-17 2012-01-04 日本电气株式会社 Storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102308273A (en) * 2009-02-17 2012-01-04 日本电气株式会社 Storage system
CN101996250A (en) * 2010-11-15 2011-03-30 中国科学院计算技术研究所 Hadoop-based mass stream data storage and query method and system
CN102281332A (en) * 2011-08-31 2011-12-14 上海西本网络科技有限公司 Distributed cache array and data updating method thereof

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793424A (en) * 2012-10-31 2014-05-14 阿里巴巴集团控股有限公司 Database data migration method and database data migration system
CN103838770A (en) * 2012-11-26 2014-06-04 中国移动通信集团北京有限公司 Logic data partition method and system
CN103559254A (en) * 2013-10-31 2014-02-05 上海上讯信息技术有限公司 Storage system and method on basis of modules
CN103559254B (en) * 2013-10-31 2018-03-02 上海上讯信息技术股份有限公司 A kind of storage system and method based on module
CN105468651A (en) * 2014-09-12 2016-04-06 阿里巴巴集团控股有限公司 Data query method and system for relational database
CN104252535A (en) * 2014-09-16 2014-12-31 福建新大陆软件工程有限公司 Hbase-based data hash processing method and device
CN104252544A (en) * 2014-09-30 2014-12-31 北京华智凯科技有限公司 Big data mining method and device
CN105573971A (en) * 2014-10-10 2016-05-11 富士通株式会社 Table reconstruction apparatus and method
CN105573971B (en) * 2014-10-10 2018-09-25 富士通株式会社 Table reconfiguration device and method
CN104462462B (en) * 2014-12-16 2017-11-07 用友软件股份有限公司 Change the data warehouse modeling method and model building device of frequency based on business
CN104462462A (en) * 2014-12-16 2015-03-25 用友软件股份有限公司 Service change frequency based data warehouse modeling method and device
WO2016197852A1 (en) * 2015-06-09 2016-12-15 阿里巴巴集团控股有限公司 Data processing method and device
CN105022791A (en) * 2015-06-19 2015-11-04 华南理工大学 Novel KV distributed data storage method
CN108153744A (en) * 2016-12-02 2018-06-12 上海中兴软件有限责任公司 A kind of data storage system maintenance method and device
CN106933992A (en) * 2017-02-24 2017-07-07 北京华安普惠高新技术有限公司 Distributed data purging system and method based on data analysis
CN106933992B (en) * 2017-02-24 2018-02-06 北京华安普惠高新技术有限公司 Distributed data purging system and method based on data analysis
WO2019052162A1 (en) * 2017-09-15 2019-03-21 平安科技(深圳)有限公司 Method, apparatus and device for improving data cleaning efficiency, and readable storage medium
CN107832333A (en) * 2017-09-29 2018-03-23 北京邮电大学 Method and system based on distributed treatment and DPI data structure user network data fingerprint
CN107832333B (en) * 2017-09-29 2022-05-10 北京邮电大学 Method and system for constructing user network data fingerprint based on distributed processing and DPI data
CN107908610A (en) * 2017-12-04 2018-04-13 北京中燕信息技术有限公司 A kind of data processing method and device
CN108304486A (en) * 2017-12-29 2018-07-20 北京欧链科技有限公司 A kind of data processing method and device based on block chain
CN109857832A (en) * 2019-01-03 2019-06-07 中国银行股份有限公司 A kind of preprocess method and device of payment data
CN110287199B (en) * 2019-07-01 2021-11-16 联想(北京)有限公司 Database processing method and electronic equipment
CN110287199A (en) * 2019-07-01 2019-09-27 联想(北京)有限公司 A kind of processing method and electronic equipment of database
CN110825739A (en) * 2019-10-30 2020-02-21 京东数字科技控股有限公司 Table building statement generation method, device, equipment and storage medium
CN112231406A (en) * 2020-10-20 2021-01-15 浪潮云信息技术股份公司 Distributed cloud data centralized processing method
CN112307721A (en) * 2020-10-30 2021-02-02 广州朗国电子科技有限公司 Method for quickly converting third-party interface data into customized form and storage medium
CN112597219A (en) * 2020-12-15 2021-04-02 中国建设银行股份有限公司 Method and device for importing large-data-volume text file into distributed database
CN113759884A (en) * 2021-11-08 2021-12-07 西安热工研究院有限公司 Method and system for generating input/output point product file of distributed control system
CN117633024A (en) * 2024-01-23 2024-03-01 天津南大通用数据技术股份有限公司 Database optimization method based on preprocessing optimization join
CN117633024B (en) * 2024-01-23 2024-04-23 天津南大通用数据技术股份有限公司 Database optimization method based on preprocessing optimization join

Also Published As

Publication number Publication date
CN102542071B (en) 2014-02-26

Similar Documents

Publication Publication Date Title
CN102542071B (en) Distributed data processing system and method
CN102867071B (en) Management method for massive network management historical data
CN104298760B (en) A kind of data processing method and data processing equipment applied to data warehouse
CN104111936B (en) Data query method and system
CN108629029A (en) A kind of data processing method and device applied to data warehouse
US20140101167A1 (en) Creation of Inverted Index System, and Data Processing Method and Apparatus
CN109241159B (en) Partition query method and system for data cube and terminal equipment
CN105956123A (en) Local updating software-based data processing method and apparatus
CN107247778A (en) System and method for implementing expansible data storage service
CN103631924B (en) A kind of application process and system of distributive database platform
CN109669925B (en) Management method and device of unstructured data
CN104239377A (en) Platform-crossing data retrieval method and device
CN103399942A (en) Data engine system supporting SaaS multi-tenant function and working method of data engine system
CN110489407A (en) Data filling mining method, apparatus, computer equipment and storage medium
CN105164673A (en) Query integration across databases and file systems
CN114297173B (en) Knowledge graph construction method and system for large-scale mass data
CN103793493A (en) Method and system for processing car-mounted terminal mass data
CN106055678A (en) Hadoop-based panoramic big data distributed storage method
CN110096509A (en) Realize that historical data draws the system and method for storage of linked list modeling processing under big data environment
CN111708895B (en) Knowledge graph system construction method and device
CN111625561B (en) Data query method and device
CN102332004A (en) Data processing method and system for managing mass data
CN101093482A (en) Method for storing and retrieving mass information
CN106503008A (en) File memory method and device and file polling method and apparatus
CN102411632A (en) Chain table-based memory database page type storage method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHENZHEN LONGSHI MEDIA CO., LTD.

Free format text: FORMER OWNER: SHENZHEN COSHIP VIDEO COMMUNICATION CO., LTD.

Effective date: 20130407

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20130407

Address after: 4, 518057 floor, rainbow science and technology building, north high tech Zone, Nanshan District, Guangdong, Shenzhen

Applicant after: Shenzhen Longguan Media Co., Ltd.

Address before: 518057 B2-1 District, rainbow tech building, North Fifth Industrial Zone, north high tech Zone, Nanshan District, Guangdong, Shenzhen

Applicant before: Shenzhen Tongzhou Video Media Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHENZHEN TONGZHOU ELECTRONIC CO., LTD.

Free format text: FORMER OWNER: SHENZHEN LONGSHI MEDIA CO., LTD.

Effective date: 20140521

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20140521

Address after: 518057 rainbow science and Technology Building (North West Road), Nanshan District hi tech Zone, Shenzhen, Guangdong

Patentee after: Shenzhen Tongzhou Electronic Co., Ltd.

Address before: 4, 518057 floor, rainbow science and technology building, north high tech Zone, Nanshan District, Guangdong, Shenzhen

Patentee before: Shenzhen Longguan Media Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140226

Termination date: 20150117

EXPY Termination of patent right or utility model