CN108121745A - A kind of data load method and device - Google Patents

A kind of data load method and device Download PDF

Info

Publication number
CN108121745A
CN108121745A CN201611085703.4A CN201611085703A CN108121745A CN 108121745 A CN108121745 A CN 108121745A CN 201611085703 A CN201611085703 A CN 201611085703A CN 108121745 A CN108121745 A CN 108121745A
Authority
CN
China
Prior art keywords
data
tables
major key
file
key field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611085703.4A
Other languages
Chinese (zh)
Other versions
CN108121745B (en
Inventor
陈叶超
刘云飞
齐骥
金振江
柯亮
钱岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201611085703.4A priority Critical patent/CN108121745B/en
Publication of CN108121745A publication Critical patent/CN108121745A/en
Application granted granted Critical
Publication of CN108121745B publication Critical patent/CN108121745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention provides a kind of data load method, and this method includes:The data sorting that the major key field of the data loaded as needed loads needs, and generate data file;Major key field sampling to the data that the needs after sequence load, generates the first major key field;The partition information of tables of data is generated according to the first major key field, and subregion is carried out to tables of data according to the partition information of tables of data;Data file is grouped according to the partition information of tables of data, and the partitioned file of tables of data is generated according to group result;The partitioned file of tables of data is loaded into the subregion of corresponding tables of data.The embodiment of the present invention also provides a kind of data loading device simultaneously.

Description

A kind of data load method and device
Technical field
The present invention relates to data processing field more particularly to a kind of data load methods and device.
Background technology
When inquiring about data in the database, first have in system first load the tables of data in database, it is to be added Data query could be carried out after the completion of load.
The data that existing data load method often loads needs are cut into slices according to table partition information, then will be cut Data after piece are loaded into distributed data base.
However this data load method may result in the data being loaded and concentrate in the subregion of a few a tables of data, And the subregion of other tables of data is not present or only exists a small amount of data, is tilted so as to cause data slicer, influences data and adds Carry performance and the load balancing of tables of data subregion.
The content of the invention
In view of this, an embodiment of the present invention is intended to provide a kind of data load method and device, made with solving data skew Into the problem of data loading performance efficiency is low, tables of data subregion load imbalance.
What the technical solution of the embodiment of the present invention was realized in:
A kind of data load method, including:
The major key field of the data loaded as needed generates data file to the data sorting for needing to load;
Major key field sampling to the data that the needs after sequence load, generates the first major key field;
The partition information of tables of data is generated according to first major key field, and according to the partition information pair of the tables of data The tables of data carries out subregion;
The data file is grouped according to the partition information of the tables of data, and data are generated according to group result The partitioned file of table;
The partitioned file of the tables of data is loaded into the subregion of corresponding tables of data.
Method as described above, further includes:
To the data slicer for needing to load, n group preprocessed datas are obtained;Wherein, n is positive integer;
Correspondingly, the data sorting that the major key field of the data loaded as needed loads the needs, and it is raw Into data file, including:
It is sorted respectively to the n groups preprocessed data according to the major key field of the preprocessed data, and generates n data File.
Method as described above, the major key field sampling of the data of the needs loading after described pair of sequence, obtains the One major key field, including:
The major key field of the n groups preprocessed data after sequence is sampled respectively, generates n the second major key fields of group;
It integrally sorts to second major key field of n groups, and second major key field of n groups after whole sort is adopted Sample generates first major key field.
Method as described above, the partition information that tables of data is generated according to first major key field, and according to institute The partition information for stating tables of data carries out subregion to tables of data, including:
The start field and trailer field in the subregion section of the tables of data are obtained according to first major key field;
Subregion is carried out to tables of data according to the start field in the subregion section of the tables of data and trailer field.
Method as described above, the partition information according to the tables of data are grouped the data file, and The partitioned file of tables of data is generated according to group result, including:
I-th of data file is divided by N according to the partition information of the tables of dataiGroup data file;
According to j-th of partition information of tables of data in N1+N2+ ... screening meets described j-th point in+Nn group data files The data file of area's information, and generate j-th of partitioned file of tables of data;Wherein, i=1,2 ... n, j=1,2 ... s, s, Ni For positive integer.
Method as described above, further includes:
It cuts into slices to the data in the partitioned file of the tables of data;
Correspondingly, the partitioned file of the tables of data is loaded into the subregion of corresponding tables of data, including:
The partitioned file of the tables of data after progress data slicer is loaded into the subregion of corresponding tables of data.
A kind of data loading device, including:
Sorting module, for the major key field of data that loads as needed to the data sorting for needing to load, and Generate data file;
Sampling module, the major key field of the data for being loaded to the needs after sequence sample, and generate the first major key Field;
Division module, for generating the partition information of tables of data according to first major key field, and according to the data The partition information of table carries out subregion to the tables of data;
Processing module is grouped the data file for the partition information according to the tables of data, and according to point The partitioned file of group result generation tables of data;
Load-on module, for the partitioned file of the tables of data to be loaded into the subregion of corresponding tables of data.
Device as described above, further includes:
Section module, for the data slicer for needing to load, obtaining n group preprocessed datas;Wherein, n is just whole Number;
The sorting module, specifically for being pre-processed respectively to the n groups according to the major key field of the preprocessed data Data sorting, and generate n data file.
Device as described above, the sampling module, specifically for the major key to the n groups preprocessed data after sequence Field samples respectively, generates n the second major key fields of group;Integrally sort to second major key field of n groups, and to whole sort after Second major key field of n groups sampling, generate first major key field.
Device as described above, the processing module include:
I-th of data file is divided into N by grouped element for the partition information according to the tables of dataiGroup data file;
Screening unit, for according to j-th of partition information of tables of data in N1+N2+ ... screen symbol in+Nn group data files The data file of j-th of partition information is closed, and generates j-th of partitioned file of tables of data;Wherein, i=1,2 ... n, j= 1st, 2 ... s, s, NiIt is positive integer.
The data load method and device that the embodiment of the present invention is provided, the major key field pair of the data loaded as needed The data sorting loaded is needed, and generates data file;Major key field sampling to the data that the needs after sequence load, generation First major key field;The partition information of tables of data is generated according to the first major key field, and according to the partition information logarithm of tables of data Subregion is carried out according to table;Data file is grouped according to the partition information of tables of data, and tables of data is generated according to group result Partitioned file;The partitioned file of tables of data is loaded into the subregion of corresponding tables of data;In this way it can be ensured that each data The subregion of table is borne by a part of data loading tasks, and a certain partial data table point is concentrated on so as to avoid data loading tasks It is carried out in area, therefore improves data response rate, the balanced load of tables of data subregion.
Description of the drawings
Fig. 1 is a kind of flow diagram of data load method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of another data load method provided in an embodiment of the present invention;
Fig. 3 is the flow diagram of another data load method provided in an embodiment of the present invention;
Fig. 4 is the flow diagram of another data load method provided in an embodiment of the present invention;
Fig. 5 is the schematic diagram of data load method provided in an embodiment of the present invention;
Fig. 6 is a kind of structure diagram of data loading device provided in an embodiment of the present invention;
Fig. 7 is the structure diagram of another data loading device provided in an embodiment of the present invention;
Fig. 8 is the structure diagram of another data loading device provided in an embodiment of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes.
Fig. 1 is a kind of flow diagram of data load method provided in an embodiment of the present invention, as shown in Figure 1, this method Comprise the following steps:
The data sorting that step 101, the major key field of the data loaded as needed load needs, and generate data text Part.
Specifically, the data sorting that the major key field for the data that step 101 loads as needed loads needs, and generate Data file can be realized by data loading device.The data that major key field loads for unique mark needs.
It should be noted that the data file being ordered into according to the data file that the data after sequence generate.Generate data When file, file index information is write, orderly data file becomes the ordered data file that can be indexed, can pass through rope Draw the value for quickly navigating to some major key field.
The major key field sampling of step 102, the data loaded to the needs after sequence, generates the first major key field.
Specifically, the major key field sampling for the data that step 102 loads the needs after sequence, generates the first major key field It can be realized by data loading device.
It should be noted that major key word of the method for sampling of fixed intervals to the data for needing to load after sequence may be employed Duan Jinhang is sampled, so as to reach better sample effect.Fixed intervals number can be set according to actual needs, if needed Reach sample effect evenly, the number of fixed intervals opposite can be set small, if do not had to the effect of sampling Strict requirements, the numbers of fixed intervals opposite can be set to be large, if correspondingly, the number of fixed intervals is small, be sampled The quantity of the first obtained major key field is comparatively more, if the number of fixed intervals is big, the first major key word for sampling Section is comparatively few.
Step 103, the partition information that tables of data is generated according to the first major key field, and according to the partition information pair of tables of data Tables of data carries out subregion.
Specifically, step 103 generates the partition information of tables of data according to the first major key field, and according to the subregion of tables of data Information carries out subregion to tables of data and can be realized by data loading device.
Step 104 is grouped data file according to the partition information of tables of data, and generates data according to group result The partitioned file of table.
Specifically, step 104 is grouped data file according to the partition information of tables of data, and given birth to according to group result Partitioned file into tables of data can be realized by data loading device.
It should be noted that the data file generated in step 101 may include one corresponding to the subregion A of tables of data Divided data file, also a part of data file corresponding to the subregion B comprising the tables of data adjacent with the subregion A of tables of data, institute To need that the data file is split and (data file is grouped according to the partition information of tables of data), generate data The partitioned file of table.
The partitioned file of tables of data is loaded into the subregion of corresponding tables of data by step 105.
Specifically, it can be by counting that the partitioned file of tables of data is loaded into the subregion of corresponding tables of data by step 105 It is realized according to loading device.The subregion of one tables of data corresponds to one or more partitioned files.
It is referred to it should be noted that the partitioned file of tables of data is loaded into the subregion of corresponding tables of data by step The partitioned file of the tables of data of 104 generations is loaded into the tables of data subregion belonging to partitioned file.
The data load method that the embodiment of the present invention is provided, the major key field of the data loaded as needed is to needing to add The data sorting of load, and generate data file;Major key field sampling to the data that the needs after sequence load, the first master of generation Key field;Generate the partition information of tables of data according to the first major key field, and according to the partition information of tables of data to tables of data into Row subregion;Data file is grouped according to the partition information of tables of data, and the subregion of tables of data is generated according to group result File;The partitioned file of tables of data is loaded into the subregion of corresponding tables of data;In this way it can be ensured that point of each tables of data Area is borne by a part of data loading tasks, so as to avoid data loading tasks concentrate in a certain partial data table subregion into Row, therefore improves data response rate, the balanced load of tables of data subregion.
Fig. 2 is the flow diagram of another data load method provided in an embodiment of the present invention, as shown in Fig. 2, the party Method comprises the following steps:
The data slicer that step 201, data loading device load needs, obtains n group preprocessed datas, wherein, n is just Integer.
Specifically, the n groups preprocessed data that the data that data loading device loads needs are cut into slices can be handed over It goes to handle by n distributed task scheduling, each distributed task scheduling handles one group of preprocessed data.
Step 202, data loading device respectively sort to n groups preprocessed data according to the major key field of preprocessed data, And generate n data file.
It should be noted that in step 201, if n distributed task scheduling is transferred to go to handle n group preprocessed datas, Then each distributed task scheduling handles one group of preprocessed data, and processing procedure is:This group of preprocessed data is arranged according to major key field Sequence generates data file.
Step 203, data loading device sample the major key field of the n group preprocessed datas after sequence respectively, generate n groups Second major key field.
It should be noted that in step 201, if n distributed task scheduling is transferred to go to handle n group preprocessed datas, Then each distributed task scheduling handles one group of preprocessed data, and processing procedure is:To the major key of this group of preprocessed data after sequence Field is sampled, and generates the second major key field.
It should also be noted that, the method for sampling of fixed intervals may be employed to the group after sequence in each distributed task scheduling The major key field of preprocessed data is sampled, so as to reach better sample effect.Fixed intervals number can be according to reality It is set, if required up sample effect evenly, the number of fixed intervals opposite can be set small, such as Fruit does not have the effect of sampling strict requirements, and the number of fixed intervals opposite can be set to be large.
Step 204, data loading device integrally sort to the second major key field of n groups, and to the n groups second after whole sort Major key field samples, and generates the first major key field.
Step 205, data loading device generate the partition information of tables of data according to the first major key field, and according to tables of data Partition information to tables of data carry out subregion.
Step 206, data loading device are grouped data file according to the partition information of tables of data, and according to grouping As a result the partitioned file of tables of data is generated.
The partitioned file of tables of data is loaded into the subregion of corresponding tables of data by step 207, data loading device.
It should be noted that it is referred in the present embodiment with the explanation of same steps in other embodiments or concept other Description in embodiment, details are not described herein again.
The data load method that the embodiment of the present invention is provided obtains the data slicer that needs load n groups pretreatment number According to;It is sorted respectively to n groups preprocessed data according to the major key field of preprocessed data, and generates n data file;After sequence The major key fields of n group preprocessed datas sample respectively, generate n the second major key fields of group;The second major key field of n groups is integrally arranged Sequence, and sample the first major key field of generation;The partition information of tables of data is generated according to the first major key field, and according to tables of data Partition information carries out subregion to tables of data;Data file is grouped according to the partition information of tables of data, and is tied according to grouping Fruit generates the partitioned file of tables of data;Finally the partitioned file of tables of data is loaded into the subregion of corresponding tables of data;In this way, It may insure that the subregion of each tables of data is borne by a part of data loading tasks, concentrated on so as to avoid data loading tasks It is carried out in a certain partial data table subregion, therefore improves data response rate, the balanced load of tables of data subregion.
Fig. 3 is the flow diagram of another data load method provided in an embodiment of the present invention, as shown in figure 3, the party Method comprises the following steps:
The data slicer that step 301, data loading device load needs, obtains n group preprocessed datas, wherein, n is just Integer.
Step 302, data loading device respectively sort to n groups preprocessed data according to the major key field of preprocessed data, And generate n data file.
Step 303, data loading device sample the major key field of the n group preprocessed datas after sequence respectively, generate n groups Second major key field.
Step 304, data loading device integrally sort to the second major key field of n groups, and to the n groups second after whole sort Major key field samples, and generates the first major key field.
Step 305, data loading device according to the first major key field obtain tables of data subregion section start field and Trailer field.
Specifically, assuming that major key field is User ID (0~30000), the first major key field sampled is respectively 10000th, 18000,24000, then according to the start field in first subregion section of the tables of data of the first major key field acquisition It is respectively 0,10000 with trailer field, first subregion section can be expressed as subregion 1:(0,10000];The tables of data of acquisition Second subregion section start field and trailer field be respectively 10000,18000, second subregion section can represent For subregion 2:(10000,18000];The start field and trailer field in the 3rd subregion section of the tables of data of acquisition be respectively 18000th, 24000, the 3rd subregion section can be expressed as subregion 3:(18000,24000];The 4th of the tables of data of acquisition The start field and trailer field in subregion section are respectively 24000,30000, and the 4th subregion section can be expressed as subregion 4: (24000,30000].
Step 306, data loading device are according to the start field and trailer field in the subregion section of tables of data to tables of data Carry out subregion.
It is referred to specifically, carrying out subregion to tables of data according to the start field in the subregion section of tables of data and trailer field After tables of data carries out subregion, the subregion of a certain tables of data can only be stored in the data file for belonging to the subregion.
Step 307, data loading device are grouped data file according to the partition information of tables of data, and according to grouping As a result the partitioned file of tables of data is generated.
The partitioned file of tables of data is loaded into the subregion of corresponding tables of data by step 308, data loading device.
It should be noted that it is referred in the present embodiment with the explanation of same steps in other embodiments or concept other Description in embodiment, details are not described herein again.
The data load method that the embodiment of the present invention is provided obtains the data slicer that needs load n groups pretreatment number According to;It is sorted respectively to n groups preprocessed data according to the major key field of preprocessed data, and generates n data file;After sequence The major key fields of n group preprocessed datas sample respectively, generate n the second major key fields of group;The second major key field of n groups is integrally arranged Sequence, and sample the first major key field of generation;The partition information of tables of data is generated according to the first major key field, and according to tables of data Partition information carries out subregion to tables of data;Data file is grouped according to the partition information of tables of data, and is tied according to grouping Fruit generates the partitioned file of tables of data;Finally the partitioned file of tables of data is loaded into the subregion of corresponding tables of data;In this way, It may insure that the subregion of each tables of data is borne by a part of data loading tasks, concentrated on so as to avoid data loading tasks It is carried out in a certain partial data table subregion, therefore improves data response rate, the balanced load of tables of data subregion.
Fig. 4 is the flow diagram of another data load method provided in an embodiment of the present invention, as shown in figure 4, the party Method comprises the following steps:
The data slicer that step 401, data loading device load needs, obtains n group preprocessed datas, wherein, n is just Integer.
Step 402, data loading device respectively sort to n groups preprocessed data according to the major key field of preprocessed data, And generate n data file.
Step 403, data loading device sample the major key field of the n group preprocessed datas after sequence respectively, generate n groups Second major key field.
Step 404, data loading device integrally sort to the second major key field of n groups, and to the n groups second after whole sort Major key field samples, and generates the first major key field.
Step 405, data loading device according to the first major key field obtain tables of data subregion section start field and Trailer field.
Step 406, data loading device are according to the start field and trailer field in the subregion section of tables of data to tables of data Carry out subregion.
I-th of data file is divided into N by step 407 according to the partition information of tables of dataiGroup data file.
Specifically, the 1st data file is divided by N according to the partition information of tables of data1Group data file, by the 2nd number It is divided into N according to file2Group data file ..., is divided into N by nth data filenGroup data file.
If it should be noted that judge that i-th of data file includes the number of 3 subregions according to the partition information of tables of data According to file, then this document is divided into 3 groups of data files, i.e. Ni=3.
Step 408, according to j-th of partition information of tables of data in N1+N2+ ... screening meets jth in+Nn group data files The data file of a partition information, and generate j-th of partitioned file of tables of data;Wherein, i=1,2 ... n, j=1,2 ... s, s, NiIt is positive integer.
Specifically, assume tables of data in step 406 according to the start field in the subregion section of tables of data and trailer field It is divided into s subregion;Then according to the 1st of tables of data the partition information in N1+N2+ ... screening meets the 1st in+Nn group data files The data file of a partition information, and generate the 1st partitioned file of tables of data (partitioned file corresponds to the 1st subregion);Root According to the 2nd partition information of tables of data in N1+N2+ ... screening meets the data text of the 2nd partition information in+Nn group data files Part, and generate the 2nd partitioned file of tables of data (partitioned file corresponds to the 2nd subregion);…;According to s-th of tables of data Partition information is in N1+N2+ ... screening meets the data file of s-th of partition information in+Nn group data files, and generates tables of data S-th of partitioned file (partitioned file correspond to s-th of subregion).
The partitioned file of tables of data is loaded into the subregion of corresponding tables of data by step 409, data loading device.
It should be noted that it is referred in the present embodiment with the explanation of same steps in other embodiments or concept other Description in embodiment, details are not described herein again.
The data load method that the embodiment of the present invention is provided obtains the data slicer that needs load n groups pretreatment number According to;It is sorted respectively to n groups preprocessed data according to the major key field of preprocessed data, and generates n data file;After sequence The major key fields of n group preprocessed datas sample respectively, generate n the second major key fields of group;The second major key field of n groups is integrally arranged Sequence, and sample the first major key field of generation;The partition information of tables of data is generated according to the first major key field, and according to tables of data Partition information carries out subregion to tables of data;Data file is grouped according to the partition information of tables of data, and is tied according to grouping Fruit generates the partitioned file of tables of data;Finally the partitioned file of tables of data is loaded into the subregion of corresponding tables of data;In this way, It may insure that the subregion of each tables of data is borne by a part of data loading tasks, concentrated on so as to avoid data loading tasks It is carried out in certain a part of tables of data subregion, therefore improves data response rate, the balanced load of tables of data subregion.
Further, data load method provided in an embodiment of the present invention, further includes:
It cuts into slices to the data in the partitioned file of tables of data;
Correspondingly, the partitioned file of tables of data is loaded into the subregion of corresponding tables of data, including:
The partitioned file of tables of data after progress data slicer is loaded into the subregion of corresponding tables of data.
It should be noted that if some partitioned file of tables of data is excessive, it is not easy to carry out data loading, then can incite somebody to action Data in the partitioned file are cut into slices, and the data after each section transfer to a point of task to go to handle, and so may insure The data of each task processing are reasonable, so as to solve the problems, such as data skew and database table subregion load imbalance.
A specific embodiment is provided below and illustrates data load method provided by the invention.Fig. 5 is the embodiment of the present invention The schematic diagram of the data load method of offer, it is assumed that now need to load 1000G user data into database, data major key field (identification, ID) is demonstrate,proved for user identity, according to each table multidomain treat-ment 10G data of configuration, then need distribution 100 A subregion, as shown in figure 5, this method is as follows:
Data slicer:The number of data slicer=(total amount of data)/(maximum amount of data of each slicing treatment), it is assumed that every The maximum amount of data of a slicing treatment is 256M, then data slicer number=1000G/256M=4000.Therefore, read first Fileinfo by 1000G data files, is cut into slices according to the data volume of each slicing treatment 256M, after generating 4000 pieces of sections Data give the data after 4000 sections to distributed task scheduling and go to handle respectively, the number after each task processing 256M sections According to.
Local sampling and generation intermediate data file:Each distributed task scheduling is according to user's id field to oneself inside Data are ranked up, and user's id field is sampled and generates intermediate data file after sequence.Wherein, intermediate data file It is ordered into, when generating intermediate data file, writes file index information, then can some quickly be navigated to by index User's id field.
Final sampling:By the data sending sampled in above-mentioned steps into pre- subregion task, pre- subregion task is played a game Obtained user's id field of portion's sampling carries out whole sequences, is then sampled according to the pre- number of partitions, wherein, the pre- number of partitions= (total amount of data)/(data volume of each table multidomain treat-ment), it is assumed that the data volume of each table multidomain treat-ment is 10G, then pre- point Area's number=1000G/10G=100, therefore we sample 100 datas record, data record is user's id field herein.
Pre- subregion:The 100 user's id fields obtained according to sampling divide 100 subregions, specific square partition in advance for tables of data Method is, using 100 user's id fields successively as the start field and trailer field of subregion, so as to generate the section of subregion letter Breath, the block information of subregion can be expressed as (initial value, end value], it is assumed that finally sampling obtained user's id field is respectively 100000th, 200000,300000 ..., then the tables of data subregion of formation is:Subregion 1:(0,100000], subregion 2:(100000, 200000], subregion 3:(200000,300000] ....
Generate partitioned file:According to subregion, the intermediate data file generated before is grouped, (wherein, each grouping Information includes:Subregion start field, subregion trailer field, listed files), the intermediate data file after generation grouping, according to category Partitioned file is generated in whole intermediate data files of a certain subregion, is included into corresponding subregion.Here it is possible to partitioned file It is cut into slices again, each subregion can have multiple sections, and each slice information includes one group of file and reads information (wherein, often A file reads information and starts reading position, file reading end position including document location, file);Such as at each section 1G files are managed, each subregion has 10 (10G/1G) a sections, each one group of file of slicing treatment.
Load partitioned file:The partitioned file of the tables of data ultimately generated is loaded into the subregion of tables of data.
Fig. 6 is a kind of structure diagram of data loading device provided in an embodiment of the present invention, as shown in fig. 6, the device 5 Including:
Sorting module 51, the data sorting that the major key field of data for loading as needed loads needs, and it is raw Into data file;
Sampling module 52 for the major key field sampling of the data loaded to the needs after sequence, generates the first major key word Section;
Division module 53, for generating the partition information of tables of data according to the first major key field, and according to point of tables of data Area's information carries out subregion to tables of data;
Processing module 54 is grouped data file for the partition information according to tables of data, and according to group result Generate the partitioned file of tables of data;
Load-on module 55, for the partitioned file of tables of data to be loaded into the subregion of corresponding tables of data.
The data loading device that the embodiment of the present invention is provided, the major key field of the data loaded as needed is to needing to add The data sorting of load, and generate data file;Major key field sampling to the data that the needs after sequence load, the first master of generation Key field;Generate the partition information of tables of data according to the first major key field, and according to the partition information of tables of data to tables of data into Row subregion;Data file is grouped according to the partition information of tables of data, and the subregion of tables of data is generated according to group result File;The partitioned file of tables of data is loaded into the subregion of corresponding tables of data;In this way it can be ensured that point of each tables of data Area is borne by a part of data loading tasks, concentrates in a certain partial-partition and carries out so as to avoid data loading tasks, because This improves data response rate, the balanced load of tables of data subregion.
Fig. 7 is the structure diagram of another data loading device provided in an embodiment of the present invention, as shown in fig. 7, the dress 5 are put to further include:
Section module 56, for the data slicer loaded to needs, obtains n group preprocessed datas;Wherein, n is positive integer.
Further, sorting module 51, specifically for according to the major key field of preprocessed data respectively to n group preprocessed datas Sequence, and generate n data file.
Sampling module 52 samples respectively specifically for the major key field to the n group preprocessed datas after sequence, generates n groups Second major key field;It integrally sorts to the second major key field of n groups, and the second major key field of n groups after whole sort is sampled, it is raw Into the first major key field.
Division module 53, the start field and knot in the subregion section specifically for obtaining tables of data according to the first major key field Beam field;Subregion is carried out to tables of data according to the start field in the subregion section of tables of data and trailer field.
Fig. 8 is the structure diagram of another data loading device provided in an embodiment of the present invention, as shown in figure 8, processing Module 54 includes:
I-th of data file is divided into N by grouped element 541 for the partition information according to tables of dataiGroup data file.
Screening unit 542, for according to j-th of partition information of tables of data in N1+N2+ ... it is screened in+Nn group data files Meet the data file of j-th of partition information, and generate j-th of partitioned file of tables of data;Wherein, i=1,2 ... n, j=1, 2 ... s, s, NiIt is positive integer.
Further, section module 56, the data being additionally operable in the partitioned file to tables of data are cut into slices.
Load-on module 55 is additionally operable to the partitioned file of the tables of data after progress data slicer being loaded into corresponding tables of data Subregion in.
It should be noted that the interaction in the present embodiment between modules, unit, is referred to Fig. 1~4 and corresponds to Embodiment of the method, details are not described herein again.
The data loading device that the embodiment of the present invention is provided obtains the data slicer that needs load n groups pretreatment number According to;It is sorted respectively to n groups preprocessed data according to the major key field of preprocessed data, and generates n data file;After sequence The major key fields of n group preprocessed datas sample respectively, generate n the second major key fields of group;The second major key field of n groups is integrally arranged Sequence, and sample the first major key field of generation;The partition information of tables of data is generated according to the first major key field, and according to tables of data Partition information carries out subregion to tables of data;Data file is grouped according to the partition information of tables of data, and is tied according to grouping Fruit generates the partitioned file of tables of data;Finally the partitioned file of tables of data is loaded into the subregion of corresponding tables of data;In this way, It may insure that the subregion of each tables of data is borne by a part of data loading tasks, concentrated on so as to avoid data loading tasks It is carried out in a certain partial-partition, therefore improves data response rate, the balanced load of tables of data subregion.
In practical applications, it is described to fill out sorting module 51, sampling module 52, division module 53, processing module 54, grouping list Member 541, screening unit 542, load-on module 55, section module 56 can be by the central processing units that are located in data storage device (Central Processing Unit, CPU), microprocessor (Micro Processor Unit, MPU), Digital Signal Processing Device (Digital Signal Processor, DSP) or field programmable gate array (Field Programmable Gate Array, FPGA) etc. realizations.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, the shape of the embodiment in terms of hardware embodiment, software implementation or combination software and hardware can be used in the present invention Formula.Moreover, the present invention can be used can use storage in one or more computers for wherein including computer usable program code The form for the computer program product that medium is implemented on (including but not limited to magnetic disk storage and optical memory etc.).
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.

Claims (10)

1. a kind of data load method, which is characterized in that the described method includes:
The major key field of the data loaded as needed generates data file to the data sorting for needing to load;
Major key field sampling to the data that the needs after sequence load, generates the first major key field;
The partition information of tables of data is generated according to first major key field, and according to the partition information of the tables of data to described Tables of data carries out subregion;
The data file is grouped according to the partition information of the tables of data, and tables of data is generated according to group result Partitioned file;
The partitioned file of the tables of data is loaded into the subregion of corresponding tables of data.
2. according to the method described in claim 1, it is characterized in that, the method further includes:
To the data slicer for needing to load, n group preprocessed datas are obtained;Wherein, n is positive integer;
Correspondingly, the data sorting that the major key field of the data loaded as needed loads the needs, and generate number According to file, including:
It is sorted respectively to the n groups preprocessed data according to the major key field of the preprocessed data, and generates n data text Part.
3. the according to the method described in claim 2, it is characterized in that, master for the data that the needs after described pair of sequence load Key field samples, and obtains the first major key field, including:
The major key field of the n groups preprocessed data after sequence is sampled respectively, generates n the second major key fields of group;
It integrally sorts to second major key field of n groups, and second major key field of n groups after whole sort is sampled, it is raw Into first major key field.
4. according to the method described in claim 1, it is characterized in that, described generate tables of data according to first major key field Partition information, and subregion is carried out to tables of data according to the partition information of the tables of data, including:
The start field and trailer field in the subregion section of the tables of data are obtained according to first major key field;
Subregion is carried out to tables of data according to the start field in the subregion section of the tables of data and trailer field.
5. according to the method described in claim 2, it is characterized in that, the partition information according to the tables of data is to the number It is grouped according to file, and the partitioned file of tables of data is generated according to group result, including:
I-th of data file is divided by N according to the partition information of the tables of dataiGroup data file;
According to j-th of partition information of tables of data in N1+N2+ ... screening meets j-th of subregion letter in+Nn group data files The data file of breath, and generate j-th of partitioned file of tables of data;Wherein, i=1,2 ... n, j=1,2 ... s, s, NiIt is just Integer.
6. according to the method described in claim 1, it is characterized in that, the method further includes:
It cuts into slices to the data in the partitioned file of the tables of data;
Correspondingly, the partitioned file of the tables of data is loaded into the subregion of corresponding tables of data, including:
The partitioned file of the tables of data after progress data slicer is loaded into the subregion of corresponding tables of data.
7. a kind of data loading device, which is characterized in that described device includes:
Sorting module for the major key field of data that loads as needed to the data sorting for needing to load, and generates Data file;
Sampling module, the major key field of the data for being loaded to the needs after sequence sample, and generate the first major key field;
Division module, for generating the partition information of tables of data according to first major key field, and according to the tables of data Partition information carries out subregion to the tables of data;
Processing module is grouped the data file for the partition information according to the tables of data, and is tied according to grouping Fruit generates the partitioned file of tables of data;
Load-on module, for the partitioned file of the tables of data to be loaded into the subregion of corresponding tables of data.
8. device according to claim 7, which is characterized in that described device further includes:
Section module, for the data slicer for needing to load, obtaining n group preprocessed datas;Wherein, n is positive integer;
The sorting module, specifically for according to the major key field of the preprocessed data respectively to the n groups preprocessed data Sequence, and generate n data file.
9. device according to claim 8, which is characterized in that
The sampling module samples respectively specifically for the major key field to the n groups preprocessed data after sequence, generates n The second major key field of group;It integrally sorts to second major key field of n groups, and to second major key of n groups after whole sort Field samples, and generates first major key field.
10. device according to claim 8, which is characterized in that the processing module includes:
I-th of data file is divided into N by grouped element for the partition information according to the tables of dataiGroup data file;
Screening unit, for according to j-th of partition information of tables of data in N1+N2+ ... screening meets institute in+Nn group data files The data file of j-th of partition information is stated, and generates j-th of partitioned file of tables of data;Wherein, i=1,2 ... n, j=1,2 ... S, s, NiIt is positive integer.
CN201611085703.4A 2016-11-30 2016-11-30 Data loading method and device Active CN108121745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611085703.4A CN108121745B (en) 2016-11-30 2016-11-30 Data loading method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611085703.4A CN108121745B (en) 2016-11-30 2016-11-30 Data loading method and device

Publications (2)

Publication Number Publication Date
CN108121745A true CN108121745A (en) 2018-06-05
CN108121745B CN108121745B (en) 2021-08-06

Family

ID=62227013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611085703.4A Active CN108121745B (en) 2016-11-30 2016-11-30 Data loading method and device

Country Status (1)

Country Link
CN (1) CN108121745B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032766A (en) * 2018-06-14 2018-12-18 阿里巴巴集团控股有限公司 A kind of transaction methods, device and electronic equipment
CN111061738A (en) * 2019-12-16 2020-04-24 中国建设银行股份有限公司 Data table pre-grouping method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102486798A (en) * 2010-12-03 2012-06-06 腾讯科技(深圳)有限公司 Data loading method and device
CN105095413A (en) * 2015-07-09 2015-11-25 北京京东尚科信息技术有限公司 Method and apparatus for solving data skew
US20150356162A1 (en) * 2012-12-27 2015-12-10 Tencent Technology (Shenzhen) Company Limited Method and system for implementing analytic function based on mapreduce

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102486798A (en) * 2010-12-03 2012-06-06 腾讯科技(深圳)有限公司 Data loading method and device
US20150356162A1 (en) * 2012-12-27 2015-12-10 Tencent Technology (Shenzhen) Company Limited Method and system for implementing analytic function based on mapreduce
CN105095413A (en) * 2015-07-09 2015-11-25 北京京东尚科信息技术有限公司 Method and apparatus for solving data skew

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐海蛟博士: "Hadoop Mapreduce分区、分组、二次排序过程详解", 《HTTP://BLOG.SINA.COM.CN/S/BLOG_D76227260101D948.HTML》 *
贺正红 等: "面向HBase的大规模数据加载研究", 《计算机系统应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032766A (en) * 2018-06-14 2018-12-18 阿里巴巴集团控股有限公司 A kind of transaction methods, device and electronic equipment
CN111061738A (en) * 2019-12-16 2020-04-24 中国建设银行股份有限公司 Data table pre-grouping method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN108121745B (en) 2021-08-06

Similar Documents

Publication Publication Date Title
Martin et al. Galaxy morphological classification in deep-wide surveys via unsupervised machine learning
CN107122382A (en) A kind of patent classification method based on specification
Gharehgozli et al. A decision-tree stacking heuristic minimising the expected number of reshuffles at a container terminal
CN108446388A (en) Text data quality detecting method, device, equipment and computer readable storage medium
CN106055621A (en) Log retrieval method and device
Kumar et al. Discovering knowledge landscapes: an epistemic analysis of business and management field in Malaysia
WO2015154679A1 (en) Method and device for ranking search results of multiple search engines
CN107958014A (en) Search engine
CN107908796A (en) E-Government duplicate checking method, apparatus and computer-readable recording medium
CN103530316A (en) Science subject extraction method based on multi-view learning
CN108121745A (en) A kind of data load method and device
CN101788987A (en) Automatic judging method of network resource types
Atazadeh et al. A palynological study of genus Cousinia Cass.(Family Asteraceae), sections Cynaroideae Bunge and Platyacanthae Rech. f.
CN108228634A (en) A kind of data processing method and device
CN106709273B (en) The matched rapid detection method of microalgae protein characteristic sequence label and system
CN109062946A (en) It is a kind of to highlight method and device based on multiple web pages
JP5132667B2 (en) Information processing apparatus and program
CN103064862B (en) A kind of multi objective sorting data disposal route and equipment
Aydin et al. Document classification using distributed machine learning
Carta et al. Seed morphology and genome size in two Tuscan Crocus (Iridaceae) endemics: C. etruscus and C. ilvensis
CN107577690B (en) Recommendation method and recommendation device for mass information data
CN107402994A (en) A kind of sorting technique and device of multi-component system distinguishing hierarchy
WO2013069149A1 (en) Data search device, data search method and program
CN109241368A (en) A kind of magnanimity POI data storage method
CN108521527A (en) Ticket difference detecting method, system, computer storage media and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant