CN108121745B - Data loading method and device - Google Patents

Data loading method and device Download PDF

Info

Publication number
CN108121745B
CN108121745B CN201611085703.4A CN201611085703A CN108121745B CN 108121745 B CN108121745 B CN 108121745B CN 201611085703 A CN201611085703 A CN 201611085703A CN 108121745 B CN108121745 B CN 108121745B
Authority
CN
China
Prior art keywords
data
data table
partition
file
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611085703.4A
Other languages
Chinese (zh)
Other versions
CN108121745A (en
Inventor
陈叶超
刘云飞
齐骥
金振江
柯亮
钱岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201611085703.4A priority Critical patent/CN108121745B/en
Publication of CN108121745A publication Critical patent/CN108121745A/en
Application granted granted Critical
Publication of CN108121745B publication Critical patent/CN108121745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a data loading method, which comprises the following steps: sorting the data to be loaded according to the main key field of the data to be loaded, and generating a data file; sampling the primary key field of the sequenced data to be loaded to generate a first primary key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; and loading the partition file of the data table into the corresponding partition of the data table. The embodiment of the invention also provides a data loading device.

Description

Data loading method and device
Technical Field
The present invention relates to the field of data processing, and in particular, to a data loading method and apparatus.
Background
When data is inquired in a database, a data table in the database is firstly loaded in a system, and the data can be inquired after the data is loaded.
The existing data loading method usually slices data to be loaded according to table partition information, and then loads the sliced data into a distributed database.
However, this data loading method may cause the loaded data to be concentrated in some partitions of the data table, while partitions of other data tables do not exist or only have a small amount of data, thereby causing the data slice to be skewed, and affecting the data loading performance and the load balance of the partitions of the data table.
Disclosure of Invention
In view of this, embodiments of the present invention are intended to provide a data loading method and apparatus, so as to solve the problems of low data loading performance efficiency and unbalanced load of data table partitions caused by data skew.
The technical scheme of the embodiment of the invention is realized as follows:
a method of data loading, comprising:
sorting the data to be loaded according to the main key field of the data to be loaded, and generating a data file;
sampling the sorted main key fields of the data to be loaded to generate a first main key field;
generating partition information of a data table according to the first primary key field, and partitioning the data table according to the partition information of the data table;
grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results;
and loading the partition file of the data table into the corresponding partition of the data table.
The method as described above, further comprising:
slicing the data to be loaded to obtain n groups of preprocessed data; wherein n is a positive integer;
correspondingly, the sorting the data to be loaded according to the primary key field of the data to be loaded and generating a data file includes:
and sequencing the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files.
As described above, the sampling the primary key field of the sorted data to be loaded to obtain a first primary key field includes:
respectively sampling the main key fields of the n groups of sorted preprocessed data to generate n groups of second main key fields;
and integrally sequencing the n groups of second main key fields, sampling the integrally sequenced n groups of second main key fields, and generating the first main key field.
The method for generating partition information of a data table according to the first primary key field and partitioning the data table according to the partition information of the data table includes:
acquiring a start field and an end field of a partition interval of the data table according to the first primary key field;
and partitioning the data table according to the start field and the end field of the partition interval of the data table.
The method for grouping the data files according to the partition information of the data table and generating the partition files of the data table according to the grouping result comprises the following steps:
dividing the ith data file into N according to the partition information of the data tableiGroup data files;
according to j partition information of data table in N1+N2Screening data files meeting the jth partition information from + … + Nn groups of data files, and generating the jth partition file of the data table; wherein, i is 1, 2 … N, j is 1, 2 … s, NiAre all positive integers.
The method as described above, further comprising:
slicing data in a partition file of the data table;
correspondingly, loading the partition file of the data table into the corresponding partition of the data table includes:
and loading the partition file of the data table subjected to data slicing into the corresponding partition of the data table.
A data loading apparatus comprising:
the sorting module is used for sorting the data needing to be loaded according to the main key field of the data needing to be loaded and generating a data file;
the sampling module is used for sampling the sorted main key fields of the data to be loaded and generating a first main key field;
the partitioning module is used for generating partitioning information of a data table according to the first main key field and partitioning the data table according to the partitioning information of the data table;
the processing module is used for grouping the data files according to the partition information of the data table and generating partition files of the data table according to grouping results;
and the loading module is used for loading the partition file of the data table into the corresponding partition of the data table.
The apparatus as described above, further comprising:
the slicing module is used for slicing the data to be loaded to obtain n groups of preprocessing data; wherein n is a positive integer;
the sorting module is specifically configured to sort the n sets of preprocessed data according to the primary key field of the preprocessed data, and generate n data files.
In the foregoing apparatus, the sampling module is specifically configured to sample the primary key fields of the n sorted sets of preprocessed data, respectively, and generate n sets of second primary key fields; and integrally sequencing the n groups of second main key fields, sampling the integrally sequenced n groups of second main key fields, and generating the first main key field.
The apparatus as described above, the processing module comprising:
a grouping unit for dividing the ith data file into N according to the partition information of the data tableiGroup data files;
a screening unit for screening the data table according to the jth partition information1+N2Screening data files meeting the jth partition information from + … + Nn groups of data files, and generating the jth partition file of the data table; wherein, i is 1, 2 … N, j is 1, 2 … s, NiAre all positive integers.
The data loading method and the data loading device provided by the embodiment of the invention have the advantages that the data needing to be loaded are sequenced according to the main key field of the data needing to be loaded, and a data file is generated; sampling the primary key field of the sequenced data to be loaded to generate a first primary key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; loading the partition file of the data table into the corresponding partition of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized mode, the data loading performance is improved, and the load of the data table partitions is balanced.
Drawings
Fig. 1 is a schematic flowchart of a data loading method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of another data loading method according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of another data loading method according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of another data loading method according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a data loading method according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a data loading apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of another data loading apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of another data loading apparatus according to an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
Fig. 1 is a schematic flowchart of a data loading method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
step 101, sorting the data to be loaded according to the main key field of the data to be loaded, and generating a data file.
Specifically, the step 101 of sorting the data to be loaded according to the primary key field of the data to be loaded and generating the data file may be implemented by the data loading apparatus. The primary key field is used to uniquely identify the data that needs to be loaded.
It should be noted that the data files generated according to the sorted data are ordered data files. When the data file is generated, file index information is written in, the ordered data file becomes an indexable ordered data file, and the value of a certain main key field can be quickly positioned through indexing.
And 102, sampling the primary key field of the sorted data needing to be loaded, and generating a first primary key field.
Specifically, the step 102 of sampling the primary key field of the sorted data to be loaded, and generating the first primary key field may be implemented by the data loading apparatus.
It should be noted that, a fixed-interval sampling method may be adopted to sample the primary key field of the sorted data to be loaded, so as to achieve a better sampling effect. The number of the fixed intervals can be set according to actual needs, if more uniform sampling effect is required, the number of the fixed intervals can be relatively set to be small, if no strict requirement is required on the sampling effect, the number of the fixed intervals can be relatively set to be large, correspondingly, if the number of the fixed intervals is small, the number of the first main key fields obtained by sampling is relatively large, and if the number of the fixed intervals is large, the number of the first main key fields obtained by sampling is relatively small.
And 103, generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table.
Specifically, the step 103 of generating partition information of the data table according to the first primary key field, and partitioning the data table according to the partition information of the data table may be implemented by the data loading apparatus.
And 104, grouping the data files according to the partition information of the data table, and generating the partition files of the data table according to a grouping result.
Specifically, the step 104 of grouping the data files according to the partition information of the data table and generating the partition files of the data table according to the grouping result may be implemented by the data loading apparatus.
It should be noted that the data file generated in step 101 may include a part of the data file corresponding to the partition a of the data table and also include a part of the data file corresponding to the partition B of the data table adjacent to the partition a of the data table, and therefore, the data file needs to be divided (that is, the data files are grouped according to the partition information of the data table) to generate the partition file of the data table.
And 105, loading the partition file of the data table into the corresponding partition of the data table.
Specifically, the step 105 of loading the partition file of the data table into the corresponding partition of the data table may be implemented by the data loading apparatus. The partitions of a data table correspond to one or more partition files.
Loading the partition file of the data table into the corresponding partition of the data table means loading the partition file of the data table generated in step 104 into the data table partition to which the partition file belongs.
The data loading method provided by the embodiment of the invention sorts the data to be loaded according to the main key field of the data to be loaded, and generates a data file; sampling the primary key field of the sequenced data to be loaded to generate a first primary key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; loading the partition file of the data table into the corresponding partition of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized mode, the data loading performance is improved, and the load of the data table partitions is balanced.
Fig. 2 is a schematic flow chart of another data loading method according to an embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:
step 201, the data loading device slices the data to be loaded to obtain n groups of preprocessed data, where n is a positive integer.
Specifically, n groups of preprocessed data obtained by slicing the data to be loaded by the data loading device may be processed by n distributed tasks, and each distributed task processes one group of preprocessed data.
Step 202, the data loading device sorts the n groups of preprocessed data according to the primary key field of the preprocessed data, and generates n data files.
It should be noted that, in step 201, if n sets of preprocessed data are handed to n distributed tasks for processing, each distributed task processes one set of preprocessed data, and the processing procedure is as follows: and sequencing the group of preprocessed data according to the primary key field to generate a data file.
And step 203, the data loading device respectively samples the main key fields of the n groups of sorted preprocessed data to generate n groups of second main key fields.
It should be noted that, in step 201, if n sets of preprocessed data are handed to n distributed tasks for processing, each distributed task processes one set of preprocessed data, and the processing procedure is as follows: and sampling the primary key field of the sorted set of preprocessed data to generate a second primary key field.
It should be further noted that each distributed task may use a fixed-interval sampling method to sample the primary key field of the sorted set of preprocessed data, so as to achieve a better sampling effect. The number of the fixed intervals can be set according to actual needs, if more uniform sampling effect is required, the number of the fixed intervals can be set to be smaller relatively, and if no strict requirement is required on the sampling effect, the number of the fixed intervals can be set to be larger relatively.
And step 204, the data loading device integrally sorts the n groups of second main key fields, samples the n groups of integrally sorted second main key fields and generates a first main key field.
And step 205, the data loading device generates partition information of the data table according to the first primary key field, and partitions the data table according to the partition information of the data table.
And step 206, grouping the data files by the data loading device according to the partition information of the data table, and generating the partition files of the data table according to the grouping result.
Step 207, the data loading device loads the partition file of the data table into the corresponding partition of the data table.
It should be noted that, for the explanation of the same steps or concepts in the present embodiment as in the other embodiments, reference may be made to the description in the other embodiments, and details are not described here.
According to the data loading method provided by the embodiment of the invention, n groups of preprocessed data are obtained by slicing the data to be loaded; sorting the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files; respectively sampling the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields; integrally sequencing the n groups of second main key fields, and sampling to generate a first main key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; finally, loading the partition files of the data table into the corresponding partitions of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized mode, the data loading performance is improved, and the load of the data table partitions is balanced.
Fig. 3 is a schematic flowchart of another data loading method according to an embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:
step 301, the data loading device slices the data to be loaded to obtain n groups of preprocessed data, where n is a positive integer.
Step 302, the data loading device sorts the n groups of preprocessed data according to the primary key field of the preprocessed data, and generates n data files.
And step 303, the data loading device respectively samples the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields.
And step 304, the data loading device integrally sorts the n groups of second main key fields, samples the n groups of integrally sorted second main key fields and generates a first main key field.
Step 305, the data loading device obtains a start field and an end field of the partition interval of the data table according to the first primary key field.
Specifically, assuming that the primary key field is a user ID (0 to 30000), and the sampled first primary key fields are 10000, 18000, and 24000, respectively, then the start field and the end field of the first partition interval of the data table obtained according to the first primary key field are 0 and 10000, respectively, and the first partition interval may be represented as partition 1: (0, 10000), the start field and the end field of the second partition interval of the obtained data table are 10000 and 18000 respectively, the second partition interval can be expressed as a partition 2, (10000, 18000), the start field and the end field of the third partition interval of the obtained data table are 18000 and 24000 respectively, the third partition interval can be expressed as a partition 3, (18000, 24000), the start field and the end field of the fourth partition interval of the obtained data table are 24000 and 30000 respectively, and the fourth partition interval can be expressed as a partition 4, (24000, 30000).
Step 306, the data loading device partitions the data table according to the start field and the end field of the partition interval of the data table.
Specifically, partitioning the data table according to the start field and the end field of the partition interval of the data table means that after the data table is partitioned, the partition of a certain data table can only be stored in the data file belonging to the partition.
And 307, grouping the data files by the data loading device according to the partition information of the data table, and generating partition files of the data table according to a grouping result.
Step 308, the data loading device loads the partition file of the data table into the corresponding partition of the data table.
It should be noted that, for the explanation of the same steps or concepts in the present embodiment as in the other embodiments, reference may be made to the description in the other embodiments, and details are not described here.
According to the data loading method provided by the embodiment of the invention, n groups of preprocessed data are obtained by slicing the data to be loaded; sorting the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files; respectively sampling the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields; integrally sequencing the n groups of second main key fields, and sampling to generate a first main key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; finally, loading the partition files of the data table into the corresponding partitions of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized mode, the data loading performance is improved, and the load of the data table partitions is balanced.
Fig. 4 is a schematic flowchart of another data loading method according to an embodiment of the present invention, and as shown in fig. 4, the method includes the following steps:
step 401, the data loading device slices the data to be loaded to obtain n groups of preprocessed data, where n is a positive integer.
Step 402, the data loading device sorts n groups of preprocessed data according to the primary key field of the preprocessed data, and generates n data files.
And step 403, the data loading device samples the primary key fields of the n groups of sorted preprocessed data respectively to generate n groups of second primary key fields.
Step 404, the data loading device sorts the n sets of second primary key fields as a whole, and samples the n sets of second primary key fields after the whole sorting to generate a first primary key field.
Step 405, the data loading apparatus obtains a start field and an end field of the partition interval of the data table according to the first primary key field.
And step 406, the data loading device partitions the data table according to the start field and the end field of the partition interval of the data table.
Step 407, dividing the ith data file into N according to the partition information of the data tableiGroup data files.
Specifically, the 1 st data file is divided into N according to the partition information of the data table1Group data file, divide 2 nd data file into N2Group data files, …, divide the nth data file into NnGroup data files.
It should be noted that, if it is determined that the ith data file contains 3 partitioned data files according to the partition information of the data table, the file is divided into 3 groups of data files, i.e., Ni=3。
Step 408, according to the jth partition information of the data table, in N1+N2Screening data files meeting jth partition information from + … + Nn groups of data files, and generating jth partition files of the data table; wherein, i is 1, 2 … N, j is 1, 2 … s, NiAre all positive integers.
Specifically, assume that in step 406, the data table is divided into s partitions according to the start field and the end field of the partition interval of the data table; then at N according to the 1 st partition information of the data table1+N2+ … + Nn groupsScreening data files meeting the 1 st partition information from the data files, and generating a 1 st partition file of the data table (the partition file corresponds to the 1 st partition); according to the 2 nd partition information of the data table in N1+N2Screening data files meeting the 2 nd partition information from the + … + Nn group of data files, and generating a 2 nd partition file of the data table (the partition file corresponds to the 2 nd partition); …, respectively; according to the s-th partition information of the data table in N1+N2And screening the data files meeting the s-th partition information from the + … + Nn groups of data files, and generating the s-th partition file of the data table (the partition file corresponds to the s-th partition).
Step 409, the data loading device loads the partition file of the data table into the corresponding partition of the data table.
It should be noted that, for the explanation of the same steps or concepts in the present embodiment as in the other embodiments, reference may be made to the description in the other embodiments, and details are not described here.
According to the data loading method provided by the embodiment of the invention, n groups of preprocessed data are obtained by slicing the data to be loaded; sorting the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files; respectively sampling the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields; integrally sequencing the n groups of second main key fields, and sampling to generate a first main key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; finally, loading the partition files of the data table into the corresponding partitions of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized manner, the data loading performance is improved, and the load of the data table partitions is balanced.
Further, the data loading method provided by the embodiment of the present invention further includes:
slicing data in a partition file of a data table;
correspondingly, loading the partition file of the data table into the corresponding partition of the data table includes:
and loading the partition file of the data table subjected to data slicing into the corresponding partition of the data table.
It should be noted that, if a partition file of the data table is too large to load data easily, the data in the partition file may be sliced, and each sliced data is processed by one sub-task, so that it is ensured that the data processed by each task is reasonable, and the problems of data skew and unbalanced partition load of the data table are solved.
The following provides a specific embodiment to explain the data loading method provided by the present invention. Fig. 5 is a schematic diagram of a data loading method according to an embodiment of the present invention, and assuming that 1000G of user data needs to be loaded into a database, a primary key field of the data is a user identification card (ID), and 10G of data is processed according to each table partition, 100 partitions need to be allocated, as shown in fig. 5, the method includes the following steps:
slicing data: the number of data slices is (total amount of data)/(maximum amount of data per slice processing), and assuming that the maximum amount of data per slice processing is 256M, the number of data slices is 1000G/256M 4000. Therefore, first, file information is read, a 1000G data file is sliced for 256M data amount per slice to generate 4000 pieces of sliced data, and the 4000 pieces of sliced data are handed to distributed tasks for processing, and each task processes 256M sliced data.
Local sampling and generating an intermediate data file: and each distributed task sorts the data in the distributed task according to the user ID field, samples the user ID field after sorting and generates an intermediate data file. The intermediate data files are ordered, and when the intermediate data files are generated, file index information is written in, and a certain user ID field can be quickly positioned through indexes.
And (3) final sampling: sending the data obtained by sampling in the above steps to a pre-partitioning task, the pre-partitioning task completely sorts the obtained user ID fields obtained by local sampling, and then performs sampling according to the number of pre-partitions, where the number of pre-partitions is (total amount of data)/(amount of data processed by each table partition), and assuming that the amount of data processed by each table partition is 10G, the number of pre-partitions is 1000G/10G — 100, so we sample 100 data records, where the data records are the user ID fields.
Pre-partitioning: 100 partitions are pre-partitioned for the data table according to 100 sampled user ID fields, and the specific partitioning method is to take 100 user ID fields as a start field and an end field of a partition in sequence, thereby generating partition interval information, wherein the partition interval information can be represented as (start value, end value), and assuming that the user ID fields finally sampled are 100000, 200000 and 300000 … respectively, the partitions of the data table are formed as partition 1 (0, 100000), partition 2 (100000, 200000) and partition 3 (200000, 300000) ….
Generating a partition file: grouping the generated intermediate data files according to the partitions, (wherein each grouping information comprises a partition starting field, a partition ending field and a file list), generating the grouped intermediate data files, generating the partition files according to all the intermediate data files belonging to a certain partition, and classifying the partition files into the corresponding partitions. Here, the partitioned file may be re-sliced, each partition may have a plurality of slices, and each slice information includes a set of file reading information (where each file reading information includes a file position, a file start reading position, and a file reading end position); for example, each slice processes 1G of files, each partition has 10(10G/1G) slices, and each slice processes a set of files.
Loading a partition file: and loading the finally generated partition file of the data table into the partition of the data table.
Fig. 6 is a schematic structural diagram of a data loading apparatus according to an embodiment of the present invention, and as shown in fig. 6, the apparatus 5 includes:
the sorting module 51 is configured to sort the data to be loaded according to the primary key field of the data to be loaded, and generate a data file;
the sampling module 52 is configured to sample the primary key field of the sorted data to be loaded, and generate a first primary key field;
the partitioning module 53 is configured to generate partitioning information of the data table according to the first primary key field, and partition the data table according to the partitioning information of the data table;
the processing module 54 is configured to group the data files according to the partition information of the data table, and generate partition files of the data table according to a grouping result;
and the loading module 55 is configured to load the partition file of the data table into the corresponding partition of the data table.
The data loading device provided by the embodiment of the invention sorts the data to be loaded according to the main key field of the data to be loaded, and generates a data file; sampling the primary key field of the sequenced data to be loaded to generate a first primary key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; loading the partition file of the data table into the corresponding partition of the data table; therefore, the partition of each data table can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of partitions in a centralized manner, the data loading performance is improved, and the load of the data table partitions is balanced.
Fig. 7 is a schematic structural diagram of another data loading apparatus according to an embodiment of the present invention, and as shown in fig. 7, the apparatus 5 further includes:
a slicing module 56, configured to slice data to be loaded to obtain n sets of preprocessed data; wherein n is a positive integer.
Further, the sorting module 51 is specifically configured to sort the n groups of preprocessed data according to the primary key field of the preprocessed data, and generate n data files.
The sampling module 52 is specifically configured to sample the primary key fields of the n sorted sets of preprocessed data, respectively, and generate n sets of second primary key fields; and integrally sorting the n groups of second main key fields, and sampling the integrally sorted n groups of second main key fields to generate a first main key field.
The partitioning module 53 is specifically configured to obtain a start field and an end field of a partition interval of the data table according to the first primary key field; and partitioning the data table according to the start field and the end field of the partition interval of the data table.
Fig. 8 is a schematic structural diagram of another data loading apparatus according to an embodiment of the present invention, and as shown in fig. 8, the processing module 54 includes:
a grouping unit 541 for dividing the ith data file into N according to the partition information of the data tableiGroup data files.
A screening unit 542 for screening the jth partition information of the data table at N1+N2Screening data files meeting jth partition information from + … + Nn groups of data files, and generating jth partition files of the data table; wherein, i is 1, 2 … N, j is 1, 2 … s, NiAre all positive integers.
Further, the slicing module 56 is further configured to slice data in the partition file of the data table.
The loading module 55 is further configured to load the partition file of the data table after data slicing into the corresponding partition of the data table.
It should be noted that, in the present embodiment, reference may be made to method embodiments corresponding to fig. 1 to 4 in the interaction process between each module and each unit, which is not described herein again.
The data loading device provided by the embodiment of the invention slices the data to be loaded to obtain n groups of preprocessed data; sorting the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files; respectively sampling the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields; integrally sequencing the n groups of second main key fields, and sampling to generate a first main key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; finally, loading the partition files of the data table into the corresponding partitions of the data table; therefore, the partition of each data table can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of partitions in a centralized manner, the data loading performance is improved, and the load of the data table partitions is balanced.
In practical applications, the filling and sorting module 51, the sampling module 52, the partitioning module 53, the Processing module 54, the grouping Unit 541, the screening Unit 542, the loading module 55, and the slicing module 56 may be implemented by a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like in a data storage device.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A method for loading data, the method comprising:
sorting the data to be loaded according to the main key field of the data to be loaded, and generating a data file;
sampling the sorted main key fields of the data to be loaded at fixed intervals to generate a first main key field;
generating partition information of a data table according to the first primary key field, and partitioning the data table according to the partition information of the data table;
grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results;
and loading the partition file of the data table into the corresponding partition of the data table.
2. The method of claim 1, further comprising:
slicing the data to be loaded to obtain n groups of preprocessed data; wherein n is a positive integer;
correspondingly, the sorting the data to be loaded according to the primary key field of the data to be loaded and generating a data file includes:
and sequencing the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files.
3. The method of claim 2, wherein sampling the sorted primary key field of the data to be loaded to obtain a first primary key field comprises:
respectively sampling the main key fields of the n groups of sorted preprocessed data to generate n groups of second main key fields;
and integrally sequencing the n groups of second main key fields, sampling the integrally sequenced n groups of second main key fields, and generating the first main key field.
4. The method of claim 1, wherein the generating partition information of a data table according to the first primary key field and partitioning the data table according to the partition information of the data table comprises:
acquiring a start field and an end field of a partition interval of the data table according to the first primary key field;
and partitioning the data table according to the start field and the end field of the partition interval of the data table.
5. The method according to claim 2, wherein the grouping the data files according to the partition information of the data table and generating the partition files of the data table according to the grouping result comprises:
dividing the ith data file into N according to the partition information of the data tableiGroup data files;
according to j partition information of data table in N1+N2+…+NnScreening of compliant posts in group data filesThe jth partition information data file and generating a jth partition file of the data table; wherein i =1, 2 … N, j =1, 2 … s, NiAre all positive integers.
6. The method of claim 1, further comprising:
slicing data in a partition file of the data table;
correspondingly, loading the partition file of the data table into the corresponding partition of the data table includes:
and loading the partition file of the data table subjected to data slicing into the corresponding partition of the data table.
7. A data loading apparatus, characterized in that the apparatus comprises:
the sorting module is used for sorting the data needing to be loaded according to the main key field of the data needing to be loaded and generating a data file;
the sampling module is used for sampling the main key fields of the sequenced data to be loaded at fixed intervals to generate a first main key field;
the partitioning module is used for generating partitioning information of a data table according to the first main key field and partitioning the data table according to the partitioning information of the data table;
the processing module is used for grouping the data files according to the partition information of the data table and generating partition files of the data table according to grouping results;
and the loading module is used for loading the partition file of the data table into the corresponding partition of the data table.
8. The apparatus of claim 7, further comprising:
the slicing module is used for slicing the data to be loaded to obtain n groups of preprocessing data; wherein n is a positive integer;
the sorting module is specifically configured to sort the n sets of preprocessed data according to the primary key field of the preprocessed data, and generate n data files.
9. The apparatus of claim 8,
the sampling module is specifically configured to sample the primary key fields of the n groups of sorted preprocessed data respectively, and generate n groups of second primary key fields; and integrally sequencing the n groups of second main key fields, sampling the integrally sequenced n groups of second main key fields, and generating the first main key field.
10. The apparatus of claim 8, wherein the processing module comprises:
a grouping unit for dividing the ith data file into N according to the partition information of the data tableiGroup data files;
a screening unit for screening the data table according to the jth partition information1+N2+…+NnScreening data files which accord with the jth partition information from the group of data files, and generating the jth partition file of the data table; wherein i =1, 2 … N, j =1, 2 … s, NiAre all positive integers.
CN201611085703.4A 2016-11-30 2016-11-30 Data loading method and device Active CN108121745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611085703.4A CN108121745B (en) 2016-11-30 2016-11-30 Data loading method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611085703.4A CN108121745B (en) 2016-11-30 2016-11-30 Data loading method and device

Publications (2)

Publication Number Publication Date
CN108121745A CN108121745A (en) 2018-06-05
CN108121745B true CN108121745B (en) 2021-08-06

Family

ID=62227013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611085703.4A Active CN108121745B (en) 2016-11-30 2016-11-30 Data loading method and device

Country Status (1)

Country Link
CN (1) CN108121745B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032766A (en) * 2018-06-14 2018-12-18 阿里巴巴集团控股有限公司 A kind of transaction methods, device and electronic equipment
CN111061738A (en) * 2019-12-16 2020-04-24 中国建设银行股份有限公司 Data table pre-grouping method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102486798A (en) * 2010-12-03 2012-06-06 腾讯科技(深圳)有限公司 Data loading method and device
CN105095413A (en) * 2015-07-09 2015-11-25 北京京东尚科信息技术有限公司 Method and apparatus for solving data skew
US20150356162A1 (en) * 2012-12-27 2015-12-10 Tencent Technology (Shenzhen) Company Limited Method and system for implementing analytic function based on mapreduce

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102486798A (en) * 2010-12-03 2012-06-06 腾讯科技(深圳)有限公司 Data loading method and device
US20150356162A1 (en) * 2012-12-27 2015-12-10 Tencent Technology (Shenzhen) Company Limited Method and system for implementing analytic function based on mapreduce
CN105095413A (en) * 2015-07-09 2015-11-25 北京京东尚科信息技术有限公司 Method and apparatus for solving data skew

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Hadoop Mapreduce分区、分组、二次排序过程详解;徐海蛟博士;《http://blog.sina.com.cn/s/blog_d76227260101d948.html》;20130928;第1-4页 *
面向HBase的大规模数据加载研究;贺正红 等;《计算机系统应用》;20160615;第25卷(第6期);第231-237页 *

Also Published As

Publication number Publication date
CN108121745A (en) 2018-06-05

Similar Documents

Publication Publication Date Title
US20190034517A1 (en) Log event cluster analytics management
US8996464B2 (en) Efficient partitioning techniques for massively distributed computation
CN107180031B (en) Distributed storage method and device, and data processing method and device
CN104239301A (en) Data comparing method and device
US20110238677A1 (en) Dynamic Sort-Based Parallelism
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
CN108121745B (en) Data loading method and device
CN108829802B (en) Associated log playback method and device
US20090106299A1 (en) Shared-memory multiprocessor system and information processing method
CN105308579B (en) Series data parallel parsing infrastructure and its parallel decentralized approach
CN109033248B (en) Method and device for storing data record and method and device for inquiring data record
CN104794129B (en) A kind of data processing method and system based on inquiry log
US20190266142A1 (en) Data integration method, data integration device, data processing system, and computer program
CN110765082B (en) Hadoop file processing method and device, storage medium and server
US20160034527A1 (en) Accurate partition sizing for memory efficient reduction operations
US11620265B2 (en) Hybrid dynamic database schema
CN116578558A (en) Data processing method, device, equipment and storage medium
CN103064862B (en) A kind of multi objective sorting data disposal route and equipment
CN114860690A (en) Data migration method, device, equipment and storage medium
CN111382068B (en) Hierarchical testing method and device for large-batch data
US9239867B2 (en) System and method for fast identification of variable roles during initial data exploration
CN107577690A (en) The recommendation method and recommendation apparatus of magnanimity information data
US9665795B2 (en) Method and apparatus for identifying root cause of defect using composite defect map
Hu et al. Output-sensitive skyline algorithms in external memory
CN109783464B (en) Spark platform-based frequent item set mining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant