CN108121745B

CN108121745B - Data loading method and device

Info

Publication number: CN108121745B
Application number: CN201611085703.4A
Authority: CN
Inventors: 陈叶超; 刘云飞; 齐骥; 金振江; 柯亮; 钱岭
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2021-08-06
Anticipated expiration: 2036-11-30
Also published as: CN108121745A

Abstract

The embodiment of the invention provides a data loading method, which comprises the following steps: sorting the data to be loaded according to the main key field of the data to be loaded, and generating a data file; sampling the primary key field of the sequenced data to be loaded to generate a first primary key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; and loading the partition file of the data table into the corresponding partition of the data table. The embodiment of the invention also provides a data loading device.

Description

Data loading method and device

Technical Field

The present invention relates to the field of data processing, and in particular, to a data loading method and apparatus.

Background

When data is inquired in a database, a data table in the database is firstly loaded in a system, and the data can be inquired after the data is loaded.

The existing data loading method usually slices data to be loaded according to table partition information, and then loads the sliced data into a distributed database.

However, this data loading method may cause the loaded data to be concentrated in some partitions of the data table, while partitions of other data tables do not exist or only have a small amount of data, thereby causing the data slice to be skewed, and affecting the data loading performance and the load balance of the partitions of the data table.

Disclosure of Invention

In view of this, embodiments of the present invention are intended to provide a data loading method and apparatus, so as to solve the problems of low data loading performance efficiency and unbalanced load of data table partitions caused by data skew.

The technical scheme of the embodiment of the invention is realized as follows:

a method of data loading, comprising:

sorting the data to be loaded according to the main key field of the data to be loaded, and generating a data file;

sampling the sorted main key fields of the data to be loaded to generate a first main key field;

generating partition information of a data table according to the first primary key field, and partitioning the data table according to the partition information of the data table;

grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results;

and loading the partition file of the data table into the corresponding partition of the data table.

The method as described above, further comprising:

slicing the data to be loaded to obtain n groups of preprocessed data; wherein n is a positive integer;

correspondingly, the sorting the data to be loaded according to the primary key field of the data to be loaded and generating a data file includes:

and sequencing the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files.

As described above, the sampling the primary key field of the sorted data to be loaded to obtain a first primary key field includes:

respectively sampling the main key fields of the n groups of sorted preprocessed data to generate n groups of second main key fields;

and integrally sequencing the n groups of second main key fields, sampling the integrally sequenced n groups of second main key fields, and generating the first main key field.

The method for generating partition information of a data table according to the first primary key field and partitioning the data table according to the partition information of the data table includes:

acquiring a start field and an end field of a partition interval of the data table according to the first primary key field;

and partitioning the data table according to the start field and the end field of the partition interval of the data table.

The method for grouping the data files according to the partition information of the data table and generating the partition files of the data table according to the grouping result comprises the following steps:

dividing the ith data file into N according to the partition information of the data table_iGroup data files;

according to j partition information of data table in N₁+N₂Screening data files meeting the jth partition information from + … + Nn groups of data files, and generating the jth partition file of the data table; wherein, i is 1, 2 … N, j is 1, 2 … s, N_iAre all positive integers.

The method as described above, further comprising:

slicing data in a partition file of the data table;

correspondingly, loading the partition file of the data table into the corresponding partition of the data table includes:

and loading the partition file of the data table subjected to data slicing into the corresponding partition of the data table.

A data loading apparatus comprising:

the sorting module is used for sorting the data needing to be loaded according to the main key field of the data needing to be loaded and generating a data file;

the sampling module is used for sampling the sorted main key fields of the data to be loaded and generating a first main key field;

the partitioning module is used for generating partitioning information of a data table according to the first main key field and partitioning the data table according to the partitioning information of the data table;

the processing module is used for grouping the data files according to the partition information of the data table and generating partition files of the data table according to grouping results;

and the loading module is used for loading the partition file of the data table into the corresponding partition of the data table.

The apparatus as described above, further comprising:

the slicing module is used for slicing the data to be loaded to obtain n groups of preprocessing data; wherein n is a positive integer;

the sorting module is specifically configured to sort the n sets of preprocessed data according to the primary key field of the preprocessed data, and generate n data files.

In the foregoing apparatus, the sampling module is specifically configured to sample the primary key fields of the n sorted sets of preprocessed data, respectively, and generate n sets of second primary key fields; and integrally sequencing the n groups of second main key fields, sampling the integrally sequenced n groups of second main key fields, and generating the first main key field.

The apparatus as described above, the processing module comprising:

a grouping unit for dividing the ith data file into N according to the partition information of the data table_iGroup data files;

a screening unit for screening the data table according to the jth partition information₁+N₂Screening data files meeting the jth partition information from + … + Nn groups of data files, and generating the jth partition file of the data table; wherein, i is 1, 2 … N, j is 1, 2 … s, N_iAre all positive integers.

The data loading method and the data loading device provided by the embodiment of the invention have the advantages that the data needing to be loaded are sequenced according to the main key field of the data needing to be loaded, and a data file is generated; sampling the primary key field of the sequenced data to be loaded to generate a first primary key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; loading the partition file of the data table into the corresponding partition of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized mode, the data loading performance is improved, and the load of the data table partitions is balanced.

Drawings

Fig. 1 is a schematic flowchart of a data loading method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of another data loading method according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of another data loading method according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of another data loading method according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a data loading method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a data loading apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of another data loading apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of another data loading apparatus according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Fig. 1 is a schematic flowchart of a data loading method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

step 101, sorting the data to be loaded according to the main key field of the data to be loaded, and generating a data file.

Specifically, the step 101 of sorting the data to be loaded according to the primary key field of the data to be loaded and generating the data file may be implemented by the data loading apparatus. The primary key field is used to uniquely identify the data that needs to be loaded.

It should be noted that the data files generated according to the sorted data are ordered data files. When the data file is generated, file index information is written in, the ordered data file becomes an indexable ordered data file, and the value of a certain main key field can be quickly positioned through indexing.

And 102, sampling the primary key field of the sorted data needing to be loaded, and generating a first primary key field.

Specifically, the step 102 of sampling the primary key field of the sorted data to be loaded, and generating the first primary key field may be implemented by the data loading apparatus.

It should be noted that, a fixed-interval sampling method may be adopted to sample the primary key field of the sorted data to be loaded, so as to achieve a better sampling effect. The number of the fixed intervals can be set according to actual needs, if more uniform sampling effect is required, the number of the fixed intervals can be relatively set to be small, if no strict requirement is required on the sampling effect, the number of the fixed intervals can be relatively set to be large, correspondingly, if the number of the fixed intervals is small, the number of the first main key fields obtained by sampling is relatively large, and if the number of the fixed intervals is large, the number of the first main key fields obtained by sampling is relatively small.

And 103, generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table.

Specifically, the step 103 of generating partition information of the data table according to the first primary key field, and partitioning the data table according to the partition information of the data table may be implemented by the data loading apparatus.

And 104, grouping the data files according to the partition information of the data table, and generating the partition files of the data table according to a grouping result.

Specifically, the step 104 of grouping the data files according to the partition information of the data table and generating the partition files of the data table according to the grouping result may be implemented by the data loading apparatus.

It should be noted that the data file generated in step 101 may include a part of the data file corresponding to the partition a of the data table and also include a part of the data file corresponding to the partition B of the data table adjacent to the partition a of the data table, and therefore, the data file needs to be divided (that is, the data files are grouped according to the partition information of the data table) to generate the partition file of the data table.

And 105, loading the partition file of the data table into the corresponding partition of the data table.

Specifically, the step 105 of loading the partition file of the data table into the corresponding partition of the data table may be implemented by the data loading apparatus. The partitions of a data table correspond to one or more partition files.

Loading the partition file of the data table into the corresponding partition of the data table means loading the partition file of the data table generated in step 104 into the data table partition to which the partition file belongs.

The data loading method provided by the embodiment of the invention sorts the data to be loaded according to the main key field of the data to be loaded, and generates a data file; sampling the primary key field of the sequenced data to be loaded to generate a first primary key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; loading the partition file of the data table into the corresponding partition of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized mode, the data loading performance is improved, and the load of the data table partitions is balanced.

Fig. 2 is a schematic flow chart of another data loading method according to an embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:

step 201, the data loading device slices the data to be loaded to obtain n groups of preprocessed data, where n is a positive integer.

Specifically, n groups of preprocessed data obtained by slicing the data to be loaded by the data loading device may be processed by n distributed tasks, and each distributed task processes one group of preprocessed data.

Step 202, the data loading device sorts the n groups of preprocessed data according to the primary key field of the preprocessed data, and generates n data files.

It should be noted that, in step 201, if n sets of preprocessed data are handed to n distributed tasks for processing, each distributed task processes one set of preprocessed data, and the processing procedure is as follows: and sequencing the group of preprocessed data according to the primary key field to generate a data file.

And step 203, the data loading device respectively samples the main key fields of the n groups of sorted preprocessed data to generate n groups of second main key fields.

It should be noted that, in step 201, if n sets of preprocessed data are handed to n distributed tasks for processing, each distributed task processes one set of preprocessed data, and the processing procedure is as follows: and sampling the primary key field of the sorted set of preprocessed data to generate a second primary key field.

It should be further noted that each distributed task may use a fixed-interval sampling method to sample the primary key field of the sorted set of preprocessed data, so as to achieve a better sampling effect. The number of the fixed intervals can be set according to actual needs, if more uniform sampling effect is required, the number of the fixed intervals can be set to be smaller relatively, and if no strict requirement is required on the sampling effect, the number of the fixed intervals can be set to be larger relatively.

And step 204, the data loading device integrally sorts the n groups of second main key fields, samples the n groups of integrally sorted second main key fields and generates a first main key field.

And step 205, the data loading device generates partition information of the data table according to the first primary key field, and partitions the data table according to the partition information of the data table.

And step 206, grouping the data files by the data loading device according to the partition information of the data table, and generating the partition files of the data table according to the grouping result.

Step 207, the data loading device loads the partition file of the data table into the corresponding partition of the data table.

It should be noted that, for the explanation of the same steps or concepts in the present embodiment as in the other embodiments, reference may be made to the description in the other embodiments, and details are not described here.

According to the data loading method provided by the embodiment of the invention, n groups of preprocessed data are obtained by slicing the data to be loaded; sorting the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files; respectively sampling the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields; integrally sequencing the n groups of second main key fields, and sampling to generate a first main key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; finally, loading the partition files of the data table into the corresponding partitions of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized mode, the data loading performance is improved, and the load of the data table partitions is balanced.

Fig. 3 is a schematic flowchart of another data loading method according to an embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:

step 301, the data loading device slices the data to be loaded to obtain n groups of preprocessed data, where n is a positive integer.

Step 302, the data loading device sorts the n groups of preprocessed data according to the primary key field of the preprocessed data, and generates n data files.

And step 303, the data loading device respectively samples the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields.

And step 304, the data loading device integrally sorts the n groups of second main key fields, samples the n groups of integrally sorted second main key fields and generates a first main key field.

Step 305, the data loading device obtains a start field and an end field of the partition interval of the data table according to the first primary key field.

Specifically, assuming that the primary key field is a user ID (0 to 30000), and the sampled first primary key fields are 10000, 18000, and 24000, respectively, then the start field and the end field of the first partition interval of the data table obtained according to the first primary key field are 0 and 10000, respectively, and the first partition interval may be represented as partition 1: (0, 10000), the start field and the end field of the second partition interval of the obtained data table are 10000 and 18000 respectively, the second partition interval can be expressed as a partition 2, (10000, 18000), the start field and the end field of the third partition interval of the obtained data table are 18000 and 24000 respectively, the third partition interval can be expressed as a partition 3, (18000, 24000), the start field and the end field of the fourth partition interval of the obtained data table are 24000 and 30000 respectively, and the fourth partition interval can be expressed as a partition 4, (24000, 30000).

Step 306, the data loading device partitions the data table according to the start field and the end field of the partition interval of the data table.

Specifically, partitioning the data table according to the start field and the end field of the partition interval of the data table means that after the data table is partitioned, the partition of a certain data table can only be stored in the data file belonging to the partition.

And 307, grouping the data files by the data loading device according to the partition information of the data table, and generating partition files of the data table according to a grouping result.

Step 308, the data loading device loads the partition file of the data table into the corresponding partition of the data table.

Fig. 4 is a schematic flowchart of another data loading method according to an embodiment of the present invention, and as shown in fig. 4, the method includes the following steps:

step 401, the data loading device slices the data to be loaded to obtain n groups of preprocessed data, where n is a positive integer.

Step 402, the data loading device sorts n groups of preprocessed data according to the primary key field of the preprocessed data, and generates n data files.

And step 403, the data loading device samples the primary key fields of the n groups of sorted preprocessed data respectively to generate n groups of second primary key fields.

Step 404, the data loading device sorts the n sets of second primary key fields as a whole, and samples the n sets of second primary key fields after the whole sorting to generate a first primary key field.

Step 405, the data loading apparatus obtains a start field and an end field of the partition interval of the data table according to the first primary key field.

And step 406, the data loading device partitions the data table according to the start field and the end field of the partition interval of the data table.

Step 407, dividing the ith data file into N according to the partition information of the data table_iGroup data files.

Specifically, the 1 st data file is divided into N according to the partition information of the data table₁Group data file, divide 2 nd data file into N₂Group data files, …, divide the nth data file into N_nGroup data files.

It should be noted that, if it is determined that the ith data file contains 3 partitioned data files according to the partition information of the data table, the file is divided into 3 groups of data files, i.e., N_i＝3。

Step 408, according to the jth partition information of the data table, in N₁+N₂Screening data files meeting jth partition information from + … + Nn groups of data files, and generating jth partition files of the data table; wherein, i is 1, 2 … N, j is 1, 2 … s, N_iAre all positive integers.

Specifically, assume that in step 406, the data table is divided into s partitions according to the start field and the end field of the partition interval of the data table; then at N according to the 1 st partition information of the data table₁+N₂+ … + Nn groupsScreening data files meeting the 1 st partition information from the data files, and generating a 1 st partition file of the data table (the partition file corresponds to the 1 st partition); according to the 2 nd partition information of the data table in N₁+N₂Screening data files meeting the 2 nd partition information from the + … + Nn group of data files, and generating a 2 nd partition file of the data table (the partition file corresponds to the 2 nd partition); …, respectively; according to the s-th partition information of the data table in N₁+N₂And screening the data files meeting the s-th partition information from the + … + Nn groups of data files, and generating the s-th partition file of the data table (the partition file corresponds to the s-th partition).

Step 409, the data loading device loads the partition file of the data table into the corresponding partition of the data table.

According to the data loading method provided by the embodiment of the invention, n groups of preprocessed data are obtained by slicing the data to be loaded; sorting the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files; respectively sampling the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields; integrally sequencing the n groups of second main key fields, and sampling to generate a first main key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; finally, loading the partition files of the data table into the corresponding partitions of the data table; therefore, each data table partition can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of data table partitions in a centralized manner, the data loading performance is improved, and the load of the data table partitions is balanced.

Further, the data loading method provided by the embodiment of the present invention further includes:

slicing data in a partition file of a data table;

It should be noted that, if a partition file of the data table is too large to load data easily, the data in the partition file may be sliced, and each sliced data is processed by one sub-task, so that it is ensured that the data processed by each task is reasonable, and the problems of data skew and unbalanced partition load of the data table are solved.

The following provides a specific embodiment to explain the data loading method provided by the present invention. Fig. 5 is a schematic diagram of a data loading method according to an embodiment of the present invention, and assuming that 1000G of user data needs to be loaded into a database, a primary key field of the data is a user identification card (ID), and 10G of data is processed according to each table partition, 100 partitions need to be allocated, as shown in fig. 5, the method includes the following steps:

slicing data: the number of data slices is (total amount of data)/(maximum amount of data per slice processing), and assuming that the maximum amount of data per slice processing is 256M, the number of data slices is 1000G/256M 4000. Therefore, first, file information is read, a 1000G data file is sliced for 256M data amount per slice to generate 4000 pieces of sliced data, and the 4000 pieces of sliced data are handed to distributed tasks for processing, and each task processes 256M sliced data.

Local sampling and generating an intermediate data file: and each distributed task sorts the data in the distributed task according to the user ID field, samples the user ID field after sorting and generates an intermediate data file. The intermediate data files are ordered, and when the intermediate data files are generated, file index information is written in, and a certain user ID field can be quickly positioned through indexes.

And (3) final sampling: sending the data obtained by sampling in the above steps to a pre-partitioning task, the pre-partitioning task completely sorts the obtained user ID fields obtained by local sampling, and then performs sampling according to the number of pre-partitions, where the number of pre-partitions is (total amount of data)/(amount of data processed by each table partition), and assuming that the amount of data processed by each table partition is 10G, the number of pre-partitions is 1000G/10G — 100, so we sample 100 data records, where the data records are the user ID fields.

Pre-partitioning: 100 partitions are pre-partitioned for the data table according to 100 sampled user ID fields, and the specific partitioning method is to take 100 user ID fields as a start field and an end field of a partition in sequence, thereby generating partition interval information, wherein the partition interval information can be represented as (start value, end value), and assuming that the user ID fields finally sampled are 100000, 200000 and 300000 … respectively, the partitions of the data table are formed as partition 1 (0, 100000), partition 2 (100000, 200000) and partition 3 (200000, 300000) ….

Generating a partition file: grouping the generated intermediate data files according to the partitions, (wherein each grouping information comprises a partition starting field, a partition ending field and a file list), generating the grouped intermediate data files, generating the partition files according to all the intermediate data files belonging to a certain partition, and classifying the partition files into the corresponding partitions. Here, the partitioned file may be re-sliced, each partition may have a plurality of slices, and each slice information includes a set of file reading information (where each file reading information includes a file position, a file start reading position, and a file reading end position); for example, each slice processes 1G of files, each partition has 10(10G/1G) slices, and each slice processes a set of files.

Loading a partition file: and loading the finally generated partition file of the data table into the partition of the data table.

Fig. 6 is a schematic structural diagram of a data loading apparatus according to an embodiment of the present invention, and as shown in fig. 6, the apparatus 5 includes:

the sorting module 51 is configured to sort the data to be loaded according to the primary key field of the data to be loaded, and generate a data file;

the sampling module 52 is configured to sample the primary key field of the sorted data to be loaded, and generate a first primary key field;

the partitioning module 53 is configured to generate partitioning information of the data table according to the first primary key field, and partition the data table according to the partitioning information of the data table;

the processing module 54 is configured to group the data files according to the partition information of the data table, and generate partition files of the data table according to a grouping result;

and the loading module 55 is configured to load the partition file of the data table into the corresponding partition of the data table.

The data loading device provided by the embodiment of the invention sorts the data to be loaded according to the main key field of the data to be loaded, and generates a data file; sampling the primary key field of the sequenced data to be loaded to generate a first primary key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; loading the partition file of the data table into the corresponding partition of the data table; therefore, the partition of each data table can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of partitions in a centralized manner, the data loading performance is improved, and the load of the data table partitions is balanced.

Fig. 7 is a schematic structural diagram of another data loading apparatus according to an embodiment of the present invention, and as shown in fig. 7, the apparatus 5 further includes:

a slicing module 56, configured to slice data to be loaded to obtain n sets of preprocessed data; wherein n is a positive integer.

Further, the sorting module 51 is specifically configured to sort the n groups of preprocessed data according to the primary key field of the preprocessed data, and generate n data files.

The sampling module 52 is specifically configured to sample the primary key fields of the n sorted sets of preprocessed data, respectively, and generate n sets of second primary key fields; and integrally sorting the n groups of second main key fields, and sampling the integrally sorted n groups of second main key fields to generate a first main key field.

The partitioning module 53 is specifically configured to obtain a start field and an end field of a partition interval of the data table according to the first primary key field; and partitioning the data table according to the start field and the end field of the partition interval of the data table.

Fig. 8 is a schematic structural diagram of another data loading apparatus according to an embodiment of the present invention, and as shown in fig. 8, the processing module 54 includes:

a grouping unit 541 for dividing the ith data file into N according to the partition information of the data table_iGroup data files.

A screening unit 542 for screening the jth partition information of the data table at N₁+N₂Screening data files meeting jth partition information from + … + Nn groups of data files, and generating jth partition files of the data table; wherein, i is 1, 2 … N, j is 1, 2 … s, N_iAre all positive integers.

Further, the slicing module 56 is further configured to slice data in the partition file of the data table.

The loading module 55 is further configured to load the partition file of the data table after data slicing into the corresponding partition of the data table.

It should be noted that, in the present embodiment, reference may be made to method embodiments corresponding to fig. 1 to 4 in the interaction process between each module and each unit, which is not described herein again.

The data loading device provided by the embodiment of the invention slices the data to be loaded to obtain n groups of preprocessed data; sorting the n groups of preprocessed data according to the primary key fields of the preprocessed data, and generating n data files; respectively sampling the primary key fields of the n groups of sorted preprocessed data to generate n groups of second primary key fields; integrally sequencing the n groups of second main key fields, and sampling to generate a first main key field; generating partition information of the data table according to the first main key field, and partitioning the data table according to the partition information of the data table; grouping the data files according to the partition information of the data table, and generating partition files of the data table according to grouping results; finally, loading the partition files of the data table into the corresponding partitions of the data table; therefore, the partition of each data table can be ensured to bear a part of data loading tasks, so that the data loading tasks are prevented from being carried out in a certain part of partitions in a centralized manner, the data loading performance is improved, and the load of the data table partitions is balanced.

In practical applications, the filling and sorting module 51, the sampling module 52, the partitioning module 53, the Processing module 54, the grouping Unit 541, the screening Unit 542, the loading module 55, and the slicing module 56 may be implemented by a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like in a data storage device.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method for loading data, the method comprising:

sampling the sorted main key fields of the data to be loaded at fixed intervals to generate a first main key field;

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein sampling the sorted primary key field of the data to be loaded to obtain a first primary key field comprises:

4. The method of claim 1, wherein the generating partition information of a data table according to the first primary key field and partitioning the data table according to the partition information of the data table comprises:

5. The method according to claim 2, wherein the grouping the data files according to the partition information of the data table and generating the partition files of the data table according to the grouping result comprises:

according to j partition information of data table in N₁+N₂+…+N_nScreening of compliant posts in group data filesThe jth partition information data file and generating a jth partition file of the data table; wherein i =1, 2 … N, j =1, 2 … s, N_iAre all positive integers.

6. The method of claim 1, further comprising:

slicing data in a partition file of the data table;

7. A data loading apparatus, characterized in that the apparatus comprises:

the sampling module is used for sampling the main key fields of the sequenced data to be loaded at fixed intervals to generate a first main key field;

8. The apparatus of claim 7, further comprising:

9. The apparatus of claim 8,

the sampling module is specifically configured to sample the primary key fields of the n groups of sorted preprocessed data respectively, and generate n groups of second primary key fields; and integrally sequencing the n groups of second main key fields, sampling the integrally sequenced n groups of second main key fields, and generating the first main key field.

10. The apparatus of claim 8, wherein the processing module comprises:

a screening unit for screening the data table according to the jth partition information₁+N₂+…+N_nScreening data files which accord with the jth partition information from the group of data files, and generating the jth partition file of the data table; wherein i =1, 2 … N, j =1, 2 … s, N_iAre all positive integers.