CN102486798A - Data loading method and device - Google Patents

Data loading method and device Download PDF

Info

Publication number
CN102486798A
CN102486798A CN2010105822860A CN201010582286A CN102486798A CN 102486798 A CN102486798 A CN 102486798A CN 2010105822860 A CN2010105822860 A CN 2010105822860A CN 201010582286 A CN201010582286 A CN 201010582286A CN 102486798 A CN102486798 A CN 102486798A
Authority
CN
China
Prior art keywords
subregion
partition
data
partition table
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010105822860A
Other languages
Chinese (zh)
Inventor
赵伟
郭玮
李均
洪坤乾
张书彬
徐钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2010105822860A priority Critical patent/CN102486798A/en
Publication of CN102486798A publication Critical patent/CN102486798A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data loading method and a data loading device, which belong to the field of databases. The method comprises the steps of: creating a partition table according to partition types, wherein the partition table comprises partition fields; reading data to be loaded line by line; obtaining partition fields of the data in the line according to partition information in the partition table; determining the subordinative partition of the data in the line in the partition table according to the partition fields of the data; and loading the data in the line into the corresponding partition. The device comprises a creating module, a reading module, an obtaining module, a determining module and a loading module. The method and the device disclosed by the invention have the advantages that the partition fields of the data are obtained according to the partition information of the partition table, the subordinative partition of the data in the partition table is determined, the data in the line is automatically loaded into the corresponding partition, and the automatic data loading on a distributed file system is realized, so the data is ensured to be loaded into the correct partition, and the data loading error caused by the manual partition specifying is avoided.

Description

The method of data load and device
Technical field
The present invention relates to database field, particularly a kind of method of data load and device.
Background technology
Data quantity stored is often very big in the table in the data warehouse, therefore extensively adopts partitioned method, and the data in the table are divided into some subregions according to modes such as certain strategy such as scope, discrete values.The benefit of subregion is in table, to scan and to inquire about in the corresponding subregion, has avoided full table scan, has improved analysis efficiency; Operations such as updating data, deletion and backup can be carried out according to subregion in the table, have improved the efficiency of management.The mass data of therefore, storing in the distributed data warehouse also can adopt partitioned method.Yet data correctly are loaded into corresponding subregion, are to guarantee the significant prerequisite of subregion, otherwise will cause analysis result inaccurate with data management on confusion.
Distributed data of the prior art warehouse (like distributed system architecture data warehouse instrument Hadoop Hive) technical scheme is to rely on manual work to guarantee that data load arrives correct subregion, and system does not carry out the check of subregion correctness.When carrying out data load, the subregion field in the field of table, but is not present in user's the brains; Therefore when the user in showing during loading data; The user must know which subregion is current data belong to, and after the specified partition, system can be with copying data in that subregion.
In realizing process of the present invention, the inventor finds that there is following shortcoming at least in prior art:
When loading data, need the user to specify which data load to which subregion, so the user needs fully be familiar with system, principle and the data load of grasping subregion be to partitioned method, user's threshold height, and ease for use is poor; Owing to be artificial specified partition, maybe be because situation such as subregion mistake and mistaken filename appear in user's carelessness, but system but can execution error loads and does not provide any inspection and prompting, causes the data load mistake; In case data load in the system after owing to do not have the subregion field in the data, can't verification msg whether load correctly.
Summary of the invention
In order to be implemented on the distributed data warehouse data are loaded into correct subregion automatically, the embodiment of the invention provides a kind of method and device of data load.Said technical scheme is following:
On the one hand, a kind of method of data load is provided, said method comprises:
According to the type of the subregion table that creates the division, comprise the subregion field in the said partition table;
Reading line by line needs loaded data;
Obtain the subregion field of this line data according to the partition information of said partition table;
Confirm the subregion of this line data under in said partition table according to the subregion field of said data;
This line data is loaded in the corresponding affiliated subregion.
Wherein, said partition information according to said partition table obtains the subregion field of this line data, specifically comprises:
From the partition information of said partition table, obtain the subregion field of said partition table;
Go out the subregion field of this line data according to the subregion fields match of said partition table.
Particularly, said subregion field according to said data is confirmed the subregion of this line data under in said partition table, specifically comprises:
The subregion field of said data and the partition space in the said partition information are mated, and the subregion that the partition space of coupling is corresponding is confirmed as the subregion of this line data under in said partition table.
Further, said be loaded into this line data in the corresponding affiliated subregion after, also comprise:
Partitioned file to all generations scans, if contain the partitioned file less than file system default tile size in the said subregion, then the All Files under the said subregion catalogue is merged into a big file.
Further, said type according to subregion also comprises after creating the division and showing:
Write down the metadata of said partition table according to said partition table, said metadata comprises the partition information of said partition table and the mapping relations of said partition table and distributed file system.
Again further, said reading line by line needs also to comprise after the loaded data:
Metadata according to said partition table is resolved the field of this line data, judges whether the form of this line data is correct, if, then carry out the step that said partition information according to said partition table obtains the subregion field of this line data, otherwise, report an error.
On the other hand, a kind of device of data load is provided, said device comprises:
Create module, be used for type according to the subregion table that creates the division, comprise the subregion field in the said partition table;
Read module is used for reading line by line and needs loaded data;
Acquisition module is used for the subregion field that partition information according to said partition table obtains this line data that said read module reads;
Determination module, the subregion field of the data that are used for obtaining according to said acquisition module are confirmed the subregion of this line data under said partition table;
Load-on module is used for this line data is loaded into the affiliated subregion of the correspondence that said determination module confirms.
Wherein, said acquisition module specifically is used for obtaining from the partition information of said partition table the subregion field of said partition table; Go out the subregion field of this line data according to the subregion fields match of said partition table.
Particularly, said determination module specifically is used for the subregion field of said data and the partition space of said partition information are mated, and the subregion that the partition space of coupling is corresponding is confirmed as the subregion of this line data under in said partition table.
Further, after the said load-on module, also comprise:
Merge module, be used for the partitioned file of all generations is scanned,, then the All Files under the said subregion catalogue is merged into a big file if contain partitioned file in the said subregion less than file system default tile size.
Further, after the said establishment module, also comprise:
Logging modle is used for writing down according to the partition table of said establishment module creation the metadata of said partition table, and said metadata comprises the partition information of said partition table and the mapping relations of said partition table and distributed file system.
Again further, after the said read module, also comprise:
Parsing module is used for according to the metadata of said partition table the field of this line data being resolved, and judges whether the form of this line data is correct, if, then carry out said acquisition module, otherwise, report an error.
The beneficial effect of the technical scheme that the embodiment of the invention provides is:
Metadata through the record partitioning table; Confirm the subregion under this line data is in partition table according to the subregion field of the partition information of the partition table in the metadata and needs loaded data; This line data is loaded in the corresponding subregion automatically goes; Realize automatic loading data on distributed file system, thereby guaranteed data load in correct subregion, the data load mistake of having avoided artificial specified partition to cause; Simultaneously, when reading of data, data are resolved, rather than directly copy, guaranteed the correctness of the form of loaded data; Owing to contain the subregion field in the loaded data, whether data load can also belong to this subregion according to subregion field validation loaded data after in the subregion.
Description of drawings
In order to be illustrated more clearly in the technical scheme in the embodiment of the invention; The accompanying drawing of required use is done to introduce simply in will describing embodiment below; Obviously, the accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the method flow diagram of the data load that provides of the embodiment of the invention one;
Fig. 2 is the method flow diagram of the data load that provides of the embodiment of the invention two;
Fig. 3 is table and the structure mapping synoptic diagram of subregion on distributed file system that the embodiment of the invention two provides;
Fig. 4 is the flowchart of the inquiry plan that provides of the embodiment of the invention two;
Fig. 5 is the apparatus structure synoptic diagram of first kind of data load providing of the embodiment of the invention three;
Fig. 6 is the apparatus structure synoptic diagram of second kind of data load providing of the embodiment of the invention three;
Fig. 7 is the apparatus structure synoptic diagram of the third data load of providing of the embodiment of the invention three;
Fig. 8 is the apparatus structure synoptic diagram of the 4th kind of data load providing of the embodiment of the invention three.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, will combine accompanying drawing that embodiment of the present invention is done to describe in detail further below.
Embodiment one
The embodiment of the invention provides a kind of method of data load, and referring to Fig. 1, method flow is specific as follows:
101:, comprise the subregion field in this partition table according to the type of the subregion table that creates the division;
102: reading line by line needs loaded data;
103: the subregion field of obtaining this line data according to the partition information of partition table;
104: confirm the subregion of this line data under in partition table according to the subregion field of this line data;
105: this line data is loaded in the corresponding affiliated subregion.
The method that the embodiment of the invention provides; Obtain the subregion field of data through partition information according to partition table; And confirm the subregion under this line data is in partition table, and this line data is loaded in the corresponding subregion automatically goes, realized on distributed file system loading data automatically; Thereby guarantee data load in correct subregion, the data load mistake of having avoided artificial specified partition to cause; Owing to contain the subregion field in the loaded data, whether data load can also belong to this subregion according to subregion field validation loaded data after in the subregion.
Embodiment two
The embodiment of the invention provides a kind of method of data load, and referring to Fig. 2, method flow is specific as follows:
201:, comprise the subregion field in this partition table according to the type of the subregion table that creates the division;
Wherein, the embodiment of the invention is created table through CREATE TABLE statement, also can make establishment table otherwise, this is not done concrete qualification.The type of definition subregion forms partition table when creating table.The type of subregion can be the subregion of scope range type, also can be the subregion of tabulation list type, can also be the subregion of both nested compound types; The embodiment of the invention is supported two-level nested; Like range-range, range-list, list-list; Four kinds of compound divisional types of list-range, but other divisional type is not specifically limited.
Particularly, the mode of establishment table and definition divisional type is following:
(1) partition table of establishment range type
Want row is mapped to the subregion based on the train value scope, with regard to the usable range partition method.When data can be divided into the logic scope,, adopt the scope subregion just of great use, and performance is best when data can be divided in gamut equably like the month in year.If when carrying out the scope subregion, can then need consider other partition method owing to unequal division causes subregion obviously different in size.
For example: the command format of creating the partition table of range type is:
CREATE?TABLE?table_name(
Col_nam1?TYPE;
Col_nam2?TYPE;
……
)
PARTITION?BY?RANGE(Col_namX)
(
PARTITION?par_name1?VALUES?LESS?THAN(value1),
PARTITION?par_name2?VALUES?LESS?THAN(value2),
……
);
Mentioned order is used for creating the partition table of a band range subregion.
At first normally create a table, comprise table name, field etc., like above-mentioned CREATE TABLE statement.
Then the table of creating is carried out the range subregion; The type of specified partition is the range type, in the field of table, chooses the subregion field, and this subregion field must be in the row of table; After this define partition space (at least one) again; Like above-mentioned PARTITION statement, wherein the type of PARTITION BY RANGE (Col_namX) specified partition is the range type, and the subregion field is Co l_namX; After this define partition space again, its form is: PARTITION < name>VALUES THAN (< cons tant >).
(2) partition table of establishment list type
Control clearly when how row being mapped to subregion when needs, just use the tabulation partition method.This method can be specified a row discrete value for these subregion row in the description of each subregion.The tabulation partition method is to divide for the modularization data of deferring to discrete value to design, and can be very naturally divides into groups with incoherent data set and is organized into together unordered.
For example: the command format of creating the partition table of list type is:
CREATE?TABLE?table_name(
Col_nam1?TYPE;
Col_nam2?TYPE;
……
)
PARTITION?BY?LIST(Col_namX)
(
PARTITION?par_name1?VALUES?IN(valueA,valueB,valueC……),
PARTITION?par_name2?VALUES?IN(valueX,valueY,valueZ……),
……
PARTITION?default
);
Mentioned order is used for creating the partition table of a band list subregion.
At first normally create a table, comprise table name, field etc., like above-mentioned CREATE TABLE statement.
Then the table of creating is carried out lis t subregion, the type of specified partition is the list type, in the field of table, chooses the subregion field; After this this subregion field must define partition space (at least one), like above-mentioned PARTITION statement again in the row of table; Wherein the type of PARTITION BY LIST (Col_namX) specified partition is the list type, and the subregion field is Col_namX, after this defines partition space again; Its form is: PARTITION < name>VALUES IN (< cons tant>...).
(3) partition table of establishment compound type
The embodiment of the invention is that range-list is that example describes with the form of compound subregion, and the form of other range-range, list-range, the compound subregion of list-list and the form class of range-list subregion are seemingly.
The partition table of Range-list type is the combination of scope subregion and tabulation subregion, and data are first according to the scope subregion, the subregion of in each subregion, tabulating again.
For example: the command format of creating the partition table of range-list type is:
CREATE?TABLE?table_name(
Col_nam1TYPE;
Col_nam2TYPE;
……
)
PARTITION?BY?RANGE(Col_namX)
SUBPARTITION?BY?LIST(Col_namY)
(
SUBPARTITION?par_name1?VALUES?IN(valueA,valueB,valueC……),
SUBPARTITION?par_name2?VALUES?IN(valueX,valueY,valueZ……),
……
SUBPARTITION?default)
(
PARTITION?par_name1VALUES?LESS?THAN(value1),
PARTITION?par_name2VALUES?LESS?THAN(value2),
……);
Mentioned order is used for creating the partition table of a band range-list subregion.
At first normally create a table, comprise table name, field etc., like above-mentioned CREATE TABLE statement.
Then the table of creating is carried out the range-list subregion; Carry out one-level subregion---the subregion of range type earlier; Carrying out secondary partition---the subregion of list type, its partition method and above-mentioned range subregion and list partitioned method are similar, no longer give unnecessary details at this; Like above-mentioned PARTITION partly is the definition of one-level subregion, and SUBPARTITION partly is the definition of secondary partition.
202: according to the metadata of this partition table of partition table record, this metadata comprises the partition information of this partition table and the mapping relations of this partition table and distributed file system;
To this step, after partition table is created well, go up at HDFS (Hadoop Distributed FileSystem, distributed file system) earlier and create corresponding catalogue, again the metadata of this partition table of interpolation in metadata system.
Particularly, logical concept such as table in the data warehouse and subregion all are mapped as catalogue on HDFS.
For example, referring to Fig. 3, the partition table of establishment is TEST, and folder TEST so will create a file on HDFS.If TEST has two subregion P1 and P2, then under the TEST catalogue, have two catalogue P1 and P2, if TEST also has secondary partition SP1 and SP2, then in P1 and P2 catalogue, can respectively create two catalogue SP1 and SP2.
Wherein, the bibliographic structure on HDFS can be regarded the table and the physical arrangement of subregion as, and the metadata through this partition table in the metadata system connects this physical arrangement and user-defined range, list or compound subregion.
Further; In metadata system, add the metadata of this partition table; This metadata comprises that the logical organization of partition information and this partition table and the subregion of this partition table goes up the mapping relations of bibliographic structure to HDFS; As this partition table is mapped as the catalogue on the HDFS, and be the sub-directory of the catalogue of this partition table with partition map, with the data map in the file row and field of this partition table etc.Metadata in the metadata system will be used when carrying out SQL (Structured Query Language, SQL) inquiry.Shown in the metadata information of the partition table table specific as follows:
Figure BSA00000381755100081
The metadata information of table 1 partition table
Wherein, the Pri_partition attribute is deposited the one-level partition information of partition table, if this partition table does not have the one-level subregion, then this field is empty.The Sub_partition attribute is deposited the secondary partition information of partition table, if this partition table does not have secondary partition, then this field is empty.Shown in the metadata information of the partition information table specific as follows:
Figure BSA00000381755100091
The metadata information of table 2 partition information
Wherein, every grade of subregion of partition table can comprise the multilayer partition information, by the partition information of different partition areas spatial division different layers.Partition space is one, and (key is a zone name for key, value) right container, and value is a partition definition.For example, behind the subregion P1 that has created a range type, its definition be the subregion field value less than 10, then in partition space, can preserve one (P1,10) (key, value) right; If that create is the subregion P2 of a list type, its definition is that the subregion field value is (2,4,6), then in partition space, can preserve one (P2,2:4:6) (key, value) right.
203: reading line by line needs loaded data;
Wherein, When carrying out data load, use the LOAD statement to generate the inquiry plan that the robotization subregion loads, the execution tree that this inquiry plan is made up of FileScanOp, PartitionerOp and three operators of FileSinkOp; The execution action that each operator representative is certain; According to this inquiry plan, the start node of Cong Shu implements end node, obtains final execution result.FileScanOp, PartitionerOp and three operators of FileSinkOp include but not limited to following operation: FileScanOp operator scanning document, with reading out of delegation of file delegation, give PartitionerOp operator; The subregion field of the line item that the inspection of PartitionerOp operator is come in and gone out is routed to the FileSinkOp operator of corresponding subregion with it, outputs in the corresponding subregion and goes; The record that the FileSinkOp operator will import into is write in the file of appointment.
For example, with among the HDFS /load content of data/data1.txt file is in partition table test the time, the LOAD statement is following:
LOAD?DATA?INPATH′/data/data1.txt′INTOTABLE?test;
Particularly, read line by line and need loaded data to realize with the FileScanOp operator.The FileScanOp operator has comprised the path of the file that will scan, and to the analytic method of file.This analytic method can obtain from the metadata of partition table, through analytic method, with needing loaded data the reading of delegation of delegation from file.
204: the metadata according to partition table is resolved the field of this line data, judges whether the form of this line data is correct, if then execution in step 205, otherwise, execution in step 209;
Particularly, through the FileScanOp operator, obtain the metadata of the partition table that will deposit in; Analytic method with the file in this metadata is resolved the field of this line data, in the process of resolving, judges whether the form of this line data is correct; Judgment mode can if mate, then be judged the correct format of this line data for the field of information such as the form of the data file in the metadata of partition table and row name and this line data is mated; If do not match; The form of then judging this line data is incorrect, and the embodiment of the invention is an example with above-mentioned judgment mode only, but is not limited to this.
Wherein, this step is an optional step, carries out the correctness that this step can guarantee the form of loaded data.The embodiment of the invention is that example describes with execution in step 204, in the middle of realization process of the present invention, also can be after execution of step 203, and skips steps 204, directly execution in step 205.
205: the subregion field of obtaining this line data according to the partition information of partition table;
Particularly,, obtain the subregion field of this line data, be specially the subregion field of from the partition information of partition table, obtaining this partition table according to the partition information of partition table through the PartitionerOp operator; Go out the subregion field of this line data according to the subregion fields match of this partition table.
206: confirm the subregion of this line data under in partition table according to the subregion field of this line data;
Particularly; Confirm the subregion of this line data under in partition table according to the subregion field of this line data; Be specially the subregion field of this line data and the partition space in the partition information are mated, the subregion that the partition space of coupling is corresponding is confirmed as the subregion of this line data under in partition table.
Wherein, The subregion field of this line data and the partition space in the partition information are carried out matching mode to be had multiple; The embodiment of the invention is not done concrete qualification to this; Only making up a series of counter Evaluator through the PartitionerOp operator, and use Evaluator to confirm that the subregion under this line data is in partition table describes as example.At first make up a series of counters corresponding with it according to every layer in metadata partition information; With this a series of counter the subregion field of this line data is tested, if test result is coupling, then these data belong to the subregion of the Evaluator representative of this test; If test result does not match; Then continue to test,, then return unusually to the user if the test result of last Evaluator does not still match with next Evaluator.If there is acquiescence default subregion, then last Evaluator is the Evaluator of default subregion, and the test result that the Evaluator of default subregion returns is always mated.If contain secondary partition, also will make up the Evaluator of secondary partition, this line data also will carry out the coupling of secondary partition after the coupling of finishing the one-level subregion, confirm that at last this line data belongs in the secondary partition under which one-level subregion.
207: this line data is loaded in the corresponding affiliated subregion;
Particularly; After PartitionerOp confirms the subregion of this line data under in partition table; This line data is routed in the FileSinkOp operator of this subregion, calls the interface of HDFS, this line data is written in the catalogue of the subregion under it and goes by the FileSinkOp operator.
208: the partitioned file to all generations scans, if contain the partitioned file less than file system default tile size in the subregion, then the All Files under this subregion catalogue is merged into a big file, and flow process finishes;
Particularly, when the partitioned file that scanning generates, the size of inspection file is not if the size of All Files then triggers the merging behavior all greater than file system default tile size.If the size of finding to have a file at least then with the All Files under this subregion catalogue, is merged into a file less than file system default tile size.
Wherein, this step is an optional step, carries out the operational efficiency that this step can promote HDFS.The embodiment of the invention is that example describes with execution in step 208, in the middle of realization process of the present invention, also can not carry out this step.
Whether 209: if the form of this line data is incorrect, then report an error, remind the customer inspection data correct, flow process finishes.
Particularly, whether the data file of reminding customer inspection to be written into is correct, comprises path error or field errors etc., and the user can reload proper data.
Wherein, this step is an optional step, comprises this step of the next execution of prerequisite of step 204 in the embodiment of the invention.If, then do not carry out this step realizing in the middle of the process of the present invention not execution in step 204.
As shown in Figure 4, above-mentioned inquiry plan finally can be converted into the local runtime of a Map function at every machine of cluster.Each Map function reads local data, then according to partition information and subregion field, automatically data load is arrived correct subregion.Because data file is in the cluster of HDFS; So when carrying out the robotization loading; The MapReduce framework can be as far as possible carried out the Map function and is calculated to guarantee localization depositing on the machine of this data file, the output of Map function still on file system, has reduced the IO interface of network; Data are loaded in parallel simultaneously, and efficient is high.
The method that the embodiment of the invention provides; Metadata through the record partitioning table; Confirm the subregion under this line data is in partition table according to the subregion field of the partition information of the partition table in the metadata and needs loaded data, this line data is loaded in the corresponding subregion automatically goes, realized on HDFS loading data automatically; Thereby guarantee data load in correct subregion, the data load mistake of having avoided artificial specified partition to cause; Simultaneously, when reading of data, data are resolved, rather than directly copy, guaranteed the correctness of the form of loaded data; Owing to contain the subregion field in the loaded data, whether data load can also belong to this subregion according to subregion field validation loaded data after in the subregion.
Embodiment three
Referring to Fig. 5, the embodiment of the invention provides a kind of device of data load, and this device comprises:
Create module 501, be used for type according to the subregion table that creates the division, comprise the subregion field in this partition table;
Read module 502 is used for reading line by line and needs loaded data;
Acquisition module 503 is used for the subregion field that partition information according to partition table obtains this line data that read module 502 reads;
Determination module 504, the subregion field of the data that are used for obtaining according to acquisition module 503 are confirmed the subregion of this line data under partition table;
Load-on module 505 is used for this line data is loaded into the affiliated subregion of the correspondence that determination module 504 confirms.
Wherein, acquisition module 503 specifically is used for obtaining from the partition information of partition table the subregion field of this partition table; Go out the subregion field of this line data according to the subregion fields match of this partition table.
Particularly, determination module 504 specifically is used for the subregion field of data and the partition space of partition information are mated, and the subregion that the partition space of coupling is corresponding is confirmed as the subregion of this line data under in partition table.
Further, referring to Fig. 6, after the load-on module 505, also comprise:
Merge module 506, be used for the partitioned file of all generations is scanned,, then the All Files under the said subregion catalogue is merged into a big file if contain partitioned file in the said subregion less than file system default tile size.
Further,, create after the module 501, also comprise referring to Fig. 7:
Logging modle 507 is used for the metadata according to this partition table of partition table record of creating module 501 establishments, and this metadata comprises the partition information of this partition table and the mapping relations of this partition table and distributed file system.
Again further, referring to Fig. 8, after the read module 502, also comprise:
Parsing module 508 is used for according to the metadata of partition table the field of this line data being resolved, and judges whether the form of this line data is correct, if, then carry out acquisition module 503, otherwise, report an error.
In sum; The embodiment of the invention is through the metadata of record partitioning table; Confirm the subregion under this line data is in partition table according to the subregion field of the partition information of the partition table in the metadata and needs loaded data, this line data is loaded in the corresponding subregion automatically goes, realized on HDFS loading data automatically; Thereby guarantee data load in correct subregion, the data load mistake of having avoided artificial specified partition to cause; Simultaneously, when reading of data, data are resolved, rather than directly copy, guaranteed the correctness of the form of loaded data; Owing to contain the subregion field in the loaded data, whether data load can also belong to this subregion according to subregion field validation loaded data after in the subregion.
Need to prove: the device of the data load that the foregoing description provides is when automatic loading data; Only the division with above-mentioned each functional module is illustrated; In the practical application; Can as required above-mentioned functions be distributed by the different functional completion, the inner structure that is about to device is divided into different functional, to accomplish all or part of function of above description.In addition, the device of the data load that the foregoing description provides and the method embodiment of data load belong to same design, and its concrete implementation procedure sees method embodiment for details, repeats no more here.
The invention described above embodiment sequence number is not represented the quality of embodiment just to description.
All or part of step in the embodiment of the invention can utilize software to realize that corresponding software programs can be stored in the storage medium that can read, like CD or hard disk etc.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being done, is equal to replacement, improvement etc., all should be included within 66 protection domains of the present invention.

Claims (12)

1. the method for a data load is characterized in that, said method comprises:
According to the type of the subregion table that creates the division, comprise the subregion field in the said partition table;
Reading line by line needs loaded data;
Obtain the subregion field of this line data according to the partition information of said partition table;
Confirm the subregion of this line data under in said partition table according to the subregion field of said data;
This line data is loaded in the corresponding affiliated subregion.
2. method according to claim 1 is characterized in that, said partition information according to said partition table obtains the subregion field of this line data, specifically comprises:
From the partition information of said partition table, obtain the subregion field of said partition table;
Go out the subregion field of this line data according to the subregion fields match of said partition table.
3. method according to claim 1 is characterized in that, said subregion field according to said data is confirmed the subregion of this line data under in said partition table, specifically comprises:
The subregion field of said data and the partition space in the said partition information are mated, and the subregion that the partition space of coupling is corresponding is confirmed as the subregion of this line data under in said partition table.
4. method according to claim 1 is characterized in that, said be loaded into this line data in the corresponding affiliated subregion after, also comprise:
Partitioned file to all generations scans, if contain the partitioned file less than file system default tile size in the said subregion, then the All Files under the said subregion catalogue is merged into a big file.
5. method according to claim 1 is characterized in that, said type according to subregion also comprises after creating the division and showing:
Write down the metadata of said partition table according to said partition table, said metadata comprises the partition information of said partition table and the mapping relations of said partition table and distributed file system.
6. method according to claim 1 is characterized in that, said reading line by line needs also to comprise after the loaded data:
Metadata according to said partition table is resolved the field of this line data, judges whether the form of this line data is correct, if, then carry out the step that said partition information according to said partition table obtains the subregion field of this line data, otherwise, report an error.
7. the device of a data load is characterized in that, said device comprises:
Create module, be used for type according to the subregion table that creates the division, comprise the subregion field in the said partition table;
Read module is used for reading line by line and needs loaded data;
Acquisition module is used for the subregion field that partition information according to said partition table obtains this line data that said read module reads;
Determination module, the subregion field of the data that are used for obtaining according to said acquisition module are confirmed the subregion of this line data under said partition table;
Load-on module is used for this line data is loaded into the affiliated subregion of the correspondence that said determination module confirms.
8. device according to claim 7 is characterized in that, said acquisition module specifically is used for obtaining from the partition information of said partition table the subregion field of said partition table; Go out the subregion field of this line data according to the subregion fields match of said partition table.
9. device according to claim 7; It is characterized in that; Said determination module specifically is used for the subregion field of said data and the partition space of said partition information are mated, and the subregion that the partition space of coupling is corresponding is confirmed as the subregion of this line data under in said partition table.
10. device according to claim 7 is characterized in that, after the said load-on module, also comprises:
Merge module, be used for the partitioned file of all generations is scanned,, then the All Files under the said subregion catalogue is merged into a big file if contain partitioned file in the said subregion less than file system default tile size.
11. device according to claim 7 is characterized in that, after the said establishment module, also comprises:
Logging modle is used for writing down according to the partition table of said establishment module creation the metadata of said partition table, and said metadata comprises the partition information of said partition table and the mapping relations of said partition table and distributed file system.
12. device according to claim 7 is characterized in that, after the said read module, also comprises:
Parsing module is used for according to the metadata of said partition table the field of this line data being resolved, and judges whether the form of this line data is correct, if, then carry out said acquisition module, otherwise, report an error.
CN2010105822860A 2010-12-03 2010-12-03 Data loading method and device Pending CN102486798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105822860A CN102486798A (en) 2010-12-03 2010-12-03 Data loading method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105822860A CN102486798A (en) 2010-12-03 2010-12-03 Data loading method and device

Publications (1)

Publication Number Publication Date
CN102486798A true CN102486798A (en) 2012-06-06

Family

ID=46152291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105822860A Pending CN102486798A (en) 2010-12-03 2010-12-03 Data loading method and device

Country Status (1)

Country Link
CN (1) CN102486798A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982186A (en) * 2012-12-26 2013-03-20 厦门市美亚柏科信息股份有限公司 Method and system for maintaining range partitioning table based on oracle database system
CN106933928A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 Task storage method and device based on external data file
CN107301214A (en) * 2017-06-09 2017-10-27 广州虎牙信息科技有限公司 Data migration method, device and terminal device in HIVE
CN108121745A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of data load method and device
CN108205576A (en) * 2016-12-20 2018-06-26 航天信息股份有限公司 A kind of method and system for using and analyzing data based on Tool for Data Warehouse HIVE partitioned storages
CN108255628A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 A kind of data processing method and device
CN109271402A (en) * 2018-09-04 2019-01-25 拉扎斯网络科技(上海)有限公司 Data managing method, device, equipment and computer storage medium
CN109902126A (en) * 2019-02-18 2019-06-18 国家计算机网络与信息安全管理中心 Support the loading system and its implementation of HIVE auto-partition
CN109918021A (en) * 2014-11-05 2019-06-21 华为技术有限公司 Data processing method and device
CN110019507A (en) * 2017-09-21 2019-07-16 北京国双科技有限公司 The synchronous method and device of data
CN110222017A (en) * 2019-05-13 2019-09-10 北京百度网讯科技有限公司 Processing method, device, equipment and the computer readable storage medium of real time data
CN110471896A (en) * 2019-06-21 2019-11-19 深圳中琛源科技股份有限公司 A kind of data processing method, system and server
CN112035486A (en) * 2020-09-03 2020-12-04 中国银行股份有限公司 Partition establishing method, device and equipment for partition table

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1541363A (en) * 2001-02-28 2004-10-27 �Ҵ���˾ Data loading from remote data source
US7143087B2 (en) * 2002-02-01 2006-11-28 John Fairweather System and method for creating a distributed network architecture
CN101217349A (en) * 2007-01-05 2008-07-09 中兴通讯股份有限公司 A bit collection device and method in hybrid automatic repeat request
CN101799748A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Method for determining data sample class and system thereof
CN101799808A (en) * 2009-02-10 2010-08-11 中国移动通信集团公司 Data processing method and system thereof
CN101876983A (en) * 2009-04-30 2010-11-03 国际商业机器公司 Method for partitioning database and system thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1541363A (en) * 2001-02-28 2004-10-27 �Ҵ���˾ Data loading from remote data source
US7143087B2 (en) * 2002-02-01 2006-11-28 John Fairweather System and method for creating a distributed network architecture
CN101217349A (en) * 2007-01-05 2008-07-09 中兴通讯股份有限公司 A bit collection device and method in hybrid automatic repeat request
CN101799748A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Method for determining data sample class and system thereof
CN101799808A (en) * 2009-02-10 2010-08-11 中国移动通信集团公司 Data processing method and system thereof
CN101876983A (en) * 2009-04-30 2010-11-03 国际商业机器公司 Method for partitioning database and system thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王松旺等: "Oracle分区技术在公共卫生数据共享中的应用研究", 《中国卫生工程学》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982186B (en) * 2012-12-26 2016-01-20 厦门市美亚柏科信息股份有限公司 Based on maintaining method and the system of the range partition table of oracle database system
CN102982186A (en) * 2012-12-26 2013-03-20 厦门市美亚柏科信息股份有限公司 Method and system for maintaining range partitioning table based on oracle database system
CN109918021A (en) * 2014-11-05 2019-06-21 华为技术有限公司 Data processing method and device
CN106933928B (en) * 2015-12-31 2019-12-10 北京国双科技有限公司 task storage method and device based on external data file
CN106933928A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 Task storage method and device based on external data file
CN108121745A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of data load method and device
CN108121745B (en) * 2016-11-30 2021-08-06 中移(苏州)软件技术有限公司 Data loading method and device
CN108205576A (en) * 2016-12-20 2018-06-26 航天信息股份有限公司 A kind of method and system for using and analyzing data based on Tool for Data Warehouse HIVE partitioned storages
CN108255628A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 A kind of data processing method and device
CN107301214A (en) * 2017-06-09 2017-10-27 广州虎牙信息科技有限公司 Data migration method, device and terminal device in HIVE
CN107301214B (en) * 2017-06-09 2020-08-28 广州虎牙信息科技有限公司 Data migration method and device in HIVE and terminal equipment
CN110019507A (en) * 2017-09-21 2019-07-16 北京国双科技有限公司 The synchronous method and device of data
CN110019507B (en) * 2017-09-21 2021-06-08 北京国双科技有限公司 Data synchronization method and device
CN109271402A (en) * 2018-09-04 2019-01-25 拉扎斯网络科技(上海)有限公司 Data managing method, device, equipment and computer storage medium
CN109902126A (en) * 2019-02-18 2019-06-18 国家计算机网络与信息安全管理中心 Support the loading system and its implementation of HIVE auto-partition
CN110222017A (en) * 2019-05-13 2019-09-10 北京百度网讯科技有限公司 Processing method, device, equipment and the computer readable storage medium of real time data
CN110222017B (en) * 2019-05-13 2021-09-21 北京百度网讯科技有限公司 Real-time data processing method, device and equipment and computer readable storage medium
CN110471896A (en) * 2019-06-21 2019-11-19 深圳中琛源科技股份有限公司 A kind of data processing method, system and server
CN110471896B (en) * 2019-06-21 2023-11-28 陕西融华电子科技有限公司 Data processing method, system and server
CN112035486A (en) * 2020-09-03 2020-12-04 中国银行股份有限公司 Partition establishing method, device and equipment for partition table
CN112035486B (en) * 2020-09-03 2023-09-19 中国银行股份有限公司 Partition establishing method, device and equipment of partition table

Similar Documents

Publication Publication Date Title
CN102486798A (en) Data loading method and device
EP1347394B1 (en) Schema-oriented content management system
US7467163B1 (en) System and method to manipulate large objects on enterprise server data management system
Vaish Getting started with NoSQL
CN105144080B (en) System for metadata management
US20050071359A1 (en) Method for automated database schema evolution
US7376658B1 (en) Managing cross-store relationships to data objects
US20060230044A1 (en) Records management federation
CA2802996A1 (en) System and method for a computer based forms language
US20070143527A1 (en) Saving and restoring an interlocking trees datastore
US20030204527A1 (en) Logical hierarchical data model for sharing product information across product families
Narang Database management systems
CN109284273B (en) Massive small file query method and system adopting suffix array index
US20210026894A1 (en) Branch threading in graph databases
US5742809A (en) Database generic composite structure processing system
US20220147499A1 (en) Schema Agnostic Migration Of Delineated Data Between Relational Databases
Sveen Efficient storage of heterogeneous geospatial data in spatial databases
US7984072B2 (en) Three-dimensional data structure for storing data of multiple domains and the management thereof
CN113326027B (en) Domain-driven design tactical modeling method
CN102043853A (en) Method and device for updating stored data
US20240004832A1 (en) System and method for providing a container environment in a single file
CN116450107B (en) Method and device for secondary development of software by low-code platform and electronic equipment
Bender et al. A proposal for future data organization in enterprise systems—an analysis of established database approaches
Howe Data analysis for database design
Francia MongoDB and PHP: Document-Oriented Data for Web Developers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20120606

RJ01 Rejection of invention patent application after publication