CN103559272A

CN103559272A - Method and device for importing data into dimension table

Info

Publication number: CN103559272A
Application number: CN201310541634.3A
Authority: CN
Inventors: 洪超
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2013-11-05
Filing date: 2013-11-05
Publication date: 2014-02-05

Abstract

The invention discloses a method and device for importing data into a dimension table. The method for importing data into the dimension table comprises the following steps: establishing a unique index of a target dimension table, wherein the target dimension table is a dimension table for receiving data source data in a database; setting the property of the unique index as a preset property, wherein the preset property indicates that the data source data are not inserted and the database does not report errors under the condition that the data source data already exist in the target dimension table; importing the data source data into the target dimension table. According to the method and the device, the problem of low importing efficiency of data into a large dimension table is solved, and the effect of increasing the data importing efficiency is further achieved.

Description

To the method and apparatus that imports data in dimension table

Technical field

The present invention relates to database field, in particular to a kind of method and apparatus to importing data in dimension table.

Background technology

Along with Data Growth, a lot of companies all adopt database to do analytic system, at lane database, have dimension and index.Dimension table is used for preserving some dimensions, and what preserve as DimUrl is the dimension of Url, for relevant index (as visit capacity, page browsing amount) etc. being analyzed from the angle of Url at lane database.Dimension table in logic each provisional capital represents that a unique record of this dimension is capable, as each of Dimurl dimension records provisional capital, represents a unique Url record.When the scale of data warehouse large to a certain extent time, keep away the situation that unavoidably there will be large dimension table, and these large dimension tables often have every day, a lot of records is capable need to be imported, after importing, also to guarantee the uniqueness of large dimension table, just need to when large dimension table imports record row, meet two conditions here: 1. import fast simultaneously; 2. guarantee the uniqueness of every record of large dimension table.

With SSIS instrument, carrying out the extraction of ETL(data, conversion, loading procedure) time, what at present general large dimension table imported employing is all Lookup control.When inserting each line item, Lookup control judges whether it exists in large dimension table, if exist, does not insert, if do not exist, inserts.This scheme is the scheme importing line by line, and efficiency is very low.

For importing the problem that data efficiency is lower in correlation technique in large dimension table, effective solution is not yet proposed at present.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of method and apparatus to importing data in dimension table, to solve in prior art, in large dimension table, imports the problem that data efficiency is lower.

To achieve these goals, according to an aspect of the present invention, provide a kind of method to importing data in dimension table, having comprised: set up the unique index of target dimension table, wherein, target dimension table is in database, to receive the dimension table of data source data; The attribute of the unique index of Offered target dimension table is default attribute, and default attribute representation's data source data is in target dimension table in already present situation, and not data inserting source data, and database does not report an error; And data source data is imported in target dimension table.

Further, in data source data is imported to target dimension table before, to the method that imports data in dimension table, also comprise: check whether data source data exists repetition; And if check out that data source data exist to repeat, delete the repeating part of data source data or choose arbitrary data source data as being imported into data from the data source data repeating.

Further, data source data being imported to target dimension table comprises: data source data is imported in the temporary table of database; Set up the unique index of temporary table; And by the data importing in temporary table in target dimension table.

Further, in data source data is imported to target dimension table before, to the method that imports data in dimension table, also comprise: calculate the mapping value of each data source data, wherein, the length of mapping value is less than the length of corresponding data source data.

Further, mapping value is cryptographic hash.

Further, according to the key assignments of target dimension table, set up the unique index of target dimension table.

To achieve these goals, according to a further aspect in the invention, provide a kind of device to importing data in dimension table, this device for carry out that foregoing of the present invention provides any to the method that imports data in dimension table.

To achieve these goals, according to a further aspect in the invention, provide a kind of device to importing data in dimension table, comprise: set up unit, for setting up the unique index of target dimension table, wherein, target dimension table is in database, to receive the dimension table of data source data; Setting unit, is default attribute for the attribute of the unique index of Offered target dimension table, and default attribute representation's data source data is in target dimension table in already present situation, and not data inserting source data, and database does not report an error; And importing unit, for data source data being imported to target dimension table.

Further, to the device that imports data in dimension table, also comprise: inspection unit, for before data source data is imported to target dimension table, checks whether data source data exists repetition; And processing unit, in the situation that checking out that data source data exists repetition, delete the repeating part of data source data or choose arbitrary data source data as being imported into data from the data source data repeating.

Further, import unit and comprise: first imports subelement, for data source data being imported to the temporary table of database; Set up subelement, for setting up the unique index of temporary table; And second import subelement, for by the data importing of temporary table in target dimension table.

Further, to the device that imports data in dimension table, also comprise: computing unit, for calculating the mapping value of each data source data, wherein, the length of mapping value is less than the length of corresponding data source data.

Further, computing unit is used hash algorithm to calculate mapping value.

Further, set up unit and according to the key assignments of target dimension table, set up the unique index of target dimension table.

The present invention adopts the unique index of setting up target dimension table, and wherein, target dimension table is in database, to receive the dimension table of data source data; The attribute that unique index is set is default attribute, and default attribute representation's data source data is in target dimension table in already present situation, and not data inserting source data, and database does not report an error; And data source data is imported in target dimension table, due to target dimension table has been set up to unique index, when target dimension table is arrived in data importing, data source data can be carried out to multi-to-multi with already present data in target dimension table mates, judge with using Lookup control in prior art line by line whether data source data exists and compare in database, and efficiency has greatly improved.And the database not data inserting source data that do not report an error the also when setup of attribute of unique index has been existed in target dimension table for finding data source data, can make the process of data importing do not interrupted the uniqueness that simultaneously guarantees data, solved in large dimension table and imported the problem that data efficiency is lower, and then reached the effect of mentioning data importing efficiency.

Accompanying drawing explanation

The accompanying drawing that forms the application's a part is used to provide a further understanding of the present invention, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 be according to first embodiment of the invention to the process flow diagram that imports the method for data in dimension table;

Fig. 2 be according to second embodiment of the invention to the process flow diagram that imports the method for data in dimension table;

Fig. 3 be according to first embodiment of the invention to the structural representation that imports the device of data in dimension table; And

Fig. 4 be according to second embodiment of the invention to the structural representation that imports the device of data in dimension table.

Embodiment

It should be noted that, in the situation that not conflicting, embodiment and the feature in embodiment in the application can combine mutually.Describe below with reference to the accompanying drawings and in conjunction with the embodiments the present invention in detail.

The invention provides a kind of method to importing data in dimension table, the method to importing data in the dimension table below embodiment of the present invention being provided is specifically introduced:

Fig. 1 be first embodiment of the invention to the process flow diagram that imports the method for data in dimension table, as shown in Figure 1, the method comprises that following step S102 is to step S106:

Step S102, sets up the unique index of target dimension table, and wherein, target dimension table is in database, to receive the dimension table of data source data.Particularly, can set up according to the key assignments of target dimension table the unique index of target dimension table.

Step S104, the attribute of the unique index of Offered target dimension table is default attribute, default attribute representation's data source data in target dimension table in already present situation, data inserting source data not, and database does not report an error.Particularly, the embodiment of the present invention adopts database program Microsoft SQL Server, the attribute ignoreduplicate value of unique index is set to true, such setting is illustrated in while finding that data source data has existed in target dimension table, the database not data inserting that do not report an error also.

Step S106, imports data source data in target dimension table.Do not need to judge through Lookup control that line by line whether data source data exists in database, directly imports data.

Target dimension table has been set up to unique index, when target dimension table is arrived in data importing, data source data can be carried out to multi-to-multi with already present data in target dimension table mates, judge with using Lookup control in prior art line by line whether data source data exists and compare in database, and efficiency has greatly improved.And the database not data inserting source data that do not report an error the also when setup of attribute of unique index has been existed in target dimension table for finding data source data, can make the process of data importing do not interrupted the uniqueness that simultaneously guarantees data, solved in large dimension table and imported the problem that data efficiency is lower, and then reached the effect of mentioning data importing efficiency.

Further, before in data source data is imported to target dimension table, the method to importing data in dimension table of the embodiment of the present invention also comprises: check whether these data source data exist repetition, if check out that data source data exists repetition, delete the repeating part of data source data, or choose arbitrary data source data as being imported into data from the data source data repeating.Because data source data may exist repetition, before importing data, in SSIS, first it is carried out to the data volume that duplicate removal can reduce data source, further improve the efficiency of data importing.

Wherein, for the data that are the forms such as file (file such as CSV) for data source data, can take the mode of deleting duplicated data, if but data source data is database, adopting the method for deleting duplicated data is a kind of write operation consuming time, for this situation be choose a data source data in the data source data of repetition as follow-up by the data that are imported into that import in target dimension table.Further, in data source data is imported to target dimension table before, to the method that imports data in dimension table, also comprise: calculate the mapping value of each data source data, wherein, the length of mapping value is less than the length of corresponding data source data.Particularly, in embodiments of the present invention, can adopt hash algorithm to calculate the mapping value of each data source data, hash algorithm is mapped to the shorter binary data of length by longer binary data, and the different cryptographic hash of the unique correspondence of different data.Because the length of data source data may be very long, adopt cryptographic hash to carry out the efficiency that Data Matching can improve Data Matching.It should be noted that, the embodiment of the present invention adopts hash algorithm that data source data is mapped to the shorter data of length, but is not limited only to this, also can adopt the mapping value of other energy mapping (enum) data.

Fig. 2 be second embodiment of the invention to the process flow diagram that imports the method for data in dimension table, the method to importing data in dimension table that this second embodiment provides can be used as first embodiment of the invention to the preferred embodiment that imports the method for data in dimension table.As shown in Figure 2, the method comprises that following step S202 is to step S212:

Step S202, sets up the unique index of target dimension table, and wherein, target dimension table is in database, to receive the dimension table of data source data.Particularly, can set up unique index according to the key assignments of target dimension table.

Step S204, the attribute of the unique index of Offered target dimension table is default attribute, default attribute representation's data source data in target dimension table in already present situation, data inserting source data not, and database does not report an error.Particularly, the embodiment of the present invention adopts database program SQL Server, the attribute ignore duplicate value of unique index is set to true, and such setting is illustrated in while finding that data source data has existed in target dimension table, the database not data inserting that do not report an error also.

Step S206, checks whether data source data exists repetition, if data source data exists, repeats, and deletes the repeating part of data source data or choose arbitrary data source data as being imported into data from the data source data repeating.Because data source data may exist repetition, before importing data, in SSIS, first it is carried out to the data volume that duplicate removal can reduce data source, further improve the efficiency of data importing.

Step S208, imports to data source data in the temporary table of database.

Step S210, sets up the unique index of temporary table.

Step S212, by the data importing in temporary table in target dimension table.

When the quantity of data source data is larger, SQL Server does not know that data source data carried out duplicate removal in SSIS, just can judge that data likely exist repetition.Because the data that repeat can cause repeatedly carrying out from dimension table or relative index searching of identical recordings, i.e. " recoil " when data are inserted.In order to reduce the expense of " recoil ", SQL Server, when carrying out the executive plan of batch query, can be loaded into internal memory by all data in the relevant matches row of dimension table.And data source data had been done duplicate removal in the present embodiment, there is not the data source data of repetition, all data in the relevant matches row of dimension table are loaded into internal memory and have caused unnecessary memory cost.

Another kind of situation is, when SQL Server judges cost that all data in the relevant matches row of dimension table are loaded into internal memory higher than the cost of " recoil ", the executive plan meeting of SQL Server is used Nest Loops to inquire about line by line rather than is carried out batch operation, can make to reduce equally the efficiency of data importing.

The method to importing data in dimension table of second embodiment of the invention is first carried out duplicate removal to data source data, again data source data is put into the temporary table of database, data source data in temporary table is set up to unique index, so it is unique that the data in temporary table have, show that data source data is unique, does not exist repetition.SQL Server, when carrying out batch query, does not reexamine the repeatability of data source data, directly carries out batch coupling, has removed the expense that all data in dimension table relevant matches row is loaded into internal memory from, can not adopt Nest Loops to carry out inquiry line by line yet.Once to import mass data in dimension table in the situation that, the method to importing data in dimension table that the method to importing data in dimension table that adopts that second embodiment of the invention provides provides than the first embodiment has higher efficiency.

Further, in data source data is imported to target dimension table before, to the method that imports data in dimension table, also comprise: calculate the mapping value of each data source data, wherein, the length of mapping value is less than the length of corresponding data source data.Particularly, in embodiments of the present invention, can adopt hash algorithm to calculate the mapping value of each data source data, hash algorithm is mapped to the shorter binary data of length by longer binary data, and the different cryptographic hash of the unique correspondence of different data.Because the length of data source data may be very long, adopt cryptographic hash to carry out the efficiency that Data Matching can improve Data Matching.It should be noted that, the embodiment of the present invention adopts hash algorithm that data source data is mapped to the shorter data of length, but is not limited only to this, also can adopt the mapping value of other energy mapping (enum) data.

The embodiment of the present invention also provides a kind of device to importing data in dimension table, this device is mainly used in carrying out that the invention process foregoing provides to the method that imports data in dimension table, below the program that prevents that the embodiment of the present invention the is provided device of carrying out malicious operation be specifically introduced:

Fig. 3 be first embodiment of the invention to the structural drawing that imports the device of data in dimension table, as shown in Figure 3, this device comprises: set up unit 10, setting unit 20 and import unit 30.

Set up unit 10 for setting up the unique index of target dimension table, wherein, target dimension table is in database, to receive the dimension table of data source data.In addition, can set up according to the key assignments of target dimension table the unique index of target dimension table.

Setting unit 20 is default attribute for the attribute of the unique index of Offered target dimension table, the data of default attribute representation's data source in target dimension table in already present situation, data inserting source data not, and database does not report an error.Particularly, the embodiment of the present invention adopts database program Microsoft SQL Server, the attribute ignore duplicate value of unique index is set to true, such setting is illustrated in while finding that data source data has existed in target dimension table, the database not data inserting that do not report an error also.

Import unit 30 for data source data being imported to target dimension table.Do not need to judge through Lookup control that line by line whether data source data exists in database, directly imports data.

Further, the device to importing data in dimension table of the embodiment of the present invention also comprises inspection unit 40 and processing unit 50, inspection unit 40 is for before importing target dimension table by data source data, check whether data source data exists repetition, processing unit 50 is in the situation that checking out that data source data exists, delete the repeating part of data source data, or choose arbitrary data source data as being imported into data from the data source data repeating.Because data source data may exist repetition, before importing data, in SSIS, first it is carried out to the data volume that duplicate removal can reduce data source, further improve the efficiency of data importing.

Wherein, for the data that are the forms such as file (file such as CSV) for data source data, can take the mode of deleting duplicated data, if but data source data is database, adopting the method for deleting duplicated data is a kind of write operation consuming time, for this situation be choose a data source data in the data source data of repetition as follow-up by the data that are imported into that import in target dimension table.

Further, the device to importing data in dimension table of the embodiment of the present invention also comprises computing unit, and computing unit, for before data source data is imported to target dimension table, is set up the mapping value of each data source data, wherein, the length of mapping value is less than corresponding data source data.Hash algorithm is mapped to the shorter binary data of length by longer binary data, and the different cryptographic hash of the unique correspondence of different data.Because the length of data source data may be very long, adopt cryptographic hash to carry out the efficiency that Data Matching can improve Data Matching.The embodiment of the present invention adopts hash algorithm that data source data is mapped to the shorter data of length, but is not limited only to this.

Fig. 4 be second embodiment of the invention to the structural drawing that imports the device of data in dimension table, as shown in Figure 4, this device comprises: set up unit 10, setting unit 20, import unit 30, inspection unit 40 and processing unit 50.Wherein, importing unit 30 comprises the first importing subelement 301, sets up subelement 302 and the second importing subelement 303.

Inspection unit 40, for before data source data is imported to target dimension table, checks whether a plurality of data source data exist repetition.

Processing unit 50, in the situation that checking out that data source data exists, is deleted the repeating part of data source data or choose arbitrary data source data as being imported into data from the data source data repeating.Because data source data may exist repetition, before importing data, in SSIS, first it is carried out to the data volume that duplicate removal can reduce data source, further improve the efficiency of data importing.

Import unit 30 for data source data being imported to target dimension table, import unit 30 and mainly comprise the first importing subelement 301, set up subelement 302 and the second importing unit 303.Wherein, first import subelement 301 for data source data being imported to the temporary table of database.Set up subelement 302 for setting up the unique index of temporary table.Second import subelement 303 for by the data importing of temporary table in target dimension table.

The device to importing data in dimension table of second embodiment of the invention first carries out duplicate removal to data source data, again data source data is put into the temporary table of database, data source data in temporary table is set up to unique index, so it is unique that the data in temporary table have, show that data source data is unique, does not exist repetition.SQL Server, when carrying out batch query, does not reexamine the repeatability of data source data, directly carries out batch coupling, has removed the expense that all data in dimension table relevant matches row is loaded into internal memory from, can not adopt Nest Loops to carry out inquiry line by line yet.Once to import mass data in dimension table in the situation that, the device to importing data in dimension table that the device to importing data in dimension table that adopts that the embodiment of the present invention provides provides than the first embodiment has higher efficiency.

Further, the device to importing data in dimension table of second embodiment of the invention also comprises computing unit, computing unit is for before importing target dimension table by data source data, set up the mapping value of each data source data, wherein, the length of mapping value is less than corresponding data source data.Hash algorithm is mapped to the shorter binary data of length by longer binary data, and the different cryptographic hash of the unique correspondence of different data.Because the length of data source data may be very long, adopt cryptographic hash to carry out the efficiency that Data Matching can improve Data Matching.The embodiment of the present invention adopts hash algorithm that data source data is mapped to the shorter data of length, but is not limited only to this.

As can be seen from the above description, adopt the present invention to realize in dimension table and import data in batches, reached the effect that improves the efficiency of data importing.

It should be noted that, in the step shown in the process flow diagram of accompanying drawing, can in the computer system such as one group of computer executable instructions, carry out, and, although there is shown logical order in flow process, but in some cases, can carry out shown or described step with the order being different from herein.

Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in memory storage and be carried out by calculation element, or they are made into respectively to each integrated circuit modules, or a plurality of modules in them or step are made into single integrated circuit module to be realized.Like this, the present invention is not restricted to any specific hardware and software combination.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. to a method that imports data in dimension table, it is characterized in that, comprising:

Set up the unique index of target dimension table, wherein, described target dimension table is in database, to receive the dimension table of data source data;

The attribute that the unique index of described target dimension table is set is default attribute, and data source data in already present situation, is not inserted described data source data in described target dimension table described in described default attribute representation, and described database does not report an error; And

Described data source data is imported in described target dimension table.

2. the method to importing data in dimension table according to claim 1, is characterized in that, in described data source data is imported to described target dimension table before, the described method to importing data in dimension table also comprises:

Check whether described data source data exists repetition; And

If check out that described data source data exists repetition, delete the repeating part of described data source data or choose arbitrary described data source data as being imported into data from the described data source data repeating.

3. the method to importing data in dimension table according to claim 2, is characterized in that, described data source data is imported to described target dimension table and comprise:

Described data source data is imported in the temporary table of described database;

Set up the unique index of described temporary table; And

By the data importing in described temporary table in described target dimension table.

4. the method to importing data in dimension table according to claim 1, is characterized in that, in described data source data is imported to described target dimension table before, the described method to importing data in dimension table also comprises:

Calculate the mapping value of data source data described in each, wherein, the length of described mapping value is less than the length of corresponding described data source data.

5. the method to importing data in dimension table according to claim 4, is characterized in that, described mapping value is cryptographic hash.

6. the method to importing data in dimension table according to claim 1, is characterized in that, sets up the unique index of described target dimension table according to the key assignments of described target dimension table.

7. to a device that imports data in dimension table, it is characterized in that, comprising:

Set up unit, for setting up the unique index of target dimension table, wherein, described target dimension table is in database, to receive the dimension table of data source data;

Setting unit, for the attribute of the unique index of described target dimension table is set, it is default attribute, described in described default attribute representation, data source data, in described target dimension table in already present situation, is not inserted described data source data, and described database does not report an error; And

Import unit, for described data source data being imported to described target dimension table.

8. the device to importing data in dimension table according to claim 7, is characterized in that, the described device to importing data in dimension table also comprises:

Inspection unit, for before described data source data is imported to described target dimension table, checks whether described data source data exists repetition; And

Processing unit, in the situation that checking out that described data source data exists repetition, deletes the repeating part of described data source data or choose arbitrary described data source data as being imported into data from the described data source data repeating.

9. the device to importing data in dimension table according to claim 8, is characterized in that, described importing unit comprises:

First imports subelement, for described data source data being imported to the temporary table of described database;

Set up subelement, for setting up the unique index of described temporary table; And

Second imports subelement, for by the data importing of described temporary table in described target dimension table.

10. the device to importing data in dimension table according to claim 7, is characterized in that, the described device to importing data in dimension table also comprises:

Computing unit, for calculating the mapping value of data source data described in each, wherein, the length of described mapping value is less than the length of corresponding described data source data.

11. devices to importing data in dimension table according to claim 10, is characterized in that, computing unit is used hash algorithm to calculate described mapping value.

12. devices to importing data in dimension table according to claim 7, is characterized in that, the unique index of described target dimension table is set up in the described unit of setting up according to the key assignments of described target dimension table.