CN103049533A

CN103049533A - Method for quickly loading data into database

Info

Publication number: CN103049533A
Application number: CN2012105660757A
Authority: CN
Inventors: 张树杰; 王颖泽; 冯玉; 李祥凯; 任永杰; 王珊
Original assignee: Beijing Kingbase Information Technologies Co Ltd
Current assignee: Beijing Kingbase Information Technologies Co Ltd
Priority date: 2012-12-23
Filing date: 2012-12-23
Publication date: 2013-04-17

Abstract

The invention discloses a method for quickly loading data into a database. The method includes loading data files in a parallel mode when the data files are written into the database; and directly writing the data files in a loading procedure, generating tuples and then directly writing the tuples into the data files in the database. The method has the advantages that the CPU (central processing unit) utilization rate is increased owing to a parallel thread mode, inspection for various affairs when the data files are written into the database are omitted owing to the fact that the configured data are directly written into the database, and accordingly the writing efficiency of the data files can be effectively improved.

Description

A kind of method that loads fast data to database

Technical field

The present invention relates to a kind of in the database method of rapid loading data file, belong to database technical field.

Background technology

Along with extensively popularizing of internet, applications, the access of mass data and storage become the bottleneck problem of design of database system.The data of traditional database are write incoming interface and are mostly adopted single-threaded working method, and efficient is lower when writing mass data.And existing database server generally uses multi-core CPU, and single-threaded data writing mode can cause huge cpu resource waste.In addition, externally data communication device is crossed data and is write in the process of incoming interface write into Databasce, and Database Systems can be carried out multinomial affairs inspection usually.These affairs inspections also can reduce the write efficiency of data file.

Be in the Chinese patent application of 200910080927.X at application number, disclose a kind of method and system that batch data imported database.In this technical scheme, analyze the process of data in the data file and will analyze the afterwards concurrent process of data write into Databasce; Data deposit buffer memory by analysis afterwards in, until analyze complete; When the data in the buffer memory reach the preset data amount, with this data one-time write database, and these data are deleted from buffer memory; After analysis is complete, with all the data one-time write databases in the buffer memory.Adopt this technical scheme, data analysis and the speed that writes are fast, are particularly useful for mass data is imported in the database.

In addition, the people such as Ma Li point out in paper " a kind of mass data method for quickly reading based on multi-core environment " (being published in " the 16th national information storage technology conference (I S T2010) collection of thesis in 2010 "): along with the development of multi-core computer, the multinuclear PC can have been finished many large-scale calculations tasks, yet the processing in the face of mass data, data in storer and the supplementary storage read and tend to become the bottleneck that improves the application program travelling speed, thus the superior hardware performance of can not fine utilization multiple nucleus system bringing.This paper has proposed a kind of mass data rapid extracting method based on multi-core environment, take the Memory Mapping File and its method as the basis also uses based on the dividing mode of View Mapping granularity and the load balancing of the dynamic and stalic state combination, realized that the high-speed parallel for mass data extracts under multi-core platform.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of method that loads fast data to database.The method can significantly improve the loading efficiency of data file by parallel thread and the data technological means such as write direct.

For realizing above-mentioned goal of the invention, the present invention adopts following technical scheme:

A kind of method that loads fast data to database in the process of data file write into Databasce, adopts parallel mode to load data file; In loading procedure, adopt the mode of writing direct, after generating tuple directly in the data file with the tuple write into Databasce.

Wherein more preferably, before loading described data file, at first make configuration file, the quantity of parallel thread is set according to the hardware condition of database server in described configuration file.

Wherein more preferably, in the process of data file write into Databasce, after loading and finish, the index of real-time servicing database table or pending data regenerate the index of database table.

Wherein more preferably, after data base management system (DBMS) is resolved described configuration file and is created loading environment, at first resolve described data file, type according to described data file reads data that meet database table by reading thread, in this data data writing groove, and this slot data is transferred to the parsing thread; Described parsing thread is the fundamental type of database identification with Data Analysis and puts into slot data that synthetic thread is reading out data and synthetic tuple from described slot data; After synthetic tuple, according to the writing mode in the described configuration file and index process mode tuple is carried out write operation.

Wherein more preferably, in described data file loading procedure if there is abnormal conditions, judge whether to ignore that this is unusual according to described configuration file, unusually then read thread and skip this data and continue to load next bar data if ignore this, if do not ignore this unusually then withdraw from the data loading procedure.

Wherein more preferably, after withdrawing from the data loading procedure, if do not keep the data that loaded, then database table is operated, make the data that loaded invisible.

Database data rapid loading method provided by the present invention improves the utilization factor of CPU by the mode of parallel thread, and by the configuration data mode of writing direct, remove data file from and write fashionable various affairs inspections, write efficiency that can the Effective Raise data file.

Description of drawings

Fig. 1 is the overall architecture synoptic diagram of database data rapid loading method provided by the present invention;

Fig. 2 is the resolving synoptic diagram of configuration file;

Fig. 3 is the ablation process synoptic diagram of data file.

Embodiment

The technology used in the present invention thinking is to improve the loading efficiency of data file by parallel thread and the data technological means such as write direct.In specific implementation process, specify on the one hand the mode of loaded in parallel data by configuration file, and specify the quantity of parallel thread; On the other hand, specify the mode of write direct (Direct), remove the layer by layer inspection in the data file ablation process from.By above-mentioned technological means, the loading velocity of Effective Raise data file.Launch detailed explanation below in conjunction with accompanying drawing.

As shown in Figure 1, after database starts, if need to the data file no write de-lay in database, at first make the configuration file of data file ablation process according to data file s own situation, database table and hardware condition.By this configuration file can dispose degree of parallelism, writing mode, write target, the address of data file, index support, daily record support and the mode of unusually processing to occurring in the loading procedure.

Wherein, the size of degree of parallelism (single-threaded or multithreading) need to be moved according to database the hardware condition of the database server at place, is configured such as check figure of CPU etc.By the size of configuration degree of parallelism, can improve the service efficiency of CPU.In the present invention, the preferential mode data writing of selecting by multi-threaded parallel, the quantity of this parallel thread determines by the configuration parameter in the configuration file.

The configuration of writing mode then needs to select according to the situation of write efficiency and data file self, for example can select buffering to write (Buffer) mode or write direct (Direct) mode.When using the buffering writing mode, need again tuple to be written in the data file through after the operations such as affairs inspection.And when mode is write direct in use, then do not need through operations such as affairs inspections, can be directly with in the tuple data writing file.If the data file that need to write has met the various affairs inspection requirements that write target (database table), and need higher data loading efficiency, the preferred mode of writing direct that adopts is carried out, after generating tuple directly in the data file with the tuple write into Databasce.

In the present invention, data file need to meet fixing standard criterion, can be three kinds of forms of CSV, TEXT and Binary.In configuration file, by specifying in advance a certain file layout, can determine the concrete mode of Data Analysis in the subsequent operation.

The index support provides at database table and has existed in the situation of index, to the index support of database table data writing.In the present invention, regenerate the index dual mode after can selecting to adopt in the process that data write real-time servicing index or pending data to load to finish.

After the information such as storage address of specifying the database table that writes, writing mode, data writing by above-mentioned configuration file, by the Database Systems function with the Information Conduction of configuration file to data base management system (DBMS) (DBMS).With reference to figure 1, data base management system (DBMS) is responsible for utilizing the resolver resolves data according to the information in the configuration file, and resolved data can be any one in CSV, TEXT and three kinds of file types of Binary here.Subsequently, by write device generated data (being tuple), data writing.Wherein, in the process of resolution data and generated data, the mode that preferably adopts multi-threaded parallel to carry out improves the formation efficiency of data, thereby improves the loading velocity of data file.

As shown in Figure 2, as access point, then be responsible for resolving configuration file by data base management system (DBMS) with the Database Systems interface for configuration file, obtains configuration information and create loading environment according to deploy content, create thread information and enable parallel thread, create the slot data information that loads.Above-mentioned parallel thread has respectively different subtasks, comprises reading thread, parsing thread, synthetic thread and writing thread etc.

As shown in Figure 3, after data base management system (DBMS) is resolved configuration file and is created loading environment, resolution data file at first, the type of data-driven file reads data that meet database table by reading thread, in this data data writing groove, and this slot data is transferred to the parsing thread.Resolve thread and be responsible for Data Analysis is become the fundamental type that database can be identified, and put into slot data, synthetic thread is reading out data and be merged into the form of tuple from slot data then.After synthetic tuple, according to writing mode and the index process mode of configuration file tuple is carried out write operation.

In above-mentioned data file loading procedure if there is abnormal conditions, need then whether to ignore according to the configuration determination in the configuration file that this is unusual, unusually then read thread and skip this data and continue to load next bar data if ignore this, if do not ignore this unusually then withdraw from the data loading procedure.After withdrawing from the data loading procedure, need to whether keep the data that loaded according to the configuration determination in the configuration file.If do not keep these data, then database table is operated, make the data that loaded also invisible.In addition, if in the data file loading procedure, exceed the number of times upper limit of unusually skipping, illustrate that then there are a lot of mistakes in the data file that has loaded.At this moment, can the data file that has loaded be handled as follows by the mode of configuration file appointment: keep the data file that has loaded; Perhaps, remove the data file that has loaded.

More than database data rapid loading method proposed by the invention is had been described in detail.To those skilled in the art, any apparent change of under the prerequisite that does not deviate from connotation of the present invention it being done all will consist of infringement of patent right of the present invention, will bear corresponding legal liabilities.

Claims

1. method that loads fast data to database is characterized in that:

In the process of data file write into Databasce, adopt parallel mode to load data file; In loading procedure, adopt the mode of writing direct, after generating tuple directly in the data file with the tuple write into Databasce.

2. as claimed in claim 1 fast to the method for database loading data, it is characterized in that:

Before loading described data file, at first make configuration file, the quantity of parallel thread is set according to the hardware condition of database server in described configuration file.

3. as claimed in claim 1 fast to the method for database loading data, it is characterized in that:

When mode is write direct in employing, do not carry out the affairs inspection.

4. as claimed in claim 1 fast to the method for database loading data, it is characterized in that:

In the process of data file write into Databasce, the index of real-time servicing database table.

5. as claimed in claim 1 fast to the method for database loading data, it is characterized in that:

In the process of data file write into Databasce, after loading and finish, pending data regenerates the index of database table.

6. such as claim 2 or the 4 or 5 described methods that load fast data to database, it is characterized in that:

The processing mode of described index arranges in described configuration file.

7. such as the described method that loads fast data to database of any one in the claim 2～5, it is characterized in that:

After data base management system (DBMS) is resolved described configuration file and is created loading environment, at first resolve described data file, type according to described data file reads data that meet database table by reading thread, in this data data writing groove, and this slot data is transferred to the parsing thread; Described parsing thread is the fundamental type of database identification with Data Analysis and puts into slot data that synthetic thread is reading out data and synthetic tuple from described slot data; After synthetic tuple, according to the writing mode in the described configuration file and index process mode tuple is carried out write operation.

8. as claimed in claim 1 or 2 fast to the method for database loading data, it is characterized in that:

In described data file loading procedure if there is abnormal conditions, judge whether to ignore that this is unusual according to described configuration file, unusually then read thread and skip this data and continue to load next bar data if ignore this, if do not ignore this unusually then withdraw from the data loading procedure.

9. as claimed in claim 8 fast to the method for database loading data, it is characterized in that:

After withdrawing from the data loading procedure, if do not keep the data that loaded, then database table is operated, make the data that loaded invisible.