CN105069149A

CN105069149A - Structured line data-oriented distributed parallel data importing method

Info

Publication number: CN105069149A
Application number: CN201510520609.6A
Authority: CN
Inventors: 段翰聪; 张建; 闵革勇; 柳陆; 王瑾; 曾祥楷; 陈超; 朱越
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-08-24
Filing date: 2015-08-24
Publication date: 2015-11-18
Anticipated expiration: 2035-08-24
Also published as: CN105069149B

Abstract

The invention discloses a structured line data-oriented distributed parallel data importing method, comprising: step 1: obtaining a data importing rule and generating an importing task according to the data importing rule; step 2: obtaining data scale, dividing an original data table into a plurality of sub-tables according to the data scale, and at the same time, dividing the importing task into a corresponding number of importing sub-tasks according to the number of sub-tables; step 3: reading sub-table data in parallel by the plurality of importing sub-tasks, packaging the data into a protocol message, and sending the protocol message to a data importing subsystem; and step 4: creating a data underlying index and a data distribution index by the data importing subsystem according to the original data sent in the step 3, and importing the data underlying index and the data distribution index into a distributed memory database engine, so that the technical effects of quickly importing structured data stored in a disk type database into the distributed memory database system and improving the data importing efficiency of a column-oriented database are achieved.

Description

A kind of distributed parallel introduction method of structure-oriented column data

Technical field

The present invention relates to computer software fields, particularly relate to a kind of distributed parallel introduction method of structure-oriented column data.

Background technology

The commercial relations Database Systems that we use usually, its main target is the ACID feature ensureing data access, for all kinds of commercial affairs and transactional applications provide powerful data management and access service.But the real-time of their data, services is difficult to be protected, its basic reason is:

Traditional database is all disk database, and the primary copy of data is on hard disk, and when user needs visit data, data are loaded main memory by DBMS, is namely " caching technologys based on disk " to the management of data.And disk is the storage medium of extremely low speed relative to main memory, and disk access speed is also relevant with current magnetic disc head position with the physical location of access data.In addition, no matter management buffer memory (cache) or buffering (buffer), be at operating system layer or DBMS layer, all need to pay larger cost.Even if data in magnetic disk is all cached to internal memory, its management cost is still very large, cannot meet the requirement of most of application scenarios real-time.

In memory database, the data of the whole or active transaction access of database are put in internal memory, and the access of such affairs to dish has fully phased out.Because whole database is put in internal memory, database is not then re-used as a large amount of storage file and treats and as mass data addressable in internal memory, be different from the buffer memory in disk database or buffer mode, it has broken the design aim of traditional magnetic disk Database Systems completely, brings himself new design problem.As the Organization of Data of: traditional magnetic disk Database Systems, access method, Query Processing Algorithm design all for minimizing magnetic disc access times with effectively utilize disc storage space, even sacrifice CPU time and reduce by I/O number (as query processing has a large amount of intermediate data), the time and the memory headroom that how to effectively utilize CPU are then mainly considered in the design of memory database.To traditional magnetic disk Database Systems quite effective Organization of Data, access method, Query Processing Algorithm, may not effectively for memory database system, on the contrary, some are thought becomes the way useless to traditional magnetic disk Database Systems feasible on the contrary.

Distributed memory database, data scatter is stored on multiple independently back end exactly, and using internal memory as the main medium storing data, enables user realize high-performance, high concurrent, high scalable and mass data inquiry and the solution of advanced database that provides.Memory database developed rapidly in recent years, and obtained in the application putting into practice more and more widely.

Distributed memory database, relative to original traditional magnetic disk type database, owing to not using hard disk as the first medium storing data, because this reducing magnetic disc i/o, utilizes the high speed access of internal memory to achieve the high speed access of data.But because internal memory is the resource be of great value, the role of computing platform that what distributed memory database was mainly played the part of in production business is, not as the master data base storing data, its data are mainly still stored in traditional magnetic disk database.How to be imported to fast in internal memory by the massive structured data be stored in traditional magnetic disk database is the problem that first distributed memory database will solve.

The mode that traditional magnetic disk type database many employings line stores, line storage is that Jiang Gehang puts into continuous print physical location, and this is the spitting image of traditional record and file system.Then extract according to each inquiry the row needed by database engine.Line storage organizes data into many row, so just can find all row in one operation.At every turn the shortcoming of this way must process a full line, instead of only process the row oneself needed.This storage mode is relatively applicable to OLTP(OnlineTransactionProcessing, online transaction processing system, namely based on reading and writing data) scene, and be not suitable for OLAP(OnlineAnalyticalProcessing, on-line analytical processing, namely based on data analysis) scene.More the feature of OLAP reads to write less, and the data volume often read is larger, and the probability reading certain concrete field can be far longer than the probability reading full line data.This application scenarios is not suitable for so store by row.And column storage is often row in structural data organizes by single-row data mode, and be stored on medium after setting up the relation between each row.Advantage carries out simple queries for the value in certain row only to need to obtain the data of respective column, and need not read in all data of same record related column, it is fast that it reads in speed, and the internal storage resources of needs is relatively few.This represents directly can enter the memory block of these row to the search of particular value in certain row, and does not need the data scanning full line.Data compression is so also made to become easier, because the data in row have identical data type usually.So column memory model is relatively applicable to the demand of memory database.

The shortcoming existed due to per-column access is that loading speed is usually slow, because source data is represent with the form of row or record in external data source.Magnanimity line data is converted to the usual speed of column data comparatively slow, so a lot of memory database is all for renewal data analysis and calculating, instead of based on the analysis and calculation of full dose data.And based on the application scenarios of distributed memory database that full dose data analysis calculates, just need to design a set of high speed, efficiently data importing scheme, original structurized line data be stored in line data storehouse is converted to the column data after the compression of applicable internal storage data library storage, improves access efficiency and the storage efficiency of memory database.

In sum, present inventor, in the process realizing invention technical scheme in the embodiment of the present application, finds that above-mentioned technology at least exists following technical matters:

1., when data volume is larger, serial imports the data importing mode that data are obviously a kind of poor efficiencys, needs the mode adopting multidiameter delay to import, the mass data being originally stored in data persistence layer is imported in distributed memory database system.

2. in the prior art, due to the difference in columnar database and line data library structure, cause needing to carry out a large amount of data structure conversion work when the data that line stores are converted to column data, therefore when data volume is very large, the workload of data conversion can be very large, boot speed is comparatively slow, importing efficiency is lower, therefore, in conjunction with problem 1, need a kind of at a high speed, efficient parallel data import plan of design, convert magnanimity line data to column data, then walking abreast imports in system.

Summary of the invention

The invention provides a kind of distributed parallel introduction method of structure-oriented column data, solve existing data lead-in method and have that boot speed is comparatively slow, to import efficiency lower, reduce the access efficiency of memory database and the technical matters of storage efficiency, achieve and the structural data be stored in magnetic disc type database is imported in distributed memory database system fast, and personalized service can be carried out according to user's request, and incremental data more New function is provided, improve the technique effect of columnar database data importing efficiency.

For solving the problems of the technologies described above, this application provides a kind of distributed parallel introduction method of structure-oriented column data, the structural data be stored in magnetic disc type database is imported in distributed memory database system fast, this introduction method first can according to user's request, support to import data according to user's request, and provide incremental data more New function according to user's request, and be loaded into the slow problem of speed for column data, design the solution that a kind of multidiameter delay data import fast, improve columnar database data importing efficiency.

The distributed parallel introduction method of structure-oriented column data of the present invention, comprises the steps:

Step 1: database reading assembly, according to data importing rule, generates importing task;

Step 2: according to required reading data table data scale, reads task to database and is cut into multiple subtask, each subtask reading section tables of data (sublist), and set up queue management subtask, subtask;

Step 3: subtask sends to data importing subsystem by packaged for the raw data of reading after reading sublist;

Step 4: data importing subsystem creates data bottom index and Data distribution8 index according to raw data, then the data bottom index established and Data distribution8 index is imported in distributed memory database engine, and safeguards importing state.

Preferably, described step 1 obtains and imports rule, and according to data importing rule, create data importing task, service data imports task status.

Preferably, in described step 2, database reading assembly reading database metamessage, obtains tables of data scale to be read, calculates the sublist scale that should read each subtask, and generates subtask queue.

Concrete, in described step 2, the method for creation database reading subtask is: first reading database metamessage, obtain tables of data scale, according to the tables of data total scale that physical machine node resource and the current needs of current carrying database import, determine the sublist scale at every turn sending to data importing subsystem, then determine the parallel number of threads reading sublist, and form subtask queue.

Preferably, in described step 3, after reading sublist thread readjustment, the structurized data encapsulation read is become internal data protocol format, and sends to data importing subsystem.

Preferably, in described step 4, data importing subsystem receives agreement, raw data is resolved and extracts, create data bottom index and Data distribution8 index according to raw data, then to Centroid request distributed memory database engine node address, data bottom index is imported in respective nodes, and safeguard importing state, report importing progress to Centroid.

The distributed parallel introduction method of structure-oriented column data, comprises foregoing step 1-4, also comprises the steps:

Step 5. database reading assembly monitor source database update information, regular acquisition incremental update data, incremental data sent to database to import subsystem, database imports subsystem and incremental data is set up data bottom index, by respective index data importing in main memory database engine.

Described step 5 can be specially:

Step 501: the monitoring event of database reading assembly registration to source database table, obtains the incremental update information of database table, and increment size is read in the timing of database reading assembly;

Step 502: when the incremental update data of database table reach certain scale, reading database table, obtains incremental update data, and is packaged into data message, sends to database to import subsystem;

Step 503: database imports subsystem and obtains incremental data, set up data bottom index, then to the memory location of Centroid request incremental data in distributed memory database engine, incremental data bottom index is imported in distributed memory database engine, and monitors incremental update state.

The invention provides a kind of distributed parallel introduction method of structure-oriented column data.This technology provides a solution imported to fast by the structural data be stored in magnetic disc type database in distributed memory database system, adopt distributed multi-channel walk abreast import mode, massive structured data is imported in distributed memory database system fast, and incremental data more New function is provided, can fast by the incremental update data importing of structural data that is stored in data persistence layer in distributed memory database system.

Accompanying drawing explanation

Fig. 1 is the distributed parallel introduction method idiographic flow schematic diagram of structure-oriented column data in the embodiment of the present application one;

Fig. 2 is the system composition schematic diagram that in the embodiment of the present application one, method is corresponding;

Fig. 3 is basic framework and the guiding flow schematic diagram of system;

Fig. 4 is system incremental update system flow schematic diagram.

Embodiment

In order to better understand technique scheme, below in conjunction with Figure of description and concrete embodiment, technique scheme is described in detail.

Please refer to Fig. 1-Fig. 2, Fig. 1 shows the schematic diagram of a kind of embodiment of the present invention, describes, from external data base, desired data table is imported to the process in distributed memory database engine.Realize hardware of the present invention and comprise the server that multiple stage networks mutually, wherein a station server carries machine as database, and a station server is as Centroid, and all the other servers are as data importing computing node.

Of the present invention, comprise the steps:

Database reading assembly connects the interface between external data source and memory database, for the adaptation of external data source and the reading of data.The development library of database conventional in the current industry of encapsulation in database reading assembly, calls different access interface operating databases through configuration.

Monitor the connection from external node after database reading assembly starts, first according to configuration information connection data storehouse reading assembly after Centroid starts, obtain the metadata information in source database, as data table information, list structure information etc. in storehouse.Metadata information is supplied to user by Centroid, for the reference as data importing.User, according to metadata information, selects the table name and the field name that need importing, generates and import rule, be handed down to data importing subsystem, data importing subsystem connection data storehouse reading assembly, request importing data.After database reading assembly receives and imports rule, according to the corresponding importing task of importing generate rule.

Database reading assembly is when reading database, and because data volume in external database table is very large, in order to reduce the pressure to physical node, first database reading assembly reads data according to certain strategy, and tactful basic thought is as follows:

Database reading assembly receives the importing rule from data importing subsystem, and generate data importing task, first reading database metamessage, determines tables of data scale; According to the tables of data scale obtained and present node physical resource and general assignment number, determine tables of data transversally cutting to become the sublist that several sizes are identical according to table row number, the task of reading each sublist can think a subtask, determine that queue length is read in subtask according to subtask number and physical resource Information, be parallel reading Thread Count; After subtask queue is determined, adopt the form of multithreading, multi-threaded parallel reading is carried out to every sheet sublist, after worker thread all returns in group task queue, the data that each thread reads is merged into data message; Data message after merging is sent to data importing subsystem interdependent node;

Step 4: data importing subsystem creates data bottom index and Data distribution8 index according to raw data, then the data bottom index established and Data distribution8 index is imported in distributed memory database engine, and safeguards importing state;

What database reading assembly passed to data importing subsystem is structurized data, is storage list in database.Data prediction to be carried out to data by before data importing to distributed memory database engine, comprise and set up data bottom index, set up Data Data distribution index.Data prediction strategy is as follows: after data importing subsystem receives first message, first by sublist rip cutting, each column data is handed down to worker thread.First worker thread sorts according to each column data, then sets up the Data distribution8 index of data bottom index and these row; Data bottom index and Data distribution8 index are successfully established backward central node request msg resource, send the data to distributed memory database engine respective nodes after getting resource information; After data importing subsystem receives remaining word table data, set up data bottom index equally; Data bottom index has been set up and has been judged whether data slice to be sent number numbers equal with current data, if equal, sends this message afterwards; After distributed memory database engine node receives follow-up data bottom index, upgrade data with existing bottom index, and data bottom index is returned to sending node; Data importing subsystem node sets up follow-up Data distribution8 index according to the data bottom index upgraded, and Data distribution8 index is sent to the node storing Data distribution8 index; Then aforesaid operations is repeated, until these row total data is sent; After full table data are sent, report that this table imports successfully to central node;

Step 5: database reading assembly monitor source database update information, regular acquisition incremental update data, incremental data sent to database to import subsystem, database imports subsystem and incremental data is set up data bottom index, by respective index data importing in main memory database engine.

Described step 5 can be specially:

Step 503: database imports subsystem and obtains incremental data, set up data bottom index, then to the memory location of Centroid request incremental data in distributed memory database engine, incremental data bottom index is imported in distributed memory database engine, and monitors incremental update state;

Give a concrete illustration below, the technical scheme in the application be introduced:

Fig. 3 is the basic framework of system and basic guiding flow figure, and wherein database reading assembly is deployed on external data source.Basic step is as follows: 1, first Centroid sends to database reading assembly and obtain database metadata; 2, data importing rule is issued to data importing subsystem after getting database metadata; 3, data importing subsystem is by database reading assembly externally data source request msg; 4, database reading assembly is according to data importing rule, takes the parallel mode imported to read data; 5, data importing subsystem cluster is sent the data to after reading data; 6, data are converted to data bottom index and Data distribution8 index by data importing subsystem cluster; 7, data importing subsystem is to the positional information of Centroid request distributed memory database engine; 8, data bottom index and Data distribution8 index are imported in distributed memory database engine cluster; 9, database imports subsystem and monitors guiding flow in real time, imports and terminates the report of backward Centroid.

Fig. 4 is system incremental update system flowchart, and basic step is as follows: 1, database reading assembly registration database monitoring mechanism, the more new data in monitored data storehouse; 2, database reading assembly timing acquisition incremental update reporting; 3, Centroid issues increment and imports task; 4, data importing subsystem is to database reading assembly request msg; 5, database reading assembly parallel read data; 6, database reading assembly sends the data to data importing subsystem; 7, data importing subsystem sets up data bottom index and Data distribution8 index; 8, data importing subsystem is to Centroid request msg memory location; 9, data bottom index and Data distribution8 index send to distribution to be main memory database engine cluster by data importing subsystem; 10, database imports subsystem and monitors increment guiding flow in real time, imports and terminates the report of backward Centroid.

Previously described is each preferred embodiment of the present invention, preferred implementation in each preferred embodiment is if not obviously contradictory or premised on a certain preferred implementation, each preferred implementation can stack combinations use arbitrarily, design parameter in described embodiment and embodiment is only the invention proof procedure in order to clear statement inventor, and be not used to limit scope of patent protection of the present invention, scope of patent protection of the present invention is still as the criterion with its claims, the equivalent structure change that every utilization instructions of the present invention and accompanying drawing content are done, in like manner all should be included in protection scope of the present invention.

Although describe the preferred embodiments of the present invention, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the scope of the invention.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a distributed parallel introduction method for structure-oriented column data, is characterized in that, comprise the steps:

Step 1: need to generate data importing rule according to user, the database reading assembly being responsible for reading database data obtains data importing rule, imports task according to data importing generate rule;

Step 2: obtain data scale to be imported, is cut into some sublists according to data scale by former tables of data, according to sublist quantity, importing task is cut into respective numbers importing sub-task simultaneously;

Step 3: multiple importing sub-task is parallel reads sublist data, encapsulates data into protocol massages, then sent to the data importing subsystem of responsible translation data by database reading assembly after reading data;

Step 4: data importing subsystem creates data bottom index and Data distribution8 index respectively according to the raw data that step 3 sends, data bottom index is used for stored data base True Data, data bottom index and Data distribution8 index for preserving Data distribution8 position, and import in distributed memory database engine by Data distribution8 index.

2. the distributed parallel introduction method of structure-oriented column data as claimed in claim 1, it is characterized in that, described step 1 is specially: obtain and import rule, and according to data importing rule, create data importing task, service data imports task status.

3. the distributed parallel introduction method of structure-oriented column data as claimed in claim 1, it is characterized in that, described step 2 is specially: reading database metamessage, obtain tables of data scale to be read, calculate the sublist scale that should read each subtask, and generate subtask queue.

4. the distributed parallel introduction method of structure-oriented column data as claimed in claim 1, it is characterized in that, described step 3 is specially: multiple subtask adopts multithreading parallel reading sublist data, by aggregation of data after thread readjustment, data encapsulation after merger is become protocol massages, sends to data importing subsystem.

5. the distributed parallel introduction method of structure-oriented column data as claimed in claim 1, it is characterized in that, according to the raw data that step 3 sends in described step 4, adopt mode parallel establishment data bottom index and the Data distribution8 index of multithreading, and index node resource is stored in Centroid request distributed memory database storage engines, the data bottom index established and Data distribution8 index are sent to distributed memory database storage engines.

6. the distributed parallel introduction method of structure-oriented column data as claimed in claim 1, is characterized in that, data importing subsystem maintenance data importing state in described step 4, reports current data import progress to Centroid.

7. the distributed parallel introduction method of the structure-oriented column data as described in claim arbitrary in claim 1-6, it is characterized in that, described method also comprises the steps:

8. the distributed parallel introduction method of structure-oriented column data as claimed in claim 7, it is characterized in that, described step 5 specifically comprises: