CN107784039A - A kind of data load method, apparatus and system - Google Patents

A kind of data load method, apparatus and system Download PDF

Info

Publication number
CN107784039A
CN107784039A CN201610799125.4A CN201610799125A CN107784039A CN 107784039 A CN107784039 A CN 107784039A CN 201610799125 A CN201610799125 A CN 201610799125A CN 107784039 A CN107784039 A CN 107784039A
Authority
CN
China
Prior art keywords
data
loaded
source
loading
loading end
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610799125.4A
Other languages
Chinese (zh)
Inventor
程亦超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610799125.4A priority Critical patent/CN107784039A/en
Publication of CN107784039A publication Critical patent/CN107784039A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application provides a kind of data load method, device and system, the system includes at least one data source, data loading device and multiple purpose loading ends;The data loading device, for extracting source data from least one data source, and default processing is carried out to the source data and obtains data to be loaded;According at least one data attribute of data to be loaded, the data to be loaded are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.The application realizes is handled, classified and is distributed storage to the data of one or more data sources, and the embodiment of the present application is applied in ETL processing procedures, can provide the mechanism that the data in a kind of task by ETL imported into multiple purpose loading ends.

Description

A kind of data load method, apparatus and system
Technical field
The application is related to technical field of data processing, more particularly to a kind of data load method, a kind of data loading dress Put, and, a kind of data load system.
Background technology
ETL (Extraction-Transformation-Loading) is data pick-up extract, conversion transform It is the important ring for building data warehouse, user extracts required data from database, by number with loading load process According to cleaning, finally according to the data warehouse model pre-defined, load data into data warehouse.
Traditional ETL instruments and software are solved in data warehouse and imported from a data source to the data of a destination Problem, such as:Another table from database A in one table to database B.And some application scenarios are directed in practical application In the presence of the demand for importing multiple destinations, it is therefore desirable to provide a kind of data load mechanism of more destinations in ETL tasks.
The content of the invention
Technical problems to be solved in this application are to provide a kind of big data for partly or entirely solving above-mentioned technical problem The task management method and device of processing platform.
In order to solve the above problems, this application discloses a kind of data load system, including at least one data source, data Loading device and multiple purpose loading ends;
The data loading device, for extracting source data from least one data source, and the source data is carried out pre- If processing obtains data to be loaded;According at least one data attribute of data to be loaded, the data to be loaded are divided into Belong to multiple data blocks to be loaded of different purpose loading ends;Data block to be loaded is controlled to be loaded onto belonged to purpose loading End.
Disclosed herein as well is a kind of data load method, including:
Source data is extracted from least one data source, and default processing is carried out to the source data and obtains data to be loaded;
According at least one data attribute of data to be loaded, the data to be loaded are divided into the different purposes of ownership and added Carry multiple data blocks to be loaded at end;
Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.
Preferably, it is described to include from least one data source extraction source data:
Mixing source data is extracted from multiple data sources, the data source includes operation system, system module or application program.
Preferably, it is described to carry out default processing to the source data and obtain data to be loaded including:
The mixing source data is divided into multiple source datas according to entrained data source identification;
Search the processing strategy for data source corresponding to each source data;
Processing strategy according to lookup carries out default processing to each source data, the default processing include data cleansing with Data conversion.
Preferably, at least one data attribute according to data to be loaded, the data to be loaded are divided into and returned Belonging to multiple data blocks to be loaded of different purpose loading ends includes:
Purpose corresponding at least one data attribute of the data to be loaded is searched in loading end routing table adds Carry end;
The data to be loaded are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends.
Preferably, at least one data attribute according to data to be loaded, the data to be loaded are divided into Before the multiple data blocks to be loaded for belonging to different purpose loading ends, methods described also includes:
The setting content for parsing the data to be loaded obtains at least one data attribute of the data to be loaded, described Data attribute includes temporal information, data source information or data traffic types.
Preferably, the setting content of the parsing data to be loaded obtains at least one number of the data to be loaded Include according to attribute:
By carrying out rule match to the setting content of the data to be loaded, from the setting content of the data to be loaded The middle at least one data attribute of extraction.
Preferably, the setting content of the parsing data to be loaded obtains at least one number of the data to be loaded Include according to attribute:
By carrying out cutting to the setting content in the data to be loaded, at least one of the data to be loaded is obtained Data attribute.
Preferably, methods described also includes:
Receive the setting content that user is pre-selected by setting interface.
Preferably, the source data is carried out before default processing obtains data to be loaded described, methods described is also wrapped Include:
Generation is corresponding to extract the source data loading tasks of source data, and is added to first task queue;
According to the first processing sequence rule of setting, extract pending source data loading from the first task queue and appoint Business.
Preferably, at least one data attribute according to data to be loaded, the data to be loaded are divided into After the multiple data blocks to be loaded for belonging to different purpose loading ends, methods described also includes:
For each data block to be loaded, the data block loading tasks of the corresponding data block to be loaded of generation, and be added to For the second preset task queue of purpose loading end corresponding to data block to be loaded.
Preferably, each data block to be loaded of control, which is loaded onto belonged to purpose loading end, includes:
For each second task queue, according to the second processing Cahn-Ingold-Prelog sequence rule of setting, extracted from second task queue Data block loading tasks;
The data block loading tasks are performed so that data block to be loaded to be loaded onto to belonged to purpose loading end.
Preferably, the control data block to be loaded, which is loaded onto belonged to purpose loading end, includes:
Call at least one loading thread of the purpose loading end to load the data to be loaded to the purpose to load End.
Preferably, before the control data block to be loaded is loaded onto belonged to purpose loading end, methods described is also Including:
Determine the existing purpose loading end.
Preferably, methods described also includes:
If the purpose loading end is not present, creates the purpose loading end and create the corresponding purpose loading end Load thread.
Disclosed herein as well is a kind of data loading device, including:
Source data abstraction module, for extracting source data from least one data source;
Source data processing module, data to be loaded are obtained for carrying out default processing to the source data;
Attribution data module to be loaded, will be described to be loaded at least one data attribute according to data to be loaded Data are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;
Load-on module, for controlling data block to be loaded to be loaded onto belonged to purpose loading end.
Preferably, the source data abstraction module, specifically for extracting mixing source data, the data from multiple data sources Source includes operation system, system module or application program.
Preferably, the source data processing module includes:
Source data divides submodule, multiple for the mixing source data to be divided into according to entrained data source identification Source data;
Processing strategy searches submodule, for searching the processing strategy for data source corresponding to each source data;
Default processing submodule, default processing is carried out to each source data for the processing strategy according to lookup, it is described pre- If processing includes data cleansing and data conversion.
Preferably, the attribution data module to be loaded includes:
Loading end searches submodule, for searching at least one of the data to be loaded in loading end routing table Purpose loading end corresponding to data attribute;
Data divide submodule, for the data to be loaded to be divided into the multiple to be added of the different purpose loading ends of ownership Carry data block.
Preferably, in addition to:
Data attribute parsing module, at least one data attribute according to data to be loaded, being treated described Loading data are divided into before the multiple data blocks to be loaded for belonging to different purpose loading ends, parse setting for the data to be loaded Determine content and obtain at least one data attribute of the data to be loaded, the data attribute includes temporal information, data source is believed Breath or data traffic types.
Preferably, the data attribute parsing module, specifically for by entering to the setting content of the data to be loaded Line discipline is matched, and at least one data attribute is extracted from the setting content of the data to be loaded.
Preferably, the data attribute parsing module, specifically for by the setting content in the data to be loaded Cutting is carried out, obtains at least one data attribute of the data to be loaded.
Preferably, described device also includes:
Setting content receiving module, the setting content being pre-selected for receiving user by setting interface.
Preferably, in addition to:
Source data loading tasks generation module, for carrying out default processing to the source data described and obtaining number to be loaded According to before, generation corresponds to the source data loading tasks for extracting source data;
First task queue add module, for added to first task queue;
Source data loading tasks extraction module, for the first processing sequence rule according to setting, from the first task Pending source data loading tasks are extracted in queue.
Preferably, in addition to:
Data block loading tasks generation module, at least one data attribute according to data to be loaded, general The data to be loaded are divided into after the multiple data blocks to be loaded for belonging to different purpose loading ends, for each data to be loaded Block, the data block loading tasks of the corresponding data block to be loaded of generation;
Second task queue add module, for added to preset for purpose loading end corresponding to data block to be loaded Second task queue.
Preferably, the load-on module includes:
Data block loading tasks extraction module, for for each second task queue, according to the second processing order of setting Rule, data block loading tasks are extracted from second task queue;
Data block loading tasks execution module, for performing the data block loading tasks so that data block to be loaded to be loaded To the purpose loading end belonged to.
Preferably, the load-on module, specifically for calling at least one loading thread of the purpose loading end to load The data to be loaded are to the purpose loading end.
Preferably, in addition to:
Purpose loading end determining module, for being loaded onto belonged to purpose loading end in the control data block to be loaded Before, the existing purpose loading end is determined.
Preferably, in addition to:
Creation module, if for the purpose loading end to be not present, create the purpose loading end and create corresponding institute State the loading thread of purpose loading end.
Compared with prior art, the application includes advantages below:
According to the embodiment of the present application, source data is extracted from least one data source, is obtained after default processing to be loaded Data, data to be loaded are divided into the data block for belonging to different purpose loading ends according to data attribute, further control is each Data block to be loaded is loaded onto belonged to purpose loading end, realize the data of one or more data sources are handled, Classification and distribution storage, the embodiment of the present application is applied in ETL processing procedures, can provide the number in a kind of task by ETL According to the mechanism for importeding into multiple purpose loading ends.
Brief description of the drawings
Fig. 1 is a kind of application schematic diagram of data load system of the application;
Fig. 2 is a kind of flow chart of data load method embodiment 1 of the application;
Fig. 3 is a kind of flow chart of data load method embodiment 2 of the application;
Fig. 4 is a kind of configuration diagram of data load system of the application;
Fig. 5 is a kind of flow chart of data processing figure for implementing the embodiment of the present application based on Morphline;
Fig. 6 is a kind of structured flowchart of the task management device embodiment of big data processing platform of the application.
Embodiment
It is below in conjunction with the accompanying drawings and specific real to enable the above-mentioned purpose of the application, feature and advantage more obvious understandable Mode is applied to be described in further detail the application.
This application provides a kind of data load system, including at least one data source, data loading device and multiple mesh Loading end, with reference to figure 1, show a kind of application schematic diagram of data load system of the application.Wherein, data loading device Source data is extracted from least one data source, and default processing is carried out to source data and obtains data to be loaded;According to number to be loaded According at least one data attribute, data to be loaded are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends; Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.Specifically can step with the following method:
With reference to figure 2, show a kind of flow chart of data load method embodiment 1 of the application, can specifically include with Lower step:
Step 101, extract source data from least one data source, and the source data is carried out default processing obtain it is to be added Carry data.
The embodiment of the present application extracts data from data source, and extracting the mode of data can set according to the actual requirements.
The data source can be database, operation system, system module (some module that is to say operation system), application Program, accordingly, the data extracted from database can be the data in tables of data or tables of data, be extracted from operation system Data can be daily record data that business datum, the operation system that operation system generates record etc., from the number of system module extraction According to can be the result of module, for daily record data of module log etc., the data that are extracted from application program can be with It is running log of the result of application program, the data of application program crawl, application program etc..May be used also in actual applications It is above-mentioned only as an example, the application is not done to the particular type of data source to extract data from other achievable data sources Limitation.
After the source data extracted from data source, default processing is further carried out to obtain to be added being loaded onto the to be loaded of loading end Data, ETL handling processes are applied to, it can be that data are cleaned and changed to preset processing herein, the task of data cleansing Be filter those undesirable data, undesirable data mainly have incomplete data, the data of mistake and The data three major types repeated;(task of data conversion is mainly to integrate the same type number of separate sources for inconsistent data conversion According to), the conversion (being polymerize the data of separate sources according to data warehouse granularity) of data granularity and according to preset rules Data are calculated.It can specifically be set according to the actual requirements using which kind of processing mode, the application is not limited to this, Such as increase row are carried out to source data, delete the data cleansings such as row operation.
From multiple data sources extract data obtain be mixing source data, can be according to the actual requirements from the more of any kind Individual data source carries out data pick-up.For mixing source data, carry out that same processing strategy can be multiplexed during default processing, also may be used So that different processing strategies is respectively adopted, specifically corresponding processing strategy can be set for various data sources in advance, according to institute The data source identification of carrying is divided into multiple source datas by source data is mixed, and further searches for the number for various source datas ownership The processing strategy being correspondingly arranged is searched according to source, default processing is carried out to source data according to processing strategy.
Step 102, according at least one data attribute of data to be loaded, the data to be loaded are divided into ownership not With multiple data blocks to be loaded of purpose loading end.
The embodiment of the present application divides to the data to be loaded obtained from one or more data sources, and partitioning standards are to treat Data to be loaded are in other words divided by the purpose loading end that loading data are belonged to according to corresponding purpose loading end, Divide obtained data block to be loaded and correspond to different purpose loading ends respectively.
Belonged to purpose loading end is determined according to the data attribute of data to be loaded, data attribute can include time letter Breath, the data source information of the belonged to data source of mark, data traffic types one or more of, can be specifically direct From the setting content of extracting data to be loaded;Can also be obtained by carrying out statistical analysis to the setting content of data to be loaded Computing is carried out to or using certain preset algorithm to obtain;Or the feature such as data format using data to be loaded is as data Attribute the application is not limited to this, and preset algorithm can select according to the actual requirements herein, such as hash algorithm, information are plucked Algorithm etc. is wanted, the application is not limited to this.
Apply in ETL processing procedures, ETL tasks are by data transfer to be loaded into data warehouse, according to number to be loaded According to data attribute can determine purpose loading end that data to be loaded belong to respectively in data warehouse, the loading of data warehouse End can be specifically Data Warehouse memory cell, such as can be file or tables of data in data warehouse etc..
Data attribute is corresponding with loading end to be determined by any suitable mode, for example, can preset each The data attribute mapping relations with loading end respectively are planted, instruction pair in mapping relations is searched according to the data attribute of data to be loaded The purpose loading end answered;Or computing is carried out to attribute data and obtains purpose loading end, for example, carrying out Hash fortune to attribute data Calculation obtains the numbering of purpose loading end, and the purpose loading end that the numbering is indicated is as purpose loading end corresponding to attribute data.
Specifically, before identification data loading end, data to be loaded can be decomposed according to whether data attribute is carried Multiple data cells are obtained, each data unit size is identical or different, sentences respectively according to the data attribute that each data cell carries Purpose loading end corresponding to disconnected, the data block obtained after division can be the set of one or more data cells.
Step 103, data block to be loaded is controlled to be loaded onto belonged to purpose loading end.
It is determined that purpose loading end that data to be loaded are belonged to and after division obtains multiple data blocks to be loaded accordingly, enters Each data block to be loaded is respectively loaded on belonged to purpose loading end by one step.One or more loading threads can be used to add Each data block to be loaded is carried, can be according to the preferential suitable of setting when a thread needs to load multiple data blocks to be loaded Sequence loads each data block to be loaded respectively.
Before data are loaded, data block to be loaded can be preserved to corresponding respectively according to the purpose loading end belonged to File, then the data block in each file is loaded respectively, this processing mode need consume twice of disk Space;Data block to be loaded can also be added in preset queue according to the purpose loading end belonged to, to be adjusted from queue Degree task carries out data block loading, compared to former mode, it is possible to reduce to the occupancy of memory space.
According to the embodiment of the present application, source data is extracted from least one data source, is obtained after default processing to be loaded Data, data to be loaded are divided into the data block for belonging to different purpose loading ends according to data attribute, further control is each Data block to be loaded is loaded onto belonged to purpose loading end, realize the data of one or more data sources are handled, Classification and distribution storage, the embodiment of the present application is applied in ETL processing procedures, can provide the number in a kind of task by ETL According to the mechanism for importeding into multiple purpose loading ends.
The embodiment of the present application can be applied particularly to plurality of application scenes, and several specific examples are given below:
Application scenarios 1:
Source data is extracted from data mapping, so that the data source is transaction system as an example.When extracting one section from transaction system In transaction log, the trading activity of multiple different times due to transaction log record, therefore different in transaction log Data cell possesses the different daily record dates.
After transaction log is cleaned and changed, extraction multiple daily record dates for including of transaction log are the application Data attribute, and then transaction log is divided according to the daily record date, the data block on corresponding different daily record dates is obtained, it is corresponding The data block on different daily record dates corresponds to different purpose loading ends, can further control the daily record data block loading after division To the purpose loading end that is belonged to, so as to realize processing to transaction log, classify and deposit respectively by date by date Storage.
Application scenarios 2:
Source data is extracted from multiple data sources, using multiple data sources as multiple application programs.Extracted from multiple application programs Result, because the result data volume of each application program is smaller, corresponding ETL handling processes consumption is respectively configured When effort, and the waste of process resource and storage resource can be caused.This scene is applied to, the application, which can realize, to be answered multiple Processing and classification storage with the data of program.
The application is cleaned and changed after multiple application programs extract result first, further extraction process As a result the program identification of middle carrying is data attribute, and then result is divided according to program identification, obtains corresponding to not With the data block of program identification, the data block of corresponding distinct program mark corresponds to different purpose loading ends, can further control The data block of result division processed is respectively loaded on belonged to purpose loading end, so as to realize to multiple data source numbers According to processing, by application program classify and store.
Wherein the result of multiple application programs, which is cleaned and changed, can be multiplexed same handling process, also may be used With preset corresponding handling process respectively, separate sources data are processed by demand so as to realize.
Above-mentioned concrete application is merely illustrative, in the specific implementation, data source and the species of data attribute can be according to realities Border demand setting.
In the embodiment of the present application, it can be walked according to the step of data attribute division data block in the conversion of ETL processing procedures It is rapid to realize, by increasing distribution dispatch links in ETL processing procedures, realize identifying purpose loading end and distribute to be loaded The effect of data, thus it is related to four extraction, conversion, distribution, loading methods in ETL data handling procedures altogether.
The embodiment of the present application can be implemented as the programming development instrument of ETL a kind of, due to supporting to data warehouse Multiple destinations distribute data, therefore can meet the needs of distributing data to multiple destinations in concrete application scene.The volume Journey developing instrument can also provide DLL, so that programming personnel is expanded the programming development instrument by the DLL Exhibition and maintenance, and the details such as performance, fault-tolerance is by carrying out bottom frame management.
In the embodiment of the present application, it is preferable that the mapping relations of data attribute and loading end can be set, and deposited to preset Loading end routing table in, accordingly, can be in the routing table when identifying the purpose loading end that data to be loaded belong to respectively Search purpose loading end corresponding to the data attribute of the data to be loaded.
In the embodiment of the present application, it is preferable that data attribute can obtain according to the setting content of data to be loaded, in basis At least one data attribute of data to be loaded in ETL tasks, before identifying the purpose loading end that data to be loaded belong to respectively, At least one data attribute of the data to be loaded can also be obtained by parsing the setting content of the data to be loaded, if Determine the data that content can be the specified location of data identification data type to be loaded, such as a certain column data.
The setting content of data to be loaded can be that user is voluntarily set by programming personnel, in the embodiment of the present application, preferably Ground, the setting content that user is selected by the setting interface of dispatch methods can also be received, specifically can be in setting interface Relative position show multiple setting contents, such as multiple data row and its related letter are shown in ETL programming development instrument Breath, user can select to arrange for the data of recognition purpose loading end after parsing by setting interface.
In a kind of preferable example, the setting content for parsing the data to be loaded obtains the data to be loaded at least During a kind of data attribute, by carrying out rule match (for example with regular expression to the setting content of the data to be loaded Carry out canonical matching), at least one data attribute is extracted from the setting content of the data to be loaded, the result that will be matched Data attribute as data to be loaded.
In another preferable example, the setting content for parsing the data to be loaded obtains the data to be loaded extremely During a kind of few data attribute, the number to be loaded can be obtained by carrying out cutting to the setting content of the data to be loaded According at least one data attribute.Such as cutting is carried out to characterize data type type data row, obtain the number to be loaded According to size of data size and data length length, and using size and length corresponding to data to be loaded as each to be added Carry the data attribute of data.
With reference to figure 3, the flow chart of data load method embodiment 2 in a kind of ETL tasks of the application is shown, specifically may be used To comprise the following steps:
Step 201, source data is extracted from least one data source.
Step 202, generation is corresponding extracts the source data loading tasks of source data, and is added to first task queue.
The operation of each crawl source data corresponds to a source data loading tasks, after source data is extracted, Ke Yisheng respectively Into corresponding source data loading tasks, and it is added to the first task queue for being used for depositing source data loading tasks.
Step 203, according to the first processing sequence rule of setting, pending source number is extracted from the first task queue According to loading tasks.
Source data loading tasks in first task queue can sequentially be handled according to the processing sequence rule of setting, example Such as, sequential processes are carried out, according to the preferential of source data loading tasks carrying according to the generation time order and function of source data loading tasks The priority of level mark instruction carries out sequential processes, can specifically be set according to the actual requirements using which kind of processing sequence rule.
Step 204, default processing is carried out to the source data and obtains data to be loaded.
In the specific implementation, the source data loading tasks of first task queue can carry out default processing using multithreading, So as to improve the treatment effeciency of source data.
Step 205, according at least one data attribute of data to be loaded, the data to be loaded are divided into ownership not With multiple data blocks to be loaded of purpose loading end.
Step 206, the existing purpose loading end is determined.
Before data block to be loaded is added into purpose loading end, it is also necessary to determine whether purpose loading be present End, if existing, the step of can further performing loading data block, if being not present, need to create corresponding to purpose add Carry end.
Step 207, if the purpose loading end is not present, create the purpose loading end and create the corresponding purpose The loading thread of loading end.
When creating new loading end, a new loading end can be registered by setting entrance, new loading end can be by Name order name according to other loading ends is set by the user title or carried out according to the naming logistics of user's setting Name, such as:If the value of certain record size this row is XL, tablename/size=' XL ' partition tables are created.
Step 208, the data block loading tasks of the data block to be loaded are corresponded to for each data block to be loaded, generation, And added to for the second preset task queue of purpose loading end corresponding to data block to be loaded.
After data to be loaded are divided into data block to be loaded according to different purpose loading ends, it can be directed to each to be added Data block loading tasks corresponding to data block generation are carried, and the data block loading tasks of generation are added to and added for each purpose Carry in the second task queue that end is set.
The second task queue is provided with for different purpose loading ends, is torn open for depositing each secondary source data loading tasks The data block loading tasks got.Specifically can be when creating new loading end, for second corresponding to loading end establishment Task queue.
Step 209, for each second task queue, according to the second processing Cahn-Ingold-Prelog sequence rule of setting, from second task Data block loading tasks are extracted in queue.
Data block loading tasks in second task queue can sequentially be handled according to the processing sequence rule of setting, example Such as, sequential processes are carried out, according to the preferential of data block loading tasks carrying according to the generation time order and function of data block loading tasks The priority of level mark instruction carries out sequential processes, can specifically be set according to the actual requirements using which kind of processing sequence rule.
Step 210, the data block loading tasks are performed to load so that data block to be loaded is loaded onto into belonged to purpose End.
In the specific implementation, the data block loading tasks of the second task queue can carry out default processing using multithreading, So as to improve the loading efficiency of data block.
Specifically can be when creating new loading end, for one or more loading threads corresponding to loading end establishment.
According to the embodiment of the present application, source data is extracted from least one data source, is obtained after default processing to be loaded Data, data to be loaded are divided into the data block for belonging to different purpose loading ends according to data attribute, further control is each Data block to be loaded is loaded onto belonged to purpose loading end, realize the data of one or more data sources are handled, Classification and distribution storage, the embodiment of the present application is applied in ETL processing procedures, can provide the number in a kind of task by ETL According to the mechanism for importeding into multiple purpose loading ends.
To make those skilled in the art more fully understand the scheme of the embodiment of the present application, below to apply in ETL processing streams Exemplified by journey, above-described embodiment is illustrated.A kind of configuration diagram of data load system of the application is illustrated in figure 4, Three extractor, purpose distributor and loading thread pool parts can be mainly divided into.The effect of each several part is specially:
1st, extractor is responsible for extracting source data and corresponding generation source data loading tasks from data source, each extraction Source data loading tasks are buffered in first task queue.
2nd, purpose distributor consumes source data loading tasks from first task queue, first to source data carry out cleaning and Conversion obtains data to be loaded, and the purpose for further determining attribution data to be loaded according to the data attribute of data to be loaded loads Data to be loaded are divided into multiple data blocks to be loaded by end according to purpose loading end;For each data block generation pair to be loaded The data block loading tasks answered, and data block loading tasks are added to for corresponding second configured of each purpose loading end Business queue.
3rd, thread pool is loaded to be responsible for being loaded onto corresponding purpose loading from the second task queue called data block loading tasks End, if the purpose loading end there is no, is created by purpose distributor, as Fig. 4 gives two the second task queues, Purpose loading end A and purpose loading end B are corresponded to respectively, and individual queue is corresponding with loading thread pool respectively, each to load thread pool difference Including two loading threads.
Wherein, each task can carry task identification, such as name1 in first task queue and the second task queue:q1; name2:Q2, wherein name represent mark, and q represents record content.
The process that purpose distributor creates destination can include:
1st, using user by setting the customized registration logic new registration purpose loading end of entrance.
2nd, a new task queue is distributed for purpose loading end, all records for being distributed to the purpose loading end all will be slow In the presence of in the task queue.
3rd, loading thread is created, and is added in loading thread pool., will be from corresponding team after thread pool establishment is loaded Consumer record in row, and loaded, each queue can have multiple loading threads to complete the work of loading.
When implementing the embodiment of the present application, Pipeline (pipeline model) can be used to build at an ETL data The streamline of reason, the reference model of Pipeline construction work streamings by way of composite module, function example into one The action of one, then one group of action is put into an array or list, then transmits data to this action List, data sequentially realize final loading according to streamline is the same operated by each function, it is possible thereby to which realize can be with Realize high cohesion, the design object of lower coupling.
The embodiment of the present application preferably carries out construction Pipeline by Java Builder patterns, to effectively improve generation The readability of code.
The embodiment of the present application is also based on the processing procedure that Morphline realizes ETL, and Morphline is a Java Function library, the container for storing various orders can be considered as, can be embedded in any java programs, order in the form of plug-in unit It is loaded into Morphline with execution task.
It is illustrated in figure 5 a kind of flow chart of data processing figure for implementing the embodiment of the present application based on Morphline, Flume days The event event (such as system journal syslog) that will collection system obtains, obtained by data extraction Morphline Sink Multiple records, further across multiple Cmd command process, for example, edlin readline, data structured grok, data add LoadSolr is carried to send the doc of generation to Solr.It can be seen that entering streamline from Morphline above-mentioned handling process Data can only be sent to a destination.
The embodiment of the present application is applied to, by improving Morphline basic framework, realizes and obtains in the link of data conversion The processing logic of the data attribute of data to be loaded is taken, while increases distributor and increases to loading end multiple, realization passes through Morphline data extraction definitions, conversion, the whole link for distributing, being loaded into multiple loading ends.Because Morphline is each Individual processing links all employ the implementation of this lightweight of function call, can avoid preserving substantial amounts of function pair storage sky Between occupancy, and data to be loaded can be cached to internal memory, can improve ETL treatment effeciency.
Wherein it is possible to the configuration file based on Morphline defines the data attribute for identifying purpose loading end, and then By the parsing identification data attribute to configuration file to determine purpose loading end.
For embodiment of the method, in order to be briefly described, therefore it is all expressed as to a series of combination of actions, but this area Technical staff should know that the application is not limited by described sequence of movement, because according to the application, some steps can To carry out using other orders or simultaneously.Secondly, those skilled in the art should also know, implementation described in this description Example belongs to preferred embodiment, necessary to involved action and module not necessarily the application.
With reference to figure 6, the structured flowchart of data loading device embodiment in a kind of ETL tasks of the application is shown, specifically It can include with lower module:
Source data abstraction module 301, for extracting source data from least one data source;
Source data processing module 302, data to be loaded are obtained for carrying out default processing to the source data;
Attribution data module 303 to be loaded, will be described to be added at least one data attribute according to data to be loaded Carry data and be divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;
Load-on module 304, for controlling data block to be loaded to be loaded onto belonged to purpose loading end.
In the embodiment of the present application, it is preferable that the source data abstraction module, mixed specifically for being extracted from multiple data sources Source data, the data source include operation system, system module or application program.
In the embodiment of the present application, it is preferable that the source data processing module includes:
Source data divides submodule, multiple for the mixing source data to be divided into according to entrained data source identification Source data;
Processing strategy searches submodule, for searching the processing strategy for data source corresponding to each source data;
Default processing submodule, default processing is carried out to each source data for the processing strategy according to lookup, it is described pre- If processing includes data cleansing and data conversion.
In the embodiment of the present application, it is preferable that the attribution data module to be loaded includes:
Loading end searches submodule, for searching at least one of the data to be loaded in loading end routing table Purpose loading end corresponding to data attribute;
Data divide submodule, for the data to be loaded to be divided into the multiple to be added of the different purpose loading ends of ownership Carry data block.
In the embodiment of the present application, it is preferable that also include:
Data attribute parsing module, at least one data attribute according to data to be loaded, being treated described Loading data are divided into before the multiple data blocks to be loaded for belonging to different purpose loading ends, parse setting for the data to be loaded Determine content and obtain at least one data attribute of the data to be loaded, the data attribute includes temporal information, data source is believed Breath or data traffic types.
In the embodiment of the present application, it is preferable that the data attribute parsing module, specifically for by the number to be loaded According to setting content carry out rule match, extract at least one data attribute from the setting content of the data to be loaded.
In the embodiment of the present application, it is preferable that the data attribute parsing module, specifically for by the number to be loaded Setting content in carries out cutting, obtains at least one data attribute of the data to be loaded.
In the embodiment of the present application, it is preferable that also include:
Setting content receiving module, the setting content being pre-selected for receiving user by setting interface.
In the embodiment of the present application, it is preferable that also include:
Source data loading tasks generation module, for carrying out default processing to the source data described and obtaining number to be loaded According to before, generation corresponds to the source data loading tasks for extracting source data;
First task queue add module, for added to first task queue;
Source data loading tasks extraction module, for the first processing sequence rule according to setting, from the first task Pending source data loading tasks are extracted in queue.
In the embodiment of the present application, it is preferable that also include:
Data block loading tasks generation module, at least one data attribute according to data to be loaded, general The data to be loaded are divided into after the multiple data blocks to be loaded for belonging to different purpose loading ends, for each data to be loaded Block, the data block loading tasks of the corresponding data block to be loaded of generation;
Second task queue add module, for added to preset for purpose loading end corresponding to data block to be loaded Second task queue.
In the embodiment of the present application, it is preferable that the load-on module includes:
Data block loading tasks extraction module, for for each second task queue, according to the second processing order of setting Rule, data block loading tasks are extracted from second task queue;
Data block loading tasks execution module, for performing the data block loading tasks so that data block to be loaded to be loaded To the purpose loading end belonged to.
In the embodiment of the present application, it is preferable that the load-on module, at least one specifically for calling the purpose loading end Individual loading thread loads the data to be loaded to the purpose loading end.
In the embodiment of the present application, it is preferable that also include:
Purpose loading end determining module, for being loaded onto belonged to purpose loading end in the control data block to be loaded Before, the existing purpose loading end is determined.
In the embodiment of the present application, it is preferable that also include:
Creation module, if for the purpose loading end to be not present, create the purpose loading end and create corresponding institute State the loading thread of purpose loading end.
According to the embodiment of the present application, at least one data attribute for extracting data to be loaded is foundation, is identified with this and respectively treated The purpose loading end that loading data belong to respectively in data warehouse, further controls each data to be loaded to be loaded onto what is belonged to Purpose loading end, so as to provide the mechanism that the data in a kind of task by ETL imported into multiple purpose loading ends.
Because described device embodiment essentially corresponds to the embodiment of the method shown in earlier figures 1-2, therefore the present embodiment is retouched Not detailed part, may refer to the related description in previous embodiment, does not just repeat herein in stating.
The application can be used in numerous general or special purpose computing system environments or configuration.Such as:Personal computer, service Device computer, handheld device or portable set, laptop device, multicomputer system, the system based on microprocessor, top set Box, programmable consumer-elcetronics devices, network PC, minicom, mainframe computer including any of the above system or equipment DCE etc..
The application can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these DCEs, by Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with In the local and remote computer-readable storage medium including storage device.
Herein, term " comprising ", "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, from And process, method, article or the equipment for include a series of elements not only include those key elements, but also including not bright The other element really listed, or also include for this process, method, article or the intrinsic key element of equipment.Do not having In the case of more limitations, the key element that is limited by sentence "including a ...", it is not excluded that the process including the key element, Other identical element in method, article or equipment also be present.
Finally, it is to be noted that, herein, such as first and second or the like relational terms be used merely to by One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation Between any this actual relation or order be present.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering including for nonexcludability, so that process, method, article or equipment including a series of elements not only include that A little key elements, but also the other element including being not expressly set out, or also include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", is not arranged Except other identical element in the process including the key element, method, article or equipment being also present.
Above to data load method in a kind of ETL tasks provided herein, and, data in a kind of ETL tasks Loading device is described in detail, and specific case used herein is explained the principle and embodiment of the application State, the explanation of above example is only intended to help and understands the present processes and its core concept;Meanwhile for this area Those skilled in the art, according to the thought of the application, there will be changes in specific embodiments and applications, to sum up institute State, this specification content should not be construed as the limitation to the application.

Claims (16)

1. a kind of data load system, it is characterised in that add including at least one data source, data loading device and multiple purposes Carry end;
The data loading device, for extracting source data from least one data source, and default place is carried out to the source data Reason obtains data to be loaded;According at least one data attribute of data to be loaded, the data to be loaded are divided into ownership Multiple data blocks to be loaded of different purpose loading ends;Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.
A kind of 2. data load method, it is characterised in that including:
Source data is extracted from least one data source, and default processing is carried out to the source data and obtains data to be loaded;
According at least one data attribute of data to be loaded, the data to be loaded are divided into the different purpose loading ends of ownership Multiple data blocks to be loaded;
Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.
3. according to the method for claim 2, it is characterised in that described to include from least one data source extraction source data:
Mixing source data is extracted from multiple data sources, the data source includes operation system, system module or application program.
4. according to the method for claim 3, it is characterised in that it is described the source data is carried out default processing obtain it is to be added Carrying data includes:
The mixing source data is divided into multiple source datas according to entrained data source identification;
Search the processing strategy for data source corresponding to each source data;
Default processing is carried out to each source data according to the processing strategy of lookup, the default processing includes data cleansing and data Conversion.
5. according to the method for claim 2, it is characterised in that at least one data category according to data to be loaded Property, the data to be loaded are divided into and belongs to multiple data blocks to be loaded of different purpose loading ends and includes:
Purpose loading end corresponding at least one data attribute of the data to be loaded is searched in loading end routing table;
The data to be loaded are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends.
6. according to the method for claim 2, it is characterised in that at least one data category according to data to be loaded Property, the data to be loaded are divided into before the multiple data blocks to be loaded for belonging to different purpose loading ends, methods described is also Including:
The setting content for parsing the data to be loaded obtains at least one data attribute of the data to be loaded, the data Attribute includes temporal information, data source information or data traffic types.
7. according to the method for claim 6, it is characterised in that the setting content of the parsing data to be loaded obtains At least one data attribute of the data to be loaded includes:
By carrying out rule match to the setting content of the data to be loaded, carried from the setting content of the data to be loaded Take at least one data attribute.
8. according to the method for claim 6, it is characterised in that the setting content of the parsing data to be loaded obtains At least one data attribute of the data to be loaded includes:
By carrying out cutting to the setting content in the data to be loaded, at least one data of the data to be loaded are obtained Attribute.
9. according to the method for claim 6, it is characterised in that methods described also includes:
Receive the setting content that user is pre-selected by setting interface.
10. according to the method for claim 2, it is characterised in that default processing is carried out to the source data obtain described Before data to be loaded, methods described also includes:
Generation is corresponding to extract the source data loading tasks of source data, and is added to first task queue;
According to the first processing sequence rule of setting, pending source data loading tasks are extracted from the first task queue.
11. according to the method for claim 10, it is characterised in that at least one data according to data to be loaded Attribute, the data to be loaded are divided into after the multiple data blocks to be loaded for belonging to different purpose loading ends, methods described Also include:
For each data block to be loaded, the data block loading tasks of the corresponding data block to be loaded of generation, and be added to and be directed to The second preset task queue of purpose loading end corresponding to data block to be loaded.
12. according to the method for claim 11, it is characterised in that each data block to be loaded of control, which is loaded onto, to be belonged to Purpose loading end include:
For each second task queue, according to the second processing Cahn-Ingold-Prelog sequence rule of setting, data are extracted from second task queue Block loading tasks;
The data block loading tasks are performed so that data block to be loaded to be loaded onto to belonged to purpose loading end.
13. according to the method for claim 2, it is characterised in that the control data block to be loaded is loaded onto what is belonged to Purpose loading end includes:
At least one loading thread of the purpose loading end is called to load the data to be loaded to the purpose loading end.
14. according to the method for claim 13, it is characterised in that be loaded onto and belonged in the control data block to be loaded Purpose loading end before, methods described also includes:
Determine the existing purpose loading end.
15. according to the method for claim 14, it is characterised in that methods described also includes:
If the purpose loading end is not present, creates the purpose loading end and create the loading of the corresponding purpose loading end Thread.
A kind of 16. data loading device, it is characterised in that including:
Source data abstraction module, for extracting source data from least one data source;
Source data processing module, data to be loaded are obtained for carrying out default processing to the source data;
Attribution data module to be loaded, at least one data attribute according to data to be loaded, by the data to be loaded It is divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;
Load-on module, for controlling data block to be loaded to be loaded onto belonged to purpose loading end.
CN201610799125.4A 2016-08-31 2016-08-31 A kind of data load method, apparatus and system Pending CN107784039A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610799125.4A CN107784039A (en) 2016-08-31 2016-08-31 A kind of data load method, apparatus and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610799125.4A CN107784039A (en) 2016-08-31 2016-08-31 A kind of data load method, apparatus and system

Publications (1)

Publication Number Publication Date
CN107784039A true CN107784039A (en) 2018-03-09

Family

ID=61451891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610799125.4A Pending CN107784039A (en) 2016-08-31 2016-08-31 A kind of data load method, apparatus and system

Country Status (1)

Country Link
CN (1) CN107784039A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008042A (en) * 2019-03-28 2019-07-12 北京易华录信息技术股份有限公司 A kind of algorithm Cascading Methods and system based on container
CN110457348A (en) * 2018-05-02 2019-11-15 北京三快在线科技有限公司 A kind of data processing method and device
CN112214453A (en) * 2020-09-14 2021-01-12 上海微亿智造科技有限公司 Large-scale industrial data compression storage method, system and medium
CN112256775A (en) * 2020-09-27 2021-01-22 建信金融科技有限责任公司 Method and device for timed data loading of Oracle database
CN115859370A (en) * 2023-03-02 2023-03-28 萨科(深圳)科技有限公司 Transaction data processing method and device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329676A (en) * 2007-06-20 2008-12-24 华为技术有限公司 Data paralleling abstracting method and apparatus and database system
CN103077192A (en) * 2012-12-24 2013-05-01 中标软件有限公司 Data processing method and system thereof
CN104731891A (en) * 2015-03-17 2015-06-24 浪潮集团有限公司 Method for mass data extraction in ETL
CN104850638A (en) * 2015-05-25 2015-08-19 广州精点计算机科技有限公司 ETL process parallel decision method and apparatus
CN104915414A (en) * 2015-06-04 2015-09-16 北京京东尚科信息技术有限公司 Data extraction method and device
CN105760221A (en) * 2016-02-02 2016-07-13 中博信息技术研究院有限公司 Task dispatching system with distributed calculating frame
CN105808778A (en) * 2016-03-30 2016-07-27 中国银行股份有限公司 Method and device for extracting, transforming and loading mass data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329676A (en) * 2007-06-20 2008-12-24 华为技术有限公司 Data paralleling abstracting method and apparatus and database system
CN103077192A (en) * 2012-12-24 2013-05-01 中标软件有限公司 Data processing method and system thereof
CN104731891A (en) * 2015-03-17 2015-06-24 浪潮集团有限公司 Method for mass data extraction in ETL
CN104850638A (en) * 2015-05-25 2015-08-19 广州精点计算机科技有限公司 ETL process parallel decision method and apparatus
CN104915414A (en) * 2015-06-04 2015-09-16 北京京东尚科信息技术有限公司 Data extraction method and device
CN105760221A (en) * 2016-02-02 2016-07-13 中博信息技术研究院有限公司 Task dispatching system with distributed calculating frame
CN105808778A (en) * 2016-03-30 2016-07-27 中国银行股份有限公司 Method and device for extracting, transforming and loading mass data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曾一: "《大学计算机基础》", 30 September 2015 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457348A (en) * 2018-05-02 2019-11-15 北京三快在线科技有限公司 A kind of data processing method and device
CN110008042A (en) * 2019-03-28 2019-07-12 北京易华录信息技术股份有限公司 A kind of algorithm Cascading Methods and system based on container
CN112214453A (en) * 2020-09-14 2021-01-12 上海微亿智造科技有限公司 Large-scale industrial data compression storage method, system and medium
CN112256775A (en) * 2020-09-27 2021-01-22 建信金融科技有限责任公司 Method and device for timed data loading of Oracle database
CN115859370A (en) * 2023-03-02 2023-03-28 萨科(深圳)科技有限公司 Transaction data processing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107784039A (en) A kind of data load method, apparatus and system
CN103631922B (en) Extensive Web information extracting method and system based on Hadoop clusters
CN107577805A (en) A kind of business service system towards the analysis of daily record big data
CN105468744B (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN106202207A (en) A kind of index based on HBase ORM and searching system
CN104850593B (en) A kind of storage of emergency materials data and circulation monitoring method based on big data
CN112396108A (en) Service data evaluation method, device, equipment and computer readable storage medium
CN102664915B (en) Service selection method based on resource constraint in cloud manufacturing environment
CN106407216A (en) Clue tracing audition system developed on basis of semantic net construction path and construction method of clue tracing audition system
Vo et al. A multi-core approach to efficiently mining high-utility itemsets in dynamic profit databases
CN111061679B (en) Method and system for rapid configuration of technological innovation policy based on rete and drools rules
CN107291770A (en) The querying method and device of mass data in a kind of distributed system
CN108287889B (en) A kind of multi-source heterogeneous date storage method and system based on elastic table model
CN101604319A (en) Xinhua Finance Media's business datum centring system
CN105824892A (en) Method for synchronizing and processing data by data pool
Chen et al. Data mining-based dispatching system for solving the local pickup and delivery problem
CN107871055A (en) A kind of data analysing method and device
KR20140076010A (en) A system for simultaneous and parallel processing of many twig pattern queries for massive XML data and method thereof
CN106257447A (en) The video storage of cloud storage server and search method, video cloud storage system
Min et al. Data mining and economic forecasting in DW-based economical decision support system
CN115221337A (en) Data weaving processing method and device, electronic equipment and readable storage medium
Asghari et al. A semi-automatic system for data management and cleaning
Rückemann Creation of Objects and Concordances for Knowledge Processing and Advanced Computing
Peral et al. Energy consumption prediction by using an integrated multidimensional modeling approach and data mining techniques with Big Data
Jemal et al. MapReduce-DBMS: an integration model for big data management and optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180309

RJ01 Rejection of invention patent application after publication