CN107784039A - A kind of data load method, apparatus and system - Google Patents
A kind of data load method, apparatus and system Download PDFInfo
- Publication number
- CN107784039A CN107784039A CN201610799125.4A CN201610799125A CN107784039A CN 107784039 A CN107784039 A CN 107784039A CN 201610799125 A CN201610799125 A CN 201610799125A CN 107784039 A CN107784039 A CN 107784039A
- Authority
- CN
- China
- Prior art keywords
- data
- loaded
- source
- loading
- loading end
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application provides a kind of data load method, device and system, the system includes at least one data source, data loading device and multiple purpose loading ends;The data loading device, for extracting source data from least one data source, and default processing is carried out to the source data and obtains data to be loaded;According at least one data attribute of data to be loaded, the data to be loaded are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.The application realizes is handled, classified and is distributed storage to the data of one or more data sources, and the embodiment of the present application is applied in ETL processing procedures, can provide the mechanism that the data in a kind of task by ETL imported into multiple purpose loading ends.
Description
Technical field
The application is related to technical field of data processing, more particularly to a kind of data load method, a kind of data loading dress
Put, and, a kind of data load system.
Background technology
ETL (Extraction-Transformation-Loading) is data pick-up extract, conversion transform
It is the important ring for building data warehouse, user extracts required data from database, by number with loading load process
According to cleaning, finally according to the data warehouse model pre-defined, load data into data warehouse.
Traditional ETL instruments and software are solved in data warehouse and imported from a data source to the data of a destination
Problem, such as:Another table from database A in one table to database B.And some application scenarios are directed in practical application
In the presence of the demand for importing multiple destinations, it is therefore desirable to provide a kind of data load mechanism of more destinations in ETL tasks.
The content of the invention
Technical problems to be solved in this application are to provide a kind of big data for partly or entirely solving above-mentioned technical problem
The task management method and device of processing platform.
In order to solve the above problems, this application discloses a kind of data load system, including at least one data source, data
Loading device and multiple purpose loading ends;
The data loading device, for extracting source data from least one data source, and the source data is carried out pre-
If processing obtains data to be loaded;According at least one data attribute of data to be loaded, the data to be loaded are divided into
Belong to multiple data blocks to be loaded of different purpose loading ends;Data block to be loaded is controlled to be loaded onto belonged to purpose loading
End.
Disclosed herein as well is a kind of data load method, including:
Source data is extracted from least one data source, and default processing is carried out to the source data and obtains data to be loaded;
According at least one data attribute of data to be loaded, the data to be loaded are divided into the different purposes of ownership and added
Carry multiple data blocks to be loaded at end;
Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.
Preferably, it is described to include from least one data source extraction source data:
Mixing source data is extracted from multiple data sources, the data source includes operation system, system module or application program.
Preferably, it is described to carry out default processing to the source data and obtain data to be loaded including:
The mixing source data is divided into multiple source datas according to entrained data source identification;
Search the processing strategy for data source corresponding to each source data;
Processing strategy according to lookup carries out default processing to each source data, the default processing include data cleansing with
Data conversion.
Preferably, at least one data attribute according to data to be loaded, the data to be loaded are divided into and returned
Belonging to multiple data blocks to be loaded of different purpose loading ends includes:
Purpose corresponding at least one data attribute of the data to be loaded is searched in loading end routing table adds
Carry end;
The data to be loaded are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends.
Preferably, at least one data attribute according to data to be loaded, the data to be loaded are divided into
Before the multiple data blocks to be loaded for belonging to different purpose loading ends, methods described also includes:
The setting content for parsing the data to be loaded obtains at least one data attribute of the data to be loaded, described
Data attribute includes temporal information, data source information or data traffic types.
Preferably, the setting content of the parsing data to be loaded obtains at least one number of the data to be loaded
Include according to attribute:
By carrying out rule match to the setting content of the data to be loaded, from the setting content of the data to be loaded
The middle at least one data attribute of extraction.
Preferably, the setting content of the parsing data to be loaded obtains at least one number of the data to be loaded
Include according to attribute:
By carrying out cutting to the setting content in the data to be loaded, at least one of the data to be loaded is obtained
Data attribute.
Preferably, methods described also includes:
Receive the setting content that user is pre-selected by setting interface.
Preferably, the source data is carried out before default processing obtains data to be loaded described, methods described is also wrapped
Include:
Generation is corresponding to extract the source data loading tasks of source data, and is added to first task queue;
According to the first processing sequence rule of setting, extract pending source data loading from the first task queue and appoint
Business.
Preferably, at least one data attribute according to data to be loaded, the data to be loaded are divided into
After the multiple data blocks to be loaded for belonging to different purpose loading ends, methods described also includes:
For each data block to be loaded, the data block loading tasks of the corresponding data block to be loaded of generation, and be added to
For the second preset task queue of purpose loading end corresponding to data block to be loaded.
Preferably, each data block to be loaded of control, which is loaded onto belonged to purpose loading end, includes:
For each second task queue, according to the second processing Cahn-Ingold-Prelog sequence rule of setting, extracted from second task queue
Data block loading tasks;
The data block loading tasks are performed so that data block to be loaded to be loaded onto to belonged to purpose loading end.
Preferably, the control data block to be loaded, which is loaded onto belonged to purpose loading end, includes:
Call at least one loading thread of the purpose loading end to load the data to be loaded to the purpose to load
End.
Preferably, before the control data block to be loaded is loaded onto belonged to purpose loading end, methods described is also
Including:
Determine the existing purpose loading end.
Preferably, methods described also includes:
If the purpose loading end is not present, creates the purpose loading end and create the corresponding purpose loading end
Load thread.
Disclosed herein as well is a kind of data loading device, including:
Source data abstraction module, for extracting source data from least one data source;
Source data processing module, data to be loaded are obtained for carrying out default processing to the source data;
Attribution data module to be loaded, will be described to be loaded at least one data attribute according to data to be loaded
Data are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;
Load-on module, for controlling data block to be loaded to be loaded onto belonged to purpose loading end.
Preferably, the source data abstraction module, specifically for extracting mixing source data, the data from multiple data sources
Source includes operation system, system module or application program.
Preferably, the source data processing module includes:
Source data divides submodule, multiple for the mixing source data to be divided into according to entrained data source identification
Source data;
Processing strategy searches submodule, for searching the processing strategy for data source corresponding to each source data;
Default processing submodule, default processing is carried out to each source data for the processing strategy according to lookup, it is described pre-
If processing includes data cleansing and data conversion.
Preferably, the attribution data module to be loaded includes:
Loading end searches submodule, for searching at least one of the data to be loaded in loading end routing table
Purpose loading end corresponding to data attribute;
Data divide submodule, for the data to be loaded to be divided into the multiple to be added of the different purpose loading ends of ownership
Carry data block.
Preferably, in addition to:
Data attribute parsing module, at least one data attribute according to data to be loaded, being treated described
Loading data are divided into before the multiple data blocks to be loaded for belonging to different purpose loading ends, parse setting for the data to be loaded
Determine content and obtain at least one data attribute of the data to be loaded, the data attribute includes temporal information, data source is believed
Breath or data traffic types.
Preferably, the data attribute parsing module, specifically for by entering to the setting content of the data to be loaded
Line discipline is matched, and at least one data attribute is extracted from the setting content of the data to be loaded.
Preferably, the data attribute parsing module, specifically for by the setting content in the data to be loaded
Cutting is carried out, obtains at least one data attribute of the data to be loaded.
Preferably, described device also includes:
Setting content receiving module, the setting content being pre-selected for receiving user by setting interface.
Preferably, in addition to:
Source data loading tasks generation module, for carrying out default processing to the source data described and obtaining number to be loaded
According to before, generation corresponds to the source data loading tasks for extracting source data;
First task queue add module, for added to first task queue;
Source data loading tasks extraction module, for the first processing sequence rule according to setting, from the first task
Pending source data loading tasks are extracted in queue.
Preferably, in addition to:
Data block loading tasks generation module, at least one data attribute according to data to be loaded, general
The data to be loaded are divided into after the multiple data blocks to be loaded for belonging to different purpose loading ends, for each data to be loaded
Block, the data block loading tasks of the corresponding data block to be loaded of generation;
Second task queue add module, for added to preset for purpose loading end corresponding to data block to be loaded
Second task queue.
Preferably, the load-on module includes:
Data block loading tasks extraction module, for for each second task queue, according to the second processing order of setting
Rule, data block loading tasks are extracted from second task queue;
Data block loading tasks execution module, for performing the data block loading tasks so that data block to be loaded to be loaded
To the purpose loading end belonged to.
Preferably, the load-on module, specifically for calling at least one loading thread of the purpose loading end to load
The data to be loaded are to the purpose loading end.
Preferably, in addition to:
Purpose loading end determining module, for being loaded onto belonged to purpose loading end in the control data block to be loaded
Before, the existing purpose loading end is determined.
Preferably, in addition to:
Creation module, if for the purpose loading end to be not present, create the purpose loading end and create corresponding institute
State the loading thread of purpose loading end.
Compared with prior art, the application includes advantages below:
According to the embodiment of the present application, source data is extracted from least one data source, is obtained after default processing to be loaded
Data, data to be loaded are divided into the data block for belonging to different purpose loading ends according to data attribute, further control is each
Data block to be loaded is loaded onto belonged to purpose loading end, realize the data of one or more data sources are handled,
Classification and distribution storage, the embodiment of the present application is applied in ETL processing procedures, can provide the number in a kind of task by ETL
According to the mechanism for importeding into multiple purpose loading ends.
Brief description of the drawings
Fig. 1 is a kind of application schematic diagram of data load system of the application;
Fig. 2 is a kind of flow chart of data load method embodiment 1 of the application;
Fig. 3 is a kind of flow chart of data load method embodiment 2 of the application;
Fig. 4 is a kind of configuration diagram of data load system of the application;
Fig. 5 is a kind of flow chart of data processing figure for implementing the embodiment of the present application based on Morphline;
Fig. 6 is a kind of structured flowchart of the task management device embodiment of big data processing platform of the application.
Embodiment
It is below in conjunction with the accompanying drawings and specific real to enable the above-mentioned purpose of the application, feature and advantage more obvious understandable
Mode is applied to be described in further detail the application.
This application provides a kind of data load system, including at least one data source, data loading device and multiple mesh
Loading end, with reference to figure 1, show a kind of application schematic diagram of data load system of the application.Wherein, data loading device
Source data is extracted from least one data source, and default processing is carried out to source data and obtains data to be loaded;According to number to be loaded
According at least one data attribute, data to be loaded are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;
Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.Specifically can step with the following method:
With reference to figure 2, show a kind of flow chart of data load method embodiment 1 of the application, can specifically include with
Lower step:
Step 101, extract source data from least one data source, and the source data is carried out default processing obtain it is to be added
Carry data.
The embodiment of the present application extracts data from data source, and extracting the mode of data can set according to the actual requirements.
The data source can be database, operation system, system module (some module that is to say operation system), application
Program, accordingly, the data extracted from database can be the data in tables of data or tables of data, be extracted from operation system
Data can be daily record data that business datum, the operation system that operation system generates record etc., from the number of system module extraction
According to can be the result of module, for daily record data of module log etc., the data that are extracted from application program can be with
It is running log of the result of application program, the data of application program crawl, application program etc..May be used also in actual applications
It is above-mentioned only as an example, the application is not done to the particular type of data source to extract data from other achievable data sources
Limitation.
After the source data extracted from data source, default processing is further carried out to obtain to be added being loaded onto the to be loaded of loading end
Data, ETL handling processes are applied to, it can be that data are cleaned and changed to preset processing herein, the task of data cleansing
Be filter those undesirable data, undesirable data mainly have incomplete data, the data of mistake and
The data three major types repeated;(task of data conversion is mainly to integrate the same type number of separate sources for inconsistent data conversion
According to), the conversion (being polymerize the data of separate sources according to data warehouse granularity) of data granularity and according to preset rules
Data are calculated.It can specifically be set according to the actual requirements using which kind of processing mode, the application is not limited to this,
Such as increase row are carried out to source data, delete the data cleansings such as row operation.
From multiple data sources extract data obtain be mixing source data, can be according to the actual requirements from the more of any kind
Individual data source carries out data pick-up.For mixing source data, carry out that same processing strategy can be multiplexed during default processing, also may be used
So that different processing strategies is respectively adopted, specifically corresponding processing strategy can be set for various data sources in advance, according to institute
The data source identification of carrying is divided into multiple source datas by source data is mixed, and further searches for the number for various source datas ownership
The processing strategy being correspondingly arranged is searched according to source, default processing is carried out to source data according to processing strategy.
Step 102, according at least one data attribute of data to be loaded, the data to be loaded are divided into ownership not
With multiple data blocks to be loaded of purpose loading end.
The embodiment of the present application divides to the data to be loaded obtained from one or more data sources, and partitioning standards are to treat
Data to be loaded are in other words divided by the purpose loading end that loading data are belonged to according to corresponding purpose loading end,
Divide obtained data block to be loaded and correspond to different purpose loading ends respectively.
Belonged to purpose loading end is determined according to the data attribute of data to be loaded, data attribute can include time letter
Breath, the data source information of the belonged to data source of mark, data traffic types one or more of, can be specifically direct
From the setting content of extracting data to be loaded;Can also be obtained by carrying out statistical analysis to the setting content of data to be loaded
Computing is carried out to or using certain preset algorithm to obtain;Or the feature such as data format using data to be loaded is as data
Attribute the application is not limited to this, and preset algorithm can select according to the actual requirements herein, such as hash algorithm, information are plucked
Algorithm etc. is wanted, the application is not limited to this.
Apply in ETL processing procedures, ETL tasks are by data transfer to be loaded into data warehouse, according to number to be loaded
According to data attribute can determine purpose loading end that data to be loaded belong to respectively in data warehouse, the loading of data warehouse
End can be specifically Data Warehouse memory cell, such as can be file or tables of data in data warehouse etc..
Data attribute is corresponding with loading end to be determined by any suitable mode, for example, can preset each
The data attribute mapping relations with loading end respectively are planted, instruction pair in mapping relations is searched according to the data attribute of data to be loaded
The purpose loading end answered;Or computing is carried out to attribute data and obtains purpose loading end, for example, carrying out Hash fortune to attribute data
Calculation obtains the numbering of purpose loading end, and the purpose loading end that the numbering is indicated is as purpose loading end corresponding to attribute data.
Specifically, before identification data loading end, data to be loaded can be decomposed according to whether data attribute is carried
Multiple data cells are obtained, each data unit size is identical or different, sentences respectively according to the data attribute that each data cell carries
Purpose loading end corresponding to disconnected, the data block obtained after division can be the set of one or more data cells.
Step 103, data block to be loaded is controlled to be loaded onto belonged to purpose loading end.
It is determined that purpose loading end that data to be loaded are belonged to and after division obtains multiple data blocks to be loaded accordingly, enters
Each data block to be loaded is respectively loaded on belonged to purpose loading end by one step.One or more loading threads can be used to add
Each data block to be loaded is carried, can be according to the preferential suitable of setting when a thread needs to load multiple data blocks to be loaded
Sequence loads each data block to be loaded respectively.
Before data are loaded, data block to be loaded can be preserved to corresponding respectively according to the purpose loading end belonged to
File, then the data block in each file is loaded respectively, this processing mode need consume twice of disk
Space;Data block to be loaded can also be added in preset queue according to the purpose loading end belonged to, to be adjusted from queue
Degree task carries out data block loading, compared to former mode, it is possible to reduce to the occupancy of memory space.
According to the embodiment of the present application, source data is extracted from least one data source, is obtained after default processing to be loaded
Data, data to be loaded are divided into the data block for belonging to different purpose loading ends according to data attribute, further control is each
Data block to be loaded is loaded onto belonged to purpose loading end, realize the data of one or more data sources are handled,
Classification and distribution storage, the embodiment of the present application is applied in ETL processing procedures, can provide the number in a kind of task by ETL
According to the mechanism for importeding into multiple purpose loading ends.
The embodiment of the present application can be applied particularly to plurality of application scenes, and several specific examples are given below:
Application scenarios 1:
Source data is extracted from data mapping, so that the data source is transaction system as an example.When extracting one section from transaction system
In transaction log, the trading activity of multiple different times due to transaction log record, therefore different in transaction log
Data cell possesses the different daily record dates.
After transaction log is cleaned and changed, extraction multiple daily record dates for including of transaction log are the application
Data attribute, and then transaction log is divided according to the daily record date, the data block on corresponding different daily record dates is obtained, it is corresponding
The data block on different daily record dates corresponds to different purpose loading ends, can further control the daily record data block loading after division
To the purpose loading end that is belonged to, so as to realize processing to transaction log, classify and deposit respectively by date by date
Storage.
Application scenarios 2:
Source data is extracted from multiple data sources, using multiple data sources as multiple application programs.Extracted from multiple application programs
Result, because the result data volume of each application program is smaller, corresponding ETL handling processes consumption is respectively configured
When effort, and the waste of process resource and storage resource can be caused.This scene is applied to, the application, which can realize, to be answered multiple
Processing and classification storage with the data of program.
The application is cleaned and changed after multiple application programs extract result first, further extraction process
As a result the program identification of middle carrying is data attribute, and then result is divided according to program identification, obtains corresponding to not
With the data block of program identification, the data block of corresponding distinct program mark corresponds to different purpose loading ends, can further control
The data block of result division processed is respectively loaded on belonged to purpose loading end, so as to realize to multiple data source numbers
According to processing, by application program classify and store.
Wherein the result of multiple application programs, which is cleaned and changed, can be multiplexed same handling process, also may be used
With preset corresponding handling process respectively, separate sources data are processed by demand so as to realize.
Above-mentioned concrete application is merely illustrative, in the specific implementation, data source and the species of data attribute can be according to realities
Border demand setting.
In the embodiment of the present application, it can be walked according to the step of data attribute division data block in the conversion of ETL processing procedures
It is rapid to realize, by increasing distribution dispatch links in ETL processing procedures, realize identifying purpose loading end and distribute to be loaded
The effect of data, thus it is related to four extraction, conversion, distribution, loading methods in ETL data handling procedures altogether.
The embodiment of the present application can be implemented as the programming development instrument of ETL a kind of, due to supporting to data warehouse
Multiple destinations distribute data, therefore can meet the needs of distributing data to multiple destinations in concrete application scene.The volume
Journey developing instrument can also provide DLL, so that programming personnel is expanded the programming development instrument by the DLL
Exhibition and maintenance, and the details such as performance, fault-tolerance is by carrying out bottom frame management.
In the embodiment of the present application, it is preferable that the mapping relations of data attribute and loading end can be set, and deposited to preset
Loading end routing table in, accordingly, can be in the routing table when identifying the purpose loading end that data to be loaded belong to respectively
Search purpose loading end corresponding to the data attribute of the data to be loaded.
In the embodiment of the present application, it is preferable that data attribute can obtain according to the setting content of data to be loaded, in basis
At least one data attribute of data to be loaded in ETL tasks, before identifying the purpose loading end that data to be loaded belong to respectively,
At least one data attribute of the data to be loaded can also be obtained by parsing the setting content of the data to be loaded, if
Determine the data that content can be the specified location of data identification data type to be loaded, such as a certain column data.
The setting content of data to be loaded can be that user is voluntarily set by programming personnel, in the embodiment of the present application, preferably
Ground, the setting content that user is selected by the setting interface of dispatch methods can also be received, specifically can be in setting interface
Relative position show multiple setting contents, such as multiple data row and its related letter are shown in ETL programming development instrument
Breath, user can select to arrange for the data of recognition purpose loading end after parsing by setting interface.
In a kind of preferable example, the setting content for parsing the data to be loaded obtains the data to be loaded at least
During a kind of data attribute, by carrying out rule match (for example with regular expression to the setting content of the data to be loaded
Carry out canonical matching), at least one data attribute is extracted from the setting content of the data to be loaded, the result that will be matched
Data attribute as data to be loaded.
In another preferable example, the setting content for parsing the data to be loaded obtains the data to be loaded extremely
During a kind of few data attribute, the number to be loaded can be obtained by carrying out cutting to the setting content of the data to be loaded
According at least one data attribute.Such as cutting is carried out to characterize data type type data row, obtain the number to be loaded
According to size of data size and data length length, and using size and length corresponding to data to be loaded as each to be added
Carry the data attribute of data.
With reference to figure 3, the flow chart of data load method embodiment 2 in a kind of ETL tasks of the application is shown, specifically may be used
To comprise the following steps:
Step 201, source data is extracted from least one data source.
Step 202, generation is corresponding extracts the source data loading tasks of source data, and is added to first task queue.
The operation of each crawl source data corresponds to a source data loading tasks, after source data is extracted, Ke Yisheng respectively
Into corresponding source data loading tasks, and it is added to the first task queue for being used for depositing source data loading tasks.
Step 203, according to the first processing sequence rule of setting, pending source number is extracted from the first task queue
According to loading tasks.
Source data loading tasks in first task queue can sequentially be handled according to the processing sequence rule of setting, example
Such as, sequential processes are carried out, according to the preferential of source data loading tasks carrying according to the generation time order and function of source data loading tasks
The priority of level mark instruction carries out sequential processes, can specifically be set according to the actual requirements using which kind of processing sequence rule.
Step 204, default processing is carried out to the source data and obtains data to be loaded.
In the specific implementation, the source data loading tasks of first task queue can carry out default processing using multithreading,
So as to improve the treatment effeciency of source data.
Step 205, according at least one data attribute of data to be loaded, the data to be loaded are divided into ownership not
With multiple data blocks to be loaded of purpose loading end.
Step 206, the existing purpose loading end is determined.
Before data block to be loaded is added into purpose loading end, it is also necessary to determine whether purpose loading be present
End, if existing, the step of can further performing loading data block, if being not present, need to create corresponding to purpose add
Carry end.
Step 207, if the purpose loading end is not present, create the purpose loading end and create the corresponding purpose
The loading thread of loading end.
When creating new loading end, a new loading end can be registered by setting entrance, new loading end can be by
Name order name according to other loading ends is set by the user title or carried out according to the naming logistics of user's setting
Name, such as:If the value of certain record size this row is XL, tablename/size=' XL ' partition tables are created.
Step 208, the data block loading tasks of the data block to be loaded are corresponded to for each data block to be loaded, generation,
And added to for the second preset task queue of purpose loading end corresponding to data block to be loaded.
After data to be loaded are divided into data block to be loaded according to different purpose loading ends, it can be directed to each to be added
Data block loading tasks corresponding to data block generation are carried, and the data block loading tasks of generation are added to and added for each purpose
Carry in the second task queue that end is set.
The second task queue is provided with for different purpose loading ends, is torn open for depositing each secondary source data loading tasks
The data block loading tasks got.Specifically can be when creating new loading end, for second corresponding to loading end establishment
Task queue.
Step 209, for each second task queue, according to the second processing Cahn-Ingold-Prelog sequence rule of setting, from second task
Data block loading tasks are extracted in queue.
Data block loading tasks in second task queue can sequentially be handled according to the processing sequence rule of setting, example
Such as, sequential processes are carried out, according to the preferential of data block loading tasks carrying according to the generation time order and function of data block loading tasks
The priority of level mark instruction carries out sequential processes, can specifically be set according to the actual requirements using which kind of processing sequence rule.
Step 210, the data block loading tasks are performed to load so that data block to be loaded is loaded onto into belonged to purpose
End.
In the specific implementation, the data block loading tasks of the second task queue can carry out default processing using multithreading,
So as to improve the loading efficiency of data block.
Specifically can be when creating new loading end, for one or more loading threads corresponding to loading end establishment.
According to the embodiment of the present application, source data is extracted from least one data source, is obtained after default processing to be loaded
Data, data to be loaded are divided into the data block for belonging to different purpose loading ends according to data attribute, further control is each
Data block to be loaded is loaded onto belonged to purpose loading end, realize the data of one or more data sources are handled,
Classification and distribution storage, the embodiment of the present application is applied in ETL processing procedures, can provide the number in a kind of task by ETL
According to the mechanism for importeding into multiple purpose loading ends.
To make those skilled in the art more fully understand the scheme of the embodiment of the present application, below to apply in ETL processing streams
Exemplified by journey, above-described embodiment is illustrated.A kind of configuration diagram of data load system of the application is illustrated in figure 4,
Three extractor, purpose distributor and loading thread pool parts can be mainly divided into.The effect of each several part is specially:
1st, extractor is responsible for extracting source data and corresponding generation source data loading tasks from data source, each extraction
Source data loading tasks are buffered in first task queue.
2nd, purpose distributor consumes source data loading tasks from first task queue, first to source data carry out cleaning and
Conversion obtains data to be loaded, and the purpose for further determining attribution data to be loaded according to the data attribute of data to be loaded loads
Data to be loaded are divided into multiple data blocks to be loaded by end according to purpose loading end;For each data block generation pair to be loaded
The data block loading tasks answered, and data block loading tasks are added to for corresponding second configured of each purpose loading end
Business queue.
3rd, thread pool is loaded to be responsible for being loaded onto corresponding purpose loading from the second task queue called data block loading tasks
End, if the purpose loading end there is no, is created by purpose distributor, as Fig. 4 gives two the second task queues,
Purpose loading end A and purpose loading end B are corresponded to respectively, and individual queue is corresponding with loading thread pool respectively, each to load thread pool difference
Including two loading threads.
Wherein, each task can carry task identification, such as name1 in first task queue and the second task queue:q1;
name2:Q2, wherein name represent mark, and q represents record content.
The process that purpose distributor creates destination can include:
1st, using user by setting the customized registration logic new registration purpose loading end of entrance.
2nd, a new task queue is distributed for purpose loading end, all records for being distributed to the purpose loading end all will be slow
In the presence of in the task queue.
3rd, loading thread is created, and is added in loading thread pool., will be from corresponding team after thread pool establishment is loaded
Consumer record in row, and loaded, each queue can have multiple loading threads to complete the work of loading.
When implementing the embodiment of the present application, Pipeline (pipeline model) can be used to build at an ETL data
The streamline of reason, the reference model of Pipeline construction work streamings by way of composite module, function example into one
The action of one, then one group of action is put into an array or list, then transmits data to this action
List, data sequentially realize final loading according to streamline is the same operated by each function, it is possible thereby to which realize can be with
Realize high cohesion, the design object of lower coupling.
The embodiment of the present application preferably carries out construction Pipeline by Java Builder patterns, to effectively improve generation
The readability of code.
The embodiment of the present application is also based on the processing procedure that Morphline realizes ETL, and Morphline is a Java
Function library, the container for storing various orders can be considered as, can be embedded in any java programs, order in the form of plug-in unit
It is loaded into Morphline with execution task.
It is illustrated in figure 5 a kind of flow chart of data processing figure for implementing the embodiment of the present application based on Morphline, Flume days
The event event (such as system journal syslog) that will collection system obtains, obtained by data extraction Morphline Sink
Multiple records, further across multiple Cmd command process, for example, edlin readline, data structured grok, data add
LoadSolr is carried to send the doc of generation to Solr.It can be seen that entering streamline from Morphline above-mentioned handling process
Data can only be sent to a destination.
The embodiment of the present application is applied to, by improving Morphline basic framework, realizes and obtains in the link of data conversion
The processing logic of the data attribute of data to be loaded is taken, while increases distributor and increases to loading end multiple, realization passes through
Morphline data extraction definitions, conversion, the whole link for distributing, being loaded into multiple loading ends.Because Morphline is each
Individual processing links all employ the implementation of this lightweight of function call, can avoid preserving substantial amounts of function pair storage sky
Between occupancy, and data to be loaded can be cached to internal memory, can improve ETL treatment effeciency.
Wherein it is possible to the configuration file based on Morphline defines the data attribute for identifying purpose loading end, and then
By the parsing identification data attribute to configuration file to determine purpose loading end.
For embodiment of the method, in order to be briefly described, therefore it is all expressed as to a series of combination of actions, but this area
Technical staff should know that the application is not limited by described sequence of movement, because according to the application, some steps can
To carry out using other orders or simultaneously.Secondly, those skilled in the art should also know, implementation described in this description
Example belongs to preferred embodiment, necessary to involved action and module not necessarily the application.
With reference to figure 6, the structured flowchart of data loading device embodiment in a kind of ETL tasks of the application is shown, specifically
It can include with lower module:
Source data abstraction module 301, for extracting source data from least one data source;
Source data processing module 302, data to be loaded are obtained for carrying out default processing to the source data;
Attribution data module 303 to be loaded, will be described to be added at least one data attribute according to data to be loaded
Carry data and be divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;
Load-on module 304, for controlling data block to be loaded to be loaded onto belonged to purpose loading end.
In the embodiment of the present application, it is preferable that the source data abstraction module, mixed specifically for being extracted from multiple data sources
Source data, the data source include operation system, system module or application program.
In the embodiment of the present application, it is preferable that the source data processing module includes:
Source data divides submodule, multiple for the mixing source data to be divided into according to entrained data source identification
Source data;
Processing strategy searches submodule, for searching the processing strategy for data source corresponding to each source data;
Default processing submodule, default processing is carried out to each source data for the processing strategy according to lookup, it is described pre-
If processing includes data cleansing and data conversion.
In the embodiment of the present application, it is preferable that the attribution data module to be loaded includes:
Loading end searches submodule, for searching at least one of the data to be loaded in loading end routing table
Purpose loading end corresponding to data attribute;
Data divide submodule, for the data to be loaded to be divided into the multiple to be added of the different purpose loading ends of ownership
Carry data block.
In the embodiment of the present application, it is preferable that also include:
Data attribute parsing module, at least one data attribute according to data to be loaded, being treated described
Loading data are divided into before the multiple data blocks to be loaded for belonging to different purpose loading ends, parse setting for the data to be loaded
Determine content and obtain at least one data attribute of the data to be loaded, the data attribute includes temporal information, data source is believed
Breath or data traffic types.
In the embodiment of the present application, it is preferable that the data attribute parsing module, specifically for by the number to be loaded
According to setting content carry out rule match, extract at least one data attribute from the setting content of the data to be loaded.
In the embodiment of the present application, it is preferable that the data attribute parsing module, specifically for by the number to be loaded
Setting content in carries out cutting, obtains at least one data attribute of the data to be loaded.
In the embodiment of the present application, it is preferable that also include:
Setting content receiving module, the setting content being pre-selected for receiving user by setting interface.
In the embodiment of the present application, it is preferable that also include:
Source data loading tasks generation module, for carrying out default processing to the source data described and obtaining number to be loaded
According to before, generation corresponds to the source data loading tasks for extracting source data;
First task queue add module, for added to first task queue;
Source data loading tasks extraction module, for the first processing sequence rule according to setting, from the first task
Pending source data loading tasks are extracted in queue.
In the embodiment of the present application, it is preferable that also include:
Data block loading tasks generation module, at least one data attribute according to data to be loaded, general
The data to be loaded are divided into after the multiple data blocks to be loaded for belonging to different purpose loading ends, for each data to be loaded
Block, the data block loading tasks of the corresponding data block to be loaded of generation;
Second task queue add module, for added to preset for purpose loading end corresponding to data block to be loaded
Second task queue.
In the embodiment of the present application, it is preferable that the load-on module includes:
Data block loading tasks extraction module, for for each second task queue, according to the second processing order of setting
Rule, data block loading tasks are extracted from second task queue;
Data block loading tasks execution module, for performing the data block loading tasks so that data block to be loaded to be loaded
To the purpose loading end belonged to.
In the embodiment of the present application, it is preferable that the load-on module, at least one specifically for calling the purpose loading end
Individual loading thread loads the data to be loaded to the purpose loading end.
In the embodiment of the present application, it is preferable that also include:
Purpose loading end determining module, for being loaded onto belonged to purpose loading end in the control data block to be loaded
Before, the existing purpose loading end is determined.
In the embodiment of the present application, it is preferable that also include:
Creation module, if for the purpose loading end to be not present, create the purpose loading end and create corresponding institute
State the loading thread of purpose loading end.
According to the embodiment of the present application, at least one data attribute for extracting data to be loaded is foundation, is identified with this and respectively treated
The purpose loading end that loading data belong to respectively in data warehouse, further controls each data to be loaded to be loaded onto what is belonged to
Purpose loading end, so as to provide the mechanism that the data in a kind of task by ETL imported into multiple purpose loading ends.
Because described device embodiment essentially corresponds to the embodiment of the method shown in earlier figures 1-2, therefore the present embodiment is retouched
Not detailed part, may refer to the related description in previous embodiment, does not just repeat herein in stating.
The application can be used in numerous general or special purpose computing system environments or configuration.Such as:Personal computer, service
Device computer, handheld device or portable set, laptop device, multicomputer system, the system based on microprocessor, top set
Box, programmable consumer-elcetronics devices, network PC, minicom, mainframe computer including any of the above system or equipment
DCE etc..
The application can be described in the general context of computer executable instructions, such as program
Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type
Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these DCEs, by
Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with
In the local and remote computer-readable storage medium including storage device.
Herein, term " comprising ", "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, from
And process, method, article or the equipment for include a series of elements not only include those key elements, but also including not bright
The other element really listed, or also include for this process, method, article or the intrinsic key element of equipment.Do not having
In the case of more limitations, the key element that is limited by sentence "including a ...", it is not excluded that the process including the key element,
Other identical element in method, article or equipment also be present.
Finally, it is to be noted that, herein, such as first and second or the like relational terms be used merely to by
One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation
Between any this actual relation or order be present.Moreover, term " comprising ", "comprising" or its any other variant meaning
Covering including for nonexcludability, so that process, method, article or equipment including a series of elements not only include that
A little key elements, but also the other element including being not expressly set out, or also include for this process, method, article or
The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", is not arranged
Except other identical element in the process including the key element, method, article or equipment being also present.
Above to data load method in a kind of ETL tasks provided herein, and, data in a kind of ETL tasks
Loading device is described in detail, and specific case used herein is explained the principle and embodiment of the application
State, the explanation of above example is only intended to help and understands the present processes and its core concept;Meanwhile for this area
Those skilled in the art, according to the thought of the application, there will be changes in specific embodiments and applications, to sum up institute
State, this specification content should not be construed as the limitation to the application.
Claims (16)
1. a kind of data load system, it is characterised in that add including at least one data source, data loading device and multiple purposes
Carry end;
The data loading device, for extracting source data from least one data source, and default place is carried out to the source data
Reason obtains data to be loaded;According at least one data attribute of data to be loaded, the data to be loaded are divided into ownership
Multiple data blocks to be loaded of different purpose loading ends;Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.
A kind of 2. data load method, it is characterised in that including:
Source data is extracted from least one data source, and default processing is carried out to the source data and obtains data to be loaded;
According at least one data attribute of data to be loaded, the data to be loaded are divided into the different purpose loading ends of ownership
Multiple data blocks to be loaded;
Data block to be loaded is controlled to be loaded onto belonged to purpose loading end.
3. according to the method for claim 2, it is characterised in that described to include from least one data source extraction source data:
Mixing source data is extracted from multiple data sources, the data source includes operation system, system module or application program.
4. according to the method for claim 3, it is characterised in that it is described the source data is carried out default processing obtain it is to be added
Carrying data includes:
The mixing source data is divided into multiple source datas according to entrained data source identification;
Search the processing strategy for data source corresponding to each source data;
Default processing is carried out to each source data according to the processing strategy of lookup, the default processing includes data cleansing and data
Conversion.
5. according to the method for claim 2, it is characterised in that at least one data category according to data to be loaded
Property, the data to be loaded are divided into and belongs to multiple data blocks to be loaded of different purpose loading ends and includes:
Purpose loading end corresponding at least one data attribute of the data to be loaded is searched in loading end routing table;
The data to be loaded are divided into the multiple data blocks to be loaded for belonging to different purpose loading ends.
6. according to the method for claim 2, it is characterised in that at least one data category according to data to be loaded
Property, the data to be loaded are divided into before the multiple data blocks to be loaded for belonging to different purpose loading ends, methods described is also
Including:
The setting content for parsing the data to be loaded obtains at least one data attribute of the data to be loaded, the data
Attribute includes temporal information, data source information or data traffic types.
7. according to the method for claim 6, it is characterised in that the setting content of the parsing data to be loaded obtains
At least one data attribute of the data to be loaded includes:
By carrying out rule match to the setting content of the data to be loaded, carried from the setting content of the data to be loaded
Take at least one data attribute.
8. according to the method for claim 6, it is characterised in that the setting content of the parsing data to be loaded obtains
At least one data attribute of the data to be loaded includes:
By carrying out cutting to the setting content in the data to be loaded, at least one data of the data to be loaded are obtained
Attribute.
9. according to the method for claim 6, it is characterised in that methods described also includes:
Receive the setting content that user is pre-selected by setting interface.
10. according to the method for claim 2, it is characterised in that default processing is carried out to the source data obtain described
Before data to be loaded, methods described also includes:
Generation is corresponding to extract the source data loading tasks of source data, and is added to first task queue;
According to the first processing sequence rule of setting, pending source data loading tasks are extracted from the first task queue.
11. according to the method for claim 10, it is characterised in that at least one data according to data to be loaded
Attribute, the data to be loaded are divided into after the multiple data blocks to be loaded for belonging to different purpose loading ends, methods described
Also include:
For each data block to be loaded, the data block loading tasks of the corresponding data block to be loaded of generation, and be added to and be directed to
The second preset task queue of purpose loading end corresponding to data block to be loaded.
12. according to the method for claim 11, it is characterised in that each data block to be loaded of control, which is loaded onto, to be belonged to
Purpose loading end include:
For each second task queue, according to the second processing Cahn-Ingold-Prelog sequence rule of setting, data are extracted from second task queue
Block loading tasks;
The data block loading tasks are performed so that data block to be loaded to be loaded onto to belonged to purpose loading end.
13. according to the method for claim 2, it is characterised in that the control data block to be loaded is loaded onto what is belonged to
Purpose loading end includes:
At least one loading thread of the purpose loading end is called to load the data to be loaded to the purpose loading end.
14. according to the method for claim 13, it is characterised in that be loaded onto and belonged in the control data block to be loaded
Purpose loading end before, methods described also includes:
Determine the existing purpose loading end.
15. according to the method for claim 14, it is characterised in that methods described also includes:
If the purpose loading end is not present, creates the purpose loading end and create the loading of the corresponding purpose loading end
Thread.
A kind of 16. data loading device, it is characterised in that including:
Source data abstraction module, for extracting source data from least one data source;
Source data processing module, data to be loaded are obtained for carrying out default processing to the source data;
Attribution data module to be loaded, at least one data attribute according to data to be loaded, by the data to be loaded
It is divided into the multiple data blocks to be loaded for belonging to different purpose loading ends;
Load-on module, for controlling data block to be loaded to be loaded onto belonged to purpose loading end.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610799125.4A CN107784039A (en) | 2016-08-31 | 2016-08-31 | A kind of data load method, apparatus and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610799125.4A CN107784039A (en) | 2016-08-31 | 2016-08-31 | A kind of data load method, apparatus and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107784039A true CN107784039A (en) | 2018-03-09 |
Family
ID=61451891
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610799125.4A Pending CN107784039A (en) | 2016-08-31 | 2016-08-31 | A kind of data load method, apparatus and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107784039A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110008042A (en) * | 2019-03-28 | 2019-07-12 | 北京易华录信息技术股份有限公司 | A kind of algorithm Cascading Methods and system based on container |
CN110457348A (en) * | 2018-05-02 | 2019-11-15 | 北京三快在线科技有限公司 | A kind of data processing method and device |
CN112214453A (en) * | 2020-09-14 | 2021-01-12 | 上海微亿智造科技有限公司 | Large-scale industrial data compression storage method, system and medium |
CN112256775A (en) * | 2020-09-27 | 2021-01-22 | 建信金融科技有限责任公司 | Method and device for timed data loading of Oracle database |
CN115859370A (en) * | 2023-03-02 | 2023-03-28 | 萨科(深圳)科技有限公司 | Transaction data processing method and device, computer equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329676A (en) * | 2007-06-20 | 2008-12-24 | 华为技术有限公司 | Data paralleling abstracting method and apparatus and database system |
CN103077192A (en) * | 2012-12-24 | 2013-05-01 | 中标软件有限公司 | Data processing method and system thereof |
CN104731891A (en) * | 2015-03-17 | 2015-06-24 | 浪潮集团有限公司 | Method for mass data extraction in ETL |
CN104850638A (en) * | 2015-05-25 | 2015-08-19 | 广州精点计算机科技有限公司 | ETL process parallel decision method and apparatus |
CN104915414A (en) * | 2015-06-04 | 2015-09-16 | 北京京东尚科信息技术有限公司 | Data extraction method and device |
CN105760221A (en) * | 2016-02-02 | 2016-07-13 | 中博信息技术研究院有限公司 | Task dispatching system with distributed calculating frame |
CN105808778A (en) * | 2016-03-30 | 2016-07-27 | 中国银行股份有限公司 | Method and device for extracting, transforming and loading mass data |
-
2016
- 2016-08-31 CN CN201610799125.4A patent/CN107784039A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329676A (en) * | 2007-06-20 | 2008-12-24 | 华为技术有限公司 | Data paralleling abstracting method and apparatus and database system |
CN103077192A (en) * | 2012-12-24 | 2013-05-01 | 中标软件有限公司 | Data processing method and system thereof |
CN104731891A (en) * | 2015-03-17 | 2015-06-24 | 浪潮集团有限公司 | Method for mass data extraction in ETL |
CN104850638A (en) * | 2015-05-25 | 2015-08-19 | 广州精点计算机科技有限公司 | ETL process parallel decision method and apparatus |
CN104915414A (en) * | 2015-06-04 | 2015-09-16 | 北京京东尚科信息技术有限公司 | Data extraction method and device |
CN105760221A (en) * | 2016-02-02 | 2016-07-13 | 中博信息技术研究院有限公司 | Task dispatching system with distributed calculating frame |
CN105808778A (en) * | 2016-03-30 | 2016-07-27 | 中国银行股份有限公司 | Method and device for extracting, transforming and loading mass data |
Non-Patent Citations (1)
Title |
---|
曾一: "《大学计算机基础》", 30 September 2015 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457348A (en) * | 2018-05-02 | 2019-11-15 | 北京三快在线科技有限公司 | A kind of data processing method and device |
CN110008042A (en) * | 2019-03-28 | 2019-07-12 | 北京易华录信息技术股份有限公司 | A kind of algorithm Cascading Methods and system based on container |
CN112214453A (en) * | 2020-09-14 | 2021-01-12 | 上海微亿智造科技有限公司 | Large-scale industrial data compression storage method, system and medium |
CN112256775A (en) * | 2020-09-27 | 2021-01-22 | 建信金融科技有限责任公司 | Method and device for timed data loading of Oracle database |
CN115859370A (en) * | 2023-03-02 | 2023-03-28 | 萨科(深圳)科技有限公司 | Transaction data processing method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107784039A (en) | A kind of data load method, apparatus and system | |
CN103631922B (en) | Extensive Web information extracting method and system based on Hadoop clusters | |
CN107577805A (en) | A kind of business service system towards the analysis of daily record big data | |
CN105468744B (en) | Big data platform for realizing tax public opinion analysis and full text retrieval | |
CN106202207A (en) | A kind of index based on HBase ORM and searching system | |
CN104850593B (en) | A kind of storage of emergency materials data and circulation monitoring method based on big data | |
CN112396108A (en) | Service data evaluation method, device, equipment and computer readable storage medium | |
CN102664915B (en) | Service selection method based on resource constraint in cloud manufacturing environment | |
CN106407216A (en) | Clue tracing audition system developed on basis of semantic net construction path and construction method of clue tracing audition system | |
Vo et al. | A multi-core approach to efficiently mining high-utility itemsets in dynamic profit databases | |
CN111061679B (en) | Method and system for rapid configuration of technological innovation policy based on rete and drools rules | |
CN107291770A (en) | The querying method and device of mass data in a kind of distributed system | |
CN108287889B (en) | A kind of multi-source heterogeneous date storage method and system based on elastic table model | |
CN101604319A (en) | Xinhua Finance Media's business datum centring system | |
CN105824892A (en) | Method for synchronizing and processing data by data pool | |
Chen et al. | Data mining-based dispatching system for solving the local pickup and delivery problem | |
CN107871055A (en) | A kind of data analysing method and device | |
KR20140076010A (en) | A system for simultaneous and parallel processing of many twig pattern queries for massive XML data and method thereof | |
CN106257447A (en) | The video storage of cloud storage server and search method, video cloud storage system | |
Min et al. | Data mining and economic forecasting in DW-based economical decision support system | |
CN115221337A (en) | Data weaving processing method and device, electronic equipment and readable storage medium | |
Asghari et al. | A semi-automatic system for data management and cleaning | |
Rückemann | Creation of Objects and Concordances for Knowledge Processing and Advanced Computing | |
Peral et al. | Energy consumption prediction by using an integrated multidimensional modeling approach and data mining techniques with Big Data | |
Jemal et al. | MapReduce-DBMS: an integration model for big data management and optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180309 |
|
RJ01 | Rejection of invention patent application after publication |