CN101882165B - Multithreading data processing method based on ETL (Extract Transform Loading) - Google Patents

Multithreading data processing method based on ETL (Extract Transform Loading) Download PDF

Info

Publication number
CN101882165B
CN101882165B CN 201010241787 CN201010241787A CN101882165B CN 101882165 B CN101882165 B CN 101882165B CN 201010241787 CN201010241787 CN 201010241787 CN 201010241787 A CN201010241787 A CN 201010241787A CN 101882165 B CN101882165 B CN 101882165B
Authority
CN
China
Prior art keywords
data
error
send
formation
synchronization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201010241787
Other languages
Chinese (zh)
Other versions
CN101882165A (en
Inventor
周钢
陈俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CVIC Software Engineering Co Ltd
Original Assignee
CVIC Software Engineering Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CVIC Software Engineering Co Ltd filed Critical CVIC Software Engineering Co Ltd
Priority to CN 201010241787 priority Critical patent/CN101882165B/en
Publication of CN101882165A publication Critical patent/CN101882165A/en
Application granted granted Critical
Publication of CN101882165B publication Critical patent/CN101882165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Detection And Prevention Of Errors In Transmission (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multithreading data processing method based on ETL (Extract Transform Loading), comprising the following steps of: dividing the data extracting process of ETL into three obvious stages: extraction, sending and synchronization, collaterally executing the extraction, the sending and the synchronization of data by using respective independent threading, and persisting error data. The invention parallelizes the extraction process of the ETL data, greatly improves the throughput and extraction rate and the use ration of hardware resources through using a multithreading processing frame, also improves the error tolerance of the data, and reduces the probability of causing the whole ETL paralysis because errors are generated in the data extraction process through processing the error data generated in the extraction, sending and the synchronization processes of the data.

Description

Data processing of multithread based on ETL
Technical field
The present invention relates to technical field of data processing, relate in particular to a kind of data processing of multithread based on ETL.
Background technology
Data warehouse is that enterprise uses a kind of very widely integrated data platform at present; It is an independently data environment; With respect to relational database; Data warehouse technology does not have strict mathematical theory basis, and it need import to the data storage medium of data from Transaction Processing environment, external data source and off line the data warehouse through data pick-up mechanism more towards practical engineering application.Therefore, (ETL Extraction-Transformation-Loading) is a very important ring in the data warehouse, and it is the step of the necessity of taking over from the past and setting a new course for the future for data pick-up, conversion and loading.
ETL be responsible for distribute with heterogeneous data source in data; Clean after being drawn into interim middle layer like relation data and flat data file etc., conversion and integrated; Be loaded at last in data warehouse or the Data Mart, become the basis of on-line analytical processing and data mining.
When the ETL instrument of prior art carried out data processing, what the data pick-up process adopted was single-threaded treatment mechanism, had therefore that hardware resource utilization is low, data throughout is little and extract the low problem of speed.
In addition, if produce mistake in the data pick-up process, will cause whole ETL paralysis.
Summary of the invention
In view of this, the present invention provides a kind of data processing of multithread based on ETL, and to solve the problem that hardware resource utilization is low, data throughout is little and speed is low that prior art exists, technical scheme is following:
A kind of data processing of multithread based on ETL comprises: the data pick-up process of ETL is divided into tangible three phases, promptly extracts, send and synchronously, and use separately independently thread parallel to carry out following four steps:
Step 10: extract thread by one of data pick-up unit starting; Show data in the extraction source in real time through rule; And with being stored in message queue to be sent after the data encapsulation, if the extracted data process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 11: start one by data transmission unit and send thread; Cycle detection message queue to be sent; When the data that need transmission are arranged in this formation; Then these data are sent to and treat the synchronization message formation, make a mistake if send data procedures, the data that then will make mistakes send to the error data message queue;
Step 12: start a synchronizing thread by data synchronisation unit; Cycle detection is treated the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 13: by persistence thread of error data persistence unit starting; Cycle detection error data message queue; When the data of makeing mistakes were arranged in the formation, according to causing error reason, promptly extracted data mistake, transmission error in data and synchrodata mistake were preserved with error data.
Preferably, in the said method, in the step 10, said through rule in real time extraction source table data be specially:
Show data in the extraction source in real time through SQL (SQL, Structured Query Language) batch processing query statement.
Preferably, said when needing the data of transmission in this formation in the step 11 in the said method, then these data are sent to and treat that the synchronization message formation is specially:
The data that needs are sent send to and treat the synchronization message formation through http protocol or transmission control protocol (TCP, Transmission Control Protocol).
Preferably, said when in this formation the data in synchronization of needs being arranged in the step 12 in the said method, then resolve the also synchronous purpose table data of these data and be specially:
Need data in synchronization and synchronous purpose table data through the parsing of SQL batch processing mode.
Preferably, in the said method, the concrete operations of said synchronous purpose table data comprise:
Insertion, renewal and deleted data.
Preferably, in the said method, in the step 13, said with error data according to causing error reason, promptly the extracted data mistake, send and also to comprise after error in data and synchrodata mistake are preserved:
According to the preset data method of synchronization said data of makeing mistakes are carried out data sync.
Can know through above technical scheme; The present invention is through being divided into tangible three phases with ETL data pick-up process; Promptly extract, send and synchronously, and use separately thread parallel independently to carry out the extraction of data, transmission and synchronously and the processing of error data; Significantly improved the handling capacity and extraction speed of data, and the utilization factor of hardware resource; Also, improved the fault-tolerance of data, reduced owing to producing the wrong probability that causes whole ETL paralysis in the data pick-up process through the processing of the error data that produces in extraction, transmission and the synchronizing process to data.
Description of drawings
In order to be illustrated more clearly in technical scheme of the present invention; The accompanying drawing of required use is done to introduce simply in will describing the present invention below; Obviously, the accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the framework synoptic diagram of the data processing of multithread based on ETL provided by the invention.
Embodiment
The embodiment of the invention discloses a kind of data processing of multithread, comprising: the data pick-up process of ETL is divided into tangible three phases, promptly extracts, send and synchronously, and use separately independently thread parallel to carry out following four steps based on ETL:
Step 10: extract thread by one of data pick-up unit starting; Show data in the extraction source in real time through rule; And with being stored in message queue to be sent after the data encapsulation, if the extracted data process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 11: start one by data transmission unit and send thread; Cycle detection message queue to be sent; When the data that need transmission are arranged in this formation; Then these data are sent to and treat the synchronization message formation, make a mistake if send data procedures, the data that then will make mistakes send to the error data message queue;
Step 12: start a synchronizing thread by data synchronisation unit; Cycle detection is treated the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 13: by persistence thread of error data persistence unit starting; Cycle detection error data message queue; When the data of makeing mistakes were arranged in the formation, according to causing error reason, promptly extracted data mistake, transmission error in data and synchrodata mistake were preserved with error data.
The present invention is through being divided into tangible three phases with ETL data pick-up process; Promptly extract, send and synchronously; And use the independently extraction of thread parallel execution data, transmission and synchronous separately; And the preservation of error data, significantly improve the handling capacity of data and extracted speed, and the utilization factor of hardware resource; Also, improved the fault-tolerance of data, reduced owing to producing the wrong probability that causes whole ETL paralysis in the data pick-up process through the processing of the error data that produces in extraction, transmission and the synchronizing process to data.
For those skilled in the art are better understood and embodiment of the present invention, below will combine Figure of description that the technical scheme of the embodiment of the invention is described in detail.
Fig. 1 is the framework synoptic diagram of the data processing of multithread based on ETL provided by the invention.The present invention is through data pick-up unit, data transmission unit, data synchronisation unit and error data persistence unit in the framework, and with the parallelization of ETL data pick-up process, detailed process is following:
The data pick-up process of ETL is divided into tangible three phases, promptly extracts, sends and synchronously, and use separately independently thread parallel to carry out following four steps:
Step 10: extract thread by one of data pick-up unit starting; Show data in the extraction source in real time through rule; And with being stored in message queue to be sent after the data encapsulation, if the extracted data process makes a mistake, the data that then will make mistakes send to the error data message queue.
Data pick-up unit round-robin reads legal data in the source data table; And these data encapsulation are become packet; Then this packet is sent in the message queue to be sent,, then will change packet and store the error data message queue into if in the data encapsulation process, mistake occurs.Wherein, concrete SQL batch processing query statement capable of using reads legal data in the source data table.
The data pick-up real-time can be guaranteed in the data pick-up unit, and the promptly real-time data that will extract encapsulate and are saved in the message queue, and extraction process does not receive transmission and synchronizing process influence.
Step 11: start one by data transmission unit and send thread; Cycle detection message queue to be sent; When the data that need transmission are arranged in this formation; Then these data are sent to and treat the synchronization message formation, make a mistake if send data procedures, the data that then will make mistakes send to the error data message queue.
The data transmission unit round-robin reads the data in the message queue to be sent, and with this data transmission to treating in the synchronization message formation, if occur mistake in the process of transmitting, the data storage that then will make mistakes is to the error data message queue.Wherein, concrete can sending to through the data that http protocol or Transmission Control Protocol will send treated the synchronization message formation.
Data transmission unit only need send to the data in the message queue to be sent and treat to go in the synchronization message formation, can guarantee the real-time that data transmit.
Step 12: start a synchronizing thread by data synchronisation unit; Cycle detection is treated the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue.
The data synchronisation unit round-robin reads the data of treating in the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue.Concrete, can need data in synchronization and synchronous purpose table data through the parsing of SQL batch processing mode; The concrete operations of purpose table data synchronously comprise: insertion, renewal and deleted data etc.
Data synchronisation unit will need synchrodata to be synchronized in the destination data database data table, can guarantee the real-time of data sync.
Step 13: by persistence thread of error data persistence unit starting; Cycle detection error data message queue; When the data of makeing mistakes were arranged in the formation, according to causing error reason, promptly extracted data mistake, transmission error in data and synchrodata mistake were preserved with error data.
Error data persistence unit round-robin reads data in the data-message formation that makes mistakes, and preserves respectively according to the type of error data.Concrete, can preserve with the form of document form or database.After error data is preserved, can also be according to the preset data method of synchronization, for example manual type is carried out data sync to the data of makeing mistakes.
Error data persistence unit can carry out guaranteeing the security of ETL data-switching and the integrality of data synchronously through other modes with using the synchronous error data of multithreading.
Can find out that from above embodiment the embodiment of the invention has used multithreading to handle framework, parallelization the process of ETL data pick-up; Concrete, ETL data pick-up process is divided into tangible three phases, promptly extract, send and synchronously; And use the independently extraction of thread parallel execution data, transmission and synchronous separately; And the processing of error data, significantly improve the handling capacity of data and extracted speed, and the utilization factor of hardware resource; Also, improved the fault-tolerance of data, reduced owing to producing the wrong probability that causes whole ETL paralysis in the data pick-up process through the processing of the error data that produces in extraction, transmission and the synchronizing process to data.
Description through above method embodiment; The those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform; Can certainly pass through hardware, but the former is better embodiment under a lot of situation.Based on such understanding; The part that technical scheme of the present invention contributes to prior art in essence in other words can be come out with the embodied of software product; This computer software product is stored in the storage medium; Comprise some instructions with so that computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out all or part of step of the said method of each embodiment of the present invention.And aforesaid storage medium comprises: various media that can be program code stored such as ROM (read-only memory) (ROM), random-access memory (ram), magnetic disc or CD.
To the above-mentioned explanation of the disclosed embodiments, make this area professional and technical personnel can realize or use the present invention.Multiple modification to these embodiment will be conspicuous concerning those skilled in the art, and defined General Principle can realize under the situation that does not break away from the spirit or scope of the present invention in other embodiments among this paper.Therefore, the present invention will can not be restricted to these embodiment shown in this paper, but will meet and principle disclosed herein and features of novelty the wideest corresponding to scope.

Claims (6)

1. the data processing of multithread based on ETL is characterized in that, comprising: the data pick-up process of ETL is divided into tangible three phases, promptly extracts, send and synchronously, and use separately independently thread parallel to carry out following four steps:
Step 10: extract thread by one of data pick-up unit starting; Show data in the extraction source in real time through rule; And with being stored in message queue to be sent after the data encapsulation, if the extracted data process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 11: start one by data transmission unit and send thread; Cycle detection message queue to be sent; When the data that need transmission are arranged in this formation; Then these data are sent to and treat the synchronization message formation, make a mistake if send data procedures, the data that then will make mistakes send to the error data message queue;
Step 12: start a synchronizing thread by data synchronisation unit; Cycle detection is treated the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 13: by persistence thread of error data persistence unit starting; Cycle detection error data message queue; When the data of makeing mistakes were arranged in the formation, according to causing error reason, promptly extracted data mistake, transmission error in data and synchrodata mistake were preserved with error data.
2. method according to claim 1 is characterized in that, in the step 10, said through rule in real time extraction source table data be specially:
Show data in the extraction source in real time through SQL batch processing query statement.
3. method according to claim 1 is characterized in that, in the step 11,
Said when needing the data of transmission in this formation, then these data are sent to and treat that the synchronization message formation is specially:
The data that needs are sent send to through http protocol or Transmission Control Protocol treats the synchronization message formation.
4. method according to claim 1 is characterized in that, in the step 12,
Said when in this formation the data in synchronization of needs being arranged, then resolve the also synchronous purpose table data of these data and be specially:
Need data in synchronization and synchronous purpose table data through the parsing of SQL batch processing mode.
5. method according to claim 4 is characterized in that, the concrete operations of said synchronous purpose table data comprise:
Insertion, renewal and deleted data.
6. method according to claim 1 is characterized in that, in the step 13, said with error data according to causing error reason, promptly the extracted data mistake, send and also to comprise after error in data and synchrodata mistake are preserved:
According to the preset data method of synchronization said data of makeing mistakes are carried out data sync.
CN 201010241787 2010-08-02 2010-08-02 Multithreading data processing method based on ETL (Extract Transform Loading) Active CN101882165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010241787 CN101882165B (en) 2010-08-02 2010-08-02 Multithreading data processing method based on ETL (Extract Transform Loading)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010241787 CN101882165B (en) 2010-08-02 2010-08-02 Multithreading data processing method based on ETL (Extract Transform Loading)

Publications (2)

Publication Number Publication Date
CN101882165A CN101882165A (en) 2010-11-10
CN101882165B true CN101882165B (en) 2012-06-27

Family

ID=43054179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010241787 Active CN101882165B (en) 2010-08-02 2010-08-02 Multithreading data processing method based on ETL (Extract Transform Loading)

Country Status (1)

Country Link
CN (1) CN101882165B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591936A (en) * 2011-12-27 2012-07-18 四川九洲电器集团有限责任公司 Player data processing method
CN102663020A (en) * 2012-03-21 2012-09-12 北京英孚斯迈特信息技术有限公司 CDC data distribution method and device thereof
CN103096168B (en) * 2012-12-25 2016-03-02 四川九洲电器集团有限责任公司 A kind of data communication method for parallel processing based on IPTV set top box
CN104182502B (en) * 2014-08-18 2017-10-27 浪潮(北京)电子信息产业有限公司 A kind of data pick-up method and device
CN104317843B (en) * 2014-10-11 2017-08-25 上海瀚之友信息技术服务有限公司 A kind of data syn-chronization ETL system
CN104391929A (en) * 2014-11-21 2015-03-04 浪潮通用软件有限公司 Data flow transmitting method in ETL (extract, transform and load)
CN105094974B (en) * 2015-08-14 2019-12-20 上海斐讯数据通信技术有限公司 Burst signal processing method and system
CN105550319B (en) * 2015-12-12 2019-06-25 天津南大通用数据技术股份有限公司 The optimization method of persistence under a kind of cluster Consistency service high concurrent
CN106777933B (en) * 2016-12-02 2019-05-10 郑州云海信息技术有限公司 A kind of collecting method, apparatus and system
CN108062407A (en) * 2017-12-28 2018-05-22 成都飞机工业(集团)有限责任公司 A kind of project visualizes management and control data pick-up method
CN109710624B (en) * 2018-12-19 2021-06-11 泰康保险集团股份有限公司 Data processing method, device, medium and electronic equipment
CN110472102A (en) * 2019-08-22 2019-11-19 北京锐安科技有限公司 A kind of data processing method, device, equipment and storage medium
CN111339113B (en) * 2020-02-28 2023-04-28 湖南九鼎科技(集团)有限公司 ETL technology-based formula direct method and system
CN111552730B (en) * 2020-04-28 2024-01-26 杭州数梦工场科技有限公司 Data distribution method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1897025A (en) * 2006-04-27 2007-01-17 南京联创科技股份有限公司 Parallel ETL technology of multi-thread working pack in mass data process
CN101105793A (en) * 2006-07-11 2008-01-16 阿里巴巴公司 Data processing method and system of data library

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2079020B1 (en) * 2008-01-03 2013-03-20 Accenture Global Services Limited System amd method for automating ETL applications

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1897025A (en) * 2006-04-27 2007-01-17 南京联创科技股份有限公司 Parallel ETL technology of multi-thread working pack in mass data process
CN101105793A (en) * 2006-07-11 2008-01-16 阿里巴巴公司 Data processing method and system of data library

Also Published As

Publication number Publication date
CN101882165A (en) 2010-11-10

Similar Documents

Publication Publication Date Title
CN101882165B (en) Multithreading data processing method based on ETL (Extract Transform Loading)
US11327945B2 (en) Method and device for storing high-concurrency data
CN109284334B (en) Real-time database synchronization method and device, electronic equipment and storage medium
CN102970158B (en) Log storage and processing method and log server
CN102104544B (en) Order preserving method for fragmented message flow in IP (Internet Protocol) tunnel of multi-nuclear processor with accelerated hardware
CN104506496B (en) The method of near-realtime data increment distribution based on Oracle Streams technologies
CN107105009A (en) Job scheduling method and device based on Kubernetes system docking workflow engines
CN104731956A (en) Method and system for synchronizing data and related database
CN104699723A (en) Data exchange adapter and system and method for synchronizing data among heterogeneous systems
CN110008194A (en) A kind of rapid file acquisition methods based on block chain and interspace file system IPFS
CN107273542B (en) High-concurrency data synchronization method and system
CN104809199A (en) Database synchronization method and device
CN105721526B (en) The synchronous method and device of a kind of terminal, server file
CN104378234A (en) Cross-data-center data transmission processing method and system
CN104683472A (en) Data transmission method capable of supporting large data volume
CN102223368A (en) System and method capable of realizing operation identification during monitoring of remote desktop protocol (RDP)
CN104965810B (en) The method and device of data message is quickly handled under multiple-core mode
CN106776072A (en) Information push method and system
CN113242244A (en) Data transmission method, device and system
CN109165225A (en) A kind of kudu data import system and method based on bytestream format
CN104484174B (en) The treating method and apparatus of the compressed file of RAR forms
CN104239537A (en) Method for realizing generating and processing flow for large-data pre-processing text data
CN110955645B (en) Big data integration processing method and system
CN108845794A (en) A kind of streaming operation frame, method, readable medium and storage control
CN105592097B (en) A kind of client-based asynchronous interactive information approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Free format text: FORMER OWNER: CVIC SOFTWARE ENGINEERING CO., LTD.

Effective date: 20131227

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20131227

Address after: 250014 Shandong city of Ji'nan Province - Shandong Lixia District Road No. 41-1

Patentee after: CVIC Software Engineering Co., Ltd.

Address before: 250014 No. 41-1 Shandong Road, Shandong, Ji'nan

Patentee before: Shandong CVIC Software Engineering Co., Ltd.

Patentee before: CVIC Software Engineering Co., Ltd.