CN102508919B - Data processing method and system - Google Patents

Data processing method and system Download PDF

Info

Publication number
CN102508919B
CN102508919B CN201110370530.1A CN201110370530A CN102508919B CN 102508919 B CN102508919 B CN 102508919B CN 201110370530 A CN201110370530 A CN 201110370530A CN 102508919 B CN102508919 B CN 102508919B
Authority
CN
China
Prior art keywords
data
rule
task
etl
task list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110370530.1A
Other languages
Chinese (zh)
Other versions
CN102508919A (en
Inventor
钟国南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Telefonaktiebolaget LM Ericsson AB
Original Assignee
SUNRISE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SUNRISE TECHNOLOGY Co Ltd filed Critical SUNRISE TECHNOLOGY Co Ltd
Priority to CN201110370530.1A priority Critical patent/CN102508919B/en
Publication of CN102508919A publication Critical patent/CN102508919A/en
Application granted granted Critical
Publication of CN102508919B publication Critical patent/CN102508919B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing and discloses a data processing method and a system. The method comprises the following steps: ETL (Extraction-Transformation-Loading) rules are encapsulated to dynamic library file forms, and the information of the dynamic library file forms is registered in a task table of a data base at a back stage; each task of the task table is subject to ETL processing of the data according to the corresponding ETL rules through scanning the task table; and each task of the task table corresponds to an ETL rule. Through the invention, the processing for high-volume data and multiple data sources in both a non-visual system and a visible system is realized, thereby improving the processing efficiency.

Description

Data processing method and system
Technical field
The present invention relates to technical field of data processing, be specifically related to a kind of data processing method and system.
Background technology
ETL is the process of data pick-up (Extract), conversion (Transform), loading (Load).It is the important step that builds data warehouse.
Conventionally, user's data source is distributed in subsystems and node, utilizes ETL by the data on subsystems, by robotization or manually control and pass on server, extracts, cleaning and conversion processing, is then loaded into data warehouse.Because existing business data source is many, ensure the consistance of data, get a real idea of the business implication of data, cross over multi-platform, multi-data source, multisystem integral data, maximum possible improves the quality of data, caters to the characteristic that business demand constantly changes, and is the key of ETL technical finesse.
Existing ETL instrument has following two kinds of processing modes conventionally:
(1) use WINDOWS graphic interface
Under visualization interface, clicking the configuration at interface records each flow process and operation link, for example data source, transformation rule, warehouse-in etc. are recorded in a file, backstage starts resolver and scheduler is resolved this file scheduling, whole process only needs ETL developer to be familiar with development process and database knowledge, do not need ETL developer to possess programming technique, but application can be limited in non-patterned system.
(2) script processing
By the links of each ETL task such as data source, transformation rule, warehouse-in etc. with different script describings out, and these set of scripts are combined in script file scheme, after the scheduling of backstage, goes parsing with script resolver.This mode needs ETL developer to possess script edit ability, and treatment effeciency is low.
Summary of the invention
The invention provides a kind of data processing method and system, can in non-visualization system and visualization system, realize the processing of big data quantity, multi-data source, improve treatment effeciency.
For this reason, the embodiment of the present invention provides following technical scheme:
A kind of data processing method, comprising:
ETL rule is encapsulated as to dynamic library file form, and the information of this dynamic library file is registered in the task list of database on backstage;
Scan described task list, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule.
Preferably, the information of described ETL rule and described dynamic library file is that user arranges and issues.
Alternatively, described ETL rule comprises following any one or more: peek rule, Data Division rule, data conversion rule, data merge rule, data sorting rule, and data gather rule, data network collection rule, data loading rule, data configuration rule.
Alternatively, the information of described dynamic library file comprises following any one or more: the start-up time of each task, and start-up period, the mark of reforming, task type mark, task description, whether task identification, can use, and whether has subtask.
Preferably, the described ETL processing that each task in described task list is realized to data according to its corresponding ETL rule comprises:
To the each task in described task list according to its corresponding ETL rule extraction source data from data source;
The source data of obtaining is converted to the target data that system needs;
Described target data is stored in object library.
Preferably, by the task in the described task list of backstage multi-course concurrency mechanism scheduling.
A kind of data handling system, comprising:
Rule encapsulation unit, for ETL rule is encapsulated as to dynamic library file form, and registers to the information of this dynamic library file in the task list of database on backstage;
Scheduling unit, for scanning described task list, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule.
Preferably, described system also comprises:
Rule setting unit, for obtaining the information of the regular and described dynamic library file of described ETL that user arranges and issue.
Preferably, described scheduling unit comprises:
Extract subelement, for to each task of described task list according to its corresponding ETL rule extraction source data from storage facility located at processing plant;
Conversion subelement, is converted to for the source data that described extraction subelement is extracted the target data that system needs;
Storage unit, for storing the target data after described conversion subelement conversion into object library.
Preferably, described scheduling unit, specifically for dispatching the task in described task list by multi-course concurrency mechanism.
Data processing method provided by the invention and system, be encapsulated as dynamic library file form by ETL rule, and the information of this dynamic library file is registered in the task list of database on backstage; Scan described task list, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule.Possess programming technique without user, can realize the processing of big data quantity, multi-data source, not only treatment effeciency is high, and not affected by system environments, in non-visualization system and visualization system.
Brief description of the drawings
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, to the accompanying drawing of required use in embodiment be briefly described below, apparently, the accompanying drawing the following describes is only some embodiment that record in the present invention, for those of ordinary skill in the art, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the process flow diagram of embodiment of the present invention data processing method;
Fig. 2 is a kind of structural representation of embodiment of the present invention data handling system.
Embodiment
In order to make those skilled in the art person understand better the scheme of the embodiment of the present invention, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.
Data processing method provided by the invention and system, be encapsulated as dynamic library file form by ETL rule, and the information of this dynamic library file is registered in the task list of database on backstage; Scan described task list, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule.Thereby possess programming technique without user, can realize the processing of big data quantity, multi-data source, not only treatment effeciency is high, and not affected by system environments, in non-visualization system and visualization system.
As shown in Figure 1, be the process flow diagram of embodiment of the present invention data processing method, comprise the following steps:
Step 101, is encapsulated as dynamic library file form by ETL rule, and the information of this dynamic library file is registered in the task list of database on backstage.
In actual applications, the information of above-mentioned ETL rule and dynamic library file can be User Defined and be published to server.Described ETL rule can comprise the various rules of ETL application, as the rule of peeking, Data Division rule, data conversion rule, data merge rule, data sorting rule, and data gather rule, data network collection rule, data loading rule, data configuration rule etc.
In above-mentioned steps 101, can these ETL rules be encapsulated as by server to the form of dynamic library file, and the information of this dynamic library file be registered in the task list of database on backstage.The information of described dynamic library file can comprise following any one or more: start-up time, and start-up period, the mark of reforming, task type mark, task description, whether task identification, can use, and whether has subtask.These information can be user from defined, time be published on described server issuing ETL rule simultaneously.
Be to describe this task when to put the scheduling that is triggered above-mentioned start-up time, the description of above-mentioned task is in order to strengthen readability, illustrate that this task is for doing and so on, above-mentioned start-up period is used to indicate how long start a subtask, and the mark of above-mentioned task is the unique identification of this task.
In order to facilitate ETL developer's exploitation and the unified management of task, can also provide unified ETL to process API (Application Programming Interface, application programming interface), certainly, it can also be cross-platform that ETL processes API, so that ETL developer can carry out ETL exploitation in different system platforms, for example use the grand API of the instrument such as SRC_TABLE, DES_TABLE, wherein, SRC_TABLE is grand for the API of operate source data, and DES_TABLE is the grand API for Action Target data.
Step 102, scans described task list, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule.
Above-mentioned scanning process and ETL rule encapsulation enrollment process can be completed by different platforms, such as, by scheduler scanning (such as periodically or timing scan) described task list, according to the task in task list described in the message scheduling of described dynamic library file, particularly, scheduler can be by the task in the described task list of backstage multi-course concurrency mechanism scheduling.
Above-mentioned scheduler is roughly as follows to the processing procedure of the each task in described task list:
Scheduler to the each task in described task list according to its corresponding ETL rule extraction source data from data source (such as storage facility located at processing plant), the source data of obtaining is converted to the target data that system needs, described target data is stored in object library.
In said process, also can further comprise: the target data after conversion is sorted and gathered, and then the data after gathering are stored in object library.
In order to facilitate developer's use, a series of API (application programming interfaces) can also be provided, these API can be defined by developer, and scheduler calls these interfaces and realizes above-mentioned processing procedure.Such as, following API can be provided:
1. peek API, for extraction source data, comprising: network peek API, database peek API, Excel peek API, Acess peek API etc.
2. merge API, for data are merged.
3. Data Division API, for splitting data.
4. conversion API, for data are changed, such as, can indulge table and turn horizontal table etc.Can use the grand processing such as SRC_TABLE, DES_TABLE API.
5. gather API, for data are gathered, such as, use the type API can gather by index, gather by row or row.
6. index API, for big data quantity is searched, uses line index technology, namely line number is put in shared drive as index.
7. log interface, for by the situation of the calling log of each interface, to safeguard and system the present situation is shown to user.
Certainly, above-mentioned each API can be selected according to actual needs by user, and this embodiment of the present invention is not limited.
Visible, data processing method provided by the invention, is encapsulated as dynamic library file form by ETL rule, and the information of this dynamic library file is registered in the task list of database on backstage; Scan described task list, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule.Thereby possesses programming technique without user, can realize the processing of big data quantity, multi-data source, not only treatment effeciency is high, and not affected by system environments, in non-visualization system and visualization system, such as, can be applied in the system platforms such as Linux, Aix, Solaris, Windows.
Correspondingly, the embodiment of the present invention also provides a kind of data handling system, as shown in Figure 2, is a kind of structural representation of this system.
In this embodiment, described system comprises:
Rule encapsulation unit 201, for ETL rule is encapsulated as to dynamic library file form, and registers to the information of this dynamic library file in the task list of database on backstage.
Scheduling unit 202, for scanning described task list, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule.
In actual applications, the information of above-mentioned ETL rule and dynamic library file can be User Defined and be published to server.Described ETL rule can comprise the various rules of ETL application, as the rule of peeking, Data Division rule, data conversion rule, data merge rule, data sorting rule, and data gather rule, data network collection rule, data loading rule, data configuration rule etc.
For this reason, in embodiments of the present invention, described system also can further comprise: regular setting unit 203, and for obtaining described ETL rule that user arranges and issue and the information of described dynamic library file.
Correspondingly, regular encapsulation unit 201 is encapsulated as these ETL rules the form of dynamic library file, and the information of this dynamic library file is registered in the task list of database on backstage.The information of described dynamic library file can comprise following any one or more: start-up time, and start-up period, the mark of reforming, task type mark, task description, whether task identification, can use, and whether has subtask.These information can be user from defined, time be published on described server issuing ETL rule simultaneously.
In this embodiment, above-mentioned scheduling unit 102 can have various ways to realize, and a kind of concrete structure of this scheduling unit 102 comprises: extract subelement, and conversion subelement and storing sub-units, wherein:
Described extraction subelement, for to each task of described task list according to its corresponding ETL rule extraction source data from storage facility located at processing plant;
Described conversion subelement, is converted to for the source data that described extraction subelement is extracted the target data that system needs;
Described storing sub-units, for storing the target data after described conversion subelement conversion into object library.
Certainly, in actual applications, above-mentioned scheduling unit 102 also can further comprise other functional unit, such as, for the functional unit to processing such as the target data after described conversion subelement conversion sort, gathers.
In order further to improve the treatment effeciency to big data quantity, above-mentioned scheduling unit 102 can be preferably by the task in the described task list of multi-course concurrency mechanism scheduling.
Visible, data handling system provided by the invention, is encapsulated as dynamic library file form by ETL rule, and the information of this dynamic library file is registered in the task list of database on backstage; Scan described task list, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule.Thereby possesses programming technique without user, can realize the processing of big data quantity, multi-data source, not only treatment effeciency is high, and not affected by system environments, in non-visualization system and visualization system, such as, can be applied in the system platforms such as Linux, Aix, Solaris, Windows.
It should be noted that, in embodiment of the present invention data handling system, can be integrated in an equipment (such as computing machine) above with different units, also can be distributed on different equipment.
Further describe for example the method and system of the embodiment of the present invention below to the processing procedure of processing.
Such as, for the form platform of mobile service, because mobile subscriber's quantity is huge, after several hundred million cellphone subscriber's business processing, form platform can produce the business record that reaches more than one hundred million, needs the ETL instrument that can process big data quantity and process these business records.The method and system of utilizing the embodiment of the present invention to provide, can build a report database, configure different form tasks in database table, and different tasks has different ETL rules, and these different ETL rules can be hidden under unified interface.Server gets up these ETL rule-based schedulings, realizes the processing to described business record, has effectively improved treatment effeciency.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually referring to, what each embodiment stressed is and the difference of other embodiment.Especially,, for system embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.System embodiment described above is only schematic, the wherein said unit as separating component explanation can or can not be also physically to separate, the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in multiple network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skill in the art, in the situation that not paying creative work, are appreciated that and implement.
Above the embodiment of the present invention is described in detail, has applied embodiment herein the present invention is set forth, the explanation of above embodiment is just for helping to understand method and apparatus of the present invention; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims (4)

1. a data processing method, is characterized in that, comprising:
ETL rule is encapsulated as to dynamic library file form, and the information of this dynamic library file is registered in the task list of database on backstage;
Described ETL rule comprises following any one or more: peek rule, and Data Division rule, data conversion rule, data merge rule, data sorting rule, data gather rule, data network collection rule, data loading rule, data configuration rule;
The information of described dynamic library file comprises following any one or more: the start-up time of each task, and start-up period, the mark of reforming, task type mark, task description, whether task identification, can use, and whether has subtask;
Scan described task list, by the task in the described task list of backstage multi-course concurrency mechanism scheduling, to the each task in described task list according to its corresponding ETL rule realize data ETL process, the each task correspondence in described task list an ETL rule; The described ETL processing that each task in described task list is realized to data according to its corresponding ETL rule comprises:
To the each task in described task list according to its corresponding ETL rule extraction source data from data source; The source data of obtaining is converted to the target data that system needs; Described target data is stored in object library.
2. method according to claim 1, is characterized in that, the information of described ETL rule and described dynamic library file is that user arranges and issues.
3. a data handling system, is characterized in that, comprising:
Rule encapsulation unit, for ETL rule is encapsulated as to dynamic library file form, and registers to the information of this dynamic library file in the task list of database on backstage;
Described ETL rule comprises following any one or more: peek rule, and Data Division rule, data conversion rule, data merge rule, data sorting rule, data gather rule, data network collection rule, data loading rule, data configuration rule;
The information of described dynamic library file comprises following any one or more: the start-up time of each task, and start-up period, the mark of reforming, task type mark, task description, whether task identification, can use, and whether has subtask;
Scheduling unit, be used for scanning described task list, by the task in the described task list of backstage multi-course concurrency mechanism scheduling, the ETL that the each task in described task list is realized to data according to its corresponding ETL rule processes, the each task correspondence in described task list an ETL rule; Described scheduling unit comprises:
Extract subelement, for to each task of described task list according to its corresponding ETL rule extraction source data from data source;
Conversion subelement, is converted to for the source data that described extraction subelement is extracted the target data that system needs;
Storing sub-units, for storing the target data after described conversion subelement conversion into object library.
4. system according to claim 3, is characterized in that, also comprises:
Rule setting unit, for obtaining the information of the regular and described dynamic library file of described ETL that user arranges and issue.
CN201110370530.1A 2011-11-18 2011-11-18 Data processing method and system Expired - Fee Related CN102508919B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110370530.1A CN102508919B (en) 2011-11-18 2011-11-18 Data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110370530.1A CN102508919B (en) 2011-11-18 2011-11-18 Data processing method and system

Publications (2)

Publication Number Publication Date
CN102508919A CN102508919A (en) 2012-06-20
CN102508919B true CN102508919B (en) 2014-10-29

Family

ID=46221005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110370530.1A Expired - Fee Related CN102508919B (en) 2011-11-18 2011-11-18 Data processing method and system

Country Status (1)

Country Link
CN (1) CN102508919B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902583B (en) * 2012-12-27 2019-03-12 方正国际软件(北京)有限公司 A kind of ETL process execution system
CN105095327A (en) * 2014-05-23 2015-11-25 深圳市珍爱网信息技术有限公司 Distributed ELT system and scheduling method
CN104572896A (en) * 2014-12-25 2015-04-29 福建亿榕信息技术有限公司 Method and system for automatically governing data of relational database
CN106294409A (en) * 2015-05-22 2017-01-04 阿里巴巴集团控股有限公司 Data processing method and device
CN105069025B (en) * 2015-07-17 2018-03-27 浪潮天元通信信息系统有限公司 A kind of intelligence polymerization visualization of big data and managing and control system
CN110457348B (en) * 2018-05-02 2022-05-10 北京三快在线科技有限公司 Data processing method and device
CN108984652B (en) * 2018-06-27 2020-10-27 北京圣康汇金科技有限公司 Configurable data cleaning system and method
CN110222119B (en) * 2019-05-23 2021-08-31 武汉达梦数据库股份有限公司 Data conversion synchronization method, equipment and storage medium for heterogeneous database
CN110472102A (en) * 2019-08-22 2019-11-19 北京锐安科技有限公司 A kind of data processing method, device, equipment and storage medium
CN111813806B (en) * 2020-06-01 2024-04-19 北京百卓网络技术有限公司 ETL system and method based on policy service

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236580A (en) * 2010-04-26 2011-11-09 阿里巴巴集团控股有限公司 Method for distributing node to ETL (Extraction-Transformation-Loading) task and dispatching system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236580A (en) * 2010-04-26 2011-11-09 阿里巴巴集团控股有限公司 Method for distributing node to ETL (Extraction-Transformation-Loading) task and dispatching system

Also Published As

Publication number Publication date
CN102508919A (en) 2012-06-20

Similar Documents

Publication Publication Date Title
CN102508919B (en) Data processing method and system
CN110321113B (en) Integrated assembly line system taking project batches as standards and working method thereof
CN108804630B (en) Industry application-oriented big data intelligent analysis service system
CN106599197B (en) Data acquisition exchange engine
CN102467532A (en) Task processing method and task processing device
US9639444B2 (en) Architecture for end-to-end testing of long-running, multi-stage asynchronous data processing services
CN107070890A (en) Flow data processing device and communication network major clique system in a kind of communication network major clique system
CN103441900A (en) Centralization cross-platform automated testing system and control method thereof
CN107103064A (en) Data statistical approach and device
CN106557307B (en) Service data processing method and system
CN106095678A (en) Automatization's result inspection method of data bank service operation under windows platform
CN112631903A (en) Task testing method and device, electronic equipment and storage medium
CN114398194A (en) Data collection method and device, electronic equipment and readable storage medium
CN112580079A (en) Authority configuration method and device, electronic equipment and readable storage medium
CN103064780A (en) Software testing method and device thereof
CN114218291A (en) Portrait generation method, apparatus, device and storage medium based on target object
CN107491298A (en) A kind of button object automatic scanning method and system
CN101894317A (en) System and method for driving business logic through data changes
CN102339323B (en) A kind of method of carrying out data pick-up for DB2 data warehouse, dispatching and representing
CN110110153A (en) A kind of method and apparatus of node searching
CN102486731B (en) Strengthen the visualization method of the call stack of software of software, equipment and system
CN116627609A (en) Hive batch processing-based scheduling method and device
CN105630997A (en) Data parallel processing method, device and equipment
CN103077045A (en) Method for constructing XML (Extensive Markup Language) script workflow engine
CN112306869A (en) Automatic testing platform based on financial risk system and construction method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent for invention or patent application
CB02 Change of applicant information

Address after: 19, building 368, 510300 South Guangzhou Avenue, Guangdong, Guangzhou

Applicant after: Sunrise Technology Co., Ltd.

Address before: 19, building 368, 510300 South Guangzhou Avenue, Guangdong, Guangzhou

Applicant before: Snrise Corporation

COR Change of bibliographic data

Free format text: CORRECT: APPLICANT; FROM: SNRISE CORPORATION TO: CONGXING TECHNOLOGY CO., LTD.

C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: HONGKONG SHIYE DEVELOPMENT CO., LTD.

Free format text: FORMER OWNER: CONGXING TECHNOLOGY CO., LTD.

Effective date: 20150805

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150805

Address after: Room 32, building 3205, Bank of America, 12 Cecil Harcourt Road, central, Hongkong, China

Patentee after: Hongkong world industry development Co., Ltd.

Address before: 19, building 368, 510300 South Guangzhou Avenue, Guangdong, Guangzhou

Patentee before: Sunrise Technology Co., Ltd.

ASS Succession or assignment of patent right

Owner name: TELEFON AB L.M. ERICSSON (SE)

Free format text: FORMER OWNER: HONGKONG SHIYE DEVELOPMENT CO., LTD.

Effective date: 20150909

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150909

Address after: Stockholm

Patentee after: Telefon AB L.M. Ericsson [SE]

Address before: Room 32, building 3205, Bank of America, 12 Cecil Harcourt Road, central, Hongkong, China

Patentee before: Hongkong world industry development Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141029

Termination date: 20181118

CF01 Termination of patent right due to non-payment of annual fee