CN110609860A - Data ETL processing method, device, equipment and storage medium - Google Patents

Data ETL processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN110609860A
CN110609860A CN201810529602.4A CN201810529602A CN110609860A CN 110609860 A CN110609860 A CN 110609860A CN 201810529602 A CN201810529602 A CN 201810529602A CN 110609860 A CN110609860 A CN 110609860A
Authority
CN
China
Prior art keywords
data
log
type
warehouse
loading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810529602.4A
Other languages
Chinese (zh)
Inventor
唐堂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Chongqing Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Chongqing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Chongqing Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201810529602.4A priority Critical patent/CN110609860A/en
Publication of CN110609860A publication Critical patent/CN110609860A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a data ETL processing method, a data ETL processing device, data ETL processing equipment and a storage medium. The method comprises the following steps: acquiring a change log of a database log of a source data warehouse; identifying and marking the type of each log in the change log; extracting data changed in the source data warehouse according to the type; carrying out data cleaning and data conversion on the data; loading data which are in data washing and data conversion and correspond to the marked logs with the insertion types and the modification types into a target data warehouse; and loading the data of which the source data warehouse is unchanged into the target data warehouse. According to the data ETL processing method, the data ETL processing device, the data ETL processing equipment and the data ETL processing storage medium, only changed data in the source data warehouse are extracted, unchanged data in the source data warehouse are not extracted, incremental extraction of the data is achieved, and therefore the efficiency of the data ETL can be improved.

Description

Data ETL processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing ETL data.
Background
Extract-Transform-Load (ETL) is used to describe the process of extracting, transforming, and loading data from a source to a destination. ETL is an important link in building data warehouses. The method comprises the steps of extracting required data from a data source, cleaning and converting the extracted data, and loading the cleaned and converted data into a data warehouse according to a predefined data warehouse model. Data Warehouse (DW/DWH) is a strategic collection that provides all types of Data support for all levels of decision-making processes of an enterprise.
At present, the processing mode of the data ETL is mainly a full ETL mode. The full ETL mode is similar to the migration or data copying of a database, the data of all data tables in a data source are extracted, and then the data in all the data tables are cleaned and converted, so that the data are loaded.
No matter how much the data in the data table in the data source changes, all the data is processed each time by adopting a full-scale ETL mode, the data ETL will consume a lot of time, and will bring large load to the source data system and the data warehouse, which seriously affects the efficiency of the data ETL.
Disclosure of Invention
The embodiment of the invention provides a data ETL processing method, a device, equipment and a storage medium, which can improve the efficiency of data ETL.
In one aspect, an embodiment of the present invention provides a data ETL processing method, where the method includes:
acquiring a change log of a database log of a source data warehouse;
identifying and marking the type of each log in the change log;
extracting data changed in the source data warehouse according to the type;
carrying out data cleaning and data conversion on the data;
loading data which are in data washing and data conversion and correspond to the marked logs with the insertion types and the modification types into a target data warehouse;
and loading the data of which the source data warehouse is unchanged into the target data warehouse.
In one embodiment of the present invention, identifying and marking the type of each log in the change log comprises:
the type and change time of each log in the change log is identified and marked.
In one embodiment of the invention, extracting data that has changed in the source data warehouse according to type comprises:
and extracting the data corresponding to the marked logs, wherein the types of the marked logs are an insertion type, a modification type and a deletion type.
In one embodiment of the invention, before extracting data that has changed in the source data warehouse according to type, the method further comprises:
and storing the data corresponding to each log in an in-memory database.
In one embodiment of the present invention, the in-memory database is a distributed in-memory database.
In one embodiment of the present invention, loading data that has not changed in the source data warehouse into the target data warehouse comprises:
extracting unchanged data from the source data warehouse;
the extracted data is loaded into a target data warehouse.
In one embodiment of the invention, extracting unchanged data from the source data warehouse comprises:
and extracting unchanged data from the source data warehouse by using a left connection mode in a Structured Query Language (SQL).
In another aspect, an embodiment of the present invention provides an ETL processing apparatus, where the apparatus includes:
the acquisition module is used for acquiring a change log of a database log of the source data warehouse;
the marking module is used for identifying and marking the type of each log in the change logs;
the extraction module is used for extracting the data changed in the source data warehouse according to the type;
the cleaning conversion module is used for cleaning and converting data;
the first loading module is used for loading data which is in data cleaning and data conversion and corresponds to the marked log type of the insertion type and the modification type into a target data warehouse;
and the second loading module is used for loading the data of which the source data warehouse is unchanged into the target data warehouse.
In an embodiment of the present invention, the marking module is specifically configured to:
the type and change time of each log in the change log is identified and marked.
In an embodiment of the present invention, the extraction module is specifically configured to:
and extracting the data corresponding to the marked logs, wherein the types of the marked logs are an insertion type, a modification type and a deletion type.
In an embodiment of the present invention, the data ETL processing apparatus according to the embodiment of the present invention further includes:
and the storage module is used for storing the data corresponding to each log in the memory database.
In one embodiment of the present invention, the in-memory database is a distributed in-memory database.
In one embodiment of the invention, the second load module comprises:
the extraction unit is used for extracting unchanged data from the source data warehouse;
and the loading unit is used for loading the extracted data into the target data warehouse.
In an embodiment of the present invention, the extraction unit is specifically configured to:
and extracting unchanged data from the source data warehouse by using a left connection mode in the Structured Query Language (SQL).
In another aspect, an embodiment of the present invention provides a data ETL processing device, where the device includes: a memory and a processor;
the memory is used for storing executable program codes;
the processor is used for reading the executable program codes stored in the memory to execute the data ETL processing method provided by the embodiment of the invention.
In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium having computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement the data ETL processing method provided by the embodiments of the present invention.
According to the data ETL processing method, the data ETL processing device, the data ETL processing equipment and the data ETL processing storage medium, only changed data in the source data warehouse are extracted, unchanged data in the source data warehouse are not extracted, incremental extraction of the data is achieved, and therefore the efficiency of the data ETL can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart illustrating a data ETL processing method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a data ETL processing apparatus according to an embodiment of the present invention;
fig. 3 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing a data ETL processing method and apparatus according to an embodiment of the present application.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In order to solve the problem in the prior art, embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for processing data ETL, so as to improve efficiency of data ETL. First, a data ETL processing method provided in an embodiment of the present invention is described below.
Fig. 1 is a schematic flow chart illustrating a data ETL processing method according to an embodiment of the present invention. The data ETL processing method can comprise the following steps:
s101: and obtaining a change log of the change of the database log of the source data warehouse.
S102: the type of each log in the change log is identified and marked.
S103: and extracting the data changed in the source data warehouse according to the marked type.
S104: and performing data cleaning and data conversion on the extracted data.
S105: and loading data which are in the data subjected to data cleaning and data conversion and correspond to the marked log types of the insertion type and the modification type into a target data warehouse.
S106: and loading the data of which the source data warehouse is unchanged into the target data warehouse.
The database log of the embodiment of the invention can be as follows: online Redo logs (Online Redo logs) and Archive Redo logs (Archive Redo logs). Therein, the Online Redo Log, also referred to as a transaction Log (TransactionLog), stores all changes made to the database and records all insertion, update, deletion, commit, rollback, and database schema changes, i.e., any modification operations to the database. The Archive Redo Log, often referred to simply as an Archive Log (Archive Log), is an inactive Redo Log backup. By using the Archive Log, all redo history records can be reserved, and when the database is in an archiving mode and is in a Log switching mode, the background archiving process can store the content of the redo Log into the archiving Log. When the database fails, the database can be completely restored by using the data file backup, the archive log and the redo log.
In the following, table 1 (mobile phone table) is taken as an example, and it is understood that table 1 is one table in the database of the source data warehouse.
TABLE 1
The log changes for table 1 are read, the read log changes are stored in the transaction queue, and then the log of changes that have committed is parsed from the transaction queue and stored in the change log.
For example, assume that the log changes for Table 1 are read as follows:
1/17/45/50 in 2018, update table 1 set memory 6where id 1.
1/2/15: 21:54 in 2018, insert into table 1 (huashi, P20, 15000, 6, 5.5, 128, 3999, android).
Year 2018, month 1, day 2, 15:24:21, alter table 1 add system string.
1/2/15: 21:54 in 2018, insert into table 1 (huashi, P20, 15000, 6, 5.5, 128, 3999, android).
15:31:05 on 1/2/2018, update table 1 set pixel 1500 corner id 5.
15:45:32 on 1/2/2018, update table 1 set cost 3599where id 5.
16:21:21 on 2.1.2018, delete from table 1where id 2.
The data in table 1 changes according to the log change described above, as shown in table 2.
TABLE 2
id Brand Model number Pixel Memory device Size of screen Storage capacity Price System for controlling a power supply
1 Millet 6X 2000 6 5.99 64 2039
3 Apple 7Plus 1200 4 5.5 128 4658
4 Vivo X21 500 6 6.28 64 2898
5 Huashi P20 1500 6 5.5 128 3599 Android device
It will be appreciated that when executing the second change log "1/2/15: 21: 54/2018, insert into Table 1 (Hua, P20, 15000, 6, 5.5, 128, 3999, android)", Table 1 has only 8 columns of fields at this time, since there are 9 columns of fields inserted. In general, the id field is a self-increment field, and the field does not need to be specified when the insertion operation is performed. Therefore, the execution of the transaction operation corresponding to the second change log fails, and further, when the log is analyzed, the fact that the transaction operation corresponding to the log is not actually submitted can be analyzed. Further, the records stored in the change log are as follows:
1/17/45/50 in 2018, update table 1 set memory 6where id 1.
Year 2018, month 1, day 2, 15:24:21, alter table 1 add system string.
1/2/15: 21:54 in 2018, insert into table 1 (huashi, P20, 15000, 6, 5.5, 128, 3999, android).
15:31:05 on 1/2/2018, update table 1 set pixel 1500 corner id 5.
15:45:32 on 1/2/2018, update table 1 set cost 3599where id 5.
16:21:21 on 2.1.2018, delete from table 1where id 2.
In one embodiment of the invention, the log types may include create (create) type, find (select) type, insert (insert) type, update (update) type, modify (alter) type, sort (order) type, and undo (drop) type, among others.
When the log type is identified, the type of each log can be identified according to the keywords in the log.
For example, for the first log in the change logs, "1/2018, 17:45:50, 6where id in set of update table 1 is 1", it is identified that the keyword of the log includes update, and the type of the log is marked as update.
For another example, for the second log in the change logs, "2018, 1, 2, 15:24:21, 2: 24: 1 and add system string", identifying that the keyword of the log includes alter, marking the type of the log as alter.
By way of further example, for the third log in the change logs, "2018, 1, 2, 15:21:54, insert intro table 1 (Hua, P20, 15000, 6, 5.5, 128, 3999, android)", if the keyword of the log is identified to include insert, the type of the log is marked as insert.
For another example, for the sixth log in the change logs, "16: 21: 21/1/2/2018, delete from table 1where id is 2", it is identified that the keyword of the log includes delete, and the type of the log is marked as delete.
In one embodiment of the invention, the change time of each log may also be marked.
Illustratively, for the first log "1/2018, 17:45:50, update table 1 set memory 6where 1" in the change log, the change time of the log is identified and marked as 1/2018, 17:45: 50.
By marking the change time of each log, the efficiency of the data ETL can be further improved.
Illustratively, assume that data ETL processing was performed on 1/2012/00: 00: 00. When the ETL processing is carried out on the data in the range of 1 month and 3 days 00:00:00 in 2018, only the data corresponding to the logs in the range of 1 month and 2 days 00:00:00 in 2018 to 3 months and 3 days 00:00:00 in 2018 are extracted. Data corresponding to the log before 2018, 1 month, 2 days 00:00:00 does not need to be extracted. Thus improving the efficiency of the data ETL.
After the type of each log in the change logs is marked, extracting the changed data of the source data warehouse according to the marked type.
In one embodiment of the present invention, extracting data that has changed in the source data warehouse according to the marked type may include: and extracting the data corresponding to the marked logs, wherein the types of the marked logs are an insertion type, a modification type and a deletion type.
The data extracted are shown in table 3.
TABLE 3
Data cleansing (Data cleansing) and Data transformation (Data transform) are performed on the extracted Data.
Wherein, the data cleaning is to convert dirty data into data meeting the data quality requirement by using related technologies such as mathematical statistics, data mining or predefined cleaning rules; it performs a re-audit and verification process on the data with the goal of deleting duplicate information, correcting existing errors, and providing data consistency. Here, Dirty data (Dirty Read) means that data in the source system is not in a given range or meaningless to actual traffic, or the data format is illegal, and there is irregular coding and ambiguous traffic logic in the source system. The dirty data mainly includes: incomplete data, erroneous data, and repeated data.
The embodiment of the invention can adopt any one or a combination of a method for solving incomplete data (namely missing values), a method for detecting and solving error values, a method for detecting and eliminating repeated records and a method for detecting and solving inconsistency (inside data sources and between data sources) when data are cleaned. The embodiment of the present invention does not limit the specific method used for data cleaning, and any available data cleaning method can be applied to the embodiment of the present invention.
The main tasks of data transformation are to perform inconsistent data transformation, transformation of data granularity, and computation of some business rules. The inconsistent data conversion is a data integration process, and data of the same type of different service systems are unified; conversion of data granularity: business systems typically store very detailed data, and data in data warehouses is used for analysis and does not require very detailed data. Typically, business system data is aggregated at a data warehouse granularity. And (3) calculating a business rule: different enterprises have different business rules and different data indexes, and the indexes can be completed without simply adding, subtracting or adding, and storing the data indexes in a data warehouse after the data indexes are calculated in the ETL for analysis and use.
For example, assume that the data after data cleansing and data conversion is also shown in table 3. The data corresponding to the marked logs in table 3 with the types of insertion and modification are loaded into the target data warehouse. That is, data of "id ═ 1, brand ═ millet, model ═ 6X, pixel ═ 2000, memory ═ 6, screen size ═ 5.99, storage capacity ═ 64, price ═ 2039", and "id ═ 5, brand ═ hua, model ═ P20, pixel ═ 1500, memory ═ 6, screen size ═ 5.5, storage capacity ═ 128, price ═ 3599" in table 3 is written into the target data warehouse.
In an embodiment of the present invention, the SQL statements that load the data corresponding to the marked log types of the insertion type and the modification type into the target data warehouse in the data subjected to data cleaning and data conversion are as follows:
insert _ to Table 4 select from Table 3 where Log type in ('insert', 'update').
It is to be understood that table 4 is a table in the target data warehouse. Table 4 is as follows:
TABLE 4
id Brand Model number Pixel Memory device Size of screen Storage capacity Price System for controlling a power supply
1 Millet 6X 2000 6 5.99 64 2039
5 Huashi P20 1500 6 5.5 128 3599 Android device
Then, the data with unchanged source data warehouse is loaded into the target data warehouse, that is, the data with "id ═ 3, brand ═ Apple, model ═ 7Plus, pixel ═ 1200, memory ═ 4, screen size ═ 5.5, storage capacity ═ 128, price ═ 4658" and "id ═ 4, brand ═ Vivo, model ═ X21, pixel ═ 500, memory ═ 6, screen size ═ 6.28, storage capacity ═ 63, and price ═ 2898" in table 2 are written into the target data warehouse.
In one embodiment of the present invention, loading data of which the source data warehouse has not changed into the target data warehouse may include: extracting unchanged data from the source data warehouse; the extracted data is loaded into a target data warehouse.
In one embodiment of the present invention, extracting unchanged data from the source data warehouse may include: and extracting unchanged data from the source data warehouse by using a left connection mode in SQL.
By utilizing a left connection mode in SQL, unchanged data is extracted from a source data warehouse, and an SQL statement for loading the extracted data into a target data warehouse is as follows:
insert int table 4 select table 2 from table 2 left join table 3 on table 2.id ═ table 3.id where table 3.id is null.
Among them, the results of the left connection through the id field in tables 2 and 3 are shown in table 5.
TABLE 5
The row data corresponding to the delete type in the table 3 can be filtered through the table 2 left join table 3 on table 2.id ═ table 3.id "in the SQL statement.
From the above table 5, data "id 3, brand apply, model 7Plus, pixel 1200, memory 4, screen size 5.5, storage capacity 128, price 4658" and "id 4, brand Vivo, model X21, pixel 500, memory 6, screen size 6.28, storage capacity 63, and price 2898" can be extracted from the above table 2 and where table 3.
The data extracted from the table 5 can be inserted into the table 4 by the "insert intro table 4" in the SQL statement. Table 4, in which the data extracted from table 5 above is inserted, is as follows:
id brand Model number Pixel Memory device Size of screen Storage capacity Price System for controlling a power supply
1 Millet 6X 2000 6 5.99 64 2039
5 Huashi P20 1500 6 5.5 128 3599 Android device
3 Apple 7Plus 1200 4 5.5 128 4658
4 Vivo X21 500 6 6.28 64 2898
According to the embodiment of the invention, the SQL sentence replaces the traditional loading algorithm, the insert sentence replaces the update sentence and the delete sentence in the traditional loading technology, the batch loading of data can be realized, the condition that the traditional loading technology loads data one by one through an Open Database connection (ODBC)/Java Database connection (JDBC) interface is avoided, and the ETL efficiency of the data can be realized. And for the data warehouse based on the sharenying architecture, the execution efficiency of the insert statement is much higher than that of the update statement and the delete statement, and the ETL efficiency of the data is further improved.
In an embodiment of the present invention, before extracting data of which the source data warehouse has changed according to the marked type, the data ETL processing method provided by the embodiment of the present invention further includes: and storing the data corresponding to each log in an in-memory database.
The memory database is a database which directly operates by putting data in a memory. Compared with a magnetic disk, the data read-write speed of the memory is higher by several orders of magnitude, and the application performance can be greatly improved by storing data in the memory compared with accessing from the magnetic disk. Data in the memory database are stored in the memory, so that the super-strong speed advantage of memory access can be exerted, and the updating speed and the extraction speed of the data are greatly improved.
In an embodiment of the present invention, the memory database may be a distributed memory database, which may be deployed on multiple nodes of a network, and provides a uniform access interface to the outside, and has a linear performance expansion capability.
By means of the distributed memory database technology, concurrency performance can be effectively improved, and data ETL efficiency can be improved. In addition, the capacity and the performance can be horizontally and linearly expanded. For the condition that the same record records multiple updates, the distributed memory database only keeps the data record after the latest update operation, so that the phenomenon that when the same record in a data table in a database of a source data warehouse has multiple update operations, the data is subjected to multiple ETL processing according to the number of the update operations is avoided, the processing number of the ETL of the incremental data of the data warehouse is effectively reduced, and the ETL efficiency of the data can be improved.
According to the data ETL processing method provided by the embodiment of the invention, only the changed data in the source data warehouse is extracted, and the unchanged data in the source data warehouse is not extracted, so that the incremental extraction of the data is realized, and the efficiency of the data ETL can be improved.
Corresponding to the above method embodiments, the embodiment of the present invention further provides a data ETL processing apparatus.
As shown in fig. 2, fig. 2 is a schematic structural diagram of a data ETL processing apparatus according to an embodiment of the present invention. The data ETL processing device may include:
the obtaining module 201 is configured to obtain a change log of a change of a database log of a source data warehouse.
A marking module 202 for identifying and marking a type of each log in the change log.
And the extraction module 203 is used for extracting the data with the changed source data warehouse according to the marked type.
And a cleaning conversion module 204 for performing data cleaning and data conversion on the extracted data.
The first loading module 205 is configured to load, into the target data warehouse, data that is in the data subjected to the data cleaning and the data conversion and corresponds to the marked log type that is the insertion type and the modification type.
And a second loading module 206, configured to load data of which the source data warehouse has not changed into the target data warehouse.
In an embodiment of the present invention, the marking module 202 may be specifically configured to:
the type and change time of each log in the change log is identified and marked.
In an embodiment of the present invention, the extraction module 203 may be specifically configured to:
and extracting the data corresponding to the marked logs, wherein the types of the marked logs are an insertion type, a modification type and a deletion type.
In an embodiment of the present invention, the data ETL processing apparatus according to the embodiment of the present invention may further include:
and a storage module (not shown in the figure) for storing the data corresponding to each log in the change log in the memory database.
In one embodiment of the present invention, the in-memory database is a distributed in-memory database.
In an embodiment of the present invention, the second loading module 206 may include:
the extraction unit is used for extracting unchanged data from the source data warehouse;
and the loading unit is used for loading the extracted data into the target data warehouse.
In an embodiment of the present invention, the extracting unit may be specifically configured to:
and extracting unchanged data from the source data warehouse by using a left connection mode in the Structured Query Language (SQL).
The data ETL processing device only extracts the data which are changed in the source data warehouse, and does not extract the data which are not changed in the source data warehouse, so that incremental extraction of the data is realized, and the efficiency of the data ETL can be improved.
Fig. 3 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing a data ETL processing method and apparatus according to an embodiment of the present application. As shown in fig. 3, computing device 300 includes an input device 301, an input interface 302, a central processor 303, a memory 304, an output interface 305, and an output device 306. The input interface 302, the central processing unit 303, the memory 304, and the output interface 305 are connected to each other through a bus 310, and the input device 301 and the output device 306 are connected to the bus 310 through the input interface 302 and the output interface 305, respectively, and further connected to other components of the computing device 300.
Specifically, the input device 301 receives input information from the outside and transmits the input information to the central processor 303 through the input interface 302; central processor 303 processes the input information based on computer-executable instructions stored in memory 304 to generate output information, stores the output information temporarily or permanently in memory 304, and then transmits the output information to output device 306 through output interface 305; output device 306 outputs the output information external to computing device 300 for use by the user.
That is, the computing device shown in fig. 3 may also be implemented as a data ETL processing device, which may include: a memory storing computer-executable instructions; and a processor which, when executing computer executable instructions, may implement the data ETL processing method and apparatus described in connection with fig. 1 and 2.
Embodiments of the present application further provide a computer-readable storage medium having computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement the data ETL processing method provided by the embodiments of the present application.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.
The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
As described above, only the specific embodiments of the present invention are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

Claims (10)

1. A data extraction-transformation-loading ETL processing method, the method comprising:
acquiring a change log of a database log of a source data warehouse;
identifying and marking the type of each log in the change log;
extracting the changed data of the source data warehouse according to the type;
performing data cleaning and data conversion on the data;
loading data which are in data washing and data conversion and correspond to the marked logs with the insertion types and the modification types into a target data warehouse;
and loading the data of which the source data warehouse is unchanged into the target data warehouse.
2. The method of claim 1, wherein identifying and marking a type of each of the change logs comprises:
the type and change time of each of the change logs is identified and marked.
3. The method of claim 1, wherein extracting data that has changed in the source data warehouse according to the type comprises:
and extracting the data corresponding to the marked logs, wherein the types of the marked logs are an insertion type, a modification type and a deletion type.
4. The method of claim 1, wherein prior to said extracting data that has changed in said source data repository according to said type, said method further comprises:
and storing the data corresponding to each log in an internal storage database.
5. The method of claim 4, wherein the in-memory database is a distributed in-memory database.
6. The method of claim 1, wherein loading unchanged data of the source data warehouse into the target data warehouse comprises:
extracting unchanged data from the source data repository;
loading the extracted data into the target data warehouse.
7. The method of claim 6, wherein said extracting unchanged data from said source data repository comprises:
and extracting unchanged data from the source data warehouse by using a left connection mode in the Structured Query Language (SQL).
8. A data extract-transform-load ETL processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring a change log of a database log of the source data warehouse;
the marking module is used for identifying and marking the type of each log in the change logs;
the extraction module is used for extracting the data changed in the source data warehouse according to the type;
the cleaning conversion module is used for cleaning and converting the data;
the first loading module is used for loading data which is in data cleaning and data conversion and corresponds to the marked log type of the insertion type and the modification type into a target data warehouse;
and the second loading module is used for loading the data of which the source data warehouse is unchanged into the target data warehouse.
9. A data extract-transform-load ETL processing apparatus, characterized in that said apparatus comprises: a memory and a processor;
the memory is used for storing executable program codes;
the processor is configured to read the executable program code stored in the memory to perform the data extract-transform-load ETL processing method of any one of claims 1-7.
10. A computer readable storage medium having computer program instructions stored thereon; the computer program instructions when executed by a processor implement the data extract-transform-load ETL processing method of any of claims 1-7.
CN201810529602.4A 2018-05-29 2018-05-29 Data ETL processing method, device, equipment and storage medium Pending CN110609860A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810529602.4A CN110609860A (en) 2018-05-29 2018-05-29 Data ETL processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810529602.4A CN110609860A (en) 2018-05-29 2018-05-29 Data ETL processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110609860A true CN110609860A (en) 2019-12-24

Family

ID=68887473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810529602.4A Pending CN110609860A (en) 2018-05-29 2018-05-29 Data ETL processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110609860A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159272A (en) * 2019-12-31 2020-05-15 青梧桐有限责任公司 Data quality monitoring and early warning method and system based on data warehouse and ETL
CN111694853A (en) * 2020-06-02 2020-09-22 北京北大软件工程股份有限公司 Lineage-based data increment acquisition method and device, storage medium and electronic equipment
CN112256523A (en) * 2020-09-23 2021-01-22 贝壳技术有限公司 Service data processing method and device
CN112650744A (en) * 2020-12-31 2021-04-13 广州晟能软件科技有限公司 Data management method for preventing secondary pollution of data
CN113158233A (en) * 2021-03-29 2021-07-23 重庆首亨软件股份有限公司 Data preprocessing method and device and computer storage medium
CN113377763A (en) * 2020-03-10 2021-09-10 阿里巴巴集团控股有限公司 Database table switching method and device, electronic equipment and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075304A (en) * 2006-05-18 2007-11-21 河北全通通信有限公司 Method for constructing decision supporting system of telecommunication industry based on database
US20100280990A1 (en) * 2009-04-30 2010-11-04 Castellanos Maria G Etl for process data warehouse
CN102841897A (en) * 2011-06-23 2012-12-26 阿里巴巴集团控股有限公司 Incremental data extracting method, device and system
CN102915336A (en) * 2012-09-18 2013-02-06 北京金和软件股份有限公司 Incremental data capturing and extraction method based on timestamps and logs
CN105488187A (en) * 2015-12-02 2016-04-13 北京四达时代软件技术股份有限公司 Method and device for extracting multi-source heterogeneous data increment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075304A (en) * 2006-05-18 2007-11-21 河北全通通信有限公司 Method for constructing decision supporting system of telecommunication industry based on database
US20100280990A1 (en) * 2009-04-30 2010-11-04 Castellanos Maria G Etl for process data warehouse
CN102841897A (en) * 2011-06-23 2012-12-26 阿里巴巴集团控股有限公司 Incremental data extracting method, device and system
CN102915336A (en) * 2012-09-18 2013-02-06 北京金和软件股份有限公司 Incremental data capturing and extraction method based on timestamps and logs
CN105488187A (en) * 2015-12-02 2016-04-13 北京四达时代软件技术股份有限公司 Method and device for extracting multi-source heterogeneous data increment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159272A (en) * 2019-12-31 2020-05-15 青梧桐有限责任公司 Data quality monitoring and early warning method and system based on data warehouse and ETL
CN113377763A (en) * 2020-03-10 2021-09-10 阿里巴巴集团控股有限公司 Database table switching method and device, electronic equipment and computer storage medium
CN111694853A (en) * 2020-06-02 2020-09-22 北京北大软件工程股份有限公司 Lineage-based data increment acquisition method and device, storage medium and electronic equipment
CN111694853B (en) * 2020-06-02 2023-12-08 北京北大软件工程股份有限公司 Data increment collection method and device based on lineage, storage medium and electronic equipment
CN112256523A (en) * 2020-09-23 2021-01-22 贝壳技术有限公司 Service data processing method and device
CN112650744A (en) * 2020-12-31 2021-04-13 广州晟能软件科技有限公司 Data management method for preventing secondary pollution of data
CN112650744B (en) * 2020-12-31 2024-04-30 广州晟能软件科技有限公司 Data treatment method for preventing secondary pollution of data
CN113158233A (en) * 2021-03-29 2021-07-23 重庆首亨软件股份有限公司 Data preprocessing method and device and computer storage medium

Similar Documents

Publication Publication Date Title
CN110609860A (en) Data ETL processing method, device, equipment and storage medium
US10936562B2 (en) Type-specific compression in database systems
CN110472068B (en) Big data processing method, equipment and medium based on heterogeneous distributed knowledge graph
US7702698B1 (en) Database replication across different database platforms
US7418449B2 (en) System and method for efficient enrichment of business data
US20170075964A1 (en) Transforming and loading data utilizing in-memory processing
US8412670B2 (en) Apparatus, method, and program for integrating information
US20130041900A1 (en) Script Reuse and Duplicate Detection
US20120143893A1 (en) Pattern Matching Framework for Log Analysis
US9460142B2 (en) Detecting renaming operations
Agarwal et al. Approximate incremental big-data harmonization
Kvet et al. Complex time management in databases
CN115543402B (en) Software knowledge graph increment updating method based on code submission
CN110781036A (en) Data recovery method and device, computer equipment and storage medium
CN111694853A (en) Lineage-based data increment acquisition method and device, storage medium and electronic equipment
CN116089394A (en) Data rollback method, storage medium and device of database
CN114356945A (en) Data processing method, data processing device, computer equipment and storage medium
CN115098228B (en) Transaction processing method and device, computer equipment and storage medium
Regardt et al. Anchor Modeling: An Agile Modeling Technique Using the Sixth Normal Form for Structurally and Temporally Evolving Data
Singh et al. Statistically Analyzing the Impact of AutomatedETL Testing on the Data Quality of a DataWarehouse
CN118069701B (en) Reverse query link construction method, reverse query link construction device, computer equipment and storage medium
Papanastassiou et al. A Language-Agnostic Compression Framework for the Bitcoin Blockchain
CN114461601A (en) Distributed transaction rollback method, device, terminal and storage medium based on sub-base and sub-table
CN114385644A (en) Dimension data processing method, device, equipment and storage medium
CN118260358A (en) Data capturing component, method for synchronizing data, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191224