CN108388610B - Data ETL processing method and device - Google Patents
Data ETL processing method and device Download PDFInfo
- Publication number
- CN108388610B CN108388610B CN201810122336.3A CN201810122336A CN108388610B CN 108388610 B CN108388610 B CN 108388610B CN 201810122336 A CN201810122336 A CN 201810122336A CN 108388610 B CN108388610 B CN 108388610B
- Authority
- CN
- China
- Prior art keywords
- data
- source
- target
- target table
- etl processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure relates to a data ETL processing method and device. The method comprises the following steps: acquiring a source table taking a primary key of a target table as a foreign key; judging whether the data included by the foreign key in the source table exist in the target table; if at least one piece of data included by the foreign key in the source table does not exist in the target table, storing the data of the foreign key which does not exist in the target table into a loss record table, wherein the structure of the loss record table is consistent with that of the target table, so that the data stored in the loss record table can be used for determining which data are lost or inaccurate, and business analysis such as data loss rate or inaccuracy rate is realized.
Description
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing ETL data.
Background
At present, in the process of informatization construction, database construction can effectively solve the problems of data separation, service dispersion, information intercommunication incapability and the like caused by information isolated islands. Fig. 1 is a schematic diagram showing database construction in the related art. As shown in fig. 1, the database construction may refer to integrating data resources of separate business systems together to construct a system having functions of data acquisition, processing, storage, distribution, sharing, and the like. The database construction can realize the systematized management of the data, and has important significance in the informatization construction process.
The ETL (Extract-Transform-Load) is a core link of database construction. ETL can refer to the process of extracting, converting, and loading data from a source table of a source database into a target table of a target database. In the related art, the accuracy of the ETL processing procedure is usually determined by comparing the number of records or comparing the content of keywords. The accuracy of comparing the number of records or the content of the keywords is low, the application range is small, and the inaccuracy of ETL processing or the processing of wrong data cannot be determined, so that the data quality in the target database is poor.
Disclosure of Invention
In view of this, the present disclosure provides a data ETL processing method and apparatus, so as to solve the problem in the related art that data quality in a target database is poor due to inaccurate data ETL processing or processing errors.
According to an aspect of the present disclosure, there is provided a data ETL processing method, including:
acquiring a source table taking a primary key of a target table as a foreign key;
judging whether the data included by the foreign key in the source table exist in the target table;
and if at least one piece of data included by the foreign key in the source table does not exist in the target table, storing the data of the foreign key which does not exist in the target table into a loss record table, wherein the structures of the loss record table and the target table are consistent.
In one implementation, the method further comprises:
if all the data included by the foreign key in the source table exist in the target table, determining that the ETL processing process of the data is accurate; or
And if at least one piece of data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
In one implementation, obtaining a source table having a primary key of a target table as a foreign key includes:
comparing the primary key of the target table with all source tables of a source database;
and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
In one implementation, after storing the data of the source table in which the foreign key does not exist in the target table into a loss record table, the method further comprises:
carrying out ETL processing on the data in the loss record table;
and comparing the result with the target table aiming at each result obtained after the ETL processing, if the result does not exist in the target table, storing the record corresponding to the result into the target table, and deleting or marking the record corresponding to the result in the lost record table.
In one implementation, performing ETL processing on data in the loss record table includes:
and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
In one implementation, after storing all data of the source table in which the foreign key does not exist in the target table in a loss record table, the method further comprises:
determining the total number of data in the target table and the total number of data in the lost record table;
and calculating the loss rate of the ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table.
According to another aspect of the present disclosure, there is provided a data ETL processing apparatus including:
the acquisition module is used for acquiring a source table which takes the primary key of the target table as a foreign key;
the judging module is used for judging whether the data included by the foreign key words in the source table exist in the target table or not;
a recording module, configured to store, if at least one piece of data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table into a missing recording table, where the missing recording table and the target table have a same structure.
In one implementation, the apparatus further comprises:
a first determining module, configured to determine that an ETL processing procedure of data is accurate if all data included in the foreign key in the source table is present in the target table; or if at least one data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
In one implementation, the obtaining module is further configured to:
comparing the primary key of the target table with all source tables of a source database;
and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
In one implementation, the apparatus further comprises:
the first processing module is used for carrying out ETL processing on the data in the loss record table;
and the second processing module is used for comparing each result obtained after the ETL processing with the target table, storing the record corresponding to the result into the target table if the result does not exist in the target table, and deleting or marking the record corresponding to the result in the lost record table.
In one implementation, the first processing module is configured to:
and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
In one implementation, the apparatus further comprises:
a second determining module, configured to determine a total number of data in the target table and a total number of data in the lost record table;
and the calculating module is used for calculating the loss rate of the ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table.
The data ETL processing method and device disclosed by the invention can be used for judging whether data included by the foreign key in the source table exist in the target table or not by acquiring the source table which takes the primary key of the target table as the foreign key, and storing the data of the foreign key which does not exist in the target table into the loss record table if at least one piece of data included by the foreign key in the source table does not exist in the target table, wherein the structure of the loss record table is consistent with that of the target table, so that inaccurate ETL processing or wrong processing data can be recorded, the business analysis such as data loss rate or inaccuracy rate can be realized, and the lost data can be searched for in the ETL processing.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a schematic diagram showing database construction in the related art.
Fig. 2 shows a schematic diagram of ETL in the related art.
Fig. 3 is a schematic diagram illustrating a data ETL processing method in the related art.
Fig. 4 illustrates a flow chart of a data ETL processing method according to an embodiment of the present disclosure.
Fig. 5 illustrates a flow chart of a data ETL processing method according to an embodiment of the present disclosure.
Fig. 6 shows a flow chart of a data ETL processing method according to an embodiment of the present disclosure.
Fig. 7 shows a schematic diagram of a data ETL processing method according to an embodiment of the present disclosure.
Fig. 8 shows a block diagram of a data ETL processing apparatus according to an embodiment of the present disclosure.
Fig. 9 shows a block diagram of a data ETL processing apparatus according to an embodiment of the present disclosure.
Fig. 10 is a block diagram illustrating a data ETL processing apparatus 900 according to an exemplary embodiment.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
In the related art, ETL (Extract-Transform-Load) is a core link of database construction. ETL can refer to the process of extracting, converting, and loading data from a source table of a source database into a target table of a target database. The source database includes databases of each service system, such as various types of databases of MySQL, Oracle, PostgreSQL, Hive, Hadoop, Excel, text file, and the like, which is not limited in this disclosure.
The purpose of ETL processing is to integrate scattered, disordered and standard non-uniform data in a service system together, and provide analysis basis for the decision of the service system. The ETL process includes the following stages: data extraction, data cleaning conversion and data loading. In each stage of the ETL process, the longest time is spent as a purge transition stage of data. The data cleaning and converting phase comprises data cleaning and data conversion.
Wherein the task of data cleansing is to filter out the unsatisfactory data. And in the data cleaning process, the filtered result is delivered to a service administration department, and the service administration department confirms whether to filter or correct the result and then extracts the result. The unsatisfactory data includes incomplete data, erroneous data, duplicate data, and the like. The task of data transformation is mainly to perform transformation of inconsistent data, transformation of data granularity, and computation of some business rules.
Fig. 2 shows a schematic diagram of ETL in the related art. As shown in fig. 2, the process of data acquisition and logical processing by ETL includes: firstly, data are collected and extracted from a source table stored in a heterogeneous source database, and then the processed data are stored in a target table of a target database through the processing processes of cleaning, processing, converting and the like. Wherein, the service system data source, the text file and other types of data can form a source database.
In the ETL processing process of data, problems such as inaccurate data processing, even data loss, etc. may occur, which may cause poor data quality in the target database, thereby affecting the service logic of the service system based on the target database, and causing serious consequences.
In the related art, the accuracy of the ETL processing procedure is usually determined by comparing the number of records or comparing the content of keywords. The comparing the number of records refers to comparing whether the number of records in the source table of the source database is consistent with the number of records in the target table of the target database every time ETL processing of the complete or incremental data is performed, if not, ETL processing needs to be performed on all data in the source table again, otherwise, ETL processing is not performed. Comparing the key content means that each piece of data in the source table of the source database is found to be corresponding data in the target table of the target database each time ETL processing of the complete or incremental data is performed, whether the key content is consistent or not is compared, if the key content is not consistent, ETL processing needs to be performed on all the data in the source table again, and otherwise, ETL processing is not performed.
Fig. 3 is a schematic diagram illustrating a data ETL processing method in the related art. As shown in fig. 3:
as shown in fig. 3, the procedure for comparing the number of records is as follows: compare-re-ETL-store. Specifically, each time ETL processing of full or incremental data is performed, the number of records Count1 in Table a1 Table in the source database and the number of records Count2 in Table a2 Table in the target database are calculated. Comparing the Count1 with the Count2, if the Count1 and the Count2 are equal, the ETL processing procedure is considered to be accurate. Otherwise, clearing the data in the Table A2 Table, performing ETL processing on the data in the Table A1 Table again, including the processing of extracting, converting, loading and the like on the data in the Table A1 Table, and storing the data after ETL processing in the Table A2 Table in the target database.
The full data may refer to all data in the source database, and the incremental data may refer to data added to the source database within a certain time period.
Therefore, the data quality in the target database is poor due to the fact that the number of the comparison records cannot determine that ETL is processed inaccurately or data with errors is processed. When it is determined that the data ETL processing process is inaccurate, ETL processing needs to be performed on all data in the source table again, which is tedious to operate and wastes data ETL processing resources.
It should be noted that the number of comparison records is mainly applicable to a single table in a single source database to a single table in a target database. For a single table from multiple tables in a single source database to a single table in a target database, or a multiple tables in a multiple source database to a single table in a target database, if the data in the source table is duplicated, the data processed by the ETL is the data without duplication, so that the comparison of the number of records is not applicable.
As shown in fig. 3, the flow of using the keyword content is as follows: compare-re-ETL-store. Specifically, each time ETL processing of full or incremental data is performed, each piece of data in Table a1 Table in the source database is found out to be corresponding to the Table a2 Table in the target database. And comparing the key of each piece of data in the Table A1 Table with the key of the corresponding data in the Table A2 Table, and if the key of each piece of data in the Table A1 Table is the same as the key of the corresponding data in the Table A2 Table, determining that the ETL processing process of the data is accurate. Otherwise, clearing the data in the Table A2 Table, performing ETL processing on the data in the Table A1 Table again, including the processing of extracting, converting, loading and the like on the data in the Table A1 Table, and storing the data after ETL processing in the Table A2 Table in the target database.
The full data may refer to all data in the source database, and the incremental data may refer to data added to the source database within a certain time period.
Therefore, the operation of comparing the keyword contents is complicated and the efficiency is low. In addition, the comparison of the keyword content cannot determine that the ETL is processed inaccurately or processed with wrong data, resulting in poor quality of data in the target database. When it is determined that the data ETL processing process is inaccurate, ETL processing needs to be performed on all data in the source table again, which is tedious to operate and wastes data ETL processing resources.
To solve the technical problems in the related art, fig. 4 is a flowchart illustrating a data ETL processing method according to an embodiment of the present disclosure. As shown in fig. 4, the method includes steps S41 through S43.
In step S41, a source table having the primary key of the target table as the foreign key is acquired.
In one implementation, obtaining a source table having a primary key of a target table as a foreign key includes: comparing the primary key of the target table with all source tables of the source database; and acquiring the source table taking the primary key of the target table as the foreign key from all the source tables of the source database.
In step S42, it is determined whether or not the data included in the foreign key in the source table is present in the target table.
In step S43, if at least one data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table is stored in the missing record table, wherein the missing record table and the target table have the same structure.
According to the data ETL processing method, the loss record table with the structure consistent with that of the target table is established, and the data lost or processed inaccurately in the ETL processing process is recorded through the established loss record table, so that the data lost or processed inaccurately in the ETL processing process can be conveniently found. When the ETL processing process of the data is determined to be inaccurate, which data are lost or inaccurate can be determined through the data stored in the loss record table, and business analysis such as data loss rate or inaccuracy rate is achieved.
In one implementation, a primary key and an ETL processing policy of a target table are determined. And carrying out ETL processing on the data in the source table of the source database according to the determined primary key of the target table and the ETL processing strategy. The ETL processing strategy at least comprises data extraction, data cleaning conversion and data loading, and the data obtained after the ETL processing is stored in a target table, so that the target table is established. After the target table is established, a loss record table is established according to the structure of the target data table. The loss record table is used to store data that is lost or processed inaccurately during the ETL process to build the target table.
It is understood that there is only one target table and one or more source tables during ETL processing of data. In the case that there are multiple source tables, the multiple source tables may be from the same database, or the multiple source tables may be from different databases, which is not limited in this disclosure. The target database and the source database are different databases.
A Primary Key (also called a Primary Key) is one or more fields in a table, and is used to uniquely identify a record in the table. In a two-table relationship, a primary key may be used to reference a particular record in one table from the other table. The primary key is a uniquely identified key that is part of the table definition. The primary key of a table may be one key, or may be composed of multiple keys, which is not limited in this disclosure.
For example, a student table (school number, name, gender, class), the school number uniquely identifies a record in the student table, whereby the school number may serve as the primary key of the student table. The course number can uniquely identify a record in the course table, and thus can be used as a main key word of the course table. The score table (school number, course number and score) can not uniquely identify one record in the score table by a single attribute, and the combination of the school number and the course number can uniquely identify one record in the course table, so that the combination of the school number and the course number can be used as a main key word of the score table.
If a Key is used as the primary Key of one table and the Key exists in another table, the Key is called a Foreign Key (also called a Foreign Key) of the other table. A table may have one or more foreign keys. Foreign keys can represent associations between two tables, whereby a foreign key can be used to establish an association with another table. A table having a foreign key of another table as a primary key is referred to as a primary table, and a table having the foreign key is referred to as a secondary table of the primary table.
For example, student tables (school number, name, gender, class), the school number being the primary key of the student table. The course table (course number, course name, and score), and the course number is the main key word of the course table. Score table (school number, course number, score), the combination of school number and course number is the main key of score table. The schooling number in the achievement list is not the primary key word of the achievement list, but corresponds to the schooling number in the student list, and the schooling number in the student list is the primary key word of the student list, so that the schooling number in the achievement list is called as the foreign key word of the student list. Similarly, the course number in the achievement list is the foreign keyword of the course list.
In one implementation, after storing all data of the source table in which the foreign key does not exist in the target table in the loss record table, the method further comprises: determining the total number of data in the target table and the total number of data in the lost record table; and calculating the loss rate of the ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table.
Wherein, the loss rate is the number of lost pieces divided by the total number of data pieces. And under the condition that the ETL processing process of the data is accurate, the number of the lost pieces is zero, and the total number of the data is the total number of the data in the target table. Under the condition that the ETL processing process of the data is inaccurate, if the data in the lost record table corresponding to the target table is not repeated, the number of the lost pieces is the total number of the data in the lost record table corresponding to the target table, and the total number of the data is the sum of the total number of the data in the target table and the total number of the data in the lost record table corresponding to the target table.
In the actual data processing process, the total amount of data is huge, often millions, millions or even billions of mass data, and the amount of lost or error data is small. Therefore, in the data ETL processing method of the present disclosure, the total number of data in the lost record table corresponding to the target table may be used as the number of lost pieces, and the total number of data in the target table, or the sum of the total number of data in the target table and the total number of data in the lost record table corresponding to the target table may be used as the total number of data.
According to the data ETL processing method, the loss record table which is consistent with the target table structure is established, and the data which are lost or processed inaccurately in the ETL processing process are recorded through the established loss record table, so that the total number of the data which are lost or processed inaccurately in the ETL processing process can be counted, and support is provided for calculating the loss rate of the data ETL processing.
Fig. 5 illustrates a flow chart of a data ETL processing method according to an embodiment of the present disclosure. As shown in fig. 5, the method includes steps S51 through S54.
In step S51, a source table having the primary key of the target table as the foreign key is acquired.
In one implementation, obtaining a source table having a primary key of a target table as a foreign key includes: comparing the primary key of the target table with all source tables of the source database; and acquiring the source table taking the primary key of the target table as the foreign key from all the source tables of the source database.
In step S52, it is determined whether or not the data included in the foreign key in the source table is present in the target table.
In step S53, if at least one data included in the foreign key in the source table is not present in the target table, it is determined that the ETL processing is inaccurate.
As one example, the primary key of the target table is determined to be ID0, and the ETL processing policy is determined. All source tables of the source database include a Table 1 Table, a Table 2 Table, a Table 3 Table, a Table 4 Table, and a Table 5 Table. And carrying out ETL processing on the data in the source Table of the source database according to the primary key ID0 and an ETL processing strategy, and storing the data obtained after the ETL processing into a target Table, thereby establishing a target Table 0 Table. For example, data is collected from a Table 1 Table, a Table 2 Table, and a Table 3 Table in the source database, ETL processing is performed on the collected data, and the data after ETL processing is stored in a Table 0 Table in the target database.
The flow of the data ETL processing method is as follows: if the primary key of the Table 0 of the target Table is ID0, a source Table with ID0 as a foreign key is obtained from all source tables of the source database, and the source Table includes a Table 1 Table, a Table 2 Table and a Table 3 Table. The data in the source Table 1, Table 2 and Table 3 tables including the ID0 are compared one by one with the target Table 0 Table. If at least one piece of data included in the ID0 in the Table 1, Table 2, and Table 3 of the source Table is not present in the Table 0 of the target Table, it is determined that the processing procedure of the data ETL is not accurate.
In step S54, if all the data included in the foreign key in the source table exists in the target table, it is determined that the data ETL processing procedure is accurate.
The flow of the data ETL processing method is as follows: if the primary key of the Table 0 of the target Table is ID0, a source Table with ID0 as a foreign key is obtained from all source tables of the source database, and the source Table includes a Table 1 Table, a Table 2 Table and a Table 3 Table. The data included in ID0 in the source Table 1 Table, Table 2 Table, and Table 3 Table are compared one by one with the target Table 0 Table. If all the data included in the ID0 in the Table 1, Table 2 and Table 3 of the source Table are present in the Table 0 of the target Table, it is determined that the processing procedure of the data ETL is accurate.
Fig. 6 shows a flow chart of a data ETL processing method according to an embodiment of the present disclosure. As shown in fig. 6, the method includes steps S61 through S64.
In step S61, a source table having the primary key of the target table as the foreign key is acquired.
In one implementation, obtaining a source table having a primary key of a target table as a foreign key includes: comparing the primary key of the target table with all source tables of the source database; and acquiring the source table taking the primary key of the target table as the foreign key from all the source tables of the source database.
In step S62, it is determined whether or not the data included in the foreign key in the source table is present in the target table.
In step S63, if at least one data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table is stored in the missing record table, wherein the missing record table and the target table have the same structure.
In step S64, ETL processing is performed on the data in the loss record table; and comparing the result with the target table according to each result obtained after the ETL processing, if the result does not exist in the target table, storing the record corresponding to the result into the target table, and deleting or marking the record corresponding to the result in the lost record table.
Wherein the record at least comprises data before ETL processing and a result after ETL processing is carried out on the data. Since the structures of the target table and the lost record table are consistent, the record can be directly stored in the target.
If a record in the lost record table is marked or deleted, it may indicate that a result obtained after ETL processing in the record already exists in the target table.
In one implementation, performing ETL processing on data in a loss record table includes: and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table. By adopting the ETL processing strategy adopted when the target table is formed, the result of ETL processing of the data in the lost record table can be the result which should be obtained when the target table is formed, and therefore the accuracy of the finally obtained target table is improved.
According to the data ETL processing method, when the fact that the data ETL processing process is inaccurate is determined, all data in the source table do not need to be subjected to ETL processing again, the ETL processing strategy adopted by the target table is formed, the data in the lost record table is subjected to ETL processing, and for each result obtained after the ETL processing, if the result does not exist in the target table, the record corresponding to the result is stored in the target table, so that the target table with accurate data ETL processing can be obtained, operation can be simplified, data ETL processing resources can be saved, the data quality in the target database can be guaranteed, and support is provided for the service logic of a service system based on the target database.
According to the data ETL processing method, the loss record table with the structure consistent with that of the target table is established, and the data lost or processed inaccurately in the ETL processing process is recorded through the established loss record table, so that the data lost or processed inaccurately in the ETL processing process can be conveniently found. When the ETL processing process of the data is determined to be inaccurate, which data are lost or inaccurate can be determined through the data stored in the lost record table, and the ETL processing is further performed on the data in the lost record table without performing the ETL processing on all the data in the source table again, so that the operation is simplified, and the ETL processing resources of the data are saved.
Fig. 7 shows a schematic diagram of a data ETL processing method according to an embodiment of the present disclosure. As one example, the primary key of the target table is determined to be ID0, and the ETL processing policy is determined. As shown in fig. 7, all the source tables of the source database include a Table 1 Table, a Table 2 Table, a Table 3 Table, a Table 4 Table, and a Table 5 Table. And carrying out ETL processing on the data in the source Table of the source database according to the primary key ID0 and an ETL processing strategy, and storing the data obtained after the ETL processing into a target Table, thereby establishing a target Table 0 Table. For example, data is collected from a Table 1 Table, a Table 2 Table, and a Table 3 Table in the source database, ETL processing is performed on the collected data, and the data after ETL processing is stored in a Table 0 Table in the target database.
The Table 0 Table is a target Table, and the Table 0 Table includes data 1, data 3, data 4, and data 7. A Table 1 Table, a Table 2 Table and a Table 3 Table are source tables, and the Table 1 Table comprises data 1 and data 2; table 2 Table includes data 3 and data 4; table 3 Table includes data 5, data 6, and data 7. Data 5 and data 1 are repeated, and data 1 is retained after the duplication is removed in the ETL processing process.
As shown in fig. 7, the flow of the data ETL processing method of the present disclosure is specifically as follows.
And establishing a loss record Table Table 0' Table corresponding to the target Table Table 0 Table. The target Table 0 Table has the same structure as its corresponding loss record Table 0' Table. If the primary key of the Table 0 of the target Table is ID0, a source Table with ID0 as a foreign key is obtained, and the source Table includes a Table 1 Table, a Table 2 Table and a Table 3 Table. Wherein, the Table 0 Table includes data 1, data 3, data 4 and data 7. Table 1 Table includes data 1 and data 2; table 2 Table includes data 3 and data 4; table 3 Table includes data 5, data 6, and data 7.
And 2, comparing each data in the source Table Table 1, the Table 2 and the Table 3 with the target Table Table 0 to obtain a lost record Table Table 0'. For example, the loss record Table 0' Table includes data 2 and data 6.
And 3, carrying out ETL processing on the data in the loss record Table Table 0'.
And 4, comparing the result with the target Table 0 according to each result obtained after the ETL treatment, and if the result does not exist in the target Table 0, executing the following step 5.
And 5, storing the record corresponding to the result into the target Table Table 0, and deleting or marking the record corresponding to the result in the lost record Table Table 0'. The ETL processing policy is consistent with the ETL processing policy adopted when generating the target Table 0, and the ETL processing policy includes at least data extraction, data cleaning conversion, and data loading.
For example, ETL processing is performed on data 2 and data 6 in the Table of the loss record Table 0' to obtain data 2 and data 6 after ETL processing (data 2 and data 6 do not overlap). And comparing the data 2 with the target Table 0, if the data 2 does not exist in the target Table 0, storing the data 2 into the target Table 0, and deleting or marking the data 2 in the loss record Table 0'. The data 6 is compared with the target Table 0, and if the data 6 exists in the target Table 0, the data 6 in the loss record Table 0' is deleted or marked.
For another example, ETL processing is performed on data 2 and data 6 in the Table of the loss record Table 0' to obtain data 2 after ETL processing (data 2 and data 6 are repeated, and data 2 is retained after deduplication in the ETL processing process). And comparing the data 2 with the target Table 0, if the data 2 does not exist in the target Table 0, storing the data 2 into the target Table 0, and deleting or marking the data 2 in the loss record Table 0'.
As an example, table 1 shows the structure of a target table and a missing record table according to an embodiment of the present disclosure. The target table is a natural person table, and the table name is zrr. The primary key of natural person table zrr is a natural person identity id. The loss record table is a natural person loss record table and takes the table name of zrr _ lost. The natural person loss record table zrr _ lost is consistent with the structure of the natural person table zrr.
As shown in table 1, the natural person table zrr is used to store natural person information, and the natural person loss record table zrr _ lost is used to store natural person information that is lost or inaccurately processed during ETL processing. The natural person information may include natural person identity id, id type, name, phone, address, relationship type, relationship code, and the like, which is not limited in this disclosure.
TABLE 1
Natural person identity id | id type | Address | Name (I) | Telephone set | Type of relationship | Relationship code |
id | id_type | address | name | phone | relation | relation_code |
Table 2 shows source table 1 according to an embodiment of the present disclosure. As shown in Table 2, Source Table 1 is a natural human investment Table, which is named zrrtz. The natural human investment table zrrtz is a table in the source database. In this embodiment, the default investor and the investee are both natural persons, and thus the foreign key corresponding to the natural person investment table zrrtz and the natural person table zrr is: investor id and investor id. As shown in table 2, the natural person investment table zrrtz also includes information such as investor name, certificate type, address, gender, and telephone number.
TABLE 2
Investor id | Invested name | Document type | Investor id | Investor name | Address | Sex | Telephone set | |
cerno_end | name_end | certype | cerno_start | name_start | address | sex | tel | |
id1 | |
1 | id4 | Xiao Hong | #### | #### | #### | |
| Wang San | 1 | id1 | WangSi tea | #### | #### | #### | |
| Wang Da | 1 | id2 | Wang San | #### | #### | #### |
Table 3 shows source table 2 according to an embodiment of the present disclosure. As shown in Table 3, the source Table 2 is the tour conductor Table, and the table name is dyry. The tour guide people table dyry is a table in the source database. In general, the tour guides are all natural persons, and thus the foreign keys corresponding to the tour guide table dyry and the natural person table zrr are: an identification number. As shown in table 3, the tour guide person table dyry further includes information such as name, tour guide card number, gender, address, language name, ethnicity, academic calendar, and contact number.
TABLE 3
Tour guide card number | Name (I) | Sex | Identity card number | Address | Name of language | Nationality | Study calendar | Contact telephone | |
DYKH | name | sex | sfz | adress | yzmc | mz | xl | lxdh | |
| Zhang San | 1 | id5 | #### | #### | #### | #### | #### | |
| Wang Wu | 1 | id6 | #### | #### | #### | #### | #### | |
DYKH3 | Xiao Hong | 0 | id4 | #### | #### | #### | #### | #### | |
| Xiaoming liquor | 1 | id7 | #### | #### | #### | #### | #### |
Table 4 shows source table 3 according to an embodiment of the present disclosure. As shown in Table 4, Source Table 3 is the natural human role table, named zrrrz. The natural human role table zrrrz is a table in the source database. The corresponding foreign key of the natural human job table zrrrz and the natural human table zrr is: certificate numbers. As shown in table 4, the natural person job table zrrz further includes information such as job id, job unit id, business name, home address, mobile phone, and job position.
TABLE 4
Job id | Job unit id | Name of an enterprise | Name (I) | Home address | Certificate number | Mobile telephone | Position of employment |
id | pri | ent_name | name | house_add | cerno | mob_tel | position |
P1 | 001 | |
Zhang San | #### | id5 | #### | General manager |
P2 | 002 | |
Wang Wu | #### | id6 | #### | CEO |
It will be appreciated that the data in the natural people table zrr in the target database is derived from tables with natural people information that are foreign keys to the natural people identity id from all of the source tables in the source database through some complex logic processes, such as investors and investors in the natural people investment table zrrtz, tour guides in the tour guides table dyry, and staff in the natural people job table zrrrz.
As an example, ETL processing is performed on data in the source database to obtain a target table, i.e., the natural people table zrr. Table 5 shows a target table according to an embodiment of the present disclosure. As shown in table 5, natural person table zrr includes natural person information with natural person identities id1, id2, id3, id4, id5, id 6.
TABLE 5
Natural person identity id | id type | Address | Name (I) | Telephone set | Type of relationship | Relationship code |
id | id_type | address | name | phone | | relation_code |
id5 | ||||||
1 | #### | Zhang San | #### | Administrative penalty | 050100, | |
|
1 | #### | Wang Wu | #### | Administrative penalty | 100,040,000 |
|
3 | #### | Xiao Hong | #### | Administrative penalty | 050100, |
|
1 | #### | WangSi tea | #### | Administrative penalty | 050100, |
|
1 | #### | Wang San | #### | Administrative penalty | 050100, |
|
1 | #### | Wang Da | #### | Administrative penalty | 050100, |
After the ETL process is complete, a source table with the primary key natural person identity id of the natural person table zrr as the foreign key is determined from the source database: natural human job table zrrrz, tour guide staff table dyry and natural human investment table zrrtz. From which it is looked up whether the data is present in the natural people table zrr based on the natural people occupational table zrrz, the tourist guide table dyry, and the natural people investment table zrrtz. If all of the data is found to be present in the natural person table zrr, the data ETL process is determined to be accurate. If part of the data is found not to exist in the natural person table zrr, it is determined that the data ETL process is inaccurate.
Further, if it is found that data does not exist in the natural person table zrr, the data is considered to be lost data and is stored in the natural person loss record table zrr _ lost. This results in a loss record table, i.e., the natural person loss record table zrr _ lost. Table 6 shows a loss record table according to an embodiment of the present disclosure. As shown in table 6, the natural person loss record table zrr _ lost includes natural person information whose natural person identity id is id 7.
TABLE 6
Natural person identity id | id type | Address | Name (I) | Telephone set | Type of relationship | Relationship code |
id | id_type | address | name | phone | | relation_code |
id7 | ||||||
1 | #### | Xiaoming liquor | #### | Administrative penalty | 050100, |
The ETL processing strategy adopted by the natural person table zrr is formed, and the ETL processing is carried out on the data in the natural person loss record table zrr _ lost, so that the natural person information with the natural person identity id of 7 is obtained. Comparing the natural person information with the natural person identity id of id7 with the natural person table zrr, if the natural person information does not exist in the natural person table zrr, storing the natural person information with the natural person identity id of 7 in the natural person table zrr, and deleting or marking the natural person information with the natural person identity id of 7 in the natural person loss record table zrr _ lost.
From the above, it can be seen that: if a plurality of source tables in the source database are all directed at a natural person, only one piece of natural person information can be obtained after ETL processing. For example, for "small red" in the source table (i.e., "small red" in table 2 and "small red" in table 3), the final destination table has only one piece of natural human information of "small red".
Therefore, the method and the device can also aim at the situation that the recorded data is inaccurate after complex data acquisition and ETL processing. The complex data collection and ETL processing may refer to a case where a plurality of pieces of data are collected and combined into one or more pieces of data through ETL processing.
Fig. 8 shows a block diagram of a data ETL processing apparatus according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:
an obtaining module 11, configured to obtain a source table using a primary key of a target table as a foreign key; a judging module 12, configured to judge whether all data included in the foreign key in the source table exist in the target table; a recording module 13, configured to store, if at least one piece of data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table into a missing recording table, where the missing recording table and the target table have a same structure.
Fig. 9 shows a block diagram of a data ETL processing apparatus according to an embodiment of the present disclosure. As shown in fig. 9:
in one implementation, the apparatus further comprises: a first determining module 14, configured to determine that the processing procedure of the data ETL is accurate if all the data included in the foreign key in the source table is present in the target table; or if at least one data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
In one implementation, the obtaining module 11 is further configured to: comparing the primary key of the target table with all source tables of a source database; and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
In one implementation, the apparatus further comprises: a first processing module 15, configured to perform ETL processing on data in the loss record table; a second processing module 16, configured to compare, for each result obtained after ETL processing, the result with the target table, if the result does not exist in the target table, store a record corresponding to the result in the target table, and delete or mark a record corresponding to the result in the lost record table.
In one implementation, the first processing module 15 is configured to: and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
In one implementation, the apparatus further comprises: a second determining module 17, configured to determine a total number of data in the target table and a total number of data in the lost record table; and the calculating module 18 is configured to calculate a loss rate of the ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table.
According to the data ETL processing device, the loss recording table with the structure consistent with that of the target table is established, and the data lost or processed inaccurately in the ETL processing process is recorded through the established loss recording table, so that the data lost or processed inaccurately in the ETL processing process can be conveniently found. When the ETL processing process of the data is determined to be inaccurate, which data is lost or inaccurate can be determined through the data stored in the loss record table, so that service analysis such as data loss rate or inaccuracy rate can be realized.
For the device embodiment, since it basically corresponds to the method embodiment, the relevant points may be referred to the partial description of the method embodiment, and will not be repeated here.
Fig. 10 is a block diagram illustrating a data ETL processing apparatus 900 according to an exemplary embodiment. Referring to fig. 10, the apparatus 900 may include a processor 901, a machine-readable storage medium 902 having stored thereon machine-executable instructions. The processor 901 and the machine-readable storage medium 902 may communicate via a system bus 903. Also, the processor 901 performs the asynchronous communication method described above by reading machine-executable instructions in the machine-readable storage medium 902 corresponding to the asynchronous communication logic.
The machine-readable storage medium 902 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: RAM (random Access Memory), volatile Memory, non-volatile Memory, flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, dvd, etc.), or similar storage medium, or a combination thereof.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (8)
1. A method for processing ETL data, comprising:
acquiring a source table taking a primary key of a target table as a foreign key;
judging whether the data included by the foreign key in the source table exist in the target table;
if at least one piece of data included by the foreign key in the source table does not exist in the target table, storing the data of the foreign key which does not exist in the target table into a loss record table, wherein the structures of the loss record table and the target table are consistent;
after storing all data of the source table in which the foreign key does not exist in the target table into a loss record table, the method further comprises:
determining the total number of data in the target table and the total number of data in the lost record table;
calculating the loss rate of ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table;
carrying out ETL processing on the data in the loss record table;
and comparing the result with the target table aiming at each result obtained after the ETL processing, if the result does not exist in the target table, storing the record corresponding to the result into the target table, and deleting or marking the record corresponding to the result in the lost record table.
2. The method of claim 1, further comprising:
if all the data included by the foreign key in the source table exist in the target table, determining that the ETL processing process of the data is accurate; or
And if at least one piece of data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
3. The method of claim 1, wherein obtaining a source table having a primary key of a target table as a foreign key comprises:
comparing the primary key of the target table with all source tables of a source database;
and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
4. The method of claim 1, wherein performing ETL processing on the data in the loss record table comprises:
and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
5. A data ETL processing apparatus, comprising:
the acquisition module is used for acquiring a source table which takes the primary key of the target table as a foreign key;
the judging module is used for judging whether the data included by the foreign key words in the source table exist in the target table or not;
a recording module, configured to store, if at least one piece of data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table into a missing recording table, where structures of the missing recording table and the target table are consistent;
the device further comprises:
a second determining module, configured to determine a total number of data in the target table and a total number of data in the lost record table;
a calculating module, configured to calculate a loss rate of ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table
The first processing module is used for carrying out ETL processing on the data in the loss record table;
and the second processing module is used for comparing each result obtained after the ETL processing with the target table, storing the record corresponding to the result into the target table if the result does not exist in the target table, and deleting or marking the record corresponding to the result in the lost record table.
6. The apparatus of claim 5, further comprising:
a first determining module, configured to determine that an ETL processing procedure of data is accurate if all data included in the foreign key in the source table is present in the target table; or if at least one data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
7. The apparatus of claim 5, wherein the obtaining module is further configured to:
comparing the primary key of the target table with all source tables of a source database;
and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
8. The apparatus of claim 5, wherein the first processing module is configured to: and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810122336.3A CN108388610B (en) | 2018-02-07 | 2018-02-07 | Data ETL processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810122336.3A CN108388610B (en) | 2018-02-07 | 2018-02-07 | Data ETL processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108388610A CN108388610A (en) | 2018-08-10 |
CN108388610B true CN108388610B (en) | 2021-04-27 |
Family
ID=63075461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810122336.3A Active CN108388610B (en) | 2018-02-07 | 2018-02-07 | Data ETL processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108388610B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109828889A (en) * | 2019-01-31 | 2019-05-31 | 平安科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium in monitoring data library |
CN111061740B (en) * | 2019-12-17 | 2023-08-04 | 北京软通智慧科技有限公司 | Data synchronization method, device and storage medium |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU9010198A (en) * | 1997-08-14 | 1999-03-08 | Aoraki Corporation Limited | Relational database coexistence in object oriented environments |
US8065323B2 (en) * | 2009-02-23 | 2011-11-22 | Oracle International Corporation | Offline validation of data in a database system for foreign key constraints |
US8868484B2 (en) * | 2010-07-08 | 2014-10-21 | Oracle International Corporation | Efficiently updating rows in a data warehouse |
CN102411588A (en) * | 2010-09-26 | 2012-04-11 | 金蝶软件(中国)有限公司 | Comparison checking method and system of data table |
CN103309888A (en) * | 2012-03-14 | 2013-09-18 | 北京四维图新科技股份有限公司 | Method and device for verifying data of electronic map |
US9251179B2 (en) * | 2012-04-12 | 2016-02-02 | International Business Machines Corporation | Managing record location lookup caching in a relational database |
CN103164514B (en) * | 2013-02-25 | 2017-05-03 | 用友网络科技股份有限公司 | Detection device and detection method for data reference relationship |
CN104679794A (en) * | 2013-12-03 | 2015-06-03 | 中兴通讯股份有限公司 | Data difference analysis method and device |
CN107103025B (en) * | 2017-01-05 | 2020-08-14 | 亚信科技(中国)有限公司 | Data processing method and data processing platform |
-
2018
- 2018-02-07 CN CN201810122336.3A patent/CN108388610B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108388610A (en) | 2018-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10025904B2 (en) | Systems and methods for managing a master patient index including duplicate record detection | |
JP5923307B2 (en) | Assertion-based record linkage in a decentralized autonomous medical environment | |
US10572461B2 (en) | Systems and methods for managing a master patient index including duplicate record detection | |
US20180144061A1 (en) | Edge store designs for graph databases | |
MXPA04006390A (en) | Real time data warehousing. | |
US11347719B2 (en) | Multi-table data validation tool | |
US11636078B2 (en) | Personally identifiable information storage detection by searching a metadata source | |
US10445370B2 (en) | Compound indexes for graph databases | |
CN109918386A (en) | A kind of data reconstruction method and device, computer readable storage medium | |
CN108388610B (en) | Data ETL processing method and device | |
US11609897B2 (en) | Methods and systems for improved search for data loss prevention | |
CN101452556A (en) | Customer information processing system and method | |
CN112463737A (en) | System and method for rapidly acquiring data aiming at multi-format data intelligent matching template | |
CN110019542B (en) | Generation of enterprise relationship, generation of organization member database and identification of same name member | |
CN111091883A (en) | Medical text processing method and device, storage medium and equipment | |
US10929441B1 (en) | System and techniques for data record merging | |
CN104866603A (en) | Calling method of metadata and metadata managing system | |
US20180144060A1 (en) | Processing deleted edges in graph databases | |
US20090259659A1 (en) | Identifying entities of interest | |
CN112612818B (en) | Data processing method and device, computing equipment and storage medium | |
CN110851431B (en) | Data processing method and device for data center station | |
CN110502675B (en) | Voice dialing user classification method based on data analysis and related equipment | |
JP2002063363A (en) | Insurance contract information management system and customer index generating method | |
Gohel et al. | A commodity data cleaning system | |
JP2013020381A (en) | Name gathering management system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |