CN108388610B - Data ETL processing method and device - Google Patents

Data ETL processing method and device Download PDF

Info

Publication number
CN108388610B
CN108388610B CN201810122336.3A CN201810122336A CN108388610B CN 108388610 B CN108388610 B CN 108388610B CN 201810122336 A CN201810122336 A CN 201810122336A CN 108388610 B CN108388610 B CN 108388610B
Authority
CN
China
Prior art keywords
data
source
target
target table
etl processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810122336.3A
Other languages
Chinese (zh)
Other versions
CN108388610A (en
Inventor
唐逸鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN201810122336.3A priority Critical patent/CN108388610B/en
Publication of CN108388610A publication Critical patent/CN108388610A/en
Application granted granted Critical
Publication of CN108388610B publication Critical patent/CN108388610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a data ETL processing method and device. The method comprises the following steps: acquiring a source table taking a primary key of a target table as a foreign key; judging whether the data included by the foreign key in the source table exist in the target table; if at least one piece of data included by the foreign key in the source table does not exist in the target table, storing the data of the foreign key which does not exist in the target table into a loss record table, wherein the structure of the loss record table is consistent with that of the target table, so that the data stored in the loss record table can be used for determining which data are lost or inaccurate, and business analysis such as data loss rate or inaccuracy rate is realized.

Description

Data ETL processing method and device
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing ETL data.
Background
At present, in the process of informatization construction, database construction can effectively solve the problems of data separation, service dispersion, information intercommunication incapability and the like caused by information isolated islands. Fig. 1 is a schematic diagram showing database construction in the related art. As shown in fig. 1, the database construction may refer to integrating data resources of separate business systems together to construct a system having functions of data acquisition, processing, storage, distribution, sharing, and the like. The database construction can realize the systematized management of the data, and has important significance in the informatization construction process.
The ETL (Extract-Transform-Load) is a core link of database construction. ETL can refer to the process of extracting, converting, and loading data from a source table of a source database into a target table of a target database. In the related art, the accuracy of the ETL processing procedure is usually determined by comparing the number of records or comparing the content of keywords. The accuracy of comparing the number of records or the content of the keywords is low, the application range is small, and the inaccuracy of ETL processing or the processing of wrong data cannot be determined, so that the data quality in the target database is poor.
Disclosure of Invention
In view of this, the present disclosure provides a data ETL processing method and apparatus, so as to solve the problem in the related art that data quality in a target database is poor due to inaccurate data ETL processing or processing errors.
According to an aspect of the present disclosure, there is provided a data ETL processing method, including:
acquiring a source table taking a primary key of a target table as a foreign key;
judging whether the data included by the foreign key in the source table exist in the target table;
and if at least one piece of data included by the foreign key in the source table does not exist in the target table, storing the data of the foreign key which does not exist in the target table into a loss record table, wherein the structures of the loss record table and the target table are consistent.
In one implementation, the method further comprises:
if all the data included by the foreign key in the source table exist in the target table, determining that the ETL processing process of the data is accurate; or
And if at least one piece of data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
In one implementation, obtaining a source table having a primary key of a target table as a foreign key includes:
comparing the primary key of the target table with all source tables of a source database;
and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
In one implementation, after storing the data of the source table in which the foreign key does not exist in the target table into a loss record table, the method further comprises:
carrying out ETL processing on the data in the loss record table;
and comparing the result with the target table aiming at each result obtained after the ETL processing, if the result does not exist in the target table, storing the record corresponding to the result into the target table, and deleting or marking the record corresponding to the result in the lost record table.
In one implementation, performing ETL processing on data in the loss record table includes:
and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
In one implementation, after storing all data of the source table in which the foreign key does not exist in the target table in a loss record table, the method further comprises:
determining the total number of data in the target table and the total number of data in the lost record table;
and calculating the loss rate of the ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table.
According to another aspect of the present disclosure, there is provided a data ETL processing apparatus including:
the acquisition module is used for acquiring a source table which takes the primary key of the target table as a foreign key;
the judging module is used for judging whether the data included by the foreign key words in the source table exist in the target table or not;
a recording module, configured to store, if at least one piece of data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table into a missing recording table, where the missing recording table and the target table have a same structure.
In one implementation, the apparatus further comprises:
a first determining module, configured to determine that an ETL processing procedure of data is accurate if all data included in the foreign key in the source table is present in the target table; or if at least one data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
In one implementation, the obtaining module is further configured to:
comparing the primary key of the target table with all source tables of a source database;
and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
In one implementation, the apparatus further comprises:
the first processing module is used for carrying out ETL processing on the data in the loss record table;
and the second processing module is used for comparing each result obtained after the ETL processing with the target table, storing the record corresponding to the result into the target table if the result does not exist in the target table, and deleting or marking the record corresponding to the result in the lost record table.
In one implementation, the first processing module is configured to:
and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
In one implementation, the apparatus further comprises:
a second determining module, configured to determine a total number of data in the target table and a total number of data in the lost record table;
and the calculating module is used for calculating the loss rate of the ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table.
The data ETL processing method and device disclosed by the invention can be used for judging whether data included by the foreign key in the source table exist in the target table or not by acquiring the source table which takes the primary key of the target table as the foreign key, and storing the data of the foreign key which does not exist in the target table into the loss record table if at least one piece of data included by the foreign key in the source table does not exist in the target table, wherein the structure of the loss record table is consistent with that of the target table, so that inaccurate ETL processing or wrong processing data can be recorded, the business analysis such as data loss rate or inaccuracy rate can be realized, and the lost data can be searched for in the ETL processing.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a schematic diagram showing database construction in the related art.
Fig. 2 shows a schematic diagram of ETL in the related art.
Fig. 3 is a schematic diagram illustrating a data ETL processing method in the related art.
Fig. 4 illustrates a flow chart of a data ETL processing method according to an embodiment of the present disclosure.
Fig. 5 illustrates a flow chart of a data ETL processing method according to an embodiment of the present disclosure.
Fig. 6 shows a flow chart of a data ETL processing method according to an embodiment of the present disclosure.
Fig. 7 shows a schematic diagram of a data ETL processing method according to an embodiment of the present disclosure.
Fig. 8 shows a block diagram of a data ETL processing apparatus according to an embodiment of the present disclosure.
Fig. 9 shows a block diagram of a data ETL processing apparatus according to an embodiment of the present disclosure.
Fig. 10 is a block diagram illustrating a data ETL processing apparatus 900 according to an exemplary embodiment.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
In the related art, ETL (Extract-Transform-Load) is a core link of database construction. ETL can refer to the process of extracting, converting, and loading data from a source table of a source database into a target table of a target database. The source database includes databases of each service system, such as various types of databases of MySQL, Oracle, PostgreSQL, Hive, Hadoop, Excel, text file, and the like, which is not limited in this disclosure.
The purpose of ETL processing is to integrate scattered, disordered and standard non-uniform data in a service system together, and provide analysis basis for the decision of the service system. The ETL process includes the following stages: data extraction, data cleaning conversion and data loading. In each stage of the ETL process, the longest time is spent as a purge transition stage of data. The data cleaning and converting phase comprises data cleaning and data conversion.
Wherein the task of data cleansing is to filter out the unsatisfactory data. And in the data cleaning process, the filtered result is delivered to a service administration department, and the service administration department confirms whether to filter or correct the result and then extracts the result. The unsatisfactory data includes incomplete data, erroneous data, duplicate data, and the like. The task of data transformation is mainly to perform transformation of inconsistent data, transformation of data granularity, and computation of some business rules.
Fig. 2 shows a schematic diagram of ETL in the related art. As shown in fig. 2, the process of data acquisition and logical processing by ETL includes: firstly, data are collected and extracted from a source table stored in a heterogeneous source database, and then the processed data are stored in a target table of a target database through the processing processes of cleaning, processing, converting and the like. Wherein, the service system data source, the text file and other types of data can form a source database.
In the ETL processing process of data, problems such as inaccurate data processing, even data loss, etc. may occur, which may cause poor data quality in the target database, thereby affecting the service logic of the service system based on the target database, and causing serious consequences.
In the related art, the accuracy of the ETL processing procedure is usually determined by comparing the number of records or comparing the content of keywords. The comparing the number of records refers to comparing whether the number of records in the source table of the source database is consistent with the number of records in the target table of the target database every time ETL processing of the complete or incremental data is performed, if not, ETL processing needs to be performed on all data in the source table again, otherwise, ETL processing is not performed. Comparing the key content means that each piece of data in the source table of the source database is found to be corresponding data in the target table of the target database each time ETL processing of the complete or incremental data is performed, whether the key content is consistent or not is compared, if the key content is not consistent, ETL processing needs to be performed on all the data in the source table again, and otherwise, ETL processing is not performed.
Fig. 3 is a schematic diagram illustrating a data ETL processing method in the related art. As shown in fig. 3:
as shown in fig. 3, the procedure for comparing the number of records is as follows: compare-re-ETL-store. Specifically, each time ETL processing of full or incremental data is performed, the number of records Count1 in Table a1 Table in the source database and the number of records Count2 in Table a2 Table in the target database are calculated. Comparing the Count1 with the Count2, if the Count1 and the Count2 are equal, the ETL processing procedure is considered to be accurate. Otherwise, clearing the data in the Table A2 Table, performing ETL processing on the data in the Table A1 Table again, including the processing of extracting, converting, loading and the like on the data in the Table A1 Table, and storing the data after ETL processing in the Table A2 Table in the target database.
The full data may refer to all data in the source database, and the incremental data may refer to data added to the source database within a certain time period.
Therefore, the data quality in the target database is poor due to the fact that the number of the comparison records cannot determine that ETL is processed inaccurately or data with errors is processed. When it is determined that the data ETL processing process is inaccurate, ETL processing needs to be performed on all data in the source table again, which is tedious to operate and wastes data ETL processing resources.
It should be noted that the number of comparison records is mainly applicable to a single table in a single source database to a single table in a target database. For a single table from multiple tables in a single source database to a single table in a target database, or a multiple tables in a multiple source database to a single table in a target database, if the data in the source table is duplicated, the data processed by the ETL is the data without duplication, so that the comparison of the number of records is not applicable.
As shown in fig. 3, the flow of using the keyword content is as follows: compare-re-ETL-store. Specifically, each time ETL processing of full or incremental data is performed, each piece of data in Table a1 Table in the source database is found out to be corresponding to the Table a2 Table in the target database. And comparing the key of each piece of data in the Table A1 Table with the key of the corresponding data in the Table A2 Table, and if the key of each piece of data in the Table A1 Table is the same as the key of the corresponding data in the Table A2 Table, determining that the ETL processing process of the data is accurate. Otherwise, clearing the data in the Table A2 Table, performing ETL processing on the data in the Table A1 Table again, including the processing of extracting, converting, loading and the like on the data in the Table A1 Table, and storing the data after ETL processing in the Table A2 Table in the target database.
The full data may refer to all data in the source database, and the incremental data may refer to data added to the source database within a certain time period.
Therefore, the operation of comparing the keyword contents is complicated and the efficiency is low. In addition, the comparison of the keyword content cannot determine that the ETL is processed inaccurately or processed with wrong data, resulting in poor quality of data in the target database. When it is determined that the data ETL processing process is inaccurate, ETL processing needs to be performed on all data in the source table again, which is tedious to operate and wastes data ETL processing resources.
To solve the technical problems in the related art, fig. 4 is a flowchart illustrating a data ETL processing method according to an embodiment of the present disclosure. As shown in fig. 4, the method includes steps S41 through S43.
In step S41, a source table having the primary key of the target table as the foreign key is acquired.
In one implementation, obtaining a source table having a primary key of a target table as a foreign key includes: comparing the primary key of the target table with all source tables of the source database; and acquiring the source table taking the primary key of the target table as the foreign key from all the source tables of the source database.
In step S42, it is determined whether or not the data included in the foreign key in the source table is present in the target table.
In step S43, if at least one data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table is stored in the missing record table, wherein the missing record table and the target table have the same structure.
According to the data ETL processing method, the loss record table with the structure consistent with that of the target table is established, and the data lost or processed inaccurately in the ETL processing process is recorded through the established loss record table, so that the data lost or processed inaccurately in the ETL processing process can be conveniently found. When the ETL processing process of the data is determined to be inaccurate, which data are lost or inaccurate can be determined through the data stored in the loss record table, and business analysis such as data loss rate or inaccuracy rate is achieved.
In one implementation, a primary key and an ETL processing policy of a target table are determined. And carrying out ETL processing on the data in the source table of the source database according to the determined primary key of the target table and the ETL processing strategy. The ETL processing strategy at least comprises data extraction, data cleaning conversion and data loading, and the data obtained after the ETL processing is stored in a target table, so that the target table is established. After the target table is established, a loss record table is established according to the structure of the target data table. The loss record table is used to store data that is lost or processed inaccurately during the ETL process to build the target table.
It is understood that there is only one target table and one or more source tables during ETL processing of data. In the case that there are multiple source tables, the multiple source tables may be from the same database, or the multiple source tables may be from different databases, which is not limited in this disclosure. The target database and the source database are different databases.
A Primary Key (also called a Primary Key) is one or more fields in a table, and is used to uniquely identify a record in the table. In a two-table relationship, a primary key may be used to reference a particular record in one table from the other table. The primary key is a uniquely identified key that is part of the table definition. The primary key of a table may be one key, or may be composed of multiple keys, which is not limited in this disclosure.
For example, a student table (school number, name, gender, class), the school number uniquely identifies a record in the student table, whereby the school number may serve as the primary key of the student table. The course number can uniquely identify a record in the course table, and thus can be used as a main key word of the course table. The score table (school number, course number and score) can not uniquely identify one record in the score table by a single attribute, and the combination of the school number and the course number can uniquely identify one record in the course table, so that the combination of the school number and the course number can be used as a main key word of the score table.
If a Key is used as the primary Key of one table and the Key exists in another table, the Key is called a Foreign Key (also called a Foreign Key) of the other table. A table may have one or more foreign keys. Foreign keys can represent associations between two tables, whereby a foreign key can be used to establish an association with another table. A table having a foreign key of another table as a primary key is referred to as a primary table, and a table having the foreign key is referred to as a secondary table of the primary table.
For example, student tables (school number, name, gender, class), the school number being the primary key of the student table. The course table (course number, course name, and score), and the course number is the main key word of the course table. Score table (school number, course number, score), the combination of school number and course number is the main key of score table. The schooling number in the achievement list is not the primary key word of the achievement list, but corresponds to the schooling number in the student list, and the schooling number in the student list is the primary key word of the student list, so that the schooling number in the achievement list is called as the foreign key word of the student list. Similarly, the course number in the achievement list is the foreign keyword of the course list.
In one implementation, after storing all data of the source table in which the foreign key does not exist in the target table in the loss record table, the method further comprises: determining the total number of data in the target table and the total number of data in the lost record table; and calculating the loss rate of the ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table.
Wherein, the loss rate is the number of lost pieces divided by the total number of data pieces. And under the condition that the ETL processing process of the data is accurate, the number of the lost pieces is zero, and the total number of the data is the total number of the data in the target table. Under the condition that the ETL processing process of the data is inaccurate, if the data in the lost record table corresponding to the target table is not repeated, the number of the lost pieces is the total number of the data in the lost record table corresponding to the target table, and the total number of the data is the sum of the total number of the data in the target table and the total number of the data in the lost record table corresponding to the target table.
In the actual data processing process, the total amount of data is huge, often millions, millions or even billions of mass data, and the amount of lost or error data is small. Therefore, in the data ETL processing method of the present disclosure, the total number of data in the lost record table corresponding to the target table may be used as the number of lost pieces, and the total number of data in the target table, or the sum of the total number of data in the target table and the total number of data in the lost record table corresponding to the target table may be used as the total number of data.
According to the data ETL processing method, the loss record table which is consistent with the target table structure is established, and the data which are lost or processed inaccurately in the ETL processing process are recorded through the established loss record table, so that the total number of the data which are lost or processed inaccurately in the ETL processing process can be counted, and support is provided for calculating the loss rate of the data ETL processing.
Fig. 5 illustrates a flow chart of a data ETL processing method according to an embodiment of the present disclosure. As shown in fig. 5, the method includes steps S51 through S54.
In step S51, a source table having the primary key of the target table as the foreign key is acquired.
In one implementation, obtaining a source table having a primary key of a target table as a foreign key includes: comparing the primary key of the target table with all source tables of the source database; and acquiring the source table taking the primary key of the target table as the foreign key from all the source tables of the source database.
In step S52, it is determined whether or not the data included in the foreign key in the source table is present in the target table.
In step S53, if at least one data included in the foreign key in the source table is not present in the target table, it is determined that the ETL processing is inaccurate.
As one example, the primary key of the target table is determined to be ID0, and the ETL processing policy is determined. All source tables of the source database include a Table 1 Table, a Table 2 Table, a Table 3 Table, a Table 4 Table, and a Table 5 Table. And carrying out ETL processing on the data in the source Table of the source database according to the primary key ID0 and an ETL processing strategy, and storing the data obtained after the ETL processing into a target Table, thereby establishing a target Table 0 Table. For example, data is collected from a Table 1 Table, a Table 2 Table, and a Table 3 Table in the source database, ETL processing is performed on the collected data, and the data after ETL processing is stored in a Table 0 Table in the target database.
The flow of the data ETL processing method is as follows: if the primary key of the Table 0 of the target Table is ID0, a source Table with ID0 as a foreign key is obtained from all source tables of the source database, and the source Table includes a Table 1 Table, a Table 2 Table and a Table 3 Table. The data in the source Table 1, Table 2 and Table 3 tables including the ID0 are compared one by one with the target Table 0 Table. If at least one piece of data included in the ID0 in the Table 1, Table 2, and Table 3 of the source Table is not present in the Table 0 of the target Table, it is determined that the processing procedure of the data ETL is not accurate.
In step S54, if all the data included in the foreign key in the source table exists in the target table, it is determined that the data ETL processing procedure is accurate.
The flow of the data ETL processing method is as follows: if the primary key of the Table 0 of the target Table is ID0, a source Table with ID0 as a foreign key is obtained from all source tables of the source database, and the source Table includes a Table 1 Table, a Table 2 Table and a Table 3 Table. The data included in ID0 in the source Table 1 Table, Table 2 Table, and Table 3 Table are compared one by one with the target Table 0 Table. If all the data included in the ID0 in the Table 1, Table 2 and Table 3 of the source Table are present in the Table 0 of the target Table, it is determined that the processing procedure of the data ETL is accurate.
Fig. 6 shows a flow chart of a data ETL processing method according to an embodiment of the present disclosure. As shown in fig. 6, the method includes steps S61 through S64.
In step S61, a source table having the primary key of the target table as the foreign key is acquired.
In one implementation, obtaining a source table having a primary key of a target table as a foreign key includes: comparing the primary key of the target table with all source tables of the source database; and acquiring the source table taking the primary key of the target table as the foreign key from all the source tables of the source database.
In step S62, it is determined whether or not the data included in the foreign key in the source table is present in the target table.
In step S63, if at least one data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table is stored in the missing record table, wherein the missing record table and the target table have the same structure.
In step S64, ETL processing is performed on the data in the loss record table; and comparing the result with the target table according to each result obtained after the ETL processing, if the result does not exist in the target table, storing the record corresponding to the result into the target table, and deleting or marking the record corresponding to the result in the lost record table.
Wherein the record at least comprises data before ETL processing and a result after ETL processing is carried out on the data. Since the structures of the target table and the lost record table are consistent, the record can be directly stored in the target.
If a record in the lost record table is marked or deleted, it may indicate that a result obtained after ETL processing in the record already exists in the target table.
In one implementation, performing ETL processing on data in a loss record table includes: and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table. By adopting the ETL processing strategy adopted when the target table is formed, the result of ETL processing of the data in the lost record table can be the result which should be obtained when the target table is formed, and therefore the accuracy of the finally obtained target table is improved.
According to the data ETL processing method, when the fact that the data ETL processing process is inaccurate is determined, all data in the source table do not need to be subjected to ETL processing again, the ETL processing strategy adopted by the target table is formed, the data in the lost record table is subjected to ETL processing, and for each result obtained after the ETL processing, if the result does not exist in the target table, the record corresponding to the result is stored in the target table, so that the target table with accurate data ETL processing can be obtained, operation can be simplified, data ETL processing resources can be saved, the data quality in the target database can be guaranteed, and support is provided for the service logic of a service system based on the target database.
According to the data ETL processing method, the loss record table with the structure consistent with that of the target table is established, and the data lost or processed inaccurately in the ETL processing process is recorded through the established loss record table, so that the data lost or processed inaccurately in the ETL processing process can be conveniently found. When the ETL processing process of the data is determined to be inaccurate, which data are lost or inaccurate can be determined through the data stored in the lost record table, and the ETL processing is further performed on the data in the lost record table without performing the ETL processing on all the data in the source table again, so that the operation is simplified, and the ETL processing resources of the data are saved.
Fig. 7 shows a schematic diagram of a data ETL processing method according to an embodiment of the present disclosure. As one example, the primary key of the target table is determined to be ID0, and the ETL processing policy is determined. As shown in fig. 7, all the source tables of the source database include a Table 1 Table, a Table 2 Table, a Table 3 Table, a Table 4 Table, and a Table 5 Table. And carrying out ETL processing on the data in the source Table of the source database according to the primary key ID0 and an ETL processing strategy, and storing the data obtained after the ETL processing into a target Table, thereby establishing a target Table 0 Table. For example, data is collected from a Table 1 Table, a Table 2 Table, and a Table 3 Table in the source database, ETL processing is performed on the collected data, and the data after ETL processing is stored in a Table 0 Table in the target database.
The Table 0 Table is a target Table, and the Table 0 Table includes data 1, data 3, data 4, and data 7. A Table 1 Table, a Table 2 Table and a Table 3 Table are source tables, and the Table 1 Table comprises data 1 and data 2; table 2 Table includes data 3 and data 4; table 3 Table includes data 5, data 6, and data 7. Data 5 and data 1 are repeated, and data 1 is retained after the duplication is removed in the ETL processing process.
As shown in fig. 7, the flow of the data ETL processing method of the present disclosure is specifically as follows.
And establishing a loss record Table Table 0' Table corresponding to the target Table Table 0 Table. The target Table 0 Table has the same structure as its corresponding loss record Table 0' Table. If the primary key of the Table 0 of the target Table is ID0, a source Table with ID0 as a foreign key is obtained, and the source Table includes a Table 1 Table, a Table 2 Table and a Table 3 Table. Wherein, the Table 0 Table includes data 1, data 3, data 4 and data 7. Table 1 Table includes data 1 and data 2; table 2 Table includes data 3 and data 4; table 3 Table includes data 5, data 6, and data 7.
Step 1, comparing data 1 in a Table 1 with a Table 0 in a target Table, if the data 1 exists in the Table 0 in the target Table, comparing data 2 in the Table 1 with the Table 0 in the target Table, and if the data 2 does not exist in the Table 0 in the target Table, storing the data 2 in a Table 0' in a loss record Table. The same processing as that of the data in Table 1 is performed for the data in Table 2 and Table 3, which is not described herein again.
And 2, comparing each data in the source Table Table 1, the Table 2 and the Table 3 with the target Table Table 0 to obtain a lost record Table Table 0'. For example, the loss record Table 0' Table includes data 2 and data 6.
And 3, carrying out ETL processing on the data in the loss record Table Table 0'.
And 4, comparing the result with the target Table 0 according to each result obtained after the ETL treatment, and if the result does not exist in the target Table 0, executing the following step 5.
And 5, storing the record corresponding to the result into the target Table Table 0, and deleting or marking the record corresponding to the result in the lost record Table Table 0'. The ETL processing policy is consistent with the ETL processing policy adopted when generating the target Table 0, and the ETL processing policy includes at least data extraction, data cleaning conversion, and data loading.
For example, ETL processing is performed on data 2 and data 6 in the Table of the loss record Table 0' to obtain data 2 and data 6 after ETL processing (data 2 and data 6 do not overlap). And comparing the data 2 with the target Table 0, if the data 2 does not exist in the target Table 0, storing the data 2 into the target Table 0, and deleting or marking the data 2 in the loss record Table 0'. The data 6 is compared with the target Table 0, and if the data 6 exists in the target Table 0, the data 6 in the loss record Table 0' is deleted or marked.
For another example, ETL processing is performed on data 2 and data 6 in the Table of the loss record Table 0' to obtain data 2 after ETL processing (data 2 and data 6 are repeated, and data 2 is retained after deduplication in the ETL processing process). And comparing the data 2 with the target Table 0, if the data 2 does not exist in the target Table 0, storing the data 2 into the target Table 0, and deleting or marking the data 2 in the loss record Table 0'.
As an example, table 1 shows the structure of a target table and a missing record table according to an embodiment of the present disclosure. The target table is a natural person table, and the table name is zrr. The primary key of natural person table zrr is a natural person identity id. The loss record table is a natural person loss record table and takes the table name of zrr _ lost. The natural person loss record table zrr _ lost is consistent with the structure of the natural person table zrr.
As shown in table 1, the natural person table zrr is used to store natural person information, and the natural person loss record table zrr _ lost is used to store natural person information that is lost or inaccurately processed during ETL processing. The natural person information may include natural person identity id, id type, name, phone, address, relationship type, relationship code, and the like, which is not limited in this disclosure.
TABLE 1
Natural person identity id id type Address Name (I) Telephone set Type of relationship Relationship code
id id_type address name phone relation relation_code
Table 2 shows source table 1 according to an embodiment of the present disclosure. As shown in Table 2, Source Table 1 is a natural human investment Table, which is named zrrtz. The natural human investment table zrrtz is a table in the source database. In this embodiment, the default investor and the investee are both natural persons, and thus the foreign key corresponding to the natural person investment table zrrtz and the natural person table zrr is: investor id and investor id. As shown in table 2, the natural person investment table zrrtz also includes information such as investor name, certificate type, address, gender, and telephone number.
TABLE 2
Investor id Invested name Document type Investor id Investor name Address Sex Telephone set
cerno_end name_end certype cerno_start name_start address sex tel
id1 WangSi tea 1 id4 Xiao Hong #### #### ####
id2 Wang San 1 id1 WangSi tea #### #### ####
id3 Wang Da 1 id2 Wang San #### #### ####
Table 3 shows source table 2 according to an embodiment of the present disclosure. As shown in Table 3, the source Table 2 is the tour conductor Table, and the table name is dyry. The tour guide people table dyry is a table in the source database. In general, the tour guides are all natural persons, and thus the foreign keys corresponding to the tour guide table dyry and the natural person table zrr are: an identification number. As shown in table 3, the tour guide person table dyry further includes information such as name, tour guide card number, gender, address, language name, ethnicity, academic calendar, and contact number.
TABLE 3
Tour guide card number Name (I) Sex Identity card number Address Name of language Nationality Study calendar Contact telephone
DYKH name sex sfz adress yzmc mz xl lxdh
DYKH1 Zhang San 1 id5 #### #### #### #### ####
DYKH2 Wang Wu 1 id6 #### #### #### #### ####
DYKH3 Xiao Hong 0 id4 #### #### #### #### ####
DYKH5 Xiaoming liquor 1 id7 #### #### #### #### ####
Table 4 shows source table 3 according to an embodiment of the present disclosure. As shown in Table 4, Source Table 3 is the natural human role table, named zrrrz. The natural human role table zrrrz is a table in the source database. The corresponding foreign key of the natural human job table zrrrz and the natural human table zrr is: certificate numbers. As shown in table 4, the natural person job table zrrz further includes information such as job id, job unit id, business name, home address, mobile phone, and job position.
TABLE 4
Job id Job unit id Name of an enterprise Name (I) Home address Certificate number Mobile telephone Position of employment
id pri ent_name name house_add cerno mob_tel position
P1 001 Enterprise 1 Zhang San #### id5 #### General manager
P2 002 Enterprise 2 Wang Wu #### id6 #### CEO
It will be appreciated that the data in the natural people table zrr in the target database is derived from tables with natural people information that are foreign keys to the natural people identity id from all of the source tables in the source database through some complex logic processes, such as investors and investors in the natural people investment table zrrtz, tour guides in the tour guides table dyry, and staff in the natural people job table zrrrz.
As an example, ETL processing is performed on data in the source database to obtain a target table, i.e., the natural people table zrr. Table 5 shows a target table according to an embodiment of the present disclosure. As shown in table 5, natural person table zrr includes natural person information with natural person identities id1, id2, id3, id4, id5, id 6.
TABLE 5
Natural person identity id id type Address Name (I) Telephone set Type of relationship Relationship code
id id_type address name phone relation relation_code
id5
1 #### Zhang San #### Administrative penalty 050100,
id6 1 #### Wang Wu #### Administrative penalty 100,040,000
id4 3 #### Xiao Hong #### Administrative penalty 050100,
id1 1 #### WangSi tea #### Administrative penalty 050100,
id2 1 #### Wang San #### Administrative penalty 050100,
id3 1 #### Wang Da #### Administrative penalty 050100,
After the ETL process is complete, a source table with the primary key natural person identity id of the natural person table zrr as the foreign key is determined from the source database: natural human job table zrrrz, tour guide staff table dyry and natural human investment table zrrtz. From which it is looked up whether the data is present in the natural people table zrr based on the natural people occupational table zrrz, the tourist guide table dyry, and the natural people investment table zrrtz. If all of the data is found to be present in the natural person table zrr, the data ETL process is determined to be accurate. If part of the data is found not to exist in the natural person table zrr, it is determined that the data ETL process is inaccurate.
Further, if it is found that data does not exist in the natural person table zrr, the data is considered to be lost data and is stored in the natural person loss record table zrr _ lost. This results in a loss record table, i.e., the natural person loss record table zrr _ lost. Table 6 shows a loss record table according to an embodiment of the present disclosure. As shown in table 6, the natural person loss record table zrr _ lost includes natural person information whose natural person identity id is id 7.
TABLE 6
Natural person identity id id type Address Name (I) Telephone set Type of relationship Relationship code
id id_type address name phone relation relation_code
id7
1 #### Xiaoming liquor #### Administrative penalty 050100,
The ETL processing strategy adopted by the natural person table zrr is formed, and the ETL processing is carried out on the data in the natural person loss record table zrr _ lost, so that the natural person information with the natural person identity id of 7 is obtained. Comparing the natural person information with the natural person identity id of id7 with the natural person table zrr, if the natural person information does not exist in the natural person table zrr, storing the natural person information with the natural person identity id of 7 in the natural person table zrr, and deleting or marking the natural person information with the natural person identity id of 7 in the natural person loss record table zrr _ lost.
From the above, it can be seen that: if a plurality of source tables in the source database are all directed at a natural person, only one piece of natural person information can be obtained after ETL processing. For example, for "small red" in the source table (i.e., "small red" in table 2 and "small red" in table 3), the final destination table has only one piece of natural human information of "small red".
Therefore, the method and the device can also aim at the situation that the recorded data is inaccurate after complex data acquisition and ETL processing. The complex data collection and ETL processing may refer to a case where a plurality of pieces of data are collected and combined into one or more pieces of data through ETL processing.
Fig. 8 shows a block diagram of a data ETL processing apparatus according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:
an obtaining module 11, configured to obtain a source table using a primary key of a target table as a foreign key; a judging module 12, configured to judge whether all data included in the foreign key in the source table exist in the target table; a recording module 13, configured to store, if at least one piece of data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table into a missing recording table, where the missing recording table and the target table have a same structure.
Fig. 9 shows a block diagram of a data ETL processing apparatus according to an embodiment of the present disclosure. As shown in fig. 9:
in one implementation, the apparatus further comprises: a first determining module 14, configured to determine that the processing procedure of the data ETL is accurate if all the data included in the foreign key in the source table is present in the target table; or if at least one data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
In one implementation, the obtaining module 11 is further configured to: comparing the primary key of the target table with all source tables of a source database; and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
In one implementation, the apparatus further comprises: a first processing module 15, configured to perform ETL processing on data in the loss record table; a second processing module 16, configured to compare, for each result obtained after ETL processing, the result with the target table, if the result does not exist in the target table, store a record corresponding to the result in the target table, and delete or mark a record corresponding to the result in the lost record table.
In one implementation, the first processing module 15 is configured to: and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
In one implementation, the apparatus further comprises: a second determining module 17, configured to determine a total number of data in the target table and a total number of data in the lost record table; and the calculating module 18 is configured to calculate a loss rate of the ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table.
According to the data ETL processing device, the loss recording table with the structure consistent with that of the target table is established, and the data lost or processed inaccurately in the ETL processing process is recorded through the established loss recording table, so that the data lost or processed inaccurately in the ETL processing process can be conveniently found. When the ETL processing process of the data is determined to be inaccurate, which data is lost or inaccurate can be determined through the data stored in the loss record table, so that service analysis such as data loss rate or inaccuracy rate can be realized.
For the device embodiment, since it basically corresponds to the method embodiment, the relevant points may be referred to the partial description of the method embodiment, and will not be repeated here.
Fig. 10 is a block diagram illustrating a data ETL processing apparatus 900 according to an exemplary embodiment. Referring to fig. 10, the apparatus 900 may include a processor 901, a machine-readable storage medium 902 having stored thereon machine-executable instructions. The processor 901 and the machine-readable storage medium 902 may communicate via a system bus 903. Also, the processor 901 performs the asynchronous communication method described above by reading machine-executable instructions in the machine-readable storage medium 902 corresponding to the asynchronous communication logic.
The machine-readable storage medium 902 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: RAM (random Access Memory), volatile Memory, non-volatile Memory, flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, dvd, etc.), or similar storage medium, or a combination thereof.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (8)

1. A method for processing ETL data, comprising:
acquiring a source table taking a primary key of a target table as a foreign key;
judging whether the data included by the foreign key in the source table exist in the target table;
if at least one piece of data included by the foreign key in the source table does not exist in the target table, storing the data of the foreign key which does not exist in the target table into a loss record table, wherein the structures of the loss record table and the target table are consistent;
after storing all data of the source table in which the foreign key does not exist in the target table into a loss record table, the method further comprises:
determining the total number of data in the target table and the total number of data in the lost record table;
calculating the loss rate of ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table;
carrying out ETL processing on the data in the loss record table;
and comparing the result with the target table aiming at each result obtained after the ETL processing, if the result does not exist in the target table, storing the record corresponding to the result into the target table, and deleting or marking the record corresponding to the result in the lost record table.
2. The method of claim 1, further comprising:
if all the data included by the foreign key in the source table exist in the target table, determining that the ETL processing process of the data is accurate; or
And if at least one piece of data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
3. The method of claim 1, wherein obtaining a source table having a primary key of a target table as a foreign key comprises:
comparing the primary key of the target table with all source tables of a source database;
and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
4. The method of claim 1, wherein performing ETL processing on the data in the loss record table comprises:
and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
5. A data ETL processing apparatus, comprising:
the acquisition module is used for acquiring a source table which takes the primary key of the target table as a foreign key;
the judging module is used for judging whether the data included by the foreign key words in the source table exist in the target table or not;
a recording module, configured to store, if at least one piece of data included in the foreign key in the source table does not exist in the target table, the data of the foreign key that does not exist in the target table into a missing recording table, where structures of the missing recording table and the target table are consistent;
the device further comprises:
a second determining module, configured to determine a total number of data in the target table and a total number of data in the lost record table;
a calculating module, configured to calculate a loss rate of ETL processing of the data according to the total number of the data in the target table and the total number of the data in the loss record table
The first processing module is used for carrying out ETL processing on the data in the loss record table;
and the second processing module is used for comparing each result obtained after the ETL processing with the target table, storing the record corresponding to the result into the target table if the result does not exist in the target table, and deleting or marking the record corresponding to the result in the lost record table.
6. The apparatus of claim 5, further comprising:
a first determining module, configured to determine that an ETL processing procedure of data is accurate if all data included in the foreign key in the source table is present in the target table; or if at least one data included in the foreign key in the source table does not exist in the target table, determining that the ETL processing process of the data is inaccurate.
7. The apparatus of claim 5, wherein the obtaining module is further configured to:
comparing the primary key of the target table with all source tables of a source database;
and acquiring a source table taking the primary key of the target table as a foreign key from all source tables of the source database.
8. The apparatus of claim 5, wherein the first processing module is configured to: and carrying out ETL processing on the data in the loss record table by forming an ETL processing strategy adopted by the target table.
CN201810122336.3A 2018-02-07 2018-02-07 Data ETL processing method and device Active CN108388610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810122336.3A CN108388610B (en) 2018-02-07 2018-02-07 Data ETL processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810122336.3A CN108388610B (en) 2018-02-07 2018-02-07 Data ETL processing method and device

Publications (2)

Publication Number Publication Date
CN108388610A CN108388610A (en) 2018-08-10
CN108388610B true CN108388610B (en) 2021-04-27

Family

ID=63075461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810122336.3A Active CN108388610B (en) 2018-02-07 2018-02-07 Data ETL processing method and device

Country Status (1)

Country Link
CN (1) CN108388610B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109828889A (en) * 2019-01-31 2019-05-31 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium in monitoring data library
CN111061740B (en) * 2019-12-17 2023-08-04 北京软通智慧科技有限公司 Data synchronization method, device and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU9010198A (en) * 1997-08-14 1999-03-08 Aoraki Corporation Limited Relational database coexistence in object oriented environments
US8065323B2 (en) * 2009-02-23 2011-11-22 Oracle International Corporation Offline validation of data in a database system for foreign key constraints
US8868484B2 (en) * 2010-07-08 2014-10-21 Oracle International Corporation Efficiently updating rows in a data warehouse
CN102411588A (en) * 2010-09-26 2012-04-11 金蝶软件(中国)有限公司 Comparison checking method and system of data table
CN103309888A (en) * 2012-03-14 2013-09-18 北京四维图新科技股份有限公司 Method and device for verifying data of electronic map
US9251179B2 (en) * 2012-04-12 2016-02-02 International Business Machines Corporation Managing record location lookup caching in a relational database
CN103164514B (en) * 2013-02-25 2017-05-03 用友网络科技股份有限公司 Detection device and detection method for data reference relationship
CN104679794A (en) * 2013-12-03 2015-06-03 中兴通讯股份有限公司 Data difference analysis method and device
CN107103025B (en) * 2017-01-05 2020-08-14 亚信科技(中国)有限公司 Data processing method and data processing platform

Also Published As

Publication number Publication date
CN108388610A (en) 2018-08-10

Similar Documents

Publication Publication Date Title
US10025904B2 (en) Systems and methods for managing a master patient index including duplicate record detection
JP5923307B2 (en) Assertion-based record linkage in a decentralized autonomous medical environment
US10572461B2 (en) Systems and methods for managing a master patient index including duplicate record detection
US20180144061A1 (en) Edge store designs for graph databases
MXPA04006390A (en) Real time data warehousing.
US11347719B2 (en) Multi-table data validation tool
US11636078B2 (en) Personally identifiable information storage detection by searching a metadata source
US10445370B2 (en) Compound indexes for graph databases
CN109918386A (en) A kind of data reconstruction method and device, computer readable storage medium
CN108388610B (en) Data ETL processing method and device
US11609897B2 (en) Methods and systems for improved search for data loss prevention
CN101452556A (en) Customer information processing system and method
CN112463737A (en) System and method for rapidly acquiring data aiming at multi-format data intelligent matching template
CN110019542B (en) Generation of enterprise relationship, generation of organization member database and identification of same name member
CN111091883A (en) Medical text processing method and device, storage medium and equipment
US10929441B1 (en) System and techniques for data record merging
CN104866603A (en) Calling method of metadata and metadata managing system
US20180144060A1 (en) Processing deleted edges in graph databases
US20090259659A1 (en) Identifying entities of interest
CN112612818B (en) Data processing method and device, computing equipment and storage medium
CN110851431B (en) Data processing method and device for data center station
CN110502675B (en) Voice dialing user classification method based on data analysis and related equipment
JP2002063363A (en) Insurance contract information management system and customer index generating method
Gohel et al. A commodity data cleaning system
JP2013020381A (en) Name gathering management system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant