WO2020192064A1

WO2020192064A1 - Incremental data consistency implementation method and device

Info

Publication number: WO2020192064A1
Application number: PCT/CN2019/109102
Authority: WO
Inventors: 彭虎; 傅尚强; 刘洋; 孙迁
Original assignee: 苏宁云计算有限公司; 苏宁易购集团股份有限公司
Priority date: 2019-03-28
Filing date: 2019-09-29
Publication date: 2020-10-01
Also published as: CN110046168B; CN110046168A; CA3176450A1

Abstract

An incremental data consistency implementation method and device, relating to the technical field of data warehouses. The method comprises: initializing all data of data tables having an association relationship in a service system, and loading the data into a first database to generate a plurality of total data tables (S1); synchronizing real-time data of the data tables to the plurality of total data tables and a plurality of incremental data tables of a second database on the basis of logs of a service database, respectively (S2); extracting all service unique identifiers in the plurality of incremental data tables, and merging said identifiers in the second database to generate an incremental identifier merged table (S3); and querying according to the increment identification merged table to obtain service data related to the increment identifier merged table in the plurality of total data tables, and correspondingly writing the service data into a consistency increment data table of the second database (S4). There is basically no impact on the normal operation of the service database in the incremental data consistency implementation process, and the loss of database resources is relatively low.

Description

Method and device for realizing consistency of incremental data

Technical field

The invention relates to the technical field of data warehouses, in particular to a method and device for realizing incremental data consistency.

Background technique

The ODS (Operational Data Store) of the big data data warehouse is responsible for relational data tables, and it is necessary to construct a consistent incremental data table to ensure that the incremental data between the associated data tables are consistent. Take a retail transaction order as an example. The consistent incremental data table between the order header table and each order sub-table can ensure that the change order number exists in each order incremental data table, and there will be no change order number in a certain order. Some tables exist, but do not exist in some tables, resulting in the problem that the incremental table data cannot be correlated.

In current technology, the following methods are generally adopted to achieve the consistency of incremental data:

Method 1: Incrementally obtain the order header table and sub-table data in the business system to the incremental data table of the big data platform, and then use hive/spark to generate the corresponding full data table, and generate a complete change sheet based on the incremental data table No., to brute force match each table, and finally generate a consistent incremental data table for each table.

Method 2: Incrementally obtain the order number data of the order header table and each sub-table of the business system to the order number change intermediate table of the business system, and change the order number in the intermediate table according to the order number to the business system to obtain the header table and data through the database index Each sub-table corresponds to the business data of the single number to the consistent incremental data table of the data warehouse.

The above two schemes are simple to implement, but there are certain defects and shortcomings:

For method 1: Hive generates a full data table, which needs to read and write full order data. Assuming 10 billion orders, an order increase of 2 million per day, updating 2 million data requires reading and writing 10 billion data each time to generate a consistent incremental data table It is necessary to read the full data table in full again, to read the full data table twice, and to write the full data once. The big data platform has high resource consumption and low efficiency.

For method two: the business system needs to create an intermediate table of order number change, and have write permission for this table, read the business system table data twice, rely on the use of the business system data table index, and the entire process has a greater dependence on the business system. In addition, the extraction process may cause database locks, especially during the promotion period, the system is degraded, which directly causes data to be unable to be extracted, and the entire big data calculation is stagnated because the data cannot be extracted, resulting in the analysis of data cannot be normally produced on time.

In addition, the Hive-based data warehouse cannot support order number index query, and it cannot support order backtracking scenarios. For example, business analysis scenarios such as after-sales customer service need to associate the order data corresponding to the business. The corresponding order time range is wide and uncertain, maybe one Within a month or more than one year, because Hive tables are basically incapable of indexing, it is difficult to implement this type of business analysis based on Hive tables.

Summary of the invention

The present invention aims to solve at least one of the technical problems existing in the prior art or related technologies. To this end, the present invention provides a method and device for achieving incremental data consistency.

The specific technical solutions provided by the embodiments of the present invention are as follows:

In the first aspect, the present invention provides a method for achieving incremental data consistency, including:

Initialize all the data of each data table that has an association relationship in the business system, and load it into the first database to generate multiple full data tables;

Based on the database log of the business system, synchronizing the real-time data of each data table to the multiple full data tables and the multiple incremental data tables of the second database respectively;

Extracting all the unique identifiers of services in the plurality of incremental data tables, and merge them in the second database to generate an incremental identifier merged table;

According to the incremental identification combination table, the business data related to the incremental identification combination table among the multiple full data tables is obtained by querying, and correspondingly written into the consistent incremental data table of the second database.

In a preferred embodiment, based on the database log of the business system, the real-time data of each data table is synchronized to the multiple full data tables and the multiple incremental data tables of the second database. Include:

Analyze the real-time data of each data table from the database log of the business system, and synchronize it to the real-time data stream;

Landing the data in the real-time data stream in the multiple full data tables; and

The data in the real-time data stream is written into the multiple incremental data tables.

In a preferred embodiment, the first database is a KV database, and the second database is a Hive database.

In a preferred embodiment, the query to obtain the business data related to the incremental identification combination table among the multiple full data tables according to the incremental identification combination table includes:

For each order number in the incremental identification merged table, the service data matching the order number in the multiple full data tables are respectively queried through the SQL query interface to obtain the query result.

In a preferred embodiment, the method further includes:

Receiving a data backtracking query instruction, querying the first database for business data associated with the data backtracking query instruction through a SQL query interface, and returning a data backtracking query result.

In the second aspect, a device for realizing incremental data consistency is provided, including:

The initialization module is used to initialize all the data of the associated data tables in the business system, and load them into the first database to generate multiple full data tables;

The real-time synchronization module is configured to synchronize the real-time data of each data table to the multiple full data tables and multiple incremental data tables of the second database based on the database log of the business system;

An identifier merging module, configured to extract all the unique identifiers of the services in the plurality of incremental data tables, and merge them in the second database to generate an incremental identifier merging table;

A query module, configured to query and obtain business data related to the incremental identifier merge table among the multiple full data tables according to the incremental identifier merge table;

The writing module is configured to write the business data related to the incremental identification merge table into the consistent incremental data table of the second database.

In a preferred embodiment, the real-time synchronization module is specifically used for:

In a preferred embodiment, the query module is specifically used for:

For each order number in the incremental identification combined table, query the service data matching the order number in the multiple full data tables through the SQL query interface.

In a preferred embodiment, the query module is also used for:

The method and device for realizing incremental data consistency provided by the present invention, because the real-time data of each data table in the business database is synchronized to the data warehouse by using the database log, compared with the prior art by creating a single number to change the middle The table reads the business system table data and strongly relies on the use of the business system data table index. During the data collection process of the business database, the present invention has basically no impact on the normal operation of the business database; and in the query and query from multiple incremental data tables When incrementally identifying the business data related to the merged table, it only needs to be read in full once, which consumes less database resources, and the consistent incremental data table obtained by writing the query result can ensure the increment between the data tables The data is consistent; in addition, because the data analysis in the consistent incremental data table supports the analysis based on incremental data, in the daily data analysis scenario, you only need to retrieve the daily data in each table to complete all the analysis related to the order. Need to retrieve historical partition data, the database resource consumption is small.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present invention, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative work.

Figure 1 shows a flow chart of a method for achieving incremental data consistency;

Figure 2 shows the implementation flow chart of the consistency of order incremental data of the operational data warehouse ODS;

Figure 3 shows a block diagram of a device for realizing incremental data consistency.

detailed description

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are merely Some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Unless the context clearly requires, the words "including", "including" and other similar words in the entire specification and claims should be interpreted as inclusive rather than exclusive or exhaustive meanings; in other words, "including but not limited to" Meaning.

In the description of the present invention, it should be understood that the terms "first", "second", etc. are only used for descriptive purposes and cannot be understood as indicating or implying relative importance. In addition, in the description of the present invention, unless otherwise specified, "plurality" means two or more.

Example one

The embodiment of the present invention provides a method for achieving incremental data consistency, which can be applied to a data warehouse (for example, an operational data warehouse ODS). As shown in FIG. 1, the method includes the steps:

S1: Initialize all the data of each data table having an association relationship in the business system, and load it into the first database to generate multiple full data tables.

In this embodiment, each data table with an association relationship may be a one-to-one relationship or a one-to-many relationship, and each data table with a one-to-many relationship may use one data table as the parent table, and the other multiple data tables respectively As a child table. For example, in a retail transaction order scenario, the order header table is the parent table, and the order item table, order payment table, and order extension table are all child tables.

Specifically, all data of the data table is extracted from the business database corresponding to the business system based on the ETL tool, and after cleaning and conversion, it is loaded into the first database to form multiple full data tables corresponding to each data table.

For example, all data of the order header table, order product table, order payment table, and order extension table in the business database can be loaded into the first database to generate the order header table, order product table, order payment table, and order extension table Each corresponding full data sheet.

Among them, the first database may be a KV (Key-Value) database. The Key-Value database is a kind of database that stores data in key-value pairs, so it is also called a key-value database. The storage and access of its data are carried out using key-value pairs as identifiers, and the corresponding ones can be quickly queried by key value value, and can provide good read and write operations externally. The key-value database represents redis.

S2: Based on the database log of the business system, the real-time data of each data table is synchronized to multiple full data tables and multiple incremental data tables of the second database.

Among them, real-time data is newly added or newly modified data in each data table.

Among them, the second database is the Hive database. Hive database is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provides simple SQL query functions, which can convert SQL statements into MapReduce tasks for execution. Its advantages are low learning costs, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses.

Specifically, parse the real-time data of each data table from the database log of the business system, and synchronize the real-time data to the real-time data stream;

Put the data in the real-time data stream into multiple full data tables;

And write the data in the real-time data stream into multiple incremental data tables in the second database.

Among them, the database log records operation information on the business database. The database log can specifically be a Binlog database log, and the Binlog database log can be parsed regularly through the Binlog parser.

In this embodiment, the database log can be obtained when the database log is updated, where the update includes adding, deleting, or modifying fields of the data table of the business database.

It should be noted that the steps of landing the data in the real-time data stream in multiple full data tables, and the steps of landing the data in the real-time data stream in multiple full data tables, the embodiment of the present invention has two steps The order of execution is not specifically limited, and it is preferable to execute the above two steps simultaneously.

S3: Extract all service unique identifiers in multiple incremental data tables, and merge them in the second database to generate an incremental identifier merged table.

Among them, the business unique identifier can uniquely identify a business record in the database table. In the order application scenario, the business unique identifier is the order number.

Specifically, all the unique identifiers of the services in the multiple incremental data tables are extracted, and all the unique identifiers of the services are merged and deduplicated to generate an incremental identifier merge table.

In this embodiment, all service unique identifiers can be merged into a set, and the duplicate service unique identifiers can be eliminated, and the incremental identifier merge table is generated for the deduplicated service unique identifiers and stored in the Hive database.

S4: According to the incremental identification merging table, query and obtain the business data related to the incremental identification merging table from the multiple full data tables, and correspondingly write them into the consistent incremental data table of the second database.

Specifically, the process may include:

For each order number in the incremental identification combined table, the business data matching the order number in the multiple full data tables is queried through the SQL query interface.

In the specific implementation process, the KV database query can be integrated into SQL by developing the SQL query interface to reduce the difficulty of development and realize the real-time association of the Hive database and the KV database through SQL.

Since the Hive library and the KV library can be associated through SQL, the full data tables in the KV library can support fast retrieval based on single numbers, provide data index retrieval capabilities, and do not increase the pressure on the hadoop platform or business system.

Further, in addition to the above steps, the method provided in the embodiment of the present invention may further include:

Based on the consistent incremental data table in the second database, analyze indicators, dimensions, and attributes related to the business theme, where the business theme can be order placement, law enforcement, exchange and return.

Since the data analysis of the consistent incremental data table supports the analysis based on incremental data, in the daily data analysis scenario, you only need to retrieve the daily data in each table to complete all the analysis related to the order, including the order wide table and the payment wide table , Return and exchange analysis wide table, etc., there is no need to retrieve historical partition data during the analysis process, so the database resource consumption is small.

The data backtracking query instruction is received, the business data associated with the data backtracking query instruction is queried in the first database through the SQL query interface, and the data backtracking query result is returned.

Exemplarily, take customer service complaint analysis as an example. The order complained by the customer on the day may be the order data a long time ago. The Hive table is difficult to achieve long-term historical efficient and fast retrieval, and the SQL query interface of the KV database is used to search the full data table The customer order information can realize the retrospective query of business data, which effectively solves the need for after-sales customer service and other services to retrieve past order data related business scenarios and obtain related orders as a dimensional analysis business scenario, with high retrieval performance and low database resource consumption.

The following takes an order scenario as an example to further illustrate the method for implementing incremental data consistency provided in the first embodiment of the present invention, as shown in Figure 2, which shows an operational data warehouse ODS order incremental data consistency The realization process, which includes:

Step 1: Initialize all the data of each table of the parent-child table in the business system and load it into the KV library to form multiple full data tables;

Step 2: Synchronize data from the business system to the data stream in real time through the database log;

Step 3: Map the real-time data stream data to the incremental data table of the Hive library;

Step 4: Write the real-time data stream data into the full data table in the KV library;

Step 5: Extract all the tracking numbers from each incremental data table and merge them into the incremental tracking number merge table of the Hive library;

Step 6: Query and call the data of each full data table through the SQL query interface according to the incremental order number merge table, and write the query result into the consistent incremental data table of the Hive library.

Through the above steps, the consistent incremental data table of the Hive library of the data warehouse ODS and the full data table of the KV library can be generated.

The method for realizing incremental data consistency provided by the present invention, because the real-time data of each data table in the business database is synchronized to the data warehouse by using the database log, compared to the prior art by creating a single number change intermediate table to read business The system table data strongly depends on the use of the business system data table index. The present invention has basically no impact on the normal operation of the business database during the data collection process of the business database; and when querying from multiple incremental data tables and combining incremental identification When table-related business data is read, it only needs to be read in full once, which consumes less database resources, and the consistent incremental data table obtained by writing the query results can ensure that the incremental data between the data tables is consistent; In addition, because the data analysis in the consistent incremental data table supports analysis based on incremental data, in the daily data analysis scenario, you only need to retrieve the daily data in each table to complete all the order-related analysis, without the need to retrieve historical partitions Data, database resource consumption is small.

Example two

The embodiment of the present invention provides an incremental data consistency realization device. As shown in FIG. 3, the device includes:

The initialization module 31 is used to initialize all the data of each data table having an association relationship in the business system, and load it into the first database to generate multiple full data tables;

The real-time synchronization module 32 is used to synchronize the real-time data of each data table to multiple full data tables and multiple incremental data tables of the second database based on the log of the business database;

The identifier merging module 33 is used to extract all the unique identifiers of the services in the multiple incremental data tables, and merge them in the second database to generate an incremental identifier merge table;

The query module 34 is configured to query and obtain business data related to the incremental identifier merged table from the multiple full data tables according to the incremental identifier merged table; and

The writing module 35 is configured to write the business data related to the incremental identification merge table into the consistent incremental data table of the second database.

Further, the real-time synchronization module 32 is specifically used for:

Analyze the real-time data of each data table from the database log of the business system, and synchronize the real-time data to the real-time data stream;

Put the data in the real-time data stream into multiple full data tables; and

Write the data in the real-time data stream into multiple incremental data tables.

Further, the first database is a KV database, and the second database is a Hive database.

Further, the query module 34 is specifically used for:

Further, the query module 34 is also used for:

The incremental data consistency realization device provided by the present invention, because the real-time data of each data table in the business database is synchronized to the data warehouse by using the database log, compared to the prior art by creating a single number change intermediate table to read business The system table data strongly depends on the use of the business system data table index. The present invention has basically no impact on the normal operation of the business database during the data collection process of the business database; and when querying from multiple incremental data tables and combining incremental identification When table-related business data is read, it only needs to be read in full once, which consumes less database resources, and the consistent incremental data table obtained by writing the query results can ensure that the incremental data between the data tables is consistent; In addition, because the data analysis in the consistent incremental data table supports analysis based on incremental data, in the daily data analysis scenario, you only need to retrieve the daily data in each table to complete all the order-related analysis, without the need to retrieve historical partitions Data, database resource consumption is small.

All the above-mentioned optional technical solutions can be combined in any way to form an optional embodiment of the present invention, which will not be repeated here.

It should be noted that, when the incremental data consistency realization device provided in the above embodiment executes the incremental data consistency realization method, only the division of the above-mentioned functional modules is used for illustration. In practical applications, the above-mentioned Function allocation is completed by different functional modules, that is, the internal structure of the device for achieving incremental data consistency is divided into different functional modules to complete all or part of the functions described above. In addition, the method for realizing incremental data consistency provided by the foregoing embodiment belongs to the same concept as the embodiment of the device for realizing incremental data consistency. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.

Those of ordinary skill in the art can understand that all or part of the steps in the foregoing embodiments can be implemented by hardware, or by a program instructing relevant hardware to be completed. The program can be stored in a computer-readable storage medium. The storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A method for realizing incremental data consistency, which is characterized in that it includes:

Initialize all the data of each data table that has an association relationship in the business system, and load it into the first database to generate multiple full data tables;

Based on the database log of the business system, synchronizing the real-time data of each data table to the multiple full data tables and the multiple incremental data tables of the second database respectively;

Extracting all the unique identifiers of services in the plurality of incremental data tables, and merge them in the second database to generate an incremental identifier merged table;

According to the incremental identification combination table, the business data related to the incremental identification combination table among the multiple full data tables is obtained by querying, and correspondingly written into the consistent incremental data table of the second database.
The method according to claim 1, characterized in that, based on the database log of the business system, the real-time data of each data table is synchronized to the multiple full data tables and the multiple data of the second database. The incremental data table includes:

Analyze the real-time data of each data table from the database log of the business system, and synchronize it to the real-time data stream;

Landing the data in the real-time data stream in the multiple full data tables; and

The data in the real-time data stream is written into the multiple incremental data tables.
The method according to claim 1 or 2, wherein the first database is a KV database, and the second database is a Hive database.
The method according to claim 3, wherein the query to obtain the business data related to the incremental identification combined table among the multiple full data tables according to the incremental identification combined table comprises:

For each order number in the incremental identification combined table, query the service data matching the order number in the multiple full data tables through the SQL query interface.
The method according to claim 3, wherein the method further comprises:

Receiving a data backtracking query instruction, querying the first database for business data associated with the data backtracking query instruction through a SQL query interface, and returning a data backtracking query result.
A device for achieving consistency of incremental data is characterized in that it comprises:

The initialization module is used to initialize all the data of the associated data tables in the business system, and load them into the first database to generate multiple full data tables;

The real-time synchronization module is configured to synchronize the real-time data of each data table to the multiple full data tables and multiple incremental data tables of the second database based on the database log of the business system;

An identifier merging module, configured to extract all the unique identifiers of the services in the plurality of incremental data tables, and merge them in the second database to generate an incremental identifier merging table;

A query module, configured to query and obtain business data related to the incremental identifier merge table among the multiple full data tables according to the incremental identifier merge table;

The writing module is configured to write the business data related to the incremental identification merge table into the consistent incremental data table of the second database.
The device according to claim 6, wherein the real-time synchronization module is specifically configured to:

Analyze the real-time data of each data table from the database log of the business system, and synchronize it to the real-time data stream;

Landing the data in the real-time data stream in the multiple full data tables; and

The data in the real-time data stream is written into the multiple incremental data tables.
The device according to claim 6 or 7, wherein the first database is a KV database, and the second database is a Hive database.
The device according to claim 8, wherein the query module is specifically configured to:

For each order number in the incremental identification combined table, query the service data matching the order number in the multiple full data tables through the SQL query interface.
The device according to claim 8, wherein the query module is further configured to:

Receiving a data backtracking query instruction, querying the first database for business data associated with the data backtracking query instruction through a SQL query interface, and returning a data backtracking query result.