CN110879812B

CN110879812B - Spark-based data synchronization method in e-commerce platform

Info

Publication number: CN110879812B
Application number: CN201911138971.1A
Authority: CN
Inventors: 张秀超
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2023-06-20
Anticipated expiration: 2039-11-20
Also published as: CN110879812A

Abstract

The invention provides a spark-based data synchronization method in an e-commerce platform, which belongs to the technical field of data processing, mysql is used as source data of a relational database in the e-commerce platform to generate incremental data in real time, and an ETL tool is used for reading a binlog of the Mysql to obtain the incremental data and labeling the incremental data to store in hbase. Setting a synchronization rule synchronizes tagged data from hbase to hive via spark. The business data synchronization problem among a plurality of systems in the e-commerce platform is solved.

Description

Spark-based data synchronization method in e-commerce platform

Technical Field

The invention relates to a data processing technology, in particular to a spark-based data synchronization method in an e-commerce platform.

Background

In the current internet e-commerce platform, a plurality of systems are necessarily involved, data synchronization among the systems is particularly important, and when the demands of various departments on service data are different, for example, analysis or report display is performed on the data, the data needs to be migrated from the current service library to the corresponding target library. For data analysis, it is typically drawn to the target library through the ETL tool, but for large data platforms data lakes, such as hives, are not friendly to delete operations.

The increment of the DB2 database is used for increment synchronization, the synchronization strategy is too single without elastic expansion design, and the error remedying of the data synchronization failure is insufficient.

Data synchronization from an e-commerce platform to a big data platform, because of the update (or deletion) of historical data, if an HIVE is selected as a unique storage component, an HIVE transaction table mechanism needs to be started, but the disadvantage is that the HIVE transaction table has poor performance in update (or deletion) operation and cannot be calculated by using SPARK to read transaction table data; if HBASE is selected as the only storage component, the method has the advantages that the HBASE can be updated (or deleted), but has the disadvantages that SPARK reads HBASE table data to calculate, the performance is far lower than HIVE, and the time requirement of summary calculation cannot be met

Disclosure of Invention

In order to solve the technical problems, the invention provides a spark-based data synchronization method in an e-commerce platform, which is applied to synchronizing business data in a relationship database of the e-commerce platform to a large data platform, is widely used for synchronizing mass data, is applicable to a system with higher data accuracy, and provides mass data for subsequent multi-dimensional summarization calculation based on the large data platform. The business data synchronization problem among a plurality of systems in the e-commerce platform is solved.

The invention can synchronize incremental data generated in real time every day from a service library such as mysql in an electronic commerce platform into a big data platform, and then process, extract, summarize and calculate the data in the big data platform.

The technical scheme of the invention is as follows:

a SPARK-based data synchronization method in an e-commerce platform comprises the steps of taking Hbase as an increment table, storing intermediate increment data, storing full-quantity data before yesterday (including yesterday) synchronization by Hive, synchronizing HBASE library increment data to a HIVE library through SPARK, and then carrying out service summarization calculation and daily junction.

Further, the method comprises the steps of,

the related components have MYSQL, NIFI, HBASE, HIVE and SPARK.

Mysql is used as the source data of a relational database in an e-commerce platform to generate incremental data in real time, and the incremental data are obtained by reading a binlog of the Mysql through an ETL tool and stored in hbase by marking the data. Setting a synchronization rule synchronizes tagged data from hbase to hive via spark.

Setting a synchronization rule synchronizes tagged data from hbase to hive via spark.

The method comprises the following specific steps:

1) Obtaining incremental data; NIFI extracts data from MYSQL, writes the data into HBASE table, and adds "OPT_TIME" and "OPT_TYPE" fields for each record;

2) Setting a synchronization rule; hbase incremental data is synchronized to hive, if the partition related in the hive table has the key data that needs to be updated and the latest data is inserted into the partition, and the key data is synchronized in the daytime for nearly three days; synchronizing seven days of data at night; synchronizing other data on the weekend;

3) SPARK synchronizes the incremental data of HBASE to HIVE.

Further, the method comprises the steps of,

"OPT_TIME" represents the update TIME in the format of "yyyMMddHHmmss";

"opt_type" indicates a data update TYPE, including three TYPEs: "INSERT", "UPDATE" and "DELETE";

further, the method comprises the steps of,

spark reads the delta data incrydf according to the "opt_time" field and divides the data into deleteDF, updateDF and insertDF data sets according to the "opt_type" field.

Further, the method comprises the steps of,

acquiring partition lists related to update from an increDF, and reading data hiveDF in the HIVE partition lists by Spark;

deleteDF, updateDF in hiveDF is pruned according to the joint primary key and then the hiveDF is combined with updateDF and insert df.

Further, the trunk HIVE table partition partitionLists data, and then inserts hiveDF into HIVE;

and finally deleting the increDF number in the HBASE table according to the primary key.

The data synchronization can run independently, independent of other systems.

The data in the relational database are synchronized into the big data platform, and the data of a plurality of systems of the E-commerce platform can be accessed into the big data platform through the system.

And the data synchronization is realized by taking Hbase as an intermediate storage library, and a spark program is called by setting a synchronization strategy, so that the data synchronization is quickly and efficiently synchronized to a large data platform.

The invention has the beneficial effects that

The data in the e-commerce platform relation database is synchronously updated to the big data platform hive for summarization calculation, so that the accuracy of the data is ensured, and meanwhile, the performance and the efficiency of the data synchronization are higher.

Drawings

FIG. 1 is a schematic flow chart of the synchronization of the data of the e-commerce platform to the big data platform.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.

The method involves the assembly MYSQL, NIFI, HBASE, HIVE and SPARK.

MYSQL: e-business platform business library.

NIFI: data extraction and write-back.

HBASE: and creating an increment table and storing intermediate increment data.

HIVE: the full amount of data before yesterday (including yesterday) is stored.

SPARK: HBASE library incremental data is synchronized to HIVE library.

The method comprises the following specific steps:

1. incremental data is acquired. The NIFI extracts data from MYSQL, writes it to the HBASE table, and adds the "OPT_TIME" and "OPT_TYPE" fields for each record. "OPT_TIME" indicates the update TIME in the format of "yyyMMddHHmmss" such as "201910180423"; "opt_type" indicates a data update TYPE, including three TYPEs: "INSERT", "UPDATE" and "DELETE".

2. And setting a synchronization rule. Based on the current method, hbase incremental data are synchronized into hive, if a partition involved in a hive table has update, the partition needs to be strutted first and the latest data are inserted, and the time for delaying the data submitting in the current service is caused, so that the time for synchronizing the partition of hive each time is too long. By setting the synchronization rule, the key data of nearly three days can be synchronized in the daytime. The seven day night data were synchronized. And synchronizing other data when the service is not busy on the weekend to ensure the consistency of the data.

3. SPARK synchronizes the incremental data of HBASE to HIVE. The specific method comprises the following steps: spark reads the delta data incrydf according to the "opt_time" field and divides the data into deleteDF, updateDF and insertDF data sets according to the "opt_type" field. The partition lists related to the update are acquired from the increDF, and the Spark reads the data hiveDF in the HIVE partition lists. deleteDF, updateDF in hiveDF is pruned according to the joint primary key and then the hiveDF is combined with updateDF and insert df. The trunk HIVE table partitions data of partitionLists and then inserts hiveDF into HIVE. And finally deleting the increDF number in the HBASE table according to the primary key.

The method is applied to synchronizing business data in the e-commerce platform relational database into a large data platform, is mostly used for synchronizing mass data, is suitable for a system with higher data accuracy, and provides mass data for subsequent multi-dimensional summarization calculation based on the large data platform. The system solves the problem of service data synchronization among a plurality of systems in the E-commerce platform.

The foregoing description is only illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A spark-based data synchronization method in an e-commerce platform is characterized in that,

taking HBASE as an increment table, storing intermediate increment data, storing full-quantity data before yesterday (including yesterday) synchronization by HIVE, synchronizing the HBASE library increment data to the HIVE library through SPARK, and then carrying out service summarization calculation and daily junction;

the method comprises the following specific steps:

2) Setting a synchronization rule; the HBASE incremental data is synchronized to the HIVE, if the partition related in the HIVE table has the key data that needs to be updated and the latest data is inserted into the partition, and the key data is synchronized in the daytime for nearly three days; synchronizing seven days of data at night; synchronizing other data on the weekend;

3) SPARK synchronizes the incremental data of HBASE to HIVE;

"OPT_TIME" represents the update TIME in the format of "yyyMMddHHmmss";

SPARK reads the increment data incryDF according to the OPT_TIME field, and divides the data into deleteDF, updateDF data sets and insertDF data sets according to the OPT_TYPE field;

acquiring partition lists related to update from an incryDF, and reading data HIVEDF in the HIVE partition lists by SPARK;

deleting deleteDF, updateDF in the HIVEDF according to the joint primary key, and then combining the HIVEDF with the updateDF and the insert DF;

the TRUNCATE HIVE table partitions data of partial lists, and then inserts HIVEDF into HIVE;

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

mysql is used as the source data of a relational database in an e-commerce platform to generate incremental data in real time, and the incremental data are obtained by reading a binlog of the Mysql through an ETL tool and stored in HBASE.

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the synchronization rule is set to synchronize the marked data from HBASE to HIVE through SPARK.