CN110879812B - Spark-based data synchronization method in e-commerce platform - Google Patents

Spark-based data synchronization method in e-commerce platform Download PDF

Info

Publication number
CN110879812B
CN110879812B CN201911138971.1A CN201911138971A CN110879812B CN 110879812 B CN110879812 B CN 110879812B CN 201911138971 A CN201911138971 A CN 201911138971A CN 110879812 B CN110879812 B CN 110879812B
Authority
CN
China
Prior art keywords
data
hive
hbase
spark
opt
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911138971.1A
Other languages
Chinese (zh)
Other versions
CN110879812A (en
Inventor
张秀超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN201911138971.1A priority Critical patent/CN110879812B/en
Publication of CN110879812A publication Critical patent/CN110879812A/en
Application granted granted Critical
Publication of CN110879812B publication Critical patent/CN110879812B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a spark-based data synchronization method in an e-commerce platform, which belongs to the technical field of data processing, mysql is used as source data of a relational database in the e-commerce platform to generate incremental data in real time, and an ETL tool is used for reading a binlog of the Mysql to obtain the incremental data and labeling the incremental data to store in hbase. Setting a synchronization rule synchronizes tagged data from hbase to hive via spark. The business data synchronization problem among a plurality of systems in the e-commerce platform is solved.

Description

Spark-based data synchronization method in e-commerce platform
Technical Field
The invention relates to a data processing technology, in particular to a spark-based data synchronization method in an e-commerce platform.
Background
In the current internet e-commerce platform, a plurality of systems are necessarily involved, data synchronization among the systems is particularly important, and when the demands of various departments on service data are different, for example, analysis or report display is performed on the data, the data needs to be migrated from the current service library to the corresponding target library. For data analysis, it is typically drawn to the target library through the ETL tool, but for large data platforms data lakes, such as hives, are not friendly to delete operations.
The increment of the DB2 database is used for increment synchronization, the synchronization strategy is too single without elastic expansion design, and the error remedying of the data synchronization failure is insufficient.
Data synchronization from an e-commerce platform to a big data platform, because of the update (or deletion) of historical data, if an HIVE is selected as a unique storage component, an HIVE transaction table mechanism needs to be started, but the disadvantage is that the HIVE transaction table has poor performance in update (or deletion) operation and cannot be calculated by using SPARK to read transaction table data; if HBASE is selected as the only storage component, the method has the advantages that the HBASE can be updated (or deleted), but has the disadvantages that SPARK reads HBASE table data to calculate, the performance is far lower than HIVE, and the time requirement of summary calculation cannot be met
Disclosure of Invention
In order to solve the technical problems, the invention provides a spark-based data synchronization method in an e-commerce platform, which is applied to synchronizing business data in a relationship database of the e-commerce platform to a large data platform, is widely used for synchronizing mass data, is applicable to a system with higher data accuracy, and provides mass data for subsequent multi-dimensional summarization calculation based on the large data platform. The business data synchronization problem among a plurality of systems in the e-commerce platform is solved.
The invention can synchronize incremental data generated in real time every day from a service library such as mysql in an electronic commerce platform into a big data platform, and then process, extract, summarize and calculate the data in the big data platform.
The technical scheme of the invention is as follows:
a SPARK-based data synchronization method in an e-commerce platform comprises the steps of taking Hbase as an increment table, storing intermediate increment data, storing full-quantity data before yesterday (including yesterday) synchronization by Hive, synchronizing HBASE library increment data to a HIVE library through SPARK, and then carrying out service summarization calculation and daily junction.
Further, the method comprises the steps of,
the related components have MYSQL, NIFI, HBASE, HIVE and SPARK.
Mysql is used as the source data of a relational database in an e-commerce platform to generate incremental data in real time, and the incremental data are obtained by reading a binlog of the Mysql through an ETL tool and stored in hbase by marking the data. Setting a synchronization rule synchronizes tagged data from hbase to hive via spark.
Setting a synchronization rule synchronizes tagged data from hbase to hive via spark.
The method comprises the following specific steps:
1) Obtaining incremental data; NIFI extracts data from MYSQL, writes the data into HBASE table, and adds "OPT_TIME" and "OPT_TYPE" fields for each record;
2) Setting a synchronization rule; hbase incremental data is synchronized to hive, if the partition related in the hive table has the key data that needs to be updated and the latest data is inserted into the partition, and the key data is synchronized in the daytime for nearly three days; synchronizing seven days of data at night; synchronizing other data on the weekend;
3) SPARK synchronizes the incremental data of HBASE to HIVE.
Further, the method comprises the steps of,
"OPT_TIME" represents the update TIME in the format of "yyyMMddHHmmss";
"opt_type" indicates a data update TYPE, including three TYPEs: "INSERT", "UPDATE" and "DELETE";
further, the method comprises the steps of,
spark reads the delta data incrydf according to the "opt_time" field and divides the data into deleteDF, updateDF and insertDF data sets according to the "opt_type" field.
Further, the method comprises the steps of,
acquiring partition lists related to update from an increDF, and reading data hiveDF in the HIVE partition lists by Spark;
deleteDF, updateDF in hiveDF is pruned according to the joint primary key and then the hiveDF is combined with updateDF and insert df.
Further, the trunk HIVE table partition partitionLists data, and then inserts hiveDF into HIVE;
and finally deleting the increDF number in the HBASE table according to the primary key.
The data synchronization can run independently, independent of other systems.
The data in the relational database are synchronized into the big data platform, and the data of a plurality of systems of the E-commerce platform can be accessed into the big data platform through the system.
And the data synchronization is realized by taking Hbase as an intermediate storage library, and a spark program is called by setting a synchronization strategy, so that the data synchronization is quickly and efficiently synchronized to a large data platform.
The invention has the beneficial effects that
The data in the e-commerce platform relation database is synchronously updated to the big data platform hive for summarization calculation, so that the accuracy of the data is ensured, and meanwhile, the performance and the efficiency of the data synchronization are higher.
Drawings
FIG. 1 is a schematic flow chart of the synchronization of the data of the e-commerce platform to the big data platform.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.
The method involves the assembly MYSQL, NIFI, HBASE, HIVE and SPARK.
MYSQL: e-business platform business library.
NIFI: data extraction and write-back.
HBASE: and creating an increment table and storing intermediate increment data.
HIVE: the full amount of data before yesterday (including yesterday) is stored.
SPARK: HBASE library incremental data is synchronized to HIVE library.
The method comprises the following specific steps:
1. incremental data is acquired. The NIFI extracts data from MYSQL, writes it to the HBASE table, and adds the "OPT_TIME" and "OPT_TYPE" fields for each record. "OPT_TIME" indicates the update TIME in the format of "yyyMMddHHmmss" such as "201910180423"; "opt_type" indicates a data update TYPE, including three TYPEs: "INSERT", "UPDATE" and "DELETE".
2. And setting a synchronization rule. Based on the current method, hbase incremental data are synchronized into hive, if a partition involved in a hive table has update, the partition needs to be strutted first and the latest data are inserted, and the time for delaying the data submitting in the current service is caused, so that the time for synchronizing the partition of hive each time is too long. By setting the synchronization rule, the key data of nearly three days can be synchronized in the daytime. The seven day night data were synchronized. And synchronizing other data when the service is not busy on the weekend to ensure the consistency of the data.
3. SPARK synchronizes the incremental data of HBASE to HIVE. The specific method comprises the following steps: spark reads the delta data incrydf according to the "opt_time" field and divides the data into deleteDF, updateDF and insertDF data sets according to the "opt_type" field. The partition lists related to the update are acquired from the increDF, and the Spark reads the data hiveDF in the HIVE partition lists. deleteDF, updateDF in hiveDF is pruned according to the joint primary key and then the hiveDF is combined with updateDF and insert df. The trunk HIVE table partitions data of partitionLists and then inserts hiveDF into HIVE. And finally deleting the increDF number in the HBASE table according to the primary key.
The method is applied to synchronizing business data in the e-commerce platform relational database into a large data platform, is mostly used for synchronizing mass data, is suitable for a system with higher data accuracy, and provides mass data for subsequent multi-dimensional summarization calculation based on the large data platform. The system solves the problem of service data synchronization among a plurality of systems in the E-commerce platform.
The foregoing description is only illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (3)

1. A spark-based data synchronization method in an e-commerce platform is characterized in that,
taking HBASE as an increment table, storing intermediate increment data, storing full-quantity data before yesterday (including yesterday) synchronization by HIVE, synchronizing the HBASE library increment data to the HIVE library through SPARK, and then carrying out service summarization calculation and daily junction;
the method comprises the following specific steps:
1) Obtaining incremental data; NIFI extracts data from MYSQL, writes the data into HBASE table, and adds "OPT_TIME" and "OPT_TYPE" fields for each record;
2) Setting a synchronization rule; the HBASE incremental data is synchronized to the HIVE, if the partition related in the HIVE table has the key data that needs to be updated and the latest data is inserted into the partition, and the key data is synchronized in the daytime for nearly three days; synchronizing seven days of data at night; synchronizing other data on the weekend;
3) SPARK synchronizes the incremental data of HBASE to HIVE;
"OPT_TIME" represents the update TIME in the format of "yyyMMddHHmmss";
"opt_type" indicates a data update TYPE, including three TYPEs: "INSERT", "UPDATE" and "DELETE";
SPARK reads the increment data incryDF according to the OPT_TIME field, and divides the data into deleteDF, updateDF data sets and insertDF data sets according to the OPT_TYPE field;
acquiring partition lists related to update from an incryDF, and reading data HIVEDF in the HIVE partition lists by SPARK;
deleting deleteDF, updateDF in the HIVEDF according to the joint primary key, and then combining the HIVEDF with the updateDF and the insert DF;
the TRUNCATE HIVE table partitions data of partial lists, and then inserts HIVEDF into HIVE;
and finally deleting the increDF number in the HBASE table according to the primary key.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
mysql is used as the source data of a relational database in an e-commerce platform to generate incremental data in real time, and the incremental data are obtained by reading a binlog of the Mysql through an ETL tool and stored in HBASE.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises,
the synchronization rule is set to synchronize the marked data from HBASE to HIVE through SPARK.
CN201911138971.1A 2019-11-20 2019-11-20 Spark-based data synchronization method in e-commerce platform Active CN110879812B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911138971.1A CN110879812B (en) 2019-11-20 2019-11-20 Spark-based data synchronization method in e-commerce platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911138971.1A CN110879812B (en) 2019-11-20 2019-11-20 Spark-based data synchronization method in e-commerce platform

Publications (2)

Publication Number Publication Date
CN110879812A CN110879812A (en) 2020-03-13
CN110879812B true CN110879812B (en) 2023-06-20

Family

ID=69729014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911138971.1A Active CN110879812B (en) 2019-11-20 2019-11-20 Spark-based data synchronization method in e-commerce platform

Country Status (1)

Country Link
CN (1) CN110879812B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286941B (en) * 2020-12-23 2021-03-23 武汉物易云通网络科技有限公司 Big data synchronization method and device based on Binlog + HBase + Hive
CN117407445B (en) * 2023-10-27 2024-06-04 上海势航网络科技有限公司 Data storage method, system and storage medium for Internet of vehicles data platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015062181A1 (en) * 2013-11-04 2015-05-07 广东电子工业研究院有限公司 Method for achieving automatic synchronization of multisource heterogeneous data resources
CN106021422A (en) * 2016-05-13 2016-10-12 北京思特奇信息技术股份有限公司 Relational database-based method and system for forming Hive data warehouse
CN107330003A (en) * 2017-06-12 2017-11-07 上海藤榕网络科技有限公司 Method of data synchronization, system, memory and data syn-chronization equipment
CN110019477A (en) * 2017-12-27 2019-07-16 航天信息股份有限公司 A kind of method and system carrying out big data processing using HIVE backup table

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015062181A1 (en) * 2013-11-04 2015-05-07 广东电子工业研究院有限公司 Method for achieving automatic synchronization of multisource heterogeneous data resources
CN106021422A (en) * 2016-05-13 2016-10-12 北京思特奇信息技术股份有限公司 Relational database-based method and system for forming Hive data warehouse
CN107330003A (en) * 2017-06-12 2017-11-07 上海藤榕网络科技有限公司 Method of data synchronization, system, memory and data syn-chronization equipment
CN110019477A (en) * 2017-12-27 2019-07-16 航天信息股份有限公司 A kind of method and system carrying out big data processing using HIVE backup table

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
岑凯伦 ; 于红岩 ; 杨腾霄 ; .大数据下基于Spark的电商实时推荐系统的设计与实现.现代计算机(专业版).2016,(24),全文. *

Also Published As

Publication number Publication date
CN110879812A (en) 2020-03-13

Similar Documents

Publication Publication Date Title
US10180946B2 (en) Consistent execution of partial queries in hybrid DBMS
CN107402963B (en) Search data construction method, incremental data pushing device and equipment
JP7410181B2 (en) Hybrid indexing methods, systems, and programs
CN110879813B (en) Binary log analysis-based MySQL database increment synchronization implementation method
US9953051B2 (en) Multi-version concurrency control method in database and database system
CN106649378B (en) Data synchronization method and device
CN107544984B (en) Data processing method and device
US8924365B2 (en) System and method for range search over distributive storage systems
KR100481771B1 (en) Field level replication method
CN110647579A (en) Data synchronization method and device, computer equipment and readable medium
CN104899295B (en) A kind of heterogeneous data source data relation analysis method
CN102426609A (en) Index generation method and index generation device based on MapReduce programming architecture
CN110879812B (en) Spark-based data synchronization method in e-commerce platform
CN108563711A (en) A kind of time series data storage method based on timing node
CN110674154A (en) Spark-based method for inserting, updating and deleting data in Hive
CN111651519B (en) Data synchronization method, data synchronization device, electronic equipment and storage medium
CN108416043A (en) Multi-platform spatial data fusion and synchronous method
CN106503214A (en) A kind of complex rule matching process based on Redis memory databases
CN108363791A (en) A kind of method of data synchronization and device of database
CN114138907A (en) Data processing method, computer device, storage medium, and computer program product
CN103488710A (en) Efficient-storage unsteady data structure for big data pages
CN115185955A (en) Data lake data processing method and system
CN114020844A (en) Data monitoring synchronization method based on configuration
CN113094442A (en) Full data synchronization method, device, equipment and medium
CN112699187A (en) Associated data processing method, device, equipment, medium and product

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 271000 Langchao science and Technology Park, 527 Dongyue street, Tai'an City, Shandong Province

Applicant after: INSPUR SOFTWARE Co.,Ltd.

Address before: No. 1036, Shandong high tech Zone wave road, Ji'nan, Shandong

Applicant before: INSPUR SOFTWARE Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant