CN112052295A

CN112052295A - Data synchronization method and device, electronic equipment and readable storage medium

Info

Publication number: CN112052295A
Application number: CN202010783675.3A
Authority: CN
Inventors: 熊龙; 杨全文; 马智超; 周巍; 郭煜
Original assignee: China Citic Bank Corp Ltd
Current assignee: China Citic Bank Corp Ltd
Priority date: 2020-08-06
Filing date: 2020-08-06
Publication date: 2020-12-08

Abstract

The application relates to the technical field of computer data processing, in particular to a data synchronization method, a data synchronization device, electronic equipment and a readable storage medium, wherein the method comprises the steps of collecting and reporting first data information of a MySQL database and writing the first data information into a Kafka cluster database; the Storm cluster consumes the first data information in the Kafka cluster and configures the consumed first data information according to a preset rule to obtain second data information; the Storm cluster writes the second data information to the Hbase database. The data synchronization scheme based on the application improves the accuracy and the expandability of real-time synchronization of Mysql and Hbase data.

Description

Data synchronization method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer data processing technologies, and in particular, to a data synchronization method and apparatus, an electronic device, and a readable storage medium.

Background

With the rapid development of services, various data synchronization scenes are generated, wherein data changes in the MySQL database can be synchronized to the HBase database at the second level so as to provide support for service real-time query and data analysis. At present, in the prior art, the changed data in the MySQL relational database is mainly synchronized to a log subscription tool through a Binlog mechanism, and the changed data is synchronized to an HBase database through a middleware Kafka message group. However, due to software and hardware basic environment faults, DDL updating and the like, the above prior art data synchronization scheme may cause the problems that data written in and consumed by Kafka may be lost, a data processing component reports errors, and data loss in the data transmission process causes data inconsistency between MySQL and Hbase.

Disclosure of Invention

The present application aims to solve at least one of the above technical drawbacks. The technical scheme adopted by the application is as follows:

in a first aspect, an embodiment of the present application provides a data synchronization method, where the method includes:

collecting and reporting first data information of a MySQL database;

writing the first data information into a Kafka cluster database;

the Storm cluster consumes the first data information in the Kafka cluster and configures the consumed first data information according to a preset rule to obtain second data information;

the Storm cluster writes the second data information to the Hbase database.

Optionally, the collecting the first data information of the MySQL database includes:

acquiring structural information and data change information of a MySQL database table;

reporting the DDL information of the acquired table structure information and the data change information;

wherein the DDL information includes: database name, sub-library IP, data table name, DDL statement and SQL statement.

Optionally, the method further comprises:

controlling the enabling state of the first data information of the acquired MySQL database according to the received configuration parameters; wherein the configuration parameters include enabling parameters for controlling acquisition enablement.

Optionally, the Storm cluster configuring the consumed first data information according to a preset rule includes:

and according to the received preset rule, carrying out configuration processing on the first data information, wherein the configuration processing comprises the following steps:

and performing filtering, mapping and converting operations on tables, metadata and fields which can be included in the first data information.

Optionally, before the Storm cluster performs configuration processing on the consumed first data information according to a preset rule, the method further includes:

receiving reported DDL information;

judging that the DDL information needs to be processed;

judging that the DDL statement and the SQL statement included in the DDL information do not exist in the cache region;

after the Storm cluster performs configuration processing on the consumed first data information according to a preset rule, the method further comprises the following steps:

and caching the second data information according to a preset period.

Optionally, the method further comprises:

acquiring third data information of the MySQL database in a preset time period;

obtaining Hbase data of an Hbase database extracted according to the MySQL data rule;

comparing the third data information with the extracted Hbase data;

and if the comparison is inconsistent, correcting the Hbase database data.

In a second aspect, an embodiment of the present application provides a data synchronization apparatus, where the apparatus includes:

the device comprises an acquisition module, a writing module, a consumption module and a processing module; wherein the content of the first and second substances,

the acquisition module is used for acquiring and reporting first data information of the MySQL database;

the writing module is used for writing the first data information into a Kafka cluster database;

the consumption module is used for consuming the first data information in the Kafka cluster by the Storm cluster;

the processing module is used for the Storm cluster to configure the consumed first data information according to a preset rule to obtain second data information;

and the writing module is used for writing the second data information into the Hbase database by the Storm cluster.

Optionally, the acquisition module is further configured to:

Optionally, the apparatus further comprises a configuration module, wherein,

the configuration module is used for controlling the enabling state of the first data information of the acquired MySQL database according to the received configuration parameters; wherein the configuration parameters include enabling parameters for controlling acquisition enablement.

Optionally, the apparatus further comprises a judging module and a storing module, wherein

The acquisition module is used for receiving the reported DDL information;

the judging module is used for judging that the DDL information needs to be processed and also used for judging that DDL statements and SQL statements contained in the DDL information do not exist in a cache region;

the storage module is further configured to cache the second data information according to a predetermined period.

Optionally, the apparatus further comprises an alignment module, wherein the alignment module is configured to:

acquiring third data information of the MySQL database in a preset time period;

comparing the third data information with the extracted Hbase data;

and if the comparison is inconsistent, correcting the Hbase database data.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory;

the memory is used for storing operation instructions;

and the processor is used for executing the data synchronization method by calling the operation instruction.

In a fourth aspect, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the above-described method of data synchronization.

According to the data synchronization scheme disclosed by the embodiment of the application, first data information of a MySQL database is collected and reported; writing the first data information into a Kafka cluster database; the Storm cluster consumes the first data information in the Kafka cluster and configures the consumed first data information according to a preset rule to obtain second data information; the Storm cluster writes the second data information to the Hbase database. The technical scheme provided by the embodiment of the application has the beneficial effects that the accuracy and the expandability of real-time synchronization of Mysql and Hbase data are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a data synchronization method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a data synchronization apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To more clearly describe the embodiments of the present application, some definitions, concepts or devices that may be used in the embodiments are described below:

MySQL is an open source relational database management system (RDBMS) that uses the most common database management language, Structured Query Language (SQL), for database management.

HBase (Hadoop database) is a distributed storage system with high reliability, high performance, nematic and scalability, and a large-scale structured storage cluster can be built on a cheap PC Server by utilizing HBase technology. It is a database suitable for unstructured data storage.

Kafka is an open source distributed message engine/message middleware and is also a stream processing platform. Kakfa supports message passing between applications in a publish/subscribe manner, while adding Kafka Connect, Kafka Streams to support data connecting to other systems (Elasticsearch, Hadoop, etc.) based on message functionality

Ntp (network Time protocol) Time synchronization is a Time synchronization network technology.

Spark Streaming, which is a framework provided by Spark for real-time computation of big data

Storm clusters are real-time, distributed, and highly fault tolerant computing systems. Storm can process large batch data, and can enable processing to be carried out in real time on the premise of ensuring high reliability, and all information can be processed. Storm also has the characteristics of fault tolerance and distributed computation, so that Storm can be expanded to different machines for large-batch data processing. The following properties also exist:

(1) easy to expand. For extensions, you only need to add machines and change the corresponding topology settings. Storm uses Hadoop Zookeepers to perform cluster coordination, so that good operation of a large cluster can be fully ensured.

(2) The processing of each piece of information can be guaranteed.

(3) Storm cluster management is simple.

(4) Storm fault tolerance: once topology is submitted, Storm runs it until topology is revoked or closed. Tasks are also redistributed by Storm when errors occur in execution.

SQL Structured Query Language (Structured Query Language) is a special purpose programming Language, a database Query and programming Language, used to access data and Query, update, and manage relational database systems. The SQL language includes statements in four major programming language categories: data Definition Language (DDL), Data Manipulation Language (DML), Data Control Language (DCL) and Transaction Control Language (TCL). The query statement rewriting is mainly realized by rewriting two types of statements in the SQL language, namely a data definition language DDL and a data manipulation language DML.

Ddl (data Definition language) database schema Definition language is a language for describing real-world entities to be stored in a database. The DDL database schema definition language is an integral part of the SQL language (structured query language).

Spark Streaming: the framework for processing Stream data on Spark is built, the basic principle is to divide the Stream data into small time slices (a few seconds) and process this small portion of data in a manner similar to batch processing. Spark Streaming builds on Spark because Spark's low-latency execution engine (100ms +) can be used for real-time computation, although not as much as specialized Streaming data processing software, and because a portion of the narrowly dependent RDD data set can be recalculated from the source data for fault-tolerant processing purposes, as compared to other processing frameworks based on Record (e.g., Storm). Furthermore, the way in which the small batch is processed makes it compatible with both the logic and algorithms of batch and real-time data processing. The method facilitates special application occasions requiring the joint analysis of historical data and real-time data.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments in conjunction with the accompanying drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

To make the purpose, technical solution and advantages of the present application clearer, fig. 1 discloses a flowchart of a data synchronization method provided by an embodiment of the present application, and as shown in fig. 1, the data synchronization method includes:

s101, collecting and reporting first data information of a MySQL database;

in an alternative embodiment the method further comprises: introducing a configuration agent to carry out unified management and issue on configuration parameters of the data synchronization system, wherein the configuration parameters comprise enabling parameters. The local acquisition server receives the enabling parameters issued by the configuration agent and sends the enabling parameters to the local acquisition client so as to control the local acquisition client to acquire the enabling state of the first data information of the MySQL database, wherein the enabling state comprises acquisition starting, acquisition suspending, acquisition stopping, periodic or periodic acquisition and the like according to control.

In the embodiment of the present application, acquiring first data information of the MySQL database by using the Canal acquisition client includes:

S102, writing the first data information into a Kafka cluster database;

in step 102, data of the MySQL fragments are acquired by the Canal service client and the Canal acquisition client according to the configuration parameters issued by the configuration agent, and are written into the Kafka cluster.

S103, consuming the first data information in the Kafka cluster by the Storm cluster, and configuring the consumed first data information according to a preset rule to obtain second data information;

in step 103, the Storm cluster performs configuration processing on the first data information according to a preset rule received from the DDL management platform, where the configuration processing includes: and performing filtering, mapping and converting operations on tables, metadata and fields which can be included in the first data information. The DDL management platform is mainly used for managing DDL information, and comprises the steps of receiving DDL report information, updating DDL preset processing rules and issuing the DDL preset processing rules (the preset rules for short) to the Storm cluster.

In a further embodiment, before the Storm cluster performs configuration processing on the consumed first data information according to a preset rule, the method further includes:

receiving reported DDL information;

judging whether the DDL information needs to be processed or not; the need for processing, and, further,

judging whether DDL statements and SQL statements included in the DDL information exist in a cache region; if the first data information does not exist, starting the Storm cluster to perform metadata change processing on the consumed first data information according to preset rules, wherein the preset rules comprise:

create table creation of New tables-adding New Table metadata information

Drop table delete table — set the table to which the metadata information corresponds to stale

Additional Table added column — Add newly added column information for the Table

Outer table delete column-the column set that deleted the table is invalidated

The name of the table change is updated, the corresponding column name of the table is updated, and the mapping name is not changed

Change of column type for the outer table Table Modify-Change of column information for the table Change

In the original table in the first data information, as the Hbase table structure is consistent with the acquired MySQL data table structure, the data is processed and stored according to the received data (JSON data column processing and storing), and the column filtering processing is not performed.

And processing and storing data aiming at the subscription table in the first data information by the subscription column, and reconfiguring in a manual intervention mode when the table structure is changed.

In an optional embodiment, after the metadata change processing is performed on the first data information, the second data information is cached according to a predetermined period, for example, 15 minutes.

And S104, writing the second data information into an Hbase database by the Storm cluster.

According to the embodiment of the application, the synchronous transmission from the MySQL data to the HBase data is realized by introducing the DDL management platform, the Canal server and the collection end and by introducing the Storm cluster, the problems that the MySQL data is inconsistent with the HBase data due to data loss in the data transmission process in the prior art are solved, and the accuracy and the expandability of real-time synchronization of the Mysql and the HBase data are improved.

In an optional embodiment of the present application, the data synchronization method further includes comparing data consistency between data in the MySQL database and data in the HBASE database by using Spark Streaming after the data synchronization transmission described in the foregoing embodiment is completed, so as to ensure consistency of data synchronization between the two databases. The implementation process of the comparison is briefly described as follows:

step 1, acquiring third data information of a MySQL database in a preset time period;

firstly, determining a time period to be queried, loading MySQL fragment data according to a timestamp version column by utilizing a Spark Streaming principle, and subpackaging the data into a memory to form MySQL comparison data, namely third data information.

Step 2, obtaining Hbase data of the Hbase database extracted according to the MySQL data rule;

splicing the ROWKEY value according to the MySQL data ROEKEY rule, and loading the data of the HBASE database according to the ROWKEY value to form HBAS comparison data. Wherein the Hbase comparison data is used as a version number according to the update time.

Step 3, comparing the third data information with the extracted Hbase data; and if the comparison is inconsistent, correcting the Hbase database data. The logical rule of comparison is:

(1) when the MySQL third data timestamp column value is equal to the Hbase timestamp column value, no correction is needed;

(2) when the time stamp column value of the MySQL third data is earlier than the time stamp column value of the Hbase, judging that the time stamp column value of the Hbase is in the time period (query starting time and query ending time) of the current query, and correcting;

(3) when the time stamp column value of the MySQL third data is earlier than the time stamp column value of the Hbase, judging that the time stamp column value of the Hbase is not in the time period (the query starting time and the query ending time) of the query, then judging that the time stamp column value of the Hbase subtracts an NTP time backspacing threshold value, and correcting if the value is in the time period (the query starting time and the query ending time) of the query;

(4) when the time stamp column value of the MySQL third data is earlier than the time stamp column value of the Hbase, judging that the time stamp column value of the Hbase is not in the time period (query starting time and query ending time) of the query, and then judging that the time stamp column value of the Hbase subtracts an NTP time backspacing threshold value, wherein the value is not in the time period (query starting time and query ending time) of the query, and then correcting is not needed;

when the MySQL third data timestamp column value is later than the Hbase timestamp column value, a correction is made.

Based on the data comparison of the embodiment, the problem of inconsistent information of the two databases caused by MySQL data rollback is avoided, and the accuracy of data synchronization is further improved.

Based on the data synchronization method provided by the embodiment shown in fig. 1, fig. 2 shows a data synchronization apparatus provided by the embodiment of the present application, and as shown in fig. 2, the apparatus may mainly include: a 201 acquisition module, a 202 writing module, a 203 consumption module and a 204 processing module; wherein the content of the first and second substances,

the 201 acquisition module is used for acquiring and reporting first data information of the MySQL database;

the 202 writing module is configured to write the first data information into a Kafka cluster database;

the 203 consumption module is used for consuming the first data information in the Kafka cluster by the Storm cluster;

the 204 processing module is used for the Storm cluster to configure the consumed first data information according to a preset rule to obtain second data information;

the 202 writing module is used for the Storm cluster to write the second data information into the Hbase database.

In an optional embodiment of the present application, the acquisition module is further configured to:

In an alternative embodiment of the present application, the apparatus further comprises a configuration module, wherein,

In an optional embodiment of the present application, the apparatus further comprises a determining module and a storing module, wherein the determining module determines whether the apparatus is in a normal mode or not

The acquisition module is used for receiving the reported DDL information;

In an optional embodiment of the present application, the apparatus further comprises an alignment module, wherein the alignment module is configured to:

acquiring third data information of the MySQL database in a preset time period;

comparing the third data information with the extracted Hbase data;

and if the comparison is inconsistent, correcting the Hbase database data.

It is understood that the above modules of the data synchronization apparatus in the present embodiment have functions of implementing the corresponding steps of the method in the embodiment shown in fig. 1. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules can be software and/or hardware, and each module can be implemented independently or by integrating a plurality of modules. For the functional description of each module, reference may be specifically made to the corresponding description of the method in the embodiment shown in fig. 1, and details are not repeated here.

The embodiment of the application provides an electronic device, which comprises a processor and a memory;

a memory for storing operating instructions;

and the processor is used for executing the data synchronization method provided by any embodiment of the application by calling the operation instruction.

As an example, fig. 3 shows a schematic structural diagram of an electronic device to which an embodiment of the present application is applicable, and as shown in fig. 3, the electronic device 2000 includes: a processor 2001 and a memory 2003. Wherein the processor 2001 is coupled to a memory 2003, such as via a bus 2002. Optionally, the electronic device 2000 may also include a transceiver 2004. It should be noted that the transceiver 2004 is not limited to one in practical applications, and the structure of the electronic device 2000 is not limited to the embodiment of the present application.

The processor 2001 is applied to the embodiment of the present application to implement the method shown in the above method embodiment. The transceiver 2004 may include a receiver and a transmitter, and the transceiver 2004 is applied to the embodiments of the present application to implement the functions of the electronic device of the embodiments of the present application to communicate with other devices when executed.

The Processor 2001 may be a CPU (Central Processing Unit), general Processor, DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array) or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

Bus 2002 may include a path that conveys information between the aforementioned components. The bus 2002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 2002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 3, but this does not mean only one bus or one type of bus.

The Memory 2003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

Optionally, the memory 2003 is used for storing application program code for performing the disclosed aspects, and is controlled in execution by the processor 2001. The processor 2001 is configured to execute application program codes stored in the memory 2003 to implement the data synchronization method provided in any of the embodiments of the present application.

The electronic device provided by the embodiment of the application is applicable to any embodiment of the method, and is not described herein again.

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the data synchronization method shown in the above method embodiment.

The computer-readable storage medium provided in the embodiments of the present application is applicable to any of the embodiments of the foregoing method, and is not described herein again.

According to the data synchronization scheme disclosed by the embodiment of the application, first data information of a MySQL database is collected and reported; writing the first data information into a Kafka cluster database; the Storm cluster consumes the first data information in the Kafka cluster and configures the consumed first data information according to a preset rule to obtain second data information; and the Storm cluster writes the second data information into an Hbase database so as to improve the accuracy and the expandability of real-time synchronization of Mysql and Hbase data.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for synchronizing data, the method comprising:

collecting and reporting first data information of a MySQL database;

writing the first data information into a Kafka cluster database;

the Storm cluster writes the second data information to the Hbase database.

2. The data synchronization method of claim 1, wherein collecting first data information of the MySQL database comprises:

3. The data synchronization method of claim 2, further comprising:

4. The data synchronization method according to claim 2 or 3, wherein the Storm cluster performing configuration processing on the consumed first data information according to a preset rule comprises:

5. The data synchronization method according to claim 4, wherein before the Storm cluster performs configuration processing on the consumed first data information according to a preset rule, the method further comprises:

receiving reported DDL information;

judging that the DDL information needs to be processed;

and caching the second data information according to a preset period.

6. The data synchronization method of claim 5, further comprising:

acquiring third data information of the MySQL database in a preset time period;

comparing the third data information with the extracted Hbase data;

and if the comparison is inconsistent, correcting the Hbase database data.

7. A data synchronization apparatus, the apparatus comprising: the device comprises an acquisition module, a writing module, a consumption module and a processing module; wherein the content of the first and second substances,

8. The data synchronization apparatus of claim 7, wherein the acquisition module is further configured to:

9. The data synchronization apparatus of claim 8, wherein the apparatus further comprises a configuration module, wherein,

10. The data synchronization apparatus of claim 9, further comprising a determination module and a storage module, wherein the determination module and the storage module are configured to perform the determination

The acquisition module is used for receiving the reported DDL information;

11. The data synchronization apparatus of claim 10, further comprising a comparison module, wherein the comparison module is configured to:

acquiring third data information of the MySQL database in a preset time period;

comparing the third data information with the extracted Hbase data;

and if the comparison is inconsistent, correcting the Hbase database data.

12. An electronic device comprising a processor and a memory;

the memory is used for storing operation instructions;

the processor is used for executing the method of any one of claims 1-6 by calling the operation instruction.

13. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method of any one of claims 1-6.