CN105243067B

CN105243067B - A kind of method and device for realizing real-time incremental synchrodata

Info

Publication number: CN105243067B
Application number: CN201410321182.2A
Authority: CN
Inventors: 杨威; 白军伟; 王啸风; 冯是聪
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Zhizhi Heshu Technology Co ltd
Priority date: 2014-07-07
Filing date: 2014-07-07
Publication date: 2019-06-28
Anticipated expiration: 2034-07-07
Also published as: CN105243067A

Abstract

The invention discloses a kind of method and devices for realizing real-time incremental synchrodata, wherein the method for the real-time incremental synchrodata includes: the table structure information according to relevant database, and mapping relations file corresponding with the relevant database is generated in distributed PostgreSQL database HBase towards column；The operation log of the relevant database is obtained in real time；According to the operation log of acquisition, the change data of the relevant database are obtained, according to the mapping relations file of foundation, the change data of acquisition are updated in the HBase of Hadoop.It realizes data and synchronization is updated by the real-time incremental of relevant database to Hadoop, be not only effectively reduced the burden of Hadoop platform, while also increasing user experience.

Description

Method and device for realizing real-time incremental synchronous data

Technical Field

The invention relates to the technical field of big data, in particular to a method and a device for realizing real-time incremental synchronous data.

Background

The rapid development of the internet generates a large amount of data with a rapidly increased volume, the appearance of mass data and the change of a data structure, and brings huge challenges to management and analysis processing of various industries. The traditional processing method based on the relational database data cannot effectively store, analyze and process various service data which are growing increasingly. To this end, many industries have begun to employ a distributed system infrastructure (Hadoop) to analyze data. At present, the mainstream method for synchronizing the relational database data to the Hadoop platform mainly realizes the one-time full-scale import of the data through Sqoop. The Sqoop is an efficient data transmission tool between a relational database and a distributed file system (HDFS), and can lead data in the relational database into the HDFS of Hadoop and also lead data of the HDFS into the relational database.

When data of the relational database changes, if the updated data in the relational database is to be imported into the Hadoop, the data in the relational database needs to be imported in a full amount at regular time. The full import means that all data existing in real time in the relational database is imported into Hadoop. This not only burdens the Hadoop distributed system, but also is time consuming. However, no method is available at present, which can realize real-time incremental update synchronization of data from the relational database to the Hadoop, that is, only change data in the relational database is synchronized to the Hadoop.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a device for realizing real-time incremental synchronization of data, which can realize real-time incremental update synchronization of data from a relational database to a Hadoop, effectively reduce the burden of a Hadoop platform and enhance the user experience.

In order to achieve the above object, the present invention discloses a method for implementing real-time incremental synchronization data, which is applied to data import from a relational database to a distributed system architecture, and comprises:

generating a mapping relation file corresponding to the relational database in a column-oriented database HBase according to the table structure information of the relational database;

acquiring an operation log of the relational database in real time;

and acquiring the change data of the relational database according to the acquired operation log, and updating the acquired change data into the HBase of Hadoop according to the established mapping relation file.

Further, the identity and the starting point of the relational database are configured in advance; the obtaining of the operation log of the relational database in real time includes:

and acquiring an operation log of the relational database corresponding to the identity from the initial site according to the identity and the initial site.

Further, the obtaining the operation log of the relational database includes:

receiving the change data of the operation log of the relational database corresponding to the identity identifier, and storing the received change data in a message queue in sequence; or,

and when the request for acquiring the changed data is not received and the changed data in the message queue exceeds a threshold value, sequentially storing the changed data in the message queue into the corresponding directory file.

Further, after obtaining the operation log of the relational database in real time, the method further includes:

updating the initial site of the relational database;

and acquiring the next operation log of the relational database according to the updated initial site.

Further, the method further comprises: and storing the obtained change data in a local file, and recording the update history.

The invention also discloses a device for realizing real-time incremental synchronous data, which is applied to the data import from the relational database to the distributed system architecture and comprises the following steps: the device comprises a table building module, a log obtaining module, a plurality of log analyzing client modules and a data updating module, wherein:

the table building module is used for generating a mapping relation file corresponding to the relational database in a distributed and column-oriented open source database HBase according to the table structure information of the relational database;

the log acquisition module is used for acquiring the operation log of the relational database in real time;

each log analysis client module is respectively connected with the log acquisition module and is used for receiving the operation logs and the change data sent by the log analysis module and sending the obtained change data to the data updating module;

and the data updating module is used for receiving the change data sent by each log analysis client module in real time and updating the obtained change data into the HBase of Hadoop according to the mapping relation file established by the table establishing module.

Further, the log obtaining module is specifically configured to:

configuring the unique identity and the unique start site of the relational database in advance;

Further, the log obtaining module is further configured to:

receiving data of an operation log of the relational database corresponding to the identity identifier, and storing the received data in a message queue in sequence for the corresponding log analysis client module to request to obtain; or,

and if the log analysis client module does not request to acquire the data, when the data in the message queue exceeds a threshold value, sequentially storing the data in the message queue into a corresponding directory file.

Further, the log obtaining module is further configured to:

updating the initial site of the relational database;

Further, the log parsing client module is further configured to: and when the data updating module is not started, storing the received change data in a local file.

Further, the obtained change data is saved in a local file, and the update history is recorded.

The method for realizing real-time incremental synchronization data provided by the technical scheme of the application is applied to data import from a relational database to a distributed system architecture, and comprises the following steps: generating a mapping relation file corresponding to the relational database in a distributed and column-oriented open source database HBase according to the table structure information of the relational database; acquiring an operation log of the relational database in real time; and acquiring the change data of the relational database according to the acquired operation log, and updating the acquired change data into the HBase of Hadoop according to the established mapping relation file. According to the technical scheme, real-time incremental updating synchronization from the relational database to the Hadoop is achieved, meanwhile, the burden of a Hadoop platform is effectively reduced, and user experience is enhanced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method for implementing real-time incremental synchronization of data in accordance with the present invention;

FIG. 2 is a schematic diagram of the structure of the apparatus for implementing real-time incremental synchronization data according to the present invention;

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

Fig. 1 is a flowchart of a method for implementing real-time incremental synchronization of data according to the present invention, which is applied to import data from a relational database to a distributed system architecture (Hadoop), and as shown in fig. 1, the method includes the following steps:

step 101, according to the table structure information of the relational database, generating a mapping relation file corresponding to the relational database in a distributed and column-oriented database (HBase).

In this step, an association table corresponding to the relational database may be generated in the data warehouse tool (Hive) and the HBase. The association table refers to the data table created in Hive and HBase and consistent with the table structure in the relational database. After the historical data is imported into the Hadoop platform, the updating of the data table in the relational database is updated to the corresponding table in Hive and HBase. HBase is used for data storage, and Hive provides a query function. The association table may be created by Hive script.

The method for generating the mapping relationship file in this step is well known to those skilled in the art, that is, the mapping relationship file can be obtained from the relational database by a program, or after being connected to the relational database, the mapping relationship file can be obtained by executing some SQL statements through a corresponding interface. The method specifically comprises the following steps: after the table structure information of the relational database is obtained, self-defining the corresponding column names of the HBase in the Hadoop platform (for example, the column names in the relational database can be sequentially corresponding through the sequence of A-Z). In Hive, each column name of the table structure is the same as the column name of the table structure in the relational database, and the customized column name here is the column name corresponding to the HBase.

Therefore, if the primary key in the data table of the relational database is the combination key, the combination key is spliced according to a certain rule to be used as the line primary key of the HBase. The rule can be customized according to the specific and processing requirements of the data, and only the field spliced by each record according to the rule is ensured to be unique. For example, the fields in the combined primary key may be directly stitched together in an underline connection.

And step 102, acquiring an operation log of the relational database in real time.

Firstly, the unique identification and the starting point of the relational database are configured in advance.

In this step, in addition to configuring the unique identity and the start point for the relational database, other necessary information may be configured, such as an IP address, a service port number, a database user name and a password of the host where the relational database is located.

Secondly, the obtaining of the operation log of the relational database in real time specifically includes:

Finally, after the operation log of the relational database is obtained in real time, updating the initial site of the relational database; and acquiring the next operation log of the relational database according to the updated initial site.

It should be noted that the start point is a variable stored in the memory, and after the operation log is successfully acquired each time, the value of the variable can be directly modified by the program, that is, the start point is updated, and the value is written into the configuration file by the program. Saving to a file is to prevent loss of the current start site after the program terminates abnormally. Therefore, the next time the program is started, the updating progress of the program when the program is exited last time can be obtained through the starting point saved in the file.

The location is a location identifier of log operation record in the relational database, the starting location is a location identifier of starting to acquire the operation log, and the current location is the current log location in the relational database. If the start site is not configured, the default is the current site in the relational database. For the acquisition of sites, sites are available, for example, in the Mysql database by show master status.

Preferably, for each relational database, a corresponding independent thread may be configured, change data of an operation log of the relational database corresponding to the identity is received, and the received change data is sequentially stored in a message queue of a limited size; or,

and when the request for acquiring the changed data is not received and the changed data in the message queue exceeds a threshold value, storing the changed data in the message queue into the corresponding directory file in sequence.

Thus, once there is an initiating data request, the corresponding change data stored in the corresponding directory file will be sent preferentially.

While configuring the unique identification and the start site of the relational database in advance, the method further comprises the following steps: and importing the full amount of the relational database data into the database in the Hadoop platform through Sqoop.

It should be noted that, because the unique id and start point of the relational database are configured in advance, and this is also unique, it can be seen that the present solution only includes a process of importing the full amount of the relational database into the database in the Hadoop platform once, that is, after the full amount of the data is imported into the Hadoop platform, subsequent data update is imported into the Hadoop platform in a real-time manner, in an incremental manner, rather than in a general manner of importing the full amount again.

And 103, acquiring the change data of the relational database according to the acquired operation log, and updating the acquired change data into the HBase of Hadoop according to the established mapping relation file.

Further, the obtained change data is also saved in a local file, and the update history is recorded.

It should be noted with respect to this method that, for a relational database, three operations, namely addition, update, and deletion, are mainly focused on. While HBase essentially only adds data, its update and delete operations (no insert operations) are very similar in nature, all being performed during subsequent merge (Compact).

For this reason, the update operation of the HBase corresponds to both the insert and update operations of the relational database. For insertions, mapping into HBase is a simple update operation. In particular, the method of manufacturing a semiconductor device,

first, for updating, a distinction is made according to whether the primary key is updated: if the primary key is not updated, the mapping to the HBase is still a simple update operation, otherwise, if the primary key is updated, the mapping to a plurality of operations in the HBase is performed. Because the HBase can store the historical data of multiple versions according to the timestamp, for this reason, under the condition of updating the primary key, the corresponding numerical value needs to be taken out through the old primary key first, then the data composed of the new primary key and the old value is stored in the HBase database, then the new primary key and the new value are utilized to be updated into the HBase, and finally the old value is deleted from the HBase through the old primary key, so that the historical data of multiple versions are still stored in the HBase.

Finally, for the delete operation, the delete operation is still mapped into the HBase, but only marked in the HBase, and the real data delete will be performed in the Compact process (note here that after the delete operation is completed, the data is invisible to the outside, and the query result of the data will not be affected).

Fig. 2 is a schematic diagram of a composition structure of the apparatus for implementing real-time incremental synchronization data according to the present invention, which is applied to data import from a relational database to a distributed system architecture (Hadoop), and as shown in fig. 2, the apparatus includes: the device comprises a table building module, a log obtaining module, a plurality of log analyzing client modules and a data updating module. Wherein,

and the table building module is used for generating a mapping relation file corresponding to the relational database in a distributed and column-oriented open source database (HBase) according to the table structure information of the relational database.

Further, the table building module is further configured to:

and if the primary key in the data table of the relational database is a combined key, splicing the combined key according to a certain rule to be used as the row primary key of the HBase.

It should be noted that the above table building module is further configured to generate an association table corresponding to the relational database in a data warehouse tool (Hive) and the HBase. The association table refers to the data table created in Hive and HBase and consistent with the table structure in the relational database. After the historical data is imported into the Hadoop platform, the updating of the data table in the relational database is updated to the corresponding table in Hive and HBase. HBase is used for data storage, and Hive provides a query function. The association table may be created by Hive script.

And the log acquisition module is used for acquiring the operation log of the relational database in real time.

Firstly, the log obtaining module is specifically configured to:

In addition, the log obtaining module is further configured to:

Finally, the log obtaining module is further configured to:

updating the initial site of the relational database; and acquiring the next operation log of the relational database according to the updated initial site.

Each log analysis client module is respectively connected with the log acquisition module and used for receiving the operation logs and the change data sent by the log acquisition module and sending the obtained change data to the data updating module.

Before receiving the operation log and the change data sent by the log acquisition module, each log analysis client module is further used for sending a data acquisition request to the log acquisition module according to the unique identity of the corresponding relational database and the connection relation between the log analysis client module and the log acquisition module

Each log parsing client module is further to: and when the data updating module is not started, storing the received changed data in a local file.

And the data updating module is used for receiving the change data sent by each log analysis client module in real time and updating the obtained change data to the HBase of the distributed system infrastructure Hadoop according to the mapping relation file established by the table establishing module.

The data update module is further to: and storing the obtained change data in a local file, and recording the update history.

In addition, the above apparatus further comprises: and the full derivative module is used for introducing the full amount of the relational database data into HBase in the Hadoop platform through Sqoop.

In this apparatus, the number of log analysis client modules is the same as the number of relational databases.

Example one

In this embodiment, the relational database Mysql is taken as an example to explain in detail how to realize the real-time incremental data update from the Mysql database to the HBase database in Hadoop.

Firstly, configuring target Mysql database setting, starting a Mysql binary log writing function, and setting the Mysql binary log writing function as a row mode; configuring Mysql database information to be synchronized in a table building module, operating the table building module after the configuration is finished, creating an association table corresponding to the relational database in Hive and HBase, and generating a mapping relation file of a data table in the relational database and a data table in the HBase for the data updating module to use; suppose that there is a table info in the target database, whose table structure is as follows:

name of field	Type of field	Description of the invention
			id	bigint	Self-increment key
name	varchar(10)
			age	int

The generated mapping relation file content is as follows:

{"COLUMN_FAMILY":"C","SQOOP_INFO":{"COLS":{"AGE":"C","ID":"A","NAME":"B"},"IS_PK_INT":true,"PK":["ID"]}}

wherein, COLUMN _ FAMILY is the NAME of the COLUMN FAMILY in HBase, the COLUMN FAMILY has three COLUMNs, which are named as A, B, C respectively, and correspond to ID, NAME and AGE of the INFO table, wherein the primary key of the INFO table is ID and is Int type, and the NAME of the data table in HBase is SQOOP _ INFO.

Secondly, the relational database log acquisition module configures information such as a unique identity for the database of the target Mysql, and then starts the relational database log acquisition module in a background process mode. The purpose of starting the relational database log acquisition module before the full derivative is performed is that if the target Mysql database has more data, it takes a certain time to introduce all the target relational database data into Hadoop, and during the derivative, the data of the target Mysql database may be changed, such as new data is inserted or part of the data is updated and deleted. Therefore, the relational database log acquisition module needs to be started before the operation, so as to record the Mysql operation log in the period, and synchronize the data change in the period to the HBase database in Hadoop after the derivative is completed.

And finally, after the relational database log acquisition module is started, carrying out full data derivative by utilizing the Sqoop. And after the derivative is completed, starting a relational database log analysis client module and a log updating module in a background process mode. The relational database log analysis client module configures a unique identity for a database of the target Mysql and connection information of the unique identity for the database of the target Mysql through a relational database log acquisition module to request data from the relational database log acquisition module; and the log updating module sends a request to the relational database log analysis client module to obtain data, and updates the obtained data to the HBase database of Hadoop according to the mapping file generated in the first step.

After all modules are deployed, all operations on the target Mysql database can see corresponding data changes (the time delay is basically within one category) through Hive query or HBase query, and corresponding data updating history records can be found from the logs of the data updating modules. The following are several examples of data formats transmitted between the relational database log acquisition module, the relational database log analysis client module, and the data update module:

{"COLUMN_NEWVALUE":["1","alex","25"],"COLUMN_OLDVALUE":["","",""],"OPERATETABLE":"info","PK_NEWVALUE":["1"],"PK_OLDVALUE":[""],"DATABASE":"sqoop","PK":["id"],"OPERATETYPE":"INSERT","USERNAME":"sqoop","COLUMN":["id","name","age"],"OPERATETIME":"2014-05-3003:20:40:000","TABLESPACE":""}

this entry is an insert operation, so its old values are all null. The following are two UPDATE operations, the first one without updating the primary key and the second one with updating the primary key (this can be seen from the operation type OPERATION, the value of OPERATION with updating the primary key is defined as "UPDATE _ FIELDCOMP _ PK", and the value of the non-updated primary key is defined as "UPDATE _ FIELDCOMP"):

{"COLUMN_NEWVALUE":["tom"],"COLUMN_OLDVALUE":["alex"],"OPERATETABLE":"info","PK_NEWVALUE":["1"],"PK_OLDVALUE":["1"],"DATABASE":"sqoop","PK":["id"],"OPERATETYPE":"UPDATE_FIELDCOMP","USERNAME":"sqoop","COLUMN":["name"],"OPERATETIME":"2014-05-3003:21:28.000","TABLESPACE":""}

{"COLUMN_NEWVALUE":[],"COLUMN_OLDVALUE":[],"OPERATETABLE":"info","PK_NEWVALUE":["222"],"PK_OLDVALUE":["1"],"DATABASE":"sqoop","PK":["id"],"OPERATETYPE":"UPDATE_FIELDCOMP_PK","USERNAME":"sqoop","COLUMN":[],"OPERATETIME":"2014-05-3003:24:23.000","TABLESPACE":""}

next, the following is the delete operation:

{"COLUMN_NEWVALUE":["",""],"COLUMN_OLDVALUE":["alex","25"],"OPERATETABLE":"info","PK_NEWVALUE":[""],"PK_OLDVALUE":["1"],"DATABASE":"sqoop","PK":["id"],"OPERATETYPE":"DELETE","USERNA ME":"sqoop","COLUMN":["name","age"],"OPERATETIME":"2014-05-3003:29:39.000","TABLESPACE":""}

the keywords defined in the above four operations will now be described as follows:

COLUMN _ NEWVALUE: sequentially listing new values of fields with changed values in the row after the operation is finished;

COLUMN _ oldvale: sequentially listing all old values with changed values in the record of the row before the operation is carried out;

OPERATETABLE: which table represents the operation;

PK _ NEWVALUE: after the operation is performed, the row records the new value of the primary key;

PK _ oldvale: before the operation is performed, the row records the old value of the primary key;

DATABASE: the database name in Mysql corresponding to the operation;

PK: the column name of the primary key in the data table;

OPERATETYPE: operation types, including INSERT (INSERT), UPDATE (UPDATE _ FIELDCOMP, which indicates that the primary key is not updated, and UPDATE _ FIELDCOMP _ PK, which indicates that the primary key is updated, DELETE (DELETE);

USERNAME: a user name of the database;

COLUMN: the name of each column in the data table whose value changes;

OPERATETIME: the time at which the operation occurred;

TABLESPOCE: a database table space.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present application is not limited to any specific form of hardware or software combination.

The above description is only a preferred example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for realizing real-time increment synchronous data is applied to data import from a relational database to a distributed system architecture Hadoop, and comprises the following steps:

acquiring an operation log of the relational database in real time;

acquiring the change data of the relational database according to the acquired operation log, and updating the acquired change data into HBase of Hadoop according to the established mapping relation file;

wherein the obtaining the operation log of the relational database comprises:

receiving change data of an operation log of a relational database corresponding to the identity, and storing the received change data in a message queue in sequence; or,

2. The method of claim 1, wherein the identity and start point of the relational database are preconfigured; the acquiring the operation log of the relational database in real time comprises the following steps:

3. The method according to claim 1 or 2, wherein after the obtaining the operation log of the relational database in real time, the method further comprises:

updating the start site of the relational database;

4. The method of claim 3, further comprising: and storing the obtained change data in a local file, and recording the update history.

5. A device for realizing real-time increment synchronous data is applied to data import from a relational database to a distributed system architecture Hadoop, and comprises the following steps: the device comprises a table building module, a log obtaining module, a plurality of log analyzing client modules and a data updating module, wherein:

each log analysis client module is respectively connected with the log acquisition module and used for receiving the operation logs and the change data sent by the log analysis client module and sending the obtained change data to the data updating module;

the data updating module is used for receiving the change data sent by each log analysis client module in real time and updating the obtained change data into an HBase of Hadoop according to the mapping relation file established by the table establishing module;

the log obtaining module is further configured to:

receiving data of an operation log of a relational database corresponding to the identity identifier, and sequentially storing the received data in a message queue for the corresponding log analysis client module to request to obtain; or,

6. The apparatus of claim 5, wherein the log obtaining module is specifically configured to:

pre-configuring the unique identity and the unique start site of the relational database;

7. The apparatus of claim 5 or 6, wherein the log obtaining module is further configured to:

updating the start site of the relational database;

8. The apparatus of claim 7, wherein the log parsing client module is further configured to: and when the data updating module is not started, storing the received change data in a local file.

9. The apparatus of claim 7, wherein the data update module is further configured to: and storing the obtained change data in a local file, and recording the update history.