CN109063196B - Data processing method and device, electronic equipment and computer readable storage medium - Google Patents

Data processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN109063196B
CN109063196B CN201811025121.6A CN201811025121A CN109063196B CN 109063196 B CN109063196 B CN 109063196B CN 201811025121 A CN201811025121 A CN 201811025121A CN 109063196 B CN109063196 B CN 109063196B
Authority
CN
China
Prior art keywords
data
relational database
message queue
operation information
current operation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811025121.6A
Other languages
Chinese (zh)
Other versions
CN109063196A (en
Inventor
周瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rajax Network Technology Co Ltd
Original Assignee
Rajax Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rajax Network Technology Co Ltd filed Critical Rajax Network Technology Co Ltd
Priority to CN201811025121.6A priority Critical patent/CN109063196B/en
Publication of CN109063196A publication Critical patent/CN109063196A/en
Application granted granted Critical
Publication of CN109063196B publication Critical patent/CN109063196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure discloses a data processing method, a data processing device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: obtaining current operation on a relational database; determining operation information related to the current operation; caching the operation information into a message queue; and acquiring the operation information from the message queue in real time by using a flash engine, and transmitting the operation information to a non-relational database for storage. The technical scheme realizes the synchronization real-time performance between the non-relational database and the relational database, and provides a real-time data platform capable of carrying out interactive query for users.

Description

Data processing method and device, electronic equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
In a web system interacting with a user, online service data is usually stored in a relational database, so that data in the traditional relational database has high value, but the relational database does not support large-scale data analysis query operation and analysis work of mass data in the field of big data, so that the data needs to be migrated from the relational database to a non-relational database, a data model is established, and further data analysis work is performed.
Disclosure of Invention
The embodiment of the disclosure provides a data processing method and device, electronic equipment and a computer-readable storage medium.
In a first aspect, an embodiment of the present disclosure provides a data processing method.
Specifically, the data processing method includes:
obtaining current operation on a relational database;
determining operation information related to the current operation;
caching the operation information into a message queue;
and acquiring the operation information from the message queue in real time by using a flash engine, and transmitting the operation information to a non-relational database for storage.
With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining a current operation on a relational database includes:
the current operation is obtained in response to any one of data addition, data modification, and data deletion to the relational database.
With reference to the first aspect, in a second implementation manner of the first aspect, the determining operation information related to the current operation includes:
analyzing the current operation, and determining an operation type and/or a data source corresponding to the current operation;
and encapsulating the operation type and/or the data source into a standard format.
With reference to the first aspect, in a third implementation manner of the first aspect, the caching the operation information in a message queue includes:
and buffering the operation information into a Kafka message queue.
With reference to the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the present disclosure includes:
configuring the flash engine to use the KaFka message queue as a data source of the flash engine. .
With reference to the third implementation manner of the first aspect or the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the non-relational database is a kudu database, and further includes:
configuring the flux engine to use the kudu database as a data endpoint for the flux engine. .
With reference to the fifth implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the present disclosure includes:
adding an operational API of the kudu database in the flux engine;
and configuring a data endpoint of the flash engine as the operating API.
In a second aspect, an embodiment of the present disclosure provides a data processing apparatus, including:
an acquisition module configured to acquire a current operation on the relational database;
a determination module configured to determine operation information related to the current operation;
the caching module is configured to cache the operation information into a message queue;
and the transmission module is configured to acquire the operation information from the message queue in real time by using a flash engine and transmit the operation information to a non-relational database for storage.
With reference to the second aspect, in a first implementation manner of the second aspect, the obtaining module includes:
an obtaining sub-module configured to obtain the current operation in response to any one of data addition, data modification, and data deletion to the relational database.
With reference to the second aspect, in a second implementation manner of the second aspect, the determining module includes:
the determining submodule is configured to analyze the current operation and determine an operation type and/or a data source corresponding to the current operation;
an encapsulation submodule configured to encapsulate the operation type and/or data source into a standard format.
With reference to the second aspect, in a third implementation manner of the second aspect, the cache module includes:
and the caching submodule is configured to cache the operation information into a Kafka message queue.
With reference to the third implementation manner of the second aspect, in a fourth implementation manner of the second aspect, the present disclosure further includes:
a first configuration module configured to configure the flash engine to use the Kafka message queue as a data source for the flash engine.
With reference to the third implementation manner of the second aspect or the fourth implementation manner of the second aspect, in a fifth implementation manner of the second aspect, the non-relational database is a kudu database, and further includes:
a second configuration module configured to configure the flash engine to use the kudu database as a data destination for the flash engine.
With reference to the fifth implementation manner of the second aspect, in a sixth implementation manner of the second aspect, the second configuration module includes:
an adding submodule configured to add an operation API of the kudu database in the flux engine;
a configuration submodule configured to configure a data endpoint of the flash engine as the operational API.
The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.
In one possible design, the data processing apparatus includes a memory and a processor, the memory is used for storing one or more computer instructions for supporting the data processing apparatus to execute the data processing method in the first aspect, and the processor is configured to execute the computer instructions stored in the memory. The data processing apparatus may further comprise a communication interface for the data processing apparatus to communicate with other devices or a communication network.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of the first aspect.
In a fourth aspect, the disclosed embodiments provide a computer-readable storage medium for storing computer instructions for a data processing apparatus, which contains computer instructions for executing the data processing method in the first aspect.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
the embodiment of the disclosure synchronizes the operation of the relational database to the non-relational database in real time by acquiring the operation of the relational database in real time, caching the relevant operation information in the message queue, and transmitting the operation information cached in the message queue to the non-relational database in real time by using the flash engine for storage, so that the synchronization between the non-relational database and the relational database is realized in real time compared with the prior art, and a real-time data platform capable of performing interactive query is provided for a user.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:
FIG. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure;
FIG. 2 shows a flow chart of step S102 according to the embodiment shown in FIG. 1;
FIG. 3 illustrates a flow diagram for configuring a flash in a data processing method according to an embodiment of the present disclosure;
FIG. 4 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure;
FIG. 5 illustrates a block diagram of the determination module 402 according to the embodiment shown in FIG. 4;
FIG. 6 shows a block diagram of a second configuration module in a data processing apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device suitable for implementing a data processing method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.
It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
In the prior art, data is transferred from a relational database to a big data platform for storage by using open source software Sqoop of Apache, for example: HDFS and HBASE, which in turn was used for data analysis using a data analysis engine. The Sqoop is used as open source software of the Apache and is specially used for transmitting data in a relational database to a non-relational database, and in the process of transmitting the data by the Sqoop, data is imported from the relational database to the non-relational database by using MapReduce of Hadoop. The Sqoop is operated off line, so that the data transmission is suitable for transmitting data in batches, the data transmission takes longer time, and the data cannot be transmitted from the relational database to a big data platform in real time, so that the client experience is poorer for some queries with higher real-time requirements. Therefore, how to import data from the relational database to the non-relational database in real time to satisfy the real-time query requirement of the client on the data is one of the problems to be solved at present.
Fig. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure. As shown in fig. 1, the data processing method includes the following steps S101 to S104:
in step S101, a current operation on the relational database is acquired;
in step S102, operation information related to the current operation is determined;
in step S103, the operation information is buffered in a message queue;
in step S104, the operation information is obtained from the message queue in real time by using a flash engine, and is transmitted to a non-relational database for storage.
In the prior art, when big data analysis operation is performed on data in a relational database, a common method is that a MySQL relational database is imported into an HDFS in batch by using SQL of Apache according to a timed task every morning, and then a Hive data warehouse is constructed, and a user can complete adhoc operation of the data by using interactive SQL query. When data migration is performed, the Sqoop is converted into MapReduce of Hadoop to be executed, and when the data volume is large, the execution time is long, and the real-time performance of the data is poor.
In view of the foregoing drawbacks, an embodiment of the present disclosure provides a data processing method, where operations on a relational database are obtained in real time, and relevant operation information is cached in a message queue, and the operation information cached in the message queue is transmitted to a non-relational database in real time by using a flash engine for storage, so as to implement synchronization between the non-relational database and the relational database, and provide a non-relational real-time database platform capable of performing interactive query for a user.
The relational database is established on the basis of a relational model; the current operations on the relational database may include data write, data read, data modify, data DELETE, etc., for example, SQL operations of a MYSQL relational database mainly include SELECT operations, DELETE operations, UPDATE operations, and INSERT operations. The current operation in this embodiment may be any operation on the relational database, or may be a predefined partial operation, for example, an operation that may cause a content change in the relational database.
The embodiment can capture the operation of the user on the current relational database in real time and intercept the operation information related to the current operation. The operation information related to the current operation includes, but is not limited to, the type of the current operation (e.g., SELECT operation, DELETE operation, UPDATE operation, or INSERT operation), the data location of the current operation, the content of the current operation, the storage source (e.g., machine room information storing the data table of the current operation) where the data change occurs due to the current operation, and the like.
The message queue can be a pre-established real-time message queue, and operation information related to the current operation can be taken out from the message queue in real time by the flash engine while being stored in the message queue and transmitted to the non-relational database for storage. The message queue can be a message queue with high throughput and capable of supporting distributed parallel processing, so that the requirement of high real-time performance of data synchronization between a relational database and a non-relational database can be met.
A flash (log collection system) is a distributed, reliable and useful log collection system that is suitable for efficiently collecting, aggregating and moving large amounts of log data from multiple different sources to a centralized data store. The method comprises the following steps that flash supports various data sending parties to be customized in a log system and used for collecting data; at the same time, flash provides the ability to simply process the data and write to various data recipients (e.g., text, HDFS, Hbase, etc.).
One data circulation unit of the flash is a flash engine, and the flash engine comprises three modules, namely a data source, a data storage and a data destination; for data sources, the data Source of a flash engine can be of various types, such as an Avro Source, an Exec Source, and the like. In this embodiment, the captured operation information related to the current operation is cached in the message queue, so that the flash engine collects data from the message queue, that is, the message queue is a data source of flash, the data storage of flash may adopt a memory storage mode, and the data destination is set as a non-relational database.
The method comprises the steps that a flash engine takes a message queue as a data source and a non-relational database as a data destination, in the execution process of the method, the flash engine acquires operation information from the message queue, temporarily stores the operation information in a memory, sends the operation information to the non-relational database, and transmits the operation information in the message queue to the non-relational database for storage in real time.
The embodiment of the disclosure synchronizes the operation of the relational database to the non-relational database in real time by acquiring the operation of the relational database in real time, caching the relevant operation information in the message queue, and transmitting the operation information cached in the message queue to the non-relational database in real time by using the flash engine for storage, so that the synchronization between the non-relational database and the relational database is realized in real time compared with the prior art, and a real-time data platform capable of performing interactive query is provided for a user.
In an optional implementation manner of this embodiment, the relational database may be one or more of Oracle, DB2, PostgreSQL, Microsoft SQL Server, Microsoft Access, MySQL, and langue K-DB.
In an optional implementation manner of this embodiment, the non-relational database may be one or more of HDFS, HIVE, HBASE, Kudu, and other non-relational databases.
For example, assume that the current relational database is MySQL and the non-relational database is Kudu. The method comprises the steps of firstly obtaining the current operation of the MySQL relational database, secondly determining operation information related to the current operation, secondly caching the operation information into a message queue, and finally utilizing a configured flash engine to obtain the operation information from the message queue in real time and transmit the operation information to the non-relational database Kudu for storage in real time, so that the effect of synchronizing the current operation of the MySQL relational database into the non-relational database Kudu in real time is achieved, data query with high real-time requirements can be met, the utilization rate of the whole system is improved, and user experience is improved. The technical solution of the present disclosure is described above by taking MySQL relational database and non-relational database Kudu as examples, but should not be construed as limiting the present disclosure.
In an optional implementation manner of this embodiment, the step S101, that is, the step of obtaining the current operation on the relational database, further includes:
the current operation is obtained in response to any one of data addition, data modification, and data deletion to the relational database.
In this optional implementation, when the current operation is predefined as data update, and includes any one of data addition, data modification, and data deletion, the current operation is captured. Because the content in the relational database is changed by data addition, data modification and data deletion, the content needs to be synchronized to the non-relational database in real time, and other operations, such as query operations, which do not change the content in the relational database, may not execute the data processing method provided by the embodiment of the present disclosure, so that resources may be saved. Wherein, the data addition comprises the addition of new columns, new column-level integrity constraints, new table-level integrity constraints and the like in the relational database; the data modification comprises modification of original column definition, modification of column names, modification of data types and the like; the data deletion includes deleting columns in the table or specified integrity constraints, etc.
In this embodiment, the current operation of the user on the relational database may be monitored in real time, and when the user performs any one of data addition, data modification, and data deletion on the relational database, the current operation is captured, and the data processing method provided by the embodiment of the present disclosure is performed on the current operation.
In an optional implementation manner of this embodiment, as shown in fig. 2, the step S102, that is, the step of determining the operation information related to the current operation, further includes the following steps S201 to S202:
in step S201, the current operation is analyzed, and an operation type and/or a data source corresponding to the current operation are determined;
in step S202, the operation type and/or data source is encapsulated into a standard format.
In this alternative implementation, the captured current operation is parsed to determine the operation type and/or data source of the current operation. The operation type may be, for example, one of data addition, data modification, and data deletion, and the data source may include, but is not limited to, a data table name, a column name, a storage location, and the like in the database. It should be noted that, if the operation type of the current operation is not predefined as an operation that may cause a change in the contents of the database, the current operation may also be a query operation, and the query operation may be analyzed without analyzing the data source corresponding to the current operation after determining the operation type thereof.
After the operation type and/or the data source corresponding to the current operation are determined, the operation type and/or the data source can be packaged into a standard format, so that the non-relational database can perform synchronous operation according to the standard format.
In one embodiment, the operation information, i.e., the operation type and/or the data source, involved in the current operation may be converted into a fixed-format string and stored in the message queue. For example, taking the UPDATE operation as an example, the fixed format conversion is as follows:
{
″type″:″UPDATE″,
″table″:″dbname.tablename″,
″zone″:″xg1″,
″row_after″:{
″column1″:″xxx″,
″column2″:″xxx″,
″column3″:″xxx″
},
″row_before″:{
″column1″:″xxx″,
″column2″:″xxx″,
″column3″:″xxx″
}
}
the operation information stored by the character string clearly indicates the operation type of the data, the table name of the database, the real machine room with changed data and other information. Wherein the type field indicates the data type of the current operation, the table field indicates a specific database table name, the row _ before field indicates a specific value of each column before the data change, and the row _ after field indicates a value of each column after the data change.
In an optional implementation manner of this embodiment, the step S103 of buffering the operation information in a message queue further includes:
and buffering the operation information into a Kafka message queue.
In this optional implementation, Kafka is a high-throughput distributed publish-subscribe messaging system, and Kafka can satisfy both online real-time processing and batch offline processing. Kafka has the following characteristics: high throughput, even with very common hardware, Kafka can support millions of messages per second; supporting partitioning of messages by Kafka server and consumer clusters; and Hadoop parallel data loading is supported.
Therefore, in this embodiment, the Kafka message queue is established first, and the Kafka message queue is used as a data exchange hub, and the operation information in the relational database is cached in the Kafka message queue first, and then the operation information is transmitted to the non-relational database in real time through the flash engine, so that real-time and efficient exchange of different types of data from the relational database to the non-relational database is realized.
In an optional implementation manner of this embodiment, the method further includes:
configuring the flash engine to use the Kafka message queue as a data source of the flash engine.
In this alternative implementation, the flash is implemented based on a configuration file. The configuration file is established first, and then the script flash-ng of flash is used, and the flash configuration file is specified to realize the starting of the flash collection task. In this embodiment, the kafka message queue established in this embodiment is configured as a data source of the flash engine in the configuration file, and the non-relational database is configured as a data destination of the flash engine.
In one embodiment, the established configuration file configures the Kafka message queue as a data source for the flash engine by:
channels: channels specifying Source to connect specifically
type: name of component is specified, org. apache. flux. source. kafka Source is specified
kafka. bootstrap. servers: brookers for Kafka clusters are designated.
kafka consumer group id: a certain consumption group is specified.
kapka. topics: specify topic of Kafka to read.
kafka. topics. regex: a rule to read topic is specified.
batch size: in a batch where flux of Flume writes to a Channel, the maximum amount of data that can be written.
batchdurationmills: maximum latency in one batch of Flume's Source write Channel.
backoff sleep comment: the increment time of the longest waiting time triggered when Kafka is empty.
maxBackoffSleep: the longest latency that would trigger when Kafka is empty.
useFlumeEventFormat: whether to use the default data format as the data transmission format of the flash.
setTopicHeader: whether to set the recovered message to the header.
topicHeader: the name topicHeader is defined.
MigrateZooceperOffsets: migrate zookeeper's offset to Kafka's offset, which is an old version for supporting Flume.
Kafka. consumer. security. protocol: setting a security protocol for data transmission.
In an optional implementation manner of this embodiment, the non-relational database is a kudu database, and the method further includes:
configuring the flux engine to use the kudu database as a data endpoint for the flux engine.
In the optional implementation mode, Kudu is a novel column-type storage system of Cloudera open source, is suitable for constructing a data warehouse which needs to make quick response, can make up for the defect of high response delay of HDFS by utilizing Kudu, and is suitable for data analysis in an OLAP scene. Kudu can be used to store structured tables with predefined band-type columns, each table having a primary key with a uniqueness constraint that can be used as an index to support fast random access. Kudu adopts a similar mode of a log-structured storage system, and the operations of adding, deleting and modifying are all put in the memory and then are merged and stored in the persistent columnar storage. Since Kudu has ultra-strong query capability and storage capability, this embodiment uses Kudu to provide a data platform for real-time query for users.
Kudu is not supported by the prior art flash engine as a data endpoint, and the embodiment takes Kudu as the data endpoint of the flash engine by improving the implementation of the flash engine.
In an optional implementation manner of this embodiment, as shown in fig. 3, the step of configuring the flute engine to use the kudu database as a data destination of the flute engine further includes the following steps S301 to S302:
in step S301, adding an operation API of the kudu database in the flux engine;
in step S302, the data endpoint of the flash engine is configured as the operation API.
In the optional implementation manner, an operation APl capable of operating the kudu database may be added to the flash engine by using an interface provided in the kudu source code, and a position of the operation APl is specified in a configuration file of the flash engine, so that the flash engine calls the operation API to store the operation information in the message queue into the kudu database after taking out the operation information. Because the operation information taken out from the message queue by the flash engine has a fixed format, the operation APl can analyze the operation information according to the fixed format, acquire a data operation type, a corresponding data source and the like from the operation information, and perform corresponding operation on the kudu database so as to synchronize the data in the relational database to the non-relational database.
In one embodiment, in the established flash configuration file, kudu is used as the data endpoint of the flash engine through the following configuration:
type: specifying a specific KuduSink code position; // KuduSink is the operation APl of the kudu database.
masterAddresses: the location of the master for a particular Kudu is specified, typically IP: a port.
tableName: specifying the specific table names created in Kudu.
batch size: indicating the amount of data to be sent to Kudu at each time.
A producer: specifies the Producer that writes data to the event in Kudu.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.
Fig. 4 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 4, the data processing apparatus includes an obtaining module 401, a determining module 402, a buffering module 403, and a transmitting module 404:
an obtaining module 401 configured to obtain a current operation on the relational database;
a determining module 402 configured to determine operation information related to the current operation;
a buffer module 403 configured to buffer the operation information into a message queue;
and the transmitting module 404 is configured to acquire the operation information from the message queue in real time by using a flash engine, and transmit the operation information to a non-relational database for storage.
In the prior art, when big data analysis operation is performed on data in a relational database, a common method is that a MySQL relational database is imported into an HDFS in batch by using SQL of Apache according to a timed task every morning, and then a Hive data warehouse is constructed, and a user can complete adhoc operation of the data by using interactive SQL query. When data migration is performed, the Sqoop is converted into MapReduce of Hadoop to be executed, and when the data volume is large, the execution time is long, and the real-time performance of the data is poor.
In view of the foregoing drawbacks, an embodiment of the present disclosure provides a data processing apparatus, where an obtaining module 401 obtains operations on a relational database in real time, a determining module 402 determines related operation information from the obtained operations, a caching module 403 caches the related operation information in a message queue, and a transmitting module 404 transmits the operation information cached in the message queue to a non-relational database in real time by using a flash engine to store the operation information, so as to implement synchronization between the non-relational database and the relational database, and provide a non-relational real-time database platform capable of performing interactive query for a user.
The relational database is established on the basis of a relational model; the current operations on the relational database may include data write, data read, data modify, data DELETE, etc., for example, SQL operations of a MYSQL relational database mainly include SELECT operations, DELETE operations, UPDATE operations, and INSERT operations. The current operation in this embodiment may be any operation on the relational database, or may be a predefined partial operation, for example, an operation that may cause a content change in the relational database.
The embodiment can capture the operation of the user on the current relational database in real time and intercept the operation information related to the current operation. The operation information related to the current operation includes, but is not limited to, the type of the current operation (e.g., SELECT operation, DELETE operation, UPDATE operation, or INSERT operation), the data location of the current operation, the content of the current operation, the storage source (e.g., machine room information storing the data table of the current operation) where the data change occurs due to the current operation, and the like.
The message queue can be a pre-established real-time message queue, and operation information related to the current operation can be taken out from the message queue in real time by the flash engine while being stored in the message queue and transmitted to the non-relational database for storage. The message queue can be a message queue with high throughput and capable of supporting distributed parallel processing, so that the requirement of high real-time performance of data synchronization between a relational database and a non-relational database can be met.
A flash (log collection system) is a distributed, reliable and useful log collection system that is suitable for efficiently collecting, aggregating and moving large amounts of log data from multiple different sources to a centralized data store. The method comprises the following steps that flash supports various data sending parties to be customized in a log system and used for collecting data; at the same time, flash provides the ability to simply process the data and write to various data recipients (e.g., text, HDFS, Hbase, etc.).
One data circulation unit of the flash is a flash engine, and the flash engine comprises three modules, namely a data source, a data storage and a data destination; for data sources, the data Source of a flash engine can be of various types, such as an Avro Source, an Exec Source, and the like. In this embodiment, the captured operation information related to the current operation is cached in the message queue, so that the flash engine collects data from the message queue, that is, the message queue is a data source of flash, the data storage of flash may adopt a memory storage mode, and the data destination is set as a non-relational database.
The method comprises the steps that a flash engine takes a message queue as a data source and a non-relational database as a data destination, in the execution process of the method, the flash engine acquires operation information from the message queue, temporarily stores the operation information in a memory, sends the operation information to the non-relational database, and transmits the operation information in the message queue to the non-relational database for storage in real time.
The embodiment of the disclosure synchronizes the operation of the relational database to the non-relational database in real time by acquiring the operation of the relational database in real time, caching the relevant operation information in the message queue, and transmitting the operation information cached in the message queue to the non-relational database in real time by using the flash engine for storage, so that the synchronization between the non-relational database and the relational database is realized in real time compared with the prior art, and a real-time data platform capable of performing interactive query is provided for a user.
In an optional implementation manner of this embodiment, the relational database may be one or more of 0racle, DB2, PostgreSQL, Microsoft SQL Server, Microsoft Access, MySQL, and langue K-DB.
In an optional implementation manner of this embodiment, the non-relational database may be one or more of HDFS, HIVE, HBASE, Kudu, and other non-relational databases.
For example, assume that the current relational database is MySQL and the non-relational database is Kudu. Firstly, the current operation on the MySQL relational database is acquired through the acquisition module 401, secondly, the operation information related to the current operation is determined through the determination module 402, thirdly, the operation information is cached in the message queue through the cache module 403, and finally, the operation information is acquired in real time from the message queue through the transmission module 404 by using the configured flash engine and is transmitted to the non-relational database Kudu for storage in real time, so that the effect of synchronizing the current operation on the MySQL relational database to the non-relational database Kudu in real time is achieved, the data query with high real-time requirements can be met, the utilization rate of the whole system is improved, and the user experience is improved. The technical solution of the present disclosure is described above by taking MySQL relational database and non-relational database Kudu as examples, but should not be construed as limiting the present disclosure.
In an optional implementation manner of this embodiment, the obtaining module 401 includes:
an obtaining sub-module configured to obtain the current operation in response to any one of data addition, data modification, and data deletion to the relational database.
In this optional implementation, when the current operation is predefined as data update, and includes any one of data addition, data modification, and data deletion, the current operation is captured. Because the content in the relational database is changed by data addition, data modification and data deletion, the content needs to be synchronized to the non-relational database in real time, and other operations, such as query operations, which do not change the content in the relational database, can be omitted by using the data processing device provided by the embodiment of the disclosure, so that resources can be saved. Wherein, the data addition comprises the addition of new columns, new column-level integrity constraints, new table-level integrity constraints and the like in the relational database; the data modification comprises modification of original column definition, modification of column names, modification of data types and the like; the data deletion includes deleting columns in the table or specified integrity constraints, etc.
In this embodiment, the current operation of the user on the relational database may be monitored in real time, and when the user performs any one of data addition, data modification, and data deletion on the relational database, the current operation is captured, and the data processing apparatus provided in the embodiment of the present disclosure is used for the current operation.
In an optional implementation manner of this embodiment, as shown in fig. 5, the determining module 402 includes:
a determining submodule 501 configured to parse the current operation, and determine an operation type and/or a data source corresponding to the current operation;
an encapsulation submodule 502 configured to encapsulate the operation type and/or data source into a standard format.
In this alternative implementation, the captured current operation is parsed to determine the operation type and/or data source of the current operation. The operation type may be, for example, one of data addition, data modification, and data deletion, and the data source may include, but is not limited to, a data table name, a column name, a storage location, and the like in the database. It should be noted that, if the operation type of the current operation is not predefined as an operation that may cause a change in the contents of the database, the current operation may also be a query operation, and the query operation may be analyzed without analyzing the data source corresponding to the current operation after determining the operation type thereof.
After the operation type and/or the data source corresponding to the current operation are determined, the operation type and/or the data source can be packaged into a standard format, so that the non-relational database can perform synchronous operation according to the standard format.
In one embodiment, the operation information, i.e., the operation type and/or the data source, involved in the current operation may be converted into a fixed-format string and stored in the message queue. For example, taking the UPDATE operation as an example, the fixed format conversion is as follows:
{
″type″:″UPDATE″,
″table″:″dbname.tablename″,
″zone″:″xg1″,
″row_after″:{
″column1″:″xxx″,
″column2″:″xxx″,
″column3″:″xxx″
},
″row_before″:{
″column1″:″xxx″,
″column2″:″xxx″,
″column3″:″xxx″
}
}
the operation information stored by the character string clearly indicates the operation type of the data, the table name of the database, the real machine room with changed data and other information. Wherein the type field indicates the data type of the current operation, the table field indicates a specific database table name, the row _ before field indicates a specific value of each column before the data change, and the row _ after field indicates a value of each column after the data change.
In an optional implementation manner of this embodiment, the caching module 403 includes:
and the caching submodule is configured to cache the operation information into a Kafka message queue.
In this optional implementation, Kafka is a high-throughput distributed publish-subscribe messaging system, and Kafka can satisfy both online real-time processing and batch offline processing. Kafka has the following characteristics: high throughput, even with very common hardware, Kafka can support millions of messages per second; supporting partitioning of messages by Kafka server and consumer clusters; and Hadoop parallel data loading is supported.
Therefore, in this embodiment, the Kafka message queue is established first, and the Kafka message queue is used as a data exchange hub, and the operation information in the relational database is cached in the Kafka message queue first, and then the operation information is transmitted to the non-relational database in real time through the flash engine, so that real-time and efficient exchange of different types of data from the relational database to the non-relational database is realized.
In an optional implementation manner of this embodiment, the apparatus further includes: a first configuration module configured to configure the flash engine to use the Kafka message queue as a data source for the flash engine.
In this alternative implementation, the flash is implemented based on a configuration file. The configuration file is established first, and then the script flash-ng of flash is used, and the flash configuration file is specified to realize the starting of the flash collection task. In this embodiment, the kafka message queue established in this embodiment is configured as a data source of the flash engine in the configuration file, and the non-relational database is configured as a data destination of the flash engine.
In one embodiment, the established configuration file configures the Kafka message queue as a data source for the flash engine by:
channels: channels specifying Source to connect specifically
type: name of component is specified, org. apache. flux. source. kafka Source is specified
kafka. bootstrap. servers: brookers for Kafka clusters are designated.
kafka consumer group id: a certain consumption group is specified.
kapka. topics: specify topic of Kafka to read.
kafka. topics. regex: a rule to read topic is specified.
batch size: in a batch where flux of Flume writes to a Channel, the maximum amount of data that can be written.
batchdurationmills: maximum latency in one batch of Flume's Source write Channel.
backoff sleep comment: the increment time of the longest waiting time triggered when Kafka is empty.
maxBackoffSleep: the longest latency that would trigger when Kafka is empty.
useFlumeEventFormat: whether to use the default data format as the data transmission format of the flash.
setTopicHeader: whether to set the recovered message to the header.
topicHeader: the name topicHeader is defined.
MigrateZooceperOffsets: migrate zookeeper's offset to Kafka's offset, which is an old version for supporting Flume.
Kafka. consumer. security. protocol: setting a security protocol for data transmission.
In an optional implementation manner of this embodiment, the non-relational database is a kudu database, and the apparatus further includes: a second configuration module configured to configure the flash engine to use the kudu database as a data destination for the flash engine.
In the optional implementation mode, Kudu is a novel column-type storage system of Cloudera open source, is suitable for constructing a data warehouse which needs to make quick response, can make up for the defect of high response delay of HDFS by utilizing Kudu, and is suitable for data analysis in an OLAP scene. Kudu can be used to store structured tables with predefined band-type columns, each table having a primary key with a uniqueness constraint that can be used as an index to support fast random access. Kudu adopts a similar mode of a log-structured storage system, and the operations of adding, deleting and modifying are all put in the memory and then are merged and stored in the persistent columnar storage. Since Kudu has ultra-strong query capability and storage capability, this embodiment uses Kudu to provide a data platform for real-time query for users.
Kudu is not supported by the prior art flash engine as a data endpoint, and the embodiment takes Kudu as the data endpoint of the flash engine by improving the implementation of the flash engine.
In an optional implementation manner of this embodiment, as shown in fig. 6, the second configuration module includes:
an adding submodule 601 configured to add an operation API of the kudu database in the flux engine;
a configuration submodule 602 configured to configure a data endpoint of the flash engine as the operating API.
In the optional implementation manner, an operation API capable of operating the kudu database may be added to the flash engine by using an interface provided in the kudu source code, and a location of the operation API is specified in a configuration file of the flash engine, so that the flash engine calls the operation APl to store the operation information in the message queue into the kudu database after taking out the operation information. Because the operation information taken out from the message queue by the flash engine has a fixed format, the operation APl can analyze the operation information according to the fixed format, acquire a data operation type, a corresponding data source and the like from the operation information, and perform corresponding operation on the kudu database so as to synchronize the data in the relational database to the non-relational database.
In one embodiment, in the established flash configuration file, kudu is used as the data endpoint of the flash engine through the following configuration:
type: specifying a specific KuduSink code position; // KuduSink is the operational API for the kudu database.
masterAddresses: the location of the master for a particular Kudu is specified, typically IP: a port.
tableName: specifying the specific table names created in Kudu.
batch size: indicating the amount of data to be sent to Kudu at each time.
A producer: specifies the Producer that writes data to the event in Kudu.
Fig. 7 is a schematic structural diagram of an electronic device suitable for implementing a data processing method according to an embodiment of the present disclosure.
As shown in fig. 7, the electronic apparatus 700 includes a Central Processing Unit (CPU)701, which can execute various processes in the embodiment shown in fig. 1 described above according to a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The CPU701, the ROM702, and the RAM703 are connected to each other via a bus 704. An input/output (I/0) interface 705 is also connected to bus 704.
The following components are connected to the I/0 interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to embodiments of the present disclosure, the method described above with reference to fig. 1 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the method of fig. 1. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (10)

1. A data processing method, comprising:
obtaining current operation on a relational database;
determining operation information related to the current operation;
caching the operation information into a message queue;
obtaining the operation information from the message queue in real time by using a flash engine, transmitting the operation information to a non-relational database for storage,
wherein the determining operation information related to the current operation includes:
analyzing the current operation, and determining an operation type and/or a data source corresponding to the current operation;
encapsulating the operation type and/or data source into a standard format,
wherein the non-relational database is a kudu database, and the method further comprises:
adding an operation API of the kudu database in the flash engine, wherein the operation API is used for analyzing the operation types and/or data sources packaged into a standard format and synchronizing the operation types and/or data sources to the kudu database;
configuring a data endpoint of the flash engine as the operational API,
the obtaining the current operation on the relational database comprises:
obtaining the current operation in response to adding or deleting an integrity constraint in the relational database.
2. The data processing method of claim 1, wherein obtaining the current operation on the relational database comprises:
the current operation is obtained in response to any one of data addition, data modification, and data deletion to the relational database.
3. The data processing method of claim 1, wherein buffering the operation information into a message queue comprises:
and buffering the operation information into a Kafka message queue.
4. The data processing method of claim 3, wherein the method further comprises:
configuring the flash engine to use the Kafka message queue as a data source of the flash engine.
5. A data processing apparatus, comprising:
an acquisition module configured to acquire a current operation on the relational database;
a determination module configured to determine operation information related to the current operation;
the caching module is configured to cache the operation information into a message queue;
a transmitting module configured to acquire the operation information from the message queue in real time by using a flash engine and transmit the operation information to a non-relational database for storage,
wherein the determining module comprises:
the determining submodule is configured to analyze the current operation and determine an operation type and/or a data source corresponding to the current operation;
an encapsulation submodule configured to encapsulate the operation type and/or data source into a standard format,
wherein the non-relational database is a kudu database, and the apparatus further comprises:
a second configuration module configured to configure the flash engine to use the kudu database as a data destination for the flash engine, the second configuration module comprising:
an adding submodule configured to add an operation API of the kudu database in the flux engine, for parsing the operation type and/or data source encapsulated into a standard format and synchronizing the same to the kudu database;
a configuration submodule configured to configure a data endpoint of the flash engine as the operational API,
the obtaining the current operation on the relational database comprises:
obtaining the current operation in response to adding or deleting an integrity constraint in the relational database.
6. The data processing apparatus of claim 5, wherein the obtaining module comprises:
an obtaining sub-module configured to obtain the current operation in response to any one of data addition, data modification, and data deletion to the relational database.
7. The data processing apparatus of claim 5, wherein the cache module comprises:
and the caching submodule is configured to cache the operation information into a Kafka message queue.
8. The data processing apparatus of claim 7, wherein the apparatus further comprises:
a first configuration module configured to configure the flash engine to use the Kafka message queue as a data source for the flash engine.
9. An electronic device comprising a memory and a processor; wherein the content of the first and second substances,
the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of any of claims 1-4.
10. A computer-readable storage medium having stored thereon computer instructions, characterized in that the computer instructions, when executed by a processor, carry out the method steps of any of claims 1-4.
CN201811025121.6A 2018-09-03 2018-09-03 Data processing method and device, electronic equipment and computer readable storage medium Active CN109063196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811025121.6A CN109063196B (en) 2018-09-03 2018-09-03 Data processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811025121.6A CN109063196B (en) 2018-09-03 2018-09-03 Data processing method and device, electronic equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109063196A CN109063196A (en) 2018-12-21
CN109063196B true CN109063196B (en) 2021-08-27

Family

ID=64759586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811025121.6A Active CN109063196B (en) 2018-09-03 2018-09-03 Data processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109063196B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960710B (en) * 2019-01-16 2023-04-11 平安科技(深圳)有限公司 Data synchronization method and system between databases
CN112015952A (en) * 2019-06-03 2020-12-01 食亨(上海)科技服务有限公司 Data processing system and method
CN110417898B (en) * 2019-07-31 2022-02-22 拉扎斯网络科技(上海)有限公司 Data transmission method, device, client, electronic equipment and storage medium
CN110401721B (en) * 2019-08-06 2022-07-08 北京达佳互联信息技术有限公司 Method, device and system for distributing content data
CN110781235A (en) * 2019-10-24 2020-02-11 珠海格力电器股份有限公司 Big data based purchase data processing method and device, terminal and storage medium
CN111046100B (en) * 2019-11-25 2024-03-08 武汉达梦数据库股份有限公司 Method and system for synchronizing relational database to non-relational database
CN111200637B (en) * 2019-12-20 2022-07-08 新浪网技术(中国)有限公司 Cache processing method and device
CN111651464B (en) * 2020-04-15 2024-02-23 北京皮尔布莱尼软件有限公司 Data processing method, system and computing device
CN111815324A (en) * 2020-06-28 2020-10-23 北京金山云网络技术有限公司 Bill processing method, device and system
CN112286875A (en) * 2020-10-23 2021-01-29 青岛以萨数据技术有限公司 System framework for processing real-time data stream and real-time data stream processing method
CN113051274B (en) * 2021-03-31 2023-02-07 上海天旦网络科技发展有限公司 Mass tag storage system and method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629268B (en) * 2012-03-09 2014-12-17 华为技术有限公司 Data synchronization method, system and date access device
CN103823815B (en) * 2012-11-19 2017-05-17 中国联合网络通信集团有限公司 server and database access method
CN104809199B (en) * 2015-04-24 2018-11-16 联动优势科技有限公司 A kind of method and apparatus of database synchronization
CN107038162B (en) * 2016-02-03 2021-03-02 北京嘀嘀无限科技发展有限公司 Real-time data query method and system based on database log
US10303675B2 (en) * 2016-05-20 2019-05-28 FinancialForce.com, Inc. Custom lightning connect adapter for google sheets web-based spreadsheet program
CN107943979A (en) * 2017-11-29 2018-04-20 山东鲁能软件技术有限公司 The quasi real time synchronous method and device of data between a kind of database
CN108335075B (en) * 2018-03-02 2020-12-11 华南理工大学 Logistics big data oriented processing system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"flume写入kudu的sink二次开发,增加主键自定义";cf05948;《Github》;20170915;第1-9页 *
"如何使用Flume采集Kafka数据写入Kudu";Fayson;《腾讯云》;20180711;第1-11页 *

Also Published As

Publication number Publication date
CN109063196A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
CN109063196B (en) Data processing method and device, electronic equipment and computer readable storage medium
US20230144450A1 (en) Multi-partitioning data for combination operations
CN110147398B (en) Data processing method, device, medium and electronic equipment
CN110019350B (en) Data query method and device based on configuration information
CN105138592B (en) A kind of daily record data storage and search method based on distributed structure/architecture
CN109413127B (en) Data synchronization method and device
CN112307037B (en) Data synchronization method and device
CN108073625B (en) System and method for metadata information management
WO2019219010A1 (en) Data migration method and device and computer readable storage medium
CN107977396B (en) Method and device for updating data table of KeyValue database
CN111258978B (en) Data storage method
CN111858760B (en) Data processing method and device for heterogeneous database
CN104794190A (en) Method and device for effectively storing big data
CN110837423A (en) Method and device for automatically acquiring data of guided transport vehicle
US10902069B2 (en) Distributed indexing and aggregation
CN104750855A (en) Method and device for optimizing big data storage
CN112416991A (en) Data processing method and device and storage medium
CN111221851A (en) Lucene-based mass data query and storage method and device
CN103034650B (en) A kind of data handling system and method
CN113190517B (en) Data integration method and device, electronic equipment and computer readable medium
CN111241189A (en) Method and device for synchronizing data
CN114443599A (en) Data synchronization method and device, electronic equipment and storage medium
KR20100132752A (en) Distributed data processing system
KR101830504B1 (en) In-Memory DB Connection Support Type Scheduling Method and System for Real-Time Big Data Analysis in Distributed Computing Environment
CN107665241B (en) Real-time data multi-dimensional duplicate removal method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant