CN112988741A - Real-time service data merging method and device and electronic equipment - Google Patents

Real-time service data merging method and device and electronic equipment Download PDF

Info

Publication number
CN112988741A
CN112988741A CN202110157033.7A CN202110157033A CN112988741A CN 112988741 A CN112988741 A CN 112988741A CN 202110157033 A CN202110157033 A CN 202110157033A CN 112988741 A CN112988741 A CN 112988741A
Authority
CN
China
Prior art keywords
data
real
service
management system
service data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110157033.7A
Other languages
Chinese (zh)
Inventor
田刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qilu Information Technology Co Ltd
Original Assignee
Beijing Qilu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qilu Information Technology Co Ltd filed Critical Beijing Qilu Information Technology Co Ltd
Priority to CN202110157033.7A priority Critical patent/CN112988741A/en
Publication of CN112988741A publication Critical patent/CN112988741A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The disclosure relates to a real-time service data merging method, a device, system electronic equipment and a computer readable medium. The method comprises the following steps: acquiring a binary log file from a service database in real time, wherein the service data are distinguished through a main key; analyzing the binary log file to generate service data; writing the service data into a distributed publishing and subscribing message system in real time to generate cluster data; importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on a primary key; and merging the business data through the column type database management system. The real-time service data merging method, the device, the system electronic equipment and the computer readable medium can improve the data backtracking efficiency, reduce the data production time and reduce the resource consumption on the basis of ensuring the accuracy, timeliness and orderliness of the original real-time data backtracking.

Description

Real-time service data merging method and device and electronic equipment
Technical Field
The present disclosure relates to the field of computer information processing, and in particular, to a method, an apparatus, a system electronic device, and a computer readable medium for merging real-time service data.
Background
With the development of business of each company, more and more operators refer to data analysis work from off-line to on-line real-time processing in order to provide better service for users, deal with sudden data and monitor real-time operation strategies. Operators can monitor operation strategies and adjust strategy layout through real-time and quasi-real-time indexes.
To meet this demand, a Canal (database incremental log parsing) framework is often used in the prior art in time to pull binary log file data from the service mysql in real time, write the binary log file data into kafka (distributed publish-subscribe messaging system) in real time, and import clickhouse (columnar database management system). And merging and de-duplicating the binary log file in clickhouse, and attributing the data to achieve a state consistent with the latest data of mysql. And then, the real-time and quasi-real-time index calculation and the real-time label generation are completed, so that the purposes of promoting the service development and accelerating the service perception are achieved.
However, as the data volume of individual tables in mysql is larger and larger at present, more and more data need to be merged, and the original scheme for merging data through a common view has the problems of low execution speed, more resource consumption, unstable Clickhouse cluster and the like.
Therefore, a new real-time service data merging method, apparatus, system electronic device and computer readable medium are needed.
The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of this, the present disclosure provides a method, an apparatus, a system electronic device, and a computer readable medium for merging real-time service data, which can improve data backtracking efficiency, reduce data production time, and reduce resource consumption on the basis of ensuring accuracy, timeliness, and orderliness of the original real-time data backtracking.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present disclosure, a method for merging real-time service data is provided, where the method includes: acquiring a binary log file from a service database in real time, wherein the service data are distinguished through a main key; analyzing the binary log file to generate service data; writing the service data into a distributed publishing and subscribing message system in real time to generate cluster data; importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on a primary key; and merging the business data through the column type database management system.
Optionally, the method further comprises: extracting service data from the combined cluster data in real time; and generating a service index and/or a service label in real time based on the service data.
Optionally, obtaining the binary log file from the service database in real time includes: and acquiring the binary log file from the service database in real time in an incremental log analysis mode.
Optionally, importing the cluster data from the distributed publish-subscribe messaging system to a columnar database management system in a real-time consumption manner based on a primary key, comprising: acquiring cluster data from the distributed publishing and subscribing message system in a real-time consumption manner; extracting main keys of a plurality of service data contained in the cluster data; importing the plurality of business data into the columnar database management system based on the primary key.
Optionally, before importing the plurality of business data into the columnar database management system based on the primary key, the method includes: creating a plurality of local tables in the columnar database management system based on the number of primary keys.
Optionally, importing the plurality of business data into the columnar database management system based on the primary key comprises: respectively calculating hash values of the main keys; storing the plurality of business data in a plurality of local tables of a columnar database management system, respectively, based on the hash values.
Optionally, the merging the business data by the columnar database management system includes: the materialized view of the column-type database management system acquires the cluster data; merging the business data in the cluster data based on the plurality of local surfaces.
Optionally, before the materialized view of the column-type database management system acquires the cluster data, the method includes: creating a materialized view based on a plurality of local tables of the columnar database management system, the materialized view being used for cluster data merging; creating a cluster distributed table based on the materialized view; generating the generic view based on the clustered distributed table.
Optionally, the obtaining the cluster data by the materialized view of the column-type database management system includes: the materialized view of the columnar database management system processes the cluster data in an asynchronous manner.
Optionally, merging the service data in the cluster data based on the plurality of local tables includes: carrying out duplicate removal operation on the service data based on a common view function; and merging the service data after the deduplication operation.
According to an aspect of the present disclosure, a real-time service data merging apparatus is provided, the apparatus including: the log module is used for acquiring a binary log file from a service database in real time, and the service data are distinguished through a main key; the analysis module is used for analyzing the binary log file to generate service data; the writing module is used for writing the service data into the distributed publishing and subscribing message system in real time to generate cluster data; the import module is used for importing the cluster data into a columnar database management system from the distributed publishing and subscribing message system in a real-time consumption mode based on a primary key; and the merging module is used for merging the service data through the column type database management system.
Optionally, the method further comprises: the service module is used for extracting service data from the combined cluster data in real time; and generating a service index and/or a service label in real time based on the service data.
Optionally, the log module is further configured to obtain the binary log file from the service database in real time in an incremental log analysis manner.
Optionally, the importing module includes: the acquisition unit is used for acquiring cluster data from the distributed publishing and subscribing message system in a real-time consumption mode; a primary key unit, configured to extract primary keys of a plurality of service data included in the cluster data; an importing unit for importing the plurality of business data into the columnar database management system based on the primary key.
Optionally, the importing module further includes: a local table unit to create a plurality of local tables in the columnar database management system based on the number of primary keys.
Optionally, the importing unit is further configured to calculate hash values of the primary keys respectively; storing the plurality of business data in a plurality of local tables of a columnar database management system, respectively, based on the hash values.
Optionally, the merging module includes: the cluster unit is used for acquiring the cluster data by the materialized view of the column type database management system; and the merging unit is used for merging the service data in the cluster data based on the plurality of local tables.
Optionally, the merging module further includes: a creating unit for creating a materialized view based on a plurality of local tables of the columnar database management system, the materialized view being used for cluster data merging; creating a cluster distributed table based on the materialized view; generating the generic view based on the clustered distributed table.
Optionally, the cluster unit is further configured to process the cluster data in an asynchronous manner by the materialized view of the columnar database management system.
Optionally, the merging unit is further configured to perform a deduplication operation on the service data based on a common view function; and merging the service data after the deduplication operation.
According to an aspect of the present disclosure, a real-time service data merging system is provided, the system including: the system comprises a service database, a service database and a service database, wherein the service database is used for storing real-time service data, the service data are binary log files, and the service data are distinguished through a main key; the incremental log analysis server is used for analyzing the binary log file to generate service data; the distributed publishing and subscribing message system is used for writing the cluster of the service data in real time to generate cluster data; and the column type database management system is used for acquiring the cluster data based on the main key and merging the service data.
According to an aspect of the present disclosure, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.
According to an aspect of the disclosure, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.
According to the real-time service data merging method, the device, the system electronic equipment and the computer readable medium, the binary log file is obtained from the service database in real time, and the service data are distinguished through the main key; analyzing the binary log file to generate service data; writing the service data into a distributed publishing and subscribing message system in real time to generate cluster data; importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on a primary key; through the mode that the column-type database management system merges the service data, on the basis of ensuring the accuracy, timeliness and orderliness of the original real-time data backtracking, the data backtracking efficiency can be improved, the data production time is reduced, and the resource consumption is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.
Fig. 1 is a schematic diagram illustrating a real-time service data merging system according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating a real-time service data merging method according to an exemplary embodiment.
Fig. 3 is a flowchart illustrating a real-time service data merging method according to another exemplary embodiment.
Fig. 4 is a flowchart illustrating a real-time service data merging method according to another exemplary embodiment.
Fig. 5 is a diagram illustrating a real-time service data merging method according to another exemplary embodiment.
Fig. 6 is a block diagram illustrating a real-time service data merging apparatus according to an exemplary embodiment.
Fig. 7 is a block diagram illustrating a real-time service data merging system according to another exemplary embodiment.
FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 9 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, system implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.
Fig. 1 is a schematic diagram illustrating a real-time service data merging system according to an exemplary embodiment.
As shown in fig. 1, system architecture 10 may include a business database 101, an incremental log parsing server 102, a distributed publish-subscribe message system 03, and a columnar database management system 104. The network serves as a medium for providing communication links between service database 101, incremental log parsing server 102, distributed publish-subscribe messaging system 03, and columnar database management system 104. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The service database 101 can store real-time service data, which specifically includes user data and real-time behavior data generated by a user, and the service data are distinguished by a main key; the incremental log parsing server 102 may parse the binary log file, for example, to generate service data; the distributed publish-subscribe message system 103 may, for example, write in the cluster of the service data in real time to generate cluster data; columnar database management system 104 may obtain the cluster data, e.g., based on the primary key, and merge the business data.
More specifically, in a practical application scenario, after the above processing, in order to ensure that the columnar database management system 104 changes cluster data merging into local data merging by using the replacegmergee, it is first ensured that data of the same primary key is only written into the same local surface of the columnar database management system 104, and then by creating a materialized view of the replacegmergee engine, the data of the latest timestamp is left after the local merging of the data is implemented inside the view. However, since the replacegmergee can only be merged with data in the same partition, the partition is generally divided according to the date, so that data of the primary keys of different logs cannot be merged. It is achieved in the disclosed embodiments by creating a cluster table and corresponding normal view that all primary keys on the cluster remain only the most recent share.
Through the processing mode, when the bottom layer data are merged, the data with the same primary key are merged as much as possible, the data amount is reduced, and then when the cluster merging is carried out through a common view, the data of each local surface are converged and merged, namely the local data are merged first and then converged. The memory consumption is greatly reduced, and the speed of merging the data of the common view is greatly improved.
Fig. 2 is a flowchart illustrating a real-time service data merging method according to an exemplary embodiment. The real-time service data merging method 20 at least includes steps S202 to S210.
As shown in fig. 2, in S202, a binary log file is obtained from a service database in real time, and the service data is distinguished by a primary key. The method comprises the following steps: and acquiring the binary log file from the service database in real time in an incremental log analysis mode.
Where the binary log file may be a binlog file, a binlog is a binary log that records all database table structure changes (e.g., CREATE, ALTERTABLE …) and table data modifications (INSERT, UPDATE, DELETE …). binlog does not record operations such as SELECT and SHOW because such operations do not modify the data itself.
The incremental log analysis mode can be realized by the Canal, and the Canal can provide incremental data subscription and consumption based on MySQL database incremental log analysis. More specifically, the Canal can simulate an interaction protocol of the MySQLslave, pretend itself to be the MySQLslave, and send a dump protocol to the MySQLmaster; the MySQLmaster receives the dump request and starts to push binlog to slave (namely, Canal); canal parses the binlog object (originally the byte stream).
In S204, the binary log file is analyzed to generate service data.
In S206, the service data is written into the distributed publish-subscribe message system in real time to generate cluster data.
The distributed publish-subscribe messaging system can be Kafka, which is a high-throughput distributed publish-subscribe messaging system written by Scala and Java that can handle all the action stream data of the consumer in the website. These data are typically addressed by handling logs and log aggregations due to throughput requirements. This is a viable solution to the limitations of Hadoop-like log data and offline analysis systems, but which require real-time processing. The purpose of Kafka is to unify online and offline message processing through the parallel loading mechanism of Hadoop, and also to provide real-time messages through clustering.
In S208, the cluster data is imported by the distributed publish-subscribe message system into a columnar database management system in a real-time consumption manner based on a primary key. Cluster data can be obtained from the distributed publish-subscribe message system, for example, in a real-time consumption manner; extracting main keys of a plurality of service data contained in the cluster data; importing the plurality of business data into the columnar database management system based on the primary key.
The details of "importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on primary key" will be described in the embodiment corresponding to fig. 3.
Among them, the columnar database management system may be a clickwouse in which data is always stored in columns, including processes performed by vectors (vectors or column blocks). Operations are assigned based on vectors, rather than individual values, whenever possible, which is referred to as vectorized query execution, which advantageously reduces the actual data processing overhead.
In one embodiment, data may be imported into the columnar database management system from a distributed publish-subscribe messaging system through Gohangout, which is a suite of data transport pipeline tools written in golang. Operations from kafka consumption data to ES, Clickhouse and the like are supported, and meanwhile, a user can use a Filter to Filter data, analyze different types of data and then Convert the data types through Convert.
It should be noted that because clickHouse does not support transactional operations, despite the inability to use as a traditional database (OLTP), and the high request rate of key-value access, Blob or document storage, over-standardizes data.
More specifically, data merging can be performed by using a replacegmergree in clickwause and combining an argMax mode. The replacegmergree refers to a table engine for data merging provided by Clickhouse, and the table engine can merge data by configuring a primary key and a merging condition. However, to implement cluster global data merging, a global deduplication operation by the cluster table argMax is also required. In the disclosure, global deduplication is performed on Clickhouse cluster table data through the replacegmergree in combination with the argMax manner.
In S210, the business data is merged by the columnar database management system. The cluster data may be obtained, for example, by a materialized view of the columnar database management system; merging the business data in the cluster data based on the plurality of local surfaces.
The details of "merging the business data by the columnar database management system" will be described in the embodiment corresponding to fig. 4.
In one embodiment, further comprising: extracting service data from the combined cluster data in real time; and generating a service index and/or a service label in real time based on the service data. Service data can be acquired through the ods _ view, prd _ apv _ ap _ application _ real _ mv _ view command and subsequent analysis processing is performed.
The method for merging the real-time service data aims to solve the problems that as the data volume of individual tables in mysql is larger and larger, more and more data need to be merged, and the conventional scheme for merging the data through a common view has low execution speed and more resource consumption, so that a Clickhouse cluster is unstable, and the like. Through the directional design of the data flow direction of the data synchronization end, a replace MergeTree engine is better used, the cluster deduplication implementation is changed into the implementation similar to local deduplication cluster merging, the data deduplication is changed into stream processing from batch processing by using the materialized view function of Clickhouse, peak clipping and valley filling are performed, and the problem of resource consumption caused by the fact that common views are executed in a task set to perform deduplication on data in the same time is greatly reduced. On the basis of guaranteeing the accuracy, timeliness and orderliness of the original real-time data backtracking, the data backtracking efficiency is improved, the data production time is shortened, and the resource consumption is reduced.
According to the real-time service data merging method disclosed by the invention, a binary log file is obtained from a service database in real time, and the service data are distinguished through a main key; analyzing the binary log file to generate service data; writing the service data into a distributed publishing and subscribing message system in real time to generate cluster data; importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on a primary key; through the mode that the column-type database management system merges the service data, on the basis of ensuring the accuracy, timeliness and orderliness of the original real-time data backtracking, the data backtracking efficiency can be improved, the data production time is reduced, and the resource consumption is reduced.
It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
Fig. 3 is a flowchart illustrating a real-time service data merging method according to another exemplary embodiment. The process 30 shown in fig. 3 is a detailed description of the process of fig. 2, wherein S208 "import the cluster data from the distributed publish-subscribe message system to the columnar database management system in real-time consumption based on the primary key".
As shown in fig. 3, in S302, a plurality of local tables are created in the columnar database management system based on the number of primary keys. And determining the number of local tables according to the number of the primary keys extracted from the service data, wherein the number of the local tables is more than or equal to the number of the primary keys.
In S304, the cluster number is obtained from the distributed publish-subscribe message system in a real-time consumption manner.
In S306, the primary keys of the plurality of service data included in the cluster data are extracted.
In S308, hash values of the primary keys are calculated, respectively.
In S310, the plurality of business data are stored in a plurality of local tables of the columnar database management system, respectively, based on the hash values. And (4) putting the data identical to the primary key into the same local ground surface through a hash algorithm.
To ensure the ordering of data under the same primary key in the same table from Mysql to Clickhouse, Canal can be used to parse Mysql bin log and send it to kafka corresponding topic, corresponding to the ordering of the primary key data of the partition. But when consuming kafka data to the clickhouse local table by gohangout, this ordering is broken because the gohangout writes the data as polls. Therefore, the invention provides a new data circulation process from kafka to clickhouse, the mode of writing in clickhouse by polling is changed into the mode of hash algorithm, the data of the same main key are ensured to enter the same local surface of clickhouse, and the data can be merged and completed locally only in the maximum way, so that the efficiency is improved. In the prior art, when data of kafka to clickhouse are circulated, data of the same primary key is scattered to different local tables, and then all data are transmitted to a node through a cluster network in the final merging process for global merging.
Fig. 4 is a flowchart illustrating a real-time service data merging method according to another exemplary embodiment. The process 40 shown in fig. 4 is a detailed description of "merge the business data by the columnar database management system" at S210 in the process shown in fig. 2.
As shown in FIG. 4, in S402, a materialized view is created based on a plurality of local tables of the columnar database management system, the materialized view being used for cluster data merging.
Wherein a materialized view is a database object that includes a query result that is a local copy of remote data or is used to generate a summary table based on a summation of data tables. The materialized view stores remote table-based data, which may also be referred to as a snapshot. For replication, the materialized view allows you to maintain copies of remote data locally, which are read-only. For a data warehouse, the materialized views created are typically an aggregated view, a single table aggregated view, and a joined view. Some large time consuming table joins are implemented with materialized views, which may improve the efficiency of the query.
In S404, a cluster distributed table is created based on the materialized view. After the materialized view is created, if the local table has data written in, the materialized view can have the data written in, and the combination of the materialized view is not immediate and asynchronous and can be started when the partition data is combined.
In S406, the generic view is generated based on the cluster distributed table. FIG. 5 is a diagram of a clickhouse internal data flow process. In the Clickhouse, the ordering of data under the same main key of the same table from Mysql to Clickhouse can be ensured.
In S408, the materialized view of the columnar database management system obtains the cluster data for business data merging. The business data in the cluster data may be merged based on the plurality of local tables. Wherein the materialized view of the columnar database management system processes the cluster data in an asynchronous manner.
In one embodiment, the business data may be deduplicated, e.g., based on normal view functionality; and merging the service data after the deduplication operation. And a common view can be built on the upper layer of the distributed table for argMax deduplication.
According to the method, through the data merging function of the Clickhouse replacingMergeTree engine, after the fact that the same main key data are stored in the same node and are ordered is guaranteed, the table engine is used for conducting rapid and effective data merging and global duplication elimination, and the mysql data are restored. And data analysis is carried out on the basis that the data in clickhouse and the data in mysql are kept consistent. Meanwhile, the cluster resource squeezing condition is reduced, the peak value is reduced, and the cluster stability is improved.
In the method disclosed by the disclosure, through a space time-changing mode, data merging calculation realized in a clickhouse engine is utilized, so that the resource consumption can be greatly reduced, and the merging speed can be improved.
It is worth mentioning that the process of writing the gohangout consumption data to the corresponding local table may be implemented by flink. Moreover, for global data merging in clickhouse, although the global data merging can also be realized by using a flash, the cost is high, and data verification data quality monitoring is difficult.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
Fig. 6 is a block diagram illustrating a real-time service data merging apparatus according to another exemplary embodiment. As shown in fig. 6, the real-time service data merging apparatus 60 includes: a log module 602, a parsing module 604, a writing module 606, an importing module 608, and a merging module 610.
The log module 602 is configured to obtain a binary log file from a service database in real time, where the service data are distinguished by a primary key; the log module 602 is further configured to obtain the binary log file from the service database in real time in an incremental log analysis manner.
The parsing module 604 is configured to parse the binary log file to generate service data;
the writing module 606 is configured to write the service data into the distributed publish-subscribe message system in real time to generate cluster data;
an import module 608 for importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on a primary key; the import module 608 includes: the acquisition unit is used for acquiring cluster data from the distributed publishing and subscribing message system in a real-time consumption mode; a primary key unit, configured to extract primary keys of a plurality of service data included in the cluster data; an importing unit for importing the plurality of business data into the columnar database management system based on the primary key. The import unit is further used for respectively calculating hash values of the primary keys; storing the plurality of business data in a plurality of local tables of a columnar database management system, respectively, based on the hash values. A local table unit to create a plurality of local tables in the columnar database management system based on the number of primary keys.
The merging module 610 is configured to merge the business data through the columnar database management system. The merging module 610 includes: the cluster unit is used for acquiring the cluster data by the materialized view of the column type database management system; the cluster unit is also used for processing the cluster data through a materialized view of the column type database management system in an asynchronous mode. And the merging unit is used for merging the service data in the cluster data based on the plurality of local tables. The merging unit is further configured to perform a deduplication operation on the service data based on a common view function; and merging the service data after the deduplication operation. A creating unit for creating a materialized view based on a plurality of local tables of the columnar database management system, the materialized view being used for cluster data merging; creating a cluster distributed table based on the materialized view; generating the generic view based on the clustered distributed table.
The real-time service data merging device 60 may further include: the service module is used for extracting service data from the combined cluster data in real time; and generating a service index and/or a service label in real time based on the service data.
Fig. 7 is a block diagram illustrating a real-time service data merging apparatus according to an exemplary embodiment. As shown in fig. 7, the real-time service data merging apparatus 70 includes: a business database 702, an incremental log parsing server 704, a distributed publish-subscribe message system 706, and a columnar database management system 708.
The service database 702 is used for storing real-time service data, the service data is a binary log file, and the service data is distinguished through a main key;
the incremental log analysis server 704 is used for analyzing the binary log file to generate service data;
the distributed publish-subscribe message system 706 is configured to write the service data into the cluster in real time to generate cluster data;
columnar database management system 708 is configured to obtain the cluster data based on the primary key and merge the business data.
According to the real-time service data merging device disclosed by the invention, a binary log file is obtained from a service database in real time, and the service data are distinguished through a main key; analyzing the binary log file to generate service data; writing the service data into a distributed publishing and subscribing message system in real time to generate cluster data; importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on a primary key; through the mode that the column-type database management system merges the service data, on the basis of ensuring the accuracy, timeliness and orderliness of the original real-time data backtracking, the data backtracking efficiency can be improved, the data production time is reduced, and the resource consumption is reduced.
FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.
An electronic device 800 according to this embodiment of the disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 8, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: at least one processing unit 810, at least one memory unit 820, a bus 830 connecting the various system components (including the memory unit 820 and the processing unit 810), a display unit 840, and the like.
Wherein the storage unit stores program code that can be executed by the processing unit 810, such that the processing unit 810 performs the steps according to various exemplary embodiments of the present disclosure in this specification. For example, the processing unit 810 may perform the steps as shown in fig. 2, 3, 4.
The memory unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.
The memory unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 800 may also communicate with one or more external devices 800' (e.g., keyboard, pointing device, bluetooth device, etc.) such that a user can communicate with devices with which the electronic device 800 interacts, and/or any devices (e.g., router, modem, etc.) with which the electronic device 800 can communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. The network adapter 860 may communicate with other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, as shown in fig. 9, the technical solution according to the embodiment of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present disclosure.
The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring a binary log file from a service database in real time, wherein the service data are distinguished through a main key; analyzing the binary log file to generate service data; writing the service data into a distributed publishing and subscribing message system in real time to generate cluster data; importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on a primary key; and merging the business data through the column type database management system.
Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that the present disclosure is not limited to the precise arrangements, instrumentalities, or instrumentalities described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A real-time service data merging method is characterized by comprising the following steps:
acquiring a binary log file from a service database in real time, wherein the service data are distinguished through a main key;
analyzing the binary log file to generate service data;
writing the service data into a distributed publishing and subscribing message system in real time to generate cluster data;
importing the cluster data from the distributed publish-subscribe message system to a columnar database management system in a real-time consumption manner based on a primary key;
and merging the business data through the column type database management system.
2. The method of claim 1, further comprising:
extracting service data from the combined cluster data in real time;
and generating a service index and/or a service label in real time based on the service data.
3. The method of any of claims 1-2, wherein retrieving the binary log file from the service database in real-time comprises:
and acquiring the binary log file from the service database in real time in an incremental log analysis mode.
4. The method of any of claims 1-3, wherein importing the cluster data from the distributed publish-subscribe message system into a columnar database management system in a real-time consumption manner based on a primary key comprises:
acquiring cluster data from the distributed publishing and subscribing message system in a real-time consumption manner;
extracting main keys of a plurality of service data contained in the cluster data;
importing the plurality of business data into the columnar database management system based on the primary key.
5. The method of any of claims 1-4, wherein prior to importing the plurality of business data into the columnar database management system based on the primary key, comprising:
creating a plurality of local tables in the columnar database management system based on the number of primary keys.
6. The method of any of claims 1-5, wherein importing the plurality of business data into the columnar database management system based on the primary key comprises:
respectively calculating hash values of the main keys;
storing the plurality of business data in a plurality of local tables of a columnar database management system, respectively, based on the hash values.
7. The method of any of claims 1-6, wherein merging the business data by the columnar database management system comprises:
the materialized view of the column-type database management system acquires the cluster data;
merging the business data in the cluster data based on the plurality of local surfaces.
8. A real-time service data merging apparatus, comprising:
the log module is used for acquiring a binary log file from a service database in real time, and the service data are distinguished through a main key;
the analysis module is used for analyzing the binary log file to generate service data;
the writing module is used for writing the service data into the distributed publishing and subscribing message system in real time to generate cluster data;
the import module is used for importing the cluster data into a columnar database management system from the distributed publishing and subscribing message system in a real-time consumption mode based on a primary key;
and the merging module is used for merging the service data through the column type database management system.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202110157033.7A 2021-02-04 2021-02-04 Real-time service data merging method and device and electronic equipment Pending CN112988741A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110157033.7A CN112988741A (en) 2021-02-04 2021-02-04 Real-time service data merging method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110157033.7A CN112988741A (en) 2021-02-04 2021-02-04 Real-time service data merging method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN112988741A true CN112988741A (en) 2021-06-18

Family

ID=76347199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110157033.7A Pending CN112988741A (en) 2021-02-04 2021-02-04 Real-time service data merging method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112988741A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113596117A (en) * 2021-07-14 2021-11-02 北京淇瑀信息科技有限公司 Real-time data processing method, system, device and medium
CN114841570A (en) * 2022-05-07 2022-08-02 金腾科技信息(深圳)有限公司 Data processing method, device, equipment and medium for customer relationship management system
CN115934846A (en) * 2023-02-06 2023-04-07 北京仁科互动网络技术有限公司 Data synchronization method of columnar storage database clickhouse
CN117149914A (en) * 2023-10-27 2023-12-01 成都优卡数信信息科技有限公司 Storage method based on ClickHouse

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169083A (en) * 2017-05-11 2017-09-15 聚龙融创科技有限公司 Public security bayonet socket magnanimity vehicle data storage and retrieval method and device, electronic equipment
US20170277616A1 (en) * 2016-03-25 2017-09-28 Linkedin Corporation Replay-suitable trace recording by service container
CN108965355A (en) * 2017-05-18 2018-12-07 北京京东尚科信息技术有限公司 Method, apparatus and computer readable storage medium for data transmission
CN109992469A (en) * 2017-12-29 2019-07-09 北京奇虎科技有限公司 A kind of method and device merging log
US20200204557A1 (en) * 2018-12-19 2020-06-25 International Business Machines Corporation Decentralized database identity management system
CN112163048A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for realizing OLAP analysis based on ClickHouse

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170277616A1 (en) * 2016-03-25 2017-09-28 Linkedin Corporation Replay-suitable trace recording by service container
CN107169083A (en) * 2017-05-11 2017-09-15 聚龙融创科技有限公司 Public security bayonet socket magnanimity vehicle data storage and retrieval method and device, electronic equipment
CN108965355A (en) * 2017-05-18 2018-12-07 北京京东尚科信息技术有限公司 Method, apparatus and computer readable storage medium for data transmission
CN109992469A (en) * 2017-12-29 2019-07-09 北京奇虎科技有限公司 A kind of method and device merging log
US20200204557A1 (en) * 2018-12-19 2020-06-25 International Business Machines Corporation Decentralized database identity management system
CN112163048A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for realizing OLAP analysis based on ClickHouse

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FANGJIN YANG等: "Druid: a real-time analytical data store", SIGMOD \'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA,JUNE 2014,PAGES 157–168, 30 June 2014 (2014-06-30), pages 157, XP055389304, DOI: 10.1145/2588555.2595631 *
张波: "基于大数据技术的公安移动通信数据处理平台设计与实现", 中国优秀硕士学位论文全文数据库 (信息科技辑), 15 January 2017 (2017-01-15), pages 138 - 228 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113596117A (en) * 2021-07-14 2021-11-02 北京淇瑀信息科技有限公司 Real-time data processing method, system, device and medium
CN113596117B (en) * 2021-07-14 2023-09-08 北京淇瑀信息科技有限公司 Real-time data processing method, system, equipment and medium
CN114841570A (en) * 2022-05-07 2022-08-02 金腾科技信息(深圳)有限公司 Data processing method, device, equipment and medium for customer relationship management system
CN115934846A (en) * 2023-02-06 2023-04-07 北京仁科互动网络技术有限公司 Data synchronization method of columnar storage database clickhouse
CN117149914A (en) * 2023-10-27 2023-12-01 成都优卡数信信息科技有限公司 Storage method based on ClickHouse
CN117149914B (en) * 2023-10-27 2024-01-26 成都优卡数信信息科技有限公司 Storage method based on ClickHouse

Similar Documents

Publication Publication Date Title
Yaqoob et al. Big data: From beginning to future
CN112988741A (en) Real-time service data merging method and device and electronic equipment
Begoli et al. Design principles for effective knowledge discovery from big data
CN110807067B (en) Data synchronization method, device and equipment for relational database and data warehouse
Samadi et al. Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks
CN111209352B (en) Data processing method and device, electronic equipment and storage medium
Singh et al. Hadoop: addressing challenges of big data
US20190034833A1 (en) Model Training Method and Apparatus
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
Narkhede et al. HMR log analyzer: Analyze web application logs over Hadoop MapReduce
CN110781197B (en) Hive offline synchronous verification method and device and electronic equipment
Elagib et al. Big data analysis solutions using MapReduce framework
CN112948486A (en) Batch data synchronization method and system and electronic equipment
CN115858488A (en) Parallel migration method and device based on data governance and readable medium
Zhou et al. Sfmapreduce: An optimized mapreduce framework for small files
CN111752918A (en) Historical data interaction system and configuration method thereof
CN107679096B (en) Method and device for sharing indexes among data marts
Chaffai et al. E-learning real time analysis using large scale infrastructure
CN113010399A (en) Log data processing method, system, device and medium
CN112416865A (en) File processing method and device based on big data
CN111597201A (en) Content rapid compression method based on Greenplus large-scale parallel processing database
Gamero et al. Scalability Testing Approach for Internet of Things for Manufacturing SQL and NoSQL Database Latency and Throughput
Abead et al. A comparative study of hdfs replication approaches
CN111078975A (en) Multi-node incremental data acquisition system and acquisition method
BĂNUŢĂ et al. Big Data: Technologies and Software Products.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination