CN114153809A - Parallel real-time incremental statistic method based on database logs - Google Patents

Parallel real-time incremental statistic method based on database logs Download PDF

Info

Publication number
CN114153809A
CN114153809A CN202111220905.6A CN202111220905A CN114153809A CN 114153809 A CN114153809 A CN 114153809A CN 202111220905 A CN202111220905 A CN 202111220905A CN 114153809 A CN114153809 A CN 114153809A
Authority
CN
China
Prior art keywords
data
clickhouse
node
event
engine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111220905.6A
Other languages
Chinese (zh)
Inventor
马杲灵
游飞龙
杨照能
石明恩
汪睿铭
杜华飞
宋赣源
石尧
孙青松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Shulian Mingpin Technology Co ltd
Original Assignee
Guizhou Shulian Mingpin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Shulian Mingpin Technology Co ltd filed Critical Guizhou Shulian Mingpin Technology Co ltd
Priority to CN202111220905.6A priority Critical patent/CN114153809A/en
Publication of CN114153809A publication Critical patent/CN114153809A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for carrying out parallel real-time increment statistics based on business system data, which comprises a single machine mode and a distributed mode, wherein the method comprises the steps of monitoring an operation log of a business database, synchronizing the real-time increment of the business data to a ClickHouse node, carrying out the real-time statistics of index data by virtue of an increment statistics function of a ClickHouse node data synchronization table, transferring the index data calculation pressure of the business database to the ClickHouse node, improving the index data calculation performance, realizing the high-concurrency sub-second-level query of the real-time index data and solving the index data statistics of frequently updated mass data.

Description

Parallel real-time incremental statistic method based on database logs
Technical Field
The invention relates to the technical field of computer science big data, in particular to a method for carrying out parallel real-time incremental statistics based on business system data.
Background
The service system often has some statistical requirements of real-time index data, and after the service data changes, a user can inquire the changed statistical index data in real time. Business systems typically use sophisticated relational databases (e.g., MySQL, SQL Server, Oracle, PostgreSQL, etc.) as business data stores. The following methods are commonly used for real-time statistics:
1) and performing real-time query from the business database by using the statistic SQL-like statement.
2) And storing the statistical index data calculated in real time into a memory cache by using the memory cache, and clearing the corresponding cache index data after the data influencing the index statistics is changed, so that index recalculation is triggered and the cache is updated when the statistical index data is accessed next time.
3) And the service database performs index data statistics and query from the standby node by using a read-write separation technology.
4) The business database uses the fragment storage technology to horizontally split the table with large data volume, so that a single statistic SQL-like statement can be distributed to the database fragment nodes to be executed in parallel.
5) And the server side carries out index statistical calculation at regular time, and stores and updates statistical index data after the calculation is finished.
6) And a trigger is established in the database, and when the service data is changed, the corresponding index data is updated through the trigger.
For the method 1), because the whole table data scanning is mostly needed when the statistical SQL-like statements are executed, when the data volume of the service table is large or the index data query concurrency is increased, the pressure of the service database is seriously increased and the normal correspondence cannot be achieved.
With regard to the method 2), after the memory cache is introduced, the execution frequency of the statistical SQL-like statements can be reduced, but if the data changes frequently, the cache is also frequently cleared to perform index recalculation, so that the optimization effect cannot be achieved, and meanwhile, the fast response of the statistical operation based on the service table with large data volume during the index recalculation cannot be ensured. And the method needs to artificially analyze which service data changes can affect which statistical indexes, and carries out corresponding coding operation to clear the corresponding index cache, thereby increasing the complexity of development and implementation.
For the method 3), a read-write separation technology is used, index data statistics is executed in the standby database, and although index data calculation does not affect the read-write performance of the main service database, it is also not guaranteed that the statistical operation based on the table with large data volume can be quickly responded.
For method 4), the fragmentation storage technology can distribute the statistical computation pressure to each fragmentation database node, but index computation still requires whole table scanning, which cannot reduce the computation overhead of statistics, resulting in the need to increase a large amount of hardware resources to meet the high concurrent real-time statistical requirements.
With regard to the method 5), the execution frequency of the statistics SQL-like statements is reduced to relieve the database statistics pressure, but the timely update of the statistics index data cannot be guaranteed.
For the method 6), the method is used for carrying out incremental updating on the statistical index data, so that the pressure of a database is greatly reduced, the statistical performance is improved, but the method also needs to artificially analyze which service data changes can affect which statistical indexes, and carries out corresponding coding processing; for some indexes which cannot be subjected to incremental calculation according to the current statistical indexes by combining incremental data, intermediate state data of the index statistical data need to be stored and updated in the incremental calculation process; for the condition that the index statistical service logic changes or BUG occurs in the incremental calculation process, the index recalculation needs to be realized by more stock data; for the service database cluster stored in the fragmentation mode, the problem of merging of index calculation results also needs to be considered, and the implementation complexity of the method is greatly increased under the above conditions.
Disclosure of Invention
The invention aims to perform real-time data synchronization and index data real-time calculation based on a service database log, realize high-concurrency sub-second-level query of real-time index data, solve the problems of index data statistics of frequently updated mass data and parallel real-time incremental statistics, and provide a method for performing parallel real-time incremental statistics based on service system data, wherein the method comprises a single-machine mode and a distributed mode.
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
as an implementable mode, the method for parallel real-time incremental statistics based on database logs in a stand-alone mode comprises the following steps:
step S1: starting a data event consumer, creating a data synchronization table corresponding to the ClickHouse node, and synchronizing the full data of the service database table into the created data synchronization table;
step S2: starting a data event producer, reading service database logs according to the sequence of service database log positions recorded by redis, and pushing the service database logs to a message queue according to the sequence of reading the service database logs successively; the service database log comprises a data newly-added event and a table structure change event;
step S3: after the event data consumer completes the full data synchronization, the incremental synchronization operation is automatically started, and the data newly-added event and/or the table structure change event in the message queue are consumed, so that the data newly-added event and/or the table structure change event are synchronized into a data synchronization table of the ClickHouse node;
step S4: according to the statistical requirements of the index data, creating a materialized view of an AggregatingMergeTree engine based on a data synchronization table corresponding to the ClickHouse node, and designating SQL sentences needing to perform statistics in the materialized view creating sentences;
step S5: and (3) carrying out result query of statistical index data on the materialized view of the AggregatingMergeTree engine by using SQL statements.
In the scheme, the operation log of the service database is monitored, the service data are incrementally synchronized to the ClickHouse node in real time, the service data are not invaded, the index data are counted in real time by means of the incremental counting function of the ClickHouse node data synchronization table, the index data calculation pressure of the service database is transferred to the ClickHouse node, the index data calculation performance is improved, and the high-concurrence sub-second-level query of the real-time index data is realized.
Further, the step S1 specifically includes the following steps:
step S11: starting a data event consumer, and checking whether a data synchronization table corresponding to the clickwouse node is established or not after starting;
step S12: if the data synchronization table corresponding to the ClickHouse node is not created, a Collapsing MergeTree engine table is created to serve as the data synchronization table corresponding to the ClickHouse node; the state field of the CollapsingMergeTree engine table is sign, when the sign is 1, the data is represented to be valid, and when the sign is-1, the data is represented to be invalid;
step S13: after a data synchronization table corresponding to the ClickHouse node is established, synchronizing the full data of the service database table into a Collapsing MergeTree engine table; during full data synchronization, a read-only lock is added to the synchronized business database tables.
In the above scheme, the collapsingmergee engine table may be replaced with a versioned collapsingmergee engine table, and the synchronization data is held by the collapsingmergee engine table or the versioned collapsingmergee engine table.
Further, the step S2 specifically includes the following steps:
step S21: starting a data event producer, reading the service database logs according to the sequence of the service database log positions recorded by the redis, and starting reading from the tail end of the service database logs if the redis does not record the service database logs;
step S22: if a data newly added log is read, converting the data newly added log into a data newly added event, and setting a state field sign to be 1; if the data modification log is read, converting the data modification log into two data newly-added events, wherein the first event is data before modification, and setting a state field sign to be-1, and the second event is data after modification, and setting the state field sign to be 1; if the data deletion log is read, converting the data deletion log into a data newly-added event, and setting a state field sign to be-1;
step S23: if the DDL operation log is read, converting the DDL operation log into a table structure change event and simultaneously converting the DDL operation log into a DDL statement corresponding to the ClickHouse node;
step S24: and pushing the converted data newly-added event and the table structure change event to a message queue according to the sequence of reading the service database logs.
In the scheme, the original data adding, modifying and deleting operations are converted into data adding events through the setting of the state field sign, so that the random disk read-write operation is converted into the sequential disk write operation, and the sub-second-level data real-time increment synchronization is realized.
Furthermore, if the redis has not recorded the service database log, after the service database log is successfully pushed to the message queue, the sequence of reading the service database log is recorded in the redis.
Further, the specific step of step S3 includes: for the data adding event in the message queue, executing DML statement insertion data according to the ClickHouse node grammar; and for the table structure change event, executing the converted DDL statement, and synchronizing the data newly-added event and/or the table structure change event into a data synchronization table of the ClickHouse node.
Further, the step S4 specifically includes the following steps:
step S41: according to the statistical requirements, creating a materialized view of an AggregatingMergeTree engine based on the collaksHouse node corresponding to the synchronous CollapsingMergeTree engine table;
step S42: in a materialized view establishing statement of an AggregatingMergeTree engine, an SQL statement needing to be counted is specified, and partial index conversion of the statistic SQL statement is carried out, so that data with invalid state fields are prevented from being counted.
In the scheme, by means of the increment counting function of the AggregatingMergeTree engine materialized view of the ClickHouse node, a user can carry out full and increment index counting only by specifying the required statistical SQL statement in the materialized view table building statement and carrying out some statistical index conversion based on the state field sign, so that the workload and the complexity of index development are reduced, and the requirement of rapid recalculation of index data can be met.
The statistical index conversion means that, for example, count (×) is originally required to perform row number statistics, sum (sign) may be replaced; for example, sum (income _ sign) may be replaced by sum (income) if sum (income) is originally needed to be used for carrying out summary statistics on income fields; if the average of the income field needs to be averaged by avg (income), sum (income _ sign)/sum (sign) may be substituted.
As another practicable mode, the method for parallel real-time incremental statistics based on the database logs of distributed deployment comprises the following steps:
step S1: deploying a plurality of data event consumers and a plurality of ClickHouse nodes to form a ClickHouse fragment cluster, wherein the number of the ClickHouse nodes is consistent with that of the business database fragment nodes, and one data event consumer is responsible for synchronizing the data of one corresponding business database fragment node into one ClickHouse node;
step S2: deploying a plurality of data event producers, wherein one data event producer is responsible for reading a database log of a service database fragmentation node;
when a data event producer is started, performing step S2 of the method of claim 1;
step S3: each data event consumer is responsible for automatic increment synchronous operation of the data event of the corresponding business database fragment node, so that the data event of the business database fragment node is synchronized to the corresponding ClickHouse node;
step S4: creating a local materialized view by using a data synchronization table of a service database synchronized by each node in the ClickHouse fragmentation cluster;
step S5: and creating a distributed table at each ClickHouse node, associating the distributed table with a local materialized view, and performing result query of statistical index data on the distributed table of any node in the ClickHouse fragment cluster by using SQL statements.
In the scheme, for the service data fragment cluster, the parallel data synchronization and the parallel index data statistics are supported by the distributed deployment of the ClickHouse fragment cluster through the distributed deployment of the real-time data synchronization and the distributed deployment of the ClickHouse fragment cluster, and the problem of performing the parallel real-time incremental statistics on the frequently updated mass data is solved.
Further, the specific steps of step S1 are:
step S11: starting each data event consumer, and checking whether a data synchronization table of a corresponding ClickHouse node is established or not after starting;
step S12: if the data synchronization table corresponding to the ClickHouse node is not created, a Collapsing MergeTree engine table is created to be used as the data synchronization table corresponding to the ClickHouse node, or a VersionedCollapsing MergeTree engine table is created to be used as the data synchronization table corresponding to the ClickHouse node;
the state field of the CollapsingMergeTree engine table is sign, when the sign is 1, the data is represented to be valid, and when the sign is-1, the data is represented to be invalid;
the business database log update time stamp of the business database fragment node corresponding to the ClickHouse node is used as the version field of the VersionedCollapsingMergeTree engine table, so that the VersionedCollapsingMergeTree engine table can be merged according to the version field added by the main key when the business database logs are merged, and the sequence influence of the business database logs is avoided;
step S13: and after a data synchronization table corresponding to the ClickHouse node is created, synchronizing the full data of the service database table into a Collapsing MergeTree engine table or a VersionedCollapsing MergeTree engine table.
In the scheme, the sequence problem of automatic data merging is solved through a VersionedCollapsing MergeTree engine table field.
Further, the step S1 specifically includes: when a plurality of ClickHouse nodes are deployed, two or more ClickHouse sharded copy sets are created for one ClickHouse node; replacing the CollapsisMergeTree engine table in the ClickHouse slice copy set with a ReplicatedCollapsisMergeTree engine table, or replacing the VersionedCollapsisMergeTree engine table in the ClickHouse slice copy set with a ReplicatedVersionedCollapsisMergeTree engine table.
Compared with the prior art, the invention has the beneficial effects that:
(1) the invention synchronizes the service data to the ClickHouse node by monitoring the operation log of the service database, uses the Collapsing MergeTree engine table or VersionedCollapsing MergeTree engine table of the ClickHouse node to store the synchronous data, and changes the original data adding, modifying and deleting operations into the data adding event by setting the status field 'sign', thereby converting the random disk read-write operation into the sequential disk write operation and realizing the real-time increment synchronization of the data at the sub-second level.
(2) On the basis, the method transfers the calculation pressure of the statistical index data of the service database to the ClickHouse node by means of the incremental statistical function of the AggregatingMergeTree engine materialized view of the ClickHouse node, reduces the development and implementation complexity, improves the index data calculation performance, and accordingly realizes the index high-concurrency sub-second-level query.
(3) The invention also uses the horizontal sub-table technology to perform distributed storage of data, and supports parallel data synchronization and parallel index data statistics by distributed deployment of the ClickHouse fragmentation clusters, thereby solving the problem of performing parallel real-time incremental statistics on the mass data which is frequently updated.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic data flow diagram according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Also, in the description of the present invention, the terms "first", "second", and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or implying any actual relationship or order between such entities or operations.
Example 1:
the invention is realized by the following technical scheme, as shown in figure 1, the method for parallel real-time incremental statistics of database logs under a single machine mode comprises the following steps:
step S1: and starting a data event consumer, creating a data synchronization table corresponding to the ClickHouse node, and synchronizing the full data of the service database table into the created data synchronization table.
And starting a data event consumer, checking whether a data synchronization table corresponding to the clickwouse node is created or not after starting, and if so, determining that the data synchronization table corresponding to the clickwouse node is a Collapsing MergeTree engine table. If not, a CollapsingMergeTree engine table is created. The status field of the CollapsingMergeTree engine table is sign, when the sign is 1, the data is valid, and when the sign is-1, the data is invalid.
After the Collapsing MergeTree engine table is created, the full data of the business database table is synchronized into the Collapsing MergeTree engine table, and a read-only lock is added to the synchronized business database table during the full data synchronization.
Step S2: starting a data event producer, reading service database logs according to the sequence of service database log positions recorded by redis, and pushing the service database logs to a message queue according to the sequence of reading the service database logs successively; the service database log comprises a data addition event and a table structure change event.
And starting a data event producer, reading the service database logs according to the sequence of the service database log positions recorded by the redis, and starting reading from the tail end of the service database logs if the redis does not record the service database logs.
If a data newly added log is read, converting the data newly added log into a data newly added event, and setting a state field sign to be 1; if the data modification log is read, converting the data modification log into two data newly-added events, wherein the first event is data before modification, and setting a state field sign to be-1, and the second event is data after modification, and setting the state field sign to be 1; and if the data deletion log is read, converting the data deletion log into a data newly-added event, and setting the state field sign to-1. And if the DDL operation log is read, converting the DDL operation log into a table structure change event and simultaneously converting the DDL operation log into a DDL statement corresponding to the ClickHouse node.
Therefore, the service database log is converted into two events, namely a data adding event or a table structure changing event. And pushing the converted data newly-added event and the table structure change event to a message queue according to the sequence of reading the service database logs.
And for the condition that the business database logs are not recorded in the redis, after the business database logs are successfully pushed to the message queue, recording the sequence of reading the business database logs in the redis as the business database log positions recorded in the redis.
Step S3: after the event data consumer completes the full data synchronization, the increment synchronization operation is automatically started, and the data in the consumption message queue is added with events and/or table structure change events.
For the data adding event in the message queue, executing DML statement insertion data according to the ClickHouse node grammar; and for the table structure change event, executing the converted DDL statement, and synchronizing the data newly-added event and/or the table structure change event into a data synchronization table of the ClickHouse node to complete the increment synchronization operation.
Step S4: according to the statistical requirement of the index data, a materialized view of the AggregatingMergeTree engine is created based on a data synchronization table corresponding to the ClickHouse node, and SQL sentences needing to be counted are specified in the materialized view creating sentences.
And according to the statistical requirement, creating a materialized view of the AggregatingMergeTree engine based on the collaskHouse node corresponding to the synchronized CollapsingMergeTree engine table. In a materialized view creation statement of an aggregatingMergeTree engine, an SQL statement needing to be subjected to statistics is specified, and partial statistics SQL statement index conversion is carried out, for example, sum (sign) is used for replacing count (x), so that data with invalid state fields are prevented from being counted.
Step S5: and (3) carrying out result query of statistical index data on the materialized view of the AggregatingMergeTree engine by using SQL statements.
Because the AggregtingMergeTree engine can perform background automatic incremental aggregation and storage according to the specified statistical SQL statement aggregation function when the materialized view is created, the materialized view of the AggregtingMergeTree engine can be subjected to statistical index data result query by using SQL statements, and therefore high-concurrency sub-second-level query of real-time index data is achieved.
Preferably, the backup service database log can be used for data synchronization, so that the influence of the operation of the lock table as the service database during the full-scale data synchronization can be reduced.
As another possible implementation, as shown in fig. 1, a method for parallel real-time incremental statistics of database logs based on distributed deployment is used to improve performance of data synchronization and statistical computation of a sharded cluster service database, and includes the following steps:
step S1: a plurality of data event consumers and a plurality of ClickHouse nodes are deployed to form a ClickHouse fragment cluster, the number of the ClickHouse nodes is consistent with that of the business database fragment nodes, and one data event consumer is responsible for synchronizing the data of one corresponding business database fragment node into one ClickHouse node.
After a certain data event consumer is started, the step S1 in the stand-alone mode is executed, and since the data event read-write sequence of the distributed message queue cannot be guaranteed, the versioned collectingmergetree engine table can be created to replace the collectingmergetree engine table, so that the distributed message queue is used to improve the fault tolerance and the message read-write performance of the system. And taking the service database log update time stamp of the service database fragment node corresponding to the ClickHouse node as a version field of the VersionedCollapsingMergeTree engine table, and ensuring that the VersionedCollapsingMergeTree engine table can be merged according to the version field added by the main key when the service database logs are merged without being influenced by the sequence of the service database logs.
And after a data synchronization table corresponding to the ClickHouse node is created, synchronizing the full data of the service database table fragment node into a Collapsing MergeTree engine table or a VersionedCollapsing MergeTree engine table.
Step S2: and deploying a plurality of data event producers, wherein one data event producer is responsible for reading the database log of one business database fragmentation node.
When a data event producer is started, step S2 of the method in the stand-alone mode is performed. When the distributed message queue is used, the order of the service database logs is not influenced.
Step S3: each data event consumer is responsible for automatic increment synchronization operation of the data events of the corresponding business database fragment nodes, so that the data events of the business database fragment nodes are synchronized to the corresponding ClickHouse nodes.
The incremental sync operation is the same as step S3 of the method in the stand-alone mode for each data event consumer consuming a data event in the distributed message queue.
Step S4: and creating a local materialized view by using a data synchronization table of a service database synchronized by each node in the ClickHouse fragmentation cluster.
When creating a local materialized view synchronized with each node, as in step S4 of the method in the aforementioned standalone mode, it can be understood that when creating a materialized view of an aggregatingmerge merge engine based on the collapsingmergee engine table, a corresponding materialized view of the aggregatingmerge engine may also be created based on the versioned collapsingmergee table.
Step S5: and creating a distributed table at each ClickHouse node, associating the distributed table with a local materialized view, and performing result query of statistical index data on the distributed table of any node in the ClickHouse fragment cluster by using SQL statements.
The local materialized view can be subjected to background automatic incremental aggregation and storage, the statistical data results of the local materialized view can be merged when the distributed table is queried, and the result query of the statistical index data can be performed on the distributed table of any node in the ClickHouse fragment cluster by using SQL statements, so that the real-time index high-concurrence sub-second-level query is realized.
As a preferred scheme, when a plurality of ClickHouse nodes are deployed, a ClickHouse shard copy set can be used, one ClickHouse shard node creates two or more ClickHouse shard copy sets, at this time, a collapsingmergere engine table needs to be replaced by a replied collapsingmergere engine table, or a Versioned collapsingmere engine table in the ClickHouse shard copy set needs to be replaced by a replied Versioned Collapsis Mergetre engine table, the copies of the same ClickHouse node shard node can automatically synchronize data by using the replied class engine table, and at this time, one ClickHouse shard node only needs to synchronize data to one copy. The partition data synchronous replication can be performed conveniently, and the condition that index data query is inconsistent may exist due to asynchronous execution of the partition data synchronous replication, so that a user can balance and select between partition fault tolerance and data consistency according to actual service requirements.
Example 2:
in this embodiment, a case description is made on the basis of the embodiment 1, as shown in embodiment 1, in this scheme, a service database log is used to perform real-time incremental synchronization of data to a clickwouse node, a redis is used to record a reading position of the service database log, and a message queue stores data events. The service database, the data real-time synchronization, the ClickHouse node, the redis and the message queue can all use distributed deployment to realize the parallel real-time incremental data synchronization and the parallel real-time incremental statistics.
Next, a practical application case is described, which uses MySQL as a business database, kafaka as a message queue, and three clickhause nodes as a sharded cluster for statistics.
Assuming that a employment information table is in a MySQL service database, the table name is employee _ info, the current employment information situation of all the personnel is stored, an area _ code field in the table represents an administrative division, a work _ status field represents the employment situation, when the work _ status =1, the employment is represented, and the field can be updated in real time according to the specific service requirement. The employment number of each administrative region needs to be counted in real time on the project, because the data magnitude of the employment information table is in the tens of millions, and the employment number of each administrative region needs to be frequently and concurrently inquired according to the business requirements, in order to reduce the counting pressure of the MySQL business database and display the counting result on a front-end interface in real time, the scheme is used for realizing parallel real-time incremental counting, and the method comprises the following specific steps:
step S1: the method is started after a data event consumer configures a MySQL service database, kafka, a ClickHouse cluster address and related parameters, the data event consumer creates a local table, application _ info _ local and a distributed table, application _ info _ all of a Collapsing MergeTree engine on each node in a ClickHouse cluster, and then the full quantity of application _ info employment information table data of the MySQL service database is synchronized into the distributed table, application _ info _ all of the ClickHouse node.
Step S2: the method is started after a data event producer configures addresses and related parameters of a MySQL service database, redis and kafka, and the data event producer reads binlog logs in the MySQL service database to push generated data events to the kafka.
Step S3: and the data event consumer switches to an increment synchronous operation mode after the data full-scale synchronization is completed, sequentially consumes the data events in the kafka, and synchronizes the data events to the distributed table application _ info _ all of the ClickHouse node.
Step S4: creating a local materialized view, namely, the amount of employment in each administrative area is counted on each local materialized view by creating a local materialized view, namely, the amount of employment in each administrative area is counted on each local materialized view.
Step S5: and creating a distributed table (apployment _ info _ view) on each node of the ClickHouse cluster, and associating the local materialized views on the corresponding nodes, so as to summarize the statistical result of each local materialized view.
Because the locally materialized view can realize automatic aggregation on the locally-expressed element _ info _ local incremental data and save the aggregation result, the query distributed table locally-expressed element _ info _ view can merge the locally-materialized view result data. Therefore, statistical query can be carried out on the distributed table employee _ info _ view, and parallel real-time incremental statistics of employment people of each administrative region is achieved.
The invention synchronizes the service data to the ClickHouse node by monitoring the operation log of the service database, uses the Collapsing MergeTree engine table or VersionedCollapsing MergeTree engine table of the ClickHouse node to store the synchronous data, and changes the original data adding, modifying and deleting operations into the data adding event by setting the status field 'sign', thereby converting the random disk read-write operation into the sequential disk write operation and realizing the real-time increment synchronization of the data at the sub-second level. On the basis, the method transfers the calculation pressure of the statistical index data of the service database to the ClickHouse node by means of the incremental statistical function of the AggregatingMergeTree engine materialized view of the ClickHouse node, reduces the development and implementation complexity, improves the index data calculation performance, and accordingly realizes the index high-concurrency sub-second-level query.
The invention also uses the horizontal sub-table technology to perform distributed storage of data, and supports parallel data synchronization and parallel index data statistics by distributed deployment of the ClickHouse fragmentation clusters, thereby solving the problem of performing parallel real-time incremental statistics on the mass data which is frequently updated.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. The method for parallel real-time incremental statistics of database logs under a single machine mode is characterized by comprising the following steps of: the method comprises the following steps:
step S1: starting a data event consumer, creating a data synchronization table corresponding to the ClickHouse node, and synchronizing the full data of the service database table into the created data synchronization table;
step S2: starting a data event producer, reading service database logs according to the sequence of service database log positions recorded by redis, and pushing the service database logs to a message queue according to the sequence of reading the service database logs successively; the service database log comprises a data newly-added event and a table structure change event;
step S3: after the event data consumer completes the full data synchronization, the incremental synchronization operation is automatically started, and the data newly-added event and/or the table structure change event in the message queue are consumed, so that the data newly-added event and/or the table structure change event are synchronized into a data synchronization table of the ClickHouse node;
step S4: according to the statistical requirements of the index data, creating a materialized view of an AggregatingMergeTree engine based on a data synchronization table corresponding to the ClickHouse node, and designating SQL sentences needing to perform statistics in the materialized view creating sentences;
step S5: and (3) carrying out result query of statistical index data on the materialized view of the AggregatingMergeTree engine by using SQL statements.
2. The method of claim 1, wherein the method comprises the steps of: the step S1 specifically includes the following steps:
step S11: starting a data event consumer, and checking whether a data synchronization table corresponding to the clickwouse node is established or not after starting;
step S12: if the data synchronization table corresponding to the ClickHouse node is not created, a Collapsing MergeTree engine table is created to serve as the data synchronization table corresponding to the ClickHouse node; the state field of the CollapsingMergeTree engine table is sign, when the sign is 1, the data is represented to be valid, and when the sign is-1, the data is represented to be invalid;
step S13: after a data synchronization table corresponding to the ClickHouse node is established, synchronizing the full data of the service database table into a Collapsing MergeTree engine table; during full data synchronization, a read-only lock is added to the synchronized business database tables.
3. The method of claim 2, wherein the method comprises the steps of: the step S2 specifically includes the following steps:
step S21: starting a data event producer, reading the service database logs according to the sequence of the service database log positions recorded by the redis, and starting reading from the tail end of the service database logs if the redis does not record the service database logs;
step S22: if a data newly added log is read, converting the data newly added log into a data newly added event, and setting a state field sign to be 1; if the data modification log is read, converting the data modification log into two data newly-added events, wherein the first event is data before modification, and setting a state field sign to be-1, and the second event is data after modification, and setting the state field sign to be 1; if the data deletion log is read, converting the data deletion log into a data newly-added event, and setting a state field sign to be-1;
step S23: if the DDL operation log is read, converting the DDL operation log into a table structure change event and simultaneously converting the DDL operation log into a DDL statement corresponding to the ClickHouse node;
step S24: and pushing the converted data newly-added event and the table structure change event to a message queue according to the sequence of reading the service database logs.
4. The method of claim 3, wherein the method comprises the steps of: if the business database logs are not recorded in the redis, the business database logs are successfully pushed to the message queue, and then the sequence of reading the business database logs is recorded in the redis.
5. The method of claim 3, wherein the method comprises the steps of: the specific steps of step S3 include: for the data adding event in the message queue, executing DML statement insertion data according to the ClickHouse node grammar; and for the table structure change event, executing the converted DDL statement, and synchronizing the data newly-added event and/or the table structure change event into a data synchronization table of the ClickHouse node.
6. The method of claim 1, wherein the method comprises the steps of: the step S4 specifically includes the following steps:
step S41: according to the statistical requirements, creating a materialized view of an AggregatingMergeTree engine based on the collaksHouse node corresponding to the synchronous CollapsingMergeTree engine table;
step S42: in a materialized view establishing statement of an AggregatingMergeTree engine, an SQL statement needing to be counted is specified, and partial index conversion of the statistic SQL statement is carried out, so that data with invalid state fields are prevented from being counted.
7. The method for parallel real-time incremental statistics of database logs based on distributed deployment is characterized by comprising the following steps: the method comprises the following steps:
step S1: deploying a plurality of data event consumers and a plurality of ClickHouse nodes to form a ClickHouse fragment cluster, wherein the number of the ClickHouse nodes is consistent with that of the business database fragment nodes, and one data event consumer is responsible for synchronizing the data of one corresponding business database fragment node into one ClickHouse node;
step S2: deploying a plurality of data event producers, wherein one data event producer is responsible for reading a database log of a service database fragmentation node;
when a data event producer is started, performing step S2 of the method of claim 1;
step S3: each data event consumer is responsible for automatic increment synchronous operation of the data event of the corresponding business database fragment node, so that the data event of the business database fragment node is synchronized to the corresponding ClickHouse node;
step S4: creating a local materialized view by using a data synchronization table of a service database synchronized by each node in the ClickHouse fragmentation cluster;
step S5: and creating a distributed table at each ClickHouse node, associating the distributed table with a local materialized view, and performing result query of statistical index data on the distributed table of any node in the ClickHouse fragment cluster by using SQL statements.
8. The method for parallel real-time incremental statistics of database logs based on distributed deployment according to claim 7, wherein: the specific steps of step S1 are:
step S11: starting each data event consumer, and checking whether a data synchronization table of a corresponding ClickHouse node is established or not after starting;
step S12: if the data synchronization table corresponding to the ClickHouse node is not created, a Collapsing MergeTree engine table is created to be used as the data synchronization table corresponding to the ClickHouse node, or a VersionedCollapsing MergeTree engine table is created to be used as the data synchronization table corresponding to the ClickHouse node;
the state field of the CollapsingMergeTree engine table is sign, when the sign is 1, the data is represented to be valid, and when the sign is-1, the data is represented to be invalid;
the business database log update time stamp of the business database fragment node corresponding to the ClickHouse node is used as the version field of the VersionedCollapsingMergeTree engine table, so that the VersionedCollapsingMergeTree engine table can be merged according to the version field added by the main key when the business database logs are merged, and the sequence influence of the business database logs is avoided;
step S13: and after a data synchronization table corresponding to the ClickHouse node is created, synchronizing the full data of the service database table into a Collapsing MergeTree engine table or a VersionedCollapsing MergeTree engine table.
9. The method for parallel real-time incremental statistics of database logs based on distributed deployment according to claim 8, wherein: the step S1 specifically includes: when a plurality of ClickHouse nodes are deployed, two or more ClickHouse sharded copy sets are created for one ClickHouse node; replacing the CollapsisMergeTree engine table in the ClickHouse slice copy set with a ReplicatedCollapsisMergeTree engine table, or replacing the VersionedCollapsisMergeTree engine table in the ClickHouse slice copy set with a ReplicatedVersionedCollapsisMergeTree engine table.
CN202111220905.6A 2021-10-20 2021-10-20 Parallel real-time incremental statistic method based on database logs Pending CN114153809A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111220905.6A CN114153809A (en) 2021-10-20 2021-10-20 Parallel real-time incremental statistic method based on database logs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111220905.6A CN114153809A (en) 2021-10-20 2021-10-20 Parallel real-time incremental statistic method based on database logs

Publications (1)

Publication Number Publication Date
CN114153809A true CN114153809A (en) 2022-03-08

Family

ID=80462478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111220905.6A Pending CN114153809A (en) 2021-10-20 2021-10-20 Parallel real-time incremental statistic method based on database logs

Country Status (1)

Country Link
CN (1) CN114153809A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996353A (en) * 2022-05-26 2022-09-02 厦门立林科技有限公司 Distributed intelligent real-time data stream distribution method and system
CN116150175A (en) * 2023-04-18 2023-05-23 云账户技术(天津)有限公司 Heterogeneous data source-oriented data consistency verification method and device
CN117149914A (en) * 2023-10-27 2023-12-01 成都优卡数信信息科技有限公司 Storage method based on ClickHouse

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996353A (en) * 2022-05-26 2022-09-02 厦门立林科技有限公司 Distributed intelligent real-time data stream distribution method and system
CN114996353B (en) * 2022-05-26 2024-08-23 厦门立林科技有限公司 Distributed intelligent real-time data stream distribution method and system
CN116150175A (en) * 2023-04-18 2023-05-23 云账户技术(天津)有限公司 Heterogeneous data source-oriented data consistency verification method and device
CN117149914A (en) * 2023-10-27 2023-12-01 成都优卡数信信息科技有限公司 Storage method based on ClickHouse
CN117149914B (en) * 2023-10-27 2024-01-26 成都优卡数信信息科技有限公司 Storage method based on ClickHouse

Similar Documents

Publication Publication Date Title
CN111723160B (en) Multi-source heterogeneous incremental data synchronization method and system
US11921746B2 (en) Data replication method and apparatus, computer device, and storage medium
KR102307371B1 (en) Data replication and data failover within the database system
CN114153809A (en) Parallel real-time incremental statistic method based on database logs
Brewer Spanner, truetime and the cap theorem
US20150032695A1 (en) Client and server integration for replicating data
CN111125260A (en) Data synchronization method and system based on SQL Server
US9244838B2 (en) System, method, and computer-readable medium for grouping database level object access counts per processing module in a parallel processing system
JP7438603B2 (en) Transaction processing methods, apparatus, computer devices and computer programs
US20130110873A1 (en) Method and system for data storage and management
US9672244B2 (en) Efficient undo-processing during data redistribution
CN111259004B (en) Method for indexing data in storage engine and related device
CN109298978B (en) Recovery method and system for database cluster of specified position
WO2023077971A1 (en) Transaction processing method and apparatus, and computing device and storage medium
CN110245134B (en) Increment synchronization method applied to search service
CN111522880A (en) Method for improving data read-write performance based on mysql database cluster
CN114579614A (en) Real-time data full-scale acquisition method and device and computer equipment
Wang et al. Apache IoTDB: A time series database for IoT applications
CN110489092B (en) Method for solving read data delay problem under database read-write separation architecture
CN111917834A (en) Data synchronization method and device, storage medium and computer equipment
WO2022002044A1 (en) Method and apparatus for processing distributed database, and network device and computer-readable storage medium
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN112328702A (en) Data synchronization method and system
CN115292414A (en) Method for synchronizing service data to data bins
CN115080666A (en) Data synchronization method, system, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination