CN112269802A

CN112269802A - Method and system for frequent deletion, modification and check optimization based on Clickhouse

Info

Publication number: CN112269802A
Application number: CN202011111740.4A
Authority: CN
Inventors: 周小润; 刘跃红; 管正爽
Original assignee: Yinsheng Payment Service Co Ltd
Current assignee: Yinsheng Payment Service Co Ltd
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2021-01-26

Abstract

The invention discloses a method and a system for frequent deletion, modification and check optimization based on Clickhouse, relating to the technical field of data processing; the method comprises the following steps: s1, synchronizing data, analyzing and storing the real-time operation log data in the synchronous original database to a kafka cluster and mongodb; s2, consumption of running data, wherein the Application end Application is respectively connected with the kafka cluster, the mongodb cluster and the clickhouse cluster, when the Application end Application consumes the running data, the Application end Application firstly queries the mongodb data, and if the query is not available, the Application end Application queries the clickhouse cluster data; s3, operation on the query data; s4, timing synchronization of incremental data; the invention has the beneficial effects that: the method can support the operations of quick real-time insertion and update, and support high-concurrency query aiming at real-time data.

Description

Method and system for frequent deletion, modification and check optimization based on Clickhouse

Technical Field

The invention relates to the technical field of big data processing, in particular to a method and a system for frequent deletion, modification and check optimization based on Clickhouse.

Background

At present, big data are widely applied to the fields of finance, electronic commerce, logistics, enterprise government affairs and the like, and are applied to the industrial internet in recent years, and with the acceleration of the internet process of everything in the 5G era, the storage of mass data, the real-time query of mass data and the OLAP analysis bring challenges. Clickhouse is being widely applied to real-time processing and OLAP online analysis with its outstanding query and write performance, some systems can store values of individual columns individually, but cannot process analysis queries efficiently due to optimization of other scenarios, such as HBase, BigTable, Cassandra and HyperTable. In these systems, throughput of about one hundred thousand lines per second can be achieved, but not as high as hundreds of millions of lines per second as in Clickhouse.

Clickhouse was developed by yandex. Has been used continuously for many years as a core component of the system. To date, the system has over 13 trillion records in a clickwouse, and over 200 billions of events are processed per day. It allows dynamic queries and generation of reports directly from the raw data. The method is widely applied to scenes such as real-time reports, online real-time analysis, user images, intelligent recommendation systems and the like. However, there are inherent disadvantages in the use process of Clickhouse, which face some technical points to be solved:

1. transactions are not supported, true deletion/updating is not supported;

2. high concurrency is not supported, official recommendations qps of 100, the number of connections can be increased by modifying the configuration file, but in case the server is good enough;

3. the method is not suitable for real-time insertion or small-batch insert, update and delete operations, because the ClickHouse bottom layer can continuously carry out asynchronous data combination, the query performance can be influenced, and the method needs to be avoided as much as possible when real-time data writing is carried out.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and a system for frequently deleting, modifying and checking optimization based on Clickhouse, which can support the operations of quick real-time insertion and updating and support high-concurrency query aiming at real-time data.

The technical scheme adopted by the invention for solving the technical problems is as follows: in a method for frequent censorship optimization based on Clickhouse, the improvement comprising the steps of:

s1, synchronizing data, analyzing and storing the real-time operation log data in the synchronous original database to a kafka cluster and mongodb;

s2, consumption of running data, wherein the Application end Application is respectively connected with the kafka cluster, the mongodb cluster and the clickhouse cluster, when the Application end Application consumes the running data, the Application end Application firstly queries the mongodb data, and if the query is not available, the Application end Application queries the clickhouse cluster data;

s3, operating the query data, and directly operating mongodb when the Application end queries the data in mongodb and updates or deletes the data; when the Application end does not inquire data in mongodb and inquires data in the clickhouse cluster, writing the inquiry records of the Application back to mongodb for inquiry;

and S4, synchronizing the timing of the incremental data, wherein mongodb is connected with a clickhouse cluster, and mongodb synchronizes the incremental data to the clickhouse cluster.

Further, in step S1, the original database is an Oracle database or Mysql database.

Further, in step S1, data synchronization is implemented by using OGG or Canal.

Further, in step S2, when the Application queries data of the last three days, the query is sent to mongodb.

Further, in step S2, when the Application queries the history data, the Application goes to the clickhouse cluster for querying.

Further, in step S4, the incremental data is synchronized to the clickhouse cluster by timing, so as to ensure the integrity of the clickhouse cluster data.

On the other hand, the invention also discloses a system for frequently deleting, modifying and checking optimization based on Clickhouse, which is improved in that the system comprises an original database, a kafka cluster, mongodb, Application at an Application end and a Clickhouse cluster;

the original database is used for storing real-time operation log data;

the kafka cluster and the mongodb are respectively connected with the original database, and the real-time operation log data in the original database are analyzed and stored in the kafka cluster and the mongodb;

the Application end is respectively connected with the kafka cluster, the mongodb cluster and the clickhouse cluster, and is used for realizing data query on the mongodb cluster and data query on the clickhouse cluster;

the clickhouse cluster is connected with the mongodb, and the mongodb synchronizes incremental data to the clickhouse cluster in a timing mode.

In the above structure, the original database is an Oracle database or a Mysql database.

In the above structure, the Application queries mongodb's data as data for the last three days of operation.

In the above structure, the Application queries the data of the clickhouse cluster as historical data.

The invention has the beneficial effects that: the method and the system for frequent deletion, modification and check optimization based on the Clickhouse can support rapid real-time insertion and update operations, support high-concurrency query aiming at real-time data, rapidly respond to query requests and achieve millisecond-level response.

Drawings

Fig. 1 is a schematic flow chart of a method for frequent censoring and searching optimization based on Clickhouse in the present invention.

Fig. 2 is a schematic diagram of a system for frequent censoring optimization based on Clickhouse according to the present invention.

Fig. 3 is a schematic flow chart of incremental data synchronization based on the Clickhouse frequent censoring optimization method of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The conception, the specific structure, and the technical effects produced by the present invention will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the features, and the effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention. In addition, all the connection/connection relations referred to in the patent do not mean that the components are directly connected, but mean that a better connection structure can be formed by adding or reducing connection auxiliary components according to specific implementation conditions. All technical characteristics in the invention can be interactively combined on the premise of not conflicting with each other.

Referring to fig. 1, the invention discloses a method for frequent censorship optimization based on Clickhouse, and specifically, the method comprises the following steps:

in this embodiment, in step S1, the original database is an Oracle database or Mysql database; and, adopt OGG or Canal to realize the synchronization of the data;

in the embodiment, when the Application terminal queries data of the last three days, the Application terminal queries the mongdb; when the Application at the Application end inquires the historical data, inquiring in the clickhouse cluster; it should be noted that the historical data may include the data of the last three days, or may not include the data of the last three days, and may be determined according to actual conditions, and in the case of relatively high real-time performance, the historical data is generally the data including the data of the last three days.

And the integrity of the clickhouse cluster data is ensured by timing and synchronizing the incremental data to the clickhouse cluster.

In the present embodiment, as shown in fig. 3, for the timing synchronization of the incremental data in step S4, the present invention provides a specific embodiment, wherein the incremental data is first imported into the temporary table "table name _ tmp

Confirming whether the polling is finished or not, writing a formal table into a temporary table if the polling is finished, then confirming whether the polling is finished or not, naming the formal table as a table name _ tmp _ tmp and naming the temporary table as the table name _ tmp if the polling is finished

And renaming to a formal table and deleting a table name _ tmp _ tmp to finish the synchronization of the incremental data.

Referring to fig. 2, the invention further discloses a system for frequent pruning, searching and optimizing based on Clickhouse, which comprises an original database, a kafka cluster 10, a mongodb20, an Application30 and a Clickhouse cluster 40;

the original database is used for storing real-time operation log data; in this embodiment, the original database is an Oracle database or a Mysql database;

the kafka cluster 10 and the mongodb20 are respectively connected with the original database, and the real-time operation log data in the original database are analyzed and stored in the kafka cluster 10 and the mongodb 20;

the Application30 is respectively connected with the kafka cluster 10, the mongodb20 and the clickhouse cluster 40, and the Application30 is used for realizing data query on the mongodb20 and data query on the clickhouse cluster 40; referring to fig. 3, the Application30 queries mongodb20 data as data of the last three days of operation; the Application30 queries the data of the clickhouse cluster 40 as historical data;

the clickhouse cluster 40 is connected with mongodb20, and mongodb20 synchronizes incremental data to the clickhouse cluster 40 in a timing mode.

The steps of the business process are disclosed by the method and the system for frequently deleting, modifying, checking and optimizing based on the Clickhouse. Since the client frequently performs high-frequency operations of adding and deleting when performing real-time query, but the Clickhouse cluster 40 is not friendly to the deleting operation and affects performance greatly, and even if the client inserts a single high-frequency insert, the performance of the Clickhouse cluster 40 is also affected seriously, such operations should be avoided and processed in batches, and since real-time data needs to have a place to record the operations and needs to meet the requirement of real-time performance, the performance of the small data amount (three-day data) mongodb is excellent; on one hand, real-time processing can be achieved, and performance of the Clickhouse cluster 40 is not affected.

How the invention solves the problem that the Clickhouse cluster 40 is not suitable for the high concurrency scene is explained below, since most of data needing to be operated by most of real-time applications are hot data of about three days in the last day, mongodb is introduced, and since the mongodb has better support for the concurrent query and the small data volume supports millisecond-level response, the problem can be perfectly solved.

Further, we describe in detail how the system of the present invention can achieve high concurrency millisecond level responses, as described above, most requests will be processed by mongodb20, which is millisecond level; in addition, the remaining small part of requests are sent to the Clickhouse cluster 40, at this time, the concurrent queries are not very high, and the Clickhouse cluster 40 also responds to the queries in about a few milliseconds, so that the millisecond-level response of the whole system is ensured.

Data needs to be synchronized by timing to remove mongodb20, the synchronization period depends on a service scene, and when a Clickhouse cluster 40 needs to be used for online analysis of OLAP (on-line analytical processing), the synchronization operation can be accelerated, for example, once synchronization is completed for 2000 pieces; if the service only needs to go out of the T +1 data service, the service can be synchronized once a day.

In addition, the direct multi-table operation can cause the merge partition to affect the unstable table performance of the external service; therefore, the incremental data is firstly imported into the temporary table and then the formal table data is written back into the temporary table, and the data is renamed into the formal table to provide services for the outside in a rename mode after the data import is finished.

Therefore, through the description above, the method and the system for frequent deletion, modification, and check optimization based on Clickhouse of the present invention can support operations of fast real-time insertion and update, support high concurrent query for real-time data, quickly respond to query requests, and can achieve millisecond-level response.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for frequent deletion, modification and optimization based on Clickhouse is characterized by comprising the following steps:

s3, operating the query data, and directly operating mongodb when the Application end queries the data in mongodb and updates or deletes the data; when the Application end does not inquire data in mongodb and inquires data in the clickhouse cluster, writing the inquiry records of the Application back to the preposed mongodb for inquiry;

2. The method as claimed in claim 1, wherein in step S1, the original database is an Oracle database or Mysql database.

3. The method according to claim 1, wherein in step S1, data synchronization is implemented by using OGG or Canal.

4. The method as claimed in claim 1, wherein in step S2, when the Application queries data of the last three days, the query is sent to mongodb.

5. The method as claimed in claim 4, wherein in step S2, when the Application queries the history data, the Application queries the Clickhouse cluster.

6. The method according to claim 1, wherein in step S4, the integrity of Clickhouse cluster data is ensured by timing and synchronizing incremental data to the Clickhouse cluster.

7. A system based on Clickhouse frequent deletion, modification and check optimization is characterized by comprising an original database, a kafka cluster, mongodb, Application side Application and a Clickhouse cluster;

the original database is used for storing real-time operation log data;

8. The system of claim 7, wherein the original database is an Oracle database or Mysql database.

9. The system of claim 7, wherein the Application queries mongodb's data as data operating on the last three days.

10. The system of claim 7, wherein the Application queries Clickhouse cluster data as historical data.