CN112163048A

CN112163048A - Method and device for realizing OLAP analysis based on ClickHouse

Info

Publication number: CN112163048A
Application number: CN202011006169.XA
Authority: CN
Inventors: 高响; 韩锦; 李强; 高明明
Original assignee: Changzhou Weiyizhi Technology Co Ltd
Current assignee: Changzhou Weiyizhi Technology Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2021-01-01

Abstract

The invention provides a method and a device for realizing OLAP analysis based on ClickHouse, wherein the method comprises the following steps: receiving kafak data in real time through a kafka engine table of ClickHouse; carrying out data partitioning on the kafak data; TTL management is carried out on the subareas; carrying out data slicing; a primary key index; dynamic code is generated. The method receives the kafak data in real time through the kafka engine table of the ClickHouse, performs big data processing in a database mode, is simpler and easier to use, is simple to maintain and high in query speed, solves the problem of waste of data storage resources through TTL management of data partitions, increases the large-scale parallel computing capacity of the cluster through data fragmentation, quickly returns query results, can be expanded transversely and linearly, and constructs a large-scale distributed cluster, so that the method has the capacity of processing mass data.

Description

Method and device for realizing OLAP analysis based on ClickHouse

Technical Field

The invention relates to the technical field of data management, in particular to a method for realizing OLAP (on-line Analytical Processing) analysis based on ClickHouse (open source columnar database), computer equipment and a non-transitory computer readable storage medium.

Background

In the field of big data analysis, traditional big data analysis needs different frames and technical combinations to achieve the final effect, and the big data analysis becomes an expensive matter in terms of labor cost, technical capability, hardware cost and maintenance cost, so that many small and medium-sized enterprises are very worried about, and have to lease the data analysis service of a third-party large company.

Different from the scenario of transaction processing, for example, a lot of insert, update, delete operations need to be performed in situ in an e-market scenario for adding a shopping cart, placing an order, paying, and the like, and a data analysis (OLAP) scenario generally performs flexible exploration, BI tool insight, report making, and the like of any dimension after data is imported in batches. After the data is written once, an analyst needs to try to mine and analyze the data from various angles until discovering information such as business value, business change trend and the like. This is a process that requires trial and error, constant adjustment, and continuous optimization, where data is read much more often than written.

In the related art, a hadoop (distributed system infrastructure) system is generally adopted for data analysis, but the architecture has the following disadvantages:

(1) a large number of small files cannot be efficiently stored, and if a large number of small files are stored, the large number of memories of the NameNode (name node) are occupied by the small files to store file directory and block information. This is undesirable because the memory of the NameNode is always limited; the addressing time for small file storage may exceed the read time, which violates the design goals of HDFS.

(2) hadoop is an offline system that generally has difficulty supporting ad hoc queries.

Disclosure of Invention

The invention provides a method for realizing OLAP analysis based on ClickHouse, which receives kafak data in real time through a kafka engine table of the ClickHouse, performs big data processing in a database mode, is simpler and easier to use, is simple to maintain and has high query speed, solves the problem of waste of data storage resources by managing data partition TTL, increases the large-scale parallel computing capability of a cluster by data fragmentation, quickly returns a query result, can be transversely linearly expanded, and constructs a large-scale distributed cluster, thereby having the capability of processing mass data.

The invention also provides a device for realizing OLAP analysis based on the ClickHouse.

The invention also provides computer equipment.

The invention also proposes a non-transitory computer-readable storage medium.

The technical scheme adopted by the invention is as follows:

the embodiment of the first aspect of the invention provides a method for realizing OLAP analysis based on ClickHouse, which comprises the following steps: receiving kafak data in real time through a kafka (open source streaming platform) engine table of ClickHouse; performing data partitioning on the kafak data; performing TTL (Time To Live) management on the partitions; carrying out data slicing; a primary key index; dynamic code is generated.

According to one embodiment of the invention, the kafak data is received in real time through a kafka engine table of ClickHouse, comprising: creating an unbounded stream data in kafka of the ClickHouse by connecting a kafak engine table; creating an entity table; creating a distributed table to specify the entity table; the create view will specify the kafka engine table and write to the distributed table.

According to one embodiment of the invention, the data partition comprises: partitioning data according to months, partitioning data according to weeks, and directly taking each value of an Enum (enumeration) type column as a partition.

According to an embodiment of the present invention, the TTL management includes: column level TTL, row level TTL, and partition level TTL.

According to one embodiment of the invention, the data slicing comprises: random fragments, constant fixed fragments, column value fragments and self-defined expression fragments.

According to an embodiment of the invention, the generating dynamic code includes: the clickwouse dynamically generates code directly from current SQL (Structured Query Language).

An embodiment of a second aspect of the present invention provides an apparatus for implementing OLAP analysis based on ClickHouse, including: the data receiving module receives kafak data in real time through a kafka engine table of ClickHouse; the partitioning module is used for carrying out data partitioning on the kafak data; the management module is used for carrying out TTL management on the partitions; the fragmentation module is used for carrying out data fragmentation; the index module is used for indexing a main key; a generation module to generate dynamic code.

According to an embodiment of the present invention, the data receiving module is specifically configured to: creating an unbounded stream data in kafka of the ClickHouse by connecting a kafak engine table; creating an entity table; creating a distributed table to specify the entity table; the create view will specify the kafka engine table and write to the distributed table.

According to an embodiment of the present invention, the partition module is specifically configured to: the data are partitioned according to the month, the data are partitioned according to the week number, and each value of the column of the Enum type is directly taken as a partition.

According to one embodiment of the invention, the data slicing comprises: random shards, fixed shards, column value shards, and custom expression shards.

According to an embodiment of the invention, the forming module is specifically configured to: code is generated directly from current SQL dynamically.

A third embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for implementing OLAP analysis based on clickwouse according to the first embodiment of the present invention is implemented.

A fourth embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for implementing OLAP analysis based on ClickHouse according to the first embodiment of the present invention.

The invention has the beneficial effects that:

the invention makes the big data processing by the way of database more simple and easy to use, simple to maintain, fast to inquire, and contains the storage and calculation ability, without additional dependence on other storage components, completely self-realizes high availability, supports complete SQL grammar, has low learning cost and high flexibility, and lays the foundation for the extremely fast analysis performance.

Drawings

FIG. 1 is a flow diagram of a method for implementing OLAP analysis based on ClickHouse according to one embodiment of the invention;

FIG. 2 is a diagram comparing a ClickHouse query flow with an OLAP query flow in the related art;

FIG. 3 is a block diagram of an apparatus for implementing OLAP analysis based on ClickHouse according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for implementing OLAP analysis based on clickwouse according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

s1, receiving the kafak data in real time through a kafka engine table of ClickHouse.

Further, receiving the kafak data in real time through the kafka engine table of clickwouse may include: creating a kafak engine table of ClickHouse to connect unbounded stream data in kafka; creating an entity table and creating a distributed table designation entity table; the create view will specify the kafka engine table and write to the distributed table.

Specifically, (1) creating the kafka engine table may be instructed as follows:

CREATE TABLE IF NOT EXISTS table_name on cluster_name(

the field a of the String is set to,

field b String

)

ENGINE ═ Kafka ('Kafka address', 'topoic of Kafka', 'consumer group', 'specify data format');

the kafka engine table is created through the above steps, and kafka data is inserted into the engine table.

(2) Creating an entity table may be commanded by:

CREATE TABLE IF NOT EXISTS entity TABLE name on cluster name (field a field type, field b field type) ENGINE (partition field, (primary key index), 8192);

the deduplication engine and primary key index are specified by the commands described above, as well as the partition fields.

(3) Creating a distributed table designation entity table may be commanded by:

CREATE TABLE IF NOT EXISTS DISTRIBUTION TABLE AS entity TABLE ENGINE (Cluster NAME, library NAME, entity TABLE, fragmentation SIpHash64(production Date));

by the above command we create an entity table and a distributed table, fields come from the view table of the next step and specify shards.

(4) The view may be created by specifying the kafka engine table and writing to the distributed table by:

CREATE MATERIALIZED VIEW IF NOT EXISTS View Table on cluster cluster _ name TO DISTRIBUTION TABLE

AS SELECT*

FROM kafka engine table;

through the above steps, the data is written into the entity table through the view table designation kafka table.

And S2, carrying out data partitioning on the kafak data.

Further, according to an embodiment of the present invention, the data partitioning may include: and partitioning the data according to the month, partitioning the data according to the week number, and taking each value of the column of the Enum type directly as a partition.

Specifically, the clickwouse supports PARTITION BY clauses, and when a table is built, data partitioning operation can be specified according to any legal expression, for example, data is partitioned according to the month through toyyymm (), data is partitioned according to the number of weeks through toMonday (), each value of an Enum type column is directly taken as a PARTITION, and the like. Data partitioning has two main applications in ClickHouse:

partition clipping is performed on a partition key, and only necessary data is queried. And flexible partition expression setting is realized, so that partition setting can be carried out according to an SQL Pattern (expression), and the service characteristics are attached to the maximum.

And S3, TTL management is carried out on the partitions.

According to an embodiment of the present invention, the TTL management may include: column level TTL, row level TTL, and partition level TTL.

Specifically, in an analysis scenario, the value of data is continuously reduced along with the time lapse, TTL management needs to be performed on partitions, and outdated partition data are eliminated. Most businesses will only keep the data for the last few months for cost reasons, and the ClickHouse provides the ability for data lifecycle management through TTL.

ClickHouse supports several different granularities of TTL:

1) column level TTL: when part of data in a column is expired, the data is replaced by default values; when all the column data expires, the column is deleted.

2) Row level TTL: when a row is over, the row is deleted directly.

3) Zone level TTL: when a partition has expired, it is deleted directly.

According to service logic and data characteristics, flexible TTL setting can be carried out on different types of data and services, for example, only 7 days of data are needed in certain specific scenes, and the partition can be directly deleted when the TTL of the partition exceeds 7 days.

And S4, data slicing is carried out.

According to one embodiment of the invention, data slicing may comprise: random fragments, constant fixed fragments, column value fragments and self-defined expression fragments.

Specifically, the ClickHouse supports a standalone mode and also supports a distributed cluster mode. In the distributed mode, the ClickHouse divides the data into a plurality of fragments and distributes the fragments to different nodes. Different fragmentation strategies each have advantages when dealing with different SQL patterns. The ClickHouse provides a rich fragmentation strategy, so that services can be selected according to actual requirements.

1) random fragmentation: the write data is randomly distributed to a node in the distributed cluster.

2) constant fixed fragmentation: the write data is distributed to a fixed one of the nodes.

3) column value fragmentation: hash fragmentation is performed according to the value of a certain column.

4) Self-defining expression fragmentation: and appointing any legal expression, and performing hash fragmentation according to the calculated value of the expression.

And data fragmentation is carried out, so that the ClickHouse can fully utilize the large-scale parallel computing capability of the whole cluster and quickly return a query result. More importantly, the diversified fragmentation function opens an imagination space for service optimization. For example, in the case of hash sharing, JOIN calculation (putting all elements in an array into a string) can avoid data sharing (randomness) and directly perform local JOIN locally. The user-defined fragment is supported, so that the most suitable fragment strategy can be customized for different services and SQL Pattern; by utilizing the self-defined fragmentation function, the problem of data inclination among fragments can be solved by setting a reasonable fragmentation expression. In addition, the ClickHouse can be transversely linearly expanded by the fragmentation, and a large-scale distributed cluster is constructed, so that the ClickHouse has the capacity of processing mass data.

S5, primary key index.

Specifically, each column of data is divided according to index granularity (default 8192 rows), and the first row of each index granularity is referred to as a mark row. The primary key index stores the value of the primary key (primary key) corresponding to the mark row. For the query with the primary key in the where condition, the corresponding index granularity can be directly positioned by performing binary search on the primary key index, so that full-table scanning is avoided, and query is accelerated.

And S6, generating the dynamic code.

Further, according to an embodiment of the present invention, generating dynamic code includes: code is generated directly from current SQL dynamically.

Specifically, in the conventional OLAP implementation, a volcano model is usually used for expression calculation, that is, a query is converted into individual operators, such as HashJoin (a processing algorithm for a database when performing multi-table join), Scan (command), IndexScan (index Scan), Aggregation (Aggregation), and the like. In order to connect different operators, a uniform interface such as open/next/close is adopted between operators. The virtual functions of the parent class are realized in each operator, the data to be processed by a single SQL in an analysis scene is generally hundreds of millions of lines, and the calling cost of the virtual functions can not be ignored any more. In addition, various variables such as column type, column size, number of columns and the like are considered in each operator, and a large number of if-else branch judgments exist to cause failure of CPU branch prediction.

The ClickHouse can dynamically generate code directly according to the current SQL and then compile and execute. As shown in fig. 2, the left side is an OLAP query process in the related art, the right side is a clickwouse query process, and the clickwouse can directly generate codes for expressions, so that not only a large number of virtual function calls (i.e., calls of multiple function pointers in fig. 2) are eliminated, but also unnecessary if-else branch judgment is eliminated because the parameter types, numbers, and the like of the expressions are known at runtime.

From the above, the method for realizing OLAP analysis based on ClickHouse according to the present invention solves the problem of excessively complicated steps in the conventional OLAP data integration process through the kafka engine table of ClickHouse, for example, by writing the kafka collected by the flash or other data collection components into the data storage destination, the addition of new components will increase new labor cost and machine cost.

The waste of data storage resources is solved through the step of managing the data partition TTL. In an analysis scene, the value of data is continuously reduced along with the time lapse, most businesses only keep data of the latest months due to cost consideration, historical data storage can be more flexibly managed through data TTL management and can be refined to columns, or data at a row level is eliminated.

The large-scale parallel computing capacity of the cluster is increased through the data fragmentation step, the query result is quickly returned, the transverse linear expansion can be realized, and the large-scale distributed cluster is constructed, so that the large-scale distributed cluster has the capacity of processing mass data.

In summary, according to the method for implementing OLAP analysis based on clickwouse in the embodiment of the present invention, kafak data is received in real time through the kafka engine table of clickwouse, data partitioning is performed on the kafak data, TTL management is performed on the partitions, data fragmentation and primary key indexing are performed, and a dynamic code is generated. Therefore, the method receives kafak data in real time through a kafka engine table of ClickHouse, performs big data processing in a database mode, is simpler and easier to use, is simple to maintain and high in query speed, solves the problem of waste of data storage resources through TTL management of data partitions, increases the large-scale parallel computing capacity of a cluster through data fragmentation, quickly returns query results, can be expanded transversely in a linear mode, and constructs a large-scale distributed cluster, so that the method has the capacity of processing mass data.

Corresponding to the method for realizing OLAP analysis based on the ClickHouse, the invention also provides a device for realizing OLAP analysis based on the ClickHouse. Since the device embodiment of the present invention corresponds to the method embodiment described above, details that are not disclosed in the device embodiment may refer to the method embodiment described above, and are not described again in the present invention.

FIG. 3 is a block diagram of an apparatus for implementing OLAP analysis based on ClickHouse according to an embodiment of the invention. As shown in fig. 3, the apparatus includes: the system comprises a data receiving module 1, a partitioning module 2, a management module 3, a slicing module 4, an indexing module 5 and a generating module 6.

The data receiving module 1 receives kafak data in real time through a kafka engine table of ClickHouse; the partitioning module 2 is used for performing data partitioning on the kafak data; the management module 3 is used for performing TTL management on the partitions; the fragmentation module 4 is used for carrying out data fragmentation; the index module 5 is used for indexing a main key; the generation module 6 is used for generating dynamic code.

According to an embodiment of the present invention, the data receiving module 1 is specifically configured to: creating a kafak engine table of ClickHouse to connect unbounded stream data in kafka; creating an entity table; creating a distributed table designation entity table; the create view will specify the kafka engine table and write to the distributed table.

According to an embodiment of the present invention, the partition module 2 is specifically configured to: the data are partitioned according to the month, the data are partitioned according to the week number, and each value of the column of the Enum type is directly taken as a partition.

According to one embodiment of the invention, TTL management comprises: column level TTL, row level TTL, and partition level TTL.

According to one embodiment of the invention, data slicing comprises: random shards, fixed shards, column value shards, and custom expression shards.

According to one embodiment of the invention, the forming module 6 is specifically configured to: code is generated directly from current SQL dynamically.

According to the device for realizing OLAP analysis based on the ClickHouse, the data receiving module receives kafak data in real time through the kafka engine table of the ClickHouse; the partitioning module is used for partitioning data of the kafak data; the management module carries out TTL management on the partitions; the fragmentation module carries out data fragmentation; the index module carries out primary key indexing; the generation module generates dynamic code. Therefore, the device receives kafak data in real time through a kafka engine table of ClickHouse, performs big data processing in a database mode, is simpler and easier to use, is simple to maintain and high in query speed, solves the problem of waste of data storage resources through TTL management of data partitions, increases the large-scale parallel computing capacity of a cluster through data fragmentation, quickly returns query results, can be transversely linearly expanded, and constructs a large-scale distributed cluster, so that the device has the capacity of processing mass data.

The invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for implementing OLAP analysis based on clickwouse according to the above embodiments of the present invention is implemented.

According to the computer equipment provided by the embodiment of the invention, when a processor executes a computer program stored on a memory, kafak data is received in real time through a kafka engine table of ClickHouse, data partitioning is carried out on the kafak data, TTL management is carried out on the partitions, data fragmentation and primary key indexing are carried out, and dynamic codes are generated.

The present invention also proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for implementing OLAP analysis based on clickwouse according to the above-described embodiments of the present invention.

According to the non-transitory computer readable storage medium of the embodiment of the invention, when a computer program stored on a riding is executed by a processor, kafak data is received in real time through a kafka engine table of a ClickHouse, data partitioning is carried out on the kafak data, TTL management is carried out on the partitions, data fragmentation and main key indexing are carried out, and dynamic codes are generated.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for realizing OLAP analysis based on ClickHouse is characterized by comprising the following steps:

receiving kafak data in real time through a kafka engine table of ClickHouse;

performing data partitioning on the kafak data;

TTL management is carried out on the subareas;

carrying out data slicing;

a primary key index;

dynamic code is generated.

2. The method of claim 1 for implementing OLAP analysis based on clickwouse, wherein the kafak data is received in real time by a kafka engine table of clickwouse, comprising:

creating an unbounded stream data in kafka of the ClickHouse by connecting a kafak engine table;

creating an entity table;

creating a distributed table to specify the entity table;

the create view will specify the kafka engine table and write to the distributed table.

3. The method of claim 1, wherein the data partitioning comprises:

the data are partitioned according to the month, the data are partitioned according to the week number, and each value of the column of the Enum type is directly taken as a partition.

4. The method for implementing OLAP analysis based on ClickHouse as claimed in claim 1, wherein the TTL management includes: column level TTL, row level TTL, and partition level TTL.

5. The method of claim 1, wherein the data slicing comprises:

random shards, fixed shards, column value shards, and custom expression shards.

6. The method of claim 1, wherein the generating dynamic code comprises:

code is generated directly from current SQL dynamically.

7. An apparatus for implementing OLAP analysis based on ClickHouse, comprising:

the data receiving module receives kafak data in real time through a kafka engine table of ClickHouse;

the partitioning module is used for carrying out data partitioning on the kafak data;

the management module is used for carrying out TTL management on the partitions;

the fragmentation module is used for carrying out data fragmentation;

the index module is used for indexing a main key;

a generation module to generate dynamic code.

8. The device for implementing OLAP analysis based on clickwouse according to claim 7, wherein the data receiving module is specifically configured to:

creating an entity table;

creating a distributed table to specify the entity table;

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of implementing OLAP analysis based on ClickHouse according to any one of claims 1 to 6 when executing the program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the method of implementing OLAP analysis based on ClickHouse according to any one of claims 1 to 6.