CN114661706A

CN114661706A - Clickhouse data writing plug-in method based on jlogstack

Info

Publication number: CN114661706A
Application number: CN202011534160.6A
Authority: CN
Inventors: 钱奕辰
Original assignee: Hangzhou Yunzhe Technology Co ltd
Current assignee: Hangzhou Yunzhe Technology Co ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2022-06-24

Abstract

The invention discloses a Clickhouse data writing plug-in method based on jlogstack, which comprises the following steps: classifying data according to the specified key words, wherein each type of data corresponds to each table of the Clickhouse database; judging whether the database and the data table exist in the Clickhouse database, and if not, performing creation operation; checking fields of the written data, and triggering a click house database table field adding operation if a new field is found; the producer thread delivers the checked data to the corresponding processor thread to wait for the consumption of the processor thread; after the consumer thread consumes data each time, health detection is carried out on the Clickhouse cluster; when the consumer thread consumes the data, the flow of writing into the Clickhouse is started under two limits of time and number. The invention can realize the automatic operation of modifying the table fields by increasing or reducing the fields, effectively improve the speed of the data writing process and realize the writing of irregular data.

Description

Clickhouse data writing plug-in method based on jlogstack

Technical Field

The invention relates to the technical field of data writing plug-in units, in particular to a method for writing Clickhouse data into a plug-in unit based on jlogstack.

Background

The real-time log and performance data acquisition and analysis are used as an important means for mastering the operation condition of company business and searching and analyzing fault problems, and the most ELK schemes are used in the current numerous implementation schemes. The ELK scheme is based on a Logstash, Elasisearch, Kibana as a technical stack to realize the function of data acquisition, analysis and display. The company develops own jlogstash framework based on open source Logstash project, uses Java as development language, and improves the performance to about five times of original edition (Ruby version).

While data is displayed, the collected data needs to be analyzed and processed, and the performance of the Elasticsearch becomes a bottleneck, so that the OLAP database of Clickhouse is introduced. The Clickhouse is a column-oriented database, and adopts Local attached storage as a storage scheme, so that the IO performance is greatly improved; SQL is used as a development language, so that the use cost is greatly reduced. Meanwhile, the Clickhouse naturally supports the distributed mode, the high availability scheme of the fragment and the copy is supported originally, and the linear cluster expansion and the stability of the cluster during operation are guaranteed.

In the market, the existing Clickhouse writing plug-in only supports a user to create a database and a data table in advance and then write data, and simultaneously only supports writing a single library table, does not support simultaneous writing of multiple tables, and the written data format is fixed, and the table structure cannot be changed after the writing plug-in is started. In view of the related technical problems, no effective solution has been proposed at present.

Disclosure of Invention

Aiming at the problems in the related technology, the invention provides a method for writing Clickhouse data into a plug-in unit based on jlogstash, so as to overcome the technical problems in the prior art. In the writing process, the field name and the field type of each piece of data are judged, if the field is increased or decreased, the table field is automatically modified, the complete data writing process is not influenced, and the writing of irregular data can be realized under the condition that the writing plug-in is not restarted.

In order to achieve the above object, the present invention provides a method for writing Clickhouse data into a plug-in based on jlogstack, comprising the following steps:

(1) classifying data according to the specified key words, wherein each type of data corresponds to each table of the Clickhouse database;

(2) judging whether the database and the data table exist in the Clickhouse database, if not, carrying out creation operation, and meanwhile, preparing related table data information in advance, and if the keyword has corresponding information in preset data, reading the preset information for creation;

(3) performing field verification on written data, locally caching table field information of a database, performing field verification on each piece of information, and if newly added fields are found, triggering a click house database table field addition operation;

(4) the producer thread delivers the checked data to the corresponding processor thread to wait for the consumption of the processor thread;

(5) after the consumer thread consumes data each time, health detection is carried out on the Clickhouse cluster, if a certain node in the cluster is found to be offline, the node is stopped from writing data, and the data cannot be written into the node until the node is recovered to normal;

(6) when the consumer thread consumes data, the consumer thread is limited by time and number, and no matter who is triggered first, the consumer thread starts a write-in Clickhouse process.

In the preferred embodiment of the present invention, during the process of creating the database and the data table, the data writing of the Clickhouse database is stopped until the creation thread is completed and the writing is continued after the success information is returned.

In the preferred embodiment of the present invention, in the process of adding table fields to the Clickhouse database, the process of verifying data is stopped until the modification thread is completed and the verification is continued after the success information is returned.

Compared with the prior art, the invention has the following technical effects:

(1) the method comprises the steps that a preset table structure is set, a built-in file is read in an initialization stage of a writer, some preset table information is read, each piece of information corresponds to information such as a table name, a table structure, an engine and a setting item of a table, when new user data enters a data link, the writer can hang all writing work of a current new user, library building and table building work in Clickhouse is completed according to the read preset table information, and after the work is completed, a writing lock is released, and the data starts to be written normally;

(2) the invention sets multi-table writing, which is different from a plurality of open-source Clickhouse writers that set written table names in configuration items, business needs to distinguish data according to data sets of the data, then writes the data into different tables, and aiming at the situation, a thread is separately opened for each table, each thread is responsible for data writing work of one table, isolation on the thread is achieved, the problem of data confusion is avoided, the data is uniformly received by a main thread, the data is distributed to different writing threads through queues, the writing threads consume queues in a fixed time and a fixed quantity, and then the data is written into the Clickhouse in Batch, because the data writing of the Clickhouse is based on double control of Batch, the queue support time of a writer internal data stream and the data, when the set condition is reached, the data consumption is carried out, when the set condition is not reached, the current queue is blocked, the writing of the data is directly written into a local table, compared with writing in a distributed table, the problem of consistency of data written in the distributed table is effectively solved;

(3) the invention sets dynamic Schema (dynamic list structure), before each data is written into the processing queue, there is a field checking process: each field of each piece of data and a table structure stored locally are checked, when a field which is not stored in a cache is found, a Clickhouse table structure is obtained again, the problem of cache inconsistency caused by jlogstack multiple nodes and multithreading is avoided, then a check is carried out, if the second check still judges that the field is a new field, the current flow is paused, the action of modifying the table structure is triggered, a new table field is added to the current table, after the action of modifying is finished, the check flow of the current data is continued, and after all checks are finished, the data are injected into the thread of a table processor, so that the method has high practical value and popularization value.

Drawings

Fig. 1 is a schematic flowchart of a flow structure in embodiment 1 of a jlogstash-based Clickhouse data write plug-in method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a flow structure in embodiment 2 of a method for writing Clickhouse data into a plug-in based on jlogstack according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Some exemplary embodiments of the invention have been described for illustrative purposes, and it is to be understood that the invention may be practiced otherwise than as specifically described.

In this embodiment, a method for writing Clickhouse data into a plug-in based on jlogstack includes the following steps:

(5) after the consumer thread consumes data each time, health detection is carried out on the Clickhouse cluster, if a certain node in the cluster is found to be offline, the node is stopped from writing data, and the data cannot be written into the node until the node returns to normal;

(6) when the consumer thread consumes data, the data is limited by time and number, and no matter who is triggered first, the Clickhouse writing process is started, so that the phenomenon that the Clickhouse cluster merge pressure is too high and the writing speed is suddenly reduced due to frequent writing of small data volume is effectively avoided.

In some embodiments, in the process of creating the database and the data table, the data writing of the Clickhouse database is stopped until the creation thread is completed and the writing is continued after the success information is returned.

In some embodiments, in the process of adding table fields to the Clickhouse database, the process of verifying data is stopped until the modification thread is completed and the verification is continued after the success information is returned.

Example 1

As shown in fig. 1, the example includes a jlogstack core operation module, an input module, a filter module, an output plug-in module, and a Clickhouse cluster module, where the core operation module is used to ensure the start and operation of the project; the plug-in module is used for customizing a data circulation processing mode and realizing the ETL process of data; the data storage server Clickhouse is responsible for receiving logs written by output. Clickhouse plug-ins and persisting them to the hard disk for later data use.

The method comprises the following specific steps:

(1) uploading a jlogstack core operation packet;

(2) setting a jlogstack running script;

(3) upload output. clickhouse plug-in package.

Example 2

As shown in fig. 2, the automatic intelligent Clickhouse writing of data by using the write-in plug-in of the present invention includes the following specific steps:

(1) uploading the plug-in package to a server where the jlogstack is located and placing the plug-in package at a specified position according to requirements;

(2) yaml configuration files are modified, cluster names, cluster addresses, write intervals and other information are configured;

(3) yaml configuration files, configuration table engines, partition information, table names, table fields and other information are modified;

(4) jlogstack is started through script and task is started.

The invention has the beneficial effects that:

(1) a preset table structure, during the initialization stage of the writer, reading a built-in file and reading some preset table information, wherein each piece of information corresponds to the table name, the table structure, the engine, the setting items and other information of one table, when a new user data enters a data link, the writer holds all the writing work of the current new user, and the Clickho is completed according to the read preset table information

Distinguishing data by a data set, writing the data into different tables, and aiming at the situation, independently opening a thread for each table, wherein each thread is responsible for the database building and table building work in the data writing work use of one table;

for example, if a data set has three fields, i.e., a, b, and c, and the data set has only two fields, i.e., ab and ac, for the first time and only two fields for the second time in the data link, this results in a very serious problem: when the first piece of data arrives, the creation of a Clickhouse table is carried out according to data fields, but when the second piece of data arrives, a new field appears, and then the ADD COLUMN operation needs to be carried out on the table structure.

(2) Multi-table writing, different from a plurality of open-source Clickhouse writers, the table names written in the configuration items are set, the service needs to achieve thread isolation according to the number of data, the problem of data confusion is avoided, the data is uniformly received by a main thread, distributing the data to different write-in threads through the queue, enabling the write-in threads to regularly and quantitatively consume the queue, then writing the data into Clickhouse in batches, because the data writing of the Clickhouse is based on the Batch, the number of Parts is reduced, the background asynchronous Merge operation of the Clickhouse is greatly reduced, the exception of the Too Many Parts is avoided, the queue of the writer internal data flow supports the double control of time and data, when the set condition is met, data consumption is carried out while the current queue is blocked, data writing is directly written into the local table, and compared with writing into the distributed table, the problem of consistency of data writing into the distributed table is effectively avoided; the data consistency problem is not perceived in the writing process, because the process of data falling from the distributed table to the local table is completed by the Clickhouse bottom layer, and the operation is not directly connected, so that the operator cannot judge whether the data written by the operator actually falls in the Clickhouse.

(3) Dynamic Schema (dynamic table structure), before each piece of data is written into the processing queue, there is a field checking process: checking each field of each piece of data and a table structure stored locally, when finding that no field exists in a cache, acquiring a Clickhouse table structure once again to avoid the problem of cache inconsistency caused by jlogstack multiple nodes and multithreading, then checking again, if the second check still judges that the field is a new field, pausing the current flow, triggering the action of table structure modification, namely adding a new table field for the current table, continuing the checking flow of the current data after the action of modification is finished, and after all checks are finished, inputting the data into the thread of a table processor;

for example, one piece of data has 2 tags, the original storage format is { "a1":111, "b1":222}, and we will flatten the tag during writing and process the tag into the following format:

an empty case is encountered and is automatically filled with a null value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A Clickhouse data writing plug-in method based on jlogstack is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein in the process of creating the database and the data table, the data writing of the Clickhouse database is stopped until the creation thread is completed and the success information is returned, and then the data writing is continued.

3. The method as claimed in claim 1, wherein in the process of adding table fields to the Clickhouse database, the process of verifying data stops until the modification thread is completed and the verification is continued after the success information is returned.