CN112364019A

CN112364019A - Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source

Info

Publication number: CN112364019A
Application number: CN202011214153.8A
Authority: CN
Inventors: 周朝卫
Original assignee: Zhongying Youchuang Information Technology Co Ltd
Current assignee: Zhongying Youchuang Information Technology Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-02-12
Anticipated expiration: 2040-11-04
Also published as: CN112364019B

Abstract

The invention discloses a method and a device for realizing fast data writing into a ClickHouse by a user-defined Spark data source. The method and the device have the advantages that the data source of the click is customized based on the external data source interface of the Spark, the universal function of writing in the click is realized, the data of the Spark can be quickly written in the click by only calling one statement, the code customized according to the service requirement does not need to be developed, the data writing efficiency is obviously improved, and the balanced writing of the data is ensured.

Description

Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for realizing quick data writing into a click House by a user-defined Spark data source.

Background

Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark can read data in a distributed manner, and the processed data is written into a target storage in a distributed manner by performing various conversion, processing and the like on the data. ClickHouse is a columnar database management system for real-time analysis of big data. By vectorization execution and the use of a CPU bottom layer instruction Set (SIMD), the method can process massive data in parallel, thereby accelerating the data processing speed.

At present, Spark writes data into a ClickHouse table, and has two problems, namely low function reuse degree, need to develop a customized program for each requirement, need to adjust codes when adding new services or adjusting services, and complex maintenance along with continuous increase of the customized program. Specifically, data to be written (for example, a file or another database) is read using Spark, and the read data needs to be converted into a DataFrame. For each field of the DataFrame, the data type of the field of the DataFrame needs to be compared with the field of the ClickHouse table, and the data type of the field of the DataFrame is converted into the data type compatible with the ClickHouse table. In the specific writing, business logic processing, such as data type conversion, needs to be performed according to each field of each piece of data. If there are a particularly large number of clickwouses (since clickwouses are columnar stores, it is very normal in a production environment for a table to have hundreds or thousands of columns), the code will be very lengthy and error prone because of the field processing required for each column. Because different services, the fields of the DataFrame and the schema are different, if other services also need to write data into the ClickHouse, only the code of the data written into the ClickHouse can be redeveloped. If a change is sent on demand, for example, the data type of the ClickHouse field changes or the ClickHouse table adds a new field, the code needs to be modified. And when the field value of the DataFrame is empty or a certain field of the ClickHouse table is absent, the data writing fails.

And secondly, in order to ensure the stability and performance of data import, data needs to be imported from the local ground surface of the clickwouse. However, the local tables are distributed on each node in the cluster, and the DataFrame of Spark is a distributed data set, because there are many problems in writing through the distributed tables, such as no consistency check of data, downtime of the distributed tables, easy loss of data, and high stress on ZooKeeper. At present, there is no way to uniformly write data to each local node of the clickwouse, that is, uniform storage of data cannot be guaranteed.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The invention aims to self-define a data source of the click based on an external data source interface of the Spark, realize a universal function of writing in the click, realize the writing of the Spark data in the click by only calling one statement, ensure that the data is stored in each local node of the click, and do not need to develop a code customized according to the service requirement.

In order to realize the purpose, the following scheme is provided:

a method for realizing fast data writing into a ClickHouse by self-defining a Spark data source comprises the following steps:

1. creating a ClickHouse data source which can be written in a balanced manner by redefining an algorithm for importing data in a related interface and a class of a Spark data source;

2. the realized codes are issued into binary jar packets, and the jar packets are referred to each time data are written into the ClickHouse;

3. and calling a specific statement to complete the function of quickly writing the ClickHouse by converting the data to be written into the data set of the DataFrame type of Spark.

Further, inheriting the related interface and the class relationship provider, creatable relationship provider and DataSourceregister of the data source of Spark, redefining the algorithms shortName and createRelationion for importing data in the inherited interface and class; the short name algorithm returns a character string as a type, a context variable indicated by an sqlContext parameter in a CreatRelation algorithm, a data storage mode indicated by a mode parameter, a self-defined parameter transmitted from the outside indicated by a parameters parameter and data to be stored indicated by a data parameter;

further, the specific statement is:

df.write

.format("clickhouse")

.mode("overwrite")

.option("table","OUTPUT_TABLE")

.option("cluster","cluster001")

.option("host","192.168.118.22")

.save()

further, the creat relationship algorithm determines the fragment information contained in the cluster by reading the metadata system table of the clickwouse, and uniformly writes the DataFrame data of Spark in different partitions into the fragments of the clickwouse according to a specified strategy;

further, the CreatRelation algorithm respectively acquires the Schema of the ClickHouse and the DataFrame, automatically adapts the data type and the default value, and uniformly writes the data into the local node corresponding to the partition;

further, the foregoing writing data of Spark in different partitions in DataFrame data of different partitions into a partition of clickwouse according to a prescribed policy in a balanced manner, where the policy includes:

1) acquiring a host where the fragments of the cluster are located and a copy of the fragments;

2) calculating the total partition number of the Spark task;

3) calling a foreachPartition method of the DataFrame, and writing data into the fragments;

4) and acquiring the serial number of the fragments in each partition, and determining the fragments of the ClickHouse in a polling or random mode according to the serial numbers of the fragments.

5) And connecting 1 corresponding to the fragment number to a plurality of hosts according to the determined fragment number, calling a BalancelcClichouseDataSource method, randomly selecting one host, and executing balanced writing of data.

Further, the automatically adapting the data type and the default value to uniformly write the data into the local node corresponding to the partition includes the following steps:

1) schema for obtaining DataFrame

And obtaining schema information of the DataFrame through df.

2) Obtaining the schema of the ClickHouse

Connecting a ClickHouse database, and executing a command through a program: desc table name. Three pieces of information are acquired: field name, data type of field, default value of field.

3) Connecting ClickHouse, creating State object

4) Data traversing DataFrame

5) And traversing each column of each piece of data according to the DataFrame acquired in the first step.

6) If the value of the column is empty or present in the ClickHouse, the process starts from step 7) below. If the column value is present, execution begins with step 8) below.

7) If the ClickHouse sets the default value when building the table, the default value is used

8) And if the ClickHouse is not set with default value when building the table, using the default value of the ClickHouse data type.

9) Performing data type conversion according to rules

10) Adding the processed data to a State object of JDBC

11) Judging the size of the data record number added to the State element object

Further, the rule of the data type conversion is as follows:

1) when the field type of the ClickHouse is Int8, Int16, Int32, UInt8 or UInt16, the data type of the Spark DataFrame is converted into the Int type.

2) When the field type of the ClickHouse is UInt32, UInt64 or UInt64, the data type of the Spark DataFrame is converted into the Long type.

3) When the field type of ClickHouse is Float32, the data type of Spark DataFrame is converted to Float type.

4) When the field type of ClickHouse is Float64, the data type of Spark DataFrame is converted into Double type.

5) And when the field type of the ClickHouse is decimal, converting the data type of the Spark DataFrame into the decimal type.

6) And when the field type of the ClickHouse is array type, converting the data type of the Spark DataFrame into the List type.

7) And when the field type of the ClickHouse is the LowCardinal data type, converting the data type of the Spark DataFrame into a character string type.

8) And when the field type of the ClickHouse is other data types, converting the data type of the Spark DataFrame into a character string type.

Further, a threshold value can be set for the data record number of the Statement object, when the data record number is larger than or equal to the threshold value, the data in the Statement object is written into the clickwouse in batch, and if the record number is smaller than the threshold value, the data in the DataFrame is continuously traversed.

The invention also provides a device for realizing the fast data writing into the ClickHouse by the user-defined Spark data source, which comprises a creating module, a packaging module and a calling module.

In the creation module, a clickHouse data source is created by redefining an algorithm for importing data in a related interface and a class of a Spark data source;

in the packaging module, a code realized by a ClickHouse data source generated by the creating module is released into a specific binary jar packet;

in the calling module, in any item needing to write data into the ClickHouse, a jar packet generated by the packing module is quoted, and a function of quickly writing into the ClickHouse is completed by calling a specific statement by converting the data to be written into a DataFrame type data set of Spark.

The method and the device can automatically adapt to the data types of the ClickHouse field and the Spark DataFrame field by self-defining the ClickHouse data source of the Spark, automatically set the null value as the default value of the ClickHouse, and write the Spark data into each local node of the ClickHouse in a polling or random mode in the process of writing the Spark into the ClickHouse, so that the universal function of writing the Spark data into the ClickHouse is realized, and then only one statement needs to be called to write the Spark data into the ClickHouse, ensure that the data is stored in each local node of the ClickHouse in a balanced mode, and do not need to develop a code customized according to the service requirement. The data writing efficiency is obviously improved, and meanwhile, the data can be guaranteed to be written into each node in the cluster in a balanced mode.

Drawings

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

FIG. 1 is a data balanced writing method of a method and an apparatus for implementing fast data writing to a clickHouse by a custom Spark data source according to an embodiment of the present invention;

fig. 2 is a Spark partition data writing process of a method and device for implementing fast data writing into a clickwouse by a custom Spark data source in the embodiment of the present invention.

Detailed Description

The nouns referred to in the examples explain:

DataFrame: is an immutable distributed collection of data in Spark that contains the data and corresponding schema information, tables of similar data, such as field names of the schema similar database of the DataFrame.

First, the data source-related interfaces and classes that inherit Spark need to be redefined: createrelations and shortnames algorithms of RelationProvider, creatblerelationprovider, and DataSourceRegister. The ShortName algorithm return type is a character string and is used for defining the name of the data source, and the short name algorithm return type only needs to return a 'ClickHouse' character string.

CreateRelation algorithm CreateRelation for data source writes has the following parameters:

sqlContext: SQLContext, Spark's sql context variables.

mode, SaveMode, mode for saving data, overwrite write or append write, etc.

Maps [ String, String ], externally-imported custom parameters.

data, DataFrame, data to be saved.

The method for realizing CreateRelation comprises the following steps:

by reading the system data of the clickwouse, it is determined which fields the table contains, the data type of the fields, and default values. And which nodes the cluster contains, the address/port information of the nodes. According to the information, the data type and the default value are automatically adapted, and the data are uniformly written into the local node of the ClickHouse.

The method for realizing the balanced writing of the data comprises the following steps: firstly, determining the number of partitions (a partition is the minimum unit of data contained in the DataFrame) of the DataFrame to be written, then traversing the partitions, determining that the partitions are written into a certain fragment in the cluster, and realizing load balancing among the fragments. When data is written into the fragments, one fragment corresponds to a plurality of copies, so that only any copy under the fragment is written, and the load balance among the copies can be ensured. The method comprises the following specific steps:

step 1: looking up system table systems, clusters, and determining fragmentation information contained in a cluster, specifically including: the host where the shards of the cluster are located, and the copy of the shards. Multiple copies of the same tile are aggregated together, e.g., to form the following data:

slicing numbering host (multiple hosts separated by comma)

1Ch001,ch002

2Ch003,ch004

Numbering of segments	Main unit (multiple main units separated by comma)
		1	Ch001,ch002
2	Ch003,ch004

Step 2: calculating the total partition number of Spark task

The calculation method comprises the following steps: rdd. getnumcolours

Where df is the DataFrame dataset variable.

And step 3: invoking the foreachpart method of the DataFrame to write the data into the fragment

Foreachpartion, the following operations are invoked inside each partition.

1) Acquiring the number of the fragment

val partID＝TaskContext.get().partitionId()

2) Determining the fragments of the ClickHouse according to the serial numbers of the fragments

The fragmentation can be determined in a polling or random manner.

For example, using a polling approach: the total fraction of ClickHouse is 5. Then the partition data office with partition number 3 writes slice 3 (the remainder is found for 3 to 5), partition number 6 writes slice 1 (the remainder is found for 6 to 5) with partition number 3.

3) And executing the writing of the data

And connecting 1 corresponding to the fragment number to a plurality of hosts according to the determined fragment number, and executing data writing.

If one fragment number corresponds to 1 to a plurality of hosts, because a BalancdClickHouseDataSource method can be called, one host is randomly selected to perform data writing, and therefore load balancing among a plurality of hosts under the same fragment is achieved.

Calling mode:

df.foreachPartition(iter＝>{{

// number of acquisition slices

val partID＝TaskContext.get().partitionId()

Determining the fragmentation of a ClickHouse based on the number of the fragmentation, e.g. using polling

partID takes the remainder of the total number of slices.

Connecting the ClickHouse according to the determined fragment information of the ClickHouse

})

Inside each partition, the following operations are performed for data writing:

reading system data of the ClickHouse, determining which fields, data types of the fields and default values are contained in the table, automatically adapting the data types and the default values according to the information, and writing the data into the local nodes corresponding to the partitions in a balanced manner. The method comprises the following specific steps:

(1) schema for obtaining DataFrame

And obtaining schema information of the DataFrame through df.

(2) Obtaining the schema of the ClickHouse

(3) Connecting ClickHouse, creating State object

(4) Data traversing DataFrame

(5) And traversing each column of each piece of data according to the DataFrame acquired in the first step.

(6) And if the value of the column is empty or exists in the clickwouse, it is performed from the following step (7). If the column value exists, execution begins at step (8) below.

(7) If the ClickHouse sets the default value when building the table, the default value is used

(8) And if the ClickHouse is not set with default value when building the table, using the default value of the ClickHouse data type.

For example, the numeric type default is 0, the string default is null, the date type default is "1970-01-0100: 00: 00", the array type is [ ], and so on.

(9) Data type conversion

The DataFrame data types are aligned with the data types of the ClickHouse table.

(10) Adding the processed data to a State object of JDBC

(11) Judging the size of the data record number added to the State element object

By setting a threshold value, for example, 10000, when the number of records is greater than or equal to 10000, the data in the state object is written into the clickwouse in batch, and the efficiency can be improved significantly by batch insertion. If the number of records is less than 10000, the data in the DataFrame continues to be traversed.

A universal ClickHouse data source is realized through the process, and then the codes realized above are packaged into a specific binary jar packet.

The jar package generated above is referenced in any need to write data to the ClickHouse entry. Meanwhile, a DataFrame of Spark is generated from the data, the DataFrame is assigned to a variable df, and then the following statement is called through the df, so that the function of writing the data into the ClickHouse can be completed.

df.write

Format ("clickhouse")// comment, clickhouse denotes the data source that called clickhouse;

mode ("override")// comment, meaning overwrite, if clickhouse's table has data, overwrite;

option ("TABLE", "OUTPUT _ TABLE")// comment, indicating which TABLE of clickwouse to write;

option ("cluster", "cluster001")// comment, denoting the name of the cluster to write clickhouse;

an option ("host", "192.168.118.22")// annotation, indicating the address of any one or more servers of a clickhouse cluster;

save ()// comment, trigger action saved.

Wherein:

df is a data source variable of type Dataframe.

The complete package and class name is specified in the format.

Table names, cluster names, and hosts of any one node are specified in the option.

Based on the same inventive concept, the embodiment of the present invention further provides a device for implementing fast data writing into a clickwouse by a custom Spark data source, and the principle of the device is similar to that of a method for implementing fast data writing into a clickwouse by a custom Spark data source, so that the implementation of implementing fast data writing into a clickwouse by a custom Spark data source can refer to the implementation of the method for implementing fast data writing into a clickwouse by a custom Spark data source, and repeated parts are not described again. As used hereinafter, the term "module" may include a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

A device for realizing fast data writing into a ClickHouse by a custom Spark data source comprises a creating module, a packaging module and a calling module.

The technical scheme provided by the embodiment of the invention has the following beneficial technical effects: the method has the advantages that the universal function of writing in the clickwouse is realized by customizing the clickwouse data source of the Spark, only calling is needed in a project, the problem that codes need to be developed every time new services or service logic adjustment is added is avoided, data writing efficiency is obviously improved, and meanwhile, data can be written into each node in the cluster in a balanced manner.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for realizing fast data writing into a ClickHouse by a custom Spark data source is characterized by comprising the following steps:

creating a ClickHouse data source which can be written in a balanced manner by redefining an algorithm for importing data in a related interface and a class of a Spark data source;

the realized codes are issued into binary jar packets, and the jar packets are referred to each time data are written into the ClickHouse;

and calling a specific statement to complete the function of quickly writing the ClickHouse by converting the data to be written into the data set of the DataFrame type of Spark.

2. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 1, further comprising:

inheriting the related interface and the class relationship provider, the creatable relationship provider and the DataSourceRegister of the data source of Spark, redefining the inherited interface and the class algorithm ShortName and createReglation for importing data; the ShortName algorithm returns a character string as a type, a context variable indicated by an sqlContext parameter in the CreatRelation algorithm, a data storage mode indicated by a mode parameter, a self-defined parameter transmitted from the outside indicated by a parameters parameter and data to be stored indicated by a data parameter.

3. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 1, wherein:

the specific statement is:

df.write

.format("clickhouse")

.mode("overwrite")

.option("table","OUTPUT_TABLE")

.option("cluster","cluster001")

.option("host","192.168.118.22")

.save()。

4. the method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 2, wherein:

the CreatRelation algorithm determines the fragment information contained in the cluster by reading the metadata system table of the ClickHouse, and uniformly writes the dataFrame data of Spark in different partitions into the fragments of the ClickHouse according to a specified strategy.

5. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 2, wherein:

and the CreatRelation algorithm respectively acquires the Schema of the ClickHouse and the DataFrame, automatically adapts the data type and the default value, and uniformly writes the data into the local node corresponding to the partition.

6. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 4, wherein:

the prescribed strategy for balancing the fragmentation of the write clickwouse includes:

2) calculating the total partition number of the Spark task;

3) invoking a foreach partition method of the DataFrame, and writing data into the fragment df;

7. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 5, comprising the following steps:

1) obtaining Schema of the DataFrame;

and obtaining schema information of the DataFrame through df.

2) Obtaining the schema of the ClickHouse

3) Connecting ClickHouse and creating a State object;

4) traversing data of the DataFrame;

5) traversing each column of each piece of data according to the DataFrame acquired in the first step;

6) if the value of the column is empty or present in the ClickHouse, the process starts from step 7) below. If the column value exists, then execute from step 8) below;

7) if the ClickHouse sets a default value when building the table, the default value is used;

8) if the ClickHouse is not set with the default value when the table is established, the default value of the ClickHouse data type is used;

9) carrying out data type conversion according to the rule;

10) adding the processed data to a State object of JDBC;

11) and judging the size of the data record number added to the State element object.

8. The method of claim 6, wherein the rule of data type conversion is as follows:

9. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 6, wherein:

setting a threshold value for the data record number of the State element object, writing the data in the State element object into ClickHouse in batch when the value is larger than or equal to the threshold value, and continuously traversing the data in the DataFrame if the record number is smaller than the threshold value.

10. A device for realizing fast data writing into a click House by a custom Spark data source is characterized by comprising a creating module, a packaging module and a calling module.