CN112364019A - Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source - Google Patents

Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source Download PDF

Info

Publication number
CN112364019A
CN112364019A CN202011214153.8A CN202011214153A CN112364019A CN 112364019 A CN112364019 A CN 112364019A CN 202011214153 A CN202011214153 A CN 202011214153A CN 112364019 A CN112364019 A CN 112364019A
Authority
CN
China
Prior art keywords
data
clickhouse
type
spark
dataframe
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011214153.8A
Other languages
Chinese (zh)
Other versions
CN112364019B (en
Inventor
周朝卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongying Youchuang Information Technology Co Ltd
Original Assignee
Zhongying Youchuang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongying Youchuang Information Technology Co Ltd filed Critical Zhongying Youchuang Information Technology Co Ltd
Priority to CN202011214153.8A priority Critical patent/CN112364019B/en
Publication of CN112364019A publication Critical patent/CN112364019A/en
Application granted granted Critical
Publication of CN112364019B publication Critical patent/CN112364019B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for realizing fast data writing into a ClickHouse by a user-defined Spark data source. The method and the device have the advantages that the data source of the click is customized based on the external data source interface of the Spark, the universal function of writing in the click is realized, the data of the Spark can be quickly written in the click by only calling one statement, the code customized according to the service requirement does not need to be developed, the data writing efficiency is obviously improved, and the balanced writing of the data is ensured.

Description

Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for realizing quick data writing into a click House by a user-defined Spark data source.
Background
Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark can read data in a distributed manner, and the processed data is written into a target storage in a distributed manner by performing various conversion, processing and the like on the data. ClickHouse is a columnar database management system for real-time analysis of big data. By vectorization execution and the use of a CPU bottom layer instruction Set (SIMD), the method can process massive data in parallel, thereby accelerating the data processing speed.
At present, Spark writes data into a ClickHouse table, and has two problems, namely low function reuse degree, need to develop a customized program for each requirement, need to adjust codes when adding new services or adjusting services, and complex maintenance along with continuous increase of the customized program. Specifically, data to be written (for example, a file or another database) is read using Spark, and the read data needs to be converted into a DataFrame. For each field of the DataFrame, the data type of the field of the DataFrame needs to be compared with the field of the ClickHouse table, and the data type of the field of the DataFrame is converted into the data type compatible with the ClickHouse table. In the specific writing, business logic processing, such as data type conversion, needs to be performed according to each field of each piece of data. If there are a particularly large number of clickwouses (since clickwouses are columnar stores, it is very normal in a production environment for a table to have hundreds or thousands of columns), the code will be very lengthy and error prone because of the field processing required for each column. Because different services, the fields of the DataFrame and the schema are different, if other services also need to write data into the ClickHouse, only the code of the data written into the ClickHouse can be redeveloped. If a change is sent on demand, for example, the data type of the ClickHouse field changes or the ClickHouse table adds a new field, the code needs to be modified. And when the field value of the DataFrame is empty or a certain field of the ClickHouse table is absent, the data writing fails.
And secondly, in order to ensure the stability and performance of data import, data needs to be imported from the local ground surface of the clickwouse. However, the local tables are distributed on each node in the cluster, and the DataFrame of Spark is a distributed data set, because there are many problems in writing through the distributed tables, such as no consistency check of data, downtime of the distributed tables, easy loss of data, and high stress on ZooKeeper. At present, there is no way to uniformly write data to each local node of the clickwouse, that is, uniform storage of data cannot be guaranteed.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The invention aims to self-define a data source of the click based on an external data source interface of the Spark, realize a universal function of writing in the click, realize the writing of the Spark data in the click by only calling one statement, ensure that the data is stored in each local node of the click, and do not need to develop a code customized according to the service requirement.
In order to realize the purpose, the following scheme is provided:
a method for realizing fast data writing into a ClickHouse by self-defining a Spark data source comprises the following steps:
1. creating a ClickHouse data source which can be written in a balanced manner by redefining an algorithm for importing data in a related interface and a class of a Spark data source;
2. the realized codes are issued into binary jar packets, and the jar packets are referred to each time data are written into the ClickHouse;
3. and calling a specific statement to complete the function of quickly writing the ClickHouse by converting the data to be written into the data set of the DataFrame type of Spark.
Further, inheriting the related interface and the class relationship provider, creatable relationship provider and DataSourceregister of the data source of Spark, redefining the algorithms shortName and createRelationion for importing data in the inherited interface and class; the short name algorithm returns a character string as a type, a context variable indicated by an sqlContext parameter in a CreatRelation algorithm, a data storage mode indicated by a mode parameter, a self-defined parameter transmitted from the outside indicated by a parameters parameter and data to be stored indicated by a data parameter;
further, the specific statement is:
df.write
.format("clickhouse")
.mode("overwrite")
.option("table","OUTPUT_TABLE")
.option("cluster","cluster001")
.option("host","192.168.118.22")
.save()
further, the creat relationship algorithm determines the fragment information contained in the cluster by reading the metadata system table of the clickwouse, and uniformly writes the DataFrame data of Spark in different partitions into the fragments of the clickwouse according to a specified strategy;
further, the CreatRelation algorithm respectively acquires the Schema of the ClickHouse and the DataFrame, automatically adapts the data type and the default value, and uniformly writes the data into the local node corresponding to the partition;
further, the foregoing writing data of Spark in different partitions in DataFrame data of different partitions into a partition of clickwouse according to a prescribed policy in a balanced manner, where the policy includes:
1) acquiring a host where the fragments of the cluster are located and a copy of the fragments;
2) calculating the total partition number of the Spark task;
3) calling a foreachPartition method of the DataFrame, and writing data into the fragments;
4) and acquiring the serial number of the fragments in each partition, and determining the fragments of the ClickHouse in a polling or random mode according to the serial numbers of the fragments.
5) And connecting 1 corresponding to the fragment number to a plurality of hosts according to the determined fragment number, calling a BalancelcClichouseDataSource method, randomly selecting one host, and executing balanced writing of data.
Further, the automatically adapting the data type and the default value to uniformly write the data into the local node corresponding to the partition includes the following steps:
1) schema for obtaining DataFrame
And obtaining schema information of the DataFrame through df.
2) Obtaining the schema of the ClickHouse
Connecting a ClickHouse database, and executing a command through a program: desc table name. Three pieces of information are acquired: field name, data type of field, default value of field.
3) Connecting ClickHouse, creating State object
4) Data traversing DataFrame
5) And traversing each column of each piece of data according to the DataFrame acquired in the first step.
6) If the value of the column is empty or present in the ClickHouse, the process starts from step 7) below. If the column value is present, execution begins with step 8) below.
7) If the ClickHouse sets the default value when building the table, the default value is used
8) And if the ClickHouse is not set with default value when building the table, using the default value of the ClickHouse data type.
9) Performing data type conversion according to rules
10) Adding the processed data to a State object of JDBC
11) Judging the size of the data record number added to the State element object
Further, the rule of the data type conversion is as follows:
1) when the field type of the ClickHouse is Int8, Int16, Int32, UInt8 or UInt16, the data type of the Spark DataFrame is converted into the Int type.
2) When the field type of the ClickHouse is UInt32, UInt64 or UInt64, the data type of the Spark DataFrame is converted into the Long type.
3) When the field type of ClickHouse is Float32, the data type of Spark DataFrame is converted to Float type.
4) When the field type of ClickHouse is Float64, the data type of Spark DataFrame is converted into Double type.
5) And when the field type of the ClickHouse is decimal, converting the data type of the Spark DataFrame into the decimal type.
6) And when the field type of the ClickHouse is array type, converting the data type of the Spark DataFrame into the List type.
7) And when the field type of the ClickHouse is the LowCardinal data type, converting the data type of the Spark DataFrame into a character string type.
8) And when the field type of the ClickHouse is other data types, converting the data type of the Spark DataFrame into a character string type.
Further, a threshold value can be set for the data record number of the Statement object, when the data record number is larger than or equal to the threshold value, the data in the Statement object is written into the clickwouse in batch, and if the record number is smaller than the threshold value, the data in the DataFrame is continuously traversed.
The invention also provides a device for realizing the fast data writing into the ClickHouse by the user-defined Spark data source, which comprises a creating module, a packaging module and a calling module.
In the creation module, a clickHouse data source is created by redefining an algorithm for importing data in a related interface and a class of a Spark data source;
in the packaging module, a code realized by a ClickHouse data source generated by the creating module is released into a specific binary jar packet;
in the calling module, in any item needing to write data into the ClickHouse, a jar packet generated by the packing module is quoted, and a function of quickly writing into the ClickHouse is completed by calling a specific statement by converting the data to be written into a DataFrame type data set of Spark.
The method and the device can automatically adapt to the data types of the ClickHouse field and the Spark DataFrame field by self-defining the ClickHouse data source of the Spark, automatically set the null value as the default value of the ClickHouse, and write the Spark data into each local node of the ClickHouse in a polling or random mode in the process of writing the Spark into the ClickHouse, so that the universal function of writing the Spark data into the ClickHouse is realized, and then only one statement needs to be called to write the Spark data into the ClickHouse, ensure that the data is stored in each local node of the ClickHouse in a balanced mode, and do not need to develop a code customized according to the service requirement. The data writing efficiency is obviously improved, and meanwhile, the data can be guaranteed to be written into each node in the cluster in a balanced mode.
Drawings
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
FIG. 1 is a data balanced writing method of a method and an apparatus for implementing fast data writing to a clickHouse by a custom Spark data source according to an embodiment of the present invention;
fig. 2 is a Spark partition data writing process of a method and device for implementing fast data writing into a clickwouse by a custom Spark data source in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
The nouns referred to in the examples explain:
DataFrame: is an immutable distributed collection of data in Spark that contains the data and corresponding schema information, tables of similar data, such as field names of the schema similar database of the DataFrame.
First, the data source-related interfaces and classes that inherit Spark need to be redefined: createrelations and shortnames algorithms of RelationProvider, creatblerelationprovider, and DataSourceRegister. The ShortName algorithm return type is a character string and is used for defining the name of the data source, and the short name algorithm return type only needs to return a 'ClickHouse' character string.
CreateRelation algorithm CreateRelation for data source writes has the following parameters:
sqlContext: SQLContext, Spark's sql context variables.
mode, SaveMode, mode for saving data, overwrite write or append write, etc.
Maps [ String, String ], externally-imported custom parameters.
data, DataFrame, data to be saved.
The method for realizing CreateRelation comprises the following steps:
by reading the system data of the clickwouse, it is determined which fields the table contains, the data type of the fields, and default values. And which nodes the cluster contains, the address/port information of the nodes. According to the information, the data type and the default value are automatically adapted, and the data are uniformly written into the local node of the ClickHouse.
The method for realizing the balanced writing of the data comprises the following steps: firstly, determining the number of partitions (a partition is the minimum unit of data contained in the DataFrame) of the DataFrame to be written, then traversing the partitions, determining that the partitions are written into a certain fragment in the cluster, and realizing load balancing among the fragments. When data is written into the fragments, one fragment corresponds to a plurality of copies, so that only any copy under the fragment is written, and the load balance among the copies can be ensured. The method comprises the following specific steps:
step 1: looking up system table systems, clusters, and determining fragmentation information contained in a cluster, specifically including: the host where the shards of the cluster are located, and the copy of the shards. Multiple copies of the same tile are aggregated together, e.g., to form the following data:
slicing numbering host (multiple hosts separated by comma)
1Ch001,ch002
2Ch003,ch004
Numbering of segments Main unit (multiple main units separated by comma)
1 Ch001,ch002
2 Ch003,ch004
Step 2: calculating the total partition number of Spark task
The calculation method comprises the following steps: rdd. getnumcolours
Where df is the DataFrame dataset variable.
And step 3: invoking the foreachpart method of the DataFrame to write the data into the fragment
Foreachpartion, the following operations are invoked inside each partition.
1) Acquiring the number of the fragment
val partID=TaskContext.get().partitionId()
2) Determining the fragments of the ClickHouse according to the serial numbers of the fragments
The fragmentation can be determined in a polling or random manner.
For example, using a polling approach: the total fraction of ClickHouse is 5. Then the partition data office with partition number 3 writes slice 3 (the remainder is found for 3 to 5), partition number 6 writes slice 1 (the remainder is found for 6 to 5) with partition number 3.
3) And executing the writing of the data
And connecting 1 corresponding to the fragment number to a plurality of hosts according to the determined fragment number, and executing data writing.
If one fragment number corresponds to 1 to a plurality of hosts, because a BalancdClickHouseDataSource method can be called, one host is randomly selected to perform data writing, and therefore load balancing among a plurality of hosts under the same fragment is achieved.
Calling mode:
df.foreachPartition(iter=>{{
// number of acquisition slices
val partID=TaskContext.get().partitionId()
Determining the fragmentation of a ClickHouse based on the number of the fragmentation, e.g. using polling
partID takes the remainder of the total number of slices.
Connecting the ClickHouse according to the determined fragment information of the ClickHouse
})
Inside each partition, the following operations are performed for data writing:
reading system data of the ClickHouse, determining which fields, data types of the fields and default values are contained in the table, automatically adapting the data types and the default values according to the information, and writing the data into the local nodes corresponding to the partitions in a balanced manner. The method comprises the following specific steps:
(1) schema for obtaining DataFrame
And obtaining schema information of the DataFrame through df.
(2) Obtaining the schema of the ClickHouse
Connecting a ClickHouse database, and executing a command through a program: desc table name. Three pieces of information are acquired: field name, data type of field, default value of field.
(3) Connecting ClickHouse, creating State object
(4) Data traversing DataFrame
(5) And traversing each column of each piece of data according to the DataFrame acquired in the first step.
(6) And if the value of the column is empty or exists in the clickwouse, it is performed from the following step (7). If the column value exists, execution begins at step (8) below.
(7) If the ClickHouse sets the default value when building the table, the default value is used
(8) And if the ClickHouse is not set with default value when building the table, using the default value of the ClickHouse data type.
For example, the numeric type default is 0, the string default is null, the date type default is "1970-01-0100: 00: 00", the array type is [ ], and so on.
(9) Data type conversion
The DataFrame data types are aligned with the data types of the ClickHouse table.
1) When the field type of the ClickHouse is Int8, Int16, Int32, UInt8 or UInt16, the data type of the Spark DataFrame is converted into the Int type.
2) When the field type of the ClickHouse is UInt32, UInt64 or UInt64, the data type of the Spark DataFrame is converted into the Long type.
3) When the field type of ClickHouse is Float32, the data type of Spark DataFrame is converted to Float type.
4) When the field type of ClickHouse is Float64, the data type of Spark DataFrame is converted into Double type.
5) And when the field type of the ClickHouse is decimal, converting the data type of the Spark DataFrame into the decimal type.
6) And when the field type of the ClickHouse is array type, converting the data type of the Spark DataFrame into the List type.
7) And when the field type of the ClickHouse is the LowCardinal data type, converting the data type of the Spark DataFrame into a character string type.
8) And when the field type of the ClickHouse is other data types, converting the data type of the Spark DataFrame into a character string type.
(10) Adding the processed data to a State object of JDBC
(11) Judging the size of the data record number added to the State element object
By setting a threshold value, for example, 10000, when the number of records is greater than or equal to 10000, the data in the state object is written into the clickwouse in batch, and the efficiency can be improved significantly by batch insertion. If the number of records is less than 10000, the data in the DataFrame continues to be traversed.
A universal ClickHouse data source is realized through the process, and then the codes realized above are packaged into a specific binary jar packet.
The jar package generated above is referenced in any need to write data to the ClickHouse entry. Meanwhile, a DataFrame of Spark is generated from the data, the DataFrame is assigned to a variable df, and then the following statement is called through the df, so that the function of writing the data into the ClickHouse can be completed.
df.write
Format ("clickhouse")// comment, clickhouse denotes the data source that called clickhouse;
mode ("override")// comment, meaning overwrite, if clickhouse's table has data, overwrite;
option ("TABLE", "OUTPUT _ TABLE")// comment, indicating which TABLE of clickwouse to write;
option ("cluster", "cluster001")// comment, denoting the name of the cluster to write clickhouse;
an option ("host", "192.168.118.22")// annotation, indicating the address of any one or more servers of a clickhouse cluster;
save ()// comment, trigger action saved.
Wherein:
df is a data source variable of type Dataframe.
The complete package and class name is specified in the format.
Table names, cluster names, and hosts of any one node are specified in the option.
Based on the same inventive concept, the embodiment of the present invention further provides a device for implementing fast data writing into a clickwouse by a custom Spark data source, and the principle of the device is similar to that of a method for implementing fast data writing into a clickwouse by a custom Spark data source, so that the implementation of implementing fast data writing into a clickwouse by a custom Spark data source can refer to the implementation of the method for implementing fast data writing into a clickwouse by a custom Spark data source, and repeated parts are not described again. As used hereinafter, the term "module" may include a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
A device for realizing fast data writing into a ClickHouse by a custom Spark data source comprises a creating module, a packaging module and a calling module.
In the creation module, a clickHouse data source is created by redefining an algorithm for importing data in a related interface and a class of a Spark data source;
in the packaging module, a code realized by a ClickHouse data source generated by the creating module is released into a specific binary jar packet;
in the calling module, in any item needing to write data into the ClickHouse, a jar packet generated by the packing module is quoted, and a function of quickly writing into the ClickHouse is completed by calling a specific statement by converting the data to be written into a DataFrame type data set of Spark.
The technical scheme provided by the embodiment of the invention has the following beneficial technical effects: the method has the advantages that the universal function of writing in the clickwouse is realized by customizing the clickwouse data source of the Spark, only calling is needed in a project, the problem that codes need to be developed every time new services or service logic adjustment is added is avoided, data writing efficiency is obviously improved, and meanwhile, data can be written into each node in the cluster in a balanced manner.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for realizing fast data writing into a ClickHouse by a custom Spark data source is characterized by comprising the following steps:
creating a ClickHouse data source which can be written in a balanced manner by redefining an algorithm for importing data in a related interface and a class of a Spark data source;
the realized codes are issued into binary jar packets, and the jar packets are referred to each time data are written into the ClickHouse;
and calling a specific statement to complete the function of quickly writing the ClickHouse by converting the data to be written into the data set of the DataFrame type of Spark.
2. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 1, further comprising:
inheriting the related interface and the class relationship provider, the creatable relationship provider and the DataSourceRegister of the data source of Spark, redefining the inherited interface and the class algorithm ShortName and createReglation for importing data; the ShortName algorithm returns a character string as a type, a context variable indicated by an sqlContext parameter in the CreatRelation algorithm, a data storage mode indicated by a mode parameter, a self-defined parameter transmitted from the outside indicated by a parameters parameter and data to be stored indicated by a data parameter.
3. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 1, wherein:
the specific statement is:
df.write
.format("clickhouse")
.mode("overwrite")
.option("table","OUTPUT_TABLE")
.option("cluster","cluster001")
.option("host","192.168.118.22")
.save()。
4. the method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 2, wherein:
the CreatRelation algorithm determines the fragment information contained in the cluster by reading the metadata system table of the ClickHouse, and uniformly writes the dataFrame data of Spark in different partitions into the fragments of the ClickHouse according to a specified strategy.
5. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 2, wherein:
and the CreatRelation algorithm respectively acquires the Schema of the ClickHouse and the DataFrame, automatically adapts the data type and the default value, and uniformly writes the data into the local node corresponding to the partition.
6. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 4, wherein:
the prescribed strategy for balancing the fragmentation of the write clickwouse includes:
1) acquiring a host where the fragments of the cluster are located and a copy of the fragments;
2) calculating the total partition number of the Spark task;
3) invoking a foreach partition method of the DataFrame, and writing data into the fragment df;
4) and acquiring the serial number of the fragments in each partition, and determining the fragments of the ClickHouse in a polling or random mode according to the serial numbers of the fragments.
5) And connecting 1 corresponding to the fragment number to a plurality of hosts according to the determined fragment number, calling a BalancelcClichouseDataSource method, randomly selecting one host, and executing balanced writing of data.
7. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 5, comprising the following steps:
1) obtaining Schema of the DataFrame;
and obtaining schema information of the DataFrame through df.
2) Obtaining the schema of the ClickHouse
Connecting a ClickHouse database, and executing a command through a program: desc table name. Three pieces of information are acquired: field name, data type of field, default value of field.
3) Connecting ClickHouse and creating a State object;
4) traversing data of the DataFrame;
5) traversing each column of each piece of data according to the DataFrame acquired in the first step;
6) if the value of the column is empty or present in the ClickHouse, the process starts from step 7) below. If the column value exists, then execute from step 8) below;
7) if the ClickHouse sets a default value when building the table, the default value is used;
8) if the ClickHouse is not set with the default value when the table is established, the default value of the ClickHouse data type is used;
9) carrying out data type conversion according to the rule;
10) adding the processed data to a State object of JDBC;
11) and judging the size of the data record number added to the State element object.
8. The method of claim 6, wherein the rule of data type conversion is as follows:
1) when the field type of the ClickHouse is Int8, Int16, Int32, UInt8 or UInt16, the data type of the Spark DataFrame is converted into the Int type.
2) When the field type of the ClickHouse is UInt32, UInt64 or UInt64, the data type of the Spark DataFrame is converted into the Long type.
3) When the field type of ClickHouse is Float32, the data type of Spark DataFrame is converted to Float type.
4) When the field type of ClickHouse is Float64, the data type of Spark DataFrame is converted into Double type.
5) And when the field type of the ClickHouse is decimal, converting the data type of the Spark DataFrame into the decimal type.
6) And when the field type of the ClickHouse is array type, converting the data type of the Spark DataFrame into the List type.
7) And when the field type of the ClickHouse is the LowCardinal data type, converting the data type of the Spark DataFrame into a character string type.
8) And when the field type of the ClickHouse is other data types, converting the data type of the Spark DataFrame into a character string type.
9. The method for realizing fast data writing into the clickwouse by the custom Spark data source as claimed in claim 6, wherein:
setting a threshold value for the data record number of the State element object, writing the data in the State element object into ClickHouse in batch when the value is larger than or equal to the threshold value, and continuously traversing the data in the DataFrame if the record number is smaller than the threshold value.
10. A device for realizing fast data writing into a click House by a custom Spark data source is characterized by comprising a creating module, a packaging module and a calling module.
In the creation module, a clickHouse data source is created by redefining an algorithm for importing data in a related interface and a class of a Spark data source;
in the packaging module, a code realized by a ClickHouse data source generated by the creating module is released into a specific binary jar packet;
in the calling module, in any item needing to write data into the ClickHouse, a jar packet generated by the packing module is quoted, and a function of quickly writing into the ClickHouse is completed by calling a specific statement by converting the data to be written into a DataFrame type data set of Spark.
CN202011214153.8A 2020-11-04 2020-11-04 Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source Active CN112364019B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011214153.8A CN112364019B (en) 2020-11-04 2020-11-04 Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011214153.8A CN112364019B (en) 2020-11-04 2020-11-04 Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source

Publications (2)

Publication Number Publication Date
CN112364019A true CN112364019A (en) 2021-02-12
CN112364019B CN112364019B (en) 2022-10-04

Family

ID=74513515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011214153.8A Active CN112364019B (en) 2020-11-04 2020-11-04 Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source

Country Status (1)

Country Link
CN (1) CN112364019B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114003660A (en) * 2021-11-05 2022-02-01 广州宸祺出行科技有限公司 Method and device for efficiently synchronizing real-time data to click House based on flash

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022874A (en) * 2016-05-17 2016-10-12 北京奇虎科技有限公司 Order data processing method, order system, and flow charging system
US20180004778A1 (en) * 2016-07-01 2018-01-04 Salesforce.Com, Inc. Field types defined via custom metadata types
CN108959549A (en) * 2018-06-29 2018-12-07 北京奇虎科技有限公司 Method for writing data, calculates equipment and computer storage medium at device
CN110837506A (en) * 2019-11-07 2020-02-25 中电福富信息科技有限公司 Mycat-based data fragmentation and read-write separation method and system
CN111125216A (en) * 2019-12-10 2020-05-08 中盈优创资讯科技有限公司 Method and device for importing data into Phoenix
CN111782134A (en) * 2019-06-14 2020-10-16 北京京东尚科信息技术有限公司 Data processing method, device, system and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022874A (en) * 2016-05-17 2016-10-12 北京奇虎科技有限公司 Order data processing method, order system, and flow charging system
US20180004778A1 (en) * 2016-07-01 2018-01-04 Salesforce.Com, Inc. Field types defined via custom metadata types
CN108959549A (en) * 2018-06-29 2018-12-07 北京奇虎科技有限公司 Method for writing data, calculates equipment and computer storage medium at device
CN111782134A (en) * 2019-06-14 2020-10-16 北京京东尚科信息技术有限公司 Data processing method, device, system and computer readable storage medium
CN110837506A (en) * 2019-11-07 2020-02-25 中电福富信息科技有限公司 Mycat-based data fragmentation and read-write separation method and system
CN111125216A (en) * 2019-12-10 2020-05-08 中盈优创资讯科技有限公司 Method and device for importing data into Phoenix

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114003660A (en) * 2021-11-05 2022-02-01 广州宸祺出行科技有限公司 Method and device for efficiently synchronizing real-time data to click House based on flash
CN114003660B (en) * 2021-11-05 2022-06-03 广州宸祺出行科技有限公司 Method and device for efficiently synchronizing real-time data to click House based on flash

Also Published As

Publication number Publication date
CN112364019B (en) 2022-10-04

Similar Documents

Publication Publication Date Title
US11893018B2 (en) Dispersing data and parity across a set of segments stored via a computing system
US20170083573A1 (en) Multi-query optimization
US10353893B2 (en) Data partitioning and ordering
US10552393B2 (en) System and method for use of a dynamic flow in a multidimensional database environment
US11693912B2 (en) Adapting database queries for data virtualization over combined database stores
CN109241159B (en) Partition query method and system for data cube and terminal equipment
US20240143576A1 (en) Registering additional type systems using a hub data model for data processing
CN112364019B (en) Method and device for realizing fast data writing into ClickHouse by self-defined Spark data source
CN112912870A (en) Tenant identifier conversion
EP3158477B1 (en) Dynamic n-dimensional cubes for hosted analytics
CN114443680A (en) Database management system, related apparatus, method and medium
CN109947736B (en) Method and system for real-time computing
CN106599241A (en) Big data visual management method for GIS software
CN116719822B (en) Method and system for storing massive structured data
EP3293644B1 (en) Loading data for iterative evaluation through simd registers
WO2005041067A1 (en) Distributed memory type information processing system
Purdilă et al. Single‐scan: a fast star‐join query processing algorithm
US20220215021A1 (en) Data Query Method and Apparatus, Computing Device, and Storage Medium
CN111191106B (en) DSL construction method, system, electronic device and medium
US20170031982A1 (en) Maintaining Performance in the Presence of Insertions, Deletions, and Streaming Queries
CN112749197A (en) Data fragment refreshing method, device, equipment and storage medium
CN113868138A (en) Method, system, equipment and storage medium for acquiring test data
CN114391152A (en) Clustering process using business data
US10558637B2 (en) Modularized data distribution plan generation
US20240160614A1 (en) Method for fast and efficient computing of tree form-factors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: Room 702-2, No. 4811 Caoan Road, Jiading District, Shanghai, 201800

Patentee after: CHINA UNITECHS

Address before: Room 1004-4, 10 / F, 1112 Hanggui Road, Anting Town, Jiading District, Shanghai 201800

Patentee before: CHINA UNITECHS