CA3140571A1 - Method and apparatus for clickhouse-based crowd selection - Google Patents
Method and apparatus for clickhouse-based crowd selection Download PDFInfo
- Publication number
- CA3140571A1 CA3140571A1 CA3140571A CA3140571A CA3140571A1 CA 3140571 A1 CA3140571 A1 CA 3140571A1 CA 3140571 A CA3140571 A CA 3140571A CA 3140571 A CA3140571 A CA 3140571A CA 3140571 A1 CA3140571 A1 CA 3140571A1
- Authority
- CA
- Canada
- Prior art keywords
- tag
- query
- clickhouse
- type
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 230000002776 aggregation Effects 0.000 claims abstract description 77
- 238000004220 aggregation Methods 0.000 claims abstract description 77
- 238000000638 solvent extraction Methods 0.000 claims abstract description 25
- 238000004590 computer program Methods 0.000 claims description 6
- 230000000007 visual effect Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract 1
- 230000009286 beneficial effect Effects 0.000 description 5
- 238000004904 shortening Methods 0.000 description 3
- 238000005259 measurement Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0242—Determining effectiveness of advertisements
- G06Q30/0244—Optimization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Development Economics (AREA)
- Strategic Management (AREA)
- Finance (AREA)
- Computational Linguistics (AREA)
- Accounting & Taxation (AREA)
- Fuzzy Systems (AREA)
- Entrepreneurship & Innovation (AREA)
- Mathematical Physics (AREA)
- Game Theory and Decision Science (AREA)
- Probability & Statistics with Applications (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method and apparatus for Clickhouse-based crowd selection relate to big data technologies, and satisfy requirements of the business side for fast and efficient computation of crowd packages.
The method involves: configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items; based on feature types of tag values, partitioning the CH
table into aggregation tables of different types; acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and by means of multithreading executing the Clickhouse query clauses parallelly, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results. The apparatus implements the described method.
The method involves: configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items; based on feature types of tag values, partitioning the CH
table into aggregation tables of different types; acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and by means of multithreading executing the Clickhouse query clauses parallelly, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results. The apparatus implements the described method.
Description
METHOD AND APPARATUS FOR CLICKHOUSE-BASED CROWD SELECTION
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the technical field of big data, and more particularly to a method and apparatus for Clickhouse-based crowd selection.
Description of Related Art
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the technical field of big data, and more particularly to a method and apparatus for Clickhouse-based crowd selection.
Description of Related Art
[0002] Currently, precise selection of crowd from massive internet data is critical to successful placement of advertising information. Usually, this can be achieved by tagging user characteristic behaviors, using the obtained tags as criteria to screen the target crowd, and placing real-time advertisement. Since advertising campaigns are usually audience specific, criteria used for screening different crowds are highly diverse.
[0003] Conventionally, off-line compute engines such as Elasticsearch and spark are used to determine this selection. However, as the business side has increasingly demanding requirements for computation performance in terms of crowd packages, these existing compute engines are becoming incompetent to satisfy users when it comes to quasi real-time performance. Besides, the data volume of selection criteria has now reached a scale of hundreds of millions, making data pulling extremely time-consuming, and this is adverse to business needs.
SUMMARY OF THE INVENTION
SUMMARY OF THE INVENTION
[0004] The objective of the present invention is to provide a method and an apparatus for Clickhouse-based crowd selection, which provide the business side with fast and efficient computation of crowd packages.
Date recue / Date received 202 1-1 1-26
Date recue / Date received 202 1-1 1-26
[0005] To achieve the foregoing objective, in a first aspect, the present invention provides a method for Clickhouse-based crowd selection, which comprises:
[0006] configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items;
[0007] partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values;
[0008] acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and
[0009] executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results.
[0010] Preferably, the step of configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items comprises:
[0011] configurating the tag information through a visual interface of a business system, wherein the tag information includes three kinds of configuration information items and the corresponding tag original-table data, wherein the three kinds of configuration information items are tag codes, tag names, and tag values; and
[0012] automatically creating spark tasks to process the tag original-table data, and introducing the tag original-table data into the CH table in the Clickhouse engine in formats of tag codes, tag names, and tag values.
[0013] More preferably, the step of based on feature types of tag values, partitioning the CH
table into aggregation tables of different types comprises:
table into aggregation tables of different types comprises:
[0014] dividing the feature types of the tag values into an int type, a double type, a string type, and a date type; and
[0015] according to the feature types of the tag values in the tag original-table data, partitioning Date recue / Date received 202 1-1 1-26 the CH table correspondingly into an it-type tag-value aggregation table, a double-type tag-value aggregation table, a string-type tag-value aggregation table, and a date-type tag-value aggregation table.
[0016] Further, after the step of partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values, the method further comprises:
[0017] distributing the aggregation tables across nodes of the Clickhouse engine, for performing distributed query.
[0018] Preferably, the step of acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses comprises:
[0019] selecting the query tags through the visual interface of the business system to form the sql query criteria, and sending the sql query criteria to kafka; and
[0020] consuming sql query criteria data, receiving drop tables through SparkStreaming tasks, and converting the sql query criteria into the Clickhouse query clauses.
[0021] Preferably, the step of executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results comprises:
[0022] using the Spark tasks to execute the Clickhouse query clauses parallelly by means of multithreading, while acquiring detail data from the corresponding aggregation tables and storing the detail data into the Hive table, so as to obtain the crowd selection query results through computation.
[0023] Preferably, after the step of executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results, the method further comprises:
Date recue / Date received 202 1-1 1-26
Date recue / Date received 202 1-1 1-26
[0024] storing the detail data of the Hive table into redis, for the business system to use according to business needs.
[0025] As compared to the prior art, the method for Clickhouse-based crowd selection of the present invention has the following beneficial effects:
[0026] With the disclosed method for Clickhouse-based crowd selection, a user is enabled to configure tag information through a back end of a business system and create spark tasks.
Then the user can introduce original-table data of tags into a CH table of a Clickhouse engine according to configured information items. Afterward, according to feature types of tag values, a CH table in a Clickhouse engine can be partitioned into different types of aggregation table. Subsequently, the user can provide query tags to generate sql query criteria.
The criteria are then converted into Clickhouse query clauses. By means of multithreading, the Clickhouse query clauses are executed parallelly. Query detail data are acquired from aggregation tables of the corresponding types, and stored into a Hive table, so that crowd selection query results can be obtained.
Then the user can introduce original-table data of tags into a CH table of a Clickhouse engine according to configured information items. Afterward, according to feature types of tag values, a CH table in a Clickhouse engine can be partitioned into different types of aggregation table. Subsequently, the user can provide query tags to generate sql query criteria.
The criteria are then converted into Clickhouse query clauses. By means of multithreading, the Clickhouse query clauses are executed parallelly. Query detail data are acquired from aggregation tables of the corresponding types, and stored into a Hive table, so that crowd selection query results can be obtained.
[0027] It is thus clear that the present invention uses a concept of Clickhouse + Spark computation to introduce the computation process for criteria of tag combinations into a Clickhouse engine, and use the combination of the Spark compute engine and the Clickhouse compute engine to generate the crowd package, thereby shortening the time required by computation of detail data satisfying the criteria of tag combinations from minutes to seconds.
[0028] In a second aspect, the present invention provides an apparatus for Clickhouse-based crowd selection, which is applicable to the method for Clickhouse-based crowd selection of the foregoing technical scheme. The apparatus comprises:
[0029] a configuring unit, for configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items;
[0030] a table-partitioning unit, for partitioning the CH table into aggregation tables of different Date recue / Date received 202 1-1 1-26 types on the basis of feature types of tag values;
[0031] a query unit, for acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and
[0032] a processing unit, for executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results.
[0033] Preferably, the table-partitioning unit is for dividing the feature types of the tag values into an int type, a double type, a string type, and a date type, according to the feature types of the tag values in the tag original-table data, partitioning the CH table correspondingly into an it-type tag-value aggregation table, a double-type tag-value aggregation table, a string-type tag-value aggregation table, and a date-type tag-value aggregation table.
[0034] As compared to the prior art, the disclosed apparatus for Clickhouse-based crowd selection provides beneficial effects that are similar to those provided by the disclosed method for Clickhouse-based crowd selection as enumerated above, and thus no repetitions are made herein.
[0035] In a third aspect, the present invention provides a computer-readable storage medium, storing thereon a computer program. When the computer program is executed by a processor, it implements the steps of the method for Clickhouse-based crowd selection as described previously.
[0036] As compared to the prior art, the disclosed computer-readable storage medium provides beneficial effects that are similar to those provided by the disclosed method for Clickhouse-based crowd selection as enumerated above, and thus no repetitions are made herein.
BRIEF DESCRIPTION OF THE DRAWINGS
BRIEF DESCRIPTION OF THE DRAWINGS
[0037] The accompanying drawings are provided herein for better understanding of the present invention and form a part of this disclosure. The illustrative embodiments and their Date recue / Date received 202 1-1 1-26 descriptions are for explaining the present invention and by no means form any improper limitation to the present invention, wherein:
[0038] FIG. 1 is a flowchart of the method for Clickhouse-based crowd selection of one embodiment of the present invention;
[0039] FIG. 2 is a schematic drawing illustrating that a CH table is partitioned into four types of aggregation tables according to an embodiment of the present invention; and
[0040] FIG. 3 is another flowchart of the method for Clickhouse-based crowd selection of the embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
DETAILED DESCRIPTION OF THE INVENTION
[0041] To make the foregoing objectives, features, and advantages of the present invention clearer and more understandable, the following description will be directed to some embodiments as depicted in the accompanying drawings to detail the technical schemes disclosed in these embodiments. It is, however, to be understood that the embodiments referred herein are only a part of all possible embodiments and thus not exhaustive. Based on the embodiments of the present invention, all the other embodiments can be conceived without creative labor by people of ordinary skill in the art, and all these and other embodiments shall be embraced in the scope of the present invention.
Embodiment 1
Embodiment 1
[0042] Referring to FIG. 1 through FIG. 3, the present embodiment provides a method for Clickhouse-based crowd selection that comprises:
[0043] configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items;
partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values; acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and by means of multithreading executing the Clickhouse query clauses parallelly, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive Date recue / Date received 202 1-1 1-26 table, and obtaining crowd selection query results.
partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values; acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and by means of multithreading executing the Clickhouse query clauses parallelly, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive Date recue / Date received 202 1-1 1-26 table, and obtaining crowd selection query results.
[0044] With the disclosed method for Clickhouse-based crowd selection, a user is enabled to configure tag information through a back end of a business system and create spark tasks.
Then the user can introduce original-table data of tags into a CH table of a Clickhouse engine according to configured information items. Afterward, according to feature types of tag values, a CH table in a Clickhouse engine can be partitioned into different types of aggregation table. Subsequently, the user can provide query tags to generate sql query criteria.
The criteria are then converted into Clickhouse query clauses. By means of multithreading, the Clickhouse query clauses are executed parallelly. Query detail data are acquired from aggregation tables of the corresponding types, and stored into a Hive table, so that crowd selection query results can be obtained.
Then the user can introduce original-table data of tags into a CH table of a Clickhouse engine according to configured information items. Afterward, according to feature types of tag values, a CH table in a Clickhouse engine can be partitioned into different types of aggregation table. Subsequently, the user can provide query tags to generate sql query criteria.
The criteria are then converted into Clickhouse query clauses. By means of multithreading, the Clickhouse query clauses are executed parallelly. Query detail data are acquired from aggregation tables of the corresponding types, and stored into a Hive table, so that crowd selection query results can be obtained.
[0045] It is thus clear that the present invention uses a concept of Clickhouse + Spark computation to introduce the computation process for criteria of tag combinations into a Clickhouse engine, and use the combination of the Spark compute engine and the Clickhouse compute engine to generate the crowd package, thereby shortening the time required by computation of detail data satisfying the criteria of tag combinations from minutes to seconds.
[0046] In the foregoing embodiment, the step of configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items comprises:
[0047] configurating the tag information through a visual interface of a business system, wherein the tag information includes three kinds of configuration information items and the corresponding tag original-table data, wherein the three kinds of configuration information items are tag codes, tag names, and tag values; and automatically creating spark tasks to process the tag original-table data, and introducing the tag original-table data into the CH
table in the Clickhouse engine in formats of tag codes, tag names, and tag values.
Date recue / Date received 202 1-1 1-26
table in the Clickhouse engine in formats of tag codes, tag names, and tag values.
Date recue / Date received 202 1-1 1-26
[0048] In a particular implementation, the information items may include three types, namely tag codes, tag names, and tag values. For instance, if it is desired to know preferences of male members about some merchandise item, the tag code, tag name, and tag value of the information items are set as labelname, labelvalue, and userid, respectively.
Then the tag original-table data are introduced into the CH table in the formats of the foregoing information items. For an example where one of the entries of the CH table is labelname =A, then labelvalue =male(1), userid=[1,2,3,4].
Then the tag original-table data are introduced into the CH table in the formats of the foregoing information items. For an example where one of the entries of the CH table is labelname =A, then labelvalue =male(1), userid=[1,2,3,4].
[0049] As a further example, in order to query a crowd package of 18-year-old male members, a user may, through the back end of the business system, visually configure a tag A=gender (labelname=A, labelvalue=1 representing male), and a tag B= age (labelname=B, labelvalue=18), so as to obtain the sql query criteria of "age=18 && gender =male." Then, data of male members and data of 18-year-old members are acquired from the corresponding aggregation tables. These data are assembled and computed before fed back to the user as the query results for crowd selection.
[0050] In the foregoing embodiment, the step of based on feature types of tag values, partitioning the CH table into aggregation tables of different types comprises:
[0051] dividing the feature types of the tag values into an int type, a double type, a string type, and a date type; and according to the feature types of the tag values in the tag original-table data, partitioning the CH table correspondingly into an it-type tag-value aggregation table, a double-type tag-value aggregation table, a string-type tag-value aggregation table, and a date-type tag-value aggregation table.
[0052] As shown in FIG. 2, in a particular implementation, through the back end of the business system, the commonly used tag values are divided into four types, namely the int type, the double type, the string type, and the date type, so as to satisfy diverse business needs about query criteria. Then according to the feature types of the tag values in the tag original-table data, the CH table can correspondingly be partitioned into the it-type tag-value aggregation Date recue / Date received 202 1-1 1-26 table, the double-type tag-value aggregation table, the string-type tag-value aggregation table, and the date-type tag-value aggregation table. This is to say that all the tag values in the int-type tag-value aggregation table are of the int type, and all the tag values in the double-type tag-value aggregation table are of the double type, and so on.
[0053] For example, for checking the access count, the relevant tag values are obviously of the int type, and the conditional operators for crowd package data are count days>= '15' and count days<'30'. In other words, it is to query for the access count generated between the date of 15 and the date of 30.
[0054] For example, for checking the coordinate locations of users, the relevant tag values are obviously of the double type. This type is suitable for queries of coordinate locations, and helps select a member crowd in a certain latitude-longitude coordinate range.
[0055] For example, for checking member levels of users, the relevant tag values are obviously of the string type. This type is favorable to precise query matching.
[0056] For example, for checking access dates of users, the relevant tag values are obviously of the date type. This type is suitable for comparison of dates, such as criteria for crowd packages like a first-for-purchase time later than a certain date.
[0057] To satisfy business needs, in the present embodiment, four types of CH
aggregation tables are introduced, with the following table structures:
Field Type Description labelname String Tag Code labelvalue String/Int64/double/Date Tag Value uv AggregateFunction(groupBitmap, Bitmap generated from unique UInt64) aggregation function numbers of members Date recue / Date received 202 1-1 1-26
aggregation tables are introduced, with the following table structures:
Field Type Description labelname String Tag Code labelvalue String/Int64/double/Date Tag Value uv AggregateFunction(groupBitmap, Bitmap generated from unique UInt64) aggregation function numbers of members Date recue / Date received 202 1-1 1-26
[0058] As can be seen, all labelname fields are of the String type. This is usually used to express a tag code. The labelvalue fields may be of the String type, the Int type, the double type, or the Date type, usually for expressing a tag value. uv is usually used to express a tag name.
[0059] With the four types of aggregation tables, data can be acquired from corresponding types of CH aggregation tables according to business needs automatically for subsequent computation, thereby significantly improving computation performance of the system.
[0060] For example, for tags related to the number of days users sleep for, the criteria are described as users sleeping for more than a certain number of days or fewer than a certain number of days, i.e., count days>= '15' and count days<'30'. The aggregation tables of the Int type are correspondingly selected for data acquiring and to support data operators.
[0061] In the foregoing embodiment, after the step of partitioning the CH
table into aggregation tables of different types on the basis of feature types of tag values, the method further comprises: distributing the aggregation tables across nodes of the Clickhouse engine, for performing distributed query.
table into aggregation tables of different types on the basis of feature types of tag values, the method further comprises: distributing the aggregation tables across nodes of the Clickhouse engine, for performing distributed query.
[0062] In a particular implementation, since the aggregation tables are local tables distributed across nodes of the Clickhouse engine, a distributed table directing to the aggregation tables at the nodes may be used to realize distributed query computation. As demonstrated in tests, when tag data were introduced into a Clickhouse engine from an original table (such as HDFS), the standalone performance was 1 million/sec, showing linear expansion in terms of performance. It was also found through actual measurement that 50 billion data could be introduced into the CH table in one hour. The measurement was performed in a Clickhouse cluster environment, with the following machine configuration: 32 cores, 256G, 4T drive, using a cluster of 6 machines.
Date recue / Date received 202 1-1 1-26
Date recue / Date received 202 1-1 1-26
[0063] In the foregoing embodiment, the step of acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses comprises:
[0064] selecting the query tags through the visual interface of the business system to form the sql query criteria, and sending the sql query criteria to kafka; and consuming sql query criteria data, receiving drop tables through SparkStreaming tasks, and converting the sql query criteria into the Clickhouse query clauses.
[0065] In the foregoing embodiment, the step of executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results comprises:
[0066] using the Spark tasks to execute the Clickhouse query clauses parallelly by means of multithreading, while acquiring detail data from the corresponding aggregation tables and storing the detail data into the Hive table, so as to obtain the crowd selection query results through computation.
[0067] In the foregoing embodiment, after the step of executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results, the method further comprises:
[0068] storing the detail data of the Hive table into redis, for the business system to use according to business needs.
[0069] Referring to FIG. 3, for easy understanding, an example is used to explain the scheme as discussed in the foregoing embodiment:
[0070] A user may select query tags through the page of the business system to generate sql query criteria and send them to kafka. SparkStreaming consumes the corresponding kafka data and converts sql into Clickhouse query clauses that are then stored into Mysql. Spark Date recue / Date received 202 1-1 1-26 tasks read Clickhouse query clauses from Mysql. Afterward, by means of multithreading, the Clickhouse query clauses are executed parallelly to query detail data from the CH aggregation tables corresponding to the distributed nodes. The detail data are then stored into a Hive table for the business system to query and use, thereby outputting crowd selection query results.
In addition, the detail data in the hive table may be transferred to redis according to business needs, so as to provide the business system with other query services.
In addition, the detail data in the hive table may be transferred to redis according to business needs, so as to provide the business system with other query services.
[0071] To sum up, the present embodiment has the following innovations:
1. The present invention uses a concept of Clickhouse + Spark computation to introduce the computation process for criteria of tag combinations into a Clickhouse engine, and use the combination of the Spark compute engine and the Clickhouse compute engine to generate the crowd package, thereby shortening the time required by computation of detail data satisfying the criteria of tag combinations from minutes to seconds.
2. Addition or removal of tags can be visually configured through the back end of the business system, thereby eliminating hassles about complicated operations for adding or removing tags.
3. The present invention leverages the benefits of Clickhouse to solve issues about aggregation performance related to large volume of data, thereby providing better user experience.
4. The present invention saves hardware resources. According to early-stage assessment, Clickhouse only needs half of the hardware computation resources required by elasticsearch. As to storage capacity, for identical data, one Clickhouse machine is equivalent to 27 elasticsearch machines.
Embodiment 2
1. The present invention uses a concept of Clickhouse + Spark computation to introduce the computation process for criteria of tag combinations into a Clickhouse engine, and use the combination of the Spark compute engine and the Clickhouse compute engine to generate the crowd package, thereby shortening the time required by computation of detail data satisfying the criteria of tag combinations from minutes to seconds.
2. Addition or removal of tags can be visually configured through the back end of the business system, thereby eliminating hassles about complicated operations for adding or removing tags.
3. The present invention leverages the benefits of Clickhouse to solve issues about aggregation performance related to large volume of data, thereby providing better user experience.
4. The present invention saves hardware resources. According to early-stage assessment, Clickhouse only needs half of the hardware computation resources required by elasticsearch. As to storage capacity, for identical data, one Clickhouse machine is equivalent to 27 elasticsearch machines.
Embodiment 2
[0072] The present embodiment provides a Clickhouse-based crowd selection apparatus, which comprises:
[0073] a configuring unit, for configurating tag information to create spark tasks, and Date recue / Date received 202 1-1 1-26 introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items;
[0074] a table-partitioning unit, for based on feature types of tag values, partitioning the CH
table into aggregation tables of different types;
table into aggregation tables of different types;
[0075] a query unit, for acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and
[0076] a processing unit, for executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results.
[0077] Preferably, the table-partitioning unit is for dividing the feature types of the tag values into an int type, a double type, a string type, and a date type, according to the feature types of the tag values in the tag original-table data, partitioning the CH table correspondingly into an it-type tag-value aggregation table, a double-type tag-value aggregation table, a string-type tag-value aggregation table, and a date-type tag-value aggregation table.
[0078] As compared to the prior art, the disclosed apparatus for Clickhouse-based crowd selection provides beneficial effects that are similar to those provided by the method for Clickhouse-based crowd selection as enumerated above, and thus no repetitions are made herein.
Embodiment 3
Embodiment 3
[0079] The present embodiment provides a computer-readable storage medium, storing thereon a computer program. When the computer program is executed by a processor, it implements the steps of the method for Clickhouse-based crowd selection as described previously.
[0080] As compared to the prior art, the disclosed computer-readable storage medium provides beneficial effects that are similar to those provided by the disclose method as enumerated above, and thus no repetitions are made herein.
Date recue / Date received 202 1-1 1-26
Date recue / Date received 202 1-1 1-26
[0081] As will be appreciated by people of ordinary skill in the art, implementation of all or a part of the steps of the method of the present invention as described previously may be realized by having a program instruct related hardware components. The program may be stored in a computer-readable storage medium, and the program is about performing the individual steps of the methods described in the foregoing embodiments. The storage medium may be a ROM/RAM, a magnetic disk, an optical disk, a memory card or the like.
[0082] The present invention has been described with reference to the preferred embodiments and it is understood that the embodiments are not intended to limit the scope of the present invention. Moreover, as the contents disclosed herein should be readily understood and can be implemented by a person skilled in the art, all equivalent changes or modifications which do not depart from the concept of the present invention should be encompassed by the appended claims. Hence, the scope of the present invention shall only be defined by the appended claims.
Date recue / Date received 202 1-1 1-26
Date recue / Date received 202 1-1 1-26
Claims (10)
1. A method for Clickhouse-based crowd selection, comprising:
configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items;
partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values;
acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results.
configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items;
partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values;
acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results.
2. The method of claim 1, wherein the step of configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items comprises:
configurating the tag information through a visual interface of a business system, wherein the tag information includes three kinds of configuration information items and the corresponding tag original-table data, wherein the three kinds of configuration information items are tag codes, tag names, and tag values; and automatically creating spark tasks to process the tag original-table data, and introducing the tag Date recue / Date received 2021-11-26 original-table data into the CH table in the Clickhouse engine in formats of tag codes, tag names, and tag values.
configurating the tag information through a visual interface of a business system, wherein the tag information includes three kinds of configuration information items and the corresponding tag original-table data, wherein the three kinds of configuration information items are tag codes, tag names, and tag values; and automatically creating spark tasks to process the tag original-table data, and introducing the tag Date recue / Date received 2021-11-26 original-table data into the CH table in the Clickhouse engine in formats of tag codes, tag names, and tag values.
3. The method of claim 2, wherein the step of partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values comprises:
dividing the feature types of the tag values into an int type, a double type, a string type, and a date type; and partitioning the CH table correspondingly into an int-type tag-value aggregation table, a double-type tag-value aggregation table, a string-type tag-value aggregation table, and a date-type tag-value aggregation table according to the feature types of the tag values in the tag original-table data.
dividing the feature types of the tag values into an int type, a double type, a string type, and a date type; and partitioning the CH table correspondingly into an int-type tag-value aggregation table, a double-type tag-value aggregation table, a string-type tag-value aggregation table, and a date-type tag-value aggregation table according to the feature types of the tag values in the tag original-table data.
4. The method of claim 3, wherein after the step of partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values, the method further comprises:
distributing the aggregation tables across nodes of the Clickhouse engine, for performing distributed query.
distributing the aggregation tables across nodes of the Clickhouse engine, for performing distributed query.
5. The method of claim 4, wherein the step of acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses comprises:
selecting the query tags through the visual interface of the business system to form the sql query criteria, and sending the sql query criteria to kafka; and consuming sql query criteria data, receiving drop tables through SparkStreaming tasks, and converting the sql query criteria into the Clickhouse query clauses.
Date recue / Date received 2021-11-26
selecting the query tags through the visual interface of the business system to form the sql query criteria, and sending the sql query criteria to kafka; and consuming sql query criteria data, receiving drop tables through SparkStreaming tasks, and converting the sql query criteria into the Clickhouse query clauses.
Date recue / Date received 2021-11-26
6. The method of any of claims 1 through 5, wherein the step of executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results comprises:
using the Spark tasks to execute the Clickhouse query clauses parallelly by means of multithreading, while acquiring detail data from the corresponding aggregation tables and storing the detail data into the Hive table, so as to obtain the crowd selection query results through computation.
using the Spark tasks to execute the Clickhouse query clauses parallelly by means of multithreading, while acquiring detail data from the corresponding aggregation tables and storing the detail data into the Hive table, so as to obtain the crowd selection query results through computation.
7. The method of claim 1, wherein after the step of executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results, the method further comprises:
storing the detail data of the Hive table into redis, for the business system to use according to business needs.
storing the detail data of the Hive table into redis, for the business system to use according to business needs.
8. An apparatus for Clickhouse-based crowd selection, comprising:
a configuring unit, for configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items;
a table-partitioning unit, for partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values;
a query unit, for acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and Date recue / Date received 2021-11-26 a processing unit, for executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results.
a configuring unit, for configurating tag information to create spark tasks, and introducing original-table data of tags into a CH table of a Clickhouse engine according to configured information items;
a table-partitioning unit, for partitioning the CH table into aggregation tables of different types on the basis of feature types of tag values;
a query unit, for acquiring query tags from a user to generate sql query criteria, and converting the sql query criteria into Clickhouse query clauses; and Date recue / Date received 2021-11-26 a processing unit, for executing the Clickhouse query clauses parallelly by means of multithreading, acquiring query detail data from the aggregation tables of the corresponding types, storing the detail data into a Hive table, and obtaining crowd selection query results.
9. The apparatus of claim 8, wherein the table-partitioning unit is for dividing the feature types of the tag values into an int type, a double type, a string type, and a date type, and partitioning the CH table correspondingly into an int-type tag-value aggregation table, a double-type tag-value aggregation table, a string-type tag-value aggregation table, and a date-type tag-value aggregation table according to the feature types of the tag values in the tag original-table data.
10. A computer-readable storage medium storing therein a computer program, wherein the computer program when executed by a processor performing a method as described in any of claims 1 through 7.
Date recue / Date received 2021-11-26
Date recue / Date received 2021-11-26
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011347988.0 | 2020-11-26 | ||
CN202011347988.0A CN112396462B (en) | 2020-11-26 | 2020-11-26 | Crowd circling method and device based on click house |
Publications (1)
Publication Number | Publication Date |
---|---|
CA3140571A1 true CA3140571A1 (en) | 2022-05-26 |
Family
ID=74605283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA3140571A Pending CA3140571A1 (en) | 2020-11-26 | 2021-11-26 | Method and apparatus for clickhouse-based crowd selection |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112396462B (en) |
CA (1) | CA3140571A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115840772A (en) * | 2022-11-11 | 2023-03-24 | 中电金信软件有限公司 | Passenger group data statistical method and device, electronic equipment and storage medium |
CN118069680A (en) * | 2024-02-02 | 2024-05-24 | 天翼爱音乐文化科技有限公司 | Data processing method and system, electronic equipment and storage medium |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113590923A (en) * | 2021-07-28 | 2021-11-02 | 深圳市酷开网络科技股份有限公司 | Crowd delineating task splitting method, device, equipment and storage medium |
CN113724007A (en) * | 2021-08-31 | 2021-11-30 | 平安国际智慧城市科技股份有限公司 | Patient population selection method, device, equipment and computer-readable storage medium |
CN114549042B (en) * | 2022-01-10 | 2022-10-18 | 北京元年科技股份有限公司 | Cost sharing method and system |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462317A (en) * | 2014-12-01 | 2015-03-25 | 苏州朗米尔照明科技有限公司 | Data screening method based on tag |
CN109937413B (en) * | 2016-07-20 | 2023-04-21 | 深圳市东信时代信息技术有限公司 | Processing method and system for massive crowd characteristic data |
CN110968582B (en) * | 2019-11-01 | 2022-12-30 | 苏宁云计算有限公司 | Crowd generation method and device |
-
2020
- 2020-11-26 CN CN202011347988.0A patent/CN112396462B/en active Active
-
2021
- 2021-11-26 CA CA3140571A patent/CA3140571A1/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115840772A (en) * | 2022-11-11 | 2023-03-24 | 中电金信软件有限公司 | Passenger group data statistical method and device, electronic equipment and storage medium |
CN118069680A (en) * | 2024-02-02 | 2024-05-24 | 天翼爱音乐文化科技有限公司 | Data processing method and system, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112396462B (en) | 2022-11-22 |
CN112396462A (en) | 2021-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA3140571A1 (en) | Method and apparatus for clickhouse-based crowd selection | |
CN103177061B (en) | Unique value estimation in partition table | |
US9251212B2 (en) | Profiling in a massive parallel processing environment | |
CN109597974B (en) | Report generation method and device | |
JP2011054159A (en) | System to modify website for organic search optimization | |
CN106940721B (en) | Data processing method and system for user-defined selection box | |
CN112115113B (en) | Data storage system, method, device, equipment and storage medium | |
CN108021629B (en) | Method, device and system for generating advertisement material data | |
CN112256977A (en) | Componentized commodity recommendation method, computing device and computer storage medium | |
CN112732763A (en) | Data aggregation method and device, electronic equipment and medium | |
CN109033157B (en) | Complex data retrieval method and system based on user-defined search condition tree | |
CN117472965A (en) | Database data query method and device, electronic equipment and storage medium | |
CN110297945B (en) | Data information processing method and system based on XBRL | |
US20190340179A1 (en) | Result set output criteria | |
CN108874856B (en) | Page generation system, method and server | |
CN112948441B (en) | Multi-dimensional data collection method and equipment for financial data | |
CN108197164A (en) | Business data storage method and device | |
CN114840551A (en) | Database table processing method and device, electronic equipment and storage medium | |
CN114201157A (en) | Method and system for customizing target service module by low code | |
CN113836379A (en) | Intelligent recommendation method and system based on customer image | |
CN112783633A (en) | Data updating system and method based on resource mutual exclusion scheduling model | |
CN111427922A (en) | Data analysis method, system, device and storage medium based on distributed architecture | |
CN112784195A (en) | Page data publishing method and system | |
CN113485799B (en) | Method, device, equipment and storage medium for executing same timing task by multiple libraries | |
US20080120265A1 (en) | System and method for processing data elements |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request |
Effective date: 20220916 |
|
EEER | Examination request |
Effective date: 20220916 |
|
EEER | Examination request |
Effective date: 20220916 |
|
EEER | Examination request |
Effective date: 20220916 |
|
EEER | Examination request |
Effective date: 20220916 |