CN113360529A - Data query method and device for Kylin cluster - Google Patents

Data query method and device for Kylin cluster Download PDF

Info

Publication number
CN113360529A
CN113360529A CN202110609692.XA CN202110609692A CN113360529A CN 113360529 A CN113360529 A CN 113360529A CN 202110609692 A CN202110609692 A CN 202110609692A CN 113360529 A CN113360529 A CN 113360529A
Authority
CN
China
Prior art keywords
cube
data
dimension
historical
queried
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110609692.XA
Other languages
Chinese (zh)
Inventor
劳天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202110609692.XA priority Critical patent/CN113360529A/en
Publication of CN113360529A publication Critical patent/CN113360529A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a data query method and device of a Kylin cluster, and relates to the technical field of computers. The specific implementation mode of the method comprises the following steps: acquiring data to be processed; determining one or more dimensions included in the data to be processed by using a dimension determination model, and determining cube corresponding to the one or more dimensions in the Kylin cluster; the dimension determination model is obtained by training according to the historical cube parameters and cube generation results; receiving a data query request, wherein the data query request indicates a dimension to be queried; and querying data corresponding to the dimension to be queried from the database according to the target cube corresponding to the dimension to be queried. The method and the device can reduce human factor interference in the cube creation process, obtain the optimal cube configuration, have strong applicability, reduce testing time and resource cost, and improve data processing efficiency.

Description

Data query method and device for Kylin cluster
Technical Field
The invention relates to the technical field of computers, in particular to a data query method and device for a Kylin cluster.
Background
Apache Kylin is an open-source and distributed analytical data warehouse, and can support real-time/offline large-scale data processing and analysis. In the Kylin cluster, large-scale data are aggregated by creating a multi-dimensional combined computing unit cube, and a computing result is stored in an HBase database to support sql query.
In the existing cube creating process, the configuration parameters of the calculating unit cube are determined through a creator according to experience; after the cube creation is completed, testing the cubes under different configurations by using various data, and further determining whether the created cubes are applicable.
The existing cube creation mode depends on personal experience of creator, the creator is uneven in level and large in mobility, stable and optimal cube creation cannot be guaranteed, and universality is not provided; repeated testing of the cube consumes significant time and resource costs.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data query method and apparatus for a Kylin cluster, which can reduce human factor interference in a cube creation process, obtain an optimal cube configuration, have strong applicability, reduce test time and resource cost, and improve data processing efficiency.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data query method for a Kylin cluster, including:
acquiring data to be processed;
determining one or more dimensions included in the data to be processed by using a dimension determination model, and determining cube corresponding to the one or more dimensions in the Kylin cluster; the dimension determination model is obtained by training according to historical cube parameters and cube generation results;
receiving a data query request, wherein the data query request indicates a dimension to be queried;
and querying data corresponding to the dimension to be queried from a database according to the target cube corresponding to the dimension to be queried.
Optionally, the training step of the dimension determination model includes:
acquiring multiple groups of historical cube parameters; the historical cube parameters comprise historical hive parameters, historical cube dimensions, historical cube calculation indexes and historical hbase parameters;
and taking the multiple groups of historical cube parameters as the input of the dimension determination model, and taking the time length required by whether the cube is successfully generated and/or the cube generation as the output of the dimension determination model so as to train the dimension determination model.
Optionally, generating an sql statement according to the dimension to be queried;
and querying the target cube corresponding to the dimension to be queried according to the sql statement.
Optionally, the historical cube parameter further includes: one or more sets of historical sql parameters corresponding to each set of the historical cube parameters.
Optionally, the querying, according to the target cube corresponding to the dimension to be queried, data corresponding to the dimension to be queried from a database includes:
determining whether aggregated data corresponding to the dimension to be queried exists in an HBase database according to the target cube;
if so, acquiring the aggregated data as data corresponding to the dimension to be queried according to the target cube;
if not, determining target data corresponding to the target cube, aggregating the target data by using the target cube, and taking an aggregation result as data corresponding to the dimension to be inquired.
Optionally, the method further comprises:
and pre-polymerizing the data to be processed of the corresponding dimension by using the cube, and storing a pre-polymerization result in an HBase database.
Optionally, the determining the cube corresponding to the one or more dimensions in the Kylin cluster includes:
and determining whether the cube corresponding to the dimension exists in the Kylin cluster, and if not, generating the cube corresponding to the dimension.
Optionally, the dimension determination model is trained based on a logistic algorithm or a regression algorithm.
According to still another aspect of the embodiments of the present invention, there is provided a data query apparatus for a Kylin cluster, including:
the acquisition module is used for acquiring data to be processed;
the data processing module is used for determining one or more dimensions included in the data to be processed by using a dimension determination model and determining cube corresponding to the one or more dimensions in the Kylin cluster; the dimension determination model is obtained by training according to historical cube parameters and cube generation results;
the device comprises a receiving module, a query module and a query module, wherein the receiving module is used for receiving a data query request which indicates a dimension to be queried;
and the query module is used for querying the data corresponding to the dimension to be queried from a database according to the target cube corresponding to the dimension to be queried.
According to another aspect of the embodiments of the present invention, there is provided a data query electronic device of a Kylin cluster, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the data query method of the Kylin cluster provided by the present invention.
According to still another aspect of an embodiment of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the data query method of the Kylin cluster provided by the present invention.
One embodiment of the above invention has the following advantages or benefits: because the technical means of determining the dimension and calculating the index through the dimension creating model and further generating the cube is adopted, the technical problems that the existing cube creating mode depends on personal experience, the cube creating is unstable and not the optimal selection, the universality is not realized, and the testing time and resource cost consumption are large are solved, so that the technical effects of reducing the human factor interference in the cube creating process, obtaining the optimal cube configuration, having strong applicability, reducing the testing time and resource cost and improving the data processing efficiency are achieved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 shows an exemplary system architecture diagram of a data query method or a data query apparatus of a Kylin cluster suitable for application to the Kylin cluster of the present invention;
FIG. 2 is a schematic diagram of a main flow of a data query method of a Kylin cluster according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the main flow of a dimension determination model training method according to a first embodiment of the invention;
FIG. 4 is a schematic diagram of a main flow of a data query method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the main flow of a dimension determination model training method according to a second embodiment of the present invention;
FIG. 6 is a schematic diagram of the main modules of a data query device of a Kylin cluster according to an embodiment of the present invention;
fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 shows an exemplary system architecture diagram of a data query method or a data query device of a Kylin cluster suitable for being applied to the Kylin cluster according to an embodiment of the present invention, and as shown in fig. 1, the exemplary system architecture of the data query method or the data query device of the Kylin cluster according to an embodiment of the present invention includes:
as shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a data query application, a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server that provides various services, such as a background management server that provides support for data query type websites browsed by users using the terminal devices 101, 102, 103. The backend management server may analyze and perform other processing on the received data such as the data query request, and feed back a processing result (for example, target order data) to the terminal devices 101, 102, and 103.
It should be noted that the data query method of the Kylin cluster provided by the embodiment of the present invention is generally executed by the server 105, and accordingly, the data query device of the Kylin cluster is generally disposed in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 2 is a schematic diagram of a main flow of a data query method for a Kylin cluster according to an embodiment of the present invention, and as shown in fig. 2, the data query method for the Kylin cluster of the present invention includes:
step S201, data to be processed is acquired.
In the embodiment of the invention, in order to query the required data, the data to be processed needs to be acquired, the data to be processed is processed, and the processing result is stored, so that a user can query according to the processing result. For example, the data to be processed is ordering data of a certain commodity.
In the embodiment of the present invention, the data source of the data to be processed may be a Hive table.
Step S202, determining one or more dimensions included in the data to be processed by using a dimension determination model, and determining cube corresponding to the one or more dimensions in the Kylin cluster; the dimension determination model is obtained by training according to historical cube parameters and cube generation results.
In the embodiment of the invention, after the data to be processed is obtained, one or more dimensions and the calculation indexes corresponding to the dimensions included in the data to be processed are determined by using the dimension determination model, and one or more cubes corresponding to the one or more dimensions are generated in the Kylin cluster according to the determined dimensions and the calculation indexes corresponding to the dimensions. And calculating the data to be processed according to the dimension corresponding to the cube and the calculation index, and acquiring processing results of the data to be processed under different dimensions. And storing the processing result of each cube to a database, so that a user can query the data processing result of the cube corresponding to the dimension and the calculation index. The cube can be understood as a result obtained by calculating data in the Hive table according to one or more specified dimensions and calculation indexes.
In the embodiment of the present invention, for example, the data to be processed is ordering data of a certain commodity, and the ordering data is determined to include the dimensions of: total order amount, number of orders in each city, gender of the order placing user, age of the order placing user, order placing time period of the user, access source (such as sharing, query and the like) and the like; the calculation indexes corresponding to the dimensions are: the method comprises the steps that the calculation index of the total amount of orders is summation, the calculation index of the number of orders in each city is screening + summation, the calculation index of the gender of the order-placing user is screening + summation + percentage, the calculation index of the age of the order-placing user is screening + curve fitting, the calculation index of the order-placing time period of the user is screening + piecewise curve fitting, the calculation index of the access source is screening + summation and the like, and according to the determined dimension and the calculation index corresponding to the dimension, a cube corresponding to the dimension is generated in a Kylin cluster; one cube may include a result obtained by performing calculation according to all dimensions and corresponding calculation indexes, such as: one cube comprises calculation results of all dimensions of order total amount, order quantity of each city, order user gender, order user age, order time period of a user, access source and the like; or each cube may correspond to a result calculated according to one dimension and a corresponding calculation index, for example: total order amount cube, order amount cube of each city, order user gender cube, order user age cube, order time period cube, visit source cube, etc.
Calculating the data to be processed according to the dimension corresponding to the cube and the calculation index, and acquiring processing results of the data to be processed under different dimensions, wherein the processing results comprise: the order total result, the order quantity result of each city, the gender result of the order-placing user, the age result of the order-placing user, the order-placing time period result of the user, the access source result and the like. And storing the processing result of each cube to a database, so that a user can query the data processing result of the dimension corresponding to the cube. Further, if the user wants to query the total order amount of a certain commodity, the user queries the total order amount result of the certain commodity to obtain 25000 total order amounts; if a user wants to inquire the order quantity of a certain commodity in Wuhan, inquiring from the order quantity results of various cities of the certain commodity to obtain 483 orders of the Wuhan; if the user wants to inquire the male and female occupation ratio of the user of a certain commodity, the gender result of the ordering user of the certain commodity is inquired, and the male and female occupation ratio of the user is 6: 4.
In the embodiment of the invention, the processing result processed according to the dimension corresponding to the cube and the calculation index can be stored in the Hbase database, so that a user can query through the Hbase database.
In the existing kylin cluster, the technical scheme for creating the cube is mainly a specialist method, and the cube is created by depending on personal experience, and is usually an OLAP (Online Analytical Processing) technical expert. In the process of creating the cube, a technical expert determines the dimension of data to be processed by determining related information such as a data query range and system resources and based on a large amount of experience accumulated by an individual in the kylin using process, so that the cube is created according to the dimension determined by the technical expert. However, the cube created completely depending on personal experience is not necessarily reasonable, and the personal experience is difficult to backup, and the use of the kylin cluster is greatly influenced by personnel loss.
According to the method, the dimensionality of the data to be processed and the calculation indexes corresponding to the dimensionality are determined by utilizing the dimensionality determination model, and the cube is created according to the determined dimensionality and the calculation indexes, so that the intervention of human factors in the cube creation process in the kylin cluster is reduced, excessive dependence on personal experience is avoided, and the application range is wide.
In the embodiment of the invention, before the cube corresponding to the determined dimension is generated according to the determined dimension and the calculation index corresponding to the dimension, whether the cube with the same dimension exists in the Kylin cluster is judged, if yes, the data to be processed is directly sent to the existing cube, and the data to be processed is calculated; and if not, generating a cube corresponding to the determined dimension in the Kylin cluster, and calculating the data to be processed.
In the embodiment of the invention, the dimension determination model can be obtained by training according to the historical cube parameters and cube generation results.
As shown in fig. 3, a first embodiment of the present invention discloses a dimension determination model training method, which includes the following steps:
step 301, acquiring data of multiple sets of historical cube parameters. The historical cube parameters comprise historical hive parameters, historical cube dimensions, historical cube calculation indexes and historical hbase parameters.
In this embodiment of the present invention, the data of the historical cube parameter is from the historical log of the kylin cluster, and may include: historical hive parameters, historical cube dimensions, historical cube calculation indexes and historical hbase parameters. The historical hive parameters comprise the size of the hive table, the number of partitions of the hive table, the partition size of the hive table, the data type (such as commodity data, person data, geographic data, resource data, system configuration data, safety data and the like) of the hive table and the like; the historical cube dimensions include cube dimensions corresponding to the data types of the hive table, for example, if the data types of the hive table are commodity data, the cube dimensions include total orders, order quantity of each city, gender of an order placing user, age of the order placing user, order placing time period of the user, access source and the like; the historical cube calculation indexes comprise calculation methods of data corresponding to the cube dimensionality, such as screening/query, addition, subtraction, multiplication and division, maximum value, minimum value, average value, variance, percentage, sorting, extraction, replacement, rounding, citation, linear fitting, curve fitting and the like; the historical hbase parameters comprise column values of the hbase database, the number of fragments of the hbase database, the size of the fragments of the hbase database and the like.
In this embodiment of the present invention, the history cube parameter may further include: a historical map parameter and a historical reduce parameter. Apache Kylin converts the fact table and the dimension table stored in Hive into Cube through MapReduce, specifically, generates a dimension dictionary through MapReduce by using data in the Hive table, and further generates data of each dimension combination. The historical map parameters comprise the number of maps, the types of map functions and the like; the historical reduce parameters include the number of reduces, the type of reduce function, etc.
Step 302, obtaining historical cube generation results corresponding to multiple sets of historical cube parameters. And whether the historical cube generation result cube is successfully generated or not and the time length required for generating the cube is prolonged.
In this embodiment of the present invention, the data of the result generated by the history cube comes from the history log of the kylin cluster, and may include: whether the cube is successfully generated or not and the time length required for generating the cube.
Step 303, data cleaning.
In this embodiment of the present invention, after acquiring data of a history cube parameter and a history cube generation result from a history log of a kylin cluster, the acquired data is cleaned, and the step of cleaning the data may include:
step 3031, invalid data generated in the test process or when faults occur is cleaned, and historical cube parameters and actual use data of historical cube generation results are reserved.
Step 3032, cleaning the data of the historical cube parameters of which the size of the hive table is smaller than a preset hive table size threshold value.
Step 3033, cleaning the historical cube parameters and the data of the historical cube generation results, wherein the dimensionalities corresponding to the data in the hive table are smaller than a preset dimensionality threshold value.
At step 304, the data is normalized.
In the embodiment of the invention, after the data cleaning of the historical cube parameters and the historical cube generation result data is completed, the data is formatted, and can be formatted into json character strings and written into files for storage. Through formatting processing, the format of the model input data is further unified, and error fluctuation of model training is reduced.
Step 305, data normalization.
In the embodiment of the invention, the historical cube parameter and the historical cube generation result are usually in a digital form, for example, the size of the hive table is 500M; the success of cube generation is 1, and the failure of cube generation is 0. The data are normalized, so that the process of model training can be accelerated, and the efficiency of model training is improved.
Step 306, using the multiple groups of historical cube parameters as the input of the dimension determination model, and using whether the cube is successfully generated and/or the time length required for generating the cube as the output of the dimension determination model, so as to train the dimension determination model.
In the embodiment of the invention, after normalized data is obtained, a plurality of groups of normalized historical hive parameters, historical cube dimensions, historical cube calculation indexes and historical hbase parameters are used as the input of the dimension determination model, the generation result of the normalized historical cube is used as the output of the dimension determination model, and the dimension determination model is trained. And (4) evaluating the recall rate and the accuracy rate of the dimension determination model obtained by training, and continuously and iteratively adjusting parameters according to the recall rate and the accuracy rate to enable the dimension determination model to achieve the expected accuracy.
In the embodiment of the invention, the algorithm adopted by the dimension determination model can be a logic algorithm or a regression algorithm, and the dimension determination model is obtained through the training of the logic algorithm or the regression algorithm.
Step 307, build a web service.
In the embodiment of the invention, after the accurate dimension determination model is obtained, the dimension determination model is deployed as a web service, so that the service can receive one or more groups of cube parameters and requests of cube generation results, determine the optimal cube parameters from the requests and return the optimal cube parameters to a requester.
In the embodiment of the invention, according to the dimension determination model training method of the first embodiment of the invention, the dimension determination model can be obtained, so that the cube parameters can be determined according to the web service deployed by the model, the artificial interference in the cube creation process is reduced, the test time and the resource cost are reduced, and the application range is wide.
Step S203, a data query request is received, wherein the data query request indicates the dimension to be queried.
In the embodiment of the present invention, when the processing result processed according to the dimension corresponding to the cube and the calculation index is stored in the Hbase database, the Hbase database may receive the data query request.
And S204, inquiring data corresponding to the dimension to be inquired from a database according to the target cube corresponding to the dimension to be inquired.
In the embodiment of the invention, after a data query request is received, a target cube corresponding to the dimension to be queried is determined according to the dimension to be queried indicated by the database query request. And according to the target cube, inquiring data corresponding to the dimension to be inquired from the HBase database. The HBase database stores processing results of a plurality of cubes, and corresponding data can be inquired from the HBase database after the target cube is determined.
As shown in fig. 4, a first embodiment of the present invention discloses a data query method, which includes the following steps:
step S401, a data query request is received, and the data query request indicates the dimension to be queried.
And S402, generating an sql statement according to the dimension to be queried.
Step S403, inquiring the target cube corresponding to the dimension to be inquired according to the sql statement.
Step S404, determining whether aggregated data corresponding to the dimension to be inquired exists in the HBase database according to the target cube, and if so, turning to step S405; if not, go to step S406.
Step S405, acquiring the aggregated data according to the target cube as data corresponding to the dimension to be queried.
Step S406, determining target data corresponding to the target cube, aggregating the target data by using the target cube, and taking an aggregation result as data corresponding to the dimension to be queried.
In this embodiment of the present invention, before receiving the data query request, the method further includes: and pre-polymerizing the data to be processed of the corresponding dimension by using the cube, and storing a pre-polymerization result in an HBase database. In the Apache Kylin system, pre-polymerization refers to that before multi-dimensional calculation of big data is carried out, possible dimensional combinations are calculated in advance, and the calculation result is stored in a high-performance kv database (such as an HBase database). Therefore, when multi-dimensional query is subsequently carried out, the result is directly obtained from the kv database, and the purpose of inquiring the second data immediately is achieved.
During data aggregation of the Apache Kylin system, the Kylin creates a computing unit cube combined by multiple dimensions, and the more dimensions contained in the cube, the more query combinations can be supported, but more kv database storage space is consumed, and the longer cube creation time is consumed. By the dimension determination model, the optimal cube can be created by using the least resources, and the optimal query speed and the optimal result can be obtained.
In this embodiment of the present invention, the history cube parameter may further include: as shown in fig. 5, a second embodiment of the present invention discloses a dimension determination model training method, including the following steps:
step 501, acquiring multiple groups of historical cube parameters; the historical cube parameters comprise historical hive parameters, historical cube dimensions, historical cube calculation indexes, historical hbase parameters and one or more groups of historical sql parameters corresponding to the historical cube parameters.
In the embodiment of the invention, the data of the historical cube parameter is from a historical log of a kylin cluster, wherein the historical sql parameter comprises an inquiry dimension of sql, an inquiry calculation index of sql, an inquiry duration of sql, whether sql inquiry is overtime or not and the like.
Step 502, obtaining historical cube generation results corresponding to multiple sets of historical cube parameters.
Step 503, data cleaning.
Step 5031, cleaning invalid data generated in the test process or when a fault occurs, and keeping the historical cube parameters and the actual use data of the historical cube generation result.
Step 5032, cleaning the data of the historical cube parameter of which the size of the hive table is smaller than a preset hive table size threshold.
Step 5033, cleaning the historical cube parameters and the historical cube generation result data, of which the dimensionality is smaller than a preset dimensionality threshold value, corresponding to the data in the hive table.
Step 5034, cleaning the data of the historical sql parameters of the non-aggregated classes. By cleaning the non-aggregated sql query data, the influence of the non-aggregated data query request on the cube generation can be eliminated, so that the generated cube is more accurate, and the data processing requirement is met.
At step 504, the data is normalized.
And step 505, normalizing the data.
Step 506, using the multiple sets of historical cube parameters as the input of the dimension determination model, and using the historical cube generation result as the output of the dimension determination model, so as to train the dimension determination model.
In this embodiment of the present invention, the history cube generation result may include: whether the cube is successfully generated or not and the time length required for generating the cube.
Step 507, constructing the web service.
In the embodiment of the invention, the algorithm adopted by the dimension determination model can be a logic algorithm or a regression algorithm, and the dimension determination model is obtained through the training of the logic algorithm or the regression algorithm.
In the embodiment of the invention, according to the dimension determination model training method of the second embodiment of the invention, an accurate dimension determination model can be obtained, and then an optimal cube parameter can be determined according to a web service deployed by the model, so that the artificial interference in the cube creation process is reduced, the test time and the resource cost are reduced, and the application range is wide.
In the existing kylin cluster, the cube created by the expert method needs to be tested by using actual data, and whether the created cube is feasible or not is further judged. For example, data of a certain day is selected, the data is input into the created cube for testing, whether the cube is created successfully or not is judged, and if the cube is created successfully, the time consumption of cube creation is short, the sql hit rate after cube creation is achieved, and whether the created cube is feasible or not is further judged. However, the continuous test needs to occupy system resources, if the data volume is large, the test time is long, and the test needs to consume great resources and time cost in order to obtain the optimal cube; moreover, after consuming a great deal of time and cost, the final cube creation result is not obtained.
According to the invention, the optimal cube corresponding to the dimensionality can be generated by utilizing the dimensionality determination model according to the data to be processed, so that the resource and time cost of the test is greatly reduced, and the data processing efficiency is further improved.
In the embodiment of the invention, the data to be processed is acquired; determining one or more dimensions included in the data to be processed by using a dimension determination model, and determining cube corresponding to the one or more dimensions in the Kylin cluster; the dimension determination model is obtained by training according to historical cube parameters and cube generation results; receiving a data query request, wherein the data query request indicates a dimension to be queried; and according to the target cube corresponding to the dimension to be queried, querying data corresponding to the dimension to be queried from a database, and the like, so that the interference of human factors in the cube creation process can be reduced, the optimal cube configuration is obtained, the applicability is strong, the testing time and resource cost are reduced, and the data processing efficiency is improved.
Fig. 6 is a schematic diagram of main modules of a data query apparatus of a Kylin cluster according to an embodiment of the present invention, and as shown in fig. 6, the data query apparatus 600 of the Kylin cluster of the present invention includes:
the obtaining module 601 is configured to obtain data to be processed.
In the embodiment of the present invention, in order to query the required data, the obtaining module 601 is required to obtain the data to be processed, and a data source of the data to be processed may be a Hive table.
A data processing module 602, configured to determine, by using a dimension determination model, one or more dimensions included in the to-be-processed data, and determine cube in the Kylin cluster corresponding to the one or more dimensions; the dimension determination model is obtained by training according to historical cube parameters and cube generation results.
In this embodiment of the present invention, after the obtaining module 601 obtains data to be processed, the data processing module 602 determines one or more dimensions and calculation indexes corresponding to the dimensions included in the data to be processed by using a dimension determination model, and generates one or more cubes corresponding to the one or more dimensions in a Kylin cluster according to the determined dimensions and the calculation indexes corresponding to the dimensions. The data processing module 602 calculates the data to be processed according to the dimension and the calculation index corresponding to the cube, and obtains processing results of the data to be processed in different dimensions. And storing the processing result of each cube to a database, so that a user can query the data processing result of the cube corresponding to the dimension and the calculation index.
A receiving module 603, configured to receive a data query request, where the data query request indicates a dimension to be queried.
In this embodiment of the present invention, the receiving module 603 is configured to receive a data query request, where the data query request indicates a dimension to be queried.
The query module 604 is configured to query, according to the target cube corresponding to the dimension to be queried, data corresponding to the dimension to be queried from a database.
In this embodiment of the present invention, after the receiving module 603 receives the data query request, the querying module 604 determines a target cube corresponding to the dimension to be queried according to the dimension to be queried indicated by the database query request, and queries data corresponding to the dimension to be queried from the HBase database according to the target cube.
In the embodiment of the invention, through the acquisition module, the data processing module, the receiving module, the query module and other modules, the interference of human factors in the cube creation process can be reduced, the optimal cube configuration is obtained, the applicability is strong, the testing time and the resource cost are reduced, and the data processing efficiency is improved.
Fig. 7 is a schematic structural diagram of a computer system suitable for implementing a terminal device according to an embodiment of the present invention, and as shown in fig. 7, a computer system 700 of a terminal device according to an embodiment of the present invention includes:
a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the system 700 are also stored. The CPU701, the ROM702, and the RAM703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an acquisition module, a data processing module, a receiving module, and a query module. The names of the modules do not form a limitation on the modules themselves in some cases, for example, a data processing module may also be described as a "module that determines a dimension of data to be processed according to a dimension determination model and constructs a cube corresponding to the dimension".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring data to be processed; determining one or more dimensions included in the data to be processed by using a dimension determination model, and determining cube corresponding to the one or more dimensions in the Kylin cluster; the dimension determination model is obtained by training according to historical cube parameters and cube generation results; receiving a data query request, wherein the data query request indicates a dimension to be queried; and querying data corresponding to the dimension to be queried from a database according to the target cube corresponding to the dimension to be queried.
According to the technical scheme of the embodiment of the invention, based on the history record created by the kylin cube and the history execution record of the sql on the cube, a relation model of the characteristics used in the cube creation process and the final cube creation result is established, and the relation model is deployed as the web service, so that the optimal cube parameter is determined. The method can reduce the interference of human factors in the cube creation process, obtain the optimal cube configuration, has strong applicability, reduces the testing time and resource cost, and improves the data processing efficiency.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. A data query method of a Kylin cluster is characterized by comprising the following steps:
acquiring data to be processed;
determining one or more dimensions included in the data to be processed by using a dimension determination model, and determining cube corresponding to the one or more dimensions in the Kylin cluster; the dimension determination model is obtained by training according to historical cube parameters and cube generation results;
receiving a data query request, wherein the data query request indicates a dimension to be queried;
and querying data corresponding to the dimension to be queried from a database according to the target cube corresponding to the dimension to be queried.
2. The method of claim 1, wherein the step of training the dimension determination model comprises:
acquiring multiple groups of historical cube parameters; the historical cube parameters comprise historical hive parameters, historical cube dimensions, historical cube calculation indexes and historical hbase parameters;
and taking the multiple groups of historical cube parameters as the input of the dimension determination model, and taking the time length required by whether the cube is successfully generated and/or the cube generation as the output of the dimension determination model so as to train the dimension determination model.
3. The method of claim 2,
generating an sql statement according to the dimension to be queried;
and querying the target cube corresponding to the dimension to be queried according to the sql statement.
4. The method of claim 3,
the historical cube parameters further include: one or more sets of historical sql parameters corresponding to each set of the historical cube parameters.
5. The method according to claim 1, wherein the querying data corresponding to the dimension to be queried from a database according to the target cube corresponding to the dimension to be queried comprises:
determining whether aggregated data corresponding to the dimension to be queried exists in an HBase database according to the target cube;
if so, acquiring the aggregated data as data corresponding to the dimension to be queried according to the target cube;
if not, determining target data corresponding to the target cube, aggregating the target data by using the target cube, and taking an aggregation result as data corresponding to the dimension to be inquired.
6. The method of claim 5, further comprising:
and pre-polymerizing the data to be processed of the corresponding dimension by using the cube, and storing a pre-polymerization result in an HBase database.
7. The method of claim 1, wherein the determining the cube in the Kylin cluster corresponding to the one or more dimensions comprises:
and determining whether the cube corresponding to the dimension exists in the Kylin cluster, and if not, generating the cube corresponding to the dimension.
8. The method according to any one of claims 1 to 7,
the dimension determination model is obtained based on logic algorithm or regression algorithm training.
9. A data query apparatus of a Kylin cluster, comprising:
the acquisition module is used for acquiring data to be processed;
the data processing module is used for determining one or more dimensions included in the data to be processed by using a dimension determination model and determining cube corresponding to the one or more dimensions in the Kylin cluster; the dimension determination model is obtained by training according to historical cube parameters and cube generation results;
the device comprises a receiving module, a query module and a query module, wherein the receiving module is used for receiving a data query request which indicates a dimension to be queried;
and the query module is used for querying the data corresponding to the dimension to be queried from a database according to the target cube corresponding to the dimension to be queried.
10. A data-querying electronic device of a Kylin cluster, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
11. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.
CN202110609692.XA 2021-06-01 2021-06-01 Data query method and device for Kylin cluster Pending CN113360529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110609692.XA CN113360529A (en) 2021-06-01 2021-06-01 Data query method and device for Kylin cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110609692.XA CN113360529A (en) 2021-06-01 2021-06-01 Data query method and device for Kylin cluster

Publications (1)

Publication Number Publication Date
CN113360529A true CN113360529A (en) 2021-09-07

Family

ID=77530999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110609692.XA Pending CN113360529A (en) 2021-06-01 2021-06-01 Data query method and device for Kylin cluster

Country Status (1)

Country Link
CN (1) CN113360529A (en)

Similar Documents

Publication Publication Date Title
CN109388637B (en) Data warehouse information processing method, device, system and medium
CN109614402B (en) Multidimensional data query method and device
CN110689268B (en) Method and device for extracting indexes
CN114329201A (en) Deep learning model training method, content recommendation method and device
CN107291835B (en) Search term recommendation method and device
CN110866040A (en) User portrait generation method, device and system
US20190370599A1 (en) Bounded Error Matching for Large Scale Numeric Datasets
CN108885633A (en) For finding and being connected to the technology of REST interface automatically
CN112116426A (en) Method and device for pushing article information
CN110928594A (en) Service development method and platform
CN108985805B (en) Method and device for selectively executing push task
CN110895591A (en) Method and device for positioning self-picking point
US20140379681A1 (en) Cross-channel social search
CN113450042A (en) Method and device for determining replenishment quantity
EP4116889A2 (en) Method and apparatus of processing event data, electronic device, and medium
WO2022156589A1 (en) Method and device for determining live broadcast click rate
CN113360529A (en) Data query method and device for Kylin cluster
CN110837907A (en) Method and device for predicting wave order quantity
CN110766431A (en) Method and device for judging whether user is sensitive to coupon
CN113495891B (en) Data processing method and device
CN112184370A (en) Method and device for pushing product
CN113868373A (en) Word cloud generation method and device, electronic equipment and storage medium
CN109902847B (en) Method and device for predicting amount of orders in branch warehouse
CN113760966A (en) Data processing method and device based on heterogeneous database system
CN113626175A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination