CN113779060A - Data query method and device - Google Patents
Data query method and device Download PDFInfo
- Publication number
- CN113779060A CN113779060A CN202110103292.1A CN202110103292A CN113779060A CN 113779060 A CN113779060 A CN 113779060A CN 202110103292 A CN202110103292 A CN 202110103292A CN 113779060 A CN113779060 A CN 113779060A
- Authority
- CN
- China
- Prior art keywords
- query
- task
- historical
- subtasks
- subtask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 238000004590 computer program Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 abstract description 20
- 230000000694 effects Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 14
- 238000004891 communication Methods 0.000 description 6
- 238000005192 partition Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000000725 suspension Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24534—Query rewriting; Transformation
- G06F16/24539—Query rewriting; Transformation using cached or materialised query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data query method and a data query device, and relates to the technical field of computers. One embodiment of the method comprises: receiving a query request, selecting a corresponding historical query subtask according to a query task carried in the query request, and generating a historical task set; determining a query task splitting strategy according to the historical task set, and splitting the query task into a plurality of query subtasks; and executing a plurality of query subtasks in parallel, and orderly combining the query results corresponding to the plurality of query subtasks to generate the query result corresponding to the query request. The implementation mode overcomes the technical problems of poor compatibility, low query speed and unstable data query process of the existing query mode, and achieves the technical effect of improving the data query efficiency.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for querying data.
Background
In the JAVA project, there is a need to query Hive table data to derive data. In the prior art, there are two solutions for this scenario: one is to acquire data through a HiveServer/HiveServer2 service provided by the Hive self; the other is to acquire data through an open-source distributed SQL query engine Presto.
In the process of implementing the invention, the prior art at least has the following problems: the first HiveServer/HiveServer2 method generally does not allow the query of the HiveTable data directly in this way for security; the second Presto method has insufficient compatibility between the data format and Hive SQL syntax, and has the problems of slow speed and instability when the table with large data volume is derived.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for data query, which can submit query SQL (structured query language) to obtain export data, and package a Jar package (JAVA archive, which is a software package file format and is generally used to aggregate a large number of JAVA class files, related metadata, and resource files into one file in order to develop JAVA platform application software or library) Hive-Client in order to facilitate call implementation and code reuse in JAVA projects. The Hive-Client is a module/Client formed by encapsulating the Hive query function and can be directly called by a user through an interface. Meanwhile, in order to solve the problems of low speed and instability when a table with large data volume is exported, intelligent task splitting and automatic merging are carried out based on historical query tasks, and data query efficiency is improved.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method for querying data, including:
receiving a query request, wherein the query request carries a query task;
selecting a historical query subtask according to the query task, and generating a historical task set;
determining a query task splitting strategy according to the historical task set, splitting the query task based on the strategy, and generating a plurality of query subtasks;
executing a plurality of query subtasks, and acquiring a first query result corresponding to each query subtask;
and merging the first query results to generate a second query result corresponding to the query request.
Preferably, the determining a query task splitting policy according to the historical task set includes:
determining a time query range and query efficiency corresponding to the historical query subtasks in the historical task set;
and determining the splitting period of the query task according to the time query range and the query efficiency.
Preferably, the determining a splitting period of the query task according to the time query scope and the query efficiency includes:
selecting a predetermined number of historical query subtasks from a historical task set;
classifying the historical query subtasks selected from the historical task set according to the time query range, and calculating the first average query efficiency of each type of historical query subtasks;
and selecting a time query range corresponding to the historical query subtask classification with the highest average query efficiency as a splitting period of the query task.
Preferably, the determining a splitting period of the query task according to the time query scope and the query efficiency includes:
selecting a predetermined number of historical query subtasks from a historical task set;
classifying the historical query subtasks selected from the historical task set according to the time query range, and calculating the first average query efficiency of each type of historical query subtasks;
calculating the average value of the first average query efficiency to obtain a second average query efficiency;
and determining the matched historical query subtask classification according to the second average query efficiency, and taking the time query range as the splitting period of the query task.
Preferably, selecting a historical query subtask according to the query task, and generating a historical task set, includes:
determining a value range of a splitting period according to a query target object, selecting a historical query subtask based on the value range, and generating a historical task set; and/or
And selecting a historical query subtask according to the query field of the query task to generate a historical task set.
Preferably, after receiving the query request, further determining whether the query time range of the query task is smaller than a preset threshold, and if so, directly executing task query.
Preferably, when executing the plurality of query subtasks, for each query subtask, the following query steps are performed:
analyzing the query subtask, and determining task parameters and execution parameters of a query engine;
creating an initial query result file and a corresponding field according to the task parameters;
and executing the subtask query by a query engine according to the execution parameters, and writing a query result into the initial query result file to generate the first query result.
Preferably, merging the first query result to generate a second query result corresponding to the query request includes:
setting identifications for a plurality of query subtasks corresponding to the query tasks;
judging whether the plurality of query subtasks are executed correctly or not according to the identification and generating the first query result;
if so, merging the first query result.
Preferably, the subtask query step is encapsulated and invoked through an interface.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for querying data, including:
the receiving module is used for receiving a query request, and the query request carries a query task;
the historical task set generation module is used for selecting a historical query subtask according to the query task and generating a historical task set;
the task splitting module determines a query task splitting strategy according to the historical task set, splits the query task based on the strategy and generates a plurality of query subtasks;
the subtask execution module is used for executing a plurality of inquiry subtasks and acquiring a first inquiry result corresponding to each inquiry subtask;
and the merging module merges the first query results to generate second query results corresponding to the query requests.
According to still another aspect of an embodiment of the present invention, there is provided a data query apparatus including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the aforementioned data query method.
According to still another aspect of embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the aforementioned data query method.
One embodiment of the above invention has the following advantages or benefits: because the technical means of intelligent task splitting based on historical query tasks and querying split subtasks through packaged clients are adopted, the technical problems of poor compatibility, low query speed and unstable data query process of the existing query mode are solved, and the technical effect of improving the data query efficiency is achieved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a data query method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of the main flow of query subtask execution according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of the main process of query task execution according to yet another embodiment of the present invention;
FIG. 4 is a schematic diagram of partial query results for a query subtask, according to yet another embodiment of the present invention;
FIG. 5 is a schematic diagram of the main blocks of a data query device according to yet another embodiment of the present invention;
FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 7 is a schematic block diagram of a computer system suitable for use with a server implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a data query method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step S101: and receiving a query request, wherein the query request carries a query task.
Taking the derivation of data in the Hive table with large data volume as an example, after receiving a data derivation query request of a user, by analyzing SQL query in the user request, specific query tasks of the user can be determined, including information such as a query field, a query time range, a query object, and the like.
The common SQL statement is of the form:
Wherein, the index 1-index n are fields queried by the user; tableA queries a target object for a user, generally a queried target data table name, namely, in which data tables the fields are queried; dt represents a query date, "dt > -date 1 'AND dt < ═ date 2'", which indicates that the query start time is date 1 AND the query end time is date 2, AND correspondingly, the query time range T is date 2-date 1.
Optionally, when the query time range for obtaining the query task is smaller than a preset threshold, the task query may be directly executed without splitting the subsequent query task.
Step S102: and selecting a historical query subtask according to the query task to generate a historical task set.
In the implementation process of the invention, a mode of intelligently splitting the user task based on the historical query subtasks is adopted, so that the historical query subtasks executed after the existing historical query tasks are split are firstly analyzed to obtain the information of the starting time, the ending time, the name of a query data table, the query time consumption, the line number of generated data files and the like of each historical subtask, and then the query task splitting strategy is determined based on the information.
Further, the historical query subtasks may be screened according to the query time range of the query task determined in step 101, the query target object, and other information, so as to generate a historical task set, thereby determining the query task splitting policy more accurately in step 103. The historical task set is composed of screened historical query subtasks.
In an embodiment of the present invention, the historical query subtasks may be screened according to a query target object in the query task, where the query target object is usually a query data table name. Because the data fields recorded by the same query data table are relatively fixed and the data volume generated every day is equivalent, the historical query subtask which is the same as the query data table in the query task can be selected to generate the historical task set.
In addition, the historical query subtasks may also be filtered based on the query time range. In general, the minimum and maximum values of the query time range that can be supported by the same query data table may be determined, for example, the query time range may be set to [5 days, 40 days ], that is, the splitting period of the query task needs to be within the above range. Correspondingly, when the historical query subtasks are screened, the historical query subtasks with the query time range within the range can be selected to generate the historical task set.
In another implementation manner of the present invention, the historical query subtasks may also be screened according to the query fields in the query task, and because the number of fields is different, the execution efficiency of the query task is different, and in the above selection process, the historical query subtasks matched with the query fields in the query task may be selected to generate the historical task set. Specifically, the query field may be completely the same as the query field, or partially the same as the query field, or query weights are set for the fields according to the average query time consumption of different fields, the weight values of the query task and the history subtasks are respectively calculated, and the history subtasks close to the weight values of the query task are selected to generate the history task set.
It should be noted that, the above-mentioned various ways of generating the historical task set may be used alone, or a plurality of historical task sets may be selected to be used together, and the present invention is not limited thereto.
In addition, in order to determine the historical task set more accurately, the historical query subtasks with incomplete query time ranges can be eliminated during selection. The history query subtask with the incomplete query time range refers to a history query subtask which has a query time range smaller than a splitting period and is different from the splitting periods of other history query tasks in the splitting process of the same history query task. For example, if the query time range of the historical query task is 43 days and the split period is 20 days, the historical query task can be split into two query subtasks for 20 days and one query subtask for 3 days; meanwhile, the split periods of other historical query tasks are analyzed, and the split periods of only 5 days, 10 days and 30 days are different from the query time range of the last subtask for 3 days except the 20 days, so that the historical query subtasks with the query time range of 3 days can be filtered when the historical task set is generated.
Step S103: and determining a query task splitting strategy according to the historical task set, splitting the query task based on the strategy, and generating a plurality of query subtasks.
The query task splitting strategy refers to a principle of splitting the query task, and can be directly specified by a user, or split based on a field of the query task, and/or split based on a query time range of the query task. Preferably, in order to make the time for parallel execution of the split subtasks close and facilitate merging of query results generated by the subtasks subsequently, in an embodiment of the present invention, the split is preferably performed within a query time range of the query task.
In a specific implementation process, a binary variable [ T, T ] may be generated for each historical query subtask, where T represents a query time range, and T is a query end time-query start time; t represents the time consumed by unit query, T is the total time consumed by unit query/T, the smaller the T value is, the higher the query efficiency is, and the larger the T value is, the lower the query efficiency is.
After determining the binary variable of each historical query subtask in the historical task set, the splitting period of the query task can be determined according to the time query range and the query efficiency in the binary variable.
In one embodiment, a predetermined number of historical query subtasks may be selected from the historical task set according to the sequence of query efficiency from high to low, the selected historical query subtasks are classified according to the time query range, and a first average query efficiency of each type of historical query subtasks is calculated; and selecting a time query range corresponding to the historical query subtask classification with the highest average query efficiency as a splitting period of the query task.
Table 1 shows an example of selecting the 8 unit query subtasks from the historical task set that take the least time (query efficiency) to query.
TABLE 1
Inquiry time range (sky) | Time consuming unit query (minute) |
10 | 5 |
20 | 5 |
30 | 4 |
40 | 6 |
10 | 5.5 |
20 | 6.5 |
30 | 5 |
40 | 6 |
Through the query time range, 8 historical query subtasks in table 1 can be divided into 4 types, and the first average unit query time consumption corresponding to each type of historical query subtask is calculated, which is specifically shown in table 2.
TABLE 2
Further, the query subtask class with the smallest time consumption (4.5 minutes) for the first average unit query is selected from table 2, and the corresponding query time range (30 days) is used as the splitting period.
In another embodiment, considering the number of fields queried by each historical query subtask and the difference in server load during query execution, the first average unit query time consumption may be further averaged for the second time to obtain a second average unit query time consumption, and then a matching historical query subtask classification is determined according to the second average unit query time consumption, and the time query range is used as the splitting period of the query task. As per the example in table 2, it may be determined that the second average unit query elapsed time is: (5.25+5.75+4.5+6)/4 ═ 5.375 (minutes), then the matching historical query subtask class is queried in table 2 according to the second average unit query time consumption (i.e. the historical query subtask class with the first average unit query time consumption closest to the second average unit query time consumption), and the split period is determined to be 10 days.
Step S104: and executing a plurality of query subtasks, and acquiring a first query result corresponding to each query subtask.
The method obtains the export data by submitting query SQL (structured query language), specifically, the Hive-e or Hive-f can be used as a Hive query command, the Hive-e can directly execute SQL sentences, and the Hive-f is used for executing files formed by SQL sentences. Taking Hive-e SQL as an example, in the embodiment of the present invention, each query subtask may execute a query task through a Hive-Client by encapsulating one Jar package Hive-Client. Specifically, in the execution Process, in order to realize calling and code multiplexing in a JAVA project, a sub-Process may be created through a Runtime class in a current JAVA Process to execute SQL, for example, a Process is run in Runtime. The specific Hive-Client query process will be described in further detail in the subsequent embodiments of the present invention.
Since the Hive-e SQL is executed by the Hive-Client, a Hive access environment needs to be configured on the server executing the Hive-Client. In fact, deploying a Hive access environment on each JAVA project using Hive-Client is too cumbersome and inefficient. Therefore, in the embodiment of the invention, the Hive environment is configured on the designated server, then a WEB service is developed based on the Hive-Client Jar, and two service interfaces of RPC and HTTP are provided, so that the user can directly call the interfaces, and the cost for using the Hive-Client is simplified.
Step S105: and merging the first query results to generate a second query result corresponding to the query request.
After the concurrent execution of the Hive-Client is completed, the query results of each subtask can be merged to obtain the query result corresponding to the query task. In order to ensure that all subtasks can be executed correctly, a split task identifier can be set for judging whether the current task is a split query subtask, if the split task identifier exists, the data file merging process needs to be started after all other query subtasks with the split task identifier are executed, and after merging, the merged data file is pushed to a user; optionally, the data file of the packed subtask may also be pushed together with the merged data file.
According to the embodiment of the invention, the splitting strategy is formed based on the historical query task, so that the query task can be split intelligently, and the query efficiency is improved; meanwhile, when the historical task set is determined, the factors such as the target object of the query task, the query time range and the query field are further combined, so that the selection of the historical task set is more accurate, the reliability of the splitting strategy is ensured, and the query speed is further improved.
Fig. 2 is a schematic diagram of a main flow of query subtask execution according to an embodiment of the present invention, in which each query subtask can execute a query task through a Hive-Client by encapsulating a Jar package Hive-Client.
Step 201: the object is initialized based on the query SQL parameters.
Specifically, the initialization of the object may be performed by defining a JAVA function, such as: initializing the object by using the HiveClient HiveClients (parameters), wherein the parameter parameters are object parameters, and encapsulating and querying SQL information.
Step 202: and performing parameter analysis and determining the execution parameters of the query engine.
Specifically, the SQL can be analyzed through a function resolvehighevproperties (), and the execution parameters of the query engine are determined, that is, different Hive execution parameters are set for different Hive tables, for example, the number of applications for MapReduce, the jvm parameters, and the like, so as to achieve the optimal execution effect.
Step 203: and performing parameter analysis and determining user task parameters.
And analyzing the task parameters submitted by the user through a function resolveParameters () to obtain the task parameters including header parameters, enumeration value conversion and the like.
Step 204: a data file and header are generated.
Creating a data result file based on the task parameters analyzed in the step 203 through a function generatefileandadd () and setting header data of the file;
step 205: and submitting the SQL execution acquisition result and writing the SQL execution acquisition result into a data file.
The result of the query is continuously written in an appended manner by the function executeSQL AnddAppendData () into the data result file created in the data file step 204. Specifically, the SQL query execution mode may be a distributed query mechanism of the Hive engine in the prior art, for example, a logic query plan corresponding to a subtask is parsed into a specific physical execution plan, and the specific physical execution plan is distributed to different threads of different nodes of the cluster for execution.
Preferably, for the convenience of using the Hive-Client, the step 202 and 205 may be further encapsulated, and the query is executed through a Hive Client. Procedure () the return result may be a status code, such as: procedure (), the status of the current query is determined by the status code, such as success, failure, suspension, etc., and the query result can be directly written into the data file through the executesqlndapppendata () in step 205. Through the encapsulation, when the user uses the data query method, the file of the query data and the state of the corresponding query result can be obtained only by calling the higher client.
The Hive-Client query method provided by the embodiment of the invention determines the execution parameters and the task parameters of the query engine through parameter analysis, so that the query result can be provided more efficiently, and the technical problems of poor compatibility, low query speed and unstable query process of the existing query method are solved. Meanwhile, by further encapsulating the query process, the user can conveniently and directly use the Hive-Client to execute the sub-task query, and the use experience of the user is improved.
FIG. 3 is a schematic diagram of the main process of query task execution according to an embodiment of the present invention.
The following query tasks are examples.
SQL: AND selecting partition, directly placing the order quantity FROM tableB WHERE dt ═ 2020-10-01 ', AND dt ═ 2020-12-29'.
According to the number of records generated by TableB each day, usually in the order of hundred million, determining the maximum value and the minimum value of the query task splitting period, for example, the minimum value is 5 days, and the maximum value is 50 days, which means that when the query time range of the query task is less than 5 days, the Hive-Client query can be directly called without splitting the task. When the query task needs to be split, the query time range of the split subtasks is required to be not more than 50 days, so that the query success rate of the subtasks is prevented from being influenced due to the overlong query time range. Therefore, according to the SQL query task, it can be determined that the query time range of the current task is 90 days, and the current task needs to be split.
And the task splitting component is responsible for splitting the query task and splitting the task according to an intelligent splitting strategy. The selection of a particular set of historical tasks and determination of the split policy are similar to steps 102 and 103 in the previous embodiment. For example, it may be determined from steps 102 and 103 that the splitting period of the current query task is 30 days, and the task splitting component splits the current query task into 3 subtasks, which are SQL 1-SQL 3, as follows:
SQL 1: AND selecting partition, directly placing the order quantity FROM tableB WHERE dt ═ 2020-10-01 ', AND dt ═ 2020-10-30'.
SQL 2: AND selecting partition, directly placing the order quantity FROM tableB WHERE dt ═ 2020-10-31 ', AND dt ═ 2020-11-29'.
SQL 3: AND selecting partition, directly placing the order quantity FROM tableB WHERE dt ═ 2020-11-30 'AND dt ═ 2020-12-29'.
Each subtask acquires a query result by calling Hive-Client to execute the query of the subtask, and the specific process is described in embodiment 2. Taking the SQL1 task as an example, initializing an object by using a high client (new high clients), and encapsulating information of the query SQL 1; then, the query is executed through int return code ═ high client. Similar to the embodiment in fig. 2, when executed, the higher client product () parses the parameters, determines the execution parameters and the user task parameters of the query engine, generates a data file and a header according to the parsed parameters, and further submits the execution SQL1 to obtain the result and writes the result into the data file. FIG. 4 shows partial query results for the SQL1 query subtask. The header includes "date", "partition", and "direct order amount", and the corresponding query result is written in the table in an additional manner.
After the subtasks SQL 1-SQL 3 are correctly executed, the query results are merged through the subtask automatic merging component and packaged to inform a user of downloading.
FIG. 5 is a schematic diagram of the main modules of an apparatus for data querying, according to an embodiment of the present invention;
the data query apparatus 500 includes:
a receiving module 501, configured to receive a query request, where the query request carries a query task;
a historical task set generating module 502, which selects a historical query subtask according to the query task and generates a historical task set;
the task splitting module 503 is configured to determine a query task splitting policy according to the historical task set, split the query task based on the policy, and generate a plurality of query subtasks;
the subtask execution module 504 is configured to execute a plurality of query subtasks and obtain a first query result corresponding to each query subtask;
and a merging module 505 for merging the first query results to generate a second query result corresponding to the query request.
Fig. 6 shows an exemplary system architecture 600 of a data query method or data query device to which embodiments of the present invention may be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and servers 605, 606, 607, 608. The network 604 serves to provide a medium for communication links between the terminal devices 601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few. The server 605 may also be connected to the servers 606, 607, 608 through any of the above-mentioned connection manners, and the servers 606, 607, 608 may be cluster nodes for executing a specific physical execution plan into which the server 605 parses the logical query plan.
A user may use the terminal devices 601, 602, 603 to interact with the server 605 via the network 604 to send a query request. The terminal devices 601, 602, 603 can access the query service provided by the server 605 through RPC and/or HTTP interfaces in a WEB manner.
The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 605 may be a server providing a data query function, on which a Hive environment is configured, and may develop a WEB service based on Jar of Hive-Client to provide the data query function.
It should be noted that the data query method provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the data query apparatus is generally disposed in the server 605.
It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a receive module, and the like. Where the names of the units do not in some cases constitute a limitation on the units themselves, for example, the receiving module may also be described as a "module for receiving query requests".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to:
receiving a query request, wherein the query request carries a query task;
selecting a historical query subtask according to the query task, and generating a historical task set;
determining a query task splitting strategy according to the historical task set, splitting the query task based on the strategy, and generating a plurality of query subtasks;
executing a plurality of query subtasks, and acquiring a first query result corresponding to each query subtask;
and merging the first query results to generate a second query result corresponding to the query request.
According to the technical scheme of the embodiment of the invention, the technical means of intelligently splitting the task based on the historical query task and querying the split subtasks through the packaged client are adopted, so that the technical problems of poor compatibility, low data query speed and unstable query process of the conventional query mode are solved, and the technical effect of improving the data query efficiency is achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (12)
1. A method for querying data, comprising:
receiving a query request, wherein the query request carries a query task;
selecting a historical query subtask according to the query task, and generating a historical task set;
determining a query task splitting strategy according to the historical task set, splitting the query task based on the strategy, and generating a plurality of query subtasks;
executing a plurality of query subtasks, and acquiring a first query result corresponding to each query subtask;
and merging the first query results to generate a second query result corresponding to the query request.
2. The method of claim 1, wherein determining a query task splitting policy from a set of historical tasks comprises:
determining a time query range and query efficiency corresponding to the historical query subtasks in the historical task set;
and determining the splitting period of the query task according to the time query range and the query efficiency.
3. The method of claim 2, wherein determining the split period of the query task according to the time query scope and the query efficiency comprises:
selecting a predetermined number of historical query subtasks from a historical task set;
classifying the historical query subtasks selected from the historical task set according to the time query range, and calculating the first average query efficiency of each type of historical query subtasks;
and selecting a time query range corresponding to the historical query subtask classification with the highest average query efficiency as a splitting period of the query task.
4. The method of claim 2, wherein determining the split period of the query task according to the time query scope and the query efficiency comprises:
selecting a predetermined number of historical query subtasks from a historical task set;
classifying the historical query subtasks selected from the historical task set according to the time query range, and calculating the first average query efficiency of each type of historical query subtasks;
calculating the average value of the first average query efficiency to obtain a second average query efficiency;
and determining the matched historical query subtask classification according to the second average query efficiency, and taking the time query range as the splitting period of the query task.
5. The method of any of claims 1-4, wherein selecting a historical query subtask based on the query task to generate a set of historical tasks comprises:
determining a value range of a splitting period according to a query target object, selecting a historical query subtask based on the value range, and generating a historical task set; and/or
And selecting a historical query subtask according to the query field of the query task to generate a historical task set.
6. The method of claim 1, wherein after receiving the query request, further determining whether a query time range of the query task is smaller than a preset threshold, and if so, directly executing the task query.
7. The method according to any of claims 1-4, wherein in said executing a plurality of query subtasks, for each query subtask, the following query steps are performed:
analyzing the query subtask, and determining task parameters and execution parameters of a query engine;
creating an initial query result file and a corresponding field according to the task parameters;
and executing the subtask query by a query engine according to the execution parameters, and writing a query result into the initial query result file to generate the first query result.
8. The method according to any of claims 1-4, wherein merging the first query results to generate a second query result corresponding to the query request comprises:
setting identifications for a plurality of query subtasks corresponding to the query tasks;
judging whether the plurality of query subtasks are executed correctly or not according to the identification and generating the first query result;
if so, merging the first query result.
9. The method of claim 7, wherein the subtask query step is encapsulated and invoked via an interface.
10. A data query apparatus, comprising:
the receiving module is used for receiving a query request, and the query request carries a query task;
the historical task set generation module is used for selecting a historical query subtask according to the query task and generating a historical task set;
the task splitting module determines a query task splitting strategy according to the historical task set, splits the query task based on the strategy and generates a plurality of query subtasks;
the subtask execution module is used for executing a plurality of inquiry subtasks and acquiring a first inquiry result corresponding to each inquiry subtask;
and the merging module merges the first query results to generate second query results corresponding to the query requests.
11. A data query device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110103292.1A CN113779060A (en) | 2021-01-26 | 2021-01-26 | Data query method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110103292.1A CN113779060A (en) | 2021-01-26 | 2021-01-26 | Data query method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113779060A true CN113779060A (en) | 2021-12-10 |
Family
ID=78835540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110103292.1A Pending CN113779060A (en) | 2021-01-26 | 2021-01-26 | Data query method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113779060A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114817293A (en) * | 2022-03-31 | 2022-07-29 | 华能信息技术有限公司 | Data query method and system based on distributed SQL |
CN115016873A (en) * | 2022-05-05 | 2022-09-06 | 上海乾臻信息科技有限公司 | Front-end data interaction method and system, electronic equipment and readable storage medium |
CN116739319A (en) * | 2023-08-15 | 2023-09-12 | 中国兵器装备集团兵器装备研究所 | Method and system for improving task execution time safety of intelligent terminal |
CN117349323A (en) * | 2023-12-05 | 2024-01-05 | 苏州元脑智能科技有限公司 | Database data processing method and device, storage medium and electronic equipment |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101464884A (en) * | 2008-12-31 | 2009-06-24 | 阿里巴巴集团控股有限公司 | Distributed task system and data processing method using the same |
US20120005021A1 (en) * | 2010-07-02 | 2012-01-05 | Yahoo! Inc. | Selecting advertisements using user search history segmentation |
CN102779183A (en) * | 2012-07-02 | 2012-11-14 | 华为技术有限公司 | Data inquiry method, equipment and system |
US20120310916A1 (en) * | 2010-06-04 | 2012-12-06 | Yale University | Query Execution Systems and Methods |
WO2015076662A1 (en) * | 2013-11-20 | 2015-05-28 | Mimos Berhad | A system and method for predicting query in a search engine |
WO2015074466A1 (en) * | 2013-11-22 | 2015-05-28 | 华为技术有限公司 | Data search method and apparatus |
CN106407190A (en) * | 2015-07-27 | 2017-02-15 | 阿里巴巴集团控股有限公司 | Event record querying method and device |
CN106557578A (en) * | 2016-11-23 | 2017-04-05 | 中国工商银行股份有限公司 | The inquiry of historical data method and system |
CN107798056A (en) * | 2017-09-05 | 2018-03-13 | 海纳信成(北京)信息技术有限公司 | A kind of data query method and device |
US20180322168A1 (en) * | 2017-05-04 | 2018-11-08 | Salesforce.Com, Inc. | Technologies for asynchronous querying |
US20190087457A1 (en) * | 2017-09-21 | 2019-03-21 | Oracle International Corporation | Function semantic based partition-wise sql execution and partition pruning |
US20190236185A1 (en) * | 2018-01-26 | 2019-08-01 | Vmware, Inc. | Splitting a time-range query into multiple sub-queries for serial execution |
CN110096489A (en) * | 2019-04-30 | 2019-08-06 | 阿里巴巴集团控股有限公司 | A kind of data query method, system, device and electronic equipment |
-
2021
- 2021-01-26 CN CN202110103292.1A patent/CN113779060A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101464884A (en) * | 2008-12-31 | 2009-06-24 | 阿里巴巴集团控股有限公司 | Distributed task system and data processing method using the same |
US20120310916A1 (en) * | 2010-06-04 | 2012-12-06 | Yale University | Query Execution Systems and Methods |
US20120005021A1 (en) * | 2010-07-02 | 2012-01-05 | Yahoo! Inc. | Selecting advertisements using user search history segmentation |
CN102779183A (en) * | 2012-07-02 | 2012-11-14 | 华为技术有限公司 | Data inquiry method, equipment and system |
WO2015076662A1 (en) * | 2013-11-20 | 2015-05-28 | Mimos Berhad | A system and method for predicting query in a search engine |
WO2015074466A1 (en) * | 2013-11-22 | 2015-05-28 | 华为技术有限公司 | Data search method and apparatus |
CN106407190A (en) * | 2015-07-27 | 2017-02-15 | 阿里巴巴集团控股有限公司 | Event record querying method and device |
CN106557578A (en) * | 2016-11-23 | 2017-04-05 | 中国工商银行股份有限公司 | The inquiry of historical data method and system |
US20180322168A1 (en) * | 2017-05-04 | 2018-11-08 | Salesforce.Com, Inc. | Technologies for asynchronous querying |
CN107798056A (en) * | 2017-09-05 | 2018-03-13 | 海纳信成(北京)信息技术有限公司 | A kind of data query method and device |
US20190087457A1 (en) * | 2017-09-21 | 2019-03-21 | Oracle International Corporation | Function semantic based partition-wise sql execution and partition pruning |
US20190236185A1 (en) * | 2018-01-26 | 2019-08-01 | Vmware, Inc. | Splitting a time-range query into multiple sub-queries for serial execution |
CN110096489A (en) * | 2019-04-30 | 2019-08-06 | 阿里巴巴集团控股有限公司 | A kind of data query method, system, device and electronic equipment |
Non-Patent Citations (1)
Title |
---|
谷震离;: "SQL Server数据库应用程序性能优化方法", 计算机工程与设计, no. 15, 16 August 2006 (2006-08-16) * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114817293A (en) * | 2022-03-31 | 2022-07-29 | 华能信息技术有限公司 | Data query method and system based on distributed SQL |
CN114817293B (en) * | 2022-03-31 | 2022-11-08 | 华能信息技术有限公司 | Data query method and system based on distributed SQL |
CN115016873A (en) * | 2022-05-05 | 2022-09-06 | 上海乾臻信息科技有限公司 | Front-end data interaction method and system, electronic equipment and readable storage medium |
CN116739319A (en) * | 2023-08-15 | 2023-09-12 | 中国兵器装备集团兵器装备研究所 | Method and system for improving task execution time safety of intelligent terminal |
CN116739319B (en) * | 2023-08-15 | 2023-10-13 | 中国兵器装备集团兵器装备研究所 | Method and system for improving task execution time safety of intelligent terminal |
CN117349323A (en) * | 2023-12-05 | 2024-01-05 | 苏州元脑智能科技有限公司 | Database data processing method and device, storage medium and electronic equipment |
CN117349323B (en) * | 2023-12-05 | 2024-02-27 | 苏州元脑智能科技有限公司 | Database data processing method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7275171B2 (en) | Operating System Customization in On-Demand Network Code Execution Systems | |
CN113779060A (en) | Data query method and device | |
US10114682B2 (en) | Method and system for operating a data center by reducing an amount of data to be processed | |
US10965733B2 (en) | Efficient, automated distributed-search methods and systems | |
US9363195B2 (en) | Configuring cloud resources | |
US10735299B2 (en) | Management of connections of a client application including server selection | |
WO2020123439A1 (en) | Performance-based hardware emulation in an on-demand network code execution system | |
CN109614227B (en) | Task resource allocation method and device, electronic equipment and computer readable medium | |
CN111897633A (en) | Task processing method and device | |
US10102098B2 (en) | Method and system for recommending application parameter setting and system specification setting in distributed computation | |
CN110781180B (en) | Data screening method and data screening device | |
CN110781159B (en) | Ceph directory file information reading method and device, server and storage medium | |
CN110764769A (en) | Method and device for processing user request | |
CN113326305A (en) | Method and device for processing data | |
CN111767126A (en) | System and method for distributed batch processing | |
CN116932147A (en) | Streaming job processing method and device, electronic equipment and medium | |
CN107493205B (en) | Method and device for predicting capacity expansion performance of equipment cluster | |
US20210311942A1 (en) | Dynamically altering a query access plan | |
CN114296965A (en) | Feature retrieval method, feature retrieval device, electronic equipment and computer storage medium | |
CN112988806A (en) | Data processing method and device | |
CN110222018A (en) | Data summarization executes method and device | |
CN118132010B (en) | Data storage method and device | |
CN116991562B (en) | Data processing method and device, electronic equipment and storage medium | |
US11573960B2 (en) | Application-based query transformations | |
WO2022057698A1 (en) | Efficient bulk loading multiple rows or partitions for single target table |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |