CN107480220B

CN107480220B - Rapid text query method based on online aggregation

Info

Publication number: CN107480220B
Application number: CN201710647283.2A
Authority: CN
Inventors: 伍赛; 庞志飞; 陈珂; 陈刚; 寿黎但; 胡天磊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-08-01
Filing date: 2017-08-01
Publication date: 2021-01-12
Anticipated expiration: 2037-08-01
Also published as: CN107480220A

Abstract

The invention discloses a quick text query method based on online aggregation. Firstly, reading a text file in a specified data mode; secondly, input query is carried out in an online aggregation mode, and all queries are organized in a tree mode; and finally, calculating corresponding query confidence degrees or confidence intervals according to different aggregation functions, and stopping the query process when the confidence degrees or the confidence intervals meet preset critical values. The method can directly inquire on the oversized text data, quickly obtain an approximate result and realize quick inquiry response on the oversized text data.

Description

Rapid text query method based on online aggregation

Technical Field

The invention relates to an online aggregation query technology, an index technology and a query visualization method in the field of human-computer interaction in the field of databases, in particular to a rapid text query method based on online aggregation.

Background

Due to the explosive growth of information in recent years, data analysts must capture value in order to analyze these data in depth using the latest technologies in the industry today. Before starting the analysis work, the data analyst needs to perform some pre-processing operations on the data, such as performing some specific queries to check whether the data is necessary for further analysis. However, such operations can be very time consuming to accomplish with conventional database systems because the task of importing and indexing very large data sets into a database is complex and often takes a long time. In practical applications, users usually do not need to obtain accurate results, but only want to obtain a rough "profile" of the data, so the online aggregation technology is widely used.

Database crawling is a technique for reorganizing data in a database system. The database engine reorganizes the current data for predicate logic in the query. When similar query requests again, the database engine can be quickly positioned to a data interval meeting the requirement, and then the query speed is accelerated. For example, a current query is grouped by a certain attribute, and the system reorganizes the data by grouping the attribute. When a group is inquired by a careful request, the system can be quickly positioned to the corresponding group to process the data in sequence, thereby quickly returning the result. The database engine is cache-friendly to the data organization algorithm, i.e. during the rearrangement process, the condition of cache miss does not occur, thereby greatly accelerating the data rearrangement process.

The text-based SQL query technology refers to a technology that a user directly performs SQL query on text data without importing the data into a traditional database system. Generally speaking, SQL queries on text involve two phases, schema definition and format conversion. Schema definition refers to a format in which a user or a system automatically derives text data, such as column names and data types. The format conversion refers to that the system dynamically converts text data into a corresponding data structure in a memory, and then queries and calculates. For a database system with a small scale, the technology can enable a user to quickly develop a query without complicated database import operation. However, for a large-scale database system, the problems of too long import time and too slow response can occur.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a quick text query method based on online aggregation.

As shown in fig. 1, the technical solution adopted by the present invention to solve the technical problem is as follows:

the invention randomly samples the text data, queries through the online aggregation technology and returns the approximate query result acceptable by the user.

The query system described in the text refers to a system formed by the method of the invention.

The method comprises the following steps:

1) firstly, arranging a text file in a database into a specified data mode;

2) inputting queries in an online aggregation manner, wherein all queries are organized in a tree manner;

3) and calculating the query confidence or confidence interval of the query by using different aggregation functions, and stopping the query process when the confidence or confidence interval meets a preset critical value.

The step 3) is specifically as follows: the query system calculates the confidence degree or confidence interval of the query according to the aggregation function in the user query, such as avg, count, var and other functions, and according to the central limit theorem or the Hough inequality.

The data mode of the step 1) is formed by each column of each record in the text file, different columns represent the attributes of different records respectively, and each column of each record mainly comprises two parts of a column name and a data type.

The recorded attributes indicate the specific meaning of each column, e.g. for a CSV (comma separated text file) about the student's achievements, the first column might indicate the student's name, the second column might indicate the achievements, etc.

The data types are classified into a text type, a number type, and a time type. For example, for a student achievement file, the name is a text type and the achievement is a number type.

For a text file, the data schema is a unique and deterministic data structure.

The step 2) is specifically as follows:

2.1) compiling the query requirement into a standard SQL statement by a user, inputting the SQL statement for query to obtain a query result, wherein the query is a random sampling query which is carried out in an online aggregation mode aiming at a text file;

2.2) organizing all queries currently performed on the text file in a tree structure mode, wherein root nodes in the tree structure are virtual query nodes (empty nodes) established in advance and do not perform specific query tasks; after a query requirement is input, a node is established and distributed in the tree structure, and a single query result and a final query result which are obtained in real time are stored in a buffer area corresponding to the node on the memory; distributing the current query to a node under the root node or the root node according to the comparison condition of the current query result and the previous query result;

the method comprises the following steps: and if the current query is similar to the previous query and has overlapping query requirements, distributing the current query to the node corresponding to the previous similar query to form the relationship between the parent node and the child node. And if the current query result is not similar to all the previous query results and the query requirements are not overlapped, distributing the current query to the root node.

2.3) each node establishes a buffer area on the memory, and each node stores the query data corresponding to the query to the buffer area; and when the child node is generated, if the query data of the child node and the query data of the parent node are close to each other and have the query requirements overlapping, the query data of the child node is acquired from the parent node and is directly used as the query data of the child node, so that the query process is accelerated.

The query in the step 2.1) is a query which continuously performs random sampling for a plurality of times from the text file until the query result enables the user to be satisfied, and each query is performed on the text file once to obtain query data.

And storing the query data of the single query into the memory buffer area of the corresponding node after each query.

Inputting a query requirement to query, obtaining each single query result, storing the single query result into a buffer area, taking a union set of the single query results of multiple queries, storing the union set into a coverage buffer area in real time, taking the satisfied query result of a user as a final query result, and obtaining different final query results corresponding to different query requirements.

The invention enables the sampled data to have the property of one-time reading and multi-time multiplexing through the arrangement and the application of the buffer.

Furthermore, the invention feeds back the current query state including the current confidence degree, the confidence interval and other information to the user in real time through the asynchronous call query state interface.

When the user queries by adopting the method, a plurality of queries can be simultaneously carried out. When the query progress is slow and the memory is tense, the user stops any query process. When a user thinks that a certain query result has no value, the query is completely deleted from the tree structure to save the memory.

The invention saves memory resources by stopping or deleting a certain query process. The method comprises the following steps: when a query process is stopped, its sub-query processes are not affected. When a node of a query is deleted from the tree structure, the node corresponding to its sub-query is stopped and deleted.

When the query is finished, a result is returned, and the result is displayed on the console by default. The user can download and save the files in the JSON format according to the requirements.

According to the method, the SQL query is directly carried out on the text data without importing the oversized text data into a traditional database system, and the approximate result is quickly returned within the acceptable confidence degree and confidence interval defined by the user by online gathering of the randomly sampled text data sets.

The query system designed by the invention supports standard SQL query and realizes quick query response on the oversized text data. The super large text data referred to in the present invention refers to a CSV file having a single file size exceeding 1 GB.

The invention carries out random dynamic sampling on data in the process of query execution by expanding a conventional query system, calculates the current approximate result and confidence interval, and returns the approximate query result under the condition of meeting a certain confidence coefficient. The error of the query result gradually decreases as the number of samples increases.

The invention has the beneficial effects that:

the invention carries out SQL query on the text data by an online aggregation method and simultaneously provides an output form derived from the result text. The method achieves excellent performance in SQL query facing ultra-large text data sets. The real-time query response requirements of the user can be met, and the results can be presented in a text or visual mode which is easy to accept by the user.

Drawings

FIG. 1 is a flow chart of the steps performed by the present invention.

FIG. 2 is a tree query organization diagram.

Detailed Description

The technical solution of the present invention will now be further explained with reference to specific embodiments and examples.

Referring to fig. 1, the embodiment of the present invention and the implementation process thereof are as follows:

step 1: first, a text file in a database is arranged into a prescribed data pattern.

The data mode is composed of records, each record is composed of columns, different columns represent the attributes of different records respectively, and each column of each record is mainly composed of two parts of a column name and a data type.

For a text file, the data schema is a unique and deterministic data structure.

Step 2: the user writes standard SQL statements according to the self-defined column names and submits the SQL statements to the query system, and an SQL executor in the system is responsible for compiling, explaining and executing the SQL statements. In the SQL writing process, a user needs to use a correct column name, and if the SQL of the user has a syntax error, the system immediately returns an error prompt.

The query is to continuously perform a plurality of different randomly sampled queries on the text file and supplement the randomly sampled queries to historical query results until the query results satisfy users, and each query is to perform a randomly sampled query on the text file once to obtain query data. And storing the single query result into the memory buffer area of the corresponding node after each query.

While the query is in progress, the user can view the current query results at any time through the graphical interface.

And step 3: organizing real-time query results in a query process in a tree structure mode, wherein root nodes in the tree structure are virtual query nodes (empty nodes) established in advance and do not perform specific query tasks; after a query requirement is input, a node is established and distributed in the tree structure, and a single query result and a final query result which are obtained in real time are stored in a buffer area corresponding to the node on the memory; distributing the current query result to a node under the root node or the root node according to the comparison condition of the current query result and the previous query result;

when assigning a parent, the system will first find the query that is closest to the WHERE clause of the query and act as the parent for the query. Wherein the WHERE clause represents a portion of the SQL statement that follows the WHERE keyword and precedes the FROM keyword. In the WHERE clause, there are one or more attributes, and the similarity of two WHERE clauses is represented by the degree of attribute overlap. For example, the degree of coincidence of WHERE age >20 AND wheereage <40AND discrict 1 is 1, AND the degree of coincidence of WHERE age >20 AND wheere discrict 1 is 0. If the query does not have a WHERE clause, the system takes the root node as the parent node of the query. The data stream can be better organized by organizing the query through the tree structure, and the maximization of data multiplexing is realized.

And 4, step 4: as shown in fig. 2, in the query of the same sub-tree, data has a property of "read once and multiplex many times", a parent node stores the adopted data in its own buffer, and when a child query is generated, a large number of samples are quickly obtained from the buffer of the parent query, thereby accelerating the query process.

For example, as shown in FIG. 2, assume that there are queries 1-4 now in the system. Query 1 refers to the query requirement for average revenue, and query 1 is the most advanced query as the root node. Queries 2-4, which are children of query 1, are queries issued while continuing to explore data based on query 1. The symbol:: represents the concatenation of the operator. In the process of executing query 1, query 1 reads data from the data buffer pool to its own data buffer, and performs a query operation with the data. At this point, assume that the user was motivated by query 1, wanting to see income for gender grouped and age < 25. When query 2 enters the system, the system will designate query 1 as the parent node for query 2, since the query 2 and query 1 part operations (SELECT AVG (salt)) are the same.

When query 2 starts executing, query 2 is the average income case for queries grouped under genre and age <25, the system will load the data of query 1 buffer directly into the buffer of query 2 without reading from the file. Meanwhile, the WHERE clause is added to the query 2 compared with the query 1, so that the data in the buffer area of the query 1 can be reorganized through the WHERE clause.

When query 3 begins execution, query 3 is the average income of the query age >30 people. Likewise, the system designates query 1 as its parent. Since the buffer of query 1 has been rearranged for this attribute of age and the WHERE clause of query 3 is age >30 while query 2 is in progress, query 3 directly reads the data of age >25 of query 1, while the buffer of query 1 continues to be subdivided into 3 segments according to age.

When query 4 begins execution, query 4 is the average revenue case for queries grouped by district. Since the WHERE clause of query 4 does not contain age, data cannot be quickly obtained from the buffer of age in query 1. At this point, query 1 creates a buffer for the district. When a new query about distribute enters the system, the buffer area is directly used, so that the query speed is accelerated.

A higher confidence is obtained because no more samples will be read per query. The data flow is from the father node to each child node, and each child node obtains more samples needed by itself. It should be noted that. When the query tree is higher, the samples obtained by the query nodes at the bottom layer become less, and in this case, the query nodes need to obtain more samples from the files additionally.

And 5: the query system calculates the confidence or confidence interval of the query according to the aggregation function in the user query, such as AVG, COUNT, VAR and other functions, and according to the central limit theorem or Hough inequality, and when the two indexes meet the defined critical value, the query is stopped.

In order to be able to return approximate results quickly in a small amount of sample space, the present embodiment adopts the hough inequality as a statistical model of the query system.

The variable T representing a record containing m recordsText file, note t₁,t₂,...,t_m∈T，t₁Representing the first record and T representing the set of query records. Assume that the existing query is as follows:

SELECT AVG(expression(t_i))FROM T

wherein AVG represents the aggregation function of averaging, SELECT and FROM are SQL statement keywords, expression (t)_i) Indicating that a record t is being recorded from_iAnd (4) a mathematical expression of any column value in (1).

S represents a set of query data randomly picked from T. Let | S | ═ n and | T | ═ m, we can get P (si ∈ S) ═ 1/m,1 ≦ i ≦ n,

that is, the probability of each sample in the sample set being selected is equal, where m represents the total number of query records in the text file, n represents the total number of randomly selected query data, s_iRepresenting each query data randomly picked. Therefore, an estimated aggregate result may be obtained for a given particular query data.

Wherein the content of the first and second substances,

representing the estimated aggregate value and n representing the number of randomly picked samples.

For the true aggregate results, denoted as a, as follows.

Let a and b be t_iUpper and lower bounds. The confidence interval may be estimated by the hough inequality according to the current query data. Wherein ε represents the upper error limit.

Wherein P represents the probability that the difference between the true and estimated values is less than epsilon.

The confidence is expressed by the symbol p and solved by equaling the above equation to p, i.e.

And calculating an upper error limit epsilon by adopting the following formula:

where ε represents the upper error limit and p represents the confidence.

Therefore, the query system can ensure that the real aggregated result falls in the interval in a sampling mode under the condition that the probability p is not less than

In (1).

And the query system alternately optimizes epsilon and p, and after the user or system self-defined critical values are met, the query is finished and the result is returned.

When the query system finds that the current confidence is higher than a preset threshold value or the confidence interval is smaller than a preset interval in the calculation process, the query is stopped, and a final result is returned.

Step 6: the query system feeds back the current query state including the current confidence, confidence interval and other indexes to the user in real time through an asynchronous call query state interface.

Query time is long relative to normal database query time, since queries based on online aggregation typically require iteration multiple times to reach a final result.

When a user wants to view the current results immediately, the results of the current query may be asynchronously requested from the query system. During the iteration process, each query will keep the last query snapshot in the buffer, including the last calculation result. When the query is accidentally exited or the user stops, the query can be quickly recovered from the current time point by using the snapshot information and continuously executed.

And 7: the user can stop or delete a query by specifying that its sub-queries are unaffected after the query is stopped.

After the father node is stopped, the father node does not perform any filtering operation on the data flow, meanwhile, any data cannot be cached in the buffer area of the father node, and the father node only serves as a forwarding node and forwards the data flow to the child node of the father node.

And when the parent node is restarted, the original operation on the data stream is continued to be recovered. When a user specifies to delete a query, its sub-queries are stopped and deleted, and the buffers reserved by each node are also released.

And 8: when the query is finished, a result is returned, and the result is displayed on the console by default. The user can download and save the files in the JSON format according to the requirements. Wherein the result set is represented as a JSON array. Each record in the result set is a JSON object. The key of each JSON object is a column name, and the value is the value corresponding to the column name.

Claims

1. A quick text query method based on online aggregation is characterized in that: the method comprises the following steps:

1) firstly, arranging a text file in a database into a specified data mode;

the step 2) is specifically as follows:

the query in the step 2.1) is to continuously perform random sampling query for multiple times from the text file until the query result is satisfied by a user, and each query is to perform random sampling query for the text file once to obtain query data;

2.2) organizing all queries currently performed on the text file in a tree structure mode, wherein root nodes in the tree structure are virtual query nodes established in advance and do not perform specific query tasks; establishing and distributing a node in a tree structure after inputting a query requirement; distributing the current query to a node under the root node or the root node according to the comparison condition of the current query result and the previous query result;

2.3) each node establishes a buffer area on the memory, and each node stores the query data corresponding to the query to the buffer area; when the child node is generated, if the query data of the child node and the query data of the father node are close to each other and have query requirements overlapping, acquiring the query data from the father node and directly using the query data as the query data of the child node;

2. The fast text query method based on online aggregation according to claim 1, wherein: the data mode of the step 1) is formed by each column of each record in the text file, different columns represent the attributes of different records respectively, and each column of each record mainly comprises two parts of a column name and a data type.

3. The method of claim 2, wherein the method comprises: inputting a query requirement to query, obtaining each single query result, storing the single query result into a buffer area, taking a union set of the single query results of multiple queries, storing the union set into a coverage buffer area in real time, taking the satisfied query result of a user as a final query result, and obtaining different final query results corresponding to different query requirements.