CN109710642B

CN109710642B - Index aggregation parallel processing system based on big data architecture

Info

Publication number: CN109710642B
Application number: CN201811549059.0A
Authority: CN
Inventors: 李秋实; 谢莹莹; 郭庆; 宋怀明; 蒋丹东
Original assignee: Zhongke Shuguang International Information Industry Co ltd
Current assignee: Zhongke Shuguang International Information Industry Co ltd
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2021-07-27
Anticipated expiration: 2038-12-18
Also published as: CN109710642A

Abstract

The invention provides a parallel processing system for index aggregation based on a big data architecture. The system comprises: the service interface is used for submitting the parallel SQL to the parallel SQL service device and further verifying the parallel SQL through the metadata service device; the parallel SQL service device is used for providing overall query service, index query service and aggregation query service; the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction, and is responsible for determining to jump to the database service device or the index service device and parallel SQL repackaging. The invention can reduce the extra cost of the traditional big data SQL technical query and can expand the problem that the traditional SQL grammar can not support multiple types of aggregation requirements.

Description

Index aggregation parallel processing system based on big data architecture

Technical Field

The invention relates to the technical field of big data, in particular to a parallel processing system for index aggregation based on a big data architecture.

Background

With the rapid growth of network data in recent years and the arrival of big data era, the traditional database technology cannot meet more and more big data business systems, while the big data database technology of the new generation often cannot be tightly combined with an index system, and cannot be effectively compatible with business requirements such as large-scale report calculation, aggregation statistical analysis and the like, and the main reasons include the following two points:

1. the existing common Structured Query Language (SQL) syntax (such as HiveSQL and Spark-SQL) often cannot effectively utilize indexes under large data volume;

2. the standard database SQL syntax is limited in its support for the large aggregate requirements.

For example, the following steps are carried out: when SQL is required to handle the following requirements simultaneously:

SELECT...FROM A WHERE${filter1}GROUP BY${grouping1}

SELECT...FROM A WHERE${filter2}GROUP BY${grouping2}

...

SELECT...FROM A WHERE${filterN}GROUP BY${groupingN}

the traditional big data technology, the index technology and the related optimization method thereof can not inquire A data for only 1 time to count all results, so the performance overhead can reach about several times or tens of times; the existing optimization method solves the problem of how to more effectively optimize the query performance of single SQL by using the index.

Disclosure of Invention

The index aggregation parallel processing system based on the big data architecture can reduce the extra cost of the traditional big data SQL technical query and can expand the problem that the traditional SQL syntax can not support multiple types of aggregation requirements.

In a first aspect, the present invention provides a parallel processing system for index aggregation based on big data architecture, including:

the service interface is used for submitting the parallel SQL to the parallel SQL service device and further verifying the parallel SQL through the metadata service device;

the parallel SQL service device is used for providing overall query service, index query service and aggregation query service;

the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction, and is responsible for determining to jump to the database service device or the index service device and parallel SQL repackaging.

Optionally, the metadata service device performs unified integration on the query through a commonly used Spark-SQL component, and determines how to modify the parallel SQL description into a direct-processing SQL statement or a non-SQL parallel query plan.

Optionally, the parallel SQL service device is compatible with other optimization strategies of traditional SQL and big data related SQL.

Optionally, the parallel SQL syntax adopts a pipeline manner, the system defaults to execute in sequence according to a pipeline order, and records a result returned by each parallel SQL statement.

Optionally, the parallel index aggregation syntax provided by the parallel SQL service device is: each aggregated parallel SQL statement contains one or more aggregation elements and links the intermediate result set of < search >, which is specified by the USING key.

Optionally, the index aggregation syntax provided by the parallel SQL service device is: and adding an aggregation filter condition and a grouping filter condition through a WHERE key, designating an aggregation GROUP through a GROUP BY key, and mapping an aggregation result set through a MAP key.

The index aggregation parallel processing system based on the big data architecture provided by the embodiment of the invention not only supports the query SQL grammar of the traditional big data database, but also expands the problem that the traditional SQL grammar can not support various aggregation requirements by combining the index technology, thereby solving the problem of large extra cost in the query of the traditional big data SQL technology; in addition, the method is compatible with other big data optimization technologies, related business system optimization and other related optimization means, and is compatible with various general databases, big data warehouses and SQL grammars thereof.

Drawings

Fig. 1 is a block diagram illustrating an optimization method for a database according to an embodiment of the present invention;

FIG. 2 is a block diagram illustrating an optimization method for an index system according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a parallel processing system based on index aggregation of a big data architecture according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a general mode based on the combination of a big data technology and a database index technology, which not only supports the query SQL grammar of the traditional big data database, but also combines the index technology to solve the problem that the traditional SQL grammar can not support various aggregation requirements for expansion.

First, a syntax form is introduced.

The parallel SQL syntax uses a pipeline approach, for example:

< parallel SQL statement > |.

The system defaults to execute in sequence according to the pipeline sequence, and records the result returned by each parallel SQL statement. The pipe in parallel SQL is similar to the pipe in Shell, which is based on parameter delivery of output stream, and the pipe in parallel SQL is used for parameter delivery based on result set.

And then introduces syntax.

1. Index query syntax (search):

2. parallel index aggregation syntax (aggregations): the traditional SQL standard cannot describe the syntax

3. Index aggregation syntax (aggregation):

by adopting the parallel SQL syntax design, parallel aggregation can be made possible from time to time, and by combining with common big data indexing technologies, such as Solr, ElastiSearch and other indexing systems supporting docValue, the aggregation is carried out in an indexing mode, so that the extra overhead of the query is greatly reduced, namely:

SELECT*FROM twitter|count(1)GROUP BY user,avg(age)WHERE ct>10GROUP BY company

the query results are the two result sets, and the optimization method for the database is shown in fig. 1.

See figure 2 for an optimization of the indexing system.

The grammar retains the SQL characteristics, and not only supports the parallel SQL of the pipeline type, but also supports the SQL of various data source standards. The following table is a compatible SQL Syntax (SQL):

as shown in fig. 3, the parallel processing system for index aggregation based on big data architecture according to an embodiment of the present invention includes:

the service interface is used for providing a WEB service interface, submitting parallel SQL to the parallel SQL service device and further verifying through the metadata service device;

the parallel SQL service device is used for providing overall query service, index query service and aggregation query service, and is compatible with other traditional SQL and the optimization strategy of big data related SQL;

the metadata service device comprises a metadata base, a database system and a permission system, is mainly responsible for SQL verification, table and data source extraction and for determining to jump to the metadata service device or a parallel SQL service device and repackaging of the parallel SQL, wherein the repackaging refers to adding SQL filtering and comprises control of data table, row and column permissions.

The metadata service device performs unified integration on the query through a commonly used Spark-SQL component, and determines how to modify the parallel SQL description into a SQL statement which can be directly processed or a non-SQL parallel query plan.

It will be understood by those skilled in the art that all or part of the processes of the embodiments of the methods described above may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A big data architecture based index aggregation parallel processing system, comprising:

the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction and is responsible for deciding to jump to the database service device or the index service device and parallel SQL repackaging;

wherein, the repackaging refers to adding SQL filtering and comprises the control of data table, row and column authorities;

the index aggregation syntax provided by the parallel SQL service device is as follows: and adding an aggregation filter condition and a grouping filter condition through a WHERE key, designating an aggregation GROUP through a GROUP BY key, and mapping an aggregation result set through a MAP key.

2. The system according to claim 1, wherein the metadata service device performs unified integration of the query through commonly used Spark-SQL components to determine how to modify the parallel SQL description into a directly processable SQL statement or non-SQL parallel query plan.

3. The system according to claim 1, wherein the parallel SQL service device is compatible with other optimization strategies of legacy SQL and big data related SQL.

4. The system of claim 1, wherein the parallel SQL syntax is pipelined, the system defaults to executing in order according to the pipeline order, and records the result returned by each parallel SQL statement.

5. The system according to claim 1, wherein the parallel SQL service device provides the parallel index aggregation syntax as: each aggregated parallel SQL statement contains one or more aggregation elements and links the intermediate result set of < search >, which is specified by the USING key.