CN109710642B - Index aggregation parallel processing system based on big data architecture - Google Patents
Index aggregation parallel processing system based on big data architecture Download PDFInfo
- Publication number
- CN109710642B CN109710642B CN201811549059.0A CN201811549059A CN109710642B CN 109710642 B CN109710642 B CN 109710642B CN 201811549059 A CN201811549059 A CN 201811549059A CN 109710642 B CN109710642 B CN 109710642B
- Authority
- CN
- China
- Prior art keywords
- sql
- parallel
- service device
- aggregation
- index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention provides a parallel processing system for index aggregation based on a big data architecture. The system comprises: the service interface is used for submitting the parallel SQL to the parallel SQL service device and further verifying the parallel SQL through the metadata service device; the parallel SQL service device is used for providing overall query service, index query service and aggregation query service; the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction, and is responsible for determining to jump to the database service device or the index service device and parallel SQL repackaging. The invention can reduce the extra cost of the traditional big data SQL technical query and can expand the problem that the traditional SQL grammar can not support multiple types of aggregation requirements.
Description
Technical Field
The invention relates to the technical field of big data, in particular to a parallel processing system for index aggregation based on a big data architecture.
Background
With the rapid growth of network data in recent years and the arrival of big data era, the traditional database technology cannot meet more and more big data business systems, while the big data database technology of the new generation often cannot be tightly combined with an index system, and cannot be effectively compatible with business requirements such as large-scale report calculation, aggregation statistical analysis and the like, and the main reasons include the following two points:
1. the existing common Structured Query Language (SQL) syntax (such as HiveSQL and Spark-SQL) often cannot effectively utilize indexes under large data volume;
2. the standard database SQL syntax is limited in its support for the large aggregate requirements.
For example, the following steps are carried out: when SQL is required to handle the following requirements simultaneously:
SELECT...FROM A WHERE${filter1}GROUP BY${grouping1}
SELECT...FROM A WHERE${filter2}GROUP BY${grouping2}
...
SELECT...FROM A WHERE${filterN}GROUP BY${groupingN}
the traditional big data technology, the index technology and the related optimization method thereof can not inquire A data for only 1 time to count all results, so the performance overhead can reach about several times or tens of times; the existing optimization method solves the problem of how to more effectively optimize the query performance of single SQL by using the index.
Disclosure of Invention
The index aggregation parallel processing system based on the big data architecture can reduce the extra cost of the traditional big data SQL technical query and can expand the problem that the traditional SQL syntax can not support multiple types of aggregation requirements.
In a first aspect, the present invention provides a parallel processing system for index aggregation based on big data architecture, including:
the service interface is used for submitting the parallel SQL to the parallel SQL service device and further verifying the parallel SQL through the metadata service device;
the parallel SQL service device is used for providing overall query service, index query service and aggregation query service;
the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction, and is responsible for determining to jump to the database service device or the index service device and parallel SQL repackaging.
Optionally, the metadata service device performs unified integration on the query through a commonly used Spark-SQL component, and determines how to modify the parallel SQL description into a direct-processing SQL statement or a non-SQL parallel query plan.
Optionally, the parallel SQL service device is compatible with other optimization strategies of traditional SQL and big data related SQL.
Optionally, the parallel SQL syntax adopts a pipeline manner, the system defaults to execute in sequence according to a pipeline order, and records a result returned by each parallel SQL statement.
Optionally, the parallel index aggregation syntax provided by the parallel SQL service device is: each aggregated parallel SQL statement contains one or more aggregation elements and links the intermediate result set of < search >, which is specified by the USING key.
Optionally, the index aggregation syntax provided by the parallel SQL service device is: and adding an aggregation filter condition and a grouping filter condition through a WHERE key, designating an aggregation GROUP through a GROUP BY key, and mapping an aggregation result set through a MAP key.
The index aggregation parallel processing system based on the big data architecture provided by the embodiment of the invention not only supports the query SQL grammar of the traditional big data database, but also expands the problem that the traditional SQL grammar can not support various aggregation requirements by combining the index technology, thereby solving the problem of large extra cost in the query of the traditional big data SQL technology; in addition, the method is compatible with other big data optimization technologies, related business system optimization and other related optimization means, and is compatible with various general databases, big data warehouses and SQL grammars thereof.
Drawings
Fig. 1 is a block diagram illustrating an optimization method for a database according to an embodiment of the present invention;
FIG. 2 is a block diagram illustrating an optimization method for an index system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a parallel processing system based on index aggregation of a big data architecture according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a general mode based on the combination of a big data technology and a database index technology, which not only supports the query SQL grammar of the traditional big data database, but also combines the index technology to solve the problem that the traditional SQL grammar can not support various aggregation requirements for expansion.
First, a syntax form is introduced.
The parallel SQL syntax uses a pipeline approach, for example:
< parallel SQL statement > |.
The system defaults to execute in sequence according to the pipeline sequence, and records the result returned by each parallel SQL statement. The pipe in parallel SQL is similar to the pipe in Shell, which is based on parameter delivery of output stream, and the pipe in parallel SQL is used for parameter delivery based on result set.
And then introduces syntax.
1. Index query syntax (search):
2. parallel index aggregation syntax (aggregations): the traditional SQL standard cannot describe the syntax
3. Index aggregation syntax (aggregation):
by adopting the parallel SQL syntax design, parallel aggregation can be made possible from time to time, and by combining with common big data indexing technologies, such as Solr, ElastiSearch and other indexing systems supporting docValue, the aggregation is carried out in an indexing mode, so that the extra overhead of the query is greatly reduced, namely:
SELECT*FROM twitter|count(1)GROUP BY user,avg(age)WHERE ct>10GROUP BY company
the query results are the two result sets, and the optimization method for the database is shown in fig. 1.
See figure 2 for an optimization of the indexing system.
The grammar retains the SQL characteristics, and not only supports the parallel SQL of the pipeline type, but also supports the SQL of various data source standards. The following table is a compatible SQL Syntax (SQL):
as shown in fig. 3, the parallel processing system for index aggregation based on big data architecture according to an embodiment of the present invention includes:
the service interface is used for providing a WEB service interface, submitting parallel SQL to the parallel SQL service device and further verifying through the metadata service device;
the parallel SQL service device is used for providing overall query service, index query service and aggregation query service, and is compatible with other traditional SQL and the optimization strategy of big data related SQL;
the metadata service device comprises a metadata base, a database system and a permission system, is mainly responsible for SQL verification, table and data source extraction and for determining to jump to the metadata service device or a parallel SQL service device and repackaging of the parallel SQL, wherein the repackaging refers to adding SQL filtering and comprises control of data table, row and column permissions.
The metadata service device performs unified integration on the query through a commonly used Spark-SQL component, and determines how to modify the parallel SQL description into a SQL statement which can be directly processed or a non-SQL parallel query plan.
The index aggregation parallel processing system based on the big data architecture provided by the embodiment of the invention not only supports the query SQL grammar of the traditional big data database, but also expands the problem that the traditional SQL grammar can not support various aggregation requirements by combining the index technology, thereby solving the problem of large extra cost in the query of the traditional big data SQL technology; in addition, the method is compatible with other big data optimization technologies, related business system optimization and other related optimization means, and is compatible with various general databases, big data warehouses and SQL grammars thereof.
It will be understood by those skilled in the art that all or part of the processes of the embodiments of the methods described above may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (5)
1. A big data architecture based index aggregation parallel processing system, comprising:
the service interface is used for submitting the parallel SQL to the parallel SQL service device and further verifying the parallel SQL through the metadata service device;
the parallel SQL service device is used for providing overall query service, index query service and aggregation query service;
the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction and is responsible for deciding to jump to the database service device or the index service device and parallel SQL repackaging;
wherein, the repackaging refers to adding SQL filtering and comprises the control of data table, row and column authorities;
the index aggregation syntax provided by the parallel SQL service device is as follows: and adding an aggregation filter condition and a grouping filter condition through a WHERE key, designating an aggregation GROUP through a GROUP BY key, and mapping an aggregation result set through a MAP key.
2. The system according to claim 1, wherein the metadata service device performs unified integration of the query through commonly used Spark-SQL components to determine how to modify the parallel SQL description into a directly processable SQL statement or non-SQL parallel query plan.
3. The system according to claim 1, wherein the parallel SQL service device is compatible with other optimization strategies of legacy SQL and big data related SQL.
4. The system of claim 1, wherein the parallel SQL syntax is pipelined, the system defaults to executing in order according to the pipeline order, and records the result returned by each parallel SQL statement.
5. The system according to claim 1, wherein the parallel SQL service device provides the parallel index aggregation syntax as: each aggregated parallel SQL statement contains one or more aggregation elements and links the intermediate result set of < search >, which is specified by the USING key.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811549059.0A CN109710642B (en) | 2018-12-18 | 2018-12-18 | Index aggregation parallel processing system based on big data architecture |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811549059.0A CN109710642B (en) | 2018-12-18 | 2018-12-18 | Index aggregation parallel processing system based on big data architecture |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109710642A CN109710642A (en) | 2019-05-03 |
CN109710642B true CN109710642B (en) | 2021-07-27 |
Family
ID=66256833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811549059.0A Active CN109710642B (en) | 2018-12-18 | 2018-12-18 | Index aggregation parallel processing system based on big data architecture |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109710642B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101196890A (en) * | 2006-12-08 | 2008-06-11 | 国际商业机器公司 | Method and device for analyzing information and application performance during polymerized data base operation |
CN103942234A (en) * | 2013-01-21 | 2014-07-23 | 中国电信股份有限公司 | Method for operating multiple heterogeneous databases, middleware device and system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9251207B2 (en) * | 2007-11-29 | 2016-02-02 | Microsoft Technology Licensing, Llc | Partitioning and repartitioning for data parallel operations |
CN102622414B (en) * | 2012-02-17 | 2013-11-06 | 清华大学 | Peer-to-peer structure based distributed high-dimensional indexing parallel query framework |
-
2018
- 2018-12-18 CN CN201811549059.0A patent/CN109710642B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101196890A (en) * | 2006-12-08 | 2008-06-11 | 国际商业机器公司 | Method and device for analyzing information and application performance during polymerized data base operation |
CN103942234A (en) * | 2013-01-21 | 2014-07-23 | 中国电信股份有限公司 | Method for operating multiple heterogeneous databases, middleware device and system |
Also Published As
Publication number | Publication date |
---|---|
CN109710642A (en) | 2019-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299102B (en) | HBase secondary index system and method based on Elastcissearch | |
CN106547796B (en) | Database execution method and device | |
CN103064875B (en) | A kind of spatial service data distributed enquiring method | |
JP6338817B2 (en) | Data management system and method using database middleware | |
CN105786808B (en) | A kind of method and apparatus for distributed execution relationship type computations | |
CN105824957A (en) | Query engine system and query method of distributive memory column-oriented database | |
CN106611053B (en) | Data cleaning and indexing method | |
CN106777108A (en) | A kind of data query method and apparatus based on mixing storage architecture | |
EP2849089A1 (en) | Virtual table indexing mechanism and method capable of realizing multi-attribute compound condition query | |
CN111177148B (en) | Method for automatically building and dividing tables of hydropower database | |
CN107506464A (en) | A kind of method that HBase secondary indexs are realized based on ES | |
CN107291964B (en) | A method of fuzzy query is realized based on HBase | |
US8583655B2 (en) | Using an inverted index to produce an answer to a query | |
CN102750354B (en) | Method for analyzing and processing non-structured data query operating language | |
CN109947791A (en) | A kind of database statement optimization method, device, equipment and storage medium | |
JP2018506775A (en) | Identifying join relationships based on transaction access patterns | |
CN112395303A (en) | Query execution method and device, electronic equipment and computer readable medium | |
WO2017112861A1 (en) | System and method for adaptive filtering of data requests | |
CN111723161A (en) | Data processing method, device and equipment | |
CN108073641B (en) | Method and device for querying data table | |
CN117033424A (en) | Query optimization method and device for slow SQL (structured query language) statement and computer equipment | |
US7610272B2 (en) | Materialized samples for a business warehouse query | |
KR101955376B1 (en) | Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method | |
US9870399B1 (en) | Processing column-partitioned data for row-based operations in a database system | |
WO2015131579A1 (en) | Data storage method, apparatus and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |