CN109710642B - Index aggregation parallel processing system based on big data architecture - Google Patents

Index aggregation parallel processing system based on big data architecture Download PDF

Info

Publication number
CN109710642B
CN109710642B CN201811549059.0A CN201811549059A CN109710642B CN 109710642 B CN109710642 B CN 109710642B CN 201811549059 A CN201811549059 A CN 201811549059A CN 109710642 B CN109710642 B CN 109710642B
Authority
CN
China
Prior art keywords
sql
parallel
service device
aggregation
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811549059.0A
Other languages
Chinese (zh)
Other versions
CN109710642A (en
Inventor
李秋实
谢莹莹
郭庆
宋怀明
蒋丹东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Shuguang International Information Industry Co ltd
Original Assignee
Zhongke Shuguang International Information Industry Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Shuguang International Information Industry Co ltd filed Critical Zhongke Shuguang International Information Industry Co ltd
Priority to CN201811549059.0A priority Critical patent/CN109710642B/en
Publication of CN109710642A publication Critical patent/CN109710642A/en
Application granted granted Critical
Publication of CN109710642B publication Critical patent/CN109710642B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a parallel processing system for index aggregation based on a big data architecture. The system comprises: the service interface is used for submitting the parallel SQL to the parallel SQL service device and further verifying the parallel SQL through the metadata service device; the parallel SQL service device is used for providing overall query service, index query service and aggregation query service; the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction, and is responsible for determining to jump to the database service device or the index service device and parallel SQL repackaging. The invention can reduce the extra cost of the traditional big data SQL technical query and can expand the problem that the traditional SQL grammar can not support multiple types of aggregation requirements.

Description

Index aggregation parallel processing system based on big data architecture
Technical Field
The invention relates to the technical field of big data, in particular to a parallel processing system for index aggregation based on a big data architecture.
Background
With the rapid growth of network data in recent years and the arrival of big data era, the traditional database technology cannot meet more and more big data business systems, while the big data database technology of the new generation often cannot be tightly combined with an index system, and cannot be effectively compatible with business requirements such as large-scale report calculation, aggregation statistical analysis and the like, and the main reasons include the following two points:
1. the existing common Structured Query Language (SQL) syntax (such as HiveSQL and Spark-SQL) often cannot effectively utilize indexes under large data volume;
2. the standard database SQL syntax is limited in its support for the large aggregate requirements.
For example, the following steps are carried out: when SQL is required to handle the following requirements simultaneously:
SELECT...FROM A WHERE${filter1}GROUP BY${grouping1}
SELECT...FROM A WHERE${filter2}GROUP BY${grouping2}
...
SELECT...FROM A WHERE${filterN}GROUP BY${groupingN}
the traditional big data technology, the index technology and the related optimization method thereof can not inquire A data for only 1 time to count all results, so the performance overhead can reach about several times or tens of times; the existing optimization method solves the problem of how to more effectively optimize the query performance of single SQL by using the index.
Disclosure of Invention
The index aggregation parallel processing system based on the big data architecture can reduce the extra cost of the traditional big data SQL technical query and can expand the problem that the traditional SQL syntax can not support multiple types of aggregation requirements.
In a first aspect, the present invention provides a parallel processing system for index aggregation based on big data architecture, including:
the service interface is used for submitting the parallel SQL to the parallel SQL service device and further verifying the parallel SQL through the metadata service device;
the parallel SQL service device is used for providing overall query service, index query service and aggregation query service;
the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction, and is responsible for determining to jump to the database service device or the index service device and parallel SQL repackaging.
Optionally, the metadata service device performs unified integration on the query through a commonly used Spark-SQL component, and determines how to modify the parallel SQL description into a direct-processing SQL statement or a non-SQL parallel query plan.
Optionally, the parallel SQL service device is compatible with other optimization strategies of traditional SQL and big data related SQL.
Optionally, the parallel SQL syntax adopts a pipeline manner, the system defaults to execute in sequence according to a pipeline order, and records a result returned by each parallel SQL statement.
Optionally, the parallel index aggregation syntax provided by the parallel SQL service device is: each aggregated parallel SQL statement contains one or more aggregation elements and links the intermediate result set of < search >, which is specified by the USING key.
Optionally, the index aggregation syntax provided by the parallel SQL service device is: and adding an aggregation filter condition and a grouping filter condition through a WHERE key, designating an aggregation GROUP through a GROUP BY key, and mapping an aggregation result set through a MAP key.
The index aggregation parallel processing system based on the big data architecture provided by the embodiment of the invention not only supports the query SQL grammar of the traditional big data database, but also expands the problem that the traditional SQL grammar can not support various aggregation requirements by combining the index technology, thereby solving the problem of large extra cost in the query of the traditional big data SQL technology; in addition, the method is compatible with other big data optimization technologies, related business system optimization and other related optimization means, and is compatible with various general databases, big data warehouses and SQL grammars thereof.
Drawings
Fig. 1 is a block diagram illustrating an optimization method for a database according to an embodiment of the present invention;
FIG. 2 is a block diagram illustrating an optimization method for an index system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a parallel processing system based on index aggregation of a big data architecture according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a general mode based on the combination of a big data technology and a database index technology, which not only supports the query SQL grammar of the traditional big data database, but also combines the index technology to solve the problem that the traditional SQL grammar can not support various aggregation requirements for expansion.
First, a syntax form is introduced.
The parallel SQL syntax uses a pipeline approach, for example:
< parallel SQL statement > |.
The system defaults to execute in sequence according to the pipeline sequence, and records the result returned by each parallel SQL statement. The pipe in parallel SQL is similar to the pipe in Shell, which is based on parameter delivery of output stream, and the pipe in parallel SQL is used for parameter delivery based on result set.
And then introduces syntax.
1. Index query syntax (search):
Figure BDA0001910156030000041
2. parallel index aggregation syntax (aggregations): the traditional SQL standard cannot describe the syntax
Figure BDA0001910156030000042
3. Index aggregation syntax (aggregation):
Figure BDA0001910156030000051
by adopting the parallel SQL syntax design, parallel aggregation can be made possible from time to time, and by combining with common big data indexing technologies, such as Solr, ElastiSearch and other indexing systems supporting docValue, the aggregation is carried out in an indexing mode, so that the extra overhead of the query is greatly reduced, namely:
SELECT*FROM twitter|count(1)GROUP BY user,avg(age)WHERE ct>10GROUP BY company
the query results are the two result sets, and the optimization method for the database is shown in fig. 1.
See figure 2 for an optimization of the indexing system.
The grammar retains the SQL characteristics, and not only supports the parallel SQL of the pipeline type, but also supports the SQL of various data source standards. The following table is a compatible SQL Syntax (SQL):
Figure BDA0001910156030000052
as shown in fig. 3, the parallel processing system for index aggregation based on big data architecture according to an embodiment of the present invention includes:
the service interface is used for providing a WEB service interface, submitting parallel SQL to the parallel SQL service device and further verifying through the metadata service device;
the parallel SQL service device is used for providing overall query service, index query service and aggregation query service, and is compatible with other traditional SQL and the optimization strategy of big data related SQL;
the metadata service device comprises a metadata base, a database system and a permission system, is mainly responsible for SQL verification, table and data source extraction and for determining to jump to the metadata service device or a parallel SQL service device and repackaging of the parallel SQL, wherein the repackaging refers to adding SQL filtering and comprises control of data table, row and column permissions.
The metadata service device performs unified integration on the query through a commonly used Spark-SQL component, and determines how to modify the parallel SQL description into a SQL statement which can be directly processed or a non-SQL parallel query plan.
The index aggregation parallel processing system based on the big data architecture provided by the embodiment of the invention not only supports the query SQL grammar of the traditional big data database, but also expands the problem that the traditional SQL grammar can not support various aggregation requirements by combining the index technology, thereby solving the problem of large extra cost in the query of the traditional big data SQL technology; in addition, the method is compatible with other big data optimization technologies, related business system optimization and other related optimization means, and is compatible with various general databases, big data warehouses and SQL grammars thereof.
It will be understood by those skilled in the art that all or part of the processes of the embodiments of the methods described above may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (5)

1. A big data architecture based index aggregation parallel processing system, comprising:
the service interface is used for submitting the parallel SQL to the parallel SQL service device and further verifying the parallel SQL through the metadata service device;
the parallel SQL service device is used for providing overall query service, index query service and aggregation query service;
the metadata service device comprises a metadata base, a database system and a permission system, is responsible for SQL verification, table and data source extraction and is responsible for deciding to jump to the database service device or the index service device and parallel SQL repackaging;
wherein, the repackaging refers to adding SQL filtering and comprises the control of data table, row and column authorities;
the index aggregation syntax provided by the parallel SQL service device is as follows: and adding an aggregation filter condition and a grouping filter condition through a WHERE key, designating an aggregation GROUP through a GROUP BY key, and mapping an aggregation result set through a MAP key.
2. The system according to claim 1, wherein the metadata service device performs unified integration of the query through commonly used Spark-SQL components to determine how to modify the parallel SQL description into a directly processable SQL statement or non-SQL parallel query plan.
3. The system according to claim 1, wherein the parallel SQL service device is compatible with other optimization strategies of legacy SQL and big data related SQL.
4. The system of claim 1, wherein the parallel SQL syntax is pipelined, the system defaults to executing in order according to the pipeline order, and records the result returned by each parallel SQL statement.
5. The system according to claim 1, wherein the parallel SQL service device provides the parallel index aggregation syntax as: each aggregated parallel SQL statement contains one or more aggregation elements and links the intermediate result set of < search >, which is specified by the USING key.
CN201811549059.0A 2018-12-18 2018-12-18 Index aggregation parallel processing system based on big data architecture Active CN109710642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811549059.0A CN109710642B (en) 2018-12-18 2018-12-18 Index aggregation parallel processing system based on big data architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811549059.0A CN109710642B (en) 2018-12-18 2018-12-18 Index aggregation parallel processing system based on big data architecture

Publications (2)

Publication Number Publication Date
CN109710642A CN109710642A (en) 2019-05-03
CN109710642B true CN109710642B (en) 2021-07-27

Family

ID=66256833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811549059.0A Active CN109710642B (en) 2018-12-18 2018-12-18 Index aggregation parallel processing system based on big data architecture

Country Status (1)

Country Link
CN (1) CN109710642B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196890A (en) * 2006-12-08 2008-06-11 国际商业机器公司 Method and device for analyzing information and application performance during polymerized data base operation
CN103942234A (en) * 2013-01-21 2014-07-23 中国电信股份有限公司 Method for operating multiple heterogeneous databases, middleware device and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9251207B2 (en) * 2007-11-29 2016-02-02 Microsoft Technology Licensing, Llc Partitioning and repartitioning for data parallel operations
CN102622414B (en) * 2012-02-17 2013-11-06 清华大学 Peer-to-peer structure based distributed high-dimensional indexing parallel query framework

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196890A (en) * 2006-12-08 2008-06-11 国际商业机器公司 Method and device for analyzing information and application performance during polymerized data base operation
CN103942234A (en) * 2013-01-21 2014-07-23 中国电信股份有限公司 Method for operating multiple heterogeneous databases, middleware device and system

Also Published As

Publication number Publication date
CN109710642A (en) 2019-05-03

Similar Documents

Publication Publication Date Title
CN109299102B (en) HBase secondary index system and method based on Elastcissearch
CN106547796B (en) Database execution method and device
CN103064875B (en) A kind of spatial service data distributed enquiring method
JP6338817B2 (en) Data management system and method using database middleware
CN105786808B (en) A kind of method and apparatus for distributed execution relationship type computations
CN105824957A (en) Query engine system and query method of distributive memory column-oriented database
CN106611053B (en) Data cleaning and indexing method
CN106777108A (en) A kind of data query method and apparatus based on mixing storage architecture
EP2849089A1 (en) Virtual table indexing mechanism and method capable of realizing multi-attribute compound condition query
CN111177148B (en) Method for automatically building and dividing tables of hydropower database
CN107506464A (en) A kind of method that HBase secondary indexs are realized based on ES
CN107291964B (en) A method of fuzzy query is realized based on HBase
US8583655B2 (en) Using an inverted index to produce an answer to a query
CN102750354B (en) Method for analyzing and processing non-structured data query operating language
CN109947791A (en) A kind of database statement optimization method, device, equipment and storage medium
JP2018506775A (en) Identifying join relationships based on transaction access patterns
CN112395303A (en) Query execution method and device, electronic equipment and computer readable medium
WO2017112861A1 (en) System and method for adaptive filtering of data requests
CN111723161A (en) Data processing method, device and equipment
CN108073641B (en) Method and device for querying data table
CN117033424A (en) Query optimization method and device for slow SQL (structured query language) statement and computer equipment
US7610272B2 (en) Materialized samples for a business warehouse query
KR101955376B1 (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
US9870399B1 (en) Processing column-partitioned data for row-based operations in a database system
WO2015131579A1 (en) Data storage method, apparatus and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant