CN107229672A

CN107229672A - A kind of big data SQL query method and system for SolrCloud

Info

Publication number: CN107229672A
Application number: CN201710261610.0A
Authority: CN
Inventors: 沈志宏; 周园春; 吴章生; 黎建辉
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2017-04-20
Filing date: 2017-04-20
Publication date: 2017-10-03

Abstract

The invention discloses a kind of big data SQL query method and system for SolrCloud.This method is：1) collection of document in Solr is mapped to SQL table, the document in Solr and field is distinguished into correspondence mappings into the row and column of SQL table；2) SQL query statement received is parsed, by SQL relational queries condition resolution RexNode objects therein；3) successively translated for the concrete structure of RexNode objects, obtain corresponding Solr querying conditions；NOT conditions in the SQL query statement are changed into corresponding Solr querying conditions；4) by step 3) obtained Solr querying conditions resolve into inquiry page by page for SolrCloud clusters, and be sent to SolrCloud servers and inquired about.Present invention improves the present situation that existing database system can not meet full-text search.

Description

A kind of big data SQL query method and system for SolrCloud

Technical field

The present invention relates to big data, database technical field, a kind of big data SQL query for SolrCloud is proposed Method and system.

Background technology

It is increasing using production and need data volume to be processed with the development and popularization of network.The explosion type of data Increase so that the workload of present Database Systems increasingly increases, the ever-increasing increasing application of data volume requirement Program can expand to is gone in bigger cluster calculate, therefore big data Distributed Calculation be handle mass data must be by it Road.Solr is used as high performance search server, using the teaching of the invention it is possible to provide quick, powerful full-text search function.SolrCloud is to be based on ZooKeeper and Solr Distributed-solution, is Solr addition distributed functions, is stretched, certainly for setting up High Availabitity, height Visibly moved mistake, distributed index, the Solr server clusters of distributed query.

Solr provides the query language retrieved for extensive document data, and query function enriches very much.Including Single character is matched, 0 or multiple character, the fuzzy query based on editing distance is matched, (is searched separated by a distance adjacent to inquiry Word), range query, etc..Meanwhile, Solr query grammars also support the combination of multiple queries condition, such as AND, OR, NOT Deng.Solr query grammars also provide the characteristics such as the field filter of inquiry, paging.

But, on the one hand, the developer of database field also gets used to manipulating database using sql like language.The opposing party Face, due to the diversity of big data system, the structure of some information systems is increasingly wished and the specific storage management system of bottom System is (such as：MySQL clusters, MongoDB, SolrCloud etc.) decoupling.Therefore a kind of SQL query for SolrCloud is needed to draw Hold up.

At present, the more existing SolrCloud query engines based on Hive and SparkSQL, such as：hive-solr (https://github.com/qindongliang/hive-solr) it is exactly that a kind of Hive that is based on provides HiveQL for Solr The plug-in unit of inquiry, spark-solr (https://github.com/lucidworks/spark-solr) it is then that one kind is based on SparkSQL Solr query plugins, Solr data source can not only be mapped to the RDD (elasticity point in Spark frameworks by it Cloth data set), can also be registered as SparkSQL form, and by SQL statement carry out data write-in and Inquiry.The drawbacks of the above scheme has prominent：(1) they are too dependent on big data Computational frame, such as：Hive and SparkSQL, this can produce the dependence to complex environment, such as：Eventually depend on the Pang of the compositions such as Zookeeper, Hadoop Big software environment；(2) Hive depends on Hadoop MapReduce Computational frames, and SparkSQL depends on Spark calculation blocks Frame, in Hive and SparkSQL, SQL execution, which often becomes, to be converted into a succession of meeting general specification (such as：Map- Shuffle-Reduce calculating task) is completed, and either the startup scheduling of framework or the adaptation of algorithm turn in implementation procedure Change, can all bring serious delay.Therefore, the scheme either based on Hive or SparkSQL, appoints more suitable for batch SQL The execution of business, rather than online interaction formula SQL query.

In addition, in SolrCloud highest version, also having carried one in server end and being responsible for providing SQL interface services Module.But, for some information systems based on SolrCloud 4.0,5.0 versions, selecting the version to mean that needs To be transformed for production environment, including product version upgrading, program transformation and Data Migration, this all brings potential wind Dangerous and larger workload.

Based on background above, the present invention proposes a kind of interactive SQL query system of big data for SolrCloud.

The content of the invention

It is an object of the invention to provide a kind of big data SQL query method and system for SolrCloud, the system The ability of interactive SQL query is realized in client for SolrCloud, while improving existing database system can not expire The present situation of sufficient full-text search.

For above-mentioned purpose, the technical solution adopted in the present invention is：

A kind of interactive SQL query system of big data for SolrCloud, be mainly made up of module 1 and module 2 (with Fig. 1 correspondences)：

Module 1 (SQL is analyzed and enforcement engine)：The module realizes the parsing of SQL statement, formulates SQL executive plans, and adjust The execution of sql command is completed with module 2, SQL query result is returned to user program.User can repeatedly submit SQL statement simultaneously Query Result is obtained, the process is interactive inquiry.

Module 2 (Solr adapters)：The module is responsible for performing specific SQL execution task, and groundwork is looked into including Solr Translation, NOT operation optimizations, Solr inquiry is ask to be connected with optimization, table schema mapping, Command Line Parsing and SolrCloud.

Module 2 includes 5 submodules：

Module 2.1 (Solr query translations module)：By SQL relational query conditional translation into Solr querying condition；

Module 2.2 (NOT operation optimizations module)：NOT conditions in querying condition are converted into corresponding inquiry in Solr Condition, such as：NOT (a=1AND b=2) is converted into a！=1OR b！=2；

Module 2.3 (Solr is inquired about and optimization module)：Solr inquiries are performed, Query Result is returned.In order to avoid big data The client waiting problem brought is measured, the module devises buffering page by page；

Module 2.4 (table schema mapping block)：The module is responsible for collection of document (Collection) mapping in Solr Into SQL table, by the document (Document) and field (field) in Solr, the row and column of SQL table is mapped to；

Module 2.5 (Command Line Parsing and SolrCloud link blocks)：The module is responsible for parsing incoming configuration parameter, bag Include Solr server location, Solr collection of document title, the row of SQL table, and corresponding Solr field names.Simultaneously By calling SolrCloud client end AP I, the connection with server is set up.

Further, module 1 is based on open source projects Apache Calcite (http://calcite.apache.org) it is complete Into SQL statement parsing and executive plan, Apache Calcite are the frameworks of a distributed SQL query engine, independent of Any big data Computational frame, therefore with preferably portable.

Further, the startup of module 1 depends on some configuration file, and the configuration file must comply with Apache The JSON forms that Schema is defined in Calcite.

Further, module 2.1 can receive RexNode pairs corresponding to SQL query condition that module 1 is passed over As then successively being translated for the concrete structure of RexNode objects, such as：For " a=1AND b=2 ", can be first by a=1 The two dual operations translate into a respectively with b=2:1 and b:2, the translation of AND operation is then performed again.The module supports SQL All kinds of querying conditions, including with or, it is non-, be more than, be less than, non-NULL, for it is empty, be equal to, like, be not equal to, be more than or equal to, it is small In equal to.Translation process is as shown in specific embodiment 2.

Further, what module 2.1 can be intelligent resolves into SQL query condition the acceptable parts of Solr, and can not The part of receiving.Such as：For name like ' b% ' and length (name)>Expression formula as 20, can resolve into name Like ' b% ' and length (name)>20 two parts.Module 1 can do secondary filter for the Query Result of module 2, therefore length(name)>20 can be filtered one time as condition for Query Result again.

Further, the NOT operation optimizations process of module 2.2 is as shown in specific embodiment 3.

Further, module 2.3 devises the mechanism buffered page by page, and record sum is obtained by the inquiries of Solr first Mesh M, and according to M and paging size N, number of pages P is calculated, so that by total query decomposition into P continuous subqueries, it is each to inquire about With identical querying condition, but with different deviation posts.Only when user program acquisition request is to xth page, The inquiry request of xth page real can be just submitted to Solr.This page by page buffering avoid user program wait as long for and Network resource consumption.

Further, module 2.4 only can just perform the pattern of the table when being operated for certain table first time and reflect Penetrate.

Further, module 2.5 can receive the configuration information that module 1 is received, and only be carried out in first time for the table It can just be parsed when operation.

Further, SolrCloud establishment of connections are with safeguarding in module 2.5, except supporting SolrCloud clusters, Support the Solr servers of single node.

The beneficial effects of the present invention are：

The invention provides a kind of interactive SQL query system of big data for SolrCloud, with traditional use The method that API or Solr query grammars are inquired about is compared, and the system supports the sql like language using standard to SolrCloud Formula inquiry is interacted, the versatility of program development is enhanced.Meanwhile, the system is by Solr fuzzy query, adjacent to inquiry Ability is transplanted in SQL statement, it is allowed to realize that the non-precision for field is retrieved using like operators and function.

The advantage of present system is specially：

(1) operation to SolrCloud systems is realized there is provided a kind of non-API mode.It is currently based on SolrCloud's Using, mainly realized by way of the API such as C++/Java to SolrCloud reading and writing operation, this mode is to developer It is required that it is higher, and it is highly susceptible to the influence of API client releases.Using the inventive method, then the SQL languages of standard can be used Speech realizes the operation to SolrCloud document datas, and developer is even without the program of writing, by conventional sql like language, Demand can be completed.

(2) without the big data Computational frame using such as Hadoop, Spark, therefore the system has well removable Plant property, simultaneously because not needing the loading time of big data Computational frame, the system has preferable performance.

(3) SolrCloud version is not required, and is a kind of program option of client, therefore disposed extremely It is convenient.SolrCloud also provides built-in SQL engines after 6.0 versions, some be based on SolrCloud 4.0,5.0 editions This information system, using the workload that the present invention can just save edition upgrading, program transformation and Data Migration are brought.

Brief description of the drawings

Fig. 1 is the internal structure schematic diagram of present system；

Fig. 2 is the execution flow chart of SQL query in the present invention.

Fig. 3 describes the uml model figure of class for Solr conditions in the present invention.

Embodiment

To enable the features described above and advantage of the present invention to become apparent, special embodiment below, and it is detailed to coordinate accompanying drawing to make Carefully it is described as follows.

1st, the execution process description of SQL query

In a system of the invention, the execution flow of a SQL query is as shown in Figure 2.Wherein：

Step 1：User program submits a SQL query statement S to the system；

Step 2：In the system Calcite query engines parsing query statement, and by WHERE condition resolutions therein into RexNode object RN, jump to step 3.If there is no WHERE conditions, then the Solr for creating all records of search is looked into Ask (being designated as SF2), and jump to step 6；

Step 3：Object RN is transferred to Solr adapters (module 2) by Calcite query engines；

Step 4：RN is translated into Solr inquiries SF1 by Solr adapter query translation modules；

Step 5：NOT operation optimizations module is changed the Solr NOT conditions inquired about in SF1, obtains SF2；

Step 6：Solr inquiries and optimization module resolve into SF2 the inquiry page by page for SolrCloud clusters, connection SolrCloud servers, obtain Query Result R1；

Step 7：Calcite query engines use initial filter condition RN (the i.e. RexNode of initial filter condition for R1 Object) secondary filter is carried out, R2 is obtained, the step can handle result not by the filtering of Solr adapters；

Step 8：Calcite query engines return to SQL query result R2 to user program.

2nd, Solr query translation methods explanation

Step 1：RexNode objects are received, triple T (kind, left, right) is created, wherein：Kind is should The corresponding querying conditions of RexNode type (with or, it is non-, be more than, be less than, etc.), left is left operand, and right is right behaviour Count；

Step 2：Matched for triple T, 20 kinds of situations are distinguished altogether, it is as follows：

Each Solr querying condition, and the different situations in upper table are described invention defines a series of class Generate corresponding class object.Such as：According to the 19th kind of situation, NotSolrFilter objects can be constructed.

Similar class include AndSolrFilter, NotSolrFilter, OrSolrFilter, GtSolrFilter, NotNullSolrFilter、IsNullSolrFilter、EqualsSolrFilter、LikeSolrFilter、 NotEqualsSolrFilter, GeSolrFilter, LeSolrFilter, LtSolrFilter etc., in addition, inconvertible Condition is defined as UnrecognizedSolrFilter.These classes realize an identical interface SolrFilter, uml class Figure is as shown in Figure 3.

SolrFilter interfaces define a method：ToSolrQueryString (), this method returns to current Solr and looked into The corresponding query string of inquiry condition, such as EqualsSolrFilter classes are defined as follows：

3rd, NOT operation optimizations specification

The principle of NOT operation optimizations is：Replace shape such as Not (Filter) operation, it is to avoid in the expression formula after conversion There are Not operators.The principle of optimality is as follows：

(1) if running into Not (x>Y) condition, then be substituted for x≤y；

(2) if running into Not (x >=y) condition, it is substituted for x<y；

(3) if running into Not (x<Y) condition, then be substituted for x >=y；

(4) if running into Not (x≤y) condition, it is substituted for x>y；

(5) if running into Not (x==y) condition, it is substituted for x！=y；

(6) if running into Not (x！=y) condition, then be substituted for x==y；

(7) if running into Not (x is null) condition, x is not null are substituted for；

(8) if running into Not (x is not null) condition, x is null are substituted for；

(9) if running into Not (Not (xxx)) condition, xxx is substituted for, and continues to carry out xxx conditions therein NOT operation optimizations；

(10) if running into And (xxx) condition, continue to carry out NOT operation optimizations to xxx conditions therein；

(11) if running into Or (xxx) condition, continue to carry out NOT operation optimizations to xxx conditions therein；

(12) if running into Not (And (xxx, yyy)) condition, then (Not x) Or (Not y) is substituted for, and continue To xxx therein, yyy conditions carry out NOT operation optimizations；

(13) if running into Not (Or (xxx, yyy)) condition, then (Not x) And (Not y) is substituted for, and continue To xxx therein, yyy conditions carry out NOT operation optimizations；

(14) other situations, then without NOT operation optimizations；

Above rule is as shown in the table：

4th, program class specification

Present system uses Scala language developments, main to include such as lower class：

●SolrTable：The subclass for the AbstractTable that Apache Calcite are provided, while realizing Apache ScannableTable the and FilterableTable interfaces that Calcite is provided, SolrTable, which is represented, corresponds to Solr documents The SQL tables of set, mainly realize the methods such as scan (), to obtain qualified record；

●SolrTableFactory：The subclass for the TableFactory that Apache Calcite are provided, to according to finger Determine parameter, create SolrTable objects；

●SolrQueryResultsIterator：For an iterator of the record in Solr Query Results；

●SolrQueryResults：The encapsulation to Solr Query Results is realized, it contains one SolrQueryResultsIterator objects；

●SqlFilter2SolrFilterTranslator：Core classes, realize the translation that SQL query is inquired about to Solr；

●SolrFilter：The Solr querying conditions after conversion are represented, the present invention defines each of SolrFilter simultaneously Individual subclass, including AndSolrFilter, NotSolrFilter, OrSolrFilter, GtSolrFilter, NotNullSolrFilter、IsNullSolrFilter、EqualsSolrFilter、LikeSolrFilter、 NotEqualsSolrFilter, GeSolrFilter, LeSolrFilter, LtSolrFilter etc., in addition, inconvertible Condition is defined as UnrecognizedSolrFilter.

●SolrTableConf：One tool-class, mainly provides the methods such as the Parameter analysis of electrochemical such as text-type, shaping；

5th, the use example of present system

The following use for present system provides code sample：

Procedure above is a normal JDBC client-side program, and the difference with other programs is JDBC URL form Such as " jdbc:calcite:Model=src/java/test/model.json ", is drawn wherein specifying using Apache Calcite The driving held up, and using model.json as configuration file, model.json contents are as follows：

There is defined entitled solr SQL database, entitled docs SQL numbers are defined below in the storehouse According to table, the table is used by org.apache.calcite.adapter.solr.SolrTableFactory and created as properties Build：

●solrServerURL：Solr server address, such as：http://bluejoe1:8983/solr/ collection1

●solrZkHosts:SolrCloud uses Zookeeper addresses, such as：bluejoe1:9983

●solrCollection：The title of Solr collection of document, such as：collection1

●columns:By the column information of CSV, such as：id integer,name char,age integer

●columnMapping:By the row map information of CSV, such as：name->name_s,age->Age_i is represented Name column saves as name_s fields in Solr documents, and age, which is listed in Solr documents, saves as age_s fields,

●pageSize:The record count for the every page used during paging, such as：50

When user specifies solrZkHosts parameters, system will connect corresponding SolrCloud clusters；When user specifies SolrServerURL parameters, system will connect corresponding individual node.The two parameters can only simultaneous selection one.

Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area Member can modify or equivalent substitution to technical scheme, without departing from the spirit and scope of the present invention, this hair Bright protection domain should be to be defined described in claims.

Claims

1. a kind of big data SQL query method for SolrCloud, its step is：

1) collection of document in Solr is mapped to SQL table by table schema mapping block, and the document in Solr and field are distinguished Row and column of the correspondence mappings into SQL table；

2) SQL analyses are parsed with enforcement engine to the SQL query statement received, by SQL relational queries condition solution therein Analyse RexNode objects；

3) Solr query translations module is successively translated for the concrete structure of RexNode objects, is obtained corresponding Solr and is looked into Inquiry condition；NOT conditions in the SQL query statement are changed corresponding Solr querying conditions by NOT operation optimizations module；

4) Solr inquiries with optimization module by step 3) obtained Solr querying conditions resolve into for SolrCloud clusters by Page inquiry, and be sent to SolrCloud servers and inquired about；

5) Solr inquiry with optimization module using the Query Result of SolrCloud servers as the SQL query statement inquiry knot Really.

2. the method as described in claim 1, it is characterised in that the step 5) in, Solr inquiries will with optimization module The Query Result of SolrCloud servers is sent to SQL analyses and enforcement engine, and SQL analyses are used with enforcement engine pin should RexNode objects are filtered to the Query Result, using obtained filter result as the SQL query statement Query Result.

3. method as claimed in claim 1 or 2, it is characterised in that the step 4) in, Solr inquiries and optimization module will be by Page inquiry is configured parsing and is sent to SolrCloud servers with SolrCloud link blocks and is inquired about；Wherein, configuration solution Analysis and SolrCloud link blocks are responsible for parsing incoming configuration parameter, including in SolrCloud server location, Solr Collection of document title, SQL table row name and corresponding Solr field names, and with SolrCloud servers set up be connected.

4. method as claimed in claim 1 or 2, it is characterised in that the Solr inquiries are with setting one page by page in optimization module Buffering, obtains recording total number M and paging size N, number of pages P is calculated, by total query decomposition according to the inquiries of Solr first Into P continuous subqueries, each inquiry has identical querying condition, but with different deviation posts；Only work as request When getting xth page, the inquiry request of xth page is submitted to SolrCloud clusters.

5. method as claimed in claim 1 or 2, it is characterised in that the Solr query translations module is directed to RexNode objects The method successively translated of concrete structure be：

51) the translation result table of a triple is set up；Multiple classes are created according to Solr querying conditions, each class correspondence has identical The one or more triples of translation result；Each class has an identical interface SolrFilter, for returning to current Solr The corresponding query string of querying condition；

52) according to the RexNode objects received, triple T (kind, left, right) is created；Wherein：Kind is should The type of the corresponding querying condition of RexNode objects, left is left operand, and right is right operand；

52) the Solr querying conditions of matching are searched from the class according to triple T.

6. a kind of big data SQL query system for SolrCloud, it is characterised in that SQL analyze with enforcement engine including Table schema mapping block, Solr query translations module and Solr inquiries and optimization module；Wherein,

SQL is analyzed and enforcement engine, for being parsed to the SQL query statement received, by SQL relational queries condition therein Parse RexNode objects；

Table schema mapping block, for the collection of document in Solr to be mapped into SQL table, by the document and field in Solr point Row and column of the other correspondence mappings into SQL table；

Solr query translation modules, are successively translated for the concrete structure for RexNode objects, obtain corresponding Solr Querying condition；NOT conditions in the SQL query statement are changed corresponding Solr querying conditions by NOT operation optimizations module；

Solr is inquired about and optimization module, for obtained Solr querying conditions to be resolved into for SolrCloud clusters page by page Inquire about, and be sent to SolrCloud servers and inquired about, regard the Query Result of SolrCloud servers as the SQL query The Query Result of sentence.

7. the system described in claim 6, it is characterised in that also including SQL analyses and enforcement engine；Solr is inquired about and optimization The Query Result of SolrCloud servers is sent to SQL analyses and enforcement engine by module, and SQL analyses are adopted with enforcement engine pin The Query Result is filtered with the RexNode objects, using obtained filter result as the SQL query statement inquiry knot Really.

8. the system described in claim 6 or 7, it is characterised in that the Solr inquiries will be inquired about through matching somebody with somebody page by page with optimization module Parsing is put to be sent to SolrCloud servers with SolrCloud link blocks and inquired about；Wherein, Command Line Parsing with SolrCloud link blocks are responsible for parsing incoming configuration parameter, including the text in SolrCloud server location, Solr Shelves name set, the row name of SQL table and corresponding Solr field names, and be connected with the foundation of SolrCloud servers.

9. the system described in claim 6 or 7, it is characterised in that the Solr inquiries are with setting one to delay page by page in optimization module Mechanism is rushed, obtains recording total number M and paging size N according to the inquiries of Solr first, calculates number of pages P, by total query decomposition into P Individual continuous subquery, each inquiry has identical querying condition, but with different deviation posts；Only work as acquisition request When to xth page, the inquiry request of xth page is submitted to SolrCloud clusters.

10. the system described in claim 6 or 7, it is characterised in that set a triple to turn in the Solr query translations module Translate result table；Multiple classes are created according to Solr querying conditions, each class correspondence has the one or more ternarys of identical translation result Group；Each class has an identical interface SolrFilter, for returning to the corresponding query string of current Solr querying conditions；Institute Solr query translations module is stated according to the RexNode objects received, triple T (kind, left, right) is created；Wherein： Kind is the type of the corresponding querying condition of RexNode objects, and left is left operand, and right is right operand；It is described Solr query translations module searches the Solr querying conditions of matching according to triple T from the class.