A kind of big data SQL query method and system for SolrCloud
Technical field
The present invention relates to big data, database technical field, a kind of big data SQL query for SolrCloud is proposed
Method and system.
Background technology
It is increasing using production and need data volume to be processed with the development and popularization of network.The explosion type of data
Increase so that the workload of present Database Systems increasingly increases, the ever-increasing increasing application of data volume requirement
Program can expand to is gone in bigger cluster calculate, therefore big data Distributed Calculation be handle mass data must be by it
Road.Solr is used as high performance search server, using the teaching of the invention it is possible to provide quick, powerful full-text search function.SolrCloud is to be based on
ZooKeeper and Solr Distributed-solution, is Solr addition distributed functions, is stretched, certainly for setting up High Availabitity, height
Visibly moved mistake, distributed index, the Solr server clusters of distributed query.
Solr provides the query language retrieved for extensive document data, and query function enriches very much.Including
Single character is matched, 0 or multiple character, the fuzzy query based on editing distance is matched, (is searched separated by a distance adjacent to inquiry
Word), range query, etc..Meanwhile, Solr query grammars also support the combination of multiple queries condition, such as AND, OR, NOT
Deng.Solr query grammars also provide the characteristics such as the field filter of inquiry, paging.
But, on the one hand, the developer of database field also gets used to manipulating database using sql like language.The opposing party
Face, due to the diversity of big data system, the structure of some information systems is increasingly wished and the specific storage management system of bottom
System is (such as:MySQL clusters, MongoDB, SolrCloud etc.) decoupling.Therefore a kind of SQL query for SolrCloud is needed to draw
Hold up.
At present, the more existing SolrCloud query engines based on Hive and SparkSQL, such as:hive-solr
(https://github.com/qindongliang/hive-solr) it is exactly that a kind of Hive that is based on provides HiveQL for Solr
The plug-in unit of inquiry, spark-solr (https://github.com/lucidworks/spark-solr) it is then that one kind is based on
SparkSQL Solr query plugins, Solr data source can not only be mapped to the RDD (elasticity point in Spark frameworks by it
Cloth data set), can also be registered as SparkSQL form, and by SQL statement carry out data write-in and
Inquiry.The drawbacks of the above scheme has prominent:(1) they are too dependent on big data Computational frame, such as:Hive and
SparkSQL, this can produce the dependence to complex environment, such as:Eventually depend on the Pang of the compositions such as Zookeeper, Hadoop
Big software environment;(2) Hive depends on Hadoop MapReduce Computational frames, and SparkSQL depends on Spark calculation blocks
Frame, in Hive and SparkSQL, SQL execution, which often becomes, to be converted into a succession of meeting general specification (such as:Map-
Shuffle-Reduce calculating task) is completed, and either the startup scheduling of framework or the adaptation of algorithm turn in implementation procedure
Change, can all bring serious delay.Therefore, the scheme either based on Hive or SparkSQL, appoints more suitable for batch SQL
The execution of business, rather than online interaction formula SQL query.
In addition, in SolrCloud highest version, also having carried one in server end and being responsible for providing SQL interface services
Module.But, for some information systems based on SolrCloud 4.0,5.0 versions, selecting the version to mean that needs
To be transformed for production environment, including product version upgrading, program transformation and Data Migration, this all brings potential wind
Dangerous and larger workload.
Based on background above, the present invention proposes a kind of interactive SQL query system of big data for SolrCloud.
The content of the invention
It is an object of the invention to provide a kind of big data SQL query method and system for SolrCloud, the system
The ability of interactive SQL query is realized in client for SolrCloud, while improving existing database system can not expire
The present situation of sufficient full-text search.
For above-mentioned purpose, the technical solution adopted in the present invention is:
A kind of interactive SQL query system of big data for SolrCloud, be mainly made up of module 1 and module 2 (with
Fig. 1 correspondences):
Module 1 (SQL is analyzed and enforcement engine):The module realizes the parsing of SQL statement, formulates SQL executive plans, and adjust
The execution of sql command is completed with module 2, SQL query result is returned to user program.User can repeatedly submit SQL statement simultaneously
Query Result is obtained, the process is interactive inquiry.
Module 2 (Solr adapters):The module is responsible for performing specific SQL execution task, and groundwork is looked into including Solr
Translation, NOT operation optimizations, Solr inquiry is ask to be connected with optimization, table schema mapping, Command Line Parsing and SolrCloud.
Module 2 includes 5 submodules:
Module 2.1 (Solr query translations module):By SQL relational query conditional translation into Solr querying condition;
Module 2.2 (NOT operation optimizations module):NOT conditions in querying condition are converted into corresponding inquiry in Solr
Condition, such as:NOT (a=1AND b=2) is converted into a!=1OR b!=2;
Module 2.3 (Solr is inquired about and optimization module):Solr inquiries are performed, Query Result is returned.In order to avoid big data
The client waiting problem brought is measured, the module devises buffering page by page;
Module 2.4 (table schema mapping block):The module is responsible for collection of document (Collection) mapping in Solr
Into SQL table, by the document (Document) and field (field) in Solr, the row and column of SQL table is mapped to;
Module 2.5 (Command Line Parsing and SolrCloud link blocks):The module is responsible for parsing incoming configuration parameter, bag
Include Solr server location, Solr collection of document title, the row of SQL table, and corresponding Solr field names.Simultaneously
By calling SolrCloud client end AP I, the connection with server is set up.
Further, module 1 is based on open source projects Apache Calcite (http://calcite.apache.org) it is complete
Into SQL statement parsing and executive plan, Apache Calcite are the frameworks of a distributed SQL query engine, independent of
Any big data Computational frame, therefore with preferably portable.
Further, the startup of module 1 depends on some configuration file, and the configuration file must comply with Apache
The JSON forms that Schema is defined in Calcite.
Further, module 2.1 can receive RexNode pairs corresponding to SQL query condition that module 1 is passed over
As then successively being translated for the concrete structure of RexNode objects, such as:For " a=1AND b=2 ", can be first by a=1
The two dual operations translate into a respectively with b=2:1 and b:2, the translation of AND operation is then performed again.The module supports SQL
All kinds of querying conditions, including with or, it is non-, be more than, be less than, non-NULL, for it is empty, be equal to, like, be not equal to, be more than or equal to, it is small
In equal to.Translation process is as shown in specific embodiment 2.
Further, what module 2.1 can be intelligent resolves into SQL query condition the acceptable parts of Solr, and can not
The part of receiving.Such as:For name like ' b% ' and length (name)>Expression formula as 20, can resolve into name
Like ' b% ' and length (name)>20 two parts.Module 1 can do secondary filter for the Query Result of module 2, therefore
length(name)>20 can be filtered one time as condition for Query Result again.
Further, the NOT operation optimizations process of module 2.2 is as shown in specific embodiment 3.
Further, module 2.3 devises the mechanism buffered page by page, and record sum is obtained by the inquiries of Solr first
Mesh M, and according to M and paging size N, number of pages P is calculated, so that by total query decomposition into P continuous subqueries, it is each to inquire about
With identical querying condition, but with different deviation posts.Only when user program acquisition request is to xth page,
The inquiry request of xth page real can be just submitted to Solr.This page by page buffering avoid user program wait as long for and
Network resource consumption.
Further, module 2.4 only can just perform the pattern of the table when being operated for certain table first time and reflect
Penetrate.
Further, module 2.5 can receive the configuration information that module 1 is received, and only be carried out in first time for the table
It can just be parsed when operation.
Further, SolrCloud establishment of connections are with safeguarding in module 2.5, except supporting SolrCloud clusters,
Support the Solr servers of single node.
The beneficial effects of the present invention are:
The invention provides a kind of interactive SQL query system of big data for SolrCloud, with traditional use
The method that API or Solr query grammars are inquired about is compared, and the system supports the sql like language using standard to SolrCloud
Formula inquiry is interacted, the versatility of program development is enhanced.Meanwhile, the system is by Solr fuzzy query, adjacent to inquiry
Ability is transplanted in SQL statement, it is allowed to realize that the non-precision for field is retrieved using like operators and function.
The advantage of present system is specially:
(1) operation to SolrCloud systems is realized there is provided a kind of non-API mode.It is currently based on SolrCloud's
Using, mainly realized by way of the API such as C++/Java to SolrCloud reading and writing operation, this mode is to developer
It is required that it is higher, and it is highly susceptible to the influence of API client releases.Using the inventive method, then the SQL languages of standard can be used
Speech realizes the operation to SolrCloud document datas, and developer is even without the program of writing, by conventional sql like language,
Demand can be completed.
(2) without the big data Computational frame using such as Hadoop, Spark, therefore the system has well removable
Plant property, simultaneously because not needing the loading time of big data Computational frame, the system has preferable performance.
(3) SolrCloud version is not required, and is a kind of program option of client, therefore disposed extremely
It is convenient.SolrCloud also provides built-in SQL engines after 6.0 versions, some be based on SolrCloud 4.0,5.0 editions
This information system, using the workload that the present invention can just save edition upgrading, program transformation and Data Migration are brought.
Brief description of the drawings
Fig. 1 is the internal structure schematic diagram of present system;
Fig. 2 is the execution flow chart of SQL query in the present invention.
Fig. 3 describes the uml model figure of class for Solr conditions in the present invention.
Embodiment
To enable the features described above and advantage of the present invention to become apparent, special embodiment below, and it is detailed to coordinate accompanying drawing to make
Carefully it is described as follows.
1st, the execution process description of SQL query
In a system of the invention, the execution flow of a SQL query is as shown in Figure 2.Wherein:
Step 1:User program submits a SQL query statement S to the system;
Step 2:In the system Calcite query engines parsing query statement, and by WHERE condition resolutions therein into
RexNode object RN, jump to step 3.If there is no WHERE conditions, then the Solr for creating all records of search is looked into
Ask (being designated as SF2), and jump to step 6;
Step 3:Object RN is transferred to Solr adapters (module 2) by Calcite query engines;
Step 4:RN is translated into Solr inquiries SF1 by Solr adapter query translation modules;
Step 5:NOT operation optimizations module is changed the Solr NOT conditions inquired about in SF1, obtains SF2;
Step 6:Solr inquiries and optimization module resolve into SF2 the inquiry page by page for SolrCloud clusters, connection
SolrCloud servers, obtain Query Result R1;
Step 7:Calcite query engines use initial filter condition RN (the i.e. RexNode of initial filter condition for R1
Object) secondary filter is carried out, R2 is obtained, the step can handle result not by the filtering of Solr adapters;
Step 8:Calcite query engines return to SQL query result R2 to user program.
2nd, Solr query translation methods explanation
Step 1:RexNode objects are received, triple T (kind, left, right) is created, wherein:Kind is should
The corresponding querying conditions of RexNode type (with or, it is non-, be more than, be less than, etc.), left is left operand, and right is right behaviour
Count;
Step 2:Matched for triple T, 20 kinds of situations are distinguished altogether, it is as follows:
Each Solr querying condition, and the different situations in upper table are described invention defines a series of class
Generate corresponding class object.Such as:According to the 19th kind of situation, NotSolrFilter objects can be constructed.
Similar class include AndSolrFilter, NotSolrFilter, OrSolrFilter, GtSolrFilter,
NotNullSolrFilter、IsNullSolrFilter、EqualsSolrFilter、LikeSolrFilter、
NotEqualsSolrFilter, GeSolrFilter, LeSolrFilter, LtSolrFilter etc., in addition, inconvertible
Condition is defined as UnrecognizedSolrFilter.These classes realize an identical interface SolrFilter, uml class
Figure is as shown in Figure 3.
SolrFilter interfaces define a method:ToSolrQueryString (), this method returns to current Solr and looked into
The corresponding query string of inquiry condition, such as EqualsSolrFilter classes are defined as follows:
3rd, NOT operation optimizations specification
The principle of NOT operation optimizations is:Replace shape such as Not (Filter) operation, it is to avoid in the expression formula after conversion
There are Not operators.The principle of optimality is as follows:
(1) if running into Not (x>Y) condition, then be substituted for x≤y;
(2) if running into Not (x >=y) condition, it is substituted for x<y;
(3) if running into Not (x<Y) condition, then be substituted for x >=y;
(4) if running into Not (x≤y) condition, it is substituted for x>y;
(5) if running into Not (x==y) condition, it is substituted for x!=y;
(6) if running into Not (x!=y) condition, then be substituted for x==y;
(7) if running into Not (x is null) condition, x is not null are substituted for;
(8) if running into Not (x is not null) condition, x is null are substituted for;
(9) if running into Not (Not (xxx)) condition, xxx is substituted for, and continues to carry out xxx conditions therein
NOT operation optimizations;
(10) if running into And (xxx) condition, continue to carry out NOT operation optimizations to xxx conditions therein;
(11) if running into Or (xxx) condition, continue to carry out NOT operation optimizations to xxx conditions therein;
(12) if running into Not (And (xxx, yyy)) condition, then (Not x) Or (Not y) is substituted for, and continue
To xxx therein, yyy conditions carry out NOT operation optimizations;
(13) if running into Not (Or (xxx, yyy)) condition, then (Not x) And (Not y) is substituted for, and continue
To xxx therein, yyy conditions carry out NOT operation optimizations;
(14) other situations, then without NOT operation optimizations;
Above rule is as shown in the table:
4th, program class specification
Present system uses Scala language developments, main to include such as lower class:
●SolrTable:The subclass for the AbstractTable that Apache Calcite are provided, while realizing Apache
ScannableTable the and FilterableTable interfaces that Calcite is provided, SolrTable, which is represented, corresponds to Solr documents
The SQL tables of set, mainly realize the methods such as scan (), to obtain qualified record;
●SolrTableFactory:The subclass for the TableFactory that Apache Calcite are provided, to according to finger
Determine parameter, create SolrTable objects;
●SolrQueryResultsIterator:For an iterator of the record in Solr Query Results;
●SolrQueryResults:The encapsulation to Solr Query Results is realized, it contains one
SolrQueryResultsIterator objects;
●SqlFilter2SolrFilterTranslator:Core classes, realize the translation that SQL query is inquired about to Solr;
●SolrFilter:The Solr querying conditions after conversion are represented, the present invention defines each of SolrFilter simultaneously
Individual subclass, including AndSolrFilter, NotSolrFilter, OrSolrFilter, GtSolrFilter,
NotNullSolrFilter、IsNullSolrFilter、EqualsSolrFilter、LikeSolrFilter、
NotEqualsSolrFilter, GeSolrFilter, LeSolrFilter, LtSolrFilter etc., in addition, inconvertible
Condition is defined as UnrecognizedSolrFilter.
●SolrTableConf:One tool-class, mainly provides the methods such as the Parameter analysis of electrochemical such as text-type, shaping;
5th, the use example of present system
The following use for present system provides code sample:
Procedure above is a normal JDBC client-side program, and the difference with other programs is JDBC URL form
Such as " jdbc:calcite:Model=src/java/test/model.json ", is drawn wherein specifying using Apache Calcite
The driving held up, and using model.json as configuration file, model.json contents are as follows:
There is defined entitled solr SQL database, entitled docs SQL numbers are defined below in the storehouse
According to table, the table is used by org.apache.calcite.adapter.solr.SolrTableFactory and created as properties
Build:
●solrServerURL:Solr server address, such as:http://bluejoe1:8983/solr/
collection1
●solrZkHosts:SolrCloud uses Zookeeper addresses, such as:bluejoe1:9983
●solrCollection:The title of Solr collection of document, such as:collection1
●columns:By the column information of CSV, such as:id integer,name char,age integer
●columnMapping:By the row map information of CSV, such as:name->name_s,age->Age_i is represented
Name column saves as name_s fields in Solr documents, and age, which is listed in Solr documents, saves as age_s fields,
●pageSize:The record count for the every page used during paging, such as:50
When user specifies solrZkHosts parameters, system will connect corresponding SolrCloud clusters;When user specifies
SolrServerURL parameters, system will connect corresponding individual node.The two parameters can only simultaneous selection one.
Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area
Member can modify or equivalent substitution to technical scheme, without departing from the spirit and scope of the present invention, this hair
Bright protection domain should be to be defined described in claims.