CN107515887A

CN107515887A - A kind of interactive query method suitable for a variety of big data management systems

Info

Publication number: CN107515887A
Application number: CN201710515380.6A
Authority: CN
Inventors: 沈志宏; 李跃鹏; 黎建辉
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2017-06-29
Filing date: 2017-06-29
Publication date: 2017-12-26
Anticipated expiration: 2037-06-29
Also published as: CN107515887B

Abstract

The present invention relates to a kind of interactive query method suitable for a variety of big data management systems, its step includes：1) associated document model is established, it includes document sets and incidence set, and the incidence set is the set that the association between document is formed；2) different original data models is converted into associated document model, connected as one different data sources by associated document model；3) associated document model is based on, establishes the unified query language for being suitable for multivariate data；4) using the unified query language for being suitable for multivariate data, the unified query to relevant database, chart database and file system is realized.Present invention firstly provides the unified query language for being suitable for multivariate data management system, it is possible to achieve to the unified query of relevant database, chart database, and file system.

Description

A kind of interactive query method suitable for a variety of big data management systems

Technical field

The present invention relates to a kind of query language, and in particular to a kind of interactive inquiry language suitable for big data management system Querying method is mentioned, belongs to big data, database technical field.

Background technology

With the continuous popularization of computer, the management of data and process demand are increasingly urgent, and people are directed to different data The different data model of form and feature extraction, and realize corresponding data management system come realize the management of data and point Analysis.More influential data model such as E-R models, since the last century 70's proposes, E-R models are ruled substantially The database world is up to more than 40 years.Since last decade, as what internet and Internet of Things were applied gos deep into, large-scale structuring, The generation of semi-structured, non-structural data has triggered NoSQL motions [Cattell R.Scalable SQL and NoSQL data stores[J].ACM SIGMOD Record,2010,39(4):12-27].The database world is monopolized by initial SQL Situation be transformed into the situation that traditional SQL, NoSQL, NewSQL divide and rule.

Structure one perfect big data application system, it is necessary to fully take into account from 4V [Gupta R, Gupta H, Mohania M.Cloud computing and big data analytics:what is new from databases perspective[C]//Proc of 1st BDA.,New Delhi,India:,Springer Berlin Heidelberg,2012:42-61.] challenge, to the further analysis of big data, association mining, or even scientific discovery.With biology Exemplified by the science data of subject, existing by instruments such as sequencing, mass spectrum, nuclear magnetic resonance, caused lots of genes sequence is literary daily Part, protein sequence file, the micro-data such as 26S Proteasome Structure and Function of protein, also have and traditional use MongoDB or SQL numbers According to storehouse come preserved species information, Physiological-biochemical Characters, the macro-data such as reaction condition information, also substantial amounts of document, The knowledge informations such as patent.In order to preferably realize Knowledge Discovery, scientific research personnel can introduce Bio-ontology toward contact, be closed by RDF The mode of network of networking manages the large-scale association between the data such as species, protein, gene.These microcosmic and macroscopic aspects Information ultimately form an organic database, so as to which life is understood and studied from the aspect of entirety.With number Generally require to dispatch a series of data pipeline completion according to the scientific discovery of driving, it can be seen that these streamlines can be crossed over Multiple processes such as the collections of data, batch write-in, inquiry, analysis and visualization, one is there is among these and huge is asked Topic：How to allow streamline programming personnel not consider further that the otherness of bottom data storage model, and can by it is a kind of it is unified in a manner of Access and operation dataThis problem is mapped in data management technique, i.e., how to cross over SQL, NoSQL, NewSQL database Border, realize the universal data access of multivariate data model, and provide for Computational frame as Hadoop, Spark unified Data operating interface.

Relational database covers distributed data base to memory database at present, mainly have MySQL, PostgreSQL, Oracle, SQLite etc., the uniformity of data access is ensured by ACID and affairs, data are carried out using table, row, keyword Processing, is fixed, the application scenario of strong consistency suitable for structure.In October, 1986, U.S. ANSI is using SQL as relation data The standard language (ANSI X3.135-1986) of base management system, it is adopted as international standard afterwards for ISO.SQL is so as to as current Most popular relational database query language.

NoSQL databases include Key-Value databases, columnar database, document database, chart database.Due to NoSQL databases also lack a set of unified query language at present, have part research to be directed to encapsulating out for NoSQL databases The interface of SQL query, as Hive provides the HQL query languages similar to SQL, simplify the use difficulty of NoSQL databases. Spark SQL are that a kind of SQL based on Spark DataFrame big datas processing framework is realized, support the big data based on SQL Processing and analysis.Based on DataFrame, Spark can be current mass data storehouse such as MySQL, HBase, Cassandra, MongoDB provides the SQL query analysis ability based on big data.

As an important branch in NoSQL databases, chart database is often used for managing large-scale related information, Such as associating between species and gene, the social networks of people, Amazon warehouse retail main data system etc., support is based on attribute The quick associative search of graph model.Typical chart database has Neo4j, Titan, Virtuoso etc. at present.For chart database, Neo4J proposes Cypher query languages, and the correlation inquiry of diagram data model is succinctly expressed using the grammer similar to SQL, simplifies Chart database uses difficulty.TinkerPop project Attribute Orienteds figure proposes Gremlin figure traversal metalanguages, supports a variety of figures Database, such as Titan, OrientDB, TinkerGraph, it is referred to as the Perl of chart database circle.In addition, rdf model is base In a kind of semantic description framework of graph model, it is adapted to expression semantic information and its association, typical RDF data storehouse has at present Jena, Virtuoso etc..RDF data in 2004 accesses working group and has issued first RDF query language SPARQL, 2008 SPARQL agreements and query language formally turn into a W3C proposed standard.SPARQL uses structuralized query mode, passes through Where subgraph matchs realize correlation inquiry, and at present, most of RDF data storehouse all supports the SPARQL of standard to inquire about.

It is not difficult to find out, a set of unified query language, wherein chart database is also lacked currently for SQL, NoSQL database Due to its special inquiry and analysis mode, the another side of SQL query language has generally been pulled to.Therefore, people are in selection number When according to model, generally require to make one's choice：It is to select SQL database (including supporting SQL NoSQL databases), still Chart databaseThis choose often brings the otherness of upper layer application, be using the powerful Gremlin of association analysis ability, SPARQL query languages, or use traditional SQL query language based on bivariate table

Based on background above, the present invention proposes a kind of new query language Simba, to realize to relational data The unified query of storehouse, chart database, and file system.

The content of the invention

It is an object of the invention to provide a kind of interactive query method suitable for a variety of big data management systems, pass through Unified query language Simba, relevant database, chart database can be directed to, and file system realizes inquiry.

The technical solution adopted by the present invention is as follows：

A kind of interactive query method suitable for a variety of big data management systems, its step include：

1) associated document model is established, it includes document sets and incidence set, and the incidence set is the association structure between document Into set；

2) different original data models is converted into associated document model, by associated document model by different data Source connects as one；

3) associated document model is based on, establishes the unified query language for being suitable for multivariate data；

4) using the unified query language for being suitable for multivariate data, realize to relevant database, chart database and text The unified query of part system.

Further, the unified query language for being suitable for multivariate data management system include FIND, WITH, WHERE, Tetra- clauses of RETURN；FIND sentences determine the basic variable of inquiry, and these variables must represent document；WITH statement determines The intermediate variable used in matching condition grammer；WHERE sentences determine that returning result needs the condition met；RETURN sentence bags Having contained needs the data referencing for returning to user.

Further, the basic query space in FIND sentences is made up of a kind of document or multiclass document, and requires to close Connection document model can not carry out the comparison between two class documents of onrelevant；It is implicit in WITH statement to define basic query The expansion that document and association in space are carried out；The text for expanding search space can not only be implicitly defined in WHERE sentences Shelves, association, moreover it is possible to be associated the Selecting operation of document mid-module；Document, link, attribute hierarchies are included in RETURN sentences URL, or represent URL variable, the sentence mainly performs the project of associated document model, and the result of return is one Associated document.

Further, the implementation procedure of the unified query language is divided into four steps：Determine document, establish document between close System, selection, projection.

Further, different data sources is connected as one by the associated document model, forms a network, and The data referencing grammer of the unified query language is formed using similar URL form, uniformly to access the data in network.

Further, the intermediate variable in the unified query language represents the document sets with basic search space correlation Conjunction, numerical value, character string, intermediate variable are used in grammer is matched, and corresponding condition coupling is carried out according to the type of intermediate variable Operation.

A set of intermediate language proposed by the present invention, independent of specific operating system and programming language.Due to Simba languages Speech contains the operation of a variety of data models, and some of which operation can not be done directly by database, therefore in practical application In, can be for SDK (Software Development Kit, the SDK of Simba language development Database Systems Bag), some compensation operations are carried out on the basis of local data database query language.Such as：It can be managed for MongoDB database developments Solve and perform the java applet bag (or C++ program bags) of Simba language, such CLIENT PROGRAM can is by calling in SDK API (Application Programming Interface, application programming interface) operated using Simba language MongoDB, i.e., data management system is inquired about by SimbaQL translaters, this pattern is as shown in Figure 1.

Another mode is that database is directly based upon Simba language design communication protocols, and client-side program can pass through The network request of transmission Simba orders, the Query Result needed, the pattern are as shown in Figure 2.

As shown in figure 3, the Simba query languages of the present invention include following components：

1.SimbaQL syntactic structures：Each sons such as SimbaQL general structure and FIND, WITH, WHERE, RETURN are provided The grammer of sentence.

2. data referencing grammer：Illustrate how the data in reference data source in SimbaQL；

3. intermediate variable grammer：Illustrate how defined in SimbaQL and use intermediate variable；

4. matching condition grammer：Illustrate how to write matching condition in SimbaQL；

5.SimbaQL analysis programs：The SimbaQL analysis programs based on Java are provided, to write SimbaQL clients Program, or query engine configuration processor；

Compared with prior art, advantages of the present invention is as follows：

(1) unified query language for being suitable for multivariate data management system is proposed first, and the language can be realized to closing It is the unified query of type database, chart database, and file system.It can retrieve and meet specified attribute condition in relation table Record, can also retrieve the multiple summits for meeting specified associations condition in chart database, while can also retrieve file system Specific file in system.In current development technique, application program must pass through SQL query language, Cypger/gremlin Language, and API mode realize retrieval to relevant database, chart database, file system respectively, this otherness Way brings the difficulty for grasping multilingual and the not versatility of programming.And by SimbaQL, then need a set of unification Syntax format.This difference is as shown in Figure 4, Figure 5.

(2) the typical query mode to big data management system is concluded, to the SQL query function and Tu Cha of complexity Ask function to be simplified, SimbaQL target is to cover most of query demand, and allows the data management system of main flow The language is very easily supported, therefore has abandoned the sophisticated functions in SQL query and figure inquiry, such as：Subquery, or inquiry As a result UNION etc. is operated.SimbaQL suggests these secondary operations, and big data Computational frame can be allowed to do, SimbaQL sheets Body only completes the function of simple data query extraction.

(3) inquiry for a variety of data model databases that SimbaQL language is directed to, which includes a variety of data models Computing.So if the query language of some model can not complete the computing of other models, the realization of SimbaQL language can be helped The model is helped to complete.For example MongoDB query language can not complete the JOIN computings of document, and SimbaQL supports JOIN fortune Calculate, therefore SimbaQL realization will compensate these computings.

(4) SimbaQL introduces the characteristic of intermediate variable, to express and hide the uninterested information of user.To look into Look for exemplified by two related entities of tool：

FIND x, y WITH $ m=x.child WHERE $ m.child=y RETURN x, y

The sentence introduces intermediate variable m, and the object representated by the variable is x child, and y is his child.This is looked into Ask to return to all grandparent and grandchild, but propose the application of the inquiry and need not be concerned about that whom m is specifically.

This mode avoids spelling and repeated simultaneously, realizes quick literary style.

(5) SimbaQL introduces the multistage expression way for quoting attribute, such as：X.knows.knows.name represents x understanding Someone y understanding someone z name.In traditional query language, multistage reference is not supported.This mode is effective Reduce the repetition of code, and with intuitively effect.

Brief description of the drawings

Fig. 1 is shown by way of SimbaQL translaters are inquired about data management system.

Fig. 2 is shown by way of SimbaQL procotols are directly inquired about data management system.

Fig. 3 shows the structure chart of present invention.

Fig. 4 shows to need in the prior art using the different management system of different language inquiries.

Fig. 5 shows that the present invention inquires about different management systems using SimbaQL language unities.

Fig. 6 is the structural representation of LDM models.

Embodiment

Below by specific embodiments and the drawings, the present invention will be further described.

The SimbaQL of present invention design is with LinkedDocument mid-modules (associated document model, Linked Document Model, abbreviation LDM) based on, transported by LDM computings and the mapping of other model calculations and SDK compensation Calculate, reach the purpose of a variety of data model database unified queries.

1st, Linked Document models

1) Linked Document model definitions

Document is the set being made up of one group of attribute, and attribute is the set that same categorical data is formed.Each document is write from memory Recognize the primary key attributes for including a unique mark.Primary key attributes is similar with the function of IP address, it is necessary to is global unique mark； The type of other attributes can be arbitrary, including document, association, a customization type etc..Association is a special text Shelves, wherein (from must be included:Primary key, to：Primary key) two attributes, for representing the association between document, the association is The knows referred between the relation between two datas, such as a person document and another person document associates representative First man recognizes second people.Document sets and incidence set must all possess a name identifiers illustrate set in document and The semanteme of association.Attribute number can be different in same class document or association, and this means that { ' id ':’fffff0’, ‘name’:‘bluejoe’,‘age’:30 } both teacher classes text can also can be used as a member of person class documents A member of shelves.

LDM models are two tuples being made up of document sets and incidence set (document sets, incidence set), and wherein incidence set is A variety of set of relationship between two class documents.The general configuration of LDM models is as shown in Figure 6.Wherein, Documents represents document Collection, Links represent incidence set, and PersonDocument represents this kind of collection of document of people, and SoftwareDocument represents software Class collection of document, InventLink represent the set of this kind of association of people's invention software, and 1,2 represent document unique identifier primary key, Attr1, attr2 ... represent the attribute of document.

2) LDM transformation rules

LDM is directed to the inquiry and analysis of data, and it provides two kinds of transformation rule：Original data model arrives LDM conversion, LDM to existing programming model require the conversion of form.

A) original data model → LDM

The formal definitions of data model translation are (G, L, M), and the Schema that wherein G represents world model that is to say LDM, L represents local data model (relational model, key-value models, document model, attribute graph model), and M represents reflecting from L to G Penetrate rule.Original data model to LDM conversion primary concern is that the semanteme of data, and the conversion of data type aspect then may be used To be determined according to system requirements by developer oneself.The original data model that conversion given below includes have relational model, Key-value models, document model and attribute graph model, main transformation rule are as shown in table 1.Wherein customized conversion rule It is then according to the characteristic of former data model, extracts the data acquisition system for meeting some features.Such as in extraction key-value models The data of key comprising person are as Person class collection of document；Extract the summit that lable in attribute graph model is Person As Person class documents；By personid phases in personid and the Software document of Person classes document in document model Articulation set invent is extracted as Deng this relation.

Transformation rule of the original data model of table 1. to LDM

LDM	Relational model	Key-value models	Document model	Attribute graph model
					Attribute	Attribute	Key	Attribute	Attribute
Document	Record	Pair	Document	Summit
					Collection of document	Table	It is self-defined	Set	It is self-defined
Connection	External key	It is self-defined	It is self-defined	Side
					Articulation set	External key	It is self-defined	It is self-defined	It is self-defined

It should be noted that whether document sets or incidence set must all possess a name in LDM, therefore for In the transfer process of the external key of relational model and other self-defined parts, it is necessary to provide a name conduct by switch crew The semanteme of set element.For example in attribute graph model, the node that lable is ' person ' can be made as in LDM Person class documents；The node comprising attribute ' teacher ' can also be made as the teacher class documents in LDM, and in fact This two classes document may correspond to same node.

In addition, archetype to LDM conversion can be not limited to model above, developer can define it according to demand Its data model is to LDM transformation rule, such as file system, column database etc..

B) LDM → programming model

LDM to programming model conversion primary concern is that relation in data structure.Currently a popular programming model is such as The acceptable data structure such as map/reduce, spark SQL, Pergel mainly has array, table, figure.Therefore it is given below LDM to these three data structures transformation rule, as shown in table 2.

Table 2.LDM to array, table, figure transformation rule

3) LDM operation rules

LDM operation rule is the computing based on relational model, key-value models, document model and attribute graph model Definition.Including the set operation of relational model, concatenation operation, Selecting operation, project；Key-value models Get computings；The selection of document model, project；The traversal and Selecting operation of attribute graph model.The operation method of LDM models Then it is broadly divided into three classes：Set operation, association computing, document computing, specific operation rule are as shown in table 3.

Table 3.LDM operation rule

4) LDM data accesses rule

Due to LDM by database connection for a network, we can be come in citation network using similar URL form Data.This URL form is as follows：

Wherein, datasource represents data source, such as MySQL, MongoDB etc., and document represents data source to LDM The document of mapping, link represent the association that data source maps to LDM, and identity represents the primary key of document, propertyName Represent the attribute-name of document.

Data can be quoted in different levels, such as the name attributes to person documents in MySQL database Reference can be expressed as：

MySQL.person.name

Father associations to person documents in MongoDB databases, which carry out application, to be expressed as：

MongoDB.person.father

What association represented is collection of document corresponding to the association, and we can also continue to deeply be quoted, such as

MongoDB.person.father.name

Data corresponding to data referencing URL be actually LDM opening relationships computing and project after result.Than Data such as MongoDB.person.father representatives are that two Person class documents are established into father to associate, and to Father relations carry out the result of project.

2nd, SimbaQL syntactic structures

As SQL and relational model, based on Linked Document models, every SimbaQL sentence can be converted into Linked Document operational formula, operational formula are made up of the following computing in table 2：" establishing and closing in " association computing " Connection ", " Selecting operation "；" Selecting operation ", " project " in " document computing ".SimbaQL query statements mainly include Tetra- clauses of FIND, WITH, WHERE, RETURN, syntactic structure are as follows：

FIND<documents>

WITH<variables>

WHERE<conditions>

RETURN<urls>

Wherein, FIND clause determines the basic variable of inquiry, and each variable is necessarily corresponded in Linked Doument A kind of document；WITH statement determines the intermediate variable used in matching condition grammer, and these intermediate variables can be a variety of numbers According to type, document is not limited to；WHERE determines that returning result needs the condition met；RETURN sentences contain needs and returned to The data referencing of user.LDM calculating processes corresponding to SimbaQL are given below.

First, the basic query space in FIND is made up of a kind of document or multiclass document, and SimbaQL requires LDM The comparison between two class documents of onrelevant can not be carried out.For example the data of person objects and software objects are carried out Inquiry can be expressed as：

FIND MySQL.person p,MySQL.software s

If do not associated between person documents and software documents, then we can only to person and Software carries out Selecting operation respectively, and can not carry out Selecting operation as similar p.inventid=s.id.

The expansion defined the document in basic query space and association progress of variable implicitly defined in WITH, Such as：

FIND person p WITH $ soft=p.invent

Above sentence represents to contain Software documents in the Linked Document that we search for, and association invent.It with

Find person p, software s WITH $ soft=p.invent

It is of equal value.

The document for expanding search space, association can not only be implicitly defined in WHERE sentences, moreover it is possible to carry out LDM choosing Select computing.Such as：

FIND person p WHERE p.invent.name=' simba '

It is then implicit to be determined that LDM includes document sets (person, software), incidence set (invent), and will Software documents are asked to meet the condition that name attributes are ' simba '.

Document, link, the URL of attribute hierarchies can be included in RETURN sentences, or represents URL variable.The sentence The main project for performing LDM, the result of return is a Linked Document.

In summary, SimbaQL execution is divided into four steps：Determine document, establish document between relation, selection, projection.It is false If the basic search space in FIND is A, B；The document implicitly determined in WHERE sentences associating between C, and A and C L1, alternative condition condition；Projector space in RETURN sentences is space, and other process computings obtain document and are doc；So LDM computings are corresponding to SimbaQL sentences：

Result=σ_spaceπ_condition((A×_dB)_A×_L1C)

Such as SimbaQL sentences：

FIND person p, software s WHERE p.name=' bluejoe ' and p.invent.name=' simbaql’return p.name

Corresponding LDM computings are：

Result=σ_p.nameπ_{P.name=' bluejoe ' and software=' simbaql '}(Person×_inventSoftware)

3rd, data referencing grammer (alternatively referred to as attribute list reaches grammer, as shown in Figure 3)

Wherein, datasource represents the data source registered in associated document, and document is represented in associated document Document class, link represent the articulation set in associated document, and multiple link can be included in URL, and identity represents document Id, propertyName represent the attribute of document.

MySQL.person.name

MongoDB.person.father

MongoDB.person.father.name

4th, intermediate variable grammer

Intermediate variable can represent the collection of document, numerical value, character string with basic search space correlation.The expression of variable by $ symbols are formed with identifier, and its definition uses WITH statement：

Such as：

WITH $ c1=p.knows.knows (collection of document)

WITH $ c2=123 (numerical value)

WITH $ c3=' bluejoe ' (character string)

Intermediate variable is used in grammer is matched, and corresponding condition coupling operation is carried out according to the type of intermediate variable.When When intermediate variable is collection of document, its main function is a part of content that replacement data quotes URL.

5th, matching condition grammer

Matching condition is the expression formula that a return value guided by WHERE sentences is bool types.The grammer of expression formula Rule is as follows：

1) collection of document A, B polymerization screening：(<Document A>.link|<Document A>)=<Document B>

2) collection of document screens：(<Document>.attribute|<Association>.attribute) operator master datas class Type

3)<Expression formula>AND|OR<Expression formula>

The operator operators wherein supported at present include：><=>=<=.For<Association>.attribute or<Text Shelves A>.link the situation of a collection of document is represented, the meaning of "=" operator is " presence ", such as：P.knows.name=' Bluejoe ' represents the people that one entitled " bluejoe " in the people known be present, and p.knows=p1 represents the people that p knows The middle meaning that p1 be present.

It should be noted that although WHERE sentences correspond to LDM Selecting operation, the Selecting operation in SimbaQL is only The attribute value of document is selected.Such as:

FIND person p, software s WHERE p.invent=s AND s.name=' SimbaQL ' RETURN p.name

Although including p.invent=s in alternative condition, actually real alternative condition is s.name=' SimbaQL’.

6th, SimbaQL analysis programs

SimbaQL analysis programs mainly include following several classes：

Whole syntactic structure associated class：Statement、SearchSpace、VariableDefines、Conditions、 SubSpace, the table 1 that its implication is seen below.

Syntax tree abstract class and interface：Node (node), Condition (condition), Variable (variable), Document (document), AttributeDocument (document drawn by attribute, such as p.knows, $ p.knows etc.), ValueExprecession (Value Types of expression formula).

Syntax tree concrete kind：RawDocument、RawAttribute、RawVarible、WithVarible、 StringValue、IntegerValue、TerminalCondition、And、Or、Not、VaribleAttribute、 DocumentRefference、Operator.Wherein RawDocument be used for representing in FIND Person p Person p this The document of sample；RawAttribute is used for representing attribute as p.name；VaribleAttribute represents $ k.name so The attribute drawn by variable；WithVarible represents variable as $ k in WITH $ k=p.knows；RawVarible represents p The Document Variables so defined by FIND；StringValue, IntegerValue represent character string and integer respectively；And、Or、 Not is used for preposition and, or, not in expression；DocumentRefference is represented Knows connections in p.knows.p.knows.name；Operator is used for representing comparison operator, TerminalCondition represents such as p.age>The 30 this expression formulas that can not split again.

The essential information of above class is as shown in table 4：

The abstract syntax tree essential information of table 4.

Except the related JAVA classes of above abstract syntax tree, analysis program also includes ANTLR4 morphology syntax parsing class： SimbaQLLexer、SimbaQLParser、SimbaQLBaseListener、SimbaQLBaseVisitor、 SimbaQLVisitor.Wherein SimbaQLLexer is by the morphology resolver of the SimbaQL sentences of ANTLR4 generations, for sentencing Word in conclusion sentence whether grammaticalness；SimbaQLParser is SimbaQL Syntactic parsers； SimbaQLBaseListener, SimbaQLBaseVisitor are the base that syntax tree is accessed with listener and visitor respectively Class；SimbaQLVisitor is to be inherited from SimbaQLBaseVisitor to access syntax tree for visitor modes.

Abstract syntax tree builds class：AstBuilder, for building abstract syntax tree, such provides an input SimbaQL sentences, one syntax tree guided by Statement of output.

Syntax error checks class：AstChecker、DBchecker.Wherein AstChecker can detect not meeting language The query statement of method；DBchecker is used to detect the content to conflict with data source in query statement, for example is wrapped in query statement Containing p.knows, and do not have knows connections in data source.

Syntax parsing case (output syntax parsing tree)：SimbaParser、Treeprinter.SimbaParser is one Individual structure query statement syntax analytic tree simultaneously prints the case program of parsing tree construction；Treeprinter is printing grammer solution Analyse the program of tree.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this area Technical scheme can be modified by personnel or equivalent substitution, without departing from the spirit and scope of the present invention, this The protection domain of invention should be to be defined described in claims.

Claims

1. a kind of interactive query method suitable for a variety of big data management systems, its step includes：

1) associated document model is established, it includes document sets and incidence set, and the incidence set is that the association between document is formed Set；

2) different original data models is converted into associated document model, connected different data sources by associated document model It is connected in one；

4) using the unified query language for being suitable for multivariate data, realize to relevant database, chart database and file system The unified query of system.

2. the method as described in claim 1, it is characterised in that the document in the document sets of the associated document model is one group The set that attribute is formed, the attribute are the set that same categorical data is formed；Each document acquiescence includes a primary key category Property, the primary key attributes is global unique mark；The document sets possess a name identifiers to illustrate to collect with incidence set Document and the semanteme associated in conjunction.

3. the method as described in claim 1, it is characterised in that the unified query language for being suitable for multivariate data management system Speech includes tetra- clauses of FIND, WITH, WHERE, RETURN；The basic variable of FIND sentences determination inquiry, these variables are necessary Represent document；WITH statement determines the intermediate variable used in matching condition grammer；WHERE sentences determine returning result needs The condition of satisfaction；RETURN sentences, which contain, needs the data referencing for returning to user.

4. method as claimed in claim 3, it is characterised in that：Basic query space in FIND sentences is by a kind of document or more Class document is formed, and requires that associated document model can not carry out the comparison between two class documents of onrelevant；In WITH statement Implicit defines the expansion for carrying out the document in basic query space and association；Can not only be implicit in WHERE sentences Ground defines the document for expanding search space, association, moreover it is possible to is associated the Selecting operation of document mid-module；In RETURN sentences Comprising document, link, attribute hierarchies URL, or represent URL variable, the sentence mainly performs the throwing of associated document model Shadow computing, the result of return is an associated document.

5. method as claimed in claim 3, it is characterised in that the implementation procedure of the unified query language is divided into four steps：Really Determine document, establish document between relation, selection, projection.

6. method as claimed in claim 3, it is characterised in that connected different data sources by the associated document model It is integrated, forms a network, and the data referencing grammer of the unified query language is formed using the form similar to URL, comes The unified data accessed in network.

7. method as claimed in claim 3, it is characterised in that intermediate variable in the unified query language represent with it is basic The related collection of document in search space, numerical value, character string, intermediate variable uses in grammer is matched, according to the class of intermediate variable Type carries out corresponding condition coupling operation.

8. method as claimed in claim 3, it is characterised in that the matching condition in the unified query language be one by The return value of WHERE sentences guiding is the expression formula of bool types, and the syntax rule of expression formula is as follows：

1) collection of document A, B polymerization screening：(<Document A>.link|<Document A>)=<Document B>；

2) collection of document screens：(<Document>.attribute|<Association>.attribute) operator basic data types；

3)<Expression formula>AND|OR<Expression formula>.

9. method as claimed in claim 3, it is characterised in that the analysis program in the unified query language includes：Entirely Syntactic structure associated class, syntax tree abstract class and interface, syntax tree concrete kind.

10. the method as described in claim 1, it is characterised in that developed in actual applications for the unified query language The SDK of Database Systems, and some compensation operations are carried out on the basis of local data database query language, then client-side program leads to Cross and call the API in SDK to use the unified query language operating database；Or the unification is directly based upon to database Query language designs communication protocol, the Query Result that client-side program is needed by sending network request.