CN107515887B - Interactive query method suitable for various big data management systems - Google Patents

Interactive query method suitable for various big data management systems Download PDF

Info

Publication number
CN107515887B
CN107515887B CN201710515380.6A CN201710515380A CN107515887B CN 107515887 B CN107515887 B CN 107515887B CN 201710515380 A CN201710515380 A CN 201710515380A CN 107515887 B CN107515887 B CN 107515887B
Authority
CN
China
Prior art keywords
document
data
association
query
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710515380.6A
Other languages
Chinese (zh)
Other versions
CN107515887A (en
Inventor
沈志宏
李跃鹏
黎建辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201710515380.6A priority Critical patent/CN107515887B/en
Publication of CN107515887A publication Critical patent/CN107515887A/en
Application granted granted Critical
Publication of CN107515887B publication Critical patent/CN107515887B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an interactive query method suitable for various big data management systems, which comprises the following steps: 1) establishing an association document model which comprises a document set and an association set, wherein the association set is a set formed by the association between documents; 2) converting different original data models into associated document models, and connecting different data sources into a whole through the associated document models; 3) establishing a uniform query language suitable for the multi-metadata based on the associated document model; 4) and the unified query language suitable for the multivariate data is utilized to realize the unified query of the relational database, the graph database and the file system. The invention provides a unified query language suitable for a multivariate data management system for the first time, and can realize unified query of a relational database, a graph database and a file system.

Description

Interactive query method suitable for various big data management systems
Technical Field
The invention relates to a query language, in particular to an interactive query language and a query method suitable for a big data management system, and belongs to the technical field of big data and databases.
Background
With the continuous popularization of computers, the demands for data management and processing are increasingly urgent, people propose different data models aiming at different data forms and characteristics, and realize the management and analysis of data by a corresponding data management system. More influential data models, such as the E-R model, have dominated the database world for over 40 years since the 70's of the last century. In recent decades, with the penetration of internet and internet of things applications, the generation of large-scale structured, semi-structured, unstructured data has triggered NoSQL movements [ Cattell r. The database world is transformed from the first SQL monopoly scenario to the traditional SQL, NoSQL, NewSQL divide and conquer scenario.
Constructing a complete big data application system needs to fully consider the data from 4V [ Gupta R, Gupta H, Mohania M. cloud computing and big data analytics: what is new from data bases expert? [C] v/Proc of 1st BDA, New Delhi, India:, Springer Berlin Heidelberg,2012: 42-61 ], further analysis of the big data, correlation mining, and even scientific discovery. Taking scientific data of biology as an example, there are a large amount of microscopic data such as gene sequence files, protein sequence files, and structure and function of protein, which are generated by sequencing, mass spectrometry, nuclear magnetic resonance and other instruments every day, macroscopic data such as species information, physiological and biochemical properties, reaction condition information, and the like, which are stored by traditional MongoDB or SQL database, and also a large amount of knowledge information such as literature, patent and the like. In order to better realize knowledge discovery, researchers often introduce biological ontologies and manage large-scale association among species, proteins, genes and other data in an RDF (resource description framework) association network mode. These micro-and macro-level information ultimately form an organic database to understand and study life from an overall level. Data-driven scientific discovery often needs to be completed by scheduling a series of data pipelines, and it can be seen that these pipelines span multiple processes of data acquisition, batch writing, query, analysis, visualization, and the like, which has a huge problem: how can pipeline programmers access and manipulate data in a uniform manner without considering the differences in the underlying data storage model? The problem is mapped to a data management technology, namely how to cross the boundary of SQL, NoSQL and NewSQL databases, realize the uniform data access of a multivariate data model, and provide a uniform data operation interface for calculation frameworks such as Hadoop and Spark.
At present, a relational database covers a distributed database to a memory database, mainly comprises MySQL, PostgreSQL, Oracle, SQLite and the like, the consistency of data access is ensured through ACID and transactions, and a table, a column and keywords are used for processing data, so that the method is suitable for application occasions with fixed structures and strong consistency. In 10 months 1986, ANSI in the united states used SQL as the standard language for relational database management systems (ANSI X3.135-1986), followed by ISO adoption as an international standard. SQL is thus the most widely used relational database query language at the present time.
The NoSQL database comprises a Key-Value database, a column database, a document database and a database. Because the NoSQL database is lack of a set of uniform query language at present, some researches are dedicated to packaging an SQL query interface aiming at the NoSQL database, for example, Hive provides an HQL query language similar to SQL, and the use difficulty of the NoSQL database is simplified. Spark SQL is a SQL implementation based on Spark DataFrame big data processing framework, and supports SQL-based big data processing and analysis. Based on DataFrame, Spark can provide SQL query analysis capability based on big data for the current massive databases such as MySQL, HBase, Cassandra, MongoDB.
As an important branch in NoSQL databases, graph databases are often used to manage large-scale association information, such as associations between species and genes, social relationships among people, amazon warehouse retail owner data systems, and the like, and support fast association retrieval based on an attribute graph model. Typical graph databases currently include Neo4j, Titan, Virtuoso, and the like. For the graph database, Neo4J provides a Cypher query language, and associated query of a grammatical concise expression graph data model similar to SQL is adopted, so that the difficulty in using the graph database is simplified. The TinkerPop project proposes a Gremlin graph traversal analysis language facing attribute graphs, and supports various graph databases, such as Titan, OrientDB, TinkerGraph and the like, which are called Perl of graph database boundary. In addition, the RDF model is a semantic description framework based on a graph model, is suitable for expressing semantic information and association thereof, and currently, typical RDF databases include Jena, Virtuoso and the like. The RDF data access working group published the first RDF query language SPARQL in 2004, and the SPARQL protocol and query language were formally a W3C recommendation in 2008. SPARQL adopts a structured query mode, and realizes related query through where subgraph matching, and at present, most RDF databases support standard SPARQL query.
It is obvious that there is still no unified set of query language for SQL and NoSQL databases, and the graph database is usually pushed to the other side of SQL query language due to its special query and analysis method. Therefore, when selecting a data model, one often needs to make a choice: is it chosen for SQL databases (including SQL-enabled NoSQL databases), or for graph databases? The choice often brings differences of upper-level applications, namely, the Gremlin and SPARQL query languages with strong association analysis capability are adopted, or the traditional SQL query language based on a two-dimensional table is adopted?
Based on the above background, the present invention provides a new query language Simba, which is used to implement unified query of relational databases, graph databases, and file systems.
Disclosure of Invention
The invention aims to provide an interactive query method suitable for various big data management systems, which can realize query aiming at a relational database, a graph database and a file system through a uniform query language Simba.
The technical scheme adopted by the invention is as follows:
an interactive query method suitable for a plurality of big data management systems comprises the following steps:
1) establishing an association document model which comprises a document set and an association set, wherein the association set is a set formed by the association between documents;
2) converting different original data models into associated document models, and connecting different data sources into a whole through the associated document models;
3) establishing a uniform query language suitable for the multi-metadata based on the associated document model;
4) and the unified query language suitable for the multivariate data is utilized to realize the unified query of the relational database, the graph database and the file system.
Further, the uniform query language suitable for the multivariate data management system comprises four clauses of FIND, WITH, WHERE and RETURN; the FIND statement determines the basic variables of the query, which must represent the document; the WITH statement determines intermediate variables used in matching the conditional grammar; the WHERE statement determines the conditions which need to be met when the returned result is returned; the RETURN statement contains the data reference that needs to be returned to the user.
Further, the basic query space in the FIND statement is composed of one type of document or multiple types of documents, and requires that the associated document model cannot perform a comparison between two types of documents without association; the expansion of the documents and the association in the basic query space is implicitly defined in the WITH statement; the WHERE statement can implicitly define the document and the association for expanding the query space and can also perform the selection operation of an intermediate model of the associated document; the RETURN statement contains a document, a link, a URL of an attribute hierarchy, or a variable representing the URL, and mainly performs a projection operation of an associated document model, and a returned result is an associated document.
Further, the execution process of the unified query language is divided into four steps: determining documents, establishing relations among the documents, selecting and projecting.
Further, different data sources are connected into a whole through the associated document model to form a network, and the data reference grammar of the uniform query language is formed in a URL-like form to uniformly access data in the network.
Further, intermediate variables in the unified query language represent document sets, numerical values and character strings related to the basic search space, the intermediate variables are used in matching grammar, and corresponding condition matching operation is carried out according to types of the intermediate variables.
The set of intermediate languages provided by the invention is independent of a specific operating system and a programming language. Since the Simba language includes operations of a plurality of data models, some of the operations cannot be directly completed by the database, in practical applications, some compensation operations can be performed on the basis of the local database query language by developing an SDK (Software Development Kit) of the database system for the Simba language. Such as: a Java package (or C + + package) capable of understanding and executing the Simba language is developed for the MongoDB database, so that the client can operate the MongoDB by calling an API (Application Programming Interface) in the SDK in the Simba language, that is, query the data management system through the SimbaQL translator, which is shown in fig. 1.
Alternatively, the database designs the communication protocol directly based on the Simba language, and the client program can obtain the required query result by sending the network request of the Simba command, and the mode is as shown in fig. 2.
As shown in FIG. 3, the Simba query language of the present invention comprises the following parts:
SimbaQL syntax structure: the overall structure of SimbaQL and the syntax of each clause such as FIND, WITH, WHERE, RETURN, etc. are provided.
2. Data reference syntax: explaining how to refer to data in a data source in the SimbaQL;
3. intermediate variable syntax: it is explained how intermediate variables are defined and used in SimbaQL;
4. matching condition grammar: explaining how to write matching conditions in the SimbaQL;
SimbaQL resolution procedure: providing a Java-based SimbaQL analysis program for writing a SimbaQL client program or querying an engine execution program;
compared with the prior art, the invention has the following advantages:
(1) the unified query language suitable for the multivariate data management system is provided for the first time, and the language can realize unified query on a relational database, a graph database and a file system. Records in the relational table that satisfy the specified attribute conditions may be retrieved, multiple vertices in the graph database that satisfy the specified association conditions may be retrieved, and specific files in the file system may also be retrieved. In the current development technology, an application program must respectively realize the retrieval of a relational database, a graph database and a file system through an SQL query language, a Cypger/gremlin language and an API (application programming interface), and the difference method brings difficulty in mastering various languages and non-universality of programming. While with SimbaQL, a set of uniform syntax formats is required. This difference is shown in fig. 4 and 5.
(2) The typical query mode of a big data management system is summarized, the complex SQL query function and graph query function are simplified, the goal of SimbaQL is to cover most query requirements, and the mainstream data management system can conveniently support the language, so that the complex functions in SQL query and graph query are abandoned, such as: sub-query, or UNION of query results. SimbaQL suggests these secondary operations, which can be done by a big data computation framework, and the SimbaQL itself only accomplishes the function of simple data query extraction.
(3) The SimbaQL language aims at the query of the database of the data models, and the query comprises the operation of the data models. Therefore, if the query language of a certain model cannot complete the operations of other models, the realization of the SimbaQL language can help the model to complete. For example, MongoDB's query language cannot complete JOIN operations on documents, while SimbaQL supports JOIN operations, so the implementation of SimbaQL compensates for these operations.
(4) SimbaQL introduces the properties of intermediate variables to express and hide information that is not of interest to the user. Take the example of finding two entities with associations:
FIND x,y WITH$m=x.child WHERE$m.child=y RETURN x,y
this statement introduces the intermediate variable $ m, the object represented by this variable is the child of x, and y is his child. The query is used to return all grandgrandchildren, but the application that proposes the query does not need to care about who $ m is specific.
Meanwhile, spelling repetition is avoided by the method, and the quick writing method is realized.
(5) SimbaQL introduces an expression mode of multi-level reference attributes, such as: name represents the name of someone z that someone y that x knows. In conventional query languages, multi-level referencing is not supported. The method effectively reduces the repetition of codes and has intuitive effect.
Drawings
Figure 1 illustrates the manner in which a query is made to a data management system by a SimbaQL translator.
Fig. 2 illustrates the manner in which queries are made directly to the data management system via the SimbaQL network protocol.
Fig. 3 shows a block diagram of the content of the present invention.
Fig. 4 shows that different management systems need to be queried in different languages in the prior art.
FIG. 5 shows the present invention using SimbaQL language to query different management systems in a unified way.
Fig. 6 is a schematic structural diagram of the LDM model.
Detailed Description
The invention is further illustrated by the following specific examples and the accompanying drawings.
The SimbaQL design of the invention is based on a Linkeddocument intermediate Model (an associated Document Model, LDM for short), and achieves the purpose of uniformly querying various data Model databases through the mapping of LDM operation and other Model operations and the compensation operation of SDK.
1. Linked Document model
1) Linked Document model definition
A document is a collection of a set of attributes, which are collections of the same type of data. Each document contains by default a uniquely identified master code attribute. The main code attribute is similar to the function of an IP address and must be a global unique identifier; the type of other attributes may be arbitrary, including a document, association, custom type, etc. An association is a special document that must contain two attributes (from: main code, to: main code) to represent an association between documents, which refers to a relationship between two pieces of data, such as a knows association between one person document and another person document representing that the first person knows the second person. Both the document set and the association set must have a name identifier to account for the semantics of the documents and associations in the set. The number of attributes in the same type of document or association may be different, which means that { 'id': fff0 ',' name ': blue joe', 'age': 30} can be a member of both person class documents and teacher class documents.
The LDM model is a binary set (document set, association set) composed of a document set and an association set, wherein the association set is a plurality of relationship sets between two types of documents. The general structure of the LDM model is shown in fig. 6. Wherein Documents represents a document set, Links represents an association set, PersonDocument represents a document set such as a person, software document represents a document set such as a software document set, InventLink represents a set of associations such as a person invented software, 1 and 2 represent main codes of unique identifiers of Documents, and attr1 and attr2 … represent attributes of the Documents.
2) LDM conversion rules
LDM is directed to the query and analysis of data, which provides two types of transformation rules: conversion of raw data model to LDM, conversion of LDM to existing programming model requires format conversion.
a) Original data model → LDM
The formal definition of the data model transformation is (G, L, M), where G represents Schema of the global model, i.e. LDM, L represents the local data model (relational model, key-value model, document model, property graph model), and M represents the mapping rule from L to G. The conversion of the original data model into the LDM mainly considers the semantics of the data, and the conversion at the data type level can be determined by developers according to the system requirements. The transformations given below include the raw data models of relational, key-value, document, and property graph models, with the main transformation rules as shown in table 1. The self-defined conversion rule is to extract a data set meeting certain characteristics according to the characteristics of the original data model. For example, data containing Person's keys in the key-value model is extracted to be used as a Person class document set; extracting a peak with a legacy of Person in the attribute graph model as a Person class document; the relation that personid of the Person class document in the document model is equal to personid in the Software document is extracted as a connection set invent.
TABLE 1 conversion rules of raw data model to LDM
LDM Relational model Key-value model Document model Attribute graph model
Properties Properties Key Properties Properties
Document Recording Pair Document Vertex point
Document collection Watch (A) Self-defining Collection Self-defining
Connection of External key Self-defining Self-defining Edge
Connection set External key Self-defining Self-defining Self-defining
It should be noted that in LDM, both the document set and the association set must have a name, so during the conversion process for the foreign key and other custom parts of the relationship model, a name must be provided by the converter as the semantic of the collection element. For example, in the attribute graph model, a node with a stable of 'person' may be used as a person class document in the LDM; it is also possible to let the node containing the attribute 'teacher' be the teacher class document in the LDM, and in fact both classes of documents may correspond to the same node.
In addition, the conversion of the original model to the LDM may not be limited to the above model, and a developer may define conversion rules of other data models to the LDM, such as a file system, a column database, etc., according to requirements.
b) LDM → Programming model
The conversion of LDM into programming models primarily takes into account relationships on the data structure. The data structures accepted by the current popular programming models such as map/reduce, spark SQL, Pergel and the like mainly comprise arrays, tables and graphs. The rules for the conversion of LDM to these three data structures are therefore given below, as shown in Table 2.
TABLE 2 conversion rules of LDM to array, table, graph
Figure BDA0001336546600000071
3) LDM operation rule
The operation rule of the LDM is defined based on the operation of a relationship model, a key-value model, a document model and an attribute graph model. The method comprises the steps of set operation, connection operation, selection operation and projection operation of a relation model; get operation of the key-value model; selecting a document model and performing projection operation; and traversing the attribute graph model and selecting operation. Algorithms of the LDM model are mainly divided into three categories: the specific operation rules of the set operation, the association operation and the document operation are shown in table 3.
TABLE 3 operational rules of LDM
Figure BDA0001336546600000072
Figure BDA0001336546600000081
4) LDM data access rules
Since LDM links databases to a network, we can use a URL-like format to reference data in the network. This URL is of the form:
<datasource>.<document>.<link>.<identity>.<propertyName>
the data source represents a data source, such as MySQL, MongoDB and the like, the document represents a document mapped from the data source to the LDM, the link represents the association mapped from the data source to the LDM, the identity represents the main code of the document, and the propertylame represents the attribute name of the document.
Data can be referenced at different levels, for example, a reference to the name property of a person document in a MySQL database can be expressed as:
MySQL.person.name
applying the gather association of a person document in a MongoDB database can be expressed as:
MongoDB.person.father
the association represents the document set corresponding to the association, and further reference can be carried out, such as
MongoDB.person.father.name
The data corresponding to the data reference URL is actually the result after the relational operation and the projection operation of the LDM. For example, the data represented by mongodb.
2. SimbaQL grammar structure
Like SQL and relational models, based on the Linked Document model, each SimbaQL statement can be converted into an operational formula of Linked Document, which is composed of the following operations in table 2: "establish association" and "select operation" in "association operation"; "selection operation" and "projection operation" in "document operation". The SimbaQL query statement mainly comprises four clauses of FIND, WITH, WHERE and RETURN, and the grammar structure is as follows:
FIND<documents>
WITH<variables>
WHERE<conditions>
RETURN<urls>
the FIND clause determines basic variables of the query, and each variable must correspond to a type of document in the Linked document; the WITH statement determines intermediate variables used in matching the conditional grammar, which may be of various data types and are not limited to documents; the WHERE determines the conditions which need to be met when the returned result is returned; the RETURN statement contains the data reference that needs to be returned to the user. The LDM operation process corresponding to SimbaQL is given below.
First, the basic query space in FIND consists of one or more classes of documents, and SimbaQL requires that LDM cannot make comparisons between two classes of documents without associations. For example, querying data of a person object and a software object can be expressed as:
FIND MySQL.person p,MySQL.software s
if there is no association between the person document and the software document, we can only perform selection operations on person and software separately, but not on selection operations like p.inventid ═ s.id.
The variables defined in the WITH implicitly define the extension of the documents and associations in the basic query space, such as:
FIND person p WITH$soft=p.invent
the above statements indicate that the Linked Document we search contains Software documents, and associated invent. It is and
Find person p,software s WITH$soft=p.invent
are equivalent.
The WHERE statement can not only implicitly define the document and the association for expanding the query space, but also can perform LDM selection operation. Such as:
FIND person p WHERE p.invent.name=’simba’
then it is implicitly determined that the set of documents (person, software) and the associated set (invent) are included in the LDM, and the software document is required to satisfy the condition that the name attribute is 'simba'.
The RETURN statement may contain a URL of a document, a link, an attribute hierarchy, or a variable representing a URL. The statement mainly executes the projection operation of the LDM, and the returned result is a Linked Document.
In summary, the execution of SimbaQL is divided into four steps: determining documents, establishing relations among the documents, selecting and projecting. Assume the basic search space in FIND is A, B; c and the association L1 between A and C are implicitly determined in the WHERE statement, and the selection condition is condition; the projection space in the RETURN statement is space, and the other documents obtained through operation are doc; then the LDM operation corresponding to the SimbaQL statement is:
result=σspaceπcondition((A×dB)A×L1C)
such as the SimbaQL statement:
FIND person p,software s WHERE p.name=’bluejoe’and p.invent.name=’simbaql’return p.name
the corresponding LDM operation is:
result=σp.nameπp.name=′bluejoe′and software=′simbaql′(Person×inventSoftware)
3. data reference syntax (also called attribute expression syntax, as shown in FIG. 3)
Since LDM links databases to a network, we can use a URL-like format to reference data in the network. This URL is of the form:
<datasource>.<document>.<link*>.<identity>.<propertyName>
the data source represents a data source registered in the associated document, the document represents a document class in the associated document, the link represents a connection set in the associated document, the URL may contain a plurality of links, the identity represents the id of the document, and the propertyName represents the attribute of the document.
Data can be referenced at different levels, for example, a reference to the name property of a person document in a MySQL database can be expressed as:
MySQL.person.name
applying the gather association of a person document in a MongoDB database can be expressed as:
MongoDB.person.father
the association represents the document set corresponding to the association, and further reference can be carried out, such as
MongoDB.person.father.name
The data corresponding to the data reference URL is actually the result after the relational operation and the projection operation of the LDM. For example, the data represented by mongodb.
4. Intermediate variable syntax
The intermediate variables may represent a set of documents, values, strings, related to the basic search space. The representation of the variable consists of a $ symbol and an identifier, which defines the use of the WITH statement:
such as:
success $ c1 ═ p.knows.knows (document set)
WITH $ c2 ═ 123 (value)
WITH $ c3 ═ 'bluejoe' (string)
The intermediate variables are used in the matching grammar, and corresponding condition matching operation is carried out according to the types of the intermediate variables. When the intermediate variable is a document set, its main role is to replace a part of the content of the data reference URL.
5. Matching condition grammar
The match condition is an expression of the type pool returned by the WHERE statement. The grammatical rules of the expression are as follows:
1) aggregation screening of document sets A and B: (< document a >. link | < document a >) < document B >
2) Screening a document set: (document >. attribute | < association >. attribute) operator basic data type
3) < expression > AND | OR < expression >
Wherein the operator operators currently supported include: each of the two methods is as follows. For the case where < association >. attribute or < document a >. link represents a collection of documents, the "═ operator means" present ", for example: name ' bluejoe ' represents the presence of a person named ' bluejoe ' among known persons, and p.knows ' p1 indicates the presence of p1 among p known persons.
It should be noted that although the WHERE statement corresponds to the selection operation of the LDM, the selection operation in SimbaQL only selects the attribute value of the document. Such as:
FIND person p,software s WHERE p.invent=s AND s.name=’SimbaQL’RETURN p.name
although the selection condition includes p.invent ═ s, the actual selection condition is s.name ═ SimbaQL'.
6. SimbaQL resolution program
The SimbaQL parsing program mainly includes the following classes:
the whole grammar structure related class: statement, SearchSpace, VariableDefines, Conditions, SubSpace, the meanings of which are given in Table 1 below.
Syntax tree abstract classes and interfaces: node, Condition, Variable, Document, AttributeDocument (documents derived from attributes such as p.knows, $ p.knows, etc.), value Expression (value type of expression).
Syntax tree specific classes: RawDocument, RawAttribute, RawValible, WithValible, StringValue, IntegerValue, TerminalCondition, And, Or, Not, Varible Attribute, Document reference, Operator. Where RawDocument is used to represent a document such as Person p in FIND Person p; rawtribute is used to represent attributes such as p.name; VaribleAttribute represents an attribute such as $ k.name that is derived by a variable; withvariable denotes a variable WITH $ k ═ k in p.knows; RawVariable represents p such document variables defined by FIND; StringValue and IntegerValue respectively represent character strings and integers; and add, Or, Not are used to represent prepositions And, Or, Not in the expression; document reference denotes a knows connection in p.knows.p.knows.name; operator is used to represent the compare Operator and TerminalCondition represents an irreparable expression such as p.age > 30.
The basic information of the above classes is shown in table 4:
TABLE 4 Abstract syntax Tree basic information
Figure BDA0001336546600000121
Figure BDA0001336546600000131
In addition to the related JAVA classes of the above abstract syntax tree, the parser also contains the lexical syntax parsing class of ANTLR 4: SimbaQLLEr, SimbaQLParser, SimbaQLBaseListener, SimbaQLBaseVisitor, SimbaQLVisitor. Wherein, the SimbaQLLEr is a lexical analyzer of the SimbaQL sentence generated by ANTLR4 and is used for judging whether the words in the sentence accord with grammar; SimbaQLParser is a syntactic parser for SimbaQL; SimbaQLBaseListener and SimbaQLBaseVisitor access the base classes of the syntax tree with listener and visitor, respectively; SimbaQLVisitor is inherited from SimbaQLBaseVisitor for the viewer mode access syntax tree.
Abstract syntax tree building class: the AstBuilder is used for constructing an abstract syntax tree, and the abstract syntax tree provides an input SimbaQL Statement and outputs a syntax tree guided by state.
Syntax error checking class: AstChecker, DBchecker. Wherein the AstChecker can detect query statements that do not conform to syntax; DBchecker is used to detect content in a query statement that conflicts with a data source, such as p.knows contained in the query statement, but there is no knows connection in the data source.
Syntax parsing case (output syntax parsing tree): SimbaParser, treepriner. SimbaParser is a case program for constructing a syntax parsing tree of a query statement and printing out a parsing tree structure; treeprenter is a program that prints a syntax parse tree.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (6)

1. An interactive query method suitable for a plurality of big data management systems comprises the following steps:
1) establishing an association document model which comprises a document set and an association set, wherein the association set is a set formed by the association between documents;
2) converting different original data models into associated document models, and connecting different data sources into a whole through the associated document models;
3) establishing a uniform query language suitable for the multi-metadata based on the associated document model;
4) unified query of a relational database, a graph database and a file system is realized by using a unified query language suitable for multivariate data;
the documents in the document set of the associated document model are a set formed by a group of attributes, and the attributes are sets formed by the same type of data; each document default contains a main code attribute, and the main code attribute is a global unique identifier; the document set and the association set have a name identifier to explain the semantics of the documents and the associations in the set;
the uniform query language suitable for the multivariate data management system comprises four clauses of FIND, WITH, WHERE and RETURN; the FIND statement determines the basic variables of the query, which must represent the document; the WITH statement determines intermediate variables used in matching the conditional grammar; the WHERE statement determines the conditions which need to be met when the returned result is returned; the RETURN statement contains the data reference that needs to be returned to the user;
the basic query space in the FIND statement is composed of one type of document or multiple types of documents, and requires that the associated document model cannot perform comparison between two types of documents without association; the expansion of the documents and the association in the basic query space is implicitly defined in the WITH statement; the WHERE statement can implicitly define the document and the association for expanding the query space and can also perform the selection operation of an intermediate model of the associated document; the RETURN statement comprises a document, a link, a URL of an attribute level or a variable representing the URL, the statement mainly executes projection operation of a relevant document model, and a returned result is a relevant document;
connecting different data sources into a whole through the associated document model to form a network, and forming data reference grammar of the uniform query language in a form similar to URL (uniform resource locator) to uniformly access data in the network; the URL is of the form:
<datasource>.<document>.<link>.<identity>.<propertyName>
the data source represents a data source, the document represents a document mapped from the data source to an associated document model, the link represents the association mapped from the data source to the associated document model, the identity represents the main code of the document, and the propertylame represents the attribute name of the document.
2. The method of claim 1, wherein the execution of the unified query language is divided into four steps: determining documents, establishing relations among the documents, selecting and projecting.
3. The method of claim 1, wherein intermediate variables in the unified query language represent document sets, values, strings associated with the basic search space, the intermediate variables being used in the matching syntax, and corresponding conditional matching operations being performed according to the type of the intermediate variables.
4. The method of claim 1, wherein the matching condition in the unified query language is an expression of the boy type returned by the WHERE statement, and the syntax rules of the expression are as follows:
1) aggregation screening of document sets A and B: (document a >. link | < document a >) < document B >;
2) screening a document set: (document >. attribute | < association >. attribute) operator basic data type;
3) < expression > AND | OR < expression >.
5. The method of claim 1, wherein the parser in the unified query language comprises: the whole grammar structure related class, grammar tree abstract class and interface, grammar tree concrete class.
6. The method of claim 1, wherein in real applications, an SDK of a database system is developed for the unified query language and some compensation operations are performed on the basis of a local database query language, and then a client program operates the database using the unified query language by calling an API in the SDK; or a communication protocol is designed for the database directly based on the uniform query language, and the client program obtains the required query result by sending a network request.
CN201710515380.6A 2017-06-29 2017-06-29 Interactive query method suitable for various big data management systems Active CN107515887B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710515380.6A CN107515887B (en) 2017-06-29 2017-06-29 Interactive query method suitable for various big data management systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710515380.6A CN107515887B (en) 2017-06-29 2017-06-29 Interactive query method suitable for various big data management systems

Publications (2)

Publication Number Publication Date
CN107515887A CN107515887A (en) 2017-12-26
CN107515887B true CN107515887B (en) 2021-01-08

Family

ID=60721837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710515380.6A Active CN107515887B (en) 2017-06-29 2017-06-29 Interactive query method suitable for various big data management systems

Country Status (1)

Country Link
CN (1) CN107515887B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110109951B (en) * 2017-12-29 2022-12-06 华为技术有限公司 Correlation query method, database application system and server
CN109033260B (en) * 2018-07-06 2021-08-31 天津大学 Knowledge graph interactive visual query method based on RDF
CN110765151A (en) * 2018-07-27 2020-02-07 北京国双科技有限公司 Calculation formula processing method and device
CN109241054A (en) * 2018-08-02 2019-01-18 成都松米科技有限公司 A kind of multimodal data library system, implementation method and server
CN111221785A (en) * 2018-11-27 2020-06-02 中云开源数据技术(上海)有限公司 Semantic data lake construction method of multi-source heterogeneous data
CN112148925B (en) * 2019-06-27 2024-03-01 北京百度网讯科技有限公司 User identification association query method, device, equipment and readable storage medium
CN111475534B (en) * 2020-05-12 2023-04-14 北京爱笔科技有限公司 Data query method and related equipment
CN112084248A (en) * 2020-09-11 2020-12-15 党丹 Intelligent data retrieval, lookup and model acquisition method based on graph database
CN112632037B (en) * 2020-12-24 2023-04-07 浪潮通用软件有限公司 Method and device for graphically defining query data set
CN113761290A (en) * 2021-03-10 2021-12-07 中科天玑数据科技股份有限公司 Query method and query system for realizing full-text search graph database based on SQL
CN113282625B (en) * 2021-05-31 2022-10-04 重庆富民银行股份有限公司 SQL-based API data query and processing system and method
CN113515610B (en) * 2021-06-21 2022-09-13 中盾创新数字科技(北京)有限公司 File management method based on object-oriented language processing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073701A (en) * 2010-12-30 2011-05-25 浪潮集团山东通用软件有限公司 Semantic definition-based multi-data source data querying method
CN105468702A (en) * 2015-11-18 2016-04-06 中国科学院计算机网络信息中心 Large-scale RDF data association path discovery method
CN106294402A (en) * 2015-05-21 2017-01-04 阿里巴巴集团控股有限公司 The data search method of a kind of heterogeneous data source and device thereof
CN106372177A (en) * 2016-08-30 2017-02-01 东华大学 Query expansion method supporting correlated query and fuzzy grouping of mixed data type

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7702625B2 (en) * 2006-03-03 2010-04-20 International Business Machines Corporation Building a unified query that spans heterogeneous environments

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073701A (en) * 2010-12-30 2011-05-25 浪潮集团山东通用软件有限公司 Semantic definition-based multi-data source data querying method
CN106294402A (en) * 2015-05-21 2017-01-04 阿里巴巴集团控股有限公司 The data search method of a kind of heterogeneous data source and device thereof
CN105468702A (en) * 2015-11-18 2016-04-06 中国科学院计算机网络信息中心 Large-scale RDF data association path discovery method
CN106372177A (en) * 2016-08-30 2017-02-01 东华大学 Query expansion method supporting correlated query and fuzzy grouping of mixed data type

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
OpenCSDB关联数据在科学数据库中的应用研究;沈志宏 等;《中国图书馆学报》;20120915;全文 *
基于NoSQL的RDF数据存储与查询技术综述;王林彬 等;《计算机应用研究》;20150531;第32卷(第5期);全文 *
基于数据库集群的海量RDF数据联合查询系统的研究与实现;徐仕超 等;《科研信息化技术与应用》;20160120;第7卷(第1期);摘要、第25页第1-6段,第2-3节、图5 *

Also Published As

Publication number Publication date
CN107515887A (en) 2017-12-26

Similar Documents

Publication Publication Date Title
CN107515887B (en) Interactive query method suitable for various big data management systems
Mena et al. OBSERVER: An approach for query processing in global information systems based on interoperation across pre-existing ontologies
US20100017395A1 (en) Apparatus and methods for transforming relational queries into multi-dimensional queries
CN106934062A (en) A kind of realization method and system of inquiry elasticsearch
JP6720641B2 (en) Data constraint of multilingual data tier
US20110161352A1 (en) Extensible indexing framework using data cartridges
Unbehauen et al. Knowledge extraction from structured sources
US11914631B2 (en) Systems and methods for using an ontology to generate database entries and access and search a database
CN107491476B (en) Data model conversion and query analysis method suitable for various big data management systems
Abbes et al. MongoDB-based modular ontology building for big data integration
Scharffe et al. Correspondence patterns for ontology alignment
CN117093599A (en) Unified SQL query method for heterogeneous data sources
Michel et al. Translation of Heterogeneous Databases into RDF, and Application to the Construction of a SKOS Taxonomical Reference
Michel et al. A generic mapping-based query translation from SPARQL to various target database query languages
Natarajan et al. [Retracted] Schema‐Based Mapping Approach for Data Transformation to Enrich Semantic Web
CN116108194A (en) Knowledge graph-based search engine method, system, storage medium and electronic equipment
Michel et al. Bridging the semantic web and NoSQL worlds: generic SPARQL query translation and application to MongoDB
Feng et al. Geoqamap-geographic question answering with maps leveraging LLM and open knowledge base (short paper)
Palopoli et al. Experiences using DIKE, a system for supporting cooperative information system and data warehouse design
CN113221528B (en) Automatic generation and execution method of clinical data quality evaluation rule based on openEHR model
CN113032366A (en) SQL syntax tree analysis method based on Flex and Bison
Su-Cheng et al. Mapping of extensible markup language-to-ontology representation for effective data integration
Alaoui et al. Semantic Oriented Data Modeling for Enterprise Application Engineering Using Semantic Web Languages
Kalna et al. MDA transformation process of a PIM logical decision-making from NoSQL database to big data NoSQL PSM
Malik et al. Technique for transformation of data from RDB to XML then to RDF

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant