CN106372177B

CN106372177B - Support the correlation inquiry of mixed data type and the enquiry expanding method of fuzzy grouping

Info

Publication number: CN106372177B
Application number: CN201610783143.3A
Authority: CN
Inventors: 黄晓虎; 王杰; 薛皓; 王梅
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2016-08-30
Filing date: 2016-08-30
Publication date: 2019-09-27
Anticipated expiration: 2036-08-30
Also published as: CN106372177A

Abstract

The present invention provides a kind of correlation inquiry for supporting mixed data type and the enquiry expanding methods of fuzzy grouping, comprising the following steps: step 1, framework are built；Step 2, data storage；Step 3, query expansion；Step 4, inquiry parsing；Step 5, Hybrid connections；Step 6, fuzzy grouping；Step 7, encapsulated result simultaneously return.The present invention cannot pass through certain rule connection and specified type data aggregate operating function confinement problems for mixed type data in distributed database environment, the SQL Extension syntax of polymerization and connection is provided for user, can include the method for the query expansions such as fuzzy grouping and fuzzy connection by specifying sentence to complete.Extend the functionality and adaptability of distributed data base.

Description

Support the correlation inquiry of mixed data type and the enquiry expanding method of fuzzy grouping

Technical field

The present invention relates to a kind of mixed type data query methods for supporting fuzzy connection and fuzzy grouping.

Background technique

Fast development with computer and information technology and the becoming increasingly popular in every profession and trade application, have daily The data for often reaching hundreds of TB even tens of to hundreds of PB scales generate and collect, and the mass property and isomery characteristic of data are to biography System database technology especially centralized data base brings huge challenge.In order to the MYSQL being widely used at present, PostGreSQL etc. increase income centralized data base provide it is distributed support, volume of data library middleware comes into being, in these Between part for user provide the scheme of transparent building data-base cluster, can be smooth by existing single machine centralized data base " cloud " end is moved to application, becomes a kind of important distributed data management solution.At the same time, distributed data base Middleware can be by different types of underlying database and application integration, if carrying out relevant database and NoSQL in bottom It is unified integrated, the blended data for being expected to separate sources and different structure is adaptively stored and searching and managing, thus reality Effective management of existing isomery big data.However since the query function of current SQL sentence is limited, mixed type data can not be supported Include connection, grouping etc. most-often used inquiry operation.Therefore mixed type data are directed to, realize the unification for utilizing middleware Storage, and query function is extended, make it that the correlation inquiry of blended data be supported just to seem very necessary.

Summary of the invention

The purpose of the present invention is: it is based on distributed data base middleware, realizes the Function Extension of SQL statement, completion includes The blended datas inquiry such as fuzzy grouping and fuzzy connection.

In order to achieve the above object, the technical solution of the present invention is to provide a kind of associations for supporting mixed data type to look into Inquire about the enquiry expanding method of fuzzy grouping, which comprises the following steps:

Step 1, blended data storage architecture are built；

Step 2, the storage of mixed type data are that data are stored in corresponding node database according to data type by unit with column In, which includes:

Step 2.1 builds table, will include designated word number of segment according to field type specified by configuration file and SQL statement According in table deposit correspondence database, specifically include:

Step 2.1.1, configuration information is obtained, determines that table yet to be built is included in vertical fragmentation configuration information；

Step 2.1.2, analytical decomposition sentence creates field type and field in table statement comprising field according to SQL Length, is divided into structuring and unstructured properties, and write-in file changes the index for the operations such as looking into as subsequent additions and deletions, herein On the basis of respectively building act on branch by substatement；

Step 2.1.3, route distribution, by the number with corresponding types in configuration respectively of substatement obtained in step 2.1.2 It is bound according to library, and carries out route distribution, table is built in completion；

Step 2.2, insertion data, are deposited into according to being inserted into data indexed file with the corresponding relationship of database In the table of correspondence database, specifically include:

Step 2.2.1, configuration information is obtained, determines and is inserted into table name included in vertical fragmentation configuration information；

Step 2.2.2, analytical decomposition sentence, query steps 2.1.2 index file generated, according to attribute in file with The corresponding relationship of table, building act on branch by substatement；

Step 2.2.3, route distribution, by substatement obtained in step 2.2.2 respectively with configuration corresponding types data Library binding, and route distribution is carried out, complete data insertion.

Step 3, mixing query expansion.According to simplifying and functional principle, design SQL statement is as follows:

SELECT* | and expression [AS output_name] [...] and FROM from_item

[GROUP BY column][CONTAIN r DIVIDED BY d]|[START WITH num1 PER num2] [WHERE condition]

Wherein expression indicates field name or an expression formula；From_item indicates the table to be inquired, i.e., each number According to unit corresponding with table in library, it is denoted as table1；GROUP BY grouping and WHERE conditional statement respectively specify that in the sentence It is grouped or attended operation:

1) specified to grouping field column by GROUP BY, and by CONTAIN...DIVIDED BY... or START WITH...PER..., which is respectively specified that, carries out fuzzy division operation to character string or integer column；

It 2) here include condition of contact and substatement by WHERE given query condition condition, in condition of contact Including link field c1 and connection type in table1, attended operation is done in the table table2 and the table comprising inquiry in substatement Field c2；

Step 4, inquiry parsing, system parses specified SQL statement before route distribution, and obtains relevant parameter, It specifically includes:

Step 4.1, return type parsing, obtain result appearance form after SELECT keyword；

Whether step 4.2, fuzzy connection parsing, judge comprising FUZZY IN keyword in SQL statement, comprising then executing step Rapid 5, it is no to then follow the steps 4.3；

Whether step 4.3 obscures packet parsing, judge in SQL statement to include CONTAIN or START WITH keyword, Comprising thening follow the steps 6, the non-newly-increased sentence of the sentence is otherwise judged, default route distribution simultaneously obtains final result.

Step 5, Hybrid connections, support the attended operation of multi-source heterogeneous data fuzzy matching, which includes:

Step 5.1, inquiry are torn open and are write, and system splits prototype statement according to keyword FUZZY IN, extract master respectively and look into The conditional statement with FUZZY IN is ask, is saved in memory as the query statement of table1 and table2, and by connection type；

Step 5.2, routing binding, query configuration information, respectively by the query statement of table1 and table2 and corresponding road It is bound by node, and carries out route distribution；

Step 5.3, query execution divide the inquiry operation of sentence in the execution of each node respectively, obtain result set and successively return To route distribution；

Step 5.4, FUZZY IN connection obtain connection type FUZZY IN in memory, to result obtained in table1 Collection is filtered with c1 Column Properties, is only retained this and is classified as the result set of c2 column substring and return in table2；

Step 6, fuzzy grouping are treated grouping column and are carried out comprising designated character string or numeric type by appointed interval by character type Grouping, the step include:

Step 6.1 determines packet type, if sentence includes keyword START WITH, differentiates that it is numeric type by one Fixed interval is grouped, and executes step 6.2；If comprising CONTAIN keyword, determine that it is character type by comprising character string into Row grouping, executes step 6.3；

Step 6.2, numeric type grouping, parse relevant parameter according to parameter setting rule of classification and obtain group result collection, The step includes:

Step 6.2.1, relevant parameter is parsed, initial value s=num1 and spacing value Δ=num2 specified by sentence is extracted；

Step 6.2.2, initial results collection is inquired, and GROUP BY and START...WITH... correlative is filtered, to data Library issues inquiry request, obtains initial results collection t；

Step 6.2.3, each record v in initial results collection t is traversed, according to formulaAffiliated group of definitive result Number, and encapsulated by " k:v " form；

Step 6.3, character type grouping, parse relevant parameter according to parameter setting rule of classification and obtain group result collection, The step includes:

Step 6.3.1, relevant parameter is parsed, character string r and string delimiter d specified by sentence is extracted, by r root Multiple substrings are divided into according to d, each substring belongs to one group, and distributes group number k；

Step 6.3.2, initial results collection is inquired, and GROUP BY and START...WITH... correlative is filtered, to data Library issues inquiry request, obtains initial results collection t；

Step 6.3.3, each record v in initial results collection t is traversed, screening includes each substring in step 6.3.1 Record, and encapsulated in the form of " k:v ".

Step 6.4, group result return, and grouping is executed " k:v " result set returned and is encapsulated into one with tabular form In resultset objects, it is back at route distribution；

Step 7, encapsulated result simultaneously return, and according to the type that returns the result obtained in step 4.1, Table Header information is arranged, and The Table Header information form for being successively encapsulated as byte stream corresponding with content is returned the result is returned.

Preferably, the step 1 includes:

Step 1.1 builds database environment, installation relation type and non-relational data in the environment of single machine or multimachine Library；

Step 1.2 builds MYCAT middleware platform, and different types of database is added to centre by configuration file Part bottom layer node, and specify each node database type, comprising the following steps:

Step 1.2.1, MYCAT is installed, the installation of software is completed by importing MYCAT source code in ECLIPSE；

Step 1.2.2, it sets, JAR packet necessary to specified database is accessed passes through in ECLIPSE BUILD PATH is added in system running environment；

Step 1.2.3, configuration node information, to addition table table and node in configuration file " schema.xml " DataNode information specifies the corresponding relationship of table table and node dataNode, and addition vertical fragmentation is regular, and will be to be added Database address and the information such as user name password be added in the configuration file.

The present invention provides one kind to be extended in distributed data base middleware layer in face of SQL statement, and according to SQL language Sentence carries out route distribution, and the strategy for meeting conditional outcome collection is obtained in database bottom or middleware level.

The present invention provides one kind to carry out route distribution according to specified SQL statement in distributed data base middleware level, And the strategy for meeting conditional outcome collection is obtained in database bottom or middleware level according to specified requirements.It is characterized in that supporting The fuzzy grouping of true-to-shape and the fuzzy connection of blended data, to realize SQL statement in terms of blended data query function Extension.

Detailed description of the invention

Fig. 1 is the process schematic of step 5 in the present invention.

Specific embodiment

In order to make the present invention more obvious and understandable, it is hereby described in detail below with preferred embodiment.

The present invention provides fuzzy connections and specified number that mixed data type is realized in a kind of extension by SQL statement According to the method for the fuzzy grouping of type.The present invention cannot pass through a set pattern for mixed type data in distributed database environment Then connection and specified type data aggregate operating function confinement problems, provide the SQL Extension sentence of polymerization and connection for user Method can include the method for the query expansions such as fuzzy grouping and fuzzy connection by specifying sentence to complete, and extend point The functionality and adaptability of cloth database.For using MYSQL and MONGODB as bottom layer node database, specific steps are such as Under:

Step 1, framework are built, and using MYCAT as database middleware, build distributed database environment, and ring is arranged Border variable and bottom layer node information, the step include:

Step 1.1 builds database environment, installation relation type and non-relational data in the environment of single machine or multimachine Library.Here relevant database uses MYSQL, and non-relational database uses MONGODB；

Step 1.2 builds MYCAT middleware platform, and it is each point that each database host in step 1.1 is added in configuration Node, and specify the corresponding relationship of each node and database, the specific steps are as follows:

Step 1.2.1, MYCAT is installed, the installation of software is completed by importing MYCAT Open Source Code in ECLIPSE；

Step 1.2.3, configure node information, to addition table (table) in configuration file " schema.xml " and DataNode (node) information specifies the corresponding relationship of table and dataNode, and addition vertical fragmentation is regular, and will be to be added Database address and the information such as user name password be added in the configuration file.

Step 2, the storage of mixed type data.It is that data are stored in corresponding node database according to data type by unit with column In, which includes:

Step 2.1.2, analytical decomposition sentence, according to SQL build in table statement comprising field field type and field it is long Degree, is divided into structuring and unstructured field, constructs substatement respectively；

Step 2.1.3, index file is created, according to table in configuration file and database corresponding relationship and each point of library class Type will be built all field write-in files in table statement in the form of " table name: { field name: database name } ", be changed as subsequent additions and deletions The index of operations such as look into；

Step 2.1.4, route distribution, by the number with corresponding types in configuration respectively of substatement obtained in step 2.1.2 It is bound according to library, carries out route distribution, table is built in completion.

Step 3, query expansion write the fuzzy grouping of specified type data according to designed query expansion syntax respectively And the class SQL statement of mixed type data fuzzy connection, it specifically includes:

The SQL statement of the fuzzy grouping of numeric type is as follows:

SELECT COUNT(COLUMN)FROM TABLE GROUP BY COLUMN START WITH num1 PER num2；

COLUMN arrange the record since the num1 in inquiry TABLE, and it is one group that record value, which is pressed every num2 points, return each group Record number；

The SQL statement of the fuzzy grouping of character type is as follows:

SELECT COUNT(COLUMN)FROM TABLE GROUP BY COLUMN CONTAIN r DIVIDED BY d；

Record of the COLUMN column comprising character in character string group r in TABLE is inquired, r returns to each group using d as separator Record number；

The SQL statement of fuzzy connection is as follows:

SELECT c1 FROM table1 WHERE COLUMN FUZZY IN(SELECT c2 FROM table2)；

C1 record and c2 record in table2 in table1 are inquired respectively, are classified as by the c1 that FUZZY IN obtains table1 The record for the substring that c2 is arranged in table2.

Step 4, inquiry parsing, system parses specified SQL statement before route distribution, and obtains relevant parameter. The step includes:

Step 5.4, FUZZY IN connection obtain connection type FUZZY IN in memory, to result obtained in table1 Collection is filtered with c1 Column Properties, is only retained this and is classified as the result set of c2 column substring and return in table2.

Step 5 detailed process is as shown in Figure 1.

In Fig. 1, index file is created in step 2.1.3, and the non-structural data of the types such as string and file exist In MONGODB, general type is stored in MYSQL.Fuzzy connection is according to index file field and database corresponding relationship, by sentence It is distributed in corresponding point of library, obtains the implementing result in point library, by FUZZY work N condition of contact, filtering c1 is not c2 substring Record, obtains final result.

Step 6.3.1, relevant parameter is parsed, character string r and string delimiter d specified by sentence is extracted, by r root Multiple substrings are divided into according to s, each substring belongs to one group, and distributes group number k；

Step 6.3.3, each record v in initial results collection t is traversed, screening includes each substring in step 6.3.1 Record, and encapsulated in the form of " k:v "；

Step 6.4, group result return, and grouping is executed " k:v " result set returned and is encapsulated into one with tabular form In resultset objects, it is back at route distribution.

It can be seen that this technology is not high for user's operation level requirement, the flexibility for being supplied to user is larger, and can Give full play to the distinctive function of underlying database.

Claims

1. a kind of correlation inquiry for supporting mixed data type and the enquiry expanding method of fuzzy grouping, which is characterized in that including Following steps:

Step 1, blended data storage architecture are built；

Step 2, the storage of mixed type data are that data are stored in corresponding node database by unit according to data type with column, The step includes:

Step 2.1 builds table, will include designated field data table according to field type specified by configuration file and SQL statement It is stored in correspondence database, specifically includes:

Step 2.1.2, analytical decomposition sentence creates field type and field length in table statement comprising field according to SQL, It is divided into structuring and unstructured properties, write-in file changes the index looked into as subsequent additions and deletions, distinguishes on this basis Building act on branch by substatement；

Step 2.1.3, route distribution, by the database with corresponding types in configuration respectively of substatement obtained in step 2.1.2 Binding, and route distribution is carried out, table is built in completion；

Step 2.2, insertion data, according to be inserted into data indexed file be deposited into the corresponding relationship of database it is corresponding In the table of database, specifically include:

Step 2.2.2, analytical decomposition sentence, query steps 2.1.2 index file generated, according to attribute in file and table Corresponding relationship, building act on branch by substatement；

Step 2.2.3, route distribution, the database by substatement obtained in step 2.2.2 respectively with configuration corresponding types are tied up It is fixed, and route distribution is carried out, complete data insertion；

Step 3, mixing query expansion；According to simplifying and functional principle, design SQL statement is as follows:

SELECT*|expression[AS output_name][...]FROM from_item

Wherein expression indicates field name or an expression formula；From_item indicates the table to be inquired, i.e., each database In unit corresponding with table, be denoted as table1；GROUP BY grouping and WHERE conditional statement respectively specify that progress in the sentence Grouping or attended operation:

1) specified to grouping field column by GROUP BY, and pass through CONTAIN...DIVIDED BY... or START WITH...PER... it respectively specifies that and fuzzy division operation is carried out to character string or integer column；

2) by WHERE given query condition condition, here include condition of contact and substatement, include in condition of contact Link field c1 and connection type in table1 do the word of attended operation in the table table2 and the table comprising inquiry in substatement Section c2；

Step 4, inquiry parsing, system parses specified SQL statement before route distribution, and obtains relevant parameter, specifically Include:

Whether step 4.2, fuzzy connection parsing judge comprising FUZZY IN keyword in SQL statement, comprising thening follow the steps 5, It is no to then follow the steps 4.3；

Whether step 4.3, fuzzy packet parsing judge in SQL statement to include CONTAIN or START WITH keyword, include 6 are thened follow the steps, otherwise judges the non-newly-increased sentence of the sentence, default route distribution simultaneously obtains final result；

Step 5.1, inquiry are torn open and are write, and system splits prototype statement according to keyword FUZZY IN, extract respectively main inquiry and The conditional statement of FUZZY IN saves in memory as the query statement of table1 and table2, and by connection type；

Step 5.2, routing binding, query configuration information respectively tie the query statement of table1 and table2 with corresponding routing Point binding, and carry out route distribution；

Step 5.3, query execution divide the inquiry operation of sentence in the execution of each node respectively, obtain result set and be successively back to road By Issuing Office；

Step 5.4, FUZZY IN connection, obtain memory in connection type FUZZY IN, to result set obtained in table1 with C1 Column Properties are filtered, and are only retained this and are classified as the result set of c2 column substring and return in table2；

Step 6, fuzzy grouping are treated grouping column and are divided comprising designated character string or numeric type by appointed interval by character type Group, the step include:

Step 6.1 determines packet type, if sentence includes keyword START WITH, differentiates that it is numeric type by between certain Every being grouped, step 6.2 is executed；If determining that it is character type comprising CONTAIN keyword and being divided by comprising character string Group executes step 6.3；

Step 6.2, numeric type grouping, parse relevant parameter according to parameter setting rule of classification and obtain group result collection, the step Suddenly include:

Step 6.2.2, initial results collection is inquired, and filters GROUP BY and START...WITH... correlative, is sent out to database Inquiry request out obtains initial results collection t；

Step 6.2.3, each record v in initial results collection t is traversed, according to formulaThe affiliated group number of definitive result, and It is encapsulated by " k:v " form；

Step 6.3, character type grouping, parse relevant parameter according to parameter setting rule of classification and obtain group result collection, the step Suddenly include:

Step 6.3.1, relevant parameter is parsed, character string r and string delimiter d specified by sentence is extracted, by r according to d Multiple substrings are divided into, each substring belongs to one group, and distributes group number k；

Step 6.3.2, initial results collection is inquired, and filters GROUP BY and START...WITH... correlative, is sent out to database Inquiry request out obtains initial results collection t；

Step 6.3.3, each record v in initial results collection t is traversed, screening includes the record of each substring in step 6.3.1, And it is encapsulated in the form of " k:v "；

Step 6.4, group result return, and grouping is executed " k:v " result set returned and is encapsulated into a result with tabular form Collect in object, is back at route distribution；

Step 7, encapsulated result simultaneously return, and according to the type that returns the result obtained in step 4.1, are arranged Table Header information, and by table The head information form for being successively encapsulated as byte stream corresponding with content is returned the result returns.

2. the query expansion side of a kind of correlation inquiry for supporting mixed data type as described in claim 1 and fuzzy grouping Method, which is characterized in that the step 1 includes:

Step 1.1 builds database environment, installation relation type and non-relational database in the environment of single machine or multimachine；

Step 1.2 builds MYCAT middleware platform, and different types of database is added to middleware bottom by configuration file Node layer, and specify each node database type, comprising the following steps:

Step 1.2.2, it sets, JAR packet necessary to specified database is accessed passes through BUILD in ECLIPSE PATH is added in system running environment；

Step 1.2.3, configuration node information, to addition table table and node dataNode in configuration file " schema.xml " Information specifies the corresponding relationship of table table and node dataNode, and addition vertical fragmentation is regular, and by database to be added Address and user name password are added in the configuration file.