CN107491561B

CN107491561B - Ontology-based urban traffic heterogeneous data integration system and method

Info

Publication number: CN107491561B
Application number: CN201710873196.9A
Authority: CN
Inventors: 王海泉; 张雅素; 赵洁洁; 吴世敏
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2017-09-25
Filing date: 2017-09-25
Publication date: 2020-05-26
Anticipated expiration: 2037-09-25
Also published as: CN107491561A

Abstract

The invention relates to an ontology-based urban traffic heterogeneous data integration system and method, which can solve the problems of various grammars, semantics and system heterogeneity in urban traffic field data and improve the working efficiency of data processing. The system consists of a module: the query decomposition module, the ontology and database mapping module, the sub-query generation module, the result merging module and the wrapper module. The method utilizes the ontology, realizes the management of conventional and special data in the urban traffic field by establishing a database table and mapping between fields and ontology concepts and attributes, and solves the heterogeneous problem while integrating data. The invention fully considers the special characteristics of the data in the urban traffic field compared with the data in other fields, and solves the problem of heterogeneous data in the urban traffic field which cannot be solved by a universal method, thereby providing a uniform data query interface for data processing personnel and improving the data processing efficiency.

Description

Ontology-based urban traffic heterogeneous data integration system and method

Technical Field

The invention relates to an ontology-based urban traffic heterogeneous data integration system and method, and belongs to the field of database/data integration.

Background

To solve the heterogeneous problem between data sources, researchers at home and abroad have conducted a great deal of research work on data integration so far. The data integration system architecture which has application value and is in the mainstream position at present mainly comprises two types: an ETL architecture and a middleware architecture.

The data processing process of the ETL framework is mainly divided into three steps: extract, Transform, load. In the extraction step, the system reads data from the various data sources. In the conversion step, in order to make the data syntactically consistent with the mode in the target database table, the data is uniformly converted according to the conversion rule defined in advance. In the loading step, the result of the previous step is imported into the target data source. To remove noise and inconsistencies, the data is typically cleaned after it is extracted. Since the ETL framework uniformly converts heterogeneous data in advance and stores the data in the data warehouse, a user queries based on the converted data. Thus, for some data that is updated faster, the data warehouse may not be able to provide the user with the most up-to-date version of the data. The ETL architecture is not suitable for updating faster data. However, for data with little change, the strategy of conversion in advance adopted by the ETL architecture can reduce the time for the system to respond to the query, and is more efficient than the middleware architecture.

The middleware framework decomposes and converts the query of a user into the query aiming at different data sources by establishing a global view, then merges and converts the data returned by the data sources, and returns the result. A user initiates a query to the data integration system through an application program, the middleware firstly decomposes a query result into sub-queries aiming at a plurality of data sources according to a global mode, and the wrapper module receives the sub-queries and then converts the sub-queries into codes which can be directly executed by the data sources. When the data source returns the data, the wrapper module performs data format conversion according to the requirement in the mapping, and the mediator merges the query results from different data sources and finally returns the query results to the user. One important module in the middleware framework is: the system comprises a mapping module, a query decomposition module and a result returning module. The mapping relation between each data source and the global view is stored through a mapping module; query decomposition can decompose a global query and generate a plurality of sub-queries; and the result returning process converts the result fed back by the data source into a uniform form and returns the result. The middleware framework performs corresponding data conversion when receiving user query, so that the real-time performance of the data can be ensured. But the system response time is relatively long since each query requires a conversion of the data.

There are a number of implementations of the schema in the middleware framework. Because the ontology can explain unclear and hidden knowledge, the ontology can well solve the problem of semantic heterogeneity by adopting the ontology as a mode. So far, numerous researchers introduce an ontology when designing a data source integration system, and the recognition and establishment of semantic information in data source management are enhanced by introducing normative domain knowledge, so that the method has a good effect on the aspect of improving semantic interoperability.

At present, a large number of data integration systems based on the above framework exist in the traffic field, but methods used by the systems are suitable for conventional data integration management in various fields, and the system is designed without considering some special characteristics (such as space-time characteristics of trajectory data and the like) of urban traffic data compared with data in other fields, so that tasks such as management and query of the urban traffic data cannot be completed.

Disclosure of Invention

The invention solves the problems: the defects of the prior art are overcome, and the urban traffic heterogeneous data integration system and method based on the ontology are provided. The system and the method aim at the specificity of the urban traffic field data, and are improved and expanded on the basis of the middleware data integration architecture based on the ontology, so that the urban traffic heterogeneous data integration system based on the ontology is formed. The system can carry out integrated management and query on some unconventional data (such as track data) in the urban traffic field, and reduces the data processing bottom layer work of urban traffic, thereby improving the efficiency of data mining work.

The technical scheme of the invention is as follows: an ontology-based urban traffic heterogeneous data integration system, comprising: the system comprises an ontology and database mapping module, a sub-query decomposition module, a query generation module, a query result merging module and a wrapper module. The query decomposition module, the sub-query generation module and the query result merging module form a mediator of the data integration system.

The body and database mapping module: the system is responsible for mapping and analyzing urban traffic ontology concepts and attributes involved in the global query to a database table and fields; the module comprises two files, namely a mapping rule file for describing an urban traffic field knowledge global ontology, a data source and the global ontology; the urban traffic field knowledge global ontology describes the relationship between concepts in the traffic field, and the concepts also contain some attributes describing the characteristics of the concepts, and are standard word lists referred by a user when writing a global query statement; the mapping rule file records the corresponding relation between the database table and the field of each urban traffic data source in the system and the global ontology concept and attribute; inputting ontology concepts and attributes involved in the query, and returning data sources, database tables and field names related to the query decomposition module by querying a mapping rule between a data source stored in a mapping rule file and the global ontology;

the query decomposition module: the global query language used by the system is adapted based on SQL language, and a table mapping function and a format conversion function aiming at the track data are added for processing the track data in the urban traffic heterogeneous data and various special conditions; a user writes a global query statement by using the language, the query decomposition module analyzes the global query statement input by the user to determine a data source, a database table and a field which need to be accessed by the query, the analysis process is carried out by establishing a query tree, and the query tree records concepts and attributes in the global query statement and the database table and the field related to the query respectively; the steps of establishing the query tree are as follows: analyzing three clauses of select, from and where in a global query statement, extracting ontology concepts and attributes, table mapping functions and attribute format conversion functions related to the three clauses, then establishing a query tree layer by layer from a root node, and analyzing a database table and fields corresponding to the ontology concepts and attributes related to the query statement by continuously calling an ontology and database mapping module to complete the establishment of the query tree; the module can be called through an external interface of the system, and at the moment, a global query statement is required to be transmitted as a parameter; the user can also input query sentences in a text box of the query interface for query;

a sub-query generation module: generating sub-queries aiming at each traffic data source according to the query tree generated by the query decomposition module, and providing all information required by one-time query for the wrapper modules corresponding to each data source; traversing the query tree, extracting all data sources related to the query, simultaneously reading corresponding information in a data source configuration file (the file stores the content of the connection mode, the brief introduction of data content and the like of the existing data source of the system) contained in the system, generating a sub-query comprising a plurality of information of sub-query statements, data source types and database connection configuration information for all related data sources, and sending the sub-query to a wrapper module for execution;

a wrapper module: each data source in the system is provided with a corresponding wrapper module, after the sub-query is received from a sub-query generation module, the data specified by the sub-query is obtained through three steps of conversion of the sub-query, execution of the sub-query and format conversion, the data is returned to a query result merging module for connection, the wrapper modules corresponding to different data sources are divided into an SQL type wrapper module and a non-SQL type wrapper module according to the difference of the types of the corresponding data sources (SQL, HBase, HDFS, Hive and other SQL or NoSQL data sources), the SQL type wrapper module firstly needs to convert the sub-query into codes which the data sources can be directly executed, then reads the connection configuration information of the data sources in the sub-query, establishes connection with the data sources and executes the query, thereby solving the heterogeneous problem of the data source system, when the data sources return query results, the wrapper modules firstly convert the results from the data sources according to the format conversion function in the sub-query, therefore, the problem of format heterogeneity is solved, the working processes of the non-SQL class wrapper module and the SQL class wrapper module are basically the same, but in the sub-query conversion step, the non-SQL class wrapper module only needs to analyze the sub-query and does not need to generate codes which can be directly executed by a data source;

a query result merging module: collecting the query results sent by each wrapper module, and merging the results according to the relationship types among the sub-queries; the module selects different strategies to carry out result combination according to different global query types, the query types are single-concept query and multi-concept query, and the single-concept query is divided into two conditions of whole binding and partial binding: the single concepts are all bound, namely, the from clause of the global query statement only contains one concept, and when the concept and the attribute name of the global query statement are replaced by using a database table and a field according to the mapping rule, each attribute can find the database field mapped by the attribute; partial binding means that when a database table and fields are used according to mapping rules to replace concepts and attribute names of global query statements, at least one attribute cannot find the mapped database fields, and therefore the fields need to be obtained from other data sources describing the same batch of entities; multi-concept query, namely query containing a plurality of concepts in the from clause; under the condition that the global query is a single-concept query, simply performing parallel operation on a plurality of result sets to obtain a new result set for all bound sub-queries; for partial binding, connecting a plurality of result sets, and removing redundant intersecting fields in the result sets to obtain a final result set; under the condition that the global query is a multi-concept query, all or part of the bound results of the sub-queries belonging to the same concept need to be merged, and then merged results of different concepts are merged again to obtain a final result set; when an external interface of the system is called for query, the final result set is returned to the user in a list object mode, the user can also query through a graphical query interface, and the query result is presented to the user in a visual table mode.

An ontology-based urban traffic heterogeneous data integration method comprises the following implementation steps:

(1) query decomposition: decomposing three clause components of select, from and where of the global query statement, establishing a query tree describing the query structure, and mapping and analyzing global ontology concepts and attributes related in the three clauses into tables and fields of a database through an ontology-database in a query mapping rule file;

(2) and (3) generating a sub-query: generating sub-queries aiming at each traffic data source according to the query tree generated by the query decomposition module, and providing all information required by one-time query for the wrapper modules corresponding to each data source; traversing a query tree, extracting all data sources related to the query, simultaneously reading corresponding information in a data source configuration file contained in the system, generating a sub-query comprising a sub-query statement, a data source type and a plurality of information of database connection configuration information for all the related data sources, and sending the sub-query to a wrapper module for execution;

(3) wrapper sub-query execution: obtaining data appointed by each sub-query through three steps of sub-query conversion, sub-query execution and format conversion, and returning the data to a query result merging module of the mediator for connection;

for the SQL class wrapper modules, firstly, sub-queries are required to be converted into SQL statements which can be directly executed by a data source, then, connection with the data source is established and the queries are executed, part of format conversion functions are embedded into SQL codes when the sub-queries are converted, simpler data format conversion is completed by a database, and when the data source returns a query result, the remaining format conversion work is completed by each wrapper module according to the format conversion functions in the sub-queries, so that the problem of format heterogeneity is solved;

for the non-SQL class wrapper module, the working process is basically the same as that of the SQL class wrapper module, but in the sub-query conversion step, the non-SQL class wrapper module only needs to analyze the sub-query and does not need to generate a code which can be directly executed by a data source; in addition, the execution of the format conversion function will be completed in the format conversion step;

(4) and (3) merging the query results: according to different results of query decomposition, result merging tasks under three conditions of single-concept full binding, single-concept partial binding and multi-concept query are completed, for the result merging of the single-concept full binding, the merging can be completed by performing set merging operation on all query result sets, and for the result merging of the single-concept partial binding, information of a plurality of data sources needs to be subjected to cross merging, namely a Cartesian product needs to be obtained for the plurality of data sets to obtain a final merging result set; for multi-concept query, when the query only involves the cross operation between multi-concept data sets, only the Cartesian product is required to be solved for the multiple data sets; when the multi-concept query is mixed with partial binding, partial binding results of sub-queries belonging to the same concept are merged, and merged results of different concepts are merged;

(5) and returning a query result: and returning the query result to the user in a form corresponding to the global query statement input.

Compared with the prior art, the invention has the advantages that:

(1) the invention is a data integration system aiming at the urban traffic field, and can process special data of the urban traffic field which can not be processed by the existing system;

(2) the data query of the invention uses a query language similar to SQL, the language is modified according to the special data characteristics in the urban traffic field, the general grammar is similar to SQL, the user of the system can use the system directly without excessive learning cost;

(3) the invention specifies the format of the result of the data source returned to the user in each query by writing the data format conversion function in the database and the body mapping file, thereby helping the user to solve the problem of inconsistent formats of the data sources in data analysis.

Drawings

FIG. 1 is a block diagram of the constituent modules of the present invention;

FIG. 2 is a fragment of a database and ontology mapping file;

FIG. 3 is a fragment of a database configuration information file;

FIG. 4 is a schematic diagram of an urban traffic ontology used by the system;

FIG. 5 is a listing of all concepts and attributes contained in the mass of urban traffic used by the system;

FIG. 6 is a schematic diagram of a query tree structure, where OntClass is an ontology class and DBTable represents a table to which its parent node (an OntClass) is mapped in a database; OntProperty is an attribute corresponding to a grandparent node (an OntClass) in one query, and DBAttribute is a field mapped by the grandparent node (an OntProperty) in a database;

FIG. 7 is a data structure composition of query tree nodes;

FIG. 8 is a data structure composition of a sub-query.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described below in detail.

As shown in FIG. 1, the present invention comprises an ontology and database mapping module, a query decomposition module, a sub-query generation module, a query result merging module, and a wrapper module; the system comprises an ontology, a database mapping module, a sub-query decomposition module, a query generation module and a query result merging module, wherein the ontology and database mapping module, the sub-query decomposition module, the query generation module and the query result merging module belong to a system mediator. The detailed design and implementation process of each module is as follows:

1. ontology and database mapping module

The ontology and database mapping module is called by the query decomposition module, and maps the ontology concepts in the global query to the database by querying the mapping rule file and the database configuration information file. The input parameters of the module are the concept and attribute of the ontology or the table mapping function parameters. The table mapping function is embedded in a query language used by the system for solving the complex query condition of the urban traffic heterogeneous data, and can find one of a plurality of tables mapped by a certain concept through function parameters. Such mappings of concepts and tables are often difficult to build, such as "track" concepts, each of which has a particular table mapping function. The module uses a mapping rule file, a database configuration information file and a global ontology. The mapping rule file is shown in fig. 2, and the upper part is the mapping rule file, where the left side is the table and the fields in the data source (i.e. sr6 is the name of the data source, trajector is the table name, time, lon, lat, company id, taxi id, etc. are the fields), and the right side is the attributes in the ontology; the lower half part is a data storage situation in the database, wherein the left side is the database, a track (Trajectory) table stores a plurality of fields such as time, lon, lat, company ID, taxi ID, and the like, and the right side is the ontology content mapped by the fields. The nested content of the database configuration information file is shown in fig. 3, where each sr tag pair represents a data source, and a data source includes information such as a data source name (name), a data source type (type), a data source url (url), a connection user name (user name), and a connection password (password). The global ontology describes the business logic of the urban traffic domain, which is built based on a database, and is not extracted from the business of the domain. The Taxi service part of the urban traffic body is shown in fig. 4, which mainly includes a plurality of important concepts and attributes thereof, such as track points (trajectorodot), taxis (Taxi), and location points (Point), and a detailed list of the concepts and the attributes is shown in fig. 5, where the first column is a concept and the second column is an attribute corresponding to the concept.

When the module input parameters are the concept and the attribute of the ontology, the module processing procedure is as follows:

step 1: reading the global ontology, and determining whether the parameters are valid, namely determining whether the inputted ontology concept and attribute exist in the ontology;

step 2: reading the mapping rule file, and searching whether the mapping rule file contains a related mapping rule matched with the content of the concept and the attribute;

and step 3: if the mapping rule is relevant, returning to the database part of the mapping rule, and returning the content in the form of' data source. If there is no relevant mapping, then the absence of a relevant data source is prompted.

When the module input parameter is a table mapping function, the module processing procedure is as follows:

step 1: checking each table mapping function parameter, and checking whether the table mapping function parameter is legal or not;

step 2: reading a concept-specific data source configuration file corresponding to the table mapping function, and searching whether a data source meeting the conditions exists in the concept-specific data source configuration file;

and step 3: if the data source is related, all related information of the data source is returned, such as the name of the data source, the name of a table and the like; if no relevant data source exists, prompting that no relevant data source exists.

2. Query decomposition module

The query decomposition module analyzes all data sources and contents thereof related in the global query by analyzing the global query statement and constructing a query tree. The method comprises the following specific steps:

step 1: analyzing each syntactic component of the global query statement, decomposing select, from and where clauses of the syntactic component, and respectively storing elements in the selected, from and where clauses are decomposed into three queues;

step 2: constructing a query tree: and traversing the contents of the three clause queues, and mapping the concepts and the attributes in the ontology to one or more database tables and fields by continuously calling the ontology and database mapping module.

The structure of the query tree is shown in fig. 6, which includes five nodes, namely Root, OntClass, DBTable, OntProperty, and dbatttribute. The query tree is divided into five layers, and data source fields related to concepts and attributes in the global query and the query are recorded respectively. The structure of the query tree is as follows: the root node content of the tree is empty; the second layer is a class of the body, namely each constituent in the from clause; the third layer is a database table obtained by analyzing according to the mapping rule, and one concept may correspond to a plurality of tables of different data sources, so that one ontology concept node may have a plurality of database table sub-nodes; the fourth layer is the attribute of the body, namely each component in the select clause, and the attribute node of the body is the child node of a certain data source table, namely, the attribute node indicates that the table can return a corresponding field value; and the fifth layer is the field of the database table obtained by analyzing the ontology attribute according to the mapping rule, and as in the case of the class of the ontology, one ontology attribute may be mapped to fields in different data sources, so that the ontology attribute may have a plurality of child nodes. The data structure of the query tree is shown in FIG. 7, where the name string is the name of the node, i.e., the name of the object represented by the node. The bindname variable stores the binding name of the node, which directly points to a database table; the bindname field is mainly used for storing a database table name obtained by the table mapping function, and for a node of which the storage object is not the table mapping function, the value of the attribute is the same as that of the name attribute. the type string stores the type of the node, the type of the second layer of nodes is determined whether the object stored by the node is a mapping Function or an ontology conceptual value and may be Function or Concept, the type of the third layer of nodes is Table, the type of the fourth layer of nodes is Property, and the type of the fifth layer of nodes is Attribute. The source variable indicates which clause the object stored by the node comes from, and the values of the source variable may include select, from and where. The child list stores all children of the node. The super variable stores the parent of the node. The joinrelationship list, which is only valid in conducting queries containing multiple concepts, records the join keys of the database tables to which the two concepts correspond. The info character string records the extra information of some nodes, and key values of the father node concepts of the nodes are stored in the nodes of the third layer; in the node where the fifth level storage object comes from the where clause (type value is where), the info variable stores the value of the where element, otherwise the value is null.

The variable description above only applies to the second to fifth level nodes, and the first level node (i.e. the root node) only serves to store the position of the second level node, provide access to the tree, and does not store other actual information. Therefore, the value of the other member is null except that the value of the type variable is Root.

The query tree is constructed as follows:

step 1: and establishing a root node, wherein values of other members are null except for the type member value of root.

Step 2: reading elements in the from queue one by one, if the elements are ontology concepts, directly generating nodes to be added into a child node list of a root node, and assigning the name members and the bindname members of the nodes as ontology concept names; if the node is a table mapping function, the function name is used as the value of the node name and the bindname, and the node is generated and then added into a child node queue of the root node. The source type of this level (second level) node is "from", indicating that a from clause is relevant. In addition, the ontology concept and the key of the concept in the table mapping function are read from the mapping rule file and stored in the info variable of the node.

And step 3: for each node (ontology class or table mapping function) of the second layer, searching a data source mapped in the mapping file according to the value of the name attribute or directly executing the mapping function, generating a node for the searched related database table and adding the node into the child node list of the second layer node to form a third layer node. For the third layer node of which the father node is an ontology class, directly taking the name of the mapped database table as the values of the name attribute and the bindname attribute; for a node whose parent node is a mapping function, the name of the database table corresponding to the mapping rule is the name of a type of table mapped by the mapping function, and the format of the database table is "data source + concept name", for example, sr1_ target, and the name is assigned to the name variable. For the value of the bindname variable, the concept name in the mapped content needs to be replaced with the actual database table, and the format of the value is database _ table, for example sr1_ 20120101. The source type of the level (third level) node is "from".

And 4, step 4: for each node a (storing a data source and a database table name) of the third layer, reading the element b of the select queue one by one according to the name variable value of a, and searching whether a related mapping rule (namely, a data attribute with a name of [ a.name ]. b exists in a mapping file), wherein the a.name contains both the data source name and the table name), if so, generating a node of the fourth layer and adding the node into the child node queue of a, and the name variable value and the bindname variable value are both the names of the select queue element b. Property, namely, the source of the attribute is specified, at this time, it needs to be judged whether the Concept value of the element is consistent with the name of the third-layer node a, if so, the node is generated according to the method, and if not, the next select queue element is analyzed continuously. The source type of the level (fourth level) node generated in this step is "select", indicating that a select clause is concerned.

And 5: for each element a in the third layer, reading the elements b in the where queue one by one according to the name attribute of a and judging whether the element a is a connection key specifying statement of multi-concept query (the form of the connection key specifying statement is, for example, concept1.property1 ═ concept2.property2), wherein the statement is characterized in that the right side of the equal sign contains a character of. If the statement is a connection key specification statement, a connection relation triple is generated according to the connection relation and is placed in a joinrelationship list of the third-layer element, and the next where queue element b is analyzed. Otherwise, whether the related mapping rule exists is searched according to the ontology attribute on the left side of the b expression. And if so, generating a fourth layer node and adding the fourth layer node into the subnode queue, wherein the name and the bindname attribute are both names of the where queue element b, and assigning the whole expression to the info attribute. Property ", at this time, it needs to be judged whether the Concept of the where element is consistent with the name variable of a, if so, a node is generated according to the above method, and if not, the next where queue element is analyzed continuously. The source type of the level (fourth level) node generated in this step is "where", indicating that a where clause is related.

Step 6: for each node at the fourth layer, the related mapping rule is searched according to the value of the name variable (as in the previous step) and a fifth-layer node is generated and added into the child node queue, wherein the values of the name and the bindname variable are the mapping content (for example, sr1_ taxiinfo. For a grandparent node (a related node located at the second layer) as a node of the table mapping function, the name variable value is a data source table attribute in the mapping content, but the table name in the bindname needs to be converted into a table name obtained after mapping (for example, sr3_ 20101. taxi). If the source value is where, the node needs to inherit the value of the info variable from the parent node, that is, the expression of the constraint query result, and for the child nodes of the table mapping function, the concept in the info value of the parent node needs to be replaced with the actual table name.

3. Sub-query generation module

The sub-queries are decomposed from the global query, which contains all the information needed by the wrapper module to perform a query. The data structure of the sub-query is shown in FIG. 8. By traversing the query tree generated by the query decomposition module, the sub-query generation module can generate all sub-queries decomposed by one-time global query. According to the rules of query decomposition, each third-level node (representing a database table) corresponds to a sub-query, so that a sub-query only contains one concept, but possibly contains multiple attributes. The string arrays select, where and from strings store the elements contained in three corresponding clauses in the sub-query statement, respectively. subQueryString is a sub query statement, and sourceNum stores the number of a data source. sourceType indicates the kind of data source, and its values include MySQL, SQLServer, Oracle, and so on. connection string, userName, and password store a connection string, a user name, and a password, respectively, used when the database is connected. The character string array key stores the key of the concept corresponding to the sub-query, and the key of one concept may be composed of a plurality of attributes and can be obtained from the key labeling attribute of the corresponding concept of the ontology.

The method comprises the following steps:

step 1: and acquiring third-layer elements of the query tree, and initializing a SubQuery structure for each third-layer element.

Step 2: a from clause is generated. For each third tier element, the value of the bindname attribute is assigned to the from attribute.

And step 3: select and where clauses are generated. For each third level element, the child node of its child node (i.e., the fifth level element that is an indirect subclass of the element) is found. Judging the related nodes positioned at the fifth layer one by one, and if the source attribute is select, adding the bindname attribute value into a select queue; and if the source attribute is the where, adding the info attribute value into the where queue.

And 4, step 4: and combining the query character strings according to the contents of the elements in the select, where and from clauses.

And 5: and intercepting the SourceName from the from clause, and accordingly finding out the information of the data source type (SourceType), the connection string (connectionString), the user name (userName) and the password (password) of the database connection from the data source configuration information file.

All the generated sub-queries are sent to the corresponding wrapper modules, and the wrapper modules acquire required data from the database according to the information in the sub-queries.

4. Query result merging module

And the query result merging module is responsible for merging the query results returned by the data source wrapper modules to finally form a complete query result data set. According to different types of queries, the query results can be merged into three types: and combining results of single-concept all-binding queries, combining results of single-concept partial-binding queries and combining results of multi-concept queries.

The single concept all binding, namely, the from clause of the global query statement only contains one concept, and when the concept and the attribute name of the global query statement are replaced by using a database table and a field according to the mapping rule, each attribute can find the database field mapped by the attribute. Result merging of single-concept all-binding queries can be accomplished by performing a union operation on all query result sets.

The single concept partial binding is that only one concept is included in the from clause of the global query statement, and when the concept and the attribute name of the global query statement are replaced by using a database table and fields according to the mapping rule, at least one attribute cannot find the database field mapped by the database table, so that the fields need to be obtained from other data sources describing the same batch of entities. The result merging of the single-concept partial binding query involves the cross merging of information of multiple data sources, i.e., cartesian products are required to be performed on multiple data sets to obtain a final merged result set, which is similar to the join operation of SQL.

A multi-concept query is a query containing a plurality of concepts in a from clause. Such queries may involve only crossover operations between multiple concept datasets, and may also include partially bound query cases. When the query only involves the cross operation among the multi-concept data sets, only the Cartesian product is required to be solved for the multiple data sets; when the multi-concept query is mixed with partial binding, partial binding results of sub-queries belonging to the same concept should be merged first, and then merged results of different concepts should be merged.

The processing procedure of the query merging module is as follows:

step 1: collecting data sets from the wrapper module, and directly jumping to the step 3 without merging if only one data set exists; if a plurality of data sets exist, entering the step 2;

step 2: judging the query relationship among a plurality of data sets according to the query tree of the query generated by the query decomposition module, and executing the result merging process of all binding queries, partial binding queries or multi-concept queries;

and step 3: and returning the query result to the user through an interface or an interface mode.

5. Packing device module

The wrapper module is responsible for further processing the sub-queries sent by the mediator, acquiring corresponding data from the database and converting the format of the data. The wrapper module is divided into an SQL wrapper module and a non-SQL wrapper module. Wherein the SQL class wrapper module is a wrapper module designed for a data source that can be queried using the SQL language; the non-SQL class wrapper module is a special wrapper module of a NoSQL data source such as HBase, HDFS and the like which cannot be queried by SQL. The processing process part of the wrapper module mainly comprises three steps: sub-query conversion, sub-query execution, and format conversion.

The SQL class wrapper module is processed as follows:

step 1: and converting the sub-query. Firstly, redundant data source information contained in each component in the sub-query is removed, then a format conversion function in the global query is analyzed and stored, and finally, an SQL statement which can be directly executed by a data source is combined according to the information. For a plurality of format conversion functions specified in the global query language, a part of the format conversion functions can be converted into SQL embedded functions, and the format conversion is directly carried out by a database; format conversion that is not handled by a partial database is done by the wrapper module itself.

Step 2: the sub-query executes. And connecting the data source according to the configuration information of the data source in the sub-query, and executing the SQL query statement.

And step 3: and (5) converting the format. And (3) receiving a query result sent by the data source in the step (2), and completing a conversion process by using a format conversion function built in a wrapper module at the step for format conversion which cannot be processed by the SQL statement in the step (1).

And 4, step 4: and returning the result to a result merging module of the mediator.

The processing procedure of the non-SQL class wrapper module is as follows:

step 1: and converting the sub-query. Similar to the processing procedure of the SQL type wrapper module, the removal of the information of the sub-query redundant data source and the analysis of the format conversion function need to be carried out, but SQL sentences do not need to be generated, and only the analysis result needs to be stored.

Step 2: the sub-query executes. And executing query according to the analysis result to obtain related data, wherein the specific process of the step is related to the type of the data source, the sub-query execution processes of the HBase and the HDFS data source are different, but the query result is stored in the memory.

And step 3: and (5) converting the format. And calling a built-in function in the non-SQL class wrapper module, and converting the query result according to the analysis result in the step 1.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. The utility model provides a heterogeneous data integration system of urban traffic based on ontology which characterized in that: the system comprises a body, a database mapping module, a query decomposition module, a sub-query generation module, a wrapper module and a query result merging module, wherein the query decomposition module, the sub-query generation module and the query result merging module form a mediator of the data integration system;

the body and database mapping module: the system is responsible for mapping and analyzing urban traffic ontology concepts and attributes involved in the global query to a database table and fields; the mapping module of the body and the database comprises two files, namely a mapping rule file for describing a global body of the knowledge in the urban traffic field, a data source and the global body; the urban traffic field knowledge global ontology describes the relationship between concepts in the traffic field, and the concepts also contain some attributes describing the concept characteristics, wherein the attributes are standard word lists referred by a user when writing a global query statement; the mapping rule file records the corresponding relation between a database table and fields of each urban traffic data source and the global ontology concept and attribute; inputting ontology concepts and attributes involved in the query, and returning data sources, database tables and field names related to the query decomposition module by querying a mapping rule between a data source stored in the mapping rule file and the global ontology;

the query decomposition module: the used global query language is adapted based on SQL language, and a table mapping function and a format conversion function aiming at the track data are added for processing the track data and various special conditions in the urban traffic heterogeneous data; a user writes a global query statement by using the language, the query decomposition module analyzes the global query statement input by the user to determine a data source, a database table and a field which need to be accessed by the query, the analysis process is carried out by establishing a query tree, and the query tree records concepts and attributes in the global query statement and the database table and the field related to the query respectively; the steps of establishing the query tree are as follows: analyzing three clauses of select, from and where in a global query statement, extracting ontology concepts and attributes, table mapping functions and attribute format conversion functions related to the three clauses, then establishing a query tree layer by layer from a root node, and analyzing a database table and fields corresponding to the ontology concepts and attributes related to the query statement by continuously calling an ontology and database mapping module to complete the establishment of the query tree; the module can be called through an external interface of the system, and at the moment, a global query statement is required to be transmitted as a parameter; the user can also input query sentences in a text box of the query interface for query;

a sub-query generation module: generating sub-queries aiming at each traffic data source according to the query tree generated by the query decomposition module, and providing all information required by one-time query for the wrapper modules corresponding to each data source; traversing a query tree, extracting all data sources related to the query, simultaneously reading corresponding information in a data source configuration file contained in the system, generating a sub-query comprising a sub-query statement, a data source type and a plurality of information of database connection configuration information for all the related data sources, and sending the sub-query to a wrapper module for execution;

a wrapper module: each data source is provided with a corresponding wrapper module, after the sub-query is received from the sub-query generation module, the data specified by the sub-query is obtained through three steps of sub-query conversion, sub-query execution and format conversion, the data is returned to the query result merging module for connection, the wrapper modules corresponding to different data sources are divided into an SQL type wrapper module and a non-SQL type wrapper module according to the different types of the corresponding data sources, the SQL type wrapper module firstly needs to convert the sub-query into codes which can be directly executed by the data source, then reads the data source connection configuration information in the sub-query, establishes connection with the data source and executes the query, thereby solving the problem of data source system heterogeneity, when the data source returns the query result, the wrapper modules firstly convert the format of the result from the data source according to the format conversion function in the sub-query, therefore, the problem of format heterogeneity is solved, the working processes of the non-SQL class wrapper module and the SQL class wrapper module are basically the same, but in the sub-query conversion step, the non-SQL class wrapper module only needs to analyze the sub-query and does not need to generate codes which can be directly executed by a data source;

2. An ontology-based urban traffic heterogeneous data integration method is characterized by comprising the following steps: the method comprises the following implementation steps: