Heterogeneous data fusion method based on ontology mapping
Technical Field
The invention relates to a heterogeneous data fusion method based on ontology mapping, and belongs to the technical field of data processing.
Background
With the continuous development of new technologies such as big data and cloud computing, data information in various fields is extremely expanded, the amount of information is in an explosive trend, and they are widely distributed in a complex network environment, and the formats of data may be in many different. Therefore, how to solve the problem of interaction between these data becomes crucial.
At the end of the 20 th century, the concept of semantic network was proposed, and the main purpose of semantic network was to process data to realize mutual understanding between semantic data at the knowledge level, so that the semantic data can be used and analyzed by computers. The ontology is a knowledge expression form of the semantic network, is a key technology for processing data in the semantic network, and can effectively express the relation between knowledge and knowledge. However, since data comes from different organizations, different ontologies are generated, and heterogeneity exists between different ontologies, which is not beneficial to the fusion operation of data. The ontology mapping technology provides possibility for data sharing and information interaction between different data sources, namely, a semantic mapping relation is created for a heterogeneous ontology, so that the problem that data fusion sharing is difficult to perform between different fields is solved.
When data is fused and shared, all data semantics are generally required to be consistent. Semantic conflict refers to semantic inconsistency caused by differences in description, structure and content of two objects when describing the same real thing. However, in consideration of the actual situation, the databases of different data sources are independent from each other and have high autonomy, so that the databases formed by different data sources cannot be completely consistent in semantics, which causes great trouble to data fusion. In the past, pattern matching to be performed in data fusion is generally performed through manual identification, manual judgment and field matching, and for the problem of semantic conflict, a common solution is to manually eliminate semantic inconsistency of data in pattern integration, so that the method has the defects of high limitation, unsatisfactory accuracy and waste of a large amount of human resources.
It is clear that manually indicating pattern matching is a tedious, time consuming, error prone and costly process. With the rapid development of information technology and the ever-explosive growth of data sources, more complex databases are gradually generated and need to be processed, the corresponding patterns are also large, the number to be matched is greatly increased, and therefore a faster and less labor-intensive matching method is needed, which needs automatic support of the pattern matching process.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a heterogeneous data fusion method based on ontology mapping, which comprises the steps of constructing a metadata dictionary according to the condition of a database system, further obtaining a local ontology model, then carrying out similarity calculation on an ontology under a local mode and a global ontology to obtain similarity, judging the fusion condition according to the similarity, mapping data and realizing heterogeneous data fusion.
The invention mainly solves the technical problems that: the method comprises the steps of constructing local ontology models of different data sources, calculating the similarity of an ontology in a local mode, and mapping the ontology in the local mode to a global mode according to the similarity condition.
The invention adopts the following technical scheme:
a heterogeneous data fusion method based on ontology mapping comprises the following steps:
(1) establishing a metadata dictionary for data from different data sources, and then constructing a local ontology model;
(2) performing semantic similarity calculation on the local ontology and the global ontology to obtain similarity;
(3) and according to the similarity condition, mapping the data according to the mapping rule from the local mode to the whole mode, eliminating semantic conflict and realizing heterogeneous data fusion.
In step (1), a data source needs to be processed in the early stage to determine a data field needed at last, that is, to determine a standard format of data to be generated, and then follow-up work is performed to construct a local ontology of a data source mode.
Data schema refers to the logical representation of data in a data source, and in a relational database, schema refers to the definition of a data table, which includes the attribute name of the table, the order of the attributes, the domain of the attributes, the primary key and the foreign key information. Data fusion is a process of pattern integration, which integrates different patterns into a uniform form. Considering that the ontologies have semantic conflict in the schema integration process, before mapping, a local ontology of each data source schema should be constructed.
In order to construct a local ontology, for databases of different sources, source data and connection information of various heterogeneous data sources need to be acquired.
Preferably, the constructing the local ontology of the data source schema specifically includes:
for databases of different sources, source data and connection information of various heterogeneous data sources are obtained, wherein the source data and the connection information of the source data comprise who provides the database, what the name of the database is, how many tables the database comprises, what fields the tables comprise, what relations between the tables are, what attributes each field of the tables are, what relations between the attributes are, what main foreign keys of the tables are, and other information.
In the invention, the required information is inquired through a built-in function, for example, in the prior art, showdatabases sentences can be used for determining the name of data, and showdables sentences can be used for obtaining the number of tables contained in the database.
The data format of the metadata dictionary is a dictionary and is presented in the form of key-value, for example, if the key of the database name is defined as dbname, the corresponding value is user _ info, the number of tables in the database is defined as dbnum, the corresponding value is 5, the key of the relationship between the tables is defined as dbrelation, and the corresponding value is one _ to _ one. Different databases correspond to different dictionary records, which may have the same key, but the value may be different.
The construction of the metadata dictionary provides a basis for constructing the local ontology, and the local ontology is constructed only in order to clearly and canonically describe the concept of the data source.
Preferably, the construction of the local ontology model is performed in two steps, i.e. the data source is analyzed first, and then the local ontology is defined:
the data sources refer to different data sources, the different data sources extract attributes of the different data sources to form a metadata dictionary, the field content of the metadata dictionary comprises data element classes, data table attribute classes, relations among tables and description of the data sources, the data element classes comprise all fields contained in the tables and attributes corresponding to the fields, and the data table attribute classes refer to relations among the field attributes and main foreign key information of the tables; the description of the data source class comprises a data source address, a source data type, a source data name, a source data provider, a data element class and a data table attribute class in a metadata dictionary;
the record of each metadata dictionary is a local ontology.
After the ontology is successfully constructed, the situation of semantic conflict is detected, and the detection is carried out through the calculation of the semantic similarity of the local ontology and the global ontology. The semantic similarity can be used to describe the similarity between different words, and after the similarity is determined, whether semantic conflict occurs is determined, where the semantic similarity is usually a real number between 0 and 1.
Preferably, the step (2) is further:
the method comprises the steps of realizing similarity calculation by establishing a graph convolution network, judging whether semantic conflict occurs or not, and judging whether the data is not to be mapped, specifically, firstly, selecting all data element classes from a metadata dictionary, acquiring component elements in a local mode and a global mode, taking the acquired component elements as nodes in the graph convolution network, taking the relationship among the component elements as edges, and selecting key values from the metadata dictionary to be fields with element relationships to determine the elements and the relationships;
wherein, different data sources are called local modes, and a target library generated after fusion is called a global mode;
in the construction process, a database corresponds to an entity in a local ontology, the relationship between the entities corresponds to the relation in a local ontology model, the relationship between the entities comprises a one-to-one relationship, a one-to-many relationship and a many-to-many relationship, the database table name, the table attribute, the field name, the field type and the attribute value field of the entity represent the attribute information of data, the main key and the external key constrain the structural information representing the data, the attribute information and the structural information are combined to be used as a label of a graph volume network, the characteristics of the local ontology and the global ontology are described through the label information, and the ontology with similar probability has similarity (can be automatically learned and calculated through the graph volume network), and the mapping from the local to the global is carried out.
Preferably, the input layer of the graph convolution network comprises local ontology nodes and global ontology nodes, wherein the local ontology nodes and the global ontology nodes both comprise structure information and attribute information, the entity is used as vector input, the embedding layer embeds the vector input by the input layer into a low-dimensional compact vector to obtain the structure information of the local ontology nodes and the global ontology nodes, the attribute vector of the aggregated attribute information is obtained after the local ontology attribute information is embedded, the embedded structure vector and the embedded attribute vector can be spliced to obtain complete vector representation of the local ontology nodes and the global ontology nodes (the process of embedding into the low-dimensional compact vector and the process of splicing to obtain the complete vector representation of the nodes are both principle knowledge of the graph convolution network, and for the prior art, the graph convolution network can automatically learn without manual operation), and then the complete vectors of the local ontology and the global ontology are sent to the hidden layer for learning, after learning, the output layer outputs the similarity probability of the local body nodes and the global body nodes of different data through a sigmod function, if the semantic similarity is more than or equal to 0.7, the semantic conflict is considered to occur, and the entity with the semantic conflict is mapped.
The graph convolution network adopted by the invention is the prior art, and is used as a tool to calculate the similarity, namely, the semantic similarity can be output finally through the input layer, the embedded layer and the n hidden layers, the input and output can be defined according to the requirement, other internal operations can be automatically completed without manual participation, the prior art is mature, and the description is omitted here.
Preferably, the data mapping is performed by calculating the similarity in the relational database (assuming that there are two different data sources db1 and db2), and then determining whether there is a semantic conflict (in db1 and db2), and mapping the entities with semantic conflict whose similarity is greater than or equal to 0.7 in the local mode to the global mode according to the following rules. When mapping is carried out, the local mode is mapped to the global mode according to the following rules:
1) when the field name corresponding to the entity attribute is a synonym with a different name, the conflict can be eliminated by changing the name, and the entity attribute name appearing for the first time is taken as the attribute name corresponding to the global mode after mapping;
for example, to express the field "telephone number", the attribute name is ipone in db1, iponenumber in db2, db1 appears earlier than db2, and then "ipone" in db1, which appears first, is used as the attribute name in global mode.
2) When the field name corresponding to the entity attribute is a homomorphic heteronym, the conflict can be eliminated by changing the name or establishing a homomorphic heteronym dictionary of the attribute name, wherein the first appearing entity attribute name is taken as the corresponding attribute name in the global mode after mapping;
for example, if name attribute names appear in db1 and db2, name represents the name of oneself in db1, db2 represents the great name, db1 appears earlier than db2, then the meaning of the name of oneself in db1 appearing earlier is used as the meaning in the global mode.
3) When the same attribute is endowed with different values in different databases and conflicts occur due to different expressions, the conflict can be solved through the concept of semantic values, and various expressions which represent the same attribute are unified in global data to standardize the expression form;
if the attribute of age appears in both db1 and db2, age is 16 in db1, db2 age is 20, db1 appears before db2, then the first-appearing value of age in db1 is used as the value in global mode.
Through the mapping rule, the semantic conflict problem occurring in the heterogeneous data can be eliminated through the mapping of the similar ontology.
In the present invention, all the details which are not described in detail can be carried out by using the prior art.
The invention has the beneficial effects that:
the heterogeneous data fusion method based on ontology mapping can perform mode integration on heterogeneous data of multiple data types from different data sources, and eliminates the semantic conflict problem through similarity calculation and a formulated mapping algorithm, so that the purpose of fusing the heterogeneous data into standard data of the same type is achieved, and great convenience is brought to statistical analysis of the data and sharing of information.
In the prior art, similarity is calculated by a mathematical calculation mode, accuracy is low, and reliability is poor.
The application range of the invention comprises the fields of medicine, education, finance, safety and the like which need data integration analysis, and the invention can be used in any type of occasions which need to fuse a large amount of heterogeneous data into standard data, and has wide application prospect.
Drawings
FIG. 1 is a flow chart of a heterogeneous data fusion method based on ontology mapping according to the present invention;
FIG. 2 is a diagram illustrating the data source classes in the present invention;
FIG. 3 is a schematic diagram illustrating the relationship between information in a metadata dictionary according to the present invention;
FIG. 4 is a flow chart illustrating semantic conflict detection according to the present invention.
The specific implementation mode is as follows:
in order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific examples, but not limited thereto, and the present invention is not described in detail and is in accordance with the conventional techniques in the art.
Example 1:
a heterogeneous data fusion method based on ontology mapping, as shown in fig. 1-4, includes the following steps:
(1) establishing a metadata dictionary for data from different data sources, and then constructing a local ontology model;
(2) performing semantic similarity calculation on the local ontology and the global ontology to obtain similarity;
(3) and according to the similarity condition, mapping the data according to the mapping rule from the local mode to the whole mode, eliminating semantic conflict and realizing heterogeneous data fusion.
The whole method is realized as shown in the flow chart 1.
Example 2:
compared with the embodiment 1, the heterogeneous data fusion method based on ontology mapping is different in that heterogeneous data provided by different fields, different organizations and different personnel is fused, in the step (1), a data source is processed in the early stage, a data field required at last is determined, namely a standard format of data to be generated is determined, and then subsequent work is performed to construct a local ontology of a data source mode.
Data schema refers to the logical representation of data in a data source, and in a relational database, schema refers to the definition of a data table, which includes the attribute name of the table, the order of the attributes, the domain of the attributes, the primary key and the foreign key information. Data fusion is a process of pattern integration, which integrates different patterns into a uniform form. Considering that the ontologies have semantic conflict in the schema integration process, before mapping, a local ontology of each data source schema should be constructed.
In order to construct a local ontology, for databases of different sources, source data and connection information of various heterogeneous data sources need to be acquired.
Example 3:
a heterogeneous data fusion method based on ontology mapping, which is different from that described in embodiment 2, specifically, constructing a local ontology of a data source model is:
for databases of different sources, source data and connection information of various heterogeneous data sources are obtained, wherein the source data and the connection information of the source data comprise who provides the database, what the name of the database is, how many tables the database comprises, what fields the tables comprise, what relations between the tables are, what attributes each field of the tables are, what relations between the attributes are, what main foreign keys of the tables are, and other information.
In the invention, the needed information is inquired through a built-in function, for example, showdatabases sentences can be used for determining the name of the data, and showtables sentences can be used for obtaining the number of tables contained in the database.
The data format of the metadata dictionary is a dictionary and is presented in the form of key-value, for example, if the key of the database name is defined as dbname, the corresponding value is user _ info, the number of tables in the database is defined as dbnum, the corresponding value is 5, the key of the relationship between the tables is defined as dbrelation, and the corresponding value is one _ to _ one. Different databases correspond to different dictionary records, which may have the same key, but the value may be different.
The construction of the metadata dictionary provides a basis for constructing the local ontology, and the local ontology is constructed only in order to clearly and canonically describe the concept of the data source.
The construction of the local ontology model is divided into two steps, namely, firstly analyzing a data source, and then defining a local ontology:
the data sources refer to different data sources, the different data sources extract attributes of the different data sources to form a metadata dictionary, the field content of the metadata dictionary comprises data element classes, data table attribute classes, relations among tables and description of the data sources, the data element classes comprise all fields contained in the tables and attributes corresponding to the fields, and the data table attribute classes refer to relations among the field attributes and main foreign key information of the tables; the description of the data source class is shown in fig. 2, and includes a data source address, a source data type, a source data name and a source data provider, and the relationship between the data element class, the data table attribute class and the table in the metadata dictionary is shown in fig. 3.
The record of each metadata dictionary is a local ontology.
Example 4:
a heterogeneous data fusion method based on ontology mapping, which is different from that described in embodiment 3 in that, after an ontology is successfully constructed, a semantic conflict condition is detected, and the detection is performed by calculating semantic similarity between a local ontology and a global ontology. The semantic similarity can be used to describe the similarity between different words, and after the similarity is determined, whether semantic conflict occurs is determined, where the semantic similarity is usually a real number between 0 and 1.
The step (2) is further as follows:
the method comprises the steps of realizing similarity calculation by establishing a graph convolution network, judging whether semantic conflict occurs or not, and judging whether the data is not mapped or not according to the semantic conflict, wherein a semantic conflict detection model graph is shown in FIG. 4, specifically, firstly, all data element classes are selected from a metadata dictionary, constituent elements in a local mode and a global mode are obtained, the obtained constituent elements are used as nodes in the graph convolution network, the relation among the constituent elements is used as edges, and the elements and the relation can be determined by selecting fields with key values as element relations from the metadata dictionary;
wherein, different data sources are called local modes, and a target library generated after fusion is called a global mode;
in the construction process, a database corresponds to an entity in a local body, the relationship between the entities corresponds to the relation in a local body model, the relationship between the entities comprises one-to-one relationship, one-to-many relationship and many-to-many relationship, the database table name, the table attribute, the field name, the field type and the attribute value field of the entity reflect the attribute information of data, the primary key and the external key restrain the structural information of the data, the attribute information and the structural information are combined to be used as the label of the graph volume network, the characteristics of the local body and the global body are described through the label information, the bodies with similar probability have similarity (can be automatically learned and calculated through the graph volume network), and the mapping from the local to the global is carried out.
The input layer of the graph convolution network comprises local body nodes and global body nodes, wherein the local body nodes and the global body nodes both comprise structure information and attribute information, an entity is used as vector input, the embedded layer embeds vectors input by the input layer into low-dimensional compact vectors to obtain the structure information of the local body nodes and the global body nodes, the attribute vectors of aggregated attribute information are obtained after the local body attribute information is embedded, the embedded structure vectors and the attribute vectors can be spliced to obtain complete vector representation of the local body nodes and the global body nodes (the process of embedding into the low-dimensional compact vectors and the process of splicing to obtain the complete vector representation of the nodes are both principle knowledge of the graph convolution network, and the graph convolution network can automatically learn without manual operation for the prior art), and then the complete local body and global body vectors are sent to the hidden layer for learning, after learning, the output layer outputs the similarity probability of the local body nodes and the global body nodes of different data through a sigmod function, if the semantic similarity is more than or equal to 0.7, the semantic conflict is considered to occur, and the entity with the semantic conflict is mapped.
The graph convolution network adopted by the invention is the prior art, and is used as a tool to calculate the similarity, namely, the semantic similarity can be output finally through the input layer, the embedded layer and the n hidden layers, the input and output can be defined according to the requirement, other internal operations can be automatically completed without manual participation, the prior art is mature, and the description is omitted here.
Example 5:
a heterogeneous data fusion method based on ontology mapping is different from embodiment 4 in that the data mapping needs to do work by calculating the similarity in a relational database (assuming that two different data sources db1 and db2 exist), then judging whether semantic conflict situations exist (in db1 and db2), and mapping entities with the similarity being more than or equal to 0.7 in a local mode to a global mode according to the following rules. When mapping is performed, the local mode is mapped to the global mode according to the following rules:
1) when the field name corresponding to the entity attribute is a synonym with a different name, the conflict can be eliminated by changing the name, and the entity attribute name appearing for the first time is taken as the attribute name corresponding to the global mode after mapping;
for example, to express the field "telephone number", the attribute name is ipone in db1, iponenumber in db2, db1 appears earlier than db2, and then "ipone" in db1, which appears first, is used as the attribute name in global mode.
2) When the field name corresponding to the entity attribute is a homomorphic heteronym, the conflict can be eliminated by changing the name or establishing a homomorphic heteronym dictionary of the attribute name, and the entity attribute name appearing for the first time is taken as the corresponding attribute name in the global mode after mapping;
for example, if name attribute names appear in db1 and db2, name represents the name of oneself in db1, db2 represents the great name, db1 appears earlier than db2, then the meaning of the name of oneself in db1 appearing earlier is used as the meaning in the global mode.
3) When the same attribute is endowed with different values in different databases and conflicts occur due to different expressions, the conflict can be solved through the concept of semantic values, and various expressions which represent the same attribute are unified in global data to standardize the expression form;
for example, if the attribute of age appears in db1 and db2, age is 16 in db1, age is 20 in db2, and db1 appears earlier than db2, then the first-appearing value of age in db1 is used as the value in global mode.
Through the mapping rule, the semantic conflict problem occurring in the heterogeneous data can be eliminated through the mapping of the similar ontology.
Example 6:
a heterogeneous data fusion method based on ontology mapping is characterized in that two different data sources db1 and db2 are assumed, db1 is provided with four fields of name, age, address and ipone, and db2 is provided with four fields of name, age, add and iponenumber;
first, a metadata dictionary is constructed, and db1 is used for example, as can be seen from the four fields of the data source, the database name is db1, the database relationship is one-to-one, the field names include name, age, address and ipone, and the number of the fields is 4, so that the formed metadata dictionary records are db name 1, dbrelation: one _ to _ one, dbzd: name, age, address, ipone, dbzdcount:4, and db2 is obtained in the same way. Each data source is a local ontology, db1 corresponds to local ontology 1, db2 corresponds to local ontology 2, and the relationships between ontologies are one-to-one.
And secondly, calculating similarity, inputting the metadata dictionary into a graph convolution network, automatically learning to obtain the similarity between the ontologies, and performing field matching work when the similarity is more than or equal to 0.7 and the semantic conflict is considered to occur.
And thirdly, carrying out mapping work. The mapping is to follow the mapping rules, which are described below in this embodiment:
if we express the field "phone number", the attribute name is ipone in db1, iponenumber in db2, db1 appears earlier than db2, then we use the "ipone" in db1 that appears first as the attribute name in global mode;
if name attribute names appear in db1 and db2, name represents the name of oneself in db1, db2 represents the great name, db1 appears earlier than db2, then the meaning of the name of oneself in db1 appearing earlier is used as the meaning in the global mode;
if the attribute of age appears in both db1 and db2, in db1, age is 16, db2, age is 20, db1 appears before db2, then the first-appearing value of age in db1 is used as the value in global mode.
Through the mapping rule, the semantic conflict problem occurring in the heterogeneous data can be eliminated through the mapping of the similar ontology.
It should be noted that the global ontology in the present invention may also be a local ontology at the beginning, that is, the local ontology 1 and the local ontology 2 in this embodiment perform semantic similarity calculation (that is, the local ontology 2 is used as an initial global ontology, and data fusion between two local ontologies is performed at the beginning) to obtain a similarity, and then according to a similarity condition, according to a mapping rule from a local schema to a global schema, data is mapped to eliminate semantic conflict, thereby implementing heterogeneous data fusion.
Initially, after data fusion is performed between the two local ontologies, a global ontology is formed, and then data fusion can be performed on other local ontologies and the global ontology.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.