CN111858649B - Heterogeneous data fusion method based on ontology mapping - Google Patents

Heterogeneous data fusion method based on ontology mapping Download PDF

Info

Publication number
CN111858649B
CN111858649B CN202010779077.9A CN202010779077A CN111858649B CN 111858649 B CN111858649 B CN 111858649B CN 202010779077 A CN202010779077 A CN 202010779077A CN 111858649 B CN111858649 B CN 111858649B
Authority
CN
China
Prior art keywords
data
local
attribute
global
ontology
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010779077.9A
Other languages
Chinese (zh)
Other versions
CN111858649A (en
Inventor
孙留倩
魏玉良
王佰玲
王巍
刘扬
辛国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weihai Tianzhiwei Network Space Safety Technology Co ltd
Harbin Institute of Technology Weihai
Original Assignee
Weihai Tianzhiwei Network Space Safety Technology Co ltd
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weihai Tianzhiwei Network Space Safety Technology Co ltd, Harbin Institute of Technology Weihai filed Critical Weihai Tianzhiwei Network Space Safety Technology Co ltd
Priority to CN202010779077.9A priority Critical patent/CN111858649B/en
Publication of CN111858649A publication Critical patent/CN111858649A/en
Application granted granted Critical
Publication of CN111858649B publication Critical patent/CN111858649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Abstract

The invention relates to a heterogeneous data fusion method based on ontology mapping, which belongs to the technical field of data processing. According to the method, the data fields are standardized in a form of establishing a metadata dictionary, then the similarity is calculated by automatic learning of a graph convolution network, errors caused by mathematical calculation are eliminated, the accuracy is higher, finally the fields are mapped through the established mapping rule, low-efficiency manual screening is avoided, accurate mapping is achieved, and the data fusion matching degree is higher.

Description

Heterogeneous data fusion method based on ontology mapping
Technical Field
The invention relates to a heterogeneous data fusion method based on ontology mapping, and belongs to the technical field of data processing.
Background
With the continuous development of new technologies such as big data and cloud computing, data information in various fields is extremely expanded, the amount of information is in an explosive trend, and they are widely distributed in a complex network environment, and the formats of data may be in many different. Therefore, how to solve the problem of interaction between these data becomes crucial.
At the end of the 20 th century, the concept of semantic network was proposed, and the main purpose of semantic network was to process data to realize mutual understanding between semantic data at the knowledge level, so that the semantic data can be used and analyzed by computers. The ontology is a knowledge expression form of the semantic network, is a key technology for processing data in the semantic network, and can effectively express the relation between knowledge and knowledge. However, since data comes from different organizations, different ontologies are generated, and heterogeneity exists between different ontologies, which is not beneficial to the fusion operation of data. The ontology mapping technology provides possibility for data sharing and information interaction between different data sources, namely, a semantic mapping relation is created for a heterogeneous ontology, so that the problem that data fusion sharing is difficult to perform between different fields is solved.
When data is fused and shared, all data semantics are generally required to be consistent. Semantic conflict refers to semantic inconsistency caused by differences in description, structure and content of two objects when describing the same real thing. However, in consideration of the actual situation, the databases of different data sources are independent from each other and have high autonomy, so that the databases formed by different data sources cannot be completely consistent in semantics, which causes great trouble to data fusion. In the past, pattern matching to be performed in data fusion is generally performed through manual identification, manual judgment and field matching, and for the problem of semantic conflict, a common solution is to manually eliminate semantic inconsistency of data in pattern integration, so that the method has the defects of high limitation, unsatisfactory accuracy and waste of a large amount of human resources.
It is clear that manually indicating pattern matching is a tedious, time consuming, error prone and costly process. With the rapid development of information technology and the ever-explosive growth of data sources, more complex databases are gradually generated and need to be processed, the corresponding patterns are also large, the number to be matched is greatly increased, and therefore a faster and less labor-intensive matching method is needed, which needs automatic support of the pattern matching process.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a heterogeneous data fusion method based on ontology mapping, which comprises the steps of constructing a metadata dictionary according to the condition of a database system, further obtaining a local ontology model, then carrying out similarity calculation on an ontology under a local mode and a global ontology to obtain similarity, judging the fusion condition according to the similarity, mapping data and realizing heterogeneous data fusion.
The invention mainly solves the technical problems that: the method comprises the steps of constructing local ontology models of different data sources, calculating the similarity of an ontology in a local mode, and mapping the ontology in the local mode to a global mode according to the similarity condition.
The invention adopts the following technical scheme:
a heterogeneous data fusion method based on ontology mapping comprises the following steps:
(1) establishing a metadata dictionary for data from different data sources, and then constructing a local ontology model;
(2) performing semantic similarity calculation on the local ontology and the global ontology to obtain similarity;
(3) and according to the similarity condition, mapping the data according to the mapping rule from the local mode to the whole mode, eliminating semantic conflict and realizing heterogeneous data fusion.
In step (1), a data source needs to be processed in the early stage to determine a data field needed at last, that is, to determine a standard format of data to be generated, and then follow-up work is performed to construct a local ontology of a data source mode.
Data schema refers to the logical representation of data in a data source, and in a relational database, schema refers to the definition of a data table, which includes the attribute name of the table, the order of the attributes, the domain of the attributes, the primary key and the foreign key information. Data fusion is a process of pattern integration, which integrates different patterns into a uniform form. Considering that the ontologies have semantic conflict in the schema integration process, before mapping, a local ontology of each data source schema should be constructed.
In order to construct a local ontology, for databases of different sources, source data and connection information of various heterogeneous data sources need to be acquired.
Preferably, the constructing the local ontology of the data source schema specifically includes:
for databases of different sources, source data and connection information of various heterogeneous data sources are obtained, wherein the source data and the connection information of the source data comprise who provides the database, what the name of the database is, how many tables the database comprises, what fields the tables comprise, what relations between the tables are, what attributes each field of the tables are, what relations between the attributes are, what main foreign keys of the tables are, and other information.
In the invention, the required information is inquired through a built-in function, for example, in the prior art, showdatabases sentences can be used for determining the name of data, and showdables sentences can be used for obtaining the number of tables contained in the database.
The data format of the metadata dictionary is a dictionary and is presented in the form of key-value, for example, if the key of the database name is defined as dbname, the corresponding value is user _ info, the number of tables in the database is defined as dbnum, the corresponding value is 5, the key of the relationship between the tables is defined as dbrelation, and the corresponding value is one _ to _ one. Different databases correspond to different dictionary records, which may have the same key, but the value may be different.
The construction of the metadata dictionary provides a basis for constructing the local ontology, and the local ontology is constructed only in order to clearly and canonically describe the concept of the data source.
Preferably, the construction of the local ontology model is performed in two steps, i.e. the data source is analyzed first, and then the local ontology is defined:
the data sources refer to different data sources, the different data sources extract attributes of the different data sources to form a metadata dictionary, the field content of the metadata dictionary comprises data element classes, data table attribute classes, relations among tables and description of the data sources, the data element classes comprise all fields contained in the tables and attributes corresponding to the fields, and the data table attribute classes refer to relations among the field attributes and main foreign key information of the tables; the description of the data source class comprises a data source address, a source data type, a source data name, a source data provider, a data element class and a data table attribute class in a metadata dictionary;
the record of each metadata dictionary is a local ontology.
After the ontology is successfully constructed, the situation of semantic conflict is detected, and the detection is carried out through the calculation of the semantic similarity of the local ontology and the global ontology. The semantic similarity can be used to describe the similarity between different words, and after the similarity is determined, whether semantic conflict occurs is determined, where the semantic similarity is usually a real number between 0 and 1.
Preferably, the step (2) is further:
the method comprises the steps of realizing similarity calculation by establishing a graph convolution network, judging whether semantic conflict occurs or not, and judging whether the data is not to be mapped, specifically, firstly, selecting all data element classes from a metadata dictionary, acquiring component elements in a local mode and a global mode, taking the acquired component elements as nodes in the graph convolution network, taking the relationship among the component elements as edges, and selecting key values from the metadata dictionary to be fields with element relationships to determine the elements and the relationships;
wherein, different data sources are called local modes, and a target library generated after fusion is called a global mode;
in the construction process, a database corresponds to an entity in a local ontology, the relationship between the entities corresponds to the relation in a local ontology model, the relationship between the entities comprises a one-to-one relationship, a one-to-many relationship and a many-to-many relationship, the database table name, the table attribute, the field name, the field type and the attribute value field of the entity represent the attribute information of data, the main key and the external key constrain the structural information representing the data, the attribute information and the structural information are combined to be used as a label of a graph volume network, the characteristics of the local ontology and the global ontology are described through the label information, and the ontology with similar probability has similarity (can be automatically learned and calculated through the graph volume network), and the mapping from the local to the global is carried out.
Preferably, the input layer of the graph convolution network comprises local ontology nodes and global ontology nodes, wherein the local ontology nodes and the global ontology nodes both comprise structure information and attribute information, the entity is used as vector input, the embedding layer embeds the vector input by the input layer into a low-dimensional compact vector to obtain the structure information of the local ontology nodes and the global ontology nodes, the attribute vector of the aggregated attribute information is obtained after the local ontology attribute information is embedded, the embedded structure vector and the embedded attribute vector can be spliced to obtain complete vector representation of the local ontology nodes and the global ontology nodes (the process of embedding into the low-dimensional compact vector and the process of splicing to obtain the complete vector representation of the nodes are both principle knowledge of the graph convolution network, and for the prior art, the graph convolution network can automatically learn without manual operation), and then the complete vectors of the local ontology and the global ontology are sent to the hidden layer for learning, after learning, the output layer outputs the similarity probability of the local body nodes and the global body nodes of different data through a sigmod function, if the semantic similarity is more than or equal to 0.7, the semantic conflict is considered to occur, and the entity with the semantic conflict is mapped.
The graph convolution network adopted by the invention is the prior art, and is used as a tool to calculate the similarity, namely, the semantic similarity can be output finally through the input layer, the embedded layer and the n hidden layers, the input and output can be defined according to the requirement, other internal operations can be automatically completed without manual participation, the prior art is mature, and the description is omitted here.
Preferably, the data mapping is performed by calculating the similarity in the relational database (assuming that there are two different data sources db1 and db2), and then determining whether there is a semantic conflict (in db1 and db2), and mapping the entities with semantic conflict whose similarity is greater than or equal to 0.7 in the local mode to the global mode according to the following rules. When mapping is carried out, the local mode is mapped to the global mode according to the following rules:
1) when the field name corresponding to the entity attribute is a synonym with a different name, the conflict can be eliminated by changing the name, and the entity attribute name appearing for the first time is taken as the attribute name corresponding to the global mode after mapping;
for example, to express the field "telephone number", the attribute name is ipone in db1, iponenumber in db2, db1 appears earlier than db2, and then "ipone" in db1, which appears first, is used as the attribute name in global mode.
2) When the field name corresponding to the entity attribute is a homomorphic heteronym, the conflict can be eliminated by changing the name or establishing a homomorphic heteronym dictionary of the attribute name, wherein the first appearing entity attribute name is taken as the corresponding attribute name in the global mode after mapping;
for example, if name attribute names appear in db1 and db2, name represents the name of oneself in db1, db2 represents the great name, db1 appears earlier than db2, then the meaning of the name of oneself in db1 appearing earlier is used as the meaning in the global mode.
3) When the same attribute is endowed with different values in different databases and conflicts occur due to different expressions, the conflict can be solved through the concept of semantic values, and various expressions which represent the same attribute are unified in global data to standardize the expression form;
if the attribute of age appears in both db1 and db2, age is 16 in db1, db2 age is 20, db1 appears before db2, then the first-appearing value of age in db1 is used as the value in global mode.
Through the mapping rule, the semantic conflict problem occurring in the heterogeneous data can be eliminated through the mapping of the similar ontology.
In the present invention, all the details which are not described in detail can be carried out by using the prior art.
The invention has the beneficial effects that:
the heterogeneous data fusion method based on ontology mapping can perform mode integration on heterogeneous data of multiple data types from different data sources, and eliminates the semantic conflict problem through similarity calculation and a formulated mapping algorithm, so that the purpose of fusing the heterogeneous data into standard data of the same type is achieved, and great convenience is brought to statistical analysis of the data and sharing of information.
In the prior art, similarity is calculated by a mathematical calculation mode, accuracy is low, and reliability is poor.
The application range of the invention comprises the fields of medicine, education, finance, safety and the like which need data integration analysis, and the invention can be used in any type of occasions which need to fuse a large amount of heterogeneous data into standard data, and has wide application prospect.
Drawings
FIG. 1 is a flow chart of a heterogeneous data fusion method based on ontology mapping according to the present invention;
FIG. 2 is a diagram illustrating the data source classes in the present invention;
FIG. 3 is a schematic diagram illustrating the relationship between information in a metadata dictionary according to the present invention;
FIG. 4 is a flow chart illustrating semantic conflict detection according to the present invention.
The specific implementation mode is as follows:
in order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific examples, but not limited thereto, and the present invention is not described in detail and is in accordance with the conventional techniques in the art.
Example 1:
a heterogeneous data fusion method based on ontology mapping, as shown in fig. 1-4, includes the following steps:
(1) establishing a metadata dictionary for data from different data sources, and then constructing a local ontology model;
(2) performing semantic similarity calculation on the local ontology and the global ontology to obtain similarity;
(3) and according to the similarity condition, mapping the data according to the mapping rule from the local mode to the whole mode, eliminating semantic conflict and realizing heterogeneous data fusion.
The whole method is realized as shown in the flow chart 1.
Example 2:
compared with the embodiment 1, the heterogeneous data fusion method based on ontology mapping is different in that heterogeneous data provided by different fields, different organizations and different personnel is fused, in the step (1), a data source is processed in the early stage, a data field required at last is determined, namely a standard format of data to be generated is determined, and then subsequent work is performed to construct a local ontology of a data source mode.
Data schema refers to the logical representation of data in a data source, and in a relational database, schema refers to the definition of a data table, which includes the attribute name of the table, the order of the attributes, the domain of the attributes, the primary key and the foreign key information. Data fusion is a process of pattern integration, which integrates different patterns into a uniform form. Considering that the ontologies have semantic conflict in the schema integration process, before mapping, a local ontology of each data source schema should be constructed.
In order to construct a local ontology, for databases of different sources, source data and connection information of various heterogeneous data sources need to be acquired.
Example 3:
a heterogeneous data fusion method based on ontology mapping, which is different from that described in embodiment 2, specifically, constructing a local ontology of a data source model is:
for databases of different sources, source data and connection information of various heterogeneous data sources are obtained, wherein the source data and the connection information of the source data comprise who provides the database, what the name of the database is, how many tables the database comprises, what fields the tables comprise, what relations between the tables are, what attributes each field of the tables are, what relations between the attributes are, what main foreign keys of the tables are, and other information.
In the invention, the needed information is inquired through a built-in function, for example, showdatabases sentences can be used for determining the name of the data, and showtables sentences can be used for obtaining the number of tables contained in the database.
The data format of the metadata dictionary is a dictionary and is presented in the form of key-value, for example, if the key of the database name is defined as dbname, the corresponding value is user _ info, the number of tables in the database is defined as dbnum, the corresponding value is 5, the key of the relationship between the tables is defined as dbrelation, and the corresponding value is one _ to _ one. Different databases correspond to different dictionary records, which may have the same key, but the value may be different.
The construction of the metadata dictionary provides a basis for constructing the local ontology, and the local ontology is constructed only in order to clearly and canonically describe the concept of the data source.
The construction of the local ontology model is divided into two steps, namely, firstly analyzing a data source, and then defining a local ontology:
the data sources refer to different data sources, the different data sources extract attributes of the different data sources to form a metadata dictionary, the field content of the metadata dictionary comprises data element classes, data table attribute classes, relations among tables and description of the data sources, the data element classes comprise all fields contained in the tables and attributes corresponding to the fields, and the data table attribute classes refer to relations among the field attributes and main foreign key information of the tables; the description of the data source class is shown in fig. 2, and includes a data source address, a source data type, a source data name and a source data provider, and the relationship between the data element class, the data table attribute class and the table in the metadata dictionary is shown in fig. 3.
The record of each metadata dictionary is a local ontology.
Example 4:
a heterogeneous data fusion method based on ontology mapping, which is different from that described in embodiment 3 in that, after an ontology is successfully constructed, a semantic conflict condition is detected, and the detection is performed by calculating semantic similarity between a local ontology and a global ontology. The semantic similarity can be used to describe the similarity between different words, and after the similarity is determined, whether semantic conflict occurs is determined, where the semantic similarity is usually a real number between 0 and 1.
The step (2) is further as follows:
the method comprises the steps of realizing similarity calculation by establishing a graph convolution network, judging whether semantic conflict occurs or not, and judging whether the data is not mapped or not according to the semantic conflict, wherein a semantic conflict detection model graph is shown in FIG. 4, specifically, firstly, all data element classes are selected from a metadata dictionary, constituent elements in a local mode and a global mode are obtained, the obtained constituent elements are used as nodes in the graph convolution network, the relation among the constituent elements is used as edges, and the elements and the relation can be determined by selecting fields with key values as element relations from the metadata dictionary;
wherein, different data sources are called local modes, and a target library generated after fusion is called a global mode;
in the construction process, a database corresponds to an entity in a local body, the relationship between the entities corresponds to the relation in a local body model, the relationship between the entities comprises one-to-one relationship, one-to-many relationship and many-to-many relationship, the database table name, the table attribute, the field name, the field type and the attribute value field of the entity reflect the attribute information of data, the primary key and the external key restrain the structural information of the data, the attribute information and the structural information are combined to be used as the label of the graph volume network, the characteristics of the local body and the global body are described through the label information, the bodies with similar probability have similarity (can be automatically learned and calculated through the graph volume network), and the mapping from the local to the global is carried out.
The input layer of the graph convolution network comprises local body nodes and global body nodes, wherein the local body nodes and the global body nodes both comprise structure information and attribute information, an entity is used as vector input, the embedded layer embeds vectors input by the input layer into low-dimensional compact vectors to obtain the structure information of the local body nodes and the global body nodes, the attribute vectors of aggregated attribute information are obtained after the local body attribute information is embedded, the embedded structure vectors and the attribute vectors can be spliced to obtain complete vector representation of the local body nodes and the global body nodes (the process of embedding into the low-dimensional compact vectors and the process of splicing to obtain the complete vector representation of the nodes are both principle knowledge of the graph convolution network, and the graph convolution network can automatically learn without manual operation for the prior art), and then the complete local body and global body vectors are sent to the hidden layer for learning, after learning, the output layer outputs the similarity probability of the local body nodes and the global body nodes of different data through a sigmod function, if the semantic similarity is more than or equal to 0.7, the semantic conflict is considered to occur, and the entity with the semantic conflict is mapped.
The graph convolution network adopted by the invention is the prior art, and is used as a tool to calculate the similarity, namely, the semantic similarity can be output finally through the input layer, the embedded layer and the n hidden layers, the input and output can be defined according to the requirement, other internal operations can be automatically completed without manual participation, the prior art is mature, and the description is omitted here.
Example 5:
a heterogeneous data fusion method based on ontology mapping is different from embodiment 4 in that the data mapping needs to do work by calculating the similarity in a relational database (assuming that two different data sources db1 and db2 exist), then judging whether semantic conflict situations exist (in db1 and db2), and mapping entities with the similarity being more than or equal to 0.7 in a local mode to a global mode according to the following rules. When mapping is performed, the local mode is mapped to the global mode according to the following rules:
1) when the field name corresponding to the entity attribute is a synonym with a different name, the conflict can be eliminated by changing the name, and the entity attribute name appearing for the first time is taken as the attribute name corresponding to the global mode after mapping;
for example, to express the field "telephone number", the attribute name is ipone in db1, iponenumber in db2, db1 appears earlier than db2, and then "ipone" in db1, which appears first, is used as the attribute name in global mode.
2) When the field name corresponding to the entity attribute is a homomorphic heteronym, the conflict can be eliminated by changing the name or establishing a homomorphic heteronym dictionary of the attribute name, and the entity attribute name appearing for the first time is taken as the corresponding attribute name in the global mode after mapping;
for example, if name attribute names appear in db1 and db2, name represents the name of oneself in db1, db2 represents the great name, db1 appears earlier than db2, then the meaning of the name of oneself in db1 appearing earlier is used as the meaning in the global mode.
3) When the same attribute is endowed with different values in different databases and conflicts occur due to different expressions, the conflict can be solved through the concept of semantic values, and various expressions which represent the same attribute are unified in global data to standardize the expression form;
for example, if the attribute of age appears in db1 and db2, age is 16 in db1, age is 20 in db2, and db1 appears earlier than db2, then the first-appearing value of age in db1 is used as the value in global mode.
Through the mapping rule, the semantic conflict problem occurring in the heterogeneous data can be eliminated through the mapping of the similar ontology.
Example 6:
a heterogeneous data fusion method based on ontology mapping is characterized in that two different data sources db1 and db2 are assumed, db1 is provided with four fields of name, age, address and ipone, and db2 is provided with four fields of name, age, add and iponenumber;
first, a metadata dictionary is constructed, and db1 is used for example, as can be seen from the four fields of the data source, the database name is db1, the database relationship is one-to-one, the field names include name, age, address and ipone, and the number of the fields is 4, so that the formed metadata dictionary records are db name 1, dbrelation: one _ to _ one, dbzd: name, age, address, ipone, dbzdcount:4, and db2 is obtained in the same way. Each data source is a local ontology, db1 corresponds to local ontology 1, db2 corresponds to local ontology 2, and the relationships between ontologies are one-to-one.
And secondly, calculating similarity, inputting the metadata dictionary into a graph convolution network, automatically learning to obtain the similarity between the ontologies, and performing field matching work when the similarity is more than or equal to 0.7 and the semantic conflict is considered to occur.
And thirdly, carrying out mapping work. The mapping is to follow the mapping rules, which are described below in this embodiment:
if we express the field "phone number", the attribute name is ipone in db1, iponenumber in db2, db1 appears earlier than db2, then we use the "ipone" in db1 that appears first as the attribute name in global mode;
if name attribute names appear in db1 and db2, name represents the name of oneself in db1, db2 represents the great name, db1 appears earlier than db2, then the meaning of the name of oneself in db1 appearing earlier is used as the meaning in the global mode;
if the attribute of age appears in both db1 and db2, in db1, age is 16, db2, age is 20, db1 appears before db2, then the first-appearing value of age in db1 is used as the value in global mode.
Through the mapping rule, the semantic conflict problem occurring in the heterogeneous data can be eliminated through the mapping of the similar ontology.
It should be noted that the global ontology in the present invention may also be a local ontology at the beginning, that is, the local ontology 1 and the local ontology 2 in this embodiment perform semantic similarity calculation (that is, the local ontology 2 is used as an initial global ontology, and data fusion between two local ontologies is performed at the beginning) to obtain a similarity, and then according to a similarity condition, according to a mapping rule from a local schema to a global schema, data is mapped to eliminate semantic conflict, thereby implementing heterogeneous data fusion.
Initially, after data fusion is performed between the two local ontologies, a global ontology is formed, and then data fusion can be performed on other local ontologies and the global ontology.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (2)

1. A heterogeneous data fusion method based on ontology mapping is characterized by comprising the following steps:
(1) establishing a metadata dictionary for data from different data sources, and then constructing a local ontology model;
(2) performing semantic similarity calculation on the local ontology and the global ontology to obtain similarity;
(3) according to the similarity condition, mapping the data according to the mapping rule from the local mode to the global mode, eliminating semantic conflict and realizing heterogeneous data fusion;
in the step (1), a data source is processed in the early stage to determine the data field required at last, and then a local body of a data source mode is constructed;
the method for constructing the local ontology of the data source mode specifically comprises the following steps:
for databases of different sources, acquiring source data and connection information of various heterogeneous data sources, wherein the source data and the connection information thereof comprise who provides the database, what the name of the database is, what tables the database comprises, what fields the tables comprise, what connections are formed among the tables, what attributes are formed in each field of the tables, what connections are formed among the attributes, what main and foreign keys of the tables are formed, the information is obtained by built-in function query, the queried data is presented in a key-value form, and the construction of a metadata dictionary is completed;
the construction of the local ontology model is divided into two steps, namely, firstly analyzing a data source, and then defining a local ontology:
the field content of the metadata dictionary comprises a data element class, a data table attribute class, a relation between tables and description of a data source, wherein the data element class comprises all fields contained in the tables and attributes corresponding to the fields, and the data table attribute class refers to the relation among the field attributes and the main foreign key information of the tables;
the record of each metadata dictionary is a local ontology;
the step (2) is further as follows:
the similarity calculation is realized by establishing a graph convolution network, whether semantic conflict occurs or not is judged, and specifically,
firstly, all data element classes are selected from a metadata dictionary, component elements in a local mode and a global mode are obtained, the obtained component elements are used as nodes in a graph convolution network, the relation between the component elements is used as an edge, different data sources are called as local modes, and a target library generated after fusion is called as a global mode;
in the construction process, a database corresponds to an entity in a local body, the relationship among the entities comprises a one-to-one relationship, a one-to-many relationship and a many-to-many relationship, the database table name, the table attribute, the field name, the field type and the attribute value field of the entity embody the attribute information of the data, the primary key and the external key constrain the structure information of the data, the attribute information and the structure information are combined to be used as a label of a graph volume network, the characteristics of the local body and the global body are described through the label information, the bodies with similar probability have similarity, and the mapping from the local to the global is carried out;
the input layer of the graph convolution network is composed of local body nodes and global body nodes, wherein the local body nodes and the global body nodes both comprise structure information and attribute information, an entity is used as vector input, the embedded layer embeds vectors input by the input layer into low-dimensional compact vectors to obtain the structure information of the local body nodes and the global body nodes, the attribute vectors of aggregated attribute information are obtained after the local body attribute information is embedded, the embedded structure vectors and the attribute vectors are spliced to obtain complete vector representation of the local body nodes and the global body nodes, then the complete local body and the complete global body vectors are sent into a hidden layer for learning, after learning, the output layer outputs the similarity probability of the local body nodes and the global body nodes of different data through a sigmod function, if the semantic similarity is more than or equal to 0.7, the semantic conflict is considered to occur, and mapping the entities with semantic conflict.
2. The heterogeneous data fusion method based on ontology mapping according to claim 1, wherein during mapping, the local schema is mapped to the global schema according to the following rules:
1) when the field names corresponding to the entity attributes are synonyms with different names, the conflict can be eliminated by changing the names, and the entity attribute names appearing for the first time are taken as the attribute names corresponding to the global mode after mapping;
2) when the field name corresponding to the entity attribute is a homomorphic heteronym, the conflict can be eliminated by changing the name or establishing a homomorphic heteronym dictionary of the attribute name, wherein the first appearing entity attribute name is taken as the corresponding attribute name in the global mode after mapping;
3) when the same attribute is endowed with different values in different databases and conflicts which occur due to different expressions occur, the conflict can be solved through the concept of semantic values, and various expressions which represent the same attribute are unified in global data and are in a standard expression form.
CN202010779077.9A 2020-08-05 2020-08-05 Heterogeneous data fusion method based on ontology mapping Active CN111858649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010779077.9A CN111858649B (en) 2020-08-05 2020-08-05 Heterogeneous data fusion method based on ontology mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010779077.9A CN111858649B (en) 2020-08-05 2020-08-05 Heterogeneous data fusion method based on ontology mapping

Publications (2)

Publication Number Publication Date
CN111858649A CN111858649A (en) 2020-10-30
CN111858649B true CN111858649B (en) 2022-06-17

Family

ID=72972498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010779077.9A Active CN111858649B (en) 2020-08-05 2020-08-05 Heterogeneous data fusion method based on ontology mapping

Country Status (1)

Country Link
CN (1) CN111858649B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112908441A (en) * 2021-03-04 2021-06-04 文华学院 Data processing method and device for medical platform and processing equipment
CN113076366B (en) * 2021-04-09 2023-01-24 南京邮电大学 Intelligent lamp pole virtualization method
CN113360518B (en) * 2021-06-07 2023-03-21 哈尔滨工业大学 Hierarchical ontology construction method based on multi-source heterogeneous data
CN113554063B (en) * 2021-06-25 2024-04-23 西安电子科技大学 Industrial digital twin virtual-real data fusion method, system, equipment and terminal
CN115757655B (en) * 2022-11-14 2023-07-07 中国兵器工业计算机应用技术研究所 Metadata management-based data blood-edge analysis system and method
CN116303392B (en) * 2023-03-02 2023-09-01 重庆市规划和自然资源信息中心 Multi-source data table management method for real estate registration data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385635A (en) * 2011-12-14 2012-03-21 湖南科技大学 Heterogeneous data integration method based on ontology mode
CN104182454A (en) * 2014-07-04 2014-12-03 重庆科技学院 Multi-source heterogeneous data semantic integration model constructed based on domain ontology and method
CN109063114A (en) * 2018-07-27 2018-12-21 华南理工大学广州学院 Heterogeneous data integrating method, device, terminal and the storage medium of energy cloud platform

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9720984B2 (en) * 2012-10-22 2017-08-01 Bank Of America Corporation Visualization engine for a knowledge management system
CN109992784B (en) * 2019-04-08 2021-03-19 北京航空航天大学 Heterogeneous network construction and distance measurement method fusing multi-mode information
CN110781319B (en) * 2019-09-17 2022-06-21 北京邮电大学 Common semantic representation and search method and device for cross-media big data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385635A (en) * 2011-12-14 2012-03-21 湖南科技大学 Heterogeneous data integration method based on ontology mode
CN104182454A (en) * 2014-07-04 2014-12-03 重庆科技学院 Multi-source heterogeneous data semantic integration model constructed based on domain ontology and method
CN109063114A (en) * 2018-07-27 2018-12-21 华南理工大学广州学院 Heterogeneous data integrating method, device, terminal and the storage medium of energy cloud platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于图卷积神经网络的抑郁症分类诊断的研究;吴颖真;《中国优秀硕士学位论文全文库》;20200615;正文第7-32页 *

Also Published As

Publication number Publication date
CN111858649A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN111858649B (en) Heterogeneous data fusion method based on ontology mapping
WO2017076263A1 (en) Method and device for integrating knowledge bases, knowledge base management system and storage medium
WO2021226809A1 (en) Method and system for constructing knowledge map of manufacturing field
WO2021213314A1 (en) Data processing method and device, and computer readable storage medium
CN112131872A (en) Document author duplicate name disambiguation method and construction system
US11853363B2 (en) Data preparation using semantic roles
CN111666422A (en) Knowledge graph construction system and method
US11194798B2 (en) Automatic transformation of complex tables in documents into computer understandable structured format with mapped dependencies and providing schema-less query support for searching table data
CN113987199B (en) BIM intelligent image examination method, system and medium with standard automatic interpretation
US11487943B2 (en) Automatic synonyms using word embedding and word similarity models
CN112507089A (en) Intelligent question-answering engine based on knowledge graph and implementation method thereof
CN116108194A (en) Knowledge graph-based search engine method, system, storage medium and electronic equipment
Das et al. MyNLIDB: a natural language interface to database
Sun A natural language interface for querying graph databases
CN117251455A (en) Intelligent report generation method and system based on large model
Lim et al. The integration of relationship instances from heterogeneous databases
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
Schmitt et al. A comprehensive database schema integration method based on the theory of formal concepts
CN114880483A (en) Metadata knowledge graph construction method, storage medium and system
WO2021135103A1 (en) Method and apparatus for semantic analysis, computer device, and storage medium
CN113221528A (en) Automatic generation and execution method of clinical data quality evaluation rule based on openEHR model
CN113868322B (en) Semantic structure analysis method, device and equipment, virtualization system and medium
Warren et al. Edge Labelled Graphs and Property Graphs; a comparison from the user perspective
CN116127053B (en) Entity word disambiguation, knowledge graph generation and knowledge recommendation methods and devices
WO2023168659A1 (en) Entity pair recognition method and apparatus spanning graph data and relational data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Wei Yuliang

Inventor after: Wang Bailing

Inventor after: Wang Wei

Inventor after: Liu Yang

Inventor after: Xin Guodong

Inventor after: Sun Liuqian

Inventor before: Sun Liuqian

Inventor before: Wei Yuliang

Inventor before: Wang Bailing

Inventor before: Wang Wei

Inventor before: Liu Yang

Inventor before: Xin Guodong