CN104699767B - A kind of extensive Ontology Mapping Method towards Chinese language - Google Patents

A kind of extensive Ontology Mapping Method towards Chinese language Download PDF

Info

Publication number
CN104699767B
CN104699767B CN201510082840.1A CN201510082840A CN104699767B CN 104699767 B CN104699767 B CN 104699767B CN 201510082840 A CN201510082840 A CN 201510082840A CN 104699767 B CN104699767 B CN 104699767B
Authority
CN
China
Prior art keywords
concept
target
source
ontology
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510082840.1A
Other languages
Chinese (zh)
Other versions
CN104699767A (en
Inventor
王汀
刘经纬
蔡万江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Original Assignee
CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS filed Critical CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Priority to CN201510082840.1A priority Critical patent/CN104699767B/en
Publication of CN104699767A publication Critical patent/CN104699767A/en
Application granted granted Critical
Publication of CN104699767B publication Critical patent/CN104699767B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention provides a kind of mapping method towards extensive Chinese body.This method includes:The concept initial association degree computational methods blended based on Chinese thesaurus and editing distance similarity algorithm;Nucleoid field of force potential function based on the improved fusion concept similarity of initial association degree and distinctiveness ratio, is compressed with this potential function to extensive Ontology Mapping scale;The measurement of similarity is carried out by the complex concept being introduced into global sequence alignment algorithm centering body of text.Because polysemy and word order sensitivity phenomenon be present in Chinese word, and the computing cost of extensive Ontology Mapping is very big, and the present invention improves existing nucleoid field of force potential function first so that the scale compression of the measurement and body to be mapped for similitude between concept has more reasonability.Secondly, the defects of being mapped using global sequence alignment technology complicated Chinese concept, and then improving existing Chinese Ontology Mapping system, the mapping efficiency and precision ratio and recall ratio of system are finally improved.

Description

Large-scale ontology mapping method for Chinese language
Technical Field
The invention relates to the field of Chinese ontology mapping.
Background
The vision of the semantic Web is to build a network of Data | (Web of Data) to enable machines to understand semantic information on the network. The ontology is used as a core element of semantic Web, is a formal and standardized description for describing a sharing concept in a specific field, and is a basis for realizing network knowledge sharing and semantic interoperation. At present, due to the existence of heterogeneity among different ontologies, the reuse and sharing among ontologies become difficult.
The task of Ontology mapping (Ontology Alignment) is to find the concept semantic association between heterogeneous ontologies. However, due to cultural and background reasons, a mature ontology mapping system for Chinese language description is still lacking. With the development of semantic web, ontologies and knowledge bases of large-scale Chinese language descriptions are increasingly constructed and shared. Meanwhile, the construction of the Chinese ontology mapping system is still in a starting stage. Therefore, the invention mainly solves the construction problem of the large-scale ontology mapping system for Chinese description.
Researchers at home and abroad have proposed various mapping methods and typical systems. Several typical element-level similarity calculation algorithms based on edit distance and based on Token are listed in the document [ Cohen W, ravikumar P, fienberg S.A. composition of string distance metrics for name-matching tasks [ C ]. Proceedings of the IJCAI Workshop on Information Integration on the Web (IIWeb) [ Acapulco, mexico,2003, 73-78], and the performance of several algorithms was evaluated. Melnik S et al [ Melnik S, garcia-Molina H, rahm E.Si similar mapping: A versatile graph Matching ontology and bits application to schema Matching [ C ] Proceedings of the 18th International Conference of Data Engineering (ICDE) San Jose, california,2002: and (3) Similarity broadcasting, namely constructing a Similarity broadcasting graph by using a concept system of the ontology, and broadcasting and correcting the Similarity between concepts. A RiMOM system is developed by Zhong Qian et al [ Zhong Q, li H, li J, xie G, tang J, zhou L, pan Y. A gauss function based on assessment for unbalanced on knowledge matching [ C ]. Proceedings of the 28th International Conference on Management of Data (SIGMOD). Rhode Island, USA,2009 669-680], and the system is based on a multi-strategy mapping mode of characteristics such as ontology examples, concept names and ontology structures, and is suitable for the mapping task of large-scale ontology by introducing a universal field theory idea. It lacks optimization for chinese specific language features. Giuncigia F et al [ Giuncigia F., yat sketchh M.. Element level semantic matching [ D ]. Italy: depth.of Information and Communication Technology University of Trento,2004] propose a linguistic-based method, introduce a shared knowledge dictionary (e.g., woret), and utilize the linguistic relationship for semantic relationship discovery. The literature [ Isaac A, meij L, schloach S, wang S.an empirical study of entity-based on-existence matching [ C ]. Proceedings of the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference (ISWC/ASWC). Busan, korea,2007 253-266] proposes an instance-level ontology mapping algorithm that measures the similarity between concepts according to the number of common instances of an ontology concept.
In recent years, research work related to the construction of large-scale Chinese ontology base and ontology mapping system is gradually expanding. Lijia and the like propose a method for calculating element layer concept similarity based on Hownet, and a Chinese ontology mapping system [ Lijia, blessing, liuchen, and the like ] is realized. Tianjiele et al propose a Chinese word semantic similarity calculation algorithm [ Tianjiele, zhao Wei ] based on a synonym word forest [ J ] A college newspaper for Jilin university, 2010, 28 (6): 602-608], but its achievements are not applied in a semantic web environment. The students of Wang Zhi-chun et al [ Z.Wang, Z.Wang, J.Li et al. Knowledge extraction from Chinese dictionary wiki encyclopedias [ J ]. Journal of Zhejiang University-Science C, vol 13, no.4, pp.268-280,2012] propose to extract the hierarchical relationship among concepts based on the classification system of Chinese encyclopedia, obtain the concept attribute and encyclopedia example in the vocabulary web page containing the Infobox, finally establish two large Chinese large-scale ontology libraries based on the encyclopedia and the interactive encyclopedia, and establish the coreference between examples with DBpedia according to a simple keyword matching strategy. Niu Xing et al [ Niu X, sun X, wang H, et al, zhishi. Me-weather Chinese chain linking open data [ C ]. ISWC 2011.Springer Berlin Heidelberg,2011 205-220] researchers semantically integrate Baidu encyclopedias, interactive encyclopedias, and develop semantic web data query application systems based on Chinese descriptions. Yidong Chen et al [ Chen Yidong, chen Liwei, xu Kun. Learning Chinese entity attributes from online encyclopedia [ C ]. APWeb 2012] propose to utilize the attribute-value pair information in Chinese encyclopedia Infobox, automatically extract well-constructed training samples, and then extract massive knowledge triples from the unstructured texts of encyclopedia based on a statistical learning model, finally construct an open domain-oriented Chinese knowledge base.
The defects of the existing system and the main contributions of the invention are as follows:
1) A new overall framework oriented to the Chinese large-scale ontology mapping model is provided.
Currently, there is little research on the discovery of ontological conceptual equivalence relationships between semantic data sets in the chinese environment. In a semantic web environment, with the larger and larger scale of an ontology, how to ensure the efficiency of ontology mapping becomes a problem to be solved urgently. Therefore, the present research proposes a chinese-oriented framework-level ontology mapping model. Firstly, a multi-strategy fusion method based on combination of an editing distance and a synonym forest is adopted to calculate the concept initial similarity between the ontologies to be mapped. And secondly, compressing the scale of the ontology to be mapped based on a data field theory and by taking the concept initial similarity as input. And finally, according to the Chinese concept and semantic features contained in an encyclopedia knowledge base, a novel deterministic mapping strategy of the equivalence relation of the Chinese ontology concept is provided by introducing a sequence comparison idea in bioinformatics.
2) A new method for compression reduction of large-scale ontology mapping scale is proposed.
The traditional ontology mapping system and method usually only pay attention to the mapping result, but neglect the mapping efficiency. Therefore, the traditional method is not practical when facing large-scale ontology mapping tasks. Before the deterministic mapping of the equivalence relation is carried out on the Chinese large-scale ontology, a new data field potential function is provided for controlling the time complexity within an acceptable range, and on the basis of the new data field potential function, the large-scale ontology is firstly subjected to reduction and compression of the mapping scale. Specifically, on the basis of improvement of an original pseudo-nuclear force field potential function, a new method for measuring a potential value of a data object by comprehensively calculating semantic similarity and dissimilarity value between concepts is provided based on synonym forest (expanded version), and a new algorithm for reducing the mapping scale of a large-scale ontology is designed on the basis of the new method.
3) Provides a new concept semantic similarity calculation method based on the bioinformatics global double-sequence comparison idea.
The research work of the document [ Zhong Q, li H, li J, xie G, tang J, zhou L, pan y.a gauss function based on assessment for unbalanced on knowledge matching [ C ]. Proceedings of the 28th International Conference on Management of Data (SIGMOD). Rhode Island, USA,2009 669-680] is currently only applicable to ontologies based on english description and their mapping tasks, but it lacks support for multilingual ontologies, in particular it is not optimized for the characteristics of chinese ontologies. Meanwhile, the concept similarity calculation method in the traditional Chinese ontology mapping system does not consider the influence of atomic concept sequence difference and word-meaning phenomenon in the combined concepts on the quality of the mapping relationship between the two combined concepts, but neglects the important characteristics of the Chinese concepts-word sequence sensitivity | and the word-meaning |, and inevitably causes the error of the mapping result. In order to solve the problems, the method abstracts the equivalence relation discovery of Chinese concepts into the global sequence comparison problem, is based on the idea of dynamic programming, and introduces a Needleman-Wunsch global comparison algorithm in the field of bioinformatics to carry out semantic similarity calculation between the combined concepts. Experiments show that by adopting a concept global comparison similarity calculation method based on a Needleman-Wunsch algorithm, the error mapping possibly brought by the traditional method can be effectively avoided. Compared with the traditional method, the new method has more advantages and rationality when facing a large-scale Chinese ontology mapping task.
Therefore, the number of chinese large-scale ontologies published on the web is still small, and there is a large heterogeneity, and the existing chinese ontology mapping system is inefficient and not highly available when facing large-scale ontology mapping tasks. Meanwhile, a related system which is described by aiming at the Chinese language and is suitable for a large-scale ontology mapping task in a semantic web environment is still lacking at present. Therefore, the invention is designed based on synonym forest and realizes a Chinese-oriented large-scale ontology mapping system.
Disclosure of Invention
In the Chinese ontology mapping system, both simple lemmas and unknown words correspond to concepts in the ontology to be mapped. Therefore, the present invention refers to the Concept corresponding to the simple lemma as an atomic Concept (Atom Concept, AC), and refers to the Concept corresponding to the unknown word as a combined Concept (Component Concept, CC), and it is agreed that all the combined concepts are combined by a linear arrangement of several atomic concepts. First, a set of definitions and formal descriptions of the problems faced are given here:
define 1 an ontology mapping: two entities to be mapped O source 、O target For source entity O source A certain concept C in source In the target body O target Find the corresponding concept C with the same or close semanteme target Thus defining the mapping function map O source →O target :
For theIf sim (C) source ,C target )&gt, T; then there is map (C) source )=C target
Wherein sim (C) source ,C target ) Is to be mapped concept C source And C target T is a threshold value, which indicates when the concept C is source And concept C target When the semantic similarity of (2) is greater than T, then<C source ,C target &gt, as a concept mapping pair is found, the threshold value can be taken as T epsilon [0.8,0.9 in the system]Any random number in between can meet the requirement. Source body O source In which the total number of concepts contained is n source Target body O target In which the total number of concepts contained is n target
Definition 2: for the Semantic Knowledge Base (SKB) of "forest of synonyms", the set SKB TYCCL Consisting of atomic concepts, i.e. having SKBs TYCCL ={AC 1 ,AC 2 ,…,AC N In which a certain element AC i As a set SKB TYCCL The concept of atom in (1). And N is the lemma size contained in the knowledge base.
Definition 3: the composite concept CC may be composed of an ordered arrangement of a series of atomic concepts. Namely: for the Then there is an ordered sequence CC = [ AC = 1 ,AC 2 ,…,AC i ,…]Wherein i is ≧ 1 andi is an atomic concept AC i The arrangement position in the ordered sequence CC. In particular, AC is for all atomic concepts i May have AC i =[AC i ]。
Definition 4: for an ontology O to be mapped source And O target Concept C in source And C target Is provided with Wherein m and n are respectively a combination concept C source And C target Corresponding ordered sequence CC source And CC target The length of (b) is m, n is more than or equal to 1.
A large-scale ontology mapping method for Chinese language is characterized by comprising the following steps: the method comprises three steps of: calculating the initial association degree of the concept based on the combination of the editing distance and the synonym forest, compressing the body and mapping the certainty;
(1) Concept initial association degree calculation based on editing distance and synonym forest fusion
a) Edit distance similarity
Two entities to be mapped O source 、O target For source entity O source A certain concept C in source In the target body O target Find the corresponding concept C with the same or close semanteme target Two concepts C source And C target Their edit distance values and their similarity values are characterized by equations (1) and (2):
wherein, | Do (C) source ,C target ) | is the concept C to be mapped source And C target The number of editing operations of (a), namely: character string C source The minimum number of steps of operation is completely changed into a character string C target There are three types of operations: adding, deleting, or modifying a character; l (C) source ) And L (C) target ) The length of the character of the concept to be mapped;
wherein, the SIM E (C source ,C target ) For the concept C to be mapped source And C target The similarity of (2);
b) Synonym forest similarity
Similarity calculation formula based on synonym forest:
for theF i Is a word element C source And C target The number of layers represented by different sub-codes appears on the ith layer, | F | represents the number of elements in the set F, and is always equal to 5 in the system; the weight coefficient of the conceptual similarity is alpha x (F) i /|F|);n subTree Is a word element C source And C target In the occurrence of sub-codes differing by F i The total number of nodes contained in the corresponding branch of the layer, D is the lemma C source And C target The coding distance of (a); alpha is epsilon [0.4,0.5]Any random number in the random number interval can meet the requirement;
c) Multi-strategy fusion association degree algorithm
Firstly, the similarity of two basic algorithms is comparedTaking the maximum value of the results of the two algorithms; at the same time, two concepts C are considered together source And C target Similarity and dissimilarity between them, and superimposes them into each concept C source ,C target The final degree of association of (c); the method defines the maximum value obtained by two similarity algorithms as rho, and correspondingly, the dissimilarity index is 1-rho; it is apparent that rho e (0, 1)]Then, there is formula (4):
referred to herein as conceptsAndhas a semantic correlation coefficient of lambda st
Finally, the source ontology concept is obtainedAnd a target body O target Initial degree of association ofExpressed by formula (6);
concept C in the target ontology due to symmetry of relevance calculation target Initial degree of association m target The same can be obtained; taking m as the initial relevance factor when the final initial relevance value of a certain concept is zero source ,m target ∈[0.01,0.05]The random numbers in the space between the two random numbers meet the requirement; thus, the initial association degree set Map _ O of all concepts in the ontology O to be mapped is obtained source And Map _ O target (ii) a The initial relevance set is uniformly expressed in the form of key value pairs: map _ O<C,m>;
(2) Ontology compression algorithm
When a large-scale ontology mapping task is faced, the traditional algorithm is difficult to adapt in time or space complexity, so that a corresponding strategy is needed to compress the original ontology to be mapped;
for source body O source Concept set of (1)And a target body O target Concept set ofBy each conceptInitial relevance value ofTo characterize the extent to which the concept affects other concepts,has been given by equation (6); the modified field strength function is shown in equation (8):
taking δ =1, r =2; obtaining an ontology O to be mapped source Each concept inThe potential value function expression of (3), as shown in equation (9):
concepts in a target ontologyPotential value ofObtaining the product in the same way; finally obtaining potential value set PotentialMap _ O of all concepts in the ontology O to be mapped source And postertialmap _ O target (ii) a The set of potential values is collectively defined as a key-value pair:
the concept set in O is divided into two parts, which are called: a candidate area and a culling area;
specifically, for the output key-value pair set Map _ O obtained after executing the multi-strategy fusion association algorithm source And Map _ O target Respectively counting Map _ O according to the relevance value of each concept element source And Map _ O target The total number of concepts with middle relevance value greater than 0.05 is called Range _ Candidate _ O source And Range _ Candidate _ O target The variable is defined as the body O to be mapped source And O target Upper bound of candidate area interval of (a);
for potential value set PotentialMap _ O source And partertialmap _ O target The concept elements in (1) are sorted in a descending order according to the key valuesVariable for its rankingIdentifying; if it is Then the conceptWill be retained as candidate concepts; accordingly, ifThen the conceptWill be eliminated; for the target ontology O, from the symmetry existing between the source ontology and the target ontology target The candidate concept extraction rule of (2) is obtained by the same method;
(3) Deterministic mapping
For source ontology O to be mapped source And a target body O target Any two concepts C in source And C target When semantic similarity calculation of concepts is performed, the following three cases occur:
①C source and C target Are all atomic concepts, i.e.: c source ∈SKB TYCCL And C target ∈SKB TYCCL
②C source And C target One of them is an atomic concept and the other is a combinatorial concept, i.e.:or
③C source And C target Are all combinatorial concepts, i.e.:and is
For case (1), calculating semantic similarity of the two concepts using formula (3); for the case (2) and the case (3), in the system, first, two word string sequences to be aligned are represented in a form of a scoring matrix (scoring matrix), and the two sequences are respectively used as two dimensions of a dynamic programming matrix; for an ontology O to be mapped source And O target Concept C in source And C target The ith row of the scoring matrix M corresponds to a word string sequence CC source Concept of atom in (1)The jth column corresponding word string sequence CC target Concept of atom in (1)Wherein i is less than or equal to m, and j is less than or equal to n; the ith row and jth column element in the dynamic programming matrix M is called M ij
Firstly, giving a penalty factor p = -0.05 of a sequence alignment algorithm, and initializing the m +1 th row and the n +1 th column of the matrix respectively;
secondly, calculating a function SIM based on synonym forest similarity T Carrying out recursion solution on the rest m multiplied by n elements in the matrix;
the definition of the scoring function f is given first, as shown in equation (11):
the recursion rule is shown in equation (12):
from M in the matrix mn Element start, trace back to M in the matrix 11 Obtaining an optimal comparison path after the elements are finished; if the obtained optimal comparison path is more than one, selecting one path;
finally, inserting a vacancy character "-" to obtain a correct global sequence comparison result;
two concept entry sequences to be mapped and combined after inserting the vacant character is called CC source’ And CC target’ (ii) a The total number of elements contained in the two sequences is equal, and is called L cc’ (ii) a And (3) obtaining a similarity calculation formula (13) between the combined concepts according to the comparison result and based on a scoring function f:
drawings
FIG. 1 is a flow chart of a large-scale Chinese ontology mapping method
FIG. 2 (a) false match results
FIG. 2 (b) correct alignment
FIG. 3 example one scoring matrix
FIG. 4 example two scoring matrix
FIG. 5 sequence matching results of example two
Detailed Description
(1) Concept initial association degree calculation based on editing distance and synonym forest fusion
a) Edit distance similarity
When facing the mapping task of the large-scale ontology, the invention proposes to compress the ontology to be mapped first. Specifically, an edit distance algorithm is used to first perform an initial similarity calculation between the concept sets. This is because the efficiency of the algorithm is often considered when performing the initial association calculation, and the accuracy thereof is regarded as a secondary factor.
That is to sayWhen the initial relevance of the ontology to be mapped is obtained, the system can obtain the literal similarity between concepts through an edit distance algorithm, and the semantic relevance is ignored. Especially for two concepts C source And C target Their edit distance values and their similarity values can be characterized by equations (1) and (2):
wherein, | Do (C) source ,C target ) | is the concept C to be mapped source And C target The number of editing operations of (a), namely: character string C source The minimum number of steps of operation is completely changed into a character string C target There are three types of operations: a character is added, deleted or modified. L (C) source ) And L (C) target ) Is the character length of the concept to be mapped.
Wherein, the SIM E (C source ,C target ) For the concept C to be mapped source And C target The similarity of (c).
b) Synonym forest similarity
The synonym forest (TYCCL) is a chinese synonym dictionary, which encodes each word and organizes it in a tree structure in a hierarchical relationship, where each node in the tree represents a concept, and the chinese concept coreference recognition can be abstracted as the recognition similarity calculation problem of chinese synonyms, so the synonym forest is the best choice. The system adopts the expansion version of the Harvard synonym forest as a common knowledge base for extracting the mapping relation of the Chinese ontology.
The synonym forest organizes the lemmas into a hierarchical structure with 5 layers from top to bottom. Each layer has a corresponding coding mark, and the 5 layers of codes are sequentially arranged from left to right to form a word forest code of a word element. The semantic relatedness implicit between words also increases with the increase of the hierarchy.
The following explains the lin coding format, taking the lemma-substance | (the lin coding is Ba01a02 =), as shown in table 1:
TABLE 1 Lin coding example
According to the structural characteristics of the synonym forest, firstly analyzing the forest codes of the concept to be mapped, extracting the sub-codes from the 1 st layer to the 5 th layer, and then comparing the sub-codes from the 1 st layer. If the sub-codes appear differently, the mapping pair is given corresponding similarity weight according to the appearing hierarchy. The sub-codes appear at a deeper level differently, the higher the similarity weight, and vice versa. Meanwhile, the number of branch nodes in each layer also has an influence on the similarity.
The similarity calculation formula based on the synonym forest is given as follows:
because the ontology mapping task focuses more on semantic similarity between concepts, the invention introduces adjustment parameters: the semantic relevance factor alpha adjusts the relation of semantic relevance and semantic similarity among concepts at different levels and controls the possible similarity degree among the lemmas at different levels, and obviously alpha belongs to (0, 1). The larger the value of alpha is, the higher the possibility of representing the similarity or equivalence of the lemmas among different levels is, and the influence of the semantic relevance of different levels on the similarity of the final concept is larger, and the influence of the semantic relevance of different levels on the similarity of the final concept is smaller.
Wherein F = {1,2,3,4,5}, forF i Is a word element C source And C target The number of layers represented by different occurrence of subcodes at the ith layer, | F | represents the number of elements in the set F, and is equal to 5 in the system. The weight coefficient of the conceptual similarity is alpha x (F) i /|F|)。n subTree Is a word element C source And C target In the occurrence of sub-codes differing by F i Total number of nodes contained under corresponding branches of the layer, D being the lemma C source And C target The coding distance of (1).
In particular, when the five-level codes of the concept pair to be mapped are equal and the last bit of the word forest code is = |, then the similarity function SIM T Has a return value of 1.0. Obviously, the function SIM T Has a value range of (0, 1)]. The system focuses on obtaining the equivalence relation between the word element concepts, particularly when a Chinese ontology mapping task is faced, the value of alpha is not too high due to the fact that semantic similarity among the concepts is more prominent, and a semantic relevance factor can be taken as alpha belonging to [0.4,0.5 ]]Any random number in between can meet the requirement.
c) Multi-strategy fusion association degree algorithm
Because the edit distance similarity algorithm and the synonym forest similarity algorithm have certain complementarity, the similarity result values of the two algorithms are fused.
When the data field theory is applied, each concept is considered as a particle in a field, and the prior work usually only considers the phenomenon of attractive force existing among the particles in a physical field, just ignores the objective fact that repulsive force generally exists among the particles, and therefore, does not introduce an influence factor of the repulsive force on the correlation degree. To ameliorate this drawback, the present invention considers both attractive and repulsive forces between the concepts to be mapped. That is, the similarity between concepts is considered as the attractive force existing between particles in a field, and the dissimilarity is considered as the repulsive force. Therefore, in the first algorithm provided by the invention, the maximum value of the results of the two algorithms is taken by comparing the similarity results of the two basic algorithms; at the same time, two concepts C are considered together source And C target Similarity and dissimilarity between them, and superimposes them into each concept C source ,C target The final degree of association of (c). The invention defines the maximum value obtained by two similarity algorithms as rho, and obviously rho epsilon (0, 1)]Then, there is formula (4):
herein, the concept is calledAndbetween is lambda st 1-p in the formula (5) st To measure the degree of dissimilarity between the two concepts. Log (1- ρ) st ) The method is a logarithmic function with the base 10, and the logarithmic function is defined as a strict monotone increasing function, so that the variation trend of dissimilarity degree to similarity degree can smoothly reflect the causal relationship between the two functions. It can be seen that the similarity ρ st The larger the value of (c), the smaller the degree of dissimilarity, the more the function-p is adjusted st ×log(1-ρ st ) The larger the value of (c). Thus obtained conceptInitial degree of association ofThe similarity and dissimilarity between the source ontology concept and the target ontology concept are comprehensively considered, so that the result is more reasonable. In order to converge the equation (5), it is specified that the value of ρ belongs to the interval (0.9, 1]Concept of timeAndbetween the semantic correlation factor lambda st Is 1.
Finally, the concept in the source ontology is obtainedAnd a target body O target Initial degree of association ofExpressed by equation (6).
Concept C in the target ontology because the relevance computation has symmetry target Initial degree of association m target The same may be said, that is, two ontologies to be mapped are completely weighted, so that either one of the two ontologies can be regarded as a source ontology, and the other one is naturally called a target ontology. Thus, the initial relevance value of a concept in two ontologies is uniquely determined only depending on the interaction between the ontologies, and is independent of the expressive names given to the two ontologies by the system. The specific initial association algorithm description of multi-strategy fusion is shown in algorithm I. In order to make the relevance shift continuous for the case that the final initial relevance value of a certain concept is zero, the initial relevance factor can be m in the system source ,m target ∈[0.01,0.05]Any random number in between can meet the requirement. Thus, the initial association degree set Map _ O of all concepts in the ontology O to be mapped is obtained source And Map _ O target . Uniformly defining an initial association set as key-value pairs: map _ O<C,m>
(2) Large-scale ontology compression algorithm
When a large-scale ontology mapping task is faced, the traditional algorithm is difficult to adapt in terms of time complexity and space complexity, and therefore a corresponding strategy is needed to compress the original ontology to be mapped.
The data field theory is provided based on the field theory thought in physics, and the mutual relation between data in a number domain space is abstracted into the interaction problem between substance particles, and finally the mutual relation is formed into a description method of the field theory. The theory expresses the interaction relation among different data through a potential function, thereby reflecting the distribution characteristics of the data and clustering and dividing the data set according to an equipotential line structure in a data field. However, the short-range field potential function adopted in the classical data field usually only considers the influence of the path distance between the data objects on the final potential value, and in the face of the ontology mapping problem, the short-range field potential function is embodied to ignore the ubiquitous semantic association factors between the data objects, such as: and (4) simulating a nuclear force field potential function.
When a large-scale ontology is compressed, semantic association between concepts in the ontology is used as a compression basis. The invention converts the source body O source And a target body O target The ubiquitous semantic association among the concepts in (1) is regarded as the basis and the premise of ontology compression, ontology concepts are regarded as data objects in a data field, the initial association degree among the concepts is regarded as the quality of each object in the data field, and a novel method for measuring the potential value of the data objects by comprehensively calculating the semantic similarity and dissimilarity degree among the concepts is provided. By introducing a ubiquitous semantic association degree factor among concepts in the ontology, the defect of the pseudo-nuclear force field potential function in the aspect of the ontology mapping problem is corrected, and the pseudo-nuclear force field potential function macroscopically conforms to the characteristics of the ontology mapping.
a) Definition of potential function
Because the short-range field can better reflect the interaction condition between data, a pseudo-nuclear force field potential function is adopted. The specific definition of the method in the ontology mapping problem is as follows:
in the ontology O to be mapped, the shortest path length between concepts is: i C s -C l Defining a path length R between concepts not greater than 2 due to the short range field characteristics. Then according to the data field theoryTo concept C s And C l The field strength function expression of the interaction between them is shown in formula (7):
wherein m is s Represents the mass of each data point, typically let m s =1, but this method can only reflect the influence of the path distance between concepts on the final potential value, but makes the semantic association degree between the concepts totally lacked, so the invention proposes to introduce the semantic similarity and dissimilarity between the concepts into the potential value calculation, and introduce m into the potential value calculation s The value of (c) is defined as the initial degree of association between concepts, which has been given by equation (6). The field intensity function in the formula (7) is corrected and perfected by comprehensively considering the similarity and the dissimilarity between the concepts. Equation (6) illustrates that the greater the initial degree of association of a concept in the ontology to be mapped, the greater its quality in the data field.
That is, for source ontology O source Concept set of (1)And a target body O target Concept set ofBy each conceptInitial relevance value ofTo delineate the extent to which the concept affects other concepts. The modified field strength function is shown in equation (8):
delta e (0, + ∞) reflects the influence between conceptsThe granularity of (d), also called scaling factor, is not taken to be δ =1, r =2. Thus, the body O to be mapped is obtained source Each concept inThe potential value function expression of (3), as shown in equation (9):
concepts in the target ontology are symmetric due to the data object potential value calculations existing between the source and target ontologiesPotential value ofThe same can be obtained. Finally, potential value set potential map _ O of all concepts in the ontology O to be mapped is obtained source And partertialmap _ O target . The unified definition of a set of potential values is a key-value pair:
b) Extraction of candidate concepts
In order to compress the ontology to be mapped O, the system divides the concept set in O into two parts, which are called: a candidate zone and an obsolete zone.
In particular, for the set of output key-value pairs Map _ O obtained after the execution of the algorithm one source And Map _ O target Respectively counting Map _ O according to the relevance value of each concept element source And Map _ O target The total number of concepts with middle relevance value greater than 0.05 is called Range _ Candidate _ O source And Range _ Candidate _ O target The variable is defined as the body O to be mapped source And O target Is lower than the upper bound of the candidate interval.
For a set of potential values postertialmap _ O source And postential Map_O target The concept elements in (1) are sorted in a descending order according to key valuesVariable for its rankingTo identify. If it is Then the conceptWill be retained as candidate concepts. Accordingly, if Then the conceptWill be eliminated. For the target ontology O, from the symmetry existing between the source ontology and the target ontology target The candidate concept extraction rule of (2) is obtained in the same way.
(3) Concept deterministic mapping based on Needleman-Wunsch algorithm
The semantic knowledge base which is common at present comprises: synonyms forest, hownet and WordNet. In particular, the words which are already collected in the synonym forest (expanded version) are called simple lemmas; the words that are not included in the word forest are collectively called unknown words (Out of Vocabulary, OOV).
The problem is discussed in categories by the given relevant definitions. For source ontology O to be mapped source And a target body O target Any two concepts C in source And C target When calculating the semantic similarity of concepts, the following three situations occur:
④C source and C target Are all atomic concepts, i.e.: c source ∈SKB TYCCL And C target ∈SKB TYCCL
⑤C source And C target One of them is an atomic concept and the other is a combinatorial concept, i.e.:or
⑥C source And C target Are all combinatorial concepts, i.e.:and is provided with
For case (1), the present invention will directly employ equation (3) to calculate the semantic similarity of the two concepts. The following focuses on the similarity degree calculation method for the case (2) and the case (3).
For similarity calculation of Chinese combination concepts, a traditional Chinese ontology mapping system provides a processing scheme. For example: lijia and the like design and realize an element layer concept similarity calculation method based on a Hownet and realize a Chinese ontology mapping system [ Lijia, wish name, liu Chen, and the like ] Chinese ontology mapping research and realization [ J ] Chinese information science and report, 2007,21 (4): 27-33]. When the method is used for processing the similarity calculation problem of unknown words, the atomic concept sequences corresponding to the two combined concepts are traversed, the atomic concept mapping pair with the maximum similarity is found out, and the similarity value of the two combined concepts is finally calculated through the obtained relatively-large mapping pair. The calculation formula is as follows:
wherein, B xy Representing the elements, max, in a similarity matrix composed of columns and rows of known words obtained by splitting two vocabularies i (B xy ) Representing the similarity of the numeric arrangement in the matrix as the ith bit.
However, because of the ubiquitous characteristic of the Chinese concept of being light before and heavy after |, the processing mode of the predecessor inevitably brings errors of semantic similarity calculation. For example, two combined concepts to be mapped that appear in different ontologies: the historical theory II and the thought history II are subjected to word segmentation to obtain two ordered arrangements formed by atomic concepts: history, theory, and thought, history. If the method of processing unknown words commonly used by predecessors is adopted, the atomic concept mapping effect shown in fig. 2 (a) can be obtained, semantic similarity of each pair of atomic concept mapping is calculated according to a given formula (3) based on synonym forest (expanded version), and finally, the formula (10) proposed by predecessors is adopted for comprehensive calculation, the value of the obtained concept element level similarity is 1.0, and a completely unreasonable combined concept mapping pair and similarity result is obviously obtained. The reason is that the method neglects the ubiquitous word order sensitivity phenomenon in the Chinese natural language and also neglects the characteristic of the existence of the Chinese natural language, namely the front part is light and the back part is heavy II.
Therefore, the system adopts a concept semantic similarity calculation method based on global sequence comparison.
a) Overview of the sequence alignment (alignment) Algorithm
In bioinformatics, a two-sequence alignment refers to the alignment of two DNA, RNA or protein sequences together, indicating the similarity, where gaps can be inserted in the sequences, with the corresponding identical or similar symbols aligned on the same column. By comparing similar fragments and conserved sites between two sequences, the molecular evolutionary relationship which may exist is searched.
Overall, alignment models can be divided into 2 classes: one is global alignment (global alignment), which mainly examines the overall similarity between 2 sequences and scans and compares the sequences all the way. The other is a local alignment (local alignment) method, which focuses on some specific fragments in the sequence and compares the similarity between fragments in the sequence. Both can be solved by the idea of Dynamic Programming (DP).
The Needleman-Wunsch algorithm is a typical global alignment algorithm that is suitable for comparing 2 sequences that are more similar globally macroscopically. This algorithm was proposed in 1970 by Needleman and Wunsch and is a dynamic programming algorithm (DP) that aligns the similarity between two sequences. This algorithm is one of the basic algorithms of bioinformatics. The system mainly considers a global double-sequence comparison algorithm.
b) Constructing a dynamic programming scoring matrix
The sequence refers to a character string composed of a series of letter identifiers according to a certain arrangement rule. Specifically, when facing the similarity calculation problem of the ontology concept, the combined concept is regarded as a word string sequence, and each element in the sequence is an atomic concept. Firstly, performing word segmentation processing on a combined concept to obtain a corresponding word string sequence; in a Chinese large-scale ontology mapping system, ICTCCLAS 50 developed by Chinese academy of sciences is used as a word segmentation processing tool. The alphabet (alphabet) is defined as "forest of synonyms" semantic knowledge base: SKB TYCCL And adding a vacant symbol: gap (-).
The system calculates and abstracts the concept similarity of ontology mapping into a comparison process of two word string sequences, namely, deciding to insert a gap symbol into a corresponding position in the word string sequences through a gap penalty function to enable the lengths of the two sequences to be the same, and further constructing the corresponding relation between the atomic concepts of the sequences to be compared or between the atomic concepts and the gap symbol. The essence of the sequence alignment algorithm is to find the optimal global pairing of two combined concept sequences by a scoring strategy.
In the system, first, two word string sequences to be aligned are expressed in a form of a scoring matrix (scoring matrix), and the two sequences are respectively used as two dimensions of a dynamic programming matrix. For an ontology O to be mapped source And O target Concept C in source And C target The ith row of the scoring matrix M corresponds to a word string sequence CC source Concept of atom in (1)The jth column corresponding word string sequence CC target Concept of atom in (1)Wherein i is less than or equal to m, and j is less than or equal to n. The ith row and jth column element in the dynamic programming matrix M is called M ij
For example: the combined concept-the second industrial revolution |, and the second world war warrior |, after the word segmentation processing, can obtain two word string sequences to be compared: [ second, industrial revolution]And [ second, world war, warrior]. According to the dynamic programming concept, two word string sequences are represented in rows and columns. Hypothetical sequence CC source Is m, sequence CC target Is n, then a sequence CC can be formed source Is line, sequence CC target A two-dimensional matrix of (m + 1) × (n + 1) which is a column, as shown in fig. 4.
c) Optimized recursive solution algorithm
And based on the idea of dynamic programming, carrying out recursive solution on the optimal comparison path in the matrix M.
Firstly, a penalty factor p = -0.05 of a sequence alignment algorithm is given, and the m +1 th row and the n +1 th column of the matrix are initialized respectively.
Secondly, calculating a function SIM based on synonym forest similarity T And carrying out recursive solution on the rest m multiplied by n elements in the matrix. First, the definition of the score function f is given, as shown in equation (11):
considering the ubiquitous-characteristic of the Chinese concept of being heavy II after and light, the starting point of recursion is chosen to be at the end of two combined concepts, namely: m in the matrix mn And (4) elements. To SIM T See formula (3). Specifically, the recursive rule is shown in equation (12):
finally, from M in the matrix mn Element start, trace back to M in the matrix 11 And (5) obtaining the optimal comparison path after the elements are finished. It should be noted here that if more than one optimal alignment path is obtained, one of the optimal alignment paths is selected.
A specific concept element level similarity calculation algorithm based on the global sequence comparison idea is shown in algorithm II.
Still in a combined concept: second Industrial revolution II and second world war warrior II as examples, the matrix M 'containing the best matching path obtained by Algorithm two' (i)(j) As shown in fig. 4. Wherein-arrow |, which represents the selectable advance direction upon backtracking, is obtained from formula (12); and-the bold arrow | indicates the resulting best path. In particular, the bold diagonal arrow | represents the pairing of the 2 atomic concepts corresponding to its tail; the bold horizontal arrow | represents the sequence of words string CC source In the method, 1 gap character | - | is inserted in front of the atom concept position corresponding to the line where the gap character exists; the sequence of word strings CC is represented by the bold vertical arrow |) target In the method, 1 gap character | is inserted in front of the corresponding atom concept position of the column where the gap character exists. The scoring matrix, the word string sequence CC, given by fig. 4 source And CC target The optimal alignment results are shown in FIG. 5:
from Algorithm two, the scoring matrix M 'shown in FIG. 3 is given below' (i)(j) The calculation flow of each element in (1) can be obtained by the same way as the scoring matrix shown in fig. 4.
Step (1): and (5) initializing a matrix. Let penalty factor p = -0.05, combine concept CC source Has a sequence length of m, CC target Has a sequence length of n.
Step (2): a cost value for each element in the scoring matrix is recursively calculated.
M from the bottom right-most corner of the matrix 22 The element starts to be calculated, when the row mark i =2 and the column mark j =2, then:
the same can be obtained:
and (3): backtracking to obtain M 33 To M 11 The optimal path of (2): m is a group of 33 →M 32 →M 21 →M 11
And (4): finally, insert the gap symbol- |, get the correct global sequence alignment as shown in figure 2 (b).
The system calls two combined concept entry sequences to be mapped after inserting the gap character- | as CC source’ And CC target’ (ii) a In this case, the total number of elements contained in the two sequences is equal, and is collectively referred to as L cc’ . According to the comparison result and the base noteDividing a function f to obtain a similarity calculation formula (13) between the combined concepts:
after the method for calculating the element-level similarity of the combined concept based on sequence alignment is explained, the two groups of similarity calculation examples mentioned before are reviewed.
The first calculation example: CC (component C) source = [ thought, history ]],CC target = [ history, theory)]. The combined conceptual similarity value obtained from equation (10) in section 4.3 is Sim (CC) source ,CC target ) = (1.0 + 1.0)/2 =1.0, and the effect of the combined concept sequence based on the sequence alignment algorithm is shown in fig. 2 (b), and the corresponding scoring matrix is shown in fig. 3. The combined concept similarity value obtained by comprehensive calculation based on the algorithm II and the formula (3), the formula (11) and the formula (13) is the SIM NW (CC source’ ,CC target’ ) = (f (thought, —) + f (history) + f (— theory))/3 = (p + SIM) T (history ) + p)/3 = (-0.05 + 1.0-0.05)/3 =0.3.
Example two: CC (component C) source = [ second, industrial revolution ]],CC target = [ second, world war, warrior]. If the combined concept similarity value calculated by the formula (10) is Sim (CC) source ,CC target ) 1.0 because the atomic concept-second present-phenomenon-word ambiguity |). In particular, the lemma-level | has multiple encodings in synonym forest (expanded version), where-Dn 04B03 | encodings give a decision that two atomic lemmas-second | and-level | are equivalent lemmas. Therefore, formula (10) proposed according to the conventional method will result in the case that the mapping result of four groups of atomic concepts is 1.0, which is:&second, second>=1.0,&lt, second>=1.0,&Then, second&gt =1.0, and&lt, times&gt =1.0. Substituting equation (10) with: sim (CC) source ,CC target ) = (1.0 + 1.0)/4 =1.0. Based on the algorithm two and the formula(3) The combined concept similarity value obtained by comprehensively calculating the formula (11) and the formula (13) is SIM NW (CC source’ ,CC target’ ) = (f (second, second) + f (times ) + f (industrial revolution, world war) + f (— warrior))/4 = (SIM) T (second ) + SIM T (sub ) + SIM T (industrial revolution, world war) + p)/4 = (1.0 + 0.18-0.05)/4 =0.5325, and the corresponding sequence matching result is shown in fig. 5.
It can be seen that there is no equivalence between the two combined concepts in example one and example two. While the traditional methods respectively give wrong conclusions of extremely high similarity with similarity of 1.0. Conversely, the similarity value obtained by algorithm two is more reasonable. It can be seen that, when considering the phenomenon of-center of gravity shift backward |, and-word ambiguity |, which are common in chinese concepts, by using the global alignment algorithm based on Needleman-Wunsch algorithm, the mismapping that may be brought about by the conventional method represented by formula (10) can be effectively avoided. Meanwhile, when the combined concept mapping is faced, if the semantic order of the atomic concepts in the corresponding word string sequence is basically the same, the effect of the second algorithm is basically consistent with that of the traditional method. In conclusion, the concept element level similarity algorithm based on global sequence alignment has more advantages and rationality in the face of large-scale Chinese ontology mapping tasks than the traditional method.
(4) Experimental data preparation
Compared to international common ontologies and their mapping tasks, for example: the existing open-source Chinese large-scale Ontology is still deficient due to the multi-domain standard Ontology and the mapping benchmark Evaluation index thereof issued by international organizations such as OAEI (automatic Alignment Evaluation Initiative) and the like. Therefore, the invention adopts the Chinese network open encyclopedia knowledge base as an experimental data source. In addition to the DBpedia (chinese version) knowledge base, the crawler toolkit HTMLParser is used to crawl and parse open classification pages of the encyclopedia and the interactive encyclopedia, respectively. The invention not only analyzes the classification system in the open encyclopedia of the Chinese network, but also extracts and analyzes the Infobox structured information contained in all the entry pages, organizes the Infobox structured information in the form of Chinese character triples, and finally forms three large-scale Chinese ontology libraries to be mapped.
The encyclopedic open classification system mainly forms a concept system of an ontology, the knowledge in the Infobox information frame takes the name of a wiki page as a subject, the attribute name (Property) as a predicate, and the attribute value as an object. For example, if factual statements: the nationality of the kinson of money is that the people's republic of china iib appears in the Infobox information box of the entry page, and should be expressed as: -the kindersonian |, nationality: the people's republic of china |. The knowledge obtained from the information boxes can thus be expressed in the form of triplets, namely < qiansheng, nationality, and the people's republic of china >. The system is based on a method provided by a document [ Z.Wang, Z.Wang, J.Li et al.knowledgeedge extraction from chip wiki encyclopedias [ J ]. Journal of Zhejiang University-Science C, vol 13, no.4, pp.268-280,2012 ].
TABLE 2 Chinese network encyclopedia knowledge base information
The system constructs an encyclopedia ontology framework containing over 1300 concepts and an interactive encyclopedia ontology framework containing 29263 concepts. Wherein, the top level classification in the encyclopedia classification system of hundredths includes: 13 categories of characters, science, history, sports, education, and the like; and the interactive encyclopedia comprises 13 large top-level classifications of characters, technologies, hot topics and the like. DBpedia (chinese version) is considered a semantically wikipedia knowledge base that contains 23 top-level taxonomies, for a total of over 10 million concepts, which can be derived directly from download links provided by wikipedia. The relevant information of the three Chinese network encyclopedia knowledge bases is shown in table 2.
The system adopts Precision (Precision), recall (Recall) and F-measure for mapping result identification as final evaluation criteria. Wherein:
precision (P) = correct mapping logarithm of output/total number of mapping pairs of output × 100%
Recall (R) = correct mapping logarithm of output/total number of mapping pairs in standard result × 100%
F-measure(F1)=2×P×R/(P+R)×100%
For large-scale Chinese ontology mapping tasks, top-level classification in an encyclopedia, interactive encyclopedia and Wikipedia Chinese edition (DBpedia 3.8 Chinese edition) in an encyclopedia ontology concept set is opened in three Chinese networks: the correct mapping pairs in the people, science, society, geography and art subclasses are used as reference mappings for evaluating the efficiency of the algorithm, see table 3.
TABLE 3 three Chinese encyclopedia mapping task ontology reference mapping statistics
(5) Results of the experiment
a) Experiment one: large-scale Chinese ontology compression
Based on the proposed semantic similarity and dissimilarity between the comprehensive calculation concepts, the force field data field potential function is simulated, and the large-scale body of the Chinese language is firstly compressed and reduced in mapping scale. For the three involved ontology mapping tasks, the compression effect that the ontology to be mapped can obtain in different semantic environments is shown in table 4. Wherein:
compression ratio (%) = (before compression body size-after compression body size)/before compression body size
TABLE 4 Large Scale ontology mapping Scale compression Effect
As can be seen from the data of the results in Table 4, when the original scale ratio between two ontologies to be mapped is larger, the compression ratio of the ontology of relatively smaller scale is smaller, and the ontology of larger scale is easier to obtain higher compression ratio. Namely: the larger the raw scale ratio, the higher compression rates are likely to be obtained for larger-scale ontologies to be mapped. In this case, the compression ratios obtained for the ontology to be mapped differ significantly. When the original scale between the ontologies to be mapped is smaller or close, the compression ratios obtained by the two ontologies have a tendency to converge. Therefore, a better clustering effect can be obtained based on the corrected pseudo-nuclear force field potential function, and the time and space complexity of the large-scale ontology mapping task can be effectively controlled and simplified before deterministic mapping is carried out.
b) Experiment two: large-scale Chinese ontology mapping result evaluation
The evaluation results of the three mapping tasks are shown in table 5, and the precision (P value), the recall (R value) and the F1-measure value obtained by using three different typical similarity calculation algorithms are respectively given. The first algorithm is a cross-language universal similarity calculation method [ http:// diene. Cis. String.ac. Uk/Prototype. Html ] based on an edit distance algorithm, which is abbreviated as method one in the following; the second is a typical calculation algorithm for similarity of Chinese words based on synonym forest [ Tiankule, zhaowei ] calculation method for similarity of words based on synonym forest [ J ]. Written news of university of Jilin, 2010, 28 (6): 602-608], hereinafter referred to as method two; the third method is the Chinese concept comprehensive similarity calculation method provided by the invention, which is hereinafter referred to as the system method.
The system is compared and analyzed visually with the first method and the second method. In order to ensure fairness, the system uniformly sets the similarity threshold of each algorithm judgment concept equivalence relation to be T =0.9.
TABLE 5 evaluation results of three typical similarity algorithms
As can be seen from Table 5, the system is basically equal to the edit distance similarity algorithm in the precision ratio of the Baidu-Hudong mapping task. Meanwhile, the precision ratio of the system is obviously higher than that of the method II, because the ontology mapping problem focuses more on the identification of the coreference relationship among concepts, and the method II focuses more on the semantic correlation among the words, so that a larger error is introduced when the word similarity is calculated. And when the Hudong-DBpedia mapping task is carried out, the precision ratio result is basically equal to that of the first method, and is about 9% higher than that of the second method on average.
In the aspect of recall rate, firstly, because a synonym forest is introduced as a semantic knowledge base, the aspect of recall rate is higher than that of the method one. Secondly, as can be seen from the evaluation results of the three mapping tasks, after the data field potential function is introduced as the compression factor of the ontology mapping scale, the data field potential function can also be regarded as the structure-level mapping between the concept sets. Therefore, according to the structural level characteristics of possible concept elements in some different encyclopedia sub-classifications, the system can also be provided with strong error correction capability at the same time, namely: errors that may be introduced by employing a pure element-level mapping strategy may be circumvented. Meanwhile, by introducing the combined concept similarity calculation method based on bioinformatics sequence comparison, not only can wrong mapping possibly brought by a traditional algorithm for calculating the similarity of the unknown words be avoided, but also compared with the Chinese word similarity calculation algorithm provided in the second method, the Chinese word similarity calculation algorithm does not consider the problem of the unknown words, so that the recall ratio of different sub-mapping tasks is more likely to be improved according to the characteristics of the combined concepts contained in different sub-classifications.
Finally, from an overall performance (F1 value) perspective, the system is approximately 11% and 20% higher on average in the face of the Baidu-Hudong mapping task than method one and method two. When facing the Hudong-DBpedia mapping task, the overall performance of the method is higher than that of the synonym forest similarity algorithm provided by the second method, and is basically equal to that of the first method. The overall performance of the system in the face of the Baidu-DBpedia mapping task is still about 21% and 8% higher than method two and method one, respectively.

Claims (1)

1. A large-scale ontology mapping method for Chinese language is characterized in that: the method comprises three steps of: calculating the initial association degree of a concept based on the combination of an editing distance and a synonym forest, compressing the body and mapping the certainty;
(1) Concept initial association degree calculation based on editing distance and synonym forest fusion
a) Edit distance similarity
Two entities to be mapped O source 、O target For source entity O source A certain concept of Needs to be in the target body O target Find corresponding concepts that are semantically identical or close to Two conceptsAndthe edit distance value and the similarity value of (a) are described by formula (1) and formula (2):
wherein the content of the first and second substances,for the concept to be mappedAndthe number of editing operations of (a), namely: character stringThe character string is completely changed by a minimum of operation stepsThere are three operations: adding, deleting, or modifying a character;andthe length of the character of the concept to be mapped;
wherein the content of the first and second substances,for the concept to be mappedAndthe similarity of (2);
b) Synonym forest similarity
Similarity calculation formula based on synonym forest:
for theF i Is a word elementAndthe number of layers represented by different sub-codes appears at the ith layer, | F | represents the number of elements in the set F and is always equal to 5; the weight coefficient of the conceptual similarity is alpha x (F) i /|F|);n subTree As a word elementAndat F-th occurrence of sub-coding differences i Total number of nodes contained under corresponding branches of the layer, D being a lemmaAndthe coding distance of (a); alpha is epsilon [0.4,0.5]Any random number in the random number interval can meet the requirement;
c) Multi-strategy fusion association degree algorithm
Firstly, comparing similarity results of two basic algorithms, and taking the maximum value of the results of the two algorithms; at the same time, two concepts are considered togetherAndsimilarity and dissimilarity between them, and superimposes them into each conceptThe final degree of association of (c); defining the maximum value obtained by two similarity algorithms as rho, and correspondingly, the dissimilarity index is 1-rho; it is apparent that rho e (0, 1)]Then, there is formula (4):
concept of scaleAndbetween is lambda st
Finally, the source ontology concept is obtainedAnd a target body O target Initial degree of association ofExpressed by formula (6);
concepts in the target ontology due to symmetry of relevance calculationsInitial degree of association m target The same can be obtained; in the case that the final initial relevance value of a certain concept is zero, taking m as the initial relevance factor source ,m target ∈[0.01,0.05]A certain random number therebetween satisfiesSolving; thus, the initial association degree set Map _ O of all concepts in the ontology O to be mapped is obtained source And Map _ O target (ii) a Uniformly expressing the initial association degree set in the form of key value pairs: map _ O<C,m>;
(2) Ontology compression algorithm
For source entity O source Concept set ofAnd a target body O target Concept set ofBy each conceptInitial relevance value ofTo characterize the extent to which the concept affects other concepts,has been given by equation (6); the modified field strength function is shown in equation (8):
taking δ =1, r =2; obtaining an ontology O to be mapped source Each concept inThe potential value function expression of (2) is shown as formula (9):
target ontologyConcept of (1)Potential value ofThe same can be obtained; finally obtaining potential value set PotentialMap _ O of all concepts in the ontology O to be mapped source And partertialmap _ O target (ii) a The set of potential values is collectively defined as a key-value pair: potentialMap _ O<C,
The concept set in O is divided into two parts, which are called: a candidate area and a culling area;
specifically, for the output key-value pair set Map _ O obtained after executing the multi-strategy fusion association algorithm source And Map _ O target Respectively counting Map _ O according to the relevance value of each concept element source And Map _ O target The total number of concepts with a medium relevance value greater than 0.05 is called Range _ Candidate _ O respectively source And Range _ Candidate _ O target The variable is defined as the body O to be mapped source And O target Upper bound of candidate interval;
for a set of potential values postertialmap _ O source And partertialmap _ O target The concept elements in (1) are sorted in a descending order according to key valuesVariable for its rankingIdentifying; if it isThen the conceptWill be retained as candidate concepts; accordingly, if Then the conceptWill be eliminated; for the target ontology O, from the symmetry existing between the source ontology and the target ontology target The candidate concept extraction rule is obtained by the same method;
(3) Deterministic mapping
For source ontology O to be mapped source And a target body O target Any two concepts ofAndwhen semantic similarity calculation of concepts is performed, the following three cases occur:
andare all atomic concepts, i.e.:and is provided with
Andone of them is an atomic concept and the other is a combinatorial concept, i.e.:or
Andare all combinatorial concepts, i.e.:and is
For case (1), calculating the semantic similarity of the two concepts by using formula (3); for the cases (2) and (3), first, two word string sequences to be aligned are represented in a form of a scoring matrix (scoring matrix), and the two sequences are respectively used as two dimensions of a dynamic programming matrix; for an ontology O to be mapped source And O target Concept of (1)Andword string sequence CC corresponding to ith row of scoring matrix M source Concept of atom in (1)The jth column corresponding word string sequence CC target Concept of atom in (1)Wherein i is less than or equal to m, and j is less than or equal to n; the ith row and jth column elements in the dynamic programming matrix M are called M ij
Firstly, giving a penalty factor p = -0.05 of a sequence comparison algorithm, and respectively initializing the m +1 th row and the n +1 th column of a matrix;
secondly, calculating a function SIM based on synonym forest similarity T Carrying out recursion solution on the rest m multiplied by n elements in the matrix;
first, the definition of the score function f is given, as shown in equation (11):
the recursion rule is shown in equation (12):
from M in the matrix mn Element start, trace back to M in the matrix 11 Obtaining an optimal comparison path after the elements are finished; if more than one optimal comparison path is obtained, selecting one optimal comparison path;
finally, inserting a vacancy character "-" to obtain a correct global sequence comparison result;
two concept entry sequences to be mapped and combined after inserting the null character "-", are called CC source’ And CC target’ (ii) a In this case, the total number of elements contained in the two sequences is equal, and is collectively referred to as L cc’ (ii) a According to the comparison result and based on the scoring function f, obtainingSimilarity calculation formula (13) between the combined concepts:
CN201510082840.1A 2015-02-15 2015-02-15 A kind of extensive Ontology Mapping Method towards Chinese language Expired - Fee Related CN104699767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510082840.1A CN104699767B (en) 2015-02-15 2015-02-15 A kind of extensive Ontology Mapping Method towards Chinese language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510082840.1A CN104699767B (en) 2015-02-15 2015-02-15 A kind of extensive Ontology Mapping Method towards Chinese language

Publications (2)

Publication Number Publication Date
CN104699767A CN104699767A (en) 2015-06-10
CN104699767B true CN104699767B (en) 2018-02-02

Family

ID=53346888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510082840.1A Expired - Fee Related CN104699767B (en) 2015-02-15 2015-02-15 A kind of extensive Ontology Mapping Method towards Chinese language

Country Status (1)

Country Link
CN (1) CN104699767B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107978341A (en) * 2017-12-22 2018-05-01 南京昂特医信数据技术有限公司 Isomeric data adaptation method and its system under a kind of medicine semantic frame based on linguistic context
CN109635119B (en) * 2018-10-25 2023-08-04 同济大学 Industrial big data integration system based on ontology fusion
CN109582961A (en) * 2018-11-28 2019-04-05 重庆邮电大学 A kind of efficient robot data similarity calculation algorithm
CN109783650B (en) * 2019-01-10 2020-12-11 首都经济贸易大学 Chinese network encyclopedia knowledge denoising method, system and knowledge base
CN111353523A (en) * 2019-12-24 2020-06-30 中国国家铁路集团有限公司 Method for classifying railway customers
CN114519101B (en) * 2020-11-18 2023-06-06 易保网络技术(上海)有限公司 Data clustering method and system, data storage method and system and storage medium
CN112328915B (en) * 2020-11-25 2023-02-28 山东师范大学 Multi-source interest point fusion method and system based on spatial entity matching performance evaluation
CN112541056B (en) * 2020-12-18 2024-05-31 卫宁健康科技集团股份有限公司 Medical term standardization method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7752032B2 (en) * 2005-04-26 2010-07-06 Kabushiki Kaisha Toshiba Apparatus and method for translating Japanese into Chinese using a thesaurus and similarity measurements, and computer program therefor
CN102955774A (en) * 2012-05-30 2013-03-06 华东师范大学 Control method and device for calculating Chinese word semantic similarity
CN103440314A (en) * 2013-08-27 2013-12-11 北京工业大学 Semantic retrieval method based on Ontology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7752032B2 (en) * 2005-04-26 2010-07-06 Kabushiki Kaisha Toshiba Apparatus and method for translating Japanese into Chinese using a thesaurus and similarity measurements, and computer program therefor
CN102955774A (en) * 2012-05-30 2013-03-06 华东师范大学 Control method and device for calculating Chinese word semantic similarity
CN103440314A (en) * 2013-08-27 2013-12-11 北京工业大学 Semantic retrieval method based on Ontology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于同义词词林的中文大规模本体映射方案;王汀等;《计算机科学》;20140531;第41卷(第5期);全文 *

Also Published As

Publication number Publication date
CN104699767A (en) 2015-06-10

Similar Documents

Publication Publication Date Title
CN104699767B (en) A kind of extensive Ontology Mapping Method towards Chinese language
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
RU2662688C1 (en) Extraction of information from sanitary blocks of documents using micromodels on basis of ontology
JP5904559B2 (en) Scenario generation device and computer program therefor
Zouaq et al. Evaluating the generation of domain ontologies in the knowledge puzzle project
RU2646380C1 (en) Using verified by user data for training models of confidence
CN111767325B (en) Multi-source data deep fusion method based on deep learning
WO2015093539A1 (en) Complex predicate template gathering device, and computer program therefor
CN103336852A (en) Cross-language ontology construction method and device
CN113312922B (en) Improved chapter-level triple information extraction method
CN102779119B (en) A kind of method of extracting keywords and device
CN104346382B (en) Use the text analysis system and method for language inquiry
Naser-Karajah et al. Current trends and approaches in synonyms extraction: Potential adaptation to arabic
Yang et al. Ontology generation for large email collections.
CN116244446A (en) Social media cognitive threat detection method and system
CN113239143B (en) Power transmission and transformation equipment fault processing method and system fusing power grid fault case base
Lakhanpal et al. Discover trending domains using fusion of supervised machine learning with natural language processing
Calegari et al. Object‐fuzzy concept network: An enrichment of ontologies in semantic information retrieval
RU2640718C1 (en) Verification of information object attributes
CN109783650B (en) Chinese network encyclopedia knowledge denoising method, system and knowledge base
Algosaibi et al. Using the semantics inherent in sitemaps to learn ontologies
Chen English translation template retrieval based on semantic distance ontology knowledge recognition algorithm
CN113326348A (en) Blog quality evaluation method and tool
El Idrissi et al. HCHIRSIMEX: An extended method for domain ontology learning based on conditional mutual information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180202

Termination date: 20190215