CN104699767A - Large-scale ontology mapping method for Chinese languages - Google Patents

Large-scale ontology mapping method for Chinese languages Download PDF

Info

Publication number
CN104699767A
CN104699767A CN201510082840.1A CN201510082840A CN104699767A CN 104699767 A CN104699767 A CN 104699767A CN 201510082840 A CN201510082840 A CN 201510082840A CN 104699767 A CN104699767 A CN 104699767A
Authority
CN
China
Prior art keywords
source
concept
target
similarity
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510082840.1A
Other languages
Chinese (zh)
Other versions
CN104699767B (en
Inventor
王汀
刘经纬
蔡万江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Original Assignee
CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS filed Critical CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Priority to CN201510082840.1A priority Critical patent/CN104699767B/en
Publication of CN104699767A publication Critical patent/CN104699767A/en
Application granted granted Critical
Publication of CN104699767B publication Critical patent/CN104699767B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a mapping method for large-scale Chinese ontology. The method comprises the following steps: initializing a correlation degree computing method on the basis of the concept integrating Chinese thesaurus and an edit distance similarity algorithm; compressing large-scale ontology mapping scale on the basis of a pseudo-nuclear-force field potential function integrating concept similarity and dissimilarity improved by initial correlation degree; performing similarity measurement on complex concepts in the Chinese ontology through introducing a global sequence alignment algorithm. Chinese works have the phenomena of polysemy and sensitive word order, and the computing cost of large-scale ontology mapping is high, and according to the method, firstly, the existing pseudo-nuclear-force field potential function is improved, so that the measurement of similarity among concepts and the scale compression of the ontology to be mapped are more reasonable. Secondly, a global sequence alignment technology is adopted to map complex Chinese concepts, further defects of a traditional Chinese ontology mapping system are overcome, and finally the mapping efficiency of the system is improved, and the precision ratio and the recall ratio are increased.

Description

A kind of extensive Ontology Mapping Method towards Chinese language
Technical field
The present invention relates to Chinese Ontology Mapping field.
Background technology
The vision of Semantic Web is the net ‖ (Web of Data) of foundation-data, understands semantic information on network to enable machine.Body is as the core element of Semantic Web, and being the formalization, the standardization explanation that describe specific area shared ideas, is realize network knowledge to share the basis with Semantic Interoperation.At present owing to there is isomerism between different body, result in reusing and sharing between body and become difficulty.
The task of Ontology Mapping (Ontology Alignment) is exactly the Concept Semantic association that will find between isomery body.But due to culture and Background Cause, still lack the ripe Ontology Mapping system described towards Chinese language at present.And along with the development of semantic net, the body that large-scale Chinese language describes and knowledge base are also fabricated more and more and share.Meanwhile, the structure of Chinese Ontology Mapping system is still in the starting stage.Therefore, the present invention mainly solves the Construct question of the extensive Ontology Mapping system described towards Chinese.
Domestic and international researchist has proposed multiple mapping method and canonical system.Document [Cohen W, Ravikumar P, Fienberg S.A comparison of string distance metrics for name-matching tasks [C] .Proceedings ofthe IJCAI Workshop on Information Integration on the Web (IIWeb) .Acapulco, Mexico, 2003:73-78] in list based on editing distance and several typical element level Similarity Measure algorithms based on Token, and the performance of several algorithm to be evaluated and tested.[the Melnik S such as Melnik S, Garcia-Molina H, Rahm E.Similarity flooding:A versatile graph matching algorithm and its application to schema Matching [C] .Proceedings ofthe 18th International Conference of Data Engineering (ICDE) .San Jose, California, 2002:117-128] propose a kind of structural level Ontology Mapping algorithm: Similarity flooding, this system utilizes the concept system structure similarity propagation figure of body, and the similarity between concept is propagated and revised.[the Zhong Q such as Zhong Qian, Li H, Li J, Xie G, Tang J, Zhou L, Pan Y.A gauss function based approach for unbalanced ontologymatching [C] .Proceedings of the 28th International Conference on Management ofData (SIGMOD) .Rhode Island, USA, 2009:669-680] develop RiMOM system, this system is based on instances of ontology, many policy mappings mode of the feature such as concept name and body construction, and by introducing pervasive field theory thought, it is made to be applicable to the mapping tasks of extensive body.But it lacks the optimization for Chinese language-specific feature.[the Giunchiglia F. such as Giunchiglia F, Yat skevich M..Element level semantic matching [D] .Italy:Dept.of Informationand Communication Technology University of Trento, 2004] propose based on linguistic method, and introduce shared knowledge dictionary (as: WordNet), utilize linguistic relation to carry out semantic relation discovery.Document [Isaac A, Meij L, SchlobachS, Wang S.An empirical study of instance-based ontology matching [C] .Proceedings of the 6thInternational Semantic Web Conference and the 2nd Asian Semantic WebConference (ISWC/ASWC) .Busan, Korea, 2007:253-266] a kind of Ontology Mapping algorithm of instance-level is proposed, it measures the similarity between concept according to the common example quantity of Ontological concept.
In recent years, the correlative study work of extensive Chinese ontology library and Ontology Mapping system constructing just progressively launches.Li Jia etc. propose a kind of based on knowing the method that the element layer concept similarity of net (Hownet) calculates, and achieve a Chinese Ontology Mapping system [Li Jia, Zhu Ming, Liu Chen, Deng. Chinese Ontology Mapping research and implementation [J]. Journal of Chinese Information Processing, 2007,21 (4): 27-33], this system is when in the face of extensive Ontology Mapping task, and its applicability is still to be tested.Tian Jiule etc. propose a kind of Chinese word semantic similarity computational algorithm [Tian Jiule based on Chinese thesaurus, Zhao Wei. based on the Measurement of word similarity [J] of Chinese thesaurus. Jilin University's journal, 2010,28 (6): 602-608], but its achievement do not apply under Semantic Web.Scholar [the Z.Wang such as Wang Zhi-chun, Z.Wang, J.Li et al.Knowledge extraction from chinesewiki encyclopedias [J] .Journal of Zhejiang University-Science C, vol 13, no.4, pp.268 – 280, 2012] propose to extract hierarchical relationship between concept based on the taxonomic hierarchies of Chinese encyclopaedia, obtain containing the concept attribute in the entry web page of Infobox and encyclopaedia entry example, finally set up the extensive ontology library of two large Chinese based on Baidupedia and interactive encyclopaedia, and according to simple keyword match strategy, and DBpedia sets up the co-reference between example.[the Niu X such as Niu Xing, Sun X, Wang H, et al.Zhishi.me-weaving Chinese linking open data [C] .ISWC 2011.Springer Berlin Heidelberg, 2011:205-220] Baidupedia, interactive encyclopaedia and Chinese wikipedia carry out semantic intergration by researchist, and develop the semantic web data query application system described based on Chinese.[the Chen Yidong such as Yidong Chen, Chen Liwei, Xu Kun.Learning Chinese entity attributes from onlineencyclopedia [C] .APWeb 2012:179-186] propose to utilize attribute-value in Chinese encyclopaedia Infobox to information, the training sample of the good structure of automatic extraction, and then Corpus--based Method learning model extracts the knowledge tlv triple of magnanimity from the non-structured text of encyclopaedia, finally construct a Chinese knowledge base towards open field.
The deficiency that existing system exists and main contributions of the present invention are:
1) a kind of overall framework towards the extensive Ontology Mapping model of Chinese is newly proposed.
Research at present for the Ontological concept relation of equivalence discovery between the semantic data collection in Chinese environment is also less.In semantic web environment, along with the scale of body is increasing, how to ensure that the efficiency of Ontology Mapping just becomes problem demanding prompt solution.Therefore, a kind of framework level Ontology Mapping model towards Chinese originally researched and proposed.First, the how tactful fusion method combined based on editing distance and Chinese thesaurus is adopted to calculate the initial similarity of concept between body to be mapped.Secondly, theoretical based on data fields and with the initial similarity of concept for input, the scale of body to be mapped is compressed.Finally, according to the semantic feature contained by Chinese concept and encyclopaedic knowledge storehouse, by introducing the sequence alignment thought in bioinformatics, propose a kind of Chinese Ontological concept relation of equivalence determinacy mapping policy newly.
2) a kind of new method carrying out compressing yojan to extensive Ontology Mapping scale is proposed.
Traditional Ontology Mapping system and method often only focuses on mapping result, and ignores mapping efficiency.Therefore, when in the face of extensive Ontology Mapping task, classic method seems that practicality is not strong.This research is carrying out relation of equivalence really before Qualitative Mapping to the extensive body of Chinese, in order to time complexity is controlled within the scope of acceptable, propose a kind of new data fields potential function, and based on this, first extensive body is carried out to yojan and the compression of mapping scale.Specifically, on the basis that original nucleoid field of force potential function is improved, based on " Chinese thesaurus " (extended edition), to propose between a kind of COMPREHENSIVE CALCULATING concept semantic similarity and different angle value to weigh the new method of data object gesture value, and devise the new algorithm that a kind of mapping scale for extensive body carries out yojan on this basis.
3) a kind of Concept Semantic Similarity New calculating method based on bioinformatics overall situation pairwise comparison thought is proposed.
Document [Zhong Q, Li H, Li J, Xie G, Tang J, Zhou L, Pan Y.A gauss function based approachfor unbalanced ontology matching [C] .Proceedings of the 28th International Conference onManagement of Data (SIGMOD) .Rhode Island, USA, 2009:669-680] research work be only applicable at present based on English describe body and mapping tasks, and it lacks the support to multilingual body, particularly be not optimized for the feature of Chinese body.Simultaneously, concept similarity computing method in traditional Chinese Ontology Mapping system do not consider in combined concept atomic concepts order difference and polysemy on the impact of mapping relations quality between structure two combined concepts, and ignore the key character of Chinese concept-word order responsive ‖ and-polysemy ‖, the error of mapping result certainly will be caused.In order to solve the problem, the relation of equivalence of Chinese concept is proposed to find that abstract is global sequence's comparison problem, based on the thought of dynamic programming, and the Needleman – Wunsch overall comparison algorithm introduced in field of bioinformatics carries out the Semantic Similarity Measurement between combined concept.Experiment shows, adopts the concept overall comparison similarity algorithm based on Needleman-Wunsch algorithm, effectively can evade the mistake mapping that classic method may be brought.The new method proposed, when in the face of extensive Chinese Ontology Mapping task, has more advantage and rationality than classic method.
Therefore, the extensive body of Chinese be distributed at present on web is still less, and there is larger isomerism, and existing Chinese Ontology Mapping system is when in the face of extensive Ontology Mapping task, and the efficiency that seems is lower and availability is not high.Meanwhile, still lack at present and describe for Chinese language, and adapt to the related system of extensive Ontology Mapping task in semantic web environment.Therefore the present invention is based on Chinese thesaurus and design and Implement an extensive Ontology Mapping system towards Chinese.
Summary of the invention
In Chinese Ontology Mapping system, simple lemma and unregistered word all correspond to the concept in body to be mapped.Therefore, concept corresponding to simple lemma is called atomic concepts (Atom Concept by the present invention, AC), and the concept corresponding to unregistered word is called combined concept (Component Concept, and to arrange all combined concepts be all combined by the linear array of several atomic conceptses CC).Here one group of faced the problems definition and formalized description is first provided:
Define 1 Ontology Mapping: two body O to be mapped source, O target, for source body O sourcein certain concept C source, need at target body O targetmiddle searching and its semantic identical or close corresponding concept C target, therefore define mapping function map:O source→ O target:
For if sim is (C source, C target) >T; Then there is map (C source)=C target
Wherein sim (C source, C target) be concept C to be mapped sourceand C targetsimilarity, T is threshold value, represents as concept C sourcewith concept C targetsemantic similarity when being greater than T, then by <C source, C target> is as the Conceptual Projection pair found, in native system, certain random number that threshold value can be taken as between T ∈ [0.8,0.9] all can meet the demands.Source body O sourcein the concept that contains add up to n source, target body O targetin the concept that contains add up to n target.
Definition 2: for " Chinese thesaurus " semantic knowledge-base (Semantic Knowledge Base, SKB), then S set KB tYCCLbe made up of atomic concepts, namely have SKB tYCCL={ AC 1, AC 2..., AC n, wherein certain elements A C ifor S set KB tYCCLin atomic concepts.The lemma scale of N for including in knowledge base.
Definition 3: combined concept CC can be made up of the ordered arrangement of a series of atomic concepts.That is: for then there is ordered sequence CC=[AC 1, AC 2..., AC i... ], wherein i>=1 and i is atomic concepts AC iarrangement position in ordered sequence CC.Especially, for all atomic concepts AC i, can AC be had i=[AC i].
Definition 4: for body O to be mapped sourceand O targetin concept C sourceand C target, have C source = CC source = [ AC 1 source , AC 2 source , . . . , AC i source . . . , AC m source ] , C t arg et = CC t arg et = [ AC 1 t arg et , AC 2 t arg et , . . . , AC j t arg et . . . , AC n t arg et ] . Wherein m and n is respectively combined concept C sourceand C targetcorresponding ordered sequence CC sourceand CC targetlength, then have m, n>=1.
Towards an extensive Ontology Mapping Method for Chinese language, it is characterized in that: be made up of three large steps, respectively: the concept initial association degree merged based on editing distance and synonym word forest form calculates, body compresses and determinacy maps;
(1) the concept initial association degree merged based on editing distance and synonym word forest form calculates
A) editing distance similarity
Two body O to be mapped source, O target, for source body O sourcein certain concept C source, need at target body O targetmiddle searching and its semantic identical or close corresponding concept C target, two concept C sourceand C target, the value of their editing distance and their Similarity value are portrayed by formula (1) and formula (2):
EditDis tan ce ( C source , C t arg et ) = | Do ( C source , C t arg et ) | max ( L ( C source ) , L ( C t arg et ) ) - - - ( 1 )
Wherein, | Do (C source, C target) | be concept C to be mapped sourceand C targetediting operation number of times, that is: character string C sourcehow many minimum processes walks operation becomes character string C completely target, operation here has three kinds: add, delete or an amendment character; L (C source) and L (C target) be the character length of concept to be mapped;
SIM E ( C source , C t arg et ) = 1 ( 1 + EditDis tan ce ( C source , C t arg et ) ) - - - ( 2 )
Wherein, SIM e(C source, C target) be concept C to be mapped sourceand C targetsimilarity;
B) Chinese thesaurus similarity
Calculating formula of similarity based on Chinese thesaurus:
SIM T ( C source , C t arg et ) = &alpha; &times; F i | F | &times; cos ( n subTree &times; &pi; 180 ) &times; ( n subTree - D + 1 n subTree ) - - - ( 3 )
For f ifor lemma C sourceand C targetthe hierachy number representated by son coding difference is there is at i-th layer, | F| represents the element number in set F, is constantly equal to 5 in the present system; Concept similarity weight coefficient is α × (F i/ | F|); n subTreefor lemma C sourceand C targetthere is the F that son coding is different ithe node total number comprised under layer respective branch, D is lemma C sourceand C targetcoding distance; Certain random number between α ∈ [0.4,0.5] all can meet the demands;
C) how strategy merges algorithm of correlation degree
First by comparing the similarity result of two kinds of rudimentary algorithms, the maximal value of two kinds of arithmetic result is got; Meanwhile, two concept C are considered sourceand C targetbetween similarity and distinctiveness ratio, and superposed and entered each concept C source, C targetthe final degree of association; It is ρ that the present invention defines the maximal value that two kinds of similarity algorithms obtain, and correspondingly, distinctiveness ratio index is 1-ρ; Obvious ρ ∈ (0,1], then there is formula (4):
&rho; st = max ( SIM E ( C s source , C t t arg et ) , SIM T ( C s source , C t t arg et ) ) - - - ( 4 )
Claim concept here with between semantic related coefficient be λ st,
Finally obtain source Ontological concept with target body O targetinitial association degree express with formula (6);
m s source = &lambda; s 1 + &lambda; s 2 + &lambda; s 3 + . . . + &lambda; sn t arg et = &Sigma; t = 1 n t arg et &lambda; st - - - ( 6 )
Because calculation of relationship degree has symmetry, the concept C therefore in target body targetinitial association degree m targetin like manner can obtain; Be the situation of zero in the initial association angle value that certain concept is final, the initial association degree factor is got m source, m targetcertain random number between ∈ [0.01,0.05] all meets the demands; So just obtain the initial association degree set Map_O of all financial resourcess concept in body O to be mapped sourceand Map_O target; The form of key-value pair is adopted to state by unified for the set of initial association degree: Map_O<C, m>;
(2) body compression algorithm
When in the face of large-scale Ontology Mapping task, traditional algorithm is all difficult to adapt in time or space complexity, therefore needs corresponding strategy to compress body to be mapped originally;
For source body O sourceconcept set with target body O targetconcept set use each concept initial association angle value portray the influence degree of this concept for other concepts, provided by formula (6); Pass through the field intensity function of correction as shown in formula (8):
Get δ=1, R=2; Obtain body O to be mapped sourcein each concept gesture value function expression formula, as shown in formula (9):
Concept in target body gesture value in like manner can obtain; Finally obtain the gesture value set potentialMap_O of all financial resourcess concept in body O to be mapped sourceand potentialMap_O target; Gesture value set unified definition is key-value pair:
Concept set in O is divided into two parts, is called: candidate regions and superseded district;
Particularly, for performing the output key-value pair set Map_O obtained after many strategies merge algorithm of correlation degree sourceand Map_O target, the association angle value according to each concept element counts Map_O respectively sourceand Map_O targetthe concept sum that middle association angle value is greater than 0.05 is called Range_Candidate_O sourceand Range_Candidate_O target, this variable-definition is body O to be mapped sourceand O targetthe interval upper bound of candidate regions;
For gesture value set potentialMap_O sourceand potentialMap_O targetin concept element, carry out descending sort according to key assignments, for &ForAll; C s source &Element; potentialMap _ O source , Its rank variable mark; If then concept to be retained by alternatively concept; Correspondingly, if Rank s source &Element; [ Range _ Candidate _ O source + 1 , n source ] , Then concept to be eliminated; By the symmetry existed between source body and target body, for target body O targetcandidate concepts decimation rule in like manner can obtain;
(3) determinacy maps
For source body O to be mapped sourcewith target body O targetin any two concept C sourceand C target, when carrying out the Semantic Similarity Measurement of concept, there will be following three kinds of situations:
1. C sourceand C targetbe atomic concepts, that is: C source∈ SKB tYCCLand C target∈ SKB tYCCL
2. C sourceand C targetone of them be atomic concepts, and another is combined concept, that is: or C t arg et &NotElement; SKB TYCCL
3. C sourceand C targetbe combined concept, that is: and
For situation 1., formula (3) is adopted to calculate the semantic similarity of two concepts; For situation 2. with situation 3., in the present system, first represented with the form of scoring matrix (scoring matrix) by be compared two word string sequences, two sequences is respectively as the bidimensional of dynamic programming matrix; For body O to be mapped sourceand O targetin concept C sourceand C target, the i-th row equivalent string sequence CC of scoring matrix M sourcein atomic concepts jth row equivalent string sequence CC targetin atomic concepts wherein i≤m, j≤n; In dynamic programming matrix M, the i-th row jth column element is called M ij;
First, provide the penalty factor p=-0.05 of sequence alignment algorithms, and capable to the m+1 of matrix and (n+1)th arrange and carry out initialization respectively;
Secondly, based on Chinese thesaurus Similarity Measure function SIM t, recursive resolve is carried out to all the other m × n element in matrix;
First provide the definition of scoring function f, as shown in formula (11):
Recursive rule is as shown in formula (12):
M ij = max M ( i + 1 ) ( j + 1 ) + f ( AC i source , AC j t arg et ) M ( i ) ( j + 1 ) + p M ( i + 1 ) ( j ) + p - - - ( 12 )
From the M matrix mnelement starts, and dates back the M in matrix 11element terminates, and obtains optimum comparison path; If more than one of the optimum comparison path obtained, then optional one;
Finally insert room symbol "-", obtain correct global sequence's comparison result;
Two the combined concept entry sequences to be mapped will inserted behind room symbol "-" are called CC source 'and CC target '; At this moment the element sum comprised in two sequences is equal, is referred to as L cc '; According to comparison result with based on scoring function f, obtain the calculating formula of similarity (13) between combined concept:
SIM NW ( CC source &prime; , CC t arg et &prime; ) = &Sigma; i = 1 L cc &prime; f ( AC i source &prime; , AC i t arg et &prime; ) L cc &prime; - - - ( 13 ) .
Accompanying drawing explanation
The extensive Ontology Mapping Method process flow diagram of Fig. 1 Chinese
The matching result of Fig. 2 (a) mistake
The comparison result that Fig. 2 (b) is correct
The scoring matrix of Fig. 3 example one
The scoring matrix of Fig. 4 example two
The sequences match result of Fig. 5 example two
Embodiment
(1) the concept initial association degree merged based on editing distance and synonym word forest form calculates
A) editing distance similarity
When the mapping tasks of extensive body, the present invention proposes first to compress body to be mapped.Specifically, first employing editing distance algorithm carries out the initial Similarity Measure between concept set.This is because when carrying out initial association degree and calculating, often consider the high efficiency of algorithm, its accuracy is then counted as secondary cause.
That is, when the initial association obtaining body to be mapped is spent, native system can obtain the literal similarity between concept by editing distance algorithm, and ignores its semantic dependency.Particularly for two concept C sourceand C target, the value of their editing distance and their Similarity value can be portrayed by formula (1) and formula (2):
EditDis tan ce ( C source , C t arg et ) = | Do ( C source , C t arg et ) | max ( L ( C source ) , L ( C t arg et ) ) - - - ( 1 )
Wherein, | Do (C source, C target) | be concept C to be mapped sourceand C targetediting operation number of times, that is: character string C sourcehow many minimum processes walks operation becomes character string C completely target, operation here has three kinds: add, delete or an amendment character.L (C source) and L (C target) be the character length of concept to be mapped.
SIM E ( C source , C t arg et ) = 1 ( 1 + EditDis tan ce ( C source , C t arg et ) ) - - - ( 2 )
Wherein, SIM e(C source, C target) be concept C to be mapped sourceand C targetsimilarity.
B) Chinese thesaurus similarity
Chinese thesaurus (TongyiciCilin, TYCCL) be a Chinese synonym allusion quotation, each vocabulary carries out encoding and is organized in a tree structure with hierarchical relationship by it, each node on behalf in tree concept, and the concept co-reference identification of Chinese, in fact can abstractly be the identification Similarity Measure problem of Chinese synonym, therefore Chinese thesaurus be best selection.Native system adopts Harbin Institute of Technology's Chinese thesaurus extended edition as the commonsense knowledge base of Chinese Ontology Mapping Relation extraction.
Lemma is organized as hierarchy by Chinese thesaurus, is top-downly of five storeys altogether.Each level has corresponding code identification, and the coding of 5 layers arrays from left to right, and forms the word woods coding of lemma.Semantic relevancy implicit between word and word also improves along with the increase of level.
Below for lemma-material ‖ (woods is encoded to word: Ba01A02=), word woods coded format is made an explanation, as shown in table 1:
Table 1 word woods encoding examples
According to the design feature of Chinese thesaurus, the word woods coding first treating concept of mapping is resolved, and extracts the 1st to the 5th straton coding, then compares from the 1st straton coding.If occur, son coding is different, then give this mapping pair corresponding similarity weight according to the level occurred.Son coding is different appears at darker level, then similarity weight is higher, otherwise then lower.Meanwhile, the number of branch node number of every layer also has impact to similarity.
Provide the calculating formula of similarity based on Chinese thesaurus below:
SIM T ( C source , C t arg et ) = &alpha; &times; F i | F | &times; cos ( n subTree &times; &pi; 180 ) &times; ( n subTree - D + 1 n subTree ) - - - ( 3 )
Because Ontology Mapping task more pays close attention to the Semantic Similarity between concept, therefore the present invention introduces regulating parameter: semantic relevancy factor-alpha, the relation of semantic dependency and Semantic Similarity between different level concept and control is regulated to be in degree possible similar between the lemma of different levels branch by α, obvious α ∈ (0,1).The value of α is larger, and the possibility that the lemma between expression different levels is similar or of equal value is larger, and the semantic dependency of different levels is larger for the impact of final concept similarity, otherwise then less.
Wherein, F={1,2,3,4,5}, for f ifor lemma C sourceand C targetthe hierachy number representated by son coding difference is there is at i-th layer, | F| represents the element number in set F, is constantly equal to 5 in the present system.Concept similarity weight coefficient is α × (F i/ | F|).N subTreefor lemma C sourceand C targetthere is the F that son coding is different ithe node total number comprised under layer respective branch, D is lemma C sourceand C targetcoding distance.
Particularly when five layers of coding that concept to be mapped is right are all equal, and word woods is encoded, last position is-=No. ‖, then similarity function SIM trreturn value be 1.0.Obviously, function SIM tcodomain be (0,1].Native system lays particular emphasis on the relation of equivalence obtained between lemma concept, particularly when in the face of Chinese Ontology Mapping task, due to the semantic similarity between more outstanding concept, therefore the value of α is unsuitable too high, certain random number that the semantic relevancy factor can be taken as between α ∈ [0.4,0.5] all can meet the demands.
C) how strategy merges algorithm of correlation degree
Because editing distance similarity algorithm and Chinese thesaurus similarity algorithm have certain complementarity, therefore the similarity result value of these two kinds of algorithms merges by the present invention.
When application data field theory, each concept is regarded as the particle in field, and the gravitation phenomenon that previous work often exists between particle in a consideration physical field, and exactly ignore the objective fact of going back ubiquity repulsion between particle, therefore introduce repulsion to the factor of influence of the degree of association.In order to improve this defect, the present invention considers gravitation between concept to be mapped and repulsion.That is, the similarity between concept is considered as the gravitation existed between particle in field, and distinctiveness ratio is considered as repulsion.Therefore, in the algorithm one of the present invention's proposition, first by comparing the similarity result of two kinds of rudimentary algorithms, the maximal value of two kinds of arithmetic result is got; Meanwhile, two concept C are considered sourceand C targetbetween similarity and distinctiveness ratio, and superposed and entered each concept C source, C targetthe final degree of association.It is ρ that the present invention defines the maximal value that two kinds of similarity algorithms obtain, obvious ρ ∈ (0,1], then there is formula (4):
&rho; st = max ( SIM E ( C s source , C t t arg et ) , SIM T ( C s source , C t t arg et ) ) - - - ( 4 )
Here, concept is claimed with between semantic related coefficient be λ st, the 1-ρ in formula (5) stthe distinctiveness ratio be used between tolerance two concepts.-log (1-ρ st) be denary logarithm function, the present invention is defined as strictly monotone increasing function, and distinctiveness ratio can be enable like this to reflect cause-effect relationship therebetween reposefully to the variation tendency of similarity.Can see, similarity ρ stvalue larger, distinctiveness ratio is less, then adjustment function-ρ st× log (1-ρ st) value larger.The concept obtained like this initial association degree consider the similarity between source body and target Ontological concept and distinctiveness ratio, thus made result more reasonable.For making formula (5) restrain, the value of regulation ρ belong to interval (0.9,1] time, concept with between semantic correlation factor λ stbe 1.
Finally obtain concept in the body of source with target body O targetinitial association degree express with formula (6).
m s source = &lambda; s 1 + &lambda; s 2 + &lambda; s 3 + . . . + &lambda; sn t arg et = &Sigma; t = 1 n t arg et &lambda; st - - - ( 6 )
Because calculation of relationship degree has symmetry, the concept C therefore in target body targetinitial association degree m targetin like manner can obtain, in other words, two bodies to be mapped are complete equal rights, and therefore any one in the two can be considered as source body, another is referred to as target body naturally.Therefore, in two bodies, the initial association angle value of concept only depends on that the interaction between body is now uniquely determined, and with system give two bodies declarative appellation have nothing to do.The initial association degree arthmetic statement that concrete many strategies merge, is shown in algorithm one.Being the situation of zero for the initial association angle value that certain concept is final, in order to make the transfer of the degree of association have continuity, the initial association degree factor can being got m in the present system source, m targetcertain random number between ∈ [0.01,0.05] all can meet the demands.So just obtain the initial association degree set Map_O of all financial resourcess concept in body O to be mapped sourceand Map_O target.Be key-value pair by initial association degree set unified definition: Map_O<C, m>
(2) extensive body compression algorithm
When in the face of large-scale Ontology Mapping task, traditional algorithm is all difficult to adapt in time or space complexity, therefore needs corresponding strategy to compress body to be mapped originally.
The proposition of data fields theory is the field theory thought in physically based deformation, by abstract for the mutual relationship in number field space between data be INTERACTION PROBLEMS between material particle, final form turns to the describing method of field theory.This theory expresses the interaction relationship between different pieces of information by potential function, thus embodies the distribution characteristics of data, and carries out clustering according to the equipotential line structure in data fields to data set.But, the short-range field potential function that classical data place adopts often only considered path distance between data object to the impact of final gesture value, when in the face of ontology mapping problem, be just presented as the ubiquitous semantic association factor between data object that ignores, such as: nucleoid field of force potential function.
When compressing extensive body, using the foundation of the semantic association in body between concept as compression.The present invention is by source body O sourcewith target body O targetin concept between ubiquitous semantic association be considered as body compression basis and prerequisite, Ontological concept is considered as the data object in data fields, initial association degree between concept is regarded as the quality of each object in data fields, propose a kind of new method being weighed data object gesture value by the semantic similarity between COMPREHENSIVE CALCULATING concept and distinctiveness ratio.By introducing in body the ubiquitous semantic association degree factor between concept, have modified the deficiency of nucleoid field of force potential function when in the face of ontology mapping problem, making it in the feature macroscopically meeting Ontology Mapping.
A) definition of potential function
Because short-range field can better reflect the interaction situation between data, therefore adopt nucleoid field of force potential function.Its being defined as follows in ontology mapping problem:
In body O to be mapped, the shortest path length between concept is: || C s-C l||, due to the characteristic of short-range field, the path R therefore between defined notion is not more than 2.Then theoretical by data fields, obtain concept C swith C lbetween interactional field intensity function expression, as shown in formula (7):
Wherein, m srepresent the quality of each data point, generally make m s=1, but this way can only reflect that path distance between concept is on the impact of final gesture value, but makes the semantic association degree between concept lack completely, and therefore the present invention proposes that the Semantic Similarity between concept and diversity are introduced gesture value and calculates, by m svalue be defined as initial association degree between concept, provided by formula (6).By the similarity between concept and distinctiveness ratio are considered, carry out revising with perfect to the field intensity function in formula (7).Formula (6) illustrates, in body to be mapped, the initial association degree of concept is larger, then its quality in data fields is larger.
That is, for source body O sourceconcept set with target body O targetconcept set use each concept initial association angle value portray the influence degree of this concept for other concepts.Pass through the field intensity function of correction as shown in formula (8):
δ ∈ (0 ,+∞) reflects the granularity affected between concept, also referred to as zoom factor, might as well get δ=1, R=2.So just obtain body O to be mapped sourcein each concept gesture value function expression formula, as shown in formula (9):
There is symmetry, the concept therefore in target body because the data object gesture value existed between source body and target body calculates gesture value in like manner can obtain.Finally obtain the gesture value set potentialMap_O of all financial resourcess concept in body O to be mapped sourceand potentialMap_O target.The unified definition of gesture value set is key-value pair:
B) extraction of candidate concepts
In order to compress body O to be mapped, the concept set in O is divided into two parts by native system, is called: candidate regions and superseded district.
Particularly, for the output key-value pair set Map_O obtained after execution algorithm one sourceand Map_O target, the association angle value according to each concept element counts Map_O respectively sourceand Map_O targetthe concept sum that middle association angle value is greater than 0.05 is called Range_Candidate_O sourceand Range_Candidate_O target, this variable-definition is body O to be mapped sourceand O targetthe interval upper bound of candidate regions.
For gesture value set potentialMap_O sourceand potentialMap_O targetin concept element, carry out descending sort according to key assignments, for &ForAll; C s source &Element; potentialMap _ O source , Its rank variable identify.If then concept to be retained by alternatively concept.Correspondingly, if then concept to be eliminated.By the symmetry existed between source body and target body, for target body O targetcandidate concepts decimation rule in like manner can obtain.
(3) the concept determinacy based on Needleman-Wunsch algorithm maps
Comparatively common semantic knowledge-base comprises at present: " Chinese thesaurus ", Hownet and WordNet.Especially, for the word of having included in " Chinese thesaurus " (extended edition) in the present invention, simple lemma is called; And for the word of not yet including in word woods, be referred to as unregistered word (Out of Vocabulary, OOV).
By the related definition provided, Taxonomic discussion is carried out to problem.For source body O to be mapped sourcewith target body O targetin any two concept C sourceand C target, when carrying out the Semantic Similarity Measurement of concept, there will be following three kinds of situations:
4. C sourceand C targetbe atomic concepts, that is: C source∈ SKB tYCCLand C target∈ SKB tYCCL
5. C sourceand C targetone of them be atomic concepts, and another is combined concept, that is: or C t arg et &NotElement; SKB TYCCL
6. C sourceand C targetbe combined concept, that is: and
For situation 1., the present invention will directly adopt formula (3) to calculate the semantic similarity of two concepts.Discuss below for situation 2. with situation similarity calculating method 3..
For the Similarity Measure of Chinese combined concept, traditional Chinese Ontology Mapping system gives a kind of processing scheme.Such as: Li Jia etc. have designed and Implemented one based on the element layer concept similarity computing method knowing net (Hownet), and achieve a Chinese Ontology Mapping system [Li Jia, Zhu Ming, Liu Chen, Deng. Chinese Ontology Mapping research and implementation [J]. Journal of Chinese Information Processing, 2007,21 (4): 27-33].The method is when processing the Similarity Measure problem of unregistered word, atomic concepts sequence corresponding to two combined concepts is traveled through, find out the atomic concepts mapping pair that wherein similarity is maximum, by the mapping pair of relative maximum obtained, finally obtain the Similarity value of two combined concepts.Its computing formula is as follows:
Sim ( A , B ) = &Sigma; i = 1 max ( m , n ) max i ( B xy ) max ( m , n ) - - - ( 10 )
Wherein, B xyrepresent the element in the similarity matrix that the known words that obtains after splitting with two vocabulary respectively forms for ranks, max i(B xy) value arrangements is the similarity of i-th in representing matrix.
But due to ubiquitous-front light rear heavy ‖ of Chinese concept, therefore the processing mode of above-mentioned forefathers brings the error of Semantic Similarity Measurement unavoidably.Such as, two combined concepts to be mapped occurred in different body :-Historical Theory ‖ and-intellectual history ‖, after word segmentation processing, obtains two ordered arrangements be made up of atomic concepts: [history, theoretical] and [thought, history].If adopt the process unregistered word method that forefathers are general, the atomic concepts that then can obtain as shown in Fig. 2 (a) maps effect, based on " Chinese thesaurus " (extended edition), and calculate semantic similarity when often pair of atomic concepts maps according to the formula provided (3), the formula (10) finally adopting forefathers to propose carries out COMPREHENSIVE CALCULATING, the value of the concept element level similarity obtained is 1.0, and what obviously obtain is complete irrational combined concept mapping pair and similarity result.Reason is that this method ignores in Chinese natural language, ubiquitous word order sensitivity phenomenon, also have ignored it to exist-front light rear the feature weighing ‖.
Therefore, native system adopts a kind of Concept Semantic Similarity computing method based on global sequence's comparison.
A) sequence alignment (alignment) algorithm general introduction
In bioinformatics, pairwise comparison refers to by two DNA, RNA or protein alignment together, indicates its resemblance, and can insert room symbol in sequence, corresponding same or analogous symbol comes on same row.By comparing similar segment between two sequences and conserved sites, find the molecular evolution relation that it may exist.
On the whole, comparison model can be divided into 2 classes: a class is overall comparison (global alignment), the global similarity between paper examines 2 sequences, carries out whole process scanning and compare sequence.Another kind of is Local Alignment (local alignment) method, pays close attention to some the special segment in sequence, the similarity in comparative sequences between segment.The two all solves by dynamic programming (dynamic programming, DP) thought.
Needleman-Wunsch algorithm is typical overall comparison algorithm, and this algorithm is applicable to 2 higher sequences of more overall macroscopically similarity degree.This algorithm was proposed in 1970 by Needleman and Wunsch, and it is the dynamic programming algorithm (dynamic programming, DP) of similarity between a kind of comparison two sequences.This algorithm is one of rudimentary algorithm of bioinformatics.Native system mainly considers overall pairwise comparison algorithm.
B) dynamic programming scoring matrix is constructed
So-called sequence refers to by a series of letter mark, according to the character string that certain queueing discipline forms.Particularly, when the Similarity Measure problem of Ontological concept, combined concept is considered as word string sequence by the present invention, and each element in sequence is atomic concepts.Wherein, first combined concept is carried out word segmentation processing, obtain the word string sequence of its correspondence; In the extensive Ontology Mapping system of Chinese, adopt the ICTCLAS50 of Computer Department of the Chinese Academy of Science's research and development as word segmentation processing instrument.Alphabet (alphabet) is defined as " Chinese thesaurus " semantic knowledge-base: SKB tYCCL, add room symbol: gap (-) simultaneously.
It is the comparison process of two word string sequences that the concept similarity of Ontology Mapping calculates abstract by native system: by gap penalty function, the relevant position insertion room symbol of decision-making in word string sequence, make two sequence lengths identical, and then the corresponding relation that between the atomic concepts constructing sequence to be compared or atomic concepts and room accords with.The essence of sequence alignment algorithms is exactly by scoring tactics, finds out the best overall situation pairing of two combined concept sequences.
In the present system, first represented with the form of scoring matrix (scoring matrix) by be compared two word string sequences, two sequences is respectively as the bidimensional of dynamic programming matrix.For body O to be mapped sourceand O targetin concept C sourceand C target, the i-th row equivalent string sequence CC of scoring matrix M sourcein atomic concepts jth row equivalent string sequence CC targetin atomic concepts wherein i≤m, j≤n.In dynamic programming matrix M, the i-th row jth column element is called M ij.
Such as: combined concept-second industrial revolution ‖ and-World War II war criminal ‖, after word segmentation processing, can obtain two word string sequences to be compared: [the second, secondary, the industrial revolution] and [the second, secondary, world war, war criminal].According to Dynamic Programming Idea, two word string sequences are represented with row and column.Hypothetical sequence CC sourcelength be m, sequence C C targetlength be n, then can form one with sequence C C sourcefor row, sequence C C targetfor the two-dimensional matrix of (m+1) × (n+1) of row, as shown in Figure 4.
C) optimized recursive resolve algorithm
Based on the thought of dynamic programming, recursive resolve is carried out to the optimum comparison path in matrix M.
First, provide the penalty factor p=-0.05 of sequence alignment algorithms, and capable to the m+1 of matrix and (n+1)th arrange and carry out initialization respectively.
Secondly, based on Chinese thesaurus Similarity Measure function SIM t, recursive resolve is carried out to all the other m × n element in matrix.First provide the definition of scoring function f, as shown in formula (11):
Consider the feature of Chinese concept ubiquity-front light rear heavy ‖, therefore the starting point of recurrence is chosen to be ending place of two combined concepts, that is: the M in matrix mnelement.To SIM tdescription ask for an interview formula (3).Recursive rule is as shown in formula (12) particularly:
M ij = max M ( i + 1 ) ( j + 1 ) + f ( AC i source , AC j t arg et ) M ( i ) ( j + 1 ) + p M ( i + 1 ) ( j ) + p - - - ( 12 )
Finally, from the M matrix mnelement starts, and dates back the M in matrix 11element terminates, and can obtain optimum comparison path.Here it should be noted that, if more than one of the optimum comparison path obtained, then optional one.
The concrete concept element level Similarity Measure algorithm based on global sequence's comparison thought, is shown in algorithm two.
Still for combined concept :-second industrial revolution ‖ and-World War II war criminal ‖, the matrix M comprising Optimum Matching path obtained by algorithm two ' (i) (j)as shown in Figure 4.Wherein ,-arrow ‖ is obtained by formula (12), selectable working direction during its expression backtracking; And-Bold arrows ‖ represents the optimal path obtained.Especially, the oblique arrow ‖ of-overstriking represents and is matched by the atomic concepts of 2 corresponding to its afterbody; The horizontal arrow ‖ of-overstriking represents word string sequence CC sourcein, at it insert 1 room symbol--‖ before corresponding atomic concepts position of being expert at;-overstriking vertical arrows ‖ represents word string sequence CC targetin, before the atomic concepts relevant position that its column is corresponding, insert 1 room symbol--‖.The scoring matrix then provided by Fig. 4, word string sequence CC sourceand CC targetoptimum comparison result as shown in Figure 5:
By algorithm two, provide the scoring matrix M ' shown in Fig. 3 below (i) (j)in the calculation process of each element, the scoring matrix shown in Fig. 4 in like manner can obtain.
Step is 1.: matrix initialisation.Make penalty factor p=-0.05, combined concept CC sourcesequence length be m, CC targetsequence length be n.
Step is 2.: the cost value of each element in recursive calculation scoring matrix.
From the M of the last cell of matrix 22element starts to calculate, now rower i=2, and row mark j=2, then have:
In like manner can obtain:
Step is 3.: backtracking obtains by M 33to M 11optimal path: M 33→ M 32→ M 21→ M 11
Step is 4.: finally insert room symbol--‖, obtain correct global sequence's comparison result as shown in Fig. 2 (b).
Two combined concept entry sequences to be mapped after native system will insert room symbol--‖ are called CC source 'and CC target '; At this moment the element sum comprised in two sequences is equal, is referred to as L cc '.According to comparison result with based on scoring function f, obtain the calculating formula of similarity (13) between combined concept:
SIM NW ( CC source &prime; , CC t arg et &prime; ) = &Sigma; i = 1 L cc &prime; f ( AC i source &prime; , AC i t arg et &prime; ) L cc &prime; - - - ( 13 )
After the combined concept Element-Level similarity calculating method based on sequence alignment is set forth, then re-examine is carried out to the two groups of Similarity Measure examples mentioned before.
Example one: CC source=[thought, history], CC target=[history, theoretical].The combined concept Similarity value obtained by the formula (10) of 4.3 joints is Sim (CC source, CC target)=(1.0+1.0)/2=1.0, and adopt its effect of combined concept sequence pair obtained based on sequence alignment algorithms as shown in Fig. 2 (b), the scoring matrix of its correspondence is as shown in Figure 3.And should be SIM based on the combined concept Similarity value that algorithm two and formula (3), formula (11), formula (13) COMPREHENSIVE CALCULATING obtain nW(CC source ', CC target ')=(f (thought,-)+f (history, history)+f (-, theoretical))/3=(p+SIM t(history, history)+p)/3=(-0.05+1.0-0.05)/3=0.3.
Example two: CC source=[the second, secondary, the industrial revolution], CC target=[the second, secondary, world war, war criminal].The combined concept Similarity value calculated according to formula (10) is Sim (CC source, CC target)=1.0, this is because atomic concepts-secondary ‖ existence-polysemy ‖ phenomenon.Particularly, lemma-secondary ‖ has multiple coding item in " Chinese thesaurus " (extended edition), and wherein-Dn04B03=‖ coding item gives the judgement that two atom lemmas-the two ‖ and-secondary ‖ is lemma of equal value.Therefore, can obtain four groups of atomic concepts mapping result according to the formula (10) of classic method proposition is the situation of 1.0, respectively: < second, secondary >=1.0, < the second, two >=1.0, < time, 2nd >=1.0, and < time, secondary >=1.0.Substitute into formula (10) to have: Sim (CC source, CC target)=(1.0+1.0+1.0+1.0)/4=1.0.And should be SIM based on the combined concept Similarity value that algorithm two and formula (3), formula (11), formula (13) COMPREHENSIVE CALCULATING obtain nW(CC source ', CC target ')=(f (second, second)+f (secondary, secondary)+f (industrial revolution, world war)+f (-, war criminal))/4=(SIM t(second, second)+SIM t(secondary, secondary)+SIM t(industrial revolution, world war)+p)/4=(1.0+1.0+0.18-0.05)/4=0.5325, the sequences match result corresponding to it is as shown in Figure 5.
Can see, between two combined concepts in example one and example two, there is no relation of equivalence.And classic method sets forth the wrong conclusion that similarity is the high similarity of 1.0.On the contrary, the Similarity value obtained by algorithm two is then more reasonable.As can be seen here, when consider Chinese concept ubiquitous-center of gravity after move ‖ and-polysemy ‖ phenomenon time, by adopting the overall comparison algorithm based on Needleman-Wunsch algorithm, the mistake mapping that the classic method representated by formula (10) may be brought effectively can be evaded.Meanwhile, when mapping in the face of combined concept, if the semantic sequence of the atomic concepts in the word string sequence of its correspondence is substantially identical, then the effect of algorithm two should be basically identical with classic method.In sum, based on global sequence's comparison concept element level similarity algorithm in the face of extensive Chinese Ontology Mapping task time, have more advantage and rationality than classic method.
(4) experimental data prepares
Compared to international body and mapping tasks thereof, such as: the benchmark evaluation metrics of the multi-field standard body that the international organizations such as OAEI (Ontology Alignment EvaluationInitiative) issue and mapping thereof, the extensive body of existing Chinese of increasing income is still comparatively deficient.Therefore, the present invention adopts Chinese network opening encyclopaedic knowledge storehouse as experimental data source.Except DBpedia (Chinese edition) knowledge base, reptile kit HTMLParser is used to crawl the open classification page of Baidupedia and interactive encyclopaedia and resolve respectively.Taxonomic hierarchies in Chinese network opening encyclopaedia is not only resolved by the present invention, simultaneously, the Infobox structured message that whole entry page comprises also is extracted and resolves, and it is organized with the form of Chinese character tlv triple, the extensive Chinese ontology library that final formation three is to be mapped.
Wherein, the concept system of encyclopaedia open taxonomic hierarchies main composition body, the knowledge in Infobox message box is then referred to as subject with the name of the wiki page, and Property Name (Property) is as predicate, and property value is as object.Such as, if fact statement: the nationality of-Qian Xuesen is that People's Republic of China (PRC) ‖ appears in the Infobox message box of this entry page, then should be expressed as :-Qian Xuesen ‖ ,-nationality: People's Republic of China (PRC) ‖.The knowledge that therefore can will obtain from message box, expresses with the form of tlv triple, i.e. < Qian Xuesen, nationality, People's Republic of China (PRC) >.The construction strategy of concrete Baidupedia and interactive encyclopaedia ontology library, native system is based on document [Z.Wang, Z.Wang, J.Li et al.Knowledge extractionfrom chinese wiki encyclopedias [J] .Journal of Zhejiang University-Science C, vol 13, no.4, pp.268 – 280,2012] method that proposes.
Table 2 Chinese network encyclopaedic knowledge library information
Native system constructs and includes the Baidupedia body frame more than 1300 concepts and the interactive encyclopaedia body frame containing 29263 concepts.Wherein, the top-level categories in Baidupedia taxonomic hierarchies comprises: the 13 large classes such as personage, science, history, physical culture and education; And interactive encyclopaedia comprises the 13 large top classification such as personage, technology, much-talked-about topic.DBpedia (Chinese edition) is regarded as the wikipedia knowledge base of semantization, and it comprises 23 top classification, and altogether containing more than 100,000 concepts, and its download link that can directly provide from wikipedia obtains.Three large Chinese network encyclopaedic knowledge storehouse relevant informations are as shown in table 2.
Native system adopts for the accuracy rate (Precision) of mapping result identification, recall rate (Recall) and F-measure as final evaluation criterion.Wherein:
Mapping pair sum × 100% of the correct mapping logarithm/output of Precision (P)=output
Mapping pair sum × 100% in the correct mapping logarithm/standard results of Recall (R)=output
F-measure(F1)=2×P×R/(P+R)×100%
For extensive Chinese Ontology Mapping task, choose the top-level categories that the large Chinese network opening encyclopaedia Ontological concept of Baidupedia, interactive encyclopaedia and wikipedia Chinese edition (DBpedia3.8 Chinese edition) three is concentrated: the correct mapping pair in personage, science, society, geography and artistic subclass, reference in this, as evaluation algorithms efficiency maps, in table 3.
The large Chinese encyclopaedia mapping tasks body of table 3 three is with reference to mapping statistics
(5) experimental result
A) one is tested: extensive Chinese body compression
Based on the semantic similarity between the COMPREHENSIVE CALCULATING concept proposed and distinctiveness ratio nucleoid field of force data fields potential function, first the compression and the yojan that map scale are carried out to the extensive body of Chinese.For three the Ontology Mapping tasks related to, the compression effectiveness that body to be mapped can obtain under different semantic environment is as shown in table 4.Wherein:
Body scale before compressibility (%)=(before compression, body Gui Mo – compresses rear body scale)/compression
The extensive Ontology Mapping scale compression effect of table 4
As can be seen from table 4 result data, when the original scale between two bodies to be mapped is compared larger, a relatively small-scale bulk compressibility is also less, and fairly large body to be mapped then more easily obtains higher compressibility.That is: original scale is than larger, then fairly large body to be mapped can obtainable compressibility higher.In this case, the compressibility that body to be mapped obtains differs larger.And original scale between body to be mapped is smaller or convergence time, also there is convergent trend in the compressibility obtained both them.As can be seen here, good Clustering Effect can be obtained based on the nucleoid field of force potential function revised, before carrying out determinacy mapping, effectively can control the Time & Space Complexity of the extensive Ontology Mapping task with yojan.
B) two are tested: extensive Chinese Ontology Mapping result evaluation and test
The evaluation result of three large mapping tasks is as shown in table 5, sets forth precision ratio (P value), recall ratio (R value) and F1-measure value that the different typical Similarity Measure algorithm of employing three kinds obtains.The first algorithm is [the Diogene Ontology Mapping Prototype of the similarity algorithm based on editing distance algorithm that can be general across language, http://diogene.cis.strath.ac.uk/prototype.html], below referred to as method one; The second is typically based on the Chinese Word similarity algorithm [Tian Jiule of Chinese thesaurus, Zhao Wei. based on the Measurement of word similarity [J] of Chinese thesaurus. Jilin University's journal, 2010,28 (6): 602-608], below referred to as method two; The third method is the Chinese generalization by the representation of groups similarity algorithm that the present invention proposes, below referred to as native system method.
Native system and method one and method two are carried out comparative analysis intuitively below.In order to guarantee fairness, often kind of algorithm is judged that the similarity threshold of concept relation of equivalence is unified and is set as T=0.9 by native system.
The evaluation result of table 5 three kinds of typical similarity algorithms
As shown in Table 5, native system remains basically stable with editing distance similarity algorithm in the precision ratio of Baidu-Hudong mapping tasks.The precision ratio of native system is also apparently higher than method two simultaneously, this is because the co-reference identification of ontology mapping problem more between focus on concepts, and method two too pays close attention to the semantic relevancy between word, and which results in when carrying out Word similarity, introducing larger error.And when carrying out Hudong-DBpedia mapping tasks, the precision ratio result obtained then remains basically stable with method one, simultaneously higher than method two on average about 9%.
In recall ratio, first, owing to introducing Chinese thesaurus as semantic knowledge-base, therefore recall ratio aspect also can higher than method one.Secondly, as can be seen from the evaluation result of three mapping tasks also, after introducing data field potential function is as the Ontology Mapping scale compression factor, the structural level that also can be regarded as between concept set maps.Therefore, according in the encyclopaedia subclassification that some is different, the structural level feature that concept element may exist, it also can bring stronger error correcting capability for native system simultaneously, that is: may evade the error owing to adopting pure elemental level mapping policy to bring.Simultaneously, by introducing the combined concept similarity calculating method based on bioinformatics sequence alignment, the mistake that the traditional algorithm towards unregistered word Similarity Measure not only can be avoided to bring maps, compared to the Chinese Word similarity algorithm proposed in method two, because it does not consider unregistered word problem, the feature of the combined concept therefore contained according to different subclassification, more likely improves the recall ratio of different sub-mapping tasks.
Finally, from (F1 value) overall performance, native system is when in the face of Baidu-Hudong mapping tasks, and ratio method one and method two on average exceed about 11% and 20%.When in the face of Hudong-DBpedia mapping tasks, overall performance of the present invention higher than the Chinese thesaurus similarity algorithm proposed in method two, and remains basically stable with method one.When in the face of Baidu-DBpedia mapping tasks, the overall performance of native system is still respectively higher than method two and method one about 21% and 8%.

Claims (1)

1. towards an extensive Ontology Mapping Method for Chinese language, it is characterized in that: be made up of three large steps, respectively: the concept initial association degree merged based on editing distance and synonym word forest form calculates, body compresses and determinacy maps;
(1) the concept initial association degree merged based on editing distance and synonym word forest form calculates
A) editing distance similarity
Two body O to be mapped source, O target, for source body O sourcein certain concept C source, need at target body O targetmiddle searching and its semantic identical or close corresponding concept C target, two concept C sourceand C target, the value of their editing distance and their Similarity value are portrayed by formula (1) and formula (2):
EditDis tan ce ( C source , C t arg et ) = | Do ( C source , C t arg et ) | max ( L ( C source ) , L ( C t arg et ) ) - - - ( 1 )
Wherein, | Do (C source, C target) | be concept C to be mapped sourceand C targetediting operation number of times, that is: character string C sourcehow many minimum processes walks operation becomes character string C completely target, operation here has three kinds: add, delete or an amendment character; L (C source) and L (C target) be the character length of concept to be mapped;
SIM E = ( C souce , C t arg et ) = 1 ( 1 + EditDis tan ce ( C source , C t arg et ) ) - - - ( 2 )
Wherein, SIM e(C source, C target) be concept C to be mapped sourceand C targetsimilarity;
B) Chinese thesaurus similarity
Calculating formula of similarity based on Chinese thesaurus:
SIM T ( C source , C tartget ) = &alpha; &times; F i | F | &times; cos ( n subTree &times; &pi; 180 ) &times; ( n subTree - D + 1 n subTree ) - - - ( 3 )
For f ifor lemma C sourceand C targetthe hierachy number representated by son coding difference is there is at i-th layer, | F| represents the element number in set F, is constantly equal to 5 in the present invention; Concept similarity weight coefficient is α × (F i/ | F|); n subTreefor lemma C sourceand C targetthere is the F that son coding is different ithe node total number comprised under layer respective branch, D is lemma C sourceand C targetcoding distance; Certain random number between α ∈ [0.4,0.5] all can meet the demands;
C) how strategy merges algorithm of correlation degree
First by comparing the similarity result of two kinds of rudimentary algorithms, the maximal value of two kinds of arithmetic result is got; Meanwhile, two concept C are considered sourceand C targetbetween similarity and distinctiveness ratio, and superposed and entered each concept C source, C targetthe final degree of association; It is ρ that the present invention defines the maximal value that two kinds of similarity algorithms obtain, and correspondingly, distinctiveness ratio index is 1-ρ; Obvious ρ ∈ (0,1], then there is formula (4):
&rho; st = max ( SIM E ( C s source , C t t arg et ) , SIM T ( C s wource , C t t arg et ) ) - - - ( 4 )
Claim concept here with between semantic related coefficient be λ st,
Finally obtain source Ontological concept with target body O targetinitial association degree express with formula (6);
m s source = &lambda; s 1 + &lambda; s 2 + &lambda; s 3 + . . . + &lambda; sn t arg et = &Sigma; t = 1 n t arg et &lambda; st - - - ( 6 )
Because calculation of relationship degree has symmetry, the concept C therefore in target body targetinitial association degree m targetin like manner can obtain; Be the situation of zero in the initial association angle value that certain concept is final, the initial association degree factor is got m source, m targetcertain random number between ∈ [0.01,0.05] all meets the demands; So just obtain the initial association degree set Map_O of all financial resourcess concept in body O to be mapped sourceand Map_O target; The form of key-value pair is adopted to state by unified for the set of initial association degree: Map_O<C, m>;
(2) body compression algorithm
When in the face of large-scale Ontology Mapping task, traditional algorithm is all difficult to adapt in time or space complexity, therefore needs corresponding strategy to compress body to be mapped originally;
For source body O sourceconcept set with target body O targetconcept set use each concept initial association angle value portray the influence degree of this concept for other concepts, provided by formula (6); Pass through the field intensity function of correction as shown in formula (8):
Get δ=1, R=2; Obtain body O to be mapped sourcein each concept gesture value function expression formula, as shown in formula (9):
Concept in target body gesture value in like manner can obtain; Finally obtain the gesture value set potentialMap_O of all financial resourcess concept in body O to be mapped sourceand potentialMap_O target; Gesture value set unified definition is key-value pair: potentialMap_O<C,
Concept set in O is divided into two parts, is called: candidate regions and superseded district;
Particularly, for performing the output key-value pair set Map_O obtained after many strategies merge algorithm of correlation degree sourceand Map_O target, the association angle value according to each concept element counts Map_O respectively sourceand Map_O targetthe concept sum that middle association angle value is greater than 0.05 is called Range_Candidate_O sourceand Range_Candidate_O target, this variable-definition is body O to be mapped sourceand O targetthe interval upper bound of candidate regions;
For gesture value set potentialMap_O sourceand potentialMap_O targetin concept element, carry out descending sort according to key assignments, for its rank variable mark; If Rank s source &Element; [ 1 , Range _ Candidate _ O source ] , Then concept to be retained by alternatively concept; Correspondingly, if Rank s soure &Element; [ Range _ Candidate _ O source + 1 , n source ] , Then concept to be eliminated; By the symmetry existed between source body and target body, for target body O targetcandidate concepts decimation rule in like manner can obtain;
(3) determinacy maps
For source body O to be mapped sourcewith target body O targetin any two concept C sourceand C target, when carrying out the Semantic Similarity Measurement of concept, there will be following three kinds of situations:
1. C sourceand C targetbe atomic concepts, that is: C source∈ SKB tYCCLand C target∈ SKB tYCCL
2. C sourceand C targetone of them be atomic concepts, and another is combined concept, that is: C source &NotElement; SKB TYCCL Or C t arg et &NotElement; SKB TYCCL
3. C sourceand C targetbe combined concept, that is: and C t arg et &NotElement; SKB TYCCL
For situation 1., formula (3) is adopted to calculate the semantic similarity of two concepts; For situation 2. with situation 3., in the present invention, first represented with the form of scoring matrix (scoring matrix) by be compared two word string sequences, two sequences is respectively as the bidimensional of dynamic programming matrix; For body O to be mapped sourceand O targetin concept C sourceand C target, the i-th row equivalent string sequence CC of scoring matrix M sourcein atomic concepts jth row equivalent string sequence CC targetin atomic concepts wherein i≤m, j≤n; In dynamic programming matrix M, the i-th row jth column element is called M ij;
First, provide the penalty factor p=-0.05 of sequence alignment algorithms, and capable to the m+1 of matrix and (n+1)th arrange and carry out initialization respectively;
Secondly, based on Chinese thesaurus Similarity Measure function SIM t, recursive resolve is carried out to all the other m × n element in matrix;
First provide the definition of scoring function f, as shown in formula (11):
Recursive rule is as shown in formula (12):
M ij = max M ( i + 1 ) ( j + 1 ) + f ( AC i source , AC j t arg et ) M ( i ) ( j + 1 ) + p M ( i + 1 ) ( j ) + p - - - ( 12 )
From the M matrix mnelement starts, and dates back the M in matrix 11element terminates, and obtains optimum comparison path; If more than one of the optimum comparison path obtained, then optional one;
Finally insert room symbol "-", obtain correct global sequence's comparison result;
Two the combined concept entry sequences to be mapped will inserted behind room symbol "-" are called CC source 'and CC target '; At this moment the element sum comprised in two sequences is equal, is referred to as L cc '; According to comparison result with based on scoring function f, obtain the calculating formula of similarity (13) between combined concept:
SIM NW ( CC source &prime; , CC t arg et &prime; ) = &Sigma; i = 1 L cc &prime; f ( AC i source &prime; , AC i t arg et &prime; ) L cc &prime; - - - ( 13 ) .
CN201510082840.1A 2015-02-15 2015-02-15 A kind of extensive Ontology Mapping Method towards Chinese language Expired - Fee Related CN104699767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510082840.1A CN104699767B (en) 2015-02-15 2015-02-15 A kind of extensive Ontology Mapping Method towards Chinese language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510082840.1A CN104699767B (en) 2015-02-15 2015-02-15 A kind of extensive Ontology Mapping Method towards Chinese language

Publications (2)

Publication Number Publication Date
CN104699767A true CN104699767A (en) 2015-06-10
CN104699767B CN104699767B (en) 2018-02-02

Family

ID=53346888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510082840.1A Expired - Fee Related CN104699767B (en) 2015-02-15 2015-02-15 A kind of extensive Ontology Mapping Method towards Chinese language

Country Status (1)

Country Link
CN (1) CN104699767B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107978341A (en) * 2017-12-22 2018-05-01 南京昂特医信数据技术有限公司 Isomeric data adaptation method and its system under a kind of medicine semantic frame based on linguistic context
CN109582961A (en) * 2018-11-28 2019-04-05 重庆邮电大学 A kind of efficient robot data similarity calculation algorithm
CN109635119A (en) * 2018-10-25 2019-04-16 同济大学 A kind of industrial big data integrated system based on ontology fusion
CN109783650A (en) * 2019-01-10 2019-05-21 首都经济贸易大学 Chinese network encyclopaedic knowledge goes drying method, system and knowledge base
CN111353523A (en) * 2019-12-24 2020-06-30 中国国家铁路集团有限公司 Method for classifying railway customers
CN112328915A (en) * 2020-11-25 2021-02-05 山东师范大学 Multi-source interest point fusion method and system based on spatial entity matching performance evaluation
CN114519101A (en) * 2020-11-18 2022-05-20 易保网络技术(上海)有限公司 Data clustering method and system, data storage method and system and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7752032B2 (en) * 2005-04-26 2010-07-06 Kabushiki Kaisha Toshiba Apparatus and method for translating Japanese into Chinese using a thesaurus and similarity measurements, and computer program therefor
CN102955774A (en) * 2012-05-30 2013-03-06 华东师范大学 Control method and device for calculating Chinese word semantic similarity
CN103440314A (en) * 2013-08-27 2013-12-11 北京工业大学 Semantic retrieval method based on Ontology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7752032B2 (en) * 2005-04-26 2010-07-06 Kabushiki Kaisha Toshiba Apparatus and method for translating Japanese into Chinese using a thesaurus and similarity measurements, and computer program therefor
CN102955774A (en) * 2012-05-30 2013-03-06 华东师范大学 Control method and device for calculating Chinese word semantic similarity
CN103440314A (en) * 2013-08-27 2013-12-11 北京工业大学 Semantic retrieval method based on Ontology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王汀等: "一种基于同义词词林的中文大规模本体映射方案", 《计算机科学》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107978341A (en) * 2017-12-22 2018-05-01 南京昂特医信数据技术有限公司 Isomeric data adaptation method and its system under a kind of medicine semantic frame based on linguistic context
CN109635119A (en) * 2018-10-25 2019-04-16 同济大学 A kind of industrial big data integrated system based on ontology fusion
CN109635119B (en) * 2018-10-25 2023-08-04 同济大学 Industrial big data integration system based on ontology fusion
CN109582961A (en) * 2018-11-28 2019-04-05 重庆邮电大学 A kind of efficient robot data similarity calculation algorithm
CN109783650A (en) * 2019-01-10 2019-05-21 首都经济贸易大学 Chinese network encyclopaedic knowledge goes drying method, system and knowledge base
CN109783650B (en) * 2019-01-10 2020-12-11 首都经济贸易大学 Chinese network encyclopedia knowledge denoising method, system and knowledge base
CN111353523A (en) * 2019-12-24 2020-06-30 中国国家铁路集团有限公司 Method for classifying railway customers
CN114519101A (en) * 2020-11-18 2022-05-20 易保网络技术(上海)有限公司 Data clustering method and system, data storage method and system and storage medium
CN114519101B (en) * 2020-11-18 2023-06-06 易保网络技术(上海)有限公司 Data clustering method and system, data storage method and system and storage medium
CN112328915A (en) * 2020-11-25 2021-02-05 山东师范大学 Multi-source interest point fusion method and system based on spatial entity matching performance evaluation

Also Published As

Publication number Publication date
CN104699767B (en) 2018-02-02

Similar Documents

Publication Publication Date Title
CN104699767A (en) Large-scale ontology mapping method for Chinese languages
US8463810B1 (en) Scoring concepts for contextual personalized information retrieval
Pezzoni et al. How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation
Rinaldi et al. A matching framework for multimedia data integration using semantics and ontologies
Moya et al. Integrating web feed opinions into a corporate data warehouse
Liu et al. Relation classification via BERT with piecewise convolution and focal loss
Sharaff et al. Analysing fuzzy based approach for extractive text summarization
Sun et al. Graph force learning
Khodizadeh-Nahari et al. A novel similarity measure for spatial entity resolution based on data granularity model: Managing inconsistencies in place descriptions
Zoupanos et al. Efficient comparison of sentence embeddings
Adek et al. Online Newspaper Clustering in Aceh using the Agglomerative Hierarchical Clustering Method
Song et al. Semi-automatic construction of a named entity dictionary for entity-based sentiment analysis in social media
Calegari et al. Object‐fuzzy concept network: An enrichment of ontologies in semantic information retrieval
CN109783650B (en) Chinese network encyclopedia knowledge denoising method, system and knowledge base
Kovács et al. Conceptualization with incremental bron-kerbosch algorithm in big data architecture
KR102454261B1 (en) Collaborative partner recommendation system and method based on user information
Wang et al. An ontology automation construction scheme for Chinese e‐government thesaurus optimizing
Qazvinian et al. Evolutionary coincidence‐based ontology mapping extraction
Iftikhar et al. Deep Learning-Based Correct Answer Prediction for Developer Forums
Pérez-Guadarramas et al. Analysis of OWA operators for automatic keyphrase extraction in a semantic context
Al-Mutairi et al. Predicting the Popularity of Trending Arabic Wikipedia Articles Based on External Stimulants Using Data/Text Mining Techniques
Yang Intelligent construction of English-Chinese bilingual context model based on CBR
Wang et al. Co-regularized PLSA for multi-modal learning
CN116702784B (en) Entity linking method, entity linking device, computer equipment and storage medium
Sharma et al. Fine-tuned Predictive Model for Verifying POI Data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180202

Termination date: 20190215