CN104699767A

CN104699767A - Large-scale ontology mapping method for Chinese languages

Info

Publication number: CN104699767A
Application number: CN201510082840.1A
Authority: CN
Inventors: 王汀; 刘经纬; 蔡万江
Original assignee: CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Current assignee: CAPITAL UNIVERSITY OF ECONOMICS AND BUSINESS
Priority date: 2015-02-15
Filing date: 2015-02-15
Publication date: 2015-06-10
Anticipated expiration: 2035-02-15
Also published as: CN104699767B

Abstract

The invention provides a mapping method for large-scale Chinese ontology. The method comprises the following steps: initializing a correlation degree computing method on the basis of the concept integrating Chinese thesaurus and an edit distance similarity algorithm; compressing large-scale ontology mapping scale on the basis of a pseudo-nuclear-force field potential function integrating concept similarity and dissimilarity improved by initial correlation degree; performing similarity measurement on complex concepts in the Chinese ontology through introducing a global sequence alignment algorithm. Chinese works have the phenomena of polysemy and sensitive word order, and the computing cost of large-scale ontology mapping is high, and according to the method, firstly, the existing pseudo-nuclear-force field potential function is improved, so that the measurement of similarity among concepts and the scale compression of the ontology to be mapped are more reasonable. Secondly, a global sequence alignment technology is adopted to map complex Chinese concepts, further defects of a traditional Chinese ontology mapping system are overcome, and finally the mapping efficiency of the system is improved, and the precision ratio and the recall ratio are increased.

Description

A kind of extensive Ontology Mapping Method towards Chinese language

Technical field

The present invention relates to Chinese Ontology Mapping field.

Background technology

The vision of Semantic Web is the net ‖ (Web of Data) of foundation-data, understands semantic information on network to enable machine.Body is as the core element of Semantic Web, and being the formalization, the standardization explanation that describe specific area shared ideas, is realize network knowledge to share the basis with Semantic Interoperation.At present owing to there is isomerism between different body, result in reusing and sharing between body and become difficulty.

The task of Ontology Mapping (Ontology Alignment) is exactly the Concept Semantic association that will find between isomery body.But due to culture and Background Cause, still lack the ripe Ontology Mapping system described towards Chinese language at present.And along with the development of semantic net, the body that large-scale Chinese language describes and knowledge base are also fabricated more and more and share.Meanwhile, the structure of Chinese Ontology Mapping system is still in the starting stage.Therefore, the present invention mainly solves the Construct question of the extensive Ontology Mapping system described towards Chinese.

Domestic and international researchist has proposed multiple mapping method and canonical system.Document [Cohen W, Ravikumar P, Fienberg S.A comparison of string distance metrics for name-matching tasks [C] .Proceedings ofthe IJCAI Workshop on Information Integration on the Web (IIWeb) .Acapulco, Mexico, 2003:73-78] in list based on editing distance and several typical element level Similarity Measure algorithms based on Token, and the performance of several algorithm to be evaluated and tested.[the Melnik S such as Melnik S, Garcia-Molina H, Rahm E.Similarity flooding:A versatile graph matching algorithm and its application to schema Matching [C] .Proceedings ofthe 18th International Conference of Data Engineering (ICDE) .San Jose, California, 2002:117-128] propose a kind of structural level Ontology Mapping algorithm: Similarity flooding, this system utilizes the concept system structure similarity propagation figure of body, and the similarity between concept is propagated and revised.[the Zhong Q such as Zhong Qian, Li H, Li J, Xie G, Tang J, Zhou L, Pan Y.A gauss function based approach for unbalanced ontologymatching [C] .Proceedings of the 28th International Conference on Management ofData (SIGMOD) .Rhode Island, USA, 2009:669-680] develop RiMOM system, this system is based on instances of ontology, many policy mappings mode of the feature such as concept name and body construction, and by introducing pervasive field theory thought, it is made to be applicable to the mapping tasks of extensive body.But it lacks the optimization for Chinese language-specific feature.[the Giunchiglia F. such as Giunchiglia F, Yat skevich M..Element level semantic matching [D] .Italy:Dept.of Informationand Communication Technology University of Trento, 2004] propose based on linguistic method, and introduce shared knowledge dictionary (as: WordNet), utilize linguistic relation to carry out semantic relation discovery.Document [Isaac A, Meij L, SchlobachS, Wang S.An empirical study of instance-based ontology matching [C] .Proceedings of the 6thInternational Semantic Web Conference and the 2nd Asian Semantic WebConference (ISWC/ASWC) .Busan, Korea, 2007:253-266] a kind of Ontology Mapping algorithm of instance-level is proposed, it measures the similarity between concept according to the common example quantity of Ontological concept.

In recent years, the correlative study work of extensive Chinese ontology library and Ontology Mapping system constructing just progressively launches.Li Jia etc. propose a kind of based on knowing the method that the element layer concept similarity of net (Hownet) calculates, and achieve a Chinese Ontology Mapping system [Li Jia, Zhu Ming, Liu Chen, Deng. Chinese Ontology Mapping research and implementation [J]. Journal of Chinese Information Processing, 2007,21 (4): 27-33], this system is when in the face of extensive Ontology Mapping task, and its applicability is still to be tested.Tian Jiule etc. propose a kind of Chinese word semantic similarity computational algorithm [Tian Jiule based on Chinese thesaurus, Zhao Wei. based on the Measurement of word similarity [J] of Chinese thesaurus. Jilin University's journal, 2010,28 (6): 602-608], but its achievement do not apply under Semantic Web.Scholar [the Z.Wang such as Wang Zhi-chun, Z.Wang, J.Li et al.Knowledge extraction from chinesewiki encyclopedias [J] .Journal of Zhejiang University-Science C, vol 13, no.4, pp.268 – 280, 2012] propose to extract hierarchical relationship between concept based on the taxonomic hierarchies of Chinese encyclopaedia, obtain containing the concept attribute in the entry web page of Infobox and encyclopaedia entry example, finally set up the extensive ontology library of two large Chinese based on Baidupedia and interactive encyclopaedia, and according to simple keyword match strategy, and DBpedia sets up the co-reference between example.[the Niu X such as Niu Xing, Sun X, Wang H, et al.Zhishi.me-weaving Chinese linking open data [C] .ISWC 2011.Springer Berlin Heidelberg, 2011:205-220] Baidupedia, interactive encyclopaedia and Chinese wikipedia carry out semantic intergration by researchist, and develop the semantic web data query application system described based on Chinese.[the Chen Yidong such as Yidong Chen, Chen Liwei, Xu Kun.Learning Chinese entity attributes from onlineencyclopedia [C] .APWeb 2012:179-186] propose to utilize attribute-value in Chinese encyclopaedia Infobox to information, the training sample of the good structure of automatic extraction, and then Corpus--based Method learning model extracts the knowledge tlv triple of magnanimity from the non-structured text of encyclopaedia, finally construct a Chinese knowledge base towards open field.

The deficiency that existing system exists and main contributions of the present invention are:

1) a kind of overall framework towards the extensive Ontology Mapping model of Chinese is newly proposed.

Research at present for the Ontological concept relation of equivalence discovery between the semantic data collection in Chinese environment is also less.In semantic web environment, along with the scale of body is increasing, how to ensure that the efficiency of Ontology Mapping just becomes problem demanding prompt solution.Therefore, a kind of framework level Ontology Mapping model towards Chinese originally researched and proposed.First, the how tactful fusion method combined based on editing distance and Chinese thesaurus is adopted to calculate the initial similarity of concept between body to be mapped.Secondly, theoretical based on data fields and with the initial similarity of concept for input, the scale of body to be mapped is compressed.Finally, according to the semantic feature contained by Chinese concept and encyclopaedic knowledge storehouse, by introducing the sequence alignment thought in bioinformatics, propose a kind of Chinese Ontological concept relation of equivalence determinacy mapping policy newly.

2) a kind of new method carrying out compressing yojan to extensive Ontology Mapping scale is proposed.

Traditional Ontology Mapping system and method often only focuses on mapping result, and ignores mapping efficiency.Therefore, when in the face of extensive Ontology Mapping task, classic method seems that practicality is not strong.This research is carrying out relation of equivalence really before Qualitative Mapping to the extensive body of Chinese, in order to time complexity is controlled within the scope of acceptable, propose a kind of new data fields potential function, and based on this, first extensive body is carried out to yojan and the compression of mapping scale.Specifically, on the basis that original nucleoid field of force potential function is improved, based on " Chinese thesaurus " (extended edition), to propose between a kind of COMPREHENSIVE CALCULATING concept semantic similarity and different angle value to weigh the new method of data object gesture value, and devise the new algorithm that a kind of mapping scale for extensive body carries out yojan on this basis.

3) a kind of Concept Semantic Similarity New calculating method based on bioinformatics overall situation pairwise comparison thought is proposed.

Document [Zhong Q, Li H, Li J, Xie G, Tang J, Zhou L, Pan Y.A gauss function based approachfor unbalanced ontology matching [C] .Proceedings of the 28th International Conference onManagement of Data (SIGMOD) .Rhode Island, USA, 2009:669-680] research work be only applicable at present based on English describe body and mapping tasks, and it lacks the support to multilingual body, particularly be not optimized for the feature of Chinese body.Simultaneously, concept similarity computing method in traditional Chinese Ontology Mapping system do not consider in combined concept atomic concepts order difference and polysemy on the impact of mapping relations quality between structure two combined concepts, and ignore the key character of Chinese concept-word order responsive ‖ and-polysemy ‖, the error of mapping result certainly will be caused.In order to solve the problem, the relation of equivalence of Chinese concept is proposed to find that abstract is global sequence's comparison problem, based on the thought of dynamic programming, and the Needleman – Wunsch overall comparison algorithm introduced in field of bioinformatics carries out the Semantic Similarity Measurement between combined concept.Experiment shows, adopts the concept overall comparison similarity algorithm based on Needleman-Wunsch algorithm, effectively can evade the mistake mapping that classic method may be brought.The new method proposed, when in the face of extensive Chinese Ontology Mapping task, has more advantage and rationality than classic method.

Therefore, the extensive body of Chinese be distributed at present on web is still less, and there is larger isomerism, and existing Chinese Ontology Mapping system is when in the face of extensive Ontology Mapping task, and the efficiency that seems is lower and availability is not high.Meanwhile, still lack at present and describe for Chinese language, and adapt to the related system of extensive Ontology Mapping task in semantic web environment.Therefore the present invention is based on Chinese thesaurus and design and Implement an extensive Ontology Mapping system towards Chinese.

Summary of the invention

In Chinese Ontology Mapping system, simple lemma and unregistered word all correspond to the concept in body to be mapped.Therefore, concept corresponding to simple lemma is called atomic concepts (Atom Concept by the present invention, AC), and the concept corresponding to unregistered word is called combined concept (Component Concept, and to arrange all combined concepts be all combined by the linear array of several atomic conceptses CC).Here one group of faced the problems definition and formalized description is first provided:

Define 1 Ontology Mapping: two body O to be mapped ^source, O ^target, for source body O ^sourcein certain concept C ^source, need at target body O ^targetmiddle searching and its semantic identical or close corresponding concept C ^target, therefore define mapping function map:O ^source→ O ^target:

For if sim is (C ^source, C ^target) >T; Then there is map (C ^source)=C ^target

Wherein sim (C ^source, C ^target) be concept C to be mapped ^sourceand C ^targetsimilarity, T is threshold value, represents as concept C ^sourcewith concept C ^targetsemantic similarity when being greater than T, then by <C ^source, C ^target> is as the Conceptual Projection pair found, in native system, certain random number that threshold value can be taken as between T ∈ [0.8,0.9] all can meet the demands.Source body O ^sourcein the concept that contains add up to n ^source, target body O ^targetin the concept that contains add up to n ^target.

Definition 2: for " Chinese thesaurus " semantic knowledge-base (Semantic Knowledge Base, SKB), then S set KB ^tYCCLbe made up of atomic concepts, namely have SKB ^tYCCL={ AC ₁, AC ₂..., AC _n, wherein certain elements A C _ifor S set KB ^tYCCLin atomic concepts.The lemma scale of N for including in knowledge base.

Definition 3: combined concept CC can be made up of the ordered arrangement of a series of atomic concepts.That is: for then there is ordered sequence CC=[AC ₁, AC ₂..., AC _i... ], wherein i>=1 and i is atomic concepts AC _iarrangement position in ordered sequence CC.Especially, for all atomic concepts AC _i, can AC be had _i=[AC _i].

Definition 4: for body O to be mapped ^sourceand O ^targetin concept C ^sourceand C ^target, have

C^{source} = {CC}^{source} =

[{AC}_{1}^{source}, {AC}_{2}^{source}, . . ., {AC}_{i}^{source} . . ., {AC}_{m}^{source}], C^{t \arg et} = {CC}^{t \arg et} = [{AC}_{1}^{t \arg et}, {AC}_{2}^{t \arg et}, . . ., {AC}_{j}^{t \arg et} . . ., {AC}_{n}^{t \arg et}] .

Wherein m and n is respectively combined concept C ^sourceand C ^targetcorresponding ordered sequence CC ^sourceand CC ^targetlength, then have m, n>=1.

Towards an extensive Ontology Mapping Method for Chinese language, it is characterized in that: be made up of three large steps, respectively: the concept initial association degree merged based on editing distance and synonym word forest form calculates, body compresses and determinacy maps;

(1) the concept initial association degree merged based on editing distance and synonym word forest form calculates

A) editing distance similarity

Two body O to be mapped ^source, O ^target, for source body O ^sourcein certain concept C ^source, need at target body O ^targetmiddle searching and its semantic identical or close corresponding concept C ^target, two concept C ^sourceand C ^target, the value of their editing distance and their Similarity value are portrayed by formula (1) and formula (2):

EditDis \tan ce (C^{source}, C^{t \arg et}) = \frac{| Do (C^{source}, C^{t \arg et}) |}{\max (L (C^{source}), L (C^{t \arg et}))} - - - (1)

Wherein, | Do (C ^source, C ^target) | be concept C to be mapped ^sourceand C ^targetediting operation number of times, that is: character string C ^sourcehow many minimum processes walks operation becomes character string C completely ^target, operation here has three kinds: add, delete or an amendment character; L (C ^source) and L (C ^target) be the character length of concept to be mapped;

{SIM}_{E} (C^{source}, C^{t \arg et}) = \frac{1}{(1 + EditDis \tan ce (C^{source}, C^{t \arg et}))} - - - (2)

Wherein, SIM _e(C ^source, C ^target) be concept C to be mapped ^sourceand C ^targetsimilarity;

B) Chinese thesaurus similarity

Calculating formula of similarity based on Chinese thesaurus:

{SIM}_{T} (C^{source}, C^{t \arg et}) = α \times \frac{F_{i}}{| F |} \times \cos (n^{subTree} \times \frac{π}{180}) \times (\frac{n^{subTree} - D + 1}{n^{subTree}}) - - - (3)

For f _ifor lemma C ^sourceand C ^targetthe hierachy number representated by son coding difference is there is at i-th layer, | F| represents the element number in set F, is constantly equal to 5 in the present system; Concept similarity weight coefficient is α × (F _i/ | F|); n ^subTreefor lemma C ^sourceand C ^targetthere is the F that son coding is different _ithe node total number comprised under layer respective branch, D is lemma C ^sourceand C ^targetcoding distance; Certain random number between α ∈ [0.4,0.5] all can meet the demands;

C) how strategy merges algorithm of correlation degree

First by comparing the similarity result of two kinds of rudimentary algorithms, the maximal value of two kinds of arithmetic result is got; Meanwhile, two concept C are considered ^sourceand C ^targetbetween similarity and distinctiveness ratio, and superposed and entered each concept C ^source, C ^targetthe final degree of association; It is ρ that the present invention defines the maximal value that two kinds of similarity algorithms obtain, and correspondingly, distinctiveness ratio index is 1-ρ; Obvious ρ ∈ (0,1], then there is formula (4):

ρ_{st} = \max ({SIM}_{E} (C_{s}^{source}, C_{t}^{t \arg et}), {SIM}_{T} (C_{s}^{source}, C_{t}^{t \arg et})) - - - (4)

Claim concept here with between semantic related coefficient be λ _st,

Finally obtain source Ontological concept with target body O ^targetinitial association degree express with formula (6);

m_{s}^{source} = λ_{s 1} + λ_{s 2} + λ_{s 3} + . . . + λ_{{sn}^{t \arg et}} = Σ_{t = 1}^{n^{t \arg et}} λ_{st} - - - (6)

Because calculation of relationship degree has symmetry, the concept C therefore in target body ^targetinitial association degree m ^targetin like manner can obtain; Be the situation of zero in the initial association angle value that certain concept is final, the initial association degree factor is got m ^source, m ^targetcertain random number between ∈ [0.01,0.05] all meets the demands; So just obtain the initial association degree set Map_O of all financial resourcess concept in body O to be mapped ^sourceand Map_O ^target; The form of key-value pair is adopted to state by unified for the set of initial association degree: Map_O<C, m>;

(2) body compression algorithm

When in the face of large-scale Ontology Mapping task, traditional algorithm is all difficult to adapt in time or space complexity, therefore needs corresponding strategy to compress body to be mapped originally;

For source body O ^sourceconcept set with target body O ^targetconcept set use each concept initial association angle value portray the influence degree of this concept for other concepts, provided by formula (6); Pass through the field intensity function of correction as shown in formula (8):

Get δ=1, R=2; Obtain body O to be mapped ^sourcein each concept gesture value function expression formula, as shown in formula (9):

Concept in target body gesture value in like manner can obtain; Finally obtain the gesture value set potentialMap_O of all financial resourcess concept in body O to be mapped ^sourceand potentialMap_O ^target; Gesture value set unified definition is key-value pair:

Concept set in O is divided into two parts, is called: candidate regions and superseded district;

Particularly, for performing the output key-value pair set Map_O obtained after many strategies merge algorithm of correlation degree ^sourceand Map_O ^target, the association angle value according to each concept element counts Map_O respectively ^sourceand Map_O ^targetthe concept sum that middle association angle value is greater than 0.05 is called Range_Candidate_O ^sourceand Range_Candidate_O ^target, this variable-definition is body O to be mapped ^sourceand O ^targetthe interval upper bound of candidate regions;

For gesture value set potentialMap_O ^sourceand potentialMap_O ^targetin concept element, carry out descending sort according to key assignments, for

{&ForAll; C}_{s}^{source} &Element; potentialMap_O^{source},

Its rank variable mark; If then concept to be retained by alternatively concept; Correspondingly, if

{Rank}_{s}^{source} &Element; [Range_Candidate_O^{source} + 1, n^{source}],

Then concept to be eliminated; By the symmetry existed between source body and target body, for target body O ^targetcandidate concepts decimation rule in like manner can obtain;

(3) determinacy maps

For source body O to be mapped ^sourcewith target body O ^targetin any two concept C ^sourceand C ^target, when carrying out the Semantic Similarity Measurement of concept, there will be following three kinds of situations:

1. C ^sourceand C ^targetbe atomic concepts, that is: C ^source∈ SKB _tYCCLand C ^target∈ SKB _tYCCL

2. C ^sourceand C ^targetone of them be atomic concepts, and another is combined concept, that is: or

C^{t \arg et} &NotElement; {SKB}_{TYCCL}

3. C ^sourceand C ^targetbe combined concept, that is: and

For situation 1., formula (3) is adopted to calculate the semantic similarity of two concepts; For situation 2. with situation 3., in the present system, first represented with the form of scoring matrix (scoring matrix) by be compared two word string sequences, two sequences is respectively as the bidimensional of dynamic programming matrix; For body O to be mapped ^sourceand O ^targetin concept C ^sourceand C ^target, the i-th row equivalent string sequence CC of scoring matrix M ^sourcein atomic concepts jth row equivalent string sequence CC ^targetin atomic concepts wherein i≤m, j≤n; In dynamic programming matrix M, the i-th row jth column element is called M _ij;

First, provide the penalty factor p=-0.05 of sequence alignment algorithms, and capable to the m+1 of matrix and (n+1)th arrange and carry out initialization respectively;

Secondly, based on Chinese thesaurus Similarity Measure function SIM _t, recursive resolve is carried out to all the other m × n element in matrix;

First provide the definition of scoring function f, as shown in formula (11):

Recursive rule is as shown in formula (12):

M_{ij} = \max \{\begin{matrix} M_{(i + 1) (j + 1)} + f ({AC}_{i}^{source}, {AC}_{j}^{t \arg et}) \\ M_{(i) (j + 1)} + p \\ M_{(i + 1) (j)} + p \end{matrix} - - - (12)

From the M matrix _mnelement starts, and dates back the M in matrix ₁₁element terminates, and obtains optimum comparison path; If more than one of the optimum comparison path obtained, then optional one;

Finally insert room symbol "-", obtain correct global sequence's comparison result;

Two the combined concept entry sequences to be mapped will inserted behind room symbol "-" are called CC ^{source '}and CC ^{target '}; At this moment the element sum comprised in two sequences is equal, is referred to as L ^{cc '}; According to comparison result with based on scoring function f, obtain the calculating formula of similarity (13) between combined concept:

{SIM}_{NW} ({CC}^{{source}^{'}}, {CC}^{{t \arg et}^{'}}) = Σ_{i = 1}^{L^{{cc}^{'}}} \frac{f ({AC}_{i}^{{source}^{'}}, {AC}_{i}^{{t \arg et}^{'}})}{L^{{cc}^{'}}} - - - (13) .

Accompanying drawing explanation

The extensive Ontology Mapping Method process flow diagram of Fig. 1 Chinese

The matching result of Fig. 2 (a) mistake

The comparison result that Fig. 2 (b) is correct

The scoring matrix of Fig. 3 example one

The scoring matrix of Fig. 4 example two

The sequences match result of Fig. 5 example two

Embodiment

A) editing distance similarity

When the mapping tasks of extensive body, the present invention proposes first to compress body to be mapped.Specifically, first employing editing distance algorithm carries out the initial Similarity Measure between concept set.This is because when carrying out initial association degree and calculating, often consider the high efficiency of algorithm, its accuracy is then counted as secondary cause.

That is, when the initial association obtaining body to be mapped is spent, native system can obtain the literal similarity between concept by editing distance algorithm, and ignores its semantic dependency.Particularly for two concept C ^sourceand C ^target, the value of their editing distance and their Similarity value can be portrayed by formula (1) and formula (2):

EditDis \tan ce (C^{source}, C^{t \arg et}) = \frac{| Do (C^{source}, C^{t \arg et}) |}{\max (L (C^{source}), L (C^{t \arg et}))} - - - (1)

Wherein, | Do (C ^source, C ^target) | be concept C to be mapped ^sourceand C ^targetediting operation number of times, that is: character string C ^sourcehow many minimum processes walks operation becomes character string C completely ^target, operation here has three kinds: add, delete or an amendment character.L (C ^source) and L (C ^target) be the character length of concept to be mapped.

{SIM}_{E} (C^{source}, C^{t \arg et}) = \frac{1}{(1 + EditDis \tan ce (C^{source}, C^{t \arg et}))} - - - (2)

Wherein, SIM _e(C ^source, C ^target) be concept C to be mapped ^sourceand C ^targetsimilarity.

B) Chinese thesaurus similarity

Chinese thesaurus (TongyiciCilin, TYCCL) be a Chinese synonym allusion quotation, each vocabulary carries out encoding and is organized in a tree structure with hierarchical relationship by it, each node on behalf in tree concept, and the concept co-reference identification of Chinese, in fact can abstractly be the identification Similarity Measure problem of Chinese synonym, therefore Chinese thesaurus be best selection.Native system adopts Harbin Institute of Technology's Chinese thesaurus extended edition as the commonsense knowledge base of Chinese Ontology Mapping Relation extraction.

Lemma is organized as hierarchy by Chinese thesaurus, is top-downly of five storeys altogether.Each level has corresponding code identification, and the coding of 5 layers arrays from left to right, and forms the word woods coding of lemma.Semantic relevancy implicit between word and word also improves along with the increase of level.

Below for lemma-material ‖ (woods is encoded to word: Ba01A02=), word woods coded format is made an explanation, as shown in table 1:

Table 1 word woods encoding examples

According to the design feature of Chinese thesaurus, the word woods coding first treating concept of mapping is resolved, and extracts the 1st to the 5th straton coding, then compares from the 1st straton coding.If occur, son coding is different, then give this mapping pair corresponding similarity weight according to the level occurred.Son coding is different appears at darker level, then similarity weight is higher, otherwise then lower.Meanwhile, the number of branch node number of every layer also has impact to similarity.

Provide the calculating formula of similarity based on Chinese thesaurus below:

{SIM}_{T} (C^{source}, C^{t \arg et}) = α \times \frac{F_{i}}{| F |} \times \cos (n^{subTree} \times \frac{π}{180}) \times (\frac{n^{subTree} - D + 1}{n^{subTree}}) - - - (3)

Because Ontology Mapping task more pays close attention to the Semantic Similarity between concept, therefore the present invention introduces regulating parameter: semantic relevancy factor-alpha, the relation of semantic dependency and Semantic Similarity between different level concept and control is regulated to be in degree possible similar between the lemma of different levels branch by α, obvious α ∈ (0,1).The value of α is larger, and the possibility that the lemma between expression different levels is similar or of equal value is larger, and the semantic dependency of different levels is larger for the impact of final concept similarity, otherwise then less.

Wherein, F={1,2,3,4,5}, for f _ifor lemma C ^sourceand C ^targetthe hierachy number representated by son coding difference is there is at i-th layer, | F| represents the element number in set F, is constantly equal to 5 in the present system.Concept similarity weight coefficient is α × (F _i/ | F|).N ^subTreefor lemma C ^sourceand C ^targetthere is the F that son coding is different _ithe node total number comprised under layer respective branch, D is lemma C ^sourceand C ^targetcoding distance.

Particularly when five layers of coding that concept to be mapped is right are all equal, and word woods is encoded, last position is-=No. ‖, then similarity function SIM _trreturn value be 1.0.Obviously, function SIM _tcodomain be (0,1].Native system lays particular emphasis on the relation of equivalence obtained between lemma concept, particularly when in the face of Chinese Ontology Mapping task, due to the semantic similarity between more outstanding concept, therefore the value of α is unsuitable too high, certain random number that the semantic relevancy factor can be taken as between α ∈ [0.4,0.5] all can meet the demands.

C) how strategy merges algorithm of correlation degree

Because editing distance similarity algorithm and Chinese thesaurus similarity algorithm have certain complementarity, therefore the similarity result value of these two kinds of algorithms merges by the present invention.

When application data field theory, each concept is regarded as the particle in field, and the gravitation phenomenon that previous work often exists between particle in a consideration physical field, and exactly ignore the objective fact of going back ubiquity repulsion between particle, therefore introduce repulsion to the factor of influence of the degree of association.In order to improve this defect, the present invention considers gravitation between concept to be mapped and repulsion.That is, the similarity between concept is considered as the gravitation existed between particle in field, and distinctiveness ratio is considered as repulsion.Therefore, in the algorithm one of the present invention's proposition, first by comparing the similarity result of two kinds of rudimentary algorithms, the maximal value of two kinds of arithmetic result is got; Meanwhile, two concept C are considered ^sourceand C ^targetbetween similarity and distinctiveness ratio, and superposed and entered each concept C ^source, C ^targetthe final degree of association.It is ρ that the present invention defines the maximal value that two kinds of similarity algorithms obtain, obvious ρ ∈ (0,1], then there is formula (4):

ρ_{st} = \max ({SIM}_{E} (C_{s}^{source}, C_{t}^{t \arg et}), {SIM}_{T} (C_{s}^{source}, C_{t}^{t \arg et})) - - - (4)

Here, concept is claimed with between semantic related coefficient be λ _st, the 1-ρ in formula (5) _stthe distinctiveness ratio be used between tolerance two concepts.-log (1-ρ _st) be denary logarithm function, the present invention is defined as strictly monotone increasing function, and distinctiveness ratio can be enable like this to reflect cause-effect relationship therebetween reposefully to the variation tendency of similarity.Can see, similarity ρ _stvalue larger, distinctiveness ratio is less, then adjustment function-ρ _st× log (1-ρ _st) value larger.The concept obtained like this initial association degree consider the similarity between source body and target Ontological concept and distinctiveness ratio, thus made result more reasonable.For making formula (5) restrain, the value of regulation ρ belong to interval (0.9,1] time, concept with between semantic correlation factor λ _stbe 1.

Finally obtain concept in the body of source with target body O ^targetinitial association degree express with formula (6).

m_{s}^{source} = λ_{s 1} + λ_{s 2} + λ_{s 3} + . . . + λ_{{sn}^{t \arg et}} = Σ_{t = 1}^{n^{t \arg et}} λ_{st} - - - (6)

Because calculation of relationship degree has symmetry, the concept C therefore in target body ^targetinitial association degree m ^targetin like manner can obtain, in other words, two bodies to be mapped are complete equal rights, and therefore any one in the two can be considered as source body, another is referred to as target body naturally.Therefore, in two bodies, the initial association angle value of concept only depends on that the interaction between body is now uniquely determined, and with system give two bodies declarative appellation have nothing to do.The initial association degree arthmetic statement that concrete many strategies merge, is shown in algorithm one.Being the situation of zero for the initial association angle value that certain concept is final, in order to make the transfer of the degree of association have continuity, the initial association degree factor can being got m in the present system ^source, m ^targetcertain random number between ∈ [0.01,0.05] all can meet the demands.So just obtain the initial association degree set Map_O of all financial resourcess concept in body O to be mapped ^sourceand Map_O ^target.Be key-value pair by initial association degree set unified definition: Map_O<C, m>

(2) extensive body compression algorithm

When in the face of large-scale Ontology Mapping task, traditional algorithm is all difficult to adapt in time or space complexity, therefore needs corresponding strategy to compress body to be mapped originally.

The proposition of data fields theory is the field theory thought in physically based deformation, by abstract for the mutual relationship in number field space between data be INTERACTION PROBLEMS between material particle, final form turns to the describing method of field theory.This theory expresses the interaction relationship between different pieces of information by potential function, thus embodies the distribution characteristics of data, and carries out clustering according to the equipotential line structure in data fields to data set.But, the short-range field potential function that classical data place adopts often only considered path distance between data object to the impact of final gesture value, when in the face of ontology mapping problem, be just presented as the ubiquitous semantic association factor between data object that ignores, such as: nucleoid field of force potential function.

When compressing extensive body, using the foundation of the semantic association in body between concept as compression.The present invention is by source body O ^sourcewith target body O ^targetin concept between ubiquitous semantic association be considered as body compression basis and prerequisite, Ontological concept is considered as the data object in data fields, initial association degree between concept is regarded as the quality of each object in data fields, propose a kind of new method being weighed data object gesture value by the semantic similarity between COMPREHENSIVE CALCULATING concept and distinctiveness ratio.By introducing in body the ubiquitous semantic association degree factor between concept, have modified the deficiency of nucleoid field of force potential function when in the face of ontology mapping problem, making it in the feature macroscopically meeting Ontology Mapping.

A) definition of potential function

Because short-range field can better reflect the interaction situation between data, therefore adopt nucleoid field of force potential function.Its being defined as follows in ontology mapping problem:

In body O to be mapped, the shortest path length between concept is: || C _s-C _l||, due to the characteristic of short-range field, the path R therefore between defined notion is not more than 2.Then theoretical by data fields, obtain concept C _swith C _lbetween interactional field intensity function expression, as shown in formula (7):

Wherein, m _srepresent the quality of each data point, generally make m _s=1, but this way can only reflect that path distance between concept is on the impact of final gesture value, but makes the semantic association degree between concept lack completely, and therefore the present invention proposes that the Semantic Similarity between concept and diversity are introduced gesture value and calculates, by m _svalue be defined as initial association degree between concept, provided by formula (6).By the similarity between concept and distinctiveness ratio are considered, carry out revising with perfect to the field intensity function in formula (7).Formula (6) illustrates, in body to be mapped, the initial association degree of concept is larger, then its quality in data fields is larger.

That is, for source body O ^sourceconcept set with target body O ^targetconcept set use each concept initial association angle value portray the influence degree of this concept for other concepts.Pass through the field intensity function of correction as shown in formula (8):

δ ∈ (0 ,+∞) reflects the granularity affected between concept, also referred to as zoom factor, might as well get δ=1, R=2.So just obtain body O to be mapped ^sourcein each concept gesture value function expression formula, as shown in formula (9):

There is symmetry, the concept therefore in target body because the data object gesture value existed between source body and target body calculates gesture value in like manner can obtain.Finally obtain the gesture value set potentialMap_O of all financial resourcess concept in body O to be mapped ^sourceand potentialMap_O ^target.The unified definition of gesture value set is key-value pair:

B) extraction of candidate concepts

In order to compress body O to be mapped, the concept set in O is divided into two parts by native system, is called: candidate regions and superseded district.

Particularly, for the output key-value pair set Map_O obtained after execution algorithm one ^sourceand Map_O ^target, the association angle value according to each concept element counts Map_O respectively ^sourceand Map_O ^targetthe concept sum that middle association angle value is greater than 0.05 is called Range_Candidate_O ^sourceand Range_Candidate_O ^target, this variable-definition is body O to be mapped ^sourceand O ^targetthe interval upper bound of candidate regions.

{&ForAll; C}_{s}^{source} &Element; potentialMap_O^{source},

Its rank variable identify.If then concept to be retained by alternatively concept.Correspondingly, if then concept to be eliminated.By the symmetry existed between source body and target body, for target body O ^targetcandidate concepts decimation rule in like manner can obtain.

(3) the concept determinacy based on Needleman-Wunsch algorithm maps

Comparatively common semantic knowledge-base comprises at present: " Chinese thesaurus ", Hownet and WordNet.Especially, for the word of having included in " Chinese thesaurus " (extended edition) in the present invention, simple lemma is called; And for the word of not yet including in word woods, be referred to as unregistered word (Out of Vocabulary, OOV).

By the related definition provided, Taxonomic discussion is carried out to problem.For source body O to be mapped ^sourcewith target body O ^targetin any two concept C ^sourceand C ^target, when carrying out the Semantic Similarity Measurement of concept, there will be following three kinds of situations:

4. C ^sourceand C ^targetbe atomic concepts, that is: C ^source∈ SKB _tYCCLand C ^target∈ SKB _tYCCL

5. C ^sourceand C ^targetone of them be atomic concepts, and another is combined concept, that is: or

C^{t \arg et} &NotElement; {SKB}_{TYCCL}

6. C ^sourceand C ^targetbe combined concept, that is: and

For situation 1., the present invention will directly adopt formula (3) to calculate the semantic similarity of two concepts.Discuss below for situation 2. with situation similarity calculating method 3..

For the Similarity Measure of Chinese combined concept, traditional Chinese Ontology Mapping system gives a kind of processing scheme.Such as: Li Jia etc. have designed and Implemented one based on the element layer concept similarity computing method knowing net (Hownet), and achieve a Chinese Ontology Mapping system [Li Jia, Zhu Ming, Liu Chen, Deng. Chinese Ontology Mapping research and implementation [J]. Journal of Chinese Information Processing, 2007,21 (4): 27-33].The method is when processing the Similarity Measure problem of unregistered word, atomic concepts sequence corresponding to two combined concepts is traveled through, find out the atomic concepts mapping pair that wherein similarity is maximum, by the mapping pair of relative maximum obtained, finally obtain the Similarity value of two combined concepts.Its computing formula is as follows:

Sim (A, B) = \frac{Σ_{i = 1}^{\max (m, n)} \max_{i} (B_{xy})}{\max (m, n)} - - - (10)

Wherein, B _xyrepresent the element in the similarity matrix that the known words that obtains after splitting with two vocabulary respectively forms for ranks, max _i(B _xy) value arrangements is the similarity of i-th in representing matrix.

But due to ubiquitous-front light rear heavy ‖ of Chinese concept, therefore the processing mode of above-mentioned forefathers brings the error of Semantic Similarity Measurement unavoidably.Such as, two combined concepts to be mapped occurred in different body :-Historical Theory ‖ and-intellectual history ‖, after word segmentation processing, obtains two ordered arrangements be made up of atomic concepts: [history, theoretical] and [thought, history].If adopt the process unregistered word method that forefathers are general, the atomic concepts that then can obtain as shown in Fig. 2 (a) maps effect, based on " Chinese thesaurus " (extended edition), and calculate semantic similarity when often pair of atomic concepts maps according to the formula provided (3), the formula (10) finally adopting forefathers to propose carries out COMPREHENSIVE CALCULATING, the value of the concept element level similarity obtained is 1.0, and what obviously obtain is complete irrational combined concept mapping pair and similarity result.Reason is that this method ignores in Chinese natural language, ubiquitous word order sensitivity phenomenon, also have ignored it to exist-front light rear the feature weighing ‖.

Therefore, native system adopts a kind of Concept Semantic Similarity computing method based on global sequence's comparison.

A) sequence alignment (alignment) algorithm general introduction

In bioinformatics, pairwise comparison refers to by two DNA, RNA or protein alignment together, indicates its resemblance, and can insert room symbol in sequence, corresponding same or analogous symbol comes on same row.By comparing similar segment between two sequences and conserved sites, find the molecular evolution relation that it may exist.

On the whole, comparison model can be divided into 2 classes: a class is overall comparison (global alignment), the global similarity between paper examines 2 sequences, carries out whole process scanning and compare sequence.Another kind of is Local Alignment (local alignment) method, pays close attention to some the special segment in sequence, the similarity in comparative sequences between segment.The two all solves by dynamic programming (dynamic programming, DP) thought.

Needleman-Wunsch algorithm is typical overall comparison algorithm, and this algorithm is applicable to 2 higher sequences of more overall macroscopically similarity degree.This algorithm was proposed in 1970 by Needleman and Wunsch, and it is the dynamic programming algorithm (dynamic programming, DP) of similarity between a kind of comparison two sequences.This algorithm is one of rudimentary algorithm of bioinformatics.Native system mainly considers overall pairwise comparison algorithm.

B) dynamic programming scoring matrix is constructed

So-called sequence refers to by a series of letter mark, according to the character string that certain queueing discipline forms.Particularly, when the Similarity Measure problem of Ontological concept, combined concept is considered as word string sequence by the present invention, and each element in sequence is atomic concepts.Wherein, first combined concept is carried out word segmentation processing, obtain the word string sequence of its correspondence; In the extensive Ontology Mapping system of Chinese, adopt the ICTCLAS50 of Computer Department of the Chinese Academy of Science's research and development as word segmentation processing instrument.Alphabet (alphabet) is defined as " Chinese thesaurus " semantic knowledge-base: SKB _tYCCL, add room symbol: gap (-) simultaneously.

It is the comparison process of two word string sequences that the concept similarity of Ontology Mapping calculates abstract by native system: by gap penalty function, the relevant position insertion room symbol of decision-making in word string sequence, make two sequence lengths identical, and then the corresponding relation that between the atomic concepts constructing sequence to be compared or atomic concepts and room accords with.The essence of sequence alignment algorithms is exactly by scoring tactics, finds out the best overall situation pairing of two combined concept sequences.

In the present system, first represented with the form of scoring matrix (scoring matrix) by be compared two word string sequences, two sequences is respectively as the bidimensional of dynamic programming matrix.For body O to be mapped ^sourceand O ^targetin concept C ^sourceand C ^target, the i-th row equivalent string sequence CC of scoring matrix M ^sourcein atomic concepts jth row equivalent string sequence CC ^targetin atomic concepts wherein i≤m, j≤n.In dynamic programming matrix M, the i-th row jth column element is called M _ij.

Such as: combined concept-second industrial revolution ‖ and-World War II war criminal ‖, after word segmentation processing, can obtain two word string sequences to be compared: [the second, secondary, the industrial revolution] and [the second, secondary, world war, war criminal].According to Dynamic Programming Idea, two word string sequences are represented with row and column.Hypothetical sequence CC ^sourcelength be m, sequence C C ^targetlength be n, then can form one with sequence C C ^sourcefor row, sequence C C ^targetfor the two-dimensional matrix of (m+1) × (n+1) of row, as shown in Figure 4.

C) optimized recursive resolve algorithm

Based on the thought of dynamic programming, recursive resolve is carried out to the optimum comparison path in matrix M.

First, provide the penalty factor p=-0.05 of sequence alignment algorithms, and capable to the m+1 of matrix and (n+1)th arrange and carry out initialization respectively.

Secondly, based on Chinese thesaurus Similarity Measure function SIM _t, recursive resolve is carried out to all the other m × n element in matrix.First provide the definition of scoring function f, as shown in formula (11):

Consider the feature of Chinese concept ubiquity-front light rear heavy ‖, therefore the starting point of recurrence is chosen to be ending place of two combined concepts, that is: the M in matrix _mnelement.To SIM _tdescription ask for an interview formula (3).Recursive rule is as shown in formula (12) particularly:

M_{ij} = \max \{\begin{matrix} M_{(i + 1) (j + 1)} + f ({AC}_{i}^{source}, {AC}_{j}^{t \arg et}) \\ M_{(i) (j + 1)} + p \\ M_{(i + 1) (j)} + p \end{matrix} - - - (12)

Finally, from the M matrix _mnelement starts, and dates back the M in matrix ₁₁element terminates, and can obtain optimum comparison path.Here it should be noted that, if more than one of the optimum comparison path obtained, then optional one.

The concrete concept element level Similarity Measure algorithm based on global sequence's comparison thought, is shown in algorithm two.

Still for combined concept :-second industrial revolution ‖ and-World War II war criminal ‖, the matrix M comprising Optimum Matching path obtained by algorithm two ' _{(i) (j)}as shown in Figure 4.Wherein ,-arrow ‖ is obtained by formula (12), selectable working direction during its expression backtracking; And-Bold arrows ‖ represents the optimal path obtained.Especially, the oblique arrow ‖ of-overstriking represents and is matched by the atomic concepts of 2 corresponding to its afterbody; The horizontal arrow ‖ of-overstriking represents word string sequence CC ^sourcein, at it insert 1 room symbol--‖ before corresponding atomic concepts position of being expert at;-overstriking vertical arrows ‖ represents word string sequence CC ^targetin, before the atomic concepts relevant position that its column is corresponding, insert 1 room symbol--‖.The scoring matrix then provided by Fig. 4, word string sequence CC ^sourceand CC ^targetoptimum comparison result as shown in Figure 5:

By algorithm two, provide the scoring matrix M ' shown in Fig. 3 below _{(i) (j)}in the calculation process of each element, the scoring matrix shown in Fig. 4 in like manner can obtain.

Step is 1.: matrix initialisation.Make penalty factor p=-0.05, combined concept CC ^sourcesequence length be m, CC ^targetsequence length be n.

Step is 2.: the cost value of each element in recursive calculation scoring matrix.

From the M of the last cell of matrix ₂₂element starts to calculate, now rower i=2, and row mark j=2, then have:

In like manner can obtain:

Step is 3.: backtracking obtains by M ₃₃to M ₁₁optimal path: M ₃₃→ M ₃₂→ M ₂₁→ M ₁₁

Step is 4.: finally insert room symbol--‖, obtain correct global sequence's comparison result as shown in Fig. 2 (b).

Two combined concept entry sequences to be mapped after native system will insert room symbol--‖ are called CC ^{source '}and CC ^{target '}; At this moment the element sum comprised in two sequences is equal, is referred to as L ^{cc '}.According to comparison result with based on scoring function f, obtain the calculating formula of similarity (13) between combined concept:

{SIM}_{NW} ({CC}^{{source}^{'}}, {CC}^{{t \arg et}^{'}}) = Σ_{i = 1}^{L^{{cc}^{'}}} \frac{f ({AC}_{i}^{{source}^{'}}, {AC}_{i}^{{t \arg et}^{'}})}{L^{{cc}^{'}}} - - - (13)

After the combined concept Element-Level similarity calculating method based on sequence alignment is set forth, then re-examine is carried out to the two groups of Similarity Measure examples mentioned before.

Example one: CC ^source=[thought, history], CC ^target=[history, theoretical].The combined concept Similarity value obtained by the formula (10) of 4.3 joints is Sim (CC ^source, CC ^target)=(1.0+1.0)/2=1.0, and adopt its effect of combined concept sequence pair obtained based on sequence alignment algorithms as shown in Fig. 2 (b), the scoring matrix of its correspondence is as shown in Figure 3.And should be SIM based on the combined concept Similarity value that algorithm two and formula (3), formula (11), formula (13) COMPREHENSIVE CALCULATING obtain _nW(CC ^{source '}, CC ^{target '})=(f (thought,-)+f (history, history)+f (-, theoretical))/3=(p+SIM _t(history, history)+p)/3=(-0.05+1.0-0.05)/3=0.3.

Example two: CC ^source=[the second, secondary, the industrial revolution], CC ^target=[the second, secondary, world war, war criminal].The combined concept Similarity value calculated according to formula (10) is Sim (CC ^source, CC ^target)=1.0, this is because atomic concepts-secondary ‖ existence-polysemy ‖ phenomenon.Particularly, lemma-secondary ‖ has multiple coding item in " Chinese thesaurus " (extended edition), and wherein-Dn04B03=‖ coding item gives the judgement that two atom lemmas-the two ‖ and-secondary ‖ is lemma of equal value.Therefore, can obtain four groups of atomic concepts mapping result according to the formula (10) of classic method proposition is the situation of 1.0, respectively: < second, secondary >=1.0, < the second, two >=1.0, < time, 2nd >=1.0, and < time, secondary >=1.0.Substitute into formula (10) to have: Sim (CC ^source, CC ^target)=(1.0+1.0+1.0+1.0)/4=1.0.And should be SIM based on the combined concept Similarity value that algorithm two and formula (3), formula (11), formula (13) COMPREHENSIVE CALCULATING obtain _nW(CC ^{source '}, CC ^{target '})=(f (second, second)+f (secondary, secondary)+f (industrial revolution, world war)+f (-, war criminal))/4=(SIM _t(second, second)+SIM _t(secondary, secondary)+SIM _t(industrial revolution, world war)+p)/4=(1.0+1.0+0.18-0.05)/4=0.5325, the sequences match result corresponding to it is as shown in Figure 5.

Can see, between two combined concepts in example one and example two, there is no relation of equivalence.And classic method sets forth the wrong conclusion that similarity is the high similarity of 1.0.On the contrary, the Similarity value obtained by algorithm two is then more reasonable.As can be seen here, when consider Chinese concept ubiquitous-center of gravity after move ‖ and-polysemy ‖ phenomenon time, by adopting the overall comparison algorithm based on Needleman-Wunsch algorithm, the mistake mapping that the classic method representated by formula (10) may be brought effectively can be evaded.Meanwhile, when mapping in the face of combined concept, if the semantic sequence of the atomic concepts in the word string sequence of its correspondence is substantially identical, then the effect of algorithm two should be basically identical with classic method.In sum, based on global sequence's comparison concept element level similarity algorithm in the face of extensive Chinese Ontology Mapping task time, have more advantage and rationality than classic method.

(4) experimental data prepares

Compared to international body and mapping tasks thereof, such as: the benchmark evaluation metrics of the multi-field standard body that the international organizations such as OAEI (Ontology Alignment EvaluationInitiative) issue and mapping thereof, the extensive body of existing Chinese of increasing income is still comparatively deficient.Therefore, the present invention adopts Chinese network opening encyclopaedic knowledge storehouse as experimental data source.Except DBpedia (Chinese edition) knowledge base, reptile kit HTMLParser is used to crawl the open classification page of Baidupedia and interactive encyclopaedia and resolve respectively.Taxonomic hierarchies in Chinese network opening encyclopaedia is not only resolved by the present invention, simultaneously, the Infobox structured message that whole entry page comprises also is extracted and resolves, and it is organized with the form of Chinese character tlv triple, the extensive Chinese ontology library that final formation three is to be mapped.

Wherein, the concept system of encyclopaedia open taxonomic hierarchies main composition body, the knowledge in Infobox message box is then referred to as subject with the name of the wiki page, and Property Name (Property) is as predicate, and property value is as object.Such as, if fact statement: the nationality of-Qian Xuesen is that People's Republic of China (PRC) ‖ appears in the Infobox message box of this entry page, then should be expressed as :-Qian Xuesen ‖ ,-nationality: People's Republic of China (PRC) ‖.The knowledge that therefore can will obtain from message box, expresses with the form of tlv triple, i.e. < Qian Xuesen, nationality, People's Republic of China (PRC) >.The construction strategy of concrete Baidupedia and interactive encyclopaedia ontology library, native system is based on document [Z.Wang, Z.Wang, J.Li et al.Knowledge extractionfrom chinese wiki encyclopedias [J] .Journal of Zhejiang University-Science C, vol 13, no.4, pp.268 – 280,2012] method that proposes.

Table 2 Chinese network encyclopaedic knowledge library information

Native system constructs and includes the Baidupedia body frame more than 1300 concepts and the interactive encyclopaedia body frame containing 29263 concepts.Wherein, the top-level categories in Baidupedia taxonomic hierarchies comprises: the 13 large classes such as personage, science, history, physical culture and education; And interactive encyclopaedia comprises the 13 large top classification such as personage, technology, much-talked-about topic.DBpedia (Chinese edition) is regarded as the wikipedia knowledge base of semantization, and it comprises 23 top classification, and altogether containing more than 100,000 concepts, and its download link that can directly provide from wikipedia obtains.Three large Chinese network encyclopaedic knowledge storehouse relevant informations are as shown in table 2.

Native system adopts for the accuracy rate (Precision) of mapping result identification, recall rate (Recall) and F-measure as final evaluation criterion.Wherein:

Mapping pair sum × 100% of the correct mapping logarithm/output of Precision (P)=output

Mapping pair sum × 100% in the correct mapping logarithm/standard results of Recall (R)=output

F-measure(F1)＝2×P×R/(P+R)×100％

For extensive Chinese Ontology Mapping task, choose the top-level categories that the large Chinese network opening encyclopaedia Ontological concept of Baidupedia, interactive encyclopaedia and wikipedia Chinese edition (DBpedia3.8 Chinese edition) three is concentrated: the correct mapping pair in personage, science, society, geography and artistic subclass, reference in this, as evaluation algorithms efficiency maps, in table 3.

The large Chinese encyclopaedia mapping tasks body of table 3 three is with reference to mapping statistics

(5) experimental result

A) one is tested: extensive Chinese body compression

Based on the semantic similarity between the COMPREHENSIVE CALCULATING concept proposed and distinctiveness ratio nucleoid field of force data fields potential function, first the compression and the yojan that map scale are carried out to the extensive body of Chinese.For three the Ontology Mapping tasks related to, the compression effectiveness that body to be mapped can obtain under different semantic environment is as shown in table 4.Wherein:

Body scale before compressibility (%)=(before compression, body Gui Mo – compresses rear body scale)/compression

The extensive Ontology Mapping scale compression effect of table 4

As can be seen from table 4 result data, when the original scale between two bodies to be mapped is compared larger, a relatively small-scale bulk compressibility is also less, and fairly large body to be mapped then more easily obtains higher compressibility.That is: original scale is than larger, then fairly large body to be mapped can obtainable compressibility higher.In this case, the compressibility that body to be mapped obtains differs larger.And original scale between body to be mapped is smaller or convergence time, also there is convergent trend in the compressibility obtained both them.As can be seen here, good Clustering Effect can be obtained based on the nucleoid field of force potential function revised, before carrying out determinacy mapping, effectively can control the Time & Space Complexity of the extensive Ontology Mapping task with yojan.

B) two are tested: extensive Chinese Ontology Mapping result evaluation and test

The evaluation result of three large mapping tasks is as shown in table 5, sets forth precision ratio (P value), recall ratio (R value) and F1-measure value that the different typical Similarity Measure algorithm of employing three kinds obtains.The first algorithm is [the Diogene Ontology Mapping Prototype of the similarity algorithm based on editing distance algorithm that can be general across language, http://diogene.cis.strath.ac.uk/prototype.html], below referred to as method one; The second is typically based on the Chinese Word similarity algorithm [Tian Jiule of Chinese thesaurus, Zhao Wei. based on the Measurement of word similarity [J] of Chinese thesaurus. Jilin University's journal, 2010,28 (6): 602-608], below referred to as method two; The third method is the Chinese generalization by the representation of groups similarity algorithm that the present invention proposes, below referred to as native system method.

Native system and method one and method two are carried out comparative analysis intuitively below.In order to guarantee fairness, often kind of algorithm is judged that the similarity threshold of concept relation of equivalence is unified and is set as T=0.9 by native system.

The evaluation result of table 5 three kinds of typical similarity algorithms

As shown in Table 5, native system remains basically stable with editing distance similarity algorithm in the precision ratio of Baidu-Hudong mapping tasks.The precision ratio of native system is also apparently higher than method two simultaneously, this is because the co-reference identification of ontology mapping problem more between focus on concepts, and method two too pays close attention to the semantic relevancy between word, and which results in when carrying out Word similarity, introducing larger error.And when carrying out Hudong-DBpedia mapping tasks, the precision ratio result obtained then remains basically stable with method one, simultaneously higher than method two on average about 9%.

In recall ratio, first, owing to introducing Chinese thesaurus as semantic knowledge-base, therefore recall ratio aspect also can higher than method one.Secondly, as can be seen from the evaluation result of three mapping tasks also, after introducing data field potential function is as the Ontology Mapping scale compression factor, the structural level that also can be regarded as between concept set maps.Therefore, according in the encyclopaedia subclassification that some is different, the structural level feature that concept element may exist, it also can bring stronger error correcting capability for native system simultaneously, that is: may evade the error owing to adopting pure elemental level mapping policy to bring.Simultaneously, by introducing the combined concept similarity calculating method based on bioinformatics sequence alignment, the mistake that the traditional algorithm towards unregistered word Similarity Measure not only can be avoided to bring maps, compared to the Chinese Word similarity algorithm proposed in method two, because it does not consider unregistered word problem, the feature of the combined concept therefore contained according to different subclassification, more likely improves the recall ratio of different sub-mapping tasks.

Finally, from (F1 value) overall performance, native system is when in the face of Baidu-Hudong mapping tasks, and ratio method one and method two on average exceed about 11% and 20%.When in the face of Hudong-DBpedia mapping tasks, overall performance of the present invention higher than the Chinese thesaurus similarity algorithm proposed in method two, and remains basically stable with method one.When in the face of Baidu-DBpedia mapping tasks, the overall performance of native system is still respectively higher than method two and method one about 21% and 8%.

Claims

1. towards an extensive Ontology Mapping Method for Chinese language, it is characterized in that: be made up of three large steps, respectively: the concept initial association degree merged based on editing distance and synonym word forest form calculates, body compresses and determinacy maps;

A) editing distance similarity

EditDis \tan ce (C^{source}, C^{t \arg et}) = \frac{| Do (C^{source}, C^{t \arg et}) |}{\max (L (C^{source}), L (C^{t \arg et}))} - - - (1)

{SIM}_{E} = (C^{souce}, C^{t \arg et}) = \frac{1}{(1 + EditDis \tan ce (C^{source}, C^{t \arg et}))} - - - (2)

B) Chinese thesaurus similarity

Calculating formula of similarity based on Chinese thesaurus:

{SIM}_{T} (C^{source}, C^{tartget}) = α \times \frac{F_{i}}{| F |} \times \cos (n^{subTree} \times \frac{π}{180}) \times (\frac{n^{subTree} - D + 1}{n^{subTree}}) - - - (3)

For f _ifor lemma C ^sourceand C ^targetthe hierachy number representated by son coding difference is there is at i-th layer, | F| represents the element number in set F, is constantly equal to 5 in the present invention; Concept similarity weight coefficient is α × (F _i/ | F|); n ^subTreefor lemma C ^sourceand C ^targetthere is the F that son coding is different _ithe node total number comprised under layer respective branch, D is lemma C ^sourceand C ^targetcoding distance; Certain random number between α ∈ [0.4,0.5] all can meet the demands;

C) how strategy merges algorithm of correlation degree

ρ_{st} = \max ({SIM}_{E} (C_{s}^{source}, C_{t}^{t \arg et}), {SIM}_{T} (C_{s}^{wource}, C_{t}^{t \arg et})) - - - (4)

Claim concept here with between semantic related coefficient be λ _st,

m_{s}^{source} = λ_{s 1} + λ_{s 2} + λ_{s 3} + . . . + λ_{{sn}^{t \arg et}} = Σ_{t = 1}^{n^{t \arg et}} λ_{st} - - - (6)

(2) body compression algorithm

Concept in target body gesture value in like manner can obtain; Finally obtain the gesture value set potentialMap_O of all financial resourcess concept in body O to be mapped ^sourceand potentialMap_O ^target; Gesture value set unified definition is key-value pair: potentialMap_O<C,

For gesture value set potentialMap_O ^sourceand potentialMap_O ^targetin concept element, carry out descending sort according to key assignments, for its rank variable mark; If

{Rank}_{s}^{source} &Element; [1, Range_Candidate_O^{source}],

Then concept to be retained by alternatively concept; Correspondingly, if

{Rank}_{s}^{soure} &Element;

[Range_Candidate_O^{source} + 1, n^{source}],

(3) determinacy maps

2. C ^sourceand C ^targetone of them be atomic concepts, and another is combined concept, that is:

C^{source} &NotElement; {SKB}_{TYCCL}

Or

C^{t \arg et} &NotElement; {SKB}_{TYCCL}

3. C ^sourceand C ^targetbe combined concept, that is: and

C^{t \arg et} &NotElement; {SKB}_{TYCCL}

For situation 1., formula (3) is adopted to calculate the semantic similarity of two concepts; For situation 2. with situation 3., in the present invention, first represented with the form of scoring matrix (scoring matrix) by be compared two word string sequences, two sequences is respectively as the bidimensional of dynamic programming matrix; For body O to be mapped ^sourceand O ^targetin concept C ^sourceand C ^target, the i-th row equivalent string sequence CC of scoring matrix M ^sourcein atomic concepts jth row equivalent string sequence CC ^targetin atomic concepts wherein i≤m, j≤n; In dynamic programming matrix M, the i-th row jth column element is called M _ij;

First provide the definition of scoring function f, as shown in formula (11):

Recursive rule is as shown in formula (12):

M_{ij} = \max \{\begin{matrix} M_{(i + 1) (j + 1)} + f ({AC}_{i}^{source}, {AC}_{j}^{t \arg et}) \\ M_{(i) (j + 1)} + p \\ M_{(i + 1) (j)} + p \end{matrix} - - - (12)

{SIM}_{NW} ({CC}^{{source}^{'}}, {CC}^{{t \arg et}^{'}}) = Σ_{i = 1}^{L^{{cc}^{'}}} \frac{f ({AC}_{i}^{{source}^{'}}, {AC}_{i}^{{t \arg et}^{'}})}{L^{{cc}^{'}}} - - - (13) .