CN109582961A

CN109582961A - A kind of efficient robot data similarity calculation algorithm

Info

Publication number: CN109582961A
Application number: CN201811433367.7A
Authority: CN
Inventors: 罗志勇; 范志鹏; 赵杰; 王月; 韩冷; 于士杰; 郑焕平; 蔡婷
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2019-04-05

Abstract

The phenomenon that a kind of efficient robot data similarity calculation algorithm is claimed in the present invention, is related to field there are multiple and different ontologies, and then cause the Interoperability between isomery ontology.This method defines ontology pretreatment, maps candidate generation, the merging of the Word similarity of concept term, multiple mapping and mapping result output.In all kinds of mapping methods, the concept similarity based on word constitutive characteristic calculates other corpus resources not needed in addition to word itself substantially and supports that calculating has the characteristics that direct and agility.But there are still for semantic identical but write the problems such as not fully consistent allosome, synonym similarity calculation be difficult and the allocation strategy of concept term composition term weighing to be matched is not perfect for presently relevant method.For these problems, the invention proposes a kind of efficient robot data similarity calculation algorithms, to promote Ontology Mapping resultant effect.

Description

A kind of efficient robot data similarity calculation algorithm

Technical field

The invention belongs to computer information processing fields more particularly to a kind of efficient robot data similarity calculation to calculate Method.

Background technique

Ontology Mapping Method can generally be summarized as following 4 kinds:

(1) by concept similarity calculation method, compare the similarity penetrated between object, to find between isomery ontology Connection, such as Rodr í guez etc. propose a kind of method for calculating Concept Similarity using concept definition, by the concept in ontology It is divided into three parts of semantic relation collection for indicating the synset of concept, the feature set for portraying concept, concept, and utilizes this three Divide and carries out similarity calculation, the final mapping relations determined between concept.

(2) similitude between isomery ontology in structure is analyzed, finds mapping relations by writing mapping ruler. Sunna etc. proposes a kind of method for using bulk junction composition as contextual information to realize Ontology Mapping.This method is in addition to examining Consider outside node self-information, reference is also made to the multi-level information such as its father node, child nodes, grandchild node.

(3) by the example in ontology, the mapping relations between ontology are found using technologies such as machine learning.It is typical Example is the GLUE system of the propositions such as the Doan of University of Washington.This method comprehensively considers the various Heterogeneities of ontology, passes through Machine learning classifies to the example of concept, and the Joint Distribution probability that then occurs in concept using example calculates concept Between similarity and combine domain constraint and heuristic knowledge finally to determine mapping relations.

(4) a variety of methods are carried out to comprehensive Ontology Mapping Method.Shailendra etc. develops a hybrid ontology and reflects Engine is penetrated, by the Ontology Mapping algorithm based on syntax, the Ontology Mapping algorithm based on background knowledge and structure-based ontology Mapping algorithm is integrated, and multi-angle is mapped from many aspects, while playing various method advantages, also compensates for difference The shortcoming of method.High bright wait improves similarity calculating method in terms of Ontological concept title, structure, example, attribute 4, And propose the similarity calculating method of fusion.Li Jia etc. proposes the side using Hownet, calculated in conjunction with a variety of Lexical Similarities Method realizes the ELOMC system of Chinese Ontology Mapping.

In above-mentioned Ontology Mapping Method, find that the method contacted between ontology is generally only borrowed by concept similarity calculating Help word, phrase of expression Ontological concept etc. that can carry out similarity calculation as input, it is of less demanding to design conditions.This Outside, such method not only can individually carry out Ontology Mapping calculating, but also can easily be counted with other types method integration It calculates, use is more flexible, therefore is widely used.The core of such Ontology Mapping Method is the similarity based on word It calculates.There are some similarity calculating methods based on word at present, following 4 major class can be divided into:

(1) cosine value method.Cosine similarity (Cosine Similarity) algorithm by calculate have n dimension two to Cosine angle between amount reflects estimating for similarity degree between vector.Such algorithm has widely applicable, realization simply The features such as, it is one of the Words similarity algorithm being most widely used.

(2) edit distance approach.In this type of method, the Smith- that Smith and Wa-terma was proposed in 1981 Waterman algorithm is most representative.There is no directly two sequences of calculating othernesses on the whole for the algorithm, but by Different location insertion space, multiple calculate take optimal method in shorter sequence, find out two sequence highest similarity value conducts Final result.The Jaro-Winkler Distance algorithm that Winkler is proposed is also a kind of more typical editing distance calculation Method, he improves on the basis of JaroDistance algorithm, the quantity factor of character match is not only allowed for, also by character Matched positional factor takes into account.

(3) similarity calculating method based on word role.The word of Ontological concept title is constituted, that is, in title There is different similarities to contribute by centre word (Head) and qualifier part (Modifiers), can be used as that estimate word similar The important symbol of property.Each term is expressed as by Nenadic etc. in carrying out field of biomedicine Similarity of Term calculating task Bipartite word description scheme, i.e., the modification composition part of key vocabularies and term in term；Then one is used The Dice coefficients (Dicelike Coefficient) of a weighting compare the word description scheme of two terms.Those are shared more The long term for collectively constituting part will obtain more score values；If two term keywords having the same, additional point Value will increase in similarity measure result.

(4) a variety of concept similarities calculate the method combined.Noess-ner etc. proposes the integrated side of CODI Words similarity Method, by cosine similarity algorithm, Levenshtein similarity algorithm, Jaro-Winkler Dis-tance similarity algorithm, Smith-Waterman similarity algorithm, Over-lap Coefficient similarity algorithm and Jaccard similarity algorithm It is integrated with specified weight, to promote the concept term similarity calculation effect during Ontology Matching.

In the above-mentioned experiment for calculating progress domain body mapping by concept similarity of application, find existing method two A aspect still has deficiency:

(1) for allosome word as similar " Mutation " (variation) and " Variation " (variation), synonym, closely Adopted word hardly results in correct mapping relations using the existing similarity calculating method based on word.Especially when these words When the centre word of the term as expression Ontological concept, counting loss will lead to corresponding Ontological concept matching fault.

(2) existing algorithm distributes the similar weight of qualifier in term in average mode, this makes similarity calculation As a result discrimination is bad.

The present invention proposes a kind of efficient robot data similarity calculation algorithm, introduces the synonym of WordNet, nearly justice Word and search and editing distance calculate, to solve the problems, such as that related algorithm is difficult to allosome word, synonym, near synonym；By again The Weight Value Distributed Methods for designing term centre word and term qualifier, keep centre word, qualifier weight distribution more reasonable, in turn Effectively promote the resultant effect of Ontological concept mapping.

Summary of the invention

Present invention seek to address that the above problem of the prior art.Propose it is a kind of needed for condition it is few, dispose quick and easy, energy It is enough effectively to promote the efficient robot data similarity calculation algorithm for calculating effect.Technical scheme is as follows:

A kind of efficient robot data similarity calculation algorithm comprising following steps:

A. ontology to be matched is imported, ontology to be matched is pre-processed；

B. mapping is candidate generates, including between the similar situation in pretreatment ontology concept characteristic, conceptual example Factor is analyzed, and selects concept mapping candidate right；

C. synonymous using the editing distance similarity calculation, the Word Net based on centre word that constitute word based on term The method that word, near synonym retrieval and the weight automatic Assignment based on term centre word, qualifier combine, between ontology The term description collections that combination of two is formed are to progress similarity calculation；

D. knot of the highest concept of mapping of similarity to set, as Ontology Mapping is selected from qualified concept group Fruit is exported, and is formatted output and storage to Ontology Mapping result.

Further, in the step a ontology pretreatment specifically include: import ontology to be matched, in ontology include class Concept, attribute, the term for reading including example, attribute instance parsed, feature extraction and formatted storage, be subsequent With ready for operation.

Further, candidate generate is mapped in the step b to have used for reference Huber etc. and use in CODI Ontology Mapping Method Mapping candidate's generation method, have main steps that: factors such as similar situation in 1. betweens of ontologies concept characteristic, conceptual example point Analysis；2. the possible concept mapping of selection and combination is candidate right.

Further, described using the editing distance similarity calculation for constituting word based on term, it specifically includes: 3.1) art Language constitutes the similarity calculation of word level, on the basis of establishing the word one-to-one similarity calculation in term, therefore, leads to The mode of two term word matrixes of building is crossed, the best match corresponding relationship in term between word is found, in the list of two terms In set of words matching process, in order to find optimum matching relation, following formula is proposed:

sim(w_i,w_j)=d_ω(w_i,w_j)if d_ω(w_i,w_j)≥0.8

Wherein, sim (w_i,w_j) indicating similarity in two terms between any word pair, similarity threshold takes 0.8, d_ω (w_i,w_j) indicate to calculate the word w obtained by editing distance formula_iAnd w_jBetween editing distance.

Further, the similarity calculation between the word pair is divided into three kinds of situations:

1. if the d of two words_ωValue is less than 0.8, then it is assumed that mismatches between them, similarity 0；

2. if the d of two words_ωValue is more than or equal to 0.8, then it is assumed that mismatches between them, similarity d_ω；

3. if first retrieving Word Net two words are the centre word of respectively place term and judging whether the two is same each other Adopted word, near synonym or antonym, if it is return value is 1, and otherwise word calculates similarity by two.

Further, the Words similarity weight distribution specifically includes: parsing, is found out in analytic tree to term The most deep noun in position, and using the word as term centre word, other words are as qualifier, to term centre word and qualifier It carries out after correctly distinguishing, carries out the distribution of similarity weight to each word using following formula；

Wherein, w_t1iAnd w_t2jRespectively indicate term t₁In i-th of word and term t₂In j-th of word, d (w_t1i) indicate word Language w_t1iWith t₁The distance between subject term in term, Weight (w_t1i,w_t2j) indicate that matching word is to similar in progress in two terms Weight distribution when degree calculates.

Further, the Words similarity COMPREHENSIVE CALCULATING formula are as follows:

Wherein, t₁And t₂For the term pair for needing to calculate Words similarity, sim (w_t1i,w_t2j) indicate term t₁In i-th Word and term t₂In similarity between j-th of word, Weight (w_t1i,w_t2j) indicate according to and subject term in respective term it Between distance, be w_t1iAnd w_t2jThe similarity weight of distribution, in denominator the value range of l be [0, Max (| t₁|,|t₂|) -1], really Entire calculated result has been protected to be in always in [0,1] range.

Further, multiple mapping merges in the step d and mapping result exports, and specifically includes: according to experimental data Analysis sets optimal trusted degree threshold value, filters out the concept pair for having more than this threshold value, is being more than the ontology to be matched of threshold value Concept centering will form the mapping relations of 1:n, m:1 or m:n between concept, in order to be translated into 1:1 mapping relations, so as to Mapping result collection is manually marked with domain expert to compare, and needs to select similarity from these qualified concept groups again To set, the result as Ontology Mapping is exported highest concept of mapping；And Ontology Mapping result is formatted Output and storage.

It advantages of the present invention and has the beneficial effect that:

Innovative point of the invention is to propose a kind of efficient robot data similarity calculation algorithm, in Ontology Mapping Concept similarity calculates existing deficiency, proposes a kind of improved method.By the synonym of Word Net, near synonym retrieval and editor Distance algorithm introduces the similarity deterministic process between term centre word, and by new automatic Weight Value Distributed Methods in term Heart word and term modification Word similarity are integrated.It comprises the concrete steps that: ontology pretreatment, the candidate generation of mapping, Words similarity Calculating, multiple mapping merge and mapping result output, wherein in Word similarity, by constituting centre word to term Carry out Word Net synonym, near synonym retrieval, and to the centre word and all qualifiers that can not retrieve edited away from From calculating, the correct matching relationship between two terms composition word is found, and is each section by similarity Weight Value Distributed Methods With relation allocation weight, similarity calculation result between term is obtained eventually by COMPREHENSIVE CALCULATING formula.The advantages of this method is: tool Have required condition it is few, deployment it is quick and easy, can effectively promote calculating effect

Detailed description of the invention

Fig. 1 is that the present invention provides Ontology Mapping procedure chart of the preferred embodiment based on Word similarity；

Fig. 2 is Word similarity flow chart of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, detailed Carefully describe.Described embodiment is only a part of the embodiments of the present invention.

The technical solution that the present invention solves above-mentioned technical problem is:

The invention proposes a kind of efficient robot data similarity calculation algorithms, it is characterised in that: by editing distance The composition word match that algorithm is applied to Ontological concept term calculates, and the synonym of WordNet, near synonym retrieval are applied to art Similarity judgement between language centre word, and the Weight Value Distributed Methods of Design Technology language centre word and term qualifier again, to mention Rise Ontology Mapping resultant effect.Hereinafter reference will be made to the drawings and invention is further described in detail in conjunction with example.

As shown in Figure 1, the Ontology Mapping procedure chart of the invention based on Word similarity, is specifically realized in :

1. ontology pre-processes, it is responsible for importing ontology to be matched, to the concept of class in ontology, attribute, reads example, attribute instance Equal terms are parsed, feature extraction and formatted storage, are subsequent match ready for operation.

It is generated 2. mapping is candidate, by between the factor analyses such as similar situation in ontology concept characteristic, conceptual example, choosing It selects and to combine the mapping of possible concept candidate right.

3. Word similarity, using constituting the editing distance similarity calculation of word based on term, be based on centre word Word Net synonym, near synonym retrieval and the weight automatic Assignment based on term centre word, qualifier combine Method, the term description collections formed between combination of two ontology are to similarity calculation is carried out, and result is as judging ontology The important evidence of middle concept mapping, the specific implementation process is as follows step:

3.1) term constitutes the similarity calculation of word level, it should establish the one-to-one similarity of word in term On the basis of calculating.Therefore, it is necessary to find the best match in term between word by way of constructing two term word matrixes Corresponding relationship.In the set of letters matching process of two terms, in order to find optimum matching relation, following formula is proposed:

sim(w_i,w_j)=d_ω(w_i,w_j)if d_ω(w_i,w_j)≥0.8

Wherein, sim (w_i,w_j) indicate similarity in two terms between any word pair, it is determined here by Germicidal efficacy Similarity threshold take it is 0.8 more appropriate.d_ω(w_i,w_j) indicate to calculate the word w obtained by editing distance formula_iAnd w_jIt Between editing distance.Similarity calculation between word pair is divided into three kinds of situations:

3.2) Words similarity weight distribution designs, and parses to term, finds out the name that position is most deep in analytic tree Word, and using the word as term centre word, other words are as qualifier.It is correctly distinguished to term centre word and qualifier Afterwards, the distribution of similarity weight is carried out to each word using following formula.

3.3) Words similarity COMPREHENSIVE CALCULATING designs

4. multiple mapping merges and mapping result output, according to analysis of experimental data, optimal trusted degree threshold value, screening are set The concept pair of this threshold value is had more than out.In the Ontological concept pair to be matched for being more than threshold value, it may will form between concept The mapping relations of 1:n, m:1 or m:n, in order to be translated into 1:1 mapping relations, manually to mark mapping knot with domain expert Fruit collection compares, and needs to select the highest concept of mapping of similarity to set from these qualified concept groups again, make It is exported for the result of Ontology Mapping；And output and storage are formatted to Ontology Mapping result, in order to subsequent benefit With and calculate effect evaluation and test.

As shown in Fig. 2, Word similarity flow chart of the invention, Fig. 2 is done for Word similarity in Fig. 1 Further refinement.Specifically it is achieved in that

1. obtaining two words first from ontology library；

2. two words that will acquire carry out judging whether it is synonym；

3. if two words are synonym, return value 1；

4. if obtaining the distance of two words by calculating two words are not synonyms；

5. the similarity of two words is 0 when if the distance of two words is less than 0.8；

6. the similarity of two words is d when if the distance of two words is greater than or equal to 0.8_w(between two words away from From value).

The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention.? After the content for having read record of the invention, technical staff can be made various changes or modifications the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims

1. a kind of efficient robot data similarity calculation algorithm, which comprises the following steps:

B. mapping is candidate generates, by between the factor including the similar situation in pretreatment ontology concept characteristic, conceptual example It is analyzed, selects concept mapping candidate right；

C. the editing distance similarity calculation of word, the Word Net synonym based on centre word, close is constituted using based on term The method that adopted word and search and weight automatic Assignment based on term centre word, qualifier combine, between ontology two-by-two The term description collections formed are combined to progress similarity calculation；

D. select the highest concept of mapping of similarity to set from qualified concept group, as Ontology Mapping result into Row output, and output and storage are formatted to Ontology Mapping result.

2. a kind of efficient robot data similarity calculation algorithm according to claim 1, which is characterized in that the step Ontology pretreatment specifically includes in rapid a: ontology to be matched is imported, to including the concept of class in ontology, attribute, read example, attribute Term including example parsed, feature extraction and formatted storage, is subsequent match ready for operation.

3. a kind of efficient robot data similarity calculation algorithm according to claim 1, which is characterized in that the step Candidate generate of mapping has used for reference Huber etc. and map candidate's generation method used in CODI Ontology Mapping Method in rapid b, mainly Step is: the similar situation factor analysis in 1. betweens of ontologies concept characteristic, conceptual example；2. the possible concept of selection and combination Mapping is candidate right.

4. a kind of efficient robot data similarity calculation algorithm according to claim 1, which is characterized in that described to adopt With the editing distance similarity calculation for constituting word based on term, specifically include: 3.1) term constitutes the similarity of word level It calculates, on the basis of establishing the word one-to-one similarity calculation in term, therefore, passes through two term word matrixes of building Mode finds the best match corresponding relationship in term between word, in the set of letters matching process of two terms, in order to Optimum matching relation is enough found, proposes following formula:

sim(w_i,w_j)=d_ω(w_i,w_j)if d_ω(w_i,w_j)≥0.8

Wherein, sim (w_i,w_j) indicating similarity in two terms between any word pair, similarity threshold takes 0.8, d_ω(w_i,w_j) It indicates to calculate the word w obtained by editing distance formula_iAnd w_jBetween editing distance.

5. a kind of efficient robot data similarity calculation algorithm according to claim 4, which is characterized in that the list Similarity calculation between word pair is divided into three kinds of situations:

3. if first retrieving Word Net judges whether the two is synonymous each other two words are the centre word of respectively place term Word, near synonym or antonym, if it is return value is 1, and otherwise word calculates similarity by two.

6. a kind of efficient robot data similarity calculation algorithm according to claim 4, which is characterized in that institute's predicate Language similarity weight distribution specifically includes: parsing to term, finds out the noun that position is most deep in analytic tree, and by the word As term centre word, other words are as qualifier, and after correctly distinguish to term centre word and qualifier, use is following Formula carries out the distribution of similarity weight to each word；

Wherein, w_t1iAnd w_t2jRespectively indicate term t₁In i-th of word and term t₂In j-th of word, d (w_t1i) indicate word w_t1iWith t₁The distance between subject term in term, Weight (w_t1i,w_t2j) indicate that matching word is in progress similarity in two terms Weight distribution when calculating.

7. a kind of efficient robot data similarity calculation algorithm according to claim 6, which is characterized in that institute's predicate Language similarity COMPREHENSIVE CALCULATING formula are as follows:

Wherein, t₁And t₂For the term pair for needing to calculate Words similarity, sim (w_t1i,w_t2j) indicate term t₁In i-th of word With term t₂In similarity between j-th of word, Weight (w_t1i,w_t2j) indicate according to between subject term in respective term Distance is w_t1iAnd w_t2jThe similarity weight of distribution, in denominator the value range of l be [0, Max (| t₁|,|t₂|) -1], it is ensured that Entire calculated result is in always in [0,1] range.

8. a kind of efficient robot data similarity calculation algorithm according to claim 7, which is characterized in that the step Multiple mapping merges in rapid d and mapping result exports, and specifically includes: according to analysis of experimental data, optimal trusted degree threshold value is set, The concept pair for having more than this threshold value is filtered out, in the Ontological concept pair to be matched for being more than threshold value, will form between concept The mapping relations of 1:n, m:1 or m:n, in order to be translated into 1:1 mapping relations, manually to mark mapping knot with domain expert Fruit collection compares, and needs to select the highest concept of mapping of similarity to set from these qualified concept groups again, make It is exported for the result of Ontology Mapping；And output and storage are formatted to Ontology Mapping result.