CN101847141A

CN101847141A - Method for measuring semantic similarity of Chinese words

Info

Publication number: CN101847141A
Application number: CN 201010191677
Authority: CN
Inventors: 张玥杰; 彭琳; 金城; 薛向阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2010-06-03
Filing date: 2010-06-03
Publication date: 2010-09-29

Abstract

The invention belongs to the technical field of natural language processing and particularly discloses a method for measuring semantic similarity of Chinese words. The method comprises the following steps of: firstly, extracting rich semantic information of a Hownet by using a KDML language of the hownet; secondly, calculating primary similarity by using an optimized primary similarity calculation formula; and finally, calculating the similarity among concepts by using a maximum matching algorithm to obtain the semantic similarity of the Chinese words. Compared with other traditional methods, the method for measuring the semantic similarity of the Chinese words has a better discrimination of the semantic similarity and calculation results meet the subject feeling of people.

Description

Chinese word semantic similarity measure

Technical field

The invention belongs to the natural language processing technique field, be specifically related to the phrase semantic method for measuring similarity.

Background technology

There is very complicated relation between the word of natural language, as synonym, to justice, antisense, whole-part and hyponymy etc.In actual applications, an active demand is accurately measured this complex relationship with a kind of simple quantity exactly, i.e. phrase semantic similarity (Word Lexical Semantic Similarity).The phrase semantic similarity all has a wide range of applications in a lot of fields, as image retrieval, text classification, word sense disambiguation and mechanical translation etc.The phrase semantic similarity adopts a numerical value to measure semantic similarity degree between two words, and also the complex relationship between word is that the accurate tolerance of similarity is brought very big problem just.Manual creation is passed through in semantic knowledge source based on the dictionary form, has " correctness, without prejudice and completeness "; Have field unbalancedness and the sparse property of data and be used for the semantic corpus that calculates, do not have enough big semantic marker language material simultaneously.Therefore, have many limitation, and demonstrate its advantage all the more along with constantly improving of extensive semantic dictionary, intuitively, simply and be not subjected to the restriction in word field based on the mode of machine readable dictionary based on the phrase semantic measuring similarity mode of adding up.

Research strategy to the phrase semantic measuring similarity is divided into two classes substantially both at home and abroad, promptly based on the mode of machine readable dictionary with based on the mode of adding up.

Phrase semantic measuring similarity based on machine readable dictionary is a kind of rationalist approach based on linguistics and artificial intelligence.Semantic dictionary is organized according to layer of structure relation between notion, according to the similarity of learning calculating such as hyponymy between the notion and apposition word in the resource at this speech like sound.

Phrase semantic measuring similarity based on statistics is a kind of empirical method, and it depends on a kind of like this hypothesis, and the speech that every semanteme is close, their context also should be similar.The large-scale corpus of this method utilization is added up, it mainly with the probability distribution of contextual information as the phrase semantic measuring similarity with reference to foundation, so some documents also are referred to as distributed similarity (Distributional Similarity).Vector space model be based on use in the phrase semantic method for measuring similarity of statistics a kind of comparatively widely.This model is selected a stack features speech in advance, calculate the correlativity (generally the frequency that appears in this speech context with this stack features speech is measured) of this stack features speech and each speech then in the extensive language material of reality, so, can obtain the feature term vector of a correlativity for each speech, the similarity between the compute vector (relatively Chang Yong method is to calculate the cosine value) is as the similarity of word then.But present computing method mostly speed are slower.And because the corpus or the restriction of machine readable dictionary, the accuracy of semantic similarity tolerance has much room for improvement.

Summary of the invention

It is fast to the objective of the invention is to propose a kind of computing velocity, the high Chinese word semantic similarity measure of tolerance accuracy.

The present invention proposes Chinese word semantic similarity measure, is a kind of new for the phrase semantic method for measuring similarity of knowing net.This method similar algorithm more in the past, more utilized KDML (the Knowledge DatabaseMark-up Language) language of knowing net to extract the semantic information of enriching of knowing net, adopt layering to calculate and add the method for maximum match, optimized adopted former similarity algorithm simultaneously, made result calculated have more the subjective sensation that discrimination also meets the people more.

Summarize about the KDML language

A word may have a plurality of notions in " knowing net ", and each notion represents with a record, shape as:

NO.＝021739

W_C=beats

G_C＝V

E_C=～ball ,～tennis ,～basketball ,～shuttlecock ,～board ,～playing card ,～mahjong ,～swing ,～taijiquan, ball～get very well

W_E＝play

G_E＝V

E_E＝

DEF={exercise| exercise: domain={sport| physical culture } }

In the above-mentioned record, NO. be recording mechanism, W_C, G_C, E_C are respectively word, part of speech and the examples of Chinese, and W_E, G_E, E_E are respectively word, part of speech and the examples of English, DEF is the semantic formula of this notion, comes its expression of standard with knowledge data descriptive language (KDML).KDML has following four important composition forms:

1) justice is former: used word is called as justice former (sememes) in the KDML descriptive language, and " exercise| exercise " wherein and " sport| physical culture " are exactly that two justice are former, and organize according to the KDML syntax rule.The adopted former ambiguousness that do not have extracts " meaning least unit the most basic and that be not easy to cut apart again ", just the least unit of Miao Shuing from Chinese character (comprising single-morpheme word).

2) main classes justice is former: first justice in the semantic formula is former, and to be also referred to as main classes justice simultaneously former, and it is former that " exercise| exercise " is main classes justice in this example.The former meaning that must be pointed out that this notion is the most basic of main classes justice can think that it has the strongest descriptive power to notion.

3) semantic formula: " DEF={...} " is the core of whole record, is definition and description for this notion, is referred to as semantic formula.Be complexity, consistance and the accuracy of guaranteeing conceptual description, utilize KDML to come the description of standard semantic formula.

4) the former framework of main classes justice: briefly, know that net has also carried out the semantic formula definition as word for most of justice is former.As shown below.Wherein, for justice former " thing| all things on earth ", the former framework of its main classes justice is " { entity| entity: { ExistAppear| deposits cash: existent={～} } } ", describes the grammer strictness and follows the KDML language.

In based on the notion semantic description that KDML set up, be in the adopted former descriptive power difference in the different bracket levels for the phrase semantic definition, the adopted former descriptive power to general neck that is in the outer bracket is strong more; Otherwise being in adopted former in the internal layer bracket is to the former specific explanations of last layer justice, so it is the intermediate description to notion, descriptive power relatively a little less than.So when tolerance phrase semantic similarity, be necessary it is treated with a certain discrimination.

About the former measuring similarity of justice

As the important foundation of word measuring similarity, adopted former calculation of similarity degree is carried out according to the former hierarchical system of justice (being hyponymy).Based on tree-shaped hierarchical structure, consider path between the node, introduce the level degree of depth of node simultaneously, and set up adopted former calculation of similarity degree formula:

Sim (S_{1}, S_{2}) = \frac{α \times \min (Depth (S_{1}), Depth (S_{2}))}{α \times \min (Depth (S_{1}), Depth (S_{2})) + Dist (S_{1}, S_{2})} - - - (1)

Wherein, S ₁With S ₂Represent that respectively two justice are former; Dist (S ₁, S ₂) path of two justice of expression between former; α is for regulating parameter, and the expression similarity is 0.5 o'clock a path; Depth (S ₁) and Depth (S ₂) represent adopted former S respectively ₁With S ₂The level degree of depth; Min (Depth (S ₁), Depth (S ₂)) expression gets smaller in two former level degree of depth of justice.The former semantic information of carrying of justice has the branch of size, and the node semantic information that is in bottom is abundant more, and the node semanteme that is in high level is abstract more, so should treat adopted former on the different levels with a certain discrimination.

Tolerance about semantic similarity

Have the polysemy phenomenon in the Chinese, the phrase semantic similarity should be calculated the similarity between the notion, and the semantic similarity of two alone words (not being in certain context) is the maximal value of similarity between its all notions.

Sim(W ₁，W ₂)＝max?Sim(C _1i，C _2j) i＝1…n，j＝1…m (2)

Wherein, W ₁Represent word 1 and have n notion, W ₂Represent word 2 and have m notion, C _1iBe W ₁I item notion, C _2jBe W ₂J item notion.According to the architectural characteristic of KDML, the notion semantic similarity is divided into three parts calculates:

Sim(C ₁，C ₂)＝w ₁*P ₁+w ₂*P ₂+w ₃*P ₃ (3)

Wherein, P ₁Be the similarity of two notion main classes justice between former; P ₂Similarity for whole semantic formula; P ₃Be at two former framework calculation of similarity degree of DEF main classes justice; w ₁, w ₂With w ₃Be respectively three pairing weights of part, should satisfy constraint condition w ₁+ w ₂+ w ₃=1 and w ₂＞w ₁, w ₂＞w ₃

With " infant " and " pediatrician " is the calculating that example specifies concept similarity.Wherein the semantic formula of " infant " and " pediatrician " is respectively:

DEF={human| people: domain={medical| doctor }, modifier={child| juvenile }, SufferFrom| suffers from: experience={～, doctor| cures: patient={～

DEF={human| people: HostOf={Occupation| position }, the domain={medical| doctor }, doctor| cures: agent={～, patient={human| people: modifier={child| juvenile } } } }

In formula (3), P ₁Be the similarity of two main classes justice between former, i.e. first justice former " human| people " and " human| people's " similarity in the semantic formula, by formula calculate (1).Aforementionedly illustrated that main classes justice is former and had the most direct semantic description ability, therefore it has single-rowly been considered highly significant for a part for notion.

For P ₂,, so do it as a whole and to calculate its similarity with reference to the KDML rule necessary because semantic formula is a complete individuality, and has oneself syntax rule.This part is the part of the most complicated and weights proportion maximum in the whole semantic similarity tolerance, because need to consider whole semantic formula.Its computation process can be divided into two stages, at first, according to the KDML syntax rule adopted former in the semantic formula divided (" { } " being distinguished level under it with braces) by level, and before not having dynamic character adopted former interpolation ZeroRole, as shown in table 1; Every layer is adopted the method for maximum match to carry out similarity calculating then.

Table 1: the adopted former hierarchical structure table of " infant " and " pediatrician "

In the maximum match method, be example with the second layer of table 1.At first, calculate every group of similarity that justice is former, therefrom one of the selective value maximum group, in this example " the domain={medical| doctor }; domain={medical| doctor } " and the similarity of " ZeroRole=doctor| cures, and ZeroRole=doctor| cures " be 1, then choose one group wantonly and get final product, as selecting " domain={medical| doctor }, domain={medical| cures } "; Secondly, still select semantic similarity value the maximum in remaining adopted former group, " ZeroRole=doctor| cures, and ZeroRole=doctor| cures " is selected; The rest may be inferred, and third round is selected " modifier={child| juvenile }, HostOf={Occupation| position } ", and four-wheel is selected " SufferFrom| suffers from, NULL ".When two notions when not waiting, can occur the situation that the former and empty element of justice matches with the adopted former number of layer, can unify to get at this moment smaller value r (parameter that sets).At last, adopted former group of selected semantic similarity addition averaged, can obtain P ₂The value of part.

For P ₃, its computing method and P ₂Identical.Measuring similarity at the former framework of main classes justice is actually the another kind of method of calculating the former similarity of main classes justice, has emphasized the former direct descriptive power for notion of main classes justice again.

Finally, based on above-mentioned three part calculation of similarity degree, can calculate semantic similarity between every pair of notion according to formula (3), then by formula (2) get the semantic similarity of maximal value as word.

Description of drawings

Fig. 1 is semantic formula definition diagram.

Fig. 2 is an algorithm flow chart.

Embodiment

Method flow of the present invention as shown in Figure 2.Its operation steps is:

A) two words of input;

B) from know net, obtain all records of this two words;

C) the DEF expression formula in the taking-up record;

D) according to the former similarity of justice in formula (1) calculation expression;

E) calculate the similarity of two DEF expression formulas according to formula (3);

F) calculate the similarity of two words according to formula (2).

For b), the C/C++ development interface that can use official to provide.Know that the net system externally provides bilingual Chinese-English knowledge dictionary and the chained library relevant with exploitation.

In HowNet, search the word record and need following steps:

(1) at first need to call HowNet_Initial (), this function can carry out initialization to the data of knowing the net knowledge system, must call this initialization function before calling other functions.The index file hownet.idx of net knowledge system can be need known in the function,, the initialization failure can be directly caused if this file does not exist.

(2) (char* ApSt, S_SEARCH_MODEsHowNet_SearchMode) search key return the number of the record that finds, the key word of ApStr variable for being searched to call HowNet_Search_Keyword then.The result that can obtain searching according to the search pattern and the key word of appointment by this function, the i.e. number of the record that in knowing the net knowledge base, finds.In use, for the concrete outcome that obtains searching, this function usually and function HowNet_Get_SearchResult, HowNet_Get_Unit_Item uses jointly.

(3) then call the result that HowNet_Get_SearchResult () obtains searching, rreturn value is the recording mechanism array of the record that finds.This function will be placed on the recording mechanism of all records that find by function HowNet_Search_Keyword in the array and return.

Use the recording mechanism that obtains in the step 3 at last, (const DWORDAdwUnitID, const BYTE AItemID char*ApRlt) obtain the particular content of the specified portions of a designated recorder to call HowNet_Get_Unit_Item.The AdwUnitI variable is a recording mechanism, and AitemID specifies concrete which content that obtains this record, when being worth for HOWNET_ITEM_ID_ALL, and the complete documentation content of expression designated recorder.

For d), to take out in the DEF expression formula, shape gets final product as adopted former calculating of " ".

Claims

1. a Chinese word words and phrases justice calculation of similarity degree method is characterized in that concrete steps are: at first utilize and know that the KDML language of net extracts the semantic information of enriching of knowing net; Adopt adopted former calculating formula of similarity to calculate adopted former similarity then; Adopt the similarity between the maximum matching algorithm formula calculating notion at last, promptly obtain Chinese word words and phrases justice similarity; Wherein:

The former calculation of similarity degree formula of described justice is:

Sim (S_{1}, S_{2}) = \frac{α \times \min (Depth (S_{1}), Depth (S_{2}))}{α \times \min (Depth (S_{1}), Depth (S_{2})) + Dist (S_{1}, S_{2})} - - - (1)

Wherein, S ₁With S ₂Represent that respectively two justice are former; Dist (S ₁, S ₂) path of two justice of expression between former; α is for regulating parameter, and the expression similarity is 0.5 o'clock a path; Depth (S ₁) and Depth (S ₂) represent adopted former S respectively ₁With S ₂The level degree of depth; Min (Depth (S ₁), Depth (S ₂)) expression gets smaller in two former level degree of depth of justice;

The formula of described maximum matching algorithm is:

Sim(W ₁，W ₂)＝maxSim(C _1i，C _2j) i＝1…n，j＝1…m (2)

Wherein, W ₁Represent word 1 and have n notion, W ₂Represent word 2 and have m notion, C _1iBe W ₁I item notion, C _2jBe W ₂J item notion; According to the architectural characteristic of KDML, the notion semantic similarity is divided into three parts and calculates:

Sim(C ₁，C ₂)＝w ₁*P ₁+w ₂*P ₂+w ₃*P ₃ (3)

Wherein, P ₁Be the similarity of two notion main classes justice between former; P ₂Similarity for whole semantic formula; P ₃Be at two former framework calculation of similarity degree of DEF main classes justice; w ₁, w ₂With w ₃Be respectively three pairing weights of part, satisfy constraint condition w ₁+ w ₂+ w ₃=1 and w ₂＞w ₁, w ₂＞w ₃