CN106610948A

CN106610948A - Improved lexical semantic similarity solution algorithm

Info

Publication number: CN106610948A
Application number: CN201610838940.7A
Authority: CN
Inventors: 金平艳
Original assignee: Sichuan Yonglian Information Technology Co Ltd
Current assignee: Sichuan Yonglian Information Technology Co Ltd
Priority date: 2016-07-20
Filing date: 2016-09-21
Publication date: 2017-05-03

Abstract

The invention provides an improved lexical semantic similarity solution algorithm. On the basis of a successfully initialized statistical method module, a maximum keyword is extracted by comparing context words with the maximum weights of words to be compared and the similarity of the words to be compared, and finally the similarity between the extracted maximum keyword and the words to be compared is calculated to obtain a result. The calculation result of the semantic similarity is basically consistent with actually manually judged semantic similarity; the objective reality is better reflected; and the user demands are better satisfied.

Description

A kind of improved Similarity of Words derivation algorithm

Technical field

The present invention relates to Semantic Web technology field, and in particular to a kind of improved Similarity of Words derivation algorithm.

Background technology

Since 21 century, the internet industry in the whole world enters the new period of a high speed development, and various new technologies are continuous Emerge in large numbers.Natural language processing as important technology between contact computer and people also achieves quickly development.Traditional language Adopted relatedness computation method is roughly divided into two classes：Semantic similarity calculation method based on semantic dictionary and based on corpus Semantic similarity calculation method, they belong to the semantic similarity computational methods based on statistics, are also a kind of experience master Right way of conduct method；Set up on observable linguistic fact based on the research of the semantic similarity of statistics, and do not depended solely on The intuition of linguist.It is built upon two phrase semantics it is similar and if only if they in similar context environmental this On the basis of one assumes, using Large Scale Corpus, using the contextual information of word as Semantic Similarity Measurement reference according to According to.In addition, this kind of method can be compared accurate effectively tolerance to the Semantic Similarity between word, but the method is relied on very much It is computationally intensive and method is more complicated in training used corpus, it is larger by Sparse and data noise jamming, sometimes Apparent error occurs, based on above demand is met, the invention provides a kind of improved Similarity of Words derivation algorithm.

The content of the invention

The similarity problem being directed in word, the invention provides a kind of improved Similarity of Words is solved and calculated Method.

In order to solve the above problems, the present invention is achieved by the following technical solutions：

Step 1：Initialization statistical method module, can be here《Word dictionary》、《Word woods》, Hownet,《Baidupedia》 Etc. corpus.

Step 2：By word (c to be compared₁, c₂) in input initialization statistical method module.

Step 3：Word (c to be compared is found in statistical module₁, c₂) the maximum cliction up and down of weight in adjacent context (c_sx1,c_sx2)。

Step 4：According to word (c to be compared₁, c₂) the maximum cliction (c up and down of the corresponding weight of difference_sx1,c_sx2) between it is similar Degree, extracts similarity maximum keyword c_sx。

Step 5：Similarity maximum keyword c is calculated respectively_sxWith word (c to be compared₁, c₂) the degree of correlation.

Step 6：The degree of correlation that recycle step 5 is tried to achieve, draws word (c to be compared₁, c₂) similarity sim (c₁, c₂) value.

Present invention has the advantages that：

1st, the result of calculation of semantic similarity is basically identical with the semantic similitude degree of actual artificial judgment.

2nd, objective reality is preferably reflected.

3rd, user's request is more met.

Description of the drawings

Fig. 1 is a kind of structure flow chart of improved Similarity of Words derivation algorithm.

Specific embodiment

To solve word (c₁, c₂) between semantic similarity problem, the present invention has been described in detail with reference to Fig. 1, its tool Body implementation steps are as follows：

Step 1：Initialization statistical method module, can be here《Word dictionary》、《Word woods》、《Hownet》、《Baidu hundred Section》Etc. corpus.

Find out word (c to be compared₁, c₂) corresponding upper and lower cliction maximum value (c of weight in corpus_sx1,c_sx2), specifically Calculating process is as follows：

Up and down cliction is searched according to constraints, for example, in Chinese, with stronger context restriction relation Part of speech is to having：Adjective-noun, verb-noun, noun-verb, adjective-verb etc..

weight_Sx1,2=p (c_1,2/c_Sx1,2)log₂[p(c_1,2/c_Sx1,2)+1]

Above formula c_Sx1,2Respectively with certain relation and word (c to be compared₁, c₂) adjacent cliction up and down,

p(c_1,2/c_Sx1,2) it is c_1,2With upper and lower cliction c_Sx1,2The conditional probability of certain relation is presented, equation 1 above is smooth system Number.

Above formula

n(c_1,2/c_Sx1,2) it is word (c to be compared in corpus₁, c₂) respectively with upper and lower cliction c_Sx1,2Certain relation is presented Co-occurrence quantity, n (c_1,2, c_Sx1,2) it is word (c to be compared in corpus₁, c₂) with upper and lower cliction c_Sx1,2Co-occurrence quantity.

In sum, following formula is obtained final product：

MAXweight_Sx1,2=MAX { p (c_1,2/c_Sx1,2)log₂[p(c_1,2/c_Sx1,2)+1]}

Word (c to be compared is found according to above formula₁, c₂) optimal context Collocation (c_sx1,c_sx2),c_sx1It is and c₁.It is in The existing optimal context Collocation corresponding to certain relation, in the same manner c_sx2It is and c₂It is presented optimal upper and lower corresponding to certain relation Literary Collocation.

According to the context vocabulary (c that two weights are maximum_sx1,c_sx2) included in total word frequency number, can find its pass Key word information c_sx, i.e.,

Above formula nf (c_sx1, c_sx2) it is the maximum context vocabulary (c of two weights_sx1,c_sx2) included in total word quantity,Maximum context vocabulary (the c of respectively two weights_sx1,c_sx2) in entry length.

Obtained by above formula,

f(c_sx)=max [Sim (c_sx1, c_sx2)]

According to f (c_sx) maximum, you can find best match keyword c_sx。

According to c_sxWith word (c to be compared₁, c₂) probability of co-occurrence, c in corpus_sxWith word (c to be compared₁, c₂) semanteme Structural relation can respectively draw maximum keyword c_sxWith word (c to be compared₁, c₂) the degree of correlation, its concrete calculating process is as follows：

Step 5.1) c_sxWith word (c to be compared₁, c₂) the co-occurrence probabilities p (c in corpus_1,2/c_sx), above-mentioned steps 4 in the same manner Principle can draw.

Step 5.2) c_sxWith word (c to be compared₁, c₂) semantic structure relation, this can according to its《Hownet》In Hierarchical relationship draws its path vector

Step 5.3) in sum, can respectively draw maximum keyword c_sxWith word (c to be compared₁, c₂) the degree of correlation.

relativity(c_sx, c₁)、relativity(c_sx, c₂) it is respectively maximum keyword c_sxWith word (c to be compared₁, c₂) the degree of correlation, above formulaFor c_sxWith word (c to be compared₁, c₂) semantic structure relation regulatory factor,For co-occurrence Probability p (c_1,2/c_sx) regulatory factor.

Step 6：The degree of correlation that recycle step 4 is tried to achieve, draws word (c to be compared₁, c₂) similarity sim (c₁, c₂) value.

A kind of improved Similarity of Words derivation algorithm, its false code calculating process is as follows：

Input：The selected statistical module of initialization, word (c to be compared₁, c₂)

Output：Word (c to be compared₁, c₂) between semantic similarity.

Claims

1. a kind of improved Similarity of Words derivation algorithm, the present invention relates to Semantic Web technology field, and in particular to Improved Similarity of Words derivation algorithm is planted, be it is characterized in that, comprised the steps：

Step 1：Initialization statistical method module, can be here《Word dictionary》、《Word woods》, Hownet,《Baidupedia》Etc. Corpus

Step 2：By word to be comparedIn input initialization statistical method module

Step 3：Word to be compared is found in statistical moduleThe maximum cliction up and down of weight in adjacent context

Step 4：According to word to be comparedThe maximum cliction up and down of the corresponding weight of differenceBetween it is similar Degree, extracts similarity maximum keyword

Step 5：Similarity maximum keyword is calculated respectivelyWith word to be comparedThe degree of correlation

Step 6：The degree of correlation that recycle step 5 is tried to achieve, draws word to be comparedSimilarity simValue.

2., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above Concrete calculating process in step 3 is as follows：

Find out word to be comparedCorresponding upper and lower cliction maximum value of weight in corpus, specifically Calculating process is as follows：

Up and down cliction is searched according to constraints, for example, in Chinese, the part of speech with stronger context restriction relation To having：Adjective-noun, verb-noun, noun-verb, adjective-verb etc.

Above formulaRespectively with certain relation and word to be comparedAdjacent cliction up and down, ForWith upper and lower clictionThe conditional probability of certain relation is presented, equation 1 above is smoothing factor

Above formula

For word to be compared in corpusRespectively with upper and lower clictionCertain relation is presented Co-occurrence quantity,For word to be compared in corpusWith upper and lower clictionCo-occurrence quantity

In sum, following formula is obtained final product：

Word to be compared is found according to above formulaOptimal context CollocationBe with, it is in The existing optimal context Collocation corresponding to certain relation, in the same mannerBe withPresent corresponding to certain relation it is optimal on Hereafter Collocation.

3., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above Concrete calculating process in step 4 is as follows：

According to the context vocabulary that two weights are maximumIncluded in total word frequency number, its can be found crucial Word information, i.e.,

Sim

Above formulaFor the context vocabulary that two weights are maximumIncluded in total word number Amount,The maximum context vocabulary of respectively two weightsIn entry length

Obtained by above formula,

According toMaximum, you can find best match keyword。

4., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above Concrete calculating process in step 5 is as follows：

According toWith word to be comparedThe probability of co-occurrence in corpus,With word to be comparedSemanteme Structural relation can respectively draw maximum keywordWith word to be comparedThe degree of correlation, its concrete calculating process is as follows：

Step 5.1）With word to be comparedThe co-occurrence probabilities in corpusAbove-mentioned steps 4 in the same manner Principle can draw

Step 5.2）With word to be comparedSemantic structure relation, this can according to its《Hownet》In level Structural relation draws its path vector

Step 5.3）In sum, maximum keyword can respectively be drawnWith word to be comparedThe degree of correlation

Respectively maximum keywordWith word to be compared The degree of correlation, above formulaForWith word to be comparedSemantic structure relation regulatory factor,It is common Existing probabilityRegulatory factor.

5., according to a kind of improved Similarity of Words derivation algorithm described in claim 1, it is characterized in that, the above Concrete calculating process in step 6 is as follows：

Step 6：The degree of correlation that recycle step 4 is tried to achieve, draws word to be comparedSimilarity simValue

sim。