CN104090918A

CN104090918A - Sentence similarity calculation method based on information amount

Info

Publication number: CN104090918A
Application number: CN201410268361.4A
Authority: CN
Inventors: 吴昊; 黄河燕
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2014-06-16
Filing date: 2014-06-16
Publication date: 2014-10-08
Anticipated expiration: 2034-06-16
Also published as: CN104090918B

Abstract

The invention relates to a sentence similarity calculation method based on information amount. The method comprises the following steps: firstly, confirming the sense of a word according to the concept with the maximum information amount in words of two sentences, subsequently, calculating the information amount of the word and the public information amount of multiple words according to an hierarchical structure and corpus statistics of a semantic net, calculating the total information amount of multiple words by using the inclusion-exclusion principle in combinatorial mathematics so as to respectively obtain respective information amount of two sentences and the total information amount of two sentences, and finally defining and calculating the similarity of the sentences according to the Jaccard similarity principle. By adopting the method, the judgment of human beings on the similarity degree of sentences can be authentically simulated, moreover other natural language processing techniques such as corpus training parameters or experience parameters, dependence on scale of corpus, part-of-speech tagging and the like are not needed, the time performance is excellent, and quasi real-time calculation efficiency can be obtained on a conventional main current multi-core PC (Personal Computer) for sentence pairs of normal lengths.

Description

A kind of sentence similarity computing method based on quantity of information

Technical field

The present invention relates to a kind of sentence similarity computing method, be specifically related to a kind of sentence similarity computing method based on quantity of information, belong to natural language processing technique field.

Background technology

It is an important research content of natural language processing that sentence or short text similarity are calculated, and the effect in the applications such as information retrieval, mechanical translation, question answering system, automatic abstract is more and more important in recent years.Traditional method adopts the computing method of Documents Similarity more, only sentence word is regarded as and is not mutually had related meaningless symbol, and the sentence that contains a small amount of word for calculating is accurate not.And at present conventional mixed method conventionally need to be in associated data set training parameter or use experience parameter, its shortcoming is to rely on training dataset, versatility is not strong.

Summary of the invention

The purpose of the method for the present invention is that addresses the above problem, a kind of sentence similarity computing method based on quantity of information are provided, by using the essential attribute of this language of quantity of information, use inclusion-exclusion principle to obtain the gross information content of accurate multiple words, thereby obtain the sentence similarity result obtaining closer to people's subjective judgement.

The thought of the inventive method is first to determine the meaning of a word of word by having the concept of maximum quantity of information between two sentence words; Then utilize the hierarchical structure of semantic net (such as WordNet) and corpus (such as BNC corpus or Brown corpus etc.) to calculate the public information amount between quantity of information and many words of word; Next apply inclusion-exclusion principle in combinatorics and calculate the gross information content of multiple words, thereby obtain respectively two sentences quantity of information separately, and two sentences quantity of information altogether; Finally define and calculate the similarity of sentence according to Jaccard similarity principle.

For achieving the above object, the technical solution used in the present invention is:

Sentence similarity computing method based on quantity of information, comprise the following steps:

Step 1: input two sentence s to be calculated _aand s _b, note sentence s _aand s _bbe respectively:

s_{a} = {w_{i}^{a} | i = 1,2, . . ., n}

s_{b} = {w_{i}^{b} | i = 1,2, . . ., m}

Wherein, with represent respectively sentence s _aand s _bi word, n and m represent respectively sentence s _aand s _bword number;

Step 2: the word in input sentence is carried out to meaning of a word selection, and process is as follows:

Word the meaning of a word determine according to formula 1:

[formula 1]

c_{i}^{a} = \underset{c_{2} &Element; consepts (s_{b})}{\underset{c_{1} &Element; consept (w_{i}^{a})}{\underset{c &Element; subsum (c_{1}, c_{2})}{\arg \max}}} {- \log P (c)}

Wherein, subsum (c ₁, c ₂) for comprise concept c in semantic net ₁and c ₂all concept set, be illustrated in all words that comprise in semantic net the set of concept, consepts (s _b) represent to comprise sentence s in semantic net _bin the set of concept of all words, P (c) is the frequency of concept c in corpus, special, if P (c) is 0, logP (c) is that the value of 0, P (c) is determined according to formula 2:

[formula 2]

P(c)＝Σ _w∈words(c)count(w)/N

The wherein set of all words in all sub-concept of concept c and concept c in words (c) expression semantic net, count (w) is the frequency of word w in corpus, N represents the frequency sum of all financial resourcess concept in semantic net, and the frequency of each concept is the total frequency sum of whole words in corpus in this concept;

In like manner, by formula 1 replace to consepts (s _b) replace to consepts (s _a), can obtain sentence s _bin the meaning of a word of i word

Sentence s after meaning determination _aand s _bcan be designated as:

s_{a} = {c_{i}^{a} | i = 1,2, . . ., n}

s_{b} = {c_{i}^{b} | i = 1,2, . . ., m}

Step 3: determine the sentence of the meaning of a word according to step 2 gained, the inclusion-exclusion principle in application combinatorics calculates sentence s _aand s _bquantity of information separately and the gross information content of the two, computation process is as follows:

Sentence s _aquantity of information IC (s _a) computing formula as shown in Equation 3:

[formula 3]

IC (s_{a}) = Σ_{k = 1}^{n} {(- 1)}^{k - 1} \underset{1 \leq i_{1} < i_{2} < \cdot \cdot \cdot < i_{k} \leq n}{Σ} commonIC (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{a})

Wherein, represent hierarchical structure and the common semantic information space building of corpus statistics by semantic net, calculate according to formula 4:

[formula 4]

commonIC (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{a}) = \max_{c &Element; subsum (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot {, c}_{i_{k}}^{a})} [- \log P (c)]

Wherein, for comprise concept in semantic net the set of all concepts;

In like manner, in wushu 3 and formula 4, all alphabetical a replace to b, and n replaces to m, can obtain sentence s _bquantity of information;

Sentence s _aand s _bin the set of all unduplicated words regard a new sentence as, through type 5 obtains sentence s _aand s _bgross information content IC (s _a∪ s _b):

[formula 5]

IC (s_{a} \cup s_{b}) = Σ_{k = 1}^{p} {(- 1)}^{k - 1} \underset{1 \leq i_{1} < \cdot \cdot \cdot < i_{k} \leq p}{Σ} commonIC (c_{i_{1}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{b})

Wherein, p is sentence s _aand s _bthe sum of unduplicated word;

Step 4: two the sentence s of contextual definition by union and between occuring simultaneously _aand s _bpublic information amount COMMONIC (s _a, s _b), computing formula as shown in Equation 6:

[formula 6]

COMMONIC(s _a,s _b)＝IC(s _a)+IC(s _b)-IC(s _a∪s _b)

Step 5: according to Jaccard correlation principle, definition sentence s _aand s _bsimilarity sim (s _a, s _b), computing formula as shown in Equation 7:

[formula 7]

sim (s_{a}, s_{b}) = \frac{COMMONIC (s_{a}, s_{b})}{IC (s_{a} \cup s_{b})}

Step 6: the similarity sim (s of two sentences of output _a, s _b).

Beneficial effect

Contrast prior art, the inventive method has further improved the degree of accuracy of mixed method, the judgement of simulating human that can be true to nature to sentence similarity degree, and do not need to use language material training parameter or use experience parameter, do not rely on the scale of corpus, without other natural language processing techniques such as part-of-speech taggings, can be directly used in sentence similarity and calculate, there is good versatility; Time performance is outstanding, to the sentence pair of general length, on current main-stream multinuclear PC, obtains quasi real time counting yield.

Brief description of the drawings

Fig. 1 is the process flow diagram of the inventive method.

Fig. 2 is the inventive method and additive method comparison diagram.

Embodiment

Below in conjunction with accompanying drawing, implementation process of the present invention is elaborated.

As shown in Figure 1, the inventive method is mainly divided into 5 steps:

Step 1: input two sentences to be calculated.Note sentence s _aand s _bbe respectively:

s_{a} = {w_{i}^{a} | i = 1,2, . . ., n}

s_{b} = {w_{i}^{b} | i = 1,2, . . ., m}

Wherein, with represent respectively sentence s _aand s _bi word, n and m represent respectively sentence s _aand s _bword number.

Step 2: because of the polysemy ubiquity of word, the word of input sentence is carried out to meaning of a word selection, can eliminate the uncertainty of sentence semantics, thereby prepare for subsequent calculations sentence similarity.Detailed process is as follows:

(1), in two sentences of input, respectively choose a word composition word pair;

(2) use the meaning of a word (or concept) of word in semantic net (such as WordNet) to replace word, this word is to converting multiple meaning of a word pair to;

(3) calculate the meaning of a word between public information amount;

(4) choose maximum public information amount as the right public information amount of this word;

(5) public information amount right all words according to from big to small sequence;

(6) two arrays of definition, are used for recording the meaning of a word of word in each sentence, and all elements value in initialization array is definite meaning of a word;

(6) process successively according to the order of (5), if the meaning of a word of word is also not definite, word corresponding element value in array is still recorded to the meaning of a word meaning of a word of word corresponding this quantity of information in meaning of a word array for not determining;

(7) until array element value is not determined the meaning of a word, all the meaning determination of word is complete.

Word the meaning of a word (or concept) according to suc as formula 1 determine:

[formula 1]

c_{i}^{a} = \underset{c_{2} &Element; consepts (s_{b})}{\underset{c_{1} &Element; consept (w_{i}^{a})}{\underset{c &Element; subsum (c_{1}, c_{2})}{\arg \max}}} {- \log P (c)}

Wherein

[formula 2]

P(c)＝Σ _w∈words(c)count(w)/N

Here subsum (c, ₁, c ₂) for comprise concept c in semantic net ₁and c ₂all concept set, be illustrated in all words that comprise in semantic net the set of concept, consepts (s _b) represent to comprise sentence s in semantic net _bin the set of concept of all words, P (c) is the frequency of concept c in certain corpus.Special, if P (c) is 0, logP (c) is 0;

The set of all words in words (c) expression semantic net in all sub-concept of concept c and concept c, count (w) is the frequency of word w in a certain corpus, N represents the frequency sum of all financial resourcess concept in semantic net, and the frequency of each concept is the total frequency sum of whole words in corpus in this concept.

In like manner can obtain sentence s _bin the meaning of a word of i word sentence s after meaning determination _aand s _bcan be designated as:

s_{a} = {c_{i}^{a} | i = 1,2, . . ., n}

s_{b} = {c_{i}^{b} | i = 1,2, . . ., m}

Step 3: determine the sentence of the meaning of a word according to step 2 gained, the inclusion-exclusion principle in application combinatorics calculates sentence s _aand s _bquantity of information separately, i.e. sentence s _aquantity of information computing formula as shown in Equation 3:

[formula 3]

IC (s_{a}) = Σ_{k = 1}^{n} {(- 1)}^{k - 1} \underset{1 \leq i_{1} < i_{2} < \cdot \cdot \cdot < i_{k} \leq n}{Σ} commonIC (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{a})

Wherein

[formula 4]

commonIC (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{a}) = \max_{c &Element; subsum (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot {, c}_{i_{k}}^{a})} [- \log P (c)]

Here, represent by concept in the hierarchical structure of semantic net and the semantic information space of the common structure of corpus statistics common factor, for comprise concept in semantic net the set of all concepts, any k for value between 1 to n, represent sentence s _ain get k word one combination.

In like manner, in wushu 3 and formula 4, all alphabetical a replace to b, and n replaces to m, can obtain sentence s _bquantity of information.

Sentence s _aand s _bin the set of all unduplicated words regard a new sentence as, through type 5 obtains sentence s _aand s _bgross information content:

[formula 5]

IC (s_{a} \cup s_{b}) = Σ_{k = 1}^{p} {(- 1)}^{k - 1} \underset{1 \leq i_{1} < \cdot \cdot \cdot < i_{k} \leq p}{Σ} commonIC (c_{i_{1}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{b})

Wherein, p is two unduplicated word sums of sentence.

Step 4: two sentences quantity of information separately and the gross information content of two sentences that obtain by step 3.Relation by union and between occuring simultaneously obtains two sentence s _aand s _bpublic information amount COMMONIC (s _a, s _b) as shown in Equation 6:

[formula 6]

COMMONIC(s _a,s _b)＝IC(s _a)+IC(s _b)-IC(s _a∪s _b)

Step 5: the gross information content of two sentences that obtain by step 3 and step 4 and public information amount, according to Jaccard correlation principle, two sentence s _aand s _bsimilarity can calculate by through type 7:

[formula 7]

sim (s_{a}, s_{b}) = \frac{COMMONIC (s_{a}, s_{b})}{IC (s_{a} \cup s_{b})}

As mentioned above, the invention provides a kind of sentence similarity computing method based on quantity of information.The true sentence pair of inputting by user, system will calculate result sentence similarity being judged in the mankind true to nature automatically.

Provide the comparative result of the inventive method and existing other 4 kinds of sentence similarity computing method below.The semantic net WordNet and the corpus BNC that in experiment, have used.Assessment adopts the Pearson correlation coefficient (PCC) of linear dependence, Spearman rank related coefficient (SRCC) the Kendall rank related coefficient (KRCC) relevant with probability sorting that generally sorts relevant.Table 1 is to adopting the marking result of the whole bag of tricks to the corresponding sentence of word.

The artificial marking that table 1 sentence is right and the whole bag of tricks marking

Can obtain Fig. 2 from table 1.As can see from Figure 2, this paper method is all better than other 4 kinds of methods on PCC, SRCC and KRCC, illustrates that model more approaches the mankind's subjective judgement to the judgement of sentence similarity.In addition, single volunteer's marking of artificial data collection and the PCC average that all volunteer gives a mark between average are 0.825, and maximal value is 0.921; And the PCC of the inventive method is higher than the mean value of single marking and lower than the higher limit of single marking; Illustrate model to the determined level of sentence similarity the average level higher than people, and credible result.

Above-described specific descriptions; object, technical scheme and beneficial effect to invention further describe; institute is understood that; the foregoing is only specific embodiments of the invention; the protection domain being not intended to limit the present invention; within the spirit and principles in the present invention all, any amendment of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the sentence similarity computing method based on quantity of information, is characterized in that, comprise the following steps:

s_{a} = {w_{i}^{a} | i = 1,2, . . ., n}

s_{b} = {w_{i}^{b} | i = 1,2, . . ., m}

Word the meaning of a word determine according to formula 1:

[formula 1]

c_{i}^{a} = \underset{c_{2} &Element; consepts (s_{b})}{\underset{c_{1} &Element; consept (w_{i}^{a})}{\underset{c &Element; subsum (c_{1}, c_{2})}{\arg \max}}} {- \log P (c)}

[formula 2]

P(c)＝Σ _w∈words(c)count(w)/N

Sentence s after meaning determination _aand s _bcan be designated as:

s_{a} = {c_{i}^{a} | i = 1,2, . . ., n}

s_{b} = {c_{i}^{b} | i = 1,2, . . ., m}

[formula 3]

IC (s_{a}) = Σ_{k = 1}^{n} {(- 1)}^{k - 1} \underset{1 \leq i_{1} < i_{2} < \cdot \cdot \cdot < i_{k} \leq n}{Σ} commonIC (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{a})

[formula 4]

commonIC (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{a}) = \max_{c &Element; subsum (c_{i_{1}}^{a}, c_{i_{2}}^{a}, \cdot \cdot \cdot {, c}_{i_{k}}^{a})} [- \log P (c)]

Wherein, for comprise concept in semantic net the set of all concepts;

Sentence s _aand s _bin the set of all not dittographs regard a new sentence as, through type 5 obtains sentence s _aand s _bgross information content IC (s _a∪ s _b):

[formula 5]

IC (s_{a} \cup s_{b}) = Σ_{k = 1}^{p} {(- 1)}^{k - 1} \underset{1 \leq i_{1} < \cdot \cdot \cdot < i_{k} \leq p}{Σ} commonIC (c_{i_{1}}^{a}, \cdot \cdot \cdot, c_{i_{k}}^{b})

Wherein, p is sentence s _aand s _bthe sum of unduplicated word;

[formula 6]

COMMONIC(s _a,s _b)＝IC(s _a)+IC(s _b)-IC(s _a∪s _b)

[formula 7]

sim (s_{a}, s_{b}) = \frac{COMMONIC (s_{a}, s_{b})}{IC (s_{a} \cup s_{b})}

Step 6: the similarity sim (s of two sentences of output _a, s _b).