CN103164394A

CN103164394A - Text similarity calculation method based on universal gravitation

Info

Publication number: CN103164394A
Application number: CN2013100931085A
Authority: CN
Inventors: 陈雪; 吴超
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2012-07-16
Filing date: 2013-03-22
Publication date: 2013-06-19
Anticipated expiration: 2033-03-22
Also published as: CN103164394B

Abstract

The invention discloses a text similarity calculation method based on universal gravitation. The text similarity calculation method based on the universal gravitation includes the following concrete steps: (1) any two texts in a domain anthology are input; (2) text representation and a largest public subgraph are generated; (4) the compactness of the largest public subgraph of the texts is calculated based on the universal gravitation; (5) the similarity of the texts is calculated; (6) the similarity of the texts is output. According to the text similarity calculation method based on the universal gravitation, the largest public subgraph of the texts serves as a reference substance of the similarity. A concept of the compactness is extended from a law of the universal gravitation in physics to conduct quantification to the reference substance. A similarity degree of conversion from the texts to the largest public subgraph of the texts serves as a measurement criteria of the similarity. The text similarity calculation method based on the universal gravitation is simple, convenient to use, easy to operate and good in effect.

Description

A kind of based on gravitational Text similarity computing method

Technical field

The present invention relates to a kind of similarity calculating method of text, specifically relate to the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity, is a kind of based on gravitational Text similarity computing method.

Background technology

Most widely used Text similarity computing method is based on the cosine computing method of vector space model at present.Vector space model is shown as a weight vector with text table, and each in vector forms by lexical item, and the weight of each lexical item is determined by the TFIDF method.The cosine computing formula is calculated the cosine value of the angle of text weight vector, and with this as text similarity.

But use when calculating the similarity of text based on the cosine computing method of vector space model, have the following disadvantages:

(1) vector space model is regarded text as the set of lexical item, and the relation between lexical item and lexical item is regarded as independently, has so just lost a large amount of text structure information.

(2) the cosine computing formula is not considered the semantic dependency between keyword in text, there is no the structural dependence between the taking into account critical word yet.

Summary of the invention

The object of the invention is to for the deficiency based on the cosine computing method of vector space model, provide a kind of based on gravitational Text similarity computing method, the object of reference of these computing method take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity.

To achieve the above object, design of the present invention is as follows: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity; Described tight ness rating is correlativity between keyword, and is relevant with the right weight of keyword to keyword.

According to above-mentioned invention thought, the present invention adopts following technical proposals:

A kind ofly it is characterized in that based on gravitational Text similarity computing method, its concrete steps are as follows:

(1) any two pieces of texts in the collected works of input field;

(2) generation of text representation and maximum public subgraph;

(3) calculate the tight ness rating of the maximum public subgraph of text based on universal gravitation;

(4) calculate the similarity of text;

(5) similarity of output text.

The tight ness rating of the maximum public subgraph of described text, its tight ness rating calculating formula is as follows:

Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1.

The similarity of described text, its similarity calculating formula is as follows:

Wherein,

Figure 2013100931085100002DEST_PATH_IMAGE009

Be the keyword number of the public subgraph of maximum,

The keyword logarithm of maximum public subgraph,

Figure 2013100931085100002DEST_PATH_IMAGE011

,

Be respectively the keyword number of text A and text B,

,

Be respectively the keyword logarithm of text A and text B, ,

Be respectively the tight ness rating of the maximum public subgraph of text A and text B.

Of the present invention a kind of based on gravitational Text similarity computing method compared with prior art, have following outstanding feature and advantage: the object of reference take the public subgraph of maximum as similarity, the graph structure model of text is converted into maximum public subgraph model according to certain probability, then calculate the similarity of text according to the similarity degree that transforms, the object of reference that calculates take public third party's text model as similarity can guarantee comparability and objectivity; Introduce the physical law of universal gravitation the right contact tightness degree of keyword in text graph structure model is measured, catch right semantic dependency and the structural dependence of keyword in text.

Description of drawings

Fig. 1 is a kind of process flow diagram based on gravitational Text similarity computing method of the present invention.

Embodiment

Below in conjunction with accompanying drawing, embodiments of the invention are further described.

Embodiment one: referring to Fig. 1, this is based on gravitational Text similarity computing method, it is characterized in that: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity.

The tight ness rating of the maximum public subgraph of described text is correlativity between keyword, and relevant with the right weight of keyword to keyword, its tight ness rating calculating formula is as follows:

The similarity of described text, its pass similarity calculating formula is as follows:

Wherein, Be the keyword number of the public subgraph of maximum,

The keyword logarithm of maximum public subgraph, , Be respectively the keyword number of text A and text B,

,

Be respectively the keyword logarithm of text A and text B,

,

Embodiment two: this carries out the similarity calculating of text based on gravitational Text similarity computing method from 70 pieces of papers of 2011 to 2012 of TKDE.As shown in Figure 1, the present embodiment a kind of based on gravitational Text similarity computing method, its step is as follows:

S1. input any two pieces of texts in the collected works of field, for example, open two pieces of interim texts of 24 volumes the 1st in 2011;

S2. the generation of text representation and maximum public subgraph for example, uses the graph structure of text to represent that model represents text; Use maximum public subgraph algorithm to generate the maximum public subgraph of text;

S3. calculate the tight ness rating of the maximum public subgraph of text based on universal gravitation; The tight ness rating calculating formula is as follows:

Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1;

S4. calculate the similarity of text; The similarity calculating formula is as follows:

Wherein,

Be the keyword number of the public subgraph of maximum,

The keyword logarithm of maximum public subgraph,

,

Be respectively the keyword number of text A and text B,

,

Be respectively the keyword logarithm of text A and text B,

,

Be respectively the tight ness rating of the maximum public subgraph of text A and text B;

S5. export the similarity of text.

Claims

1. one kind based on gravitational Text similarity computing method, it is characterized in that: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity; Described tight ness rating is correlativity between keyword, and is relevant with the right weight of keyword to keyword; Its concrete steps are as follows:

(1) any two pieces of texts in the collected works of input field;

(2) generation of text representation and maximum public subgraph;

(4) calculate the similarity of text;

(5) similarity of output text.

2. by claimed in claim 1 based on gravitational Text similarity computing method, it is characterized in that: the tight ness rating of the maximum public subgraph of the text in described step (3), its tight ness rating calculating formula is as follows:

Figure 2013100931085100001DEST_PATH_IMAGE002

3. by claimed in claim 1 based on gravitational Text similarity computing method, it is characterized in that: the similarity of the text in described step (4), its similarity calculating formula is as follows:

Figure 2013100931085100001DEST_PATH_IMAGE004

Figure 2013100931085100001DEST_PATH_IMAGE008

Wherein,

Figure 2013100931085100001DEST_PATH_IMAGE010

Be the keyword number of the public subgraph of maximum, The keyword logarithm of maximum public subgraph, , Be respectively the keyword number of text A and text B,

Figure 2013100931085100001DEST_PATH_IMAGE018

,

Figure 2013100931085100001DEST_PATH_IMAGE020

Be respectively the keyword logarithm of text A and text B,

Figure 2013100931085100001DEST_PATH_IMAGE022

, Be respectively the tight ness rating of the maximum public subgraph of text A and text B.