CN103164394A - Text similarity calculation method based on universal gravitation - Google Patents

Text similarity calculation method based on universal gravitation Download PDF

Info

Publication number
CN103164394A
CN103164394A CN2013100931085A CN201310093108A CN103164394A CN 103164394 A CN103164394 A CN 103164394A CN 2013100931085 A CN2013100931085 A CN 2013100931085A CN 201310093108 A CN201310093108 A CN 201310093108A CN 103164394 A CN103164394 A CN 103164394A
Authority
CN
China
Prior art keywords
text
similarity
keyword
public subgraph
tight ness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100931085A
Other languages
Chinese (zh)
Other versions
CN103164394B (en
Inventor
陈雪
吴超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201310093108.5A priority Critical patent/CN103164394B/en
Publication of CN103164394A publication Critical patent/CN103164394A/en
Application granted granted Critical
Publication of CN103164394B publication Critical patent/CN103164394B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text similarity calculation method based on universal gravitation. The text similarity calculation method based on the universal gravitation includes the following concrete steps: (1) any two texts in a domain anthology are input; (2) text representation and a largest public subgraph are generated; (4) the compactness of the largest public subgraph of the texts is calculated based on the universal gravitation; (5) the similarity of the texts is calculated; (6) the similarity of the texts is output. According to the text similarity calculation method based on the universal gravitation, the largest public subgraph of the texts serves as a reference substance of the similarity. A concept of the compactness is extended from a law of the universal gravitation in physics to conduct quantification to the reference substance. A similarity degree of conversion from the texts to the largest public subgraph of the texts serves as a measurement criteria of the similarity. The text similarity calculation method based on the universal gravitation is simple, convenient to use, easy to operate and good in effect.

Description

A kind of based on gravitational Text similarity computing method
Technical field
The present invention relates to a kind of similarity calculating method of text, specifically relate to the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity, is a kind of based on gravitational Text similarity computing method.
Background technology
Most widely used Text similarity computing method is based on the cosine computing method of vector space model at present.Vector space model is shown as a weight vector with text table, and each in vector forms by lexical item, and the weight of each lexical item is determined by the TFIDF method.The cosine computing formula is calculated the cosine value of the angle of text weight vector, and with this as text similarity.
But use when calculating the similarity of text based on the cosine computing method of vector space model, have the following disadvantages:
(1) vector space model is regarded text as the set of lexical item, and the relation between lexical item and lexical item is regarded as independently, has so just lost a large amount of text structure information.
(2) the cosine computing formula is not considered the semantic dependency between keyword in text, there is no the structural dependence between the taking into account critical word yet.
Summary of the invention
The object of the invention is to for the deficiency based on the cosine computing method of vector space model, provide a kind of based on gravitational Text similarity computing method, the object of reference of these computing method take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity.
To achieve the above object, design of the present invention is as follows: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity; Described tight ness rating is correlativity between keyword, and is relevant with the right weight of keyword to keyword.
According to above-mentioned invention thought, the present invention adopts following technical proposals:
A kind ofly it is characterized in that based on gravitational Text similarity computing method, its concrete steps are as follows:
(1) any two pieces of texts in the collected works of input field;
(2) generation of text representation and maximum public subgraph;
(3) calculate the tight ness rating of the maximum public subgraph of text based on universal gravitation;
(4) calculate the similarity of text;
(5) similarity of output text.
The tight ness rating of the maximum public subgraph of described text, its tight ness rating calculating formula is as follows:
Figure 912304DEST_PATH_IMAGE002
Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1.
The similarity of described text, its similarity calculating formula is as follows:
Figure 920766DEST_PATH_IMAGE006
Figure 315975DEST_PATH_IMAGE008
Wherein,
Figure 2013100931085100002DEST_PATH_IMAGE009
Be the keyword number of the public subgraph of maximum,
Figure 192665DEST_PATH_IMAGE010
The keyword logarithm of maximum public subgraph,
Figure 2013100931085100002DEST_PATH_IMAGE011
,
Figure 828176DEST_PATH_IMAGE012
Be respectively the keyword number of text A and text B,
Figure DEST_PATH_IMAGE013
,
Figure 95210DEST_PATH_IMAGE014
Be respectively the keyword logarithm of text A and text B, ,
Figure 356427DEST_PATH_IMAGE016
Be respectively the tight ness rating of the maximum public subgraph of text A and text B.
Of the present invention a kind of based on gravitational Text similarity computing method compared with prior art, have following outstanding feature and advantage: the object of reference take the public subgraph of maximum as similarity, the graph structure model of text is converted into maximum public subgraph model according to certain probability, then calculate the similarity of text according to the similarity degree that transforms, the object of reference that calculates take public third party's text model as similarity can guarantee comparability and objectivity; Introduce the physical law of universal gravitation the right contact tightness degree of keyword in text graph structure model is measured, catch right semantic dependency and the structural dependence of keyword in text.
Description of drawings
Fig. 1 is a kind of process flow diagram based on gravitational Text similarity computing method of the present invention.
Embodiment
Below in conjunction with accompanying drawing, embodiments of the invention are further described.
Embodiment one: referring to Fig. 1, this is based on gravitational Text similarity computing method, it is characterized in that: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity.
The tight ness rating of the maximum public subgraph of described text is correlativity between keyword, and relevant with the right weight of keyword to keyword, its tight ness rating calculating formula is as follows:
Figure 25305DEST_PATH_IMAGE002
Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1.
The similarity of described text, its pass similarity calculating formula is as follows:
Figure DEST_PATH_IMAGE017
Figure DEST_PATH_IMAGE019
Wherein, Be the keyword number of the public subgraph of maximum,
Figure 883912DEST_PATH_IMAGE010
The keyword logarithm of maximum public subgraph, , Be respectively the keyword number of text A and text B,
Figure 891816DEST_PATH_IMAGE013
,
Figure 166940DEST_PATH_IMAGE014
Be respectively the keyword logarithm of text A and text B,
Figure 810411DEST_PATH_IMAGE015
,
Figure 207894DEST_PATH_IMAGE016
Be respectively the tight ness rating of the maximum public subgraph of text A and text B.
Embodiment two: this carries out the similarity calculating of text based on gravitational Text similarity computing method from 70 pieces of papers of 2011 to 2012 of TKDE.As shown in Figure 1, the present embodiment a kind of based on gravitational Text similarity computing method, its step is as follows:
S1. input any two pieces of texts in the collected works of field, for example, open two pieces of interim texts of 24 volumes the 1st in 2011;
S2. the generation of text representation and maximum public subgraph for example, uses the graph structure of text to represent that model represents text; Use maximum public subgraph algorithm to generate the maximum public subgraph of text;
S3. calculate the tight ness rating of the maximum public subgraph of text based on universal gravitation; The tight ness rating calculating formula is as follows:
Figure 936816DEST_PATH_IMAGE020
Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1;
S4. calculate the similarity of text; The similarity calculating formula is as follows:
Figure DEST_PATH_IMAGE021
Figure 327214DEST_PATH_IMAGE006
Figure 825192DEST_PATH_IMAGE008
Wherein,
Figure 596839DEST_PATH_IMAGE009
Be the keyword number of the public subgraph of maximum,
Figure 609794DEST_PATH_IMAGE010
The keyword logarithm of maximum public subgraph,
Figure 492300DEST_PATH_IMAGE011
,
Figure 844783DEST_PATH_IMAGE012
Be respectively the keyword number of text A and text B,
Figure 600381DEST_PATH_IMAGE013
,
Figure 38316DEST_PATH_IMAGE014
Be respectively the keyword logarithm of text A and text B,
Figure 458933DEST_PATH_IMAGE015
,
Figure 993819DEST_PATH_IMAGE016
Be respectively the tight ness rating of the maximum public subgraph of text A and text B;
S5. export the similarity of text.

Claims (3)

1. one kind based on gravitational Text similarity computing method, it is characterized in that: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity; Described tight ness rating is correlativity between keyword, and is relevant with the right weight of keyword to keyword; Its concrete steps are as follows:
(1) any two pieces of texts in the collected works of input field;
(2) generation of text representation and maximum public subgraph;
(3) calculate the tight ness rating of the maximum public subgraph of text based on universal gravitation;
(4) calculate the similarity of text;
(5) similarity of output text.
2. by claimed in claim 1 based on gravitational Text similarity computing method, it is characterized in that: the tight ness rating of the maximum public subgraph of the text in described step (3), its tight ness rating calculating formula is as follows:
Figure 2013100931085100001DEST_PATH_IMAGE002
Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1.
3. by claimed in claim 1 based on gravitational Text similarity computing method, it is characterized in that: the similarity of the text in described step (4), its similarity calculating formula is as follows:
Figure 2013100931085100001DEST_PATH_IMAGE004
Figure 2013100931085100001DEST_PATH_IMAGE008
Wherein,
Figure 2013100931085100001DEST_PATH_IMAGE010
Be the keyword number of the public subgraph of maximum, The keyword logarithm of maximum public subgraph, , Be respectively the keyword number of text A and text B,
Figure 2013100931085100001DEST_PATH_IMAGE018
,
Figure 2013100931085100001DEST_PATH_IMAGE020
Be respectively the keyword logarithm of text A and text B,
Figure 2013100931085100001DEST_PATH_IMAGE022
, Be respectively the tight ness rating of the maximum public subgraph of text A and text B.
CN201310093108.5A 2012-07-16 2013-03-22 A kind of based on gravitational Text similarity computing method Expired - Fee Related CN103164394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310093108.5A CN103164394B (en) 2012-07-16 2013-03-22 A kind of based on gravitational Text similarity computing method

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN2012102438628 2012-07-16
CN201210243862 2012-07-16
CN201210243862.8 2012-07-16
CN201310093108.5A CN103164394B (en) 2012-07-16 2013-03-22 A kind of based on gravitational Text similarity computing method

Publications (2)

Publication Number Publication Date
CN103164394A true CN103164394A (en) 2013-06-19
CN103164394B CN103164394B (en) 2016-08-03

Family

ID=48587490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310093108.5A Expired - Fee Related CN103164394B (en) 2012-07-16 2013-03-22 A kind of based on gravitational Text similarity computing method

Country Status (1)

Country Link
CN (1) CN103164394B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997382A (en) * 2017-03-22 2017-08-01 山东大学 Innovation intention label automatic marking method and system based on big data
US10282678B2 (en) 2015-11-18 2019-05-07 International Business Machines Corporation Automated similarity comparison of model answers versus question answering system output
CN109753600A (en) * 2018-12-20 2019-05-14 航天信息股份有限公司 Handle the method, apparatus asked questions and storage medium
US10628749B2 (en) 2015-11-17 2020-04-21 International Business Machines Corporation Automatically assessing question answering system performance across possible confidence values
CN114201598A (en) * 2022-02-18 2022-03-18 药渡经纬信息科技(北京)有限公司 Text recommendation method and text recommendation device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
US20080126335A1 (en) * 2006-11-29 2008-05-29 Oracle International Corporation Efficient computation of document similarity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080126335A1 (en) * 2006-11-29 2008-05-29 Oracle International Corporation Efficient computation of document similarity
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUN GUO等: "Word activation forces map word networks", 《NATURE》, vol. 23, no. 8, 31 December 2011 (2011-12-31), pages 1 - 16 *
吴江宁等: "基于最大公共子图的文本相似度算法研究", 《情报学报》, vol. 29, no. 5, 31 October 2010 (2010-10-31), pages 785 - 791 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10628749B2 (en) 2015-11-17 2020-04-21 International Business Machines Corporation Automatically assessing question answering system performance across possible confidence values
US10282678B2 (en) 2015-11-18 2019-05-07 International Business Machines Corporation Automated similarity comparison of model answers versus question answering system output
CN106997382A (en) * 2017-03-22 2017-08-01 山东大学 Innovation intention label automatic marking method and system based on big data
CN106997382B (en) * 2017-03-22 2020-12-01 山东大学 Innovative creative tag automatic labeling method and system based on big data
CN109753600A (en) * 2018-12-20 2019-05-14 航天信息股份有限公司 Handle the method, apparatus asked questions and storage medium
CN114201598A (en) * 2022-02-18 2022-03-18 药渡经纬信息科技(北京)有限公司 Text recommendation method and text recommendation device
CN114201598B (en) * 2022-02-18 2022-05-17 药渡经纬信息科技(北京)有限公司 Text recommendation method and text recommendation device

Also Published As

Publication number Publication date
CN103164394B (en) 2016-08-03

Similar Documents

Publication Publication Date Title
CN103207905B (en) A kind of method of calculating text similarity of based target text
CN103186574B (en) A kind of generation method and apparatus of Search Results
Yang Research and realization of internet public opinion analysis based on improved TF-IDF algorithm
CN104239512A (en) Text recommendation method
CN103164394A (en) Text similarity calculation method based on universal gravitation
CN110909550B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN105260474A (en) Microblog user influence computing method based on information interaction network
CN103617157A (en) Text similarity calculation method based on semantics
WO2015032301A1 (en) Method for detecting the similarity of the patent documents on the basis of new kernel function luke kernel
CN108170650B (en) Text comparison method and text comparison device
CN105183770A (en) Chinese integrated entity linking method based on graph model
CN104216968A (en) Rearrangement method and system based on document similarity
CN110147425A (en) A kind of keyword extracting method, device, computer equipment and storage medium
CN105095430A (en) Method and device for setting up word network and extracting keywords
CN103324664A (en) Document similarity distinguishing method based on Fourier transform
Na et al. Automatically generation and evaluation of stop words list for Chinese patents
Petcu et al. The one‐dimensional shallow water equations with transparent boundary conditions
WO2020052547A1 (en) Method and apparatus for identifying new words in spam message, and electronic device
CN103744918A (en) Vertical domain based micro blog searching ranking method and system
WO2023109143A1 (en) Real store verification method and apparatus, device, and storage medium
Chen et al. A backward Euler alternating direction implicit difference scheme for the three‐dimensional fractional evolution equation
CN103150388A (en) Method and device for extracting key words
CN113657116B (en) Social media popularity prediction method and device based on visual semantic relationship
CN104331443A (en) Industry data source detection method
Tung et al. Improved linear programming for fuzzy weighted average

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160803

Termination date: 20190322