CN103164394A - Text similarity calculation method based on universal gravitation - Google Patents
Text similarity calculation method based on universal gravitation Download PDFInfo
- Publication number
- CN103164394A CN103164394A CN2013100931085A CN201310093108A CN103164394A CN 103164394 A CN103164394 A CN 103164394A CN 2013100931085 A CN2013100931085 A CN 2013100931085A CN 201310093108 A CN201310093108 A CN 201310093108A CN 103164394 A CN103164394 A CN 103164394A
- Authority
- CN
- China
- Prior art keywords
- text
- similarity
- keyword
- public subgraph
- tight ness
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text similarity calculation method based on universal gravitation. The text similarity calculation method based on the universal gravitation includes the following concrete steps: (1) any two texts in a domain anthology are input; (2) text representation and a largest public subgraph are generated; (4) the compactness of the largest public subgraph of the texts is calculated based on the universal gravitation; (5) the similarity of the texts is calculated; (6) the similarity of the texts is output. According to the text similarity calculation method based on the universal gravitation, the largest public subgraph of the texts serves as a reference substance of the similarity. A concept of the compactness is extended from a law of the universal gravitation in physics to conduct quantification to the reference substance. A similarity degree of conversion from the texts to the largest public subgraph of the texts serves as a measurement criteria of the similarity. The text similarity calculation method based on the universal gravitation is simple, convenient to use, easy to operate and good in effect.
Description
Technical field
The present invention relates to a kind of similarity calculating method of text, specifically relate to the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity, is a kind of based on gravitational Text similarity computing method.
Background technology
Most widely used Text similarity computing method is based on the cosine computing method of vector space model at present.Vector space model is shown as a weight vector with text table, and each in vector forms by lexical item, and the weight of each lexical item is determined by the TFIDF method.The cosine computing formula is calculated the cosine value of the angle of text weight vector, and with this as text similarity.
But use when calculating the similarity of text based on the cosine computing method of vector space model, have the following disadvantages:
(1) vector space model is regarded text as the set of lexical item, and the relation between lexical item and lexical item is regarded as independently, has so just lost a large amount of text structure information.
(2) the cosine computing formula is not considered the semantic dependency between keyword in text, there is no the structural dependence between the taking into account critical word yet.
Summary of the invention
The object of the invention is to for the deficiency based on the cosine computing method of vector space model, provide a kind of based on gravitational Text similarity computing method, the object of reference of these computing method take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity.
To achieve the above object, design of the present invention is as follows: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity; Described tight ness rating is correlativity between keyword, and is relevant with the right weight of keyword to keyword.
According to above-mentioned invention thought, the present invention adopts following technical proposals:
A kind ofly it is characterized in that based on gravitational Text similarity computing method, its concrete steps are as follows:
(1) any two pieces of texts in the collected works of input field;
(2) generation of text representation and maximum public subgraph;
(3) calculate the tight ness rating of the maximum public subgraph of text based on universal gravitation;
(4) calculate the similarity of text;
(5) similarity of output text.
The tight ness rating of the maximum public subgraph of described text, its tight ness rating calculating formula is as follows:
Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1.
The similarity of described text, its similarity calculating formula is as follows:
Wherein,
Be the keyword number of the public subgraph of maximum,
The keyword logarithm of maximum public subgraph,
,
Be respectively the keyword number of text A and text B,
,
Be respectively the keyword logarithm of text A and text B,
,
Be respectively the tight ness rating of the maximum public subgraph of text A and text B.
Of the present invention a kind of based on gravitational Text similarity computing method compared with prior art, have following outstanding feature and advantage: the object of reference take the public subgraph of maximum as similarity, the graph structure model of text is converted into maximum public subgraph model according to certain probability, then calculate the similarity of text according to the similarity degree that transforms, the object of reference that calculates take public third party's text model as similarity can guarantee comparability and objectivity; Introduce the physical law of universal gravitation the right contact tightness degree of keyword in text graph structure model is measured, catch right semantic dependency and the structural dependence of keyword in text.
Description of drawings
Fig. 1 is a kind of process flow diagram based on gravitational Text similarity computing method of the present invention.
Embodiment
Below in conjunction with accompanying drawing, embodiments of the invention are further described.
Embodiment one: referring to Fig. 1, this is based on gravitational Text similarity computing method, it is characterized in that: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity.
The tight ness rating of the maximum public subgraph of described text is correlativity between keyword, and relevant with the right weight of keyword to keyword, its tight ness rating calculating formula is as follows:
Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1.
The similarity of described text, its pass similarity calculating formula is as follows:
Wherein,
Be the keyword number of the public subgraph of maximum,
The keyword logarithm of maximum public subgraph,
,
Be respectively the keyword number of text A and text B,
,
Be respectively the keyword logarithm of text A and text B,
,
Be respectively the tight ness rating of the maximum public subgraph of text A and text B.
Embodiment two: this carries out the similarity calculating of text based on gravitational Text similarity computing method from 70 pieces of papers of 2011 to 2012 of TKDE.As shown in Figure 1, the present embodiment a kind of based on gravitational Text similarity computing method, its step is as follows:
S1. input any two pieces of texts in the collected works of field, for example, open two pieces of interim texts of 24 volumes the 1st in 2011;
S2. the generation of text representation and maximum public subgraph for example, uses the graph structure of text to represent that model represents text; Use maximum public subgraph algorithm to generate the maximum public subgraph of text;
S3. calculate the tight ness rating of the maximum public subgraph of text based on universal gravitation; The tight ness rating calculating formula is as follows:
Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1;
S4. calculate the similarity of text; The similarity calculating formula is as follows:
Wherein,
Be the keyword number of the public subgraph of maximum,
The keyword logarithm of maximum public subgraph,
,
Be respectively the keyword number of text A and text B,
,
Be respectively the keyword logarithm of text A and text B,
,
Be respectively the tight ness rating of the maximum public subgraph of text A and text B;
S5. export the similarity of text.
Claims (3)
1. one kind based on gravitational Text similarity computing method, it is characterized in that: the object of reference take the maximum public subgraph of text as similarity, the concept of extending tight ness rating from the physical law of universal gravitation quantizes object of reference, and the similarity degree that is converted into the maximum public subgraph of text take text is the criterion of similarity; Described tight ness rating is correlativity between keyword, and is relevant with the right weight of keyword to keyword; Its concrete steps are as follows:
(1) any two pieces of texts in the collected works of input field;
(2) generation of text representation and maximum public subgraph;
(3) calculate the tight ness rating of the maximum public subgraph of text based on universal gravitation;
(4) calculate the similarity of text;
(5) similarity of output text.
2. by claimed in claim 1 based on gravitational Text similarity computing method, it is characterized in that: the tight ness rating of the maximum public subgraph of the text in described step (3), its tight ness rating calculating formula is as follows:
Wherein, n is the right occurrence number of keyword, and R is the right weight of keyword, and M, m are respectively the weight of keyword centering keyword, and G is the tight ness rating constant, G=1.
3. by claimed in claim 1 based on gravitational Text similarity computing method, it is characterized in that: the similarity of the text in described step (4), its similarity calculating formula is as follows:
Wherein,
Be the keyword number of the public subgraph of maximum,
The keyword logarithm of maximum public subgraph,
,
Be respectively the keyword number of text A and text B,
,
Be respectively the keyword logarithm of text A and text B,
,
Be respectively the tight ness rating of the maximum public subgraph of text A and text B.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310093108.5A CN103164394B (en) | 2012-07-16 | 2013-03-22 | A kind of based on gravitational Text similarity computing method |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012102438628 | 2012-07-16 | ||
CN201210243862 | 2012-07-16 | ||
CN201210243862.8 | 2012-07-16 | ||
CN201310093108.5A CN103164394B (en) | 2012-07-16 | 2013-03-22 | A kind of based on gravitational Text similarity computing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103164394A true CN103164394A (en) | 2013-06-19 |
CN103164394B CN103164394B (en) | 2016-08-03 |
Family
ID=48587490
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310093108.5A Expired - Fee Related CN103164394B (en) | 2012-07-16 | 2013-03-22 | A kind of based on gravitational Text similarity computing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103164394B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106997382A (en) * | 2017-03-22 | 2017-08-01 | 山东大学 | Innovation intention label automatic marking method and system based on big data |
US10282678B2 (en) | 2015-11-18 | 2019-05-07 | International Business Machines Corporation | Automated similarity comparison of model answers versus question answering system output |
CN109753600A (en) * | 2018-12-20 | 2019-05-14 | 航天信息股份有限公司 | Handle the method, apparatus asked questions and storage medium |
US10628749B2 (en) | 2015-11-17 | 2020-04-21 | International Business Machines Corporation | Automatically assessing question answering system performance across possible confidence values |
CN114201598A (en) * | 2022-02-18 | 2022-03-18 | 药渡经纬信息科技(北京)有限公司 | Text recommendation method and text recommendation device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
US20080126335A1 (en) * | 2006-11-29 | 2008-05-29 | Oracle International Corporation | Efficient computation of document similarity |
-
2013
- 2013-03-22 CN CN201310093108.5A patent/CN103164394B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080126335A1 (en) * | 2006-11-29 | 2008-05-29 | Oracle International Corporation | Efficient computation of document similarity |
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
Non-Patent Citations (2)
Title |
---|
JUN GUO等: "Word activation forces map word networks", 《NATURE》, vol. 23, no. 8, 31 December 2011 (2011-12-31), pages 1 - 16 * |
吴江宁等: "基于最大公共子图的文本相似度算法研究", 《情报学报》, vol. 29, no. 5, 31 October 2010 (2010-10-31), pages 785 - 791 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10628749B2 (en) | 2015-11-17 | 2020-04-21 | International Business Machines Corporation | Automatically assessing question answering system performance across possible confidence values |
US10282678B2 (en) | 2015-11-18 | 2019-05-07 | International Business Machines Corporation | Automated similarity comparison of model answers versus question answering system output |
CN106997382A (en) * | 2017-03-22 | 2017-08-01 | 山东大学 | Innovation intention label automatic marking method and system based on big data |
CN106997382B (en) * | 2017-03-22 | 2020-12-01 | 山东大学 | Innovative creative tag automatic labeling method and system based on big data |
CN109753600A (en) * | 2018-12-20 | 2019-05-14 | 航天信息股份有限公司 | Handle the method, apparatus asked questions and storage medium |
CN114201598A (en) * | 2022-02-18 | 2022-03-18 | 药渡经纬信息科技(北京)有限公司 | Text recommendation method and text recommendation device |
CN114201598B (en) * | 2022-02-18 | 2022-05-17 | 药渡经纬信息科技(北京)有限公司 | Text recommendation method and text recommendation device |
Also Published As
Publication number | Publication date |
---|---|
CN103164394B (en) | 2016-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103207905B (en) | A kind of method of calculating text similarity of based target text | |
CN103186574B (en) | A kind of generation method and apparatus of Search Results | |
Yang | Research and realization of internet public opinion analysis based on improved TF-IDF algorithm | |
CN104239512A (en) | Text recommendation method | |
CN103164394A (en) | Text similarity calculation method based on universal gravitation | |
CN110909550B (en) | Text processing method, text processing device, electronic equipment and readable storage medium | |
CN105260474A (en) | Microblog user influence computing method based on information interaction network | |
CN103617157A (en) | Text similarity calculation method based on semantics | |
WO2015032301A1 (en) | Method for detecting the similarity of the patent documents on the basis of new kernel function luke kernel | |
CN108170650B (en) | Text comparison method and text comparison device | |
CN105183770A (en) | Chinese integrated entity linking method based on graph model | |
CN104216968A (en) | Rearrangement method and system based on document similarity | |
CN110147425A (en) | A kind of keyword extracting method, device, computer equipment and storage medium | |
CN105095430A (en) | Method and device for setting up word network and extracting keywords | |
CN103324664A (en) | Document similarity distinguishing method based on Fourier transform | |
Na et al. | Automatically generation and evaluation of stop words list for Chinese patents | |
Petcu et al. | The one‐dimensional shallow water equations with transparent boundary conditions | |
WO2020052547A1 (en) | Method and apparatus for identifying new words in spam message, and electronic device | |
CN103744918A (en) | Vertical domain based micro blog searching ranking method and system | |
WO2023109143A1 (en) | Real store verification method and apparatus, device, and storage medium | |
Chen et al. | A backward Euler alternating direction implicit difference scheme for the three‐dimensional fractional evolution equation | |
CN103150388A (en) | Method and device for extracting key words | |
CN113657116B (en) | Social media popularity prediction method and device based on visual semantic relationship | |
CN104331443A (en) | Industry data source detection method | |
Tung et al. | Improved linear programming for fuzzy weighted average |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160803 Termination date: 20190322 |