CN102567537A - Short text similarity computing method based on searched result quantity - Google Patents

Short text similarity computing method based on searched result quantity Download PDF

Info

Publication number
CN102567537A
CN102567537A CN2011104583763A CN201110458376A CN102567537A CN 102567537 A CN102567537 A CN 102567537A CN 2011104583763 A CN2011104583763 A CN 2011104583763A CN 201110458376 A CN201110458376 A CN 201110458376A CN 102567537 A CN102567537 A CN 102567537A
Authority
CN
China
Prior art keywords
short text
retrieval
result
corpus
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104583763A
Other languages
Chinese (zh)
Inventor
李琳
钟珞
袁景凌
夏红霞
刘东飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN2011104583763A priority Critical patent/CN102567537A/en
Publication of CN102567537A publication Critical patent/CN102567537A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a short text similarity computing method based on the searched result quantity, which includes the steps: 1 preprocessing short texts; 2 utilizing single short texts and combinations of each two short texts as search words and respectively submitting the search words to a large-scale corpus; and 3 computing similarity between each two short texts by the aid of the returned searched result quantity. The computing method does not depend on traditional text processing and is capable of quickly and effectively obtaining computed results. The short texts are utilized as the search words, and the large-scale corpus returns searched results including the short texts. Content of the searched results includes text interpretations on the short texts, and the searched result quantity can be regarded as a compressor and implies semantic interpretations of the short texts in the corpus.

Description

A kind of short text similarity calculating method based on result for retrieval quantity
Technical field
The present invention designs the similarity of short text and calculates, and refers to a kind of short text similarity calculating method based on result for retrieval quantity particularly, belongs to the text mining field.
Background technology
Short text (Short Text) refers to the textual form that those length are short; It has extension widely; Increasing intercommunion platform uses short text more continually, like mobile phone short message, instant message, BBS title, microblogging, online chatting record, blog and news analysis etc.At present, the short text data volume is huge day by day, and the text mining of short text is had wide practical use in fields such as Topic Tracking and discovery, popular word analysis, public sentiment early warning, image retrievals.
But, because the text size of short text is short, cause its sample characteristics very sparse, be unfavorable for retrieval analysis.Short text is concise in expression or uses the requirement that does not meet standard in addition, tends to exceed the tradition or the normal literal expression meaning, like popular " microblogging " of network now, usually uses its partials " muffler " to carry out acute pyogenic infection of finger tip in the cyberspeak.The unique language feature of these of short text greatly reduces the precision that the short text similarity is calculated, and is the difficult point that short text excavates so how to improve computational accuracy effectively.
For solving this difficult point, we propose a kind of short text similarity calculating method based on result for retrieval quantity, utilize the characteristics of extensive corpus broad covered area, from semantically understanding the implication of short text.
Summary of the invention
The purpose of this invention is to provide a kind of short text similarity calculating method, can overcome the not enough and nonstandard shortcoming of term of short text self sample characteristics, improve the precision that similarity is calculated through semantic analysis based on result for retrieval quantity.
For realizing above-mentioned purpose, the present invention includes following steps:
(1) short text is carried out pre-service;
(2) combination in twos of pretreated single short text and pretreated short text is submitted to corpus respectively as the retrieval and inquisition speech;
(3) the result for retrieval quantity of utilizing corpus to return is calculated short text similarity between any two.
In technique scheme, said step (1) is specially:
(1-1) utilize general stop words tabulation that short text is filtered, said general stop words is tone auxiliary word, adverbial word, preposition and conjunction;
(1-2) ending of each word participle variation of short text is formed in filtration, extracts the stem of word, and calculates the word frequency of said stem.
In technique scheme, the corpus in the said step (2) is Web search engine or wikipedia.
In technique scheme, said step (3) utilizes following formula to calculate the similarity between short text S1 and the S2.
Similarity ( s 1 , s 2 ) = log f ( s 1 , s 2 ) log f ( s 1 ) + log f ( s 2 ) - log f ( s 1 , s 2 ) - - - ( 1 )
Wherein, the quantity of f (s1) expression result for retrieval that short text s1 is obtained as the retrieval and inquisition speech of corpus; F (s2) is the quantity of result for retrieval that short text s2 is obtained as the retrieval and inquisition speech of corpus; F (s1, the quantity of the result for retrieval that s2) then the combination of s1 and s2 is obtained as the retrieval and inquisition speech of corpus.
In technique scheme, the length of said short text is less than or equal to 200 characters.
Compared with prior art, the inventive method does not rely on traditional text and handles, and can be fast and obtain result of calculation effectively.As term, corpus will return the result for retrieval that comprises this short text with short text.The result for retrieval content comprises the text interpretation of this short text, and its quantity can be regarded as a compressor reducer, is containing the semantic interpretation of this short text at this corpus.Based on the proposition of the short text similarity calculating method of result for retrieval quantity is from semantically conventional needle being launched effective improvement of handling to the literal of text itself, and can in time reflect the time dependent situation of meaning of short text.
Through following description and combine accompanying drawing, it is more clear that the present invention will become, and these accompanying drawings are used to explain embodiments of the invention.
Description of drawings
Fig. 1 is the main flow chart that the present invention is based on the short text similarity calculating method of result for retrieval quantity.
Embodiment
With reference now to accompanying drawing, describe embodiments of the invention, like Fig. 1, present embodiment Benq so that two short text S1 and S2 are example may further comprise the steps in the short text similarity calculating method of result for retrieval quantity:
Step S1, the short text that length is less than or equal to 200 characters carries out pre-service, and concrete steps do
Step S1-1 utilizes general stop words tabulation (stop words list) that short text is filtered, and said general stop words is tone auxiliary word, adverbial word, preposition and conjunction;
Step S1-2, the ending of filtering each word participle variation of forming short text extracts the stem of word, and calculates the word frequency of said stem.
Step S2 submits to extensive corpus with the combination in twos of single short text and short text respectively as the retrieval and inquisition speech, and used corpus is Web search engine or wikipedia.
Step S3, to the result for retrieval quantity of returning, utilize following formula to calculate the similarity between short text S1 and the S2:
Similarity ( s 1 , s 2 ) = log f ( s 1 , s 2 ) log f ( s 1 ) + log f ( s 2 ) - log f ( s 1 , s 2 ) - - - ( 1 )
Wherein, the quantity of f (s1) expression result for retrieval that short text s1 is obtained as the retrieval and inquisition speech of corpus; F (s2) is the quantity of result for retrieval that short text s2 is obtained as the retrieval and inquisition speech of corpus; F (s1, the quantity of the result for retrieval that s2) then the combination of s1 and s2 is obtained as the retrieval and inquisition speech of corpus.
Handle through normalization at last the similarity value scope of calculating is controlled at [0,1].
In the said method, to the pre-service optimization of short text with given prominence to keyword.As term, extensive corpus will return the result for retrieval that comprises this short text with short text.The result for retrieval content comprises the text interpretation of this short text, and its quantity can be regarded as a compressor reducer, is containing the semantic interpretation of this short text at this corpus.Present embodiment does not rely on traditional text based on the short text similarity calculating method of result for retrieval quantity and handles, and can be fast and obtain the calculation of similarity degree result effectively.Based on the proposition of the short text similarity calculating method of result for retrieval quantity is from semantically conventional needle being launched effective improvement of handling to the literal of text itself, and can in time reflect the time dependent situation of meaning of short text.

Claims (5)

1. the short text similarity calculating method based on result for retrieval quantity is characterized in that, comprises the steps:
(1) short text is carried out pre-service;
(2) combination in twos of pretreated single short text and pretreated short text is submitted to corpus respectively as the retrieval and inquisition speech;
(3) the result for retrieval quantity of utilizing corpus to return is calculated short text similarity between any two.
2. according to the said short text similarity calculating method of claim 1, it is characterized in that said step (1) is specially based on result for retrieval quantity:
(1-1) utilize general stop words tabulation that short text is filtered, said general stop words is tone auxiliary word, adverbial word, preposition and conjunction;
(1-2) ending of each word participle variation of short text is formed in filtration, extracts the stem of word, and calculates the word frequency of said stem.
3. according to the said short text similarity calculating method based on result for retrieval quantity of claim 1, it is characterized in that: used search engine is Web search engine or wikipedia in the step (2).
4. according to the said short text similarity calculating method of claim 1, it is characterized in that similarity is passed through computes in the step (3) based on result for retrieval quantity:
Similarity ( s 1 , s 2 ) = log f ( s 1 , s 2 ) log f ( s 1 ) + log f ( s 2 ) - log f ( s 1 , s 2 )
In the formula, f (s1) is the quantity of result for retrieval that short text s1 is obtained as the retrieval and inquisition speech of corpus; F (s2) is the quantity of result for retrieval that short text s2 is obtained as the retrieval and inquisition speech of corpus; F (s1, the quantity of the result for retrieval that s2) then the combination of s1 and s2 is obtained as the retrieval and inquisition speech of corpus.
5. according to each said short text similarity calculating method based on result for retrieval quantity of claim 1~4, it is characterized in that: the length of said short text is less than or equal to 200 characters.
CN2011104583763A 2011-12-31 2011-12-31 Short text similarity computing method based on searched result quantity Pending CN102567537A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104583763A CN102567537A (en) 2011-12-31 2011-12-31 Short text similarity computing method based on searched result quantity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104583763A CN102567537A (en) 2011-12-31 2011-12-31 Short text similarity computing method based on searched result quantity

Publications (1)

Publication Number Publication Date
CN102567537A true CN102567537A (en) 2012-07-11

Family

ID=46412936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104583763A Pending CN102567537A (en) 2011-12-31 2011-12-31 Short text similarity computing method based on searched result quantity

Country Status (1)

Country Link
CN (1) CN102567537A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035992A (en) * 2014-06-10 2014-09-10 复旦大学 Method and system for processing text semantics by utilizing image processing technology and semantic vector space
CN104052654A (en) * 2014-06-25 2014-09-17 金硕澳门离岸商业服务有限公司 Method and system for achieving chatting online
CN106682174A (en) * 2016-12-28 2017-05-17 南华大学 Big data application based short text information searching system
CN109635275A (en) * 2018-11-06 2019-04-16 交控科技股份有限公司 Literature content retrieval and recognition methods and device
CN109871429A (en) * 2019-01-31 2019-06-11 郑州轻工业学院 Merge the short text search method of Wikipedia classification and explicit semantic feature
CN110891010A (en) * 2018-09-05 2020-03-17 百度在线网络技术(北京)有限公司 Method and apparatus for transmitting information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN102262663A (en) * 2011-07-25 2011-11-30 中国科学院软件研究所 Method for repairing software defect reports

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN102262663A (en) * 2011-07-25 2011-11-30 中国科学院软件研究所 Method for repairing software defect reports

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DANUSHKA BOLLEGALA等: "Measuring Semantic Similarity between Words using Web Search Engines", 《WWW "07 PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB》, 31 December 2007 (2007-12-31), pages 3 - 2 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035992A (en) * 2014-06-10 2014-09-10 复旦大学 Method and system for processing text semantics by utilizing image processing technology and semantic vector space
CN104035992B (en) * 2014-06-10 2017-05-10 复旦大学 Method and system for processing text semantics by utilizing image processing technology and semantic vector space
CN104052654A (en) * 2014-06-25 2014-09-17 金硕澳门离岸商业服务有限公司 Method and system for achieving chatting online
CN106682174A (en) * 2016-12-28 2017-05-17 南华大学 Big data application based short text information searching system
CN106682174B (en) * 2016-12-28 2020-04-17 南华大学 Short text information retrieval system based on big data application
CN110891010A (en) * 2018-09-05 2020-03-17 百度在线网络技术(北京)有限公司 Method and apparatus for transmitting information
CN109635275A (en) * 2018-11-06 2019-04-16 交控科技股份有限公司 Literature content retrieval and recognition methods and device
CN109871429A (en) * 2019-01-31 2019-06-11 郑州轻工业学院 Merge the short text search method of Wikipedia classification and explicit semantic feature

Similar Documents

Publication Publication Date Title
CN106598944B (en) A kind of civil aviaton's security public sentiment sentiment analysis method
CN101510222B (en) Multilayer index voice document searching method
CN102708100B (en) Method and device for digging relation keyword of relevant entity word and application thereof
CN100458795C (en) Intelligent word input method and input method system and updating method thereof
CN102253982B (en) Query suggestion method based on query semantics and click-through data
CN102567537A (en) Short text similarity computing method based on searched result quantity
CN106503049A (en) A kind of microblog emotional sorting technique for merging multiple affection resources based on SVM
CN105068991A (en) Big data based public sentiment discovery method
CN101464897A (en) Word matching and information query method and device
CN103186574A (en) Method and device for generating searching result
WO2008027503A3 (en) Semantic search engine
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN102880723A (en) Searching method and system for identifying user retrieval intention
CN102253971B (en) PageRank method based on quick similarity
CN104991943A (en) Music searching method and apparatus
CN102043843A (en) Method and obtaining device for obtaining target entry based on target application
CN104965823A (en) Big data based opinion extraction method
US20190171713A1 (en) Semantic parsing method and apparatus
CN101968801A (en) Method for extracting key words of single text
CN103678435A (en) Drug specification data similarity matching method
CN101650729B (en) Dynamic construction method for Web service component library and service search method thereof
CN105095430A (en) Method and device for setting up word network and extracting keywords
CN103246644A (en) Method and device for processing Internet public opinion information
CN105183765A (en) Big data-based topic extraction method
CN102650986A (en) Synonym expansion method and device both used for text duplication detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120711