CN102567537A - Short text similarity computing method based on searched result quantity - Google Patents
Short text similarity computing method based on searched result quantity Download PDFInfo
- Publication number
- CN102567537A CN102567537A CN2011104583763A CN201110458376A CN102567537A CN 102567537 A CN102567537 A CN 102567537A CN 2011104583763 A CN2011104583763 A CN 2011104583763A CN 201110458376 A CN201110458376 A CN 201110458376A CN 102567537 A CN102567537 A CN 102567537A
- Authority
- CN
- China
- Prior art keywords
- short text
- retrieval
- result
- corpus
- short
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a short text similarity computing method based on the searched result quantity, which includes the steps: 1 preprocessing short texts; 2 utilizing single short texts and combinations of each two short texts as search words and respectively submitting the search words to a large-scale corpus; and 3 computing similarity between each two short texts by the aid of the returned searched result quantity. The computing method does not depend on traditional text processing and is capable of quickly and effectively obtaining computed results. The short texts are utilized as the search words, and the large-scale corpus returns searched results including the short texts. Content of the searched results includes text interpretations on the short texts, and the searched result quantity can be regarded as a compressor and implies semantic interpretations of the short texts in the corpus.
Description
Technical field
The present invention designs the similarity of short text and calculates, and refers to a kind of short text similarity calculating method based on result for retrieval quantity particularly, belongs to the text mining field.
Background technology
Short text (Short Text) refers to the textual form that those length are short; It has extension widely; Increasing intercommunion platform uses short text more continually, like mobile phone short message, instant message, BBS title, microblogging, online chatting record, blog and news analysis etc.At present, the short text data volume is huge day by day, and the text mining of short text is had wide practical use in fields such as Topic Tracking and discovery, popular word analysis, public sentiment early warning, image retrievals.
But, because the text size of short text is short, cause its sample characteristics very sparse, be unfavorable for retrieval analysis.Short text is concise in expression or uses the requirement that does not meet standard in addition, tends to exceed the tradition or the normal literal expression meaning, like popular " microblogging " of network now, usually uses its partials " muffler " to carry out acute pyogenic infection of finger tip in the cyberspeak.The unique language feature of these of short text greatly reduces the precision that the short text similarity is calculated, and is the difficult point that short text excavates so how to improve computational accuracy effectively.
For solving this difficult point, we propose a kind of short text similarity calculating method based on result for retrieval quantity, utilize the characteristics of extensive corpus broad covered area, from semantically understanding the implication of short text.
Summary of the invention
The purpose of this invention is to provide a kind of short text similarity calculating method, can overcome the not enough and nonstandard shortcoming of term of short text self sample characteristics, improve the precision that similarity is calculated through semantic analysis based on result for retrieval quantity.
For realizing above-mentioned purpose, the present invention includes following steps:
(1) short text is carried out pre-service;
(2) combination in twos of pretreated single short text and pretreated short text is submitted to corpus respectively as the retrieval and inquisition speech;
(3) the result for retrieval quantity of utilizing corpus to return is calculated short text similarity between any two.
In technique scheme, said step (1) is specially:
(1-1) utilize general stop words tabulation that short text is filtered, said general stop words is tone auxiliary word, adverbial word, preposition and conjunction;
(1-2) ending of each word participle variation of short text is formed in filtration, extracts the stem of word, and calculates the word frequency of said stem.
In technique scheme, the corpus in the said step (2) is Web search engine or wikipedia.
In technique scheme, said step (3) utilizes following formula to calculate the similarity between short text S1 and the S2.
Wherein, the quantity of f (s1) expression result for retrieval that short text s1 is obtained as the retrieval and inquisition speech of corpus; F (s2) is the quantity of result for retrieval that short text s2 is obtained as the retrieval and inquisition speech of corpus; F (s1, the quantity of the result for retrieval that s2) then the combination of s1 and s2 is obtained as the retrieval and inquisition speech of corpus.
In technique scheme, the length of said short text is less than or equal to 200 characters.
Compared with prior art, the inventive method does not rely on traditional text and handles, and can be fast and obtain result of calculation effectively.As term, corpus will return the result for retrieval that comprises this short text with short text.The result for retrieval content comprises the text interpretation of this short text, and its quantity can be regarded as a compressor reducer, is containing the semantic interpretation of this short text at this corpus.Based on the proposition of the short text similarity calculating method of result for retrieval quantity is from semantically conventional needle being launched effective improvement of handling to the literal of text itself, and can in time reflect the time dependent situation of meaning of short text.
Through following description and combine accompanying drawing, it is more clear that the present invention will become, and these accompanying drawings are used to explain embodiments of the invention.
Description of drawings
Fig. 1 is the main flow chart that the present invention is based on the short text similarity calculating method of result for retrieval quantity.
Embodiment
With reference now to accompanying drawing, describe embodiments of the invention, like Fig. 1, present embodiment Benq so that two short text S1 and S2 are example may further comprise the steps in the short text similarity calculating method of result for retrieval quantity:
Step S1, the short text that length is less than or equal to 200 characters carries out pre-service, and concrete steps do
Step S1-1 utilizes general stop words tabulation (stop words list) that short text is filtered, and said general stop words is tone auxiliary word, adverbial word, preposition and conjunction;
Step S1-2, the ending of filtering each word participle variation of forming short text extracts the stem of word, and calculates the word frequency of said stem.
Step S2 submits to extensive corpus with the combination in twos of single short text and short text respectively as the retrieval and inquisition speech, and used corpus is Web search engine or wikipedia.
Step S3, to the result for retrieval quantity of returning, utilize following formula to calculate the similarity between short text S1 and the S2:
Wherein, the quantity of f (s1) expression result for retrieval that short text s1 is obtained as the retrieval and inquisition speech of corpus; F (s2) is the quantity of result for retrieval that short text s2 is obtained as the retrieval and inquisition speech of corpus; F (s1, the quantity of the result for retrieval that s2) then the combination of s1 and s2 is obtained as the retrieval and inquisition speech of corpus.
Handle through normalization at last the similarity value scope of calculating is controlled at [0,1].
In the said method, to the pre-service optimization of short text with given prominence to keyword.As term, extensive corpus will return the result for retrieval that comprises this short text with short text.The result for retrieval content comprises the text interpretation of this short text, and its quantity can be regarded as a compressor reducer, is containing the semantic interpretation of this short text at this corpus.Present embodiment does not rely on traditional text based on the short text similarity calculating method of result for retrieval quantity and handles, and can be fast and obtain the calculation of similarity degree result effectively.Based on the proposition of the short text similarity calculating method of result for retrieval quantity is from semantically conventional needle being launched effective improvement of handling to the literal of text itself, and can in time reflect the time dependent situation of meaning of short text.
Claims (5)
1. the short text similarity calculating method based on result for retrieval quantity is characterized in that, comprises the steps:
(1) short text is carried out pre-service;
(2) combination in twos of pretreated single short text and pretreated short text is submitted to corpus respectively as the retrieval and inquisition speech;
(3) the result for retrieval quantity of utilizing corpus to return is calculated short text similarity between any two.
2. according to the said short text similarity calculating method of claim 1, it is characterized in that said step (1) is specially based on result for retrieval quantity:
(1-1) utilize general stop words tabulation that short text is filtered, said general stop words is tone auxiliary word, adverbial word, preposition and conjunction;
(1-2) ending of each word participle variation of short text is formed in filtration, extracts the stem of word, and calculates the word frequency of said stem.
3. according to the said short text similarity calculating method based on result for retrieval quantity of claim 1, it is characterized in that: used search engine is Web search engine or wikipedia in the step (2).
4. according to the said short text similarity calculating method of claim 1, it is characterized in that similarity is passed through computes in the step (3) based on result for retrieval quantity:
In the formula, f (s1) is the quantity of result for retrieval that short text s1 is obtained as the retrieval and inquisition speech of corpus; F (s2) is the quantity of result for retrieval that short text s2 is obtained as the retrieval and inquisition speech of corpus; F (s1, the quantity of the result for retrieval that s2) then the combination of s1 and s2 is obtained as the retrieval and inquisition speech of corpus.
5. according to each said short text similarity calculating method based on result for retrieval quantity of claim 1~4, it is characterized in that: the length of said short text is less than or equal to 200 characters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011104583763A CN102567537A (en) | 2011-12-31 | 2011-12-31 | Short text similarity computing method based on searched result quantity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011104583763A CN102567537A (en) | 2011-12-31 | 2011-12-31 | Short text similarity computing method based on searched result quantity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102567537A true CN102567537A (en) | 2012-07-11 |
Family
ID=46412936
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011104583763A Pending CN102567537A (en) | 2011-12-31 | 2011-12-31 | Short text similarity computing method based on searched result quantity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102567537A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104035992A (en) * | 2014-06-10 | 2014-09-10 | 复旦大学 | Method and system for processing text semantics by utilizing image processing technology and semantic vector space |
CN104052654A (en) * | 2014-06-25 | 2014-09-17 | 金硕澳门离岸商业服务有限公司 | Method and system for achieving chatting online |
CN106682174A (en) * | 2016-12-28 | 2017-05-17 | 南华大学 | Big data application based short text information searching system |
CN109635275A (en) * | 2018-11-06 | 2019-04-16 | 交控科技股份有限公司 | Literature content retrieval and recognition methods and device |
CN109871429A (en) * | 2019-01-31 | 2019-06-11 | 郑州轻工业学院 | Merge the short text search method of Wikipedia classification and explicit semantic feature |
CN110891010A (en) * | 2018-09-05 | 2020-03-17 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
CN102262663A (en) * | 2011-07-25 | 2011-11-30 | 中国科学院软件研究所 | Method for repairing software defect reports |
-
2011
- 2011-12-31 CN CN2011104583763A patent/CN102567537A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
CN102262663A (en) * | 2011-07-25 | 2011-11-30 | 中国科学院软件研究所 | Method for repairing software defect reports |
Non-Patent Citations (1)
Title |
---|
DANUSHKA BOLLEGALA等: "Measuring Semantic Similarity between Words using Web Search Engines", 《WWW "07 PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB》, 31 December 2007 (2007-12-31), pages 3 - 2 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104035992A (en) * | 2014-06-10 | 2014-09-10 | 复旦大学 | Method and system for processing text semantics by utilizing image processing technology and semantic vector space |
CN104035992B (en) * | 2014-06-10 | 2017-05-10 | 复旦大学 | Method and system for processing text semantics by utilizing image processing technology and semantic vector space |
CN104052654A (en) * | 2014-06-25 | 2014-09-17 | 金硕澳门离岸商业服务有限公司 | Method and system for achieving chatting online |
CN106682174A (en) * | 2016-12-28 | 2017-05-17 | 南华大学 | Big data application based short text information searching system |
CN106682174B (en) * | 2016-12-28 | 2020-04-17 | 南华大学 | Short text information retrieval system based on big data application |
CN110891010A (en) * | 2018-09-05 | 2020-03-17 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
CN109635275A (en) * | 2018-11-06 | 2019-04-16 | 交控科技股份有限公司 | Literature content retrieval and recognition methods and device |
CN109871429A (en) * | 2019-01-31 | 2019-06-11 | 郑州轻工业学院 | Merge the short text search method of Wikipedia classification and explicit semantic feature |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106598944B (en) | A kind of civil aviaton's security public sentiment sentiment analysis method | |
CN101510222B (en) | Multilayer index voice document searching method | |
CN102708100B (en) | Method and device for digging relation keyword of relevant entity word and application thereof | |
CN100458795C (en) | Intelligent word input method and input method system and updating method thereof | |
CN102253982B (en) | Query suggestion method based on query semantics and click-through data | |
CN102567537A (en) | Short text similarity computing method based on searched result quantity | |
CN106503049A (en) | A kind of microblog emotional sorting technique for merging multiple affection resources based on SVM | |
CN105068991A (en) | Big data based public sentiment discovery method | |
CN101464897A (en) | Word matching and information query method and device | |
CN103186574A (en) | Method and device for generating searching result | |
WO2008027503A3 (en) | Semantic search engine | |
CN103049435A (en) | Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device | |
CN102880723A (en) | Searching method and system for identifying user retrieval intention | |
CN102253971B (en) | PageRank method based on quick similarity | |
CN104991943A (en) | Music searching method and apparatus | |
CN102043843A (en) | Method and obtaining device for obtaining target entry based on target application | |
CN104965823A (en) | Big data based opinion extraction method | |
US20190171713A1 (en) | Semantic parsing method and apparatus | |
CN101968801A (en) | Method for extracting key words of single text | |
CN103678435A (en) | Drug specification data similarity matching method | |
CN101650729B (en) | Dynamic construction method for Web service component library and service search method thereof | |
CN105095430A (en) | Method and device for setting up word network and extracting keywords | |
CN103246644A (en) | Method and device for processing Internet public opinion information | |
CN105183765A (en) | Big data-based topic extraction method | |
CN102650986A (en) | Synonym expansion method and device both used for text duplication detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20120711 |