CN102567537A

CN102567537A - Short text similarity computing method based on searched result quantity

Info

Publication number: CN102567537A
Application number: CN2011104583763A
Authority: CN
Inventors: 李琳; 钟珞; 袁景凌; 夏红霞; 刘东飞
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2011-12-31
Filing date: 2011-12-31
Publication date: 2012-07-11

Abstract

The invention discloses a short text similarity computing method based on the searched result quantity, which includes the steps: 1 preprocessing short texts; 2 utilizing single short texts and combinations of each two short texts as search words and respectively submitting the search words to a large-scale corpus; and 3 computing similarity between each two short texts by the aid of the returned searched result quantity. The computing method does not depend on traditional text processing and is capable of quickly and effectively obtaining computed results. The short texts are utilized as the search words, and the large-scale corpus returns searched results including the short texts. Content of the searched results includes text interpretations on the short texts, and the searched result quantity can be regarded as a compressor and implies semantic interpretations of the short texts in the corpus.

Description

A kind of short text similarity calculating method based on result for retrieval quantity

Technical field

The present invention designs the similarity of short text and calculates, and refers to a kind of short text similarity calculating method based on result for retrieval quantity particularly, belongs to the text mining field.

Background technology

Short text (Short Text) refers to the textual form that those length are short; It has extension widely; Increasing intercommunion platform uses short text more continually, like mobile phone short message, instant message, BBS title, microblogging, online chatting record, blog and news analysis etc.At present, the short text data volume is huge day by day, and the text mining of short text is had wide practical use in fields such as Topic Tracking and discovery, popular word analysis, public sentiment early warning, image retrievals.

But, because the text size of short text is short, cause its sample characteristics very sparse, be unfavorable for retrieval analysis.Short text is concise in expression or uses the requirement that does not meet standard in addition, tends to exceed the tradition or the normal literal expression meaning, like popular " microblogging " of network now, usually uses its partials " muffler " to carry out acute pyogenic infection of finger tip in the cyberspeak.The unique language feature of these of short text greatly reduces the precision that the short text similarity is calculated, and is the difficult point that short text excavates so how to improve computational accuracy effectively.

For solving this difficult point, we propose a kind of short text similarity calculating method based on result for retrieval quantity, utilize the characteristics of extensive corpus broad covered area, from semantically understanding the implication of short text.

Summary of the invention

The purpose of this invention is to provide a kind of short text similarity calculating method, can overcome the not enough and nonstandard shortcoming of term of short text self sample characteristics, improve the precision that similarity is calculated through semantic analysis based on result for retrieval quantity.

For realizing above-mentioned purpose, the present invention includes following steps:

(1) short text is carried out pre-service;

(2) combination in twos of pretreated single short text and pretreated short text is submitted to corpus respectively as the retrieval and inquisition speech;

(3) the result for retrieval quantity of utilizing corpus to return is calculated short text similarity between any two.

In technique scheme, said step (1) is specially:

(1-1) utilize general stop words tabulation that short text is filtered, said general stop words is tone auxiliary word, adverbial word, preposition and conjunction;

(1-2) ending of each word participle variation of short text is formed in filtration, extracts the stem of word, and calculates the word frequency of said stem.

In technique scheme, the corpus in the said step (2) is Web search engine or wikipedia.

In technique scheme, said step (3) utilizes following formula to calculate the similarity between short text S1 and the S2.

Similarity (s 1, s 2) = \frac{\log f (s 1, s 2)}{\log f (s 1) + \log f (s 2) - \log f (s 1, s 2)} - - - (1)

Wherein, the quantity of f (s1) expression result for retrieval that short text s1 is obtained as the retrieval and inquisition speech of corpus; F (s2) is the quantity of result for retrieval that short text s2 is obtained as the retrieval and inquisition speech of corpus; F (s1, the quantity of the result for retrieval that s2) then the combination of s1 and s2 is obtained as the retrieval and inquisition speech of corpus.

In technique scheme, the length of said short text is less than or equal to 200 characters.

Compared with prior art, the inventive method does not rely on traditional text and handles, and can be fast and obtain result of calculation effectively.As term, corpus will return the result for retrieval that comprises this short text with short text.The result for retrieval content comprises the text interpretation of this short text, and its quantity can be regarded as a compressor reducer, is containing the semantic interpretation of this short text at this corpus.Based on the proposition of the short text similarity calculating method of result for retrieval quantity is from semantically conventional needle being launched effective improvement of handling to the literal of text itself, and can in time reflect the time dependent situation of meaning of short text.

Through following description and combine accompanying drawing, it is more clear that the present invention will become, and these accompanying drawings are used to explain embodiments of the invention.

Description of drawings

Fig. 1 is the main flow chart that the present invention is based on the short text similarity calculating method of result for retrieval quantity.

Embodiment

With reference now to accompanying drawing, describe embodiments of the invention, like Fig. 1, present embodiment Benq so that two short text S1 and S2 are example may further comprise the steps in the short text similarity calculating method of result for retrieval quantity:

Step S1, the short text that length is less than or equal to 200 characters carries out pre-service, and concrete steps do

Step S1-1 utilizes general stop words tabulation (stop words list) that short text is filtered, and said general stop words is tone auxiliary word, adverbial word, preposition and conjunction;

Step S1-2, the ending of filtering each word participle variation of forming short text extracts the stem of word, and calculates the word frequency of said stem.

Step S2 submits to extensive corpus with the combination in twos of single short text and short text respectively as the retrieval and inquisition speech, and used corpus is Web search engine or wikipedia.

Step S3, to the result for retrieval quantity of returning, utilize following formula to calculate the similarity between short text S1 and the S2:

Similarity (s 1, s 2) = \frac{\log f (s 1, s 2)}{\log f (s 1) + \log f (s 2) - \log f (s 1, s 2)} - - - (1)

Handle through normalization at last the similarity value scope of calculating is controlled at [0,1].

In the said method, to the pre-service optimization of short text with given prominence to keyword.As term, extensive corpus will return the result for retrieval that comprises this short text with short text.The result for retrieval content comprises the text interpretation of this short text, and its quantity can be regarded as a compressor reducer, is containing the semantic interpretation of this short text at this corpus.Present embodiment does not rely on traditional text based on the short text similarity calculating method of result for retrieval quantity and handles, and can be fast and obtain the calculation of similarity degree result effectively.Based on the proposition of the short text similarity calculating method of result for retrieval quantity is from semantically conventional needle being launched effective improvement of handling to the literal of text itself, and can in time reflect the time dependent situation of meaning of short text.

Claims

1. the short text similarity calculating method based on result for retrieval quantity is characterized in that, comprises the steps:

(1) short text is carried out pre-service;

2. according to the said short text similarity calculating method of claim 1, it is characterized in that said step (1) is specially based on result for retrieval quantity:

3. according to the said short text similarity calculating method based on result for retrieval quantity of claim 1, it is characterized in that: used search engine is Web search engine or wikipedia in the step (2).

4. according to the said short text similarity calculating method of claim 1, it is characterized in that similarity is passed through computes in the step (3) based on result for retrieval quantity:

Similarity (s 1, s 2) = \frac{\log f (s 1, s 2)}{\log f (s 1) + \log f (s 2) - \log f (s 1, s 2)}

In the formula, f (s1) is the quantity of result for retrieval that short text s1 is obtained as the retrieval and inquisition speech of corpus; F (s2) is the quantity of result for retrieval that short text s2 is obtained as the retrieval and inquisition speech of corpus; F (s1, the quantity of the result for retrieval that s2) then the combination of s1 and s2 is obtained as the retrieval and inquisition speech of corpus.

5. according to each said short text similarity calculating method based on result for retrieval quantity of claim 1～4, it is characterized in that: the length of said short text is less than or equal to 200 characters.