CN102693279B

CN102693279B - Method, device and system for fast calculating comment similarity

Info

Publication number: CN102693279B
Application number: CN201210132078.XA
Authority: CN
Inventors: 陈学文; 张宇峰; 姚键; 潘柏宇; 卢述奇
Original assignee: 1Verge Internet Technology Beijing Co Ltd
Current assignee: Alibaba China Co Ltd; Youku Network Technology Beijing Co Ltd
Priority date: 2012-04-28
Filing date: 2012-04-28
Publication date: 2014-09-03
Anticipated expiration: 2032-04-28
Also published as: CN102693279A

Abstract

The invention provides a method, a device and a system for fast calculating comment similarity. The method comprises the steps of: extracting keywords from a new comment firstly; checking inverted indexes and text information for each extracted keyword, and finding out a text having identical keywords with the new comment text afterwards; counting a quantity of the identical keywords between the new comment text and an index text; calculating similarity between the new text and the index text according to the quantity of identical keywords between the new comment text and the index text; and acquiring a highest similarity score of the new text to find out a text that most resembles the new comment text. The method, device and system of the invention are particularly suitable for similarity analysis of short text content of film reviews, short text similarity can be calculated quickly, and an operation program replaces space for time to reduce CPU calculation time.

Description

A kind of method, Apparatus and system of quick calculating comment similarity

Technical field

The invention belongs to text similarity analysis technical field, relate in particular to a kind of method, Apparatus and system of quick calculating comment similarity.

Background technology

Interactive field at information network, user often wishes received information to make comments, owing to often having the comment that part similarity is very high in comment, so analyze comment similarity, for the data analysis of comment, process and play an important role, such as contributing to elite comment extraction, rubbish comment content analysis etc.

Existing comment account form generally adopts direct use similarity algorithm to calculate the similarity between any two comments, and then calculates the similarity score of the comment that similarity is the highest, and then finds out the comment that similarity is higher.Yet this kind of comment account form need to compare one by one with new comment and historical review, calculated amount is larger.Caused the processing speed of server slow so, on the one hand; On the other hand, the access times to the storage comment content data base of server have also been increased.

Summary of the invention

In view of problems of the prior art, the object of the present invention is to provide a kind of method, Apparatus and system of quick calculating comment similarity, for for internet information, particularly: the comment of internet information, reply etc.For this class short text, adopt the similarity calculating method that is applicable to short text, can realize and reducing the computing dependency degree of server CPU and the access times to the storage comment content data base of server, with this, improve the system treatment efficiency of server.

In order to achieve the above object, the invention provides a kind of method of quick calculating comment similarity, it is characterized in that comprising the steps:

S1, the new comment of extraction key word; Comprise

S11, is converted into available processes text by comment urtext;

S12, is then used participle program to carry out participle to processed comment text;

S13, according to text word segmentation result, extracts sentence trunk;

S14, according to the resulting feature key word of the further filtration step S13 of stop words vocabulary, final extraction obtains useful new comment key word;

S2, for each key word extracting, look into inverted index and text message, find out the text that has same keyword with new comment text;

S3, calculate the quantity of same keyword between new comment text and index text;

Between S4, the new comment text of basis and index text, the quantity of same keyword is calculated the similarity of new text and index Chinese version; Comprise

S41, the method calculated characteristics Keyword Weight of employing boolean weight;

S42, the weight of each key word obtaining according to step S41, adopts Dice coefficient calculations text similarity, with the number of same keyword and the weight of each key word between two texts, weighs the similarity degree between text;

S5, obtain new text highest similarity score, thereby find out and text the most similar in new comment text.

S6, adds index by new comment text, generates new index, and then when calculating next comment, all known comments all will add in inverted index.

In addition, the present invention also provides a kind of device of quick calculating comment similarity, it is characterized in that comprising as lower module:

Keyword extraction module, for extracting the key word of new comment; Comprise

For comment urtext being converted into the module of available processes text;

For using participle program processed comment text to be carried out to the module of participle;

For according to text word segmentation result, extract the module of sentence trunk;

For further filtering resulting feature key word according to stop words vocabulary, the final module that obtains useful new comment key word of extracting;

Inverted index module, is used to each key word of extraction to look into inverted index and text message, finds out the text that has same keyword with new comment text;

Same keyword computing module, for calculating the quantity of same keyword between new comment text and index text;

Similarity calculation module, for calculating the similarity of new text and index Chinese version according to the quantity of same keyword between new comment text and index text; Comprise

For adopting the module of the method calculated characteristics Keyword Weight of boolean's weight;

For according to the weight of each key word obtaining, adopt Dice coefficient calculations text similarity, with the number of same keyword and the weight of each key word between two texts, weigh the module of the similarity degree between text;

Similarity text determination module, for obtaining new text highest similarity score, thereby finds out and text the most similar in new comment text;

Index adds module, for new comment text is added to index, generates new index, and then when calculating next comment, all known comments all will add in inverted index.

In addition, the present invention also provides a kind of system of quick calculating comment similarity, it is characterized in that comprising as lower device:

Keyword extraction device, for extracting the key word of new comment; Comprise

For comment urtext being converted into the module of available processes text;

Inverted index device, is used to each key word of extraction to look into inverted index and text message, finds out the text that has same keyword with new comment text;

Same keyword calculation element, for calculating the quantity of same keyword between new comment text and index text;

Similarity calculation element, for calculating the similarity of new text and index Chinese version according to the quantity of same keyword between new comment text and index text; Comprise

Similarity text determining device, for obtaining new text highest similarity score, thereby finds out and text the most similar in new comment text;

Index adding set, for new comment text is added to index, generates new index, and then when calculating next comment, all known comments all will add in inverted index.

Method, Apparatus and system that comment similarity is calculated in express-analysis of the present invention can calculate short text similarity fast, and operation program is traded space for time, and reduce CPU computing time, particularly;

1, adopt inverted index mode to store text feature key word, strengthen similarity String searching speed, do not need similarity calculating one by one between text, reduce calculated amount;

Intermediate computations value while 2, retaining each Text similarity computing, is directly used during Text similarity computing, does not need repeatedly to calculate.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the method for quick calculating comment similarity of the present invention;

Fig. 2 is the block diagram of the device of quick calculating comment similarity of the present invention;

Fig. 3 is the block diagram of the system of quick calculating comment similarity of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage are become apparent more, below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation:

Fig. 1 is the process flow diagram of the method for quick calculating comment similarity of the present invention.As shown in Figure 1, the concrete implementation of the inventive method is as follows:

S1, the new comment of extraction key word; Concrete leaching process is as follows:

Step S11, is converted into available processes text by comment urtext, as removes the information such as inner label, expression;

Transformation Program can be carried out text-processing by self program, for example, for for this class short text of microblogging, inner label in short text, Sina's microblogging label can be removed, the name occurring in forwarding " //@", topic label " ## " etc. all removes, self content of extracting comment only, in addition also by storage " [] " in database, as expression labels such as [praising], the expression label information in short text can be removed,

Step S12, is then used participle program to carry out participle to processed comment text;

This process can be used self program realization, also can use third party's Chinese word segmentation program, and dictionary captures from internet, thereby can enrich local participle dictionary constantly; Divide word algorithm to adopt maximum reverse matching principle, according to the word in dictionary, text is carried out to participle.

Step S13, according to text word segmentation result, extracts the sentence trunks such as noun, verb;

Extraction noun, verb, adjective etc. carry out part-of-speech tagging according to program and get, and use external program to complete.

Such as after " Huang Xiaoming is development of action heartily " mark " Huang Xiaoming/nh heartily/the o story of a play or opera/n development/v ".

If for some complex sentence minor structures, likely there is marked erroneous situation, cause extraction to there will be mistake.According to the accuracy rate of test part-of-speech tagging, can exist small part mistake may affect last similarity score more than 95%, but because high-accuracy score range is not too large.So can be drawn into more exactly sentence trunk.

Step S14, finally according to the resulting feature key word of the further filtration step S13 of stop words vocabulary, final extraction obtains useful new comment key word.

Word in stop words vocabulary, represents that these words are little on the impact of the text meaning, can ignore.Stop words vocabulary partly derives from internet, and small part is used statistical method to draw, such as " sofa " this key word score after finding in the extensive comment of statistics is very low, can add stop words vocabulary.In addition, more stop words, for example: seem,, certain etc.

S2, for each key word extracting, look into inverted index and text message, find out the text that has same keyword with new comment text; Each key word is set up to an index, and index text is for making the text of similarity analysis.The object of inverted index is so that fast finding text and text message;

Row's index is a kind of technical method using in search engine.Inverted index essence is according to the keyword in text, to set up one to search mechanism, searches a kind of method of text.Each in this concordance list all comprises a property value and has the address of each record of this property value.Because not being determines property value by recording, but by property value, determined the position of recording, thereby be called inverted index (inverted index).With the file of inverted index, we are called inverted index file, are called for short inverted file.

It is as follows that the present invention sets up inverted index detailed process:

Define two table a and b; Wherein, the text of every a line storage comment of table a, feature keyword message and unique No. id of representing text of extraction; Table b is every a line storage key and one group of id sequence.According to the id sequence of the corresponding text of key word that text generates of table a.Table b create-rule is: all texts in traversal list a, to the key word occurring in each text, add in the id sequence that table b key word is corresponding No. id, if this key word not adds one group of new key word.

Inverted index use procedure, for example, finds out the document that contains key word " hello ", can navigate to fast key word " hello " according to table b, and get corresponding id sequence, according to document corresponding to id in id look-up table a.

Detailed process is as follows:

According to new comment text and the index text that comprises identical key word in other all texts in S2 step, calculate the key word number of new comment text and all texts, because S2 step has been found out the text that has same keyword with new text, so " all texts " is an interval being simplified in this step, resulting result is the number of same keyword between text, this key word number is exactly comm (s1, s2) value in calculating formula of similarity Dice method below.

Add up the information of same characteristic features between each text and new text, this information can be key word, text feature of the present invention only represents with the key word in text, so only use the feature key word extracting in S1 step when calculating similarity, may there are some information in other method, if text size, symbolic information etc. is also text feature, also can be used as the characteristic information that text is analyzed.

Comment characteristic information refers to this value of leng in formula (s2), the text message value that this value representation is used extraordinary key word to calculate, as used Dice method to calculate text similarity in the present invention, this value is the number of feature key word in text so.This value can be kept in the table a of S2, use while carrying out similarity to facilitate with other texts.

Between S4, the new comment text of basis and index text, the quantity of same keyword is calculated the similarity of new text and index Chinese version; The specific implementation process of this step is as follows:

Step S41, the method calculated characteristics Keyword Weight of employing boolean weight; Because comment content is short text, the Feature Words negligible amounts that text packets contains, so adopt the method calculated characteristics weight of boolean's weight; Conventional feature weight method has: boolean's weight, word frequency (tf) weight, tf-idf weight.According to experiment, show if use tf-idf method calculated characteristics weight, the effect of the similarity of increase calculated amount, and calculating does not relatively have significant change, so adopt the method calculated characteristics weight of boolean's weight.

Step S42, the weight of each key word obtaining according to step S41, adopts Dice coefficient calculations text similarity, with the number of same keyword and the weight of each key word between two texts, weighs the similarity degree between text;

Dice coefficient formulas is:

Dice（s1,s2）=2×comm(s1,s2)/(leng(s1)+leng(s2))

Wherein, comm (s1, s2) is the number of identical characters in s1, s2, leng (s1), and leng (s2) is the length of character string s1, s2.

Illustrate as follows, for example: through extracting the sentence of key word after processing, be new text C1: film yellow dawn is bright plays the part of Xiao Ming;

Existing index text comprises:

Index text C2: film yellow dawn of bright artistic skills

Index text C3: Zhao Wei plays the part of little common vetch

Index text C4: little common vetch girl

First, according to " film ", " Huang Xiaoming ", " playing the part of ", key words such as " Xiao Ming " is found out corresponding document C2 and C3(C2, C3, C4 in inverted index and has been added inverted index).

Then, calculate the C1 number identical with C3 key word with C2, C1, i.e. comm (s1, s2) in formula.

Finally, use Dice calculating formula of similarity, calculate the similarity of C1 and C2, the similarity of C1 and C3.

S5, obtain new text highest similarity score, thereby find out and text the most similar in new comment text.Finally obtain a mark of similarity, this mark is between 0-1, and 1 represents that content of text is the most close, and 0 expression is least close; The object that obtains newly commenting on similarity score is to judge whether this new comment is to plagiarize, and the comment that also can determine that citation times is maximum based on this is elite comment, thereby reduces the elite comment score of plagiarizing comment.

S6, new comment text is added to index, produce new index, and then when calculating next comment, all known comments all to add in inverted index.

Technical solution of the present invention can realize in an isolated system, also can obtain thus a kind of entity apparatus that can complete this technical scheme, and Fig. 2 is the block diagram of the device of quick calculating comment similarity of the present invention; Specifically comprise as lower module:

Keyword extraction module, for extracting the key word of new comment; The detailed process of specific works process and method step S1 is identical.

Inverted index module, is used to each key word of extraction to look into inverted index and text message, finds out the text that has same keyword with new comment text; The detailed process of specific works process and method step S2 is identical.

Same keyword computing module, for calculating the quantity of same keyword between new comment text and index text; The detailed process of specific works process and method step S3 is identical.

Similarity calculation module, for calculating the similarity of new text and index Chinese version according to the quantity of same keyword between new comment text and index text; The detailed process of specific works process and method step S4 is identical.

Similarity text determination module, for obtaining new text highest similarity score, thereby finds out and text the most similar in new comment text; The detailed process of specific works process and method step S5 is identical.

Index adds module, for new comment text is added to index, produces new index, and then when calculating next comment, all known comments all will add in inverted index.

In addition, the present invention also can work in coordination with by each device of separation, can obtain thus a kind of system that can complete this technical scheme, and Fig. 3 is the block diagram of the system of quick calculating comment similarity of the present invention, specifically comprises as lower device:

Keyword extraction device, for extracting the key word of new comment; The detailed process of specific works process and method step S1 is identical.

Inverted index device, is used to each key word of extraction to look into inverted index and text message, finds out the text that has same keyword with new comment text; The detailed process of specific works process and method step S2 is identical.

Same keyword calculation element, for calculating the quantity of same keyword between new comment text and index text; The detailed process of specific works process and method step S3 is identical.

Similarity calculation element, for calculating the similarity of new text and index Chinese version according to the quantity of same keyword between new comment text and index text; The detailed process of specific works process and method step S4 is identical.

Similarity text determining device, for obtaining new text highest similarity score, thereby finds out and text the most similar in new comment text; The detailed process of specific works process and method step S5 is identical.

Index adding set, for new comment text is added to index, produces new index, and then when calculating next comment, all known comments all will add in inverted index.

In sum, method, the Apparatus and system of quick calculating comment similarity of the present invention, because comment content is short text, the Feature Words negligible amounts that text packets contains, so adopt the method calculated characteristics weight of boolean's weight, adopts the similarity of two character strings of Dice coefficient calculations, similarity calculation of complex is made to optimization, it has the following advantages: can calculate fast short text similarity, operation program is traded space for time, and reduces CPU computing time.Adopt inverted index mode to store text feature key word, strengthen similarity String searching speed, do not need similarity calculating one by one between text, reduce calculated amount.

It is more than the detailed description that the preferred embodiments of the present invention are carried out, but those of ordinary skill in the art is to be appreciated that, within the scope of the present invention, and guided by the spirit, various improvement, interpolation and replacement are all possible, such as adjusting interface interchange order, change message format and content, the different programming language (as C, C++, Java etc.) of use and realize etc.In these protection domains that all limit in claim of the present invention.

Claims

1. calculate fast a method for comment similarity, it is characterized in that comprising the steps:

S1, the new comment of extraction key word;

Between S4, the new comment text of basis and index text, the quantity of same keyword is calculated the similarity of new text and index Chinese version;

S5, obtain new text highest similarity score, thereby find out and text the most similar in new comment text;

S6, adds index by new comment text, produces new index, and then when calculating next comment, all known comments all will add in inverted index, and the intermediate computations value while retaining each Text similarity computing;

Wherein, step S1 specifically comprises the steps:

S11, is converted into available processes text by comment urtext;

S13, according to text word segmentation result, extracts sentence trunk;

Wherein the detailed process of step S4 comprises:

S21, the method calculated characteristics Keyword Weight of employing boolean weight;

S22, the weight of each key word obtaining according to step S21, adopts Dice coefficient calculations text similarity, with the number of same keyword and the weight of each key word between two texts, weighs the similarity degree between text,

Described Dice coefficient formulas is:

Dice(s1，s2)＝2×comm(s1，s2)/(leng(s1)+leng(s2))