CN101667194A

CN101667194A - Automatic abstracting method and system based on user comment text feature

Info

Publication number: CN101667194A
Application number: CN200910093409A
Authority: CN
Inventors: 张铭; 章彦星
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2009-09-29
Filing date: 2009-09-29
Publication date: 2010-03-10

Abstract

The invention provides automatic abstracting method and system based on a user comment text feature. The method comprises the following steps: crawling and analyzing a user comment webpage and carrying out a series of pretreatments on user comments; identifying features commented by users from the user comments; classifying user comment sentences according to the commented features and filtering the feathers according to the classifying result of the comment sentences; calculating the score of the comment sentences and extracting several abstracting sentences to generate an abstract. The invention can accurately identify the features interested by the users from a large quantity of user comments, classify the comment sentences according to the features of the comments, and then automatically generate the concise and comprehensive abstract by using a test abstracting method based on sentence extraction, thereby helping users obviously improve the efficiency and the quality for acquiringknowledge. The invention can shorten the time of picking commodities for users, increase the shopping efficiency and improve the shopping experience when being used in the field of electronic commerce.

Description

Auto-abstracting method and autoabstract system thereof based on user comment text feature

Technical field

The present invention relates to auto-abstracting method and autoabstract system thereof that a kind of text feature at user comment carries out text summarization, belong to the knowledge excavation technical field.

Background technology

Be to use computer technology automatically to generate " one section than original text short and small and contained the text of important information in the original text " based on the autoabstract technology of text feature for electronic document.Along with the deep development of internet, the information explosion formula increases makes the text summarization The Application of Technology more and more widely.According to the difference of process object, text snippet can be divided into single document autoabstract and multi-document auto-abstracting two classes.

Single document autoabstract technology is the technology that generates summary for single document automatically, the main method that extracts based on sentence that adopts, promptly at first calculate the score of sentence according to factors such as the position of word frequency, sentence, syntactic structure, file structures, choose the highest some sentences of score then as the digest sentence, all digest sentences are become digest by their sequential organizations in original text.In addition, single document autoabstract can also be adopted the digest generation method based on natural language understanding, utilize linguistic knowledge to analyze the deep layer language construction of text, and utilize domain knowledge to semanteme judge, reasoning, obtain the semantic expressiveness of document, the semantic expressiveness according to document generates summary then.Compare, the method that sentence extracts is simpler, applied widely; And very complicated based on the digest generation method of natural language understanding, and depend on domain knowledge base, have the limited characteristic in strict field.Therefore, the method based on the sentence extraction is still taked in the single document autoabstract of main flow at present.

Multi-document auto-abstracting is for a plurality of documents under the same theme generate summary automatically, need to consider on the different document content redundancy with conflict.Multi-document auto-abstracting mainly contains three class methods: (1) use information extraction technique extracts the important information in each document, and manually customization or the semi-automatically template of generation summary are inserted template generation summary with the information that extracts; (2) at first use single document digest technology to generate summary, filter the content of wherein redundant and conflict then, will remain the content tissue and generate summary for each document; (3) at first all sentences of forming document are classified or cluster, the sentence of choosing the performance theme then from each set is organized into digest.Adopting an exemplary tool of the third method is MEAD, specifically referring to Radev D R, Jing H, Stys M, et al.Centroid-based summarization of multiple documents.Information Processing and Management, 2004,40:919-938.MEAD is a multi-document auto-abstracting system based on clustering documents and collection of document feature, MEAD carries out cluster to the sentence in many documents earlier, use statistical method to choose speech and the phrase that word frequency is the highest in each sentence set and form " barycenter " of pseudo-sentence as set, then in the set of computations similarity of other sentences and barycenter as the score of sentence, choose in each set the highest sentence of score at last and the generation documentation summary organized in the digest sentence as the digest sentence.

Along with the development of Web2.0, the internet becomes the platform that people can freely communicate one's views gradually, begins to occur a large amount of texts that comprises abundant subjective opinion on the network, as user comment etc.At present, the research object of text summarization mainly is that scientific and technical literature and news etc. have that rigorous file structure, diction are relatively uniform, the text of statement objective fact; And user comment is expressed the text of subjective opinion often at the things particular aspects, and it has, and structure is loose flexibly, the diversified characteristics of diction.Consider the above characteristics of user comment, the present invention has adopted the sorting technique based on feature, promptly at first analyzes a large amount of comments, therefrom identifies all features of user comment, according to the feature of sentence evaluation single comment sentence is classified then.At present the emotion analysis field has proposed the method for some recognition features from user comment, as the frequent item set mining method, and based on the method for probabilistic language model, the method for mode discovery and pattern match, and based on the unsupervised learning method of heuristic rule etc.

These subjectivities this paper enormous amount and the relatively dispersion that distributes are so the rich knowledge that will obtain wherein to comprise often will spend a large amount of time and efforts.It is that user comment generates summary that the present invention mainly adopts the 3rd class methods, proposes a kind of feature identification and filter algorithm, and precision ratio and the F1 value discerned by contrast experiment's characterization all are greatly improved.

Summary of the invention

In order to overcome the deficiency of prior art structure, the invention provides a kind of auto-abstracting method and autoabstract system thereof based on user comment text feature, it can be that a large amount of user comments generates succinct, comprehensive summary automatically, obtains knowledge to help people from user comment faster and betterly.The precision ratio of feature identification of the present invention and F1 value all have raising more significantly.The technical solution adopted for the present invention to solve the technical problems is:

A kind of auto-abstracting method based on user comment text feature, it may further comprise the steps:

Step 1, user comment pre-service: climb and get and resolve the user comment webpage, obtain user comment, then described user comment is carried out pre-service, obtain pretreated user comment;

Step 2, feature identification: by analyzing described pretreated user comment, therefrom identify the feature of being estimated by the user, from the described feature of being estimated by the user, use statistical method to identify candidate feature then;

Step 3, comment sentence classification: described candidate feature classification pressed in described pretreated user comment sentence, thereby obtain the comment sentence class of corresponding candidate feature;

Step 4, feature are filtered: according to a described comment class candidate feature is filtered, thereby obtain final feature and pairing candidate comments on a class;

Step 5, summary generates: calculate the score that described candidate comments on each sentence in the class, extract some digest sentences and generate summary.

Further comprise in the above-mentioned steps 1, described climbing got and resolved the user comment webpage and be meant, climbs all user comment webpages of getting this things at the specific things of choosing, and obtains climbing the user comment of getting, resolve the described user comment of getting of climbing then, obtain user comment text.

In the above-mentioned steps 1, described user comment is carried out pre-service be meant, the part of speech of institute all words of mark in the user comment is removed stop words wherein, and the residue word is carried out the stem extraction, obtains pretreated user comment text.

In the above-mentioned steps 2, the described feature of being estimated by the user is meant certain side, certain details, certain attribute or certain ingredient that the user has in mind when estimating certain part things.

In the above-mentioned steps 2, described use statistical method identifies candidate feature and is meant: extract all nouns in the described corresponding user comment sentence of being estimated by the user of feature, calculate the frequency that single noun occurs and the frequency of any two noun co-occurrences; Choose the highest noun of the highest single noun of the frequency of occurrences and co-occurrence frequency as candidate feature.

Candidate feature is filtered described in the above-mentioned steps 4 be meant, the relative position that occurs in the comment sentence according to the noun of composition characteristic, and the extensive and specialization relation on the meaning between each feature filter meaningless and redundant candidate feature.

Further comprise in the above-mentioned steps 5: adopt statistical method to calculate the keyword of each comment sentence class theme of performance, then according to the compatible degree of comment sentence content and theme, the length of comment sentence and the position that the comment sentence occurs in the entire chapter comment, calculate the score of comment sentence, extract the some former comment sentence tissue that score is the highest in the user comment sentence class then and generate summary.

A kind of auto-abstracting method based on user comment text feature, comprise that further described employing statistical method calculates each keyword of commenting on sentence class theme of performance and is meant, on basis to the classification of comment sentence, use the method for adding up to find out the keyword of each class, pseudo-sentence one barycenter of this comment sentence class theme of structure expression calculates based on the similarity of comment sentence with barycenter; The compatible degree of described comment sentence content and theme is meant the similarity of comment and barycenter.

A kind of autoabstract system based on user comment text feature, it comprises:

The user comment pretreatment module: it is used to climb gets and resolves user comment, then described user comment is carried out pre-service;

The feature identification module: it identifies the feature of being estimated by the user by analyzing described pretreated user comment from user comment, from the described feature of being estimated by the user, uses statistical method to identify candidate feature then.

Comment sentence sort module: it is classified the user comment sentence by described candidate feature, thereby obtains the comment sentence class of corresponding candidate feature;

The feature filtering module: it further filters candidate feature according to a comment sentence sorting result, thereby obtains interested candidate feature as final feature, and obtains pairing candidate and comment on a class;

The summary generation module: it is used for calculating the score that described candidate comments on a class, extracts some digest sentences and generates summary.

Wherein, the user comment pretreatment module sends to the feature identification module with the pre-service result, the candidate feature that obtains identifying, to send into a comment sentence sort module through the candidate feature that the pretreated user comment text of described user comment module and feature identification module identify and classify, obtain a comment sentence class; Described candidate feature is filtered the candidate who obtains final feature and correspondence thereof comment on a class; The summary generation module comments on a class with described candidate and described final feature is carried out statistical study and generated summary as input.

Beneficial effect of the present invention:

The present invention proposes a kind of auto-abstracting method, for the first time the text summarization technology is applied to comprise the user comment that enriches subjective information, and proposed sorting technique based on feature at the characteristics of user comment based on user comment text.

The inventive method can generate succinctly, comprehensively user comment is made a summary, and shortens the user greatly and reads the time that useful information is obtained in comment, improves the knowledge utilization rate; This method based on feature suits the user comment own characteristic, and the feature identification that the present invention proposes and the precision ratio of feature filter algorithm can reach more than 81%, and recall ratio can reach 52%, and the contrast algorithm that precision ratio and F1 value are chosen all is greatly improved.Under the background that information explosion cybertimes formula increases, user comment auto-abstracting method according to the present invention is significant, and can be widely used in numerous areas such as ecommerce, can significantly improve quality and the efficient of obtaining knowledge from magnanimity information.

Description of drawings

Fig. 1 is the general flow chart according to the auto-abstracting method based on user comment text feature of the present invention;

Fig. 2 is the process flow diagram according to the comment sentence classification of the inventive method;

Fig. 3 is the process flow diagram according to the summary generation of the inventive method.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:

Embodiment 1:

Below in conjunction with an example that in ecommerce, generates summary, describe the specific embodiment of the present invention in detail for user comment.

Ecommerce is that Web important on the internet uses, e-commerce website often allows the user that commodity are made comments, these are comprising the comment that the user experiences the subjectivity of commodity purchasing and use, usually can be used as other users and select the reference of businessman and commodity, also can be used as the foundation that businessman improves service.Much-sought-after item on the large-scale website often comprises hundreds and thousands of user comments, reads very consuming time.The present invention can generate succinct, comprehensive summary automatically for a large number of users comment, improves the efficient of knowledge acquisition greatly.

As shown in Figure 1, the user comment auto-abstracting method based on feature mainly comprises following step:

Step 1 user comment pre-service: climb and get and resolve user comment, then described user comment is carried out pre-service.

The user comment that is commodity in the ecommerce generates summary, at first needs to swash from e-commerce website to get all user comment webpages at these commodity.In the present embodiment, swash from www.amazon.com and to get all user comment webpages at commodity Apple iPod touch, analyzing web page obtains 939 user comments.

Before beginning autoabstract, need carry out series of preprocessing to user comment.Use StanfordPart-of-Speech Tagger that user comment is carried out part-of-speech tagging, Stanford Part-of-SpeechTagger is a part-of-speech tagging device that uses maximum entropy model, and accuracy rate can reach 96.86%.In addition, the deletion stop words in the deletion user comment, using Porter Stemmer is that remaining word extracts stem.Comment sentence employing vector space model after the processing is represented and is stored.

Step 2 feature identification: by analyzing a large number of users comment, from user comment, identify the feature of being estimated by the user, from the described feature of being estimated by the user, use statistical method to identify candidate feature then.

As previously mentioned, the feature of things is certain side, certain details or certain attribute, certain ingredient that the user has in mind when estimating certain part things.In e-commerce field, the attribute or the ingredient of the commodity often that the user has in mind itself, perhaps certain side or the details in the shopping process, these are collectively referred to as feature.These features are noun or two phrases that noun is formed often; Because different user adopts identical word representation feature, and often expresses their shopping and user experience with different words, so the frequency that the word of representation feature occurs is higher than other words.Based on this, the present invention adopts a kind of statistical method based on frequent item set mining to carry out feature identification, can discern the feature of extensive stock adaptively.

The problem description of frequent item set mining is: D=＜S ₁, S ₂..., S _NBe a set that contains N collection, wherein I=1,2 ..., N is a N _iThe item collection, t _j, j=1,2 ..., n _iIt is an item.There is N*minsupport collection S in a given parameter minimum support minsupport, frequent item set mining at least in order to find out among all item collection S:D that meet the following conditions _k, make

The Apriori algorithm is one of classic algorithm of frequent item set mining.It adopts the breadth-first search strategy, utilizes Apriori character, and the m item collection that promptly satisfies the minimum support requirement must be the subclass (m＞n), dwindled the search volume effectively of the union of all n item collection that meet the demands.

Different with the Apriori algorithm, the characteristics algorithm of this paper is only discerned individual character and double word feature, and promptly 1-item collection and 2-item collection especially are in particular both and have specified different minimum support minsupport1 and minsupport2.Why like this, be will be far below the frequency of individual character feature appearance because form the frequency of two word co-occurrences of double word feature.If both adopt same minimum support, then parameter too conference cause effectively discerning the double word feature, the too little meeting of parameter causes discerning the individual character feature of a large amount of mistakes.This algorithm steps is as follows:

1) extracts all nouns generation transaction files in the user comment, the noun that occurs in comment sentence of the every behavior of file;

2) traversal transaction file, the support of adding up each noun; Total line number of transaction file, promptly the comment sentence adds up to N;

3) choosing the noun that support is not less than minsupport1 is the individual character feature;

4) be not less than the noun of minsupport2 as double word feature Candidate Set with all supports;

5) traversal transaction file, the support of adding up the phrase that any two nouns form is chosen support and is not less than the phrase of minsupport2 as the double word feature.

As algorithm steps 4) shown in, choose support and be not less than the noun of minsupport2 rather than all nouns as double word feature Candidate Set, be to utilize Apriori character to dwindle the search volume.Two parameter m insupport1 in the algorithm and minsupport2 obtain by some row experimental learnings, and wherein individual character feature support minsupport1 is 0.012, and double word feature support minsupport2 is 0.005.

Step 3 comment sentence classification: described candidate feature classification pressed in the user comment sentence, thereby obtain the comment sentence class of corresponding candidate feature.

After identifying all features of being estimated of commodity, analyze the feature that each comment sentence is estimated successively, the comment sentence is assigned in the comment sentence class of this feature correspondence.So obtain a series of comment sentence classes, wherein the corresponding feature of each class comprises all comment sentences of estimating this feature.

Step 4 feature is filtered: according to a comment sentence sorting result candidate feature is filtered, thereby obtain interested candidate feature and pairing candidate comments on sentence.

After finishing the classification of comment sentence, need be according to sorting result, the position that two words of consideration composition double word feature occur in the comment sentence and the number of times of appearance filter insignificant double word feature; Filter redundant individual character feature according to candidate's individual character feature in conceptive relation of inclusion then with the double word feature.

Filter for the double word feature, observe the appearance of two words in the comment sentence close together often of forming the double word feature, and the relative order that is consistent.Defined the notion of effective double word feature for this reason.

Define 1 one effective double word feature f=＜w ₁, w ₂Should meet the following conditions:

(1) f=＜w ₁, w ₂Co-occurrence in comment sentence s, w ₁And w ₂Keep w ₁At preceding w ₂After relative order, and both the distance of position occurs less than given threshold value windowsize;

(2) upgrade the number of the support of double word feature for the comment sentence of satisfy condition (1), the support of double word feature must be greater than given threshold value minsupp.

If double word feature f=＜w ₁, w ₂Support less than given threshold value, then this double word feature is insignificant.

Filter for the individual character feature, defined the notion of the pure support (pure support) of individual character feature.

Define 2 known all double word feature f ₁, f ₂...,

The pure support of an individual character feature w is meant that w occurs and f ₁, f ₂...,

The sum of absent variable comment sentence.

Effectively the individual character feature is meant that pure support is not less than the individual character feature of given threshold value minpsupp, and pure support is redundant less than the individual character feature of minpsupp.

For example, battery life and life are the feature that algorithm 1 identifies, and the support of battery life is 20, and the support of life is 30, and then the pure support of life is 30-20=10.If given minpsupp=20, then life is redundant individual character feature.

The classification of comment sentence is as follows with the specific descriptions of feature filter algorithm:

Input: through pretreated user comment, and the candidate feature that identifies of algorithm 1

Output: through the feature of filtering, and the comment sentence class of each feature correspondence

Process: Classifier (windowsize, minsupp, minpsupp)

2while reads in a comment sentence s _i

3for s _iIn each word w _j

4if w _jBe the individual character feature then that algorithm 1 identifies

5off _j=w _jAt s _iThe middle position that occurs

6nouns＝nouns∪(w _j，off _j)

7 will comment on sentence s _iAssign to individual character feature w _jCorresponding comment sentence class c _j

Among the 8for nouns each is to noun (w _j, off _j), (w _k, off _k)

9if＜w _j, w _kBe double word Te Zheng ﹠amp; ﹠amp; Off _k-off _j＜windowsize then

10 with s _iAssign to double word feature＜w _j, w _kCorresponding comment sentence class c _Jk

11else if＜w _k, w _jBe double word Te Zheng ﹠amp; ﹠amp; Off _j-off _k＜windowsize then

12 with s _iAssign to double word feature＜w _k, w _jCorresponding comment sentence class c _Kj

Each double word feature＜w of 13for _j, w _k

14 according to definition 1 renewal＜w _j, w _kSupport supp _Jk

15if?supp _jk＜minsupp?then

16 deletion double word feature＜w _j, w _k

Each noun w that 17for occurs in the double word feature _j

18 according to definition 1 calculating w _jPure support psupp _j

19if?psupp _j＜minpsupp?then

20 deletions are word feature w early _j

The classification of the capable one-tenth comment of algorithm 2 1-12 sentence, as shown in Figure 2, a given comment sentence, algorithm judges earlier whether each noun that wherein occurs is the individual character feature, the noun of judging per two individual character features composition then will be commented on the comment sentence class that corresponding individual character feature or double word feature correspondence assigned in sentence then to whether being the double word feature.Concrete comment sentence assorting process is as follows:

(1) reads in a comment sentence s, the noun w that record wherein occurs ₁, w ₂..., w _t, judge w _i(i=1 ... t) be the individual character feature? if not, continue to handle next noun w among the s _I+1All nouns that in handling s, occur; (2) if w _iBe the individual character feature, then s assigned to w _iCorresponding class c _i, with w _iAdd nouns; To among the nouns each to noun＜w _j, w _k, judgement＜w _j, w _kBe the double word feature? if so s is assigned to＜w _j, w _kCorresponding class c _JkOtherwise, continue to get back to (1) and continue to handle next noun among the s.

Algorithm 2 13-16 are capable to carry out the filtration of double word feature according to definition 1, and 17-20 is capable to finish the filtration of individual character feature according to definition 2.Ultimate range between the position, the minimum pure support of the minimum support of double word feature and individual character feature appear in two nouns that three parameter windowsize, minsupp and minpsupp represent to form the double word feature respectively in the comment sentence.Through serial experiment study, the windowsize value is 2, and the value of minsupp and minpsupp is identical with minsupport2 and minsupport1 respectively, is 0.005 and 0.012.

Step 5 summary generates: calculate the score that described candidate comments on sentence, extract some digest sentences and generate summary.

On the basis of comment sentence classification, the method that the present invention uses sentence to extract generates summary.The process flow diagram that Fig. 3 generates for summary.As shown in Figure 3,, calculate the weight of forming comment sentence word earlier, extract the centroid vector that this comment sentence class theme of expression formed in the highest keyword of some weights for each comment sentence class; Based on the score of comment sentence, extract the digest sentence of the highest some comment sentences of score then as this classification according to compressibility with similarity, comment sentence length and the sentence position calculation comment sentence in the entire chapter comment of barycenter; Digest sentence according to each comment sentence class of certain series arrangement generates summary at last.

D=＜s ₁, s ₂..., s _NBe the comment sentence classification of certain certain feature of product, N is the number of comment sentence among the d.

I=1,2 ..., N is comment sentence s _iVector model represent that n is the sum of occurring words in the whole comment sentence classification, w _IjMiddle i is the identifier of comment sentence, and j is the global identifier of word.

I=1,2 ..., N, j=1,2 ..., n is word w _jWeights.Especially, work as w _jNot at s _iIn when occurring

The barycenter of comment sentence classification d is a pseudo-sentence that can reflect the theme of this classification, adopts vector model to represent equally,

Wherein

Be keyword w _kWeight, computing method are:

v_{w_{k}} = \frac{v_{w_{k}}^{*}}{\sqrt{Σ_{j = 1}^{n} {v_{w_{j}}^{*}}^{2}}},

J=1,2 ..., n, and

v_{w_{k}}^{*} = {tf}_{w_{k}} * {idf}_{w_{k}},

{tf}_{w_{k}} = Σ_{i = 1}^{N} {tf}_{w_{k}, s_{i}},

For each comment sentence, calculate following three kinds of scores:

(1) score based on barycenter is as follows:

scor e_{c} (s_{i}) = Σ_{k = 1}^{n} (v_{w_{ik}} * w_{k}), 0 \leq {score}_{c} (s_{i}) \leq 1

The vector of i.e. expression comment sentence and the cosine similarity of centroid vector.Because barycenter is the pseudo-sentence of expression collection of document theme, can reflect the theme of collection of document more to the similar more comment sentence of barycenter, so score is high more.

(2) score based on comment sentence length is as follows:

The short more sentence score of length is high more, can make the summary of equal length comprise more sentence, thereby comprises abundant more information.

(3) score based on the first sentence of paragraph is as follows:

According to the research of Baxendale, the position of sentence in document is great to the influence of sentence importance, and the probability that the first sentence of paragraph is this paragraph center sentence is 85%.Therefore, the first sentence of paragraph must be divided into 1.

For a comment sentence s _i, it is initial must be divided into based on barycenter and based on the linearity of the score of length and, promptly

score ₀(s _i)＝α*score _c(s _i)+β*score _l(s _i)+γ*score _f(s _i)

Wherein α is the weights based on the barycenter score, and β is based on the weights of the score of comment sentence length, and γ is based on the weights of the score of the first sentence of paragraph, 0＜α, beta, gamma＜1 and alpha+beta+γ=1.Consider quality and the actual application demand that generates summary by a series of experiments, choose α=0.5, β=0.3, γ=0.2.

After obtaining commenting on the initial score of sentence, from each classification, extract the highest sentence of score successively and add summary; If summary length does not reach the restriction of compressibility, then after each iteration, recomputate the score of residue comment sentence in each classification, extract the highest sentence of score then and add summary, iteration finishes when summary length reaches restriction.During (k+1) inferior iteration, a comment sentence s _iThe score computing method be:

{score}_{k + 1} (s_{i}) = {score}_{k} (s_{i}) - \frac{1}{N} {score}_{k} (s_{k}^{*})

S wherein _k ^*It is the highest comment sentence of choosing after the k time iteration of score.The purpose that recomputates the sentence score after each iteration is for for to give higher score with the dissimilar sentence of having chosen of sentence content, the redundance of the summary that is generated with reduction.

The final relative order that generates between the digest sentence that to consider when making a summary from each comment sentence class, to choose.Here earlier with the descending sort of feature, choose a digest sentence successively in the comment sentence class of each feature correspondence and add summary by support.

Performance evaluating

User comment auto-abstracting method based on feature at first needs the analysis user comment to identify the feature of being estimated, then all comment sentences are classified according to the feature of being estimated, the method for using sentence to extract extracts digest and generates summary from each comment sentence class.Therefore, the quality of feature identification is most important for the quality that generates summary.

The index of the quality of evaluation and test feature identification mainly contains following three:

Recall ratio (Recall)

Precision ratio (Precision)

F1 value (F1-measure)

In the application of user comment summary, some feature is only quilt user's evaluation seldom often, and should pay the utmost attention to by the feature of user's common concern under the limited situation of summary length, so the precision ratio of feature identification is more important than recall ratio.

The selected contrast algorithm of experiment is Hu﹠amp; The Apriori algorithm that Liu adopts in emotion analytic system FBS research (Hu Minqing, Liu Bing.Mining and Summarizing Customer Reviews.SIGKDD, 2004,168-177).Experimental data is the English user comment from 5 kinds of commodity of e-commerce website amazon, cnet and epinions collection, comprise 2 sections of mobile phones, 1 section of notebook computer, 1 section of MP3 player and 1 amount of money sign indicating number camera, every kind of commodity have hundreds of user comments.

At first choose a mark person and read all user comments, manually mark out feature wherein, table 1 the 2nd is classified the characteristic number of the artificial mark of extensive stock as.The feature of comparison algorithm identification and manually mark feature then, the 3rd row and the 7th row are respectively the characteristic number that algorithm identified goes out; The characteristic number that statistic algorithm identification is correct is calculated precision ratio, recall ratio and F1 value.Experimental result shows that the feature identification that the present invention adopts and the recall ratio of filter algorithm are 51.9%, and precision ratio is 81.0%, and the F1 value is 62.7%, has improved 24% compared to contrast algorithm precision ratio, and the F1 value has improved 6%.

The quality assessment of table 1 feature identification

Under the prerequisite of feature identification accurately, given compressibility (getting 1% in the experiment), user comment auto-abstracting method based on feature can generate the summary that covers all identified features (recall ratio is 51.9%), and can shorten reading time greatly (1%), thereby significantly improve the user obtains useful information from the mass users comment efficient, this has great practice significance and application prospect in the cybertimes that the information explosion formula increases.

Below only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and the inventive method is equally applicable to the expansion realm of sale of electronic product, e-book, mobile phone and the raising user degree of association.In addition, anyly be familiar with those skilled in the art in the technical scope that the present invention discloses, the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.

Claims

1. auto-abstracting method based on user comment text feature, it may further comprise the steps:

Step 4, feature are filtered: according to a described comment class described candidate feature is filtered, thereby obtain final feature and pairing candidate comments on a class;

2. the auto-abstracting method based on user comment text feature according to claim 1, it is characterized in that: in the step 1, described climbing got and resolved the user comment webpage and be meant, climb all user comment webpages of getting this things at the specific things of choosing, obtain climbing the user comment of getting, resolve the described user comment of getting of climbing then, obtain user comment text.

3. the auto-abstracting method based on user comment text feature according to claim 1, it is characterized in that: in the step 1, described user comment is carried out pre-service to be meant, mark the part of speech of all words in the described user comment, remove stop words wherein, and the residue word is carried out stem extract, obtain described pretreated user comment.

4. the auto-abstracting method based on user comment text feature according to claim 1, it is characterized in that the feature of being estimated by the user described in the step 2 is meant certain side, certain details, certain attribute or certain ingredient that the user has in mind when estimating certain part things.

5. the auto-abstracting method based on user comment text feature according to claim 1, it is characterized in that, using statistical method to identify candidate feature described in the step 2 is meant: extracts all nouns in the described corresponding user comment sentence of being estimated by the user of feature, calculate the frequency of single noun appearance and the frequency of any two noun co-occurrences; Choose the highest noun of the highest single noun of the frequency of occurrences and co-occurrence frequency as candidate feature.

6. the auto-abstracting method based on user comment text feature according to claim 1, it is characterized in that: candidate feature is filtered described in the step 4 is meant, the relative position that in the comment sentence, occurs according to the noun of composition characteristic, and the extensive and specialization relation on the meaning between each feature, filter meaningless and redundant candidate feature.

7. the auto-abstracting method based on user comment text feature according to claim 1 is characterized in that: calculate score that described candidate comments on sentence described in the step 5 and be meant that length, position and the content of commenting on sentence according to described candidate calculate the score that described candidate comments on sentence.

8. according to claim 1 or 7 described auto-abstracting methods based on user comment text feature, it is characterized in that: step 5 further comprises: adopt statistical method to calculate the keyword of each comment sentence class theme of performance, then according to the compatible degree of comment sentence content and theme, the length of comment sentence and the position that the comment sentence occurs in the entire chapter comment, calculate the score of comment sentence, extract the some former comment sentence tissue that score is the highest in the user comment sentence class then and generate summary.

9. the auto-abstracting method based on user comment text feature according to claim 8, it is characterized in that: the keyword that described employing statistical method calculates each comment sentence class theme of performance is meant, on basis to the classification of comment sentence, use the method for adding up to find out the keyword of each class, structure calculates based on the similarity of comment sentence with barycenter as the barycenter of the pseudo-sentence of this comment sentence class theme of expression; The compatible degree of described comment sentence content and theme is meant the similarity of comment and barycenter.

10. autoabstract system based on user comment text feature, it comprises:

The user comment pretreatment module: it is used to climb gets and resolves the user comment webpage, obtains user comment, then described user comment is carried out pre-service, obtains pretreated user comment;

The feature identification module: it therefrom identifies the feature of being estimated by the user by analyzing described pretreated user comment, from the described feature of being estimated by the user, uses statistical method to identify candidate feature then;

Comment sentence sort module: it is classified described pretreated user comment sentence by described candidate feature, thereby obtains the comment sentence class of corresponding candidate feature;

The feature filtering module: its according to described comment sentence classification described candidate feature is filtered, thereby obtain final feature and pairing candidate comments on a class;

The summary generation module: calculate the score that described candidate comments on each sentence in the class, extract some digest sentences and generate summary,

Wherein, the user comment pretreatment module sends to the feature identification module with the pre-service result, the candidate feature that obtains identifying; To send into a comment sentence sort module through the candidate feature that the pretreated user comment text of described user comment module and feature identification module identify and classify, obtain a comment sentence class; Described candidate feature is filtered the candidate who obtains final feature and correspondence thereof comment on a class; The summary generation module comments on a class with described candidate and described final feature is carried out statistical study and generated summary as input.