CN102682120B - Method and device for acquiring essential article commented on network - Google Patents

Method and device for acquiring essential article commented on network Download PDF

Info

Publication number
CN102682120B
CN102682120B CN201210151075.0A CN201210151075A CN102682120B CN 102682120 B CN102682120 B CN 102682120B CN 201210151075 A CN201210151075 A CN 201210151075A CN 102682120 B CN102682120 B CN 102682120B
Authority
CN
China
Prior art keywords
comment
text
comment text
value
key word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210151075.0A
Other languages
Chinese (zh)
Other versions
CN102682120A (en
Inventor
陈学文
张宇峰
姚健
潘柏宇
卢述奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Youku Network Technology Beijing Co Ltd
Original Assignee
1Verge Internet Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 1Verge Internet Technology Beijing Co Ltd filed Critical 1Verge Internet Technology Beijing Co Ltd
Priority to CN201210151075.0A priority Critical patent/CN102682120B/en
Publication of CN102682120A publication Critical patent/CN102682120A/en
Application granted granted Critical
Publication of CN102682120B publication Critical patent/CN102682120B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for acquiring an essential text in network comment text. The method comprises the following steps of: S1, extracting a keyword in the comment text; S2, assigning based on the meaning represented by the keyword, and calculating and acquiring the value of the extracted keyword in a comment text bank through an Inverse Document Frequency (IDF); S3, according to times that the keyword appears under some subject and the value of the keyword which is acquired in the step S2 in the comment text bank, calculating the value of the keyword under the subject; S4, calculating the value of a punctuation mark of the comment text, wherein the depended theory is that the more the punctuation mark in the comment text accords with the law, the higher the value of the comment text is; S5, calculating the value of the similarity of the comment text, wherein the depended theory is that the higher the similarity of the later enunciable comment text is with the similarity of a historia comment text, the lower the value is; S6, multiplying the value of the keyword, which is obtained by the calculation in the step S3, by the value of the punctuation mark, which is obtained in the step S4, and the value of the text similarity, which is obtained by the calculation in the step S5, to calculate the score of each comment; and S7, after the scores of a plurality of comment texts are acquired, taking the comment text with the score exceeding a certain threshold value as the essential comment text. By using the method, the device and the system, the essential text is automatically acquired through using a computer program and an algorithm; the network management cost is reduced; and the acquirement precision of the text is improved.

Description

A kind of acquisition methods of network comment elite text and device
Technical field
The invention belongs to text analysis technique field, particularly relate to a kind of acquisition methods and device of network comment elite text.
Background technology
Along with the development of Internet technology, the internet, applications based on WEB2.0 is more and more universal, and individual subscriber can carry out the issue of text, the expression of viewpoint on the internet personally, also causes the information exponentially on internet to increase thus.Certainly, a lot of junk information may be also contains in the middle of this.Therefore, how in so many information, to obtain content that is useful, elite, be the problem of people's general concern.And the relevant elite comment text obtaining a certain theme in prior art mainly adopts following three kinds of modes:
1, supvr marks elite comment text.This mode is confined to gerentocratic manual intervention, so the video that unavoidable appearance only has part has elite comment text to mark, and the process subjectivity of mark comment text is strong, and for sudden strong video comments text elite comment text mark overlong time, such as certain TV play comment text on the same day synchronously play has and has thousands of, then can not respond fast.And the dependency degree adopted in this way for human resources is higher, and lack enough objectivity, mistakes and omissions rate is higher.Therefore, not single handling cost is higher, and actual effect is also bad.
2, the mode using system statistics to reply quantity calculates elite comment text.This mode is confined to the participation situation of user for comment text, and the comment text adopting this mode elite to mark not necessarily has elite character, is subject to human intervention impact comparatively large, such as: argue, answer a question.Therefore, the result that this mode feeds back is often not objective, and Consumer's Experience is poor.
3, ballot modes such as " approve of and oppose " " top are stepped on " is used to calculate elite.This mode is confined to the participation situation of user for comment text, and after bulk information comment text emerges in large numbers, user is only interested in for the comment text of former pages, and can be less to the attention rate of early stage comment text.Therefore, the result that this mode feeds back is often comparatively unilateral, and objective fact fully can not be fed back to user, the important information that easily allowed user miss.
Above comment text account form all has some limitations, and has elite comment text to mark the possibility of omitting.
Summary of the invention
In view of problems of the prior art, the object of the present invention is to provide a kind of acquisition side and device of network comment elite text, use computer program and algorithm automatic acquisition elite text, reduce network administration cost, improve the elite degree of text acquisition.
In order to achieve the above object, the invention provides a kind of acquisition methods of network comment elite text, it is characterized in that comprising the steps:
S1, the key word extracted in comment text;
S2, carry out assignment in conjunction with the meaning that key word characterizes, and calculated by anti-document frequency (IDF) and obtain the key word extracted and be worth in comment text storehouse;
The value calculation key word value under this theme of key word in comment text storehouse obtained in S3, the number of times occurred under a certain theme according to key word and step S2;
S4, Using statistics method the punctuation mark that the distribution of punctuation mark processes to calculate in comment text is worth, the principle of its foundation is that in comment text, punctuation mark more meets rule, and so this comment text is worth higher;
S5, adopt the value of Dice coefficient calculations comment text similarity, more high value is lower with the civilian similarity of historical review for the comment text delivered after being of the principle of its foundation;
S6, the key word calculated is worth the text similarity being worth with the punctuation mark obtained in step S4 and calculating in step S5 is worth to be multiplied and calculates the score of each comment text text in step S3;
S7, obtain many comment literary composition score after, obtain point comment text exceeding certain threshold value as elite comment text.
Further, the acquisition methods of network comment text elite text of the present invention, is characterized in that the detailed process of step S1 comprises:
S11, participle is carried out to comment text content;
Remove stop words according to vocabulary of stopping using after S22, participle, remaining is then the key word of comment text content.
Further, the acquisition methods of network comment elite text of the present invention, is characterized in that the detailed process of step S4 comprises:
The distribution of S41, statistics large-scale corpus punctuation mark, with top score is 1 point, by the distribution normalized of the Chinese character of all sentences with symbol ratio, calculates the distribution score of a symbol;
S42, to symbol distribution score process, form a Chinese character and symbol distribution curve;
S43, calculate symbol factor score in comment text according to distribution curve.
Further, the detailed process of step S5 comprises employing Dice coefficient calculations text similarity, and weigh the similarity degree between text with the number of same keyword and the weight of each key word between two texts, wherein Keyword Weight value gets 1;
Dice coefficient formulas is:
Dice(s1,s2)=2×comm(s1,s2)/(leng(s1)+leng(s2))
Wherein, comm (s1, s2) is the number of identical characters in s1, s2, and leng (s1), leng (s2) are the length of character string s1, s2.
Further, the acquisition methods of network comment elite text of the present invention, is characterized in that for can also by backstage management procedure, setting which comment text be elite comment text, and preferentially shows.
An acquisition device for network comment elite text, is characterized in that comprising as lower module:
Keyword-extraction module, for extracting the key word in comment literary composition;
Comment keyword text library value capture module, for carrying out assignment in conjunction with the meaning that key word characterizes, and is worth in comment text storehouse by the key word that anti-document frequency (IDF) calculating acquisition is extracted;
Comment text key word value calculation module, for the value calculation key word value under this theme of key word in comment text storehouse obtained in the number of times that occurs under a certain theme according to key word and comment keyword text library value capture module;
Comment text punctuation mark value calculation module, the punctuation mark processing to calculate comment text for Using statistics method and to the distribution of punctuation mark is worth, the principle of its foundation is that in comment text, punctuation mark more meets rule, and so this comment text is worth higher;
Comment text similarity calculation module, for adopting the value of Dice coefficient calculations comment text similarity, more high value is lower for the comment text that the principle of its foundation is delivered after being and historical review text similarity;
Comment text points calculating module, calculates the score of each comment text for the key word calculated in comment text key word value calculation module value being worth to be multiplied with the text similarity calculated in the sign value obtained in comment text punctuation mark value calculation module and comment text similarity calculation module;
Elite comment text determination module, for after the score obtaining many comment texts, obtains point comment text exceeding certain threshold value as elite comment text
The acquisition methods of network comment elite text of the present invention and system adopt computer program to calculate the elite text under network comment, and automatic acquisition goes out elite comment text, and it is objective that elite comment text obtains real result, and amount is large, reduce and omit.Comment text content according to being divided into line ordering, conveniently can be screened comment text and relevant information, reduces manual intervention and comment text maintenance cost.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the acquisition methods of network comment elite text of the present invention;
Fig. 2 is the block diagram of the acquisition device of network comment elite text of the present invention;
Embodiment
For making above-mentioned purpose of the present invention, feature and advantage become apparent more, and below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation:
Internet exists various theme, have model, microblogging, picture, video etc., for these themes, online friend continuous follow-up can carry out comment text usually, thus creates a large amount of comment text texts.For different themes, because its comment text form is identical, be all word content, therefore, for different themes, the obtain manner of its elite comment text can be general.For this reason, in a particular embodiment, we describe how to obtain elite comment text with online friend to the embodiment that a certain video subject carries out comment text.
Fig. 1 is the process flow diagram of the acquisition methods of a kind of network comment text elite text of the present invention.
As described in Figure 1, the concrete implementation of the inventive method is as follows:
S1, the key word extracted in comment text;
Because a certain theme often exists a lot of comment text, such as a certain video, after broadcast, often there are thousands of comment texts, in order to obtain elite comment text, need to analyze the content of each comment text, for this reason, for each comment text, first participle to be carried out to comment text content, remove stop words according to vocabulary of stopping using after participle, remaining is then the key word of comment text content.Extract these comment text key words, these key representations comment text feature.Word in stop words vocabulary, represents that the impact that these words look like on text is little, can ignore.Stop words vocabulary part derives from internet, and small part Using statistics method draws, such as adds up in extensive comment text and finds that rear " sofa " this key word score is very low, can add stop words vocabulary.In addition, more stop words, such as: seem, certain etc.
This step core concept of extraction comment text key word is the trunk in extracting comment text sentence, finds out the primary key affecting comment text content.The object that these key words exist is the score value in order to obtain comment text in the calculating of elite comment text.
What illustrate: " just bored with article contrast "
After participle: with, article, what, contrast, just, bored,;
After removing stop words: article, contrast, bored.
The key word that S2, acquisition are extracted is worth in comment text storehouse;
Here comment text storehouse refers to the comment text data for all videos that a certain service provider sets up, such as, all videos in Yoqoo, and comment text data refer to the comment text delivered after user watches video.The all key words occurred can be counted in website comment text by comment text storehouse, calculate the value of key word in whole comment text storehouse.Concrete sample calculation, such as: " being posted (sofa) " this key word may occur in a large amount of comment texts, and " text " this key word may only can occur in a small amount of comment text, so " text " influence power (being worth high) to comment text is higher than " turning note ", therefore, the sign meaning in conjunction with key word in comment text storehouse assignment can be carried out.Wherein:
Key word is worth (Term Value) and is embodied by anti-document frequency (IDF) in comment text storehouse, the principle calculating the anti-document frequency (IDF) of a key word is in all comment text documents, occur that the number of files of this key word is more, then key word is worth lower; If the frequency TF that certain word or phrase occur in one section of text is high, and seldom occurs in other texts, then think that this word or phrase have good class discrimination ability, be applicable to for classification.TFIDF is actually: TF*IDF, TF word frequency (Term Frequency), the anti-document frequency of IDF (Inverse Document Frequency).TF represents the frequency that entry occurs in a document.The main thought of IDF is: if the document comprising entry t is fewer, namely n is less, and IDF is larger, then illustrate that entry t has good class discrimination ability.If the number of files comprising entry t in a certain class document C is m, and the total number of documents that other class comprises t is k, obviously all number of files n=m+k comprising t, when m is large time, n is also large, and the value of the IDF obtained according to IDF formula can be little, just illustrates that this entry t class discrimination is indifferent.
The value calculation key word value under this theme of key word in comment text storehouse obtained in S3, the number of times occurred under a certain theme according to key word and step S2;
With regard to a certain video, the object obtaining the value of key word under video is to calculate the influence power of this key word in this video, can be regarded as the secondary calculating be worth key word, calculates the influence power of this key word to some videos.
Such as: the value score calculating all key words in comment text storehouse in step s 2, suppose that the score of " Huang Haibo " " Lin Xinru " is 2.0 points.But the value of (as " the fine epoch of son's wife " first collect) this both keyword will be different in a video.Statistics " Huang Haibo " occurs 6 times, and " Lin Xinru " occurs 1 time, so " Huang Haibo "=12 point, " Lin Xinru "=2 point.
If one comment text can be chosen to be elite comment text, so multiple key word must be had
" Video Term value (key word mark) " in elite comment text computing formula (f step) can use each key word score to be added
Example: have two comment texts
The score of " Huang Haibo " " Lin Xinru " " artistic skills " " well " is 12,2,5,0.1 respectively
C1=Huang Haibo artistic skills are pretty good
Key word score " Huang Haibo "+" artistic skills "+" well "=12+5+0.1=17.1
The C2=woods heart is as good in artistic skills
Key word score " Lin Xinru "+" artistic skills "+" well "=2+5+0.1=7.1
The value of key word in a video (Video Term Value) refers to that in video, different user delivers the key word frequency of occurrences.
The punctuation mark that S4 calculates comment text is worth;
The principle calculating comment text content punctuation mark value (Sign Value) institute foundation is that in comment text, punctuation mark more meets rule, and so this comment text is worth higher.Its computing method are:
The distribution of statistics large-scale corpus punctuation mark, with top score is 1 point, by the distribution normalized of the Chinese character of all sentences with symbol ratio, calculates the distribution score of a punctuation mark; Follow-up needs processes, and forms a Chinese character and punctuation mark distribution curve; Finally calculate the Chinese character symbol ratio in comment text, the punctuation mark obtaining comment text according to the symbol distribution results calculated is worth score.
Citing: the Chinese corpus of statistics 300w bar sentence, the ratio of Chinese character and symbol in statistics sentence, the symbol of the sentence that setting ratio is the highest must be divided into 1 point, and the score of the sentence of other Chinese character symbol ratios calculates respective value according to ratio.
Statistics:
Chinese character and symbol are 30w the highest (representing that Chinese character and symbol ratio are that 10:1 has 30w sentence) than the sentence for 10:1, and the symbol of setting comment text sentence must be divided into 1 point.
Chinese character and symbol are 20w than the sentence for 11:1, then the sentence of 11:1 must be divided into 1* (20/30)=0.6 point
Chinese character and symbol are 25w than the sentence for 9:1, then the sentence of 11:1 must be divided into 1* (25/30) to approximate 0.8 point
" artistic skills of Huang Haibo are all well and good in calculating! " process of symbol factor score of this comment text is: the ratio first calculating Chinese character and punctuation mark in this comment text is 10:1, then the symbol factor score that can calculate comment text equals 1 point.
The value of S5, calculating comment text similarity;
Comment text and historical review text similarity are worth (Similarity Value), and the comment text namely under a video compares with historical review text similarity.More high value is lower for the comment text that the principle of foundation is delivered after being and historical review text similarity.
Adopt Dice coefficient calculations text similarity, weigh the similarity degree between text with the number of same keyword and the weight of each key word between two texts, wherein Keyword Weight value gets 1;
Dice coefficient formulas is:
Dice(s1,s2)=2×comm(s1,s2)/(leng(s1)+leng(s2))
Wherein, comm (s1, s2) is the number of identical characters in s1, s2, and leng (s1), leng (s2) are the length of character string s1, s2.
S6, the key word calculated is worth to be worth to be multiplied with the similarity calculated in the sign value obtained in step S4 and step S5 calculates the score of each comment text in step S3;
Specifically, the score of a comment text under a video can be written as:
Comment text score=Video Term value (key word mark) * Sign value (symbol factor) * Similarity Value (similarity factor)
Computing formula can be expanded simultaneously:
Comment text score=key word factor * symbol factor * similarity factor * other factors 1* other factors 2
Other factors is that the information such as such as title, user, video profile are on the impact of comment text score.
S7, obtain many comment texts score after, obtain point comment text exceeding certain threshold value as elite comment text.
After the score obtaining many comment texts, according to mark height in the display of video playback page, obtain point comment text exceeding certain threshold value as elite comment text.
In addition, can also manual intervention elite comment text, manual intervention be by backstage management procedure, setting which comment text is elite comment text, and preferentially shows.
Elite comment text C1 and C3 of the example such as, artificial setting C4 is elite comment text
The displaying result of so last elite comment text is: C4, C1, C3
The object lesson concrete video being carried out to the extraction of elite comment text below by one to describe implementation of the present invention in detail so that those skilled in the art know that whole process:
In step S2, key word is worth before carrying out the calculating of elite comment text good in comment text storehouse, and in comment text storehouse, add up all comment text language materials can obtain.
Suppose Huang Haibo=4 point, artistic skills=2.5 point, good=0.1 point
Suppose that some videos have 6 comment texts (truly carrying out the comment text number at least 300 of elite comment text, up to ten thousand at most)
The artistic skills of (user 1) C1=Huang Haibo are pretty good.
(user 2) C2=.。。。。。。。。
Huang Haibo inside (user 3) this TV play of C3=is a good person!
Leading man above (user 4) C4=also has oneself style.。。。。。。。。。。。。。。
(user 5) C5=Huang Haibo artistic skills are fine.
(user 1) C6=Huang Haibo
After participle, extraction key word:
Huang Haibo artistic skills are pretty good
NULL (not having key word)
Huang Haibo good person inside TV play
Leading man oneself style above
Huang Haibo artistic skills are fine
Key word score
Huang Haibo=4*3=12 (key word is worth the number of times that * occurs in video, and the key word of a user is only calculated once, not counting in such as C6)
Artistic skills=2.5*2=5
Well=0.1*1=0.1
Preliminary the dividing of so every bar comment text
C1=" Huang Haibo "+" artistic skills "+" well "=12+5+0.1=17.1
C2=0
C3=19
C4=14
C5=17.2
Compute sign score
C1=1,C2=0,C3=1,C4=0.3,C5=1
Calculate similarity score
C1=1
C2=1
C3=0.7 (the most similar to C1, similarity is 0.3, last coefficient of similarity score: 1-0.3=0.7 divides)
C4=1
C5=0.3
Final score
C1=17.1*1*1=17.1
C2=0
C3=19*1*0.7=13.3
C4=14*0.3=4.2
C5=17.2*1*0.3=5.2
Finally sort
C1>C3>C5>C4>C2
Set according to threshold value 2 (given threshold is set as 10) of getting above, obtain comment text point more than 10 as elite comment text.
Technical solution of the present invention can realize in an isolated system, also can obtain a kind of entity apparatus that can complete this technical scheme thus, and Fig. 2 is the block diagram of the acquisition device of network comment text elite text of the present invention; Specifically comprise as lower module:
Keyword-extraction module, for extracting the key word in comment text;
Comment keyword text library value capture module, is worth in comment text storehouse for obtaining extracted key word;
Comment text key word value calculation module, for the value calculation key word value under this theme of key word in comment text storehouse obtained in the number of times that occurs under a certain theme according to key word and step S2;
Comment text punctuation mark value calculation module, is worth for the punctuation mark calculating comment text;
Comment text similarity calculation module, for calculating the value of comment text similarity;
Comment text points calculating module, calculates the score of each comment text for the key word calculated in comment text key word value calculation module value being worth to be multiplied with the similarity calculated in the sign value obtained in comment text punctuation mark value calculation module and comment text similarity calculation module;
Elite comment text determination module, for after the score obtaining many comment texts, obtains point comment text exceeding certain threshold value as elite comment text.
In addition, the present invention also can have been worked in coordination with by each device be separated, and can obtain a kind of system that can complete this technical scheme thus, Fig. 3 is the block diagram of the acquisition system of network comment text elite text of the present invention, specifically comprises as lower device:
Keyword extraction device, for extracting the key word in comment text;
Comment keyword text library value capture device, is worth in comment text storehouse for obtaining extracted key word;
Comment text key word value calculation device, for the value calculation key word value under this theme of key word in comment text storehouse obtained in the number of times that occurs under a certain theme according to key word and step S2;
Comment text punctuation mark value calculation device, is worth for the punctuation mark calculating comment text;
Comment text Similarity Measure device, for calculating the value of comment text similarity;
Comment text score calculation element, calculates the score of each comment text for the key word calculated in comment text key word value calculation module value being worth to be multiplied with the text similarity calculated in the sign value obtained in comment text punctuation mark value calculation module and comment text similarity calculation module;
Elite comment text determining device, for after the score obtaining many comment texts, obtains point comment text exceeding certain threshold value as elite comment text.
In sum, the acquisition methods of a kind of network comment elite text provided by the invention and device, it adopts new technical scheme to be calculate all comment text service routine automatic analysis under video, draws the score list of the elite degree of a comment text; Simultaneously elite comment text calculates and can prevent pour water behavior or multi-user of same user and send out the problems such as Similar content, and the mark result of calculation of comment text has certain fairness; Be applicable to represent on video playback page comment text region some more outstanding comment texts.
It is more than the detailed description that the preferred embodiments of the present invention are carried out, but those of ordinary skill in the art it is to be appreciated that, within the scope of the present invention, and guided by the spirit, various improvement, interpolation and replacement are all possible, such as, adjust interface interchange order, change message format and content, programming language (as C, C++, Java etc.) that use is different realizes.These are all in the protection domain that claim of the present invention limits.

Claims (6)

1. an acquisition methods for network comment elite text, is characterized in that comprising the steps:
S1, the key word extracted in comment text;
S2, carry out assignment in conjunction with the meaning that key word characterizes, and calculated by anti-document frequency (IDF) and obtain the key word extracted and be worth in comment text storehouse;
The value calculation key word value under this theme of the key word obtained in S3, the number of times occurred under a certain theme according to key word and step S2 in comment storehouse;
S4, Using statistics method the punctuation mark that the distribution of punctuation mark processes to calculate in comment text is worth, the principle of its foundation is that in comment text, punctuation mark more meets rule, and so this comment is worth higher;
S5, adopt the value of Dice coefficient calculations comment text similarity, more high value is lower for the comment that the principle of its foundation is delivered after being and historical review text similarity;
S6, the key word calculated is worth the text similarity being worth with the punctuation mark obtained in step S4 and calculating in step S5 is worth to be multiplied and calculates the score of each comment text in step S3;
S7, obtain many comment texts score after, obtain point comment exceeding certain threshold value as elite comment text.
2. the acquisition methods of network comment elite text according to claim 1, is characterized in that the detailed process of step S1 comprises:
S11, participle is carried out to comment text content;
Remove stop words according to vocabulary of stopping using after S22, participle, remaining is then the key word of comment text content.
3. the acquisition methods of network comment elite text according to claim 1, is characterized in that the detailed process of step S4 comprises:
The distribution of S41, statistics large-scale corpus punctuation mark, with top score is 1 point, by the distribution normalized of the Chinese character of all sentences with symbol ratio, calculates the distribution score of a symbol;
S42, to symbol distribution score process, form a Chinese character and symbol distribution curve;
S43, according to distribution curve calculate comment in symbol factor score.
4. the acquisition methods of the network comment elite text according to claim 1 or 3, it is characterized in that the detailed process of step S5 comprises and adopt Dice coefficient calculations text similarity, weigh the similarity degree between text with the number of same keyword and the weight of each key word between two texts, wherein Keyword Weight value gets 1;
Dice coefficient formulas is:
Dice(s1,s2)=2×comm(s1,s2)/(leng(s1)+leng(s2))
Wherein, comm (s1, s2) is the number of identical characters in s1, s2, and leng (s1), leng (s2) are the length of character string s1, s2.
5. the acquisition methods of network comment elite text according to claim 1, is characterized in that for can also by backstage management procedure, setting which comment text be elite comment text, and preferentially shows.
6. an acquisition device for network comment elite article, is characterized in that comprising as lower module:
Keyword-extraction module, for extracting the key word in comment text;
Comment keyword storehouse value capture module, for carrying out assignment in conjunction with the meaning that key word characterizes, and is worth in comment storehouse by the key word that anti-document frequency (IDF) calculating acquisition is extracted;
Comment text key word value calculation module, for the value calculation key word value under this theme of key word in comment storehouse obtained in the number of times that occurs under a certain theme according to key word and comment keyword storehouse value capture module;
Comment text punctuation mark value calculation module, the punctuation mark processing to calculate comment for Using statistics method and to the distribution of punctuation mark is worth, and the principle of its foundation is that in comment, punctuation mark more meets rule, and so this comment is worth higher;
Comment text similarity calculation module, for adopting the value of Dice coefficient calculations comment text similarity, more high value is lower for the comment that the principle of its foundation is delivered after being and historical review text similarity;
Comment text points calculating module, calculates each score commented on for the key word value calculated in comment key word value calculation module being worth to be multiplied with the text similarity calculated in the sign value obtained in comment text punctuation mark value calculation module and comment text similarity calculation module;
Elite comment text determination module, for after the score obtaining many comment texts, obtains point comment text exceeding certain threshold value as elite comment text.
CN201210151075.0A 2012-05-15 2012-05-15 Method and device for acquiring essential article commented on network Expired - Fee Related CN102682120B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210151075.0A CN102682120B (en) 2012-05-15 2012-05-15 Method and device for acquiring essential article commented on network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210151075.0A CN102682120B (en) 2012-05-15 2012-05-15 Method and device for acquiring essential article commented on network

Publications (2)

Publication Number Publication Date
CN102682120A CN102682120A (en) 2012-09-19
CN102682120B true CN102682120B (en) 2015-06-03

Family

ID=46814045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210151075.0A Expired - Fee Related CN102682120B (en) 2012-05-15 2012-05-15 Method and device for acquiring essential article commented on network

Country Status (1)

Country Link
CN (1) CN102682120B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678416A (en) * 2012-09-26 2014-03-26 杨裴生 Network news and information reading interactive system
CN104714939B (en) * 2013-12-13 2017-09-29 联想(北京)有限公司 A kind of information processing method and electronic equipment
CN104778171A (en) * 2014-01-10 2015-07-15 携程计算机技术(上海)有限公司 Character string matching system and method
CN105630793A (en) * 2014-10-28 2016-06-01 阿里巴巴集团控股有限公司 Information weight determination method and device
CN105446602B (en) * 2015-11-24 2019-04-16 努比亚技术有限公司 The device and method for positioning article keyword
CN107301200A (en) * 2017-05-23 2017-10-27 合肥智权信息科技有限公司 A kind of article appraisal procedure and system analyzed based on Sentiment orientation
CN107818173B (en) * 2017-11-15 2021-05-14 电子科技大学 Vector space model-based Chinese false comment filtering method
CN110276065A (en) * 2018-03-15 2019-09-24 北京京东尚科信息技术有限公司 A kind of method and apparatus handling goods review
CN109829165A (en) * 2019-02-11 2019-05-31 杭州乾博科技有限公司 One kind is from media article Valuation Method and system
CN110674256B (en) * 2019-09-25 2023-05-12 携程计算机技术(上海)有限公司 Method and system for detecting correlation degree of comment and reply of OTA hotel
CN111782761B (en) * 2020-05-12 2023-10-31 北京达佳互联信息技术有限公司 Comment information determining method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630321A (en) * 2009-08-26 2010-01-20 中山大学 On-line article screening method based on data mining (DM)
CN101739416A (en) * 2008-11-04 2010-06-16 未序网络科技(上海)有限公司 Method for sequencing multi-index comprehensive weight video
CN102081627A (en) * 2009-11-27 2011-06-01 北京金山软件有限公司 Method and system for determining contribution degree of word in text

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100153318A1 (en) * 2008-11-19 2010-06-17 Massachusetts Institute Of Technology Methods and systems for automatically summarizing semantic properties from documents with freeform textual annotations
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature
CN102254038B (en) * 2011-08-11 2013-01-23 武汉安问科技发展有限责任公司 System and method for analyzing network comment relevance

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739416A (en) * 2008-11-04 2010-06-16 未序网络科技(上海)有限公司 Method for sequencing multi-index comprehensive weight video
CN101630321A (en) * 2009-08-26 2010-01-20 中山大学 On-line article screening method based on data mining (DM)
CN102081627A (en) * 2009-11-27 2011-06-01 北京金山软件有限公司 Method and system for determining contribution degree of word in text

Also Published As

Publication number Publication date
CN102682120A (en) 2012-09-19

Similar Documents

Publication Publication Date Title
CN102682120B (en) Method and device for acquiring essential article commented on network
Rangel et al. Overview of the 3rd Author Profiling Task at PAN 2015
CN108628833B (en) Method and device for determining summary of original content and method and device for recommending original content
Rangel Pardo et al. Overview of the 3rd Author Profiling Task at PAN 2015
CN103699626B (en) Method and system for analysing individual emotion tendency of microblog user
CN103198057B (en) One kind adds tagged method and apparatus to document automatically
Furlan et al. Semantic similarity of short texts in languages with a deficient natural language processing support
Shimada et al. Analyzing tourism information on twitter for a local city
CN103761239B (en) A kind of method utilizing emoticon that microblogging is carried out Sentiment orientation classification
CN106503049A (en) A kind of microblog emotional sorting technique for merging multiple affection resources based on SVM
CN106980692A (en) A kind of influence power computational methods based on microblogging particular event
Kothari et al. Detecting comments on news articles in microblogs
Chatzakou et al. Harvesting opinions and emotions from social media textual resources
CN102096680A (en) Method and device for analyzing information validity
CN104268192B (en) A kind of webpage information extracting method, device and terminal
CN102033880A (en) Marking method and device based on structured data acquisition
Bora Summarizing public opinions in tweets
CN103544321A (en) Data processing method and device for micro-blog emotion information
CN104077417A (en) Figure tag recommendation method and system in social network
CN107203520A (en) The method for building up of hotel's sentiment dictionary, the sentiment analysis method and system of comment
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN108363748B (en) Topic portrait system and topic portrait method based on knowledge
CN103577405A (en) Interest analysis based micro-blogger community classification method
CN106227768A (en) A kind of short text opining mining method based on complementary language material
CN107688630A (en) A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee after: Youku network technology (Beijing) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: 1VERGE INTERNET TECHNOLOGY (BEIJING) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200623

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: Youku network technology (Beijing) Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150603

Termination date: 20210515