Summary of the invention
In view of problems of the prior art, the object of the present invention is to provide a kind of acquisition side and device of network comment elite text, use computer program and algorithm automatic acquisition elite text, reduce network administration cost, improve the elite degree of text acquisition.
In order to achieve the above object, the invention provides a kind of acquisition methods of network comment elite text, it is characterized in that comprising the steps:
S1, the key word extracted in comment text;
S2, carry out assignment in conjunction with the meaning that key word characterizes, and calculated by anti-document frequency (IDF) and obtain the key word extracted and be worth in comment text storehouse;
The value calculation key word value under this theme of key word in comment text storehouse obtained in S3, the number of times occurred under a certain theme according to key word and step S2;
S4, Using statistics method the punctuation mark that the distribution of punctuation mark processes to calculate in comment text is worth, the principle of its foundation is that in comment text, punctuation mark more meets rule, and so this comment text is worth higher;
S5, adopt the value of Dice coefficient calculations comment text similarity, more high value is lower with the civilian similarity of historical review for the comment text delivered after being of the principle of its foundation;
S6, the key word calculated is worth the text similarity being worth with the punctuation mark obtained in step S4 and calculating in step S5 is worth to be multiplied and calculates the score of each comment text text in step S3;
S7, obtain many comment literary composition score after, obtain point comment text exceeding certain threshold value as elite comment text.
Further, the acquisition methods of network comment text elite text of the present invention, is characterized in that the detailed process of step S1 comprises:
S11, participle is carried out to comment text content;
Remove stop words according to vocabulary of stopping using after S22, participle, remaining is then the key word of comment text content.
Further, the acquisition methods of network comment elite text of the present invention, is characterized in that the detailed process of step S4 comprises:
The distribution of S41, statistics large-scale corpus punctuation mark, with top score is 1 point, by the distribution normalized of the Chinese character of all sentences with symbol ratio, calculates the distribution score of a symbol;
S42, to symbol distribution score process, form a Chinese character and symbol distribution curve;
S43, calculate symbol factor score in comment text according to distribution curve.
Further, the detailed process of step S5 comprises employing Dice coefficient calculations text similarity, and weigh the similarity degree between text with the number of same keyword and the weight of each key word between two texts, wherein Keyword Weight value gets 1;
Dice coefficient formulas is:
Dice(s1,s2)=2×comm(s1,s2)/(leng(s1)+leng(s2))
Wherein, comm (s1, s2) is the number of identical characters in s1, s2, and leng (s1), leng (s2) are the length of character string s1, s2.
Further, the acquisition methods of network comment elite text of the present invention, is characterized in that for can also by backstage management procedure, setting which comment text be elite comment text, and preferentially shows.
An acquisition device for network comment elite text, is characterized in that comprising as lower module:
Keyword-extraction module, for extracting the key word in comment literary composition;
Comment keyword text library value capture module, for carrying out assignment in conjunction with the meaning that key word characterizes, and is worth in comment text storehouse by the key word that anti-document frequency (IDF) calculating acquisition is extracted;
Comment text key word value calculation module, for the value calculation key word value under this theme of key word in comment text storehouse obtained in the number of times that occurs under a certain theme according to key word and comment keyword text library value capture module;
Comment text punctuation mark value calculation module, the punctuation mark processing to calculate comment text for Using statistics method and to the distribution of punctuation mark is worth, the principle of its foundation is that in comment text, punctuation mark more meets rule, and so this comment text is worth higher;
Comment text similarity calculation module, for adopting the value of Dice coefficient calculations comment text similarity, more high value is lower for the comment text that the principle of its foundation is delivered after being and historical review text similarity;
Comment text points calculating module, calculates the score of each comment text for the key word calculated in comment text key word value calculation module value being worth to be multiplied with the text similarity calculated in the sign value obtained in comment text punctuation mark value calculation module and comment text similarity calculation module;
Elite comment text determination module, for after the score obtaining many comment texts, obtains point comment text exceeding certain threshold value as elite comment text
The acquisition methods of network comment elite text of the present invention and system adopt computer program to calculate the elite text under network comment, and automatic acquisition goes out elite comment text, and it is objective that elite comment text obtains real result, and amount is large, reduce and omit.Comment text content according to being divided into line ordering, conveniently can be screened comment text and relevant information, reduces manual intervention and comment text maintenance cost.
Embodiment
For making above-mentioned purpose of the present invention, feature and advantage become apparent more, and below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation:
Internet exists various theme, have model, microblogging, picture, video etc., for these themes, online friend continuous follow-up can carry out comment text usually, thus creates a large amount of comment text texts.For different themes, because its comment text form is identical, be all word content, therefore, for different themes, the obtain manner of its elite comment text can be general.For this reason, in a particular embodiment, we describe how to obtain elite comment text with online friend to the embodiment that a certain video subject carries out comment text.
Fig. 1 is the process flow diagram of the acquisition methods of a kind of network comment text elite text of the present invention.
As described in Figure 1, the concrete implementation of the inventive method is as follows:
S1, the key word extracted in comment text;
Because a certain theme often exists a lot of comment text, such as a certain video, after broadcast, often there are thousands of comment texts, in order to obtain elite comment text, need to analyze the content of each comment text, for this reason, for each comment text, first participle to be carried out to comment text content, remove stop words according to vocabulary of stopping using after participle, remaining is then the key word of comment text content.Extract these comment text key words, these key representations comment text feature.Word in stop words vocabulary, represents that the impact that these words look like on text is little, can ignore.Stop words vocabulary part derives from internet, and small part Using statistics method draws, such as adds up in extensive comment text and finds that rear " sofa " this key word score is very low, can add stop words vocabulary.In addition, more stop words, such as: seem, certain etc.
This step core concept of extraction comment text key word is the trunk in extracting comment text sentence, finds out the primary key affecting comment text content.The object that these key words exist is the score value in order to obtain comment text in the calculating of elite comment text.
What illustrate: " just bored with article contrast "
After participle: with, article, what, contrast, just, bored,;
After removing stop words: article, contrast, bored.
The key word that S2, acquisition are extracted is worth in comment text storehouse;
Here comment text storehouse refers to the comment text data for all videos that a certain service provider sets up, such as, all videos in Yoqoo, and comment text data refer to the comment text delivered after user watches video.The all key words occurred can be counted in website comment text by comment text storehouse, calculate the value of key word in whole comment text storehouse.Concrete sample calculation, such as: " being posted (sofa) " this key word may occur in a large amount of comment texts, and " text " this key word may only can occur in a small amount of comment text, so " text " influence power (being worth high) to comment text is higher than " turning note ", therefore, the sign meaning in conjunction with key word in comment text storehouse assignment can be carried out.Wherein:
Key word is worth (Term Value) and is embodied by anti-document frequency (IDF) in comment text storehouse, the principle calculating the anti-document frequency (IDF) of a key word is in all comment text documents, occur that the number of files of this key word is more, then key word is worth lower; If the frequency TF that certain word or phrase occur in one section of text is high, and seldom occurs in other texts, then think that this word or phrase have good class discrimination ability, be applicable to for classification.TFIDF is actually: TF*IDF, TF word frequency (Term Frequency), the anti-document frequency of IDF (Inverse Document Frequency).TF represents the frequency that entry occurs in a document.The main thought of IDF is: if the document comprising entry t is fewer, namely n is less, and IDF is larger, then illustrate that entry t has good class discrimination ability.If the number of files comprising entry t in a certain class document C is m, and the total number of documents that other class comprises t is k, obviously all number of files n=m+k comprising t, when m is large time, n is also large, and the value of the IDF obtained according to IDF formula can be little, just illustrates that this entry t class discrimination is indifferent.
The value calculation key word value under this theme of key word in comment text storehouse obtained in S3, the number of times occurred under a certain theme according to key word and step S2;
With regard to a certain video, the object obtaining the value of key word under video is to calculate the influence power of this key word in this video, can be regarded as the secondary calculating be worth key word, calculates the influence power of this key word to some videos.
Such as: the value score calculating all key words in comment text storehouse in step s 2, suppose that the score of " Huang Haibo " " Lin Xinru " is 2.0 points.But the value of (as " the fine epoch of son's wife " first collect) this both keyword will be different in a video.Statistics " Huang Haibo " occurs 6 times, and " Lin Xinru " occurs 1 time, so " Huang Haibo "=12 point, " Lin Xinru "=2 point.
If one comment text can be chosen to be elite comment text, so multiple key word must be had
" Video Term value (key word mark) " in elite comment text computing formula (f step) can use each key word score to be added
Example: have two comment texts
The score of " Huang Haibo " " Lin Xinru " " artistic skills " " well " is 12,2,5,0.1 respectively
C1=Huang Haibo artistic skills are pretty good
Key word score " Huang Haibo "+" artistic skills "+" well "=12+5+0.1=17.1
The C2=woods heart is as good in artistic skills
Key word score " Lin Xinru "+" artistic skills "+" well "=2+5+0.1=7.1
The value of key word in a video (Video Term Value) refers to that in video, different user delivers the key word frequency of occurrences.
The punctuation mark that S4 calculates comment text is worth;
The principle calculating comment text content punctuation mark value (Sign Value) institute foundation is that in comment text, punctuation mark more meets rule, and so this comment text is worth higher.Its computing method are:
The distribution of statistics large-scale corpus punctuation mark, with top score is 1 point, by the distribution normalized of the Chinese character of all sentences with symbol ratio, calculates the distribution score of a punctuation mark; Follow-up needs processes, and forms a Chinese character and punctuation mark distribution curve; Finally calculate the Chinese character symbol ratio in comment text, the punctuation mark obtaining comment text according to the symbol distribution results calculated is worth score.
Citing: the Chinese corpus of statistics 300w bar sentence, the ratio of Chinese character and symbol in statistics sentence, the symbol of the sentence that setting ratio is the highest must be divided into 1 point, and the score of the sentence of other Chinese character symbol ratios calculates respective value according to ratio.
Statistics:
Chinese character and symbol are 30w the highest (representing that Chinese character and symbol ratio are that 10:1 has 30w sentence) than the sentence for 10:1, and the symbol of setting comment text sentence must be divided into 1 point.
Chinese character and symbol are 20w than the sentence for 11:1, then the sentence of 11:1 must be divided into 1* (20/30)=0.6 point
Chinese character and symbol are 25w than the sentence for 9:1, then the sentence of 11:1 must be divided into 1* (25/30) to approximate 0.8 point
" artistic skills of Huang Haibo are all well and good in calculating! " process of symbol factor score of this comment text is: the ratio first calculating Chinese character and punctuation mark in this comment text is 10:1, then the symbol factor score that can calculate comment text equals 1 point.
The value of S5, calculating comment text similarity;
Comment text and historical review text similarity are worth (Similarity Value), and the comment text namely under a video compares with historical review text similarity.More high value is lower for the comment text that the principle of foundation is delivered after being and historical review text similarity.
Adopt Dice coefficient calculations text similarity, weigh the similarity degree between text with the number of same keyword and the weight of each key word between two texts, wherein Keyword Weight value gets 1;
Dice coefficient formulas is:
Dice(s1,s2)=2×comm(s1,s2)/(leng(s1)+leng(s2))
Wherein, comm (s1, s2) is the number of identical characters in s1, s2, and leng (s1), leng (s2) are the length of character string s1, s2.
S6, the key word calculated is worth to be worth to be multiplied with the similarity calculated in the sign value obtained in step S4 and step S5 calculates the score of each comment text in step S3;
Specifically, the score of a comment text under a video can be written as:
Comment text score=Video Term value (key word mark) * Sign value (symbol factor) * Similarity Value (similarity factor)
Computing formula can be expanded simultaneously:
Comment text score=key word factor * symbol factor * similarity factor * other factors 1* other factors 2
Other factors is that the information such as such as title, user, video profile are on the impact of comment text score.
S7, obtain many comment texts score after, obtain point comment text exceeding certain threshold value as elite comment text.
After the score obtaining many comment texts, according to mark height in the display of video playback page, obtain point comment text exceeding certain threshold value as elite comment text.
In addition, can also manual intervention elite comment text, manual intervention be by backstage management procedure, setting which comment text is elite comment text, and preferentially shows.
Elite comment text C1 and C3 of the example such as, artificial setting C4 is elite comment text
The displaying result of so last elite comment text is: C4, C1, C3
The object lesson concrete video being carried out to the extraction of elite comment text below by one to describe implementation of the present invention in detail so that those skilled in the art know that whole process:
In step S2, key word is worth before carrying out the calculating of elite comment text good in comment text storehouse, and in comment text storehouse, add up all comment text language materials can obtain.
Suppose Huang Haibo=4 point, artistic skills=2.5 point, good=0.1 point
Suppose that some videos have 6 comment texts (truly carrying out the comment text number at least 300 of elite comment text, up to ten thousand at most)
The artistic skills of (user 1) C1=Huang Haibo are pretty good.
(user 2) C2=.。。。。。。。。
Huang Haibo inside (user 3) this TV play of C3=is a good person!
Leading man above (user 4) C4=also has oneself style.。。。。。。。。。。。。。。
(user 5) C5=Huang Haibo artistic skills are fine.
(user 1) C6=Huang Haibo
After participle, extraction key word:
Huang Haibo artistic skills are pretty good
NULL (not having key word)
Huang Haibo good person inside TV play
Leading man oneself style above
Huang Haibo artistic skills are fine
Key word score
Huang Haibo=4*3=12 (key word is worth the number of times that * occurs in video, and the key word of a user is only calculated once, not counting in such as C6)
Artistic skills=2.5*2=5
Well=0.1*1=0.1
Preliminary the dividing of so every bar comment text
C1=" Huang Haibo "+" artistic skills "+" well "=12+5+0.1=17.1
C2=0
C3=19
C4=14
C5=17.2
Compute sign score
C1=1,C2=0,C3=1,C4=0.3,C5=1
Calculate similarity score
C1=1
C2=1
C3=0.7 (the most similar to C1, similarity is 0.3, last coefficient of similarity score: 1-0.3=0.7 divides)
C4=1
C5=0.3
Final score
C1=17.1*1*1=17.1
C2=0
C3=19*1*0.7=13.3
C4=14*0.3=4.2
C5=17.2*1*0.3=5.2
Finally sort
C1>C3>C5>C4>C2
Set according to threshold value 2 (given threshold is set as 10) of getting above, obtain comment text point more than 10 as elite comment text.
Technical solution of the present invention can realize in an isolated system, also can obtain a kind of entity apparatus that can complete this technical scheme thus, and Fig. 2 is the block diagram of the acquisition device of network comment text elite text of the present invention; Specifically comprise as lower module:
Keyword-extraction module, for extracting the key word in comment text;
Comment keyword text library value capture module, is worth in comment text storehouse for obtaining extracted key word;
Comment text key word value calculation module, for the value calculation key word value under this theme of key word in comment text storehouse obtained in the number of times that occurs under a certain theme according to key word and step S2;
Comment text punctuation mark value calculation module, is worth for the punctuation mark calculating comment text;
Comment text similarity calculation module, for calculating the value of comment text similarity;
Comment text points calculating module, calculates the score of each comment text for the key word calculated in comment text key word value calculation module value being worth to be multiplied with the similarity calculated in the sign value obtained in comment text punctuation mark value calculation module and comment text similarity calculation module;
Elite comment text determination module, for after the score obtaining many comment texts, obtains point comment text exceeding certain threshold value as elite comment text.
In addition, the present invention also can have been worked in coordination with by each device be separated, and can obtain a kind of system that can complete this technical scheme thus, Fig. 3 is the block diagram of the acquisition system of network comment text elite text of the present invention, specifically comprises as lower device:
Keyword extraction device, for extracting the key word in comment text;
Comment keyword text library value capture device, is worth in comment text storehouse for obtaining extracted key word;
Comment text key word value calculation device, for the value calculation key word value under this theme of key word in comment text storehouse obtained in the number of times that occurs under a certain theme according to key word and step S2;
Comment text punctuation mark value calculation device, is worth for the punctuation mark calculating comment text;
Comment text Similarity Measure device, for calculating the value of comment text similarity;
Comment text score calculation element, calculates the score of each comment text for the key word calculated in comment text key word value calculation module value being worth to be multiplied with the text similarity calculated in the sign value obtained in comment text punctuation mark value calculation module and comment text similarity calculation module;
Elite comment text determining device, for after the score obtaining many comment texts, obtains point comment text exceeding certain threshold value as elite comment text.
In sum, the acquisition methods of a kind of network comment elite text provided by the invention and device, it adopts new technical scheme to be calculate all comment text service routine automatic analysis under video, draws the score list of the elite degree of a comment text; Simultaneously elite comment text calculates and can prevent pour water behavior or multi-user of same user and send out the problems such as Similar content, and the mark result of calculation of comment text has certain fairness; Be applicable to represent on video playback page comment text region some more outstanding comment texts.
It is more than the detailed description that the preferred embodiments of the present invention are carried out, but those of ordinary skill in the art it is to be appreciated that, within the scope of the present invention, and guided by the spirit, various improvement, interpolation and replacement are all possible, such as, adjust interface interchange order, change message format and content, programming language (as C, C++, Java etc.) that use is different realizes.These are all in the protection domain that claim of the present invention limits.