CN104572736A

CN104572736A - Keyword extraction method and device based on social networking services

Info

Publication number: CN104572736A
Application number: CN201310503897.5A
Authority: CN
Inventors: 赵立永; 于晓明; 杨建武
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2013-10-23
Filing date: 2013-10-23
Publication date: 2015-04-29

Abstract

The invention provides a keyword extraction method and device based on social networking services. The keyword extraction method includes steps of segmenting texts to be extracted and counting word frequency and the text number corresponding to words; calculating word weights according to the word frequency and the text number of the words and selecting words with high weight in the preset number as alternative keywords; extracting the alternative keywords high in frequency in the texts in the preset number to be extracted from the alternative keywords as the keywords. Through noise filtering, text deweighting, word segmentation and word weight calculation of the texts to be extracted, the keywords are extracted according to the word weight, and extraction speed is increased without much historical search information.

Description

Based on keyword extracting method and the device of social networks

Technical field

The present invention relates to keyword extraction techniques field, particularly a kind of keyword extracting method based on social networks and device.

Background technology

The descriptor that keyword is jointly paid close attention to as vast social user and used, can contain a large amount of information.By extracting the key word information in the social text of magnanimity, the theme that vast social user pays close attention to jointly can not only be understood in time, and social user can be helped to grasp current hot information in time.Therefore, keyword extraction can successfully manage problem of information overload, and provides quick information service easily for vast social user.

Ubiquitous keyword abstraction method is: obtain the historical search information of a large number of users, according to the frequent descriptor occurred in the historical search information of user and web page contents, extracts keyword.

But current method depends on the search information of user to a great extent, need to get a large amount of historical search information, accurately can extract keyword, extraction rate is low.

Summary of the invention

(1) technical matters solved

The technical matters that the present invention solves is: how to solve and need to obtain a large amount of historical search information problem in extraction keyword process.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of keyword extracting method based on social networks, comprising:

Participle is carried out to text to be extracted, and the textual data that the word frequency of adding up word is corresponding with this word;

The textual data corresponding according to described word frequency and this word, calculate word weight, choose the first preset value word that word weight is larger alternatively keyword, from candidate keywords, extract the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword.

Preferably, described participle is carried out to text to be extracted before, to comprise further: noise filtering is carried out to text to be extracted, and by the text duplicate removal after filtering;

And/or,

Described the step that text to be extracted carries out participle to be comprised further: carry out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule; Choose then the first preset value word that word weight is larger alternatively keyword comprise: be the word of the first part of speech from part of speech, choose the first preset value word that word weight is larger alternatively keyword.

Preferably, described noise filtering is carried out to text to be extracted, specifically comprises:

According to the noise filtering rule of setting, travel through text to be extracted, mate the character in text to be extracted, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, by the character deletion that the match is successful;

And/or,

Described by the text duplicate removal after filtration, specifically comprise:

Text mapping after current filter is become finger print information, and the text after this current filter and finger print information storehouse are compared, if the fingerprint number that there are differences in comparative result is less than or equal to the 3rd preset value, then by the text suppression after current filter, otherwise, the finger print information of the text after current filter is added in described finger print information storehouse.

Preferably, the word frequency of described statistics word and textual data corresponding to this word, comprise further:

For each unduplicated word distributes glossarial index number, and the feature of the call number of word and the word corresponding with call number is saved in glossarial index table;

For the text after duplicate removal distributes text index number, according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;

Wherein, the feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and word weight that this word is corresponding.

Preferably, described calculating word weight specifically comprises:

Word weight according to following formulae discovery word:

weight (term) = b (term) * a (term) * tf (term) * \log \frac{| d |}{1 + df (term)},

Wherein, weight (term) is word weight, and b (term) and a (term) is experiential modification value, the word frequency that tf (term) is word, and df (term) is textual data corresponding to word, | d| is text sum.

For solving the problems of the technologies described above, present invention also offers a kind of keyword extracting device based on social networks, comprising:

Word-dividing mode, for carrying out participle to text to be extracted, and is transferred to statistical module by the word after participle;

Described statistical module, the textual data that word frequency and this word for adding up word are corresponding, and statistics is transferred to computing module;

Described computing module, for according to described word frequency and textual data corresponding to this word, calculates word weight, and result of calculation is transferred to and chooses module;

Describedly choose module, for choosing the first preset value word that word weight is larger alternatively keyword, and result will be chosen be transferred to module in advance;

Described extraction module, for extracting the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword from candidate keywords.

Preferably, described device also comprises:

Noise filtering module, for carrying out noise filtering to text to be extracted, and by the File Transfer after filtration to text duplicate removal module;

Described text duplicate removal module, for carrying out duplicate removal by the text after filtration;

And/or,

Part-of-speech tagging module, for carrying out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule, and chooses module described in annotation results being transferred to;

Describedly choose module, also for from part of speech be the first part of speech word in, choose the first preset value word that word weight is larger alternatively keyword.

Preferably, described noise filtering module comprises:

Setting submodule, for setting noise filtering rule, and by the noise filtering regular transmission of setting to matched sub-block;

Traversal submodule, for traveling through text to be extracted, and is transferred to described matched sub-block by traversing result;

Described matched sub-block, for the noise filtering rule according to setting, the character in text to be extracted is mated, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, and the character transmission that the match is successful is deleted submodule to first;

Described first deletes submodule, for by the character deletion that the match is successful;

And/or,

Described text duplicate removal module comprises:

Mapping submodule, for the text mapping after current filter is become finger print information, is transferred to comparison sub-module by mapping result;

Described comparison sub-module, for the text after this current filter and finger print information storehouse are compared, and the File Transfer fingerprint number that there are differences in comparative result being less than or equal to the current filter of the 3rd preset value deletes submodule to second, and the File Transfer of the current filter fingerprint number that there are differences in comparative result being not less than the 3rd preset value is to preserving submodule;

Described second deletes submodule, for by the text suppression after current filter;

Described preservation submodule, for adding the finger print information of the text after current filter in described finger print information storehouse.

Preferably, described device also comprises:

Distribution module, for distributing glossarial index number for each unduplicated word, and is the text distribution text index number after duplicate removal, and allocation result is transferred to preservation module;

Described preservation module, for the feature of the call number of word and the word corresponding with call number is saved in glossarial index table, and according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;

Preferably, described computing module, the word weight for according to following formulae discovery word:

weight (term) = b (term) * a (term) * tf (term) * \log \frac{| d |}{1 + df (term)},

(3) beneficial effect

The present invention is by providing a kind of keyword extracting method in social networks and device, do not need according to a large amount of historical search information, but directly in text to be extracted, extract keyword, by carrying out participle to text to be extracted and calculating word weight, and then according to word weight extraction keyword, owing to not needing a large amount of historical search information, thus improve extraction rate.

Accompanying drawing explanation

Fig. 1 is the method flow diagram that the embodiment of the present invention one provides;

Fig. 2 is the method flow diagram that the embodiment of the present invention two provides;

Fig. 3 is that the glossarial index that the embodiment of the present invention two provides represents intention;

Fig. 4 is that the text index that the embodiment of the present invention two provides represents intention;

Fig. 5 is the apparatus structure schematic diagram that the embodiment of the present invention three provides;

Fig. 6 is the noise filtering modular structure schematic diagram that the embodiment of the present invention three provides;

Fig. 7 is the text duplicate removal modular structure schematic diagram that the embodiment of the present invention three provides.

Embodiment

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Embodiment 1:

For solving in prior art the problem extracted key word and need to obtain a large amount of historical data, embodiments provide a kind of keyword extracting method based on social networks, as shown in Figure 1, the method comprises:

Step 101: participle is carried out to text to be extracted, and the textual data that the word frequency of adding up word is corresponding with this word;

Step 102: the textual data corresponding according to described word frequency and this word, calculate word weight, choose the first preset value word that word weight is larger alternatively keyword, from candidate keywords, extract the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword.

The embodiment of the present invention does not need according to a large amount of historical search information, but directly in text to be extracted, extract keyword, by carrying out participle to text to be extracted and calculating word weight, and then according to word weight extraction keyword, owing to not needing a large amount of historical search information, thus improve extraction rate.

In embodiments of the present invention, by carrying out noise filtering, duplicate removal to text to be extracted in advance, thus improve the fairness of keyword extraction.According to the part of speech of mark, from the word of the first part of speech meeting extracting rule, choose the first preset value word that word weight is larger alternatively keyword.Thus improve the accuracy of keyword extraction.

In embodiments of the present invention, by setting noise filtering rule, travel through text to be extracted, character in text to be extracted is mated, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, by the character deletion that the match is successful, therefore by after noise filtering, improve the accuracy for the treatment of effeciency and keyword extraction.Owing to having some repeated texts in the text after noise filtering, therefore by after duplicate removal process, residue text is not repeated text, thus decreases the complicacy of text, improves the extraction accuracy of keyword.

In embodiments of the present invention, by setting up glossarial index table and text index table, for subsequent extracted keyword provides conveniently.

Embodiment 2

In order to solve the problem of prior art, the embodiment of the present invention second embodiment provides a kind of keyword extracting method based on social networks, and as shown in Figure 2, the method comprises:

Step 201: noise filtering is carried out to text to be extracted;

In embodiments of the present invention, because text to be extracted comprises a large amount of invalid informations, not only reduce treatment effeciency, and affect the effect of keyword extraction.Therefore,

First, noise filtering rule is set;

In embodiments of the present invention, the noise filtering rule of setting comprises: a, emoticon (generally occurring with " [text] " form) noise; B, " html label " noise; C, "@user name " noise; D, " //@user name " noise.

Secondly, travel through text to be extracted, according to above-mentioned noise filtering rule, the character in text to be extracted is mated;

Finally, if the character in text to be extracted belongs to a kind of in said extracted rule, then the match is successful, by the character deletion that the match is successful.

Obtain the text after filtering according to above-mentioned steps, be beneficial to the text after this filtration, improve the accuracy for the treatment of effeciency and keyword extraction.

Step 202: by the text duplicate removal after filtration;

Due to the forwarding relation between text, there is a large amount of phenomenon repeated in the text after filtration, in order to reduce the unjustness that duplicate contents brings to word weight calculation, needs the text duplicate removal to filtering.

This De-weight method comprises: the text mapping after current filter is become finger print information, and the text after this current filter and finger print information storehouse are compared, if the fingerprint number that there are differences in comparative result is less than or equal to preset value, then by the text suppression after current filter, otherwise, the finger print information of the text after current filter is added in described finger print information storehouse.

Such as, the text mapping after current filter is become 6 finger print informations, the fingerprint number that there are differences is less than or equal to the text suppression of 3, otherwise the finger print information of the text after current filter is added in finger print information storehouse in comparative result.

Step 203: participle and part-of-speech tagging are carried out to the text after duplicate removal, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule;

Step 204: add up the textual data that the word frequency of word is corresponding with this word;

Step 205: utilize the textual data that the part of speech of word, word frequency and this word are corresponding, sets up glossarial index table and text index table;

First, for each unduplicated word distributes glossarial index number, and the feature of the call number of word and the word corresponding with call number is saved in glossarial index table, as shown in Figure 3;

The feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and weight that this word is corresponding.Wherein, weight is the value that subsequent step obtains.

Secondly, be that the text after duplicate removal distributes text index number, according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal be saved in text index table, as shown in Figure 4.

Step 206: the textual data corresponding according to described word frequency and this word, calculates word weight;

According to the different background of user, set up user and pay close attention to dictionary.

Such as, finance and economics is relevant, physical culture is relevant, amusement is relevant.

Pay close attention to dictionary, word frequency, textual data that this word is corresponding according to user, and following formula calculates word weight:

weight (term) = b (term) * a (term) * tf (term) * \log \frac{| d |}{1 + df (term)},

Wherein, weight (term) is word weight, b (term) is the experiential modification value paying close attention to dictionary based on user, a (term) is the experiential modification value judged based on part of speech, the word frequency that tf (term) is word, df (term) is textual data corresponding to word, | d| is text sum.

The value of a (term) is:

Wherein, nr is name, and nt is mechanism's name;

The value of b (term) is: pay close attention to word in dictionary when this word belongs to user, then b (term) is 1.5; Pay close attention to word in dictionary when this word does not belong to user, then b (term) is 1.

Step 207: word weight is sorted from big to small, and be the word of the first part of speech from part of speech, choose the first preset value word that word weight is larger alternatively keyword;

Wherein, the first part of speech comprises :/a adjective ,/v verb, and/j is called for short ,/ns place name ,/nr name ,/nt mechanism name ,/nz proper noun.

Second part of speech is the word of other parts of speech not belonging to the first part of speech.

Step 208: extract the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword from candidate keywords.

This step can be extracted according to text index table, namely extracts a preset value candidate keywords that the glossarial index frequency of occurrences is larger in described text index table as keyword.

As shown in Figure 4, in multiple text glossarial index number be 7 the word frequency of occurrences comparatively large, if the word that glossarial index number is 7 is candidate keywords, be then that the word of 7 is as keyword using this glossarial index number.

Wherein, the embodiment of the present invention is applied to all social network-i i-platform such as microblogging, space.

The embodiment of the present invention is by providing a kind of in the keyword extracting method of social networks, do not need according to a large amount of historical search information, but directly in text to be extracted, extract keyword, by carrying out noise filtering, text duplicate removal, participle to text to be extracted and calculating word weight, and then according to word weight extraction keyword, owing to not needing a large amount of historical search information, thus improve extraction rate.

Embodiment 3

The embodiment of the present invention additionally provides a kind of keyword extracting device based on social networks, as shown in Figure 5, comprising:

Word-dividing mode 501, for carrying out participle to text to be extracted, and is transferred to statistical module by the word after participle;

Described statistical module 502, the textual data that word frequency and this word for adding up word are corresponding, and statistics is transferred to computing module;

Described computing module 503, for according to described word frequency and textual data corresponding to this word, calculates word weight, and result of calculation is transferred to and chooses module;

Describedly choose module 504, for choosing the first preset value word that word weight is larger alternatively keyword, and result will be chosen be transferred to module in advance;

Described extraction module 505, for extracting the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword from candidate keywords.

Further, described device also comprises:

And/or,

Further, described noise filtering module as shown in Figure 6, comprising:

Setting submodule 601, for setting noise filtering rule, and by the noise filtering regular transmission of setting to matched sub-block;

Traversal submodule 602, for traveling through text to be extracted, and is transferred to described matched sub-block by traversing result;

Described matched sub-block 603, for the noise filtering rule according to setting, the character in text to be extracted is mated, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, and the character transmission that the match is successful is deleted submodule to first;

Described first deletes submodule 604, for by the character deletion that the match is successful;

Further, described text duplicate removal module as shown in Figure 7, comprising:

Mapping submodule 701, for the text mapping after current filter is become finger print information, is transferred to comparison sub-module by mapping result;

Described comparison sub-module 702, for the text after this current filter and finger print information storehouse are compared, and the File Transfer fingerprint number that there are differences in comparative result being less than or equal to the current filter of the 3rd preset value deletes submodule to second, and the File Transfer of the current filter fingerprint number that there are differences in comparative result being not less than the 3rd preset value is to preserving submodule;

Described second deletes submodule 703, for by the text suppression after current filter;

Described preservation submodule 704, for adding the finger print information of the text after current filter in described finger print information storehouse.

Further, described device also comprises:

Further, described computing module, the word weight for according to following formulae discovery word:

weight (term) = b (term) * a (term) * tf (term) * \log \frac{| d |}{1 + df (term)},

The present invention is by providing a kind of in the keyword extracting device of social networks, do not need according to a large amount of historical search information, but directly in text to be extracted, extract keyword by extraction module, by noise filtering module, text duplicate removal module, word-dividing mode, computing module, respectively noise filtering, text duplicate removal, participle carried out to text to be extracted and calculate word weight, and then according to word weight extraction keyword, owing to not needing a large amount of historical search information, thus improve extraction rate.

Above embodiment is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims

1. based on a keyword extracting method for social networks, it is characterized in that, comprising:

2. the method for claim 1, is characterized in that, described participle is carried out to text to be extracted before, to comprise further: noise filtering is carried out to text to be extracted, and by the text duplicate removal after filtering;

And/or,

3. method as claimed in claim 2, is characterized in that, describedly carries out noise filtering to text to be extracted, specifically comprises:

And/or,

4. the method for claim 1, is characterized in that, the word frequency of described statistics word and textual data corresponding to this word, comprise further:

5. the method for claim 1, is characterized in that, described calculating word weight specifically comprises:

Word weight according to following formulae discovery word:

weight (term) = b (term) * a (term) * tf (term) * \log \frac{| d |}{1 + df (term)},

6. based on a keyword extracting device for social networks, it is characterized in that, comprising:

7. device as claimed in claim 6, it is characterized in that, described device also comprises:

And/or,

8. device as claimed in claim 7, it is characterized in that, described noise filtering module comprises:

And/or,

Described text duplicate removal module comprises:

9. device as claimed in claim 6, it is characterized in that, described device also comprises:

10. device as claimed in claim 6, is characterized in that, described computing module, the word weight for according to following formulae discovery word:

weight (term) = b (term) * a (term) * tf (term) * \log \frac{| d |}{1 + df (term)},