CN104572736A - Keyword extraction method and device based on social networking services - Google Patents

Keyword extraction method and device based on social networking services Download PDF

Info

Publication number
CN104572736A
CN104572736A CN201310503897.5A CN201310503897A CN104572736A CN 104572736 A CN104572736 A CN 104572736A CN 201310503897 A CN201310503897 A CN 201310503897A CN 104572736 A CN104572736 A CN 104572736A
Authority
CN
China
Prior art keywords
word
text
term
module
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310503897.5A
Other languages
Chinese (zh)
Inventor
赵立永
于晓明
杨建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University
Priority to CN201310503897.5A priority Critical patent/CN104572736A/en
Publication of CN104572736A publication Critical patent/CN104572736A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention provides a keyword extraction method and device based on social networking services. The keyword extraction method includes steps of segmenting texts to be extracted and counting word frequency and the text number corresponding to words; calculating word weights according to the word frequency and the text number of the words and selecting words with high weight in the preset number as alternative keywords; extracting the alternative keywords high in frequency in the texts in the preset number to be extracted from the alternative keywords as the keywords. Through noise filtering, text deweighting, word segmentation and word weight calculation of the texts to be extracted, the keywords are extracted according to the word weight, and extraction speed is increased without much historical search information.

Description

Based on keyword extracting method and the device of social networks
Technical field
The present invention relates to keyword extraction techniques field, particularly a kind of keyword extracting method based on social networks and device.
Background technology
The descriptor that keyword is jointly paid close attention to as vast social user and used, can contain a large amount of information.By extracting the key word information in the social text of magnanimity, the theme that vast social user pays close attention to jointly can not only be understood in time, and social user can be helped to grasp current hot information in time.Therefore, keyword extraction can successfully manage problem of information overload, and provides quick information service easily for vast social user.
Ubiquitous keyword abstraction method is: obtain the historical search information of a large number of users, according to the frequent descriptor occurred in the historical search information of user and web page contents, extracts keyword.
But current method depends on the search information of user to a great extent, need to get a large amount of historical search information, accurately can extract keyword, extraction rate is low.
Summary of the invention
(1) technical matters solved
The technical matters that the present invention solves is: how to solve and need to obtain a large amount of historical search information problem in extraction keyword process.
(2) technical scheme
For solving the problems of the technologies described above, the invention provides a kind of keyword extracting method based on social networks, comprising:
Participle is carried out to text to be extracted, and the textual data that the word frequency of adding up word is corresponding with this word;
The textual data corresponding according to described word frequency and this word, calculate word weight, choose the first preset value word that word weight is larger alternatively keyword, from candidate keywords, extract the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword.
Preferably, described participle is carried out to text to be extracted before, to comprise further: noise filtering is carried out to text to be extracted, and by the text duplicate removal after filtering;
And/or,
Described the step that text to be extracted carries out participle to be comprised further: carry out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule; Choose then the first preset value word that word weight is larger alternatively keyword comprise: be the word of the first part of speech from part of speech, choose the first preset value word that word weight is larger alternatively keyword.
Preferably, described noise filtering is carried out to text to be extracted, specifically comprises:
According to the noise filtering rule of setting, travel through text to be extracted, mate the character in text to be extracted, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, by the character deletion that the match is successful;
And/or,
Described by the text duplicate removal after filtration, specifically comprise:
Text mapping after current filter is become finger print information, and the text after this current filter and finger print information storehouse are compared, if the fingerprint number that there are differences in comparative result is less than or equal to the 3rd preset value, then by the text suppression after current filter, otherwise, the finger print information of the text after current filter is added in described finger print information storehouse.
Preferably, the word frequency of described statistics word and textual data corresponding to this word, comprise further:
For each unduplicated word distributes glossarial index number, and the feature of the call number of word and the word corresponding with call number is saved in glossarial index table;
For the text after duplicate removal distributes text index number, according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;
Wherein, the feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and word weight that this word is corresponding.
Preferably, described calculating word weight specifically comprises:
Word weight according to following formulae discovery word:
weight ( term ) = b ( term ) * a ( term ) * tf ( term ) * log | d | 1 + df ( term ) ,
Wherein, weight (term) is word weight, and b (term) and a (term) is experiential modification value, the word frequency that tf (term) is word, and df (term) is textual data corresponding to word, | d| is text sum.
For solving the problems of the technologies described above, present invention also offers a kind of keyword extracting device based on social networks, comprising:
Word-dividing mode, for carrying out participle to text to be extracted, and is transferred to statistical module by the word after participle;
Described statistical module, the textual data that word frequency and this word for adding up word are corresponding, and statistics is transferred to computing module;
Described computing module, for according to described word frequency and textual data corresponding to this word, calculates word weight, and result of calculation is transferred to and chooses module;
Describedly choose module, for choosing the first preset value word that word weight is larger alternatively keyword, and result will be chosen be transferred to module in advance;
Described extraction module, for extracting the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword from candidate keywords.
Preferably, described device also comprises:
Noise filtering module, for carrying out noise filtering to text to be extracted, and by the File Transfer after filtration to text duplicate removal module;
Described text duplicate removal module, for carrying out duplicate removal by the text after filtration;
And/or,
Part-of-speech tagging module, for carrying out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule, and chooses module described in annotation results being transferred to;
Describedly choose module, also for from part of speech be the first part of speech word in, choose the first preset value word that word weight is larger alternatively keyword.
Preferably, described noise filtering module comprises:
Setting submodule, for setting noise filtering rule, and by the noise filtering regular transmission of setting to matched sub-block;
Traversal submodule, for traveling through text to be extracted, and is transferred to described matched sub-block by traversing result;
Described matched sub-block, for the noise filtering rule according to setting, the character in text to be extracted is mated, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, and the character transmission that the match is successful is deleted submodule to first;
Described first deletes submodule, for by the character deletion that the match is successful;
And/or,
Described text duplicate removal module comprises:
Mapping submodule, for the text mapping after current filter is become finger print information, is transferred to comparison sub-module by mapping result;
Described comparison sub-module, for the text after this current filter and finger print information storehouse are compared, and the File Transfer fingerprint number that there are differences in comparative result being less than or equal to the current filter of the 3rd preset value deletes submodule to second, and the File Transfer of the current filter fingerprint number that there are differences in comparative result being not less than the 3rd preset value is to preserving submodule;
Described second deletes submodule, for by the text suppression after current filter;
Described preservation submodule, for adding the finger print information of the text after current filter in described finger print information storehouse.
Preferably, described device also comprises:
Distribution module, for distributing glossarial index number for each unduplicated word, and is the text distribution text index number after duplicate removal, and allocation result is transferred to preservation module;
Described preservation module, for the feature of the call number of word and the word corresponding with call number is saved in glossarial index table, and according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;
Wherein, the feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and word weight that this word is corresponding.
Preferably, described computing module, the word weight for according to following formulae discovery word:
weight ( term ) = b ( term ) * a ( term ) * tf ( term ) * log | d | 1 + df ( term ) ,
Wherein, weight (term) is word weight, and b (term) and a (term) is experiential modification value, the word frequency that tf (term) is word, and df (term) is textual data corresponding to word, | d| is text sum.
(3) beneficial effect
The present invention is by providing a kind of keyword extracting method in social networks and device, do not need according to a large amount of historical search information, but directly in text to be extracted, extract keyword, by carrying out participle to text to be extracted and calculating word weight, and then according to word weight extraction keyword, owing to not needing a large amount of historical search information, thus improve extraction rate.
Accompanying drawing explanation
Fig. 1 is the method flow diagram that the embodiment of the present invention one provides;
Fig. 2 is the method flow diagram that the embodiment of the present invention two provides;
Fig. 3 is that the glossarial index that the embodiment of the present invention two provides represents intention;
Fig. 4 is that the text index that the embodiment of the present invention two provides represents intention;
Fig. 5 is the apparatus structure schematic diagram that the embodiment of the present invention three provides;
Fig. 6 is the noise filtering modular structure schematic diagram that the embodiment of the present invention three provides;
Fig. 7 is the text duplicate removal modular structure schematic diagram that the embodiment of the present invention three provides.
Embodiment
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Embodiment 1:
For solving in prior art the problem extracted key word and need to obtain a large amount of historical data, embodiments provide a kind of keyword extracting method based on social networks, as shown in Figure 1, the method comprises:
Step 101: participle is carried out to text to be extracted, and the textual data that the word frequency of adding up word is corresponding with this word;
Step 102: the textual data corresponding according to described word frequency and this word, calculate word weight, choose the first preset value word that word weight is larger alternatively keyword, from candidate keywords, extract the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword.
The embodiment of the present invention does not need according to a large amount of historical search information, but directly in text to be extracted, extract keyword, by carrying out participle to text to be extracted and calculating word weight, and then according to word weight extraction keyword, owing to not needing a large amount of historical search information, thus improve extraction rate.
In embodiments of the present invention, by carrying out noise filtering, duplicate removal to text to be extracted in advance, thus improve the fairness of keyword extraction.According to the part of speech of mark, from the word of the first part of speech meeting extracting rule, choose the first preset value word that word weight is larger alternatively keyword.Thus improve the accuracy of keyword extraction.
In embodiments of the present invention, by setting noise filtering rule, travel through text to be extracted, character in text to be extracted is mated, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, by the character deletion that the match is successful, therefore by after noise filtering, improve the accuracy for the treatment of effeciency and keyword extraction.Owing to having some repeated texts in the text after noise filtering, therefore by after duplicate removal process, residue text is not repeated text, thus decreases the complicacy of text, improves the extraction accuracy of keyword.
In embodiments of the present invention, by setting up glossarial index table and text index table, for subsequent extracted keyword provides conveniently.
Embodiment 2
In order to solve the problem of prior art, the embodiment of the present invention second embodiment provides a kind of keyword extracting method based on social networks, and as shown in Figure 2, the method comprises:
Step 201: noise filtering is carried out to text to be extracted;
In embodiments of the present invention, because text to be extracted comprises a large amount of invalid informations, not only reduce treatment effeciency, and affect the effect of keyword extraction.Therefore,
First, noise filtering rule is set;
In embodiments of the present invention, the noise filtering rule of setting comprises: a, emoticon (generally occurring with " [text] " form) noise; B, " html label " noise; C, "@user name " noise; D, " //@user name " noise.
Secondly, travel through text to be extracted, according to above-mentioned noise filtering rule, the character in text to be extracted is mated;
Finally, if the character in text to be extracted belongs to a kind of in said extracted rule, then the match is successful, by the character deletion that the match is successful.
Obtain the text after filtering according to above-mentioned steps, be beneficial to the text after this filtration, improve the accuracy for the treatment of effeciency and keyword extraction.
Step 202: by the text duplicate removal after filtration;
Due to the forwarding relation between text, there is a large amount of phenomenon repeated in the text after filtration, in order to reduce the unjustness that duplicate contents brings to word weight calculation, needs the text duplicate removal to filtering.
This De-weight method comprises: the text mapping after current filter is become finger print information, and the text after this current filter and finger print information storehouse are compared, if the fingerprint number that there are differences in comparative result is less than or equal to preset value, then by the text suppression after current filter, otherwise, the finger print information of the text after current filter is added in described finger print information storehouse.
Such as, the text mapping after current filter is become 6 finger print informations, the fingerprint number that there are differences is less than or equal to the text suppression of 3, otherwise the finger print information of the text after current filter is added in finger print information storehouse in comparative result.
Step 203: participle and part-of-speech tagging are carried out to the text after duplicate removal, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule;
Step 204: add up the textual data that the word frequency of word is corresponding with this word;
Step 205: utilize the textual data that the part of speech of word, word frequency and this word are corresponding, sets up glossarial index table and text index table;
First, for each unduplicated word distributes glossarial index number, and the feature of the call number of word and the word corresponding with call number is saved in glossarial index table, as shown in Figure 3;
The feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and weight that this word is corresponding.Wherein, weight is the value that subsequent step obtains.
Secondly, be that the text after duplicate removal distributes text index number, according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal be saved in text index table, as shown in Figure 4.
Step 206: the textual data corresponding according to described word frequency and this word, calculates word weight;
According to the different background of user, set up user and pay close attention to dictionary.
Such as, finance and economics is relevant, physical culture is relevant, amusement is relevant.
Pay close attention to dictionary, word frequency, textual data that this word is corresponding according to user, and following formula calculates word weight:
weight ( term ) = b ( term ) * a ( term ) * tf ( term ) * log | d | 1 + df ( term ) ,
Wherein, weight (term) is word weight, b (term) is the experiential modification value paying close attention to dictionary based on user, a (term) is the experiential modification value judged based on part of speech, the word frequency that tf (term) is word, df (term) is textual data corresponding to word, | d| is text sum.
The value of a (term) is:
Wherein, nr is name, and nt is mechanism's name;
The value of b (term) is: pay close attention to word in dictionary when this word belongs to user, then b (term) is 1.5; Pay close attention to word in dictionary when this word does not belong to user, then b (term) is 1.
Step 207: word weight is sorted from big to small, and be the word of the first part of speech from part of speech, choose the first preset value word that word weight is larger alternatively keyword;
Wherein, the first part of speech comprises :/a adjective ,/v verb, and/j is called for short ,/ns place name ,/nr name ,/nt mechanism name ,/nz proper noun.
Second part of speech is the word of other parts of speech not belonging to the first part of speech.
Step 208: extract the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword from candidate keywords.
This step can be extracted according to text index table, namely extracts a preset value candidate keywords that the glossarial index frequency of occurrences is larger in described text index table as keyword.
As shown in Figure 4, in multiple text glossarial index number be 7 the word frequency of occurrences comparatively large, if the word that glossarial index number is 7 is candidate keywords, be then that the word of 7 is as keyword using this glossarial index number.
Wherein, the embodiment of the present invention is applied to all social network-i i-platform such as microblogging, space.
The embodiment of the present invention is by providing a kind of in the keyword extracting method of social networks, do not need according to a large amount of historical search information, but directly in text to be extracted, extract keyword, by carrying out noise filtering, text duplicate removal, participle to text to be extracted and calculating word weight, and then according to word weight extraction keyword, owing to not needing a large amount of historical search information, thus improve extraction rate.
Embodiment 3
The embodiment of the present invention additionally provides a kind of keyword extracting device based on social networks, as shown in Figure 5, comprising:
Word-dividing mode 501, for carrying out participle to text to be extracted, and is transferred to statistical module by the word after participle;
Described statistical module 502, the textual data that word frequency and this word for adding up word are corresponding, and statistics is transferred to computing module;
Described computing module 503, for according to described word frequency and textual data corresponding to this word, calculates word weight, and result of calculation is transferred to and chooses module;
Describedly choose module 504, for choosing the first preset value word that word weight is larger alternatively keyword, and result will be chosen be transferred to module in advance;
Described extraction module 505, for extracting the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword from candidate keywords.
Further, described device also comprises:
Noise filtering module, for carrying out noise filtering to text to be extracted, and by the File Transfer after filtration to text duplicate removal module;
Described text duplicate removal module, for carrying out duplicate removal by the text after filtration;
And/or,
Part-of-speech tagging module, for carrying out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule, and chooses module described in annotation results being transferred to;
Describedly choose module, also for from part of speech be the first part of speech word in, choose the first preset value word that word weight is larger alternatively keyword.
Further, described noise filtering module as shown in Figure 6, comprising:
Setting submodule 601, for setting noise filtering rule, and by the noise filtering regular transmission of setting to matched sub-block;
Traversal submodule 602, for traveling through text to be extracted, and is transferred to described matched sub-block by traversing result;
Described matched sub-block 603, for the noise filtering rule according to setting, the character in text to be extracted is mated, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, and the character transmission that the match is successful is deleted submodule to first;
Described first deletes submodule 604, for by the character deletion that the match is successful;
Further, described text duplicate removal module as shown in Figure 7, comprising:
Mapping submodule 701, for the text mapping after current filter is become finger print information, is transferred to comparison sub-module by mapping result;
Described comparison sub-module 702, for the text after this current filter and finger print information storehouse are compared, and the File Transfer fingerprint number that there are differences in comparative result being less than or equal to the current filter of the 3rd preset value deletes submodule to second, and the File Transfer of the current filter fingerprint number that there are differences in comparative result being not less than the 3rd preset value is to preserving submodule;
Described second deletes submodule 703, for by the text suppression after current filter;
Described preservation submodule 704, for adding the finger print information of the text after current filter in described finger print information storehouse.
Further, described device also comprises:
Distribution module, for distributing glossarial index number for each unduplicated word, and is the text distribution text index number after duplicate removal, and allocation result is transferred to preservation module;
Described preservation module, for the feature of the call number of word and the word corresponding with call number is saved in glossarial index table, and according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;
Wherein, the feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and word weight that this word is corresponding.
Further, described computing module, the word weight for according to following formulae discovery word:
weight ( term ) = b ( term ) * a ( term ) * tf ( term ) * log | d | 1 + df ( term ) ,
Wherein, weight (term) is word weight, and b (term) and a (term) is experiential modification value, the word frequency that tf (term) is word, and df (term) is textual data corresponding to word, | d| is text sum.
The present invention is by providing a kind of in the keyword extracting device of social networks, do not need according to a large amount of historical search information, but directly in text to be extracted, extract keyword by extraction module, by noise filtering module, text duplicate removal module, word-dividing mode, computing module, respectively noise filtering, text duplicate removal, participle carried out to text to be extracted and calculate word weight, and then according to word weight extraction keyword, owing to not needing a large amount of historical search information, thus improve extraction rate.
Above embodiment is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (10)

1. based on a keyword extracting method for social networks, it is characterized in that, comprising:
Participle is carried out to text to be extracted, and the textual data that the word frequency of adding up word is corresponding with this word;
The textual data corresponding according to described word frequency and this word, calculate word weight, choose the first preset value word that word weight is larger alternatively keyword, from candidate keywords, extract the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword.
2. the method for claim 1, is characterized in that, described participle is carried out to text to be extracted before, to comprise further: noise filtering is carried out to text to be extracted, and by the text duplicate removal after filtering;
And/or,
Described the step that text to be extracted carries out participle to be comprised further: carry out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule; Choose then the first preset value word that word weight is larger alternatively keyword comprise: be the word of the first part of speech from part of speech, choose the first preset value word that word weight is larger alternatively keyword.
3. method as claimed in claim 2, is characterized in that, describedly carries out noise filtering to text to be extracted, specifically comprises:
According to the noise filtering rule of setting, travel through text to be extracted, mate the character in text to be extracted, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, by the character deletion that the match is successful;
And/or,
Described by the text duplicate removal after filtration, specifically comprise:
Text mapping after current filter is become finger print information, and the text after this current filter and finger print information storehouse are compared, if the fingerprint number that there are differences in comparative result is less than or equal to the 3rd preset value, then by the text suppression after current filter, otherwise, the finger print information of the text after current filter is added in described finger print information storehouse.
4. the method for claim 1, is characterized in that, the word frequency of described statistics word and textual data corresponding to this word, comprise further:
For each unduplicated word distributes glossarial index number, and the feature of the call number of word and the word corresponding with call number is saved in glossarial index table;
For the text after duplicate removal distributes text index number, according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;
Wherein, the feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and word weight that this word is corresponding.
5. the method for claim 1, is characterized in that, described calculating word weight specifically comprises:
Word weight according to following formulae discovery word:
weight ( term ) = b ( term ) * a ( term ) * tf ( term ) * log | d | 1 + df ( term ) ,
Wherein, weight (term) is word weight, and b (term) and a (term) is experiential modification value, the word frequency that tf (term) is word, and df (term) is textual data corresponding to word, | d| is text sum.
6. based on a keyword extracting device for social networks, it is characterized in that, comprising:
Word-dividing mode, for carrying out participle to text to be extracted, and is transferred to statistical module by the word after participle;
Described statistical module, the textual data that word frequency and this word for adding up word are corresponding, and statistics is transferred to computing module;
Described computing module, for according to described word frequency and textual data corresponding to this word, calculates word weight, and result of calculation is transferred to and chooses module;
Describedly choose module, for choosing the first preset value word that word weight is larger alternatively keyword, and result will be chosen be transferred to module in advance;
Described extraction module, for extracting the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword from candidate keywords.
7. device as claimed in claim 6, it is characterized in that, described device also comprises:
Noise filtering module, for carrying out noise filtering to text to be extracted, and by the File Transfer after filtration to text duplicate removal module;
Described text duplicate removal module, for carrying out duplicate removal by the text after filtration;
And/or,
Part-of-speech tagging module, for carrying out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule, and chooses module described in annotation results being transferred to;
Describedly choose module, also for from part of speech be the first part of speech word in, choose the first preset value word that word weight is larger alternatively keyword.
8. device as claimed in claim 7, it is characterized in that, described noise filtering module comprises:
Setting submodule, for setting noise filtering rule, and by the noise filtering regular transmission of setting to matched sub-block;
Traversal submodule, for traveling through text to be extracted, and is transferred to described matched sub-block by traversing result;
Described matched sub-block, for the noise filtering rule according to setting, the character in text to be extracted is mated, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, and the character transmission that the match is successful is deleted submodule to first;
Described first deletes submodule, for by the character deletion that the match is successful;
And/or,
Described text duplicate removal module comprises:
Mapping submodule, for the text mapping after current filter is become finger print information, is transferred to comparison sub-module by mapping result;
Described comparison sub-module, for the text after this current filter and finger print information storehouse are compared, and the File Transfer fingerprint number that there are differences in comparative result being less than or equal to the current filter of the 3rd preset value deletes submodule to second, and the File Transfer of the current filter fingerprint number that there are differences in comparative result being not less than the 3rd preset value is to preserving submodule;
Described second deletes submodule, for by the text suppression after current filter;
Described preservation submodule, for adding the finger print information of the text after current filter in described finger print information storehouse.
9. device as claimed in claim 6, it is characterized in that, described device also comprises:
Distribution module, for distributing glossarial index number for each unduplicated word, and is the text distribution text index number after duplicate removal, and allocation result is transferred to preservation module;
Described preservation module, for the feature of the call number of word and the word corresponding with call number is saved in glossarial index table, and according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;
Wherein, the feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and word weight that this word is corresponding.
10. device as claimed in claim 6, is characterized in that, described computing module, the word weight for according to following formulae discovery word:
weight ( term ) = b ( term ) * a ( term ) * tf ( term ) * log | d | 1 + df ( term ) ,
Wherein, weight (term) is word weight, and b (term) and a (term) is experiential modification value, the word frequency that tf (term) is word, and df (term) is textual data corresponding to word, | d| is text sum.
CN201310503897.5A 2013-10-23 2013-10-23 Keyword extraction method and device based on social networking services Pending CN104572736A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310503897.5A CN104572736A (en) 2013-10-23 2013-10-23 Keyword extraction method and device based on social networking services

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310503897.5A CN104572736A (en) 2013-10-23 2013-10-23 Keyword extraction method and device based on social networking services

Publications (1)

Publication Number Publication Date
CN104572736A true CN104572736A (en) 2015-04-29

Family

ID=53088820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310503897.5A Pending CN104572736A (en) 2013-10-23 2013-10-23 Keyword extraction method and device based on social networking services

Country Status (1)

Country Link
CN (1) CN104572736A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608627A (en) * 2016-02-01 2016-05-25 广东欧珀移动通信有限公司 Information updating method and apparatus based on social network platform
CN106097113A (en) * 2016-06-21 2016-11-09 仲兆满 A kind of social network user sound interest digging method
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN107577671A (en) * 2017-09-19 2018-01-12 中央民族大学 A kind of key phrases extraction method based on multi-feature fusion
CN108628875A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of extracting method of text label, device and server
CN108984596A (en) * 2018-06-01 2018-12-11 阿里巴巴集团控股有限公司 A kind of keyword excavates and the method, device and equipment of risk feedback

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300007A1 (en) * 2008-05-28 2009-12-03 Takuya Hiraoka Information processing apparatus, full text retrieval method, and computer-readable encoding medium recorded with a computer program thereof
CN101872363A (en) * 2010-06-24 2010-10-27 北京邮电大学 Method for extracting keywords
CN102033919A (en) * 2010-12-07 2011-04-27 北京新媒传信科技有限公司 Method and system for extracting text key words
CN103164471A (en) * 2011-12-15 2013-06-19 盛乐信息技术(上海)有限公司 Recommendation method and system of video text labels
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300007A1 (en) * 2008-05-28 2009-12-03 Takuya Hiraoka Information processing apparatus, full text retrieval method, and computer-readable encoding medium recorded with a computer program thereof
CN101872363A (en) * 2010-06-24 2010-10-27 北京邮电大学 Method for extracting keywords
CN102033919A (en) * 2010-12-07 2011-04-27 北京新媒传信科技有限公司 Method and system for extracting text key words
CN103164471A (en) * 2011-12-15 2013-06-19 盛乐信息技术(上海)有限公司 Recommendation method and system of video text labels
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN105608627A (en) * 2016-02-01 2016-05-25 广东欧珀移动通信有限公司 Information updating method and apparatus based on social network platform
CN106097113A (en) * 2016-06-21 2016-11-09 仲兆满 A kind of social network user sound interest digging method
CN106097113B (en) * 2016-06-21 2020-11-27 江苏海洋大学 Social network user dynamic and static interest mining method
CN108628875A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of extracting method of text label, device and server
CN107577671A (en) * 2017-09-19 2018-01-12 中央民族大学 A kind of key phrases extraction method based on multi-feature fusion
CN107577671B (en) * 2017-09-19 2020-09-22 中央民族大学 Subject term extraction method based on multi-feature fusion
CN108984596A (en) * 2018-06-01 2018-12-11 阿里巴巴集团控股有限公司 A kind of keyword excavates and the method, device and equipment of risk feedback

Similar Documents

Publication Publication Date Title
CN107204184B (en) Audio recognition method and system
CN108287922B (en) Text data viewpoint abstract mining method fusing topic attributes and emotional information
CN104572736A (en) Keyword extraction method and device based on social networking services
CN103336766B (en) Short text garbage identification and modeling method and device
CN104850574B (en) A kind of filtering sensitive words method of text-oriented information
Zou et al. Automatic construction of Chinese stop word list
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN104102681A (en) Microblog key event acquiring method and device
CN102298587B (en) Satisfaction investigation method and system
CN103123624B (en) Determine method and device, searching method and the device of centre word
CN104615593A (en) Method and device for automatic detection of microblog hot topics
CN103226576A (en) Comment spam filtering method based on semantic similarity
CN105843796A (en) Microblog emotional tendency analysis method and device
CN102609427A (en) Public opinion vertical search analysis system and method
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN104239490A (en) Multi-account detection method and device for UGC (user generated content) website platform
CN107103067A (en) A kind of method of data synchronization and system based on search engine
US20150331953A1 (en) Method and device for providing search engine label
Dewani et al. Development of computational linguistic resources for automated detection of textual cyberbullying threats in Roman Urdu language
CN104915443A (en) Extraction method of Chinese Microblog evaluation object
CN109978020A (en) A kind of social networks account vest identity identification method based on multidimensional characteristic
CN104346382B (en) Use the text analysis system and method for language inquiry
Takuro et al. Codewords detection in microblogs focusing on differences in word use between two corpora
CN105159927B (en) Method and device for selecting subject term of target text and terminal
CN104166712B (en) Indexing of Scien. and Tech. Literature method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150429