CN104731797A - Keyword extracting method and keyword extracting device - Google Patents
Keyword extracting method and keyword extracting device Download PDFInfo
- Publication number
- CN104731797A CN104731797A CN201310706212.7A CN201310706212A CN104731797A CN 104731797 A CN104731797 A CN 104731797A CN 201310706212 A CN201310706212 A CN 201310706212A CN 104731797 A CN104731797 A CN 104731797A
- Authority
- CN
- China
- Prior art keywords
- participle
- text
- weight
- speech
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a keyword extracting method and a keyword extracting device and belongs to the field of computers. The keyword extracting method includes: subjecting a text to word partition to obtain partitioned words included in the text; acquiring the characteristic of each partitioned word included by the text and acquiring the position of each partitioned word in the text; calculating a score for each partitioned word according to the characteristic of each partitioned word and the position of the partitioned word in the text, wherein the score is used for indicating the importance degree, reflected by each partitioned word, of main content of the text; selecting a preset number of partitioned words with the highest scores and determining the selected partitioned words as keywords. The keyword extracting device comprises a dividing module, an acquiring module, a calculating module and a determining module. The keyword extracting method and the keyword extracting device have the advantages that the extracted keywords are not affected by other texts, keyword extracting efficiency is improved, and keyword extracting accuracy is increased.
Description
Technical field
The present invention relates to computer realm, particularly a kind of method and device extracting keyword.
Background technology
When the content of certain text is more, user, in order to grasp the purport of the text quickly, just need the main contents obtaining the text, and keyword exactly can reflect the main contents of a text, so just need to extract keyword from the text.
At present, provide a kind of method extracting keyword, be specially: participle division is carried out to text, obtain the participle that the text comprises, for each participle that the text comprises, add up the number of times that this participle occurs in the text, the number of times of statistics is defined as the word frequency of this participle; In the text collection at text place, there is the quantity of the text of this participle in statistics, according to the quantity of all texts in the quantity of statistics and the text collection at text place, calculates the reverse document-frequency of this participle; According to the word frequency of this participle and the reverse document-frequency of this participle, calculate the TF-IDF(Term Frequency-Inverse DocumentFrequency of this participle, word frequency-reverse document-frequency) value.According to the TF-IDF value of each participle, each word is sorted, obtain the default value word that TF-IDF value is maximum, the word of acquisition is defined as keyword.
Realizing in process of the present invention, inventor finds that prior art at least exists following problem:
Because TF-IDF computing not only depends on the text, also depend on other texts that text place text collection comprises, when other texts that text set comprises and the text associate little time, the accuracy extracting keyword according to the TF-IDF value of participle is lower, and it is when the amount of text in text set increases or reduce, larger on the keyword impact of extracting.
Summary of the invention
In order to solve the problem of prior art, embodiments provide a kind of method and the device that extract keyword.Described technical scheme is as follows:
On the one hand, provide a kind of method extracting keyword, described method comprises:
Participle division is carried out to text, obtains the participle that described text comprises;
Obtain the part of speech of each participle that described text comprises, and obtain the described position of each participle in described text;
According to part of speech and the described position of each participle in described text of described each participle, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;
Select the default value participle that mark is the highest, the participle of selection is defined as keyword.
Wherein, the described part of speech according to described each participle and the described position of each participle in described text, calculate the mark of described each participle, comprising:
Calculate the information entropy of each participle that described text comprises;
According to the part of speech of described each participle, obtain corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
According to the described position of each participle in described text, obtain the position weight of described each participle;
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, calculate the mark of described each participle.
Wherein, the information entropy of each participle that the described text of described calculating comprises, comprising:
For each participle that described text comprises, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;
Add up the first frequency that participle described in described first text occurs, and add up the second frequency of the appearance of participle described in described second text;
According to described first frequency and described second frequency, calculate the information entropy of described participle.
Further, described according to described first frequency and described second frequency, calculate the information entropy of described participle, comprising:
According to described first frequency and described second frequency, according to formula E (w
i)=-p
1(w
i) log
2p
1(w
i)-p
2(w
i) log
2p
2(w
i) calculate the information entropy of described participle;
Wherein, w
ifor i-th participle that described text comprises, E (w
i) be participle w
iinformation entropy, p
1(w
i) be described participle w
ifirst frequency, p
1(w
i) be described participle w
isecond frequency.
Wherein, the part of speech weight of the described information entropy according to each participle, described each participle and the position weight of described each participle, calculate the mark of described each participle, comprising:
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w
i)=c
1* E (w
i)+c
2* pos (w
i)+c
3* t (w
i) calculate the mark of described each participle;
Wherein, w
ifor i-th participle that described text comprises, f (w
i) be participle w
iscore, c
1for the weight of information entropy, E (w
i) be described participle w
iinformation entropy, c
2for the weight of position weight, pos (w
i) be described participle w
iposition weight, c
3for the weight of part of speech weight, t (w
i) be described participle w
ipart of speech weight.
On the other hand, provide a kind of device extracting keyword, described device comprises:
Dividing module, for carrying out participle division to text, obtaining the participle that described text comprises;
Acquisition module, for obtaining the part of speech of each participle that described text comprises, and obtains the described position of each participle in described text;
Computing module, for according to the part of speech of described each participle and the described position of each participle in described text, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;
Determination module, for selecting the default value participle that mark is the highest, is defined as keyword by the participle of selection.
Wherein, described computing module comprises:
First computing unit, for calculating the information entropy of each participle that described text comprises;
First acquiring unit, for the part of speech according to described each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
Second acquisition unit, for according to the described position of each participle in described text, obtains the position weight of described each participle;
Second computing unit, for the part of speech weight of the information entropy according to described each participle, described each participle and the position weight of described each participle, calculates the mark of described each participle.
Wherein, described first computing unit comprises:
Divide subelement, for each participle comprised for described text, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;
Statistics subelement, for adding up the first frequency that participle described in described first text occurs, and adds up the second frequency of the appearance of participle described in described second text;
Computation subunit, for according to described first frequency and described second frequency, calculates the information entropy of described participle.
Further, described computation subunit, specifically for:
According to described first frequency and described second frequency, according to formula E (w
i)=-p
1(w
i) log
2p
1(w
i)-p
2(w
i) log
2p
2(w
i) calculate the information entropy of described participle;
Wherein, w
ifor i-th participle that described text comprises, E (w
i) be participle w
iinformation entropy, p
1(w
i) be described participle w
ifirst frequency, p
1(w
i) be described participle w
isecond frequency.
Wherein, described second computing unit, specifically for:
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w
i)=c
1* E (w
i)+c
2* pos (w
i)+c
3* t (w
i) calculate the mark of described each participle;
Wherein, w
ifor i-th participle that described text comprises, f (w
i) be participle w
iscore, c
1for the weight of information entropy, E (w
i) be described participle w
iinformation entropy, c
2for the weight of position weight, pos (w
i) be described participle w
iposition weight, c
3for the weight of part of speech weight, t (w
i) be described participle w
ipart of speech weight.
In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram extracting keyword that the embodiment of the present invention one provides;
Fig. 2 is a kind of method flow diagram extracting keyword that the embodiment of the present invention two provides;
Fig. 3 is a kind of apparatus structure schematic diagram extracting keyword that the embodiment of the present invention three provides.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Embodiment one
Embodiments provide a kind of method extracting keyword, see Fig. 1, the method comprises:
Step 101: carry out participle division to text, obtains the participle that the text comprises;
Step 102: the part of speech obtaining each participle that the text comprises, and obtain the position of each participle in the text;
Step 103: according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, this mark is used to indicate the significance level of the main contents of a participle reflection text;
Step 104: select the default value participle that mark is the highest, the participle of selection is defined as keyword.
Wherein, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, comprising:
Calculate the information entropy of each participle that the text comprises;
According to the part of speech of each participle, obtain corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
According to the position of each participle in the text, obtain the position weight of each participle;
According to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle.
Wherein, calculate the information entropy of each participle that the text comprises, comprising:
For each participle that the text comprises, according to the position of this participle in the text, the text is divided into the first text and the second text, the first text is this participle and the text before this participle, and the second text is this participle and the text after this participle;
Add up the first frequency that in the first text, this participle occurs, and the second frequency that in statistics the second text, this participle occurs;
According to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle.
Further, according to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle, comprising:
According to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle according to following formula (1);
E(w
i)=-p
1(w
i)log
2p
1(w
i)-p
2(w
i)log
2p
2(w
i) (1)
Wherein, in above-mentioned formula (1), w
ifor i-th participle that the text comprises, E (w
i) be participle w
iinformation entropy, p
1(w
i) be participle w
ifirst frequency, p
1(w
i) be participle w
isecond frequency.
Wherein, according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle, comprising:
According to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle according to following formula (2);
f(w
i)=c
1*E(w
i)+c
2*pos(w
i)+c
3*t(w
i) (2)
Wherein, in above-mentioned formula (2), w
ifor i-th participle that the text comprises, f (w
i) be participle w
iscore, c
1for the weight of information entropy, E (w
i) be participle w
iinformation entropy, c
2for the weight of position weight, pos (w
i) be participle w
iposition weight, c
3for the weight of part of speech weight, t (w
i) be participle w
ipart of speech weight.
In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.
Embodiment two
Embodiments provide a kind of method extracting keyword, see Fig. 2, the method comprises:
Step 201: carry out participle division to text, obtains the participle that the text comprises;
Alternatively, after carrying out participle division to text, the order of each participle in the text that can comprise according to the text, each participle text comprised is stored in point set of words.Wherein, the participle of repetition can be comprised in this point of set of words.
Wherein, the embodiment of the present invention only carries out participle division to the text, does not carry out removal re-treatment to the participle obtained after division, so may comprise the participle of repetition in the participle obtained after embodiment of the present invention division.
Wherein, participle is carried out to text and is divided into prior art, do not repeat them here.
Such as, the text is " Everbright Securities abnormal transaction be punished money ", carries out participle division to the text, obtains the participle that the text comprises to be: wide, security, exception, transaction, quilt, place and fine.
Step 202: the part of speech obtaining each participle that the text comprises, and obtain the position of each participle in the text;
Wherein, part of speech comprises the parts of speech such as noun, verb, adjective, preposition, conjunction.Due to noun and verb larger as the possibility of the keyword of a text, and other part of speech is less as the possibility of the keyword of a text, so the part of speech of a participle is comparatively large on the impact of the keyword extracted, the embodiment of the present invention needs the part of speech distinguishing each participle.
Wherein, obtaining the position of each participle in the text can be the paragraph of each participle in the text, or also can be the position of each participle in this participle place paragraph.
Wherein, a text comprises one or more paragraph, the significance level of each paragraph is different, and such as, the significance level of the main contents of the first paragraph of certain text and a reflection text of final stage is larger than the significance level of the main contents of other paragraphs reflection text of the text; And the significance level of the diverse location in each paragraph is also different, such as, the significance level of the main contents of a section first reflection text is larger than the significance level of the main contents of the section neutralizing zone tail reflection text, so the impact of the position of participle in the text on the keyword extracted is quite important, need to obtain the position of each participle in the text.
Step 203: the information entropy calculating each participle that the text comprises;
Particularly, this step can be divided into the step of (1)-(3) as follows to realize, and comprising:
(1) for each participle that the text comprises, according to the position of this participle in the text, the text is divided into the first text and the second text, the first text is the text before this participle and this participle, and the second text is this participle and the text after this participle;
Such as, this participle is "abnormal", according to this participle "abnormal", text " the abnormal money of being punished of concluding the business of Everbright Securities " is divided into the first text " Everbright Securities is abnormal " and the second text " abnormal money of being punished of concluding the business ".
(2), the first frequency that in the first text, this participle occurs is added up, and the second frequency that in statistics the second text, this participle occurs;
Particularly, add up the number of the participle that the first text comprises, and the number of times that in statistics the first text, this participle occurs, by the number of the participle that the number of times that this participle in the first text occurs comprises divided by the first text, obtain the first frequency that in the first text, this participle occurs; Add up the number of the participle that the second text comprises, and the number of times that in statistics the second text, this participle occurs, by the number of the participle that the number of times that this participle in the second text occurs comprises divided by the second text, obtain this participle in the second text and occur it being second frequency.
Wherein, in the first text and the second text, all comprise this participle, so can ensure that the first frequency that this participle in the first text counted occurs is not 0, and ensure that the second frequency that in the second text of counting, this participle occurs is not 0.
(3), according to the first frequency of this participle and second frequency, the information entropy of this participle is calculated.
Particularly, according to first frequency and the second frequency of this participle, calculate the information entropy of this participle according to following formula (1);
E(w
i)=-p
1(w
i)log
2p
1(w
i)-p
2(w
i)log
2p
2(w
i) (1)
Wherein, in above-mentioned formula (1), w
ifor i-th participle that described text comprises, E (w
i) be participle w
iinformation entropy, p
1(w
i) be participle w
ifirst frequency, p
1(w
i) be participle w
isecond frequency, Log is logarithm operation.
Step 204: according to the part of speech of each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
Wherein, in advance corresponding part of speech weight is arranged to each part of speech, and part of speech weight corresponding to each part of speech and this part of speech is stored in the corresponding relation of part of speech and part of speech weight.
Wherein, all part of speech weight sums equal 1.
Step 205: according to the position of each participle in the text, obtain the position weight of each participle;
Particularly, according to the position of each participle in the text, obtain the position range at the position place of each participle in the text, according to the position range obtained, obtain corresponding position weight from the position range stored with the corresponding relation of position weight.
Wherein, in advance position division is carried out to text, obtain multiple position range, for each position range arranges a corresponding position weight, and the position weight that each position range is corresponding with this position range is stored in the corresponding relation of position range and position weight.
It should be added that, classification due to text comprises narrative, expository writing, argumentative writing, express one's emotion literary composition and practical writing, and the position significance level of participle is different in the text of each classification, so each position range that can be directed to the text of each classification arranges different position weights, thus improve the accuracy extracting keyword from text.
Wherein, the embodiment of the present invention can also arrange different position ranges to different classes of text.
Wherein, user can modify according to the position range of the classification of text to text, and modifies to the position weight of each position range.
Step 206: according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle;
Particularly, according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, the mark of each participle is calculated according to following formula (2);
f(w
i)=c
1*E(w
i)+c
2*pos(w
i)+c
3*t(w
i) (2)
Wherein, in above-mentioned formula (2), w
ifor i-th participle that the text comprises, f (w
i) be participle w
iscore, c
1for the weight of information entropy, E (w
i) be participle w
iinformation entropy, c
2for the weight of position weight, pos (w
i) be participle w
iposition weight, c
3for the weight of part of speech weight, t (w
i) be participle w
ipart of speech weight.
Wherein, mark due to participle depends on the information entropy of participle, part of speech weight and position weight, and information entropy, part of speech weight and position weight reflect that the significance level of this participle is different, so the embodiment of the present invention arranges weight respectively to information entropy, part of speech weight and position weight, and the weight of the weight of information entropy, part of speech weight and the weight sum of position weight equal 1.
Step 207: select the default value participle that mark is the highest, the participle of selection is defined as keyword.
Alternatively, select mark to be greater than the participle of predetermined threshold value, the participle of selection is defined as keyword.
In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.
Embodiment three
See Fig. 3, embodiments provide a kind of device extracting keyword, this device comprises:
Dividing module 301, for carrying out participle division to text, obtaining the participle that the text comprises;
Acquisition module 302, for obtaining the part of speech of each participle that the text comprises, and obtains the position of each participle in the text;
Computing module 303, for according to the part of speech of each participle and the position of each participle in the text, calculates the mark of each participle, and this mark is used to indicate the significance level of the main contents of a participle reflection text;
Determination module 304, for selecting the default value participle that mark is the highest, is defined as keyword by the participle of selection.
Wherein, computing module 303 comprises:
First computing unit, for calculating the information entropy of each participle that the text comprises;
First acquiring unit, for the part of speech according to each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
Second acquisition unit, for according to the position of each participle in the text, obtains the position weight of each participle;
Second computing unit, for the part of speech weight of the information entropy according to each participle, each participle and the position weight of each participle, calculates the mark of each participle.
Wherein, the first computing unit comprises:
Divide subelement, for each participle comprised for the text, according to the position of this participle in the text, the text is divided into the first text and the second text, first text is this participle and the text before this participle, and the second text is this participle and the text after this participle;
Statistics subelement, for adding up the first frequency that in the first text, this participle occurs, and the second frequency that in statistics the second text, this participle occurs;
Computation subunit, for according to the first frequency of this participle and the second frequency of this participle, calculates the information entropy of this participle.
Further, computation subunit, specifically for:
According to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle according to following formula (1);
E(w
i)=-p
1(w
i)log
2p
1(w
i)-p
2(w
i)log
2p
2(w
i) (1)
Wherein, in above-mentioned formula (1), w
ifor i-th participle that the text comprises, E (w
i) be participle w
iinformation entropy, p
1(w
i) be participle w
ifirst frequency, p
1(w
i) be participle w
isecond frequency.
Wherein, the second computing unit, specifically for:
According to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle according to following formula (2);
f(w
i)=c
1*E(w
i)+c
2*pos(w
i)+c
3*t(w
i) (2)
Wherein, in above-mentioned formula (2), w
ifor i-th participle that the text comprises, f (w
i) be participle w
iscore, c
1for the weight of information entropy, E (w
i) be participle w
iinformation entropy, c
2for the weight of position weight, pos (w
i) be participle w
iposition weight, c
3for the weight of part of speech weight, t (w
i) be participle w
ipart of speech weight.
In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.
It should be noted that: the device of the extraction keyword that above-described embodiment provides is when extracting keyword, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.In addition, the device of the extraction keyword that above-described embodiment provides belongs to same design with the embodiment of the method extracting keyword, and its specific implementation process refers to embodiment of the method, repeats no more here.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (10)
1. extract a method for keyword, it is characterized in that, described method comprises:
Participle division is carried out to text, obtains the participle that described text comprises;
Obtain the part of speech of each participle that described text comprises, and obtain the described position of each participle in described text;
According to part of speech and the described position of each participle in described text of described each participle, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;
Select the default value participle that mark is the highest, the participle of selection is defined as keyword.
2. the method for claim 1, is characterized in that, the described part of speech according to described each participle and the described position of each participle in described text, calculate the mark of described each participle, comprising:
Calculate the information entropy of each participle that described text comprises;
According to the part of speech of described each participle, obtain corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
According to the described position of each participle in described text, obtain the position weight of described each participle;
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, calculate the mark of described each participle.
3. method as claimed in claim 2, it is characterized in that, the information entropy of each participle that the described text of described calculating comprises, comprising:
For each participle that described text comprises, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;
Add up the first frequency that participle described in described first text occurs, and add up the second frequency of the appearance of participle described in described second text;
According to described first frequency and described second frequency, calculate the information entropy of described participle.
4. method as claimed in claim 3, is characterized in that, described according to described first frequency and described second frequency, calculates the information entropy of described participle, comprising:
According to described first frequency and described second frequency, according to formula E (w
i)=-p
1(w
i) log
2p
1(w
i)-p
2(w
i) log
2p
2(w
i) calculate the information entropy of described participle;
Wherein, w
ifor i-th participle that described text comprises, E (w
i) be participle w
iinformation entropy, p
1(w
i) be described participle w
ifirst frequency, p
1(w
i) be described participle w
isecond frequency.
5. method as claimed in claim 2, it is characterized in that, the part of speech weight of the described information entropy according to each participle, described each participle and the position weight of described each participle, calculate the mark of described each participle, comprising:
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w
i)=c
1* E (w
i)+c
2* pos (w
i)+c
3* t (w
i) calculate the mark of described each participle;
Wherein, w
ifor i-th participle that described text comprises, f (w
i) be participle w
iscore, c
1for the weight of information entropy, E (w
i) be described participle w
iinformation entropy, c
2for the weight of position weight, pos (w
i) be described participle w
iposition weight, c
3for the weight of part of speech weight, t (w
i) be described participle w
ipart of speech weight.
6. extract a device for keyword, it is characterized in that, described device comprises:
Dividing module, for carrying out participle division to text, obtaining the participle that described text comprises;
Acquisition module, for obtaining the part of speech of each participle that described text comprises, and obtains the described position of each participle in described text;
Computing module, for according to the part of speech of described each participle and the described position of each participle in described text, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;
Determination module, for selecting the default value participle that mark is the highest, is defined as keyword by the participle of selection.
7. device as claimed in claim 6, it is characterized in that, described computing module comprises:
First computing unit, for calculating the information entropy of each participle that described text comprises;
First acquiring unit, for the part of speech according to described each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
Second acquisition unit, for according to the described position of each participle in described text, obtains the position weight of described each participle;
Second computing unit, for the part of speech weight of the information entropy according to described each participle, described each participle and the position weight of described each participle, calculates the mark of described each participle.
8. device as claimed in claim 7, it is characterized in that, described first computing unit comprises:
Divide subelement, for each participle comprised for described text, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;
Statistics subelement, for adding up the first frequency that participle described in described first text occurs, and adds up the second frequency of the appearance of participle described in described second text;
Computation subunit, for according to described first frequency and described second frequency, calculates the information entropy of described participle.
9. device as claimed in claim 8, is characterized in that,
Described computation subunit, specifically for:
According to described first frequency and described second frequency, according to formula E (w
i)=-p
1(w
i) log
2p
1(w
i)-p
2(w
i) log
2p
2(w
i) calculate the information entropy of described participle;
Wherein, w
ifor i-th participle that described text comprises, E (w
i) be participle w
iinformation entropy, p
1(w
i) be described participle w
ifirst frequency, p
1(w
i) be described participle w
isecond frequency.
10. device as claimed in claim 7, is characterized in that,
Described second computing unit, specifically for:
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w
i)=c
1* E (w
i)+c
2* pos (w
i)+c
3* t (w
i) calculate the mark of described each participle;
Wherein, w
ifor i-th participle that described text comprises, f (w
i) be participle w
iscore, c
1for the weight of information entropy, E (w
i) be described participle w
iinformation entropy, c
2for the weight of position weight, pos (w
i) be described participle w
iposition weight, c
3for the weight of part of speech weight, t (w
i) be described participle w
ipart of speech weight.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310706212.7A CN104731797B (en) | 2013-12-19 | 2013-12-19 | A kind of method and device of extraction keyword |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310706212.7A CN104731797B (en) | 2013-12-19 | 2013-12-19 | A kind of method and device of extraction keyword |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104731797A true CN104731797A (en) | 2015-06-24 |
CN104731797B CN104731797B (en) | 2018-09-18 |
Family
ID=53455694
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310706212.7A Active CN104731797B (en) | 2013-12-19 | 2013-12-19 | A kind of method and device of extraction keyword |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104731797B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933197A (en) * | 2015-07-13 | 2015-09-23 | 北京天天卓越科技有限公司 | Method and terminal equipment for determining keywords |
CN105373528A (en) * | 2015-08-18 | 2016-03-02 | 新华网股份有限公司 | Method and device for analyzing sensitivity of text contents |
CN106325688A (en) * | 2016-08-17 | 2017-01-11 | 北京锤子数码科技有限公司 | Text processing method and device |
CN106484266A (en) * | 2016-10-18 | 2017-03-08 | 北京锤子数码科技有限公司 | A kind of text handling method and device |
CN106557508A (en) * | 2015-09-28 | 2017-04-05 | 北京神州泰岳软件股份有限公司 | A kind of text key word extracting method and device |
WO2017084267A1 (en) * | 2015-11-18 | 2017-05-26 | 乐视控股(北京)有限公司 | Method and device for keyphrase extraction |
CN107577713A (en) * | 2017-08-03 | 2018-01-12 | 国网信通亿力科技有限责任公司 | Text handling method based on electric power dictionary |
CN107665189A (en) * | 2017-06-16 | 2018-02-06 | 平安科技(深圳)有限公司 | A kind of method, terminal and equipment for extracting centre word |
WO2018027463A1 (en) * | 2016-08-08 | 2018-02-15 | 深圳市博信诺达经贸咨询有限公司 | Application method and system for keyword analysis in big data |
CN108399165A (en) * | 2018-03-28 | 2018-08-14 | 广东技术师范学院 | A kind of keyword abstraction method based on position weighting |
CN108519970A (en) * | 2018-02-06 | 2018-09-11 | 平安科技(深圳)有限公司 | The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text |
CN108563636A (en) * | 2018-04-04 | 2018-09-21 | 广州杰赛科技股份有限公司 | Extract method, apparatus, equipment and the storage medium of text key word |
CN110032730A (en) * | 2019-02-18 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A kind of processing method of text data, device and equipment |
CN112069232A (en) * | 2020-09-08 | 2020-12-11 | 中国移动通信集团河北有限公司 | Method and device for inquiring broadband service coverage area |
WO2021051557A1 (en) * | 2019-09-18 | 2021-03-25 | 平安科技(深圳)有限公司 | Semantic recognition-based keyword determination method and apparatus, and storage medium |
CN113282752A (en) * | 2021-06-09 | 2021-08-20 | 江苏联著实业股份有限公司 | Object classification method and system based on semantic mapping |
CN113515940A (en) * | 2021-07-14 | 2021-10-19 | 上海芯翌智能科技有限公司 | Method and equipment for text search |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110264655A1 (en) * | 2010-04-22 | 2011-10-27 | Microsoft Corporation | Location context mining |
CN103186662A (en) * | 2012-12-28 | 2013-07-03 | 中联竞成(北京)科技有限公司 | System and method for extracting dynamic public sentiment keywords |
-
2013
- 2013-12-19 CN CN201310706212.7A patent/CN104731797B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110264655A1 (en) * | 2010-04-22 | 2011-10-27 | Microsoft Corporation | Location context mining |
CN103186662A (en) * | 2012-12-28 | 2013-07-03 | 中联竞成(北京)科技有限公司 | System and method for extracting dynamic public sentiment keywords |
Non-Patent Citations (2)
Title |
---|
张红鹰: "中文文本关键词提取方法", 《计算机系统应用》 * |
蒋效宇: "基于关键词抽取的自动文摘算法", 《计算机工程》 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933197A (en) * | 2015-07-13 | 2015-09-23 | 北京天天卓越科技有限公司 | Method and terminal equipment for determining keywords |
CN105373528A (en) * | 2015-08-18 | 2016-03-02 | 新华网股份有限公司 | Method and device for analyzing sensitivity of text contents |
CN105373528B (en) * | 2015-08-18 | 2019-03-12 | 新华网股份有限公司 | A kind of text content sensitive analysis method and device |
CN106557508A (en) * | 2015-09-28 | 2017-04-05 | 北京神州泰岳软件股份有限公司 | A kind of text key word extracting method and device |
WO2017084267A1 (en) * | 2015-11-18 | 2017-05-26 | 乐视控股(北京)有限公司 | Method and device for keyphrase extraction |
WO2018027463A1 (en) * | 2016-08-08 | 2018-02-15 | 深圳市博信诺达经贸咨询有限公司 | Application method and system for keyword analysis in big data |
CN106325688A (en) * | 2016-08-17 | 2017-01-11 | 北京锤子数码科技有限公司 | Text processing method and device |
CN106484266A (en) * | 2016-10-18 | 2017-03-08 | 北京锤子数码科技有限公司 | A kind of text handling method and device |
US10489047B2 (en) | 2016-10-18 | 2019-11-26 | Beijing Bytedance Network Technology Co Ltd. | Text processing method and device |
CN111381751A (en) * | 2016-10-18 | 2020-07-07 | 北京字节跳动网络技术有限公司 | Text processing method and device |
CN107665189A (en) * | 2017-06-16 | 2018-02-06 | 平安科技(深圳)有限公司 | A kind of method, terminal and equipment for extracting centre word |
CN107665189B (en) * | 2017-06-16 | 2019-12-13 | 平安科技(深圳)有限公司 | method, terminal and equipment for extracting central word |
CN107577713B (en) * | 2017-08-03 | 2018-09-11 | 国网信通亿力科技有限责任公司 | Text handling method based on electric power dictionary |
CN107577713A (en) * | 2017-08-03 | 2018-01-12 | 国网信通亿力科技有限责任公司 | Text handling method based on electric power dictionary |
CN108519970A (en) * | 2018-02-06 | 2018-09-11 | 平安科技(深圳)有限公司 | The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text |
CN108519970B (en) * | 2018-02-06 | 2021-08-31 | 平安科技(深圳)有限公司 | Method for identifying sensitive information in text, electronic device and readable storage medium |
CN108399165A (en) * | 2018-03-28 | 2018-08-14 | 广东技术师范学院 | A kind of keyword abstraction method based on position weighting |
CN108563636A (en) * | 2018-04-04 | 2018-09-21 | 广州杰赛科技股份有限公司 | Extract method, apparatus, equipment and the storage medium of text key word |
CN110032730A (en) * | 2019-02-18 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A kind of processing method of text data, device and equipment |
CN110032730B (en) * | 2019-02-18 | 2023-09-05 | 创新先进技术有限公司 | Text data processing method, device and equipment |
WO2021051557A1 (en) * | 2019-09-18 | 2021-03-25 | 平安科技(深圳)有限公司 | Semantic recognition-based keyword determination method and apparatus, and storage medium |
CN112069232A (en) * | 2020-09-08 | 2020-12-11 | 中国移动通信集团河北有限公司 | Method and device for inquiring broadband service coverage area |
CN112069232B (en) * | 2020-09-08 | 2023-08-01 | 中国移动通信集团河北有限公司 | Broadband service coverage query method and device |
CN113282752A (en) * | 2021-06-09 | 2021-08-20 | 江苏联著实业股份有限公司 | Object classification method and system based on semantic mapping |
CN113515940A (en) * | 2021-07-14 | 2021-10-19 | 上海芯翌智能科技有限公司 | Method and equipment for text search |
Also Published As
Publication number | Publication date |
---|---|
CN104731797B (en) | 2018-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104731797A (en) | Keyword extracting method and keyword extracting device | |
CN108647309B (en) | Chat content auditing method and system based on sensitive words | |
CN109299280B (en) | Short text clustering analysis method and device and terminal equipment | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN108536677A (en) | A kind of patent text similarity calculating method | |
CN108549634A (en) | A kind of Chinese patent text similarity calculating method | |
US20110072011A1 (en) | Method and system for scoring texts | |
CN106599148A (en) | Method and device for generating abstract | |
CN109471933A (en) | A kind of generation method of text snippet, storage medium and server | |
CN104991891A (en) | Short text feature extraction method | |
CN101520802A (en) | Question-answer pair quality evaluation method and system | |
CN102033880A (en) | Marking method and device based on structured data acquisition | |
CN103473380B (en) | A kind of computer version sensibility classification method | |
CN105095179B (en) | The method and device that user's evaluation is handled | |
CN106547924A (en) | The sentiment analysis method and device of text message | |
CN105095430A (en) | Method and device for setting up word network and extracting keywords | |
CN102081601A (en) | Field word identification method and device | |
Pla et al. | Sentiment analysis in Twitter for Spanish | |
CN104239285A (en) | New article chapter detecting method and device | |
CN103744838A (en) | Chinese emotional abstract system and Chinese emotional abstract method for measuring mainstream emotional information | |
CN103186647B (en) | A kind of method and device according to contribution degree sequence | |
CN106997340A (en) | The generation of dictionary and the Document Classification Method and device using dictionary | |
CN104035969A (en) | Method and system for building feature word banks in social network | |
CN102375848B (en) | Evaluation object clustering method and device | |
CN102622405B (en) | Method for computing text distance between short texts based on language content unit number evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP02 | Change in the address of a patent holder | ||
CP02 | Change in the address of a patent holder |
Address after: Room 810, 8 / F, 34 Haidian Street, Haidian District, Beijing 100080 Patentee after: BEIJING D-MEDIA COMMUNICATION TECHNOLOGY Co.,Ltd. Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A room 602 Patentee before: BEIJING D-MEDIA COMMUNICATION TECHNOLOGY Co.,Ltd. |