CN104731797A

CN104731797A - Keyword extracting method and keyword extracting device

Info

Publication number: CN104731797A
Application number: CN201310706212.7A
Authority: CN
Inventors: 吴涛
Original assignee: Beijing Feinno Communication Technology Co Ltd
Current assignee: Beijing Feinno Communication Technology Co Ltd
Priority date: 2013-12-19
Filing date: 2013-12-19
Publication date: 2015-06-24
Anticipated expiration: 2033-12-19
Also published as: CN104731797B

Abstract

The invention discloses a keyword extracting method and a keyword extracting device and belongs to the field of computers. The keyword extracting method includes: subjecting a text to word partition to obtain partitioned words included in the text; acquiring the characteristic of each partitioned word included by the text and acquiring the position of each partitioned word in the text; calculating a score for each partitioned word according to the characteristic of each partitioned word and the position of the partitioned word in the text, wherein the score is used for indicating the importance degree, reflected by each partitioned word, of main content of the text; selecting a preset number of partitioned words with the highest scores and determining the selected partitioned words as keywords. The keyword extracting device comprises a dividing module, an acquiring module, a calculating module and a determining module. The keyword extracting method and the keyword extracting device have the advantages that the extracted keywords are not affected by other texts, keyword extracting efficiency is improved, and keyword extracting accuracy is increased.

Description

A kind of method and device extracting keyword

Technical field

The present invention relates to computer realm, particularly a kind of method and device extracting keyword.

Background technology

When the content of certain text is more, user, in order to grasp the purport of the text quickly, just need the main contents obtaining the text, and keyword exactly can reflect the main contents of a text, so just need to extract keyword from the text.

At present, provide a kind of method extracting keyword, be specially: participle division is carried out to text, obtain the participle that the text comprises, for each participle that the text comprises, add up the number of times that this participle occurs in the text, the number of times of statistics is defined as the word frequency of this participle; In the text collection at text place, there is the quantity of the text of this participle in statistics, according to the quantity of all texts in the quantity of statistics and the text collection at text place, calculates the reverse document-frequency of this participle; According to the word frequency of this participle and the reverse document-frequency of this participle, calculate the TF-IDF(Term Frequency-Inverse DocumentFrequency of this participle, word frequency-reverse document-frequency) value.According to the TF-IDF value of each participle, each word is sorted, obtain the default value word that TF-IDF value is maximum, the word of acquisition is defined as keyword.

Realizing in process of the present invention, inventor finds that prior art at least exists following problem:

Because TF-IDF computing not only depends on the text, also depend on other texts that text place text collection comprises, when other texts that text set comprises and the text associate little time, the accuracy extracting keyword according to the TF-IDF value of participle is lower, and it is when the amount of text in text set increases or reduce, larger on the keyword impact of extracting.

Summary of the invention

In order to solve the problem of prior art, embodiments provide a kind of method and the device that extract keyword.Described technical scheme is as follows:

On the one hand, provide a kind of method extracting keyword, described method comprises:

Participle division is carried out to text, obtains the participle that described text comprises;

Obtain the part of speech of each participle that described text comprises, and obtain the described position of each participle in described text;

According to part of speech and the described position of each participle in described text of described each participle, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;

Select the default value participle that mark is the highest, the participle of selection is defined as keyword.

Wherein, the described part of speech according to described each participle and the described position of each participle in described text, calculate the mark of described each participle, comprising:

Calculate the information entropy of each participle that described text comprises;

According to the part of speech of described each participle, obtain corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;

According to the described position of each participle in described text, obtain the position weight of described each participle;

According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, calculate the mark of described each participle.

Wherein, the information entropy of each participle that the described text of described calculating comprises, comprising:

For each participle that described text comprises, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;

Add up the first frequency that participle described in described first text occurs, and add up the second frequency of the appearance of participle described in described second text;

According to described first frequency and described second frequency, calculate the information entropy of described participle.

Further, described according to described first frequency and described second frequency, calculate the information entropy of described participle, comprising:

According to described first frequency and described second frequency, according to formula E (w _i)=-p ₁(w _i) log ₂p ₁(w _i)-p ₂(w _i) log ₂p ₂(w _i) calculate the information entropy of described participle;

Wherein, w _ifor i-th participle that described text comprises, E (w _i) be participle w _iinformation entropy, p ₁(w _i) be described participle w _ifirst frequency, p ₁(w _i) be described participle w _isecond frequency.

Wherein, the part of speech weight of the described information entropy according to each participle, described each participle and the position weight of described each participle, calculate the mark of described each participle, comprising:

According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w _i)=c ₁* E (w _i)+c ₂* pos (w _i)+c ₃* t (w _i) calculate the mark of described each participle;

Wherein, w _ifor i-th participle that described text comprises, f (w _i) be participle w _iscore, c ₁for the weight of information entropy, E (w _i) be described participle w _iinformation entropy, c ₂for the weight of position weight, pos (w _i) be described participle w _iposition weight, c ₃for the weight of part of speech weight, t (w _i) be described participle w _ipart of speech weight.

On the other hand, provide a kind of device extracting keyword, described device comprises:

Dividing module, for carrying out participle division to text, obtaining the participle that described text comprises;

Acquisition module, for obtaining the part of speech of each participle that described text comprises, and obtains the described position of each participle in described text;

Computing module, for according to the part of speech of described each participle and the described position of each participle in described text, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;

Determination module, for selecting the default value participle that mark is the highest, is defined as keyword by the participle of selection.

Wherein, described computing module comprises:

First computing unit, for calculating the information entropy of each participle that described text comprises;

First acquiring unit, for the part of speech according to described each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;

Second acquisition unit, for according to the described position of each participle in described text, obtains the position weight of described each participle;

Second computing unit, for the part of speech weight of the information entropy according to described each participle, described each participle and the position weight of described each participle, calculates the mark of described each participle.

Wherein, described first computing unit comprises:

Divide subelement, for each participle comprised for described text, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;

Statistics subelement, for adding up the first frequency that participle described in described first text occurs, and adds up the second frequency of the appearance of participle described in described second text;

Computation subunit, for according to described first frequency and described second frequency, calculates the information entropy of described participle.

Further, described computation subunit, specifically for:

Wherein, described second computing unit, specifically for:

In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is a kind of method flow diagram extracting keyword that the embodiment of the present invention one provides;

Fig. 2 is a kind of method flow diagram extracting keyword that the embodiment of the present invention two provides;

Fig. 3 is a kind of apparatus structure schematic diagram extracting keyword that the embodiment of the present invention three provides.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

Embodiment one

Embodiments provide a kind of method extracting keyword, see Fig. 1, the method comprises:

Step 101: carry out participle division to text, obtains the participle that the text comprises;

Step 102: the part of speech obtaining each participle that the text comprises, and obtain the position of each participle in the text;

Step 103: according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, this mark is used to indicate the significance level of the main contents of a participle reflection text;

Step 104: select the default value participle that mark is the highest, the participle of selection is defined as keyword.

Wherein, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, comprising:

Calculate the information entropy of each participle that the text comprises;

According to the part of speech of each participle, obtain corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;

According to the position of each participle in the text, obtain the position weight of each participle;

According to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle.

Wherein, calculate the information entropy of each participle that the text comprises, comprising:

For each participle that the text comprises, according to the position of this participle in the text, the text is divided into the first text and the second text, the first text is this participle and the text before this participle, and the second text is this participle and the text after this participle;

Add up the first frequency that in the first text, this participle occurs, and the second frequency that in statistics the second text, this participle occurs;

According to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle.

Further, according to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle, comprising:

According to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle according to following formula (1);

E(w _i)=-p ₁(w _i)log ₂p ₁(w _i)-p ₂(w _i)log ₂p ₂(w _i) （1）

Wherein, in above-mentioned formula (1), w _ifor i-th participle that the text comprises, E (w _i) be participle w _iinformation entropy, p ₁(w _i) be participle w _ifirst frequency, p ₁(w _i) be participle w _isecond frequency.

Wherein, according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle, comprising:

According to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle according to following formula (2);

f(w _i)=c ₁*E(w _i)+c ₂*pos(w _i)+c ₃*t(w _i) （2）

Wherein, in above-mentioned formula (2), w _ifor i-th participle that the text comprises, f (w _i) be participle w _iscore, c ₁for the weight of information entropy, E (w _i) be participle w _iinformation entropy, c ₂for the weight of position weight, pos (w _i) be participle w _iposition weight, c ₃for the weight of part of speech weight, t (w _i) be participle w _ipart of speech weight.

Embodiment two

Embodiments provide a kind of method extracting keyword, see Fig. 2, the method comprises:

Step 201: carry out participle division to text, obtains the participle that the text comprises;

Alternatively, after carrying out participle division to text, the order of each participle in the text that can comprise according to the text, each participle text comprised is stored in point set of words.Wherein, the participle of repetition can be comprised in this point of set of words.

Wherein, the embodiment of the present invention only carries out participle division to the text, does not carry out removal re-treatment to the participle obtained after division, so may comprise the participle of repetition in the participle obtained after embodiment of the present invention division.

Wherein, participle is carried out to text and is divided into prior art, do not repeat them here.

Such as, the text is " Everbright Securities abnormal transaction be punished money ", carries out participle division to the text, obtains the participle that the text comprises to be: wide, security, exception, transaction, quilt, place and fine.

Step 202: the part of speech obtaining each participle that the text comprises, and obtain the position of each participle in the text;

Wherein, part of speech comprises the parts of speech such as noun, verb, adjective, preposition, conjunction.Due to noun and verb larger as the possibility of the keyword of a text, and other part of speech is less as the possibility of the keyword of a text, so the part of speech of a participle is comparatively large on the impact of the keyword extracted, the embodiment of the present invention needs the part of speech distinguishing each participle.

Wherein, obtaining the position of each participle in the text can be the paragraph of each participle in the text, or also can be the position of each participle in this participle place paragraph.

Wherein, a text comprises one or more paragraph, the significance level of each paragraph is different, and such as, the significance level of the main contents of the first paragraph of certain text and a reflection text of final stage is larger than the significance level of the main contents of other paragraphs reflection text of the text; And the significance level of the diverse location in each paragraph is also different, such as, the significance level of the main contents of a section first reflection text is larger than the significance level of the main contents of the section neutralizing zone tail reflection text, so the impact of the position of participle in the text on the keyword extracted is quite important, need to obtain the position of each participle in the text.

Step 203: the information entropy calculating each participle that the text comprises;

Particularly, this step can be divided into the step of (1)-(3) as follows to realize, and comprising:

(1) for each participle that the text comprises, according to the position of this participle in the text, the text is divided into the first text and the second text, the first text is the text before this participle and this participle, and the second text is this participle and the text after this participle;

Such as, this participle is "abnormal", according to this participle "abnormal", text " the abnormal money of being punished of concluding the business of Everbright Securities " is divided into the first text " Everbright Securities is abnormal " and the second text " abnormal money of being punished of concluding the business ".

(2), the first frequency that in the first text, this participle occurs is added up, and the second frequency that in statistics the second text, this participle occurs;

Particularly, add up the number of the participle that the first text comprises, and the number of times that in statistics the first text, this participle occurs, by the number of the participle that the number of times that this participle in the first text occurs comprises divided by the first text, obtain the first frequency that in the first text, this participle occurs; Add up the number of the participle that the second text comprises, and the number of times that in statistics the second text, this participle occurs, by the number of the participle that the number of times that this participle in the second text occurs comprises divided by the second text, obtain this participle in the second text and occur it being second frequency.

Wherein, in the first text and the second text, all comprise this participle, so can ensure that the first frequency that this participle in the first text counted occurs is not 0, and ensure that the second frequency that in the second text of counting, this participle occurs is not 0.

(3), according to the first frequency of this participle and second frequency, the information entropy of this participle is calculated.

Particularly, according to first frequency and the second frequency of this participle, calculate the information entropy of this participle according to following formula (1);

E(w _i)=-p ₁(w _i)log ₂p ₁(w _i)-p ₂(w _i)log ₂p ₂(w _i) （1）

Wherein, in above-mentioned formula (1), w _ifor i-th participle that described text comprises, E (w _i) be participle w _iinformation entropy, p ₁(w _i) be participle w _ifirst frequency, p ₁(w _i) be participle w _isecond frequency, Log is logarithm operation.

Step 204: according to the part of speech of each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;

Wherein, in advance corresponding part of speech weight is arranged to each part of speech, and part of speech weight corresponding to each part of speech and this part of speech is stored in the corresponding relation of part of speech and part of speech weight.

Wherein, all part of speech weight sums equal 1.

Step 205: according to the position of each participle in the text, obtain the position weight of each participle;

Particularly, according to the position of each participle in the text, obtain the position range at the position place of each participle in the text, according to the position range obtained, obtain corresponding position weight from the position range stored with the corresponding relation of position weight.

Wherein, in advance position division is carried out to text, obtain multiple position range, for each position range arranges a corresponding position weight, and the position weight that each position range is corresponding with this position range is stored in the corresponding relation of position range and position weight.

It should be added that, classification due to text comprises narrative, expository writing, argumentative writing, express one's emotion literary composition and practical writing, and the position significance level of participle is different in the text of each classification, so each position range that can be directed to the text of each classification arranges different position weights, thus improve the accuracy extracting keyword from text.

Wherein, the embodiment of the present invention can also arrange different position ranges to different classes of text.

Wherein, user can modify according to the position range of the classification of text to text, and modifies to the position weight of each position range.

Step 206: according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle;

Particularly, according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, the mark of each participle is calculated according to following formula (2);

f(w _i)=c ₁*E(w _i)+c ₂*pos(w _i)+c ₃*t(w _i) （2）

Wherein, mark due to participle depends on the information entropy of participle, part of speech weight and position weight, and information entropy, part of speech weight and position weight reflect that the significance level of this participle is different, so the embodiment of the present invention arranges weight respectively to information entropy, part of speech weight and position weight, and the weight of the weight of information entropy, part of speech weight and the weight sum of position weight equal 1.

Step 207: select the default value participle that mark is the highest, the participle of selection is defined as keyword.

Alternatively, select mark to be greater than the participle of predetermined threshold value, the participle of selection is defined as keyword.

Embodiment three

See Fig. 3, embodiments provide a kind of device extracting keyword, this device comprises:

Dividing module 301, for carrying out participle division to text, obtaining the participle that the text comprises;

Acquisition module 302, for obtaining the part of speech of each participle that the text comprises, and obtains the position of each participle in the text;

Computing module 303, for according to the part of speech of each participle and the position of each participle in the text, calculates the mark of each participle, and this mark is used to indicate the significance level of the main contents of a participle reflection text;

Determination module 304, for selecting the default value participle that mark is the highest, is defined as keyword by the participle of selection.

Wherein, computing module 303 comprises:

First computing unit, for calculating the information entropy of each participle that the text comprises;

First acquiring unit, for the part of speech according to each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;

Second acquisition unit, for according to the position of each participle in the text, obtains the position weight of each participle;

Second computing unit, for the part of speech weight of the information entropy according to each participle, each participle and the position weight of each participle, calculates the mark of each participle.

Wherein, the first computing unit comprises:

Divide subelement, for each participle comprised for the text, according to the position of this participle in the text, the text is divided into the first text and the second text, first text is this participle and the text before this participle, and the second text is this participle and the text after this participle;

Statistics subelement, for adding up the first frequency that in the first text, this participle occurs, and the second frequency that in statistics the second text, this participle occurs;

Computation subunit, for according to the first frequency of this participle and the second frequency of this participle, calculates the information entropy of this participle.

Further, computation subunit, specifically for:

E(w _i)=-p ₁(w _i)log ₂p ₁(w _i)-p ₂(w _i)log ₂p ₂(w _i) （1）

Wherein, the second computing unit, specifically for:

f(w _i)=c ₁*E(w _i)+c ₂*pos(w _i)+c ₃*t(w _i) （2）

It should be noted that: the device of the extraction keyword that above-described embodiment provides is when extracting keyword, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.In addition, the device of the extraction keyword that above-described embodiment provides belongs to same design with the embodiment of the method extracting keyword, and its specific implementation process refers to embodiment of the method, repeats no more here.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. extract a method for keyword, it is characterized in that, described method comprises:

2. the method for claim 1, is characterized in that, the described part of speech according to described each participle and the described position of each participle in described text, calculate the mark of described each participle, comprising:

3. method as claimed in claim 2, it is characterized in that, the information entropy of each participle that the described text of described calculating comprises, comprising:

4. method as claimed in claim 3, is characterized in that, described according to described first frequency and described second frequency, calculates the information entropy of described participle, comprising:

5. method as claimed in claim 2, it is characterized in that, the part of speech weight of the described information entropy according to each participle, described each participle and the position weight of described each participle, calculate the mark of described each participle, comprising:

6. extract a device for keyword, it is characterized in that, described device comprises:

7. device as claimed in claim 6, it is characterized in that, described computing module comprises:

8. device as claimed in claim 7, it is characterized in that, described first computing unit comprises:

9. device as claimed in claim 8, is characterized in that,

Described computation subunit, specifically for:

10. device as claimed in claim 7, is characterized in that,

Described second computing unit, specifically for: