CN104731797A - Keyword extracting method and keyword extracting device - Google Patents

Keyword extracting method and keyword extracting device Download PDF

Info

Publication number
CN104731797A
CN104731797A CN201310706212.7A CN201310706212A CN104731797A CN 104731797 A CN104731797 A CN 104731797A CN 201310706212 A CN201310706212 A CN 201310706212A CN 104731797 A CN104731797 A CN 104731797A
Authority
CN
China
Prior art keywords
participle
text
weight
speech
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310706212.7A
Other languages
Chinese (zh)
Other versions
CN104731797B (en
Inventor
吴涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Feinno Communication Technology Co Ltd
Original Assignee
Beijing Feinno Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Feinno Communication Technology Co Ltd filed Critical Beijing Feinno Communication Technology Co Ltd
Priority to CN201310706212.7A priority Critical patent/CN104731797B/en
Publication of CN104731797A publication Critical patent/CN104731797A/en
Application granted granted Critical
Publication of CN104731797B publication Critical patent/CN104731797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a keyword extracting method and a keyword extracting device and belongs to the field of computers. The keyword extracting method includes: subjecting a text to word partition to obtain partitioned words included in the text; acquiring the characteristic of each partitioned word included by the text and acquiring the position of each partitioned word in the text; calculating a score for each partitioned word according to the characteristic of each partitioned word and the position of the partitioned word in the text, wherein the score is used for indicating the importance degree, reflected by each partitioned word, of main content of the text; selecting a preset number of partitioned words with the highest scores and determining the selected partitioned words as keywords. The keyword extracting device comprises a dividing module, an acquiring module, a calculating module and a determining module. The keyword extracting method and the keyword extracting device have the advantages that the extracted keywords are not affected by other texts, keyword extracting efficiency is improved, and keyword extracting accuracy is increased.

Description

A kind of method and device extracting keyword
Technical field
The present invention relates to computer realm, particularly a kind of method and device extracting keyword.
Background technology
When the content of certain text is more, user, in order to grasp the purport of the text quickly, just need the main contents obtaining the text, and keyword exactly can reflect the main contents of a text, so just need to extract keyword from the text.
At present, provide a kind of method extracting keyword, be specially: participle division is carried out to text, obtain the participle that the text comprises, for each participle that the text comprises, add up the number of times that this participle occurs in the text, the number of times of statistics is defined as the word frequency of this participle; In the text collection at text place, there is the quantity of the text of this participle in statistics, according to the quantity of all texts in the quantity of statistics and the text collection at text place, calculates the reverse document-frequency of this participle; According to the word frequency of this participle and the reverse document-frequency of this participle, calculate the TF-IDF(Term Frequency-Inverse DocumentFrequency of this participle, word frequency-reverse document-frequency) value.According to the TF-IDF value of each participle, each word is sorted, obtain the default value word that TF-IDF value is maximum, the word of acquisition is defined as keyword.
Realizing in process of the present invention, inventor finds that prior art at least exists following problem:
Because TF-IDF computing not only depends on the text, also depend on other texts that text place text collection comprises, when other texts that text set comprises and the text associate little time, the accuracy extracting keyword according to the TF-IDF value of participle is lower, and it is when the amount of text in text set increases or reduce, larger on the keyword impact of extracting.
Summary of the invention
In order to solve the problem of prior art, embodiments provide a kind of method and the device that extract keyword.Described technical scheme is as follows:
On the one hand, provide a kind of method extracting keyword, described method comprises:
Participle division is carried out to text, obtains the participle that described text comprises;
Obtain the part of speech of each participle that described text comprises, and obtain the described position of each participle in described text;
According to part of speech and the described position of each participle in described text of described each participle, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;
Select the default value participle that mark is the highest, the participle of selection is defined as keyword.
Wherein, the described part of speech according to described each participle and the described position of each participle in described text, calculate the mark of described each participle, comprising:
Calculate the information entropy of each participle that described text comprises;
According to the part of speech of described each participle, obtain corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
According to the described position of each participle in described text, obtain the position weight of described each participle;
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, calculate the mark of described each participle.
Wherein, the information entropy of each participle that the described text of described calculating comprises, comprising:
For each participle that described text comprises, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;
Add up the first frequency that participle described in described first text occurs, and add up the second frequency of the appearance of participle described in described second text;
According to described first frequency and described second frequency, calculate the information entropy of described participle.
Further, described according to described first frequency and described second frequency, calculate the information entropy of described participle, comprising:
According to described first frequency and described second frequency, according to formula E (w i)=-p 1(w i) log 2p 1(w i)-p 2(w i) log 2p 2(w i) calculate the information entropy of described participle;
Wherein, w ifor i-th participle that described text comprises, E (w i) be participle w iinformation entropy, p 1(w i) be described participle w ifirst frequency, p 1(w i) be described participle w isecond frequency.
Wherein, the part of speech weight of the described information entropy according to each participle, described each participle and the position weight of described each participle, calculate the mark of described each participle, comprising:
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w i)=c 1* E (w i)+c 2* pos (w i)+c 3* t (w i) calculate the mark of described each participle;
Wherein, w ifor i-th participle that described text comprises, f (w i) be participle w iscore, c 1for the weight of information entropy, E (w i) be described participle w iinformation entropy, c 2for the weight of position weight, pos (w i) be described participle w iposition weight, c 3for the weight of part of speech weight, t (w i) be described participle w ipart of speech weight.
On the other hand, provide a kind of device extracting keyword, described device comprises:
Dividing module, for carrying out participle division to text, obtaining the participle that described text comprises;
Acquisition module, for obtaining the part of speech of each participle that described text comprises, and obtains the described position of each participle in described text;
Computing module, for according to the part of speech of described each participle and the described position of each participle in described text, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;
Determination module, for selecting the default value participle that mark is the highest, is defined as keyword by the participle of selection.
Wherein, described computing module comprises:
First computing unit, for calculating the information entropy of each participle that described text comprises;
First acquiring unit, for the part of speech according to described each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
Second acquisition unit, for according to the described position of each participle in described text, obtains the position weight of described each participle;
Second computing unit, for the part of speech weight of the information entropy according to described each participle, described each participle and the position weight of described each participle, calculates the mark of described each participle.
Wherein, described first computing unit comprises:
Divide subelement, for each participle comprised for described text, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;
Statistics subelement, for adding up the first frequency that participle described in described first text occurs, and adds up the second frequency of the appearance of participle described in described second text;
Computation subunit, for according to described first frequency and described second frequency, calculates the information entropy of described participle.
Further, described computation subunit, specifically for:
According to described first frequency and described second frequency, according to formula E (w i)=-p 1(w i) log 2p 1(w i)-p 2(w i) log 2p 2(w i) calculate the information entropy of described participle;
Wherein, w ifor i-th participle that described text comprises, E (w i) be participle w iinformation entropy, p 1(w i) be described participle w ifirst frequency, p 1(w i) be described participle w isecond frequency.
Wherein, described second computing unit, specifically for:
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w i)=c 1* E (w i)+c 2* pos (w i)+c 3* t (w i) calculate the mark of described each participle;
Wherein, w ifor i-th participle that described text comprises, f (w i) be participle w iscore, c 1for the weight of information entropy, E (w i) be described participle w iinformation entropy, c 2for the weight of position weight, pos (w i) be described participle w iposition weight, c 3for the weight of part of speech weight, t (w i) be described participle w ipart of speech weight.
In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram extracting keyword that the embodiment of the present invention one provides;
Fig. 2 is a kind of method flow diagram extracting keyword that the embodiment of the present invention two provides;
Fig. 3 is a kind of apparatus structure schematic diagram extracting keyword that the embodiment of the present invention three provides.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
Embodiment one
Embodiments provide a kind of method extracting keyword, see Fig. 1, the method comprises:
Step 101: carry out participle division to text, obtains the participle that the text comprises;
Step 102: the part of speech obtaining each participle that the text comprises, and obtain the position of each participle in the text;
Step 103: according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, this mark is used to indicate the significance level of the main contents of a participle reflection text;
Step 104: select the default value participle that mark is the highest, the participle of selection is defined as keyword.
Wherein, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, comprising:
Calculate the information entropy of each participle that the text comprises;
According to the part of speech of each participle, obtain corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
According to the position of each participle in the text, obtain the position weight of each participle;
According to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle.
Wherein, calculate the information entropy of each participle that the text comprises, comprising:
For each participle that the text comprises, according to the position of this participle in the text, the text is divided into the first text and the second text, the first text is this participle and the text before this participle, and the second text is this participle and the text after this participle;
Add up the first frequency that in the first text, this participle occurs, and the second frequency that in statistics the second text, this participle occurs;
According to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle.
Further, according to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle, comprising:
According to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle according to following formula (1);
E(w i)=-p 1(w i)log 2p 1(w i)-p 2(w i)log 2p 2(w i) (1)
Wherein, in above-mentioned formula (1), w ifor i-th participle that the text comprises, E (w i) be participle w iinformation entropy, p 1(w i) be participle w ifirst frequency, p 1(w i) be participle w isecond frequency.
Wherein, according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle, comprising:
According to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle according to following formula (2);
f(w i)=c 1*E(w i)+c 2*pos(w i)+c 3*t(w i) (2)
Wherein, in above-mentioned formula (2), w ifor i-th participle that the text comprises, f (w i) be participle w iscore, c 1for the weight of information entropy, E (w i) be participle w iinformation entropy, c 2for the weight of position weight, pos (w i) be participle w iposition weight, c 3for the weight of part of speech weight, t (w i) be participle w ipart of speech weight.
In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.
Embodiment two
Embodiments provide a kind of method extracting keyword, see Fig. 2, the method comprises:
Step 201: carry out participle division to text, obtains the participle that the text comprises;
Alternatively, after carrying out participle division to text, the order of each participle in the text that can comprise according to the text, each participle text comprised is stored in point set of words.Wherein, the participle of repetition can be comprised in this point of set of words.
Wherein, the embodiment of the present invention only carries out participle division to the text, does not carry out removal re-treatment to the participle obtained after division, so may comprise the participle of repetition in the participle obtained after embodiment of the present invention division.
Wherein, participle is carried out to text and is divided into prior art, do not repeat them here.
Such as, the text is " Everbright Securities abnormal transaction be punished money ", carries out participle division to the text, obtains the participle that the text comprises to be: wide, security, exception, transaction, quilt, place and fine.
Step 202: the part of speech obtaining each participle that the text comprises, and obtain the position of each participle in the text;
Wherein, part of speech comprises the parts of speech such as noun, verb, adjective, preposition, conjunction.Due to noun and verb larger as the possibility of the keyword of a text, and other part of speech is less as the possibility of the keyword of a text, so the part of speech of a participle is comparatively large on the impact of the keyword extracted, the embodiment of the present invention needs the part of speech distinguishing each participle.
Wherein, obtaining the position of each participle in the text can be the paragraph of each participle in the text, or also can be the position of each participle in this participle place paragraph.
Wherein, a text comprises one or more paragraph, the significance level of each paragraph is different, and such as, the significance level of the main contents of the first paragraph of certain text and a reflection text of final stage is larger than the significance level of the main contents of other paragraphs reflection text of the text; And the significance level of the diverse location in each paragraph is also different, such as, the significance level of the main contents of a section first reflection text is larger than the significance level of the main contents of the section neutralizing zone tail reflection text, so the impact of the position of participle in the text on the keyword extracted is quite important, need to obtain the position of each participle in the text.
Step 203: the information entropy calculating each participle that the text comprises;
Particularly, this step can be divided into the step of (1)-(3) as follows to realize, and comprising:
(1) for each participle that the text comprises, according to the position of this participle in the text, the text is divided into the first text and the second text, the first text is the text before this participle and this participle, and the second text is this participle and the text after this participle;
Such as, this participle is "abnormal", according to this participle "abnormal", text " the abnormal money of being punished of concluding the business of Everbright Securities " is divided into the first text " Everbright Securities is abnormal " and the second text " abnormal money of being punished of concluding the business ".
(2), the first frequency that in the first text, this participle occurs is added up, and the second frequency that in statistics the second text, this participle occurs;
Particularly, add up the number of the participle that the first text comprises, and the number of times that in statistics the first text, this participle occurs, by the number of the participle that the number of times that this participle in the first text occurs comprises divided by the first text, obtain the first frequency that in the first text, this participle occurs; Add up the number of the participle that the second text comprises, and the number of times that in statistics the second text, this participle occurs, by the number of the participle that the number of times that this participle in the second text occurs comprises divided by the second text, obtain this participle in the second text and occur it being second frequency.
Wherein, in the first text and the second text, all comprise this participle, so can ensure that the first frequency that this participle in the first text counted occurs is not 0, and ensure that the second frequency that in the second text of counting, this participle occurs is not 0.
(3), according to the first frequency of this participle and second frequency, the information entropy of this participle is calculated.
Particularly, according to first frequency and the second frequency of this participle, calculate the information entropy of this participle according to following formula (1);
E(w i)=-p 1(w i)log 2p 1(w i)-p 2(w i)log 2p 2(w i) (1)
Wherein, in above-mentioned formula (1), w ifor i-th participle that described text comprises, E (w i) be participle w iinformation entropy, p 1(w i) be participle w ifirst frequency, p 1(w i) be participle w isecond frequency, Log is logarithm operation.
Step 204: according to the part of speech of each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
Wherein, in advance corresponding part of speech weight is arranged to each part of speech, and part of speech weight corresponding to each part of speech and this part of speech is stored in the corresponding relation of part of speech and part of speech weight.
Wherein, all part of speech weight sums equal 1.
Step 205: according to the position of each participle in the text, obtain the position weight of each participle;
Particularly, according to the position of each participle in the text, obtain the position range at the position place of each participle in the text, according to the position range obtained, obtain corresponding position weight from the position range stored with the corresponding relation of position weight.
Wherein, in advance position division is carried out to text, obtain multiple position range, for each position range arranges a corresponding position weight, and the position weight that each position range is corresponding with this position range is stored in the corresponding relation of position range and position weight.
It should be added that, classification due to text comprises narrative, expository writing, argumentative writing, express one's emotion literary composition and practical writing, and the position significance level of participle is different in the text of each classification, so each position range that can be directed to the text of each classification arranges different position weights, thus improve the accuracy extracting keyword from text.
Wherein, the embodiment of the present invention can also arrange different position ranges to different classes of text.
Wherein, user can modify according to the position range of the classification of text to text, and modifies to the position weight of each position range.
Step 206: according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle;
Particularly, according to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, the mark of each participle is calculated according to following formula (2);
f(w i)=c 1*E(w i)+c 2*pos(w i)+c 3*t(w i) (2)
Wherein, in above-mentioned formula (2), w ifor i-th participle that the text comprises, f (w i) be participle w iscore, c 1for the weight of information entropy, E (w i) be participle w iinformation entropy, c 2for the weight of position weight, pos (w i) be participle w iposition weight, c 3for the weight of part of speech weight, t (w i) be participle w ipart of speech weight.
Wherein, mark due to participle depends on the information entropy of participle, part of speech weight and position weight, and information entropy, part of speech weight and position weight reflect that the significance level of this participle is different, so the embodiment of the present invention arranges weight respectively to information entropy, part of speech weight and position weight, and the weight of the weight of information entropy, part of speech weight and the weight sum of position weight equal 1.
Step 207: select the default value participle that mark is the highest, the participle of selection is defined as keyword.
Alternatively, select mark to be greater than the participle of predetermined threshold value, the participle of selection is defined as keyword.
In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.
Embodiment three
See Fig. 3, embodiments provide a kind of device extracting keyword, this device comprises:
Dividing module 301, for carrying out participle division to text, obtaining the participle that the text comprises;
Acquisition module 302, for obtaining the part of speech of each participle that the text comprises, and obtains the position of each participle in the text;
Computing module 303, for according to the part of speech of each participle and the position of each participle in the text, calculates the mark of each participle, and this mark is used to indicate the significance level of the main contents of a participle reflection text;
Determination module 304, for selecting the default value participle that mark is the highest, is defined as keyword by the participle of selection.
Wherein, computing module 303 comprises:
First computing unit, for calculating the information entropy of each participle that the text comprises;
First acquiring unit, for the part of speech according to each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
Second acquisition unit, for according to the position of each participle in the text, obtains the position weight of each participle;
Second computing unit, for the part of speech weight of the information entropy according to each participle, each participle and the position weight of each participle, calculates the mark of each participle.
Wherein, the first computing unit comprises:
Divide subelement, for each participle comprised for the text, according to the position of this participle in the text, the text is divided into the first text and the second text, first text is this participle and the text before this participle, and the second text is this participle and the text after this participle;
Statistics subelement, for adding up the first frequency that in the first text, this participle occurs, and the second frequency that in statistics the second text, this participle occurs;
Computation subunit, for according to the first frequency of this participle and the second frequency of this participle, calculates the information entropy of this participle.
Further, computation subunit, specifically for:
According to the first frequency of this participle and the second frequency of this participle, calculate the information entropy of this participle according to following formula (1);
E(w i)=-p 1(w i)log 2p 1(w i)-p 2(w i)log 2p 2(w i) (1)
Wherein, in above-mentioned formula (1), w ifor i-th participle that the text comprises, E (w i) be participle w iinformation entropy, p 1(w i) be participle w ifirst frequency, p 1(w i) be participle w isecond frequency.
Wherein, the second computing unit, specifically for:
According to the position weight of the information entropy of each participle, the part of speech weight of each participle and each participle, calculate the mark of each participle according to following formula (2);
f(w i)=c 1*E(w i)+c 2*pos(w i)+c 3*t(w i) (2)
Wherein, in above-mentioned formula (2), w ifor i-th participle that the text comprises, f (w i) be participle w iscore, c 1for the weight of information entropy, E (w i) be participle w iinformation entropy, c 2for the weight of position weight, pos (w i) be participle w iposition weight, c 3for the weight of part of speech weight, t (w i) be participle w ipart of speech weight.
In embodiments of the present invention, obtain the part of speech of each participle that the text comprises, and obtain the position of each participle in the text, according to part of speech and the position of each participle in the text of each participle, calculate the mark of each participle, select the default value participle that mark is the highest, the participle of selection is defined as keyword.Mark due to each participle only depends on the text, the part of speech of this participle and the position of this participle in the text, it doesn't matter with other text, so the keyword extracted is not by the impact of other texts, improve the effect extracting keyword, and then improve the accuracy extracting keyword.
It should be noted that: the device of the extraction keyword that above-described embodiment provides is when extracting keyword, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.In addition, the device of the extraction keyword that above-described embodiment provides belongs to same design with the embodiment of the method extracting keyword, and its specific implementation process refers to embodiment of the method, repeats no more here.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. extract a method for keyword, it is characterized in that, described method comprises:
Participle division is carried out to text, obtains the participle that described text comprises;
Obtain the part of speech of each participle that described text comprises, and obtain the described position of each participle in described text;
According to part of speech and the described position of each participle in described text of described each participle, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;
Select the default value participle that mark is the highest, the participle of selection is defined as keyword.
2. the method for claim 1, is characterized in that, the described part of speech according to described each participle and the described position of each participle in described text, calculate the mark of described each participle, comprising:
Calculate the information entropy of each participle that described text comprises;
According to the part of speech of described each participle, obtain corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
According to the described position of each participle in described text, obtain the position weight of described each participle;
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, calculate the mark of described each participle.
3. method as claimed in claim 2, it is characterized in that, the information entropy of each participle that the described text of described calculating comprises, comprising:
For each participle that described text comprises, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;
Add up the first frequency that participle described in described first text occurs, and add up the second frequency of the appearance of participle described in described second text;
According to described first frequency and described second frequency, calculate the information entropy of described participle.
4. method as claimed in claim 3, is characterized in that, described according to described first frequency and described second frequency, calculates the information entropy of described participle, comprising:
According to described first frequency and described second frequency, according to formula E (w i)=-p 1(w i) log 2p 1(w i)-p 2(w i) log 2p 2(w i) calculate the information entropy of described participle;
Wherein, w ifor i-th participle that described text comprises, E (w i) be participle w iinformation entropy, p 1(w i) be described participle w ifirst frequency, p 1(w i) be described participle w isecond frequency.
5. method as claimed in claim 2, it is characterized in that, the part of speech weight of the described information entropy according to each participle, described each participle and the position weight of described each participle, calculate the mark of described each participle, comprising:
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w i)=c 1* E (w i)+c 2* pos (w i)+c 3* t (w i) calculate the mark of described each participle;
Wherein, w ifor i-th participle that described text comprises, f (w i) be participle w iscore, c 1for the weight of information entropy, E (w i) be described participle w iinformation entropy, c 2for the weight of position weight, pos (w i) be described participle w iposition weight, c 3for the weight of part of speech weight, t (w i) be described participle w ipart of speech weight.
6. extract a device for keyword, it is characterized in that, described device comprises:
Dividing module, for carrying out participle division to text, obtaining the participle that described text comprises;
Acquisition module, for obtaining the part of speech of each participle that described text comprises, and obtains the described position of each participle in described text;
Computing module, for according to the part of speech of described each participle and the described position of each participle in described text, calculate the mark of described each participle, described mark is used to indicate the significance level that a participle reflects the main contents of described text;
Determination module, for selecting the default value participle that mark is the highest, is defined as keyword by the participle of selection.
7. device as claimed in claim 6, it is characterized in that, described computing module comprises:
First computing unit, for calculating the information entropy of each participle that described text comprises;
First acquiring unit, for the part of speech according to described each participle, obtains corresponding part of speech weight from the part of speech stored with the corresponding relation of part of speech weight;
Second acquisition unit, for according to the described position of each participle in described text, obtains the position weight of described each participle;
Second computing unit, for the part of speech weight of the information entropy according to described each participle, described each participle and the position weight of described each participle, calculates the mark of described each participle.
8. device as claimed in claim 7, it is characterized in that, described first computing unit comprises:
Divide subelement, for each participle comprised for described text, according to the position of described participle in described text, described text is divided into the first text and the second text, described first text is described participle and the text before described participle, and described second text is described participle and the text after described participle;
Statistics subelement, for adding up the first frequency that participle described in described first text occurs, and adds up the second frequency of the appearance of participle described in described second text;
Computation subunit, for according to described first frequency and described second frequency, calculates the information entropy of described participle.
9. device as claimed in claim 8, is characterized in that,
Described computation subunit, specifically for:
According to described first frequency and described second frequency, according to formula E (w i)=-p 1(w i) log 2p 1(w i)-p 2(w i) log 2p 2(w i) calculate the information entropy of described participle;
Wherein, w ifor i-th participle that described text comprises, E (w i) be participle w iinformation entropy, p 1(w i) be described participle w ifirst frequency, p 1(w i) be described participle w isecond frequency.
10. device as claimed in claim 7, is characterized in that,
Described second computing unit, specifically for:
According to the position weight of the information entropy of described each participle, the part of speech weight of described each participle and described each participle, according to formula f (w i)=c 1* E (w i)+c 2* pos (w i)+c 3* t (w i) calculate the mark of described each participle;
Wherein, w ifor i-th participle that described text comprises, f (w i) be participle w iscore, c 1for the weight of information entropy, E (w i) be described participle w iinformation entropy, c 2for the weight of position weight, pos (w i) be described participle w iposition weight, c 3for the weight of part of speech weight, t (w i) be described participle w ipart of speech weight.
CN201310706212.7A 2013-12-19 2013-12-19 A kind of method and device of extraction keyword Active CN104731797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310706212.7A CN104731797B (en) 2013-12-19 2013-12-19 A kind of method and device of extraction keyword

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310706212.7A CN104731797B (en) 2013-12-19 2013-12-19 A kind of method and device of extraction keyword

Publications (2)

Publication Number Publication Date
CN104731797A true CN104731797A (en) 2015-06-24
CN104731797B CN104731797B (en) 2018-09-18

Family

ID=53455694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310706212.7A Active CN104731797B (en) 2013-12-19 2013-12-19 A kind of method and device of extraction keyword

Country Status (1)

Country Link
CN (1) CN104731797B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933197A (en) * 2015-07-13 2015-09-23 北京天天卓越科技有限公司 Method and terminal equipment for determining keywords
CN105373528A (en) * 2015-08-18 2016-03-02 新华网股份有限公司 Method and device for analyzing sensitivity of text contents
CN106325688A (en) * 2016-08-17 2017-01-11 北京锤子数码科技有限公司 Text processing method and device
CN106484266A (en) * 2016-10-18 2017-03-08 北京锤子数码科技有限公司 A kind of text handling method and device
CN106557508A (en) * 2015-09-28 2017-04-05 北京神州泰岳软件股份有限公司 A kind of text key word extracting method and device
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN107577713A (en) * 2017-08-03 2018-01-12 国网信通亿力科技有限责任公司 Text handling method based on electric power dictionary
CN107665189A (en) * 2017-06-16 2018-02-06 平安科技(深圳)有限公司 A kind of method, terminal and equipment for extracting centre word
WO2018027463A1 (en) * 2016-08-08 2018-02-15 深圳市博信诺达经贸咨询有限公司 Application method and system for keyword analysis in big data
CN108399165A (en) * 2018-03-28 2018-08-14 广东技术师范学院 A kind of keyword abstraction method based on position weighting
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text
CN108563636A (en) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 Extract method, apparatus, equipment and the storage medium of text key word
CN110032730A (en) * 2019-02-18 2019-07-19 阿里巴巴集团控股有限公司 A kind of processing method of text data, device and equipment
CN112069232A (en) * 2020-09-08 2020-12-11 中国移动通信集团河北有限公司 Method and device for inquiring broadband service coverage area
WO2021051557A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Semantic recognition-based keyword determination method and apparatus, and storage medium
CN113282752A (en) * 2021-06-09 2021-08-20 江苏联著实业股份有限公司 Object classification method and system based on semantic mapping
CN113515940A (en) * 2021-07-14 2021-10-19 上海芯翌智能科技有限公司 Method and equipment for text search

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110264655A1 (en) * 2010-04-22 2011-10-27 Microsoft Corporation Location context mining
CN103186662A (en) * 2012-12-28 2013-07-03 中联竞成(北京)科技有限公司 System and method for extracting dynamic public sentiment keywords

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110264655A1 (en) * 2010-04-22 2011-10-27 Microsoft Corporation Location context mining
CN103186662A (en) * 2012-12-28 2013-07-03 中联竞成(北京)科技有限公司 System and method for extracting dynamic public sentiment keywords

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张红鹰: "中文文本关键词提取方法", 《计算机系统应用》 *
蒋效宇: "基于关键词抽取的自动文摘算法", 《计算机工程》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933197A (en) * 2015-07-13 2015-09-23 北京天天卓越科技有限公司 Method and terminal equipment for determining keywords
CN105373528A (en) * 2015-08-18 2016-03-02 新华网股份有限公司 Method and device for analyzing sensitivity of text contents
CN105373528B (en) * 2015-08-18 2019-03-12 新华网股份有限公司 A kind of text content sensitive analysis method and device
CN106557508A (en) * 2015-09-28 2017-04-05 北京神州泰岳软件股份有限公司 A kind of text key word extracting method and device
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
WO2018027463A1 (en) * 2016-08-08 2018-02-15 深圳市博信诺达经贸咨询有限公司 Application method and system for keyword analysis in big data
CN106325688A (en) * 2016-08-17 2017-01-11 北京锤子数码科技有限公司 Text processing method and device
CN106484266A (en) * 2016-10-18 2017-03-08 北京锤子数码科技有限公司 A kind of text handling method and device
US10489047B2 (en) 2016-10-18 2019-11-26 Beijing Bytedance Network Technology Co Ltd. Text processing method and device
CN111381751A (en) * 2016-10-18 2020-07-07 北京字节跳动网络技术有限公司 Text processing method and device
CN107665189A (en) * 2017-06-16 2018-02-06 平安科技(深圳)有限公司 A kind of method, terminal and equipment for extracting centre word
CN107665189B (en) * 2017-06-16 2019-12-13 平安科技(深圳)有限公司 method, terminal and equipment for extracting central word
CN107577713B (en) * 2017-08-03 2018-09-11 国网信通亿力科技有限责任公司 Text handling method based on electric power dictionary
CN107577713A (en) * 2017-08-03 2018-01-12 国网信通亿力科技有限责任公司 Text handling method based on electric power dictionary
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text
CN108519970B (en) * 2018-02-06 2021-08-31 平安科技(深圳)有限公司 Method for identifying sensitive information in text, electronic device and readable storage medium
CN108399165A (en) * 2018-03-28 2018-08-14 广东技术师范学院 A kind of keyword abstraction method based on position weighting
CN108563636A (en) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 Extract method, apparatus, equipment and the storage medium of text key word
CN110032730A (en) * 2019-02-18 2019-07-19 阿里巴巴集团控股有限公司 A kind of processing method of text data, device and equipment
CN110032730B (en) * 2019-02-18 2023-09-05 创新先进技术有限公司 Text data processing method, device and equipment
WO2021051557A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Semantic recognition-based keyword determination method and apparatus, and storage medium
CN112069232A (en) * 2020-09-08 2020-12-11 中国移动通信集团河北有限公司 Method and device for inquiring broadband service coverage area
CN112069232B (en) * 2020-09-08 2023-08-01 中国移动通信集团河北有限公司 Broadband service coverage query method and device
CN113282752A (en) * 2021-06-09 2021-08-20 江苏联著实业股份有限公司 Object classification method and system based on semantic mapping
CN113515940A (en) * 2021-07-14 2021-10-19 上海芯翌智能科技有限公司 Method and equipment for text search

Also Published As

Publication number Publication date
CN104731797B (en) 2018-09-18

Similar Documents

Publication Publication Date Title
CN104731797A (en) Keyword extracting method and keyword extracting device
CN108647309B (en) Chat content auditing method and system based on sensitive words
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN103336766B (en) Short text garbage identification and modeling method and device
CN108536677A (en) A kind of patent text similarity calculating method
CN108549634A (en) A kind of Chinese patent text similarity calculating method
US20110072011A1 (en) Method and system for scoring texts
CN106599148A (en) Method and device for generating abstract
CN109471933A (en) A kind of generation method of text snippet, storage medium and server
CN104991891A (en) Short text feature extraction method
CN101520802A (en) Question-answer pair quality evaluation method and system
CN102033880A (en) Marking method and device based on structured data acquisition
CN103473380B (en) A kind of computer version sensibility classification method
CN105095179B (en) The method and device that user's evaluation is handled
CN106547924A (en) The sentiment analysis method and device of text message
CN105095430A (en) Method and device for setting up word network and extracting keywords
CN102081601A (en) Field word identification method and device
Pla et al. Sentiment analysis in Twitter for Spanish
CN104239285A (en) New article chapter detecting method and device
CN103744838A (en) Chinese emotional abstract system and Chinese emotional abstract method for measuring mainstream emotional information
CN103186647B (en) A kind of method and device according to contribution degree sequence
CN106997340A (en) The generation of dictionary and the Document Classification Method and device using dictionary
CN104035969A (en) Method and system for building feature word banks in social network
CN102375848B (en) Evaluation object clustering method and device
CN102622405B (en) Method for computing text distance between short texts based on language content unit number evaluation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: Room 810, 8 / F, 34 Haidian Street, Haidian District, Beijing 100080

Patentee after: BEIJING D-MEDIA COMMUNICATION TECHNOLOGY Co.,Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A room 602

Patentee before: BEIJING D-MEDIA COMMUNICATION TECHNOLOGY Co.,Ltd.