CN108170666A

CN108170666A - A kind of improved method based on TF-IDF keyword extractions

Info

Publication number: CN108170666A
Application number: CN201711229728.1A
Authority: CN
Inventors: 向阳; 郑惺; 张默涵; 赵雨晴
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2018-06-15

Abstract

The present invention relates to a kind of improved methods for being based on TF IDF (term frequency-inverse document frequency) keyword extraction, specifically include following steps：The number that all words occur in each text in statistic document set respectively；The weight computing of word is carried out using improved TF IDF formula；Word is ranked up according to weights are descending, foundation is retrieved using ranking results as text key word.Compared with prior art, the present invention has many advantages, such as to distinguish different part of speech words, considers that the keyword for actually representing text feature carries out keyword sorting consistence.

Description

A kind of improved method based on TF-IDF keyword extractions

Technical field

The present invention relates to natural language processing method, more particularly, to a kind of improvement side based on TF-IDF keyword extractions Method.

Background technology

TF-IDF (term frequency-inverse document frequency) be it is a kind of for information retrieval with The common weighting technique of data mining.TF means word frequency (Term Frequency), and IDF means reverse document-frequency (Inverse Document Frequency)。

Keyword abstraction is a kind of natural language processing technique for identifying significant and representative segment or vocabulary.It closes Keyword extraction be the basis of natural language processing technique one of with core, most for the treatment of technology to non-structured text, such as Text snippet, text classification, text cluster and automatic question answering etc. are required for relying on it to improve precision.With the development of network, Internet resource is increasingly abundanter, however article lacks keyword tag mostly, and manual markings keyword is time-consuming and laborious and subjectivity It is relatively strong, therefore the technology has great importance for the keyword abstraction of text.

This method is to carry out keyword extraction for short text language material, due to the limited length of short text, keyword Mainly there was only one or two.Based on this situation, this method uses unsupervised keyword abstraction method, such as document frequency DF (Document Frequency), word frequency TF (Term Frequency), key is carried out using improved TF-IDF Word extracts.

Invention content

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind based on TF-IDF passes The improved method of keyword extraction.

The purpose of the present invention can be achieved through the following technical solutions：

A kind of improved method based on TF-IDF keyword extractions, the method include the following steps：

S1, the number that all words occur in each text in text collection is counted respectively；

S2, the weight computing that word is carried out using improved TF-IDF formula；

S3, word is ranked up according to weights are descending, foundation is retrieved using ranking results as text key word.

Preferably, the improved TF-IDF formula are：

TF-IDF_i=(k_i1+k_i2)*tf_ik*log(|D|/df_i*tf_i)

Wherein：k_i1Represent feature t_iIn text d_kMiddle part of speech significant coefficient, k_i2Represent feature t_iIn text d_kIn proprietary field Word significant coefficient, tf_ikRepresent feature t_iIn text d_kThe number of middle appearance, tf_iRepresent feature t_iTime occurred in all texts Number, | D | represent the number of text, df_iIt represents comprising feature t_iText number.

Preferably, D represents text set, D={ d₁,d₂,…,d_k..., d_kRepresent k-th of text in D.

Preferably, t_iRepresent T in ith feature word (i ∈ 1,2 ... | T |), | T | represent word number, T represent Text word set：

T={ t₁,t₂,…,t_i,…}。

Preferably, df_iIf/| D | represent a text in selection text set, the text includes Feature Words t_iProbability, idf_iRepresent df_iFall row frequency：

idf_i=log (| D |/df_i*tf_i)。

Preferably for word i, if i is noun, k_i1The number of the word of all noun numbers of=all documents/all； If i is verb, k_i1The number of the word of all verb numbers of=all documents/all；If i is adjective, its k_i1=institute There is the number of the word of all adjective numbers of document/all；If i is the word in addition to noun, verb and adjective, k_i1= 0。

Compared with prior art, the present invention has the following advantages：

1st, different part of speech words are distinguished：Keyword abstraction process introduces part of speech significant coefficient and proprietary domain term weight Coefficient is wanted, can accomplish to distinguish the importance of the word of different parts of speech；

2nd, consideration actually represents the keyword of text feature, carries out keyword sorting consistence：Utilize feature t_iIn all texts The number occurred in this is modified keyword weight calculating, although the IDF values of certain words are low, it still can be fine The keyword sequence of representative text feature adjust to by front position so that keyword extraction result is more accurate.

Description of the drawings

Fig. 1 is the method flow schematic diagram of the present invention.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is the part of the embodiment rather than whole embodiments of the present invention.Based on this hair Embodiment in bright, the every other reality that those of ordinary skill in the art are obtained under the premise of creative work is not made Example is applied, should all belong to the scope of protection of the invention.

Embodiment：

Improved method the present invention is based on TF-IDF keyword extractions specifically includes following steps, such as Fig. 1：

S1, the number that all words occur in each document in statistic document set respectively；

S3, word is ranked up according to weights are descending, using ranking results as keyword search over database foundation.

This method considers for short text language material that importance possessed by the various pieces in text is different first 's.In general, verb, nouns and adjectives are the main parts in sentence, also important to keyword abstraction technology；Number Word, pronoun etc. are only modifications, the further perfect integrality of sentence, but for sentence classification almost without work With.Therefore this method assigns different significant coefficient k to the word of different parts of speech_i1, wherein noun, verb and adjectival important Coefficient is greater than other part of speech words such as adverbial word, pronoun.

Secondly this method is additionally contemplates that important function of the specific area keyword to text, important territoriality word is assigned Give significant coefficient k_i2。

This method considers that the shortcoming in TF-IDF methods is made that optimization.TF-IDF main thoughts be if some The frequency (TF high) that word or phrase occur in an article, and seldom there is (IDF high) in other articles, then it is assumed that this Word or phrase have good class discrimination ability, are suitble to when one of keyword of composition notebook.It is although it is contemplated that certain The IDF values of word are low, but it still can be good at representing text feature.Therefore this method makes following improvement：

TF-IDF_i=(k_i1+k_i2)*tf_ik*log(|D|/df_i*tf_i)

TF-IDF retrieval models.The main thought of TF-IDF models is：If the frequency that word w occurs in a document d Height, and seldom occur in other documents, then it is assumed that word w has good separating capacity, is adapted to an article d and other Article distinguishes.The formal definitions of the model are described below.

T represents text word set.Feature Words can be word, phrase or phrase.| T | represent the number of word.t_iRepresent T In ith feature word (i ∈ 1,2 ... | T |).

T={ t₁,t₂,…,t_i,…}

D represents text set.| D | represent the number of text.d_kRepresent D in k-th of text (k ∈ 1,2 ... | D |).

D={ d₁,d₂,…,d_k,…}

tf_ikRepresent feature t_iIn text d_kThe number of middle appearance.This is a local parameter for single text.

tf_iRepresent feature t_iThe number occurred in all texts.

df_iIt represents comprising feature t_iText number.

idf_iRepresent df_iFall row frequency.df_iIf/| D | represent a text in selection text set, the text includes spy Levy word t_iProbability.

idf_i=log (| D |/df_i*tf_i)

k_i1Represent feature t_iIn text d_kMiddle part of speech significant coefficient.

k_i2Represent feature t_iIn text d_kIn proprietary domain term significant coefficient.This be one can be with customized system of parameters Number.

k_i1Setting help to increase noun, verb, adjective are compared to the importance of other part of speech words, k_i2Have Help protrude the importance of proprietary field keyword.

Difference lies in method introduces words by first of our new calculation formula and traditional TF-IDF calculations Property significant coefficient and proprietary domain term significant coefficient, can accomplish to distinguish the importance of the word of different parts of speech；Second difference It is to change the calculation of traditional IDF, adds tf_iThe weight calculation of some keywords is modified.

In text collection D, distribution situation of each word in the document of each classification is investigated：If a word is equal Even is distributed in the document of each classification, this illustrates that the word does not have representativeness, that is, cannot function as a certain class keywords；Instead If a word only come across in each document of a certain classification, and text that is less or hardly appearing in other classifications When in shelves, then illustrate that the word can be as the keyword of the category.In fact, traditional TF-IDF not can solve This practical problem, then we be made that following illustration for this situation.

Part of speech is broadly divided into 4 classes by us：Noun, verb, adjective, other.For belong to preceding 3 class (" noun, verb, Adjective ") some word i for, if i is noun, its k_i1Word (4 classes of all noun numbers of=all documents/all Word) number；For the word (word namely in addition to noun, verb and adjective, such as adverbial word, pronoun) of last one kind, Its k_i1=0.

Meanwhile we before the experiments there are one word list, the proper noun comprising the field, this list is based on Calculate k_i2, for the word k inside this list_i2=1, otherwise k_i2=0.

Assuming that we have collection of document D at present, which includes 8 documents, and 8 documents are divided into 2 classifications, each Classification has 4 documents, while we have extracted candidate of 3 words as keyword.Table 1 is candidate word in each document The concrete condition of appearance situation and the classification of document.Table 2 is directed to the weight results of each candidate word calculating for tradition TF-IDF (i.e. the candidate score of candidate word).Table 3 is directed to weight results (the i.e. candidate word of each candidate word calculating for improved TF-IDF Candidate score).

Table 1

Table 2

Candidate word	Weights
		T1	3.010
T2	0.986
		T3	1.250

Table 3

Candidate word	Weights
		T1	13.010
T2	15.351
		T3	11.250

1. occurrence number of all words in each document in statistic document set D, and obtain a result such as 1 institute of table Show.(8 documents only have 3 words T1, T2, T3 in above-mentioned example, and document 1, document 2, document 3 and document 4 are to belong to classification 1, other documents belong to classification 2)

2. weights (this keyword based on entire collection of document of each candidate word is calculated according to traditional TF-IDF formula Weight computing).

Remarks 1：TF-IDF=tf_ik*log(|D|/df_i)

Remarks 2：Such as T1, TF-IDF=(2+3+2+3+0+0+0+0) * log (8/4), wherein 8 be all number of files Mesh, 4 be the number of the document comprising T1

Remarks 3：What is selected in this sample of the log truth of a matter is 10.

3. the weights of each candidate word are calculated according to improved TF-IDF formula.

Remarks 1：TF-IDF=(k_i1+k_i2)*tf_ik*log(|D|/df_i*tf_i)

Remarks 2：Such as T1, TF-IDF=(2+3+2+3+0+0+0+0) * log (8/4* (2+3+2+3+0+0+0+0)), In 8 be all number of documents, 4 be the number of the document comprising T1

Remarks 3：What is selected in this sample of the log truth of a matter is 10

Remarks 4：K in this sample_i1+k_i2=1.

Interpretation of result：

T1 is only occurred in classification 1 as can be seen from Table 1, therefore the keyword that can represent classification 1 as one；T2 It is evenly distributed in two classifications, does not have the ability as keyword；T3 largely occurs in classification 2, few in classification 1 Amount occurs, with certain ability as keyword, but because in the short text in proprietary field, some some classifications Although monopoly noun a small amount of can be appeared in other classification, they are still the good representative of this classification, and It can be as the keyword of this classification.

From the point of view of the comparison of table 2 and table 3, according to the descending sequence of weights, traditional TF-IDF methods can obtain T1> T3>Result as T2 can not fully meet demand, but improved TF-IDF methods can obtain T2>T1>T3's as a result, Not only it had met our demand but also can achieve the purpose that conventional method.

The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection domain subject to.

Claims

1. a kind of improved method based on TF-IDF keyword extractions, which is characterized in that the method includes the following steps：

2. a kind of improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that described Improved TF-IDF formula are：

TF-IDF_i=(k_i1+k_i2)*tf_ik*log(|D|/df_i*tf_i)

Wherein：k_i1Represent feature t_iIn text d_kMiddle part of speech significant coefficient, k_i2Represent feature t_iIn text d_kIn proprietary domain term weight Want coefficient, tf_ikRepresent feature t_iIn text d_kThe number of middle appearance, tf_iRepresent feature t_iThe number occurred in all texts, | D | represent the number of text, df_iIt represents comprising feature t_iText number.

3. a kind of improved method based on TF-IDF keyword extractions according to claim 2, which is characterized in that D is represented Text set, D={ d₁,d₂,…,d_k..., d_kRepresent k-th of text in D.

A kind of 4. improved method based on TF-IDF keyword extractions according to claim 2, which is characterized in that t_iRepresent T In ith feature word (i ∈ 1,2 ... | T |), | T | represent word number, T represent text word set：

T={ t₁,t₂,…,t_i,…}。

A kind of 5. improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that df_i/|D| If representing a text in selection text set, the text includes Feature Words t_iProbability, idf_iRepresent df_iFall row frequency：

idf_i=log (| D |/df_i*tf_i)。

6. a kind of improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that for word I, if i is noun, k_i1The number of the word of all noun numbers of=all documents/all；If i is verb, k_i1=all The number of the word of all verb numbers of document/all；If i is adjective, its k_i1All adjective numbers of=all documents The number of the word of mesh/all；If i is the word in addition to noun, verb and adjective, k_i1=0.