CN108446274A

CN108446274A - A kind of keyword extracting method based on time-sensitive tf-idf

Info

Publication number: CN108446274A
Application number: CN201810214547.XA
Authority: CN
Inventors: 王晓慧; 覃京燕
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2018-08-24

Abstract

The present invention provides a kind of keyword extracting method based on time-sensitive tf idf, belongs to natural language processing studying technological domain.This method carries out document division according to the temporal information of text data first, corpus is segmented, the reverse Wen Jian Frequency rates for all words for including are calculated in the corpus, then the tf idf scores of time-sensitive are calculated, in addition, provided with a time attenuation coefficient, it is used for the speed of regulating time decaying.Finally, word is sorted from big to small according to time-sensitive tf idf scores, n word before exporting, as keyword.This method can obtain the different keyword of time-sensitive degree with the value of regulating time attenuation coefficient.This method needs to specify some particular document without being limited to classical tf idf algorithms, without strictly limiting some specific period.Further, it is also possible to which the speed of time decaying is arranged so that algorithm is more flexible, obtained keyword more diversity.

Description

A kind of keyword extracting method based on time-sensitive tf-idf

Technical field

The present invention relates to natural language processing studying technological domains, particularly relate to a kind of pass based on time-sensitive tf-idf Keyword extracting method.

Background technology

tf-idf(Term Frequency Inverse Document Frequency)(G.Salton and M.McGill,editors.Introduction to Modern Information Retrieval.McGraw-Hill, 1983.) it is a kind of common method of text mining, it is built upon on such a hypothesis：It is most significant to difference document Word should be those the frequency of occurrences is high in a document, and the few word of the frequency of occurrences in other documents of entire collection of document Language.In the given file of portion, word frequency (term frequency, abbreviation tf) refers to some given word in this document The frequency of middle appearance.This number is the normalization to word number (term count), to prevent it to be biased to long file.Reverse text Part frequency (inverse document frequency, abbreviation idf) is the measurement of a word general importance.A certain specific word The idf of language, can be by total files divided by comprising the number of files of the word, then take denary logarithm to obtain the obtained quotient It arrives.Finally, tf-idf of the word in given file is scored at the word frequency (tf) and its reverse document-frequency of this word (idf) product.

Tf-idf algorithms are widely used in keyword extraction (B.Lott, Survey of Keyword Extraction Techniques, UNM Education, 2012.), information retrieval field (J.Ramos.Using TF-IDF to Determine Word Relevance in Document Queries.Technical report,Department of Computer Science,Rutgers University,2003.).Some researchs are changed on the basis of tf-idf algorithms It makes, to promote the performance of classical tf-idf algorithms.For example, Berger et al. proposes a kind of referred to as adaptive tf-idf calculations Gradient decline is combined (Berger, A et al.Bridging the Lexical with tf-idf algorithms by method, the algorithm Chasm:Statistical Approaches to Answer Finding.In Proc.Int.Conf.Research and Development in Information Retrieval, 192-199,2000.), Oren et al. is by genetic algorithm and tf- Idf algorithms are combined together (Oren, Nir.Reexamining tf.idf based information retrieval with Genetic Programming.In Proceedings of SAICSIT,1-10,2002.)。

Either classics tf-idf algorithms or improved tf-idf algorithms are required for specifying some specific document. Sometimes, we expect the popular keyword near some time, such as the news focus for obtaining in November, 2017 or so, or Hot microblog topic during person's Spring Festival in 2018, the stringent event horizon of neither one, also without some specific document.This Invention is directed to the demand, it is proposed that the tf-idf algorithms of time-sensitive, some time when clear for extraction time obscure boundary Keyword near section.

Invention content

The present invention needs to specify specific document, does not account for the limitations such as time factor for classics tf-idf algorithms, A kind of keyword extracting method based on time-sensitive tf-idf is provided.

This method specifically includes that steps are as follows：

(1) document as unit of the time is divided calculates with idf：Document is carried out according to the temporal information of text data to draw Point, corpus is segmented, each word w in the corpus is calculated_iReverse Wen Jian Frequency rates idf_i；

(2) time-sensitive tf-idf is calculated：Time attenuation factor is increased on the basis of classical tf-idf algorithms, specifically Thinking is to add a time window, remoter from current point in time, weight is smaller, is arranged simultaneously when calculating term weighing Time attenuation coefficient is used for the speed of regulating time decaying, calculates word w_iIn time period t_jTime-sensitive tf-idf；

(3) topN keyword is extracted：Word is sorted from big to small according to time-sensitive tf-idf scores, n before exporting A word, as keyword.

Wherein, temporal information unit is second, minute, hour, the moon, one kind in year in step (1).

In step (1)Wherein, | D | it is the total number of documents in corpus；d_nWhen being unit Between t_nAll texts composition document；|{n:w_i∈d_n| it includes word w to be_iNumber of documents.

Word w in step (2)_iIn time period t_jTime-sensitive tf-idf, be denoted as tfidf_i'_,j, calculate as follows：

Wherein, λ is time attenuation coefficient, and m is influence time section t_jTime range,n_i,jFor word w_iIn time period t_jText in the number that occurs.

This method can obtain the different keyword of time-sensitive degree with the value of regulating time attenuation coefficient.

The above-mentioned technical proposal of the present invention has the beneficial effect that：

This method can be with passes related to time such as the keyword of extraction time sensitivity, popular vocabulary near some time Keyword needs to specify some particular document, without strictly limiting some specific period without being limited to classical tf-idf algorithms. Further, it is also possible to which the speed of time decaying is arranged so that algorithm is more flexible, obtained keyword more diversity.

Description of the drawings

Fig. 1 is the flow chart of the keyword extracting method based on time-sensitive tf-idf of the present invention.

Specific implementation mode

To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool Body embodiment is described in detail.

The present invention provides a kind of keyword extracting method based on time-sensitive tf-idf.

As shown in Figure 1, for the flow chart of this method.Input data is text data and some time with temporal information Section, exports the keyword near this period, i.e., time-sensitive tf-idf scores sort from big to small after top n word. Step in specific implementation process is as follows：

A. the document as unit of the time is divided calculates with idf.First, it is divided according to the temporal information of text data For the document as unit of the time, chronomere can be second, minute, hour, the moon, year etc., by the institute of each unit interval There is text to regard a document as, such as the microblogging text of some day is a document.These documents constitute a corpus. Then, corpus is segmented, calculates in the corpus the reverse Wen Jian Frequency rates (inverse for all words for including Document frequency, idf).The reverse Wen Jian Frequency rates calculating of the tf-idf algorithms of many classics is according to ready-made language Expect library to calculate, rather than based on the corpus of analyzed document structure, suggest this when analyzed corpus of documents is smaller It does.Since different scenes are different with used vocabulary under field, obtained idf values are widely different, so being based on institute here The document calculations idf of analysis uses for reference the definition of reverse document-frequency in classics tf-idf algorithms.Specially：Corpus is carried out Participle, to each word w in word segmentation result_iCalculate reverse document-frequency idf_i：

Wherein, | D | it is the total number of documents in corpus；d_nIt is unit interval t_nAll texts composition document；|{n:w_i ∈d_n| it includes word w to be_iNumber of documents.

B. time-sensitive tf-idf is calculated.First, the definition for using for reference word frequency in classics tf-idf algorithms, in time period t_j's In document, to each word w occurred in document_iCalculate word frequency tf_i,j, t here_jIt can be the unit interval, can also be unit The combination of time, such as unit interval are day, t_jIt can be one month or 1 year.

Wherein, n_i,jIt is word w_iIn time period t_jText in the number that occurs, denominator is time period t_jText in institute There is the sum of the occurrence number of word.

So, word w_iIn time period t_jTime-sensitive tf-idf, be denoted as tfidf_i'_,j, calculate as follows：

Wherein, λ is time attenuation coefficient, and λ is bigger, slower, the time period t of decaying_jNeighbouring text influences it bigger； Otherwise λ is smaller, faster, the time period t of decaying_jThe keyword of left and right depends primarily on time period t_jText.M is influence time Section t_jTime range.

C. topN keyword is extracted.The value of λ and m are set, and it is 5 that usual λ, which is set as 1, m values,.According to time-sensitive Tf-idf scores sort word from big to small, export top n word, as keyword.According to different demands, adjustment time declines The length m of the speed λ and time window that subtract.Such as the much-talked-about topic during the extraction Spring Festival, because when influence of the Spring Festival to everybody Between it is longer, it is possible to by time period t_jLunar calendar New Year's Eve is set as to during the sixth day of lunar month vacation, the length m of time window is arranged It is larger, from lunar calendar off year to the Lantern Festival, the time, attenuation coefficient λ was larger so that the time rate of decay is slow.Conversely, If a time effects time is shorter, λ and m can be arranged smaller.Therefore, this method can be adjusted according to specific requirements The value of whole λ and m so that algorithm is more flexible, obtained keyword more diversity.

The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, several improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of keyword extracting method based on time-sensitive tf-idf, it is characterised in that：Including steps are as follows：

(1) document as unit of the time is divided calculates with idf：Document division is carried out according to the temporal information of text data, it will Corpus is segmented, and each word w in the corpus is calculated_iReverse Wen Jian Frequency rates idf_i；

(2) time-sensitive tf-idf is calculated：Time attenuation factor is increased on the basis of classical tf-idf algorithms, and the time is set Attenuation coefficient is used for the speed of regulating time decaying, calculates word w_iIn time period t_jTime-sensitive tf-idf；

(3) topN keyword is extracted：Word is sorted from big to small according to time-sensitive tf-idf scores, n word before exporting Language, as keyword.

2. the keyword extracting method according to claim 1 based on time-sensitive tf-idf, it is characterised in that：The step Suddenly temporal information unit is second, minute, hour, the moon, one kind in year in (1).

3. according to the keyword extracting method based on time-sensitive tf-idf described in claim 1, it is characterised in that：It is described In step (1)Wherein, | D | it is the total number of documents in corpus；d_nIt is unit interval t_nInstitute The document being made of text；|{n:w_i∈d_n| it includes word w to be_iNumber of documents.

4. according to the keyword extracting method based on time-sensitive tf-idf described in claim 1, it is characterised in that：It is described Word w in step (2)_iIn time period t_jTime-sensitive tf-idf, be denoted as tfidf '_i,j, calculate as follows：

Wherein, λ is time attenuation coefficient, and m is influence time section t_jTime range,n_i,jFor word w_i Time period t_jText in the number that occurs.