CN108446274A - A kind of keyword extracting method based on time-sensitive tf-idf - Google Patents

A kind of keyword extracting method based on time-sensitive tf-idf Download PDF

Info

Publication number
CN108446274A
CN108446274A CN201810214547.XA CN201810214547A CN108446274A CN 108446274 A CN108446274 A CN 108446274A CN 201810214547 A CN201810214547 A CN 201810214547A CN 108446274 A CN108446274 A CN 108446274A
Authority
CN
China
Prior art keywords
time
idf
sensitive
word
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810214547.XA
Other languages
Chinese (zh)
Inventor
王晓慧
覃京燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201810214547.XA priority Critical patent/CN108446274A/en
Publication of CN108446274A publication Critical patent/CN108446274A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of keyword extracting method based on time-sensitive tf idf, belongs to natural language processing studying technological domain.This method carries out document division according to the temporal information of text data first, corpus is segmented, the reverse Wen Jian Frequency rates for all words for including are calculated in the corpus, then the tf idf scores of time-sensitive are calculated, in addition, provided with a time attenuation coefficient, it is used for the speed of regulating time decaying.Finally, word is sorted from big to small according to time-sensitive tf idf scores, n word before exporting, as keyword.This method can obtain the different keyword of time-sensitive degree with the value of regulating time attenuation coefficient.This method needs to specify some particular document without being limited to classical tf idf algorithms, without strictly limiting some specific period.Further, it is also possible to which the speed of time decaying is arranged so that algorithm is more flexible, obtained keyword more diversity.

Description

A kind of keyword extracting method based on time-sensitive tf-idf
Technical field
The present invention relates to natural language processing studying technological domains, particularly relate to a kind of pass based on time-sensitive tf-idf Keyword extracting method.
Background technology
tf-idf(Term Frequency Inverse Document Frequency)(G.Salton and M.McGill,editors.Introduction to Modern Information Retrieval.McGraw-Hill, 1983.) it is a kind of common method of text mining, it is built upon on such a hypothesis:It is most significant to difference document Word should be those the frequency of occurrences is high in a document, and the few word of the frequency of occurrences in other documents of entire collection of document Language.In the given file of portion, word frequency (term frequency, abbreviation tf) refers to some given word in this document The frequency of middle appearance.This number is the normalization to word number (term count), to prevent it to be biased to long file.Reverse text Part frequency (inverse document frequency, abbreviation idf) is the measurement of a word general importance.A certain specific word The idf of language, can be by total files divided by comprising the number of files of the word, then take denary logarithm to obtain the obtained quotient It arrives.Finally, tf-idf of the word in given file is scored at the word frequency (tf) and its reverse document-frequency of this word (idf) product.
Tf-idf algorithms are widely used in keyword extraction (B.Lott, Survey of Keyword Extraction Techniques, UNM Education, 2012.), information retrieval field (J.Ramos.Using TF-IDF to Determine Word Relevance in Document Queries.Technical report,Department of Computer Science,Rutgers University,2003.).Some researchs are changed on the basis of tf-idf algorithms It makes, to promote the performance of classical tf-idf algorithms.For example, Berger et al. proposes a kind of referred to as adaptive tf-idf calculations Gradient decline is combined (Berger, A et al.Bridging the Lexical with tf-idf algorithms by method, the algorithm Chasm:Statistical Approaches to Answer Finding.In Proc.Int.Conf.Research and Development in Information Retrieval, 192-199,2000.), Oren et al. is by genetic algorithm and tf- Idf algorithms are combined together (Oren, Nir.Reexamining tf.idf based information retrieval with Genetic Programming.In Proceedings of SAICSIT,1-10,2002.)。
Either classics tf-idf algorithms or improved tf-idf algorithms are required for specifying some specific document. Sometimes, we expect the popular keyword near some time, such as the news focus for obtaining in November, 2017 or so, or Hot microblog topic during person's Spring Festival in 2018, the stringent event horizon of neither one, also without some specific document.This Invention is directed to the demand, it is proposed that the tf-idf algorithms of time-sensitive, some time when clear for extraction time obscure boundary Keyword near section.
Invention content
The present invention needs to specify specific document, does not account for the limitations such as time factor for classics tf-idf algorithms, A kind of keyword extracting method based on time-sensitive tf-idf is provided.
This method specifically includes that steps are as follows:
(1) document as unit of the time is divided calculates with idf:Document is carried out according to the temporal information of text data to draw Point, corpus is segmented, each word w in the corpus is calculatediReverse Wen Jian Frequency rates idfi
(2) time-sensitive tf-idf is calculated:Time attenuation factor is increased on the basis of classical tf-idf algorithms, specifically Thinking is to add a time window, remoter from current point in time, weight is smaller, is arranged simultaneously when calculating term weighing Time attenuation coefficient is used for the speed of regulating time decaying, calculates word wiIn time period tjTime-sensitive tf-idf;
(3) topN keyword is extracted:Word is sorted from big to small according to time-sensitive tf-idf scores, n before exporting A word, as keyword.
Wherein, temporal information unit is second, minute, hour, the moon, one kind in year in step (1).
In step (1)Wherein, | D | it is the total number of documents in corpus;dnWhen being unit Between tnAll texts composition document;|{n:wi∈dn| it includes word w to beiNumber of documents.
Word w in step (2)iIn time period tjTime-sensitive tf-idf, be denoted as tfidfi',j, calculate as follows:
Wherein, λ is time attenuation coefficient, and m is influence time section tjTime range,ni,jFor word wiIn time period tjText in the number that occurs.
This method can obtain the different keyword of time-sensitive degree with the value of regulating time attenuation coefficient.
The above-mentioned technical proposal of the present invention has the beneficial effect that:
This method can be with passes related to time such as the keyword of extraction time sensitivity, popular vocabulary near some time Keyword needs to specify some particular document, without strictly limiting some specific period without being limited to classical tf-idf algorithms. Further, it is also possible to which the speed of time decaying is arranged so that algorithm is more flexible, obtained keyword more diversity.
Description of the drawings
Fig. 1 is the flow chart of the keyword extracting method based on time-sensitive tf-idf of the present invention.
Specific implementation mode
To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool Body embodiment is described in detail.
The present invention provides a kind of keyword extracting method based on time-sensitive tf-idf.
As shown in Figure 1, for the flow chart of this method.Input data is text data and some time with temporal information Section, exports the keyword near this period, i.e., time-sensitive tf-idf scores sort from big to small after top n word. Step in specific implementation process is as follows:
A. the document as unit of the time is divided calculates with idf.First, it is divided according to the temporal information of text data For the document as unit of the time, chronomere can be second, minute, hour, the moon, year etc., by the institute of each unit interval There is text to regard a document as, such as the microblogging text of some day is a document.These documents constitute a corpus. Then, corpus is segmented, calculates in the corpus the reverse Wen Jian Frequency rates (inverse for all words for including Document frequency, idf).The reverse Wen Jian Frequency rates calculating of the tf-idf algorithms of many classics is according to ready-made language Expect library to calculate, rather than based on the corpus of analyzed document structure, suggest this when analyzed corpus of documents is smaller It does.Since different scenes are different with used vocabulary under field, obtained idf values are widely different, so being based on institute here The document calculations idf of analysis uses for reference the definition of reverse document-frequency in classics tf-idf algorithms.Specially:Corpus is carried out Participle, to each word w in word segmentation resultiCalculate reverse document-frequency idfi
Wherein, | D | it is the total number of documents in corpus;dnIt is unit interval tnAll texts composition document;|{n:wi ∈dn| it includes word w to beiNumber of documents.
B. time-sensitive tf-idf is calculated.First, the definition for using for reference word frequency in classics tf-idf algorithms, in time period tj's In document, to each word w occurred in documentiCalculate word frequency tfi,j, t herejIt can be the unit interval, can also be unit The combination of time, such as unit interval are day, tjIt can be one month or 1 year.
Wherein, ni,jIt is word wiIn time period tjText in the number that occurs, denominator is time period tjText in institute There is the sum of the occurrence number of word.
So, word wiIn time period tjTime-sensitive tf-idf, be denoted as tfidfi',j, calculate as follows:
Wherein, λ is time attenuation coefficient, and λ is bigger, slower, the time period t of decayingjNeighbouring text influences it bigger; Otherwise λ is smaller, faster, the time period t of decayingjThe keyword of left and right depends primarily on time period tjText.M is influence time Section tjTime range.
C. topN keyword is extracted.The value of λ and m are set, and it is 5 that usual λ, which is set as 1, m values,.According to time-sensitive Tf-idf scores sort word from big to small, export top n word, as keyword.According to different demands, adjustment time declines The length m of the speed λ and time window that subtract.Such as the much-talked-about topic during the extraction Spring Festival, because when influence of the Spring Festival to everybody Between it is longer, it is possible to by time period tjLunar calendar New Year's Eve is set as to during the sixth day of lunar month vacation, the length m of time window is arranged It is larger, from lunar calendar off year to the Lantern Festival, the time, attenuation coefficient λ was larger so that the time rate of decay is slow.Conversely, If a time effects time is shorter, λ and m can be arranged smaller.Therefore, this method can be adjusted according to specific requirements The value of whole λ and m so that algorithm is more flexible, obtained keyword more diversity.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, several improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (4)

1. a kind of keyword extracting method based on time-sensitive tf-idf, it is characterised in that:Including steps are as follows:
(1) document as unit of the time is divided calculates with idf:Document division is carried out according to the temporal information of text data, it will Corpus is segmented, and each word w in the corpus is calculatediReverse Wen Jian Frequency rates idfi
(2) time-sensitive tf-idf is calculated:Time attenuation factor is increased on the basis of classical tf-idf algorithms, and the time is set Attenuation coefficient is used for the speed of regulating time decaying, calculates word wiIn time period tjTime-sensitive tf-idf;
(3) topN keyword is extracted:Word is sorted from big to small according to time-sensitive tf-idf scores, n word before exporting Language, as keyword.
2. the keyword extracting method according to claim 1 based on time-sensitive tf-idf, it is characterised in that:The step Suddenly temporal information unit is second, minute, hour, the moon, one kind in year in (1).
3. according to the keyword extracting method based on time-sensitive tf-idf described in claim 1, it is characterised in that:It is described In step (1)Wherein, | D | it is the total number of documents in corpus;dnIt is unit interval tnInstitute The document being made of text;|{n:wi∈dn| it includes word w to beiNumber of documents.
4. according to the keyword extracting method based on time-sensitive tf-idf described in claim 1, it is characterised in that:It is described Word w in step (2)iIn time period tjTime-sensitive tf-idf, be denoted as tfidf 'i,j, calculate as follows:
Wherein, λ is time attenuation coefficient, and m is influence time section tjTime range,ni,jFor word wi Time period tjText in the number that occurs.
CN201810214547.XA 2018-03-15 2018-03-15 A kind of keyword extracting method based on time-sensitive tf-idf Pending CN108446274A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810214547.XA CN108446274A (en) 2018-03-15 2018-03-15 A kind of keyword extracting method based on time-sensitive tf-idf

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810214547.XA CN108446274A (en) 2018-03-15 2018-03-15 A kind of keyword extracting method based on time-sensitive tf-idf

Publications (1)

Publication Number Publication Date
CN108446274A true CN108446274A (en) 2018-08-24

Family

ID=63194558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810214547.XA Pending CN108446274A (en) 2018-03-15 2018-03-15 A kind of keyword extracting method based on time-sensitive tf-idf

Country Status (1)

Country Link
CN (1) CN108446274A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831809A (en) * 2020-07-17 2020-10-27 北京首汽智行科技有限公司 Method for extracting keywords of question text
CN112287682A (en) * 2020-12-28 2021-01-29 北京智慧星光信息技术有限公司 Method, device and equipment for extracting subject term and storage medium
CN113468441A (en) * 2021-06-29 2021-10-01 平安信托有限责任公司 Search sorting method, device, equipment and storage medium based on weight adjustment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090299978A1 (en) * 2008-05-28 2009-12-03 Alexander Farfurnik Systems and methods for keyword and dynamic url search engine optimization
CN104615715A (en) * 2015-02-05 2015-05-13 北京航空航天大学 Social network event analyzing method and system based on geographic positions
CN104679738A (en) * 2013-11-27 2015-06-03 北京拓尔思信息技术股份有限公司 Method and device for mining Internet hot words
CN105488092A (en) * 2015-07-13 2016-04-13 中国科学院信息工程研究所 Time-sensitive self-adaptive on-line subtopic detecting method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090299978A1 (en) * 2008-05-28 2009-12-03 Alexander Farfurnik Systems and methods for keyword and dynamic url search engine optimization
CN104679738A (en) * 2013-11-27 2015-06-03 北京拓尔思信息技术股份有限公司 Method and device for mining Internet hot words
CN104615715A (en) * 2015-02-05 2015-05-13 北京航空航天大学 Social network event analyzing method and system based on geographic positions
CN105488092A (en) * 2015-07-13 2016-04-13 中国科学院信息工程研究所 Time-sensitive self-adaptive on-line subtopic detecting method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
樊兆欣: "个性化新闻推荐系统关键技术研究与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831809A (en) * 2020-07-17 2020-10-27 北京首汽智行科技有限公司 Method for extracting keywords of question text
CN112287682A (en) * 2020-12-28 2021-01-29 北京智慧星光信息技术有限公司 Method, device and equipment for extracting subject term and storage medium
CN113468441A (en) * 2021-06-29 2021-10-01 平安信托有限责任公司 Search sorting method, device, equipment and storage medium based on weight adjustment

Similar Documents

Publication Publication Date Title
Christian et al. Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF)
CN109960724B (en) Text summarization method based on TF-IDF
Raulji et al. Stop-word removal algorithm and its implementation for Sanskrit language
Khreisat Arabic text classification using N-gram frequency statistics a comparative study
CN108763402B (en) Class-centered vector text classification method based on dependency relationship, part of speech and semantic dictionary
CN102622338B (en) Computer-assisted computing method of semantic distance between short texts
Morabia et al. SEDTWik: segmentation-based event detection from tweets using Wikipedia
CN108363694B (en) Keyword extraction method and device
CN108170666A (en) A kind of improved method based on TF-IDF keyword extractions
CN108446274A (en) A kind of keyword extracting method based on time-sensitive tf-idf
Awajan Keyword extraction from Arabic documents using term equivalence classes
KR101377447B1 (en) Multi-document summarization method and system using semmantic analysis between tegs
Gunawan et al. Multi-document summarization by using textrank and maximal marginal relevance for text in Bahasa Indonesia
Fattah A novel statistical feature selection approach for text categorization
Yu et al. Towards high performance text mining: a TextRank-based method for automatic text summarization
Fodil et al. Theme classification of Arabic text: A statistical approach
Cai et al. Indonesian automatic text summarization based on a new clustering method in sentence level
Elrajubi An improved Arabic light stemmer
Pickard Comparing word2vec and GloVe for automatic measurement of MWE compositionality
Shim et al. A study on the effect of the document summarization technique on the fake news detection model
KR101120038B1 (en) Neologism selection apparatus and its method
Atwan et al. Impact of stemmer on arabic text retrieval
Long et al. Multi-document summarization by information distance
Yoon et al. On Temporally Sensitive Word Embeddings for News Information Retrieval.
Pramudita et al. Automatic Text Summarization of Madura Tourism Articles Using TF-IDF and K-Medoid Clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180824