CN108446274A - A kind of keyword extracting method based on time-sensitive tf-idf - Google Patents
A kind of keyword extracting method based on time-sensitive tf-idf Download PDFInfo
- Publication number
- CN108446274A CN108446274A CN201810214547.XA CN201810214547A CN108446274A CN 108446274 A CN108446274 A CN 108446274A CN 201810214547 A CN201810214547 A CN 201810214547A CN 108446274 A CN108446274 A CN 108446274A
- Authority
- CN
- China
- Prior art keywords
- time
- idf
- sensitive
- word
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000002123 temporal effect Effects 0.000 claims abstract description 7
- 230000001105 regulatory effect Effects 0.000 claims abstract description 5
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 claims description 3
- 230000002441 reversible effect Effects 0.000 abstract description 7
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000000605 extraction Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000001550 time effect Effects 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of keyword extracting method based on time-sensitive tf idf, belongs to natural language processing studying technological domain.This method carries out document division according to the temporal information of text data first, corpus is segmented, the reverse Wen Jian Frequency rates for all words for including are calculated in the corpus, then the tf idf scores of time-sensitive are calculated, in addition, provided with a time attenuation coefficient, it is used for the speed of regulating time decaying.Finally, word is sorted from big to small according to time-sensitive tf idf scores, n word before exporting, as keyword.This method can obtain the different keyword of time-sensitive degree with the value of regulating time attenuation coefficient.This method needs to specify some particular document without being limited to classical tf idf algorithms, without strictly limiting some specific period.Further, it is also possible to which the speed of time decaying is arranged so that algorithm is more flexible, obtained keyword more diversity.
Description
Technical field
The present invention relates to natural language processing studying technological domains, particularly relate to a kind of pass based on time-sensitive tf-idf
Keyword extracting method.
Background technology
tf-idf(Term Frequency Inverse Document Frequency)(G.Salton and
M.McGill,editors.Introduction to Modern Information Retrieval.McGraw-Hill,
1983.) it is a kind of common method of text mining, it is built upon on such a hypothesis:It is most significant to difference document
Word should be those the frequency of occurrences is high in a document, and the few word of the frequency of occurrences in other documents of entire collection of document
Language.In the given file of portion, word frequency (term frequency, abbreviation tf) refers to some given word in this document
The frequency of middle appearance.This number is the normalization to word number (term count), to prevent it to be biased to long file.Reverse text
Part frequency (inverse document frequency, abbreviation idf) is the measurement of a word general importance.A certain specific word
The idf of language, can be by total files divided by comprising the number of files of the word, then take denary logarithm to obtain the obtained quotient
It arrives.Finally, tf-idf of the word in given file is scored at the word frequency (tf) and its reverse document-frequency of this word
(idf) product.
Tf-idf algorithms are widely used in keyword extraction (B.Lott, Survey of Keyword Extraction
Techniques, UNM Education, 2012.), information retrieval field (J.Ramos.Using TF-IDF to
Determine Word Relevance in Document Queries.Technical report,Department of
Computer Science,Rutgers University,2003.).Some researchs are changed on the basis of tf-idf algorithms
It makes, to promote the performance of classical tf-idf algorithms.For example, Berger et al. proposes a kind of referred to as adaptive tf-idf calculations
Gradient decline is combined (Berger, A et al.Bridging the Lexical with tf-idf algorithms by method, the algorithm
Chasm:Statistical Approaches to Answer Finding.In Proc.Int.Conf.Research and
Development in Information Retrieval, 192-199,2000.), Oren et al. is by genetic algorithm and tf-
Idf algorithms are combined together (Oren, Nir.Reexamining tf.idf based information retrieval
with Genetic Programming.In Proceedings of SAICSIT,1-10,2002.)。
Either classics tf-idf algorithms or improved tf-idf algorithms are required for specifying some specific document.
Sometimes, we expect the popular keyword near some time, such as the news focus for obtaining in November, 2017 or so, or
Hot microblog topic during person's Spring Festival in 2018, the stringent event horizon of neither one, also without some specific document.This
Invention is directed to the demand, it is proposed that the tf-idf algorithms of time-sensitive, some time when clear for extraction time obscure boundary
Keyword near section.
Invention content
The present invention needs to specify specific document, does not account for the limitations such as time factor for classics tf-idf algorithms,
A kind of keyword extracting method based on time-sensitive tf-idf is provided.
This method specifically includes that steps are as follows:
(1) document as unit of the time is divided calculates with idf:Document is carried out according to the temporal information of text data to draw
Point, corpus is segmented, each word w in the corpus is calculatediReverse Wen Jian Frequency rates idfi;
(2) time-sensitive tf-idf is calculated:Time attenuation factor is increased on the basis of classical tf-idf algorithms, specifically
Thinking is to add a time window, remoter from current point in time, weight is smaller, is arranged simultaneously when calculating term weighing
Time attenuation coefficient is used for the speed of regulating time decaying, calculates word wiIn time period tjTime-sensitive tf-idf;
(3) topN keyword is extracted:Word is sorted from big to small according to time-sensitive tf-idf scores, n before exporting
A word, as keyword.
Wherein, temporal information unit is second, minute, hour, the moon, one kind in year in step (1).
In step (1)Wherein, | D | it is the total number of documents in corpus;dnWhen being unit
Between tnAll texts composition document;|{n:wi∈dn| it includes word w to beiNumber of documents.
Word w in step (2)iIn time period tjTime-sensitive tf-idf, be denoted as tfidfi',j, calculate as follows:
Wherein, λ is time attenuation coefficient, and m is influence time section tjTime range,ni,jFor word
wiIn time period tjText in the number that occurs.
This method can obtain the different keyword of time-sensitive degree with the value of regulating time attenuation coefficient.
The above-mentioned technical proposal of the present invention has the beneficial effect that:
This method can be with passes related to time such as the keyword of extraction time sensitivity, popular vocabulary near some time
Keyword needs to specify some particular document, without strictly limiting some specific period without being limited to classical tf-idf algorithms.
Further, it is also possible to which the speed of time decaying is arranged so that algorithm is more flexible, obtained keyword more diversity.
Description of the drawings
Fig. 1 is the flow chart of the keyword extracting method based on time-sensitive tf-idf of the present invention.
Specific implementation mode
To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool
Body embodiment is described in detail.
The present invention provides a kind of keyword extracting method based on time-sensitive tf-idf.
As shown in Figure 1, for the flow chart of this method.Input data is text data and some time with temporal information
Section, exports the keyword near this period, i.e., time-sensitive tf-idf scores sort from big to small after top n word.
Step in specific implementation process is as follows:
A. the document as unit of the time is divided calculates with idf.First, it is divided according to the temporal information of text data
For the document as unit of the time, chronomere can be second, minute, hour, the moon, year etc., by the institute of each unit interval
There is text to regard a document as, such as the microblogging text of some day is a document.These documents constitute a corpus.
Then, corpus is segmented, calculates in the corpus the reverse Wen Jian Frequency rates (inverse for all words for including
Document frequency, idf).The reverse Wen Jian Frequency rates calculating of the tf-idf algorithms of many classics is according to ready-made language
Expect library to calculate, rather than based on the corpus of analyzed document structure, suggest this when analyzed corpus of documents is smaller
It does.Since different scenes are different with used vocabulary under field, obtained idf values are widely different, so being based on institute here
The document calculations idf of analysis uses for reference the definition of reverse document-frequency in classics tf-idf algorithms.Specially:Corpus is carried out
Participle, to each word w in word segmentation resultiCalculate reverse document-frequency idfi:
Wherein, | D | it is the total number of documents in corpus;dnIt is unit interval tnAll texts composition document;|{n:wi
∈dn| it includes word w to beiNumber of documents.
B. time-sensitive tf-idf is calculated.First, the definition for using for reference word frequency in classics tf-idf algorithms, in time period tj's
In document, to each word w occurred in documentiCalculate word frequency tfi,j, t herejIt can be the unit interval, can also be unit
The combination of time, such as unit interval are day, tjIt can be one month or 1 year.
Wherein, ni,jIt is word wiIn time period tjText in the number that occurs, denominator is time period tjText in institute
There is the sum of the occurrence number of word.
So, word wiIn time period tjTime-sensitive tf-idf, be denoted as tfidfi',j, calculate as follows:
Wherein, λ is time attenuation coefficient, and λ is bigger, slower, the time period t of decayingjNeighbouring text influences it bigger;
Otherwise λ is smaller, faster, the time period t of decayingjThe keyword of left and right depends primarily on time period tjText.M is influence time
Section tjTime range.
C. topN keyword is extracted.The value of λ and m are set, and it is 5 that usual λ, which is set as 1, m values,.According to time-sensitive
Tf-idf scores sort word from big to small, export top n word, as keyword.According to different demands, adjustment time declines
The length m of the speed λ and time window that subtract.Such as the much-talked-about topic during the extraction Spring Festival, because when influence of the Spring Festival to everybody
Between it is longer, it is possible to by time period tjLunar calendar New Year's Eve is set as to during the sixth day of lunar month vacation, the length m of time window is arranged
It is larger, from lunar calendar off year to the Lantern Festival, the time, attenuation coefficient λ was larger so that the time rate of decay is slow.Conversely,
If a time effects time is shorter, λ and m can be arranged smaller.Therefore, this method can be adjusted according to specific requirements
The value of whole λ and m so that algorithm is more flexible, obtained keyword more diversity.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, without departing from the principles of the present invention, several improvements and modifications can also be made, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (4)
1. a kind of keyword extracting method based on time-sensitive tf-idf, it is characterised in that:Including steps are as follows:
(1) document as unit of the time is divided calculates with idf:Document division is carried out according to the temporal information of text data, it will
Corpus is segmented, and each word w in the corpus is calculatediReverse Wen Jian Frequency rates idfi;
(2) time-sensitive tf-idf is calculated:Time attenuation factor is increased on the basis of classical tf-idf algorithms, and the time is set
Attenuation coefficient is used for the speed of regulating time decaying, calculates word wiIn time period tjTime-sensitive tf-idf;
(3) topN keyword is extracted:Word is sorted from big to small according to time-sensitive tf-idf scores, n word before exporting
Language, as keyword.
2. the keyword extracting method according to claim 1 based on time-sensitive tf-idf, it is characterised in that:The step
Suddenly temporal information unit is second, minute, hour, the moon, one kind in year in (1).
3. according to the keyword extracting method based on time-sensitive tf-idf described in claim 1, it is characterised in that:It is described
In step (1)Wherein, | D | it is the total number of documents in corpus;dnIt is unit interval tnInstitute
The document being made of text;|{n:wi∈dn| it includes word w to beiNumber of documents.
4. according to the keyword extracting method based on time-sensitive tf-idf described in claim 1, it is characterised in that:It is described
Word w in step (2)iIn time period tjTime-sensitive tf-idf, be denoted as tfidf 'i,j, calculate as follows:
Wherein, λ is time attenuation coefficient, and m is influence time section tjTime range,ni,jFor word wi
Time period tjText in the number that occurs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810214547.XA CN108446274A (en) | 2018-03-15 | 2018-03-15 | A kind of keyword extracting method based on time-sensitive tf-idf |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810214547.XA CN108446274A (en) | 2018-03-15 | 2018-03-15 | A kind of keyword extracting method based on time-sensitive tf-idf |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108446274A true CN108446274A (en) | 2018-08-24 |
Family
ID=63194558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810214547.XA Pending CN108446274A (en) | 2018-03-15 | 2018-03-15 | A kind of keyword extracting method based on time-sensitive tf-idf |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108446274A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111831809A (en) * | 2020-07-17 | 2020-10-27 | 北京首汽智行科技有限公司 | Method for extracting keywords of question text |
CN112287682A (en) * | 2020-12-28 | 2021-01-29 | 北京智慧星光信息技术有限公司 | Method, device and equipment for extracting subject term and storage medium |
CN113468441A (en) * | 2021-06-29 | 2021-10-01 | 平安信托有限责任公司 | Search sorting method, device, equipment and storage medium based on weight adjustment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090299978A1 (en) * | 2008-05-28 | 2009-12-03 | Alexander Farfurnik | Systems and methods for keyword and dynamic url search engine optimization |
CN104615715A (en) * | 2015-02-05 | 2015-05-13 | 北京航空航天大学 | Social network event analyzing method and system based on geographic positions |
CN104679738A (en) * | 2013-11-27 | 2015-06-03 | 北京拓尔思信息技术股份有限公司 | Method and device for mining Internet hot words |
CN105488092A (en) * | 2015-07-13 | 2016-04-13 | 中国科学院信息工程研究所 | Time-sensitive self-adaptive on-line subtopic detecting method and system |
-
2018
- 2018-03-15 CN CN201810214547.XA patent/CN108446274A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090299978A1 (en) * | 2008-05-28 | 2009-12-03 | Alexander Farfurnik | Systems and methods for keyword and dynamic url search engine optimization |
CN104679738A (en) * | 2013-11-27 | 2015-06-03 | 北京拓尔思信息技术股份有限公司 | Method and device for mining Internet hot words |
CN104615715A (en) * | 2015-02-05 | 2015-05-13 | 北京航空航天大学 | Social network event analyzing method and system based on geographic positions |
CN105488092A (en) * | 2015-07-13 | 2016-04-13 | 中国科学院信息工程研究所 | Time-sensitive self-adaptive on-line subtopic detecting method and system |
Non-Patent Citations (1)
Title |
---|
樊兆欣: "个性化新闻推荐系统关键技术研究与实现", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111831809A (en) * | 2020-07-17 | 2020-10-27 | 北京首汽智行科技有限公司 | Method for extracting keywords of question text |
CN112287682A (en) * | 2020-12-28 | 2021-01-29 | 北京智慧星光信息技术有限公司 | Method, device and equipment for extracting subject term and storage medium |
CN113468441A (en) * | 2021-06-29 | 2021-10-01 | 平安信托有限责任公司 | Search sorting method, device, equipment and storage medium based on weight adjustment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Christian et al. | Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF) | |
CN109960724B (en) | Text summarization method based on TF-IDF | |
Raulji et al. | Stop-word removal algorithm and its implementation for Sanskrit language | |
Khreisat | Arabic text classification using N-gram frequency statistics a comparative study | |
CN108763402B (en) | Class-centered vector text classification method based on dependency relationship, part of speech and semantic dictionary | |
CN102622338B (en) | Computer-assisted computing method of semantic distance between short texts | |
Morabia et al. | SEDTWik: segmentation-based event detection from tweets using Wikipedia | |
CN108363694B (en) | Keyword extraction method and device | |
CN108170666A (en) | A kind of improved method based on TF-IDF keyword extractions | |
CN108446274A (en) | A kind of keyword extracting method based on time-sensitive tf-idf | |
Awajan | Keyword extraction from Arabic documents using term equivalence classes | |
KR101377447B1 (en) | Multi-document summarization method and system using semmantic analysis between tegs | |
Gunawan et al. | Multi-document summarization by using textrank and maximal marginal relevance for text in Bahasa Indonesia | |
Fattah | A novel statistical feature selection approach for text categorization | |
Yu et al. | Towards high performance text mining: a TextRank-based method for automatic text summarization | |
Fodil et al. | Theme classification of Arabic text: A statistical approach | |
Cai et al. | Indonesian automatic text summarization based on a new clustering method in sentence level | |
Elrajubi | An improved Arabic light stemmer | |
Pickard | Comparing word2vec and GloVe for automatic measurement of MWE compositionality | |
Shim et al. | A study on the effect of the document summarization technique on the fake news detection model | |
KR101120038B1 (en) | Neologism selection apparatus and its method | |
Atwan et al. | Impact of stemmer on arabic text retrieval | |
Long et al. | Multi-document summarization by information distance | |
Yoon et al. | On Temporally Sensitive Word Embeddings for News Information Retrieval. | |
Pramudita et al. | Automatic Text Summarization of Madura Tourism Articles Using TF-IDF and K-Medoid Clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180824 |