CN108170666A - A kind of improved method based on TF-IDF keyword extractions - Google Patents

A kind of improved method based on TF-IDF keyword extractions Download PDF

Info

Publication number
CN108170666A
CN108170666A CN201711229728.1A CN201711229728A CN108170666A CN 108170666 A CN108170666 A CN 108170666A CN 201711229728 A CN201711229728 A CN 201711229728A CN 108170666 A CN108170666 A CN 108170666A
Authority
CN
China
Prior art keywords
text
word
idf
represent
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711229728.1A
Other languages
Chinese (zh)
Inventor
向阳
郑惺
张默涵
赵雨晴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201711229728.1A priority Critical patent/CN108170666A/en
Publication of CN108170666A publication Critical patent/CN108170666A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of improved methods for being based on TF IDF (term frequency-inverse document frequency) keyword extraction, specifically include following steps:The number that all words occur in each text in statistic document set respectively;The weight computing of word is carried out using improved TF IDF formula;Word is ranked up according to weights are descending, foundation is retrieved using ranking results as text key word.Compared with prior art, the present invention has many advantages, such as to distinguish different part of speech words, considers that the keyword for actually representing text feature carries out keyword sorting consistence.

Description

A kind of improved method based on TF-IDF keyword extractions
Technical field
The present invention relates to natural language processing method, more particularly, to a kind of improvement side based on TF-IDF keyword extractions Method.
Background technology
TF-IDF (term frequency-inverse document frequency) be it is a kind of for information retrieval with The common weighting technique of data mining.TF means word frequency (Term Frequency), and IDF means reverse document-frequency (Inverse Document Frequency)。
Keyword abstraction is a kind of natural language processing technique for identifying significant and representative segment or vocabulary.It closes Keyword extraction be the basis of natural language processing technique one of with core, most for the treatment of technology to non-structured text, such as Text snippet, text classification, text cluster and automatic question answering etc. are required for relying on it to improve precision.With the development of network, Internet resource is increasingly abundanter, however article lacks keyword tag mostly, and manual markings keyword is time-consuming and laborious and subjectivity It is relatively strong, therefore the technology has great importance for the keyword abstraction of text.
This method is to carry out keyword extraction for short text language material, due to the limited length of short text, keyword Mainly there was only one or two.Based on this situation, this method uses unsupervised keyword abstraction method, such as document frequency DF (Document Frequency), word frequency TF (Term Frequency), key is carried out using improved TF-IDF Word extracts.
Invention content
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind based on TF-IDF passes The improved method of keyword extraction.
The purpose of the present invention can be achieved through the following technical solutions:
A kind of improved method based on TF-IDF keyword extractions, the method include the following steps:
S1, the number that all words occur in each text in text collection is counted respectively;
S2, the weight computing that word is carried out using improved TF-IDF formula;
S3, word is ranked up according to weights are descending, foundation is retrieved using ranking results as text key word.
Preferably, the improved TF-IDF formula are:
TF-IDFi=(ki1+ki2)*tfik*log(|D|/dfi*tfi)
Wherein:ki1Represent feature tiIn text dkMiddle part of speech significant coefficient, ki2Represent feature tiIn text dkIn proprietary field Word significant coefficient, tfikRepresent feature tiIn text dkThe number of middle appearance, tfiRepresent feature tiTime occurred in all texts Number, | D | represent the number of text, dfiIt represents comprising feature tiText number.
Preferably, D represents text set, D={ d1,d2,…,dk..., dkRepresent k-th of text in D.
Preferably, tiRepresent T in ith feature word (i ∈ 1,2 ... | T |), | T | represent word number, T represent Text word set:
T={ t1,t2,…,ti,…}。
Preferably, dfiIf/| D | represent a text in selection text set, the text includes Feature Words tiProbability, idfiRepresent dfiFall row frequency:
idfi=log (| D |/dfi*tfi)。
Preferably for word i, if i is noun, ki1The number of the word of all noun numbers of=all documents/all; If i is verb, ki1The number of the word of all verb numbers of=all documents/all;If i is adjective, its ki1=institute There is the number of the word of all adjective numbers of document/all;If i is the word in addition to noun, verb and adjective, ki1= 0。
Compared with prior art, the present invention has the following advantages:
1st, different part of speech words are distinguished:Keyword abstraction process introduces part of speech significant coefficient and proprietary domain term weight Coefficient is wanted, can accomplish to distinguish the importance of the word of different parts of speech;
2nd, consideration actually represents the keyword of text feature, carries out keyword sorting consistence:Utilize feature tiIn all texts The number occurred in this is modified keyword weight calculating, although the IDF values of certain words are low, it still can be fine The keyword sequence of representative text feature adjust to by front position so that keyword extraction result is more accurate.
Description of the drawings
Fig. 1 is the method flow schematic diagram of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is the part of the embodiment rather than whole embodiments of the present invention.Based on this hair Embodiment in bright, the every other reality that those of ordinary skill in the art are obtained under the premise of creative work is not made Example is applied, should all belong to the scope of protection of the invention.
Embodiment:
Improved method the present invention is based on TF-IDF keyword extractions specifically includes following steps, such as Fig. 1:
S1, the number that all words occur in each document in statistic document set respectively;
S2, the weight computing that word is carried out using improved TF-IDF formula;
S3, word is ranked up according to weights are descending, using ranking results as keyword search over database foundation.
This method considers for short text language material that importance possessed by the various pieces in text is different first 's.In general, verb, nouns and adjectives are the main parts in sentence, also important to keyword abstraction technology;Number Word, pronoun etc. are only modifications, the further perfect integrality of sentence, but for sentence classification almost without work With.Therefore this method assigns different significant coefficient k to the word of different parts of speechi1, wherein noun, verb and adjectival important Coefficient is greater than other part of speech words such as adverbial word, pronoun.
Secondly this method is additionally contemplates that important function of the specific area keyword to text, important territoriality word is assigned Give significant coefficient ki2
This method considers that the shortcoming in TF-IDF methods is made that optimization.TF-IDF main thoughts be if some The frequency (TF high) that word or phrase occur in an article, and seldom there is (IDF high) in other articles, then it is assumed that this Word or phrase have good class discrimination ability, are suitble to when one of keyword of composition notebook.It is although it is contemplated that certain The IDF values of word are low, but it still can be good at representing text feature.Therefore this method makes following improvement:
TF-IDFi=(ki1+ki2)*tfik*log(|D|/dfi*tfi)
TF-IDF retrieval models.The main thought of TF-IDF models is:If the frequency that word w occurs in a document d Height, and seldom occur in other documents, then it is assumed that word w has good separating capacity, is adapted to an article d and other Article distinguishes.The formal definitions of the model are described below.
T represents text word set.Feature Words can be word, phrase or phrase.| T | represent the number of word.tiRepresent T In ith feature word (i ∈ 1,2 ... | T |).
T={ t1,t2,…,ti,…}
D represents text set.| D | represent the number of text.dkRepresent D in k-th of text (k ∈ 1,2 ... | D |).
D={ d1,d2,…,dk,…}
tfikRepresent feature tiIn text dkThe number of middle appearance.This is a local parameter for single text.
tfiRepresent feature tiThe number occurred in all texts.
dfiIt represents comprising feature tiText number.
idfiRepresent dfiFall row frequency.dfiIf/| D | represent a text in selection text set, the text includes spy Levy word tiProbability.
idfi=log (| D |/dfi*tfi)
ki1Represent feature tiIn text dkMiddle part of speech significant coefficient.
ki2Represent feature tiIn text dkIn proprietary domain term significant coefficient.This be one can be with customized system of parameters Number.
ki1Setting help to increase noun, verb, adjective are compared to the importance of other part of speech words, ki2Have Help protrude the importance of proprietary field keyword.
Difference lies in method introduces words by first of our new calculation formula and traditional TF-IDF calculations Property significant coefficient and proprietary domain term significant coefficient, can accomplish to distinguish the importance of the word of different parts of speech;Second difference It is to change the calculation of traditional IDF, adds tfiThe weight calculation of some keywords is modified.
In text collection D, distribution situation of each word in the document of each classification is investigated:If a word is equal Even is distributed in the document of each classification, this illustrates that the word does not have representativeness, that is, cannot function as a certain class keywords;Instead If a word only come across in each document of a certain classification, and text that is less or hardly appearing in other classifications When in shelves, then illustrate that the word can be as the keyword of the category.In fact, traditional TF-IDF not can solve This practical problem, then we be made that following illustration for this situation.
Part of speech is broadly divided into 4 classes by us:Noun, verb, adjective, other.For belong to preceding 3 class (" noun, verb, Adjective ") some word i for, if i is noun, its ki1Word (4 classes of all noun numbers of=all documents/all Word) number;For the word (word namely in addition to noun, verb and adjective, such as adverbial word, pronoun) of last one kind, Its ki1=0.
Meanwhile we before the experiments there are one word list, the proper noun comprising the field, this list is based on Calculate ki2, for the word k inside this listi2=1, otherwise ki2=0.
Assuming that we have collection of document D at present, which includes 8 documents, and 8 documents are divided into 2 classifications, each Classification has 4 documents, while we have extracted candidate of 3 words as keyword.Table 1 is candidate word in each document The concrete condition of appearance situation and the classification of document.Table 2 is directed to the weight results of each candidate word calculating for tradition TF-IDF (i.e. the candidate score of candidate word).Table 3 is directed to weight results (the i.e. candidate word of each candidate word calculating for improved TF-IDF Candidate score).
Table 1
Table 2
Candidate word Weights
T1 3.010
T2 0.986
T3 1.250
Table 3
Candidate word Weights
T1 13.010
T2 15.351
T3 11.250
1. occurrence number of all words in each document in statistic document set D, and obtain a result such as 1 institute of table Show.(8 documents only have 3 words T1, T2, T3 in above-mentioned example, and document 1, document 2, document 3 and document 4 are to belong to classification 1, other documents belong to classification 2)
2. weights (this keyword based on entire collection of document of each candidate word is calculated according to traditional TF-IDF formula Weight computing).
Remarks 1:TF-IDF=tfik*log(|D|/dfi)
Remarks 2:Such as T1, TF-IDF=(2+3+2+3+0+0+0+0) * log (8/4), wherein 8 be all number of files Mesh, 4 be the number of the document comprising T1
Remarks 3:What is selected in this sample of the log truth of a matter is 10.
3. the weights of each candidate word are calculated according to improved TF-IDF formula.
Remarks 1:TF-IDF=(ki1+ki2)*tfik*log(|D|/dfi*tfi)
Remarks 2:Such as T1, TF-IDF=(2+3+2+3+0+0+0+0) * log (8/4* (2+3+2+3+0+0+0+0)), In 8 be all number of documents, 4 be the number of the document comprising T1
Remarks 3:What is selected in this sample of the log truth of a matter is 10
Remarks 4:K in this samplei1+ki2=1.
Interpretation of result:
T1 is only occurred in classification 1 as can be seen from Table 1, therefore the keyword that can represent classification 1 as one;T2 It is evenly distributed in two classifications, does not have the ability as keyword;T3 largely occurs in classification 2, few in classification 1 Amount occurs, with certain ability as keyword, but because in the short text in proprietary field, some some classifications Although monopoly noun a small amount of can be appeared in other classification, they are still the good representative of this classification, and It can be as the keyword of this classification.
From the point of view of the comparison of table 2 and table 3, according to the descending sequence of weights, traditional TF-IDF methods can obtain T1> T3>Result as T2 can not fully meet demand, but improved TF-IDF methods can obtain T2>T1>T3's as a result, Not only it had met our demand but also can achieve the purpose that conventional method.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection domain subject to.

Claims (6)

1. a kind of improved method based on TF-IDF keyword extractions, which is characterized in that the method includes the following steps:
S1, the number that all words occur in each text in text collection is counted respectively;
S2, the weight computing that word is carried out using improved TF-IDF formula;
S3, word is ranked up according to weights are descending, foundation is retrieved using ranking results as text key word.
2. a kind of improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that described Improved TF-IDF formula are:
TF-IDFi=(ki1+ki2)*tfik*log(|D|/dfi*tfi)
Wherein:ki1Represent feature tiIn text dkMiddle part of speech significant coefficient, ki2Represent feature tiIn text dkIn proprietary domain term weight Want coefficient, tfikRepresent feature tiIn text dkThe number of middle appearance, tfiRepresent feature tiThe number occurred in all texts, | D | represent the number of text, dfiIt represents comprising feature tiText number.
3. a kind of improved method based on TF-IDF keyword extractions according to claim 2, which is characterized in that D is represented Text set, D={ d1,d2,…,dk..., dkRepresent k-th of text in D.
A kind of 4. improved method based on TF-IDF keyword extractions according to claim 2, which is characterized in that tiRepresent T In ith feature word (i ∈ 1,2 ... | T |), | T | represent word number, T represent text word set:
T={ t1,t2,…,ti,…}。
A kind of 5. improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that dfi/|D| If representing a text in selection text set, the text includes Feature Words tiProbability, idfiRepresent dfiFall row frequency:
idfi=log (| D |/dfi*tfi)。
6. a kind of improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that for word I, if i is noun, ki1The number of the word of all noun numbers of=all documents/all;If i is verb, ki1=all The number of the word of all verb numbers of document/all;If i is adjective, its ki1All adjective numbers of=all documents The number of the word of mesh/all;If i is the word in addition to noun, verb and adjective, ki1=0.
CN201711229728.1A 2017-11-29 2017-11-29 A kind of improved method based on TF-IDF keyword extractions Pending CN108170666A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711229728.1A CN108170666A (en) 2017-11-29 2017-11-29 A kind of improved method based on TF-IDF keyword extractions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711229728.1A CN108170666A (en) 2017-11-29 2017-11-29 A kind of improved method based on TF-IDF keyword extractions

Publications (1)

Publication Number Publication Date
CN108170666A true CN108170666A (en) 2018-06-15

Family

ID=62524216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711229728.1A Pending CN108170666A (en) 2017-11-29 2017-11-29 A kind of improved method based on TF-IDF keyword extractions

Country Status (1)

Country Link
CN (1) CN108170666A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062895A (en) * 2018-07-23 2018-12-21 挖财网络技术有限公司 A kind of intelligent semantic processing method
CN109977397A (en) * 2019-02-18 2019-07-05 广州市诚毅科技软件开发有限公司 Hot news extracting method, system and storage medium based on part of speech combination
CN111209372A (en) * 2020-01-02 2020-05-29 北京字节跳动网络技术有限公司 Keyword determination method and device, electronic equipment and storage medium
CN111753547A (en) * 2020-06-30 2020-10-09 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN113270092A (en) * 2021-05-11 2021-08-17 云南电网有限责任公司 Scheduling voice keyword extraction method based on LDA algorithm
CN113392637A (en) * 2021-06-24 2021-09-14 青岛科技大学 TF-IDF-based subject term extraction method, device, equipment and storage medium
CN113434666A (en) * 2021-04-06 2021-09-24 西安理工大学 Keyword relevance analysis method
US11842160B2 (en) 2021-07-14 2023-12-12 International Business Machines Corporation Keyword extraction with frequency—inverse document frequency method for word embedding

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130185308A1 (en) * 2012-01-13 2013-07-18 International Business Machines Corporation System and method for extraction of off-topic part from conversation
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN107122382A (en) * 2017-02-16 2017-09-01 江苏大学 A kind of patent classification method based on specification
CN107145476A (en) * 2017-05-23 2017-09-08 福建师范大学 One kind is based on improvement TF IDF keyword extraction algorithms

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130185308A1 (en) * 2012-01-13 2013-07-18 International Business Machines Corporation System and method for extraction of off-topic part from conversation
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN107122382A (en) * 2017-02-16 2017-09-01 江苏大学 A kind of patent classification method based on specification
CN107145476A (en) * 2017-05-23 2017-09-08 福建师范大学 One kind is based on improvement TF IDF keyword extraction algorithms

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐冬冬 等: "一种基于类别描述的TF-IDF特征选择方法的改进", 《现代图书情报技术》 *
金镇晟: "基于改进的TF-IDF算法的中文微博话题检测与研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062895B (en) * 2018-07-23 2022-06-24 挖财网络技术有限公司 Intelligent semantic processing method
CN109062895A (en) * 2018-07-23 2018-12-21 挖财网络技术有限公司 A kind of intelligent semantic processing method
CN109977397A (en) * 2019-02-18 2019-07-05 广州市诚毅科技软件开发有限公司 Hot news extracting method, system and storage medium based on part of speech combination
CN109977397B (en) * 2019-02-18 2022-11-15 广州市诚毅科技软件开发有限公司 News hotspot extracting method, system and storage medium based on part-of-speech combination
CN111209372A (en) * 2020-01-02 2020-05-29 北京字节跳动网络技术有限公司 Keyword determination method and device, electronic equipment and storage medium
CN111209372B (en) * 2020-01-02 2021-08-17 北京字节跳动网络技术有限公司 Keyword determination method and device, electronic equipment and storage medium
CN111753547A (en) * 2020-06-30 2020-10-09 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN111753547B (en) * 2020-06-30 2024-02-27 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN113434666A (en) * 2021-04-06 2021-09-24 西安理工大学 Keyword relevance analysis method
CN113270092A (en) * 2021-05-11 2021-08-17 云南电网有限责任公司 Scheduling voice keyword extraction method based on LDA algorithm
CN113392637A (en) * 2021-06-24 2021-09-14 青岛科技大学 TF-IDF-based subject term extraction method, device, equipment and storage medium
CN113392637B (en) * 2021-06-24 2023-02-07 青岛科技大学 TF-IDF-based subject term extraction method, device, equipment and storage medium
US11842160B2 (en) 2021-07-14 2023-12-12 International Business Machines Corporation Keyword extraction with frequency—inverse document frequency method for word embedding

Similar Documents

Publication Publication Date Title
CN108170666A (en) A kind of improved method based on TF-IDF keyword extractions
Christian et al. Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF)
CN108763402B (en) Class-centered vector text classification method based on dependency relationship, part of speech and semantic dictionary
CN109582704B (en) Recruitment information and the matched method of job seeker resume
CN108763213A (en) Theme feature text key word extracting method
CN109960756B (en) News event information induction method
Agirre et al. Unsupervised WSD based on automatically retrieved examples: The importance of bias
CN110287309B (en) Method for quickly extracting text abstract
Rangel et al. Overview of the track on author profiling and deception detection in arabic
CN109522547B (en) Chinese synonym iteration extraction method based on pattern learning
Gupta et al. Text summarization of Hindi documents using rule based approach
Meena et al. Survey on graph and cluster based approaches in multi-document text summarization
Tandel et al. Multi-document text summarization-a survey
CN110705247A (en) Based on x2-C text similarity calculation method
CN110134847A (en) A kind of hot spot method for digging and system based on internet Financial Information
CN108399165A (en) A kind of keyword abstraction method based on position weighting
CN110399483A (en) A kind of subject classification method, apparatus, electronic equipment and readable storage medium storing program for executing
CN107526792A (en) A kind of Chinese question sentence keyword rapid extracting method
Gopan et al. Comparative study on different approaches in keyword extraction
Ayadi et al. A Survey of Arabic Text Representation and Classification Methods.
Haribhakta et al. Unsupervised topic detection model and its application in text categorization
CN106997345A (en) The keyword abstraction method of word-based vector sum word statistical information
CN108804422B (en) Scientific and technological paper text modeling method
CN110413985B (en) Related text segment searching method and device
Alam et al. Bangla news trend observation using lda based topic modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180615