CN108170666A - A kind of improved method based on TF-IDF keyword extractions - Google Patents
A kind of improved method based on TF-IDF keyword extractions Download PDFInfo
- Publication number
- CN108170666A CN108170666A CN201711229728.1A CN201711229728A CN108170666A CN 108170666 A CN108170666 A CN 108170666A CN 201711229728 A CN201711229728 A CN 201711229728A CN 108170666 A CN108170666 A CN 108170666A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- idf
- represent
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of improved methods for being based on TF IDF (term frequency-inverse document frequency) keyword extraction, specifically include following steps:The number that all words occur in each text in statistic document set respectively;The weight computing of word is carried out using improved TF IDF formula;Word is ranked up according to weights are descending, foundation is retrieved using ranking results as text key word.Compared with prior art, the present invention has many advantages, such as to distinguish different part of speech words, considers that the keyword for actually representing text feature carries out keyword sorting consistence.
Description
Technical field
The present invention relates to natural language processing method, more particularly, to a kind of improvement side based on TF-IDF keyword extractions
Method.
Background technology
TF-IDF (term frequency-inverse document frequency) be it is a kind of for information retrieval with
The common weighting technique of data mining.TF means word frequency (Term Frequency), and IDF means reverse document-frequency
(Inverse Document Frequency)。
Keyword abstraction is a kind of natural language processing technique for identifying significant and representative segment or vocabulary.It closes
Keyword extraction be the basis of natural language processing technique one of with core, most for the treatment of technology to non-structured text, such as
Text snippet, text classification, text cluster and automatic question answering etc. are required for relying on it to improve precision.With the development of network,
Internet resource is increasingly abundanter, however article lacks keyword tag mostly, and manual markings keyword is time-consuming and laborious and subjectivity
It is relatively strong, therefore the technology has great importance for the keyword abstraction of text.
This method is to carry out keyword extraction for short text language material, due to the limited length of short text, keyword
Mainly there was only one or two.Based on this situation, this method uses unsupervised keyword abstraction method, such as document frequency
DF (Document Frequency), word frequency TF (Term Frequency), key is carried out using improved TF-IDF
Word extracts.
Invention content
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind based on TF-IDF passes
The improved method of keyword extraction.
The purpose of the present invention can be achieved through the following technical solutions:
A kind of improved method based on TF-IDF keyword extractions, the method include the following steps:
S1, the number that all words occur in each text in text collection is counted respectively;
S2, the weight computing that word is carried out using improved TF-IDF formula;
S3, word is ranked up according to weights are descending, foundation is retrieved using ranking results as text key word.
Preferably, the improved TF-IDF formula are:
TF-IDFi=(ki1+ki2)*tfik*log(|D|/dfi*tfi)
Wherein:ki1Represent feature tiIn text dkMiddle part of speech significant coefficient, ki2Represent feature tiIn text dkIn proprietary field
Word significant coefficient, tfikRepresent feature tiIn text dkThe number of middle appearance, tfiRepresent feature tiTime occurred in all texts
Number, | D | represent the number of text, dfiIt represents comprising feature tiText number.
Preferably, D represents text set, D={ d1,d2,…,dk..., dkRepresent k-th of text in D.
Preferably, tiRepresent T in ith feature word (i ∈ 1,2 ... | T |), | T | represent word number, T represent
Text word set:
T={ t1,t2,…,ti,…}。
Preferably, dfiIf/| D | represent a text in selection text set, the text includes Feature Words tiProbability,
idfiRepresent dfiFall row frequency:
idfi=log (| D |/dfi*tfi)。
Preferably for word i, if i is noun, ki1The number of the word of all noun numbers of=all documents/all;
If i is verb, ki1The number of the word of all verb numbers of=all documents/all;If i is adjective, its ki1=institute
There is the number of the word of all adjective numbers of document/all;If i is the word in addition to noun, verb and adjective, ki1=
0。
Compared with prior art, the present invention has the following advantages:
1st, different part of speech words are distinguished:Keyword abstraction process introduces part of speech significant coefficient and proprietary domain term weight
Coefficient is wanted, can accomplish to distinguish the importance of the word of different parts of speech;
2nd, consideration actually represents the keyword of text feature, carries out keyword sorting consistence:Utilize feature tiIn all texts
The number occurred in this is modified keyword weight calculating, although the IDF values of certain words are low, it still can be fine
The keyword sequence of representative text feature adjust to by front position so that keyword extraction result is more accurate.
Description of the drawings
Fig. 1 is the method flow schematic diagram of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is the part of the embodiment rather than whole embodiments of the present invention.Based on this hair
Embodiment in bright, the every other reality that those of ordinary skill in the art are obtained under the premise of creative work is not made
Example is applied, should all belong to the scope of protection of the invention.
Embodiment:
Improved method the present invention is based on TF-IDF keyword extractions specifically includes following steps, such as Fig. 1:
S1, the number that all words occur in each document in statistic document set respectively;
S2, the weight computing that word is carried out using improved TF-IDF formula;
S3, word is ranked up according to weights are descending, using ranking results as keyword search over database foundation.
This method considers for short text language material that importance possessed by the various pieces in text is different first
's.In general, verb, nouns and adjectives are the main parts in sentence, also important to keyword abstraction technology;Number
Word, pronoun etc. are only modifications, the further perfect integrality of sentence, but for sentence classification almost without work
With.Therefore this method assigns different significant coefficient k to the word of different parts of speechi1, wherein noun, verb and adjectival important
Coefficient is greater than other part of speech words such as adverbial word, pronoun.
Secondly this method is additionally contemplates that important function of the specific area keyword to text, important territoriality word is assigned
Give significant coefficient ki2。
This method considers that the shortcoming in TF-IDF methods is made that optimization.TF-IDF main thoughts be if some
The frequency (TF high) that word or phrase occur in an article, and seldom there is (IDF high) in other articles, then it is assumed that this
Word or phrase have good class discrimination ability, are suitble to when one of keyword of composition notebook.It is although it is contemplated that certain
The IDF values of word are low, but it still can be good at representing text feature.Therefore this method makes following improvement:
TF-IDFi=(ki1+ki2)*tfik*log(|D|/dfi*tfi)
TF-IDF retrieval models.The main thought of TF-IDF models is:If the frequency that word w occurs in a document d
Height, and seldom occur in other documents, then it is assumed that word w has good separating capacity, is adapted to an article d and other
Article distinguishes.The formal definitions of the model are described below.
T represents text word set.Feature Words can be word, phrase or phrase.| T | represent the number of word.tiRepresent T
In ith feature word (i ∈ 1,2 ... | T |).
T={ t1,t2,…,ti,…}
D represents text set.| D | represent the number of text.dkRepresent D in k-th of text (k ∈ 1,2 ... | D |).
D={ d1,d2,…,dk,…}
tfikRepresent feature tiIn text dkThe number of middle appearance.This is a local parameter for single text.
tfiRepresent feature tiThe number occurred in all texts.
dfiIt represents comprising feature tiText number.
idfiRepresent dfiFall row frequency.dfiIf/| D | represent a text in selection text set, the text includes spy
Levy word tiProbability.
idfi=log (| D |/dfi*tfi)
ki1Represent feature tiIn text dkMiddle part of speech significant coefficient.
ki2Represent feature tiIn text dkIn proprietary domain term significant coefficient.This be one can be with customized system of parameters
Number.
ki1Setting help to increase noun, verb, adjective are compared to the importance of other part of speech words, ki2Have
Help protrude the importance of proprietary field keyword.
Difference lies in method introduces words by first of our new calculation formula and traditional TF-IDF calculations
Property significant coefficient and proprietary domain term significant coefficient, can accomplish to distinguish the importance of the word of different parts of speech;Second difference
It is to change the calculation of traditional IDF, adds tfiThe weight calculation of some keywords is modified.
In text collection D, distribution situation of each word in the document of each classification is investigated:If a word is equal
Even is distributed in the document of each classification, this illustrates that the word does not have representativeness, that is, cannot function as a certain class keywords;Instead
If a word only come across in each document of a certain classification, and text that is less or hardly appearing in other classifications
When in shelves, then illustrate that the word can be as the keyword of the category.In fact, traditional TF-IDF not can solve
This practical problem, then we be made that following illustration for this situation.
Part of speech is broadly divided into 4 classes by us:Noun, verb, adjective, other.For belong to preceding 3 class (" noun, verb,
Adjective ") some word i for, if i is noun, its ki1Word (4 classes of all noun numbers of=all documents/all
Word) number;For the word (word namely in addition to noun, verb and adjective, such as adverbial word, pronoun) of last one kind,
Its ki1=0.
Meanwhile we before the experiments there are one word list, the proper noun comprising the field, this list is based on
Calculate ki2, for the word k inside this listi2=1, otherwise ki2=0.
Assuming that we have collection of document D at present, which includes 8 documents, and 8 documents are divided into 2 classifications, each
Classification has 4 documents, while we have extracted candidate of 3 words as keyword.Table 1 is candidate word in each document
The concrete condition of appearance situation and the classification of document.Table 2 is directed to the weight results of each candidate word calculating for tradition TF-IDF
(i.e. the candidate score of candidate word).Table 3 is directed to weight results (the i.e. candidate word of each candidate word calculating for improved TF-IDF
Candidate score).
Table 1
Table 2
Candidate word | Weights |
T1 | 3.010 |
T2 | 0.986 |
T3 | 1.250 |
Table 3
Candidate word | Weights |
T1 | 13.010 |
T2 | 15.351 |
T3 | 11.250 |
1. occurrence number of all words in each document in statistic document set D, and obtain a result such as 1 institute of table
Show.(8 documents only have 3 words T1, T2, T3 in above-mentioned example, and document 1, document 2, document 3 and document 4 are to belong to classification
1, other documents belong to classification 2)
2. weights (this keyword based on entire collection of document of each candidate word is calculated according to traditional TF-IDF formula
Weight computing).
Remarks 1:TF-IDF=tfik*log(|D|/dfi)
Remarks 2:Such as T1, TF-IDF=(2+3+2+3+0+0+0+0) * log (8/4), wherein 8 be all number of files
Mesh, 4 be the number of the document comprising T1
Remarks 3:What is selected in this sample of the log truth of a matter is 10.
3. the weights of each candidate word are calculated according to improved TF-IDF formula.
Remarks 1:TF-IDF=(ki1+ki2)*tfik*log(|D|/dfi*tfi)
Remarks 2:Such as T1, TF-IDF=(2+3+2+3+0+0+0+0) * log (8/4* (2+3+2+3+0+0+0+0)),
In 8 be all number of documents, 4 be the number of the document comprising T1
Remarks 3:What is selected in this sample of the log truth of a matter is 10
Remarks 4:K in this samplei1+ki2=1.
Interpretation of result:
T1 is only occurred in classification 1 as can be seen from Table 1, therefore the keyword that can represent classification 1 as one;T2
It is evenly distributed in two classifications, does not have the ability as keyword;T3 largely occurs in classification 2, few in classification 1
Amount occurs, with certain ability as keyword, but because in the short text in proprietary field, some some classifications
Although monopoly noun a small amount of can be appeared in other classification, they are still the good representative of this classification, and
It can be as the keyword of this classification.
From the point of view of the comparison of table 2 and table 3, according to the descending sequence of weights, traditional TF-IDF methods can obtain T1>
T3>Result as T2 can not fully meet demand, but improved TF-IDF methods can obtain T2>T1>T3's as a result,
Not only it had met our demand but also can achieve the purpose that conventional method.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right
It is required that protection domain subject to.
Claims (6)
1. a kind of improved method based on TF-IDF keyword extractions, which is characterized in that the method includes the following steps:
S1, the number that all words occur in each text in text collection is counted respectively;
S2, the weight computing that word is carried out using improved TF-IDF formula;
S3, word is ranked up according to weights are descending, foundation is retrieved using ranking results as text key word.
2. a kind of improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that described
Improved TF-IDF formula are:
TF-IDFi=(ki1+ki2)*tfik*log(|D|/dfi*tfi)
Wherein:ki1Represent feature tiIn text dkMiddle part of speech significant coefficient, ki2Represent feature tiIn text dkIn proprietary domain term weight
Want coefficient, tfikRepresent feature tiIn text dkThe number of middle appearance, tfiRepresent feature tiThe number occurred in all texts, |
D | represent the number of text, dfiIt represents comprising feature tiText number.
3. a kind of improved method based on TF-IDF keyword extractions according to claim 2, which is characterized in that D is represented
Text set, D={ d1,d2,…,dk..., dkRepresent k-th of text in D.
A kind of 4. improved method based on TF-IDF keyword extractions according to claim 2, which is characterized in that tiRepresent T
In ith feature word (i ∈ 1,2 ... | T |), | T | represent word number, T represent text word set:
T={ t1,t2,…,ti,…}。
A kind of 5. improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that dfi/|D|
If representing a text in selection text set, the text includes Feature Words tiProbability, idfiRepresent dfiFall row frequency:
idfi=log (| D |/dfi*tfi)。
6. a kind of improved method based on TF-IDF keyword extractions according to claim 1, which is characterized in that for word
I, if i is noun, ki1The number of the word of all noun numbers of=all documents/all;If i is verb, ki1=all
The number of the word of all verb numbers of document/all;If i is adjective, its ki1All adjective numbers of=all documents
The number of the word of mesh/all;If i is the word in addition to noun, verb and adjective, ki1=0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711229728.1A CN108170666A (en) | 2017-11-29 | 2017-11-29 | A kind of improved method based on TF-IDF keyword extractions |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711229728.1A CN108170666A (en) | 2017-11-29 | 2017-11-29 | A kind of improved method based on TF-IDF keyword extractions |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108170666A true CN108170666A (en) | 2018-06-15 |
Family
ID=62524216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711229728.1A Pending CN108170666A (en) | 2017-11-29 | 2017-11-29 | A kind of improved method based on TF-IDF keyword extractions |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108170666A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062895A (en) * | 2018-07-23 | 2018-12-21 | 挖财网络技术有限公司 | A kind of intelligent semantic processing method |
CN109977397A (en) * | 2019-02-18 | 2019-07-05 | 广州市诚毅科技软件开发有限公司 | Hot news extracting method, system and storage medium based on part of speech combination |
CN111209372A (en) * | 2020-01-02 | 2020-05-29 | 北京字节跳动网络技术有限公司 | Keyword determination method and device, electronic equipment and storage medium |
CN111753547A (en) * | 2020-06-30 | 2020-10-09 | 上海观安信息技术股份有限公司 | Keyword extraction method and system for sensitive data leakage detection |
CN112597760A (en) * | 2020-12-04 | 2021-04-02 | 光大科技有限公司 | Method and device for extracting domain words in document |
CN113270092A (en) * | 2021-05-11 | 2021-08-17 | 云南电网有限责任公司 | Scheduling voice keyword extraction method based on LDA algorithm |
CN113392637A (en) * | 2021-06-24 | 2021-09-14 | 青岛科技大学 | TF-IDF-based subject term extraction method, device, equipment and storage medium |
CN113434666A (en) * | 2021-04-06 | 2021-09-24 | 西安理工大学 | Keyword relevance analysis method |
US11842160B2 (en) | 2021-07-14 | 2023-12-12 | International Business Machines Corporation | Keyword extraction with frequency—inverse document frequency method for word embedding |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130185308A1 (en) * | 2012-01-13 | 2013-07-18 | International Business Machines Corporation | System and method for extraction of off-topic part from conversation |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN107122382A (en) * | 2017-02-16 | 2017-09-01 | 江苏大学 | A kind of patent classification method based on specification |
CN107145476A (en) * | 2017-05-23 | 2017-09-08 | 福建师范大学 | One kind is based on improvement TF IDF keyword extraction algorithms |
-
2017
- 2017-11-29 CN CN201711229728.1A patent/CN108170666A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130185308A1 (en) * | 2012-01-13 | 2013-07-18 | International Business Machines Corporation | System and method for extraction of off-topic part from conversation |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN107122382A (en) * | 2017-02-16 | 2017-09-01 | 江苏大学 | A kind of patent classification method based on specification |
CN107145476A (en) * | 2017-05-23 | 2017-09-08 | 福建师范大学 | One kind is based on improvement TF IDF keyword extraction algorithms |
Non-Patent Citations (2)
Title |
---|
徐冬冬 等: "一种基于类别描述的TF-IDF特征选择方法的改进", 《现代图书情报技术》 * |
金镇晟: "基于改进的TF-IDF算法的中文微博话题检测与研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062895B (en) * | 2018-07-23 | 2022-06-24 | 挖财网络技术有限公司 | Intelligent semantic processing method |
CN109062895A (en) * | 2018-07-23 | 2018-12-21 | 挖财网络技术有限公司 | A kind of intelligent semantic processing method |
CN109977397A (en) * | 2019-02-18 | 2019-07-05 | 广州市诚毅科技软件开发有限公司 | Hot news extracting method, system and storage medium based on part of speech combination |
CN109977397B (en) * | 2019-02-18 | 2022-11-15 | 广州市诚毅科技软件开发有限公司 | News hotspot extracting method, system and storage medium based on part-of-speech combination |
CN111209372A (en) * | 2020-01-02 | 2020-05-29 | 北京字节跳动网络技术有限公司 | Keyword determination method and device, electronic equipment and storage medium |
CN111209372B (en) * | 2020-01-02 | 2021-08-17 | 北京字节跳动网络技术有限公司 | Keyword determination method and device, electronic equipment and storage medium |
CN111753547A (en) * | 2020-06-30 | 2020-10-09 | 上海观安信息技术股份有限公司 | Keyword extraction method and system for sensitive data leakage detection |
CN111753547B (en) * | 2020-06-30 | 2024-02-27 | 上海观安信息技术股份有限公司 | Keyword extraction method and system for sensitive data leakage detection |
CN112597760A (en) * | 2020-12-04 | 2021-04-02 | 光大科技有限公司 | Method and device for extracting domain words in document |
CN113434666A (en) * | 2021-04-06 | 2021-09-24 | 西安理工大学 | Keyword relevance analysis method |
CN113270092A (en) * | 2021-05-11 | 2021-08-17 | 云南电网有限责任公司 | Scheduling voice keyword extraction method based on LDA algorithm |
CN113392637A (en) * | 2021-06-24 | 2021-09-14 | 青岛科技大学 | TF-IDF-based subject term extraction method, device, equipment and storage medium |
CN113392637B (en) * | 2021-06-24 | 2023-02-07 | 青岛科技大学 | TF-IDF-based subject term extraction method, device, equipment and storage medium |
US11842160B2 (en) | 2021-07-14 | 2023-12-12 | International Business Machines Corporation | Keyword extraction with frequency—inverse document frequency method for word embedding |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108170666A (en) | A kind of improved method based on TF-IDF keyword extractions | |
Christian et al. | Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF) | |
CN108763402B (en) | Class-centered vector text classification method based on dependency relationship, part of speech and semantic dictionary | |
CN109582704B (en) | Recruitment information and the matched method of job seeker resume | |
CN108763213A (en) | Theme feature text key word extracting method | |
CN109960756B (en) | News event information induction method | |
Agirre et al. | Unsupervised WSD based on automatically retrieved examples: The importance of bias | |
CN110287309B (en) | Method for quickly extracting text abstract | |
Rangel et al. | Overview of the track on author profiling and deception detection in arabic | |
CN109522547B (en) | Chinese synonym iteration extraction method based on pattern learning | |
Gupta et al. | Text summarization of Hindi documents using rule based approach | |
Meena et al. | Survey on graph and cluster based approaches in multi-document text summarization | |
Tandel et al. | Multi-document text summarization-a survey | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
CN110134847A (en) | A kind of hot spot method for digging and system based on internet Financial Information | |
CN108399165A (en) | A kind of keyword abstraction method based on position weighting | |
CN110399483A (en) | A kind of subject classification method, apparatus, electronic equipment and readable storage medium storing program for executing | |
CN107526792A (en) | A kind of Chinese question sentence keyword rapid extracting method | |
Gopan et al. | Comparative study on different approaches in keyword extraction | |
Ayadi et al. | A Survey of Arabic Text Representation and Classification Methods. | |
Haribhakta et al. | Unsupervised topic detection model and its application in text categorization | |
CN106997345A (en) | The keyword abstraction method of word-based vector sum word statistical information | |
CN108804422B (en) | Scientific and technological paper text modeling method | |
CN110413985B (en) | Related text segment searching method and device | |
Alam et al. | Bangla news trend observation using lda based topic modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180615 |