CN102662976A - Text feature weighting method based on supervision - Google Patents
Text feature weighting method based on supervision Download PDFInfo
- Publication number
- CN102662976A CN102662976A CN2012100638795A CN201210063879A CN102662976A CN 102662976 A CN102662976 A CN 102662976A CN 2012100638795 A CN2012100638795 A CN 2012100638795A CN 201210063879 A CN201210063879 A CN 201210063879A CN 102662976 A CN102662976 A CN 102662976A
- Authority
- CN
- China
- Prior art keywords
- files
- text
- lexical item
- positive example
- text feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000005303 weighing Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 4
- 238000000605 extraction Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000007812 deficiency Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 241000486679 Antitype Species 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a text feature weighting method based on supervision. Text feature extraction is carried out to obtain four types of numbers of documents, a, b, c, and d. The a represents the number of documents containing lexical item t and belonging to a positive example. The b represents the number of documents containing no lexical item t and belonging to a positive example. The c represents the number of documents containing lexical item t and not belonging to a positive example. The d represents the number of documents containing no lexical item t and not belonging to a positive example. The a, b, c, and d add up to N, namely, the total number of documents. The text feature weighting formula (I) is provided. With the text feature weighting method tf.ridf based on supervision put forward in the invention, a comprehensive consideration of words among total documents and the documents of each classification is realized, and classification performance is effectively improved.
Description
Technical field
The present invention relates to file classification method, especially a kind of text feature method of weighting.
Background technology
Along with popularizing of internet, applications, stored the text message of magnanimity in the network, people need excavate Useful Information urgently from text.Text mining just is meant from a large amount of text datas the process that extracts unknown in advance, intelligible, final information available or knowledge.Text classification is under given classification system, the process of classification under confirming automatically according to content of text.An important component part is a text representation in the text classification; In the text representation the most frequently used be vector space model (Vector Space Model, VSM), VSM is a statistical model of text representation; It is regarded document as and is made up of proper vector; And each proper vector can be a speech also can be phrase, and each proper vector all contains weight, thereby the vectorial matching problem that the classification problem of text is converted in the vector space is handled.After being converted into proper vector, just can use sorting algorithm commonly used to carry out text classification, like SVMs, k nearest neighbor etc.
A major issue in the text representation is exactly to select the text of representing which proper vector can be best, and will remove the proper vector that those do not represent value.Usually the general process of text classification is: participle, go stop words, index, statistics, feature extraction, characteristic weighing, classification based training and assessment.
Wherein the text feature weighting is a ring important in the entire process process.Based on the text feature weighting of statistics, mainly be divided into two big types: Feature Weighting Method (supervised term weighting method) and guideless Feature Weighting Method (unsupervised term weighting method) that guidance is promptly arranged.Present commonplace use be that (term frequency and inverse document frequency, tf.idf), it belongs to guideless Feature Weighting Method to the anti-document frequency method of weighting of word frequency.
At present, all proposed both at home and abroad much, included guidance and guideless Feature Weighting Method about the improving one's methods of characteristic weighing.Xue Xiaobing is with the densely distributed property of text speech, and the position that text size and speech occur for the first time is as the major consideration of characteristic weighing.Yet guideless Feature Weighting Method do not consider characteristic each document of training set and of all categories in difference on the distribution proportion to the influence of classification.There is the Feature Weighting Method of guidance to consider this point just on the contrary.Li Kaiqi has at first pointed out the deficiency of tf.idf method, on based on the method basis that guidance is arranged, combines tf.idf to improve the text feature weighting information gain.Man Lan has then considered the proportionate relationship of speech in positive type and anti-type, thinks that this speech just can be represented positive type more, just has high more weights as long as anti-type of ratio that accounts for of positive analogy is high more.
Summary of the invention
In order to overcome the relatively poor deficiency of classification performance of existing text feature method of weighting, the present invention provides a kind of text feature method of weighting based on guidance that promotes classification performance
For the technological means that solves the problems of the technologies described above employing is:
A kind of based on the text feature method of weighting that instructs, carry out obtaining a after text feature extracts, b, c, four types of number of files of d, representative is as follows respectively:
The number of files that a representes to contain lexical item t and belongs to positive example;
B representes not contain lexical item t but the number of files that belongs to positive example;
C representes to contain lexical item t but the number of files that do not belong to positive example;
The number of files that d representes not contain lexical item t and do not belong to positive example.
A, b, c, the summation of d is N, promptly total number of files;
Said text feature weighting formula is following:
Wherein K is expressed as:
Wherein, ridf combines to carry out characteristic weighing for relevant anti-document frequency method of weighting with word frequency, this formulate the significance level of a text between overall and classification.
Technical conceive of the present invention is: traditional Feature Weighting Method is following:
Wherein (t d) is the frequency of feature lexical item t in document d to tf, and high more this speech of tf value can be represented document more, and tf has embodied the relation of speech in document.N among the idf is the number of files that contains feature lexical item t, and N is total number of files.It is low more that idf thinks that the number of files that certain speech occurs accounts for the ratio of total number of files, and the document of this type can be represented more in this speech, i.e. the ability of this speech difference type is strong more, and idf has then embodied the relation of speech between document.Though the tfidf formula seems simply, comparing other complicated file classification method (like the text classification based on semanteme) usually has best expressive ability in some occasion.
Yet also there is weak point in this formula, and in six kinds of document distribution plans representing like Fig. 1, we can know:
The number of files that a representes to contain lexical item t and belongs to positive example;
B representes not contain lexical item t but the number of files that belongs to positive example;
C representes to contain lexical item t but the number of files that do not belong to positive example;
The number of files that d representes not contain lexical item t and do not belong to positive example.
A, b, c, the summation of d is N, promptly total number of files.
In these three kinds of situation of T1-T3, idf can give identical weights.Tangible then, the classification contribution ability of T1 should be maximum.
In order to overcome the deficiency of traditional guideless Feature Weighting Method, Chinese scholars has all proposed many new methods of weighting that guidance is arranged, and proposes like Xue Xiaobing:
Substitute idf with the rf formula, its basic thought is: the ratio that the ratio c that a accounts for accounts for is big more, and this speech positive example of capable more difference and counter-example so just should have high more weight; And when a=c, no matter the quantity of a and c is much, the ability that this speech difference is positive and negative type equates forever.
Yet this formula has been abandoned the thought of original idf again, and the ratio that the number of files that certain speech promptly occurs accounts for total number of files is low more, and this speech is the document of valuable more this type of representative just.Like T5 and T6 both of these case, significantly, the speech of T6 too spreads unchecked, and its representative ability is not as this situation of T5.
According to above-mentioned analysis, confirm design object of the present invention: the proportionate relationship of speech in each classification considered in (1), promptly utilizes the method that guidance is arranged, and makes full use of the training data of type of having label.(2) advantage of reservation traditional characteristic weighting is considered the situation that speech distributes in overall document.
Based on above two targets, design fundamentals are following:
Wherein the k value is:
Substitute idf with ridf, its advantage is following:
1. work as a and be not equal under the situation of c, ridf promptly will consider the factor of N/ (a+c), will consider also that simultaneously a/max (1, we can say and at this time promptly considered the relation of speech in classification, also considered the relation of speech between overall document by factor c).
2. under the situation of a=c, the size of K value can not influence final ridf value, meets above thought.
Description of drawings
Fig. 1 is six kinds of document distribution plans.
Embodiment
Below in conjunction with accompanying drawing the present invention is further specified.
With reference to Fig. 1, a kind of based on the text feature method of weighting that instructs, carry out obtaining a after text feature extracts, b, c, four types of number of files of d, representative is as follows respectively:
The number of files that a representes to contain lexical item t and belongs to positive example;
B representes not contain lexical item t but the number of files that belongs to positive example;
C representes to contain lexical item t but the number of files that do not belong to positive example;
The number of files that d representes not contain lexical item t and do not belong to positive example.
A, b, c, the summation of d is N, promptly total number of files;
Said text feature weighting formula is following:
Wherein K is expressed as:
Wherein, ridf combines to carry out characteristic weighing for relevant anti-document frequency method of weighting with word frequency, this formulate the significance level of a text between overall and classification.
In the present embodiment; Corpus is the Chinese corpus that contains 20 classifications that international data center center natural language processing group provides from Fudan University's computerized information with technology; Words partition system adopts the Chinese lexical analytic system ICTCLAS of Inst. of Computing Techn. Academia Sinica's development, and sorter adopts is the Libsvm of people's exploitation such as professor Lin Zhiren of Taiwan Univ..
1. at first original corpus is carried out Chinese word segmentation and part of speech mark.
2. carry out feature extraction, comprise the removal low-frequency word; Remove unnecessary part of speech, keep noun, verb and adjective; Adopt the feature extraction formula to calculate weights for each speech, preset threshold values is left out the characteristic speech that is lower than threshold values.
3. adopt the weighting formula of the present invention's design to carry out characteristic weighing, and the correlation data collection is set, other disposal routes are identical, and the weighting formula adopts tf.idf and tf.rf.
4. use the Libsvm training data, adopt linear kernel function, and adopt precision ratio, recall ratio and three indexs of F-measure to come comparing data.
5. experimental data shows from 1000 to the 5000 totally 10 groups of experiments respectively of text feature number, and the F-mesaure of tf.ridf is respectively 0.79,0.843,0.876,0.80,875,0.91,0.917,0.947,0.978,0.978.By contrast, the F-measure of tf.rf is 0.726,0.746,0.827,0.77,0.827,0.854,0.912,0.933,0.933,0.944.Can find out that classification performance of the present invention is better than tf.rf.
Claims (1)
1. one kind based on the text feature method of weighting that instructs, and it is characterized in that: carry out obtaining a after text feature extracts, and b, c, four types of number of files of d, representative is as follows respectively:
The number of files that a representes to contain lexical item t and belongs to positive example;
B representes not contain lexical item t but the number of files that belongs to positive example;
C representes to contain lexical item t but the number of files that do not belong to positive example;
The number of files that d representes not contain lexical item t and do not belong to positive example;
A, b, c, the summation of d is N, promptly total number of files;
Said text feature weighting formula is following:
Wherein K is expressed as:
Wherein, ridf combines to carry out characteristic weighing for relevant anti-document frequency method of weighting with word frequency, this formulate text totally between the text and the significance level between each classification text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100638795A CN102662976A (en) | 2012-03-12 | 2012-03-12 | Text feature weighting method based on supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100638795A CN102662976A (en) | 2012-03-12 | 2012-03-12 | Text feature weighting method based on supervision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102662976A true CN102662976A (en) | 2012-09-12 |
Family
ID=46772467
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012100638795A Pending CN102662976A (en) | 2012-03-12 | 2012-03-12 | Text feature weighting method based on supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102662976A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103207893A (en) * | 2013-03-13 | 2013-07-17 | 北京工业大学 | Classification method of two types of texts on basis of vector group mapping |
CN105389307A (en) * | 2015-12-02 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Statement intention category identification method and apparatus |
CN105512104A (en) * | 2015-12-02 | 2016-04-20 | 上海智臻智能网络科技股份有限公司 | Dictionary dimension reducing method and device and information classifying method and device |
CN106682411A (en) * | 2016-12-22 | 2017-05-17 | 浙江大学 | Method for converting physical examination diagnostic data into disease label |
-
2012
- 2012-03-12 CN CN2012100638795A patent/CN102662976A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103207893A (en) * | 2013-03-13 | 2013-07-17 | 北京工业大学 | Classification method of two types of texts on basis of vector group mapping |
CN103207893B (en) * | 2013-03-13 | 2016-05-25 | 北京工业大学 | The sorting technique of two class texts based on Vector Groups mapping |
CN105389307A (en) * | 2015-12-02 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Statement intention category identification method and apparatus |
CN105512104A (en) * | 2015-12-02 | 2016-04-20 | 上海智臻智能网络科技股份有限公司 | Dictionary dimension reducing method and device and information classifying method and device |
CN106682411A (en) * | 2016-12-22 | 2017-05-17 | 浙江大学 | Method for converting physical examination diagnostic data into disease label |
CN106682411B (en) * | 2016-12-22 | 2019-04-16 | 浙江大学 | A method of disease label is converted by physical examination diagnostic data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103279478B (en) | A kind of based on distributed mutual information file characteristics extracting method | |
CN104199857B (en) | A kind of tax document hierarchy classification method based on multi-tag classification | |
CN103617157A (en) | Text similarity calculation method based on semantics | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
Al-diabat | Arabic text categorization using classification rule mining | |
CN108763484A (en) | A kind of law article recommendation method based on LDA topic models | |
CN103823896A (en) | Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm | |
CN103064969A (en) | Method for automatically creating keyword index table | |
CN103324628A (en) | Industry classification method and system for text publishing | |
Bates et al. | Counting clusters in twitter posts | |
CN104239512A (en) | Text recommendation method | |
Wang et al. | Text clustering based on the improved TFIDF by the iterative algorithm | |
CN102955857A (en) | Class center compression transformation-based text clustering method in search engine | |
CN105550170A (en) | Chinese word segmentation method and apparatus | |
Swamy et al. | Indian language text representation and categorization using supervised learning algorithm | |
CN102662976A (en) | Text feature weighting method based on supervision | |
CN101887415A (en) | Automatic extraction method for text document theme word meaning | |
Yan et al. | Tibetan sentence sentiment analysis based on the maximum entropy model | |
CN103116636A (en) | Method and device of excavation of subject of text big data based on characteristic space decomposition | |
Yang et al. | Research on Chinese text classification based on Word2vec | |
CN104866606A (en) | MapReduce parallel big data text classification method | |
Murthy et al. | A comparative study on term weighting methods for automated telugu text categorization with effective classifiers | |
CN105550292A (en) | Web page classification method based on von Mises-Fisher probability model | |
Choi et al. | N-gram feature selection for text classification based on symmetrical conditional probability and TF-IDF | |
Zhu et al. | An integrated method for micro-blog subjective sentence identification based on three-way decisions and naive Bayes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20120912 |