CN102662976A - Text feature weighting method based on supervision - Google Patents

Text feature weighting method based on supervision Download PDF

Info

Publication number
CN102662976A
CN102662976A CN2012100638795A CN201210063879A CN102662976A CN 102662976 A CN102662976 A CN 102662976A CN 2012100638795 A CN2012100638795 A CN 2012100638795A CN 201210063879 A CN201210063879 A CN 201210063879A CN 102662976 A CN102662976 A CN 102662976A
Authority
CN
China
Prior art keywords
files
text
lexical item
positive example
text feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100638795A
Other languages
Chinese (zh)
Inventor
刘端阳
陆洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN2012100638795A priority Critical patent/CN102662976A/en
Publication of CN102662976A publication Critical patent/CN102662976A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text feature weighting method based on supervision. Text feature extraction is carried out to obtain four types of numbers of documents, a, b, c, and d. The a represents the number of documents containing lexical item t and belonging to a positive example. The b represents the number of documents containing no lexical item t and belonging to a positive example. The c represents the number of documents containing lexical item t and not belonging to a positive example. The d represents the number of documents containing no lexical item t and not belonging to a positive example. The a, b, c, and d add up to N, namely, the total number of documents. The text feature weighting formula (I) is provided. With the text feature weighting method tf.ridf based on supervision put forward in the invention, a comprehensive consideration of words among total documents and the documents of each classification is realized, and classification performance is effectively improved.

Description

A kind of based on the text feature method of weighting that instructs
Technical field
The present invention relates to file classification method, especially a kind of text feature method of weighting.
Background technology
Along with popularizing of internet, applications, stored the text message of magnanimity in the network, people need excavate Useful Information urgently from text.Text mining just is meant from a large amount of text datas the process that extracts unknown in advance, intelligible, final information available or knowledge.Text classification is under given classification system, the process of classification under confirming automatically according to content of text.An important component part is a text representation in the text classification; In the text representation the most frequently used be vector space model (Vector Space Model, VSM), VSM is a statistical model of text representation; It is regarded document as and is made up of proper vector; And each proper vector can be a speech also can be phrase, and each proper vector all contains weight, thereby the vectorial matching problem that the classification problem of text is converted in the vector space is handled.After being converted into proper vector, just can use sorting algorithm commonly used to carry out text classification, like SVMs, k nearest neighbor etc.
A major issue in the text representation is exactly to select the text of representing which proper vector can be best, and will remove the proper vector that those do not represent value.Usually the general process of text classification is: participle, go stop words, index, statistics, feature extraction, characteristic weighing, classification based training and assessment.
Wherein the text feature weighting is a ring important in the entire process process.Based on the text feature weighting of statistics, mainly be divided into two big types: Feature Weighting Method (supervised term weighting method) and guideless Feature Weighting Method (unsupervised term weighting method) that guidance is promptly arranged.Present commonplace use be that (term frequency and inverse document frequency, tf.idf), it belongs to guideless Feature Weighting Method to the anti-document frequency method of weighting of word frequency.
At present, all proposed both at home and abroad much, included guidance and guideless Feature Weighting Method about the improving one's methods of characteristic weighing.Xue Xiaobing is with the densely distributed property of text speech, and the position that text size and speech occur for the first time is as the major consideration of characteristic weighing.Yet guideless Feature Weighting Method do not consider characteristic each document of training set and of all categories in difference on the distribution proportion to the influence of classification.There is the Feature Weighting Method of guidance to consider this point just on the contrary.Li Kaiqi has at first pointed out the deficiency of tf.idf method, on based on the method basis that guidance is arranged, combines tf.idf to improve the text feature weighting information gain.Man Lan has then considered the proportionate relationship of speech in positive type and anti-type, thinks that this speech just can be represented positive type more, just has high more weights as long as anti-type of ratio that accounts for of positive analogy is high more.
Summary of the invention
In order to overcome the relatively poor deficiency of classification performance of existing text feature method of weighting, the present invention provides a kind of text feature method of weighting based on guidance that promotes classification performance
For the technological means that solves the problems of the technologies described above employing is:
A kind of based on the text feature method of weighting that instructs, carry out obtaining a after text feature extracts, b, c, four types of number of files of d, representative is as follows respectively:
The number of files that a representes to contain lexical item t and belongs to positive example;
B representes not contain lexical item t but the number of files that belongs to positive example;
C representes to contain lexical item t but the number of files that do not belong to positive example;
The number of files that d representes not contain lexical item t and do not belong to positive example.
A, b, c, the summation of d is N, promptly total number of files;
Said text feature weighting formula is following:
ridf = log 2 = ( 2 + ( a max ( 1 , c ) ) K ) - - - ( 1 )
Wherein K is expressed as:
K = ( N a + c ) a - c | a - c | ( a ≠ c ) N a + c ( a = c ) - - - ( 2 )
Wherein, ridf combines to carry out characteristic weighing for relevant anti-document frequency method of weighting with word frequency, this formulate the significance level of a text between overall and classification.
Technical conceive of the present invention is: traditional Feature Weighting Method is following:
tf · idf = tf ( t , d ) * log 2 ( N n + 0.01 )
Wherein (t d) is the frequency of feature lexical item t in document d to tf, and high more this speech of tf value can be represented document more, and tf has embodied the relation of speech in document.N among the idf is the number of files that contains feature lexical item t, and N is total number of files.It is low more that idf thinks that the number of files that certain speech occurs accounts for the ratio of total number of files, and the document of this type can be represented more in this speech, i.e. the ability of this speech difference type is strong more, and idf has then embodied the relation of speech between document.Though the tfidf formula seems simply, comparing other complicated file classification method (like the text classification based on semanteme) usually has best expressive ability in some occasion.
Yet also there is weak point in this formula, and in six kinds of document distribution plans representing like Fig. 1, we can know:
The number of files that a representes to contain lexical item t and belongs to positive example;
B representes not contain lexical item t but the number of files that belongs to positive example;
C representes to contain lexical item t but the number of files that do not belong to positive example;
The number of files that d representes not contain lexical item t and do not belong to positive example.
A, b, c, the summation of d is N, promptly total number of files.
In these three kinds of situation of T1-T3, idf can give identical weights.Tangible then, the classification contribution ability of T1 should be maximum.
In order to overcome the deficiency of traditional guideless Feature Weighting Method, Chinese scholars has all proposed many new methods of weighting that guidance is arranged, and proposes like Xue Xiaobing:
rf = log 2 ( 2 + a c )
Substitute idf with the rf formula, its basic thought is: the ratio that the ratio c that a accounts for accounts for is big more, and this speech positive example of capable more difference and counter-example so just should have high more weight; And when a=c, no matter the quantity of a and c is much, the ability that this speech difference is positive and negative type equates forever.
Yet this formula has been abandoned the thought of original idf again, and the ratio that the number of files that certain speech promptly occurs accounts for total number of files is low more, and this speech is the document of valuable more this type of representative just.Like T5 and T6 both of these case, significantly, the speech of T6 too spreads unchecked, and its representative ability is not as this situation of T5.
According to above-mentioned analysis, confirm design object of the present invention: the proportionate relationship of speech in each classification considered in (1), promptly utilizes the method that guidance is arranged, and makes full use of the training data of type of having label.(2) advantage of reservation traditional characteristic weighting is considered the situation that speech distributes in overall document.
Based on above two targets, design fundamentals are following:
ridf = log 2 = ( 2 + ( a max ( 1 , c ) ) K )
Wherein the k value is:
K = ( N a + c ) a - c | a - c | ( a ≠ c ) N a + c ( a = c )
Substitute idf with ridf, its advantage is following:
1. work as a and be not equal under the situation of c, ridf promptly will consider the factor of N/ (a+c), will consider also that simultaneously a/max (1, we can say and at this time promptly considered the relation of speech in classification, also considered the relation of speech between overall document by factor c).
2. under the situation of a=c, the size of K value can not influence final ridf value, meets above thought.
Description of drawings
Fig. 1 is six kinds of document distribution plans.
Embodiment
Below in conjunction with accompanying drawing the present invention is further specified.
With reference to Fig. 1, a kind of based on the text feature method of weighting that instructs, carry out obtaining a after text feature extracts, b, c, four types of number of files of d, representative is as follows respectively:
The number of files that a representes to contain lexical item t and belongs to positive example;
B representes not contain lexical item t but the number of files that belongs to positive example;
C representes to contain lexical item t but the number of files that do not belong to positive example;
The number of files that d representes not contain lexical item t and do not belong to positive example.
A, b, c, the summation of d is N, promptly total number of files;
Said text feature weighting formula is following:
ridf = log 2 = ( 2 + ( a max ( 1 , c ) ) K ) - - - ( 1 )
Wherein K is expressed as:
K = ( N a + c ) a - c | a - c | ( a ≠ c ) N a + c ( a = c ) - - - ( 2 )
Wherein, ridf combines to carry out characteristic weighing for relevant anti-document frequency method of weighting with word frequency, this formulate the significance level of a text between overall and classification.
In the present embodiment; Corpus is the Chinese corpus that contains 20 classifications that international data center center natural language processing group provides from Fudan University's computerized information with technology; Words partition system adopts the Chinese lexical analytic system ICTCLAS of Inst. of Computing Techn. Academia Sinica's development, and sorter adopts is the Libsvm of people's exploitation such as professor Lin Zhiren of Taiwan Univ..
1. at first original corpus is carried out Chinese word segmentation and part of speech mark.
2. carry out feature extraction, comprise the removal low-frequency word; Remove unnecessary part of speech, keep noun, verb and adjective; Adopt the feature extraction formula to calculate weights for each speech, preset threshold values is left out the characteristic speech that is lower than threshold values.
3. adopt the weighting formula of the present invention's design to carry out characteristic weighing, and the correlation data collection is set, other disposal routes are identical, and the weighting formula adopts tf.idf and tf.rf.
4. use the Libsvm training data, adopt linear kernel function, and adopt precision ratio, recall ratio and three indexs of F-measure to come comparing data.
5. experimental data shows from 1000 to the 5000 totally 10 groups of experiments respectively of text feature number, and the F-mesaure of tf.ridf is respectively 0.79,0.843,0.876,0.80,875,0.91,0.917,0.947,0.978,0.978.By contrast, the F-measure of tf.rf is 0.726,0.746,0.827,0.77,0.827,0.854,0.912,0.933,0.933,0.944.Can find out that classification performance of the present invention is better than tf.rf.

Claims (1)

1. one kind based on the text feature method of weighting that instructs, and it is characterized in that: carry out obtaining a after text feature extracts, and b, c, four types of number of files of d, representative is as follows respectively:
The number of files that a representes to contain lexical item t and belongs to positive example;
B representes not contain lexical item t but the number of files that belongs to positive example;
C representes to contain lexical item t but the number of files that do not belong to positive example;
The number of files that d representes not contain lexical item t and do not belong to positive example;
A, b, c, the summation of d is N, promptly total number of files;
Said text feature weighting formula is following:
ridf = log 2 = ( 2 + ( a max ( 1 , c ) ) K ) - - - ( 1 )
Wherein K is expressed as:
K = ( N a + c ) a - c | a - c | ( a ≠ c ) N a + c ( a = c ) - - - ( 2 )
Wherein, ridf combines to carry out characteristic weighing for relevant anti-document frequency method of weighting with word frequency, this formulate text totally between the text and the significance level between each classification text.
CN2012100638795A 2012-03-12 2012-03-12 Text feature weighting method based on supervision Pending CN102662976A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100638795A CN102662976A (en) 2012-03-12 2012-03-12 Text feature weighting method based on supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100638795A CN102662976A (en) 2012-03-12 2012-03-12 Text feature weighting method based on supervision

Publications (1)

Publication Number Publication Date
CN102662976A true CN102662976A (en) 2012-09-12

Family

ID=46772467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100638795A Pending CN102662976A (en) 2012-03-12 2012-03-12 Text feature weighting method based on supervision

Country Status (1)

Country Link
CN (1) CN102662976A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207893A (en) * 2013-03-13 2013-07-17 北京工业大学 Classification method of two types of texts on basis of vector group mapping
CN105389307A (en) * 2015-12-02 2016-03-09 上海智臻智能网络科技股份有限公司 Statement intention category identification method and apparatus
CN105512104A (en) * 2015-12-02 2016-04-20 上海智臻智能网络科技股份有限公司 Dictionary dimension reducing method and device and information classifying method and device
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207893A (en) * 2013-03-13 2013-07-17 北京工业大学 Classification method of two types of texts on basis of vector group mapping
CN103207893B (en) * 2013-03-13 2016-05-25 北京工业大学 The sorting technique of two class texts based on Vector Groups mapping
CN105389307A (en) * 2015-12-02 2016-03-09 上海智臻智能网络科技股份有限公司 Statement intention category identification method and apparatus
CN105512104A (en) * 2015-12-02 2016-04-20 上海智臻智能网络科技股份有限公司 Dictionary dimension reducing method and device and information classifying method and device
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN106682411B (en) * 2016-12-22 2019-04-16 浙江大学 A method of disease label is converted by physical examination diagnostic data

Similar Documents

Publication Publication Date Title
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN103617157A (en) Text similarity calculation method based on semantics
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
Al-diabat Arabic text categorization using classification rule mining
CN108763484A (en) A kind of law article recommendation method based on LDA topic models
CN103823896A (en) Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm
CN103064969A (en) Method for automatically creating keyword index table
CN103324628A (en) Industry classification method and system for text publishing
Bates et al. Counting clusters in twitter posts
CN104239512A (en) Text recommendation method
Wang et al. Text clustering based on the improved TFIDF by the iterative algorithm
CN102955857A (en) Class center compression transformation-based text clustering method in search engine
CN105550170A (en) Chinese word segmentation method and apparatus
Swamy et al. Indian language text representation and categorization using supervised learning algorithm
CN102662976A (en) Text feature weighting method based on supervision
CN101887415A (en) Automatic extraction method for text document theme word meaning
Yan et al. Tibetan sentence sentiment analysis based on the maximum entropy model
CN103116636A (en) Method and device of excavation of subject of text big data based on characteristic space decomposition
Yang et al. Research on Chinese text classification based on Word2vec
CN104866606A (en) MapReduce parallel big data text classification method
Murthy et al. A comparative study on term weighting methods for automated telugu text categorization with effective classifiers
CN105550292A (en) Web page classification method based on von Mises-Fisher probability model
Choi et al. N-gram feature selection for text classification based on symmetrical conditional probability and TF-IDF
Zhu et al. An integrated method for micro-blog subjective sentence identification based on three-way decisions and naive Bayes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120912