CN103886108A - Feature selection and weight calculation method of imbalance text set - Google Patents

Feature selection and weight calculation method of imbalance text set Download PDF

Info

Publication number
CN103886108A
CN103886108A CN201410149441.8A CN201410149441A CN103886108A CN 103886108 A CN103886108 A CN 103886108A CN 201410149441 A CN201410149441 A CN 201410149441A CN 103886108 A CN103886108 A CN 103886108A
Authority
CN
China
Prior art keywords
feature
classification
text
chi
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410149441.8A
Other languages
Chinese (zh)
Other versions
CN103886108B (en
Inventor
刘磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Goonie International Software (Beijing) Co.,Ltd.
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201410149441.8A priority Critical patent/CN103886108B/en
Publication of CN103886108A publication Critical patent/CN103886108A/en
Application granted granted Critical
Publication of CN103886108B publication Critical patent/CN103886108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Abstract

The invention provides a feature selection and weight calculation method of an imbalance text set, and belongs to the field of text information processing. In order to solve the classification problem of imbalance text data, a feature selection and weight calculation method and system are provided. The category discrimination degree and the average word frequency factor are combined, the chi-squared statistic method is improved so as to conduct feature selection, meanwhile, a commonly-used feature weight calculation method is improved, and on the basis of the improvement, the TF-IDF weight calculation method is provided. The effect of the method on solving the imbalance data set problem is superior to that of a traditional feature selection method, and the method is effective and feasible in effectively improving the classification accuracy.

Description

A kind of feature selecting of unbalanced text set and weighing computation method
Technical field
The invention belongs to text information processing field, specifically relate to feature selecting and the weighing computation method of unbalanced text set.
Background technology
Along with the develop rapidly of infotech and popularizing of internet, expanding has rapidly appearred in text message resource.These information resources are in horn of plenty people knowledge and providing convenience, but this wherein also contains a large amount of junk information.As one of major technique of information retrieval technique, Text Classification has very high using value at aspects such as improving information retrieval and filtering system performance.
Under normal circumstances, the source of text not only comprises webpage, mail, also comprises note, microblogging and forum's model etc.In text classification process, if text table is shown as to vector form, the feature in training set may be ten hundreds of.In a large amount of features, much uncorrelated and feature redundancy needs to remove, and the noise characteristic of classification of disturbance accuracy also needs to remove.Huge feature space dimension can reduce performance and the generalization ability of sorter, and process high dimension vector needs high time complexity simultaneously.Feature selecting, as the important step of Text Classification, improves efficiency and the precision of sorter by feature is carried out to dimension-reduction treatment.Because classification information is the important component part of text classification, there is the problems such as classification is related to complexity, skewness weighing apparatus and classification is uncertain in text classification, and these problems are that feature selecting has been researched and proposed a lot of challenges.
A lot of traditional machine learning methods are all based under data set equilibrium situation, but in real world applications, most according to being unbalanced, conventional machines learning method is conventionally poor to the treatment effect of unbalanced data set.How effectively unbalanced data set to be processed is a study hotspot of Data Mining.Processing for unbalanced data set has wide prospect and practical significance in the fields such as medical diagnosis, financial credit management and filtrating mail.There are being two aspects, the one, sampling aspect, the 2nd, algorithm aspect for the processing of unbalanced problem.The present invention has provided Feature Selection by the feature selecting aspect of concentrating based on unbalanced data set.
Inventor, by considering the Feature Selection Algorithms of unbalanced data set, provides a kind of feature selecting and weighing computation method of unbalanced text set, has overcome the limitation of traditional classification method in the face of unbalanced data set.
Summary of the invention
The object of the invention is to the classification problem for unbalanced text data, propose a kind of Feature Selection and weighing computation method and system.The present invention, in conjunction with class discrimination degree and average word frequency factor, carries out Feature Selection by improving chi metering method.Also conventional feature weight computing method are improved simultaneously, and the weighing computation method of TF-IDF proposed on its basis, experiment shows, the effect in the time processing unbalanced data set problem of improving one's methods is better than traditional feature selection approach, is effective and feasible for improving classification accuracy.
The present invention adopts following technological means to realize:
Step 1: text set is carried out to text pre-service, extract semantic information, method is as follows:
Step 1.1: utilize Chinese morphology process software, file set is carried out to participle and part-of-speech tagging processing.
Step 1.2: filter out the stop words after word segmentation processing, comprising: auxiliary words of mood, preposition, adverbial word.
Step 2: the feature selecting of carrying out text set is calculated, and method is as follows:
Every pretreated text data set is handled as follows
The CHI statistic of step 2.1: calculated characteristics t and classification c
Comprise feature t and belong to classification c i, be designated as A.
Comprise feature t and do not belong to classification
Figure BDA0000490324070000025
, be designated as B.
Do not comprise feature
Figure BDA0000490324070000026
and belong to classification c i, be designated as C.
Do not comprise feature and do not belong to classification , be designated as D.
The CHI normalized set formula of feature t and classification c is:
χ 2 ( t , c ) = { N × ( AD - CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D ) , AD - BC > 0 0 , AD - BC ≤ 0 - - - ( 1 )
Step 2.2: calculate reversing classification frequency ICF
Wherein M is the sum of text set classification, m tit is the number that occurs the classification of feature t in document sets.
ECF t , C = In M m t + 1 Wherein M>0,0≤m t≤ M
Step 2.3: carry out improved chi amount and calculate
χ 2 ( t , c ) = { N × ( AD - CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D ) × ICF t , C × TC i T C i ‾ , AD - BC > 0 0 , AD - BC ≤ 0 - - - ( 2 )
The average word frequency TC that wherein feature t occurs in positive class iwith its average word frequency occurring in negative class ratio weighed the degree of correlation of feature and classification, larger characterization t is larger with degree of correlation of positive class for its value.Here χ 2(t, c) span be [0 ,+∞) between.
Step 3: term weight function calculates, and method is as follows:
Feature Words in each text is carried out to weight calculation
Step 3.1: calculate lambda factor, method is as follows:
λ ( t , c i ) = DF ( t , c i ) D ( c i ) - - - ( 3 )
Wherein, DF (t, c i) expression c iin class, comprise the textual data of characteristic item t, D (c i) expression c itext sum in class, λ is the textual data that comprises Feature Words t in a certain classification and accounts for the ratio of this class text sum, λ (t, c i) span is between [0,1];
Step 3.2: calculate TF-IDF* λ IG numerical value
w ( t i , d j ) = tf ij * log ( N n i ) * λIG Σ i ∈ d j [ tf ij * log ( N n i ) * λIG ] 2 - - - ( 4 )
Step 3.3: calculate TF-IDF* λ CHI
w ( t i , d j ) = tf ij * log ( N n i + L ) * λCHI Σ i ∈ d j [ tf ij * log ( N n i + L ) * CHI ] 2 - - - ( 5 )
The parametric t representation feature item of formula in step 3.2 and step 3.3, wherein N is the sum of classification in text set, n iit is the number that occurs the classification of feature t in text set.Tf ijrepresent a Feature Words t iat certain text d jthe number of times of middle appearance.W (t i, d j) span is between [0,1].
Step 4: classification results output.
The present invention compared with prior art, has following obvious advantage and beneficial effect:
Inventive method has considered the distribution situation of feature in positive and negative classification, and selection representativeness and the more intense feature of distinctive that can be comprehensive, avoid the inadaptability of traditional characteristic system of selection on unbalanced data set.Weighing computation method based on feature binding pattern has better solved the extraction problem of dimension of a vector space height and linked character word, has improved the efficiency of sort program and the precision of classification.
Accompanying drawing explanation
Fig. 1 realizes the process flow diagram of unbalanced text data set Feature Selection and weighing computation method and system;
Fig. 2 is non-equilibrium than the F1 value broken line graph of lower positive class;
The experimental result of TF-IDF weight calculation after improvement under the selection of Fig. 3 chi measure feature;
The comparing result figure of TF-IDF weight calculation after improvement under Fig. 4 information gain feature selecting.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Shown in Fig. 1, the method that the present invention proposes realizes successively according to the following steps:
Step 1: unbalanced text set is carried out to text pre-service, extract the word that contains semantic information.
Step 1.1: utilize Chinese morphology process software, file set is carried out to participle and part-of-speech tagging processing.
Experiment word segmentation processing adopts Chinese lexical analysis system ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System).
Step 1.2: filter out the stop words after word segmentation processing.As auxiliary words of mood, preposition, adverbial word etc.
If exist in a large number stop words to cause noise jamming to its effective information in text.Delete the effect that can reach thick dimensionality reduction after stop words, object is efficiency in order to improve sort program and the precision of classification.
Step 2: the feature selecting of carrying out text set is calculated
Every pretreated unbalanced text data set is handled as follows:
The CHI statistic of step 2.1: calculated characteristics t and classification c, here
(t, c i): comprise feature t and belong to classification c i, be designated as A.
comprise feature t and do not belong to classification
Figure BDA0000490324070000044
, be designated as B.
do not comprise feature
Figure BDA0000490324070000043
and belong to classification c i, be designated as C.
Figure BDA0000490324070000051
do not comprise feature
Figure BDA0000490324070000052
and do not belong to classification
Figure BDA0000490324070000053
be designated as D.
A and D have showed feature t and classification c ipositive dependence, B and D have showed feature t and classification c inegative dependence.In the system of selection of CHI statistical nature, the CHI normalized set formula of feature t and classification c is:
χ 2 ( t , c ) = { N × ( AD - CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D ) , AD - BC > 0 0 , AD - BC ≤ 0 - - - ( 1 )
Step 2.2: the reversing classification frequency ICF that calculates unbalanced text collection;
Because different features exists difference to the discrimination of classification, the feature in obviously positive class has good class discrimination degree.Reversing classification frequency ICF(Inverse Category Frequency) computing formula is as follows:
ICF t , C = In M m t + 1 - - - ( 2 )
Wherein M is the sum of classification in text set C, m iit is the number that occurs the classification of feature t in C.Adding 1 is to be 0 for fear of ICF,
Step 2.3: carry out improved chi amount and calculate
χ 2 ( t , c ) = { N × ( AD - CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D ) × ICF t , C × TC i T C i ‾ , AD - BC > 0 0 , AD - BC ≤ 0 - - - ( 3 )
The average word frequency TC that wherein feature t occurs in positive class iwith its average word frequency occurring in negative class
Figure BDA0000490324070000057
ratio weighed the degree of correlation of feature and classification, larger characterization t is larger with degree of correlation of positive class for its value.
Step 3: carry out term weight function in unbalanced text set and calculate
The frequency that calculated characteristics word weight occurs in text by Feature Words and number are determined the weight of this Feature Words.The present invention use TF ?IDF function calculated characteristics weight.
Word frequency represents with TF, i.e. a number of times that Feature Words occurs in text.The TF value of a Feature Words shows that more greatly its classification represents that ability is stronger.Anti-text frequency represents with IDF, and its implication is: if the textual data that comprises certain Feature Words is fewer, this Feature Words represents that the ability of certain class text is stronger, and its weight is also larger.
TF ?IDF formula be that word frequency and anti-text frequency are multiplied each other, the TF after standardization ?IDF function formula be:
T F i * ID F j = tf i * log ( N n j + L ) Σ t ∈ d k [ tf j * log ( N n j + L ) ] 2 - - - ( 4 )
Wherein L is constant, determines according to experiment.N is total textual data, n jfor there is Feature Words t jtextual data.
Inventor improves the term weight function computing method in each text.Based on TF ?added Feature Words to differentiate text categories in the development of IDF consideration.The frequency that uses TF ?IDF performance characteristic item to occur in text, by the relation between feature selecting function performance characteristic item and text categories.
Step 3.1: calculate lambda factor
In the unbalanced situation of data, even if the textual data that " large class " comprises Feature Words is little, also may be greater than the textual data that comprises this Feature Words in " group ".Regulate by introducing lambda factor, represent as follows:
λ ( t , c i ) = DF ( t , c i ) D ( c i ) - - - ( 5 )
Wherein, DF (t, c i) expression c iin class, comprise the textual data of characteristic item t, D (c i) expression c itext sum in class, λ is the textual data that comprises Feature Words t in a certain classification and accounts for the ratio of this class text sum;
Step 3.2: add information gain, calculate TF ?IDF* λ IG numerical value
Information gain (Information Gain) is weighed feature whether the quantity of information providing for classification is provided.For each feature t, gain difference is larger, and this feature is more important to classification effect.Feature t information gain is as follows: IG ( t ) = - Σ i = 1 n P ( c i ) log P ( c i ) + P ( t ) Σ i = 1 n P ( c i | t ) log P ( c i | t ) log P ( c i | t ) + P ( t ‾ ) Σ i = 1 n P ( c i | t ‾ ) log P ( c i | t ‾ ) - - - ( 6 )
Wherein, P (c i) belong to classification c for text iprobability, P (t) appears at the probability in text set, P (c for feature t i| while t) representing to comprise feature t, text belongs to c iprobability,
Figure BDA0000490324070000064
represent not comprise in text set the probability of the text of feature t,
Figure BDA0000490324070000065
represent that text does not comprise feature t and belongs to c iprobability, n is classification number.
First with TF ?IDF to select the frequency occurring in single text higher, but the less Feature Words of frequency occurring in other text.Do not find out and do not occur in sample by information gain again, but can express text implication, and have the word of very large contribution to differentiating text categories.Finally introduce lambda factor and carry out combination, improvement formula is:
w ( t i , d j ) = tf ij * log ( N n i ) * λIG Σ i ∈ d j [ tf ij * log ( N n i ) * λIG ] 2 - - - ( 7 )
Step 3.3: introduce improved chi amount, calculate TF-IDF* λ CHI
Relation between CHI performance characteristic word and classification, introduces lambda factor its and TF-IDF is carried out to combination, and improving rear algorithm, to bias toward the frequency of occurrences more and can contain the Feature Words of a large amount of classification information.After improving, formula is:
w ( t i , d j ) = tf ij * log ( N n i + L ) * λCHI Σ i ∈ d j [ tf ij * log ( N n i + L ) * CHI ] 2 - - - ( 8 )
Step 4: carry out classifying quality contrast test according to improved Feature Selection and weighing computation method.
In order to check method of the present invention to improve to some extent with respect to classic method, the present invention has carried out following experiment.
Step 4.1: the feature selecting experiment of unbalanced data set text classification
Experimental data derives from Fudan University's Chinese corpus obtaining on scientific research data sharing platform website, and adopts open method of testing.Fudan University's Chinese corpus comprises 20 classifications, is divided into training set and test set two parts, and two-part sample number is roughly equal and do not have overlappingly, and full text is txt form.The category distribution situation of training set and test set is as shown in table 1:
The category distribution situation of table 1 training set and test set
Figure BDA0000490324070000072
Corresponding item name is corresponding as follows:
C3 ?art, C4 ?literature, C5 ?education, C6 ?philosophy, C7 ?history, C11 ?space, C15 ?energy, C16 ?electronics, C17 ?communication, C19 ?computing machine, C23 ?mining, C29 ?transportation, C31 ?environment, C32 ?agricultural, C34 ?economy, C35 ?law, C36 ?medical science, C37 ?military affairs, C38 ?politics, C39 ?physical culture.
In text classification experiment, according to practical application, two parts merged and choose sample.Choose in Fudan University's Chinese corpus that sample size differs larger C5 and C34 tests as unbalanced data set herein, in positive class C5, choose at random 60 pieces of texts, negative class C34 chooses 6 groups at random according to special ratios.The experimental data of unbalanced data set is as shown in table 2:
The experimental data of the unbalanced data set of table 2
Figure BDA0000490324070000073
Here use the method (3 ?fold cross validation) of 3 times of cross validations, the sample set of choosing is above divided into 3 groups, and wherein 2 groups as training set, and 1 group as test set, and by this process in triplicate, finally get the mean value of these three experimental results.
Experiment word segmentation processing adopts Chinese lexical analysis system ICTCLAS, selection be characterized as 1000 dimensions.Sorting algorithm adopts support vector machine.Performance estimating method adopts the overall target F1 value of precision ratio and recall ratio, and its formula is:
F 1 = 2 precision * recall precision + recall - - - ( 9 )
The different non-equilibrium contrasts of the experimental results than CHI feature selection approach after lower CHI, IG and improvement below, the method for weighting of this experiment adopt TF ?IDF Feature Weighting Method, experimental result is as follows:
Table 3TF ?IDF Feature Weighting Method experimental result
Figure BDA0000490324070000082
Owing to more paying close attention to the classifying quality of the positive class of unbalanced data centralization, simultaneously for the ease of the comparative analysis of experimental data, unbalancedly represent with broken line graph than lower positive class F1 value different, as shown in Figure 4.Can find out by observing, along with the continuous increase of the non-equilibrium ratio of positive and negative two class, in three kinds of feature selection approachs, the F1 value of negative class all presents growth by a small margin, and in the rear CHI method of improvement, the F1 value of negative class is better than CHI and IG.
Can find from positive class F1 value change curve, under different characteristic system of selection, the variation of positive class F1 value differs larger.Along with the continuous increase of non-equilibrium ratio, after improving, in CHI method, positive class F1 value has obtained than the better effect of additive method, and after 1:10, reach more stable value, after improving, CHI method is not reducing negative class classifying quality simultaneously, make positive class sample obtain there is attention, obtained gratifying effect.
CHI method synthesis after improvement has been considered the distribution situation of feature in positive and negative classification, selection representativeness and the more intense feature of distinctive that can be comprehensive.From experimental data, can find out simultaneously, method after improvement is subject to the impact of the unbalanced degree of data set very little, under different unbalanced ratios, the CHI method after improvement is not in reducing negative class classification performance, and the classification performance of its positive class can remain on the state of a relative ideal.
In sum, the CHI method after improvement is well avoided the inadaptability of traditional characteristic system of selection on unbalanced data set, and in not reducing negative class classification performance, has promoted by a relatively large margin positive class classification performance.
Step 4.2: the weight calculation experiment of unbalanced data set text classification
Experimental data derives from Fudan University's Chinese corpus obtaining on scientific research data sharing platform website, and adopts open method of testing.Fudan University's Chinese corpus comprises 20 classifications, is divided into training set and test set two parts, and two-part sample number is roughly equal and do not have overlappingly, and full text is txt form.Therefrom choose 10 classifications, as shown in table 4 for the sample number distribution situation of training and testing.
The sample number of table 4 training and testing distributes
Figure BDA0000490324070000091
Choose KNN sorting algorithm and carry out model training, test in the time that feature selecting function is identical, weights computing formula select respectively TF ?IDF and TF ?classifying quality when IDF* λ feature selecting function.K value is 10.
(1) use information gain IG as feature selection approach, feature weight computing method be respectively TF ?IDF and TF ?IDF ?λ IG.Experimental result is in table 5, and overall comparing result as shown in Figure 3.
The experimental result of TF-IDF weight calculation after improvement under table 5 information gain feature selecting
Figure BDA0000490324070000092
Therefrom can find out, the TF after improvement ?IDF* λ IG method had more significantly and to have promoted at grand average recall rate, accuracy rate and micro-Average Accuracy three aspects:.From classification accuracy rate angle, the method after improvement has had larger lifting at C7 and two classifications of C11, and wherein C7 is a classification that sample number is relatively less, also have certain lifting, but amplitude is limited in all the other classifications.
(2) use chi amount CHI as feature selection approach, feature weight computing method be respectively TF ?IDF, TF ?IDF ?λ CHI.Experimental result is in table 6, and overall comparing result as shown in Figure 4.
The experimental result of TF-IDF weight calculation after improvement under the selection of table 6 chi measure feature
Therefrom can find out, although the TF after improving ?IDF* λ CHI method in grand average recall rate, have decline by a small margin, on grand average and micro-Average Accuracy, be significantly improved.The accuracy rate of most of classification has certain lifting, and it is obvious that C39 and C7 promote amplitude.
By above embodiment, test in the weight method improvement based on feature combination at use KNN disaggregated model, TF after improvement ?IDF classification effect to significantly be better than traditional TF ?IDF method, in indivedual classifications, in the less situation of sample, also shown good classifying quality.This weighing computation method based on feature binding pattern can better solve the problem of the extraction of dimension of a vector space height and linked character word.
Experimental result shows, the weight of utilizing feature combination that the present invention proposes is improved one's methods and had obvious improvement than classic method.
Finally it should be noted that: above example is only in order to illustrate the present invention and unrestricted technical scheme described in the invention; Therefore,, although this instructions has been described in detail the present invention with reference to above-mentioned example, those of ordinary skill in the art should be appreciated that still and can modify or be equal to replacement the present invention; And all do not depart from technical scheme and the improvement thereof of the spirit and scope of invention, it all should be encompassed in the middle of claim scope of the present invention.

Claims (1)

1. the feature selecting of unbalanced text set and weighing computation method and a system, realizes according to the following steps:
Step 1: text set is carried out to text pre-service, extract semantic information, method is as follows:
Step 1.1: utilize Chinese morphology process software, file set is carried out to participle and part-of-speech tagging processing;
Step 1.2: filter out the stop words after word segmentation processing, auxiliary words of mood, preposition, adverbial word;
Step 2: the feature selecting of carrying out text set is calculated, and method is as follows:
Every pretreated text data set is handled as follows
The CHI statistic of step 2.1: calculated characteristics t and classification c
Comprise feature t and belong to classification c i, be designated as A;
Comprise feature t and do not belong to classification
Figure FDA0000490324060000014
, be designated as B;
Do not comprise feature
Figure FDA0000490324060000015
and belong to classification c i, be designated as C;
Do not comprise feature
Figure FDA0000490324060000017
and do not belong to classification
Figure FDA0000490324060000016
, be designated as D;
The CHI normalized set formula of feature t and classification c is:
χ 2 ( t , c ) = { N × ( AD - CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D ) , AD - BC > 0 0 , AD - BC ≤ 0
Step 2.2: calculate reversing classification frequency ICF;
Wherein M is the sum of classification in text set C, m tit is the number that occurs the classification of feature t in C;
ECF t , C = In M m t + 1
Step 2.3: carry out improved chi amount and calculate, method is as follows:
χ 2 ( t , c ) = { N × ( AD - CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D ) × ICF t , C × TC i T C i ‾ , AD - BC > 0 0 , AD - BC ≤ 0
The average word frequency TC that wherein feature t occurs in positive class iwith its average word frequency occurring in negative class
Figure FDA0000490324060000018
ratio weighed the degree of correlation of feature and classification, larger characterization t is larger with degree of correlation of positive class for its value;
Step 3: term weight function calculates
Feature Words in each text is carried out to weight calculation
Step 3.1: calculate lambda factor, method is as follows:
λ ( t , c i ) = DF ( t , c i ) D ( c i )
Wherein, DF (t, c i) expression c iin class, comprise the textual data of characteristic item t, D (c i) expression c itext sum in class, λ is the textual data that comprises Feature Words t in a certain classification and accounts for the ratio of this class text sum;
Step 3.2: calculate TF-IDF* λ IG numerical value, method is as follows:
w ( t i , d j ) = tf ij * log ( N n i ) * λIG Σ i ∈ d j [ tf ij * log ( N n i ) * λIG ] 2
Step 3.3: calculate TF-IDF* λ CHI, method is as follows:
w ( t i , d j ) = tf ij * log ( N n i + L ) * λCHI Σ i ∈ d j [ tf ij * log ( N n i + L ) * CHI ] 2
Step 4: classification results output.
CN201410149441.8A 2014-04-13 2014-04-13 The feature selecting and weighing computation method of a kind of unbalanced text set Active CN103886108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410149441.8A CN103886108B (en) 2014-04-13 2014-04-13 The feature selecting and weighing computation method of a kind of unbalanced text set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410149441.8A CN103886108B (en) 2014-04-13 2014-04-13 The feature selecting and weighing computation method of a kind of unbalanced text set

Publications (2)

Publication Number Publication Date
CN103886108A true CN103886108A (en) 2014-06-25
CN103886108B CN103886108B (en) 2017-09-01

Family

ID=50955000

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410149441.8A Active CN103886108B (en) 2014-04-13 2014-04-13 The feature selecting and weighing computation method of a kind of unbalanced text set

Country Status (1)

Country Link
CN (1) CN103886108B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN105808718A (en) * 2016-03-07 2016-07-27 浙江工业大学 Text feature selection method based on unbalanced data sets
CN106502990A (en) * 2016-10-27 2017-03-15 广东工业大学 A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing
CN108090088A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 Feature extracting method and device
CN108376130A (en) * 2018-03-09 2018-08-07 长安大学 A kind of objectionable text information filtering feature selection approach
CN109471942A (en) * 2018-11-07 2019-03-15 合肥工业大学 Chinese comment sensibility classification method and device based on evidential reasoning rule
CN109492219A (en) * 2018-10-25 2019-03-19 山东省通信管理局 A kind of swindle website identification method analyzed based on tagsort and emotional semantic
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF
CN110019654A (en) * 2017-07-20 2019-07-16 南方电网传媒有限公司 A kind of unbalance network text classification optimization system
CN110347833A (en) * 2019-07-09 2019-10-18 浙江工业大学 A kind of classification method of more wheel dialogues
CN110705247A (en) * 2019-08-30 2020-01-17 山东科技大学 Based on x2-C text similarity calculation method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
US9715493B2 (en) * 2012-09-28 2017-07-25 Semeon Analytics Inc. Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
CN103049435B (en) * 2013-01-04 2015-10-14 浙江工商大学 Text fine granularity sentiment analysis method and device
CN103218444B (en) * 2013-04-22 2016-12-28 中央民族大学 Based on semantic method of Tibetan language webpage text classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
熊忠阳, 张鹏招, 张玉芳: "基于χ~2统计的文本分类特征选择方法的研究", 《计算机应用》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN105512311B (en) * 2015-12-14 2019-02-26 北京工业大学 A kind of adaptive features select method based on chi-square statistics
CN105808718B (en) * 2016-03-07 2019-02-01 浙江工业大学 A kind of text feature selection method based on unbalanced dataset
CN105808718A (en) * 2016-03-07 2016-07-27 浙江工业大学 Text feature selection method based on unbalanced data sets
CN106502990A (en) * 2016-10-27 2017-03-15 广东工业大学 A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing
CN108090088A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 Feature extracting method and device
CN110019654A (en) * 2017-07-20 2019-07-16 南方电网传媒有限公司 A kind of unbalance network text classification optimization system
CN108376130A (en) * 2018-03-09 2018-08-07 长安大学 A kind of objectionable text information filtering feature selection approach
CN109492219A (en) * 2018-10-25 2019-03-19 山东省通信管理局 A kind of swindle website identification method analyzed based on tagsort and emotional semantic
CN109471942A (en) * 2018-11-07 2019-03-15 合肥工业大学 Chinese comment sensibility classification method and device based on evidential reasoning rule
CN109471942B (en) * 2018-11-07 2021-09-07 合肥工业大学 Chinese comment emotion classification method and device based on evidence reasoning rule
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF
CN110347833A (en) * 2019-07-09 2019-10-18 浙江工业大学 A kind of classification method of more wheel dialogues
CN110705247A (en) * 2019-08-30 2020-01-17 山东科技大学 Based on x2-C text similarity calculation method

Also Published As

Publication number Publication date
CN103886108B (en) 2017-09-01

Similar Documents

Publication Publication Date Title
CN103886108A (en) Feature selection and weight calculation method of imbalance text set
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN100583101C (en) Text categorization feature selection and weight computation method based on field knowledge
CN102622373B (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN104391835A (en) Method and device for selecting feature words in texts
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN104239436A (en) Network hot event detection method based on text classification and clustering analysis
CN104298715B (en) A kind of more indexed results ordering by merging methods based on TF IDF
CN103390051A (en) Topic detection and tracking method based on microblog data
CN105912716A (en) Short text classification method and apparatus
CN105760493A (en) Automatic work order classification method for electricity marketing service hot spot 95598
CN101944099A (en) Method for automatically classifying text documents by utilizing body
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN102629272A (en) Clustering based optimization method for examination system database
CN104239512A (en) Text recommendation method
CN105975518A (en) Information entropy-based expected cross entropy feature selection text classification system and method
CN106021578A (en) Improved text classification algorithm based on integration of cluster and membership degree
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN103914551A (en) Method for extending semantic information of microblogs and selecting features thereof
CN102999538B (en) Personage's searching method and equipment
CN104361059A (en) Harmful information identification and web page classification method based on multi-instance learning
CN102929977B (en) Event tracing method aiming at news website
Huang et al. Topic detection from microblog based on text clustering and topic model analysis
CN105224689A (en) A kind of Dongba document sorting technique
Yang et al. Research on Chinese text classification based on Word2vec

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200813

Address after: A5, block D, Xisanqi cultural science and Technology Park, yard 27, xixiaokou Road, Haidian District, Beijing 100085

Patentee after: Goonie International Software (Beijing) Co.,Ltd.

Address before: 100124 Chaoyang District, Beijing Ping Park, No. 100

Patentee before: Beijing University of Technology

TR01 Transfer of patent right