CN106844596A - One kind is based on improved SVM Chinese Text Categorizations - Google Patents

One kind is based on improved SVM Chinese Text Categorizations Download PDF

Info

Publication number
CN106844596A
CN106844596A CN201710026144.8A CN201710026144A CN106844596A CN 106844596 A CN106844596 A CN 106844596A CN 201710026144 A CN201710026144 A CN 201710026144A CN 106844596 A CN106844596 A CN 106844596A
Authority
CN
China
Prior art keywords
text
characteristic item
classification
vector
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710026144.8A
Other languages
Chinese (zh)
Inventor
邱志斌
向靓
涂高元
郭永兴
陆云燕
陈雅贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XIAMEN TIPRAY TECHNOLOGY Co Ltd
Original Assignee
XIAMEN TIPRAY TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XIAMEN TIPRAY TECHNOLOGY Co Ltd filed Critical XIAMEN TIPRAY TECHNOLOGY Co Ltd
Priority to CN201710026144.8A priority Critical patent/CN106844596A/en
Publication of CN106844596A publication Critical patent/CN106844596A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses one kind based on improved SVM Chinese Text Categorizations, comprises the following steps:Step 1, to Chinese Text Pretreatment, obtains characteristic item set;Step 2, feature selecting is carried out to characteristic item set, the characteristic item set after being simplified;Step 3, to simplifying after characteristic item set calculate weight;Step 4, builds text vector, and using each key words in text an as dimension in vector space, and the value in dimension is the weight of the key wordses;Step 5, grader is built using Weighted Support Vector;Step 6, is treated classifying text and is processed using step 14, obtains text vector, and the grader that text vector input step 5 is built obtains classification results.This kind of sorting technique can improve text classification precision.

Description

One kind is based on improved SVM Chinese Text Categorizations
Technical field
It is more particularly to a kind of to be based on improved SVM Chinese Text Categorizations the invention belongs to data mining technology field.
Background technology
File classification method is a kind of sorting technique for having a guidance, it with a text data set for having marked classification (i.e. Training set) grader is trained, the text of non-identified category is classified with the grader for training then, existing classification Method and defect are:
(1) traditional machine learning method such as bayes method and K nearest neighbor algorithms be all based on empirical risk minimization and Realize, promote performance not ideal enough;
(2) traditional SVMs (SVM, Support Vector Machine) method is based on Structural risk minization Change a kind of new mode identification method of principle, with small sample, good popularization performance, the features such as global optimum, but In real operation, the classification problem of generally existing sample imbalance, now, traditional support vector machine method can exist higher False Rate, has much room for improvement.
The content of the invention
The purpose of the present invention, is to provide a kind of based on improved SVM Chinese Text Categorizations, and it can improve text point Class precision.
In order to reach above-mentioned purpose, solution of the invention is:
One kind is based on improved SVM Chinese Text Categorizations, comprises the following steps:
Step 1, to Chinese Text Pretreatment, obtains characteristic item set;
Step 2, feature selecting is carried out to characteristic item set, the characteristic item set after being simplified;
Step 3, to simplifying after characteristic item set calculate weight;
Step 4, builds text vector, using each key words in text an as dimension in vector space, and Value in dimension is the weight of the key wordses;
Step 5, grader is built using Weighted Support Vector;
Step 6, treats classifying text using step 1-4 treatment, text vector is obtained, by text vector input step 5 graders for building, obtain classification results.
In above-mentioned steps 1, Chinese word segmentation is included to Chinese Text Pretreatment and two processes of stop words are gone.
The particular content of above-mentioned steps 2 is:All characteristic items during a valuation functions are constructed to characteristic item set are carried out Assessment, then according to assessed value descending sort, the requirement of threshold value or characteristic item number according to setting selects those spies above Item is levied, the characteristic item set after being simplified.
Above-mentioned valuation functions use evolution test function, it is assumed that characteristic item t and classification CiBetween meet the x of the single order free degree2 Distribution, its computing formula is as follows:
Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification CiText number, B is comprising feature T and it is not belonging to classification CiText number, C is not comprising characteristic item t and belongs to classification CiText number, D is not comprising spy Levy a t and be not belonging to classification CiText number;
Then, by the x of each characteristic item t2Statistical value arranges a sequence from big to small, and several are used as the spy after simplifying before choosing Levy item set.
In above-mentioned steps 3, weight calculation is carried out using inverse ratio document frequency, the computing formula of weight IDF is:
IDF=log (Dall/Dt)
Wherein, DallIt is article sum, DtFor the article quantity that the word occurs.
The detailed content of above-mentioned steps 5 is:
Training sample set is provided with to be expressed asWherein, i=1,2 ..., m,yi ∈ { 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, yiIt is key words sorting;Based on Weighted Support Vector Textual classification model be expressed as follows:
Wherein, ζi>=0, i=1,2 ..., l, l represent number of samples,It is kernel function;Si> 0 represents sample importance Weights, if 0 < Si< 1 represents sampleIt is inessential;Si=1 representsIt is general important;If Si> 1 is representedIt is critically important;Sample This classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights;
Computing formula construction Lagrangian to weight IDF values is as follows:
Wherein, αi, βiIt is Lagrange multiplier, i=1,2 ..., l;
Finally give optimum classifier:
Wherein,It is footpath To base kernel function.
After such scheme, the present invention on file classification method of the tradition based on vector machine by increased weighting step Suddenly, the situation of sample imbalance in current multiclass Chinese Text Categorization can be effectively improved, by the weighting supporting vector after improvement Machine file classification method is applied in the daily document classification of enterprises and institutions, improves nicety of grading, it is ensured that some important classes Other file (such as finance category file) is not leaked, and data safety has been ensured to a certain extent.
Brief description of the drawings
Fig. 1 is the flow chart of training stage of the invention;
Fig. 2 is the flow chart of sorting phase of the present invention.
Specific embodiment
Below with reference to accompanying drawing, technical scheme is described in detail.
The present invention provides a kind of based on improved SVM Chinese Text Categorizations, and text classification is by text document and rule The process that the classification set is matched, includes training and two stages of classification, wherein, flow chart such as Fig. 1 of training stage It is shown, the flow chart of sorting phase as shown in Fig. 2 the treatment in the two stages is differed except final step, other treatment Step is identical, last step, is the structure that sorting algorithm carries out grader using the data of input in the training stage, It is to carry out classification treatment using the grader for training in sorting phase;The sorting technique comprises the following steps:
(1) training stage
Step 1, Chinese text pretreatment, including Chinese word segmentation and go two processes of stop words.
Chinese word segmentation, refers to a sentence with Chinese expression, analyzes its significant word or phrase for including, most These words are extracted from Chinese sentence afterwards, so original Chinese sentence becomes single word one by one;
Stop words is removed, the frequency of occurrences is very high in generally referring to removal text, but practical significance little word again, as common " ", " ", " and ", " then " etc, be more also using excessively frequently word, as " I ", " if ", " " and " " Etc., and various punctuation marks, it is to avoid there is excessive interference after participle.
The step can using the Chinese Academy of Sciences ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) Words partition system, ICTCLAS Words partition systems take full advantage of dictionary matching, system Meter analyzes the advantage of both segmenting methods, the characteristics of can play dictionary matching method participle speed fast, efficiency high, and can utilize Statistical analysis method combination context identification neologisms, the advantage of disambiguation.
Step 2, feature selecting
Exist in the form of characteristic item set after Text Pretreatment, the characteristic item quantity in now characteristic item set is unusual More, it is necessary to carry out dimension-reduction treatment, i.e. feature selecting to characteristic item set.By constructing a valuation functions, (the present embodiment is used Evolution test function) to characteristic item set in all characteristic items be estimated, then according to assessed value descending sort, according to setting The requirement selection of fixed threshold value or characteristic item number those characteristic items above.
Evolution is checked:Assuming that characteristic item t and classification CiBetween meet the x of the single order free degree2Distribution, characteristic item t is for classification CiX2Statistical value is higher, characteristic item t and classification CiCorrelation it is stronger, class discrimination degree is bigger, on the contrary class discrimination degree get over Small, its computing formula is as follows:
Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification CiText number, B is comprising feature T and it is not belonging to classification CiText number, C is not comprising characteristic item t and belongs to classification CiText number, D is not comprising spy Levy a t and be not belonging to classification CiText number.
Then, by the x of each characteristic item t2Statistical value arranges a sequence from big to small, and several are used as the spy after simplifying before choosing Levy item set.
Step 3, weight calculation
The present invention carries out weight calculation using inverse ratio document frequency (inverse document frequency, IDF), The IDF values of a certain specific word, are a measurements for word general importance, with total files divided by the article quantity comprising the word, The business that will be obtained again takes the logarithm (log).The computing formula of IDF values is:
IDF=log (Dall/Dt)
Wherein, DallIt is article sum, DtFor the article quantity that the word occurs.
Step 4, text representation
For the ease of computer disposal text, text representation is conveniently processed into computer shape using vector space model Formula.In text vector space, each key words is a dimension in vector space, and the value in dimension is the key The weight of word, weight represents the significance level of the key wordses.
Step 5, builds grader
Using the support vector machine method after improving in the present invention --- Weighted Support Vector is used as structure grader Method, for solving the classification problem under sample imbalance situation.Except the great disparity of sample size of all categories, the important journey of classification Degree difference also results in the imbalance of sample.For example:For the classification problem of the text of company's unit, " financial sffairs paper " it is important Degree is obviously higher than " athletic meeting file " significance level.While nicety of grading is ensured, should try one's best and avoid to important class Erroneous judgement.
Step is as follows:
1. Weighted Support Vector categorises weights to training sample, embodies different classes of importance.By increasing Vital document class weight, can efficiently reduce the sample number divided by mistake in the category.
2. furthermore, it is contemplated that the significance level of each text is also not quite similar, i.e., they are also differed to the contribution classified, It is subject to sample weights by individual text, improves each text by possibility of correctly classifying, reduces important text and divided by mistake The possibility of class, so as to improve nicety of grading.
Specific algorithm and derivation are as follows:
Training sample set is provided with to be expressed asWherein, i=1,2 ..., m, yi∈ { 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, yiFor key words sorting, (classification is individual in the present embodiment Number is 10) such as yi=1 i-th text of expression belongs to the 2nd classification.Textual classification model based on Weighted Support Vector It is expressed as follows:
Wherein, ζi>=0, i=1,2 ..., l, l represent number of samples,It is kernel function.Si> 0 represents sample importance Weights, if 0 < Si< 1 represents sampleIt is inessential;Si=1 representsIt is general important;If Si> 1 is representedIt is critically important.Sample This classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights.Weighted Support Vector and standard branch Hold vector machine to compare, the punishment that most prominent advantage has been its obfuscation divides sample mistake, i.e., to the slack variable of each sample It is multiplied by the corresponding importance weight of sample and classification weights.
Computing formula construction Lagrangian to IDF values is as follows:
Wherein, αi, βiIt is Lagrange multiplier, i=1,2 ..., l.
Finally give optimum classifier:
Wherein,It is footpath To base kernel function.
(2) sorting phase
The text to be sorted to one, is processed text first with the step 1-4 in the training stage, obtains one Corresponding text vector X, in grader f () that then X is input to constructed by step 5, can just obtain the classification corresponding to X As a result f (X), so as to obtain the classification of text.
Above example is only explanation technological thought of the invention, it is impossible to limit protection scope of the present invention with this, every According to technological thought proposed by the present invention, any change done on the basis of technical scheme each falls within the scope of the present invention Within.

Claims (6)

1. it is a kind of to be based on improved SVM Chinese Text Categorizations, it is characterised in that to comprise the following steps:
Step 1, to Chinese Text Pretreatment, obtains characteristic item set;
Step 2, feature selecting is carried out to characteristic item set, the characteristic item set after being simplified;
Step 3, to simplifying after characteristic item set calculate weight;
Step 4, builds text vector, using each key words in text an as dimension in vector space, and dimension On value be the key wordses weight;
Step 5, grader is built using Weighted Support Vector;
Step 6, treats classifying text using step 1-4 treatment, obtains text vector, by the structure of text vector input step 5 The grader built, obtains classification results.
2. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The step 1 In, Chinese word segmentation is included to Chinese Text Pretreatment and two processes of stop words are gone.
3. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The step 2 Particular content be:All characteristic items during a valuation functions are constructed to characteristic item set are estimated, then according to assessment Value descending sort, the requirement of threshold value or characteristic item number according to setting selects those characteristic items above, after being simplified Characteristic item set.
4. as claimed in claim 3 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The assessment Function uses evolution test function, it is assumed that characteristic item t and classification CiBetween meet the x of the single order free degree2Distribution, its computing formula It is as follows:
x 2 ( t , C i ) = N ( A D - B C ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D )
Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification CiText number, B be comprising characteristic item t and It is not belonging to classification CiText number, C is not comprising characteristic item t and belongs to classification CiText number, D is not comprising characteristic item t And it is not belonging to classification CiText number;
Then, by the x of each characteristic item t2Statistical value arranges a sequence from big to small, and several are used as the characteristic item after simplifying before choosing Set.
5. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The step 3 In, weight calculation is carried out using inverse ratio document frequency, the computing formula of weight IDF is:
IDF=log (Dall/Dt)
Wherein, DallIt is article sum, DtFor the article quantity that the word occurs.
6. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The step 5 Detailed content be:
Training sample set is provided with to be expressed asWherein, i=1,2 ..., m,yi∈ { 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, yiIt is key words sorting;Based on Weighted Support Vector Textual classification model is expressed as follows:
min 1 2 | | W ‾ | | 2 + C σ Σ i = 1 l S i ζ i
s . t . y i ( W ‾ T Φ ( x ‾ i ) + b ) ≥ 1 - ζ i
Wherein, ζi>=0, i=1,2 ..., l, l represent number of samples,It is kernel function;Si> 0 represents sample importance power Value, if 0 < Si< 1 represents sampleIt is inessential;Si=1 representsIt is general important;If Si> 1 is representedIt is critically important;Sample Classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights;
Computing formula construction Lagrangian to weight IDF values is as follows:
Φ ( w → , b , a ) = 1 2 | | w → | | 2 + C σ Σ i = 1 l s i ξ i - Σ i = 1 l α i ( y i ( w → T Φ ( x → i ) + b ) - 1 + ξ i ) - Σ i = 1 l β i ξ i
Wherein, αi, βiIt is Lagrange multiplier, i=1,2 ..., l;
Finally give optimum classifier:
f ( x → j ) = s g n ( Σ i = 1 l y i a i * K ( x → i , x → j ) + b * )
Wherein,It is radial direction base Kernel function.
CN201710026144.8A 2017-01-13 2017-01-13 One kind is based on improved SVM Chinese Text Categorizations Pending CN106844596A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710026144.8A CN106844596A (en) 2017-01-13 2017-01-13 One kind is based on improved SVM Chinese Text Categorizations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710026144.8A CN106844596A (en) 2017-01-13 2017-01-13 One kind is based on improved SVM Chinese Text Categorizations

Publications (1)

Publication Number Publication Date
CN106844596A true CN106844596A (en) 2017-06-13

Family

ID=59124204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710026144.8A Pending CN106844596A (en) 2017-01-13 2017-01-13 One kind is based on improved SVM Chinese Text Categorizations

Country Status (1)

Country Link
CN (1) CN106844596A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409537A (en) * 2018-09-29 2019-03-01 深圳市元征科技股份有限公司 A kind of Maintenance Cases classification method and device
CN109815334A (en) * 2019-01-25 2019-05-28 武汉斗鱼鱼乐网络科技有限公司 A kind of barrage file classification method, storage medium, equipment and system
CN109947941A (en) * 2019-03-05 2019-06-28 永大电梯设备(中国)有限公司 A kind of method and system based on elevator customer service text classification
CN110377734A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of file classification method based on support vector machines

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101127086A (en) * 2007-09-12 2008-02-20 哈尔滨工程大学 High spectrum image repeated selection weighing classification method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101127086A (en) * 2007-09-12 2008-02-20 哈尔滨工程大学 High spectrum image repeated selection weighing classification method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHUANG, DONG 等,: "Efficient Text Classification by Weighted Proximal SVM", 《FIFTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING 》 *
姜嘉琪: "类和样本加权支持向量机及其在入侵检测中的应用研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
熊忠阳 等: "基于 χ 2 统计的文本分类特征选择方法的研究", 《计算机应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409537A (en) * 2018-09-29 2019-03-01 深圳市元征科技股份有限公司 A kind of Maintenance Cases classification method and device
CN109815334A (en) * 2019-01-25 2019-05-28 武汉斗鱼鱼乐网络科技有限公司 A kind of barrage file classification method, storage medium, equipment and system
CN109947941A (en) * 2019-03-05 2019-06-28 永大电梯设备(中国)有限公司 A kind of method and system based on elevator customer service text classification
CN110377734A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of file classification method based on support vector machines

Similar Documents

Publication Publication Date Title
Nguyen et al. Comparative study of sentiment analysis with product reviews using machine learning and lexicon-based approaches
US7711673B1 (en) Automatic charset detection using SIM algorithm with charset grouping
CN104750844A (en) Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN105389379A (en) Rubbish article classification method based on distributed feature representation of text
CN105912716A (en) Short text classification method and apparatus
CN106844596A (en) One kind is based on improved SVM Chinese Text Categorizations
BaygIn Classification of text documents based on Naive Bayes using N-Gram features
US8560466B2 (en) Method and arrangement for automatic charset detection
CN104820703A (en) Text fine classification method
CN110019790A (en) Text identification, text monitoring, data object identification, data processing method
CN109670014A (en) A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning
CN109522544A (en) Sentence vector calculation, file classification method and system based on Chi-square Test
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
Nguyen et al. An ensemble of shallow and deep learning algorithms for Vietnamese sentiment analysis
Budhiraja et al. A supervised learning approach for heading detection
Uslu et al. Towards a DDC-based topic network model of wikipedia
CN110968693A (en) Multi-label text classification calculation method based on ensemble learning
Ferreira et al. Using a genetic algorithm approach to study the impact of imbalanced corpora in sentiment analysis
Zobeidi et al. Effective text classification using multi-level fuzzy neural network
CN114996446B (en) Text classification method, device and storage medium
Diwakar et al. Proposed machine learning classifier algorithm for sentiment analysis
CN115358340A (en) Credit credit collection short message distinguishing method, system, equipment and storage medium
Islam et al. Performance measurement of multiple supervised learning algorithms for Bengali news headline sentiment classification
Abdulla et al. Fake News Detection: A Graph Mining Approach
Hirsch et al. Evolving rules for document classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170613

RJ01 Rejection of invention patent application after publication