CN106844596A - One kind is based on improved SVM Chinese Text Categorizations - Google Patents
One kind is based on improved SVM Chinese Text Categorizations Download PDFInfo
- Publication number
- CN106844596A CN106844596A CN201710026144.8A CN201710026144A CN106844596A CN 106844596 A CN106844596 A CN 106844596A CN 201710026144 A CN201710026144 A CN 201710026144A CN 106844596 A CN106844596 A CN 106844596A
- Authority
- CN
- China
- Prior art keywords
- text
- characteristic item
- classification
- vector
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses one kind based on improved SVM Chinese Text Categorizations, comprises the following steps:Step 1, to Chinese Text Pretreatment, obtains characteristic item set;Step 2, feature selecting is carried out to characteristic item set, the characteristic item set after being simplified;Step 3, to simplifying after characteristic item set calculate weight;Step 4, builds text vector, and using each key words in text an as dimension in vector space, and the value in dimension is the weight of the key wordses;Step 5, grader is built using Weighted Support Vector;Step 6, is treated classifying text and is processed using step 14, obtains text vector, and the grader that text vector input step 5 is built obtains classification results.This kind of sorting technique can improve text classification precision.
Description
Technical field
It is more particularly to a kind of to be based on improved SVM Chinese Text Categorizations the invention belongs to data mining technology field.
Background technology
File classification method is a kind of sorting technique for having a guidance, it with a text data set for having marked classification (i.e.
Training set) grader is trained, the text of non-identified category is classified with the grader for training then, existing classification
Method and defect are:
(1) traditional machine learning method such as bayes method and K nearest neighbor algorithms be all based on empirical risk minimization and
Realize, promote performance not ideal enough;
(2) traditional SVMs (SVM, Support Vector Machine) method is based on Structural risk minization
Change a kind of new mode identification method of principle, with small sample, good popularization performance, the features such as global optimum, but
In real operation, the classification problem of generally existing sample imbalance, now, traditional support vector machine method can exist higher
False Rate, has much room for improvement.
The content of the invention
The purpose of the present invention, is to provide a kind of based on improved SVM Chinese Text Categorizations, and it can improve text point
Class precision.
In order to reach above-mentioned purpose, solution of the invention is:
One kind is based on improved SVM Chinese Text Categorizations, comprises the following steps:
Step 1, to Chinese Text Pretreatment, obtains characteristic item set;
Step 2, feature selecting is carried out to characteristic item set, the characteristic item set after being simplified;
Step 3, to simplifying after characteristic item set calculate weight;
Step 4, builds text vector, using each key words in text an as dimension in vector space, and
Value in dimension is the weight of the key wordses;
Step 5, grader is built using Weighted Support Vector;
Step 6, treats classifying text using step 1-4 treatment, text vector is obtained, by text vector input step
5 graders for building, obtain classification results.
In above-mentioned steps 1, Chinese word segmentation is included to Chinese Text Pretreatment and two processes of stop words are gone.
The particular content of above-mentioned steps 2 is:All characteristic items during a valuation functions are constructed to characteristic item set are carried out
Assessment, then according to assessed value descending sort, the requirement of threshold value or characteristic item number according to setting selects those spies above
Item is levied, the characteristic item set after being simplified.
Above-mentioned valuation functions use evolution test function, it is assumed that characteristic item t and classification CiBetween meet the x of the single order free degree2
Distribution, its computing formula is as follows:
Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification CiText number, B is comprising feature
T and it is not belonging to classification CiText number, C is not comprising characteristic item t and belongs to classification CiText number, D is not comprising spy
Levy a t and be not belonging to classification CiText number;
Then, by the x of each characteristic item t2Statistical value arranges a sequence from big to small, and several are used as the spy after simplifying before choosing
Levy item set.
In above-mentioned steps 3, weight calculation is carried out using inverse ratio document frequency, the computing formula of weight IDF is:
IDF=log (Dall/Dt)
Wherein, DallIt is article sum, DtFor the article quantity that the word occurs.
The detailed content of above-mentioned steps 5 is:
Training sample set is provided with to be expressed asWherein, i=1,2 ..., m,yi
∈ { 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, yiIt is key words sorting;Based on Weighted Support Vector
Textual classification model be expressed as follows:
Wherein, ζi>=0, i=1,2 ..., l, l represent number of samples,It is kernel function;Si> 0 represents sample importance
Weights, if 0 < Si< 1 represents sampleIt is inessential;Si=1 representsIt is general important;If Si> 1 is representedIt is critically important;Sample
This classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights;
Computing formula construction Lagrangian to weight IDF values is as follows:
Wherein, αi, βiIt is Lagrange multiplier, i=1,2 ..., l;
Finally give optimum classifier:
Wherein,It is footpath
To base kernel function.
After such scheme, the present invention on file classification method of the tradition based on vector machine by increased weighting step
Suddenly, the situation of sample imbalance in current multiclass Chinese Text Categorization can be effectively improved, by the weighting supporting vector after improvement
Machine file classification method is applied in the daily document classification of enterprises and institutions, improves nicety of grading, it is ensured that some important classes
Other file (such as finance category file) is not leaked, and data safety has been ensured to a certain extent.
Brief description of the drawings
Fig. 1 is the flow chart of training stage of the invention;
Fig. 2 is the flow chart of sorting phase of the present invention.
Specific embodiment
Below with reference to accompanying drawing, technical scheme is described in detail.
The present invention provides a kind of based on improved SVM Chinese Text Categorizations, and text classification is by text document and rule
The process that the classification set is matched, includes training and two stages of classification, wherein, flow chart such as Fig. 1 of training stage
It is shown, the flow chart of sorting phase as shown in Fig. 2 the treatment in the two stages is differed except final step, other treatment
Step is identical, last step, is the structure that sorting algorithm carries out grader using the data of input in the training stage,
It is to carry out classification treatment using the grader for training in sorting phase;The sorting technique comprises the following steps:
(1) training stage
Step 1, Chinese text pretreatment, including Chinese word segmentation and go two processes of stop words.
Chinese word segmentation, refers to a sentence with Chinese expression, analyzes its significant word or phrase for including, most
These words are extracted from Chinese sentence afterwards, so original Chinese sentence becomes single word one by one;
Stop words is removed, the frequency of occurrences is very high in generally referring to removal text, but practical significance little word again, as common
" ", " ", " and ", " then " etc, be more also using excessively frequently word, as " I ", " if ", " " and " "
Etc., and various punctuation marks, it is to avoid there is excessive interference after participle.
The step can using the Chinese Academy of Sciences ICTCLAS (Institute of Computing Technology,
Chinese Lexical Analysis System) Words partition system, ICTCLAS Words partition systems take full advantage of dictionary matching, system
Meter analyzes the advantage of both segmenting methods, the characteristics of can play dictionary matching method participle speed fast, efficiency high, and can utilize
Statistical analysis method combination context identification neologisms, the advantage of disambiguation.
Step 2, feature selecting
Exist in the form of characteristic item set after Text Pretreatment, the characteristic item quantity in now characteristic item set is unusual
More, it is necessary to carry out dimension-reduction treatment, i.e. feature selecting to characteristic item set.By constructing a valuation functions, (the present embodiment is used
Evolution test function) to characteristic item set in all characteristic items be estimated, then according to assessed value descending sort, according to setting
The requirement selection of fixed threshold value or characteristic item number those characteristic items above.
Evolution is checked:Assuming that characteristic item t and classification CiBetween meet the x of the single order free degree2Distribution, characteristic item t is for classification
CiX2Statistical value is higher, characteristic item t and classification CiCorrelation it is stronger, class discrimination degree is bigger, on the contrary class discrimination degree get over
Small, its computing formula is as follows:
Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification CiText number, B is comprising feature
T and it is not belonging to classification CiText number, C is not comprising characteristic item t and belongs to classification CiText number, D is not comprising spy
Levy a t and be not belonging to classification CiText number.
Then, by the x of each characteristic item t2Statistical value arranges a sequence from big to small, and several are used as the spy after simplifying before choosing
Levy item set.
Step 3, weight calculation
The present invention carries out weight calculation using inverse ratio document frequency (inverse document frequency, IDF),
The IDF values of a certain specific word, are a measurements for word general importance, with total files divided by the article quantity comprising the word,
The business that will be obtained again takes the logarithm (log).The computing formula of IDF values is:
IDF=log (Dall/Dt)
Wherein, DallIt is article sum, DtFor the article quantity that the word occurs.
Step 4, text representation
For the ease of computer disposal text, text representation is conveniently processed into computer shape using vector space model
Formula.In text vector space, each key words is a dimension in vector space, and the value in dimension is the key
The weight of word, weight represents the significance level of the key wordses.
Step 5, builds grader
Using the support vector machine method after improving in the present invention --- Weighted Support Vector is used as structure grader
Method, for solving the classification problem under sample imbalance situation.Except the great disparity of sample size of all categories, the important journey of classification
Degree difference also results in the imbalance of sample.For example:For the classification problem of the text of company's unit, " financial sffairs paper " it is important
Degree is obviously higher than " athletic meeting file " significance level.While nicety of grading is ensured, should try one's best and avoid to important class
Erroneous judgement.
Step is as follows:
1. Weighted Support Vector categorises weights to training sample, embodies different classes of importance.By increasing
Vital document class weight, can efficiently reduce the sample number divided by mistake in the category.
2. furthermore, it is contemplated that the significance level of each text is also not quite similar, i.e., they are also differed to the contribution classified,
It is subject to sample weights by individual text, improves each text by possibility of correctly classifying, reduces important text and divided by mistake
The possibility of class, so as to improve nicety of grading.
Specific algorithm and derivation are as follows:
Training sample set is provided with to be expressed asWherein, i=1,2 ..., m,
yi∈ { 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, yiFor key words sorting, (classification is individual in the present embodiment
Number is 10) such as yi=1 i-th text of expression belongs to the 2nd classification.Textual classification model based on Weighted Support Vector
It is expressed as follows:
Wherein, ζi>=0, i=1,2 ..., l, l represent number of samples,It is kernel function.Si> 0 represents sample importance
Weights, if 0 < Si< 1 represents sampleIt is inessential;Si=1 representsIt is general important;If Si> 1 is representedIt is critically important.Sample
This classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights.Weighted Support Vector and standard branch
Hold vector machine to compare, the punishment that most prominent advantage has been its obfuscation divides sample mistake, i.e., to the slack variable of each sample
It is multiplied by the corresponding importance weight of sample and classification weights.
Computing formula construction Lagrangian to IDF values is as follows:
Wherein, αi, βiIt is Lagrange multiplier, i=1,2 ..., l.
Finally give optimum classifier:
Wherein,It is footpath
To base kernel function.
(2) sorting phase
The text to be sorted to one, is processed text first with the step 1-4 in the training stage, obtains one
Corresponding text vector X, in grader f () that then X is input to constructed by step 5, can just obtain the classification corresponding to X
As a result f (X), so as to obtain the classification of text.
Above example is only explanation technological thought of the invention, it is impossible to limit protection scope of the present invention with this, every
According to technological thought proposed by the present invention, any change done on the basis of technical scheme each falls within the scope of the present invention
Within.
Claims (6)
1. it is a kind of to be based on improved SVM Chinese Text Categorizations, it is characterised in that to comprise the following steps:
Step 1, to Chinese Text Pretreatment, obtains characteristic item set;
Step 2, feature selecting is carried out to characteristic item set, the characteristic item set after being simplified;
Step 3, to simplifying after characteristic item set calculate weight;
Step 4, builds text vector, using each key words in text an as dimension in vector space, and dimension
On value be the key wordses weight;
Step 5, grader is built using Weighted Support Vector;
Step 6, treats classifying text using step 1-4 treatment, obtains text vector, by the structure of text vector input step 5
The grader built, obtains classification results.
2. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The step 1
In, Chinese word segmentation is included to Chinese Text Pretreatment and two processes of stop words are gone.
3. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The step 2
Particular content be:All characteristic items during a valuation functions are constructed to characteristic item set are estimated, then according to assessment
Value descending sort, the requirement of threshold value or characteristic item number according to setting selects those characteristic items above, after being simplified
Characteristic item set.
4. as claimed in claim 3 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The assessment
Function uses evolution test function, it is assumed that characteristic item t and classification CiBetween meet the x of the single order free degree2Distribution, its computing formula
It is as follows:
Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification CiText number, B be comprising characteristic item t and
It is not belonging to classification CiText number, C is not comprising characteristic item t and belongs to classification CiText number, D is not comprising characteristic item t
And it is not belonging to classification CiText number;
Then, by the x of each characteristic item t2Statistical value arranges a sequence from big to small, and several are used as the characteristic item after simplifying before choosing
Set.
5. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The step 3
In, weight calculation is carried out using inverse ratio document frequency, the computing formula of weight IDF is:
IDF=log (Dall/Dt)
Wherein, DallIt is article sum, DtFor the article quantity that the word occurs.
6. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that:The step 5
Detailed content be:
Training sample set is provided with to be expressed asWherein, i=1,2 ..., m,yi∈
{ 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, yiIt is key words sorting;Based on Weighted Support Vector
Textual classification model is expressed as follows:
Wherein, ζi>=0, i=1,2 ..., l, l represent number of samples,It is kernel function;Si> 0 represents sample importance power
Value, if 0 < Si< 1 represents sampleIt is inessential;Si=1 representsIt is general important;If Si> 1 is representedIt is critically important;Sample
Classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights;
Computing formula construction Lagrangian to weight IDF values is as follows:
Wherein, αi, βiIt is Lagrange multiplier, i=1,2 ..., l;
Finally give optimum classifier:
Wherein,It is radial direction base
Kernel function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710026144.8A CN106844596A (en) | 2017-01-13 | 2017-01-13 | One kind is based on improved SVM Chinese Text Categorizations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710026144.8A CN106844596A (en) | 2017-01-13 | 2017-01-13 | One kind is based on improved SVM Chinese Text Categorizations |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106844596A true CN106844596A (en) | 2017-06-13 |
Family
ID=59124204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710026144.8A Pending CN106844596A (en) | 2017-01-13 | 2017-01-13 | One kind is based on improved SVM Chinese Text Categorizations |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106844596A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409537A (en) * | 2018-09-29 | 2019-03-01 | 深圳市元征科技股份有限公司 | A kind of Maintenance Cases classification method and device |
CN109815334A (en) * | 2019-01-25 | 2019-05-28 | 武汉斗鱼鱼乐网络科技有限公司 | A kind of barrage file classification method, storage medium, equipment and system |
CN109947941A (en) * | 2019-03-05 | 2019-06-28 | 永大电梯设备(中国)有限公司 | A kind of method and system based on elevator customer service text classification |
CN110377734A (en) * | 2019-07-01 | 2019-10-25 | 厦门美域中央信息科技有限公司 | A kind of file classification method based on support vector machines |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101127086A (en) * | 2007-09-12 | 2008-02-20 | 哈尔滨工程大学 | High spectrum image repeated selection weighing classification method |
-
2017
- 2017-01-13 CN CN201710026144.8A patent/CN106844596A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101127086A (en) * | 2007-09-12 | 2008-02-20 | 哈尔滨工程大学 | High spectrum image repeated selection weighing classification method |
Non-Patent Citations (3)
Title |
---|
ZHUANG, DONG 等,: "Efficient Text Classification by Weighted Proximal SVM", 《FIFTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING 》 * |
姜嘉琪: "类和样本加权支持向量机及其在入侵检测中的应用研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
熊忠阳 等: "基于 χ 2 统计的文本分类特征选择方法的研究", 《计算机应用》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409537A (en) * | 2018-09-29 | 2019-03-01 | 深圳市元征科技股份有限公司 | A kind of Maintenance Cases classification method and device |
CN109815334A (en) * | 2019-01-25 | 2019-05-28 | 武汉斗鱼鱼乐网络科技有限公司 | A kind of barrage file classification method, storage medium, equipment and system |
CN109947941A (en) * | 2019-03-05 | 2019-06-28 | 永大电梯设备(中国)有限公司 | A kind of method and system based on elevator customer service text classification |
CN110377734A (en) * | 2019-07-01 | 2019-10-25 | 厦门美域中央信息科技有限公司 | A kind of file classification method based on support vector machines |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nguyen et al. | Comparative study of sentiment analysis with product reviews using machine learning and lexicon-based approaches | |
US7711673B1 (en) | Automatic charset detection using SIM algorithm with charset grouping | |
CN104750844A (en) | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts | |
CN105389379A (en) | Rubbish article classification method based on distributed feature representation of text | |
CN105912716A (en) | Short text classification method and apparatus | |
CN106844596A (en) | One kind is based on improved SVM Chinese Text Categorizations | |
BaygIn | Classification of text documents based on Naive Bayes using N-Gram features | |
US8560466B2 (en) | Method and arrangement for automatic charset detection | |
CN104820703A (en) | Text fine classification method | |
CN110019790A (en) | Text identification, text monitoring, data object identification, data processing method | |
CN109670014A (en) | A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning | |
CN109522544A (en) | Sentence vector calculation, file classification method and system based on Chi-square Test | |
CN108090178A (en) | A kind of text data analysis method, device, server and storage medium | |
Nguyen et al. | An ensemble of shallow and deep learning algorithms for Vietnamese sentiment analysis | |
Budhiraja et al. | A supervised learning approach for heading detection | |
Uslu et al. | Towards a DDC-based topic network model of wikipedia | |
CN110968693A (en) | Multi-label text classification calculation method based on ensemble learning | |
Ferreira et al. | Using a genetic algorithm approach to study the impact of imbalanced corpora in sentiment analysis | |
Zobeidi et al. | Effective text classification using multi-level fuzzy neural network | |
CN114996446B (en) | Text classification method, device and storage medium | |
Diwakar et al. | Proposed machine learning classifier algorithm for sentiment analysis | |
CN115358340A (en) | Credit credit collection short message distinguishing method, system, equipment and storage medium | |
Islam et al. | Performance measurement of multiple supervised learning algorithms for Bengali news headline sentiment classification | |
Abdulla et al. | Fake News Detection: A Graph Mining Approach | |
Hirsch et al. | Evolving rules for document classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170613 |
|
RJ01 | Rejection of invention patent application after publication |