CN101441663A

CN101441663A - Chinese text classification characteristic dictionary generating method based on LZW compression algorithm

Info

Publication number: CN101441663A
Application number: CNA2008102325572A
Authority: CN
Inventors: 郑庆华; 刘均; 吴朝晖; 蒋路; 常晓; 林鹏
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2008-12-02
Filing date: 2008-12-02
Publication date: 2009-05-27
Anticipated expiration: 2028-12-02
Also published as: CN101441663B

Abstract

The invention relates to the field of text mining and knowledge acquisition in computer application, in particular to a method for generating a Chinese text categorization feature dictionary based on an LZW compression algorithm, which comprises the following steps: firstly, r categories of a text to be categorized are hypothesized, each category corresponds to a sample set, a string table str_table i is initialized to the ith sample set, wherein i is equal to 1, ellipsis, r; then a document in the ith sample set is input into the LZW compression algorithm LZWencode(infile, str_table) to produce a corresponding compressed coding string which is taken as a candidate feature word to update the string table str_table i; and finally, a character string is subject to multiple filtration to form a feature dictionary with r text categories.

Description

A kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm

Technical field

The present invention relates to text mining and knowledge acquisition field in the computer utility, particularly a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm.

Background technology

Feature lexicon is the set that is used for representing all feature speech of text in the text classification.At present, Chinese text characteristic of division dictionary generating method mainly is based on Chinese word segmentation to carry out the feature selected ci poem after handling and selects, and the Chinese word segmentation instrument is indispensable often, and the quality of participle tool performance be can not ignore final text classification influential effect.Feature selecting adopts feature filtration method, feature reconstruction method or potential semantic indexing method usually.The feature filtration method mainly comprises based on document frequency (DF), mutual information (MI), information gain (IG), χ ²Filter methods such as amount, promptly all words to gained behind the participle calculate it in above certain value that entire document is concentrated, and setting threshold filters; The feature reconstruction method mainly comprises clustering procedure and potential semantic indexing method, and clustering procedure will be poly-to the identical or close word of a certain classification contribution is a class, substitutes such all words as one in the feature space with such center; Potential semantic indexing method utilizes the svd technology of matrix to realize the dimensionality reduction of feature space.

Current main Chinese word segmentation instrument have ICTCLAS, the magnanimity scientific ﹠ technical corporation of cas computer technical institute magnanimity Word Intelligent Segmentation system, Harbin Institute of Technology's statistics Words partition system, SEGTAG system of Tsing-Hua University, Beijing University's computerese Words partition system etc.Outstanding participle instrument not only will have very high word segmentation and part-of-speech tagging accuracy, will guarantee that also ambiguity is handled and the unregistered word recognition function preferably.Yet, for the text classification problem, only need use its word segmentation function, text dividing is become one by one independently word item.Therefore adopting the participle instrument to carry out word segmentation, is minimum semantic primitive---speech with regard to the characteristic item that defines text classification, and the candidate feature collection is all vocabulary that document sets occurs after by participle, and candidate's vocabulary is very big, the effect characteristics efficiency in extracting.

Summary of the invention

The object of the present invention is to provide a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm, it need not third party's Chinese word segmentation instrument just can make up the text classification feature lexicon, has improved the efficient of feature extraction.

In order to achieve the above object, the present invention is achieved by the following technical solutions: a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm, it is characterized in that, and may further comprise the steps:

Step 1: suppose to treat that classified text has r classification, the corresponding sample set of each classification is for string table str_table of i class sample set initialization _i, i=1 wherein ..., r, this string table str_table _iIn every be (str, TF), the frequency TF that writes down character string str respectively and in i class sample set, occur;

Step 2: (infile str_table), and produces corresponding compressed encoding string, is used for upgrading string table str_table as the candidate feature speech with the input of the document in i class sample set LZW compression algorithm LZWencode _i, that is: if character string str has been present in string table str_table _iIn, then the frequency TF value with character string str adds 1, otherwise increases a new character string str and its frequency TF value is set to 1;

Step 3: to string table str_table _iIn item sort setpoint frequency threshold value minTF by frequency TF value _i, deletion frequency TF value is less than frequency threshold minTF _iItem;

Step 4: statistics string table str_table _iIn the document frequency DF that in i class sample set, occurs of every character string str, the number of files of character string str promptly appears in the i class sample set, set i class dictionary dic _iWith and minimum document frequency threshold value minDF _i, maximum document frequency threshold value maxDF _i, with minDF _i≤ DF≤maxDF _iCharacter string str and its document frequency DF join i class dictionary dic _iIn;

Step 5: with all kinds of dictionary dic _iThe total dictionary D of comprehensive one-tenth, wherein the document frequency DF that occurs in whole sample set of each character string str equals its document frequency DF value sum in each classification, sort to every according to the document frequency DF among total dictionary D, and set minimum document frequency threshold value minDF and maximum document frequency threshold value maxDF among total dictionary D, delete document frequency DF value among total dictionary D less than minimum document frequency threshold value minDF with greater than the item of maximum document frequency threshold value maxDF;

Step 6: calculate among total dictionary D every character string str at the information gain value IG of whole sample set,

IG (W) = P (W) \underset{i}{Σ} P (C_{i} | W) \log \frac{P (C_{i} | W)}{P (C_{i})} + P (\overset{&OverBar;}{W}) \underset{i}{Σ} P (C_{i} | \overset{&OverBar;}{W}) \log \frac{P (C_{i} | \overset{&OverBar;}{W})}{P (C_{i})},

Wherein P (W) represents the probability that word W occurs, and word W is character string str; P (C _i) be the probability of occurrence of i class value; P (C _i| belong to the conditional probability of i class when W) occurring for word W; Then, character string str among total dictionary D is sorted from big to small according to its information gain IG value, set total dictionary D capacity M, M item before the intercepting, the feature lexicon that total dictionary D that form this moment just classifies as this r class text.

Further characteristics of the present invention are:

Described frequency threshold minTF _iBe string table str_table _iIn the 5th～10 little frequency TF value;

Described i class dictionary dic _iIn minimum document frequency threshold value minDF _iBe string table str_table _iIn the little document frequency DF value of str the 5th～10;

Described i class dictionary dic _iIn maximum document frequency threshold value maxDF _iBe string table str_table _iIn the 5th～10 big document frequency DF value;

Minimum document frequency threshold value minDF among described total dictionary D is the 5th～10 a little document frequency DF value among total dictionary D;

Maximum document frequency threshold value maxDF among described total dictionary D is the 5th～10 a big DF value among total dictionary D.

Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm proposed by the invention can be applicable to effectively that the feature lexicon of Chinese text classification problem makes up.Different with the feature lexicon generation method that adopts the participle instrument is that this method is not to have carried out word frequency statistics after having obtained whole independent vocabulary again, but directly extracts the feature string and add up its word frequency in text.Feature is filtered and is carried out on the string table that carries out preliminary screening, carries out feature than conventional method on all vocabulary of sample set and filters, and has reduced calculated amount, has improved the efficient of feature extraction.

Embodiment

Below content of the present invention is described in further detail.

Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm may further comprise the steps:

Step 1: suppose text to be divided into r classification, the corresponding sample set of each classification is for string table str_table of i class sample set initialization _i, i=1 wherein ..., r, this string table str_table _iIn every be (str, TF), the frequency TF that writes down character string str respectively and in i class sample set, occur.

Step 2: (infile str_table), and produces corresponding compressed encoding string, is used for upgrading string table str_table as the candidate feature speech with the input of the document in i class sample set LZW compression algorithm LZWencode _i, that is: if character string str has been present in string table str_table _iIn, then the frequency TF value with character string str adds 1, otherwise increases a new character string str and its frequency TF value is set to 1.

LZW compression algorithm LZWencode is described below:

LZWencode(infile，str_table)

Step1：wbuf＝convert_to_widestring(infile)；

The initial character of Step2:it=wbuf

index＝0，old_index＝0；

Step3:while (it is not the trailing character of wbuf)

wstr1＝wstr；

wstr1＝wstr1+it；

old_index＝index；

If (wstr1 is present among the str_table)

wstr＝wstr1；

The position of index=wstr1 in str_table;

else

If (str_table is empty) adds str_table to (wstr1,1);

else

The TF value of str_table old_index item adds 1;

If (wstr1 is less than the str of str_table index item)

Before (wstr1,1) insertion str_table index item;

else

If (index does not point to the str_table afterbody)

Before (wstr1,1) insertion str_table index+1 item;

else

(wstr1,1) is inserted the str_table afterbody;

Put wstr for empty;

wstr1＝wstr1+it；

The it=wbuf character late;

Step 3: to string table str_table _iIn item sort setpoint frequency threshold value minTF by frequency TF value _i, deletion frequency TF value is less than frequency threshold minTF _iItem; Described frequency threshold minTF _iBe string table str_table _iIn the 5th～10 little frequency TF value.

Step 4: statistics string table str_table _iIn the document frequency DF that in i class sample set, occurs of every character string str, the number of files of character string str promptly appears in the i class sample set, set i class dictionary dic _iWith and minimum document frequency threshold value minDF _i, maximum document frequency threshold value maxDF _i, with minDF _i≤ DF≤maxDF _iCharacter string str and its document frequency DF join i class dictionary dic _iIn.

Step 5: with all kinds of dictionary dic _iThe total dictionary D of comprehensive one-tenth, wherein the document frequency DF that occurs in whole sample set of each character string str equals its document frequency DF value sum in each classification, sort to every according to the document frequency DF among total dictionary D, and set minimum document frequency threshold value minDF and maximum document frequency threshold value maxDF among total dictionary D, delete document frequency DF value among total dictionary D less than minimum document frequency threshold value minDF with greater than the item of maximum document frequency threshold value maxDF; Described i class dictionary dic _iIn minimum document frequency threshold value minDF _iBe string table str_table _iIn the little document frequency DF value of str the 5th～10; Described i class dictionary dic _iIn maximum document frequency threshold value maxDF _iBe string table str_table _iIn the 5th～10 big document frequency DF value; Minimum document frequency threshold value minDF among described total dictionary D is the 5th～10 a little document frequency DF value among total dictionary D; Maximum document frequency threshold value maxDF among described total dictionary D is the 5th～10 a big DF value among total dictionary D.

IG (W) = P (W) \underset{i}{Σ} P (C_{i} | W) \log \frac{P (C_{i} | W)}{P (C_{i})} + P (\overset{&OverBar;}{W}) \underset{i}{Σ} P (C_{i} | \overset{&OverBar;}{W}) \log \frac{P (C_{i} | \overset{&OverBar;}{W})}{P (C_{i})},

With concrete experimental example the present invention is described in further details below.Select 6 computer studies system architectures, database, distributed operating system, information security, computer network and operating systems to carry out class test.Each section trains number of files to be accordingly: 278 pieces of 275 pieces of system architectures, 247 pieces of databases, 281 pieces of distributed operating systems, 270 pieces of information securities, 237 pieces of computer networks and operating systems amount to 1588 pieces.

Select wherein 300 pieces as test set at random, other 1288 pieces are as training set, at this method with based on the method for participle instrument, use identical character representation method and sorting algorithm to carry out classification experiments.Final feature lexicon size is 1038, and the character representation method adopts LTC weights representation, and sorting algorithm is selected support vector machine (SVM) algorithm for use, uses the experimental result of the participle instrument ICTCLAS of the Chinese Academy of Sciences as follows:

The course name	Be judged to the number of system architecture	Be judged to the number of database	Be judged to distributed number	Be judged to the information security number	Be judged to the number of network	Be judged to the number of operating system	Precision (％)	Recall (％)	F value (%)
The course name	Be judged to the number of system architecture	Be judged to the number of database	Be judged to distributed number	Be judged to the information security number	Be judged to the number of network	Be judged to the number of operating system	Precision (％)	Recall (％)	F value (%)	System architecture	49	0	2	0	0	0	96.1	96.1	96.1
Database	0	46	0	0	0	1	97.9	97.9	97.9	System architecture	49	0	2	0	0	0	96.1	96.1	96.1
Database	0	46	0	0	0	1	97.9	97.9	97.9	Distributed	0	0	52	0	1	0	96.3	98.1	97.2
Information security	1	0	0	48	1	1	94.1	94.1	94.1	Distributed	0	0	52	0	1	0	96.3	98.1	97.2
Information security	1	0	0	48	1	1	94.1	94.1	94.1	Computer network	0	0	0	3	42	0	95.5	93.3	94.4
Operating system	1	1	0	0	0	51	96.2	96.2	96.2	Computer network	0	0	0	3	42	0	95.5	93.3	94.4

Calculate thus and can get: classification accuracy rate (accuracy)=(49+46+52+48+42+51)/300=96%

Employing the present invention is based on dictionary that the Chinese text characteristic of division dictionary generating method of LZW compression algorithm obtains, and to carry out the class test result as follows:

The course name	Be judged to the number of system architecture	Be judged to the number of database	Be judged to distributed number	Be judged to letter safety number	Be judged to the number of network	Be judged to the number of operating system	Precision (％)	Recall (％)	F value (%)
The course name	Be judged to the number of system architecture	Be judged to the number of database	Be judged to distributed number	Be judged to letter safety number	Be judged to the number of network	Be judged to the number of operating system	Precision (％)	Recall (％)	F value (%)	System architecture	50	0	2	0	0	0	96.2	98.0	97.1
Database	0	47	0	0	0	1	100	100	100	System architecture	50	0	2	0	0	0	96.2	98.0	97.1
Database	0	47	0	0	0	1	100	100	100	Distributed	0	0	52	0	1	0	96.3	98.1	97.2
Information security	1	0	0	47	3	1	90.4	90.4	90.4	Distributed	0	0	52	0	1	0	96.3	98.1	97.2
Information security	1	0	0	47	3	1	90.4	90.4	90.4	Computer network	0	0	1	4	39	0	88.6	88.6	88.6
Operating system	1	0	1	1	1	49	96.1	92.5	94.2	Computer network	0	0	1	4	39	0	88.6	88.6	88.6

Calculate thus and can get: classification accuracy rate (accuracy)=(50+47+52+47+39+49)/300=94.7%.

Can find that this method has obtained very high classification accuracy rate, final classification accuracy rate is suitable with employing Chinese Academy of Sciences segmenting method.Therefore, under the situation that lacks available Chinese word segmentation instrument, this method can be applicable to well that the feature lexicon of text classification makes up.

Claims

1, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, may further comprise the steps:

IG (W) = P (W) \underset{i}{Σ} P (C_{i} | W) \log \frac{P (C_{i} | W)}{P (C_{i})} + P (\overset{&OverBar;}{W}) \underset{i}{Σ} P (C_{i} | \overset{&OverBar;}{W}) \log \frac{P (C_{i} | \overset{&OverBar;}{W})}{P (C_{i})},

2, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that described frequency threshold minTF _iBe string table str_table _iIn the 5th～10 little frequency TF value.

3, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, described i class dictionary dic _iIn minimum document frequency threshold value minDF _iBe string table str_table _iIn the little document frequency DF value of str the 5th～10.

4, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, described i class dictionary dic _iIn maximum document frequency threshold value maxDF _iBe string table str_table _iIn the 5th～10 big document frequency DF value.

5, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, the minimum document frequency threshold value minDF among described total dictionary D is the 5th～10 a little document frequency DF value among total dictionary D.

6, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, the maximum document frequency threshold value maxDF among described total dictionary D is the 5th～10 a big DF value among total dictionary D.