CN101441663A - Chinese text classification characteristic dictionary generating method based on LZW compression algorithm - Google Patents

Chinese text classification characteristic dictionary generating method based on LZW compression algorithm Download PDF

Info

Publication number
CN101441663A
CN101441663A CNA2008102325572A CN200810232557A CN101441663A CN 101441663 A CN101441663 A CN 101441663A CN A2008102325572 A CNA2008102325572 A CN A2008102325572A CN 200810232557 A CN200810232557 A CN 200810232557A CN 101441663 A CN101441663 A CN 101441663A
Authority
CN
China
Prior art keywords
str
dictionary
value
document frequency
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008102325572A
Other languages
Chinese (zh)
Other versions
CN101441663B (en
Inventor
郑庆华
刘均
吴朝晖
蒋路
常晓
林鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN2008102325572A priority Critical patent/CN101441663B/en
Publication of CN101441663A publication Critical patent/CN101441663A/en
Application granted granted Critical
Publication of CN101441663B publication Critical patent/CN101441663B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the field of text mining and knowledge acquisition in computer application, in particular to a method for generating a Chinese text categorization feature dictionary based on an LZW compression algorithm, which comprises the following steps: firstly, r categories of a text to be categorized are hypothesized, each category corresponds to a sample set, a string table str_table i is initialized to the ith sample set, wherein i is equal to 1, ellipsis, r; then a document in the ith sample set is input into the LZW compression algorithm LZWencode(infile, str_table) to produce a corresponding compressed coding string which is taken as a candidate feature word to update the string table str_table i; and finally, a character string is subject to multiple filtration to form a feature dictionary with r text categories.

Description

A kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm
Technical field
The present invention relates to text mining and knowledge acquisition field in the computer utility, particularly a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm.
Background technology
Feature lexicon is the set that is used for representing all feature speech of text in the text classification.At present, Chinese text characteristic of division dictionary generating method mainly is based on Chinese word segmentation to carry out the feature selected ci poem after handling and selects, and the Chinese word segmentation instrument is indispensable often, and the quality of participle tool performance be can not ignore final text classification influential effect.Feature selecting adopts feature filtration method, feature reconstruction method or potential semantic indexing method usually.The feature filtration method mainly comprises based on document frequency (DF), mutual information (MI), information gain (IG), χ 2Filter methods such as amount, promptly all words to gained behind the participle calculate it in above certain value that entire document is concentrated, and setting threshold filters; The feature reconstruction method mainly comprises clustering procedure and potential semantic indexing method, and clustering procedure will be poly-to the identical or close word of a certain classification contribution is a class, substitutes such all words as one in the feature space with such center; Potential semantic indexing method utilizes the svd technology of matrix to realize the dimensionality reduction of feature space.
Current main Chinese word segmentation instrument have ICTCLAS, the magnanimity scientific ﹠ technical corporation of cas computer technical institute magnanimity Word Intelligent Segmentation system, Harbin Institute of Technology's statistics Words partition system, SEGTAG system of Tsing-Hua University, Beijing University's computerese Words partition system etc.Outstanding participle instrument not only will have very high word segmentation and part-of-speech tagging accuracy, will guarantee that also ambiguity is handled and the unregistered word recognition function preferably.Yet, for the text classification problem, only need use its word segmentation function, text dividing is become one by one independently word item.Therefore adopting the participle instrument to carry out word segmentation, is minimum semantic primitive---speech with regard to the characteristic item that defines text classification, and the candidate feature collection is all vocabulary that document sets occurs after by participle, and candidate's vocabulary is very big, the effect characteristics efficiency in extracting.
Summary of the invention
The object of the present invention is to provide a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm, it need not third party's Chinese word segmentation instrument just can make up the text classification feature lexicon, has improved the efficient of feature extraction.
In order to achieve the above object, the present invention is achieved by the following technical solutions: a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm, it is characterized in that, and may further comprise the steps:
Step 1: suppose to treat that classified text has r classification, the corresponding sample set of each classification is for string table str_table of i class sample set initialization i, i=1 wherein ..., r, this string table str_table iIn every be (str, TF), the frequency TF that writes down character string str respectively and in i class sample set, occur;
Step 2: (infile str_table), and produces corresponding compressed encoding string, is used for upgrading string table str_table as the candidate feature speech with the input of the document in i class sample set LZW compression algorithm LZWencode i, that is: if character string str has been present in string table str_table iIn, then the frequency TF value with character string str adds 1, otherwise increases a new character string str and its frequency TF value is set to 1;
Step 3: to string table str_table iIn item sort setpoint frequency threshold value minTF by frequency TF value i, deletion frequency TF value is less than frequency threshold minTF iItem;
Step 4: statistics string table str_table iIn the document frequency DF that in i class sample set, occurs of every character string str, the number of files of character string str promptly appears in the i class sample set, set i class dictionary dic iWith and minimum document frequency threshold value minDF i, maximum document frequency threshold value maxDF i, with minDF i≤ DF≤maxDF iCharacter string str and its document frequency DF join i class dictionary dic iIn;
Step 5: with all kinds of dictionary dic iThe total dictionary D of comprehensive one-tenth, wherein the document frequency DF that occurs in whole sample set of each character string str equals its document frequency DF value sum in each classification, sort to every according to the document frequency DF among total dictionary D, and set minimum document frequency threshold value minDF and maximum document frequency threshold value maxDF among total dictionary D, delete document frequency DF value among total dictionary D less than minimum document frequency threshold value minDF with greater than the item of maximum document frequency threshold value maxDF;
Step 6: calculate among total dictionary D every character string str at the information gain value IG of whole sample set,
IG ( W ) = P ( W ) Σ i P ( C i | W ) log P ( C i | W ) P ( C i ) + P ( W ‾ ) Σ i P ( C i | W ‾ ) log P ( C i | W ‾ ) P ( C i ) ,
Wherein P (W) represents the probability that word W occurs, and word W is character string str; P (C i) be the probability of occurrence of i class value; P (C i| belong to the conditional probability of i class when W) occurring for word W; Then, character string str among total dictionary D is sorted from big to small according to its information gain IG value, set total dictionary D capacity M, M item before the intercepting, the feature lexicon that total dictionary D that form this moment just classifies as this r class text.
Further characteristics of the present invention are:
Described frequency threshold minTF iBe string table str_table iIn the 5th~10 little frequency TF value;
Described i class dictionary dic iIn minimum document frequency threshold value minDF iBe string table str_table iIn the little document frequency DF value of str the 5th~10;
Described i class dictionary dic iIn maximum document frequency threshold value maxDF iBe string table str_table iIn the 5th~10 big document frequency DF value;
Minimum document frequency threshold value minDF among described total dictionary D is the 5th~10 a little document frequency DF value among total dictionary D;
Maximum document frequency threshold value maxDF among described total dictionary D is the 5th~10 a big DF value among total dictionary D.
Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm proposed by the invention can be applicable to effectively that the feature lexicon of Chinese text classification problem makes up.Different with the feature lexicon generation method that adopts the participle instrument is that this method is not to have carried out word frequency statistics after having obtained whole independent vocabulary again, but directly extracts the feature string and add up its word frequency in text.Feature is filtered and is carried out on the string table that carries out preliminary screening, carries out feature than conventional method on all vocabulary of sample set and filters, and has reduced calculated amount, has improved the efficient of feature extraction.
Embodiment
Below content of the present invention is described in further detail.
Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm may further comprise the steps:
Step 1: suppose text to be divided into r classification, the corresponding sample set of each classification is for string table str_table of i class sample set initialization i, i=1 wherein ..., r, this string table str_table iIn every be (str, TF), the frequency TF that writes down character string str respectively and in i class sample set, occur.
Step 2: (infile str_table), and produces corresponding compressed encoding string, is used for upgrading string table str_table as the candidate feature speech with the input of the document in i class sample set LZW compression algorithm LZWencode i, that is: if character string str has been present in string table str_table iIn, then the frequency TF value with character string str adds 1, otherwise increases a new character string str and its frequency TF value is set to 1.
LZW compression algorithm LZWencode is described below:
LZWencode(infile,str_table)
Step1:wbuf=convert_to_widestring(infile);
The initial character of Step2:it=wbuf
index=0,old_index=0;
Step3:while (it is not the trailing character of wbuf)
wstr1=wstr;
wstr1=wstr1+it;
old_index=index;
If (wstr1 is present among the str_table)
wstr=wstr1;
The position of index=wstr1 in str_table;
else
If (str_table is empty) adds str_table to (wstr1,1);
else
The TF value of str_table old_index item adds 1;
If (wstr1 is less than the str of str_table index item)
Before (wstr1,1) insertion str_table index item;
else
If (index does not point to the str_table afterbody)
Before (wstr1,1) insertion str_table index+1 item;
else
(wstr1,1) is inserted the str_table afterbody;
Put wstr for empty;
wstr1=wstr1+it;
The it=wbuf character late;
Step 3: to string table str_table iIn item sort setpoint frequency threshold value minTF by frequency TF value i, deletion frequency TF value is less than frequency threshold minTF iItem; Described frequency threshold minTF iBe string table str_table iIn the 5th~10 little frequency TF value.
Step 4: statistics string table str_table iIn the document frequency DF that in i class sample set, occurs of every character string str, the number of files of character string str promptly appears in the i class sample set, set i class dictionary dic iWith and minimum document frequency threshold value minDF i, maximum document frequency threshold value maxDF i, with minDF i≤ DF≤maxDF iCharacter string str and its document frequency DF join i class dictionary dic iIn.
Step 5: with all kinds of dictionary dic iThe total dictionary D of comprehensive one-tenth, wherein the document frequency DF that occurs in whole sample set of each character string str equals its document frequency DF value sum in each classification, sort to every according to the document frequency DF among total dictionary D, and set minimum document frequency threshold value minDF and maximum document frequency threshold value maxDF among total dictionary D, delete document frequency DF value among total dictionary D less than minimum document frequency threshold value minDF with greater than the item of maximum document frequency threshold value maxDF; Described i class dictionary dic iIn minimum document frequency threshold value minDF iBe string table str_table iIn the little document frequency DF value of str the 5th~10; Described i class dictionary dic iIn maximum document frequency threshold value maxDF iBe string table str_table iIn the 5th~10 big document frequency DF value; Minimum document frequency threshold value minDF among described total dictionary D is the 5th~10 a little document frequency DF value among total dictionary D; Maximum document frequency threshold value maxDF among described total dictionary D is the 5th~10 a big DF value among total dictionary D.
Step 6: calculate among total dictionary D every character string str at the information gain value IG of whole sample set,
IG ( W ) = P ( W ) Σ i P ( C i | W ) log P ( C i | W ) P ( C i ) + P ( W ‾ ) Σ i P ( C i | W ‾ ) log P ( C i | W ‾ ) P ( C i ) ,
Wherein P (W) represents the probability that word W occurs, and word W is character string str; P (C i) be the probability of occurrence of i class value; P (C i| belong to the conditional probability of i class when W) occurring for word W; Then, character string str among total dictionary D is sorted from big to small according to its information gain IG value, set total dictionary D capacity M, M item before the intercepting, the feature lexicon that total dictionary D that form this moment just classifies as this r class text.
With concrete experimental example the present invention is described in further details below.Select 6 computer studies system architectures, database, distributed operating system, information security, computer network and operating systems to carry out class test.Each section trains number of files to be accordingly: 278 pieces of 275 pieces of system architectures, 247 pieces of databases, 281 pieces of distributed operating systems, 270 pieces of information securities, 237 pieces of computer networks and operating systems amount to 1588 pieces.
Select wherein 300 pieces as test set at random, other 1288 pieces are as training set, at this method with based on the method for participle instrument, use identical character representation method and sorting algorithm to carry out classification experiments.Final feature lexicon size is 1038, and the character representation method adopts LTC weights representation, and sorting algorithm is selected support vector machine (SVM) algorithm for use, uses the experimental result of the participle instrument ICTCLAS of the Chinese Academy of Sciences as follows:
The course name Be judged to the number of system architecture Be judged to the number of database Be judged to distributed number Be judged to the information security number Be judged to the number of network Be judged to the number of operating system Precision (%) Recall (%) F value (%)
System architecture 49 0 2 0 0 0 96.1 96.1 96.1
Database 0 46 0 0 0 1 97.9 97.9 97.9
Distributed 0 0 52 0 1 0 96.3 98.1 97.2
Information security 1 0 0 48 1 1 94.1 94.1 94.1
Computer network 0 0 0 3 42 0 95.5 93.3 94.4
Operating system 1 1 0 0 0 51 96.2 96.2 96.2
Calculate thus and can get: classification accuracy rate (accuracy)=(49+46+52+48+42+51)/300=96%
Employing the present invention is based on dictionary that the Chinese text characteristic of division dictionary generating method of LZW compression algorithm obtains, and to carry out the class test result as follows:
The course name Be judged to the number of system architecture Be judged to the number of database Be judged to distributed number Be judged to letter safety number Be judged to the number of network Be judged to the number of operating system Precision (%) Recall (%) F value (%)
System architecture 50 0 2 0 0 0 96.2 98.0 97.1
Database 0 47 0 0 0 1 100 100 100
Distributed 0 0 52 0 1 0 96.3 98.1 97.2
Information security 1 0 0 47 3 1 90.4 90.4 90.4
Computer network 0 0 1 4 39 0 88.6 88.6 88.6
Operating system 1 0 1 1 1 49 96.1 92.5 94.2
Calculate thus and can get: classification accuracy rate (accuracy)=(50+47+52+47+39+49)/300=94.7%.
Can find that this method has obtained very high classification accuracy rate, final classification accuracy rate is suitable with employing Chinese Academy of Sciences segmenting method.Therefore, under the situation that lacks available Chinese word segmentation instrument, this method can be applicable to well that the feature lexicon of text classification makes up.

Claims (6)

1, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, may further comprise the steps:
Step 1: suppose to treat that classified text has r classification, the corresponding sample set of each classification is for string table str_table of i class sample set initialization i, i=1 wherein ..., r, this string table str_table iIn every be (str, TF), the frequency TF that writes down character string str respectively and in i class sample set, occur;
Step 2: (infile str_table), and produces corresponding compressed encoding string, is used for upgrading string table str_table as the candidate feature speech with the input of the document in i class sample set LZW compression algorithm LZWencode i, that is: if character string str has been present in string table str_table iIn, then the frequency TF value with character string str adds 1, otherwise increases a new character string str and its frequency TF value is set to 1;
Step 3: to string table str_table iIn item sort setpoint frequency threshold value minTF by frequency TF value i, deletion frequency TF value is less than frequency threshold minTF iItem;
Step 4: statistics string table str_table iIn the document frequency DF that in i class sample set, occurs of every character string str, the number of files of character string str promptly appears in the i class sample set, set i class dictionary dic iWith and minimum document frequency threshold value minDF i, maximum document frequency threshold value maxDF i, with minDF i≤ DF≤maxDF iCharacter string str and its document frequency DF join i class dictionary dic iIn;
Step 5: with all kinds of dictionary dic iThe total dictionary D of comprehensive one-tenth, wherein the document frequency DF that occurs in whole sample set of each character string str equals its document frequency DF value sum in each classification, sort to every according to the document frequency DF among total dictionary D, and set minimum document frequency threshold value minDF and maximum document frequency threshold value maxDF among total dictionary D, delete document frequency DF value among total dictionary D less than minimum document frequency threshold value minDF with greater than the item of maximum document frequency threshold value maxDF;
Step 6: calculate among total dictionary D every character string str at the information gain value IG of whole sample set,
IG ( W ) = P ( W ) Σ i P ( C i | W ) log P ( C i | W ) P ( C i ) + P ( W ‾ ) Σ i P ( C i | W ‾ ) log P ( C i | W ‾ ) P ( C i ) ,
Wherein P (W) represents the probability that word W occurs, and word W is character string str; P (C i) be the probability of occurrence of i class value; P (C i| belong to the conditional probability of i class when W) occurring for word W; Then, character string str among total dictionary D is sorted from big to small according to its information gain IG value, set total dictionary D capacity M, M item before the intercepting, the feature lexicon that total dictionary D that form this moment just classifies as this r class text.
2, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that described frequency threshold minTF iBe string table str_table iIn the 5th~10 little frequency TF value.
3, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, described i class dictionary dic iIn minimum document frequency threshold value minDF iBe string table str_table iIn the little document frequency DF value of str the 5th~10.
4, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, described i class dictionary dic iIn maximum document frequency threshold value maxDF iBe string table str_table iIn the 5th~10 big document frequency DF value.
5, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, the minimum document frequency threshold value minDF among described total dictionary D is the 5th~10 a little document frequency DF value among total dictionary D.
6, a kind of Chinese text characteristic of division dictionary generating method based on the LZW compression algorithm is characterized in that, the maximum document frequency threshold value maxDF among described total dictionary D is the 5th~10 a big DF value among total dictionary D.
CN2008102325572A 2008-12-02 2008-12-02 Chinese text classification characteristic dictionary generating method based on LZW compression algorithm Expired - Fee Related CN101441663B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102325572A CN101441663B (en) 2008-12-02 2008-12-02 Chinese text classification characteristic dictionary generating method based on LZW compression algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102325572A CN101441663B (en) 2008-12-02 2008-12-02 Chinese text classification characteristic dictionary generating method based on LZW compression algorithm

Publications (2)

Publication Number Publication Date
CN101441663A true CN101441663A (en) 2009-05-27
CN101441663B CN101441663B (en) 2010-06-23

Family

ID=40726097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102325572A Expired - Fee Related CN101441663B (en) 2008-12-02 2008-12-02 Chinese text classification characteristic dictionary generating method based on LZW compression algorithm

Country Status (1)

Country Link
CN (1) CN101441663B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103219999A (en) * 2013-04-16 2013-07-24 大连理工大学 Reversible information hiding method based on Lempel-Ziv-Welch (LZW) compression algorithm
CN103377224A (en) * 2012-04-24 2013-10-30 北京百度网讯科技有限公司 Method and device for recognizing problem types and method and device for establishing recognition models
CN104008126A (en) * 2014-03-31 2014-08-27 北京奇虎科技有限公司 Method and device for segmentation on basis of webpage content classification
CN106294736A (en) * 2016-08-10 2017-01-04 成都轻车快马网络科技有限公司 Text feature based on key word frequency
CN106991127A (en) * 2017-03-06 2017-07-28 西安交通大学 A kind of knowledget opic short text hierarchy classification method extended based on topological characteristic
CN108804404A (en) * 2018-05-29 2018-11-13 周宇 Character text processing method and processing device
CN110688835A (en) * 2019-09-03 2020-01-14 重庆邮电大学 Word feature value-based law-specific field word discovery method and device
CN117691746A (en) * 2023-12-11 2024-03-12 国网河北省电力有限公司正定县供电分公司 Power line monitoring system and monitoring method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377224A (en) * 2012-04-24 2013-10-30 北京百度网讯科技有限公司 Method and device for recognizing problem types and method and device for establishing recognition models
CN103377224B (en) * 2012-04-24 2016-08-17 北京百度网讯科技有限公司 Identify the method and device of problem types, set up the method and device identifying model
CN103219999A (en) * 2013-04-16 2013-07-24 大连理工大学 Reversible information hiding method based on Lempel-Ziv-Welch (LZW) compression algorithm
CN104008126A (en) * 2014-03-31 2014-08-27 北京奇虎科技有限公司 Method and device for segmentation on basis of webpage content classification
CN106294736A (en) * 2016-08-10 2017-01-04 成都轻车快马网络科技有限公司 Text feature based on key word frequency
CN106991127A (en) * 2017-03-06 2017-07-28 西安交通大学 A kind of knowledget opic short text hierarchy classification method extended based on topological characteristic
CN106991127B (en) * 2017-03-06 2020-01-10 西安交通大学 Knowledge subject short text hierarchical classification method based on topological feature expansion
CN108804404A (en) * 2018-05-29 2018-11-13 周宇 Character text processing method and processing device
CN108804404B (en) * 2018-05-29 2022-04-15 周宇 Character text processing method and device
CN110688835A (en) * 2019-09-03 2020-01-14 重庆邮电大学 Word feature value-based law-specific field word discovery method and device
CN117691746A (en) * 2023-12-11 2024-03-12 国网河北省电力有限公司正定县供电分公司 Power line monitoring system and monitoring method

Also Published As

Publication number Publication date
CN101441663B (en) 2010-06-23

Similar Documents

Publication Publication Date Title
CN101441663B (en) Chinese text classification characteristic dictionary generating method based on LZW compression algorithm
CN106383877B (en) Social media online short text clustering and topic detection method
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN102346829B (en) Virus detection method based on ensemble classification
CN109299480A (en) Terminology Translation method and device based on context of co-text
CN102955857B (en) Class center compression transformation-based text clustering method in search engine
CN107463548B (en) Phrase mining method and device
CN101021838A (en) Text handling method and system
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN104778209A (en) Opinion mining method for ten-million-scale news comments
CN102193936A (en) Data classification method and device
CN105975491A (en) Enterprise news analysis method and system
CN112527948B (en) Sentence-level index-based real-time data deduplication method and system
CN105022740A (en) Processing method and device of unstructured data
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN106611041A (en) New text similarity solution method
CN108153899B (en) Intelligent text classification method
CN111310467B (en) Topic extraction method and system combining semantic inference in long text
CN109902173B (en) Chinese text classification method
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN109543023B (en) Document classification method and system based on trie and LCS algorithm
CN110110326A (en) A kind of text cutting method based on subject information
CN110874408A (en) Model training method, text recognition device and computing equipment
CN105447158A (en) Graph based automatic mining method for synonym set in patent search log

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100623

Termination date: 20141202

EXPY Termination of patent right or utility model