CN1936887A - Automatic text classification method based on classification concept space - Google Patents

Automatic text classification method based on classification concept space Download PDF

Info

Publication number
CN1936887A
CN1936887A CN 200510086462 CN200510086462A CN1936887A CN 1936887 A CN1936887 A CN 1936887A CN 200510086462 CN200510086462 CN 200510086462 CN 200510086462 A CN200510086462 A CN 200510086462A CN 1936887 A CN1936887 A CN 1936887A
Authority
CN
China
Prior art keywords
classification
word
vector
space
classifying documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200510086462
Other languages
Chinese (zh)
Inventor
鲁松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN 200510086462 priority Critical patent/CN1936887A/en
Publication of CN1936887A publication Critical patent/CN1936887A/en
Pending legal-status Critical Current

Links

Images

Abstract

Being divided into training and classifying two phases, the method includes steps: (1) constructing data of classified words and expressions (WE) matrix; (2) based on the said matrix to build frequency data table of inverse sorted classes; (3) based on the said table to build effective set of WE; (4) based on the said set to rebuild data of classified WE matrix; (5) based on the said rebuilt matrix to build frequency data table of inverse sorted WE in each class; (6) based on classified WE matrix, and frequency data table of inverse sorted WE to build vector representations of WE based on space of class concept; (7) based on frequencies of words and frequencies of sorted classes to construct vector data of document to be classified in vector space of class concept; (8) based on magnitude of each component in vector of document to be classified to obtain class of the document. The invention is suitable to information classifying, filtering, and monitoring etc.

Description

Automatic text classification method based on the class concepts space
Technical field
The invention belongs to content and information analysis and process field, particularly a kind of automatic text classification method based on the class concepts space.
Background technology
Autotext classification (Auto Text Classification) is computer automatic sorting is carried out in research at large volume document under given classification situation a technology.The basis of this technology is a vector space model, and wherein vector space is to be the contour gt of dimension with word or through the notion of conversion, in this space, uses various sorting techniques document is classified.
So far various research reports show, can not the accurate description texts based on the high dimension vector space of word quadrature, and also have problems such as can't measuring and set the conversion threshold value through the quadrature concept space that matrixing obtains.Also make autotext sort research problem face an urgent demand of seeking breakthrough on the text representation model thus.
Summary of the invention
The objective of the invention is to, a kind of automatic text classification method based on the class concepts space is provided.
This method has effectively realized representing based on the textual formization of class concepts, and can guarantee the high efficiency and the high accuracy of the classification of computing machine autotext.
A kind of automatic text classification method based on the class concepts space of the present invention is that word is mapped to the class concepts space, and obtains the new technique of classification results based on the vector accumulative total of the heavy effective word of cum rights in the text; It is characterized in that entire method is divided into training and classifies two stages, comprises the steps:
Step 1) structure classes word matrix data;
Step 2) arranges classification frequency data table based on what classification word matrix was set up each word;
Step 3) makes up effective word collection based on arranging classification frequency data table;
Step 4) re-constructs classification word matrix data based on effective word collection;
Step 5) is arranged word frequency data table based on what the classification word matrix that re-constructs was set up each classification;
Step 6) is based on classification word matrix and arrange word frequency data table, sets up the word vector representation based on the class concepts space;
Step 7) is treated the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space;
Step 8) is according to treating that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents.
Wherein step 6) is described based on classification word matrix with arrange word frequency data table, foundation is based on the word vector representation in class concepts space, be word the classification frequency and such other arrange the word frequency and multiply each other and normalization, obtain the vector representation of word in the class concepts space.
Wherein step 7) is described treats the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space, promptly according to the word frequency for the treatment of word in the classifying documents with arrange that the classification frequency multiplies each other and normalization, obtain word weight in treating classifying documents, distribute according to each term weighing, each word vector summation is obtained treating the vector representation of classifying documents in the class concepts space, in representation of knowledge system, a classification is exactly the aggregate of a notion, and whole model designs in the hyperspace mode, so we are referred to as the class concepts space.
Wherein the described foundation of step 8) treats that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents, treat that promptly the pairing classification of one-component maximum in each component of classifying documents vector is exactly to treat the affiliated classification of classifying documents, be about to treat that classifying documents is mapped in the class concepts vector space, classification work can be finished.
Use the matrix data in data processing, the conversion among the present invention, set up the technological means of arranging classification frequency data table of each word thus, solved the technical matters of text classification, reached the autotext classification, and the high technique effect of classification effectiveness.
Compare with the classification of traditional autotext, characteristics of the present invention are: designed novel based on the class concepts space the text representation method and based on the automatic text classification method of this method for expressing.Use new method of the present invention, can realize the real representation of concept of text, overcome the defective of nonopiate characteristic between word, and bring the accuracy of text classification thus, simultaneously, because the process of classification is exactly the process that text is mapped to concept space, so classification effectiveness is high.The present invention is applicable to fields such as high-efficiency information classification, information filtering and information monitoring.
Description of drawings
Fig. 1 is the process flow diagram of the automatic text classification method based on the class concepts space of the present invention.
Embodiment
This paper uses d &OverBar; = < t f 1 , t f 2 , &hellip; , t f n > The word frequency vector of a document of expression, wherein tf jRepresent the frequency of occurrences of j word in the document d; Use C &OverBar; m = < tc f 1 , tc f 2 , &hellip; , tc f n > The word frequency vector of representing the m classification, wherein tcf nRepresent the frequency of occurrences of n word in the m classification.
The method step of Fig. 1 is as follows:
In the training stage,
Step 1, structure classes word matrix data:
Can construct the classification word matrix data of training set, C q &times; n = &lsqb; c &OverBar; 1 , c &OverBar; 2 , &hellip; , c &OverBar; q &rsqb; T , Wherein total q classification, n word.
Wherein, m IjBe illustrated in the number of j word in the i class.
Step 2, arrange classification frequency data table based on what classification word matrix was set up each word:
According to classification word matrix, employing method IC F i = log ( | C | | t i | + 0.01 ) Obtaining our called after arranges the classification frequency (Index Category Frequency, ICF), this is for defining the ability of i word difference classification.Wherein | C| is the sum of classification, is worth to be q; t iSum for classification that the i word occurred.ICF iBe worth greatly more, show that i word difference ability of all categories is strong more.
Step 3, make up effective word collection based on arranging classification frequency data table:
After the ordering, constitute effective word collection by 80% part of maximum, list size is made as p.
Step 4, re-construct classification word matrix data based on effective word collection:
Re-construct classification word matrix data based on effective word collection C q&times;p = &lsqb; c &OverBar; 1 , c &OverBar; 2 ,&hellip; , c &OverBar; p &rsqb; T &CenterDot;
Step 5, arrange word frequency data table based on what the classification word matrix that re-constructs was set up each classification:
Based on new classification word Matrix C Q * pThe employing method IT F i = log ( | T | | c i | + 0.01 ) Obtaining our called after arranges the word frequency (Index Term Frequency, ITF), this is to distinguish the ability that each word concentrated in effective word for defining the i classification.Wherein | T| is the size of effective vocabulary, is worth to be p; | c i| for appearing at the word collection size in the i classification.ITF iBe worth greatly more, show that the ability of i classification difference word is strong more.
Step 6, based on classification word matrix with arrange word frequency data table, set up word vector representation based on the class concepts space:
For word being mapped in the concept space based on classification, the present invention designed word in of all categories the frequency of occurrences with arrange that the word frequency multiplies each other and normalized method obtains word vector representation based on the class concepts space, computing method are:
t &OverBar; j = < tc f 1 &times; IT F 1 &Sigma; i = 1 q ( tc f i &times; IT F i ) , tc f 2 &times; IT F 2 &Sigma; i = 1 q ( tc f i &times; IT F i ) ,&hellip; , tc f q &times; IT F q &Sigma; i = 1 q ( tc f i &times; IT F i ) > &CenterDot;
Wherein, tcf 1, tcf 2..., tcf qBe respectively No. 1 classification, No. 2 classification ..., the word frequency of j word appears in the q classification; ITF 1, ITF 2..., ITF gBe respectively No. 1 classification, No. 2 classification ..., the g classification arrange the word frequency; Design each component respectively divided by
Figure A20051008646200074
Be for normalized.So far, finished of the mapping of effective word collection, promptly designed a kind of novel word and represented mode in the class concepts space.
At sorting phase
Step 7, treat the classifying documents vector data in the classification frequency structure classes notion vector space based on treating word frequency in the classifying documents and arranging:
Treat that classifying documents is turned to document by form d &OverBar; = < t f 1 , t f 2 , &hellip; , t f n > , To its process of classifying is exactly according to word in the document and weight thereof, it is mapped in the class concepts space finishes classification.
At first, design the ability that a kind of method is calculated each word statement the document in the document, i.e. the weight of each word, computing method are: P &OverBar; = < t f 1 &times; IC F 1 &Sigma; i = 1 g ( t f i &times; IC F i ) , t f 2 &times; IC F 2 &Sigma; i = 1 g ( t f i &times; IC F i ) , &hellip; , t f g &times; IC F g &Sigma; i = 1 g ( t f i &times; IC F i ) > , Wherein. t f 1 &times; IC F 1 &Sigma; i = 1 g ( t f i &times; IC F i ) , t f 2 &times; IC F 2 &Sigma; i = 1 g ( t f i &times; IC F i ) , &hellip; , t f g &times; IC F g &Sigma; i = 1 g ( t f i &times; IC F i ) Be respectively No. 1 word, No. 2 word ..., the weighted value of g word., design a kind of method will treat classifying documents also be mapped in class concepts space, be about to treat in the class concepts space word vector and this word in the classifying documents in treating classifying documents after the multiplied by weight, all words vector summations thereafter.Detailed method is: d &OverBar; concept = &Sigma; i = 1 g ( t &OverBar; i &times; P i ) , Wherein,
Figure A20051008646200085
Be the word t in the class concepts space iVector representation, p iBe word t iWeighted value, be after the summation and treat in the class concepts space that classifying documents represents.
Step 8, according to treating that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents.
At last,
Figure A20051008646200086
Each component size is an all kinds of tendency degree of the document ownership in the vector, be worth greatly more, and tendentiousness is just strong more.To single mode text classification problem, maximum component corresponding class is classification results.As seen, treat the mapping of classifying documents in the class concepts space as long as finish, whole classification just can be finished, and this also is that the present invention has reasons of high performance.
Use new method of the present invention, can realize the real representation of concept of text, overcome the defective of nonopiate characteristic between word, and bring the accuracy of text classification thus, simultaneously, because the process of classification is exactly the process that text is mapped to concept space, so classification effectiveness is high.The present invention is applicable to fields such as high-efficiency information classification, information filtering and information monitoring.

Claims (4)

1, a kind of automatic text classification method based on the class concepts space is that word is mapped to the class concepts space, and obtains the new technique of classification results based on the vector accumulative total of effective word in the text; It is characterized in that entire method is divided into training and classifies two stages, comprises the steps:
Step 1) structure classes word matrix data;
Step 2) arranges classification frequency data table based on what classification word matrix was set up each word;
Step 3) makes up effective word collection based on arranging classification frequency data table;
Step 4) re-constructs classification word matrix data based on effective word collection;
Step 5) is arranged word frequency data table based on what the classification word matrix that re-constructs was set up each classification;
Step 6) is based on classification word matrix and arrange word frequency data table, sets up the word vector representation based on the class concepts space;
Step 7) is treated the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space;
Step 8) is according to treating that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents.
2, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein step 6) is described based on classification word matrix with arrange word frequency data table, foundation is based on the word vector representation in class concepts space, be word the classification frequency and such other arrange the word frequency and multiply each other and normalization, obtain the vector representation of word in the class concepts space.
3, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein step 7) is described treats the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space, promptly according to the word frequency for the treatment of word in the classifying documents with arrange that the classification frequency multiplies each other and normalization, obtain word weight in treating classifying documents, distribute according to each term weighing, each word vector summation is obtained treating the vector representation of classifying documents in the class concepts space, in representation of knowledge system, a classification is exactly the aggregate of a notion, and whole model designs in the hyperspace mode, so we are referred to as the class concepts space.
4, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein the described foundation of step 8) treats that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents, treat that promptly the pairing classification of one-component maximum in each component of classifying documents vector is exactly to treat the affiliated classification of classifying documents, be about to treat that classifying documents is mapped in the class concepts vector space, classification work can be finished.
CN 200510086462 2005-09-22 2005-09-22 Automatic text classification method based on classification concept space Pending CN1936887A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510086462 CN1936887A (en) 2005-09-22 2005-09-22 Automatic text classification method based on classification concept space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510086462 CN1936887A (en) 2005-09-22 2005-09-22 Automatic text classification method based on classification concept space

Publications (1)

Publication Number Publication Date
CN1936887A true CN1936887A (en) 2007-03-28

Family

ID=37954392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510086462 Pending CN1936887A (en) 2005-09-22 2005-09-22 Automatic text classification method based on classification concept space

Country Status (1)

Country Link
CN (1) CN1936887A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100462979C (en) * 2007-06-26 2009-02-18 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server
CN101587493B (en) * 2009-06-29 2012-07-04 中国科学技术大学 Text classification method
CN102135961B (en) * 2010-01-22 2013-03-20 北京金山软件有限公司 Method and device for determining domain feature words
CN106528595A (en) * 2016-09-23 2017-03-22 中国农业科学院农业信息研究所 Website homepage content based field information collection and association method
CN110209805A (en) * 2018-04-26 2019-09-06 腾讯科技(深圳)有限公司 File classification method, device, storage medium and computer equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100462979C (en) * 2007-06-26 2009-02-18 腾讯科技(深圳)有限公司 Distributed indesx file searching method, searching system and searching server
CN101587493B (en) * 2009-06-29 2012-07-04 中国科学技术大学 Text classification method
CN102135961B (en) * 2010-01-22 2013-03-20 北京金山软件有限公司 Method and device for determining domain feature words
CN106528595A (en) * 2016-09-23 2017-03-22 中国农业科学院农业信息研究所 Website homepage content based field information collection and association method
CN106528595B (en) * 2016-09-23 2019-08-06 中国农业科学院农业信息研究所 Realm information based on website homepage content is collected and correlating method
CN110209805A (en) * 2018-04-26 2019-09-06 腾讯科技(深圳)有限公司 File classification method, device, storage medium and computer equipment
CN110209805B (en) * 2018-04-26 2023-11-28 腾讯科技(深圳)有限公司 Text classification method, apparatus, storage medium and computer device

Similar Documents

Publication Publication Date Title
Huang et al. A systematic method to create search strategies for emerging technologies based on the Web of Science: illustrated for ‘Big Data’
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN107844559A (en) A kind of file classifying method, device and electronic equipment
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN104598532A (en) Information processing method and device
CN103886108B (en) The feature selecting and weighing computation method of a kind of unbalanced text set
CN101794311A (en) Fuzzy data mining based automatic classification method of Chinese web pages
CN103970666A (en) Method for detecting repeated software defect reports
CN108304509B (en) Junk comment filtering method based on text multi-directional expression mutual learning
CN102629272A (en) Clustering based optimization method for examination system database
CN104899230A (en) Public opinion hotspot automatic monitoring system
CN104484380A (en) Personalized search method and personalized search device
CN107844558A (en) The determination method and relevant apparatus of a kind of classification information
CN110288495A (en) Case statute of limitation intelligence checking method and device
CN101567069A (en) Processing method of evaluation data of legal risk and query system
CN1936887A (en) Automatic text classification method based on classification concept space
CN108228788A (en) Guide of action automatically extracts and associated method and electronic equipment
CN105117466A (en) Internet information screening system and method
CN110399432A (en) A kind of classification method of table, device, computer equipment and storage medium
Pintér International experience in establishing indicators for the circular economy and considerations for China
Xu et al. Combining text classification and hidden markov modeling techniques for structuring randomized clinical trial abstracts
CN105787004A (en) Text classification method and device
CN105224689A (en) A kind of Dongba document sorting technique
Lei et al. Automatically classify chinese judgment documents utilizing machine learning algorithms
CN104462552A (en) Question and answer page core word extracting method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication