CN1936887A

CN1936887A - Automatic text classification method based on classification concept space

Info

Publication number: CN1936887A
Application number: CN 200510086462
Authority: CN
Inventors: 鲁松
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2005-09-22
Filing date: 2005-09-22
Publication date: 2007-03-28

Abstract

Being divided into training and classifying two phases, the method includes steps: (1) constructing data of classified words and expressions (WE) matrix; (2) based on the said matrix to build frequency data table of inverse sorted classes; (3) based on the said table to build effective set of WE; (4) based on the said set to rebuild data of classified WE matrix; (5) based on the said rebuilt matrix to build frequency data table of inverse sorted WE in each class; (6) based on classified WE matrix, and frequency data table of inverse sorted WE to build vector representations of WE based on space of class concept; (7) based on frequencies of words and frequencies of sorted classes to construct vector data of document to be classified in vector space of class concept; (8) based on magnitude of each component in vector of document to be classified to obtain class of the document. The invention is suitable to information classifying, filtering, and monitoring etc.

Description

Automatic text classification method based on the class concepts space

Technical field

The invention belongs to content and information analysis and process field, particularly a kind of automatic text classification method based on the class concepts space.

Background technology

Autotext classification (Auto Text Classification) is computer automatic sorting is carried out in research at large volume document under given classification situation a technology.The basis of this technology is a vector space model, and wherein vector space is to be the contour gt of dimension with word or through the notion of conversion, in this space, uses various sorting techniques document is classified.

So far various research reports show, can not the accurate description texts based on the high dimension vector space of word quadrature, and also have problems such as can't measuring and set the conversion threshold value through the quadrature concept space that matrixing obtains.Also make autotext sort research problem face an urgent demand of seeking breakthrough on the text representation model thus.

Summary of the invention

The objective of the invention is to, a kind of automatic text classification method based on the class concepts space is provided.

This method has effectively realized representing based on the textual formization of class concepts, and can guarantee the high efficiency and the high accuracy of the classification of computing machine autotext.

A kind of automatic text classification method based on the class concepts space of the present invention is that word is mapped to the class concepts space, and obtains the new technique of classification results based on the vector accumulative total of the heavy effective word of cum rights in the text; It is characterized in that entire method is divided into training and classifies two stages, comprises the steps:

Step 1) structure classes word matrix data;

Step 2) arranges classification frequency data table based on what classification word matrix was set up each word;

Step 3) makes up effective word collection based on arranging classification frequency data table;

Step 4) re-constructs classification word matrix data based on effective word collection;

Step 5) is arranged word frequency data table based on what the classification word matrix that re-constructs was set up each classification;

Step 6) is based on classification word matrix and arrange word frequency data table, sets up the word vector representation based on the class concepts space;

Step 7) is treated the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space;

Step 8) is according to treating that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents.

Wherein step 6) is described based on classification word matrix with arrange word frequency data table, foundation is based on the word vector representation in class concepts space, be word the classification frequency and such other arrange the word frequency and multiply each other and normalization, obtain the vector representation of word in the class concepts space.

Wherein step 7) is described treats the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space, promptly according to the word frequency for the treatment of word in the classifying documents with arrange that the classification frequency multiplies each other and normalization, obtain word weight in treating classifying documents, distribute according to each term weighing, each word vector summation is obtained treating the vector representation of classifying documents in the class concepts space, in representation of knowledge system, a classification is exactly the aggregate of a notion, and whole model designs in the hyperspace mode, so we are referred to as the class concepts space.

Wherein the described foundation of step 8) treats that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents, treat that promptly the pairing classification of one-component maximum in each component of classifying documents vector is exactly to treat the affiliated classification of classifying documents, be about to treat that classifying documents is mapped in the class concepts vector space, classification work can be finished.

Use the matrix data in data processing, the conversion among the present invention, set up the technological means of arranging classification frequency data table of each word thus, solved the technical matters of text classification, reached the autotext classification, and the high technique effect of classification effectiveness.

Compare with the classification of traditional autotext, characteristics of the present invention are: designed novel based on the class concepts space the text representation method and based on the automatic text classification method of this method for expressing.Use new method of the present invention, can realize the real representation of concept of text, overcome the defective of nonopiate characteristic between word, and bring the accuracy of text classification thus, simultaneously, because the process of classification is exactly the process that text is mapped to concept space, so classification effectiveness is high.The present invention is applicable to fields such as high-efficiency information classification, information filtering and information monitoring.

Description of drawings

Fig. 1 is the process flow diagram of the automatic text classification method based on the class concepts space of the present invention.

Embodiment

This paper uses

\overset{&OverBar;}{d} = < t f_{1}, t f_{2}, \dots, t f_{n} >

The word frequency vector of a document of expression, wherein tf _jRepresent the frequency of occurrences of j word in the document d; Use

{\overset{&OverBar;}{C}}_{m} = < tc f_{1}, tc f_{2}, \dots, tc f_{n} >

The word frequency vector of representing the m classification, wherein tcf _nRepresent the frequency of occurrences of n word in the m classification.

The method step of Fig. 1 is as follows:

In the training stage,

Step 1, structure classes word matrix data:

Can construct the classification word matrix data of training set,

C_{q \times n} = {[{\overset{&OverBar;}{c}}_{1}, {\overset{&OverBar;}{c}}_{2}, \dots, {\overset{&OverBar;}{c}}_{q}]}^{T},

Wherein total q classification, n word.

Wherein, m _IjBe illustrated in the number of j word in the i class.

Step 2, arrange classification frequency data table based on what classification word matrix was set up each word:

According to classification word matrix, employing method

IC F_{i} = \log (\frac{| C |}{| t_{i} |} + 0.01)

Obtaining our called after arranges the classification frequency (Index Category Frequency, ICF), this is for defining the ability of i word difference classification.Wherein | C| is the sum of classification, is worth to be q; t _iSum for classification that the i word occurred.ICF _iBe worth greatly more, show that i word difference ability of all categories is strong more.

Step 3, make up effective word collection based on arranging classification frequency data table:

After the ordering, constitute effective word collection by 80% part of maximum, list size is made as p.

Step 4, re-construct classification word matrix data based on effective word collection:

Re-construct classification word matrix data based on effective word collection

C_{q×p} = {[{\overset{&OverBar;}{c}}_{1}, {\overset{&OverBar;}{c}}_{2},\dots, {\overset{&OverBar;}{c}}_{p}]}^{T} \cdot

Step 5, arrange word frequency data table based on what the classification word matrix that re-constructs was set up each classification:

Based on new classification word Matrix C _{Q * p}The employing method

IT F_{i} = \log (\frac{| T |}{| c_{i} |} + 0.01)

Obtaining our called after arranges the word frequency (Index Term Frequency, ITF), this is to distinguish the ability that each word concentrated in effective word for defining the i classification.Wherein | T| is the size of effective vocabulary, is worth to be p; | c _i| for appearing at the word collection size in the i classification.ITF _iBe worth greatly more, show that the ability of i classification difference word is strong more.

Step 6, based on classification word matrix with arrange word frequency data table, set up word vector representation based on the class concepts space:

For word being mapped in the concept space based on classification, the present invention designed word in of all categories the frequency of occurrences with arrange that the word frequency multiplies each other and normalized method obtains word vector representation based on the class concepts space, computing method are:

{\overset{&OverBar;}{t}}_{j} = < \frac{tc f_{1} \times IT F_{1}}{Σ_{i = 1}^{q} (tc f_{i} \times IT F_{i})}, \frac{tc f_{2} \times IT F_{2}}{Σ_{i = 1}^{q} (tc f_{i} \times IT F_{i})},\dots, \frac{tc f_{q} \times IT F_{q}}{Σ_{i = 1}^{q} (tc f_{i} \times IT F_{i})} > \cdot

Wherein, tcf ₁, tcf ₂..., tcf _qBe respectively No. 1 classification, No. 2 classification ..., the word frequency of j word appears in the q classification; ITF ₁, ITF ₂..., ITF _gBe respectively No. 1 classification, No. 2 classification ..., the g classification arrange the word frequency; Design each component respectively divided by

Be for normalized.So far, finished of the mapping of effective word collection, promptly designed a kind of novel word and represented mode in the class concepts space.

At sorting phase

Step 7, treat the classifying documents vector data in the classification frequency structure classes notion vector space based on treating word frequency in the classifying documents and arranging:

Treat that classifying documents is turned to document by form

\overset{&OverBar;}{d} = < t f_{1}, t f_{2}, \dots, t f_{n} >,

To its process of classifying is exactly according to word in the document and weight thereof, it is mapped in the class concepts space finishes classification.

At first, design the ability that a kind of method is calculated each word statement the document in the document, i.e. the weight of each word, computing method are:

\overset{&OverBar;}{P} = < \frac{t f_{1} \times IC F_{1}}{Σ_{i = 1}^{g} (t f_{i} \times IC F_{i})}, \frac{t f_{2} \times IC F_{2}}{Σ_{i = 1}^{g} (t f_{i} \times IC F_{i})}, \dots, \frac{t f_{g} \times IC F_{g}}{Σ_{i = 1}^{g} (t f_{i} \times IC F_{i})} >,

Wherein.

\frac{t f_{1} \times IC F_{1}}{Σ_{i = 1}^{g} (t f_{i} \times IC F_{i})}, \frac{t f_{2} \times IC F_{2}}{Σ_{i = 1}^{g} (t f_{i} \times IC F_{i})}, \dots, \frac{t f_{g} \times IC F_{g}}{Σ_{i = 1}^{g} (t f_{i} \times IC F_{i})}

Be respectively No. 1 word, No. 2 word ..., the weighted value of g word., design a kind of method will treat classifying documents also be mapped in class concepts space, be about to treat in the class concepts space word vector and this word in the classifying documents in treating classifying documents after the multiplied by weight, all words vector summations thereafter.Detailed method is:

{\overset{&OverBar;}{d}}_{concept} = Σ_{i = 1}^{g} ({\overset{&OverBar;}{t}}_{i} \times P_{i}),

Wherein,

Be the word t in the class concepts space _iVector representation, p _iBe word t _iWeighted value, be after the summation and treat in the class concepts space that classifying documents represents.

Step 8, according to treating that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents.

At last,

Each component size is an all kinds of tendency degree of the document ownership in the vector, be worth greatly more, and tendentiousness is just strong more.To single mode text classification problem, maximum component corresponding class is classification results.As seen, treat the mapping of classifying documents in the class concepts space as long as finish, whole classification just can be finished, and this also is that the present invention has reasons of high performance.

Use new method of the present invention, can realize the real representation of concept of text, overcome the defective of nonopiate characteristic between word, and bring the accuracy of text classification thus, simultaneously, because the process of classification is exactly the process that text is mapped to concept space, so classification effectiveness is high.The present invention is applicable to fields such as high-efficiency information classification, information filtering and information monitoring.

Claims

1, a kind of automatic text classification method based on the class concepts space is that word is mapped to the class concepts space, and obtains the new technique of classification results based on the vector accumulative total of effective word in the text; It is characterized in that entire method is divided into training and classifies two stages, comprises the steps:

Step 1) structure classes word matrix data;

2, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein step 6) is described based on classification word matrix with arrange word frequency data table, foundation is based on the word vector representation in class concepts space, be word the classification frequency and such other arrange the word frequency and multiply each other and normalization, obtain the vector representation of word in the class concepts space.

3, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein step 7) is described treats the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space, promptly according to the word frequency for the treatment of word in the classifying documents with arrange that the classification frequency multiplies each other and normalization, obtain word weight in treating classifying documents, distribute according to each term weighing, each word vector summation is obtained treating the vector representation of classifying documents in the class concepts space, in representation of knowledge system, a classification is exactly the aggregate of a notion, and whole model designs in the hyperspace mode, so we are referred to as the class concepts space.

4, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein the described foundation of step 8) treats that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents, treat that promptly the pairing classification of one-component maximum in each component of classifying documents vector is exactly to treat the affiliated classification of classifying documents, be about to treat that classifying documents is mapped in the class concepts vector space, classification work can be finished.