CN1936887A - Automatic text classification method based on classification concept space - Google Patents
Automatic text classification method based on classification concept space Download PDFInfo
- Publication number
- CN1936887A CN1936887A CN 200510086462 CN200510086462A CN1936887A CN 1936887 A CN1936887 A CN 1936887A CN 200510086462 CN200510086462 CN 200510086462 CN 200510086462 A CN200510086462 A CN 200510086462A CN 1936887 A CN1936887 A CN 1936887A
- Authority
- CN
- China
- Prior art keywords
- classification
- word
- vector
- space
- classifying documents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
Being divided into training and classifying two phases, the method includes steps: (1) constructing data of classified words and expressions (WE) matrix; (2) based on the said matrix to build frequency data table of inverse sorted classes; (3) based on the said table to build effective set of WE; (4) based on the said set to rebuild data of classified WE matrix; (5) based on the said rebuilt matrix to build frequency data table of inverse sorted WE in each class; (6) based on classified WE matrix, and frequency data table of inverse sorted WE to build vector representations of WE based on space of class concept; (7) based on frequencies of words and frequencies of sorted classes to construct vector data of document to be classified in vector space of class concept; (8) based on magnitude of each component in vector of document to be classified to obtain class of the document. The invention is suitable to information classifying, filtering, and monitoring etc.
Description
Technical field
The invention belongs to content and information analysis and process field, particularly a kind of automatic text classification method based on the class concepts space.
Background technology
Autotext classification (Auto Text Classification) is computer automatic sorting is carried out in research at large volume document under given classification situation a technology.The basis of this technology is a vector space model, and wherein vector space is to be the contour gt of dimension with word or through the notion of conversion, in this space, uses various sorting techniques document is classified.
So far various research reports show, can not the accurate description texts based on the high dimension vector space of word quadrature, and also have problems such as can't measuring and set the conversion threshold value through the quadrature concept space that matrixing obtains.Also make autotext sort research problem face an urgent demand of seeking breakthrough on the text representation model thus.
Summary of the invention
The objective of the invention is to, a kind of automatic text classification method based on the class concepts space is provided.
This method has effectively realized representing based on the textual formization of class concepts, and can guarantee the high efficiency and the high accuracy of the classification of computing machine autotext.
A kind of automatic text classification method based on the class concepts space of the present invention is that word is mapped to the class concepts space, and obtains the new technique of classification results based on the vector accumulative total of the heavy effective word of cum rights in the text; It is characterized in that entire method is divided into training and classifies two stages, comprises the steps:
Step 1) structure classes word matrix data;
Step 2) arranges classification frequency data table based on what classification word matrix was set up each word;
Step 3) makes up effective word collection based on arranging classification frequency data table;
Step 4) re-constructs classification word matrix data based on effective word collection;
Step 5) is arranged word frequency data table based on what the classification word matrix that re-constructs was set up each classification;
Step 6) is based on classification word matrix and arrange word frequency data table, sets up the word vector representation based on the class concepts space;
Step 7) is treated the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space;
Step 8) is according to treating that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents.
Wherein step 6) is described based on classification word matrix with arrange word frequency data table, foundation is based on the word vector representation in class concepts space, be word the classification frequency and such other arrange the word frequency and multiply each other and normalization, obtain the vector representation of word in the class concepts space.
Wherein step 7) is described treats the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space, promptly according to the word frequency for the treatment of word in the classifying documents with arrange that the classification frequency multiplies each other and normalization, obtain word weight in treating classifying documents, distribute according to each term weighing, each word vector summation is obtained treating the vector representation of classifying documents in the class concepts space, in representation of knowledge system, a classification is exactly the aggregate of a notion, and whole model designs in the hyperspace mode, so we are referred to as the class concepts space.
Wherein the described foundation of step 8) treats that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents, treat that promptly the pairing classification of one-component maximum in each component of classifying documents vector is exactly to treat the affiliated classification of classifying documents, be about to treat that classifying documents is mapped in the class concepts vector space, classification work can be finished.
Use the matrix data in data processing, the conversion among the present invention, set up the technological means of arranging classification frequency data table of each word thus, solved the technical matters of text classification, reached the autotext classification, and the high technique effect of classification effectiveness.
Compare with the classification of traditional autotext, characteristics of the present invention are: designed novel based on the class concepts space the text representation method and based on the automatic text classification method of this method for expressing.Use new method of the present invention, can realize the real representation of concept of text, overcome the defective of nonopiate characteristic between word, and bring the accuracy of text classification thus, simultaneously, because the process of classification is exactly the process that text is mapped to concept space, so classification effectiveness is high.The present invention is applicable to fields such as high-efficiency information classification, information filtering and information monitoring.
Description of drawings
Fig. 1 is the process flow diagram of the automatic text classification method based on the class concepts space of the present invention.
Embodiment
This paper uses
The word frequency vector of a document of expression, wherein tf
jRepresent the frequency of occurrences of j word in the document d; Use
The word frequency vector of representing the m classification, wherein tcf
nRepresent the frequency of occurrences of n word in the m classification.
The method step of Fig. 1 is as follows:
In the training stage,
Step 1, structure classes word matrix data:
Can construct the classification word matrix data of training set,
Wherein total q classification, n word.
Wherein, m
IjBe illustrated in the number of j word in the i class.
Step 2, arrange classification frequency data table based on what classification word matrix was set up each word:
According to classification word matrix, employing method
Obtaining our called after arranges the classification frequency (Index Category Frequency, ICF), this is for defining the ability of i word difference classification.Wherein | C| is the sum of classification, is worth to be q; t
iSum for classification that the i word occurred.ICF
iBe worth greatly more, show that i word difference ability of all categories is strong more.
Step 3, make up effective word collection based on arranging classification frequency data table:
After the ordering, constitute effective word collection by 80% part of maximum, list size is made as p.
Step 4, re-construct classification word matrix data based on effective word collection:
Re-construct classification word matrix data based on effective word collection
Step 5, arrange word frequency data table based on what the classification word matrix that re-constructs was set up each classification:
Based on new classification word Matrix C
Q * pThe employing method
Obtaining our called after arranges the word frequency (Index Term Frequency, ITF), this is to distinguish the ability that each word concentrated in effective word for defining the i classification.Wherein | T| is the size of effective vocabulary, is worth to be p; | c
i| for appearing at the word collection size in the i classification.ITF
iBe worth greatly more, show that the ability of i classification difference word is strong more.
Step 6, based on classification word matrix with arrange word frequency data table, set up word vector representation based on the class concepts space:
For word being mapped in the concept space based on classification, the present invention designed word in of all categories the frequency of occurrences with arrange that the word frequency multiplies each other and normalized method obtains word vector representation based on the class concepts space, computing method are:
Wherein, tcf
1, tcf
2..., tcf
qBe respectively No. 1 classification, No. 2 classification ..., the word frequency of j word appears in the q classification; ITF
1, ITF
2..., ITF
gBe respectively No. 1 classification, No. 2 classification ..., the g classification arrange the word frequency; Design each component respectively divided by
Be for normalized.So far, finished of the mapping of effective word collection, promptly designed a kind of novel word and represented mode in the class concepts space.
At sorting phase
Step 7, treat the classifying documents vector data in the classification frequency structure classes notion vector space based on treating word frequency in the classifying documents and arranging:
Treat that classifying documents is turned to document by form
To its process of classifying is exactly according to word in the document and weight thereof, it is mapped in the class concepts space finishes classification.
At first, design the ability that a kind of method is calculated each word statement the document in the document, i.e. the weight of each word, computing method are:
Wherein.
Be respectively No. 1 word, No. 2 word ..., the weighted value of g word., design a kind of method will treat classifying documents also be mapped in class concepts space, be about to treat in the class concepts space word vector and this word in the classifying documents in treating classifying documents after the multiplied by weight, all words vector summations thereafter.Detailed method is:
Wherein,
Be the word t in the class concepts space
iVector representation, p
iBe word t
iWeighted value, be after the summation and treat in the class concepts space that classifying documents represents.
Step 8, according to treating that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents.
At last,
Each component size is an all kinds of tendency degree of the document ownership in the vector, be worth greatly more, and tendentiousness is just strong more.To single mode text classification problem, maximum component corresponding class is classification results.As seen, treat the mapping of classifying documents in the class concepts space as long as finish, whole classification just can be finished, and this also is that the present invention has reasons of high performance.
Use new method of the present invention, can realize the real representation of concept of text, overcome the defective of nonopiate characteristic between word, and bring the accuracy of text classification thus, simultaneously, because the process of classification is exactly the process that text is mapped to concept space, so classification effectiveness is high.The present invention is applicable to fields such as high-efficiency information classification, information filtering and information monitoring.
Claims (4)
1, a kind of automatic text classification method based on the class concepts space is that word is mapped to the class concepts space, and obtains the new technique of classification results based on the vector accumulative total of effective word in the text; It is characterized in that entire method is divided into training and classifies two stages, comprises the steps:
Step 1) structure classes word matrix data;
Step 2) arranges classification frequency data table based on what classification word matrix was set up each word;
Step 3) makes up effective word collection based on arranging classification frequency data table;
Step 4) re-constructs classification word matrix data based on effective word collection;
Step 5) is arranged word frequency data table based on what the classification word matrix that re-constructs was set up each classification;
Step 6) is based on classification word matrix and arrange word frequency data table, sets up the word vector representation based on the class concepts space;
Step 7) is treated the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space;
Step 8) is according to treating that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents.
2, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein step 6) is described based on classification word matrix with arrange word frequency data table, foundation is based on the word vector representation in class concepts space, be word the classification frequency and such other arrange the word frequency and multiply each other and normalization, obtain the vector representation of word in the class concepts space.
3, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein step 7) is described treats the classifying documents vector data based on treating word frequency in the classifying documents and arranging in the classification frequency structure classes notion vector space, promptly according to the word frequency for the treatment of word in the classifying documents with arrange that the classification frequency multiplies each other and normalization, obtain word weight in treating classifying documents, distribute according to each term weighing, each word vector summation is obtained treating the vector representation of classifying documents in the class concepts space, in representation of knowledge system, a classification is exactly the aggregate of a notion, and whole model designs in the hyperspace mode, so we are referred to as the class concepts space.
4, by the described automatic text classification method of claim 1 based on the class concepts space, it is characterized in that, wherein the described foundation of step 8) treats that each component size in the classifying documents vector can directly obtain to treat the affiliated classification of classifying documents, treat that promptly the pairing classification of one-component maximum in each component of classifying documents vector is exactly to treat the affiliated classification of classifying documents, be about to treat that classifying documents is mapped in the class concepts vector space, classification work can be finished.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510086462 CN1936887A (en) | 2005-09-22 | 2005-09-22 | Automatic text classification method based on classification concept space |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510086462 CN1936887A (en) | 2005-09-22 | 2005-09-22 | Automatic text classification method based on classification concept space |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1936887A true CN1936887A (en) | 2007-03-28 |
Family
ID=37954392
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200510086462 Pending CN1936887A (en) | 2005-09-22 | 2005-09-22 | Automatic text classification method based on classification concept space |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1936887A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100462979C (en) * | 2007-06-26 | 2009-02-18 | 腾讯科技(深圳)有限公司 | Distributed indesx file searching method, searching system and searching server |
CN101587493B (en) * | 2009-06-29 | 2012-07-04 | 中国科学技术大学 | Text classification method |
CN102135961B (en) * | 2010-01-22 | 2013-03-20 | 北京金山软件有限公司 | Method and device for determining domain feature words |
CN106528595A (en) * | 2016-09-23 | 2017-03-22 | 中国农业科学院农业信息研究所 | Website homepage content based field information collection and association method |
CN110209805A (en) * | 2018-04-26 | 2019-09-06 | 腾讯科技(深圳)有限公司 | File classification method, device, storage medium and computer equipment |
-
2005
- 2005-09-22 CN CN 200510086462 patent/CN1936887A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100462979C (en) * | 2007-06-26 | 2009-02-18 | 腾讯科技(深圳)有限公司 | Distributed indesx file searching method, searching system and searching server |
CN101587493B (en) * | 2009-06-29 | 2012-07-04 | 中国科学技术大学 | Text classification method |
CN102135961B (en) * | 2010-01-22 | 2013-03-20 | 北京金山软件有限公司 | Method and device for determining domain feature words |
CN106528595A (en) * | 2016-09-23 | 2017-03-22 | 中国农业科学院农业信息研究所 | Website homepage content based field information collection and association method |
CN106528595B (en) * | 2016-09-23 | 2019-08-06 | 中国农业科学院农业信息研究所 | Realm information based on website homepage content is collected and correlating method |
CN110209805A (en) * | 2018-04-26 | 2019-09-06 | 腾讯科技(深圳)有限公司 | File classification method, device, storage medium and computer equipment |
CN110209805B (en) * | 2018-04-26 | 2023-11-28 | 腾讯科技(深圳)有限公司 | Text classification method, apparatus, storage medium and computer device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Huang et al. | A systematic method to create search strategies for emerging technologies based on the Web of Science: illustrated for ‘Big Data’ | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
CN103279478B (en) | A kind of based on distributed mutual information file characteristics extracting method | |
CN104598532A (en) | Information processing method and device | |
CN103886108B (en) | The feature selecting and weighing computation method of a kind of unbalanced text set | |
CN101794311A (en) | Fuzzy data mining based automatic classification method of Chinese web pages | |
CN103970666A (en) | Method for detecting repeated software defect reports | |
CN108304509B (en) | Junk comment filtering method based on text multi-directional expression mutual learning | |
CN102629272A (en) | Clustering based optimization method for examination system database | |
CN104899230A (en) | Public opinion hotspot automatic monitoring system | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN107844558A (en) | The determination method and relevant apparatus of a kind of classification information | |
CN110288495A (en) | Case statute of limitation intelligence checking method and device | |
CN101567069A (en) | Processing method of evaluation data of legal risk and query system | |
CN1936887A (en) | Automatic text classification method based on classification concept space | |
CN108228788A (en) | Guide of action automatically extracts and associated method and electronic equipment | |
CN105117466A (en) | Internet information screening system and method | |
CN110399432A (en) | A kind of classification method of table, device, computer equipment and storage medium | |
Pintér | International experience in establishing indicators for the circular economy and considerations for China | |
Xu et al. | Combining text classification and hidden markov modeling techniques for structuring randomized clinical trial abstracts | |
CN105787004A (en) | Text classification method and device | |
CN105224689A (en) | A kind of Dongba document sorting technique | |
Lei et al. | Automatically classify chinese judgment documents utilizing machine learning algorithms | |
CN104462552A (en) | Question and answer page core word extracting method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |