CN103207893A - Classification method of two types of texts on basis of vector group mapping - Google Patents

Classification method of two types of texts on basis of vector group mapping Download PDF

Info

Publication number
CN103207893A
CN103207893A CN2013100804554A CN201310080455A CN103207893A CN 103207893 A CN103207893 A CN 103207893A CN 2013100804554 A CN2013100804554 A CN 2013100804554A CN 201310080455 A CN201310080455 A CN 201310080455A CN 103207893 A CN103207893 A CN 103207893A
Authority
CN
China
Prior art keywords
text
class
vector
matrix
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100804554A
Other languages
Chinese (zh)
Other versions
CN103207893B (en
Inventor
李玉鑑
王影
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201310080455.4A priority Critical patent/CN103207893B/en
Publication of CN103207893A publication Critical patent/CN103207893A/en
Application granted granted Critical
Publication of CN103207893B publication Critical patent/CN103207893B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a classification method of two types of texts on the basis of vector group mapping. The classification method includes: collecting text data sets, and dividing the text data sets into a training set and a testing set; preprocessing the data sets; extracting a general vocabulary, and performing statistics on word frequency; performing feature selection on a training sample set to obtain a feature vector table; expressing all sample feature weighting values in the data set as vectors; expressing the training sample set and the testing sample set as a vector group respectively; expressing positive class samples and negative class samples in the training sample set as a complete-matrix form respectively; mapping a positive class text matrix and a negative class text matrix in training samples to vectors respectively; and utilizing a nearest neighbor algorithm to judge classes of testing samples. The classification method of two types of texts on the basis of the vector group mapping utilizes a tf*rf feature extraction method, uses the vector group to express positive class texts and negative class texts, can extract text features with strong self-adaptive capacity and good classification performance, can express text information roundly, and simplifies a classification process and improves classification speed through mapping transformation of the vector group.

Description

Sorting technique based on two class texts of Vector Groups mapping
Technical field
The invention belongs to electronic information technical field, be specifically related to a kind of sorting technique of two class texts based on Vector Groups mapping.
Background technology
Text classification refers to automatically give classification mark according to certain standard to text set with computing machine, it has important application in fields such as information retrieval, text mining and intelligence analysis, wherein relate to gordian techniquies such as text representation, feature selecting, disaggregated model and evaluation method.The process of text classification as shown in Figure 1.At first need text is carried out pre-service, and text is carried out proper vector represent; Training study structural classification device then; Use sorter that new text is classified at last.
At present, text classifier relatively commonly used have naive Bayesian (
Figure BDA00002914097000011
Bayes), support vector machine (SVM), K arest neighbors (KNN) etc.Wherein the KNN method is simple, and classifying quality is good, and the different pieces of information collection is had good operability.The arest neighbors method is a special case of KNN method, and its basic thought is the nearest samples that finds test sample book in training sample, determines the classification of test sample book then according to the classification of this nearest samples.At first, because the arest neighbors method is only judged the classification of test sample book according to the sample of distance test sample arest neighbors, the interference of having amplified noise data can reduce nicety of grading.Secondly, because traditional arest neighbors method does not have the training stage, all calculating is all finished at a minute time-like, so the real-time of this method is bad.When the training set number of documents was very big, its computing cost was huge, to such an extent as to along with the growth of training set, assorting process will very slowly even can't be carried out.This is a major defect of arest neighbors method.Mainly reduce the computing cost of arest neighbors method at present from two aspects: the one, the scale of minimizing training set is removed noise data; The 2nd, improve the similarity of arest neighbors and calculate and searching algorithm, reduce the complexity that similarity is calculated, change Local Search into by global search.Though existing algorithm can effectively reduce the computing cost of nearest neighbor search, their major parts can't guarantee to carry out the optimum search of the overall situation, can not be applicable to mass data and higher dimensional space.
Summary of the invention
At above-mentioned based on a little less than the antinoise data interference performance that exists in the arest neighbors text classification process, divide the time-like computing cost big defective, the present invention proposes a kind of characteristic information according to the positive and negative class sample of the overall situation and judge the classification of test sample book, divide time-like to the dependency degree of individual samples and the two class text sorting techniques of classification time thereby reduce.
Ultimate principle of the present invention: each text is extracted feature, with the text representation form that is a proper vector, and then a class text is expressed as the form of Vector Groups.Then, by mapping transformation the Vector Groups of each classification is mapped as corresponding categorization vector, by calculate two Euclidean distance judging distance test sample books between vector nearest be positive class column vector or negative class column vector, and judge the classification of this test sample book according to the nearest column vector classification of distance test sample.
A kind of sorting technique of two class texts based on Vector Groups mapping is characterized in that may further comprise the steps:
Step 1 is collected data set, and the data set of collecting is divided into training sample set and test sample book collection.
Step 2, the preprocessed data collection, method is as follows:
The data of non-structureization are treated to structural data, obtain setting up model or application model primary data sample that classify, that comprise field informations such as each text attribute; Structurized data sample is carried out participle, finish word character small letterization, remove stop words, the rough handling of deletion punctuation mark and root reduction, the word frequency of adding up each test sample book and training sample.
Step 3 to training sample set, extracts total vocabulary, and method is as follows:
For each lexical item in total vocabulary, statistics comprises positive class number of samples and the negative class number of samples of this lexical item, filters out training sample and concentrates all positive and negative class document frequencies less than 3 lexical item, obtains the document frequency table.
Step 4 is carried out Feature Selection to training sample set, obtains the proper vector table.Concrete grammar is as follows:
Each lexical item in the total vocabulary that obtains for step 3 is calculated lexical item t for text categories c iχ 2Statistical value χ 2(t, c i).χ 2Statistical value is more high, and it and the correlativity between such are more big, and the classification information of carrying is also more, and computing formula is as follows:
χ 2 ( t , c i ) = N × ( AD - CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D )
Wherein, N represents the text sum that training sample is concentrated, and A represents to belong to c iClass and comprise the document frequency of t, B represents not belong to c iClass still comprises the document frequency of t, and C represents to belong to c iClass does not still comprise the document frequency of t, and D neither belongs to c iThe document frequency that does not also comprise t.
Step 5 is composed weights for the characteristic item of all samples of data centralization, obtains the vector representation of each sample.The present invention adopts the tax weights method of word frequency correlated frequency long-pending (tf.rf), and wherein tf is word frequency (term frequency), and rf is correlated frequency (relevance frequency).For lexical item t k, make text d about t kWeights be ω k, the vector representation d=(ω of generation text d 1, ω 2..., ω n).Calculate weights ω according to tf.rf kFormula be:
ω k=tf k*rf k
Wherein, tf kExpression lexical item t kFrequency in document d is obtained by step 2, rf kThe computing method of value are as follows:
rf k = log 2 ( 2 + a k max ( 1 , c k ) )
Wherein, a kBe illustrated in the training document sets and comprise lexical item t kPositive class text number, b kExpression does not comprise lexical item t kPositive class text number, c kExpression comprises lexical item t kNegative class text number, d kExpression does not comprise lexical item t kNegative class text number.
Step 6 is expressed as a Vector Groups with training sample set, and the test sample book set representations is become another Vector Groups.
Step 7, the training sample Vector Groups that step 5 is obtained classification is divided into two groups according to the mark of positive and negative class.And the Vector Groups of all positive class texts and negative class text is expressed as the form of a complete matrix respectively, namely obtain the matrix representation forms of positive class text and negative class text.
Step 8 is mapped as a vector respectively with the positive class text matrix in the training sample and negative class text matrix.Matrix is specific as follows to the mapping process of vector:
(1) two matrixes is carried out svd (SVD) respectively.For example, when carrying out svd for matrix M, M can be decomposed into the form of three matrix products, formula is as follows:
M=U*S*V
Wherein, if M is m*n, then U is m*m, and V is n*n, and S is m*n.Singular value is on the diagonal line of S, and is non-negative and by descending sort.
(2) carry out svd respectively for positive class text matrix and negative class text matrix after, the upper triangular matrix element of the left matrix U after decomposing is lined up a column vector by row.These two column vectors are exactly the column vector after the required mapping, are called positive class column vector and negative class column vector.
Step 9, to each test sample book, utilize nearest neighbor algorithm to judge its classification, namely by calculate two Euclidean distances between vector come the judging distance test sample book nearest be positive class column vector or negative class column vector, judge the classification of this test sample book according to the nearest column vector classification of distance test sample.
The present invention compared with prior art has following remarkable advantages and beneficial effect:
The present invention no longer needs the sample of judging distance test sample book arest neighbors one by one in the text classification process, determine the classification of this test sample book then according to the sample of arest neighbors, but regard all positive class samples and all negative class samples as a sample respectively, like this when calculating the nearest samples of certain test sample book, only need to calculate twice (when being that two class texts divide time-like), greatly reduce calculated amount, accelerated the speed of classification; Simultaneously, the present invention has avoided the contingency in assorting process middle distance test sample book nearest samples classification to a certain extent, represent the information of positive and negative class text by the mapping vector that utilizes used positive and negative class text, can effectively utilize the characteristic information of overall positive and negative class text, improve the correctness of classification; In addition, the text classification thought of passing through the Vector Groups mapping that proposes in the present invention is not only applicable to the classification of two class texts, can be applied to the multiclass text classification simultaneously yet, and extensibility is strong.
Description of drawings
Fig. 1 is the module structure drafting of text classification involved in the present invention;
Fig. 2 is method flow diagram involved in the present invention.
Embodiment
The invention will be further described below in conjunction with drawings and the specific embodiments.
Based on the process flow diagram of two class text sorting techniques of Vector Groups mapping as shown in Figure 2.A kind of sorting technique of two class texts based on Vector Groups mapping is characterized in that may further comprise the steps:
Step 1 is collected data set, and the data set of collecting is divided into training sample set and test sample book collection.
Step 2, the preprocessed data collection.
Step 3 is extracted total vocabulary, and the statistics word frequency obtains the document frequency table.
Step 4 is carried out Feature Selection to training sample set, obtains the proper vector table.
Step 5 is composed weights for the characteristic item of all samples of data centralization, obtains the vector representation of each sample.
Step 6 is expressed as a Vector Groups with training sample set, and the test sample book set representations is become another Vector Groups.
Step 7 concentrates positive and negative class sample to be expressed as the form of a complete matrix respectively training sample.
Step 8 is mapped as a vector respectively with the positive class text matrix in the training sample and negative class text matrix.
Step 9 to each test sample book, utilizes nearest neighbor algorithm to judge its classification.
Provide one below and use the example that the present invention classifies to text.
Collect the Reuters data set from UCI data set website, downloaded 68274 pieces of texts altogether, wherein 65740 pieces as training set, and remaining 2534 pieces as test set.The present invention adopts maximum preceding 10 class texts of Reuters data centralization text bibliography, comprises acq, com, crude, earn, grain, interest, money-fx, ship, trade, wheat.Each class text details is as shown in table 1:
The information list of each class text of table 1
Figure BDA00002914097000041
Because the present invention solves two class text classification problems, and data set comprises the sample of 10 classifications altogether, so specify the positive class sample of class conduct wherein when experiment, remaining sample is as negative class sample.
For the Reuters data set, by specifying different classes as positive class, test 10 groups of data to analyze the experiment effect of comparison-of-pair sorting's device.For example when the positive time-like of acq conduct, remaining 9 class is all as negative class.6574 pieces of texts are arranged as training sample in each experiment, 2534 pieces of samples are as test sample book.Calculate nearest neighbor classifier respectively and based on the experiment effect of Vector Groups mapping sorter, estimate three indexs with accuracy, recall rate and F1-here and estimate.Experimental result is as shown in table 2.
It is the experimental result of evaluation index that table 2 is estimated with accuracy, recall rate and F1-
Figure BDA00002914097000051
From above experimental result, as can be seen, under the selected different situation of class as positive class, generally be better than text classification algorithm based on arest neighbors based on three indexs of text classification algorithm of Vector Groups mapping.When the positive time-like of ship conduct, because the unbalancedness of data, nearest neighbor classifier all is divided into negative class with all samples, and the recall rate of this moment is 0, can't calculate thereby cause F1-to estimate.When the mean value of each evaluation index of calculating nearest neighbor classifier, do not consider this group data.For same data, sorter of the present invention still can guarantee stable classifying quality.As can be seen from the above table, no matter whether the balance of data sample, and the average accuracy that the sorter among the present invention can both guarantee to classify is more than 93%, and average recall rate is 84.1%, and mean F 1 value is 0.5888, fully proved validity and the superiority of this method.
Above embodiment is only in order to illustrating the present invention, and and unrestricted technical scheme described in the invention.Therefore, technical scheme and improvement thereof that all do not break away from the spirit and scope of the present invention all should be encompassed in the middle of the claim scope of the present invention.

Claims (1)

1. sorting technique based on two class texts of Vector Groups mapping is characterized in that may further comprise the steps:
Step 1 is collected data set, and the data set of collecting is divided into training sample set and test sample book collection;
Step 2, the preprocessed data collection, method is as follows:
The data of non-structureization are treated to structural data, obtain setting up model or application model primary data sample that classify, that comprise field informations such as each text attribute; Structurized data sample is carried out participle, finish word character small letterization, remove stop words, the rough handling of deletion punctuation mark and root reduction, the word frequency of adding up each test sample book and training sample;
Step 3 to training sample set, extracts total vocabulary, and method is as follows:
For each lexical item in total vocabulary, statistics comprises positive class number of samples and the negative class number of samples of this lexical item, filters out training sample and concentrates all positive and negative class document frequencies less than 3 lexical item, obtains the document frequency table;
Step 4 is carried out Feature Selection to training sample set, obtains the proper vector table; Method is as follows:
Each lexical item in the total vocabulary that obtains for step 3 is calculated lexical item t for text categories c iχ 2Statistical value χ 2(t, c i); χ 2Statistical value is more high, and it and the correlativity between such are more big, and the classification information of carrying is also more, and computing formula is as follows:
χ 2 ( t , c i ) = N × ( AD - CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × ( C + D )
Wherein, N represents the text sum that training sample is concentrated, and A represents to belong to c iClass and comprise the document frequency of t, B represents not belong to c iClass still comprises the document frequency of t, and C represents to belong to c iClass does not still comprise the document frequency of t, and D neither belongs to c iThe document frequency that does not also comprise t;
Step 5 is composed weights for the characteristic item of all samples of data centralization, obtains the vector representation of each sample, and method is as follows:
Adopt the tax weights method of word frequency correlated frequency long-pending (tf.rf), wherein tf is word frequency (term frequency), and rf is correlated frequency (relevance frequency); For lexical item t k, make text d about t kWeights be ω k, the vector representation d=(ω of generation text d 1, ω 2..., ω n); Be calculated as follows weights ω according to tf.rf k:
ω k=tf k*rf k
Wherein, tf kExpression lexical item t kFrequency in document d is obtained by step 2, rf kThe computing method of value are as follows:
rf k = log 2 ( 2 + a k max ( 1 , c k ) )
Wherein, a kBe illustrated in the training document sets and comprise lexical item t kPositive class text number, b kExpression does not comprise lexical item t kPositive class text number, c kExpression comprises lexical item t kNegative class text number, d kExpression does not comprise lexical item t kNegative class text number;
Step 6 is expressed as a Vector Groups with training sample set, and the test sample book set representations is become another Vector Groups;
Step 7, the training sample Vector Groups that step 5 obtains is classified, mark according to positive and negative class is divided into two groups, and the Vector Groups of all positive class texts and negative class text is expressed as the form of a complete matrix respectively, namely obtains the matrix representation forms of positive class text and negative class text;
Step 8 is mapped as a vector respectively with the positive class text matrix in the training sample and negative class text matrix, and method is as follows:
(1) two matrixes are carried out svd (SVD) respectively, for example, when carrying out svd for matrix M, M can be decomposed into the form of three matrix products, be expressed as follows:
M=U*S*V
Wherein, if M is m*n, then U is m*m, and V is n*n, and S is m*n; Singular value is on the diagonal line of S, and is non-negative and by descending sort;
(2) carry out svd respectively for positive class text matrix and negative class text matrix after, the upper triangular matrix element of left matrix U after decomposing is lined up a column vector by row, these two column vectors are exactly the column vector after the required mapping, are called positive class column vector and negative class column vector;
Step 9, to each test sample book, utilize nearest neighbor algorithm to judge its classification, namely by calculate two Euclidean distances between vector come the judging distance test sample book nearest be positive class column vector or negative class column vector, judge the classification of this test sample book according to the nearest column vector classification of distance test sample.
CN201310080455.4A 2013-03-13 2013-03-13 The sorting technique of two class texts based on Vector Groups mapping Expired - Fee Related CN103207893B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310080455.4A CN103207893B (en) 2013-03-13 2013-03-13 The sorting technique of two class texts based on Vector Groups mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310080455.4A CN103207893B (en) 2013-03-13 2013-03-13 The sorting technique of two class texts based on Vector Groups mapping

Publications (2)

Publication Number Publication Date
CN103207893A true CN103207893A (en) 2013-07-17
CN103207893B CN103207893B (en) 2016-05-25

Family

ID=48755115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310080455.4A Expired - Fee Related CN103207893B (en) 2013-03-13 2013-03-13 The sorting technique of two class texts based on Vector Groups mapping

Country Status (1)

Country Link
CN (1) CN103207893B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103901888A (en) * 2014-03-18 2014-07-02 北京工业大学 Robot autonomous motion control method based on infrared and sonar sensors
CN105069476A (en) * 2015-08-10 2015-11-18 国网宁夏电力公司 Method for identifying abnormal wind power data based on two-stage integration learning
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008305268A (en) * 2007-06-08 2008-12-18 Hitachi Ltd Document classification device and classification method
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification
CN102662976A (en) * 2012-03-12 2012-09-12 浙江工业大学 Text feature weighting method based on supervision

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008305268A (en) * 2007-06-08 2008-12-18 Hitachi Ltd Document classification device and classification method
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification
CN102662976A (en) * 2012-03-12 2012-09-12 浙江工业大学 Text feature weighting method based on supervision

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李玉鑑 等: "基于DF和CHI的联合特征提取方法及其应用", 《北京工业大学学报》 *
田东风 等: "矩阵奇异值分解理论在中文文本分类中的应用", 《数学的实践与认识》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103901888A (en) * 2014-03-18 2014-07-02 北京工业大学 Robot autonomous motion control method based on infrared and sonar sensors
CN103901888B (en) * 2014-03-18 2017-01-25 北京工业大学 Robot autonomous motion control method based on infrared and sonar sensors
CN105069476A (en) * 2015-08-10 2015-11-18 国网宁夏电力公司 Method for identifying abnormal wind power data based on two-stage integration learning
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107273416B (en) * 2017-05-05 2021-05-04 深信服科技股份有限公司 Webpage hidden link detection method and device and computer readable storage medium

Also Published As

Publication number Publication date
CN103207893B (en) 2016-05-25

Similar Documents

Publication Publication Date Title
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN100583101C (en) Text categorization feature selection and weight computation method based on field knowledge
CN104573046B (en) A kind of comment and analysis method and system based on term vector
CN105224695B (en) A kind of text feature quantization method and device and file classification method and device based on comentropy
Al Qadi et al. Arabic text classification of news articles using classical supervised classifiers
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN107944480A (en) A kind of enterprises ' industry sorting technique
CN101604322B (en) Decision level text automatic classified fusion method
CN103617429A (en) Sorting method and system for active learning
CN105808524A (en) Patent document abstract-based automatic patent classification method
CN109960799A (en) A kind of Optimum Classification method towards short text
CN103886108B (en) The feature selecting and weighing computation method of a kind of unbalanced text set
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN108363810A (en) A kind of file classification method and device
CN103324628A (en) Industry classification method and system for text publishing
CN105426426A (en) KNN text classification method based on improved K-Medoids
CN105760493A (en) Automatic work order classification method for electricity marketing service hot spot 95598
CN103617435A (en) Image sorting method and system for active learning
CN101876987A (en) Overlapped-between-clusters-oriented method for classifying two types of texts
CN105260437A (en) Text classification feature selection method and application thereof to biomedical text classification
CN106021578A (en) Improved text classification algorithm based on integration of cluster and membership degree
CN105045913B (en) File classification method based on WordNet and latent semantic analysis
CN109522544A (en) Sentence vector calculation, file classification method and system based on Chi-square Test
CN104346459A (en) Text classification feature selecting method based on term frequency and chi-square statistics

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160525

Termination date: 20200313