CN103207893A

CN103207893A - Classification method of two types of texts on basis of vector group mapping

Info

Publication number: CN103207893A
Application number: CN2013100804554A
Authority: CN
Inventors: 李玉鑑; 王影
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2013-03-13
Filing date: 2013-03-13
Publication date: 2013-07-17
Anticipated expiration: 2033-03-13
Also published as: CN103207893B

Abstract

The invention discloses a classification method of two types of texts on the basis of vector group mapping. The classification method includes: collecting text data sets, and dividing the text data sets into a training set and a testing set; preprocessing the data sets; extracting a general vocabulary, and performing statistics on word frequency; performing feature selection on a training sample set to obtain a feature vector table; expressing all sample feature weighting values in the data set as vectors; expressing the training sample set and the testing sample set as a vector group respectively; expressing positive class samples and negative class samples in the training sample set as a complete-matrix form respectively; mapping a positive class text matrix and a negative class text matrix in training samples to vectors respectively; and utilizing a nearest neighbor algorithm to judge classes of testing samples. The classification method of two types of texts on the basis of the vector group mapping utilizes a tf*rf feature extraction method, uses the vector group to express positive class texts and negative class texts, can extract text features with strong self-adaptive capacity and good classification performance, can express text information roundly, and simplifies a classification process and improves classification speed through mapping transformation of the vector group.

Description

Sorting technique based on two class texts of Vector Groups mapping

Technical field

The invention belongs to electronic information technical field, be specifically related to a kind of sorting technique of two class texts based on Vector Groups mapping.

Background technology

Text classification refers to automatically give classification mark according to certain standard to text set with computing machine, it has important application in fields such as information retrieval, text mining and intelligence analysis, wherein relate to gordian techniquies such as text representation, feature selecting, disaggregated model and evaluation method.The process of text classification as shown in Figure 1.At first need text is carried out pre-service, and text is carried out proper vector represent; Training study structural classification device then; Use sorter that new text is classified at last.

At present, text classifier relatively commonly used have naive Bayesian (

Bayes), support vector machine (SVM), K arest neighbors (KNN) etc.Wherein the KNN method is simple, and classifying quality is good, and the different pieces of information collection is had good operability.The arest neighbors method is a special case of KNN method, and its basic thought is the nearest samples that finds test sample book in training sample, determines the classification of test sample book then according to the classification of this nearest samples.At first, because the arest neighbors method is only judged the classification of test sample book according to the sample of distance test sample arest neighbors, the interference of having amplified noise data can reduce nicety of grading.Secondly, because traditional arest neighbors method does not have the training stage, all calculating is all finished at a minute time-like, so the real-time of this method is bad.When the training set number of documents was very big, its computing cost was huge, to such an extent as to along with the growth of training set, assorting process will very slowly even can't be carried out.This is a major defect of arest neighbors method.Mainly reduce the computing cost of arest neighbors method at present from two aspects: the one, the scale of minimizing training set is removed noise data; The 2nd, improve the similarity of arest neighbors and calculate and searching algorithm, reduce the complexity that similarity is calculated, change Local Search into by global search.Though existing algorithm can effectively reduce the computing cost of nearest neighbor search, their major parts can't guarantee to carry out the optimum search of the overall situation, can not be applicable to mass data and higher dimensional space.

Summary of the invention

At above-mentioned based on a little less than the antinoise data interference performance that exists in the arest neighbors text classification process, divide the time-like computing cost big defective, the present invention proposes a kind of characteristic information according to the positive and negative class sample of the overall situation and judge the classification of test sample book, divide time-like to the dependency degree of individual samples and the two class text sorting techniques of classification time thereby reduce.

Ultimate principle of the present invention: each text is extracted feature, with the text representation form that is a proper vector, and then a class text is expressed as the form of Vector Groups.Then, by mapping transformation the Vector Groups of each classification is mapped as corresponding categorization vector, by calculate two Euclidean distance judging distance test sample books between vector nearest be positive class column vector or negative class column vector, and judge the classification of this test sample book according to the nearest column vector classification of distance test sample.

A kind of sorting technique of two class texts based on Vector Groups mapping is characterized in that may further comprise the steps:

Step 1 is collected data set, and the data set of collecting is divided into training sample set and test sample book collection.

Step 2, the preprocessed data collection, method is as follows:

The data of non-structureization are treated to structural data, obtain setting up model or application model primary data sample that classify, that comprise field informations such as each text attribute; Structurized data sample is carried out participle, finish word character small letterization, remove stop words, the rough handling of deletion punctuation mark and root reduction, the word frequency of adding up each test sample book and training sample.

Step 3 to training sample set, extracts total vocabulary, and method is as follows:

For each lexical item in total vocabulary, statistics comprises positive class number of samples and the negative class number of samples of this lexical item, filters out training sample and concentrates all positive and negative class document frequencies less than 3 lexical item, obtains the document frequency table.

Step 4 is carried out Feature Selection to training sample set, obtains the proper vector table.Concrete grammar is as follows:

Each lexical item in the total vocabulary that obtains for step 3 is calculated lexical item t for text categories c _iχ ²Statistical value χ ²(t, c _i).χ ²Statistical value is more high, and it and the correlativity between such are more big, and the classification information of carrying is also more, and computing formula is as follows:

χ^{2} (t, c_{i}) = \frac{N \times {(AD - CB)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}

Wherein, N represents the text sum that training sample is concentrated, and A represents to belong to c _iClass and comprise the document frequency of t, B represents not belong to c _iClass still comprises the document frequency of t, and C represents to belong to c _iClass does not still comprise the document frequency of t, and D neither belongs to c _iThe document frequency that does not also comprise t.

Step 5 is composed weights for the characteristic item of all samples of data centralization, obtains the vector representation of each sample.The present invention adopts the tax weights method of word frequency correlated frequency long-pending (tf.rf), and wherein tf is word frequency (term frequency), and rf is correlated frequency (relevance frequency).For lexical item t _k, make text d about t _kWeights be ω _k, the vector representation d=(ω of generation text d ₁, ω ₂..., ω _n).Calculate weights ω according to tf.rf _kFormula be:

ω _k＝tf _k*rf _k

Wherein, tf _kExpression lexical item t _kFrequency in document d is obtained by step 2, rf _kThe computing method of value are as follows:

{rf}_{k} = \log_{2} (2 + \frac{a_{k}}{\max (1, c_{k})})

Wherein, a _kBe illustrated in the training document sets and comprise lexical item t _kPositive class text number, b _kExpression does not comprise lexical item t _kPositive class text number, c _kExpression comprises lexical item t _kNegative class text number, d _kExpression does not comprise lexical item t _kNegative class text number.

Step 6 is expressed as a Vector Groups with training sample set, and the test sample book set representations is become another Vector Groups.

Step 7, the training sample Vector Groups that step 5 is obtained classification is divided into two groups according to the mark of positive and negative class.And the Vector Groups of all positive class texts and negative class text is expressed as the form of a complete matrix respectively, namely obtain the matrix representation forms of positive class text and negative class text.

Step 8 is mapped as a vector respectively with the positive class text matrix in the training sample and negative class text matrix.Matrix is specific as follows to the mapping process of vector:

(1) two matrixes is carried out svd (SVD) respectively.For example, when carrying out svd for matrix M, M can be decomposed into the form of three matrix products, formula is as follows:

M=U*S*V

Wherein, if M is m*n, then U is m*m, and V is n*n, and S is m*n.Singular value is on the diagonal line of S, and is non-negative and by descending sort.

(2) carry out svd respectively for positive class text matrix and negative class text matrix after, the upper triangular matrix element of the left matrix U after decomposing is lined up a column vector by row.These two column vectors are exactly the column vector after the required mapping, are called positive class column vector and negative class column vector.

Step 9, to each test sample book, utilize nearest neighbor algorithm to judge its classification, namely by calculate two Euclidean distances between vector come the judging distance test sample book nearest be positive class column vector or negative class column vector, judge the classification of this test sample book according to the nearest column vector classification of distance test sample.

The present invention compared with prior art has following remarkable advantages and beneficial effect:

The present invention no longer needs the sample of judging distance test sample book arest neighbors one by one in the text classification process, determine the classification of this test sample book then according to the sample of arest neighbors, but regard all positive class samples and all negative class samples as a sample respectively, like this when calculating the nearest samples of certain test sample book, only need to calculate twice (when being that two class texts divide time-like), greatly reduce calculated amount, accelerated the speed of classification; Simultaneously, the present invention has avoided the contingency in assorting process middle distance test sample book nearest samples classification to a certain extent, represent the information of positive and negative class text by the mapping vector that utilizes used positive and negative class text, can effectively utilize the characteristic information of overall positive and negative class text, improve the correctness of classification; In addition, the text classification thought of passing through the Vector Groups mapping that proposes in the present invention is not only applicable to the classification of two class texts, can be applied to the multiclass text classification simultaneously yet, and extensibility is strong.

Description of drawings

Fig. 1 is the module structure drafting of text classification involved in the present invention;

Fig. 2 is method flow diagram involved in the present invention.

Embodiment

The invention will be further described below in conjunction with drawings and the specific embodiments.

Based on the process flow diagram of two class text sorting techniques of Vector Groups mapping as shown in Figure 2.A kind of sorting technique of two class texts based on Vector Groups mapping is characterized in that may further comprise the steps:

Step 2, the preprocessed data collection.

Step 3 is extracted total vocabulary, and the statistics word frequency obtains the document frequency table.

Step 4 is carried out Feature Selection to training sample set, obtains the proper vector table.

Step 5 is composed weights for the characteristic item of all samples of data centralization, obtains the vector representation of each sample.

Step 7 concentrates positive and negative class sample to be expressed as the form of a complete matrix respectively training sample.

Step 8 is mapped as a vector respectively with the positive class text matrix in the training sample and negative class text matrix.

Step 9 to each test sample book, utilizes nearest neighbor algorithm to judge its classification.

Provide one below and use the example that the present invention classifies to text.

Collect the Reuters data set from UCI data set website, downloaded 68274 pieces of texts altogether, wherein 65740 pieces as training set, and remaining 2534 pieces as test set.The present invention adopts maximum preceding 10 class texts of Reuters data centralization text bibliography, comprises acq, com, crude, earn, grain, interest, money-fx, ship, trade, wheat.Each class text details is as shown in table 1:

The information list of each class text of table 1

Because the present invention solves two class text classification problems, and data set comprises the sample of 10 classifications altogether, so specify the positive class sample of class conduct wherein when experiment, remaining sample is as negative class sample.

For the Reuters data set, by specifying different classes as positive class, test 10 groups of data to analyze the experiment effect of comparison-of-pair sorting's device.For example when the positive time-like of acq conduct, remaining 9 class is all as negative class.6574 pieces of texts are arranged as training sample in each experiment, 2534 pieces of samples are as test sample book.Calculate nearest neighbor classifier respectively and based on the experiment effect of Vector Groups mapping sorter, estimate three indexs with accuracy, recall rate and F1-here and estimate.Experimental result is as shown in table 2.

It is the experimental result of evaluation index that table 2 is estimated with accuracy, recall rate and F1-

From above experimental result, as can be seen, under the selected different situation of class as positive class, generally be better than text classification algorithm based on arest neighbors based on three indexs of text classification algorithm of Vector Groups mapping.When the positive time-like of ship conduct, because the unbalancedness of data, nearest neighbor classifier all is divided into negative class with all samples, and the recall rate of this moment is 0, can't calculate thereby cause F1-to estimate.When the mean value of each evaluation index of calculating nearest neighbor classifier, do not consider this group data.For same data, sorter of the present invention still can guarantee stable classifying quality.As can be seen from the above table, no matter whether the balance of data sample, and the average accuracy that the sorter among the present invention can both guarantee to classify is more than 93%, and average recall rate is 84.1%, and mean F 1 value is 0.5888, fully proved validity and the superiority of this method.

Above embodiment is only in order to illustrating the present invention, and and unrestricted technical scheme described in the invention.Therefore, technical scheme and improvement thereof that all do not break away from the spirit and scope of the present invention all should be encompassed in the middle of the claim scope of the present invention.

Claims

1. sorting technique based on two class texts of Vector Groups mapping is characterized in that may further comprise the steps:

Step 1 is collected data set, and the data set of collecting is divided into training sample set and test sample book collection;

Step 2, the preprocessed data collection, method is as follows:

The data of non-structureization are treated to structural data, obtain setting up model or application model primary data sample that classify, that comprise field informations such as each text attribute; Structurized data sample is carried out participle, finish word character small letterization, remove stop words, the rough handling of deletion punctuation mark and root reduction, the word frequency of adding up each test sample book and training sample;

For each lexical item in total vocabulary, statistics comprises positive class number of samples and the negative class number of samples of this lexical item, filters out training sample and concentrates all positive and negative class document frequencies less than 3 lexical item, obtains the document frequency table;

Step 4 is carried out Feature Selection to training sample set, obtains the proper vector table; Method is as follows:

Each lexical item in the total vocabulary that obtains for step 3 is calculated lexical item t for text categories c _iχ ²Statistical value χ ²(t, c _i); χ ²Statistical value is more high, and it and the correlativity between such are more big, and the classification information of carrying is also more, and computing formula is as follows:

χ^{2} (t, c_{i}) = \frac{N \times {(AD - CB)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}

Wherein, N represents the text sum that training sample is concentrated, and A represents to belong to c _iClass and comprise the document frequency of t, B represents not belong to c _iClass still comprises the document frequency of t, and C represents to belong to c _iClass does not still comprise the document frequency of t, and D neither belongs to c _iThe document frequency that does not also comprise t;

Step 5 is composed weights for the characteristic item of all samples of data centralization, obtains the vector representation of each sample, and method is as follows:

Adopt the tax weights method of word frequency correlated frequency long-pending (tf.rf), wherein tf is word frequency (term frequency), and rf is correlated frequency (relevance frequency); For lexical item t _k, make text d about t _kWeights be ω _k, the vector representation d=(ω of generation text d ₁, ω ₂..., ω _n); Be calculated as follows weights ω according to tf.rf _k:

ω _k＝tf _k*rf _k

{rf}_{k} = \log_{2} (2 + \frac{a_{k}}{\max (1, c_{k})})

Wherein, a _kBe illustrated in the training document sets and comprise lexical item t _kPositive class text number, b _kExpression does not comprise lexical item t _kPositive class text number, c _kExpression comprises lexical item t _kNegative class text number, d _kExpression does not comprise lexical item t _kNegative class text number;

Step 6 is expressed as a Vector Groups with training sample set, and the test sample book set representations is become another Vector Groups;

Step 7, the training sample Vector Groups that step 5 obtains is classified, mark according to positive and negative class is divided into two groups, and the Vector Groups of all positive class texts and negative class text is expressed as the form of a complete matrix respectively, namely obtains the matrix representation forms of positive class text and negative class text;

Step 8 is mapped as a vector respectively with the positive class text matrix in the training sample and negative class text matrix, and method is as follows:

(1) two matrixes are carried out svd (SVD) respectively, for example, when carrying out svd for matrix M, M can be decomposed into the form of three matrix products, be expressed as follows:

M=U*S*V

Wherein, if M is m*n, then U is m*m, and V is n*n, and S is m*n; Singular value is on the diagonal line of S, and is non-negative and by descending sort;

(2) carry out svd respectively for positive class text matrix and negative class text matrix after, the upper triangular matrix element of left matrix U after decomposing is lined up a column vector by row, these two column vectors are exactly the column vector after the required mapping, are called positive class column vector and negative class column vector;