CN110837853A

CN110837853A - Rapid classification model construction method

Info

Publication number: CN110837853A
Application number: CN201911037562.2A
Authority: CN
Inventors: 甘涛; 王志阳; 何艳敏; 罗瑜
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-02-25

Abstract

The invention discloses a rapid classification model construction method, which adopts a local sensitive Hash method to map training samples into Hash values with moderate digits, further screens out possible boundary sample points and trains the boundary sample points, thereby obviously reducing the number of the training samples, reducing the calculation complexity and improving the speed of model construction; and (3) according to the probability that the sample points are boundary points, the potential boundary points are screened out in an iterative manner, so that the speed is improved, and the accuracy of model classification is ensured.

Description

Rapid classification model construction method

Technical Field

The invention belongs to the technical field of pattern recognition, and relates to a rapid classification model construction method.

Background

The classification problem is that a classification model is obtained by training and analyzing the existing data, and then the model is used for carrying out class prediction on unknown data. For example, a personal income grade classification model is established according to data such as personal age, education degree, marital status, occupation, income and the like obtained from census, and based on the model, the income grade can be predicted according to related data of non-income of the individual. The design and construction of classification models in classification problems are the core and are the key research points in the field of computer mode identification.

The classical classification model mainly comprises K-Nearest Neighbor (KNN), a Support Vector Machine (SVM), a neural network and the like. The KNN determines the category of the sample to be classified according to the category of the nearest sample or samples, and has the advantages of simple concept and easy realization. The calculation amount of the model mainly focuses on calculating the distance from each sample to be classified to all known samples; the SVM is one of the most robust and accurate models in the classification model, aims to search a hyperplane separating different samples at the largest interval, and is widely applied to the fields of face recognition, machine fault detection, time sequence prediction, biological engineering and the like. However, the problem is solved by the SVM by means of quadratic programming, and the solving of the quadratic programming involves the operation of a high-order matrix, so that when the data volume is large, the model construction time is long; in contrast, the neural network model as the current hot has better classification performance, but the problems of low convergence speed, large calculation amount, long construction time and the like in model construction still generally exist; in conclusion, the classical classification models all face the problem of high computational complexity, and the problem becomes more prominent with the continuous increase of the data volume and dimension in the big data era.

Methods to reduce computational complexity typically fall into two categories, dimensionality reduction and sample count reduction. In the aspect of reducing dimensionality, common means include principal component analysis, linear discriminant analysis and the like, namely dimensionality reduction is performed on data, and training is performed on the dimensionality reduced data. The method has the disadvantage that the data characteristics can be changed after dimension reduction, so that the classification accuracy of the model is reduced. In the aspect of reducing the number of samples, the representative sample points of the clustering are usually sought through sample clustering, and only the representative sample points are used for training, so that the model building speed is improved. But the clustering calculation needs a lot of time overhead, and because the representative sample points can not replace the distribution of the original data, the classification accuracy is easy to be reduced. In short, for large-scale data sets, a model construction method which can ensure classification accuracy and has high processing speed needs to be researched.

Disclosure of Invention

Aiming at the defects in the prior art, the rapid classification model construction method provided by the invention has the characteristics of high training speed and low calculation complexity for a large-scale data set under the condition of ensuring the classification accuracy.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a method for constructing a rapid classification model comprises the following steps:

s1, local sensitive Hash mapping:

s11, setting the total number of training samples as N, wherein the samples comprise positive sample points and negative sample points;

s12, mapping each training sample point into a Hash value with the digit of K by adopting a standard locality sensitive Hash method, obtaining H different Hash values after mapping is finished, and obtaining H Hash buckets according to the condition that each different Hash value is a Hash bucket;

s2, counting the number of sample points:

s21, counting the number of the positive type sample points in each hash bucket

Number of sum-negative type sample points

Wherein i is the number of the hash bucket, and i is more than or equal to 1 and less than or equal to H;

s22, comparing the number of the positive type sample points in each hash bucket

Number of sum-negative type sample pointsThe value of (A) is smaller than that of (B)

Wherein the content of the first and second substances,

is to take

And

the lesser of the two;

s3, initialization:

s31, defining a support vector set S, a boundary sample set B and a simplified training sample set T, and initializing the support vector set S, the boundary sample set B and the simplified training sample set T to be empty;

s32, defining an iteration counter n, and setting the initial value of the iteration counter n to be 1;

s33, defining a sample point number threshold t, and setting the initial value of the sample point number threshold t as t₀；

S4, screening boundary samples:

s41, judging whether n is equal to 1, if yes, screening to meet the requirement

If not, screening all the hash buckets meeting the requirementAll hash buckets of (1);

s42, emptying the boundary sample set B;

s43, adding all sample points in all hash buckets obtained through screening into a boundary sample set B;

s5, constructing a training sample set:

s51, emptying the simplified training sample set T;

s52, adding all sample points in the boundary sample set B and the support vector set S into a simplified training sample set T;

s6, training a classification model: performing SVM training on the simplified training sample set T to obtain a support vector set S 'and a classification model M';

s7, determination termination condition:

s71, calculating the number change value delta w of the support vectors, wherein,

the absolute value is taken for operation, | | | | |, the number of the collection element is taken for operation;

s72, self-subtracting 1 from the threshold t of the number of the sample points, and self-adding 1 to the iteration counter n;

s73, setting w_thA preset support vector number change threshold value is set, if the change threshold value meets delta w<w_thOr t is 0, the building process is ended, the classification model M' is output, otherwise, S is updated, and the process goes to step S41.

Further: the value range of the digit K of the haxi value in step S12 is:

log₂N≤K≤log₂N+10。

further: initial value t of sample point number threshold t in step S33₀The value range is as follows:

wherein the content of the first and second substances,

is a function of the rounding-down,is to take

And 1, respectively.

Further: support vector number change threshold w in step S73_thThe value range is as follows:

the invention has the beneficial effects that: the method adopts a local sensitive Hash method to map the training samples into Hash values with moderate digits, thereby screening out possible boundary sample points and training the boundary sample points, obviously reducing the number of the training samples, reducing the computational complexity and improving the speed of model construction; and (3) according to the probability that the sample points are boundary points, the potential boundary points are screened out in an iterative manner, so that the speed is improved, and the accuracy of model classification is ensured.

Drawings

FIG. 1 is a flow chart of a method for constructing a rapid classification model.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

In this example, two sets of data a6a and a9a were used in the experiment, which are widely used in ucaidult data sets, and the configuration is shown in table 1. The ucaidult dataset, also known as the "census income" dataset, is a set of attributes derived from census of the population, such as age, education, marital status, occupation, etc., to predict if the income exceeds $ 50K/year.

Table 1 data set configuration table

Sample set	Number of training samples	Number of samples tested	Dimension (d) of
				a6a	11220	21341	123
a9a	32561	16281	123

Taking an a9a data set as an example, as shown in fig. 1, a method for constructing a rapid classification model includes the following steps:

s1, local sensitive Hash mapping:

s11, setting the total number of training samples to be N, where the samples include positive sample points and negative sample points, where N is 32561 in this embodiment;

s12, mapping each training sample point to a hash value with a bit number of K by using a standard locality sensitive hashing method, obtaining H different hash values after the mapping is completed, where the number of hash buckets is H according to each different hash value as a hash bucket, and regarding the locality sensitive hashing method mapping as a process of putting sample points into hash buckets, all sample points mapped to the same hash value will be put into corresponding hash buckets, where H is 7090 in this embodiment;

the value range of the digit K of the haxi value in step S12 is:

log₂N≤K≤log₂n + 10; in this embodiment, K is 17.

S2, counting the number of sample points:

s21, counting the number of the positive type sample points in each hash bucket

Number of sum-negative type sample points

s22, comparing each hash bucketNumber of positive type sample points

Number of sum-negative type sample points

The value of (A) is smaller than that of (B)

Wherein the content of the first and second substances,

is to take

Andthe lesser of the two;

s3, initialization:

Initial value t of sample point number threshold t in step S33₀The value range is as follows:

wherein the content of the first and second substances,

is a function of the rounding-down,is to take

The larger of both and 1, t in this embodiment₀＝2；

S4, screening boundary samples:

s41, judging whether n is equal to 1, if yes, screening to meet the requirement

s42, emptying the boundary sample set B;

s43, adding all sample points in all hash buckets obtained by screening to the boundary sample set B, and screening 15344 sample points when n is equal to 1;

s5, constructing a training sample set:

s51, emptying the simplified training sample set T;

s7, determination termination condition:

when n is 1, the support vector number variation value Δ w is 7.475 × 10⁷；

s73, setting w_thFor a predetermined number of support vectorsChange the threshold value if Δ w is satisfied<w_thOr t is 0, ending the construction process, outputting the classification model M ', or else updating S is S', and jumping to the step S41;

support vector number change threshold w in step S73_thThe value range is as follows:

in this embodiment, let w_thWhen n is equal to 1, the value of variation Δ w in the number of support vectors is 7.475 × 10⁷Due to Δ w>w_thAnd t is 1, update S is S', and go to step S41; when n is 2, the value of change Δ w in the number of support vectors is 0.162, because Δ w>w_thIf t is 0, the building process is terminated and the classification model M' is output.

The following gives an evaluation of the performance of the method of the invention:

comparing the method with an SVM method (hereinafter abbreviated as skearn SVM) in a skearn library of the latest python version 3.7, and setting parameters of a skearn SVM model as C ═ 1, kernel ═ rbf ', and gamma ═ auto'.

(1) Training sample number comparison

For the a6a and a9a data sets, the method of the invention undergoes two times of iterative training, and the number of samples participating in the training at each iteration is shown in table 2. It can be seen that in the first iteration, the average number of samples involved in training is 46.36% of the total number of samples; in the second iteration, the average number of samples participating in training accounts for 31.15 percent of the total number of samples; the two iterations are summed up, and the training samples are reduced by 22.49% compared with the sklern SVM average in the method.

TABLE 2 training sample number comparison

(2) Classification Performance comparison

The test results of the model construction time and the classification accuracy of the method of the present invention are shown in table 3. It can be seen that, in the aspect of model construction speed, the method of the invention has obvious advantages compared with the sklern SVM, construction time is averagely reduced by 36.70%, as training time and sample number are not in linear relation, when the sample number is increased, the training time is exponentially increased, in the test, the training process is subjected to two iterations, the number of samples participating in training in each iteration is relatively small, and therefore, the total training time is remarkably shortened; in the aspect of classification accuracy, the method has equivalent performance to a sklern SVM, and the accuracy is slightly improved. In conclusion, the method provided by the invention can improve the model construction speed and ensure the accuracy of model classification.

TABLE 3 comparison of model build time to Classification accuracy

Claims

1. A method for constructing a rapid classification model is characterized by comprising the following steps:

s1, local sensitive Hash mapping:

s2, counting the number of sample points:

s21, counting the number of the positive type sample points in each hash bucket

Number of sum-negative type sample points

Number of sum-negative type sample points

The value of (A) is smaller than that of (B)

Wherein the content of the first and second substances,

is to take

Andthe lesser of the two;

s3, initialization:

S4, screening boundary samples:

s41, judging whether n is equal to 1, if yes, screening to meet the requirement

If not, screening all the hash buckets meeting the requirement

All hash buckets of (1);

s42, emptying the boundary sample set B;

s5, constructing a training sample set:

s51, emptying the simplified training sample set T;

s7, determination termination condition:

s73, setting w_thA preset support vector number change threshold value is set, if delta w is more than w_thOr t is 0, the building process is ended, the classification model M' is output, otherwise, S is updated, and the process goes to step S41.

2. The method for constructing a rapid classification model according to claim 1, wherein the range of the number K of the hash values in step S12 is:

log₂N≤K≤log₂N+10。

3. the method for constructing a rapid classification model according to claim 1, wherein the initial value t of the threshold t of the number of sample points in the step S33 is₀The value range is as follows:wherein the content of the first and second substances,is a function of the rounding-down,is to take

And 1, respectively.

4. The method for constructing a rapid classification model according to claim 1, wherein the threshold w for the variation of the number of support vectors in step S73_thThe value range is as follows: