CN110837853A - Rapid classification model construction method - Google Patents

Rapid classification model construction method Download PDF

Info

Publication number
CN110837853A
CN110837853A CN201911037562.2A CN201911037562A CN110837853A CN 110837853 A CN110837853 A CN 110837853A CN 201911037562 A CN201911037562 A CN 201911037562A CN 110837853 A CN110837853 A CN 110837853A
Authority
CN
China
Prior art keywords
hash
sample
sample points
training
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911037562.2A
Other languages
Chinese (zh)
Inventor
甘涛
王志阳
何艳敏
罗瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911037562.2A priority Critical patent/CN110837853A/en
Publication of CN110837853A publication Critical patent/CN110837853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a rapid classification model construction method, which adopts a local sensitive Hash method to map training samples into Hash values with moderate digits, further screens out possible boundary sample points and trains the boundary sample points, thereby obviously reducing the number of the training samples, reducing the calculation complexity and improving the speed of model construction; and (3) according to the probability that the sample points are boundary points, the potential boundary points are screened out in an iterative manner, so that the speed is improved, and the accuracy of model classification is ensured.

Description

Rapid classification model construction method
Technical Field
The invention belongs to the technical field of pattern recognition, and relates to a rapid classification model construction method.
Background
The classification problem is that a classification model is obtained by training and analyzing the existing data, and then the model is used for carrying out class prediction on unknown data. For example, a personal income grade classification model is established according to data such as personal age, education degree, marital status, occupation, income and the like obtained from census, and based on the model, the income grade can be predicted according to related data of non-income of the individual. The design and construction of classification models in classification problems are the core and are the key research points in the field of computer mode identification.
The classical classification model mainly comprises K-Nearest Neighbor (KNN), a Support Vector Machine (SVM), a neural network and the like. The KNN determines the category of the sample to be classified according to the category of the nearest sample or samples, and has the advantages of simple concept and easy realization. The calculation amount of the model mainly focuses on calculating the distance from each sample to be classified to all known samples; the SVM is one of the most robust and accurate models in the classification model, aims to search a hyperplane separating different samples at the largest interval, and is widely applied to the fields of face recognition, machine fault detection, time sequence prediction, biological engineering and the like. However, the problem is solved by the SVM by means of quadratic programming, and the solving of the quadratic programming involves the operation of a high-order matrix, so that when the data volume is large, the model construction time is long; in contrast, the neural network model as the current hot has better classification performance, but the problems of low convergence speed, large calculation amount, long construction time and the like in model construction still generally exist; in conclusion, the classical classification models all face the problem of high computational complexity, and the problem becomes more prominent with the continuous increase of the data volume and dimension in the big data era.
Methods to reduce computational complexity typically fall into two categories, dimensionality reduction and sample count reduction. In the aspect of reducing dimensionality, common means include principal component analysis, linear discriminant analysis and the like, namely dimensionality reduction is performed on data, and training is performed on the dimensionality reduced data. The method has the disadvantage that the data characteristics can be changed after dimension reduction, so that the classification accuracy of the model is reduced. In the aspect of reducing the number of samples, the representative sample points of the clustering are usually sought through sample clustering, and only the representative sample points are used for training, so that the model building speed is improved. But the clustering calculation needs a lot of time overhead, and because the representative sample points can not replace the distribution of the original data, the classification accuracy is easy to be reduced. In short, for large-scale data sets, a model construction method which can ensure classification accuracy and has high processing speed needs to be researched.
Disclosure of Invention
Aiming at the defects in the prior art, the rapid classification model construction method provided by the invention has the characteristics of high training speed and low calculation complexity for a large-scale data set under the condition of ensuring the classification accuracy.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a method for constructing a rapid classification model comprises the following steps:
s1, local sensitive Hash mapping:
s11, setting the total number of training samples as N, wherein the samples comprise positive sample points and negative sample points;
s12, mapping each training sample point into a Hash value with the digit of K by adopting a standard locality sensitive Hash method, obtaining H different Hash values after mapping is finished, and obtaining H Hash buckets according to the condition that each different Hash value is a Hash bucket;
s2, counting the number of sample points:
s21, counting the number of the positive type sample points in each hash bucket
Figure BDA0002251950400000021
Number of sum-negative type sample points
Figure BDA0002251950400000022
Wherein i is the number of the hash bucket, and i is more than or equal to 1 and less than or equal to H;
s22, comparing the number of the positive type sample points in each hash bucket
Figure BDA0002251950400000023
Number of sum-negative type sample pointsThe value of (A) is smaller than that of (B)
Figure BDA0002251950400000025
Wherein the content of the first and second substances,
Figure BDA0002251950400000026
is to take
Figure BDA0002251950400000027
And
Figure BDA0002251950400000028
the lesser of the two;
s3, initialization:
s31, defining a support vector set S, a boundary sample set B and a simplified training sample set T, and initializing the support vector set S, the boundary sample set B and the simplified training sample set T to be empty;
s32, defining an iteration counter n, and setting the initial value of the iteration counter n to be 1;
s33, defining a sample point number threshold t, and setting the initial value of the sample point number threshold t as t0
S4, screening boundary samples:
s41, judging whether n is equal to 1, if yes, screening to meet the requirement
Figure BDA0002251950400000031
If not, screening all the hash buckets meeting the requirementAll hash buckets of (1);
s42, emptying the boundary sample set B;
s43, adding all sample points in all hash buckets obtained through screening into a boundary sample set B;
s5, constructing a training sample set:
s51, emptying the simplified training sample set T;
s52, adding all sample points in the boundary sample set B and the support vector set S into a simplified training sample set T;
s6, training a classification model: performing SVM training on the simplified training sample set T to obtain a support vector set S 'and a classification model M';
s7, determination termination condition:
s71, calculating the number change value delta w of the support vectors, wherein,
Figure BDA0002251950400000033
the absolute value is taken for operation, | | | | |, the number of the collection element is taken for operation;
s72, self-subtracting 1 from the threshold t of the number of the sample points, and self-adding 1 to the iteration counter n;
s73, setting wthA preset support vector number change threshold value is set, if the change threshold value meets delta w<wthOr t is 0, the building process is ended, the classification model M' is output, otherwise, S is updated, and the process goes to step S41.
Further: the value range of the digit K of the haxi value in step S12 is:
log2N≤K≤log2N+10。
further: initial value t of sample point number threshold t in step S330The value range is as follows:
Figure BDA0002251950400000041
wherein the content of the first and second substances,
Figure BDA0002251950400000045
is a function of the rounding-down,is to take
Figure BDA0002251950400000043
And 1, respectively.
Further: support vector number change threshold w in step S73thThe value range is as follows:
Figure BDA0002251950400000044
the invention has the beneficial effects that: the method adopts a local sensitive Hash method to map the training samples into Hash values with moderate digits, thereby screening out possible boundary sample points and training the boundary sample points, obviously reducing the number of the training samples, reducing the computational complexity and improving the speed of model construction; and (3) according to the probability that the sample points are boundary points, the potential boundary points are screened out in an iterative manner, so that the speed is improved, and the accuracy of model classification is ensured.
Drawings
FIG. 1 is a flow chart of a method for constructing a rapid classification model.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
In this example, two sets of data a6a and a9a were used in the experiment, which are widely used in ucaidult data sets, and the configuration is shown in table 1. The ucaidult dataset, also known as the "census income" dataset, is a set of attributes derived from census of the population, such as age, education, marital status, occupation, etc., to predict if the income exceeds $ 50K/year.
Table 1 data set configuration table
Sample set Number of training samples Number of samples tested Dimension (d) of
a6a 11220 21341 123
a9a 32561 16281 123
Taking an a9a data set as an example, as shown in fig. 1, a method for constructing a rapid classification model includes the following steps:
s1, local sensitive Hash mapping:
s11, setting the total number of training samples to be N, where the samples include positive sample points and negative sample points, where N is 32561 in this embodiment;
s12, mapping each training sample point to a hash value with a bit number of K by using a standard locality sensitive hashing method, obtaining H different hash values after the mapping is completed, where the number of hash buckets is H according to each different hash value as a hash bucket, and regarding the locality sensitive hashing method mapping as a process of putting sample points into hash buckets, all sample points mapped to the same hash value will be put into corresponding hash buckets, where H is 7090 in this embodiment;
the value range of the digit K of the haxi value in step S12 is:
log2N≤K≤log2n + 10; in this embodiment, K is 17.
S2, counting the number of sample points:
s21, counting the number of the positive type sample points in each hash bucket
Figure BDA0002251950400000057
Number of sum-negative type sample points
Figure BDA0002251950400000058
Wherein i is the number of the hash bucket, and i is more than or equal to 1 and less than or equal to H;
s22, comparing each hash bucketNumber of positive type sample points
Figure BDA0002251950400000051
Number of sum-negative type sample points
Figure BDA0002251950400000052
The value of (A) is smaller than that of (B)
Figure BDA0002251950400000053
Wherein the content of the first and second substances,
Figure BDA0002251950400000054
is to take
Figure BDA0002251950400000055
Andthe lesser of the two;
s3, initialization:
s31, defining a support vector set S, a boundary sample set B and a simplified training sample set T, and initializing the support vector set S, the boundary sample set B and the simplified training sample set T to be empty;
s32, defining an iteration counter n, and setting the initial value of the iteration counter n to be 1;
s33, defining a sample point number threshold t, and setting the initial value of the sample point number threshold t as t0
Initial value t of sample point number threshold t in step S330The value range is as follows:
Figure BDA0002251950400000062
wherein the content of the first and second substances,
Figure BDA0002251950400000068
is a function of the rounding-down,is to take
Figure BDA0002251950400000064
The larger of both and 1, t in this embodiment0=2;
S4, screening boundary samples:
s41, judging whether n is equal to 1, if yes, screening to meet the requirement
Figure BDA0002251950400000065
If not, screening all the hash buckets meeting the requirementAll hash buckets of (1);
s42, emptying the boundary sample set B;
s43, adding all sample points in all hash buckets obtained by screening to the boundary sample set B, and screening 15344 sample points when n is equal to 1;
s5, constructing a training sample set:
s51, emptying the simplified training sample set T;
s52, adding all sample points in the boundary sample set B and the support vector set S into a simplified training sample set T;
s6, training a classification model: performing SVM training on the simplified training sample set T to obtain a support vector set S 'and a classification model M';
s7, determination termination condition:
s71, calculating the number change value delta w of the support vectors, wherein,
Figure BDA0002251950400000067
the absolute value is taken for operation, | | | | |, the number of the collection element is taken for operation;
when n is 1, the support vector number variation value Δ w is 7.475 × 107
S72, self-subtracting 1 from the threshold t of the number of the sample points, and self-adding 1 to the iteration counter n;
s73, setting wthFor a predetermined number of support vectorsChange the threshold value if Δ w is satisfied<wthOr t is 0, ending the construction process, outputting the classification model M ', or else updating S is S', and jumping to the step S41;
support vector number change threshold w in step S73thThe value range is as follows:
Figure BDA0002251950400000071
in this embodiment, let wthWhen n is equal to 1, the value of variation Δ w in the number of support vectors is 7.475 × 107Due to Δ w>wthAnd t is 1, update S is S', and go to step S41; when n is 2, the value of change Δ w in the number of support vectors is 0.162, because Δ w>wthIf t is 0, the building process is terminated and the classification model M' is output.
The following gives an evaluation of the performance of the method of the invention:
comparing the method with an SVM method (hereinafter abbreviated as skearn SVM) in a skearn library of the latest python version 3.7, and setting parameters of a skearn SVM model as C ═ 1, kernel ═ rbf ', and gamma ═ auto'.
(1) Training sample number comparison
For the a6a and a9a data sets, the method of the invention undergoes two times of iterative training, and the number of samples participating in the training at each iteration is shown in table 2. It can be seen that in the first iteration, the average number of samples involved in training is 46.36% of the total number of samples; in the second iteration, the average number of samples participating in training accounts for 31.15 percent of the total number of samples; the two iterations are summed up, and the training samples are reduced by 22.49% compared with the sklern SVM average in the method.
TABLE 2 training sample number comparison
Figure BDA0002251950400000072
(2) Classification Performance comparison
The test results of the model construction time and the classification accuracy of the method of the present invention are shown in table 3. It can be seen that, in the aspect of model construction speed, the method of the invention has obvious advantages compared with the sklern SVM, construction time is averagely reduced by 36.70%, as training time and sample number are not in linear relation, when the sample number is increased, the training time is exponentially increased, in the test, the training process is subjected to two iterations, the number of samples participating in training in each iteration is relatively small, and therefore, the total training time is remarkably shortened; in the aspect of classification accuracy, the method has equivalent performance to a sklern SVM, and the accuracy is slightly improved. In conclusion, the method provided by the invention can improve the model construction speed and ensure the accuracy of model classification.
TABLE 3 comparison of model build time to Classification accuracy
Figure BDA0002251950400000081

Claims (4)

1. A method for constructing a rapid classification model is characterized by comprising the following steps:
s1, local sensitive Hash mapping:
s11, setting the total number of training samples as N, wherein the samples comprise positive sample points and negative sample points;
s12, mapping each training sample point into a Hash value with the digit of K by adopting a standard locality sensitive Hash method, obtaining H different Hash values after mapping is finished, and obtaining H Hash buckets according to the condition that each different Hash value is a Hash bucket;
s2, counting the number of sample points:
s21, counting the number of the positive type sample points in each hash bucket
Figure FDA0002251950390000011
Number of sum-negative type sample points
Figure FDA0002251950390000012
Wherein i is the number of the hash bucket, and i is more than or equal to 1 and less than or equal to H;
s22, comparing the number of the positive type sample points in each hash bucket
Figure FDA0002251950390000013
Number of sum-negative type sample points
Figure FDA0002251950390000014
The value of (A) is smaller than that of (B)
Figure FDA0002251950390000015
Wherein the content of the first and second substances,
Figure FDA0002251950390000017
is to take
Figure FDA0002251950390000018
Andthe lesser of the two;
s3, initialization:
s31, defining a support vector set S, a boundary sample set B and a simplified training sample set T, and initializing the support vector set S, the boundary sample set B and the simplified training sample set T to be empty;
s32, defining an iteration counter n, and setting the initial value of the iteration counter n to be 1;
s33, defining a sample point number threshold t, and setting the initial value of the sample point number threshold t as t0
S4, screening boundary samples:
s41, judging whether n is equal to 1, if yes, screening to meet the requirement
Figure FDA00022519503900000110
If not, screening all the hash buckets meeting the requirement
Figure FDA00022519503900000111
All hash buckets of (1);
s42, emptying the boundary sample set B;
s43, adding all sample points in all hash buckets obtained through screening into a boundary sample set B;
s5, constructing a training sample set:
s51, emptying the simplified training sample set T;
s52, adding all sample points in the boundary sample set B and the support vector set S into a simplified training sample set T;
s6, training a classification model: performing SVM training on the simplified training sample set T to obtain a support vector set S 'and a classification model M';
s7, determination termination condition:
s71, calculating the number change value delta w of the support vectors, wherein,
Figure FDA0002251950390000021
the absolute value is taken for operation, | | | | |, the number of the collection element is taken for operation;
s72, self-subtracting 1 from the threshold t of the number of the sample points, and self-adding 1 to the iteration counter n;
s73, setting wthA preset support vector number change threshold value is set, if delta w is more than wthOr t is 0, the building process is ended, the classification model M' is output, otherwise, S is updated, and the process goes to step S41.
2. The method for constructing a rapid classification model according to claim 1, wherein the range of the number K of the hash values in step S12 is:
log2N≤K≤log2N+10。
3. the method for constructing a rapid classification model according to claim 1, wherein the initial value t of the threshold t of the number of sample points in the step S33 is0The value range is as follows:wherein the content of the first and second substances,is a function of the rounding-down,is to take
Figure FDA0002251950390000025
And 1, respectively.
4. The method for constructing a rapid classification model according to claim 1, wherein the threshold w for the variation of the number of support vectors in step S73thThe value range is as follows:
CN201911037562.2A 2019-10-29 2019-10-29 Rapid classification model construction method Pending CN110837853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911037562.2A CN110837853A (en) 2019-10-29 2019-10-29 Rapid classification model construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911037562.2A CN110837853A (en) 2019-10-29 2019-10-29 Rapid classification model construction method

Publications (1)

Publication Number Publication Date
CN110837853A true CN110837853A (en) 2020-02-25

Family

ID=69575755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911037562.2A Pending CN110837853A (en) 2019-10-29 2019-10-29 Rapid classification model construction method

Country Status (1)

Country Link
CN (1) CN110837853A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522906A (en) * 2020-04-22 2020-08-11 电子科技大学 Financial event main body extraction method based on question-answering mode
CN111638427A (en) * 2020-06-03 2020-09-08 西南交通大学 Transformer fault detection method based on nuclear capsule neuron coverage

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522906A (en) * 2020-04-22 2020-08-11 电子科技大学 Financial event main body extraction method based on question-answering mode
CN111522906B (en) * 2020-04-22 2023-03-28 电子科技大学 Financial event main body extraction method based on question-answering mode
CN111638427A (en) * 2020-06-03 2020-09-08 西南交通大学 Transformer fault detection method based on nuclear capsule neuron coverage

Similar Documents

Publication Publication Date Title
CN110070141B (en) Network intrusion detection method
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN114241779B (en) Short-time prediction method, computer and storage medium for urban expressway traffic flow
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN111539444A (en) Gaussian mixture model method for modified mode recognition and statistical modeling
WO2024036709A1 (en) Anomalous data detection method and apparatus
Mohammadi et al. Improving linear discriminant analysis with artificial immune system-based evolutionary algorithms
CN113128671B (en) Service demand dynamic prediction method and system based on multi-mode machine learning
CN110634060A (en) User credit risk assessment method, system, device and storage medium
CN111539451A (en) Sample data optimization method, device, equipment and storage medium
CN110837853A (en) Rapid classification model construction method
Hu et al. A novel support vector regression for data set with outliers
CN113535964B (en) Enterprise classification model intelligent construction method, device, equipment and medium
CN111723010B (en) Software BUG classification method based on sparse cost matrix
Parker et al. Nonlinear time series classification using bispectrum‐based deep convolutional neural networks
CN116612307A (en) Solanaceae disease grade identification method based on transfer learning
Gu et al. A distance-type-insensitive clustering approach
CN113539479B (en) Similarity constraint-based miRNA-disease association prediction method and system
CN113656707A (en) Financing product recommendation method, system, storage medium and equipment
CN114095268A (en) Method, terminal and storage medium for network intrusion detection
CN115797041A (en) Financial credit assessment method based on depth map semi-supervised learning
CN111401783A (en) Power system operation data integration feature selection method
CN110827919A (en) Dimension reduction method applied to gene expression profile data
Wang et al. Cosine kernel based density peaks clustering algorithm
CN116862078B (en) Method, system, device and medium for predicting overdue of battery-change package user

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200225

RJ01 Rejection of invention patent application after publication