CN110837853A - Rapid classification model construction method - Google Patents
Rapid classification model construction method Download PDFInfo
- Publication number
- CN110837853A CN110837853A CN201911037562.2A CN201911037562A CN110837853A CN 110837853 A CN110837853 A CN 110837853A CN 201911037562 A CN201911037562 A CN 201911037562A CN 110837853 A CN110837853 A CN 110837853A
- Authority
- CN
- China
- Prior art keywords
- hash
- sample
- sample points
- training
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9014—Indexing; Data structures therefor; Storage structures hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a rapid classification model construction method, which adopts a local sensitive Hash method to map training samples into Hash values with moderate digits, further screens out possible boundary sample points and trains the boundary sample points, thereby obviously reducing the number of the training samples, reducing the calculation complexity and improving the speed of model construction; and (3) according to the probability that the sample points are boundary points, the potential boundary points are screened out in an iterative manner, so that the speed is improved, and the accuracy of model classification is ensured.
Description
Technical Field
The invention belongs to the technical field of pattern recognition, and relates to a rapid classification model construction method.
Background
The classification problem is that a classification model is obtained by training and analyzing the existing data, and then the model is used for carrying out class prediction on unknown data. For example, a personal income grade classification model is established according to data such as personal age, education degree, marital status, occupation, income and the like obtained from census, and based on the model, the income grade can be predicted according to related data of non-income of the individual. The design and construction of classification models in classification problems are the core and are the key research points in the field of computer mode identification.
The classical classification model mainly comprises K-Nearest Neighbor (KNN), a Support Vector Machine (SVM), a neural network and the like. The KNN determines the category of the sample to be classified according to the category of the nearest sample or samples, and has the advantages of simple concept and easy realization. The calculation amount of the model mainly focuses on calculating the distance from each sample to be classified to all known samples; the SVM is one of the most robust and accurate models in the classification model, aims to search a hyperplane separating different samples at the largest interval, and is widely applied to the fields of face recognition, machine fault detection, time sequence prediction, biological engineering and the like. However, the problem is solved by the SVM by means of quadratic programming, and the solving of the quadratic programming involves the operation of a high-order matrix, so that when the data volume is large, the model construction time is long; in contrast, the neural network model as the current hot has better classification performance, but the problems of low convergence speed, large calculation amount, long construction time and the like in model construction still generally exist; in conclusion, the classical classification models all face the problem of high computational complexity, and the problem becomes more prominent with the continuous increase of the data volume and dimension in the big data era.
Methods to reduce computational complexity typically fall into two categories, dimensionality reduction and sample count reduction. In the aspect of reducing dimensionality, common means include principal component analysis, linear discriminant analysis and the like, namely dimensionality reduction is performed on data, and training is performed on the dimensionality reduced data. The method has the disadvantage that the data characteristics can be changed after dimension reduction, so that the classification accuracy of the model is reduced. In the aspect of reducing the number of samples, the representative sample points of the clustering are usually sought through sample clustering, and only the representative sample points are used for training, so that the model building speed is improved. But the clustering calculation needs a lot of time overhead, and because the representative sample points can not replace the distribution of the original data, the classification accuracy is easy to be reduced. In short, for large-scale data sets, a model construction method which can ensure classification accuracy and has high processing speed needs to be researched.
Disclosure of Invention
Aiming at the defects in the prior art, the rapid classification model construction method provided by the invention has the characteristics of high training speed and low calculation complexity for a large-scale data set under the condition of ensuring the classification accuracy.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a method for constructing a rapid classification model comprises the following steps:
s1, local sensitive Hash mapping:
s11, setting the total number of training samples as N, wherein the samples comprise positive sample points and negative sample points;
s12, mapping each training sample point into a Hash value with the digit of K by adopting a standard locality sensitive Hash method, obtaining H different Hash values after mapping is finished, and obtaining H Hash buckets according to the condition that each different Hash value is a Hash bucket;
s2, counting the number of sample points:
s21, counting the number of the positive type sample points in each hash bucketNumber of sum-negative type sample pointsWherein i is the number of the hash bucket, and i is more than or equal to 1 and less than or equal to H;
s22, comparing the number of the positive type sample points in each hash bucketNumber of sum-negative type sample pointsThe value of (A) is smaller than that of (B)Wherein the content of the first and second substances,is to takeAndthe lesser of the two;
s3, initialization:
s31, defining a support vector set S, a boundary sample set B and a simplified training sample set T, and initializing the support vector set S, the boundary sample set B and the simplified training sample set T to be empty;
s32, defining an iteration counter n, and setting the initial value of the iteration counter n to be 1;
s33, defining a sample point number threshold t, and setting the initial value of the sample point number threshold t as t0;
S4, screening boundary samples:
s41, judging whether n is equal to 1, if yes, screening to meet the requirementIf not, screening all the hash buckets meeting the requirementAll hash buckets of (1);
s42, emptying the boundary sample set B;
s43, adding all sample points in all hash buckets obtained through screening into a boundary sample set B;
s5, constructing a training sample set:
s51, emptying the simplified training sample set T;
s52, adding all sample points in the boundary sample set B and the support vector set S into a simplified training sample set T;
s6, training a classification model: performing SVM training on the simplified training sample set T to obtain a support vector set S 'and a classification model M';
s7, determination termination condition:
s71, calculating the number change value delta w of the support vectors, wherein,the absolute value is taken for operation, | | | | |, the number of the collection element is taken for operation;
s72, self-subtracting 1 from the threshold t of the number of the sample points, and self-adding 1 to the iteration counter n;
s73, setting wthA preset support vector number change threshold value is set, if the change threshold value meets delta w<wthOr t is 0, the building process is ended, the classification model M' is output, otherwise, S is updated, and the process goes to step S41.
Further: the value range of the digit K of the haxi value in step S12 is:
log2N≤K≤log2N+10。
further: initial value t of sample point number threshold t in step S330The value range is as follows:wherein the content of the first and second substances,is a function of the rounding-down,is to takeAnd 1, respectively.
Further: support vector number change threshold w in step S73thThe value range is as follows:
the invention has the beneficial effects that: the method adopts a local sensitive Hash method to map the training samples into Hash values with moderate digits, thereby screening out possible boundary sample points and training the boundary sample points, obviously reducing the number of the training samples, reducing the computational complexity and improving the speed of model construction; and (3) according to the probability that the sample points are boundary points, the potential boundary points are screened out in an iterative manner, so that the speed is improved, and the accuracy of model classification is ensured.
Drawings
FIG. 1 is a flow chart of a method for constructing a rapid classification model.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
In this example, two sets of data a6a and a9a were used in the experiment, which are widely used in ucaidult data sets, and the configuration is shown in table 1. The ucaidult dataset, also known as the "census income" dataset, is a set of attributes derived from census of the population, such as age, education, marital status, occupation, etc., to predict if the income exceeds $ 50K/year.
Table 1 data set configuration table
Sample set | Number of training samples | Number of samples tested | Dimension (d) of |
a6a | 11220 | 21341 | 123 |
a9a | 32561 | 16281 | 123 |
Taking an a9a data set as an example, as shown in fig. 1, a method for constructing a rapid classification model includes the following steps:
s1, local sensitive Hash mapping:
s11, setting the total number of training samples to be N, where the samples include positive sample points and negative sample points, where N is 32561 in this embodiment;
s12, mapping each training sample point to a hash value with a bit number of K by using a standard locality sensitive hashing method, obtaining H different hash values after the mapping is completed, where the number of hash buckets is H according to each different hash value as a hash bucket, and regarding the locality sensitive hashing method mapping as a process of putting sample points into hash buckets, all sample points mapped to the same hash value will be put into corresponding hash buckets, where H is 7090 in this embodiment;
the value range of the digit K of the haxi value in step S12 is:
log2N≤K≤log2n + 10; in this embodiment, K is 17.
S2, counting the number of sample points:
s21, counting the number of the positive type sample points in each hash bucketNumber of sum-negative type sample pointsWherein i is the number of the hash bucket, and i is more than or equal to 1 and less than or equal to H;
s22, comparing each hash bucketNumber of positive type sample pointsNumber of sum-negative type sample pointsThe value of (A) is smaller than that of (B)Wherein the content of the first and second substances,is to takeAndthe lesser of the two;
s3, initialization:
s31, defining a support vector set S, a boundary sample set B and a simplified training sample set T, and initializing the support vector set S, the boundary sample set B and the simplified training sample set T to be empty;
s32, defining an iteration counter n, and setting the initial value of the iteration counter n to be 1;
s33, defining a sample point number threshold t, and setting the initial value of the sample point number threshold t as t0;
Initial value t of sample point number threshold t in step S330The value range is as follows: wherein the content of the first and second substances,is a function of the rounding-down,is to takeThe larger of both and 1, t in this embodiment0=2;
S4, screening boundary samples:
s41, judging whether n is equal to 1, if yes, screening to meet the requirementIf not, screening all the hash buckets meeting the requirementAll hash buckets of (1);
s42, emptying the boundary sample set B;
s43, adding all sample points in all hash buckets obtained by screening to the boundary sample set B, and screening 15344 sample points when n is equal to 1;
s5, constructing a training sample set:
s51, emptying the simplified training sample set T;
s52, adding all sample points in the boundary sample set B and the support vector set S into a simplified training sample set T;
s6, training a classification model: performing SVM training on the simplified training sample set T to obtain a support vector set S 'and a classification model M';
s7, determination termination condition:
s71, calculating the number change value delta w of the support vectors, wherein,the absolute value is taken for operation, | | | | |, the number of the collection element is taken for operation;
when n is 1, the support vector number variation value Δ w is 7.475 × 107;
S72, self-subtracting 1 from the threshold t of the number of the sample points, and self-adding 1 to the iteration counter n;
s73, setting wthFor a predetermined number of support vectorsChange the threshold value if Δ w is satisfied<wthOr t is 0, ending the construction process, outputting the classification model M ', or else updating S is S', and jumping to the step S41;
support vector number change threshold w in step S73thThe value range is as follows:
in this embodiment, let wthWhen n is equal to 1, the value of variation Δ w in the number of support vectors is 7.475 × 107Due to Δ w>wthAnd t is 1, update S is S', and go to step S41; when n is 2, the value of change Δ w in the number of support vectors is 0.162, because Δ w>wthIf t is 0, the building process is terminated and the classification model M' is output.
The following gives an evaluation of the performance of the method of the invention:
comparing the method with an SVM method (hereinafter abbreviated as skearn SVM) in a skearn library of the latest python version 3.7, and setting parameters of a skearn SVM model as C ═ 1, kernel ═ rbf ', and gamma ═ auto'.
(1) Training sample number comparison
For the a6a and a9a data sets, the method of the invention undergoes two times of iterative training, and the number of samples participating in the training at each iteration is shown in table 2. It can be seen that in the first iteration, the average number of samples involved in training is 46.36% of the total number of samples; in the second iteration, the average number of samples participating in training accounts for 31.15 percent of the total number of samples; the two iterations are summed up, and the training samples are reduced by 22.49% compared with the sklern SVM average in the method.
TABLE 2 training sample number comparison
(2) Classification Performance comparison
The test results of the model construction time and the classification accuracy of the method of the present invention are shown in table 3. It can be seen that, in the aspect of model construction speed, the method of the invention has obvious advantages compared with the sklern SVM, construction time is averagely reduced by 36.70%, as training time and sample number are not in linear relation, when the sample number is increased, the training time is exponentially increased, in the test, the training process is subjected to two iterations, the number of samples participating in training in each iteration is relatively small, and therefore, the total training time is remarkably shortened; in the aspect of classification accuracy, the method has equivalent performance to a sklern SVM, and the accuracy is slightly improved. In conclusion, the method provided by the invention can improve the model construction speed and ensure the accuracy of model classification.
TABLE 3 comparison of model build time to Classification accuracy
Claims (4)
1. A method for constructing a rapid classification model is characterized by comprising the following steps:
s1, local sensitive Hash mapping:
s11, setting the total number of training samples as N, wherein the samples comprise positive sample points and negative sample points;
s12, mapping each training sample point into a Hash value with the digit of K by adopting a standard locality sensitive Hash method, obtaining H different Hash values after mapping is finished, and obtaining H Hash buckets according to the condition that each different Hash value is a Hash bucket;
s2, counting the number of sample points:
s21, counting the number of the positive type sample points in each hash bucketNumber of sum-negative type sample pointsWherein i is the number of the hash bucket, and i is more than or equal to 1 and less than or equal to H;
s22, comparing the number of the positive type sample points in each hash bucketNumber of sum-negative type sample pointsThe value of (A) is smaller than that of (B)Wherein the content of the first and second substances, is to takeAndthe lesser of the two;
s3, initialization:
s31, defining a support vector set S, a boundary sample set B and a simplified training sample set T, and initializing the support vector set S, the boundary sample set B and the simplified training sample set T to be empty;
s32, defining an iteration counter n, and setting the initial value of the iteration counter n to be 1;
s33, defining a sample point number threshold t, and setting the initial value of the sample point number threshold t as t0;
S4, screening boundary samples:
s41, judging whether n is equal to 1, if yes, screening to meet the requirementIf not, screening all the hash buckets meeting the requirementAll hash buckets of (1);
s42, emptying the boundary sample set B;
s43, adding all sample points in all hash buckets obtained through screening into a boundary sample set B;
s5, constructing a training sample set:
s51, emptying the simplified training sample set T;
s52, adding all sample points in the boundary sample set B and the support vector set S into a simplified training sample set T;
s6, training a classification model: performing SVM training on the simplified training sample set T to obtain a support vector set S 'and a classification model M';
s7, determination termination condition:
s71, calculating the number change value delta w of the support vectors, wherein,the absolute value is taken for operation, | | | | |, the number of the collection element is taken for operation;
s72, self-subtracting 1 from the threshold t of the number of the sample points, and self-adding 1 to the iteration counter n;
s73, setting wthA preset support vector number change threshold value is set, if delta w is more than wthOr t is 0, the building process is ended, the classification model M' is output, otherwise, S is updated, and the process goes to step S41.
2. The method for constructing a rapid classification model according to claim 1, wherein the range of the number K of the hash values in step S12 is:
log2N≤K≤log2N+10。
3. the method for constructing a rapid classification model according to claim 1, wherein the initial value t of the threshold t of the number of sample points in the step S33 is0The value range is as follows:wherein the content of the first and second substances,is a function of the rounding-down,is to takeAnd 1, respectively.
4. The method for constructing a rapid classification model according to claim 1, wherein the threshold w for the variation of the number of support vectors in step S73thThe value range is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911037562.2A CN110837853A (en) | 2019-10-29 | 2019-10-29 | Rapid classification model construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911037562.2A CN110837853A (en) | 2019-10-29 | 2019-10-29 | Rapid classification model construction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110837853A true CN110837853A (en) | 2020-02-25 |
Family
ID=69575755
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911037562.2A Pending CN110837853A (en) | 2019-10-29 | 2019-10-29 | Rapid classification model construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110837853A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111522906A (en) * | 2020-04-22 | 2020-08-11 | 电子科技大学 | Financial event main body extraction method based on question-answering mode |
CN111638427A (en) * | 2020-06-03 | 2020-09-08 | 西南交通大学 | Transformer fault detection method based on nuclear capsule neuron coverage |
-
2019
- 2019-10-29 CN CN201911037562.2A patent/CN110837853A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111522906A (en) * | 2020-04-22 | 2020-08-11 | 电子科技大学 | Financial event main body extraction method based on question-answering mode |
CN111522906B (en) * | 2020-04-22 | 2023-03-28 | 电子科技大学 | Financial event main body extraction method based on question-answering mode |
CN111638427A (en) * | 2020-06-03 | 2020-09-08 | 西南交通大学 | Transformer fault detection method based on nuclear capsule neuron coverage |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110070141B (en) | Network intrusion detection method | |
CN114241779B (en) | Short-time prediction method, computer and storage medium for urban expressway traffic flow | |
CN111062425B (en) | Unbalanced data set processing method based on C-K-SMOTE algorithm | |
CN110298024B (en) | Method and device for detecting confidential documents and storage medium | |
CN113128671B (en) | Service demand dynamic prediction method and system based on multi-mode machine learning | |
Mohammadi et al. | Improving linear discriminant analysis with artificial immune system-based evolutionary algorithms | |
WO2024036709A1 (en) | Anomalous data detection method and apparatus | |
CN111539444A (en) | Gaussian mixture model method for modified mode recognition and statistical modeling | |
CN110634060A (en) | User credit risk assessment method, system, device and storage medium | |
CN111539451A (en) | Sample data optimization method, device, equipment and storage medium | |
CN110837853A (en) | Rapid classification model construction method | |
CN116612307A (en) | Solanaceae disease grade identification method based on transfer learning | |
CN117521063A (en) | Malicious software detection method and device based on residual neural network and combined with transfer learning | |
CN111797979A (en) | Vibration transmission system based on LSTM model | |
Parker et al. | Nonlinear time series classification using bispectrum‐based deep convolutional neural networks | |
Gu et al. | A distance-type-insensitive clustering approach | |
CN113539479B (en) | Similarity constraint-based miRNA-disease association prediction method and system | |
CN111461199B (en) | Safety attribute selection method based on distributed junk mail classified data | |
CN115470834A (en) | Multi-label learning algorithm for correcting inaccurate labels of label confidence degree based on label propagation | |
CN113656707A (en) | Financing product recommendation method, system, storage medium and equipment | |
CN114095268A (en) | Method, terminal and storage medium for network intrusion detection | |
KR102212310B1 (en) | System and method for detecting of Incorrect Triple | |
CN110827919A (en) | Dimension reduction method applied to gene expression profile data | |
Wang et al. | Cosine kernel based density peaks clustering algorithm | |
CN113688229B (en) | Text recommendation method, system, storage medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200225 |
|
RJ01 | Rejection of invention patent application after publication |