CN107688831A - A kind of unbalanced data sorting technique based on cluster down-sampling - Google Patents

A kind of unbalanced data sorting technique based on cluster down-sampling Download PDF

Info

Publication number
CN107688831A
CN107688831A CN201710784810.4A CN201710784810A CN107688831A CN 107688831 A CN107688831 A CN 107688831A CN 201710784810 A CN201710784810 A CN 201710784810A CN 107688831 A CN107688831 A CN 107688831A
Authority
CN
China
Prior art keywords
sample
training set
cluster
multiclass
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710784810.4A
Other languages
Chinese (zh)
Inventor
曹路
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuyi University
Original Assignee
Wuyi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuyi University filed Critical Wuyi University
Priority to CN201710784810.4A priority Critical patent/CN107688831A/en
Publication of CN107688831A publication Critical patent/CN107688831A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The invention discloses a kind of unbalanced data sorting technique based on cluster down-sampling, this method comprises the steps:Using fast search and density peaks clustering algorithm, which clusters, to be found to the multiclass sample of training set, cluster result is obtained, the multiclass sample in training set is divided into N clusters;Few class sample in every cluster sample of multiclass sample in training set and training set forms to new sample set, and with support vector cassification, the supporting vector of multiclass sample in acquisition training set;The few class sample extracted in supporting vector and training set per cluster forms new training set together;New training set is trained by SVMs, and Performance Evaluation is carried out by cross validation collection.The present invention can not only shorten the training time of grader, and the discrimination of few class sample is improved in the case where not endangering multiclass sample identification rate, improve the performance of grader.

Description

A kind of unbalanced data sorting technique based on cluster down-sampling
Technical field
The present invention relates to the research field of pattern-recognition, more particularly to a kind of unbalanced data based on cluster down-sampling Sorting technique.
Background technology
Classification problem is a very important research contents in the fields such as pattern-recognition, machine learning, in actual life In have very extensive application, such as the Handwritten Digit Recognition in banking system, the recognition of face in safety and protection monitoring system and network Intrusion detection in safety etc..At present, treatment classification problem has had the sorting technique of some relative maturities, such as:Decision tree, The methods of K- neighbours, neutral net, SVMs, wherein, SVMs is with its complete theoretical explanation and good reality Result is tested to receive significant attention.These traditional sorting techniques are all based on what class distribution equilibrium was assumed and proposed, its main mesh Be to improve overall classification performance, good effect is shown to the data set being evenly distributed.But the institute in actual life The features such as uneven sample size and noise jamming often occurs between classification in the data of acquisition, traditional grader is not reached To Expected Results.
Unbalanced dataset is widely present in actual life, such as the defect ware detection on production line, credit card fraud inspection Survey and medical diagnosis on disease etc., in these data sets, the more classification of sample number is referred to as multiclass, and the less classification of sample number is referred to as few Class, the sample number of multiclass are far longer than the sample number of few class.In the classification problem of unbalanced dataset, the identification of few class sample The emphasis often classified, such as the product on production line, most of to belong to qualified products, only sub-fraction is defect ware, If using traditional sorting technique, the discrimination of defect ware is very low, just can not really realize the purpose of detection defect ware.Cause How this, improve performance of the grader in uneven classification problem, is improved in the case where not endangering multicategory classification precision few The discrimination of class sample is urgent problem to be solved.
The Research of Classification of unbalanced dataset can be divided into two aspects, and one is started with itself from algorithm, by changing Enter existing algorithm, make the few class of classification deviation, typical such as Cost Sensitive Support Vector Machines, pass through the power higher to few class sample Weigh to improve the nicety of grading of few class.Second, being pre-processed in data plane by Sampling techniques to unbalanced dataset, make The sample number of few class and multiclass is in a basic balance in training set.
Sampling techniques can be divided into two kinds of up-sampling and down-sampling, and up-sampling technology is by simple copy or using didactic Method typically has random up-sampling and SMOTE to increase the quantity of few class sample(Synthetic Minority Over- sampling Technique)Algorithm.SMOTE algorithms pass through in given few random interpolation between class sample point and its K neighbour New sample point is constructed, improves the performance of unbalanced data classification to a certain extent.But either random up-sampling is still SMOTE algorithms, the regularity of distribution of data in itself is not followed, when the sample of generation and the inconsistent distribution of initial data, Noise will be unavoidably introduced, not only easy over-fitting also add algorithm complex, it is impossible to adapt to the development of current big data Trend.
Down-sampling by deleting some multiclass sample points to reduce the number of multiclass sample, typically have random down-sampling and OSS(One Side Selection)Algorithm.Multiclass sample is divided into noise sample by OSS algorithms, boundary sample, redundant samples and Safe sample, noise spot and boundary point are removed according to Tomek Links technologies to reduce few class number of samples.Because reduce sample This point, down-sampling technology can reduce the complexity of algorithm, reduce the training time.But down-sampling technology is by multiclass sample It is possible to that representative multiclass sample information can be lost while deletion, and classifying face is shifted.
The content of the invention
The main object of the present invention be the shortcomings that overcoming prior art with deficiency, there is provided it is a kind of based on cluster down-sampling not Equilibrium criterion sorting technique, the discrimination of few class sample is improved in the case where ensureing multicategory classification precision, to improve imbalance The classification performance of data set.
The present invention principle be:SVMs is the grader highly dependent upon supporting vector, the present invention according to support to A kind of this key property of amount machine, it is proposed that unbalanced data sorting technique based on cluster down-sampling.First by quickly searching Multiclass is divided into different clusters by rope and discovery density peaks clustering algorithm;Then every cluster of multiclass and few class sample point are built Training set, the supporting vector per cluster is obtained by SVMs training, retains all supporting vectors of all clusters, deletes non- Supporting vector builds new multiclass sample point to obtain the data set of relative equilibrium;Finally the new data set of acquisition is supported Vector machine is classified.
The present invention uses following technical scheme:
A kind of unbalanced data sorting technique based on cluster down-sampling, comprises the steps:
(1)Unbalanced dataset is divided into training set and cross validation collection two parts;
(2)Multiclass sample and few class sample are extracted from training set;
(3)Using fast search and density peaks clustering algorithm, which clusters, to be found to the multiclass sample of training set, is clustered As a result, the multiclass sample in training set is divided into N clusters;
(4)Few class sample in every cluster sample of multiclass sample in training set and training set is formed to new sample set, is used in combination Support vector cassification, obtain the supporting vector of multiclass sample in training set;
(5)The few class sample extracted in supporting vector and training set per cluster forms new training set together;
(6)New training set is trained by SVMs, and Performance Evaluation is carried out by cross validation collection.
Further, step(1)In, the ratio that training set intersects collection can be allocated as needed, typically can be with Take ten folding cross validations, i.e., be divided into data set very, will wherein 9 parts be used as training set, 1 part is used as test set.
Further, step(3)In, clustering algorithm implementation steps are:1)According to the definition of local density, calculate each more Class sample pointLocal density;2)According toCarry out descending sort;3)Order, according to proximity density point Range formula tries to achieve distance;4)According toWithRelationships decision figure, select cluster center, cluster center is regarded asValue compared with Big sample point;5)After obtaining cluster center, left point is assigned in each cluster according to cluster center;The definition of local density is, whereinDefinition,For multiclass sample pointThe distance put to other,For distance threshold;Proximity density point distance definition is, its implication is than multiclass sample pointDensity is high In sample point, withMost adjacent point arrivesDistance.
Further, step(4)In, can in the supporting vector of every cluster sample of multiclass sample in obtaining training set The number of supporting vector is controlled by adjusting punishment parameter C and the kernel functional parameter of SVMs;Supporting vector is being supported Decision-making Function is played in the classification of vector machine, contains the important information of multiclass sample, the supporting vector of every cluster is remained, that is, protects Stayed multiclass sample to include the maximum sample of information content, weed out in multiclass sample be not supporting vector sample point, reach and subtract The purpose of few multiclass sample point.
Further, step(5)In, preferably, the intersection of the supporting vector per cluster should be with lacking in training set Class number of samples approaches.
Further, step(6)In, preferably, the standard that classification performance is assessed can use geometric average accuracy G- The average value F-measure of the accuracy and recall rate of mean and few class.
The present invention compared with prior art, has the following advantages that and beneficial effect:
(1)The complexity of SVMs is O(N3), wherein N is training sample number, and the Downsapling method that the present invention uses subtracts The scale of training sample is lacked, with conventional top sampling method(Such as random up-sampling and SMOTE algorithms)Compare, when shortening training Between, more adapt to the development trend of current big data.
(2)Technical scheme provided by the invention, catch important decision effect of the supporting vector in supporting vector grader This characteristic, multiclass is divided into more clusters using clustering algorithm, and in extracting per cluster to the support that classification plays a decisive role to The message sample for reservation is measured, with respect to other Downsapling methods(Such as random down-sampling and OSS algorithms), preferably remain multiclass The information of sample.
(3)In the present invention, few class sample of negligible amounts all participates in classification based training process in training set, ensure that few class Effect of the sample in classification, the contribution rate of few class sample is improved, enhances grader overall performance.
Brief description of the drawings
Fig. 1 is that the inventive method realizes block diagram;
Fig. 2 is original support vector cassification face in the present invention, ideal sort face, relation between two class sample points and supporting vector Schematic diagram;
Fig. 3 is support vector cassification face, two class sample points and supporting vector triadic relation after clustered down-sampling in the present invention Schematic diagram;
The F-measure values of data set compare figure under Fig. 4 distinct methods;
The G-mean values of data set compare figure under Fig. 5 distinct methods.
Embodiment
With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are unlimited In this.In order to not obscure the present invention, will be no longer described in detail for some common technologies such as SVMs theory etc..
A kind of unbalanced data sorting technique based on cluster down-sampling provided by the invention, specific implementation step are as follows:
(1)Unbalanced dataset is divided into training set and cross validation collection two parts, is represented by, wherein D is Unbalanced dataset, Tr are training set, and Te is used to represent cross validation collection.The ratio that training set intersects collection can be as needed Be allocated, can typically take ten folding cross validations, i.e., be divided into data set very, will wherein 9 parts be used as training set, 1 part of work For test set.
(2)Tr extracts multiclass sample Ma and few class sample Mi from training set.
(3)To the multiclass sample of training setUsing fast search and find density peaks cluster Algorithm is clustered, and obtains cluster result, the multiclass sample in training set is divided into N clusters.
Fast search and discovery density peaks clustering algorithm are based on two it is assumed that 1)Class cluster center is by close with relatively low part Neighbours' point of degree surrounds, and 2)With having relatively large distance with more highdensity any point.
Local density is defined as, whereinIt is defined as,For multiclass Sample pointThe distance put to other,For distance threshold, in this example,It is taken as 0.01.
Proximity density point distance definition is, its implication is than sample pointIn the high sample point of density, withMost adjacent point arrivesDistance.
The implementation steps of clustering algorithm are as follows in the present invention:
1)According to the definition of local density, the local density of every bit is calculated
2)According toCarry out descending sort;
3)Order, distance is tried to achieve according to proximity density point range formula;
4)According toWithRelationships decision figure, select cluster center;
5)After obtaining cluster center, left point is assigned in each cluster according to cluster center.
Involved clustering algorithm need to only calculate once distance in the present invention, it is not necessary to interative computation.
(4)Few class sample in every cluster sample of multiclass sample in training set and training set is formed to new sample set, And with support vector cassification, obtain the supporting vector of multiclass sample in training set.Classification of the supporting vector in SVMs In play Decision-making Function, contain the important information of multiclass sample, remain the supporting vector of every cluster, that is, remain multiclass sample Include the maximum sample of information content, weed out in multiclass sample be not supporting vector sample point, reach and reduce multiclass sample point Purpose., can be by adjusting SVMs in the supporting vector of every cluster sample of multiclass sample in obtaining training set Punishment parameter C and kernel functional parameter control the number of supporting vector.
(5)The few class sample extracted in supporting vector and training set per cluster forms new training set together.According to branch The characteristic of vector machine is held, the number of supporting vector will be less than the sample number included per cluster.Preferably, the branch per cluster Holding the intersection of vector should approach with few class number of samples in training set.
(6)New training set is trained by SVMs, and Performance Evaluation is carried out by cross validation collection.Make For preferably, the standard that classification performance is assessed can use geometric average accuracy G-mean and few class accuracy and putting down for recall rate Average F-measure.G-mean and F-measure is built upon what is proposed on the basis of hybrid matrix, wherein, G-mean is same When taken into account precision ratio and precision ratio, available for evaluation system entirety classification performance, G-mean values are bigger, then system is integrally classified Better.In unbalanced system, F-measure is used for the classification performance for evaluating few class sample, and F-measure values are bigger, then few Class sample classification better performances.
The present embodiment is illustrated below by way of actual scene.
The present embodiment is chosen the data set that two unbalance factors differ greatly and tested.Data set is all from plus sharp welfare Sub- university Irving branch school machine learning databases UCI, one is Haberman's Survival Data Set, the data set bag Existence and death of the hospital of Chicago University to the patient that suffers from breast cancer after completing to perform the operation between 1958 and 1970 are contained The judgement of situation, i.e. two classification problems, share 306 samples, each sample is surrounded by 3 attributes, respectively patient during operation Age, the operation time of patient and detection axillary gland number positive;Another data set is Letter Recognition, number According to concentration sample Shi white pixel 26 English alphabets, i.e., classification number be 26, share 20000 samples, each letter 16 numerical characteristics are converted into, i.e. attribute is 16 dimensions.In the present embodiment, the details of two datasets are shown in Table 1, wherein not Balanced ratio refers to the ratio of multiclass and few class in data set.
The data set of table 1
Data set Sample Attribute number It is more/few Unbalance factor
Haberman 306 3 225/81 2.78
Letter 20000 16 19266/734 26.25
It should be noted that being simplified experiment in embodiment, Letter data sets are converted into the processing of two-value class, quantity 734 Zee be few class, it is remaining to merge into multiclass.
In embodiment, by data set random division, ensure that unbalance factor does not change in partition process, wherein taking instruction Practice 90% that collection is total sample set, test set is the 10% of sample set.
By method proposed by the invention and SVMs Direct Classification in embodiment(SVM), prop up after random down-sampling Hold vector machine classification(RUS+SVM)It is compared, experimental result is as follows:
The F-measure values of data set under the distinct methods of table 2
Data set SVM RUS+SVM The inventive method
Haberman 0.627 0.612 0.635
Letter 0.576 0.581 0.594
The G-mean values of data set under the distinct methods of table 3
Data set SVM RUS+SVM The inventive method
Haberman 0.683 0.677 0.691
Letter 0.607 0.615 0.627
From result, it can be seen that, method proposed by the invention has some superiority, F- on the data set of different balanced ratios Measure and G-mean raising, illustrate that the method that invention proposes can not only improve the overall classification performance of unbalanced data, The nicety of grading of few class is also improved.

Claims (6)

1. a kind of unbalanced data sorting technique based on cluster down-sampling, it is characterised in that comprise the steps:
(1)Unbalanced dataset is divided into training set and cross validation collection two parts;
(2)Multiclass sample and few class sample are extracted from training set;
(3)Using fast search and density peaks clustering algorithm, which clusters, to be found to the multiclass sample of training set, is clustered As a result, the multiclass sample in training set is divided into N clusters;
(4)Few class sample in every cluster sample of multiclass sample in training set and training set is formed to new sample set, is used in combination Support vector cassification, obtain the supporting vector of multiclass sample in training set;
(5)The few class sample extracted in supporting vector and training set per cluster forms new training set together;
(6)New training set is trained by SVMs, and Performance Evaluation is carried out by cross validation collection.
A kind of 2. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step (1)In, the ratio that training set intersects collection is allocated as needed, is taken ten folding cross validations, i.e., is divided into data set very, To wherein 9 parts and be used as training set, 1 part is used as test set.
A kind of 3. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step Suddenly(3)In, clustering algorithm implementation steps are:1)According to the definition of local density, each multiclass sample point is calculatedPart it is close Degree;2)According toCarry out descending sort;3)Order, distance is tried to achieve according to proximity density point range formula;4) According toWithRelationships decision figure, select cluster center, cluster center is regarded asIt is worth larger sample point;5)According to cluster Center is assigned in each cluster by remaining sample point;The definition of local density is, whereinIt is defined as,For multiclass sample pointThe distance put to other,For distance threshold;Proximity density point distance is fixed Justice is
A kind of 4. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step (4)In, in the supporting vector of every cluster sample of multiclass sample in obtaining training set, by the punishment for adjusting SVMs Parameter C and kernel functional parameter control the number of supporting vector, and supporting vector plays Decision-making Function in the classification of SVMs, The important information of multiclass sample is contained, remains the supporting vector of every cluster, that is, remains multiclass sample and includes information content most Big sample, weed out in multiclass sample be not supporting vector sample point, reach reduce multiclass sample point purpose.
A kind of 5. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step (5)In, the intersection of the supporting vector per cluster should approach with few class number of samples in training set.
A kind of 6. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step (6)In, the standard that classification performance is assessed is geometric average accuracy G-mean and the accuracy of few class and the average value of recall rate F-measure。
CN201710784810.4A 2017-09-04 2017-09-04 A kind of unbalanced data sorting technique based on cluster down-sampling Pending CN107688831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710784810.4A CN107688831A (en) 2017-09-04 2017-09-04 A kind of unbalanced data sorting technique based on cluster down-sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710784810.4A CN107688831A (en) 2017-09-04 2017-09-04 A kind of unbalanced data sorting technique based on cluster down-sampling

Publications (1)

Publication Number Publication Date
CN107688831A true CN107688831A (en) 2018-02-13

Family

ID=61155779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710784810.4A Pending CN107688831A (en) 2017-09-04 2017-09-04 A kind of unbalanced data sorting technique based on cluster down-sampling

Country Status (1)

Country Link
CN (1) CN107688831A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629633A (en) * 2018-05-09 2018-10-09 浪潮软件股份有限公司 A kind of method and system for establishing user's portrait based on big data
CN108875365A (en) * 2018-04-22 2018-11-23 北京光宇之勋科技有限公司 A kind of intrusion detection method and intrusion detection detection device
CN109360206A (en) * 2018-09-08 2019-02-19 华中农业大学 Crop field spike of rice dividing method based on deep learning
CN109490704A (en) * 2018-10-16 2019-03-19 河海大学 A kind of Fault Section Location of Distribution Network based on random forests algorithm
CN109783586A (en) * 2019-01-21 2019-05-21 福州大学 Waterborne troops's comment detection system and method based on cluster resampling
CN109871901A (en) * 2019-03-07 2019-06-11 中南大学 A kind of unbalanced data classification method based on mixing sampling and machine learning
CN111080442A (en) * 2019-12-21 2020-04-28 湖南大学 Credit scoring model construction method, device, equipment and storage medium
US20210158078A1 (en) * 2018-09-03 2021-05-27 Ping An Technology (Shenzhen) Co., Ltd. Unbalanced sample data preprocessing method and device, and computer device
CN113936185A (en) * 2021-09-23 2022-01-14 杭州电子科技大学 Software defect data self-adaptive oversampling method based on local density information
US11954685B2 (en) 2019-03-07 2024-04-09 Sony Corporation Method, apparatus and computer program for selecting a subset of training transactions from a plurality of training transactions

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875365A (en) * 2018-04-22 2018-11-23 北京光宇之勋科技有限公司 A kind of intrusion detection method and intrusion detection detection device
CN108875365B (en) * 2018-04-22 2023-04-07 湖南省金盾信息安全等级保护评估中心有限公司 Intrusion detection method and intrusion detection device
CN108629633A (en) * 2018-05-09 2018-10-09 浪潮软件股份有限公司 A kind of method and system for establishing user's portrait based on big data
US20210158078A1 (en) * 2018-09-03 2021-05-27 Ping An Technology (Shenzhen) Co., Ltd. Unbalanced sample data preprocessing method and device, and computer device
US11941087B2 (en) * 2018-09-03 2024-03-26 Ping An Technology (Shenzhen) Co., Ltd. Unbalanced sample data preprocessing method and device, and computer device
CN109360206A (en) * 2018-09-08 2019-02-19 华中农业大学 Crop field spike of rice dividing method based on deep learning
CN109490704A (en) * 2018-10-16 2019-03-19 河海大学 A kind of Fault Section Location of Distribution Network based on random forests algorithm
CN109783586B (en) * 2019-01-21 2022-10-21 福州大学 Water army comment detection method based on clustering resampling
CN109783586A (en) * 2019-01-21 2019-05-21 福州大学 Waterborne troops's comment detection system and method based on cluster resampling
CN109871901A (en) * 2019-03-07 2019-06-11 中南大学 A kind of unbalanced data classification method based on mixing sampling and machine learning
US11954685B2 (en) 2019-03-07 2024-04-09 Sony Corporation Method, apparatus and computer program for selecting a subset of training transactions from a plurality of training transactions
CN111080442A (en) * 2019-12-21 2020-04-28 湖南大学 Credit scoring model construction method, device, equipment and storage medium
CN113936185A (en) * 2021-09-23 2022-01-14 杭州电子科技大学 Software defect data self-adaptive oversampling method based on local density information

Similar Documents

Publication Publication Date Title
CN107688831A (en) A kind of unbalanced data sorting technique based on cluster down-sampling
Li et al. Adaptive multi-objective swarm fusion for imbalanced data classification
CN105224695B (en) A kind of text feature quantization method and device and file classification method and device based on comentropy
CN104866572B (en) A kind of network short text clustering method
CN108304884A (en) A kind of cost-sensitive stacking integrated study frame of feature based inverse mapping
CN105389480B (en) Multiclass imbalance genomics data iteration Ensemble feature selection method and system
CN105760889A (en) Efficient imbalanced data set classification method
CN108509982A (en) A method of the uneven medical data of two classification of processing
CN106537422A (en) Systems and methods for capture of relationships within information
CN105069470A (en) Classification model training method and device
CN105184316A (en) Support vector machine power grid business classification method based on feature weight learning
CN109284626A (en) Random forests algorithm towards difference secret protection
CN107122382A (en) A kind of patent classification method based on specification
CN107832412B (en) Publication clustering method based on literature citation relation
CN110991653A (en) Method for classifying unbalanced data sets
CN105045913B (en) File classification method based on WordNet and latent semantic analysis
CN106156372A (en) The sorting technique of a kind of internet site and device
Dubey et al. A systematic review on k-means clustering techniques
CN105893876A (en) Chip hardware Trojan horse detection method and system
CN106228554A (en) Fuzzy coarse central coal dust image partition methods based on many attribute reductions
CN105938523A (en) Feature selection method and application based on feature identification degree and independence
CN110533116A (en) Based on the adaptive set of Euclidean distance at unbalanced data classification method
CN105389471A (en) Method for reducing training set of machine learning
CN102629272A (en) Clustering based optimization method for examination system database
Devi et al. A relative evaluation of the performance of ensemble learning in credit scoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180213

RJ01 Rejection of invention patent application after publication