CN113469251A - Method for classifying unbalanced data - Google Patents

Method for classifying unbalanced data Download PDF

Info

Publication number
CN113469251A
CN113469251A CN202110748670.1A CN202110748670A CN113469251A CN 113469251 A CN113469251 A CN 113469251A CN 202110748670 A CN202110748670 A CN 202110748670A CN 113469251 A CN113469251 A CN 113469251A
Authority
CN
China
Prior art keywords
data
marked
sample
samples
classifying
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110748670.1A
Other languages
Chinese (zh)
Inventor
赵正旦
章韵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202110748670.1A priority Critical patent/CN113469251A/en
Publication of CN113469251A publication Critical patent/CN113469251A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a classification method of unbalanced data, which belongs to the technical field of machine learning and comprises an active learning method and an oversampling method, wherein the unbalanced data comprises marked data and unmarked data, and the method specifically comprises the following steps: preprocessing the marked data, and calculating distance features to obtain an initial training set; training the initial training set to obtain an initial classifier; calculating uncertainty of the label-free data by using an initial classifier; sorting unmarked data according to uncertainty, and manually marking to obtain a marked data set; performing probability oversampling on the marked data set to obtain a balanced data set; and training the balanced data set to obtain a classifier for classifying the unbalanced data. The classification method of the unbalanced data of the invention combines the active learning method and the oversampling method, thereby reducing the number of samples participating in training; meanwhile, the classifier is guaranteed to have higher classification precision on data of most classes and data of few classes.

Description

Method for classifying unbalanced data
Technical Field
The invention relates to a classification method of unbalanced data, and belongs to the field of machine learning.
Background
At present, the research on the problem of data imbalance is mainly developed on the level of a data preprocessing layer, the level of characteristics and the level of a classification algorithm, and the classifier is ensured to have higher classification precision on data of most classes and data of few classes. On the data preprocessing level, imbalance is reduced or eliminated by changing the sample distribution of a training set, and a specific method is a series of undersampling and oversampling technologies; the unbalance of the number distribution of the samples on the characteristic level is usually accompanied with the unbalance of the distribution of the characteristic attributes, and the characteristics with distinguishing characteristics are selected by using a characteristic selection method, so that the classification precision of a few classes is improved; in the classification algorithm level, according to the defects of the algorithm in solving the unbalance problem, the characteristics of unbalance data are combined, the algorithm is reasonably improved to improve the recognition rate of a few types of samples, and typical methods include ensemble learning, cost sensitive learning, single type learning and the like.
The main idea of active learning is to introduce interactive capability in the training process, actively select the best sample to be added into the training set in the circulating process, reduce the number of samples participating in the training and save the operation consumption. The best candidate sample is actively selected for learning according to the learning process, and the traditional method for passively learning from a sample set with known identification is broken through. The learning algorithm can effectively reduce the number of samples to be evaluated, improve the prediction accuracy of the initial classifier, actively screen useful samples and store most useful information. The active learning can avoid a large amount of manual marking work, and can better solve the problems that the learning process speed is slowed down and a large amount of memory space is occupied due to the large scale of the training set.
The active learning sample selection strategy mainly comprises the following steps: flow-based sample selection policies and pool-based sample selection. Wherein the pool-based sample selection criteria mainly include: uncertain standards, version space reduction standards, generalized error reduction standards, and the like. The choice of examples based on the uncertainty criterion is mainly to represent the degree of uncertainty by probability and the degree of uncertainty by distance. The sample selection based on the reduction of the version space is to ensure that the selected sample can reduce the version space of the sample to the maximum extent, wherein the version space refers to a combination of a series of different types of reference classifiers. Committee queries are a typical algorithm based on this standard. The generalization error of the classifier is a common index for evaluating the robustness of the classifier, and the final goal of the sample selection based on the generalization error reduction standard is to reduce the generalization error of the classifier.
In machine learning, the sample imbalance problem refers to the phenomenon of class distribution imbalance. If a conventional algorithm is used to deal with the problem, the classification result is often biased to the majority, so that the minority cannot be correctly identified. However, most of the conventional algorithms train the classifier based on the overall accuracy maximization, so that the influence of a few classes of samples is ignored, the minority classes are wrongly classified, and the classification result of the conventional classifier is influenced. However, in many practical problems, the minority class tends to carry a greater amount of information and is of greater value than the majority class. The unbalanced data classification problem widely exists in the fields of biological medicine, finance, information security, industry, computer vision and the like.
Disclosure of Invention
The invention aims to provide a method for classifying unbalanced data, which can reduce the number of training samples, reduce the error rate of a few classes and improve the classification precision.
In order to achieve the above object, the present invention provides a method for classifying unbalanced data, including an active learning method and an oversampling method, where the unbalanced data includes a first type of data and a second type of data, and the first type of data and/or the second type of data includes labeled data and unlabeled data, and the method includes the specific steps of:
step 1, preprocessing marked data, and calculating distance features to obtain an initial training set;
step 2, training the initial training set to obtain an initial classifier;
step 3, calculating the uncertainty of the unmarked data by using the initial classifier;
step 4, sorting the unmarked data according to the uncertainty, and manually marking to obtain a marked data set;
step 5, carrying out probability oversampling on the marked data set by using an oversampling method to obtain a balanced data set;
and 6, training the balanced data set to obtain a classifier for classifying the unbalanced data.
As a further improvement of the present invention, the active learning method is a sample selection mode based on an uncertainty sampling strategy; the oversampling method specifically comprises: the characteristics of the samples comprise discrete characteristics and continuous characteristics, the samples of the continuous characteristics are fitted by using an EM algorithm and utilizing an AIC (advanced information center) criterion to obtain a mixed Gaussian distribution model P, a conditional distribution function of each characteristic under other characteristics is calculated, and a new sample is obtained by Gibbs sampling; the method comprises the steps of firstly counting different frequencies of occurrence of each discrete feature in the first type of data for the samples of the discrete features, and then randomly generating new samples according to the corresponding frequencies.
As a further improvement of the present invention, the pretreatment in step 1 is: calculating an internal distance between marked data and unmarked data, wherein the internal distance is calculated according to the following formula:
Figure BDA0003145285990000031
where n is the dimension of the data,
Figure BDA0003145285990000032
and
Figure BDA0003145285990000033
respectively representing the ith dimension characteristic value of the unmarked data and the marked data.
As a further improvement of the present invention, the minimum value of the internal distance is a distance feature, all distance features of each sample x are calculated for all samples of unlabeled data and labeled data, and are arranged in order from small to large according to the distance features, the first t samples with the smallest distance features and the labeled data are selected to form the initial training set, and the calculation formula of the distance features is as follows:
feature_dis(x)x∈A=minz∈B Disinner(x,z),x∈A;
where z is all samples with labeled data.
As a further improvement of the present invention, step 2 specifically comprises: and training the initial training set by using a support vector machine to obtain an initial classifier.
As a further improvement of the present invention, step 3 specifically is: using said primerThe initial classifier classifies the unmarked data to obtain a sample xiBelong to the category yiProbability p (y) ofi|xi) According to the sample xiBelong to the category yiProbability p (y) ofi|xi) Calculating to obtain an information entropy, wherein the information entropy is uncertainty, and a calculation formula of the information entropy is as follows:
Figure BDA0003145285990000041
as a further improvement of the invention, the sample x is judged according to the optimal label and suboptimal label criterioniThe calculation formula of the optimal label and suboptimal label criterion is as follows:
Figure BDA0003145285990000042
wherein, p (y)best|xi) And p (y)second_best|xi) Are respectively a sample xiThe optimal classification probability and the suboptimal classification probability.
As a further improvement of the present invention, step 4 specifically comprises: and arranging the unmarked data according to the sequence of the uncertainty from large to small, manually marking the sample with the maximum uncertainty, adding the marked sample into the initial training set to train the initial classifier, and stopping training until the initial classifier reaches a threshold value to obtain a marked data set.
As a further improvement of the present invention, step 5 specifically comprises: representing the real distribution of the marked data set by using a mixed Gaussian model, and performing probability oversampling to obtain a balanced data set, wherein the distribution probability density expression of the mixed Gaussian model is as follows:
Figure BDA0003145285990000043
wherein, ω islL is a weighted weight, 1,2, …, and satisfies
Figure BDA0003145285990000044
μlIs the mean value of the Gaussian mixture model; sigmalIs the variance of the Gaussian mixture model; n (x | mu)ll) For the ith gaussian probability distribution, the expression is:
Figure BDA0003145285990000045
as a further improvement of the present invention, the probability oversampling specifically includes: and circularly using the oversampling method for the samples in the marked data set until s new samples are generated, and balancing the first type data and the second type data to obtain the balanced data set.
The invention has the beneficial effects that: according to the unbalanced data classification method, active learning and an oversampling method are combined, active learning is achieved through an uncertainty sample selection method based on BvSB, the number of training samples is reduced, and operation consumption is saved. And then, an unbalanced data set is balanced by an oversampling method based on probability selection, the error rate of the minority class is reduced, and the classifier is ensured to have higher classification precision on the data of the majority class and the data of the minority class.
Drawings
FIG. 1 is a flow chart of an unbalanced data classification method of the present invention.
Fig. 2 is a flow chart of an active learning method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Referring to fig. 1, the present invention provides a classification method for unbalanced data, where the classification method includes an active learning method and an oversampling method, the unbalanced data includes a first type of data and a second type of data, and the first type of data and/or the second type of data includes marked data and unmarked data.
The unbalanced data refers to the unbalanced category of data, that is, the number of the first type data is unbalanced with the number of the second type data, in this embodiment, the proportion of the first type data to the second type data in the unbalanced data is smaller, the proportion of the second type data is larger, that is, the first type data is a minority type, and the second type data is a majority type.
The active learning method comprises an initial training set selection strategy adopted according to current unbalanced data, and is a sample selection mode based on an uncertainty sampling strategy, so that the number of training samples is reduced, and the operation consumption is saved.
The oversampling method is a probability-based oversampling method, and specifically comprises the following steps: the characteristics of the samples comprise discrete characteristics and continuous characteristics, the samples of the continuous characteristics are fitted by using an EM algorithm and utilizing an AIC (advanced information center) criterion to obtain a mixed Gaussian distribution model P, a conditional distribution function of each characteristic under other characteristics is calculated, and a new sample is obtained by Gibbs sampling; for the samples of the discrete features, the different frequencies of each discrete feature in the minority class are counted firstly, and then new samples are generated randomly according to the corresponding frequencies, so that the classifier is guaranteed to have higher classification precision on the data of the majority class and the minority class.
Referring to fig. 2, the active learning method specifically includes: firstly, training an initial classifier S through a marked sample set L, evaluating the information of each sample in an unmarked sample set U through the initial classifier S and a query algorithm q for evaluating the information quantity of the samples, selecting the sample with the largest information quantity from the unmarked sample set U, submitting the sample with the largest information quantity to a manual marker T for manual marking, and then putting the manually marked sample into the marked sample set L to optimize the initial classifier S.
The unbalanced data classification method of the invention comprises the following steps:
step 1, preprocessing marked data, and calculating distance features to obtain an initial training set.
Calculating the internal distance between the marked data and the unmarked data, wherein the calculation formula of the internal distance is as follows:
Figure BDA0003145285990000061
where n is the dimension of the data,
Figure BDA0003145285990000062
and
Figure BDA0003145285990000063
i-th dimension characteristic values of the unmarked data (A) and the marked data (B), respectively.
Calculating all distance characteristics of each sample x for all samples in the unmarked data and the marked data, arranging all samples according to the sequence of the distance characteristics from small to large, and selecting the first t samples with the minimum corresponding distance characteristics and the marked data to form an initial training set.
In all non-homogeneous points formed by the sample parameters, the minimum value of the internal distance is the distance characteristic, and when x belongs to A, the calculation formula of the distance characteristic is as follows:
feature_dis(x)x∈A=minz∈B Disinner(x,z);
where z is all samples with labeled data (B).
And 2, training the initial training set to obtain an initial classifier.
Training the initial training set obtained in the step 1 through a Support Vector Machine (SVM) to obtain an initial classifier ffristAnd the method is used for subsequent active learning sample selection.
And 3, calculating the uncertainty of the unmarked data by using the initial classifier.
All unmarked by the initial classifier obtained in step 2Classifying the data samples to obtain a sample xiBelong to the category yiProbability of (d) is denoted as p (y)i|xi) According to the sample xiBelong to the category yiProbability p (y) ofi|xi) The information entropy (namely uncertainty) is obtained through calculation, and the calculation formula of the information entropy is as follows:
Figure BDA0003145285990000064
wherein the content of the first and second substances,
Figure BDA0003145285990000071
the expression x takes the maximum value in the range U, the higher the information entropy of the sample is, the more fuzzy the class attribute of the sample is, the greater value and information amount can be brought to the model, and the improvement of the accuracy of the classifier is facilitated.
In the multi-classification problem, a sample x is judged according to an optimal label and a suboptimal label criterion (BvSB criterion)iOnly two classes with the highest sample classification possibility are considered in the BvSB criterion, and other classification results are ignored, and the calculation formula of the BvSB criterion is as follows:
Figure BDA0003145285990000072
wherein, p (y)best|xi) And p (y)second_best|xi) Are respectively a sample xiThe optimal classification probability and the suboptimal classification probability.
And 4, sequencing unmarked data according to the uncertainty, and manually marking to obtain a marked data set.
And (3) manually marking the sample with the largest information amount (namely the sample with the largest information entropy) selected in the step (3), specifically, arranging the unmarked data according to the sequence of the uncertainty from large to small, and manually marking the sample with the largest uncertainty. And adding the marked samples into an initial training set, and retraining the initial classifier by using the updated initial training set until the initial classifier reaches a threshold value threshold and stopping training, wherein all samples in the training set are required training samples, namely the marked data set.
And 5, performing probability oversampling on the marked data set by using an oversampling method to obtain a balanced data set.
And (4) representing the real distribution of the marked data set obtained in the step (4) by using a mixed Gaussian model, and performing probability oversampling to obtain a balanced data set. The Gaussian mixture model is an extension of a single Gaussian density function and can be used for approximating the probability density of any shape, wherein the parameters are obtained by weighting L Gaussian mixture models, and the distribution probability density expression of the Gaussian mixture model is as follows:
Figure BDA0003145285990000073
wherein, ω islL is a weighted weight, 1,2, …, and satisfies
Figure BDA0003145285990000074
μlIs the mean value of the Gaussian mixture model; sigmalIs the variance of the Gaussian mixture model; n (x | mu)ll) For the ith Gaussian probability distribution, the expression is as follows:
Figure BDA0003145285990000081
the probability oversampling specifically includes: and (4) circularly using the oversampling method for the samples in the marked data set until s new samples are generated, and balancing the first class data (minority class) and the second class data (majority class) to obtain a balanced data set.
And 6, training the balanced data set to obtain a classifier for classifying the unbalanced data.
And (5) training the balance data set with the labels obtained in the step (5) to obtain a classifier. Training the balance data set generated in the step 5 to obtain a final classifierffinal
In summary, the invention provides a classification method for unbalanced data, which introduces interactive capability in the training process by using an active learning method, and selects samples by the uncertainty of the BvSB criterion, thereby reducing the number of training samples and saving the operation consumption. Meanwhile, an oversampling method is added in the training process, so that the unbalanced data set is balanced, the misclassification rate of the minority class is reduced, and the classifier is guaranteed to have higher classification precision on the data of the majority class and the data of the minority class.
Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims (10)

1. A method for classifying unbalanced data, comprising: the method comprises an active learning method and an oversampling method, wherein the unbalanced data comprises first class data and second class data, and the first class data and/or the second class data comprise marked data and unmarked data, and the specific steps are as follows:
step 1, preprocessing marked data, and calculating distance features to obtain an initial training set;
step 2, training the initial training set to obtain an initial classifier;
step 3, calculating the uncertainty of the unmarked data by using the initial classifier;
step 4, sorting the unmarked data according to the uncertainty, and manually marking to obtain a marked data set;
step 5, carrying out probability oversampling on the marked data set by using an oversampling method to obtain a balanced data set;
and 6, training the balanced data set to obtain a classifier for classifying the unbalanced data.
2. The method of classifying imbalance data according to claim 1, wherein: the active learning method is a sample selection mode based on an uncertainty sampling strategy; the oversampling method specifically comprises: the characteristics of the samples comprise discrete characteristics and continuous characteristics, the samples of the continuous characteristics are fitted by using an EM algorithm and utilizing an AIC (advanced information center) criterion to obtain a mixed Gaussian distribution model P, a conditional distribution function of each characteristic under other characteristics is calculated, and a new sample is obtained by Gibbs sampling; the method comprises the steps of firstly counting different frequencies of occurrence of each discrete feature in the first type of data for the samples of the discrete features, and then randomly generating new samples according to the corresponding frequencies.
3. The method for classifying unbalanced data as recited in claim 1, wherein the preprocessing in the step 1 is: calculating an internal distance between marked data and unmarked data, wherein the internal distance is calculated according to the following formula:
Figure FDA0003145285980000011
where n is the dimension of the data,
Figure FDA0003145285980000012
and
Figure FDA0003145285980000013
respectively representing the ith dimension characteristic value of the unmarked data and the marked data.
4. A method of classifying imbalance data according to claim 3, characterized in that: the minimum value of the internal distance is a distance feature, all distance features of each sample x are calculated for all samples of unmarked data and marked data, the distance features are arranged in a sequence from small to large, the first t samples with the minimum distance features and the marked data are selected to form the initial training set, and the calculation formula of the distance features is as follows:
feature_dis(x)x∈A=minz∈BDisinner(x,z),x∈A;
where z is all samples with labeled data.
5. The method for classifying unbalanced data according to claim 1, wherein the step 2 is specifically: and training the initial training set by using a support vector machine to obtain an initial classifier.
6. The method for classifying unbalanced data according to claim 1, wherein step 3 specifically comprises: classifying the label-free data by using the initial classifier to obtain a sample xiBelong to the category yiProbability p (y) ofi|xi) According to the sample xiBelong to the category yiProbability p (y) ofi|xi) Calculating to obtain an information entropy, wherein the information entropy is uncertainty, and a calculation formula of the information entropy is as follows:
ENT=argmaxxi∈U-∑yi∈Yp(yi|xi)logp(yi|xi);
wherein the content of the first and second substances,
Figure FDA0003145285980000021
meaning that x takes a maximum value in the range U.
7. The method of classifying imbalance data according to claim 6, wherein: judging the sample x according to the optimal label and suboptimal label criterioniThe calculation formula of the optimal label and suboptimal label criterion is as follows:
BvSB=argminxi∈U((p(ybest|xi)-p(ysecond_best|xi));
wherein, p (y)best|xi) And p (y)second_best|xi) Are respectively a sample xiTo be optimizedProbability and suboptimal classification probability.
8. The method for classifying unbalanced data according to claim 1, wherein step 4 is specifically: arranging the unmarked data according to the sequence of the uncertainty from large to small, manually marking the sample with the maximum uncertainty, adding the marked sample into the initial training set to train the initial classifier, and stopping training until the initial classifier reaches a threshold value to obtain a marked data set.
9. The method for classifying unbalanced data according to claim 2, wherein step 5 specifically comprises: representing the real distribution of the marked data set by using a mixed Gaussian model, and performing probability oversampling to obtain a balanced data set, wherein the distribution probability density expression of the mixed Gaussian model is as follows:
Figure FDA0003145285980000031
wherein, ω islL is a weighted weight, 1,2, …, and satisfies
Figure FDA0003145285980000032
μlIs the mean value of the Gaussian mixture model; sigmalIs the variance of the Gaussian mixture model; n (x | mu)ll) For the ith gaussian probability distribution, the expression is:
Figure FDA0003145285980000033
10. the method of classifying imbalance data according to claim 9, wherein; the probability oversampling specifically includes: and circularly using the oversampling method for the samples in the marked data set until s new samples are generated, and balancing the first type data and the second type data to obtain the balanced data set.
CN202110748670.1A 2021-07-02 2021-07-02 Method for classifying unbalanced data Pending CN113469251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110748670.1A CN113469251A (en) 2021-07-02 2021-07-02 Method for classifying unbalanced data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110748670.1A CN113469251A (en) 2021-07-02 2021-07-02 Method for classifying unbalanced data

Publications (1)

Publication Number Publication Date
CN113469251A true CN113469251A (en) 2021-10-01

Family

ID=77877340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110748670.1A Pending CN113469251A (en) 2021-07-02 2021-07-02 Method for classifying unbalanced data

Country Status (1)

Country Link
CN (1) CN113469251A (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239516A (en) * 2014-09-17 2014-12-24 南京大学 Unbalanced data classification method
KR20150069424A (en) * 2013-12-13 2015-06-23 건국대학교 산학협력단 System and method for large unbalanced data classification based on hadoop for prediction of traffic accidents
CN108154178A (en) * 2017-12-25 2018-06-12 北京工业大学 Semi-supervised support attack detection method based on improved SVM-KNN algorithms
AU2018101315A4 (en) * 2018-09-07 2018-10-11 Liu, Ruiqi Mr A Solution For Data Imbalance Classification Problem In Model Construction In Banking Industry
CN109492776A (en) * 2018-11-21 2019-03-19 哈尔滨工程大学 Microblogging Popularity prediction method based on Active Learning
CN110222785A (en) * 2019-06-13 2019-09-10 重庆大学 Self-adapting confidence degree Active Learning Method for gas sensor drift correction
CN110443281A (en) * 2019-07-05 2019-11-12 重庆信科设计有限公司 Adaptive oversampler method based on HDBSCAN cluster
CN110516722A (en) * 2019-08-15 2019-11-29 南京航空航天大学 The automatic generation method of traceability between a kind of demand and code based on Active Learning
CN110569982A (en) * 2019-08-07 2019-12-13 南京智谷人工智能研究院有限公司 Active sampling method based on meta-learning
CN111368924A (en) * 2020-03-05 2020-07-03 南京理工大学 Unbalanced data classification method based on active learning
CN111461855A (en) * 2019-01-18 2020-07-28 同济大学 Credit card fraud detection method and system based on undersampling, medium, and device
CN112069310A (en) * 2020-06-18 2020-12-11 中国科学院计算技术研究所 Text classification method and system based on active learning strategy
CN112508092A (en) * 2020-12-03 2021-03-16 上海云从企业发展有限公司 Sample screening method, system, equipment and medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150069424A (en) * 2013-12-13 2015-06-23 건국대학교 산학협력단 System and method for large unbalanced data classification based on hadoop for prediction of traffic accidents
CN104239516A (en) * 2014-09-17 2014-12-24 南京大学 Unbalanced data classification method
CN108154178A (en) * 2017-12-25 2018-06-12 北京工业大学 Semi-supervised support attack detection method based on improved SVM-KNN algorithms
AU2018101315A4 (en) * 2018-09-07 2018-10-11 Liu, Ruiqi Mr A Solution For Data Imbalance Classification Problem In Model Construction In Banking Industry
CN109492776A (en) * 2018-11-21 2019-03-19 哈尔滨工程大学 Microblogging Popularity prediction method based on Active Learning
CN111461855A (en) * 2019-01-18 2020-07-28 同济大学 Credit card fraud detection method and system based on undersampling, medium, and device
CN110222785A (en) * 2019-06-13 2019-09-10 重庆大学 Self-adapting confidence degree Active Learning Method for gas sensor drift correction
CN110443281A (en) * 2019-07-05 2019-11-12 重庆信科设计有限公司 Adaptive oversampler method based on HDBSCAN cluster
CN110569982A (en) * 2019-08-07 2019-12-13 南京智谷人工智能研究院有限公司 Active sampling method based on meta-learning
CN110516722A (en) * 2019-08-15 2019-11-29 南京航空航天大学 The automatic generation method of traceability between a kind of demand and code based on Active Learning
CN111368924A (en) * 2020-03-05 2020-07-03 南京理工大学 Unbalanced data classification method based on active learning
CN112069310A (en) * 2020-06-18 2020-12-11 中国科学院计算技术研究所 Text classification method and system based on active learning strategy
CN112508092A (en) * 2020-12-03 2021-03-16 上海云从企业发展有限公司 Sample screening method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN108363810B (en) Text classification method and device
US7570816B2 (en) Systems and methods for detecting text
CN111126482B (en) Remote sensing image automatic classification method based on multi-classifier cascade model
CN108154178A (en) Semi-supervised support attack detection method based on improved SVM-KNN algorithms
CN107169086B (en) Text classification method
CN109344884A (en) The method and device of media information classification method, training picture classification model
JP6897749B2 (en) Learning methods, learning systems, and learning programs
CN111428733A (en) Zero sample target detection method and system based on semantic feature space conversion
WO2021189830A1 (en) Sample data optimization method, apparatus and device, and storage medium
CN110008365B (en) Image processing method, device and equipment and readable storage medium
CN109522544A (en) Sentence vector calculation, file classification method and system based on Chi-square Test
CN109948730A (en) A kind of data classification method, device, electronic equipment and storage medium
CN109214444B (en) Game anti-addiction determination system and method based on twin neural network and GMM
WO2020024444A1 (en) Group performance grade recognition method and apparatus, and storage medium and computer device
CN115801374A (en) Network intrusion data classification method and device, electronic equipment and storage medium
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
WO2020135054A1 (en) Method, device and apparatus for video recommendation and storage medium
CN111191033A (en) Open set classification method based on classification utility
CN113298253B (en) Model training method, recognition method and device for named entity recognition
CN110956541A (en) Stock tendency classification prediction method based on intelligent fusion calculation
CN112579730A (en) High-expansibility multi-label text classification method and device
CN111950652A (en) Semi-supervised learning data classification algorithm based on similarity
CN113469251A (en) Method for classifying unbalanced data
CN110378384A (en) A kind of image classification method of combination privilege information and sequence support vector machines
CN109583492A (en) A kind of method and terminal identifying antagonism image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination