CN106372655A

CN106372655A - Synthetic method for minority class samples in non-balanced IPTV data set

Info

Publication number: CN106372655A
Application number: CN201610753263.9A
Authority: CN
Inventors: 魏昕; 李智林; 周亮; 黄若尘; 刘榕华
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2016-08-26
Filing date: 2016-08-26
Publication date: 2017-02-01

Abstract

The invention discloses a synthetic method for minority class samples in a non-balanced IPTV data set, and aims to overcome the defect of performance reduction of a subsequent classification and prediction model caused by the fact that new samples are directly generated without analytic processing of minority samples in an existing minority class data synthesis method. The synthetic method is implemented by the steps of firstly finding out a neighbor set of the minority class samples; dividing neighbor samples into a noise set, a security set and a dangerous set according to a proportion of categories which the neighbor samples belong to; not processing samples in the noise set; calculating a ratio of the security set to the dangerous set, and calculating a related probability; selecting the security set or the dangerous set according to the probability; and generating new minority class samples based on the samples in the selected set. By adopting the method, the minority class sample effect having the negative effect for classification can be removed; the utility of the minority class samples near a classification face is improved; and the obtained new minority class samples can better improve the performance of the subsequent classification and prediction model.

Description

A kind of synthetic method of the minority class sample on non-equilibrium iptv data set

Technical field

The present invention relates to non-equilibrium data process field, especially relate to the minority class on a kind of non-equilibrium iptv data set The synthetic method of sample.

Background technology

With the business transformation of domestic fixed network operator, become the new industry of operator based on the various value-added services of the Internet The important component part of business growth point, especially IPTV (iptv) business has presented the situation of rapid growth. Iptv has following features: (1) user is obtained in that high-quality digital media service；(2) user can pass through broadband ip network Free selection video frequency program；(3) it provides wide emerging market for operator.In recent years, operator and research institution Personnel are devoted to lifting impression and the satisfaction of iptv user by the key factor of research impact user experience quality (qoe) Degree.

In existing solution, the report based on the status data gathering from iptv Set Top Box and user hinders data, leads to Cross the model in machine learning and correlation technique to predict the qoe of user.But due in iptv business in most cases, net In order, Consumer's Experience is also preferable, does not report barrier for network, in limited instances poor user experience and report barrier, thus Set Top Box institute The data collecting is nonequilibrium, i.e. there are two class users report barrier classifications and user does not report barrier classification.Wherein use The sample number of family report barrier classification is far smaller than the sample number that user does not report barrier classification, then in this problem, user's report barrier classification For minority class, it is many several classes ofs that user does not report barrier classification.

In order to solve non-equilibrium data process problem it is often necessary to according to available data characteristic, synthesize a part of minority class Sample, so that two class data volumes reach balance.In existing method, synthetic minority Oversampling technique (smote), as the technology of an over-sampling, is frequently utilized for synthesizing minority class.Although Smote algorithm has many good qualities, but still has some defects, including over-fitting data polytropy.Particularly, work as smote Generate equal number of generated data for each a few sample, neighbours' sample is not taken into account, this can increase minority class The probability that internal specimen overlapping phenomenon occurs.In addition some minority class samples are located near classification interface, and subsequent classifier is risen Pivotal role, and other samples are located at most apoplexy due to endogenous wind, belong to noise, generate minority class sample if based on it, then can be right Classification has the opposite effect, and existing smote algorithm does not consider these problems.Based on this, the present invention specifically addresses smote technology Some technological deficiencies existing, preferably solve the problems, such as the data nonbalance in iptv user qoe prediction.

Content of the invention

The technical problem to be solved is that the deficiency for background technology provides a kind of non-equilibrium iptv data The synthetic method of the minority class sample on collection.

The present invention is to solve above-mentioned technical problem to employ the following technical solutions:

A kind of synthetic method of the minority class sample on non-equilibrium iptv data set, specifically includes following steps:

Step 1: find out minority class sample set x_minorIn each sample point x_iCorresponding k neighbour set s_i, wherein k is nature Number, i=1 ... n, x_i∈x_minor；K neighbour collection is combined into apart from x_iThe set that k nearest sample is formed；

Step 2, the characteristic of k each the minority class sample of neighbour's set analysis being obtained according to step 1, and then be classified as making an uproar Sound collection, safe collection and dangerous collection three classes；

Step 3, the sample that noise is concentrated does not process, and calculates the sample size in safe collection and the dangerous sample concentrated Ratio t between quantity；

Step 4, produces an equally distributed random number b obeying on interval [0,1]；If b is ∈ [0, t/ (t+1)], then Select the dangerous all samples concentrated as input, the smote algorithm sending into standard generates new minority class sample；Conversely, then Select all samples in safe collection as input, the smote algorithm sending into standard generates new minority class sample；

Step 5, original minority class sample and newly-generated minority class sample are combined the new minority class set of composition Close.

Further preferred side as the synthetic method of the minority class sample on a kind of non-equilibrium iptv data set of the present invention Case, described step 2 specifically comprises the steps of:

Step 2.1, counts s_iIn belong to many several classes ofs x_majorNumber of samples, use | s_i∩x_major| to represent, its expression Many several classes ofs sample set x_majorAnd s_iCommon factor in number of samples.

Step 2.2, judges | s_i∩x_major| residing interval, it is specifically divided into three kinds of situations:

If | s_i∩x_major|=k, then current sample x_iIt is in most apoplexy due to endogenous wind, it is believed that it is to make an uproar for classification problem Sound；x_minorIn all samples composition safe collections meeting this condition；

If 0≤| s_i∩x_major| < 0.5k then shows current sample x_iDangerous very little by misclassification；x_minorIn all full The sample composition safe collection of this condition of foot；

If 0.5k≤| s_i∩x_major| < k then shows current sample x_iExist by the danger of misclassification；x_minorIn all full The dangerous collection of sample composition of this condition of foot.

Further preferred side as the synthetic method of the minority class sample on a kind of non-equilibrium iptv data set of the present invention Case, in step 4, the concrete calculating process of algorithm of described smote is as follows: sets current sample as x_i, from the k neighbour of this sample Set s_iOne sample x of middle random selection_j, produce an equally distributed random number δ of obedience from interval [0,1], then newly-generated Minority class sample is: x_new=x_i+δ×(x_j-x_i).

The present invention adopts above technical scheme compared with prior art, has following technical effect that

1. the present invention can solve the problems such as classification of non-equilibrium data, prediction by producing minority class sample；

2. the present invention classifies to minority class sample, does not consider using the minority class sample being absorbed among many several classes ofs sample Produce new samples, it is to avoid the hydraulic performance decline being brought in subsequent classification by noise.Further, since the dangerous sample concentrated is in two Near the classification interface of class, the minority class sample new using the sample generation in this set as much as possible, be conducive to significantly Improve subsequent classification, the performance of Forecasting Methodology；

3. the data overlap during the present invention can avoid the minority class sample that traditional smote algorithm is brought to produce Problem.

Brief description

Fig. 1 is the synthetic method flow chart of the minority class sample on the present invention non-equilibrium iptv data set；

Fig. 2 is to be respectively adopted three kinds of methods under knn grader of the present invention to process the g average ratio of non-equilibrium iptv data sets relatively Result；

Fig. 3 is to be respectively adopted the g average ratio that three kinds of methods process non-equilibrium iptv data set under c4.5 grader of the present invention Relatively result；

Fig. 4 is that the present invention is respectively adopted the minority class data that the smote method of standard and method proposed by the present invention generate G average comparative result as test set.

Specific embodiment

Below in conjunction with the accompanying drawings technical scheme is described in further detail:

As shown in figure 1, a kind of synthetic method of the minority class sample on non-equilibrium iptv data set, its step includes:

Step 1: find out all minority class sample points respective k neighbour set s_i, wherein k is natural number, and i is positive integer；

The detailed process of all steps is as follows:

Step 1: set the data that iptv Set Top Box collects and include status dataHinder data with the report of userBoth are one-to-one.Wherein vector x_iDimension be p, reflect iptv network condition (time delay, packet loss, interim card Deng), y_iFor scalar, it is the labelling whether user reports barrier, such as user ensures, then y_i=1, conversely, y_i=0.So, minority class sample This collection x_minorIt is defined as y_i=1, i=1 ..., corresponding all x of n_i；Many several classes ofs sample set x_majorIt is defined as y_i=0, i =1 ..., corresponding all x of n_i, i.e. x_major=x x_major.For each sample x in minority class_i∈x_minor, calculate Its with x in all samples Euclidean distance, k nearest sample of selected distance form x_iK neighbour set s_i.

Step 2: by the characteristic of k each minority class sample of neighbour's set analysis, minority class sample is classified further, specifically As follows:

(2-1) count s_iIn k sample in belong to many several classes ofs x_majorNumber of samples, that is, obtain | s_i∩x_major|, This can be by counting s_iMiddle sample generic labelling y obtains.

(2-2) judge | s_i∩x_major| residing interval, it is divided into three kinds of situations:

Situation 1: if | s_i∩x_major|=k, then show current sample x_iIt is in most apoplexy due to endogenous wind, for classification problem Speech is it is believed that it is noise.x_minorIn the set of all samples compositions meeting this condition be defined as " safe collection "；

Situation 2: if 0≤| s_i∩x_major| < 0.5k then shows current sample x_iDangerous very little by misclassification.x_minor In the set of all samples compositions meeting this condition be defined as " safe collection "；

Situation 3: if 0.5k≤| s_i∩x_major| < k then shows current sample x_iExist by the danger of misclassification.x_minor In the set of all samples compositions meeting this condition be defined as " dangerous collection "；

(2-3) sample point concentrated for the noise meeting situation 1, it does not do any subsequent treatment, i.e. do not utilize its life The minority class sample of Cheng Xin.For the dangerous sample point concentrated of the safe collection meeting situation 2 and situation 3, enter next step Continue with.

Step 3: calculate the ratio between the sample size in safe collection obtained in the previous step and the dangerous sample size concentrated Value, is designated as t.

Step 4: produce an equally distributed random number obeyed on interval [0,1], be designated as b.If b is ∈ [0, t/ (t+ 1)], then with the dangerous all samples concentrated as input, send into standard smote algorithm and generate new minority class sample；No Then, with all samples in safe collection as input, the smote algorithm sending into standard generates new minority class sample.Original few Several classes of sample and newly-generated minority class sample are combined, and form new minority class set.

Standard smote algorithm in this step is as follows: sets current sample as x_i, from the k neighbour set s of this sample_iIn with Machine selects a sample x_j, produce an equally distributed random number δ of obedience from interval [0,1], then newly-generated minority class sample Originally it is: x_new=x_i+δ×(x_j-x_i).

It should be noted that needing the new sample number producing by between many several classes ofs sample number and original minority class sample number Difference determine.Assume that many several classes ofs sample and minority class sample size are respectively | x_major| and | x_minor|, then need newly-generated (x_major|-|x_minor|) individual minority class sample.If this step is with safe collection, and (sample number in this set is n_safe) as standard The input of smote algorithm, then each sample in safe collection need operation standard smote algorithmSecondary.With Reason, if the danger of this step integrates, and (sample number in this set is as n_danger) as standard smote algorithm input, then dangerous Each sample concentrated needs operation standard smote algorithmSecondary.

Embodiment and performance evaluation

In order to the synthetic method of the minority class sample non-equilibrium iptv data set present invention designed by is better described Advantage, be applied to prediction iptv system user report barrier.Here, two original data sets both are from Jiangsu Telecom. Data set 1 (i.e. x) is to April iptv Key Performance Indicator (kpi) data of No. ten from April No. one.Data set 2 (i.e. y) is Hinder data (the user's report barrier data receiving by phone) from the report of user.

After collecting raw data set, need to carry out data cleansing to it, its object is to remove in initial data Repeat to record, the data such as error logging and property value disappearance record, and by the data data collection 2 in data set 1 Data corresponds, and according to the report barrier labelling of data set 2, the data in data set 1 is classified, for use as subsequently pre- Survey the training of model.After data cleansing, in data set x, total record (sample) has 439050, wherein 4871 genus In minority class, 434179 belong to many several classes ofs, and dimension p of each data is 11.The implication of each dimension is to be shown in Table 1.

The implication of each dimension of table 1 data

After data cleansing, for equilibrium majority class and minority class sample, using the non-equilibrium iptv designed by the present invention The synthetic method of the minority class sample on data set, produces minority class sample so that new minority class sample total is original number According to the minority class sample number concentrated 40 times.

With several classes of sample more than 150000 and 150000 minority class samples (new) as training dataset, have chosen here K arest neighbors (knn) sorting algorithm and c4.5 Decision Tree Algorithm implementation model training, and the model logarithm with training Classified according to the remaining data concentrated.Fig. 2 directly carries out for not producing new minority class sample under knn grader classifying, Generate new minority class sample only with standard smote method to be classified and using method proposed by the present invention new the lacking of generation Several classes of sample carries out the comparative result of the g average (g-mean) under three kinds of methods of classifying.

In figs. 2 and 3, the ratio for minority class and many several classes ofs in the data set classified is 1:20 respectively (6000:12000), 1:25 (6000:15000) and 1:30 (6000:18000).Under the test case of these three ratios, we As can be seen that either knn grader or c4.5 grader, the g of the minority class sample synthetic method designed by the present invention is equal Value is higher than other two methods.Longitudinally contrast finds Fig. 2 and Fig. 3, compares with c4.5 grader, knn grader and the present invention The minority class sample synthetic method being proposed combines, and has more preferable classification performance.

Additionally, the minority class data generating the smote method of standard and method proposed by the present invention, as test set, is come Compare the g average of two methods.The numeral that Fig. 4 can be seen that on transverse axis represents by standard smote method and present invention proposition The number of minority class that generates respectively of method.In test data, the ratio of minority class and many several classes ofs is is 1:20.This three In the case of kind, it may be seen that the g average of method proposed by the present invention is above standard smote method.

Test result indicate that using the minority class sample synthetic method designed by the present invention, significantly improving existing non-equilibrium The classification estimated performance of iptv data set.

Claims

1. the minority class sample on a kind of non-equilibrium iptv data set synthetic method it is characterised in that: specifically include following step Rapid:

Step 1: find out minority class sample set x_minorIn each sample point x_iCorresponding k neighbour set s_i, wherein k is natural number, i =1 ... n, x_i∈x_minor；K neighbour collection is combined into apart from x_iThe set that k nearest sample is formed；

Step 2, the characteristic of k each the minority class sample of neighbour's set analysis being obtained according to step 1, and then it is classified as noise Collection, safe collection and dangerous collection three classes；

Step 3, the sample that noise is concentrated does not process, and calculates the sample size in safe collection and the dangerous sample size concentrated Between ratio t；

Step 4, produces an equally distributed random number b obeying on interval [0,1]；If b is ∈ [0, t/ (t+1)], then select The dangerous all samples concentrated generate new minority class sample as input, the smote algorithm sending into standard；Conversely, then selecting All samples in safe collection generate new minority class sample as input, the smote algorithm sending into standard；

Step 5, original minority class sample and newly-generated minority class sample are combined the new minority class set of composition.

2. the synthetic method of the minority class sample on a kind of non-equilibrium iptv data set according to claim 1, its feature It is: described step 2 specifically comprises the steps of:

Step 2.1, counts s_iIn belong to many several classes ofs x_majorNumber of samples, use | s_i∩x_major| to represent, it represents most Class sample set x_majorAnd s_iCommon factor in number of samples；

If | s_i∩x_major|=k, then current sample x_iIt is in most apoplexy due to endogenous wind, it is believed that it is noise for classification problem； x_minorIn all samples composition safe collections meeting this condition；

If 0≤| s_i∩x_major| < 0.5k then shows current sample x_iDangerous very little by misclassification；x_minorIn all meet this The sample composition safe collection of condition；

If 0.5k≤| s_i∩x_major| < k then shows current sample x_iExist by the danger of misclassification；x_minorIn all meet this The dangerous collection of sample composition of condition.

3. the synthetic method of the minority class sample on a kind of non-equilibrium iptv data set according to claim 1, its feature It is: in step 4, the concrete calculating process of algorithm of described smote is as follows: sets current sample as x_i, near from the k of this sample Adjacent set s_iOne sample x of middle random selection_j, produce an equally distributed random number δ of obedience from interval [0,1], then newly-generated Minority class sample be: x_new=x_i+δ×(x_j-x_i).