CN102521656A

CN102521656A - Integrated transfer learning method for classification of unbalance samples

Info

Publication number: CN102521656A
Application number: CN201110452050XA
Authority: CN
Inventors: 于重重; 谭励; 田蕊; 刘宇; 吴子珺
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2011-12-29
Filing date: 2011-12-29
Publication date: 2012-06-27
Anticipated expiration: 2031-12-29
Also published as: CN102521656B

Abstract

The invention relates to an integrated transfer learning method for classification of unbalance samples, which comprises the following steps of: in the initializing process, giving different weights to positive and negative samples to ensure that the negative samples which account a small ratio for the total samples and have a large amount of information have large initial weights; in the training process in each round, extracting part of samples according to a certain ratio and using the selected samples as a training subset to carry out training, and after finishing the training, selecting the classifier with the smallest error from a plurality of simple classifiers as a weak classifier and regulating the training dataset according to a redundant data dynamic eliminating algorithm; and obtaining a weak classifier sequence after T rounds of iteration and overlaying and combining a plurality of weak classifiers into a strong classifier. According to the invention, the classification law of novel data which is distributed similarly with old data is found by effectively utilizing the classification law of the old data; particularly, a novel method is provided for solving the problem of classification of the data which is classified in an unbalance mode; the effect of a small amount of the negative samples in the classification process in the classification training process is ensured; the contribution rate of the negative samples is effectively improved; and the classification efficiency and accuracy are improved.

Description

The integrated transfer learning method of non-equilibrium sample classification

Technical field

The invention belongs to the machine learning field, the big and unbalanced supplemental training data of positive negative sample have proposed the integrated transfer learning algorithm of a kind of improvement to the redundant data amount, utilize these supplemental training data migtations to help target data to classify.

Background technology

Transfer learning is the hot topic of machine learning area research in recent years; It is to the little characteristics of flag data amount in the new task; Proposition effectively utilizes out-of-date data migtation and is applied in the new task: though a large amount of outdated datas and problem domain to be solved difference to some extent wherein is certain to exist some data helpful to new classification problem.In order to find these useful data, utilize the new data of having been classified on a small quantity, excavate the valuable information in the legacy data.Train the disaggregated model of a more efficient at last based on useful informations all in two parts data, realize the knowledge migration of legacy data to new data.

At present, to different transfer learning tasks multiple solution is arranged:

People such as Q.Yang propose Naive Bayes Classification device (Naive Bayes Classifier) is promoted into a sorter of supporting the cross-domain texts classification, have realized the migration of knowledge between the different field text.(WDai，G.-R.Xue，QYang，and?YYu.Transferring?naive?bayes?classifiers?for?text?classification[A].The?Twenty-Second?National?Conference?on?Artificial?Intelligence[C]，2007.540-545.)

People such as Dai propose integrated study is applied in the transfer learning; Through the boosting technology is the algorithm-TrAdaboost of strong study with weak learning algorithm " lifting "; This algorithm will move auxiliary data and this two parts data set of target data is directly combined; Form a mixed data set as training set, on this data set, utilize TrAdaboost algorithm training disaggregated model then.(YLiu?and?PStone.Value-function-based?transfer?for?reinforcement?learning?using?structure?mapping[A].In?Proceedings?of?theTwenty-First?National?Conference?on?Artificial?Intelligence?[C]，2006.877-882.)

With the integrated study algorithm application in transfer learning; Can be under the situation that does not change the Weak Classifier nicety of grading; Through integrated be strong learning algorithm with weak learning algorithm " lifting ", thereby effectively promote the transfer learning effect, yet also there are some problems in this method:

The TrAdaboost algorithm is applicable to two classification problems of solution based on symmetry, and positive negative data is treated on an equal basis.Yet it might be extremely unbalanced in real world, characterizing on two types of different classes of sample distribution, and also there is very big difference in importance.

In addition, often have mass of redundancy data in the auxiliary data, these data maybe be very dissimilar with target data set, and their existence not only can influence the training speed of model, also can cause the decline of nicety of grading.

Summary of the invention

The purpose of this invention is to provide a kind of new method, sample weights is distributed and the adjustment strategy through optimizing, and improves that data volume is little, the contribution rate of one type of sample (negative sample) of containing much information; And in training process, dynamically reject " uncorrelated " data, and according to the sample threshold lower limit that configures, eliminate the too small that part of data of weighted value, through the iteration training of T wheel, the supplemental training data will constantly be tending towards optimizing.

Principle of the present invention is: the method for utilizing migration; Classify to the unbalanced data of positive negative sample; The characteristic attribute vector that at first supplemental training data and target data is extracted is mixed into training set, then the every dimensional feature attribute on this training set is used weak learning algorithm respectively.When initialization, give different weights with positive negative sample, it is little but negative sample initial weight that contain much information is big to guarantee to account for total sample ratio.The every wheel extracted the part sample in proportion as training subclass to train in the training process; After training finishes; Minimum that of Select Error from several simple classification devices as a Weak Classifier h, and dynamically rejected algorithm adjustment training dataset according to redundant data.Like this, through just obtaining a Weak Classifier sequence (h after the T wheel iteration ₁, h ₂..., h _T), final classification function f (x) adopts a kind of ballot mode to produce, and is about to a plurality of Weak Classifiers and gets up to be combined into a strong classifier through certain method stack (boost).Method flow is as shown in Figure 5.

Technical scheme provided by the invention is following:

A kind of integrated transfer learning method of non-equilibrium sample classification is characterized in that, comprises the steps:

1) will move auxiliary data collection A and target data set O is mixed into training dataset C in proportion;

2) initialization sample weight;

3) obtain the normalization sample weights;

If the iteration total degree is T, from 1 to T every take turns iteration training accomplish successively below 4)～9) go on foot:

4) randomly draw the training subset D;

5) if contain positive and negative two types of samples in the training subset D, then carry out the 6th) step; Otherwise, do not comprise another kind of in extract the part sample and insert the training subset D, have positive and negative two types of samples in the subset D to guarantee to train;

6) on the training subset D,, train basic sorter and integrated summation to obtain Weak Classifier with weak learning algorithm P;

7) calculate Weak Classifier h _tTraining error rate on the target training data, wherein t is an iteration factor;

8) based on classification error rate adjustment sample weights;

9) dynamically reject redundant data;

10) obtain final integrated classifier and export positive sample and negative sample.

In the step 1), the extracting part divided data is mixed into training dataset C, C={ (X in proportion in migration auxiliary data collection A and target data set O respectively ₁, Y ₁), (X ₂, Y ₂) ..., (X _N, Y _N), (X wherein _i, Y _i) be the training sample that is combined into by sample characteristics attribute vector and sample class, i=1,2 ..., N; N sample is data among the A before among the training dataset C, and a remaining m sample is the data among the O, n+m=N; X wherein _i∈ X, X is input sample data, X _iBe the characteristic attribute vector of sample, dimension is q, Y _i{ 0 ,+1} is the class label of sample to ∈.

In the step 3), the computing method of normalization sample weights are: the initializes weights of each sample divided by total sample weights, is promptly obtained the sample weights after the normalization.

In the step 4), contained sample size is half among the C in the training subset D of extraction.

In the step 6), said weak learning algorithm is a decision tree, artificial neural network, SVM.

In the step 7), Weak Classifier h _tTraining error rate ε on the target training data _tFollowing calculating:

ϵ_{t} = Σ_{i = n + 1}^{n + m} \frac{ω_{i}^{t} | h_{t} (x_{i}) - y_{i} |}{Σ_{i = n + 1}^{n + m} ω_{i}^{t}}

Wherein

Be the weight of i sample when the t time iteration, h _t(x _i) be i the output of sample when the t time iteration, y _iIt is the true class label of i sample; If the ε that calculates _tGreater than 1/2, then its value is set to 1/2, i.e. ε _tBe not more than 1/2.

In the step 9), whenever when being lower than, the weighted value of training sample is regarded as redundant data under the threshold value that configures in limited time after taking turns training and finishing, from this partial data of training sample deletion.

In the step 9),, promptly stop step 4)～9 when the training sample sum stops training during smaller or equal to the minimum number of samples that configures) iteration.

In the step 10), final integrated classifier is output as

Promptly with the voting results of most Weak Classifiers as final classification results; Wherein value 1 is represented the one type of sample that in unbalanced data, occupies the majority, the another kind of sample that 0 representative of value occupies the minority; X is a training sample, and T is total iterations, and t is an iteration factor, h _t(x) be the output of sample x when the t time iteration,

Integrated transfer learning method noted earlier is applied to bridge monitoring, and for final integrated classifier output valve Z, Z is that 1 this sample of expression is positive sample, characterizes bridge and is in a good state of health; Z is that 0 this sample of expression is the duplicate sample basis, and the sign bridge is a faulted condition.

Beneficial effect of the present invention: utilize technical scheme provided by the invention; Can effectively utilize the classification rule that has legacy data to find out the classification rule of the new data of APPROXIMATE DISTRIBUTION; Especially the classification problem to the classification unbalanced data provides new method; Guarantee the few effect of negative sample in classification based training of quantity in the classification, improved the contribution rate of negative sample effectively, improved the efficient and the precision of classification.

Description of drawings

Fig. 1 embodiment step block diagram

The error rate of Fig. 2 the present invention on training data and test data

Fig. 3 TrAdaboost increases relative error with training data

Fig. 4 the present invention increases relative error with training data

Fig. 5 the inventive method process flow diagram

Table 1 input data

Table 2 training data constituent

The test result of table 3TrAdaboost algorithm and algorithm of the present invention

Embodiment

The integrated transfer learning method (being referred to as UBITLA) of non-equilibrium sample classification provided by the invention, step is (ginseng Fig. 1) as follows:

1. input: the input data come from two parts: migration auxiliary data collection A, target data set O.The extracting part divided data is mixed into training dataset C={ (X in proportion in these two parts data respectively ₁, Y ₁), (X ₂, Y ₂) ..., (X _N, Y _N), (X wherein _i, Y _i) be the training sample that is combined into by sample characteristics attribute vector and sample class.i＝1，2，…，N。N sample is data among the A before among the C, and m remaining among C sample is the data (n+m=N) among the O.Predetermined iterations is T.X wherein _i∈ X, X is input sample data, X _iBe the characteristic attribute vector of sample, dimension is q, Y _i{ 0 ,+1} is the class label of sample to ∈.

2. initialization sample weight:

Wherein,

Be the initializes weights of i sample, 1 these weights of expression wherein are init states, i ₀Represent that this initializes weights value handles without normalization, i=1 wherein, 2 ..., N.D, l are respectively positive sample size among A and the O.

3. obtain the normalization sample weights:

ω_{i}^{1} = \frac{ω_{i_{0}}^{1}}{Σ_{i = 1}^{N} ω_{i_{0}}^{1}}

is that i sample is through normalized sample weights; Respectively with the initializes weights of each sample divided by total sample weights, promptly obtain the sample weights after the normalization.

If the iteration total degree is T, whenever takes turns iteration training and accomplish the 4-9 step successively from 1 to T:

4. (wherein contained sample number is half among the C among the D, and D randomly draws from C to randomly draw the training subset D.)

5. whether contain positive and negative two types of samples in the training of judgement subset D.Go on foot if contain two types of samples then carry out the 6th,, then directly in another kind of, extract the part sample and insert the training subset D if only contain wherein one type of sample.Guarantee to train and have positive and negative two types of samples in the subclass.

The training subset D on, with weak learning algorithm P (like decision tree, artificial neural network; Basic classification algorithms such as SVM); Train basic sorter j=1 of each dimensional feature attribute, 2 ... Q; Wherein h representes the basic sorter by basic classification algorithm training structure, and the dimension of q representation feature attribute vector, t are represented t wheel iteration.Obtain Weak Classifier by basic sorter summation:

h_{t} = Σ_{j = 1}^{q} h_{t}^{j}

7. calculate Weak Classifier h _tTraining error rate on the target training data:

ϵ_{t} = Σ_{i = n + 1}^{n + m} \frac{ω_{i}^{t} | h_{t} (x_{i}) - y_{i} |}{Σ_{i = n + 1}^{n + m} ω_{i}^{t}}

Wherein

Be the weight of i sample when the t time iteration, h _t(x _i) be i the output of sample when the t time iteration, y _iIt is the true class label of i sample.

If the ε that calculates _tGreater than 1/2, then need its value be set to 1/2.Be ε _tBe not more than 1/2.

8. adjustment sample weights:

If y _i=0 and h _t(x) ≠ y _i, 1≤i≤n wherein, then

ω_{i}^{t + 1} = ω_{i}^{t} β^{| h_{t} (x_{i}) - y_{i} |} + dr \times ω_{i}^{t}

(0≤dr≤1)

Otherwise

ω_{i}^{t + 1} = \{\begin{matrix} ω_{i}^{t} β^{| h_{t} (x_{i}) - y_{i} |}, & 1 \leq i \leq n \\ ω_{i}^{t} β_{t}^{- | h_{t} (x_{i}) - y_{i} |}, & n + 1 \leq i \leq m + n \end{matrix}

Make

if i sample is negative sample; And when not waiting with the output of t subseries device; According to above-mentioned first formula adjustment weight, otherwise adjust according to second formula.Wherein, dr is a decay factor, and effect is to make by the negative sample weights of misclassification adjustment to have memory function, guarantees that it unlikelyly diminishes rapidly.

9. dynamically reject redundant data:

The every wheel is regarded as redundant data after training finishes when the weighted value of training sample is lower than the threshold value lower limit r that configures, from this partial data of training sample deletion., the training sample sum stops training when being lower than the minimum number of samples that configures.

10. output: final integrated classifier is output as:

Promptly the voting results with most Weak Classifiers are final classification results, wherein 1 represent the one type of sample that in unbalanced data, occupies the majority, 0 another kind of sample that representative occupies the minority.

Embodiment adopts the historical bridge spanning the sea Monitoring Data collection (DataS1) of monitoring in existing 2 years and the Monitoring Data collection (DataS2) that newly builds up highway bridges as research object; From the actual monitoring strain data of two bridges according to early; In; Evening peak period and ebb period morning and afternoon are extracted a certain proportion of data respectively as migration supplemental training data set and target data set, and the data constituent is as shown in table 2.Wherein the positive and negative samples ratio is 5: 1, adopts 14 crucial monitoring point static strain data on the bridge plate of two bridges respectively, as 14 dimension input data.Output data: 1 representative is normal, 0 representative damage.In every training process of taking turns, randomly draw 1/2 data as the training data subclass, randomly draw a part of target data equally as the test data subclass.

In above-mentioned the 1st step, actual importation data are seen table 1, and wherein, the ratio of supplemental training data and target data is about 5: 1, and positive and negative classification ratio is about 5: 1,6000 of total sample number.The 10th step was output as the positive sample of 1 expression, characterized bridge and was in a good state of health; Be output as 0 expression negative sample, the sign bridge is a faulted condition.

Present embodiment is applied to the integrated transfer learning method of non-equilibrium sample classification in the bridge actual monitoring data qualification; Little to damage data amount in the bridge actual monitoring data; The positive unbalanced characteristics of negative sample; Rationally utilize out-of-date data to help new data to classify, can improve the little but contribution rate of the bridge damnification data that contain much information of data volume effectively, thereby improve the discrimination of positive negative sample; Can effectively instruct the related personnel that the bridge structure that produces damage data is monitored more closely, in time take corresponding maintenance measure.

Fig. 2 explains the auxiliary validity of setting up disaggregated model of migration data.Fig. 3, Fig. 4 explanation can improve final sorter precision through optimizing the supplemental training sample set.There is the bright the present invention of this illustration can optimize the supplemental training data set, thereby reaches the doulbe-sides' victory of efficient and precision, promoted the transfer learning effect.

	1	2	3	4	5	6	7	8	9	10	11	12	13	Output
															Sample 1	44	53.7	20.8	24.8	26.8	30.8	33.7	31.9	30	29.2	71.3	43.8	89.6	0
Sample 2	23	24.3	43.8	55.7	20.2	25	27.1	30.6	33.3	31.7	20.7	29.9	70.9	1
															Sample 3	42.6	89.4	26.3	21.8	67.9	48.8	54.8	22.1	30.8	70.6	42.5	89.4	26.5	1
Sample 4	20.8	23.8	42.8	58.9	19	25.8	28.5	30.6	32.1	30.7	31	31.5	69.6	1
															Sample 5	32.9	46.8	20.1	14.7	39.4	50.7	14.3	18.1	21.9	28.9	28.5	28.1	40.7	1
Sample 6	44.9	53.6	20.1	22.1	41.8	51.7	19	23.5	29.1	27.9	31.7	30.1	19.5	1
															Sample 7	20.5	17.5	58.5	32.2	46.6	21.1	14.2	41.2	48.4	15.3	16.8	20.7	28.9	0
Sample 8	29.9	70.9	42.6	89.4	26.3	21.8	67.9	48.8	54.8	22.1	30.8	70.6	42.5	1
															Sample 9	31.5	69.6	41.3	88.6	26.5	20.7	66.1	47.2	53.6	20.8	23.5	42.5	59.7	0
Sample 10	32.6	79.1	55.2	95.2	32.3	27.3	76.6	54.1	59.1	18.3	33.9	78.7	54.4	1

Table 1 input data

Table 2 training data constituent

Claims

1. the integrated transfer learning method of a non-equilibrium sample classification is characterized in that, comprises the steps:

2) initialization sample weight;

3) obtain the normalization sample weights;

4) randomly draw the training subset D;

8) based on classification error rate adjustment sample weights;

9) dynamically reject redundant data;

2. integrated transfer learning method as claimed in claim 1 is characterized in that, in the step 1), the extracting part divided data is mixed into training dataset C, C={ (X in proportion in migration auxiliary data collection A and target data set O respectively ₁, Y ₁), (X ₂, Y ₂) ..., (X _N, Y _N), (X wherein _i, Y _i) be the training sample that is combined into by sample characteristics attribute vector and sample class, i=1,2 ..., N; N sample is data among the A before among the training dataset C, and a remaining m sample is the data among the O, n+m=N; X wherein _i∈ X, X is input sample data, X _iBe the characteristic attribute vector of sample, dimension is q, Y _i{ 0 ,+1} is the class label of sample to ∈.

3. integrated transfer learning method as claimed in claim 1 is characterized in that, in the step 3), the computing method of normalization sample weights are: the initializes weights of each sample divided by total sample weights, is promptly obtained the sample weights after the normalization.

4. integrated transfer learning method as claimed in claim 1 is characterized in that, in the step 4), contained sample size is half among the C in the training subset D of extraction.

5. integrated transfer learning method as claimed in claim 1 is characterized in that, in the step 6), said weak learning algorithm is a decision tree, artificial neural network, SVM.

6. integrated transfer learning method as claimed in claim 2 is characterized in that, in the step 7), and Weak Classifier h _tTraining error rate ε on the target training data _tFollowing calculating:

ϵ_{t} = Σ_{i = n + 1}^{n + m} \frac{ω_{i}^{t} | h_{t} (x_{i}) - y_{i} |}{Σ_{i = n + 1}^{n + m} ω_{i}^{t}}

Wherein

7. integrated transfer learning method as claimed in claim 1 is characterized in that, in the step 9), whenever is regarded as redundant data under the threshold value that configures in limited time when the weighted value of training sample is lower than after taking turns training and finishing, from this partial data of training sample deletion.

8. integrated transfer learning method as claimed in claim 1 is characterized in that, in the step 9), when the training sample sum stops training during smaller or equal to the minimum number of samples that configures, promptly stops step 4)～9) iteration.

9. integrated transfer learning method as claimed in claim 6 is characterized in that in the step 10), final integrated classifier is output as

10. the described integrated transfer learning method of claim 1～9 is applied to bridge monitoring, for final integrated classifier output valve Z, Z is that 1 this sample of expression is positive sample, characterizes bridge and is in a good state of health; Z is that 0 this sample of expression is the duplicate sample basis, and the sign bridge is a faulted condition.