CN102521656B

CN102521656B - Integrated transfer learning method for classification of unbalance samples

Info

Publication number: CN102521656B
Application number: CN201110452050.XA
Authority: CN
Inventors: 谭励; 苏维均; 于重重; 田蕊; 刘宇; 吴子珺; 马萌
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2011-12-29
Filing date: 2011-12-29
Publication date: 2014-02-26
Anticipated expiration: 2031-12-29
Also published as: CN102521656A

Abstract

The invention relates to an integrated transfer learning method for classification of unbalance samples, which comprises the following steps of: in the initializing process, giving different weights to positive and negative samples to ensure that the negative samples which account a small ratio for the total samples and have a large amount of information have large initial weights; in the training process in each round, extracting part of samples according to a certain ratio and using the selected samples as a training subset to carry out training, and after finishing the training, selecting the classifier with the smallest error from a plurality of simple classifiers as a weak classifier and regulating the training dataset according to a redundant data dynamic eliminating algorithm; and obtaining a weak classifier sequence after T rounds of iteration and overlaying and combining a plurality of weak classifiers into a strong classifier. According to the invention, the classification law of novel data which is distributed similarly with old data is found by effectively utilizing the classification law of the old data; particularly, a novel method is provided for solving the problem of classification of the data which is classified in an unbalance mode; the effect of a small amount of the negative samples in the classification process in the classification training process is ensured; the contribution rate of the negative samples is effectively improved; and the classification efficiency and accuracy are improved.

Description

The integrated transfer learning method of non-equilibrium sample classification

Technical field

The invention belongs to machine learning field, the supplemental training data of and positive and negative sample imbalance large for redundant data amount, have proposed the integrated transfer learning algorithm of a kind of improvement, utilize these supplemental training Data Migrations to help target datas to classify.

Background technology

Transfer learning is the hot topic of machine learning area research in recent years, it is for the little feature of flag data amount in new task, proposition effectively utilizes out-of-date Data Migration and is applied in new task: although a large amount of outdated datas and problem domain to be solved difference to some extent is wherein certain to exist some data helpful to new classification problem.In order to find these useful data, utilize the new data being classified on a small quantity, excavate the valuable information in legacy data.Finally according to useful informations all in two parts data, train the disaggregated model of a more efficient, realize legacy data to the knowledge migration of new data.

At present, for different transfer learning tasks, there is multiple solution:

The people such as Q.Yang propose Naive Bayes Classifier (Naive Bayes Classifier) to promote to a sorter of supporting cross-domain texts classification, have realized the migration of knowledge between different field text.(WDai,G.-R.Xue,QYang,and?YYu.Transferring?naive?bayes?classifiers?for?text?classification[A].The?Twenty-Second?National?Conference?on?Artificial?Intelligence[C],2007.540-545.)

The people such as Dai propose integrated study to be applied in transfer learning, by boosting technology, by weak learning algorithm " lifting ", be the algorithm-TrAdaboost of strong study, this algorithm is directly combined migration auxiliary data and this two parts data set of target data, form a mixed data set as training set, then on this data set, utilize TrAdaboost Algorithm for Training disaggregated model.(YLiu?and?PStone.Value-function-based?transfer?for?reinforcement?learning?using?structure?mapping[A].In?Proceedings?of?theTwenty-First?National?Conference?on?Artificial?Intelligence[C],2006.877-882.)

Ensemble Learning Algorithms is applied in transfer learning, can be in the situation that not changing Weak Classifier nicety of grading, by integrated by weak learning algorithm " lifting ", be strong learning algorithm, thereby effectively promote transfer learning effect, yet also there are some problems in the method:

TrAdaboost algorithm is applicable to solve two classification problems based on symmetry, and positive negative data is treated on an equal basis.Yet it is likely extremely unbalanced characterizing in real world in the different classes of sample distribution of two classes, and importance also exists very large difference.

In addition, often have mass of redundancy data in auxiliary data, these data may be very dissimilar with target data set, and their existence not only can affect the training speed of model, also can cause the decline of nicety of grading.

Summary of the invention

The object of this invention is to provide a kind of new method, by optimizing sample weights, distribute and adjust strategy, improve that data volume is little, the contribution rate of the class sample (negative sample) that contains much information; And in training process, dynamically reject " uncorrelated " data, and according to the sample threshold lower limit setting, eliminate the too small part data of weighted value, through the iteration training of T wheel, supplemental training data will constantly be tending towards optimizing.

Principle of the present invention is: the method for utilizing migration, data for positive and negative sample imbalance are classified, first characteristic attribute vector supplemental training data and target data being extracted is mixed into training set, then the every dimensional feature attribute on this training set is applied respectively to weak learning algorithm.When initialization, by positive negative sample, give different weights, guarantee to account for total sample proportion negative sample initial weight little but that contain much information large.Every wheel extracted in proportion part sample in training process and trained as training subset, after training finishes, that of Select Error minimum from several simple classification devices, as a Weak Classifier h, and dynamically rejects algorithm according to redundant data and adjusts training dataset.Like this, after T wheel iteration, just can obtain a Weak Classifier sequence (h ₁, h ₂..., h _t), final classification function f (x) adopts a kind of ballot mode to produce, and is about to a plurality of Weak Classifiers and gets up to be combined into a strong classifier by certain method stack (boost).Method flow as shown in Figure 5.

Technical scheme provided by the invention is as follows:

A kind of bridge monitoring methods based on transfer learning, it is characterized in that, from the actual monitoring data of bridge according to early, in, evening peak period and ebb period morning and afternoon are extracted respectively a certain proportion of data as migration auxiliary data collection A and target data set O, then carry out following steps:

1) migration auxiliary data collection A and target data set O are mixed into training dataset C in proportion;

2) initialization sample weight;

3) obtain normalization sample weights;

If iteration total degree is T, from 1 to T every take turns iteration training complete successively below 4)～9) step:

4) randomly draw training subset D;

5) if contain positive and negative two class samples in training subset D, carry out the 6th) step; Otherwise, at the another kind of middle extraction part sample not comprising, insert training subset D, to guarantee having positive and negative two class samples in training subset D;

6), on training subset D, with weak learning algorithm P, training base sorter integrated summation obtain Weak Classifier;

7) calculate Weak Classifier h _ttraining error rate on target training data, wherein t is iteration factor;

8) according to classification error rate, adjust sample weights;

9) dynamic eliminate redundancy data;

10) obtain final integrated classifier and export positive sample and negative sample;

For final integrated classifier output valve Z, Z is that this sample of 1 expression is positive sample, characterizes bridge health in order; Z is that this sample of 0 expression is duplicate sample basis, and sign bridge is faulted condition.

In step 1), in migration auxiliary data collection A and target data set O, extracting part divided data is mixed into training dataset C, C={ (X in proportion respectively ₁, Y ₁), (X ₂, Y ₂) ..., (X _n, Y _n), (X wherein _i, Y _i) be the training sample being combined into by sample characteristics attribute vector and sample class, i=1,2 ..., N; Before in training dataset C, n sample is data in A, and a remaining m sample is the data in O, n+m=N; X wherein _i∈ X, X is input sample data, X _ibe the characteristic attribute vector of sample, dimension is q, Y _i∈ { 0 ,+1 } is the class label of sample.

In step 3), the computing method of normalization sample weights are: the initializes weights of each sample, divided by total sample weights, is obtained to the sample weights after normalization.

In step 4), in the training subset D of extraction, contained sample size is half in C.

In step 6), described weak learning algorithm is decision tree, artificial neural network, SVM.

In step 7), Weak Classifier h _ttraining error rate ε on target training data _tfollowing calculating:

ϵ_{t} = Σ_{i = n + 1}^{n + m} \frac{ω_{i}^{t} | h_{t} (x_{i}) - y_{i} |}{Σ_{i = n + 1}^{n + m} ω_{i}^{t}}

Wherein

the weight of i sample when the t time iteration, h _t(x _i) be the output of i sample when the t time iteration, y _iit is the true class label of i sample; If the ε calculating _tbe greater than 1/2, its value is set to 1/2, i.e. ε _tbe not more than 1/2.

In step 9), every take turns training and finish after when the weighted value of training sample is lower than being regarded as redundant data under the threshold value setting in limited time, from training sample, delete this partial data.

In step 9), deconditioning when training sample sum is less than or equal to the minimum number of samples setting, stops step 4)～9) iteration.

In step 10), final integrated classifier is output as

Using the voting results of most Weak Classifiers as final classification results; The class sample that wherein value 1 representative occupies the majority in unbalanced data, the another kind of sample that 0 representative of value occupies the minority; X is training sample, and T is total iterations, and t is iteration factor, h _t(x) be the output of sample x when the t time iteration,

Beneficial effect of the present invention: utilize technical scheme provided by the invention, can effectively utilize the classification rule of existing legacy data to find out the classification rule of the new data of APPROXIMATE DISTRIBUTION, especially the classification problem for classification unbalanced data provides new method, guaranteed the effect of the negative sample that in classification, quantity is few in classification based training, effectively improve the contribution rate of negative sample, improved efficiency and the precision of classification.

Accompanying drawing explanation

Fig. 1 embodiment step block diagram

The error rate of Fig. 2 the present invention in training data and test data

Fig. 3 TrAdaboost increases relative error with training data

Fig. 4 the present invention increases relative error with training data

Fig. 5 the inventive method process flow diagram

Table 1 input data

Table 2 training data constituent

The test result of table 3TrAdaboost algorithm and algorithm of the present invention

Embodiment

The integrated transfer learning method (being referred to as UBITLA) of non-equilibrium sample classification provided by the invention, step following (ginseng Fig. 1):

1. input: input data come from two parts: migration auxiliary data collection A, target data set O.In these two parts data, extracting part divided data is mixed into training dataset C={ (X in proportion respectively ₁, Y ₁), (X ₂, Y ₂) ..., (X _n, Y _n), (X wherein _i, Y _i) be the training sample being combined into by sample characteristics attribute vector and sample class.i＝1，2，…，N。Before in C, n sample is data in A, and in C, m remaining sample is the data (n+m=N) in O.Predetermined iterations is T.X wherein _i∈ X, X is input sample data, X _ibe the characteristic attribute vector of sample, dimension is q, Y _i∈ { 0 ,+1 } is the class label of sample.

2. initialization sample weight:

Wherein,

be the initializes weights of i sample, 1 these weights of expression are wherein init states, i ₀represent that this initializes weights value is without normalized, i=1 wherein, 2 ..., N.D, l is respectively positive sample size in A and O.

3. obtain normalization sample weights:

ω_{i}^{1} = \frac{ω_{i_{0}}^{1}}{Σ_{i = 1}^{N} ω_{i_{0}}^{1}}

be i sample through normalized sample weights, respectively by the initializes weights of each sample divided by total sample weights, obtain the sample weights after normalization.

If iteration total degree is T, from 1 to T, often takes turns iteration training and complete successively 4-9 steps:

4. randomly draw training subset D(wherein in D contained sample number be half in C, D randomly draws from C.）

5. in training of judgement subset D, whether contain positive and negative two class samples.If contain two class samples, carry out the 6th step, if only containing a class sample wherein, directly at another kind of middle extraction part sample and insert training subset D.Guarantee to exist in training subset positive and negative two class samples.

6. on training subset D, with weak learning algorithm P(as decision tree, artificial neural network, the basic classification algorithms such as SVM), train the base sorter of every one-dimensional characteristic attribute

j=1,2 ..., q, wherein h represents the base sorter of being constructed by basic classification Algorithm for Training, the dimension of q representation feature attribute vector, t represents t wheel iteration.By the summation of base sorter, obtain Weak Classifier:

h_{t} = Σ_{j = 1}^{q} h_{t}^{j}

7. calculate Weak Classifier h _ttraining error rate on target training data:

ϵ_{t} = Σ_{i = n + 1}^{n + m} \frac{ω_{i}^{t} | h_{t} (x_{i}) - y_{i} |}{Σ_{i = n + 1}^{n + m} ω_{i}^{t}}

Wherein

the weight of i sample when the t time iteration, h _t(x _i) be the output of i sample when the t time iteration, y _iit is the true class label of i sample.

If the ε calculating _tbe greater than 1/2, need its value to be set to 1/2.Be ε _tbe not more than 1/2.

8. adjust sample weights:

If y _i=0 and h _t(x) ≠ y _i, 1≤i≤n wherein,

ω_{i}^{t + 1} = ω_{i}^{t} β^{| h_{t} (x_{i}) - y_{i} |} + dr \times ω_{i}^{t} (0 \leq dr \leq 1)

Otherwise

ω_{i}^{t + 1} = \{\begin{matrix} ω_{i}^{t} β^{| h_{t} (x_{i}) - y_{i} |}, & 1 \leq i \leq n \\ ω_{i}^{t} {β_{t}}^{- | h_{t} (x_{i}) - y_{i} |}, & n + 1 \leq i \leq m + n \end{matrix}

Order

if i sample is negative sample, and, according to above-mentioned first formula, adjusts weight, otherwise adjust according to second formula not etc. time with the output of t subseries device.Wherein, dr is a decay factor, and effect is to make to be had memory function by the negative sample weights adjustment of misclassification, guarantees that it unlikelyly diminishes rapidly.

9. dynamic eliminate redundancy data:

Every take turns training and finish after when the weighted value of training sample lower than set threshold value lower limit r time be regarded as redundant data, from training sample, delete this partial data.When training sample sum lower than set minimum number of samples time deconditioning.

10. output: final integrated classifier is output as:

The voting results of most Weak Classifiers of take are final classification results, the class sample that wherein 1 representative occupies the majority in unbalanced data, 0 another kind of sample that representative occupies the minority.

Embodiment adopts the historical bridge spanning the sea Monitoring Data collection (DataS1) of monitoring in existing 2 years and a new highway bridge Monitoring Data collection (DataS2) as research object, from the actual monitoring strain data of two bridges according to early, in, evening peak period and ebb period morning and afternoon are extracted respectively a certain proportion of data as migration supplemental training data set and target data set, and data constituent is as shown in table 2.Wherein positive and negative samples ratio is 5:1, adopts respectively 14 crucial monitoring point static strain data on the bridge plate of two bridges, as 14 dimensions input data.Output data: 1 representative is normal, 0 representative damage.In every training process of taking turns, randomly draw 1/2 data as training data subset, randomly draw equally a part of target data as test data subset.

In above-mentioned the 1st step, actual importation data are in Table 1, and wherein, the ratio of supplemental training data and target data is about 5:1, and positive and negative classification ratio is about 5:1,6000 of total sample number.The 10th step, is output as the positive sample of 1 expression, characterizes bridge health in order; Be output as 0 expression negative sample, sign bridge is faulted condition.

The present embodiment is applied to the integrated transfer learning method of non-equilibrium sample classification in bridge actual monitoring Data classification, little for damage data amount in bridge actual monitoring data, the feature of positive and negative sample imbalance, rationally utilize out-of-date data to help new data to classify, can effectively improve the little but contribution rate of the bridge damnification data that contain much information of data volume, thereby improve the discrimination of positive negative sample, can effectively instruct related personnel more closely to monitor producing the bridge structure of damage data, take in time corresponding maintenance measure.

Fig. 2 illustrates the auxiliary validity of setting up disaggregated model of migration data.Fig. 3, Fig. 4 explanation can improve final sorter precision by optimizing supplemental training sample set.There is the bright the present invention of this illustration can optimize supplemental training data set, thereby reach the doulbe-sides' victory of efficiency and precision, promoted transfer learning effect.

Table 1 input data

Table 2 training data constituent

Claims

1. the bridge monitoring methods based on transfer learning, it is characterized in that, from the actual monitoring data of bridge according to early, in, evening peak period and ebb period morning and afternoon are extracted respectively a certain proportion of data as migration auxiliary data collection A and target data set O, then carry out following steps:

2) initialization sample weight;

3) obtain normalization sample weights;

4) randomly draw training subset D;

8) according to classification error rate, adjust sample weights;

9) dynamic eliminate redundancy data;

2. bridge monitoring methods as claimed in claim 1, is characterized in that, in step 1), in migration auxiliary data collection A and target data set O, extracting part divided data is mixed into training dataset C, C={ (X in proportion respectively ₁, Y ₁), (X ₂, Y ₂) ..., (X _n, Y _n), (X wherein _i, Y _i) be the training sample being combined into by sample characteristics attribute vector and sample class, i=1,2 ..., N; Before in training dataset C, n sample is data in A, and a remaining m sample is the data in O, n+m=N; X wherein _i∈ X, X is input sample data, X _ibe the characteristic attribute vector of sample, dimension is q, Y _i∈ { 0 ,+1 } is the class label of sample.

3. bridge monitoring methods as claimed in claim 1, is characterized in that, in step 3), the computing method of normalization sample weights are: the initializes weights of each sample, divided by total sample weights, is obtained to the sample weights after normalization.

4. bridge monitoring methods as claimed in claim 1, is characterized in that, in step 4), in the training subset D of extraction, contained sample size is half in C.

5. bridge monitoring methods as claimed in claim 1, is characterized in that, in step 6), described weak learning algorithm is decision tree, artificial neural network, SVM.

6. bridge monitoring methods as claimed in claim 2, is characterized in that, in step 7), and Weak Classifier h _ttraining error rate ε on target training data _tfollowing calculating:

ϵ_{t} = Σ_{i = n + 1}^{n + m} \frac{ω_{i}^{t} | h_{t} (x_{i}) - y_{i} |}{Σ_{i = n + 1}^{n + m} ω_{i}^{t}}

Wherein

7. bridge monitoring methods as claimed in claim 1, is characterized in that, in step 9), every take turns training and finish after when the weighted value of training sample is lower than being regarded as redundant data under the threshold value setting in limited time, from training sample, delete this partial data.

8. bridge monitoring methods as claimed in claim 1, is characterized in that, in step 9), deconditioning when training sample sum is less than or equal to the minimum number of samples setting, stops step 4)～9) iteration.

9. bridge monitoring methods as claimed in claim 6, is characterized in that, in step 10), final integrated classifier is output as