CN104331716A

CN104331716A - SVM active learning classification algorithm for large-scale training data

Info

Publication number: CN104331716A
Application number: CN201410665206.6A
Authority: CN
Inventors: 刘福江; 林伟华; 徐战亚; 郭艳; 黄彩春; 郭振辉
Original assignee: Wuhan Tu Ge Infotech Ltd
Current assignee: Wuhan Tu Ge Infotech Ltd
Priority date: 2014-11-20
Filing date: 2014-11-20
Publication date: 2015-02-04

Abstract

The invention relates to the crossing field of remote sensing classification and image information processing technology, especially relates to a SVM active learning classification algorithm for large-scale training data. The SVM active learning classification algorithm is based on clustering and uncertainty evaluation method, and selects, from a large number of samples, boundary samples which are far from cluster centroid and closer to two kinds of interface, and implements iterative optimization of classifier by introducing the active learning method. The process of selecting the boundary samples is not blind, but scientific, and the difference between the uncertain information and the distribution information of the sample can be compared through an iterative learning system, and a compression set can be automatically controlled and adjusted according to the comparison result. An optimal training sample set can be obtained by means of inversion derivation, thereby completing automatic classification of remote sensing images and improving the quality of classification.

Description

Towards the SVM Active Learning sorting algorithm of large scale training data

Technical field

The present invention relates to the crossing domain of Classification in Remote Sensing Image and the picture image information processing technology, particularly relate to the SVM Active Learning sorting algorithm towards large scale training data.

Background technology

Remote sensing image objective reality ground records and reflects the strong and weak information of the electromagnetic radiation of earth's surface object, is a kind of form of expression of remote sensing terrestrial object information.Utilize remote sensing image to carry out terrain classification and have important application in fields such as city monitoring, agricultural monitoring, soil investigation and forestry monitorings.Existing remote sensing image terrain classification method mainly concentrates on and utilizes the spectral information of remote sensing image pixel (or being aided with the spatial informations such as texture), adopts the clustering criterias such as distance, angle, probability or the method such as support vector machine, neural network to realize classification.When constructing a remote sensing image supervised classification system, in order to train classification models, need collecting sample data as the training data of categorizing system.Training data is the key factor (Zhang Hua, 2012) affecting remote sensing image supervised classification genealogical classification precision.Along with the development of sensor information science and technology, sensor information data day by day present the feature of higher-dimension and magnanimity, the training data how gathering categorizing system from these extensive remotely-sensed datas has become remote sensing image terrain classification method to need the problem (palace roc, 2009) of research.

Traditional remote sensing image terrain classification system often adopts artificial mask method to gather training data, and this method takes time and effort, with high costs, and artificial interpretation is more difficult.Therefore, in the whole world or extensive remote sensing image processing procedure, require that training data Sample Storehouse is set up in robotization.For many years, Chinese scholars is being sought always and can realized remote sensing image interpretation method automatically, efficiently.Current discussion is be incorporated in machine-learning process by remote sensing fields knowledge more widely, also joins in computer automatic interpretation process by the knowledge used during expert's visual interpretation and carries out compressive classification, improve the intelligence degree of whole process.The Global Forests that such as team of Univ Maryland-Coll Park USA John Townshend professor and Chengquan Huang is studied covers change and detects (Global Forest Cover Change) project, object spectrum knowledge is incorporated into computer interpretation algorithm, investigated the automatic acquisition algorithm of training sample.Adopt this algorithm, the number of the forest-non-forest sample that a scape LandsatETM+ image produces automatically reaches closely " ten million " (C.Huang 2008,2009; J. R. Townshend 2012; J. O. Sexton 2013).

At present concentrate what carry out that samples selection adopts usually to be the equidistant methods of sampling of simple layering at large training sample, but owing to not adopting any information of data, this method is with blindness.The selection of a good training sample is the engineering of a trial and error, and trial and error engineering is the process of an iteration, will repeatedly through samples selection, perform classification, evaluation result and renewal sample set four steps, until reach satisfied result, be a process very consuming time.Therefore, need to introduce the method that in machine learning field, samples selection is optimized, solve the Automatic Optimal problem that the large training sample of remote sensing concentrates samples selection.

Summary of the invention

In order to overcome above-mentioned weak point, the present invention proposes the SVM Active Learning sorting algorithm towards large scale training data, the sample optimization system of selection of the comprehensive machine learning areas of the method, analyze the impact of different training sample on classification, adopt and choose boundary sample based on uncertain sampling policy in clustering method and Active Learning, the optimized algorithm of research classification of remote-sensing images device in the training sample situation of border, improves the efficiency of nicety of grading and work.

The present invention solves the problems of the technologies described above adopted technical scheme: towards the SVM Active Learning sorting algorithm of large scale training data, its difference is, first from magnanimity machine marker samples, uses clustering method to select initial compression collection and training sample set respectively; Then use the initial compression SVM classifier perfected of assemble for training to classify to training sample subset, statistical classification precision, mark with machine and contrasts, and therefrom selects mistake and divides a sample; According to disaggregated model F, predict dividing the classification of each sample of sample set by mistake, the a part of sample selecting the difference of optimal labeling probability and suboptimum label probability minimum is as boundary sample, and join initial compression and concentrate re-training SVM classifier, iteration uses training sample set Optimum Classification device, calculates last three nicety of grading mean values to training sample set Iterative classification and variance, < and → 0, then stop iteration, export the SVM classifier optimized; Otherwise, continue to carry out iteration.

Preferably, its method comprises the following steps: step 1), to original machine marker samples use analyze based on the clustering method of neighbour's rule, obtain the cluster centre of every class sample, extract the cluster centre of cluster subset according to classification respectively, using cluster centre as initial compression collection A;

Step 2), calculate the cluster radius r of each cluster barycenter, cluster dispersion and each sample distance d to affiliated cluster barycenter, if poly-within-cluster variance threshold value thresholding is T, if, the sample composition training sample set B then chosen, and B is divided at random n size identical subset { b1, b2, b3, b4, b5 ... bn};

Step 3), employing initial compression collection A train SVM classifier, obtain first disaggregated model F;

Step 4), use first disaggregated model F to training sample partitions of subsets;

Step 5), evaluate the nicety of grading of this subseries, and extract a point sample by mistake from b1 set, form a point sample set by mistake;

Step 6), according to disaggregated model F, predict dividing the classification of each sample of sample set by mistake, obtain its Probability p belonging to each possible classification (yi|x), calculate the difference of the probability of sample optimal labeling and the probability of suboptimum label, the minimum part sample of both selections difference joins boundary sample collection G;

Step 7), boundary sample collection G is joined in initial compression collection A, as new initial compression collection;

Step 8), iterative step 3-7, and when calculating nearest three Iterative classifications, the mean value of nicety of grading and variance, if < and → 0, then stop iteration, export the SVM classifier optimized, otherwise continue iteration.

The invention has the beneficial effects as follows: the inventive method is based on cluster and uncertainty assessment method, the boundary sample that distance cluster barycenter is comparatively far away, distance two class interphases are nearer is again selected in great amount of samples, by introducing the method for Active Learning, carry out the iteration optimization of sorter.The process that boundary sample is selected is not blindly, but science, by the iterative learning system constantly unascertained information of comparative sample and the difference of distributed intelligence, and according to comparative result, automatically controlling and adjustment compression collection, optimum training sample set is derived in inverting, completes the automatic classification of remote sensing image, improves the quality of classification.

Accompanying drawing explanation

Fig. 1 is the improvement SVM classifier method schematic diagram choosing sample based on Active Learning.

Fig. 2 is boundary sample Optimal Fitting optimal classification surface schematic diagram.

Fig. 3 is boundary sample distribution character figure after the cluster analysis based on neighbour's rule.

Fig. 4 is the performance results schematic diagram of boundary sample in uncertain probability analysis.

Embodiment

In order to realize above technical scheme, the present invention needs to solve following particular problem: the design of initial compression collection, the decomposition strategy of large training sample set, the design of samples selection strategy and the determination of stop condition during the generation of training sample set, iterative learning, the choosing method of boundary sample collection, the calculating etc. of sample set distribution dispersion.

Fig. 1 is the improvement SVM classifier method schematic diagram choosing sample based on Active Learning, use and based on the clustering method of neighbour's rule, the original sample that magnanimity machine marks is analyzed, choose class barycenter part sample as initial compression collection A, calculate distance, the cluster radius of clustering cluster, the dispersion of each clustering cluster of remaining sample to cluster barycenter, the training sample set Selecting All Parameters such as cluster dispersion threshold value are set, from remaining great amount of samples, select training sample set; Initial compression collection A is inputted initial SVM classifier as training sample, obtains disaggregated model F, use disaggregated model F to classify to training sample subset; Analyze classification results, calculate the nicety of grading of this subseries, if nicety of grading is greater than expectation threshold value, export the image classification device optimized; If nicety of grading is lower than expectation threshold value, then from by mistake point sample, selects boundary sample further and join in initial compression collection A, continue to optimize image classification device.

Support vector machine classification method shows many distinctive advantages in solution small sample, non-linear and high dimensional pattern identification, wherein the design of initial compression collection is relevant with the similarity measurement between sample, the design of initial compression collection determines the quality of preliminary classification lineoid, in later stage Active Learning process, have impact on learning time and final sorter stability greatly, and the key that initial compression collection is chosen is to have chosen the representational sample of class.The present invention uses the clustering method based on neighbour's rule to analyze magnanimity machine mark original sample, can represent the feature of such sample preferably near the sample of cluster centre, therefore the present invention chooses part sample near cluster centre as initial compression collection.

The quality of training sample set quality is the deciding factor of iteration system nicety of grading and speed of convergence, chooses training sample set not only relevant with the decomposition strategy of magnanimity machine marker samples collection, and also relevant with the similarity between adjacent sample.Magnanimity original sample collection is too large, can not directly as the training sample set of support vector machine, and the present invention's screening boundary sample be positioned near Optimal Separating Hyperplane is optimized training to sorter.Why carrying out Optimum Classification device with boundary sample, is because the existence of sample be correctly validated in training sample can make the class region that trains compacter, different classes of interregional every larger; But the number of samples be correctly validated is too many, easily make the class region that trains too narrow and small, thus add boundary sample and known risk by the mistake of point sample by mistake; The class region that trains may be made large as much as possible by the existence of by mistake point sample in training sample, also make simultaneously different classes of between easily overlap, increase error in classification.

Fig. 2 is boundary sample distribution character figure after the cluster analysis based on neighbour's rule, choose the sample characteristics of boundary sample Water demand boundary sample: boundary sample is near classifying face in lineoid space, there are two kind features simultaneously, there is differentiation ambiguity, and category feature is not fairly obvious.In the cluster analysis result based on neighbour's rule, the distribution characteristics of boundary sample shows as: most of boundary sample is distributed near cluster radius, as shown in Fig. 2 hollow core sample point.

The sample that the present invention chooses cluster analysis middle distance cluster centroid distance d meets as training sample set B, d:.

α reaches the standard grade parameter in border, and β rolls off the production line parameter in border.

After determining training sample set B, initial compression collection A is used to train initial support vector machine classifier, obtain disaggregated model F, then disaggregated model F is used to classify to a part of training sample set bi, evaluate the nicety of grading of this subseries, if nicety of grading is greater than expect threshold value T, then export this SVM classifier; If nicety of grading is less than expect threshold value T, just continues to filter out boundary sample from dividing sample by mistake, boundary sample is added to initial compression collection A re-training SVM classifier.

Fig. 3 is boundary sample Optimal Fitting optimal classification surface schematic diagram, triangle and circular object represent the different classes of sample that machine marks, in subgraph (a), use initial compression collection A to train SVM classifier, obtain Optimal Separating Hyperplane F, in figure, the sample of red-label is by the sample divided by mistake.

Sample in subgraph (b) is the subset bi of training sample set B, and use the subset bi of Optimal Separating Hyperplane to training sample set to classify, classification results is as shown in subgraph (c).

Boundary sample is positioned near Optimal Separating Hyperplane, is easily divided by mistake, and therefore we can choose boundary sample from dividing sample by mistake, then improve Optimal Separating Hyperplane further with boundary sample.Select classification results in sub-figure (c) and mark different samples from machine, as dividing sample set by mistake, as figure red-label object.Point sample set is also not exclusively boundary sample by mistake, in subgraph (c), H sample set is boundary sample collection, and G sample distance classification lineoid is apart from far, but divided by mistake, this mistake divides sample Producing reason to be just wrong for sample label attribute forecast when machine marks, so sorted result and machine prediction label are not inconsistent, be considered to point sample by mistake.

Because boundary sample is near classifying face in lineoid space, have two kind features, the present invention chooses by introducing uncertainty threshold on samples method the boundary sample be distributed near Optimal Separating Hyperplane simultaneously.

Fig. 4 is the performance results schematic diagram of boundary sample in uncertain probability analysis, adopt uncertainty threshold on samples method, according to current disaggregated model F, predict dividing the classification of each sample of sample set by mistake, obtain its Probability p belonging to each possible classification (yi|x), calculate the probability of optimal labeling of sample and the probability of suboptimum label and both differences, uncertainty threshold on samples judges: probability difference belongs to higher than the sample of threshold value the sample determining that degree is higher, is cast out; Otherwise the sample lower than threshold value belongs to uncertain higher sample, is added boundary sample collection.

The screening of boundary sample collection is complete, is added by boundary sample in initial compression collection A, practices SVM classifier, iterative step 4-7 as the training of new initial compression, until sorter nicety of grading is higher than expectation threshold value.

The present invention proposes the SVM Active Learning sorting algorithm towards large scale training data, the sample optimization system of selection of comprehensive machine learning areas, initiatively select the sample that will learn thus the sample complex effectively reducing learning algorithm analyzes the impact of different training sample on classification, under the prerequisite reaching same or better results of learning, selected training set, thus the cost effectively reduced spent by handmarking's sample, then adopt clustering method and choose boundary sample based on uncertain sampling policy, the optimized algorithm of research classification of remote-sensing images device in the training sample situation of border, effective process remote sensing fields data volume sample brought that increases severely is preferred, degradation practical problems under nicety of grading, but not only to improve the unitary basis of classification accuracy rate for inspection-classification device quality.

Claims

1. choose an improvement SVM classifier for sample based on Active Learning Method, it is characterized in that, its method comprises the following steps:

Step a), first cluster analysis is carried out to magnanimity machine marker samples, choose cluster centre part sample of all categories as initial compression collection A, calculate sample to distance d, the cluster radius r of cluster barycenter, poly-within-cluster variance, choose clustering fuzzy sample as training sample set B{b1, b2, b3, b4, b5 ... bn};

Step b), the training of use initial compression practice SVM classifier, and with this sorter to training sample set (i=1,2 ... n) classify, calculate sorter nicety of grading, and the mistake picked out in classification results divides sample, predict by the classification of current class model to each sample, then use uncertainty threshold on samples determining method from dividing sample the boundary sample picked out further near Optimal Separating Hyperplane by mistake;

Step c), boundary sample is joined in initial compression collection A, iteration carries out step b), until nicety of grading remains on higher level stop iteration, exports the SVM classifier after optimizing.

2. the improvement SVM classifier choosing sample based on Active Learning Method as claimed in claim 1, it is characterized in that, described step a) comprises following concrete steps:

Step a1), to the cluster analysis of magnanimity machine marker samples, obtain the cluster centre of each classification, selected part sample near cluster centre of all categories, form initial compression collection A;

Step a2), calculate each sample to distance d, the cluster radius r of affiliated cluster barycenter, poly-within-cluster variance, if poly-within-cluster variance threshold value thresholding is T, if, the sample composition training sample set B then chosen, and B is divided at random n size identical subset { b1, b2, b3, b4, b5 ... bn}.

3. the improvement SVM classifier choosing sample based on Active Learning Method as claimed in claim 1, it is characterized in that, described step b) comprises following concrete steps:

Step b1), adopt initial compression collection A to train SVM classifier, obtain first disaggregated model F, re-use F and subset is classified;

Step b2), evaluate the nicety of grading of this subseries, and extract a point sample by mistake from b1 set, form a point sample set by mistake;

Step b3), according to disaggregated model F, predict dividing the classification of each sample of sample set by mistake, obtain its Probability p belonging to each possible classification (yi|x), calculate the difference of the probability of sample optimal labeling and the probability of suboptimum label, a part of sample that both selections difference is minimum, this part sample is boundary sample collection G.

4. the improvement SVM classifier choosing sample based on Active Learning Method as claimed in claim 1, it is characterized in that, described step c) comprises following concrete steps:

Step c1), boundary sample G is joined in initial compression collection A, then iterative step b, new initial compression training is used to practice SVM classifier, classification of assessment precision, and calculate mean value and the variance of last 3 Iterative classification precision, if < and → 0, then iteration is stopped to export the SVM classifier optimized; Otherwise continue iterative step b).