CN104156438A

CN104156438A - Unlabeled sample selection method based on confidence coefficients and clustering

Info

Publication number: CN104156438A
Application number: CN201410395794.6A
Authority: CN
Inventors: 王荣燕; 谢延红
Original assignee: Dezhou University
Current assignee: Dezhou University
Priority date: 2014-08-12
Filing date: 2014-08-12
Publication date: 2014-11-19

Abstract

The invention discloses an unlabeled sample selection method based on confidence coefficients and clustering. The method comprises the steps of firstly, clustering all unlabeled samples, selecting the samples close to the cluster boundary, using labeled samples for training a supervised SVM classifier and recognizing the selected boundary samples after the boundary samples are selected out, selecting the unlabeled samples in different confidence coefficient sections for TSVM training, defining a certain threshold value lambda after the confidence coefficients of the samples are obtained, and selecting the unlabeled samples with the confidence coefficients larger than the lambda to enable the unlabeled samples to enter semi-supervised learning on the next layer so as to enable the probability that the selected samples belongs to the types corresponding to the classifier on the next layer to become the largest on the condition of the classifier of the last layer, wherein the selected unlabeled samples not only can represent the sample boundary distribution situation, but also enable the probability that the selected samples on each layer belong to the types corresponding to the classifier on the next layer to be the largest.

Description

A kind of method that unmarked sample based on degree of confidence and cluster is selected

Technical field

The present invention relates to machine learning and area of pattern recognition, a kind of method that specifically unmarked sample based on degree of confidence and cluster is selected.

Background technology

At present, having in the study of supervised classification model, a general problem is marker samples deficiency.Reason is increasingly mature along with digital content collection of material technology, and the cheap of mass storage, audio-frequency information on network rapidly increases, obtaining a large amount of unlabelled audio frequency samples is very easy to, but it is too high manually to mark cost, cause the quantity of concentrating unmarked sample at a lot of voice datas much larger than the situation of the quantity of marker samples.If only use on a small quantity marker samples, the Generalization Capability that the disaggregated model that supervised learning obtains is difficult to possess, meanwhile, the information in a large amount of unmarked samples also cannot be fully used, and causes the waste of information.Under this background, how research fully utilize the semi-supervised learning (Semi-supervised Learning) that a large amount of unmarked samples improves learning performance and caused people's attention under the condition of a small amount of marker samples, becomes one of important research field of current machine learning and pattern-recognition.

The semi-supervised learning value that has a wide range of applications in practical problems, its achievement in research has been applied to the fields such as speech recognition, image recognition and image retrieval, video labeling, natural language processing and living things feature recognition.Owing to existing a large amount of unmarked audio files on network, thereby semi-supervised learning is applied to complex audio classification problem also just becomes very natural.

At present, in semi-supervised learning, what people paid close attention to is more how to utilize unmarked sample, and which unmarked sample can help the research of semi-supervised learning less for.The TSVM learning method that for example Thorsten Joachims proposes in the literature, two labels that unmarked sample is added in advance of specified conditions are proved to meet by exchange, can make the objective function of support vector machine more optimize, the experiment of the document shows, the classification performance of semi-supervised learning device improves constantly along with the increase of unmarked sample size.But, in many experiments, the classification performance of finding semi-supervised audio classifiers is not along with increasing of unmarked sample improves constantly, this shows in the limited situation of marker samples, be not that unmarked sample is helpful to semi-supervised learning arbitrarily, the performance of semi-supervised learning device is relevant with the unmarked sample adding.About this point, the people such as Aarti Singh also document [in point out that not unmarked sample is all helpful to semi-supervised learning arbitrarily.For this specific question, the present invention proposes the algorithm that a kind of unmarked sample based on degree of confidence and cluster is selected, this algorithm can better utilize unmarked sample to improve the performance of audio classifiers, and carries out unmarked sample for semi-supervised learning during for other field and select to provide reference.

Summary of the invention

The present invention seeks to as overcoming above-mentioned deficiency, a kind of method that provides unmarked sample based on degree of confidence and cluster to select, so that the audio stream file in processing movie and video programs, the audio types of time of occupying the majority in movie and video programs is voice, the time slice that other audio types entirety is occupied is relatively short, therefore, there is equally serious data nonbalance problem.Impact classification performance being caused in order to weaken data nonbalance, the present invention adopts layering TSVM algorithm, and proposes the algorithm that in a kind of semi-supervised learning, the unmarked sample based on degree of confidence and cluster is selected.

The technical solution adopted in the present invention is: a kind of method that unmarked sample based on degree of confidence and cluster is selected, what adopt is layering TSVM sorter, the algorithm that adopts the unmarked sample based on degree of confidence and cluster to select, for improving the performance of TSVM algorithm, unmarked sample for TSVM study is to select from a large amount of unmarked samples, these unmarked samples need to meet specific condition and could improve the performance of semi-supervised learning device, first all unmarked samples are carried out to cluster, select from the closer sample in bunch border, select after boundary sample, there is the svm classifier device of supervision with the training of marker samples, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.Obtain after the degree of confidence of sample, define a certain threshold value, select the unmarked sample that is greater than of degree of confidence to enter the semi-supervised learning of lower one deck, object is to make selected sample belong to probability maximum under the condition of last layer sorter of the corresponding classification of lower one deck sorter institute.The unmarked sample of choosing can representative sample border distribution situation, can make again every layer of sample of choosing belong to lower one deck sorter the maximum probability of corresponding classification.

Satisfied condition and the system of selection of described unmarked sample is as follows:

(1) the present invention adopts layering TSVM sorter, at two TSVM sorters of every one deck training, adds the sample of semi-supervised learning, should belong to respectively the corresponding respective classes of each sorter.

Ground floor, in quiet and non-quiet semi-supervised learning, all unmarked sample standard deviations belong to this two classifications.Therefore, all unmarked sample standard deviations meet above-mentioned condition.

After the sorter training of ground floor, obtain Liang Ge branch, suppose that left branch is positive class, right branch is negative class, each branching selection is belonged to the sample of this branch.The method of selecting is, definition sample x belongs to the degree of confidence Con (x) of certain classification, suppose that it is f (x) that ground floor is trained the classifying face obtaining, utilize this classifying face again to identify all unmarked samples, the degree of confidence that x belongs to a certain class with respect to classifying face f (x) can represent with the probability that x belongs to such, the present invention adopts Linetal. to carry out probability estimate to the one improvement algorithm of Platt probability output, that is:

Con (x) = prob (y | x) \{\begin{matrix} \frac{\exp (Af (x) + B)}{1 + \exp (Af (x) + B)}; & if f (x) > 0 \\ \frac{1}{1 + \exp (Af (x) + B)}; & others \end{matrix} - - - (5 - 5)

Wherein, A and B determine jointly by training data and classification results f (x).

Obtain, after the degree of confidence of sample, defining a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised learning of lower one deck.Object be make selected sample belong to lower one deck sorter the probability of corresponding classification maximum under the condition of last layer sorter.

(2) TSVM sorter in essence or svm classifier device, therefore, will be found support vector machine sample equally.Therefore,, while selecting unmarked sample, select near the sample of support vector machine more helpful to the training of sorter.Here, in unmarked sample, more selecting boundary sample, is to realize by the method for cluster equally.

Discriminative model carries out modeling to the distributional difference of different classes of sample, and for this class model, the sample that is positioned at classification border is more useful; Here equally can by the method for cluster never marker samples concentrate and obtain sample space distributed intelligence; The border that these information are mainly present in bunch the contribution of discriminative model, the namely more sparse region of sample distribution, reason is that the classifying face of discriminative model is usually located at these rarefactions;

About choosing of discriminative model sorter, in recent years, support vector machine demonstrates very large superiority in classification problem, and therefore, the discriminative model that the present invention uses is support vector machine; While special instruction, the discriminative model of mentioning below all refers to support vector machine;

A technology very crucial in support vector machine is kernel function, carries out the training of model by sample characteristics spatial mappings to nuclear space; What inseparable some low dimensional feature space problem can be become by kernel function divides; For the consistance of keeping at a distance, in the time of cluster, need to by Feature Space Transformation to the nuclear space at support vector machine place after carry out again cluster;

If what support vector machine was selected is linear kernel function, distance metric when cluster is directly with the Euclidean distance in primitive character space; If support vector machine selects is other Non-linear Kernel functions, for neighbour's propagation clustering algorithm, need to similarity be set to sample separation in nuclear space from opposite number; Calculate sample separation in nuclear space from method be: the sample set S={x in original sample feature space ₁, x ₂..., x _n, x _i∈ R ^d, being mapped to higher dimensional space F, mapping function is Φ (x), i.e. x=(x ₁..., x _d) → Φ (x)=(Φ ₁(x) ..., Φ _n(x)), in higher dimensional space F, sample x _iand x _jbetween Euclidean distance be:

\begin{matrix} d_{F} (i, j) = \sqrt{{| | Φ (x_{i}) - Φ (x_{j}) | |}^{2}} \\ = \sqrt{{[Φ (x_{i}) - Φ (x_{j})]}^{T} [Φ (x_{i}) - Φ (x_{j})]} \\ = \sqrt{Φ^{T} (x_{i}) \cdot Φ (x_{i}) + Φ^{T} (x_{j}) \cdot Φ (x_{j}) - {2 Φ}^{T} (x_{i}) \cdot Φ (x_{j})} \\ = \sqrt{K (x_{i}, x_{i}) + K (x_{j}, x_{j}) - 2 K (x_{i}, x_{j})} \end{matrix} - - - (24)

Wherein, K ()=Φ ^t() Φ () is kernel function; The computing in higher dimensional space F is converted into the computing in nuclear space; In neighbour's propagation algorithm, the present invention is by sample x _iwith x _jbetween similarity measurement be defined as sample separation from negative value, that is:

s _F(i,j)＝-d _F(i,j) (25)

S _f(i, j) represents the similarity between nuclear space sampling point, and this similarity is the input of neighbour's propagation algorithm, and ensuing algorithm is the same with neighbour's propagation algorithm that previous section is introduced;

K-means clustering algorithm in nuclear space is than the neighbour's propagation clustering algorithm complexity in nuclear space; Its complicacy is mainly how to find Nei Cu center, F space; Introduce the K-means clustering algorithm in nuclear space below;

In the F of space, order,

m_{Fk} = \frac{1}{N_{k}} Σ_{i = 1}^{N_{k}} Φ (x_{i}) - - - (26)

M _fkbe k bunch of average in F space, the objective function of K-means clustering algorithm in the F of space becomes:

\begin{matrix} J_{F} = Σ_{k = 1}^{K} Σ_{i = 1}^{N_{k}} {| | Φ (x_{i}) - m_{Fk} | |}^{2} \\ = Σ_{k = 1}^{K} Σ_{i = 1}^{N_{k}} {| | Φ (x_{i}) - \frac{1}{N_{k}} Σ_{j = 1}^{N_{k}} Φ (x_{j}) | |}^{2} \\ = Σ_{k = 1}^{K} Σ_{i = 1}^{N_{k}} [Φ^{T} (x_{i}) \cdot Φ (x_{i}) + \frac{1}{N_{k}^{2}} Σ_{j}^{N_{k}} Σ_{l = 1}^{N_{k}} Φ^{T} (x_{j}) \cdot Φ (x_{l}) - \frac{2}{N_{k}} Σ_{j = 1}^{N_{k}} Φ^{T} (x_{i}) \cdot Φ (x_{j})] \\ = Σ_{k = 1}^{K} Σ_{i = 1}^{N_{k}} [K (x_{i}, x_{i}) + \frac{1}{N_{k}^{2}} Σ_{j}^{N_{k}} Σ_{l = 1}^{N_{k}} K (x_{j}, x_{l}) - \frac{2}{N_{k}} Σ_{j = 1}^{N_{k}} K (x_{i}, x_{j})] \end{matrix} - - - (27)

The objective function in F space is become the computing in nuclear space by above-mentioned formula; In the time upgrading bunch center, owing to cannot determining the expression-form of Φ () function, thereby directly through type (26) calculates each Cu center; At paper ^{[20] [22]}in, by by bunch in each sample point be assumed to be a bunch center and obtain its approximate value; That is: for k bunch, to each some x of its inside _i, i ∈ [1, N _k], according to every other sampling point x in formula (24) compute cluster _j, j ∈ [1, N _k] and j ≠ i and this distance sum in F space, in can making bunch, every other sampling point and this point, apart from the sampling point of sum minimum, are just decided to be the some m in corresponding primitive character space, intra-cluster center, F space _k, m _fk=Φ (m _k), that is:

m_{k} = \arg \min_{m_{i}, i &Element; [1, N_{k}]} Σ_{j = 1}^{N_{k}} [K (x_{i}, x_{i}) + K (x_{j}, x_{j}) - 2 K (x_{i}, x_{j})] - - - (28)

Obtain behind bunch center, just can redistribute sampling point, but, calculate like this Cu center and have deviation compared with real bunch of center; For this problem, the present invention proposes another sampling point allocation algorithm, need to be in the time of each iteration compute cluster centers all; This algorithm is as follows:

Suppose that cluster result is in the time of the n-1 time iteration:

{CLT}^{(n - 1)} = {{Clt}_{1}^{(n - 1)}, {Clt}_{2}^{(n - 1)}, . . ., {Clt}_{K}^{(n - 1)}},

K represent to obtain after cluster bunch number, represent the set of each bunch,

{Clt}_{k}^{(n - 1)} = {{clt_s}_{k 1}^{(n - 1)}, {clt_s}_{k 2}^{(n - 1)}, . . ., {clt_s}_{{kN}_{k}}^{(n - 1)}},

N _krepresent k bunch number that comprises sample; In the time redistributing sample the n time, to sample x _i, calculate following formula:

d_{F} ({i, Clt}_{k}^{(n - 1)}) = \sqrt{Σ_{j = 1}^{N_{k}^{(n - 1)}} [K (x_{i}, x_{i}) + K ({clt_s}_{j}^{(n - 1)}, {clt_s}_{j}^{(n - 1)}) - 2 K (x_{i}, {clt_s}_{j}^{(n - 1)})]} - - - (29)

In the time of the n time iteration, sample point x _iaffiliated bunch k ^*for:

k^{*} = \arg \min_{k &Element; [1, K]} d_{F} (i, {Clt}_{k}^{(n - 1)}) - - - (30)

To bunch in bunch label under distributing of each sampling point; While distributing sampling point by said method iteration, do not need to calculate each Cu center; Just, in the time that iteration stops, calculate final Cu center by formula (28);

Some adjustment for keeping Cluster space to do when consistent with classifying space are introduced above; Obtain after the cluster result in the consistent situation in space, introduce the useful sample selection algorithm for discriminative model sorter in the present invention below;

Introduced above, (be better that sample in each bunch is purer at cluster result, most samples all belong to same classification) situation under, the sample at bunch edge is likely the borderline sample of class, and such sample is very useful to the study of supporting vector machine model; But under normal circumstances, the unmarked audio fragment intercepting from raw audio streams is more mixed and disorderly, the number difference of all kinds audio fragment is very large, and it is chronic that especially general audio types fragment continues; Although carry out some pre-service, but still cannot ensure that the number of various audio types fragments is suitable; Therefore the result, obtaining after cluster often can be too not desirable; In this case, sample in bunch border is the true borderline sample of class not necessarily, wherein can have some wild points, the sample that two classes or even multiple classification are easily obscured, if only select such sample to mark, can not represent the real information on classification border; In the situation that selecting finite sample, by this class sample, for Training Support Vector Machines sorter, the classifying face obtaining can be subject to these borders to obscure the impact of sampling point, and classification accuracy is not high; In view of this point, the useful sample that the present invention chooses is the slightly past bunch inner sample in border, and in such sample, wild point is fewer, and majority is other sample of same class; Meanwhile, also not Li Cu center is too not near for the sample of choosing, because the near sample in Li Cu center is for support vector machine classifier, the quantity of information containing is little, mostly in these samples is all positioned at the region away from supporting surface, helps very little for finding support vector machine;

Sample as for border can obtain after part sample according to this chapter mark, carries out Active Learning further to obtain more more useful samples; Select training sample to be marked to carry out mark by the mode based on cluster, select the more sample of high-quality for Active Learning and lay the foundation; About the method for Active Learning, it not the emphasis that the present invention studies; Introduce the system of selection of sample to be marked in discriminative model sorter below; Similar with production category of model device, d _f(i, k), the distance at sample point Dao Cu center in expression F space; In the k in F space bunch, sample x _ithe distance at Dao Cu center is:

d_{F} (i, k) = \sqrt{Σ_{i = 1}^{N_{k}} [K ({clt_s}_{ki}, {clt_s}_{ki}) + K (c_{k}, c_{k}) - 2 K ({clt_s}_{ki}, c_{k})]} - - - (31)

Wherein, c _krepresent k Cu Cu center in D dimension space;

The distance at all sampling point Dao Cu center in statistics bunch, and find maximum distance as bunch radius, a given real number λ, λ ∈ [0,1], with for radius cluster dividing, from a bunch centre distance near sampling M in the region on bunch border _kindividual, as the useful sampling point of discriminative model sorter, M _kdefinite method identical with production model; The flow process of below sample in discriminative model being selected is described below:

Calculate nuclear space radius; To each bunch of k, k ∈ [1, N], each sample clt_s in compute cluster _ki, i ∈ [1, N _k], the distance d between Yu Cu center _f(i, k), and find out maximum radius bunch division; A given real number λ, λ ∈ [0,1], by k bunch, k ∈ [1, N], with for radius cluster dividing;

Choose sampling point; For each bunch, at Cu Neilicu center scope in stochastic sampling M _kindividual sample, according to formula (23), M _k=w _krN _k, n _kfor bunch in all number of samples;

When described ground floor sorter is selected unmarked sample, first all unmarked samples are carried out to cluster, select the method for sample to be marked according to svm classifier device, select from the closer sample in bunch border, same, space for the space that ensures to choose sample during with training SVM is consistent, and clustering algorithm is also to carry out in corresponding nuclear space.Select after boundary sample, use marker samples training to have the svm classifier device of supervision, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.According to the difference of confidence interval, unmarked sample selection algorithm can be divided into: high confidence level is selected, low confidence is selected to select with middle degree of confidence.High confidence level selects to refer to the unmarked sample of selecting higher than a certain confidence threshold value; Low confidence selects to refer to the unmarked sample of selecting lower than a certain confidence threshold value; Middle degree of confidence is selected, and refers to and selects the unmarked sample of confidence bit between two threshold values.

When other described layer is selected unmarked sample, only the unmarked sample of selecting in (1) is carried out to cluster, carry out cluster with the high confidence level sample that last layer TSVM sorter is selected, select from the sample close to bunch border, then, there is the svm classifier device of supervision with the training of marker samples of current layer.And the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.

In original TSVM problem, test sample book cannot be selected, but when TSVM is used for to the training of sorter, object is by the strong sorter of information training generalization ability comprising in unmarked sample, has a large amount of unmarked samples to utilize.In experiment, discovery is not that unmarked sample is more, and the generalization ability of the sorter obtaining is stronger.This illustrates the in the situation that of finite sample, is not that unmarked sample is all useful arbitrarily.And in general TSVM algorithm, unmarked sample is not arranged to qualifications.In three hypothesis of semi-supervised learning, there is specific relation with between classification in the distribution (comprise unmarked and mark) of all supposing sample, in the time that the unmarked sample of choosing is consistent with real sample distribution, just can improve the performance of semi-supervised learning device.Meanwhile, what the present invention adopted is layering TSVM sorter, and therefore, the algorithm that adopts the unmarked sample based on degree of confidence and cluster to select, for improving the performance of TSVM algorithm.

Brief description of the drawings

Fig. 1 is the process flow diagram of the method that in semi-supervised learning provided by the invention, the unmarked sample based on degree of confidence and cluster is selected;

Fig. 2 is many classification layering TSVM classification process figure provided by the invention.

Embodiment

For further understanding summary of the invention of the present invention, Characteristic, hereby exemplify following examples, and coordinate accompanying drawing to be described in detail as follows:

Refer to Fig. 1;

Below in conjunction with the drawings and specific embodiments, the invention will be further described.

The present invention is a kind of method that in semi-supervised learning, the unmarked sample based on degree of confidence and cluster is selected, it is characterized in that, what the present invention adopted is layering TSVM sorter, the algorithm that adopts the unmarked sample based on degree of confidence and cluster to select, for improving the performance of TSVM algorithm, unmarked sample for TSVM study is to select from a large amount of unmarked samples, these unmarked samples need to meet specific condition and could improve the performance of semi-supervised learning device, first all unmarked samples are carried out to cluster, select from the closer sample in bunch border, select after boundary sample, there is the svm classifier device of supervision with the training of marker samples, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.Obtain after the degree of confidence of sample, define a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised learning of lower one deck, and object is to make selected sample belong to probability maximum under the condition of last layer sorter of the corresponding classification of lower one deck sorter institute.The unmarked sample of choosing can representative sample border distribution situation, can make again every layer of sample of choosing belong to lower one deck sorter the maximum probability of corresponding classification, classify layering TSVM classification process as shown in Figure 2 more;

After the sorter training of ground floor, obtain Liang Ge branch, suppose that left branch is positive class, right branch is negative class, each branching selection is belonged to the sample of this branch.The method of selecting is, definition sample x belongs to the degree of confidence Con (x) of certain classification, suppose that it is f (x) that ground floor is trained the classifying face obtaining, utilize this classifying face again to identify all unmarked samples, the degree of confidence that x belongs to a certain class with respect to classifying face f (x) can represent with the probability that x belongs to such, the present invention adopts Lin et al. to carry out probability estimate to the one improvement algorithm of Platt probability output, that is:

Con (x) = prob (y | x) \{\begin{matrix} \frac{\exp (Af (x) + B)}{1 + \exp (Af (x) + B)}; & if f (x) > 0 \\ \frac{1}{1 + \exp (Af (x) + B)}; & others \end{matrix} - - - (5 - 5)

Concrete clustering algorithm, adopt K-means clustering algorithm (K-means Clustering), the one that to be Mac Queen proposed in 1967 is without supervision real-time clustering algorithm, this algorithm is a kind of popular approximate data, its objective is according to the similarity of data data are divided into a predetermined K part, and find the central point of every part.For with classification in " class " distinguish, this K part is called to individual bunch of K, the central point of every part is called a bunch center.

The target of K-means clustering algorithm is the similarity sum maximum between all sample points in making bunch.Conventionally, similarity measurement is based on distance metric, and in the time adopting Euclidean distance, the cluster objective function based on error sum of squares criterion is as follows:

J = Σ_{k = 1}^{K} Σ_{i = 1}^{N_{k}} {| | x_{i} - m_{k} | |}^{2}

Wherein,

K, represent bunch number;

N _k, represent k bunch number that comprises sample, meet: n is the number of participating in cluster sample;

X _i, the eigenvector of expression sample;

M _k, represent k Ge Cucu center, i.e. the average of k bunch, its computing formula is as follows:

m_{k} = \frac{1}{N} Σ_{i = 1}^{N_{k}} x_{i}

As can be seen from the above equation, objective function is relevant with the initial clusters number K specifying.The object of cluster is to find the distribution that makes sample in each bunch and bunch centre distance sum minimum, and the general step that solves its objective function (2-1) is:

The first step, initialization, inputs the number K of sample set S to be clustered and cluster, chooses at random K sample as initial cluster center in S.Stopping criterion for iteration is set, is generally maximum cycle or convergence error threshold value;

Second step, distributes, to each sample x _i, find and its nearest Cu center m according to similarity criteria _i, and distributed to this bunch;

Spreading, upgrades bunch center, and for each bunch, in recalculating bunch, the mean vector of all samples is as Xin Cu center;

The 3rd step, repeats second step and the 3rd step until meet end condition.

Can find out from K-means clustering algorithm flow process, this algorithm principle is simple, is convenient to process large-scale data.But, need to preset the number K of cluster, and K value is selected very difficult often.Under many circumstances, cannot learn by data set be divided into how many bunches suitable.In addition, in K-means clustering algorithm, need to select some samples as initial cluster centre, according to these initial cluster centres, sample be carried out to iteration division.K-means clustering algorithm is very strong to initial cluster center dependence, and the selection of initial cluster center tends to have influence on final cluster result.If initial cluster center choosing is incorrect, can cause cluster to be absorbed in local optimum.Certainly, there are some improvement algorithms for this problem, but all cannot tackle the problem at its root.

Principle of work of the present invention:

The principle and the foundation that realize sample selection are: TSVM belongs to discriminative model.Discriminative model, You Jiao district fractional model, district's fractional model is also indifferent to the distribution of training sample self, but the distributional difference of different classes of is carried out to modeling; In other words, district's fractional model need to be selected a model of the common structure of two class samples of mutually distinguishing, and forms an adiaphorous classifying face by two class samples " wrestling ".Therefore,, for the formula of differentiation category of model device, useful training sample is the sample that can describe different classes of border distribution situation.Mark if can find out boundary sample, just can save the workload of a lot of artificial marks.

According to the modeling feature of discriminative model, for discriminative model sorter, the feature that useful training sample set possesses comprises:

1) sparse property.Be that useful sample is positioned at the sparse region that distributes.Such as: for svm classifier device, support vector sample is the most useful sample.

2) diversity.The standard class that this character and production model are chosen representative sample seemingly, in the time selecting sample and carry out handmarking, to comprise as much as possible the sample in each classification, in other words, sample set will be contained each classification in the unmarked voice data of magnanimity, although there is no supervision message, but can excavate the space distribution information of sample, such as the sample in which region is intensive, the sample in which region is sparse etc., and these information can be used as selecting the foundation of sample.Clustering algorithm is a kind of typical unsupervised learning algorithm, never in marker samples, extracts clustering information, thereby obtains the distribution situation of sample.The object of cluster is that audio fragment in each bunch that wishes that cluster obtains is from identical semantic type.Therefore, can obtain from cluster bunch select a useful sample set to mark.

According to narration above, the Method of Sample Selection difference to be marked of different sorter models.That in this patent, adopt is TSVM, belong to of SVM, for svm classifier device, cluster according to divergence minimum in being bunch, and bunch between divergence maximum, being the sparse region of sample distribution just, the border of each bunch that cluster obtains, is therefore useful near the sample on bunch border for svm classifier device.Generally speaking the decision surface between, different classes of is often positioned at the sparse region of sample distribution.Therefore,, for discriminative model sorter, it is optimal that the borderline sample of searching class marks.Clustering algorithm carries out bunch division according to the sparse property of sample distribution just, for the boundary sample of choosing as much as possible, select the sample near bunch border, but not class central sample, the mode of this point and production Model Selection representative sample is distinguishing.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the method that the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, the method that adopts layering TSVM sorter and the unmarked sample based on degree of confidence and cluster to select; Specifically comprise:

First all unmarked samples are carried out to cluster, select from the closer sample in bunch border, select after boundary sample, use marker samples training to have the svm classifier device of supervision, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training; Obtain, after the degree of confidence of sample, defining a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised of lower one deck.

2. the method that in semi-supervised learning according to claim 1, the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, condition and system of selection that unmarked sample is satisfied are as follows:

Adopt layering TSVM sorter, at two TSVM sorters of every one deck training, add the sample of semi-supervised learning, belong to respectively the corresponding respective classes of each sorter;

Ground floor, in quiet and non-quiet semi-supervised learning, all unmarked sample standard deviations belong to this two classifications; Therefore, all unmarked sample standard deviations meet above-mentioned condition;

After the sorter training of ground floor, obtain Liang Ge branch, left branch is positive class, and right branch is negative class, each branching selection is belonged to the sample of this branch; The method of selecting is, definition sample x belongs to the degree of confidence Con (x) of certain classification, it is f (x) that ground floor is trained the classifying face obtaining, utilize this classifying face again to identify all unmarked samples, the probability that the degree of confidence x that x belongs to a certain class with respect to classifying face f (x) belongs to such represents, adopt Lin et al. to improve algorithm to the one of Platt probability output and carry out probability estimate, that is:

Con (x) = prob (y | x) \{\begin{matrix} \frac{\exp (Af (x) + B)}{1 + \exp (Af (x) + B)}; & if f (x) > 0 \\ \frac{1}{1 + \exp (Af (x) + B)}; & others \end{matrix} - - - (5 - 5)

Wherein, A and B determine jointly by training data and classification results f (x);

Obtain, after the degree of confidence of sample, defining a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised learning of lower one deck;

TSVM sorter in essence or svm classifier device, while selecting unmarked sample, selects near sample support vector machine more helpful to the training of sorter; Here, in unmarked sample, more selecting boundary sample, is to realize by the method for cluster equally.

3. the method that in semi-supervised learning according to claim 2, the unmarked sample based on degree of confidence and cluster is selected, it is characterized in that, when other described layer is selected unmarked sample, only to adopting the unmarked sample of selecting in layering TSVM sorter to carry out cluster, carry out cluster with the high confidence level sample that last layer TSVM sorter is selected, select from the sample close to bunch border, then, train the svm classifier device that has supervision by the marker samples of current layer; And the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.

4. the method that in semi-supervised learning according to claim 2, the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, sampling point allocation algorithm, need to be in the time of each iteration compute cluster centers all; Algorithm is as follows:

In the time of the n-1 time iteration, cluster result is: k represent to obtain after cluster bunch number, represent the set of each bunch,

{Clt}_{k}^{(n - 1)} = {{clt_s}_{k 1}^{(n - 1)}, {clt_s}_{k 2}^{(n - 1)}, . . ., {clt_s}_{{kN}_{k}}^{(n - 1)}},

d_{F} ({i, Clt}_{k}^{(n - 1)}) = \sqrt{Σ_{j = 1}^{N_{k}^{(n - 1)}} [K (x_{i}, x_{i}) + K ({clt_s}_{j}^{(n - 1)}, {clt_s}_{j}^{(n - 1)}) - 2 K (x_{i}, {clt_s}_{j}^{(n - 1)})]}

In the time of the n time iteration, sample point x _iaffiliated bunch k ^*for:

k^{*} = \arg \min_{k &Element; [1, K]} d_{F} (i, {Clt}_{k}^{(n - 1)})

To bunch in bunch label under distributing of each sampling point; While distributing sampling point by said method iteration, do not need to calculate each Cu center; Just, in the time that iteration stops, pass through formula

m_{k} = \arg \min_{m_{i}, i &Element; [1, N_{k}]} Σ_{j = 1}^{N_{k}} [K (x_{i}, x_{i}) + K (x_{j}, x_{j}) - 2 K (x_{i}, x_{j})]

Calculate final Cu center.

5. the method that in semi-supervised learning according to claim 4, the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, for the useful sample selection algorithm of discriminative model sorter;

Similar with production category of model device, d _f(i, k), the distance at sample point Dao Cu center in expression F space; In the k in F space bunch, sample x _ithe distance at Dao Cu center is:

d_{F} (i, k) = \sqrt{Σ_{i = 1}^{N_{k}} [K ({clt_s}_{ki}, {clt_s}_{ki}) + K (c_{k}, c_{k}) - 2 K ({clt_s}_{ki}, c_{k})]} - - - (31)

Wherein, c _krepresent k Cu Cu center in D dimension space;

The distance at all sampling point Dao Cu center in statistics bunch, and find maximum distance as bunch radius, a given real number λ, λ ∈ [0,1], with for radius cluster dividing, from a bunch centre distance near sampling M in the region on bunch border _kindividual, as the useful sampling point of discriminative model sorter, M _kdefinite method identical with production model; The flow process that sample in discriminative model is selected is described below:

Calculate nuclear space radius; To each bunch of k, k ∈ [1, N], each sample clt_s in compute cluster _ki, i ∈ [1, N _k], the distance d between Yu Cu center _f(i, k), and find out maximum radius

Bunch division; A given real number λ, λ ∈ [0,1], by k bunch, k ∈ [1, N], with for radius cluster dividing;

Choose sampling point; For each bunch, at Cu Neilicu center scope in stochastic sampling M _kindividual sample, according to formula (23), M _k=w _krN _k, n _kfor bunch in all number of samples.

6. the method that in semi-supervised learning according to claim 1, the unmarked sample based on degree of confidence and cluster is selected, it is characterized in that, when described ground floor sorter is selected unmarked sample, first all unmarked samples are carried out to cluster, select the method for sample to be marked according to svm classifier device, select from the closer sample in bunch border, select after boundary sample, there is the svm classifier device of supervision with the training of marker samples, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training; According to the difference of confidence interval, unmarked sample selection algorithm is divided into: high confidence level is selected, low confidence is selected to select with middle degree of confidence; High confidence level selects to refer to the unmarked sample of selecting higher than a certain confidence threshold value; Low confidence selects to refer to the unmarked sample of selecting lower than a certain confidence threshold value; Middle degree of confidence is selected, and refers to and selects the unmarked sample of confidence bit between two threshold values.

7. the method that in semi-supervised learning according to claim 1, the unmarked sample based on degree of confidence and cluster is selected, it is characterized in that, clustering algorithm, adopt K-means clustering algorithm, according to the similarity of data, data are divided into a predetermined K part, and find the central point of every part, for classification in class separate, this K part is called to K bunch, and the central point of every part is called a bunch center;

Similarity sum maximum in making bunch between all sample points, in the time adopting Euclidean distance, the cluster objective function based on error sum of squares criterion is as follows:

J = Σ_{k = 1}^{K} Σ_{i = 1}^{N_{k}} {| | x_{i} - m_{k} | |}^{2}

Wherein,

K, represent bunch number;

X _i, the eigenvector of expression sample;

M _k, represent k Ge Cucu center, i.e. the average of k bunch, computing formula is as follows:

m_{k} = \frac{1}{N} Σ_{i = 1}^{N_{k}} x_{i} .

8. the method that in semi-supervised learning according to claim 7, the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, the object of cluster is to find the distribution of the sample that makes in each bunch and bunch centre distance sum minimum, solves objective function step be:

The first step, initialization, inputs the number K of sample set S to be clustered and cluster, chooses at random K sample as initial cluster center in S, stopping criterion for iteration is set, maximum cycle or convergence error threshold value;

Second step, distributes, to each sample x _i, find and its nearest Cu center m according to similarity criteria _i, and distribute to this bunch;

The 3rd step, upgrades bunch center, and for each bunch, in recalculating bunch, the mean vector of all samples is as Xin Cu center;

The 4th step, repeats second step and the 3rd step until meet end condition.