CN104156438A - Unlabeled sample selection method based on confidence coefficients and clustering - Google Patents

Unlabeled sample selection method based on confidence coefficients and clustering Download PDF

Info

Publication number
CN104156438A
CN104156438A CN201410395794.6A CN201410395794A CN104156438A CN 104156438 A CN104156438 A CN 104156438A CN 201410395794 A CN201410395794 A CN 201410395794A CN 104156438 A CN104156438 A CN 104156438A
Authority
CN
China
Prior art keywords
sample
bunch
confidence
cluster
unmarked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410395794.6A
Other languages
Chinese (zh)
Inventor
王荣燕
谢延红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dezhou University
Original Assignee
Dezhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dezhou University filed Critical Dezhou University
Priority to CN201410395794.6A priority Critical patent/CN104156438A/en
Publication of CN104156438A publication Critical patent/CN104156438A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an unlabeled sample selection method based on confidence coefficients and clustering. The method comprises the steps of firstly, clustering all unlabeled samples, selecting the samples close to the cluster boundary, using labeled samples for training a supervised SVM classifier and recognizing the selected boundary samples after the boundary samples are selected out, selecting the unlabeled samples in different confidence coefficient sections for TSVM training, defining a certain threshold value lambda after the confidence coefficients of the samples are obtained, and selecting the unlabeled samples with the confidence coefficients larger than the lambda to enable the unlabeled samples to enter semi-supervised learning on the next layer so as to enable the probability that the selected samples belongs to the types corresponding to the classifier on the next layer to become the largest on the condition of the classifier of the last layer, wherein the selected unlabeled samples not only can represent the sample boundary distribution situation, but also enable the probability that the selected samples on each layer belong to the types corresponding to the classifier on the next layer to be the largest.

Description

A kind of method that unmarked sample based on degree of confidence and cluster is selected
Technical field
The present invention relates to machine learning and area of pattern recognition, a kind of method that specifically unmarked sample based on degree of confidence and cluster is selected.
Background technology
At present, having in the study of supervised classification model, a general problem is marker samples deficiency.Reason is increasingly mature along with digital content collection of material technology, and the cheap of mass storage, audio-frequency information on network rapidly increases, obtaining a large amount of unlabelled audio frequency samples is very easy to, but it is too high manually to mark cost, cause the quantity of concentrating unmarked sample at a lot of voice datas much larger than the situation of the quantity of marker samples.If only use on a small quantity marker samples, the Generalization Capability that the disaggregated model that supervised learning obtains is difficult to possess, meanwhile, the information in a large amount of unmarked samples also cannot be fully used, and causes the waste of information.Under this background, how research fully utilize the semi-supervised learning (Semi-supervised Learning) that a large amount of unmarked samples improves learning performance and caused people's attention under the condition of a small amount of marker samples, becomes one of important research field of current machine learning and pattern-recognition.
The semi-supervised learning value that has a wide range of applications in practical problems, its achievement in research has been applied to the fields such as speech recognition, image recognition and image retrieval, video labeling, natural language processing and living things feature recognition.Owing to existing a large amount of unmarked audio files on network, thereby semi-supervised learning is applied to complex audio classification problem also just becomes very natural.
At present, in semi-supervised learning, what people paid close attention to is more how to utilize unmarked sample, and which unmarked sample can help the research of semi-supervised learning less for.The TSVM learning method that for example Thorsten Joachims proposes in the literature, two labels that unmarked sample is added in advance of specified conditions are proved to meet by exchange, can make the objective function of support vector machine more optimize, the experiment of the document shows, the classification performance of semi-supervised learning device improves constantly along with the increase of unmarked sample size.But, in many experiments, the classification performance of finding semi-supervised audio classifiers is not along with increasing of unmarked sample improves constantly, this shows in the limited situation of marker samples, be not that unmarked sample is helpful to semi-supervised learning arbitrarily, the performance of semi-supervised learning device is relevant with the unmarked sample adding.About this point, the people such as Aarti Singh also document [in point out that not unmarked sample is all helpful to semi-supervised learning arbitrarily.For this specific question, the present invention proposes the algorithm that a kind of unmarked sample based on degree of confidence and cluster is selected, this algorithm can better utilize unmarked sample to improve the performance of audio classifiers, and carries out unmarked sample for semi-supervised learning during for other field and select to provide reference.
Summary of the invention
The present invention seeks to as overcoming above-mentioned deficiency, a kind of method that provides unmarked sample based on degree of confidence and cluster to select, so that the audio stream file in processing movie and video programs, the audio types of time of occupying the majority in movie and video programs is voice, the time slice that other audio types entirety is occupied is relatively short, therefore, there is equally serious data nonbalance problem.Impact classification performance being caused in order to weaken data nonbalance, the present invention adopts layering TSVM algorithm, and proposes the algorithm that in a kind of semi-supervised learning, the unmarked sample based on degree of confidence and cluster is selected.
The technical solution adopted in the present invention is: a kind of method that unmarked sample based on degree of confidence and cluster is selected, what adopt is layering TSVM sorter, the algorithm that adopts the unmarked sample based on degree of confidence and cluster to select, for improving the performance of TSVM algorithm, unmarked sample for TSVM study is to select from a large amount of unmarked samples, these unmarked samples need to meet specific condition and could improve the performance of semi-supervised learning device, first all unmarked samples are carried out to cluster, select from the closer sample in bunch border, select after boundary sample, there is the svm classifier device of supervision with the training of marker samples, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.Obtain after the degree of confidence of sample, define a certain threshold value, select the unmarked sample that is greater than of degree of confidence to enter the semi-supervised learning of lower one deck, object is to make selected sample belong to probability maximum under the condition of last layer sorter of the corresponding classification of lower one deck sorter institute.The unmarked sample of choosing can representative sample border distribution situation, can make again every layer of sample of choosing belong to lower one deck sorter the maximum probability of corresponding classification.
Satisfied condition and the system of selection of described unmarked sample is as follows:
(1) the present invention adopts layering TSVM sorter, at two TSVM sorters of every one deck training, adds the sample of semi-supervised learning, should belong to respectively the corresponding respective classes of each sorter.
Ground floor, in quiet and non-quiet semi-supervised learning, all unmarked sample standard deviations belong to this two classifications.Therefore, all unmarked sample standard deviations meet above-mentioned condition.
After the sorter training of ground floor, obtain Liang Ge branch, suppose that left branch is positive class, right branch is negative class, each branching selection is belonged to the sample of this branch.The method of selecting is, definition sample x belongs to the degree of confidence Con (x) of certain classification, suppose that it is f (x) that ground floor is trained the classifying face obtaining, utilize this classifying face again to identify all unmarked samples, the degree of confidence that x belongs to a certain class with respect to classifying face f (x) can represent with the probability that x belongs to such, the present invention adopts Linetal. to carry out probability estimate to the one improvement algorithm of Platt probability output, that is:
Con ( x ) = prob ( y | x ) exp ( Af ( x ) + B ) 1 + exp ( Af ( x ) + B ) ; if f ( x ) > 0 1 1 + exp ( Af ( x ) + B ) ; others - - - ( 5 - 5 )
Wherein, A and B determine jointly by training data and classification results f (x).
Obtain, after the degree of confidence of sample, defining a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised learning of lower one deck.Object be make selected sample belong to lower one deck sorter the probability of corresponding classification maximum under the condition of last layer sorter.
(2) TSVM sorter in essence or svm classifier device, therefore, will be found support vector machine sample equally.Therefore,, while selecting unmarked sample, select near the sample of support vector machine more helpful to the training of sorter.Here, in unmarked sample, more selecting boundary sample, is to realize by the method for cluster equally.
Discriminative model carries out modeling to the distributional difference of different classes of sample, and for this class model, the sample that is positioned at classification border is more useful; Here equally can by the method for cluster never marker samples concentrate and obtain sample space distributed intelligence; The border that these information are mainly present in bunch the contribution of discriminative model, the namely more sparse region of sample distribution, reason is that the classifying face of discriminative model is usually located at these rarefactions;
About choosing of discriminative model sorter, in recent years, support vector machine demonstrates very large superiority in classification problem, and therefore, the discriminative model that the present invention uses is support vector machine; While special instruction, the discriminative model of mentioning below all refers to support vector machine;
A technology very crucial in support vector machine is kernel function, carries out the training of model by sample characteristics spatial mappings to nuclear space; What inseparable some low dimensional feature space problem can be become by kernel function divides; For the consistance of keeping at a distance, in the time of cluster, need to by Feature Space Transformation to the nuclear space at support vector machine place after carry out again cluster;
If what support vector machine was selected is linear kernel function, distance metric when cluster is directly with the Euclidean distance in primitive character space; If support vector machine selects is other Non-linear Kernel functions, for neighbour's propagation clustering algorithm, need to similarity be set to sample separation in nuclear space from opposite number; Calculate sample separation in nuclear space from method be: the sample set S={x in original sample feature space 1, x 2..., x n, x i∈ R d, being mapped to higher dimensional space F, mapping function is Φ (x), i.e. x=(x 1..., x d) → Φ (x)=(Φ 1(x) ..., Φ n(x)), in higher dimensional space F, sample x iand x jbetween Euclidean distance be:
d F ( i , j ) = | | Φ ( x i ) - Φ ( x j ) | | 2 = [ Φ ( x i ) - Φ ( x j ) ] T [ Φ ( x i ) - Φ ( x j ) ] = Φ T ( x i ) · Φ ( x i ) + Φ T ( x j ) · Φ ( x j ) - 2 Φ T ( x i ) · Φ ( x j ) = K ( x i , x i ) + K ( x j , x j ) - 2 K ( x i , x j ) - - - ( 24 )
Wherein, K ()=Φ t() Φ () is kernel function; The computing in higher dimensional space F is converted into the computing in nuclear space; In neighbour's propagation algorithm, the present invention is by sample x iwith x jbetween similarity measurement be defined as sample separation from negative value, that is:
s F(i,j)=-d F(i,j) (25)
S f(i, j) represents the similarity between nuclear space sampling point, and this similarity is the input of neighbour's propagation algorithm, and ensuing algorithm is the same with neighbour's propagation algorithm that previous section is introduced;
K-means clustering algorithm in nuclear space is than the neighbour's propagation clustering algorithm complexity in nuclear space; Its complicacy is mainly how to find Nei Cu center, F space; Introduce the K-means clustering algorithm in nuclear space below;
In the F of space, order,
m Fk = 1 N k Σ i = 1 N k Φ ( x i ) - - - ( 26 )
M fkbe k bunch of average in F space, the objective function of K-means clustering algorithm in the F of space becomes:
J F = Σ k = 1 K Σ i = 1 N k | | Φ ( x i ) - m Fk | | 2 = Σ k = 1 K Σ i = 1 N k | | Φ ( x i ) - 1 N k Σ j = 1 N k Φ ( x j ) | | 2 = Σ k = 1 K Σ i = 1 N k [ Φ T ( x i ) · Φ ( x i ) + 1 N k 2 Σ j N k Σ l = 1 N k Φ T ( x j ) · Φ ( x l ) - 2 N k Σ j = 1 N k Φ T ( x i ) · Φ ( x j ) ] = Σ k = 1 K Σ i = 1 N k [ K ( x i , x i ) + 1 N k 2 Σ j N k Σ l = 1 N k K ( x j , x l ) - 2 N k Σ j = 1 N k K ( x i , x j ) ] - - - ( 27 )
The objective function in F space is become the computing in nuclear space by above-mentioned formula; In the time upgrading bunch center, owing to cannot determining the expression-form of Φ () function, thereby directly through type (26) calculates each Cu center; At paper [20] [22]in, by by bunch in each sample point be assumed to be a bunch center and obtain its approximate value; That is: for k bunch, to each some x of its inside i, i ∈ [1, N k], according to every other sampling point x in formula (24) compute cluster j, j ∈ [1, N k] and j ≠ i and this distance sum in F space, in can making bunch, every other sampling point and this point, apart from the sampling point of sum minimum, are just decided to be the some m in corresponding primitive character space, intra-cluster center, F space k, m fk=Φ (m k), that is:
m k = arg min m i , i ∈ [ 1 , N k ] Σ j = 1 N k [ K ( x i , x i ) + K ( x j , x j ) - 2 K ( x i , x j ) ] - - - ( 28 )
Obtain behind bunch center, just can redistribute sampling point, but, calculate like this Cu center and have deviation compared with real bunch of center; For this problem, the present invention proposes another sampling point allocation algorithm, need to be in the time of each iteration compute cluster centers all; This algorithm is as follows:
Suppose that cluster result is in the time of the n-1 time iteration: CLT ( n - 1 ) = { Clt 1 ( n - 1 ) , Clt 2 ( n - 1 ) , . . . , Clt K ( n - 1 ) } , K represent to obtain after cluster bunch number, represent the set of each bunch, Clt k ( n - 1 ) = { clt _ s k 1 ( n - 1 ) , clt _ s k 2 ( n - 1 ) , . . . , clt _ s kN k ( n - 1 ) } , N krepresent k bunch number that comprises sample; In the time redistributing sample the n time, to sample x i, calculate following formula:
d F ( i , Clt k ( n - 1 ) ) = Σ j = 1 N k ( n - 1 ) [ K ( x i , x i ) + K ( clt _ s j ( n - 1 ) , clt _ s j ( n - 1 ) ) - 2 K ( x i , clt _ s j ( n - 1 ) ) ] - - - ( 29 )
In the time of the n time iteration, sample point x iaffiliated bunch k *for:
k * = arg min k ∈ [ 1 , K ] d F ( i , Clt k ( n - 1 ) ) - - - ( 30 )
To bunch in bunch label under distributing of each sampling point; While distributing sampling point by said method iteration, do not need to calculate each Cu center; Just, in the time that iteration stops, calculate final Cu center by formula (28);
Some adjustment for keeping Cluster space to do when consistent with classifying space are introduced above; Obtain after the cluster result in the consistent situation in space, introduce the useful sample selection algorithm for discriminative model sorter in the present invention below;
Introduced above, (be better that sample in each bunch is purer at cluster result, most samples all belong to same classification) situation under, the sample at bunch edge is likely the borderline sample of class, and such sample is very useful to the study of supporting vector machine model; But under normal circumstances, the unmarked audio fragment intercepting from raw audio streams is more mixed and disorderly, the number difference of all kinds audio fragment is very large, and it is chronic that especially general audio types fragment continues; Although carry out some pre-service, but still cannot ensure that the number of various audio types fragments is suitable; Therefore the result, obtaining after cluster often can be too not desirable; In this case, sample in bunch border is the true borderline sample of class not necessarily, wherein can have some wild points, the sample that two classes or even multiple classification are easily obscured, if only select such sample to mark, can not represent the real information on classification border; In the situation that selecting finite sample, by this class sample, for Training Support Vector Machines sorter, the classifying face obtaining can be subject to these borders to obscure the impact of sampling point, and classification accuracy is not high; In view of this point, the useful sample that the present invention chooses is the slightly past bunch inner sample in border, and in such sample, wild point is fewer, and majority is other sample of same class; Meanwhile, also not Li Cu center is too not near for the sample of choosing, because the near sample in Li Cu center is for support vector machine classifier, the quantity of information containing is little, mostly in these samples is all positioned at the region away from supporting surface, helps very little for finding support vector machine;
Sample as for border can obtain after part sample according to this chapter mark, carries out Active Learning further to obtain more more useful samples; Select training sample to be marked to carry out mark by the mode based on cluster, select the more sample of high-quality for Active Learning and lay the foundation; About the method for Active Learning, it not the emphasis that the present invention studies; Introduce the system of selection of sample to be marked in discriminative model sorter below; Similar with production category of model device, d f(i, k), the distance at sample point Dao Cu center in expression F space; In the k in F space bunch, sample x ithe distance at Dao Cu center is:
d F ( i , k ) = Σ i = 1 N k [ K ( clt _ s ki , clt _ s ki ) + K ( c k , c k ) - 2 K ( clt _ s ki , c k ) ] - - - ( 31 )
Wherein, c krepresent k Cu Cu center in D dimension space;
The distance at all sampling point Dao Cu center in statistics bunch, and find maximum distance as bunch radius, a given real number λ, λ ∈ [0,1], with for radius cluster dividing, from a bunch centre distance near sampling M in the region on bunch border kindividual, as the useful sampling point of discriminative model sorter, M kdefinite method identical with production model; The flow process of below sample in discriminative model being selected is described below:
Calculate nuclear space radius; To each bunch of k, k ∈ [1, N], each sample clt_s in compute cluster ki, i ∈ [1, N k], the distance d between Yu Cu center f(i, k), and find out maximum radius bunch division; A given real number λ, λ ∈ [0,1], by k bunch, k ∈ [1, N], with for radius cluster dividing;
Choose sampling point; For each bunch, at Cu Neilicu center scope in stochastic sampling M kindividual sample, according to formula (23), M k=w krN k, n kfor bunch in all number of samples;
When described ground floor sorter is selected unmarked sample, first all unmarked samples are carried out to cluster, select the method for sample to be marked according to svm classifier device, select from the closer sample in bunch border, same, space for the space that ensures to choose sample during with training SVM is consistent, and clustering algorithm is also to carry out in corresponding nuclear space.Select after boundary sample, use marker samples training to have the svm classifier device of supervision, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.According to the difference of confidence interval, unmarked sample selection algorithm can be divided into: high confidence level is selected, low confidence is selected to select with middle degree of confidence.High confidence level selects to refer to the unmarked sample of selecting higher than a certain confidence threshold value; Low confidence selects to refer to the unmarked sample of selecting lower than a certain confidence threshold value; Middle degree of confidence is selected, and refers to and selects the unmarked sample of confidence bit between two threshold values.
When other described layer is selected unmarked sample, only the unmarked sample of selecting in (1) is carried out to cluster, carry out cluster with the high confidence level sample that last layer TSVM sorter is selected, select from the sample close to bunch border, then, there is the svm classifier device of supervision with the training of marker samples of current layer.And the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.
In original TSVM problem, test sample book cannot be selected, but when TSVM is used for to the training of sorter, object is by the strong sorter of information training generalization ability comprising in unmarked sample, has a large amount of unmarked samples to utilize.In experiment, discovery is not that unmarked sample is more, and the generalization ability of the sorter obtaining is stronger.This illustrates the in the situation that of finite sample, is not that unmarked sample is all useful arbitrarily.And in general TSVM algorithm, unmarked sample is not arranged to qualifications.In three hypothesis of semi-supervised learning, there is specific relation with between classification in the distribution (comprise unmarked and mark) of all supposing sample, in the time that the unmarked sample of choosing is consistent with real sample distribution, just can improve the performance of semi-supervised learning device.Meanwhile, what the present invention adopted is layering TSVM sorter, and therefore, the algorithm that adopts the unmarked sample based on degree of confidence and cluster to select, for improving the performance of TSVM algorithm.
Brief description of the drawings
Fig. 1 is the process flow diagram of the method that in semi-supervised learning provided by the invention, the unmarked sample based on degree of confidence and cluster is selected;
Fig. 2 is many classification layering TSVM classification process figure provided by the invention.
Embodiment
For further understanding summary of the invention of the present invention, Characteristic, hereby exemplify following examples, and coordinate accompanying drawing to be described in detail as follows:
Refer to Fig. 1;
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
The present invention is a kind of method that in semi-supervised learning, the unmarked sample based on degree of confidence and cluster is selected, it is characterized in that, what the present invention adopted is layering TSVM sorter, the algorithm that adopts the unmarked sample based on degree of confidence and cluster to select, for improving the performance of TSVM algorithm, unmarked sample for TSVM study is to select from a large amount of unmarked samples, these unmarked samples need to meet specific condition and could improve the performance of semi-supervised learning device, first all unmarked samples are carried out to cluster, select from the closer sample in bunch border, select after boundary sample, there is the svm classifier device of supervision with the training of marker samples, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.Obtain after the degree of confidence of sample, define a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised learning of lower one deck, and object is to make selected sample belong to probability maximum under the condition of last layer sorter of the corresponding classification of lower one deck sorter institute.The unmarked sample of choosing can representative sample border distribution situation, can make again every layer of sample of choosing belong to lower one deck sorter the maximum probability of corresponding classification, classify layering TSVM classification process as shown in Figure 2 more;
Satisfied condition and the system of selection of described unmarked sample is as follows:
(1) the present invention adopts layering TSVM sorter, at two TSVM sorters of every one deck training, adds the sample of semi-supervised learning, should belong to respectively the corresponding respective classes of each sorter.
Ground floor, in quiet and non-quiet semi-supervised learning, all unmarked sample standard deviations belong to this two classifications.Therefore, all unmarked sample standard deviations meet above-mentioned condition.
After the sorter training of ground floor, obtain Liang Ge branch, suppose that left branch is positive class, right branch is negative class, each branching selection is belonged to the sample of this branch.The method of selecting is, definition sample x belongs to the degree of confidence Con (x) of certain classification, suppose that it is f (x) that ground floor is trained the classifying face obtaining, utilize this classifying face again to identify all unmarked samples, the degree of confidence that x belongs to a certain class with respect to classifying face f (x) can represent with the probability that x belongs to such, the present invention adopts Lin et al. to carry out probability estimate to the one improvement algorithm of Platt probability output, that is:
Con ( x ) = prob ( y | x ) exp ( Af ( x ) + B ) 1 + exp ( Af ( x ) + B ) ; if f ( x ) > 0 1 1 + exp ( Af ( x ) + B ) ; others - - - ( 5 - 5 )
Wherein, A and B determine jointly by training data and classification results f (x).
Obtain, after the degree of confidence of sample, defining a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised learning of lower one deck.Object be make selected sample belong to lower one deck sorter the probability of corresponding classification maximum under the condition of last layer sorter.
(2) TSVM sorter in essence or svm classifier device, therefore, will be found support vector machine sample equally.Therefore,, while selecting unmarked sample, select near the sample of support vector machine more helpful to the training of sorter.Here, in unmarked sample, more selecting boundary sample, is to realize by the method for cluster equally.
When described ground floor sorter is selected unmarked sample, first all unmarked samples are carried out to cluster, select the method for sample to be marked according to svm classifier device, select from the closer sample in bunch border, same, space for the space that ensures to choose sample during with training SVM is consistent, and clustering algorithm is also to carry out in corresponding nuclear space.Select after boundary sample, use marker samples training to have the svm classifier device of supervision, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.According to the difference of confidence interval, unmarked sample selection algorithm can be divided into: high confidence level is selected, low confidence is selected to select with middle degree of confidence.High confidence level selects to refer to the unmarked sample of selecting higher than a certain confidence threshold value; Low confidence selects to refer to the unmarked sample of selecting lower than a certain confidence threshold value; Middle degree of confidence is selected, and refers to and selects the unmarked sample of confidence bit between two threshold values.
Concrete clustering algorithm, adopt K-means clustering algorithm (K-means Clustering), the one that to be Mac Queen proposed in 1967 is without supervision real-time clustering algorithm, this algorithm is a kind of popular approximate data, its objective is according to the similarity of data data are divided into a predetermined K part, and find the central point of every part.For with classification in " class " distinguish, this K part is called to individual bunch of K, the central point of every part is called a bunch center.
The target of K-means clustering algorithm is the similarity sum maximum between all sample points in making bunch.Conventionally, similarity measurement is based on distance metric, and in the time adopting Euclidean distance, the cluster objective function based on error sum of squares criterion is as follows:
J = Σ k = 1 K Σ i = 1 N k | | x i - m k | | 2
Wherein,
K, represent bunch number;
N k, represent k bunch number that comprises sample, meet: n is the number of participating in cluster sample;
X i, the eigenvector of expression sample;
M k, represent k Ge Cucu center, i.e. the average of k bunch, its computing formula is as follows:
m k = 1 N Σ i = 1 N k x i
As can be seen from the above equation, objective function is relevant with the initial clusters number K specifying.The object of cluster is to find the distribution that makes sample in each bunch and bunch centre distance sum minimum, and the general step that solves its objective function (2-1) is:
The first step, initialization, inputs the number K of sample set S to be clustered and cluster, chooses at random K sample as initial cluster center in S.Stopping criterion for iteration is set, is generally maximum cycle or convergence error threshold value;
Second step, distributes, to each sample x i, find and its nearest Cu center m according to similarity criteria i, and distributed to this bunch;
Spreading, upgrades bunch center, and for each bunch, in recalculating bunch, the mean vector of all samples is as Xin Cu center;
The 3rd step, repeats second step and the 3rd step until meet end condition.
Can find out from K-means clustering algorithm flow process, this algorithm principle is simple, is convenient to process large-scale data.But, need to preset the number K of cluster, and K value is selected very difficult often.Under many circumstances, cannot learn by data set be divided into how many bunches suitable.In addition, in K-means clustering algorithm, need to select some samples as initial cluster centre, according to these initial cluster centres, sample be carried out to iteration division.K-means clustering algorithm is very strong to initial cluster center dependence, and the selection of initial cluster center tends to have influence on final cluster result.If initial cluster center choosing is incorrect, can cause cluster to be absorbed in local optimum.Certainly, there are some improvement algorithms for this problem, but all cannot tackle the problem at its root.
When other described layer is selected unmarked sample, only the unmarked sample of selecting in (1) is carried out to cluster, carry out cluster with the high confidence level sample that last layer TSVM sorter is selected, select from the sample close to bunch border, then, there is the svm classifier device of supervision with the training of marker samples of current layer.And the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.
Principle of work of the present invention:
The principle and the foundation that realize sample selection are: TSVM belongs to discriminative model.Discriminative model, You Jiao district fractional model, district's fractional model is also indifferent to the distribution of training sample self, but the distributional difference of different classes of is carried out to modeling; In other words, district's fractional model need to be selected a model of the common structure of two class samples of mutually distinguishing, and forms an adiaphorous classifying face by two class samples " wrestling ".Therefore,, for the formula of differentiation category of model device, useful training sample is the sample that can describe different classes of border distribution situation.Mark if can find out boundary sample, just can save the workload of a lot of artificial marks.
According to the modeling feature of discriminative model, for discriminative model sorter, the feature that useful training sample set possesses comprises:
1) sparse property.Be that useful sample is positioned at the sparse region that distributes.Such as: for svm classifier device, support vector sample is the most useful sample.
2) diversity.The standard class that this character and production model are chosen representative sample seemingly, in the time selecting sample and carry out handmarking, to comprise as much as possible the sample in each classification, in other words, sample set will be contained each classification in the unmarked voice data of magnanimity, although there is no supervision message, but can excavate the space distribution information of sample, such as the sample in which region is intensive, the sample in which region is sparse etc., and these information can be used as selecting the foundation of sample.Clustering algorithm is a kind of typical unsupervised learning algorithm, never in marker samples, extracts clustering information, thereby obtains the distribution situation of sample.The object of cluster is that audio fragment in each bunch that wishes that cluster obtains is from identical semantic type.Therefore, can obtain from cluster bunch select a useful sample set to mark.
According to narration above, the Method of Sample Selection difference to be marked of different sorter models.That in this patent, adopt is TSVM, belong to of SVM, for svm classifier device, cluster according to divergence minimum in being bunch, and bunch between divergence maximum, being the sparse region of sample distribution just, the border of each bunch that cluster obtains, is therefore useful near the sample on bunch border for svm classifier device.Generally speaking the decision surface between, different classes of is often positioned at the sparse region of sample distribution.Therefore,, for discriminative model sorter, it is optimal that the borderline sample of searching class marks.Clustering algorithm carries out bunch division according to the sparse property of sample distribution just, for the boundary sample of choosing as much as possible, select the sample near bunch border, but not class central sample, the mode of this point and production Model Selection representative sample is distinguishing.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims (8)

1. the method that the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, the method that adopts layering TSVM sorter and the unmarked sample based on degree of confidence and cluster to select; Specifically comprise:
First all unmarked samples are carried out to cluster, select from the closer sample in bunch border, select after boundary sample, use marker samples training to have the svm classifier device of supervision, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training; Obtain, after the degree of confidence of sample, defining a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised of lower one deck.
2. the method that in semi-supervised learning according to claim 1, the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, condition and system of selection that unmarked sample is satisfied are as follows:
Adopt layering TSVM sorter, at two TSVM sorters of every one deck training, add the sample of semi-supervised learning, belong to respectively the corresponding respective classes of each sorter;
Ground floor, in quiet and non-quiet semi-supervised learning, all unmarked sample standard deviations belong to this two classifications; Therefore, all unmarked sample standard deviations meet above-mentioned condition;
After the sorter training of ground floor, obtain Liang Ge branch, left branch is positive class, and right branch is negative class, each branching selection is belonged to the sample of this branch; The method of selecting is, definition sample x belongs to the degree of confidence Con (x) of certain classification, it is f (x) that ground floor is trained the classifying face obtaining, utilize this classifying face again to identify all unmarked samples, the probability that the degree of confidence x that x belongs to a certain class with respect to classifying face f (x) belongs to such represents, adopt Lin et al. to improve algorithm to the one of Platt probability output and carry out probability estimate, that is:
Con ( x ) = prob ( y | x ) exp ( Af ( x ) + B ) 1 + exp ( Af ( x ) + B ) ; if f ( x ) > 0 1 1 + exp ( Af ( x ) + B ) ; others - - - ( 5 - 5 )
Wherein, A and B determine jointly by training data and classification results f (x);
Obtain, after the degree of confidence of sample, defining a certain threshold value λ, the unmarked sample of selecting degree of confidence to be greater than λ enters the semi-supervised learning of lower one deck;
TSVM sorter in essence or svm classifier device, while selecting unmarked sample, selects near sample support vector machine more helpful to the training of sorter; Here, in unmarked sample, more selecting boundary sample, is to realize by the method for cluster equally.
3. the method that in semi-supervised learning according to claim 2, the unmarked sample based on degree of confidence and cluster is selected, it is characterized in that, when other described layer is selected unmarked sample, only to adopting the unmarked sample of selecting in layering TSVM sorter to carry out cluster, carry out cluster with the high confidence level sample that last layer TSVM sorter is selected, select from the sample close to bunch border, then, train the svm classifier device that has supervision by the marker samples of current layer; And the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training.
4. the method that in semi-supervised learning according to claim 2, the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, sampling point allocation algorithm, need to be in the time of each iteration compute cluster centers all; Algorithm is as follows:
In the time of the n-1 time iteration, cluster result is: k represent to obtain after cluster bunch number, represent the set of each bunch,
Clt k ( n - 1 ) = { clt _ s k 1 ( n - 1 ) , clt _ s k 2 ( n - 1 ) , . . . , clt _ s kN k ( n - 1 ) } ,
N krepresent k bunch number that comprises sample; In the time redistributing sample the n time, to sample x i, calculate following formula:
d F ( i , Clt k ( n - 1 ) ) = Σ j = 1 N k ( n - 1 ) [ K ( x i , x i ) + K ( clt _ s j ( n - 1 ) , clt _ s j ( n - 1 ) ) - 2 K ( x i , clt _ s j ( n - 1 ) ) ]
In the time of the n time iteration, sample point x iaffiliated bunch k *for:
k * = arg min k ∈ [ 1 , K ] d F ( i , Clt k ( n - 1 ) )
To bunch in bunch label under distributing of each sampling point; While distributing sampling point by said method iteration, do not need to calculate each Cu center; Just, in the time that iteration stops, pass through formula m k = arg min m i , i ∈ [ 1 , N k ] Σ j = 1 N k [ K ( x i , x i ) + K ( x j , x j ) - 2 K ( x i , x j ) ] Calculate final Cu center.
5. the method that in semi-supervised learning according to claim 4, the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, for the useful sample selection algorithm of discriminative model sorter;
Similar with production category of model device, d f(i, k), the distance at sample point Dao Cu center in expression F space; In the k in F space bunch, sample x ithe distance at Dao Cu center is:
d F ( i , k ) = Σ i = 1 N k [ K ( clt _ s ki , clt _ s ki ) + K ( c k , c k ) - 2 K ( clt _ s ki , c k ) ] - - - ( 31 )
Wherein, c krepresent k Cu Cu center in D dimension space;
The distance at all sampling point Dao Cu center in statistics bunch, and find maximum distance as bunch radius, a given real number λ, λ ∈ [0,1], with for radius cluster dividing, from a bunch centre distance near sampling M in the region on bunch border kindividual, as the useful sampling point of discriminative model sorter, M kdefinite method identical with production model; The flow process that sample in discriminative model is selected is described below:
Calculate nuclear space radius; To each bunch of k, k ∈ [1, N], each sample clt_s in compute cluster ki, i ∈ [1, N k], the distance d between Yu Cu center f(i, k), and find out maximum radius
Bunch division; A given real number λ, λ ∈ [0,1], by k bunch, k ∈ [1, N], with for radius cluster dividing;
Choose sampling point; For each bunch, at Cu Neilicu center scope in stochastic sampling M kindividual sample, according to formula (23), M k=w krN k, n kfor bunch in all number of samples.
6. the method that in semi-supervised learning according to claim 1, the unmarked sample based on degree of confidence and cluster is selected, it is characterized in that, when described ground floor sorter is selected unmarked sample, first all unmarked samples are carried out to cluster, select the method for sample to be marked according to svm classifier device, select from the closer sample in bunch border, select after boundary sample, there is the svm classifier device of supervision with the training of marker samples, and the boundary sample of selecting is identified, select the unmarked sample of different confidence intervals to carry out TSVM training; According to the difference of confidence interval, unmarked sample selection algorithm is divided into: high confidence level is selected, low confidence is selected to select with middle degree of confidence; High confidence level selects to refer to the unmarked sample of selecting higher than a certain confidence threshold value; Low confidence selects to refer to the unmarked sample of selecting lower than a certain confidence threshold value; Middle degree of confidence is selected, and refers to and selects the unmarked sample of confidence bit between two threshold values.
7. the method that in semi-supervised learning according to claim 1, the unmarked sample based on degree of confidence and cluster is selected, it is characterized in that, clustering algorithm, adopt K-means clustering algorithm, according to the similarity of data, data are divided into a predetermined K part, and find the central point of every part, for classification in class separate, this K part is called to K bunch, and the central point of every part is called a bunch center;
Similarity sum maximum in making bunch between all sample points, in the time adopting Euclidean distance, the cluster objective function based on error sum of squares criterion is as follows:
J = Σ k = 1 K Σ i = 1 N k | | x i - m k | | 2
Wherein,
K, represent bunch number;
N k, represent k bunch number that comprises sample, meet: n is the number of participating in cluster sample;
X i, the eigenvector of expression sample;
M k, represent k Ge Cucu center, i.e. the average of k bunch, computing formula is as follows:
m k = 1 N Σ i = 1 N k x i .
8. the method that in semi-supervised learning according to claim 7, the unmarked sample based on degree of confidence and cluster is selected, is characterized in that, the object of cluster is to find the distribution of the sample that makes in each bunch and bunch centre distance sum minimum, solves objective function step be:
The first step, initialization, inputs the number K of sample set S to be clustered and cluster, chooses at random K sample as initial cluster center in S, stopping criterion for iteration is set, maximum cycle or convergence error threshold value;
Second step, distributes, to each sample x i, find and its nearest Cu center m according to similarity criteria i, and distribute to this bunch;
The 3rd step, upgrades bunch center, and for each bunch, in recalculating bunch, the mean vector of all samples is as Xin Cu center;
The 4th step, repeats second step and the 3rd step until meet end condition.
CN201410395794.6A 2014-08-12 2014-08-12 Unlabeled sample selection method based on confidence coefficients and clustering Pending CN104156438A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410395794.6A CN104156438A (en) 2014-08-12 2014-08-12 Unlabeled sample selection method based on confidence coefficients and clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410395794.6A CN104156438A (en) 2014-08-12 2014-08-12 Unlabeled sample selection method based on confidence coefficients and clustering

Publications (1)

Publication Number Publication Date
CN104156438A true CN104156438A (en) 2014-11-19

Family

ID=51881936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410395794.6A Pending CN104156438A (en) 2014-08-12 2014-08-12 Unlabeled sample selection method based on confidence coefficients and clustering

Country Status (1)

Country Link
CN (1) CN104156438A (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331716A (en) * 2014-11-20 2015-02-04 武汉图歌信息技术有限责任公司 SVM active learning classification algorithm for large-scale training data
CN105069474A (en) * 2015-08-05 2015-11-18 山东师范大学 Semi-supervised learning high confidence sample excavating method for audio event classification
CN106778585A (en) * 2016-12-08 2017-05-31 腾讯科技(上海)有限公司 A kind of face key point-tracking method and device
CN107067025A (en) * 2017-02-15 2017-08-18 重庆邮电大学 A kind of data automatic marking method based on Active Learning
CN107229944A (en) * 2017-05-04 2017-10-03 青岛科技大学 Semi-supervised active identification method based on cognitive information particle
CN107293308A (en) * 2016-04-01 2017-10-24 腾讯科技(深圳)有限公司 A kind of audio-frequency processing method and device
CN107516101A (en) * 2016-06-16 2017-12-26 阿里巴巴集团控股有限公司 A kind of data boundary division methods and equipment
CN107735804A (en) * 2015-07-06 2018-02-23 微软技术许可有限责任公司 The shift learning technology of different tag sets
CN107943856A (en) * 2017-11-07 2018-04-20 南京邮电大学 A kind of file classification method and system based on expansion marker samples
CN108334943A (en) * 2018-01-03 2018-07-27 浙江大学 The semi-supervised soft-measuring modeling method of industrial process based on Active Learning neural network model
CN108351971A (en) * 2015-10-12 2018-07-31 北京市商汤科技开发有限公司 The method and system that the object that label has is clustered
CN108805173A (en) * 2018-05-16 2018-11-13 苏州迈为科技股份有限公司 Solar battery sheet aberration method for separating
CN108876270A (en) * 2018-09-19 2018-11-23 惠龙易通国际物流股份有限公司 Automatic source of goods auditing system and method
CN108885700A (en) * 2015-10-02 2018-11-23 川科德博有限公司 Data set semi-automatic labelling
CN109165309A (en) * 2018-08-06 2019-01-08 北京邮电大学 Negative training sample acquisition method, device and model training method, device
CN109645993A (en) * 2018-11-13 2019-04-19 天津大学 A kind of methods of actively studying of the raising across individual brain-computer interface recognition performance
CN109873774A (en) * 2019-01-15 2019-06-11 北京邮电大学 A kind of network flow identification method and device
CN109933619A (en) * 2019-03-13 2019-06-25 西南交通大学 A kind of semisupervised classification prediction technique
CN110263804A (en) * 2019-05-06 2019-09-20 杭州电子科技大学 A kind of medical image dividing method based on safe semi-supervised clustering
CN110782876A (en) * 2019-10-21 2020-02-11 华中科技大学 Unsupervised active learning method for voice emotion calculation
WO2020038353A1 (en) * 2018-08-21 2020-02-27 瀚思安信(北京)软件技术有限公司 Abnormal behavior detection method and system
CN110933102A (en) * 2019-12-11 2020-03-27 支付宝(杭州)信息技术有限公司 Abnormal flow detection model training method and device based on semi-supervised learning
CN111125317A (en) * 2019-12-27 2020-05-08 携程计算机技术(上海)有限公司 Model training, classification, system, device and medium for conversational text classification
WO2020207179A1 (en) * 2019-04-09 2020-10-15 山东科技大学 Method for extracting concept word from video caption
CN111898704A (en) * 2020-08-17 2020-11-06 腾讯科技(深圳)有限公司 Method and device for clustering content samples
CN112651431A (en) * 2020-12-16 2021-04-13 北方工业大学 Clustering sorting method for retired power batteries
CN113095442A (en) * 2021-06-04 2021-07-09 成都信息工程大学 Hail identification method based on semi-supervised learning under multi-dimensional radar data
WO2022011892A1 (en) * 2020-07-15 2022-01-20 北京市商汤科技开发有限公司 Network training method and apparatus, target detection method and apparatus, and electronic device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1656472A (en) * 2001-11-16 2005-08-17 陈垣洋 Plausible neural network with supervised and unsupervised cluster analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1656472A (en) * 2001-11-16 2005-08-17 陈垣洋 Plausible neural network with supervised and unsupervised cluster analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王荣燕: "复杂音频分类中的关键问题研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331716A (en) * 2014-11-20 2015-02-04 武汉图歌信息技术有限责任公司 SVM active learning classification algorithm for large-scale training data
CN107735804B (en) * 2015-07-06 2021-10-26 微软技术许可有限责任公司 System and method for transfer learning techniques for different sets of labels
CN107735804A (en) * 2015-07-06 2018-02-23 微软技术许可有限责任公司 The shift learning technology of different tag sets
US11062228B2 (en) 2015-07-06 2021-07-13 Microsoft Technoiogy Licensing, LLC Transfer learning techniques for disparate label sets
CN105069474A (en) * 2015-08-05 2015-11-18 山东师范大学 Semi-supervised learning high confidence sample excavating method for audio event classification
CN105069474B (en) * 2015-08-05 2019-02-12 山东师范大学 Semi-supervised learning high confidence level sample method for digging for audio event classification
CN108885700A (en) * 2015-10-02 2018-11-23 川科德博有限公司 Data set semi-automatic labelling
CN108351971A (en) * 2015-10-12 2018-07-31 北京市商汤科技开发有限公司 The method and system that the object that label has is clustered
CN108351971B (en) * 2015-10-12 2022-04-22 北京市商汤科技开发有限公司 Method and system for clustering objects marked with attributes
CN107293308B (en) * 2016-04-01 2019-06-07 腾讯科技(深圳)有限公司 A kind of audio-frequency processing method and device
CN107293308A (en) * 2016-04-01 2017-10-24 腾讯科技(深圳)有限公司 A kind of audio-frequency processing method and device
CN107516101B (en) * 2016-06-16 2021-07-06 阿里巴巴集团控股有限公司 Boundary data dividing method and device
CN107516101A (en) * 2016-06-16 2017-12-26 阿里巴巴集团控股有限公司 A kind of data boundary division methods and equipment
WO2018103525A1 (en) * 2016-12-08 2018-06-14 腾讯科技(深圳)有限公司 Method and device for tracking facial key point, and storage medium
US10817708B2 (en) 2016-12-08 2020-10-27 Tencent Technology (Shenzhen) Company Limited Facial tracking method and apparatus, and storage medium
CN106778585B (en) * 2016-12-08 2019-04-16 腾讯科技(上海)有限公司 A kind of face key point-tracking method and device
CN106778585A (en) * 2016-12-08 2017-05-31 腾讯科技(上海)有限公司 A kind of face key point-tracking method and device
CN107067025B (en) * 2017-02-15 2020-12-22 重庆邮电大学 Text data automatic labeling method based on active learning
CN107067025A (en) * 2017-02-15 2017-08-18 重庆邮电大学 A kind of data automatic marking method based on Active Learning
CN107229944B (en) * 2017-05-04 2021-05-07 青岛科技大学 Semi-supervised active identification method based on cognitive information particles
CN107229944A (en) * 2017-05-04 2017-10-03 青岛科技大学 Semi-supervised active identification method based on cognitive information particle
CN107943856A (en) * 2017-11-07 2018-04-20 南京邮电大学 A kind of file classification method and system based on expansion marker samples
CN108334943A (en) * 2018-01-03 2018-07-27 浙江大学 The semi-supervised soft-measuring modeling method of industrial process based on Active Learning neural network model
CN108805173A (en) * 2018-05-16 2018-11-13 苏州迈为科技股份有限公司 Solar battery sheet aberration method for separating
CN109165309B (en) * 2018-08-06 2020-10-16 北京邮电大学 Negative example training sample acquisition method and device and model training method and device
CN109165309A (en) * 2018-08-06 2019-01-08 北京邮电大学 Negative training sample acquisition method, device and model training method, device
WO2020038353A1 (en) * 2018-08-21 2020-02-27 瀚思安信(北京)软件技术有限公司 Abnormal behavior detection method and system
CN108876270B (en) * 2018-09-19 2022-08-12 惠龙易通国际物流股份有限公司 Automatic goods source auditing system and method
CN108876270A (en) * 2018-09-19 2018-11-23 惠龙易通国际物流股份有限公司 Automatic source of goods auditing system and method
CN109645993A (en) * 2018-11-13 2019-04-19 天津大学 A kind of methods of actively studying of the raising across individual brain-computer interface recognition performance
CN109873774A (en) * 2019-01-15 2019-06-11 北京邮电大学 A kind of network flow identification method and device
CN109873774B (en) * 2019-01-15 2021-01-01 北京邮电大学 Network traffic identification method and device
CN109933619A (en) * 2019-03-13 2019-06-25 西南交通大学 A kind of semisupervised classification prediction technique
WO2020207179A1 (en) * 2019-04-09 2020-10-15 山东科技大学 Method for extracting concept word from video caption
CN110263804A (en) * 2019-05-06 2019-09-20 杭州电子科技大学 A kind of medical image dividing method based on safe semi-supervised clustering
CN110782876A (en) * 2019-10-21 2020-02-11 华中科技大学 Unsupervised active learning method for voice emotion calculation
CN110933102B (en) * 2019-12-11 2021-10-26 支付宝(杭州)信息技术有限公司 Abnormal flow detection model training method and device based on semi-supervised learning
CN114039794A (en) * 2019-12-11 2022-02-11 支付宝(杭州)信息技术有限公司 Abnormal flow detection model training method and device based on semi-supervised learning
CN110933102A (en) * 2019-12-11 2020-03-27 支付宝(杭州)信息技术有限公司 Abnormal flow detection model training method and device based on semi-supervised learning
CN111125317A (en) * 2019-12-27 2020-05-08 携程计算机技术(上海)有限公司 Model training, classification, system, device and medium for conversational text classification
WO2022011892A1 (en) * 2020-07-15 2022-01-20 北京市商汤科技开发有限公司 Network training method and apparatus, target detection method and apparatus, and electronic device
CN111898704A (en) * 2020-08-17 2020-11-06 腾讯科技(深圳)有限公司 Method and device for clustering content samples
CN111898704B (en) * 2020-08-17 2024-05-10 腾讯科技(深圳)有限公司 Method and device for clustering content samples
CN112651431A (en) * 2020-12-16 2021-04-13 北方工业大学 Clustering sorting method for retired power batteries
CN112651431B (en) * 2020-12-16 2023-07-07 北方工业大学 Clustering and sorting method for retired power batteries
CN113095442A (en) * 2021-06-04 2021-07-09 成都信息工程大学 Hail identification method based on semi-supervised learning under multi-dimensional radar data
CN113095442B (en) * 2021-06-04 2021-09-10 成都信息工程大学 Hail identification method based on semi-supervised learning under multi-dimensional radar data

Similar Documents

Publication Publication Date Title
CN104156438A (en) Unlabeled sample selection method based on confidence coefficients and clustering
CN106599029B (en) Chinese short text clustering method
CN106339416B (en) Educational data clustering method based on grid fast searching density peaks
Zhu et al. Video synopsis by heterogeneous multi-source correlation
CN108875816A (en) Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion
Chang et al. Columbia University/VIREO-CityU/IRIT TRECVID2008 high-level feature extraction and interactive video search
CN101620615B (en) Automatic image annotation and translation method based on decision tree learning
CN103207879A (en) Method and equipment for generating image index
CN106815310A (en) A kind of hierarchy clustering method and system to magnanimity document sets
CN104156945A (en) Method for segmenting gray scale image based on multi-objective particle swarm optimization algorithm
CN104268507A (en) Manual alphabet identification method based on RGB-D image
CN103631769A (en) Method and device for judging consistency between file content and title
CN103473275A (en) Automatic image labeling method and automatic image labeling system by means of multi-feature fusion
CN104077408B (en) Extensive across media data distributed semi content of supervision method for identifying and classifying and device
CN105069474B (en) Semi-supervised learning high confidence level sample method for digging for audio event classification
CN107609570B (en) Micro video popularity prediction method based on attribute classification and multi-view feature fusion
CN102426598A (en) Method for clustering Chinese texts for safety management of network content
CN105447142B (en) A kind of double mode agricultural science and technology achievement classification method and system
Huang et al. Tag refinement of micro-videos by learning from multiple data sources
Qian et al. ISABoost: A weak classifier inner structure adjusting based AdaBoost algorithm—ISABoost based application in scene categorization
Xie et al. K-means clustering based on density for scene image classification
Shinde et al. A systematic study of text mining techniques
Gao et al. Image classification based on support vector machine and the fusion of complementary features
CN102637205B (en) Document classification method based on Hadoop
CN102945370B (en) Based on the sorting technique of many label two visual angles support vector machine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141119