CN116662832A - Training sample selection method based on clustering and active learning - Google Patents
Training sample selection method based on clustering and active learning Download PDFInfo
- Publication number
- CN116662832A CN116662832A CN202310493334.6A CN202310493334A CN116662832A CN 116662832 A CN116662832 A CN 116662832A CN 202310493334 A CN202310493334 A CN 202310493334A CN 116662832 A CN116662832 A CN 116662832A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- clustering
- value
- threshold
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 29
- 238000010187 selection method Methods 0.000 title claims abstract description 13
- 238000000034 method Methods 0.000 claims abstract description 38
- 238000002372 labelling Methods 0.000 claims abstract description 16
- 238000012216 screening Methods 0.000 claims abstract description 5
- 238000007500 overflow downdraw method Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 17
- 238000009826 distribution Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 13
- 230000004927 fusion Effects 0.000 claims description 12
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 2
- 230000006978 adaptation Effects 0.000 claims description 2
- 230000008859 change Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 claims description 2
- 238000001914 filtration Methods 0.000 claims description 2
- 238000012804 iterative process Methods 0.000 claims description 2
- 238000000926 separation method Methods 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 claims 1
- 238000005259 measurement Methods 0.000 claims 1
- 238000006116 polymerization reaction Methods 0.000 claims 1
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a training sample selection method and a training sample selection device based on clustering and active learning, wherein the method firstly utilizes consistency regularization to divide samples in a data pool into high-confidence samples higher than a threshold value and low-confidence samples lower than the threshold value; secondly, clustering data by a density peak clustering method, and dividing samples in a data pool into inner and outer areas; then labeling the samples which are higher than the threshold value and belong to the inner area with pseudo labels, and adding the samples which are lower than the threshold value and belong to the outer area into an active learning task; and finally, screening out training samples with uncertainty and diversity by a multi-index fusion method and labeling the training samples for an expert. The application can effectively reduce the cost of labeling training samples by an expert on the premise of reaching the preset performance of the model.
Description
Technical Field
The application relates to a training sample selection method and device based on clustering and active learning, and belongs to the technical field of Internet and artificial intelligence.
Background
In machine learning, model training relies on a large number of labeled samples, and the quality of model performance often depends on the number and quality of a given labeled sample. The greater the number of samples, the higher the quality and the greater the likelihood of obtaining a high-precision model. In the practical application process, such as target detection, face recognition, text classification and the like, a large number of unlabeled samples usually appear, and labeling samples requires a large amount of labor and financial resources, and some samples in specific fields also require labeling by experts in relevant field backgrounds to ensure the label quality of the samples.
In order to solve the problem of excessive sample labeling cost, active Learning (AL) and Semi-supervised Learning (Semi-super Vised Learning, SSL) have been proposed by related scholars. The AL is to select samples with higher uncertainty and poorer stability according to corresponding strategies to label the expert in a multi-iteration process, and train the model by manually labeling as few high-quality samples as possible. SSL trains the model based on a large number of unlabeled samples, fine-tunes the model by using a small number of labeled samples, and reduces the manual labeling cost by using the model to generate pseudo labels for the unlabeled samples with high confidence.
However, some problems still exist in the existing semi-supervised active learning method, such as the SSL module usually sets a high and fixed threshold to select samples with high confidence to self-generate pseudo tags and send the pseudo tags into the marked data pool. The high threshold can effectively reduce deviation and filter noise data, but the model is dynamically adjusted under the training conditions of different iteration rounds, the model is low in precision in the initial training period, and only a few samples can exceed the set high threshold, so that a large amount of samples are wasted to a certain extent, and the convergence rate of the model is slow. In addition, the data extraction quality of the SSL method is highly dependent on the quality of the pseudo tag generated by the SSL module in the previous iteration process, which may result in that once the quality of the pseudo tag generated in a certain iteration stage is poor, the wrong tag may be incorporated into training, and as the number of iterations increases, the mistake is continuously amplified, and the late model is very easy to learn in the wrong direction. The AL module performs sample extraction based on single indexes such as uncertainty or instability, but the single indexes easily ignore the representativeness of each label of the sample, so that a certain redundancy exists in the sample extracted each time, and the performance of the model is further reduced.
The application provides a training sample selection method based on clustering and active learning, which comprises the steps of firstly distinguishing samples by using a consistency regularization method of semi-supervised learning, and dividing the samples into a high confidence sample higher than a threshold value and a low confidence sample lower than the threshold value; secondly, dividing the samples in the clusters into inner and outer areas by using a density peak clustering algorithm; and finally, labeling the samples which are higher than the threshold and belong to the inner region with pseudo labels, and sending the samples which are lower than the threshold and belong to the outer region into an active learning module, and selecting the samples to label for labeling specialists through a multi-index fusion algorithm.
Disclosure of Invention
Aiming at the problems and the shortcomings of the prior art, the application provides a training sample selection method and a training sample selection device based on clustering and active learning, which can reduce the labeling cost of an expert as much as possible on the premise of reaching the preset performance of a model.
In order to achieve the above object, the technical scheme of the present application is as follows: a training sample selection method and device based on clustering and active learning mainly comprises the processes of consistence regular division sample confidence, density peak clustering division of inner and outer areas of clusters, multi-index fusion selection strategies and the like, and can reduce the labeling cost of experts as far as possible on the premise of achieving the preset performance of a model. The method mainly comprises three steps, and specifically comprises the following steps:
step 1, a method for setting a dynamic adaptive threshold is used. In the initial stage of training, the overall accuracy of the model is low, so that the threshold setting standard is lowered, the threshold is raised from 0, all samples are given a learned opportunity, and the model gradually and stably acquires the identification capability and then returns to a fixed high threshold.
And 2, dividing the clusters into inner and outer areas by using a density peak clustering method. And a module based on Density Peak Clustering (DPC) is added between the SSL module and the AL module, the integral bottom structure of the sample is obtained through the DPC, the selection range of the data is further reduced by utilizing the spatial distribution characteristics of the data set, and the sample screening quality is improved, so that the current precision of the model is decoupled to a certain extent.
And 3, selecting a strategy of multi-index fusion. And selecting samples with high contribution to improving the performance of the model by using uncertainty indexes based on information entropy and representative indexes based on Euclidean distance, and reducing the repeatability of the selected samples, thereby accelerating the convergence rate of the model.
Compared with the prior art, the application has the following beneficial effects:
1. according to the method, the problem that a large number of label samples are wasted due to fixed threshold setting in the SSL module can be effectively solved through the self-adaptive threshold.
2. According to the method, a module based on Density Peak Clustering (DPC) is added between an SSL module and an AL module, a sample integral bottom structure is obtained through the DPC, the selection range of data is further reduced by utilizing the spatial distribution characteristics of a data set, the sample screening quality is improved, and therefore the current precision of a model is decoupled to a certain extent.
3. According to the method, a multi-index fusion selection strategy is used, the representativeness of a sample label is considered while a high-value sample is extracted, and the repeatability of the selected sample is reduced, so that the convergence rate of a model is increased, and the labeling cost of a user is reduced on the basis of reaching the preset performance of the model.
Drawings
FIG. 1 is a general block diagram of a method according to an embodiment of the present application.
FIG. 2 is a flow chart of a method according to an embodiment of the application.
Detailed Description
The application is further illustrated below in conjunction with specific examples in order to enhance the understanding and appreciation of the application.
Examples: the whole framework and the specific flow of the application are respectively shown in fig. 1 and 2, and the specific implementation steps are as follows: a sample selection method based on clustering and semi-supervised active learning, the method being as follows.
Step (a)A method for setting dynamic self-adaptive threshold includes selecting Resnet50 as task model T, firstly labeling L marked sample sets in data poolAfter turning over weak enhancement operation, the prediction distribution P is obtained by training in T i Calculating cross entropy loss by combining the prediction distribution and the original label, namely, a loss function L of marked samples in a data pool s As shown in formula (1):
wherein the original label isP m (|.) represents the posterior probability distribution of the task model, i.e., the confidence of the predicted tag, S ce (|.) represents the cross entropy between two probability distributions, a w (x i ) Is a weak enhancement function.
In the SSL module, the network is trained mainly by the consistency regularization idea, which builds on a basic assumption: the same picture is enhanced by different disturbance, and the network can output the same prediction result. Weak enhancement and strong enhancement are respectively carried out on unlabeled data sets, wherein the weak enhancement is based on flip-and-shift enhancement strategies, the strong enhancement is based on CTAugment strategies, a pseudo tag is obtained through weak enhancement, then the pseudo tag is utilized to monitor the output value of the strong enhancement, and loss of the two is calculated to monitor and train the network.
In each iteration process, the learning ability of the network is dynamically changed and gradually enhanced, and the self-adaptive dynamic threshold value is used for filtering the pseudo tags. The adaptive threshold at the time of network training to step t is:
wherein T is max Representing the maximum iteration number of threshold adaptation, when T is greater than or equal to T max The threshold is fixed to be alpha, namely, when the network is trained to a certain degree, the characteristic learning is stable, and the traditional fixed threshold can be directly used. When the network characterization fluctuates greatly, i.e. t<T max An adaptive threshold is used. Wherein the number of pseudo tags that can meet the manual threshold at the current learning stageThe method comprises the following steps:
if it isDuring learning iteration, the model is not provided with more pseudo labels higher than a threshold value, and then the model is multiplied by a coefficient smaller than 1, so that the selection threshold is reduced, and more samples meeting the threshold value standard are ensured to be trained; on the contrary, if->The parameters are unchanged, and basic selection thresholds are maintained. The left side of the multiplication of formula (2) is +.>Wherein S is U Representing the number of unlabeled samples for the current iteration stage. In the initial stage of iteration, S U Is far greater than +.>The threshold value is gradually increased from 0, so as to ensure that as many samples as possible participate in the training process.
The network trains the weakly reinforced samples to obtain probability distribution, selects labels with prediction probability larger than the dynamic threshold value set in the formula (2) as sample pseudo labels, then utilizes the pseudo labels to monitor the strongly reinforced output value, and calculates a loss function of unlabeled data so as to monitor and train the network. The loss function of unlabeled data is:
wherein P is m (|.) represents the posterior probability distribution of the task model, i.e., the confidence of the predicted tag, S ce (|.) represents the cross entropy between two probability distributions, a w (x i ) Is a weak enhancement function, A s (x i ) Is a strong enhancement function. At the SSL module, according to the adaptive threshold tau t The unlabeled dataset may be partitioned into more than a threshold τ t Is a small portion of pseudo tag sample set D l And below a threshold τ t Most of unlabeled sample set D u 。
And 2, dividing the inner and outer areas in the cluster by using a density peak clustering method, wherein the method is divided into 4 substeps.
Substep 2-1, for unlabeled sample points x i Calculating two quantities, the local density ρ of the sample points i And the minimum distance delta of the point to the point with the higher local density i :
Wherein d is ij For sample x i And sample x j Euclidean distance between d c For the cut-off distance (distance set by the user). The application comprehensively considers ρ i And delta i These 2 factors take the product of the two as the cluster center weight:
y i =ρ i ·δ i (7)
wherein y is i The larger the value of (2)x i The more likely it is to be a cluster center; y is i The smaller the value of x i The more likely it is to be a non-clustered central point. Will y i Sequencing from big to small, selecting the first K values as clustering center points, and clustering into K clusters { C } 1 ,C 2 ,C 3 …C k }。
In the substep 2-2, the contour coefficient does not need the actual category information of the sample, and can represent the intra-cluster aggregation degree and the inter-cluster separation degree, and the calculation mode is shown in a formula (8):
wherein a (x i ) For sample x i Average distance from samples within the same cluster, b (x i ) For sample x i Average distance from samples that are not within the same cluster. The application defines the contour coefficient of a single cluster as the average value of the contour coefficients of samples in the cluster, and evaluates the clustering quality of each cluster through the contour coefficient, wherein the higher the contour coefficient is, the better the clustering effect is. Respectively calculate { C 1 ,C 2 ,C 3 …C k The contour coefficient of { S } is 1 ,S 2 ,S 3 …S k Will { S } 1 ,S 2 ,S 3 …S k Sorting from big to small, assuming the sorting result is S 1 >S 2 >…>S k 。
Substep 2-3, for K clusters, calculate the divide-by-cluster center u in the kth E {1,2, … K } clusters k Average distance d from all points outside to cluster center k The calculation mode is shown in the formula (9):
in formula (9): n is n k Representing the number of samples in the kth cluster; c (C) k Representing a set of all samples for the kth cluster;represent C k I samples except for the cluster center. Screening out that the distance from each cluster to the cluster center is larger than d k Constitute an outer region set; less than d k Constitute a collection of inner regions.
In the substep 2-4, the application considers two indexes of the contour coefficient and the membership degree comprehensively, and considers that the sample quality of the inner area point in the cluster with high contour coefficient is higher, and the sample quality of the outer area point in the cluster with low contour coefficient is poorer. Ordering the contour coefficients from large to small, assuming S 1 >S 2 >S 3 >…>S k From S 1 Initially, sequentially selecting sample points and D in the inner area l As the finally generated pseudo tag data set D cl And added to the annotated data pool. Similarly, the contour coefficients are ordered from small to large, and S is assumed k <S k-1 <…<S 1 From S k Starting to sequentially select sample points and D of outer area u As the intersection of data D finally fed into the active learning module cu 。
And 3, selecting strategies based on multi-index fusion, wherein the strategy is divided into 3 substeps as follows.
And a substep 3-1, wherein the uncertainty index is based on the information entropy. The application adopts the information entropy to measure the uncertainty index, and the larger the entropy value is, the harder the current classifier is to distinguish the category of the sample. The entropy definition of the sample is shown in formula (10):
wherein Y is a set of all possible classification values, taking the classification problem as an example, and the values of Y are 0 and 1; θ L Training the obtained classifier on the current marked sample set; p (y|x) i ,θ L ) For the current classifier to unlabeled sample x i A probability value predicted to be 0 or 1. For each sample, the sample with the greatest entropy is selected as shown in formula (11):
wherein, the value of the max function is the index value corresponding to the maximum value; i.e * The number of samples selected for each round of iteration.
In the sub-step 3-2, in the iterative process, the Euclidean distance between the current marked sample and the unmarked sample is calculated, and the smaller the distance is, the closer the samples are, and the higher the similarity of the samples is. For a certain unlabeled sample x i With the minimum value d (x) of Euclidean distance between the sample and all the marked samples i ) The representativeness of the sample tag is measured as shown in equation (12):
d(x i ) The larger the sample is, the farther the sample is from the current marked sample is, and the higher the representative index of the sample is.
In the substep 3-3, the multi-index fusion method, in order to reduce the problem of high redundancy of the extracted samples caused by single index, the application adopts a weighted form to fuse uncertainty and representative 2 indexes, and the obtained fusion index is used as the selection standard of the final sample, and the fusion index formula is shown in (13):
wherein, beta is E [0,1 ]]As the weight coefficient, when 0.5<When beta is less than or equal to 1, the uncertainty occupies more weight; when 0 is less than or equal to beta<At 0.5, the representative weight is greater; when β=0.5, the uncertainty is the same as the weight of the diversity. Because the classifier parameters are constantly changing during each iteration, a fixed beta value is difficult to ensure that the samples extracted during the change are optimal. For this purpose, the application sets the value set of beta to {0.1,0.2 … 0.9}, and when beta takes on the 9 different values, 9 different K data sets can be selected and sent intoThe SSL module performs training, and respectively calculates loss functions corresponding to the 9 data sets according to a formula (4), wherein the larger the loss is, the more valuable the current fusion index is. Let beta be the maximum loss * The value of (2) is shown in formula (14):
β * the corresponding first K values are { A } 1 ,A 2 ,A 3 …A k And extracting the data set and sending the data set to an expert for marking, sending the marked data set to a marked data pool after marking by the expert, and retraining the marked data set by the network on the basis of the existing marked data set at the SSL module. The whole CSSAL algorithm process is iterated continuously until the model reaches the preset precision.
In conclusion, the CSSAL framework can effectively reduce the cost of manual labeling on the premise of achieving the preset performance of the model. Generating a pseudo tag by selecting samples with prediction probabilities higher than a set threshold; and secondly, dividing the data set into two intervals of high confidence and low confidence through a DPC clustering module, finally leaving a pseudo label which is higher than a threshold value and is in the high confidence interval, and sending a sample which is lower than the threshold value and is in the low confidence interval into a third part AL module. In the AL module, a sample with uncertainty and representativeness is selected through a multi-index fusion strategy and is sent to an expert for marking, finally, marked data are sent to a marked data pool, a task model is trained again based on the latest label data, and the whole process is iterated continuously until the model reaches preset precision.
Based on the same inventive concept, the sample selection method and device based on clustering and semi-supervised active learning disclosed by the embodiment of the application comprise a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the sample selection method and device based on clustering and semi-supervised active learning is realized when the computer program is loaded to the processor.
It is to be understood that the above-described embodiments are provided for illustrating the present application and not for limiting the scope of the present application, and that various modifications equivalent to the present application will fall within the scope of the appended claims after reading the present application.
Claims (4)
1. A training sample selection method based on clustering and active learning, the method comprising the steps of:
step 1: a setting method of a dynamic self-adaptive threshold value is used;
step 2: dividing the clusters into inner and outer areas by using a density peak clustering method, marking samples with high confidence and belonging to the inner areas with pseudo labels, and sending samples with low confidence and belonging to the outer areas into an active learning module;
step 3: and selecting samples to be marked by an expert by using a multi-index fusion method.
2. The method for selecting training samples based on clustering and active learning as claimed in claim 1, wherein step 1, a dynamic adaptive threshold setting method is used, the method performs weak enhancement and strong enhancement on unlabeled data sets respectively, wherein the weak enhancement is based on flip-and-shift enhancement strategy, the strong enhancement is based on CTAugment strategy, a pseudo tag is obtained through weak enhancement, then the output value of the strong enhancement is supervised by using the pseudo tag, loss of the two is calculated, and the network can be supervised and trained, wherein the original tag is P m (|.) represents the posterior probability distribution of the task model, i.e., the confidence of the predicted tag, S ce (|.) represents the cross entropy between two probability distributions, a w (x i ) Is a function of the weak enhancement function,
in each iteration process, the learning ability of the network is dynamically changed and gradually enhanced, so the application provides the self-adaptive dynamic threshold value for filtering the pseudo tag, and the self-adaptive threshold value when the network is trained to the t step is as follows:
wherein T is max Representing the maximum iteration number of threshold adaptation, when T is greater than or equal to T max The threshold is fixed to be alpha, namely, when the network is trained to a certain degree, the characterization learning is stable, the traditional fixed threshold is directly used, and when the network characterization fluctuation is large, namely, t<T max An adaptive threshold is used, wherein,the function represents the number of pseudo tags capable of satisfying the manual threshold in the current learning stage, and is defined as shown in a formula (3):
from the formula (2), ifDuring learning iteration, the model is not provided with more pseudo labels higher than a threshold value, and then the model is multiplied by a coefficient smaller than 1, so that the selection threshold is reduced, and more samples meeting the threshold value standard are ensured to be trained; on the contrary, if->Maintaining basic selection threshold with unchanged parameters, wherein the left side of the multiplication number of the formula (2) is +.>Wherein S is U Represents the number of unlabeled samples at the current iteration stage, S at the initial iteration stage U Is far greater than +.>The threshold value is gradually increased from 0, so that as many samples as possible are ensured to participate in the training process, the network trains the weakly reinforced samples to obtain probability distribution, labels with prediction probability larger than the dynamic threshold value set in the formula (2) are selected as sample pseudo labels, then the pseudo labels are used for supervising the strongly reinforced output value, and a loss function of unlabeled data is calculated so as to supervise and train the network, wherein the loss function of unlabeled data is shown in the formula (4):
wherein P is m (|.) represents the posterior probability distribution of the task model, i.e., the confidence of the predicted tag, S ce (|.) represents the cross entropy between two probability distributions, a w (x i ) Is a weak enhancement function, A s (x i ) Is a strong enhancement function, and in the SSL module, according to the self-adaptive threshold tau t The unlabeled dataset may be partitioned into more than a threshold τ t Is a small portion of pseudo tag sample set D l And below a threshold τ t Most of unlabeled sample set D u 。
3. The training sample selection method based on clustering and active learning according to claim 1, wherein the step 2 of dividing the clusters into inner and outer areas using a density peak clustering method is performed by the sub-steps of:
substep 2-1, for unlabeled sample points x i Calculating two quantities, the local density ρ of the sample points i And the minimum distance delta of the point to the point with the higher local density i As defined by equation (5) and equation (6)The illustration is:
wherein d is ij For sample x i And sample x j Euclidean distance between d c For the cutoff distance (distance set by the user), the present application considers ρ comprehensively i And delta i The product of the two factors is used as a clustering center weight, and the definition is shown in a formula (7):
y i =ρ i ·δ i (7)
wherein y is i The larger the value of x i The more likely it is to be a cluster center; y is i The smaller the value of x i The more likely it is to be a non-clustered central point, y will be i Sequencing from big to small, selecting the first K values as clustering center points, and clustering into K clusters { C } 1 ,C 2 ,C 3 …C k },
In the substep 2-2, calculating a contour coefficient evaluation clustering standard, wherein the contour coefficient does not need actual category information of a sample, and can represent intra-cluster polymerization degree and inter-cluster separation degree, and the calculation mode is shown as a formula (8):
wherein a (x i ) For sample x i Average distance from samples within the same cluster, b (x i ) For sample x i The application defines the contour coefficient of a single cluster as the average value of the contour coefficients of samples in the cluster, evaluates the clustering quality of each cluster through the contour coefficients, and calculates { C }, wherein the higher the contour coefficient is, the better the clustering effect is 1, C 2 ,C 3 …C k The contour coefficient of { S } is 1 ,S 2 ,S 3 …S k Will { S } 1 ,S 2 ,S 3 …S k Sorting from big to small, assuming the sorting result is S 1 >S 2 >…>S k ,
Sub-step 2-3, calculating the membership degree of the sample to divide the inner area and the outer area, and calculating the clustering center u in the K E {1,2, … K } clusters aiming at the K clusters k Average distance d from all points outside to cluster center k The calculation mode is shown in the formula (9):
in formula (9): n is n k Representing the number of samples in the kth cluster; c (C) k Representing a set of all samples for the kth cluster;represent C k The ith sample except the clustering center is screened out that the distance from each cluster to the clustering center is larger than d k Is used for the measurement of the sample of (a),
substep 2-4, combining threshold and cluster to perform sample screening, the application considers two indexes of contour coefficient and membership comprehensively, considers that the sample quality of inner area points in clusters with high contour coefficient is higher, and the sample quality of outer area points in clusters with low contour coefficient is poorer, sorts the contour coefficient from big to small, and is assumed to be S 1 >S 2 >S 3 >…>S k From S 1 Initially, sequentially selecting sample points and D in the inner area l As the finally generated pseudo tag data set D cl Adding the data into a marked data pool, and similarly sequencing the contour coefficients from small to large, wherein the assumption is S k <S k-1 <…<S 1 From S k Starting to sequentially select sample points and D of outer area u As the intersection of data D finally fed into the active learning module cu 。
4. The training sample selection method based on clustering and active learning according to claim 1, wherein in step 3, samples are selected for expert labeling by using a multi-index fusion method, and the implementation of the step is divided into the following sub-steps:
in the substep 3-1, the information entropy is used for measuring the uncertainty index, the larger the entropy value is, the harder the current classifier is to distinguish the category to which the sample belongs, and the entropy value definition of the sample is shown in a formula (10):
wherein Y is a set of all possible classification values, taking the classification problem as an example, and the values of Y are 0 and 1; θ L Training the obtained classifier on the current marked sample set; p (y|x) i ,θ L ) For the current classifier to unlabeled sample x i A probability value predicted to be 0 or 1, and for each sample, a sample with the maximum information entropy is selected, as shown in formula (11):
wherein, the value of the max function is the index value corresponding to the maximum value; i.e * The number of samples selected for each round of iteration,
sub-step 3-2, calculating Euclidean distance between the current marked sample and the unmarked sample in the iterative process by using a representative index based on the distance, wherein the smaller the distance is, the closer the distance is to the sample, the higher the similarity of the sample is, and the sample x is to a certain unmarked sample i With the minimum value d (x) of Euclidean distance between the sample and all the marked samples i ) The representativeness of the sample tag is measured as shown in equation (12):
d(x i ) The larger the sample, the farther the sample is from the current marked sample, the higher the representative index of the sample,
and 3-3, fusing uncertainty and the representative 2 indexes in a weighted form, and taking the obtained fusion index as a selection standard of a final sample, wherein a fusion index formula is shown in (13):
g β =β*f(x i )+(1-β)d(x i ) (13)
wherein, beta is E [0,1 ]]As the weight coefficient, when 0.5<When beta is less than or equal to 1, the uncertainty occupies more weight; when 0 is less than or equal to beta<At 0.5, the representative weight is greater; when beta=0.5, uncertainty is the same as the weight of diversity, because the parameters of the classifier are always changed in each iteration process, a fixed beta value is difficult to ensure that samples extracted in the change process are optimal, the application sets the value set of beta to {0.1,0.2 … 0.9}, when beta takes 9 different values, 9 different K data sets can be selected, the data sets are sent to an SSL module for training, loss functions corresponding to the 9 data sets are calculated respectively according to a formula (4), the larger the loss is more valuable to the current fusion index, and the beta with the largest loss is set * The value of (2) is shown in formula (14):
β * the corresponding first K values are { A } 1, A 2 ,A 3 …A k Extracting the data set and sending the data set to an expert for marking, sending the labeled data set to a marked data pool after marking by the expert, and retraining the data set by the network on the basis of the existing marked data set in the SSL module, wherein the whole process is iterated continuously until the model reaches the preset precision.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310493334.6A CN116662832A (en) | 2023-04-28 | 2023-04-28 | Training sample selection method based on clustering and active learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310493334.6A CN116662832A (en) | 2023-04-28 | 2023-04-28 | Training sample selection method based on clustering and active learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116662832A true CN116662832A (en) | 2023-08-29 |
Family
ID=87719682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310493334.6A Pending CN116662832A (en) | 2023-04-28 | 2023-04-28 | Training sample selection method based on clustering and active learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116662832A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117112871A (en) * | 2023-10-19 | 2023-11-24 | 南京华飞数据技术有限公司 | Data real-time efficient fusion processing method based on FCM clustering algorithm model |
CN118520304A (en) * | 2024-07-19 | 2024-08-20 | 国网山西省电力公司营销服务中心 | Deep learning multilayer active power distribution network situation awareness and assessment method and system |
-
2023
- 2023-04-28 CN CN202310493334.6A patent/CN116662832A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117112871A (en) * | 2023-10-19 | 2023-11-24 | 南京华飞数据技术有限公司 | Data real-time efficient fusion processing method based on FCM clustering algorithm model |
CN117112871B (en) * | 2023-10-19 | 2024-01-05 | 南京华飞数据技术有限公司 | Data real-time efficient fusion processing method based on FCM clustering algorithm model |
CN118520304A (en) * | 2024-07-19 | 2024-08-20 | 国网山西省电力公司营销服务中心 | Deep learning multilayer active power distribution network situation awareness and assessment method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378632B (en) | Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method | |
CN108681752B (en) | Image scene labeling method based on deep learning | |
CN110443420B (en) | Crop yield prediction method based on machine learning | |
CN116662832A (en) | Training sample selection method based on clustering and active learning | |
CN108229550B (en) | Cloud picture classification method based on multi-granularity cascade forest network | |
CN113326731A (en) | Cross-domain pedestrian re-identification algorithm based on momentum network guidance | |
CN109034205A (en) | Image classification method based on the semi-supervised deep learning of direct-push | |
CN107392919B (en) | Adaptive genetic algorithm-based gray threshold acquisition method and image segmentation method | |
CN114357221B (en) | Self-supervision active learning method based on image classification | |
CN114821299B (en) | Remote sensing image change detection method | |
CN113642655B (en) | Small sample image classification method based on support vector machine and convolutional neural network | |
CN110458022A (en) | It is a kind of based on domain adapt to can autonomous learning object detection method | |
CN108596204B (en) | Improved SCDAE-based semi-supervised modulation mode classification model method | |
CN116804668B (en) | Salt iodine content detection data identification method and system | |
CN114998602A (en) | Domain adaptive learning method and system based on low confidence sample contrast loss | |
CN115240024A (en) | Method and system for segmenting extraterrestrial pictures by combining self-supervised learning and semi-supervised learning | |
CN114818963B (en) | Small sample detection method based on cross-image feature fusion | |
CN115292532A (en) | Remote sensing image domain adaptive retrieval method based on pseudo label consistency learning | |
CN111222546A (en) | Multi-scale fusion food image classification model training and image classification method | |
CN113536939B (en) | Video duplication removing method based on 3D convolutional neural network | |
CN118279320A (en) | Target instance segmentation model building method based on automatic prompt learning and application thereof | |
CN113627240A (en) | Unmanned aerial vehicle tree species identification method based on improved SSD learning model | |
CN111079840B (en) | Complete image semantic annotation method based on convolutional neural network and concept lattice | |
CN112347842B (en) | Offline face clustering method based on association graph | |
CN114120049A (en) | Long tail distribution visual identification method based on prototype classifier learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |