US20220335085A1 - Data selection method, data selection apparatus and program - Google Patents
Data selection method, data selection apparatus and program Download PDFInfo
- Publication number
- US20220335085A1 US20220335085A1 US17/631,396 US201917631396A US2022335085A1 US 20220335085 A1 US20220335085 A1 US 20220335085A1 US 201917631396 A US201917631396 A US 201917631396A US 2022335085 A1 US2022335085 A1 US 2022335085A1
- Authority
- US
- United States
- Prior art keywords
- data
- data pieces
- pieces
- cluster
- labeled
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 63
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 238000002372 labelling Methods 0.000 claims description 10
- 238000003064 k means clustering Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 description 31
- 238000005070 sampling Methods 0.000 description 29
- 238000010586 diagram Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000007786 learning performance Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
Definitions
- the present invention relates to a data selection method, a data selection device and a program.
- samples with high contribution to performance improvement are selected (sampled) by an algorithm from unlabeled data pieces using a small number of labeled data pieces and presented to an operator.
- the operator labels the samples, then the learning is performed, and thereby learning performance can be improved as compared to the case of random sampling.
- Non-Patent Literature 1 Since the number of samples is one per sampling in the conventional active learning, a technique for obtaining multiple samples in a single sampling, which is adapted to batch learning of convolutional neural networks (CNN) is suggested (Non-Patent Literature 1).
- Non-Patent Literature 1 the data pieces are mapped into the feature space, and sampling is performed by using an approximation solution algorithm for the k-center problem. Since multiple samples are subsets that inherit the characteristics of the entire data structure in the feature space, the learning close to that in using all data pieces can be performed even when a small number of data pieces are used.
- a learned model prepared in a deep learning framework is used as the feature extractor.
- the dataset used in the prepared learned model the 1000-class classification of the ImageNet dataset, etc. is used.
- Non-Patent Literature 1 it is important for sampling to select a feature extractor performing mapping into a feature space that is effective for the target task, because the data is referenced in the feature space during sampling; however, it is difficult to evaluate beforehand the feature extractor that is effective for the unlabeled data pieces handled by the active learning.
- the present invention has been made in view of the above points, and an object of the present invention is to make it possible to select data pieces to be labeled, which are effective for a target task, from among data sets of unlabeled data pieces.
- a data selection method selects, based on a set of labeled first data pieces and a set of unlabeled second data pieces, a target to be labeled from the set of the second data pieces.
- the method includes: a classification procedure classifying data pieces belonging to the set of the first data pieces and data pieces belonging to the set of the second data pieces into clusters of the number at least one more than the number of types of the labels; and a selection procedure selecting the second data piece to be labeled from a cluster, from among the clusters, that does not include the first data piece, each of the procedures being performed by a computer.
- FIG. 1 is a diagram showing an example of a hardware configuration of a data selection device 10 in a first embodiment.
- FIG. 2 is a diagram showing an example of a functional configuration of the data selection device 10 in the first embodiment.
- FIG. 3 is a flowchart for illustrating an example of processing procedures executed by the data selection device 10 in the first embodiment.
- FIG. 4 is a diagram showing an example of a functional configuration of a data selection device 10 in a second embodiment.
- a feature extractor is generated by use of a framework of unsupervised feature expression learning. Note that the number of data pieces in the data set of the labeled data pieces is smaller than the number of data pieces in the data set of the unlabeled data pieces.
- the unsupervised feature expression learning is to generate a feature extractor effective for a target task by self-supervised learning, which uses supervisory signals that can be automatically generated for input data.
- the unsupervised feature expression learning includes techniques such as Deep Clustering (Non-Patent Literature 2).
- clustering For the feature of each labeled data piece and each unlabeled data piece used in the unsupervised feature expression learning, which is obtained by using the feature extractor obtained by the unsupervised feature expression learning, clustering is performed.
- clusters including labeled data pieces For each cluster obtained as a result of the clustering, classification is performed into two types of clusters: clusters including labeled data pieces and clusters including no labeled data pieces among the data pieces belonging thereto.
- sampling is performed from each cluster including no labeled data pieces, and the data pieces to be labeled are outputted.
- the embodiment aims to select such unlabeled data pieces having characteristics different from those of the labeled data pieces. It was described that the number of the unlabeled data pieces is larger than the number of the labeled data pieces, but it is expected that the number of the unlabeled data pieces will form a majority because labeling the data pieces for using thereof as actual learning data requires a large amount of operation.
- the embodiment aims to extract data pieces, from such a large number of data pieces, that should be labeled, and can increase, for example, the accuracy of estimation by labeling.
- FIG. 1 is a diagram showing an example of a hardware configuration of a data selection device 10 in the first embodiment.
- the data selection device 10 in FIG. 1 includes a drive device 100 , an auxiliary storage device 102 , a memory device 103 , a CPU 104 , and an interface device 105 that are connected to one another via a bus B.
- the programs implementing the processes in the data selection device 10 are provided by a recording medium 101 , such as a CD-ROM.
- a recording medium 101 such as a CD-ROM.
- the programs are installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100 .
- the programs are not necessarily be installed from the recording medium 101 , and may be downloaded from other computers via a network.
- the auxiliary storage device 102 stores the installed programs, as well as necessary files, data, and the like.
- FIG. 2 is a diagram showing an example of a functional configuration of the data selection device 10 in the first embodiment.
- the data selection device 10 includes a feature extractor generation unit 11 and a sampling process unit 12 . Each of these units is implemented by processes executed by the CPU 104 caused by one or more programs installed in the data selection device 10 .
- the feature extractor generation unit 11 outputs a feature extractor with a data set A of labeled data pieces, a data set B of unlabeled data pieces, and the number S of samples to be additionally labeled as input.
- the data set A of the labeled data pieces refers to a set of labeled image data pieces.
- the data set B of the unlabeled data pieces refers to a set of unlabeled image data pieces.
- the sampling process unit 12 selects data pieces to be labeled with the data set A of the labeled data pieces, the data set B of the unlabeled data pieces, the number S of samples to be additionally labeled, and the feature extractor as input.
- FIG. 3 is a flowchart for illustrating an example of processing procedures executed by the data selection device 10 in the first embodiment.
- the feature extractor generation unit 11 executes a pseudo label generation process with the data set A of the labeled data pieces, the data set B of the unlabeled data pieces, and the number S of samples to be additionally labeled as input.
- the feature extractor generation unit 11 provides pseudo labels to each data piece aEA and each data piece b ⁇ B, and outputs a data set A in which each data piece a is provided with a pseudo label and a data set B in which each data piece b is provided with a pseudo label as pseudo datasets.
- the feature extractor generation unit 11 performs k-means clustering on the data set A and the data set B, based on respective intermediate features of each data piece a and each data piece b when each data piece is inputted into Convolutional Neural Network (CNN), which is the source of the feature extractor, to thereby provide identification information corresponding to a cluster, to which each data piece belongs, to the data piece belonging to the cluster as a pseudo label.
- CNN Convolutional Neural Network
- the feature extractor generation unit 11 randomly initializes the first CNN parameter, and uses the sum of the number of data pieces in the data set A and S as the number of clusters in k-means clustering.
- the feature extractor generation unit 11 performs a CNN learning process (S 102 ).
- the CNN learning process the feature extractor generation unit 11 learns the CNN using the pseudo dataset as input.
- the learning using the pseudo data is also performed on the data piece a, which is labeled.
- Steps S 101 and S 102 are repeated until a learning end condition is satisfied (S 103 ). Whether or not the learning end condition has been satisfied may be determined, for example, by whether the number of repetitions in steps S 101 and S 102 has reached a predefined number of repetitions, or by the changes in the error function.
- the feature extractor generation unit 11 regards the CNN at that time as the feature extractor and outputs thereof.
- steps S 101 to S 103 unsupervised feature expression learning is used to generate (learn) the feature extractor.
- the method of generating the pseudo labels is not limited to the method as described above.
- the sampling process unit 12 performs a feature extraction process (S 104 ).
- the sampling process unit 12 uses the feature extractor to obtain (extract) respective feature information (image feature) of each data piece a in the data set A and each data piece b in the data set B.
- the sampling process unit 12 inputs each data piece a in the data set A and each data piece b in the data set B into the feature extractor in turn, to thereby obtain the feature information, which is outputted from the feature extractor, for each data piece a and each data piece b.
- the feature information is data expressed in a vector form.
- the sampling process unit 12 performs a clustering process (S 105 ).
- the sampling process unit 12 performs k-means clustering on a feature information group, and outputs cluster information (information including the feature information of each data piece and the results of classification of each data piece into a cluster).
- cluster information information including the feature information of each data piece and the results of classification of each data piece into a cluster.
- the sum of the number of data pieces in the data set A and the number S of the samples is regarded as the number of clusters of k-means.
- each data piece a and each data piece b are classified into the clusters of the number at least one more than the number of data pieces in the data set A.
- the sampling process unit 12 performs a cluster selection process ( 3106 ).
- the sampling process unit 12 outputs S clusters for sample selection.
- the clusters generated by k-means clustering in the clustering process of step 3105 are classified into the cluster including the data pieces aEA and the data pieces b ⁇ B, and the cluster including only the data pieces b.
- the cluster including the data pieces a is considered to be a set of data pieces that can be identified by learning using the data pieces a, it is considered that selection of a data piece b included in the cluster as a sample will result in a low learning effect.
- the cluster including only the data pieces b without including the data pieces a is considered to include the data pieces that are difficult to identify by learning using the data pieces a; therefore, the data piece b included in the cluster is considered to have high learning effectiveness as a sample.
- each cluster including only the data pieces b that is, each cluster that does not include the data pieces a
- the sampling process unit 12 selects the cluster to be sampled in step S 105 .
- the sampling process unit 12 calculates a score value t in the cluster by the following expression:
- the sampling process unit 12 selects S clusters as the clusters for sample selection, starting with a cluster having a relatively small score value t in the ascending order. Since the score value t is a variance, a cluster having a low score value t is the cluster with a small variance. It is assumed that, as described above, the group of data pieces b belonging to such a cluster with a small variance is a set of data pieces that does not exist in the labeled data group or have features that are only expressed small. Therefore, the data pieces b selected from such a cluster are considered to be highly influential.
- the sampling process unit 12 performs a sample selection process (S 107 ).
- the sampling process unit 12 performs sampling on each of the S clusters for sample selection. For example, for each cluster for sample selection, the sampling process unit 12 selects the data piece b related to the feature information with the minimum distance from the center u y of the cluster (the distance between vectors) as the sample.
- the data piece at the center of the cluster is considered to be the data piece that most strongly expresses the common feature of the cluster.
- noise reduction can also be expected because the data piece can also be regarded as the average of the data pieces belonging to the cluster.
- the sampling process unit 12 outputs each sample (each data piece b) selected from each of the S clusters as the data piece to be labeled.
- the feature information of each data piece is obtained (extracted) by use of the feature extractor that has provided the pseudo label, which can be automatically generated only from the data set A of the labeled data pieces and the data set B of the unlabeled data pieces, to each data piece and has learned; therefore, the sampling based on the feature space effective for the target task is performed, and as a result, it is possible to select the data piece, that is to be labeled and is effective for the target task, from among data sets of unlabeled data pieces.
- a cluster including only the unlabeled data pieces is selected in sampling, and the technique for selecting the data piece closest to the center of the cluster is applied. Therefore, in the embodiment, it becomes possible to select data pieces in a range that cannot be covered by the labeled data pieces on the feature space; thereby it becomes possible to reduce the cost of efficient labeling operations.
- FIG. 4 is a diagram showing an example of a functional configuration of a data selection device 10 in the second embodiment.
- a generation method of the pseudo label by the feature extractor generation unit 11 is different from the first embodiment. Due to the differences in the generation method, in the second embodiment, input of the number S of samples to be additionally labeled to the feature extractor generation unit 11 may not be required.
- the feature extractor generation unit 11 may use the rotation direction of the image of each data piece as a pseudo label for each data piece.
- the feature extractor generation unit 11 may use the correct permutations of the patches of each data piece as a pseudo label for each data piece.
- the feature extractor generation unit 11 may generate the pseudo label for each data piece by other known methods.
- the feature extractor generation unit 11 is an example of a generation unit.
- the sampling process unit 12 is an example of an obtaining unit, a classification unit, and a selection unit.
- the data piece a is an example of a first data piece.
- the data piece b is an example of a second data piece.
Abstract
Description
- The present invention relates to a data selection method, a data selection device and a program.
- In supervised learning, since labeled data pieces corresponding to a target task is required, there is a demand for reducing costs of the operation of labeling a large number of collected image data pieces. One of the technologies aimed at reducing the cost of the labeling operation is active learning, which performs efficient learning by selecting a small number of data pieces (samples) from the entire data pieces and labeling thereof, not labeling all the image data pieces.
- In the active learning, samples with high contribution to performance improvement are selected (sampled) by an algorithm from unlabeled data pieces using a small number of labeled data pieces and presented to an operator. The operator labels the samples, then the learning is performed, and thereby learning performance can be improved as compared to the case of random sampling.
- Since the number of samples is one per sampling in the conventional active learning, a technique for obtaining multiple samples in a single sampling, which is adapted to batch learning of convolutional neural networks (CNN) is suggested (Non-Patent Literature 1).
- In Non-Patent Literature 1, the data pieces are mapped into the feature space, and sampling is performed by using an approximation solution algorithm for the k-center problem. Since multiple samples are subsets that inherit the characteristics of the entire data structure in the feature space, the learning close to that in using all data pieces can be performed even when a small number of data pieces are used.
-
- Non-Patent Literature 1: Active Learning for Convolutional Neural Networks: A Core-Set Approach, O. Sener, S. Savarese, International Conference on Learning Representations (ICLR), 2018.
- Non-Patent Literature 2: Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze, “Deep Clustering for Unsupervised Learning of Visual Features” Proc. ECCV (2018).
- However, the above technique has the following problems.
- When data pieces are mapped into a feature space, it is necessary to prepare a feature extractor, and in many cases, a learned model prepared in a deep learning framework is used as the feature extractor. As the dataset used in the prepared learned model, the 1000-class classification of the ImageNet dataset, etc. is used.
- Therefore, if classification of the data pieces for the target task differs from the classification contents of the dataset used in the prepared learned model, it is impossible to extract a feature that is effective for the target task.
- In the technique of Non-Patent Literature 1, it is important for sampling to select a feature extractor performing mapping into a feature space that is effective for the target task, because the data is referenced in the feature space during sampling; however, it is difficult to evaluate beforehand the feature extractor that is effective for the unlabeled data pieces handled by the active learning.
- The present invention has been made in view of the above points, and an object of the present invention is to make it possible to select data pieces to be labeled, which are effective for a target task, from among data sets of unlabeled data pieces.
- To solve the above problems, a data selection method selects, based on a set of labeled first data pieces and a set of unlabeled second data pieces, a target to be labeled from the set of the second data pieces. The method includes: a classification procedure classifying data pieces belonging to the set of the first data pieces and data pieces belonging to the set of the second data pieces into clusters of the number at least one more than the number of types of the labels; and a selection procedure selecting the second data piece to be labeled from a cluster, from among the clusters, that does not include the first data piece, each of the procedures being performed by a computer.
- It is possible to select data pieces to be labeled, which are effective for a target task, from among data sets of unlabeled data pieces.
-
FIG. 1 is a diagram showing an example of a hardware configuration of adata selection device 10 in a first embodiment. -
FIG. 2 is a diagram showing an example of a functional configuration of thedata selection device 10 in the first embodiment. -
FIG. 3 is a flowchart for illustrating an example of processing procedures executed by thedata selection device 10 in the first embodiment. -
FIG. 4 is a diagram showing an example of a functional configuration of adata selection device 10 in a second embodiment. - Hereinafter, embodiments of the present invention will be described based on the drawings.
- In the embodiment, based on the input of data sets of labeled data pieces and data sets of unlabeled data pieces that are candidates for labeling, a feature extractor is generated by use of a framework of unsupervised feature expression learning. Note that the number of data pieces in the data set of the labeled data pieces is smaller than the number of data pieces in the data set of the unlabeled data pieces. The unsupervised feature expression learning is to generate a feature extractor effective for a target task by self-supervised learning, which uses supervisory signals that can be automatically generated for input data. The unsupervised feature expression learning includes techniques such as Deep Clustering (Non-Patent Literature 2).
- For the feature of each labeled data piece and each unlabeled data piece used in the unsupervised feature expression learning, which is obtained by using the feature extractor obtained by the unsupervised feature expression learning, clustering is performed.
- For each cluster obtained as a result of the clustering, classification is performed into two types of clusters: clusters including labeled data pieces and clusters including no labeled data pieces among the data pieces belonging thereto.
- Of the two types of classification described above, sampling is performed from each cluster including no labeled data pieces, and the data pieces to be labeled are outputted.
- Let us consider an unlabeled data piece that is effective for a target task, and should be labeled. For example, even if an unlabeled data piece having similar characteristics to a labeled data piece is labeled, it is difficult to say that the data piece is effective for the target task. On the other hand, if it is possible to label an unlabeled data piece having characteristics different from those of a labeled data piece, it is assumed that, by performing learning by using the data piece and the label, it becomes possible to learn so that identification taking into account the characteristics can be performed.
- The embodiment aims to select such unlabeled data pieces having characteristics different from those of the labeled data pieces. It was described that the number of the unlabeled data pieces is larger than the number of the labeled data pieces, but it is expected that the number of the unlabeled data pieces will form a majority because labeling the data pieces for using thereof as actual learning data requires a large amount of operation. The embodiment aims to extract data pieces, from such a large number of data pieces, that should be labeled, and can increase, for example, the accuracy of estimation by labeling.
- According to the above method, it is possible to use any feature expression learning technique, and it is possible to eliminate the need for preparation of the learned model.
-
FIG. 1 is a diagram showing an example of a hardware configuration of adata selection device 10 in the first embodiment. Thedata selection device 10 inFIG. 1 includes adrive device 100, anauxiliary storage device 102, amemory device 103, aCPU 104, and aninterface device 105 that are connected to one another via a bus B. - The programs implementing the processes in the
data selection device 10 are provided by arecording medium 101, such as a CD-ROM. When therecording medium 101 storing the programs is set to thedrive device 100, the programs are installed from therecording medium 101 to theauxiliary storage device 102 via thedrive device 100. However, the programs are not necessarily be installed from therecording medium 101, and may be downloaded from other computers via a network. Theauxiliary storage device 102 stores the installed programs, as well as necessary files, data, and the like. - When an instruction to start the programs is provided, the
memory device 103 reads the programs from theauxiliary storage device 102 and stores thereof. TheCPU 104 executes the functions related to thedata selection device 10 in accordance with the programs stored in thememory device 103. Theinterface device 105 is used as an interface to connect to the network.FIG. 2 is a diagram showing an example of a functional configuration of thedata selection device 10 in the first embodiment. InFIG. 2 , thedata selection device 10 includes a featureextractor generation unit 11 and asampling process unit 12. Each of these units is implemented by processes executed by theCPU 104 caused by one or more programs installed in thedata selection device 10. - The feature
extractor generation unit 11 outputs a feature extractor with a data set A of labeled data pieces, a data set B of unlabeled data pieces, and the number S of samples to be additionally labeled as input. The data set A of the labeled data pieces refers to a set of labeled image data pieces. The data set B of the unlabeled data pieces refers to a set of unlabeled image data pieces. - The
sampling process unit 12 selects data pieces to be labeled with the data set A of the labeled data pieces, the data set B of the unlabeled data pieces, the number S of samples to be additionally labeled, and the feature extractor as input. - Hereinafter, the processing procedures executed by the
data selection device 10 will be described.FIG. 3 is a flowchart for illustrating an example of processing procedures executed by thedata selection device 10 in the first embodiment. - In
step 101, the featureextractor generation unit 11 executes a pseudo label generation process with the data set A of the labeled data pieces, the data set B of the unlabeled data pieces, and the number S of samples to be additionally labeled as input. In the pseudo label generation process, the featureextractor generation unit 11 provides pseudo labels to each data piece aEA and each data piece b∈B, and outputs a data set A in which each data piece a is provided with a pseudo label and a data set B in which each data piece b is provided with a pseudo label as pseudo datasets. On this occasion, the featureextractor generation unit 11 performs k-means clustering on the data set A and the data set B, based on respective intermediate features of each data piece a and each data piece b when each data piece is inputted into Convolutional Neural Network (CNN), which is the source of the feature extractor, to thereby provide identification information corresponding to a cluster, to which each data piece belongs, to the data piece belonging to the cluster as a pseudo label. Note that the featureextractor generation unit 11 randomly initializes the first CNN parameter, and uses the sum of the number of data pieces in the data set A and S as the number of clusters in k-means clustering. - Subsequently, the feature
extractor generation unit 11 performs a CNN learning process (S102). In the CNN learning process, the featureextractor generation unit 11 learns the CNN using the pseudo dataset as input. On this occasion, the learning using the pseudo data is also performed on the data piece a, which is labeled. - Steps S101 and S102 are repeated until a learning end condition is satisfied (S103). Whether or not the learning end condition has been satisfied may be determined, for example, by whether the number of repetitions in steps S101 and S102 has reached a predefined number of repetitions, or by the changes in the error function. When the learning end condition is satisfied, the feature
extractor generation unit 11 regards the CNN at that time as the feature extractor and outputs thereof. - Thus, in steps S101 to S103, unsupervised feature expression learning is used to generate (learn) the feature extractor. However, the method of generating the pseudo labels is not limited to the method as described above.
- Subsequently, the
sampling process unit 12 performs a feature extraction process (S104). In the feature extraction process, thesampling process unit 12 uses the feature extractor to obtain (extract) respective feature information (image feature) of each data piece a in the data set A and each data piece b in the data set B. In other words, thesampling process unit 12 inputs each data piece a in the data set A and each data piece b in the data set B into the feature extractor in turn, to thereby obtain the feature information, which is outputted from the feature extractor, for each data piece a and each data piece b. Note that the feature information is data expressed in a vector form. - Subsequently, the
sampling process unit 12 performs a clustering process (S105). In the clustering process, with the feature information of each data piece a, the feature information of each data piece b, and the number S of samples to be additionally labeled as input, thesampling process unit 12 performs k-means clustering on a feature information group, and outputs cluster information (information including the feature information of each data piece and the results of classification of each data piece into a cluster). On this occasion, the sum of the number of data pieces in the data set A and the number S of the samples is regarded as the number of clusters of k-means. In other words, each data piece a and each data piece b are classified into the clusters of the number at least one more than the number of data pieces in the data set A. - Subsequently, the
sampling process unit 12 performs a cluster selection process (3106). In the cluster selection process, with the data set A, the data set B, and the cluster information as input, thesampling process unit 12 outputs S clusters for sample selection. - The clusters generated by k-means clustering in the clustering process of step 3105 are classified into the cluster including the data pieces aEA and the data pieces b∈B, and the cluster including only the data pieces b.
- Since the cluster including the data pieces a is considered to be a set of data pieces that can be identified by learning using the data pieces a, it is considered that selection of a data piece b included in the cluster as a sample will result in a low learning effect.
- On the other hand, the cluster including only the data pieces b without including the data pieces a is considered to include the data pieces that are difficult to identify by learning using the data pieces a; therefore, the data piece b included in the cluster is considered to have high learning effectiveness as a sample.
- Thus, one sample is to be selected from each cluster including only the data pieces b (that is, each cluster that does not include the data pieces a); however, since the number of clusters including only the data pieces b is not less than S, the
sampling process unit 12 selects the cluster to be sampled in step S105. - Specifically, in the feature information {xi}n i=1 and the cluster level {y|y∈{1, . . . , k)}}n i=1, when the center of the cluster y (the center of the feature information group of each data piece belonging to the cluster y) is assumed to be uy=1/nyΣi:yi=yxi (ny is the number of data pieces belonging to the cluster y), the
sampling process unit 12 calculates a score value t in the cluster by the following expression: -
- The
sampling process unit 12 selects S clusters as the clusters for sample selection, starting with a cluster having a relatively small score value t in the ascending order. Since the score value t is a variance, a cluster having a low score value t is the cluster with a small variance. It is assumed that, as described above, the group of data pieces b belonging to such a cluster with a small variance is a set of data pieces that does not exist in the labeled data group or have features that are only expressed small. Therefore, the data pieces b selected from such a cluster are considered to be highly influential. - Subsequently, the
sampling process unit 12 performs a sample selection process (S107). In the sample selection process, thesampling process unit 12 performs sampling on each of the S clusters for sample selection. For example, for each cluster for sample selection, thesampling process unit 12 selects the data piece b related to the feature information with the minimum distance from the center uy of the cluster (the distance between vectors) as the sample. The data piece at the center of the cluster is considered to be the data piece that most strongly expresses the common feature of the cluster. In addition, noise reduction can also be expected because the data piece can also be regarded as the average of the data pieces belonging to the cluster. Thesampling process unit 12 outputs each sample (each data piece b) selected from each of the S clusters as the data piece to be labeled. - As described above, according to the first embodiment, the feature information of each data piece is obtained (extracted) by use of the feature extractor that has provided the pseudo label, which can be automatically generated only from the data set A of the labeled data pieces and the data set B of the unlabeled data pieces, to each data piece and has learned; therefore, the sampling based on the feature space effective for the target task is performed, and as a result, it is possible to select the data piece, that is to be labeled and is effective for the target task, from among data sets of unlabeled data pieces.
- In addition, since it becomes possible to perform sampling on the feature space without preparing a learned model beforehand, it is possible to eliminate the selection of a feature extractor adapted to a target task for given image data, and to reduce the cost of labeling since high learning performance can be obtained by providing labels to a small number of samples.
- Moreover, in the embodiment, a cluster including only the unlabeled data pieces is selected in sampling, and the technique for selecting the data piece closest to the center of the cluster is applied. Therefore, in the embodiment, it becomes possible to select data pieces in a range that cannot be covered by the labeled data pieces on the feature space; thereby it becomes possible to reduce the cost of efficient labeling operations.
- Next, a second embodiment will be described. In the second embodiment, description will be given to the points that are different from the first embodiment. Points not specifically mentioned in the second embodiment may be similar to those of the first embodiment.
-
FIG. 4 is a diagram showing an example of a functional configuration of adata selection device 10 in the second embodiment. In the second embodiment, a generation method of the pseudo label by the featureextractor generation unit 11 is different from the first embodiment. Due to the differences in the generation method, in the second embodiment, input of the number S of samples to be additionally labeled to the featureextractor generation unit 11 may not be required. - For example, in the case where the image of each data piece to be inputted (each data piece a and each data piece b) is randomly rotated, the feature
extractor generation unit 11 may use the rotation direction of the image of each data piece as a pseudo label for each data piece. Alternatively, in the case where the image of each data piece (each data piece a and each data piece b) is divided into patches and randomly inputted, the featureextractor generation unit 11 may use the correct permutations of the patches of each data piece as a pseudo label for each data piece. Still alternatively, the featureextractor generation unit 11 may generate the pseudo label for each data piece by other known methods. - Note that, in each of the above embodiments, the feature
extractor generation unit 11 is an example of a generation unit. Thesampling process unit 12 is an example of an obtaining unit, a classification unit, and a selection unit. The data piece a is an example of a first data piece. The data piece b is an example of a second data piece. - The embodiments of the present invention have been described, but it is not intended to limit the present invention to these specific embodiments. Various kinds of modifications and variations can be made within the scope of the gist of the present invention defined by the following claims.
-
-
- 10 Data selection device
- 11 Feature extractor generation unit
- 12 Sampling process unit
- 100 Drive device
- 101 Recording medium
- 102 Auxiliary storage device
- 103 Memory device
- 104 CPU
- 105 Interface device
- B Bus
Claims (20)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2019/029807 WO2021019681A1 (en) | 2019-07-30 | 2019-07-30 | Data selection method, data selection device, and program |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220335085A1 true US20220335085A1 (en) | 2022-10-20 |
Family
ID=74229395
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/631,396 Pending US20220335085A1 (en) | 2019-07-30 | 2019-07-30 | Data selection method, data selection apparatus and program |
Country Status (3)
Country | Link |
---|---|
US (1) | US20220335085A1 (en) |
JP (1) | JP7222429B2 (en) |
WO (1) | WO2021019681A1 (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160232444A1 (en) * | 2015-02-05 | 2016-08-11 | International Business Machines Corporation | Scoring type coercion for question answering |
US20170061625A1 (en) * | 2015-08-26 | 2017-03-02 | Digitalglobe, Inc. | Synthesizing training data for broad area geospatial object detection |
US9911033B1 (en) * | 2016-09-05 | 2018-03-06 | International Business Machines Corporation | Semi-supervised price tag detection |
US20190042879A1 (en) * | 2018-06-26 | 2019-02-07 | Intel Corporation | Entropic clustering of objects |
US20190332667A1 (en) * | 2018-04-26 | 2019-10-31 | Microsoft Technology Licensing, Llc | Automatically cross-linking application programming interfaces |
US20200027002A1 (en) * | 2018-07-20 | 2020-01-23 | Google Llc | Category learning neural networks |
US20200285906A1 (en) * | 2017-09-08 | 2020-09-10 | The General Hospital Corporation | A system and method for automated labeling and annotating unstructured medical datasets |
US20200285939A1 (en) * | 2017-09-28 | 2020-09-10 | D5Ai Llc | Aggressive development with cooperative generators |
US20200334567A1 (en) * | 2019-04-17 | 2020-10-22 | International Business Machines Corporation | Peer assisted distributed architecture for training machine learning models |
US20210056412A1 (en) * | 2019-08-20 | 2021-02-25 | Lg Electronics Inc. | Generating training and validation data for machine learning |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8498950B2 (en) * | 2010-10-15 | 2013-07-30 | Yahoo! Inc. | System for training classifiers in multiple categories through active learning |
US10102481B2 (en) * | 2015-03-16 | 2018-10-16 | Conduent Business Services, Llc | Hybrid active learning for non-stationary streaming data with asynchronous labeling |
US20180032901A1 (en) * | 2016-07-27 | 2018-02-01 | International Business Machines Corporation | Greedy Active Learning for Reducing User Interaction |
CN109844777A (en) * | 2016-10-26 | 2019-06-04 | 索尼公司 | Information processing unit and information processing method |
WO2018116921A1 (en) * | 2016-12-21 | 2018-06-28 | 日本電気株式会社 | Dictionary learning device, dictionary learning method, data recognition method, and program storage medium |
-
2019
- 2019-07-30 JP JP2021536513A patent/JP7222429B2/en active Active
- 2019-07-30 WO PCT/JP2019/029807 patent/WO2021019681A1/en active Application Filing
- 2019-07-30 US US17/631,396 patent/US20220335085A1/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160232444A1 (en) * | 2015-02-05 | 2016-08-11 | International Business Machines Corporation | Scoring type coercion for question answering |
US20170061625A1 (en) * | 2015-08-26 | 2017-03-02 | Digitalglobe, Inc. | Synthesizing training data for broad area geospatial object detection |
US9911033B1 (en) * | 2016-09-05 | 2018-03-06 | International Business Machines Corporation | Semi-supervised price tag detection |
US20200285906A1 (en) * | 2017-09-08 | 2020-09-10 | The General Hospital Corporation | A system and method for automated labeling and annotating unstructured medical datasets |
US20200285939A1 (en) * | 2017-09-28 | 2020-09-10 | D5Ai Llc | Aggressive development with cooperative generators |
US20190332667A1 (en) * | 2018-04-26 | 2019-10-31 | Microsoft Technology Licensing, Llc | Automatically cross-linking application programming interfaces |
US20190042879A1 (en) * | 2018-06-26 | 2019-02-07 | Intel Corporation | Entropic clustering of objects |
US20200027002A1 (en) * | 2018-07-20 | 2020-01-23 | Google Llc | Category learning neural networks |
US20200334567A1 (en) * | 2019-04-17 | 2020-10-22 | International Business Machines Corporation | Peer assisted distributed architecture for training machine learning models |
US20210056412A1 (en) * | 2019-08-20 | 2021-02-25 | Lg Electronics Inc. | Generating training and validation data for machine learning |
Non-Patent Citations (12)
Title |
---|
Ahn, Euijoon, et al., "Unsupervised Feature Learning with K-means and An Ensemble of Deep Convolutional Neural Networks for Medical Image Classification", arXiv, Cornell University, Document arXiv:1906.03359 [cs.CV], June 7, 2019, 8 pages. * |
Cheng, Yanhua, et al., "Semi-Supervised Learning for RGB-D Object Recognition", ICPR 2014, Stockholm, Sweden, August 24-28, 2014, pp. 2377-2382. * |
Gowda, Harsha S., et al., "Semi-supervised Text Categorization Using Recursive K-means Clustering", arXiv, Cornell University, Document - arXiv:1706.07913v1 [cs.LG], 24 Jun 2017, 10 pages. * |
Kansal, Tushar, et al., "Customer Segmentation using K-means Clustering", CTEMS 2018, Belgaum, India, December 21-22, 2018, pp. 135-139. * |
Knyazev, Boris, et al., "Autoconvolution for Unsupervised Feature Learning", arXiv, Cornell University, Document - arXiv:1606.00611v1 [cs.CV] 2 Jun 2016, 15 pages. * |
Ling, Zhigang, et al., "Semi-Supervised Learning via Convolutional Neural Network for Hyperspectral Image Classification", ICPR 2018, Beijing, China, August 20-24, 2018, pp. 1900-1905. * |
Ngoc, Minh Tran, et al., "Centroid Neural Network with Pairwise Constraints for Semi-supervised Learning", Neural Processing Letters, Vol. 18, February 8, 2018, pp. 1721-1747. * |
Nie, Feiping, et al., "Multi-view Clustering and Semi-Supervised Classification with Adaptive Neighbors", International Journal of Engineering & Technology, Vol. 7 (1.8), February 2018, pp. 81-85. * |
Reddy, Y C A Padmanabha, et al., "Semi-supervised learning: a brief review", International Journal of Engineering & Technology, Vol. 7 (1.8), February 2018, pp. 81-85. * |
Tang, Jian, et al., "PTE: Predictive Text Embedding through Large-Scale Heterogeneous Text Networks", KDD ‘15, Sydney, NSW, Australia, pp. 1165-1174. * |
Wagner, Raimer, et al., "Learning Convolutional Neural Networks From Few Samples", IJCNN 2013, Dallas, TX, Aug. 4-9, 2013, 7 pages. * |
Wang, Xin, et al., "Semi-supervised K-Means Clustering by Optimizing Initial Cluster Centers", WISM 2011, Taiyuan, China, September 24-25, 2011, Proceedings, Part II, LNCS 6988, pp. 178-187. * |
Also Published As
Publication number | Publication date |
---|---|
JP7222429B2 (en) | 2023-02-15 |
WO2021019681A1 (en) | 2021-02-04 |
JPWO2021019681A1 (en) | 2021-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11586988B2 (en) | Method of knowledge transferring, information processing apparatus and storage medium | |
CN108416370B (en) | Image classification method and device based on semi-supervised deep learning and storage medium | |
EP3355244A1 (en) | Data fusion and classification with imbalanced datasets | |
EP2991003B1 (en) | Method and apparatus for classification | |
US20170161645A1 (en) | Method and apparatus for labeling training samples | |
US20170032276A1 (en) | Data fusion and classification with imbalanced datasets | |
CN108491302B (en) | Method for detecting spark cluster node state | |
WO2016205286A1 (en) | Automatic entity resolution with rules detection and generation system | |
US20210117802A1 (en) | Training a Neural Network Using Small Training Datasets | |
CN113449099A (en) | Text classification method and text classification device | |
US20220092407A1 (en) | Transfer learning with machine learning systems | |
US10565478B2 (en) | Differential classification using multiple neural networks | |
WO2021034394A1 (en) | Semi supervised animated character recognition in video | |
US20200117574A1 (en) | Automatic bug verification | |
WO2023134402A1 (en) | Calligraphy character recognition method based on siamese convolutional neural network | |
US11164043B2 (en) | Creating device, creating program, and creating method | |
CN109902731B (en) | Performance fault detection method and device based on support vector machine | |
CN115147632A (en) | Image category automatic labeling method and device based on density peak value clustering algorithm | |
Aljundi et al. | Identifying wrongly predicted samples: A method for active learning | |
US11875114B2 (en) | Method and system for extracting information from a document | |
US20100239168A1 (en) | Semi-tied covariance modelling for handwriting recognition | |
US20220335085A1 (en) | Data selection method, data selection apparatus and program | |
Boillet et al. | Confidence estimation for object detection in document images | |
US20220405299A1 (en) | Visualizing feature variation effects on computer model prediction | |
Xue et al. | Incremental zero-shot learning based on attributes for image classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NIPPON TELEGRAPH AND TELEPHONE CORPORATION, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TSUKATANI, SHUNSUKE;MURASAKI, KAZUHIKO;ANDO, SHINGO;AND OTHERS;SIGNING DATES FROM 20211015 TO 20211029;REEL/FRAME:058816/0452 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |