US20220335085A1 - Data selection method, data selection apparatus and program - Google Patents

Data selection method, data selection apparatus and program Download PDF

Info

Publication number
US20220335085A1
US20220335085A1 US17/631,396 US201917631396A US2022335085A1 US 20220335085 A1 US20220335085 A1 US 20220335085A1 US 201917631396 A US201917631396 A US 201917631396A US 2022335085 A1 US2022335085 A1 US 2022335085A1
Authority
US
United States
Prior art keywords
data
data pieces
pieces
cluster
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/631,396
Inventor
Shunsuke TSUKATANI
Kazuhiko MURASAKI
Shingo Ando
Atsushi Sagata
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION reassignment NIPPON TELEGRAPH AND TELEPHONE CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: TSUKATANI, Shunsuke, MURASAKI, Kazuhiko, ANDO, SHINGO, SAGATA, ATSUSHI
Publication of US20220335085A1 publication Critical patent/US20220335085A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present invention relates to a data selection method, a data selection device and a program.
  • samples with high contribution to performance improvement are selected (sampled) by an algorithm from unlabeled data pieces using a small number of labeled data pieces and presented to an operator.
  • the operator labels the samples, then the learning is performed, and thereby learning performance can be improved as compared to the case of random sampling.
  • Non-Patent Literature 1 Since the number of samples is one per sampling in the conventional active learning, a technique for obtaining multiple samples in a single sampling, which is adapted to batch learning of convolutional neural networks (CNN) is suggested (Non-Patent Literature 1).
  • Non-Patent Literature 1 the data pieces are mapped into the feature space, and sampling is performed by using an approximation solution algorithm for the k-center problem. Since multiple samples are subsets that inherit the characteristics of the entire data structure in the feature space, the learning close to that in using all data pieces can be performed even when a small number of data pieces are used.
  • a learned model prepared in a deep learning framework is used as the feature extractor.
  • the dataset used in the prepared learned model the 1000-class classification of the ImageNet dataset, etc. is used.
  • Non-Patent Literature 1 it is important for sampling to select a feature extractor performing mapping into a feature space that is effective for the target task, because the data is referenced in the feature space during sampling; however, it is difficult to evaluate beforehand the feature extractor that is effective for the unlabeled data pieces handled by the active learning.
  • the present invention has been made in view of the above points, and an object of the present invention is to make it possible to select data pieces to be labeled, which are effective for a target task, from among data sets of unlabeled data pieces.
  • a data selection method selects, based on a set of labeled first data pieces and a set of unlabeled second data pieces, a target to be labeled from the set of the second data pieces.
  • the method includes: a classification procedure classifying data pieces belonging to the set of the first data pieces and data pieces belonging to the set of the second data pieces into clusters of the number at least one more than the number of types of the labels; and a selection procedure selecting the second data piece to be labeled from a cluster, from among the clusters, that does not include the first data piece, each of the procedures being performed by a computer.
  • FIG. 1 is a diagram showing an example of a hardware configuration of a data selection device 10 in a first embodiment.
  • FIG. 2 is a diagram showing an example of a functional configuration of the data selection device 10 in the first embodiment.
  • FIG. 3 is a flowchart for illustrating an example of processing procedures executed by the data selection device 10 in the first embodiment.
  • FIG. 4 is a diagram showing an example of a functional configuration of a data selection device 10 in a second embodiment.
  • a feature extractor is generated by use of a framework of unsupervised feature expression learning. Note that the number of data pieces in the data set of the labeled data pieces is smaller than the number of data pieces in the data set of the unlabeled data pieces.
  • the unsupervised feature expression learning is to generate a feature extractor effective for a target task by self-supervised learning, which uses supervisory signals that can be automatically generated for input data.
  • the unsupervised feature expression learning includes techniques such as Deep Clustering (Non-Patent Literature 2).
  • clustering For the feature of each labeled data piece and each unlabeled data piece used in the unsupervised feature expression learning, which is obtained by using the feature extractor obtained by the unsupervised feature expression learning, clustering is performed.
  • clusters including labeled data pieces For each cluster obtained as a result of the clustering, classification is performed into two types of clusters: clusters including labeled data pieces and clusters including no labeled data pieces among the data pieces belonging thereto.
  • sampling is performed from each cluster including no labeled data pieces, and the data pieces to be labeled are outputted.
  • the embodiment aims to select such unlabeled data pieces having characteristics different from those of the labeled data pieces. It was described that the number of the unlabeled data pieces is larger than the number of the labeled data pieces, but it is expected that the number of the unlabeled data pieces will form a majority because labeling the data pieces for using thereof as actual learning data requires a large amount of operation.
  • the embodiment aims to extract data pieces, from such a large number of data pieces, that should be labeled, and can increase, for example, the accuracy of estimation by labeling.
  • FIG. 1 is a diagram showing an example of a hardware configuration of a data selection device 10 in the first embodiment.
  • the data selection device 10 in FIG. 1 includes a drive device 100 , an auxiliary storage device 102 , a memory device 103 , a CPU 104 , and an interface device 105 that are connected to one another via a bus B.
  • the programs implementing the processes in the data selection device 10 are provided by a recording medium 101 , such as a CD-ROM.
  • a recording medium 101 such as a CD-ROM.
  • the programs are installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100 .
  • the programs are not necessarily be installed from the recording medium 101 , and may be downloaded from other computers via a network.
  • the auxiliary storage device 102 stores the installed programs, as well as necessary files, data, and the like.
  • FIG. 2 is a diagram showing an example of a functional configuration of the data selection device 10 in the first embodiment.
  • the data selection device 10 includes a feature extractor generation unit 11 and a sampling process unit 12 . Each of these units is implemented by processes executed by the CPU 104 caused by one or more programs installed in the data selection device 10 .
  • the feature extractor generation unit 11 outputs a feature extractor with a data set A of labeled data pieces, a data set B of unlabeled data pieces, and the number S of samples to be additionally labeled as input.
  • the data set A of the labeled data pieces refers to a set of labeled image data pieces.
  • the data set B of the unlabeled data pieces refers to a set of unlabeled image data pieces.
  • the sampling process unit 12 selects data pieces to be labeled with the data set A of the labeled data pieces, the data set B of the unlabeled data pieces, the number S of samples to be additionally labeled, and the feature extractor as input.
  • FIG. 3 is a flowchart for illustrating an example of processing procedures executed by the data selection device 10 in the first embodiment.
  • the feature extractor generation unit 11 executes a pseudo label generation process with the data set A of the labeled data pieces, the data set B of the unlabeled data pieces, and the number S of samples to be additionally labeled as input.
  • the feature extractor generation unit 11 provides pseudo labels to each data piece aEA and each data piece b ⁇ B, and outputs a data set A in which each data piece a is provided with a pseudo label and a data set B in which each data piece b is provided with a pseudo label as pseudo datasets.
  • the feature extractor generation unit 11 performs k-means clustering on the data set A and the data set B, based on respective intermediate features of each data piece a and each data piece b when each data piece is inputted into Convolutional Neural Network (CNN), which is the source of the feature extractor, to thereby provide identification information corresponding to a cluster, to which each data piece belongs, to the data piece belonging to the cluster as a pseudo label.
  • CNN Convolutional Neural Network
  • the feature extractor generation unit 11 randomly initializes the first CNN parameter, and uses the sum of the number of data pieces in the data set A and S as the number of clusters in k-means clustering.
  • the feature extractor generation unit 11 performs a CNN learning process (S 102 ).
  • the CNN learning process the feature extractor generation unit 11 learns the CNN using the pseudo dataset as input.
  • the learning using the pseudo data is also performed on the data piece a, which is labeled.
  • Steps S 101 and S 102 are repeated until a learning end condition is satisfied (S 103 ). Whether or not the learning end condition has been satisfied may be determined, for example, by whether the number of repetitions in steps S 101 and S 102 has reached a predefined number of repetitions, or by the changes in the error function.
  • the feature extractor generation unit 11 regards the CNN at that time as the feature extractor and outputs thereof.
  • steps S 101 to S 103 unsupervised feature expression learning is used to generate (learn) the feature extractor.
  • the method of generating the pseudo labels is not limited to the method as described above.
  • the sampling process unit 12 performs a feature extraction process (S 104 ).
  • the sampling process unit 12 uses the feature extractor to obtain (extract) respective feature information (image feature) of each data piece a in the data set A and each data piece b in the data set B.
  • the sampling process unit 12 inputs each data piece a in the data set A and each data piece b in the data set B into the feature extractor in turn, to thereby obtain the feature information, which is outputted from the feature extractor, for each data piece a and each data piece b.
  • the feature information is data expressed in a vector form.
  • the sampling process unit 12 performs a clustering process (S 105 ).
  • the sampling process unit 12 performs k-means clustering on a feature information group, and outputs cluster information (information including the feature information of each data piece and the results of classification of each data piece into a cluster).
  • cluster information information including the feature information of each data piece and the results of classification of each data piece into a cluster.
  • the sum of the number of data pieces in the data set A and the number S of the samples is regarded as the number of clusters of k-means.
  • each data piece a and each data piece b are classified into the clusters of the number at least one more than the number of data pieces in the data set A.
  • the sampling process unit 12 performs a cluster selection process ( 3106 ).
  • the sampling process unit 12 outputs S clusters for sample selection.
  • the clusters generated by k-means clustering in the clustering process of step 3105 are classified into the cluster including the data pieces aEA and the data pieces b ⁇ B, and the cluster including only the data pieces b.
  • the cluster including the data pieces a is considered to be a set of data pieces that can be identified by learning using the data pieces a, it is considered that selection of a data piece b included in the cluster as a sample will result in a low learning effect.
  • the cluster including only the data pieces b without including the data pieces a is considered to include the data pieces that are difficult to identify by learning using the data pieces a; therefore, the data piece b included in the cluster is considered to have high learning effectiveness as a sample.
  • each cluster including only the data pieces b that is, each cluster that does not include the data pieces a
  • the sampling process unit 12 selects the cluster to be sampled in step S 105 .
  • the sampling process unit 12 calculates a score value t in the cluster by the following expression:
  • the sampling process unit 12 selects S clusters as the clusters for sample selection, starting with a cluster having a relatively small score value t in the ascending order. Since the score value t is a variance, a cluster having a low score value t is the cluster with a small variance. It is assumed that, as described above, the group of data pieces b belonging to such a cluster with a small variance is a set of data pieces that does not exist in the labeled data group or have features that are only expressed small. Therefore, the data pieces b selected from such a cluster are considered to be highly influential.
  • the sampling process unit 12 performs a sample selection process (S 107 ).
  • the sampling process unit 12 performs sampling on each of the S clusters for sample selection. For example, for each cluster for sample selection, the sampling process unit 12 selects the data piece b related to the feature information with the minimum distance from the center u y of the cluster (the distance between vectors) as the sample.
  • the data piece at the center of the cluster is considered to be the data piece that most strongly expresses the common feature of the cluster.
  • noise reduction can also be expected because the data piece can also be regarded as the average of the data pieces belonging to the cluster.
  • the sampling process unit 12 outputs each sample (each data piece b) selected from each of the S clusters as the data piece to be labeled.
  • the feature information of each data piece is obtained (extracted) by use of the feature extractor that has provided the pseudo label, which can be automatically generated only from the data set A of the labeled data pieces and the data set B of the unlabeled data pieces, to each data piece and has learned; therefore, the sampling based on the feature space effective for the target task is performed, and as a result, it is possible to select the data piece, that is to be labeled and is effective for the target task, from among data sets of unlabeled data pieces.
  • a cluster including only the unlabeled data pieces is selected in sampling, and the technique for selecting the data piece closest to the center of the cluster is applied. Therefore, in the embodiment, it becomes possible to select data pieces in a range that cannot be covered by the labeled data pieces on the feature space; thereby it becomes possible to reduce the cost of efficient labeling operations.
  • FIG. 4 is a diagram showing an example of a functional configuration of a data selection device 10 in the second embodiment.
  • a generation method of the pseudo label by the feature extractor generation unit 11 is different from the first embodiment. Due to the differences in the generation method, in the second embodiment, input of the number S of samples to be additionally labeled to the feature extractor generation unit 11 may not be required.
  • the feature extractor generation unit 11 may use the rotation direction of the image of each data piece as a pseudo label for each data piece.
  • the feature extractor generation unit 11 may use the correct permutations of the patches of each data piece as a pseudo label for each data piece.
  • the feature extractor generation unit 11 may generate the pseudo label for each data piece by other known methods.
  • the feature extractor generation unit 11 is an example of a generation unit.
  • the sampling process unit 12 is an example of an obtaining unit, a classification unit, and a selection unit.
  • the data piece a is an example of a first data piece.
  • the data piece b is an example of a second data piece.

Abstract

A data selection method selects, based on a set of labeled first data pieces and a set of unlabeled second data pieces, a target to be labeled from the set of the second data pieces. The method includes: a classification procedure classifying data pieces belonging to the set of the first data pieces and data pieces belonging to the set of the second data pieces into clusters of the number at least one more than the number of types of the labels; and a selection procedure selecting the second data piece to be labeled from a cluster, from among the clusters, that does not include the first data piece, each of the procedures being performed by a computer. Thereby, it is possible to select the data piece to be labeled, which is effective for a target task, from among data sets of unlabeled data pieces.

Description

    TECHNICAL FIELD
  • The present invention relates to a data selection method, a data selection device and a program.
  • BACKGROUND ART
  • In supervised learning, since labeled data pieces corresponding to a target task is required, there is a demand for reducing costs of the operation of labeling a large number of collected image data pieces. One of the technologies aimed at reducing the cost of the labeling operation is active learning, which performs efficient learning by selecting a small number of data pieces (samples) from the entire data pieces and labeling thereof, not labeling all the image data pieces.
  • In the active learning, samples with high contribution to performance improvement are selected (sampled) by an algorithm from unlabeled data pieces using a small number of labeled data pieces and presented to an operator. The operator labels the samples, then the learning is performed, and thereby learning performance can be improved as compared to the case of random sampling.
  • Since the number of samples is one per sampling in the conventional active learning, a technique for obtaining multiple samples in a single sampling, which is adapted to batch learning of convolutional neural networks (CNN) is suggested (Non-Patent Literature 1).
  • In Non-Patent Literature 1, the data pieces are mapped into the feature space, and sampling is performed by using an approximation solution algorithm for the k-center problem. Since multiple samples are subsets that inherit the characteristics of the entire data structure in the feature space, the learning close to that in using all data pieces can be performed even when a small number of data pieces are used.
  • CITATION LIST Non-Patent Literature
    • Non-Patent Literature 1: Active Learning for Convolutional Neural Networks: A Core-Set Approach, O. Sener, S. Savarese, International Conference on Learning Representations (ICLR), 2018.
    • Non-Patent Literature 2: Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze, “Deep Clustering for Unsupervised Learning of Visual Features” Proc. ECCV (2018).
    SUMMARY OF THE INVENTION Technical Problem
  • However, the above technique has the following problems.
  • When data pieces are mapped into a feature space, it is necessary to prepare a feature extractor, and in many cases, a learned model prepared in a deep learning framework is used as the feature extractor. As the dataset used in the prepared learned model, the 1000-class classification of the ImageNet dataset, etc. is used.
  • Therefore, if classification of the data pieces for the target task differs from the classification contents of the dataset used in the prepared learned model, it is impossible to extract a feature that is effective for the target task.
  • In the technique of Non-Patent Literature 1, it is important for sampling to select a feature extractor performing mapping into a feature space that is effective for the target task, because the data is referenced in the feature space during sampling; however, it is difficult to evaluate beforehand the feature extractor that is effective for the unlabeled data pieces handled by the active learning.
  • The present invention has been made in view of the above points, and an object of the present invention is to make it possible to select data pieces to be labeled, which are effective for a target task, from among data sets of unlabeled data pieces.
  • Means for Solving the Problem
  • To solve the above problems, a data selection method selects, based on a set of labeled first data pieces and a set of unlabeled second data pieces, a target to be labeled from the set of the second data pieces. The method includes: a classification procedure classifying data pieces belonging to the set of the first data pieces and data pieces belonging to the set of the second data pieces into clusters of the number at least one more than the number of types of the labels; and a selection procedure selecting the second data piece to be labeled from a cluster, from among the clusters, that does not include the first data piece, each of the procedures being performed by a computer.
  • Effects of the Invention
  • It is possible to select data pieces to be labeled, which are effective for a target task, from among data sets of unlabeled data pieces.
  • BRIEF DESCRIPTION OF DRAWINGS
  • FIG. 1 is a diagram showing an example of a hardware configuration of a data selection device 10 in a first embodiment.
  • FIG. 2 is a diagram showing an example of a functional configuration of the data selection device 10 in the first embodiment.
  • FIG. 3 is a flowchart for illustrating an example of processing procedures executed by the data selection device 10 in the first embodiment.
  • FIG. 4 is a diagram showing an example of a functional configuration of a data selection device 10 in a second embodiment.
  • DESCRIPTION OF EMBODIMENTS
  • Hereinafter, embodiments of the present invention will be described based on the drawings.
  • In the embodiment, based on the input of data sets of labeled data pieces and data sets of unlabeled data pieces that are candidates for labeling, a feature extractor is generated by use of a framework of unsupervised feature expression learning. Note that the number of data pieces in the data set of the labeled data pieces is smaller than the number of data pieces in the data set of the unlabeled data pieces. The unsupervised feature expression learning is to generate a feature extractor effective for a target task by self-supervised learning, which uses supervisory signals that can be automatically generated for input data. The unsupervised feature expression learning includes techniques such as Deep Clustering (Non-Patent Literature 2).
  • For the feature of each labeled data piece and each unlabeled data piece used in the unsupervised feature expression learning, which is obtained by using the feature extractor obtained by the unsupervised feature expression learning, clustering is performed.
  • For each cluster obtained as a result of the clustering, classification is performed into two types of clusters: clusters including labeled data pieces and clusters including no labeled data pieces among the data pieces belonging thereto.
  • Of the two types of classification described above, sampling is performed from each cluster including no labeled data pieces, and the data pieces to be labeled are outputted.
  • Let us consider an unlabeled data piece that is effective for a target task, and should be labeled. For example, even if an unlabeled data piece having similar characteristics to a labeled data piece is labeled, it is difficult to say that the data piece is effective for the target task. On the other hand, if it is possible to label an unlabeled data piece having characteristics different from those of a labeled data piece, it is assumed that, by performing learning by using the data piece and the label, it becomes possible to learn so that identification taking into account the characteristics can be performed.
  • The embodiment aims to select such unlabeled data pieces having characteristics different from those of the labeled data pieces. It was described that the number of the unlabeled data pieces is larger than the number of the labeled data pieces, but it is expected that the number of the unlabeled data pieces will form a majority because labeling the data pieces for using thereof as actual learning data requires a large amount of operation. The embodiment aims to extract data pieces, from such a large number of data pieces, that should be labeled, and can increase, for example, the accuracy of estimation by labeling.
  • According to the above method, it is possible to use any feature expression learning technique, and it is possible to eliminate the need for preparation of the learned model.
  • FIG. 1 is a diagram showing an example of a hardware configuration of a data selection device 10 in the first embodiment. The data selection device 10 in FIG. 1 includes a drive device 100, an auxiliary storage device 102, a memory device 103, a CPU 104, and an interface device 105 that are connected to one another via a bus B.
  • The programs implementing the processes in the data selection device 10 are provided by a recording medium 101, such as a CD-ROM. When the recording medium 101 storing the programs is set to the drive device 100, the programs are installed from the recording medium 101 to the auxiliary storage device 102 via the drive device 100. However, the programs are not necessarily be installed from the recording medium 101, and may be downloaded from other computers via a network. The auxiliary storage device 102 stores the installed programs, as well as necessary files, data, and the like.
  • When an instruction to start the programs is provided, the memory device 103 reads the programs from the auxiliary storage device 102 and stores thereof. The CPU 104 executes the functions related to the data selection device 10 in accordance with the programs stored in the memory device 103. The interface device 105 is used as an interface to connect to the network. FIG. 2 is a diagram showing an example of a functional configuration of the data selection device 10 in the first embodiment. In FIG. 2, the data selection device 10 includes a feature extractor generation unit 11 and a sampling process unit 12. Each of these units is implemented by processes executed by the CPU 104 caused by one or more programs installed in the data selection device 10.
  • The feature extractor generation unit 11 outputs a feature extractor with a data set A of labeled data pieces, a data set B of unlabeled data pieces, and the number S of samples to be additionally labeled as input. The data set A of the labeled data pieces refers to a set of labeled image data pieces. The data set B of the unlabeled data pieces refers to a set of unlabeled image data pieces.
  • The sampling process unit 12 selects data pieces to be labeled with the data set A of the labeled data pieces, the data set B of the unlabeled data pieces, the number S of samples to be additionally labeled, and the feature extractor as input.
  • Hereinafter, the processing procedures executed by the data selection device 10 will be described. FIG. 3 is a flowchart for illustrating an example of processing procedures executed by the data selection device 10 in the first embodiment.
  • In step 101, the feature extractor generation unit 11 executes a pseudo label generation process with the data set A of the labeled data pieces, the data set B of the unlabeled data pieces, and the number S of samples to be additionally labeled as input. In the pseudo label generation process, the feature extractor generation unit 11 provides pseudo labels to each data piece aEA and each data piece b∈B, and outputs a data set A in which each data piece a is provided with a pseudo label and a data set B in which each data piece b is provided with a pseudo label as pseudo datasets. On this occasion, the feature extractor generation unit 11 performs k-means clustering on the data set A and the data set B, based on respective intermediate features of each data piece a and each data piece b when each data piece is inputted into Convolutional Neural Network (CNN), which is the source of the feature extractor, to thereby provide identification information corresponding to a cluster, to which each data piece belongs, to the data piece belonging to the cluster as a pseudo label. Note that the feature extractor generation unit 11 randomly initializes the first CNN parameter, and uses the sum of the number of data pieces in the data set A and S as the number of clusters in k-means clustering.
  • Subsequently, the feature extractor generation unit 11 performs a CNN learning process (S102). In the CNN learning process, the feature extractor generation unit 11 learns the CNN using the pseudo dataset as input. On this occasion, the learning using the pseudo data is also performed on the data piece a, which is labeled.
  • Steps S101 and S102 are repeated until a learning end condition is satisfied (S103). Whether or not the learning end condition has been satisfied may be determined, for example, by whether the number of repetitions in steps S101 and S102 has reached a predefined number of repetitions, or by the changes in the error function. When the learning end condition is satisfied, the feature extractor generation unit 11 regards the CNN at that time as the feature extractor and outputs thereof.
  • Thus, in steps S101 to S103, unsupervised feature expression learning is used to generate (learn) the feature extractor. However, the method of generating the pseudo labels is not limited to the method as described above.
  • Subsequently, the sampling process unit 12 performs a feature extraction process (S104). In the feature extraction process, the sampling process unit 12 uses the feature extractor to obtain (extract) respective feature information (image feature) of each data piece a in the data set A and each data piece b in the data set B. In other words, the sampling process unit 12 inputs each data piece a in the data set A and each data piece b in the data set B into the feature extractor in turn, to thereby obtain the feature information, which is outputted from the feature extractor, for each data piece a and each data piece b. Note that the feature information is data expressed in a vector form.
  • Subsequently, the sampling process unit 12 performs a clustering process (S105). In the clustering process, with the feature information of each data piece a, the feature information of each data piece b, and the number S of samples to be additionally labeled as input, the sampling process unit 12 performs k-means clustering on a feature information group, and outputs cluster information (information including the feature information of each data piece and the results of classification of each data piece into a cluster). On this occasion, the sum of the number of data pieces in the data set A and the number S of the samples is regarded as the number of clusters of k-means. In other words, each data piece a and each data piece b are classified into the clusters of the number at least one more than the number of data pieces in the data set A.
  • Subsequently, the sampling process unit 12 performs a cluster selection process (3106). In the cluster selection process, with the data set A, the data set B, and the cluster information as input, the sampling process unit 12 outputs S clusters for sample selection.
  • The clusters generated by k-means clustering in the clustering process of step 3105 are classified into the cluster including the data pieces aEA and the data pieces b∈B, and the cluster including only the data pieces b.
  • Since the cluster including the data pieces a is considered to be a set of data pieces that can be identified by learning using the data pieces a, it is considered that selection of a data piece b included in the cluster as a sample will result in a low learning effect.
  • On the other hand, the cluster including only the data pieces b without including the data pieces a is considered to include the data pieces that are difficult to identify by learning using the data pieces a; therefore, the data piece b included in the cluster is considered to have high learning effectiveness as a sample.
  • Thus, one sample is to be selected from each cluster including only the data pieces b (that is, each cluster that does not include the data pieces a); however, since the number of clusters including only the data pieces b is not less than S, the sampling process unit 12 selects the cluster to be sampled in step S105.
  • Specifically, in the feature information {xi}n i=1 and the cluster level {y|y∈{1, . . . , k)}}n i=1, when the center of the cluster y (the center of the feature information group of each data piece belonging to the cluster y) is assumed to be uy=1/nyΣi:yi=yxi (ny is the number of data pieces belonging to the cluster y), the sampling process unit 12 calculates a score value t in the cluster by the following expression:
  • t = 1 n y i : y i = y x i - u y 2 [ Math . 1 ]
  • The sampling process unit 12 selects S clusters as the clusters for sample selection, starting with a cluster having a relatively small score value t in the ascending order. Since the score value t is a variance, a cluster having a low score value t is the cluster with a small variance. It is assumed that, as described above, the group of data pieces b belonging to such a cluster with a small variance is a set of data pieces that does not exist in the labeled data group or have features that are only expressed small. Therefore, the data pieces b selected from such a cluster are considered to be highly influential.
  • Subsequently, the sampling process unit 12 performs a sample selection process (S107). In the sample selection process, the sampling process unit 12 performs sampling on each of the S clusters for sample selection. For example, for each cluster for sample selection, the sampling process unit 12 selects the data piece b related to the feature information with the minimum distance from the center uy of the cluster (the distance between vectors) as the sample. The data piece at the center of the cluster is considered to be the data piece that most strongly expresses the common feature of the cluster. In addition, noise reduction can also be expected because the data piece can also be regarded as the average of the data pieces belonging to the cluster. The sampling process unit 12 outputs each sample (each data piece b) selected from each of the S clusters as the data piece to be labeled.
  • As described above, according to the first embodiment, the feature information of each data piece is obtained (extracted) by use of the feature extractor that has provided the pseudo label, which can be automatically generated only from the data set A of the labeled data pieces and the data set B of the unlabeled data pieces, to each data piece and has learned; therefore, the sampling based on the feature space effective for the target task is performed, and as a result, it is possible to select the data piece, that is to be labeled and is effective for the target task, from among data sets of unlabeled data pieces.
  • In addition, since it becomes possible to perform sampling on the feature space without preparing a learned model beforehand, it is possible to eliminate the selection of a feature extractor adapted to a target task for given image data, and to reduce the cost of labeling since high learning performance can be obtained by providing labels to a small number of samples.
  • Moreover, in the embodiment, a cluster including only the unlabeled data pieces is selected in sampling, and the technique for selecting the data piece closest to the center of the cluster is applied. Therefore, in the embodiment, it becomes possible to select data pieces in a range that cannot be covered by the labeled data pieces on the feature space; thereby it becomes possible to reduce the cost of efficient labeling operations.
  • Next, a second embodiment will be described. In the second embodiment, description will be given to the points that are different from the first embodiment. Points not specifically mentioned in the second embodiment may be similar to those of the first embodiment.
  • FIG. 4 is a diagram showing an example of a functional configuration of a data selection device 10 in the second embodiment. In the second embodiment, a generation method of the pseudo label by the feature extractor generation unit 11 is different from the first embodiment. Due to the differences in the generation method, in the second embodiment, input of the number S of samples to be additionally labeled to the feature extractor generation unit 11 may not be required.
  • For example, in the case where the image of each data piece to be inputted (each data piece a and each data piece b) is randomly rotated, the feature extractor generation unit 11 may use the rotation direction of the image of each data piece as a pseudo label for each data piece. Alternatively, in the case where the image of each data piece (each data piece a and each data piece b) is divided into patches and randomly inputted, the feature extractor generation unit 11 may use the correct permutations of the patches of each data piece as a pseudo label for each data piece. Still alternatively, the feature extractor generation unit 11 may generate the pseudo label for each data piece by other known methods.
  • Note that, in each of the above embodiments, the feature extractor generation unit 11 is an example of a generation unit. The sampling process unit 12 is an example of an obtaining unit, a classification unit, and a selection unit. The data piece a is an example of a first data piece. The data piece b is an example of a second data piece.
  • The embodiments of the present invention have been described, but it is not intended to limit the present invention to these specific embodiments. Various kinds of modifications and variations can be made within the scope of the gist of the present invention defined by the following claims.
  • REFERENCE SIGNS LIST
      • 10 Data selection device
      • 11 Feature extractor generation unit
      • 12 Sampling process unit
      • 100 Drive device
      • 101 Recording medium
      • 102 Auxiliary storage device
      • 103 Memory device
      • 104 CPU
      • 105 Interface device
      • B Bus

Claims (20)

1. A computer-implemented method for selecting target data for labeling using a set of labeled first data pieces and a set of unlabeled second data pieces, the method comprising:
classifying data pieces belonging to the set of the first data pieces and data pieces belonging to the set of the second data pieces into a number of clusters, the number of clusters being at least one more than the number of types of the labels; and
selecting a data piece from the set of the second data pieces to be labeled from a cluster, from among the clusters, that does not include one of the first data pieces.
2. The computer-implemented method according to claim 1, the method further comprising:
generating a feature extractor by using unsupervised feature expression learning based on the set of the first data pieces and the set of the second data pieces; and
obtaining feature information for each of the first data pieces and each of the second data pieces by using the feature extractor, wherein
the classifying classifies the set of the first data pieces and the set of the second data pieces into the clusters based on the feature information.
3. The computer-implemented method according to claim 2, wherein the selecting selects the data piece from the set of the second data pieces to be labeled from a cluster having a relatively small variance of the feature information.
4. The computer-implemented method according to claim 2, wherein the selecting selects, in the cluster that does not include the first data pieces, the data piece from the set of the second data pieces related to the feature information with a minimum distance from a center of the cluster.
5. A data selection device comprising a processor configured to execute a method for selecting, based on a set of labeled first data pieces and a set of unlabeled second data pieces, a target to be labeled from the set of the second data pieces, comprising:
classifying data pieces belonging to the set of the first data pieces and data pieces belonging to the set of the second data pieces into a number of clusters, the number clusters being at least one more than the number of types of the labels; and
selecting a data piece from the set of the second data pieces to be labeled from a cluster, from among the clusters, that does not include one of the first data pieces.
6. The data selection device according to claim 5, the processor further configured to execute a method comprising:
generating a feature extractor by using unsupervised feature expression learning based on the set of the first data pieces and the set of the second data pieces; and
obtaining feature information for each of the first data pieces and each of the second data pieces by using the feature extractor, wherein
the classifying classifies the set of the first data pieces and the set of the second data pieces into the clusters based on the feature information.
7. The data selection device according to claim 6, wherein the selecting selects the data piece from the set of the second data pieces to be labeled from a cluster having a relatively small variance of the feature information.
8. A computer-readable non-transitory recording medium storing computer-executable program instructions that when executed by a processor cause a computer system to execute a data selection method comprising:
classifying data pieces belonging to a set of labeled first data pieces and data pieces belonging to a set of unlabeled second pieces into a number of clusters, the number of clusters being at least one more than a number of types of labels; and
selecting a data piece from the set of the second data pieces to be labeled from a cluster, from among the clusters, that does not include one of the first data pieces.
9. The computer-implemented method according to claim 2, wherein the classifying uses a convolutional neural network.
10. The computer-implemented method according to claim 2, wherein the obtaining feature information is based on k-means clustering on the set of the first data pieces and the set of the second data pieces.
11. The computer-implemented method according to claim 3, wherein the selecting selects, in the cluster that does not include the first data pieces, the data piece from the set of the second data pieces related to the feature information with a minimum distance from a center of the cluster.
12. The computer-readable non-transitory recording medium according to claim 8, the computer-executable program instructions when executed further causing the computer system to execute a data selection method comprising:
generating a feature extractor by using unsupervised feature expression learning based on the set of the first data pieces and the set of the second data pieces; and
obtaining feature information for each of the first data pieces and each of the second data pieces by using the feature extractor, wherein
the classifying classifies the set of the first data pieces and the set of the second data pieces into the clusters based on the feature information.
13. The data selection device according to claim 6, wherein the selecting selects the data piece from the set of the second data pieces to be labeled from a cluster having a relatively small variance of the feature information.
14. The data selection device according to claim 6, wherein the selecting selects, in the cluster that does not include the first data pieces, the data piece from the set of the second data pieces related to the feature information with a minimum distance from a center of the cluster.
15. The data selection device according to claim 6, wherein the classifying uses a convolutional neural network.
16. The data selection device according to claim 6, wherein the obtaining feature information is based on k-means clustering on the set of the first data pieces and the set of the second data pieces.
17. The computer-readable non-transitory recording medium according to claim 12, wherein the selecting selects the data piece from the set of the second data pieces to be labeled from a cluster having a relatively small variance of the feature information.
18. The computer-readable non-transitory recording medium according to claim 12, wherein the selecting selects, in the cluster that does not include the first data pieces, the data piece from the set of the second data pieces related to the feature information with a minimum distance from a center of the cluster.
19. The computer-readable non-transitory recording medium according to claim 12, wherein the classifying uses a convolutional neural network.
20. The computer-readable non-transitory recording medium according to claim 12, wherein the obtaining feature information is based on k-means clustering on the set of the first data pieces and the set of the second data pieces.
US17/631,396 2019-07-30 2019-07-30 Data selection method, data selection apparatus and program Pending US20220335085A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/029807 WO2021019681A1 (en) 2019-07-30 2019-07-30 Data selection method, data selection device, and program

Publications (1)

Publication Number Publication Date
US20220335085A1 true US20220335085A1 (en) 2022-10-20

Family

ID=74229395

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/631,396 Pending US20220335085A1 (en) 2019-07-30 2019-07-30 Data selection method, data selection apparatus and program

Country Status (3)

Country Link
US (1) US20220335085A1 (en)
JP (1) JP7222429B2 (en)
WO (1) WO2021019681A1 (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160232444A1 (en) * 2015-02-05 2016-08-11 International Business Machines Corporation Scoring type coercion for question answering
US20170061625A1 (en) * 2015-08-26 2017-03-02 Digitalglobe, Inc. Synthesizing training data for broad area geospatial object detection
US9911033B1 (en) * 2016-09-05 2018-03-06 International Business Machines Corporation Semi-supervised price tag detection
US20190042879A1 (en) * 2018-06-26 2019-02-07 Intel Corporation Entropic clustering of objects
US20190332667A1 (en) * 2018-04-26 2019-10-31 Microsoft Technology Licensing, Llc Automatically cross-linking application programming interfaces
US20200027002A1 (en) * 2018-07-20 2020-01-23 Google Llc Category learning neural networks
US20200285906A1 (en) * 2017-09-08 2020-09-10 The General Hospital Corporation A system and method for automated labeling and annotating unstructured medical datasets
US20200285939A1 (en) * 2017-09-28 2020-09-10 D5Ai Llc Aggressive development with cooperative generators
US20200334567A1 (en) * 2019-04-17 2020-10-22 International Business Machines Corporation Peer assisted distributed architecture for training machine learning models
US20210056412A1 (en) * 2019-08-20 2021-02-25 Lg Electronics Inc. Generating training and validation data for machine learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8498950B2 (en) * 2010-10-15 2013-07-30 Yahoo! Inc. System for training classifiers in multiple categories through active learning
US10102481B2 (en) * 2015-03-16 2018-10-16 Conduent Business Services, Llc Hybrid active learning for non-stationary streaming data with asynchronous labeling
US20180032901A1 (en) * 2016-07-27 2018-02-01 International Business Machines Corporation Greedy Active Learning for Reducing User Interaction
CN109844777A (en) * 2016-10-26 2019-06-04 索尼公司 Information processing unit and information processing method
WO2018116921A1 (en) * 2016-12-21 2018-06-28 日本電気株式会社 Dictionary learning device, dictionary learning method, data recognition method, and program storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160232444A1 (en) * 2015-02-05 2016-08-11 International Business Machines Corporation Scoring type coercion for question answering
US20170061625A1 (en) * 2015-08-26 2017-03-02 Digitalglobe, Inc. Synthesizing training data for broad area geospatial object detection
US9911033B1 (en) * 2016-09-05 2018-03-06 International Business Machines Corporation Semi-supervised price tag detection
US20200285906A1 (en) * 2017-09-08 2020-09-10 The General Hospital Corporation A system and method for automated labeling and annotating unstructured medical datasets
US20200285939A1 (en) * 2017-09-28 2020-09-10 D5Ai Llc Aggressive development with cooperative generators
US20190332667A1 (en) * 2018-04-26 2019-10-31 Microsoft Technology Licensing, Llc Automatically cross-linking application programming interfaces
US20190042879A1 (en) * 2018-06-26 2019-02-07 Intel Corporation Entropic clustering of objects
US20200027002A1 (en) * 2018-07-20 2020-01-23 Google Llc Category learning neural networks
US20200334567A1 (en) * 2019-04-17 2020-10-22 International Business Machines Corporation Peer assisted distributed architecture for training machine learning models
US20210056412A1 (en) * 2019-08-20 2021-02-25 Lg Electronics Inc. Generating training and validation data for machine learning

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
Ahn, Euijoon, et al., "Unsupervised Feature Learning with K-means and An Ensemble of Deep Convolutional Neural Networks for Medical Image Classification", arXiv, Cornell University, Document arXiv:1906.03359 [cs.CV], June 7, 2019, 8 pages. *
Cheng, Yanhua, et al., "Semi-Supervised Learning for RGB-D Object Recognition", ICPR 2014, Stockholm, Sweden, August 24-28, 2014, pp. 2377-2382. *
Gowda, Harsha S., et al., "Semi-supervised Text Categorization Using Recursive K-means Clustering", arXiv, Cornell University, Document - arXiv:1706.07913v1 [cs.LG], 24 Jun 2017, 10 pages. *
Kansal, Tushar, et al., "Customer Segmentation using K-means Clustering", CTEMS 2018, Belgaum, India, December 21-22, 2018, pp. 135-139. *
Knyazev, Boris, et al., "Autoconvolution for Unsupervised Feature Learning", arXiv, Cornell University, Document - arXiv:1606.00611v1 [cs.CV] 2 Jun 2016, 15 pages. *
Ling, Zhigang, et al., "Semi-Supervised Learning via Convolutional Neural Network for Hyperspectral Image Classification", ICPR 2018, Beijing, China, August 20-24, 2018, pp. 1900-1905. *
Ngoc, Minh Tran, et al., "Centroid Neural Network with Pairwise Constraints for Semi-supervised Learning", Neural Processing Letters, Vol. 18, February 8, 2018, pp. 1721-1747. *
Nie, Feiping, et al., "Multi-view Clustering and Semi-Supervised Classification with Adaptive Neighbors", International Journal of Engineering & Technology, Vol. 7 (1.8), February 2018, pp. 81-85. *
Reddy, Y C A Padmanabha, et al., "Semi-supervised learning: a brief review", International Journal of Engineering & Technology, Vol. 7 (1.8), February 2018, pp. 81-85. *
Tang, Jian, et al., "PTE: Predictive Text Embedding through Large-Scale Heterogeneous Text Networks", KDD ‘15, Sydney, NSW, Australia, pp. 1165-1174. *
Wagner, Raimer, et al., "Learning Convolutional Neural Networks From Few Samples", IJCNN 2013, Dallas, TX, Aug. 4-9, 2013, 7 pages. *
Wang, Xin, et al., "Semi-supervised K-Means Clustering by Optimizing Initial Cluster Centers", WISM 2011, Taiyuan, China, September 24-25, 2011, Proceedings, Part II, LNCS 6988, pp. 178-187. *

Also Published As

Publication number Publication date
JP7222429B2 (en) 2023-02-15
WO2021019681A1 (en) 2021-02-04
JPWO2021019681A1 (en) 2021-02-04

Similar Documents

Publication Publication Date Title
US11586988B2 (en) Method of knowledge transferring, information processing apparatus and storage medium
CN108416370B (en) Image classification method and device based on semi-supervised deep learning and storage medium
EP3355244A1 (en) Data fusion and classification with imbalanced datasets
EP2991003B1 (en) Method and apparatus for classification
US20170161645A1 (en) Method and apparatus for labeling training samples
US20170032276A1 (en) Data fusion and classification with imbalanced datasets
CN108491302B (en) Method for detecting spark cluster node state
WO2016205286A1 (en) Automatic entity resolution with rules detection and generation system
US20210117802A1 (en) Training a Neural Network Using Small Training Datasets
CN113449099A (en) Text classification method and text classification device
US20220092407A1 (en) Transfer learning with machine learning systems
US10565478B2 (en) Differential classification using multiple neural networks
WO2021034394A1 (en) Semi supervised animated character recognition in video
US20200117574A1 (en) Automatic bug verification
WO2023134402A1 (en) Calligraphy character recognition method based on siamese convolutional neural network
US11164043B2 (en) Creating device, creating program, and creating method
CN109902731B (en) Performance fault detection method and device based on support vector machine
CN115147632A (en) Image category automatic labeling method and device based on density peak value clustering algorithm
Aljundi et al. Identifying wrongly predicted samples: A method for active learning
US11875114B2 (en) Method and system for extracting information from a document
US20100239168A1 (en) Semi-tied covariance modelling for handwriting recognition
US20220335085A1 (en) Data selection method, data selection apparatus and program
Boillet et al. Confidence estimation for object detection in document images
US20220405299A1 (en) Visualizing feature variation effects on computer model prediction
Xue et al. Incremental zero-shot learning based on attributes for image classification

Legal Events

Date Code Title Description
AS Assignment

Owner name: NIPPON TELEGRAPH AND TELEPHONE CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TSUKATANI, SHUNSUKE;MURASAKI, KAZUHIKO;ANDO, SHINGO;AND OTHERS;SIGNING DATES FROM 20211015 TO 20211029;REEL/FRAME:058816/0452

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED