WO2022249415A1

WO2022249415A1 - Information provision device, information provision method, and information provision program

Info

Publication number: WO2022249415A1
Application number: PCT/JP2021/020296
Authority: WO
Inventors: 真弥山口; 哲哉塩田; 滉平山口; 基貴湯原
Original assignee: 日本電信電話株式会社
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-12-01
Also published as: JPWO2022249415A1

Abstract

A feature extraction unit (131) according to an embodiment extracts a plurality of features by inputting a plurality of data sets into a model that uses a data set to output a feature in a dimension lower than that of the data set. A similarity degree calculation unit (132) calculates the degree of similarity between the plurality of features extracted by the feature extraction unit (131).

Description

Information providing device, information providing method and information providing program

The present invention relates to an information providing device, an information providing method, and an information providing program.

A deep neural network (DNN) is capable of highly accurate prediction in image processing and natural language processing. On the other hand, learning a DNN is costly.

For example, the cost of training a DNN includes the cost of collecting data sets that include correct labels (annotations), the computational cost of improving accuracy, and the tuning cost of searching multiple hyperparameters for each case. included.

Transfer learning has been proposed as a method to reduce such costs when introducing a DNN business.

Transfer learning is a technology that uses a dataset (source dataset) different from the target dataset or a trained model to perform learning with less data or less computation time.

In addition, transfer learning includes methods such as fine tuning and domain adaptation.

　Fine-tuning is a method of pre-learning a model with a transfer source dataset and using the learned parameters as initial values for learning of the target dataset.

Domain adaptation is a method in which the same model learns both the source data set and the target data set at the same time, and uses the knowledge of the source data set to solve the task of the target data set.

However, the conventional technology has the problem that it may not be possible to implement transfer learning efficiently. Conventional transfer learning largely relies on the intuition and experience of the developer, requiring manual work such as selection of transfer source datasets and tuning of parameters.

The datasets that are effective for transfer learning are not trivial, and the results of transfer learning vary greatly depending on the relationship (similarity) between the target dataset and the source dataset. For example, a model pre-trained with ImageNet (large-scale, large-scale) may be inferior to a model not pre-trained depending on the target dataset (see Non-Patent Document 1, for example).

On the other hand, the degree of similarity between datasets is generally unknown, and no de facto standard index has yet emerged.

Also, in transfer learning, it is necessary to select hyperparameters that match the target dataset and the transfer source dataset. On the other hand, deep learning models have many hyperparameters, and tuning is essential even during transfer learning.

In order to solve the above-described problems and achieve the object, an information providing device inputs a plurality of data sets from a data set to a model that outputs a feature amount of lower dimension than the data set, thereby obtaining a plurality of data sets. It is characterized by having a feature extractor that extracts feature quantities, and a similarity calculator that calculates similarities between the plurality of feature quantities extracted by the feature extractor.

According to the present invention, transfer learning can be efficiently implemented.

FIG. 1 is a diagram showing a configuration example of an information providing device according to the first embodiment. FIG. 2 is a diagram for explaining a similarity measuring method. FIG. 3 is a diagram for explaining a model learning method. FIG. 4 is a diagram for explaining information providing processing. FIG. 5 is a flowchart showing the flow of learning processing. FIG. 6 is a flowchart showing the flow of similarity measurement processing. FIG. 7 is a flowchart showing the flow of information providing processing. FIG. 8 is a diagram showing the results of the experiment. FIG. 9 is a diagram showing experimental results. FIG. 10 is a diagram showing an example of a computer that executes an information providing program.

Embodiments of an information providing device, an information providing method, and an information providing program according to the present application will be described in detail below based on the drawings. In addition, this invention is not limited by embodiment described below.

[Configuration of the first embodiment]
FIG. 1 is a diagram showing a configuration example of an information providing device according to the first embodiment. The information providing device 10 calculates the degree of similarity between data sets and provides information based on the calculated degree of similarity. For example, the information providing device 10 provides information for identifying a transfer source data set similar to a target data set in transfer learning.

In addition, the information providing device 10 performs model learning processing for calculating the degree of similarity. The information providing device 10 may calculate the degree of similarity using the learned model, or may provide the learned model to another device or the like.

As shown in FIG. 1, the information providing device 10 has an input/output unit 11, a storage unit 12 and a control unit 13.

The input/output unit 11 is an interface for inputting/outputting data. For example, the input/output unit 11 may be a communication interface such as a NIC (Network Interface Card) for performing data communication with other devices via a network. Also, the input/output unit 11 may be an interface for connecting input devices such as a mouse and a keyboard, and output devices such as a display.

The storage unit 12 is a storage device such as an HDD (Hard Disk Drive), an SSD (Solid State Drive), an optical disc, or the like. Note that the storage unit 12 may be a rewritable semiconductor memory such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory). The storage unit 12 stores an OS (Operating System) and various programs executed by the information providing device 10 . The storage unit 12 also stores model information 121 .

The model information 121 is information such as parameters for constructing a model, and is updated as appropriate during the learning process. Also, the updated model information 121 may be output to another device or the like via the input/output unit 11 .

The control unit 13 controls the information providing device 10 as a whole. The control unit 13 includes, for example, electronic circuits such as CPU (Central Processing Unit), MPU (Micro Processing Unit), GPU (Graphics Processing Unit), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array), etc. It is an integrated circuit. The control unit 13 also has an internal memory for storing programs defining various processing procedures and control data, and executes each processing using the internal memory. Further, the control unit 13 functions as various processing units by running various programs. For example, the control unit 13 has a feature extraction unit 131 , a similarity calculation unit 132 , a loss function calculation unit 133 , an update unit 134 , a candidate extraction unit 135 and a provision unit 136 .

The feature extraction unit 131 extracts a plurality of feature amounts by inputting a plurality of data sets into a model that outputs a feature amount of lower dimension than the data set.

The similarity calculation unit 132 calculates similarities between the feature quantities extracted by the feature extraction unit 131 .

A similarity measuring method by the feature extraction unit 131 and the similarity calculation unit 132 will be described with reference to FIG. FIG. 2 is a diagram for explaining a similarity measuring method.

As shown in FIG. 2, the feature extraction unit 131 uses model F to extract feature amounts from data set A and data set B. Model F is a deep neural network for feature extraction.

A dataset contains multiple data samples. Also, the feature extraction unit 131 extracts a feature amount for each data sample.

In the example of FIG. 2, data set A includes I data samples x _A ⁱ (where i is an integer from 0 to I). Then, the feature extraction unit 131 extracts from the data set A, feature amounts f _A ⁱ corresponding to I data samples.

Here, since the datasets used in the DNN are high-dimensional, it is difficult to directly measure the similarity between datasets. Therefore, the feature extraction unit 131 extracts a feature quantity obtained by reducing the dimension of the data set as shown in FIG. For example, f _A ⁱ is lower dimensional than x _A ⁱ .

Furthermore, the feature extraction unit 131 aggregates the extracted feature amounts. In the example of FIG. 2, the feature extraction unit 131 aggregates the feature amounts f _A ⁱ corresponding to I data samples into one feature amount _f'A .

In this way, the feature extracting unit 131 can aggregate the feature amount output by the model, which is the feature amount for each data sample included in the data set, into a single data sample feature amount. For example, the feature extraction unit 131 can use statistics such as averages and variances of each element of a plurality of data samples as feature amounts after aggregation.

Then, the similarity calculation unit 132 calculates the similarity between the feature quantities aggregated by the feature extraction unit 131 . For example, if the aggregated feature amounts _f′A and _f′B are vectors, the similarity calculation unit 132 calculates the distance d _AB between the vectors _f′A and _f′B as the similarity. The similarity calculation unit 132 may calculate the 2-Wasserstein distance as the distance between vectors.

In addition, the feature extraction unit 131 extracts feature amounts from a model that has been trained by self-supervised learning using the transfer source data set in transfer learning. Then, the similarity calculation unit 132 calculates the similarity between the feature amount of the transfer source data set and the feature amount of the target data set in the transfer learning.

The loss function calculator 133 calculates a loss function for model learning. Also, the updating unit 134 updates the parameters of the model so that the loss function is optimized.

It should be noted that the parameters of model F are stored in the storage unit 12 as model information 121 . The updating unit 134 updates the model information 121 .

The learning method of model F will be explained using FIG. FIG. 3 is a diagram for explaining a model learning method.

Model F is used to measure the degree of similarity between target data and a plurality of transfer source data when specifying transfer source data similar to target data in transfer learning.

At that time, the information providing apparatus 10 is assumed to perform learning of the model F in advance by an arbitrary task such as classification using the transfer source data set group.
In the embodiment, the information providing device 10 learns the model F using self-supervised learning. Since accident supervised learning does not require annotation, it is easy to handle multiple datasets together.

In addition, the information providing apparatus 10 uses MoCo (Reference: He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.).

As shown in FIG. 3, the loss function calculation unit 133 calculates the loss function L _q of contrastive loss based on the feature amount obtained by inputting a plurality of data sets (D ₀ to D _N ) into the model F. do.

Here, contrastive loss generates a query and correct key image by two different image transformations from the input image, and correctly combines the query and correct key through DNN from the set of keys obtained from other images. Find task loss.

The q on the right side of the loss function in FIG. 3 is the output of model F obtained from the query image. k ₊ is the output of model F obtained from the correct key of the augmented transformation of the same image as the query. K is the total number of key images including the correct key. Also, τ is a temperature coefficient.

The candidate extraction unit 135 and the provision unit 136 support actual transfer learning by providing information specifying transfer source data similar to target data. The information providing process will be described with reference to FIG. FIG. 4 is a diagram for explaining information providing processing.

Data set _DT in FIG. 4 is the target data set. Also, data sets _D0 to _DN are a plurality of transfer source data sets.

The similarity calculator corresponds to the feature extraction block 131 and the similarity calculator 132 that use the learned model F.

At this time, the similarity calculation unit 132 calculates the similarity between the feature quantities for each of the one target data set (D _T ) and the plurality of transition source data sets (D ₀ to D _T ). .

Then, based on the calculated similarity, the candidate extraction unit 135 extracts, as a candidate, a transfer source data set whose feature amount similarity to the target data set is at or above a predetermined rank.

Further, the providing unit 136 provides the user with information for specifying the transfer source data set extracted as a candidate among the transfer source data sets.

In the example of FIG. 4, _the candidate extraction unit 135 creates a ranking by arranging the calculated degrees of similarity d _0T , d _1T , . Then, for example, transfer source data sets D _N , D ₁ , and D ₀ corresponding to the top three similarities d _NT , d _1T , and d _0T are extracted.

The providing unit 136 provides the extracted transfer source data sets D _N , D ₁ and D ₀ to the user together with the corresponding hyperparameters H _N , H ₁ and H ₀ . It is assumed that the optimal hyperparameters for each transition source data set have already been determined by a method such as grid search in the process of building the model in the past.

In this way, multiple combinations of transfer source data and hyperparameters provided collectively may be used collectively in transfer learning.

[Processing of the first embodiment]
The flow of processing by the information providing apparatus 10 will be described using the flowcharts shown in FIGS. 5, 6 and 7. FIG.

FIG. 5 is a flowchart showing the flow of learning processing. As shown in FIG. 5, first, the information providing device 10 reads learning data from the transfer source data set group (step S101).

Next, the information providing device 10 extracts features from the learning data using the DNN model F (step S102).

Here, the information providing device 10 calculates the loss function of the pre-learning task on the feature space (step S103). Then, the information providing device 10 updates the parameters of the model F by the back propagation method of the loss function (step S104).

At this time, if the maximum number of learning steps>the number of learning steps (step S105, True), the information providing device 10 returns to step S101 and repeats the process. On the other hand, if the maximum number of learning steps is not greater than the number of learning steps (Step S105, False), the information providing device 10 terminates the process.

FIG. 6 is a flowchart showing the flow of similarity measurement processing. First, as shown in FIG. The information providing device 10 reads data samples from the transfer source data set (step S201).

Next, the information providing device 10 extracts features from the transfer source data sample using the DNN model F (step S202). Furthermore, the information providing device 10 aggregates the feature vectors for each transition source data sample into a single feature vector (for example, mean or variance) (step S203).

The information providing device 10 reads data samples from the target data set (step S204).

Then, the information providing device 10 extracts the features of the target data sample using the DNN model F (step S205). Furthermore, the information providing apparatus 10 aggregates the feature vectors for each target data sample into a single feature vector, similarly to the transfer source data set (step S206).

The information providing device 10 calculates the degree of similarity between the feature vectors of the aggregated target data set and the transfer source data set, for example, using the 2-Wasserstein distance (step S207).

FIG. 7 is a flowchart showing the flow of information provision processing. First, the information providing device 10 calculates the similarity between the target data set and the N transfer source data sets (step S301).

Next, the information providing device 10 sorts the transition source data set (distance: ascending order/score: descending order) by the data set similarity {d _iT } _j ^N (step S302). Then, the information providing apparatus 10 extracts Top-K transfer source data set ids from the ranking obtained by sorting (K≦N: arbitrary integer) (step S303).

Here, the information providing device 10 reads the datasets and hyperparameters associated with the K transfer source dataset ids (step S304). The information providing apparatus 10 then issues a URI (Uniform Resource Identifier) that can be downloaded by the user, and outputs the dataset and hyperparameters (step S305).

[Effects of the first embodiment]
As described above, the feature extraction unit 131 extracts a plurality of feature amounts by inputting a plurality of data sets into a model that outputs a feature amount of lower dimension than the data set. The similarity calculator 132 calculates the similarity between the feature quantities extracted by the feature extractor 131 .

In this way, the information providing device 10 can automatically calculate the degree of similarity between datasets. As a result, according to the present embodiment, similar data sets can be specified, so that transfer learning can be efficiently performed.

The feature extraction unit 131 aggregates the feature amounts output by the model, which are feature amounts for each data sample included in the data set, into one data sample feature amount. The similarity calculation unit 132 calculates the similarity between feature amounts aggregated by the feature extraction unit 131 .

As a result, according to this embodiment, it becomes possible to easily calculate the distance between the feature quantities.

The feature extraction unit 131 extracts feature quantities from a model that has been trained by self-supervised learning using a transfer source data set in transfer learning. The similarity calculation unit 132 calculates the similarity between the feature amount of the transfer source data set and the feature amount of the target data set in transfer learning.

In this way, in this embodiment, self-supervised learning that does not require annotation enables efficient learning of a model that measures similarity.

The similarity calculation unit 132 calculates the similarity between feature quantities for each of one target data set and a plurality of transition source data sets. The providing unit 136 provides the user with information for specifying, among the transfer source data sets, those transfer source data sets whose similarity in feature quantity with the target data set is equal to or higher than a predetermined rank.

As a result, the information providing device 10 can recommend a transfer source dataset similar to the target dataset to the user. Therefore, according to this embodiment, transfer learning can be performed efficiently.

[experiment]
An experiment conducted by actually implementing the above embodiment will be described. In the experiment, using the above embodiment, the transfer source dataset and the hyperparameter (architecture) were selected according to the similarity of the dataset.

The experimental setup is as follows.
・Dataset Target dataset: Oxford Pets
Transfer source dataset: Subset group of ImageNet divided into 11 classes Neural network architecture:
Experiment 1: ResNet-50
Experiment 2: ResNet-50, ResNet-101, ResNext-50-32x4d, ResNext-101-32-4d, Wide-ResNet-50, Wide-ResNet-101

(Experiment 1)
FIG. 8 shows the results of Experiment 1 in which the transfer source data set was selected according to the similarity of the data sets. FIG. 8 is a diagram showing the results of the experiment.

In the example of FIG. 8, the feature extractor (model F, self-supervised learning model Moco) was trained using all the data of the transfer source data set. Then, a feature extractor was used to measure the dataset similarity between the target dataset and the source dataset.

Furthermore, using the trained model with each subset as the transfer source data set, Oxford Pets was fine-tuned and the test accuracy was measured. FIG. 8 is a diagram visualizing the correlation between data set similarity (Similarity) and test accuracy (ACC@1).

From the correlation shown in FIG. 8, it can be said that the embodiment is effective in selecting an effective transfer source data set.

(Experiment 2)
FIG. 9 shows the results of Experiment 2, in which the hyperparameters (architecture) were selected according to the similarity of the datasets. FIG. 9 is a diagram showing experimental results.

In the example of FIG. 9, class classification was learned for each architecture using the target dataset and the transition source dataset, and the test accuracy was measured. Then, a feature extractor was used to measure the dataset similarity between the target dataset and the source dataset.

In addition, we created an architecture ranking (in descending order of test accuracy) for each dataset by test accuracy, and measured the Mean Average Precision (MAP) of the ranking between the target dataset and the transfer source dataset. FIG. 9 is a diagram visualizing the correlation between data set similarity (Similarity) and MAP.

From the correlation shown in FIG. 9, it can be said that the embodiment is effective in selecting effective hyperparameters.

[System configuration, etc.]
Also, each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution and integration of each device is not limited to the illustrated one, and all or part of them can be functionally or physically distributed or Can be integrated and configured. Furthermore, all or any part of each processing function performed by each device is realized by a CPU (Central Processing Unit) and a program analyzed and executed by the CPU, or hardware by wired logic can be realized as Note that the program may be executed not only by the CPU but also by other processors such as a GPU.

Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be performed manually, or the processes described as being performed manually can be performed manually. All or part of this can also be done automatically by known methods. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified.

[program]
As one embodiment, the information providing apparatus 10 can be implemented by installing an information providing program that executes the above processing as package software or online software on a desired computer. For example, the information processing device can function as the information providing device 10 by causing the information processing device to execute the information providing program. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).

The information providing device 10 can also be implemented as a server device that uses a terminal device used by a user as a client and provides the client with services related to the above processing. For example, the server device is implemented as a server device that provides a similarity measurement service that inputs a target data set and a plurality of transfer source data sets and outputs the similarity between the target data set and each transfer source data set. . In this case, the server device may be implemented as a web server, or may be implemented as a cloud that provides services related to the above processing by outsourcing.

FIG. 12 is a diagram showing an example of a computer that executes an information providing program. The computer 1000 has a memory 1010 and a CPU 1020, for example. Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090 . A disk drive interface 1040 is connected to the disk drive 1100 . A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 . Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example. Video adapter 1060 is connected to display 1130, for example.

The hard disk drive 1090 stores, for example, an OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, a program that defines each process of the information providing apparatus 10 is implemented as a program module 1093 in which computer-executable code is described. Program modules 1093 are stored, for example, on hard disk drive 1090 . For example, the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration of the information providing apparatus 10 . The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

Also, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads the program modules 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processes of the above-described embodiments.

The program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.

10 information providing device 11 input/output unit 12 storage unit 121 model information 13 control unit 131 feature extraction unit 132 similarity calculation unit 133 loss function calculation unit 134 update unit 135 candidate extraction unit 136 provision unit

Claims

A feature extraction unit that extracts a plurality of feature quantities by inputting a plurality of data sets into a model that outputs a feature quantity of lower dimension than the dataset, and
a similarity calculation unit that calculates the similarity between the plurality of feature quantities extracted by the feature extraction unit;
An information providing device characterized by comprising:
The feature extracting unit aggregates the feature amount output by the model, which is the feature amount for each data sample included in the data set, into a feature amount of one data sample,
2. The information providing apparatus according to claim 1, wherein the similarity calculation unit calculates the similarity between feature quantities aggregated by the feature extraction unit.
The feature extraction unit extracts a feature amount from a model trained by self-supervised learning using a transfer source data set in transfer learning,
3. The information providing apparatus according to claim 1, wherein the similarity calculation unit calculates a similarity between the feature amount of the transfer source data set and the feature amount of the target data set in the transfer learning. .
further comprising a providing unit for providing information to the user;
The similarity calculation unit calculates the similarity between feature quantities for each of one target data set and a plurality of transfer source data sets,
The providing unit provides the user with information for specifying a transfer source data set having a feature amount similarity with the target data set of a predetermined rank or higher among the transfer source data sets. The information providing device according to any one of claims 1 to 3.
An information providing method executed by an information providing device,
A feature extraction step of extracting a plurality of feature amounts by inputting a plurality of data sets into a model that outputs a feature amount of lower dimension than the data set;
a similarity calculation step of calculating a similarity between the plurality of feature quantities extracted by the feature extraction step;
An information provision method characterized by comprising:
An information providing program for causing a computer to function as the information providing device according to any one of claims 1 to 4.