WO2021250774A1

WO2021250774A1 - Learning device, prediction device, learning method, and program

Info

Publication number: WO2021250774A1
Application number: PCT/JP2020/022672
Authority: WO
Inventors: 悠三鼓; 豪入江; 大貴伊神
Original assignee: 日本電信電話株式会社
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2021-12-16
Also published as: JP7440798B2; JPWO2021250774A1

Abstract

Provided is a learning device provided with: a feature extraction device for outputting a feature amount of input data; a plurality of identification devices for acquiring, on the basis of the feature amount, attribution probabilities to a known class and an unknown class for the data; an unknown class identification device for determining, on the basis of the attribution probabilities acquired by the identification devices, whether the data is in the unknown class or not; an identification inconsistency evaluation unit for outputting a value of an identification inconsistency degree indicating a difference of the attribution probabilities acquired by the plurality of identification devices for the data; and a learning unit for performing repetitive learning of parameters of the feature extraction device and the plurality of identification devices in such a manner that, by using data which is not in the unknown class and to which a training label is not given, the value of the identification inconsistency degree is reduced for the feature extraction device, and the value of the identification inconsistency degree is increased for the plurality of identification devices.

Description

Learning equipment, prediction equipment, learning methods and programs

The present invention relates to a learning device, a prediction device, a learning method, and a program technology.

For predictive model learning using machine learning, a framework generally called supervised learning is used. Supervised learning is a framework in which a large number of pairs of data and correct class labels for the data are prepared, and the relationship is learned from the pairs of data and class labels.

In order to realize supervised learning, it is necessary to prepare a pair of a large amount of data and a class label, but creating this is basically expensive. Therefore, a method may be adopted in which a model learned in an area where supervised data already exists (hereinafter referred to as "domain") is utilized in a target domain. For example, when recognizing handwritten characters, after learning the classifier using digital font data for which supervised data can be obtained relatively easily, the classifier is used for handwritten character data with little (or no) supervised data. Methods such as retraining may be taken.

However, the original domain from which learning was performed (hereinafter referred to as "original domain": digital font data in the case of the previous example) and the target domain (hereinafter referred to as "target domain": handwritten character data in the case of the previous example). And, the data generation distribution may be different. FIG. 6 is a diagram illustrating an outline of such a problem. In FIG. 6, the region surrounded by the solid line is the original domain 10, the region surrounded by the broken line is the target domain 20, and the line shown by the straight line is the identification boundary 30. For example, even if the same character "A" is used, the shape of the digital font and the handwritten character may differ greatly. When the generation distribution is different, the discrimination boundary 30 learned in the original domain 10 as shown in FIG. 6 may not be reliable with respect to the target domain 20. In such a case, there arises a problem that the trained model cannot achieve the discrimination accuracy expected in the target domain 20. In this way, the learning problem when there is a difference between domains is called a domain adaptation problem.

Conventionally, in order to solve such a domain adaptation problem, the following known techniques exist. In the technique disclosed in Patent Document 1, the original domain to the target domain that minimizes the value of MMD, which is the distribution feeling distance between the sample generation distribution in the original domain and the sample generation distribution in the target domain. The conversion rule to is learned. Then, the data of the original domain is converted using the learned conversion rule, and the model is learned by supervised learning using the converted data of the original domain.

In Non-Patent Document 1, the data of the original domain and the data of the target domain are classified into the feature extractor that projects the data of the original domain and the data of the target domain into the feature space that makes it difficult to identify the domain, and the data of the original domain and the data in the feature space. The relationship with the given class label is learned at the same time. Making it difficult to distinguish between the data of the original domain and the data of the target domain in the feature space means that the generation distributions of both are brought closer to each other in the feature space. Such processing may mean, for example, changing from the state of FIG. 6 to the state of FIG. 7. This improves the prediction accuracy of the target domain data for the model obtained by supervised learning with the original domain data.

In Non-Patent Document 2, the common feature space learned in Non-Patent Document 1 is learned using a feature extractor and two classifiers connected to the feature extractor. It is known that the model learned in Non-Patent Document 2 has higher prediction accuracy to the target domain data than the model learned in Non-Patent Document 1.

Various incidental problems may occur depending on the difference between the original domain and the target domain. As one of the incidental problems, there is a problem that occurs when data other than the class given to the original domain exists in the target domain. Taking the case of handwritten character recognition as an example, although there are only "a", "i", and "u" in the digital font data, "e" and "o" are in the handwritten character data. Such a problem arises when it is included. Classes labeled by the original domain are called known classes (in the previous example, "a", "i", "u"), and other classes are called unknown classes (in the previous example, "". E "," O "). Normally, a discriminator that has undergone supervised learning predicts that it belongs to one of the known classes even if data belonging to an unknown class is input. Such an operation may cause a problem that the accuracy of character recognition is lowered.

There are also the following problems as another problem. Normally, it is assumed that the original domain and the target domain each consist of a single domain. However, both the original domain and the target domain may be formed by multiple domains. For example, if the handwritten character data is written by a plurality of different individuals, or if it is written using a different writing instrument, the original domain or the target domain may be formed by the plurality of domains. In this case, since the generation distribution changes, it can be considered that a plurality of domains are inherent in the target domain. When the domain is formed by a plurality of domains, there arises a problem that the expected prediction accuracy cannot be achieved by the method as in Non-Patent Document 1.

There is a non-patent document 4 as a document relating to a technique corresponding to a problem in which a plurality of domains are inherent in the original domain. In the technique disclosed in Non-Patent Document 4, features that make it difficult to distinguish between each domain inherent in the original domain and the target domain are learned. On the contrary, there is a non-patent document 5 as a document relating to a technique for dealing with a problem in which a plurality of domains are inherent in a target domain. In the technique disclosed in Non-Patent Document 5, a feature that makes it difficult to identify a domain among the regions of a plurality of domains inherent in the target domain is learned.

Japanese Unexamined Patent Publication No. 2019-101789

In order to solve the domain adaptation problem in which the generation distribution differs between the original domain and the target domain and various incidental problems that occur with it, such as Non-Patent Document 3, Non-Patent Document 4, and Non-Patent Document 5, respectively. Technology has been proposed. However, although each technique exhibits good performance for the incidental problems it considers, it is not effective for other incidental problems.

For example, even if the technique of Non-Patent Document 4, which is a technique for dealing with a problem in which a plurality of domains are inherent in the original domain, is applied to a problem in which a plurality of domains are inherent in the target domain, sufficient performance can be obtained. No.

In general, it is rare that it is possible to know in advance what kind of problem exists in the processing target. Therefore, it is difficult to determine what kind of problem the technology should be applied to. Further, in the case where a plurality of the above-mentioned problems are mixed, there is a problem that it becomes difficult to apply the technology.

In view of the above circumstances, the present invention has been made in view of such problems, and an object of the present invention is to provide a technique for achieving good performance for a wider range of problems related to domains.

One aspect of the present invention includes a feature extractor that outputs a feature amount of input data, and a plurality of classifiers that acquire the attribution probability of the data to a known class and an unknown class based on the feature amount. An unknown class classifier that determines whether or not the data is an unknown class based on the attribution probability acquired by the classifier, and each attribution obtained by the plurality of classifiers with respect to the data. Using the identification mismatch evaluation unit that outputs the value of the discrimination mismatch degree indicating the difference in probability and the data that is not the unknown class and is not given the teacher label, the feature extractor has the discrimination mismatch degree. Learning including a learning unit for iteratively learning the parameters of the feature extractor and the plurality of the classifiers so as to reduce the value and to increase the value of the discriminant mismatch degree for the plurality of the classifiers. It is a device.

One aspect of the present invention includes a feature extractor that outputs a feature amount of input data based on the parameters obtained by the above learning device, and the parameters and the feature amount obtained by the above learning device. Based on this, it is a prediction device including a classifier that acquires the attribution probability of the data to a known class and an unknown class.

One aspect of the present invention is a feature extraction step that outputs a feature amount of input data using a feature extractor, and a known class and a known class for the data based on the feature amount using a plurality of classifiers. An identification step for acquiring the attribution probability to the unknown class, an unknown class identification step for determining whether or not the data is an unknown class based on the acquired attribution probability, and a plurality of the data. Using the discrimination mismatch evaluation step that outputs the value of the discrimination mismatch degree indicating the difference in the respective attribution probabilities obtained by the classifier, and the data that is not the unknown class and is not given the teacher label. Iterating the parameters of the feature extractor and the plurality of discriminators so as to reduce the value of the discriminant discriminant for the feature extractor and to increase the value of the discriminant discrepancy for the plurality of discriminators. It is a learning method having a learning step for learning.

One aspect of the present invention is a program for operating a computer as the above-mentioned learning device.

The present invention has been made in view of such problems, and it is possible to achieve good performance for a wider range of problems related to domains.

It is a figure which shows the outline of this embodiment. It is a figure which shows the outline of this embodiment. It is a functional block diagram which shows an example of the learning apparatus 100 which concerns on this embodiment. It is a functional block diagram which shows an example of the prediction apparatus 200 which concerns on this embodiment. It is a flowchart which shows the operation example of the learning apparatus 100. It is a figure which shows the example of the prior art. It is a figure which shows the example of the prior art.

<Summary>
First, the outline of the present embodiment will be described. This embodiment operates appropriately even when there is a problem in which an unknown class exists (hereinafter referred to as "first problem"). Further, in this embodiment, the problem that the data of each domain is only partially labeled (hereinafter referred to as "second problem") and the problem that the domain attribution information of the data is unknown (hereinafter referred to as "" Even if there is a "third problem"), it may be configured to operate properly. Moreover, even if a plurality of problems among these three incidental problems are inherent, they may be configured to operate appropriately.

More specifically, it is as follows. 1 and 2 are diagrams showing an outline of the present embodiment. In FIGS. 1 and 2, the region surrounded by the solid line is the original domain 10, the region surrounded by the broken line is the target domain 20, and the line shown by the straight line is the identification boundary 30. The line segment 40 shows the information specified as the boundary between the known class and the unknown class. Arrow 50 indicates that it constitutes domain adaptation.

In this embodiment, the first problem that an unknown class exists is dealt with by identifying and identifying the data belonging to the unknown class from the data to which the teacher label is not given. In this embodiment, for the second and third problems, the labeled data is regarded as the original domain, and the data not given the teacher label and belonging to the known class is regarded as the target domain. Address by configuring adaptation.

<Sample configuration of learning device>
Next, the configuration of the learning device according to the present embodiment will be described. FIG. 3 is a functional block diagram showing an example of the learning device 100 according to the present embodiment. The learning device 100 is configured by using an information processing device such as a personal computer or a server device. The learning device 100 includes a control unit 90, an unknown class information storage unit 130, and a learning result storage unit 140. The control unit 90 is configured by using a processor such as a CPU (Central Processing Unit) and a memory. The control unit 90 includes a feature extractor 101, a first classifier 102, a second classifier 103, a discrimination loss evaluation unit 104, an unknown class classifier 105, a discrimination mismatch evaluation unit 106, and a learning unit by executing a program by the processor. It functions as a unit 107. All or part of each function of the control unit 90 may be realized by using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate Array). The above program may be recorded on a computer-readable recording medium. Computer-readable recording media include, for example, flexible disks, magneto-optical disks, ROMs, CD-ROMs, portable media such as semiconductor storage devices (eg SSD: Solid State Drive), hard disks built into computer systems, and semiconductor storage. It is a storage device such as a device. The above program may be transmitted over a telecommunication line.

The learning device 100 operates by acquiring data from the supervised data storage unit 110 and the unsupervised data storage unit 120. The supervised data storage unit 110 is configured by using a device or medium capable of storing data such as a storage device such as a magnetic hard disk device or a semiconductor storage device, a recording medium such as a CD-ROM, or the like. The supervised data storage unit 110 stores a supervised data set. A supervised data set is a set of data with the desired class label. The unsupervised data storage unit 120 is configured by using a device or medium capable of storing data such as a storage device such as a magnetic hard disk device or a semiconductor storage device, a recording medium such as a CD-ROM, or the like. The unsupervised data storage unit 120 stores an unsupervised data set. An unsupervised data set is a set of data that does not have the desired class label.

The feature extractor 101 receives a supervised data set and an unsupervised data set as inputs, and extracts a feature vector from each data. The feature extractor 101 outputs the extracted feature vector to the first classifier 102 and the second classifier 103. The feature extractor 101 operates based on a function having parameters capable of extracting such a feature vector. The feature vector is, for example, a numerical vector representing the features of the data. In other words, the feature vector is a vector representing the features of the required data with n-dimensional elements. n is an arbitrary integer value, and may be, for example, n = 512. The feature vector will be described as having a vector form for convenience, but the form is irrelevant to the main point of the present invention and can take any form. Each time the feature extractor 101 outputs the feature vector, the feature extractor 101 reads the parameter stored in the learning result storage unit 140 and outputs the feature vector.

The first classifier 102 receives the feature vector output by the feature extractor 101 as an input. The first classifier 102 outputs an estimated value (hereinafter referred to as "estimated attribution probability") of the attribution probability to each class and the unknown class with respect to the original data of the input feature vector. The estimated attribution probability is the probability that the data is likely to belong to each known and unknown class. The first classifier 102 operates on the basis of a function having a parameter capable of outputting such an estimated attribution probability. Each time the first classifier 102 outputs the estimated attribution probability, the first classifier 102 reads the parameter stored in the learning result storage unit 140 and outputs the estimated attribution probability.

The second classifier 103 receives the feature vector output by the feature extractor 101 as an input. The second classifier 103 outputs an estimated value (estimated attribution probability) of the attribution probability to each class and the unknown class with respect to the original data of the input feature vector. The second classifier 103 operates based on a function having a parameter capable of outputting such an estimated attribution probability. Each time the second classifier 103 outputs the estimated attribution probability, the second classifier 103 reads the parameter stored in the learning result storage unit 140 and outputs the estimated attribution probability. The same feature vector is input to the first classifier 102 and the second classifier 103.

As the function applied to the feature extractor 101, the first classifier 102 and the second classifier 103, any function can be used as long as it is differentiable with respect to the parameter. In this embodiment, CNN (Convolutional Neural Network) is used. However, CNN is only an example, and it is not necessary to be limited to this.

The discrimination loss evaluation unit 104 determines the data to be processed, the information indicating whether or not the data to be processed is an unknown class, and the first classifier 102 and the second classifier 103 to the data to be processed. The output estimated attribution probability and the desired attribution probability for the data to be processed (hereinafter referred to as "teacher attribution probability") are received as inputs. The discrimination loss evaluation unit 104 obtains the value of the discrimination loss function (hereinafter referred to as “discrimination loss evaluation value”), which is the first loss function representing these differences. The teacher attribution probability is the attribution probability according to the class label that is the correct answer during learning.

The unknown class classifier 105 receives the data to be processed and the estimated attribution probability output by the first classifier 102 and the second classifier 103 with respect to the data to be processed as inputs. The unknown class classifier 105 identifies whether or not the data to be processed is an unknown class. The unknown class classifier 105 records information indicating the discrimination result (hereinafter referred to as “unknown class information”) in the unknown class information storage unit 130. The information recorded in the unknown class information storage unit 130 is used by the identification loss evaluation unit 104 and the identification mismatch evaluation unit 106.

The discrimination mismatch evaluation unit 106 receives the data to be processed and the estimated attribution probability output by the first classifier 102 and the second classifier 103 to the data to be processed as inputs. The discrimination mismatch evaluation unit 106 acquires a value indicating the degree of mismatch of the estimated attribution probabilities of the first classifier 102 and the second classifier 103 (hereinafter referred to as “discrimination mismatch evaluation value”).

The learning unit 107 receives the discrimination loss function obtained by the discrimination loss evaluation unit 104 and the discrimination mismatch evaluation value obtained by the discrimination mismatch evaluation unit 106 as inputs. The learning unit 107 performs iterative learning of the parameters of the feature extractor 101, the first classifier 102, and the second classifier 103 using the input values. The learning unit 107 records the parameters of the feature extractor 101, the first classifier 102, and the second classifier 103 obtained by iterative learning in the learning result storage unit 140. The iterative learning for the feature extractor 101 is performed so that both the discrimination loss evaluation value and the discrimination mismatch evaluation value become small. The iterative learning for the first classifier 102 and the second classifier 103 is performed so that the discrimination loss evaluation value becomes small and the discrimination mismatch evaluation value becomes large.

<Configuration example of prediction device>
Next, the configuration of the prediction device according to the present embodiment will be described. FIG. 4 is a functional block diagram showing an example of the prediction device 200 according to the present embodiment. The prediction device 200 is configured by using an information processing device such as a personal computer or a server device. The prediction device 200 includes a control unit 91 and a storage unit 230. The control unit 91 is configured by using a processor such as a CPU and a memory. The control unit 91 functions as a feature extractor 201 and a classifier 202 when the processor executes a program. In addition, all or a part of each function of the control unit 91 may be realized by using hardware such as ASIC, PLD and FPGA. The above program may be recorded on a computer-readable recording medium. Computer-readable recording media include, for example, flexible disks, magneto-optical disks, ROMs, CD-ROMs, portable media such as semiconductor storage devices (for example, SSDs), and storage of hard disks and semiconductor storage devices built in computer systems. It is a device. The above program may be transmitted over a telecommunication line.

The storage unit 230 is configured by using a storage device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 230 stores the parameters as the learning result obtained by the iterative learning performed by the learning unit 107 of the learning device 100.

When the feature extractor 201 receives the data to be processed (data to be predicted) 240, it reads out the parameters from the storage unit 230 and operates based on the parameters. The feature extractor 201 outputs a feature vector for the data 240 to be processed. The classifier 202 reads a parameter from the storage unit 230 and operates based on the parameter. The classifier 202 obtains an estimated attribution probability for the data 240 to be processed based on the feature vector obtained by the feature extractor 201. The output of the classifier 202 may be the estimated attribution probability itself for each class of the data 240 to be processed, or may be information indicating the prediction result of which class it belongs to.

<Operation example of learning device>
FIG. 5 is a flowchart showing an operation example of the learning device 100. Next, an operation example of the learning device 100 will be described. The learning device 100 receives the supervised data set 110 and the unsupervised data set 120, and executes the learning processing routine shown in FIG.

First, the control unit 90 of the learning device 100 reads one or more supervised data sets 110 and unsupervised data sets 120 (step S101). Next, the control unit 90 makes a branch determination as to whether or not the number of learning iterations is equal to or less than a predetermined number of scheduled times (step S102). If the number of iterations is less than or equal to the planned number, the process of step S103 is executed. On the other hand, if the number of iterations is larger than the planned number, the process of step S104 is executed.

Here, the significance of the branch processing in step S102 will be described. This branching process changes the method of identifying unknown classes. The first classifier 102 and the second classifier 103 are trained so as to be able to discriminate (K + 1) classes in which K known classes and unknown classes are combined. However, for a known class, a pair of data and its teacher attribution probability can be obtained as supervised data, whereas for an unknown class, it is unknown which data is the unknown class. Therefore, when the number of iterations is less than the planned number, the unknown class is identified for the unsupervised data, and the result is recorded as the identification history. On the other hand, when the number of iterations is larger than the planned number, the first classifier 102 and the second classifier 103 are learned so that (K + 1) classes can be identified, and the identification result of the unknown class is recorded as the identification history.

When the number of iterations is less than the planned rank, the teacher attribution probability of the unknown class can be estimated by identifying the unknown class, but an error is also included. Therefore, by learning the identification of (K + 1) classes while recording the identification history, it is possible to identify unknown classes with few errors.

In step S103, the feature extractor 101, the first classifier 102, the second classifier 103, and the unknown class classifier 105 are applied to the supervised data set 110 and the unsupervised data set 120, and the discrimination loss evaluation value is determined. The discrimination mismatch evaluation value and the judgment of whether or not it is an unknown class can be obtained.

In step S104, the unknown class identification history is read for the unsupervised data set 120. Then, in step S105, the feature extractor 101, the first classifier 102, the second classifier 103, and the unknown class classifier 105 are applied to the supervised data set 110, the unsupervised data set 120, and the unknown class identification history. Then, the discrimination loss evaluation value, the discrimination mismatch evaluation value, and the determination of whether or not the class is unknown can be obtained.

When the process of step S103 or step S105 is completed, the learning unit 107 sets the parameter values of the feature extractor 101, the first classifier 102, and the second classifier 103 based on the discrimination loss evaluation value and the discrimination mismatch evaluation value. (Values recorded in the learning result storage unit 140) are updated (step S106).

The parameters of the feature extractor 101, the first classifier 102, and the second classifier 103 are stored in the learning result storage unit 140. Next, the unknown class classifier 105 records in the unknown class information storage unit 130 the discrimination result of whether or not the unknown class data is obtained in step S103 or S105 (step S107).

Then, the control unit 90 determines whether the end condition is satisfied (step S108). If the end condition is satisfied (step S108-YES), the control unit 90 ends the process. If the end condition is not satisfied (step S108-NO), the control unit 90 returns to step S101 and repeats the process.

By the iterative learning described above, the parameters of the feature extractor 101, the first classifier 102, and the second classifier 103 are learned. With respect to the feature extractor 101, learning is performed using the discrimination loss evaluation value and the discrimination mismatch evaluation value so that the discrimination loss evaluation value and the discrimination mismatch evaluation value become smaller. The discrimination loss function outputs a value smaller as the degree of similarity between the estimated attribution probability of the data output by the first classifier 102 and the second classifier 103 and the given teacher attribution probability of the data is higher. The discriminant mismatch evaluation value indicates the difference between the discriminators regarding the estimated attribution probability of the data output by the first discriminator 102 and the second discriminator 103. Further, with respect to the first classifier 101 and the second classifier 102, learning is performed so that the discrimination loss evaluation value is small and the discrimination mismatch evaluation value is large.

[Details of each process]
Next, the details of the processing of each processing unit of the learning device 100 will be described.
[When the number of iterations is less than the planned number]
In step 102, each process of the discrimination loss evaluation unit 104, the unknown class classifier 105, and the discrimination mismatch evaluation unit 106 when the number of iterations is less than or equal to the planned number of times will be described.

The discrimination loss function has a similarity between the estimated attribution probability of the data output by the first classifier 102 and the second classifier 103 and the given teacher attribution probability of the data by inputting the feature vector output by the feature extractor 101. The higher the value, the smaller the value is output. The discrimination loss function corresponds to Equations 2 and 3 described later. Further, the value corresponds to the identification loss evaluation value.

[Processing of identification loss evaluation unit]
The feature extractor 101 is realized by using a function F that takes data x as an input, outputs a feature vector f, and has a parameter φ. The first classifier 102 can be expressed as a function having a parameter θ1 that outputs an estimated attribution probability y1 with the feature vector f as an input. The second classifier 103 can be expressed as a function having a parameter θ2 that outputs an estimated attribution probability y2 with the feature vector f as an input. The function that realizes the first classifier 102 and the second classifier 103 can be expressed as a probability function as the following equation 1 by using the function F that realizes the feature extractor 101. Note that i is used as a subscript to distinguish between the two classifiers.

The formula is the probability that yi will appear given φ, θi, and x. Desirable feature extractors 101, first classifier 102, and second classifier 103 are such that when data s is given from a supervised data set, a teacher attribution probability t to each class appears. That is, the feature extractor 101, the first discriminator 102, and the second discriminator 103 for which the attribution probability that can identify the correct class is obtained. Assuming that the probability of appearance of the teacher attribution probability t corresponding to the data s is p (s, t), learning should be able to determine the parameters φ and θi so that the following equation 2 becomes small.

Eb [a] is the expected value for the probability b of a. In the case of this embodiment, since the supervised data is acquired from the supervised data set, the expected value is approximately replaced in the form of sum as shown in the following equation 3.

Note that S and T are a set of one or more data and the corresponding teacher attribution probabilities. Equation 3 is the discrimination loss function in one example of the present embodiment, and the value evaluated for any S and T is the discrimination loss evaluation value.

By reducing Equation 3 with respect to φ, θ1, and θ2, it is possible to obtain a desirable feature extractor 101, a first discriminator 102, and a second discriminator 103 that can output t with respect to s. There are various methods for obtaining such φ, θ1, and θ2. Simply, if the probability functions representing the function F that realizes the feature extractor and the first classifier 102 and the second classifier 103 are differentiable for the respective parameters φ, θ1, and θ2, they are local. It is known that it can be minimized. Therefore, in one example of the present embodiment, the feature extractor 101 is a function that outputs the feature vector f of the data under the input of the data x, is differentiable with respect to φ, and is the first classifier. As the 102 and the second classifier 103, a function that satisfies the conditions that the feature vector f is input and the estimated attribution probabilities y1 and y2 are output and that they are differentiable with respect to θ1 and θ2, respectively, may be selected. ..

[Processing of identification mismatch evaluation unit]
The discrimination mismatch evaluation values of certain estimated attribution probabilities p1 and p2 are expressed by the following equation 4 when p1k and p2k represent the attribution probabilities of the estimated attribution probabilities p1 and p2 for the class k, respectively. Here, K represents the number of known classes to be identified, and K + 1 represents an unknown class that does not correspond to any of the known classes.

The discrimination mismatch evaluation unit 106 evaluates the degree of mismatch of the estimated attribution probabilities y1 and y2 output by the first classifier 102 and the second classifier 103 with respect to the data u of the unsupervised data set 120. That is, the discrimination mismatch evaluation unit 106 uses the discrimination mismatch evaluation value of the estimated attribution probability of the formula 4 as the first classifier for the appearance probability p (u) of the data u of the unsupervised data set shown in the following formula 5. The discrimination mismatch evaluation value Ladv of the estimated attribution probability of 102 and the second classifier 103 is output.

Eb [a] is the expected value for the probability b of a. In the case of this embodiment, since the unsupervised data is obtained from the unsupervised data set, the expected value is approximately replaced in the form of sum as shown in Equation 6 below.

U is one or more data. Equation 6 is the discriminant mismatch degree in one example of the present embodiment, and the value evaluated for any U is the discriminant mismatch evaluation value.

[Unknown class classifier processing]
The estimated attribution probabilities y1 and y2 output by the first classifier 102 and the second classifier 103 for the data x can be expressed by using the above equation 1. The information entropy H (y | x) indicating the ambiguity of the attribution probability y output for the average estimated attribution probabilities y of the estimated attribution probabilities y1 and y2 output by the first classifier 102 and the second classifier 103 is given by the following equation 7. It is expressed as.

The determination of whether or not the untrained data u of the unsupervised data set is unknown class data is determined by whether or not the value of the information entropy shown in Equation 4 is larger than the predetermined threshold σ. _{That is, the identification y u, e} of whether or not the unsupervised data u in the number of iterations e is unknown class data is expressed by the following equation 8.

[When the number of iterations is larger than a certain number]
The process of dividing the unsupervised data set into the known class data set U _I and the unknown class data set U _{O in step S105 will be described.} The identification result as to whether or not the data u of the unsupervised data set is unknown class data when the number of iterations t is t is stored in the unknown class information storage unit 130 as _{y u and t in step S107 described later.} .. In step S104, the identification result of the past T times is read from the unknown class information storage unit 130, and the data u of the unsupervised data set identified as the unknown class data of the past T / 2 times or more is the unknown class data set U. _{Those belonging to O} and other data belong to the known class data set U _I. That is, when the unsupervised data set is U in the number of iterations e, U is divided into a _{known class data set U I} and an unknown class data set U _{O according to the following equations 9 and 10.}

Next, the evaluation process according to step S105 will be described. Regarding the processing of the discrimination loss evaluation unit 104 and the discrimination mismatch evaluation unit 106, substantially the same processing as in step S103, which is the processing when the number of iterations is a certain number or less, is performed.

[Processing of identification loss evaluation unit]
Identifying loss evaluation unit 104, by taking supervised data and the set of teacher membership probability (S, T) and the total sum for the union of the unknown class data set U _O, we obtain the identification loss evaluation value. That is, the evaluation value of the discrimination loss evaluation unit 104 is expressed in the form of the following equation 11.

[Processing of identification mismatch evaluation unit]
The process of identifying discrepancies evaluation unit 106, to the data of the known class data set U _I, by performing the same processing as the evaluation process of the formula 6 of the identification mismatch evaluation unit 106 in step S103, the identification discrepancy evaluation value Ask for. That is, the discrimination mismatch evaluation value output by the discrimination mismatch evaluation unit 106 according to step S105 is obtained by the following equation 12.

[Processing of unknown class classifier]
_{The estimated attribution probabilities y 1} and y ₂ output by the first classifier 102 and the second classifier 103 for the data x can be expressed by using the above equation 1. The average estimated attribution probability y can be obtained from _{the estimated attribution probabilities y 1} and y ₂ output by the first classifier 102 and the second classifier 103. For the determination of whether the unsupervised data u of the unsupervised data set is unknown class data, the attribution probability to the unknown class K + 1 class is the most important among the attribution probabilities for each discrimination class for the average estimated attribution probability y. If it is high data, it is judged as unknown class data, and if not, it is judged as not unknown class data. _{That is, the identification y u, e} of whether or not the unsupervised data u in the number of iterations e is unknown class data is expressed by the following equation 13.

[Learning process]
The learning process of the learning unit 107 according to step S106 will be described. For the feature extractor 101, learning processing is performed so that the values of the discrimination loss evaluation value L _s and the discrimination mismatch evaluation value L _{ad v become smaller.} For the first classifier 102 and the second classifier 103 _{, learning processing is performed so that the discrimination loss evaluation value L s} is small and the discrimination mismatch evaluation value is large. Specifically, the problems shown in Equation 14, Equation 15, and Equation 16 are sequentially optimized.

Here, the functions of the feature extractor 101, the first classifier 102, and the second classifier 103 so that the discrimination loss evaluation value Ls and the discrimination mismatch evaluation value L _adv are differentiable with respect to the parameters θ ₁ , θ _{2, and φ.} Since we chose, it is possible to learn by the error gradient effect method.

The expected effects of the above learning will be explained. First, minimizing the parameters θ ₁ , θ ₂ _{, and φ for L s} has the effect of improving recognition accuracy based on supervised data, similar to general discriminant learning.

For L _adv , learning is performed so that the values of parameters θ ₁ and θ ₂ become large, and the parameters φ are minimized. Details regarding the effect of this learning are as described in Non-Patent Document 4. The distribution of supervised data and the distribution of unsupervised data in the space of features output by the feature extractor 101 come close to each other. By approaching the distribution in the feature space, it becomes possible to recognize the unsupervised data with high accuracy by the discriminator learned from the supervised data.

However, if the distribution of supervised data and the distribution of unsupervised data are brought closer to each other on the feature space by the same learning as in Non-Patent Document 4, the unknown class data among the unsupervised data will also be brought closer. At this time, the unsupervised unknown class data approaches the supervised data and is identified as one of the originally inappropriate classes of known class data. In the present embodiment, the unknown class data is detected, and the unknown class data and the detected data are not used for the evaluation of _{La dv in step S105.} This prevents the above-mentioned inappropriate supervised data distribution and the unknown class data distribution from coming close to each other, and makes it possible to learn to detect that the unknown class data is unknown class data.

[Parameter storage process]
After learning the parameters, the parameters θ ₁ , θ ₂ , and φ are stored in the learning result storage unit 140 in the process according to step S107.

The process of saving the identification result as to whether the unsupervised data is unknown class data in the process of step S108 will be described. As for the identification history of whether the data u of the unsupervised data set in the number of iterations e is unknown class data, y _{u and e} are obtained by the process of step S103 when the number of iterations e is equal to or less than a certain value. Further, as for the identification history of whether the data u of the unsupervised data set in the number of iterations e is unknown class data, y _{u and e} are obtained by the process of step S305 when the number of iterations e is larger than a certain value. .. _{In step S108, the identification results y u and e} are stored in the unknown class information storage unit 130 for each of the data u of the unsupervised data set.

The learning process from steps S101 to S108 may be repeated until the end condition is satisfied.

Arbitrary information may be used for the end condition. For example, "until the value of the objective function is not converted more than a certain amount", "until the accuracy of the evaluation data prepared separately from the training data does not change more than a certain amount", etc. good.

(Modification example)
One or both of the supervised data storage unit 110 and the unsupervised data storage unit 120 may be provided in the learning device 100. Either or both of the unknown class information storage unit 130 and the learning result storage unit 140 may be provided outside the learning device 100. When it is provided externally, data may be acquired by performing communication such as TCP / IP.

The learning device 100 may be mounted using one information processing device, or may be distributed and mounted in a plurality of information processing devices.

As described above, the embodiment of the present invention has been described in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and the design and the like within a range not deviating from the gist of the present invention are also included.

The present invention is applicable to a learning device.

100 ... Learning device, 101 ... Feature extractor, 102 ... First classifier, 103 ... Second classifier, 104 ... Discrimination loss evaluation unit, 105 ... Unknown class classifier, 106 ... Discrimination mismatch evaluation unit, 107 ... Learning unit , 200 ... Predictor

Claims

A feature extractor that outputs the features of the input data, and
A plurality of classifiers that acquire the attribution probabilities of the data to known and unknown classes based on the features, and
An unknown class classifier that determines whether or not the data is an unknown class based on the attribution probability acquired by the classifier.
A discriminant mismatch evaluation unit that outputs a value of the discriminant discrepancy degree indicating the difference in the attribution probabilities obtained by the plurality of discriminators with respect to the data.
Using data that is not the unknown class and is not given the teacher label, the value of the discriminant discrepancy is reduced for the feature extractor, and the discriminant discriminator for the plurality of discriminators is used. A learning unit that iteratively learns the parameters of the feature extractor and a plurality of the classifiers so as to increase the value.
A learning device equipped with.
Further provided with an identification loss evaluation unit that outputs a value of an identification loss function indicating a value indicating a smaller value as the degree of similarity between the attribution probability and a given teacher attribution probability of the data is higher with respect to the data.
The learning unit uses the data to which the teacher label is attached and the data which is an unknown class and is not given the teacher label to discriminate between the feature extractor and the plurality of discriminators. The learning device according to claim 1, further performing iterative learning of the parameters so as to reduce the value of the loss function.
The learning device according to claim 1 or 2, wherein the unknown class classifier determines based on the attribution probability when the number of times of iterative learning in the learning unit is less than or equal to a predetermined number of times.
The learning device according to any one of claims 1 to 3, wherein the unknown class classifier makes a judgment based on a past judgment result when the number of times of iterative learning in the learning unit is more than a predetermined number of times. ..
A feature extractor that outputs the feature amount of the input data based on the parameter obtained by the learning device according to any one of claims 1 to 4.
A classifier that acquires the probability of attribution of the data to a known class and an unknown class based on the parameter obtained by the learning device according to any one of claims 1 to 4 and the feature amount.
A predictor equipped with.
A feature extraction step that outputs the feature amount of the input data using the feature extractor,
An identification step of acquiring the attribution probabilities of the data to a known class and an unknown class based on the feature quantity using a plurality of classifiers, respectively.
An unknown class identification step for determining whether or not the data is an unknown class based on the acquired attribution probability, and
A discriminant discrepancy evaluation step that outputs a discriminant discrepancy degree value indicating a difference in the attribution probabilities obtained by the plurality of discriminators with respect to the data.
Using data that is not the unknown class and is not given the teacher label, the value of the discriminant discrepancy is reduced for the feature extractor, and the discriminant discriminator for the plurality of discriminators is used. A learning step in which the parameters of the feature extractor and the plurality of classifiers are iteratively learned so as to increase the value.
Learning method with.
A program for operating a computer as the learning device according to any one of claims 1 to 4.