WO2023166688A1

WO2023166688A1 - Machine learning program, information processing device, and machine learning method

Info

Publication number: WO2023166688A1
Application number: PCT/JP2022/009300
Authority: WO
Inventors: 正之廣本; 章中川
Original assignee: 富士通株式会社
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2023-09-07

Abstract

This machine learning program causes a computer to execute processing for: calculating the probability (p(c1), p(c2)) that each of a plurality of latent variables (z) when input data has been encoded by a self encoder belongs to the specific distribution among a plurality of distributions (pω(z|ck)); respectively generating probability distributions (q(z)) for the plurality of latent variables (z) on the basis of the plurality of distributions (pω(z|ck)) and the calculated probability (p(c1), p(c2)); and optimizing training parameters on the basis of the generated probability distributions (q(z)) such that encoded information of the plurality of distributions (pω(z|ck)), encoded information of the plurality of latent variables (z), and restoration errors caused by the self encoder are reduced.

Description

Machine learning program, information processing device and machine learning method

The present invention relates to a machine learning program, an information processing device, and a machine learning method.

Representation learning is a machine learning technique that acquires low-dimensional latent representations that represent the features of input data.

Fig. 1 is a block diagram explaining expression learning.

In expression learning, images, voices, texts, etc. are input to the encoder 6, and an output for each task is obtained from the obtained latent expression 601 through the classifier 7 and the like. If a good latent representation 601 is obtained by representation learning, various artificial intelligence (AI) tasks can be realized with high accuracy.

Therefore, in expression learning, the goal is to obtain a low-dimensional latent expression 601 that accurately expresses the properties of input data.

Fig. 2 is a diagram explaining expression learning using a generative model.

In FIG. 2, the data generation process is modeled, and an intermediate latent representation 601 is obtained by learning a set of the encoder 6a and decoder 6b. The encoder 6a and decoder 6b perform learning so that the error between the input and the output (in other words, the restoration error 602) becomes small.

Fig. 3 is a diagram explaining the generative model deep learning technology.

Generative model-based deep learning technology makes it possible to acquire low-dimensional latent representations while preserving the features of the input data x. By optimizing the following equation, encoder 61 (f _φ ), decoder 62 (g _θ ), and probability distribution p _ψ of latent variable z are learned.

Restoration error

At the same time, "rate-distortion optimization" that reduces the encoded information amount R=-log(p _ψ (Z)) is performed to obtain an efficient representation. Note that noise ε˜N(0, (β/2)I is added between the encoder 61 and the decoder 62 .

WO2021/059348

FIG. 4 is a diagram explaining an example of using a complex distribution model in the generative model deep learning technology shown in FIG. FIG. 5 is a diagram illustrating an example of using a simple distribution model in the generative model deep learning technique shown in FIG.

In the generative model deep learning technique shown in FIG. 3, the latent variable z is assumed to follow a single prior distribution p _ψ . Then, when the input data has multiple different contexts, a complicated distribution shape is expressed.

Here, the context indicates "the situation in which the input data is placed". Context may represent data types, labels, categories, clusters, and groups.

In the examples shown in FIGS. 4 and 5, the distribution model for image classification of dogs (see symbol A1 in FIG. 4 and symbol B1 in FIG. 5) and cats (see symbol A2 in FIG. 4 and symbol B2 in FIG. 5) is , and the type of image (in other words, the attached label) can be regarded as the context. Data belonging to respective contexts such as dog and cat have different properties.

In the example shown in FIG. 4, since a complicated distribution model p _ψ (Z) is used, the number of learning parameters increases. On the other hand, in the example shown in FIG. 5, a simple distribution model p _ψ (Z) is used, which reduces the accuracy of image classification.

In this way, in the generative model deep learning technology described above, the complexity of the distribution model used in machine learning and the accuracy of inference results from machine learning may decrease, and the expressiveness of expression learning may decrease.

One aspect is to improve the power of expression in expression learning.

In one aspect, the machine learning program calculates the probability that each of the plurality of latent variables belongs to which of the plurality of distributions when the input data is encoded by the autoencoder, and the plurality of distributions and the calculation generating probability distributions for each of the plurality of latent variables based on the obtained probabilities, and coded information of the plurality of distributions and coded information of the plurality of latent variables based on the generated probability distributions; A computer is caused to execute a process of optimizing the learning parameters so as to reduce the restoration error by the autoencoder.

In one aspect, it is possible to improve the power of expression in expression learning.

FIG. 4 is a block diagram for explaining expression learning; FIG. 4 is a diagram for explaining representation learning by a generative model; FIG. 2 is a diagram for explaining generative model deep learning technology; FIG. 4 is a diagram illustrating an example of using a complex distribution model in the generative model deep learning technique shown in FIG. 3; FIG. 4 is a diagram illustrating an example of using a simple distribution model in the generative model deep learning technique shown in FIG. 3; It is a figure explaining the distribution model used by the machine learning in embodiment. It is a block diagram explaining machine learning in an embodiment. 2 is a block diagram schematically showing a hardware configuration example of an information processing apparatus according to an embodiment; FIG. 9 is a block diagram schematically showing a software configuration example of the information processing apparatus shown in FIG. 8; FIG. It is a flow chart explaining machine learning in an embodiment. It is a block diagram explaining machine learning in a modification.

[A] Embodiment An embodiment will be described below with reference to the drawings. However, the embodiments shown below are merely examples, and are not intended to exclude the application of various modifications and techniques not explicitly described in the embodiments. In other words, the present embodiment can be modified in various ways without departing from the spirit of the embodiment. Also, each drawing does not mean that it has only the constituent elements shown in the drawing, but can include other functions and the like.

[A-1] Configuration Example FIG. 6 is a diagram illustrating a distribution model used in machine learning in the embodiment.

Machine learning in the embodiment introduces a prior distribution p _ω (z|c _k ) (k is a natural number) for each context. Also, the prior probability q(z) of the latent variable is represented by the sum of the prior distribution p _ω (z|c _k ) for each context weighted by the probability p(c _k ) of each context.

In the example shown in FIG. 6, the product of the prior distribution p _ω (z|c ₁ ) of the dog context #1 (see symbol C1) and the probability p(c ₁ ) of the context #1 and the cat context #2 ( C2) prior distribution p _ω (z|c ₂ ) and the probability p(c ₂ ) of context #2 are calculated. Then the sum of the product of context #1 and the product of context #2 is the prior probability q(z)=p _ω (z|c ₁ )×p (c ₁ )+p _ω (z|c ₂ )×p (c ₂ ).

As a result, an appropriate prior distribution can be obtained for each context. Then, a simple model can be used for the distribution within the context, and the number of learning parameters can be reduced. In addition, since the overall distribution changes dynamically according to the input, it is possible to improve the accuracy of the inference results.

FIG. 7 is a block diagram explaining machine learning in the embodiment.

The information processing device 1 functions as a context classifier 2, an encoder 3, an adder 4 and a decoder 5.

Encoder 3 (f _φ (x)) outputs data y to adder 4 and context classifier 2 when input data x is input.

When the data y is input, the context classifier 2 (softmax(h _ψ (y))) outputs the coded information amount H(c|y) of the context as indicated by symbol D1, and Probabilities {p(c ₁ |y),p(c ₂ |y),...,p(c _k |y)} are output. The coded information amount of the context is represented by the following equation.

The probability of each context {p(c ₁ |y), p(c ₂ |y), ..., p(c _k |y)} and the prior distribution p _ω (z|c _k ), the sum of each context is calculated as indicated by symbol D2. The sum of each context is the following value.

The adder 4 adds noise ε~N(0,(β/2)I) to the data y output from the encoder 3, and outputs the latent variable z.

As indicated by D4, the coded information amount D _KL (p(z|x)||q(z| y)) is calculated.

A decoder 5 (g _θ ₍ z)) generates output data

to output

Then, as indicated by D5, the restoration error is calculated based on the input data and the output data.

is calculated.

The information processing device 1 learns the encoder 3 (f _φ ), the decoder 5 (g _θ ), the context classifier 2 (h _ψ ), and the prior distribution p _ω for each context by optimizing the following two formulas. By reducing the restoration error, the information amount of L _θ,ψ,ω (x) is reduced.

FIG. 8 is a block diagram schematically showing a hardware configuration example of the information processing device 1 according to the embodiment.

As shown in FIG. 8, the information processing apparatus 1 includes a CPU 11, a memory section 12, a display control section 13, a storage device 14, an input interface (IF) 15, an external recording medium processing section 16 and a communication IF 17.

The memory unit 12 is an example of a storage unit, and is exemplified by Read Only Memory (ROM) and Random Access Memory (RAM). A program such as a Basic Input/Output System (BIOS) may be written in the ROM of the memory unit 12 . The software programs in the memory unit 12 may be appropriately read into the CPU 11 and executed. Also, the RAM of the memory unit 12 may be used as a temporary recording memory or a working memory.

The display control unit 13 is connected to the display device 131 and controls the display device 131 . A display device 131 is a liquid crystal display, an organic light-emitting diode (OLED) display, a cathode ray tube (CRT), an electronic paper display, or the like, and displays various information for an operator or the like. The display device 131 may be combined with an input device, such as a touch panel.

The storage device 14 is a storage device with high IO performance, and may be, for example, Dynamic Random Access Memory (DRAM), SSD, Storage Class Memory (SCM), or HDD.

The input IF 15 may be connected to input devices such as the mouse 151 and keyboard 152 and may control the input devices such as the mouse 151 and keyboard 152 . The mouse 151 and keyboard 152 are examples of input devices, and the operator performs various input operations via these input devices.

The external recording medium processing unit 16 is configured so that the recording medium 160 can be attached. The external recording medium processing unit 16 is configured to be able to read information recorded on the recording medium 160 when the recording medium 160 is attached. In this example, the recording medium 160 has portability. For example, the recording medium 160 is a flexible disk, optical disk, magnetic disk, magneto-optical disk, or semiconductor memory.

The communication IF 17 is an interface for enabling communication with external devices.

The CPU 11 is an example of a processor, and is a processing device that performs various controls and calculations. The CPU 11 implements various functions by executing an operating system (OS) and programs read into the memory unit 12 . Note that the CPU 11 may be a multiprocessor including a plurality of CPUs, a multicore processor having a plurality of CPU cores, or a configuration having a plurality of multicore processors.

A device for controlling the operation of the entire information processing device 1 is not limited to the CPU 11, and may be, for example, any one of MPU, DSP, ASIC, PLD, and FPGA. Also, the device for controlling the operation of the entire information processing device 1 may be a combination of two or more of CPU, MPU, DSP, ASIC, PLD and FPGA. Note that MPU is an abbreviation for Micro Processing Unit, DSP is an abbreviation for Digital Signal Processor, and ASIC is an abbreviation for Application Specific Integrated Circuit. PLD is an abbreviation for Programmable Logic Device, and FPGA is an abbreviation for Field Programmable Gate Array.

FIG. 9 is a block diagram schematically showing a software configuration example of the information processing device 1 shown in FIG.

The CPU 11 of the information processing apparatus 1 shown in FIG. 8 functions as a probability calculation unit 111, a distribution generation unit 112, and a learning processing unit 113.

The probability calculation unit 111 calculates the probability that each of the plurality of latent variables when the input data is encoded by the autoencoder belongs to which of the plurality of distributions (in other words, the context).

The autoencoder may be a combination of an encoder 3 and a decoder 5 that transform between the space to which the input and output data belong and the space to which the latent data belongs.

The distribution generator 112 generates probability distributions for each of the multiple latent variables based on the multiple distributions and the probabilities calculated by the probability calculator 111 .

A probability distribution may be the sum of multiple distributions, each weighted by a probability.

Based on the probability distribution generated by the distribution generation unit 112, the learning processing unit 113 calculates the coded information amount of a plurality of distributions (in other words, context), the coded information amount of the latent variable, and the autoencoder. The learning parameters are optimized so that the restoration error due to is reduced.

The encoded information content of multiple distributions may be learned by the context classifier 2 for classifying the context of the input data. The learning parameters may be encoder 3 parameters, decoder 5 parameters, context classifier 2 parameters and/or prior distribution parameters.

[A-2] Operation Example Machine learning in the embodiment will be described according to the flowchart (steps S1 to S12) shown in FIG.

Input data x is converted into latent variable y by encoder 3 (step S1).

The noise ε is applied to the latent variable y by the adder 4 to obtain the latent variable z=y+z (step S2).

The latent variable z is output data by the decoder 5

(step S3).

Restoration error

is calculated (step S4).

Loss function L _{θ, φ, ψ, ω} (x) are calculated (step S5).

The parameters θ _{, φ, ψ, ω are updated so that the loss function L θ} , φ, ψ, ω (x) becomes smaller (step S6).

It is determined whether learning has converged (step S7).

If learning has not converged (see NO route in step S7), the process returns to step S1.

On the other hand, when learning converges (see YES route in step S7), machine learning ends.

The processing in steps S8 to S12 below may be executed in parallel with the processing in steps S2 to S4 described above.

When the latent variable y is determined in step S1, the probability p(c _k |y) of the context to which the latent variable y belongs is calculated by the context classifier 2 (step S8).

The coded information amount H(c|y) of the context is calculated from the probability p(c _k |y) of the context (step S9). Then, the coded information amount H(c|y) of the context is used for the processing in step S5.

Also, the prior distribution pω(z|c _k ) for each context is weighted by the probability p(c _k |y) of the context (step S10).

The weighted context probabilities are superimposed to generate a prior distribution q(z|y) (step S11).

Based on the latent variable z obtained in step S2 and the prior distribution q(z|y) generated in step S11, the coded information amount _DKL of the latent variable is calculated (step S12). Then, the coded information amount _DKL of the latent variable is used for the processing in step S5.

[B] Modification FIG. 11 is a block diagram illustrating machine learning in a modification.

The information processing device 1a in the modification executes supervised class classification processing. The information processing device 1a functions as an encoder 3a (f _φ (x)), an encoder 3b (f _φ (x)), and a classifier 3c.

The encoder 3a converts the input data x into a latent variable y and inputs it to the classifier 3c.

The encoder 3b converts the training data 141 into learning data and inputs it to the classifier 3c.

The class classifier 3c outputs a class estimation result based on the latent variable y from the encoder 3a and the learning data from the encoder 3b.

Thus, in the modified example, input data is converted into an embedded representation by the encoder 3b that has learned in the same manner as in the above-described embodiment, and the class classifier 3c learns from the embedded representation. Then, the class of unknown data is estimated by the learned class classifier 3c.

As a result, the accuracy of the class estimation result is improved compared to the case of directly learning the class classifier 3c for the input.

[C] Effects According to the machine learning program, the information processing apparatus 1, and the machine learning method of the above-described embodiments, the following effects can be obtained, for example.

The probability calculation unit 111 calculates the probability to which of the plurality of distributions each of the plurality of latent variables belongs when the input data is encoded by the autoencoder. The distribution generation unit 112 generates probability distributions for each of the multiple latent variables based on the multiple distributions and the probabilities calculated by the probability calculation unit 111 . Based on the probability distribution generated by the distribution generation unit 112, the learning processing unit 113 reduces the coded information amount of the plurality of distributions, the coded information amount of the latent variables, and the restoration error by the autoencoder. , the learning parameters are optimized.

As a result, it is possible to improve the power of expression in expression learning. For example, it can improve accuracy in image classification tasks. Latent representations of images can be obtained by unsupervised learning on the MNIST dataset. We can perform supervised learning of linear classifiers using latent representations and evaluate their accuracy. It can improve the accuracy of inference results compared to existing methods (eg GMM).

[D] Others The technology disclosed herein is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the embodiments. Each configuration and each process of this embodiment can be selected or discarded as necessary, or may be combined as appropriate.

1, 1a: Information processing device 2:

Context classifiers

3, 3a, 3b, 6, 6a, 61: Encoder 3c: Class classifier 4:

Adders

5, 6b, 62: Decoder 7: Discriminator 11: CPU
12: memory unit 13: display control unit 14: storage device 16: external recording medium processing unit 111: probability calculation unit 112: distribution generation unit 113: learning processing unit 131: display device 141: training data 151: mouse 152: keyboard 160 : Recording medium 601 : Latent expression 15 : Input IF
17: communication interface

Claims

calculating the probability that each of the plurality of latent variables belongs to which of the plurality of distributions when the input data is encoded by the autoencoder;
generating a probability distribution for each of the plurality of latent variables based on the plurality of distributions and the calculated probability;
Optimizing learning parameters based on the generated probability distributions so as to reduce the coded information of the plurality of distributions, the coded information of the plurality of latent variables, and the restoration error by the autoencoder. perform
A machine learning program that makes a computer perform a process.
The autoencoder is a combination of an encoder and a decoder that transforms between the space to which the input data and the output data of the autoencoder belong and the space to which the plurality of latent variables belong.
The machine learning program according to claim 1.
the encoded information of the plurality of distributions is learned by a context classifier for classifying the context of the input data;
The machine learning program according to claim 1 or 2.
the learning parameters are at least one of parameters of the autoencoder, parameters of the context classifier, and parameters of the probability distribution;
The machine learning program according to claim 3.
wherein the probability distribution is calculated by summing the plurality of distributions each weighted by the probability;
The machine learning program according to any one of claims 1-4.
converting the input data into an embedded representation by an encoder trained by the learning parameters;
learning a classifier for classifying unknown data with the transformed embedded representation;
6. The machine learning program according to any one of claims 1 to 5, causing the computer to execute processing.
calculating the probability that each of the plurality of latent variables belongs to which of the plurality of distributions when the input data is encoded by the autoencoder;
generating a probability distribution for each of the plurality of latent variables based on the plurality of distributions and the calculated probability;
Optimizing learning parameters based on the generated probability distributions so as to reduce the coded information of the plurality of distributions, the coded information of the plurality of latent variables, and the restoration error by the autoencoder. perform
An information processing device comprising a processor.
The autoencoder is a combination of an encoder and a decoder that transforms between the space to which the input data and the output data of the autoencoder belong and the space to which the plurality of latent variables belong.
The information processing apparatus according to claim 7.
the encoded information of the plurality of distributions is learned by a context classifier for classifying the context of the input data;
The information processing apparatus according to claim 7 or 8.
the learning parameters are at least one of parameters of the autoencoder, parameters of the context classifier, and parameters of the probability distribution;
The information processing apparatus according to claim 9 .
wherein the probability distribution is calculated by summing the plurality of distributions each weighted by the probability;
The information processing device according to any one of claims 7 to 10.
The processor
converting the input data into an embedded representation by an encoder trained by the learning parameters;
learning a classifier for classifying unknown data with the transformed embedded representation;
The information processing apparatus according to any one of claims 7 to 11.
calculating the probability that each of the plurality of latent variables belongs to which of the plurality of distributions when the input data is encoded by the autoencoder;
generating a probability distribution for each of the plurality of latent variables based on the plurality of distributions and the calculated probability;
Optimizing learning parameters based on the generated probability distributions so as to reduce the coded information of the plurality of distributions, the coded information of the plurality of latent variables, and the restoration error by the autoencoder. perform
A machine learning method in which the processing is performed by a computer.
The autoencoder is a combination of an encoder and a decoder that transforms between the space to which the input data and the output data of the autoencoder belong and the space to which the plurality of latent variables belong.
14. The machine learning method of claim 13.
the encoded information of the plurality of distributions is learned by a context classifier for classifying the context of the input data;
Machine learning method according to claim 13 or 14.
the learning parameters are at least one of parameters of the autoencoder, parameters of the context classifier, and parameters of the probability distribution;
16. The machine learning method of claim 15.
wherein the probability distribution is calculated by summing the plurality of distributions each weighted by the probability;
The machine learning method according to any one of claims 13-16.
converting the input data into an embedded representation by an encoder trained by the learning parameters;
learning a classifier for classifying unknown data with the transformed embedded representation;
The machine learning method according to any one of claims 13 to 17, wherein said computer executes the processing.