WO2019194128A1

WO2019194128A1 - Model learning device, model learning method, and program

Info

Publication number: WO2019194128A1
Application number: PCT/JP2019/014476
Authority: WO
Inventors: 崇史森谷; 山口　義和
Original assignee: 日本電信電話株式会社
Priority date: 2018-04-04
Filing date: 2019-04-01
Publication date: 2019-10-10
Also published as: JP2019185207A

Abstract

Provided is a model learning feature with which it is possible, without impairing the performance of a model learned using the data of a given domain, to additionally learn using the data of another domain. The present invention includes: a setup unit for generating a mask from a learned model parameter that is the initial value of a model parameter Ω; a feature quantity processing unit for calculating an output probability distribution that is the distribution of probability that an output corresponding to a feature quantity extracted from input data in a domain different from the domain used in the learning of the learned model parameter is the output of an output number m; and a model learning unit for learning the model parameter Ω using the mask, the output probability distribution, and a correct answer output number that is a number for identifying a correct answer output that corresponds to the feature quantity. The model learning unit calculates an update difference δ(ω) for an element ω of the model parameter Ω by a prescribed expression in which there are used a loss function L(Ω) and a mask element γ that corresponds to the element ω of the model parameter Ω, and updates the element ω.

Description

Model learning device, model learning method, program

The present invention relates to a model learning technique using a neural network.

A model (model parameter) learning method using a conventional neural network will be described. Non-Patent Document 1 discloses a method of learning an acoustic model used for speech recognition using a neural network. In particular, the details are disclosed in II. “TRAININGEURDEEP NEURAL の NETWORKS” of Non-Patent Document 1.

Hereinafter, a model learning apparatus 900 corresponding to the model learning of Non-Patent Document 1 will be described with reference to FIGS. FIG. 5 is a block diagram illustrating a configuration of the model learning apparatus 900. FIG. 6 is a flowchart showing the operation of the model learning apparatus 900. As shown in FIG. 5, the model learning apparatus 900 includes a feature amount processing unit 920, a model learning unit 930, and a recording unit 990.

The recording unit 990 is a component that appropriately records information necessary for processing of the model learning device 900. For example, the initial value of the model parameter Ω is recorded in advance. Also, the model parameter Ω generated in the learning process is recorded as appropriate. The initial value of the model parameter Ω may be generated using a random number, or a model parameter generated using data different from the data used for the current learning may be used.

Further, as shown in FIG. 7, the feature quantity processing unit 920 includes an intermediate feature quantity calculation unit 921 and an output probability distribution calculation unit 922.

Before the start of learning, feature quantities are extracted from input data (speech data in Non-Patent Document 1) serving as learning data and prepared. The feature quantity is expressed as a real vector. When the input data is audio data, an example of the feature amount is FBANK (filter bank logarithmic power) extracted for each frame (usually about 20 ms to 40 ms) obtained by dividing the audio data. Also, a correct output number that is a number for identifying the correct output corresponding to the feature quantity is also prepared. A set of the feature quantity and the correct answer number is an input to the model learning apparatus 900. A set of feature quantity and correct answer number is called training data.

In the following, the number of output types corresponding to the feature value is M (M is an integer of 1 or more), and each output type is assigned a number (hereinafter referred to as an output number) from 1 to M, and the output number The output is identified using m (1 ≦ m ≦ M, that is, m is an index representing the output number).

The model learning device 900 learns the model parameter Ω from the training data (that is, the combination of the feature value and the correct answer number). When deep neural networks (DNN: Deep Neural Networks) are used, the model parameter Ω is a weight or bias in each layer.

Each component will be described by using DNN as an example. The intermediate feature amount calculation unit 921 is a configuration unit that executes calculation in each layer from the input layer to the final hidden layer. The output probability distribution calculation unit 922 is a component that executes output calculation in the output layer. Therefore, in this case, the model parameter Ω learned by the model learning apparatus 900 is a DNN model parameter that characterizes the intermediate feature amount calculation unit 921 and the output probability distribution calculation unit 922.

The model learning apparatus 900 sets the initial value of the model parameter Ω recorded in the recording unit 990 in the intermediate feature amount calculation unit 921 and the output probability distribution calculation unit 922 before learning starts. Further, the model learning device 900 performs the calculation of the model parameter Ω to the intermediate feature amount calculation unit 921 and the output probability each time the model learning unit 930 performs optimization calculation (that is, updates to optimize) during learning. Set in the distribution calculator 922. Thus, the next training data is processed using the intermediate feature amount calculation unit 921 and the output probability distribution calculation unit 922 characterized by the newly calculated model parameter Ω.

The operation of the model learning apparatus 900 will be described with reference to FIG. Feature quantity processing unit 920, using the model parameters Omega, the features extracted from the input data, the distribution of the probability p _m is the output of the output output number m (1 ≦ m ≦ M) corresponding to the feature quantity The output probability distribution p = (p ₁ ,..., P _M ) is calculated (S920). Hereinafter, the operation of the feature amount processing unit 920 will be described with reference to FIG. The intermediate feature amount calculation unit 921 calculates an intermediate feature amount from the input feature amount (S921). Intermediate feature quantity is output probability distribution of the probability p _m is the output of the output output number m corresponding to the feature quantity input (1 ≦ m ≦ M) distribution _{p = (p 1, ...,} p M) Is a feature amount used to calculate. The processing here corresponds to the calculation of Equation (1) in Non-Patent Document 1. When DNN is used, the intermediate feature amount corresponds to the output feature amount of the final hidden layer of the DNN being learned.

The output probability distribution calculation unit 922 calculates the output probability distribution p from the intermediate feature amount calculated in S921 (S922). The processing here corresponds to the calculation of Equation (2) in Non-Patent Document 1. When DNN is used, the output probability distribution p corresponds to the output feature amount of the output layer of the DNN being learned.

The model learning unit 930 learns the model parameter Ω by using the output probability distribution p calculated in S920 and a correct output number that is a number for identifying a correct output corresponding to the feature quantity that is an input in S920. (S930). For example, the optimization calculation of the model parameter Ω is performed so as to decrease the value of the loss function C defined by the following equation. The processing here corresponds to the calculation of Equation (3) or Equation (4) in Non-Patent Document 1.

However, d = (d ₁ ,..., D _M ) is a correct probability distribution defined by the following equation.

The model learning apparatus 900 repeats the processes of S920 to S930 for the number of training data (generally a very large number of tens of millions to hundreds of millions). The model learning device 900 outputs the model parameter Ω at the time when this repetition is completed.

Also, Non-Patent Document 2 discloses a learning method that can reduce the model size (number of model parameters) in a neural network. The model learning apparatus 901 corresponding to the model learning of Non-Patent Document 2 will be described below with reference to FIGS. FIG. 5 is a block diagram illustrating a configuration of the model learning device 901. FIG. 6 is a flowchart showing the operation of the model learning device 901. As shown in FIG. 5, the model learning device 901 includes a feature amount processing unit 920, a model learning unit 931, and a recording unit 990.

That is, the model learning device 901 differs from the model learning device 900 only in that it includes a model learning unit 931 instead of the model learning unit 930.

Therefore, the operation of the model learning unit 931 will be described below (see FIG. 6). The model learning unit 931 learns the model parameter Ω using the output probability distribution p calculated in S920 and the correct output number that is a number for identifying the correct output corresponding to the feature quantity that is the input in S920. (S931). For example, the model parameter Ω is optimized using a loss function L (Ω) defined by the following equation.

Here, E (Ω) is an error term indicating an error between the output probability distribution calculated from the feature value using the model parameter Ω and the correct output, and is a term corresponding to the above-described loss function C. Further, R (Ω) is a regular parameter, and the real number λ is a hyperparameter for adjusting the influence of the regularization term R (Ω).

The model learning unit 931 learns the model parameter Ω by using the loss function L (Ω) obtained by adding the regularization term R (Ω) (scalar multiple) to the error term E (Ω), so that the model parameter Ω Learning is performed so that the values of some elements are close to 0 (the model becomes sparse).

Here, when a part of the elements of the model parameter Ω is 0 or a value close to 0, the model parameter Ω is said to have sparsity.

Therefore, the model learning unit 931 learns the model parameter Ω having sparsity using the loss function L (Ω) including the regularization term R (Ω).

In Non-Patent Document 2, regularized terms called Ridge (L2) and Group Lasso are used. For example, Ridge (L2) regularization term R _L2 (W ^l ), Group Lasso when updating only the weight parameter W ^l in layer l (l is an integer for identifying the layers constituting the neural network) The regularization term R _group (W ^l ) of is given by

That is, R _L2 (W ^l ) is the sum of squares of all elements of the weight parameter between the l-th layer and the (l-1) -th layer, and R _group (W ^l ) is one element of the l-th layer and (l -1) represents the sum of weights (absolute values) for coupling all elements (j = 1,..., N _l-1 ) of the layer.

When Group Lasso is used as a regularization term, it is possible to learn by arbitrarily grouping the model parameter Ω. For example, in Non-Patent Document 2, learning is performed as a unit (group) for grouping rows or columns of the matrix when the model parameter Ω is expressed using a matrix. Also, by learning the matrix row as a grouping unit and deleting the model parameter elements of the group whose norm value calculated for each row is smaller than the predetermined threshold from the model parameter Ω at the end of learning, the model size Have reduced.

Originally, the regularization term is used to avoid over-learning, but depending on the purpose, other than the regularization term R _L2 (W ^l ) and the regularization term R _group (W ^l ) of Non-Patent Document 2 Also, various regularization terms can be defined and used.

The learning method of Non-Patent Document 1 learns a model in one domain (for example, in the case of speech recognition, the speech data collected on the premise that conditions such as background noise, recording equipment, speech style, etc. are the same. To learn). Therefore, the domain 1 data is obtained by using a model obtained by additionally learning using data of another domain (domain 2) using a model learned using data of a certain domain (domain 1) as an initial model. When the recognition process is performed on the image, there is a possibility that the accuracy is significantly deteriorated. This characteristic of neural network learning is called catastrophic forgetting. In general, to prevent catastrophic forgetting (ie, additional learning without compromising the performance of a trained model corresponding to existing knowledge), we can use both domain 1 and domain 2 data again. Since it is necessary to re-learn the model, there is a problem that the cost related to the learning time is very high.

Accordingly, an object of the present invention is to provide a model learning technique capable of additionally learning using data of another domain without impairing the performance of a model learned using data of a certain domain. To do.

One aspect of the present invention is used for learning the learned model parameter using a setup unit that generates a mask from the learned model parameter that is an initial value of the model parameter Ω to be learned, and the model parameter Ω. from the feature quantity extracted from the input data in a different domain than the domain, the output corresponding to the feature quantity calculating the output probability distribution is a distribution of the probability p _m is the output of the output number m (1 ≦ m ≦ M) A model learning unit that learns a model parameter Ω using a feature amount processing unit, the mask, the output probability distribution, and a correct output number that is a number for identifying a correct output corresponding to the feature amount; L (Ω) is a loss function used when learning the model parameter Ω, μ is a real number, and the setup unit corresponds to the element ω of the model parameter Ω. The elements gamma, using a threshold theta, calculated by the following equation,

(Where ω ⁽⁰⁾ is the initial value of element ω)
The model learning unit calculates the update difference δ (ω) of the element ω of the model parameter Ω by the following formula, and updates the element ω

(However, ∂L (Ω) / ∂ω is the gradient with respect to the element ω of the loss function L (Ω)).

According to the present invention, it is possible to additionally learn using data of another domain without impairing the performance of a model learned using data of a certain domain.

The figure which shows an example of a structure of the model learning apparatus. The figure which shows an example of operation | movement of the model learning apparatus. FIG. 3 is a diagram illustrating an example of a configuration of a setup unit 110. The figure which shows an example of operation | movement of the setup part 110. FIG. The figure which shows an example of a structure of the model learning apparatus 900/901. The figure which shows an example of operation | movement of the model learning apparatus 900/901. The figure which shows an example of a structure of the feature-value process part 920. The figure which shows an example of operation | movement of the feature-value process part 920.

Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

<First embodiment>
The model learning apparatus 100 will be described below with reference to FIGS. FIG. 1 is a block diagram illustrating a configuration of the model learning device 100. FIG. 2 is a flowchart showing the operation of the model learning device 100. As illustrated in FIG. 1, the model learning device 100 includes a setup unit 110, a feature amount processing unit 920, a model learning unit 130, and a recording unit 990.

The recording unit 990 is a component that appropriately records information necessary for processing of the model learning device 100. For example, the initial value of the model parameter Ω is recorded in advance. The initial value of this model parameter Ω trains a set of a correct output number that is a number for identifying a correct output corresponding to a feature quantity extracted from input data in a certain domain (hereinafter referred to as domain 1) and the feature quantity. The data is, for example, learned model parameters learned by the model learning device 900 or the model learning device 901. Therefore, when the learned model parameter learned by the model learning device 901 is used, the learned model parameter has sparsity. In the following, the learned model parameter is represented as Ω ⁽⁰⁾ and its element as ω ⁽⁰⁾ .

The model learning device 100 extracts a feature amount extracted from input data in a domain (hereinafter referred to as domain 2) different from the domain used for learning the learned model parameter (that is, domain 1), and correct output corresponding to the feature amount. The model parameter Ω is learned from the training data that is a set of correct output numbers that are numbers for identifying.

The model learning device 100 uses the initial value of the model parameter Ω (that is, the learned model parameter) recorded in the recording unit 990 before the start of learning as a feature amount processing unit 920 (intermediate feature amount calculation unit 921 and output probability distribution calculation unit). 922). Further, the model learning device 100 sets the calculated model parameter Ω in the feature amount processing unit 920 every time the model learning unit 130 performs optimization calculation (that is, updates so as to optimize) the model parameter Ω during learning.

The operation of the model learning device 100 will be described with reference to FIG. The setup unit 110 generates a mask from the learned model parameter that is the initial value of the model parameter Ω to be learned, which is recorded in the recording unit 990 (S110). The setup unit 110 will be described below with reference to FIGS. FIG. 3 is a block diagram illustrating a configuration of the setup unit 110. FIG. 4 is a flowchart showing the operation of the setup unit 110. As shown in FIG. 3, the setup unit 110 includes a threshold value determination unit 111 and a mask generation unit 112. The operation of the setup unit 110 will be described with reference to FIG.

The threshold determination unit 111 determines the threshold θ from the learned model parameter (S111). Any determination method may be used as long as it is a method for determining the threshold θ so that a predetermined number of elements whose absolute values are close to 0 among the elements of the learned model parameters are extracted. For example, a frequency distribution regarding the values of learned model parameter elements is created, and the threshold θ is determined so that the ratio of the model parameter elements whose absolute value is close to 0 is 25%. (Hereinafter referred to as determination method 1). In addition, a frequency distribution of values calculated for each group in which learned model parameter elements are grouped is created, and a value between two values (for example, an average value of the two values) is determined as a threshold θ. (Hereinafter referred to as determination method 2). For example, when the learned model parameter is expressed using a matrix, the row (or column) of the matrix is grouped, and for each group, the norm value of the row vector (or column vector) of the group is related. A frequency distribution can be created and a value between two norm values can be determined as the threshold θ. That is, the determination method 1 determines the threshold θ based on the frequency distribution related to the values of the learned model parameter elements, and the determination method 2 is calculated for each group in which the learned model parameter elements are grouped. The threshold value θ is determined based on the frequency distribution relating to the value.

The mask generation unit 112 generates a mask Γ from the learned model parameters using the threshold θ determined in S111 (S112). A method for generating the mask Γ will be specifically described. The element γ of the mask Γ corresponding to the element ω of the model parameter Ω is 1 when the absolute value of the learned model parameter element ω ⁽⁰⁾ is smaller than the threshold θ (below the threshold θ), otherwise If it is 0. That is, the element γ of the mask corresponding to the element ω of the model parameter Ω is calculated by the following equation using the threshold θ.

(Where ω ⁽⁰⁾ is the initial value of element ω)
When the model parameter Ω is represented using a matrix, the mask Γ is represented by a matrix having the same size as the matrix representing the model parameter Ω, in which all elements are 0 or 1.

The feature quantity processing unit 920 uses the model parameter Ω, and from the feature quantity extracted from the input data in the domain 2, the probability p that the output corresponding to the feature quantity is an output of the output number m (1 ≦ m ≦ M). _An output probability distribution p = (p ₁ ,..., p _M ) that is a distribution of _m is calculated (S920).

The model learning unit 130 includes the mask Γ generated in S110, the output probability distribution p calculated in S920, and the correct output number that is a number for identifying the correct output corresponding to the feature quantity that is the input in S920. Using this, the model parameter Ω is learned (S130). For example, the model parameter Ω is optimized using the loss function L (Ω) defined by the formula (1) or the formula (2).

Specifically, the update difference δ (ω) of the element ω of the model parameter Ω is calculated by the equation (3), and the element ω is updated by the equation (4).

Here, μ is a (positive) real number representing the learning rate, and is a parameter for adjusting the degree of model parameter update. Further, ∂L (Ω) / ∂ω represents a gradient related to the element ω of the loss function L (Ω). Note that the gradient ∂L (Ω) / ∂ω is also used for learning in the model learning device 900 and the model learning device 901.

Using this update difference, it is possible to selectively update only the elements of the model parameter Ω that are desired to be learned, that is, elements that are smaller than the threshold θ (that is equal to or smaller than the threshold θ).

As described above, when the model parameter Ω and the mask Γ are represented by a matrix, if the matrix itself is also represented by Ω or Γ, the optimization calculation of the model parameter Ω is performed as follows using a Hadamard product. Can be represented.

It should be noted that if the loss function L (Ω) (formula (2)) including the regularization term R (Ω) is used, each element of the model parameter Ω can be effectively set to a value close to 0.

The model learning device 100 repeats the processing of S920 to S130 as many times as the number of training data, and outputs the finally calculated model parameter Ω.

According to the invention of this embodiment, it is possible to additionally learn using data of another domain without impairing the performance of a model learned using data of a certain domain. As a result, a model capable of accurately processing both domain 1 and domain 2 input data using only the input data in domain 2 and using the learned model learned using the input data in domain 1 as an initial model. Therefore, it is possible to reduce the cost related to the learning time.

<Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

As described above, when the processing functions in the hardware entity (the device of the present invention) described in the above embodiment are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

Also, this program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

For example, a computer that executes such a program first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

In this embodiment, the hardware entity is configured by executing a predetermined program on the computer. However, at least a part of these processing contents may be realized in hardware.

Claims

A setup unit that generates a mask from a learned model parameter that is an initial value of the model parameter Ω to be learned, and
Using the model parameter Ω, the output corresponding to the feature amount is output number m (1 ≦ m ≦ M, from the feature amount extracted from the input data in a domain different from the domain used for learning the learned model parameter. However, M is a feature quantity processing unit for calculating the output probability distribution is a distribution of the probability p m is the output of representing a number of types of output corresponding to the feature quantity),
A model learning device including: a model learning unit that learns a model parameter Ω using the mask, the output probability distribution, and a correct output number that is a number for identifying a correct output corresponding to the feature quantity There,
L (Ω) is a loss function used when learning the model parameter Ω, μ is a real number,
The setup unit
The mask element γ corresponding to the element ω of the model parameter Ω is calculated by the following equation using the threshold θ,

(Where ω (0) is the initial value of element ω)
The model learning unit
The update difference δ (ω) of the element ω of the model parameter Ω is calculated by the following formula, and the element ω is updated.

(However, ∂L (Ω) / ∂ω is the gradient of element ω of loss function L (Ω))
Model learning device.
The model learning device according to claim 1,
The learned model parameter has a sparsity.
The model learning device according to claim 2,
The learned model parameter is learned using a loss function L (Ω) given by the following equation:

(However, E (Ω) is an error term indicating the error between the output probability distribution calculated from the feature value using the model parameter Ω and the correct output, R (Ω) is a regularization term, and λ is a real number)
A model learning apparatus characterized by that.
The model learning device according to any one of claims 1 to 3,
The model learning device is characterized in that the threshold θ is determined based on a frequency distribution related to values of elements of the learned model parameter.
The model learning device according to any one of claims 1 to 3,
The threshold value θ is determined based on a frequency distribution related to a value calculated for each group obtained by grouping elements of the learned model parameter.
A setup step in which the model learning device generates a mask from the trained model parameter that is the initial value of the model parameter Ω to be trained;
From the feature quantity extracted from input data in a domain different from the domain used for learning the learned model parameter by the model learning device using the model parameter Ω, the output corresponding to the feature quantity is output number m ( a feature quantity processing step of calculating the output probability distribution is a distribution of the probability p m is the output of 1 ≦ m ≦ M),
A model learning step in which the model learning device learns a model parameter Ω using the mask, the output probability distribution, and a correct output number that is a number for identifying a correct output corresponding to the feature quantity; A model learning method including
L (Ω) is a loss function used when learning the model parameter Ω, μ is a real number,
In the setup step,
The mask element γ corresponding to the element ω of the model parameter Ω is calculated by the following equation using the threshold θ,

(Where ω (0) is the initial value of element ω)
In the model learning step,
The update difference δ (ω) of the element ω of the model parameter Ω is calculated by the following formula, and the element ω is updated.

(However, ∂L (Ω) / ∂ω is the gradient of element ω of loss function L (Ω))
Model learning method.
The model learning method according to claim 6,
The model learning method, wherein the learned model parameter has sparsity.
A program for causing a computer to function as the model learning device according to any one of claims 1 to 5.