JP2020129209A

JP2020129209A - Optimization method, optimization program, inference method, and inference program

Info

Publication number: JP2020129209A
Application number: JP2019020873A
Authority: JP
Inventors: ラジダブレ; Dabre Raj; 篤藤田; Atsushi Fujita
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2019-02-07
Filing date: 2019-02-07
Publication date: 2020-08-27
Anticipated expiration: 2039-02-07
Also published as: JP7297286B2

Abstract

To provide a neural network capable of corresponding to a change in usage and a request specification, and its optimization method.SOLUTION: An optimization method includes: a step for preparing training data in which an input signal is associated with a correct answer output signal; a step for inputting the input signal to a neural network, calculating an output signal outputted from the deepest layer included in the neural network, and calculating an output signal outputted from each of one or more layers including the deepest layer; a step for respectively calculating errors of respective output signals that are calculated to the correct answer output signal associated with the input signal; and a step for optimizing the parameter of each layer included in the neural network on the basis of the respective calculated errors.SELECTED DRAWING: Figure 3

Description

本技術は、深層学習の最適化手法およびその最適化手法により得られる最適化済モデルの利用手法に関する。 The present technology relates to an optimization method for deep learning and a method for using an optimized model obtained by the optimization method.

深層学習に基づく手法は、自然言語処理を含む人工知能の各分野において、他の機械学習手法よりも高い性能を発揮しつつある。 Methods based on deep learning are showing higher performance than other machine learning methods in each field of artificial intelligence including natural language processing.

深層学習では、入力信号に対して複数の非線形変換を行って出力信号を得るニューラルネットワーク（以下、単に「モデル」とも称す。）を想定する。モデルの出力信号と予め与えられた正解出力信号との誤差に基づいて、ニューラルネットワークにおける非線形変換（すなわち、線形変換行列の係数およびバイアス項の値）を最適化する。このような最適化手法によって、タスクに応じた最適化済モデルを決定できる。人間が与えた入力信号と正解出力信号との組のみに基づいてモデルを最適化するｅｎｄ−ｔｏ−ｅｎｄの最適化手法は、人間が行っているような複雑な処理を細かい処理に分割して実装する必要がないことから、近年多くのタスクに用いられている。 In deep learning, a neural network (hereinafter, also simply referred to as “model”) that obtains an output signal by performing a plurality of nonlinear transformations on an input signal is assumed. The nonlinear transformation (that is, the coefficient of the linear transformation matrix and the value of the bias term) in the neural network is optimized based on the error between the output signal of the model and the correct output signal given in advance. With such an optimization method, the optimized model can be determined according to the task. An end-to-end optimization method for optimizing a model based on only a set of an input signal and a correct output signal given by a human divides a complicated process like a human being into fine processes. It has been used for many tasks in recent years because it does not need to be implemented.

一般的に、非線形変換の回数（すなわち、ニューラルネットワークの層数）を増加させることでより複雑な関数を表現できるようになり、入出力間で複雑な対応関係をもつ問題を解決できる可能性が高くなる。様々なタスクにおいて、層数を増加させることで性能を向上できることが報告されている。 In general, increasing the number of nonlinear transformations (that is, the number of layers in the neural network) makes it possible to express a more complicated function, and it may be possible to solve a problem with a complicated correspondence between input and output. Get higher It has been reported that the performance can be improved by increasing the number of layers in various tasks.

深層学習における最適化手法においては、勾配消失（gradient vanishing）と呼ばれる問題が生じ得る。勾配消失問題への対処として、スキップ構造（residual connection）という手法がよく用いられる（例えば、非特許文献１参照）。この手法は、入力信号の次元数と出力信号の次元数とを等しくしなければならないという制約があるものの、ニューラルネットワークを安定的に最適化できる。 A problem called gradient vanishing may occur in the optimization method in deep learning. A technique called a skip structure (residual connection) is often used to deal with the problem of disappearing gradients (see Non-Patent Document 1, for example). This method has a constraint that the number of dimensions of the input signal and the number of dimensions of the output signal must be equal, but the neural network can be stably optimized.

層数の増加に伴って、空間計算量（すなわち、パラメタの数）および時間計算量（すなわち、行列の乗算回数など）が増加する。その結果、入力信号に対するモデルの出力信号を算出する処理（推論処理）を実行するために、より多くのメモリを必要とするとともに、処理速度が低下するという問題が生じ得る。必要とするメモリを低減する目的で、ニューラルネットワークにおいて同じ層のパラメタを再帰的に使用するというアプローチが提案されている（非特許文献２参照）。 As the number of layers increases, the amount of space calculation (that is, the number of parameters) and the amount of time calculation (that is, the number of times the matrix is multiplied, etc.) increase. As a result, in order to execute the process (inference process) of calculating the output signal of the model for the input signal, more memory is required and the processing speed may be reduced. An approach has been proposed in which parameters in the same layer are recursively used in a neural network for the purpose of reducing the required memory (see Non-Patent Document 2).

空間計算量および時間計算量を削減する別のアプローチとして、蒸留（knowledge distillation）と呼ばれる手法がある（非特許文献３参照）。蒸留手法は、先に複雑なモデルを最適化した上で、最適化済モデルの出力信号を参照しながら、比較的簡潔なモデルを最適化するというものである。蒸留手法によって、例えば、ニューラル機械翻訳におけるモデルの省メモリ化が実現されたことが報告されている（非特許文献４参照）。 As another approach for reducing the space calculation amount and the time calculation amount, there is a method called knowledge distillation (see Non-Patent Document 3). The distillation method is to optimize a relatively simple model by referring to the output signal of the optimized model after first optimizing the complicated model. It has been reported that the distillation method has realized, for example, memory saving of a model in neural machine translation (see Non-Patent Document 4).

空間計算量および時間計算量を削減するさらに別のアプローチとして、実数の表現精度を低減する（３２ビット表現ではなく１６ビット表現とする）手法（非特許文献５参照）、最適化済モデルにおけるパラメタの大半を＋１／−１の２値に制限できるように最適化する手法（非特許文献６参照）、ならびに、対象語彙の一部を２進符号化する手法（非特許文献７参照）などが提案されている。

As yet another approach to reduce the space calculation amount and the time calculation amount, a method of reducing the representation precision of a real number (using 16-bit representation instead of 32-bit representation) (see Non-Patent Document 5), a parameter in an optimized model There is a method of optimizing so that most of the data can be limited to binary values of +1/-1 (see Non-Patent Document 6), a method of binary encoding a part of the target vocabulary (see Non-Patent Document 7), and the like. Proposed.

深層学習の最適化は、与えられたニューラルネットワークの最深層の情報のみに基づいて行われる。そのため、最適化済モデルのうち一部の層のみを使用して出力信号を算出した場合には、性能が極端に劣化し得る。すなわち、空間計算量および時間計算量を削減するアプローチとして、最適化の対象としたモデルと推論処理において利用するモデルとの間で、使用する層あるいはネットワーク構造を異ならせることはできない。 The optimization of deep learning is performed only based on the information of the deepest layer of a given neural network. Therefore, when the output signal is calculated using only some layers of the optimized model, the performance may be extremely deteriorated. That is, as an approach to reduce the amount of space calculation and the amount of time calculation, it is not possible to use different layers or network structures between the model targeted for optimization and the model used in inference processing.

そのため、モデルの用途や要求仕様が変化した場合には、新たなモデルに対する最適化を再度実行する必要がある。この点については、上述した空間計算量および時間計算量を削減するいずれのアプローチについても同様である。 Therefore, when the usage or required specification of the model changes, it is necessary to re-optimize the new model. In this respect, the same applies to any of the approaches for reducing the amount of space calculation and the amount of time calculation described above.

本技術は、用途や要求仕様の変化に対応可能なニューラルネットワークの最適化手法を提供することを目的としている。 The present technology aims to provide an optimization method of a neural network capable of coping with changes in applications and required specifications.

本技術のある局面に従えば、複数の同一または異なる層を有するニューラルネットワークのパラメタを最適化する最適化方法が提供される。最適化方法は、入力信号と正解出力信号とが対応付けられた訓練データを用意するステップと、入力信号をニューラルネットワークに入力して、ニューラルネットワークに含まれる最深層から出力される出力信号を算出するとともに、最深層を含む１つ以上の層の各々から出力される出力信号を算出するステップと、入力信号に対応付けられた正解出力信号に対する、算出されたそれぞれの出力信号の誤差をそれぞれ算出するステップと、算出されたそれぞれの誤差に基づいて、ニューラルネットワークに含まれる各層のパラメタを最適化するステップとを含む。 According to one aspect of the present technology, an optimization method for optimizing parameters of a neural network having a plurality of identical or different layers is provided. The optimization method includes a step of preparing training data in which an input signal and a correct output signal are associated with each other, and inputting the input signal to a neural network to calculate an output signal output from the deepest layer included in the neural network. And calculating the output signal output from each of the one or more layers including the deepest layer, and calculating the error of each calculated output signal with respect to the correct output signal associated with the input signal. And a step of optimizing the parameters of each layer included in the neural network based on the calculated respective errors.

最適化するステップは、算出されたそれぞれの誤差を統合するステップを含むようにしてもよい。 The optimizing step may include a step of integrating the calculated respective errors.

誤差を統合するステップは、算出されたそれぞれの誤差を統合して、最深層から逆伝搬させるための誤差情報を算出するステップを含むようにしてもよい。 The step of integrating the errors may include a step of integrating the calculated errors and calculating error information for back propagation from the deepest layer.

誤差情報を算出するステップは、算出されたそれぞれの誤差の平均値を、最深層から逆伝搬させるための誤差情報として算出するステップを含むようにしてもよい。 The step of calculating the error information may include a step of calculating an average value of the calculated errors as error information for back propagation from the deepest layer.

最適化するステップは、パラメタを最適化する対象の層に対して逆伝搬により与えられた誤差情報と、当該対象の層の出力信号について算出された誤差とに基づいて、当該対象の層のパラメタを最適化するステップを含むようにしてもよい。 The step of optimizing includes the parameter of the target layer based on the error information given by backpropagation to the target layer for which the parameter is optimized and the error calculated for the output signal of the target layer. May be optimized.

ニューラルネットワークは、入力信号に含まれる特徴的な情報を出力するエンコーダと、先に出力した出力信号および入力信号に含まれる特徴的な情報の入力を受けて出力信号を決定するデコーダとを含むようにしてもよい。 The neural network includes an encoder that outputs characteristic information included in the input signal and a decoder that receives the input output signal and the characteristic information included in the input signal and determines the output signal. Good.

本技術の別の局面に従えば、上述の最適化方法をコンピュータに実行させるための最適化プログラムが提供される。 According to another aspect of the present technology, an optimization program for causing a computer to execute the optimization method described above is provided.

本技術のさらに別の局面に従えば、複数の同一または異なる層を有するニューラルネットワークからなる最適化済モデルを用いた推論方法が提供される。推論方法は、任意の入力信号を最適化済モデルに入力するステップと、最適化済モデルの最深層に向かって順番に出力信号を算出するステップと、最適化済モデルに含まれる複数の同一または異なる層のうち、要求に基づいて決定される最深層を含む任意の層の出力信号を推論結果として出力するステップとを含む。最適化済モデルは、訓練データに含まれる入力信号をニューラルネットワークに入力したときに算出される、最深層を含む１つ以上の層の各々から出力される出力信号と、訓練データに含まれる入力信号に対応付けられた正解出力信号とのそれぞれの誤差に基づいて、パラメタを最適化することで生成される。 According to still another aspect of the present technology, an inference method using an optimized model including a neural network having a plurality of same or different layers is provided. The inference method includes the steps of inputting an arbitrary input signal into the optimized model, calculating the output signal in order toward the deepest layer of the optimized model, and applying a plurality of identical or identical signals included in the optimized model. Out of the different layers, the output signal of any layer including the deepest layer determined based on the request is output as an inference result. The optimized model is an output signal output from each of one or more layers including the deepest layer, which is calculated when an input signal included in the training data is input to the neural network, and an input signal included in the training data. It is generated by optimizing the parameters based on the respective errors from the correct answer output signal associated with the signal.

推論結果として出力信号が出力される層は、出力信号の推論性能、および、出力信号が出力されるまでに要する時間の少なくとも一方の要求に基づいて決定されてもよい。 The layer from which the output signal is output as the inference result may be determined based on at least one of the inference performance of the output signal and the time required until the output signal is output.

本技術のさらに別の局面に従えば、上述の最適化方法をコンピュータに実行させるための推論プログラムが提供される。 According to still another aspect of the present technology, an inference program for causing a computer to execute the optimization method described above is provided.

本技術によれば、用途や要求仕様の変化に対応可能なニューラルネットワークの最適化手法を提供できる。 According to the present technology, it is possible to provide a method for optimizing a neural network that can respond to changes in applications and required specifications.

一般的な深層学習を説明するための模式図である。It is a schematic diagram for demonstrating general deep learning. 本実施の形態に従う深層学習を説明するための模式図である。FIG. 7 is a schematic diagram for explaining deep learning according to the present embodiment. 本実施の形態に従う処理手順の要部を示すフローチャートである。7 is a flowchart showing a main part of a processing procedure according to the present embodiment. ニューラル機械翻訳を実現するＴｒａｎｓｆｏｒｍｅｒモデルの一例を示す模式図である。It is a schematic diagram which shows an example of the Transformer model which implement|achieves a neural machine translation. 実施の形態１に従う最適化処理を説明するための模式図である。FIG. 6 is a schematic diagram for explaining an optimization process according to the first embodiment. 実施の形態１に従う最適化処理の主要な処理手順を示すフローチャートである。6 is a flowchart showing a main processing procedure of optimization processing according to the first embodiment. 音声翻訳向け多言語対訳コーパスを用いた英日翻訳タスクについての評価結果を示すグラフである。It is a graph which shows the evaluation result about the English-Japanese translation task using the multilingual parallel translation corpus for speech translation. ニュース分野の対訳データを用いた英独翻訳タスクについての評価結果を示すグラフである。It is a graph which shows the evaluation result about the English-German translation task using the parallel translation data of the news field. 実施の形態２に従う最適化処理を説明するための模式図である。FIG. 9 is a schematic diagram for explaining an optimization process according to the second embodiment. 実施の形態２に従う最適化処理を説明するための模式図である。FIG. 9 is a schematic diagram for explaining an optimization process according to the second embodiment. 実施の形態２に従う最適化処理の主要な処理手順を示すフローチャートである。9 is a flowchart showing a main processing procedure of optimization processing according to the second embodiment. 実施の形態２における英日翻訳タスクについての評価結果を示すグラフである。16 is a graph showing the evaluation result of the English-Japanese translation task according to the second embodiment. 実施の形態２における英独翻訳タスクについての評価結果を示すグラフである。16 is a graph showing the evaluation result of the English-German translation task in the second embodiment. 実施の形態３に従う最適化処理を説明するための模式図である。FIG. 14 is a schematic diagram for explaining an optimization process according to the third embodiment. 実施の形態３に従う最適化処理の主要な処理手順を示すフローチャートである。17 is a flowchart showing a main processing procedure of optimization processing according to the third embodiment. 実施の形態３における英日翻訳タスクについての評価結果を示すグラフである。16 is a graph showing the evaluation result of the English-Japanese translation task in the third embodiment. 本実施の形態に従う最適化処理および推論処理を実現するハードウェア構成の一例を示す模式図である。It is a schematic diagram which shows an example of the hardware constitutions which implement|achieve the optimization process and inference process according to this Embodiment.

本発明の実施の形態について、図面を参照しながら詳細に説明する。なお、図中の同一または相当部分については、同一符号を付してその説明は繰り返さない。 Embodiments of the present invention will be described in detail with reference to the drawings. The same or corresponding parts in the drawings are designated by the same reference numerals and the description thereof will not be repeated.

［Ａ．関連技術］
まず、一般的な深層学習について説明する。 [A. Related technology]
First, general deep learning will be described.

図１は、一般的な深層学習を説明するための模式図である。図１を参照して、深層学習では、入力信号に対して複数の非線形変換を施して出力信号を得るニューラルネットワーク１０を想定する。典型的には、ニューラルネットワーク１０は、入力層２と、１または複数の隠れ層４と、出力層６とからなる。入力層２、隠れ層４、および出力層６の各層は、状態を示すベクトルおよび活性化関数を含む。隣接する層の間は、アフィン変換などを介して結合される。 FIG. 1 is a schematic diagram for explaining general deep learning. With reference to FIG. 1, in deep learning, a neural network 10 that obtains an output signal by performing a plurality of nonlinear transformations on an input signal is assumed. The neural network 10 typically comprises an input layer 2, one or more hidden layers 4, and an output layer 6. Each of the input layer 2, the hidden layer 4, and the output layer 6 includes a vector indicating a state and an activation function. Adjacent layers are coupled via affine transformation or the like.

例えば、入力信号Ｘを６層の非線形変換Ｌ_１，Ｌ_２，Ｌ_３，Ｌ_４，Ｌ_５，Ｌ_６を経て出力信号Ｙを得るモデルは、以下の（１）式のように表すことができる。 For example, a model for obtaining the output signal Y through the 6-layered non-linear transformations L ₁ , L ₂ , L ₃ , L ₄ , L ₅ and L ₆ of the input signal X can be expressed as the following equation (1). it can.

Ｙ＝Ｌ_６（Ｌ_５（Ｌ_４（Ｌ_３（Ｌ_２（Ｌ_１（Ｘ））））））・・・（１）
通常、入力信号Ｘは、有限固定次元の実数ベクトルｖ（∈Ｒ^ｎ）である。非線形変換Ｌ_ｉの各々は、線形変換行列およびバイアス項で表すことができる。線形変換行列の係数およびバイアス項の値をまとめて「パラメタ」と称される。パラメタの最適化は、対象のタスクに対してより高い性能を示すように、パラメタの各値を調整する処理を意味する。 Y=L ₆ (L ₅ (L ₄ (L ₃ (L ₂ (L ₁ (X)))))) (1)
Usually, the input signal X is a finite fixed-dimensional real vector v (εR ⁿ ). Each of the non-linear transformations L _i can be represented by a linear transformation matrix and a bias term. The coefficient of the linear transformation matrix and the value of the bias term are collectively called “parameter”. Parameter optimization refers to the process of adjusting each value of the parameter so as to show higher performance for the target task.

ニューラルネットワークの最適化においては、最深層において最も有用な情報が得られると仮定する。この仮定に基づいて、以下の３つの処理（ステップＳ１〜Ｓ３）を繰り返すことでパラメタを最適化する。 In the optimization of the neural network, it is assumed that the most useful information is obtained in the deepest layer. Based on this assumption, the parameters are optimized by repeating the following three processes (steps S1 to S3).

（１）入力信号Ｘに対する出力信号Ｙを算出する（ステップＳ１）
（２）出力信号Ｙと正解出力信号との誤差ｅを算出する（ステップＳ２）
（３）誤差ｅに基づいて最深層から浅い層に向かって順番にパラメタを更新する（誤差逆伝搬（backpropagation））（ステップＳ３）
なお、ステップＳ２では、問題に応じた方法で誤差ｅが算出される。 (1) Calculate the output signal Y for the input signal X (step S1)
(2) The error e between the output signal Y and the correct output signal is calculated (step S2).
(3) Parameters are updated in order from the deepest layer to the shallowest layer based on the error e (error backpropagation) (step S3).
In step S2, the error e is calculated by a method according to the problem.

より具体的には、（ｉ）回帰問題の場合（出力信号Ｙとして一定の範囲の実数ベクトルｖ（∈Ｒ^ｎ）を得たい場合）には、シグモイド関数などを用いて出力信号Ｙを正規化し、正解出力信号である実数ベクトルｖ’（∈Ｒ^ｎ）に対する誤差ｅを算出する。 More specifically, in the case of (i) regression problem (when it is desired to obtain a real number vector v(εR ⁿ ) in a certain range as the output signal Y), the output signal Y is normalized using a sigmoid function or the like. , The error e with respect to the real vector v′ (εR ⁿ ) which is the correct output signal is calculated.

また、（ｉｉ）分類問題の場合（出力信号Ｙとして離散値ｃ（∈Ｃ））を得たい場合）には、ソフトマックス関数などを用いて出力信号Ｙをｃに関する確率分布Ｐ（ｃ）に変換し、正解出力信号である離散値ｃ’（∈Ｃ）に対する誤差ｅを交差エントロピーとして算出する。 Further, in the case of (ii) the classification problem (when it is desired to obtain the discrete value c(εC) as the output signal Y), the output signal Y is converted into the probability distribution P(c) related to c using a softmax function or the like. The error e with respect to the discrete value c′ (εC) that is the correct output signal is calculated as the cross entropy.

図１に示すような、入力信号と正解出力信号との組のみに基づくｅｎｄ−ｔｏ−ｅｎｄの最適化手法は、人間が行っているような複雑な処理を細かい処理に分割して実装する必要がないことから、近年多くのタスクに用いられている。 An end-to-end optimization method based on only a pair of an input signal and a correct output signal as shown in FIG. 1 needs to be implemented by dividing a complicated process performed by a human into fine processes. Therefore, it has been used for many tasks in recent years.

非線形変換の回数（すなわち、ニューラルネットワークの層数）を増加させることでより複雑な関数を表現できるようになり、性能向上の可能性を高めることができる。一方で、層数の増加に伴って、空間計算量（すなわち、パラメタの数）および時間計算量（すなわち、行列の乗算回数など）が増加するという課題もある。 By increasing the number of non-linear conversions (that is, the number of layers of the neural network), it becomes possible to express a more complicated function, and the possibility of improving the performance can be increased. On the other hand, as the number of layers increases, there is also a problem that the amount of space calculation (that is, the number of parameters) and the amount of time calculation (that is, the number of times the matrix is multiplied) increase.

［Ｂ．概要］
次に、本実施の形態に従うニューラルネットワークの概要について説明する。 [B. Overview]
Next, the outline of the neural network according to the present embodiment will be described.

図２は、本実施の形態に従う深層学習を説明するための模式図である。図２には、図１と同様のニューラルネットワーク１を示す。ニューラルネットワーク１は、図１に示すニューラルネットワーク１０と同様に、入力信号に対して複数の非線形変換を施して出力信号を出力する。 FIG. 2 is a schematic diagram for explaining deep learning according to the present embodiment. FIG. 2 shows a neural network 1 similar to that shown in FIG. Like the neural network 10 shown in FIG. 1, the neural network 1 performs a plurality of nonlinear conversions on an input signal and outputs an output signal.

図１に示すニューラルネットワーク１０と同様に、ニューラルネットワーク１０は、典型例として、入力層２と、１または複数の同一または異なる隠れ層４と、出力層６とからなる。入力層２、隠れ層４、および出力層６の各層は、状態を示すベクトルおよび活性化関数を含む。隣接する層の間は、アフィン変換などを介して結合される。 Similar to the neural network 10 shown in FIG. 1, the neural network 10 typically includes an input layer 2, one or more identical or different hidden layers 4, and an output layer 6. Each of the input layer 2, the hidden layer 4, and the output layer 6 includes a vector indicating a state and an activation function. Adjacent layers are coupled via affine transformation or the like.

図１に示すニューラルネットワーク１０においては、最深層の情報（すなわち、出力層６の出力信号Ｙ）のみに基づいて最適化が実行されるのに対して、本実施の形態に従うニューラルネットワーク１においては、最深層の情報に加えて、他の層の情報を用いて最適化が実行される。ニューラルネットワーク１０は、このような最適化に適したネットワーク構造を採用する。 In the neural network 10 shown in FIG. 1, the optimization is executed only based on the deepest layer information (that is, the output signal Y of the output layer 6), whereas in the neural network 1 according to the present embodiment. , The optimization is executed by using the information of other layers in addition to the information of the deepest layer. The neural network 10 adopts a network structure suitable for such optimization.

すなわち、本実施の形態に従うニューラルネットワーク１は、出力層６以外の層からも出力信号を取り出すことが可能になっている。典型例として、図２（Ａ）に示すニューラルネットワーク１においては、入力層２と、隠れ層４と、出力層６の各々から出力信号Ｙ_１，Ｙ_２，・・・，Ｙ_ｎ−１，Ｙ_ｎが出力可能になっている。 That is, the neural network 1 according to the present embodiment can extract output signals from layers other than the output layer 6. As a typical example, in the neural network 1 shown in FIG. 2A, output signals Y ₁ , Y ₂ ,..., Y _n−1 from the input layer 2, the hidden layer 4, and the output layer 6, respectively. Y _n can be output.

ニューラルネットワーク１の最適化においては、最深層を含む複数の層の情報が利用される。一般的な深層学習においては、最深層の出力信号Ｙと正解出力信号との誤差のみが用いられるのに対して、本実施の形態においては、各層の出力信号に対して正解出力信号との誤差が算出され、算出された誤差が統合された上で、パラメタが最適化される。より具体的には、以下に示すような処理（ステップＳ１１〜Ｓ１３）を繰り返すことでパラメタを最適化する。 In the optimization of the neural network 1, information of a plurality of layers including the deepest layer is used. In general deep learning, only the error between the output signal Y of the deepest layer and the correct output signal is used, whereas in the present embodiment, the error between the output signal of each layer and the correct output signal is used. Is calculated, the calculated errors are integrated, and then the parameters are optimized. More specifically, the parameters are optimized by repeating the following processing (steps S11 to S13).

（１）入力信号Ｘに対する各層の出力信号Ｙ_１，Ｙ_２，・・・，Ｙ_ｎ−１，Ｙ_ｎを算出する（ステップＳ１１）
（２）出力信号Ｙ_１，Ｙ_２，・・・，Ｙ_ｎ−１，Ｙ_ｎの各々と正解出力信号との誤差ｅ_１，ｅ_２，・・・，ｅ_ｎ−１，ｅ_ｎを算出する（ステップＳ１２）
（３）誤差ｅ_ｎ，ｅ_ｎ−１，・・・，ｅ_２，ｅ_１に基づいて最深層から浅い層に向かって順番にパラメタを更新する（誤差逆伝搬）（ステップＳ１３）
なお、ステップＳ１３の誤差逆伝搬において、対応する層の誤差ｅ_ｉが順番に考慮されることになる。すなわち、ｋ番目の層（１≦ｋ＜Ｎ）においては、（ｋ＋１）番目の層から逆伝搬される誤差情報（勾配）だけではなく、ｋ番目の層において算出された誤差ｅ_ｋをｋ番目の層で直接受け取った上で、両者を考慮して、ｋ番目の層のパラメタを更新する。 (1) The output signals Y ₁ , Y ₂ ,..., Y _n−1 , Y _n of each layer with respect to the input signal X are calculated (step S11).
(2) Output signal _Y _1, Y 2, _{calculated..,} Error _e _1, e 2 and each correct answer output signal _{_{Y n-1, Y n,}} ···, and _{_{e n-1,} e n} Yes (step S12)
(3) the error _{_{e n, e n-1,}} ···, e 2, e 1 updates the parameters in order toward the shallow layer from the deepest layer based on (Back Propagation) (step S13)
In the error back propagation in step S13, the errors e _i of the corresponding layers are considered in order. That is, in the kth layer (1≦k<N), not only the error information (gradient) backpropagated from the (k+1)th layer, but also the error e _k calculated in the kth layer After being directly received by the layer, the parameters of the kth layer are updated in consideration of both.

すなわち、パラメタを最適化する処理は、パラメタを最適化する対象の層に対して逆伝搬により与えられた誤差情報と、当該対象の層の出力信号について算出された誤差とに基づいて、当該対象の層のパラメタを最適化する処理を含む。 That is, the process of optimizing the parameter is performed on the basis of the error information given to the target layer for which the parameter is optimized by back propagation and the error calculated for the output signal of the target layer. It includes the process of optimizing the parameters of the layers of.

このように、対応する層の誤差が順番に考慮されつつ、最深層から浅い層に向かって誤差情報が逆伝搬することで、各層のパラメタが更新されてもよい。 In this way, the parameters of each layer may be updated by back-propagating the error information from the deepest layer to the shallow layer while considering the errors of the corresponding layers in order.

本実施の形態に従う技術思想は、特定の種類のニューラルネットワークに限定されるものではなく、様々な種類のニューラルネットワークに適用可能である。例えば、ＣＮＮ（Convolutional Neural Network）、ＳｔａｃｋｅｄＲＮＮ（Recurrent Neural Network）、Ｔｒａｎｓｆｏｒｍｅｒ（ニューラル機械翻訳の一形態）などのニューラルネットワークに適用可能である。 The technical idea according to the present embodiment is not limited to a specific type of neural network, but can be applied to various types of neural networks. For example, it is applicable to neural networks such as CNN (Convolutional Neural Network), Stacked RNN (Recurrent Neural Network), and Transformer (one form of neural machine translation).

また、上述したように、出力信号としては、回帰問題の場合には、シグモイド関数などを用いて正規化した信号が用いられ、分類問題の場合には、ソフトマックス関数などを用いて確率分布に変換した信号が用いられる。いずれの形式の出力信号であっても、上述の最適化手法は適用可能である。 Further, as described above, as the output signal, a signal normalized using a sigmoid function or the like is used in the case of a regression problem, and a probability distribution is calculated using a softmax function or the like in the case of a classification problem. The converted signal is used. The above-described optimization method can be applied to any type of output signal.

本実施の形態に従うニューラルネットワーク１においては、各層の出力信号Ｙ_１，Ｙ_２，・・・，Ｙ_ｎ−１，Ｙ_ｎがいずれも正解出力信号に対する誤差が少なくなるように、パラメタが最適化される。そのため、推論処理においては、最深層の情報（すなわち、出力層６の出力信号Ｙ_ｎ）だけではなく、他の層の情報（すなわち、出力信号Ｙ_１，Ｙ_２，・・・，Ｙ_ｎ−１）を用いることでも十分な性能を発揮できる可能性が高い。その結果、推論処理においては、要求される処理速度および性能に応じて、複数の層のうち任意の層の出力信号を推論結果として用いることができる。 In the neural network 1 according to the present embodiment, the parameters are optimized so that the output signals Y ₁ , Y ₂ ,..., Y _n−1 , Y _n of each layer have a small error with respect to the correct output signal. To be done. Therefore, in the inference process, not only the information of the deepest layer (that is, the output signal Y _n of the output layer 6) but also the information of other layers (that is, the output signals Y ₁ , Y ₂ ,..., Y _{n−). It} is highly possible that sufficient performance can be exhibited even by using ₁ ). As a result, in the inference processing, the output signal of an arbitrary layer among the plurality of layers can be used as the inference result depending on the required processing speed and performance.

図２（Ａ）には、各層の出力信号を正解出力信号と比較し、各層において誤差をそれぞれ算出する例を示すが、図２（Ｂ）に示すように、各層においてそれぞれ算出される誤差を統合してもよい。以下においては、誤差の統合の一手法として、平均化処理を採用した場合について説明するが、任意の手法を採用してもよい。 FIG. 2A shows an example in which the output signal of each layer is compared with the correct output signal and the error is calculated in each layer. However, as shown in FIG. 2B, the error calculated in each layer is May be integrated. Hereinafter, the case where the averaging process is adopted will be described as a method of integrating the errors, but an arbitrary method may be adopted.

図２（Ｂ）に示す最適化手法においては、以下に示すような処理（ステップＳ１１〜Ｓ１５）を繰り返すことでパラメタを最適化する。 In the optimization method shown in FIG. 2B, the parameters are optimized by repeating the following processing (steps S11 to S15).

（１）入力信号Ｘに対する各層の出力信号Ｙ_１，Ｙ_２，・・・，Ｙ_ｎ−１，Ｙ_ｎを算出する（ステップＳ１１）
（２）出力信号Ｙ_１，Ｙ_２，・・・，Ｙ_ｎ−１，Ｙ_ｎの各々と正解出力信号との誤差ｅ_１，ｅ_２，・・・，ｅ_ｎ−１，ｅ_ｎを算出する（ステップＳ１２）
（３）各層の誤差ｅ_１，ｅ_２，・・・，ｅ_ｎ−１，ｅ_ｎの間を平均化して平均誤差ｅ_ｕｎｆを算出する（ステップＳ１４）
（４）平均誤差ｅ_ｕｎｆに基づいて最深層から浅い層に向かって順番にパラメタを更新する（誤差逆伝搬）（ステップＳ１５）
このように、パラメタを最適化する処理は、算出されたそれぞれの誤差を統合する処理を採用してもよい。この誤差を統合する処理は、算出されたそれぞれの誤差を統合して、最深層から逆伝搬させるための誤差情報を算出する処理を含む。典型的には、算出されたそれぞれの誤差の平均値を、最深層から逆伝搬させるための誤差情報として算出してもよい。 (1) The output signals Y ₁ , Y ₂ ,..., Y _n−1 , Y _n of each layer with respect to the input signal X are calculated (step S11).
(2) Output signal _Y _1, Y 2, _{calculated..,} Error _e _1, e 2 and each correct answer output signal _{_{Y n-1, Y n,}} ···, and _{_{e n-1,} e n} Yes (step S12)
(3) the error _e 1 of the _{_layers,} e _{2, ···,} by averaging between the _{_{e n-1,} e n} calculates an average error _{e unf} (step S14)
(4) Parameters are updated in order from the deepest layer to the shallowest layer based on the average error e _unf (error back propagation) (step S15).
In this way, the process of optimizing the parameters may employ a process of integrating the calculated errors. The process of integrating the errors includes a process of integrating the calculated errors and calculating error information for back propagation from the deepest layer. Typically, the average value of the calculated respective errors may be calculated as error information for back propagation from the deepest layer.

この平均化処理（ステップＳ１４）においては、個々の層の出力信号に対する誤差を統合している。但し、任意の方法を用いて各層の出力信号を統合した上で、誤差情報を算出し、パラメタの最適化を行ってもよい。 In this averaging process (step S14), the errors for the output signals of the individual layers are integrated. However, the output signals of the respective layers may be integrated using an arbitrary method, and then the error information may be calculated to optimize the parameters.

本実施の形態に従うニューラルネットワークおよびその最適化手法を採用することで、空間計算量については、最適化フェーズ（訓練時）においてわずかに増加するものの、推論フェーズ（使用時）においては要求に応じて削減できる。また、時間計算量については、最適化フェーズ（訓練時）においては増加するが、推論フェーズ（使用時）においては要求に応じて削減できる。 By adopting the neural network according to the present embodiment and the optimization method thereof, the spatial calculation amount slightly increases in the optimization phase (during training), but in the inference phase (during usage) Can be reduced. Further, the amount of time calculation increases in the optimization phase (during training), but can be reduced in accordance with the request in the inference phase (during use).

図３は、本実施の形態に従う処理手順の要部を示すフローチャートである。図３（Ａ）には、本実施の形態に従う最適化処理の処理手順を示し、図３（Ｂ）には、本実施の形態に従う推論処理の処理手順を示す。 FIG. 3 is a flowchart showing a main part of a processing procedure according to the present embodiment. FIG. 3A shows a processing procedure of the optimization processing according to the present embodiment, and FIG. 3B shows a processing procedure of the inference processing according to the present embodiment.

図３（Ａ）には、複数の層を有するニューラルネットワーク（モデル）のパラメタを最適化する最適化方法の処理手順を示す。図３（Ａ）に示す主要なステップは、典型的には、プロセッサが最適化プログラムを実行することで実現される。 FIG. 3A shows a processing procedure of an optimization method for optimizing parameters of a neural network (model) having a plurality of layers. The main steps shown in FIG. 3A are typically realized by the processor executing the optimization program.

図３（Ａ）を参照して、まず、最適化処理に用いられる、入力信号と正解出力信号とが対応付けられた訓練データが用意される（ステップＳ５０）。 With reference to FIG. 3(A), first, training data used for the optimization process in which the input signal and the correct output signal are associated with each other is prepared (step S50).

続いて、訓練データに含まれる入力信号をニューラルネットワークに入力して、ニューラルネットワークに含まれる最深層を含む１つ以上の層の各々から出力される出力信号を算出する（ステップＳ５２）。 Next, the input signal included in the training data is input to the neural network, and the output signal output from each of the one or more layers including the deepest layer included in the neural network is calculated (step S52).

そして、訓練データの入力信号に対応付けられた正解出力信号に対する、算出されたそれぞれの出力信号の誤差をそれぞれ算出する（ステップＳ５４）。算出されたそれぞれの誤差に基づいて、ニューラルネットワークに含まれる各層のパラメタを最適化する（ステップＳ５６）。なお、各層のパラメタを最適化する処理は、ステップＳ５４において算出されるそれぞれの誤差を統合する処理を含み得る。誤差を統合する処理は、上述の図２（Ａ）および図２（Ｂ）に示すように、誤差逆伝搬の過程で実行される。図２（Ｂ）に示す場合には、誤差を統合する処理は、誤差逆伝搬の前にも実行される。 Then, the error of each calculated output signal with respect to the correct answer output signal associated with the input signal of the training data is calculated (step S54). The parameters of each layer included in the neural network are optimized based on the calculated respective errors (step S56). The process of optimizing the parameters of each layer may include the process of integrating the errors calculated in step S54. The process of integrating the errors is executed in the process of error back propagation, as shown in FIGS. 2A and 2B described above. In the case shown in FIG. 2B, the process of integrating the errors is also executed before the error back propagation.

通常、予め設定された回数、または、訓練データとは別に用意された検証用データ（開発データ）に対する精度が収束するまで、ステップＳ５２〜Ｓ５６の処理が繰り返される。 Usually, the processes of steps S52 to S56 are repeated a preset number of times or until the accuracy of the verification data (development data) prepared separately from the training data converges.

図３（Ｂ）には、複数の層を有するニューラルネットワークからなる最適化済モデルを用いた推論方法の処理手順を示す。図３（Ｂ）に示す主要なステップは、典型的には、プロセッサが推論プログラムを実行することで実現される。 FIG. 3B shows a processing procedure of an inference method using an optimized model including a neural network having a plurality of layers. The main steps shown in FIG. 3B are typically realized by a processor executing an inference program.

ここで、最適化済モデルは、図３（Ａ）に示す最適化方法の処理手順に従って生成される。すなわち、最適化済モデルは、訓練データに含まれる入力信号をニューラルネットワークに入力したときに算出される、最深層を含む１つ以上の層の各々から出力される出力信号と、訓練データに含まれる入力信号に対応付けられた正解出力信号とのそれぞれの誤差に基づいて、パラメタを最適化することで生成される。 Here, the optimized model is generated in accordance with the processing procedure of the optimization method shown in FIG. That is, the optimized model includes an output signal output from each of one or more layers including the deepest layer, which is calculated when an input signal included in the training data is input to the neural network, and an output signal included in the training data. It is generated by optimizing the parameters based on the respective errors from the correct output signal associated with the input signal.

図３（Ｂ）を参照して、任意の入力信号を最適化済モデルに入力する（ステップＳ６０）。そして、最適化済モデルの最深層に向かって順番に出力信号を算出する（ステップＳ６２）。すなわち、入力信号に対して、各層に規定される非線形変換が順番に実行される。 Referring to FIG. 3(B), an arbitrary input signal is input to the optimized model (step S60). Then, the output signal is calculated in order toward the deepest layer of the optimized model (step S62). That is, the nonlinear transformation defined in each layer is sequentially performed on the input signal.

最終的に、最適化済モデルに含まれる複数の層のうち最深層を含む任意の層の出力信号を推論結果として出力する（ステップＳ６４）。そして、推論処理は終了する。 Finally, the output signal of an arbitrary layer including the deepest layer among the plurality of layers included in the optimized model is output as an inference result (step S64). Then, the inference process ends.

なお、推論結果として出力信号が出力される層は、出力信号の推論性能、および、出力信号が出力されるまでに要する時間の少なくとも一方の要求に基づいて決定されてもよい。この点については、後述の実施の形態１〜３において具体例を挙げて説明する。 The layer in which the output signal is output as the inference result may be determined based on at least one of the inference performance of the output signal and the time required until the output signal is output. This point will be described with reference to specific examples in Embodiments 1 to 3 described later.

本実施の形態に従うニューラルネットワークおよびその最適化手法によれば、すべての層に対して誤差情報をより直接的に反映したパラメタの更新が可能となるため、モデルの頑健性（ロバスト性）を高めることができる。このため、最適化済モデルの使用時（推論フェーズ）において、訓練時（最適化フェーズ）よりも少ない任意の数の層のみを用いた場合でも、性能が極端に劣化することを防止でき、ひいては処理速度の改善が可能となる。このように、Ｎ層のニューラルネットワークを最適化した場合は、使用時（推論フェーズ）において、１〜Ｎ層のＮ段階の柔軟性を実現できる。 According to the neural network and its optimization method according to the present embodiment, it is possible to update the parameters that more directly reflect the error information for all layers, so that the robustness of the model is improved. be able to. Therefore, when using the optimized model (inference phase), even when using only an arbitrary number of layers that is smaller than during training (optimization phase), it is possible to prevent the performance from extremely deteriorating, and eventually The processing speed can be improved. In this way, when the N-layer neural network is optimized, it is possible to realize flexibility of N stages of 1 to N layers during use (inference phase).

［Ｃ．アプリケーション例］
次に、本実施の形態に従うニューラルネットワークおよびその最適化手法を適用したアプリケーション例について説明する。 [C. Application example]
Next, an application example to which the neural network according to the present embodiment and its optimization method are applied will be described.

上述したように、本実施の形態に従うニューラルネットワークおよびその最適化手法は、ニューラルネットワーク全般に適用可能である。本明細書においては、アプリケーションの一例として、系列変換モデル、特にニューラル機械翻訳を想定する。 As described above, the neural network according to the present embodiment and its optimization method can be applied to all neural networks. In the present specification, a series conversion model, in particular, a neural machine translation is assumed as an example of an application.

具体的には、後述の実施の形態１および２においては、非特許文献８に示されるような６層のＴｒａｎｓｆｏｒｍｅｒモデルを採用し、後述の実施の形態３においては、非特許文献２に示されるような６層のＲｅｃｕｒｒｅｎｔｌｙＳｔａｃｋｅｄＴｒａｎｓｆｏｒｍｅｒ（ＲＳ−Ｔｒａｎｓｆｏｒｍｅｒ）モデルを採用した。 Specifically, in Embodiments 1 and 2 described later, a 6-layer Transformer model as shown in Non-Patent Document 8 is adopted, and in Embodiment 3 described later, it is described in Non-Patent Document 2. Such a 6-layer Recurrently Stacked Transformer (RS-Transformer) model was adopted.

図４は、ニューラル機械翻訳を実現するＴｒａｎｓｆｏｒｍｅｒモデルの一例を示す模式図である。図４を参照して、Ｔｒａｎｓｆｏｒｍｅｒモデル２０においては、入力信号を第１言語のシーケンスとし、出力信号を第２言語のシーケンスとすることで、ニューラル機械翻訳を実現する。なお、ニューラル機械翻訳は、分類問題として捉えることができる。より具体的には、Ｔｒａｎｓｆｏｒｍｅｒモデル２０は、エンコーダ３０と、デコーダ４０とを含む。 FIG. 4 is a schematic diagram showing an example of a Transformer model that realizes neural machine translation. Referring to FIG. 4, in the Transformer model 20, the input signal is a sequence in the first language and the output signal is a sequence in the second language, thereby realizing neural machine translation. Note that neural machine translation can be regarded as a classification problem. More specifically, the Transformer model 20 includes an encoder 30 and a decoder 40.

エンコーダ３０は、入力信号に含まれる特徴的な情報を出力する。エンコーダ３０は、入力信号に含まれる特徴的な情報を抽出するためのＮ層の隠れ層３２を有している。エンコーダ３０の前段には、入力信号であるシーケンス（自然言語）中の各語を固定次元のベクトルに変換するための入力層３６が配置されている。 The encoder 30 outputs characteristic information included in the input signal. The encoder 30 has N hidden layers 32 for extracting characteristic information contained in the input signal. An input layer 36 for converting each word in the sequence (natural language) which is the input signal into a fixed-dimensional vector is arranged in the preceding stage of the encoder 30.

デコーダ４０は、先に出力した出力信号（既出力）および入力信号に含まれる特徴的な情報の入力を受けて、出力信号を決定する。デコーダ４０は、Ｍ層の隠れ層４２を有している。デコーダ４０の前段には、既出力であるシーケンス（自然言語）中の各語を固定次元のベクトルに変換するための入力層４６が配置されている。 The decoder 40 receives the previously output output signal (already output) and the characteristic information included in the input signal, and determines the output signal. The decoder 40 has M hidden layers 42. An input layer 46 for converting each word in the already output sequence (natural language) into a fixed-dimensional vector is arranged in the preceding stage of the decoder 40.

実施の形態１〜３においては、図４に示すようなＴｒａｎｓｆｏｒｍｅｒモデルを用いたニューラル機械翻訳の性能を評価した。 In the first to third embodiments, the performance of neural machine translation using the Transformer model as shown in FIG. 4 was evaluated.

［Ｄ．実施の形態１］
図５は、実施の形態１に従う最適化処理を説明するための模式図である。図５を参照して、実施の形態１においては、デコーダ４０のＭ層の隠れ層４２（一例として、６層）からのそれぞれの出力信号を用いて誤差情報を生成する。経路５０に沿って、デコーダ４０の最深層から浅い層に向かって、および、エンコーダ３０の最深層から浅い層に向かって、誤差情報が順番に逆伝搬する。 [D. Embodiment 1]
FIG. 5 is a schematic diagram for explaining the optimization process according to the first embodiment. Referring to FIG. 5, in the first embodiment, error information is generated using respective output signals from M hidden layers 42 (6 layers as an example) of decoder 40. The error information sequentially back propagates along the path 50 from the deepest layer to the shallowest layer of the decoder 40 and from the deepest layer to the shallowest layer of the encoder 30.

図６は、実施の形態１に従う最適化処理の主要な処理手順を示すフローチャートである。図６を参照して、まず、最適化処理に用いられる訓練データを用意する（ステップＳ１００）。 FIG. 6 is a flowchart showing a main processing procedure of the optimization processing according to the first embodiment. Referring to FIG. 6, first, training data used for the optimization process is prepared (step S100).

続いて、訓練データに含まれる入力信号に基づいて、Ｔｒａｎｓｆｏｒｍｅｒモデル２０のエンコーダ３０の入力信号ｅｎｃ_０として入力するテンソルＸを算出する（ｅｎｃ_０＝Ｘ）（ステップＳ１０２）。また、誤差情報ｌｏｓｓをゼロに初期化する（ｌｏｓｓ＝０）（ステップＳ１０４）。 Then, based on the input signal included in the training data, to calculate the tensor X input as an input signal enc ₀ of the encoder 30 of the Transformer Model 20 _(enc 0 = X) (step S102). In addition, the error information loss is initialized to zero (loss=0) (step S104).

続いて、エンコーダ３０の各層の出力信号を算出する。すなわち、エンコーダ３０に含まれる隠れ層３２の層位置を示すインデックスｉ（１≦ｉ≦Ｎ）について、出力信号ｅｎｃ_ｉ＝Ｌ_ｉ ^ｅｎｃ（ｅｎｃ_ｉ−１）をそれぞれ算出する（ステップＳ１１０）。ここで、Ｌ_ｉ ^ｅｎｃは、エンコーダ３０に含まれるｉ番目の隠れ層３２の非線形変換を示す。エンコーダ３０の最深層の出力である出力信号ｅｎｃ_Ｎ（入力信号に含まれる特徴的な情報）がデコーダ４０へ与えられることになる。 Then, the output signal of each layer of the encoder 30 is calculated. That is, the output signal enc _i =L _i ^enc (enc _i−1 ) is calculated for each index i (1≦i≦N) indicating the layer position of the hidden layer 32 included in the encoder 30 (step S110). Here, L _i ^enc represents the nonlinear transformation of the i-th hidden layer 32 included in the encoder 30. The output signal enc _N (characteristic information included in the input signal), which is the output of the deepest layer of the encoder 30, is given to the decoder 40.

続いて、デコーダ４０の各層の出力信号および誤差を算出する。すなわち、デコーダ４０に含まれる隠れ層４２の層位置を示すインデックスｊ（１≦ｊ≦Ｍ）について、エンコーダの最深層の出力信号ｅｎｃ_Ｎを参照しつつ、出力信号ｄｅｃ_ｊ＝Ｌ_ｊ ^ｄｅｃ（ｄｅｃ_ｊ−１，ｅｎｃ_Ｎ）をそれぞれ算出する（ステップＳ１２０）。ここで、Ｌ_ｊ ^ｄｅｃは、デコーダ４０に含まれるｊ番目の隠れ層３２の非線形変換を示す。 Then, the output signal and error of each layer of the decoder 40 are calculated. That is, for the index j (1≦j≦M) indicating the layer position of the hidden layer 42 included in the decoder 40, the output signal dec _j =L _j ^dec (dec) is referenced with reference to the output signal enc _N of the deepest layer of the encoder. _j−1 , enc _N ) are calculated (step S120). Here, L _j ^dec indicates the nonlinear transformation of the j-th hidden layer 32 included in the decoder 40.

デコーダ４０の各層において、確率分布としての出力信号Ｙ＾_ｊ＝ｓｏｆｔｍａｘ（ｄｅｃ_ｊ）を算出する（ステップＳ１２２）。なお、電子出願システムの制約上、ハット記号「＾」を対象の文字に続けて記載している（以下、同様である。）。 In each layer of the decoder 40 calculates an output signal as a probability distribution _{_{Y ^ j = softmax (dec j}} ) ( step S122). Note that due to the restrictions of the electronic filing system, the hat symbol “^” is described after the target character (the same applies hereinafter).

さらに、確率分布としての出力信号Ｙ＾_ｊと離散値としての正解出力信号Ｙとの誤差を交差エントロピーとして算出し、誤差情報ｌｏｓｓに加算する（ステップＳ１２４）。すなわち、誤差情報ｌｏｓｓ＝ｌｏｓｓ＋ｃｒｏｓｓ＿ｅｎｔｒｏｐｙ（Ｙ＾_ｊ，Ｙ）が算出される。 Further, the error between the output signal Y^ _j as a probability distribution and the correct output signal Y as a discrete value is calculated as cross entropy and added to the error information loss (step S124). That is, the error information loss=loss+cross_entropy(Y^ _j , Y) is calculated.

最終的に、デコーダ４０の各層において算出された誤差の平均値が、パラメタの最適化に用いられる誤差情報として決定される（ステップＳ１３０）。すなわち、誤差情報ｌｏｓｓ＝ｌｏｓｓ／Ｍが算出される。そして、算出された誤差情報ｌｏｓｓに基づいて、デコーダ４０の最深層から浅い層に向かって順番にパラメタが更新され、続いて、エンコーダ３０の最深層から浅い層に向かって順番にパラメタが更新される（ステップＳ１３２）。 Finally, the average value of the errors calculated in each layer of the decoder 40 is determined as error information used for parameter optimization (step S130). That is, the error information loss=loss/M is calculated. Then, based on the calculated error information loss, the parameters are updated in order from the deepest layer to the shallow layer of the decoder 40, and subsequently, the parameters are updated in order from the deepest layer to the shallow layer of the encoder 30. (Step S132).

通常は、上述したステップＳ１０２以下の処理が複数回に亘って繰り返される。
なお、説明の便宜上、図５および図６においては記載を省略しているが、実際には、バッチノーマライゼーション（Batch Normalization）やドロップアウト（dropout）などの過学習を回避するための処理を適宜配置してもよい。また、最適化処理を高速化するための任意の処理を適宜配置してもよい（例えば、非特許文献５〜７など参照）。 Usually, the above-described processing of step S102 and subsequent steps are repeated a plurality of times.
Although not shown in FIGS. 5 and 6 for convenience of description, in practice, processing for avoiding over-learning such as batch normalization (Batch Normalization) and dropout (dropout) is appropriately arranged. You may. Further, any process for speeding up the optimization process may be appropriately arranged (for example, see Non-Patent Documents 5 to 7).

上述した最適化処理による性能を以下の２つの翻訳タスクについて評価した。
１番目の翻訳タスクとして、情報通信研究機構（ＮＩＣＴ）により開発された音声翻訳向け多言語対訳コーパス（非特許文献９および非特許文献１０参照）を用いた英日翻訳タスクを設定した。多言語対訳コーパスから、訓練データとして約４０万文対を設定し、評価用データとして約２０００文対を設定した。 The performance of the optimization process described above was evaluated for the following two translation tasks.
As the first translation task, an English-Japanese translation task using a multilingual parallel translation corpus for speech translation developed by National Institute of Information and Communications Technology (NICT) (see Non-Patent Documents 9 and 10) was set. From the multilingual parallel corpus, about 400,000 sentence pairs were set as training data, and about 2000 sentence pairs were set as evaluation data.

２番目の翻訳タスクとして、ニュース分野の対訳データ（非特許文献１１および非特許文献１２参照）を用いた英独翻訳タスクを設定した。ニュース分野の対訳データから、訓練データとして約５６０万文対を設定し、評価用データとして約３０００文対を設定した。 As the second translation task, an English-German translation task using parallel translation data in the news field (see Non-Patent Document 11 and Non-Patent Document 12) was set. From the bilingual data in the news field, about 5.6 million sentence pairs were set as training data, and about 3000 sentence pairs were set as evaluation data.

それぞれの翻訳タスクについて、翻訳性能をＢＬＥＵスコア（非特許文献１３参照）および翻訳の速度で評価した。 For each translation task, translation performance was evaluated by BLEU score (see Non-Patent Document 13) and translation speed.

図７は、音声翻訳向け多言語対訳コーパスを用いた英日翻訳タスクについての評価結果を示すグラフである。図８は、ニュース分野の対訳データを用いた英独翻訳タスクについての評価結果を示すグラフである。 FIG. 7 is a graph showing an evaluation result of an English-Japanese translation task using a multilingual parallel translation corpus for speech translation. FIG. 8 is a graph showing an evaluation result of an English-German translation task using parallel translation data in the news field.

図７および図８に示す「ＢＬＵＥ：１×６」は、エンコーダ３０の層数を６とし、デコーダ４０の層数を６とした上で、図６に示す手順に従う最適化処理により得られた最適化済モデルの翻訳性能を示す。「ＢＬＵＥ：６−ｋ」は、エンコーダ３０の層数を６とし、デコーダ４０の層数を１〜６と異ならせた６種類のモデルについて関連技術に従う最適化処理（最深層の誤差情報のみに基づく最適化処理）により得られたそれぞれの最適化済モデル（６種類）の翻訳性能を示す。「ＢＬＵＥ：６−６」は、エンコーダ３０の層数を６とし、デコーダ４０の層数を６とした上で、関連技術に従う最適化処理により得られた最適化済モデルの翻訳性能を示す。「ＢＬＵＥ：６−６」に示す翻訳性能は、同一の最適化済モデルについて、推論フェーズにおいて使用するデコーダ４０の層数を１〜６にそれぞれ異ならせたものである。 “BLUE: 1×6” shown in FIGS. 7 and 8 is obtained by the optimization process according to the procedure shown in FIG. 6 after the encoder 30 has 6 layers and the decoder 40 has 6 layers. The translation performance of the optimized model is shown. “BLUE: 6-k” is an optimization process according to a related technique for six types of models in which the number of layers of the encoder 30 is 6 and the number of layers of the decoder 40 is different from 1 to 6 (only for error information of the deepest layer). The translation performance of each optimized model (6 types) obtained by the optimization processing based on FIG. “BLUE:6-6” indicates the translation performance of the optimized model obtained by the optimization processing according to the related technique, with the number of layers of the encoder 30 set to 6 and the number of layers of the decoder 40 set to 6. The translation performance shown in "BLUE:6-6" is that the number of layers of the decoder 40 used in the inference phase is different from 1 to 6 for the same optimized model.

なお、図６に示す手順に従う最適化処理の実行には、関連技術に従う最適化処理の実行に要した時間の約２．０倍の時間を要した。 It should be noted that execution of the optimization process according to the procedure shown in FIG. 6 took about 2.0 times the time taken to execute the optimization process according to the related art.

図７および図８の横軸「１」〜「６」は、推論フェーズ（使用時）において使用するデコーダ４０の層を示す。例えば、横軸が「３」の位置においては、デコーダ４０の３番目の層からの出力信号が推論結果として使用された場合の性能を示す。 The horizontal axes “1” to “6” in FIGS. 7 and 8 indicate the layers of the decoder 40 used in the inference phase (when used). For example, in the position where the horizontal axis is "3", performance is shown when the output signal from the third layer of the decoder 40 is used as the inference result.

図７および図８に示すように、関連技術に従う最適化処理により得られた最適化済モデル（ＢＬＵＥ：６−６）においては、使用するデコーダ４０の層数を最適化フェーズ（訓練時）よりも減らした場合には、翻訳性能（ＢＬＵＥスコア）が極端に劣化していることが分かる。 As shown in FIG. 7 and FIG. 8, in the optimized model (BLUE:6-6) obtained by the optimization process according to the related art, the number of layers of the decoder 40 to be used is determined from the optimization phase (during training). It can be seen that the translation performance (BLUE score) is extremely deteriorated when is also reduced.

また、推論フェーズで使用するのと同じ層数のデコーダ４０を有するモデルについて、関連技術に従う最適化処理により得られた最適化済モデル（ＢＬＵＥ：６−ｋ）においては、デコーダ４０の層数が２〜６の間では、ＢＬＵＥ：６−６と概ね同等の翻訳性能であることが分かる。 Further, regarding the model having the same number of layers of decoder 40 as that used in the inference phase, in the optimized model (BLUE: 6-k) obtained by the optimization processing according to the related art, the number of layers of decoder 40 is It can be seen that between 2 and 6, the translation performance is almost the same as BLUE:6-6.

これに対して、実施の形態１に従う最適化処理によれば、デコーダ４０のすべての層の出力信号に基づいて、６−１〜６−６の６個のモデルを同時に最適化しており（ＢＬＵＥ：１×６）、使用するデコーダ４０の層数を最適化フェーズ（訓練時）よりも減らした場合であっても、翻訳性能の劣化はわずかであることが分かる。 On the other hand, according to the optimization processing according to the first embodiment, six models 6-1 to 6-6 are simultaneously optimized based on the output signals of all layers of the decoder 40 (BLUE 1×6), it can be seen that even if the number of layers of the decoder 40 used is reduced from that in the optimization phase (during training), the translation performance is slightly degraded.

また、図７および図８に示す「翻訳時間［ｓｅｃ］」は、実施の形態１に従う最適化処理により得られた最適化済モデル（ＢＬＵＥ：１×６）を用いて、評価用データ（英日翻訳タスクについては約２０００文、英独翻訳タスクについては約３０００文）の翻訳に要した処理時間（モデルのロード時間および入力文のエンコードに要する時間を含む）を表す。この処理時間のグラフによれば、層数を減らすことで、大幅な高速化を実現できることが分かる。 Further, the “translation time [sec]” shown in FIGS. 7 and 8 uses the optimized model (BLUE: 1×6) obtained by the optimization processing according to the first embodiment to evaluate data (English). The processing time (including model loading time and input sentence encoding time) required for translation of about 2000 sentences for the Japanese translation task and about 3000 sentences for the English-German translation task is shown. According to the graph of the processing time, it can be seen that a significant speedup can be realized by reducing the number of layers.

具体的には、１番目の翻訳タスク（英日翻訳タスク）においては、デコーダ４０の２つの層を用いることで、処理時間を約４０％低減でき、デコーダ４０の３つの層を用いることで、処理時間を約３０％低減できることが分かる。また、２番目の翻訳タスク（英独翻訳タスク）においては、デコーダ４０の２つの層を用いることで、処理時間を約５７％低減でき、デコーダ４０の３つの層を用いることで、処理時間を約３６％低減できることが分かる。 Specifically, in the first translation task (English-Japanese translation task), the processing time can be reduced by about 40% by using the two layers of the decoder 40, and by using the three layers of the decoder 40, It can be seen that the processing time can be reduced by about 30%. Further, in the second translation task (English-German translation task), the processing time can be reduced by about 57% by using the two layers of the decoder 40, and the processing time can be reduced by using the three layers of the decoder 40. It can be seen that the reduction can be about 36%.

［Ｅ．実施の形態２］
実施の形態１においては、デコーダ４０の各層の出力信号に基づいて算出される誤差情報を用いる最適化処理について説明したが、実施の形態２においては、エンコーダ３０およびデコーダ４０の各層の出力信号に基づいて算出される誤差情報を用いる最適化処理について説明する。 [E. Embodiment 2]
In the first embodiment, the optimization process using the error information calculated based on the output signals of the respective layers of the decoder 40 has been described, but in the second embodiment, the output signals of the respective layers of the encoder 30 and the decoder 40 are used. The optimization process using the error information calculated based on this will be described.

図９および図１０は、実施の形態２に従う最適化処理を説明するための模式図である。図９および図１０を参照して、実施の形態２においては、エンコーダ３０のＮ層の隠れ層３２（一例として、６層）の各々からの出力信号、および、デコーダ４０のＭ層の隠れ層４２（一例として、６層）からのそれぞれの出力信号を用いて、誤差情報を生成する。生成された誤差情報は、経路５０に沿って、デコーダ４０の最深層から浅い層に向かって、および、エンコーダ３０の最深層から浅い層に向かって、順番に逆伝搬する。 9 and 10 are schematic diagrams for explaining the optimization process according to the second embodiment. Referring to FIGS. 9 and 10, in the second embodiment, output signals from each of N hidden layers 32 (6 layers as an example) of encoder 30 and M hidden layers of decoder 40. Error information is generated using the respective output signals from 42 (for example, 6 layers). The generated error information back propagates along the path 50 in order from the deepest layer to the shallowest layer of the decoder 40 and from the deepest layer to the shallowest layer of the encoder 30.

図９には、一例として、エンコーダ３０の最深層（Ｎ番目の隠れ層３２）からの出力信号ｅｎｃ_Ｎがデコーダ４０に入力される場合を示し、図１０には、一例として、エンコーダ３０のｉ番目の層（ｉ番目の隠れ層３２）からの出力信号ｅｎｃ_ｉがデコーダ４０に入力される場合を示す。図９および図１０に示すように、実施の形態２においては、エンコーダ３０の各層からのＮ通りの出力信号と、デコーダ４０の各層からのＭ通りの出力信号との組み合わせ（Ｎ×Ｍ）のそれぞれについて誤差情報が存在し得る。 FIG. 9 shows, as an example, a case where the output signal enc _N from the deepest layer (N-th hidden layer 32) of the encoder 30 is input to the decoder 40, and FIG. The case where the output signal enc _i from the th layer (i-th hidden layer 32) is input to the decoder 40 is shown. As shown in FIG. 9 and FIG. 10, in the second embodiment, the number of combinations (N×M) of N kinds of output signals from each layer of encoder 30 and M kinds of output signals from each layer of decoder 40. There may be error information for each.

図１１は、実施の形態２に従う最適化処理の主要な処理手順を示すフローチャートである。図１１を参照して、まず、最適化処理に用いられる訓練データを用意する（ステップＳ２００）。 FIG. 11 is a flowchart showing the main processing procedure of the optimization processing according to the second embodiment. Referring to FIG. 11, first, training data used for the optimization processing is prepared (step S200).

続いて、訓練データに含まれる入力信号に基づいて、Ｔｒａｎｓｆｏｒｍｅｒモデル２０のエンコーダ３０の入力信号ｅｎｃ_０として入力するテンソルＸを算出する（ｅｎｃ_０＝Ｘ）（ステップＳ２０２）。また、誤差情報ｌｏｓｓをゼロに初期化する（ｌｏｓｓ＝０）（ステップＳ２０４）。 Then, the tensor X input as the input signal enc ₀ of the encoder 30 of the Transformer model 20 is calculated based on the input signal included in the training data (enc ₀ =X) (step S202). Further, the error information loss is initialized to zero (loss=0) (step S204).

続いて、エンコーダ３０の各層の出力信号および誤差を算出する。すなわち、エンコーダ３０に含まれる隠れ層３２の層位置を示すインデックスｉ（１≦ｉ≦Ｎ）について、ステップＳ２１０〜Ｓ２１６の処理が繰り返される。 Then, the output signal and error of each layer of the encoder 30 are calculated. That is, the processes of steps S210 to S216 are repeated for the index i (1≦i≦N) indicating the layer position of the hidden layer 32 included in the encoder 30.

より具体的には、出力信号ｅｎｃ_ｉ＝Ｌ_ｉ ^ｅｎｃ（ｅｎｃ_ｉ−１）をそれぞれ算出する（ステップＳ２１０）。ここで、Ｌ_ｉ ^ｅｎｃは、エンコーダ３０に含まれる隠れ層３２の非線形変換を示す。この時点で、エンコーダ３０が出力する出力信号ｅｎｃ_ｉがデコーダ４０へ与えられることになる。 More specifically, the output signals enc _i =L _i ^enc (enc _i-1 ) are calculated (step S210). Here, L _i ^enc represents the nonlinear transformation of the hidden layer 32 included in the encoder 30. At this point, the output signal enc _i output from the encoder 30 is given to the decoder 40.

さらに、インデックスｉの各々について、デコーダ４０の各層の出力信号および誤差を算出する。すなわち、デコーダ４０に含まれる隠れ層４２の層位置を示すインデックスｊ（１≦ｊ≦Ｍ）について、ステップＳ２１２〜Ｓ２１６の処理が繰り返される。 Further, for each index i, the output signal and error of each layer of the decoder 40 are calculated. That is, the processes of steps S212 to S216 are repeated for the index j (1≦j≦M) indicating the layer position of the hidden layer 42 included in the decoder 40.

より具体的には、出力信号ｄｅｃ_ｊ＝Ｌ_ｊ ^ｄｅｃ（ｄｅｃ_ｊ−１，ｅｎｃ_ｉ）をそれぞれ算出する（ステップＳ２１２）。ここで、Ｌ_ｊ ^ｄｅｃは、デコーダ４０に含まれるｊ番目の隠れ層３２の非線形変換を示す。 More specifically, the output signals dec _j =L _j ^dec (dec _j-1 , enc _i ) are calculated (step S212). Here, L _j ^dec indicates the nonlinear transformation of the j-th hidden layer 32 included in the decoder 40.

そして、確率分布としての出力信号Ｙ＾_ｉ，ｊ＝ｓｏｆｔｍａｘ（ｄｅｃ_ｊ）を算出する（ステップＳ２１４）。さらに、確率分布としての出力信号Ｙ＾_ｉ，ｊと離散値としての正解出力信号Ｙとの誤差を交差エントロピーとして算出し、誤差情報ｌｏｓｓに加算する（ステップＳ２１６）。すなわち、誤差情報ｌｏｓｓ＝ｌｏｓｓ＋ｃｒｏｓｓ＿ｅｎｔｒｏｐｙ（Ｙ＾_ｉ，ｊ，Ｙ）が算出される。 Then, the output signal Y^ _i,j =softmax(dec _j ) as the probability distribution is calculated (step S214). Furthermore, the error between the output signal Y^ _i,j as the probability distribution and the correct output signal Y as the discrete value is calculated as the cross entropy and added to the error information loss (step S216). That is, the error information loss=loss+cross_entropy(Y^ _i,j ,Y) is calculated.

最終的に、エンコーダ３０の各層（Ｎ層）とデコーダ４０の各層（Ｍ層）との組み合わせに関して算出された誤差の平均値が、パラメタの最適化に用いられる誤差情報として決定される（ステップＳ２２０）。すなわち、誤差情報ｌｏｓｓ＝ｌｏｓｓ／（Ｎ×Ｍ）が算出される。そして、算出された誤差情報ｌｏｓｓに基づいて、デコーダ４０の最深層から浅い層に向かって順番にパラメタが更新され、続いて、エンコーダ３０の最深層から浅い層に向かって順番にパラメタが更新される（ステップＳ２２２）。 Finally, the average value of the errors calculated for the combination of each layer (N layer) of the encoder 30 and each layer (M layer) of the decoder 40 is determined as the error information used for parameter optimization (step S220). ). That is, the error information loss=loss/(N×M) is calculated. Then, based on the calculated error information loss, the parameters are updated in order from the deepest layer to the shallow layer of the decoder 40, and subsequently, the parameters are updated in order from the deepest layer to the shallow layer of the encoder 30. (Step S222).

通常は、上述したステップＳ２０２以下の処理が複数回に亘って繰り返される。
なお、説明の便宜上、図９〜図１１においては記載を省略しているが、実際には、バッチノーマライゼーションやドロップアウトなどの過学習を回避するための処理を適宜配置してもよい。また、最適化処理を高速化するための任意の処理を適宜配置してもよい（例えば、非特許文献５〜７など参照）。 Usually, the above-described processing of step S202 and subsequent steps are repeated a plurality of times.
For convenience of explanation, although not shown in FIGS. 9 to 11, in practice, processing for avoiding over-learning such as batch normalization and dropout may be appropriately arranged. Further, any process for speeding up the optimization process may be appropriately arranged (for example, see Non-Patent Documents 5 to 7).

上述した最適化処理による性能を１番目の翻訳タスク（実施の形態１において説明した英日翻訳タスクと同じ）および２番目の翻訳タスク（実施の形態１において説明した英独翻訳タスクと同じ）について評価した。実施の形態１と同様に、翻訳性能をＢＬＥＵスコアおよび翻訳の速度でそれぞれ評価した。 The performance of the above-described optimization processing is applied to the first translation task (same as the English-Japanese translation task described in the first embodiment) and the second translation task (same as the English-German translation task described in the first embodiment). evaluated. Similar to the first embodiment, the translation performance was evaluated by the BLEU score and the translation speed, respectively.

図１２は、実施の形態２における英日翻訳タスクについての評価結果を示すグラフである。図１２（Ａ）には、図１１に示す手順に従う最適化処理により得られた最適化済モデルについてのＢＬＵＥスコアの評価結果を示し、図１２（Ｂ）には、図１１に示す手順に従う最適化処理により得られた最適化済モデルについての処理時間の評価結果を示す。処理時間は、評価用データ（約２０００文）の翻訳に要した処理時間（モデルのロード時間および入力文のエンコードに要する時間を含む）を表す。 FIG. 12 is a graph showing the evaluation result of the English-Japanese translation task according to the second embodiment. FIG. 12A shows the evaluation result of the BLUE score for the optimized model obtained by the optimization processing according to the procedure shown in FIG. 11, and FIG. 12B shows the optimum result according to the procedure shown in FIG. The evaluation result of the processing time about the optimized model obtained by the optimization processing is shown. The processing time represents the processing time required for translating the evaluation data (about 2000 sentences) (including the model loading time and the time required for encoding the input sentence).

図１３は、実施の形態２における英独翻訳タスクについての評価結果を示すグラフである。図１３（Ａ）には、図１１に示す手順に従う最適化処理により得られた最適化済モデルについてのＢＬＵＥスコアの評価結果を示し、図１３（Ｂ）には、図１１に示す手順に従う最適化処理により得られた最適化済モデルについての処理時間の評価結果を示す。処理時間は、評価用データ（約３０００文）の翻訳に要した処理時間（モデルのロード時間および入力文のエンコードに要する時間を含む）を表す。 FIG. 13 is a graph showing the evaluation result of the English-German translation task according to the second embodiment. FIG. 13A shows the evaluation result of the BLUE score for the optimized model obtained by the optimization processing according to the procedure shown in FIG. 11, and FIG. 13B shows the optimum result according to the procedure shown in FIG. The evaluation result of the processing time about the optimized model obtained by the optimization processing is shown. The processing time represents the processing time required for translating the evaluation data (about 3000 sentences) (including the model loading time and the time required for input sentence encoding).

図１２（Ａ）、図１２（Ｂ）、図１３（Ａ）および図１３（Ｂ）において、横軸「１」〜「６」は、推論フェーズ（使用時）において使用するエンコーダ３０の層数を示す。また、縦軸「１」〜「６」は、推論フェーズ（使用時）において使用するデコーダ４０の層数を示す。例えば、横軸が「３」および横軸が「３」の位置においては、エンコーダ３０の３番目の層からの出力信号（入力信号に含まれる特徴的な情報）がデコーダ４０に入力され、デコーダ４０の３番目の層からの出力信号が推論結果として使用された場合の性能を示す。 In FIGS. 12A, 12B, 13A, and 13B, the horizontal axes “1” to “6” indicate the number of layers of the encoder 30 used in the inference phase (when used). Indicates. Further, the vertical axes “1” to “6” indicate the number of layers of the decoder 40 used in the inference phase (when used). For example, at the position where the horizontal axis is “3” and the horizontal axis is “3”, the output signal (characteristic information included in the input signal) from the third layer of the encoder 30 is input to the decoder 40, and the decoder 40 The performance is shown when the output signal from the third layer of 40 is used as the inference result.

なお、図１１に示す手順に従う最適化処理の実行には、関連技術に従う最適化処理の実行に要した時間の約９．５倍の時間を要した。 It should be noted that execution of the optimization process according to the procedure shown in FIG. 11 took about 9.5 times the time taken to execute the optimization process according to the related art.

図１２（Ａ）に示すように、エンコーダ３０の４つ以上の層およびデコーダ４０の３つ以上の層を用いることで、関連技術に従う最適化処理により得られた最適化済モデルによる翻訳性能（２７．０９ポイント）と同等の翻訳性能（２６．０９〜２６．５３ポイント）を発揮できることが分かる。 As shown in FIG. 12(A), by using four or more layers of the encoder 30 and three or more layers of the decoder 40, the translation performance by the optimized model obtained by the optimization processing according to the related art ( It can be seen that the translation performance (26.09 to 26.53 points) equivalent to 27.09 points) can be exhibited.

図１２（Ｂ）に示すように、エンコーダ３０の４つ以上の層およびデコーダ４０の３つの層を用いた場合には、関連技術に従う最適化処理により得られた最適化済モデルを用いた場合の処理時間（８５．２０単位時間）に比較して、処理時間を約３０％低減できることが分かる（５７．２５〜５９．４９単位時間）。 As shown in FIG. 12B, when four or more layers of the encoder 30 and three layers of the decoder 40 are used, the optimized model obtained by the optimization process according to the related art is used. It can be seen that the processing time can be reduced by about 30% (57.25 to 59.49 unit hours) as compared with the processing time (85.20 unit hours).

また、図１３（Ａ）に示すように、エンコーダ３０の４つの層およびデコーダ４０の４つの層を用いることで、関連技術に従う最適化処理により得られた最適化済モデルによる翻訳性能（３２．３１ポイント）と同等の翻訳性能（３１．４４〜３２．０３ポイント）を発揮できることが分かる。 Further, as shown in FIG. 13A, by using the four layers of the encoder 30 and the four layers of the decoder 40, the translation performance by the optimized model obtained by the optimization processing according to the related art (32. It can be seen that the translation performance (31.44 to 32.03 points) equivalent to 31 points) can be exhibited.

図１３（Ｂ）に示すように、エンコーダ３０の４つ以上の層およびデコーダ４０の４つの層を用いた場合には、関連技術に従う最適化処理により得られた最適化済モデルを用いた場合の処理時間（２５１．６９単位時間）に比較して、処理時間を約３５％低減できることが分かる（１３７．８８〜１６１．４４単位時間）。 As shown in FIG. 13B, when four or more layers of the encoder 30 and four layers of the decoder 40 are used, the optimized model obtained by the optimization process according to the related art is used. It can be seen that the treatment time can be reduced by about 35% as compared with the treatment time (251.69 unit hours) of (137.88-161.44 unit hours).

なお、図１２（Ｂ）および図１３（Ｂ）の評価結果によれば、使用するエンコーダ３０の層数を減らすことは、処理の高速化にはあまり有効ではなく、一方、使用するデコーダ４０の層数を減らすことは、処理の高速化にはより有効であることが分かる。 According to the evaluation results of FIG. 12B and FIG. 13B, reducing the number of layers of the encoder 30 used is not very effective for speeding up the processing, while the decoder 40 used It can be seen that reducing the number of layers is more effective for speeding up the processing.

［Ｆ．実施の形態３］
実施の形態１および２においては、エンコーダ３０およびデコーダ４０が複数の異なる層を有するモデルを例示した。このような複数の異なる層を有するエンコーダ３０およびデコーダ４０に代えて、同じ層を再帰的に使用することで、メモリの使用量を抑制しつつ、複数の層と同等の非線形変換を実現できる（非特許文献２など参照）。実施の形態３においては、同じ層を再帰的に使用するモデルに対する最適化処理について説明する。 [F. Third Embodiment]
In Embodiments 1 and 2, a model in which encoder 30 and decoder 40 have a plurality of different layers has been illustrated. By using the same layer recursively instead of the encoder 30 and the decoder 40 having such a plurality of different layers, it is possible to realize the non-linear conversion equivalent to that of a plurality of layers while suppressing the memory usage ( See Non-Patent Document 2, etc.). In the third embodiment, an optimization process for a model that recursively uses the same layer will be described.

図１４は、実施の形態３に従う最適化処理を説明するための模式図である。図１４を参照して、ＲＳ−Ｔｒａｎｓｆｏｒｍｅｒモデル２０Ａは、隠れ層３２を再帰的に使用可能に結合されたエンコーダ３０Ａと、隠れ層４２を再帰的に使用可能に結合されたデコーダ４０Ａとを含む。エンコーダ３０Ａの出力信号（入力信号に含まれる特徴的な情報）は、デコーダ４０Ａへ出力される。 FIG. 14 is a schematic diagram for explaining the optimization process according to the third embodiment. Referring to FIG. 14, the RS-Transformer model 20A includes an encoder 30A having a hidden layer 32 recursively enabled and a decoder 40A having a hidden layer 42 recursively enabled. The output signal of encoder 30A (characteristic information included in the input signal) is output to decoder 40A.

隠れ層３２を再帰的にＮ回使用することで、Ｎ層分に相当する非線形変換を実現でき、隠れ層４２を再帰的にＭ回使用することで、Ｍ層分に相当する非線形変換を実現できる。一方で、隠れ層３２および隠れ層４２は、１層分しか存在しないので、ＲＳ−Ｔｒａｎｓｆｏｒｍｅｒモデル２０Ａを規定するパラメタの数をＴｒａｎｓｆｏｒｍｅｒモデル２０よりも低減できる。 By using the hidden layer 32 recursively N times, a non-linear conversion corresponding to N layers can be realized, and by using the hidden layer 42 recursively M times, a non-linear conversion corresponding to M layers can be realized. it can. On the other hand, since the hidden layer 32 and the hidden layer 42 exist only for one layer, the number of parameters defining the RS-Transformer model 20A can be reduced as compared with the Transformer model 20.

実施の形態３においては、簡単化のため、実施の形態１と同様に、エンコーダ３０については関連技術と同様に最深層の出力信号のみに基づいて算出される誤差信号を用いるとともに、デコーダ４０Ａの各層の出力信号（すなわち、各再帰処理における出力信号）に基づいて算出される誤差情報を用いた最適化処理を実行する。但し、実施の形態２と同様に、エンコーダ３０Ａの各層の出力信号に基づいて算出される誤差信号、および、デコーダ４０Ａの各層の出力信号（すなわち、各再帰処理における出力信号）に基づいて算出される誤差情報を用いる最適化処理を採用してもよい。 In the third embodiment, for simplification, as in the first embodiment, an error signal calculated based on only the output signal of the deepest layer is used for the encoder 30 in the same manner as in the first embodiment, and the decoder 40A has the same structure. The optimization process using the error information calculated based on the output signal of each layer (that is, the output signal in each recursive process) is executed. However, similar to the second embodiment, the error signal is calculated based on the output signal of each layer of the encoder 30A and the output signal of each layer of the decoder 40A (that is, the output signal in each recursive process). The optimization process using the error information may be adopted.

図１４に示すＲＳ−Ｔｒａｎｓｆｏｒｍｅｒモデル２０Ａに対する最適化処理においては、経路５０に沿って、デコーダ４０Ａの最深層から浅い層に向かって、および、エンコーダ３０Ａの最深層から浅い層に向かって、誤差情報が順番に逆伝搬する。この誤差情報を逆伝搬する処理においても、所定回数の再帰処理が実行される。すなわち、同一の隠れ層に対して、誤差情報が複数回に亘って逆伝搬することで、パラメタが最適化される。 In the optimization process for the RS-Transformer model 20A shown in FIG. 14, the error information is calculated along the path 50 from the deepest layer to the shallowest layer of the decoder 40A and from the deepest layer to the shallowest layer of the encoder 30A. Propagate back in sequence. Also in the process of back-propagating this error information, the recursive process is executed a predetermined number of times. That is, the parameters are optimized by back-propagating the error information multiple times to the same hidden layer.

図１５は、実施の形態３に従う最適化処理の主要な処理手順を示すフローチャートである。図１５を参照して、まず、最適化処理に用いられる訓練データを用意する（ステップＳ３００）。 FIG. 15 is a flowchart showing the main processing procedure of the optimization processing according to the third embodiment. Referring to FIG. 15, first, training data used for the optimization process is prepared (step S300).

続いて、訓練データに含まれる入力信号に基づいて、ＲＳ−Ｔｒａｎｓｆｏｒｍｅｒモデル２０Ａのエンコーダ３０Ａの入力信号ｅｎｃ_０として入力するテンソルＸを算出する（ｅｎｃ_０＝Ｘ）（ステップＳ３０２）。また、誤差情報ｌｏｓｓをゼロに初期化する（ｌｏｓｓ＝０）（ステップＳ３０４）。 Then, based on the input signal included in the training data, to calculate the tensor X input as an input signal enc ₀ of the encoder 30A of the RS-Transformer Model 20A _(enc 0 = X) (step S302). The error information loss is initialized to zero (loss=0) (step S304).

続いて、エンコーダ３０Ａの各層の出力信号を算出する。すなわち、エンコーダ３０Ａの隠れ層３２についての再帰処理の回数を示すインデックスｉ（１≦ｉ≦Ｎ）について、出力信号ｅｎｃ_ｉ＝Ｌ^ｅｎｃ（ｅｎｃ_ｉ−１）をそれぞれ算出する（ステップＳ３１０）。ここで、Ｌ^ｅｎｃは、エンコーダ３０Ａに含まれる隠れ層３２の非線形変換を示す。Ｎ回の再帰処理によって得られたエンコーダ３０Ａの出力である出力信号ｅｎｃ_Ｎ（最深層の出力信号に相当）がデコーダ４０Ａへ与えられることになる。 Then, the output signal of each layer of the encoder 30A is calculated. That is, the output signal enc _i =L ^enc (enc _i−1 ) is calculated for each index i (1≦i≦N) indicating the number of times the recursive process is performed on the hidden layer 32 of the encoder 30A (step S310). Here, L ^enc represents the nonlinear transformation of the hidden layer 32 included in the encoder 30A. The output signal enc _N (corresponding to the output signal of the deepest layer) which is the output of the encoder 30A obtained by the recursive processing N times is given to the decoder 40A.

続いて、デコーダ４０Ａの各層の出力信号および誤差を算出する。すなわち、デコーダ４０Ａの隠れ層４２についての再帰処理の回数を示すインデックスｊ（１≦ｊ≦Ｍ）について、出力信号ｄｅｃ_ｊ＝Ｌ^ｄｅｃ（ｄｅｃ_ｊ−１，ｅｎｃ_Ｎ）をそれぞれ算出する（ステップＳ３２０）。そして、確率分布としての出力信号Ｙ＾_ｊ＝ｓｏｆｔｍａｘ（ｄｅｃ_ｊ）を算出する（ステップＳ３２２）。さらに、確率分布としての出力信号Ｙ＾_ｊと離散値としての正解出力信号Ｙとの誤差を交差エントロピーとして算出し、誤差情報ｌｏｓｓに加算する（ステップＳ３２４）。すなわち、誤差情報ｌｏｓｓ＝ｌｏｓｓ＋ｃｒｏｓｓ＿ｅｎｔｒｏｐｙ（Ｙ＾_ｊ，Ｙ）が算出される。 Then, the output signal and error of each layer of the decoder 40A are calculated. That is, the output signal dec _j =L ^dec (dec _j−1 , enc _N ) is calculated for each index j (1≦j≦M) that indicates the number of times the recursive process is performed on the hidden layer 42 of the decoder 40A (step S320). ). Then, the output signal Y^ _j =softmax(dec _j ) as the probability distribution is calculated (step S322). Further, the error between the output signal Y^ _j as the probability distribution and the correct output signal Y as the discrete value is calculated as the cross entropy and added to the error information loss (step S324). That is, the error information loss=loss+cross_entropy(Y^ _j , Y) is calculated.

最終的に、デコーダ４０Ａの各層において算出された誤差の平均値が、パラメタの最適化に用いられる誤差情報として決定される（ステップＳ３３０）。すなわち、誤差情報ｌｏｓｓ＝ｌｏｓｓ／Ｎが算出される。そして、算出された誤差情報ｌｏｓｓに基づいて、デコーダ４０Ａの最深層から浅い層に向かって順番にパラメタが更新され、続いて、エンコーダ３０Ａの最深層から浅い層に向かって順番にパラメタが更新される（ステップＳ３３２）。 Finally, the average value of the errors calculated in each layer of the decoder 40A is determined as error information used for parameter optimization (step S330). That is, the error information loss=loss/N is calculated. Then, based on the calculated error information loss, the parameters are updated in order from the deepest layer to the shallow layer of the decoder 40A, and subsequently the parameters are updated in order from the deepest layer to the shallow layer of the encoder 30A. (Step S332).

通常は、上述したステップＳ３０２以下の処理が複数回に亘って繰り返される。
なお、説明の便宜上、図１４および図１５においては記載を省略しているが、実際には、バッチノーマライゼーションやドロップアウトなどの過学習を回避するための処理を適宜配置してもよい。また、最適化処理を高速化するための任意の処理を適宜配置してもよい（例えば、非特許文献５〜７など参照）。 Normally, the above-described processing of step S302 and thereafter is repeated a plurality of times.
Note that, for convenience of description, although not shown in FIGS. 14 and 15, in practice, processing for avoiding over-learning such as batch normalization and dropout may be appropriately arranged. Further, any process for speeding up the optimization process may be appropriately arranged (for example, see Non-Patent Documents 5 to 7).

上述した最適化処理による性能を１番目の翻訳タスク（実施の形態１において説明した英日翻訳タスクと同じ）について評価した。実施の形態１と同様に、翻訳性能をＢＬＥＵスコアで評価した。なお、エンコーダ３０Ａおよびデコーダ４０Ａの再帰処理の回数は同数（すなわち、図１５において、Ｎ＝Ｍ）とした。 The performance of the optimization process described above was evaluated for the first translation task (same as the English-Japanese translation task described in the first embodiment). Similar to the first embodiment, the translation performance was evaluated by the BLEU score. In addition, the number of times of recursive processing of the encoder 30A and the decoder 40A is the same (that is, N=M in FIG. 15).

図１４に示すＲＳ−Ｔｒａｎｓｆｏｒｍｅｒモデル２０Ａの最適化済モデルは、実施の形態１に従うＴｒａｎｓｆｏｒｍｅｒモデル２０の最適化済モデルに比較して、データサイズが４７％まで低減された。 The optimized model of the RS-Transformer model 20A shown in FIG. 14 has a data size reduced to 47% as compared with the optimized model of the Transformer model 20 according to the first embodiment.

図１６は、実施の形態３における英日翻訳タスクについての評価結果を示すグラフである。図１６（Ａ）には、非特許文献２に示される関連技術に従う最適化処理により得られた最適化済モデルについてのＢＬＵＥスコアの評価結果を示し、図１６（Ｂ）には、図１５に示す手順に従う最適化処理により得られた最適化済モデルについてのＢＬＵＥスコアの評価結果を示す。 FIG. 16 is a graph showing the evaluation result of the English-Japanese translation task according to the third embodiment. FIG. 16(A) shows the evaluation result of the BLUE score for the optimized model obtained by the optimization process according to the related technique shown in Non-Patent Document 2, and FIG. The evaluation result of the BLUE score about the optimized model obtained by the optimization process according to the procedure shown is shown.

図１６（Ａ）および図１６（Ｂ）において、横軸「１」〜「６」は、最適化フェーズ（訓練時）における再帰処理の回数を示す。また、縦軸「１」〜「６」は、推論フェーズ（使用時）において使用するデコーダ４０Ａの再帰処理の回数を示す。推論フェーズにおいて使用するエンコーダ３０Ａの再帰処理の回数は、最適化フェーズと同じである。 In FIG. 16A and FIG. 16B, the horizontal axes “1” to “6” indicate the number of times of recursive processing in the optimization phase (during training). Further, the vertical axes “1” to “6” indicate the number of times of recursive processing of the decoder 40A used in the inference phase (when used). The number of recursive processes of the encoder 30A used in the inference phase is the same as that in the optimization phase.

図１６（Ａ）および図１６（Ｂ）に示すグラフにおいて、左上から右下にかけての対角線上の値は、推論フェーズ（使用時）において、最適化フェーズ（訓練時）と同じ回数の再帰処理を実行した場合の結果を示す。図１６（Ａ）および図１６（Ｂ）に示すグラフの左下の部分は、デコーダ４０Ａの再帰処理の回数を最適化フェーズ（訓練時）よりも増やした場合を意味する。原理的には、このような処理も可能であるが、本来の目的である、処理の高速化の観点からは相反するため実際の評価は行っていない（「０．００」の値で示されている）。 In the graphs shown in FIGS. 16A and 16B, the values on the diagonal line from the upper left to the lower right indicate the same number of times of recursive processing as the optimization phase (during training) in the inference phase (during use). The result when executed is shown. The lower left part of the graphs shown in FIGS. 16A and 16B means a case where the number of times of recursive processing of the decoder 40A is increased more than in the optimization phase (during training). In principle, this kind of processing is also possible, but since it is a conflict from the original purpose of speeding up the processing, it has not been actually evaluated (shown as a value of "0.00"). ing).

図１６（Ａ）および図１６（Ｂ）に示すように、再帰処理の回数が増加するほど（紙面右側にゆくほど）、翻訳性能が向上していることが分かる。しかしながら、図１６（Ａ）に示すように、関連技術に従う最適化処理により得られた最適化済モデルにおいては、デコーダ４０Ａの再帰処理の回数を最適化フェーズ（訓練時）よりも減らした場合には、翻訳性能（ＢＬＵＥスコア）が極端に劣化することが分かる。 As shown in FIGS. 16A and 16B, it can be seen that the translation performance is improved as the number of times of recursive processing is increased (toward the right side of the drawing). However, as shown in FIG. 16A, in the optimized model obtained by the optimization processing according to the related art, when the number of times of the recursive processing of the decoder 40A is reduced from the optimization phase (during training). Shows that the translation performance (BLUE score) is extremely deteriorated.

これに対して、図１６（Ｂ）に示すように、図１５に示す手順に従う最適化処理により得られた最適化済モデルによれば、デコーダ４０Ａの再帰処理の回数を最適化フェーズ（訓練時）よりも減らした場合であっても、翻訳性能の劣化はわずか（ＢＬＥＵスコアで最大０．５ポイント）であり、非特許文献２に示される最適化済モデルと同等の翻訳性能を維持できていることが分かる。 On the other hand, as shown in FIG. 16B, according to the optimized model obtained by the optimization processing according to the procedure shown in FIG. 15, the number of times of the recursive processing of the decoder 40A is set to the optimization phase (during training). ), the translation performance is only slightly deteriorated (BLEU score is 0.5 points at maximum), and the translation performance equivalent to that of the optimized model shown in Non-Patent Document 2 can be maintained. I know that

［Ｇ．ハードウェア構成］
次に、本実施の形態に従う最適化処理および推論処理を実現するためのハードウェア構成の一例について説明する。 [G. Hardware configuration]
Next, an example of a hardware configuration for implementing the optimization process and the inference process according to the present embodiment will be described.

図１７は、本実施の形態に従う最適化処理および推論処理を実現するハードウェア構成の一例を示す模式図である。本実施の形態に従う最適化処理および推論処理は、典型的には、コンピュータの一例である情報処理装置１００を用いて実現される。 FIG. 17 is a schematic diagram showing an example of a hardware configuration for realizing the optimization process and the inference process according to the present embodiment. The optimization process and the inference process according to the present embodiment are typically realized using information processing device 100 which is an example of a computer.

図１７を参照して、情報処理装置１００は、主要なハードウェアコンポーネントとして、ＣＰＵ（central processing unit）１０２と、ＧＰＵ（graphics processing unit）１０４と、主メモリ１０６と、ディスプレイ１０８と、ネットワークインターフェイス（Ｉ／Ｆ：interface）１１０と、二次記憶装置１１２と、入力デバイス１２２と、光学ドライブ１２４とを含む。これらのコンポーネントは、内部バス１２８を介して互いに接続される。 With reference to FIG. 17, the information processing apparatus 100 has a CPU (central processing unit) 102, a GPU (graphics processing unit) 104, a main memory 106, a display 108, and a network interface (main interface) as main hardware components. The I/F (interface) 110, the secondary storage device 112, the input device 122, and the optical drive 124 are included. These components are connected to each other via an internal bus 128.

ＣＰＵ１０２および／またはＧＰＵ１０４は、後述するような各種プログラムを実行することで、本実施の形態に従う最適化処理および推論処理を実現するプロセッサである。ＣＰＵ１０２およびＧＰＵ１０４は、複数個配置されてもよいし、複数のコアを有していてもよい。 CPU 102 and/or GPU 104 is a processor that realizes the optimization process and the inference process according to the present embodiment by executing various programs to be described later. A plurality of CPUs 102 and GPUs 104 may be arranged, or a plurality of cores may be provided.

主メモリ１０６は、プロセッサ（ＣＰＵ１０２および／またはＧＰＵ１０４）が処理を実行するにあたって、プログラムコードやワークデータなどを一時的に格納（あるいは、キャッシュ）する記憶領域であり、例えば、ＤＲＡＭ（dynamic random access memory）やＳＲＡＭ（static random access memory）などの揮発性メモリデバイスなどで構成される。 The main memory 106 is a storage area for temporarily storing (or caching) program codes, work data, etc. when the processor (CPU 102 and/or GPU 104) executes processing, and for example, DRAM (dynamic random access memory). ) And SRAM (static random access memory).

ディスプレイ１０８は、処理に係るユーザインターフェイスや処理結果などを出力する表示部であり、例えば、ＬＣＤ（liquid crystal display）や有機ＥＬ（electroluminescence）ディスプレイなどで構成される。 The display 108 is a display unit that outputs a user interface related to processing, a processing result, and the like, and is configured by, for example, an LCD (liquid crystal display) or an organic EL (electroluminescence) display.

ネットワークインターフェイス１１０は、インターネット上またはイントラネット上の任意の情報処理装置などとの間でデータを遣り取りする。ネットワークインターフェイス１１０としては、例えば、イーサネット（登録商標）、無線ＬＡＮ（local area network）、Ｂｌｕｅｔｏｏｔｈ（登録商標）などの任意の通信方式を採用できる。 The network interface 110 exchanges data with an arbitrary information processing device on the Internet or an intranet. As the network interface 110, for example, an arbitrary communication method such as Ethernet (registered trademark), wireless LAN (local area network), and Bluetooth (registered trademark) can be adopted.

入力デバイス１２２は、ユーザからの指示や操作などを受付けるデバイスであり、例えば、キーボード、マウス、タッチパネル、ペンなどで構成される。また、入力デバイス１２２は、学習およびデコーディングに必要な音声信号を収集するための集音デバイスを含んでいてもよいし、集音デバイスにより収集された音声信号の入力を受付けるためのインターフェイスを含んでいてもよい。 The input device 122 is a device that receives instructions and operations from the user, and is configured with, for example, a keyboard, a mouse, a touch panel, a pen, and the like. Further, the input device 122 may include a sound collecting device for collecting a sound signal necessary for learning and decoding, and includes an interface for receiving an input of the sound signal collected by the sound collecting device. You can leave.

光学ドライブ１２４は、ＣＤ−ＲＯＭ（compact disc read only memory）、ＤＶＤ（digital versatile disc）などの光学ディスク１２６に格納されている情報を読出して、内部バス１２８を介して他のコンポーネントへ出力する。光学ディスク１２６は、非一過的（non-transitory）な記録媒体の一例であり、任意のプログラムを不揮発的に格納した状態で流通する。光学ドライブ１２４が光学ディスク１２６からプログラムを読み出して、二次記憶装置１１２などにインストールすることで、コンピュータが情報処理装置１００として機能するようになる。したがって、本発明の主題は、二次記憶装置１１２などにインストールされたプログラム自体、または、本実施の形態に従う機能や処理を実現するためのプログラムを格納した光学ディスク１２６などの記録媒体でもあり得る。 The optical drive 124 reads information stored in an optical disc 126 such as a CD-ROM (compact disc read only memory) or a DVD (digital versatile disc) and outputs it to other components via an internal bus 128. The optical disk 126 is an example of a non-transitory recording medium, and is distributed in a state in which an arbitrary program is stored in a nonvolatile manner. The optical drive 124 reads the program from the optical disk 126 and installs it in the secondary storage device 112 or the like, so that the computer functions as the information processing device 100. Therefore, the subject matter of the present invention may be a program itself installed in the secondary storage device 112 or the like, or a recording medium such as an optical disk 126 storing a program for implementing the functions and processes according to the present embodiment. ..

図１７には、非一過的な記録媒体の一例として、光学ディスク１２６などの光学記録媒体を示すが、これに限らず、フラッシュメモリなどの半導体記録媒体、ハードディスクまたはストレージテープなどの磁気記録媒体、ＭＯ（magneto-optical disk）などの光磁気記録媒体を用いてもよい。 FIG. 17 shows an optical recording medium such as the optical disk 126 as an example of a non-transitory recording medium, but the present invention is not limited to this, and a semiconductor recording medium such as a flash memory or a magnetic recording medium such as a hard disk or a storage tape. , MO (magneto-optical disk) or the like may be used.

二次記憶装置１１２は、コンピュータを情報処理装置１００として機能させるために必要なプログラムおよびデータを格納する。例えば、ハードディスク、ＳＳＤ（solid state drive）などの不揮発性記憶装置で構成される。 The secondary storage device 112 stores programs and data necessary for causing a computer to function as the information processing device 100. For example, it is configured by a non-volatile storage device such as a hard disk or SSD (solid state drive).

より具体的には、二次記憶装置１１２は、図示しないＯＳ（operating system）の他、典型的には、最適化処理を実現するための最適化プログラム１１４と、推論処理を実現するための推論プログラム１１６と、最適化済モデルを規定するパラメタ１１８と、訓練データ１２０とを格納している。 More specifically, the secondary storage device 112 includes, in addition to an OS (operating system) not shown, typically an optimization program 114 for realizing optimization processing and an inference for realizing inference processing. A program 116, parameters 118 defining an optimized model, and training data 120 are stored.

最適化プログラム１１４は、プロセッサ（ＣＰＵ１０２および／またはＧＰＵ１０４）によって実行されることで、図３（Ａ）に示すパラメタの最適化処理を実現する。また、推論プログラム１１６は、プロセッサ（ＣＰＵ１０２および／またはＧＰＵ１０４）によって実行されることで、図３（Ｂ）に示す推論処理を実現する。 The optimization program 114 is executed by the processor (CPU 102 and/or GPU 104) to realize the parameter optimization process shown in FIG. The inference program 116 is executed by the processor (CPU 102 and/or GPU 104) to implement the inference processing shown in FIG.

プロセッサ（ＣＰＵ１０２および／またはＧＰＵ１０４）がプログラムを実行する際に必要となるライブラリや機能モジュールの一部を、ＯＳが標準で提供するライブラリまたは機能モジュールにより代替してもよい。この場合には、プログラム単体では、対応する機能を実現するために必要なプログラムモジュールのすべてを含むものにはならないが、ＯＳの実行環境下にインストールされることで、目的の処理を実現できる。このような一部のライブラリまたは機能モジュールを含まないプログラムであっても、本発明の技術的範囲に含まれ得る。 A part of the library or function module required when the processor (CPU 102 and/or GPU 104) executes the program may be replaced by the library or function module provided as standard by the OS. In this case, the program alone does not include all the program modules required to realize the corresponding functions, but the target processing can be realized by installing the program module under the OS execution environment. Even a program that does not include such a part of the library or the functional module can be included in the technical scope of the present invention.

また、これらのプログラムは、上述したようないずれかの記録媒体に格納されて流通するだけでなく、インターネットまたはイントラネットを介してサーバ装置などからダウンロードすることで配布されてもよい。 Also, these programs may be distributed not only by being stored in any of the above-described recording media and being distributed, but also by being downloaded from a server device or the like via the Internet or an intranet.

図１７には、単一のコンピュータを用いて情報処理装置１００を構成する例を示すが、これに限らず、コンピュータネットワークを介して接続された複数のコンピュータが明示的または黙示的に連携して、情報処理装置１００および情報処理装置１００を含むシステムを実現するようにしてもよい。 FIG. 17 shows an example in which the information processing apparatus 100 is configured using a single computer, but the present invention is not limited to this, and a plurality of computers connected via a computer network cooperate explicitly or implicitly. The information processing apparatus 100 and a system including the information processing apparatus 100 may be realized.

プロセッサ（ＣＰＵ１０２および／またはＧＰＵ１０４）がプログラムを実行することで実現される機能の全部または一部を、集積回路などのハードワイヤード回路（hard-wired circuit）を用いて実現してもよい。例えば、ＡＳＩＣ（application specific integrated circuit）やＦＰＧＡ（field-programmable gate array）などを用いて実現してもよい。 All or part of the functions realized by the processor (CPU 102 and/or GPU 104) executing the program may be realized using a hard-wired circuit such as an integrated circuit. For example, it may be realized by using an ASIC (application specific integrated circuit) or an FPGA (field-programmable gate array).

当業者であれば、本発明が実施される時代に応じた技術を適宜用いて、本実施の形態に従う情報処理装置１００を実現できるであろう。 Those skilled in the art will be able to realize the information processing apparatus 100 according to the present embodiment by appropriately using the technology according to the time when the present invention is implemented.

説明の便宜上、同一の情報処理装置１００を用いて、最適化処理および推論処理を実行する例を示したが、最適化処理および推論処理を異なるハードウェアを用いて実現してもよい。 For convenience of explanation, the same information processing apparatus 100 is used to execute the optimization process and the inference process, but the optimization process and the inference process may be implemented using different hardware.

［Ｈ．まとめ］
本実施の形態に従う最適化方法によれば、ニューラルネットワークの最深層を含む複数の層の出力信号と正解出力信号とをそれぞれ比較して得られる誤差情報に基づいて、ニューラルネットワークのパラメタを最適化する。これによって、ニューラルネットワークの隠れ層から内部的に出力される出力信号を用いた場合であっても、最深層の出力信号に対して性能が極端に劣化するような事態を避けることができる。 [H. Summary]
According to the optimization method according to the present embodiment, the parameters of the neural network are optimized based on the error information obtained by comparing the output signals of the plurality of layers including the deepest layer of the neural network and the correct output signal. To do. As a result, even when the output signal internally output from the hidden layer of the neural network is used, it is possible to avoid a situation where the performance is extremely deteriorated with respect to the output signal of the deepest layer.

その結果、推論処理においては、最深層の出力信号を推論結果としなくても、最深層より浅い層の出力信号を推論結果として用いることも実用上可能となる。 As a result, in the inference process, it is practically possible to use the output signal of the layer shallower than the deepest layer as the inference result, without using the output signal of the deepest layer as the inference result.

本実施の形態に従う最適化方法によれば、各層から比較的性能の高い出力信号を得られる最適化済モデルを生成できるので、要求仕様（例えば、出力信号の推論性能や出力信号が出力されるまでに要する時間など）に応じて、任意の層の出力信号を推論結果として利用できるので、処理の高速化に加えて、柔軟性の向上も実現できる。 According to the optimization method according to the present embodiment, it is possible to generate an optimized model that can obtain an output signal with relatively high performance from each layer, so that the required specifications (for example, the inference performance of the output signal and the output signal are output). The output signal of any layer can be used as the inference result according to the time required for the process), so that not only the processing speed can be increased, but also the flexibility can be improved.

今回開示された実施の形態は、すべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は、上記した実施の形態の説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time are to be considered as illustrative in all points and not restrictive. The scope of the present invention is shown not by the above description of the embodiments but by the claims, and is intended to include meanings equivalent to the claims and all modifications within the scope.

１，１０ニューラルネットワーク、２，３６，４６入力層、４，３２，４２隠れ層、６出力層、２０，２０Ａモデル、３０，３０Ａエンコーダ、４０，４０Ａデコーダ、５０経路、１００情報処理装置、１０２ＣＰＵ、１０４ＧＰＵ、１０６主メモリ、１０８ディスプレイ、１１０ネットワークインターフェイス、１１２二次記憶装置、１１４最適化プログラム、１１６推論プログラム、１１８パラメタ、１２０訓練データ、１２２入力デバイス、１２４光学ドライブ、１２６光学ディスク、１２８内部バス。 1,10 Neural network, 2,36,46 input layer, 4,32,42 hidden layer, 6 output layer, 20,20A model, 30,30A encoder, 40,40A decoder, 50 path, 100 information processing device, 102 CPU, 104 GPU, 106 main memory, 108 display, 110 network interface, 112 secondary storage device, 114 optimization program, 116 inference program, 118 parameters, 120 training data, 122 input device, 124 optical drive, 126 optical disk, 128 internal bus.

Claims

An optimization method for optimizing parameters of a neural network having a plurality of same or different layers, comprising:
Preparing training data in which the input signal and the correct output signal are associated with each other, and
The input signal is input to the neural network to calculate an output signal output from the deepest layer included in the neural network, and an output signal output from each of one or more layers including the deepest layer is calculated. A step of calculating,
Calculating the error of each of the calculated output signals with respect to the correct output signal associated with the input signal,
Optimizing the parameters of each layer included in the neural network based on the calculated respective errors.

The optimization method according to claim 1, wherein the optimizing step includes a step of integrating the calculated respective errors.

The step of optimizing, based on the error information given by backpropagation to the target layer for which the parameter is optimized, and the error calculated for the output signal of the target layer, The optimization method according to claim 1, comprising the step of optimizing parameters.

An optimization program for causing a computer to execute the optimization method according to claim 1.

An inference method using an optimized model consisting of a neural network having a plurality of the same or different layers,
Inputting any input signal into the optimized model;
Calculating the output signal in order towards the deepest layer of the optimized model,
Out of the plurality of the same or different layers included in the optimized model, the output signal of any layer including the deepest layer, which is determined based on a request, is output as an inference result.
The optimized model includes output signals output from each of one or more layers including the deepest layer, which are calculated when an input signal included in training data is input to the neural network, and the training data. The inference method generated by optimizing parameters based on the respective errors from the correct answer output signal associated with the input signal included in.

An inference program for causing a computer to execute the optimization method according to claim 5.