JP7422535B2

JP7422535B2 - Conversion device and program

Info

Publication number: JP7422535B2
Application number: JP2019231754A
Authority: JP
Inventors: 岳士梶山; 伶遠藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2024-01-26
Anticipated expiration: 2039-12-23
Also published as: JP2021099713A

Description

本発明は、変換装置およびプログラムに関する。 The present invention relates to a conversion device and a program.

映像に映されている内容を自動的に認識する技術は、人のコミュニケーションを補助する手段としての活用が期待されている。その一例として、手話をカメラ等で撮影して、その映像（画像）を自動的に認識する技術は、聴覚障害者と健聴者との間のコミュニケーションへの活用が期待される。 Technology that automatically recognizes the content shown in videos is expected to be used as a means of assisting human communication. As an example, technology that automatically recognizes sign language images captured by a camera or the like is expected to be used for communication between hearing-impaired people and normal hearing people.

非特許文献１には、手話言語のひとつであるドイツ手話を自動認識してドイツ語へ変換する研究について記載されている。例えば、非特許文献１内のFigure 2は、手話言語を口語言語に翻訳するための手話翻訳機の概略構成を示している。このFigure 2が示す手話翻訳機は、エンコーダーとデコーダーを含んで構成される。エンコーダーおよびデコーダーは、それぞれ、再帰型ニューラルネットワーク（ＲＮＮ，recurrent neural network）を用いている。エンコーダーは、フレーム画像の系列を入力し、特徴ベクトルを生成する。デコーダーは、エンコーダーによって生成された特徴ベクトルを入力し、語の系列を生成する。 Non-Patent Document 1 describes research on automatically recognizing German sign language, which is one sign language, and converting it into German. For example, Figure 2 in Non-Patent Document 1 shows a schematic configuration of a sign language translator for translating sign language into colloquial language. The sign language translator shown in Figure 2 consists of an encoder and a decoder. The encoder and decoder each use a recurrent neural network (RNN). The encoder receives a sequence of frame images and generates a feature vector. The decoder receives the feature vectors generated by the encoder and generates a sequence of words.

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden ”Neural Sign Language Translation” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018．Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden “Neural Sign Language Translation” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

カメラを用いて撮影される映像の内容（例えば、人のジェスチャー等）を認識する技術は、例えば、非接触型のヒューマンマシンインターフェースが望まれる適用領域で実用化されてきた。非接触型のインターフェースが望まれる領域とは、例えば、食品工場や医療現場など、衛生面での考慮が求められる領域である。しかしながら、例えば手話言語のような、連続する複雑な人の動きを、自動認識して別の言語に変換する技術は、実用レベルに達していない。 BACKGROUND ART Technology for recognizing the content of images captured using a camera (for example, human gestures, etc.) has been put into practical use, for example, in application areas where a non-contact human-machine interface is desired. Areas where contactless interfaces are desired include, for example, areas where hygiene considerations are required, such as food factories and medical sites. However, technology that automatically recognizes continuous and complex human movements and converts them into another language, such as in sign language, has not yet reached a practical level.

日本で使用される手話言語のひとつである日本手話の自動認識に関しても、実用例は報告されていない。 No practical examples have been reported regarding automatic recognition of Japanese Sign Language, one of the sign languages used in Japan.

また、入力される手話映像が予め単語単位に区切られていない場合には、映像を基に手話単語の単位に自動的に区切って手話単語を自動認識することは、さらに困難である。 Furthermore, if the input sign language video is not divided into word units in advance, it is even more difficult to automatically recognize the sign language words by automatically dividing the video into sign language word units based on the video.

本発明は、上記の課題認識に基づいて行なわれたものであり、入力データ（例えば、所定の単位（例えば変換先の単語等の区切り）に区切られていない映像（フレーム画像の系列））を入力し、その入力データに対応する記号列（例えば、所定の言語表現における単語列）を出力することのできる変換装置およびプログラムを提供しようとするものである。 The present invention was made based on the above-mentioned problem recognition, and it is possible to process input data (for example, video (series of frame images) that is not divided into predetermined units (for example, divisions of words to be converted)). The present invention aims to provide a conversion device and a program that can input data and output a symbol string (for example, a word string in a predetermined linguistic expression) corresponding to the input data.

［１］上記の課題を解決するため、本発明の一態様による変換装置は、入力データを基に状態データを生成するエンコーダー部と、前記状態データを基に出力データを生成するデコーダー部と、前記エンコーダー部への入力となる学習用入力データと、前記学習用入力データに対応する前記出力データの正解である正解データとの対を供給する学習データ供給部と、前記学習用入力データに基づいて前記エンコーダー部が生成する状態データ、に基づいて前記デコーダー部が生成する学習用出力データと、前記学習用入力データに対応して前記学習データ供給部が供給する前記正解データと、の差を表すロスを算出するロス算出部と、前記正解データを基に推定される状態データである推定状態データを生成する第２エンコーダー部と、前記学習用入力データに基づいて前記エンコーダー部が生成する前記状態データと、前記学習用入力データに対応して前記学習データ供給部が供給する前記正解データに基づいて前記第２エンコーダー部が生成する前記推定状態データと、の差を表す第２ロスを算出する第２ロス算出部と、第１学習モードと、第２学習モードと、変換実行モードとを適宜切り替えて動作させるように制御する制御部と、を備え、前記第１学習モードにおいては、学習データ供給部が供給する前記学習用入力データと前記正解データとに基づいて前記ロス算出部が算出した前記ロス、に基づいて前記エンコーダー部および前記デコーダー部の内部パラメーターを調整し、前記第２学習モードにおいては、学習データ供給部が供給する前記学習用入力データと前記正解データとに基づいて前記第２ロス算出部が算出した前記第２ロス、に基づいて前記エンコーダー部および前記第２エンコーダー部の内部パラメーターを調整し、前記変換実行モードにおいては、前記エンコーダー部が入力データを基に状態データを生成し、前記エンコーダー部が生成した前記状態データを基に、前記デコーダー部が、出力データを生成する、変換装置である。 [1] In order to solve the above problems, a conversion device according to one aspect of the present invention includes an encoder unit that generates state data based on input data, a decoder unit that generates output data based on the state data, a learning data supply unit that supplies a pair of learning input data that is input to the encoder unit and correct answer data that is a correct answer of the output data corresponding to the learning input data; the difference between the learning output data generated by the decoder section based on the state data generated by the encoder section and the correct answer data supplied by the learning data supply section corresponding to the learning input data. a second encoder unit that generates estimated state data that is state data estimated based on the correct data; and a second encoder unit that generates estimated state data that is state data that is estimated based on the correct data; Calculating a second loss representing the difference between the state data and the estimated state data generated by the second encoder unit based on the correct data supplied by the learning data supply unit in response to the learning input data. a second loss calculation unit that performs learning; and a control unit that controls the operation to switch between a first learning mode, a second learning mode, and a conversion execution mode as appropriate; The internal parameters of the encoder section and the decoder section are adjusted based on the loss calculated by the loss calculation section based on the learning input data and the correct data supplied by the data supply section, and the second learning is performed. In the mode, the encoder unit and the second encoder unit calculate the second loss based on the second loss calculated by the second loss calculation unit based on the learning input data supplied by the learning data supply unit and the correct answer data. In the conversion execution mode, the encoder section generates state data based on the input data, and the decoder section generates output data based on the state data generated by the encoder section. It is a converting device that generates.

［２］また、本発明の一態様は、上記の変換装置において、前記エンコーダー部と、前記デコーダー部と、前記第２エンコーダー部との各々は、内部にニューラルネットワークを備え、前記第１学習モードにおいては、前記ロスに基づいて前記エンコーダー部および前記デコーダー部のそれぞれのニューラルネットワークの誤差逆伝播を行うことによって前記エンコーダー部および前記デコーダー部の内部パラメーターを調整し、前記第２学習モードにおいては、前記第２ロスに基づいて前記エンコーダー部および前記第２エンコーダー部のそれぞれのニューラルネットワークの誤差逆伝播を行うことによって前記エンコーダー部および前記第２エンコーダー部の内部パラメーターを調整するものである。 [2] Further, in one aspect of the present invention, in the above conversion device, each of the encoder section, the decoder section, and the second encoder section includes a neural network therein, and the first learning mode In the second learning mode, the internal parameters of the encoder section and the decoder section are adjusted by performing error backpropagation of the respective neural networks of the encoder section and the decoder section based on the loss, and in the second learning mode, The internal parameters of the encoder section and the second encoder section are adjusted by performing error backpropagation in neural networks of the encoder section and the second encoder section based on the second loss.

［３］また、本発明の一態様は、上記の変換装置において、前記制御部は、学習処理の際に、学習データ供給部が供給する前記学習用入力データと前記正解データとの対ごとに、前記第１学習モードと前記第２学習モードとを繰り返して実行するよう制御するものである。 [3] Further, in one aspect of the present invention, in the above-mentioned conversion device, the control unit, during the learning process, performs the following for each pair of the learning input data and the correct answer data supplied by the learning data supply unit. , the first learning mode and the second learning mode are controlled to be executed repeatedly.

［４］また、本発明の一態様は、上記の変換装置において、前記入力データは、画像の系列であり、前記出力データは、所定の記号の系列である、というものである。 [4] Moreover, one aspect of the present invention is that in the above conversion device, the input data is a series of images, and the output data is a series of predetermined symbols.

［５］また、本発明の一態様は、上記の変換装置において、前記画像の系列は、手話を表す画像の系列であり、前記記号の系列は、前記手話に対応する、グロス表記による語の列である、というものである。 [5] Further, in one aspect of the present invention, in the above conversion device, the series of images is a series of images representing a sign language, and the series of symbols is a series of words in gloss notation corresponding to the sign language. It is called a column.

［６］また、本発明の一態様は、入力データを基に状態データを生成するエンコーダー部と、前記状態データを基に出力データを生成するデコーダー部と、前記エンコーダー部への入力となる学習用入力データと、前記学習用入力データに対応する前記出力データの正解である正解データとの対を供給する学習データ供給部と、前記学習用入力データに基づいて前記エンコーダー部が生成する状態データ、に基づいて前記デコーダー部が生成する学習用出力データと、前記学習用入力データに対応して前記学習データ供給部が供給する前記正解データと、の差を表すロスを算出するロス算出部と、前記正解データを基に推定される状態データである推定状態データを生成する第２エンコーダー部と、前記学習用入力データに基づいて前記エンコーダー部が生成する前記状態データと、前記学習用入力データに対応して前記学習データ供給部が供給する前記正解データに基づいて前記第２エンコーダー部が生成する前記推定状態データと、の差を表す第２ロスを算出する第２ロス算出部と、第１学習モードと、第２学習モードと、変換実行モードとを適宜切り替えて動作させるように制御する制御部と、を備え、前記第１学習モードにおいては、学習データ供給部が供給する前記学習用入力データと前記正解データとに基づいて前記ロス算出部が算出した前記ロス、に基づいて前記エンコーダー部および前記デコーダー部の内部パラメーターを調整し、前記第２学習モードにおいては、学習データ供給部が供給する前記学習用入力データと前記正解データとに基づいて前記第２ロス算出部が算出した前記第２ロス、に基づいて前記エンコーダー部および前記第２エンコーダー部の内部パラメーターを調整し、前記変換実行モードにおいては、前記エンコーダー部が入力データを基に状態データを生成し、前記エンコーダー部が生成した前記状態データを基に、前記デコーダー部が、出力データを生成する、変換装置としてコンピューターを機能させるプログラムである。 [6] Further, one aspect of the present invention provides an encoder unit that generates state data based on input data, a decoder unit that generates output data based on the state data, and a learning device that is input to the encoder unit. a learning data supply unit that supplies a pair of input data for learning and correct answer data that is a correct answer of the output data corresponding to the input data for learning, and state data generated by the encoder unit based on the input data for learning. a loss calculation unit that calculates a loss representing a difference between the learning output data generated by the decoder unit based on , and the correct answer data supplied by the learning data supply unit corresponding to the learning input data; , a second encoder unit that generates estimated state data that is state data estimated based on the correct data; the state data that the encoder unit generates based on the learning input data; and the learning input data. and the estimated state data generated by the second encoder unit based on the correct data supplied by the learning data supply unit in response to the second loss calculation unit; a control unit that controls the operation by appropriately switching between a first learning mode, a second learning mode, and a conversion execution mode, and in the first learning mode, the learning data supply unit supplies the learning data. The internal parameters of the encoder section and the decoder section are adjusted based on the loss calculated by the loss calculation section based on the input data and the correct data, and in the second learning mode, the learning data supply section Adjusting the internal parameters of the encoder unit and the second encoder unit based on the second loss calculated by the second loss calculation unit based on the supplied learning input data and the correct answer data, and performing the conversion. In the execution mode, the computer functions as a conversion device in which the encoder section generates state data based on input data, and the decoder section generates output data based on the state data generated by the encoder section. This is a program that allows you to

本発明によれば、データ（映像）を自動認識してその映像に対応する記号列を出力する処理において、認識精度を向上させることができる。 According to the present invention, recognition accuracy can be improved in the process of automatically recognizing data (video) and outputting a symbol string corresponding to the video.

本発明の実施形態による変換装置の概略機能構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic functional configuration of a conversion device according to an embodiment of the present invention. 同実施形態による変換装置内の、エンコーダー部およびデコーダー部の処理によるデータの流れを示す概略図である。FIG. 3 is a schematic diagram showing the flow of data through processing by an encoder section and a decoder section in the conversion device according to the embodiment. 同実施形態による変換装置の、付加的なネットワークである第２エンコーダー部を用いた学習処理によるデータの流れを示す概略図である。FIG. 7 is a schematic diagram showing the flow of data through learning processing using a second encoder unit, which is an additional network, of the conversion device according to the embodiment. 同実施形態による変換装置内のエンコーダー部のより詳細な構成例を示すブロック図である。FIG. 2 is a block diagram showing a more detailed configuration example of an encoder unit in the conversion device according to the same embodiment. 同実施形態による変換装置内のデコーダー部のより詳細な構成例を示すブロック図である。FIG. 2 is a block diagram showing a more detailed configuration example of a decoder section in the conversion device according to the same embodiment. 同実施形態による変換装置内の第２エンコーダー部のより詳細な構成例を示すブロック図である。FIG. 3 is a block diagram showing a more detailed configuration example of a second encoder section in the conversion device according to the same embodiment. 同実施形態による変換装置が機械学習処理を行う際の処理手順を示すフローチャートである。It is a flowchart which shows the processing procedure when the conversion apparatus by the same embodiment performs machine learning processing. 評価実験の結果を示すグラフであり、本実施形態の変換装置と、従来技術による変換装置との間で、変換誤り率を対比するためのグラフである。This is a graph showing the results of an evaluation experiment, and is a graph for comparing the conversion error rate between the conversion device of this embodiment and the conversion device according to the conventional technology.

次に、本発明の実施形態について、説明する。 Next, embodiments of the present invention will be described.

本実施形態による変換装置は、手話の映像を入力し、その映像が表す手話を、グロス表記と呼ばれる中間表現に変換する。グロス表記は、文字を持たない手話言語において、手話のフレーズまたは文章を構成する一連の動作を、手話の単語に相当する短い区間で区切り、文字によって書き起こした記号列である。日本手話のグロス表記では、手話の単語の意味に近い日本語の単語をラベルとして用いる。つまり、本実施形態による変換装置は、手話の映像を入力し、映像の自動認識処理を行い、その映像に対応するラベル列（記号列）を出力するものである。 The conversion device according to this embodiment inputs a sign language video and converts the sign language represented by the video into an intermediate representation called gloss notation. Gloss notation is a symbol string in a sign language that does not have letters, in which a series of actions that make up a sign language phrase or sentence is separated into short intervals corresponding to words in sign language and transcribed using letters. In Japanese Sign Language gloss notation, Japanese words that are close in meaning to the sign language words are used as labels. That is, the conversion device according to this embodiment inputs a sign language video, performs automatic recognition processing on the video, and outputs a label string (symbol string) corresponding to the video.

なお、変換装置に入力される手話映像は、単語等の単位に予め区切られているものではない。また、区切り位置を示すメタ情報も、付与されていない。 Note that the sign language video input to the conversion device is not divided into units such as words in advance. Also, meta information indicating the break position is not provided.

図１は、本実施形態による変換装置の概略機能構成を示すブロック図である。図示するように、変換装置１は、入力部１０と、エンコーダー部２０と、デコーダー部３０と、出力部４０と、ロス算出部５０と、第２エンコーダー部６０と、第２ロス算出部７０と、学習データ供給部８０と、制御部９０とを含んで構成される。これらの各機能部は、例えば、コンピューターと、プログラムとで実現することが可能である。また、各機能部は、必要に応じて、記憶手段を有する。記憶手段は、例えば、プログラム上の変数や、プログラムの実行によりアロケーションされるメモリーである。また、必要に応じて、磁気ハードディスク装置やソリッドステートドライブ（ＳＳＤ）といった不揮発性の記憶手段を用いるようにしてもよい。また、各機能部の少なくとも一部の機能を、プログラムではなく専用の電子回路として実現してもよい。各機能部の機能について、次に説明する。 FIG. 1 is a block diagram showing a schematic functional configuration of a conversion device according to this embodiment. As illustrated, the conversion device 1 includes an input section 10, an encoder section 20, a decoder section 30, an output section 40, a loss calculation section 50, a second encoder section 60, and a second loss calculation section 70. , a learning data supply section 80, and a control section 90. Each of these functional units can be realized by, for example, a computer and a program. Furthermore, each functional section has a storage means, if necessary. The storage means is, for example, variables on the program or memory allocated by executing the program. Furthermore, if necessary, nonvolatile storage means such as a magnetic hard disk device or a solid state drive (SSD) may be used. Furthermore, at least some of the functions of each functional unit may be realized as a dedicated electronic circuit instead of a program. The functions of each functional section will be explained next.

入力部１０は、入力データを取得し、エンコーダー部２０に供給する。この入力データは、画像の系列であって良い。さらに、この画像の系列（映像）は、手話を表す画像の系列（映像）であって良い。なお、映像のフレームレートは任意であるが、例えば、３０フレーム毎秒（ｆｐｓ）程度として良い。 The input unit 10 acquires input data and supplies it to the encoder unit 20. This input data may be a sequence of images. Furthermore, this series of images (video) may be a series of images (video) representing sign language. Note that the frame rate of the video is arbitrary, and may be, for example, about 30 frames per second (fps).

エンコーダー部２０は、入力データを基に状態ベクトル（状態データ）を生成するものである。エンコーダー部２０は、機械学習処理を行って内部のモデルを更新する（パラメーターを調整する）機能を持つ。本実施形態では、エンコーダー部２０は、内部にニューラルネットワークを持ち、誤差逆伝播の処理によって内部のパラメーターを更新できる。なお、誤差逆伝播の手法自体は、既存技術によって実施できるものである。 The encoder unit 20 generates a state vector (state data) based on input data. The encoder unit 20 has a function of updating an internal model (adjusting parameters) by performing machine learning processing. In this embodiment, the encoder unit 20 has an internal neural network and can update internal parameters through error backpropagation processing. Note that the error backpropagation method itself can be implemented using existing technology.

デコーダー部３０は、状態ベクトル（状態データ）を基に出力データを生成するものである。デコーダー部３０は、機械学習処理を行って内部のモデルを更新する（パラメーターを調整する）機能を持つ。本実施形態では、デコーダー部３０は、内部にニューラルネットワークを持ち、誤差逆伝播の処理によって内部のパラメーターを更新できる。 The decoder unit 30 generates output data based on the state vector (state data). The decoder unit 30 has a function of performing machine learning processing to update an internal model (adjust parameters). In this embodiment, the decoder unit 30 has an internal neural network, and can update internal parameters through error backpropagation processing.

出力部４０は、デコーダー部３０が生成した出力データ（推定記号列）を出力する。出力データは、例えば、手話の映像に対応するグロス表記の語列（記号列）であっても良い。 The output unit 40 outputs the output data (estimated symbol string) generated by the decoder unit 30. The output data may be, for example, a word string (symbol string) in gloss notation corresponding to a sign language video.

ロス算出部５０は、エンコーダー部２０に入力される学習用入力データを基にエンコーダー部２０およびデコーダー部３０が生成する出力データ（推定された単語列、学習用出力データ）と、前記学習用入力データに対応して学習データ供給部が供給する正解データと、の差を表すロスを算出する。エンコーダー部２０およびデコーダー部３０によって生成される学習用出力データと、正解データとが、それぞれ、記号列に対応するベクトルであると捉えた場合、ロス算出部５０が算出するロスは、例えば、それら両ベクトル間のノルムである。 The loss calculation unit 50 includes output data (estimated word strings, learning output data) generated by the encoder unit 20 and the decoder unit 30 based on the learning input data input to the encoder unit 20, and the learning input data. A loss representing the difference between the data and the correct data supplied by the learning data supply unit is calculated. When the learning output data generated by the encoder unit 20 and the decoder unit 30 and the correct answer data are respectively considered to be vectors corresponding to symbol strings, the loss calculated by the loss calculation unit 50 is, for example, It is the norm between both vectors.

第２エンコーダー部６０は、学習データ供給部８０が供給する正解データ（デコーダー部３０が出力する出力データの正解）を基に、推定される状態ベクトル（状態データ）である推定状態データを生成する。第２エンコーダー部６０は、機械学習処理を行って内部のモデルを更新する（パラメーターを調整する）機能を持つ。本実施形態では、第２エンコーダー部６０は、内部にニューラルネットワークを持ち、誤差逆伝播の処理によって内部のパラメーターを更新できる。 The second encoder unit 60 generates estimated state data, which is an estimated state vector (state data), based on the correct data supplied by the learning data supply unit 80 (correct answer of the output data output by the decoder unit 30). . The second encoder unit 60 has a function of performing machine learning processing to update the internal model (adjust parameters). In this embodiment, the second encoder unit 60 has an internal neural network and can update internal parameters through error backpropagation processing.

第２ロス算出部７０は、学習用入力データに基づいてエンコーダー部２０が生成する状態データと、学習用入力データに対応して学習データ供給部８０が供給する正解データ、に基づいて第２エンコーダー部６０が生成する推定状態データと、の差を表す第２ロスを算出する。状態データと推定状態データとをともにベクトルと捉えた場合、第２ロス算出部７０が算出する第２ロスは、例えば、それら両ベクトル間のノルムである。 The second loss calculation unit 70 calculates the second encoder based on the state data generated by the encoder unit 20 based on the learning input data and the correct data supplied by the learning data supply unit 80 in response to the learning input data. A second loss representing the difference between the estimated state data generated by the unit 60 and the estimated state data generated by the unit 60 is calculated. When both the state data and the estimated state data are regarded as vectors, the second loss calculated by the second loss calculation unit 70 is, for example, the norm between these two vectors.

学習データ供給部８０は、エンコーダー部２０やデコーダー部３０や第２エンコーダー部６０が機械学習を行うための学習データを供給する。具体的には、学習データ供給部８０は、エンコーダー部２０への入力となる学習用入力データと、その学習用入力データに対応する出力データの正解である正解データとの対を供給する。上記の学習用入力データは、第２エンコーダー部６０への入力としても使用される。学習データ供給部８０は、学習用入力データと正解データとの対を多数供給する。 The learning data supply unit 80 supplies learning data for the encoder unit 20, decoder unit 30, and second encoder unit 60 to perform machine learning. Specifically, the learning data supply unit 80 supplies a pair of learning input data that is input to the encoder unit 20 and correct answer data that is the correct answer to the output data corresponding to the learning input data. The above learning input data is also used as input to the second encoder section 60. The learning data supply unit 80 supplies many pairs of learning input data and correct answer data.

制御部９０は、変換装置１全体の動作を制御する。制御部９０は、少なくとも、変換装置１の動作モードに基づく制御を行う。具体例として、制御部９０は、第１学習モードと、第２学習モードと、変換実行モードとを適宜切り替えて動作させるように、変換装置１の各部を制御する。各モードでの変換装置１の動作は、次のとおりである。第１学習モードにおいては、学習データ供給部８０が供給する学習用入力データと正解データとに基づいてロス算出部５０が算出したロス、に基づいて、エンコーダー部２０およびデコーダー部３０の内部パラメーターを調整する。第２学習モードにおいては、学習データ供給部８０が供給する学習用入力データと正解データとに基づいて第２ロス算出部が算出した第２ロス、に基づいてエンコーダー部２０および第２エンコーダー部６０の内部パラメーターを調整する。これらの第１学習モードおよび第２学習モードのそれぞれにおける内部パラメーターの調整とは、例えば、ニューラルネットワークにおける誤差逆伝播処理によって、各部の内部パラメーターを更新する処理である。ニューラルネットワークの内部パラメーターとは、各節における出力値を計算する際の入力値に適用される重み値のベクトルである。そして、変換実行モードにおいては、エンコーダー部２０が入力データを基に状態ベクトル（状態データ）を生成し、エンコーダー部２０が生成した状態ベクトル（状態データ）を基に、デコーダー部３０が、出力データ（入力データに対応する推定変換結果）を生成する。 The control unit 90 controls the entire operation of the conversion device 1 . The control unit 90 performs control based on at least the operation mode of the conversion device 1. As a specific example, the control unit 90 controls each unit of the conversion device 1 to operate the conversion device 1 by appropriately switching between the first learning mode, the second learning mode, and the conversion execution mode. The operation of the conversion device 1 in each mode is as follows. In the first learning mode, the internal parameters of the encoder section 20 and the decoder section 30 are calculated based on the loss calculated by the loss calculation section 50 based on the learning input data and correct answer data supplied by the learning data supply section 80. adjust. In the second learning mode, the encoder unit 20 and the second encoder unit 60 calculate the second loss based on the learning input data supplied by the learning data supply unit 80 and the correct data. Adjust the internal parameters of. The adjustment of the internal parameters in each of the first learning mode and the second learning mode is, for example, a process of updating the internal parameters of each part by error backpropagation processing in a neural network. The internal parameters of a neural network are vectors of weight values that are applied to input values when calculating the output value at each node. In the conversion execution mode, the encoder unit 20 generates a state vector (state data) based on the input data, and the decoder unit 30 generates output data based on the state vector (state data) generated by the encoder unit 20. (estimated transformation result corresponding to input data) is generated.

なお、制御部９０が第１学習モードと第２学習モードとを切り替える制御の手順については、後で、図７（フローチャート）を参照しながらさらに説明する。 Note that the control procedure by which the control unit 90 switches between the first learning mode and the second learning mode will be further described later with reference to FIG. 7 (flowchart).

図２は、本実施形態の変換装置１内の、エンコーダー部２０およびデコーダー部３０による動作におけるデータの流れを示す概略図である。以下において説明するように、エンコーダー部２０およびデコーダー部３０は、学習モードと変換モードで、動作するものである。 FIG. 2 is a schematic diagram showing the flow of data in the operation of the encoder unit 20 and decoder unit 30 in the conversion device 1 of this embodiment. As explained below, the encoder section 20 and the decoder section 30 operate in a learning mode and a conversion mode.

エンコーダー部２０は、内部にニューラルネットワーク２０１を有している。ニューラルネットワーク２０１には、入力映像が持つフレーム画像の系列ｆｒａｍｅ_１，ｆｒａｍｅ_２，・・・，ｆｒａｍｅ_ｒが入力される。ニューラルネットワーク２０１は、フレーム画像の系列ｆｒａｍｅ_１，ｆｒａｍｅ_２，・・・，ｆｒａｍｅ_ｒに基づいて算出される状態ベクトルを出力する。エンコーダー部２０は、入力映像に基づいて生成した状態ベクトルを、デコーダー部３０に渡す。 The encoder section 20 has a neural network 201 inside. A series of frame images frame ₁ , frame ₂ , . . . , frame _r included in the input video is input to the neural network 201 . The neural network 201 outputs a state vector calculated based on a series of frame images frame ₁ , frame ₂ , . . . , frame _r . The encoder unit 20 passes the state vector generated based on the input video to the decoder unit 30.

デコーダー部３０は、内部にニューラルネットワーク３０１を有している。ニューラルネットワーク３０１には、エンコーダー部２０のニューラルネットワーク２０１で生成された状態ベクトルが入力される。ニューラルネットワーク３０１は、入力される状態ベクトルに基づいて算出される語の列ｗｏｒｄ_１，ｗｏｒｄ_２，・・・，ｗｏｒｄ_ｕ－１,ｗｏｒｄ_ｕを出力する。これらの語は、いずれも、前述のグロス表記における記号である。また、ニューラルネットワーク３０１は、語列の最後に、特殊記号である＜ｅｏｓ＞を出力する。＜ｅｏｓ＞は、文の終わり（end of sentence）を表す記号である。ニューラルネットワーク３０１が出力する語の列は、推定語列とも呼ばれる。 The decoder section 30 has a neural network 301 inside. The state vector generated by the neural network 201 of the encoder unit 20 is input to the neural network 301 . The neural network 301 outputs a string of words word ₁ , word ₂ , . . . , word _u-1 , word _u calculated based on the input state vector. All of these words are symbols in the gloss notation described above. Further, the neural network 301 outputs a special symbol <eos> at the end of the word string. <eos> is a symbol representing the end of sentence. The word string output by the neural network 301 is also called an estimated word string.

ニューラルネットワーク２０１および３０１の各々は、学習モードで動作する際に、学習データに基づく機械学習処理を行うことによって、内部のパラメーターを調整する。ニューラルネットワーク２０１および３０１の各々は、変換モードで動作する際には、機械学習処理において調整済みの内部パラメーターを用いて、出力を算出する。エンコーダー部２０とデコーダー部３０とが変換モードで動作する際には、ニューラルネットワーク３０１が出力する推定語列が、入力映像に対応する変換結果である。 When operating in learning mode, each of the neural networks 201 and 301 adjusts internal parameters by performing machine learning processing based on learning data. When operating in the conversion mode, each of the neural networks 201 and 301 calculates an output using adjusted internal parameters in the machine learning process. When the encoder unit 20 and the decoder unit 30 operate in the conversion mode, the estimated word string output by the neural network 301 is the conversion result corresponding to the input video.

機械学習処理についてさらに詳しく書く。ニューラルネットワーク３０１が出力する推定語列は、正解データである正解語列と比較することができる。正解語列は、入力映像に対応する形で、学習データ供給部８０によって供給される。ロス算出部５０は、ニューラルネットワーク３０１が出力する推定語列と、学習データ供給部８０から供給される正解語列とから、ロスを算出する。ロス算出部５０によって算出されたロスに基づき、ニューラルネットワーク２０１および３０１は、誤差逆伝播を行い、内部のパラメーターを更新する。 Write more details about machine learning processing. The estimated word string output by the neural network 301 can be compared with the correct word string that is correct data. The correct word string is supplied by the learning data supply unit 80 in a form corresponding to the input video. The loss calculation unit 50 calculates the loss from the estimated word string output by the neural network 301 and the correct word string supplied from the learning data supply unit 80. Based on the loss calculated by the loss calculation unit 50, the neural networks 201 and 301 perform error backpropagation and update internal parameters.

図３は、付加的なネットワークである第２エンコーダー部６０を用いた学習処理の流れを示す概略図である。第２エンコーダー部６０は、エンコーダー部２０の学習処理を補助する目的のみに用いられる。つまり、第２エンコーダー部６０は、学習モードのみで用いられるものであり、変換モードでは使用されない。 FIG. 3 is a schematic diagram showing the flow of learning processing using the second encoder section 60, which is an additional network. The second encoder section 60 is used only for the purpose of assisting the learning process of the encoder section 20. In other words, the second encoder section 60 is used only in the learning mode and not in the conversion mode.

学習モードで変換装置１が動作する場合に、エンコーダー部２０が、フレーム画像の系列を入力して、状態ベクトルを出力することは、既に説明した通りである。学習モードで変換装置１が動作する場合には、それに加えて、第２エンコーダー部６０に、正解語列が入力される。この正解語列は、学習データ供給部８０によって供給されるものである。正解語列は、図示するように、ｗｏｒｄ_１，ｗｏｒｄ_２，・・・，ｗｏｒｄ_ｕ－１,ｗｏｒｄ_ｕといった語の列である。また、正解語列の先頭には特殊記号である＜ｂｏｓ＞（文の始め、beginning of sentence）が付加され、正解語列の最後には特殊記号である＜ｅｏｓ＞（文の終わり、end of sentence）が付加されている。第２エンコーダー部６０は、ニューラルネットワーク６０１を内部に持っている。ニューラルネットワーク６０１は、入力された正解語列と、その時点における内部パラメーターとに基づき、状態ベクトルを算出し、出力する。 As already explained, when the conversion device 1 operates in the learning mode, the encoder unit 20 inputs a series of frame images and outputs a state vector. When the conversion device 1 operates in the learning mode, a correct word string is also input to the second encoder section 60. This correct word string is supplied by the learning data supply section 80. As shown in the figure, the correct word string is a string of words such as word ₁ , word ₂ , . . . , word _u-1 , word _u . Additionally, the special symbol <bos> (beginning of sentence) is added to the beginning of the correct word string, and the special symbol <eos> (end of sentence) is added to the end of the correct word string. sentence) is added. The second encoder section 60 has a neural network 601 inside. The neural network 601 calculates and outputs a state vector based on the input correct word string and internal parameters at that time.

第２ロス算出部７０は、エンコーダー部２０が出力した状態ベクトルと、第２エンコーダー部６０が出力した状態ベクトルとを取得し、これらの両ベクトルからロス（第２ロス）を算出する。ニューラルネットワーク２０１と６０１の各々は、第２ロス算出部７０によって算出されたロスに基づき、誤差逆伝播を行い、それぞれの内部パラメーターを更新する。 The second loss calculation unit 70 obtains the state vector output by the encoder unit 20 and the state vector output by the second encoder unit 60, and calculates a loss (second loss) from both of these vectors. Each of the neural networks 201 and 601 performs error backpropagation based on the loss calculated by the second loss calculation unit 70, and updates each internal parameter.

つまり、エンコーダー部２０は、第２ロス算出部７０が算出したロスに基づく誤差逆伝播を行うことにより、内部のパラメーターを調整する。この学習処理により、エンコーダー部２０は、入力映像を基に、良好な状態ベクトルを出力することが可能となる。エンコーダー部２０は、図２にも示したように、ロス算出部５０が算出したロスに基づく誤差逆伝播も、行う。しかしながら、ロス算出部５０が算出したロスに基づいて誤差逆伝播を行う場合の逆伝播の経路は比較的長く、第２ロス算出部７０が算出したロスに基づいて誤差逆伝播を行う場合の逆伝播の経路は比較的短い。つまり、ロス算出部５０が算出したロスに基づく誤差逆伝播だけではその経路が長すぎることによって十分な機械学習効果が得られない場合にも、第２ロス算出部７０が算出したロスに基づく誤差逆伝播を併用することにより、エンコーダー部２０は、より良好な学習を行うことができる。 That is, the encoder unit 20 adjusts internal parameters by performing error backpropagation based on the loss calculated by the second loss calculation unit 70. This learning process enables the encoder unit 20 to output a good state vector based on the input video. As shown in FIG. 2, the encoder unit 20 also performs error backpropagation based on the loss calculated by the loss calculation unit 50. However, the path of backpropagation when error backpropagation is performed based on the loss calculated by the loss calculation unit 50 is relatively long, and the path of backpropagation when error backpropagation is performed based on the loss calculated by the second loss calculation unit 70 is relatively long. The path of propagation is relatively short. In other words, even if sufficient machine learning effect cannot be obtained by backpropagation of the error based on the loss calculated by the loss calculation unit 50 because the path is too long, the error based on the loss calculated by the second loss calculation unit 70 By using backpropagation in combination, the encoder unit 20 can perform better learning.

つまり、本実施形態に特有の構成である第２エンコーダー部６０を用いることにより、エンコーダー部２０の学習効果を改善することができる。 That is, by using the second encoder section 60, which has a configuration unique to this embodiment, the learning effect of the encoder section 20 can be improved.

図４は、エンコーダー部２０のより詳細な構成例を示すブロック図である。図示するように、エンコーダー部２０は、内部に再帰型ニューラルネットワーク（ＲＮＮ，recurrent neural network）を含むように構成される。図ではＲＮＮの時間的な再帰構造を左から右方向に展開して表現している。図示する構成例では、エンコーダー部２０は、入力されるフレーム画像列ｆｒａｍｅ_１，ｆｒａｍｅ_２，・・・，ｆｒａｍｅ_ｒの各フレームに対応して、第１層から第Ｎ層までのＲＮＮを持つ。Ｎは、正整数である。例えば、Ｎを２以上且つ４以下程度の値としてよい。しかし、Ｎは、ここに例示した範囲に限定されるものではない。エンコーダー部２０を構成するため、時間の進行につれて（フレーム画像の進行につれて）、Ｎ層のＲＮＮの回路を順次再利用する。第１層のＲＮＮには、フレーム画像が入力される。第1層のＲＮＮには直接フレーム画像を入力するのではなく、事前にフレーム画像を図示していないＣＮＮ（畳み込みニューラルネットワーク）などの特徴を抽出する回路に入力し、その出力である特徴ベクトルを第1層のＲＮＮに入力しても良い。第１層のＲＮＮからの出力は、同じフレーム画像に対応する第２層のＲＮＮと、次のフレーム画像に対応する第１層のＲＮＮとに、渡される。また、第ｉ層（１＜ｉ＜Ｎ）のＲＮＮは、同じフレーム画像に対応する第（ｉ－１）層のＲＮＮからの出力と、前のフレーム画像に対応する第ｉ層のＲＮＮからの出力とを受け取る。そして、その第ｉ層のＲＮＮからの出力は、同じフレーム画像に対応する第（ｉ＋１）層のＲＮＮと、次のフレーム画像に対応する第ｉ層のＲＮＮとに、渡される。また、第Ｎ層のＲＮＮは、同じフレーム画像に対応する第（Ｎ－１）層のＲＮＮからの出力と、前のフレーム画像に対応する第Ｎ層のＲＮＮからの出力とを受け取る。そして、その第Ｎ層のＲＮＮからの出力は、次のフレーム画像に対応する第Ｎ層のＲＮＮに渡される。最後のフレーム画像（図４においては、ｆｒａｍｅ_ｒ）に対応するＲＮＮからの出力は、状態ベクトルである。エンコーダー部２０は、生成した状態ベクトルを、デコーダー部３０や第２ロス算出部７０に渡す。 FIG. 4 is a block diagram showing a more detailed configuration example of the encoder section 20. As shown in FIG. As illustrated, the encoder unit 20 is configured to include a recurrent neural network (RNN) therein. In the figure, the temporal recursive structure of the RNN is expanded from left to right. In the illustrated configuration example, the encoder unit 20 has RNNs from the first layer to the Nth layer corresponding to each frame of the input frame image sequence frame ₁ , frame ₂ , . . . , frame _r . N is a positive integer. For example, N may be set to a value of approximately 2 or more and 4 or less. However, N is not limited to the range illustrated here. To configure the encoder unit 20, N-layer RNN circuits are sequentially reused as time progresses (as frame images progress). A frame image is input to the first layer RNN. Instead of directly inputting the frame image to the first layer RNN, the frame image is input in advance to a circuit that extracts features such as a CNN (convolutional neural network) (not shown), and the output feature vector is It may also be input to the first layer RNN. The output from the first layer RNN is passed to the second layer RNN corresponding to the same frame image and the first layer RNN corresponding to the next frame image. In addition, the i-th layer (1<i<N) RNN outputs the output from the (i-1) layer RNN corresponding to the same frame image and the i-th layer RNN corresponding to the previous frame image. Receive output. Then, the output from the i-th layer RNN is passed to the (i+1)-th layer RNN corresponding to the same frame image and the i-th layer RNN corresponding to the next frame image. Further, the Nth layer RNN receives an output from the (N-1)th layer RNN corresponding to the same frame image and an output from the Nth layer RNN corresponding to the previous frame image. Then, the output from the Nth layer RNN is passed to the Nth layer RNN corresponding to the next frame image. The output from the RNN corresponding to the last frame image (frame _r in FIG. 4) is a state vector. The encoder section 20 passes the generated state vector to the decoder section 30 and the second loss calculation section 70.

図４を参照して説明したように、エンコーダー部２０は、論理的には、Ｎ行ｒ列のマトリクス状に配置されたＲＮＮを用いて構成される。ただし、Ｎは層の数であり、ｒは入力される画像の系列の長さである。 As described with reference to FIG. 4, the encoder unit 20 is logically configured using RNNs arranged in a matrix of N rows and r columns. Here, N is the number of layers, and r is the length of the sequence of input images.

図５は、デコーダー部３０のより詳細な構成例を示すブロック図である。図示するように、デコーダー部３０は、内部にＲＮＮを含んで構成される。図ではＲＮＮの時間的な再帰構造を左から右方向に展開して表現している。図示する構成例では、デコーダー部３０は、出力する語列（推定語列）ｗｏｒｄ_１，ｗｏｒｄ_２，・・・，ｗｏｒｄ_ｕ－１,ｗｏｒｄ_ｕ，および＜ｅｏｓ＞の各記号に対応して、第１層から第Ｎ層までのＲＮＮを持つ。ここでのＮの値は、エンコーダー部２０（図４参照）のＮの値に合わせる。つまり、デコーダー部３０は、論理的には、エンコーダー部２０の内部構成と同様の、Ｎ行（ｕ＋１）列のマトリクス状に配置されたＲＮＮを用いて構成される。デコーダー部３０におけるＲＮＮのマトリクス内での、データの受け渡しの流れも、エンコーダー部２０のＲＮＮのマトリクス内におけるそれと同様である。ここで、（ｕ＋１）は、出力系列の長さである。ただし、この出力系列の長さは、＜ｅｏｓ＞等の特殊記号を含む長さであってもよい。 FIG. 5 is a block diagram showing a more detailed configuration example of the decoder section 30. As shown in FIG. As illustrated, the decoder section 30 includes an RNN therein. In the figure, the temporal recursive structure of the RNN is expanded from left to right. In the illustrated configuration example, the decoder unit 30 outputs word strings (estimated word strings) word ₁ , word ₂ , . . . , word _u-1 , word _u , and corresponding to each symbol of <eos>, It has an RNN from the first layer to the Nth layer. The value of N here is matched to the value of N of the encoder section 20 (see FIG. 4). That is, the decoder section 30 is logically configured using RNNs arranged in a matrix of N rows and (u+1) columns, similar to the internal configuration of the encoder section 20. The flow of data exchange within the RNN matrix in the decoder section 30 is also similar to that within the RNN matrix in the encoder section 20. Here, (u+1) is the length of the output sequence. However, the length of this output series may include a special symbol such as <eos>.

デコーダー部３０は、エンコーダー部２０が生成した状態ベクトルを、入力データとして取得する。また、デコーダー部３０の第Ｎ層のＲＮＮは、順次、推定語列（ｗｏｒｄ_１，ｗｏｒｄ_２，・・・，ｗｏｒｄ_ｕ－１,ｗｏｒｄ_ｕ，および＜ｅｏｓ＞）を出力する。デコーダー部３０は、生成した推定語列を、出力部４０やロス算出部５０に渡す。 The decoder unit 30 obtains the state vector generated by the encoder unit 20 as input data. Further, the N-th layer RNN of the decoder unit 30 sequentially outputs estimated word sequences (word ₁ , word ₂ , . . . , word _u-1 , word _u , and <eos>). The decoder unit 30 passes the generated estimated word string to the output unit 40 and the loss calculation unit 50.

図６は、第２エンコーダー部６０のより詳細な構成例を示すブロック図である。図示するように、第２エンコーダー部６０は、内部にＲＮＮを含んで構成される。図ではＲＮＮの時間的な再帰構造を左から右方向に展開して表現している。図示する構成例では、第２エンコーダー部６０は、入力される正解語列＜ｂｏｓ＞，ｗｏｒｄ_１，ｗｏｒｄ_２，・・・，ｗｏｒｄ_ｕ－１,ｗｏｒｄ_ｕ，および＜ｅｏｓ＞に対応して、第１層から第Ｎ層までのＲＮＮを持つ。ここでのＮの値は、エンコーダー部２０（図４参照）のＮの値に合わせる。つまり、第２エンコーダー部６０は、論理的には、エンコーダー部２０の内部構成と同様の、Ｎ行（ｕ＋２）列のマトリクス状に配置されたＲＮＮを用いて構成される。第２エンコーダー部６０におけるＲＮＮのマトリクス内での、データの受け渡しの流れも、エンコーダー部２０のＲＮＮのマトリクス内におけるそれと同様である。ここで、（ｕ＋２）は、出力系列の長さである。ただし、この出力系列の長さは、＜ｂｏｓ＞や＜ｅｏｓ＞等の特殊記号を含む長さであってもよい。 FIG. 6 is a block diagram showing a more detailed configuration example of the second encoder section 60. As illustrated, the second encoder section 60 is configured to include an RNN therein. In the figure, the temporal recursive structure of the RNN is expanded from left to right. In the illustrated configuration example, the second encoder unit 60 corresponds to the input correct word strings <bos>, word ₁ , word ₂ , . . . , word _u-1 , word _u , and <eos>, It has an RNN from the first layer to the Nth layer. The value of N here is matched to the value of N of the encoder section 20 (see FIG. 4). That is, the second encoder section 60 is logically configured using RNNs arranged in a matrix of N rows and (u+2) columns, similar to the internal configuration of the encoder section 20. The flow of data exchange within the RNN matrix in the second encoder section 60 is also similar to that within the RNN matrix in the encoder section 20. Here, (u+2) is the length of the output sequence. However, the length of this output series may include special symbols such as <bos> and <eos>.

第２エンコーダー部６０は、学習データ供給部８０から渡される正解語列のデータを入力として取得する。第２エンコーダー部６０の第１層のＲＮＮは、順次、正解語列（＜ｂｏｓ＞，ｗｏｒｄ_１，ｗｏｒｄ_２，・・・，ｗｏｒｄ_ｕ－１,ｗｏｒｄ_ｕ，および＜ｅｏｓ＞）を入力する。第２エンコーダー部６０は、上記の正解語列を基に生成した状態ベクトルを、第２ロス算出部７０に渡す。 The second encoder unit 60 obtains as input the correct word string data passed from the learning data supply unit 80 . The first layer RNN of the second encoder unit 60 sequentially receives the correct word strings (<bos>, word ₁ , word ₂ , . . . , word _u-1 , word _u , and <eos>). The second encoder unit 60 passes the state vector generated based on the correct word string to the second loss calculation unit 70.

図７は、変換装置１が機械学習処理を行う際の手順の一例を示すフローチャートである。以下では、このフローチャートを参照しながら、学習処理の手順について説明する。 FIG. 7 is a flowchart illustrating an example of a procedure when the conversion device 1 performs machine learning processing. Below, the procedure of the learning process will be explained with reference to this flowchart.

ステップＳ１０１において、学習データ供給部８０は、学習用データとして、１対の入出力データを供給する。入力データは、映像データである。学習データ供給部８０は、入力データを、フレーム画像データの系列として、エンコーダー部２０に渡す。出力データは、正解語列データである。学習データ供給部８０は、出力データである正解語列を、第２エンコーダー部６０およびロス算出部５０に渡す。 In step S101, the learning data supply unit 80 supplies a pair of input and output data as learning data. The input data is video data. The learning data supply unit 80 passes the input data to the encoder unit 20 as a series of frame image data. The output data is correct word string data. The learning data supply unit 80 passes the correct word string, which is output data, to the second encoder unit 60 and the loss calculation unit 50.

次に、ステップＳ１０２において、エンコーダー部２０は、ステップＳ１０１で渡されたフレーム画像データの系列を基に、順伝播を行う。エンコーダー部２０は、順伝播の結果として、状態ベクトルを出力する。 Next, in step S102, the encoder unit 20 performs forward propagation based on the series of frame image data passed in step S101. The encoder unit 20 outputs a state vector as a result of forward propagation.

次に、ステップＳ１０３において、第２エンコーダー部６０は、ステップＳ１０１で渡された正解語列のデータを基に、順伝播を行う。第２エンコーダー部６０は、順伝播の結果として、状態ベクトルを出力する。 Next, in step S103, the second encoder unit 60 performs forward propagation based on the data of the correct word string passed in step S101. The second encoder unit 60 outputs a state vector as a result of forward propagation.

次に、ステップＳ１０４において、第２ロス算出部７０は、エンコーダー部２０から出力された状態ベクトル（ステップＳ１０２）と、第２エンコーダー部６０から出力された状態ベクトル（ステップＳ１０３）とを基に、ロスを算出する。 Next, in step S104, the second loss calculation unit 70 calculates, based on the state vector output from the encoder unit 20 (step S102) and the state vector output from the second encoder unit 60 (step S103), Calculate loss.

次に、ステップＳ１０５において、エンコーダー部２０は、ステップＳ１０４において第２ロス算出部７０が算出したロスに基づいて、誤差逆伝播を行う。この誤差逆伝播により、エンコーダー部２０は、内部のパラメーターを更新する。 Next, in step S105, the encoder unit 20 performs error backpropagation based on the loss calculated by the second loss calculation unit 70 in step S104. Through this error backpropagation, the encoder unit 20 updates internal parameters.

次に、ステップＳ１０６において、第２エンコーダー部６０は、ステップＳ１０４において第２ロス算出部７０が算出したロスに基づいて、誤差逆伝播を行う。この誤差逆伝播により、第２エンコーダー部６０は、内部のパラメーターを更新する。 Next, in step S106, the second encoder unit 60 performs error backpropagation based on the loss calculated by the second loss calculation unit 70 in step S104. Through this error backpropagation, the second encoder section 60 updates internal parameters.

以上、ステップＳ１０２からＳ１０６までの一連の処理は、エンコーダー部２０の出力と第２エンコーダー部６０の出力との差分に基づき、エンコーダー部２０および第２エンコーダー部６０の各々が内部に持つニューラルネットワークのパラメーターを調整する処理である。つまり、前述の、第２学習モードの処理である。 As described above, the series of processes from steps S102 to S106 are based on the difference between the output of the encoder section 20 and the output of the second encoder section 60, and the neural network that each of the encoder section 20 and the second encoder section 60 has internally is executed. This is the process of adjusting parameters. In other words, this is the second learning mode processing described above.

次に、ステップＳ１０７において、エンコーダー部２０は、ステップＳ１０１で渡されたフレーム画像データの系列を基に、順伝播を行う。エンコーダー部２０は、順伝播の結果として、状態ベクトルを出力する。本ステップで生成した状態ベクトルを、エンコーダー部２０は、デコーダー部３０に渡す。 Next, in step S107, the encoder unit 20 performs forward propagation based on the series of frame image data passed in step S101. The encoder unit 20 outputs a state vector as a result of forward propagation. The encoder unit 20 passes the state vector generated in this step to the decoder unit 30.

次に、ステップＳ１０８において、デコーダー部３０は、ステップＳ１０７においてエンコーダー部２０が出力した状態ベクトルに基づいて、順伝播を行う。その結果として、デコーダー部３０は、語の列（推定語列）を出力する。この推定語列は、＜ｅｏｓ＞等の特殊記号を含んでもよい。 Next, in step S108, the decoder unit 30 performs forward propagation based on the state vector output by the encoder unit 20 in step S107. As a result, the decoder unit 30 outputs a word string (estimated word string). This estimated word string may include special symbols such as <eos>.

次に、ステップＳ１０９において、ロス算出部５０は、ステップＳ１０１で渡された正解語列のデータと、ステップＳ１０８において求められた推定語列のデータとを基に、ロスを算出する。 Next, in step S109, the loss calculation unit 50 calculates a loss based on the correct word string data passed in step S101 and the estimated word string data obtained in step S108.

次に、ステップＳ１１０において、デコーダー部３０は、ステップＳ１０９において算出されたロスに基づいて、誤差逆伝播を行う。この誤差逆伝播により、デコーダー部３０は、内部のパラメーターを更新する。 Next, in step S110, the decoder unit 30 performs error backpropagation based on the loss calculated in step S109. Through this error backpropagation, the decoder unit 30 updates internal parameters.

次に、ステップＳ１１１において、エンコーダー部２０は、ステップＳ１１０におけるデコーダー部３０の誤差逆伝播の処理の延長として、エンコーダー部２０が持つニューラルネットワークの誤差逆伝播を行う。この誤差逆伝播により、エンコーダー部２０は、内部のパラメーターを更新する。 Next, in step S111, the encoder unit 20 performs error backpropagation of the neural network included in the encoder unit 20, as an extension of the error backpropagation process of the decoder unit 30 in step S110. Through this error backpropagation, the encoder unit 20 updates internal parameters.

以上、ステップＳ１０７からＳ１１１までの一連の処理は、エンコーダー部２０およびデコーダー部３０の順伝播処理によって得られた推定語列と、学習データ供給部８０から与えられた正解語列との差分に基づき、エンコーダー部２０およびデコーダー部３０の各々が内部に持つニューラルネットワークのパラメーターを調整する処理である。つまり、前述の、第１学習モードの処理である。 The series of processes from steps S107 to S111 described above are based on the difference between the estimated word string obtained by the forward propagation process of the encoder section 20 and decoder section 30 and the correct word string given from the learning data supply section 80. , is a process of adjusting the parameters of the neural network that each of the encoder section 20 and the decoder section 30 has internally. That is, this is the process of the first learning mode described above.

ステップＳ１１２において、制御部９０は、全ての学習データを用いた機械学習処理を完了したか否かを判定する。全ての学習データを処理済みである場合（ステップＳ１１２：ＹＥＳ）には、次のステップＳ１１３に進む。まだ学習データ（入出力データ対）が残っている場合（ステップＳ１１２：ＮＯ）には、次のデータを処理するためにステップＳ１０１に戻る。 In step S112, the control unit 90 determines whether the machine learning process using all the learning data has been completed. If all the learning data has been processed (step S112: YES), the process advances to the next step S113. If there is still learning data (input/output data pair) remaining (step S112: NO), the process returns to step S101 to process the next data.

ステップＳ１１３に進んだ場合には、制御部９０は、現在の学習データの集合を用いた学習処理の所定回数の繰り返しが完了したか否かを判定する。なお、この回数は、例えば、予め定めておくものとする。所定回数の処理が完了した場合（ステップＳ１１３：ＹＥＳ）には、本フローチャート全体の処理を終了する。所定回数の処理が完了していない場合（ステップＳ１１３：ＮＯ）には、次の回の処理を行うためにステップＳ１０１に戻る。なお、本ステップにおいて、予め定めておいた回数に基づいて全体の処理を終了するか否かの判断を行う代わりに、他の判断基準に基づいた判断を行うようにしてもよい。一例として、更新対象であるニューラルネットワークのパラメーター集合の値の収束状況（十分に収束しているか否か）に基づいて、全体の処理を終了するか否かの判断を行うようにしてもよい。 If the process proceeds to step S113, the control unit 90 determines whether or not the learning process using the current set of learning data has been repeated a predetermined number of times. Note that this number of times is, for example, predetermined. If the predetermined number of processes have been completed (step S113: YES), the entire process of this flowchart is ended. If the predetermined number of processes have not been completed (step S113: NO), the process returns to step S101 to perform the next process. Note that in this step, instead of determining whether or not to end the entire process based on a predetermined number of times, the determination may be made based on another determination criterion. As an example, it may be determined whether or not to end the entire process based on the convergence status (whether or not the values of the parameter set of the neural network to be updated are sufficiently converged).

以上の処理の手順により、エンコーダー部２０およびデコーダー部３０の学習が進む。学習により、エンコーダー部２０およびデコーダー部３０のそれぞれの内部のパラメーターが調整されるため、エンコーダー部２０およびデコーダー部３０は、より精度良く、入力データ（具体例としては、画像の系列。さらに具体的な例としては、手話を表す映像。）に対応する出力データ（具体例としては、記号の列。さらに具体的な例としては、手話に対応するグロス表記の単語列。）を生成するようになる。 The learning of the encoder section 20 and the decoder section 30 progresses through the above processing procedure. Learning adjusts the internal parameters of the encoder unit 20 and decoder unit 30, so that the encoder unit 20 and decoder unit 30 can process input data (for example, a series of images) more accurately. For example, a video representing sign language. Become.

以上、説明した手順では、ロス算出部５０が算出したロスに基づいてエンコーダー部２０のパラメーターを更新するだけでなく、第２ロス算出部７０が算出したロスにも基づいてエンコーダー部２０のパラメーターを更新する。第２ロス算出部７０は、エンコーダー部２０と第２エンコーダー部６０とがそれぞれ算出する状態ベクトルの差をロスとして産出する。この手法により、エンコーダー部２０の学習をより良好に行うことができる。つまり、エンコーダー部２０が生成する状態ベクトルは、入力映像と正解語列との関係をより良く表現するものとなる。したがって、変換装置１は、入力映像に対応して、精度の高い推定語列を生成することが期待される。 In the procedure described above, the parameters of the encoder section 20 are not only updated based on the loss calculated by the loss calculation section 50, but also the parameters of the encoder section 20 are updated based on the loss calculated by the second loss calculation section 70. Update. The second loss calculating section 70 generates the difference between the state vectors calculated by the encoder section 20 and the second encoder section 60 as a loss. This method allows the encoder unit 20 to learn more effectively. In other words, the state vector generated by the encoder unit 20 better represents the relationship between the input video and the correct word string. Therefore, the conversion device 1 is expected to generate a highly accurate estimated word string corresponding to the input video.

図７に示した手順では、第２ロス算出部７０が算出したロスに基づく学習（ステップＳ１０２からＳ１０６までの、エンコーダー部２０および第２エンコーダー部６０の学習、第２学習モード）と、ロス算出部５０が算出したロスに基づく学習（ステップＳ１０７からＳ１１１までの、エンコーダー部２０およびデコーダー部３０の学習、第１学習モード）とを、個別且つ交互に実施している。これは、前述の制御部９０によるモードの切り替えの例である。つまり、制御部９０は、学習処理の際に、学習データ供給部８０が供給する学習用入力データと正解データとの対ごとに、第１学習モードと第２学習モードとを繰り返して実行するよう制御する。しかしながら、これら両者の学習を計算グラフ上で同時に行うようにしてもよい。 In the procedure shown in FIG. 7, learning based on the loss calculated by the second loss calculation unit 70 (learning of the encoder unit 20 and the second encoder unit 60 from steps S102 to S106, second learning mode) and loss calculation Learning based on the loss calculated by the unit 50 (learning of the encoder unit 20 and decoder unit 30 from steps S107 to S111, first learning mode) is performed individually and alternately. This is an example of mode switching by the control unit 90 described above. That is, during the learning process, the control unit 90 repeatedly executes the first learning mode and the second learning mode for each pair of learning input data and correct answer data supplied by the learning data supply unit 80. Control. However, both of these learning may be performed simultaneously on the calculation graph.

［評価実験の例］
上記の実施形態による変換装置の評価実験を行った。その結果を次に記す。 [Example of evaluation experiment]
An evaluation experiment was conducted on the conversion device according to the above embodiment. The results are described below.

実験の条件は次のとおりである。ＣＮＮ（畳み込みニューラルネットワーク）として、ＡｌｅｘＮｅｔを用いてＩｍａｇｅＮｅｔのデータセットで学習したパラメーターを初期値とした。また、ＲＮＮとして、エンコーダー部２０とデコーダー部３０と第２エンコーダー部６０のすべてに、４層１０００ユニットのＲｅｓｉｄｕａｌＧＲＵ（Gated Recurrent Unit）を採用した。 The conditions of the experiment were as follows. As a CNN (convolutional neural network), parameters learned using the ImageNet data set using AlexNet were used as initial values. Furthermore, as the RNN, a 4-layer, 1000-unit residual GRU (Gated Recurrent Unit) was adopted for all of the encoder section 20, decoder section 30, and second encoder section 60.

なお、比較対象の変換装置は、従来技術の手法を用いた変換装置である。言い換えれば、比較対象の変換装置は、第２ロス算出部７０を持たず、第２ロス算出部７０が算出したロスに基づく誤差逆伝播を行わない。また、したがって、比較対象の変換装置は、第２エンコーダー部６０を持たない。 Note that the conversion device to be compared is a conversion device using a conventional technique. In other words, the conversion device to be compared does not have the second loss calculation unit 70 and does not perform error backpropagation based on the loss calculated by the second loss calculation unit 70. Further, therefore, the conversion device to be compared does not have the second encoder unit 60.

評価実験で用いた学習データとして、入力映像（手話映像）と、グロス表現による正解語列との対、１６０００対を用いた。評価用データとしては、同様に入力映像とグロス表現による正解語列との対、１０００対を用いた。学習データとして用いた入力映像は、日本放送協会の手話ニュース（２００９年から２０１５年までの放送分）の映像である。対象ドメインは、気象情報および気象関連の話題である。入力映像の長さは、最大で、１文あたり１０秒である。入力映像のフレームレートは、２９．９７フレーム毎秒である。画像のサイズは２５６画素×２５６画素であり、画像は、上半身と右手と左手とを含む。語彙数は、６６９５語である。 As learning data used in the evaluation experiment, 16,000 pairs of input video (sign language video) and correct word strings based on gross expressions were used. Similarly, 1000 pairs of input video and correct word strings based on gross representation were used as evaluation data. The input video used as learning data is a video of the Japan Broadcasting Corporation's sign language news (broadcast from 2009 to 2015). The target domain is weather information and weather-related topics. The maximum length of the input video is 10 seconds per sentence. The frame rate of the input video is 29.97 frames per second. The size of the image is 256 pixels x 256 pixels, and the image includes the upper body, right hand, and left hand. The number of vocabulary is 6695 words.

図８は、評価実験の結果を示すグラフである。同図のグラフは、従来技術の手法と、本実施形態の手法とのそれぞれについて、学習データ数と単語誤り率（変換誤り率）との対応関係を示す。グラフの横軸は延べ学習データ数である（単位は、文数）。グラフの縦軸は、単語誤り率である。両手法とも、概ね、学習データ数を大きくするほど、単語誤り率が下がる傾向が出ている。しかし、学習データ数のすべての領域において、従来技術の手法を用いる場合よりも、本実施形態の手法を用いる場合のほうが、単語誤り率は低い。従来技術の手法を用いる場合の単語誤り率の最小値は、０．４２４であった。一方、本実施形態の手法を用いる場合の単語誤り率の最小値は、０．４０３であった。つまり、本実施形態を用いることにより、より低い単語誤り率を実現できることが、確認できた。 FIG. 8 is a graph showing the results of the evaluation experiment. The graph in the figure shows the correspondence between the number of learning data and the word error rate (conversion error rate) for each of the conventional technique and the technique of this embodiment. The horizontal axis of the graph is the total number of learning data (unit: number of sentences). The vertical axis of the graph is the word error rate. In both methods, the word error rate tends to decrease as the number of training data increases. However, in all areas of the number of training data, the word error rate is lower when using the method of this embodiment than when using the conventional technique. The minimum word error rate using the prior art method was 0.424. On the other hand, the minimum word error rate when using the method of this embodiment was 0.403. In other words, it was confirmed that by using this embodiment, a lower word error rate could be achieved.

なお、上述した実施形態における変換装置の少なくとも一部の機能をコンピューターで実現することができる。その場合、この機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ＵＳＢメモリー等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、一時的に、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリーのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Note that at least some of the functions of the conversion device in the embodiments described above can be realized by a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Furthermore, "computer-readable recording media" refers to portable media such as flexible disks, magneto-optical disks, ROM, CD-ROM, DVD-ROM, and USB memory, and storage devices such as hard disks built into computer systems. Say something. Furthermore, a "computer-readable recording medium" refers to a medium that temporarily and dynamically stores a program, such as a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In that case, it may also include something that retains a program for a certain period of time, such as a volatile memory inside a computer system that is a server or client. Further, the program may be one for realizing a part of the above-mentioned functions, or may be one that can realize the above-mentioned functions in combination with a program already recorded in the computer system.

以上、実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。例えば、入力映像に映る内容は、手話以外でもよい。手話に限らず、任意のジェスチャー等の動き（人あるいは生物の動きには限定されない）の映像を基に、記号列を出力する変換装置を実施してもよい。また、出力する単語列は、グロス表記には限定されない。出力する単語列は、任意の言語表現や、より一般的な記号列等であってもよい。また、入力データは、映像に限定されない。例えば、任意の系列データであってもよい。また、エンコーダー部２０やデコーダー部３０や第２エンコーダー部６０が用いる、機械学習のための手法は、ニューラルネットワークに限られるものではない。つまり、ニューラルネットワークの代わりに、学習データに基づいて機械学習を行うことのできる任意の手段を用いてもよい。 Although the embodiments have been described above, the present invention can be further implemented in the following modifications. For example, the content shown in the input video may be other than sign language. A conversion device that outputs symbol strings based on images of movements such as arbitrary gestures (not limited to movements of people or living things), not limited to sign language, may be implemented. Furthermore, the word string to be output is not limited to gloss notation. The word string to be output may be an arbitrary linguistic expression or a more general symbol string. Furthermore, input data is not limited to video. For example, it may be any series data. Furthermore, the method for machine learning used by the encoder section 20, decoder section 30, and second encoder section 60 is not limited to neural networks. That is, instead of a neural network, any means that can perform machine learning based on learning data may be used.

以上、説明したように、本実施形態（変形例を含む）によれば、変換装置１は、ニューラルネットワークを用いて構成され入力される正解記号列を基に状態ベクトルを生成するする第２エンコーダー部６０と、エンコーダー部２０が生成する状態ベクトルと第２エンコーダー部６０が生成する状態ベクトルとの差を算出する第２ロス算出部７０とを備える。そして、第２ロス算出部７０が算出したロスに基づいて、エンコーダー部２０および第２エンコーダー部６０の誤差逆伝播を行うことができる。つまり、相対的に短い誤差逆伝播経路を用いて、エンコーダー部２０の機械学習処理を行うことができる。これにより、学習効果が良く表れ、変換装置１の変換精度が向上する。あるいは、変換装置１の学習コストを下げることができる。 As described above, according to the present embodiment (including modified examples), the conversion device 1 includes a second encoder that is configured using a neural network and generates a state vector based on the input correct symbol string. section 60, and a second loss calculation section 70 that calculates the difference between the state vector generated by the encoder section 20 and the state vector generated by the second encoder section 60. Then, based on the loss calculated by the second loss calculation unit 70, backpropagation of errors in the encoder unit 20 and the second encoder unit 60 can be performed. In other words, the machine learning process of the encoder unit 20 can be performed using a relatively short error backpropagation path. Thereby, the learning effect is clearly expressed, and the conversion accuracy of the conversion device 1 is improved. Alternatively, the learning cost of the conversion device 1 can be reduced.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.

本発明は、例えば、映像を基に記号列を生成するあらゆる適用領域（一例として、映像理解等）に利用することができる。特に手話映像を対象とした処理を行う場合には、聴覚障害者と健聴者のコミュニケーションに利用したり、手話学習者の教育に利用したり、することができる。但し、本発明の利用範囲はここに例示したものには限られない。 The present invention can be used, for example, in any application area that generates a symbol string based on a video (an example is video understanding, etc.). In particular, when processing sign language videos, it can be used for communication between hearing-impaired people and normal hearing people, and for the education of sign language learners. However, the scope of use of the present invention is not limited to what is exemplified here.

１変換装置
１０入力部
２０エンコーダー部
３０デコーダー部
４０出力部
５０ロス算出部
６０第２エンコーダー部
７０第２ロス算出部
８０学習データ供給部
９０制御部
２０１，３０１，６０１ニューラルネットワーク 1 Conversion device 10 Input section 20 Encoder section 30 Decoder section 40 Output section 50 Loss calculation section 60 Second encoder section 70 Second loss calculation section 80 Learning data supply section 90 Control section 201, 301, 601 Neural network

Claims

an encoder section that generates state data based on input data;
a decoder unit that generates output data based on the state data;
a learning data supply unit that supplies a pair of learning input data that is input to the encoder unit and correct answer data that is a correct answer of the output data that corresponds to the learning input data;
state data generated by the encoder unit based on the learning input data; learning output data generated by the decoder unit based on the learning output data; and the learning output data supplied by the learning data supply unit in response to the learning input data. a loss calculation unit that calculates a loss representing the difference between the correct data and the
a second encoder unit that generates estimated state data that is state data estimated based on the correct data;
The state data is generated by the encoder unit based on the learning input data, and the second encoder unit generates the state data based on the correct data supplied by the learning data supply unit corresponding to the learning input data. a second loss calculation unit that calculates a second loss representing the difference between the estimated state data and the estimated state data;
a control unit that controls the operation by appropriately switching between a first learning mode, a second learning mode, and a conversion execution mode;
Equipped with
In the first learning mode, the inside of the encoder section and the decoder section is calculated based on the loss calculated by the loss calculation section based on the learning input data supplied by the learning data supply section and the correct answer data. Adjust the parameters and
In the second learning mode, the encoder section and the Adjust the internal parameters of the second encoder section,
In the conversion execution mode, the encoder section generates state data based on input data, and the decoder section generates output data based on the state data generated by the encoder section.
conversion device.

Each of the encoder section, the decoder section, and the second encoder section includes a neural network therein,
In the first learning mode, the internal parameters of the encoder section and the decoder section are adjusted by performing error backpropagation of the respective neural networks of the encoder section and the decoder section based on the loss;
In the second learning mode, the internal parameters of the encoder section and the second encoder section are determined by performing error backpropagation of the respective neural networks of the encoder section and the second encoder section based on the second loss. adjust,
The conversion device according to claim 1.

The control unit repeatedly executes the first learning mode and the second learning mode for each pair of the learning input data and the correct answer data supplied by the learning data supply unit during the learning process. to control,
The conversion device according to claim 1 or 2.

The input data is a series of images;
the output data is a series of predetermined symbols;
Conversion device according to any one of claims 1 to 3.

The series of images is a series of images representing sign language,
the series of symbols is a series of words in glossary notation corresponding to the sign language;
The conversion device according to claim 4.

an encoder section that generates state data based on input data;
a decoder unit that generates output data based on the state data;
a learning data supply unit that supplies a pair of learning input data that is input to the encoder unit and correct answer data that is a correct answer of the output data that corresponds to the learning input data;
state data generated by the encoder unit based on the learning input data; learning output data generated by the decoder unit based on the learning output data; and the learning output data supplied by the learning data supply unit in response to the learning input data. a loss calculation unit that calculates a loss representing the difference between the correct data and the
a second encoder unit that generates estimated state data that is state data estimated based on the correct data;
The state data is generated by the encoder unit based on the learning input data, and the second encoder unit generates the state data based on the correct data supplied by the learning data supply unit corresponding to the learning input data. a second loss calculation unit that calculates a second loss representing the difference between the estimated state data and the estimated state data;
a control unit that controls the operation by appropriately switching between a first learning mode, a second learning mode, and a conversion execution mode;
Equipped with
In the first learning mode, the inside of the encoder section and the decoder section is calculated based on the loss calculated by the loss calculation section based on the learning input data supplied by the learning data supply section and the correct answer data. Adjust the parameters and
In the second learning mode, the encoder section and the Adjust the internal parameters of the second encoder section,
In the conversion execution mode, the encoder section generates state data based on input data, and the decoder section generates output data based on the state data generated by the encoder section.
A program that allows a computer to function as a converter.