JP2023139731A

JP2023139731A - Learning method and learning apparatus for gesture generation model

Info

Publication number: JP2023139731A
Application number: JP2022045417A
Authority: JP
Inventors: 伯文呉; Bowen Wu; 超然劉; Chaoran Liu; カルロストシノリイシイ; Toshinori Ishi Carlos
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2023-10-04

Abstract

To provide a learning method and a learning apparatus for a gesture generation model, configured to naturally choice a gesture suitable for the content of speech.SOLUTION: A learning method for a gesture generation model includes the steps of: preparing training data subdivided into multiple segments; preparing a generator 118 configured to receive, as input, a voice matrix included in each segment, a noise matrix, and a seed pose vector 184 from a previous segment, and generate a gesture matrix 126 with respect to the input voice matrix 128; preparing a discriminator 120 configured to output an evaluation value 130 of an input gesture matrix 200 with respect to the input voice matrix 128; and training the generator 118 and the discriminator 120 by adversarial training that evaluates a gesture matrix with respect to a voice matrix by switching between the gesture matrix 126 obtained from the generator 118 and a gesture matrix 124 of the training data.SELECTED DRAWING: Figure 3

Description

特許法第３０条第２項適用申請有り令和３年７月６日にウェブ上で公開された２３ｒｄＡＣＭＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭｕｌｔｉｍｏｄａｌＩｎｔｅｒａｃｔｉｏｎのＧＥＮＥＡＷＯＲＫＳＨＯＰ２０２１にて、”ＰｒｏｂａｂｉｌｉｓｔｉｃＨｕｍａｎ－ｌｉｋｅＧｅｓｔｕｒｅＳｙｎｔｈｅｓｉｓｆｒｏｍＳｐｅｅｃｈｕｓｉｎｇＧＲＵ－ｂａｓｅｄＷＧＡＮ”と題する論文に発表Application for application of Article 30, Paragraph 2 of the Patent Act has been filed At the GENEA WORKSHOP 2021 of the 23rd ACM International Conference on Multimodal Interaction, which was published on the web on July 6, 2021, “Probabilis tic Human-like Gesture Synthesis from Speech using Published in a paper titled “GRU-based WGAN”

この発明は、音声から発話時のジェスチャを生成するジェスチャ生成モデルの学習方法、学習装置、及びジェスチャ生成モデルに関する The present invention relates to a learning method, a learning device, and a gesture generation model for a gesture generation model that generates speech gestures from voice.

人が人型ロボット及びＣＧ（ＣｏｍｐｕｔｅｒＧｒａｐｈｉｃｓ）アバターなど（以下、単に「ロボット」という。）と対話するときには、できるだけ人が違和感を持たないようすることが望ましい。そうした場面においては、発話内容及び発話の態様はもちろんだが、発話時のロボットのジェスチャも大きな意味を持つ。 When a person interacts with a humanoid robot, a computer graphics (CG) avatar, or the like (hereinafter simply referred to as a "robot"), it is desirable to prevent the person from feeling uncomfortable as much as possible. In such situations, not only the content of the utterance and the form of the utterance, but also the robot's gestures during the utterance have great significance.

従来、発話内容にあわせてロボットのジェスチャを生成することが研究されてきた。従来の手法の代表的なものは、発話内容をジェスチャに１対１にマッピングする方法である。しかし、そうした手法においては、発話内容が同一ならばジェスチャも同一になる。実際に人場合、同じ発話内容でも同じジェスチャをすることは稀である。そのため、このようなマッピングを採用するとロボットのジェスチャに相手が違和感を持つことになる。 Conventionally, research has been conducted on generating robot gestures according to the content of speech. A typical conventional method is a method of one-to-one mapping of speech content to gestures. However, in such a method, if the content of the utterance is the same, the gestures will also be the same. In reality, it is rare for people to make the same gesture even with the same utterance content. Therefore, if such mapping is adopted, the other party will feel uncomfortable with the robot's gestures.

一方、特許文献１には、発話内容に対して予め準備した文法的ルールを適用し、いくつかのジェスチャ候補を選択することが開示されている。特許文献１においては、これらのジェスチャ候補に対して予め割り当てられていた重みの分布を使用し、指定されたパラメータに対してそれらの重みの分布から得られる値比較に基づいてジェスチャ候補の１つを確率的に選択する。さらに選択されたジェスチャ候補にしたがってロボットを動作させるときに、制御ポイントの軌跡に乱数に基づく値を加えることにより、ロボットの動作に変化を与える。 On the other hand, Patent Document 1 discloses that grammatical rules prepared in advance are applied to utterance content to select several gesture candidates. In Patent Document 1, a distribution of weights assigned in advance to these gesture candidates is used, and one of the gesture candidates is determined based on a value comparison obtained from the distribution of weights for a specified parameter. is selected probabilistically. Furthermore, when the robot moves according to the selected gesture candidate, a value based on a random number is added to the trajectory of the control point to change the robot's movement.

特許文献１によれば、基本となるジェスチャ候補が確率的に選択される。したがって、発話内容が同じでも異なるジェスチャが選択されることがある。その上、最終的に乱数に基づいてロボットの動作に変化を与えるので、そのロボットのジェスチャが人間には自然に見えるという効果があるとされている。 According to Patent Document 1, basic gesture candidates are selected stochastically. Therefore, different gestures may be selected even if the content of the utterance is the same. Furthermore, since the robot's movements are ultimately changed based on random numbers, it is said to have the effect of making the robot's gestures look natural to humans.

特表２０１４－５０４９５９号公報Special Publication No. 2014-504959

特許文献１に開示の技術によれば、発話内容が同じでもロボットのジェスチャが異なることになる。しかし、特許文献１に開示の技術は、発話内容が同じでもロボットのジェスチャが異なるため自然に感じる、というにとどまり、ジェスチャそのものが人間のジェスチャと比較して自然かどうかについての開示はない。発話内容に対してジェスチャにランダム性をもたせたとしても、ジェスチャそのものが不自然ならば意味はない。 According to the technology disclosed in Patent Document 1, the robot's gestures are different even if the content of the utterance is the same. However, the technology disclosed in Patent Document 1 only says that the robot's gestures are different even if the content of the utterance is the same, so it feels natural, and there is no disclosure as to whether the gestures themselves are more natural than human gestures. Even if the gesture is made random in relation to the content of the utterance, it is meaningless if the gesture itself is unnatural.

したがってこの発明は、発話の内容にふさわしいジェスチャを、自然に使い分けることができるようにするためのジェスチャ生成モデルの学習方法及び学習装置を提供することを目的とする。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a learning method and a learning device for a gesture generation model that allows a user to naturally select and use gestures appropriate to the content of an utterance.

この発明の第１の局面に係るジェスチャ生成モデルの学習方法は、発話音声から発話時のジェスチャを生成するためのジェスチャ生成モデルの学習方法であって、コンピュータが、各々が所定時間長で所定の重複期間を持つ複数のセグメントに分割された学習データを準備するステップを含み、複数のセグメントの各々は、一定の長さの複数のフレームを含み、複数のセグメントの各々は、当該セグメントの含む各フレームにおける人物の発話音声から得られた所定の音響特徴量を成分とする音声行列と、当該セグメントの含む各フレームにおける当該人物のポーズを表すポーズベクトルからなるジェスチャ行列とを含み、コンピュータが、セグメントの各々について、当該セグメントに含まれる音声行列と、所定の分布からサンプリングされたノイズを成分とするノイズ行列と、直前のセグメントのジェスチャ行列の一部を構成する所定個数のポーズベクトルからなるシードポーズベクトルとを入力とし、音声行列、ノイズ行列、及びシードポーズベクトルに基づいて、入力された音声行列に対応するジェスチャ行列を生成し出力するための第１のニューラルネットワークを準備するステップと、コンピュータが、第１の入力及び第２の入力を持ち、第１の入力に第１のニューラルネットワークに入力された音声行列を受け、第２の入力に受けたジェスチャ行列の、入力された音声行列に対応するジェスチャ行列としての評価値を出力するための第２のニューラルネットワークを準備するステップと、コンピュータが、第１のニューラルネットワークの出力するジェスチャ行列と学習データのジェスチャ行列とを切り替えて、音声行列に対するジェスチャ行列として評価する敵対的学習により、第１のニューラルネットワークと第２のニューラルネットワークとを訓練するステップとを含む。 A learning method for a gesture generation model according to a first aspect of the present invention is a learning method for a gesture generation model for generating speech gestures from speech sounds, the method comprising: the step of preparing training data divided into a plurality of segments having overlapping periods, each of the plurality of segments including a plurality of frames of a constant length, and each of the plurality of segments including A computer generates a segment that includes an audio matrix whose components are predetermined acoustic features obtained from speech uttered by a person in a frame, and a gesture matrix consisting of a pose vector representing the pose of the person in each frame included in the segment. For each segment, a seed pose consisting of a voice matrix included in the segment, a noise matrix whose components are noise sampled from a predetermined distribution, and a predetermined number of pose vectors forming part of the gesture matrix of the immediately preceding segment. preparing a first neural network for generating and outputting a gesture matrix corresponding to the input speech matrix based on the speech matrix, the noise matrix, and the seed pose vector; , has a first input and a second input, receives a speech matrix input to the first neural network as the first input, and corresponds to the input speech matrix of the gesture matrix received as the second input. a step of preparing a second neural network for outputting an evaluation value as a gesture matrix for the voice matrix; The method includes training the first neural network and the second neural network by adversarial learning evaluated as a gesture matrix.

好ましくは、敵対的学習における損失関数は、入力された音声行列に対する、第１のニューラルネットワークによる出力の条件付分布と、第２のニューラルネットワークによる正解データの条件付分布との相違を表す第１の損失関数と、第２のニューラルネットワークにおける、損失関数の勾配ペナルティと、直前の学習データのジェスチャ行列の末尾の所定個数のポーズベクトルと、第１のニューラルネットワークの出力するジェスチャ行列の先頭の所定個数のポーズベクトルとの相違を表す第３の損失関数とを含む。 Preferably, the loss function in adversarial learning is a first neural network that represents the difference between the conditional distribution of output from the first neural network and the conditional distribution of correct data from the second neural network with respect to the input speech matrix. a loss function, a gradient penalty of the loss function in the second neural network, a predetermined number of pose vectors at the end of the gesture matrix of the immediately preceding learning data, and a predetermined number at the beginning of the gesture matrix output from the first neural network. and a third loss function representing the difference from the number of pose vectors.

より好ましくは、所定時間長は１秒以上で２秒以下の範囲から選択される。 More preferably, the predetermined time length is selected from a range of 1 second or more and 2 seconds or less.

さらに好ましくは、所定時間長は１．３秒以上で１．７秒以下の範囲から選択される。 More preferably, the predetermined time length is selected from a range of 1.3 seconds or more and 1.7 seconds or less.

好ましくは、所定の重複期間は１０ミリ秒以上で３０ミリ秒以下の範囲から選択される。 Preferably, the predetermined overlap period is selected from a range of 10 ms or more and 30 ms or less.

より好ましくは、所定の重複期間は１５ミリ秒以上で２５ミリ秒以下の範囲から選択される。 More preferably, the predetermined overlap period is selected from a range of 15 milliseconds or more and 25 milliseconds or less.

さらに好ましくは、所定の音響特徴量は、Ｆ０又はパワー若しくはその双方を含む。 More preferably, the predetermined acoustic feature amount includes F0 or power or both.

好ましくは、敵対的学習は、ｕｎｒｏｌｌｅｄ－ＧＡＮ（ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋ）により実行される。 Preferably, the adversarial learning is performed by an unrolled-GAN (Generative Adversarial Network).

より好ましくは、第２のニューラルネットワークは、第２の入力に受けたジェスチャ行列と、第１のニューラルネットワークに入力された音声行列とを入力として受け、第２の入力に受けたジェスチャ行列の、音声行列に対応するジェスチャとしての評価値を表すスカラーを出力する畳み込みニューラルネットワークを含む。 More preferably, the second neural network receives as input a gesture matrix received at the second input and a speech matrix inputted to the first neural network, and the second neural network receives as input the gesture matrix received at the second input. It includes a convolutional neural network that outputs a scalar representing an evaluation value as a gesture corresponding to a speech matrix.

さらに好ましくは、第１のニューラルネットワークは、学習データ中の音声行列と、ノイズ行列と、シードポーズベクトルとを連結した行列を入力とし、第１のニューラルネットワークに入力された音声行列に対応するジェスチャ行列を出力するためのｂｉ－ＧＲＵ（ＧａｔｅｄＲｅｃｕｒｒｅｎｔＵｎｉｔ）ネットワークを含む。 More preferably, the first neural network receives as input a matrix obtained by concatenating a speech matrix, a noise matrix, and a seed pose vector in the learning data, and generates gestures corresponding to the speech matrix input to the first neural network. It includes a bi-GRU (Gated Recurrent Unit) network for outputting matrices.

この発明の第２の局面に係るジェスチャ生成モデルの学習装置は、発話音声から発話時の人のジェスチャを生成するためのジェスチャ生成モデルの学習装置であって、各々が所定時間長で所定の重複期間を持つ複数のセグメントに分割された学習データを記憶する学習データ記憶装置を含み、複数のセグメントの各々は、一定の長さの複数のフレームを含み、複数のセグメントの各々は、当該セグメントに含まれる各フレームにおける人物の発話音声から得られた所定の音響特徴量を成分とする音声行列と、当該フレームにおける当該人物のジェスチャを表すジェスチャ行列とを含み、複数のセグメントの各々について、当該セグメントに含まれる音声行列と、所定の分布からサンプリングされたノイズ行列と、所定のシードポーズベクトルとを入力とし、音声行列、ノイズ行列、及びシードポーズベクトルに基づいて、入力された音声行列に対応するジェスチャ行列を生成し出力するための第１のニューラルネットワークと、第１のニューラルネットワークの出力するジェスチャ行列と、第１のニューラルネットワークに入力された音声行列とを受け、第１のニューラルネットワークの出力するジェスチャ行列の、入力された音声行列に対する評価値を出力するための第２のニューラルネットワークと、第１のニューラルネットワークと第２のニューラルネットワークとを、学習データからのジェスチャ行列と、第１のニューラルネットワークの出力するジェスチャ行列とを切り替えて用いる敵対的学習により訓練する敵対的学習手段とを含む。 A learning device for a gesture generation model according to a second aspect of the present invention is a learning device for a gesture generation model for generating human gestures at the time of speech from speech sounds, each of which has a predetermined overlap for a predetermined length of time. a learning data storage device that stores learning data divided into a plurality of segments having a period, each of the plurality of segments including a plurality of frames of a certain length, and each of the plurality of segments For each of the plurality of segments, the segment includes an audio matrix whose components are predetermined acoustic features obtained from the speech uttered by the person in each included frame, and a gesture matrix representing the gesture of the person in the frame. , a noise matrix sampled from a predetermined distribution, and a predetermined seed pose vector. a first neural network for generating and outputting a gesture matrix; receiving a gesture matrix output from the first neural network; and a speech matrix input to the first neural network; A second neural network for outputting an evaluation value of a gesture matrix to be inputted with respect to an input speech matrix, a first neural network, and a second neural network are connected to a gesture matrix from learning data, and adversarial learning means that performs training by adversarial learning that switches between the gesture matrix output from the neural network and the gesture matrix output from the neural network.

この発明の上記及び他の目的、特徴、局面及び利点は、添付の図面と関連して理解されるこの発明に関する次の詳細な説明から明らかとなるであろう。 The above and other objects, features, aspects and advantages of the present invention will become apparent from the following detailed description of the invention, understood in conjunction with the accompanying drawings.

図１は、この発明の実施形態に係るジェスチャ生成モデルの使用方法を模式的に示す図である。FIG. 1 is a diagram schematically showing how to use a gesture generation model according to an embodiment of the present invention. 図２は、この発明の実施形態に係るジェスチャ生成モデルの学習方法の概略を示す模式図である。FIG. 2 is a schematic diagram showing an outline of a learning method for a gesture generation model according to an embodiment of the present invention. 図３は、この発明の実施形態に係るジェスチャ生成モデルの学習装置の機能的構成を示すブロック図であるFIG. 3 is a block diagram showing the functional configuration of a gesture generation model learning device according to an embodiment of the present invention. 図４は、この発明の実施形態に係るジェスチャ生成モデルの生成器のブロック図である。FIG. 4 is a block diagram of a gesture generation model generator according to an embodiment of the present invention. 図５は、この発明の実施形態に係るジェスチャ生成モデルの識別器のブロック図である。FIG. 5 is a block diagram of a gesture generation model discriminator according to an embodiment of the present invention. 図６は、この発明の実施形態に係るジェスチャ生成モデルの学習時の動作を説明するための模式図である。FIG. 6 is a schematic diagram for explaining the operation of the gesture generation model during learning according to the embodiment of the present invention. 図７は、この発明の実施形態に係るジェスチャ生成モデルの学習時の一部のコストの計算方法を説明するための模式図である。FIG. 7 is a schematic diagram for explaining a method of calculating part of the cost during learning of the gesture generation model according to the embodiment of the present invention. 図８は、この発明の実施形態に係るジェスチャ生成モデルの学習方法を実現するプログラムの制御構造を示すフローチャートである。FIG. 8 is a flowchart showing the control structure of a program that implements the gesture generation model learning method according to the embodiment of the present invention. 図９は、この発明の実施形態に係るジェスチャ生成モデルを使用したジェスチャ生成装置の機能的ブロック図である。FIG. 9 is a functional block diagram of a gesture generation device using a gesture generation model according to an embodiment of the present invention. 図１０は、この発明の実施形態に係るジェスチャ生成モデルにおいて生成されたジェスチャのセグメントの内挿を説明するための模式図である。FIG. 10 is a schematic diagram for explaining interpolation of gesture segments generated in the gesture generation model according to the embodiment of the present invention. 図１１は、この発明の実施形態に係るジェスチャ生成モデルについて行った実験の客観的評価を表形式で示す図である。FIG. 11 is a diagram showing, in a table format, an objective evaluation of experiments conducted on the gesture generation model according to the embodiment of the present invention. 図１２は、この発明の実施形態に係るジェスチャ生成モデルについて行った実験の主観的評価の項目を表形式で示す図である。FIG. 12 is a diagram showing, in a table format, items for subjective evaluation of experiments conducted on the gesture generation model according to the embodiment of the present invention. 図１３は、この発明の実施形態に係るジェスチャ生成モデルについて行った実験の主観的評価の結果を示すグラフである。FIG. 13 is a graph showing the results of a subjective evaluation of an experiment conducted on the gesture generation model according to the embodiment of the present invention. 図１４は、この発明の実施形態に係るジェスチャ生成モデルにおける連続性損失に対する重みによる効果の違いを表形式で示す図である。FIG. 14 is a diagram showing, in a table format, differences in the effect of weights on continuity loss in the gesture generation model according to the embodiment of the present invention. 図１５は、この発明の実施形態に係るジェスチャ生成モデルによるジェスチャ生成の滑らかさを示すグラフである。FIG. 15 is a graph showing the smoothness of gesture generation by the gesture generation model according to the embodiment of the present invention. 図１６は、この発明の実施形態に係るジェスチャ生成モデルによるジェスチャ生成方法及び装置、並びにジェスチャ生成方法を実現するためのコンピュータシステムの外観を示す図である。FIG. 16 is a diagram showing the external appearance of a gesture generation method and apparatus using a gesture generation model, and a computer system for realizing the gesture generation method, according to an embodiment of the present invention. 図１７は、図１６に示すコンピュータシステムのハードウェア構成を示すブロック図である。FIG. 17 is a block diagram showing the hardware configuration of the computer system shown in FIG. 16.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰返さない。 In the following description and drawings, identical parts are provided with the same reference numerals. Therefore, detailed description thereof will not be repeated.

第１第１実施形態
Ａ．構成
ロボットの制御において、ロボットの発話に応じてロボットの動作を逐一プログラムなどにより記述するという方法もあり得る。しかしそのような方法ではロボットの動作に限界があることは明らかである。また記述されたプログラムにより得られるジェスチャを自然なものにするためには、実験とプログラムとを繰り返す必要もある。発話の内容も無限に存在する。したがって、実際上、そのように発話に応じてロボットの動作を記述することは不可能である。 1 First embodiment A. Configuration In controlling a robot, there may be a method in which the robot's actions are described step by step in a program in response to the robot's utterances. However, it is clear that there are limits to how the robot can operate with such a method. Furthermore, in order to make the gestures obtained by a written program natural, it is necessary to repeat experiments and programs. There is also an infinite number of utterance contents. Therefore, in practice, it is impossible to describe the robot's actions in response to utterances.

そうした問題を解決する有力な方法は、図１に示すように、ロボットの発話に対応する音声信号５０から、その発話に対応するジェスチャ５２を生成するモデル５４を得ることである。どのような音声信号に対しても自然なジェスチャが得られるようなモデル５４が得られれば、苦労してロボットの動作を記述する必要はない。以下に説明する本実施の形態に係るジェスチャ生成モデルの学習方法は、そのための方法である。 A promising method for solving such problems is to obtain a model 54 that generates a gesture 52 corresponding to the robot's utterance from an audio signal 50 corresponding to the robot's utterance, as shown in FIG. If a model 54 that can produce natural gestures in response to any audio signal can be obtained, there is no need to go through the trouble of describing the robot's movements. The gesture generation model learning method according to the present embodiment, which will be described below, is a method for this purpose.

ａ．全体構成
図２に、この実施形態に係るジェスチャ生成モデルの学習方法の全体を概略的に示す。図２を参照して、このジェスチャ生成モデルの学習方法を実現する学習システム１００は、学習データを記憶するための学習データ記憶装置１１０を含む。学習データ記憶装置１１０には、複数の学習データセットが記憶されている。各学習データセットは、人の発話から得た音声信号を含む。音声信号はそれぞれ一定長のセグメントに分割され、さらに複数のフレームに分割される。そして音声特徴量抽出部１１６が、各フレームから１０ミリ秒ごとに音声の韻律情報であるＦ０及びパワーを抽出しベクトル化する。このベクトルをこの明細書においては音声ベクトルと呼ぶ。音声ベクトルは２次元のベクトルである。 a. Overall Configuration FIG. 2 schematically shows the entire learning method of the gesture generation model according to this embodiment. Referring to FIG. 2, a learning system 100 that implements this gesture generation model learning method includes a learning data storage device 110 for storing learning data. The learning data storage device 110 stores a plurality of learning data sets. Each training data set includes audio signals obtained from human utterances. The audio signal is divided into segments each having a fixed length, and further divided into a plurality of frames. Then, the audio feature extraction unit 116 extracts F0 and power, which are prosody information of the audio, from each frame every 10 milliseconds and vectorizes the extracted information. This vector is referred to as a speech vector in this specification. The audio vector is a two-dimensional vector.

この実施形態においては、１つのセグメントは１．５秒であり、隣接するセグメントとの間の０．２秒の重複部分を含む。すなわち１セグメントは実質的に１．７秒の長さである。この実施形態においては、１セグメントは３４フレームに分割される。１フレームは０．０５秒に相当する。すなわち、１セグメント分の音声信号は２次元の音声ベクトルを３４個並べたものとなり、３４行２列の行列を形成する。この行列をこの明細書においては音声行列と呼ぶ。 In this embodiment, one segment is 1.5 seconds with an overlap of 0.2 seconds between adjacent segments. That is, one segment is substantially 1.7 seconds long. In this embodiment, one segment is divided into 34 frames. One frame corresponds to 0.05 seconds. That is, one segment's worth of audio signals is made up of 34 two-dimensional audio vectors, forming a matrix of 34 rows and 2 columns. This matrix is referred to as a speech matrix in this specification.

各セグメントは、各フレームに対応するポーズベクトル列も含む。１つのポーズベクトルは、対応するフレームにおける人の姿勢から得た、その上半身１３箇所の関節の回転角度を成分とするベクトルである。各関節の回転角度は３次元であるため、１フレームあたりのポーズベクトルは３×１３個＝３９個の成分を含む。１セグメントに含まれるポーズベクトルは３４個であるため、１セグメント分のポーズ情報は３４行×３９列の行列となる。連続するポーズによりジェスチャが生成される。したがって、この行列をこの明細書においてはジェスチャ行列と呼ぶ。またこのジェスチャ行列に基づいて生成される１セグメント分のジェスチャをジェスチャチャンクと呼ぶ。 Each segment also includes a sequence of pose vectors corresponding to each frame. One pose vector is a vector whose components are rotation angles of 13 joints in the upper body obtained from the person's posture in the corresponding frame. Since the rotation angle of each joint is three-dimensional, the pose vector per frame includes 3×13=39 components. Since there are 34 pose vectors included in one segment, pose information for one segment is a matrix of 34 rows and 39 columns. Gestures are generated by consecutive poses. Therefore, this matrix is referred to as a gesture matrix in this specification. Furthermore, one segment of gestures generated based on this gesture matrix is called a gesture chunk.

ジェスチャ行列を得るための関節を上半身に限定したのは、人が話をするときのジェスチャがほとんど上半身に集中するためである。また、セグメントの長さについて、重複分を除き１．５秒と短くしたのは、長いジェスチャも短いジェスチャをつないだものと考えられること、及び生成するジェスチャの長さをあまり長くすると、ジェスチャ生成が難しくなると考えられたためである。後述するようにこの実施形態においては、１．５秒という短いセグメントごとにジェスチャ行列を生成し、さらに隣接するセグメントの接続が滑らかとなるようにジェスチャ生成モデルの学習を行う。 The reason why we limited the joints for obtaining the gesture matrix to the upper body is because most of the gestures that people make when speaking are concentrated in the upper body. In addition, the reason why we shortened the segment length to 1.5 seconds excluding overlaps was because long gestures can be considered to be a series of short gestures, and if the length of the generated gesture is too long, the gesture This is because it was thought that it would be difficult. As will be described later, in this embodiment, a gesture matrix is generated for each short segment of 1.5 seconds, and the gesture generation model is further trained so that the connections between adjacent segments are smooth.

なお、上に述べた理由から、セグメント長は１秒以上で２秒以下、できれば１．３秒以上で１．７秒以下の範囲から選択することが好ましい。また重複期間についてもあまり長いと好ましくなく、また短いとジェスチャの遷移が滑らかにならないため、例えば１０ミリ秒以上で３０ミリ秒以下、できれば１５ミリ秒以上で２５ミリ秒以下の範囲から選択することが好ましい。 For the reasons stated above, the segment length is preferably selected from a range of 1 second or more and 2 seconds or less, preferably 1.3 seconds or more and 1.7 seconds or less. Also, if the overlapping period is too long, it is not desirable, and if it is short, the gesture transition will not be smooth. Therefore, for example, select from a range of 10 ms or more and 30 ms or less, preferably 15 ms or more and 25 ms or less. is preferred.

学習システム１００はさらに、音声特徴量抽出部１１６から音声特徴量を成分として持つ音声行列１２８を、また所定の分布からサンプリングしたランダムなノイズを成分とするノイズ行列１３２を入力として受け、音声から予測したジェスチャを表す情報であるジェスチャ行列１２６を出力するよう訓練される生成器１１８を含む。この実施形態においては、ノイズは標準正規分布からサンプリングされる。またノイズ行列１３２を構成するノイズベクトルは各フレームに対応して生成される２０次元のベクトルである。したがってノイズ行列１３２は３４行×２０列の行列である。 The learning system 100 further receives as input a speech matrix 128 having speech features as components from the speech feature extraction unit 116 and a noise matrix 132 containing random noise sampled from a predetermined distribution as input, and performs predictions from the speech. The gesture matrix 126 includes a generator 118 that is trained to output a gesture matrix 126 that is information representing the gestures made. In this embodiment, the noise is sampled from a standard normal distribution. Further, the noise vectors constituting the noise matrix 132 are 20-dimensional vectors generated corresponding to each frame. Therefore, the noise matrix 132 is a matrix of 34 rows and 20 columns.

学習システム１００はさらに、入力されるジェスチャ行列が、実際のデータのポーズベクトル１１２から構成されるジェスチャ行列なのか、生成器１１８がランダムノイズから生成した偽のジェスチャ行列なのかを識別し、生成器１１８が学習したポーズベクトルの分布と実際のポーズベクトルの分布との距離を表す評価値１３０を出力するように訓練される識別器１２０と、この訓練のために、生成器１１８の出力するジェスチャ行列１２６と、学習データ記憶装置１１０に記憶された実際のポーズベクトル１１２から生成された本物のジェスチャ行列１２４とのいずれかを選択して識別器１２０に与えるための選択器１２２とを含む。 The learning system 100 further identifies whether the input gesture matrix is a gesture matrix composed of pose vectors 112 of actual data or a fake gesture matrix generated by the generator 118 from random noise; A discriminator 120 that is trained to output an evaluation value 130 representing the distance between the distribution of pose vectors 118 learned and the actual distribution of pose vectors, and a gesture matrix output from the generator 118 for this training. 126 and a real gesture matrix 124 generated from the actual pose vectors 112 stored in the learning data storage device 110 .

ｂ．ジェスチャモデル学習装置
図３に、この実施形態に係る学習システム１００の構成を示す。図３を参照して、学習システム１００は、発話音声をキャプチャし音声信号に変換するための音声キャプチャ部１６６と、音声キャプチャ部１６６の出力する音声信号をフレーム化し、各フレームから音声信号のＦ０及びパワーを算出し音声ベクトルとして出力するためのＦ０・パワー抽出部１６８と、発話者の上半身の動きから所定の部位の動きをキャプチャし、モーションキャプチャデータを出力するためのモーションキャプチャ部１６２と、１．５秒あたり３４フレームの割合をもって、上半身の１２箇所の関節の回転角度を算出してポーズベクトルを出力するためのポーズベクトル生成部１６４と、Ｆ０・パワー抽出部１６８の出力する音声ベクトルとポーズベクトル生成部１６４の出力する同じフレームのポーズベクトルと対応付けて学習データを生成するための学習データ生成装置１７０とを含む。 b. Gesture Model Learning Device FIG. 3 shows the configuration of a learning system 100 according to this embodiment. Referring to FIG. 3, the learning system 100 includes an audio capture unit 166 for capturing uttered audio and converting it into an audio signal, frames the audio signal output from the audio capture unit 166, and converts the audio signal F0 from each frame into frames. and an F0/power extraction unit 168 for calculating and outputting the power as a voice vector, and a motion capture unit 162 for capturing the movement of a predetermined part from the movement of the speaker's upper body and outputting motion capture data. A pose vector generation unit 164 calculates the rotation angles of 12 joints of the upper body and outputs a pose vector at a rate of 34 frames per 1.5 seconds, and an audio vector output from the F0/power extraction unit 168. It includes a learning data generation device 170 for generating learning data in association with the pose vector of the same frame outputted by the pose vector generation unit 164.

学習のために、発話の音声信号とその発話に対するジェスチャデータとの組が複数個準備されている。その各組に対して上記した処理が実行され、複数組の学習データが得られる。学習システム１００はさらに、このようにして得られた複数組の学習データを記憶するための、図２に示した学習データ記憶装置１１０を含む。 For learning, a plurality of pairs of audio signals of utterances and gesture data for the utterances are prepared. The above-described process is executed for each set, and multiple sets of learning data are obtained. The learning system 100 further includes a learning data storage device 110 shown in FIG. 2 for storing the plurality of sets of learning data obtained in this manner.

学習システム１００はさらに、図２に示す生成器１１８、識別器１２０及び選択器１２２を含み、学習データ記憶装置１１０に記憶された学習データを用いて生成器１１８及び識別器１２０を敵対的学習により訓練するためのジェスチャモデル学習装置１５０を含む。ジェスチャモデル学習装置１５０による学習が完了した後の生成器１１８が、ジェスチャ生成に用いられる。 The learning system 100 further includes a generator 118, a discriminator 120, and a selector 122 shown in FIG. It includes a gesture model learning device 150 for training. The generator 118 after the learning by the gesture model learning device 150 is completed is used for gesture generation.

生成器１１８は、第１から第３の３つの入力を持ち、第１から第３の入力において受けたデータから、音声信号１１４に対応するジェスチャ行列１２６を生成し出力するためのものである。第１の入力には学習データ記憶装置１１０からの音声信号１１４（音声行列）が与えられる。第２の入力にはノイズ行列１３２が与えられる。ジェスチャモデル学習装置１５０はこのために、正規分布からサンプリングしたノイズにより３４行２０列のノイズ行列１３２を生成して生成器１１８の第２の入力に与えるためのノイズベクトル発生部１８０を含む。生成器１１８の第３の入力には、後述するシードポーズベクトル１８４が与えられる。 The generator 118 has three inputs, first to third, and is for generating and outputting a gesture matrix 126 corresponding to the audio signal 114 from the data received at the first to third inputs. A first input is provided with an audio signal 114 (speech matrix) from the learning data storage device 110. A noise matrix 132 is provided to the second input. To this end, the gesture model learning device 150 includes a noise vector generation unit 180 that generates a noise matrix 132 of 34 rows and 20 columns from noise sampled from a normal distribution and provides the generated noise matrix 132 to the second input of the generator 118. A third input of the generator 118 is provided with a seed pose vector 184, which will be described later.

ジェスチャモデル学習装置１５０はさらに、生成器１１８から出力されるジェスチャ行列１２６の各成分の高周波成分を除去して平滑化して、選択器１２２の第１の入力に与えるためのＬＰＦ（ＬｏｎｇＰａｓｓＦｉｌｔｅｒ）１９０を含む。選択器１２２の第２の入力には、学習データ記憶装置１１０からの本物のジェスチャ行列１２４が与えられる。すなわち、選択器１２２は、生成器１１８から出力されるジェスチャ行列１２６、及び、学習データ記憶装置１１０から与えられる本物のジェスチャ行列１２４のいずれかを選択してジェスチャ行列２００として出力する。 The gesture model learning device 150 further includes an LPF (Long Pass Filter) for removing and smoothing high frequency components of each component of the gesture matrix 126 output from the generator 118 and providing the result to the first input of the selector 122. Contains 190. A second input of selector 122 is provided with a real gesture matrix 124 from training data storage 110 . That is, the selector 122 selects either the gesture matrix 126 output from the generator 118 or the real gesture matrix 124 provided from the learning data storage device 110 and outputs it as the gesture matrix 200.

識別器１２０は２つの入力を持つ。第１の入力には学習データ記憶装置１１０からの音声行列１２８が与えられる。第２の入力には選択器１２２の出力であるジェスチャ行列２００が与えられる。識別器１２０は、第２の入力において受けるジェスチャ行列２００が、第１の入力において受ける音声行列１２８に対する本物のジェスチャ行列か、生成器１１８により生成された偽のジェスチャ行列かを表す評価値１３０を出力する。 Discriminator 120 has two inputs. The first input is provided with the audio matrix 128 from the training data storage 110 . A gesture matrix 200, which is the output of the selector 122, is given to the second input. The discriminator 120 generates an evaluation value 130 representing whether the gesture matrix 200 received at the second input is a genuine gesture matrix relative to the speech matrix 128 received at the first input or a fake gesture matrix generated by the generator 118. Output.

ジェスチャモデル学習装置１５０はさらに、識別器１２０、生成器１１８及び選択器１２２を制御信号１９６により制御し、後述するように敵対学習により識別器１２０及び選択器１２２の訓練を行うための敵対的学習部１９２と、敵対的学習部１９２が敵対的学習において識別器１２０を構成するニューラルネットワークの各パラメータを一時的に退避する退避先となるパラメータ記憶部１９４とを含む。 The gesture model learning device 150 further controls the discriminator 120, the generator 118, and the selector 122 using a control signal 196, and performs adversarial learning to train the discriminator 120 and the selector 122 by adversarial learning as described below. section 192, and a parameter storage section 194 where the adversarial learning section 192 temporarily saves each parameter of the neural network constituting the classifier 120 during adversarial learning.

ジェスチャモデル学習装置１５０はさらに、本物のジェスチャ行列１２４の末尾の４フレーム分のポーズベクトルを抽出して記憶し、生成器１１８が次にジェスチャ行列を生成するときにシードポーズベクトルとして生成器１１８の第３の入力に与えるためのシードポーズベクトル抽出部１８２を含む。 The gesture model learning device 150 further extracts and stores pose vectors for the last four frames of the real gesture matrix 124, and uses the pose vectors of the generator 118 as a seed pose vector when the generator 118 generates a gesture matrix next time. It includes a seed pose vector extraction unit 182 for providing to the third input.

図４を参照して、生成器１１８は、前述した第１から第３の入力を持ち、これらに入力される音声行列１２８、ノイズ行列１３２及びシードポーズベクトル１８４を受け取る全結合層２５０と、全結合層２５０の出力を受けるように接続された２－ｌａｙｅｒｂｉ－ＧＲＵ２５２と、２－ｌａｙｅｒｂｉ－ＧＲＵ２５２の出力を受けるように接続され、ジェスチャ行列１２６を出力するための全結合層２５４とを含む。 Referring to FIG. 4, the generator 118 includes a fully connected layer 250 that has the first to third inputs described above and receives the speech matrix 128, the noise matrix 132, and the seed pose vector 184 input thereto; 2-layer bi-GRU 252 connected to receive the output of the connection layer 250; and a fully connected layer 254 connected to receive the output of the 2-layer bi-GRU 252 for outputting the gesture matrix 126. .

生成器１１８の全結合層２５０への入力は、ポーズベクトル４個分を３４行×３９列に整形したジェスチャ行列、１セグメント分の音声ベクトルからなる３４行×２列の音声信号行列、及びノイズから生成した３４行×２０列のノイズ行列である。全結合層２５０の出力は３４行×２５６列の行列である。 The inputs to the fully connected layer 250 of the generator 118 are a gesture matrix in which four pose vectors are formatted into 34 rows and 39 columns, an audio signal matrix of 34 rows and two columns consisting of one segment's worth of audio vectors, and noise. This is a 34 row by 20 column noise matrix generated from . The output of the fully connected layer 250 is a matrix of 34 rows and 256 columns.

２－ｌａｙｅｒｂｉ－ＧＲＵ２５２の入力は３４行×２５６列の行列、出力も３４行×２５６列の行列である。 The input of the 2-layer bi-GRU 252 is a matrix with 34 rows and 256 columns, and the output is also a matrix with 34 rows and 256 columns.

全結合層２５４の入力は３４行×２５６列の行列、出力はポーズベクトルからなる行列と同様、３４行×３９列の行列である。 The input to the fully connected layer 254 is a matrix of 34 rows and 256 columns, and the output is a matrix of 34 rows and 39 columns, similar to the matrix consisting of pose vectors.

図５を参照して、識別器１２０は、２つの入力を持ち、音声行列１２８と選択器１２２からのジェスチャ行列２００とを受ける全結合層３００と、全結合層３００の出力を受けるように接続された１－Ｄ畳み込み層３０２と、１－Ｄ畳み込み層３０２の出力を受け、評価値１３０を出力するための全結合層３０４とを含む。 Referring to FIG. 5, the discriminator 120 has two inputs, a fully connected layer 300 receiving the speech matrix 128 and the gesture matrix 200 from the selector 122, and a fully connected layer 300 connected to receive the output of the fully connected layer 300. 1-D convolution layer 302 and a fully connected layer 304 for receiving the output of the 1-D convolution layer 302 and outputting an evaluation value 130.

全結合層３００への入力は、１セグメント分の音声ベクトルからなる３４行×２列の音声行列と、図３に示す選択器１２２から受ける、本物の、又は偽のポーズベクトルからなる３４行×３９列の行列である。 The input to the fully connected layer 300 is a 34 row x 2 column audio matrix consisting of audio vectors for one segment, and a 34 row x 2 column audio matrix consisting of real or false pose vectors received from the selector 122 shown in FIG. It is a matrix with 39 columns.

ｃ．敵対的学習の詳細
図６に、ジェスチャモデル学習装置１５０による生成器１１８及び識別器１２０の訓練の概略を示す。この実施形態においては、訓練は、識別器１２０の訓練、及び生成器１１８の訓練の順番に従って行われる。 c. Details of Adversarial Learning FIG. 6 shows an outline of training of the generator 118 and the discriminator 120 by the gesture model learning device 150. In this embodiment, training is performed according to the order of training the discriminator 120 and training the generator 118.

図示はしていないが、最初に、識別器１２０には、音声の１セグメントから得られた音声行列と、そのセグメントから得られたポーズベクトルからなるジェスチャ行列とが与えられる。このジェスチャ行列は本物のジェスチャ行列である。識別器１２０は内部の各層のパラメータを用いた演算を行い、ジェスチャ行列が音声行列に対するジェスチャか否かを示す評価値を出力する。訓練の最初はこの評価値には信頼がおけない。訓練においては、この評価値が０に近づくように識別器１２０のパラメータを更新する。 Although not shown, first, the discriminator 120 is given a speech matrix obtained from one segment of speech and a gesture matrix consisting of a pose vector obtained from that segment. This gesture matrix is a real gesture matrix. The classifier 120 performs calculations using the parameters of each internal layer, and outputs an evaluation value indicating whether the gesture matrix is a gesture for the speech matrix. At the beginning of training, this evaluation value cannot be trusted. During training, the parameters of the classifier 120 are updated so that this evaluation value approaches 0.

続いて生成器１１８に、学習データの中で処理対象となるセグメントの直前のセグメントの末尾の４フレームのポーズベクトル３５４と、音声から抽出した特徴量からなる音声行列３５０と、正規分布からサンプリングしたノイズにより生成されたノイズベクトル３５２とが入力される。生成器１１８はこれら入力に対して各層においてパラメータを用いた演算を行い、ポーズベクトルからなるジェスチャ行列３５６を出力する。ジェスチャ行列３５６は偽のジェスチャ行列である。 Next, the generator 118 is provided with the pose vector 354 of the last four frames of the segment immediately before the segment to be processed in the learning data, the audio matrix 350 consisting of feature quantities extracted from the audio, and the audio matrix 350 that is sampled from a normal distribution. A noise vector 352 generated by noise is input. The generator 118 performs calculations using parameters in each layer on these inputs, and outputs a gesture matrix 356 consisting of pose vectors. Gesture matrix 356 is a fake gesture matrix.

このジェスチャ行列３５６と音声行列３５０とが識別器１２０に与えられる。識別器１２０は、ジェスチャ行列３５６及び音声行列３５０に対して、各層のパラメータを用いた演算を行い、ジェスチャ行列３５６が音声行列３５０に対応するポーズベクトルか否かを示す評価値３５８を出力する。ここでは、評価値３５８が大きくなるようにまず識別器１２０のパラメータを更新する。さらに、評価値３５８が０に近づくように生成器１１８のパラメータを更新する。 This gesture matrix 356 and voice matrix 350 are provided to the discriminator 120. The classifier 120 performs calculations on the gesture matrix 356 and the audio matrix 350 using parameters of each layer, and outputs an evaluation value 358 indicating whether the gesture matrix 356 is a pose vector corresponding to the audio matrix 350. Here, first, the parameters of the classifier 120 are updated so that the evaluation value 358 becomes larger. Furthermore, the parameters of the generator 118 are updated so that the evaluation value 358 approaches 0.

以後、同様の処理を各セグメントに対して繰り返す。その際、学習時には生成器１１８が出力したジェスチャ行列３５６ではなく、学習データである実際のポーズベクトルが次の処理のシードポーズベクトルとして使用される。以上が生成器１１８及び識別器１２０に対する敵対的学習の概略である。 Thereafter, similar processing is repeated for each segment. At this time, during learning, not the gesture matrix 356 output by the generator 118, but the actual pose vector that is learning data is used as a seed pose vector for the next process. The above is an outline of adversarial learning for the generator 118 and the classifier 120.

ただし、実際には、この敵対的学習にはｕｎｒｏｌｌｅｄ－ＧＡＮという手法が用いられる。さらに学習時の評価値となる損失関数についても次に説明するように通常のものと異なる損失関数が用いられる。 However, in reality, a method called unrolled-GAN is used for this adversarial learning. Furthermore, as for the loss function used as the evaluation value during learning, a loss function different from the normal one is used as described below.

以下にこの実施形態において使用する損失関数を示す。 The loss function used in this embodiment is shown below.

上の式において、ｙはポーズベクトルを、ｓは音声ベクトルを、ｚはノイズベクトルを、ｍはセグメント内のフレーム数を、それぞれ表す。ｋは前後のセグメント間で強制的に重複させるポーズベクトルの数を示すハイパーパラメータである。本実施形態ではｋ＝４である。上式において、「Ｈｕｂｅｒ」はＨｕｂｅｒ損失を指し、その定義は以下の文献に記載されている。
Ross Girshick. 2015. Fast r-cnn. In IEEE International Conference on Computer Vision. 1440-1448.
この文献に記載されている定義は以下のとおりである。

In the above equation, y represents a pose vector, s represents a speech vector, z represents a noise vector, and m represents the number of frames within a segment. k is a hyperparameter indicating the number of pose vectors that are forcibly overlapped between the preceding and succeeding segments. In this embodiment, k=4. In the above formula, "Huber" refers to Huber loss, and its definition is described in the following document.
Ross Girshick. 2015. Fast r-cnn. In IEEE International Conference on Computer Vision. 1440-1448.
The definitions described in this document are as follows.

Ｌcriticは通常のＷＧＡＮ（ＷａｓｓｅｒｓｔｅｉｎＧＡＮ）において使用される損失関数である。この実施形態においては、Ｌcriticについては予めＷＧＡＮにより識別器１２０の学習をしてＬcriticを算出できるようにしておき、上記した生成器１１８及び識別器１２０の学習においてはその値をＬcriticとして使用する。Ｌgpは勾配ペナルティであり、学習を安定化させるための正規化項である。勾配ペナルティは、識別器の勾配のノルムが１と等しくなるようにペナルティを課すためのものである。この実施形態においては、識別器は２つの入力を持つ。実験の結果、双方の入力にこの勾配ペナルティを課しても、一方のみに課した場合とほとんど効果が変わらなかった。そのため、この実施形態においては２つの入力の中で生成器の入力を受ける方のみにこのペナルティを課している。 Lcritic is a loss function used in normal WGAN (Wasserstein GAN). In this embodiment, the discriminator 120 is trained in advance by WGAN to be able to calculate Lcritic, and the value is used as Lcritic in the learning of the generator 118 and the discriminator 120 described above. Lgp is a gradient penalty and is a regularization term for stabilizing learning. The gradient penalty is for imposing a penalty so that the norm of the gradient of the classifier is equal to 1. In this embodiment, the discriminator has two inputs. Experiments showed that imposing this gradient penalty on both inputs had little difference in effect compared to imposing it on only one input. Therefore, in this embodiment, this penalty is imposed only on the one receiving the input of the generator among the two inputs.

最後のＬcontinuityは、従来存在していなかった損失関数であり、動作シーケンスの連続性を保つためにこの実施形態において導入した。図７にこの連続性損失に関する関数の目的を示す。 The last Lcontinuity is a loss function that did not exist in the past, and was introduced in this embodiment in order to maintain continuity of the operation sequence. FIG. 7 shows the purpose of this function regarding continuity loss.

この実施形態においては、生成器１１８に入力される１セグメント分の音声信号１１４が１．５秒分と短い。そのため、１回に音声信号１１４とノイズ行列１３２とから生成されるジェスチャ行列１２６も１．５秒分しかない。それより長い発話に対しては、複数のセグメントについてのジェスチャチャンクを生成しそれらを連結する必要がある。そのためにこの実施形態においては、生成器１１８から出力されるセグメントのジェスチャ行列１２６の最初のいくつか（ｋ個）のフレームのポーズベクトル４５０が、その直前のセグメントの末尾の同数（ｋ個）のフレームのジェスチャ（ポーズ）、すなわち生成器１１８に入力されたシードポーズベクトル１８４とよく似たものになるようにこの損失関数を課している。こうすることにより、連続するセグメントに対して生成されたジェスチャ行列から生成したジェスチャチャンクをそのまま連結させてもなめらかなジェスチャが得られる。 In this embodiment, one segment of audio signal 114 input to generator 118 is as short as 1.5 seconds. Therefore, the gesture matrix 126 generated from the audio signal 114 and the noise matrix 132 at one time is only for 1.5 seconds. For longer utterances, it is necessary to generate gesture chunks for multiple segments and concatenate them. Therefore, in this embodiment, the pose vectors 450 of the first few (k) frames of the gesture matrix 126 of the segment output from the generator 118 are the pose vectors 450 of the same number (k) of frames at the end of the segment just before it. This loss function is imposed to closely resemble the gesture (pose) of the frame, ie, the seed pose vector 184 input to the generator 118. By doing so, a smooth gesture can be obtained even if gesture chunks generated from gesture matrices generated for consecutive segments are directly connected.

図８に、この実施形態において生成器１１８及び識別器１２０の学習を行うために使用される、ｕｎｒｏｌｌｅｄ－ＧＡＮを実現するプログラムの主要部分の制御構造をフローチャート形式により示す。 FIG. 8 shows, in flowchart form, the control structure of the main part of the program for realizing the unrolled-GAN, which is used to train the generator 118 and the discriminator 120 in this embodiment.

図８を参照して、このプログラムは、以下の処理５０２を全ての学習データに対して繰り返すステップ５００を含む。 Referring to FIG. 8, this program includes step 500 in which the following process 502 is repeated for all learning data.

処理５０２は、本物のジェスチャデータを用いて識別器１２０のパラメータのみを更新する処理５１０と、処理５１０により更新された識別器１２０のパラメータを図３に示すパラメータ記憶部１９４に退避するステップ５１２と、処理５１０と同じく本物のジェスチャデータを用いて識別器１２０のパラメータのみを更新する処理５１４を所定回数だけ繰り返すステップ５１２とを含む。 The process 502 includes a process 510 of updating only the parameters of the classifier 120 using real gesture data, and a step 512 of saving the parameters of the classifier 120 updated in the process 510 to the parameter storage unit 194 shown in FIG. , step 512 of repeating process 514 for updating only the parameters of classifier 120 using real gesture data a predetermined number of times, similar to process 510.

処理５１０は、本物データのバッチをサンプリングするステップ５４０と、ノイズデータのバッチをサンプリングするステップ５４２と、ステップ５４０及び５４２においてサンプリングされた本物データ及びノイズから生成されたデータを用いて識別器１２０の勾配を算出するステップ５４４と、ステップ５４４において算出された勾配に基づいて識別器１２０のパラメータを更新するステップ５４６とを含む。 The process 510 includes sampling 540 a batch of real data, sampling 542 a batch of noise data, and using the data generated from the sampled real data and noise in steps 540 and 542 to run the classifier 120 . The method includes step 544 of calculating a gradient, and step 546 of updating parameters of the discriminator 120 based on the gradient calculated in step 544.

処理５１６の内容も処理５１０と同じである。 The contents of process 516 are also the same as process 510.

処理５０２はさらに、処理５１４に続き、本物データのバッチをサンプリングするステップ５１８と、ノイズデータのバッチをサンプリングするステップ５２０と、ステップ５１８及び５２０においてサンプリングされたデータを用いて生成器１１８及び識別器１２０を動作させて識別器１２０から得られた評価値に基づき、生成器１１８の勾配を算出するステップ５２２と、ステップ５２２において算出された勾配に基づいて生成器１１８のパラメータを更新するステップ５２４と、連続性損失により生成器１１８のパラメータをさらに更新するステップ５２５と、ステップ５１２において図３のパラメータ記憶部１９４に退避された識別器１２０のパラメータを復元して処理５０２を終了するステップ５２６とを含む。 Process 502 further continues process 514 with a step 518 of sampling a batch of real data, a step 520 of sampling a batch of noise data, and using the data sampled in steps 518 and 520 to generate a generator 118 and a discriminator. step 522 of calculating the gradient of the generator 118 based on the evaluation value obtained from the discriminator 120 by operating the discriminator 120; and step 524 of updating the parameters of the generator 118 based on the gradient calculated in step 522. , step 525 of further updating the parameters of the generator 118 by continuity loss, and step 526 of restoring the parameters of the discriminator 120 saved in the parameter storage unit 194 of FIG. 3 in step 512 and ending the process 502. include.

このようにｕｎｒｏｌｌｅｄ－ＧＡＮにおいては、識別器のパラメータを生成器よりも先行して複数回にわたり更新し、生成器のパラメータを更新した後に、識別器のパラメータを最初の更新時に戻す。こうした処理により、ＧＡＮによる学習でよく生じるモード崩壊と呼ばれる現象を防ぐ効果があるとされている。 In this way, in unrolled-GAN, the parameters of the discriminator are updated multiple times before the generator, and after the parameters of the generator are updated, the parameters of the discriminator are returned to the time of the first update. Such processing is said to have the effect of preventing a phenomenon called mode collapse that often occurs in GAN learning.

ｄ．ジェスチャ生成装置
上記学習システム１００により学習が終わった生成器１１８を用いて、音声ベクトルからその音声ベクトルに対するジェスチャ情報を得ることができる。図９に生成器１１８を用いたジェスチャ生成装置５６０の構成を示す。 d. Gesture Generation Device Using the generator 118 that has been trained by the learning system 100, gesture information for the voice vector can be obtained from the voice vector. FIG. 9 shows the configuration of a gesture generation device 560 using the generator 118.

図９を参照して、ジェスチャ生成装置５６０は、ジェスチャ生成の対象となる発話音声をキャプチャする（又は録音音声を再生する）ための、図３に示したものと同じ構成の音声キャプチャ部１６６と、音声キャプチャ部１６６によりキャプチャされた音声信号の各フレームからＦ０及びパワーを抽出して音声ベクトルを生成するための、図３に示したものと同じ構成のＦ０・パワー抽出部１６８と、Ｆ０・パワー抽出部１６８の出力する音声ベクトル列を３４個ずつまとめることにより１セグメント分の音声行列を生成するための音声セグメント部５８０と、音声セグメント部５８０により出力される音声行列を記憶するための音声データ記憶装置５８２とを含む。 Referring to FIG. 9, gesture generation device 560 includes an audio capture unit 166 having the same configuration as that shown in FIG. , an F0/power extraction unit 168 having the same configuration as that shown in FIG. An audio segment unit 580 for generating an audio matrix for one segment by combining 34 audio vector sequences output by the power extraction unit 168; and an audio segment unit 580 for storing the audio matrix output by the audio segment unit 580. data storage device 582.

ジェスチャ生成装置５６０はさらに、音声データ記憶装置５８２に記憶された音声データの各セグメントを読み出し、当該音声データに対応するジェスチャデータ５８６を生成するためのジェスチャ生成部５８４を含む。 Gesture generation device 560 further includes a gesture generation unit 584 for reading each segment of audio data stored in audio data storage device 582 and generating gesture data 586 corresponding to the audio data.

ジェスチャ生成部５８４は、図３に示すものと同様のノイズベクトル発生部１８０と、生成器１１８と、生成器１１８の出力するジェスチャチャンクの末尾の４フレームを次のセグメントのためのシードポーズベクトルとして抽出し一時記憶するためのシードポーズベクトル抽出部５９０とを含む。 The gesture generation unit 584 includes a noise vector generation unit 180 similar to that shown in FIG. 3, a generator 118, and the last four frames of the gesture chunk output from the generator 118 as a seed pose vector for the next segment. and a seed pose vector extraction unit 590 for extracting and temporarily storing.

生成器１１８の第１の入力には音声データ記憶装置５８２からの音声行列が与えられる。第２の入力にはノイズベクトル発生部１８０によりサンプリングされたノイズ行列が与えられる。第３の入力にはシードポーズベクトル抽出部５９０から、直前のセグメントの末尾の４フレーム分のポーズベクトルを他の入力と同じ形に整形したシードポーズ行列が与えられる。 A first input of generator 118 is provided with an audio matrix from audio data storage 582 . A noise matrix sampled by the noise vector generator 180 is given to the second input. A seed pose matrix obtained by shaping the pose vectors for the last four frames of the immediately preceding segment into the same form as the other inputs is supplied to the third input from the seed pose vector extraction unit 590.

ジェスチャ生成部５８４はさらに、生成器１１８が出力を受けるように接続された、図３に示すものと同様のＬＰＦ１９０と、ＬＰＦ１９０の出力するジェスチャチャンクを一時記憶するためのジェスチャチャンク記憶部５９２と、ジェスチャチャンク記憶部５９２に記憶された複数のジェスチャチャンクのうち、隣接するセグメント分のジェスチャチャンクの間で、先行するジェスチャチャンクの末尾の複数（この実施形態においては４個）のポーズベクトルの各々と、後続するジェスチャチャンクの先頭の同数のポーズベクトルであって先行するポーズベクトルにそれぞれ対応するポーズベクトルとを互いに等しい重みで加算することにより、セグメント間のジェスチャの内挿を行うためのジェスチャ補間部５９４とを含む。ジェスチャ補間部５９４によりセグメント間のジェスチャの内挿がされたジェスチャがジェスチャデータ５８６として出力される。 The gesture generation section 584 further includes an LPF 190 similar to that shown in FIG. 3, connected so that the generator 118 receives the output, and a gesture chunk storage section 592 for temporarily storing gesture chunks output from the LPF 190. Among the plurality of gesture chunks stored in the gesture chunk storage unit 592, between the gesture chunks of adjacent segments, each of the plurality of (four in this embodiment) pose vectors at the end of the preceding gesture chunk is , a gesture interpolation unit for interpolating gestures between segments by adding the same number of pose vectors at the beginning of the subsequent gesture chunk and corresponding to the preceding pose vectors with equal weights. 594. The gesture interpolated between the segments by the gesture interpolation unit 594 is output as gesture data 586.

図１０に、ジェスチャ補間部５９４の構成を示す。図１０を参照して、ジェスチャ補間部５９４には、先行する音声のセグメントから生成されたジェスチャチャンク６００と、そのセグメントの直後のセグメントとから生成されたジェスチャチャンク６０６とが入力される。 FIG. 10 shows the configuration of gesture interpolation section 594. Referring to FIG. 10, gesture interpolation unit 594 receives gesture chunk 600 generated from the preceding audio segment and gesture chunk 606 generated from the segment immediately following that segment.

ジェスチャチャンク６００は、末尾の複数個のポーズベクトル６０４と、それ以外の前半部分のポーズベクトル６０２とを含む。ジェスチャチャンク６０６は、先頭の複数個のポーズベクトル６１０と、それ以外の後半部分のポーズベクトル６０８とを含む。この実施形態においては、ポーズベクトル６０４及び６１０はそれぞれ４個のポーズベクトルを含む。 The gesture chunk 600 includes a plurality of pose vectors 604 at the end and pose vectors 602 in the other half. The gesture chunk 606 includes a plurality of pose vectors 610 at the beginning and pose vectors 608 in the other half. In this embodiment, pose vectors 604 and 610 each include four pose vectors.

ジェスチャ補間部５９４は、ジェスチャチャンク６００から末尾の４個のポーズベクトル６０４を取り出してその各々の成分に０．５を乗算するための乗算部６１２と、ジェスチャチャンク６０６から先頭の４個のポーズベクトル６１０を取り出してその各々の成分に０．５を乗算するための乗算部６１４と、乗算部６１２により出力される４個のポーズベクトルの各々と、乗算部６１４により出力される４個のポーズベクトルのうち、対応するものとを互いに加算して新たな４つの内挿ポーズベクトル６１８を生成する加算部６１６とを含む。 The gesture interpolation unit 594 includes a multiplication unit 612 for extracting the last four pose vectors 604 from the gesture chunk 600 and multiplying each component by 0.5, and a multiplication unit 612 for extracting the last four pose vectors 604 from the gesture chunk 600 and multiplying each component by 0.5. 610 and multiplies each component by 0.5, each of the four pose vectors output by the multiplier 612, and the four pose vectors output by the multiplier 614. Among them, an adding unit 616 that adds corresponding ones to each other to generate four new interpolated pose vectors 618 is included.

ジェスチャチャンク６００の前半部分のポーズベクトル６０２と、内挿ポーズベクトル６１８と、ジェスチャチャンク６０６の後半部分のポーズベクトル６０８とをこの順で連結することにより、２つのセグメントから１つのジェスチャが得られる。 By concatenating the pose vector 602 of the first half of the gesture chunk 600, the interpolated pose vector 618, and the pose vector 608 of the second half of the gesture chunk 606 in this order, one gesture is obtained from the two segments.

Ｂ．動作
ａ．学習
学習に先立って、学習データが生成され学習データ記憶装置１１０に記憶される。ｕｎｒｏｌｌｅｄ－ＧＡＮにより生成器１１８及び識別器１２０の学習を行う前にＷＧＡＮにより識別器１２０の学習を行う。この学習により、ｕｎｒｏｌｌｅｄ－ＧＡＮにおけるＬcriticの識別器１２０による算出が可能になる。 B. Action a. Learning Prior to learning, learning data is generated and stored in the learning data storage device 110. Before learning the generator 118 and the classifier 120 using the unrolled-GAN, the classifier 120 is trained using the WGAN. This learning enables the classifier 120 to calculate Lcritic in unrolled-GAN.

図３を参照して、学習データ記憶装置１１０に記憶された学習データのうち、先頭のセグメントからの音声行列が生成器１１８及び識別器１２０に入力される。また学習データのうち、先頭セグメントのジェスチャ行列が選択器１２２に与えられる。敵対的学習部１９２は選択器１２２に学習データからのジェスチャ行列を選択させる。この結果、学習データからの本物のジェスチャ行列が識別器１２０に入力される。識別器１２０は入力された音声行列とジェスチャ行列とを使用して、上記した損失関数に基づいて評価値を算出する。 Referring to FIG. 3, of the learning data stored in learning data storage device 110, the speech matrix from the first segment is input to generator 118 and discriminator 120. Also, among the learning data, the gesture matrix of the first segment is given to the selector 122. The adversarial learning unit 192 causes the selector 122 to select a gesture matrix from the learning data. As a result, the real gesture matrix from the training data is input to the classifier 120. The classifier 120 uses the input speech matrix and gesture matrix to calculate an evaluation value based on the loss function described above.

一方、シードポーズベクトル抽出部１８２は、通常は直前のセグメントの末尾からシードポーズベクトル１８４を抽出し行列に整形して生成器１１８に与える。ノイズベクトル発生部１８０は正規分布からのサンプリングによりノイズ行列１３２を生成し生成器１１８に与える。生成器１１８は、入力された学習データの音声行列と、ノイズ行列１３２と、シードポーズベクトル１８４とを用いてジェスチャ行列１２６を生成する。ジェスチャ行列１２６により高周波成分が除去されたジェスチャ行列１２６は選択器１２２の入力に与えられる。選択器１２２はＬＰＦ１９０の出力を選択し、識別器１２０に与える。この入力は偽のポーズベクトルである。識別器１２０は、今度は音声行列１２８と偽のポーズベクトルとの間で評価値１３０を算出する。 On the other hand, the seed pose vector extraction unit 182 usually extracts a seed pose vector 184 from the end of the immediately preceding segment, formats it into a matrix, and supplies it to the generator 118. The noise vector generator 180 generates a noise matrix 132 by sampling from a normal distribution and provides it to the generator 118. The generator 118 generates the gesture matrix 126 using the speech matrix of the input learning data, the noise matrix 132, and the seed pose vector 184. The gesture matrix 126 from which high frequency components have been removed is provided to the input of the selector 122 . The selector 122 selects the output of the LPF 190 and supplies it to the discriminator 120. This input is a fake pose vector. The classifier 120 then calculates an evaluation value 130 between the audio matrix 128 and the fake pose vector.

敵対的学習部１９２は、本物から得られた評価値１３０と、偽のポーズベクトルから得られた評価値１３０とに基づいて、損失関数のうち前半の２項から識別器１２０のパラメータ勾配を算出し、学習率とこの勾配とに基づいて識別器１２０のパラメータを更新する。この更新の結果、識別器１２０のパラメータは、本物のデータに対しては評価値が０に近く、偽のデータに対しては大きくなるように更新される。敵対的学習部１９２は、更新後のパラメータをパラメータ記憶部１９４に退避する。 The adversarial learning unit 192 calculates the parameter gradient of the classifier 120 from the first two terms of the loss function based on the evaluation value 130 obtained from the real pose vector and the evaluation value 130 obtained from the fake pose vector. Then, the parameters of the classifier 120 are updated based on the learning rate and this gradient. As a result of this update, the parameters of the classifier 120 are updated so that the evaluation value is close to 0 for real data and becomes large for fake data. The adversarial learning unit 192 saves the updated parameters to the parameter storage unit 194.

敵対的学習部１９２はさらに、同様に識別器１２０のパラメータを更新する処理を所定回数繰り返す。ただしこの際、更新後のパラメータをパラメータ記憶部１９４に退避する処理は行わない。この処理により、識別器１２０のパラメータの学習は一時的に生成器１１８のパラメータより先の世代まで進んだことになる。 The adversarial learning unit 192 further repeats the process of updating the parameters of the classifier 120 a predetermined number of times. However, at this time, the process of saving the updated parameters to the parameter storage unit 194 is not performed. Through this process, the learning of the parameters of the classifier 120 has temporarily progressed to a generation earlier than the parameters of the generator 118.

その後、同様にして学習データに基づいて識別器１２０から得られた評価値と、ノイズデータに基づいて識別器１２０から得られた評価値とに基づき、今度は生成器１１８のパラメータの更新が行われる。この更新では、偽のデータに対して識別器１２０の出力する評価値が小さくなるように生成器１１８のパラメータが更新される。 After that, the parameters of the generator 118 are updated based on the evaluation value obtained from the discriminator 120 based on the learning data and the evaluation value obtained from the discriminator 120 based on the noise data in the same manner. be exposed. In this update, the parameters of the generator 118 are updated so that the evaluation value output by the discriminator 120 becomes smaller for false data.

さらに、先行するセグメントの末尾の４個のポーズベクトルと、後続するセグメントの先頭の４個のポーズベクトルとを使用して、Ｌcontinuityが算出され、この損失関数の値が小さくなるように生成器１１８のパラメータが更新される。その後対比しておいた識別器１２０のパラメータを識別器１２０に復元する。 Furthermore, Lcontinuity is calculated using the last four pose vectors of the preceding segment and the first four pose vectors of the following segment, and the generator 118 parameters are updated. Thereafter, the compared parameters of the classifier 120 are restored to the classifier 120.

このようにして、全ての学習データを用いた学習を繰り返し、所定の終了条件が成立すると生成器１１８の学習は終了する。例えば訓練を所定回数完了したなどが終了条件となる。 In this way, learning using all the learning data is repeated, and when a predetermined termination condition is met, the learning of the generator 118 ends. For example, the termination condition may be that training has been completed a predetermined number of times.

この学習により、生成器１１８は学習データの表すジェスチャデータの分布を学習したことになる。 Through this learning, the generator 118 has learned the distribution of gesture data represented by the learning data.

ｂ．推論
推論時にはジェスチャ生成装置５６０（図９）は以下のように動作する。図９を参照して、音声キャプチャ部１６６がジェスチャ生成の対象となる音声をキャプチャし音声信号に変換する。Ｆ０・パワー抽出部１６８は、この音声信号の各フレームについてＦ０とパワーとを算出しベクトル化する。音声セグメント部５８０は、Ｆ０・パワー抽出部１６８の出力するベクトル列を１．５秒ずつのセグメントでかつセグメント間に０．２秒の重複部分を持つセグメント列に分割し音声データ記憶装置５８２に格納する。 b. Inference During inference, gesture generator 560 (FIG. 9) operates as follows. Referring to FIG. 9, audio capture unit 166 captures audio that is a target of gesture generation and converts it into an audio signal. The F0/power extraction unit 168 calculates F0 and power for each frame of this audio signal and vectorizes it. The audio segment unit 580 divides the vector string output from the F0/power extraction unit 168 into a segment string of 1.5 seconds each with an overlapping portion of 0.2 seconds between segments, and stores the segment strings in the audio data storage device 582. Store.

生成器１１８は、音声データ記憶装置５８２から対象となるセグメントの音声行列を読み出す。またノイズベクトル発生部１８０は標準正規分布からノイズをサンプリングしノイズ行列として生成器１１８に与える。シードポーズベクトル抽出部５９０は、直前のセグメントについて生成器１１８の出力から得られたジェスチャチャンクの末尾の４個のポーズベクトルをシードポーズベクトルとして抽出し、生成器１１８に与える。 Generator 118 reads the audio matrix of the segment of interest from audio data storage 582 . Further, the noise vector generator 180 samples noise from the standard normal distribution and provides it to the generator 118 as a noise matrix. The seed pose vector extraction unit 590 extracts the last four pose vectors of the gesture chunk obtained from the output of the generator 118 for the immediately preceding segment as seed pose vectors, and provides them to the generator 118.

生成器１１８は、これら入力に対して、学習により得たパラメータを用いた演算を行い、入力された音声行列に対応するポーズベクトル列からなるジェスチャ行列を出力する。このジェスチャ行列はＬＰＦ１９０を経てジェスチャチャンク記憶部５９２に記憶される。記憶されたジェスチャ行列の末尾の４個のポーズベクトルは、次のセグメントの処理におけるシードポーズベクトルとしてシードポーズベクトル抽出部５９０により読み出される。 The generator 118 performs calculations on these inputs using parameters obtained through learning, and outputs a gesture matrix consisting of a sequence of pose vectors corresponding to the input speech matrix. This gesture matrix is stored in the gesture chunk storage unit 592 via the LPF 190. The last four pose vectors of the stored gesture matrix are read out by the seed pose vector extraction unit 590 as seed pose vectors for processing the next segment.

ジェスチャ補間部５９４は、ジェスチャチャンク記憶部５９２に記憶された複数のジェスチャチャンクのうち、隣接する２つのジェスチャ行列の重複する部分に存在する４個のポーズベクトルのうち対応するベクトルに０．５を乗算して足し合わせることにより２つのセグメント間のジェスチャを補間する。 The gesture interpolation unit 594 adds 0.5 to the corresponding vector among the four pose vectors existing in the overlapping part of two adjacent gesture matrices among the plurality of gesture chunks stored in the gesture chunk storage unit 592. Interpolate gestures between two segments by multiplying and adding.

Ｃ．効果
上記実施形態によれば、従来手法よりもなめらかでかつ自然な印象を与えるジェスチャを生成できるようになる。またジェスチャが確率的に生成されるため、同じ発話に対しても異なるジェスチャが生成できる。以下に説明する実験によれば、客観的評価によっても、主観的評価によっても、上記した効果が得られることがわかった。 C. Effects According to the above embodiment, it becomes possible to generate gestures that give a smoother and more natural impression than conventional methods. Furthermore, since gestures are generated stochastically, different gestures can be generated for the same utterance. According to the experiments described below, it was found that the above-mentioned effects can be obtained by both objective evaluation and subjective evaluation.

第２実験
Ａ．実験の設定
実験においては、発話中の人物に対するモーションキャプチャを行うことにより、１０９４個のモーション（ジェスチャ）と日本語発話の対からなる１０９４組の学習データを得た。これらデータの合計時間は６時間である。モデルの学習を行うために、学習率を生成器１１８と識別器１２０との双方について１０^－４とした。バッチサイズは１２８とした。損失関数におけるλｃを１に、λｇｐを１０に、それぞれ設定した。ノイズのサンプリングに使用した正規分布は、平均が０，分散が１のものであった。モデルの学習にはＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）を備えた高性能なコンピュータを用いて６時間を要した。 Second experiment A. Experimental Settings In the experiment, 1094 sets of training data consisting of 1094 pairs of motions (gestures) and Japanese utterances were obtained by performing motion capture on a person speaking. The total time for these data is 6 hours. To train the model, the learning rate was set to 10 ⁻⁴ for both generator 118 and discriminator 120. The batch size was 128. λc and λgp in the loss function were set to 1 and 10, respectively. The normal distribution used for noise sampling had a mean of 0 and a variance of 1. It took 6 hours to train the model using a high-performance computer equipped with a GPU (Graphics Processing Unit).

カーネル密度推定（ＫＤＥ（ＫｅｒｎｅｌＤｅｎｓｉｔｙＥｓｔｉｍａｔｉｏｎ））は、所定の分布におけるある分布の対数尤度を出力するため、分布を評価するために適切と言われている。そこで以下の実験において生成されたジェスチャの客観的評価にはＫＥＤを用いた。また主観的評価として、ユーザの知覚に基づくものを使用した。すなわち、上記実施形態の性能の評価をまとめるために、ユーザに対する調査を実施した。関節の回転角度は人間には直感的にわかりやすいものではない。しかし、得られたジェスチャモデルに基づいてアバターの画像を動作させることにより、ユーザによるジェスチャの評価を得るようにした。 Kernel Density Estimation (KDE) is said to be suitable for evaluating distributions because it outputs the log likelihood of a certain distribution in a predetermined distribution. Therefore, KED was used for objective evaluation of gestures generated in the following experiments. Furthermore, as a subjective evaluation, one based on the user's perception was used. That is, in order to summarize the evaluation of the performance of the above embodiment, a survey of users was conducted. The rotation angle of a joint is not intuitively easy for humans to understand. However, by moving the avatar image based on the obtained gesture model, the user's evaluation of the gesture can be obtained.

実験においては、上記実施形態による学習を行った生成器を、ビデオクリップを使用して種々の対照群と比較した。その内容は以下のとおりである。 In the experiment, the trained generator according to the embodiment described above was compared with various control groups using video clips. Its contents are as follows.

・正解（ＧｒｏｕｎｄＴｒｕｔｈ：ＧＴ）
モーションキャプチャにより得られた実際の人間のジェスチャから得た本物のデータ。アバターに適用する前に、ギクシャクした動きを取り除くためにローパスフィルタを使用した前処理（平滑化）を行った。このギクシャクした動きは、おそらくモーションキャプチャに使用した装置に起因するものと考えられる。・Correct answer (Ground Truth: GT)
Real data from real human gestures obtained through motion capture. Before applying it to the avatar, preprocessing (smoothing) was performed using a low-pass filter to remove jerky movements. This jerky movement is probably caused by the device used for motion capture.

・ベースライン（参照用データ）
ベースラインモデルとして、関節の座標値を生成する最先端のモデルであって、上記実施の形態の学習を行ったデータセットと同じデータセットを用いて訓練したものを用いた。このモデルはＧＡＮに基づいた確率的なジェスチャ生成モデルである。比較のため、ニューラルネットワークを用いて、関節の座標データを、対応する関節の回転角度に変換した。・Baseline (reference data)
As the baseline model, we used a state-of-the-art model for generating coordinate values of joints, which was trained using the same data set as the data set used for learning in the above embodiment. This model is a probabilistic gesture generation model based on GAN. For comparison, we used a neural network to convert joint coordinate data into corresponding joint rotation angles.

・実施形態
実施形態のモデルは確率的にジェスチャを生成する。したがってこの実施形態のモデルによれば同じ音声セグメントから異なるジェスチャシーケンスを生成できる。そのように異なるジェスチャからなる結果を評価するために、各発話セットについて、実施形態のモデルによる２つのビデオを後述する第１案及び第２案として生成した。このビデオの生成時には、モデルは同じ音声入力に対して異なるノイズベクトルを使用した。 - Embodiment The model of the embodiment generates gestures stochastically. Therefore, the model of this embodiment allows different gesture sequences to be generated from the same speech segment. In order to evaluate the results of such different gestures, for each utterance set, two videos were generated using the model of the embodiment as a first plan and a second plan, which will be described later. When generating this video, the model used different noise vectors for the same audio input.

Ｂ．実験の結果
ａ．客観的評価
実施形態及びベースラインモデルにより生成したジェスチャをいずれもＫＤＥモデルにフィットさせた。最適な周波数帯域を３重交差検定によるグリッドサーチを用いて決定した。したがって、出力値が大きくなるほどジェスチャは実際のデータの分布に近づいた。上記各種のモデルによる結果を図１１に示す。 B. Experimental results a. Objective Evaluation Gestures generated by the embodiment and baseline models were both fitted to the KDE model. The optimal frequency band was determined using a grid search with triple cross-validation. Therefore, the larger the output value, the closer the gestures were to the actual data distribution. The results obtained using the various models described above are shown in FIG. 11.

図１１を参照して、正解の結果は、正解データそのものによりフィットさせたＫＤＥモデルを用いて正解データの尤度を算出することにより計算された。この結果が、到達出来る最もよい結果と見ることができる。モデルの対数尤度が正解の対数尤度に近くなるほどそのモデルは正解データ分布により適合していると考えることができる。 Referring to FIG. 11, the correct answer result was calculated by calculating the likelihood of the correct answer data using a KDE model fitted by the correct answer data itself. This result can be seen as the best possible result. The closer the log-likelihood of a model is to the log-likelihood of the correct answer, the better the model can be considered to fit the correct data distribution.

図１１に示すように、本願実施形態によるモデルは正解データにより近い対数尤度を達成できた。 As shown in FIG. 11, the model according to the embodiment of the present application was able to achieve a log likelihood closer to the correct data.

ｂ．主観的評価
実施形態に係るモデルを評価するための尺度として３つを使用した。すなわち、ジェスチャの自然さ、ジェスチャと発話との整合性及び意味的整合性である。これらの評価値を得るため、図１２に示すような質問を設定した。 b. Subjective Evaluation Three scales were used to evaluate the model according to the embodiment. That is, the naturalness of the gesture, the consistency between the gesture and the utterance, and the semantic consistency. In order to obtain these evaluation values, questions as shown in FIG. 12 were set.

比較のため、ベースラインモデルと、本願実施形態において異なるノイズベクトルからジェスチャを生成した第１案と第２案と、正解データとを用いてビデオを作成して比較に用いた。実験においては５つの別々の発話を選択し、各発話について４種類（正解データ、ベースライン、第１案及び第２案）のジェスチャを再現したアバターの動作をビデオクリップとして作成した。ビデオの順序はランダムとした。被験者に対しては、各ビデオを見た後、図１２に示す各項目について、アバターの動きが１から７のスケール（１：強く同意しない、７：強く同意する）のいずれに相当するかを割り当てることが求められた。 For comparison, a video was created using the baseline model, first and second plans in which gestures were generated from different noise vectors in the embodiment of the present application, and correct data, and used for comparison. In the experiment, we selected five separate utterances, and created video clips of avatar movements that reproduced four types of gestures (correct data, baseline, first plan, and second plan) for each utterance. The order of the videos was randomized. After watching each video, subjects were asked to indicate on a scale of 1 to 7 (1: strongly disagree, 7: strongly agree) the avatar's movements for each item shown in Figure 12. required to be allocated.

被験者としては、クラウドソーシングサービスにより集められた３３人を用いた。これら３３人のうち１７人は男性、１６人は女性であった。いずれも日本語を母語とする話者であった。年齢の平均は３８歳、分散は９．５年であった。 Thirty-three people gathered through a crowdsourcing service were used as subjects. Of these 33, 17 were men and 16 were women. All were native speakers of Japanese. The mean age was 38 years and the variance was 9.5 years.

結果を図１３にグラフ形式により示す。 The results are shown in graphical form in FIG.

この実験における設定を用いてスコアに有意な相違があるかどうかを統計的に調べるために分散分析を行った。ｐ＜０．０１において全てのスケールのスコアが分散分析をパスした。 An analysis of variance was performed to statistically examine whether there were significant differences in scores using the settings in this experiment. All scale scores passed analysis of variance at p<0.01.

さらに、グループの組み合わせの各々についてグループ間において有意な差があるか否かを調べるためにＴｕｋｅｙＨＳＤ(Tukey’s Honestly Significant Difference test)を行った。図１３に示すように、自然さという点で、ベースラインと正解との間にはｐ＜０．０５において有意な差が見られた。ベースラインと第１案との間、及びベースラインと第２案との間ではｐ＜０．０１で有意な差が見られた。正解データと第１案との間ではｐ＜０．７７で有意な差がなく、正解データと第２案との間でもｐ＝０．７７で有意な差が見られなかった。さらに第１案と第２案とではｐ＝０．９で有意な差が見られなかった。 Furthermore, Tukey HSD (Tukey's Honestly Significant Difference test) was conducted to examine whether there was a significant difference between the groups for each group combination. As shown in FIG. 13, in terms of naturalness, there was a significant difference between the baseline and the correct answer at p<0.05. Significant differences were found between the baseline and the first plan and between the baseline and the second plan at p<0.01. There was no significant difference between the correct data and the first plan, p<0.77, and no significant difference was found between the correct data and the second plan, p=0.77. Furthermore, no significant difference was found between the first and second plans at p=0.9.

さらに時間的整合性についてみると、ベースラインと正解との間、ベースラインと第１案との間、及びベースラインと第２案との間では、いずれもｐ＜０．０１で有意な差があった。正解と第１案、正解と第２案、及び第１案と第２案との間においては、ｐ＝０．９で優位な差がなかった。 Furthermore, regarding temporal consistency, there were significant differences at p<0.01 between the baseline and the correct answer, between the baseline and the first plan, and between the baseline and the second plan. was there. There was no significant difference between the correct answer and the first plan, the correct answer and the second plan, and the first plan and the second plan, with p=0.9.

意味的整合性については、ベースラインと正解との間ではｐ＜０．０１で有意な差があった。ベースラインと第１案との間ではｐ＜０．０５で有意な差が、ベースラインと第２案との間ではｐ＜０．０１で有意な差があった。正解と第１案、正解と第２案との間ではいずれもｐ＝０．９で有意な差が見られなかった。第１案と第２案との間ではｐ＝０．８６で有意な差が見られなかった。 Regarding semantic consistency, there was a significant difference between the baseline and the correct answer at p<0.01. There was a significant difference between the baseline and the first plan at p<0.05, and a significant difference between the baseline and the second plan at p<0.01. No significant difference was found between the correct answer and the first plan, and between the correct answer and the second plan, with p=0.9. No significant difference was found between the first plan and the second plan at p=0.86.

以上のようにこの実施形態によれば主観的にも正解と同じ、自然なジェスチャが得られた。 As described above, according to this embodiment, a natural gesture that is subjectively the same as a correct answer was obtained.

なお、上記実施形態では連続性損失Ｌｃを損失として用いており、損失関数においてλｃが乗じられている。この損失を使用することにより、２つの効果が得られる。第１は隣接して生成された２つのジェスチャ行列の重複部分のポーズベクトルを強制的に同様のものとすることにより、ジェスチャの連続性が得られるという効果である。第２は、先行するセグメントのジェスチャチャンクの最後のポーズベクトルを次のセグメント生成時のシードポーズベクトルとして用いることにより、後続のセグメントに文脈が与えられ、セグメント間における意味的整合性を保つことができるという効果である。 Note that in the above embodiment, the continuity loss Lc is used as the loss, and is multiplied by λc in the loss function. Using this loss has two effects. The first effect is that continuity of gestures can be obtained by forcibly making the pose vectors of the overlapping portions of two gesture matrices generated adjacently the same. Second, by using the last pose vector of the gesture chunk of the preceding segment as a seed pose vector when generating the next segment, subsequent segments are given context and semantic consistency can be maintained between segments. The effect is that it can be done.

この連続性損失Ｌｃの係数λｃの値を変えたときのジェスチャへの影響を調べるために、係数λｃの値だけを変えて他のハイパーパラメータを一定にして学習を行い、異なるパラメータを持つ複数のモデルを構築した。その結果、係数λｃ＝０とすると、セグメントごとにジェスチャが分離したものに見えることがわかった。すなわち、ジェスチャには少なくとも準備とストロークという２段階があり、ストロークは準備の前に完全に完了していることが必要とされるが、連続性損失Ｌｃを全く使用しない場合には、ストロークの最後まで行く前にストロークの動作が中止され、つぎのジェスチャの準備が開始されてしまうという印象を受けた。また連続性損失Ｌｃを用いず、ジェスチャ間の内挿のみを使用する場合、ジェスチャは滑らかになるがかえって不自然な印象を受けた。 In order to investigate the effect on gestures when changing the value of the coefficient λc of this continuity loss Lc, we performed learning by changing only the value of the coefficient λc and keeping other hyperparameters constant. built the model. As a result, it was found that when the coefficient λc=0, gestures appear to be separated for each segment. That is, a gesture has at least two stages: preparation and stroke, and the stroke must be completely completed before preparation, but if no continuity loss Lc is used, the final stage of the stroke I got the impression that the stroke was stopped before it reached that point, and preparations for the next gesture began. Furthermore, when only interpolation between gestures is used without using the continuity loss Lc, the gestures become smoother, but give an unnatural impression.

また、生成されたジェスチャの観察から、連続性損失を用いずに学習したモデルにより生成したジェスチャにおいては連続性が問題とはなるが、これらのジェスチャはいずれも正解の範囲内の動きと言えることから、連続性損失を含めたモデルとそうでないモデルとの間で、生成されたジェスチャにおける関節の角度と位置の対数尤度は大きく変わらないはずである、という仮説を立て、これを検証するための実験を行った。また、連続性損失を用いて訓練されたモデルとそうでないモデルとの間では、生成されたジェスチャの観察結果から、関節の速度の対数尤度はかなり異なるはずであるという仮説についても実験で検証した。こうした仮説を検証するための実験は以下のとおりである。 Furthermore, from observation of the generated gestures, it can be said that although continuity is an issue with gestures generated by a model learned without using continuity loss, these gestures can all be said to be movements within the correct range. Therefore, we hypothesized that the log likelihoods of joint angles and positions in generated gestures should not differ significantly between models that include continuity loss and models that do not, and in order to verify this. An experiment was conducted. We also experimentally verified the hypothesis that the log-likelihood of joint velocities should be significantly different between models trained using continuity loss and models that were not trained, based on the observed results of generated gestures. did. The experiment to test this hypothesis is as follows.

まず、関節角度の出力を関節の座標位置に変換した。さらに変換後のデータから各関節の速度を算出しＫＤＥにより評価した。関節の位置及び速度のＫＤＥによる評価結果を図１４に示す。図１４を参照して、連続性損失を導入することにより、モデルが生成する位置と速度の分布として正解と似たものが得られることがわかった。 First, the output of the joint angle was converted to the coordinate position of the joint. Furthermore, the velocity of each joint was calculated from the converted data and evaluated using KDE. FIG. 14 shows the evaluation results of joint positions and velocities using KDE. Referring to FIG. 14, it was found that by introducing the continuity loss, a distribution of positions and velocities generated by the model that is similar to the correct solution can be obtained.

さらに、図１５に示すように、関節の異常な速度は主として生成されたジェスチャの遷移区間に生ずることが分かる。図１５を参照して、正解データから得た速度のグラフ７２０においては異常に大きな速度は生じない。またλｃ＝１とした実施形態のグラフ７２４にも速度の異常な増加は認められない。それに対し、λｃ＝０とし、連続性損失を使用しない学習により得られたモデルの場合、グラフ７２２に見られるように定期的に速度が異常に大きくなることが分かる。これらの区間７００から区間７１４はいずれもセグメントとセグメントとの境界部分、すなわち先行するジェスチャチャンクから次のジェスチャチャンクへの遷移区間である。特に、連続性損失を使用しない場合には、ジェスチャチャンクの開始時と終了時に異常に大きな速度が生じることが分かる。こうした現象が生ずる原因は不明だが、少なくとも連続性損失を導入することによりこうした問題を回避できることがこの結果から分かる。 Furthermore, as shown in FIG. 15, it can be seen that the abnormal velocity of the joints mainly occurs in the transition section of the generated gesture. Referring to FIG. 15, an abnormally large speed does not occur in the speed graph 720 obtained from the correct data. Further, no abnormal increase in speed is observed in the graph 724 of the embodiment in which λc=1. On the other hand, in the case of a model obtained by learning with λc=0 and no continuity loss, it can be seen that the speed periodically increases abnormally, as seen in graph 722. These sections 700 to 714 are all boundary portions between segments, that is, transition sections from the preceding gesture chunk to the next gesture chunk. In particular, it can be seen that when no continuity loss is used, abnormally large velocities occur at the beginning and end of gesture chunks. Although the cause of this phenomenon is unknown, this result shows that such problems can be avoided by at least introducing continuity loss.

第３コンピュータによる実現
図１６は、上記実施形態を実現するコンピュータシステムの１例の外観図である。図１７は、図１６に示すコンピュータシステムのハードウェア構成の１例を示すブロック図である。 Third Implementation by Computer FIG. 16 is an external view of an example of a computer system that implements the above embodiment. FIG. 17 is a block diagram showing an example of the hardware configuration of the computer system shown in FIG. 16.

図１６を参照して、このコンピュータシステム９５０は、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ１００２を有するコンピュータ９７０と、いずれもコンピュータ９７０に接続された、ユーザと対話するためのキーボード９７４、マウス９７６、及びモニタ９７２とを含む。もちろんこれらはユーザ対話が必要となったときのための構成の一例であって、ユーザ対話に利用できる一般のハードウェア及びソフトウェア（例えばタッチパネル、音声入力、ポインティングデバイス一般）ならばどのようなものも利用できる。 Referring to FIG. 16, this computer system 950 includes a computer 970 having a DVD (Digital Versatile Disc) drive 1002, a keyboard 974, a mouse 976, and a monitor for interacting with the user, all connected to the computer 970. 972. Of course, these are examples of configurations for when user interaction is required, and any general hardware and software that can be used for user interaction (e.g. touch panel, voice input, pointing device in general) can be used. Available.

図１７を参照して、コンピュータ９７０は、ＤＶＤドライブ１００２に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）９９０と、ＧＰＵ９９２と、ＣＰＵ９９０、ＧＰＵ９９２、ＤＶＤドライブ１００２に接続されたバス１０１０と、バス１０１０に接続され、コンピュータ９７０のブートアッププログラムなどを記憶するＲＯＭ（Ｒｅａｄ－ＯｎｌｙＭｅｍｏｒｙ）９９６と、バス１０１０に接続され、プログラムを構成する命令、システムプログラム、及び作業データなどを記憶するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９９８と、バス１０１０に接続された不揮発性メモリであるＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）１０００とを含む。ＳＳＤ１０００は、ＣＰＵ９９０及びＧＰＵ９９２が実行するプログラム、並びにＣＰＵ９９０及びＧＰＵ９９２が実行するプログラムが使用するデータなどを記憶するためのものである。コンピュータ９７０はさらに、他端末との通信を可能とするネットワーク９８６への接続を提供するネットワークＩ／Ｆ（Ｉｎｔｅｒｆａｃｅ）１００８と、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ９８４が着脱可能で、ＵＳＢメモリ９８４とコンピュータ９７０内の各部との通信を提供するＵＳＢポート１００６とを含む。 Referring to FIG. 17, a computer 970 includes, in addition to a DVD drive 1002, a CPU (Central Processing Unit) 990, a GPU 992, a bus 1010 connected to the CPU 990, the GPU 992, and the DVD drive 1002; , a ROM (Read-Only Memory) 996 that stores a boot-up program for the computer 970, and a RAM (Random Access Memory) 998 that is connected to the bus 1010 and stores instructions, system programs, work data, etc. that make up the program. and an SSD (Solid State Drive) 1000 that is a nonvolatile memory connected to a bus 1010. The SSD 1000 is for storing programs executed by the CPU 990 and GPU 992, data used by the programs executed by the CPU 990 and GPU 992, and the like. The computer 970 further includes a network I/F (Interface) 1008 that provides a connection to a network 986 that enables communication with other terminals, and a USB (Universal Serial Bus) memory 984 that is removable. 970.

コンピュータ９７０はさらに、マイクロフォン９８２、スピーカ９８０及び図示しないモーションキャプチャ装置とバス１０１０とに接続され、ＣＰＵ９９０により生成されＲＡＭ９９８又はＳＳＤ１０００に保存された音声信号、映像信号、モーションデータ、及びテキストデータなどをＣＰＵ９９０の指示にしたがって読み出し、アナログ変換及び増幅処理をしてスピーカ９８０を駆動したり、マイクロフォン９８２からのアナログの音声信号をデジタル化し、ＲＡＭ９９８又はＳＳＤ１０００の、ＣＰＵ９９０により指定される任意のアドレスに保存したり、モーションキャプチャ装置からのモーションキャプチャ信号を受信しＣＰＵ９９０により指定されるアドレスに保存したりするための入出力Ｉ／Ｆ１００４を含む。 The computer 970 is further connected to a microphone 982, a speaker 980, a motion capture device (not shown), and a bus 1010, and transmits audio signals, video signals, motion data, text data, etc. generated by the CPU 990 and stored in the RAM 998 or the SSD 1000 to the CPU 990. Read the signal according to the instructions, perform analog conversion and amplification processing to drive the speaker 980, or digitize the analog audio signal from the microphone 982 and save it in an arbitrary address specified by the CPU 990 in the RAM 998 or SSD 1000. , an input/output I/F 1004 for receiving a motion capture signal from a motion capture device and storing it at an address designated by the CPU 990.

上記実施形態では、学習システム１００及びジェスチャ生成装置５６０又はそれらの一部である生成器１１８、識別器１２０及び敵対的学習部１９２などを実現するためのプログラム、ニューラルネットワークのパラメータ並びにニューラルネットワークプログラムなどは、いずれも例えば図１７に示すＳＳＤ１０００、ＲＡＭ９９８、ＤＶＤ９７８又はＵＳＢメモリ９８４、若しくはネットワークＩ／Ｆ１００８及びネットワーク９８６を介して接続された図示しない外部装置の記憶媒体などに格納される。典型的には、これらのデータ及びパラメータなどは、例えば外部からＳＳＤ１０００に書込まれコンピュータ９７０による実行時にはＲＡＭ９９８にロードされる。 In the above embodiment, the learning system 100 and the gesture generation device 560 or a program for realizing the generator 118, the discriminator 120, the adversarial learning unit 192, etc. that are part of the learning system 100, the neural network parameters, the neural network program, etc. are stored in, for example, the SSD 1000, RAM 998, DVD 978, or USB memory 984 shown in FIG. 17, or a storage medium of an external device (not shown) connected via the network I/F 1008 and network 986. Typically, these data and parameters are written into the SSD 1000 from the outside, for example, and loaded into the RAM 998 when executed by the computer 970.

このコンピュータシステムを、図３及び図９に示す学習システム１００及びジェスチャ生成装置５６０並びにその各構成要素の機能を実現するよう動作させるためのコンピュータプログラムは、ＤＶＤドライブ１００２に装着されるＤＶＤ９７８に記憶され、ＤＶＤドライブ１００２からＳＳＤ１０００に転送される。又は、これらのプログラムはＵＳＢメモリ９８４に記憶され、ＵＳＢメモリ９８４をＵＳＢポート１００６に装着し、プログラムをＳＳＤ１０００に転送する。又は、このプログラムはネットワーク９８６を通じてコンピュータ９７０に送信されＳＳＤ１０００に記憶されてもよい。 A computer program for operating this computer system so as to realize the functions of the learning system 100 and the gesture generation device 560 shown in FIGS. , are transferred from the DVD drive 1002 to the SSD 1000. Alternatively, these programs are stored in the USB memory 984, the USB memory 984 is attached to the USB port 1006, and the programs are transferred to the SSD 1000. Alternatively, this program may be transmitted to computer 970 via network 986 and stored on SSD 1000.

プログラムは実行のときにＲＡＭ９９８にロードされる。もちろん、キーボード９７４、モニタ９７２及びマウス９７６を用いてソースプログラムを入力し、コンパイルした後のオブジェクトプログラムをＳＳＤ１０００に格納してもよい。スクリプト言語の場合には、キーボード９７４などを用いて入力したスクリプトをＳＳＤ１０００に格納してもよい。仮想マシン上で動作するプログラムの場合には、仮想マシンとして機能するプログラムを予めコンピュータ９７０にインストールしておく必要がある。ニューラルネットワークの訓練及びテストには大量の計算が伴うため、特に数値計算を行う実体であるプログラム部分はスクリプト言語ではなくコンピュータのネイティブなコードからなるオブジェクトプログラムとして本発明の実施形態の各部を実現する方が好ましい。 The program is loaded into RAM 998 during execution. Of course, a source program may be input using the keyboard 974, monitor 972, and mouse 976, and the compiled object program may be stored in the SSD 1000. In the case of a script language, a script input using the keyboard 974 or the like may be stored in the SSD 1000. In the case of a program that runs on a virtual machine, it is necessary to install the program that functions as a virtual machine on the computer 970 in advance. Since the training and testing of a neural network involves a large amount of calculation, the program portion that is the entity that performs numerical calculations is implemented as an object program consisting of computer native code rather than a script language. is preferable.

ＣＰＵ９９０は、その内部のプログラムカウンタと呼ばれるレジスタ（図示せず）により示されるアドレスにしたがってＲＡＭ９９８からプログラムを読み出して命令を解釈し、命令の実行に必要なデータを命令により指定されるアドレスにしたがってＲＡＭ９９８、ＳＳＤ１０００又はそれ以外の機器から読み出して命令により指定される処理を実行する。ＣＰＵ９９０は、実行結果のデータを、ＲＡＭ９９８、ＳＳＤ１０００、ＣＰＵ９９０内のレジスタなど、プログラムにより指定されるアドレスに格納する。このとき、プログラムカウンタの値もプログラムによって更新される。コンピュータプログラムは、ＤＶＤ９７８から、ＵＳＢメモリ９８４から、又はネットワークを介して、ＲＡＭ９９８に直接にロードしてもよい。なお、ＣＰＵ９９０が実行するプログラムの中で、一部のタスク（主として数値計算）については、プログラムに含まれる命令により、又はＣＰＵ９９０による命令実行時の解析結果にしたがって、ＧＰＵ９９２にディスパッチされる。 The CPU 990 reads the program from the RAM 998 according to the address indicated by an internal register called a program counter (not shown), interprets the instruction, and stores the data necessary for executing the instruction in the RAM 998 according to the address specified by the instruction. , the SSD 1000 or other devices and executes the processing specified by the command. The CPU 990 stores the data of the execution result at an address specified by the program, such as the RAM 998, the SSD 1000, or a register within the CPU 990. At this time, the value of the program counter is also updated by the program. Computer programs may be loaded directly into RAM 998 from DVD 978, from USB memory 984, or via a network. Note that in the program executed by the CPU 990, some tasks (mainly numerical calculations) are dispatched to the GPU 992 according to instructions included in the program or according to an analysis result when the CPU 990 executes the instructions.

コンピュータ９７０との協働により上記した実施形態に係る各部の機能を実現するプログラムは、それら機能を実現するようコンピュータ９７０を動作させるように記述され配列された複数の命令を含む。この命令を実行するのに必要な基本的機能のいくつかはコンピュータ９７０上で動作するオペレーティングシステム（ＯＳ）若しくはサードパーティのプログラム、又はコンピュータ９７０にインストールされる各種ツールキットのモジュールにより提供される。したがって、このプログラムはこの実施形態のシステム及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令の中で、所望の結果が得られるように制御されたやり方で適切な機能又は「プログラミング・ツール・キット」の機能を静的にリンクすることで、又はプログラムの実行時に動的にそれら機能に動的リンクことにより、上記した各装置及びその構成要素としての動作を実行する命令のみを含んでいればよい。そのためのコンピュータ９７０の動作方法は周知であるので、ここでは繰返さない。 A program that realizes the functions of each part according to the embodiment described above in cooperation with the computer 970 includes a plurality of instructions written and arranged to cause the computer 970 to operate to realize those functions. Some of the basic functionality necessary to execute this instruction is provided by an operating system (OS) or third party programs running on computer 970 or by modules of various toolkits installed on computer 970. Therefore, this program does not necessarily include all the functions necessary to implement the system and method of this embodiment. This program may be activated by statically linking appropriate functions or Programming Tool Kit functions within the instructions in a controlled manner to achieve the desired results, or when the program is run. By dynamically linking these functions, it is sufficient to include only the instructions for executing the operations of each of the above-mentioned devices and their constituent elements. The manner in which computer 970 operates for this purpose is well known and will not be repeated here.

なお、ＧＰＵ９９２は並列処理を行うことが可能であり、機械学習に伴う多量の計算を同時並列的又はパイプライン的に実行できる。例えばプログラムのコンパイル時にプログラム中で発見された並列的計算要素、又はプログラムの実行時に発見された並列的計算要素は、随時、ＣＰＵ９９０からＧＰＵ９９２にディスパッチされ、実行され、その結果が直接に、又はＲＡＭ９９８の所定アドレスを介してＣＰＵ９９０に返され、プログラム中の所定の変数に代入される。 Note that the GPU 992 can perform parallel processing, and can execute a large amount of calculations associated with machine learning simultaneously in parallel or in a pipeline manner. For example, parallel computing elements found in a program when the program is compiled or parallel computing elements discovered when the program is executed are dispatched from the CPU 990 to the GPU 992 and executed, and the results are sent directly or to the RAM 998. It is returned to the CPU 990 via a predetermined address, and is substituted into a predetermined variable in the program.

第４変形例
上記実施形態では、音響特徴量としてＦ０及びパワーという韻律情報を使用している。しかしこの発明はそのような実施形態に限定されるわけではない。ＭＦＣＣ（Ｍｅｌ－ｆｒｅｑｕｅｎｃｙｃｅｐｓｔｒａｌｃｏｅｆｆｉｃｉｅｎｔ）のような言語的情報に関する他の音響特徴量を用いてもよい。また上記実施形態で使用した各ベクトル及び行列の次元数は単なる１例である。応用に応じてそれらの次元数を適宜変更してもよいことは言うまでもない。また上記実施形態では、シードポーズベクトルとして４個のベクトルを使用している。これは音声セグメントの重複期間である０．２秒を１フレームの時間である０．０５秒で除した値に相当する。したがって、音声セグメントの重複期間を変更したり、１フレームの時間を変更したりすれば、それに応じてシードポーズベクトルの数もそれに從って変化させてよい。 Fourth Modification In the above embodiment, prosody information called F0 and power is used as the acoustic feature amount. However, the invention is not limited to such embodiments. Other acoustic features related to linguistic information such as MFCC (Mel-frequency cepstral coefficient) may also be used. Further, the number of dimensions of each vector and matrix used in the above embodiment is merely an example. It goes without saying that the number of dimensions may be changed as appropriate depending on the application. Further, in the above embodiment, four vectors are used as seed pose vectors. This corresponds to a value obtained by dividing 0.2 seconds, which is the overlapping period of audio segments, by 0.05 seconds, which is the time of one frame. Therefore, if the overlapping period of audio segments is changed or the time of one frame is changed, the number of seed pose vectors may be changed accordingly.

上記実施形態では、ノイズをサンプリングするための分布として標準正規分布を用いている。しかしこの発明はそのようなものに限定されるわけではない。標準正規分布と異なる分布を用いてもよい。また、生成器１１８及び識別器１２０の構成も上記したものには限定されない。応用により、適宜その構成を変更してもよい。さらに、上記実施形態では、敵対的学習にｕｎｒｏｌｌｅｄ－ＧＡＮを用いている。これは、ｕｎｒｏｌｌｅｄ－ＧＡＮがモード崩壊を起こす可能性が低いためである。したがって、モード崩壊を起こさないような敵対的学習のためのアルゴリズムならば、他のアルゴリズムを使用してもよい。 In the above embodiment, a standard normal distribution is used as a distribution for sampling noise. However, this invention is not limited to such. A distribution different from the standard normal distribution may be used. Furthermore, the configurations of the generator 118 and the discriminator 120 are not limited to those described above. The configuration may be changed as appropriate depending on the application. Furthermore, in the above embodiment, unrolled-GAN is used for adversarial learning. This is because unrolled-GAN is less likely to cause mode collapse. Therefore, other algorithms may be used as long as they are algorithms for adversarial learning that do not cause mode collapse.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed this time is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim, with reference to the description of the detailed description of the invention, and all changes within the scope and meaning equivalent to the words described therein are defined. include.

５０、１１４音声信号
５２ジェスチャ
５４モデル
１００学習システム
１１０学習データ記憶装置
１１２、３５４、４５０、６０２、６０４、６０８、６１０ポーズベクトル
１１６音声特徴量抽出部
１１８生成器
１２０識別器
１２２選択器
１２４、１２６、２００、３５６ジェスチャ行列
１２８、３５０音声行列
１３０、３５８評価値
１３２ノイズ行列
１５０ジェスチャモデル学習装置
１７０学習データ生成装置
１８０ノイズベクトル発生部
１８４シードポーズベクトル
１９２敵対的学習部
２５０、２５４、３００、３０４全結合層
２５２２－ｌａｙｅｒｂｉ－ＧＲＵ
３０２１－Ｄ畳み込み層
３５２ノイズベクトル
５６０ジェスチャ生成装置
５８０音声セグメント部
５８２音声データ記憶装置
５８４ジェスチャ生成部
５８６ジェスチャデータ
５９２ジェスチャチャンク記憶部
５９４ジェスチャ補間部
６００、６０６ジェスチャチャンク
６１２、６１４乗算部
６１６加算部
６１８内挿ポーズベクトル
50, 114 Audio signal 52 Gesture 54 Model 100 Learning system 110 Learning data storage device 112, 354, 450, 602, 604, 608, 610 Pose vector 116 Audio feature extractor 118 Generator 120 Discriminator 122 Selector 124, 126 , 200, 356 Gesture matrix 128, 350 Audio matrix 130, 358 Evaluation value 132 Noise matrix 150 Gesture model learning device 170 Learning data generation device 180 Noise vector generation unit 184 Seed pose vector 192 Adversarial learning unit 250, 254, 300, 304 Fully connected layer 252 2-layer bi-GRU
302 1-D convolution layer 352 Noise vector 560 Gesture generation device 580 Voice segment section 582 Voice data storage device 584 Gesture generation section 586 Gesture data 592 Gesture chunk storage section 594 Gesture interpolation section 600, 606 Gesture chunk 612, 614 Multiplication section 616 Addition Section 618 Interpolated pose vector

Claims

A learning method for a gesture generation model for generating speech gestures from speech sounds, the method comprising:
the computer preparing training data divided into a plurality of segments each having a predetermined length of time and a predetermined overlapping period;
each of the plurality of segments includes a plurality of frames of a certain length;
Each of the plurality of segments includes an audio matrix whose components are predetermined acoustic features obtained from speech uttered by a person in each frame included in the segment, and a pose representing the pose of the person in each frame included in the segment. a gesture matrix consisting of vectors;
A computer generates, for each of the segments, the audio matrix included in the segment, a noise matrix whose components are noise sampled from a predetermined distribution, and a predetermined number of gesture matrices forming part of the gesture matrix of the immediately preceding segment. a seed pose vector consisting of pose vectors, and generates and outputs the gesture matrix corresponding to the input voice matrix based on the voice matrix, the noise matrix, and the seed pose vector. a step of preparing a neural network in step 1;
a computer having a first input and a second input, receiving the speech matrix input to the first neural network at the first input; preparing a second neural network for outputting an evaluation value as the gesture matrix corresponding to the input voice matrix;
The computer switches the gesture matrix output from the first neural network and the gesture matrix of the learning data, and performs adversarial learning to evaluate the gesture matrix as a gesture matrix for the voice matrix. 2. A learning method for a gesture generation model, the method comprising the step of training a neural network according to No. 2.

The loss function in the adversarial learning is
a first loss function representing a difference between a conditional distribution of output from the first neural network and a conditional distribution of correct data from the second neural network with respect to the input speech matrix;
a gradient penalty of the loss function in the second neural network;
A first neural network representing a difference between the predetermined number of pose vectors at the end of the gesture matrix of the immediately preceding learning data and the predetermined number of pose vectors at the head of the gesture matrix output from the first neural network. 3. The gesture generation model learning method according to claim 1, comprising a loss function of 3.

3. The gesture generation model learning method according to claim 1, wherein the predetermined time length is selected from a range of 1 second or more and 2 seconds or less.

4. The gesture generation model learning method according to claim 3, wherein the predetermined time length is selected from a range of 1.3 seconds or more and 1.7 seconds or less.

5. The gesture generation model learning method according to claim 3, wherein the predetermined overlapping period is selected from a range of 10 milliseconds or more and 30 milliseconds or less.

6. The gesture generation model learning method according to claim 5, wherein the predetermined overlapping period is selected from a range of 15 milliseconds or more and 25 milliseconds or less.

7. The gesture generation model learning method according to claim 1, wherein the predetermined acoustic feature includes F0, power, or both.

The gesture generation model learning method according to any one of claims 1 to 7, wherein the adversarial learning is performed by an unrolled-GAN.

The second neural network receives as input the gesture matrix received at the second input and the voice matrix inputted to the first neural network, and receives the gesture matrix received at the second input. The learning method for a gesture generation model according to any one of claims 1 to 8, comprising a convolutional neural network that outputs a scalar representing an evaluation value of a matrix as a gesture corresponding to the audio matrix.

The first neural network receives as input a matrix in which the audio matrix, the noise matrix, and the seed pose vector in the learning data are connected, and applies a matrix to the audio matrix input to the first neural network. The learning method of a gesture generation model according to any one of claims 1 to 9, comprising a bi-GRU network for outputting the corresponding gesture matrix.

A learning device for a gesture generation model for generating human gestures during speech from speech sounds, comprising:
a learning data storage device storing learning data divided into a plurality of segments each having a predetermined length of time and a predetermined overlapping period;
each of the plurality of segments includes a plurality of frames of a certain length;
Each of the plurality of segments includes an audio matrix whose components are predetermined acoustic features obtained from speech uttered by a person in each frame included in the segment, and a gesture matrix representing the gesture of the person in the frame. including,
For each of the plurality of segments, the audio matrix included in the segment, a noise matrix sampled from a predetermined distribution, and a predetermined seed pose vector are input, and the audio matrix, the noise matrix, and the seed a first neural network for generating and outputting the gesture matrix corresponding to the input voice matrix based on a pose vector;
The gesture matrix output from the first neural network and the voice matrix input to the first neural network are received, and the input voice of the gesture matrix output from the first neural network is received. a second neural network for outputting evaluation values for the matrix;
Adversarial training in which the first neural network and the second neural network are trained by adversarial learning that switches between the gesture matrix from the training data and the gesture matrix output from the first neural network. A learning device for a gesture generation model, comprising a learning means.