JP2020082246A

JP2020082246A - Posture data generation device, learning tool, computer program, learning data, posture data generation method and learning model generation method

Info

Publication number: JP2020082246A
Application number: JP2018217480A
Authority: JP
Inventors: 巧藤原; Takumi Fujiwara; 原　豪紀; Toshiki Hara; 豪紀原; 中川　修; Osamu Nakagawa; 修中川; 前田　強; Tsuyoshi Maeda; 強前田; 長井　隆行; Takayuki Nagai; 隆行長井; 友昭中村; Tomoaki Nakamura; 章仁嶋津; Akihito Shimazu
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2020-06-04

Abstract

To provide a posture data generation device that can automatically generate a gesture for presentation, and to provide a learning tool, a computer program, learning data, a posture data generation method and a learning model generation method.SOLUTION: A posture data generation device includes: a learning tool that is produced using uttered voice data and human body posture data as learning data; an acquisition unit for acquiring the uttered voice data; and a generating unit for generating posture data on the basis of the uttered voice data acquired by the acquisition unit and the learning tool.SELECTED DRAWING: Figure 1

Description

本発明は、姿勢データ生成装置、学習器、コンピュータプログラム、学習データ、姿勢データ生成方法及び学習モデルの生成方法に関する。 The present invention relates to a posture data generation device, a learning device, a computer program, learning data, a posture data generation method, and a learning model generation method.

近年、ロボットやアバターなどのエージェントの社会進出が進んでおり、日常生活においても、エージェントと人が接する機会が増加している。このようなエージェントの例として、プレゼンテーションを行うものがある。 In recent years, agents such as robots and avatars have advanced into society, and the opportunities for agents to come into contact with each other are increasing in daily life. An example of such an agent is to give a presentation.

特許文献１には、プレゼンテーションのように人と対面するためのアプリケーションを備え、腕の関節や頭の向きを回転させて姿勢を変更して信頼性の高いコミュニケーションを提供することができるサービスロボットが開示されている。 Patent Document 1 discloses a service robot that includes an application for facing a person like a presentation and that can rotate a joint of an arm or a head to change a posture to provide highly reliable communication. It is disclosed.

特開２００６−２９７５３１号公報JP, 2006-297531, A

しかし、エージェントが、人との自然なコミュニケーションを実現するためには、自然なコミュニケーションに見えるジェスチャーを制作する必要がある。しかし、自然なジェスチャーの制作には、相当のスキルを備える制作者が必要である。また、様々な動作パターンを実現するには、予め手動でプログラムやモーションデータを制作する必要がある。このため、自然なジェスチャーの制作には、長時間の作業を要し、コストも高くなるという問題がある。 However, in order to realize natural communication with people, agents need to create gestures that look like natural communication. However, the production of natural gestures requires a creator with considerable skill. Moreover, in order to realize various operation patterns, it is necessary to manually create programs and motion data in advance. For this reason, there is a problem in that it takes a long time to produce a natural gesture and the cost becomes high.

本発明は、斯かる事情に鑑みてなされたものであり、プレゼンテーションのジェスチャーを自動的に生成することができる姿勢データ生成装置、学習器、コンピュータプログラム、学習データ、姿勢データ生成方法及び学習モデルの生成方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and includes a posture data generation device, a learning device, a computer program, learning data, a posture data generation method, and a learning model that can automatically generate a gesture of a presentation. It is intended to provide a generation method.

本発明の実施の形態に係る姿勢データ生成装置は、発話音声データと人体の姿勢データとを学習データとして用いて生成してある学習器と、発話音声データを取得する取得部と、前記取得部で取得した発話音声データ及び前記学習器に基づいて姿勢データを生成する生成部とを備える。 A posture data generation device according to an embodiment of the present invention is a learning device that is generated by using utterance voice data and posture data of a human body as learning data, an acquisition unit that acquires utterance voice data, and the acquisition unit. And a generation unit that generates posture data based on the utterance voice data acquired in step 1 and the learning device.

本発明の実施の形態に係る学習器は、発話音声データと人体の姿勢データとを学習データとして用いて生成してある。 The learning device according to the embodiment of the present invention is generated using utterance voice data and human posture data as learning data.

本発明の実施の形態に係る姿コンピュータプログラムは、コンピュータに、発話音声データを取得する処理と、発話音声データと人体の姿勢データとを学習データとして用いて生成してある学習器に、取得した発話音声データを入力して姿勢データを生成する処理とを実行させる。 The figure computer program according to an embodiment of the present invention is acquired by a process of acquiring utterance voice data in a computer and a learner generated using utterance voice data and human body posture data as learning data. A process of inputting utterance voice data and generating posture data is executed.

本発明の実施の形態に係る学習データは、プレゼンテーション動画から抽出された発話音声データの時系列データ及び人体の姿勢データの時系列データを有する学習データであって、前記姿勢データは、人体の複数の関節位置の３次元データを有し、前記複数の関節位置の３次元データのプレゼンテーション動画の複数フレームに亘る時系列データを再帰型ニューラルネットワークの出力ノードに与える処理と、前記プレゼンテーション動画の１フレームの間に所要回数サンプリングされた発話音声データの時系列データの前記複数フレームに亘る時系列データを前記再帰型ニューラルネットワークの入力ノードに与える処理と、前記出力ノード及び入力にノードそれぞれ与えられた前記時系列データに基づいて前記再帰型ニューラルネットワークを学習する処理とを実行するのに用いられる。 The learning data according to the embodiment of the present invention is learning data including time-series data of utterance voice data and time-series data of posture data of a human body extracted from a presentation moving image, and the posture data is a plurality of human body data. Processing for giving time-series data of the 3D data of the joint positions over a plurality of frames of the presentation moving image to the output node of the recursive neural network, and 1 frame of the presentation moving image. A process of giving time series data of the time series data of the utterance voice data sampled a required number of times to the input node of the recursive neural network, and the node given to the output node and the input, respectively. And a process of learning the recurrent neural network based on time series data.

本発明の実施の形態に係る姿勢データ生成方法は、発話音声データを取得し、発話音声データと人体の姿勢データとを学習データとして用いて生成してある学習器に、取得した発話音声データを入力して姿勢データを生成する。 A posture data generation method according to an embodiment of the present invention acquires utterance voice data, and acquires the utterance voice data in a learning device that is generated by using the utterance voice data and the posture data of the human body as learning data. Input and generate posture data.

本発明の実施の形態に係る学習モデルの生成方法は、発話音声データ及び人体の姿勢データを取得し、取得された発話音声データ及び人体の姿勢データを学習データとして用いる。 A learning model generation method according to an embodiment of the present invention acquires utterance voice data and human body posture data, and uses the acquired utterance voice data and human body posture data as learning data.

本発明によれば、プレゼンテーションのジェスチャーを自動的に生成することができる。 According to the present invention, a presentation gesture can be automatically generated.

本実施の形態のジェスチャー生成装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the gesture generation apparatus of this Embodiment. 発話音声データの一例を示す模式図である。It is a schematic diagram which shows an example of utterance voice data. 発話音声のピッチ及びエネルギーの時系列データの一例を示す模式図である。It is a schematic diagram which shows an example of the time series data of the pitch and energy of speech voice. ３次元姿勢データの一例を示す模式図である。It is a schematic diagram which shows an example of three-dimensional posture data. 学習モデルの構成の一例を示す模式図である。It is a schematic diagram which shows an example of a structure of a learning model. 学習モデルによる姿勢データの出力の様子の一例を示す模式図である。It is a schematic diagram which shows an example of the mode of the output of the posture data by a learning model. 発話文章毎に生成したジェスチャーの接続方法の第１例を示す模式図である。It is a schematic diagram which shows the 1st example of the connection method of the gesture produced|generated for every utterance sentence. 発話文章毎に生成したジェスチャーの接続方法の第２例を示す模式図である。It is a schematic diagram which shows the 2nd example of the connection method of the gesture produced|generated for every utterance sentence. 本実施の形態のジェスチャー生成装置が生成したジェスチャーの一例を示す模式図である。It is a schematic diagram which shows an example of the gesture which the gesture generation apparatus of this Embodiment produced|generated. 生成したジェスチャーのうち置き換えるジェスチャーの時間Ｔの算出例を示す模式図である。It is a schematic diagram which shows the example of calculation of time T of the gesture to replace among the generated gestures. キーワードとジェスチャーデータとの関係の一例を示す説明図である。It is explanatory drawing which shows an example of the relationship between a keyword and gesture data. 置き換えるジェスチャーのリサンプリング方法の一例を示す模式図である。It is a schematic diagram which shows an example of the resampling method of the gesture to replace. リサンプリングの結果の一例を示す模式図である。It is a schematic diagram which shows an example of the result of resampling. キーワードに対応するジェスチャーで置き換えた後のジェスチャーの一例を示す模式図である。It is a schematic diagram which shows an example of the gesture after replacing with the gesture corresponding to a keyword. 置き換えた後のジェスチャーの補間の一例を示す模式図である。It is a schematic diagram which shows an example of interpolation of the gesture after replacement. 本実施の形態の学習モデル生成部の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the learning model production|generation part of this Embodiment. 発話文章を学習単位で区切る基準の一例を示す模式図である。It is a schematic diagram which shows an example of the standard which divides an utterance sentence into learning units. 学習モデルの生成方法の一例を示す模式図である。It is a schematic diagram which shows an example of the generation method of a learning model. 本実施の形態のジェスチャー生成装置によるジェスチャー生成の処理手順の一例を示すフローチャートである。It is a flow chart which shows an example of the processing procedure of gesture generation by the gesture generation device of this embodiment. 学習モデル生成部による学習モデル生成の処理手順の一例を示すフローチャートである。It is a flow chart which shows an example of a processing procedure of learning model generation by a learning model generation part.

以下、本発明の実施の形態を図面に基づいて説明する。図１は本実施の形態のジェスチャー生成装置５０の構成の一例を示すブロック図である。ジェスチャー生成装置５０は、装置全体を制御する制御部５１、取得部５２、記憶部５３、処理部５４及び生成部５７を備える。また、処理部５４は、学習器としての学習モデル５５、及び補正部５６を備える。制御部５１は、ＣＰＵ、ＲＯＭ及びＲＡＭなどで構成することができる。 Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an example of the configuration of the gesture generation device 50 according to the present embodiment. The gesture generation device 50 includes a control unit 51 that controls the entire device, an acquisition unit 52, a storage unit 53, a processing unit 54, and a generation unit 57. Further, the processing unit 54 includes a learning model 55 as a learning device, and a correction unit 56. The control unit 51 can be composed of a CPU, a ROM, a RAM, and the like.

取得部５２は、発話音声データを取得することができる。 The acquisition unit 52 can acquire utterance voice data.

図２は発話音声データの一例を示す模式図である。図２において、縦軸は波形の振幅を示し、例えば、電圧レベルで表すことができる。横軸は時間を示す。 FIG. 2 is a schematic diagram showing an example of uttered voice data. In FIG. 2, the vertical axis represents the amplitude of the waveform, which can be represented by a voltage level, for example. The horizontal axis represents time.

制御部５１は、音声分析機能を備え、取得部５２で取得した発話音声データから発話音声のピッチ及びエネルギーを抽出することができる。 The control unit 51 has a voice analysis function and can extract the pitch and energy of the utterance voice from the utterance voice data acquired by the acquisition unit 52.

図３は発話音声のピッチ及びエネルギーの時系列データの一例を示す模式図である。発話音声のピッチは、音声波形の周波数であり、音声の高低を表すことができる。発話音声のエネルギーは、音声のエネルギーであり、音声の強弱を表すことができる。なお、発話音声のピッチ及びエネルギーの時系列データを纏めて音声韻律時系列データとも称する。 FIG. 3 is a schematic diagram showing an example of time-series data of pitch and energy of speech voice. The pitch of the uttered voice is the frequency of the voice waveform and can represent the pitch of the voice. The energy of the uttered voice is the energy of the voice and can represent the strength of the voice. The time series data of the pitch and energy of the uttered voice are collectively referred to as voice prosody time series data.

なお、取得部５２は、発話音声のピッチ及びエネルギーの時系列データを取得することもできる。この場合、制御部５１は、発話音声のピッチ及びエネルギーを抽出する必要はない。取得部５２は、例えば、外部の記憶デバイスに記憶された発話音声データ、あるいは発話音声のピッチ及びエネルギーの時系列データを読み込む機能、あるいはインターネットなどの通信ネットワークを経由して受信する機能などを備えることができる。 The acquisition unit 52 can also acquire time-series data of the pitch and energy of the spoken voice. In this case, the control unit 51 does not need to extract the pitch and energy of the spoken voice. The acquisition unit 52 has, for example, a function of reading utterance voice data stored in an external storage device, or time-series data of pitch and energy of utterance voice, or a function of receiving it via a communication network such as the Internet. be able to.

記憶部５３は、取得部５２で取得した発話音声データ、あるいは発話音声のピッチ及びエネルギーの時系列データを記憶することができる。また、記憶部５３は、複数のキーワードと当該複数のキーワードそれぞれの意味を伝達する伝達３次元データの時系列データとを関連付けて記憶する。キーワード及び伝達３次元データの詳細は後述する。 The storage unit 53 can store the uttered voice data acquired by the acquisition unit 52 or the time-series data of the pitch and energy of the uttered voice. The storage unit 53 also stores a plurality of keywords and time-series data of transmission three-dimensional data that transmits the meaning of each of the plurality of keywords in association with each other. Details of the keywords and the transmitted three-dimensional data will be described later.

処理部５４は、例えば、ＣＰＵ（例えば、複数のプロセッサコアを実装したマルチ・プロセッサなど）、ＧＰＵ（Graphics Processing Units）、ＤＳＰ（Digital Signal Processors）、ＦＰＧＡ（Field-Programmable Gate Arrays）などのハードウェアを組み合わせることによって構成することができる。また、量子プロセッサを組み合わせることもできる。 The processing unit 54 is, for example, a hardware such as a CPU (for example, a multi-processor including a plurality of processor cores), a GPU (Graphics Processing Units), a DSP (Digital Signal Processors), and an FPGA (Field-Programmable Gate Arrays). Can be configured by combining. It is also possible to combine quantum processors.

学習モデル５５は、発話音声データと人体の姿勢データとを学習データとして用いて生成してある。例えば、プレゼンテーションを行う人の発話音声データと当該人の動きを示す姿勢データとを学習データとして用いて学習モデル５５を生成することができる。 The learning model 55 is generated by using the speech data and the posture data of the human body as learning data. For example, the learning model 55 can be generated by using the speech data of the person who gives the presentation and the posture data indicating the movement of the person as the learning data.

より具体的には、学習モデル５５は、図３に例示した、発話音声のピッチ及びエネルギーそれぞれの時系列データを学習データとして用いて生成してある。また、学習モデル５５は、人体の複数の関節位置の３次元データの時系列データを学習データとして用いて生成してある。 More specifically, the learning model 55 is generated by using the time-series data of the pitch and energy of the uttered voice illustrated in FIG. 3 as the learning data. The learning model 55 is generated by using time-series data of three-dimensional data of a plurality of joint positions of the human body as learning data.

図４は３次元姿勢データの一例を示す模式図である。図中、符号Ｐ１からＰ９は、人体の上半身の関節の位置を示す。図４では、９個の関節が図示されているが、関節の数は９個に限定されない。複数の関節位置は、プレゼンテーション時に人の動きが顕著に表れる部分を含めることができればよく、図４のように、腰、腕、肩、首、頭などを含む上半身の複数の関節の位置とすることができる。３次元データは、基準とする座標系でのｘｙｚ座標とすることができる。 FIG. 4 is a schematic diagram showing an example of three-dimensional posture data. In the figure, symbols P1 to P9 indicate the positions of the joints of the upper half of the human body. Although FIG. 4 shows nine joints, the number of joints is not limited to nine. It suffices for the plurality of joint positions to include a portion in which a person's movement is remarkably displayed at the time of presentation, and as shown in FIG. 4, the plurality of joint positions of the upper body including the waist, arms, shoulders, neck, head and the like are set. be able to. The three-dimensional data can be xyz coordinates in a reference coordinate system.

これにより、学習モデル５５は、人の発話と当該発話に伴う体の動きとの関係性を学習することができる。また、発話の際の話し手の意思や熱意は、音声韻律、すなわち発話音声のピッチ及びエネルギーの変化となって表れる。そこで、音声韻律時系列データを学習データとして用いることにより、学習モデル５５は、意思や熱意を表現する姿勢データを出力することができる。 Thereby, the learning model 55 can learn the relationship between the utterance of a person and the movement of the body accompanying the utterance. Further, the speaker's intention and enthusiasm at the time of utterance are expressed as voice prosody, that is, changes in pitch and energy of the uttered voice. Therefore, by using the phonetic prosody time series data as the learning data, the learning model 55 can output the posture data expressing the intention or enthusiasm.

学習モデル５５は、時系列データを学習データとするものであればよく、例えば、再帰型ニューラルネットワーク（Recurrent Neural Network）とすることができるが、これに限定されない。学習モデル５５は、他の機械学習を用いたものでもよい。学習モデル５５の詳細は後述する。 The learning model 55 only needs to use time-series data as learning data, and can be, for example, a recurrent neural network (Recurrent Neural Network), but is not limited thereto. The learning model 55 may use another machine learning. Details of the learning model 55 will be described later.

生成部５７は、取得した発話音声データ（より具体的には、発話音声のピッチ及びエネルギーそれぞれの時系列データ）及び学習モデル５５に基づいて姿勢データを生成することができる。 The generation unit 57 can generate posture data based on the acquired utterance voice data (more specifically, time-series data of each pitch and energy of the utterance voice) and the learning model 55.

学習モデル５５は、発話音声データと人体の姿勢データとを学習データとして用いて予め生成されているので、取得した発話音声データを学習モデル５５に入力すると、学習モデル５５は、入力された発話音声データと関連性がある姿勢データを出力する。これにより、人の発話と当該発話に伴うジェスチャー（体の動き）を生成することができ、プレゼンテーションのジェスチャーを自動的に生成することができる。また、ジェスチャー制作のコストを低減することができる。 Since the learning model 55 is generated in advance using the utterance voice data and the posture data of the human body as learning data, when the acquired utterance voice data is input to the learning model 55, the learning model 55 receives the input utterance voice. Outputs posture data that is related to the data. Thereby, the utterance of the person and the gesture (movement of the body) accompanying the utterance can be generated, and the gesture of the presentation can be automatically generated. Also, the cost of gesture production can be reduced.

図５は学習モデル５５の構成の一例を示す模式図である。学習モデル５５は、エンコーダ５５１、及びデコーダ５５２を備える。エンコーダ５５１は、入力ノードに入力された発話音声データ（具体的には、発話音声のピッチ及びエネルギーの時系列データ）をエンコードする。デコーダ５５２は、エンコードされたデータをデコードして、人体の姿勢データ（具体的には、複数の関節位置の３次元データの時系列データ）を出力ノードから出力する。 FIG. 5 is a schematic diagram showing an example of the configuration of the learning model 55. The learning model 55 includes an encoder 551 and a decoder 552. The encoder 551 encodes the uttered voice data (specifically, the time-series data of the pitch and energy of the uttered voice) input to the input node. The decoder 552 decodes the encoded data and outputs the posture data of the human body (specifically, the time series data of the three-dimensional data of a plurality of joint positions) from the output node.

エンコーダ５５１、及びデコーダ５５２は、複数のＬＳＴＭ（long Short Term Memory）と称される中間層を有する。ＬＳＴＭは、記憶セル（不図示）を有し、過去の必要な時系列データを保持するとともに、次時刻のＬＳＴＭへ隠れ状態ｈｔを出力することができる。エンコーダ５５１及びデコーダ５５２は、時系列データを別の時系列データに変換することができる。図中、＜ＥＯＳ＞は、「区切り文字」であり、デコーダ５５２に時系列データの生成の開始を知らせる合図として利用されるとともに終了の合図として利用される。なお、図５では、便宜上、複数のＬＳＴＭ層を纏めて一つのＬＳＴＭで図示している。また、エンベディング（Embedding）層、全結合層などは省略している。 The encoder 551 and the decoder 552 have a plurality of intermediate layers called LSTMs (long short term memories). The LSTM has a memory cell (not shown), can hold necessary time series data in the past, and can output the hidden state ht to the LSTM at the next time. The encoder 551 and the decoder 552 can convert the time series data into another time series data. In the figure, <EOS> is a "delimiter", which is used as a signal notifying the decoder 552 of the start of generation of time-series data, and is also used as an end signal. Note that, in FIG. 5, for convenience, a plurality of LSTM layers are collectively shown as one LSTM. In addition, an embedding layer, a fully connected layer, etc. are omitted.

図６は学習モデル５５による姿勢データの出力の様子の一例を示す模式図である。図６の例では、便宜上、出力ノードから３フレーム分（時点ｔ、ｔ＋１、ｔ＋２とする）の３次元姿勢時系列データが出力されている。関節位置の数を９個とする、一つの関節当たりｘｙｚ座標の値が存在するので、１フレーム当たり２７個（＝９×３）のデータを有する。この場合、１フレームの間での音声韻律時系列データのサンプリング数をｎとすると、時点ｔでの音声韻律時系列データは、（Ｘ₁、…、Ｘ_n）であり、時点ｔ＋１での音声韻律時系列データは、（Ｘ_n+1、…、Ｘ_2n）であり、時点ｔ＋２での音声韻律時系列データは、（Ｘ_2n+1、…、Ｘ_3n）であある。ここで、Ｘは発話音声のピッチ及びエネルギーを含む物理量である。 FIG. 6 is a schematic diagram showing an example of how posture data is output by the learning model 55. In the example of FIG. 6, for convenience, three frames (time points t, t+1, and t+2) of three-dimensional posture time series data are output from the output node. Since there are xyz coordinate values per joint, where the number of joint positions is nine, there are 27 (=9×3) data per frame. In this case, assuming that the sampling number of the speech prosody time series data in one frame is n, the speech prosody time series data at the time point t is (X ₁ ,..., X _n ) and the speech at the time point t+1. The prosody time series data is (X _n+1 ,..., X _2n ), and the speech prosody time series data at the time point t+2 is (X _2n+1 ,..., X _3n ). Here, X is a physical quantity including the pitch and energy of the spoken voice.

生成部５７は、人体の複数の関節位置の３次元データの時系列データを生成することができる。これにより、生成部５７は、時間の経過とともに変化する、上半身の複数の関節位置を示す姿勢データを生成することができ、プレゼンテーションのジェスチャーを自動的に生成することができる。 The generation unit 57 can generate time-series data of three-dimensional data of a plurality of joint positions of the human body. Thereby, the generation unit 57 can generate posture data indicating a plurality of joint positions of the upper body, which change with the passage of time, and can automatically generate a presentation gesture.

また、生成部５７は、発話文章単位で人体の複数の関節位置の３次元データの時系列データを生成することができる。発話文章とは、発話の初めと終わりとで、音声と、ジェスチャーが少ない状態（基本の姿勢）となる単位である。発話文章単位で学習し、ジェスチャーを生成することにより、生成後のジェスチャーを接続した場合に、接続箇所の前後のジェスチャーの動きが急に変わることを避けることができる。 Further, the generation unit 57 can generate time-series data of three-dimensional data of a plurality of joint positions of the human body in units of uttered sentences. The utterance sentence is a unit in which a voice and a gesture are few at the beginning and end of the utterance (basic posture). By learning and generating a gesture for each utterance sentence, it is possible to avoid a sudden change in the movement of the gesture before and after the connection point when the generated gesture is connected.

発話文章単位内では体の動きが滑らかであっても、発話文章間では、体の動きが滑らかにならない場合があり、生成されたジェスチャーが発話文章間で不自然になる可能性がある。以下では、発話文章間のジェスチャーを滑らかにする方法について説明する。 Even if the body movement is smooth in the utterance sentence unit, the body movement may not be smooth between the utterance sentences, and the generated gesture may be unnatural between the utterance sentences. Hereinafter, a method for smoothing gestures between spoken sentences will be described.

図７は発話文章毎に生成したジェスチャーの接続方法の第１例を示す模式図である。図７に示すように、発話文章が、「月の表面にある1個のオレンジを」という発話文章単位Ｓ１と、発話文章単位Ｓ１に繋がる「観測するのと同じ位小さいのです」という発話文章単位Ｓ２とする。発話文章単位Ｓ１の音声韻律時系列データが学習モデル５５に入力され、生成部５７が複数のジェスチャー（複数の関節位置の３次元データの時系列データ）を生成する。図７に示すように、発話文章単位Ｓ１の最後から２番目のジェスチャーと、最後のジェスチャーをＧ１２、Ｇ１１と表す。 FIG. 7 is a schematic diagram showing a first example of a method of connecting gestures generated for each uttered sentence. As shown in Fig. 7, the utterance sentence is "one orange on the surface of the moon", and the utterance sentence "is as small as observed" connected to the utterance sentence unit S1. The unit is S2. The speech prosody time series data of the uttered sentence unit S1 is input to the learning model 55, and the generation unit 57 generates a plurality of gestures (time series data of three-dimensional data of a plurality of joint positions). As shown in FIG. 7, the penultimate gesture and the final gesture of the uttered sentence unit S1 are represented as G12 and G11.

同様に、発話文章単位Ｓ２の音声韻律時系列データが学習モデル５５に入力され、生成部５７が複数のジェスチャー（複数の関節位置の３次元データの時系列データ）を生成する。図７に示すように、発話文章単位Ｓ２の最初のジェスチャーと、その次のジェスチャーをＧ２１、Ｇ２２と表す。 Similarly, the phonetic prosody time series data of the uttered sentence unit S2 is input to the learning model 55, and the generation unit 57 generates a plurality of gestures (time series data of three-dimensional data of a plurality of joint positions). As shown in FIG. 7, the first gesture and the next gesture of the utterance sentence unit S2 are represented as G21 and G22.

補正部５６は、第１補正部としての機能を有し、一の発話文章単位内の３次元データであって、一の発話文章と繋がる他の発話文章単位内の３次元データと繋がる３次元データを補正する。 The correction unit 56 has a function as a first correction unit, and is three-dimensional data in one utterance sentence unit that is connected to three-dimensional data in another utterance sentence unit that is connected to one utterance sentence unit. Correct the data.

図７の例では、補正部５６による補正後のジェスチャーを、接続後のジェスチャーとして図示している。図７に示すように、ジェスチャーＧ１１の補正後のジェスチャーは、ジェスチャーＧ１１とＧ２１との線形補間を行って生成することができる。具体的には、ジェスチャーＧ１１のデータに対して３４％の重み付けを行い、ジェスチャーＧ２１のデータに対して６６％の重み付けを行い、重み付けしたジェスチャーＧ１１及びＧ２１のデータの和を、ジェスチャーＧ１１の補正後のジェスチャーとする。 In the example of FIG. 7, the gesture corrected by the correction unit 56 is illustrated as the gesture after connection. As shown in FIG. 7, the corrected gesture of the gesture G11 can be generated by performing linear interpolation of the gestures G11 and G21. Specifically, the data of the gesture G11 is weighted by 34%, the data of the gesture G21 is weighted by 66%, and the sum of the weighted data of the gestures G11 and G21 is corrected. Gesture.

ジェスチャーＧ１２の補正後のジェスチャーは、ジェスチャーＧ１２とＧ２１との線形補間を行って生成することができる。具体的には、ジェスチャーＧ１２のデータに対して６７％の重み付けを行い、ジェスチャーＧ２１のデータに対して３３％の重み付けを行い、重み付けしたジェスチャーＧ１２及びＧ２１のデータの和を、ジェスチャーＧ１２の補正後のジェスチャーとする。 The corrected gesture of the gesture G12 can be generated by linearly interpolating the gestures G12 and G21. Specifically, the data of the gesture G12 is weighted by 67%, the data of the gesture G21 is weighted by 33%, and the sum of the weighted data of the gestures G12 and G21 is corrected. Gesture.

ジェスチャーＧ２１の補正、及びジェスチャーＧ２２の補正も同様にして行うことができる。なお、図７の例では、ジェスチャーＧ１２及びＧ１１の補正と、ジェスチャーＧ２１及びＧ２２の補正を行っているが、これに限定されるものではなく、ジェスチャーＧ１２及びＧ１１、またはジェスチャーＧ２１及びＧ２２のいずれか一方だけを補正してもよい。また、重み付けの割合（％）は一例であって、図７の例に限定されない。 The correction of the gesture G21 and the correction of the gesture G22 can be performed in the same manner. In the example of FIG. 7, the gestures G12 and G11 are corrected and the gestures G21 and G22 are corrected. However, the present invention is not limited to this, and either the gestures G12 and G11 or the gestures G21 and G22 is performed. Only one may be corrected. Further, the weighting ratio (%) is an example, and is not limited to the example of FIG. 7.

発話文章単位内では、自然な動き、滑らかな動きを示す姿勢データを生成することができる。しかし、発話文章と次の発話文章との間では、姿勢データの時間的変化が大きくなり、不自然なジェスチャーが生成される可能性がある。そこで、図７に示すように、発話文章が繋がる箇所の姿勢データを補正することにより、発話文章間の姿勢データの変化を滑らかにして、自然なジェスチャーを生成することができる。 Within the utterance sentence unit, it is possible to generate posture data indicating natural movement and smooth movement. However, between the utterance sentence and the next utterance sentence, the temporal change of the posture data becomes large, and an unnatural gesture may be generated. Therefore, as shown in FIG. 7, by correcting the posture data of the portion where the utterance sentences are connected, the change in the posture data between the utterance sentences can be smoothed and a natural gesture can be generated.

図８は発話文章毎に生成したジェスチャーの接続方法の第２例を示す模式図である。図７の例では、発話文章単位Ｓ１内の最後の２つのジェスチャー、及び発話文章単位Ｓ２内の最初の２つのジェスチャーを補正する構成であったが、図８では、発話文章単位Ｓ１内の最後のジェスチャー、及び発話文章単位Ｓ２内の最初のジェスチャーを補正する。 FIG. 8 is a schematic diagram showing a second example of a method of connecting gestures generated for each uttered sentence. In the example of FIG. 7, the last two gestures in the utterance sentence unit S1 and the first two gestures in the utterance sentence unit S2 are corrected, but in FIG. 8, the last gesture in the utterance sentence unit S1 is corrected. And the first gesture in the uttered sentence unit S2 are corrected.

図８に示すように、ジェスチャーＧ１１のデータに対して５０％の重み付けを行い、ジェスチャーＧ２１のデータに対して５０％の重み付けを行い、重み付けしたジェスチャーＧ１１及びＧ２１のデータの和を、ジェスチャーＧ１１の補正後のジェスチャーとする。ジェスチャーＧ２１の補正も同様である。また、ジェスチャーＧ１１又はＧ２１のいずれか一方だけを補正してもよい。また、重み付けの割合（％）は一例であって、図８の例に限定されない。 As shown in FIG. 8, the data of the gesture G11 is weighted by 50%, the data of the gesture G21 is weighted by 50%, and the sum of the weighted data of the gestures G11 and G21 is calculated. It is the corrected gesture. The same applies to the correction of the gesture G21. Further, only one of the gestures G11 and G21 may be corrected. Further, the weighting ratio (%) is an example, and is not limited to the example of FIG. 8.

次に、単語の意味を表現するジェスチャーを用いた補正例について説明する。 Next, an example of correction using a gesture that expresses the meaning of a word will be described.

図９は本実施の形態のジェスチャー生成装置５０が生成したジェスチャーの一例を示す模式図である。図９に示すジェスチャーは、発話文「銀河の中心には大きな象がいる」から抽出した音声韻律時系列データによって生成されたジェスチャーであり、便宜上、１０フレーム分のジェスチャーを図示している。 FIG. 9 is a schematic diagram showing an example of a gesture generated by the gesture generation device 50 according to the present embodiment. The gesture shown in FIG. 9 is a gesture generated from the phonetic prosody time series data extracted from the utterance sentence “There is a large elephant in the center of the galaxy”, and for convenience, the gesture for 10 frames is illustrated.

補正部５６は、発話文の中に特定の単語（「キーワード」とも称する）があるか否かを判定し、キーワードがある場合、当該キーワードの発話タイミングを特定する。発話タイミングは、キーワードの発話の開始時点と終了時点とによって決定することができる。なお、発話の開始時点と発話の長さでもよい。 The correction unit 56 determines whether or not there is a specific word (also referred to as “keyword”) in the utterance sentence, and if there is a keyword, specifies the utterance timing of the keyword. The utterance timing can be determined by the start point and the end point of the keyword utterance. Note that the utterance start time and utterance length may be used.

図１０は生成したジェスチャーのうち置き換えるジェスチャーの時間Ｔの算出例を示す模式図である。図１０に示すように、補正部５６は、生成されたジェスチャーの中で置き換えるジェスチャーの最初の時点と最後の時点とを決定する。最初の時点は、発話時間の開始時点からｔ１秒前の時点とすることができる。ｔ１は、例えば、６４５ミリ秒とすることができるが、これに限定されない。ｔ１は、単語の発話に先んじて体の動きが開始する時間とすることができる。また、最後の時点は、発話時間の終了時点からｔ２秒後の時点とすることができる。ｔ２は、例えば、５５５ミリ秒とすることができるが、これに限定されない。ｔ２は、単語の発話終了後に体の動きが終了する時間とすることができる。 FIG. 10 is a schematic diagram showing an example of calculating the time T of a gesture to be replaced among the generated gestures. As shown in FIG. 10, the correction unit 56 determines the first time point and the last time point of the gesture to be replaced in the generated gesture. The first time point can be a time point t1 seconds before the start time of the speech time. t1 can be, for example, 645 milliseconds, but is not limited thereto. t1 can be the time when body movement begins prior to the utterance of the word. Further, the last time point can be a time point t2 seconds after the end time of the utterance time. t2 can be, for example, 555 milliseconds, but is not limited to this. t2 can be the time when the body movement ends after the word utterance ends.

補正部５６は、置き換えるジェスチャーの時間Ｔを、フレームレート（各フレームが表示される時間、すなわち、ｆｐｓの逆数）で除算して、置き換えるジェスチャーのフレーム数Ｆを算出する。 The correction unit 56 divides the time T of the gesture to be replaced by the frame rate (the time at which each frame is displayed, that is, the reciprocal of fps) to calculate the frame number F of the gesture to be replaced.

図１１はキーワードとジェスチャーデータとの関係の一例を示す説明図である。キーワードは、例えば、「大きい」、「小さい」、「食べる」、「笑う」、「長い」などのように、そのキーワードの意味を表現する動きに特徴があり、情報伝達を目的とする動きに関連する単語である。ジェスチャーデータは、対応するキーワードを発話する際の体の動きを表現するジェスチャーの時系列データである。例えば、人が「大きい」と発話する場合、体の動きを、ｇ１、ｇ２、ｇ３、ｇ４、ｇ５の５つのジェスチャーで表すことができる。キーワードとジェスチャーデータとの関係は、記憶部５３に記憶することができる。 FIG. 11 is an explanatory diagram showing an example of the relationship between keywords and gesture data. Keywords are characterized by the movement that expresses the meaning of the keyword, such as "big", "small", "eat", "laugh", and "long". It is a related word. The gesture data is time-series data of gestures that express the movement of the body when speaking the corresponding keyword. For example, when a person speaks "large", the body movement can be represented by five gestures g1, g2, g3, g4, and g5. The relationship between the keyword and the gesture data can be stored in the storage unit 53.

図１２は置き換えるジェスチャーのリサンプリング方法の一例を示す模式図である。図１２に示すように、「大きい」というキーワードを表現するジェスチャーをｇ１、ｇ２、ｇ３、ｇ４、ｇ５の５つのジェスチャーとする。補正部５６は、リサンプリングのサンプリング間隔Δｔを算出する。サンプリング間隔Δｔは、ジェスチャーをｇ１の開始時点からジェスチャーｇ５の終了時点までの時間を（Ｆ−１）で除算して求めることができる。ここで、Ｆは置き換えるジェスチャーのフレーム数である。図１２の例では、ジェスチャーをｇ１の開始時点からジェスチャーｇ５の終了時点までの時間を３つの間隔で分けられている。 FIG. 12 is a schematic diagram showing an example of a resampling method of a replacement gesture. As shown in FIG. 12, the gestures expressing the keyword “large” are five gestures g1, g2, g3, g4, and g5. The correction unit 56 calculates the sampling interval Δt for resampling. The sampling interval Δt can be obtained by dividing the time from the start point of the gesture g1 to the end point of the gesture g5 by (F-1). Here, F is the number of frames of the gesture to be replaced. In the example of FIG. 12, the time from the start point of the gesture g1 to the end point of the gesture g5 is divided into three intervals.

サンプリングされたジェスチャーは、Ｇ１、Ｇ２、Ｇ３、Ｇ４となる。ジェスチャーＧ１は、ジェスチャーｇ１をそのまま使用する。ジェスチャーＧ４は、ジェスチャーｇ５をそのまま使用する。ジェスチャーＧ２は、サンプリングのタイミングに応じて、ジェスチャーｇ２とｇ３とを線形補間したものを使用する。ジェスチャーＧ３は、サンプリングのタイミングに応じて、ジェスチャーｇ３とｇ４とを線形補間したものを使用する。これにより、サンプリングしたジェスチャーＧ１〜Ｇ４の変化が滑らかになり自然な動きとすることができる。 The sampled gestures are G1, G2, G3, and G4. The gesture g1 uses the gesture g1 as it is. The gesture G4 uses the gesture g5 as it is. As the gesture G2, a gesture obtained by linearly interpolating the gestures g2 and g3 is used according to the sampling timing. The gesture G3 uses a linear interpolation of the gestures g3 and g4 according to the sampling timing. As a result, the changes in the sampled gestures G1 to G4 become smooth, and a natural movement can be achieved.

図１３はリサンプリングの結果の一例を示す模式図である。「大きい」を表現するジェスチャーｇ１、ｇ２、ｇ３、ｇ４、ｇ５が、４つのジェスチャーＧ１、Ｇ２、Ｇ３、Ｇ４にリサンプリングされている。 FIG. 13 is a schematic diagram showing an example of the result of resampling. Gestures g1, g2, g3, g4, and g5 expressing “large” are resampled into four gestures G1, G2, G3, and G4.

図１４はキーワードに対応するジェスチャーで置き換えた後のジェスチャーの一例を示す模式図である。キーワード（図１４の例では、「大きな」）の発話の開始時点よりｔ１秒前の時点を始点として、元のジェスチャーが４つのジェスチャーＧ１、Ｇ２、Ｇ３、Ｇ４で置き換えられている。なお、置き換えたジェスチャーのうちの最初のジェスチャーＧ１とジェスチャーＧ１の前のジェスチャーの変化が滑らかでない可能性もある。同様に、置き換えたジェスチャーのうち最後のジェスチャーＧ４とジェスチャーＧ４の後のジェスチャーの変化が滑らかでない可能性もある。そこで、以下のように、ジェスチャーの補間を行うことができる。 FIG. 14 is a schematic diagram showing an example of a gesture after replacement with a gesture corresponding to a keyword. The original gesture is replaced with four gestures G1, G2, G3, and G4 starting from a time point t1 seconds before the start time point of the utterance of the keyword (“large” in the example of FIG. 14). The first gesture G1 of the replaced gestures and the gesture before the gesture G1 may not be smoothly changed. Similarly, the change of the last gesture G4 of the replaced gestures and the gesture after the gesture G4 may not be smooth. Therefore, the gesture can be interpolated as follows.

図１５は置き換えた後のジェスチャーの補間の一例を示す模式図である。図１５の例では、置き換えたジェスチャーＧ１によりも前にある２つのジェスチャーを線形補間している。具体的には、ジェスチャーＧ１の１つ前のジェスチャーは、２つ前のジェスチャーよりもジェスチャーＧ１の重みを大きくすることができる。また、置き換えたジェスチャーＧ４によりも後ろある２つのジェスチャーを線形補間している。具体的には、ジェスチャーＧ４の１つ後ろのジェスチャーは、２つ後ろのジェスチャーよりもジェスチャーＧ４の重みを大きくすることができる。 FIG. 15 is a schematic diagram illustrating an example of gesture interpolation after replacement. In the example of FIG. 15, two gestures before the replaced gesture G1 are linearly interpolated. Specifically, the gesture before the gesture G1 can have a greater weight than the gesture before the gesture G1. In addition, the two gestures behind are also linearly interpolated by the replaced gesture G4. Specifically, the gesture one behind the gesture G4 can make the weight of the gesture G4 heavier than the gesture two behind.

上述のように、補正部５６は、第２補正部としての機能を有し、生成部５７で生成した３次元データの時系列データに対応する発話文章内でキーワード（例えば、「大きい」に対応する３次元データの時系列データを、当該キーワードに関連付けられた伝達３次元データの時系列データ（ｇ１〜ｇ５、あるいはＧ１〜Ｇ４など）を用いて補正することができる。伝達３次元データの時系列データは、例えば、キーワードを表現する体の動きを表す３次元データの時間的変化を示すデータである。これにより、発話文章内の単語の意味を表現する動きでジェスチャーを補正することができるので、情報伝達を伴う動きを含む自然なジェスチャーを生成することができる。 As described above, the correction unit 56 has a function as the second correction unit, and corresponds to the keyword (for example, “large”) in the utterance sentence corresponding to the time-series data of the three-dimensional data generated by the generation unit 57. The time series data of the three-dimensional data to be processed can be corrected using the time series data of the transmission three-dimensional data (g1 to g5, or G1 to G4, etc.) associated with the keyword. The series data is, for example, data indicating temporal changes in three-dimensional data representing a body movement expressing a keyword, whereby a gesture can be corrected with a movement expressing the meaning of a word in a spoken sentence. Therefore, it is possible to generate a natural gesture including a motion involving information transmission.

次に、学習モデル５５の生成方法について説明する。 Next, a method of generating the learning model 55 will be described.

図１６は本実施の形態の学習モデル生成部６０の構成の一例を示すブロック図である。学習モデル生成部６０は、ジェスチャー生成装置５０に組み込んでもよく、あるいは別の学習用サーバに組み込んでもよい。学習モデル生成部６０は、プレゼンテーション動画取得部６１、発話音声データ抽出部６２、フレーム画像抽出部６３、ピッチ及びエネルギー抽出部６４、２次元姿勢抽出部６５、及び３次元姿勢推定部６６を備える。 FIG. 16 is a block diagram showing an example of the configuration of the learning model generation unit 60 of this embodiment. The learning model generation unit 60 may be incorporated in the gesture generation device 50, or may be incorporated in another learning server. The learning model generation unit 60 includes a presentation moving image acquisition unit 61, a speech voice data extraction unit 62, a frame image extraction unit 63, a pitch and energy extraction unit 64, a two-dimensional posture extraction unit 65, and a three-dimensional posture estimation unit 66.

プレゼンテーション動画取得部６１は、プレゼンテーション動画を取得する。学習用のデータを大量に集めるため、例えば、ウェブ上で誰もが使用可能に開示されているプレゼンテーション動画を用いることができる。 The presentation moving image acquisition unit 61 acquires a presentation moving image. In order to collect a large amount of data for learning, it is possible to use, for example, a presentation moving image that is disclosed and available on the web for everyone.

発話音声データ抽出部６２は、プレゼンテーション動画から発話音声データを抽出する。ピッチ及びエネルギー抽出部６４は、音声分析機能を備え、発話音声データから、所要のサンプリング周期で発話音声のエネルギー及びピッチを抽出することができる。ピッチ及びエネルギー抽出部６４で抽出された音声韻律時系列データは、学習データとして学習モデル５５の入力ノードに与えられる。 The utterance voice data extraction unit 62 extracts utterance voice data from the presentation moving image. The pitch and energy extraction unit 64 has a voice analysis function and can extract the energy and pitch of the uttered voice from the uttered voice data at a required sampling period. The speech prosody time series data extracted by the pitch and energy extraction unit 64 is given to the input node of the learning model 55 as learning data.

フレーム画像抽出部６３は、プレゼンテーション動画からフレーム単位で画像を抽出する。２次元姿勢抽出部６５は、各フレームの画像から人間の顔、腕、足などの部位を特定し、特定した各部位を繋げて、複数の関節の２次元座標（２次元姿勢情報ともいう）を抽出する。 The frame image extraction unit 63 extracts an image from the presentation moving image in frame units. The two-dimensional posture extraction unit 65 identifies parts such as a human face, arm, and foot from each frame image, connects the identified parts, and two-dimensional coordinates of a plurality of joints (also referred to as two-dimensional pose information). To extract.

３次元姿勢推定部６６は、予め２次元姿勢情報と３次元姿勢情報とを対応付けたデータベースを備えており、２次元姿勢抽出部６５で抽出した２次元姿勢情報に基づいて、最も近い３次元姿勢情報を推定する。３次元姿勢情報は、複数の関節の３次元座標を含む。３次元姿勢推定部６６は、図４に例示した３次元姿勢データをフレーム単位で抽出することができる。３次元姿勢推定部６６で抽出した３次元姿勢時系列データは、学習データとして学習モデル５５の出力ノードに与えられる。 The three-dimensional posture estimation unit 66 includes a database in which two-dimensional posture information and three-dimensional posture information are associated with each other in advance, and based on the two-dimensional posture information extracted by the two-dimensional posture extraction unit 65, the closest three-dimensional posture is obtained. Estimate attitude information. The three-dimensional posture information includes the three-dimensional coordinates of a plurality of joints. The three-dimensional posture estimation unit 66 can extract the three-dimensional posture data illustrated in FIG. 4 in frame units. The three-dimensional posture time-series data extracted by the three-dimensional posture estimation unit 66 is given to the output node of the learning model 55 as learning data.

上述のように、学習モデル５５は、発話音声データから抽出された発話音声のピッチ及びエネルギーそれぞれの時系列データを学習データとして用いて生成することができる。発話音声のピッチは、音声波形の周波数であり、音声の高低を表すことができる。発話音声のエネルギーは、音声のエネルギーであり、音声の強弱を表すことができる。なお、発話音声のピッチ及びエネルギーの時系列データを纏めて音声韻律時系列データとも称する。 As described above, the learning model 55 can be generated using the time-series data of the pitch and energy of the uttered voice extracted from the uttered voice data as the learning data. The pitch of the uttered voice is the frequency of the voice waveform and can represent the pitch of the voice. The energy of the uttered voice is the energy of the voice and can represent the strength of the voice. The time series data of the pitch and energy of the uttered voice are collectively referred to as voice prosody time series data.

発話の際の話し手の意思や熱意は、音声韻律、すなわち発話音声のピッチ及びエネルギーの変化となって表れる。そこで、音声韻律時系列データを学習データとして用いることにより、学習モデル５５は、意思や熱意を表現する姿勢データを出力することができる。 The intention and enthusiasm of the speaker at the time of utterance are expressed as voice prosody, that is, changes in pitch and energy of the uttered voice. Therefore, by using the phonetic prosody time series data as the learning data, the learning model 55 can output the posture data expressing the intention or enthusiasm.

学習モデル５５は、人体の複数の関節位置の３次元データの時系列データを学習データとして用いて生成することができる。複数の関節位置は、プレゼンテーション時に人の動きが顕著に表れる部分を含めることができればよく、例えば、上半身の複数の関節の位置とすることができる。３次元データは、基準とする座標系でのｘｙｚ座標とすることができる。 The learning model 55 can be generated using time-series data of three-dimensional data of a plurality of joint positions of the human body as learning data. It is sufficient that the plurality of joint positions include a portion in which a person's movement is remarkably shown at the time of presentation, and may be, for example, positions of a plurality of joints in the upper body. The three-dimensional data can be xyz coordinates in a reference coordinate system.

プレゼンテーション動画から発話音声データ及び人体の姿勢データそれぞれを発話文章毎に抽出し、学習モデル５５は、発話文章毎の発話音声データ及び人体の姿勢データを一組の学習データとして用いて生成することができる。発話文章とは、発話の初めと終わりとで、音声と、ジェスチャーが少ない状態（基本の姿勢）となる単位である。発話文章単位で学習し、ジェスチャーを生成することにより、生成後のジェスチャーを接続した場合に、接続箇所の前後のジェスチャーの動きが急に変わることを避けることができる。 The utterance voice data and the posture data of the human body are extracted from the presentation moving image for each utterance sentence, and the learning model 55 can generate the utterance voice data and the posture data of the human body for each utterance sentence as a set of learning data. it can. The utterance sentence is a unit in which a voice and a gesture are few at the beginning and end of the utterance (basic posture). By learning and generating a gesture for each utterance sentence, it is possible to avoid a sudden change in the movement of the gesture before and after the connection point when the generated gesture is connected.

プレゼンテーション動画の１フレーム毎に人体の複数の関節位置の３次元データを抽出し、学習モデル５５は、抽出した３次元データの複数フレームに亘る時系列データを学習データとして用いて生成することができる。プレゼンテーション動画が、１秒当たり１０フレームの画像で構成されている場合（１０ｆｐｓ）、１０フレームに亘る時系列データを学習データとして用いることにより、１秒間のジェスチャーを生成することができる。これにより、所要の時間のジェスチャーを生成することができる。 The three-dimensional data of a plurality of joint positions of the human body is extracted for each frame of the presentation moving image, and the learning model 55 can be generated by using the time-series data over a plurality of frames of the extracted three-dimensional data as learning data. .. When the presentation moving image is composed of images of 10 frames per second (10 fps), it is possible to generate a gesture for 1 second by using the time-series data of 10 frames as learning data. Thereby, the gesture of the required time can be generated.

字幕が挿入されたプレゼンテーション動画から字幕テキスト挿入単位で発話音声データ及び人体の姿勢データそれぞれを抽出し、学習モデル５５は、抽出した発話音声データ及び人体の姿勢データを一組の学習データとして用いて生成することができる。 The utterance voice data and the human body posture data are extracted in units of subtitle text insertion from the presentation video in which the subtitles are inserted, and the learning model 55 uses the extracted utterance voice data and the human body posture data as a set of learning data. Can be generated.

図１７は発話文章を学習単位で区切る基準の一例を示す模式図である。図１７は字幕情報の要部を図示したものであり、字幕は複数の字幕テキストに区分され、それぞれの字幕テキストの挿入開始時点を示す開始時刻、挿入時間を示す時間長が対応付けて記録されている。例えば、プレゼンテーション動画の中で、字幕テキストＴｅｘｔ１は、時刻０．００に表示開始され、２．００秒間表示される。また、字幕テキストＴｅｘｔ２は、時刻２．００に表示開始され、１．４５秒間表示される。他の字幕テキストも同様である。 FIG. 17 is a schematic diagram showing an example of criteria for dividing an uttered sentence into learning units. FIG. 17 illustrates a main part of subtitle information. A subtitle is divided into a plurality of subtitle texts, and a start time indicating an insertion start time point of each subtitle text and a time length indicating an insertion time are recorded in association with each other. ing. For example, in the presentation moving image, the subtitle text Text1 starts to be displayed at time 0.00 and is displayed for 2.00 seconds. The subtitle text Text2 is started to be displayed at time 2.00 and is displayed for 1.45 seconds. The same applies to other subtitle texts.

字幕テキスト挿入単位のデータを一組の学習データとすることにより、体の動きが滑らかな時間内の音声発話データと姿勢データとを用いて学習できるので、学習モデル５５は、体の動きが滑らかな時間内での姿勢データを生成することができ、発話音声に連動した自然なジェスチャーを生成することができる。 By using the data of the subtitle text insertion unit as a set of learning data, it is possible to learn by using the voice utterance data and the posture data within the time when the body movement is smooth. Therefore, the learning model 55 has a smooth body movement. It is possible to generate posture data within a certain period of time, and it is possible to generate a natural gesture linked to the uttered voice.

図１８は学習モデル５５の生成方法の一例を示す模式図である。図１８に示すように、再帰型ニューラルネットワークの出力ノードに、人体の複数の関節位置の３次元データのプレゼンテーション動画の複数フレーム（図１８の例では、発話文章単位となる３フレーム分）に亘る時系列データを与え、再帰型ニューラルネットワークの入力ノードに、当該プレゼンテーション動画の１フレームの間に所要回数（図１８の例では、４０回）サンプリングされた発話音声データの時系列データの当該複数フレーム（３フレーム分）に亘る時系列データを与えて、学習モデルを生成することができる。 FIG. 18 is a schematic diagram showing an example of a method of generating the learning model 55. As shown in FIG. 18, the output node of the recurrent neural network extends over a plurality of frames (three frames which are utterance sentence units in the example of FIG. 18) of a presentation moving image of three-dimensional data of a plurality of joint positions of a human body. Given the time series data, the plurality of frames of the time series data of the uttered voice data sampled the required number of times (40 times in the example of FIG. 18) during one frame of the presentation moving image are input to the input node of the recursive neural network. A learning model can be generated by giving time-series data (for 3 frames).

例えば、プレゼンテーション動画のフレーム数を３とし、フレームの時点をｔ、ｔ＋１、ｔ＋２とする。再帰型ニューラルネットワークの出力ノードには、時点ｔ、ｔ＋１、ｔ＋２それぞれの３次元データが与えられる。発話音声データの１フレーム当たりのサンプリング数をｎ（図１８の例では、ｎ＝４０）とすると、再帰型ニューラルネットワークの入力ノードには、時点ｔに対応して、（Ｘ₁、…、Ｘ_n）の発話音声データの時系列データが与えられ、時点ｔ＋１に対応して、（Ｘ_n+1、…、Ｘ_2n）の発話音声データの時系列データが与えられ、時点ｔ＋２に対応して、（Ｘ_2n+1、…、Ｘ_3n）の発話音声データの時系列データが与えられる。 For example, the number of frames of the presentation moving image is 3, and the time points of the frames are t, t+1, and t+2. The output node of the recurrent neural network is given three-dimensional data at time points t, t+1, and t+2. Assuming that the number of samples of the uttered voice data per frame is n (n=40 in the example of FIG. 18), the input node of the recursive neural network corresponds to the time point (X ₁ ,..., X _n ) the time-series data of the utterance voice data is given, and the time-series data of the utterance voice data of (X _n+1 ,..., X _2n ) is given corresponding to the time point t+1, and corresponding to the time point t+2. , (X _2n+1 ,..., X _3n ) are given as time series data of the uttered voice data.

これにより、発話（発話の韻律）と体の動きの情報との関連性を学習することができ、発話（発話の韻律）に合わせたジェスチャーを生成することができる。 As a result, the relationship between the utterance (prosody of the utterance) and the information on the body movement can be learned, and a gesture that matches the utterance (prosody of the utterance) can be generated.

図１９は本実施の形態のジェスチャー生成装置５０によるジェスチャー生成の処理手順の一例を示すフローチャートである。以下では、便宜上、処理の主体を制御部５１として説明する。制御部５１は、発話音声データを取得し（Ｓ１１）、発話文章毎に所要のサンプリング周期で発話音声のピッチ及びエネルギーを抽出する（Ｓ１２）。 FIG. 19 is a flow chart showing an example of a procedure for generating a gesture by the gesture generation device 50 according to the present embodiment. Hereinafter, for convenience, the main body of processing will be described as the control unit 51. The control unit 51 acquires the utterance voice data (S11), and extracts the pitch and energy of the utterance voice at a required sampling cycle for each utterance sentence (S12).

制御部５１は、音声韻律時系列データを学習モデル５５に入力し（Ｓ１３）、３次元姿勢時系列データを出力する（Ｓ１４）。制御部５１は、一の発話文章と繋がる他の発話文章、及び当該一の発話文章の少なくとも一方の３次元姿勢データを補正する（Ｓ１５）。 The control unit 51 inputs the speech prosody time series data to the learning model 55 (S13) and outputs the three-dimensional posture time series data (S14). The control unit 51 corrects the other utterance sentence connected to the one utterance sentence and the three-dimensional posture data of at least one of the one utterance sentence (S15).

制御部５１は、発話文章の中にキーワードがあるか否かを判定し（Ｓ１６）、キーワードがある場合（Ｓ１６でＹＥＳ）、キーワードに関連付けられたジェスチャーデータを用いて、キーワードに対応する３次元姿勢データを補正し（Ｓ１７）、後述のステップＳ１８の処理を行う。 The control unit 51 determines whether or not there is a keyword in the uttered sentence (S16), and when the keyword is present (YES in S16), the gesture data associated with the keyword is used to determine the 3D corresponding to the keyword. The posture data is corrected (S17), and the process of step S18 described later is performed.

発話文章の中にキーワードがない場合（Ｓ１６でＮＯ）、制御部５１は、処理を終了するか否かを判定し（Ｓ１８）、処理を終了しない場合（Ｓ１８でＮＯ）、ステップＳ１２以降の処理を続け、処理を終了する場合（Ｓ１８でＹＥＳ）、処理を終了する。 When there is no keyword in the uttered sentence (NO in S16), the control unit 51 determines whether or not to end the process (S18), and when the process is not to be ended (NO in S18), the processes of step S12 and thereafter. When the processing is to be ended (YES in S18), the processing is ended.

図２０は学習モデル生成部６０による学習モデル生成の処理手順の一例を示すフローチャートである。学習モデル生成部６０は、プレゼンテーション動画を取得し（Ｓ３１）、字幕テキスト挿入単位で発話音声データ及びフレーム画像を抽出する（Ｓ３２）。学習モデル生成部６０は、発話音声データから発話音声のピッチ及びエネルギーの時系列データ（音声韻律時系列データ）を抽出し（Ｓ３３）、フレーム画像から２次元姿勢情報を抽出し、３次元姿勢情報を推定する（Ｓ３４）。 FIG. 20 is a flowchart showing an example of a processing procedure of learning model generation by the learning model generation unit 60. The learning model generation unit 60 acquires the presentation moving image (S31) and extracts the uttered voice data and the frame image for each subtitle text insertion unit (S32). The learning model generation unit 60 extracts time-series data (speech prosody time-series data) of pitch and energy of utterance voice from the utterance voice data (S33), extracts two-dimensional posture information from the frame image, and three-dimensional posture information. Is estimated (S34).

学習モデル生成部６０は、推定した３次元姿勢情報に基づいて３次元姿勢時系列データを抽出し（Ｓ３５）、音声韻律時系列データ及び３次元姿勢時系列データを学習データとして用いて学習モデル５５を生成し（Ｓ３６）、処理を終了する。 The learning model generation unit 60 extracts the three-dimensional posture time series data based on the estimated three-dimensional posture information (S35), and uses the phonetic prosody time series data and the three-dimensional posture time series data as the learning data 55. Is generated (S36), and the process ends.

本実施の形態のジェスチャー生成装置５０又は学習モデル生成部６０は、ＣＰＵ（プロセッサ）、ＧＰＵ、ＲＡＭ（メモリ）などを備えた汎用コンピュータを用いて実現することもできる。すなわち、図１９又は図２０に示すような、各処理の手順を定めたコンピュータプログラムをコンピュータに備えられたＲＡＭ（メモリ）にロードし、コンピュータプログラムをＣＰＵ（プロセッサ）で実行することにより、コンピュータ上でジェスチャー生成装置５０又は学習モデル生成部６０を実現することができる。コンピュータプログラムは記録媒体に記録され流通されてもよく、あるいは、ネットワークを介して、ジェスチャー生成装置５０にインストールされてもよい。 The gesture generation device 50 or the learning model generation unit 60 according to the present embodiment can also be realized by using a general-purpose computer including a CPU (processor), GPU, RAM (memory), and the like. That is, as shown in FIG. 19 or FIG. 20, by loading a computer program that defines the procedure of each process into a RAM (memory) provided in the computer and executing the computer program by a CPU (processor), Thus, the gesture generation device 50 or the learning model generation unit 60 can be realized. The computer program may be recorded in a recording medium and distributed, or may be installed in the gesture generation device 50 via a network.

本実施の形態によれば、発話音声に連動した自然なジェスチャーを自動的に生成することができ、ジェスチャー制作に要するコストを低減することができる。 According to the present embodiment, it is possible to automatically generate a natural gesture that is linked to a spoken voice, and reduce the cost required for gesture production.

また、本実施の形態によれば、学習モデル５５の生成に用いるプレゼンテーション動画に応じて、生成するジェスチャーから受ける印象を変えることができる。例えば、使用するプレゼンテーション動画に多数の話者が含まれる場合、話者の個性が平均化され、生成されるジェスチャーも平均的なものとすることができる。逆に特定の話者のプレゼンテーション動画を用いて学習モデルを生成した場合、個性が反映させたジェスチャーを生成することができる。 Further, according to the present embodiment, the impression received from the generated gesture can be changed according to the presentation moving image used for generating the learning model 55. For example, when the presentation animation used includes a large number of speakers, the individualities of the speakers are averaged and the generated gestures can be average. On the contrary, when the learning model is generated using the presentation moving image of the specific speaker, it is possible to generate the gesture reflecting the individuality.

本実施の形態の姿勢データ生成装置は、発話音声データと人体の姿勢データとを学習データとして用いて生成してある学習器と、発話音声データを取得する取得部と、前記取得部で取得した発話音声データ及び前記学習器に基づいて姿勢データを生成する生成部とを備える。 The posture data generation device of the present embodiment is obtained by the learning device that is generated by using the utterance voice data and the posture data of the human body as learning data, the acquisition unit that acquires the utterance voice data, and the acquisition unit. And a generation unit that generates posture data based on the speech data and the learning device.

本実施の形態の学習器は、発話音声データと人体の姿勢データとを学習データとして用いて生成してある。 The learning device of the present embodiment is generated by using utterance voice data and human posture data as learning data.

本実施の形態のコンピュータプログラムは、コンピュータに、発話音声データを取得する処理と、発話音声データと人体の姿勢データとを学習データとして用いて生成してある学習器に、取得した発話音声データを入力して姿勢データを生成する処理とを実行させる。 The computer program of the present embodiment, a process of acquiring utterance voice data to a computer, a learner generated by using the utterance voice data and the posture data of the human body as learning data, the acquired utterance voice data. A process of inputting and generating attitude data is executed.

本実施の形態の姿勢データ生成方法は、発話音声データを取得し、発話音声データと人体の姿勢データとを学習データとして用いて生成してある学習器に、取得した発話音声データを入力して姿勢データを生成する。 The posture data generation method of the present embodiment acquires utterance voice data, inputs the obtained utterance voice data to a learning device that is generated by using the utterance voice data and the posture data of the human body as learning data. Generate attitude data.

本実施の形態の学習モデルの生成方法は、発話音声データ及び人体の姿勢データを取得し、取得された発話音声データ及び人体の姿勢データを学習データとして用いる。 The learning model generation method of the present embodiment acquires utterance voice data and human body posture data, and uses the acquired utterance voice data and human body posture data as learning data.

学習器（学習モデル）は、発話音声データと人体の姿勢データとを学習データとして用いて生成してある。例えば、プレゼンテーションを行う人の発話音声データと当該人の動きを示す姿勢データとを学習データとして用いて学習器を生成することができる。これにより、学習器は、人の発話と当該発話に伴う体の動きとの関係性を学習することができる。学習器は、時系列データを学習データとするものであればよく、例えば、再帰型ニューラルネットワーク（Recurrent Neural Network）とすることができるが、これに限定されない。 The learning device (learning model) is generated by using the uttered voice data and the posture data of the human body as learning data. For example, it is possible to generate a learning device by using speech data of a person who gives a presentation and posture data indicating the movement of the person as learning data. Accordingly, the learning device can learn the relationship between the utterance of a person and the movement of the body accompanying the utterance. The learning device may be any device that uses time-series data as learning data, and can be, for example, a recurrent neural network (Recurrent Neural Network), but is not limited thereto.

取得部は、発話音声データを取得し、生成部は、取得した発話音声データ及び学習器に基づいて姿勢データを生成する。学習器は、発話音声データと人体の姿勢データとを学習データとして用いて予め生成されているので、取得した発話音声データを学習器に入力すると、学習器は、入力された発話音声データと関連性がある姿勢データを出力する。これにより、人の発話と当該発話に伴うジェスチャー（体の動き）を生成することができ、プレゼンテーションのジェスチャーを自動的に生成することができる。また、ジェスチャー制作のコストを低減することができる。 The acquisition unit acquires the speech voice data, and the generation unit generates the posture data based on the acquired speech voice data and the learned device. Since the learning device is generated in advance using the utterance voice data and the posture data of the human body as learning data, when the acquired utterance voice data is input to the learning device, the learning device associates with the input utterance voice data. Output posture data that has a certain property. Thereby, the utterance of the person and the gesture (movement of the body) accompanying the utterance can be generated, and the gesture of the presentation can be automatically generated. Also, the cost of gesture production can be reduced.

本実施の形態の姿勢データ生成装置において、前記生成部は、人体の複数の関節位置の３次元データの時系列データを生成する。 In the posture data generation device of the present embodiment, the generation unit generates time-series data of three-dimensional data of a plurality of joint positions of a human body.

本実施の形態の学習器は、人体の複数の関節位置の３次元データの時系列データを学習データとして用いて生成してある。 The learning device according to the present embodiment is generated using time-series data of three-dimensional data of a plurality of joint positions of the human body as learning data.

学習器は、人体の複数の関節位置の３次元データの時系列データを学習データとして用いて生成してある。複数の関節位置は、プレゼンテーション時に人の動きが顕著に表れる部分を含めることができればよく、例えば、上半身の複数の関節の位置とすることができる。３次元データは、基準とする座標系でのｘｙｚ座標とすることができる。 The learning device is generated using time-series data of three-dimensional data of a plurality of joint positions of the human body as learning data. It is sufficient that the plurality of joint positions include a portion in which a person's movement is remarkably shown at the time of presentation, and may be, for example, positions of a plurality of joints in the upper body. The three-dimensional data can be xyz coordinates in a reference coordinate system.

生成部は、人体の複数の関節位置の３次元データの時系列データを生成する。これにより、生成部は、時間の経過とともに変化する、上半身の複数の関節位置を示す姿勢データを生成することができ、プレゼンテーションのジェスチャーを自動的に生成することができる。 The generation unit generates time-series data of three-dimensional data of a plurality of joint positions of a human body. Accordingly, the generation unit can generate posture data indicating a plurality of joint positions of the upper body, which change with the passage of time, and can automatically generate a gesture for presentation.

本実施の形態の姿勢データ生成装置において、前記生成部は、発話文章単位で人体の複数の関節位置の３次元データの時系列データを生成する。 In the posture data generation device according to the present embodiment, the generation unit generates time-series data of three-dimensional data of a plurality of joint positions of a human body in units of uttered sentences.

本実施の形態の学習器は、プレゼンテーション動画から発話音声データ及び人体の姿勢データそれぞれを発話文章毎に抽出し、発話文章毎の発話音声データ及び人体の姿勢データを一組の学習データとして用いて生成してある。 The learning device of the present embodiment extracts the utterance voice data and the posture data of the human body from the presentation video for each utterance sentence, and uses the utterance voice data and the posture data of the human body for each utterance sentence as a set of learning data. Has been generated.

学習器は、プレゼンテーション動画から発話音声データ及び人体の姿勢データそれぞれを発話文章毎に抽出し、発話文章毎の発話音声データ及び人体の姿勢データを一組の学習データとして用いて生成してある。発話文章とは、発話の初めと終わりとで、音声と、ジェスチャーが少ない状態（基本の姿勢）となる単位である。発話文章単位で学習し、ジェスチャーを生成することにより、生成後のジェスチャーを接続した場合に、接続箇所の前後のジェスチャーの動きが急に変わることを避けることができる。 The learning device extracts the utterance voice data and the human body posture data for each utterance sentence from the presentation moving image, and generates the utterance voice data and the human body posture data for each utterance sentence as a set of learning data. The utterance sentence is a unit in which a voice and a gesture are few at the beginning and end of the utterance (basic posture). By learning and generating a gesture for each utterance sentence, it is possible to avoid a sudden change in the movement of the gesture before and after the connection point when the generated gesture is connected.

生成部は、発話文章単位で人体の複数の関節位置の３次元データの時系列データを生成する。これにより、体の動きが滑らかな時間内での姿勢データを生成することができ、発話音声に連動した自然なジェスチャーを生成することができる。 The generation unit generates time-series data of three-dimensional data of a plurality of joint positions of the human body in units of uttered sentences. As a result, it is possible to generate posture data within a time period in which the body movement is smooth, and it is possible to generate a natural gesture that is linked to the uttered voice.

本実施の形態の姿勢データ生成装置は、一の発話文章単位内の３次元データであって、前記一の発話文章と繋がる他の発話文章単位内の３次元データと繋がる３次元データを補正する第１補正部を備える。 The posture data generation device according to the present embodiment corrects three-dimensional data in one utterance sentence unit, which is connected to the one utterance sentence unit and three-dimensional data in another utterance sentence unit. A first correction unit is provided.

第１補正部は、一の発話文章単位内の３次元データであって、一の発話文章と繋がる他の発話文章単位内の３次元データと繋がる３次元データを補正する。発話文章単位内では、自然な動き、滑らかな動きを示す姿勢データを生成することができる。しかし、発話文章と次の発話文章との間では、姿勢データの時間的変化を大きくなり、不自然なジェスチャーが生成される可能性がある。そこて、発話文章が繋がる箇所の姿勢データを補正することにより、発話文章間の姿勢データの変化を滑らかにして、自然なジェスチャーを生成することができる。 The first correction unit corrects the three-dimensional data in one utterance sentence unit and the three-dimensional data connected to the other utterance sentence unit in another utterance sentence unit. Within the utterance sentence unit, it is possible to generate posture data indicating natural movement and smooth movement. However, between the utterance sentence and the next utterance sentence, there is a possibility that the temporal change of the posture data becomes large and an unnatural gesture is generated. Then, by correcting the posture data at the place where the utterance sentences are connected, the change in the posture data between the utterance sentences can be smoothed and a natural gesture can be generated.

本実施の形態の姿勢データ生成装置は、複数のキーワードと該複数のキーワードそれぞれの意味を伝達する伝達３次元データの時系列データとを関連付けて記憶する記憶部と、前記生成部で生成した３次元データの時系列データに対応する発話文章内で前記キーワードに対応する３次元データの時系列データを前記キーワードに関連付けられた伝達３次元データの時系列データを用いて補正する第２補正部を備える。 The posture data generation device according to the present embodiment stores a plurality of keywords and a storage unit that associates and stores time series data of transmission three-dimensional data that transmits the meaning of each of the plurality of keywords, and 3 generated by the generation unit. A second correction unit for correcting the time series data of the three-dimensional data corresponding to the keyword in the utterance sentence corresponding to the time series data of the three-dimensional data using the time series data of the transmission three-dimensional data associated with the keyword. Prepare

記憶部は、複数のキーワードと当該複数のキーワードそれぞれの意味を伝達する伝達３次元データの時系列データとを関連付けて記憶する。キーワードは、そのキーワードの意味を表現する動きに特徴がある情報伝達を目的とする動きに関連する単語であり、例えば、「大きい」、「小さい」、「食べる」、「笑う」などの単語を含む。伝達３次元データの時系列データは、例えば、キーワードを表現する体の動きを表す３次元データの時間的変化を示すデータである。 The storage unit stores a plurality of keywords and time-series data of transmission three-dimensional data that transmits the meaning of each of the plurality of keywords in association with each other. A keyword is a word related to a motion for the purpose of information transmission that is characterized by a motion that expresses the meaning of the keyword. For example, words such as “large”, “small”, “eat”, and “laugh” are used. Including. The time-series data of the transmission three-dimensional data is, for example, data indicating a temporal change of the three-dimensional data representing the movement of the body expressing the keyword.

第２補正部は、生成部で生成した３次元データの時系列データに対応する発話文章内でキーワードに対応する３次元データの時系列データを、当該キーワードに関連付けられた伝達３次元データの時系列データを用いて補正する。これにより、発話文章内の単語の意味を表現する動きでジェスチャーを補正することができるので、情報伝達を伴う動きを含む自然なジェスチャーを生成することができる。 The second correction unit sets the time-series data of the three-dimensional data corresponding to the keyword in the utterance sentence corresponding to the time-series data of the three-dimensional data generated by the generation unit to the time of the transmission three-dimensional data associated with the keyword. Correct using the series data. With this, the gesture can be corrected by the movement that expresses the meaning of the word in the spoken sentence, and thus a natural gesture including the movement accompanied by information transmission can be generated.

本実施の形態の学習器は、発話音声データから抽出された発話音声のピッチ及びエネルギーそれぞれの時系列データを学習データとして用いて生成してある。 The learning device according to the present embodiment is generated by using the time-series data of the pitch and energy of the speech voice extracted from the speech data as learning data.

学習器は、発話音声データから抽出された発話音声のピッチ及びエネルギーそれぞれの時系列データを学習データとして用いて生成してある。発話音声のピッチは、音声波形の周波数であり、音声の高低を表すことができる。発話音声のエネルギーは、音声のエネルギーであり、音声の強弱を表すことができる。なお、発話音声のピッチ及びエネルギーの時系列データを纏めて音声韻律時系列データとも称する。 The learning device is generated using the time-series data of the pitch and energy of the utterance voice extracted from the utterance voice data as learning data. The pitch of the uttered voice is the frequency of the voice waveform and can represent the pitch of the voice. The energy of the uttered voice is the energy of the voice and can represent the strength of the voice. The time series data of the pitch and energy of the uttered voice are collectively referred to as voice prosody time series data.

発話の際の話し手の意思や熱意は、音声韻律、すなわち発話音声のピッチ及びエネルギーの変化となって表れる。そこで、音声韻律時系列データを学習データとして用いることにより、学習器は、意思や熱意を表現する姿勢データを出力することができる。 The intention and enthusiasm of the speaker at the time of utterance are expressed as voice prosody, that is, changes in pitch and energy of the uttered voice. Therefore, by using the phonetic prosody time series data as the learning data, the learning device can output the posture data expressing the intention or enthusiasm.

本実施の形態の学習器は、字幕が挿入されたプレゼンテーション動画から字幕テキスト挿入単位で発話音声データ及び人体の姿勢データそれぞれを抽出し、抽出した発話音声データ及び人体の姿勢データを一組の学習データとして用いて生成してある。 The learning device of the present embodiment extracts utterance voice data and human body posture data in a subtitle text insertion unit from a presentation video in which subtitles are inserted, and learns a set of extracted utterance voice data and human body posture data. It is generated by using it as data.

学習器は、字幕が挿入されたプレゼンテーション動画から字幕テキスト挿入単位で発話音声データ及び人体の姿勢データそれぞれを抽出し、抽出した発話音声データ及び人体の姿勢データを一組の学習データとして用いて生成してある。字幕テキスト挿入単位のデータを一組の学習データとすることにより、体の動きが滑らかな時間内の音声発話データと姿勢データとにより学習できるので、学習器は、体の動きが滑らかな時間内での姿勢データを生成することができ、発話音声に連動した自然なジェスチャーを生成することができる。 The learner extracts the utterance voice data and the posture data of the human body for each subtitle text insertion unit from the presentation video in which the subtitles are inserted, and generates them by using the extracted utterance voice data and the posture data of the human body as a set of learning data. I am doing it. By using the data for each subtitle text insertion unit as a set of learning data, it is possible to learn by voice utterance data and posture data within the time when the body movement is smooth. Posture data can be generated, and a natural gesture linked to the uttered voice can be generated.

本実施の形態の学習器は、プレゼンテーション動画の１フレーム毎に人体の複数の関節位置の３次元データを抽出し、抽出した３次元データの複数フレームに亘る時系列データを学習データとして用いて生成してある。 The learning device according to the present embodiment extracts three-dimensional data of a plurality of joint positions of a human body for each frame of a presentation moving image and generates time-series data over a plurality of frames of the extracted three-dimensional data as learning data. I am doing it.

学習器は、プレゼンテーション動画の１フレーム毎に人体の複数の関節位置の３次元データを抽出し、抽出した３次元データの複数フレームに亘る時系列データを学習データとして用いて生成してある。プレゼンテーション動画が、１秒当たり１０フレームの画像で構成されている場合（１０ｆｐｓ）、１０フレームに亘る時系列データを学習データとして用いることにより、１秒間のジェスチャーを生成することができる。これにより、所要の時間のジェスチャーを生成することができる。 The learning device extracts three-dimensional data of a plurality of joint positions of the human body for each frame of the presentation moving image, and uses the extracted time-series data of a plurality of frames of the three-dimensional data as learning data. When the presentation moving image is composed of images of 10 frames per second (10 fps), it is possible to generate a gesture for 1 second by using the time-series data of 10 frames as learning data. Thereby, the gesture of the required time can be generated.

本実施の形態の学習器は、再帰型ニューラルネットワークの出力ノードに与える、人体の複数の関節位置の３次元データのプレゼンテーション動画の複数フレームに亘る時系列データと、前記再帰型ニューラルネットワークの入力ノードに与える、前記プレゼンテーション動画の１フレームの間に所要回数サンプリングされた発話音声データの時系列データの前記複数フレームに亘る時系列データとを学習データとして用いて生成してある。 The learning device of the present embodiment provides time series data for a plurality of frames of a presentation moving image of three-dimensional data of a plurality of joint positions of a human body, which is given to an output node of the recurrent neural network, and an input node of the recurrent neural network. And the time-series data of the time-series data of the uttered voice data sampled a required number of times during one frame of the presentation moving image over the plurality of frames are used as learning data.

本実施の形態の学習データは、プレゼンテーション動画から抽出された発話音声データの時系列データ及び人体の姿勢データの時系列データを有する学習データであって、前記姿勢データは、人体の複数の関節位置の３次元データを有し、前記複数の関節位置の３次元データのプレゼンテーション動画の複数フレームに亘る時系列データを再帰型ニューラルネットワークの出力ノードに与える処理と、前記プレゼンテーション動画の１フレームの間に所要回数サンプリングされた発話音声データの時系列データの前記複数フレームに亘る時系列データを前記再帰型ニューラルネットワークの入力ノードに与える処理と、前記出力ノード及び入力にノードそれぞれ与えられた前記時系列データに基づいて前記再帰型ニューラルネットワークを学習する処理とを実行するのに用いられる。 The learning data of the present embodiment is learning data having time-series data of utterance voice data and time-series data of posture data of a human body extracted from a presentation moving image, and the posture data is a plurality of joint positions of a human body. Between the process of giving the output node of the recursive neural network the time-series data of the three-dimensional data of the plurality of joint positions over a plurality of frames of the presentation moving image, and one frame of the presentation moving image. A process of applying time-series data of the time-series data of the uttered voice data sampled a required number of times over the plurality of frames to an input node of the recurrent neural network, and the time-series data applied to the output node and the input node, respectively. And a process of learning the recurrent neural network based on the above.

学習器は、再帰型ニューラルネットワークの出力ノードに与える、人体の複数の関節位置の３次元データのプレゼンテーション動画の複数フレームに亘る時系列データと、再帰型ニューラルネットワークの入力ノードに与える、当該プレゼンテーション動画の１フレームの間に所要回数サンプリングされた発話音声データの時系列データの当該複数フレームに亘る時系列データとを学習データとして用いて生成してある。 The learner gives time series data over a plurality of frames of a presentation moving image of three-dimensional data of a plurality of joint positions of a human body to an output node of the recursive neural network, and the presentation moving image given to an input node of the recurrent neural network. Is generated by using the time-series data of the utterance voice data sampled the required number of times during one frame of the above as the learning data.

例えば、プレゼンテーション動画のフレーム数を３とし、フレームの時点をｔ、ｔ＋１、ｔ＋２とする。再帰型ニューラルネットワークの出力ノードには、時点ｔ、ｔ＋１、ｔ＋２それぞれの３次元データが与えられる。発話音声データの１フレーム当たりのサンプリング数をｎとすると、再帰型ニューラルネットワークの入力ノードには、時点ｔに対応して、（Ｘ₁、…、Ｘ_n）の発話音声データの時系列データが与えられ、時点ｔ＋１に対応して、（Ｘ_n+1、…、Ｘ_2n）の発話音声データの時系列データが与えられ、時点ｔ＋２に対応して、（Ｘ_2n+1、…、Ｘ_3n）の発話音声データの時系列データが与えられる。 For example, the number of frames of the presentation moving image is 3, and the time points of the frames are t, t+1, and t+2. The output node of the recurrent neural network is given three-dimensional data at time points t, t+1, and t+2. Assuming that the number of samples of the uttered voice data per frame is n, the time-series data of the uttered voice data of (X ₁ ,..., X _n ) is associated with the time t at the input node of the recurrent neural network. Given, time-series data of the utterance voice data of (X _n+1 ,..., X _2n ) is given corresponding to the time point t ₊₁ , and (X _2n+1 ,..., X _3n is given corresponding to the time point t+2. ) The time-series data of the utterance voice data is given.

５０ジェスチャー生成装置
５１制御部
５２取得部
５３記憶部
５４処理部
５５学習モデル
５５１エンコーダ
５５２デコーダ
５６補正部
５７生成部
６０学習モデル生成部
６１プレゼンテーション動画取得部
６２発話音声データ抽出部
６３フレーム画像抽出部
６４ピッチ及びエネルギー抽出部
６５２次元姿勢抽出部
６６３次元姿勢推定部 50 gesture generation device 51 control unit 52 acquisition unit 53 storage unit 54 processing unit 55 learning model 551 encoder 552 decoder 56 correction unit 57 generation unit 60 learning model generation unit 61 presentation video acquisition unit 62 uttered voice data extraction unit 63 frame image extraction unit 64 pitch and energy extraction unit 65 two-dimensional posture extraction unit 66 three-dimensional posture estimation unit

Claims

A learning device generated using utterance voice data and human posture data as learning data,
An acquisition unit for acquiring utterance voice data,
A posture data generation device, comprising: a generation unit that generates posture data based on the utterance voice data acquired by the acquisition unit and the learning device.

The generator is
The posture data generation device according to claim 1, wherein time-series data of three-dimensional data of a plurality of joint positions of a human body is generated.

The generator is
The posture data generation device according to claim 1 or 2, wherein time-series data of three-dimensional data of a plurality of joint positions of a human body is generated for each utterance sentence.

The 3rd data in one utterance sentence unit, Comprising: The 1st correction|amendment part which correct|amends 3 dimensional data connected with the 3 dimensional data in another utterance sentence unit connected with the said 1 utterance sentence is provided. Attitude data generator.

A storage unit that stores a plurality of keywords and time series data of transmission three-dimensional data that transmits the meaning of each of the plurality of keywords in association with each other,
Using the time-series data of the transmission three-dimensional data associated with the keyword, the time-series data of the three-dimensional data corresponding to the keyword in the utterance sentence corresponding to the time-series data of the three-dimensional data generated by the generation unit. The attitude data generation device according to claim 3 or 4, further comprising a second correction unit that corrects.

A learning device generated using utterance voice data and human body posture data as learning data.

7. The learning device according to claim 6, wherein the learning device is generated by using time-series data of each of pitch and energy of the utterance voice extracted from the utterance voice data as learning data.

The learning device according to claim 6 or 7, which is generated by using time-series data of three-dimensional data of a plurality of joint positions of a human body as learning data.

The utterance voice data and the human body posture data are extracted from the presentation video for each utterance sentence, and the utterance voice data and the human body posture data for each utterance sentence are generated as a set of learning data. Item 9. The learning device according to any one of items 8.

Claimed speech data and human body posture data are extracted in units of subtitle text insertion from a presentation video in which subtitles are inserted, and generated by using the extracted speech voice data and human body posture data as a set of learning data. The learning device according to any one of claims 6 to 9.

7. The method according to claim 6, wherein three-dimensional data of a plurality of joint positions of the human body is extracted for each frame of the presentation moving image, and time-series data over a plurality of frames of the extracted three-dimensional data is used as learning data. The learning device according to any one of 10.

Time-series data for a plurality of frames of a presentation moving image of three-dimensional data of a plurality of joint positions of a human body, which is given to an output node of a recurrent neural network
The time-series data of the time-series data of the uttered voice data sampled a required number of times during one frame of the presentation moving image, which is given to the input node of the recurrent neural network, is used as learning data. The learning device according to any one of claims 6 to 11.

On the computer,
A process of acquiring speech data,
A computer program that causes a learner, which is generated by using utterance voice data and human posture data as learning data, to input the obtained utterance voice data and generate posture data.

Learning data having time-series data of speech data extracted from a presentation video and time-series data of posture data of a human body,
The posture data has three-dimensional data of a plurality of joint positions of the human body,
A process of giving time series data over a plurality of frames of a presentation moving image of three-dimensional data of a plurality of joint positions to an output node of a recursive neural network;
A process of giving time series data of the time series data of the utterance voice data sampled a required number of times during one frame of the presentation moving image over the plurality of frames to an input node of the recursive neural network;
Learning data used for performing processing for learning the recursive neural network based on the time series data given to the output node and the input node, respectively.

Acquire speech data,
A posture data generation method for generating posture data by inputting the acquired voice data to a learning device that is generated by using utterance voice data and human posture data as learning data.

Acquire speech data and posture data of the human body,
A method for generating a learning model using the acquired utterance voice data and human posture data as learning data.