JP5402648B2

JP5402648B2 - Speech synthesis apparatus, robot apparatus, speech synthesis method, and program

Info

Publication number: JP5402648B2
Application number: JP2010000307A
Authority: JP
Inventors: 玲史近藤; 正徳加藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-01-05
Filing date: 2010-01-05
Publication date: 2014-01-29
Anticipated expiration: 2030-01-05
Also published as: JP2011141305A

Description

本発明は、文字列を表す音声を合成する音声合成処理を行う音声合成装置に関する。 The present invention relates to a speech synthesizer that performs speech synthesis processing for synthesizing speech representing a character string.

予め定められた動作を行うロボット装置が知られている。この種のロボット装置の一つは、人間とコミュニケーションを行うために音声を出力する。このため、このロボット装置は、文字列に基づいて、当該文字列を表す音声を合成する音声合成処理(Text To Speech(TTS)処理)を行う音声合成装置を備える。 A robot apparatus that performs a predetermined operation is known. One of these types of robotic devices outputs audio for communication with humans. For this reason, this robot apparatus includes a speech synthesizer that performs speech synthesis processing (Text To Speech (TTS) processing) for synthesizing speech representing the character string based on the character string.

ところで、非特許文献1には、音声に基づいて、当該音声と同期させて口の形状を変化させるアニメーション(動画像)を生成する技術(リップシンク方式)が開示されている。これによれば、生成されたアニメーションにおいて、口の形状は、音声に同期して変化する。 By the way, Non-Patent Document 1 discloses a technique (lip sync method) that generates an animation (moving image) that changes the shape of the mouth in synchronization with the sound based on the sound. According to this, in the generated animation, the shape of the mouth changes in synchronization with the sound.

ところで、音声合成装置の一つとして特許文献1に記載の音声合成装置は、ユーザがグラフィカルユーザインタフェースを介して入力した情報に基づいて、合成される音声の一部が継続する時間を設定する。そして、音声合成装置は、当該音声の一部が当該設定された時間に亘って継続するように、音声を合成する。 By the way, the speech synthesizer described in Patent Document 1 as one of speech synthesizers sets a time for which a part of synthesized speech continues based on information input by a user via a graphical user interface. Then, the speech synthesizer synthesizes the speech so that a part of the speech continues for the set time.

特開2003-036100号公報Japanese Patent Laid-Open No. 2003-036100

出渕亮一朗、“リアルタイムコンピューターアニメーションにおけるリップシンク方式の考察”、[online]、株式会社アトム、[平成21年12月21日検索]、インターネット<URL:http://www.atom.co.jp/bot/voice/technical/Lipsync.htm>Ryoichiro Ideki, “Consideration of lip-sync method in real-time computer animation”, [online], Atom Co., Ltd., [December 21, 2009 search], Internet <URL: http://www.atom.co.jp /bot/voice/technical/Lipsync.htm>

ところで、ロボット装置が動作を実行するために要する時間である動作実行時間は、ロボット装置毎に異なる。従って、音声合成装置が適用されたロボット装置に固有の動作実行時間に基づくことなく音声合成装置が音声を合成すると、ロボット装置の動作の実行と、当該動作に対応付けられた音声の出力と、を同時に開始した場合であっても、両者が同時に終了しない虞があった。即ち、上述した音声合成装置においては、ロボット装置の実際の動作と同期した音声を合成することができない場合が生じるという問題があった。 By the way, the operation execution time, which is the time required for the robot apparatus to execute the operation, is different for each robot apparatus. Therefore, when the speech synthesizer synthesizes the speech without being based on the operation execution time unique to the robot device to which the speech synthesizer is applied, the execution of the operation of the robot device, the output of the speech associated with the operation, Even if they are started simultaneously, there is a possibility that both do not end simultaneously. That is, the above-described speech synthesizer has a problem in that it may not be possible to synthesize speech synchronized with the actual operation of the robot apparatus.

このため、本発明の目的は、上述した課題である「ロボット装置の実際の動作と同期した音声を合成することができない場合が生じること」を解決することが可能な音声合成装置を提供することにある。 For this reason, an object of the present invention is to provide a speech synthesizer capable of solving the above-described problem that “the case where speech synchronized with the actual operation of the robot apparatus cannot be synthesized” occurs. It is in.

かかる目的を達成するため本発明の一形態である音声合成装置は、
予め定められた動作を行うロボット装置に適用され、且つ、処理対象となる文字列である処理対象文字列に基づいて当該処理対象文字列を表す音声である処理対象音声を合成する音声合成処理を行うように構成され、
上記処理対象文字列の少なくとも一部である動作付随文字列と対応付けられた動作を上記ロボット装置が実行するために要する時間である動作実行時間を取得する動作実行時間取得手段と、
上記取得された動作実行時間に亘って発せられる音声であって、上記動作付随文字列を表す音声である、動作付随音声を含む上記処理対象音声を合成する上記音声合成処理を行う音声合成処理実行手段と、
を備える。 In order to achieve such an object, a speech synthesizer according to one aspect of the present invention provides:
A speech synthesis process that is applied to a robot apparatus that performs a predetermined operation and that synthesizes a processing target voice that is a voice representing a processing target character string based on a processing target character string that is a processing target character string. Configured to do and
Action execution time acquisition means for acquiring an action execution time that is a time required for the robot apparatus to execute an action associated with an action-associated character string that is at least a part of the processing target character string;
Speech synthesis processing execution for performing the speech synthesis processing for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the motion-accompanying character string Means,
Is provided.

本発明は、以上のように構成されることにより、ロボット装置の実際の動作と同期した音声を合成することができる。 According to the present invention configured as described above, it is possible to synthesize a voice synchronized with the actual operation of the robot apparatus.

本発明の第1実施形態に係るロボット装置の概略構成を表す図である。1 is a diagram illustrating a schematic configuration of a robot apparatus according to a first embodiment of the present invention. 本発明の第1実施形態に係るロボット装置が受け付ける処理基礎情報を示した図である。FIG. 5 is a diagram showing processing basic information received by the robot apparatus according to the first embodiment of the present invention. 本発明の第1実施形態に係るロボット装置が記憶する動作識別情報及び動作内容情報を示したテーブルである。4 is a table showing motion identification information and motion content information stored in the robot apparatus according to the first embodiment of the present invention. 本発明の第1実施形態に係るロボット装置の作動を概念的に示したタイムチャートである。3 is a time chart conceptually showing the operation of the robot apparatus according to the first embodiment of the present invention. 本発明の第2実施形態に係るロボット装置の作動を概念的に示したタイムチャートである。6 is a time chart conceptually showing the operation of the robot apparatus according to the second embodiment of the present invention. 本発明の第3実施形態に係るロボット装置の概略構成を表す図である。FIG. 5 is a diagram illustrating a schematic configuration of a robot apparatus according to a third embodiment of the present invention. 本発明の第3実施形態に係るロボット装置が記憶する動作付随文字列特定情報及び動作識別情報を示したテーブルである。10 is a table showing action-related character string specifying information and action identification information stored in a robot apparatus according to a third embodiment of the present invention. 本発明の第5実施形態に係るロボット装置が受け付ける処理基礎情報を示した図である。FIG. 10 is a diagram showing processing basic information received by a robot apparatus according to a fifth embodiment of the present invention. 本発明の第5実施形態に係るロボット装置が記憶する動作識別情報及び動作内容情報を示したテーブルである。10 is a table showing motion identification information and motion content information stored in a robot apparatus according to a fifth embodiment of the present invention. 本発明の第5実施形態に係るロボット装置が合成する処理対象音声を、第1実施形態と比較して概念的に示した説明図である。FIG. 10 is an explanatory diagram conceptually showing the processing target voice synthesized by the robot apparatus according to the fifth embodiment of the present invention compared to the first embodiment. 本発明の第6実施形態に係る音声合成装置の概略構成を表す図である。FIG. 10 is a diagram illustrating a schematic configuration of a speech synthesizer according to a sixth embodiment of the present invention.

以下、本発明に係る、音声合成装置、ロボット装置、音声合成方法、及び、プログラム、の各実施形態について図1〜図11を参照しながら説明する。 Hereinafter, embodiments of a speech synthesizer, a robot apparatus, a speech synthesizer, and a program according to the present invention will be described with reference to FIGS.

<第1実施形態>
図1に示したように、第1実施形態に係るロボット装置1は、予め定められた動作を行う装置である。ロボット装置1は、本体部と、走行部と、腕部と、頭部と、を備える。 <First embodiment>
As shown in FIG. 1, the robot apparatus 1 according to the first embodiment is an apparatus that performs a predetermined operation. The robot apparatus 1 includes a main body part, a traveling part, an arm part, and a head part.

走行部は、モータと車輪とを含み、モータによって車輪を駆動することにより、ロボット装置1を移動させるように構成されている。なお、走行部は、複数の脚部を含み、脚部により歩行することによりロボット装置1を移動させるように構成されていてもよい。腕部は、モータとアーム部材とを含み、モータによってアーム部材を駆動するとともに、任意の角度にてアーム部材を保持可能に構成されている。頭部は、モータを含み、モータによって駆動されることにより、鉛直方向、又は、水平方向にて回動可能に構成されている。 The traveling unit includes a motor and wheels, and is configured to move the robot apparatus 1 by driving the wheels by the motor. Note that the traveling unit may include a plurality of legs, and may be configured to move the robot apparatus 1 by walking with the legs. The arm portion includes a motor and an arm member, and is configured to drive the arm member by the motor and to hold the arm member at an arbitrary angle. The head includes a motor, and is configured to be rotatable in the vertical direction or the horizontal direction by being driven by the motor.

更に、ロボット装置1は、音声合成装置2と、動作制御装置3と、を備える。
音声合成装置2は、情報受付部(情報受付手段)21と、音声合成処理実行部(音声合成処理実行手段)22と、動作実行時間取得部(動作実行時間取得手段)23と、音声出力部(音声出力手段)24と、を備える。
動作制御装置3は、動作情報記憶部31と、動作実行部32と、動作実行時間検出部33と、動作実行時間記憶部34と、を備える。 Further, the robot apparatus 1 includes a speech synthesizer 2 and an operation control device 3.
The speech synthesizer 2 includes an information reception unit (information reception unit) 21, a speech synthesis process execution unit (speech synthesis process execution unit) 22, an operation execution time acquisition unit (operation execution time acquisition unit) 23, and a voice output unit. (Audio output means) 24.
The operation control device 3 includes an operation information storage unit 31, an operation execution unit 32, an operation execution time detection unit 33, and an operation execution time storage unit 34.

情報受付部21は、処理対象文字列と、動作付随文字列と、動作識別情報と、を含む処理基礎情報を受け付ける。ここで、処理対象文字列は、処理対象となる文字列である。また、動作付随文字列は、処理対象文字列内の文字列であって、動作と対応付けられた文字列である。動作識別情報は、動作を識別するための情報である。 The information receiving unit 21 receives processing basic information including a processing target character string, a motion-accompanying character string, and motion identification information. Here, the processing target character string is a character string to be processed. The action-accompanying character string is a character string in the processing target character string and is a character string associated with the action. The operation identification information is information for identifying an operation.

本例では、情報受付部21が受け付ける処理基礎情報は、ユーザが入力した情報である。なお、情報受付部21が受け付ける処理基礎情報は、ロボット装置1が他の装置から受信した情報であってもよいし、ロボット装置1が生成した情報であってもよい。 In this example, the processing basic information received by the information receiving unit 21 is information input by the user. The processing basic information received by the information receiving unit 21 may be information received by the robot apparatus 1 from another apparatus, or may be information generated by the robot apparatus 1.

本例では、処理基礎情報は、図2に示したように、XML(Extensible Markup Language)形式に従った情報である。本例では、処理基礎情報が表す文字列のうちのタグを除いた部分が処理対象文字列であり、「dousa」タグにより囲まれた部分(要素)が動作付随文字列であり、「dousa」タグの属性「id」の値が動作識別情報である。即ち、処理対象文字列は、「それは悲しいお話だよ」であり、動作付随文字列は、「悲しい」であり、動作付随文字列「悲しい」と対応付けられた動作識別情報は、「1」である。 In this example, the processing basic information is information according to an XML (Extensible Markup Language) format as shown in FIG. In this example, the part of the character string represented by the processing basic information excluding the tag is the processing target character string, the part (element) surrounded by the “dousa” tag is the action-accompanying character string, and “dousa” The value of the tag attribute “id” is the operation identification information. That is, the processing target character string is “That's a sad story”, the motion-accompanying character string is “sad”, and the motion identification information associated with the motion-accompanying character string “sad” is “1”. It is.

以下、説明を簡単にするために、処理対象文字列が一文であり、処理対象文字列内の動作付随文字列が1つである例を用いて説明を続ける。なお、処理対象文字列が複数の文を含んでいてもよく、処理対象文字列内に複数の動作付随文字列が含まれていてもよい。この場合、ロボット装置1は、各動作付随文字列に対して同様の処理を繰り返し実行する。 Hereinafter, in order to simplify the description, the description will be continued by using an example in which the processing target character string is a single sentence and there is one action-accompanying character string in the processing target character string. Note that the processing target character string may include a plurality of sentences, and the processing target character string may include a plurality of action-accompanying character strings. In this case, the robot apparatus 1 repeatedly executes the same process for each action-accompanying character string.

音声合成処理実行部22は、情報受付部21により受け付けられた処理対象文字列に基づいて、当該処理対象文字列を表す音声である処理対象音声を合成する音声合成処理を行う。具体的には、音声合成処理実行部22は、テキスト解析部22aと、リズム生成部22bと、波形生成部22cと、を含む。 Based on the processing target character string received by the information receiving unit 21, the speech synthesis processing execution unit 22 performs a speech synthesis process of synthesizing a processing target speech that is a voice representing the processing target character string. Specifically, the speech synthesis processing execution unit 22 includes a text analysis unit 22a, a rhythm generation unit 22b, and a waveform generation unit 22c.

テキスト解析部22aは、情報受付部21により受け付けられた処理対象文字列に対して言語解析処理を行うことにより、音韻列及びアクセントを生成する。言語解析処理は、単語間の関係(係り受け)及び品詞等を解析する処理と、文字列におけるアクセントの位置を特定する処理と、を含む。なお、言語解析処理の例は、特許第3379643号公報及び特許第3518340号公報等に開示されている。 The text analysis unit 22a performs a linguistic analysis process on the processing target character string received by the information reception unit 21, thereby generating a phoneme string and an accent. The language analysis process includes a process of analyzing a relationship (dependency) between words and a part of speech, and a process of specifying an accent position in a character string. Examples of language analysis processing are disclosed in Japanese Patent No. 3337964 and Japanese Patent No. 3518340.

テキスト解析部22aは、生成した音韻列及びアクセントと、動作付随文字列に対応する音韻列と対応付けられた動作識別情報と、をリズム生成部22bへ出力する。また、テキスト解析部22aは、生成した音韻列と対応付けられた動作識別情報を動作実行時間取得部23へ出力する。 The text analysis unit 22a outputs, to the rhythm generation unit 22b, the generated phoneme string and accent, and the action identification information associated with the phoneme string corresponding to the action-associated character string. Further, the text analysis unit 22a outputs the operation identification information associated with the generated phoneme sequence to the operation execution time acquisition unit 23.

動作実行時間取得部23は、テキスト解析部22aにより出力された動作識別情報と対応付けて動作実行時間記憶部34に記憶されている動作実行時間を取得する。動作実行時間は、動作識別情報により識別される動作(即ち、動作付随文字列と対応付けられた動作)をロボット装置1が実行するために要する時間である。 The operation execution time acquisition unit 23 acquires the operation execution time stored in the operation execution time storage unit 34 in association with the operation identification information output by the text analysis unit 22a. The action execution time is the time required for the robot apparatus 1 to execute the action identified by the action identification information (that is, the action associated with the action associated character string).

本例では、後述するように、動作実行時間記憶部34は、各動作識別情報に対して、動作実行時間検出部33により検出された動作実行時間のうちの最新の値のみを当該動作識別情報と対応付けて記憶する。即ち、動作実行時間取得部23は、ロボット装置1が当該動作を前回実行するために要した時間として検出した値を、動作実行時間として取得している、と言うことができる。 In this example, as will be described later, the operation execution time storage unit 34 obtains only the latest value of the operation execution times detected by the operation execution time detection unit 33 for each operation identification information. Are stored in association with each other. That is, it can be said that the operation execution time acquisition unit 23 acquires the value detected as the time required for the robot apparatus 1 to execute the operation last time as the operation execution time.

なお、動作実行時間記憶部34は、各動作識別情報に対して、動作実行時間検出部33により検出された動作実行時間のすべての値を当該動作識別情報と対応付けて記憶するように構成されていてもよい。この場合、動作実行時間取得部23は、動作識別情報と対応付けて動作実行時間記憶部34に記憶されている動作実行時間のうちの最新の値を取得することが好適である。 The operation execution time storage unit 34 is configured to store all the values of the operation execution time detected by the operation execution time detection unit 33 in association with the operation identification information for each operation identification information. It may be. In this case, it is preferable that the operation execution time acquisition unit 23 acquires the latest value of the operation execution times stored in the operation execution time storage unit 34 in association with the operation identification information.

動作実行時間取得部23は、取得した動作実行時間と動作識別情報とを対応付けてリズム生成部22bへ出力する。 The action execution time acquisition unit 23 associates the acquired action execution time with the action identification information and outputs the associated action execution time to the rhythm generation part 22b.

リズム生成部22bは、テキスト解析部22aにより出力された、音韻列及びアクセント、並びに、動作識別情報と、動作実行時間取得部23により出力された動作実行時間及び動作識別情報と、に基づいて、韻律情報を生成する。 The rhythm generation unit 22b is based on the phoneme string and accent output by the text analysis unit 22a, and the operation identification information, and the operation execution time and operation identification information output by the operation execution time acquisition unit 23. Prosody information is generated.

韻律情報は、音韻列に含まれる各音韻と対応付けられた情報であって、韻律を表す情報である。韻律は、音の高さ(即ち、ピッチパタン(中心ピッチ(平均F0)、F0の傾斜等))及び音の長さ(即ち、音韻が継続する時間、時間長)等を表す。 The prosodic information is information associated with each phoneme included in the phoneme string, and is information representing the prosody. The prosody represents a pitch (that is, a pitch pattern (center pitch (average F0), a slope of F0, etc.)), a length of a sound (that is, a time duration of a phoneme, a length of time), and the like.

具体的には、リズム生成部22bは、動作付随文字列に対応する音韻列が継続する時間を、当該音韻列と対応付けられた動作識別情報に係る動作実行時間と一致させるように、韻律情報を生成する。 Specifically, the rhythm generation unit 22b prosodic information so that the time duration of the phoneme string corresponding to the action-related character string continues with the action execution time related to the action identification information associated with the phoneme string. Is generated.

本例では、リズム生成部22bは、先ず、特許第3240691号公報及び特許第3344487号公報等に開示された方法と同様の方法に従って、動作実行時間に基づくことなく韻律情報を生成する。そして、リズム生成部22bは、特許第2062933号公報に記載された方法と同様の方法に従って、動作付随文字列に対応する音韻列が継続する時間(基本継続時間)を取得する。 In this example, the rhythm generation unit 22b first generates prosodic information without being based on the operation execution time according to a method similar to the method disclosed in Japanese Patent No. 3240691 and Japanese Patent No. 3344487. Then, the rhythm generation unit 22b acquires a time (basic duration) for which the phoneme string corresponding to the action-related character string continues according to a method similar to the method described in Japanese Patent No. 2062933.

次いで、リズム生成部22bは、動作付随文字列に対応する音韻列と対応付けられた動作実行時間を、基本継続時間により除した値を補正比として算出する。そして、リズム生成部22bは、生成された韻律情報に含まれる各音韻が継続する時間(時間長)を、上記算出した補正比を当該時間長に乗じた値に置換することにより、当該時間長を補正する。 Next, the rhythm generation unit 22b calculates, as a correction ratio, a value obtained by dividing the action execution time associated with the phoneme string corresponding to the action-associated character string by the basic duration. Then, the rhythm generation unit 22b replaces the time (time length) that each phoneme included in the generated prosodic information continues with a value obtained by multiplying the calculated correction ratio by the time length. Correct.

このようにして、リズム生成部22bは、動作付随文字列に対応する音韻列が継続する時間が、当該音韻列と対応付けられた動作識別情報に係る動作実行時間と一致する韻律情報を生成することができる。
なお、リズム生成部22bは、他の方法を用いて韻律情報を生成するように構成されていてもよい。 In this way, the rhythm generation unit 22b generates prosodic information in which the time duration of the phoneme string corresponding to the action-accompanying character string is the same as the action execution time associated with the action identification information associated with the phoneme string. be able to.
Note that the rhythm generation unit 22b may be configured to generate prosody information using other methods.

波形生成部22cは、リズム生成部22bにより生成された韻律情報に基づいて音声を合成する。具体的には、波形生成部22cは、生成された韻律情報と最も近い韻律情報を含む音声素片を、複数の音声素片を予め記憶する記憶装置から取得する。 The waveform generation unit 22c synthesizes speech based on the prosodic information generated by the rhythm generation unit 22b. Specifically, the waveform generation unit 22c acquires a speech unit including prosody information closest to the generated prosody information from a storage device that stores a plurality of speech units in advance.

そして、波形生成部22cは、取得された音声素片と、生成された音韻列及び韻律情報と、に基づいて音声素片の韻律を変換する。次いで、波形生成部22cは、変換した音声素片を接続することにより、処理対象文字列を表す処理対象音声(を表す情報)を生成する(即ち、音声合成処理を行う)。 Then, the waveform generation unit 22c converts the prosody of the speech element based on the acquired speech element and the generated phoneme string and prosody information. Next, the waveform generation unit 22c connects the converted speech segments to generate a processing target speech (information indicating the processing target character string) (that is, performs a speech synthesis process).

このようにして、音声合成処理実行部22は、動作実行時間取得部23により取得された動作実行時間に亘って発せられる音声であって、動作付随文字列を表す音声である、動作付随音声を含む処理対象音声を合成する。 In this manner, the speech synthesis processing execution unit 22 generates motion-accompanied speech that is sound that is generated over the motion execution time acquired by the motion execution time acquisition unit 23 and that represents motion-related character strings. The target speech to be included is synthesized.

波形生成部22cは、生成した処理対象音声を音声出力部24へ出力する。更に、波形生成部22cは、動作付随音声の出力を開始する時点にて、動作実行指示を動作実行部32へ出力する。動作実行指示は、動作付随音声と対応付けられた動作識別情報を含み、当該動作識別情報により識別される動作をロボット装置1が実行する旨を指示する情報である。 The waveform generation unit 22c outputs the generated processing target audio to the audio output unit 24. Further, the waveform generation unit 22c outputs an operation execution instruction to the operation execution unit 32 at the time when the output of the operation-accompanying sound is started. The action execution instruction is information that includes action identification information associated with the action-accompanied voice and instructs the robot apparatus 1 to execute the action identified by the action identification information.

音声出力部24は、スピーカを含む。音声出力部24は、波形生成部22cにより出力された処理対象音声を、スピーカを介して出力(放音)する。このようにして、音声出力部24は、ロボット装置1が動作の実行を開始する時点と同じ時点にて、動作付随音声の出力を開始するように、音声合成処理実行部22により合成された処理対象音声を出力する。 The audio output unit 24 includes a speaker. The audio output unit 24 outputs (sounds) the processing target audio output by the waveform generation unit 22c via a speaker. In this way, the voice output unit 24 performs the process synthesized by the voice synthesis process execution unit 22 so as to start outputting the motion-accompanying voice at the same time as when the robot apparatus 1 starts to execute the motion. Output target audio.

動作情報記憶部31は、図3に示したように、動作識別情報と、動作の内容を表す動作内容情報と、を対応付けて記憶する。動作の内容は、例えば、「首を横に振る」、「首を縦に振る」、「片手を上げて下ろす」、又は、「前方に30cmだけ進んで時計回りに一回転する」等である。 As illustrated in FIG. 3, the motion information storage unit 31 stores motion identification information and motion content information representing the content of the motion in association with each other. The contents of the operation are, for example, “shake the head sideways”, “swing the head vertically”, “raise one hand up and down”, or “turn forward by 30 cm and make one clockwise turn”. .

動作実行部32は、波形生成部22cにより出力された動作実行指示に従って、動作を実行する。具体的には、動作実行部32は、波形生成部22cにより動作実行指示が出力されると、当該動作実行指示に含まれる動作識別情報と対応付けて動作情報記憶部31に記憶されている動作内容情報を取得する。そして、動作実行部32は、ロボット装置1の各部を制御することにより、取得した動作内容情報が表す動作を実行する。 The operation execution unit 32 executes the operation according to the operation execution instruction output by the waveform generation unit 22c. Specifically, when the operation execution instruction is output from the waveform generation unit 22c, the operation execution unit 32 stores the operation stored in the operation information storage unit 31 in association with the operation identification information included in the operation execution instruction. Get content information. Then, the operation execution unit 32 controls each unit of the robot apparatus 1 to execute the operation represented by the acquired operation content information.

動作実行時間検出部33は、動作実行部32が動作の実行を開始してから終了するまでの時間を測定することにより、動作実行時間を検出する。動作実行時間検出部33は、検出された動作実行時間を、当該動作を識別するための動作識別情報と対応付けて動作実行時間記憶部34に記憶させる。 The operation execution time detection unit 33 detects the operation execution time by measuring the time from when the operation execution unit 32 starts executing the operation to when it ends. The operation execution time detection unit 33 stores the detected operation execution time in the operation execution time storage unit 34 in association with the operation identification information for identifying the operation.

動作実行時間記憶部34は、動作実行時間検出部33により検出された動作実行時間と、動作識別情報と、を対応付けて記憶する。このとき、動作実行時間記憶部34は、当該動作識別情報と対応付けて、既に動作実行時間を記憶している場合、当該記憶している動作実行時間を、新たに検出された動作実行時間に置換する。 The operation execution time storage unit 34 stores the operation execution time detected by the operation execution time detection unit 33 and the operation identification information in association with each other. At this time, if the operation execution time storage unit 34 has already stored the operation execution time in association with the operation identification information, the operation execution time stored in the operation execution time is newly detected. Replace.

次に、上述したロボット装置1の作動について、図4を参照しながら説明する。図4は、ロボット装置1の作動を概念的に示したタイムチャートである。
本例では、ロボット装置1が、処理基礎情報#1を受け付け、その後、処理基礎情報#2を受け付けた場合を想定する。 Next, the operation of the above-described robot apparatus 1 will be described with reference to FIG. FIG. 4 is a time chart conceptually showing the operation of the robot apparatus 1.
In this example, it is assumed that the robot apparatus 1 receives processing basic information # 1, and then receives processing basic information # 2.

処理基礎情報#1は、処理対象文字列としての「それは悲しいね」と、動作付随文字列としての「悲しい」と、動作識別情報としての「1」と、を含む。処理基礎情報#2は、処理対象文字列としての「駄目止まって」と、動作付随文字列としての「駄目」と、動作識別情報としての「1」と、を含む。なお、図4においては、動作付随文字列は、処理対象文字列のうちの、“”により囲まれた部分である。 The basic processing information # 1 includes “It is sad” as the processing target character string, “Sad” as the motion-accompanying character string, and “1” as the motion identification information. The basic processing information # 2 includes “Don't stop” as a processing target character string, “No” as an operation-accompanying character string, and “1” as operation identification information. In FIG. 4, the action-associated character string is a portion surrounded by “” in the processing target character string.

時点t1にて、情報受付部21が処理基礎情報#1を受け付けると、テキスト解析部22aは、言語解析処理を行うことにより、音韻列及びアクセントを生成する。そして、テキスト解析部22aは、生成した音韻列及びアクセントと、動作付随文字列に対応する音韻列と対応付けられた動作識別情報「1」と、をリズム生成部22bへ出力する。また、テキスト解析部22aは、生成した音韻列と対応付けられた動作識別情報「1」を動作実行時間取得部23へ出力する。 When the information reception unit 21 receives the basic processing information # 1 at time t1, the text analysis unit 22a generates a phoneme string and an accent by performing language analysis processing. Then, the text analysis unit 22a outputs the generated phoneme string and accent, and the action identification information “1” associated with the phoneme string corresponding to the action-associated character string, to the rhythm generation unit 22b. Further, the text analysis unit 22a outputs the operation identification information “1” associated with the generated phoneme sequence to the operation execution time acquisition unit 23.

動作実行時間取得部23は、テキスト解析部22aにより出力された動作識別情報「1」と対応付けて動作実行時間記憶部34に記憶されている動作実行時間#0を取得する。本例では、この時点にて、ロボット装置1が、動作識別情報「1」により識別される動作を未だ実行していない場合を想定する。従って、動作実行時間#0は、予め設定された値(既定値)である。 The operation execution time acquisition unit 23 acquires the operation execution time # 0 stored in the operation execution time storage unit 34 in association with the operation identification information “1” output by the text analysis unit 22a. In this example, it is assumed that the robot apparatus 1 has not yet performed the action identified by the action identification information “1” at this time. Therefore, the operation execution time # 0 is a preset value (default value).

次いで、リズム生成部22bは、テキスト解析部22aにより出力された、音韻列及びアクセント、並びに、動作識別情報「1」と、動作実行時間取得部23により出力された動作実行時間#0及び動作識別情報「1」と、に基づいて、韻律情報を生成する。このとき、リズム生成部22bは、動作付随文字列「悲しい」に対応する音韻列が継続する時間が、当該音韻列と対応付けられた動作識別情報「1」に係る動作実行時間#0と一致する韻律情報を生成する。 Next, the rhythm generation unit 22b, the phoneme string and accent output by the text analysis unit 22a, the operation identification information “1”, the operation execution time # 0 and the operation identification output by the operation execution time acquisition unit 23 Prosodic information is generated based on the information “1”. At this time, the rhythm generation unit 22b matches the duration of the phoneme sequence corresponding to the motion-accompanying character string “sad” with the motion execution time # 0 related to the motion identification information “1” associated with the phoneme sequence. Prosody information is generated.

次いで、波形生成部22cは、リズム生成部22bにより生成された韻律情報に基づいて処理対象音声(「それわかなしいね」を表す音声)を生成する。波形生成部22cは、生成した処理対象音声を音声出力部24へ出力する。これにより、音声出力部24は、処理対象音声(「それわかなしいね」を表す音声)をスピーカから出力する。 Next, the waveform generation unit 22c generates a processing target speech (a speech representing “not good enough”) based on the prosodic information generated by the rhythm generation unit 22b. The waveform generation unit 22c outputs the generated processing target audio to the audio output unit 24. As a result, the audio output unit 24 outputs the processing target audio (audio indicating “not good enough”) from the speaker.

更に、波形生成部22cは、動作付随音声(「かなしい」を表す音声)の出力を開始する時点にて、動作識別情報「1」を含む動作実行指示を動作実行部32へ出力する。これにより、動作実行部32は、動作識別情報「1」と対応付けられた動作内容情報「首を横に振る」に基づく動作を実行する。 Furthermore, the waveform generation unit 22c outputs an operation execution instruction including the operation identification information “1” to the operation execution unit 32 at the time when the output of the operation-accompanying audio (audio representing “Kana”) is started. As a result, the action executing unit 32 executes an action based on the action content information “shake the head sideways” associated with the action identification information “1”.

この時点では、動作実行時間#0と、ロボット装置1が動作識別情報「1」により識別される動作を実際に実行するために要する時間と、が相違するので、ロボット装置1が動作の実行を終了する時点と、ロボット装置1が動作付随音声(「かなしい」を表す音声)の出力を終了する時点と、は一致しない。 At this time, the motion execution time # 0 is different from the time required for the robot device 1 to actually execute the motion identified by the motion identification information “1”. The time point when the operation ends and the time point when the robot apparatus 1 ends the output of the motion-accompanying sound (sound indicating “good”) do not match.

そして、動作実行部32が上記動作の実行を終了すると、動作実行時間検出部33は、検出された動作実行時間#1を、動作識別情報「1」と対応付けて動作実行時間記憶部34に記憶させる。 When the operation execution unit 32 finishes executing the operation, the operation execution time detection unit 33 associates the detected operation execution time # 1 with the operation identification information “1” in the operation execution time storage unit 34. Remember.

また、時点t2にて、情報受付部21が処理基礎情報#2を受け付けると、上述した場合と同様に、ロボット装置1は作動する。このとき、動作実行時間取得部23は、動作実行時間#0に代えて動作実行時間#1を取得する。動作実行時間#1は、ロボット装置1が動作識別情報「1」により識別される動作を実際に実行するために要した時間である。 Further, when the information receiving unit 21 receives the basic processing information # 2 at time t2, the robot apparatus 1 operates as in the case described above. At this time, the operation execution time acquisition unit 23 acquires the operation execution time # 1 instead of the operation execution time # 0. The motion execution time # 1 is the time required for the robot apparatus 1 to actually execute the motion identified by the motion identification information “1”.

従って、処理基礎情報#2に基づくロボット装置1の作動においては、ロボット装置1が動作の実行を終了する時点と、ロボット装置1が動作付随音声(「だめ」を表す音声)の出力を終了する時点と、は一致する。 Therefore, in the operation of the robot apparatus 1 based on the processing basic information # 2, the time when the robot apparatus 1 ends the execution of the operation and the robot apparatus 1 ends the output of the operation-accompanying sound (sound indicating “no use”) It coincides with the time.

なお、以上の説明においては、図4に示したように、先行する処理基礎情報#1に基づいて、動作実行時間記憶部34に記憶されている動作実行時間が更新された後に、後続する処理基礎情報#2に基づく動作実行時間取得部23の作動が開始することを想定している。このような処理は、トークン又はセマフォにより、ロボット装置1の作動の順序を保証することによって、容易に実現される。 In the above description, as shown in FIG. 4, the subsequent process after the operation execution time stored in the operation execution time storage unit 34 is updated based on the preceding process basic information # 1. It is assumed that the operation of the operation execution time acquisition unit 23 based on the basic information # 2 starts. Such processing is easily realized by guaranteeing the sequence of operations of the robot apparatus 1 using tokens or semaphores.

以上、説明したように、本発明による第1実施形態に係るロボット装置1によれば、合成された処理対象音声に含まれる動作付随音声が継続する時間を、ロボット装置1が、当該動作付随音声と対応付けられた動作を実行するために要する時間に一致させることができる。この結果、ロボット装置1の実際の動作と同期した音声を合成することができる。 As described above, according to the robot apparatus 1 according to the first embodiment of the present invention, the robot apparatus 1 determines the time during which the motion-related voice included in the synthesized processing target voice continues. It is possible to match the time required for executing the operation associated with. As a result, it is possible to synthesize a voice synchronized with the actual operation of the robot apparatus 1.

更に、第1実施形態に係るロボット装置1において、動作実行時間取得部23は、ロボット装置1が動作を前回実行するために要した時間として検出した値を、動作実行時間として取得するように構成される。 Furthermore, in the robot apparatus 1 according to the first embodiment, the operation execution time acquisition unit 23 is configured to acquire the value detected as the time required for the robot apparatus 1 to execute the operation last time as the operation execution time. Is done.

ところで、ロボット装置1が動作する環境によって、動作実行時間が変動する場合がある。このような場合、上記構成によれば、取得される動作実行時間を、ロボット装置1が動作を実行するために実際に要する時間に十分に近づけることができる。 Incidentally, the operation execution time may vary depending on the environment in which the robot apparatus 1 operates. In such a case, according to the above configuration, the acquired operation execution time can be made sufficiently close to the time actually required for the robot apparatus 1 to execute the operation.

加えて、第1実施形態に係るロボット装置1は、動作の実行を開始する時点と同じ時点にて、動作付随音声の出力を開始するように、合成された処理対象音声を出力する音声出力部24を備える。 In addition, the robot apparatus 1 according to the first embodiment outputs a voice to be processed that is synthesized so as to start outputting the motion-related voice at the same time as when the execution of the motion is started. With 24.

これによれば、合成された処理対象音声に含まれる動作付随音声の出力が開始する時点と、ロボット装置1が当該動作付随音声に対応付けられた動作の実行を開始する時点と、を一致させることができる。この結果、ロボット装置1の実際の動作と同期した音声を出力することができる。 According to this, the time when the output of the motion-accompanying speech included in the synthesized processing target speech is started coincides with the time when the robot apparatus 1 starts executing the motion associated with the motion-related speech. be able to. As a result, it is possible to output a sound synchronized with the actual operation of the robot apparatus 1.

なお、第1実施形態において、音声合成装置2は、ロボット装置1に適用されていたが、仮想的なロボット装置に適用されてもよい。ここで、仮想的なロボット装置は、ロボット装置の動作を模擬した動画像において実現される。 In the first embodiment, the speech synthesizer 2 is applied to the robot apparatus 1, but may be applied to a virtual robot apparatus. Here, the virtual robot apparatus is realized in a moving image simulating the operation of the robot apparatus.

第1実施形態に係るロボット装置1においては、1つの動作と、互いに異なる複数の動作付随文字列と、が対応付けられていたが、1つの動作と、1つの動作付随文字列と、が対応付けられていてもよい。 In the robot apparatus 1 according to the first embodiment, one motion and a plurality of different motion-accompanying character strings are associated with each other, but one motion and one motion-associated character string correspond to each other. It may be attached.

<第2実施形態>
次に、本発明の第2実施形態に係るロボット装置について説明する。第2実施形態に係るロボット装置は、上記第1実施形態に係るロボット装置に対して、過去の複数(本例では、2つ)の時点にて、ロボット装置が動作を実行するために要した時間として検出した値を平均した値を、動作実行時間として取得する点において相違している。従って、以下、かかる相違点を中心として説明する。 <Second Embodiment>
Next, a robot apparatus according to a second embodiment of the present invention will be described. The robot apparatus according to the second embodiment is required for the robot apparatus to execute an operation at a plurality of points in the past (two in this example) with respect to the robot apparatus according to the first embodiment. The difference is that a value obtained by averaging values detected as time is acquired as operation execution time. Accordingly, the following description will focus on such differences.

第2実施形態に係る動作実行時間記憶部34は、図5の(g)に示したように、各動作識別情報に対して、動作実行時間検出部33により検出された動作実行時間のすべての値を当該動作識別情報と対応付けて記憶する。 The operation execution time storage unit 34 according to the second embodiment, as shown in FIG. 5 (g), for each operation identification information, all of the operation execution time detected by the operation execution time detection unit 33. The value is stored in association with the operation identification information.

また、動作実行時間取得部23は、図5の(h)に示したように、動作識別情報と対応付けて動作実行時間記憶部34に記憶されている動作実行時間のうちの、最新の2つの値を取得する。そして、動作実行時間取得部23は、取得した2つの値を平均した値を、動作実行時間として取得する。 Further, the operation execution time acquisition unit 23, as shown in (h) of FIG. 5, the latest two of the operation execution times stored in the operation execution time storage unit 34 in association with the operation identification information. Get one value. Then, the operation execution time acquisition unit 23 acquires a value obtained by averaging the two acquired values as the operation execution time.

なお、本例では、動作識別情報と対応付けて動作実行時間記憶部34に記憶されている動作実行時間の値が1つしか存在しない場合、動作実行時間取得部23は、当該記憶されている値を動作実行時間として取得する。 In this example, when there is only one value of the operation execution time stored in the operation execution time storage unit 34 in association with the operation identification information, the operation execution time acquisition unit 23 stores the value. Get the value as the operation execution time.

ところで、ロボット装置1が動作を実行する毎に、同一の動作に対する動作実行時間が変動する場合がある。このような場合であっても、第2実施形態に係るロボット装置1によれば、取得される動作実行時間を、ロボット装置1が動作を実行するために実際に要する時間に十分に近づけることができる。この結果、ロボット装置1の実際の動作と同期した音声を確実に出力することができる。 By the way, every time the robot apparatus 1 executes an operation, the operation execution time for the same operation may vary. Even in such a case, according to the robot apparatus 1 according to the second embodiment, the acquired operation execution time can be made sufficiently close to the time actually required for the robot apparatus 1 to execute the operation. it can. As a result, the sound synchronized with the actual operation of the robot apparatus 1 can be reliably output.

なお、動作実行時間取得部23は、3つ以上の値を平均した値を動作実行時間として取得するように構成されていてもよい。また、動作実行時間取得部23は、過去の複数の時点にて、ロボット装置1が動作を実行するために要した時間として検出した値を回帰分析することにより取得される予測値を、動作実行時間として取得するように構成されていてもよい。 The operation execution time acquisition unit 23 may be configured to acquire a value obtained by averaging three or more values as the operation execution time. In addition, the motion execution time acquisition unit 23 executes the predicted value acquired by performing regression analysis on the value detected as the time required for the robot device 1 to execute the motion at a plurality of past times. You may be comprised so that it may acquire as time.

また、動作実行時間記憶部34は、各動作に対して、動作実行時間検出部33により検出された動作実行時間を平均した値を記憶するように構成されていてもよい。 Further, the operation execution time storage unit 34 may be configured to store a value obtained by averaging the operation execution times detected by the operation execution time detection unit 33 for each operation.

<第3実施形態>
次に、本発明の第3実施形態に係るロボット装置について説明する。第3実施形態に係るロボット装置は、上記第1実施形態に係るロボット装置に対して、処理対象文字列内の動作付随文字列を特定するとともに、動作付随文字列と対応付けられた動作を特定する処理を行う点において相違している。従って、以下、かかる相違点を中心として説明する。 <Third embodiment>
Next, a robot apparatus according to a third embodiment of the present invention will be described. The robot apparatus according to the third embodiment specifies an action-related character string in the processing target character string and an action associated with the action-related character string with respect to the robot apparatus according to the first embodiment. This is different in that the processing to be performed is performed. Accordingly, the following description will focus on such differences.

図6に示したように、第3実施形態に係る音声合成装置2は、第1実施形態に係る音声合成装置2が有する構成に加えて、動作生成情報記憶部(情報記憶手段)25と、動作生成部26と、を備える。 As shown in FIG. 6, in addition to the configuration of the speech synthesizer 2 according to the first embodiment, the speech synthesizer 2 according to the third embodiment includes a motion generation information storage unit (information storage unit) 25, An action generation unit 26.

情報受付部21は、動作付随文字列及び動作識別情報を含まず、且つ、処理対象文字列を含む処理基礎情報を受け付ける。 The information receiving unit 21 receives basic processing information that does not include the motion-accompanying character string and the motion identification information and includes the processing target character string.

動作生成情報記憶部25は、図7に示したように、動作付随文字列特定情報と、動作識別情報と、を対応付けて記憶する。動作付随文字列特定情報は、動作付随文字列を特定するための情報である。本例では、動作付随文字列特定情報は、表層情報と、品詞情報と、を含む。表層情報は、文字列を表す情報である。品詞情報は、当該文字列の処理対象文字列内における品詞を表す情報である。 As shown in FIG. 7, the action generation information storage unit 25 stores action-associated character string specifying information and action identification information in association with each other. The action accompanying character string specifying information is information for specifying the action accompanying character string. In this example, the action-associated character string specifying information includes surface layer information and part-of-speech information. The surface layer information is information representing a character string. The part of speech information is information representing the part of speech in the processing target character string of the character string.

テキスト解析部22aは、処理対象文字列内の、各形態素を構成する文字列を表す表層情報、及び、各形態素の品詞を表す品詞情報も出力する。
動作生成部26は、テキスト解析部22aにより出力された表層情報及び品詞情報と、動作生成情報記憶部25に記憶されている表層情報及び品詞情報と、に基づいて、処理文字列内の動作付随文字列を特定する。 The text analysis unit 22a also outputs surface layer information representing character strings constituting each morpheme and part-of-speech information representing part of speech of each morpheme in the processing target character string.
The action generation unit 26 is based on the surface layer information and part-of-speech information output by the text analysis unit 22a, and the surface layer information and part-of-speech information stored in the action generation information storage unit 25. Specify a string.

具体的には、動作生成部26は、テキスト解析部22aにより出力された表層情報及び品詞情報と同一の表層情報及び品詞情報が動作生成情報記憶部25に記憶されている場合、当該表層情報が表す文字列を動作付随文字列として特定する。即ち、動作生成部26は、形態素毎の表層情報及び品詞情報に基づくマッチング処理を行うことにより、動作付随文字列を特定する。 Specifically, the motion generation unit 26 stores the surface layer information and the part of speech information that are the same as the surface layer information and the part of speech information output by the text analysis unit 22a when the surface generation information is stored in the motion generation information storage unit 25. The character string to be represented is specified as the action-accompanying character string. That is, the motion generation unit 26 specifies a motion-accompanying character string by performing a matching process based on surface information and part-of-speech information for each morpheme.

更に、動作生成部26は、特定した動作付随文字列と対応付けられた動作を識別するための動作識別情報として、上記表層情報及び品詞情報と対応付けて動作生成情報記憶部25に記憶されている動作識別情報を取得する(即ち、動作付随文字列と対応付けられた動作を特定する)。 Further, the action generation unit 26 stores the action generation information storage unit 25 in association with the surface layer information and the part of speech information as action identification information for identifying the action associated with the specified action accompanying character string. Is obtained (that is, the action associated with the action-associated character string is specified).

動作生成部26は、特定された動作付随文字列と、取得された動作識別情報と、をテキスト解析部22aへ出力する。 The motion generation unit 26 outputs the identified motion-accompanying character string and the acquired motion identification information to the text analysis unit 22a.

そして、テキスト解析部22aは、第1実施形態に係るテキスト解析部22aと同様に、生成した音韻列及びアクセントと、動作付随文字列に対応する音韻列と対応付けられた動作識別情報と、をリズム生成部22bへ出力する。また、テキスト解析部22aは、生成した音韻列と対応付けられた動作識別情報を動作実行時間取得部23へ出力する。 Then, the text analysis unit 22a, similar to the text analysis unit 22a according to the first embodiment, the generated phoneme sequence and accent, and the action identification information associated with the phoneme sequence corresponding to the action associated character string, Output to the rhythm generator 22b. Further, the text analysis unit 22a outputs the operation identification information associated with the generated phoneme sequence to the operation execution time acquisition unit 23.

このようにして、第3実施形態に係るロボット装置1によれば、動作付随文字列及び動作識別情報を含まない処理基礎情報を受け付けた場合であっても、処理対象文字列に基づいて動作付随文字列及び動作識別情報を適切に生成することができる。即ち、処理対象文字列内の動作付随文字列と、動作と、を対応付けるための情報をユーザが入力する手間を軽減することができる。 In this way, according to the robot apparatus 1 according to the third embodiment, even when the processing basic information not including the motion-related character string and the motion identification information is received, the motion-related A character string and operation | movement identification information can be produced | generated appropriately. That is, it is possible to reduce time and effort for the user to input information for associating the action-accompanying character string in the processing target character string with the action.

なお、第3実施形態に係るロボット装置1は、形態素を単位としたマッチング処理を行うように構成されていたが、呼気段落又は文を単位としたマッチング処理を行うように構成されていてもよく、連続する複数の形態素からなる形態素列に対する規則(マッチングルール)に基づいてマッチング処理を行うように構成されていてもよい。 The robot apparatus 1 according to the third embodiment is configured to perform the matching process in units of morphemes, but may be configured to perform the matching process in units of exhalation paragraphs or sentences. The matching processing may be performed based on a rule (matching rule) for a morpheme sequence including a plurality of continuous morphemes.

第3実施形態の変形例に係るロボット装置1において、動作付随文字列特定情報は、処理対象文字列内の位置に関するプロパティ情報(例えば、当該文字列が文頭、文末、呼気段落の先頭、又は、呼気段落の末尾に位置しているか否かを表す情報)を含んでいてもよい。 In the robot apparatus 1 according to the modification of the third embodiment, the action-associated character string specifying information is property information related to the position in the processing target character string (for example, the character string is the beginning of a sentence, the end of a sentence, the beginning of an expiratory paragraph, or Information indicating whether or not it is located at the end of the exhalation paragraph).

<第4実施形態>
次に、本発明の第4実施形態に係るロボット装置について説明する。第4実施形態に係るロボット装置は、上記第3実施形態に係るロボット装置に対して、処理対象文字列内の1つの動作付随文字列と対応付けられた複数の動作識別情報が記憶されている場合、当該複数の動作識別情報の中から1つの動作識別情報を選択する点において相違している。従って、以下、かかる相違点を中心として説明する。 <Fourth embodiment>
Next, a robot apparatus according to a fourth embodiment of the present invention is described. The robot apparatus according to the fourth embodiment stores a plurality of pieces of motion identification information associated with one motion-accompanying character string in the processing target character string with respect to the robot apparatus according to the third embodiment. The case is different in that one piece of operation identification information is selected from the plurality of pieces of operation identification information. Accordingly, the following description will focus on such differences.

第4実施形態に係る動作生成部26は、テキスト解析部22aにより出力された1組の表層情報及び品詞情報と対応付けて複数の動作識別情報が動作生成情報記憶部25に記憶されている(即ち、1つの動作付随文字列と対応付けられた複数の動作識別情報が記憶されている)場合、特定した動作付随文字列と対応付けられた動作を識別するための動作識別情報として、当該複数の動作識別情報を取得する。そして、動作生成部26は、特定された動作付随文字列と、取得された複数の動作識別情報と、をテキスト解析部22aへ出力する。 The motion generation unit 26 according to the fourth embodiment stores a plurality of motion identification information in the motion generation information storage unit 25 in association with a set of surface layer information and part-of-speech information output by the text analysis unit 22a ( In other words, when a plurality of action identification information associated with one action-accompanying character string is stored), the action identification information for identifying the action associated with the identified action-associated character string is used as the plurality of action identification information. The operation identification information of is acquired. Then, the motion generation unit 26 outputs the identified motion-accompanying character string and the acquired plurality of motion identification information to the text analysis unit 22a.

そして、テキスト解析部22aは、生成した音韻列及びアクセントと、動作付随文字列に対応する音韻列と対応付けられた複数の動作識別情報と、をリズム生成部22bへ出力する。また、テキスト解析部22aは、生成した音韻列と対応付けられた複数の動作識別情報を動作実行時間取得部23へ出力する。 Then, the text analysis unit 22a outputs the generated phoneme string and accent, and a plurality of action identification information associated with the phoneme string corresponding to the action-associated character string to the rhythm generation part 22b. In addition, the text analysis unit 22a outputs a plurality of motion identification information associated with the generated phoneme sequence to the motion execution time acquisition unit 23.

動作実行時間取得部23は、テキスト解析部22aにより出力された複数の動作識別情報のそれぞれに対して、当該動作識別情報と対応付けて動作実行時間記憶部34に記憶されている動作実行時間を取得する。動作実行時間取得部23は、複数の動作識別情報のそれぞれに対して、取得した動作実行時間と動作識別情報とを対応付けてリズム生成部22bへ出力する。 The operation execution time acquisition unit 23 calculates the operation execution time stored in the operation execution time storage unit 34 in association with the operation identification information for each of the plurality of operation identification information output by the text analysis unit 22a. get. The action execution time acquisition unit 23 associates the acquired action execution time and the action identification information with each of the plurality of action identification information, and outputs the associated action execution time to the rhythm generation unit 22b.

リズム生成部(動作選択手段)22bは、テキスト解析部22aにより出力された、音韻列及びアクセント、並びに、動作識別情報と、動作実行時間取得部23により出力された動作実行時間及び動作識別情報と、に基づいて、1つの動作付随文字列と対応付けられた複数の動作識別情報の中から1つの動作識別情報を選択する。 Rhythm generation unit (motion selection means) 22b, the phoneme string and accent output by the text analysis unit 22a, and the action identification information, the action execution time and action identification information output by the action execution time acquisition unit 23 Based on, one action identification information is selected from a plurality of action identification information associated with one action-accompanying character string.

このとき、リズム生成部22bは、有効度が最も高い動作識別情報を選択する。ここで、有効度は、動作実行時間に基づくことなく生成された韻律情報が表す動作付随音声が継続する時間と、動作実行時間と、が近くなるほど高くなる値である。 At this time, the rhythm generation unit 22b selects the action identification information having the highest effectiveness. Here, the effectiveness is a value that increases as the time when the motion-accompanied speech represented by the prosodic information generated without being based on the motion execution time continues and the motion execution time become closer.

次いで、リズム生成部22bは、基本継続時間と最も近い動作実行時間が取得される基となった動作識別情報を選択する。その後、リズム生成部22bは、選択した動作識別情報に基づいて、第1実施形態と同様に、韻律情報を生成する。 Next, the rhythm generation unit 22b selects the operation identification information from which the operation execution time closest to the basic duration is acquired. Thereafter, the rhythm generation unit 22b generates prosody information based on the selected action identification information, as in the first embodiment.

このようにして、第4実施形態に係るロボット装置1によれば、動作付随文字列に適切な動作を対応付けることができるので、合成された処理対象音声を自然な音声(人間が実際に発する音声により近い音声)としてユーザに聴取させることができる。 In this way, according to the robot apparatus 1 according to the fourth embodiment, since it is possible to associate an appropriate action with the action-accompanying character string, the synthesized voice to be processed is a natural voice (a voice actually emitted by a human being). (Sound closer to the user).

なお、リズム生成部22bが、韻律情報を生成するためにコスト関数を用いている場合、リズム生成部22bは、そのコスト関数値が最良になる動作実行時間が取得される基となった動作識別情報を選択するように構成されていてもよい。 When the rhythm generation unit 22b uses a cost function to generate prosody information, the rhythm generation unit 22b performs the operation identification based on which the operation execution time at which the cost function value is the best is acquired. It may be configured to select information.

また、情報受付部21は、処理対象文字列と、動作付随文字列と、1つの動作付随文字列に対応付けられた複数の動作識別情報と、を含む処理基礎情報を受け付けるように構成されていてもよい。この場合も、ロボット装置1は、上述した場合と同様に、1つの動作付随文字列と対応付けられた複数の動作識別情報の中から1つの動作識別情報を選択することが好適である。 In addition, the information receiving unit 21 is configured to receive processing basic information including a processing target character string, an action-related character string, and a plurality of action identification information associated with one action-related character string. May be. Also in this case, it is preferable that the robot apparatus 1 selects one piece of action identification information from a plurality of pieces of action identification information associated with one action associated character string, as in the case described above.

なお、第4実施形態の変形例に係るロボット装置1は、1つの動作付随文字列と対応付けられた複数の動作識別情報の中から、無秩序(ランダム)に1つの動作識別情報を選択するように構成されていてもよい。また、ロボット装置1は、1つの動作付随文字列と対応付けられた複数の動作識別情報の中から、ラウンドロビン方式に従って、1つの動作識別情報を選択するように構成されていてもよい。 Note that the robot apparatus 1 according to the modification of the fourth embodiment selects one piece of action identification information randomly (randomly) from among a plurality of pieces of action identification information associated with one action associated character string. It may be configured. The robot apparatus 1 may be configured to select one piece of action identification information from a plurality of pieces of action identification information associated with one action associated character string according to a round robin method.

これによれば、例えば、処理対象文字列内に同一の動作付随文字列が複数含まれる場合において、各動作付随文字列に、互いに異なる動作を対応付けることができる。この結果、ロボット装置1の動作を自然な動作(人間の動作により近い動作)としてユーザに認識させることができる。 According to this, for example, when a plurality of identical action-accompanying character strings are included in the processing target character string, different actions can be associated with each action-associated character string. As a result, the operation of the robot apparatus 1 can be recognized by the user as a natural operation (an operation closer to a human operation).

<第5実施形態>
次に、本発明の第5実施形態に係るロボット装置について説明する。第5実施形態に係るロボット装置は、上記第1実施形態に係るロボット装置に対して、複数の動作付随文字列が処理対象文字列内に連続して配置されている場合、複数の動作付随文字列に対応付けられた動作に係る動作実行時間の和に亘って、複数の動作付随文字列の全体を表す音声が継続するように韻律情報を生成する点において相違している。従って、以下、かかる相違点を中心として説明する。 <Fifth embodiment>
Next, a robot apparatus according to a fifth embodiment of the present invention is described. The robot device according to the fifth embodiment has a plurality of motion-accompanying characters when a plurality of motion-accompanying character strings are continuously arranged in the processing target character string with respect to the robot device according to the first embodiment. The difference is that the prosodic information is generated so that the voice representing the whole of the plurality of action-accompanying character strings continues over the sum of the action execution times related to the actions associated with the strings. Accordingly, the following description will focus on such differences.

第5実施形態に係るリズム生成部22bは、複数の動作付随文字列が処理対象文字列内に連続して配置されている場合、当該複数の動作付随文字列のそれぞれに対して動作実行時間取得部23により取得された動作実行時間の和を算出する。そして、リズム生成部22bは、上記複数の動作付随文字列の全体に対応する音韻列が継続する時間を、算出された動作実行時間の和と一致させるように、韻律情報を生成する。 The rhythm generation unit 22b according to the fifth embodiment acquires an operation execution time for each of the plurality of action-accompanying character strings when the plurality of action-related character strings are continuously arranged in the processing target character string. The sum of the operation execution times acquired by the unit 23 is calculated. Then, the rhythm generation unit 22b generates prosodic information so that the time duration of the phoneme string corresponding to the whole of the plurality of action-accompanying character strings is matched with the calculated sum of the action execution times.

いま、情報受付部21が図8に示した処理基礎情報を受け付けた場合を想定して説明を続ける。この処理基礎情報は、処理対象文字列としての「僕は楽しいダンスが好きなんだ」と、第1の動作付随文字列としての「楽しい」と、第2の動作付随文字列としての「ダンス」と、第1の動作付随文字列「楽しい」と対応付けられた第1の動作識別情報としての「1」と、第2の動作付随文字列「ダンス」と対応付けられた第2の動作識別情報としての「2」と、を含む。また、第1の動作付随文字列「楽しい」と第2の動作付随文字列「ダンス」とは、処理対象文字列内に連続して配置されている。 Now, the description will be continued assuming that the information receiving unit 21 receives the basic processing information shown in FIG. This processing basic information includes "I like fun dance" as the processing target character string, "Fun" as the first motion-accompanying character string, and "Dance" as the second motion-accompanying character string And "1" as the first action identification information associated with the first action-accompanying character string "fun" and second action identification associated with the second action-associated character string "dance" "2" as information is included. In addition, the first motion-accompanying character string “fun” and the second motion-accompanying character string “dance” are continuously arranged in the processing target character string.

更に、動作情報記憶部31は、図9に示したように、動作識別情報しての「1」と、動作内容情報としての「笑う」と、を対応付けて記憶するとともに、動作識別情報しての「2」と、動作内容情報としての「踊る」と、を対応付けて記憶する。 Further, as shown in FIG. 9, the motion information storage unit 31 stores “1” as the motion identification information and “laugh” as the motion content information in association with each other and stores the motion identification information. All “2” and “dancing” as action content information are stored in association with each other.

ところで、上記処理基礎情報を受け付けた場合、第1実施形態に係るリズム生成部22bは、図10の(a)に示したように、第1の動作付随文字列「楽しい」に対応する音声が継続する時間を、動作識別情報「1」に対して取得された動作実行時間に設定し、且つ、第2の動作付随文字列「ダンス」に対応する音声が継続する時間を、動作識別情報「2」に対して取得された動作実行時間に設定する。 By the way, when the processing basic information is received, the rhythm generation unit 22b according to the first embodiment, as shown in (a) of FIG. 10, the voice corresponding to the first action-related character string “fun” is received. The continuous time is set to the motion execution time acquired for the motion identification information “1”, and the time for which the voice corresponding to the second motion-accompanying character string “dance” continues is the motion identification information “ Set to the operation execution time acquired for “2”.

本例では、後続する動作「踊る」をロボット装置1が実行するために要する時間は、先行する動作「笑う」をロボット装置1が実行するために要する時間よりも相当長い。従って、第1の動作付随文字列「楽しい」は、比較的短い時間内に発せられ、第2の動作付随文字列「ダンス」は、比較的長い時間内に発せられる。即ち、処理対象音声において、比較的短く発せられる動作付随音声と、比較的長く発せられる動作付随音声と、が連続して配置される。ユーザは、このような音声を不自然な音声として聴取しやすい。 In this example, the time required for the robot apparatus 1 to execute the subsequent action “dancing” is considerably longer than the time required for the robot apparatus 1 to execute the preceding action “laughing”. Therefore, the first action-related character string “fun” is issued in a relatively short time, and the second action-related character string “dance” is issued in a relatively long time. That is, in the processing target sound, the motion-accompanying sound that is uttered relatively shortly and the action-accompanied sound that is uttered relatively long are continuously arranged. The user can easily hear such sound as unnatural sound.

一方、第5実施形態に係るリズム生成部22bは、図10の(b)に示したように、第1の動作付随文字列「楽しい」、及び、第2の動作付随文字列「ダンス」の全体「楽しいダンス」に対応する音声が継続する時間を、動作識別情報「1」に対して取得された動作実行時間と、動作識別情報「2」に対して取得された動作実行時間と、和に設定する。 On the other hand, as shown in FIG. 10B, the rhythm generation unit 22b according to the fifth embodiment includes the first action-related character string “fun” and the second action-related character string “dance”. The duration of the sound corresponding to the entire “fun dance” is the sum of the motion execution time acquired for the motion identification information “1” and the motion execution time acquired for the motion identification information “2”. Set to.

これによれば、連続する複数の動作付随音声のそれぞれが継続する時間の差を低減することができる。この結果、合成された処理対象音声を自然な音声(人間が実際に発する音声により近い音声)としてユーザに聴取させることができる。 According to this, it is possible to reduce a time difference in which each of a plurality of continuous motion-related sounds continues. As a result, the synthesized speech to be processed can be heard by the user as natural speech (sound that is closer to speech actually produced by humans).

<第6実施形態>
次に、本発明の第6実施形態に係る音声合成装置について図11を参照しながら説明する。
第6実施形態に係る音声合成装置100は、予め定められた動作を行うロボット装置に適用され、且つ、処理対象となる文字列である処理対象文字列に基づいて当該処理対象文字列を表す音声である処理対象音声を合成する音声合成処理を行うように構成される。 <Sixth embodiment>
Next, a speech synthesizer according to a sixth embodiment of the present invention will be described with reference to FIG.
The speech synthesizer 100 according to the sixth embodiment is applied to a robot apparatus that performs a predetermined action, and is a voice that represents a processing target character string based on a processing target character string that is a processing target character string. Is configured to perform a speech synthesis process for synthesizing the processing target speech.

音声合成装置100は、
上記処理対象文字列の少なくとも一部である動作付随文字列と対応付けられた動作を上記ロボット装置が実行するために要する時間である動作実行時間を取得する動作実行時間取得部(動作実行時間取得手段)101と、
上記取得された動作実行時間に亘って発せられる音声であって、上記動作付随文字列を表す音声である、動作付随音声を含む上記処理対象音声を合成する上記音声合成処理を行う音声合成処理実行部(音声合成処理実行手段)102と、
を備える。 The speech synthesizer 100
An action execution time acquisition unit (an action execution time acquisition) that acquires an action execution time that is a time required for the robotic device to execute an action associated with an action-associated character string that is at least a part of the processing target character string. Means) 101,
Speech synthesis processing execution for performing the speech synthesis processing for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the motion-accompanying character string Unit (speech synthesis processing execution means) 102,
Is provided.

これによれば、合成された処理対象音声に含まれる動作付随音声が継続する時間を、ロボット装置が、当該動作付随音声と対応付けられた動作を実行するために要する時間に一致させることができる。この結果、ロボット装置の実際の動作と同期した音声を合成することができる。 According to this, it is possible to make the time required for the motion-related voice included in the synthesized processing target voice to coincide with the time required for the robot apparatus to execute the motion associated with the motion-related voice. . As a result, it is possible to synthesize a sound synchronized with the actual operation of the robot apparatus.

以上、上記実施形態を参照して本願発明を説明したが、本願発明は、上述した実施形態に限定されるものではない。本願発明の構成及び詳細に、本願発明の範囲内において当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the above embodiment, the present invention is not limited to the above-described embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

なお、上記実施形態においてロボット装置1の各機能は、回路等のハードウェアにより実現されていた。ところで、ロボット装置1は、処理装置と、プログラム(ソフトウェア)を記憶する記憶装置と、を備えるとともに、処理装置がそのプログラムを実行することにより、各機能を実現するように構成されていてもよい。この場合、プログラムは、コンピュータが読み取り可能な記録媒体に記憶されていてもよい。例えば、記録媒体は、フレキシブルディスク、光ディスク、光磁気ディスク、及び、半導体メモリ等の可搬性を有する媒体である。 In the above embodiment, each function of the robot apparatus 1 is realized by hardware such as a circuit. Meanwhile, the robot apparatus 1 may include a processing device and a storage device that stores a program (software), and the processing device may be configured to implement each function by executing the program. . In this case, the program may be stored in a computer-readable recording medium. For example, the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, and a semiconductor memory.

また、上記実施形態の他の変形例として、上述した実施形態及び変形例の任意の組み合わせが採用されてもよい。 In addition, as another modified example of the above-described embodiment, any combination of the above-described embodiments and modified examples may be employed.

本発明は、音声を出力するとともに予め設定された動作を行うロボット装置等に適用可能である。 The present invention is applicable to a robot apparatus or the like that outputs a voice and performs a preset operation.

<付記>
(付記1)
予め定められた動作を行うロボット装置に適用され、且つ、処理対象となる文字列である処理対象文字列に基づいて当該処理対象文字列を表す音声である処理対象音声を合成する音声合成処理を行うように構成され、
前記処理対象文字列の少なくとも一部である動作付随文字列と対応付けられた動作を前記ロボット装置が実行するために要する時間である動作実行時間を取得する動作実行時間取得手段と、
前記取得された動作実行時間に亘って発せられる音声であって、前記動作付随文字列を表す音声である、動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行う音声合成処理実行手段と、
を備える音声合成装置。 <Appendix>
(Appendix 1)
A speech synthesis process that is applied to a robot apparatus that performs a predetermined operation and that synthesizes a processing target voice that is a voice representing a processing target character string based on a processing target character string that is a processing target character string. Configured to do and
Action execution time acquisition means for acquiring an action execution time that is a time required for the robotic apparatus to execute an action associated with an action-associated character string that is at least a part of the processing target character string;
Speech synthesis processing execution for performing the speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the motion-associated character string Means,
A speech synthesizer comprising:

(付記2)
付記1に記載の音声合成装置であって、
前記動作付随文字列と、前記動作を識別するための動作識別情報と、を受け付ける情報受付手段を備え、
前記動作実行時間取得手段は、前記受け付けられた動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記取得された動作実行時間に亘って発せられる音声であって、前記受け付けられた動作付随文字列を表す音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成された音声合成装置。 (Appendix 2)
The speech synthesizer according to appendix 1,
Comprising information accepting means for accepting the action-accompanying character string and action identification information for identifying the action;
The operation execution time acquisition means is configured to acquire the operation execution time, which is a time required for the robot apparatus to execute an operation identified by the received operation identification information,
The speech synthesis processing execution means is a voice that is uttered over the acquired motion execution time, and is the voice that represents the accepted motion-accompanying character string, and includes the processing-target speech including the motion-accompanied speech. A speech synthesizer configured to perform the speech synthesis processing to synthesize.

(付記3)
付記1に記載の音声合成装置であって、
前記動作付随文字列を特定するための動作付随文字列特定情報と、動作を識別するための動作識別情報と、を対応付けて記憶する情報記憶手段を備え、
前記動作実行時間取得手段は、前記記憶されている動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記処理対象文字列内の、前記記憶されている動作付随文字列特定情報により特定される動作付随文字列を表す音声であって、当該動作付随文字列特定情報と対応付けて記憶されている動作識別情報に基づいて前記取得された動作実行時間に亘って発せられる音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成された音声合成装置。 (Appendix 3)
The speech synthesizer according to appendix 1,
Comprising information storage means for storing the action accompanying character string specifying information for specifying the action accompanying character string and the action identification information for identifying the action in association with each other;
The operation execution time acquisition means is configured to acquire the operation execution time, which is a time required for the robot apparatus to execute an operation identified by the stored operation identification information.
The speech synthesis process execution means is a voice representing an action-related character string specified by the stored action-related character string specifying information in the processing target character string, and the action-related character string specifying information and The speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time based on the motion identification information stored in association with each other. Constructed speech synthesizer.

これによれば、処理対象文字列内の動作付随文字列と、動作と、を対応付けるための情報をユーザが入力する手間を軽減することができる。 According to this, it is possible to reduce time and effort for the user to input information for associating the action-accompanying character string in the processing target character string with the action.

(付記4)
付記3に記載の音声合成装置であって、
前記処理対象文字列内の1つの動作付随文字列と対応付けられた複数の動作識別情報が記憶されている場合、当該複数の動作識別情報の中から1つの動作識別情報を選択する動作選択手段を備え、
前記動作実行時間取得手段は、前記選択された動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記処理対象文字列内の前記動作付随文字列を表す音声であって、前記選択された動作識別情報に基づいて前記取得された動作実行時間に亘って発せられる音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成された音声合成装置。 (Appendix 4)
The speech synthesizer according to appendix 3,
When a plurality of motion identification information associated with one motion-accompanying character string in the processing target character string is stored, a motion selection means for selecting one motion identification information from the plurality of motion identification information With
The operation execution time acquisition means is configured to acquire the operation execution time which is a time required for the robot apparatus to execute an operation identified by the selected operation identification information,
The voice synthesis processing execution means is a voice representing the action-accompanying character string in the processing target character string, and is generated over the acquired action execution time based on the selected action identification information A speech synthesizer configured to perform the speech synthesis processing for synthesizing the processing target speech including the motion-accompanying speech.

これによれば、動作付随文字列に適切な動作を対応付けることができるので、合成された処理対象音声を自然な音声(人間が実際に発する音声により近い音声)としてユーザに聴取させることができる。また、例えば、処理対象文字列内に同一の動作付随文字列が複数含まれる場合において、各動作付随文字列に、互いに異なる動作を対応付けることができる。 According to this, since an appropriate action can be associated with the action-accompanying character string, the synthesized voice to be processed can be heard by the user as a natural voice (a voice closer to a voice actually produced by a human). Further, for example, when a plurality of identical action-accompanying character strings are included in the processing target character string, different actions can be associated with each action-associated character string.

(付記5)
付記1乃至付記4のいずれか一項に記載の音声合成装置であって、
前記動作実行時間取得手段は、複数の動作付随文字列が前記処理対象文字列内に連続して配置されている場合、当該複数の動作付随文字列のそれぞれに対して前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記取得された動作実行時間の和に亘って発せられる音声であって、前記複数の動作付随文字列の全体を表す音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成された音声合成装置。 (Appendix 5)
The speech synthesizer according to any one of appendix 1 to appendix 4,
The action execution time acquisition unit acquires the action execution time for each of the plurality of action accompanying character strings when a plurality of action accompanying character strings are continuously arranged in the processing target character string. Configured as
The speech synthesis processing execution means includes the motion-accompanied speech, which is a speech uttered over a sum of the acquired motion execution times, and is a speech representing the whole of the plurality of motion-accompanying character strings. A speech synthesizer configured to perform the speech synthesis process for synthesizing a target speech.

ところで、動作実行時間は、動作毎に比較的大きく異なる。このため、動作実行時間が比較的長い動作が対応付けられた動作付随文字列と、動作実行時間が比較的短い動作が対応付けられた動作付随文字列と、が連続して処理対象文字列内に配置されている場合、処理対象音声において、比較的長く発せられる動作付随音声と、比較的短く発せられる動作付随音声と、が連続して配置される。ユーザは、このような音声を不自然な音声として聴取しやすい。 By the way, the operation execution time is relatively different for each operation. For this reason, an action-associated character string associated with an action having a relatively long action execution time and an action-associated character string associated with an action having a relatively short action execution time are continuously included in the processing target character string. In the processing target voice, the operation-related sound that is emitted for a relatively long time and the action-related sound that is emitted for a relatively short time are continuously arranged. The user can easily hear such sound as unnatural sound.

そこで、上記のように音声合成装置を構成することにより、連続する複数の動作付随音声のそれぞれが継続する時間の差を低減することができる。この結果、合成された処理対象音声を自然な音声(人間が実際に発する音声により近い音声)としてユーザに聴取させることができる。 Therefore, by configuring the speech synthesizer as described above, it is possible to reduce the difference in time that each of a plurality of continuous motion-related sounds continues. As a result, the synthesized speech to be processed can be heard by the user as natural speech (sound that is closer to speech actually produced by humans).

(付記6)
付記1乃至付記5のいずれか一項に記載の音声合成装置であって、
前記動作実行時間取得手段は、前記ロボット装置が前記動作を実行するために要した時間として検出した値を前記動作実行時間として取得するように構成された音声合成装置。 (Appendix 6)
The speech synthesizer according to any one of appendix 1 to appendix 5,
The motion synthesizer configured to acquire, as the motion execution time, a value detected as the time required for the robot device to execute the motion.

(付記7)
付記6に記載の音声合成装置であって、
前記動作実行時間取得手段は、前記ロボット装置が前記動作を前回実行するために要した時間として検出した値を、前記動作実行時間として取得するように構成された音声合成装置。 (Appendix 7)
The speech synthesizer according to appendix 6,
The speech synthesizer configured to acquire, as the motion execution time, a value detected as the time required for the robot device to execute the motion last time.

ところで、ロボット装置が動作する環境によって、動作実行時間が変動する場合がある。このような場合、上記構成によれば、取得される動作実行時間を、ロボット装置が動作を実行するために実際に要する時間に十分に近づけることができる。 By the way, the operation execution time may vary depending on the environment in which the robot apparatus operates. In such a case, according to the above configuration, the acquired operation execution time can be made sufficiently close to the time actually required for the robot apparatus to execute the operation.

(付記8)
付記6に記載の音声合成装置であって、
前記動作実行時間取得手段は、過去の複数の時点にて、前記ロボット装置が前記動作を実行するために要した時間として検出した値を平均した値を、前記動作実行時間として取得するように構成された音声合成装置。 (Appendix 8)
The speech synthesizer according to appendix 6,
The operation execution time acquisition means is configured to acquire, as the operation execution time, a value obtained by averaging values detected as times required for the robot apparatus to execute the operation at a plurality of past times. Voice synthesizer.

ところで、ロボット装置が動作を実行する毎に、同一の動作に対する動作実行時間が変動する場合がある。このような場合、上記構成によれば、取得される動作実行時間を、ロボット装置が動作を実行するために実際に要する時間に十分に近づけることができる。 By the way, every time the robot apparatus executes an operation, the operation execution time for the same operation may vary. In such a case, according to the above configuration, the acquired operation execution time can be made sufficiently close to the time actually required for the robot apparatus to execute the operation.

(付記9)
付記6に記載の音声合成装置であって、
前記動作実行時間取得手段は、過去の複数の時点にて、前記ロボット装置が前記動作を実行するために要した時間として検出した値を回帰分析することにより取得される予測値を、前記動作実行時間として取得するように構成された音声合成装置。 (Appendix 9)
The speech synthesizer according to appendix 6,
The operation execution time acquisition unit is configured to execute the operation by using a predicted value acquired by performing regression analysis on a value detected as a time required for the robot apparatus to execute the operation at a plurality of past times. A speech synthesizer configured to obtain as time.

(付記10)
付記1乃至付記9のいずれか一項に記載の音声合成装置であって、
前記ロボット装置が前記動作の実行を開始する時点と同じ時点にて、前記動作付随音声の出力を開始するように、前記合成された処理対象音声を出力する音声出力手段を備える音声合成装置。 (Appendix 10)
The speech synthesizer according to any one of appendices 1 to 9,
A speech synthesizer comprising speech output means for outputting the synthesized processing target speech so as to start outputting the motion-accompanying speech at the same time when the robot device starts executing the motion.

これによれば、合成された処理対象音声に含まれる動作付随音声の出力が開始する時点と、ロボット装置が当該動作付随音声に対応付けられた動作の実行を開始する時点と、を一致させることができる。この結果、ロボット装置の実際の動作と同期した音声を出力することができる。 According to this, the time when the output of the motion-accompanying speech included in the synthesized processing target speech starts matches the time when the robot device starts executing the motion associated with the motion-related speech. Can do. As a result, it is possible to output a sound synchronized with the actual operation of the robot apparatus.

(付記11)
予め定められた動作を行うロボット装置であって、
処理対象となる文字列である処理対象文字列に基づいて当該処理対象文字列を表す音声である処理対象音声を合成する音声合成処理を行う音声合成処理実行手段と、
前記処理対象文字列の少なくとも一部である動作付随文字列と対応付けられた動作を当該ロボット装置が実行するために要する時間である動作実行時間を取得する動作実行時間取得手段と、
を備え、
前記音声合成処理実行手段は、前記取得された動作実行時間に亘って発せられる音声であって、前記動作付随文字列を表す音声である、動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたロボット装置。 (Appendix 11)
A robot apparatus that performs a predetermined operation,
Speech synthesis processing execution means for performing speech synthesis processing for synthesizing a processing target speech that is a speech representing the processing target character string based on a processing target character string that is a processing target character string;
An action execution time acquisition means for acquiring an action execution time which is a time required for the robot apparatus to execute an action associated with an action associated character string that is at least a part of the processing target character string;
With
The voice synthesis processing execution unit synthesizes the processing target voice including the action-related voice, which is a voice generated over the acquired action execution time and is a voice representing the action-related character string. A robotic device configured to perform composition processing.

(付記12)
付記11に記載のロボット装置であって、
前記動作付随文字列と、前記動作を識別するための動作識別情報と、を受け付ける情報受付手段を備え、
前記動作実行時間取得手段は、前記受け付けられた動作識別情報により識別される動作を当該ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記取得された動作実行時間に亘って発せられる音声であって、前記受け付けられた動作付随文字列を表す音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたロボット装置。 (Appendix 12)
The robot apparatus according to attachment 11, wherein
Comprising information accepting means for accepting the action-accompanying character string and action identification information for identifying the action;
The operation execution time acquisition means is configured to acquire the operation execution time which is a time required for the robot apparatus to execute an operation identified by the received operation identification information.
The speech synthesis processing execution means is a voice that is uttered over the acquired motion execution time, and is the voice that represents the accepted motion-accompanying character string, and includes the processing-target speech including the motion-accompanied speech. A robot apparatus configured to perform the speech synthesis processing to synthesize.

(付記13)
付記11に記載のロボット装置であって、
前記動作付随文字列を特定するための動作付随文字列特定情報と、動作を識別するための動作識別情報と、を対応付けて記憶する情報記憶手段を備え、
前記動作実行時間取得手段は、前記記憶されている動作識別情報により識別される動作を当該ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記処理対象文字列内の、前記記憶されている動作付随文字列特定情報により特定される動作付随文字列を表す音声であって、当該動作付随文字列特定情報と対応付けて記憶されている動作識別情報に基づいて前記取得された動作実行時間に亘って発せられる音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたロボット装置。 (Appendix 13)
The robot apparatus according to attachment 11, wherein
Comprising information storage means for storing the action accompanying character string specifying information for specifying the action accompanying character string and the action identification information for identifying the action in association with each other;
The operation execution time acquisition means is configured to acquire the operation execution time which is a time required for the robot apparatus to execute an operation identified by the stored operation identification information.
The speech synthesis process execution means is a voice representing an action-related character string specified by the stored action-related character string specifying information in the processing target character string, and the action-related character string specifying information and The speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time based on the motion identification information stored in association with each other. Configured robotic device.

(付記14)
付記13に記載のロボット装置であって、
前記処理対象文字列内の1つの動作付随文字列と対応付けられた複数の動作識別情報が記憶されている場合、当該複数の動作識別情報の中から1つの動作識別情報を選択する動作選択手段を備え、
前記動作実行時間取得手段は、前記選択された動作識別情報により識別される動作を当該ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記処理対象文字列内の前記動作付随文字列を表す音声であって、前記選択された動作識別情報に基づいて前記取得された動作実行時間に亘って発せられる音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたロボット装置。 (Appendix 14)
The robot apparatus according to attachment 13, wherein
When a plurality of motion identification information associated with one motion-accompanying character string in the processing target character string is stored, a motion selection means for selecting one motion identification information from the plurality of motion identification information With
The operation execution time acquisition means is configured to acquire the operation execution time which is a time required for the robot apparatus to execute an operation identified by the selected operation identification information,
The voice synthesis processing execution means is a voice representing the action-accompanying character string in the processing target character string, and is generated over the acquired action execution time based on the selected action identification information A robot apparatus configured to perform the speech synthesis process of synthesizing the processing target speech including the motion-accompanying speech.

(付記15)
付記11乃至付記14のいずれか一項に記載のロボット装置であって、
前記動作実行時間取得手段は、複数の動作付随文字列が前記処理対象文字列内に連続して配置されている場合、当該複数の動作付随文字列のそれぞれに対して前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記取得された動作実行時間の和に亘って発せられる音声であって、前記複数の動作付随文字列の全体を表す音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたロボット装置。 (Appendix 15)
The robot device according to any one of appendix 11 to appendix 14,
The action execution time acquisition unit acquires the action execution time for each of the plurality of action accompanying character strings when a plurality of action accompanying character strings are continuously arranged in the processing target character string. Configured as
The speech synthesis processing execution means includes the motion-accompanied speech, which is a speech uttered over a sum of the acquired motion execution times, and is a speech representing the whole of the plurality of motion-accompanying character strings. A robot apparatus configured to perform the voice synthesis process of synthesizing a target voice.

(付記16)
付記11乃至付記15のいずれか一項に記載のロボット装置であって、
前記動作実行時間取得手段は、前記動作を実行するために要した時間として検出した値を前記動作実行時間として取得するように構成されたロボット装置。 (Appendix 16)
The robot device according to any one of appendix 11 to appendix 15,
The robot apparatus configured to acquire the value detected as the time required for executing the motion as the motion execution time.

(付記17)
予め定められた動作を行うロボット装置に適用され、且つ、処理対象となる文字列である処理対象文字列に基づいて当該処理対象文字列を表す音声である処理対象音声を合成する音声合成処理を行うように構成され、
前記処理対象文字列の少なくとも一部である動作付随文字列と対応付けられた動作を前記ロボット装置が実行するために要する時間である動作実行時間を取得し、
前記取得された動作実行時間に亘って発せられる音声であって、前記動作付随文字列を表す音声である、動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行う、音声合成方法。 (Appendix 17)
A speech synthesis process that is applied to a robot apparatus that performs a predetermined operation and that synthesizes a processing target voice that is a voice representing a processing target character string based on a processing target character string that is a processing target character string. Configured to do and
Obtaining an action execution time that is a time required for the robotic device to execute an action associated with an action-associated character string that is at least a part of the processing target character string;
A speech synthesis method for performing the speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the motion-accompanying character string. .

(付記18)
付記17に記載の音声合成方法であって、
前記動作付随文字列と、前記動作を識別するための動作識別情報と、を受け付け、
前記受け付けられた動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得し、
前記取得された動作実行時間に亘って発せられる音声であって、前記受け付けられた動作付随文字列を表す音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成された音声合成方法。 (Appendix 18)
The speech synthesis method according to appendix 17,
Receiving the action-accompanying character string and action identification information for identifying the action;
Obtaining the motion execution time, which is the time required for the robotic device to execute the motion identified by the accepted motion identification information;
The speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the accepted motion-accompanying character string, is performed. A speech synthesis method configured as described above.

(付記19)
付記17に記載の音声合成方法であって、
前記動作付随文字列を特定するための動作付随文字列特定情報と、動作を識別するための動作識別情報と、を対応付けて記憶する記憶装置に記憶されている動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得し、
前記処理対象文字列内の、前記記憶装置に記憶されている動作付随文字列特定情報により特定される動作付随文字列を表す音声であって、当該動作付随文字列特定情報と対応付けて記憶されている動作識別情報に基づいて前記取得された動作実行時間に亘って発せられる音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成された音声合成方法。 (Appendix 19)
The speech synthesis method according to appendix 17,
The action identified by the action identification information stored in the storage device that stores the action associated character string specifying information for specifying the action accompanying character string and the action identification information for identifying the action in association with each other. Obtaining the operation execution time which is the time required for the robot apparatus to execute
A voice representing an action-related character string specified by action-related character string specifying information stored in the storage device in the processing target character string, and stored in association with the action-related character string specifying information. Speech synthesis configured to perform the speech synthesis processing for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time based on the motion identification information being Method.

(付記20)
付記19に記載の音声合成方法であって、
前記処理対象文字列内の1つの動作付随文字列と対応付けられた複数の動作識別情報が前記記憶装置に記憶されている場合、当該複数の動作識別情報の中から1つの動作識別情報を選択し、
前記選択された動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得し、
前記処理対象文字列内の前記動作付随文字列を表す音声であって、前記選択された動作識別情報に基づいて前記取得された動作実行時間に亘って発せられる音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成された音声合成方法。 (Appendix 20)
The speech synthesis method according to attachment 19, wherein
When a plurality of action identification information associated with one action-accompanying character string in the processing target character string is stored in the storage device, one action identification information is selected from the plurality of action identification information And
Obtaining the operation execution time which is a time required for the robot apparatus to execute the operation identified by the selected operation identification information;
A voice representing the motion-accompanying character string in the processing target character string, and the motion-accompanied speech being a voice generated over the acquired motion execution time based on the selected motion identification information A speech synthesis method configured to perform the speech synthesis processing for synthesizing the processing target speech.

(付記21)
付記17乃至付記20のいずれか一項に記載の音声合成方法であって、
複数の動作付随文字列が前記処理対象文字列内に連続して配置されている場合、当該複数の動作付随文字列のそれぞれに対して前記動作実行時間を取得し、
前記取得された動作実行時間の和に亘って発せられる音声であって、前記複数の動作付随文字列の全体を表す音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成された音声合成方法。 (Appendix 21)
The speech synthesis method according to any one of appendix 17 to appendix 20,
When a plurality of action-accompanying character strings are continuously arranged in the processing target character string, the action execution time is acquired for each of the plurality of action-associated character strings,
The speech synthesis for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the sum of the acquired motion execution times and representing the whole of the plurality of motion-accompanying character strings. A speech synthesis method configured to perform processing.

(付記22)
予め定められた動作を行うロボット装置に適用され、且つ、処理対象となる文字列である処理対象文字列に基づいて当該処理対象文字列を表す音声である処理対象音声を合成する音声合成処理を行うように構成された音声合成装置に、
前記処理対象文字列の少なくとも一部である動作付随文字列と対応付けられた動作を前記ロボット装置が実行するために要する時間である動作実行時間を取得する動作実行時間取得手段と、
前記取得された動作実行時間に亘って発せられる音声であって、前記動作付随文字列を表す音声である、動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行う音声合成処理実行手段と、
を実現させるためのプログラム。 (Appendix 22)
A speech synthesis process that is applied to a robot apparatus that performs a predetermined operation and that synthesizes a processing target voice that is a voice representing a processing target character string based on a processing target character string that is a processing target character string. To a speech synthesizer configured to perform
Action execution time acquisition means for acquiring an action execution time that is a time required for the robotic apparatus to execute an action associated with an action-associated character string that is at least a part of the processing target character string;
Speech synthesis processing execution for performing the speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the motion-associated character string Means,
A program to realize

(付記23)
付記22に記載のプログラムであって、
前記音声合成装置に、更に、
前記動作付随文字列と、前記動作を識別するための動作識別情報と、を受け付ける情報受付手段を実現させるとともに、
前記動作実行時間取得手段は、前記受け付けられた動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記取得された動作実行時間に亘って発せられる音声であって、前記受け付けられた動作付随文字列を表す音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたプログラム。 (Appendix 23)
The program according to appendix 22,
In addition to the speech synthesizer,
Realizing an information receiving means for receiving the action-accompanying character string and action identification information for identifying the action,
The operation execution time acquisition means is configured to acquire the operation execution time, which is a time required for the robot apparatus to execute an operation identified by the received operation identification information,
The speech synthesis processing execution means is a voice that is uttered over the acquired motion execution time, and is the voice that represents the accepted motion-accompanying character string, and includes the processing-target speech including the motion-accompanied speech. A program configured to perform the speech synthesis processing to synthesize.

(付記24)
付記22に記載のプログラムであって、
前記動作実行時間取得手段は、前記動作付随文字列を特定するための動作付随文字列特定情報と、動作を識別するための動作識別情報と、を対応付けて記憶する記憶装置に記憶されている動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記処理対象文字列内の、前記記憶装置に記憶されている動作付随文字列特定情報により特定される動作付随文字列を表す音声であって、当該動作付随文字列特定情報と対応付けて記憶されている動作識別情報に基づいて前記取得された動作実行時間に亘って発せられる音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたプログラム。 (Appendix 24)
The program according to appendix 22,
The action execution time acquisition means is stored in a storage device that stores action-associated character string specifying information for specifying the action-related character string and action identification information for identifying an action in association with each other. Configured to obtain the operation execution time, which is a time required for the robot apparatus to execute the operation identified by the operation identification information;
The voice synthesis processing execution means is a voice representing an action-related character string specified by action-related character string specifying information stored in the storage device in the processing target character string, and the action-related character string The speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time based on motion identification information stored in association with specific information. A program configured to do.

(付記25)
付記24に記載のプログラムであって、
前記音声合成装置に、更に、
前記処理対象文字列内の1つの動作付随文字列と対応付けられた複数の動作識別情報が前記記憶装置に記憶されている場合、当該複数の動作識別情報の中から1つの動作識別情報を選択する動作選択手段を実現させるとともに、
前記動作実行時間取得手段は、前記選択された動作識別情報により識別される動作を前記ロボット装置が実行するために要する時間である前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記処理対象文字列内の前記動作付随文字列を表す音声であって、前記選択された動作識別情報に基づいて前記取得された動作実行時間に亘って発せられる音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたプログラム。 (Appendix 25)
The program according to attachment 24, wherein
In addition to the speech synthesizer,
When a plurality of action identification information associated with one action-accompanying character string in the processing target character string is stored in the storage device, one action identification information is selected from the plurality of action identification information While realizing the action selection means to
The operation execution time acquisition means is configured to acquire the operation execution time which is a time required for the robot apparatus to execute an operation identified by the selected operation identification information,
The voice synthesis processing execution means is a voice representing the action-accompanying character string in the processing target character string, and is generated over the acquired action execution time based on the selected action identification information A program configured to perform the speech synthesis processing for synthesizing the processing target speech including the motion-accompanying speech.

(付記26)
付記22乃至付記25のいずれか一項に記載のプログラムであって、
前記動作実行時間取得手段は、複数の動作付随文字列が前記処理対象文字列内に連続して配置されている場合、当該複数の動作付随文字列のそれぞれに対して前記動作実行時間を取得するように構成され、
前記音声合成処理実行手段は、前記取得された動作実行時間の和に亘って発せられる音声であって、前記複数の動作付随文字列の全体を表す音声である、前記動作付随音声を含む前記処理対象音声を合成する前記音声合成処理を行うように構成されたプログラム。 (Appendix 26)
The program according to any one of appendix 22 to appendix 25,
The action execution time acquisition unit acquires the action execution time for each of the plurality of action accompanying character strings when a plurality of action accompanying character strings are continuously arranged in the processing target character string. Configured as
The speech synthesis processing execution means includes the motion-accompanied speech, which is a speech uttered over a sum of the acquired motion execution times, and is a speech representing the whole of the plurality of motion-accompanying character strings. A program configured to perform the speech synthesis process for synthesizing a target speech.

1 ロボット装置
2 音声合成装置
3 動作制御装置
21 情報受付部
22 音声合成処理実行部
22a テキスト解析部
22b リズム生成部
22c 波形生成部
23 動作実行時間取得部
24 音声出力部
25 動作生成情報記憶部
26 動作生成部
31 動作情報記憶部
32 動作実行部
33 動作実行時間検出部
34 動作実行時間記憶部
100 音声合成装置
101 動作実行時間取得部
102 音声合成処理実行部
1 Robotic device
2 Speech synthesizer
3 Motion control device
21 Information reception
22 Speech synthesis processing execution part
22a Text analysis part
22b Rhythm generator
22c Waveform generator
23 Operation execution time acquisition part
24 Audio output section
25 Motion generation information storage
26 Motion generator
31 Operation information storage
32 Operation execution part
33 Operation execution time detector
34 Operation execution time memory
100 speech synthesizer
101 Operation execution time acquisition part
102 Speech synthesis processing execution unit

Claims

A speech synthesis process that is applied to a robot apparatus that performs a predetermined operation and that synthesizes a processing target voice that is a voice representing a processing target character string based on a processing target character string that is a processing target character string. Configured to do and
Action execution time acquisition means for acquiring an action execution time that is a time required for the robotic apparatus to execute an action associated with an action-associated character string that is at least a part of the processing target character string;
Speech synthesis processing execution for performing the speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the motion-associated character string Means,
A speech synthesizer comprising:

The speech synthesizer according to claim 1,
Comprising information accepting means for accepting the action-accompanying character string and action identification information for identifying the action;
The operation execution time acquisition means is configured to acquire the operation execution time, which is a time required for the robot apparatus to execute an operation identified by the received operation identification information,
The speech synthesis processing execution means is a voice that is uttered over the acquired motion execution time, and is the voice that represents the accepted motion-accompanying character string, and includes the processing-target speech including the motion-accompanied speech. A speech synthesizer configured to perform the speech synthesis processing to synthesize.

The speech synthesizer according to claim 1,
Comprising information storage means for storing the action accompanying character string specifying information for specifying the action accompanying character string and the action identification information for identifying the action in association with each other;
The operation execution time acquisition means is configured to acquire the operation execution time, which is a time required for the robot apparatus to execute an operation identified by the stored operation identification information.
The speech synthesis process execution means is a voice representing an action-related character string specified by the stored action-related character string specifying information in the processing target character string, and the action-related character string specifying information and The speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time based on the motion identification information stored in association with each other. Constructed speech synthesizer.

The speech synthesizer according to claim 3,
When a plurality of motion identification information associated with one motion-accompanying character string in the processing target character string is stored, a motion selection means for selecting one motion identification information from the plurality of motion identification information With
The operation execution time acquisition means is configured to acquire the operation execution time which is a time required for the robot apparatus to execute an operation identified by the selected operation identification information,
The voice synthesis processing execution means is a voice representing the action-accompanying character string in the processing target character string, and is generated over the acquired action execution time based on the selected action identification information A speech synthesizer configured to perform the speech synthesis processing for synthesizing the processing target speech including the motion-accompanying speech.

The speech synthesizer according to any one of claims 1 to 4,
The action execution time acquisition unit acquires the action execution time for each of the plurality of action accompanying character strings when a plurality of action accompanying character strings are continuously arranged in the processing target character string. Configured as
The speech synthesis processing execution means includes the motion-accompanied speech, which is a speech uttered over a sum of the acquired motion execution times, and is a speech representing the whole of the plurality of motion-accompanying character strings. A speech synthesizer configured to perform the speech synthesis process for synthesizing a target speech.

The speech synthesizer according to any one of claims 1 to 5,
The motion synthesizer configured to acquire, as the motion execution time, a value detected as the time required for the robot device to execute the motion.

The speech synthesizer according to any one of claims 1 to 6,
A speech synthesizer comprising speech output means for outputting the synthesized processing target speech so as to start outputting the motion-accompanying speech at the same time when the robot device starts executing the motion.

A robot apparatus that performs a predetermined operation,
Speech synthesis processing execution means for performing speech synthesis processing for synthesizing a processing target speech that is a speech representing the processing target character string based on a processing target character string that is a processing target character string;
An action execution time acquisition means for acquiring an action execution time which is a time required for the robot apparatus to execute an action associated with an action associated character string that is at least a part of the processing target character string;
With
The voice synthesis processing execution unit synthesizes the processing target voice including the action-related voice, which is a voice generated over the acquired action execution time and is a voice representing the action-related character string. A robotic device configured to perform composition processing.

A speech synthesis process that is applied to a robot apparatus that performs a predetermined operation and that synthesizes a processing target voice that is a voice representing a processing target character string based on a processing target character string that is a processing target character string. Configured to do and
Obtaining an action execution time that is a time required for the robotic device to execute an action associated with an action-associated character string that is at least a part of the processing target character string;
A speech synthesis method for performing the speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the motion-accompanying character string. .

A speech synthesis process that is applied to a robot apparatus that performs a predetermined operation and that synthesizes a processing target voice that is a voice representing a processing target character string based on a processing target character string that is a processing target character string. To a speech synthesizer configured to perform
Action execution time acquisition means for acquiring an action execution time that is a time required for the robotic apparatus to execute an action associated with an action-associated character string that is at least a part of the processing target character string;
Speech synthesis processing execution for performing the speech synthesis process for synthesizing the processing target speech including the motion-accompanied speech, which is a speech uttered over the acquired motion execution time and representing the motion-associated character string Means,
A program to realize