JP2009186989A

JP2009186989A - Voice interactive device and voice interactive program

Info

Publication number: JP2009186989A
Application number: JP2008317700A
Authority: JP
Inventors: Akiko Yamato; 亜紀子大和
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2008-01-10
Filing date: 2008-12-12
Publication date: 2009-08-20
Also published as: WO2009087860A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice interactive device and a voice interactive program, capable of changing atmosphere of voice, when a conversation content is changed. <P>SOLUTION: In a case that voice is input from a user (S4: YES), the input voice is analyzed and characters are converted (S5). A keyword is extracted from the converted character string (S7) and determination context is determined on the basis of the extracted keyword (S8). In a case that there is a change in the determination context (S9: YES) and measurement time by a timer has not elapsed five minutes (S13: NO), it is determined whether the determination context is a meaning context. In a case that the determination context is the meaning context, a first attribute information storage area is referenced and the attribute of an output voice is changed (S14). A response sentence is determined (S20), voice-synthesized according to the attribute after change (S21), and output from a speaker (S22). <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声対話装置及び音声対話プログラムに関する。より詳細には、本発明は、会話内容が変化した場合に音声の口調を変化させることのできる音声対話装置及び音声対話プログラムに関するものである。 The present invention relates to a voice dialogue apparatus and a voice dialogue program. More specifically, the present invention relates to a voice dialogue apparatus and a voice dialogue program that can change the tone of voice when conversation contents change.

従来、ユーザがコンピュータを使用する場合、キーボードやマウスによる情報入力、ディスプレイに文字や画像を表示することによる情報出力が行われている。このような入出力よりも、ユーザがより親しみやすい環境で情報の入出力ができるように、音声による入出力を行うユーザ支援装置及びシステムが提案されている（例えば、特許文献１参照）。特許文献１に記載のユーザ支援装置では、ユーザ支援装置とユーザとが対話することによって情報の入出力が行われる。
特開２００２−１６３１７１号公報 Conventionally, when a user uses a computer, information is input by a keyboard or a mouse, and information is output by displaying characters or images on a display. There has been proposed a user support apparatus and system that perform voice input / output so that information can be input / output in a user-friendly environment rather than such input / output (see, for example, Patent Document 1). In the user support device described in Patent Document 1, input / output of information is performed when the user support device and the user interact.
JP 2002-163171 A

人間同士が会話をする場合、会話内容が変わるのに応じて口調やテンポにも変化が生じる。例えば、仕事の話から趣味の話へ話の内容が変化したのであれば、仕事の話中の真面目な口調が、趣味の話中の楽しげに軽い口調に変化する。しかしながら、特許文献１に記載のユーザ支援装置のような装置は、固定的な音声、固定的なスピードでユーザとの対話を行う。したがって、会話の内容が変化したとしても、それに応じて、対話する音声の口調が変化しなかったので、ユーザは不自然に感じることがあった。 When people talk to each other, their tone and tempo change as the content of the conversation changes. For example, if the content of the story changes from a job story to a hobby story, the serious tone during the job story changes into a joyful and light tone during the hobby story. However, an apparatus such as the user support apparatus described in Patent Document 1 interacts with the user at a fixed voice and a fixed speed. Therefore, even if the content of the conversation changes, the tone of the voice to talk with does not change accordingly, and the user may feel unnatural.

本発明は、上述の問題点を解決するためになされたものであり、会話内容が変化した場合に音声の口調を変化させることのできる音声対話装置及び音声対話プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a voice dialogue apparatus and a voice dialogue program capable of changing the tone of voice when the conversation contents change. .

上記課題を解決するため、請求項１に係る発明の音声対話装置では、音声を入力する音声入力手段と、前記音声入力手段によって入力された音声である入力音声を文字列に変換する変換手段と、会話のコンテクストをキーワードに対応させて記憶するコンテクスト記憶手段と、前記変換手段により変換された文字列である変換文字列から前記コンテクスト記憶手段に記憶されているキーワードを抽出し、抽出された前記キーワードに対応して前記コンテクスト記憶手段に記憶されている前記コンテクストを前記入力音声のコンテクストに決定するコンテクスト決定手段と、前記入力音声に応じた会話文を決定する会話文決定手段と、音声を出力する音声出力手段と、前記音声出力手段によって出力される音声の属性を記憶する属性記憶手段と、前記属性記憶手段に記憶された属性で、前記会話文決定手段によって決定された前記会話文を前記音声出力手段に音声出力させる出力制御手段と、前記コンテクスト決定手段によって決定された前記コンテクストである決定コンテクストが、前記コンテクスト決定手段によって前回決定された前記コンテクストである前回決定コンテクストから変化したか否かを判断する判断手段と、前記判断手段によって前記決定コンテクストが変化したと判断された場合に、前記属性記憶手段に記憶されている音声の属性を変更する属性変更手段とを備えている。 In order to solve the above-mentioned problem, in the voice interactive apparatus according to the first aspect of the present invention, voice input means for inputting voice, and conversion means for converting the input voice, which is voice input by the voice input means, into a character string , A context storage means for storing a conversation context corresponding to a keyword, and extracting a keyword stored in the context storage means from a converted character string that is a character string converted by the conversion means, and extracting the extracted keyword Context determination means for determining the context stored in the context storage means corresponding to a keyword as the context of the input voice, conversation sentence determination means for determining a conversation sentence according to the input voice, and outputting a voice Voice output means for performing, and attribute storage means for storing attributes of the voice output by the voice output means; Output control means for causing the voice output means to output the conversation sentence determined by the conversation sentence determination means with the attribute stored in the attribute storage means, and determination that is the context determined by the context determination means When it is determined by the determination means that the context has changed from the previous determination context that is the previous context determined by the context determination means, and when the determination context determines that the determination context has changed, Attribute changing means for changing the attribute of the sound stored in the attribute storage means.

また、請求項２に係る発明の音声対話装置では、請求項１に記載の発明の構成に加えて、前記音声出力手段によって出力される音声の属性に関する音声属性情報を前記コンテクストに対応させて記憶する属性情報記憶手段を備え、前記属性変更手段は、前記決定コンテクストが前記属性情報記憶手段に前記音声属性情報が記憶されているコンテクストに変化した場合に、前記属性記憶手段に記憶されている音声の属性を、前記決定コンテクストに対応した前記音声属性情報の示す音声の属性に変更する。 In addition to the configuration of the invention according to claim 1, in the voice interactive apparatus according to claim 2, the voice attribute information related to the voice attribute output by the voice output means is stored in association with the context. Attribute information storage means, and the attribute change means is adapted to store the voice stored in the attribute storage means when the determined context changes to a context in which the voice attribute information is stored in the attribute information storage means. Is changed to an audio attribute indicated by the audio attribute information corresponding to the determined context.

また、請求項３に係る発明の音声対話装置では、請求項１又は２に記載の発明の構成に加えて、前記コンテクスト記憶手段のデータ構造はツリー構造であり、前記ツリー構造の階層が上位から下位へ進むにつれ、詳細な会話内容となるように、複数のコンテクストを記憶しており、前記属性変更手段は、前記決定コンテクストと前記前回決定コンテクストとが前記ツリー構造の親子関係にある場合、又は同じ階層に属する場合に音声の属性を変更する。 In addition, in the spoken dialogue apparatus of the invention according to claim 3, in addition to the configuration of the invention according to claim 1 or 2, the data structure of the context storage means is a tree structure, and the hierarchy of the tree structure is from the top. A plurality of contexts are stored so as to become detailed conversation contents as proceeding to the lower level, and the attribute changing unit is configured such that the determination context and the previous determination context are in a parent-child relationship of the tree structure, or Change audio attributes when belonging to the same hierarchy.

また、請求項４に係る発明の音声対話装置では、請求項１又は２に記載の発明の構成に加えて、前記コンテクスト記憶手段のデータ構造はツリー構造であり、前記ツリー構造の階層が上位から下位へ進むにつれ、詳細な会話内容となるように、複数のコンテクストを記憶しており、前記属性変更手段は、前記決定コンテクストが前記ツリー構造の所定の階層のコンテクストとなった場合に音声の属性を変更する。 Moreover, in the spoken dialogue apparatus of the invention according to claim 4, in addition to the configuration of the invention according to claim 1 or 2, the data structure of the context storage means is a tree structure, and the hierarchy of the tree structure is from the top. A plurality of contexts are stored so as to become detailed conversation contents as it goes down, and the attribute changing unit is configured to generate a voice attribute when the determined context becomes a context of a predetermined hierarchy of the tree structure. To change.

また、請求項５に係る発明の音声対話装置では、請求項１乃至４のいずれかに記載の発明の構成に加えて、前記属性変更手段は、第一の所定時間内に前記決定コンテクストが所定回数以上変化した場合に音声の属性を変更することを特徴とする。 According to a fifth aspect of the present invention, in addition to the configuration of the first aspect, the attribute changing means is configured such that the determination context is predetermined within a first predetermined time. It is characterized in that the audio attribute is changed when the number of times changes.

また、請求項６に係る発明の音声対話装置では、請求項１乃至５のいずれかに記載の発明の構成に加えて、前記属性変更手段は、前記決定コンテクストが変化しない時間が第二の所定時間以上である場合に音声の属性を変更することを特徴とする。 Further, in the spoken dialogue apparatus of the invention according to claim 6, in addition to the configuration of the invention according to any one of claims 1 to 5, the attribute changing means has a second predetermined time when the determination context does not change. The audio attribute is changed when the time is over.

また、請求項７に係る発明の音声対話プログラムでは、請求項１乃至６のいずれかに記載の音声対話装置の各種処理手段としてコンピュータを動作させることを特徴とする。 According to a seventh aspect of the present invention, there is provided a voice interaction program that causes a computer to operate as various processing means of the voice interaction device according to any one of the first to sixth aspects.

請求項１に係る発明の音声対話装置では、コンテクストが変化した場合に出力音声の属性を変更することができる。よって、ユーザがコンテクストの変化を指示することなく、出力音声の属性が変更される。コンテクストの変化に応じて音声が切り替わることにより、会話状況に変化が付き、会話を楽しむ一助となる。 In the spoken dialogue apparatus according to the first aspect of the present invention, the attribute of the output voice can be changed when the context changes. Therefore, the attribute of the output sound is changed without the user giving an instruction to change the context. By switching the voice according to the change of the context, the conversation situation changes, which helps to enjoy the conversation.

また、請求項２に係る発明の音声対話装置では、請求項１に記載の発明の効果に加えて、コンテクストに相応しい属性を示す音声属性情報を、コンテクストに対応させて記憶しておけば、コンテクスト、つまり会話の内容に相応しい音声を出力することができる。よって、コンテクストの変化に応じて、出力音声を会話の内容に相応しい音声に切り替えることができる。したがって、ユーザは、会話の内容と音声とに違和感を抱くことなく、自然な会話を行うことができる。 Further, in the spoken dialogue apparatus of the invention according to claim 2, in addition to the effect of the invention of claim 1, if voice attribute information indicating an attribute suitable for the context is stored in correspondence with the context, the context is stored. That is, it is possible to output a sound suitable for the content of the conversation. Therefore, the output sound can be switched to a sound suitable for the content of the conversation according to the change in context. Therefore, the user can have a natural conversation without feeling uncomfortable with the content and voice of the conversation.

また、請求項３に係る発明の音声対話装置では、請求項１又は２に記載の発明の効果に加えて、ユーザは、音声対話装置から出力される音声によって、会話の内容が深くなったり、浅くなったり、同レベルのコンテクストで変化していたりする状況が分かる。よって、ユーザは、会話の内容の変化状況を把握しながら会話することができ、会話を楽しむ一助となる。 In addition, in the voice interactive device according to the third aspect of the invention, in addition to the effect of the invention according to the first or second aspect, the user can deepen the content of the conversation by the voice output from the voice interactive device, You can see the situation becoming shallow and changing in the same level of context. Therefore, the user can talk while grasping the change state of the content of the conversation, and helps to enjoy the conversation.

また、請求項４に係る発明の音声対話装置では、請求項１又は２に記載の発明の効果に加えて、音声対話装置と会話をしているユーザは、会話のコンテクストの階層を、出力される音声によって把握することができる。よって、ユーザは、会話の内容の変化状況を把握しながら会話することができ、会話を楽しむ一助となる。例えば、特定の階層を最も下位の階層とすれば、ユーザは、それ以上コンテクストが詳細な内容に変化することがないことを知ることができる。また、特定の階層を最も上位の階層とすれば、ユーザは、会話をより詳細な内容に移行させることが可能である旨を知ることができる。また、所定の階層のコンテクストに何らかの意味を持たせるように、ツリー構造の構築に工夫を施せば、音声の属性の変化によって、ユーザに何らかの意味を伝えることができる。 In addition, in the voice interaction device according to the fourth aspect of the invention, in addition to the effect of the invention according to the first or second aspect, the user having a conversation with the voice interaction device outputs the level of the context of the conversation. Can be grasped by voice. Therefore, the user can talk while grasping the change state of the content of the conversation, and helps to enjoy the conversation. For example, if a specific hierarchy is set as the lowest hierarchy, the user can know that the context does not change to detailed contents any more. Further, if the specific hierarchy is the highest hierarchy, the user can know that the conversation can be shifted to more detailed contents. Further, if a tree structure is devised so as to give some meaning to the context of a predetermined hierarchy, some meaning can be conveyed to the user by a change in voice attributes.

また、請求項５に係る発明の音声対話装置では、請求項１乃至４のいずれかに記載の発明の効果に加えて、音声対話装置と会話をしているユーザは、音声対話装置から出力される音声により、コンテクストが所定時間内に何度も切り替わったことがわかる。よって、ユーザは、会話の内容の変化状況を感じながら会話することができ、会話を楽しむ一助となる。 Further, in the voice interactive device according to the fifth aspect of the invention, in addition to the effect of the invention according to any one of the first to fourth aspects, the user having a conversation with the voice interactive device is output from the voice interactive device. The voice indicates that the context has been switched many times within a predetermined time. Therefore, the user can talk while feeling the change state of the content of the conversation, which helps to enjoy the conversation.

また、請求項６に係る発明の音声対話装置では、請求項１乃至５のいずれかに記載の発明の効果に加えて、音声対話装置と会話をしているユーザは、音声対話装置から出力される音声により、同一のコンテクストが第二の所定時間以上継続していることがわかる。よって、コンテクストの変化がなかったとしても出力音声の属性が変化するので、会話を楽しむ一助となる。 In addition, in the voice interaction device according to the sixth aspect of the invention, in addition to the effect of the invention according to any one of the first to fifth aspects, the user having a conversation with the voice interaction device is output from the voice interaction device. It can be seen that the same context continues for a second predetermined time or longer. Therefore, even if there is no change in the context, the attribute of the output voice changes, which helps to enjoy the conversation.

また、請求項７に係る発明の音声対話プログラムは、請求項１乃至６のいずれかに記載の音声対話装置の各種処理手段としてコンピュータを機能させることができる。したがって、請求項１乃至６のいずれかに記載の発明の効果と同様の効果を奏することができる。 According to a seventh aspect of the present invention, there is provided a voice interaction program that allows a computer to function as various processing means of the voice interaction device according to any one of the first to sixth aspects. Therefore, an effect similar to the effect of the invention according to any one of claims 1 to 6 can be obtained.

以下、本発明の実施の形態を図面を参照して説明する。図１は、音声対話装置１００のハードウェアブロック図である。本実施の形態の音声対話装置１００は、所謂パーソナルコンピュータである。図１に示すように、音声対話装置１００には、音声対話装置１００の制御を司るＣＰＵ１０が設けられている。ＣＰＵ１０には、各種のデータを一時的に記憶するＲＡＭ１１と、ＢＩＯＳ等を記憶したＲＯＭ１２とが接続している。さらに、ＣＰＵ１０には、バスを介して、ハードディスク装置１３、出力制御部１４、入力制御部１５、音声出力制御部１６、音声入力制御部１７、タイマ１８が接続している。出力制御部１４には出力機器２４が接続され、入力制御部１５には入力機器２５が接続されている。出力機器２４とは、例えばディスプレイであり、入力機器２５とは、例えばマウスやキーボードである。音声出力制御部１６にはスピーカ２６が接続され、音声入力制御部１７にはマイク２７が接続されている。タイマ１８は時間を計測する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a hardware block diagram of the voice interaction apparatus 100. The voice interactive apparatus 100 of the present embodiment is a so-called personal computer. As shown in FIG. 1, the voice interaction apparatus 100 is provided with a CPU 10 that controls the voice interaction apparatus 100. Connected to the CPU 10 are a RAM 11 that temporarily stores various data and a ROM 12 that stores BIOS and the like. Further, a hard disk device 13, an output control unit 14, an input control unit 15, an audio output control unit 16, an audio input control unit 17, and a timer 18 are connected to the CPU 10 via a bus. An output device 24 is connected to the output control unit 14, and an input device 25 is connected to the input control unit 15. The output device 24 is, for example, a display, and the input device 25 is, for example, a mouse or a keyboard. A speaker 26 is connected to the audio output control unit 16, and a microphone 27 is connected to the audio input control unit 17. The timer 18 measures time.

ハードディスク装置１３には、コンテクストツリー記憶エリア１３１，属性情報記憶エリア１３２，音響モデル記憶エリア１３３，音声対話プログラム記憶エリア１３４，その他の情報記憶エリア１３５が少なくとも設けられている。コンテクストツリー記憶エリア１３１には、コンテクスト（会話の内容）の関係を示したコンテクストツリーが記憶されている。属性情報記憶エリア１３２には、所定の条件を満たすコンテクストの会話がなされている際に指定される音声属性に関する情報（以下、「音声属性情報」という）が記憶されている。音響モデル記憶エリア１３３には、音声をマイク２７から出力するための複数の音響モデルが記憶されている。音声対話プログラム記憶エリア１３４には、ＣＰＵ１０で実行される音声対話プログラムが記憶されている。その他の情報記憶エリア１３５には、音声対話装置１００で使用されるその他の情報が記憶されている。 The hard disk device 13 is provided with at least a context tree storage area 131, an attribute information storage area 132, an acoustic model storage area 133, a voice interaction program storage area 134, and other information storage areas 135. In the context tree storage area 131, a context tree showing the relationship of contexts (contents of conversation) is stored. The attribute information storage area 132 stores information related to a voice attribute (hereinafter referred to as “voice attribute information”) designated when a context conversation satisfying a predetermined condition is made. The acoustic model storage area 133 stores a plurality of acoustic models for outputting sound from the microphone 27. The voice interaction program storage area 134 stores a voice interaction program executed by the CPU 10. In the other information storage area 135, other information used in the voice interactive apparatus 100 is stored.

ＲＡＭ１１には現在決定コンテクスト記憶エリア１１１、前回決定コンテクスト記憶エリア１１２、属性記憶エリア１１３が設けられている。現在決定コンテクスト記憶エリア１１１には、現在の決定コンテクストのコンテクストＩＤ（以下、「決定コンテクストＩＤ」という。）が記憶される。前回決定コンテクスト記憶エリア１１２には、現在のコンテクストとなる直前の決定コンテクストのコンテクストＩＤ（以下、「前回決定コンテクストＩＤ」という。）が記憶される。属性記憶エリア１１３には、スピーカ２６から出力される音声を音声合成する際の属性が記憶されている。属性のデータ項目は、例えば、スピード、ピッチ、音響モデル、フィルター後の声質である。 The RAM 11 is provided with a currently determined context storage area 111, a previous determined context storage area 112, and an attribute storage area 113. In the current determination context storage area 111, a context ID of the current determination context (hereinafter referred to as “determination context ID”) is stored. In the previous determination context storage area 112, a context ID of the determination context immediately before becoming the current context (hereinafter referred to as “previous determination context ID”) is stored. The attribute storage area 113 stores attributes used when the voice output from the speaker 26 is synthesized. The attribute data items are, for example, speed, pitch, acoustic model, and voice quality after filtering.

本実施の形態では、音声対話装置１００において、音声対話プログラムが実行されると、音声対話エージェントが起動される。音声対話エージェントによって、出力機器（ディスプレイ）２４にキャラクタの画像が表示される。このキャラクタ画像が音声対話エージェントを具象化したものである。ユーザは、このキャラクタ画像と対話する感覚で音声対話エージェントとの対話を行う。ユーザからの発言（音声）は、マイク２７から入力される。入力された音声がテキスト解析され、ユーザからの入力文とされる。入力文に応じた応答文が決定され、音声変換されてスピーカ２６から音声出力される。音声出力の際には、キャラクタ画像も言葉を発しているような図柄となり、キャラクタと対話をしている臨場感をユーザに与える。 In the present embodiment, when a voice interaction program is executed in the voice interaction apparatus 100, a voice interaction agent is activated. An image of the character is displayed on the output device (display) 24 by the voice interaction agent. This character image is a concrete representation of a voice interaction agent. The user interacts with the voice interaction agent as if interacting with the character image. A speech (voice) from the user is input from the microphone 27. The input voice is analyzed as text and used as an input sentence from the user. A response sentence corresponding to the input sentence is determined, voice-converted, and output from the speaker 26 as voice. At the time of voice output, the character image also has a design that speaks a word, giving the user a sense of realism that is interacting with the character.

さらに、ユーザと音声対話エージェントとの対話内容が、ユーザの入力文中のキーワードにより決定される。この対話内容を「コンテクスト」という。このコンテクストはツリー構造で表される（図３参照）。音声対話エージェントは、特定のコンテクストや、コンテクストの移動状態に応じて音声対話エージェントの出力音声の属性を変更し、会話の内容に相応しい音声を出力する。 Furthermore, the content of dialogue between the user and the voice dialogue agent is determined by a keyword in the user's input sentence. This dialogue is called “context”. This context is represented by a tree structure (see FIG. 3). The voice interaction agent changes the attribute of the output voice of the voice interaction agent according to the specific context and the moving state of the context, and outputs a sound suitable for the content of the conversation.

図２及び図３を参照して、ＨＤＤ１３に設けられているコンテクストツリー記憶エリア１３１について説明する。図２は、コンテクストツリー記憶エリア１３１の構成を示す模式図である。図３は、コンテクストツリー記憶エリア１３１に記憶されているコンテクストのツリー構造の模式図である。 The context tree storage area 131 provided in the HDD 13 will be described with reference to FIGS. FIG. 2 is a schematic diagram showing the configuration of the context tree storage area 131. FIG. 3 is a schematic diagram of a tree structure of contexts stored in the context tree storage area 131.

図２に示すように、コンテクストツリー記憶エリア１３１には、データ項目として「コンテクストＩＤ」，「コンテクスト名」，「キーワード」が設けられている。そして、コンテクストＩＤ毎に、コンテクスト名が与えられている。さらに、コンテクストＩＤにはキーワードが割り当てられており、ユーザと音声対話エージェントとの会話の中にキーワードが出現した場合には、そのキーワードが対応付けられているコンテクストが現在の会話のコンテクストである「決定コンテクスト」とされる。なお、図２に示すコンテクストは一例である。 As shown in FIG. 2, the context tree storage area 131 is provided with “context ID”, “context name”, and “keyword” as data items. A context name is given for each context ID. Furthermore, a keyword is assigned to the context ID. When a keyword appears in a conversation between the user and the voice interaction agent, the context associated with the keyword is the context of the current conversation “ It is referred to as a “decision context”. Note that the context shown in FIG. 2 is an example.

コンテクストＩＤの付与規則について説明する。コンテクストＩＤ「００００」は、ツリー構造の根となるコンテクストに対して付与されるＩＤである。枝上のコンテクストには、例えば「０１００−００００」のように、４桁＋４桁のＩＤが付与される。後の４桁「００００」は、親（１階層上位）のコンテクストＩＤである。つまり、「０１００−００００」は、コンテクストＩＤ「００００」の子（１階層下位）であることを示す。以下、後の４桁のＩＤを「親ＩＤ」という。図２に示すコンテクストツリー記憶エリア１３１では、図３に示すように、コンテクストＩＤ「００００」のコンテクスト名「一般」のコンテクストが根となっている。コンテクストＩＤ「００００」のコンテクストの子として、コンテクストＩＤ「０１００−００００」のコンテクスト名「音楽」のコンテクスト，コンテクストＩＤ「０１０１−００００」のコンテクスト名「アート」のコンテクスト，コンテクストＩＤ「０１０２−００００」のコンテクスト名「雑談」のコンテクストが根のコンテクストに接続している。 The rules for assigning context IDs will be described. The context ID “0000” is an ID given to the context that is the root of the tree structure. The context on the branch is given a 4-digit + 4-digit ID such as “0100-0000”. The last four digits “0000” are the context ID of the parent (one layer higher). That is, “0100-0000” indicates a child of the context ID “0000” (lower one hierarchy). Hereinafter, the subsequent four-digit ID is referred to as “parent ID”. In the context tree storage area 131 shown in FIG. 2, as shown in FIG. 3, the context with the context name “general” having the context ID “0000” is the root. As a child of the context with the context ID “0000”, the context with the context name “music” with the context ID “0100-0000”, the context with the context name “art” with the context ID “0101-0000”, and the context ID “0102-0000”. The context with the name "Chat" is connected to the root context.

また、コンテクストＩＤの前の４桁のうち、先の２桁はツリー構造の階層を示している。図３に示すように、ツリー構造の根のコンテクストでは、先の２桁のＩＤは「００」であり、階層「００」を示している。コンテクストＩＤ「０１００−００００」のコンテクスト名「音楽」のコンテクストでは、先の２桁のＩＤ「０１」は１階層目を示している。コンテクストＩＤ「０１００−００００」のコンテクストの子のコンテクストＩＤ「０２００−０１００」，「０２０１−０１００」では、前の４桁のうちの先の２桁のＩＤ「０２」は２階層目を示している。さらに、前の４桁のうちの後の２桁は、同一階層内での識別番号である。図２及び図３で示す例では、識別番号として、「００」から順に「０１」，「０２」が割り当てられている。以下、前の４桁を「自ＩＤ」、自ＩＤのうち前の２桁を「階層ＩＤ」、後の２桁を「識別番号」という。つまり、コンテクストＩＤは「（自ＩＤ４桁）−（親ＩＤ４桁）」、すなわち「（階層ＩＤ２桁）（識別番号２桁）−（親ＩＤ４桁）」で構成される。このようなコンテクストＩＤの付与規則に従って、互いに重複しないＩＤがコンテクストに対して割り当てられるので、コンテクストをコンテクストＩＤによって識別できる。 Of the four digits before the context ID, the first two digits indicate the hierarchy of the tree structure. As shown in FIG. 3, in the context of the root of the tree structure, the previous two-digit ID is “00”, indicating the hierarchy “00”. In the context of the context name “music” with the context ID “0100-0000”, the previous two-digit ID “01” indicates the first layer. In the context IDs “0200-0100” and “0201-0100” of the context child of the context ID “0100-0000”, the previous two-digit ID “02” of the previous four digits indicates the second layer. Yes. Further, the last two digits of the preceding four digits are identification numbers in the same hierarchy. In the example shown in FIG. 2 and FIG. 3, “01” and “02” are assigned in order from “00” as identification numbers. Hereinafter, the first four digits are referred to as “own ID”, the previous two digits of the own ID are referred to as “hierarchy ID”, and the latter two digits are referred to as “identification number”. In other words, the context ID is composed of “(own ID 4 digits) − (parent ID 4 digits)”, that is, “(hierarchical ID 2 digits) (identification number 2 digits) − (parent ID 4 digits)”. According to such a context ID assigning rule, IDs that do not overlap each other are assigned to the context, so that the context can be identified by the context ID.

次に、図４乃至図６を参照して、ＨＤＤ１３に設けられている属性情報記憶エリア１３２について説明する。属性情報記憶エリア１３２は、第一属性情報記憶エリア１３２１、第二属性情報記憶エリア１３２２、及び第三属性情報記憶エリア１３２３を備えている。図４は、第一属性情報記憶エリア１３２１の構成を示す模式図であり、図５は、第二属性情報記憶エリア１３２２の構成を示す模式図であり、図６は、第三属性情報記憶エリア１３２３の構成を示す模式図である。 Next, the attribute information storage area 132 provided in the HDD 13 will be described with reference to FIGS. The attribute information storage area 132 includes a first attribute information storage area 1321, a second attribute information storage area 1322, and a third attribute information storage area 1323. 4 is a schematic diagram showing a configuration of the first attribute information storage area 1321, FIG. 5 is a schematic diagram showing a configuration of the second attribute information storage area 1322, and FIG. 6 is a third attribute information storage area. FIG. 12 is a schematic diagram illustrating the configuration of 1323.

まず、図４を参照して、第一属性情報記憶エリア１３２１について説明する。第一属性情報記憶エリア１３２１には、特別な意味を持つコンテクストが決定コンテクストとなった場合に属性を変更するための音声属性情報が記憶されている。図４に示すように、第一属性情報記憶エリア１３２１には、データ項目として「意味」，「コンテクストＩＤ」，「第一変更属性」，「第二変更属性」が設けられている。「第一変更属性」，「第二変更属性」には、それぞれ「種類」，「方法」，「変更値」の項目が設けられている。それぞれの意味にコンテクストＩＤが割り当てられており、属性のうち、２種類の属性を変更属性として設定することができる。属性の種類としては、例えば、出力音声のスピード，音声合成の際に使用する音響モデルの種類，出力音声のピッチ，フィルター後の出力音声の声質がある。なお、属性はこれに限らず、音声合成を行う音声合成プログラムに対して付与可能な属性を用いればよい。以下、意味に割り当てられているコンテクストＩＤで特定されるコンテクストを「意味コンテクスト」という。 First, the first attribute information storage area 1321 will be described with reference to FIG. The first attribute information storage area 1321 stores voice attribute information for changing attributes when a context having a special meaning becomes a determined context. As shown in FIG. 4, the first attribute information storage area 1321 is provided with “meaning”, “context ID”, “first change attribute”, and “second change attribute” as data items. In the “first change attribute” and “second change attribute”, items of “type”, “method”, and “change value” are provided, respectively. A context ID is assigned to each meaning, and two types of attributes can be set as change attributes. Examples of attribute types include the speed of output speech, the type of acoustic model used for speech synthesis, the pitch of output speech, and the voice quality of output speech after filtering. The attribute is not limited to this, and an attribute that can be assigned to a speech synthesis program for performing speech synthesis may be used. Hereinafter, the context specified by the context ID assigned to the meaning is referred to as a “semantic context”.

図４に示す例では、特別の意味として「趣味」，「得意分野」，「不得意分野」，「雑談」がある。「趣味」に割り当てられているコンテクストＩＤは「０１０１−００００」である。第一変更属性は「スピード」であり、変更値が「１．２」とされているので、出力音声のスピードが１．２に変更される。第二変更属性は「ピッチ」であり、方法が「高く」なので、出力音声のピッチが所定量高く変更される。変更される所定量は、予め定められており、例えば方法が「高く」であれば、ピッチが現在のピッチよりも０．１高く変更される。方法が「低く」であれば、ピッチが現在のピッチよりも０．１低く変更される。また、意味「得意分野」では、第一変更属性として「声の種類」が指定されており、変更値が「ｍｏｄｅｌＣ」とされている。これは、音声合成を行う際に、音響モデルのうち「ｍｏｄｅｌＣ」という音響モデルが使用されるということを示す。音響モデルはＨＤＤ１３の音響モデル記憶エリア１３３に記憶されている。なお、図４に示す例はあくまでも一例であり、他の意味を設定してもよいし、１つの意味に複数のコンテクストを割り当ててもよい。また、音声属性情報は図４に示す情報に限らない。 In the example shown in FIG. 4, there are “hobbies”, “special fields”, “special fields”, and “chat” as special meanings. The context ID assigned to “hobby” is “0101-0000”. Since the first change attribute is “speed” and the change value is “1.2”, the speed of the output voice is changed to 1.2. Since the second change attribute is “pitch” and the method is “high”, the pitch of the output audio is changed by a predetermined amount. The predetermined amount to be changed is determined in advance. For example, if the method is “high”, the pitch is changed by 0.1 higher than the current pitch. If the method is "low", the pitch is changed 0.1 lower than the current pitch. Further, in the meaning “special field”, “voice type” is designated as the first change attribute, and the change value is “modelC”. This indicates that an acoustic model “modelC” is used among the acoustic models when performing speech synthesis. The acoustic model is stored in the acoustic model storage area 133 of the HDD 13. Note that the example shown in FIG. 4 is merely an example, other meanings may be set, and a plurality of contexts may be assigned to one meaning. The voice attribute information is not limited to the information shown in FIG.

次に、図５を参照して、第二属性情報記憶エリア１３２２について説明する。第二属性情報記憶エリア１３２２には、コンテクストツリーにおいて特定の階層のコンテクストが決定コンテクストとなった場合に属性を変更するための音声属性情報が記憶されている。以下、特定の階層に属するコンテクストを「特定階層コンテクスト」という。図５に示すように、第二属性情報記憶エリア１３２２には、データ項目として「階層」及び「第一変更属性」が設けられている。「第一変更属性」には、「種類」，「方法」，「変更値」の項目が設けられている。それぞれの階層に対して、第一変更属性が割り当てられており、１つの音声属性を変更属性として設定することができる。 Next, the second attribute information storage area 1322 will be described with reference to FIG. The second attribute information storage area 1322 stores audio attribute information for changing attributes when a context of a specific hierarchy in the context tree becomes a determined context. Hereinafter, a context belonging to a specific hierarchy is referred to as a “specific hierarchy context”. As shown in FIG. 5, in the second attribute information storage area 1322, “hierarchy” and “first change attribute” are provided as data items. In the “first change attribute”, items of “type”, “method”, and “change value” are provided. The first change attribute is assigned to each layer, and one voice attribute can be set as the change attribute.

図５に示す例では、特定の階層として「最上位」，「２階層目」，「最下層」が指定されている。決定コンテクストがコンテクストツリーの最上位層、つまり、コンテクストＩＤが「００００」であれば、全ての属性を初期値に変更する指示が成される。決定コンテクストが２階層目のコンテクストである、つまり、コンテクストＩＤが「０２＊＊−＊＊＊＊（＊は任意の数）」であれば、ピッチを「０．６」とする指示が成される。決定コンテクストがコンテクストツリーの最下層、つまり、図２及び図３に示す例ではコンテクストＩＤが「０４＊＊−＊＊＊＊」であれば、声質を「０．４」とする指示が成される。なお、図５に示す変更指示は一例であり、他の階層に対して変更指示を設定してもよく、また、変更内容は他の内容であってもよい。 In the example shown in FIG. 5, “highest level”, “second level”, and “lowest level” are designated as specific levels. If the determined context is the highest layer of the context tree, that is, if the context ID is “0000”, an instruction to change all attributes to initial values is issued. If the determined context is the context of the second layer, that is, if the context ID is “02 ***-****” (* is an arbitrary number), an instruction to set the pitch to “0.6” is issued. The If the determined context is the lowest layer of the context tree, that is, in the example shown in FIGS. 2 and 3, if the context ID is “04 ***-***”, an instruction to set the voice quality to “0.4” is issued. The Note that the change instruction illustrated in FIG. 5 is an example, and the change instruction may be set for another layer, and the change content may be other content.

次に、図６を参照して、第三属性情報記憶エリア１３２３について説明する。詳細は後述するが、音声対話装置１００では、決定コンテクストが変更された場合に、コンテクストツリーにおいてどのような位置関係で決定コンテクストが移動したかが判断される。第三属性情報記憶エリア１３２３には、決定コンテクストの移動が特定の位置変化であった場合に属性を変更するための音声属性情報が記憶されている。図６に示すように、第三属性情報記憶エリア１３２３には、データ項目として「位置変化」及び「第一変更属性」が設けられている。「第一変更属性」には、「種類」，「方法」，「変更値」の項目が設けられている。それぞれの位置変化に対して、第一変更属性が割り当てられており、１つの音声属性を変更属性として設定することができる。 Next, the third attribute information storage area 1323 will be described with reference to FIG. Although details will be described later, in the voice interaction device 100, when the determined context is changed, it is determined in what positional relationship the determined context has moved in the context tree. The third attribute information storage area 1323 stores voice attribute information for changing the attribute when the movement of the determination context is a specific position change. As shown in FIG. 6, “position change” and “first change attribute” are provided as data items in the third attribute information storage area 1323. In the “first change attribute”, items of “type”, “method”, and “change value” are provided. A first change attribute is assigned to each position change, and one voice attribute can be set as a change attribute.

図６に示す例では、位置変化として「隣に移動（ＩＤ小）」，「隣に移動（ＩＤ大）」，「１階層上に移動」，「１階層下に移動」，「２階層上に移動」，「２階層下に移動」が設けられている。「隣に移動（ＩＤ小）」は、コンテクストツリーにおいて同じ階層の隣のコンテクストで、識別番号が１つ小さい方のコンテクストへの移動を示している。つまり、移動前の決定コンテクストと、移動後の決定コンテクストとの階層ＩＤが等しく、「移動後の識別番号＝移動前の識別番号−１」が成立する場合が、この「隣に移動（ＩＤ小）」に該当する。「隣に移動（ＩＤ大）」は、コンテクストツリーにおいて同じ階層の隣のコンテクストで、識別番号が１つ大きい方のコンテクストへの移動を示している。つまり、移動前の決定コンテクストと、移動後の決定コンテクストとの階層ＩＤが等しく、「移動後の識別番号＝移動前の識別番号＋１」が成立する移動が、この「隣に移動（ＩＤ大）」に該当する。 In the example shown in FIG. 6, as the position change, “move next (small ID)”, “move next (large ID)”, “move up one level”, “move down one level”, “up two levels” "Move to" and "Move down two levels" are provided. “Move to next (small ID)” indicates movement to the context of the next lower level in the context tree and having the identification number one smaller. That is, when the hierarchical IDs of the determination context before the movement and the determination context after the movement are equal and “identification number after movement = identification number before movement−1” is established, ) ”. “Move to next (large ID)” indicates a move to a context having an identification number larger by one in the context next to the same hierarchy in the context tree. That is, the movement in which the hierarchy ID of the determination context before the movement and the determination context after the movement is equal and “the identification number after the movement = the identification number before the movement + 1” is established is the “movement next (large ID)”. It corresponds to.

「１階層上に移動」は、コンテクストツリーにおいて１つ上の階層のコンテクストへの移動を示している。移動前の決定コンテクストの親ＩＤと移動後の決定コンテクストの自ＩＤとが等しい場合の移動が、この「１階層上に移動」に該当する。「１階層下に移動」は、コンテクストツリーにおいて１つ下の階層のコンテクストへの移動を示している。移動前の決定コンテクストの自ＩＤと移動後の決定コンテクストの親ＩＤとが等しい場合の移動が、この「１階層下に移動」に該当する。「２階層上に移動」は、コンテクストツリーにおいて２つ上の階層のコンテクストへの移動を示している。移動前の決定コンテクストの親ＩＤのコンテクストの親ＩＤと、移動後の決定コンテクストの自ＩＤとが等しい場合の移動が、この「２階層上に移動」に該当する。「２階層下に移動」は、コンテクストツリーにおいて２つ下の階層のコンテクストへの移動を示している。移動後の決定コンテクストの親ＩＤのコンテクストの親ＩＤと、移動前の決定コンテクストの自ＩＤとが等しい場合の移動が、この「２階層下に移動」に該当する。すなわち、移動前の決定コンテクストから見て、移動後の決定コンテクストが親である場合、親の親である場合、子である場合、及び子の子である場合（これらの関係を総称して、本実施の形態では「親子関係」という。）に音声の属性が変更される。 “Move up one level” indicates a move up to a context one level higher in the context tree. The movement in the case where the parent ID of the determination context before the movement and the own ID of the determination context after the movement are equal corresponds to this “move up one level”. “Move down one level” indicates movement to a context one level below in the context tree. The movement in the case where the own ID of the determination context before movement is equal to the parent ID of the determination context after movement corresponds to this “move down one level”. “Move up two layers” indicates a move up to a context two levels higher in the context tree. The movement when the parent ID of the context of the parent ID of the determination context before the movement is equal to the own ID of the determination context after the movement corresponds to this “move up two levels”. “Move down two levels” indicates movement to the context two levels below in the context tree. The movement when the parent ID of the context of the parent ID of the determined context after the movement is equal to the own ID of the determined context before the movement corresponds to this “move down two levels”. That is, when viewed from the decision context before movement, the decision context after movement is a parent, a parent of a parent, a child, and a child of a child (these relationships are collectively referred to as In this embodiment, the voice attribute is changed to “parent-child relationship”.

次に、図７乃至図９を参照して、音声対話装置１００において、音声対話エージェントが起動した際の動作について、音声の属性の変更に主点をおいて説明する。図７は、音声対話装置１００の動作を示すフローチャートである。図８は、メイン処理中で実行される第一処理のフローチャートである。図９は、メイン処理中で実行される第二処理のフローチャートである。図７に示すメイン処理の動作は、ハードディスク装置１３に記憶されている音声対話プログラムに従ってＣＰＵ１０が実行する。まず、最初の決定コンテクスト及び音声の属性が設定される（Ｓ１）。最初の決定コンテクスト及び音声の属性は予め定められている。最初のコンテクストＩＤが、ＲＡＭ１１の現在決定コンテクスト記憶エリア１１１に記憶され、最初の音声の属性が、ＲＡＭ１１の属性記憶エリア１１３に記憶される。図２，図３に示す例では、例えば、コンテクストＩＤ「００００」が最初の決定コンテクストとされる。 Next, with reference to FIG. 7 to FIG. 9, the operation when the voice interaction agent is activated in the voice interaction device 100 will be described with a focus on changing the voice attribute. FIG. 7 is a flowchart showing the operation of the voice interaction apparatus 100. FIG. 8 is a flowchart of the first process executed during the main process. FIG. 9 is a flowchart of the second process executed during the main process. The operation of the main process shown in FIG. 7 is executed by the CPU 10 according to the voice interaction program stored in the hard disk device 13. First, the first determination context and voice attributes are set (S1). The initial decision context and audio attributes are predetermined. The first context ID is stored in the currently determined context storage area 111 of the RAM 11, and the first audio attribute is stored in the attribute storage area 113 of the RAM 11. In the example illustrated in FIGS. 2 and 3, for example, the context ID “0000” is the first determination context.

次いで、決定コンテクストが変化した回数を計数するカウンタＣの値が、初期値である「０」に初期化される（Ｓ２）。音声の属性を変化させる基準となる時間を計測するタイマ１８がリセットされて、時間の計測が開始される（Ｓ３）。マイク２７から音声が入力されることにより、ユーザからの音声の入力があったか否かの判断が行われる（Ｓ４）。ユーザからの音声の入力がない場合には（Ｓ４：ＮＯ）、繰り返し入力の確認が行われ（Ｓ４）、ユーザからの入力の待機状態とされる。 Next, the value of the counter C that counts the number of times the determined context has changed is initialized to “0”, which is an initial value (S2). The timer 18 for measuring the reference time for changing the sound attribute is reset, and the time measurement is started (S3). When a sound is input from the microphone 27, it is determined whether or not a sound is input from the user (S4). When there is no voice input from the user (S4: NO), repeated input confirmation is performed (S4), and a standby state for input from the user is set.

ユーザからの音声の入力があった場合には（Ｓ４：ＹＥＳ）、入力された音声が周知の音声解析技術によって解析されて、文字変換される（Ｓ５）。得られた文字列が音声対話エージェントの終了を示す文言であるか否かによって、音声対話エージェントの終了指示が行われたか否かの判断が行われる（Ｓ６）。音声対話エージェントの終了を示す文言は、予め登録されているものであり、例えば「終わるよ」，「バイバイ」，「さよなら」，「じゃあね」，「終わり」，「おやすみ」というものである。得られた文字列が終了指示でなければ（Ｓ６：ＮＯ）、文字列からキーワードが抽出される（Ｓ７）。具体的には、文字列が品詞分解され、得られた単語の中にキーワードがあるか否かの判断が行われる。単語の中に、コンテクストツリー記憶エリア１３１の「キーワード」に登録されている単語が含まれていれば、文字列の中で最も早く出現したキーワードが、コンテクスト決定のためのキーワードとされる。そして、抽出されたキーワードに基づいて決定コンテクストが決定される（Ｓ８）。具体的には、抽出されたキーワードが対応付けられているコンテクストＩＤが、決定コンテクストのコンテクストＩＤとされる。現在決定コンテクスト記憶エリア１１１に記憶されているコンテクストＩＤが、前回決定コンテクスト記憶エリア１１２に記憶される。キーワードが対応付けられているコンテクストＩＤが、現在決定コンテクスト記憶エリア１１１に記憶される。 When a voice is input from the user (S4: YES), the input voice is analyzed by a well-known voice analysis technique and converted into a character (S5). It is determined whether or not an instruction to end the voice interaction agent has been issued based on whether or not the obtained character string is a word indicating the end of the voice interaction agent (S6). The words indicating the end of the voice interaction agent are registered in advance, for example, “End”, “Bye Bye”, “Goodbye”, “Jaane”, “End”, “Good Night”. If the obtained character string is not an end instruction (S6: NO), a keyword is extracted from the character string (S7). Specifically, the part of speech is decomposed, and it is determined whether or not there is a keyword in the obtained word. If a word registered in the “keyword” of the context tree storage area 131 is included in the word, the keyword that appears first in the character string is set as a keyword for context determination. Then, a determination context is determined based on the extracted keyword (S8). Specifically, the context ID associated with the extracted keyword is set as the context ID of the determined context. The context ID currently stored in the determined context storage area 111 is stored in the previous determined context storage area 112. The context ID associated with the keyword is stored in the currently determined context storage area 111.

次いで、決定コンテクストに変化があったか否かの判断が行われる（Ｓ９）。前回決定コンテクスト記憶エリア１１２に記憶されているコンテクストＩＤと、現在決定コンテクスト記憶エリア１１１に記憶されているコンテクストＩＤとが同じであれば、決定コンテクストに変化はなかったと判断される（Ｓ９：ＮＯ）。そして、第一処理が行われる（Ｓ１０）。 Next, it is determined whether or not the decision context has changed (S9). If the context ID stored in the previously determined context storage area 112 is the same as the context ID stored in the currently determined context storage area 111, it is determined that the determined context has not changed (S9: NO). . Then, the first process is performed (S10).

図８に示す第一処理が開始されると、タイマ１８で計測されている時間が５分以上経過しているか否かが判断される（Ｓ３１）。５分以上経過していなければ（Ｓ３１：ＮＯ）、処理はメイン処理へ戻る。５分以上経過していれば（Ｓ３１：ＹＥＳ）、カウンタＣの値が「０」であるか否かが判断される（Ｓ３２）。「０」である場合、すなわち、決定コンテクストが５分以上変化していない場合には（Ｓ３２：ＹＥＳ）、属性の１つである「ピッチ」の値が０．８倍に変更される（Ｓ３３）。タイマ１８がリセットされ、時間の計測が開始されて（Ｓ３４）、処理はメイン処理へ戻る。 When the first process shown in FIG. 8 is started, it is determined whether or not the time measured by the timer 18 has passed 5 minutes or more (S31). If five minutes or more have not elapsed (S31: NO), the process returns to the main process. If 5 minutes or more have elapsed (S31: YES), it is determined whether or not the value of the counter C is “0” (S32). If it is “0”, that is, if the decision context has not changed for 5 minutes or more (S32: YES), the value of “pitch” which is one of the attributes is changed to 0.8 times (S33). ). The timer 18 is reset, time measurement is started (S34), and the process returns to the main process.

カウンタＣの値が「０」でない場合には（Ｓ３２：ＮＯ）、カウンタＣの値が「５」以上であるか否かが判断される（Ｓ３５）。「５」以上でなければ（Ｓ３５：ＮＯ）、タイマ１８がリセットされ、時間の計測が開始されて（Ｓ３４）、処理はメイン処理へ戻る。カウンタＣの値が「５」以上である場合、すなわち、５分間の間に決定コンテクストが少なくとも５回以上変化していれば（Ｓ３５：ＹＥＳ）、全ての音声の属性が初期値に変更される（Ｓ３６）。カウンタＣの値が「０」に初期化される（Ｓ３７）。タイマ１８がリセットされ、時間の計測が開始されて（Ｓ３４）、処理はメイン処理へ戻る。 When the value of the counter C is not “0” (S32: NO), it is determined whether or not the value of the counter C is “5” or more (S35). If it is not "5" or more (S35: NO), the timer 18 is reset, time measurement is started (S34), and the process returns to the main process. If the value of the counter C is “5” or more, that is, if the determined context has changed at least five times in 5 minutes (S35: YES), all audio attributes are changed to initial values. (S36). The value of the counter C is initialized to “0” (S37). The timer 18 is reset, time measurement is started (S34), and the process returns to the main process.

処理が図７に示すメイン処理へ戻ると、ユーザによって入力された音声を変換した文言に応答する応答文が決定される（Ｓ２０）。応答文の決定は、周知の対話技術によって、予め定められているルールに基づいて行われる。どのような応答文が決定されるかは、特に重要でないので説明を省略する。Ｓ２０で決定された応答文が、ＲＡＭ１１の属性記憶エリア１１３に記憶されている属性に基づいて、周知の音声合成技術により音声合成され（Ｓ２１）、スピーカ２６から出力される（Ｓ２２）。そして、Ｓ４へ戻り、ユーザからの入力が待機される（Ｓ４）。 When the process returns to the main process shown in FIG. 7, a response sentence is determined in response to a word converted from the voice input by the user (S20). The response sentence is determined based on a predetermined rule by a well-known dialogue technique. The type of response sentence to be determined is not particularly important and will not be described. The response sentence determined in S20 is voice-synthesized by a well-known voice synthesis technique based on the attribute stored in the attribute storage area 113 of the RAM 11 (S21) and output from the speaker 26 (S22). And it returns to S4 and the input from a user waits (S4).

決定コンテクストに変化があった場合には（Ｓ９：ＹＥＳ）、決定コンテクストが変化した回数を計数するカウンタＣの値に「１」が加算される（Ｓ１２）。タイマ１８で計測されている時間が５分以上経過しているか否かが判断される（Ｓ１３）。５分以上経過していなければ（Ｓ１３：ＮＯ）、第二処理が行われる（Ｓ１４）。 When there is a change in the decision context (S9: YES), “1” is added to the value of the counter C that counts the number of times the decision context has changed (S12). It is determined whether or not the time measured by the timer 18 has passed 5 minutes or more (S13). If 5 minutes or more have not elapsed (S13: NO), the second process is performed (S14).

図９に示す第二処理が開始されると、まず、決定コンテクストが意味コンテクストであるか否かの判断が行われる（Ｓ３８）。決定コンテクストのコンテクストＩＤが、第一属性情報記憶エリア１３２１（図４参照）の「コンテクストＩＤ」に記憶されていれば、その決定コンテクストは意味コンテクストであると判断される（Ｓ３８：ＹＥＳ）。そこで、出力音声の属性が変更される（Ｓ４１）。具体的には、第一属性情報記憶エリア１３２１の「第一変更属性」及び「第二変更属性」が参照される。この場合、属性記憶エリア１１３において、「種類」で指定されている属性が「方法」又は「変更値」の指定に基づいて変更される。例えば、決定コンテクストＩＤが「０１０１−００００」であれば、「スピード」が「１．２」とされ、「ピッチ」の値に「０．１」が加算される。その後、処理はメイン処理へ戻る。 When the second process shown in FIG. 9 is started, first, it is determined whether or not the determined context is a semantic context (S38). If the context ID of the determined context is stored in the “context ID” of the first attribute information storage area 1321 (see FIG. 4), it is determined that the determined context is a semantic context (S38: YES). Therefore, the attribute of the output voice is changed (S41). Specifically, “first change attribute” and “second change attribute” in the first attribute information storage area 1321 are referred to. In this case, in the attribute storage area 113, the attribute designated by “type” is changed based on the designation of “method” or “change value”. For example, if the determined context ID is “0101-0000”, “Speed” is set to “1.2”, and “0.1” is added to the value of “Pitch”. Thereafter, the process returns to the main process.

決定コンテクストが意味コンテクストでない場合には（Ｓ３８：ＮＯ）、決定コンテクストが特定階層コンテクストであるか否かの判断が行われる（Ｓ３９）。決定コンテクストＩＤが、第二属性情報記憶エリア１３２２（図５参照）の「階層」に指定されている階層に属するコンテクストＩＤである場合には、決定コンテクストが特定階層コンテクストであると判断される（Ｓ３９：ＹＥＳ）。図５に示した例では、決定コンテクストＩＤの自ＩＤが「００００」である場合（最上位層）、決定コンテクストＩＤの階層ＩＤが「０２」である場合（２階層目）、又は、決定コンテクストＩＤの階層ＩＤが「０４」である（最下層）場合に、決定コンテクストが特定階層コンテクストであると判断される。この場合、属性記憶エリア１１３において、第二属性情報記憶エリア１３２２の「第一変更属性」の「種類」で指定されている属性が、「方法」又は「変更値」の指定に基づいて変更される（Ｓ４２）。例えば、階層ＩＤが「０２」であれば、「ピッチ」が「０．６」とされる。その後、処理はメイン処理へ戻る。 When the determined context is not a semantic context (S38: NO), it is determined whether the determined context is a specific hierarchy context (S39). When the determined context ID is a context ID belonging to the hierarchy specified in the “hierarchy” of the second attribute information storage area 1322 (see FIG. 5), it is determined that the determined context is a specific hierarchy context ( S39: YES). In the example shown in FIG. 5, when the self ID of the decision context ID is “0000” (the highest layer), when the hierarchy ID of the decision context ID is “02” (the second layer), or the decision context When the ID of the ID is “04” (lowermost layer), it is determined that the determined context is a specific hierarchy context. In this case, in the attribute storage area 113, the attribute specified by the “type” of the “first change attribute” in the second attribute information storage area 1322 is changed based on the designation of “method” or “change value”. (S42). For example, if the hierarchy ID is “02”, the “pitch” is set to “0.6”. Thereafter, the process returns to the main process.

決定コンテクストが特定階層コンテクストでない場合には（Ｓ３９：ＮＯ）、決定コンテクストの移動状態が所定の位置変化であるか否かの判断が行われる（Ｓ４０）。決定コンテクストＩＤと前回決定コンテクストＩＤとが比較され、第三属性情報記憶エリア１３２３（図６参照）の「位置変化」に指定されている移動状態であれば、所定の位置変化であると判断される（Ｓ４０：ＹＥＳ）。例えば、図６に示した例では、移動前の決定コンテクストの親ＩＤと移動後の決定コンテクストの自ＩＤとが等しい場合に、「１階層上に移動」の位置変化であると判断される。この場合、属性記憶エリア１１３において、第三属性情報記憶エリア１３２３の「第一変更属性」の「種類」で指定されている属性が、「方法」又は「変更値」の指定に基づいて変更される（Ｓ４３）。その後、処理はメイン処理へ戻る。 If the determined context is not a specific hierarchy context (S39: NO), it is determined whether or not the determined context movement state is a predetermined position change (S40). The determined context ID is compared with the previous determined context ID, and if the movement state is designated as “position change” in the third attribute information storage area 1323 (see FIG. 6), it is determined that the predetermined position change has occurred. (S40: YES). For example, in the example illustrated in FIG. 6, when the parent ID of the determination context before the movement and the own ID of the determination context after the movement are equal, it is determined that the position change is “move up one level”. In this case, in the attribute storage area 113, the attribute specified by the “type” of the “first change attribute” in the third attribute information storage area 1323 is changed based on the designation of “method” or “change value”. (S43). Thereafter, the process returns to the main process.

処理が図７に示すメイン処理へ戻ると、ユーザにより入力された音声を変換した文言に応答する応答文が決定される（Ｓ２０）。応答文が、属性記憶エリア１１３に記憶されている変更後の属性に基づいて、周知の音声合成技術により音声合成され（Ｓ２１）、スピーカ２６から出力される（Ｓ２２）。そして、処理はＳ４へ戻り、ユーザからの入力が待機される（Ｓ４）。 When the process returns to the main process shown in FIG. 7, a response sentence is determined in response to the text converted from the voice input by the user (S20). The response sentence is synthesized by a known speech synthesis technique based on the changed attribute stored in the attribute storage area 113 (S21) and output from the speaker 26 (S22). Then, the process returns to S4, and an input from the user is waited (S4).

また、決定コンテクストに変化があり（Ｓ９：ＹＥＳ）、タイマ１８で計測されている時間が５分以上経過している場合には（Ｓ１３：ＹＥＳ）、カウンタＣの値が「５」以上であるか否かが判断される（Ｓ１５）。カウンタ「Ｃ」の値が「５」以上でなければ（Ｓ１５：ＮＯ）、タイマ１８がリセットされ、時間の計測が開始されて（Ｓ１６）、第二処理が行われる（Ｓ１４）。 In addition, when there is a change in the determination context (S9: YES) and the time measured by the timer 18 has passed 5 minutes or more (S13: YES), the value of the counter C is “5” or more. Is determined (S15). If the value of the counter “C” is not “5” or more (S15: NO), the timer 18 is reset, time measurement is started (S16), and the second process is performed (S14).

カウンタＣの値が「５」以上であれば（Ｓ１５：ＹＥＳ）、全ての音声の属性が初期値に変更される（Ｓ１７）。カウンタＣの値が「０」に初期化される（Ｓ１８）。タイマ１８がリセットされ、時間の計測が開始される（Ｓ１９）。応答文が決定され（Ｓ２０）、決定された応答文が音声合成され（Ｓ２１）、スピーカ２６から出力される（Ｓ２２）。そして、処理はＳ４へ戻り、ユーザからの入力が待機される（Ｓ４）。 If the value of the counter C is “5” or more (S15: YES), all audio attributes are changed to initial values (S17). The value of the counter C is initialized to “0” (S18). The timer 18 is reset and time measurement is started (S19). A response sentence is determined (S20), and the determined response sentence is synthesized (S21) and output from the speaker 26 (S22). Then, the process returns to S4, and an input from the user is waited (S4).

Ｓ４〜Ｓ２２の処理が繰り返し行われることによって、ユーザと音声対話エージェントとの対話が進行する。コンテクストが変化した場合に、変化後のコンテクストが意味コンテクストであったり、特定階層コンテクストであったり、所定の位置変化が生じていたりすれば、出力音声の属性が変更される。決定コンテクストが変化しなかった時間が所定時間以上であれば、属性が変更される。決定コンテクストが所定時間内に所定回数以上していれば、属性が変更される。音声対話エージェントが出力する応答文は、属性記憶エリア１１３に記憶されている変更後の属性に基づいて、音声変換され、音声がスピーカ２６から出力される。ユーザが終了を指示する言葉を入力した場合には、本処理は終了する。 By repeating the processes of S4 to S22, the dialogue between the user and the voice dialogue agent proceeds. When the context changes, if the changed context is a semantic context, a specific hierarchy context, or a predetermined position change occurs, the attribute of the output sound is changed. If the time when the decision context has not changed is equal to or longer than the predetermined time, the attribute is changed. If the determination context has exceeded the predetermined number of times within the predetermined time, the attribute is changed. The response sentence output by the voice interaction agent is converted into voice based on the changed attribute stored in the attribute storage area 113, and the voice is output from the speaker 26. When the user inputs a word for instructing the end, this process ends.

以下、図１０を参照して、図２〜図６に示した例におけるユーザと音声対話エージェントとの対話を具体例を挙げて説明する。図１０は、ユーザと音声対話エージェントとの対話の一例を示す図である。図１０において、「対話番号」は、ユーザからの入力文と音声対話エージェントの応答文との組に付与した番号である。「ユーザからの入力文」は、マイク２７から入力された音声を文字変換して得られた文である。「キーワード」は、入力文から抽出されたキーワードである。「コンテクスト」は、キーワードによって決定された決定コンテクストである。「属性」には、「音響モデル」，「ピッチ」，「スピード」，「声質」が音声の属性として例示されている。「エージェントの応答文」は、入力文に応じて音声対話エージェントから出力される応答文である。以下の具体例では、全ての対話が５分以内に行われている。 Hereinafter, with reference to FIG. 10, the dialogue between the user and the voice interaction agent in the example shown in FIGS. FIG. 10 is a diagram illustrating an example of a dialogue between a user and a voice dialogue agent. In FIG. 10, “dialogue number” is a number assigned to a set of an input sentence from the user and a response sentence of the voice interaction agent. The “input sentence from the user” is a sentence obtained by converting the voice input from the microphone 27 into characters. “Keyword” is a keyword extracted from the input sentence. The “context” is a determination context determined by a keyword. In “attribute”, “acoustic model”, “pitch”, “speed”, and “voice quality” are exemplified as voice attributes. The “agent response text” is a response text output from the voice interaction agent in response to the input text. In the following specific example, all dialogues take place within 5 minutes.

まず、最初の決定コンテクストの決定コンテクストＩＤは「００００」とされる。そして、ＲＡＭ１１の属性記憶エリア１１３にも、属性の初期値が記憶される（Ｓ１）。対話番号１の入力文「こんにちは」に対して、「こんにちは」がキーワードとして抽出される（Ｓ７）。「こんにちは」はコンテクストＩＤ「００００」のコンテクスト名「一般」のコンテクストに対応付けられているので（図２参照）、決定コンテクストＩＤは「００００」とされる（Ｓ８）。前回の決定コンテクストも「００００」なので、コンテクストに変化はない（Ｓ９：ＮＯ）。この場合、タイマ１８による計測時間が５分未満であれば（Ｓ３１：ＮＯ）、属性は初期値のまま変更されない。応答文「こんにちは、最近どこかへ出掛けた？」が決定され（Ｓ２０）、初期値の属性に応じて音声合成が行われ（Ｓ２１）、応答文が出力される（Ｓ２２）。 First, the determination context ID of the first determination context is set to “0000”. The initial value of the attribute is also stored in the attribute storage area 113 of the RAM 11 (S1). Input sentence of dialogue number 1 for the "Hello", "Hello" is extracted as a keyword (S7). "Hello" because associated with the context of the context name of the context ID "0000", "General" (see FIG. 2), determining the context ID is "0000" (S8). Since the previous determined context is also “0000”, there is no change in the context (S9: NO). In this case, if the measurement time by the timer 18 is less than 5 minutes (S31: NO), the attribute remains unchanged at the initial value. Response sentence "Hello, recently somewhere to went out?" Is determined (S20), speech synthesis in accordance with the attribute of the initial value is performed (S21), the response sentence is output (S22).

次いで、ユーザが次の発言をし、対話番号２の入力文「そうだなぁ、展覧会へ行ったよ」が入力される（Ｓ４：ＹＥＳ）。入力された音声が文字変換され（Ｓ５）、「展覧会」がキーワードとして抽出される（Ｓ７）。「展覧会」はコンテクストＩＤ「０１０１−００００」のコンテクスト名「アート」のコンテクストに対応付けられているので（図２参照）、決定コンテクストＩＤは「０１０１−００００」とされる（Ｓ８）。前回の決定コンテクストは「００００」であるので、コンテクストに変化がある（Ｓ９：ＹＥＳ）。タイマ１８による計測時間が５分未満であり（Ｓ１３：ＮＯ）、決定コンテクストは意味コンテクストである（Ｓ３８：ＹＥＳ）。「０１０１−００００」は、意味「趣味」の意味コンテクストのコンテクストＩＤであるので（図４参照）、ピッチは初期値の「１．０」に「０．１」が加算されて「１．１」となり、スピードは初期値の「１．０」から変更値の「１．２」に変更される（Ｓ４１）。応答文「へえ、展覧会。絵とか彫刻とかを観るの？」が決定され（Ｓ２０）、変更後の音声の属性で音声合成が行われ（Ｓ２１）、応答文が出力される（Ｓ２２）。 Next, the user makes the following statement, and an input sentence of dialog number 2 “Yes, I went to the exhibition” is input (S4: YES). The input voice is converted into characters (S5), and "exhibition" is extracted as a keyword (S7). Since “Exhibition” is associated with the context of the context name “Art” with the context ID “0101-0000” (see FIG. 2), the determined context ID is “0101-0000” (S8). Since the previous determined context is “0000”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the determination context is a semantic context (S38: YES). Since “0101-0000” is the context ID of the meaning context of the meaning “hobby” (see FIG. 4), “0.1” is added to the initial value “1.0” and “1.1”. The speed is changed from the initial value “1.0” to the changed value “1.2” (S41). The response sentence “Hey, exhibition. Do you see pictures or sculptures?” Is determined (S20), speech synthesis is performed with the changed speech attributes (S21), and a response sentence is output (S22).

次いで、対話番号３の入力文「今回は絵の展覧会だったよ」が入力される（Ｓ４：ＹＥＳ）。「展覧会」がキーワードとして抽出される（Ｓ７）。「展覧会」はコンテクストＩＤ「０１０１−００００」のコンテクスト名「アート」のコンテクストに対応付けられているので、決定コンテクストＩＤは「０１０１−００００」とされる（Ｓ８）。前回決定コンテクストＩＤも「０１０１−００００」なので、決定コンテクストに変化はない（Ｓ９：ＮＯ）。応答文「どんな絵？」が決定され（Ｓ２０）、前回と同じ属性で音声合成が行われ（Ｓ２１）、応答文が出力される（Ｓ２２）。 Next, an input sentence of dialogue number 3 “This time it was an exhibition of pictures” is input (S4: YES). “Exhibition” is extracted as a keyword (S7). Since “exhibition” is associated with the context with the context name “art” having the context ID “0101-0000”, the determined context ID is “0101-0000” (S8). Since the previously determined context ID is also “0101-0000”, there is no change in the determined context (S9: NO). A response sentence “what picture?” Is determined (S20), speech synthesis is performed with the same attribute as the previous time (S21), and a response sentence is output (S22).

次いで、対話番号４の入力文「日本画だよ」が入力される（Ｓ４：ＹＥＳ）。「日本画」がキーワードとして抽出される（Ｓ７）。「日本画」はコンテクストＩＤ「０２０２−０１０１」のコンテクスト名「日本画」のコンテクストに対応付けられているので（図２参照）、決定コンテクストＩＤは「０２０２−０１０１」とされる（Ｓ８）。前回の決定コンテクストは「０１０１−００００」であるので、コンテクストに変化がある（Ｓ９：ＹＥＳ）。タイマ１８による計測時間が５分未満であり（Ｓ１３：ＮＯ）、決定コンテクストは意味コンテクストである（Ｓ３８：ＹＥＳ）。コンテクストＩＤ「０２０２−０１０１」は意味「得意分野」の意味コンテクストであるので（図４参照）、音響モデルは「ｍｏｄｅｌＣ」に変更される（Ｓ４１）。応答文「へえ、日本画。昔の絵？それとも現代の日本画？」が決定され（Ｓ２０）、変更後の属性で音声合成が行われ（Ｓ２１）、応答文が出力される（Ｓ２２）。 Subsequently, the input sentence “It is a Japanese picture” of the dialogue number 4 is input (S4: YES). “Japanese painting” is extracted as a keyword (S7). Since “Japanese painting” is associated with the context with the context name “Japanese painting” having the context ID “0202-0101” (see FIG. 2), the determined context ID is “0202-0101” (S8). Since the previous determined context is “0101-0000”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the determination context is a semantic context (S38: YES). Since the context ID “0202-0101” is a semantic context of the meaning “special field” (see FIG. 4), the acoustic model is changed to “modelC” (S41). The response sentence “Hey, Japanese painting. Old picture or modern Japanese painting?” Is determined (S20), voice synthesis is performed with the changed attribute (S21), and a response sentence is output (S22).

次いで、対話番号５の入力文「昔のだね。狩野派の展覧会だったよ。」が入力される（Ｓ４：ＹＥＳ）。「狩野派」がキーワードとして抽出される（Ｓ７）。「狩野派」はコンテクストＩＤ「０３０４−０２０２」のコンテクスト名「狩野派」のコンテクストに対応付けられているので（図２参照）、決定コンテクストＩＤは「０３０４−０２０２」とされる（Ｓ８）。前回の決定コンテクストは「０２０２−０１０１」であるので、コンテクストに変化がある（Ｓ９：ＹＥＳ）。タイマ１８による計測時間が５分未満であり（Ｓ１３：ＮＯ）、決定コンテクストは、意味コンテクストでも特定階層コンテクストでもないが（Ｓ３８：ＮＯ，Ｓ３９：ＮＯ）、前回決定コンテクストから１つ下の階層に移動している（Ｓ４０：ＹＥＳ）。そこで、スピードは、記憶されている値である「１．２」に「０．１」が加算されて「１．３」とされる（Ｓ４３）。応答文「狩野派のどんな作品があったの？」が決定され（Ｓ２０）、変更後の属性で音声合成が行われ（Ｓ２１）、応答文が出力される（Ｓ２２）。 Next, an input sentence of dialogue number 5 “It is an old time. It was a Kano exhibition” (S4: YES). “Kano School” is extracted as a keyword (S7). Since “Kano school” is associated with the context of the context name “Kano school” with the context ID “0304-022” (see FIG. 2), the determined context ID is “0304-0202” (S8). Since the previous determination context is “0202-0101”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the decision context is neither a semantic context nor a specific hierarchy context (S38: NO, S39: NO), but is one level lower than the previous decision context. It is moving (S40: YES). Therefore, the speed is set to “1.3” by adding “0.1” to the stored value “1.2” (S43). A response sentence “What kind of work was there in the Kano school?” Is determined (S20), speech synthesis is performed with the changed attribute (S21), and a response sentence is output (S22).

次いで、対話番号６の入力文「狩野永徳っていう人の作品がメインに展示されていたよ。」が入力される（Ｓ４：ＹＥＳ）。「狩野永徳」がキーワードとして抽出される（Ｓ７）。「狩野永徳」はコンテクストＩＤ「０４００−０３０４」のコンテクスト名「画家」のコンテクストに対応付けられているので（図２参照）、決定コンテクストＩＤは「０４００−０３０４」とされる（Ｓ８）。前回の決定コンテクストは「０３０４−０２０２」であるので、コンテクストに変化がある（Ｓ９：ＹＥＳ）。タイマ１８による計測時間が５分未満であり（Ｓ１３：ＮＯ）、決定コンテクストは、特定階層コンテクスト（最下層）である（Ｓ３９：ＹＥＳ）。そこで、声質は「０．４」に変更される（Ｓ４２）。応答文「狩野永徳の代表作は？」が決定され（Ｓ２０）、変更後の属性で音声合成が行われ（Ｓ２１）、応答文が出力される（Ｓ２２）。 Next, an input sentence of dialogue number 6 “A work of a person named Kano Naganori was on display” was input (S4: YES). “Kano Ekinori” is extracted as a keyword (S7). Since “Kano Naganori” is associated with the context of the context name “painter” having the context ID “0400-0304” (see FIG. 2), the determined context ID is set to “0400-0304” (S8). Since the previous determined context is “0304-0202”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the determination context is a specific hierarchy context (lowermost layer) (S39: YES). Therefore, the voice quality is changed to “0.4” (S42). The response sentence “What is Kano Naganori's representative work?” Is determined (S20), speech synthesis is performed with the changed attribute (S21), and a response sentence is output (S22).

次いで、対話番号７の入力文「国宝の洛中洛外図屏風かなぁ」が入力される（Ｓ４：ＹＥＳ）。「洛中洛外図屏風」がキーワードとして抽出される（Ｓ７）。「洛中洛外図屏風」はコンテクストＩＤ「０４０１−０３０４」のコンテクスト名「作品」のコンテクストに対応付けられているので（図２参照）、決定コンテクストＩＤは「０４０１−０３０４」とされる（Ｓ８）。前回の決定コンテクストは「０４００−０３０４」であるので、コンテクストに変化がある（Ｓ９：ＹＥＳ）。タイマ１８による計測時間が５分未満であり（Ｓ１３：ＮＯ）、決定コンテクストは、特定階層コンテクスト（最下層）である（Ｓ３９：ＹＥＳ）。そこで、声質は「０．４」に変更される（Ｓ４２）。応答文「どこにある絵なの？」が決定され（Ｓ２０）、変更後の属性で音声合成が行われ（Ｓ２１）、応答文が出力される（Ｓ２２）。 Next, the input sentence of the dialogue number 7 “National treasure 洛中洛外图ぁ風なな” is input (S4: YES). “Takanaka-gai-gai-fu-fu” is extracted as a keyword (S7). Since “洛中洛外図屏風” is associated with the context of the context name “works” with the context ID “0401-0304” (see FIG. 2), the determined context ID is “0401-0304” (S8). . Since the previous determined context is “0400-0304”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the determination context is a specific hierarchy context (lowermost layer) (S39: YES). Therefore, the voice quality is changed to “0.4” (S42). A response sentence “where is the picture?” Is determined (S20), speech synthesis is performed with the changed attribute (S21), and a response sentence is output (S22).

次いで、対話番号８の入力文「どこだったかな、わすれちゃった。バイバイ。」が入力される（Ｓ４：ＹＥＳ）。「バイバイ」が終了指示となり（Ｓ６：ＹＥＳ）、ユーザと音声対話エージェントとの対話が終了する。 Next, the input sentence of dialogue number 8 “Where was, I forgot. Bye Bye” is input (S4: YES). “Bye-bye” is a termination instruction (S6: YES), and the dialogue between the user and the voice interaction agent is terminated.

以上のようにして、ユーザと音声対話エージェントとの会話の内容（コンテクスト）に応じて、音声対話エージェントの出力音声を変更することができる。よって、音声対話エージェントの出力音声がコンテクストに見合った音声となるので、自然な対話を行うことができる。 As described above, the output voice of the voice interaction agent can be changed according to the content (context) of the conversation between the user and the voice interaction agent. Therefore, since the output voice of the voice dialogue agent becomes a voice commensurate with the context, a natural dialogue can be performed.

コンテクストに相応しい属性を示す音声属性情報を、コンテクストに対応させて記憶しておけば、コンテクスト、つまり会話の内容に相応しい音声を出力することができる。よって、コンテクストの変化に応じて、出力音声を会話の内容に相応しい音声に切り替えることができる。したがって、ユーザは、会話の内容と音声とに違和感を抱くことなく、自然な会話を行うことができる。 If voice attribute information indicating attributes suitable for the context is stored in association with the context, it is possible to output a voice suitable for the context, that is, the content of the conversation. Therefore, the output sound can be switched to a sound suitable for the content of the conversation according to the change in context. Therefore, the user can have a natural conversation without feeling uncomfortable with the content and voice of the conversation.

音声対話装置１００と会話をしているユーザは、会話のコンテクストの階層を、出力される音声によって把握することができる。よって、ユーザは、会話の内容の変化状況を把握しながら会話することができ、会話を楽しむ一助となる。例えば、特定の階層を最も下位の階層とすれば、ユーザは、それ以上コンテクストが詳細な内容に変化することがないことを知ることができる。また、特定の階層を最も上位の階層とすれば、ユーザは、会話をより詳細な内容に移行させることが可能である旨を知ることができる。また、所定の階層のコンテクストに何らかの意味を持たせるように、ツリー構造の構築に工夫を施せば、音声の属性の変化によって、ユーザに何らかの意味を伝えることができる。 A user having a conversation with the voice interactive apparatus 100 can grasp the hierarchy of the context of the conversation by the output voice. Therefore, the user can talk while grasping the change state of the content of the conversation, and helps to enjoy the conversation. For example, if a specific hierarchy is set as the lowest hierarchy, the user can know that the context does not change to detailed contents any more. Further, if the specific hierarchy is the highest hierarchy, the user can know that the conversation can be shifted to more detailed contents. Further, if a tree structure is devised so as to give some meaning to the context of a predetermined hierarchy, some meaning can be conveyed to the user by a change in voice attributes.

ユーザは、音声対話装置１００から出力される音声によって、会話の内容が深くなったり、浅くなったり、同レベルのコンテクストで変化していたりする状況が分かる。よって、ユーザは、会話の内容の変化状況を把握しながら会話することができ、会話を楽しむ一助となる。 The user can understand a situation in which the content of the conversation becomes deeper, shallower, or changes in the context of the same level by the voice output from the voice interactive device 100. Therefore, the user can talk while grasping the change state of the content of the conversation, and helps to enjoy the conversation.

音声対話装置１００と会話をしているユーザは、音声対話装置１００から出力される音声により、コンテクストが所定時間内に何度も切り替わったことがわかる。よって、ユーザは、会話の内容の変化状況を感じながら会話することができ、会話を楽しむ一助となる。 The user who is having a conversation with the voice interactive apparatus 100 can recognize that the context has been switched many times within a predetermined time by the voice output from the voice interactive apparatus 100. Therefore, the user can talk while feeling the change state of the content of the conversation, which helps to enjoy the conversation.

音声対話装置１００と会話をしているユーザは、音声対話装置１００から出力される音声により、同一のコンテクストが所定時間以上継続していることがわかる。コンテクストの変化がなかったとしても出力音声の属性が変化するので、会話を楽しむ一助となる。 A user having a conversation with the voice interactive apparatus 100 can understand that the same context continues for a predetermined time or longer by the voice output from the voice interactive apparatus 100. Even if there is no change in the context, the attributes of the output voice change, which helps to enjoy the conversation.

上記実施の形態におけるＨＤＤ１３のコンテクストツリー記憶エリア１３１が「コンテクスト記憶手段」に該当する。そして、ＲＡＭ１１の属性記憶エリア１１３が「属性記憶手段」に該当する。ＨＤＤ１３の属性情報記憶エリア１３２が「属性情報記憶手段」に該当する。マイク２７が「音声入力手段」に相当し、スピーカ２６が「音声出力手段」に相当する。そして、図７に示すフローチャートのＳ５において、入力された音声を文字に変換する処理を行うＣＰＵ１０が「変換手段」に相当する。Ｓ７において、入力された音声の文字列からキーワードを抽出し、Ｓ８において、キーワードに基づいてコンテクストを決定する処理を行うＣＰＵ１０が「コンテクスト決定手段」に相当する。Ｓ２０において、応答文を決定する処理を行うＣＰＵ１０が「会話文決定手段」に相当する。Ｓ９において、決定コンテクストに変化があったか否かの判断を行うＣＰＵ１０が「判断手段」に相当する。Ｓ２１，Ｓ２２において、属性記憶エリア１１３に記憶されている音声の属性に基づいて音声合成を行い、スピーカ２６に音声を出力させるＣＰＵ１０が「出力制御手段」に相当する。 The context tree storage area 131 of the HDD 13 in the above embodiment corresponds to “context storage means”. The attribute storage area 113 of the RAM 11 corresponds to “attribute storage means”. The attribute information storage area 132 of the HDD 13 corresponds to “attribute information storage means”. The microphone 27 corresponds to “voice input means”, and the speaker 26 corresponds to “voice output means”. Then, in S5 of the flowchart shown in FIG. 7, the CPU 10 that performs processing for converting the input voice into characters corresponds to the “conversion unit”. In S7, a keyword is extracted from the input voice character string, and in S8, the CPU 10 that performs the process of determining the context based on the keyword corresponds to “context determining means”. In S <b> 20, the CPU 10 that performs the process of determining the response sentence corresponds to “conversation sentence determination means”. In S9, the CPU 10 that determines whether or not the decision context has changed corresponds to the “determination unit”. In S21 and S22, the CPU 10 that performs voice synthesis based on the voice attribute stored in the attribute storage area 113 and outputs the voice to the speaker 26 corresponds to “output control means”.

なお、本発明の音声対話装置及び音声対話システムは、上記した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変更を加え得ることは勿論である。上記実施の形態では、音声対話プログラムを搭載した音声対話装置を所謂パーソナルコンピュータとしたが、音声対話プログラムを搭載する装置はパーソナルコンピュータでなくともよい。例えば、携帯型の端末、携帯電話、テレビでもよく、音声を入力するマイク、音声を出力するスピーカを備えていればよい。 It should be noted that the voice interaction apparatus and the voice interaction system of the present invention are not limited to the above-described embodiments, and it is needless to say that various modifications can be made without departing from the gist of the present invention. In the above-described embodiment, the voice interaction device having the voice interaction program is a so-called personal computer. However, the device having the voice interaction program need not be a personal computer. For example, a portable terminal, a mobile phone, or a television may be used as long as a microphone for inputting sound and a speaker for outputting sound are provided.

図２及び図３に示したコンテクストツリーは一例であり、必ずしもこの例のコンテクストツリーを採用する必要はない。実際にユーザと音声対話エージェントとの会話に相応しい音声を出力するためには、さらに多くの分野のコンテクストを作成し、細分化、深層化したコンテクストツリーを用いることが望ましい。属性情報記憶エリア１３２に記憶されている音声属性情報も、細かい設定を行うほど、ユーザと音声対話エージェントとの会話にさらに相応しい音声を出力することができる。コンテクストツリーの構成を工夫すれば、さらに相応しい音声で対応を行うことができる。なお、同一階層のコンテクストを１００以上に増やす場合には、コンテクストＩＤの桁数を増やす必要がある。また、コンテクストＩＤの付与法則は上記実施の形態の法則に限らない。ユーザがコンテクストツリーにコンテクスト及びキーワードを追加できるように音声対話装置を構成してもよい。この場合には、入力装置（キーボードやマウス）２４によって文字列を受け付ければよい。 The context tree shown in FIGS. 2 and 3 is an example, and the context tree of this example is not necessarily adopted. In order to actually output a sound suitable for the conversation between the user and the voice interaction agent, it is desirable to create a context in many fields, and use a subdivided and deepened context tree. As the voice attribute information stored in the attribute information storage area 132 is also finely set, the voice more suitable for the conversation between the user and the voice conversation agent can be output. If the configuration of the context tree is devised, it is possible to cope with a more appropriate voice. In addition, when increasing the context of the same hierarchy to 100 or more, it is necessary to increase the number of digits of the context ID. Further, the context ID assignment rule is not limited to the law of the above embodiment. The voice interaction device may be configured so that the user can add contexts and keywords to the context tree. In this case, a character string may be received by the input device (keyboard or mouse) 24.

上記実施の形態では、入力文からキーワードが抽出され、入力文中に最初に出現したキーワードに基づいてコンテクストが決定された。しかしながら、コンテクストの決定に使用するキーワードは、最初に出現したキーワードに限られない。例えば、複数のキーワードが入力文中に存在する場合には、それぞれのキーワードが対応付けられたコンテクストのうち、階層が最下位のコンテクストを決定コンテクストとしてもよい。同一のキーワードが複数のコンテクストに対応付けられている場合には、対話の流れを考慮して、前回の対話のコンテクストに応じて決定コンテクストを決定してもよい。例えば、「プログラム」というキーワードが、コンテクスト名「コンサート」とコンテクスト名「コンピュータ」との両方に割り当てられていたとする。この場合、前回の対話のコンテクストが「音楽」であれば、決定コンテクストは「コンサート」とすればよい。 In the above embodiment, the keywords are extracted from the input sentence, and the context is determined based on the keyword that first appears in the input sentence. However, the keyword used to determine the context is not limited to the keyword that appears first. For example, when a plurality of keywords are present in the input sentence, the context having the lowest hierarchy among the contexts associated with the respective keywords may be used as the determination context. When the same keyword is associated with a plurality of contexts, the determination context may be determined according to the context of the previous dialog in consideration of the flow of the dialog. For example, assume that the keyword “program” is assigned to both the context name “concert” and the context name “computer”. In this case, if the context of the previous dialogue is “music”, the determined context may be “concert”.

上記実施の形態では、移動前の決定コンテクストから見て、移動後の決定コンテクストが親である場合、親の親である場合、子である場合、及び子の子である場合に音声の属性が変更される。しかし、移動前後の決定コンテクストが親と子である場合のみ属性を変更してもよい。４世代以上離れた関係である場合にも属性を変更してもよい。 In the above-described embodiment, when the determination context after movement is a parent, when the determination context after movement is a parent, when it is a parent, when it is a child, and when it is a child of a child, the audio attribute is Be changed. However, the attribute may be changed only when the determination context before and after the movement is a parent and a child. The attribute may be changed even when the relationship is more than four generations away.

上記実施の形態では、移動前後の決定コンテクストが共に同じ階層に属し、且つ識別番号が１つ異なる場合に、音声の属性が変更される。しかし、移動前後の決定コンテクストが共に同じ階層に属する場合には、識別番号に関わらず属性を変更してもよい。移動前後の決定コンテクストが共に同じ階層に属し、且つ親が同一である場合に属性を変更してもよい。 In the above embodiment, the voice attributes are changed when the determination contexts before and after the movement belong to the same hierarchy and the identification numbers are different by one. However, when the determination contexts before and after the movement belong to the same hierarchy, the attribute may be changed regardless of the identification number. The attribute may be changed when the determination contexts before and after the movement belong to the same hierarchy and the parent is the same.

音声対話装置１００のハードウェアブロック図である。2 is a hardware block diagram of the voice interactive apparatus 100. FIG. コンテクストツリー記憶エリア１３１の構成を示す模式図である。3 is a schematic diagram showing a configuration of a context tree storage area 131. FIG. コンテクストツリー記憶エリア１３１に記憶されているコンテクストのツリー構造の模式図である。4 is a schematic diagram of a tree structure of contexts stored in a context tree storage area 131. FIG. 第一属性情報記憶エリア１３２１の構成を示す模式図である。5 is a schematic diagram showing a configuration of a first attribute information storage area 1321. FIG. 第二属性情報記憶エリア１３２２の構成を示す模式図である。It is a schematic diagram which shows the structure of the 2nd attribute information storage area 1322. FIG. 第三属性情報記憶エリア１３２３の構成を示す模式図である。10 is a schematic diagram showing a configuration of a third attribute information storage area 1323. FIG. 音声対話装置１００のメイン処理のフローチャートである。3 is a flowchart of main processing of the voice interaction apparatus 100. メイン処理中で実行される第一処理のフローチャートである。It is a flowchart of the 1st process performed in the main process. メイン処理中で実行される第二処理のフローチャートである。It is a flowchart of the 2nd process performed in the main process. ユーザと音声対話エージェントとの対話の一例を示す図である。It is a figure which shows an example of the dialogue between a user and a voice dialogue agent.

Explanation of symbols

１０ＣＰＵ
１３ハードディスク装置
１６音声出力制御部
１７音声入力制御部
２６スピーカ
２７マイク
１００音声対話装置
１１１現在決定コンテクスト記憶エリア
１１３属性記憶エリア
１３１コンテクストツリー記憶エリア
１３２属性情報記憶エリア
１３３音響モデル記憶エリア
１３４音声対話プログラム記憶エリア
１３２１第一属性情報記憶エリア
１３２２第二属性情報記憶エリア
１３２３第三属性情報記憶エリア 10 CPU
13 Hard disk device 16 Audio output control unit 17 Audio input control unit 26 Speaker 27 Microphone 100 Spoken dialogue device 111 Currently determined context storage area 113 Attribute storage area 131 Context tree storage area 132 Attribute information storage area 133 Acoustic model storage area 134 Spoken dialogue program Storage area 1321 First attribute information storage area 1322 Second attribute information storage area 1323 Third attribute information storage area

Claims

Voice input means for inputting voice;
Conversion means for converting an input voice, which is a voice input by the voice input means, into a character string;
A context storage means for storing a conversation context corresponding to a keyword;
A keyword stored in the context storage unit is extracted from a converted character string that is a character string converted by the conversion unit, and the context stored in the context storage unit is associated with the extracted keyword. Context determining means for determining the context of the input voice;
A conversation sentence determining means for determining a conversation sentence according to the input voice;
Audio output means for outputting audio;
Attribute storage means for storing attributes of the sound output by the sound output means;
Output control means for causing the voice output means to output the conversation sentence determined by the conversation sentence determination means with the attribute stored in the attribute storage means;
A determination unit that determines whether or not the determination context that is the context determined by the context determination unit has changed from the previous determination context that is the context determined by the context determination unit;
A voice dialogue apparatus comprising: attribute changing means for changing a voice attribute stored in the attribute storage means when the determination means determines that the determination context has changed.

Attribute information storage means for storing audio attribute information related to audio attributes output by the audio output means in association with the context;
The attribute change means, when the decision context is changed to a context in which the voice attribute information is stored in the attribute information storage means, converts the voice attribute stored in the attribute storage means to the decision context. The voice interactive apparatus according to claim 1, wherein the voice dialog device changes to a voice attribute indicated by the corresponding voice attribute information.

The data structure of the context storage means is a tree structure, and as the hierarchy of the tree structure progresses from upper to lower, a plurality of contexts are stored so that detailed conversation contents are obtained.
3. The spoken dialogue apparatus according to claim 1, wherein the attribute changing unit changes a voice attribute when the decision context and the previous decision context are in a parent-child relationship of the tree structure or belong to the same hierarchy. .

The data structure of the context storage means is a tree structure, and as the hierarchy of the tree structure progresses from upper to lower, a plurality of contexts are stored so that detailed conversation contents are obtained.
The voice interaction apparatus according to claim 1, wherein the attribute changing unit changes a voice attribute when the determined context becomes a context of a predetermined hierarchy of the tree structure.

5. The voice interactive apparatus according to claim 1, wherein the attribute changing unit changes a voice attribute when the determined context changes a predetermined number of times within a first predetermined time.

6. The spoken dialogue apparatus according to claim 1, wherein the attribute changing unit changes a voice attribute when a time during which the determination context does not change is equal to or longer than a second predetermined time.

A voice dialogue program for operating a computer as various processing means of the voice dialogue apparatus according to claim 1.