JP2018151661A

JP2018151661A - Speech control apparatus, speech control method, and program

Info

Publication number: JP2018151661A
Application number: JP2018096720A
Authority: JP
Inventors: 松原　弘明; Hiroaki Matsubara; 弘明松原; 純也浦; Junya Ura; 川▲原▼　毅彦; Takehiko Kawahara; 毅彦川▲原▼; 久湊　裕司; Yuji Hisaminato; 裕司久湊; 克二吉村; Katsuji Yoshimura
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2018-05-21
Filing date: 2018-05-21
Publication date: 2018-09-27
Anticipated expiration: 2033-09-30
Also published as: JP6536713B2

Abstract

PROBLEM TO BE SOLVED: To synthesize speech in response to a user's utterance as if the user is interacting with a person.SOLUTION: A speech recognition apparatus includes a speech input unit 102 to which speech by a speech signal is input, a reply generation unit 110 for obtaining a reply to the speech, and a voice control unit 109 that changes an interval between input of a speech signal for a speech and output of a speech signal for an answer to an interval in a relationship defined by one output rule of a plurality of preset output rules, and sets one output rule in which the proportion of speech being made with respect to the answer among the plurality of output rules satisfies a predetermined condition within a predetermined period.SELECTED DRAWING: Figure 1

Description

本発明は、音声制御装置、音声制御方法およびプログラムに関する。 The present invention relates to a voice control device, a voice control method, and a program.

近年、音声合成技術としては、次のようなものが提案されている。すなわち、利用者の話調や声質に対応した音声を合成出力することによって、より人間らしく発音する技術（例えば特許文献１参照）や、利用者の音声を分析して、当該利用者の心理状態や健康状態などを診断する技術（例えば特許文献２参照）が提案されている。
また、利用者が入力した音声を認識する一方で、シナリオで指定された内容を音声合成で出力して、利用者との音声対話を実現する音声対話システムも提案されている（例えば特許文献３参照）。 In recent years, the following have been proposed as speech synthesis techniques. That is, by synthesizing and outputting speech corresponding to the user's speech tone and voice quality, a technique for sounding more humanly (see, for example, Patent Document 1), analyzing the user's speech, A technique for diagnosing a health condition or the like (see, for example, Patent Document 2) has been proposed.
In addition, a voice dialogue system that recognizes a voice input by a user and outputs a content specified in a scenario by voice synthesis to realize a voice dialogue with the user has been proposed (for example, Patent Document 3). reference).

特開２００３−２７１１９４号公報JP 2003-271194 A 特許第４４９５９０７号公報Japanese Patent No. 4495907 特許第４８３２０９７号公報Japanese Patent No. 4832097

ところで、上述した音声合成技術と音声対話システムとを組み合わせて、利用者の音声による発言に対し、データを検索して音声合成により出力する対話システムを想定する。この場合、音声合成によって出力される音声が利用者に不自然な感じ、具体的には、いかにも機械が喋っている感じを与えるときがある、という問題が指摘されている。
本発明は、このような事情に鑑みてなされたものであり、その目的の一つは、利用者の発言に対する回答が自然な感じになるような音声制御装置、音声制御方法およびプログラムを提供することにある。 By the way, it is assumed that a dialogue system that combines the above-described speech synthesis technology and a speech dialogue system to retrieve data and output by speech synthesis in response to a speech by a user's voice is assumed. In this case, a problem has been pointed out that the voice output by the voice synthesis feels unnatural to the user, specifically, the machine sometimes feels roaring.
The present invention has been made in view of such circumstances, and one of its purposes is to provide a voice control device, a voice control method, and a program that make a reply to a user's utterance feel natural. There is.

本件発明者は、利用者による発言に対する回答を音声合成で出力（返答）するマン・マシンのシステムを検討するにあたって、まず、人同士では、どのような対話がなされるかについて、音高（周波数）に着目して考察した。 When examining the man-machine system for outputting (replying) the response to the user's utterance by speech synthesis, the present inventor first determines the pitch (frequency) ).

ここでは、人同士の対話として、一方の人（ａとする）による発言（質問、独り言、問い等を含む）に対し、他方の人（ｂとする）が回答（相槌を含む）する場合について検討する。この場合において、ａが発言したとき、ａだけでなく、当該発言に対して回答しようとするｂも、当該発言のうちの、ある区間における音高を強い印象で残していることが多い。ｂは、同意や、賛同、肯定などの意で回答するときには、印象に残っている発言の音高に対し、当該回答を特徴付ける部分、例えば語尾や語頭の音高が、所定の関係、具体的には協和音程の関係となるように発声する。当該回答を聞いたａは、自己の発言について印象に残っている音高と当該発言に対する回答を特徴付ける部分の音高とが上記関係にあるので、ｂの回答に対して心地良く、安心するような好印象を抱くことになる、と、本件発明者は考えた。 Here, as a dialogue between people, the other person (assuming b) answers (including conflicts) to one person (assuming a) (including questions, monologues, questions, etc.) consider. In this case, when “a” speaks, not only “a” but also “b” to reply to the said speech often leaves a strong impression of the pitch in a certain section of the said speech. b, when replying with consent, approval, or affirmation, the pitch of the utterance that remains in the impression, for example, the pitch of the ending or beginning of the sentence Speak to be in a Kyowa pitch. The person a who heard the answer has the above-mentioned relationship between the pitch that remains in the impression about his / her speech and the pitch of the part that characterizes the answer to the person's reply, so that the answer to b is comfortable and reassuring. The present inventor thought that he had a good impression.

例えば、ａが「そうでしょ？」と発言したとき、ａおよびｂは、当該発言のうち、念押しや確認などの意が強く表れる語尾の「しょ」の音高を記憶に残した状態となる。この状態において、ｂが、当該発言に対して「あ、はい」と肯定的に回答しようとする場合に、印象に残っている「しょ」の音高に対して、回答を特徴付ける部分、例えば語尾の「い」の音高が上記関係になるように「あ、はい」と回答する。 For example, when “a” says “Yes?”, “A” and “b” are in a state of memorizing the pitch of “Sho” at the end of the said utterance that strongly expresses willingness or confirmation. . In this state, when b tries to affirmatively answer “Yes, yes” to the statement, the part characterizing the answer, for example, the ending, Answer “Yes, yes” so that the pitch of “Yes” is in the above relationship.

図２は、このような実際の対話におけるフォルマントを示している。この図において、横軸が時間であり、縦軸が周波数であって、スペクトルは、白くなるにつれて強度が強い状態を示している。
図に示されるように、人の音声を周波数解析して得られるスペクトルは、時間的に移動する複数のピーク、すなわちフォルマントとして現れる。詳細には、「そうでしょ？」に相当するフォルマント、および、「あ、はい」に相当するフォルマントは、それぞれ３つのピーク帯（時間軸に沿って移動する白い帯状の部分）として現れている。
これらの３つのピーク帯のうち、周波数の最も低い第１フォルマントについて着目してみると、「そうでしょ？」の「しょ」に相当する符号Ａ（の中心部分）の周波数はおおよそ４００Ｈｚである。一方、符号Ｂは、「あ、はい」の「い」に相当する符号Ｂの周波数はおおよそ２６０Ｈｚである。このため、符号Ａの周波数は、符号Ｂの周波数に対して、ほぼ３／２となっていることが判る。 FIG. 2 shows a formant in such an actual dialogue. In this figure, the horizontal axis is time, the vertical axis is frequency, and the spectrum shows a state where the intensity increases as it becomes white.
As shown in the figure, a spectrum obtained by frequency analysis of human speech appears as a plurality of peaks that move in time, that is, formants. Specifically, a formant corresponding to “Yeah?” And a formant corresponding to “Ah, yes” each appear as three peak bands (white band-like portions moving along the time axis).
When attention is paid to the first formant having the lowest frequency among these three peak bands, the frequency of the code A (the central part) corresponding to “Sho” of “Oh, right?” Is approximately 400 Hz. On the other hand, for the code B, the frequency of the code B corresponding to “Yes” of “A, Yes” is approximately 260 Hz. For this reason, it can be seen that the frequency of the code A is approximately 3/2 with respect to the frequency of the code B.

周波数の比が３／２であるという関係は、音程でいえば、「ソ」に対して同じオクターブの「ド」や、「ミ」に対して１つ下のオクターブの「ラ」などの関係をいい、後述するように、完全５度の関係にある。この周波数の比（音高同士における所定の関係）については、好適な一例であるが、後述するように様々な例が挙げられる。 The relationship that the ratio of the frequencies is 3/2 is the relationship between “seo” in the same octave and “mi” in the octave one lower than “mi”. As will be described later, there is a complete 5 degree relationship. This frequency ratio (predetermined relationship between pitches) is a preferred example, but various examples can be given as will be described later.

なお、図３は、音名（階名）と人の声の周波数との関係について示す図である。この例では、第４オクターブの「ド」を基準にしたときの周波数比も併せて示しており、「ソ」は「ド」を基準にすると、上記のように３／２である。また、第３オクターブの「ラ」を基準にしたときの周波数比についても並列に例示している。 FIG. 3 is a diagram showing a relationship between a pitch name (floor name) and a human voice frequency. In this example, the frequency ratio when the fourth octave “do” is used as a reference is also shown, and “so” is 3/2 as described above when “do” is used as a reference. Further, the frequency ratio when the third octave “La” is used as a reference is also illustrated in parallel.

このように人同士の対話では、発言の音高と返答する回答の音高とは無関係ではなく、上記のような関係がある、と考察できる。そして、本件発明者は、多くの対話例を分析し、多くの人による評価を統計的に集計して、この考えがおおよそ正しいことを裏付けた。
ただし、統計的には正しいかもしれないが、心地良い等と感じる音高の関係は、人それぞれである。また、利用者の発言に対する回答を音声合成で出力（返答）する対話システムを検討したときに、当該利用者に対して、発言回数・頻度を高める、端的にいえばマシンとの対話を弾ませることは重要である。
そこで、利用者による発言に対する回答を音声合成する際に、上記目的を達成するために、次のような構成とした。 In this way, in the dialogue between people, it can be considered that the pitch of the utterance and the pitch of the answer to be answered are not irrelevant but have the above relationship. Then, the present inventor analyzed many dialogue examples and statistically aggregated evaluations by many people to prove that this idea is roughly correct.
However, although it may be statistically correct, the relationship of the pitches that feels comfortable is each person. Also, when considering a dialogue system that outputs (replies) a reply to a user's utterance by speech synthesis, the user is encouraged to increase the number and frequency of utterances. That is important.
Therefore, in order to achieve the above-mentioned purpose when speech synthesis is performed on an answer to a user's utterance, the following configuration is adopted.

すなわち、上記目的を達成するために、本発明の一態様に係る音声合成装置は、音声信号による発言を入力する音声入力部と、前記発言のうち、特定の第１区間の音高を解析する音高解析部と、前記発言に対する回答を取得する取得部と、取得された回答を音声合成する音声合成部と、前記音声合成部に対し、当該回答における特定の第２区間の音高を前記第１区間の音高に対して予め設定された音高ルールで定められた関係にある音高に変更させ、前記回答に対して発言がなされたことに応じて、前記音高ルールを設定する音声制御部と、を具備することを特徴とする。
この一態様によれば、回答における特定の第２区間の音高が、発言のうち特定の第１区間の音高に対して音高ルールで定められた関係にある音高に変更されて音声合成が制御される。音高ルールは、回答に対して発言がなされたことに応じて設定されるので、マシンとの対話を弾ませる方向に導くことができる。 In other words, in order to achieve the above object, a speech synthesizer according to an aspect of the present invention analyzes a speech input unit that inputs a speech by a speech signal, and a pitch of a specific first section of the speech. A pitch analysis unit, an acquisition unit that acquires an answer to the utterance, a voice synthesis unit that synthesizes the acquired answer by voice, and a pitch of a specific second section in the answer to the voice synthesis unit The pitch of the first section is changed to a pitch having a relationship defined by a preset pitch rule, and the pitch rule is set in response to a comment made on the answer. And a voice control unit.
According to this aspect, the pitch of the specific second section in the answer is changed to a pitch having a relationship defined by the pitch rule with respect to the pitch of the specific first section of the utterance. Synthesis is controlled. The pitch rule is set according to the utterance made to the answer, so that it is possible to guide the direction in which the conversation with the machine is played.

この態様において、第１区間は、例えば発言の語尾であり、第２区間は、回答の語頭または語尾であることが好ましい。上述したように、発言の印象を特徴付ける区間は、当該発言の語尾であり、回答の印象を特徴付ける区間は、回答の語頭または語尾であることが多いからである。
また、前記所定の関係は、完全１度を除いた協和音程の関係であることが好ましい。ここで、協和とは、複数の楽音が同時に発生したときに、それらが互いに溶け合って良く調和する関係をいい、これらの音程関係を協和音程という。協和の程度は、２音間の周波数比（振動数比）が単純なものほど高い。周波数比が最も単純な１／１（完全１度）と、２／１（完全８度）とを、特に絶対協和音程といい、これに３／２（完全５度）と４／３（完全４度）とを加えて完全協和音程という。５／４（長３度）、６／５（短３度）、５／３（長６度）および８／５（短６度）を不完全協和音程といい、これ以外のすべての周波数比の関係（長・短の２度と７度、各種の増・減音程など）を不協和音程という。 In this aspect, it is preferable that the first interval is, for example, the ending of a statement, and the second interval is the beginning or ending of an answer. As described above, the section characterizing the impression impression is the ending of the comment, and the section characterizing the answer impression is often the beginning or ending of the answer.
Moreover, it is preferable that the predetermined relationship is a relationship of Kyowa intervals excluding perfect 1 degree. Here, “Kyowa” means a relationship in which when a plurality of musical sounds are generated at the same time, they are fused and well harmonized, and these pitch relationships are called Kyowa pitches. The degree of cooperation is higher as the frequency ratio (frequency ratio) between two sounds is simpler. The simplest frequency ratios of 1/1 (perfect 1 degree) and 2/1 (perfect 8 degree) are called absolute consonance pitches, and 3/2 (perfect 5 degree) and 4/3 (perfect) 4 degrees) and is called the perfect harmony pitch. 5/4 (3 degrees long), 6/5 (3 degrees short), 5/3 (6 degrees long) and 8/5 (6 degrees short) are called incomplete harmony intervals, and all other frequency ratios This relationship (long and short 2 degrees and 7 degrees, various increase / decrease intervals, etc.) is called dissonance interval.

なお、回答の語頭または語尾の音高が、発言の語尾の音高と同一となる場合には、対話として不自然な感じを伴うと考えられるので、上記協和音程の関係としては、完全１度が除かれている。
また、回答には、質問に対する具体的な答えに限られず、「なるほど」、「そうですね」などの相槌（間投詞）も含まれる。 If the pitch at the beginning or end of the answer is the same as the pitch at the end of the utterance, it is considered that the dialogue is unnatural. Is excluded.
In addition, the answer is not limited to a specific answer to the question, but includes an answer (interjection) such as “I see” or “I think so”.

人同士の対話において、当該発言から回答までの期間、いわゆる間は、対話の弾み具合を決める１つの要素である。そこで、上記態様において、前記音声制御部は、発言から前記回答を出力するまでの間を、予め設定された出力ルールで制御し、前記回答に対して発言がなされたことに応じて、前記出力ルールを設定する構成としても良い。この構成によれば、発言から回答を出力するまでの間が回答に対して発言がなされたことに応じて設定されるので、マシンとの対話を弾ませる方向に導くことができる。 In a dialogue between people, a period from the remark to the answer, that is, a so-called interval, is one element that determines the momentum of the dialogue. Therefore, in the above aspect, the voice control unit controls a period from a speech to the output of the answer using a preset output rule, and the output is performed in response to the speech made to the answer. It is good also as a structure which sets a rule. According to this configuration, since the period from the remark to the output of the answer is set according to the remark being made with respect to the reply, it is possible to guide in a direction that encourages dialogue with the machine.

上記態様において、前記音高ルールは、予め用意された複数の場面のうち、いずれかに応じて設定される構成としても良い。ここでいう場面とは、発言者の性別、年齢と、音声合成する声の性別、年齢との組み合わせや、発言の速度（早口、遅口）と、音声合成する回答の速度との組み合わせ、対話目的（音声案内）などである。
同様に、前記出力ルールは、予め用意された複数の場面のうち、いずれかに応じて設定される構成としても良い。 In the aspect described above, the pitch rule may be set according to any one of a plurality of scenes prepared in advance. The scenes here are the combination of the gender and age of the speaker, the gender and age of the voice to be synthesized, the combination of the speed of speech (fast and slow) and the speed of the answer to be synthesized, dialogue Purpose (voice guidance).
Similarly, the output rule may be set according to any of a plurality of scenes prepared in advance.

本発明の態様について、音声合成装置のみならず、コンピュータを当該音声合成装置として機能させるプログラムとして概念することも可能である。
なお、本発明では、発言の音高（周波数）を解析対象とし、回答の音高を制御対象としているが、ヒトの音声は、上述したフォルマントの例でも明らかなように、ある程度の周波数域を有するので、解析や制御についても、ある程度の周波数範囲を持ってしまうのは避けられない。また、解析や制御については、当然のことながら誤差が発生する。このため、本件において、音高の解析や制御については、音高（周波数）の数値が同一であることのみならず、ある程度の範囲を伴うことが許容される。 The aspect of the present invention can be conceptualized as a program that causes a computer to function as the speech synthesizer as well as the speech synthesizer.
In the present invention, the pitch (frequency) of the speech is the analysis target, and the pitch of the answer is the control target. As is clear from the above-described formant example, human speech has a certain frequency range. Therefore, it is inevitable that the analysis and control have a certain frequency range. In addition, as a matter of course, errors occur in analysis and control. For this reason, in this case, the pitch analysis and control are allowed not only to have the same numerical value of the pitch (frequency) but also to have a certain range.

第１実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on 1st Embodiment. 対話における音声のフォルマントの例を示す図である。It is a figure which shows the example of the sound formant in a dialog. 音名と周波数等との関係を示す図である。It is a figure which shows the relationship between a pitch name, a frequency, etc. 音声合成装置における指標テーブルの一例を示す図である。It is a figure which shows an example of the parameter | index table in a speech synthesizer. 音声合成装置における動作期間の切り替わり例を示す図である。It is a figure which shows the example of a switching of the operation period in a speech synthesizer. 音声合成装置における音声合成処理を示すフローチャートである。It is a flowchart which shows the speech synthesis process in a speech synthesizer. 音声合成装置における音声合成処理を示すフローチャートである。It is a flowchart which shows the speech synthesis process in a speech synthesizer. 語尾の特定の具体例を示す図である。It is a figure which shows the specific specific example of an ending. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence. 第２実施形態おける指標テーブルの一例を示す図である。It is a figure which shows an example of the parameter | index table in 2nd Embodiment. 発言から回答までの間の一例を示す図である。It is a figure which shows an example between a statement and a reply. 第３実施形態おける指標テーブルの一例を示す図である。It is a figure which shows an example of the parameter | index table in 3rd Embodiment.

以下、本発明の実施形態について図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

＜第１実施形態＞
図１は、本発明の実施形態に係る音声合成装置１０の構成を示す図である。
この図において、音声合成装置１０は、ＣＰＵ（Central Unit）や、音声入力部１０２、スピーカ１４２を有する、例えば携帯電話機のような端末装置である。音声合成装置１０においてＣＰＵが、予めインストールされたアプリケーションプログラムを実行することによって、複数の機能ブロックが次のように構築される。
詳細には、音声合成装置１０では、発話区間検出部１０４、音高解析部１０６、言語解析部１０８、音声制御部１０９、回答作成部（取得部）１１０、音声合成部１１２、言語データベース１２２、回答データベース１２４、情報取得部１２６、管理用データベース１２７および音声ライブラリ１２８が構築される。
なお、特に図示しないが、このほかにも音声合成装置１０は、表示部や操作入力部なども有し、利用者が装置の状況を確認したり、装置に対して各種の操作を入力したりすることができるようになっている。同様に特に図示しないが、音声合成装置１０は、リアルタイムクロックを内蔵して、現在時刻などの時間情報を取得する構成となっている。また、音声合成装置１０は、携帯電話機のような端末装置１０に限られず、ノート型やタブレット型のパーソナルコンピュータであっても良い。 <First Embodiment>
FIG. 1 is a diagram showing a configuration of a speech synthesizer 10 according to an embodiment of the present invention.
In this figure, the speech synthesizer 10 is a terminal device such as a mobile phone having a CPU (Central Unit), a speech input unit 102, and a speaker 142. In the speech synthesizer 10, the CPU executes an application program installed in advance, so that a plurality of functional blocks are constructed as follows.
Specifically, in the speech synthesizer 10, an utterance section detection unit 104, a pitch analysis unit 106, a language analysis unit 108, a speech control unit 109, an answer creation unit (acquisition unit) 110, a speech synthesis unit 112, a language database 122, An answer database 124, an information acquisition unit 126, a management database 127, and an audio library 128 are constructed.
Although not particularly illustrated, the speech synthesizer 10 also includes a display unit, an operation input unit, and the like, so that the user can check the status of the device and input various operations to the device. Can be done. Similarly, although not particularly illustrated, the speech synthesizer 10 has a configuration in which a real-time clock is incorporated to acquire time information such as the current time. The voice synthesizer 10 is not limited to the terminal device 10 such as a mobile phone, and may be a notebook or tablet personal computer.

音声入力部１０２は、詳細については省略するが、利用者による音声（発言）を電気信号に変換するマイクロフォンと、変換された音声信号の高域成分をカットするＬＰＦ（ローパスフィルタ）と、高域成分をカットした音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。
発話区間検出部１０４は、デジタル信号に変換された音声信号を処理して発話（有音）区間を検出する。 Although not described in detail, the audio input unit 102 is a microphone that converts a user's voice (speech) into an electric signal, an LPF (low-pass filter) that cuts a high frequency component of the converted audio signal, and a high frequency It is composed of an A / D converter that converts an audio signal with components cut into a digital signal.
The utterance section detection unit 104 processes the voice signal converted into the digital signal to detect the utterance (sound) section.

音高解析部１０６は、発話区間として検出された音声信号の発言を音量解析および周波数解析して、当該発言のうち、特定の区間（第１区間）における音高を示す音高データを、音声制御部１０９に供給する。
ここで、第１区間とは、例えば発言の語尾である。また、ここでいう音高とは、例えば音声信号を周波数解析して得られる複数のフォルマントのうち、周波数の最も低い成分である第１フォルマント、図２でいえば、末端が符号Ａとなっているピーク帯で示される周波数（音高）をいう。周波数解析については、ＦＦＴ（Fast Transform）や、その他公知の方法を用いることができる。発言における語尾を特定するための具体的手法の一例については後述する。 The pitch analysis unit 106 performs volume analysis and frequency analysis on the speech of the speech signal detected as the speech interval, and uses the speech data indicating the pitch in a specific interval (first interval) of the speech as speech. This is supplied to the control unit 109.
Here, the first section is, for example, the ending of a statement. The pitch here is, for example, a first formant that is the lowest frequency component among a plurality of formants obtained by frequency analysis of an audio signal. In FIG. This is the frequency (pitch) indicated by the peak band. For frequency analysis, FFT (Fast Transform) or other known methods can be used. An example of a specific method for specifying the ending in the utterance will be described later.

一方、言語解析部１０８は、発話区間として検出された音声信号がどの音素に近いのかを、言語データベース１２２に予め作成された音素モデルを参照することにより判定して、音声信号で規定される発言を解析（特定）し、その解析結果を回答作成部１１０に供給する。 On the other hand, the language analysis unit 108 determines which phoneme is close to the speech signal detected as the speech section by referring to a phoneme model created in advance in the language database 122, and the speech specified by the speech signal. Is analyzed (specified), and the analysis result is supplied to the answer creating unit 110.

回答作成部１１０は、言語解析部１０８によって解析された発言に対応する回答を、回答データベース１２４および情報取得部１２６を参照して作成する。
なお、本実施形態において、回答作成部１１０が作成する回答には、
（１）発言に対する肯定または否定等の意を示す回答、
（２）発言に対する具体的内容の回答、
（３）発言に対する相槌としての回答、
が想定されている。（１）の回答の例としては「はい」、「いいえ」などが挙げられ、（２）としては、例えば「あすのてんきは？（明日の天気は？）」という発言に対して「はれです」と具体的に内容を回答する例などが挙げられる。（３）としては、「そうですね」、「えーと」などが挙げられ、発言が、（１）のように「はい」、「いいえ」の回答で済む発言、および、（２）のように具体的な内容を回答する必要がある発言以外の場合において作成（取得）される。 The answer creation unit 110 creates an answer corresponding to the utterance analyzed by the language analysis unit 108 with reference to the answer database 124 and the information acquisition unit 126.
In the present embodiment, the answer created by the answer creating unit 110 includes:
(1) An answer indicating the affirmation or denial of the statement,
(2) Reply to specific content for remarks,
(3) Reply as a reconciliation to remarks,
Is assumed. Examples of (1) are “Yes”, “No”, etc., and (2) is “Hare, what is tomorrow ’s weather?” There is an example of answering the content specifically. (3) can be “Yes”, “Et”, etc., and the remarks can be answered with “Yes” or “No” as in (1), and specific as in (2) It is created (acquired) in cases other than remarks that need to be answered.

（１）の回答については、例えば「いま３時ですか？」という発言に対して、内蔵のリアルタイムクロック（図示省略）から時刻情報を取得すれば、回答作成部１１０が、当該発言に対して例えば「はい」または「いいえ」のうち、どちらで回答すれば良いのかを判別することができる。
一方で、例えば「あすははれですか（明日は晴れですか）？」という発言に対しては、外部サーバにアクセスして天気情報を取得しないと、音声合成装置１０の単体で回答することができない。このように、音声合成装置１０のみでは回答できない場合、情報取得部１２６は、インターネットを介し外部サーバにアクセスし、回答の作成に必要な情報を取得して、回答作成部１１０に供給する。これにより、当該回答作成部１１０は、当該発言に対して例えば「はい」または「いいえ」のどちらで回答すれば良いのかを判別することができる。
（２）の回答については、例えば「いまなんじ？（今、何時？）」という発言に対しては、回答作成部１１０は、上記時刻情報を取得するとともに、時刻情報以外の情報を回答データベース１２４から取得することで、「ただいま○○時○○分です」という回答を作成することが可能である。一方で、「あすのてんきは？（明日の天気は？）」という発言に対しては、情報取得部１２６が、外部サーバにアクセスして、回答に必要な情報を取得するとともに、回答作成部１１０が、発言に対して例えば「はれです」という回答を、回答データベース１２４および外部サーバから作成する構成となっている。 For the answer of (1), for example, in response to an utterance “It is 3 o'clock?”, If the time information is acquired from a built-in real-time clock (not shown), the answer creation unit 110 responds to the utterance. For example, it is possible to determine which of “Yes” and “No” should be answered.
On the other hand, for example, in response to the statement “Are you sure? (Tomorrow is sunny)?” If the weather information is not acquired by accessing an external server, the speech synthesizer 10 responds alone. I can't. As described above, when the answer cannot be made only by the speech synthesizer 10, the information acquisition unit 126 accesses an external server via the Internet, acquires information necessary for creating the answer, and supplies the information to the answer creation unit 110. Thereby, the said reply preparation part 110 can discriminate | determine whether it should replies, for example with "Yes" or "No" with respect to the said statement.
With respect to the answer (2), for example, in response to the statement “Now what? (Now what time?)”, The answer creating unit 110 obtains the time information and sends information other than the time information to the answer database. By acquiring from 124, it is possible to create an answer “I am now XX hour XX minutes”. On the other hand, in response to the statement “What is Asuno Tenki? (Tomorrow's weather?)”, The information acquisition unit 126 accesses an external server to acquire information necessary for an answer, and an answer creation unit. 110 is configured to create, for example, an answer “excludes” from the answer database 124 and an external server in response to the utterance.

回答作成部１１０は、作成・取得した回答から音声シーケンスを作成して出力する。この音声シーケンスは、音素列であって、各音素に対応する音高や発音タイミングを規定したものである。
なお、（１）、（３）の回答については、例えば回答に対応する音声シーケンスを回答データベース１２４に格納しておく一方で、判別結果に対応した音声シーケンスを回答データベース１２４から読み出す構成にしても良い。詳細には、回答作成部１１０は、（１）の回答にあっては、判別結果に応じた例えば「はい」、「いいえ」などの音声シーケンスを読み出せば良いし、（３）の回答にあっては、発言の解析結果および回答作成部１１０での判別結果に応じて「そうですね」、「えーと」などの音声シーケンスを読み出せば良い。
なお、回答作成部１１０で作成・取得された音声シーケンスは、音声制御部１０９と音声合成部１１２とにそれぞれ供給される。 The answer creation unit 110 creates and outputs a speech sequence from the created / acquired answers. This speech sequence is a phoneme string and defines a pitch and a sounding timing corresponding to each phoneme.
For the answers (1) and (3), for example, a voice sequence corresponding to the answer is stored in the answer database 124, while a voice sequence corresponding to the determination result is read from the answer database 124. good. Specifically, in the answer (1), the answer creating unit 110 may read a voice sequence such as “Yes” or “No” according to the determination result, and the answer (3) In this case, it is only necessary to read out a speech sequence such as “I think so” or “Et” according to the analysis result of the utterance and the determination result in the answer creation unit 110.
Note that the speech sequences created and acquired by the answer creation unit 110 are supplied to the speech control unit 109 and the speech synthesis unit 112, respectively.

音声制御部１０９は、音声合成部１１２における音声合成を制御する。
音声シーケンスは、発声の音高や発音タイミングが規定されているので、音声合成部１１２が、単純に音声シーケンスにしたがって音声合成することで、当該回答の基本音声を出力することができる。
ただし、回答の基本音声は、発言における語尾等の音高を考慮していないので、機械が喋っている感じを与えるときがあるのは上述した通りである。そこで、本実施形態では、第１に、音声制御部１０９が、回答作成部１１０から供給された音声シーケンスのうち、特定の区間（第２区間）の音高を、音高データに対して所定の関係の音高となるように、当該音声シーケンス全体の音高を変更させる構成とした。なお、本実施形態では、第２区間を回答の語尾とするが、語尾に限られない。 The voice control unit 109 controls voice synthesis in the voice synthesis unit 112.
Since the voice sequence defines the pitch of the utterance and the sounding timing, the voice synthesizer 112 can output the basic voice of the answer by simply synthesizing the voice according to the voice sequence.
However, since the basic voice of the answer does not take into account the pitch of the ending in the utterance, it is as described above that the machine may give a feeling of being angry. Therefore, in the present embodiment, firstly, the voice control unit 109 determines a pitch of a specific section (second section) in the voice sequence supplied from the answer creation unit 110 to a pitch data. The pitch of the entire speech sequence is changed so that the pitch of the above relationship is obtained. In the present embodiment, the second section is the ending of the answer, but is not limited to the ending.

一方、回答の第２区間の音高を、発言の語尾の音高に対してどのような関係（音高ルール）にすれば、心地良い等と感じ、対話が弾むのかについては、上述したように利用者等によって異なるところである。そこで、第２に、本実施形態では、動作期間として評価期間を設けるとともに、当該評価期間において、発言に対して複数の音高ルールで回答を音声合成し、当該評価期間の終了時において、最も対話が弾んだ音高ルールに設定して、以降の音声合成に反映させる構成とした。
管理用データベース１２７は、音声制御部１０９によって管理されて、音高ルールと対話の弾み具合を示す指標とを対応付けたテーブル（指標テーブル）などを記憶する。 On the other hand, the relationship between the pitch of the second section of the answer and the pitch of the ending of the utterance (pitch rule) feels comfortable, etc. It depends on the user. Therefore, secondly, in the present embodiment, an evaluation period is provided as an operation period, and in the evaluation period, an answer is voice-synthesized with a plurality of pitch rules for the utterance, and at the end of the evaluation period, It was set as the pitch rule that the conversation bounced and reflected in the subsequent speech synthesis.
The management database 127 is managed by the voice control unit 109 and stores a table (index table) in which a pitch rule is associated with an index indicating the degree of conversational interaction.

図４は、指標テーブルにおける記憶内容の一例を示す図である。この図に示されるように、指標テーブルでは、音高ルール毎に、発言回数と適用回数とが対応付けられている。
ここで、音高ルールとは、回答の語尾の音高を、発言の語尾の音高に対してのような関係とするのかを規定するものであり、例えば同図に示されるように、４度上、３度下、５度下、６度下、８度下のように規定されている。
また、発言回数とは、評価期間において、利用者による発言に対し音声合成装置１０が回答を音声合成した場合、当該回答に対して、所定時間内にさらに利用者が発言したときの回数をカウントした値である。逆にいえば、評価期間において、利用者による発言に対して回答が音声合成された場合であっても、当該回答後に、利用者による発言がなく、または、発言があっても所定時間経過後であれば、発言回数としてカウントされない。
適用回数とは、評価期間において、対応している音高ルールが適用された回数を示す。
このため、発言回数を適用回数で割った値同士を比較することによって、利用者が回答に対して発言する回数が最大となったケース、すなわち、最も対話が弾んだケースは、どの音高ルールを適用した場合であったのかを利用者は知ることができる。
なお、ある音高ルールが適用されて回答が音声合成されても、当該回答に対して所定時間内に利用者が発言しない場合があるので、図の例のように、発言回数よりも適用回数が多くなっている。 FIG. 4 is a diagram illustrating an example of the contents stored in the index table. As shown in this figure, in the index table, the number of times of speech and the number of times of application are associated with each pitch rule.
Here, the pitch rule defines whether the pitch of the ending of the answer is related to the pitch of the ending of the utterance. For example, as shown in FIG. It is defined as up, down 3 degrees, down 5 degrees, down 6 degrees, down 8 degrees.
The number of utterances is the number of times when the user further utters a response within a predetermined time when the speech synthesizer 10 synthesizes an answer to the utterance by the user during the evaluation period. It is the value. In other words, during the evaluation period, even if the answer is voice-synthesized in response to a user's comment, there is no user's comment after that answer, or after a predetermined time has passed even if there is a comment If so, it is not counted as the number of utterances.
The number of times of application indicates the number of times that the corresponding pitch rule is applied in the evaluation period.
Therefore, by comparing the values obtained by dividing the number of utterances by the number of application times, the case where the number of times the user utters the answer to the answer is the maximum, that is, the case where the conversation is the most played is which pitch rule. The user can know whether it was a case of applying.
In addition, even if a certain pitch rule is applied and a response is synthesized by speech, the user may not speak within a predetermined time for the response. Is increasing.

音声合成部１１２は、音声制御部１０９による制御にしたがって、音声シーケンスから音声を合成する。具体的には、音声合成部１１２は、音声合成にあたって、音声ライブラリ１２８に登録された音声素片データを用いる。音声ライブラリ１２８は、単一の音素や音素から音素への遷移部分など、音声の素材となる各種の音声素片の波形を定義した音声素片データを予めデータベース化したものである。音声合成部１１２は、音声シーケンスの一音一音（音素）の音声素片データを組み合わせて、繋ぎ部分が連続するように修正しつつ、音声制御部１０９によって決定された音高ルールにしたがって回答の音高を変更して音声信号を生成する。
なお、音声合成された音声信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１４２によって音響変換されて出力される。 The speech synthesizer 112 synthesizes speech from the speech sequence according to control by the speech controller 109. Specifically, the speech synthesizer 112 uses speech segment data registered in the speech library 128 for speech synthesis. The speech library 128 is a database of speech unit data that defines waveforms of various speech units that are speech materials, such as a single phoneme or a transition portion from a phoneme to a phoneme. The speech synthesizer 112 combines the speech unit data of each sound (phoneme) of the speech sequence and corrects the connected parts to be continuous, while replying according to the pitch rule determined by the speech controller 109. The sound signal is generated by changing the pitch of.
Note that the synthesized voice signal is converted into an analog signal by a D / A converter (not shown), then acoustically converted by the speaker 142 and output.

次に、音声合成装置１０の動作について説明する。
はじめに、利用者が所定の操作をしたとき、例えば対話処理に対応したアイコンなどをメインメニュー画面（図示省略）において選択する操作をしたとき、ＣＰＵが当該処理に対応したアプリケーションプログラムを起動する。このアプリケーションプログラムを実行することによって、ＣＰＵは、図１で示した機能ブロックを構築する。 Next, the operation of the speech synthesizer 10 will be described.
First, when a user performs a predetermined operation, for example, when an operation for selecting an icon or the like corresponding to an interactive process is performed on a main menu screen (not shown), the CPU starts an application program corresponding to the process. By executing this application program, the CPU constructs the functional block shown in FIG.

図５は、当該アプリケーションプログラムの実行による動作期間を示す図である。同図に示されるように、本実施形態では、動作期間においてはルール固定期間と評価期間とが交互に繰り返される。このうち、ルール固定期間とは、評価期間の終了時において設定された音高ルールで回答が音声合成される期間である。なお、ここでは、設定されている音高ルールは、図４において白抜き三角印で示されている５度下とする。 FIG. 5 is a diagram showing an operation period due to execution of the application program. As shown in the figure, in the present embodiment, the rule fixing period and the evaluation period are alternately repeated in the operation period. Among these, the rule fixed period is a period in which the answer is synthesized with the pitch rule set at the end of the evaluation period. Here, it is assumed that the set pitch rule is 5 degrees below as indicated by white triangles in FIG.

一方、評価期間とは、利用者による発言に対して複数の音高ルールで回答を音声合成するとともに、最も対話が弾んだ音高ルールを設定するための期間である。
本実施形態では、図５に示されるようにルール固定期間と評価期間とが所定の時間毎に交互に繰り返される構成とするが、所定の条件を満たしたときだけ、例えば利用者の指示があったときだけ、評価期間に移行する構成としても良い。 On the other hand, the evaluation period is a period for synthesizing an answer with a plurality of pitch rules in response to a utterance by a user and setting a pitch rule that is played most by the dialogue.
In this embodiment, as shown in FIG. 5, the rule fixing period and the evaluation period are alternately repeated every predetermined time. However, only when a predetermined condition is satisfied, for example, there is a user instruction. It may be configured to shift to the evaluation period only when

図６は、音声合成処理を示すフローチャートである。この音声合成処理は、ルール固定期間および評価期間に関係なく実行される。 FIG. 6 is a flowchart showing the speech synthesis process. This speech synthesis process is executed regardless of the rule fixed period and the evaluation period.

まず、利用者によって、音声入力部１０２に対して音声で発言が入力される（ステップＳａ１１）。発話区間検出部１０４は、例えば当該音声の振幅を閾値と比較することにより発話区間を検出し、当該発話区間の音声信号を音高解析部１０６および言語解析部１０８のそれぞれに供給する（ステップＳａ１２）。 First, the user inputs a speech by voice to the voice input unit 102 (step Sa11). For example, the utterance section detection unit 104 detects the utterance section by comparing the amplitude of the speech with a threshold value, and supplies the speech signal of the utterance section to each of the pitch analysis unit 106 and the language analysis unit 108 (step Sa12). ).

言語解析部１０８は、供給された音声信号における発言の意味を解析して、その意味内容を示すデータを、回答作成部１１０に供給する（ステップＳａ１３）。
回答作成部１１０は、発言の言語解析結果に対応した回答を、回答データベース１２４を用いたり、必要に応じて情報取得部１２６を介し外部サーバから取得したりして、作成する（ステップＳａ１４）。そして、回答作成部１１０は、当該回答に基づく音声シーケンスを作成し、音声合成部１１２に供給する（ステップＳａ１５）。 The language analyzing unit 108 analyzes the meaning of the utterance in the supplied voice signal and supplies data indicating the meaning content to the answer creating unit 110 (step Sa13).
The answer creating unit 110 creates an answer corresponding to the language analysis result of the utterance by using the answer database 124 or acquiring it from an external server via the information acquiring unit 126 as necessary (step Sa14). Then, the answer creating unit 110 creates a speech sequence based on the answer and supplies it to the speech synthesizer 112 (step Sa15).

例えば、利用者による発言の言語解析結果が「あすははれですか（明日は晴れですか）？」という意味であれば、回答作成部１１０は、外部サーバにアクセスして、回答に必要な天気情報を取得し、取得した天気情報が晴れであれば「はい」という音声シーケンスを、晴れ以外であれば「いいえ」という音声シーケンスを、それぞれ出力する。
また、利用者による発言の言語解析結果が「あすのてんきは（明日の天気は）？」であれば、回答作成部１１０は、外部サーバから取得した天気情報にしたがって例えば「はれです」、「くもりです」などの音声シーケンスを出力する。
一方、利用者による発言の言語解析結果が「あすははれかぁ」という意味であれば、それは独り言（または、つぶやき）なので、回答作成部１１０が、例えば「そうですね」のような相槌の音声シーケンスを、回答データベース１２４から読み出して出力する。 For example, if the linguistic analysis result of the user's utterance means “Are you tomorrow (is it fine tomorrow?)?”, The answer creation unit 110 accesses the external server and is necessary for the answer. The weather information is acquired, and if the acquired weather information is clear, a sound sequence “Yes” is output, and if it is not clear, a sound sequence “No” is output.
Also, if the language analysis result of the user's utterance is “Asen Tenki is (Tomorrow's weather)?”, The answer creation unit 110, for example, according to the weather information obtained from the external server, Outputs an audio sequence such as “It is cloudy”.
On the other hand, if the linguistic analysis result of the speech by the user means “Ashha Haraka”, it is a self-speaking (or tweet), so the answer creation unit 110, for example, has a compatible speech sequence such as “Yes”. Is read from the answer database 124 and output.

音声制御部１０９は、回答作成部１１０から供給された音声シーケンスから、当該音声シーケンスにおける語尾の音高（初期音高）を特定する（ステップＳａ１６）。 The voice control unit 109 specifies the ending pitch (initial pitch) in the voice sequence from the voice sequence supplied from the answer creating unit 110 (step Sa16).

次に、音声制御部１０９は、現時点がルール固定期間であるか否かを判別する（ステップＳａ１７）。現時点がルール固定期間であれば（ステップＳａ１７の判別結果が「Ｙｅｓ」であれば）、音声制御部１０９は、当該ルール固定期間の前の評価期間において設定した音高ルールを適用する（ステップＳａ１８）。
一方、現時点がルール固定期間でなく、評価期間であれば（ステップＳａ１７の判別結果が「Ｎｏ」であれば）、音声制御部１０９は、例えば当該評価期間の１つ前の評価期間で設定された音高ルールと、指標テーブルにおいて当該音高ルールを上下に挟む音高ルールの計３つのうち、いずれか１つを選択して、選択した音高ルールを適用する（ステップＳａ１９）。具体的には、音声制御部１０９は、設定された音高ルールが図４において白抜き三角印で示されている５度下であったとすれば、当該５度下と、指標テーブルにおいて５度下を上下に挟む３度下と、６度下との３つの音高ルールのうち、いずれか１つを、ランダムで、または、所定の順番で選択する。 Next, the voice control unit 109 determines whether or not the current time is the rule fixed period (step Sa17). If the current time is the rule fixed period (if the determination result in step Sa17 is “Yes”), the voice control unit 109 applies the pitch rule set in the evaluation period before the rule fixed period (step Sa18). ).
On the other hand, if the current time is not the rule fixed period but the evaluation period (if the determination result in step Sa17 is “No”), the voice control unit 109 is set, for example, in the evaluation period immediately before the evaluation period. The selected pitch rule is applied by selecting any one of the pitch rules and the pitch rules that sandwich the pitch rule up and down in the index table (step Sa19). Specifically, if the set pitch rule is 5 degrees below indicated by a white triangle mark in FIG. 4, the voice control unit 109 is 5 degrees below and 5 degrees in the indicator table. One of the three pitch rules of 3 degrees below and 6 degrees below that sandwich the bottom up and down is selected at random or in a predetermined order.

一方、音高解析部１０６は、検出された発話区間における発言の音声信号を解析し、当該発言における第１区間（語尾）の音高を特定して、当該音高を示す音高データを音声制御部１０９に供給する（ステップＳａ２０）。ここで、音高解析部１０６における発言の語尾を特定する具体的手法の一例について説明する。 On the other hand, the pitch analysis unit 106 analyzes the speech signal of the utterance in the detected utterance section, specifies the pitch of the first section (ending part) in the utterance, and converts the pitch data indicating the pitch into voice. It supplies to the control part 109 (step Sa20). Here, an example of a specific method for specifying the ending of a statement in the pitch analysis unit 106 will be described.

発言をする人が、当該発言に対する回答を欲するような対話を想定した場合、発言の語尾に相当する部分では、音量が他の部分として比較して一時的に大きくなる、と考えられる。そこで、音高解析部１０６による第１区間（語尾）の音高については、例えば次のようにして求めることできる。
第１に、音高解析部１０６は、発話区間として検出された発言の音声信号を、音量と音高（ピッチ）とに分けて波形化する。図８の（ａ）は、音声信号についての音量を縦軸で、経過時間を横軸で表した音量波形の一例であり、（ｂ）は、同じ音声信号について周波数解析して得られる第１フォルマントの音高を縦軸で、経過時間を横軸で表した音高波形である。なお、（ａ）の音量波形と（ｂ）の音高波形との時間軸は共通である。
第２に、音高解析部１０６は、（ａ）の音量波形のうち、時間的に最後の極大Ｐ１のタイミングを特定する。
第３に、音高解析部１０６は、特定した極大Ｐ１のタイミングを前後に含む所定の時間範囲（例えば１００μ秒〜３００μ秒）を語尾であると認定する。
第４に、音高解析部１０６は、（ｂ）の音高波形のうち、認定した語尾に相当する区間Ｑ１の平均音高を、音高データとして音声制御部１０９に供給する。
このように、発話区間における音量波形について最後の極大Ｐ１を、発言の語尾に相当するタイミングとして特定することによって、会話としての発言の語尾の誤検出を少なくすることができる、と考えられる。
ここでは、（ａ）の音量波形のうち、時間的に最後の極大Ｐ１のタイミングを前後に含む所定の時間範囲を語尾であると認定したが、極大Ｐ１のタイミングを始期または終期とする所定の時間範囲を語尾と認定しても良い。また、認定した語尾に相当する区間Ｑ１の平均音高ではなく、区間Ｑ１の始期、終期や、極大Ｐ１のタイミングの音高を、音高データとして出力する構成としても良い。 Assuming a dialogue in which a person who makes a statement wants an answer to the statement, the volume corresponding to the end of the statement is considered to be temporarily higher than the other parts. Therefore, the pitch of the first section (ending) by the pitch analysis unit 106 can be obtained as follows, for example.
First, the pitch analysis unit 106 divides the speech signal detected as an utterance section into a waveform by dividing it into a volume and a pitch (pitch). FIG. 8A shows an example of a volume waveform in which the volume of an audio signal is represented on the vertical axis and the elapsed time is represented on the horizontal axis. FIG. 8B is a first waveform obtained by frequency analysis of the same audio signal. It is a pitch waveform in which the pitch of formant is represented on the vertical axis and the elapsed time is represented on the horizontal axis. The time axis of the volume waveform in (a) and the pitch waveform in (b) are common.
Secondly, the pitch analysis unit 106 identifies the timing of the last local maximum P1 in the volume waveform of (a).
Thirdly, the pitch analysis unit 106 determines that a predetermined time range (for example, 100 μsec to 300 μsec) including the timing of the specified maximum P1 before and after is the ending.
Fourth, the pitch analysis unit 106 supplies the average pitch of the section Q1 corresponding to the recognized ending in the pitch waveform of (b) to the voice control unit 109 as pitch data.
Thus, it is considered that the erroneous detection of the ending of the speech as a conversation can be reduced by specifying the final maximum P1 of the volume waveform in the utterance section as the timing corresponding to the ending of the speech.
Here, in the volume waveform of (a), a predetermined time range including the timing of the last local maximum P1 before and after is recognized as the ending, but the predetermined time period having the maximum P1 timing as the start or end is determined. The time range may be recognized as the ending. Moreover, it is good also as a structure which outputs not the average pitch of the area Q1 corresponding to the recognized ending but the pitch of the start of the period Q1, the end, and the timing of local maximum P1 as pitch data.

音高データの供給を受けた音声制御部１０９は、回答の語尾の音高が当該音高データで示される音高に対して、適用する音高ルールで定められる関係となるように、音声合成部１１２に指示する（ステップＳａ２１）。この指示により、音声合成部１１２は、回答の語尾の音高が当該音高ルールで定められた音高となるように、音声シーケンス全体の音高を変更して出力する。
本実施形態にあっては、回答が音声合成で出力されても、当該回答に続いて利用者が発言する場合があるので、処理手順がステップＳａ１１に戻る。なお、音声合成処理は、利用者による明示の操作（例えばソフトウェアボタンの操作）によって終了する。 Upon receiving the pitch data, the voice control unit 109 performs voice synthesis so that the pitch at the end of the answer has a relationship determined by the pitch rule to be applied to the pitch indicated by the pitch data. The unit 112 is instructed (step Sa21). In response to this instruction, the speech synthesizer 112 changes and outputs the pitch of the entire speech sequence so that the pitch of the ending of the answer becomes the pitch determined by the pitch rule.
In the present embodiment, even if an answer is output by speech synthesis, the user may speak after the answer, so the processing procedure returns to step Sa11. Note that the speech synthesis process is terminated by an explicit operation (for example, a software button operation) by the user.

図７は、テーブル更新処理の動作を示すフローチャートである。
このテーブル更新処理は、図６における音声合成処理とは独立して実行され、主に、評価期間において指標テーブル（図４参照）を更新して、ルール固定期間で適用する音高ルールを設定するための処理である。 FIG. 7 is a flowchart showing the operation of the table update process.
This table update process is executed independently of the speech synthesis process in FIG. 6, and mainly updates the index table (see FIG. 4) in the evaluation period and sets the pitch rules to be applied in the rule fixed period. Process.

まず、音声制御部１０９は、現時点（現在時刻）が評価期間であるか否かを判別する（ステップＳｂ１１）。現時点が評価期間でなければ（ステップＳｂ１１の判別結果が「Ｎｏ」であれば）、音声制御部１０９は、処理手順を再びステップＳｂ１１に戻す。
現時点が評価期間であれば（ステップＳｂ１１の判別結果が「Ｙｅｓ」であれば）、音声制御部１０９は、音声合成部１１２により音声合成された回答の出力があったか否かを判別する（ステップＳｂ１２）。
回答の出力がなければ（ステップＳｂ１２の判別結果が「Ｎｏ」であれば）、音声制御部１０９は、処理手順をステップＳｂ１１に戻す。このため、現時点が評価期間であって、回答が出力されない限り、以降の処理が実行されない構成となっている。
一方、回答の出力があれば（ステップＳｂ１２の判別結果が「Ｙｅｓ」であれば）、音声制御部１０９は、当該回答の出力後、所定時間（例えば５秒）内に、利用者の発言があったか否かを判別する（ステップＳｂ１３）。これは、例えば音声制御部１０９において回答の出力後に音高解析部１０６から音高データが所定時間内に供給されたか否かによって、判別することができる。 First, the voice control unit 109 determines whether or not the current time (current time) is an evaluation period (step Sb11). If the current time is not the evaluation period (if the determination result in step Sb11 is “No”), the voice control unit 109 returns the processing procedure to step Sb11 again.
If the current time is the evaluation period (if the determination result in step Sb11 is “Yes”), the speech control unit 109 determines whether or not there is an output of the speech synthesized by the speech synthesis unit 112 (step Sb12). ).
If no answer is output (if the determination result in step Sb12 is “No”), the voice control unit 109 returns the processing procedure to step Sb11. For this reason, unless the present time is an evaluation period and an answer is output, the subsequent processing is not executed.
On the other hand, if there is an output of the answer (if the determination result in step Sb12 is “Yes”), the voice control unit 109 receives the user's comment within a predetermined time (for example, 5 seconds) after the output of the answer. It is determined whether or not there has been (step Sb13). This can be determined by, for example, whether or not the pitch data is supplied from the pitch analysis unit 106 within a predetermined time after the answer is output by the voice control unit 109.

回答の出力後に、利用者の発言が所定時間経過内にあった場合（ステップＳｂ１３の判別結果が「Ｙｅｓ」である場合）、指標テーブルを更新するために、音声制御部１０９は、当該回答の音声合成にあたって適用した音高ルールを特定する（ステップＳｂ１４）。なお、この音高ルールについては、例えば、上記ステップＳａ１９において音高ルールを選択したときに、選択した音高ルールと選択した時刻情報とを対応付けて管理用データベース１２７に格納しておく一方で、最も時刻情報が新しい音高ルールを検索することで特定可能である。
音声制御部１０９は、指標テーブルにおいて、当該回答の音声合成にあたって適用した音高ルールの項目（発言回数および適用回数）をそれぞれ「１」だけインクリメントする（ステップＳｂ１５）。 If the user's utterance is within the predetermined time after the answer is output (when the determination result of step Sb13 is “Yes”), the voice control unit 109 updates the index table to update the index table. A pitch rule applied in speech synthesis is specified (step Sb14). As for the pitch rule, for example, when the pitch rule is selected in step Sa19, the selected pitch rule and the selected time information are stored in the management database 127 in association with each other. It can be specified by searching the pitch rule with the newest time information.
The voice control unit 109 increments the pitch rule item (the number of utterances and the number of times of application) applied to the speech synthesis of the answer by “1” in the index table (step Sb15).

一方、回答の出力後に、利用者の発言がなければ、あるいは、発言があっても所定時間経過後であった場合（ステップＳｂ１３の判別結果が「Ｎｏ」である場合）、音声制御部１０９は、ステップＳｂ１４と同様に、当該回答の音声合成にあたって適用した音高ルールを特定する（ステップＳｂ１６）。ただし、この場合、当該回答によって、利用者の発言がなかったものとみなすので、音声制御部１０９は、指標テーブルにおいて、当該回答の音声合成にあたって適用した音高ルールの適用回数のみを「１」だけインクリメントする（ステップＳｂ１７）。 On the other hand, if the user does not speak after the answer is output, or if a predetermined time elapses even if there is a speech (when the determination result in step Sb13 is “No”), the voice control unit 109 As in step Sb14, the pitch rule applied in the speech synthesis of the answer is specified (step Sb16). However, in this case, since it is assumed that the user has not made a speech by the answer, the voice control unit 109 sets “1” only to the number of times the pitch rule applied in the voice synthesis of the answer in the index table. Increment by (step Sb17).

次に、音声制御部１０９は、現時点が評価期間の終了タイミングである否かを判別する（ステップＳｂ１８）。
評価期間の終了タイミングでなければ（ステップＳｂ１８の判別結果が「Ｎｏ」であれば）、音声制御部１０９は、回答後の発言があったときに備えるため、処理手順をステップＳｂ１１に戻す。
一方、評価期間の終了タイミングであれば（ステップＳｂ１８の判別結果が「Ｙｅｓ」であれば）、当該評価期間において３つの音高ルールにつき、発言回数を適用回数で割った値同士を比較して、当該評価期間において最も対話が弾んだケースに適用された音高ルールを、当該評価期間後のルール固定期間に適用する音高ルールとして設定する（ステップＳｂ１９）。例えば、ステップＳｂ１８の処理時において、評価期間における３つの音高ルールが３度下、５度下、６度下であって、各音高ルールでの発言回数および適用回数が図４に示されるような値であった場合、ルール固定期間で適用する音高ルールが、それまで設定されていた５度下から、黒塗り潰しの三角印で示される３度下に変更される。
この後、音声制御部１０９は、当該評価期間において評価した３つの音高ルールにおける発言回数および適用回数をクリアした（ステップＳｂ２０）上で、次回の評価期間においても同様な処理をするため、処理手順をステップＳｂ１１に戻す。 Next, the voice control unit 109 determines whether or not the current time is the end timing of the evaluation period (step Sb18).
If it is not the end timing of the evaluation period (if the determination result in step Sb18 is “No”), the voice control unit 109 returns the processing procedure to step Sb11 in order to be prepared when there is a statement after answering.
On the other hand, if it is the end timing of the evaluation period (if the determination result in step Sb18 is “Yes”), the values obtained by dividing the number of utterances by the number of applied times are compared for the three pitch rules in the evaluation period. Then, the pitch rule applied to the case where the dialogue is played most during the evaluation period is set as the pitch rule to be applied to the rule fixed period after the evaluation period (step Sb19). For example, in the process of step Sb18, the three pitch rules in the evaluation period are 3 degrees, 5 degrees, and 6 degrees below, and the number of times of speech and the number of applications in each pitch rule are shown in FIG. In the case of such a value, the pitch rule applied in the rule fixed period is changed from 5 degrees below which has been set so far to 3 degrees below indicated by a solid black triangle.
Thereafter, the voice control unit 109 clears the number of times of speech and the number of times of application in the three pitch rules evaluated in the evaluation period (step Sb20), and performs the same process in the next evaluation period. The procedure returns to step Sb11.

このように本実施形態では、評価期間において異なる音高ルールを適用して、回答を音声合成させるとともに、当該回答に対して利用者の発言が所定時間内にあれば、適用した音高ルールの発言回数および適用回数を更新し、当該回答に対して利用者の発言が所定時間内になければ、適用した音高ルールの適用回数だけを更新する。そして、評価期間の終了タイミングにおいて、最も対話が弾んだ音高ルールが設定されて、次のルール固定期間に適用される。 As described above, in the present embodiment, different pitch rules are applied during the evaluation period to synthesize answers, and if the user's utterance is within a predetermined time for the answers, The number of times of utterance and the number of times of application are updated, and if the user does not utter a response within the predetermined time, only the number of times of application of the applied pitch rule is updated. Then, at the end timing of the evaluation period, the pitch rule played by the most dialogue is set and applied to the next fixed rule period.

次に、発言の音高と、音声シーケンスの基本音高と、変更された音声シーケンスの音高とについて、具体的な例を挙げて説明する。 Next, the pitch of the speech, the basic pitch of the voice sequence, and the pitch of the changed voice sequence will be described with specific examples.

図９の（ａ）は、利用者による発言の一例である。この図においては、発言の言語解析結果が「あすははれですか（明日は晴れですか）？」であって、当該発言の一音一音の音高が同図のように音符で示される場合の例である。なお、発言の音高波形は、実際には、図８の（ｂ）に示されるような波形となるが、ここでは、説明の便宜のために音高を音符で表現している。
この場合の例において、回答作成部１１０は、上述したように、当該発言に応じて取得した天気情報が晴れであれば、例えば「はい」の音声シーケンスを出力し、晴れ以外であれば、「いいえ」の音声シーケンスを出力する。
図９の（ｂ）は、「はい」の音声シーケンスの一例であり、この例では、一音一音に音符を割り当てて、基本音声の各語（音素）の音高や発音タイミングを規定している。なお、この例では、説明簡略化のために、一音（音素）に音符を１つ割り当てているが、スラーやタイなどのように、一音に複数の音符を割り当てても良い。 FIG. 9A shows an example of a statement made by the user. In this figure, the linguistic analysis result of the utterance is “Tomorrow is it? (Tomorrow is sunny)?”, And the pitch of each utterance is indicated by a note as shown in the figure. This is an example. Note that the pitch waveform of the speech is actually a waveform as shown in FIG. 8B, but here, the pitch is expressed as a note for convenience of explanation.
In the example in this case, as described above, the answer creating unit 110 outputs, for example, a voice sequence of “Yes” if the weather information acquired according to the utterance is sunny, and “ Outputs the “No” audio sequence.
FIG. 9B is an example of a “Yes” speech sequence. In this example, notes are assigned to each note, and the pitch and pronunciation timing of each word (phoneme) of the basic speech are defined. ing. In this example, for simplicity of explanation, one note is assigned to one note (phoneme), but a plurality of notes may be assigned to one note such as a slur or a tie.

音高ルールとして３度下が適用されるのであれば、回答作成部１１０による音声シーケンスは、音声制御部１０９によって次のように変更される。すなわち、（ａ）に示した発言のうち、符号Ａで示される語尾の「か」の区間の音高が音高データによって「ミ」であると示される場合、音声制御部１０９は、「はい」という回答のうち、符号Ｂで示される語尾の「い」の区間の音高が「ミ」に対して３度下の音高である「ド」になるように、音声シーケンス全体の音高を変更する（図９の（ｃ）参照）。 If 3rd lower is applied as the pitch rule, the voice sequence by the answer creating unit 110 is changed by the voice control unit 109 as follows. That is, in the utterance shown in (a), when the pitch of the ending “ka” indicated by the symbol A is indicated as “mi” by the pitch data, the voice control unit 109 sets “Yes”. The pitch of the entire speech sequence is set so that the pitch of the section “i” at the end of the ending indicated by the symbol B becomes “do”, which is three times lower than “mi”. Is changed (see FIG. 9C).

音高ルールとして５度下が適用されるのであれば、回答作成部１１０による音声シーケンスは、音声制御部１０９によって次のように変更される。すなわち、音声制御部１０９は、「はい」という回答のうち、符号Ｂで示される語尾の「い」の区間の音高が符号Ａの「ミ」に対して５度下の音高である「ラ」になるように、音声シーケンス全体の音高を変更する（図９の（ｄ）参照）。
音高ルールとして６度下が適用されるのであれば、音声制御部１０９は、符号Ｂで示される語尾の「い」の区間の音高が符号Ａの「ミ」に対して６度下の音高である「ソ」になるように、音声シーケンス全体の音高を変更する（図９の（ｅ）参照）。 If 5 degrees below is applied as the pitch rule, the voice sequence by the answer creating unit 110 is changed by the voice control unit 109 as follows. That is, the voice control unit 109 determines that the pitch of the section “Yes” at the end of the word “B” indicated by the symbol B is five degrees lower than the symbol “M” in the answer “Yes”. The pitch of the entire audio sequence is changed so that it becomes “L” (see FIG. 9D).
If 6 degrees below is applied as the pitch rule, the voice control unit 109 determines that the pitch of the section “I” at the end indicated by the symbol B is 6 degrees below the “M” of the code A. The pitch of the entire speech sequence is changed so that the pitch is “So” (see (e) of FIG. 9).

特に図示しないが、音高ルールとして４度上が適用されるのであれば、音声制御部１０９は、符号Ｂで示される語尾の「い」の区間の音高が符号Ａの「ミ」に対して４度上の音高である「ラ」になるように、音声シーケンス全体の音高を変更し、音高ルールとして８度下が適用されるのであれば、音声制御部１０９は、符号Ｂで示される語尾の「い」の区間の音高が符号Ａの「ミ」に対して８度（１オクターブ）下の音高である「ミ」になるように、音声シーケンス全体の音高を変更する。 Although not shown in particular, if the fourth pitch is applied as the pitch rule, the voice control unit 109 determines that the pitch of the section “I” at the end indicated by the symbol B is “M” of the symbol A. If the pitch of the entire speech sequence is changed so that the pitch is “La”, which is 4 degrees above, and 8 degrees below is applied as the pitch rule, the speech control unit 109 The pitch of the entire speech sequence is adjusted so that the pitch of the section “I” at the end of “i” is “mi”, which is 8 degrees (one octave) lower than “mi” of code A. change.

また、ここでは「はい」を例にとって説明したが、特に図示しないが「いいえ」の場合も同様に音声シーケンス全体の音高が変更される。また、「あすのてんきは？」という発言に対して、例えば「はれです」と具体的に内容を回答する場合も同様に音声シーケンス全体の音高が変更される。 In addition, although “Yes” has been described as an example here, although not particularly illustrated, the pitch of the entire speech sequence is similarly changed in the case of “No”. In addition, in response to the statement “What is Asuno Tenki?”, For example, when the content is specifically answered as “Hare”, the pitch of the entire speech sequence is similarly changed.

本実施形態において、発言の語尾の音高に対して回答の語尾の音高が協和音程の関係となるように、当該回答が音声合成されるので、発言に対する回答が不自然であるような感じを利用者に与えない。
また、ルール固定期間において適用される音高ルールは、当該ルール固定期間の前の評価期間において最も対話が弾んだ音高ルールである。このため、ルール固定期間においても、対話が弾みやすく、端的にいえば利用者にとって発言しやすくなる。そして、この音高ルールは、評価期間となる毎に設定されるので、利用者にとって心地良い、安心させるような、かつ、対話が弾む条件に収束することになる。 In this embodiment, since the answer is synthesized so that the pitch of the ending of the answer is related to the pitch of the ending, the answer to the utterance feels unnatural. Is not given to users.
In addition, the pitch rule applied in the rule fixed period is the pitch rule that the dialogue played most in the evaluation period before the rule fixed period. For this reason, even during the fixed rule period, the conversation is easy to play, and in short, it is easy for the user to speak. Since this pitch rule is set every evaluation period, it converges to a condition that is comfortable and reassuring for the user, and that encourages dialogue.

＜第２実施形態＞
上述した第１実施形態では、評価期間において複数の音高ルールを適用するとともに、そのうち、最も対話が弾んだ音高ルールを設定して、ルール固定期間において用いる構成としたが、対話を弾ませる要因は音高のほかにも「間」、すなわち発言から回答までの期間が挙げられる。
そこで、第２実施形態として、第１実施形態の音高ルールの設定による回答の音高制御に加えて、評価期間において複数の間で回答を出力させるとともに、そのうちの最も対話が弾んだ間に設定して、ルール固定期間において適用して回答の間を制御する例について説明する。 Second Embodiment
In the first embodiment described above, a plurality of pitch rules are applied during the evaluation period, and among them, the pitch rule that is most played by the dialog is set and used during the rule fixed period. In addition to the pitch, the factor is “between”, that is, the period from speech to answer.
Therefore, as the second embodiment, in addition to the pitch control of the answer by setting the pitch rule of the first embodiment, the answer is output between a plurality of times during the evaluation period, and the most interactive of them is An example of setting and controlling between answers by applying in the rule fixed period will be described.

この第２実施形態において上記アプリケーションプログラムの実行により構築される機能ブロックは、第１実施形態（図１）とほぼ同様である。
ただし、第２実施形態では、指標テーブルとしては、図４に示したような音高ルールを評価するためのテーブルに加えて、例えば図１０に示されるような回答の出力ルールを評価するためのテーブルが用いられる。 The functional blocks constructed by executing the application program in the second embodiment are almost the same as those in the first embodiment (FIG. 1).
However, in the second embodiment, as the index table, in addition to the table for evaluating the pitch rule as shown in FIG. 4, for example, the output rule for the answer as shown in FIG. 10 is evaluated. A table is used.

図１０に示されるように、回答の出力ルールを評価するための指標テーブルでは、出力ルール毎に、発言回数と適用回数とが対応付けられている。なお、ここでいう出力ルールとは、回答を音声合成するにあたって、例えば発言の終了（語尾）から回答の開始（語頭）までの期間を規定するものであり、同図に示されるように、０．５秒、１．０秒、１．５秒、２．０秒、２．５秒というように段階的に規定されている。
なお、出力ルールの各々に対応付けられた発言回数と適用回数とは、第１実施形態と同様である。 As shown in FIG. 10, in the index table for evaluating the answer output rule, the number of utterances and the number of times of application are associated with each output rule. The output rule here defines, for example, the period from the end of a statement (end of word) to the start of the answer (start of word) when synthesizing an answer. As shown in FIG. .5 seconds, 1.0 seconds, 1.5 seconds, 2.0 seconds, and 2.5 seconds.
Note that the number of utterances and the number of application times associated with each output rule are the same as those in the first embodiment.

第２実施形態の動作については、おおよそ第１実施形態における図６、図７の「音高ルール」を、「音高ルールおよび出力ルール」と読み替えた内容となる。
詳細には、図６のステップＳａ１８において、現時点がルール固定期間であれば、音声制御部１０９は、当該ルール固定期間の前の評価期間において設定した音高ルールおよび出力ルールを適用して音声合成することを決定する。一方、ステップＳａ１９において、現時点が評価期間であれば、音声制御部１０９は、３つの音高ルールのうち１つを選択するとともに、当該評価期間の１つ前の評価期間において設定した出力ルールと、指標テーブル（図１０参照）において当該出力ルールを上下に挟む出力ルールの計３つのうち、いずれか１つを選択して、選択した音高ルールおよび出力ルールを適用する。ステップＳａ２１において、音高データの供給を受けた音声制御部１０９は、回答の語尾の音高が当該音高データで示される音高に対して、適用する音高ルールで定められる関係となるように、かつ、発言の語尾から回答が出力開始されるまでの期間が適用する出力ルールで定められる期間となるように、音声合成部１１２に指示する。 The operation of the second embodiment is approximately the same as the “pitch rule and output rule” in FIG. 6 and FIG. 7 in the first embodiment.
Specifically, in step Sa18 of FIG. 6, if the current time is a rule fixed period, the voice control unit 109 applies voice pitch rules and output rules set in the evaluation period before the rule fixed period to perform voice synthesis. Decide what to do. On the other hand, in step Sa19, if the current time is the evaluation period, the voice control unit 109 selects one of the three pitch rules, and the output rule set in the evaluation period immediately before the evaluation period. In the index table (see FIG. 10), any one of a total of three output rules sandwiching the output rule above and below is selected, and the selected pitch rule and output rule are applied. In step Sa21, the voice control unit 109 that has received the pitch data has a relationship in which the pitch at the end of the answer is determined by the pitch rules to be applied to the pitch indicated by the pitch data. In addition, the voice synthesizer 112 is instructed so that the period from the end of the utterance to the start of output of the answer is the period determined by the applied output rule.

また、音声制御部１０９は、図７のステップＳｂ１４、Ｓｂ１６において、２つの指標テーブルを更新するために、当該回答に適用した音高ルールと出力ルールとを特定し、ステップＳｂ１５において、当該回答に適用した音高ルールの両項目をそれぞれ「１」だけインクリメントし、当該回答に適用した出力ルールの両項目をそれぞれ「１」だけインクリメントする。ステップＳｂ１７において、当該回答に適用した音高ルールの適用回数のみを「１」だけインクリメントし、当該回答に適用した出力ルールの適用回数のみを「１」だけインクリメントする。
評価期間の終了タイミングであれば、音声制御部１０９は、ステップＳｂ１９において、評価期間において最も対話が弾んだケースに適用された音高ルールおよび出力ルールをそれぞれ設定し、この後、ステップＳｂ２０において、当該評価期間において評価した音高ルールおよび出力ルールの項目をクリアする。 In addition, the voice control unit 109 specifies the pitch rule and the output rule applied to the answer in order to update the two index tables in steps Sb14 and Sb16 in FIG. Both items of the applied pitch rule are incremented by “1”, and both items of the output rule applied to the answer are incremented by “1”. In step Sb17, only the number of application of the pitch rule applied to the answer is incremented by “1”, and only the number of application of the output rule applied to the answer is incremented by “1”.
If it is the end timing of the evaluation period, in step Sb19, the voice control unit 109 sets a pitch rule and an output rule that are applied to the case in which the most dialogue was played during the evaluation period, and thereafter, in step Sb20. Clear the pitch rule and output rule items evaluated during the evaluation period.

第２実施形態によれば、評価期間において最も対話が弾んだ音高ルールおよび出力ルールが当該評価期間後のルール固定期間に適用されるので、利用者にとって心地良い、好印象の回答が、発言しやすい間で返されることになる。
例えば、図１１に示されるように、利用者Ｗが「あすのてんきは？」と発言した場合に、音声合成装置１０が例えば「はれです」という回答を出力する場合に、当該発言の語尾である「は」から、当該回答の語頭である「は」までの期間Ｔａが、当該利用者Ｗにとって対話が弾みやすい期間に設定される。なお、この場合に、特に図示しないが、回答の語尾である「す」の音高が、発言の語尾である「は」の音高に対して、対話が弾みやすい音高ルールの関係に設定される。
したがって、第２実施形態では、第１実施形態と同様に、発言の語尾の音高に対して回答の語尾の音高が協和音程の関係となるように当該回答が音声合成されるとともに、第１実施形態と比較して、当該回答が発言しやすい間で音声合成されるので、さらに、利用者との対話を弾みやすくすることができる。 According to the second embodiment, since the pitch rule and the output rule that the dialogue has played most during the evaluation period are applied to the rule fixed period after the evaluation period, a favorable impression that is comfortable for the user is Will be returned in between.
For example, as shown in FIG. 11, when the user W utters “What is Asuno Tenki?”, When the speech synthesizer 10 outputs, for example, an answer “I am a crawl”, the ending of the utterance A period Ta from “ha” to “ha”, which is the beginning of the answer, is set to a period in which the user W can easily enjoy the conversation. In this case, although not shown in particular, the pitch of “su”, which is the ending of the answer, is set to a pitch rule that facilitates conversation with the pitch of “ha”, which is the ending of the statement. Is done.
Therefore, in the second embodiment, as in the first embodiment, the answer is synthesized with speech so that the pitch of the ending of the answer is related to the pitch of the harmonious pitch with respect to the pitch of the ending of the utterance. Compared with the first embodiment, since speech synthesis is performed while the answer is easy to speak, it is possible to further facilitate interaction with the user.

なお、第２実施形態では、第１実施形態における回答の音高制御に加えて、発言から回答までの間を制御する構成としたが、上記音高制御から切り離して、間を制御するだけの構成としても良い。間を制御する構成としては、第１実施形態における図６、図７の「音高ルール」を、「出力ルール」と読み替えた内容となるが、この内容については、当業者からすれば、上記第２実施形態の説明から十分に類推できるであろう。 In addition, in 2nd Embodiment, it was set as the structure which controls from the said speech to an answer in addition to the pitch control of the reply in 1st Embodiment, However, it separates from the said pitch control and only controls the interval. It is good also as a structure. As a configuration for controlling the interval, the “pitch rule” in FIG. 6 and FIG. 7 in the first embodiment is replaced with “output rule”. A sufficient analogy can be made from the description of the second embodiment.

＜第３実施形態＞
次に、第３実施形態について説明する。
第３実施形態の前提について簡単に説明すると、上述したように発言の語尾の音高に対して回答の語尾の音高が心地良い等と感じる音高の関係は、人それぞれである。特に女性と男性とでは、発言の音高が大きく異なることから（女性が高く、男性は低いので）、その感じ方に大きな違いがあると思われる。
また、近年では、音声合成の際に、性別や年齢などが定められた仮想的なキャラクタの声で出力できる場合がある。回答するキャラクタの声が変更されると、特に性別が変更されると、利用者は、それまで受けていた回答の印象が異なる、と思われる。
そこで、第３実施形態では、場面として、利用者の性別（女性、男性）と音声合成する声の性別との組み合わせを想定し、これらの場面毎に指標テーブルを用意して、利用者による発言時に対応した場面の指標テーブルを用いることにした。 <Third Embodiment>
Next, a third embodiment will be described.
The premise of the third embodiment will be briefly described. As described above, the relationship between the pitch of the ending of the utterance and the pitch that the ending of the answer feels comfortable is each person. In particular, the pitch of speech differs greatly between women and men (because women are high and men are low).
Further, in recent years, there are cases where a voice of a virtual character with a defined gender, age, etc. can be output during speech synthesis. When the voice of the character to be answered is changed, especially when the gender is changed, the user seems to have a different impression of the answer that has been received.
Therefore, in the third embodiment, a combination of the user's gender (female, male) and the voice gender to be synthesized is assumed as a scene, and an index table is prepared for each of these scenes. We decided to use an index table of scenes that corresponded to the occasion.

図１２は、第３実施形態における指標テーブルの例を示す図であり、指標テーブルが、利用者の性別と、音声合成される声の性別との組み合わせに応じた分だけ用意される。具体的には、同図に示されるように、利用者の女性・男性の２通りと、回答する声（装置）の女性・男性の２通りとの計４通りの指標テーブルが管理用テーブル１２７に用意される。
音声制御部１０９は、この４通りのうち１つを次のように選択する。 FIG. 12 is a diagram illustrating an example of an index table according to the third embodiment. The index table is prepared in accordance with the combination of the gender of the user and the gender of the voice to be synthesized. Specifically, as shown in the figure, the management table 127 includes four types of index tables: two types of female / male users and two types of female / male voices (devices) to answer. To be prepared.
The voice control unit 109 selects one of the four ways as follows.

詳細には、音声制御部１０９は、利用者の性別を、例えば音声合成装置１０としての端末装置にログインした利用者の個人情報から特定する。あるいは、音声制御部１０９は、利用者の発言を音量解析や周波数解析などして、予め記憶しておいた男性・女性のパターンと比較等し、類似度の高い方のパターンの性別を当該利用者の性別として特定しても良い。また、音声制御部１０９は、回答の声の性別を、設定された情報（対話エージェントの性別情報）から特定する。このようにして、音声制御部１０９が、利用者の性別と回答の声の性別とを特定すると、当該特定した性別の組み合わせに対応した指標テーブルを選択する。
指標テーブルを選択した後については、第１実施形態と同様に、ルール固定期間と評価期間とが繰り返されることになる。 Specifically, the voice control unit 109 identifies the gender of the user from, for example, personal information of the user who has logged into the terminal device as the voice synthesizer 10. Alternatively, the voice control unit 109 compares the user's remarks with a male / female pattern stored in advance by volume analysis or frequency analysis, and uses the gender of the pattern with the higher similarity. May be specified as the gender of the person. Also, the voice control unit 109 identifies the gender of the answer voice from the set information (gender information of the dialogue agent). In this way, when the voice control unit 109 specifies the gender of the user and the gender of the answer voice, the index table corresponding to the specified gender combination is selected.
After the index table is selected, the rule fixing period and the evaluation period are repeated as in the first embodiment.

第３実施形態によれば、利用者による発言時に対応した場面の指標テーブルが用いられるとともに、ルール固定期間において発言の語尾の音高に対して回答の語尾の音高が当該指標テーブルに設定された音高ルールの関係になるように、当該回答の語尾の音高が制御されるとともに、評価期間において当該指標テーブルのうち、対話が弾んだ音高ルールが設定される。
このため、第３実施形態では、様々な場面に対応して、利用者に心地良く、対話を弾みやすくすることができる。 According to the third embodiment, the index table of the scene corresponding to the user's utterance is used, and the pitch of the ending of the answer is set in the index table with respect to the pitch of the ending of the utterance in the rule fixed period. The pitch at the end of the answer is controlled so that the pitch rules are related, and the pitch rule played by the dialogue is set in the index table during the evaluation period.
For this reason, in 3rd Embodiment, it can be comfortable for a user corresponding to various scenes, and can make a conversation easy to play.

第１実施形態においても、ルール固定期間と評価期間との繰り返しによって、場面が変わっても、利用者に心地良く、対話を弾みやすい条件に収束することになるが、それまでに要する時間（ルール固定期間と評価期間との繰り返し数）は長くかかることが予想される。これに対して、第３実施形態では、場面毎の初期状態として適切な音高ルールを設定しておければ、対話を弾みやすい条件に収束するまでの時間を短くすることができる。 In the first embodiment as well, even if the scene changes due to the repetition of the rule fixing period and the evaluation period, it will converge to a condition that is comfortable for the user and makes the conversation easy to play. The number of repetitions of the fixed period and the evaluation period) is expected to take a long time. On the other hand, in the third embodiment, if an appropriate pitch rule is set as the initial state for each scene, it is possible to shorten the time until the conversation is converged to a condition where it is easy to play.

なお、第３実施形態では、指標テーブルとして、第１実施形態の音高ルールを用いた例で説明したが、第２実施形態の出力ルールについても併用して場面に応じて切り替える構成としても良い。
また、場面については、性別のみならず、年齢（年代）を組み合わせても良い。場面としては、利用者や回答のキャラクタについての性別・年齢に限られず、発言の速度、回答の速度、音声合成装置１０の用途、例えば施設（博物館、美術館、動物園など）における音声案内、自動販売機における音声対話などの用途を想定して用意しても良い。 In the third embodiment, the pitch table of the first embodiment is used as the index table. However, the output rule of the second embodiment may be used together and switched according to the scene. .
Moreover, about a scene, you may combine not only sex but age (age). The scene is not limited to the gender and age of the user and the character of the answer, the speed of speaking, the speed of answer, the use of the speech synthesizer 10, for example, voice guidance in facilities (museums, museums, zoos, etc.), automatic sales It may be prepared assuming use such as voice dialogue in the machine.

＜応用例、変形例＞
本発明は、上述した実施形態に限定されるものではなく、例えば次に述べるような各種の応用・変形が可能である。また、次に述べる応用・変形の態様は、任意に選択された一または複数を適宜に組み合わせることもできる。 <Application examples and modifications>
The present invention is not limited to the above-described embodiments, and various applications and modifications as described below are possible, for example. In addition, one or more arbitrarily selected aspects of application / deformation described below can be appropriately combined.

＜音声入力部＞
実施形態では、音声入力部１０２は、利用者の音声（発言）をマイクロフォンで入力して音声信号に変換する構成としたが、特許請求の範囲に記載された音声入力部は、この構成に限られない。すなわち、特許請求の範囲に記載された音声入力部は、音声信号による発言をなんらかの形で入力する、または、入力される構成であれば良い。詳細には、特許請求の範囲に記載された音声入力部は、他の処理部で処理された音声信号や、他の装置から供給（または転送された）音声信号を入力する構成、さらには、ＬＳＩに内蔵され、単に音声信号を受信し後段に転送する入力インターフェース回路等を含んだ概念である。 <Voice input part>
In the embodiment, the voice input unit 102 is configured to input a user's voice (speech) with a microphone and convert the voice signal into a voice signal. However, the voice input unit described in the claims is limited to this configuration. I can't. That is, the voice input unit described in the claims may be configured to input or input a speech by a voice signal in some form. Specifically, the voice input unit described in the claims is configured to input a voice signal processed by another processing unit, a voice signal supplied (or transferred) from another device, It is a concept that includes an input interface circuit or the like that is built in an LSI and that simply receives an audio signal and transfers it to a subsequent stage.

＜音声波形データ＞
各実施形態では、回答作成部１１０が、発言に対する回答として、一音一音に音高が割り当てられた音声シーケンスを出力する構成としたが、当該回答を、例えばｗａｖ形式の音声波形データを出力する構成としても良い。
なお、音声波形データは、上述した音声シーケンスのように一音一音に音高が割り当てられないので、例えば、音声制御部１０９が、単純に再生した場合の語尾の音高を特定して、音高データで示される音高に対して、特定した音高が所定の関係となるようにフィルタ処理などの音高変換（ピッチ変換）をした上で、音声波形データを出力（再生）する構成とすれば良い。
また、カラオケ機器では周知である、話速を変えずに音高（ピッチ）をシフトする、いわゆるキーコントロール技術によって音高変換をしても良い。 <Audio waveform data>
In each embodiment, the answer creating unit 110 outputs a voice sequence in which a pitch is assigned to each note as an answer to the utterance. However, the answer is output as, for example, voice waveform data in wav format. It is good also as composition to do.
In addition, since the sound waveform data is not assigned a pitch to each sound as in the above-described sound sequence, for example, the sound control unit 109 specifies the ending pitch when simply reproduced, Configuration that outputs (reproduces) audio waveform data after performing pitch conversion (pitch conversion) such as filter processing so that the specified pitch has a predetermined relationship with the pitch indicated by the pitch data What should I do?
Also, pitch conversion may be performed by a so-called key control technique that shifts the pitch (pitch) without changing the speaking speed, which is well known in karaoke equipment.

＜回答等の語尾、語頭＞
各実施形態では、発言の語尾の音高に対応して回答の語尾の音高を制御する構成としたが、言語や、方言、言い回しなどによっては回答の語尾以外の部分、例えば語頭が特徴的となる場合もある。このような場合には、発言した人は、当該発言に対する回答があったときに、当該発言の語尾の音高と、当該回答の特徴的な語頭の音高とを無意識のうち比較して当該回答に対する印象を判断する。したがって、この場合には、発言の語尾の音高に対応して回答の語頭の音高を制御する構成とすれば良い。この構成によれば、回答の語頭が特徴的である場合、当該回答を受け取る利用者に対して心理的な印象を与えることが可能となる。 <End of answer, beginning of answer>
In each embodiment, the pitch of the ending of the answer is controlled corresponding to the pitch of the ending of the utterance. However, depending on the language, dialect, wording, etc. It may become. In such a case, the person who made the speech unconsciously compares the pitch of the ending of the speech with the pitch of the characteristic beginning of the reply, when there is an answer to the speech. Determine the impression of the answer. Therefore, in this case, the pitch at the beginning of the answer may be controlled in accordance with the pitch at the end of the utterance. According to this configuration, when the head of the answer is characteristic, it is possible to give a psychological impression to the user who receives the answer.

発言についても同様であり、語尾に限られず、語頭で判断される場合も考えられる。また、発言、回答については、語頭、語尾に限られず、平均的な音高で判断される場合や、最も強く発音した部分の音高で判断される場合なども考えられる。このため、発言の第１区間および回答の第２区間は、必ずしも語頭や語尾に限られない、ということができる。 The same applies to utterances, not limited to endings, but may be determined by the beginning of a sentence. In addition, the remarks and answers are not limited to the beginning and end of the word, but may be determined based on an average pitch, or may be determined based on the pitch of the most pronounced portion. For this reason, it can be said that the 1st section of an utterance and the 2nd section of an answer are not necessarily restricted to an initial or ending.

＜音程の関係＞
上述した各実施形態では、音高ルールを、４度上、３度下、５度下、６度下、８度下を例示したが、これ以外を用いても良い。また、協和音程の関係でなくても、経験的に良い（または悪い）印象を与える音程の関係の存在が認められる場合もあるので、当該音程の関係に回答の音高を制御する構成としても良い。ただし、この場合においても、発言の語尾等の音高と回答の語尾等の音高との２音間の音程が離れ過ぎると、発言に対する回答が不自然になりやすいので、発言の音高と回答の音高とが上下１オクターブの範囲内にあることが望ましい。 <Pitch relationship>
In each embodiment described above, the pitch rules are exemplified as 4th, 3rd, 5th, 6th, and 8th, but other rules may be used. In addition, there is a case where a relationship of a pitch that gives a good (or bad) impression is empirically recognized even if it is not a relationship of the Kyowa pitch, so that the pitch of the answer is controlled according to the relationship of the pitch. good. However, even in this case, if the pitch between the two pitches of the ending of the utterance and the pitch of the ending of the answer is too far apart, the answer to the utterance tends to become unnatural. It is desirable that the pitch of the answer is in the range of one octave above and below.

＜その他＞
実施形態にあっては、発言に対する回答を取得する構成である言語解析部１０８、言語データベース１２２および回答データベース１２４を音声合成装置１０の側に設けたが、端末装置などでは、処理の負荷が重くなる点や、記憶容量に制限がある点などを考慮して、外部サーバの側に設ける構成としても良い。すなわち、音声合成装置１０において回答作成部１１０は、発言に対する回答をなんらかの形で取得するとともに、当該回答の音声を規定するデータを出力する構成であれば足り、その回答を、音声合成装置１０の側で作成するのか、音声合成装置１０以外の他の構成（例えば外部サーバ）の側で作成するのか、については問われない。
なお、音声合成装置１０において、発言に対する回答について、外部サーバ等にアクセスしないで作成可能な用途であれば、情報取得部１２６は不要である。 <Others>
In the embodiment, the language analysis unit 108, the language database 122, and the answer database 124, which are configured to obtain responses to utterances, are provided on the side of the speech synthesizer 10, but the processing load is heavy in a terminal device or the like. In view of the above, the storage capacity is limited, and the like may be provided on the external server side. That is, in the speech synthesizer 10, it is sufficient that the answer creating unit 110 obtains an answer to the utterance in some form and outputs data defining the speech of the answer. It does not matter whether it is created on the side or on the side of another configuration (for example, an external server) other than the speech synthesizer 10.
Note that the information acquisition unit 126 is not necessary if the speech synthesizer 10 can create an answer to a statement without accessing an external server or the like.

１０２…音声入力部、１０４…発話区間検出部、１０６…音高解析部、１０８…言語解析部、１０９…音声制御部、１１０…回答作成部、１１２…音声合成部、１２６…情報取得部。
DESCRIPTION OF SYMBOLS 102 ... Voice input part, 104 ... Speech section detection part, 106 ... Pitch analysis part, 108 ... Language analysis part, 109 ... Speech control part, 110 ... Answer preparation part, 112 ... Speech synthesis part, 126 ... Information acquisition part.

Claims

A voice input unit for inputting a speech by a voice signal;
An acquisition unit for acquiring an answer to the statement;
The period from the input of the speech signal of the speech to the output of the speech signal of the answer is changed while being in a relationship defined by one output rule among a plurality of preset output rules. ,
Of the plurality of output rules, a voice control unit that sets one output rule that satisfies a predetermined condition within a predetermined period in which a ratio is given to the answer;
A voice control device comprising:

The voice control apparatus according to claim 1, wherein the output rule is set according to any one of a plurality of scenes prepared in advance.

Computer
Get an answer to the utterance from the input audio signal,
The period from the input of the speech signal of the speech to the output of the speech signal of the answer is changed while the relationship is determined by one output rule among a plurality of preset output rules. ,
A voice control method characterized in that, among the plurality of output rules, one output rule that satisfies a predetermined condition within a predetermined period is set for a rate at which a reply is made to the answer.

Computer
A voice input unit for inputting a speech by a voice signal;
An acquisition unit for acquiring an answer to the remark; and
The period from the input of the speech signal of the speech to the output of the speech signal of the answer is changed while being in a relationship defined by one output rule among a plurality of preset output rules. ,
Of the plurality of output rules, a voice control unit that sets one output rule that satisfies a predetermined condition within a predetermined period in which a ratio is given to the answer;
A program characterized by functioning as