JP2015069038A

JP2015069038A - Voice synthesizer and program

Info

Publication number: JP2015069038A
Application number: JP2013203840A
Authority: JP
Inventors: 松原　弘明; Hiroaki Matsubara; 弘明松原; 純也浦; Junya Ura; 川▲原▼　毅彦; Takehiko Kawahara; 毅彦川▲原▼; 久湊　裕司; Yuji Hisaminato; 裕司久湊; 克二吉村; Katsuji Yoshimura
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-09-30
Filing date: 2013-09-30
Publication date: 2015-04-13
Anticipated expiration: 2033-09-30
Also published as: JP6343895B2

Abstract

PROBLEM TO BE SOLVED: To perform voice synthesis with respect to utterance of a user as if it is a conversation with a person.SOLUTION: A voice synthesizer comprises a voice input section 102 to which utterance by a voice signal is input, a non-language analysis section 106 for analyzing a pitch in a specific first section in the utterance and a change state of the pitch in the utterance, an answer creation section 110 for acquiring an answer with respect to the utterance, a voice synthesis section 112 for performing voice synthesis on the acquired answer, and a voice control section 109 for making the voice synthesis section 112 change a pitch in a specific second section in the answer to become a pitch having a predetermined relation with respect to the pitch in the first section and change a change state of the pitch in the answer according to the change state of the pitch in the utterance.

Description

本発明は、音声合成装置およびプログラムに関する。 The present invention relates to a speech synthesizer and a program.

近年、音声合成技術としては、次のようなものが提案されている。すなわち、利用者の話調や声質に対応した音声を合成出力することによって、より人間らしく発音する技術（例えば特許文献１参照）や、利用者の音声を分析して、当該利用者の心理状態や健康状態などを診断する技術（例えば特許文献２参照）が提案されている。
また、利用者が入力した音声を認識する一方で、シナリオで指定された内容を音声合成で出力して、利用者との音声対話を実現する音声対話システムも提案されている（例えば特許文献３参照）。 In recent years, the following have been proposed as speech synthesis techniques. That is, by synthesizing and outputting speech corresponding to the user's speech tone and voice quality, a technique for sounding more humanly (see, for example, Patent Document 1), analyzing the user's speech, A technique for diagnosing a health condition or the like (see, for example, Patent Document 2) has been proposed.
In addition, a voice dialogue system that recognizes a voice input by a user and outputs a content specified in a scenario by voice synthesis to realize a voice dialogue with the user has been proposed (for example, Patent Document 3). reference).

特開２００３−２７１１９４号公報JP 2003-271194 A 特許第４４９５９０７号公報Japanese Patent No. 4495907 特許第４８３２０９７号公報Japanese Patent No. 4832097

ところで、上述した音声合成技術と音声対話システムとを組み合わせて、利用者の音声による発言に対し、データを検索して音声合成により出力する対話システムを想定する。この場合、音声合成によって出力される音声が利用者に不自然な感じ、具体的には、いかにも機械が喋っている感じを与えるときがある、という問題が指摘されている。
本発明は、このような事情に鑑みてなされたものであり、その目的の一つは、利用者の発言に対して、当該利用者に自然な感じを与えるような、具体的には、あたかも人と対話しているかのような印象を与えることが可能な音声合成装置およびプログラムを提供することにある。 By the way, it is assumed that a dialogue system that combines the above-described speech synthesis technology and a speech dialogue system to retrieve data and output by speech synthesis in response to a speech by a user's voice is assumed. In this case, a problem has been pointed out that the voice output by the voice synthesis feels unnatural to the user, specifically, the machine sometimes feels roaring.
The present invention has been made in view of such circumstances, and one of its purposes is to give a natural feeling to the user's remarks. An object of the present invention is to provide a speech synthesizer and a program capable of giving an impression that the user is interacting with a person.

本件発明者は、利用者による発言に対する回答を音声合成で出力（返答）するマン・マシンのシステムを検討するにあたって、まず、人同士では、どのような対話がなされるかについて、音高（周波数）に着目して考察した。 In examining the man-machine system for outputting (replying) the response to the user's utterance by speech synthesis, the present inventor first determines the pitch (frequency) ).

ここでは、人同士の対話として、一方の人（ａとする）による発言（質問、独り言、問い等を含む）に対し、他方の人（ｂとする）が回答（相槌を含む）する場合について検討する。この場合において、ａが発言したとき、ａだけなく、当該発言に対して回答しようとするｂも、当該発言のうちの、ある区間における音高を強い印象で残していることが多い。ｂは、同意や、賛同、肯定などの意で回答するときには、印象に残っている発言の音高に対し、当該回答を特徴付ける部分、例えば語尾や語頭の音高が、所定の関係、具体的には協和音程の関係となるように発声する。当該回答を聞いたａは、自己の発言について印象に残っている音高と当該発言に対する回答を特徴付ける部分の音高とが上記関係にあるので、ｂの回答に対して心地良く、安心するような好印象を抱くことになる、と、本件発明者は考えた。 Here, as a dialogue between people, the other person (assuming b) answers (including conflicts) to one person (assuming a) (including questions, monologues, questions, etc.) consider. In this case, when “a” speaks, not only “a” but also “b” trying to reply to the said speech often leaves the pitch in a certain section of the said speech with a strong impression. b, when replying with consent, approval, or affirmation, the pitch of the utterance that remains in the impression, for example, the pitch of the ending or beginning of the sentence Speak to be in a Kyowa pitch. The person a who heard the answer has the above-mentioned relationship between the pitch that remains in the impression about his / her speech and the pitch of the part that characterizes the answer to the person's reply, so that the answer to b is comfortable and reassuring. The present inventor thought that he had a good impression.

例えば、ａが「そうでしょ？」と発言したとき、ａおよびｂは、当該発言のうち、念押しや確認などの意が強く表れる語尾の「しょ」の音高を記憶に残した状態となる。この状態において、ｂが、当該発言に対して「あ、はい」と肯定的に回答しようとする場合に、印象に残っている「しょ」の音高に対して、回答を特徴付ける部分、例えば語尾の「い」の音高が上記関係になるように「あ、はい」と回答する。 For example, when “a” says “Yes?”, “A” and “b” are in a state of memorizing the pitch of “Sho” at the end of the said utterance that strongly expresses willingness or confirmation. . In this state, when b tries to affirmatively answer “Yes, yes” to the statement, the part characterizing the answer, for example, the ending, Answer “Yes, yes” so that the pitch of “Yes” is in the above relationship.

図２は、このような実際の対話におけるフォルマントを示している。この図において、横軸が時間であり、縦軸が周波数であって、スペクトルは、白くなるにつれて強度が強い状態を示している。
図に示されるように、人の音声を周波数解析して得られるスペクトルは、時間的に移動する複数のピーク、すなわちフォルマントとして現れる。詳細には、「そうでしょ？」に相当するフォルマント、および、「あ、はい」に相当するフォルマントは、それぞれ３つのピーク帯（時間軸に沿って移動する白い帯状の部分）として現れている。
これらの３つのピーク帯のうち、周波数の最も低い第１フォルマントについて着目してみると、「そうでしょ？」の「しょ」に相当する符号Ａ（の中心部分）の周波数はおおよそ４００Ｈｚである。一方、符号Ｂは、「あ、はい」の「い」に相当する符号Ｂの周波数はおおよそ２６０Ｈｚである。このため、符号Ａの周波数は、符号Ｂの周波数に対して、ほぼ３／２となっていることが判る。 FIG. 2 shows a formant in such an actual dialogue. In this figure, the horizontal axis is time, the vertical axis is frequency, and the spectrum shows a state where the intensity increases as it becomes white.
As shown in the figure, a spectrum obtained by frequency analysis of human speech appears as a plurality of peaks that move in time, that is, formants. Specifically, a formant corresponding to “Yeah?” And a formant corresponding to “Ah, yes” each appear as three peak bands (white band-like portions moving along the time axis).
When attention is paid to the first formant having the lowest frequency among these three peak bands, the frequency of the code A (the central part) corresponding to “Sho” of “Oh, right?” Is approximately 400 Hz. On the other hand, for the code B, the frequency of the code B corresponding to “Yes” of “A, Yes” is approximately 260 Hz. For this reason, it can be seen that the frequency of the code A is approximately 3/2 with respect to the frequency of the code B.

周波数の比が３／２であるという関係は、音程でいえば、「ソ」に対して同じオクターブの「ド」や、「ミ」に対して１つ下のオクターブの「ラ」などの関係をいい、後述するように、完全５度の関係にある。この周波数の比（音高同士における所定の関係）については、好適な一例であるが、後述するように様々な例が挙げられる。 The relationship that the ratio of the frequencies is 3/2 is the relationship between “seo” in the same octave and “mi” in the octave one lower than “mi”. As will be described later, there is a complete 5 degree relationship. This frequency ratio (predetermined relationship between pitches) is a preferred example, but various examples can be given as will be described later.

なお、図３は、音名（階名）と人の声の周波数との関係について示す図である。この例では、第４オクターブの「ド」を基準にしたときの周波数比も併せて示しており、「ソ」は「ド」を基準にすると、上記のように３／２である。また、第３オクターブの「ラ」を基準にしたときの周波数比についても並列に例示している。 FIG. 3 is a diagram showing a relationship between a pitch name (floor name) and a human voice frequency. In this example, the frequency ratio when the fourth octave “do” is used as a reference is also shown, and “so” is 3/2 as described above when “do” is used as a reference. Further, the frequency ratio when the third octave “La” is used as a reference is also illustrated in parallel.

このように人同士の対話では、発言の音高と返答する回答の音高とは無関係ではなく、上記のような関係がある、と考察できる。そして、本件発明者は、多くの対話例を分析し、多くの人による評価を統計的に集計して、この考えがおおよそ正しいことを裏付けた。
一方で、音高だけでなく、例えば音高の変化具合で、回答の仕方が異なる点も経験的に認められる。例えば、音声による発言が「あすははれ（明日は晴れ）」という体言止めであっても、例えば語尾に向かって音高が上がれば、その発言は「明日は晴れですか？」という意味内容の質問（疑問文）になる。
また、「あすははれ」という発言において、音高がほぼ一定であれば、その発言は、単なる独り言、つぶやきの類である。このため、当該発言に対する回答（相槌）の例えば「そうですね」の音高もほぼ一定となる。
したがって、利用者による発言に対する回答を音声合成で出力（返答）する対話システムを検討したときに、当該発言の音高のみならず、当該音高の変化具合という非言語情報も回答を音声合成する上で、重要な要素となり得る。
そこで、当該音声合成について上記目的を達成するために、次のような構成とした。 In this way, in the dialogue between people, it can be considered that the pitch of the utterance and the pitch of the answer to be answered are not irrelevant but have the above relationship. Then, the present inventor analyzed many dialogue examples and statistically aggregated evaluations by many people to prove that this idea is roughly correct.
On the other hand, it is empirically recognized that not only the pitch but also the way of answering is different, for example, depending on how the pitch changes. For example, even if the speech utterance is “tomorrow is fine (tomorrow is sunny)”, if the pitch goes up toward the end of the word, for example, the utterance means “is tomorrow sunny?” It becomes a question (question sentence).
In addition, in the remark “tomorrow”, if the pitch is almost constant, the remark is just a monologue or a tweet. For this reason, the pitch (for example) of the answer (consideration) to the said utterance is almost constant.
Therefore, when considering a dialogue system that outputs (replies) a reply to a user's utterance by speech synthesis, not only the pitch of the utterance but also non-linguistic information such as a change in the pitch synthesizes the reply. Above, it can be an important factor.
Therefore, in order to achieve the above object for the speech synthesis, the following configuration is adopted.

すなわち、上記目的を達成するために、本発明の一態様に係る音声合成装置は、音声信号による発言を入力する音声入力部と、前記発言のうち、特定の第１区間の音高と、当該発言の音高の変化具合とを解析する非言語解析部と、前記発言に対する回答を取得する取得部と、取得された回答を音声合成する音声合成部と、前記音声合成部に対し、当該回答における特定の第２区間の音高を、前記第１区間の音高に対して所定の関係にある音高となるように変更させ、かつ、当該回答の音高の変化具合を、前記発言の音高の変化具合にしたがって変更させる音声制御部と、を具備することを特徴とする。
この一態様によれば、回答における特定の第２区間の音高が、発言のうち特定の第１区間の音高に対して所定の関係にある音高となるように変更される。また、発言の音高の変化具合にしたがって回答の音高の変化が制御される。このため、あたかも人と対話しているかのような印象を利用者に与えることが可能になる。
なお、発言の音高の変化具合にしたがって回答の音高を変化させる制御例としては、発言の音高の変化具合がほとんどない場合（平坦である場合）であれば、相槌としての回答の音高も平坦にする例などが挙げられ、また、発言の音高が語尾に向かって高くなるような疑問文の場合であれば、回答の音高を語尾に向かって低くする例などが挙げられる。 That is, in order to achieve the above object, a speech synthesizer according to an aspect of the present invention includes a speech input unit that inputs speech by a speech signal, a pitch of a specific first section of the speech, The non-linguistic analysis unit that analyzes the change in the pitch of the utterance, the acquisition unit that acquires an answer to the utterance, the speech synthesis unit that synthesizes the acquired answer by speech, and the answer to the speech synthesis unit The pitch of the specific second section in the above is changed so that the pitch has a predetermined relationship with the pitch of the first section, and the change in the pitch of the answer is And a voice control unit that changes the pitch according to the change in pitch.
According to this aspect, the pitch of the specific second section in the answer is changed so as to be a pitch having a predetermined relationship with respect to the pitch of the specific first section of the utterance. Further, the change in the pitch of the answer is controlled according to the change in the pitch of the speech. For this reason, it is possible to give the user an impression as if they were interacting with a person.
As an example of control to change the pitch of the answer according to the change in the pitch of the speech, if there is almost no change in the pitch of the speech (if it is flat), the sound of the answer as a match Examples include making the pitch flat, and in the case of a question sentence in which the pitch of the utterance increases toward the end of the sentence, examples such as lowering the pitch of the answer toward the end of the sentence .

この態様において、第１区間は、例えば発言の語尾であり、第２区間は、回答の語頭または語尾であることが好ましい。上述したように、発言の印象を特徴付ける区間は、当該発言の語尾であり、回答の印象を特徴付ける区間は、回答の語頭または語尾であることが多いからである。
また、前記所定の関係は、完全１度を除いた協和音程の関係であることが好ましい。ここで、協和とは、複数の楽音が同時に発生したときに、それらが互いに溶け合って良く調和する関係をいい、これらの音程関係を協和音程という。協和の程度は、２音間の周波数比（振動数比）が単純なものほど高い。周波数比が最も単純な１／１（完全１度）と、２／１（完全８度）とを、特に絶対協和音程といい、これに３／２（完全５度）と４／３（完全４度）とを加えて完全協和音程という。５／４（長３度）、６／５（短３度）、５／３（長６度）および８／５（短６度）を不完全協和音程といい、これ以外のすべての周波数比の関係（長・短の２度と７度、各種の増・減音程など）を不協和音程という。 In this aspect, it is preferable that the first interval is, for example, the ending of a statement, and the second interval is the beginning or ending of an answer. As described above, the section characterizing the impression impression is the ending of the comment, and the section characterizing the answer impression is often the beginning or ending of the answer.
Moreover, it is preferable that the predetermined relationship is a relationship of Kyowa intervals excluding perfect 1 degree. Here, “Kyowa” means a relationship in which when a plurality of musical sounds are generated at the same time, they are fused and well harmonized, and these pitch relationships are called Kyowa pitches. The degree of cooperation is higher as the frequency ratio (frequency ratio) between two sounds is simpler. The simplest frequency ratios of 1/1 (perfect 1 degree) and 2/1 (perfect 8 degree) are called absolute consonance pitches, and 3/2 (perfect 5 degree) and 4/3 (perfect) 4 degrees) and is called the perfect harmony pitch. 5/4 (3 degrees long), 6/5 (3 degrees short), 5/3 (6 degrees long) and 8/5 (6 degrees short) are called incomplete harmony intervals, and all other frequency ratios This relationship (long and short 2 degrees and 7 degrees, various increase / decrease intervals, etc.) is called dissonance interval.

なお、回答の語頭または語尾の音高を、発言の語尾の音高と同一となる場合には、対話として不自然な感じを伴うと考えられるので、上記協和音程の関係としては、完全１度が除かれている。
上記態様において、所定の関係として最も望ましい例は、上述したように第２区間の音高が、第１区間の音高に対して５度下の協和音程の関係である、と考えられる。ただし、所定の関係としては、完全１度を除く協和音程に限られず、不協和音程の関係でも良いし、同一を除く、上下１オクターブの範囲内の音高関係でも良い。
また、回答には、質問に対する具体的な答えに限られず、「なるほど」、「そうですね」などの相槌（間投詞）も含まれる。 If the pitch of the beginning or ending of the answer is the same as the pitch of the ending of the utterance, it is considered that there is an unnatural feeling as a dialogue. Is excluded.
In the above aspect, the most desirable example of the predetermined relationship is considered to be a relationship in which the pitch of the second section is 5 degrees below the pitch of the first section as described above. However, the predetermined relationship is not limited to a consonant pitch except for a perfect degree, but may be a dissonant pitch relationship, or may be a pitch relationship within the range of one octave above and below, excluding the same.
In addition, the answer is not limited to a specific answer to the question, but includes an answer (interjection) such as “I see” or “I think so”.

上記態様において、前記非言語解析部は、前記発言の音高以外の非言語情報の変化具合も解析し、前記音声制御部は、前記音声合成を、前記音高以外の非言語情報の変化具合に応じて制御する構成としても良い。
この構成によれば、回答の音声合成が、音高以外の非言語情報の変化具合に応じて制御されるので、より人と対話しているかのような印象を利用者に与えることが可能になる。なお、音高以外の非言語情報の例としては、典型的には、音量が挙げられ、発言の音量に合わせて、回答の平均音量を制御したり、発言の音量変化に合わせて、回答の音量の変化具合を制御したりしても良い。ほかにも発言の速度（話速）が挙げられ、発言の話速に合わせて、回答の話速を制御しても良い。 In the above aspect, the non-linguistic analysis unit also analyzes changes in non-linguistic information other than the pitch of the utterance, and the speech control unit performs the speech synthesis on changes in non-linguistic information other than the pitch. It is good also as a structure controlled according to.
According to this configuration, the speech synthesis of the answer is controlled according to the change in the non-linguistic information other than the pitch, so that it is possible to give the user the impression that they are interacting with a person more. Become. An example of non-linguistic information other than the pitch is typically the volume, and the average volume of the answer is controlled according to the volume of the utterance or the answer volume is changed according to the change in the volume of the utterance. It is also possible to control how the volume changes. In addition, the speed of speech (speaking speed) can be cited, and the speaking speed of answers may be controlled in accordance with the speaking speed of speech.

本発明の態様について、音声合成装置のみならず、コンピュータを当該音声合成装置として機能させるプログラムとして概念することも可能である。
なお、本発明では、発言の音高（周波数）を解析対象とし、回答の音高を制御対象としているが、ヒトの音声は、上述したフォルマントの例でも明らかなように、ある程度の周波数域を有するので、解析や制御についても、ある程度の周波数範囲を持ってしまうのは避けられない。また、解析や制御については、当然のことながら誤差が発生する。このため、本件において、音高の解析や制御については、音高（周波数）の数値が同一であることのみならず、ある程度の範囲を伴うことが許容される。 The aspect of the present invention can be conceptualized as a program that causes a computer to function as the speech synthesizer as well as the speech synthesizer.
In the present invention, the pitch (frequency) of the speech is the analysis target, and the pitch of the answer is the control target. As is clear from the above-described formant example, human speech has a certain frequency range. Therefore, it is inevitable that the analysis and control have a certain frequency range. In addition, as a matter of course, errors occur in analysis and control. For this reason, in this case, the pitch analysis and control are allowed not only to have the same numerical value of the pitch (frequency) but also to have a certain range.

実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on embodiment. 対話における音声のフォルマントの例を示す図である。It is a figure which shows the example of the sound formant in a dialog. 音名と周波数等との関係を示す図である。It is a figure which shows the relationship between a pitch name, a frequency, etc. 音声合成装置における音声合成処理を示すフローチャートである。It is a flowchart which shows the speech synthesis process in a speech synthesizer. 語尾および音高変化の特定の具体例を示す図である。It is a figure which shows the specific example of a ending and a pitch change. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence. 変形例に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on a modification. 応用例（その１）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 1). 応用例（その２）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 2). 応用例（その３）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 3).

以下、本発明の実施形態について図面を参照して説明する。
＜音声合成装置＞ Embodiments of the present invention will be described below with reference to the drawings.
<Speech synthesizer>

図１は、本発明の実施形態に係る音声合成装置１０の構成を示す図である。
この図において、音声合成装置１０は、ＣＰＵ（Central Processing Unit）や、音声入力部１０２、スピーカ１４２を有する、例えば携帯電話機のような端末装置である。音声合成装置１０においてＣＰＵが、予めインストールされたアプリケーションプログラムを実行することによって、複数の機能ブロックが次のように構築される。
詳細には、音声合成装置１０では、発話区間検出部１０４、非言語解析部１０６、言語解析部１０８、音声制御部１０９、回答作成部（取得部）１１０、音声合成部１１２、言語データベース１２２、回答データベース１２４、情報取得部１２６および音声ライブラリ１２８が構築される。
なお、特に図示しないが、このほかにも音声合成装置１０は、表示部や操作入力部なども有し、利用者が装置の状況を確認したり、装置に対して各種の操作を入力したりすることができるようになっている。また、音声合成装置１０は、携帯電話機のような端末装置１０に限られず、ノート型やタブレット型のパーソナルコンピュータであっても良い。 FIG. 1 is a diagram showing a configuration of a speech synthesizer 10 according to an embodiment of the present invention.
In this figure, the speech synthesizer 10 is a terminal device such as a mobile phone having a CPU (Central Processing Unit), a speech input unit 102, and a speaker 142. In the speech synthesizer 10, the CPU executes an application program installed in advance, so that a plurality of functional blocks are constructed as follows.
Specifically, in the speech synthesizer 10, the speech segment detection unit 104, the non-language analysis unit 106, the language analysis unit 108, the speech control unit 109, the answer creation unit (acquisition unit) 110, the speech synthesis unit 112, the language database 122, An answer database 124, an information acquisition unit 126, and an audio library 128 are constructed.
Although not particularly illustrated, the speech synthesizer 10 also includes a display unit, an operation input unit, and the like, so that the user can check the status of the device and input various operations to the device. Can be done. The voice synthesizer 10 is not limited to the terminal device 10 such as a mobile phone, and may be a notebook or tablet personal computer.

音声入力部１０２は、詳細については省略するが、音声を電気信号に変換するマイクロフォンと、変換された音声信号の高域成分をカットするＬＰＦ（ローパスフィルタ）と、高域成分をカットした音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。
発話区間検出部１０４は、デジタル信号に変換された音声信号を処理して発話（有音）区間を検出する。 Although not described in detail, the audio input unit 102 is a microphone that converts audio into an electric signal, an LPF (low-pass filter) that cuts a high-frequency component of the converted audio signal, and an audio signal that is cut from a high-frequency component. And an A / D converter that converts the signal into a digital signal.
The utterance section detection unit 104 processes the voice signal converted into the digital signal to detect the utterance (sound) section.

非言語解析部１０６は、発話区間として検出された音声信号の発言を音量解析および周波数解析して、当該発言における言語情報以外の非言語情報として、当該発言のうち、特定の区間（第１区間）における音高を示すデータと、当該発言における音高の変化具合を示すデータとを出力する。
なお、音高を示すデータは、音声制御部１０９に供給され、音高の変化具合を示すデータは、音声制御部１０９および回答作成部１１０にそれぞれ供給される。
ここで、第１区間とは、例えば発言の語尾である。また、ここでいう音高とは、例えば音声信号を周波数解析して得られる複数のフォルマントのうち、周波数の最も低い成分である第１フォルマント、図２でいえば、末端が符号Ａとなっているピーク帯で示される周波数（音高）をいう。周波数解析については、ＦＦＴ（Fast Fourier Transform）や、その他公知の方法を用いることができる。発言における語尾を特定するための具体的手法および、音高の変化具合を特定するための具体的手法の一例については後述する。 The non-linguistic analysis unit 106 performs volume analysis and frequency analysis on the speech of the speech signal detected as the speech segment, and uses the specific segment (first segment) of the speech as non-language information other than the language information in the speech. ) And the data indicating the change in the pitch in the utterance.
The data indicating the pitch is supplied to the voice control unit 109, and the data indicating the change in the pitch is supplied to the voice control unit 109 and the answer creating unit 110, respectively.
Here, the first section is, for example, the ending of a statement. The pitch here is, for example, a first formant that is the lowest frequency component among a plurality of formants obtained by frequency analysis of an audio signal. In FIG. This is the frequency (pitch) indicated by the peak band. For frequency analysis, FFT (Fast Fourier Transform) or other known methods can be used. An example of a specific method for specifying the ending in the utterance and a specific method for specifying the degree of change in pitch will be described later.

一方、言語解析部１０８は、発話区間として検出された音声信号がどの音素に近いのかを、言語データベース１２２に予め作成された音素モデルを参照することにより判定して、音声信号で規定される発言を解析（特定）し、その解析結果を回答作成部１１０に供給する。 On the other hand, the language analysis unit 108 determines which phoneme is close to the speech signal detected as the speech section by referring to a phoneme model created in advance in the language database 122, and the speech specified by the speech signal. Is analyzed (specified), and the analysis result is supplied to the answer creating unit 110.

回答作成部１１０は、言語解析部１０８によって解析された発言に対応する回答を、非言語解析部１０６によって解析された音高の変化具合を示すデータを用いつつ、回答データベース１２４および情報取得部１２６を参照して作成する。
なお、本実施形態において、回答作成部１１０が作成する回答には、
（１）発言に対する肯定または否定等の意を示す回答、
（２）発言に対する具体的内容の回答、
（３）発言に対する相槌としての回答、
が想定されている。（１）の回答の例としては「はい」、「いいえ」などが挙げられ、（２）としては、例えば「あすのてんきは？（明日の天気は？）」という発言に対して「はれです」と具体的に内容を回答する例などが挙げられる。（３）としては、「そうですね」、「えーと」などが挙げられ、発言が、（１）のように「はい」、「いいえ」の回答で済む発言、および、（２）のように具体的な内容を回答する必要がある発言以外の場合において作成（取得）される。 The answer creation unit 110 uses the answer database 124 and the information acquisition unit 126 while using the data indicating the degree of change in the pitch analyzed by the non-language analysis unit 106 for the answer corresponding to the utterance analyzed by the language analysis unit 108. Create by referring to.
In the present embodiment, the answer created by the answer creating unit 110 includes:
(1) An answer indicating the affirmation or denial of the statement,
(2) Reply to specific content for remarks,
(3) Reply as a reconciliation to remarks,
Is assumed. Examples of (1) are “Yes”, “No”, etc., and (2) is “Hare, what is tomorrow ’s weather?” There is an example of answering the content specifically. (3) can be “Yes”, “Et”, etc., and the remarks can be answered with “Yes” or “No” as in (1), and specific as in (2) It is created (acquired) in cases other than remarks that need to be answered.

（１）の回答については、例えば「いま３時ですか？」という発言に対して、内蔵のリアルタイムクロック（図示省略）から時刻情報を取得すれば、回答作成部１１０が、当該発言に対して例えば「はい」または「いいえ」のうち、どちらで回答すれば良いのかを判別することができる。
一方で、例えば「あすははれですか（明日は晴れですか）？」という発言に対しては、外部サーバにアクセスして天気情報を取得しないと、音声合成装置１０の単体で回答することができない。このように、音声合成装置１０のみでは回答できない場合、情報取得部１２６は、インターネットを介し外部サーバにアクセスし、回答の作成に必要な情報を取得して、回答作成部１１０に供給する。これにより、当該回答作成部１１０は、当該発言が正しいか否かを判別して回答を作成することができる。
（２）の回答については、例えば「いまなんじ？（今、何時？）」という発言に対しては、回答作成部１１０は、上記時刻情報を取得するとともに、時刻情報以外の情報を回答データベース１２４から取得することで、「ただいま○○時○○分です」という回答を作成することが可能である。一方で、「あすのてんきは？（明日の天気は？）」という発言に対しては、情報取得部１２６が、外部サーバにアクセスして、回答に必要な情報を取得するとともに、回答作成部１１０が、発言に対して例えば「はれです」という回答を、回答データベース１２４を参照して作成する構成となっている。 For the answer of (1), for example, in response to an utterance “It is 3 o'clock?”, If the time information is acquired from a built-in real-time clock (not shown), the answer creation unit 110 responds to the utterance. For example, it is possible to determine which of “Yes” and “No” should be answered.
On the other hand, for example, in response to the statement “Are you sure? (Tomorrow is sunny)?” If the weather information is not acquired by accessing an external server, the speech synthesizer 10 responds alone. I can't. As described above, when the answer cannot be made only by the speech synthesizer 10, the information acquisition unit 126 accesses an external server via the Internet, acquires information necessary for creating the answer, and supplies the information to the answer creation unit 110. Thereby, the said reply creation part 110 can discriminate | determine whether the said comment is correct and can create a reply.
With respect to the answer (2), for example, in response to the statement “Now what? (Now what time?)”, The answer creating unit 110 obtains the time information and sends information other than the time information to the answer database. By acquiring from 124, it is possible to create an answer “I am now XX hour XX minutes”. On the other hand, in response to the statement “What is Asuno Tenki? (Tomorrow's weather?)”, The information acquisition unit 126 accesses an external server to acquire information necessary for an answer, and an answer creation unit. 110 is configured to create, for example, an answer “swell” for the utterance with reference to the answer database 124.

回答作成部１１０は、作成・取得した回答から音声シーケンスを作成して出力する。この音声シーケンスは、音素列であって、各音素に対応する音高や発音タイミングを規定したものである。
なお、（１）、（３）の回答については、例えば回答に対応する音声シーケンスを回答データベース１２４に格納しておく一方で、判別結果に対応した音声シーケンスを回答データベース１２４から読み出す構成にしても良い。詳細には、回答作成部１１０は、（１）の回答にあっては、判別結果に応じた例えば「はい」、「いいえ」などの音声シーケンスを読み出せば良いし、（３）の回答にあっては、発言の解析結果および回答作成部１１０での判別結果に応じて「そうですね」、「えーと」などの音声シーケンスを読み出せば良い。
なお、回答作成部１１０で作成・取得された音声シーケンスは、音声制御部１０９と音声合成部１１２とにそれぞれ供給される。 The answer creation unit 110 creates and outputs a speech sequence from the created / acquired answers. This speech sequence is a phoneme string and defines a pitch and a sounding timing corresponding to each phoneme.
For the answers (1) and (3), for example, a voice sequence corresponding to the answer is stored in the answer database 124, while a voice sequence corresponding to the determination result is read from the answer database 124. good. Specifically, in the answer (1), the answer creating unit 110 may read a voice sequence such as “Yes” or “No” according to the determination result, and the answer (3) In this case, it is only necessary to read out a speech sequence such as “I think so” or “Et” according to the analysis result of the utterance and the determination result in the answer creation unit 110.
Note that the speech sequences created and acquired by the answer creation unit 110 are supplied to the speech control unit 109 and the speech synthesis unit 112, respectively.

音声制御部１０９は、非言語解析部１０６から供給された音高データと、発言における音高の変化具合を示すデータとに応じて、音声シーケンスに対する制御内容を決定する。 The voice control unit 109 determines the control content for the voice sequence according to the pitch data supplied from the non-linguistic analysis unit 106 and the data indicating how the pitch changes in the speech.

音声シーケンスは、発声の音高や発音タイミングが規定されているので、音声合成部１１２は、単純に音声シーケンスにしたがって音声合成しても、当該回答の基本音声を出力することはできる。ただし、回答の基本音声は、発言における語尾等の音高を考慮していないので、機械が喋っている感じを与えるときがあるのは上述した通りである。
そこで、本実施形態において音声合成部１１２は、音声シーケンスで規定される基本音声を、音声制御部１０９の制御内容にしたがって次のように変更して、音声合成する。すなわち、音声制御部１０９は、音声シーケンスのうち、特定の区間（第２区間）の音高を、音高データに対して所定の関係となるように音高を変更するとともに、当該語尾に至るまでの音高の変化を、発言における音高の変化具合を示すデータに応じて変更する場合がある。
なお、本実施形態では、第２区間を回答の語尾とするが、後述するように語尾に限られない。また、本実施形態において、音高データに対して所定の関係にある音高を、５度の下の関係にある音高とするが、後述するように、５度下以外の関係にある音高としても良い。 Since the voice sequence defines the pitch of the utterance and the sounding timing, the voice synthesizer 112 can output the basic voice of the answer even if the voice synthesizer 112 simply synthesizes the voice according to the voice sequence. However, since the basic voice of the answer does not take into account the pitch of the ending in the utterance, it is as described above that the machine may give a feeling of being angry.
Therefore, in this embodiment, the speech synthesizer 112 synthesizes speech by changing the basic speech defined by the speech sequence as follows according to the control content of the speech controller 109. That is, the voice control unit 109 changes the pitch of a specific section (second section) in the voice sequence so that the pitch has a predetermined relationship with the pitch data, and reaches the ending. In some cases, the change in pitch up to is changed according to data indicating the change in pitch in the speech.
In the present embodiment, the second section is the ending of the answer, but is not limited to the ending as described later. In this embodiment, the pitch having a predetermined relationship with the pitch data is set to a pitch having a relationship of 5 degrees below, but as will be described later, a sound having a relationship other than 5 degrees is used. It can be high.

また、音声合成部１１２は、音声を合成するにあたって、音声ライブラリ１２８に登録された音声素片データを用いる。音声ライブラリ１２８は、単一の音素や音素から音素への遷移部分など、音声の素材となる各種の音声素片の波形を定義した音声素片データを、予めデータベース化したものである。音声合成部１１２は、具体的には、音声シーケンスの一音一音（音素）の音声素片データを組み合わせて、繋ぎ部分が連続するように修正しつつ、上記のように回答の語尾の音高を変更して音声信号を生成する。
なお、音声合成された音声信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１４２によって音響変換されて出力される。 The speech synthesizer 112 uses speech unit data registered in the speech library 128 when synthesizing speech. The speech library 128 is a database of speech unit data defining waveforms of various speech units that are speech materials, such as a single phoneme or a transition part from a phoneme to a phoneme. Specifically, the speech synthesizer 112 combines the speech segment data of one sound per phoneme (phoneme) and corrects the connected portion to be continuous, and corrects the ending sound of the answer as described above. Change the height to generate an audio signal.
Note that the synthesized voice signal is converted into an analog signal by a D / A converter (not shown), then acoustically converted by the speaker 142 and output.

次に、音声合成装置１０の動作について説明する。
図４は、音声合成装置１０における音声合成処理を示すフローチャートである。
はじめに、利用者が所定の操作をしたとき、例えば対話処理に対応したアイコンなどをメインメニュー画面（図示省略）において選択する操作をしたとき、ＣＰＵが当該処理に対応したアプリケーションプログラムを起動する。このアプリケーションプログラムを実行することによって、ＣＰＵは、図１で示した機能ブロックを構築する。 Next, the operation of the speech synthesizer 10 will be described.
FIG. 4 is a flowchart showing a speech synthesis process in the speech synthesizer 10.
First, when a user performs a predetermined operation, for example, when an operation for selecting an icon or the like corresponding to an interactive process is performed on a main menu screen (not shown), the CPU starts an application program corresponding to the process. By executing this application program, the CPU constructs the functional block shown in FIG.

まず、利用者によって、音声入力部１０２に対して音声で発言が入力される（ステップＳａ１１）。発話区間検出部１０４は、当該音声の振幅を閾値と比較することにより発話区間を検出し、当該発話区間の音声信号を非言語解析部１０６および言語解析部１０８のそれぞれに供給する（ステップＳａ１２）。
言語解析部１０８は、供給された音声信号における発言を言語解析して、その意味を示すデータを、回答作成部１１０に供給する（ステップＳａ１３）。 First, the user inputs a speech by voice to the voice input unit 102 (step Sa11). The utterance section detection unit 104 detects the utterance section by comparing the amplitude of the speech with a threshold value, and supplies the speech signal of the utterance section to each of the non-linguistic analysis unit 106 and the language analysis unit 108 (step Sa12). .
The language analysis unit 108 performs language analysis on the utterances in the supplied voice signal, and supplies data indicating the meaning to the answer creation unit 110 (step Sa13).

一方、非言語解析部１０６は、検出された発話区間における音声信号を解析して、音量と音高（ピッチ）とに分けて波形化する（ステップＳａ１４）。図５の（ａ）は、音声信号についての音量を縦軸で、経過時間を横軸で表した音量波形の一例であり、（ｂ）は、同じ音声信号について周波数解析して得られる第１フォルマントの音高を縦軸で、経過時間を横軸で表した音高波形である。なお、（ａ）の音量波形と（ｂ）の音高波形との時間軸は共通である。 On the other hand, the non-linguistic analysis unit 106 analyzes the voice signal in the detected utterance section, and divides it into a sound volume and a pitch (pitch) to form a waveform (step Sa14). FIG. 5A is an example of a volume waveform in which the volume of an audio signal is represented on the vertical axis and the elapsed time is represented on the horizontal axis. FIG. 5B is a first waveform obtained by frequency analysis of the same audio signal. It is a pitch waveform in which the pitch of formant is represented on the vertical axis and the elapsed time is represented on the horizontal axis. The time axis of the volume waveform in (a) and the pitch waveform in (b) are common.

利用者が発言に対する回答を欲するような対話を想定した場合、発言の語尾に相当する部分では、音量が他の部分と比較して一時的に大きくなる、と考えられる。そこで、非言語解析部１０６による第１区間（語尾）の音高については、例えば次のようにして求めて、当該音高を示す音高データを出力する（ステップＳａ１５）。
すなわち、第１に、非言語解析部１０６は、図５の（ａ）の音量波形のうち、時間的に最後の極大Ｐ１のタイミングを特定する。
第２に、非言語解析部１０６は、特定した極大Ｐ１のタイミングを前後に含む所定の時間範囲（例えば１００μ秒〜３００μ秒）を語尾Ｑ１であると認定する。
第３に、非言語解析部１０６は、（ｂ）の音高波形のうち、認定した語尾Ｑ１に相当する区間の平均音高Ｎ１を求めて、当該音高Ｎ１を示す音高データを出力する。
このように、発話区間における音量波形について最後の極大Ｐ１を、発言の語尾に相当するタイミングとして特定することによって、会話としての発言の語尾Ｑ１の誤検出を少なくすることができる、と考えられる。
ここでは、（ａ）の音量波形のうち、時間的に最後の極大Ｐ１のタイミングを前後に含む時間範囲を語尾Ｑ１であると認定したが、極大Ｐ１のタイミングを始期または終期とする時間範囲を語尾Ｑ１と認定しても良い。また、認定した語尾に相当する区間の平均音高ではなく、語尾Ｑ１の始期、終期や、極大Ｐ１のタイミングの音高を、音高データとして出力する構成としても良い。 Assuming a dialogue in which the user wants an answer to the utterance, it is considered that the volume of the portion corresponding to the ending of the utterance temporarily increases compared to the other portions. Therefore, the pitch of the first section (end of word) by the non-language analyzing unit 106 is obtained as follows, for example, and pitch data indicating the pitch is output (step Sa15).
That is, first, the non-language analyzing unit 106 specifies the timing of the last maximum P1 in terms of time in the volume waveform of FIG.
Second, the non-linguistic analysis unit 106 determines that a predetermined time range (for example, 100 μsec to 300 μsec) including the timing of the specified maximum P1 before and after is the ending Q1.
Thirdly, the non-linguistic analysis unit 106 obtains an average pitch N1 of a section corresponding to the recognized ending Q1 in the pitch waveform of (b), and outputs pitch data indicating the pitch N1. .
As described above, it is considered that the erroneous detection of the utterance ending Q1 as a conversation can be reduced by specifying the final maximum P1 of the volume waveform in the utterance section as the timing corresponding to the ending of the utterance.
Here, in the volume waveform of (a), the time range including the timing of the last maximum P1 in time is recognized as the ending Q1, but the time range having the timing of the maximum P1 as the start or end is determined. It may be recognized as ending Q1. Moreover, it is good also as a structure which outputs not the average pitch of the area corresponding to the recognized ending, but the beginning and end of ending Q1, and the timing of the maximum P1 timing as pitch data.

非言語解析部１０６は、発言の語尾の音高の特定と並行して、例えば次のようにして発言における音高の変化具合を特定する（ステップＳａ１６）。
すなわち、第１に、非言語解析部１０６は、図５の（ａ）の音量波形の極大Ｐ１のタイミングよりも時間Ｔｓ（例えば０．３秒）だけ遡ったタイミングＰ０の音高Ｎ０を求める。
第２に、非言語解析部１０６は、音高Ｎ０から音高Ｎ１への音高変化分（Ｎ１−Ｎ０）を求めて、音声制御部１０９および回答作成部１１０に供給する。
なお、タイミングＰ０については、発言の語頭に相当するタイミングとして、発言の語頭から語尾までに至る音高の変化具合として捉えても良いし、発言の一語一語の音高における変化パターンを、発言における音高の変化具合として捉えても良い。 In parallel with the specification of the ending pitch of the utterance, the non-linguistic analysis unit 106 specifies the change in the pitch in the utterance as follows, for example (step Sa16).
That is, first, the non-linguistic analysis unit 106 obtains a pitch N0 at a timing P0 that is back by the time Ts (eg, 0.3 seconds) from the timing of the maximum P1 of the volume waveform in FIG.
Second, the non-linguistic analysis unit 106 obtains a pitch change (N1-N0) from the pitch N0 to the pitch N1, and supplies it to the voice control unit 109 and the answer creation unit 110.
Note that the timing P0 may be regarded as a change in pitch from the beginning to the end of the statement as a timing corresponding to the beginning of the statement, or a change pattern in the pitch of each word of the statement, It may be understood as a change in pitch in a statement.

一方、回答作成部１１０は、発言の言語解析結果に対応した回答を、発言の音高変化分を参考にして、回答データベース１２４を用いたり、必要に応じて情報取得部１２６を介し外部サーバから取得したりして、作成する（ステップＳａ１７）。そして、回答作成部１１０は、当該回答に基づく音声シーケンスを作成し、音声合成部１１２に供給する（ステップＳａ１８）。 On the other hand, the answer creating unit 110 uses the answer database 124 to refer to the answer corresponding to the linguistic analysis result of the utterance with reference to the change in pitch of the utterance, or from an external server via the information acquisition unit 126 as necessary. Or obtain it (step Sa17). Then, the answer creating unit 110 creates a speech sequence based on the answer and supplies it to the speech synthesizing unit 112 (step Sa18).

例えば、利用者による発言の言語解析結果が「あすははれ」であっても、発言の音高が語尾に向かって上がれば、その発言は「明日は晴れ（ですか？）」という意味の質問（疑問文）になる。このため、回答作成部１１０が、外部サーバにアクセスして、回答に必要な天気情報を取得し、取得した天気情報が晴れであれば「はい」という音声シーケンスを、晴れ以外であれば「いいえ」という音声シーケンスを、それぞれ出力する。
一方、利用者による発言の言語解析結果が「あすははれ」であっても、発言の音高が語尾に向かって平坦であれば、または、下がれば、その発言は「あすははれ（かぁ）」というような意味の独り言（または、つぶやき）になる。このため、回答作成部１１０が、例えば「そうですね」のような相槌の音声シーケンスを、回答データベース１２４から読み出して出力する。
なお、回答作成部１１０は、例えば発言の音高変化分が閾値を超えていれば、発言の音高が語尾に向かって上がっていると判別し、閾値以下であれば、発言の音高が語尾に向かって平坦である（または下がっている）と判別する。
また、利用者による発言の言語解析結果が「あすのてんきは？」であれば、回答作成部１１０は、外部サーバから取得した天気情報にしたがって例えば「はれです」、「くもりです」などのような音声シーケンスを出力する。 For example, even if the language analysis result of a user's utterance is “Ace Hare”, if the pitch of the utterance rises toward the end of the word, the utterance will mean “Sunny tomorrow?” It becomes a question (question sentence). For this reason, the answer creation unit 110 accesses an external server to acquire the weather information necessary for the answer. If the acquired weather information is sunny, the voice sequence “Yes” is displayed. Are output respectively.
On the other hand, even if the linguistic analysis result of the user's utterance is “Tomorrow”, if the pitch of the utterance is flat toward the end of the utterance or decreases, the utterance will be “Tomorrow ( It ’s a monologue (or tweet) that means “ For this reason, the answer creating unit 110 reads out and outputs a speech sequence such as “That's right” from the answer database 124.
Note that the answer creation unit 110 determines that the pitch of the speech is rising toward the end of the speech if, for example, the change in pitch of the speech exceeds the threshold, and if the pitch of the speech is below the threshold, the pitch of the speech is Judged as flat (or down) toward the end of the word.
Also, if the language analysis result of the user's utterance is “What is Asuno Tenki?”, The answer creation unit 110 may, for example, “are it” or “is cloudy” according to the weather information acquired from the external server. An audio sequence like this is output.

音声制御部１０９は、回答作成部１１０から供給された音声シーケンスから、当該音声シーケンスにおける語尾の音高（初期音高）を特定する（ステップＳａ１９）。
次に、音声制御部１０９は、当該音声シーケンスの音高に対する変更内容を、非言語解析部１０６から供給された音高データ、および、音高の変化具合を示すデータに基づいて次のように決定する（ステップＳａ２０）。詳細には、音声制御部１０９は、利用者による発言の音高が語尾に向かって上がっていれば、当該音声シーケンスで規定された語尾の初期音高が音高データで示される音高に対して５度下の関係となるように、当該音声シーケンス全体の音高を変更する旨を決定する。一方、音声制御部１０９は、利用者による発言の音高が当該発言の語尾に向かって平坦であれば（または下がっていれば）、当該音声シーケンスの音高のすべてを、上記５度下の音高に変更する旨を決定する。
音声制御部１０９は、決定した内容で音声合成部１１２による音声合成を制御する（ステップＳａ２１）。これにより、音声合成部１１２は、音声制御部１０９によって変更が決定された音声シーケンスの音声を、決定された音高で合成して出力する。
なお、回答の音声が出力されると、特に図示しないが、ＣＰＵは、当該アプリケーションプログラムの実行を終了させて、メニュー画面に戻す。 The voice control unit 109 specifies the ending pitch (initial pitch) in the voice sequence from the voice sequence supplied from the answer creating unit 110 (step Sa19).
Next, the voice control unit 109 changes the pitch of the voice sequence based on the pitch data supplied from the non-linguistic analysis unit 106 and the data indicating the change in pitch as follows. Determine (step Sa20). Specifically, the voice control unit 109 determines that the initial pitch of the ending specified in the voice sequence is the pitch indicated by the pitch data if the pitch of the utterance by the user increases toward the ending. Then, it is determined that the pitch of the entire speech sequence is to be changed so that the relationship is 5 degrees below. On the other hand, if the pitch of the speech by the user is flat (or lowered) toward the end of the speech, the voice control unit 109 converts all the pitches of the speech sequence to 5 degrees lower than the above. Decide to change to pitch.
The voice control unit 109 controls voice synthesis by the voice synthesis unit 112 with the determined content (step Sa21). As a result, the speech synthesizer 112 synthesizes the speech of the speech sequence determined to be changed by the speech controller 109 with the determined pitch and outputs the synthesized speech.
When the answer sound is output, although not particularly illustrated, the CPU ends the execution of the application program and returns to the menu screen.

次に、発言の音高、音高変化と、音声シーケンスの変更について、具体的な例を挙げて説明する。 Next, the pitch of the utterance, the change in pitch, and the change of the voice sequence will be described with specific examples.

図６の（ｂ）の左欄は、利用者による発言の一例である。この図においては、発言の言語解析結果が「あすははれ」であって、当該発言の一音一音に音高が同欄に示されるような音符で示される場合の例である。なお、発言の音高波形は、実際には、図５の（ｂ）に示されるような波形となるが、ここでは、説明の便宜のために音高を音符で表現している。この図の例においては、当該発言の音高が語尾に向かって上がっていることから、回答作成部１１０は、当該発言が質問（疑問文）であると判別する。このため、回答作成部１１０は、上述したように、当該発言に応じて取得した天気情報が晴れであれば、例えば「はい」の音声シーケンスを出力し、晴れ以外であれば、「いいえ」の音声シーケンスを出力する。
図６の（ａ）は、「はい」の音声シーケンスの一例であり、この例では、一音一音に音符を割り当てて、基本音声の各語（音素）の音高や発音タイミングを規定している。なお、この例では、説明簡略化のために、一音（音素）に音符を１つ割り当てているが、スラーやタイなどのように、一音に複数の音符を割り当てても良い。
この音声シーケンスは、音声制御部１０９によって次のように変更される。
すなわち、（ｂ）の左欄に示した発言のうち、符号Ａで示される語尾の「れ」の区間の音高が音高データによって「ソ」であると示される場合、音声制御部１０９は、「はい」という回答のうち、符号Ｂで示される語尾の「い」の区間の音高が「ソ」に対して５度下の音高である「ド」になるように、音声シーケンス全体の音高を変更する（図６の（ｂ）の右欄参照）。
なお、ここでは「はい」を例にとって説明したが、特に図示しないが「いいえ」の場合も同様に音声シーケンス全体の音高が変更される。また、「あすのてんきは？」という発言に対して、例えば「はれです」と具体的に内容を回答する場合も同様に音声シーケンス全体の音高が変更される。 The left column in (b) of FIG. 6 is an example of a statement made by the user. This figure shows an example in which the language analysis result of the utterance is “Tomorrow” and the pitch of each utterance of the utterance is indicated by a note shown in the same column. Note that the pitch waveform of the utterance is actually a waveform as shown in FIG. 5B, but here the pitch is expressed as a note for convenience of explanation. In the example of this figure, since the pitch of the utterance rises toward the end of the word, the answer creating unit 110 determines that the utterance is a question (question sentence). For this reason, as described above, the answer creating unit 110 outputs, for example, an audio sequence of “Yes” if the weather information acquired in response to the statement is sunny, and “No” if the weather information is not sunny. Output an audio sequence.
FIG. 6 (a) is an example of a “Yes” speech sequence. In this example, notes are assigned to each note and the pitch and pronunciation timing of each word (phoneme) of the basic speech are defined. ing. In this example, for simplicity of explanation, one note is assigned to one note (phoneme), but a plurality of notes may be assigned to one note such as a slur or a tie.
This voice sequence is changed by the voice control unit 109 as follows.
That is, in the utterance shown in the left column of (b), when the pitch of the ending “re” section indicated by the symbol A is indicated as “So” by the pitch data, the voice control unit 109 , Of the answer “yes”, the entire speech sequence is set so that the pitch of the section “i” at the end indicated by the symbol B becomes “do” which is 5 degrees below “so”. (See the right column of FIG. 6B).
Here, “Yes” has been described as an example, but although not particularly illustrated, the pitch of the entire audio sequence is similarly changed even in the case of “No”. In addition, in response to the statement “What is Asuno Tenki?”, For example, when the content is specifically answered as “Hare”, the pitch of the entire speech sequence is similarly changed.

このように、発言の語尾における音高に対して、回答の語尾の音高が５度下の関係となるように回答が音声合成されるので、利用者に不自然な感じを与えず、あたかも対話しているかのような好印象を与えることができる。 In this way, the answer is synthesized so that the pitch at the end of the answer is 5 degrees lower than the pitch at the end of the comment, so that it does not give the user an unnatural feeling. You can give a good impression as if you were interacting.

一方、言語解析結果が「あすははれ」であっても、図７の（ｂ）の左欄に示されるように、当該発言の音高が語尾に向かって平坦である場合、回答作成部１１０は、当該発言が独り言等であると判別する。このため、回答作成部１１０は、上述したように例えば「そうですね」の音声シーケンスを出力する。
図７の（ａ）は、「そうですね」の音声シーケンスの一例である。
この音声シーケンスは、音声制御部１０９によって次のように変更される。
すなわち、同図の（ｂ）の左欄に示した発言のうち、符号Ａで示される語尾の「れ」の区間の音高が音高データによって「ソ」であると示される場合、音声制御部１０９は、「そうですね」という回答の音高のすべてを（符号Ｂで示される語尾の「ね」を含めて）、「ソ」に対して５度下の音高である「ド」になるように、音声シーケンスの音高を変更する（図７の（ｂ）の右欄参照）。 On the other hand, even if the language analysis result is “tomorrow”, as shown in the left column of FIG. 7B, if the pitch of the utterance is flat toward the ending, 110 determines that the utterance is a monologue or the like. Therefore, as described above, the answer creating unit 110 outputs, for example, a sound sequence of “Yes”.
FIG. 7A shows an example of a sound sequence of “Yes”.
This voice sequence is changed by the voice control unit 109 as follows.
That is, among the utterances shown in the left column of (b) in the figure, when the pitch of the ending “re” section indicated by the symbol A is indicated as “So” by the pitch data, the voice control The part 109 makes all the pitches of the answer “Yes” (including the ending “ne” indicated by the symbol B) to “do”, which is a pitch 5 degrees below “so”. As described above, the pitch of the voice sequence is changed (see the right column of FIG. 7B).

この場合にも、発言の語尾における音高に対して、相槌としての回答の語尾の音高が５度下の関係となるように回答が音声合成されるので、利用者に不自然な感じを与えず、あたかも対話しているかのような好印象を与えることができる。
また、本実施形態では、発言の言語解析結果が同じ場合であっても、当該発言における語尾に向かう音高変化に応じて、回答が作成される。さらに、発言の音高が平坦であれば、当該発言に対する相槌の音高も平坦されて、すなわち、元の音声シーケンスで規定された音高の変化具合も変更される。このため、マシンとではなく、あたかも人と対話しているかのような印象を利用者に与えることができる。 Also in this case, the answer is synthesized with speech so that the pitch of the ending of the answer as a companion is 5 degrees lower than the pitch at the end of the utterance, which makes the user feel unnatural. Without giving, you can give a good impression as if you were talking.
Moreover, in this embodiment, even if the linguistic language analysis results are the same, an answer is created according to the pitch change toward the ending in the utterance. Further, if the pitch of the speech is flat, the pitch of the companion to the speech is also flattened, that is, the change in the pitch defined by the original speech sequence is changed. For this reason, it is possible to give the user an impression as if they were interacting with a person, not with a machine.

＜応用例、変形例＞
本発明は、上述した実施形態に限定されるものではなく、例えば次に述べるような各種の応用・変形が可能である。また、次に述べる応用・変形の態様は、任意に選択された一または複数を適宜に組み合わせることもできる。 <Application examples and modifications>
The present invention is not limited to the above-described embodiments, and various applications and modifications as described below are possible, for example. In addition, one or more arbitrarily selected aspects of application / deformation described below can be appropriately combined.

＜音声入力部＞
実施形態では、音声入力部１０２は、利用者の音声（発言）をマイクロフォンで入力して音声信号に変換する構成としたが、特許請求の範囲に記載された音声入力部は、この構成に限られない。すなわち、特許請求の範囲に記載された音声入力部は、音声信号による発言をなんらかの形で入力する、または、入力される構成であれば良い。詳細には、特許請求の範囲に記載された音声入力部は、他の処理部で処理された音声信号や、他の装置から供給（または転送された）音声信号を入力する構成、さらには、ＬＳＩに内蔵され、単に音声信号を受信し後段に転送する入力インターフェース回路等を含んだ概念である。 <Voice input part>
In the embodiment, the voice input unit 102 is configured to input a user's voice (speech) with a microphone and convert the voice signal into a voice signal. However, the voice input unit described in the claims is limited to this configuration. I can't. That is, the voice input unit described in the claims may be configured to input or input a speech by a voice signal in some form. Specifically, the voice input unit described in the claims is configured to input a voice signal processed by another processing unit, a voice signal supplied (or transferred) from another device, It is a concept that includes an input interface circuit or the like that is built in an LSI and that simply receives an audio signal and transfers it to a subsequent stage.

＜音声波形データ＞
実施形態では、回答作成部１１０が、発言に対する回答として、一音一音に音高が割り当てられた音声シーケンスを出力する構成としたが、当該回答を、例えばｗａｖ形式の音声波形データを出力する構成としても良い。
なお、音声波形データは、上述した音声シーケンスのように一音一音に音高が割り当てられないので、例えば、音声制御部１０９が、単純に再生した場合の語尾の音高を特定して、音高データで示される音高に対して、特定した音高が所定の関係となるようにフィルタ処理などの音高変換（ピッチ変換）をした上で、音声波形データを出力（再生）する構成とすれば良い。
また、カラオケ機器では周知である、話速を変えずに音高（ピッチ）をシフトする、いわゆるキーコントロール技術によって音高変換をしても良い。 <Audio waveform data>
In the embodiment, the answer creating unit 110 outputs a voice sequence in which a pitch is assigned to each note as an answer to the utterance. However, the answer is output as, for example, voice waveform data in wav format. It is good also as a structure.
In addition, since the sound waveform data is not assigned a pitch to each sound as in the above-described sound sequence, for example, the sound control unit 109 specifies the ending pitch when simply reproduced, Configuration that outputs (reproduces) audio waveform data after performing pitch conversion (pitch conversion) such as filter processing so that the specified pitch has a predetermined relationship with the pitch indicated by the pitch data What should I do?
Also, pitch conversion may be performed by a so-called key control technique that shifts the pitch (pitch) without changing the speaking speed, which is well known in karaoke equipment.

＜回答等の語尾、語頭＞
実施形態では、発言の語尾の音高に対応して回答の語尾の音高を制御する構成としたが、言語や、方言、言い回しなどによっては回答の語尾以外の部分、例えば語頭が特徴的となる場合もある。このような場合には、発言した人は、当該発言に対する回答があったときに、当該発言の音高と、当該回答の特徴的な語頭の音高とを無意識のうち比較して当該回答に対する印象を判断する。したがって、この場合には、発言の語尾の音高に対応して回答の語頭の音高を制御する構成とすれば良い。この構成によれば、回答の語頭が特徴的である場合、当該回答を受け取る利用者に対して心理的な印象を与えることが可能となる。 <End of answer, beginning of answer>
In the embodiment, the pitch of the ending of the answer is controlled in accordance with the pitch of the ending of the utterance, but depending on the language, dialect, wording, etc., the part other than the ending of the answer, for example, the beginning of the answer is characteristic. Sometimes it becomes. In such a case, the person who made the statement, when there is an answer to the statement, unconsciously compares the pitch of the statement and the pitch of the characteristic beginning of the answer. Judge the impression. Therefore, in this case, the pitch at the beginning of the answer may be controlled in accordance with the pitch at the end of the utterance. According to this configuration, when the head of the answer is characteristic, it is possible to give a psychological impression to the user who receives the answer.

発言についても同様であり、語尾に限られず、語頭で判断される場合も考えられる。また、発言、回答については、語頭、語尾に限られず、平均的な音高で判断される場合や、最も強く発音した部分の音高で判断される場合なども考えられる。このため、発言の第１区間および回答の第２区間は、必ずしも語頭や語尾に限られない、ということができる。 The same applies to utterances, not limited to endings, but may be determined by the beginning of a sentence. In addition, the remarks and answers are not limited to the beginning and end of the word, but may be determined based on an average pitch, or may be determined based on the pitch of the most pronounced portion. For this reason, it can be said that the 1st section of an utterance and the 2nd section of an answer are not necessarily restricted to an initial or ending.

＜回答の音高の変化具合、回答の平均音高等＞
実施形態では、発言の語尾等の音高に対して、回答の音高の語尾等の音高が、例えば５度下となるように、音声シーケンス全体の音高をシフトすることによって、または、音声シーケンスの音高を平坦化することによって、元の音声シーケンスで規定された回答の音高の変化具合、および、回答の平均音高が結果的に変更される構成となっていた。
この構成に限られず、例えば、発言の音高が語尾に向かって上がっているであれば、回答の音高を語尾に向かって下がるように、また、発言の音高が語尾に向かって下がっているであれば、回答の音高を語尾に向かって上がるように、元の音声シーケンスの音高の変化具合を変更する構成としても良い。
また、回答の全体の平均音高が、発言の語尾等の音高や、発言の音高変化に応じて変化するように、元の音声シーケンスの全体または一部の音高を変更する構成としても良い。 <Changes in the pitch of responses, average pitch of responses, etc.>
In the embodiment, by shifting the pitch of the entire speech sequence such that the pitch of the ending of the answer is 5 degrees below the pitch of the ending of the utterance, or By flattening the pitch of the voice sequence, the pitch of the answer defined in the original voice sequence and the average pitch of the answer are changed as a result.
For example, if the pitch of the utterance is rising toward the ending, the pitch of the answer is lowered toward the ending, and the pitch of the utterance is decreasing toward the ending. If so, the pitch of the original voice sequence may be changed so that the pitch of the answer rises toward the end of the answer.
In addition, the entire average pitch of the answer is changed in accordance with the pitch of the ending of the utterance or the change in the pitch of the utterance. Also good.

＜発言の音量、音量変化と、回答の音量、音量変化＞
実施形態では、発言の音量変化を用いて当該発言の語尾を特定したが、発言の音量については、音高以外の非言語情報として様々な用途が考えられる。例えば、発言の平均的な音量に応じて、音声合成した回答の音量を制御する構成としても良い。また、発言の音量変化（振幅エンベロープ）に合わせて回答の音量変化を制御する構成としても良い。 <Speak volume, volume change, answer volume, volume change>
In the embodiment, the ending of the utterance is specified by using the volume change of the utterance. However, various uses of the utterance volume can be considered as non-language information other than the pitch. For example, it may be configured to control the volume of the answer that is synthesized by speech according to the average volume of the speech. Moreover, it is good also as a structure which controls the volume change of an answer according to the volume change (amplitude envelope) of an utterance.

＜会話の内容＞
実施形態では、利用者の発言に対して、音声合成装置１０が回答を音声合成で出力した時点で動作終了する構成としたが、人同士の対話では、発言と、当該発言に対する回答とで終了するのではなく、発言と、回答とが繰り返されることが多く、また、この繰り返しの数も、発言と回答との意味内容に応じて大きくなったり、小さくなったりする。
そこで、図８に示されるように、言語解析部１０８が、利用者による発言のみならず、回答作成部１１０で作成された回答についても言語解析し、その言語解析結果を音声制御部１０９に供給して、当該音声制御部１０９が、回答の語尾等の音高や、回答の音高の変化具合、回答の平均音高などを制御する構成としても良い。 <Content of conversation>
In the embodiment, the operation is terminated when the speech synthesizer 10 outputs an answer by speech synthesis in response to the user's utterance. However, in the dialogue between people, the utterance and the answer to the utterance are terminated. Rather, the remarks and answers are often repeated, and the number of repetitions increases or decreases depending on the semantic content of the remarks and answers.
Therefore, as shown in FIG. 8, the language analysis unit 108 performs linguistic analysis not only on the user's remarks but also on the response created by the response creation unit 110, and supplies the language analysis result to the voice control unit 109. The voice control unit 109 may control the pitch of the answer ending, the change in the answer pitch, the average answer pitch, and the like.

＜音程の関係＞
上述した実施形態では、発言の語尾等に対して回答の語尾等の音高が５度下となるように音声合成を制御する構成としたが、５度下以外の協和音程の関係に制御する構成であっても良い。例えば、上述したように完全８度、完全５度、完全４度、長・短３度、長・短６度であっても良い。
また、協和音程の関係でなくても、経験的に良い（または悪い）印象を与える音程の関係の存在が認められる場合もあるので、当該音程の関係に回答の音高を制御する構成としても良い。ただし、この場合においても、発言の語尾等の音高と回答の語尾等の音高との２音間の音程が離れ過ぎると、発言に対する回答が不自然になりやすいので、発言の音高と回答の音高とが上下１オクターブの範囲内にあることが望ましい。 <Pitch relationship>
In the embodiment described above, the voice synthesis is controlled so that the pitch of the ending of the answer is 5 degrees lower than the ending of the utterance, but is controlled to have a relationship of Kyowa intervals other than 5 degrees below. It may be a configuration. For example, as described above, it may be complete 8 degrees, complete 5 degrees, complete 4 degrees, long / short 3 degrees, and long / short 6 degrees.
In addition, there is a case where a relationship of a pitch that gives a good (or bad) impression is empirically recognized even if it is not a relationship of the Kyowa pitch, so that the pitch of the answer is controlled according to the relationship of the pitch. good. However, even in this case, if the pitch between the two pitches of the ending of the utterance and the pitch of the ending of the answer is too far apart, the answer to the utterance tends to become unnatural. It is desirable that the pitch of the answer is in the range of one octave above and below.

＜回答の音高シフト＞
ところで、音声シーケンスなどで規定される回答の語尾等の音高を、発言の語尾等の音高に対して所定の関係となるように制御する構成では、詳細には、実施形態のように例えば５度下となるように変更する構成では、変更しようとする音高が低すぎると、不自然な低音で回答が音声合成されてしまう場合がある。そこで次に、このような場合を回避するための応用例（その１、および、その２）について説明する。 <Pitch shift of answer>
By the way, in the configuration in which the pitch of the ending of the answer specified by the speech sequence or the like is controlled so as to have a predetermined relationship with the pitch of the ending of the utterance, for example, as in the embodiment, for example, In the configuration in which the pitch is changed to be 5 degrees below, if the pitch to be changed is too low, the answer may be synthesized with an unnatural low tone. Next, application examples (No. 1 and No. 2) for avoiding such a case will be described.

図９は、このうちの応用例（その１）における処理の要部を示す図である。なお、ここでいう処理の要部とは、図４におけるステップＳａ２０の「回答の音高決定」で実行される処理をいう。すなわち、応用例（その１）では、図４に示されるステップＳａ２０において、図９で示される処理が実行される、という関係にあり、詳細については次の通りである。
まず、音声制御部１０９は、非言語解析部１０６からの音高データで示される音高に対して、回答の語尾等の音高を、５度下の関係にある音高に仮決定する（ステップＳｂ１７１）。
次に、音声制御部１０９は、仮決定した音高が予め定められた閾値音高よりも低いか否かを判別する（ステップＳｂ１７２）。なお、閾値音高は、音声合成する際の下限周波数に相当する音高や、これより低くければ不自然な感じを与えるような音高などに設定される。 FIG. 9 is a diagram showing a main part of processing in the application example (part 1). Note that the main part of the process mentioned here refers to a process executed in “determination of answer pitch” in step Sa20 in FIG. That is, in the application example (No. 1), the processing shown in FIG. 9 is executed in step Sa20 shown in FIG. 4, and the details are as follows.
First, the voice control unit 109 temporarily determines a pitch such as a ending of the answer to a pitch that is 5 degrees below the pitch indicated by the pitch data from the non-linguistic analysis unit 106 ( Step Sb171).
Next, the voice control unit 109 determines whether or not the temporarily determined pitch is lower than a predetermined threshold pitch (step Sb172). Note that the threshold pitch is set to a pitch corresponding to the lower limit frequency when speech synthesis is performed, or to a pitch that gives an unnatural feeling if it is lower than this.

仮決定した音高、すなわち発言における語尾の音高よりも５度下の音高が閾値音高よりも低ければ（ステップＳｂ１７２の判別結果が「Ｙｅｓ」であれば）、音声制御部１０９は、仮決定した音高を１オクターブ上の音高にシフトする（ステップＳｂ１７３）。
一方、仮決定した音高が閾値音高以上であれば（ステップＳｂ１７２の判別結果が「Ｎｏ」であれば）、上記ステップＳｂ１７３の処理がスキップされる。
そして、音声制御部１０９は、回答をシフトする際に目標となる語尾の音高を、次のような音高に本決定する（ステップＳｂ１７４）。すなわち、音声制御部１０９は、仮決定した音高が閾値音高よりも低ければ、仮決定した音高を１オクターブ上に変更した音高に、また、仮決定した音高が閾値音高以上であれば、当該仮決定した音高をそのまま、それぞれ目標となる音高を本決定する。
なお、処理手順は、ステップＳｂ１７４の後においては、図４のステップＳａ２１に移行して、音声制御部１０９は、制御内容として、回答の語尾の音高を本決定した音高にシフトする旨の制御内容を決定し、これにより、音声合成部１１２は、決定された制御内容で音声シーケンスの音声を合成して出力することになる。 If the tentatively determined pitch, that is, the pitch 5 degrees below the ending pitch in the utterance is lower than the threshold pitch (if the determination result in step Sb172 is “Yes”), the voice control unit 109 The temporarily determined pitch is shifted to a pitch one octave higher (step Sb173).
On the other hand, if the temporarily determined pitch is equal to or higher than the threshold pitch (if the determination result in step Sb172 is “No”), the process in step Sb173 is skipped.
Then, the voice control unit 109 determines the target ending pitch at the time of shifting the answer to the following pitch (step Sb174). That is, if the tentatively determined pitch is lower than the threshold pitch, the voice control unit 109 sets the tentatively determined pitch to one pitch above, and the tentatively determined pitch is equal to or higher than the threshold pitch. If so, the target pitches are finally determined without changing the temporarily determined pitches.
After step Sb174, the processing procedure proceeds to step Sa21 in FIG. 4, and the voice control unit 109 shifts the pitch of the ending of the answer to the determined pitch as the control content. The control content is determined, whereby the speech synthesizer 112 synthesizes and outputs speech sequence speech with the determined control content.

この応用例（その１）によれば、変更しようとする音高が閾値音高よりも低ければ、当該音高よりも１オクターブ上の音高となるようにシフトされるので、不自然な低音で回答が音声合成される、という点を回避することができる。
ここでは、回答の語尾等の音高を１オクターブ上の音高にシフトした例であったが、１オクターブ下の音高にシフトしても良い。詳細には、利用者が発した発言の語尾等の音高が高いために、当該音高に対して５度下の音高が高すぎると、不自然な高音で回答が音声合成されてしまう。これを回避するために、音高データで示される音高に対して５度下の関係にある音高（仮決定した音高）が閾値音高より高ければ、回答の語尾等の音高を、仮決定した音高よりも１オクターブ下の音高にシフトすれば良い。 According to this application example (No. 1), if the pitch to be changed is lower than the threshold pitch, the pitch is shifted so as to be one octave higher than the pitch. It is possible to avoid the point that the answer is synthesized by voice.
In this example, the pitch of the ending of the answer is shifted to a pitch one octave higher, but may be shifted to a pitch one octave lower. Specifically, since the pitch of the ending of the utterance of the user's utterance is high, if the pitch 5 degrees below the pitch is too high, the answer is synthesized with an unnatural high tone. . In order to avoid this, if the pitch (temporarily determined pitch) that is 5 degrees below the pitch indicated by the pitch data is higher than the threshold pitch, the pitch of the ending of the answer is set. The pitch may be shifted to a pitch one octave lower than the temporarily determined pitch.

また、音声合成する際には、性別や年齢別（子供／大人の別）などが定められた仮想的なキャラクタの声で出力することができる場合がある。この場合のように女性や子供のキャラクタが指定されているとき、一律に発言の語尾に対して５度下の音高に下げてしまうと、当該キャラクタに不似合いの低音で回答が音声合成されてしまうので、同様に、１オクターブ上の音高となるようにシフトする構成としても良い。 In addition, when voice synthesis is performed, it may be possible to output a voice of a virtual character in which sex or age (child / adult) is determined. When a female or child character is specified as in this case, if the pitch is lowered to 5 degrees below the ending of the speech, the answer is synthesized with a low sound that is not suitable for the character. Therefore, similarly, it may be configured to shift so that the pitch becomes one octave higher.

図１０は、このような応用例（その２）における処理の要部を示す図であり、図４におけるステップＳａ２０の「回答の音高決定」で実行される処理を示している。図９と異なる点を中心に説明すると、ステップＳｂ１７１において、音声制御部１０９は、非言語解析部１０６からの音高データで示される音高に対して５度下の関係にある音高を仮決定した後、当該キャラクタを規定する属性として女性や子供が指定されているか否かを判別する（ステップＳｃ１７２）。 FIG. 10 is a diagram showing a main part of processing in such an application example (part 2), and shows processing executed in “answer pitch determination” in step Sa20 in FIG. The explanation will focus on the points different from FIG. 9. In step Sb171, the voice control unit 109 temporarily calculates a pitch that is 5 degrees below the pitch indicated by the pitch data from the non-linguistic analysis unit 106. After the determination, it is determined whether or not a woman or a child is designated as an attribute that defines the character (step Sc172).

音声制御部１０９は、当該属性として女性や子供が指定されていれば（ステップＳｃ１７２の判別結果が「Ｙｅｓ」であれば）、仮決定した音高を１オクターブ上の音高にシフトし（ステップＳｂ１７３）、一方、当該属性として女性や子供が指定されていなければ、例えば男性や大人が指定されていれば（ステップＳｃ１７２の判別結果が「Ｎｏ」であれば）、上記ステップＳｂ１７３の処理がスキップされる。以降については応用例（その１）と同様である。
この応用例（その２）によれば、女性や子供の声で回答させることが設定されていれば、仮決定の音高よりも１オクターブ上の音高となるようにシフトされるので、所定の音程関係を維持しつつ、不自然な低音で回答が音声合成される不具合を回避することができる。
ここでは、属性として女性や子供が指定されていれば、１オクターブ上の音高にシフトした例であったが、例えば属性として成人男性が指定されていれば、当該属性に対応したキャラクタに不似合いの高音で回答が音声合成されてしまうのを回避するために、１オクターブ下の音高にシフトしても良い。 If female or child is specified as the attribute (if the determination result in step Sc172 is “Yes”), the voice control unit 109 shifts the temporarily determined pitch to a pitch one octave higher (step S109). Sb173) On the other hand, if female or child is not specified as the attribute, for example, if male or adult is specified (if the determination result of step Sc172 is “No”), the process of step Sb173 is skipped. Is done. The subsequent steps are the same as in the application example (No. 1).
According to this application example (No. 2), if it is set to answer with a voice of a woman or a child, the pitch is shifted to an octave higher than the temporarily determined pitch. Thus, it is possible to avoid the problem that the answer is synthesized with an unnatural low tone while maintaining the pitch relationship.
In this example, if female or child is specified as an attribute, the pitch is shifted to one octave above. However, for example, if an adult male is specified as an attribute, the character corresponding to the attribute is not recognized. In order to avoid that the answer is synthesized with high-pitched sounds, the pitch may be shifted to a pitch one octave below.

＜不協和音程＞
上述した実施形態では、発言の語尾等に対して、回答の語尾等の音高が協和音程の関係となるように音声合成を制御する構成としたが、不協和音程の関係になるように音声合成を制御しても良い。なお、回答を不協和音程の関係にある音高で合成すると、発言を発した利用者に、不自然な感じや、悪印象、険悪な感じなどを与えて、スムーズな対話が成立しなくなる、という懸念もあるが、このような感じが逆にストレス解消に良いという見解もある。
そこで、動作モードとして、好印象等の回答を望むモード（第１モード）、悪印象等の回答を望むモード（第２モード）を用意しておき、いずれかのモードに応じて音声合成を制御する構成としても良い。 <Dissonance>
In the embodiment described above, the voice synthesis is controlled so that the pitch of the ending of the answer is related to the harmonious pitch with respect to the ending of the utterance, but the voice synthesis is made so as to be related to the dissonant pitch. May be controlled. It should be noted that if the answers are synthesized with pitches that are in a dissonant pitch relationship, the user who gave the speech will be given an unnatural feeling, a bad impression, a rough feeling, etc., and a smooth dialogue will not be established. There are also concerns, but there is a view that this feeling is good for stress relief.
Therefore, as an operation mode, a mode (first mode) for which a response such as a good impression is desired and a mode (second mode) for which a response such as a bad impression is desired are prepared, and speech synthesis is controlled according to any mode. It is good also as composition to do.

図１１は、このような応用例（その３）における処理の要部を示す図であり、図４におけるステップＳａ２０の「回答の音高決定」で実行される処理を示している。図９と異なる点を中心に説明すると、音声制御部１０９は、動作モードとして第１モードが設定されているか否かを判別する（ステップＳｄ１７２）。 FIG. 11 is a diagram showing a main part of processing in such an application example (part 3), and shows processing executed in “answer pitch determination” in step Sa20 in FIG. The description will focus on the points different from FIG. 9. The sound control unit 109 determines whether or not the first mode is set as the operation mode (step Sd172).

音声制御部１０９は、動作モードとして第１モードが設定されていれば（ステップＳｄ１７２の判別結果が「Ｙｅｓ」であれば）、回答の例えば語尾の音高を、発言の例えば語尾の音高に対して協和音程の関係にある音高となるように決定する（ステップＳｄ１７３Ａ）。一方、音声制御部１０９は、動作モードとして第２モードが設定されていれば（ステップＳｄ１７２の判別結果が「Ｎｏ」であれば）、回答の語尾の音高を、発言の語尾の音高に対して不協和音程の関係にある音高となるように決定する（ステップＳｄ１７３Ｂ）。以降については応用例（その１）、応用例（その２）と同様である。
したがって、この応用例（その３）によれば、第１モードが設定されていれば、発言の語尾の音高に対して回答の語尾が協和音程の関係にある音高で音声合成される一方、第２モードが設定されていれば、発言の語尾の音高に対して回答の語尾が不協和音程の関係にある音高で音声合成されるので、利用者は、適宜動作モードを使い分けることができることになる。 If the first mode is set as the operation mode (if the determination result in step Sd172 is “Yes”), the voice control unit 109 changes the pitch of the answer, for example, to the pitch of the utterance, for example. On the other hand, the pitch is determined so as to be a pitch having a relation of Kyowa pitch (step Sd173A). On the other hand, if the second mode is set as the operation mode (if the determination result in step Sd172 is “No”), the voice control unit 109 changes the pitch of the answer ending to the pitch of the utterance ending. On the other hand, the pitch is determined to be a pitch having a dissonant pitch relationship (step Sd173B). The subsequent processes are the same as those of the application example (part 1) and the application example (part 2).
Therefore, according to this application example (No. 3), if the first mode is set, speech synthesis is performed with a pitch in which the ending of the answer is related to the pitch of the Kyowa pitch with respect to the pitch of the ending of the utterance. If the second mode is set, voice synthesis is performed with a pitch in which the ending of the answer has a dissonant pitch with respect to the pitch of the ending of the utterance, so that the user can use the operation mode appropriately. It will be possible.

なお、応用例（その１）や、応用例（その２）、応用例（その３）は、音声シーケンスを用いる例で説明したが、音声波形データを用いる場合であっても良いのはもちろんである。 The application example (part 1), the application example (part 2), and the application example (part 3) have been described as examples using voice sequences. However, it goes without saying that voice waveform data may be used. is there.

＜その他＞
実施形態にあっては、発言に対する回答を取得する構成である言語解析部１０８、言語データベース１２２および回答データベース１２４を音声合成装置１０の側に設けたが、端末装置などでは、処理の負荷が重くなる点や、記憶容量に制限がある点などを考慮して、外部サーバの側に設ける構成としても良い。すなわち、音声合成装置１０において回答作成部１１０は、発言に対する回答をなんらかの形で取得するとともに、当該回答の音声を規定するデータを出力する構成であれば足り、その回答を、音声合成装置１０の側で作成するのか、音声合成装置１０以外の他の構成（例えば外部サーバ）の側で作成するのか、については問われない。
なお、音声合成装置１０において、発言に対する回答について、外部サーバ等にアクセスしないで作成可能な用途であれば、情報取得部１２６は不要である。 <Others>
In the embodiment, the language analysis unit 108, the language database 122, and the answer database 124, which are configured to obtain responses to utterances, are provided on the side of the speech synthesizer 10, but the processing load is heavy in a terminal device or the like. In view of the above, the storage capacity is limited, and the like may be provided on the external server side. That is, in the speech synthesizer 10, it is sufficient that the answer creating unit 110 obtains an answer to the utterance in some form and outputs data defining the speech of the answer. It does not matter whether it is created on the side or on the side of another configuration (for example, an external server) other than the speech synthesizer 10.
Note that the information acquisition unit 126 is not necessary if the speech synthesizer 10 can create an answer to a statement without accessing an external server or the like.

１０２…音声入力部、１０４…発話区間検出部、１０６…非言語解析部、１０８…言語解析部、１０９…音声制御部、１１０…回答作成部、１１２…音声合成部、１２６…情報取得部。

DESCRIPTION OF SYMBOLS 102 ... Voice input part, 104 ... Speech area detection part, 106 ... Non-language analysis part, 108 ... Language analysis part, 109 ... Speech control part, 110 ... Answer preparation part, 112 ... Speech synthesis part, 126 ... Information acquisition part.

Claims

A voice input unit for inputting a speech by a voice signal;
Among the utterances, a non-linguistic analysis unit that analyzes the pitch of a specific first section and how the pitch of the utterance changes,
An acquisition unit for acquiring an answer to the statement;
A speech synthesizer that synthesizes the obtained answers,
The voice synthesizer is caused to change the pitch of the specific second section in the answer so that the pitch has a predetermined relationship with the pitch of the first section, and the sound of the answer A voice control unit that changes the degree of change in height according to the degree of change in the pitch of the speech;
A speech synthesizer characterized by comprising:

The non-linguistic analysis unit also analyzes changes in non-linguistic information other than the pitch of the utterance,
The voice control unit
Controlling the speech synthesis in accordance with a change in non-linguistic information other than the pitch,
The speech synthesizer according to claim 1.

Computer
A non-linguistic analysis unit that analyzes a pitch of a specific first section and a change in a pitch of the utterance among utterances by an input voice signal;
An acquisition unit for acquiring an answer to the statement;
A speech synthesizer that synthesizes the obtained answers, and
The voice synthesizer is caused to change the pitch of the specific second section in the answer so that the pitch has a predetermined relationship with the pitch of the first section, and the sound of the answer A voice control unit for changing a change in height according to a change in the pitch of the speech;
A program characterized by functioning as