JP2015087740A

JP2015087740A - Speech synthesis device, and program

Info

Publication number: JP2015087740A
Application number: JP2014048636A
Authority: JP
Inventors: 松原　弘明; Hiroaki Matsubara; 弘明松原; 純也浦; Junya Ura; 川▲原▼　毅彦; Takehiko Kawahara; 毅彦川▲原▼; 久湊　裕司; Yuji Hisaminato; 裕司久湊; 克二吉村; Katsuji Yoshimura
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-05-31
Filing date: 2014-03-12
Publication date: 2015-05-07
Anticipated expiration: 2034-03-12
Also published as: JP2018128690A; JP2016136284A; JP6566076B2; JP6323491B2; JP5954348B2

Abstract

PROBLEM TO BE SOLVED: To synthesize a speech which gives a psychological impression to a user who has asked a question.SOLUTION: A speech synthesis device according to the present invention comprises: a speech input unit 102 for inputting a question asked by a voice signal; an answer creation unit 110 for outputting a voice sequence for an answer to the question; a high-pitched tone analysis unit 106 for analyzing a pitch, for example, at the end of a phrase among the question; and a speech synthesis unit 112 for synthesizing, by a speech, the answer indicated by the voice sequence, the speech synthesis unit synthesizing, by a prescribed relation with respect to the pitch at the end of the phrase of the answer, for example, a pitch fifth below, the pitch at the end of the phrase of the answer, among the answer, and outputting the synthesized speech.

Description

本発明は、音声合成装置およびプログラムに関する。 The present invention relates to a speech synthesizer and a program.

近年、音声合成技術としては、次のようなものが提案されている。すなわち、利用者の話調や声質に対応した音声を合成出力することによって、より人間らしく発音する技術（例えば特許文献１参照）や、利用者の音声を分析して、当該利用者の心理状態や健康状態などを診断する技術（例えば特許文献２参照）が提案されている。
また、利用者が入力した音声を認識する一方で、シナリオで指定された内容を音声合成で出力して、利用者との音声対話を実現する音声対話システムも提案されている（例えば特許文献３参照）。 In recent years, the following have been proposed as speech synthesis techniques. That is, by synthesizing and outputting speech corresponding to the user's speech tone and voice quality, a technique for sounding more humanly (see, for example, Patent Document 1), analyzing the user's speech, A technique for diagnosing a health condition or the like (see, for example, Patent Document 2) has been proposed.
In addition, a voice dialogue system that recognizes a voice input by a user and outputs a content specified in a scenario by voice synthesis to realize a voice dialogue with the user has been proposed (for example, Patent Document 3). reference).

特開２００３−２７１１９４号公報JP 2003-271194 A 特許第４４９５９０７号公報Japanese Patent No. 4495907 特許第４８３２０９７号公報Japanese Patent No. 4832097

ところで、上述した音声合成技術と音声対話システムとを組み合わせて、利用者の音声による問いに対し、データを検索して音声合成により出力する対話システムを想定する。この場合、音声合成によって出力される音声が利用者に不自然な感じ、具体的には、いかにも機械が喋っている感じを与えるときがある、という問題が指摘されている。
本発明は、このような事情に鑑みてなされたものであり、その目的の一つは、利用者に自然な感じを与えるような、具体的には、利用者に対して好印象や悪印象などを与えることが可能な音声合成装置およびプログラムを提供することにある。 By the way, it is assumed that a dialogue system that combines the above-described voice synthesis technology and a voice dialogue system and retrieves data and outputs it by voice synthesis in response to a user's voice question. In this case, a problem has been pointed out that the voice output by the voice synthesis feels unnatural to the user, specifically, the machine sometimes feels roaring.
The present invention has been made in view of such circumstances, and one of its purposes is to give the user a natural feeling, specifically, a positive impression or a bad impression on the user. It is an object to provide a speech synthesizer and a program capable of providing the above.

本件発明者は、利用者による問いに対する回答を音声合成で出力（返答）するマン・マシンのシステムを検討するにあたって、まず、人同士では、どのような対話がなされるかについて、言語的情報以外の非言語情報（ノンバーバル情報）、とりわけ対話を特徴付ける音高（周波数）に着目して考察した。 In examining the man-machine system that outputs (answers) the answers to the questions from the users by speech synthesis, the present inventor firstly examines what kind of dialogue is made between people other than linguistic information. We focused on non-verbal information (non-verbal information), especially the pitch (frequency) that characterizes dialogue.

ここでは、人同士の対話として、一方の人（ａとする）による問い（問い掛け）に対し、他方の人（ｂとする）が返答する場合について検討する。この場合において、ａが問いを発したとき、ａだけなく、当該問いに対して回答しようとするｂも、当該問いのうちの、ある区間における音高を強い印象で残していることが多い。ｂは、同意や、賛同、肯定などの意で回答するときには、印象に残っている問いの音高に対し、当該回答を特徴付ける部分、例えば語尾や語頭の音高が、所定の関係、具体的には協和音程の関係となるように発声する。当該回答を聞いたａは、自己の問いについて印象に残っている音高と当該問いに対する回答を特徴付ける部分の音高とが上記関係にあるので、ｂの回答に対して心地良く、安心するような好印象を抱くことになる、と、本件発明者は考えた。
また、言語の無い太古の昔より人同士はコミュニケーションをとってきたわけであるが、そのような環境での人同士のコミュニケーションにおいて、音量とともに音高が非常に重要な役割を担っていたものと推察している。言語の発達した現代においては、その音高コミュケーションが忘れ去られているが、太古の昔の「所定の音高関係」がＤＮＡに刻まれて伝承されているために「なぜか心地良い」と感じるのだとも推察している。 Here, as a dialogue between people, a case where the other person (b) responds to a question (question) by one person (a) will be considered. In this case, when a asks a question, not only a but also b trying to answer the question often leaves a strong impression of the pitch in a certain section of the question. b, when replying with consent, approval, or affirmation, the pitch of the question that remains in the impression, for example, the pitch of the ending or beginning of the word Speak to be in a Kyowa pitch. A who has heard the answer has a relationship between the pitch that remains in the impression about his question and the pitch of the part that characterizes the answer to the question. The present inventor thought that he had a good impression.
In addition, people have been communicating with each other since ancient times without language, but it is presumed that volume and pitch played a very important role in communication between people in such an environment. doing. In today's language-developed world, the pitch communication has been forgotten, but because the “predetermined pitch relationship” from ancient times is engraved in DNA, it is “somehow comfortable”. I guess it feels.

人同士の対話について具体的な例を挙げて説明すると、例えば、ａが「そうでしょ？」という問いを発したとき、ａおよびｂは、当該問いのうち、念押しや確認などの意が強く表れる語尾の「しょ」の音高を記憶に残した状態となる。この状態において、ｂが、当該問いに対して「あ、はい」と肯定的に回答しようとする場合に、印象に残っている「しょ」の音高に対して、回答を特徴付ける部分、例えば語尾の「い」の音高が上記関係になるように「あ、はい」と回答する。 Explaining the dialogue between people with a specific example, for example, when a asks the question “Oh, right?”, A and b have strong willingness and confirmation among the questions. The pitch of the ending “Sho” appears in the memory. In this state, when b tries to affirmatively answer “A, yes” to the question, the part characterizing the answer with respect to the pitch of “Sh” that remains in the impression, for example, the ending Answer “Yes, yes” so that the pitch of “Yes” is in the above relationship.

図２は、このような実際の対話におけるフォルマントを示している。この図において、横軸が時間であり、縦軸が周波数であって、スペクトルは、白くなるにつれて強度が強い状態を示している。
図に示されるように、人の音声を周波数解析して得られるスペクトルは、時間的に移動する複数のピーク、すなわちフォルマントとして現れる。詳細には、「そうでしょ？」に相当するフォルマント、および、「あ、はい」に相当するフォルマントは、それぞれ３つのピーク帯（時間軸に沿って移動する白い帯状の部分）として現れている。
これらの３つのピーク帯のうち、周波数の最も低い第１フォルマントについて着目してみると、「そうでしょ？」の「しょ」に相当する符号Ａ（の中心部分）の周波数はおおよそ４００Ｈｚである。一方、符号Ｂは、「あ、はい」の「い」に相当する符号Ｂの周波数はおおよそ２６０Ｈｚである。このため、符号Ａの周波数は、符号Ｂの周波数に対して、ほぼ３／２となっていることが判る。 FIG. 2 shows a formant in such an actual dialogue. In this figure, the horizontal axis is time, the vertical axis is frequency, and the spectrum shows a state where the intensity increases as it becomes white.
As shown in the figure, a spectrum obtained by frequency analysis of human speech appears as a plurality of peaks that move in time, that is, formants. Specifically, a formant corresponding to “Yeah?” And a formant corresponding to “Ah, yes” each appear as three peak bands (white band-like portions moving along the time axis).
When attention is paid to the first formant having the lowest frequency among these three peak bands, the frequency of the code A (the central part) corresponding to “Sho” of “Oh, right?” Is approximately 400 Hz. On the other hand, for the code B, the frequency of the code B corresponding to “Yes” of “A, Yes” is approximately 260 Hz. For this reason, it can be seen that the frequency of the code A is approximately 3/2 with respect to the frequency of the code B.

周波数の比が３／２であるという関係は、音程でいえば、「ソ」に対して同じオクターブの「ド」や、「ミ」に対して１つ下のオクターブの「ラ」などをいい、後述するように、完全５度の関係にある。この周波数の比（音高同士における所定の関係）については、好適な一例であるが、後述するように様々な例が挙げられる。 The relationship that the frequency ratio is 3/2 is the same octave “de” for “so” or “la” one octave lower for “mi”. As will be described later, there is a complete 5 degree relationship. This frequency ratio (predetermined relationship between pitches) is a preferred example, but various examples can be given as will be described later.

なお、図３は、音名（階名）と人の声の周波数との関係について示す図である。この例では、第４オクターブの「ド」を基準にしたときの周波数比も併せて示しており、「ソ」は「ド」を基準にすると、上記のように３／２である。また、第３オクターブの「ラ」を基準にしたときの周波数比についても並列に例示している。 FIG. 3 is a diagram showing a relationship between a pitch name (floor name) and a human voice frequency. In this example, the frequency ratio when the fourth octave “do” is used as a reference is also shown, and “so” is 3/2 as described above when “do” is used as a reference. Further, the frequency ratio when the third octave “La” is used as a reference is also illustrated in parallel.

このように人同士の対話では、問いの音高と返答する回答の音高とは無関係ではなく、上記のような関係がある、と考察できる。そして、本件発明者は、多くの対話例を分析し、多くの人による評価を統計的に集計して、この考えがおおよそ正しいことを裏付けた。このような考察や裏付けを踏まえて、利用者による問いに対する回答を音声合成で出力（返答）する対話システムを検討したときに、当該音声合成について上記目的を達成するために、次のような構成とした。 In this way, in the dialogue between people, it can be considered that the pitch of the question and the pitch of the answer to be answered are not irrelevant but have the above-described relationship. Then, the present inventor analyzed many dialogue examples and statistically aggregated evaluations by many people to prove that this idea is roughly correct. Based on such considerations and support, when considering a dialogue system that outputs (answers) answers to questions from users by speech synthesis, the following configuration is achieved to achieve the above objectives for speech synthesis: It was.

すなわち、上記目的を達成するために、本発明の一態様に係る音声合成装置は、音声信号による問いを入力する音声入力部と、前記問いのうち、特定の第１区間の音高を解析する音高解析部と、前記問いに対する回答を取得する取得部と、取得された回答のうち、特定の第２区間の音高を、前記第１区間の音高に対して所定の関係にある音高となるように変更して出力する音声合成部と、を具備することを特徴とする。
この一態様によれば、入力された音声信号による問いに対して、音声合成される回答に、不自然な感じが伴わないようにすることができる。なお、回答には、問いに対する具体的な答えに限られず、「ええ」、「なるほど」、「そうですね」などの相槌（間投詞）も含まれる。また、回答には、人による声のほかにも、「ワン」（bowwow）、「ニャー」（meow）などの動物の鳴き声も含まれる。すなわち、ここでいう回答や音声とは、人が発する声のみならず、動物の鳴き声を含む概念である。 That is, in order to achieve the above object, a speech synthesizer according to an aspect of the present invention analyzes a speech input unit that inputs a question based on a speech signal, and a pitch of a specific first section of the question. A pitch analysis unit, an acquisition unit that acquires an answer to the question, and a pitch of a specific second section of the acquired answers that are in a predetermined relationship with the pitch of the first section And a voice synthesizer that changes the output so as to be high.
According to this aspect, it is possible to prevent an unnatural feeling from being accompanied by an answer that is synthesized by speech in response to a question based on an input speech signal. Note that the answer is not limited to a specific answer to the question, but also includes a companion (interjection) such as “Yes”, “I see”, “I see”. In addition to human voices, answers include animal calls such as “bow” and “meow”. That is, the answer and the voice here are concepts including not only a voice uttered by a person but also an animal cry.

上記態様において、前記第１区間は、前記問いの語尾であり、前記第２区間は、前記回答の語頭または語尾であることが好ましい。上述したように、問いの印象を特徴付ける区間は、当該問いの語尾であり、回答の印象を特徴付ける区間は、回答の語頭または語尾であることが多いからである。 In the above aspect, it is preferable that the first section is a ending of the question and the second section is a beginning or ending of the answer. As described above, the section characterizing the impression of the question is the ending of the question, and the section characterizing the impression of the answer is often the beginning or the ending of the answer.

また、前記所定の関係は、完全１度を除いた協和音程の関係であることが好ましい。ここで、協和とは、複数の楽音が同時に発生したときに、それらが互いに溶け合って良く調和する関係をいい、これらの音程関係を協和音程という。協和の程度は、２音間の周波数比（振動数比）が単純なものほど高い。周波数比が最も単純な１／１（完全１度）と、２／１（完全８度）とを、特に絶対協和音程といい、これに３／２（完全５度）と４／３（完全４度）とを加えて完全協和音程という。５／４（長３度）、６／５（短３度）、５／３（長６度）および８／５（短６度）を不完全協和音程といい、これ以外のすべての周波数比の関係（長・短の２度と７度、各種の増・減音程など）を不協和音程という。
なお、回答の語頭または語尾の音高を、問いの語尾の音高と同一となる場合には、対話として不自然な感じを伴うと考えられるので、問いの音高と回答の音高との関係において、完全１度が除かれている。 Moreover, it is preferable that the predetermined relationship is a relationship of Kyowa intervals excluding perfect 1 degree. Here, “Kyowa” means a relationship in which when a plurality of musical sounds are generated at the same time, they are fused and well harmonized, and these pitch relationships are called Kyowa pitches. The degree of cooperation is higher as the frequency ratio (frequency ratio) between two sounds is simpler. The simplest frequency ratios of 1/1 (perfect 1 degree) and 2/1 (perfect 8 degree) are called absolute consonance pitches, and 3/2 (perfect 5 degree) and 4/3 (perfect) 4 degrees) and is called the perfect harmony pitch. 5/4 (3 degrees long), 6/5 (3 degrees short), 5/3 (6 degrees long) and 8/5 (6 degrees short) are called incomplete harmony intervals, and all other frequency ratios This relationship (long and short 2 degrees and 7 degrees, various increase / decrease intervals, etc.) is called dissonance interval.
If the pitch at the beginning or end of the answer is the same as the pitch at the end of the question, it is thought that the dialogue is accompanied by an unnatural feeling. In the relationship, one complete degree is excluded.

上記態様において、問いの音高と回答の音高とにおける所定の関係としては、完全１度を除く協和音程だけでなく、次のような範囲内の音高関係としても良い。すなわち、前記音声合成部は、前記第２区間の音高を、前記第１区間の音高に対して、同一を除く、上下１オクターブの範囲内の音高関係となるように変更して出力する構成でも良い。問いの音高に対して、回答の音高が上下１オクターブ以上離れると、上記協和音程の関係が成立しないだけでなく、対話として不自然になる、という知見によるものである。なお、この構成においても、回答の音高と問いの音高とが同一である場合、上述したように対話として不自然になるので、上下１オクターブの範囲内の音高関係から除かれている。 In the above aspect, the predetermined relationship between the pitch of the question and the pitch of the answer may be a pitch relationship within the following range, as well as the Kyowa interval except for a perfect 1 degree. That is, the speech synthesizer changes and outputs the pitch of the second section so that the pitch of the second section is in the range of one octave above and below, except for the same pitch. The structure to do may be sufficient. This is based on the knowledge that when the pitch of the answer is more than one octave above and below the pitch of the question, the above-mentioned Kyowa pitch relationship is not established, and the conversation becomes unnatural. Even in this configuration, if the pitch of the answer and the pitch of the question are the same, the dialogue becomes unnatural as described above, so it is excluded from the pitch relationship within the range of one octave above and below. .

上記態様において、前記音声合成部は、前記第２区間の音高を、前記第１区間の音高に対して、５度下の協和音程の関係にある音高となるように変更して出力する構成が好ましい。この構成によれば、問いを発した利用者に、当該問いに対して返答される回答について好印象を持たせることができる。 In the above aspect, the speech synthesizer changes and outputs the pitch of the second section so that the pitch has a relationship of 5 degrees below the pitch of the first section. The structure which does is preferable. According to this configuration, it is possible to give a good impression to the user who has made a question about the answer that is answered to the question.

上記態様において、前記音声合成部は、前記第２区間の音高を、前記第１区間の音高に対して所定の関係にある音高となるように変更しようとする場合に、変更しようとする音高が所定の閾値音高よりも低ければ、変更しようとする音高をさらに１オクターブ上の音高にシフトする、または、変更しようとする音高が所定の閾値音高よりも高ければ、変更しようとする音高を１オクターブ下の音高にシフトする、構成としても良い。この構成によれば、回答における第２区間の音高を変更しようとする場合に、当該音高が所定の閾値音高よりも低ければ（高ければ）、１オクターブ上（下）の音高にシフトするので、例えば不自然な低音（高音）で回答が音声合成されてしまう事態を回避することができる。 In the above aspect, the speech synthesizer attempts to change the pitch of the second section to change the pitch to have a predetermined relationship with the pitch of the first section. If the pitch to be changed is lower than the predetermined threshold pitch, the pitch to be changed is further shifted to a pitch one octave higher, or if the pitch to be changed is higher than the predetermined threshold pitch. The pitch to be changed may be shifted to a pitch one octave below. According to this configuration, when the pitch of the second section in the answer is to be changed, if the pitch is lower (if higher) than the predetermined threshold pitch, the pitch is increased by one octave (lower). Since the shift is performed, for example, it is possible to avoid a situation where the answer is synthesized with an unnatural low tone (high tone).

上記態様において、前記音声合成部は、前記第２区間の音高を、前記第１区間の音高に対して所定の関係にある音高となるように変更しようとする場合に、所定の属性が定められていれば、所定の関係にある音高をさらに１オクターブ上または下の音高にシフトする構成としても良い。属性とは、例えば音声合成する声の属性であって、女性や子供（成人男性）の声で合成することが定められていれば、変更しようとする音高を、所定の関係にある音高よりも１オクターブ上（下）の音高にシフトすることによって、不自然な低音（高音）で合成されてしまう事態を回避することができる。 In the above aspect, the speech synthesizer may change the pitch of the second section to a predetermined attribute when attempting to change the pitch to have a predetermined relationship with the pitch of the first section. Is defined, a pitch having a predetermined relationship may be further shifted to a pitch one octave higher or lower. An attribute is, for example, an attribute of a voice to be synthesized, and if it is determined to synthesize with a voice of a woman or a child (adult male), the pitch to be changed is a pitch having a predetermined relationship. By shifting to a pitch one octave above (below), it is possible to avoid a situation where the sound is synthesized with an unnatural bass (treble).

上記態様において、動作モードとして第１モードおよび第２モードがあり、前記音声合成部は、前記動作モードが前記第１モードであれば、前記第２区間の音高を、前記第１区間の音高に対して、完全１度を除いた協和音程の関係にある音高となるように変更して出力し、前記動作モードが前記第２モードであれば、前記第２区間の音高を、前記第１区間の音高に対して、不協和音程の関係にある音高となるように変更して出力する構成としても良い。この態様において、動作モードが第２モードであれば、不協和音程の関係にある回答が音声合成されるので、問いを発した利用者に違和感を与えることができる。逆にいえば、第２モードにすることによって、利用者に、注意喚起したり、意図的に嫌悪な感じを与えたりすることができる。 In the above aspect, there are a first mode and a second mode as operation modes, and if the operation mode is the first mode, the speech synthesizer determines a pitch of the second section as a sound of the first section. If the operation mode is the second mode, the pitch of the second section is changed to a pitch that is in the relationship of the Kyowa interval except for 1 degree to the high. It is good also as a structure which changes and outputs so that it may become the pitch which has the relationship of a dissonance pitch with respect to the pitch of the said 1st area. In this aspect, if the operation mode is the second mode, answers having a dissonant pitch relationship are synthesized with speech, so that it is possible to give a sense of discomfort to the user who made the question. Conversely, by setting the second mode, the user can be alerted or intentionally disgusted.

本発明の態様について、音声合成装置のみならず、コンピュータを当該音声合成装置として機能させるプログラムとして概念することも可能である。
なお、本発明では、問いの音高（周波数）を解析対象とし、回答の音高を制御対象としているが、ヒトの音声は、上述したフォルマントの例でも明らかなように、ある程度の周波数域を有するので、解析や制御についても、ある程度の周波数範囲を持ってしまうのは避けられない。また、解析や制御については、当然のことながら誤差が発生する。このため、本件において、音高の解析や制御については、音高（周波数）の数値が同一であることのみならず、ある程度の範囲を伴うことが許容される。 The aspect of the present invention can be conceptualized as a program that causes a computer to function as the speech synthesizer as well as the speech synthesizer.
In the present invention, the pitch (frequency) of the question is set as the analysis target and the pitch of the answer is set as the control target, but the human voice has a certain frequency range as is clear from the above-described formant example. Therefore, it is inevitable that the analysis and control have a certain frequency range. In addition, as a matter of course, errors occur in analysis and control. For this reason, in this case, the pitch analysis and control are allowed not only to have the same numerical value of the pitch (frequency) but also to have a certain range.

第１実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on 1st Embodiment. 対話における音声のフォルマントの例を示す図である。It is a figure which shows the example of the sound formant in a dialog. 音名と周波数等との関係を示す図である。It is a figure which shows the relationship between a pitch name, a frequency, etc. 音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a speech synthesizer. 語尾の特定の具体例を示す図である。It is a figure which shows the specific specific example of an ending. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence. 利用者による問いに対し合成音声の与える心理的影響を示す図である。It is a figure which shows the psychological influence which a synthetic speech gives with respect to the question by a user. 第２実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on 2nd Embodiment. 音声波形データに対する音高変換の例を示す図である。It is a figure which shows the example of the pitch conversion with respect to audio | voice waveform data. 応用例（その１）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 1). 応用例（その２）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 2). 応用例（その３）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 3). 応用例（その４）の動作概要を示す図である。It is a figure which shows the operation | movement outline | summary of an application example (the 4).

以下、本発明の実施形態について図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

＜第１実施形態＞
まず、第１実施形態に係る音声合成装置について説明する。
図１は、本発明の第１実施形態に係る音声合成装置１０の構成を示す図である。
この図において、音声合成装置１０は、ＣＰＵ（Central Processing Unit）や、音声入力部１０２、スピーカ１４２を有する、例えば携帯電話機のような端末装置である。音声合成装置１０においてＣＰＵが、予めインストールされたアプリケーションプログラムを実行することによって、複数の機能ブロックが次のように構築される。
詳細には、音声合成装置１０では、発話区間検出部１０４、音高解析部１０６、言語解析部１０８、回答作成部１１０、音声合成部１１２、言語データベース１２２、回答データベース１２４、情報取得部１２６および音声ライブラリ１２８が構築される。
なお、特に図示しないが、このほかにも音声合成装置１０は、表示部や操作入力部なども有し、利用者が装置の状況を確認したり、装置に対して各種の操作を入力したりすることができるようになっている。また、音声合成装置１０は、携帯電話機のような端末装置に限られず、ノート型やタブレット型のパーソナルコンピュータであっても良い。 <First Embodiment>
First, the speech synthesizer according to the first embodiment will be described.
FIG. 1 is a diagram showing a configuration of a speech synthesizer 10 according to the first embodiment of the present invention.
In this figure, the speech synthesizer 10 is a terminal device such as a mobile phone having a CPU (Central Processing Unit), a speech input unit 102, and a speaker 142. In the speech synthesizer 10, the CPU executes an application program installed in advance, so that a plurality of functional blocks are constructed as follows.
Specifically, in the speech synthesizer 10, the utterance section detection unit 104, the pitch analysis unit 106, the language analysis unit 108, the answer creation unit 110, the speech synthesis unit 112, the language database 122, the answer database 124, the information acquisition unit 126, and An audio library 128 is constructed.
Although not particularly illustrated, the speech synthesizer 10 also includes a display unit, an operation input unit, and the like, so that the user can check the status of the device and input various operations to the device. Can be done. The speech synthesizer 10 is not limited to a terminal device such as a mobile phone, and may be a notebook or tablet personal computer.

音声入力部１０２は、詳細については省略するが、音声を電気信号に変換するマイクロフォンと、変換された音声信号の高域成分をカットするＬＰＦ（ローパスフィルタ）と、高域成分をカットした音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。
発話区間検出部１０４は、デジタル信号に変換された音声信号を処理して発話（有音）区間を検出する。 Although not described in detail, the audio input unit 102 is a microphone that converts audio into an electric signal, an LPF (low-pass filter) that cuts a high-frequency component of the converted audio signal, and an audio signal that is cut from a high-frequency component. And an A / D converter that converts the signal into a digital signal.
The utterance section detection unit 104 processes the voice signal converted into the digital signal to detect the utterance (sound) section.

音高解析部１０６は、発話区間として検出された音声信号を周波数解析するとともに、解析して得られた第１フォルマントのうち、特定の区間（第１区間）の音高を求めて、当該音高を示す音高データを出力する。なお、第１区間とは、例えば問いの語尾である。また、第１フォルマントとは、例えば音声を周波数解析したときに得られる複数のフォルマントのうち、周波数の最も低い成分をいい、図２の例でいえば、末端が符号Ａとなっているピーク帯をいう。周波数解析については、ＦＦＴ（Fast Fourier Transform）や、その他公知の方法を用いることができる。問いにおける語尾を特定するための具体的手法の一例については後述する。 The pitch analysis unit 106 frequency-analyzes the speech signal detected as an utterance section, obtains the pitch of a specific section (first section) among the first formants obtained by the analysis, and calculates the sound. Output pitch data indicating high. The first section is, for example, a question ending. In addition, the first formant is, for example, a component having the lowest frequency among a plurality of formants obtained when frequency analysis of speech is performed. In the example of FIG. Say. For frequency analysis, FFT (Fast Fourier Transform) or other known methods can be used. An example of a specific method for specifying the ending in the question will be described later.

言語解析部１０８は、発話区間として検出された音声信号がどの音素に近いのかを、言語データベース１２２に予め作成された音素モデルを参照することにより判定して、音声信号で規定される言葉の意味を解析（特定）する。なお、このような音素モデルには、例えば隠れマルコフモデルを用いることができる。 The language analysis unit 108 determines to which phoneme the speech signal detected as the speech section is close by referring to a phoneme model created in advance in the language database 122, and the meaning of the words defined by the speech signal Is analyzed (specified). As such a phoneme model, for example, a hidden Markov model can be used.

回答作成部１１０は、言語解析部１０８によって解析された言葉の意味に対応する回答を、回答データベース１２４および情報取得部１２６を参照して作成する。例えば「いまなんじ？（今、何時？）」という問いに対しては、音声合成装置１０は、内蔵のリアルタイムクロック（図示省略）から時刻情報を取得するとともに、時刻情報以外の情報を回答データベース１２４から取得することで、「ただいま○○時○○分です」という回答を作成することが可能である。
一方で、音声合成装置１０は、「あしたのてんきは？（明日の天気は？）」という問いに対しては、外部サーバにアクセスして天気情報を取得しないと、音声合成装置１０の単体で回答を作成することができない。このように、回答データベース１２４のみでは回答が作成できない場合、情報取得部１２６が、インターネットを介し外部サーバにアクセスして、回答に必要な情報を取得する構成となっている。すなわち、回答作成部１１０は、問いに対する回答を、回答データベース１２４または外部サーバから取得する構成となっている。
なお、回答作成部１１０は、本実施形態では回答を、音素列であって、各音素に対応する音高や発音タイミングを規定した音声シーケンスにて出力する。音声合成部１１２が音高や発音タイミングが規定された音声シーケンスにしたがって音声合成すれば、当該回答の基本音声を出力することができる。ただし、本実施形態では、音声シーケンスで規定される基本音声を、音声合成部１１２が変更して出力する。 The answer creation unit 110 creates an answer corresponding to the meaning of the words analyzed by the language analysis unit 108 with reference to the answer database 124 and the information acquisition unit 126. For example, in response to the question “now what? (Now what time?)”, The speech synthesizer 10 obtains time information from a built-in real-time clock (not shown) and sends information other than the time information to the answer database. By acquiring from 124, it is possible to create an answer “I am now XX hour XX minutes”.
On the other hand, if the speech synthesizer 10 does not access the external server and acquire the weather information in response to the question “What is tomorrow? I cannot create an answer. As described above, when an answer cannot be created using only the answer database 124, the information acquisition unit 126 accesses the external server via the Internet and acquires information necessary for the answer. That is, the answer creating unit 110 is configured to obtain an answer to a question from the answer database 124 or an external server.
Note that in the present embodiment, the answer creating unit 110 outputs the answer as a phoneme sequence, which is a voice sequence that defines the pitches and pronunciation timings corresponding to each phoneme. If the speech synthesizer 112 synthesizes speech in accordance with a speech sequence in which the pitch and pronunciation timing are defined, the basic speech of the answer can be output. However, in this embodiment, the basic speech defined by the speech sequence is changed and output by the speech synthesizer 112.

音声合成部１１２は、回答作成部１１０で作成された音声シーケンスのうち、特定の区間（第２区間）の音高を、音高解析部１０６から供給される音高データに対して所定の関係にある音高に変更して音声合成し、音声信号として出力する。なお、本実施形態において第２区間を、回答の語尾とするが、後述するように語尾に限られない。また、本実施形態において、音高データに対して所定の関係にある音高を、５度の下の関係にある音高とするが、後述するように、５度下以外の関係にある音高としても良い。 The voice synthesizer 112 has a predetermined relationship between the pitch of a specific section (second section) of the voice sequence created by the answer creator 110 and the pitch data supplied from the pitch analyzer 106. The voice is synthesized by changing to the pitch in the above and output as a voice signal. In the present embodiment, the second section is the ending of the answer, but is not limited to the ending as described later. In this embodiment, the pitch having a predetermined relationship with the pitch data is set to a pitch having a relationship of 5 degrees below, but as will be described later, a sound having a relationship other than 5 degrees is used. It can be high.

また、音声合成部１１２は、音声を合成するにあたって、音声ライブラリ１２８に登録された音声素片データを用いる。音声ライブラリ１２８は、単一の音素や音素から音素への遷移部分など、音声の素材となる各種の音声素片の波形を定義した音声素片データを、予めデータベース化したものである。音声合成部１１２は、具体的には、音声シーケンスの一音一音（音素）の音声素片データを組み合わせて、繋ぎ部分が連続するように修正しつつ、上記のように回答の語尾の音高を変更して音声信号を生成する。
なお、音声合成された音声信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１４２によって音響変換されて出力される。 The speech synthesizer 112 uses speech unit data registered in the speech library 128 when synthesizing speech. The speech library 128 is a database of speech unit data defining waveforms of various speech units that are speech materials, such as a single phoneme or a transition part from a phoneme to a phoneme. Specifically, the speech synthesizer 112 combines the speech segment data of one sound per phoneme (phoneme) and corrects the connected portion to be continuous, and corrects the ending sound of the answer as described above. Change the height to generate an audio signal.
Note that the synthesized voice signal is converted into an analog signal by a D / A converter (not shown), then acoustically converted by the speaker 142 and output.

次に、音声合成装置１０の動作について説明する。図４は、音声合成装置１０における処理動作を示すフローチャートである。
はじめに、利用者が所定の操作をしたとき、例えば対話処理に対応したアイコンなどをメインメニュー画面（図示省略）において選択したとき、ＣＰＵが当該処理に対応したアプリケーションプログラムを起動する。このアプリケーションプログラムを実行することによって、ＣＰＵは、図１で示した機能ブロックを構築する。 Next, the operation of the speech synthesizer 10 will be described. FIG. 4 is a flowchart showing the processing operation in the speech synthesizer 10.
First, when the user performs a predetermined operation, for example, when an icon or the like corresponding to the interactive process is selected on the main menu screen (not shown), the CPU starts an application program corresponding to the process. By executing this application program, the CPU constructs the functional block shown in FIG.

まず、ステップＳａ１１において利用者が音声入力部１０２に対して音声で問いを入力する。次に、ステップＳａ１２において発話区間検出部１０４は、当該音声の大きさ、すなわち音量が閾値以下となる状態が所定期間以上連続する区間を無音区間とし、それ以外の区間を発話区間として検出して、当該発話区間の音声信号を音高解析部１０６および言語解析部１０８のそれぞれに供給する。 First, in step Sa11, the user inputs a question to the voice input unit 102 by voice. Next, in step Sa12, the utterance section detecting unit 104 detects a section in which the volume of the voice, that is, a state in which the volume is equal to or lower than the threshold value continues for a predetermined period or more as a silence section, and detects other sections as a utterance section. Then, the speech signal of the speech section is supplied to each of the pitch analysis unit 106 and the language analysis unit 108.

ステップＳａ１３において音高解析部１０６は、検出された発話区間における問いの音声信号を解析し、当該問いにおける第１区間（語尾）の音高を特定して、当該音高を示す音高データを音声合成部１１２に供給する。ここで、音高解析部１０６における問いの語尾を特定する具体的手法の一例について説明する。 In step Sa13, the pitch analysis unit 106 analyzes the voice signal of the question in the detected utterance section, specifies the pitch of the first section (ending) in the question, and generates pitch data indicating the pitch. This is supplied to the speech synthesizer 112. Here, an example of a specific method for specifying the ending of the question in the pitch analysis unit 106 will be described.

問いを発した人が、当該問い対する回答を欲するような対話を想定した場合、問いの語尾に相当する部分では、音量が他の部分として比較して一時的に大きくなる、と考えられる。そこで、音高解析部１０６による第１区間（語尾）の音高については、例えば次のようにして求めることできる。
第１に、音高解析部１０６は、発話区間として検出された問いの音声信号を、音量と音高（ピッチ）とに分けて波形化する。図５の（ａ）は、音声信号についての音量を縦軸で、経過時間を横軸で表した音量波形の一例であり、（ｂ）は、同じ音声信号について周波数解析して得られる第１フォルマントの音高を縦軸で、経過時間を横軸で表した音高波形である。なお、（ａ）の音量波形と（ｂ）の音高波形との時間軸は共通である。
第２に、音高解析部１０６は、（ａ）の音量波形のうち、時間的に最後の極大Ｐ１のタイミングを特定する。
第３に、音高解析部１０６は、特定した極大Ｐ１のタイミングを前後に含む所定の時間範囲（例えば１００μ秒〜３００μ秒）を語尾であると認定する。
第４に、音高解析部１０６は、（ｂ）の音高波形のうち、認定した語尾に相当する区間Ｑ１の平均音高を、音高データとして出力する。
このように、発話区間における音量波形について最後の極大Ｐ１を、問いの語尾に相当するタイミングとして特定することによって、会話としての問いの語尾の誤検出を少なくすることができる、と考えられる。
ここでは、（ａ）の音量波形のうち、時間的に最後の極大Ｐ１のタイミングを前後に含む所定の時間範囲を語尾であると認定したが、極大Ｐ１のタイミングを始期または終期とする所定の時間範囲を語尾と認定しても良い。また、認定した語尾に相当する区間Ｑ１の平均音高ではなく、区間Ｑ１の始期、終期や、極大Ｐ１のタイミングの音高を、音高データとして出力する構成としても良い。 When a dialogue is made in which a person who asks a question wants an answer to the question, it is considered that the volume corresponding to the ending of the question temporarily increases compared to the other parts. Therefore, the pitch of the first section (ending) by the pitch analysis unit 106 can be obtained as follows, for example.
First, the pitch analysis unit 106 divides the voice signal of the question detected as the speech section into a waveform by dividing it into a volume and a pitch (pitch). FIG. 5A is an example of a volume waveform in which the volume of an audio signal is represented on the vertical axis and the elapsed time is represented on the horizontal axis. FIG. 5B is a first waveform obtained by frequency analysis of the same audio signal. It is a pitch waveform in which the pitch of formant is represented on the vertical axis and the elapsed time is represented on the horizontal axis. The time axis of the volume waveform in (a) and the pitch waveform in (b) are common.
Secondly, the pitch analysis unit 106 identifies the timing of the last local maximum P1 in the volume waveform of (a).
Thirdly, the pitch analysis unit 106 determines that a predetermined time range (for example, 100 μsec to 300 μsec) including the timing of the specified maximum P1 before and after is the ending.
Fourth, the pitch analysis unit 106 outputs the average pitch of the section Q1 corresponding to the recognized ending in the pitch waveform of (b) as pitch data.
As described above, it is considered that erroneous detection of the ending of the question as a conversation can be reduced by specifying the final maximum P1 of the volume waveform in the utterance section as the timing corresponding to the ending of the question.
Here, in the volume waveform of (a), a predetermined time range including the timing of the last local maximum P1 before and after is recognized as the ending, but the predetermined time period having the maximum P1 timing as the start or end is determined. The time range may be recognized as the ending. Moreover, it is good also as a structure which outputs not the average pitch of the area Q1 corresponding to the recognized ending but the pitch of the start of the period Q1, the end, and the timing of local maximum P1 as pitch data.

一方、ステップＳａ１４において言語解析部１０８は、音声信号における言葉の意味を解析して、その意味内容を示すデータを、回答作成部１１０に供給する。ステップＳａ１５において、回答作成部１１０は、解析された言葉の意味に対応する回答を、回答データベース１２４を用いて作成したり、必要に応じて情報取得部１２６を介し外部サーバから取得したりして、当該回答に基づく音声シーケンスを作成し、音声合成部１１２に供給する。 On the other hand, in step Sa <b> 14, the language analysis unit 108 analyzes the meaning of the words in the voice signal and supplies data indicating the meaning content to the answer creation unit 110. In step Sa15, the answer creating unit 110 creates an answer corresponding to the meaning of the analyzed word using the answer database 124, or obtains it from an external server via the information obtaining unit 126 as necessary. Then, a speech sequence based on the answer is created and supplied to the speech synthesis unit 112.

図６の（ａ）は、例えば「あしたのてんきは？」という問いに対して作成された回答の音声（音声シーケンス）の一例である。この図の例では、回答である「はれです」の一音一音に音符を割り当てて、音声シーケンスによる基本音声の各語（音素）の音高や発音タイミングを示している。なお、この例では、説明簡略化のために、一音（音素）に音符を１つ割り当てているが、スラーやタイなどのように、一音に複数の音符を割り当てても良い。 FIG. 6A is an example of an answer voice (voice sequence) created in response to, for example, the question “What is tomorrow?”. In the example of this figure, a note is assigned to each sound of the answer “Hare is”, and the pitch and pronunciation timing of each word (phoneme) of the basic speech based on the speech sequence are shown. In this example, for simplicity of explanation, one note is assigned to one note (phoneme), but a plurality of notes may be assigned to one note such as a slur or a tie.

次に、ステップＳａ１６において音声合成部１１２は、回答作成部１１０から供給された音声シーケンスから、当該音声シーケンスにおける語尾の音高（初期音高）を特定する。
続いて、ステップＳａ１７において、音声合成部１１２は、音声シーケンスで規定された語尾の初期音高が音高解析部１０６からの音高データで示される音高に対して５度下の関係となるように、当該音声シーケンスで規定された音高を変更する。
例えば図６の（ｂ）で示されるように、「あしたのてんきは？」という問いのうち、符号Ａで示される語尾の「は」の区間の音高が音高データによって「ソ」であると示される場合、音声合成部１１２は、「はれです」という回答のうち、符号Ｂで示される語尾の「す」の区間の音高が「ソ」に対して５度下の音高である「ド」になるように音声シーケンス全体の音高を変更する。
そして、ステップＳａ１８において音声合成部１１２は、変更した音声シーケンスの音声を合成して出力する。なお、回答の音声を出力すると、特に図示しないが、ＣＰＵは、当該アプリケーションプログラムの実行を終了させて、メニュー画面に戻す。 Next, in step Sa16, the speech synthesizer 112 identifies the ending pitch (initial pitch) in the speech sequence from the speech sequence supplied from the answer creation unit 110.
Subsequently, in step Sa17, the speech synthesizer 112 has a relationship in which the initial pitch of the ending defined by the speech sequence is 5 degrees below the pitch indicated by the pitch data from the pitch analyzer 106. As described above, the pitch defined by the voice sequence is changed.
For example, as shown in (b) of FIG. 6, the pitch of the section “ha” at the end of the word “A” indicated by the symbol “A” is “So” according to the pitch data. The speech synthesizer 112 has a pitch that is 5 degrees below the “Seo” in the “S” section of the ending “S” that is indicated by the symbol B in the answer “It is a spread”. The pitch of the entire audio sequence is changed so that it becomes a certain “do”.
In step Sa18, the speech synthesizer 112 synthesizes and outputs the speech of the modified speech sequence. When the answer voice is output, the CPU ends the execution of the application program and returns to the menu screen, although not particularly illustrated.

図７は、本実施形態に係る音声合成装置１０が利用者に与える印象を説明するための図である。同図の（ａ）に示されるように、利用者Ｗが「あしたのてんきは？」という問いを端末装置である音声合成装置１０に入力する。このときの問いの語尾に相当する「は」の音高が「ソ」であれば、実施形態では、同図の（ｂ）で示されるように、「はれです」という音声シーケンスにおいて、語尾に相当する「す」の音高が「ド」になるように音高がシフトされて音声合成される。このため、利用者Ｗに不自然な感じを与えず、あたかも対話しているかのような好印象を与えることができる。
一方、同図の（ｃ）で示されるように「はれです」という音声シーケンスの音高をシフトしないで音声合成した場合（図６（ａ）参照）、語尾に相当する「す」の音高が「ファ」で出力される。この場合において音高の「ファ」は、「あしたのてんきは？」という問いの語尾に相当する「は」の音高の「ソ」に対して不協和音程の関係にある。すなわち、図３を参照すれば、「ソ」の周波数（３９６．０Ｈｚ）は「ファ」の周波数（３５２．０Ｈｚ）に対して９／８の関係にある。このため、利用者Ｗに不自然な感じを与えるのではなく、むしろ嫌悪のような悪印象を与えてしまう。ただし、後述するように、音声合成装置１０において、このような悪印象を利用者に積極的に与える構成もあり得る。 FIG. 7 is a diagram for explaining an impression given to the user by the speech synthesizer 10 according to the present embodiment. As shown in (a) of the figure, the user W inputs the question “What is tomorrow?” To the speech synthesizer 10 which is a terminal device. If the pitch of “ha” corresponding to the ending of the question at this time is “so”, in the embodiment, as shown in FIG. The pitch is shifted so that the pitch of “su” corresponding to is “do”, and speech synthesis is performed. For this reason, it is possible to give a good impression as if the user W is interacting without giving the user W an unnatural feeling.
On the other hand, as shown in (c) of the figure, when the speech synthesis is performed without shifting the pitch of the speech sequence of “It is a spread” (see FIG. 6 (a)), the sound of “su” corresponding to the end of the sound. High is output with "Fa". In this case, the pitch “Fa” has a dissonant pitch relationship with the “ha” pitch “So” corresponding to the ending of the question “What is the tomorrow?” That is, referring to FIG. 3, the frequency of “So” (396.0 Hz) is 9/8 relative to the frequency of “Fa” (352.0 Hz). For this reason, it does not give the user W an unnatural feeling, but rather gives a bad impression like disgust. However, as will be described later, the voice synthesizer 10 may be configured to positively give such a bad impression to the user.

＜第２実施形態＞
次に、第２実施形態について説明する。
図８は、第２実施形態に係る音声合成装置１０の構成を示すブロック図である。
第１実施形態では、回答作成部１１０が、問いに対する回答として、一音一音に音高が割り当てられた音声シーケンスを出力する構成としたが、第２実施形態では、回答音声出力部１１３が、問いに対する回答を取得して、当該回答の音声波形データを出力する。
なお、取得した回答には、回答音声出力部１１３が作成したものや、外部サーバから取得したもの、予め複数用意された回答のうち、選択されたものなどが含まれる。また、音声波形データは、例えばｗａｖ形式のようなデータであり、上述した音声シーケンスのように一音一音に音高が割り当てられない。したがって、このような音声波形データを単純に再生しただけでは、図９の（ａ）に示されるように、抑揚があるだけで、機械的な感じになる。 Second Embodiment
Next, a second embodiment will be described.
FIG. 8 is a block diagram showing the configuration of the speech synthesizer 10 according to the second embodiment.
In the first embodiment, the answer creation unit 110 outputs a voice sequence in which a pitch is assigned to each note as an answer to the question. In the second embodiment, the answer voice output unit 113 The answer to the question is acquired, and the voice waveform data of the answer is output.
The acquired answers include those created by the answer voice output unit 113, those obtained from an external server, and those selected from a plurality of answers prepared in advance. The voice waveform data is, for example, data in the wav format, and a pitch is not assigned to each sound as in the above-described voice sequence. Therefore, simply reproducing such voice waveform data provides a mechanical feeling as shown in FIG.

さて、音声波形データを再生したときに、問いの語尾に対して回答の語尾が協和音程の関係となるように変更するのが、後処理部１１４である。詳細には、後処理部１１４は、音声波形データを単純に再生した場合における語尾の音高を解析するとともに、当該解析した音高が音高解析部１０６からの音高データで示される音高に対して例えば５度下の関係となるように、回答音声出力部１１３から出力される音声波形データを音高変換（ピッチ変換）する。すなわち、第２実施形態では、後処理部１１４が、取得された回答の語尾の音高を、問いの語尾の音高に対して協和音程の一例である５度下の音高となるように変更して出力する。 Now, when the speech waveform data is reproduced, the post-processing unit 114 changes the ending of the answer to the ending of the question so that the ending of the answer has a relationship of the Kyowa interval. Specifically, the post-processing unit 114 analyzes the pitch of the ending when the speech waveform data is simply reproduced, and the analyzed pitch is indicated by the pitch data from the pitch analysis unit 106. For example, the voice waveform data output from the answer voice output unit 113 is pitch-converted (pitch-converted) so that the relation is 5 degrees below. That is, in the second embodiment, the post-processing unit 114 sets the pitch of the ending of the acquired answer to a pitch that is 5 degrees below which is an example of the Kyowa pitch with respect to the pitch of the ending of the question. Change and output.

この変換の結果は、図９の（ｂ）に示されるように、図６の（ｂ）に示した音高シフトとほぼ同様である。この構成によれば、問いに対する回答が具体的であることが必要でない場合、例えば「はい」や「いいえ」のような単純な返事や「そうですね」のような相槌などのように回答する場合には、回答音声出力部１１３は、予め複数記憶させた音声波形データのち、当該問いに対して、いずれかの音声波形データを選択して出力する構成で済む。 The result of this conversion is almost the same as the pitch shift shown in FIG. 6B, as shown in FIG. 9B. According to this structure, when the answer to a question does not need to be specific, for example, when answering a simple answer such as “Yes” or “No” or a question like “Yes” The answer voice output unit 113 may be configured to select and output one of the voice waveform data in response to the question from the plurality of voice waveform data stored in advance.

＜応用例・変形例＞
本発明は、上述した第１実施形態や第２実施形態に限定されるものではなく、例えば次に述べるような各種の応用・変形が可能である。また、次に述べる応用・変形の態様は、任意に選択された一または複数を適宜に組み合わせることもできる。 <Applications / Modifications>
The present invention is not limited to the first embodiment and the second embodiment described above, and various applications and modifications as described below, for example, are possible. In addition, one or more arbitrarily selected aspects of application / deformation described below can be appropriately combined.

＜音声入力部＞
実施形態では、音声入力部１０２は、利用者の音声（発言）をマイクロフォンで入力して音声信号に変換する構成としたが、この構成に限られず、他の処理部で処理された音声信号や、他の装置から供給（または転送された）音声信号を入力する構成としても良い。すなわち、音声入力部１０２は、音声信号による発言をなんらかの形で入力する構成であれば良い。 <Voice input part>
In the embodiment, the voice input unit 102 is configured to input a user's voice (speech) with a microphone and convert it into a voice signal. However, the present invention is not limited to this configuration, and the voice signal processed by another processing unit or A configuration may be adopted in which an audio signal supplied (or transferred) from another device is input. In other words, the voice input unit 102 only needs to be configured to input a speech by a voice signal in some form.

＜回答等の語尾、語頭＞
第１実施形態や第２実施形態では、問いの語尾の音高に対応して回答の語尾の音高を制御する構成としたが、言語や、方言、言い回しなどによっては回答の語尾以外の部分、例えば語頭が特徴的となる場合もある。このような場合には、問いを発した人は、当該問いに対する回答があったときに、当該問いの語尾の音高と、当該回答の特徴的な語頭の音高とを無意識のうち比較して当該回答に対する印象を判断する。したがって、この場合には、問いの語尾の音高に対応して回答の語頭の音高を制御する構成とすれば良い。この構成によれば、回答の語頭が特徴的である場合、当該回答を受け取る利用者に対して心理的な印象を与えることが可能となる。 <End of answer, beginning of answer>
In the first and second embodiments, the pitch of the ending of the answer is controlled in response to the pitch of the ending of the question. However, depending on the language, dialect, wording, etc., the portion other than the ending of the answer For example, the beginning of a word may be characteristic. In such a case, the person who asked the question unconsciously compares the pitch of the ending of the question with the pitch of the characteristic beginning of the answer when there is an answer to the question. To determine the impression of the answer. Therefore, in this case, the pitch of the beginning of the answer may be controlled corresponding to the pitch of the ending of the question. According to this configuration, when the head of the answer is characteristic, it is possible to give a psychological impression to the user who receives the answer.

問いについても同様であり、語尾に限られず、語頭で判断される場合も考えられる。また、問い、回答については、語頭、語尾に限られず、平均的な音高で判断される場合や、最も強く発音した部分の音高で判断される場合なども考えられる。このため、問いの第１区間および回答の第２区間は、必ずしも語頭や語尾に限られない、ということができる。 The same applies to the question, not limited to the end of the word, but may be determined by the beginning of the word. In addition, the question and answer are not limited to the beginning and end of the word, but may be determined based on an average pitch or determined based on the pitch of the most pronounced portion. For this reason, it can be said that the 1st area of a question and the 2nd area of an answer are not necessarily restricted to an initial or ending.

＜音程の関係＞
上述した実施形態では、問いの語尾等に対して回答の語尾等の音高が５度下となるように音声合成を制御する構成としたが、５度下以外の協和音程の関係に制御する構成であっても良い。例えば、上述したように完全８度、完全５度、完全４度、長・短３度、長・短６度であっても良い。
また、協和音程の関係でなくても、経験的に良い（または悪い）印象を与える音程の関係の存在が認められる場合もあるので、当該音程の関係に回答の音高を制御する構成としても良い。ただし、この場合においても、問いの語尾等の音高と回答の語尾等の音高との２音間の音程が離れ過ぎると、問いに対する回答が不自然になりやすいので、問いの音高と回答の音高とが上下１オクターブの範囲内にあることが望ましい。 <Pitch relationship>
In the embodiment described above, the voice synthesis is controlled such that the pitch of the ending of the answer is 5 degrees lower than the ending of the question. However, the control is performed in a relationship of Kyowa intervals other than 5 degrees below. It may be a configuration. For example, as described above, it may be complete 8 degrees, complete 5 degrees, complete 4 degrees, long / short 3 degrees, and long / short 6 degrees.
In addition, there is a case where a relationship of a pitch that gives a good (or bad) impression is empirically recognized even if it is not a relationship of the Kyowa pitch, so that the pitch of the answer is controlled according to the relationship of the pitch. good. However, even in this case, if the pitch between the pitch of the question ending and the pitch of the answer ending is too far apart, the answer to the question tends to be unnatural. It is desirable that the pitch of the answer is in the range of one octave above and below.

＜回答の音高シフト＞
ところで、音声シーケンスや音声波形データで規定される回答の語尾等の音高を、問いの語尾等の音高に対して所定の関係となるように制御する構成では、詳細には、実施形態のように例えば５度下となるように変更する構成では、５度下の音高が低すぎると、不自然な低音で回答が音声合成されてしまう場合がある。そこで次に、このような場合を回避するための応用例（その１、および、その２）について説明する。 <Pitch shift of answer>
By the way, in the configuration for controlling the pitch of the ending of the answer specified by the voice sequence and the voice waveform data so as to have a predetermined relationship with the pitch of the ending of the question, the details of the embodiment are described. Thus, for example, in a configuration in which the pitch is changed to be 5 degrees below, if the pitch 5 degrees below is too low, the answer may be synthesized with an unnatural bass. Next, application examples (No. 1 and No. 2) for avoiding such a case will be described.

図１０は、このうちの応用例（その１）における処理の要部を示す図である。なお、ここでいう処理の要部とは、図４におけるステップＳａ１７の「回答の音高決定」で実行される処理をいう。すなわち、応用例（その１）では、図４に示されるステップＳａ１７において、図１０で示される処理が実行される、という関係にあり、詳細については次の通りである。
まず、音声合成部１１２は、音高解析部１０６からの音高データで示される音高に対して、例えば５度下の関係にある音高を求めて仮決定する（ステップＳｂ１７１）。
次に、音声合成部１１２は、仮決定した音高が予め定められた閾値音高よりも低いか否かを判別する（ステップＳｂ１７２）。なお、閾値音高は、音声合成する際の下限周波数に相当する音高や、これより低ければ不自然な感じを与えるような音高などに設定される。 FIG. 10 is a diagram illustrating a main part of processing in the application example (part 1). Here, the main part of the processing means processing executed in “determination of answer pitch” in step Sa17 in FIG. That is, in the application example (No. 1), the processing shown in FIG. 10 is executed in step Sa17 shown in FIG. 4, and details are as follows.
First, the speech synthesizer 112 obtains and temporarily determines a pitch that is, for example, 5 degrees below the pitch indicated by the pitch data from the pitch analyzer 106 (step Sb171).
Next, the speech synthesis unit 112 determines whether or not the temporarily determined pitch is lower than a predetermined threshold pitch (step Sb172). Note that the threshold pitch is set to a pitch corresponding to the lower limit frequency for speech synthesis, or a pitch that gives an unnatural feeling if it is lower than this.

仮決定した音高、すなわち問いにおける語尾の音高よりも５度下の音高が閾値音高よりも低ければ（ステップＳｂ１７２の判別結果が「Ｙｅｓ」であれば）、音声合成部１１２は、仮決定した音高を１オクターブ上の音高にシフトする（ステップＳｂ１７３）。
一方、求めた音高が閾値音高以上であれば（ステップＳｂ１７２の判別結果が「Ｎｏ」であれば）、上記ステップＳｂ１７３の処理がスキップされる。
そして、音声合成部１１２は、回答の音高をシフトする際に目標となる語尾の音高を、次のような音高に本決定する（ステップＳｂ１７４）。すなわち、音声合成部１１２は、仮決定した音高が閾値音高よりも低ければ、仮決定した音高を１オクターブ上に変更した音高に、また、仮決定した音高が閾値音高以上であれば、当該仮決定した音高をそのまま、それぞれ目標となる音高を本決定する。
なお、処理手順は、ステップＳｂ１７４の後においては、図４のステップＳａ１８に戻る。このため、音声合成部１１２は、本決定の音高に変更した音声シーケンスの音声を合成して出力する。 If the tentatively determined pitch, that is, the pitch 5 degrees below the ending pitch in the question is lower than the threshold pitch (if the determination result in step Sb172 is “Yes”), the speech synthesis unit 112 The temporarily determined pitch is shifted to a pitch one octave higher (step Sb173).
On the other hand, if the obtained pitch is equal to or higher than the threshold pitch (if the determination result of step Sb172 is “No”), the process of step Sb173 is skipped.
Then, the speech synthesizer 112 finally determines the target ending pitch as the following pitch when shifting the pitch of the answer (step Sb174). That is, if the tentatively determined pitch is lower than the threshold pitch, the voice synthesizer 112 changes the tentatively determined pitch to one pitch above, and the tentatively determined pitch is equal to or higher than the threshold pitch. If so, the target pitches are finally determined without changing the temporarily determined pitches.
The processing procedure returns to step Sa18 in FIG. 4 after step Sb174. Therefore, the voice synthesizer 112 synthesizes and outputs the voice of the voice sequence changed to the final pitch.

この応用例（その１）によれば、変更しようとする音高が閾値音高よりも低ければ、当該音高よりも１オクターブ上の音高となるようにシフトされるので、不自然な低音で回答が音声合成される、という点を回避することができる。
ここでは、回答の語尾等の音高を１オクターブ上の音高にシフトした例であったが、１オクターブ下の音高にシフトしても良い。詳細には、利用者が発した問いの語尾等の音高が高いために、当該音高に対して５度下の音高が高すぎると、不自然な高音で回答が音声合成されてしまう。これを回避するために、音高データで示される音高に対して５度下の関係にある音高（仮決定した音高）が閾値音高より高ければ、回答の語尾等の音高を、仮決定した音高よりも１オクターブ下の音高にシフトすれば良い。 According to this application example (No. 1), if the pitch to be changed is lower than the threshold pitch, the pitch is shifted so as to be one octave higher than the pitch. It is possible to avoid the point that the answer is synthesized by voice.
In this example, the pitch of the ending of the answer is shifted to a pitch one octave higher, but may be shifted to a pitch one octave lower. Specifically, since the pitch of the ending of the question issued by the user is high, if the pitch 5 degrees below the pitch is too high, the answer is synthesized with an unnatural high tone. . In order to avoid this, if the pitch (temporarily determined pitch) that is 5 degrees below the pitch indicated by the pitch data is higher than the threshold pitch, the pitch of the ending of the answer is set. The pitch may be shifted to a pitch one octave lower than the temporarily determined pitch.

また、音声合成する際には、性別や年齢別（子供／大人の別）などが定められた仮想的なキャラクタの声で出力することができる場合がある。この場合のように女性や子供のキャラクタが指定されているとき、一律に問いの語尾に対して５度下の音高に下げてしまうと、当該キャラクタに不似合いの低音で回答が音声合成されてしまうので、同様に、１オクターブ上の音高となるようにシフトする構成としても良い。 In addition, when voice synthesis is performed, it may be possible to output a voice of a virtual character in which sex or age (child / adult) is determined. When a female or child character is specified as in this case, if the pitch is lowered to 5 degrees below the ending of the question, the answer is synthesized with a bass sound that is not suitable for the character. Therefore, similarly, it may be configured to shift so that the pitch becomes one octave higher.

図１１は、このような応用例（その２）における処理の要部を示す図であり、図４におけるステップＳａ１７の「回答の音高決定」で実行される処理を示している。図１０と異なる点を中心に説明すると、ステップＳｂ１７１において、音声合成部１１２は、音高解析部１０６からの音高データで示される音高に対して５度下の関係にある音高を求めて仮決定した後、当該キャラクタを規定する属性として女性や子供が指定されているか否かを判別する（ステップＳｃ１７２）。 FIG. 11 is a diagram showing a main part of processing in such an application example (part 2), and shows processing executed in “determination of answer pitch” in step Sa17 in FIG. Description will be made centering on differences from FIG. 10. In step Sb171, the speech synthesizer 112 obtains a pitch that is 5 degrees below the pitch indicated by the pitch data from the pitch analyzer 106. Then, it is determined whether or not a woman or a child is designated as an attribute that defines the character (step Sc172).

音声合成部１１２は、当該属性として女性や子供が指定されていれば（ステップＳｃ１７２の判別結果が「Ｙｅｓ」であれば）、仮決定した音高を１オクターブ上の音高にシフトし（ステップＳｂ１７３）、一方、当該属性として女性や子供が指定されていなければ、例えば男性や大人が指定されていれば（ステップＳｃ１７２の判別結果が「Ｎｏ」であれば）、上記ステップＳｂ１７３の処理がスキップされる。以降については応用例（その１）と同様である。
この応用例（その２）によれば、女性や子供の声で回答させるような設定がなされていれば、仮決定の音高よりも１オクターブ上の音高となるようにシフトされるので、不自然な低音で回答が音声合成されると、という点を回避することができる。
ここでは、属性として女性や子供が指定されていれば、１オクターブ上の音高にシフトした例であったが、例えば属性として成人男性が指定されていれば、当該属性に対応したキャラクタに不似合いの高音で回答が音声合成されてしまうのを回避するために、１オクターブ下の音高にシフトしても良い。 If female or child is specified as the attribute (if the determination result in step Sc172 is “Yes”), the speech synthesizer 112 shifts the temporarily determined pitch to a pitch one octave higher (step Sb173) On the other hand, if female or child is not specified as the attribute, for example, if male or adult is specified (if the determination result of step Sc172 is “No”), the process of step Sb173 is skipped. Is done. The subsequent steps are the same as in the application example (No. 1).
According to this application example (No. 2), if the setting is made to answer in the voice of a woman or child, the pitch is shifted so that the pitch is one octave higher than the temporarily determined pitch. It can be avoided that the answer is synthesized with an unnatural low tone.
In this example, if female or child is specified as an attribute, the pitch is shifted to one octave above. However, for example, if an adult male is specified as an attribute, the character corresponding to the attribute is not recognized. In order to avoid that the answer is synthesized with high-pitched sounds, the pitch may be shifted to a pitch one octave below.

＜不協和音程＞
上述した実施形態では、問いの語尾等に対して、回答の語尾等の音高が協和音程の関係となるように音声合成を制御する構成としたが、不協和音程の関係になるように音声合成を制御しても良い。なお、回答を不協和音程の関係にある音高で合成すると、問いを発した利用者に、不自然な感じや、悪印象、険悪な感じなどを与えて、スムーズな対話が成立しなくなる、という懸念もあるが、このような感じが逆にストレス解消に良いという見解もある。
そこで、動作モードとして、好印象等の回答を望むモード（第１モード）、悪印象等の回答を望むモード（第２モード）を用意しておき、いずれかのモードに応じて音声合成を制御する構成としても良い。 <Dissonance>
In the above-described embodiment, the voice synthesis is controlled so that the pitch of the answer ending is in the relationship of the consonant pitch with respect to the ending of the question. May be controlled. In addition, if the answers are synthesized with pitches that have a dissonant pitch relationship, the user who asked the question will be given an unnatural feeling, a bad impression, a harsh feeling, etc., and a smooth dialogue will not be established There are also concerns, but there is a view that this feeling is good for stress relief.
Therefore, as an operation mode, a mode (first mode) for which a response such as a good impression is desired and a mode (second mode) for which a response such as a bad impression is desired are prepared, and speech synthesis is controlled according to any mode. It is good also as composition to do.

図１２は、このような応用例（その３）における処理の要部を示す図であり、図４におけるステップＳａ１７の「回答の音高決定」で実行される処理を示している。図１０と異なる点を中心に説明すると、音声合成部１１２は、動作モードとして第１モードが設定されているか否かを判別する（ステップＳｄ１７２）。 FIG. 12 is a diagram showing a main part of processing in such an application example (part 3), and shows processing executed in “answer pitch determination” in step Sa17 in FIG. The description will focus on points different from FIG. 10. The speech synthesizer 112 determines whether or not the first mode is set as the operation mode (step Sd172).

音声合成部１１２は、動作モードとして第１モードが設定されていれば（ステップＳｄ１７２の判別結果が「Ｙｅｓ」であれば）、回答の例えば語尾の音高を、問いの例えば語尾の音高に対して協和音程の関係にある音高となるように決定する（ステップＳｄ１７３Ａ）。一方、音声合成部１１２は、動作モードとして第２モードが設定されていれば（ステップＳｄ１７２の判別結果が「Ｎｏ」であれば）、回答の語尾の音高を、問いの語尾の音高に対して不協和音程の関係にある音高となるように決定する（ステップＳｄ１７３Ｂ）。以降については応用例（その１）、応用例（その２）と同様である。
したがって、この応用例（その３）によれば、第１モードが設定されていれば、問いの音高に対して協和音程の関係にある音高で回答が音声合成される一方、第２モードが設定されていれば、問いの音高に対して不協和音程の関係にある音高で回答が音声合成されるので、利用者は、適宜動作モードを使い分けることができることになる。 If the first mode is set as the operation mode (if the determination result of step Sd172 is “Yes”), the speech synthesizer 112 changes the pitch of the ending of the answer to the pitch of the ending of the question, for example. On the other hand, the pitch is determined so as to be a pitch having a relation of Kyowa pitch (step Sd173A). On the other hand, if the second mode is set as the operation mode (if the determination result in step Sd172 is “No”), the speech synthesizer 112 sets the pitch of the answer ending to the pitch of the question ending. On the other hand, the pitch is determined to be a pitch having a dissonant pitch relationship (step Sd173B). The subsequent processes are the same as those of the application example (part 1) and the application example (part 2).
Therefore, according to this application example (No. 3), if the first mode is set, the answer is synthesized with a pitch having a relationship of the Kyowa interval with the pitch of the question, while the second mode Is set, the answer is synthesized with a pitch that is in a dissonant pitch with respect to the pitch of the question, so that the user can use the operation mode appropriately.

なお、応用例（その１）や、応用例（その２）、応用例（その３）は、第１実施形態のような音声シーケンスを用いる例で説明したが、第２実施形態のような音声波形データを用いる場合であっても良いのはもちろんである。 The application example (part 1), the application example (part 2), and the application example (part 3) have been described as examples using the voice sequence as in the first embodiment, but the voice as in the second embodiment. Of course, waveform data may be used.

＜音声・回答＞
実施形態については、回答を、人の声で音声合成する構成としたが、人による声のほかにも、動物の鳴き声で音声合成しても良い。すなわち、ここでいう音声は、人の声に限られず、動物の鳴き声を含む概念である。そこで次に、回答を動物の鳴き声で音声合成する応用例（その４）について説明する。 <Voice / Answer>
In the embodiment, the answers are synthesized by voice synthesis using human voices. However, in addition to voices by humans, voice synthesis may be performed by animal calls. That is, the voice here is not limited to a human voice, but is a concept that includes an animal call. Next, an application example (part 4) in which an answer is synthesized with an animal call will be described.

図１３は、この応用例（その４）の動作概要を示す図である。回答を動物の鳴き声で音声合成する場合、問いの語尾の音高に対して、鳴き声の語尾が所定の音高とさせるだけの処理となる。このため、問いの意味を解析して、当該解析した意味に対応する情報を取得する、当該情報に基づいた回答を作成する、という処理等は不要となる。 FIG. 13 is a diagram showing an outline of the operation of this application example (No. 4). In the case of synthesizing an answer with the sound of an animal call, the processing is only to make the ending of the cry to a predetermined pitch with respect to the pitch of the ending of the question. For this reason, the process of analyzing the meaning of the question, obtaining information corresponding to the analyzed meaning, and creating an answer based on the information becomes unnecessary.

同図の（ａ）に示されるように、利用者Ｗが「いいてんきだね」という問いを発して音声合成装置１０に入力した場合、音声合成装置１０は、問いの語尾に相当する「ね」の音高を解析し、当該音高が例えば「ソ」であれば、「ワン」という犬の鳴き声の音声波形データを後処理して、「ワン」の語尾に相当する「ン」の音高を、問いの語尾の音高に対して協和音程の一例である５度下の音高である「ド」となるように変更して出力する。 As shown in (a) of the figure, when the user W issues a question “I'm happy” and inputs it to the speech synthesizer 10, the speech synthesizer 10 will answer “Ne” corresponding to the ending of the question. ”Is analyzed, and if the pitch is“ Seo ”, for example, the voice waveform data of the dog call“ One ”is post-processed, and the“ N ”sound corresponding to the ending of“ One ” The pitch is changed to be “do”, which is a pitch 5 degrees below which is an example of the Kyowa interval with respect to the pitch of the ending of the question.

回答が動物の鳴き声で音声合成する場合、回答で利用者が望む情報を得ることはできない。つまり、利用者が問いとして「あすのてんきは？」と質問しても、当該利用者は明日の天気情報を得ることはできない。しかしながら、利用者がなんらかの問いを発したときに、当該問いの語尾の音高に対して、鳴き声の語尾が例えば５度下の関係となるように音声合成されると、当該鳴き声は心地良く、安心するような好印象を抱かせる点においては、回答を人の声で音声合成する場合と同じである。したがって、動物の鳴き声を音声合成する場合でも、利用者に対して、当該鳴き声を発する仮想的な動物とあたかも意志が通じているかのような、一種の癒しの効果を与えることが期待できるのである。 When the answer is synthesized with the sound of an animal, the information desired by the user cannot be obtained with the answer. In other words, even if the user asks "What is Asuno Tenki?", The user cannot obtain tomorrow's weather information. However, when the user issues a question, if the voice is synthesized such that the ending of the squeak is 5 degrees below the pitch of the ending of the question, the squeal is comfortable, It is the same as the case of synthesizing an answer with a human voice in terms of giving a good impression of relief. Therefore, even when synthesizing animal calls, it can be expected to give the user a kind of healing effect as if they are communicating with the virtual animal that makes the call. .

なお、音声合成装置１０に表示部を設けて、同図の（ｂ）に示されるように、仮想的な動物を表示させるとともに、当該動物について、音声合成に同期させて尻尾を振る、首を傾けるなどの動画で表示させる構成としても良い。このような構成によって、上記癒し効果をより高めることができる。
また、鳴き声を合成する動物を例えば犬とする場合、犬種（チワワ、ポメラニアン、ゴールデン・レトリバーなど）を選択することができる構成としても良い。
回答を動物の鳴き声で音声合成する音声合成装置１０については、端末装置に限られず、当該動物を模したペットロボットや、縫いぐるみなどに適用しても良い。 In addition, a display unit is provided in the speech synthesizer 10 to display a virtual animal as shown in (b) of the figure, and for the animal, the tail is shaken in synchronization with the speech synthesis. It is good also as a structure displayed by animation, such as tilting. With such a configuration, the healing effect can be further enhanced.
Moreover, when the animal which synthesize | combines a cry is made into a dog, for example, it is good also as a structure which can select dog breeds (Chihuahua, Pomeranian, a golden retriever, etc.).
The speech synthesizer 10 that synthesizes an answer with the sound of an animal is not limited to a terminal device, and may be applied to a pet robot imitating the animal, a stuffed animal, or the like.

＜その他＞
実施形態にあっては、問いに対する回答を取得する構成である言語解析部１０８、言語データベース１２２および回答データベース１２４を音声合成装置１０の側に設けたが、端末装置などでは、処理の負荷が重くなる点や、記憶容量に制限がある点などを考慮して、外部サーバの側に設ける構成としても良い。すなわち、音声合成装置１０において回答作成部１１０（回答音声出力部１１３）は、問いに対する回答をなんらかの形で取得するとともに、当該回答の音声シーケンス（音声波形データ）を出力する構成であれば足り、その回答を、音声合成装置１０の側で作成するのか、音声合成装置１０以外の他の構成（例えば外部サーバ）の側で作成するのか、については問われない。
なお、音声合成装置１０において、問いに対する回答について、外部サーバ等にアクセスしないで作成可能な用途であれば、情報取得部１２６は不要である。 <Others>
In the embodiment, the language analysis unit 108, the language database 122, and the answer database 124, which are configured to acquire an answer to a question, are provided on the side of the speech synthesizer 10, but the processing load is heavy in a terminal device or the like. In view of the above, the storage capacity is limited, and the like may be provided on the external server side. That is, it is sufficient that the answer creating unit 110 (answer voice output unit 113) in the speech synthesizer 10 obtains an answer to the question in some form and outputs a voice sequence (speech waveform data) of the answer. It does not matter whether the answer is created on the side of the speech synthesizer 10 or on the side of a configuration other than the speech synthesizer 10 (for example, an external server).
Note that the information acquisition unit 126 is unnecessary if the speech synthesizer 10 can be used to create an answer to a question without accessing an external server or the like.

１０２…音声入力部、１０４…発話区間検出部、１０６…音高解析部、１０８…言語解析部、１１０…回答作成部、１１２…音声合成部、１２６…情報取得部。
DESCRIPTION OF SYMBOLS 102 ... Voice input part, 104 ... Speech area detection part, 106 ... Pitch analysis part, 108 ... Language analysis part, 110 ... Answer preparation part, 112 ... Speech synthesis part, 126 ... Information acquisition part.

Claims

A voice input unit for inputting questions by voice signals;
Among the questions, a pitch analysis unit that analyzes the pitch of a specific first section;
An acquisition unit for acquiring an answer to the question;
Among the obtained answers, a speech synthesizer that changes and outputs the pitch of a specific second section so that the pitch has a predetermined relationship with the pitch of the first section;
A speech synthesizer characterized by comprising:

The first interval is the ending of the question;
The second interval is the beginning or end of the answer,
The speech synthesizer according to claim 1.

The predetermined relationship is a relationship of Kyowa intervals excluding perfect 1 degree.
The speech synthesizer according to claim 1 or 2.

The speech synthesizer
The pitch of the second section is changed so as to have a pitch relationship within a range of one octave above and below, except for the same pitch as the pitch of the first section, and output. The speech synthesizer according to 1 or 2.

The speech synthesizer
The pitch of the second section is changed so as to be a pitch having a relationship of a Kyowa pitch 5 degrees below the pitch of the first section, and is output. 4. The speech synthesizer according to 4.

The speech synthesizer
When changing the pitch of the second section to be a pitch having a predetermined relationship with the pitch of the first section,
If the pitch to be changed is lower than a predetermined threshold pitch, the pitch to be changed is further shifted to a pitch one octave higher, or
If the pitch to be changed is higher than a predetermined threshold pitch, the pitch to be changed is further shifted to a pitch one octave below.
The speech synthesizer according to claim 3 or 4, characterized by the above.

The speech synthesizer
If the pitch of the second section is to be changed to be a pitch having a predetermined relationship with the pitch of the first section, if a predetermined attribute is defined, the predetermined relationship 5. The speech synthesizer according to claim 3 or 4, wherein the pitch at is further shifted to a pitch one octave above or below.

There are a first mode and a second mode as operation modes,
The speech synthesizer
If the operation mode is the first mode, the pitch of the second section is set to a pitch that is in a relationship of Kyowa intervals except for the perfect pitch with respect to the pitch of the first section. Change and output,
If the operation mode is the second mode, the pitch of the second section is changed and output so that the pitch is in a dissonant pitch relative to the pitch of the first section.
The speech synthesizer according to claim 3.

Computer
An acquisition unit for acquiring an answer to a question by an input audio signal;
Of the above questions, a pitch analyzer for analyzing the pitch of a specific first section, and
Among the obtained answers, a speech synthesizer that changes and outputs the pitch of a specific second section so that the pitch has a predetermined relationship with the pitch of the first section,
A program characterized by functioning as