JP6232892B2

JP6232892B2 - Speech synthesis apparatus and program

Info

Publication number: JP6232892B2
Application number: JP2013205260A
Authority: JP
Inventors: 松原　弘明; 弘明松原; 純也浦; 川▲原▼　毅彦; 毅彦川▲原▼; 久湊　裕司; 裕司久湊; 克二吉村
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-09-30
Filing date: 2013-09-30
Publication date: 2017-11-22
Anticipated expiration: 2033-09-30
Also published as: JP2015069138A

Description

本発明は、音声合成装置およびプログラムに関する。 The present invention relates to a speech synthesizer and a program.

近年、音声合成技術としては、次のようなものが提案されている。すなわち、利用者の話調や声質に対応した音声を合成出力することによって、より人間らしく発音する技術（例えば特許文献１参照）や、利用者の音声を分析して、当該利用者の心理状態や健康状態などを診断する技術（例えば特許文献２参照）が提案されている。
また、利用者が入力した音声を認識する一方で、シナリオで指定された内容を音声合成で出力して、利用者との音声対話を実現する音声対話システムも提案されている（例えば特許文献３参照）。 In recent years, the following have been proposed as speech synthesis techniques. That is, by synthesizing and outputting speech corresponding to the user's speech tone and voice quality, a technique for sounding more humanly (see, for example, Patent Document 1), analyzing the user's speech, A technique for diagnosing a health condition or the like (see, for example, Patent Document 2) has been proposed.
In addition, a voice dialogue system that recognizes a voice input by a user and outputs a content specified in a scenario by voice synthesis to realize a voice dialogue with the user has been proposed (for example, Patent Document 3). reference).

特開２００３−２７１１９４号公報JP 2003-271194 A 特許第４４９５９０７号公報Japanese Patent No. 4495907 特許第４８３２０９７号公報Japanese Patent No. 4832097

ところで、上述した音声合成技術と音声対話システムとを組み合わせて、利用者の音声による発言に対し、データを検索して音声合成により出力する対話システムを想定する。この場合、音声合成によって出力される音声が利用者に不自然な感じ、具体的には、いかにも機械が喋っている感じを与えるときがある、という問題が指摘されている。
本発明は、このような事情に鑑みてなされたものであり、その目的の一つは、利用者の発言に対する回答に、当該利用者に自然な感じを与えるとともに、当該利用者に対話することに一種の喜びのような感じを与えるような音声合成装置およびプログラムを提供することにある。 By the way, it is assumed that a dialogue system that combines the above-described speech synthesis technology and a speech dialogue system to retrieve data and output by speech synthesis in response to a speech by a user's voice is assumed. In this case, a problem has been pointed out that the voice output by the voice synthesis feels unnatural to the user, specifically, the machine sometimes feels roaring.
The present invention has been made in view of such circumstances, and one of its purposes is to give the user a natural feeling in answering the user's remarks and to interact with the user. Is to provide a speech synthesizer and a program that give a kind of joyful feeling.

本件発明者は、利用者による発言に対する回答を音声合成で出力（返答）するマン・マシンのシステムを検討するにあたって、まず、人同士では、どのような対話がなされるかについて、音高（周波数）に着目して考察した。 When examining the man-machine system for outputting (replying) the response to the user's utterance by speech synthesis, the present inventor first determines the pitch (frequency) ).

ここでは、人同士の対話として、一方の人（ａとする）による発言（質問、独り言、問い等を含む）に対し、他方の人（ｂとする）が回答（相槌を含む）する場合について検討する。この場合において、ａが発言したとき、ａだけなく、当該発言に対して回答しようとするｂも、当該発言のうちの、ある区間における音高を強い印象で残していることが多い。ｂは、同意や、賛同、肯定などの意で回答するときには、印象に残っている発言の音高に対し、当該回答を特徴付ける部分、例えば語尾や語頭の音高が、所定の関係、具体的には協和音程の関係となるように発声する。当該回答を聞いたａは、自己の発言について印象に残っている音高と当該発言に対する回答を特徴付ける部分の音高とが上記関係にあるので、ｂの回答に対して心地良く、安心するような好印象を抱くことになる、と、本件発明者は考えた。 Here, as a dialogue between people, the other person (assuming b) answers (including conflicts) to one person (assuming a) (including questions, monologues, questions, etc.) consider. In this case, when “a” speaks, not only “a” but also “b” trying to reply to the said speech often leaves the pitch in a certain section of the said speech with a strong impression. b, when replying with consent, approval, or affirmation, the pitch of the utterance that remains in the impression, for example, the pitch of the ending or beginning of the sentence Speak to be in a Kyowa pitch. The person a who heard the answer has the above-mentioned relationship between the pitch that remains in the impression about his / her speech and the pitch of the part that characterizes the answer to the person's reply, so that the answer to b is comfortable and reassuring. The present inventor thought that he had a good impression.

例えば、ａが「そうでしょ？」と発言したとき、ａおよびｂは、当該発言のうち、念押しや確認などの意が強く表れる語尾の「しょ」の音高を記憶に残した状態となる。この状態において、ｂが、当該発言に対して「あ、はい」と肯定的に回答しようとする場合に、印象に残っている「しょ」の音高に対して、回答を特徴付ける部分、例えば語尾の「い」の音高が上記関係になるように「あ、はい」と回答する。 For example, when “a” says “Yes?”, “A” and “b” are in a state of memorizing the pitch of “Sho” at the end of the said utterance that strongly expresses willingness or confirmation . In this state, when b tries to affirmatively answer “Yes, yes” to the statement, the part characterizing the answer, for example, the ending, Answer “Yes, yes” so that the pitch of “Yes” is in the above relationship.

図２は、このような実際の対話におけるフォルマントを示している。この図において、横軸が時間であり、縦軸が周波数であって、スペクトルは、白くなるにつれて強度が強い状態を示している。
図に示されるように、人の音声を周波数解析して得られるスペクトルは、時間的に移動する複数のピーク、すなわちフォルマントとして現れる。詳細には、「そうでしょ？」に相当するフォルマント、および、「あ、はい」に相当するフォルマントは、それぞれ３つのピーク帯（時間軸に沿って移動する白い帯状の部分）として現れている。
これらの３つのピーク帯のうち、周波数の最も低い第１フォルマントについて着目してみると、「そうでしょ？」の「しょ」に相当する符号Ａ（の中心部分）の周波数はおおよそ４００Ｈｚである。一方、符号Ｂは、「あ、はい」の「い」に相当する符号Ｂの周波数はおおよそ２６０Ｈｚである。このため、符号Ａの周波数は、符号Ｂの周波数に対して、ほぼ３／２となっていることが判る。 FIG. 2 shows a formant in such an actual dialogue. In this figure, the horizontal axis is time, the vertical axis is frequency, and the spectrum shows a state where the intensity increases as it becomes white.
As shown in the figure, a spectrum obtained by frequency analysis of human speech appears as a plurality of peaks that move in time, that is, formants. Specifically, a formant corresponding to “Yeah?” And a formant corresponding to “Ah, yes” each appear as three peak bands (white band-like portions moving along the time axis).
When attention is paid to the first formant having the lowest frequency among these three peak bands, the frequency of the code A (the central part) corresponding to “Sho” of “Oh, right?” Is approximately 400 Hz. On the other hand, for the code B, the frequency of the code B corresponding to “Yes” of “A, Yes” is approximately 260 Hz. For this reason, it can be seen that the frequency of the code A is approximately 3/2 with respect to the frequency of the code B.

周波数の比が３／２であるという関係は、音程でいえば、「ソ」に対して同じオクターブの「ド」や、「ミ」に対して１つ下のオクターブの「ラ」などの関係をいい、後述するように、完全５度の関係にある。この周波数の比（音高同士における所定の関係）については、好適な一例であるが、後述するように様々な例が挙げられる。 The relationship that the ratio of the frequencies is 3/2 is the relationship between “seo” in the same octave and “mi” in the octave one lower than “mi”. As will be described later, there is a complete 5 degree relationship. This frequency ratio (predetermined relationship between pitches) is a preferred example, but various examples can be given as will be described later.

なお、図３は、音名（階名）と人の声の周波数との関係について示す図である。この例では、第４オクターブの「ド」を基準にしたときの周波数比も併せて示しており、「ソ」は「ド」を基準にすると、上記のように３／２である。また、第３オクターブの「ラ」を基準にしたときの周波数比についても並列に例示している。 FIG. 3 is a diagram showing a relationship between a pitch name (floor name) and a human voice frequency. In this example, the frequency ratio when the fourth octave “do” is used as a reference is also shown, and “so” is 3/2 as described above when “do” is used as a reference. Further, the frequency ratio when the third octave “La” is used as a reference is also illustrated in parallel.

このように人同士の対話では、発言の音高と返答する回答の音高とは無関係ではなく、上記のような関係がある、と考察できる。そして、本件発明者は、多くの対話例を分析し、多くの人による評価を統計的に集計して、この考えがおおよそ正しいことを裏付けた。 In this way, in the dialogue between people, it can be considered that the pitch of the utterance and the pitch of the answer to be answered are not irrelevant but have the above relationship. Then, the present inventor analyzed many dialogue examples and statistically aggregated evaluations by many people to prove that this idea is roughly correct.

一方で、利用者による発言に対する回答を音声合成で出力（返答）する対話システムを検討したときに、当該利用者としては、例えば老若男女を問わず様々な属性の人物が利用することが想定される。また、音声合成する際に用いる音声素片などのデータには、採取するモデルが存在する。逆に言えば、複数のモデルを用意しておけば、様々な声質で回答を音声合成することができる。このため、音声合成で回答を出力する場合、当該回答を様々な属性（エージェント属性）で出力することができる。
このため、対話システムにおいては、利用者の話者属性とエージェントの属性との組み合わせが多岐にわたることを考慮しなければならない。
具体的には、例えば発言者が女性であり、回答者が男性である場合、当該女性による発言の語尾の音高に対し、当該男性が、当該発言に対する回答の語尾等の音高が所定の関係となるように回答しようとしても、当該回答の語尾等の音高が男性にとっては高過ぎて、却って不自然になる。逆に、発言者が男性であり、回答者が女性である場合、当該男性による発言の語尾の音高に対し、当該女性が、当該発言に対する回答の語尾等の音高が所定の関係となるように回答しようとしても、当該回答の語尾等の音高が女性にとっては低すぎることになる。
そこで、利用者による発言に対する回答を音声合成する際に、上記目的を達成するために、次のような構成とした。 On the other hand, when considering a dialogue system that outputs (replies) speech replies to a user's utterances, it is assumed that the user is used by people of various attributes, for example, regardless of age or gender. The In addition, there is a model to be collected for data such as speech segments used for speech synthesis. In other words, if a plurality of models are prepared, answers can be synthesized with various voice qualities. For this reason, when an answer is output by speech synthesis, the answer can be output with various attributes (agent attributes).
For this reason, in the interactive system, it must be considered that there are various combinations of the user's speaker attribute and the agent attribute.
Specifically, for example, when the speaker is a female and the respondent is a male, the pitch of the ending of a reply to the male is predetermined for the pitch of the female. Even if you try to answer in a relationship, the pitch of the ending of the answer is too high for men, and on the contrary, it becomes unnatural. Conversely, when the speaker is a male and the respondent is a female, the pitch of the ending of the answer to the utterance of the female has a predetermined relationship with the pitch of the ending of the utterance by the male. Even when trying to answer like this, the pitch of the ending of the answer is too low for women.
Therefore, in order to achieve the above-mentioned purpose when speech synthesis is performed on an answer to a user's utterance, the following configuration is adopted.

すなわち、上記目的を達成するために、本発明の一態様に係る音声合成装置は、発言者による発言を入力する音声入力部と、前記発言のうち、特定の第１区間の音高を解析する音高解析部と、前記発言に対する回答を取得する取得部と、取得された回答を所定のエージェント属性で音声合成する音声合成部と、前記音声合成部に対し、当該回答における特定の第２区間の音高が前記第１区間の音高に対して所定の関係にある音高となるように変更させる規則で音声合成を制御するとともに、前記発言者の話者属性、または、前記エージェント属性の少なくとも一方にしたがって前記規則を修正する音声制御部と、を具備することを特徴とする。
この一態様によれば、回答における特定の第２区間の音高が、発言のうち特定の第１区間の音高に対して所定の関係にある音高となるように変更される規則で音声合成が制御される。さらに、発言者の話者属性、または、エージェント属性の少なくとも一方にしたがって規則が修正される。このため、利用者の発言に対する回答に、当該利用者に自然な感じを与えるとともに、当該利用者に対話することに一種の喜びを与えることが可能になる。
発言者の話者属性とは、例えば、当該発言者の性別である。性別には、男性、女性のほか、中性を含む。また、話者属性としては、性別のほかに、年齢や、年代、子供・大人・老人の年代別を含んでもよい。この話者属性は、音声合成装置に対して予め設定しても良いし、音声合成装置の側で求めても良い。
また、エージェント属性とは、音声合成する際のモデルの属性であって、上記話者属性と同様に、性別や年齢（年代）である。このエージェント属性は、例えば音声合成装置に予め設定される。 That is, in order to achieve the above object, a speech synthesizer according to an aspect of the present invention analyzes a speech input unit that inputs a speech by a speaker, and a pitch of a specific first section of the speech. A pitch analysis unit, an acquisition unit that acquires an answer to the utterance, a voice synthesis unit that synthesizes the acquired answer with a predetermined agent attribute, and a specific second section in the answer to the voice synthesis unit The voice synthesis is controlled according to a rule for changing the pitch of the speaker to be a pitch having a predetermined relationship with the pitch of the first section, and the speaker attribute of the speaker or the agent attribute A voice control unit that modifies the rule according to at least one of the rules.
According to this aspect, the voice is changed according to a rule in which the pitch of the specific second section in the answer is changed so that the pitch has a predetermined relationship with the pitch of the specific first section of the utterance. Synthesis is controlled. Furthermore, the rule is modified according to at least one of the speaker attribute and the agent attribute of the speaker. For this reason, it becomes possible to give a kind of joy to interacting with the user while giving the user a natural feeling in the reply to the user's remarks.
The speaker attribute of the speaker is, for example, the gender of the speaker. Gender includes male, female, and neutral. In addition to sex, speaker attributes may include age, age, age of children / adults / old people, and so on. This speaker attribute may be preset for the speech synthesizer or may be obtained on the speech synthesizer side.
The agent attribute is an attribute of a model used for speech synthesis, and is gender and age (age) as in the case of the speaker attribute. This agent attribute is preset in the speech synthesizer, for example.

この態様において、第１区間は、例えば発言の語尾であり、第２区間は、回答の語頭または語尾であることが好ましい。上述したように、発言の印象を特徴付ける区間は、当該発言の語尾であり、回答の印象を特徴付ける区間は、回答の語頭または語尾であることが多いからである。
また、前記所定の関係は、完全１度を除いた協和音程の関係であることが好ましい。ここで、協和とは、複数の楽音が同時に発生したときに、それらが互いに溶け合って良く調和する関係をいい、これらの音程関係を協和音程という。協和の程度は、２音間の周波数比（振動数比）が単純なものほど高い。周波数比が最も単純な１／１（完全１度）と、２／１（完全８度）とを、特に絶対協和音程といい、これに３／２（完全５度）と４／３（完全４度）とを加えて完全協和音程という。５／４（長３度）、６／５（短３度）、５／３（長６度）および８／５（短６度）を不完全協和音程といい、これ以外のすべての周波数比の関係（長・短の２度と７度、各種の増・減音程など）を不協和音程という。 In this aspect, it is preferable that the first interval is, for example, the ending of a statement, and the second interval is the beginning or ending of an answer. As described above, the section characterizing the impression impression is the ending of the comment, and the section characterizing the answer impression is often the beginning or ending of the answer.
Moreover, it is preferable that the predetermined relationship is a relationship of Kyowa intervals excluding perfect 1 degree. Here, “Kyowa” means a relationship in which when a plurality of musical sounds are generated at the same time, they are fused and well harmonized, and these pitch relationships are called Kyowa pitches. The degree of cooperation is higher as the frequency ratio (frequency ratio) between two sounds is simpler. The simplest frequency ratios of 1/1 (perfect 1 degree) and 2/1 (perfect 8 degree) are called absolute consonance pitches, and 3/2 (perfect 5 degree) and 4/3 (perfect) 4 degrees) and is called the perfect harmony pitch. 5/4 (3 degrees long), 6/5 (3 degrees short), 5/3 (6 degrees long) and 8/5 (6 degrees short) are called incomplete harmony intervals, and all other frequency ratios This relationship (long and short 2 degrees and 7 degrees, various increase / decrease intervals, etc.) is called dissonance interval.

なお、回答の語頭または語尾の音高を、発言の語尾の音高と同一となる場合には、対話として不自然な感じを伴うと考えられるので、上記協和音程の関係としては、完全１度が除かれている。
上記態様において、所定の関係として最も望ましい例は、上述したように第２区間の音高が、第１区間の音高に対して５度下の協和音程の関係である、と考えられる。ただし、所定の関係としては、完全１度を除く協和音程に限られず、不協和音程の関係でも良いし、同一を除く、上下１オクターブの範囲内の音高関係でも良い。
また、回答には、質問に対する具体的な答えに限られず、「なるほど」、「そうですね」などの相槌（間投詞）も含まれる。 If the pitch of the beginning or ending of the answer is the same as the pitch of the ending of the utterance, it is considered that there is an unnatural feeling as a dialogue. Is excluded.
In the above aspect, the most desirable example of the predetermined relationship is considered to be a relationship in which the pitch of the second section is 5 degrees below the pitch of the first section as described above. However, the predetermined relationship is not limited to a consonant pitch except for a perfect degree, but may be a dissonant pitch relationship, or may be a pitch relationship within the range of one octave above and below, excluding the same.
In addition, the answer is not limited to a specific answer to the question, but includes an answer (interjection) such as “I see” or “I think so”.

上記態様において、前記発言の言語情報を解析する言語解析部を備え、前記言語情報が所定の場合、前記回答部は、当該発言に対する回答として相槌を取得し、前記音声制御部は、前記音声合成部に対し、前記発言者の話者属性に応じて当該相槌の出力を制御する構成としても良い。この構成によれば、利用者からみて、自己の属性に応じて相槌が出力されるので、より人と対話しているかのような自然な感じを受ける。
なお、相槌の出力の制御態様としては、相槌の出力タイミングを制御するほか、相槌の繰り返し出力（連呼）する制御、相槌を出力しない（黙る）制御も含む。 In the above aspect, a language analysis unit that analyzes language information of the utterance is provided, and when the language information is predetermined, the answering unit obtains a conflict as an answer to the utterance, and the voice control unit It is good also as a structure which controls the output of the said cooperation with respect to the part according to the speaker attribute of the said speaker. According to this configuration, from the user's point of view, since a conflict is output according to his / her attributes, the user feels more natural as if he / she is interacting with a person.
In addition to controlling the output timing of the conflict, the control mode of the conflict output includes control for repeatedly outputting the conflict (sequential call) and control for not outputting the silence (silently).

本発明の態様について、音声合成装置のみならず、コンピュータを当該音声合成装置として機能させるプログラムとして概念することも可能である。
なお、本発明では、発言の音高（周波数）を解析対象とし、回答の音高を制御対象としているが、ヒトの音声は、上述したフォルマントの例でも明らかなように、ある程度の周波数域を有するので、解析や制御についても、ある程度の周波数範囲を持ってしまうのは避けられない。また、解析や制御については、当然のことながら誤差が発生する。このため、本件において、音高の解析や制御については、音高（周波数）の数値が同一であることのみならず、ある程度の範囲を伴うことが許容される。 The aspect of the present invention can be conceptualized as a program that causes a computer to function as the speech synthesizer as well as the speech synthesizer.
In the present invention, the pitch (frequency) of the speech is the analysis target, and the pitch of the answer is the control target. As is clear from the above-described formant example, human speech has a certain frequency range. Therefore, it is inevitable that the analysis and control have a certain frequency range. In addition, as a matter of course, errors occur in analysis and control. For this reason, in this case, the pitch analysis and control are allowed not only to have the same numerical value of the pitch (frequency) but also to have a certain range.

実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on embodiment. 対話における音声のフォルマントの例を示す図である。It is a figure which shows the example of the sound formant in a dialog. 音名と周波数等との関係を示す図である。It is a figure which shows the relationship between a pitch name, a frequency, etc. 音声合成装置における音声合成処理を示すフローチャートである。It is a flowchart which shows the speech synthesis process in a speech synthesizer. 音声合成装置における音声合成処理を示すフローチャートである。It is a flowchart which shows the speech synthesis process in a speech synthesizer. 語尾の特定の具体例を示す図である。It is a figure which shows the specific specific example of an ending. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence. 音声シーケンスに対する音高シフトの例を示す図である。It is a figure which shows the example of the pitch shift with respect to an audio | voice sequence.

以下、本発明の実施形態について図面を参照して説明する。
＜音声合成装置＞ Embodiments of the present invention will be described below with reference to the drawings.
<Speech synthesizer>

図１は、本発明の実施形態に係る音声合成装置１０の構成を示す図である。
この図において、音声合成装置１０は、ＣＰＵ（Central Processing Unit）や、音声入力部１０２、スピーカ１４２を有する、例えば携帯電話機のような端末装置である。音声合成装置１０においてＣＰＵが、予めインストールされたアプリケーションプログラムを実行することによって、複数の機能ブロックが次のように構築される。
詳細には、音声合成装置１０では、発話区間検出部１０４、音高解析部１０６、言語解析部１０８、音声制御部１０９、回答作成部（取得部）１１０、音声合成部１１２、言語データベース１２２、回答データベース１２４、情報取得部１２６および音声ライブラリ１２８が構築される。
なお、特に図示しないが、このほかにも音声合成装置１０は、表示部や操作入力部なども有し、利用者が装置の状況を確認したり、装置に対して各種の操作を入力したりすることができるようになっている。また、音声合成装置１０は、携帯電話機のような端末装置１０に限られず、ノート型やタブレット型のパーソナルコンピュータであっても良い。 FIG. 1 is a diagram showing a configuration of a speech synthesizer 10 according to an embodiment of the present invention.
In this figure, the speech synthesizer 10 is a terminal device such as a mobile phone having a CPU (Central Processing Unit), a speech input unit 102, and a speaker 142. In the speech synthesizer 10, the CPU executes an application program installed in advance, so that a plurality of functional blocks are constructed as follows.
Specifically, in the speech synthesizer 10, an utterance section detection unit 104, a pitch analysis unit 106, a language analysis unit 108, a speech control unit 109, an answer creation unit (acquisition unit) 110, a speech synthesis unit 112, a language database 122, An answer database 124, an information acquisition unit 126, and an audio library 128 are constructed.
Although not particularly illustrated, the speech synthesizer 10 also includes a display unit, an operation input unit, and the like, so that the user can check the status of the device and input various operations to the device. Can be done. The voice synthesizer 10 is not limited to the terminal device 10 such as a mobile phone, and may be a notebook or tablet personal computer.

音声入力部１０２は、詳細については省略するが、音声を電気信号に変換するマイクロフォンと、変換された音声信号の高域成分をカットするＬＰＦ（ローパスフィルタ）と、高域成分をカットした音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。
発話区間検出部１０４は、デジタル信号に変換された音声信号を処理して発話（有音）区間を検出する。 Although not described in detail, the audio input unit 102 is a microphone that converts audio into an electric signal, an LPF (low-pass filter) that cuts a high-frequency component of the converted audio signal, and an audio signal that is cut from a high-frequency component. And an A / D converter that converts the signal into a digital signal.
The utterance section detection unit 104 processes the voice signal converted into the digital signal to detect the utterance (sound) section.

音高解析部１０６は、発話区間として検出された音声信号の発言を音量解析および周波数解析して、当該発言のうち、特定の区間（第１区間）における音高を示す音高データを、音声制御部１０９に供給する。
ここで、第１区間とは、例えば発言の語尾である。また、ここでいう音高とは、例えば音声信号を周波数解析して得られる複数のフォルマントのうち、周波数の最も低い成分である第１フォルマント、図２でいえば、末端が符号Ａとなっているピーク帯で示される周波数（音高）をいう。周波数解析については、ＦＦＴ（Fast Fourier Transform）や、その他公知の方法を用いることができる。発言における語尾を特定するための具体的手法の一例については後述する。 The pitch analysis unit 106 performs volume analysis and frequency analysis on the speech of the speech signal detected as the speech interval, and uses the speech data indicating the pitch in a specific interval (first interval) of the speech as speech. This is supplied to the control unit 109.
Here, the first section is, for example, the ending of a statement. The pitch here is, for example, a first formant that is the lowest frequency component among a plurality of formants obtained by frequency analysis of an audio signal. In FIG. This is the frequency (pitch) indicated by the peak band. For frequency analysis, FFT (Fast Fourier Transform) or other known methods can be used. An example of a specific method for specifying the ending in the utterance will be described later.

一方、言語解析部１０８は、発話区間として検出された音声信号がどの音素に近いのかを、言語データベース１２２に予め作成された音素モデルを参照することにより判定して、音声信号で規定される発言を解析（特定）し、その解析結果を回答作成部１１０に供給する。 On the other hand, the language analysis unit 108 determines which phoneme is close to the speech signal detected as the speech section by referring to a phoneme model created in advance in the language database 122, and the speech specified by the speech signal. Is analyzed (specified), and the analysis result is supplied to the answer creating unit 110.

回答作成部１１０は、言語解析部１０８によって解析された発言に対応する回答を、回答データベース１２４および情報取得部１２６を参照して作成する。
なお、本実施形態において、回答作成部１１０が作成する回答には、
（１）発言に対する肯定または否定等の意を示す回答、
（２）発言に対する具体的内容の回答、
（３）発言に対する相槌としての回答、
が想定されている。（１）の回答の例としては「はい」、「いいえ」などが挙げられ、（２）としては、例えば「あすのてんきは？（明日の天気は？）」という発言に対して「はれです」と具体的に内容を回答する例などが挙げられる。（３）としては、「そうですね」、「えーと」などが挙げられ、発言が、（１）のように「はい」、「いいえ」の回答で済む発言、および、（２）のように具体的な内容を回答する必要がある発言以外の場合において作成（取得）される。 The answer creation unit 110 creates an answer corresponding to the utterance analyzed by the language analysis unit 108 with reference to the answer database 124 and the information acquisition unit 126.
In the present embodiment, the answer created by the answer creating unit 110 includes:
(1) An answer indicating the affirmation or denial of the statement,
(2) Reply to specific content for remarks,
(3) Reply as a reconciliation to remarks,
Is assumed. Examples of (1) are “Yes”, “No”, etc., and (2) is “Hare, what is tomorrow ’s weather?” There is an example of answering the content specifically. (3) can be “Yes”, “Et”, etc., and the remarks can be answered with “Yes” or “No” as in (1), and specific as in (2) It is created (acquired) in cases other than remarks that need to be answered.

（１）の回答については、例えば「いま３時ですか？」という発言に対して、内蔵のリアルタイムクロック（図示省略）から時刻情報を取得すれば、回答作成部１１０が、当該発言に対して例えば「はい」または「いいえ」のうち、どちらで回答すれば良いのかを判別することができる。
一方で、例えば「あすははれですか（明日は晴れですか）？」という発言に対しては、外部サーバにアクセスして天気情報を取得しないと、音声合成装置１０の単体で回答することができない。このように、音声合成装置１０のみでは回答できない場合、情報取得部１２６は、インターネットを介し外部サーバにアクセスし、回答の作成に必要な情報を取得して、回答作成部１１０に供給する。これにより、当該回答作成部１１０は、当該発言が正しいか否かを判別して回答を作成することができる。
（２）の回答については、例えば「いまなんじ？（今、何時？）」という発言に対しては、回答作成部１１０は、上記時刻情報を取得するとともに、時刻情報以外の情報を回答データベース１２４から取得することで、「ただいま○○時○○分です」という回答を作成することが可能である。一方で、「あすのてんきは？（明日の天気は？）」という発言に対しては、情報取得部１２６が、外部サーバにアクセスして、回答に必要な情報を取得するとともに、回答作成部１１０が、発言に対して例えば「はれです」という回答を、回答データベース１２４および取得した情報を基に作成する構成となっている。 For the answer of (1), for example, in response to an utterance “It is 3 o'clock?”, If the time information is acquired from the built-in real-time clock (not shown), the answer creation unit 110 For example, it is possible to determine which of “Yes” and “No” should be answered.
On the other hand, for example, in response to the statement “Are you sure? (Tomorrow is sunny)?” If the weather information is not acquired by accessing an external server, the speech synthesizer 10 responds alone. I can't. As described above, when the answer cannot be made only by the speech synthesizer 10, the information acquisition unit 126 accesses an external server via the Internet, acquires information necessary for creating the answer, and supplies the information to the answer creation unit 110. Thereby, the said reply creation part 110 can discriminate | determine whether the said comment is correct and can create a reply.
With respect to the answer (2), for example, in response to the statement “Now what? (Now what time?)”, The answer creating unit 110 obtains the time information and sends information other than the time information to the answer database. By acquiring from 124, it is possible to create an answer “I am now XX hour XX minutes”. On the other hand, in response to the statement “What is Asuno Tenki? (Tomorrow's weather?)”, The information acquisition unit 126 accesses an external server to acquire information necessary for an answer, and an answer creation unit. 110 is configured to create, for example, an answer “swell” for the utterance based on the answer database 124 and the acquired information.

回答作成部１１０は、作成・取得した回答から音声シーケンスを作成して出力する。この音声シーケンスは、音素列であって、各音素に対応する音高や発音タイミングを規定したものである。
なお、（１）、（３）の回答については、例えば回答に対応する音声シーケンスを回答データベース１２４に格納しておく一方で、判別結果に対応した音声シーケンスを回答データベース１２４から読み出す構成にしても良い。詳細には、回答作成部１１０は、（１）の回答にあっては、判別結果に応じた例えば「はい」、「いいえ」などの音声シーケンスを読み出せば良いし、（３）の回答にあっては、発言の解析結果および回答作成部１１０での判別結果に応じて「そうですね」、「えーと」などの音声シーケンスを読み出せば良い。
なお、回答作成部１１０で作成・取得された音声シーケンスは、音声制御部１０９と音声合成部１１２とにそれぞれ供給される。 The answer creation unit 110 creates and outputs a speech sequence from the created / acquired answers. This speech sequence is a phoneme string and defines a pitch and a sounding timing corresponding to each phoneme.
For the answers (1) and (3), for example, a voice sequence corresponding to the answer is stored in the answer database 124, while a voice sequence corresponding to the determination result is read from the answer database 124. good. Specifically, in the answer (1), the answer creating unit 110 may read a voice sequence such as “Yes” or “No” according to the determination result, and the answer (3) In this case, it is only necessary to read out a speech sequence such as “I think so” or “Et” according to the analysis result of the utterance and the determination result in the answer creation unit 110.
Note that the speech sequences created and acquired by the answer creation unit 110 are supplied to the speech control unit 109 and the speech synthesis unit 112, respectively.

音声シーケンスは、発声の音高や発音タイミングが規定されているので、音声合成部１１２が、単純に音声シーケンスにしたがって音声合成することで、当該回答の基本音声を出力することはできる。ただし、回答の基本音声は、発言における語尾等の音高を考慮していないので、機械が喋っている感じを与えるときがあるのは上述した通りである。
そこで、本実施形態において音声制御部１０９は、音声シーケンス全体の音高を、次のように規則を適用して変更させる。すなわち、音声制御部１０９は、回答作成部１１０からの音声シーケンスのうち、特定の区間（第２区間）の音高を、音高データに対して所定の関係の音高に変更させる規則（デフォルトルール）とする。ただし、この規則を貫くと、音声合成される回答が却って不自然になる場合があるので、話者属性およびエージェント属性に応じて、上記デフォルトルールを適宜修正する。 Since the voice sequence defines the pitch of the utterance and the sounding timing, the voice synthesizer 112 can output the basic voice of the answer by simply synthesizing the voice according to the voice sequence. However, since the basic voice of the answer does not take into account the pitch of the ending in the utterance, it is as described above that the machine may give a feeling of being angry.
Therefore, in the present embodiment, the voice control unit 109 changes the pitch of the entire voice sequence by applying a rule as follows. That is, the voice control unit 109 changes a pitch of a specific section (second section) in the voice sequence from the answer creation unit 110 to a pitch having a predetermined relationship with the pitch data (default). Rule). However, if this rule is followed, the answer to be synthesized may be unnatural, so the default rule is appropriately modified according to the speaker attribute and the agent attribute.

なお、本実施形態では、第２区間を回答の語尾とするが、後述するように語尾に限られない。また、本実施形態において、デフォルトルールとして、回答の語尾の音高を、音高データで示される音高に対して所定の関係にある音高、具体的には５度下の関係にある音高とする、と規定するが、後述するように、５度下以外の関係にある音高としても良い。話者属性は、本実施形態にあっては利用者の性別を規定するデータであり、例えば音声合成装置１０としての端末装置に登録された利用者についての個人情報が用いられる。また、エージェント属性は、音声合成装置１０の仮想的な人格を示す情報である。すなわち、エージェント属性は、どのような人物を想定して回答を音声合成するのかを規定するために、当該人物の属性を示すデータであり、ここでは説明の簡易化のために、上記話者属性と同様に、性別を規定するデータとする。なお、エージェント属性は、上記操作入力部を介し利用者によって設定される。 In the present embodiment, the second section is the ending of the answer, but is not limited to the ending as described later. In the present embodiment, as a default rule, the pitch of the ending of the answer is a pitch having a predetermined relationship with the pitch indicated by the pitch data, specifically, a sound having a relationship of 5 degrees below. Although it is defined as high, as will be described later, the pitch may have a relationship other than 5 degrees below. In the present embodiment, the speaker attribute is data that defines the gender of the user. For example, personal information about the user registered in the terminal device as the speech synthesizer 10 is used. The agent attribute is information indicating a virtual personality of the speech synthesizer 10. That is, the agent attribute is data indicating the attribute of the person in order to define what kind of person is supposed to be used for speech synthesis. Here, for simplification of explanation, the agent attribute is described above. Similarly to the above, the data defines gender. The agent attribute is set by the user via the operation input unit.

音声合成部１１２は、音声を合成するにあたって、音声ライブラリ１２８に登録された音声素片データを用いる。音声ライブラリ１２８は、単一の音素や音素から音素への遷移部分など、音声の素材となる各種の音声素片の波形を定義した音声素片データを、複数のエージェント属性毎に予めデータベース化したものである。音声合成部１１２は、具体的には、エージェント属性で規定される音声素片データを用いて、音声シーケンスの一音一音（音素）の音声素片データを組み合わせて、繋ぎ部分が連続するように修正しつつ、音声制御部１０９によって規定された規則（修正された規則を含む）にしたがって回答の音高を変更して音声信号を生成する。
なお、音声合成された音声信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１４２によって音響変換されて出力される。 The speech synthesis unit 112 uses speech segment data registered in the speech library 128 when synthesizing speech. The speech library 128 previously stores speech unit data that defines the waveforms of various speech units that are the source of speech, such as a single phoneme or a transition from phoneme to phoneme, for each of a plurality of agent attributes. Is. Specifically, the speech synthesizer 112 uses speech unit data defined by the agent attribute to combine speech unit data of one sound per phoneme (phoneme) so that the connected portions are continuous. The pitch of the answer is changed according to the rules defined by the voice control unit 109 (including the revised rules), and a voice signal is generated.
Note that the synthesized voice signal is converted into an analog signal by a D / A converter (not shown), then acoustically converted by the speaker 142 and output.

次に、音声合成装置１０の動作について説明する。
図４は、音声合成装置１０における音声合成処理を示すフローチャートである。
はじめに、利用者が所定の操作をしたとき、例えば対話処理に対応したアイコンなどをメインメニュー画面（図示省略）において選択する操作をしたとき、ＣＰＵが当該処理に対応したアプリケーションプログラムを起動する。このアプリケーションプログラムを実行することによって、ＣＰＵは、図１で示した機能ブロックを構築する。 Next, the operation of the speech synthesizer 10 will be described.
FIG. 4 is a flowchart showing a speech synthesis process in the speech synthesizer 10.
First, when a user performs a predetermined operation, for example, when an operation for selecting an icon or the like corresponding to an interactive process is performed on a main menu screen (not shown), the CPU starts an application program corresponding to the process. By executing this application program, the CPU constructs the functional block shown in FIG.

まず、利用者によって、音声入力部１０２に対して音声で発言が入力される（ステップＳａ１１）。発話区間検出部１０４は、例えば当該音声の振幅を閾値と比較することにより発話区間を検出し、当該発話区間の音声信号を音高解析部１０６および言語解析部１０８のそれぞれに供給する（ステップＳａ１２）。 First, the user inputs a speech by voice to the voice input unit 102 (step Sa11). For example, the utterance section detection unit 104 detects the utterance section by comparing the amplitude of the speech with a threshold value, and supplies the speech signal of the utterance section to each of the pitch analysis unit 106 and the language analysis unit 108 (step Sa12). ).

言語解析部１０８は、供給された音声信号における発言の意味を解析して、その意味内容を示すデータを、回答作成部１１０に供給する（ステップＳａ１３）。
回答作成部１１０は、発言の言語解析結果に対応した回答を、回答データベース１２４を用いたり、必要に応じて情報取得部１２６を介し外部サーバから取得したりして、作成する（ステップＳａ１４）。そして、回答作成部１１０は、上述したように当該回答に基づく音声シーケンスを作成し、音声合成部１１２に供給する（ステップＳａ１５）。 The language analyzing unit 108 analyzes the meaning of the utterance in the supplied voice signal and supplies data indicating the meaning content to the answer creating unit 110 (step Sa13).
The answer creating unit 110 creates an answer corresponding to the language analysis result of the utterance by using the answer database 124 or acquiring it from an external server via the information acquiring unit 126 as necessary (step Sa14). Then, as described above, the answer creating unit 110 creates a speech sequence based on the answer and supplies it to the speech synthesizing unit 112 (step Sa15).

例えば、利用者による発言の言語解析結果が「あすははれですか（明日は晴れですか）？」という意味であれば、回答作成部１１０は、外部サーバにアクセスして、回答に必要な天気情報を取得し、取得した天気情報が晴れであれば「はい」という音声シーケンスを、晴れ以外であれば「いいえ」という音声シーケンスを、それぞれ出力する。
また、利用者による発言の言語解析結果が「あすのてんきは（明日の天気は）？」であれば、回答作成部１１０は、外部サーバから取得した天気情報にしたがって例えば「はれです」、「くもりです」などのような音声シーケンスを出力する。
一方、利用者による発言の言語解析結果が「あすははれかぁ」という意味であれば、それは独り言（または、つぶやき）なので、回答作成部１１０が、例えば「そうですね」のような相槌の音声シーケンスを、回答データベース１２４から読み出して出力する。
音声制御部１０９は、回答作成部１１０から供給された音声シーケンスから、当該音声シーケンスにおける語尾の音高（初期音高）を特定する（ステップＳａ１６）。 For example, if the linguistic analysis result of the user's utterance means “Are you tomorrow (is it fine tomorrow?)?”, The answer creation unit 110 accesses the external server and is necessary for the answer. The weather information is acquired, and if the acquired weather information is clear, a sound sequence “Yes” is output, and if it is not clear, a sound sequence “No” is output.
Also, if the language analysis result of the user's utterance is “Asen Tenki is (Tomorrow's weather)?”, The answer creation unit 110, for example, according to the weather information obtained from the external server, Outputs an audio sequence such as “It is cloudy”.
On the other hand, if the linguistic analysis result of the user's utterance means “Ashha Hareka”, it is a monologue (or tweet). Is read from the answer database 124 and output.
The voice control unit 109 specifies the ending pitch (initial pitch) in the voice sequence from the voice sequence supplied from the answer creating unit 110 (step Sa16).

一方、音高解析部１０６は、検出された発話区間における発言の音声信号を解析し、当該発言における第１区間（語尾）の音高を特定して、当該音高を示す音高データを音声制御部１０９に供給する（ステップＳａ１７）。ここで、音高解析部１０６における発言の語尾を特定する具体的手法の一例について説明する。 On the other hand, the pitch analysis unit 106 analyzes the speech signal of the utterance in the detected utterance section, specifies the pitch of the first section (ending part) in the utterance, and converts the pitch data indicating the pitch into voice. It supplies to the control part 109 (step Sa17). Here, an example of a specific method for specifying the ending of a statement in the pitch analysis unit 106 will be described.

発言をする人が、当該発言に対する回答を欲するような対話を想定した場合、発言の語尾に相当する部分では、音量が他の部分と比較して一時的に大きくなる、と考えられる。そこで、音高解析部１０６による第１区間（語尾）の音高については、例えば次のようにして求めることできる。
第１に、音高解析部１０６は、発話区間として検出された発言の音声信号を、音量と音高（ピッチ）とに分けて波形化する。図６の（ａ）は、音声信号についての音量を縦軸で、経過時間を横軸で表した音量波形の一例であり、（ｂ）は、同じ音声信号について周波数解析して得られる第１フォルマントの音高を縦軸で、経過時間を横軸で表した音高波形である。なお、（ａ）の音量波形と（ｂ）の音高波形との時間軸は共通である。
第２に、音高解析部１０６は、（ａ）の音量波形のうち、時間的に最後の極大Ｐ１のタイミングを特定する。
第３に、音高解析部１０６は、特定した極大Ｐ１のタイミングを前後に含む所定の時間範囲（例えば１００μ秒〜３００μ秒）を語尾であると認定する。
第４に、音高解析部１０６は、（ｂ）の音高波形のうち、認定した語尾に相当する区間Ｑ１の平均音高を、音高データとして出力する。
このように、発話区間における音量波形について最後の極大Ｐ１を、発言の語尾に相当するタイミングとして特定することによって、会話としての発言の語尾の誤検出を少なくすることができる、と考えられる。
ここでは、（ａ）の音量波形のうち、時間的に最後の極大Ｐ１のタイミングを前後に含む所定の時間範囲を語尾であると認定したが、極大Ｐ１のタイミングを始期または終期とする所定の時間範囲を語尾と認定しても良い。また、認定した語尾に相当する区間Ｑ１の平均音高ではなく、区間Ｑ１の始期、終期や、極大Ｐ１のタイミングの音高を、音高データとして出力する構成としても良い。 Assuming a dialogue in which a person who speaks wants an answer to the comment, the volume corresponding to the tail of the comment is considered to be temporarily higher than the other parts. Therefore, the pitch of the first section (ending) by the pitch analysis unit 106 can be obtained as follows, for example.
First, the pitch analysis unit 106 divides the speech signal detected as an utterance section into a waveform by dividing it into a volume and a pitch (pitch). FIG. 6A is an example of a volume waveform in which the volume of an audio signal is represented on the vertical axis and the elapsed time is represented on the horizontal axis. FIG. 6B is a first waveform obtained by frequency analysis of the same audio signal. It is a pitch waveform in which the pitch of formant is represented on the vertical axis and the elapsed time is represented on the horizontal axis. The time axis of the volume waveform in (a) and the pitch waveform in (b) are common.
Secondly, the pitch analysis unit 106 identifies the timing of the last local maximum P1 in the volume waveform of (a).
Thirdly, the pitch analysis unit 106 determines that a predetermined time range (for example, 100 μsec to 300 μsec) including the timing of the specified maximum P1 before and after is the ending.
Fourth, the pitch analysis unit 106 outputs the average pitch of the section Q1 corresponding to the recognized ending in the pitch waveform of (b) as pitch data.
Thus, it is considered that the erroneous detection of the ending of the speech as a conversation can be reduced by specifying the final maximum P1 of the volume waveform in the utterance section as the timing corresponding to the ending of the speech.
Here, in the volume waveform of (a), a predetermined time range including the timing of the last local maximum P1 before and after is recognized as the ending, but the predetermined time period having the maximum P1 timing as the start or end is determined. The time range may be recognized as the ending. Moreover, it is good also as a structure which outputs not the average pitch of the area Q1 corresponding to the recognized ending but the pitch of the start of the period Q1, the end, and the timing of local maximum P1 as pitch data.

一方、音高データの供給を受けた音声制御部１０９は、次のような規則修正処理を実行する（ステップＳａ１８）。
図５は、この規則修正処理の詳細を示すフローチャートである。まず、音声制御部１０９は、話者属性を示すデータと、エージェント属性を示すデータとを取得する（ステップＳｂ１１）。 On the other hand, the voice control unit 109 that has received the pitch data executes the following rule correction process (step Sa18).
FIG. 5 is a flowchart showing details of the rule correction processing. First, the voice control unit 109 acquires data indicating speaker attributes and data indicating agent attributes (step Sb11).

次に、音声制御部１０９は、話者属性、すなわち利用者の属性が女性であるか否かを取得したデータによって判別する（ステップＳｂ１２）。
話者属性が女性であれば（ステップＳｂ１２の判別結果が「Ｙｅｓ」であれば）、音声制御部１０９は、回答の語尾の音高を、音高データで示される音高に対して５度下の音高ではなく、例えば１ランク下の協和音程の関係にある６度下の音高とするように、デフォルトルールを修正する。これにより、回答の語尾の音高が、デフォルトルールで定められていた音高よりも下げられる（ステップＳｂ１３）。
なお、ここでいうランクとは、音楽的な意味ではなく、あくまでも便宜的なものであり、音高データで示される音高に対して５度下の音高を基準にして、ランクを１つ下げたときでは６度（長６度）下の音高をいい、さらに１つ下げたときでは８度下の音高をいう。また、５度下の音高を基準にして、ランクを１つ上げたときでは３度（長３度）下の音高をいい、さらに１つ上げたときでは４度上の音高をいう。
一方、利用者の話者属性が女性でなければ（ステップＳｂ１２の判別結果が「Ｎｏ」であれば）、音声制御部１０９は、当該話者属性が男性であるか否かを判別する（ステップＳｂ１４）。
話者属性が男性であれば（ステップＳｂ１４の判別結果が「Ｙｅｓ」であれば）、音声制御部１０９は、回答の語尾の音高を、音高データで示される音高に対して、３度下の音高とするように、デフォルトルールを修正する。これにより、回答の語尾の音高が、デフォルトルールで定められていた音高よりも上げられる（ステップＳｂ１５）。
なお、話者属性が中性である場合や、話者属性が未登録である場合（ステップＳｂ１４の判別結果が「Ｎｏ」である場合）、音声制御部１０９は、ステップＳｂ１３またはＳｂ１５の処理をスキップさせて、デフォルトルールを未修正とする。 Next, the voice control unit 109 determines whether the speaker attribute, that is, whether the user attribute is female or not is based on the acquired data (step Sb12).
If the speaker attribute is female (if the determination result in step Sb12 is “Yes”), the voice control unit 109 sets the pitch of the ending of the answer to 5 degrees with respect to the pitch indicated by the pitch data. For example, the default rule is corrected so that the pitch is 6 degrees below, which is in the relation of the Kyowa pitch one rank lower than the lower pitch. Thereby, the pitch of the ending of the answer is lowered below the pitch defined by the default rule (step Sb13).
Note that the rank here is not a musical meaning, but merely a convenience, and one rank is defined based on a pitch 5 degrees below the pitch indicated by the pitch data. When it is lowered, it means a pitch that is 6 degrees (long 6 degrees), and when it is lowered further, it means a pitch that is 8 degrees below. Also, with the pitch 5 degrees below as a reference, when the rank is increased by 1, the pitch is 3 degrees (long 3 degrees), and when it is further raised, the pitch is 4 degrees above. .
On the other hand, if the speaker attribute of the user is not female (if the determination result in step Sb12 is “No”), the voice control unit 109 determines whether the speaker attribute is male (step Sb12). Sb14).
If the speaker attribute is male (if the determination result of step Sb14 is “Yes”), the voice control unit 109 sets the pitch of the ending of the answer to 3 based on the pitch indicated by the pitch data. Modify the default rule to have a lower pitch. Thereby, the pitch of the ending of the answer is raised above the pitch defined by the default rule (step Sb15).
When the speaker attribute is neutral or the speaker attribute is not registered (when the determination result of step Sb14 is “No”), the voice control unit 109 performs the process of step Sb13 or Sb15. Skip the default rule and leave it unmodified.

続いて、音声制御部１０９は、エージェント属性が女性であるか否かを判別する（ステップＳｂ１６）。エージェント属性が女性であれば（ステップＳｂ１６の判別結果が「Ｙｅｓ」であれば）、音声制御部１０９は、修正されたルール（または未修正のデフォルトルール）において、回答の語尾の音高を１ランク上の音高に上げるように修正する（ステップＳｂ１７）。
例えば、音声制御部１０９は、ステップＳｂ１３において回答の語尾の音高を、音高データで示される音高に対して１ランク下の６度下の音高とするようにデフォルトルールを修正したのであれば、ステップＳｂ１７において、元の５度下の音高とするように、デフォルトルールに戻す。また、音声制御部１０９は、ステップＳｂ１５において回答の語尾の音高を、音高データで示される音高に対して１ランク上の３度下の音高とするようにデフォルトルールを修正したのであれば、ステップＳｂ１７において、さらに１ランク上の４度上の音高とするようにルールを再修正する。
なお、ステップＳｂ１３またはＳｂ１５の処理をスキップさせた場合であれば、音声制御部１０９は、ステップＳｂ１７において、回答の語尾の音高を、音高データで示される音高に対して、１ランク上の３度下の関係にある音高とするように、当該デフォルトルールを修正する。 Subsequently, the voice control unit 109 determines whether the agent attribute is female (step Sb16). If the agent attribute is female (if the determination result in step Sb16 is “Yes”), the voice control unit 109 sets the pitch at the end of the answer to 1 in the modified rule (or the unmodified default rule). Correction is made to raise the pitch to a higher rank (step Sb17).
For example, the voice control unit 109 has modified the default rule so that the pitch of the ending of the answer is set to a pitch 6 degrees below the pitch indicated by the pitch data in step Sb13. If there is, in step Sb17, the default rule is restored so that the original pitch is 5 degrees below. In addition, since the voice control unit 109 has modified the default rule so that the pitch of the ending of the answer is set to a pitch that is three degrees below the pitch indicated by the pitch data in step Sb15. If there is, in step Sb17, the rule is re-corrected so that the pitch is 4 degrees higher by one rank.
If the process of step Sb13 or Sb15 is skipped, the voice control unit 109 raises the pitch of the ending of the answer by one rank with respect to the pitch indicated by the pitch data in step Sb17. The default rule is modified so that the pitch is 3 degrees below.

一方、エージェント属性が女性でなければ（ステップＳｂ１６の判別結果が「Ｎｏ」であれば）、音声制御部１０９は、当該エージェントの属性が男性であるか否かを判別する（ステップＳｂ１８）。エージェント属性が男性であれば（ステップＳｂ１８の判別結果が「Ｙｅｓ」であれば）、音声制御部１０９は、修正されたルールにおいて、回答の語尾の音高を１ランク下の音高に上げるように修正する（ステップＳｂ１９）。
例えば、音声制御部１０９は、ステップＳｂ１３において回答の語尾の音高を、音高データで示される音高に対して１ランク下の６度下の音高とするようにデフォルトルールを修正したのであれば、ステップＳｂ１９において、さらに１ランク下の８度下の音高とするようにルールを再修正する。また、音声制御部１０９は、ステップＳｂ１５において回答の語尾の音高を、音高データで示される音高に対して１ランク上の３度下の音高とするようにデフォルトルールを修正したのであれば、ステップＳｂ１９において、元の５度下の音高とするように、デフォルトルールに戻す。なお、ステップＳｂ１３またはＳｂ１５の処理をスキップさせた場合であれば、音声制御部１０９は、ステップＳｂ１９において、回答の語尾の音高を、音高データで示される音高に対して、１ランク下の６度下の関係にある音高とするように、当該デフォルトルールを修正する。 On the other hand, if the agent attribute is not female (if the determination result in step Sb16 is “No”), the voice control unit 109 determines whether the attribute of the agent is male (step Sb18). If the agent attribute is male (if the determination result in step Sb18 is “Yes”), the voice control unit 109 increases the pitch of the ending of the answer to a pitch one rank lower in the modified rule. (Step Sb19).
For example, the voice control unit 109 has modified the default rule so that the pitch of the ending of the answer is set to a pitch 6 degrees below the pitch indicated by the pitch data in step Sb13. If there is, in step Sb19, the rule is re-corrected so that the pitch is 8 degrees lower by one rank. In addition, since the voice control unit 109 has modified the default rule so that the pitch of the ending of the answer is set to a pitch that is three degrees below the pitch indicated by the pitch data in step Sb15. If there is, in step Sb19, the default rule is restored so that the original pitch is 5 degrees below. If the process of step Sb13 or Sb15 is skipped, the voice control unit 109 lowers the pitch of the ending of the answer by one rank with respect to the pitch indicated by the pitch data in step Sb19. The default rule is modified so that the pitch is 6 degrees below.

なお、エージェント属性が中性である場合や、エージェント属性が未設定である場合（ステップＳｂ１８の判別結果が「Ｎｏ」である場合）、音声制御部１０９は、ステップＳｂ１７またはＳｂ１９の処理をスキップさせる。ステップＳｂ１７またはＳｂ１９の処理の終了後、もしくは、スキップ後、処理手順は、図４におけるステップＳａ１９に移行する。 If the agent attribute is neutral or if the agent attribute is not set (when the determination result in step Sb18 is “No”), the voice control unit 109 skips the process in step Sb17 or Sb19. . After the end of the process of step Sb17 or Sb19 or after skipping, the process procedure proceeds to step Sa19 in FIG.

次に、音声制御部１０９は、ステップＳａ１８で修正したルール（または、デフォルトルール）を適用して、回答作成部１１０から供給された音声シーケンスを変更する旨を決定する（ステップＳａ１９）。具体的には、修正したルールにおいて、回答の語尾の音高を、音高データで示される音高に対して例えば３度下の関係にある音高とする、と規定されている場合、音声制御部１０９は、回答作成部１１０から供給された音声シーケンスで規定される回答の語尾を、音高データで示される音高に対して３度下の関係にある音高となるように、当該音声シーケンス全体の音高をシフトすることを決定する。
音声制御部１０９は、決定した内容で音声合成部１１２による音声合成を制御する（ステップＳａ２０）。これにより、音声合成部１１２は、音声制御部１０９によって変更が決定された音声シーケンスの音声を、決定された音高で合成して出力する。
なお、回答の音声が出力されると、特に図示しないが、ＣＰＵは、当該アプリケーションプログラムの実行を終了させて、メニュー画面に戻す。 Next, the voice control unit 109 determines to change the voice sequence supplied from the answer creating unit 110 by applying the rule (or default rule) corrected in step Sa18 (step Sa19). Specifically, when the revised rule stipulates that the pitch of the ending of the answer is a pitch that is, for example, three degrees below the pitch indicated by the pitch data, The control unit 109 adjusts the ending of the answer specified by the voice sequence supplied from the answer creating unit 110 so that the pitch is 3 degrees below the pitch indicated by the pitch data. Decide to shift the pitch of the entire speech sequence.
The voice control unit 109 controls the voice synthesis by the voice synthesis unit 112 with the determined content (step Sa20). As a result, the speech synthesizer 112 synthesizes the speech of the speech sequence determined to be changed by the speech controller 109 with the determined pitch and outputs the synthesized speech.
When the answer sound is output, although not particularly illustrated, the CPU ends the execution of the application program and returns to the menu screen.

次に、発言の音高と、音声シーケンスの基本音高と、変更された音声シーケンスの音高とについて、具体的な例を挙げて説明する。 Next, the pitch of the speech, the basic pitch of the voice sequence, and the pitch of the changed voice sequence will be described with specific examples.

図７の（ｂ）の左欄は、利用者による発言の一例である。この図においては、発言の言語解析結果が「あすははれですか（明日は晴れですか）？」であって、当該発言の一音一音に音高が同欄に示されるような音符で示される場合の例である。なお、発言の音高波形は、実際には、図６の（ｂ）に示されるような波形となるが、ここでは、説明の便宜のために音高を音符で表現している。
この場合の例において、回答作成部１１０は、上述したように、当該発言に応じて取得した天気情報が晴れであれば、例えば「はい」の音声シーケンスを出力し、晴れ以外であれば、「いいえ」の音声シーケンスを出力する。
図７の（ａ）は、「はい」の音声シーケンスの一例であり、この例では、一音一音に音符を割り当てて、基本音声の各語（音素）の音高や発音タイミングを規定している。なお、この例では、説明簡略化のために、一音（音素）に音符を１つ割り当てているが、スラーやタイなどのように、一音に複数の音符を割り当てても良い。 The left column of (b) of FIG. 7 is an example of a statement made by the user. In this figure, the linguistic analysis result of the utterance is “Tomorrow is it? (Tomorrow is sunny?)?” And the pitch is shown in the same column for each note. It is an example in case shown by. Note that the pitch waveform of the speech is actually a waveform as shown in FIG. 6B, but here, the pitch is expressed as a note for convenience of explanation.
In the example in this case, as described above, the answer creating unit 110 outputs, for example, a voice sequence of “Yes” if the weather information acquired according to the utterance is sunny, and “ Outputs the “No” audio sequence.
(A) in FIG. 7 is an example of the voice sequence “Yes”. In this example, notes are assigned to each note, and the pitch and pronunciation timing of each word (phoneme) of the basic voice are defined. ing. In this example, for simplicity of explanation, one note is assigned to one note (phoneme), but a plurality of notes may be assigned to one note such as a slur or a tie.

回答作成部１１０による音声シーケンスは、デフォルトルールが適用されるのであれば、音声制御部１０９によって次のように変更される。すなわち、（ｂ）の左欄に示した発言のうち、符号Ａで示される語尾の「か」の区間の音高が音高データによって「ミ」であると示される場合、音声制御部１０９は、「はい」という回答のうち、符号Ｂで示される語尾の「い」の区間の音高が「ミ」に対して５度下の音高である「ラ」になるように、音声シーケンス全体の音高を変更する（図７の（ｂ）の右欄参照）。
なお、本実施形態において、デフォルトルールが適用される場合として、図５において、第１に、ステップＳｂ１２、Ｓｂ１４、Ｓｂ１６、Ｓｂ１８の判別結果がいずれも「Ｎｏ」である場合と、第２に、ステップＳｂ１２の判別結果が「Ｙｅｓ」であって、ステップＳｂ１６の判別結果が「Ｙｅｓ」である場合と、第３に、ステップＳｂ１２の判別結果が「Ｎｏ」、ステップＳｂ１４の判別結果が「Ｙｅｓ」であって、ステップＳｂ１６の判別結果が「Ｎｏ」、ステップＳｂ１８の判別結果が「Ｙｅｓ」である場合と、の３通りがある。 The voice sequence by the answer creating unit 110 is changed by the voice control unit 109 as follows if the default rule is applied. That is, in the utterance shown in the left column of (b), when the pitch of the ending “ka” indicated by the symbol A is indicated by the pitch data as “mi”, the voice control unit 109 , Of the answer “yes”, the entire speech sequence is set so that the pitch of the section “i” at the end indicated by the symbol B becomes “la”, which is 5 degrees below “mi”. Is changed (see the right column of FIG. 7B).
In this embodiment, as a case where the default rule is applied, in FIG. 5, first, when the determination results of steps Sb12, Sb14, Sb16, and Sb18 are all “No”, secondly, When the determination result of step Sb12 is “Yes” and the determination result of step Sb16 is “Yes”, thirdly, the determination result of step Sb12 is “No”, and the determination result of step Sb14 is “Yes”. In addition, there are three cases where the determination result of step Sb16 is “No” and the determination result of step Sb18 is “Yes”.

発言が図７の（ｂ）の左欄で示される場合に、修正されたルール、例えば６度下が適用されるのであれば、回答作成部１１０による音声シーケンスは、音声制御部１０９によって次のように変更される。すなわち、「はい」という回答のうち、符号Ｂで示される語尾の「い」の区間の音高が「ミ」に対して６度下の音高である「ソ」になるように、音声シーケンス全体の音高を変更する（図８の右欄参照）。
なお、本実施形態において、６度下のルールが適用される場合として、第１に、ステップＳｂ１２の判別結果が「Ｙｅｓ」であって、ステップＳｂ１６、Ｓｂ１８の判別結果が「Ｎｏ」である場合と、第２に、ステップＳｂ１２、Ｓｂ１４の判別結果が「Ｎｏ」であって、ステップＳｂ１６の判別結果が「Ｎｏ」、ステップＳｂ１８の判別結果が「Ｙｅｓ」である場合と、の２通りがある。 When the remark is shown in the left column of FIG. 7B, if a modified rule, for example, 6 degrees below, is applied, the speech sequence by the answer creating unit 110 is Will be changed as follows. That is, in the answer “Yes”, the voice sequence is set so that the pitch of the section “Yes” at the end of the word “B” indicated by the symbol B becomes “So” that is six degrees below “Mi”. The overall pitch is changed (see the right column in FIG. 8).
In the present embodiment, as a case where the 6th lower rule is applied, firstly, the determination result in step Sb12 is “Yes”, and the determination results in steps Sb16 and Sb18 are “No”. Secondly, there are two cases: the determination result of steps Sb12 and Sb14 is “No”, the determination result of step Sb16 is “No”, and the determination result of step Sb18 is “Yes”. .

発言が図７の（ｂ）の左欄で示される場合に、修正されたルール、例えば８度下が適用されるのであれば、回答作成部１１０による音声シーケンスは、音声制御部１０９によって次のように変更される。すなわち、「はい」という回答のうち、符号Ｂで示される語尾の「い」の音高が「ミ」に対して８度（１オクターブ）下の音高である「ミ」になるように、音声シーケンス全体の音高を変更する（図９の右欄参照）。
なお、本実施形態において、８度下のルールが適用される場合として、ステップＳｂ１２の判別結果が「Ｙｅｓ」であって、ステップＳｂ１６の判別結果が「Ｎｏ」、ステップＳｂ１８の判別結果が「Ｙｅｓ」である場合の１通りがある。 If the remark is shown in the left column of FIG. 7B and the modified rule, for example, 8 degrees below, is applied, the speech sequence by the answer creating unit 110 is Will be changed as follows. That is, among the answers “yes”, the pitch of “i” at the end indicated by the symbol B is “mi”, which is 8 degrees (one octave) lower than “mi”. The pitch of the entire speech sequence is changed (see the right column in FIG. 9).
In the present embodiment, as a case where the rule of 8 degrees below is applied, the determination result of step Sb12 is “Yes”, the determination result of step Sb16 is “No”, and the determination result of step Sb18 is “Yes”. "Is one way.

発言が図７の（ｂ）の左欄で示される場合に、修正されたルール、例えば３度下が適用されるのであれば、回答作成部１１０による音声シーケンスは、音声制御部１０９によって次のように変更される。すなわち、「はい」という回答のうち、符号Ｂで示される語尾の「い」の区間の音高が「ミ」に対して３度下の音高である「ド」になるように、音声シーケンス全体の音高を変更する（図１０の右欄参照）。
なお、本実施形態において、３度下のルールが適用される場合として、第１に、ステップＳｂ１２の判別結果が「Ｎｏ」、ステップＳｂ１４の判別結果が「Ｙｅｓ」であって、ステップＳｂ１６、Ｓｂ１８の判別結果が「Ｎｏ」である場合と、第２に、ステップＳｂ１２、Ｓｂ１４の判別結果が「Ｎｏ」であって、ステップＳｂ１６の判別結果が「Ｙｅｓ」である場合との２通りがある。 When the remark is shown in the left column of FIG. 7B, if a modified rule, for example, 3rd down is applied, the voice sequence by the answer creating unit 110 is Will be changed as follows. That is, in the answer “Yes”, the voice sequence is set so that the pitch of the section “Yes” at the end indicated by the symbol B becomes “Do” that is three times lower than “Mi”. The overall pitch is changed (see the right column in FIG. 10).
In the present embodiment, as a case where the rule three times below is applied, first, the determination result in step Sb12 is “No”, the determination result in step Sb14 is “Yes”, and steps Sb16 and Sb18 are performed. The determination result of “No” is “No”, and secondly, the determination result of Steps Sb12 and Sb14 is “No”, and the determination result of Step Sb16 is “Yes”.

発言が図７の（ｂ）の左欄で示される場合に、修正されたルール、例えば４度上が適用されるのであれば、回答作成部１１０による音声シーケンスは、音声制御部１０９によって次のように変更される。すなわち、「はい」という回答のうち、符号Ｂで示される語尾の「い」の区間の音高が「ミ」に対して４度上の音高である「ラ」になるように、音声シーケンス全体の音高を変更する（図１１の右欄参照）。
なお、本実施形態において、４度上のルールが適用される場合には、ステップＳｂ１２の判別結果が「Ｎｏ」、ステップＳｂ１４の判別結果が「Ｙｅｓ」であって、ステップＳｂ１６の判別結果が「Ｙｅｓ」である場合の１通りがある。 When the remark is shown in the left column of FIG. 7B, if a modified rule, for example, 4 degrees above, is applied, the speech sequence by the answer creating unit 110 is Will be changed as follows. That is, in the answer “Yes”, the voice sequence is set so that the pitch of the section “Yes” at the end indicated by the symbol B becomes “La” that is four times higher than “Mi”. The entire pitch is changed (see the right column in FIG. 11).
In this embodiment, when the rule four times higher is applied, the determination result of step Sb12 is “No”, the determination result of step Sb14 is “Yes”, and the determination result of step Sb16 is “ There is one way of being “Yes”.

なお、ここでは「はい」を例にとって説明したが、特に図示しないが「いいえ」の場合も同様に音声シーケンス全体の音高が変更される。また、「あすのてんきは？」という発言に対して、例えば「はれです」と具体的に内容を回答する場合も同様に音声シーケンス全体の音高が変更される。 Here, “Yes” has been described as an example, but although not particularly illustrated, the pitch of the entire audio sequence is similarly changed even in the case of “No”. In addition, in response to the statement “What is Asuno Tenki?”, For example, when the content is specifically answered as “Hare”, the pitch of the entire speech sequence is similarly changed.

本実施形態では、発言の語尾の音高に対して回答の語尾の音高が１オクターブ内の協和音程の関係となるようにデフォルトルールが修正されるので、発言に対する回答が不自然であるような感じを利用者に与えない。
本実施形態によれば、発言の語尾の音高に対して回答の語尾の音高が５度下の関係とするデフォルトルールにおいて、話者属性が女性であれば音高を１ランク下げ、話者属性が男性であれば音高を１ランク上げるように、回答が音声合成される。また、デフォルトルールにおいて、エージェント属性が女性であれば音高を１ランク上げ、エージェント属性が男性であれば音高を１ランク下げるように、回答が音声合成される。このように、話者属性、エージェント属性に合わせて回答の音高が変更されるので、利用者に一種の新鮮さ、喜びを与えることができる。 In this embodiment, the default rule is modified so that the pitch of the ending of the answer is in the relationship of the Kyowa interval within one octave with respect to the pitch of the ending of the utterance, so that the answer to the utterance seems unnatural. Do not give the user a feeling.
According to this embodiment, in the default rule in which the pitch of the ending of the answer is 5 degrees below the pitch of the ending of the utterance, if the speaker attribute is female, the pitch is lowered by one rank, If the person attribute is male, the answer is synthesized by speech so that the pitch is increased by one rank. In the default rule, the answer is synthesized by voice so that if the agent attribute is female, the pitch is increased by one rank, and if the agent attribute is male, the pitch is decreased by one rank. Thus, since the pitch of the answer is changed according to the speaker attribute and the agent attribute, a kind of freshness and joy can be given to the user.

＜応用例、変形例＞
本発明は、上述した実施形態に限定されるものではなく、例えば次に述べるような各種の応用・変形が可能である。また、次に述べる応用・変形の態様は、任意に選択された一または複数を適宜に組み合わせることもできる。 <Application examples and modifications>
The present invention is not limited to the above-described embodiments, and various applications and modifications as described below are possible, for example. In addition, one or more arbitrarily selected aspects of application / deformation described below can be appropriately combined.

＜音声入力部＞
実施形態では、音声入力部１０２は、利用者の音声（発言）をマイクロフォンで入力して音声信号に変換する構成としたが、特許請求の範囲に記載された音声入力部は、この構成に限られない。すなわち、特許請求の範囲に記載された音声入力部は、音声信号による発言をなんらかの形で入力する、または、入力される構成であれば良い。詳細には、特許請求の範囲に記載された音声入力部は、他の処理部で処理された音声信号や、他の装置から供給（または転送された）音声信号を入力する構成、さらには、ＬＳＩに内蔵され、単に音声信号を受信し後段に転送する入力インターフェース回路等を含んだ概念である。 <Voice input part>
In the embodiment, the voice input unit 102 is configured to input a user's voice (speech) with a microphone and convert the voice signal into a voice signal. However, the voice input unit described in the claims is limited to this configuration. I can't. That is, the voice input unit described in the claims may be configured to input or input a speech by a voice signal in some form. Specifically, the voice input unit described in the claims is configured to input a voice signal processed by another processing unit, a voice signal supplied (or transferred) from another device, It is a concept that includes an input interface circuit or the like that is built in an LSI and that simply receives an audio signal and transfers it to a subsequent stage.

＜音声波形データ＞
実施形態では、回答作成部１１０が、発言に対する回答として、一音一音に音高が割り当てられた音声シーケンスを出力する構成としたが、当該回答を、例えばｗａｖ形式の音声波形データを出力する構成としても良い。
なお、音声波形データは、上述した音声シーケンスのように一音一音に音高が割り当てられないので、例えば、音声制御部１０９が、単純に再生した場合の語尾の音高を特定して、音高データで示される音高に対して、特定した音高が所定の関係となるようにフィルタ処理などの音高変換（ピッチ変換）をした上で、音声波形データを出力（再生）する構成とすれば良い。
また、カラオケ機器では周知である、話速を変えずに音高（ピッチ）をシフトする、いわゆるキーコントロール技術によって音高変換をしても良い。 <Audio waveform data>
In the embodiment, the answer creating unit 110 outputs a voice sequence in which a pitch is assigned to each note as an answer to the utterance. However, the answer is output as, for example, voice waveform data in wav format. It is good also as a structure.
In addition, since the sound waveform data is not assigned a pitch to each sound as in the above-described sound sequence, for example, the sound control unit 109 specifies the ending pitch when simply reproduced, Configuration that outputs (reproduces) audio waveform data after performing pitch conversion (pitch conversion) such as filter processing so that the specified pitch has a predetermined relationship with the pitch indicated by the pitch data What should I do?
Also, pitch conversion may be performed by a so-called key control technique that shifts the pitch (pitch) without changing the speaking speed, which is well known in karaoke equipment.

＜回答等の語尾、語頭＞
実施形態では、発言の語尾の音高に対応して回答の語尾の音高を制御する構成としたが、言語や、方言、言い回しなどによっては回答の語尾以外の部分、例えば語頭が特徴的となる場合もある。このような場合には、発言した人は、当該発言に対する回答があったときに、当該発言の語尾の音高と、当該回答の特徴的な語頭の音高とを無意識のうち比較して当該回答に対する印象を判断する。したがって、この場合には、発言の語尾の音高に対応して回答の語頭の音高を制御する構成とすれば良い。この構成によれば、回答の語頭が特徴的である場合、当該回答を受け取る利用者に対して心理的な印象を与えることが可能となる。 <End of answer, beginning of answer>
In the embodiment, the pitch of the ending of the answer is controlled in accordance with the pitch of the ending of the utterance, but depending on the language, dialect, wording, etc., the part other than the ending of the answer, for example, the beginning of the answer is characteristic. Sometimes it becomes. In such a case, the person who made the speech unconsciously compares the pitch of the ending of the speech with the pitch of the characteristic beginning of the reply, when there is an answer to the speech. Determine the impression of the answer. Therefore, in this case, the pitch at the beginning of the answer may be controlled in accordance with the pitch at the end of the utterance. According to this configuration, when the head of the answer is characteristic, it is possible to give a psychological impression to the user who receives the answer.

発言についても同様であり、語尾に限られず、語頭で判断される場合も考えられる。また、発言、回答については、語頭、語尾に限られず、平均的な音高で判断される場合や、最も強く発音した部分の音高で判断される場合なども考えられる。このため、発言の第１区間および回答の第２区間は、必ずしも語頭や語尾に限られない、ということができる。 The same applies to utterances, not limited to endings, but may be determined by the beginning of a sentence. In addition, the remarks and answers are not limited to the beginning and end of the word, but may be determined based on an average pitch, or may be determined based on the pitch of the most pronounced portion. For this reason, it can be said that the 1st section of an utterance and the 2nd section of an answer are not necessarily restricted to an initial or ending.

＜話者属性＞
実施形態では、話者属性として、音声合成装置１０としての端末装置に登録された利用者の個人情報を用いたが、音声合成装置１０の側で検出する構成としても良い。例えば利用者の発言を、音量解析や周波数解析などして、予め記憶しておいた各種の性別、年齢の組み合わせに対応したパターンと比較等し、類似度の高いパターンの属性を、話者属性として検出すれば良い。
なお、話者属性の検出ができなかった場合、図５におけるステップＳｂ１２、Ｓｂ１４は「Ｎｏ」と判別される。 <Speaker attributes>
In the embodiment, the personal information of the user registered in the terminal device as the speech synthesizer 10 is used as the speaker attribute. However, the configuration may be such that detection is performed on the speech synthesizer 10 side. For example, the user's remarks are compared with patterns corresponding to various combinations of gender and age that have been stored in advance by volume analysis and frequency analysis, etc. Can be detected.
If the speaker attribute cannot be detected, steps Sb12 and Sb14 in FIG. 5 are determined as “No”.

＜エージェント属性＞
実施形態では、エージェント属性を性別としたが、性別・年齢などを組み合わせて３種以上としても良い。 <Agent attribute>
In the embodiment, the agent attribute is gender, but three or more types may be combined with gender and age.

＜相槌の連呼、相槌の出力タイミング等＞
ところで、人同士の対話を、発言者の性別という観点でみたとき、次のような傾向が見られる場合がある。例えば、女性であれば、対話において雰囲気や調和などを重視する傾向や、場を盛り上げるような傾向が見られる。具体的には、相槌を多用したり、相槌を連呼したり、発言から回答までの間を短くしたり、するなどの傾向が見られる。このため、利用者が女性であれば、発言に対する回答を音声合成で出力する音声合成装置１０に対しても、そのような傾向を期待するはずである。そこで、音声制御部１０９は、話者属性が女性であれば、その旨を回答作成部１１０に通知して、当該回答作成部１１０が、発言に対する相槌としての（３）の回答の作成頻度を高くしたり、同じ相槌の音声シーケンスを繰り返し出力したりしても良い。また、音声制御部１０９は、音声合成部１１２に対して、利用者による発言の終了から回答を出力開始するまでの時間を、相対的に早めるように制御しても良い。
一方、男性であれば、対話において内容や、論理性、個性などを重視する傾向が見られる場合がある。具体的には、必要以上に相槌を用いず、状況によっては敢えて無回答としたり（黙ったり）、発言から回答までの間を長くしたり、するなどの傾向が見られる。
そこで、音声制御部１０９は、話者属性が男性であれば、その旨を回答作成部１１０に通知して、当該回答作成部１１０が、発言に対する相槌の作成頻度を低くするとともに、所定の確率で無回答としても良い。また、音声制御部１０９は、音声合成部１１２に対して、利用者による発言の終了から回答を出力開始するまでの時間を、相対的に遅くするように制御しても良い。 <Continuous call of mutual, output timing of mutual, etc.>
By the way, when the dialogue between people is seen from the viewpoint of the gender of the speaker, the following tendency may be observed. For example, women tend to place importance on atmosphere and harmony in dialogue, and tend to excite the place. Specifically, there is a tendency to use a lot of concessions, to call concessions repeatedly, to shorten the time between remarks and answers, and so on. For this reason, if the user is a woman, such a tendency should be expected also for the speech synthesizer 10 that outputs an answer to the speech by speech synthesis. Therefore, if the speaker attribute is female, the voice control unit 109 notifies the answer creating unit 110 to that effect, and the answer creating unit 110 sets the frequency of creating the answer of (3) as a response to the utterance. It may be higher or the same sequence of audio sequences may be output repeatedly. In addition, the voice control unit 109 may control the voice synthesis unit 112 so that the time from the end of the speech by the user to the start of output of the answer is relatively advanced.
On the other hand, in the case of men, there is a tendency to emphasize content, logic, personality, etc. in the dialogue. Specifically, there is a tendency of not using the conflict more than necessary, and depending on the situation, no answer (silently), or lengthening the time between remarks and answers.
Therefore, if the speaker attribute is male, the voice control unit 109 notifies the answer creating unit 110 to that effect, and the answer creating unit 110 lowers the frequency of creating the conflict for the utterance and has a predetermined probability. And it is good as no answer. Further, the voice control unit 109 may control the voice synthesis unit 112 so that the time from the end of the utterance by the user to the start of output of the answer is relatively delayed.

＜音程の関係＞
上述した実施形態では、デフォルトルールを、発言の語尾等に対して回答の語尾等の音高が５度下にする、という内容であったが、５度下以外の協和音程の関係に制御する構成であっても良い。例えば、上述したように完全８度、完全５度、完全４度、長・短３度、長・短６度であっても良い。
なお、協和音程の関係でなくても、経験的に良い（または悪い）印象を与える音程の関係の存在が認められる場合もあるので、当該音程の関係に回答の音高を制御する構成としても良い。ただし、この場合においても、発言の語尾等の音高と回答の語尾等の音高との２音間の音程が離れ過ぎると、発言に対する回答が不自然になりやすいので、発言の音高と回答の音高とが上下１オクターブの範囲内にあることが望ましい。 <Pitch relationship>
In the above-described embodiment, the default rule is that the pitch of the ending of the answer is 5 degrees below the ending of the utterance, but is controlled to a relationship of Kyowa intervals other than 5 degrees below. It may be a configuration. For example, as described above, it may be complete 8 degrees, complete 5 degrees, complete 4 degrees, long / short 3 degrees, and long / short 6 degrees.
In addition, even if it is not the relationship of the Kyowa interval, there may be a relationship of the interval that gives a good (or bad) impression empirically, so the pitch of the answer may be controlled based on the relationship of the interval good. However, even in this case, if the pitch between the two pitches of the ending of the utterance and the pitch of the ending of the answer is too far apart, the answer to the utterance tends to become unnatural. It is desirable that the pitch of the answer is in the range of one octave above and below.

また、図５におけるステップＳｂ１３において、回答の語尾の音高を、デフォルトルールで定められていた音高よりも下げるときの条件として、話者属性が女性であることに対して、さらに、発言の語尾等の音高が第１閾値音高（周波数）以上であることを加重しても良い（ステップＳｂ１３における※）。女性による発言の音高が高い場合に、音声合成による回答が不自然に高くなってしまうのを回避するためである。
同様に、上記ステップＳｂ１５において、回答の語尾の音高を、デフォルトルールで定められていた音高よりも上げるときの条件として、話者属性が男性であることに対して、さらに、発言の語尾等の音高が第２閾値音高以下であることを加重しても良い（ステップＳｂ１５における※）。男性による発言の音高が低い場合に、音声合成による回答が不自然に低くなってしまうのを回避するためである。 Further, in step Sb13 in FIG. 5, as a condition for lowering the pitch of the ending of the answer below the pitch set in the default rule, the speaker attribute is female, It may be weighted that the pitch of the ending is equal to or higher than the first threshold pitch (frequency) (* in step Sb13). This is to avoid an unnatural increase in the answer by speech synthesis when the pitch of a speech by a woman is high.
Similarly, in step Sb15, as a condition for raising the pitch of the ending of the answer above the pitch defined by the default rule, the speaker attribute is male, and further It may be weighted that the pitch such as is equal to or lower than the second threshold pitch (* in step Sb15). This is to avoid an unnaturally low answer by speech synthesis when the pitch of a man's speech is low.

＜その他＞
実施形態にあっては、発言に対する回答を取得する構成である言語解析部１０８、言語データベース１２２および回答データベース１２４を音声合成装置１０の側に設けたが、端末装置などでは、処理の負荷が重くなる点や、記憶容量に制限がある点などを考慮して、外部サーバの側に設ける構成としても良い。すなわち、音声合成装置１０において回答作成部１１０は、発言に対する回答をなんらかの形で取得するとともに、当該回答の音声を規定するデータを出力する構成であれば足り、その回答を、音声合成装置１０の側で作成するのか、音声合成装置１０以外の他の構成（例えば外部サーバ）の側で作成するのか、については問われない。
なお、音声合成装置１０において、発言に対する回答について、外部サーバ等にアクセスしないで作成可能な用途であれば、情報取得部１２６は不要である。 <Others>
In the embodiment, the language analysis unit 108, the language database 122, and the answer database 124, which are configured to obtain responses to utterances, are provided on the side of the speech synthesizer 10, but the processing load is heavy in a terminal device or the like. In view of the above, the storage capacity is limited, and the like may be provided on the external server side. That is, in the speech synthesizer 10, it is sufficient that the answer creating unit 110 obtains an answer to the utterance in some form and outputs data defining the speech of the answer. It does not matter whether it is created on the side or on the side of another configuration (for example, an external server) other than the speech synthesizer 10.
Note that the information acquisition unit 126 is not necessary if the speech synthesizer 10 can create an answer to a statement without accessing an external server or the like.

１０２…音声入力部、１０４…発話区間検出部、１０６…音高解析部、１０８…言語解析部、１０９…音声制御部、１１０…回答作成部、１１２…音声合成部、１２６…情報取得部。
DESCRIPTION OF SYMBOLS 102 ... Voice input part, 104 ... Speech section detection part, 106 ... Pitch analysis part, 108 ... Language analysis part, 109 ... Speech control part, 110 ... Answer preparation part, 112 ... Speech synthesis part, 126 ... Information acquisition part.

Claims

A voice input unit for inputting a speech by a speaker;
Among the said remarks, a pitch analysis unit that analyzes the pitch of a specific first section;
An acquisition unit for acquiring an answer to the statement;
A voice synthesizer that synthesizes the obtained answer with a predetermined agent attribute;
While controlling the voice synthesis according to a rule that causes the voice synthesizer to change the pitch of a specific second section in the answer to a pitch that has a predetermined relationship with the pitch of the first section. A voice control unit that modifies the rule according to at least one of the speaker attribute of the speaker or the agent attribute;
A speech synthesizer characterized by comprising:

Computer
An acquisition unit for acquiring an answer to a comment made by a speaker;
Among the said remarks, a pitch analysis unit that analyzes the pitch of a specific first section,
A speech synthesizer that synthesizes the obtained answer with a predetermined agent attribute; and
While controlling the voice synthesis according to a rule that causes the voice synthesizer to change the pitch of a specific second section in the answer to a pitch that has a predetermined relationship with the pitch of the first section. A voice control unit that modifies the rule according to at least one of the speaker attribute of the speaker or the agent attribute,
A program characterized by functioning as