JP6446993B2

JP6446993B2 - Voice control device and program

Info

Publication number: JP6446993B2
Application number: JP2014213852A
Authority: JP
Inventors: 嘉山　啓; 啓嘉山; 松原　弘明; 弘明松原
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2014-10-20
Filing date: 2014-10-20
Publication date: 2019-01-09
Anticipated expiration: 2034-10-20
Also published as: EP3211637A4; US20190139535A1; CN107077840B; US20170221470A1; JP2016080944A; EP3211637A1; US10789937B2; CN107077840A; EP3211637B1; WO2016063879A1; US10217452B2

Description

本発明は、音声制御装置およびプログラムに関する。 The present invention relates to a voice control device and a program.

近年、音声合成技術としては、次のようなものが提案されている。すなわち、利用者の話調や声質に対応した音声を合成出力することによって、より人間らしく発音する技術（例えば特許文献１参照）や、利用者の音声を分析して、当該利用者の心理状態や健康状態などを診断する技術（例えば特許文献２参照）が提案されている。また、利用者が入力した音声を認識する一方で、シナリオで指定された内容を音声合成で出力して、利用者との音声対話を実現する音声対話システムも提案されている（例えば特許文献３参照）。 In recent years, the following have been proposed as speech synthesis techniques. That is, by synthesizing and outputting speech corresponding to the user's speech tone and voice quality, a technique for sounding more humanly (see, for example, Patent Document 1), analyzing the user's speech, A technique for diagnosing a health condition or the like (see, for example, Patent Document 2) has been proposed. In addition, a voice dialogue system that recognizes a voice input by a user and outputs a content specified in a scenario by voice synthesis to realize a voice dialogue with the user has been proposed (for example, Patent Document 3). reference).

特開２００３−２７１１９４号公報JP 2003-271194 A 特許第４４９５９０７号公報Japanese Patent No. 4495907 特許第４８３２０９７号公報Japanese Patent No. 4832097

ところで、上述した音声合成技術と音声対話システムとを組み合わせて、利用者の音声による問いに対し、データを検索して音声合成により出力する対話システムを想定する。この場合、音声合成によって出力される音声が利用者に不自然な感じ、具体的には、いかにも機械が喋っている感じを与えるときがある、という問題が指摘されている。
本発明は、このような事情に鑑みてなされたものであり、その目的の一つは、利用者に自然な感じを与えることが可能な音声制御装置およびプログラムを提供することにある。 By the way, it is assumed that a dialogue system that combines the above-described voice synthesis technology and a voice dialogue system and retrieves data and outputs it by voice synthesis in response to a user's voice question. In this case, a problem has been pointed out that the voice output by the voice synthesis feels unnatural to the user, specifically, the machine sometimes feels roaring.
The present invention has been made in view of such circumstances, and one of its purposes is to provide a voice control device and a program capable of giving a user a natural feeling.

まず、利用者による問いに対する回答を音声合成で出力するマン・マシンのシステムを検討するにあたって、まず、人同士では、どのような対話がなされるかについて、言語的情報以外の情報、とりわけ対話を特徴付ける音高（周波数）に着目して考察する。 First of all, when considering a man-machine system that outputs the answers to questions from users by speech synthesis, first of all, what kind of dialogue is made between people, other than linguistic information, especially dialogue. Considering the pitch (frequency) to be characterized.

人同士の対話として、一方の人（ａとする）による問い（問い掛け）に対し、他方の人（ｂとする）が返答する場合について検討する。この場合において、ａが問いを発したとき、ａだけなく、当該問いに対して回答しようとするｂも、当該問いのうちの、特定区間における音高を強い印象で残していることが多い。ｂは、同意や、賛同、肯定などの意で回答するときには、印象に残っている問いの音高に対し、当該回答を特徴付ける部分の音高が、特定の関係、具体的には協和音程の関係となるように発声する。当該回答を聞いたａは、自己の問いについて印象に残っている音高と当該問いに対する回答を特徴付ける部分の音高とが上記関係にあるので、ｂの回答に対して心地良く、安心するような好印象を抱くことになる、と考えられる。 As a dialogue between people, a case where the other person (b) responds to a question (question) by one person (a) will be considered. In this case, when a asks a question, not only a but also b trying to answer the question often leaves a strong impression of the pitch in the specific section of the question. b, when responding with consent, approval, affirmation, etc., the pitch of the part that characterizes the answer has a specific relationship, specifically a Kyowa interval, Speak to be in a relationship. A who has heard the answer has a relationship between the pitch that remains in the impression about his question and the pitch of the part that characterizes the answer to the question. It is thought that you will have a good impression.

このように人同士の対話では、問いの音高と回答の音高とは無関係ではなく、上記のような関係がある、と考察できる。このような考察を踏まえて、利用者による問いに対する回答を音声合成で出力（返答）する対話システムを検討したときに、当該音声合成について上記目的を達成するために、次のような構成とした。 In this way, in the dialogue between people, it can be considered that the pitch of the question and the pitch of the answer are not irrelevant and have the above relationship. Based on these considerations, when considering a dialogue system that outputs (answers) answers to questions from users by speech synthesis, the following configuration has been adopted in order to achieve the above objective for the speech synthesis. .

すなわち、上記目的を達成するために、本発明の一態様に係る音声合成装置は、入力された音声信号による問いのうち、特定区間の音高を取得する第１音高取得部と、前記問いに対する回答の音声データを取得する回答取得部と、取得された回答の音声データに基づく音高を取得する第２音高取得部と、前記回答の音声データに基づく音高に対して所定の音高範囲内の目標音高であって、かつ、前記特定区間の音高に対して特定の関係を維持する目標音高までの音高シフト量を決定する音高シフト量決定部と、前記回答の音声データに基づく音高を前記音高シフト量だけシフトして回答を合成する回答合成部と、を具備することを特徴とする。 That is, in order to achieve the above object, a speech synthesizer according to an aspect of the present invention includes a first pitch acquisition unit that acquires a pitch of a specific section among questions based on an input voice signal, and the question. An answer acquisition unit for acquiring the voice data of the answer to the voice, a second pitch acquisition unit for acquiring the pitch based on the voice data of the acquired answer, and a predetermined sound for the pitch based on the voice data of the answer A pitch shift amount determination unit that determines a pitch shift amount to a target pitch that is a target pitch within a high range and maintains a specific relationship with respect to the pitch of the specific section; and the answer An answer composition unit for synthesizing an answer by shifting the pitch based on the voice data by the pitch shift amount.

この一態様によれば、本実施形態によれば、利用者が発した問いに対する回答を、不自然でなく、かつ、聴感上の品質の劣化を防いで、合成（再生）することができる。
なお、回答には、問いに対する具体的な答えに限られず、相槌（間投詞）も含まれる。また、回答には、人による声のほかにも、「ワン」（bowwow）、「ニャー」（meow）などの動物の鳴き声も含まれる。すなわち、ここでいう回答や音声とは、人が発する声のみならず、動物の鳴き声を含む概念である。
問いのうち、特定区間の音高とは、強い印象で残している部分での音高をいい、具体的には、音量が所定値以上である区間の音高最高値や、問いの末尾区間の音高であることが好ましい。
また、音声データに基づく音高とは、例えば音声データを標準で再生したときの特徴的な部分での音高であり、特徴的な部分とは語頭部分の音高、音量が最も高い部分での音高のほか、平均音高などである。
ここで、特定の関係としては、協和音程の関係であることが好ましい。協和とは、複数の楽音が同時に発生したときに、それらが互いに溶け合って良く調和する関係をいい、これらの音程関係を協和音程という。協和の程度は、２音間の周波数比（振動数比）が単純なものほど高い。 According to this embodiment, according to the present embodiment, the answers to the questions issued by the user can be synthesized (reproduced) without being unnatural and preventing deterioration in the quality of hearing.
Note that the answer is not limited to a specific answer to the question, but includes an answer (interjection). In addition to human voices, answers include animal calls such as “bow” and “meow”. That is, the answer and the voice here are concepts including not only a voice uttered by a person but also an animal cry.
Among the questions, the pitch of a specific section refers to the pitch of the part left with a strong impression. Specifically, the peak value of the section where the volume is equal to or higher than the predetermined value, or the end section of the question It is preferable that the pitch is.
The pitch based on the voice data is, for example, the pitch at the characteristic part when the voice data is played back as a standard, and the characteristic part is the part with the highest pitch and volume at the beginning of the word. As well as the average pitch.
Here, the specific relationship is preferably a Kyowa interval relationship. Kyowa means a relationship in which a plurality of musical sounds are generated at the same time and are well-harmonized with each other. These pitch relationships are called Kyowa pitches. The degree of cooperation is higher as the frequency ratio (frequency ratio) between two sounds is simpler.

上記一態様において、前記音高シフト量決定部は、前記回答の音声データに基づく音高に対し、前記音高範囲に収まるように、前記音高シフト量をオクターブ単位で変更する構成としても良い。音声データを音高シフト量だけシフトする場合に、そのシフト量が大きければ、劣化するが、当該一態様によれば、このような劣化を防止することができる。 In the above aspect, the pitch shift amount determination unit may change the pitch shift amount in octave units so that the pitch based on the voice data of the answer falls within the pitch range. . When the audio data is shifted by the pitch shift amount, if the shift amount is large, the sound data is deteriorated. However, according to the aspect, such deterioration can be prevented.

また、前記第１音高取得部は、入力された音声信号の音量が所定値以上である区間の音高最高値を、特定区間の音高として取得する構成が好ましい。このときの所定値以上であることの判別においてヒステリシス特性を持たせて判別しても良いし、音高の検出可能であることを条件として加重しても良い。
本発明の態様について、音声合成装置のみならず、コンピュータを当該音声合成装置として機能させるプログラムとして概念することも可能である。 The first pitch acquisition unit preferably acquires the highest pitch value of a section in which the volume of the input audio signal is equal to or higher than a predetermined value as the pitch of the specific section. In this determination, the determination may be made with a hysteresis characteristic or may be weighted on the condition that the pitch can be detected.
The aspect of the present invention can be conceptualized as a program that causes a computer to function as the speech synthesizer as well as the speech synthesizer.

実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on embodiment. 音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a speech synthesizer. 音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a speech synthesizer. 利用者による問いと、音声合成装置による回答との音高例を示す図である。It is a figure which shows the example of a pitch of the question by a user, and the reply by a speech synthesizer. 応用例の前提を説明するための図である。It is a figure for demonstrating the premise of an application example. 応用例（その１）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 1). 応用例（その２）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 2). 応用例（その３）における処理の要部を示す図である。It is a figure which shows the principal part of the process in an application example (the 3). 応用例（その４）の動作概要を示す図である。It is a figure which shows the operation | movement outline | summary of an application example (the 4).

以下、本発明の実施形態について図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本発明の実施形態に係る音声合成装置１０の構成を示す図である。
この音声合成装置１０は、例えば、ぬいぐるみに組み込まれて、利用者が当該ぬいぐるみに問いを発したときに、相槌などの回答を音声合成して出力する装置である。音声合成装置１０は、ＣＰＵ（Central Processing Unit）や、音声入力部１０２、スピーカ１４２を有し、当該ＣＰＵが、予めインストールされたアプリケーションプログラムを実行することによって、複数の機能ブロックが次のように構築される。詳細には、音声合成装置１０では、音声特徴量取得部１０６、回答選択部１１０、回答音高取得部１１２、音高シフト量決定部１１４および回答合成部１１６が構築される。 FIG. 1 is a diagram showing a configuration of a speech synthesizer 10 according to an embodiment of the present invention.
The speech synthesizer 10 is a device that is incorporated in a stuffed animal, for example, and synthesizes and outputs an answer such as a question when a user asks the stuffed animal. The voice synthesizer 10 includes a CPU (Central Processing Unit), a voice input unit 102, and a speaker 142. When the CPU executes an application program installed in advance, a plurality of functional blocks are as follows. Built. Specifically, in the speech synthesizer 10, a speech feature amount acquisition unit 106, an answer selection unit 110, an answer pitch acquisition unit 112, a pitch shift amount determination unit 114, and an answer synthesis unit 116 are constructed.

なお、特に図示しないが、このほかにも音声合成装置１０は、表示部や操作入力部なども有し、利用者が装置の状況を確認したり、装置に対して各種の操作を入力したり、各種の設定などができるようになっている。また、音声合成装置１０は、ぬいぐるみのような玩具に限られず、いわゆるペットロボットや、携帯電話機のような端末装置、タブレット型のパーソナルコンピュータなどであっても良い。 Although not particularly illustrated, the speech synthesizer 10 also includes a display unit, an operation input unit, and the like, so that the user can check the status of the device and input various operations to the device. Various settings can be made. The voice synthesizer 10 is not limited to a toy such as a stuffed toy, and may be a so-called pet robot, a terminal device such as a mobile phone, a tablet personal computer, or the like.

音声入力部１０２は、詳細については省略するが、音声を電気信号に変換するマイクロフォンと、変換された音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。 Although not described in detail, the sound input unit 102 includes a microphone that converts sound into an electrical signal and an A / D converter that converts the converted sound signal into a digital signal.

音声特徴量取得部１０６（第１音高取得部）は、デジタル信号に変換された音声信号を解析処理して、当該音声信号を発話区間および非発話区間に分別するとともに、発話区間のうち、有声区間における特定区間の音高を検出し、当該音高を示すデータを回答選択部１１０および音高シフト量決定部１１４に供給する。ここで、発話区間とは、例えば音声信号の音量が閾値以上となる区間をいい、反対に、非発話区間とは、音声信号の音量が閾値未満となる区間をいう。また、有声区間とは、発話区間のうち、音声信号の音高（ピッチ）が検出可能な区間をいう。音高が検出可能な区間とは、音声信号に周期的な部分があって、その部分が検出可能であることを意味する。
なお、ここでは、特定区間を有声区間のうちの末尾区間とし、音高として、当該末尾区間における最高値としている。また、末尾区間とは、有声区間の終了から時間的に前方に向けた所定時間（例えば１８０ｍｓｅｃ）の区間である。有声区間については後述するように、音声信号の音量を２つ（または３つ以上）の閾値で判別しても良い。 The voice feature amount acquisition unit 106 (first pitch acquisition unit) analyzes the voice signal converted into the digital signal, and classifies the voice signal into a speech segment and a non-speech segment, and among the speech segments, The pitch of a specific section in the voiced section is detected, and data indicating the pitch is supplied to the answer selection unit 110 and the pitch shift amount determination unit 114. Here, the utterance section refers to, for example, a section in which the volume of the audio signal is equal to or higher than a threshold, and conversely, the non-utterance section refers to a section in which the volume of the audio signal is less than the threshold. The voiced section is a section in which the pitch (pitch) of the voice signal can be detected in the utterance section. The section in which the pitch can be detected means that there is a periodic part in the audio signal and that part can be detected.
Here, the specific section is the end section of the voiced section, and the pitch is the highest value in the end section. Further, the end section is a section of a predetermined time (for example, 180 msec) that is temporally forward from the end of the voiced section. As will be described later for the voiced section, the volume of the voice signal may be determined by two (or three or more) threshold values.

回答ライブラリ１２４は、利用者による問いに対する回答の音声データを、予め複数記憶する。この音声データは、モデルとなる人物の音声を録音したものであり、例えば「はい」、「いいえ」、「そう」、「うん」、「ふーん」、「なるほど」のような、質問に対する返事や相槌などである。回答の音声データについては、例えばｗａｖやｍｐ３などのフォーマットであり、標準で再生したときの波形サンプル毎（または波形周期毎）の音高と、それらを平均した平均音高が予め求められて、その平均音高（回答に基づく音高）を示すデータが音声データに対応付けられて回答ライブラリ１２４に記憶されている。なお、ここでいう標準で再生とは、音声データを録音時の条件（サンプリング周波数）と同じ条件で再生する、という意味である。 The answer library 124 stores a plurality of voice data of answers to questions by the user in advance. This voice data is a recording of the voice of a model person. For example, “Yes”, “No”, “Yes”, “Yes”, “Fun”, “I see” For example, it is a companion. The voice data of the answer is in a format such as wav or mp3, for example, and a pitch for each waveform sample (or every waveform cycle) when reproduced as a standard and an average pitch obtained by averaging them are obtained in advance. Data indicating the average pitch (pitch based on the answer) is stored in the answer library 124 in association with the voice data. The standard reproduction here means that the audio data is reproduced under the same conditions as the recording conditions (sampling frequency).

回答選択部１１０（回答取得部）は、音声特徴量取得部１０６から特定区間の音高を示すデータが出力された場合、当該音声に対する回答の音声データを、回答ライブラリ１２４から１つを選択し、当該選択した回答の音声データを、対応付けられた平均音高を示すデータとともに、読み出して出力する。
回答選択部１１０が、複数の音声データのうち、１つの音声データをどのようなルールで選択するかについては、例えばランダムでも良いし、問いの特定区間の音高に対して平均音高が最も近い音声データを選択する、としても良い。 When the data indicating the pitch of the specific section is output from the voice feature quantity acquisition unit 106, the answer selection unit 110 (answer acquisition unit) selects one voice data of the answer to the voice from the answer library 124. The voice data of the selected answer is read and output together with the data indicating the associated average pitch.
For example, the answer selection unit 110 may select a single piece of voice data from among a plurality of pieces of voice data. For example, the answer selection unit 110 may be random. It is also possible to select near voice data.

なお、本実施形態において、選択される回答については、利用者の問いの意味内容が考慮されないことになるが、この音声合成装置１０を、利用者が発した問いに対し、回答として相槌等を返す装置としてみれば、これで十分である。
一方で、図において破線で示されるように言語解析部１０８を設けて、当該言語解析部１０８が音声信号で規定される問いの意味内容を解析し、回答選択部１１０が、データベース等を介して当該問いに対する回答を作成する構成としても良い。 In the present embodiment, the meaning content of the user's question is not taken into account for the selected answer. However, the speech synthesizer 10 is used as a response to the question issued by the user. This is sufficient for a returning device.
On the other hand, a language analysis unit 108 is provided as shown by a broken line in the figure, the language analysis unit 108 analyzes the meaning content of the question specified by the audio signal, and the answer selection unit 110 transmits the database via a database or the like. It is good also as a structure which produces the answer with respect to the said question.

回答音高取得部１１２（第２音高取得部）は、回答選択部１１０で読み出されたデータのうち、回答の平均音高を示すデータを抜き出して、音高シフト量決定部１１４に供給する。
音高シフト量決定部１１４は、音声特徴量取得部１０６から出力された音声信号における特定区間の音高と、回答音高取得部１１２から出力された回答の平均音高との差から、回答の音声データを再生する際における音高のシフト量を、後述するように決定する。
回答合成部１１６は、回答ライブラリ１２４から読み出された回答の音声データを、音高シフト量決定部１１４で決定された音高のシフト量だけシフトさせて再生（合成）する。なお、音高がシフトされた音声信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１４２によって音響変換されて出力される。
また、回答の音高に対応付けられたデータについては、すなわち、回答ライブラリ１２４に記憶されるとともに、音高シフト量決定部１１４で音高シフト量の決定に用いられるデータについては、平均音高を示すデータ以外であっても良い。例えば、音高の中間値でも良いし、音声データの所定区間の平均音高でも良い。 The answer pitch acquisition unit 112 (second pitch acquisition unit) extracts data indicating the average pitch of answers from the data read by the answer selection unit 110 and supplies the extracted data to the pitch shift amount determination unit 114. To do.
The pitch shift amount determination unit 114 determines the answer based on the difference between the pitch of the specific section in the voice signal output from the voice feature value acquisition unit 106 and the average pitch of the answers output from the answer pitch acquisition unit 112. The pitch shift amount when reproducing the audio data is determined as will be described later.
The answer synthesizing unit 116 reproduces (synthesizes) the answer voice data read from the answer library 124 by shifting it by the pitch shift amount determined by the pitch shift amount determining unit 114. Note that the audio signal whose pitch is shifted is converted into an analog signal by a D / A converter (not shown), and then acoustically converted by the speaker 142 and output.
The data associated with the pitch of the answer is stored in the answer library 124, and the data used for determining the pitch shift amount by the pitch shift amount determining unit 114 is the average pitch. It may be other than data indicating. For example, an intermediate value of the pitch may be used, or an average pitch of a predetermined section of the audio data may be used.

次に、音声合成装置１０の動作について説明する。
図２は、音声合成装置１０における処理動作を示すフローチャートである。
はじめに、音声合成装置１０が適用されたぬいぐるみに対して、利用者が音声で問いを発したときに、このフローチャートで示される処理が起動される。なお、ここでは便宜的に、利用者の音声（問い）の音高に対して回答の音声データの音高が高い場合を例にとって説明する。 Next, the operation of the speech synthesizer 10 will be described.
FIG. 2 is a flowchart showing processing operations in the speech synthesizer 10.
First, when the user asks a question about the stuffed animal to which the speech synthesizer 10 is applied, the processing shown in this flowchart is started. Here, for convenience, the case where the pitch of the answer voice data is higher than the pitch of the voice (question) of the user will be described as an example.

まず、ステップＳａ１１において、音声入力部１０２によって変換された音声信号が音声特徴量取得部１０６に供給される。
次に、ステップＳａ１２において、音声特徴量取得部１０６は、音声入力部１０２からの音声信号に対して解析処理、すなわち利用者が発した問いの音高を検出する処理を実行する。
ステップＳａ１３において、回答合成部１１６によって回答が再生中であるか否かが判別される。 First, in step Sa <b> 11, the audio signal converted by the audio input unit 102 is supplied to the audio feature amount acquisition unit 106.
Next, in step Sa12, the voice feature quantity acquisition unit 106 performs an analysis process on the voice signal from the voice input unit 102, that is, a process of detecting the pitch of a question issued by the user.
In step Sa13, the answer composition unit 116 determines whether or not the answer is being reproduced.

回答が再生中でなければ（ステップＳａ１３の判別結果が「Ｎｏ」であれば）、音声特徴量取得部１０６は、音声入力部１０２からの音声信号の問い（発話）が終了したか否かを判別する（ステップＳａ１４）。なお、問いが終了したか否かについては、具体的には、例えば、音声信号の音量が所定の閾値未満となった状態が所定時間継続したか否かで判別される。
問いが終了していなければ（ステップＳａ１４の判別結果が「Ｎｏ」であれば）、処理手順がステップＳａ１１に戻り、これにより、音声特徴量取得部１０６は、音声入力部１０２からの音声信号の解析処理を継続する。
問いが終了していれば（ステップＳａ１４の判別結果が「Ｙｅｓ」であれば）、音高シフト量決定部１１４は、回答選択部１１０により選択された回答の音声データを再生する際の音高シフト量を、後述するように決定する（ステップＳａ１５）。
そして、音高シフト量決定部１１４は、決定した音高シフト量を回答合成部１１６に通知して、回答選択部１１０により選択された回答の音声データの再生を指示する（ステップＳａ１６）。この指示にしたがって回答合成部１１６は、当該音声データを、音高シフト量決定部１１４で決定された音高シフト量だけシフトして再生する（ステップＳａ１７）。 If the answer is not being played back (if the determination result in step Sa13 is “No”), the voice feature acquisition unit 106 determines whether or not the question (speech) of the voice signal from the voice input unit 102 has ended. A determination is made (step Sa14). Note that whether or not the inquiry has ended is specifically determined based on, for example, whether or not a state in which the volume of the audio signal has become less than a predetermined threshold has continued for a predetermined time.
If the inquiry has not ended (if the determination result in step Sa14 is “No”), the processing procedure returns to step Sa11, whereby the audio feature quantity acquisition unit 106 receives the audio signal from the audio input unit 102. Continue the analysis process.
If the question has been completed (if the determination result in step Sa14 is “Yes”), the pitch shift amount determination unit 114 reproduces the pitch when the voice data of the answer selected by the answer selection unit 110 is reproduced. The shift amount is determined as will be described later (step Sa15).
Then, the pitch shift amount determination unit 114 notifies the answer composition unit 116 of the determined pitch shift amount, and instructs the reproduction of the voice data of the response selected by the response selection unit 110 (step Sa16). In accordance with this instruction, the answer synthesizer 116 shifts and reproduces the audio data by the pitch shift amount determined by the pitch shift amount determiner 114 (step Sa17).

なお、ステップＳａ１３において、回答合成部１１６によって回答が再生中であると判別される場合（ステップＳａ１３の判別結果が「Ｙｅｓ」となる場合）とは、ある問いに応じて回答を再生中に、次の問いが利用者によって発せられた場合などである。この場合、処理手順は、ステップＳａ１４、Ｓａ１１という経路を戻らず、ステップＳａ１７に移行するので、回答の再生が優先されることになる。 In step Sa13, when the answer composition unit 116 determines that the answer is being reproduced (when the determination result in step Sa13 is “Yes”), the answer is being reproduced according to a certain question. For example, when the following question is asked by the user. In this case, the processing procedure does not return the path of steps Sa14 and Sa11, and proceeds to step Sa17, so that the reproduction of the answer has priority.

図３は、図２におけるステップＳａ１５の処理、すなわち回答の音声データの音高シフト量を決定する処理の詳細を示すフローチャートである。
なお、この処理が実行されるための前提は、回答合成部１１６が回答を再生中でなく（ステップＳａ１３の判別結果が「Ｎｏ」）、かつ、利用者により問いの入力が終了している（ステップＳａ１４の判別結果が「Ｙｅｓ」）、ことである。
まず、ステップＳｂ１１において、音高シフト量決定部１１４は、音声特徴量取得部１０６から、問いの特定区間の音高を示すデータを取得する。 FIG. 3 is a flowchart showing details of the process of step Sa15 in FIG. 2, that is, the process of determining the pitch shift amount of the answer voice data.
The premise for executing this process is that the answer composition unit 116 is not reproducing the answer (the determination result in step Sa13 is “No”), and the user has finished inputting the question ( The determination result of step Sa14 is “Yes”).
First, in step Sb11, the pitch shift amount determination unit 114 acquires data indicating the pitch of the specific section in question from the voice feature amount acquisition unit 106.

一方、回答選択部１１０は、利用者による問いに対する回答の音声データを、回答ライブラリ１２４から選択し、当該選択した回答の音声データと、当該音声データに対応付けられた平均音高を示すデータとを読み出す。このうち、回答音高取得部１１２は、読み出されたデータのうちの平均音高を示すデータを音高シフト量決定部１１４に供給する。これにより、音高シフト量決定部１１４は、回答選択部１１０により選択された回答の平均音高を示すデータを取得する（ステップＳｂ１２）。 On the other hand, the answer selection unit 110 selects voice data of the answer to the question by the user from the answer library 124, the voice data of the selected answer, and data indicating the average pitch associated with the voice data Is read. Among these, the answer pitch acquisition unit 112 supplies data indicating the average pitch among the read data to the pitch shift amount determination unit 114. Thereby, the pitch shift amount determination unit 114 acquires data indicating the average pitch of the answer selected by the answer selection unit 110 (step Sb12).

次に、音高シフト量決定部１１４は、問いの特定区間の音高に対して、所定の関係（例えば５度下）にある音高を、音声データで回答する際の音高として仮決定する（ステップＳｂ１３）。 Next, the pitch shift amount determination unit 114 temporarily determines a pitch having a predetermined relationship (for example, 5 degrees below) with respect to the pitch of the specific section in question as a pitch when replying with voice data. (Step Sb13).

続いて、音高シフト量決定部１１４は、回答選択部１１０により選択された回答の平均音高から、仮決定した音高（ステップＳｂ１３のほか、後述するステップＳｂ１６、Ｓｂ１８で変更された音高を含む）までの音高シフト量を算出する（ステップＳｂ１４）。音高シフト量決定部１１４は、回答の平均音高を音高シフト量だけシフトした場合の音高（シフト後の音高）が下限閾値よりも低いか否かを判別する（ステップＳｂ１５）。ここで、下限閾値とは、回答の平均音高に対して、どれだけ低い音高まで許容するのかを示す閾値であり、詳細について後述する。 Subsequently, the pitch shift amount determination unit 114 determines the pitch temporarily determined from the average pitch of the answers selected by the answer selection unit 110 (the pitches changed in steps Sb16 and Sb18 described below in addition to step Sb13). Is calculated (step Sb14). The pitch shift amount determination unit 114 determines whether or not the pitch (pitch after the shift) when the average pitch of the answer is shifted by the pitch shift amount is lower than the lower limit threshold (step Sb15). Here, the lower limit threshold value is a threshold value indicating how much a lower pitch is allowed with respect to the average pitch of answers, and will be described in detail later.

シフト後の音高が下限閾値よりも低ければ（ステップＳｂ１５の判別結果が「Ｙｅｓ」であれば）、音高シフト量決定部１１４は、仮決定した回答の音高を１オクターブ引き上げて、当該１オクターブ上げた音高を、音声データで回答する際の音高として再度仮決定する（ステップＳｂ１６）。なお、この後、処理手順がステップＳｂ１４に戻り、再度、音高シフト量が算出されて、ステップＳｂ１５、Ｓｂ１７の判別が実行されることになる。 If the pitch after the shift is lower than the lower limit threshold (if the determination result in step Sb15 is “Yes”), the pitch shift amount determination unit 114 raises the pitch of the tentatively determined answer by one octave, and The pitch raised by one octave is provisionally determined again as the pitch when replying with voice data (step Sb16). After this, the processing procedure returns to step Sb14, the pitch shift amount is calculated again, and the determinations of steps Sb15 and Sb17 are executed.

一方、シフト後の音高が下限閾値よりも低くなければ（ステップＳｂ１５の判別結果が「Ｎｏ」であれば）、音高シフト量決定部１１４は、当該シフト後の音高が上限閾値よりも高いか否かを判別する（ステップＳｂ１７）。ここで、上限閾値とは、回答の平均音高に対して、どれだけ高い音高まで許容するのかを示す閾値であり、詳細については後述する。
シフト後の音高が上限閾値よりも高ければ（ステップＳｂ１７の判別結果が「Ｙｅｓ」であれば）、音高シフト量決定部１１４は、仮決定した回答の音高を１オクターブ引き下げて、当該１オクターブ下げた音高を、音声データで回答する際の音高として再度仮決定する（ステップＳｂ１８）。なお、この後、処理手順がステップＳｂ１４に戻り、再度、音高シフト量が算出されて、ステップＳｂ１５、Ｓｂ１７の判別が実行されることになる。 On the other hand, if the pitch after the shift is not lower than the lower limit threshold (if the determination result in step Sb15 is “No”), the pitch shift amount determination unit 114 determines that the pitch after the shift is lower than the upper limit threshold. It is determined whether or not it is high (step Sb17). Here, the upper limit threshold value is a threshold value indicating how much pitch is allowed to be higher than the average pitch of answers, and details will be described later.
If the pitch after the shift is higher than the upper limit threshold value (if the determination result in step Sb17 is “Yes”), the pitch shift amount determination unit 114 lowers the pitch of the tentatively determined answer by one octave, and The pitch lowered by one octave is provisionally determined again as the pitch when answering with voice data (step Sb18). After this, the processing procedure returns to step Sb14, the pitch shift amount is calculated again, and the determinations of steps Sb15 and Sb17 are executed.

シフト後の音高が上限閾値よりも高くなければ（ステップＳｂ１７の判別結果が「Ｎｏ」であれば）、当該シフト後の音高が、下限閾値以上であって上限閾値以下の所定の音高範囲内に収まっていることを意味する。このため、音高シフト量決定部１１４は、処理手順をステップＳｂ１９に移行させ、現時点において仮決定の段階にある音高を本決定として、その音高シフト量を回答合成部１１６に通知する。 If the pitch after the shift is not higher than the upper limit threshold (if the determination result of step Sb17 is “No”), the pitch after the shift is a predetermined pitch that is not less than the lower limit threshold and not more than the upper limit threshold. Means it is within range. For this reason, the pitch shift amount determination unit 114 shifts the processing procedure to step Sb19, determines the pitch that is currently in the tentative determination stage as the final determination, and notifies the answer composition unit 116 of the pitch shift amount.

図４は、利用者によって音声入力された問いと、音声合成装置１０により合成される回答との関係を、音高を縦軸に、時間を横軸にとって例示した図である。
この図において、符号Ｔ１で示される実線は、利用者による問いの音高変化を簡易的に直線で示している。符号Ｐ１は、この問いＴ１における特定区間の音高である。
また、図において、符号Ａ１で示される実線は、問いＴ１に対して選択された回答の音声データを標準で再生したときの音高変化を簡易的に示す図であり、符号Ｐ２は、その平均音高である。 FIG. 4 is a diagram illustrating the relationship between a question input by a user and an answer synthesized by the speech synthesizer 10 with the pitch as the vertical axis and the time as the horizontal axis.
In this figure, the solid line indicated by the symbol T1 simply indicates a change in the pitch of the question by the user as a straight line. A symbol P1 is a pitch of a specific section in the question T1.
Further, in the figure, the solid line indicated by reference symbol A1 is a diagram simply showing the change in pitch when the voice data of the answer selected for the question T1 is reproduced as a standard, and reference symbol P2 indicates the average It is pitch.

問いＴ１に対して、回答Ａ１の音高をシフトさせずに再生すると、機械的な感じを受けやすい。このため、本実施形態では、第１に、問いＴ１の特徴的で印象的な部分である特定区間（語尾）の音高Ｐ１に対して、協和音程の例えば５度下の関係にある音高Ｐ２−１となるように、回答Ａ１をシフトさせた回答Ａ１−１で再生しようとする。なお、符号Ｄ１は、音高Ｐ１と音高Ｐ２−１との音高差である。
ただし、回答Ａ１に対する回答Ａ１−１の音高シフト量Ｄ２が大きすぎると、音高シフトした回答Ａ１−１を再生したときに聴感上の品質が劣化する。特に、問いの特定区間の音高と回答の平均音高とが大きく離れている場合（例えば、問いを発する利用者が男性で、回答のモデルが女性である場合）、音高を低くする方向にシフトさせて再生すると、不自然になりやすく、また、著しく劣化しやすい。 When the pitch of the answer A1 is reproduced without shifting to the question T1, it is easy to receive a mechanical feeling. For this reason, in the present embodiment, firstly, the pitch is in a relationship that is, for example, 5 degrees below the Kyowa interval with respect to the pitch P1 of the specific section (ending) that is a characteristic and impressive part of the question T1. An attempt is made to reproduce the answer A1-1 by shifting the answer A1 so as to be P2-1. Note that the symbol D1 is a pitch difference between the pitch P1 and the pitch P2-1.
However, if the pitch shift amount D2 of the answer A1-1 with respect to the answer A1 is too large, the auditory quality deteriorates when the answer A1-1 that has been pitch-shifted is reproduced. In particular, if the pitch of the specific section of the question is far away from the average pitch of the answer (for example, if the user making the question is male and the answer model is female), the direction of decreasing the pitch If the playback is shifted to, it tends to be unnatural and remarkably deteriorated.

そこで、本実施形態では、第２に、回答合成部１１６で合成させる回答の音高が音高Ｐ１に対して特定の関係となることを維持しつつ、元の回答Ａ１の平均音高Ｐ２に対して、所定の音高範囲に収まるまで、回答Ａ１−１の音高Ｐ２−１を、オクターブを単位として段階的にシフトさせる構成となっている。図４の例において、回答Ａ１−４は、回答Ａ１の音高Ｐ２を基準にした音高範囲に収まるまで、回答Ａ１−１から回答Ａ１−２、回答Ａ１−３を経て、３オクターブ高めた例である。
図４において、回答Ａ１の平均音高Ｐ２を基準にして設定される音高範囲のうち、当該平均音高Ｐ２から、下限閾値Ｐth_Lまでの音高差分量が符号Ｔ_Lで規定され、上限閾値Ｐth_Hまでの音高差分量が符号Ｔ_Hで規定される。すなわち、下限閾値Ｐth_Lは、回答Ａ１の平均音高Ｐ２を基準にして音高差分量Ｔ_Lで規定される相対値であり、同様に、上限閾値Ｐth_Hは、平均音高Ｐ２を基準にして音高差分量Ｔ_Hで規定される相対値である。回答ライブラリ１２４に記憶された回答の音声データは複数存在するので、回答の音高範囲を規定する下限閾値Ｐth_Lおよび上限閾値Ｐth_Hについては、回答毎に異なることになるが、このように平均音高Ｐ２を基準にして音高差分量で相対的に規定することによって、回答の音声データ毎に下限閾値Ｐth_Lおよび上限閾値Ｐth_Hを予め対応付けて記憶させる必要がない。
なお、音高Ｐ２−１は、問いＴ１の音高Ｐ１に対して協和音程の関係にあり、音高Ｐ２−４は、当該音高Ｐ２−１に対して３オクターブの上の関係にある。このため、音高Ｐ２−４の周波数と、音高Ｐ２−１の周波数とは、整数比の関係が維持されていることになるので、音高Ｐ１と音高Ｐ２−４とについても、ほぼ協和音程の関係が維持されることになる。 Therefore, in the present embodiment, second, while maintaining that the pitch of the answer synthesized by the answer synthesis unit 116 has a specific relationship with the pitch P1, the average pitch P2 of the original answer A1 is maintained. On the other hand, the pitch P2-1 of the answer A1-1 is shifted step by step in units of octaves until it falls within a predetermined pitch range. In the example of FIG. 4, the answer A1-4 is increased by 3 octaves from the answer A1-1 to the answer A1-2 and the answer A1-3 until it falls within the pitch range based on the pitch P2 of the answer A1. It is an example.
In FIG. 4, the pitch difference amount from the average pitch P2 to the lower limit threshold Pth_L among the pitch ranges set with reference to the average pitch P2 of the answer A1 is defined by the code T_L, and the upper limit threshold Pth_H. The pitch difference amount until is defined by the code T_H. That is, the lower limit threshold Pth_L is a relative value defined by the pitch difference amount T_L based on the average pitch P2 of the answer A1, and similarly, the upper limit threshold Pth_H is a pitch based on the average pitch P2. It is a relative value defined by the difference amount T_H. Since there are a plurality of voice data of answers stored in the answer library 124, the lower limit threshold value Pth_L and the upper limit threshold value Pth_H that define the pitch range of the answer are different for each answer. By relatively defining the pitch difference amount with reference to P2, there is no need to store the lower limit threshold value Pth_L and the upper limit threshold value Pth_H in advance for each answering voice data.
Note that the pitch P2-1 is in the relationship of the concert pitch with respect to the pitch P1 of the question T1, and the pitch P2-4 is in a relationship of 3 octaves above the pitch P2-1. For this reason, since the relationship of the integer ratio is maintained between the frequency of the pitch P2-4 and the frequency of the pitch P2-1, the pitch P1 and the pitch P2-4 are almost the same. The relationship of Kyowa intervals will be maintained.

また例えば、回答Ａ１−２を本決定して再生して良い場合もあるが、回答Ａ１−１よりも１オクターブシフトしただけでは、元の回答Ａ１からのシフト量が大きく、なおも不自然であったり、聴感上の品質劣化の程度が看過できなったりすることがあるので、所定の音高範囲に収まるようにしている。 In addition, for example, the answer A1-2 may be determined and reproduced, but if only one octave shift is performed from the answer A1-1, the shift amount from the original answer A1 is large and still unnatural. Or the degree of quality degradation in the sense of hearing may not be overlooked, so that it falls within a predetermined pitch range.

本実施形態によれば、利用者が発した問いに対する回答を、機械的ではなく、調子が不自然でもなく、かつ、聴感上の品質の劣化を防いで、合成（再生）することができる。また、回答の音声データに、当該回答は女性であるのか、男性であるのかを示す属性情報を付与して、当該属性情報に応じて音高のシフト量を決定する必要もない。 According to this embodiment, an answer to a question issued by a user can be synthesized (reproduced) without being mechanical, unnatural, and preventing deterioration in audible quality. Further, it is not necessary to add attribute information indicating whether the answer is female or male to the voice data of the answer, and determine the pitch shift amount according to the attribute information.

本実施形態では、次に例示するように回答を合成することで、怒りの回答、気のない回答など、感情を伴った回答を合成することができる。
なお、図５は、次の各用語を説明するための図であり、図において、符号Ａｖは、回答Ａ１の音高変化幅であり、符号ｄは、問いＴ１の終了から回答Ａ１が再生開始されるまでの時間であり、符号Ａｄは、回答Ａ１の再生時間である。また、符号Ｔｇは、問いＴ１における音量の時間的変化を示し、符号Ａｇは、回答Ａ１における音量の時間的変化を示す。 In the present embodiment, as illustrated below, by combining the answers, it is possible to synthesize answers with emotions such as an angry answer and a careless answer.
FIG. 5 is a diagram for explaining the following terms. In the figure, the symbol Av is the pitch change width of the answer A1, and the symbol d is the answer A1 starting to play from the end of the question T1. The code Ad is the playback time of the answer A1. Moreover, the code | symbol Tg shows the time change of the sound volume in the question T1, and the code | symbol Ag shows the time change of the sound volume in the answer A1.

例えば、図６に示される応用例（その１）では、回答Ａ１の再生速度を高めて回答Ａ１１のように再生し、問いＴ１の終了から回答Ａ１１が再生開始されるまでの時間ｄ１１を、時間ｄよりも短くし、かつ、回答Ａ１１の音量Ａｇ１１を音量Ａｇよりも大きくしている。これによって、怒りを表現した回答を、出力することができる。なお、回答Ａ１１の再生速度が高められているので、当該回答Ａ１１の再生時間Ａｄ１１は、回答Ａ１の再生時間Ａｄよりも短くなっている。 For example, in the application example (part 1) shown in FIG. 6, the playback speed of the answer A1 is increased to play back like the answer A11, and the time d11 from the end of the question T1 to the start of playback of the answer A11 is expressed as time It is shorter than d, and the volume Ag11 of the answer A11 is larger than the volume Ag. As a result, an answer expressing anger can be output. Note that since the playback speed of the answer A11 is increased, the playback time Ad11 of the answer A11 is shorter than the playback time Ad of the answer A1.

また例えば、図７に示される応用例（その２）では、回答Ａ１の再生速度を遅くして回答Ａ１２のように再生し、問いＴ１の終了から回答Ａ１２が再生開始されるまでの時間ｄ１２を、時間ｄよりも長くし、かつ、回答Ａ１２の音量Ａｇ１２を音量Ａｇよりも小さくしている。これによって、いわゆる、気のないを表現した回答を、出力することができる。なお、回答Ａ１２の再生速度が遅くなっているので、当該回答Ａ１２の再生時間Ａｄ１２は、回答Ａ１の再生時間Ａｄより長くなっている。 Further, for example, in the application example (part 2) shown in FIG. 7, the playback speed of the answer A1 is slowed down and played back as the answer A12, and the time d12 from the end of the question T1 to the start of playback of the answer A12 is set. , Longer than the time d, and the volume Ag12 of the answer A12 is made smaller than the volume Ag. This makes it possible to output a so-called unspoken answer. Since the playback speed of the answer A12 is slow, the playback time Ad12 of the answer A12 is longer than the playback time Ad of the answer A1.

くわえて、図８に示される応用例（その３）では、回答Ａ１に対して末尾に向かって音高が上昇するように回答Ａ１３のように再生することによって、すなわち、回答Ａ１３が音高変化幅Ａｖ１３だけ上昇するように再生している。これによって、問い掛けるような回答を出力することができる。 In addition, in the application example (part 3) shown in FIG. 8, the answer A13 is reproduced like the answer A13 so that the pitch increases toward the end with respect to the answer A1, that is, the answer A13 changes in pitch. Reproduction is performed so as to increase by the width Av13. This makes it possible to output an answer that asks questions.

このように感情を伴った回答を合成する際に、問いＴ１に対する回答の音高変化幅（高低方向含む）や、問いＴ１の終了から回答が再生開始されるまでの時間、回答の再生音量、回答の再生速度などについては、利用者等が上記操作入力部などを介して設定できる構成としても良い。
また、怒りの回答、気のない回答、問い掛けるような回答の種類を利用者が選択できる構成としても良い。 Thus, when composing an answer with emotion, the pitch change width (including the height direction) of the answer to the question T1, the time from the end of the question T1 until the answer starts to be reproduced, the answer playback volume, The response playback speed or the like may be set by the user or the like via the operation input unit.
Moreover, it is good also as a structure in which a user can select the kind of answer which makes an anger answer, a careless answer, and a question ask.

また、利用者により発せられた問いの音声信号から、発話区間、有声区間等を次のように検出しても良い。 Moreover, you may detect an utterance area, a voiced area, etc. from the audio | voice signal of the question uttered by the user as follows.

図９は、応用例（その４）において、発話区間、非発話区間および有声区間の検出と、音量の閾値との関係を示す図である。
この図では、利用者が発した問いについて、音高の時間的変化が（ａ）に、音量の時間的変化が（ｂ）に、それぞれ示される。詳細には、音高および音量が徐々に上昇し、途中から下降に転じる様子が示されている。 FIG. 9 is a diagram illustrating a relationship between detection of a speech segment, a non-speech segment, and a voiced segment and a sound volume threshold in an application example (No. 4).
In this figure, the temporal change in pitch is shown in (a) and the temporal change in volume is shown in (b) for the question that the user has made. Specifically, it is shown that the pitch and volume gradually increase and then start to decrease from the middle.

ここで、閾値Ｔhvg_Hは、音声信号から音高（ピッチ）が検出可能な場合であって、問いの音量が上昇方向であるときに適用され、音量が当該閾値Ｔhvg_H以上になったときに発話区間および有声区間の開始と検出される。
閾値Ｔhvg_Lは、音声信号から音高が検出可能な場合であって、問いの音量が下降方向であるときに適用され、音量が当該閾値Ｔhvg_L未満になったときに有声区間の終了と検出される。
発話においては、音量が閾値Ｔhvg_L未満になっても、音量の揺れ戻しなどがある。そこで、この図の例では、問いの音声信号から音高が検出できる下限の閾値Ｔhuvgを用意し、問いの音量が下降方向である場合であって、当該音量が閾値Ｔhvg_L未満になった後、さらに閾値Ｔhuvg未満になったときに、発話区間が終了（非発話区間の開始）と検出している。
なお、閾値Ｔhvg_H、Ｔhvg_L、Ｔhuvgについては、
Ｔhvg_H＞Ｔhvg_L＞Ｔhuvg
の関係にある。 Here, the threshold value Thvg_H is applied when the pitch (pitch) can be detected from the audio signal, and is applied when the volume of the question is in the increasing direction, and the utterance period when the volume exceeds the threshold Thvg_H. And the start of a voiced interval is detected.
The threshold value Thvg_L is applied when the pitch can be detected from the audio signal, and is applied when the volume of the question is in the descending direction, and is detected as the end of the voiced section when the volume is less than the threshold Thvg_L. .
In utterance, even if the volume falls below the threshold Thvg_L, there is a shake back of the volume. Therefore, in the example of this figure, a lower threshold value Thhuvg that can detect the pitch from the questioned audio signal is prepared, and the questioned sound volume is in a descending direction, and after the sound volume is less than the threshold value Thvg_L, Furthermore, when it becomes less than the threshold Thuvg, it is detected that the utterance section is ended (the start of the non-speaking section).
Note that the threshold values Thvg_H, Thvg_L, and Thhuvg are as follows:
Thvg_H>Thvg_L> Thuvg
Are in a relationship.

閾値Ｔhvg_H、Ｔhvg_Lによって検出した有声区間における音高の最高値を、問いにおける特定区間の音高として検出しても良い。
また、このようにして検出される有声区間は、比較的短い時間であれば、音声信号としてノイズを拾ってしまうことが想定される。このため、有声区間として検出されることの条件として、音声信号から音高が検出可能な場合であって、問いの音量が上昇方向であるときに、閾値Ｔhvg_H以上になってから所定時間以経過したことを要件としても良い。
非有声（無声）区間は、比較的短い時間であれば、問いが終了していないことが想定されるので、無声区間として検出されることの条件として、音声信号から音高が検出可能な場合であって、問いの音量が下降方向であるときに、閾値Ｔhvg_L未満になってから所定時間経過したことを要件としても良い。
もちろん、音量が閾値Ｔhvg_H以上になってから所定時間以経過したことを要件として検出した有声区間の後に、音量が閾値Ｔhvg_L未満になってから所定時間経過したことを要件として無声区間を検出したときに、先の有声区間での音高の最高値を、問いにおける特定区間の音高として検出しても良い。 The highest value of the pitch in the voiced section detected by the threshold values Thvg_H and Thvg_L may be detected as the pitch of the specific section in the question.
Moreover, if the voiced section detected in this way is a relatively short time, it is assumed that noise is picked up as an audio signal. For this reason, as a condition for detection as a voiced section, when a pitch can be detected from a voice signal and the volume of the question is in an increasing direction, a predetermined time or more has elapsed since the threshold Thvg_H or higher. It is good also as a requirement.
If the non-voiced (unvoiced) section is a relatively short time, it is assumed that the question has not ended. Therefore, the pitch can be detected from the voice signal as a condition for detection as an unvoiced section. However, when the volume of the question is in the descending direction, it may be a requirement that a predetermined time elapses after becoming less than the threshold Thvg_L.
Of course, when an unvoiced section is detected as a requirement that a predetermined time has passed since the volume has become less than the threshold Thvg_L after a voiced section where it has been detected that the predetermined time has passed since the volume has become equal to or greater than the threshold Thvg_H In addition, the highest value of the pitch in the previous voiced section may be detected as the pitch of the specific section in the question.

また、利用者による問いにおいて、有声区間の末尾区間が無声音（端的にいえば、発声の際に声帯の振動を伴わない音）である場合、直前の有声音部分から、当該無声音部分の音高を推定しても良い。
利用者による問いの特定区間については、有声区間の末尾区間としたが、例えば語頭区間であっても良いし、問いのうち、どの部分の音高を特定するかについて、利用者が任意に設定できる構成としても良い。
また、有声区間の検出のために音量および音高の２つを用いるのではなく、いずれか一方を用いて検出しても良いし、どれを用いて有声区間の検出をするのかを利用者が選択しても良い。 In addition, when the user asks that the last segment of the voiced segment is an unvoiced sound (in short, a sound that is not accompanied by vocal cord vibration during utterance), the pitch of the unvoiced sound portion is changed from the immediately preceding voiced sound portion. May be estimated.
The specific section of the question by the user is the last section of the voiced section, but it may be the beginning section, for example, and the user arbitrarily sets which part of the question to specify the pitch It is good also as a structure which can be performed.
Moreover, instead of using two of the volume and the pitch for detecting the voiced section, it may be detected using either one, and the user determines which one is used to detect the voiced section. You may choose.

回答ライブラリ１２４に記憶する回答の音声データについては、人物Ａ、Ｂ、Ｃ、…のように複数人にわたって、同一内容の回答を記憶させても良い。人物Ａ、Ｂ、Ｃ、…については例えば有名人、タレント、歌手などとして、各人物毎に音声データをライブラリ化する。
このようにライブラリ化する場合、メモリーカードなどの媒体を介して回答の音声データを回答ライブラリ１２４に格納させても良いし、音声合成装置１０にネットワーク接続機能を持たせて、特定のサーバから回答の音声データをダウンロードし、回答ライブラリ１２４に格納させても良い。メモリーカードやサーバから回答の音声データを入手する場合、無償であっても良いし、有償であっても良い。
一方で、問いに対しては、どの人物をモデルとして回答して欲しいのかを、利用者が操作入力部等によって選択可能な構成としても良いし、各種条件（日、週、月など）毎にランダムで決定する構成としても良い。 As for the voice data of answers stored in the answer library 124, answers of the same content may be stored across a plurality of people such as persons A, B, C,. For the persons A, B, C,..., For example, celebrities, talents, singers, etc., voice data is stored in a library for each person.
In the case of creating a library in this way, the answer voice data may be stored in the answer library 124 via a medium such as a memory card, or the voice synthesizer 10 is provided with a network connection function to answer from a specific server. May be downloaded and stored in the answer library 124. When obtaining the answer voice data from the memory card or the server, it may be free or paid.
On the other hand, it can be configured so that the user can select the person who wants to answer the model as a model by the operation input unit, etc., or for each condition (day, week, month, etc.) It is good also as a structure determined at random.

また、回答の音声データについては、音声入力部１０２のマイクロフォンを介して、利用者自身や、当該利用者の家族、知人の音声を録音したもの（または別途の装置によってデータ化したもの）をライブラリ化しても良い。
このように身近な人物の音声で回答がなされると、問いを発したときに、あたかも当該人物と対話しているかのような感覚を得ることができる。 As for the voice data of answers, a library of voices recorded by the user himself / herself, the user's family and acquaintances (or converted into data by a separate device) via the microphone of the voice input unit 102 is a library. May be used.
When the answer is made with the voice of a person close to the person like this, it is possible to obtain a feeling as if the person is interacting with the person when the question is made.

また、回答については、動物（イヌ、ネコなど）などの鳴き声であっても良いし、犬種などを適宜選択可能な構成としても良い。このように回答を動物の鳴き声とすることで、あたかも当該動物と対話しているかのような、一種の癒しの効果を得ることができる。 In addition, the answer may be a call from an animal (dog, cat, etc.), or may be configured such that a dog breed or the like can be selected as appropriate. In this way, by using an answer as an animal call, it is possible to obtain a kind of healing effect as if it were interacting with the animal.

回答音高取得部１１２が、回答選択部１１０により決定された回答の音声データを解析して、当該音声データを標準で再生したときの平均音高を取得し、この音高を示すデータを音高シフト量決定部１１４に供給する構成としても良い。この構成によれば、音高を示すデータを回答の音声データに、予め対応付けて回答ライブラリ１２４に記憶させる必要がなくなる。 The answer pitch acquisition unit 112 analyzes the voice data of the answer determined by the answer selection unit 110, acquires an average pitch when the voice data is reproduced as a standard, and uses the data indicating the pitch as a sound. A configuration may be adopted in which the high shift amount determination unit 114 is supplied. According to this configuration, it is not necessary to associate the data indicating the pitch with the voice data of the answer and store it in the answer library 124 in advance.

なお、実施形態では、利用者による問いの音高に対して回答の音声データの音高が高い場合を例にとって説明したが、逆に、利用者による問いの音高に対して回答の音声データの音高が低い場合にも適用可能である。 In the embodiment, the case where the pitch of the voice data of the answer is higher than the pitch of the question asked by the user has been described as an example. This is also applicable when the pitch of the is low.

１０２…音声入力部、１０６…音声特徴量取得部（第１音高取得部）、１１０…回答選択部、１１２…回答音高取得部（第２音高取得部）、１１４…音高シフト量決定部、１１６…回答合成部、１２４…回答ライブラリ。
DESCRIPTION OF SYMBOLS 102 ... Voice input part, 106 ... Voice feature-value acquisition part (1st pitch acquisition part), 110 ... Answer selection part, 112 ... Answer pitch acquisition part (2nd pitch acquisition part), 114 ... Pitch shift amount Decision unit 116... Response composition unit 124.

Claims

A first pitch acquisition unit for acquiring a pitch of a specific partial section among the questions based on the input voice signal;
An answer obtaining unit for obtaining an answer to the question;
A second pitch acquisition unit for acquiring a pitch based on the voice signal of the acquired answer;
Sound up to a target pitch that is within a predetermined pitch range with respect to the pitch based on the voice signal of the answer and that maintains a specific relationship with the pitch of the partial section A pitch shift amount determination unit for determining a high shift amount;
A voice control unit that shifts the pitch based on the voice signal of the answer by the pitch shift amount;
A voice control device comprising:

The pitch shift amount determination unit
The voice control device according to claim 1, wherein the pitch shift amount is changed in octave units so that the pitch based on the voice signal of the answer falls within the pitch range.

The first pitch acquisition unit
The voice control device according to claim 1 or 2, wherein a maximum pitch value of a section in which a volume of an input voice signal is equal to or higher than a predetermined value is acquired as a pitch of the partial section.

The first pitch acquisition unit
The pitch of the last section of the voiced section of the input audio signal is acquired as the pitch of the partial section.
The voice control device according to claim 1, wherein

The tail section is a section of a predetermined time from the end of the time when the volume of the audio signal capable of detecting the pitch decreases and falls below a predetermined threshold.
The voice control apparatus according to claim 4.

The first pitch acquisition unit
The pitch of the beginning section of the input voice signal is acquired as the pitch of the partial section.
The voice control device according to claim 1, wherein

The user can arbitrarily set which part of the input audio signal is the partial section.
The voice control device according to claim 1, wherein

The partial section is a section of the input audio signal that leaves a strong impression on the person who has heard the audio signal.
The voice control device according to claim 1, wherein

The pitch based on the voice signal of the answer is an average value or an intermediate value of the pitch of the voice signal of the answer.
The voice control device according to claim 1, wherein

Computer
A first pitch acquisition unit for acquiring a pitch of a specific partial section among the questions based on the input voice signal;
An answer acquisition unit for acquiring an audio signal of an answer to the question;
A second pitch acquisition unit for acquiring a pitch based on the voice signal of the acquired answer;
Sound up to a target pitch that is within a predetermined pitch range with respect to the pitch based on the voice signal of the answer and that maintains a specific relationship with the pitch of the partial section A pitch shift amount determination unit for determining a high shift amount, and
A voice control unit that shifts the pitch based on the voice signal of the answer by the pitch shift amount;
A program characterized by functioning as