JP2018159776A

JP2018159776A - Voice reproduction controller, and program

Info

Publication number: JP2018159776A
Application number: JP2017056323A
Authority: JP
Inventors: 嘉山　啓; Hiroshi Kayama; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2018-10-11

Abstract

PROBLEM TO BE SOLVED: To make it possible to suppress sound quality deterioration and to reproduce an answer which does not give a user a sense of discomfort, without requiring a large amount of resources.SOLUTION: The provided voice reproduction controller includes a pitch acquiring unit which acquires a pitch based on voice data of an answer corresponding to the voice signal of the inputted question, a target pitch determining unit for determining a first target pitch which maintains a first relation with respect to a pitch of a specific section of the voice signal of the inputted question and falls within a first pitch range determined according to the pitch of the answer, a target pitch changing unit for changing the first target pitch to a second target pitch when the first pitch does not fall within a second pitch range which is narrower than the first pitch range, and a reproduction instructing unit which instructs the voice regeneration unit to reproduce the answer shifting the pitch based on the voice data of the answer to the pitch determined in the target pitch determining unit or the pitch changed in the target pitch changing unit.SELECTED DRAWING: Figure 1

Description

本発明は、音声再生技術に関する。 The present invention relates to an audio reproduction technique.

近年、音声再生技術としては、次のようなものが提案されている。すなわち、利用者の話調や声質に対応して合成した音声を再生することによって、より人間らしい音声を再生する技術（例えば特許文献１参照）や、利用者の音声を分析して、当該利用者の心理状態や健康状態などを診断する技術（例えば特許文献２参照）が提案されている。
また、利用者が入力した音声を認識する一方で、シナリオで指定された内容の音声を合成して再生し、利用者との音声対話を実現する音声対話システムも提案されている（例えば特許文献３参照）。 In recent years, the following has been proposed as an audio reproduction technique. That is, by reproducing a synthesized voice corresponding to the tone and voice quality of a user, a technique for reproducing more human-like voice (see, for example, Patent Document 1), and analyzing a user's voice, the user A technique (for example, refer to Patent Document 2) for diagnosing the psychological state or health state of a child has been proposed.
There has also been proposed a voice dialogue system that recognizes a voice input by a user while synthesizing and playing back a voice of a content specified in a scenario to realize a voice dialogue with the user (for example, Patent Documents). 3).

特開２００３−２７１１９４号公報JP 2003-271194 A 特許第４４９５９０７号公報Japanese Patent No. 4495907 特許第４８３２０９７号公報Japanese Patent No. 4832097

ところで、上述した音声合成技術と音声対話システムとを組み合わせて、利用者の音声による問いに対し、データを検索して音声合成により回答の音声を合成して再生する対話システムを想定する。この場合、回答音声の音高が問いの音声の音高からかけ離れていると、問いに対する親和性を欠き、利用者に不自然な印象を与える、といった不具合が発生する。このような不具合の発生を回避するには、問いの内容に対して音高の異なる複数種の回答音声を用意しておき、問いの音声の音高に応じて何れかの回答音声を選択して再生することが考えられる。しかし、回答音声のデータを記憶する記憶装置の記憶容量を十分に確保できないなどリソースに制約がある場合には、問いの内容に対して音高の異なる複数種の回答音声を用意することはできない。このようにリソースの制約が厳しい場合には、問いの内容毎に回答音声を１つだけ用意しておき、音高シフトにより問いの音声の音高に応じた音高の回答音声を合成して再生することが考えられる。しかし、音高シフトには音質劣化が伴うため、音高シフト量が大きくなるほど、回答音声の音質劣化が著しくなる、といった問題がある。 By the way, a dialogue system is assumed that combines the above-described speech synthesis technology and a speech dialogue system, retrieves data for a question by a user's voice, and synthesizes and reproduces a reply voice by speech synthesis. In this case, if the pitch of the answer voice is far from the pitch of the question voice, there is a problem that the affinity for the question is lacking and an unnatural impression is given to the user. To avoid the occurrence of such problems, prepare multiple types of answer voices with different pitches for the contents of the question, and select one of the answer voices according to the pitch of the question voice. Can be played. However, if there are limited resources, such as the storage capacity of the storage device that stores the answer voice data is not sufficient, multiple answer voices with different pitches cannot be prepared for the contents of the question. . When resource constraints are severe, prepare only one answer voice for each question and synthesize the answer voice with the pitch corresponding to the pitch of the question voice by pitch shift. It is possible to play. However, since the pitch shift is accompanied by sound quality deterioration, there is a problem that the sound quality deterioration of the answer voice becomes more significant as the pitch shift amount increases.

本発明は、このような事情に鑑みてなされたものであり、その目的の一つは、多大なリソースを要することなく、音質劣化を抑えかつ利用者に不自然な感じを抱かせない回答音声を再生することを可能にする音声再生制御装置および音声再生制御プログラムを提供することにある。 The present invention has been made in view of such circumstances, and one of the purposes thereof is an answer voice that suppresses deterioration of sound quality and does not give the user an unnatural feeling without requiring a large amount of resources. Is to provide a sound reproduction control device and a sound reproduction control program.

利用者による問いに対する回答音声を再生するマン・マシンのシステムを検討するにあたって、まず、人同士では、どのような対話がなされるかについて、言語的情報以外の情報、とりわけ対話を特徴付ける音高（周波数）に着目して考察する。 When considering a man-machine system that plays back voices of answers to questions by users, first of all, what kind of dialogue is made between people, information other than linguistic information, especially the pitch that characterizes the dialogue ( (Frequency) is considered.

人同士の対話として、一方の人（ａとする）による問いに対し、他方の人（ｂとする）が返答する場合について検討する。この場合において、ａが問いを発したとき、ａだけなく、当該問いに対して回答しようとするｂも、当該問いのうちの語尾の区間などの特定区間における音高（例えば、最低音高）を強い印象で残していることが多い。ｂは、同意や、賛同、肯定などの意で回答するときには、印象に残っている問いの音高に対し、当該回答を特徴付ける部分の音高が、特定の関係、例えば同じ音程の関係となるように発声する。当該回答を聞いたａは、自己の問いについて印象に残っている音高と当該問いに対する回答を特徴付ける部分の音高とが上記関係にあるので、ｂの回答に対して心地良く、安心するような好印象を抱くことになる、と考えられる。 As a dialogue between people, the case where the other person (referred to as b) responds to a question from one person (referred to as a) will be considered. In this case, when a asks a question, not only a but also b to be answered to the question is a pitch (for example, the lowest pitch) in a specific section such as a ending section of the question. Is often left with a strong impression. b, when responding with consent, approval, affirmation, etc., the pitch of the part that characterizes the answer has a specific relationship, for example, the same pitch relationship, with the pitch of the question that remains in the impression Say as follows. A who has heard the answer has a relationship between the pitch that remains in the impression about his question and the pitch of the part that characterizes the answer to the question. It is thought that you will have a good impression.

このように人同士の対話では、問いの音高と回答の音高とは無関係ではなく、上記のような関係がある、と考察できる。このような考察を踏まえて、利用者による問いに対する回答音声を再生（返答）する対話システムを検討したときに、当該音声再生について上記目的を達成するために、次のような構成とした。 In this way, in the dialogue between people, it can be considered that the pitch of the question and the pitch of the answer are not irrelevant and have the above relationship. Based on such considerations, when a dialogue system for reproducing (replying) an answer voice to a user's question was examined, the following configuration was adopted in order to achieve the above-mentioned purpose for the voice reproduction.

すなわち、上記目的を達成するために、本発明の一態様に係る音声再生制御装置は、入力された問いの音声信号に対応する回答の音声データに基づく音高を取得する音高取得部と、入力された問いの音声信号の特定区間の音高に対して第１の関係を維持しかつ回答の音高に応じて定まる第１の音高範囲に収まる第１の目標音高を決定する目標音高決定部と、第１の目標音高が第１の音高範囲よりも狭い第２の音高範囲内に収まらない場合に、第１の目標音高を第２の目標音高に変更する目標音高変更部と、回答の音声データに基づく音高を、目標音高決定部で決定された音高、或いは目標音高変更部で変更された音高へシフトして、回答を再生することを音声再生部へ指示する再生指示部と、を具備することを特徴とする。 That is, in order to achieve the above object, a sound reproduction control device according to an aspect of the present invention includes a pitch acquisition unit that acquires a pitch based on voice data of an answer corresponding to an input voice signal of a question, A target for determining a first target pitch within a first pitch range that maintains the first relationship with the pitch of a specific section of the input voice signal of the question and is determined according to the pitch of the answer. When the pitch determination unit and the first target pitch do not fall within the second pitch range narrower than the first pitch range, the first target pitch is changed to the second target pitch. The pitch based on the target pitch change section and the answer voice data to be shifted to the pitch determined by the target pitch determination section or the pitch changed by the target pitch change section, and the answer is played back And a playback instruction unit for instructing the audio playback unit to do this.

ここで、第１の関係の一例としては、特定区間の音高と同じ音高またはオクターブ違いの音高の関係が挙げられる。オクターブ違いの音高とは、２つの音の音高差が１オクターブの整数倍の関係を言う。同時に発生する２つの音の音高が異なっていても、音高差が１オクターブの整数倍であれば調和が保たれ、聴者に与える違和感は少ないからである。 Here, as an example of the first relationship, there is a relationship between the same pitch as the pitch of a specific section or a pitch with a different octave. The pitch of octave difference means that the pitch difference between two sounds is an integral multiple of one octave. This is because even if the pitches of two sounds generated at the same time are different, harmony is maintained if the pitch difference is an integral multiple of one octave, and the listener feels less discomfort.

この一態様によれば、利用者が発した問いに対する回答を、不自然でなく、かつ、聴感上の品質の劣化を防いで、再生することができる。なお、回答には、問いに対する具体的な答えに限られず、相槌（間投詞）も含まれる。また、回答には、人による声のほかにも、「ワン」（bowwow）、「ニャー」（meow）などの動物の鳴き声も含まれる。すなわち、ここでいう回答や音声とは、人が発する声のみならず、動物の鳴き声を含む概念である。 According to this aspect, it is possible to reproduce the answer to the question issued by the user without being unnatural and preventing deterioration in the quality of hearing. Note that the answer is not limited to a specific answer to the question, but includes an answer (interjection). In addition to human voices, answers include animal calls such as “bow” and “meow”. That is, the answer and the voice here are concepts including not only a voice uttered by a person but also an animal cry.

特定区間とは、強い印象を残している部分のことをいい、例えば問いの語尾の区間（末尾区間）である。特定区間の音高の具体例としては、音量が所定値以上（すなわち、有声区間）である語尾の区間の最低音高が挙げられる。また、音声データに基づく音高とは、例えば音声データを標準で再生したときの特徴的な部分での音高であり、特徴的な部分とは語頭部分の音高、音量が最も高い部分での音高のほか、平均音高などである。 The specific section refers to a portion that leaves a strong impression, for example, a section at the end of the question (end section). As a specific example of the pitch of the specific section, there is the lowest pitch of the ending section whose volume is equal to or higher than a predetermined value (that is, a voiced section). The pitch based on the voice data is, for example, the pitch at the characteristic part when the voice data is played back as a standard, and the characteristic part is the part with the highest pitch and volume at the beginning of the word. As well as the average pitch.

上記一態様において、目標音高決定部は、前記第１の音高範囲に収まるまで、目標音高を第１のシフト量単位で変更して前記第１の目標音高を決定し、目標音高変更部は、前記第２の音高範囲に収まるまで、目標音高を前記第１のシフト量単位よりも小さい第２のシフト量単位であって、前記特定区間の音高に応じて定まる第２のシフト量単位で変更して前記第２の目標音高を決定する。目標音高変更部により変更された音高への音高シフト量が、変更前の音高シフト量よりも小さくなり、かつ第２の目標音高をきめ細やかに設定できるようにするためである。なお、第１の音高範囲についてはオクターブ単位で定めておけば良く、第２の音高範囲については半オクターブ単位で定めておけば良い。 In the above aspect, the target pitch determination unit determines the first target pitch by changing the target pitch in units of the first shift amount until the target pitch falls within the first pitch range, The pitch changing unit is a second shift amount unit that is smaller than the first shift amount unit and is determined according to the pitch of the specific section until it falls within the second pitch range. The second target pitch is determined by changing in units of the second shift amount. The pitch shift amount to the pitch changed by the target pitch changing unit is smaller than the pitch shift amount before the change, and the second target pitch can be set finely. . Note that the first pitch range may be determined in octave units, and the second pitch range may be determined in half octave units.

また、より好ましい構成としては、入力された音声信号の音量が所定値以上である問いの末尾区間の音高最低値を上記特定区間の音高として取得する構成が考えられる。このときの音量が所定値以上であることの判別においてヒステリシス特性を持たせて判別しても良い。 Further, as a more preferable configuration, a configuration is conceivable in which the lowest pitch value in the last section when the volume of the input audio signal is equal to or higher than a predetermined value is acquired as the pitch in the specific section. In determining whether the sound volume at this time is equal to or higher than a predetermined value, it may be determined with a hysteresis characteristic.

本発明の態様について、音声再生制御装置のみならず、コンピュータを当該音声再生制御装置として機能させるプログラムとして概念することも可能である。 The aspect of the present invention can be conceptualized as a program that causes a computer to function as the audio reproduction control device as well as the audio reproduction control device.

実施形態に係る音声再生制御装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice reproduction | regeneration control apparatus which concerns on embodiment. 音声再生制御装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of an audio | voice reproduction | regeneration control apparatus. 音声再生制御装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of an audio | voice reproduction | regeneration control apparatus. 利用者による問いと、音声再生制御装置による回答との音高例を示す図である。It is a figure which shows the example of a pitch of the question by a user, and the reply by an audio | voice reproduction control apparatus.

以下、図面を参照しつつ、この発明の実施形態を説明する。
（Ａ：構成）
図１は、本発明の実施形態に係る音声再生制御装置１０の構成を示す図である。
この音声再生制御装置１０は、例えば、ぬいぐるみに組み込まれて、利用者が当該ぬいぐるみに問いを発したときに、相槌などの回答を音声合成して出力する装置である。音声再生制御装置１０は、ＣＰＵ（Central Processing Unit）や、音声入力部１０２、スピーカ１４２を有し、当該ＣＰＵが、予めインストールされたアプリケーションプログラムを実行することによって、複数の機能ブロックが次のように構築される。詳細には、音声再生制御装置１０では、音声特徴量取得部１０６、回答選択部１１０、音高取得部１１２、再生指示部１１４および回答再生部１１６が構築される。 Embodiments of the present invention will be described below with reference to the drawings.
(A: Configuration)
FIG. 1 is a diagram showing a configuration of an audio reproduction control apparatus 10 according to the embodiment of the present invention.
The voice reproduction control device 10 is a device that is incorporated in a stuffed animal, for example, and synthesizes and outputs an answer such as a question when a user asks the stuffed animal. The audio reproduction control device 10 includes a CPU (Central Processing Unit), an audio input unit 102, and a speaker 142. When the CPU executes an application program installed in advance, a plurality of functional blocks are as follows. Built in. Specifically, in the audio reproduction control device 10, an audio feature amount acquisition unit 106, an answer selection unit 110, a pitch acquisition unit 112, a reproduction instruction unit 114, and an answer reproduction unit 116 are constructed.

なお、特に図示しないが、このほかにも音声再生制御装置１０は、表示部や操作入力部なども有し、利用者が装置の状況を確認したり、装置に対して各種の操作を入力したり、各種の設定などができるようになっている。また、音声再生制御装置１０は、ぬいぐるみのような玩具に限られず、いわゆるペットロボットや、携帯電話機のような端末装置、タブレット型のパーソナルコンピュータなどであっても良い。 Although not particularly illustrated, the sound reproduction control device 10 also has a display unit, an operation input unit, and the like, so that the user can check the status of the device or input various operations to the device. Or various settings. The audio reproduction control device 10 is not limited to a toy such as a stuffed toy, and may be a so-called pet robot, a terminal device such as a mobile phone, a tablet personal computer, or the like.

音声入力部１０２は、詳細については省略するが、音声を電気信号に変換するマイクロフォンと、変換された音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。 Although not described in detail, the sound input unit 102 includes a microphone that converts sound into an electrical signal and an A / D converter that converts the converted sound signal into a digital signal.

音声特徴量取得部１０６は、デジタル信号に変換された音声信号を解析処理して、当該音声信号を発話区間および非発話区間に分別するとともに、発話区間のうち、有声区間における特定区間の最低音高を検出し、当該音高を示すデータを回答選択部１１０および再生指示部１１４に供給する。ここで、発話区間とは、例えば音声信号の音量が閾値以上となる区間をいい、反対に、非発話区間とは、音声信号の音量が閾値未満となる区間をいう。また、有声区間とは、発話区間のうち、音声信号の音高（ピッチ）が検出可能な区間をいう。音高が検出可能な区間とは、音声信号に周期的な部分があって、その部分が検出可能であることを意味する。なお、ここでは、特定区間を有声区間のうちの末尾区間としている。また、末尾区間とは、有声区間の終了から時間的に前方に向けた所定時間（例えば１８０ｍｓｅｃ）の区間である。 The voice feature quantity acquisition unit 106 analyzes the voice signal converted into the digital signal, classifies the voice signal into the utterance period and the non-utterance period, and among the utterance period, the lowest sound of the specific period in the voiced period High is detected, and data indicating the pitch is supplied to the answer selection unit 110 and the reproduction instruction unit 114. Here, the utterance section refers to, for example, a section in which the volume of the audio signal is equal to or higher than a threshold, and conversely, the non-utterance section refers to a section in which the volume of the audio signal is less than the threshold. The voiced section is a section in which the pitch (pitch) of the voice signal can be detected in the utterance section. The section in which the pitch can be detected means that there is a periodic part in the audio signal and that part can be detected. Here, the specific section is the last section of the voiced section. Further, the end section is a section of a predetermined time (for example, 180 msec) that is temporally forward from the end of the voiced section.

回答ライブラリ１２４は、利用者による問いに対する回答の音声データを、予め複数記憶する。この音声データは、モデルとなる人物の音声を録音したものであり、例えば「はい」、「いいえ」、「そう」、「うん」、「ふーん」、「なるほど」のような、質問に対する返事や相槌などである。回答の音声データについては、例えばｗａｖやｍｐ３などのフォーマットであり、標準で再生したときの波形サンプル毎（または波形周期毎）の音高と、それらを平均した平均音高が予め求められて、その平均音高（回答に基づく音高）を示すデータが音声データに対応付けられて回答ライブラリ１２４に記憶されている。なお、ここでいう標準で再生とは、音声データを録音時の条件（サンプリング周波数）と同じ条件で再生する、という意味である。 The answer library 124 stores a plurality of voice data of answers to questions by the user in advance. This voice data is a recording of the voice of a model person. For example, “Yes”, “No”, “Yes”, “Yes”, “Fun”, “I see” For example, it is a companion. The voice data of the answer is in a format such as wav or mp3, for example, and a pitch for each waveform sample (or every waveform cycle) when reproduced as a standard and an average pitch obtained by averaging them are obtained in advance. Data indicating the average pitch (pitch based on the answer) is stored in the answer library 124 in association with the voice data. The standard reproduction here means that the audio data is reproduced under the same conditions as the recording conditions (sampling frequency).

回答選択部１１０（回答取得部）は、音声特徴量取得部１０６から特定区間の最低音高を示すデータが出力された場合、当該音声に対する回答の音声データを、回答ライブラリ１２４から１つを選択し、当該選択した回答の音声データを、対応付けられた平均音高を示すデータとともに、読み出して出力する。回答選択部１１０が、複数の音声データのうち、１つの音声データをどのようなルールで選択するかについては、例えばランダムでも良いし、問いの特定区間の最低音高に対して平均音高が最も近い音声データを選択する、としても良い。 When the voice feature quantity acquisition unit 106 outputs data indicating the minimum pitch of a specific section, the answer selection unit 110 (answer acquisition unit) selects one of the answer data for the voice from the answer library 124. Then, the voice data of the selected answer is read and output together with the data indicating the associated average pitch. For example, the answer selection unit 110 may select a single piece of voice data from among a plurality of pieces of voice data. For example, the answer selection unit 110 may be random, and the average pitch may be lower than the lowest pitch of the specific section in question. The closest audio data may be selected.

本実施形態において、選択される回答については、利用者の問いの意味内容が考慮されないことになるが、この音声再生制御装置１０を、利用者が発した問いに対し、回答として相槌等を返す装置としてみれば、これで十分である。一方で、図において破線で示されるように言語解析部１０８を設けて、当該言語解析部１０８が音声信号で規定される問いの意味内容を解析し、回答選択部１１０が、データベース等を介して当該問いに対する回答を作成する構成としても良い。 In the present embodiment, the meaning content of the user's question is not taken into account for the selected answer. However, the voice reproduction control device 10 returns a reconciliation as an answer to the question issued by the user. This is sufficient for a device. On the other hand, a language analysis unit 108 is provided as shown by a broken line in the figure, the language analysis unit 108 analyzes the meaning content of the question specified by the audio signal, and the answer selection unit 110 transmits the database via a database or the like. It is good also as a structure which produces the answer with respect to the said question.

音高取得部１１２は、回答選択部１１０で読み出されたデータのうち、回答の平均音高を示すデータを抜き出して、再生指示部１１４に供給する。 The pitch acquisition unit 112 extracts data indicating the average pitch of answers from the data read by the answer selection unit 110 and supplies the extracted data to the reproduction instruction unit 114.

再生指示部１１４は、音声特徴量取得部１０６から出力された音声信号における特定区間の最低音高と、音高取得部１１２から出力された回答の平均音高との差から、回答の音声データを再生する際の目標音高を決定し、回答の平均音高を当該目標音高にシフトさせる音高シフト量を決定する。図１に示すように、再生指示部１１４は、目標音高決定部１１４ａと目標音高変更部１１４ｂとを含む。 The reproduction instructing unit 114 determines the answer voice data based on the difference between the minimum pitch of the specific section in the voice signal output from the voice feature value acquisition unit 106 and the average pitch of the answer output from the pitch acquisition unit 112. Is determined, and a pitch shift amount for shifting the average pitch of the answer to the target pitch is determined. As shown in FIG. 1, the reproduction instruction unit 114 includes a target pitch determination unit 114a and a target pitch change unit 114b.

目標音高決定部１１４ａは、問いの音声の特定区間の最低音高に対して予め定められた第１の関係を維持する目標音高であって、かつ回答の平均音高に応じて定まる第１の音高範囲に収まる第１の目標音高を決定する。また、目標音高決定部１１４ａは、回答の平均音高を当該第１の目標音高までシフトさせる音高シフト量を算出する。本実施形態では、第１の関係とは、問いの音声の特定区間の最低音高と等しい音高の関係またはオクターブ単位の違いを有する音高の関係である。同時に発生する２つの音の音高が同じである場合は勿論、両者が異なっていても、音高差が１オクターブの整数倍であれば、調和が保たれ、聴者に与える違和感は少ないからである。また、第１の音高範囲とは回答の平均音高を中心とする１オクターブの範囲、すなわち、平均音高−６００セント〜平均音高＋６００セントの範囲である。 The target pitch determination unit 114a is a target pitch that maintains a predetermined first relationship with respect to the minimum pitch of a specific section of the questioned voice, and is determined according to the average pitch of the answers. A first target pitch that falls within the pitch range of 1 is determined. Further, the target pitch determination unit 114a calculates a pitch shift amount for shifting the average pitch of the answer to the first target pitch. In the present embodiment, the first relationship is a relationship of pitches equal to the minimum pitch of a specific section of the questioned voice or a relationship of pitches having a difference in octave units. Of course, if the pitches of two sounds generated at the same time are the same, if the pitch difference is an integer multiple of one octave, harmony is maintained and the listener feels less uncomfortable. is there. The first pitch range is a range of one octave centered on the average pitch of answers, that is, a range of average pitch -600 cents to average pitch +600 cents.

目標音高変更部１１４ｂは、目標音高決定部１１４ａにより決定された第１の目標音高が上記第１の音高範囲よりも狭い第２の音高範囲に収まっていない場合に、上記第１の目標音高を第２の目標音高に変更する。より詳細に説明すると、目標音高変更部１１４ｂは、第１の目標音高が第２の音高範囲に収まっていない場合に、第１の目標音高を、上記特定区間の最低音高に対して第１の関係とは異なる第２の関係を維持する目標音高であって、第２の音高範囲内に収まる第２の目標音高へ変更し、目標音高決定部１１４ａにより算出された音高シフト量を当該第２の目標音高にシフトさせる音高シフト量に補正する。ここで、第２の関係とは、例えば「ド」に対する「ソ」のような協和音の関係のように、上記特定区間の最低音高に対して親和性が高い音高の関係を言う。第２の関係が維持されていれば、第１の関係が維持されている場合ほどではないものの、聴者に与える違和感は少ないからである。また、第２の音高範囲とは回答の平均音高を中心とする半オクターブの範囲、すなわち、平均音高−３００セント〜平均音高＋３００セントの範囲である。 The target pitch changing unit 114b determines that the first target pitch determined by the target pitch determining unit 114a is not within the second pitch range narrower than the first pitch range. The target pitch of 1 is changed to the second target pitch. More specifically, the target pitch changing unit 114b sets the first target pitch to the lowest pitch of the specific section when the first target pitch is not within the second pitch range. On the other hand, the target pitch that maintains the second relationship different from the first relationship is changed to a second target pitch that falls within the second pitch range, and is calculated by the target pitch determination unit 114a. The pitch shift amount thus corrected is corrected to a pitch shift amount for shifting to the second target pitch. Here, the second relationship refers to a relationship between pitches having a high affinity for the minimum pitch in the specific section, such as a relationship of a consonant such as “So” to “Do”. This is because if the second relationship is maintained, the discomfort given to the listener is small, although not as much as when the first relationship is maintained. The second pitch range is a range of a half octave centered on the average pitch of answers, that is, a range of average pitch -300 cents to average pitch +300 cents.

再生指示部１１４は、目標音高決定部１１４ａで決定された音高、或いは目標音高変更部１１４ｂで変更された音高へ変更するための音高シフト量だけシフトして、回答を再生することを回答再生部１１６へ指示する。回答再生部１１６は、回答ライブラリ１２４から読み出された回答の音声データを、再生指示部１１４から指示された音高シフト量だけシフトさせて再生（合成）する。なお、音高がシフトされた音声信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１４２によって音響変換されて出力される。また、回答の音高に対応付けられたデータ、すなわち、回答ライブラリ１２４に記憶されるとともに再生指示部１１４で音高シフト量の決定に用いられるデータ、については、平均音高を示すデータ以外であっても良い。例えば、音高の中間値でも良いし、音声データの所定区間の平均音高でも良い。 The reproduction instruction unit 114 reproduces the answer by shifting the pitch determined by the target pitch determination unit 114a or the pitch shift amount for changing to the pitch changed by the target pitch change unit 114b. To the answer reproducing unit 116. The answer reproducing unit 116 reproduces (synthesizes) the answer voice data read from the answer library 124 by shifting the voice data by the pitch shift amount instructed by the reproduction instruction unit 114. Note that the audio signal whose pitch is shifted is converted into an analog signal by a D / A converter (not shown), and then acoustically converted by the speaker 142 and output. The data associated with the pitch of the answer, that is, the data stored in the answer library 124 and used for determining the pitch shift amount by the playback instruction unit 114 is data other than the data indicating the average pitch. There may be. For example, an intermediate value of the pitch may be used, or an average pitch of a predetermined section of the audio data may be used.

（Ｂ：動作）
次に、音声再生制御装置１０の動作について説明する。
図２は、音声再生制御装置１０における処理動作を示すフローチャートである。
はじめに、音声再生制御装置１０が適用されたぬいぐるみに対して、利用者が音声で問いを発したときに、このフローチャートで示される処理が起動される。なお、ここでは便宜的に、利用者の音声（問い）の音高に対して回答の音声データの音高が低い場合を例にとって説明する。 (B: Operation)
Next, the operation of the audio reproduction control device 10 will be described.
FIG. 2 is a flowchart showing the processing operation in the audio reproduction control device 10.
First, when the user asks a question about the stuffed animal to which the voice reproduction control device 10 is applied, the processing shown in this flowchart is started. Here, for convenience, the case where the pitch of the answer voice data is lower than the pitch of the voice (question) of the user will be described as an example.

まず、ステップＳａ１１において、音声入力部１０２によって変換された音声信号が音声特徴量取得部１０６に供給される。次に、ステップＳａ１２において、音声特徴量取得部１０６は、音声入力部１０２からの音声信号に対して解析処理、すなわち利用者が発した問いの音高を検出する処理を実行する。ステップＳａ１３において、回答再生部１１６によって回答が再生中であるか否かが判別される。 First, in step Sa <b> 11, the audio signal converted by the audio input unit 102 is supplied to the audio feature amount acquisition unit 106. Next, in step Sa12, the voice feature quantity acquisition unit 106 performs an analysis process on the voice signal from the voice input unit 102, that is, a process of detecting the pitch of a question issued by the user. In step Sa13, the answer reproducing unit 116 determines whether or not the answer is being reproduced.

回答が再生中でなければ（ステップＳａ１３の判別結果が「Ｎｏ」であれば）、音声特徴量取得部１０６は、音声入力部１０２からの音声信号の問い（発話）が終了したか否かを判別する（ステップＳａ１４）。なお、問いが終了したか否かについては、具体的には、例えば、音声信号の音量が所定の閾値未満となった状態が所定時間継続したか否かで判別される。問いが終了していなければ（ステップＳａ１４の判別結果が「Ｎｏ」であれば）、処理手順がステップＳａ１１に戻り、これにより、音声特徴量取得部１０６は、音声入力部１０２からの音声信号の解析処理を継続する。 If the answer is not being played back (if the determination result in step Sa13 is “No”), the voice feature acquisition unit 106 determines whether or not the question (speech) of the voice signal from the voice input unit 102 has ended. A determination is made (step Sa14). Note that whether or not the inquiry has ended is specifically determined based on, for example, whether or not a state in which the volume of the audio signal has become less than a predetermined threshold has continued for a predetermined time. If the inquiry has not ended (if the determination result in step Sa14 is “No”), the processing procedure returns to step Sa11, whereby the audio feature quantity acquisition unit 106 receives the audio signal from the audio input unit 102. Continue the analysis process.

問いが終了していれば（ステップＳａ１４の判別結果が「Ｙｅｓ」であれば）、再生指示部１１４は、回答選択部１１０により選択された回答の音声データを再生する際の音高シフト量を、後述するように決定する（ステップＳａ１５）。そして、再生指示部１１４は、決定した音高シフト量を回答再生部１１６に通知して、回答選択部１１０により選択された回答の音声データの再生を指示する（ステップＳａ１６）。この指示にしたがって回答再生部１１６は、当該音声データを、再生指示部１１４から通知された音高シフト量だけシフトして再生する（ステップＳａ１７）。 If the inquiry has been completed (if the determination result in step Sa14 is “Yes”), the playback instruction unit 114 sets the pitch shift amount when the voice data of the answer selected by the answer selection unit 110 is played back. The determination is made as described later (step Sa15). Then, the reproduction instruction unit 114 notifies the answer reproduction unit 116 of the determined pitch shift amount, and instructs the reproduction of the answer voice data selected by the answer selection unit 110 (step Sa16). In accordance with this instruction, the answer reproducing unit 116 reproduces the sound data by shifting the audio data by the pitch shift amount notified from the reproduction instruction unit 114 (step Sa17).

なお、ステップＳａ１３において、回答再生部１１６によって回答が再生中であると判別される場合（ステップＳａ１３の判別結果が「Ｙｅｓ」となる場合）とは、ある問いに応じて回答を再生中に、次の問いが利用者によって発せられた場合などである。この場合、処理手順は、ステップＳａ１４、Ｓａ１１という経路を戻らず、ステップＳａ１７に移行するので、回答の再生が優先されることになる。 In step Sa13, when the answer reproducing unit 116 determines that the answer is being reproduced (when the determination result in step Sa13 is “Yes”), the answer is being reproduced according to a certain question. For example, when the following question is asked by the user. In this case, the processing procedure does not return the path of steps Sa14 and Sa11, and proceeds to step Sa17, so that the reproduction of the answer has priority.

図３は、図２におけるステップＳａ１５の処理、すなわち回答の音声データの音高シフト量を決定する処理の詳細を示すフローチャートである。なお、この処理が実行されるための前提は、回答再生部１１６が回答を再生中でなく（ステップＳａ１３の判別結果が「Ｎｏ」）、かつ、利用者による問いの入力が終了している（ステップＳａ１４の判別結果が「Ｙｅｓ」）、ことである。 FIG. 3 is a flowchart showing details of the process of step Sa15 in FIG. 2, that is, the process of determining the pitch shift amount of the answer voice data. The premise for executing this process is that the answer reproducing unit 116 is not reproducing the answer (the determination result of step Sa13 is “No”), and the user has finished inputting the question ( The determination result of step Sa14 is “Yes”).

まず、ステップＳｂ１１において、再生指示部１１４は、音声特徴量取得部１０６から、問いの特定区間の最低音高を示すデータを取得する。 First, in step Sb11, the reproduction instruction unit 114 acquires data indicating the minimum pitch of the specific section in question from the audio feature amount acquisition unit 106.

一方、回答選択部１１０は、利用者による問いに対する回答の音声データを、回答ライブラリ１２４から選択し、当該選択した回答の音声データと、当該音声データに対応付けられた平均音高を示すデータとを読み出す。このうち、音高取得部１１２は、読み出されたデータのうちの平均音高を示すデータを再生指示部１１４に供給する。これにより、再生指示部１１４は、回答選択部１１０により選択された回答の平均音高を示すデータを取得する（ステップＳｂ１２）。 On the other hand, the answer selection unit 110 selects voice data of the answer to the question by the user from the answer library 124, the voice data of the selected answer, and data indicating the average pitch associated with the voice data Is read. Among these, the pitch acquisition unit 112 supplies data indicating the average pitch among the read data to the reproduction instruction unit 114. Thereby, the reproduction instruction unit 114 acquires data indicating the average pitch of the answer selected by the answer selection unit 110 (step Sb12).

次に、目標音高決定部１１４ａは、問いの特定区間の最低音高を、前述の第１の目標音高として仮決定する（ステップＳｂ１３）。 Next, the target pitch determination unit 114a provisionally determines the lowest pitch of the specific section in question as the first target pitch described above (step Sb13).

続いて、目標音高決定部１１４ａは、回答選択部１１０により選択された回答の平均音高から、仮決定した第１の目標音高（ステップＳｂ１３のほか、後述するステップＳｂ１６、Ｓｂ１８による変更後の第１の目標音高を含む）までの音高シフト量を算出する（ステップＳｂ１４）。目標音高決定部１１４ａは、仮決定した第１の目標音高が前述した第１の音高範囲の下限閾値Ｌ１よりも低いか否かを判別する（ステップＳｂ１５）。この下限閾値Ｌ１は、回答の平均音高に対して、どれだけ低い音高まで音高シフトを許容するのかについての閾値であり、本実施形態では、回答の平均音高−６００セントである。 Subsequently, the target pitch determining unit 114a determines the first target pitch temporarily determined from the average pitch of the answers selected by the answer selecting unit 110 (after step Sb16 and after change by steps Sb16 and Sb18 described later). The pitch shift amount up to (including the first target pitch) is calculated (step Sb14). The target pitch determining unit 114a determines whether or not the temporarily determined first target pitch is lower than the lower limit threshold L1 of the first pitch range described above (step Sb15). The lower limit threshold L1 is a threshold for how much pitch shift is allowed with respect to the average pitch of the answer, and in this embodiment, the average pitch of the answer is −600 cents.

上記仮決定した第１の目標音高が下限閾値Ｌ１よりも低ければ（ステップＳｂ１５の判別結果が「Ｙｅｓ」であれば）、目標音高決定部１１４ａは、第１の目標音高を１オクターブ（１２００セント）引き上げて再度仮決定する（ステップＳｂ１６）。なお、この後、処理手順がステップＳｂ１４に戻り、再度、音高シフト量が算出されて、ステップＳｂ１５の判別が実行されることになる。 If the tentatively determined first target pitch is lower than the lower limit threshold L1 (if the determination result in step Sb15 is “Yes”), the target pitch determination unit 114a sets the first target pitch to one octave. (1200 cents) is raised and provisionally determined again (step Sb16). After this, the processing procedure returns to step Sb14, the pitch shift amount is calculated again, and the determination in step Sb15 is executed.

一方、仮決定した第１の目標音高が下限閾値Ｌ１よりも低くなければ（ステップＳｂ１５の判別結果が「Ｎｏ」であれば）、目標音高決定部１１４ａは、当該仮決定した第１の目標音高が第１の音高範囲の上限閾値Ｈ１よりも高いか否かを判別する（ステップＳｂ１７）。この上限閾値Ｈ１は、回答の平均音高に対して、どれだけ高い音高まで音高シフトを許容するのかについての閾値であり、本実施形態では、回答の平均音高＋６００セントである。上記仮決定した第１の目標音高が上限閾値Ｈ１よりも高ければ（ステップＳｂ１７の判別結果が「Ｙｅｓ」であれば）、目標音高決定部１１４ａは、第１の目標音高を１オクターブ引き下げて再度仮決定する（ステップＳｂ１８）。なお、この後、処理手順がステップＳｂ１４に戻り、再度、音高シフト量が算出されて、ステップＳｂ１５、Ｓｂ１７の判別が実行されることになる。 On the other hand, if the tentatively determined first target pitch is not lower than the lower limit threshold L1 (if the determination result in step Sb15 is “No”), the target pitch determining unit 114a performs the tentatively determined first target pitch. It is determined whether or not the target pitch is higher than the upper limit threshold H1 of the first pitch range (step Sb17). The upper limit threshold H1 is a threshold for how much pitch shift is allowed with respect to the average pitch of the answer, and in this embodiment, the average pitch of the answer is +600 cents. If the tentatively determined first target pitch is higher than the upper limit threshold H1 (if the determination result in step Sb17 is “Yes”), the target pitch determination unit 114a sets the first target pitch to one octave. It is lowered and provisionally determined again (step Sb18). After this, the processing procedure returns to step Sb14, the pitch shift amount is calculated again, and the determinations of steps Sb15 and Sb17 are executed.

仮決定した第１の目標音高が上限閾値Ｈ１よりも高くなければ（ステップＳｂ１７の判別結果が「Ｎｏ」であれば）、当該仮決定した第１の目標音高は第１の音高範囲に収まっていることを意味する。目標音高決定部１１４ａは、ステップＳｂ１７の判別結果が「Ｎｏ」となった時点の第１の目標音高を本決定として、処理手順をステップＳｂ１９に移行させる。 If the tentatively determined first target pitch is not higher than the upper limit threshold value H1 (if the determination result in step Sb17 is “No”), the tentatively determined first target pitch is the first pitch range. Means that The target pitch determination unit 114a sets the first target pitch at the time when the determination result of step Sb17 is “No” as the main determination, and shifts the processing procedure to step Sb19.

ステップＳｂ１９では、目標音高変更部１１４ｂは、音高シフト量分のシフト後の音高を第２の目標音高に仮決定し、当該仮決定した第２の目標音高が前述した第２の音高範囲の下限閾値Ｌ２よりも低いか否かを判別する。ステップＳｂ１７の判別結果が「Ｎｏ」となった直後に実行されるステップＳｂ１９では、目標音高決定部１１４ａにより本決定された第１の目標音高が第２の目標音高に仮決定される。下限閾値Ｌ２は、回答の平均音高に対してどれだけ低い音高まで音高シフトを許容するのかを前述した下限閾値Ｌ１よりも厳格に示す閾値であり、本実施形態では、回答の平均音高−３００セントである。 In step Sb19, the target pitch changing unit 114b temporarily determines the pitch after the shift corresponding to the pitch shift amount as the second target pitch, and the temporarily determined second target pitch is the second target pitch described above. It is determined whether the pitch range is lower than the lower limit threshold L2. In step Sb19 executed immediately after the determination result in step Sb17 becomes “No”, the first target pitch determined by the target pitch determination unit 114a is provisionally determined as the second target pitch. . The lower limit threshold L2 is a threshold that more strictly indicates how much pitch shift is allowed to a lower pitch than the average pitch of answers than the lower limit threshold L1 described above, and in this embodiment, the average pitch of answers High-300 cents.

ステップＳｂ１９の判別結果が「Ｙｅｓ」であれば、目標音高変更部１１４ａは、第２の目標音高を所定量だけ引き上げて仮決定し、音高シフト量を再計算する（ステップＳｂ２０）。ステップＳｂ２０における音高の引き上げ量は、ステップＳｂ１６における引き上げ量（１オクターブ）よりも小さく設定されている。前述したように第２の音高範囲は第１の音高範囲よりも狭いからである。ステップＳｂ２０における音高の引き上げ量については、第２の音高範囲の幅および前述した第２の関係に応じて設定しておけば良い。具体的には、問いの音高が「ド」である場合に、回答の音高が「ソ」になるように（オクターブ違いの同じ音高の関係ではなく、１オクターブ内の、親和性が高い関係（例えば協和音の関係）にある音高となるように）、上記ステップＳｂ２０における音高の引き上げ量を７００セントに設定しておくことが考えられ、本実施形態では、この態様が採用されている。ステップＳｂ２０の処理の実行後、処理手順がステップＳｂ１９に戻り、ステップＳｂ１９の判別が再度実行されることになる。 If the determination result in step Sb19 is “Yes”, the target pitch changing unit 114a temporarily raises the second target pitch by a predetermined amount and recalculates the pitch shift amount (step Sb20). The pitch raising amount in step Sb20 is set smaller than the raising amount (1 octave) in step Sb16. This is because the second pitch range is narrower than the first pitch range as described above. The pitch increase amount in step Sb20 may be set according to the width of the second pitch range and the second relationship described above. Specifically, when the pitch of the question is “do”, the pitch of the answer is “seo” (not the relationship of the same pitch in different octaves, but the affinity within one octave It is conceivable that the pitch increase amount in step Sb20 is set to 700 cents so that the pitch is in a high relationship (for example, a consonant relationship), and this aspect is adopted in this embodiment. Has been. After execution of the process of step Sb20, the process procedure returns to step Sb19, and the determination of step Sb19 is executed again.

一方、仮決定した第２の目標音高が下限閾値Ｌ２よりも低くなければ（ステップＳｂ１９の判別結果が「Ｎｏ」であれば）、目標音高変更部１１４ｂは、当該仮決定した第２の目標音高が第２の音高範囲の上限閾値Ｈ２よりも高いか否かを判別する（ステップＳｂ２１）。この上限閾値Ｈ２は、回答の平均音高に対してどれだけ高い音高まで音高シフトを許容するのかを、前述した上限閾値Ｈ１よりも厳格に示す閾値であり、本実施形態では、回答の平均音高＋３００セントである。シフト後の音高が上限閾値Ｈ２よりも高ければ（ステップＳｂ２１の判別結果が「Ｙｅｓ」であれば）、目標音高変更部１１４ｂは、第２の目標音高を所定量だけ引き下げて再度仮決定し、音高シフト量を再計算する（ステップＳｂ２２）。ステップＳｂ２２における音高の引き下げ量も、ステップＳｂ１８における引き下げ量（１オクターブ）よりも小さく設定されている。ステップＳｂ２２における音高の引き下げ量についても、ステップＳｂ２０における音高の引き上げ量と同様に、第２の音高範囲の幅および前述した第２の関係に応じて設定しておけば良い。本実施形態では、問いの音高が「ド」である場合に、回答の音高が「ソ」になるように（オクターブ違いの同じ音高の関係ではなく、１オクターブ内の、親和性が高い関係（例えば協和音の関係）にある音高となるように）、上記ステップＳｂ２２における音高の引き下げ量は５００セントに設定されている。ステップＳｂ２２の処理の実行後、処理手順がステップＳｂ１９に戻り、ステップＳｂ１９、Ｓｂ２１の判別が再度実行されることになる。 On the other hand, if the tentatively determined second target pitch is not lower than the lower limit threshold L2 (if the determination result in step Sb19 is “No”), the target pitch changing unit 114b performs the tentatively determined second target pitch. It is determined whether or not the target pitch is higher than an upper limit threshold H2 of the second pitch range (step Sb21). This upper limit threshold value H2 is a threshold value that more strictly indicates the upper limit threshold value H1 that indicates how much pitch shift is allowed with respect to the average pitch of the answer. In the present embodiment, Average pitch + 300 cents. If the pitch after the shift is higher than the upper limit threshold value H2 (if the determination result in step Sb21 is “Yes”), the target pitch changing unit 114b lowers the second target pitch by a predetermined amount and again temporarily Then, the pitch shift amount is recalculated (step Sb22). The pitch reduction amount in step Sb22 is also set smaller than the reduction amount (1 octave) in step Sb18. Similar to the pitch increase amount in step Sb20, the pitch decrease amount in step Sb22 may be set according to the width of the second pitch range and the second relationship described above. In this embodiment, when the pitch of the question is “do”, the pitch of the answer is “seo” (the relationship within the octave is not the relationship of the same pitches with different octaves). The pitch reduction amount in step Sb22 is set to 500 cents so that the pitch is high (for example, a pitch that is in the relationship of consonance). After execution of the process of step Sb22, the process procedure returns to step Sb19, and the determinations of steps Sb19 and Sb21 are executed again.

仮決定した第２の目標音高が上限閾値Ｈ２よりも高くなければ（ステップＳｂ２１の判別結果が「Ｎｏ」であれば）、当該仮決定した第２の目標音高が第２の音高範囲内に収まっていることを意味する。目標音高変更部１１４ｂは、ステップＳｂ２１の判別結果が「Ｎｏ」となった時点の第２の目標音高を、出力する回答の音高に本決定し、回答の平均音高を当該本決定した音高にシフトさせるための音高シフト量としてその時点の音高シフト量を回答再生部１１６に通知する（ステップＳｂ２３）。
以上が音高決定部１１４の動作である。 If the tentatively determined second target pitch is not higher than the upper limit threshold H2 (if the determination result in step Sb21 is “No”), the tentatively determined second target pitch is in the second pitch range. Means it is within. The target pitch changing unit 114b finally determines the second target pitch at the time when the determination result in step Sb21 is “No” as the pitch of the answer to be output, and determines the average pitch of the answer. The answer reproduction unit 116 is notified of the pitch shift amount at that time as the pitch shift amount for shifting to the pitch (step Sb23).
The above is the operation of the pitch determination unit 114.

図４は、利用者によって音声入力された問いの音声と、音声再生制御装置１０により再生（合成）される回答の音声との関係を、音高を縦軸に、時間を横軸にとって例示した図である。この図において、符号Ｔ１で示される実線は、利用者による「あのね」という問いの音声の音高変化を簡易的に直線で示している。符号Ｐ１は、この問いＴ１における特定区間の最低音高、具体的には「ド」の音高を示す。また、図において、符号Ａ１で示される実線は、問いＴ１に対して選択された「うん」という回答の音声データを標準で再生したときの音高変化を簡易的に示す図であり、符号Ｐ０は、その平均音高を示す。 FIG. 4 exemplifies the relationship between the voice of the question input by the user and the voice of the answer played (synthesized) by the voice playback control device 10 with the pitch on the vertical axis and the time on the horizontal axis. FIG. In this figure, the solid line indicated by the reference symbol T1 simply indicates a change in the pitch of the voice “Ane” by the user as a straight line. The symbol P1 indicates the lowest pitch of the specific section in the question T1, specifically, the pitch of “do”. Further, in the figure, a solid line indicated by reference symbol A1 is a diagram simply showing a change in pitch when the voice data of the answer “Yes” selected for the question T1 is reproduced as a standard. Indicates the average pitch.

問いＴ１に対して、回答Ａ１の音高をシフトさせずに再生すると、不自然な感じを受けやすい。このため、本実施形態では、まず、問いＴ１の特徴的で印象的な部分である特定区間（語尾）の最低音高Ｐ１が、回答Ａ１を再生する際の音高として仮決定され（図３：ステップＳｂ１３）、回答Ａ１の音高Ｐ０を当該仮決定した音高Ｐ１にシフトさせる音高シフト量が算出される（ステップＳｂ１４）。 When the question A1 is reproduced without shifting the pitch of the answer A1, an unnatural feeling is easily received. For this reason, in this embodiment, first, the lowest pitch P1 of the specific section (end of word), which is a characteristic and impressive part of the question T1, is provisionally determined as the pitch when the answer A1 is reproduced (FIG. 3). : Step Sb13), a pitch shift amount for shifting the pitch P0 of the answer A1 to the temporarily determined pitch P1 is calculated (step Sb14).

図４に示すように、仮決定された音高Ｐ１は、第１の音高範囲の上限閾値Ｈ１を上回っているため、ステップＳｂ１５の判別結果は「Ｎｏ」となり、ステップＳｂ１７の判別結果は「Ｙｅｓ」となる。その結果、ステップＳｂ１８の処理が実行され、回答Ａ１を再生する際の音高として、音高Ｐ１から１オクターブ引き下げた音高Ｐ２が仮決定され、音高シフト量が再計算される（図４：ステップＳｂ１４）。 As shown in FIG. 4, since the temporarily determined pitch P1 exceeds the upper limit threshold value H1 of the first pitch range, the determination result in step Sb15 is “No”, and the determination result in step Sb17 is “ Yes ". As a result, the process of step Sb18 is executed, and the pitch P2 that is one octave lower than the pitch P1 is provisionally determined as the pitch when the answer A1 is reproduced, and the pitch shift amount is recalculated (FIG. 4). : Step Sb14).

図４に示すように、音高Ｐ２は、第１の音高範囲の下限閾値Ｌ１を上回っており、かつ上限閾値Ｈ１を下回っている。その結果、ステップＳｂ１５の判別結果は「Ｎｏ」となり、ステップＳｂ１７の判別結果も「Ｎｏ」となって、ステップＳｂ１９の処理が実行される。図４に示すように、音高Ｐ２は第２の音高範囲の下限閾値Ｌ２を上回っており、かつ上限閾値Ｈ２も上回っている。このため、ステップＳｂ１９の判別結果は「Ｎｏ」となってステップＳｂ２１の判別処理が実行され、ステップＳｂ２１の判別結果は「Ｙｅｓ」となる。その結果、ステップＳｂ２２の処理が実行され、回答Ａ１を再生する際の音高として、音高Ｐ２から所定量（本実施形態では、５００セント）だけ引き下げた音高Ｐ３が仮決定され、音高Ｐ０を音高Ｐ３にシフトさせる音高シフト量が再計算される。 As shown in FIG. 4, the pitch P2 is higher than the lower limit threshold L1 of the first pitch range and lower than the upper limit threshold H1. As a result, the determination result of step Sb15 is “No”, the determination result of step Sb17 is also “No”, and the process of step Sb19 is executed. As shown in FIG. 4, the pitch P2 exceeds the lower limit threshold L2 of the second pitch range, and also exceeds the upper limit threshold H2. Therefore, the determination result of step Sb19 is “No”, the determination process of step Sb21 is executed, and the determination result of step Sb21 is “Yes”. As a result, the process of step Sb22 is executed, and the pitch P3 that is lowered by a predetermined amount (500 cents in this embodiment) from the pitch P2 is provisionally determined as the pitch when the answer A1 is reproduced. The pitch shift amount for shifting P0 to the pitch P3 is recalculated.

図４に示すように、音高Ｐ３は、第２の音高範囲の下限閾値Ｌ２を上回っており、かつ上限閾値Ｈ２を下回っている。その結果、ステップＳｂ１９の判別結果は「Ｎｏ」となり、ステップＳｂ２１の判別結果も「Ｎｏ」となって、回答Ａ１を再生する際の音高として、音高Ｐ３が本決定される（ステップＳｂ２３）。 As shown in FIG. 4, the pitch P3 is higher than the lower limit threshold L2 of the second pitch range and lower than the upper limit threshold H2. As a result, the determination result in step Sb19 is “No”, the determination result in step Sb21 is also “No”, and the pitch P3 is determined as the pitch when the answer A1 is reproduced (step Sb23). .

以上の動作の結果、音高を音高Ｐ３にシフトさせた回答Ａ１が、問いＴ１に対する応答として再生される。ここで、注目すべき点は、回答Ａ１の音高を音高Ｐ３にシフトさせる際の音高シフト量Ｄ１は、回答Ａ１の音高を音高Ｐ２にシフトさせる際の音高シフト量Ｄ２よりも小さいという点である。本実施形態により再生される回答Ａ１の音高Ｐ３は、問いＴ１の特定区間の最低音高Ｐ１（「ド」）とはオクターブ違いの関係にはないものの、これと親和性の高い音高（「ソ」）である。また、回答Ａ１の音高を音高Ｐ３にシフトさせる際の音高シフト量Ｄ１は音高シフト量Ｄ２よりも小さく、音高シフトに起因する音質劣化を小さくすることができる。 As a result of the above operation, the answer A1 in which the pitch is shifted to the pitch P3 is reproduced as a response to the question T1. Here, it should be noted that the pitch shift amount D1 when shifting the pitch of the answer A1 to the pitch P3 is greater than the pitch shift amount D2 when shifting the pitch of the answer A1 to the pitch P2. Is also small. The pitch P3 of the answer A1 reproduced according to the present embodiment is not in an octave difference from the lowest pitch P1 (“do”) in the specific section of the question T1, but has a high affinity with this ( "So"). In addition, the pitch shift amount D1 when shifting the pitch of the answer A1 to the pitch P3 is smaller than the pitch shift amount D2, and deterioration in sound quality due to the pitch shift can be reduced.

このように、本実施形態によれば、利用者が発した問いに対する回答を、不自然でもなく、かつ、聴感上の品質の劣化を防いで、合成（再生）することができる。また、本実施形態では、１つの問いに対して音高の異なる複数の回答の音声データを用意しておく必要はなく、少ないリソースで実現可能である。 As described above, according to the present embodiment, the answer to the question issued by the user can be synthesized (reproduced) without being unnatural and preventing deterioration of the audible quality. Further, in the present embodiment, it is not necessary to prepare voice data of a plurality of answers with different pitches for one question, and it can be realized with few resources.

（Ｃ：変形および応用例）
本発明は、上述した実施形態に限定されるものではなく、例えば次に述べるような各種の応用・変形が可能である。また、次に述べる応用・変形の態様は、任意に選択された一または複数を適宜に組み合わせることもできる。 (C: Deformation and application examples)
The present invention is not limited to the above-described embodiments, and various applications and modifications as described below are possible, for example. In addition, one or more arbitrarily selected aspects of application / deformation described below can be appropriately combined.

回答ライブラリ１２４に記憶する回答の音声データについては、人物Ａ、Ｂ、Ｃ、…のように複数人にわたって、同一内容の回答を記憶させても良い。人物Ａ、Ｂ、Ｃ、…については例えば有名人、タレント、歌手などとして、各人物毎に音声データをライブラリ化する。このようにライブラリ化する場合、メモリカードなどの媒体を介して回答の音声データを回答ライブラリ１２４に格納させても良いし、音声再生制御装置１０にネットワーク接続機能を持たせて、特定のサーバから回答の音声データをダウンロードし、回答ライブラリ１２４に格納させても良い。メモリカードやサーバから回答の音声データを入手する場合、無償であっても良いし、有償であっても良い。 As for the voice data of answers stored in the answer library 124, answers of the same content may be stored across a plurality of people such as persons A, B, C,. For the persons A, B, C,..., For example, celebrities, talents, singers, etc., voice data is stored in a library for each person. In the case of creating a library in this way, answer voice data may be stored in the answer library 124 via a medium such as a memory card, or the voice playback control device 10 may be provided with a network connection function so as to be sent from a specific server. The voice data of the answer may be downloaded and stored in the answer library 124. When the voice data of the answer is obtained from the memory card or the server, it may be free or paid.

一方で、問いに対しては、どの人物をモデルとして回答して欲しいのかを、利用者が操作入力部等によって選択可能な構成としても良いし、各種条件（日、週、月など）毎にランダムで決定する構成としても良い。 On the other hand, it can be configured so that the user can select the person who wants to answer the model as a model by the operation input unit, etc., or for each condition (day, week, month, etc.) It is good also as a structure determined at random.

また、回答の音声データについては、音声入力部１０２のマイクロフォンを介して、利用者自身や、当該利用者の家族、知人の音声を録音したもの（または別途の装置によってデータ化したもの）をライブラリ化しても良い。このように身近な人物の音声で回答がなされると、問いを発したときに、あたかも当該人物と対話しているかのような感覚を得ることができる。 As for the voice data of answers, a library of voices recorded by the user himself / herself, the user's family and acquaintances (or converted into data by a separate device) via the microphone of the voice input unit 102 is a library. May be used. When the answer is made with the voice of a person close to the person like this, it is possible to obtain a feeling as if the person is interacting with the person when the question is made.

また、回答については、動物（イヌ、ネコなど）などの鳴き声であっても良いし、犬種などを適宜選択可能な構成としても良い。このように回答を動物の鳴き声とすることで、あたかも当該動物と対話しているかのような、一種の癒しの効果を得ることができる。 In addition, the answer may be a call from an animal (dog, cat, etc.), or may be configured such that a dog breed or the like can be selected as appropriate. In this way, by using an answer as an animal call, it is possible to obtain a kind of healing effect as if it were interacting with the animal.

音高取得部１１２が、回答選択部１１０により決定された回答の音声データを解析して、当該音声データを標準で再生したときの平均音高を取得し、この音高を示すデータを再生指示部１１４に供給する構成としても良い。この構成によれば、音高を示すデータを回答の音声データに、予め対応付けて回答ライブラリ１２４に記憶させる必要がなくなる。また、上記実施形態では、再生指示部１１４に目標音高決定部１１４ａと目標音高変更部１１４ｂが含まれていたが、目標音高決定部１１４ａおよび目標音高変更部１１４ｂを再生指示部１１４とは別個に設けても良い。また、上記実施形態の音声再生制御装置１０には、回答再生部１１６が含まれていたが、回答再生部１１６は音声再生制御装置１０とは別個の音声合成装置であっても良い。 The pitch acquisition unit 112 analyzes the voice data of the answer determined by the answer selection unit 110, acquires an average pitch when the voice data is played back as a standard, and instructs to play back data indicating the pitch It is good also as a structure supplied to the part 114. FIG. According to this configuration, it is not necessary to associate the data indicating the pitch with the voice data of the answer and store it in the answer library 124 in advance. In the above embodiment, the target pitch determination unit 114a and the target pitch change unit 114b are included in the playback instruction unit 114, but the target pitch determination unit 114a and the target pitch change unit 114b are used as the playback instruction unit 114. You may provide separately. In addition, the voice playback control device 10 of the above embodiment includes the answer playback unit 116, but the answer playback unit 116 may be a voice synthesizer separate from the voice playback control device 10.

なお、実施形態では、利用者による問いの音高に対して回答の音声データの音高が低い場合を例にとって説明したが、逆に、利用者による問いの音高に対して回答の音声データの音高が高い場合にも適用可能である。 In the embodiment, the case where the pitch of the voice data of the answer is lower than the pitch of the question asked by the user has been described as an example. It is also applicable when the pitch of is high.

１０２…音声入力部、１０６…音声特徴量取得部、１１０…回答選択部、１１２…音高取得部、１１４…再生指示部、１１４ａ…目標音高決定部、１１４ｂ…目標音高変更部、１１６…回答再生部、１２４…回答ライブラリ。 DESCRIPTION OF SYMBOLS 102 ... Voice input part, 106 ... Voice feature-value acquisition part, 110 ... Reply selection part, 112 ... Pitch acquisition part, 114 ... Reproduction | regeneration instruction | indication part, 114a ... Target pitch determination part, 114b ... Target pitch change part, 116 ... answer playback unit, 124 ... answer library.

Claims

A pitch acquisition unit that acquires a pitch based on voice data of an answer corresponding to the voice signal of the input question;
The target pitch that maintains a predetermined first relationship with the pitch of a specific section of the input question voice signal and that is determined according to the pitch based on the answer voice data. A target pitch determining unit that determines a first target pitch that falls within a pitch range of one;
A target sound for changing the first target pitch to a second target pitch when the first target pitch does not fall within a second pitch range narrower than the first pitch range. High change part,
The pitch based on the voice data of the answer is shifted by the pitch shift amount for changing to the pitch determined by the target pitch determining unit or the pitch changed by the target pitch changing unit. A playback instruction unit for instructing the answer playback unit to play back the answer;
An audio reproduction control device comprising:

The target pitch determining unit determines the first target pitch by changing the target pitch in units of a first shift amount until it falls within the first pitch range.
The target pitch changing unit sets the target pitch to a second shift amount unit smaller than the first shift amount unit until the pitch falls within the second pitch range. The audio reproduction control device according to claim 1, wherein the second target pitch is determined by changing in units of second shift amounts determined accordingly.

The sound reproduction control device according to claim 1 or 2, wherein the first pitch range is defined in octave units, and the second pitch range is defined in half octave units.

The voice reproduction control according to any one of claims 1 to 3, wherein a minimum pitch value in a section in which a volume of the input voice signal is equal to or higher than a predetermined value is set as a pitch in the specific section. apparatus.

Computer
A pitch acquisition unit that acquires a pitch based on voice data of an answer corresponding to the voice signal of the input question;
The target pitch that maintains a predetermined first relationship with the pitch of a specific section of the input question voice signal and that is determined according to the pitch based on the answer voice data. A target pitch determining unit that determines a first target pitch that falls within a pitch range of one;
A target sound for changing the first target pitch to a second target pitch when the first target pitch does not fall within a second pitch range narrower than the first pitch range. High change part,
The pitch based on the voice data of the answer is shifted by the pitch shift amount for changing to the pitch determined by the target pitch determining unit or the pitch changed by the target pitch changing unit. A playback instruction unit for instructing the answer playback unit to play back the answer;
A program characterized by functioning as