JP6181921B2

JP6181921B2 - Voice reproduction apparatus, voice synthesis reproduction apparatus, and programs thereof

Info

Publication number: JP6181921B2
Application number: JP2012254292A
Authority: JP
Inventors: 世木　寛之; 寛之世木
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2012-11-20
Filing date: 2012-11-20
Publication date: 2017-08-16
Anticipated expiration: 2032-11-20
Also published as: JP2014102379A

Description

本発明は、音声を信号処理して再生する音声再生装置およびそのプログラム、合成した音声を信号処理して再生する音声合成再生装置およびそのプログラムに関する。 The present invention relates to a sound reproducing device and a program for reproducing a sound by performing signal processing, and a sound synthesizing and reproducing device and a program for reproducing the synthesized sound by signal processing.

従来、入力音声の話速を変換することで、音声の長さをリアルタイムで変えながら再生する音声再生方法が提案されている（特許文献１参照）。特許文献１で提案された音声再生方法は、人が発声した音声を加工してリアルタイムで発話速度を変換しながら再生するものであり、受聴音声の話速を変換する際に、入力音声のデータ長と、事前に与えられた伸縮倍率に関する変換関数によって予め計算された出力データ長と、実際に出力されている音声のデータ長とを一定の処理単位で常に監視することで、情報の欠落を生じさせることなく一連の処理を行うことができる。 2. Description of the Related Art Conventionally, a voice reproduction method has been proposed in which the voice speed is changed in real time by converting the speech speed of input voice (see Patent Document 1). The speech reproduction method proposed in Patent Document 1 is a method in which speech uttered by a person is processed and reproduced while converting the speech speed in real time, and the input speech data is converted when converting the speech speed of the received speech. By constantly monitoring the length, the output data length calculated in advance by a conversion function relating to the scaling factor given in advance, and the data length of the voice that is actually output in a certain processing unit, missing information A series of processes can be performed without causing them.

特許３２２００４３号公報（特開平１０−３０１５９８号公報）Japanese Patent No. 3220043 (Japanese Patent Laid-Open No. 10-301598)

しかしながら、特許文献１で提案された音声再生方法は入力音声に対してリアルタイムで話速を変換し、話速変換後の音声を再生することは可能であるが、入力音声の話速変換にかかる時間が、変換後の音声を再生する時間と比べて長いと、再生音が途切れてしまう可能性がある。例えば、特許文献１で提案された音声再生方法は、話速変換済の音声Ａの再生中に、当該音声Ａの次に再生する音声Ｂの話速変換を並行して行う際に、音声Ａの再生時間が５秒、音声Ｂの話速変換時間が１０秒であるとすると、音声Ａの再生が終わってから５秒間は何も再生されないことになる。 However, although the speech reproduction method proposed in Patent Document 1 can convert the speech speed in real time with respect to the input speech and reproduce the speech after the speech speed conversion, it takes the speech speed conversion of the input speech. If the time is longer than the time for reproducing the converted sound, the reproduced sound may be interrupted. For example, in the audio reproduction method proposed in Patent Document 1, the voice A is converted when the voice A to be reproduced next to the voice A is concurrently converted during the reproduction of the voice A after the voice speed conversion. If the playback time is 5 seconds and the speech speed conversion time of the voice B is 10 seconds, nothing is played back for 5 seconds after the playback of the voice A is finished.

本発明はかかる点に鑑みてなされたものであって、信号処理した音声を途切れさせることなく再生することができる安定性の高い音声再生装置および音声合成再生装置ならびにこれらのプログラムを提供することを課題とする。 The present invention has been made in view of the above points, and provides a highly stable voice reproducing apparatus and voice synthesizing / reproducing apparatus that can reproduce signal-processed voice without interruption, and a program thereof. Let it be an issue.

前記課題を解決するために請求項１に係る音声再生装置は、音声を信号処理して再生する音声再生装置であって、入力音声バッファリング手段と、音声信号処理手段と、出力音声バッファリング手段と、音声再生手段と、出力音声バッファー検出手段と、音声信号処理制御手段と、を備える構成とした。 In order to solve the above problem, an audio reproducing apparatus according to claim 1 is an audio reproducing apparatus that performs audio signal processing and reproduces the input audio buffering means, audio signal processing means, and output audio buffering means. And an audio reproduction means, an output audio buffer detection means, and an audio signal processing control means.

このような構成を備える音声再生装置は、入力音声バッファリング手段によって、入力された音声を保存し、音声信号処理手段によって、音声信号処理制御手段から制御信号が入力された場合に、入力音声バッファリング手段に保存されている音声を信号処理する。すなわち、音声信号処理手段は、出力音声バッファリング手段に保存されている信号処理済の音声の量が閾値を下回る場合に音声を信号処理する。これにより、出力音声バッファリング手段に保存されている音声、すなわち音声再生手段によって再生される信号処理済の音声の量が常に一定量に保たれることになる。 An audio reproducing apparatus having such a configuration stores input audio by the input audio buffering means, and the input audio buffer when the control signal is input from the audio signal processing control means by the audio signal processing means. The sound stored in the ring means is signal-processed. That is, the audio signal processing means processes the audio when the amount of the signal-processed audio stored in the output audio buffering means is below the threshold value. As a result, the amount of the sound stored in the output sound buffering means, that is, the amount of the signal processed sound reproduced by the sound reproducing means is always kept constant.

また、音声再生装置は、出力音声バッファリング手段によって、音声信号処理手段において信号処理された音声を保存し、音声再生手段によって、出力音声バッファリング手段に保存されている音声を再生する。そして、音声再生装置は、出力音声バッファー検出手段によって、出力音声バッファリング手段に保存されている音声の量を検出し、音声信号処理制御手段によって、出力音声バッファー検出手段において検出された音声の量が予め定められた閾値未満であるか否かを判定し、当該閾値未満である場合に、入力音声バッファリング手段に保存されている音声を信号処理する旨の制御信号を音声信号処理手段に対して出力する。 In addition, the audio reproduction device stores the audio signal-processed by the audio signal processing unit by the output audio buffering unit, and reproduces the audio stored in the output audio buffering unit by the audio reproduction unit. Then, the sound reproducing device detects the amount of sound stored in the output sound buffering means by the output sound buffer detection means, and the amount of sound detected by the sound signal processing control means in the output sound buffer detection means. Is less than a predetermined threshold value, and if it is less than the threshold value, a control signal for signal processing the sound stored in the input sound buffering means is sent to the sound signal processing means. Output.

また、音声再生装置において、音声信号処理手段によって、予め定められたテスト音声を信号処理し、出力音声バッファリング手段によって、音声信号処理手段において信号処理されたテスト音声を保存し、音声再生手段によって、出力音声バッファリング手段に保存されているテスト音声を再生し、出力音声バッファー検出手段によって、出力音声バッファリング手段に保存されているテスト音声の量を検出した場合において、音声信号処理制御手段で用いられる閾値が、出力音声バッファー検出手段によって検出されたテスト音声の量の最小値に、音声再生装置の処理性能の安定性を示す予め定められた信頼率を乗じたものである構成とした。この信頼率は、音声再生装置の各手段を例えばコンピュータによって具現化する場合において、性能の良いＣＰＵや並列して動作するプログラムが少ない場合には例えば「１．０」をとり、性能の悪いＣＰＵや並列して複数のプログラムが動作している場合には例えば「２．０」や「３．０」をとる、ユーザーが予め定める設定値のことである。すなわち、音声再生装置は、テスト音声を入力して処理を行うことで得られた出力音声バッファー検出手段の出力から求めた値を、音声信号処理制御手段における閾値として用いることができる。 Further, in the audio reproduction apparatus, the audio signal processing means, and signal processing the test sound predetermined by the output audio buffering means, to save the test speech signal processing in the audio signal processing means, audio reproduction means When the test audio stored in the output audio buffering means is played back and the amount of test audio stored in the output audio buffering means is detected by the output audio buffer detection means, the audio signal processing control means The threshold used in the above is obtained by multiplying the minimum value of the amount of test audio detected by the output audio buffer detection means by a predetermined reliability factor indicating the stability of the processing performance of the audio reproduction device. . This reliability rate is, for example, “1.0” when the CPU of the sound reproducing apparatus is embodied by a computer, for example, when there are few high-performance CPUs or programs operating in parallel, and the CPU with poor performance. When a plurality of programs are operating in parallel, for example, “2.0” or “3.0” is set by the user in advance. That is, the audio reproduction device can use the value obtained from the output of the output audio buffer detection means obtained by processing by inputting the test audio as the threshold value in the audio signal processing control means.

請求項３に係る音声再生装置は、請求項１に係る音声再生装置において、音声信号処理手段が、入力音声バッファリング手段に保存されている音声の話速を変換する信号処理を行う構成とした。すなわち、音声信号処理手段は、出力音声バッファリング手段に保存されている話速変換済の音声の量が閾値を下回る場合に音声を話速変換する。これにより、出力音声バッファリング手段に保存されている音声、すなわち音声再生手段によって再生される話速変換済の音声の量が常に一定量に保たれることになる。 According to a third aspect of the present invention, there is provided an audio reproduction device according to the first aspect, wherein the audio signal processing means performs signal processing for converting the speech speed of the audio stored in the input audio buffering means. . That is, the voice signal processing means converts the voice speed when the amount of the voice speed converted voice stored in the output voice buffering means falls below a threshold value. As a result, the amount of speech stored in the output speech buffering means, that is, the speech speed-converted speech reproduced by the speech reproduction means is always kept constant.

前記課題を解決するために請求項４に係る音声合成再生装置は、請求項１から請求項３のいずれか一項に係る音声再生装置を備える音声合成再生装置であって、音声合成手段と、入力音声バッファー検出手段と、音声合成制御手段と、備える構成とした。 In order to solve the above-mentioned problem, a speech synthesis / playback device according to claim 4 is a speech synthesis / playback device including the speech playback device according to any one of claims 1 to 3, comprising: speech synthesis means; The input voice buffer detection means and the voice synthesis control means are provided.

このような構成を備える音声合成再生装置は、音声合成手段によって、音声合成制御手段から制御信号が入力された場合に、入力文に対応する音声を合成し、入力音声バッファリング手段によって、音声合成手段において合成された音声を保存する。すなわち、音声合成手段は、入力音声バッファリング手段に保存されている合成音声の量が閾値を下回る場合に音声を合成する。これにより、入力音声バッファリング手段に保存されている合成音声、すなわち音声信号処理手段によって信号処理される合成音声の量が常に一定量に保たれることになる。 A speech synthesis / playback apparatus having such a configuration synthesizes speech corresponding to an input sentence when a control signal is input from speech synthesis control means by speech synthesis means, and speech synthesis is performed by input speech buffering means. The synthesized voice in the means is saved. That is, the speech synthesis unit synthesizes speech when the amount of synthesized speech stored in the input speech buffering unit is below the threshold value. As a result, the amount of synthesized speech stored in the input speech buffering means, that is, the amount of synthesized speech signal-processed by the speech signal processing means is always kept constant.

そして、音声合成再生装置は、入力音声バッファー検出手段によって、入力音声バッファリング手段に保存されている音声の量を検出し、音声合成制御手段によって、入力音声バッファー検出手段において検出された音声の量が予め定められた閾値未満であるか否かを判定し、当該閾値未満である場合に、入力文に対応する音声を合成する旨の制御信号を音声合成手段に対して出力する。 Then, the speech synthesis / playback apparatus detects the amount of speech stored in the input speech buffering means by the input speech buffer detection means, and the amount of speech detected by the speech synthesis control means in the input speech buffer detection means. Is less than a predetermined threshold, and if it is less than the threshold, a control signal for synthesizing the speech corresponding to the input sentence is output to the speech synthesizer.

前記課題を解決するために請求項５に係る音声再生プログラムは、コンピュータを、請求項１から請求項３のいずれか一項に記載の音声再生装置として機能させることとした。 In order to solve the above problem, an audio reproduction program according to claim 5 causes a computer to function as the audio reproduction apparatus according to any one of claims 1 to 3.

前記課題を解決するために請求項６に係る音声合成再生プログラムは、コンピュータを、請求項４に記載の音声合成再生装置として機能させることとした。 In order to solve the above problems, a speech synthesis / playback program according to a sixth aspect causes a computer to function as the speech synthesis / playback apparatus according to the fourth aspect.

請求項１、請求項２および請求項５に係る発明によれば、信号処理済の音声の量を常に管理しながら音声を信号処理して再生するため、当該信号処理済の音声を途切れさせることなく安定的に再生することができる。 According to the first, second, and fifth aspects of the invention, the signal-processed sound is interrupted in order to process and reproduce the sound while always managing the amount of the signal-processed sound. Can be reproduced stably.

請求項２、請求項３に係る発明によれば、話速変換済の音声の量を常に管理しながら音声を話速変換して再生するため、当該話速変換済の音声を途切れさせることなく安定的に再生することができる。 According to the second and third aspects of the invention, since the voice is converted and played back while always managing the amount of the voice whose voice speed has been converted, the voice whose voice speed has been converted is not interrupted. It can be played back stably.

請求項４および請求項６に係る発明によれば、合成音声の量を常に管理しながら音声を合成するとともに、信号処理済の合成音声の量を常に管理しながら合成音声を信号処理して再生するため、当該信号処理済の合成音声を途切れさせることなく安定的に再生することができる。 According to the inventions according to claims 4 and 6, the synthesized speech is synthesized while always managing the amount of synthesized speech, and the synthesized speech is signal-processed and reproduced while always managing the amount of synthesized speech that has undergone signal processing. Therefore, the signal-processed synthesized speech can be stably reproduced without interruption.

本発明に係る音声再生装置の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the audio | voice reproduction apparatus which concerns on this invention. 本発明に係る音声再生装置の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the audio | voice reproduction apparatus which concerns on this invention. 本発明に係る音声合成再生装置の全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of a speech synthesis / playback apparatus according to the present invention. 本発明に係る音声合成再生装置の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the speech synthesizer / reproducing apparatus concerning this invention.

本発明の実施形態に係る音声再生装置および音声合成再生装置ならびにこれらのプログラムについて、図面を参照しながら説明する。なお、以下の説明において、同一の構成については同一の名称及び符号を付し、詳細説明を省略する。また、以下の説明では、まず音声再生装置について説明した後、音声合成再生装置およびこれらのプログラムについて説明することとする。 A voice playback device, a voice synthesis playback device, and a program thereof according to an embodiment of the present invention will be described with reference to the drawings. In the following description, the same configuration is given the same name and symbol, and detailed description is omitted. Further, in the following description, the voice playback device will be described first, and then the voice synthesis playback device and these programs will be described.

［音声再生装置の構成］
本発明に係る音声再生装置の構成について、図１を参照しながら説明する。音声再生装置１は、入力音声に信号処理を行って再生するものであり、具体的には図１に示すように、外部から入力される音声に対して、予め定められた音声信号処理用パラメータに基づいて信号処理を行い、信号処理済の音声（以下、信号処理済音声という）を再生するものである。この音声再生装置１は、例えば一般家庭でテレビ番組を視聴する視聴者が、リモコン操作などによってリアルタイムに音声の話速や声の高さなどを調整する場合に用いられる。 [Configuration of audio playback device]
The configuration of the audio reproducing apparatus according to the present invention will be described with reference to FIG. The audio reproduction device 1 performs signal processing on input audio and reproduces it. Specifically, as shown in FIG. 1, predetermined audio signal processing parameters are set for audio input from the outside. Based on the above, signal processing is performed to reproduce signal-processed sound (hereinafter referred to as signal-processed sound). The audio reproduction device 1 is used when, for example, a viewer who watches a television program in a general home adjusts the speech speed or the voice pitch in real time by operating a remote controller or the like.

音声再生装置１は、ここでは図１に示すように、入力音声バッファリング手段１１と、音声信号処理手段１２と、出力音声バッファリング手段１３と、音声再生手段１４と、出力音声バッファー検出手段１５と、音声信号処理制御手段１６と、を備えている。 Here, as shown in FIG. 1, the audio reproducing apparatus 1 includes an input audio buffering means 11, an audio signal processing means 12, an output audio buffering means 13, an audio reproducing means 14, and an output audio buffer detecting means 15. And an audio signal processing control means 16.

入力音声バッファリング手段１１は、信号処理前の音声を保存するものである。この入力音声バッファリング手段１１は、図１に示すように、外部から入力された信号処理前の音声を保存するとともに、後記する音声信号処理手段１２の求めに応じて当該音声を出力する。入力音声バッファリング手段１１は、具体的にはデータを記憶することができるハードディスクまたはフラッシュメモリなどで構成される。なお、入力音声バッファリング手段１１は、ここでは図１に示すように、音声再生装置１の内部に設けられているが、音声再生装置１の外部に設けられた構成としても構わない。 The input audio buffering means 11 stores audio before signal processing. As shown in FIG. 1, the input audio buffering means 11 stores externally input audio before signal processing and outputs the audio in response to a request from the audio signal processing means 12 described later. The input audio buffering means 11 is specifically composed of a hard disk or flash memory capable of storing data. Here, as shown in FIG. 1, the input audio buffering means 11 is provided inside the audio reproduction device 1, but may be configured outside the audio reproduction device 1.

音声信号処理手段１２は、音声を信号処理するものである。音声信号処理手段１２は、図１に示すように、入力音声バッファリング手段１１に保存されている音声を、予め定められた音声信号処理用パラメータに基づいて信号処理する。そして、音声信号処理手段１２は、図１に示すように、信号処理済音声を出力音声バッファリング手段１３に出力する。なお、この音声信号処理手段１２における信号処理の具体例としては、例えば音声の話速変換処理、声質を変換する処理、声の高さを変える処理、音声の感情成分を変える処理などが挙げられる。 The sound signal processing means 12 performs signal processing on sound. As shown in FIG. 1, the audio signal processing unit 12 processes the audio stored in the input audio buffering unit 11 based on a predetermined audio signal processing parameter. Then, the audio signal processing unit 12 outputs the signal-processed audio to the output audio buffering unit 13 as shown in FIG. Specific examples of signal processing in the audio signal processing means 12 include, for example, speech speed conversion processing, voice quality conversion processing, voice pitch changing processing, and voice emotion component changing processing. .

音声信号処理手段１２は、信号処理として音声の話速変換処理を行う場合、外部から音声信号処理用パラメータとして音声の伸縮率が入力される。これを受けて音声信号処理手段１２は、音声データのパワー、零交差数、自己相関関数を用いて音声区間を検出するとともに、音声区間についてピッチ周期の抽出を行う。そして、音声信号処理手段１２は、ピッチ周期と伸縮率とによって規定される時間長に基づいて、音声波形の基本周期の間引き／繰り返しを行い、音声波形同士を適切な時間長で重ね合わせて接続することで、話速変換を行う。なお、このような話速変換手法については、公知の技術を用いることができる（例えば、特許第３３２７９３６号、特許第２９５５２４７号）。 When the speech signal processing means 12 performs speech speech speed conversion processing as signal processing, a speech expansion / contraction rate is input from the outside as a speech signal processing parameter. In response to this, the voice signal processing means 12 detects the voice section using the power of the voice data, the number of zero crossings, and the autocorrelation function, and extracts the pitch period for the voice section. Then, the audio signal processing means 12 performs decimation / repetition of the basic period of the audio waveform based on the time length defined by the pitch period and the expansion / contraction rate, and the audio waveforms are overlapped and connected with an appropriate time length. By doing so, speech speed conversion is performed. In addition, about such a speech speed conversion method, a well-known technique can be used (for example, patent 3327936, patent 2955247).

また、音声信号処理手段１２は、信号処理として音声の感情成分を変える処理を行う場合、外部から音声信号処理用パラメータとして、例えば喜怒哀楽に応じた感情パラメータが入力される。そして、音声信号処理手段１２は、当該感情パラメータに基づいて音声波形の基本周波数やスペクトルを調整することで、音声の感情成分を変更する。 In addition, when the voice signal processing means 12 performs a process of changing the emotion component of the voice as the signal processing, for example, an emotion parameter corresponding to emotions is input as a voice signal processing parameter from the outside. And the audio | voice signal processing means 12 changes the emotion component of an audio | voice by adjusting the fundamental frequency and spectrum of an audio | voice waveform based on the said emotion parameter.

ここで、音声信号処理手段１２は、入力音声バッファリング手段１１に保存されている音声を単に信号処理するのではなく、自身が既に信号処理を行った音声の量に応じて信号処理を行う。すなわち、音声信号処理手段１２は、図１に示すように、出力音声バッファリング手段１３に既に保存されている信号処理済音声の量が予め定められた閾値未満である場合に、入力音声バッファリング手段１１に保存されている音声を信号処理する。 Here, the audio signal processing unit 12 does not simply process the audio stored in the input audio buffering unit 11, but performs signal processing according to the amount of audio that has already been subjected to signal processing. That is, as shown in FIG. 1, the audio signal processing unit 12 performs input audio buffering when the amount of signal processed audio already stored in the output audio buffering unit 13 is less than a predetermined threshold. The voice stored in the means 11 is signal-processed.

より具体的には、音声信号処理手段１２には、図１に示すように、出力音声バッファリング手段１３に保存されている信号処理済音声が前記した閾値未満である場合、音声信号処理制御手段１６から、信号処理を行う旨の制御信号が入力される。そして、音声信号処理手段１２は、この制御信号が入力された場合のみ、入力音声バッファリング手段１１に保存されている音声を古いものから順番に必要な個数だけ取得して話速変換などの信号処理を行い、信号処理済音声を出力音声バッファリング手段１３に出力する。なお、音声信号処理手段１２は、ここでは後記する音声再生手段１４による信号処理済音声の再生と並行して音声の信号処理を行う。 More specifically, as shown in FIG. 1, the audio signal processing means 12 includes an audio signal processing control means when the signal processed audio stored in the output audio buffering means 13 is less than the above-mentioned threshold value. 16, a control signal for performing signal processing is input. Then, only when this control signal is input, the audio signal processing means 12 obtains the required number of audios stored in the input audio buffering means 11 in order from the oldest to obtain a signal such as speech speed conversion. Processing is performed, and the signal-processed sound is output to the output sound buffering means 13. Here, the audio signal processing means 12 performs audio signal processing in parallel with the reproduction of the signal processed audio by the audio reproduction means 14 described later.

出力音声バッファリング手段１３は、信号処理済音声を保存するものである。この出力音声バッファリング手段１３は、図１に示すように、音声信号処理手段１２から入力された信号処理済音声を保存するとともに、後記する音声再生手段１４の求めに応じて当該信号処理済音声を出力する。出力音声バッファリング手段１３は、具体的にはデータを記憶することができるハードディスクまたはフラッシュメモリなどで構成される。なお、出力音声バッファリング手段１３は、ここでは図１に示すように、音声再生装置１の内部に設けられているが、音声再生装置１の外部に設けられた構成としても構わない。 The output sound buffering means 13 stores the signal processed sound. As shown in FIG. 1, the output audio buffering means 13 stores the signal processed audio input from the audio signal processing means 12 and also performs the signal processed audio in response to a request from the audio reproduction means 14 to be described later. Is output. The output audio buffering means 13 is specifically composed of a hard disk or flash memory capable of storing data. Here, the output audio buffering means 13 is provided inside the audio reproduction device 1 as shown in FIG. 1, but may be configured outside the audio reproduction device 1.

音声再生手段１４は、信号処理済音声を再生するものである。音声再生手段１４は、図１に示すように、出力音声バッファリング手段１３に保存されている信号処理済音声を古いものから順番に必要な個数だけ取得し、スピーカなどの図示しない音声デバイスに対して出力して再生する。なお、音声再生手段１４は、ここでは前記した音声信号処理手段１２による信号処理と並行して信号処理済音声を再生する。 The audio reproduction means 14 reproduces the signal processed audio. As shown in FIG. 1, the audio reproduction means 14 obtains the required number of signal processed voices stored in the output audio buffering means 13 in order from the oldest one, and with respect to an audio device (not shown) such as a speaker. Output and play. Here, the sound reproduction means 14 reproduces the signal processed sound in parallel with the signal processing by the sound signal processing means 12 described above.

出力音声バッファー検出手段１５は、信号処理済音声の量を検出するものである。出力音声バッファー検出手段１５は、図１に示すように、出力音声バッファリング手段１３に保存されている信号処理済音声の量を所定のサンプリング周期で常時検出し、当該信号処理済音声の量を音声信号処理制御手段１６に対して出力する。 The output sound buffer detecting means 15 detects the amount of signal processed sound. As shown in FIG. 1, the output sound buffer detection means 15 always detects the amount of signal processed sound stored in the output sound buffering means 13 at a predetermined sampling period, and determines the amount of the signal processed sound. Output to the audio signal processing control means 16.

音声信号処理制御手段１６は、音声信号処理手段１２における信号処理を制御するものである。音声信号処理制御手段１６には、図１に示すように、出力音声バッファー検出手段１５から出力音声バッファリング手段１３に保存されている信号処理済音声の量が入力される。そして、音声信号処理制御手段１６は、当該信号処理済音声の量が予め定められた閾値未満であるかを判定し、当該閾値未満である場合に、入力音声バッファリング手段１１に保存されている音声を信号処理する旨の制御信号を音声信号処理手段１２に対して出力する。このように、音声再生装置１は、音声信号処理制御手段１６によって、出力音声バッファリング手段１３に保存されている音声の量に応じて、音声信号処理手段１２における信号処理の要否を判定して音声の信号処理を制御することができる。 The audio signal processing control means 16 controls signal processing in the audio signal processing means 12. As shown in FIG. 1, the audio signal processing control unit 16 receives the amount of signal processed audio stored in the output audio buffering unit 13 from the output audio buffer detection unit 15. Then, the audio signal processing control unit 16 determines whether the amount of the signal-processed audio is less than a predetermined threshold value, and if it is less than the threshold value, the audio signal processing control unit 16 stores it in the input audio buffering unit 11. A control signal indicating that the sound is processed is output to the sound signal processing means 12. As described above, in the audio reproduction device 1, the audio signal processing control unit 16 determines whether or not the signal processing in the audio signal processing unit 12 is necessary according to the amount of audio stored in the output audio buffering unit 13. Audio signal processing can be controlled.

ここで、前記した閾値は、予め経験的および実験的に求めた値であり、本発明を具現あるいは実現するハードウェアの性能（例えばＣＰＵやデータの転送速度など）に応じて決定される。また、前記した閾値は、次のような条件の場合において、出力音声バッファー検出手段１５によって検出されたテスト音声の量の最小値に、音声再生装置１の処理性能の安定性を示す予め定められた信頼率を乗じたものとすることができる。この信頼率は、音声再生装置１の各手段を例えばコンピュータによって具現化する場合において、性能の良いＣＰＵや並列して動作するプログラムが少ない場合には「１．０」をとり、性能の悪いＣＰＵや並列して複数のプログラムが動作している場合には「２．０」や「３．０」をとる、ユーザーが予め定める設定値のことである。 Here, the above-described threshold value is a value obtained empirically and experimentally in advance, and is determined according to hardware performance (for example, CPU and data transfer speed) that implements or implements the present invention. Further, the above-described threshold value is determined in advance to indicate the stability of the processing performance of the audio reproduction device 1 to the minimum value of the amount of test audio detected by the output audio buffer detection means 15 under the following conditions. Multiplied by the confidence factor. This reliability rate is "1.0" when the CPU of the sound reproducing apparatus 1 is embodied by, for example, a computer, and there are few programs operating in parallel, and a CPU with poor performance. Or, when a plurality of programs are operating in parallel, “2.0” or “3.0” is set by the user in advance.

すなわち、前記した条件とは、音声信号処理手段１２によって、予め定められたテスト音声を信号処理し、出力音声バッファリング手段１３によって、音声信号処理手段１２において信号処理されたテスト音声を保存し、音声再生手段１４によって、出力音声バッファリング手段１３に保存されているテスト音声を再生し、出力音声バッファー検出手段１５によって、出力音声バッファリング手段１３に保存されているテスト音声の量を検出した場合である。 That is, the above-mentioned conditions are that a predetermined test sound is signal-processed by the sound signal processing means 12, and a test sound signal-processed by the sound signal processing means 12 is stored by the output sound buffering means 13, When the test sound stored in the output sound buffering means 13 is reproduced by the sound reproduction means 14 and the amount of the test sound stored in the output sound buffering means 13 is detected by the output sound buffer detection means 15 It is.

以上のような構成を備える音声再生装置１は、音声信号処理手段１２が、出力音声バッファリング手段１３に保存されている信号処理済音声（例えば話速変換済の音声）の量が閾値を下回る場合に音声を信号処理（例えば話速変換）する。これにより、出力音声バッファリング手段１３に保存されている音声、すなわち音声再生手段１４によって再生される信号処理済音声の量が常に一定量に保たれることになる。従って、音声再生装置１によれば、信号処理済音声の量を常に管理しながら音声を信号処理して再生するため、当該信号処理済音声を途切れさせることなく安定的に再生することができる。 In the audio reproduction device 1 having the above-described configuration, the audio signal processing unit 12 has the amount of signal-processed audio (for example, speech-converted audio) stored in the output audio buffering unit 13 below a threshold value. In some cases, the voice is subjected to signal processing (for example, speech speed conversion). As a result, the amount of sound stored in the output sound buffering means 13, that is, the amount of signal processed sound reproduced by the sound reproducing means 14, is always kept constant. Therefore, according to the audio reproducing device 1, since the audio is signal-processed and reproduced while always managing the amount of the signal-processed audio, the signal-processed audio can be stably reproduced without being interrupted.

なお、本発明に係る音声再生装置１のように、入力音声に対してリアルタイムで話速を変換して話速変換済音声を再生するのではなく、例えば事前に音声をまとめて話速変換し、全ての音声の話速変換が完了した後に話速変換済音声を再生すれば、再生音が途切れることはない。しかしこの場合は、話速変換を行った時刻と話速変換後の音声を再生した時刻とがずれるため、話速変換を行う際に指定された伸縮率と、話速変換後の音声が再生される際に指定される伸縮率とにずれが生じることになる。すなわち、事前に伸縮率を「２倍」として話速変換した音声は、当然ながら、再生時に「２倍」以外の話速に変更することはできない。 Instead of reproducing the speech speed converted speech by converting the speech speed in real time to the input speech as in the speech playback apparatus 1 according to the present invention, for example, the speech speed is collectively converted in advance. If the speech speed converted speech is played after the speech speed conversion of all the voices is completed, the playback sound will not be interrupted. However, in this case, the time at which the speech speed conversion is performed and the time at which the speech after the speech speed conversion is played back are different, so the expansion / contraction rate specified when the speech speed conversion is performed and the speech after the speech speed conversion is played back. This causes a deviation from the expansion / contraction rate specified at the time. In other words, the voice whose speech speed has been converted with the expansion / contraction rate set to “2 times” in advance cannot be changed to a speech speed other than “2 times” during reproduction.

このように従来は、音声の話速変換と再生とを並行して行って話速変換の倍率（伸縮率）指定のレスポンスを良くすると、再生音が途切れる可能性があり、音声の話速変換と再生と別個に行って再生音が途切れないようにすると、話速変換の倍率指定のレスポンスが悪くなるという相関関係があったが、本発明に係る音声再生装置１は、前記したような構成を備えることで、話速変換の倍率指定のレスポンスの良さと、再生音の途切れなさを両立することができる。 In this way, conventionally, if the response of the voice speed conversion ratio (expansion rate) is improved by performing the voice speed conversion and playback in parallel, the playback sound may be interrupted. If the playback sound is not interrupted separately from the playback, there is a correlation that the response for specifying the magnification of the speech speed conversion is deteriorated. However, the audio playback device 1 according to the present invention is configured as described above. By providing the above, it is possible to achieve both good response for specifying the magnification of speech speed conversion and the uninterrupted playback sound.

［音声再生装置の処理手順］
本発明に係る音声再生装置１の処理手順について、図２を参照（適宜図１を参照）しながら説明する。 [Processing procedure of audio playback device]
A processing procedure of the audio reproduction device 1 according to the present invention will be described with reference to FIG. 2 (refer to FIG. 1 as appropriate).

音声再生装置１は、まず外部から入力音声バッファリング手段１１に対して信号処理前の音声を入力する（ステップＳ１）。次に、音声再生装置１は、出力音声バッファー検出手段１５によって、出力音声バッファリング手段１３に保存されている信号処理済音声の量を検出する（ステップＳ２）。次に、音声再生装置１は、音声信号処理制御手段１６によって、信号処理済音声の量が閾値未満であるか否かを判定する（ステップＳ３）。 The audio reproducing device 1 first inputs the audio before signal processing from the outside to the input audio buffering means 11 (step S1). Next, the audio reproduction device 1 detects the amount of signal processed audio stored in the output audio buffering means 13 by the output audio buffer detection means 15 (step S2). Next, the audio reproducing device 1 determines whether or not the amount of the signal processed audio is less than the threshold by the audio signal processing control means 16 (step S3).

音声再生装置１は、信号処理済音声の量が閾値未満である場合（ステップＳ３においてＹｅｓ）、音声信号処理手段１２によって、入力音声バッファリング手段１１に保存されている音声を信号処理し（ステップＳ４）、出力音声バッファリング手段１３によって、信号処理済音声を保存する（ステップＳ５）。次に、音声再生装置１は、音声再生手段１４によって、スピーカなどの音声デバイスを介して、出力音声バッファリング手段１３に保存されている信号処理済音声を再生する（ステップＳ６）。そして、音声再生装置１は、外部からの音声の入力が終了した場合（ステップＳ７においてＹｅｓ）はステップＳ８に進み、外部からの音声の入力が終了していない場合（ステップＳ７においてＮｏ）はステップＳ１に戻って前記した処理を繰り返す。 When the amount of the signal processed sound is less than the threshold value (Yes in step S3), the sound reproducing device 1 performs signal processing on the sound stored in the input sound buffering means 11 by the sound signal processing means 12 (step S3). S4) The signal processed voice is stored by the output voice buffering means 13 (step S5). Next, the audio reproducing device 1 reproduces the signal-processed audio stored in the output audio buffering means 13 through the audio device such as a speaker by the audio reproducing means 14 (step S6). Then, the audio playback device 1 proceeds to step S8 when the input of audio from the outside is completed (Yes in step S7), and proceeds to step S8 when the input of audio from the outside is not completed (No in step S7). Returning to S1, the above-described processing is repeated.

音声再生装置１は、全ての信号処理済音声の再生が終了した場合（ステップＳ８においてＹｅｓ）は処理を終了し、全ての信号処理済音声の再生が終了していない場合（ステップＳ８においてＮｏ）はステップＳ２に戻って前記した処理を繰り返す。なお、ステップＳ８において、全ての信号処理済音声の再生が終了したか否かを判定する方法としては、例えば音声信号処理手段１２および音声再生手段１４によって、入力音声バッファリング手段１１内および出力音声バッファリング手段１３内の音声が残っているか否かをそれぞれ検出する方法などを用いることができる。 The audio reproduction device 1 ends the process when the reproduction of all the signal-processed sounds is completed (Yes in Step S8), and does not complete the reproduction of all the signal-processed sounds (No in Step S8). Returns to step S2 and repeats the process described above. In step S8, as a method for determining whether or not the reproduction of all the signal-processed sounds has been completed, for example, the sound signal processing means 12 and the sound reproduction means 14 may be used in the input sound buffering means 11 and the output sound. For example, a method for detecting whether or not the sound in the buffering means 13 remains can be used.

一方、音声再生装置１は、出力音声バッファリング手段１３内の信号処理済音声の量が閾値を超える場合（ステップＳ３においてＮｏ）、信号処理を行わずに待機し（ステップＳ９）、ステップＳ６以降の処理を行う。音声再生装置１は、以上のような手順を経て信号処理済音声を再生する。 On the other hand, when the amount of the signal processed audio in the output audio buffering means 13 exceeds the threshold value (No in step S3), the audio reproducing device 1 stands by without performing signal processing (step S9), and after step S6 Perform the process. The audio reproduction device 1 reproduces the signal processed audio through the above procedure.

［音声合成再生装置の構成］
本発明に係る音声合成再生装置２の構成について、図３を参照しながら説明する。音声合成再生装置２は、合成した音声に信号処理を行って再生するものであり、具体的には図３に示すように、外部から入力される入力文（テキスト）に従って音声を合成し、当該合成した音声に対して、予め定められた音声信号処理用パラメータに基づいて信号処理を行い、信号処理済の合成音声（以下、信号処理済合成音声という）を再生するものである。この音声合成再生装置２は、前記した音声再生装置１と同様の用途に用いられる。 [Configuration of speech synthesis playback device]
The configuration of the speech synthesis / playback apparatus 2 according to the present invention will be described with reference to FIG. The speech synthesis / playback apparatus 2 performs signal processing on the synthesized speech and reproduces it. Specifically, as shown in FIG. 3, the speech synthesis / playback device 2 synthesizes speech according to an input sentence (text) input from the outside, The synthesized voice is subjected to signal processing based on a predetermined voice signal processing parameter, and a signal-processed synthesized voice (hereinafter referred to as a signal-processed synthesized voice) is reproduced. The voice synthesis / playback apparatus 2 is used for the same application as the voice playback apparatus 1 described above.

音声合成再生装置２は、ここでは図３に示すように、前記した音声再生装置１の構成である入力音声バッファリング手段１１と、音声信号処理手段１２と、出力音声バッファリング手段１３と、音声再生手段１４と、出力音声バッファー検出手段１５と、音声信号処理制御手段１６とに加えて、音声合成手段２１と、入力音声バッファー検出手段２２と、音声合成制御手段２３と、を備えている。なお、以下の説明では音声再生装置１と重複する構成については、詳細な説明を省略する。 Here, as shown in FIG. 3, the speech synthesis / playback apparatus 2 includes an input voice buffering means 11, a voice signal processing means 12, an output voice buffering means 13, and a voice that are the configuration of the voice playback apparatus 1 described above. In addition to the reproduction unit 14, the output audio buffer detection unit 15, and the audio signal processing control unit 16, a voice synthesis unit 21, an input audio buffer detection unit 22, and a voice synthesis control unit 23 are provided. In the following description, detailed description of the same components as those of the audio playback device 1 is omitted.

音声合成手段２１は、入力文（テキスト）に対応する音声を合成するものである。音声合成手段２１は、図３に示すように、外部から入力されたテキストに基づいて、音声データおよび音声合成用パラメータを利用して音声合成を行う。なお、音声合成手段２１は、例えばＨＭＭ音声合成方式、波形編集方式、波形接続方式などの一般的な方法を利用して音声合成を行うことができる。また、前記した音声合成用パラメータとしては、例えば声量、発話速度、ピッチ（基本周波数）、スペクトルなどのパラメータが挙げられる。 The voice synthesizer 21 synthesizes a voice corresponding to an input sentence (text). As shown in FIG. 3, the speech synthesis means 21 performs speech synthesis using speech data and speech synthesis parameters based on text input from the outside. The voice synthesizer 21 can synthesize voice using general methods such as an HMM voice synthesis method, a waveform editing method, and a waveform connection method. Examples of the speech synthesis parameters include parameters such as voice volume, speech rate, pitch (fundamental frequency), and spectrum.

ここで、音声合成手段２１は、入力文から単に音声合成を行うのではなく、自身が既に合成した音声の量に応じて音声合成を行う。すなわち、音声合成手段２１は、図３に示すように、入力音声バッファリング手段１１に既に保存されている合成済の音声（以下、合成音声という）の量が予め定められた閾値未満である場合に、入力文から音声合成を行う。 Here, the speech synthesis means 21 does not simply perform speech synthesis from the input sentence, but performs speech synthesis according to the amount of speech already synthesized by itself. That is, as shown in FIG. 3, the speech synthesizer 21 has a case where the amount of synthesized speech (hereinafter referred to as synthesized speech) already stored in the input speech buffering unit 11 is less than a predetermined threshold. Next, speech synthesis is performed from the input sentence.

より具体的には、音声合成手段２１には、図３に示すように、入力音声バッファリング手段１１に保存されている合成音声が前記した閾値未満である場合、音声合成制御手段２３から、音声合成を行う旨の制御信号が入力される。そして、音声合成手段２１は、この制御信号が入力された場合のみ、入力文から音声合成を行い、合成音声を入力音声バッファリング手段１１に出力する。なお、音声合成手段２１は、ここでは音声信号処理手段１２により合成音声の信号処理や、音声再生手段１４による信号処理済合成音声の再生と並行して音声の合成を行う。また、音声合成手段２１における音声合成の単位は特に限定されず、例えばフレーズ単位、文単位で音声合成を行うことができる。 More specifically, as shown in FIG. 3, when the synthesized speech stored in the input speech buffering unit 11 is less than the above-described threshold, the speech synthesis control unit 23 sends the speech synthesis unit 21 a speech. A control signal for performing synthesis is input. The speech synthesizer 21 synthesizes speech from the input sentence only when this control signal is input, and outputs the synthesized speech to the input speech buffering unit 11. Here, the speech synthesizer 21 synthesizes the speech in parallel with the signal processing of the synthesized speech by the speech signal processor 12 and the reproduction of the signal processed synthesized speech by the speech reproducer 14. The unit of speech synthesis in the speech synthesizer 21 is not particularly limited. For example, speech synthesis can be performed in units of phrases and sentences.

ここで、入力音声バッファリング手段１１は、前記した音声再生装置１におけるものと同様の構成を備えているが、ここでは外部から入力された音声ではなく、音声合成手段２１によって合成された音声を保存する。 Here, the input sound buffering means 11 has the same configuration as that in the sound reproduction apparatus 1 described above, but here, the sound synthesized by the sound synthesizing means 21 is not the sound inputted from the outside. save.

入力音声バッファー検出手段２２は、合成音声の量を検出するものである。入力音声バッファー検出手段２２は、図３に示すように、入力音声バッファリング手段１１に保存されている合成音声の量を所定のサンプリング周期で常時検出し、当該合成音声の量を音声合成制御手段２３に対して出力する。 The input voice buffer detection means 22 detects the amount of synthesized voice. As shown in FIG. 3, the input voice buffer detection means 22 always detects the amount of synthesized speech stored in the input voice buffering means 11 at a predetermined sampling period, and the synthesized speech volume is detected as voice synthesis control means. Output to 23.

音声合成制御手段２３は、音声合成手段２１における音声合成を制御するものである。音声合成制御手段２３には、図３に示すように、入力音声バッファー検出手段２２から入力音声バッファリング手段１１に保存されている合成音声の量が入力される。そして、音声合成制御手段２３は、当該合成音声の量が予め定められた閾値未満であるかを判定し、当該閾値未満である場合に、入力文から音声を合成する旨の制御信号を音声合成手段２１に対して出力する。このように、音声合成再生装置２は、音声合成制御手段２３によって、入力音声バッファリング手段１１に保存されている合成音声の量に応じて、音声合成手段２１における合成の要否を判定して音声の合成を制御することができる。 The voice synthesis control unit 23 controls voice synthesis in the voice synthesis unit 21. As shown in FIG. 3, the amount of synthesized speech stored in the input speech buffering unit 11 is input to the speech synthesis control unit 23 from the input speech buffer detection unit 22. Then, the speech synthesis control unit 23 determines whether the amount of the synthesized speech is less than a predetermined threshold, and if it is less than the threshold, the speech synthesis control unit 23 synthesizes a control signal for synthesizing speech from the input sentence. Output to the means 21. As described above, the speech synthesis / playback apparatus 2 determines whether or not the speech synthesis unit 21 needs to synthesize the speech synthesis control unit 23 according to the amount of synthesized speech stored in the input speech buffering unit 11. Speech synthesis can be controlled.

ここで、前記した閾値は、予め経験的および実験的に求めた値であり、本発明を具現あるいは実現するハードウェアの性能（例えばＣＰＵやデータの転送速度など）に応じて決定される。 Here, the above-described threshold value is a value obtained empirically and experimentally in advance, and is determined according to hardware performance (for example, CPU and data transfer speed) that implements or implements the present invention.

以上のような構成を備える音声合成再生装置２は、音声合成手段２１が、入力音声バッファリング手段１１に保存されている合成音声の量が閾値を下回る場合に音声を合成する。これにより、入力音声バッファリング手段１１に保存されている合成音声、すなわち音声信号処理手段１２によって信号処理（例えば話速変換）される合成音声の量が常に一定量に保たれることになる。また、音声合成再生装置２は、音声信号処理手段１２が、出力音声バッファリング手段１３に保存されている信号処理済合成音声（例えば話速変換済の合成音声）の量が閾値を下回る場合に合成音声を信号処理する。これにより、出力音声バッファリング手段１３に保存されている合成音声、すなわち音声再生手段１４によって再生される信号処理済合成音声の量が常に一定量に保たれることになる。従って、音声合成再生装置２によれば、合成音声の量を常に管理しながら音声を合成するとともに、信号処理済合成音声の量を常に管理しながら合成音声を信号処理して再生するため、当該信号処理済合成音声を途切れさせることなく安定的に再生することができる。 In the speech synthesis / playback apparatus 2 having the above-described configuration, the speech synthesis unit 21 synthesizes speech when the amount of synthesized speech stored in the input speech buffering unit 11 falls below a threshold value. As a result, the amount of synthesized speech stored in the input speech buffering means 11, that is, the amount of synthesized speech that is subjected to signal processing (for example, speech speed conversion) by the speech signal processing means 12, is always kept constant. Also, the speech synthesis / playback apparatus 2 is configured so that the speech signal processing unit 12 has a signal-processed synthesized speech (for example, synthesized speech that has been subjected to speech speed conversion) stored in the output speech buffering unit 13 below the threshold. Process the synthesized speech. As a result, the amount of the synthesized speech stored in the output speech buffering means 13, that is, the amount of the signal processed synthetic speech reproduced by the speech reproducing means 14 is always kept constant. Therefore, according to the speech synthesis / playback apparatus 2, the speech is synthesized while always managing the amount of the synthesized speech, and the synthesized speech is signal-processed and reproduced while always managing the amount of the synthesized speech that has been signal processed. The signal-processed synthesized speech can be stably reproduced without interruption.

［音声合成再生装置の処理手順］
本発明に係る音声合成再生装置２の処理手順について、図４を参照（適宜図３を参照）しながら説明する。 [Processing procedure of speech synthesis playback device]
The processing procedure of the speech synthesis / playback apparatus 2 according to the present invention will be described with reference to FIG. 4 (refer to FIG. 3 as appropriate).

音声合成再生装置２は、まず外部から音声合成手段２１に対してテキストを入力する（ステップＳ１１）。次に、音声合成再生装置２は、入力音声バッファー検出手段２２によって、入力音声バッファリング手段１１に保存されている合成音声の量を検出する（ステップＳ１２）。次に、音声合成再生装置２は、音声合成制御手段２３によって、合成音声の量が閾値未満であるか否かを判定する（ステップＳ１３）。 The voice synthesis / playback apparatus 2 first inputs text from the outside to the voice synthesis means 21 (step S11). Next, the speech synthesis / playback apparatus 2 detects the amount of synthesized speech stored in the input speech buffering means 11 by the input speech buffer detection means 22 (step S12). Next, the speech synthesis / playback apparatus 2 determines whether or not the amount of synthesized speech is less than the threshold by the speech synthesis control means 23 (step S13).

音声合成再生装置２は、合成音声の量が閾値未満である場合（ステップＳ１３においてＹｅｓ）、音声合成手段２１によって、テキストに対応した音声を合成し（ステップＳ１４）、入力音声バッファリング手段１１によって、合成音声を保存する（ステップＳ１５）。次に、音声合成再生装置２は、出力音声バッファー検出手段１５によって、出力音声バッファリング手段１３に保存されている信号処理済合成音声の量を検出する（ステップＳ１６）。次に、音声合成再生装置２は、音声信号処理制御手段１６によって、信号処理済合成音声の量が閾値未満であるか否かを判定する（ステップＳ１７）。 When the amount of synthesized speech is less than the threshold (Yes in step S13), the speech synthesis / playback apparatus 2 synthesizes speech corresponding to the text by the speech synthesis unit 21 (step S14), and the input speech buffering unit 11 The synthesized speech is saved (step S15). Next, the speech synthesis / playback apparatus 2 detects the amount of the signal-processed synthesized speech stored in the output speech buffering means 13 by the output speech buffer detection means 15 (step S16). Next, the speech synthesis / playback apparatus 2 determines whether or not the amount of the signal-processed synthesized speech is less than the threshold value by the speech signal processing control means 16 (step S17).

音声合成再生装置２は、信号処理済合成音声の量が閾値未満である場合（ステップＳ１７においてＹｅｓ）、音声信号処理手段１２によって、入力音声バッファリング手段１１に保存されている合成音声を信号処理し（ステップＳ１８）、出力音声バッファリング手段１３によって、信号処理済合成音声を保存する（ステップＳ１９）。次に、音声合成再生装置２は、音声再生手段１４によって、スピーカなどの音声デバイスを介して、出力音声バッファリング手段１３に保存されている信号処理済合成音声を再生する（ステップＳ２０）。そして、音声合成再生装置２は、外部からのテキストの入力が終了した場合（ステップＳ２１においてＹｅｓ）はステップＳ２２に進み、外部からのテキストの入力が終了していない場合（ステップＳ２１においてＮｏ）はステップＳ１１に戻って前記した処理を繰り返す。 If the amount of the signal-processed synthesized speech is less than the threshold value (Yes in step S17), the speech synthesis / playback apparatus 2 performs signal processing on the synthesized speech stored in the input speech buffering unit 11 by the speech signal processing unit 12. In step S18, the output voice buffering means 13 stores the signal-processed synthesized voice (step S19). Next, the speech synthesis / playback apparatus 2 plays back the signal-processed synthesized speech stored in the output speech buffering means 13 via the speech device such as a speaker by the voice playback means 14 (step S20). Then, the speech synthesis / playback apparatus 2 proceeds to step S22 when the input of the text from the outside is completed (Yes in step S21), and proceeds to step S22 when the input of the text from the outside is not completed (No in step S21). It returns to step S11 and repeats the above-mentioned process.

音声再生装置１は、全ての信号処理済合成音声の再生が終了した場合（ステップＳ２２においてＹｅｓ）は処理を終了し、全ての信号処理済合成音声の再生が終了していない場合（ステップＳ２２においてＮｏ）はステップＳ１２に戻って前記した処理を繰り返す。なお、ステップＳ２２において、全ての信号処理済合成音声の再生が終了したか否かを判定する方法としては、例えば音声信号処理手段１２および音声再生手段１４によって、入力音声バッファリング手段１１内および出力音声バッファリング手段１３内の音声が残っているか否かをそれぞれ検出する方法などを用いることができる。 When the reproduction of all the signal-processed synthesized sounds is completed (Yes in step S22), the audio reproducing device 1 ends the process, and when the reproduction of all the signal-processed synthesized sounds is not completed (in step S22). No) returns to step S12 and repeats the process described above. In step S22, as a method for determining whether or not the reproduction of all the signal-processed synthesized sounds has been completed, for example, the audio signal processing means 12 and the audio reproduction means 14 are used in the input audio buffering means 11 and the output. For example, a method for detecting whether or not the voice in the voice buffering means 13 remains can be used.

一方、音声合成再生装置２は、入力音声バッファリング手段１１内の合成音声の量が閾値を超える場合（ステップＳ１３においてＮｏ）、音声の合成を行わずに待機し（ステップＳ２３）、ステップＳ１６以降の処理を行う。また、音声合成再生装置２は、出力音声バッファリング手段１３内の信号処理済合成音声の量が閾値を超える場合（ステップＳ１７においてＮｏ）、信号処理を行わずに待機し（ステップＳ２４）、ステップＳ２０以降の処理を行う。音声合成再生装置２は、以上のような手順を経て信号処理済合成音声を再生する。 On the other hand, when the amount of synthesized speech in the input speech buffering means 11 exceeds the threshold (No in step S13), the speech synthesis / playback device 2 stands by without synthesizing speech (step S23), and after step S16. Perform the process. If the amount of the signal-processed synthesized speech in the output speech buffering means 13 exceeds the threshold (No in step S17), the speech synthesis / playback apparatus 2 stands by without performing signal processing (step S24), and step The process after S20 is performed. The voice synthesis / playback apparatus 2 plays the signal-processed synthesized voice through the above-described procedure.

［音声合成再生処理の具体例］
本発明に係る音声合成再生装置２による処理の具体例について、図３を参照しながら説明する。ここでは一例として、「自分だけ費用を負担しないということであれば、かなり失礼ではないですかね？」という入力文を音声合成して話速変換することを考える。また、以下の例では、入力音声バッファリング手段１１および出力音声バッファリング手段１３には、最初はデータが保存されていない状態とする。 [Specific example of speech synthesis playback processing]
A specific example of processing by the speech synthesis / playback apparatus 2 according to the present invention will be described with reference to FIG. Here, as an example, suppose that speech speed is synthesized by synthesizing an input sentence such as “If you don't bear the cost of yourself, isn't it quite rude?” In the following example, the input audio buffering means 11 and the output audio buffering means 13 are initially in a state where no data is stored.

まず、入力音声バッファー検出手段２２は、入力音声バッファリング手段１１内に存在する合成音声の量を検出し、音声合成制御手段２３に対して「０」を出力する。次に、音声合成制御手段２３は、入力音声バッファー検出手段２２の出力「０」を受けて、閾値を下回っているため、音声合成手段２１に対して入力文から音声を合成する旨の制御信号を出力する。次に、音声合成手段２１は、入力文に対応する合成音声を生成し、入力音声バッファリング手段１１に対して出力する。なお、音声合成手段２１は、このとき、前記した入力文における「自分だけ」に対応する合成音声を生成して出力したものとする。次に、入力音声バッファリング手段１１は、音声合成手段２１から入力された合成音声を保存する。 First, the input voice buffer detection means 22 detects the amount of synthesized voice existing in the input voice buffering means 11 and outputs “0” to the voice synthesis control means 23. Next, since the speech synthesis control unit 23 receives the output “0” of the input speech buffer detection unit 22 and is below the threshold, the control signal indicating that the speech synthesis unit 21 synthesizes speech from the input sentence. Is output. Next, the speech synthesizer 21 generates synthesized speech corresponding to the input sentence and outputs it to the input speech buffering unit 11. Note that at this time, the speech synthesizer 21 generates and outputs a synthesized speech corresponding to “only me” in the input sentence. Next, the input voice buffering unit 11 stores the synthesized voice input from the voice synthesis unit 21.

続いて先ほどと同様に、入力音声バッファー検出手段２２は、入力音声バッファリング手段１１内に存在する合成音声の量を検出し、先ほど入力された「自分だけ」の合成音声の長さを音声合成制御手段２３に出力する。次に、音声合成制御手段２３は、入力音声バッファー検出手段２２の出力を受けて、閾値を下回っている場合は、音声合成手段２１に対して入力文から音声を合成する旨の制御信号を出力し、閾値以上である場合は何も行わない。 Subsequently, as before, the input voice buffer detection means 22 detects the amount of synthesized speech existing in the input voice buffering means 11, and the synthesized voice length of “only me” inputted earlier is synthesized. Output to the control means 23. Next, the speech synthesis control unit 23 receives the output of the input speech buffer detection unit 22 and outputs a control signal for synthesizing speech from the input sentence to the speech synthesis unit 21 when it is below the threshold value. However, if it is equal to or greater than the threshold, nothing is done.

一方、出力音声バッファー検出手段１５は、出力音声バッファリング手段１３内に存在する信号処理済合成音声の量を検出し、音声信号処理制御手段１６に対して「０」を出力する。次に、音声信号処理制御手段１６は、出力音声バッファー検出手段１５の出力「０」を受けて、閾値を下回っているため、音声信号処理手段１２に対して合成音声を信号処理する旨の制御信号を出力する。次に、音声信号処理手段１２は、入力音声バッファリング手段１１に保存されている合成音声を古いものから順番に必要な個数だけ取得して話速変換を行って信号処理済合成音声を生成し、出力音声バッファリング手段１３に対して出力する。次に、音声再生手段１４は、出力音声バッファリング手段１３に保存されている信号処理済合成音声を古いものから順番に必要な個数だけ取得し、スピーカなどの図示しない音声デバイスに対して出力して再生する。 On the other hand, the output audio buffer detection means 15 detects the amount of signal-processed synthesized speech existing in the output audio buffering means 13 and outputs “0” to the audio signal processing control means 16. Next, since the audio signal processing control unit 16 receives the output “0” of the output audio buffer detection unit 15 and is below the threshold value, the audio signal processing unit 12 performs control to signal-process the synthesized audio. Output a signal. Next, the voice signal processing means 12 obtains the required number of synthesized voices stored in the input voice buffering means 11 in order from the oldest and performs speech speed conversion to generate signal-processed synthesized voices. And output to the output audio buffering means 13. Next, the audio reproduction means 14 acquires the required number of signal-processed synthesized voices stored in the output audio buffering means 13 in order from the oldest and outputs them to an audio device (not shown) such as a speaker. To play.

続いて先ほどと同様に、出力音声バッファー検出手段１５は、出力音声バッファリング手段１３内に存在する信号処理済合成音声の量を検出し、音声信号処理制御手段１６に対して出力する。次に、音声信号処理制御手段１６は、出力音声バッファー検出手段１５の出力を受けて、閾値を下回っている場合は、音声信号処理手段１２に対して合成音声を信号処理する旨の制御信号を出力し、閾値以上である場合は何も行わない。 Subsequently, as before, the output audio buffer detection means 15 detects the amount of the signal-processed synthesized voice existing in the output audio buffering means 13 and outputs it to the audio signal processing control means 16. Next, the audio signal processing control means 16 receives the output of the output audio buffer detection means 15 and, if it is below the threshold value, gives a control signal to the audio signal processing means 12 to signal-process the synthesized voice. If it is output and the threshold value is exceeded, nothing is done.

しばらくすると、信号処理済合成音声が再生されていくことで、出力音声バッファー検出手段１５の出力が閾値を下回るため、音声信号処理手段１２は、入力音声バッファリング手段１１に保存されている合成音声を話速変換し、出力音声バッファリング手段１３に対して出力する。また、同時に入力音声バッファー検出手段２２の出力も閾値を下回ってくるため、音声合成手段２１は入力文の音声合成を行い、入力音声バッファリング手段１１に対して合成音声を出力する。なお、上記の処理においては、音声合成や話速変換の動作のタイミングが重要であり、音声合成制御手段２３および音声信号処理制御手段１６で用いられる閾値を、動作が安定となる最小限の値に設定することで、直近の感情などの音声合成パラメータや、話速変換の伸縮率などの信号処理パラメータを音声に反映させることが可能となる。 After a while, since the output of the output audio buffer detection means 15 falls below the threshold value due to the reproduction of the signal processed synthetic voice, the audio signal processing means 12 is stored in the input voice buffering means 11. Is converted to speech speed and output to the output voice buffering means 13. At the same time, since the output of the input speech buffer detection means 22 falls below the threshold value, the speech synthesis means 21 performs speech synthesis of the input sentence and outputs synthesized speech to the input speech buffering means 11. In the above processing, the timing of the operation of speech synthesis and speech speed conversion is important, and the threshold value used in the speech synthesis control means 23 and the speech signal processing control means 16 is the minimum value that stabilizes the operation. By setting to, speech synthesis parameters such as the latest emotion and signal processing parameters such as the expansion / contraction rate of speech speed conversion can be reflected in the speech.

［音声再生プログラムおよび音声合成再生プログラム］
ここで、前記した音声再生装置１および音声合成再生装置２は、一般的なコンピュータを、前記した各手段および各部として機能させるプログラムにより動作させることで実現することができる。このプログラムは、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 [Voice playback program and voice synthesis playback program]
Here, the voice reproduction device 1 and the voice synthesis reproduction device 2 described above can be realized by causing a general computer to operate according to a program that causes each of the above-described units and units to function. This program can be distributed via a communication line, or can be written on a recording medium such as a CD-ROM for distribution.

具体的には、音声再生プログラムは、入力された音声を保存する入力音声バッファリング手段１１と、音声信号処理手段１２によって信号処理された音声を保存する出力音声バッファリング手段１３とを備え、音声を信号処理して再生する音声再生装置１のコンピュータを、前記した音声信号処理手段１２および前記した音声再生手段１４、として機能させることができる。また、音声合成再生プログラムは、音声合成手段２１によって合成された合成音声を保存する入力音声バッファリング手段１１と、音声信号処理手段１２によって信号処理された信号処理済合成音声を保存する出力音声バッファリング手段１３とを備え、合成音声を信号処理して再生する音声合成再生装置２のコンピュータを、前記した音声合成手段２１、前記した音声信号処理手段１２および前記した音声再生手段１４、として機能させることができる。 Specifically, the audio reproduction program includes an input audio buffering unit 11 that stores the input audio and an output audio buffering unit 13 that stores the audio signal processed by the audio signal processing unit 12. Can be made to function as the above-described audio signal processing means 12 and the above-described audio reproduction means 14. The speech synthesis / playback program also includes an input speech buffering unit 11 that stores the synthesized speech synthesized by the speech synthesis unit 21 and an output speech buffer that stores the signal processed synthesized speech signal-processed by the speech signal processing unit 12. A voice synthesizing / reproducing apparatus 2 including a ring unit 13 and processing the synthesized speech by signal processing to function as the speech synthesizing unit 21, the audio signal processing unit 12, and the audio reproducing unit 14. be able to.

以上、本発明に係る音声再生装置および音声合成再生装置ならびにこれらのプログラムについて、発明を実施するための形態により具体的に説明したが、本発明の趣旨はこれらの記載に限定されるものではなく、特許請求の範囲の記載に基づいて広く解釈されなければならない。また、これらの記載に基づいて種々変更、改変等したものも本発明の趣旨に含まれることはいうまでもない。 As mentioned above, although the audio | voice reproduction apparatus and the audio | voice synthesis | combination reproduction apparatus which concern on this invention, and these programs were demonstrated concretely by the form for inventing, the meaning of this invention is not limited to these description. Should be construed broadly based on the claims. Needless to say, various changes and modifications based on these descriptions are also included in the spirit of the present invention.

１音声再生装置
１１入力音声バッファリング手段
１２音声信号処理手段
１３出力音声バッファリング手段
１４音声再生手段
１５出力音声バッファー検出手段
１６音声信号処理制御手段
２音声合成再生装置
２１音声合成手段
２２入力音声バッファー検出手段
２３音声合成制御手段 DESCRIPTION OF SYMBOLS 1 Audio | voice reproduction apparatus 11 Input audio | voice buffering means 12 Audio | voice signal processing means 13 Output audio | voice buffering means 14 Audio | voice reproduction | regeneration means 15 Output audio | voice buffer detection means 16 Audio | voice signal processing control means 2 Audio | voice synthesis | combination reproduction | regeneration apparatus 21 Audio | voice synthesis means 22 Input audio | voice buffer Detection means 23 Speech synthesis control means

Claims

An audio reproduction device that performs signal processing and reproduces audio,
Input voice buffering means for storing the input voice;
Audio signal processing means for signal-processing audio stored in the input audio buffering means;
Output audio buffering means for storing the audio signal processed by the audio signal processing means;
Audio reproduction means for reproducing audio stored in the output audio buffering means;
Output audio buffer detection means for detecting the amount of audio stored in the output audio buffering means;
It is determined whether or not the amount of sound detected by the output sound buffer detecting means is less than a predetermined threshold value, and if it is less than the threshold value, the sound stored in the input sound buffering means is Audio signal processing control means for outputting a control signal for signal processing to the audio signal processing means,
Signal processing is performed on a predetermined test sound by the sound signal processing means,
The output sound buffering means stores the test sound signal-processed by the sound signal processing means,
The audio reproduction means reproduces the test audio stored in the output audio buffering means,
In the case where the amount of test audio stored in the output audio buffering means is detected by the output audio buffer detection means,
The threshold used in the audio signal processing control means is a predetermined reliability ratio indicating the stability of the processing performance of the audio reproduction apparatus to the minimum value of the amount of test audio detected by the output audio buffer detection means. Multiplied by
The audio signal processing unit performs signal processing on audio stored in the input audio buffering unit when the control signal is input from the audio signal processing control unit.

An audio reproduction device that performs signal processing and reproduces audio,
Input voice buffering means for storing the input voice;
Audio signal processing means for signal-processing audio stored in the input audio buffering means;
Output audio buffering means for storing the audio signal processed by the audio signal processing means;
Audio reproduction means for reproducing audio stored in the output audio buffering means;
Output audio buffer detection means for detecting the amount of audio stored in the output audio buffering means;
It is determined whether or not the amount of sound detected by the output sound buffer detecting means is less than a predetermined threshold value, and if it is less than the threshold value, the sound stored in the input sound buffering means is Audio signal processing control means for outputting a control signal for signal processing to the audio signal processing means,
The voice signal processing means performs signal processing for converting a speech speed stored in the input voice buffering means when the control signal is input from the voice signal processing control means. Audio playback device.

2. The audio reproducing apparatus according to claim 1, wherein the audio signal processing means performs signal processing for converting a speech speed of the audio stored in the input audio buffering means.

A speech synthesis / playback device comprising the speech playback device according to any one of claims 1 to 3,
Speech synthesis means for synthesizing speech corresponding to the input sentence;
Input audio buffer detection means for detecting the amount of audio stored in the input audio buffering means;
Control whether to determine whether or not the amount of speech detected by the input speech buffer detection means is less than a predetermined threshold, and to synthesize speech corresponding to the input sentence if it is less than the threshold Speech synthesis control means for outputting a signal to the speech synthesis means;
With
The speech synthesizer synthesizes speech corresponding to the input sentence when the control signal is input from the speech synthesis controller.
The speech synthesis / playback apparatus, wherein the input speech buffering means stores the speech synthesized by the speech synthesis means.

An audio reproduction program for causing a computer to function as the audio reproduction apparatus according to any one of claims 1 to 3.

A speech synthesis / playback program for causing a computer to function as the speech synthesis / playback device according to claim 4.