JPH05307395A

JPH05307395A - Voice synthesizer

Info

Publication number: JPH05307395A
Application number: JP4135706A
Authority: JP
Inventors: Yoshiaki Oikawa; 芳明及川
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1992-04-30
Filing date: 1992-04-30
Publication date: 1993-11-19

Abstract

PURPOSE:To obtain synthesized voices which are easily recognized in a varying environmental noise condition. CONSTITUTION:The synthesizer consists of a voice synthesizer signal generating circuit 20, a speaker 6 which transforms synthesized voice signals into sound waves and outputs, a microphone 16 which takes in the synthesized voices as well as surrounding noises and transforms them and a surrounding sound extraction and analyzing circuit 21 extracts and analyzes the surrounding noise components of the voice signals taken in by the microphone 16. The parameters used to generate the synthesized voice signals are adjusted based on the result of the analysis.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、人工的に音声を作り出
す音声合成装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer that artificially produces speech.

【０００２】[0002]

【従来の技術】人工的に音声を作り出す近年の音声合成
方式としては、例えば、人が発声した音声波形をそのま
ま或いは波形符号化して蓄積しておき必要に応じて繋ぎ
合わせて出力する録音編集方式と、人が発声した音声波
形を分析してパラメータに変換された形で蓄積しておき
それを繋ぎ合わせて音声合成器を駆動して音声を作り出
すパラメータ編集方式と、文字列或いは音素記号列から
音声学的，言語学的規則に基づいて音声を作り出す規則
合成方式が存在する。2. Description of the Related Art As a recent voice synthesizing method for artificially producing a voice, for example, a recording / editing method in which a voice waveform uttered by a person is stored as it is or after waveform encoding is stored and connected as needed to be output. , A parameter editing method that analyzes the voice waveform uttered by a person, accumulates it in the form converted into parameters, and connects it to drive a voice synthesizer to produce a voice, and a character string or phoneme symbol string There is a rule synthesizing method that produces speech based on phonetic and linguistic rules.

【０００３】すなわち、上記録音編集方式による音声合
成は、予め人が発声した音声を、単語や文節等を単位に
とって蓄積（録音）しておき、必要に応じてそれらを読
み出して接続（編集）し、音声を合成する方式である。
一般に録音編集方式では、相異なる蓄積単位時間の接続
における音声の音響的特徴（スペクトル包絡，振幅，基
本周波数，発声速度等）の連続性の良否が文章音声の品
質を左右する。蓄積する単位を、文節，文というように
大きくとればとるほど、作り出される音声の品質（明瞭
度，自然性等）は良くなるが、合成できる語彙や文章の
種類が限られる。一方、蓄積単位を音節，単音というよ
うに小さくすると、合成できる語彙や文章の自由度は増
すが、音声の品質が著しく低下する。That is, in the voice synthesis by the above-mentioned recording / editing method, the voice uttered by a person is accumulated (recorded) in units of words, phrases, etc., and read out and connected (edited) as needed. , A method of synthesizing voice.
Generally, in the recording / editing method, the quality of the text voice depends on the continuity of the acoustic characteristics (spectrum envelope, amplitude, fundamental frequency, utterance speed, etc.) of the voice in the connection of different storage unit times. The larger the unit to be accumulated, such as bunsetsu and sentence, the better the quality of the produced speech (clarity, naturalness, etc.), but the vocabulary and types of sentences that can be synthesized are limited. On the other hand, if the storage unit is reduced to a syllable or a single sound, the degree of freedom of vocabulary and sentences that can be synthesized is increased, but the quality of speech is significantly reduced.

【０００４】また、上記パラメータ編集方式による音声
合成は、上記録音編集方式の場合と同様に単語，文節等
を単位とするが、予め人が発声した音声を音声生成モデ
ルに基づいて分析して、パラメータ時系列の形で蓄え、
必要に応じて接続したパラメータ時系列を用いて音声合
成器を駆動して音声を合成する方式である。音源及びス
ペクトル包絡パラメータの形で蓄えるので、波形を蓄え
る場合に比べて、合成音声の自然性は方式により若干低
下するが、大幅な情報圧縮が図れる。更に、パラメータ
を操作することにより、時間長の伸縮や、接続部のピッ
チやスペクトル変化の平滑化等を行うことができる。Further, the speech synthesis by the parameter editing method uses words, phrases, etc. as a unit as in the case of the recording and editing method, but the speech uttered by a person is analyzed in advance based on the speech generation model, Store in the form of parameter time series,
In this method, a voice synthesizer is driven by using a parameter time series connected as needed to synthesize voice. Since the sound source and the spectrum envelope parameter are stored, the naturalness of the synthesized speech is slightly lowered depending on the method as compared with the case where the waveform is stored, but a large amount of information compression can be achieved. Further, by manipulating the parameters, it is possible to expand / contract the time length, smooth the pitch of the connecting portion, or change the spectrum.

【０００５】さらに、上記規則合成方式による音声合成
は、蓄えておく単位として音節，音素，１ピッチ区間の
波形等のような基本的な小さな単位で特徴パラメータを
用い、そのかわりそれらを接続する規則や、ピッチ・振
幅等の韻律情報を制御する規則を精密に定めることによ
り、いかなる言葉でも、音素，音節記号或いは文字の系
列から合成できるようにしようとする方式である。この
とき音声が自然で聞き易いものであるためには、ピッチ
やストレスの変化や、スペクトルの時間的変化が滑らか
で、ポーズ等が自然でなければならない。したがって、
この方式の場合は、合成に用いる基本単位の品質と共
に、自然音声の音声学的ないし言語学的特徴に基づく、
音響パラメータの制御規則（制御情報と制御機構）が重
要な役割を果たす。Further, in the speech synthesis by the rule synthesis method, characteristic parameters are used in basic small units such as syllables, phonemes, and waveforms in one pitch interval as a unit to be stored, and instead, rules for connecting them are used. In addition, it is a method that allows any word to be synthesized from a sequence of phonemes, syllable symbols, or characters by precisely defining rules for controlling prosodic information such as pitch and amplitude. At this time, in order for the voice to be natural and easy to hear, changes in pitch and stress, temporal changes in the spectrum, and poses must be natural. Therefore,
In the case of this method, it is based on the phonetic or linguistic characteristics of natural speech together with the quality of the basic unit used for synthesis.
The control rules (control information and control mechanism) of acoustic parameters play an important role.

【０００６】ここで、図８に上述したように人工的に音
声を合成する従来の音声合成装置の一例を示す。この図
８において、音源発生部２では、音源パラメータ蓄積部
７から供給される音源パラメータに基づいて音源データ
が生成される。この音源データは、音声合成フィルタ３
に送られる。当該音声合成フィルタ３には、スペクトル
パラメータ蓄積部８からのスペクトルパラメータも供給
されるようになっており、したがって、当該音声合成フ
ィルタ３で上記スペクトルパラメータに基づいて上記音
源データをフィルタリングすることで合成音声信号の波
形データが得られるようになる。この合成音声信号の波
形データは、レベル調整器４に送られる。このレベル調
整器４には、使用者が設定する音量に相当する音量ボリ
ューム部１からのレベル調整データも供給され、したが
って、当該レベル調整器４で上記使用者が設定したレベ
ル調整データに基づいて上記合成音声信号のレベルが調
整される。このレベル調整器４でレベル調整がなされた
合成音声信号は、ディジタル／アナログ（Ｄ／Ａ）変換
器５でアナログの合成音声信号に変換され、更にスピー
カ６等の放音手段によって合成音声として出力される。FIG. 8 shows an example of a conventional speech synthesizer that artificially synthesizes speech as described above. In FIG. 8, the sound source generation unit 2 generates sound source data based on the sound source parameters supplied from the sound source parameter storage unit 7. This sound source data is used as the voice synthesis filter 3
Sent to. The speech synthesis filter 3 is also supplied with the spectrum parameter from the spectrum parameter storage unit 8. Therefore, the speech synthesis filter 3 synthesizes the sound source data by filtering the sound source data based on the spectrum parameter. The waveform data of the audio signal can be obtained. The waveform data of the synthesized voice signal is sent to the level adjuster 4. The level adjuster 4 is also supplied with level adjustment data from the volume control unit 1 corresponding to the volume set by the user. Therefore, based on the level adjustment data set by the user with the level adjuster 4, The level of the synthesized voice signal is adjusted. The synthesized voice signal whose level has been adjusted by the level adjuster 4 is converted into an analog synthesized voice signal by the digital / analog (D / A) converter 5, and further output as synthesized voice by the sound emitting means such as the speaker 6. To be done.

【０００７】[0007]

【発明が解決しようとする課題】ところで、上記従来の
音声合成装置は、使用環境として一定の音響環境下に置
かれる場合を前提としているため、通常、音声合成出力
レベルが一定の値となるように設定されている。By the way, since the above-mentioned conventional speech synthesizer is premised on the case where it is placed in a constant acoustic environment as a use environment, the speech synthesis output level is usually set to a constant value. Is set to.

【０００８】ところが、上述したような従来の音声合成
装置においては、周囲の音響環境が変化する場合には以
下に述べるような問題が生じてしまう。However, in the conventional speech synthesizer as described above, the following problems occur when the surrounding acoustic environment changes.

【０００９】すなわち、周囲の音響環境が変化する場合
としては、例えば移動型ロボットに或いは自動車の車内
等で音声合成装置を使用した場合が考えられる。このよ
うに周囲の音響環境が変化する場合には、合成された音
声が聞き取り難くなる。このため、当該従来の音声合成
装置を使用する使用者は、上記合成音声を聞き取り易く
するために、例えば、周囲の音響環境に合わせて音量を
調整したり、ヘッドホンを使用したり、若しくはもう一
度音声合成をやり直させたり（言い直させたり）等のこ
とを行わなければならなくなる。これは使いやすい装置
とは言えない。このように、従来の音声合成装置は、当
該音声合成装置が使用される雑音環境が変化するような
場合についての十分な対応が考慮されていないという問
題点がある。That is, as a case where the surrounding acoustic environment changes, for example, a case where a voice synthesizer is used in a mobile robot or in a car is considered. When the surrounding acoustic environment changes in this way, it becomes difficult to hear the synthesized voice. Therefore, in order to make it easier to hear the synthesized voice, the user who uses the conventional speech synthesizer adjusts the volume according to the surrounding acoustic environment, uses headphones, or once again makes a speech. You have to do things such as redoing (rewording) the composition. This is not an easy-to-use device. As described above, the conventional speech synthesizer has a problem in that sufficient consideration is not taken into consideration when the noise environment in which the speech synthesizer is used changes.

【００１０】そこで、本発明は、上述のような実情に鑑
みて提案されたものであり、周囲雑音環境が変化するよ
うな状況下であっても、聞き取り易い合成音声を得るこ
とのできる音声合成装置を提供することを目的とするも
のである。Therefore, the present invention has been proposed in view of the above-mentioned circumstances, and it is possible to obtain a synthesized speech that is easy to hear even in a situation where the ambient noise environment changes. The purpose is to provide a device.

【００１１】[0011]

【課題を解決するための手段】本発明の音声合成装置
は、上述の目的を達成するために提案されたものであ
り、所定のパラメータを用いて合成音声信号を生成する
合成音声信号生成手段と、上記合成音声信号生成手段か
らの合成音声信号を音波に変換して合成音声を出力する
放音手段と、上記放音手段から出力された合成音声と共
に周囲雑音を取り込んで音声信号に変換する集音手段
と、上記集音手段で取り込んだ音声信号の周囲雑音成分
を抽出してこの周囲雑音成分を分析する分析手段とを有
し、上記分析手段での分析結果（周囲雑音成分の分析結
果）に基づいて上記合成音声信号生成手段で合成音声信
号を生成する際の上記パラメータを調整するようにした
ものである。The speech synthesizer of the present invention is proposed to achieve the above-mentioned object, and includes a synthesized speech signal generation means for generating a synthesized speech signal using a predetermined parameter. A sound emitting means for converting the synthesized speech signal from the synthesized speech signal generating means into a sound wave and outputting the synthesized speech; and a collection means for capturing ambient noise together with the synthesized speech output from the sound emitting means and converting it into a speech signal. It has a sounding means and an analyzing means for extracting the ambient noise component of the voice signal taken in by the sound collecting means and analyzing the ambient noise component, and the analysis result by the analyzing means (the analysis result of the ambient noise component). On the basis of the above, the above-mentioned parameter is adjusted when the synthesized voice signal is generated by the synthesized voice signal generating means.

【００１２】すなわち、本発明の音声合成装置は、例え
ば周囲雑音環境が変化するような環境で使用される音声
合成装置であって、周囲雑音の状態が変化した場合に
も、周囲の音をマイクロホン等の集音手段を用いて装置
に取り込んで周囲雑音を分析し、この分析結果に基づい
て音声を合成する際のパラメータを調整することによ
り、聞き取り易い音声合成出力を得るようにしたもので
ある。That is, the speech synthesizer of the present invention is a speech synthesizer used in an environment in which the ambient noise environment changes, for example. It is designed to obtain a voice synthesis output that is easy to hear by taking in the device by using sound collecting means such as etc., analyzing the ambient noise, and adjusting the parameters when synthesizing the voice based on the analysis result. ..

【００１３】また、本発明の音声合成装置は、音源パラ
メータとスペクトルパラメータを用いて合成音声信号を
生成すると共にレベルを調整して出力する合成音声信号
生成手段と、上記合成音声信号生成手段からの合成音声
信号を音波に変換して合成音声を出力する放音手段と、
上記放音手段から出力された合成音声と共に周囲雑音を
取り込んで音声信号に変換する集音手段と、上記集音手
段で取り込んだ音声信号の周囲雑音成分を抽出してこの
周囲雑音成分を高速フーリエ変換（ＦＦＴ）処理により
周波数分析すると共に当該周波数分析結果（周囲雑音成
分の周波数分析結果）に基づいてある音により他の音が
マスクされる際のいわゆるマスキングカーブを算出する
分析手段とを有し、上記分析手段で算出したマスキング
カーブを用いて、上記放音手段から出力される合成音声
が上記周囲雑音にマスクされないように上記合成音声信
号生成手段での上記音源パラメータとスペクトルパラメ
ータとレベル調整値をコントロールするようにしたもの
である。Further, the speech synthesizing apparatus of the present invention generates a synthesized speech signal using the sound source parameter and the spectrum parameter, adjusts the level and outputs the synthesized speech signal, and the synthesized speech signal generating means. Sound emitting means for converting a synthetic voice signal into a sound wave and outputting a synthetic voice,
A sound collecting unit that takes in ambient noise together with the synthesized voice output from the sound emitting unit and converts the ambient noise into a voice signal, and extracts a ambient noise component of the voice signal captured by the sound collecting unit to extract the ambient noise component by a fast Fourier transform. And an analysis unit for performing frequency analysis by transform (FFT) processing and calculating a so-called masking curve when another sound is masked by another sound based on the frequency analysis result (frequency analysis result of ambient noise component). Using the masking curve calculated by the analyzing means, the sound source parameter, the spectrum parameter, and the level adjustment value in the synthetic voice signal generating means are controlled so that the synthetic voice output from the sound emitting means is not masked by the ambient noise. It is designed to control.

【００１４】すなわち、この音声合成装置は、周囲雑音
信号に高速フーリエ変換（ＦＦＴ）をかけて周波数分析
を行い、この周波数分析の結果のスペクトルにより、マ
スキングカーブを求め、このマスキングカーブを用いて
出力合成音が周囲音にマスクされないように、音源パラ
メータ，スペクトルパラメータ及びレベル調整値をコン
トロールするようにしたものである。That is, this speech synthesizer performs a frequency analysis by applying a fast Fourier transform (FFT) to the ambient noise signal, obtains a masking curve from the spectrum of the result of this frequency analysis, and outputs using this masking curve. The sound source parameter, the spectrum parameter, and the level adjustment value are controlled so that the synthesized sound is not masked by the ambient sound.

【００１５】[0015]

【作用】本発明の音声合成装置によれば、一旦合成して
放音手段から出力した合成音声を集音手段で周囲雑音と
共に取り込み、この集音手段で取り込んだ信号から周囲
雑音の信号を抽出して分析し、この周囲雑音信号の抽出
分析結果に基づいてパラメータを調整して再び音声を合
成することで、周囲雑音に影響され難い合成音声を得る
ようにしている。According to the voice synthesizing apparatus of the present invention, the synthesized voice once synthesized and output from the sound emitting means is taken in together with the ambient noise by the sound collecting means, and the ambient noise signal is extracted from the signal taken in by the sound collecting means. Then, the parameters are adjusted based on the extraction and analysis result of the ambient noise signal, and the voice is synthesized again to obtain a synthesized voice that is not easily influenced by the ambient noise.

【００１６】[0016]

【実施例】以下、本発明の実施例を図面を参照しながら
説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１７】本発明実施例の音声合成装置の基本構成
は、図１に示すように、所定のパラメータで合成音声信
号を生成する合成音声信号生成回路２０と、上記合成音
声信号生成回路２０からの合成音声信号を音波に変換し
て合成音声を出力する放音手段（スピーカ６）と、上記
スピーカ６から出力された合成音声と共に周囲雑音を取
り込んで音声信号に変換する集音手段（マイクロホン１
６）と、上記マイクロホン１６で取り込んだ音声信号の
周囲雑音成分を抽出しこの抽出した周囲雑音成分を分析
する分析手段（周囲音抽出分析回路２１）とを有し、上
記周囲音抽出分析回路２１での分析結果に基づいて上記
合成音声信号生成回路２０で合成音声信号を生成する際
の上記パラメータを調整するようにしたものである。As shown in FIG. 1, the basic configuration of the speech synthesizing apparatus according to the present invention comprises a synthetic speech signal generating circuit 20 for generating a synthetic speech signal with a predetermined parameter, and the synthetic speech signal generating circuit 20. A sound emitting means (speaker 6) for converting a synthetic voice signal into a sound wave and outputting a synthetic voice, and a sound collecting means for capturing ambient noise together with the synthetic voice output from the speaker 6 and converting it into a voice signal (microphone 1
6) and an analyzing means (ambient sound extraction / analysis circuit 21) for extracting an ambient noise component of the voice signal captured by the microphone 16 and analyzing the extracted ambient noise component. The above parameters are adjusted when the synthesized voice signal is generated by the synthesized voice signal generation circuit 20 based on the analysis result in.

【００１８】すなわち、本実施例の音声合成装置は、前
述したような周囲音響環境が変化するような音響環境下
においても、聞き取り易い合成音声出力を得るために構
築されているものであって、上記合成音声信号生成回路
２０で生成し上記スピーカ６から出力した合成音声と共
に現在の周囲の音響環境を上記マイクロホン１６により
装置内に取り込み、上記周囲音抽出分析回路２１により
当該マイクロホン１６で取り込んだ音からノイズ（周囲
雑音）の成分を抽出すると共にこのノイズ成分を解析し
て、この解析結果に基づいて上記合成音声信号生成回路
２０で音声合成パラメータを調整することで合成音声が
聞き取り易くなるようにコントロールしている。なお、
本実施例では、ディジタル的に信号を扱うようにしてい
るため、ディジタルデータである上記合成音声信号はデ
ィジタル／アナログ（Ｄ／Ａ）変換器５を介してアナロ
グの信号に変換された後に上記スピーカ６に送られると
共に、上記マイクロホン１６からのアナログ音声信号
は、アナログ／ディジタル（Ａ／Ｄ）変換器１５を介し
てディジタル信号に変換されて上記周囲音抽出分析回路
２１に送られるようになっている。That is, the voice synthesizing apparatus of this embodiment is constructed so as to obtain a synthesized voice output that is easy to hear even in an acoustic environment in which the surrounding acoustic environment changes as described above. The sound produced by the synthesized voice signal generation circuit 20 and output from the speaker 6 together with the current surrounding acoustic environment is taken into the device by the microphone 16, and the sound taken by the microphone 16 by the ambient sound extraction / analysis circuit 21. A noise (ambient noise) component is extracted from the noise component, the noise component is analyzed, and the synthesized speech signal generation circuit 20 adjusts the speech synthesis parameter based on the analysis result so that the synthesized speech can be easily heard. Have control. In addition,
In the present embodiment, since the signals are handled digitally, the above-mentioned synthesized voice signal which is digital data is converted into an analog signal through the digital / analog (D / A) converter 5 and then the above-mentioned speaker. 6, the analog voice signal from the microphone 16 is converted into a digital signal via the analog / digital (A / D) converter 15 and is sent to the ambient sound extraction / analysis circuit 21. There is.

【００１９】図２には、上記基本構成のより詳細な具体
例（第１の具体例）の構成を示す。なお、この図２にお
いて、図１及び前述した図８と同様の構成要素には同一
の指示符号を付している。FIG. 2 shows the structure of a more detailed specific example (first specific example) of the above basic structure. In FIG. 2, the same components as those in FIG. 1 and FIG. 8 described above are designated by the same reference numerals.

【００２０】ここで、この第１の具体例の上記合成音声
信号生成回路２０は、音量ボリューム部１と音源発生部
２と音声合成フィルタ３と音源パラメータ蓄積部７とス
ペクトルパラメータ蓄積部８とレベル調整器１０で構成
され、また、上記周囲音抽出分析回路（分析手段）２１
は、バッファ１１とＲＭＳ（root mean square) 値算出
部１２と加算器１３とバッファ１４とで構成されてい
る。Here, the synthesized voice signal generation circuit 20 of the first concrete example includes a volume control unit 1, a sound source generation unit 2, a voice synthesis filter 3, a sound source parameter storage unit 7, a spectrum parameter storage unit 8 and a level. The ambient sound extraction / analysis circuit (analyzing means) 21 includes the adjuster 10.
Is composed of a buffer 11, an RMS (root mean square) value calculation unit 12, an adder 13, and a buffer 14.

【００２１】すなわち、この図２において、上記音源発
生部２では、音源パラメータ蓄積部７から供給される音
源パラメータに基づいて音源データが生成される。この
音源データは、上記音声合成フィルタ３に送られる。当
該音声合成フィルタ３には、スペクトルパラメータ蓄積
部８からのスペクトルパラメータも供給されるようにな
っており、したがって、当該音声合成フィルタ３で上記
スペクトルパラメータに基づいて上記音源データをフィ
ルタリングすることで合成音声信号の波形データが得ら
れるようになる。この合成音声信号の波形データは、上
記レベル調整器１０に送られる。このレベル調整器１０
には、使用者が設定する音量に相当する音量ボリューム
部１からのレベル調整データも供給され、したがって、
当該レベル調整器１０で上記使用者が設定したレベル調
整データに基づいて上記合成音声信号のレベルが調整さ
れる。このレベル調整器１０でレベル調整がなされた合
成音声信号は、上記Ｄ／Ａ変換器５でアナログの合成音
声信号に変換され、更に上記スピーカ６等の放音手段に
よって合成音声として出力される。That is, in FIG. 2, the sound source generation section 2 generates sound source data based on the sound source parameters supplied from the sound source parameter storage section 7. This sound source data is sent to the voice synthesis filter 3. The speech synthesis filter 3 is also supplied with the spectrum parameter from the spectrum parameter storage unit 8. Therefore, the speech synthesis filter 3 synthesizes the sound source data by filtering the sound source data based on the spectrum parameter. The waveform data of the audio signal can be obtained. The waveform data of the synthesized voice signal is sent to the level adjuster 10. This level adjuster 10
Is also supplied with level adjustment data from the volume control unit 1 corresponding to the volume set by the user, and therefore,
The level adjuster 10 adjusts the level of the synthesized voice signal based on the level adjustment data set by the user. The synthesized voice signal whose level is adjusted by the level adjuster 10 is converted into an analog synthesized voice signal by the D / A converter 5 and is further output as a synthesized voice by the sound emitting means such as the speaker 6.

【００２２】一方、上記マイクロホン１６では、上記ス
ピーカ６から出力された合成音声と共に、周囲雑音も取
り込まれる。当該マイクロホン１６で取り込まれた音声
信号は、Ａ／Ｄ変換器１５を介してディジタル音声信号
に変換された後、バッファ１４に一旦蓄えられた後、上
記加算器１３に加算信号として送られる。また、この加
算器１３には、上記レベル調整器１０から出力されバッ
ファ１１を介した合成音声信号が減算信号として供給さ
れるようになっている。なお、上記バッファ１１，１４
は、上記加算器１３に供給される上記レベル調整器１０
から出力された合成音声信号と、上記マイクロホン１６
で取り込みＡ／Ｄ変換器１５を介した音声信号中に含ま
れる合成音声信号とが同一のものとなるように、タイミ
ングを揃えるために設けられている。これにより、上記
加算器１５の出力は、上記マイクロホン１６で周囲雑音
と共に取り込んだ合成音声信号成分が取り除かれるたも
の、すなわち、周囲雑音成分のみの信号となる。当該加
算器１５からの上記周囲雑音成分の信号は、上記ＲＭＳ
値算出部１２に送られ、当該算出部１２でＲＭＳ値が求
められる。このＲＭＳ値は上記レベル調整器１０に送ら
れ、当該レベル調整器１０ではこのＲＭＳ値に基づいて
レベル調整がなされる。On the other hand, in the microphone 16, ambient noise is also captured together with the synthesized voice output from the speaker 6. The audio signal taken in by the microphone 16 is converted into a digital audio signal via the A / D converter 15, and then temporarily stored in the buffer 14 and then sent to the adder 13 as an addition signal. The adder 13 is supplied with the synthesized voice signal output from the level adjuster 10 via the buffer 11 as a subtraction signal. The buffers 11 and 14 are
Is the level adjuster 10 supplied to the adder 13.
The synthesized voice signal output from the microphone 16
It is provided in order to align the timing so that the synthesized voice signal included in the voice signal captured by the input A / D converter 15 is the same. As a result, the output of the adder 15 becomes a signal obtained by removing the synthesized voice signal component taken in together with the ambient noise by the microphone 16, that is, a signal of only the ambient noise component. The signal of the ambient noise component from the adder 15 is the RMS
The value is sent to the value calculator 12, and the calculator 12 calculates the RMS value. The RMS value is sent to the level adjuster 10, and the level adjuster 10 adjusts the level based on the RMS value.

【００２３】言い換えれば、この第１の具体例の構成に
おいては、音声合成出力環境下における音（合成音声と
周囲雑音を含む音）を上記マイクロホン１６により取り
込んでＡ／Ｄ変換してバッファ１４に蓄え、その際の出
力した合成音声信号の波形データもバッファ１１に蓄
え、これらバッファ１１，１４の双方に蓄えられた波形
データから上記加算器１３によりノイズ（周囲雑音）成
分を抽出するようにしている。このノイズ成分を用いて
ＲＭＳ値算出部１２で周囲雑音のＲＭＳ値を求め、この
ＲＭＳ値を音声合成出力に対してレベル調整を施すレベ
ル調整器１０に入力し、当該音声合成出力レベルを周囲
音より大きい音になるようにコントロールして出力す
る。In other words, in the configuration of the first specific example, the sound (sound including synthetic speech and ambient noise) under the speech synthesis output environment is captured by the microphone 16 and A / D converted to the buffer 14. Stored, the waveform data of the synthesized voice signal output at that time is also stored in the buffer 11, and the noise (ambient noise) component is extracted by the adder 13 from the waveform data stored in both the buffers 11 and 14. There is. Using this noise component, the RMS value calculation unit 12 obtains the RMS value of the ambient noise, inputs this RMS value to the level adjuster 10 that performs level adjustment on the speech synthesis output, and sets the speech synthesis output level to the ambient sound. Control and output so that the sound becomes louder.

【００２４】これにより、この第１の具体例の構成によ
れば、図３に示すように、スピーカ６からの合成音声の
出力音量（レベル）Ａ_OUTの予め設定されている音量
（設定音量）よりも、周囲雑音のレベル（周囲音ＲＭＳ
値に応じたレベル）ａ_rmsが高い時には、上記出力音量
Ａ_OUTを周囲音よりも大きい音にコントロールして出力
させることができるようになる。なお、当該出力音量Ａ
_OUTを周囲音よりも大きい音にコントロールする際のレ
ベルは、使用者が予め設定しておくことができる。Thus, according to the configuration of the first specific example, as shown in FIG. 3, the preset volume (set volume) of the output volume (level) A _OUT of the synthesized voice from the speaker 6 is set. Ambient noise level (ambient sound RMS
When the value (level corresponding to the value) a _rms is high, the output volume A _OUT can be controlled to be louder than the ambient sound and output. The output volume A
_{The level at which OUT} is controlled to be louder than ambient sound can be preset by the user.

【００２５】次に、前述の図１に示した本発明の音声合
成装置は、図４に示す第２の具体例のような構成とする
ことも可能である。なお、この図４において、前述の図
１及び図２と対応する構成要素には同一の指示符号を付
している。Next, the speech synthesizer of the present invention shown in FIG. 1 can be configured as in the second concrete example shown in FIG. Note that, in FIG. 4, the same reference numerals are given to constituent elements corresponding to those in FIGS. 1 and 2 described above.

【００２６】この図４に示す第２の具体例装置の合成音
声信号生成回路２０は、上記音量ボリューム部１と音源
発生部２と音声合成フィルタ３と音源パラメータ蓄積部
３０とスペクトルパラメータ蓄積部３１とレベル調整器
３２とで構成され、上記音源パラメータとスペクトルパ
ラメータを用いて合成音声信号を生成すると共にレベル
を調整して出力するものとする。また、上記周囲音抽出
分析回路（分析手段）２１は、上記バッファ１１，１４
と加算器１３と高速フーリエ変換（ＦＦＴ）回路３４と
マスキングカーブ算出部３３とで構成され、上記マイク
ロホン１６で取り込んだ音声信号の周囲雑音成分を抽出
しこの抽出した周囲雑音成分を上記高速フーリエ変換部
３４での高速フーリエ変換処理により周波数分析すると
共に、当該周波数分析結果（周囲雑音成分の周波数分析
結果）に基づいて上記マスキングカーブ算出部３３でマ
スキングカーブを算出するようにしている。The synthesized voice signal generation circuit 20 of the second specific example device shown in FIG. 4 includes the volume control unit 1, the sound source generation unit 2, the voice synthesis filter 3, the sound source parameter storage unit 30, and the spectrum parameter storage unit 31. And a level adjuster 32 to generate a synthesized voice signal using the sound source parameter and the spectrum parameter and adjust and output the level. The ambient sound extraction / analysis circuit (analyzing means) 21 includes the buffers 11 and 14
And an adder 13, a fast Fourier transform (FFT) circuit 34, and a masking curve calculator 33. The ambient noise component of the voice signal captured by the microphone 16 is extracted, and the extracted ambient noise component is subjected to the fast Fourier transform. Frequency analysis is performed by the fast Fourier transform processing in the unit 34, and the masking curve calculation unit 33 calculates the masking curve based on the frequency analysis result (frequency analysis result of the ambient noise component).

【００２７】この第２の具体例構成は、上記マスキング
カーブ算出部３３で算出したマスキングカーブのデータ
を用いて上記スピーカ６から出力される合成音声が上記
周囲雑音にマスクされないように、上記合成音声信号生
成回路２０での上記音源パラメータとスペクトルパラメ
ータとレベル調整値をコントロールするようにしてい
る。In the second specific configuration, the synthetic voice output from the speaker 6 using the masking curve data calculated by the masking curve calculating unit 33 is prevented from being masked by the ambient noise. The sound source parameter, the spectrum parameter, and the level adjustment value in the signal generation circuit 20 are controlled.

【００２８】すなわち、この図４において、上記音源発
生部２では、音源パラメータ蓄積部３０から供給される
音源パラメータに基づいて音源データが生成され、この
音源データは上記音声合成フィルタ３に送られる。当該
音声合成フィルタ３には、スペクトルパラメータ蓄積部
３１からのスペクトルパラメータも供給されるようにな
っており、当該音声合成フィルタ３で上記スペクトルパ
ラメータに基づいて上記音源データをフィルタリングす
ることで合成音声信号の波形データが得られる。この合
成音声信号の波形データは、上記レベル調整器３２に送
られ、このレベル調整器１０で上記音量ボリューム部１
からのレベル調整データに基づいて上記合成音声信号の
レベルが調整される。このレベル調整器３２でレベル調
整がなされた合成音声信号は、上記Ｄ／Ａ変換器５を介
してスピーカ６によって合成音声として出力される。That is, in FIG. 4, the sound source generation section 2 generates sound source data based on the sound source parameters supplied from the sound source parameter storage section 30, and the sound source data is sent to the voice synthesis filter 3. A spectrum parameter from the spectrum parameter accumulating unit 31 is also supplied to the speech synthesis filter 3, and the speech synthesis filter 3 filters the sound source data based on the spectrum parameter to synthesize a speech signal. Waveform data is obtained. The waveform data of the synthesized voice signal is sent to the level adjuster 32, and the level adjuster 10 causes the volume controller 1 to output the volume data.
The level of the synthesized voice signal is adjusted based on the level adjustment data from. The synthesized speech signal whose level is adjusted by the level adjuster 32 is output as synthesized speech by the speaker 6 via the D / A converter 5.

【００２９】一方、マイクロホン１６では上記スピーカ
６から出力された合成音声と共に周囲雑音も取り込み、
このマイクロホン１６で取り込まれた音声信号はＡ／Ｄ
変換器１５を介して上記バッファ１４に一旦蓄えられた
後、上記加算器１３に加算信号として送られる。また、
この加算器１３には、上記レベル調整器３２から出力さ
れバッファ１１を介した合成音声信号が減算信号として
供給されている。当該加算器１５の出力は、前述同様に
上記マイクロホン１６で周囲雑音と共に取り込んだ合成
音声信号成分が取り除かれるたもの、すなわち、周囲雑
音成分のみの信号となる。当該加算器１５からの上記周
囲雑音成分の信号は、上記高速フーリエ変換部３４に送
られ、当該高速フーリエ変換部３４で高速フーリエ変換
処理がなされる。この高速フーリエ変換処理結果（スペ
クトルデータ）が上記マスキングカーブ算出部３３に送
られ、当該マスキングカーブ算出部３３ではある音（周
囲雑音）により他の音（合成音声）がマスクされる際の
いわゆるマスキングカーブが求められる。当該マスキン
グカーブのデータは、上記音源パラメータ蓄積部３０と
スペクトルパラメータ蓄積部３１とレベル調整器３２と
に送られ、これら蓄積部３０，３１とレベル調整器３２
では、上記マスキングカーブのデータを用いて、出力合
成音が周囲雑音にマスクされないように音源パラメー
タ，スペクトルパラメータ及びレベル調整値がコントロ
ールされる。On the other hand, the microphone 16 also takes in ambient noise together with the synthesized voice output from the speaker 6,
The audio signal captured by this microphone 16 is A / D
After being temporarily stored in the buffer 14 via the converter 15, it is sent to the adder 13 as an addition signal. Also,
The adder 13 is supplied with the synthetic speech signal output from the level adjuster 32 and passed through the buffer 11 as a subtraction signal. The output of the adder 15 is a signal obtained by removing the synthesized voice signal component taken in together with the ambient noise by the microphone 16 as described above, that is, a signal of only the ambient noise component. The signal of the ambient noise component from the adder 15 is sent to the fast Fourier transform unit 34, and the fast Fourier transform unit 34 performs fast Fourier transform processing on the signal. This fast Fourier transform processing result (spectral data) is sent to the masking curve calculation unit 33, and in the masking curve calculation unit 33, a certain sound (ambient noise) masks another sound (synthesized voice), so-called masking. A curve is required. The masking curve data is sent to the sound source parameter accumulating unit 30, the spectrum parameter accumulating unit 31, and the level adjuster 32, and these accumulating units 30, 31 and the level adjuster 32.
Then, the masking curve data is used to control the sound source parameter, the spectrum parameter, and the level adjustment value so that the output synthesized sound is not masked by ambient noise.

【００３０】ここで、当該第２の具体例における上記パ
ラメータ調整の例を以下に示す。例えば、上記周囲雑音
が正弦波のような波形の場合には、図５に示すように、
その正弦波の周波数（周囲雑音の周波数）を避けるよう
に上記スペクトルパラメータを変更して声道の共振周波
数に対応するいわゆるホルマント（formant)周波数を変
化させるようにする。なお、この図５において、図中曲
線ＥＮ_Aは合成音声の包絡を示し、図中曲線ＭＣ_aは周
囲雑音のマスキングカーブを、図中曲線ｅｎ_Aは本具体
例により補正された合成音声の包絡を示す。同じく上記
周囲雑音が正弦波のような波形の場合には、例えば図６
に示すように、上記音源パラメータを変更してピッチ周
波数を変化させるようにする。なおこの図６において、
図中曲線ＰＴ_Aは合成音声のピッチを示し、曲線ＭＣ_a
は周囲雑音のマスキングカーブを、図中曲線ｐｔ_Aは本
具体例により補正された合成音声のピッチを示してい
る。Here, an example of the parameter adjustment in the second specific example will be shown below. For example, when the ambient noise has a waveform like a sine wave, as shown in FIG.
The spectral parameters are changed so as to avoid the frequency of the sine wave (the frequency of ambient noise) so that the so-called formant frequency corresponding to the resonance frequency of the vocal tract is changed. In FIG. 5, the curve EN _A in the figure shows the envelope of the synthetic speech, the curve MC _{a in} the figure is the masking curve of the ambient noise, and the curve en _{A in} the figure is the envelope of the synthetic speech corrected by this example. Indicates. Similarly, when the ambient noise has a waveform like a sine wave, for example, as shown in FIG.
As shown in, the sound source parameter is changed to change the pitch frequency. In addition, in FIG.
In the figure, the curve PT _A indicates the pitch of the synthesized voice, and the curve MC _a
Indicates the masking curve of ambient noise, and the curve pt _{A in} the figure indicates the pitch of the synthesized speech corrected by this example.

【００３１】また、上記周囲雑音が帯域雑音のような場
合には、図７に示すように、上記マスキングカーブ以下
にならないように上記スペクトルパラメータを変更して
周波数包絡のレベルを調整するというようなことを行
う。なお、この図７において、図中曲線ＥＮ_Aは合成音
声の包絡を示し、図中曲線ＭＣ_aは周囲雑音のマスキン
グカーブを、図中曲線ｅｎ_Aは本具体例により補正され
た合成音声の包絡を示す。Further, when the ambient noise is band noise, as shown in FIG. 7, the spectrum parameter is changed to adjust the frequency envelope level so as not to fall below the masking curve. Do things. In FIG. 7, the curve EN _A in the figure shows the envelope of the synthetic speech, the curve MC _{a in} the figure is the masking curve of the ambient noise, and the curve en _{A in} the figure is the envelope of the synthetic speech corrected by this specific example. Indicates.

【００３２】上述したようなことから、本実施例の音声
合成装置によれば、周囲雑音環境が変化するような状況
下（移動ロボットや自動車内等）での従来の音声合成装
置の聞き取りにくい合成音声出力に比べて、より聞き取
り易い合成音声出力を得ることが可能となっている。From the above, according to the speech synthesizer of the present embodiment, it is difficult for the conventional speech synthesizer to comprehend in a situation where the ambient noise environment changes (in a mobile robot, an automobile, etc.). It is possible to obtain a synthetic voice output that is easier to hear than voice output.

【００３３】[0033]

【発明の効果】上述のように、本発明の音声合成装置に
おいては、所定のパラメータを用いて合成音声信号を生
成し、この合成音声信号を音波に変換して出力した合成
音声と共に周囲雑音を取り込んで音声信号に変換し、こ
の音声信号の周囲雑音成分を分析した分析結果に基づい
て合成音声信号を生成する際のパラメータを調整するよ
うにしたことにより、周囲雑音環境が変化するような状
況下であっても、聞き取り易い合成音声を得ることがで
きるようになる。As described above, in the voice synthesizing apparatus of the present invention, a synthesized voice signal is generated using a predetermined parameter, and the synthesized voice signal is converted into a sound wave and output, and ambient noise is generated. A situation in which the ambient noise environment changes by adjusting the parameters when capturing and converting to a voice signal and analyzing the ambient noise component of this voice signal based on the analysis results. Even below, it is possible to obtain a synthesized voice that is easy to hear.

[Brief description of drawings]

【図１】本発明実施例の音声合成装置の基本構成を示す
ブロック回路図である。FIG. 1 is a block circuit diagram showing a basic configuration of a speech synthesizer according to an embodiment of the present invention.

【図２】本実施例の第１の具体的構成を示すブロック回
路図である。FIG. 2 is a block circuit diagram showing a first specific configuration of the present embodiment.

【図３】第１の具体例におけるレベル調整の様子を説明
するための図である。FIG. 3 is a diagram for explaining a state of level adjustment in the first specific example.

【図４】本実施例の第２の具体的構成を示すブロック回
路図である。FIG. 4 is a block circuit diagram showing a second specific configuration of the present embodiment.

【図５】第２の具体例における周囲雑音が正弦波の場合
のスペクトルパラメータの調整の様子を説明するための
図である。FIG. 5 is a diagram for explaining how the spectrum parameters are adjusted when the ambient noise is a sine wave in the second specific example.

【図６】第２の具体例における周囲雑音が正弦波の場合
の音源パラメータの調整の様子を説明するための図であ
る。FIG. 6 is a diagram for explaining how the sound source parameters are adjusted when the ambient noise is a sine wave in the second specific example.

【図７】第２の具体例における周囲雑音が帯域雑音の場
合のスペクトルパラメータの調整の様子を説明するため
の図である。FIG. 7 is a diagram for explaining how the spectrum parameter is adjusted when the ambient noise is band noise in the second specific example.

【図８】従来の音声合成装置の概略構成を示すブロック
回路図である。FIG. 8 is a block circuit diagram showing a schematic configuration of a conventional speech synthesizer.

[Explanation of symbols]

１・・・・・音量ボリューム部２・・・・・音源発生部３・・・・・音声合成フィルタ５・・・・・Ｄ／Ａ変換器６・・・・・スピーカ７，３０・・・・音源パラメータ蓄積部８，３１・・・・スペクトルパラメータ蓄積部１０，３２・・・レベル調整器１１，１４・・・バッファ１２・・・・ＲＭＳ値算出部１３・・・・加算器１５・・・・Ａ／Ｄ変換器２０・・・・合成音声信号生成回路２１・・・・周囲音抽出分析回路３３・・・・マスキングカーブ算出部３４・・・・高速フーリエ変換部 1 ... Volume control unit 2 ... Sound source generation unit 3 ... Voice synthesis filter 5 ... D / A converter 6 ... Speaker 7, 30 ... ..Sound source parameter storage unit 8,31 ... Spectral parameter storage unit 10,32 ... Level adjuster 11,14 ... Buffer 12 ... RMS value calculation unit 13 ... Adder 15・・・ A / D converter 20 ・・・ Synthetic voice signal generation circuit 21 ・・・・ Ambient sound extraction analysis circuit 33 ・・・・ Masking curve calculation unit 34 ・・・・・・ Fast Fourier transform unit

Claims

[Claims]

1. Synthetic voice signal generating means for generating a synthetic voice signal using a predetermined parameter, and sound emitting means for converting the synthetic voice signal from the synthetic voice signal generating means into sound waves to output a synthetic voice. A sound collecting unit that captures an ambient sound together with the synthesized voice output from the sound emitting unit and converts the ambient sound into an audio signal; and an analyzing unit that analyzes an ambient sound component of the audio signal captured by the sound collecting unit, A voice synthesizing apparatus, characterized in that the above-mentioned parameter is adjusted when the synthetic voice signal is generated by the synthetic voice signal generating means based on the analysis result of the analyzing means.

2. A synthetic speech signal generating means for generating a synthetic speech signal using a sound source parameter and a spectrum parameter and adjusting and outputting a level, and converting the synthetic speech signal from the synthetic speech signal generating means into a sound wave. And a sound collecting means for outputting a synthesized voice, a sound collecting means for collecting a surrounding sound together with the synthesized sound outputted from the sound emitting means and converting the sound into a sound signal, and an ambient sound component of the sound signal fetched by the sound collecting means. And a analyzing means for calculating a masking curve on the basis of the frequency analysis result by the fast Fourier transform processing, and using the masking curve calculated by the analyzing means, the synthesis output from the sound emitting means. In order to prevent the voice from being masked by the ambient sound, the sound source parameter, the spectrum parameter and the level in the synthesized voice signal generating means are set. Speech synthesis apparatus characterized by controlling the adjustment value.