JP2004004952A

JP2004004952A - Voice synthesizer and voice synthetic method

Info

Publication number: JP2004004952A
Application number: JP2003282641A
Authority: JP
Inventors: Yumiko Kato; 加藤　弓子; Takahiro Kamai; 釜井　孝浩; Katsuyoshi Yamagami; 山上　勝義; Kenji Matsui; 松井　謙二
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-07-30
Filing date: 2003-07-30
Publication date: 2004-01-08

Abstract

<P>PROBLEM TO BE SOLVED: To surely convey information by voice even under use environment and the user having auditory difficulties that voice synthesis is regarded to be inapplicable under noise in the conventional manner. <P>SOLUTION: The voice synthesizer is provided with a voice synthesizing section which synthesizes the voice according to a text and an emphasis processing section that performs a single or a plurality of phoneme emphasis processing to the voice synthesized by the voice synthetic section. By this invention, the information is surely conveyed even by the user having auditory difficulties and even in use under the noise and its practical effect is great. <P>COPYRIGHT: (C)2004,JPO

Description

　本発明はテキストを音声に変換する音声規則合成システムにおいて、特に聴覚障害者に対する、あるいは騒音下で使用する場合に音声伝達を行う技術に関するものである。 {Circle over (1)} The present invention relates to a speech rule synthesis system for converting text to speech, and more particularly to a technique for transmitting speech to a hearing-impaired person or when used under noise.

　テキストを音声に変換する音声規則合成技術は、文字で伝送されてきた情報を人間にとってわかりやすい形式で伝達する一つの手段として重要である。例えば情報ネットワークを通じて送られる情報の大半はテキストであり、大量のテキスト情報をそのまま人間に伝えるためには表示能力の大きいディスプレイを用いるか、紙に印字する必要がある。音声 Speech rule synthesis technology that converts text to speech is important as one means of transmitting information transmitted in characters in a format that is easy for humans to understand. For example, most of the information sent through an information network is text, and in order to convey a large amount of text information as it is, it is necessary to use a display having a large display capability or print it on paper.

　しかし、情報端末が小型化し、携帯に用いられるようになると、大型のディスプレイやプリンタを用いることができないため、音声に変換することがもっとも効果的である。図５５は従来の音声合成装置の代表的な装置の構成ブロック図である。図５５の１０は目的とするテキストを入力するテキスト入力手段、２０はテキストの構文解析を行う言語処理手段、３０ｍは音声を合成する音声合成部、４０ｍは合成音声の声質を操作する操作手段、５０ｍは操作手段の入力に従って声質を制御する声質制御手段、６０は電気音響変換器である。前記の音声合成部３０ｍは言語処理から入力された読み情報および韻律情報に従い音声合成部を制御する音声合成制御手段７０ｍ、音声を母音／子音／母音の連鎖などの所望の合成単位で記憶しておく素片データベース８０、合成単位をつなぎ合わせて合成音声を生成する素片接続手段９０ｍを有する。 However, when information terminals are downsized and become portable, large-sized displays and printers cannot be used, so converting to voice is most effective. FIG. 55 is a configuration block diagram of a typical device of a conventional speech synthesizer. In FIG. 55, 10 is a text input unit for inputting a target text, 20 is a language processing unit for parsing the text, 30 m is a voice synthesizing unit for synthesizing voice, 40 m is an operating unit for operating voice quality of the synthesized voice, 50 m is voice quality control means for controlling voice quality according to the input of the operation means, and 60 is an electroacoustic transducer. The speech synthesizer 30m controls the speech synthesizer according to the reading information and the prosody information input from the language processing, and stores the speech in a desired synthesis unit such as a vowel / consonant / vowel chain. It has a unit segment database 80 and unit connection means 90m for generating synthesized speech by linking synthesis units.

　以上のように構成された従来の音声合成装置において、以下その動作を説明する。 The operation of the conventional speech synthesizer configured as described above will be described below.

　まずテキスト入力手段１０は言語処理手段２０に目的のテキストを入力する。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストを構文解析を行い、読み情報、韻律情報を生成し音声合成制御手段７０ｍに出力する。素片データベース８０は音声合成制御手段７０ｍより入力された読み情報に従って素片接続手段９０ｍに合成単位を出力する。素片接続手段９０ｍは音声合成制御手段７０ｍより入力された韻律情報および声質制御手段５０ｍより入力された制御信号に従って素片データベース８０より入力された合成単位を接続し合成音声を生成し、電気音響変換器６０を通して合成音声を出力する。 First, the text input means 10 inputs a target text to the language processing means 20. Next, the language processing means 20 performs a syntax analysis on the text input from the text input means 10, generates reading information and prosody information, and outputs the information to the speech synthesis control means 70m. The unit database 80 outputs a synthesis unit to the unit connection unit 90m according to the reading information input from the speech synthesis control unit 70m. The unit connection unit 90m connects the synthesis units input from the unit database 80 in accordance with the prosody information input from the voice synthesis control unit 70m and the control signal input from the voice quality control unit 50m to generate a synthesized voice, and The synthesized speech is output through the converter 60.

　次に音声素片の作成方法について述べる。音声素片はあらかじめ録音された音声の波形からＣＶ、ＶＣＶ、ＣＶＣなどの単位で切り出して作成される。ここでＣは子音を、Ｖは母音を表す。これらの合成単位を用いる合成方式をそれぞれＣＶ方式、ＶＣＶ方式、ＣＶＣ方式などと呼ぶ。 (4) Next, a method for creating a speech unit will be described. A speech unit is created by cutting out a waveform of a previously recorded speech in units of CV, VCV, CVC, or the like. Here, C represents a consonant, and V represents a vowel. A combining method using these combining units is called a CV method, a VCV method, a CVC method, or the like.

　ＣＶ方式の場合、子音ｋと母音ａの組み合わせである「ｋａ」などを一つの単位とする。ＶＣＶ方式の場合、母音ａと子音ｋと母音ａの組み合わせである「ａｋａ」、ＣＶＣ方式の場合、子音ｋと母音ａと子音ｔの組み合わせである「ｋａｔ」などが合成の単位である。それぞれに素片の種類の数や合成音の品質など一長一短があるが、いずれの方式も音声素片を次々と接続していくことにより合成音を生成する。 In the case of the CV method, a unit such as “ka” which is a combination of a consonant k and a vowel a is used. In the case of the VCV method, “aka” is a combination of a vowel a, a consonant k, and a vowel a, and in the case of the CVC method, “kat” is a combination of a consonant k, a vowel a, and a consonant t. Each of them has advantages and disadvantages such as the number of types of segments and the quality of synthesized speech, but in each system, synthesized speech is generated by connecting speech segments one after another.

　このような合成方式で用いる音声素片を作成する時に、前処理によって合成時に必要となる変形を行いやすい形にしておくと、合成時の計算量が削減できる。例えば、合成時には目的のピッチパターンになるようにピッチ修正を行う必要があるが、事前にピッチ周期単位で窓掛けにより波形を切り出しておく方法が特願平６−３０２４７１に述べられている。その方法を図面を参照しながら説明する。作成 When creating a speech unit used in such a synthesis method, if the pre-processing is performed so as to easily perform the deformation required at the time of synthesis, the amount of calculation at the time of synthesis can be reduced. For example, at the time of synthesis, it is necessary to correct the pitch so that a desired pitch pattern is obtained, but a method in which a waveform is cut out in advance by windowing in units of a pitch cycle is described in Japanese Patent Application No. 6-302471. The method will be described with reference to the drawings.

　図５６は波形の切り出し方法を示している。図５６のように波形のピッチ周期に対応したピーク位置にマークを付与しておき、そのマークを中心にピッチ周期の２倍以下の長さの窓で切り出しを行う。こうして切り出された波形をピッチ波形と呼ぶ。また、ピッチの概念がない無声子音部は連続した波形としてそのまま切り出しておく。これを初期波形と呼ぶ。 FIG. 56 shows a method of cutting out a waveform. As shown in FIG. 56, a mark is provided at the peak position corresponding to the pitch period of the waveform, and a cutout is made around the mark with a window having a length of twice or less the pitch period. The waveform cut out in this way is called a pitch waveform. Unvoiced consonants having no concept of pitch are cut out as a continuous waveform. This is called an initial waveform.

　図５７は合成時の処理を示している。図のように目的のピッチ周期になるように重ね合わせを行う。ピッチを上げるときは互いの間隔を狭めて重ね合わせを行い、ピッチを下げるときは逆に間隔を広げて重ね合わせを行う。 FIG. 57 shows a process at the time of synthesis. As shown in the figure, superimposition is performed so that a desired pitch period is obtained. When increasing the pitch, the overlapping is performed by narrowing the interval between each other, and when decreasing the pitch, the overlapping is performed by increasing the interval.

　このような音声合成装置と音声素片作成方法においては、騒音下での使用や聴覚に障害がある人が使用する際には合成された音声が聞き取りにくいという問題がある。現状の音声合成技術は健聴者が静寂な環境下で使用する場合においても十分な明瞭度を達成することは難しいが、騒音下で使用する場合や聴覚障害者が使用する場合には更に深刻な明瞭度低下がある。これは、合成音は限られた音声素片を用いていることや、合成時の接続処理や変形処理によって、欠落している情報が多く、騒音によるマスキングや聴覚障害の影響を受けやすいためであり、従来技術においては、騒音下や聴覚障害のある場合に音声の認識に必要な情報を伝達することが困難であるという課題を有していた。 (4) In such a speech synthesizer and a speech unit creating method, there is a problem that a synthesized speech is difficult to hear when used under noise or when a person with hearing impairment uses. Although current speech synthesis technology cannot achieve sufficient intelligibility even when used in a quiet environment by hearing people, it is more serious when used in noise or when used by hearing-impaired people. There is a decrease in clarity. This is because synthesized speech uses a limited number of speech units, and because of connection processing and deformation processing during synthesis, there is a lot of missing information, and it is susceptible to noise masking and hearing impairment. In the related art, there is a problem that it is difficult to transmit information necessary for voice recognition under noise or hearing impairment.

　本発明は上記の従来の問題を解決しようとするもので、テキストに従って音声を合成する音声合成部と、その音声合成部で合成された音声に単一あるいは複数の音韻強調処理を行う強調処理部とを備えた音声合成装置である。 SUMMARY OF THE INVENTION The present invention is directed to overcoming the above-mentioned conventional problems. A speech synthesis unit that synthesizes speech in accordance with text, and an emphasis processing unit that performs single or multiple phoneme emphasis processing on the speech synthesized by the speech synthesis unit And a speech synthesizer comprising:

　好ましくは、強調処理は音韻情報に基づき子音あるいは子音とそれに続く母音への渡りの振幅強調処理を行う子音強調処理である。 Preferably, the emphasis processing is a consonant emphasis processing for performing amplitude emphasis processing of a consonant or a consonant and a subsequent vowel based on phoneme information.

　好ましくは、強調処理は音韻情報に基づき子音の周波数帯域の強調処理を行う帯域強調処理である。 Preferably, the emphasis process is a band emphasis process for emphasizing a consonant frequency band based on phoneme information.

　以上説明したように、本発明によれば、聴覚障害のある使用者や、騒音下での使用でも情報を確実に伝達することができ、その実用的効果は大きい。 As described above, according to the present invention, information can be reliably transmitted even to a hearing impaired user or use under noisy conditions, and its practical effect is large.

　本発明の一実施形態では、使用者の聴覚特性に合わせて合成した音声に強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施す、あるいは使用場面の騒音環境に合わせて合成した音声に強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施す。また、使用者の聴覚特性に合わせてデータベースに記憶された合成単位に強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施した後に音声を合成する、あるいは使用場面の騒音環境に合わせてデータベースに記憶された合成単位に強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施した後に音声を合成する。また、あらかじめ強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施した合成単位を用いて音声を合成する。また、音声の合成を中断した際に言語処理結果に基づき停止位置以前で内容理解のしやすいテキスト上の点まで戻って音声の合成を再開する。また、言語処理に基づき強調処理を行う部分を設定することにより、聴覚障害のある使用者や、騒音下での使用でも情報を確実に伝達することができる。 In one embodiment of the present invention, the voice synthesized according to the user's auditory characteristics is subjected to an emphasis process or a process of compressing the dynamic range of the amplitude, or the voice synthesized according to the noise environment of the use scene is subjected to the emphasis process or A process for compressing the dynamic range of the amplitude is performed. In addition, speech is synthesized after applying emphasis processing or processing for compressing the dynamic range of amplitude to the synthesis unit stored in the database according to the hearing characteristics of the user, or stored in the database according to the noise environment of the use scene. After performing the emphasis processing or the processing of compressing the dynamic range of the amplitude on the synthesized unit, the voice is synthesized. Also, speech is synthesized using a synthesis unit that has been subjected to an emphasis process or a process of compressing the dynamic range of the amplitude in advance. When the speech synthesis is interrupted, the speech synthesis is resumed by returning to a point on the text where the content is easy to understand before the stop position based on the result of the language processing. In addition, by setting a portion for performing the emphasis processing based on the language processing, information can be reliably transmitted even to a user with a hearing impairment or use under noise.

　そして本発明の第一の実施形態は、テキストに従って音声を合成する音声合成部と、その音声合成部で合成された音声に単一あるいは複数の音韻強調処理を行う強調処理部とを備えた音声合成装置である。 According to a first embodiment of the present invention, there is provided a speech synthesizer for synthesizing speech in accordance with a text, and an emphasis processing unit for performing single or plural phoneme emphasis processes on the speech synthesized by the speech synthesizer. It is a synthesis device.

　好ましくは、強調処理はフォルマント強調処理である。 Preferably, the enhancement process is a formant enhancement process.

　好ましくは、フォルマント強調処理は音声のスペクトルのピークの強調処理である。 Preferably, the formant enhancement process is an enhancement process of a peak of a voice spectrum.

　好ましくは、フォルマント強調処理は音声合成部より強調処理部に入力された音韻情報に基づく音韻ごとにあらかじめ定められたフォルマント周波数を含む帯域の強調処理である。 Preferably, the formant emphasis process is a process of emphasizing a band including a predetermined formant frequency for each phoneme based on the phoneme information input from the speech synthesis unit to the emphasis processing unit.

　好ましくは、フォルマント強調処理は音声合成部より強調処理部に入力されたフォルマント情報に基づくフォルマント周波数を含む帯域の強調処理である請求項２記載の音声合成装置。音声 Preferably, the formant enhancement process is an enhancement process of a band including a formant frequency based on the formant information input from the speech synthesis unit to the enhancement processing unit.

　好ましくは、マイクロフォンと、そのマイクロフォンより入力された環境音を分析しその環境音の物理特性に基づいて強調処理部を制御する制御部とを備える。更に、制御部はマイクロフォンより入力された環境音を分析しその環境音の物理特性に基づいて強調処理部で用いる強調処理方法を選択する。 Preferably, a microphone and a control unit for analyzing the environmental sound input from the microphone and controlling the emphasis processing unit based on the physical characteristics of the environmental sound are provided. Further, the control unit analyzes the environmental sound input from the microphone and selects an enhancement processing method used in the enhancement processing unit based on the physical characteristics of the environmental sound.

　好ましくは、使用者が強調の処理方法および程度を調節するための操作手段と、その操作手段より入力された信号に基づいて強調処理部を制御する制御部とを備える。 Preferably, the control device includes an operation unit for a user to adjust a processing method and a degree of emphasis, and a control unit for controlling the emphasis processing unit based on a signal input from the operation unit.

　好ましくは、使用者の聴覚特性や好みを測定する測定部と、前記使用者の聴覚特性や好みに基づいて強調処理部を制御する制御部とを備える。更に制御部は測定部より入力された使用者の聴覚特性や好みに基づき強調処理部で用いる強調処理方法を選択する。 Preferably, there is provided a measuring unit for measuring the auditory characteristics and preferences of the user, and a control unit for controlling the emphasis processing unit based on the auditory characteristics and preferences of the user. Further, the control section selects an emphasis processing method used in the emphasis processing section based on the user's auditory characteristics and preferences input from the measurement section.

　好ましくは、使用者の聴覚特性や好みを記憶する記憶手段と、前記使用者の聴覚特性や好みに基づいて強調処理部を制御する制御部とを備える。更に好ましくは、制御部は記憶手段に記憶された使用者の聴覚特性や好みに基づき強調処理部で用いる強調処理方法を選択する。 Preferably, there are provided storage means for storing the hearing characteristics and preferences of the user, and a control unit for controlling the emphasis processing unit based on the hearing characteristics and preferences of the user. More preferably, the control unit selects an emphasis processing method used in the emphasis processing unit based on the user's auditory characteristics and preferences stored in the storage unit.

　好ましくは、聴覚特性読み取り手段と、制御部とを備え、前記聴覚特性読み取り手段によって記録媒体に格納された使用者の聴覚特性や好みを参照して前記制御部で強調処理部を制御する。更に制御部は聴覚特性読み取り手段によって読み出された使用者の聴覚特性や好みに基づき強調処理部で用いる強調処理方法を選択する。 Preferably, the apparatus further includes a hearing characteristic reading unit and a control unit, and the control unit controls the emphasis processing unit with reference to a user's hearing characteristics and preferences stored in a recording medium by the hearing characteristic reading unit. Further, the control unit selects an emphasis processing method to be used by the emphasis processing unit based on the user's auditory characteristics and preferences read by the auditory characteristics reader.

　本発明の第ニの実施形態は、音声を母音／子音／母音の連鎖などの所望の合成単位で記憶しておく音声素片データベースと、前記合成単位に強調処理を施す素片変形部と、その素片変形部により強調処理を施された合成単位を目的のテキストによって接続して音声を合成する音声合成部とを備えた音声合成装置である。 A second embodiment of the present invention provides a speech unit database that stores speech in a desired synthesis unit such as a vowel / consonant / vowel chain, a unit transformation unit that emphasizes the synthesis unit, The speech synthesis device includes a speech synthesis unit that synthesizes speech by connecting the synthesis units that have been subjected to the emphasis processing by the unit deformation unit using a target text.

　好ましくは、フォルマント強調処理は音韻情報に基づく音韻ごとにあらかじめ定められたフォルマント周波数を含む帯域の強調処理である。 Preferably, the formant enhancement process is an enhancement process of a band including a formant frequency predetermined for each phoneme based on the phoneme information.

　好ましくは、フォルマント強調処理はフォルマント情報に基づくフォルマント周波数を含む帯域の強調処理である。 Preferably, the formant enhancement process is an enhancement process of a band including a formant frequency based on the formant information.

　好ましくは、強調処理は言語情報に基づき子音のクロージャーを延長するクロージャー強調処理である。 Preferably, the emphasis process is a closure emphasis process for extending a consonant closure based on linguistic information.

　好ましくは、強調処理は言語情報に基づき音韻長を延長する延長処理である。 Preferably, the emphasis processing is an extension processing for extending a phoneme length based on linguistic information.

　好ましくは、マイクロフォンと、そのマイクロフォンより入力された環境音を分析しその環境音の物理特性に基づいて素片変形部を制御する制御部とを備える。更に制御部はマイクロフォンより入力された環境音を分析しその環境音の物理特性に基づいて素片変形部で用いる強調処理方法を選択する。 Preferably, a microphone and a control unit for analyzing the environmental sound input from the microphone and controlling the unit deformation unit based on the physical characteristics of the environmental sound are provided. Further, the control unit analyzes the environmental sound input from the microphone, and selects an emphasis processing method used in the unit deformation unit based on the physical characteristics of the environmental sound.

　好ましくは、使用者が強調の処理方法および程度を調節するための操作手段と、その操作手段より入力された信号に基づいて素片変形部を制御する制御部とを備える。 Preferably, the apparatus includes operating means for allowing a user to adjust a processing method and a degree of emphasis, and a control section for controlling a segment deforming section based on a signal input from the operating means.

　好ましくは、使用者の聴覚特性や好みを測定する測定部と、前記使用者の聴覚特性や好みに基づいて素片変形部を制御する制御部とを備える。更に制御部は測定部より入力された使用者の聴覚特性や好みに基づき素片変形部で用いる強調処理方法を選択する。 Preferably, the apparatus further includes a measuring unit for measuring the hearing characteristics and preferences of the user, and a control unit for controlling the segment deformation unit based on the hearing characteristics and preferences of the user. Further, the control unit selects an emphasis processing method to be used in the unit deformation unit based on the user's auditory characteristics and preferences input from the measurement unit.

　好ましくは、使用者の聴覚特性や好みを記憶する記憶手段と、前記使用者の聴覚特性や好みに基づいて素片変形部を制御する制御部とを備える。更に制御部は記憶手段に記憶された使用者の聴覚特性や好みに基づき素片変形部で用いる強調処理方法を選択する。 Preferably, there are provided storage means for storing the hearing characteristics and preferences of the user, and a control unit for controlling the segment deformation unit based on the hearing characteristics and preferences of the user. Further, the control unit selects an emphasis processing method to be used in the unit deformation unit based on the user's auditory characteristics and preferences stored in the storage unit.

　好ましくは、聴覚特性読み取り手段と、制御部とを備え、前記聴覚特性読み取り手段によって記録媒体に格納された使用者の聴覚特性や好みを参照して前記制御部で素片変形部を制御する。更に制御部は聴覚特性読み取り手段によって読み出された使用者の聴覚特性や好みに基づき素片変形部で用いる強調処理方法を選択する。 Preferably, there is provided a hearing characteristic reading unit and a control unit, and the control unit controls the segment deforming unit with reference to the user's hearing characteristics and preferences stored in a recording medium by the hearing characteristic reading unit. Further, the control unit selects an emphasis processing method to be used in the unit deformation unit based on the user's auditory characteristics and preferences read by the auditory characteristic reading unit.

　本発明の第三の実施形態は、あらかじめ音韻強調処理を施した音声を母音／子音／母音の連鎖などの所望の合成単位で記憶しておく音声素片データベースと、前記合成単位を目的のテキストによって接続して音声を合成する音声合成部とを備えた音声合成装置である。 In a third embodiment of the present invention, a speech unit database for storing speech subjected to phonological enhancement processing in advance in a desired synthesis unit such as a vowel / consonant / vowel chain, and And a voice synthesizing unit for synthesizing voice by connecting the voice synthesizers.

　好ましくは、強調の方法および程度の異なる複数の音声素片データベースと、マイクロフォンと、そのマイクロフォンより入力された環境音を分析しその環境音の物理特性に基づいて音声合成部が音声合成に使用する前記音声素片データベースを選択する制御部とを備える。 Preferably, a plurality of speech unit databases having different emphasis methods and degrees, a microphone, and an environmental sound input from the microphone are analyzed, and the voice synthesizer uses the voice for voice synthesis based on physical characteristics of the environmental sound. A control unit for selecting the speech unit database.

　好ましくは、強調の方法および程度の異なる複数の音声素片データベースと、使用者が強調の状態を調節するための操作手段と、その操作手段より入力された信号に基づいて音声合成部が音声合成に使用する前記音声素片データベースを選択する制御部とを備える。 Preferably, a plurality of speech unit databases having different emphasis methods and degrees, operation means for a user to adjust the state of emphasis, and a speech synthesis unit based on a signal input from the operation means, And a control unit for selecting the speech unit database to be used for the communication.

　好ましくは、強調の方法および程度の異なる複数の音声素片データベースと、使用者の聴覚特性や好みを測定する測定部と、前記使用者の聴覚特性や好みに基づいて音声合成部が音声合成に使用する前記音声素片データベースを選択する制御部とを備える。 Preferably, a plurality of speech unit databases having different emphasis methods and degrees, a measuring unit for measuring the auditory characteristics and preferences of the user, and a speech synthesizer for speech synthesis based on the auditory characteristics and preferences of the user. A control unit for selecting the speech unit database to be used.

　好ましくは、強調の方法および程度の異なる複数の音声素片データベースと、使用者の聴覚特性や好みを記憶する記憶手段と、前記使用者の聴覚特性や好みに基づいて音声合成部が音声合成に使用する前記音声素片データベースを選択する制御部とを備える。 Preferably, a plurality of speech unit databases having different emphasis methods and degrees, storage means for storing user's auditory characteristics and preferences, and a speech synthesizer for speech synthesis based on the user's auditory characteristics and preferences. A control unit for selecting the speech unit database to be used.

　好ましくは、強調の方法および程度の異なる複数の音声素片データベースを格納した記憶媒体と、音声素片データベース読み取り手段とを備える。 Preferably, there is provided a storage medium storing a plurality of speech unit databases having different emphasis methods and degrees, and speech unit database reading means.

　本発明の第四の実施形態は、テキストに従って音声を合成する音声合成部で合成された音声に単一あるいは複数の音韻強調処理を行う音声合成方法である。 The fourth embodiment of the present invention is a speech synthesizing method in which a speech synthesized by a speech synthesizing unit for synthesizing speech in accordance with a text is subjected to one or more phoneme emphasis processes.

　本発明の第五の実施形態は、音声を母音／子音／母音の連鎖などの所望の合成単位で記憶しておく音声素片データベースから出力された前記音声の合成単位に強調処理を施し、前記強調処理を施された合成単位を目的のテキストによって接続して音声を合成する音声合成方法である。 According to a fifth embodiment of the present invention, the speech synthesis unit output from a speech unit database storing speech in a desired synthesis unit such as a vowel / consonant / vowel combination is subjected to emphasis processing. This is a speech synthesizing method for synthesizing speech by connecting the synthesized units subjected to the emphasis processing by a target text.

　本発明の第六の実施形態は、あらかじめ音韻強調処理を施した音声を母音／子音／母音の連鎖などの所望の合成単位で記憶しておく音声素片データベースから出力された前記音声の合成単位を目的のテキストによって接続して音声を合成する音声合成方法である。 According to a sixth embodiment of the present invention, a speech synthesis unit output from a speech unit database storing speech subjected to phoneme emphasis processing in advance in a desired synthesis unit such as a chain of vowels / consonants / vowels. Is a speech synthesis method in which the speech is synthesized by connecting to the target text.

　（実施例１）
　以下本発明の第１の実施例について、図面を参照しながら説明する。 (Example 1)
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings.

　図１は本発明の音声合成装置の第１の実施例を示す構成ブロック図である。図２に第１の実施例の動作を説明するための流れ図を、図３、図４、図５、図６に動作を説明するための流れ図の一部を示す。図７、図８に第１の実施例の強調処理の模式図をしめす。図１において図５５と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図５５の音声合成部３０ｍが音声合成部３０ａに置き換わり、声質制御手段５０ｍが声質制御手段５０ａに置き換わり、操作手段４０ｍがマイクロフォン１１０に置き換わり、聴覚特性測定手段１２０が付け加わった以外は図５５と同一な構成である。前記の音声合成部３０ａは、言語処理手段２０より入力された読み情報、韻律情報、強調部情報に基づき音声合成部３０ａを制御する音声合成制御手段７０ａ、音声を母音／子音／母音の連鎖などの所望の合成単位で記憶しておく素片データベース８０、素片データベース８０に記憶された合成単位に強調処理を施す音韻強調処理手段１３０ａ、音韻強調処理手段１３０ａで処理された合成単位をつなげて合成音声を生成する素片接続手段９０ａおよび素片接続手段９０ａで生成された合成音声に振幅のダイナミックレンジを圧縮する圧縮処理を施す圧縮処理手段１４０ａを有する。 FIG. 1 is a block diagram showing the configuration of a first embodiment of the speech synthesizer of the present invention. FIG. 2 shows a flowchart for explaining the operation of the first embodiment, and FIGS. 3, 4, 5, and 6 show a part of the flowchart for explaining the operation. FIGS. 7 and 8 are schematic diagrams of the emphasizing process of the first embodiment. In FIG. 1, the same components or portions as those in FIG. 55 are denoted by the same reference numerals, and thus description thereof will be omitted, and only different portions will be described. 55 is the same as FIG. 55 except that the voice synthesis unit 30m is replaced by the voice synthesis unit 30a, the voice quality control unit 50m is replaced by the voice quality control unit 50a, the operation unit 40m is replaced by the microphone 110, and the auditory characteristic measurement unit 120 is added. Configuration. The speech synthesis unit 30a controls the speech synthesis unit 30a based on the reading information, prosody information, and emphasis unit information input from the language processing unit 20, and converts the voice into a vowel / consonant / vowel chain. The unit database 80 stored in the desired synthesis unit, the phoneme emphasis processing unit 130a that applies the emphasis processing to the synthesis unit stored in the unit database 80, and the synthesis unit processed by the phoneme emphasis unit 130a are connected. It has a unit connecting means 90a for generating a synthesized voice and a compression processing means 140a for performing a compression process for compressing the dynamic range of the amplitude to the synthesized voice generated by the unit connecting means 90a.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図１、図２、図３、図４、図５、図６に従って説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 1, 2, 3, 4, 5, and 6.

　まず聴覚特性測定手段１２０で使用者の聴覚特性を測定し、測定結果を声質制御手段５０ａに出力する。（ステップ１０００）。測定方法は例えば１９９２年、Audiology　Japan巻３５、４０１頁から４０２頁や平成５年,音響学会講演論文集春季、３２９頁〜３３０頁に示された測定方法のようにするものとする。声質制御手段５０ａは聴覚特性測定手段１２０より入力された測定結果に基づき強調処理の設定を決定する（ステップ１１００）。まず使用者の周波数分解能を示すｐ値を１５と比較する（ステップ１１１０）。ステップ１１１０においてｐ値が１５未満の場合はフォルマント強調情報を真とする（ステップ１１２０）。もしステップ１１１０においてｐ値が１５以上の場合はフォルマント強調情報を偽とする（ステップ１１２５）。次に使用者の時間分解能を示すギャップ検出閾値と１０msを比較する（ステップ１１３０）。ステップ１１３０においてギャップの検出閾値が１０ms以上である場合子音強調情報を真とする（ステップ１１４０）。もしステップ１１３０でギャップの検出閾値が１０ms未満の場合は子音強調情報を偽とする（ステップ１１５０）。次に使用者の２ｋＨｚ未満の平均聴力レベルと２ｋＨｚ以上の平均聴力レベルを比較する（ステップ１１６０）。ステップ１１６０において２ｋＨｚ以上の平均聴力レベルから２ｋＨｚ未満の平均聴力レベルを減じた値が３０ｄＢ以上の場合は帯域強調情報を真とする（ステップ１１７０）。もしステップ１１７０において２ｋＨｚ以上の平均聴力レベルから２ｋＨｚ未満の平均聴力レベルを減じた値が３０ｄＢ未満の場合は帯域強調情報を偽とする（ステップ１１８０）。テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストの構文解析を行い、読み情報、韻律情報および強調部情報を生成し音声合成制御手段７０ａに出力する（ステップ１３００）。素片データベース８０は音声合成制御手段７０ａより入力された読み情報に従って音韻強調処理手段１３０ａに合成単位を出力する（ステップ１４００）。音韻強調処理手段１３０ａは音声合成制御手段７０ａより入力された強調部情報と声質制御手段５０ａより入力された制御信号に従って合成単位に強調処理を施す（ステップ１５００）。音韻強調処理手段１３０ａは音声合成制御手段７０ａより入力された強調部情報が真か偽かを判定する（ステップ１５１０）。ステップ１５１０において強調部情報が真である場合、合成単位中の母音定常部の時間長を２０％延長し（ステップ１５２０）。声質制御手段５０ａより入力されたフォルマント強調情報が真か偽かを判定する（ステップ１５３０）。もしステップ１５１０において強調部情報が偽である場合、声質制御手段５０ａより入力されたフォルマント強調情報が真か偽かを判定する（ステップ１５３０）。ステップ１５３０においてフォルマント強調情報が真である場合、図７に示すように音韻強調処理手段１３０ａは素片データベース８０に記憶された合成単位に対応するフォルマント情報に従って、図７ｂ）に示すようにフォルマントを含む帯域を選択的に通過させるようフィルタバンクの各フィルタの中心周波数および帯域幅を設定し、図７ｃ）に示すようにフォルマントを含む帯域とフォルマントを含まない帯域とのコントラストを強調する（ステップ１５４０）。次に声質制御手段５０ａより入力された子音強調情報が真か偽かを判定する（ステップ１５５０）。もしステップ１５３０においてフォルマント強調情報が偽である場合、声質制御手段５０ａより入力された子音強調情報が真か偽かを判定する（ステップ１５５０）。ステップ１５５０において子音強調情報が真である場合、音韻強調処理手段１３０ａは図８に示すような素片データベース８０に記憶された合成単位に対応するラベル情報に従って、合成単位中の子音および子音から母音への渡りの振幅を図８に示すように増幅する（ステップ１５６０）。次に声質制御手段５０ａより入力された帯域強調情報が真か偽かを判定する（ステップ１５７０）。もしステップ１５６０において子音強調情報が偽である場合、声質制御手段５０ａより入力された帯域強調情報が真か偽かを判定する（ステップ１５７０）。ステップ１５７０において帯域強調情報が真である場合、音韻強調処理手段１３０ａは合成単位中の子音に２ｋＨｚ以上の帯域を強調する高帯域強調処理を行い（ステップ１５８０）、合成単位を素片接続手段９０ａに出力する（ステップ１５９０）。もしステップ１５７０において帯域強調情報が偽である場合、音韻強調処理手段１３０ａは合成単位を素片接続手段９０に出力する（ステップ１５９０）。素片接続手段９０ａは音声合成制御手段７０ａより入力された韻律情報および強調部情報に従って音韻強調処理手段１３０ａより入力された合成単位を合成し合成音声を生成する（ステップ１６００）。まず素片接続手段９０ａは音声合成制御手段７０ａより入力された強調部情報が真か偽かを判定する（ステップ１６１０）。ステップ１６１０において強調部情報が真の場合、素片接続手段９０ａは合成単位に対応するクロージャーの値を２０％延長し（ステップ１６２０）、音声合成制御手段７０ａより入力された韻律情報に従って合成音声を生成し（ステップ１６３０）、圧縮処理手段１４０ａに出力する（ステップ１６４０）。もしステップ１６１０において強調処理情報が偽の場合、素片接続手段９０ａは音声合成制御手段７０ａより入力された韻律情報に従って合成音声を生成し（ステップ１６３０）圧縮処理手段１４０ａに出力する（ステップ１６４０）。圧縮処理手段１４０ａは声質制御手段５０ａの制御信号に従って素片接続手段９０ａで生成された合成音声の振幅のダイナミックレンジを圧縮する（ステップ１７００）。まず声質制御手段５０ａはマイクロフォン１１０より入力された環境音を１ｋＨｚ以下、１ｋＨｚ〜２ｋＨｚ、２ｋＨｚ〜４ｋＨｚ、４ｋＨｚ以上の帯域に分割し、帯域ごとに１００ｍｓの平均レベルを求める（ステップ１７１０）。１ｋＨｚ以下の環境音の平均レベルと２０ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７３０）。ステップ１７３０において１ｋＨｚ以下の環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上である場合、声質制御手段５０ａは合成音声の１ｋＨｚ以下の成分のレベルのダイナミックレンジが１ｋＨｚ以下の環境音の平均レベルの値〜９０ｄＢＳＰＬとなるように圧縮処理のパラメータを設定し（ステップ１７４０）、１ｋＨｚ〜２ｋＨｚの環境音の平均レベルと２０ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７５０）。もしステップ１７３０において１ｋＨｚ以下の環境音が２０ｄＢＳＰＬ／Ｈｚ未満である場合、１ｋＨｚ〜２ｋＨｚの環境音の平均レベルと２０ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７５０）。ステップ１７５０において１ｋＨｚ〜２ｋＨｚの環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上である場合、声質制御手段５０ａは合成音声の１ｋＨｚ〜２ｋＨｚの成分のレベルのダイナミックレンジが１ｋＨｚ〜２ｋＨｚの環境音の平均レベルの値〜９０ｄＢＳＰＬとなるように圧縮処理のパラメータを設定し（ステップ１７６０）、２ｋＨｚ〜４ｋＨｚの環境音の平均レベルと１５ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７７０）。もしステップ１７５０において１ｋＨｚ〜２ｋＨｚの環境音が２０ｄＢＳＰＬ／Ｈｚ未満である場合、２ｋＨｚ〜４ｋＨｚの環境音の平均レベルと１５ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７７０）。ステップ１７７０において２ｋＨｚ〜４ｋＨｚの環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ以上である場合、声質制御手段５０ａは合成音声の２ｋＨｚ〜４ｋＨｚの成分のレベルのダイナミックレンジが２ｋＨｚ〜４ｋＨｚの環境音の平均レベルの値〜８０ｄＢＳＰＬとなるように圧縮処理のパラメータを設定し（ステップ１７８０）、４ｋＨｚ以上の環境音の平均レベルと１０ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７９０）。もしステップ１７７０において２ｋＨｚ〜４ｋＨｚの環境音が１５ｄＢＳＰＬ／Ｈｚ未満である場合、４ｋＨｚ以上の環境音の平均レベルと１０ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７９０）。ステップ１７９０において４ｋＨｚ以上の環境音の平均レベルが１０ｄＢＳＰＬ／Ｈｚ以上である場合、声質制御手段５０ａは合成音声の４ｋＨｚ以上の成分のレベルのダイナミックレンジが４ｋＨｚ以上の環境音の平均レベルの値〜６０ｄＢＳＰＬとなるように圧縮処理のパラメータを設定し（ステップ１８００）、圧縮処理手段１４０ａに制御信号を出力する（ステップ１８１０）。もしステップ１７９０において４ｋＨｚ以上の環境音の平均レベルが１０ｄＢＳＰＬ／Ｈｚ未満である場合、圧縮処理手段１４０ａに制御信号を出力する（ステップ１８１０）。圧縮処理手段１４０ａは声質制御手段５０ａより入力された制御信号に基づき素片接続手段９０ａより入力された合成音声に圧縮処理を行う（ステップ１８２０）。圧縮処理の方法は例えば１９９１年音響学会誌、巻４７、３７３頁から３７９頁に示された処理のようにするものとする。圧縮処理手段１４０ａは電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 {Circle around (1)} First, the hearing characteristics of the user are measured by the hearing characteristics measuring means 120 and the measurement result is output to the voice quality control means 50a. (Step 1000). The measuring method is, for example, the measuring method described in Audiology @ Japan Vol. 35, pp. 401-402 in 1992, and 1993, Spring Meeting of the Acoustical Society of Japan, pp. 329-330. The voice quality control unit 50a determines the setting of the emphasis processing based on the measurement result input from the auditory characteristic measurement unit 120 (step 1100). First, the p value indicating the frequency resolution of the user is compared with 15 (step 1110). If the p value is less than 15 in step 1110, the formant emphasis information is set to true (step 1120). If the p value is 15 or more in step 1110, the formant emphasis information is set to false (step 1125). Next, a gap detection threshold value indicating the time resolution of the user is compared with 10 ms (step 1130). If the gap detection threshold is 10 ms or more in step 1130, the consonant emphasis information is set to true (step 1140). If the gap detection threshold is less than 10 ms in step 1130, the consonant emphasis information is set to false (step 1150). Next, the average hearing level of the user less than 2 kHz is compared with the average hearing level of 2 kHz or more (step 1160). When the value obtained by subtracting the average hearing level below 2 kHz from the average hearing level above 2 kHz in Step 1160 is 30 dB or more, the band emphasis information is set to true (Step 1170). If the value obtained by subtracting the average hearing level below 2 kHz from the average hearing level above 2 kHz in step 1170 is less than 30 dB, the band emphasis information is set to false (step 1180). The text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 performs a syntax analysis of the text input from the text input unit 10, generates reading information, prosody information, and emphasis unit information, and outputs the information to the speech synthesis control unit 70a (step 1300). The segment database 80 outputs a synthesis unit to the phoneme enhancement processing unit 130a according to the reading information input from the speech synthesis control unit 70a (step 1400). The phonemic emphasis processing means 130a performs emphasis processing on a synthesis unit according to the emphasis unit information input from the speech synthesis control means 70a and the control signal input from the voice quality control means 50a (step 1500). The phoneme emphasis processing unit 130a determines whether the emphasis unit information input from the speech synthesis control unit 70a is true or false (step 1510). If the emphasis part information is true in step 1510, the time length of the vowel stationary part in the synthesis unit is extended by 20% (step 1520). It is determined whether the formant emphasis information input from the voice quality control means 50a is true or false (step 1530). If the emphasis unit information is false in step 1510, it is determined whether the formant emphasis information input from the voice quality control unit 50a is true or false (step 1530). If the formant emphasis information is true in step 1530, the phoneme emphasis processing means 130a converts the formant as shown in FIG. 7B) according to the formant information corresponding to the synthesis unit stored in the segment database 80 as shown in FIG. The center frequency and bandwidth of each filter in the filter bank are set to selectively pass the included band, and the contrast between the band containing the formant and the band not containing the formant is enhanced as shown in FIG. 7c) (step 1540). ). Next, it is determined whether the consonant emphasis information input from the voice quality control means 50a is true or false (step 1550). If the formant enhancement information is false in step 1530, it is determined whether the consonant enhancement information input from the voice quality control unit 50a is true or false (step 1550). If the consonant emphasis information is true in step 1550, the phoneme emphasis processing means 130a performs processing from the consonants and consonants in the synthesis unit according to the label information corresponding to the synthesis unit stored in the unit database 80 as shown in FIG. The amplitude of the transition is amplified as shown in FIG. 8 (step 1560). Next, it is determined whether the band emphasis information input from the voice quality control means 50a is true or false (step 1570). If the consonant emphasis information is false in step 1560, it is determined whether the band emphasis information input from voice quality control means 50a is true or false (step 1570). If the band emphasis information is true in step 1570, the phoneme emphasis processing means 130a performs high band emphasis processing for emphasizing the band of 2 kHz or more on the consonants in the synthesis unit (step 1580), and connects the synthesis unit to the segment connection means 90a. (Step 1590). If the band emphasis information is false in step 1570, the phoneme emphasis processing means 130a outputs the synthesis unit to the unit connection means 90 (step 1590). The unit connection means 90a synthesizes the synthesis unit input from the phoneme emphasis processing means 130a according to the prosody information and emphasis section information input from the voice synthesis control means 70a to generate synthesized speech (step 1600). First, the segment connection means 90a determines whether the emphasis section information input from the speech synthesis control means 70a is true or false (step 1610). If the emphasis section information is true in step 1610, the segment connection means 90a extends the value of the closure corresponding to the synthesis unit by 20% (step 1620), and synthesizes the synthesized speech in accordance with the prosody information input from the speech synthesis control means 70a. It is generated (step 1630) and output to the compression processing means 140a (step 1640). If the emphasis processing information is false in step 1610, the unit connecting means 90a generates a synthesized voice according to the prosody information input from the voice synthesis control means 70a (step 1630) and outputs it to the compression processing means 140a (step 1640). . The compression processing unit 140a compresses the dynamic range of the amplitude of the synthesized speech generated by the unit connection unit 90a according to the control signal of the voice quality control unit 50a (step 1700). First, the voice quality control unit 50a divides the environmental sound input from the microphone 110 into bands of 1 kHz or less, 1 kHz to 2 kHz, 2 kHz to 4 kHz, and 4 kHz or more, and obtains an average level of 100 ms for each band (step 1710). The average level of the environmental sound of 1 kHz or less is compared with 20 dBSPL / Hz (step 1730). If the average level of the environmental sound of 1 kHz or less is equal to or more than 20 dBSPL / Hz in step 1730, the voice quality control unit 50a determines that the dynamic range of the level of the component of the synthetic voice of 1 kHz or less is the average level value of the environmental sound of 1 kHz or less to 90 dBSPL. The compression processing parameters are set so as to satisfy (step 1740), and the average level of the environmental sound of 1 kHz to 2 kHz is compared with 20 dBSPL / Hz (step 1750). If the environmental sound of 1 kHz or less is less than 20 dBSPL / Hz in step 1730, the average level of the environmental sound of 1 kHz to 2 kHz is compared with 20 dBSPL / Hz (step 1750). If the average level of the environmental sound of 1 kHz to 2 kHz is equal to or higher than 20 dBSPL / Hz in step 1750, the voice quality control means 50 a determines that the dynamic range of the level of the component sound of 1 kHz to 2 kHz of the synthesized voice is the average level of the environmental sound of 1 kHz to 2 kHz. The compression processing parameters are set so as to have a value of up to 90 dBSPL (step 1760), and the average level of the environmental sound at 2 kHz to 4 kHz is compared with 15 dBSPL / Hz (step 1770). If the environmental sound of 1 kHz to 2 kHz is less than 20 dBSPL / Hz in step 1750, the average level of the environmental sound of 2 kHz to 4 kHz is compared with 15 dBSPL / Hz (step 1770). When the average level of the environmental sound of 2 kHz to 4 kHz is equal to or higher than 15 dBSPL / Hz in step 1770, the voice quality control unit 50a determines that the dynamic range of the level of the component of the synthetic voice of 2 kHz to 4 kHz is 2 kHz to 4 kHz. The compression processing parameters are set so as to have a value of up to 80 dBSPL (step 1780), and the average level of the environmental sound of 4 kHz or more is compared with 10 dBSPL / Hz (step 1790). If the environmental sound of 2 kHz to 4 kHz is less than 15 dBSPL / Hz in step 1770, the average level of the environmental sound of 4 kHz or more is compared with 10 dBSPL / Hz (step 1790). If the average level of the environmental sound of 4 kHz or more is 10 dBSPL / Hz or more in step 1790, the voice quality control unit 50a determines that the dynamic range of the level of the component of the synthetic voice of 4 kHz or more is the value of the average level of the environmental sound of 4 kHz or more to 60 dBSPL. The parameters of the compression processing are set so as to satisfy (Step 1800), and a control signal is output to the compression processing means 140a (Step 1810). If the average level of the environmental sound of 4 kHz or more is less than 10 dBSPL / Hz in step 1790, a control signal is output to the compression processing means 140a (step 1810). The compression processing unit 140a performs a compression process on the synthesized speech input from the unit connection unit 90a based on the control signal input from the voice quality control unit 50a (step 1820). The compression processing method is, for example, the processing shown in the Journal of the Acoustical Society of Japan, Vol. 47, pp. 373 to 379. The compression processing means 140a outputs a synthesized voice through the electro-acoustic transducer 60 (step 1900).

　（実施例２）
　以下本発明の第２の実施例について、図面を参照しながら説明する。 (Example 2)
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.

　図９は本発明の音声合成装置の第２の実施例を示す構成ブロック図である。図１０に第２の実施例の動作を説明するための流れ図を、図１１に動作を説明するための流れ図の一部を示す。図９において図１と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図１の音声合成部３０ａが音声合成部３０ｂに置き換わり、声質制御手段５０ａが声質制御手段５０ｂに置き換わり、マイクロフォン１１０が操作手段４０ｂに置き換わり、聴覚特性測定手段１２０が聴覚特性記憶手段２２０に置き換わった以外は図１と同一な構成である。前記の音声合成部３０ｂは、音声合成制御手段７０ｂ、合成単位を記憶しておくデータベース部２００ｂ、合成単位に振幅のダイナミックレンジを圧縮する圧縮処理を施す圧縮処理手段１４０ｂ、圧縮処理手段１４０ｂで処理された合成単位をつなげて合成音声を生成する素片接続手段９０ｂを有する。前記のデータベース部２００ｂは異なる複数の強調処理を施された素片を施された強調処理ごとに記憶する複数の素片データベース２８０ａ〜ｎと、複数の素片データベース２８０ａ〜ｎと圧縮処理手段１４０ｂとの接続を切り替えるスイッチ２１０ｂとを有する。 FIG. 9 is a block diagram showing the configuration of a second embodiment of the speech synthesizer of the present invention. FIG. 10 shows a flowchart for explaining the operation of the second embodiment, and FIG. 11 shows a part of the flowchart for explaining the operation. In FIG. 9, the same components or portions as those in FIG. The voice synthesis unit 30a of FIG. 1 has been replaced by the voice synthesis unit 30b, the voice quality control unit 50a has been replaced by the voice quality control unit 50b, the microphone 110 has been replaced by the operation unit 40b, and the auditory characteristic measurement unit 120 has been replaced by the audio characteristic storage unit 220. Other than that, the configuration is the same as that of FIG. The speech synthesis unit 30b is processed by a speech synthesis control unit 70b, a database unit 200b that stores synthesis units, a compression processing unit 140b that performs compression processing for compressing the dynamic range of the amplitude for each synthesis unit, and a compression processing unit 140b. Unit synthesis means 90b for generating synthesized speech by connecting the synthesized units. The database unit 200b stores a plurality of segment databases 280a to 280n for storing a plurality of different emphasized segments for each emphasized process, a plurality of segment databases 280a to 280n, and a compression processing unit 140b. And a switch 210b for switching the connection with the switch 210b.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図９、図１０、図１１に従って説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 9, 10, and 11.

　図１０、図１１において図２、図４と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。まず聴覚特性記憶手段２２０に記憶されたあらかじめ測定された聴覚特性を声質制御手段５０ｂに出力する。（ステップ２０００）。声質制御手段５０ｂは聴覚特性記憶手段２２０より入力された聴覚特性に基づき圧縮処理のパラメータを設定し圧縮処理手段１４０ｂへ出力する（ステップ２１００）。圧縮処理のパラメータ設定方法は例えば聴覚研究会資料、資料番Ｈ−９５−４、１頁〜８頁に示された設定方法のようにする。テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストの構文解析を行い、読み情報、韻律情報および強調部情報を生成し音声合成制御手段７０ｂに出力する（ステップ１３００）。使用者は操作手段４０ｂに強調の種類および強調の程度を入力し、操作手段４０ｂは入力結果を強調選択情報として声質制御手段５０ｂに出力する（ステップ２４００）。声質制御手段５０ｂは操作手段４０ｂより入力された強調選択情報に最も近い強調が施された素片データベースを素片データベース２８０ａ〜ｎより選択し、スイッチ２１０ｂを切り替えて圧縮処理手段１４０ｂに接続する（ステップ２５００）。ステップ２５００で圧縮処理手段１４０ｂと接続された素片データベース２８０は音声合成制御手段７０ｂより入力された読み情報に従って圧縮処理手段１４０ｂに合成単位を出力する（ステップ２６００）。圧縮処理手段１４０ｂは声質制御手段５０ｂより入力された圧縮処理パラメータに従って素片データベース２８０より入力された合成単位の振幅のダイナミックレンジを圧縮し、素片接続手段９０ｂに出力する（ステップ２７００）。素片接続手段９０ｂは音声合成制御手段７０ｂより入力された韻律情報および強調部情報に従って圧縮処理手段１４０ｂより入力された合成単位を合成し合成音声を生成する（ステップ２８００）。まず素片接続手段９０ｂは音声合成制御手段７０ｂより入力された強調部情報が真か偽かを判定する（ステップ１６１０）。ステップ１６１０において強調部情報が真の場合、素片接続手段９０ｂは合成単位中の母音定常部の時間長を２０％延長し（ステップ２９２０）、さらに合成単位に対応するクロージャーの値を２０％延長し（ステップ１６２０）、音声合成制御手段７０ｂより入力された韻律情報に従って合成音声を生成する（ステップ２９３０）。もしステップ１６１０において強調処理情報が偽の場合、素片接続手段９０ｂは音声合成制御手段７０ｂより入力された韻律情報に従って合成音声を生成する（ステップ２９３０）。素片接続手段９０ｂは電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。、 In FIGS. 10 and 11, the same operations as those in FIGS. 2 and 4 are denoted by the same reference numerals, and the description thereof will not be repeated. Only different parts will be described. First, the pre-measured hearing characteristics stored in the hearing characteristics storage means 220 are output to the voice quality control means 50b. (Step 2000). The voice quality control means 50b sets parameters for compression processing based on the auditory characteristics input from the auditory characteristic storage means 220, and outputs the parameters to the compression processing means 140b (step 2100). The parameter setting method of the compression process is, for example, the setting method described in Auditory Study Group Material, Material No. H-95-4, pp. 1-8. The text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 analyzes the syntax of the text input from the text input unit 10, generates reading information, prosody information, and emphasis unit information and outputs the generated information to the speech synthesis control unit 70b (step 1300). The user inputs the type of emphasis and the degree of emphasis to the operation means 40b, and the operation means 40b outputs the input result to the voice quality control means 50b as emphasis selection information (step 2400). The voice quality control unit 50b selects the segment database to which emphasis is applied closest to the emphasis selection information input from the operation unit 40b from the segment databases 280a to 280n, and switches the switch 210b to connect to the compression processing unit 140b ( Step 2500). The segment database 280 connected to the compression processing means 140b in step 2500 outputs a synthesis unit to the compression processing means 140b according to the read information input from the speech synthesis control means 70b (step 2600). The compression processing unit 140b compresses the dynamic range of the amplitude of the synthesis unit input from the unit database 280 according to the compression processing parameter input from the voice quality control unit 50b, and outputs it to the unit connection unit 90b (step 2700). The unit connection means 90b synthesizes the synthesis unit input from the compression processing means 140b according to the prosody information and the emphasis unit information input from the voice synthesis control means 70b to generate a synthesized voice (step 2800). First, the segment connection means 90b determines whether the emphasis section information input from the speech synthesis control means 70b is true or false (step 1610). If the emphasis section information is true in step 1610, the segment connection means 90b extends the time length of the vowel stationary section in the synthesis unit by 20% (step 2920), and further extends the closure value corresponding to the synthesis unit by 20%. Then, a synthesized voice is generated according to the prosody information input from the voice synthesis control unit 70b (step 2930). If the emphasis processing information is false in step 1610, the unit connection means 90b generates a synthesized voice according to the prosody information input from the voice synthesis control means 70b (step 2930). The unit connection means 90b outputs a synthesized voice through the electroacoustic transducer 60 (step 1900).

　（実施例３）
　以下本発明の第３の実施例について、図面を参照しながら説明する。 (Example 3)
Hereinafter, a third embodiment of the present invention will be described with reference to the drawings.

　図１２は本発明の音声合成装置の第３の実施例を示す構成ブロック図である。図１３に第３の実施例の動作を説明するための流れ図を示す。第３の実施例の構成において図９に示した第２の実施例の構成と同一物または部分については説明を省略し、異なった部分についてのみ説明する。図９の聴覚特性記憶手段２２０が聴覚特性読み取り手段３１０に置き換わり、音声合成部３０ｂが音声合成部３０ｃに置き換わり、声質制御手段５０ｂが声質制御手段５０ｃに置き換わり、素片データベース３８０ａ〜ｎ、聴覚特性３２０ａ〜ｎがつけ加わった以外は図９と同一な構成である。前記の音声合成部３０ｃは図９の音声合成制御手段７０ｂが音声合成制御手段７０ｃに置き換わり、データベース部２００ｂが素片データベース読み取り手段３００に置き換わった以外は図９の音声合成部３０ｂと同一な構成である。素片データベース３８０ａ〜ｎは複数の異なる強調の種類と強調の程度の強調処理を施した合成単位を強調処理ごとに格納した記憶媒体である。素片データベース読み取り手段３００は圧縮処理手段１４０ｂが参照する素片データベース３８０を読みとるものである。聴覚特性３２０ａ〜ｎはあらかじめ測定された複数の使用者の聴覚特性を個人ごとに格納した記憶媒体である。聴覚特性読み取り手段３１０は声質制御手段５０ｃが参照する聴覚特性を読みとるものである。 FIG. 12 is a block diagram showing the configuration of a third embodiment of the speech synthesizer of the present invention. FIG. 13 is a flowchart for explaining the operation of the third embodiment. In the configuration of the third embodiment, the description of the same components or portions as those of the second embodiment shown in FIG. 9 will be omitted, and only different portions will be described. The auditory characteristic storage unit 220 in FIG. 9 is replaced by the auditory characteristic reading unit 310, the voice synthesis unit 30b is replaced by the voice synthesis unit 30c, the voice quality control unit 50b is replaced by the voice quality control unit 50c, and the unit databases 380a to 380n, The configuration is the same as that of FIG. 9 except that 320a to 320n are added. The speech synthesis unit 30c has the same configuration as the speech synthesis unit 30b of FIG. 9 except that the speech synthesis control unit 70b of FIG. 9 is replaced by the speech synthesis control unit 70c, and the database unit 200b is replaced by the unit database reading unit 300. It is. Each of the segment databases 380a to 380n is a storage medium that stores a plurality of synthesis units subjected to different types of emphasis and emphasis processing of the degree of emphasis for each emphasis process. The unit database reading unit 300 reads the unit database 380 referenced by the compression processing unit 140b. The hearing characteristics 320a to 320n are storage media in which the previously measured hearing characteristics of a plurality of users are stored for each individual. The auditory characteristic reading unit 310 reads the auditory characteristic referred to by the voice quality control unit 50c.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図１２、図１３に従って説明する。 The operation of the speech synthesizing apparatus of the present embodiment configured as described above will be described below with reference to FIGS.

　図１３において図１０と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。まず聴覚特性読み取り手段３１０により、あらかじめセットした使用者に対応する聴覚特性３２０を読み出し、声質制御手段５０ｃに出力する。（ステップ３０００）。声質制御手段５０ｃは聴覚特性読み取り手段３１０より入力された聴覚特性に基づき圧縮処理のパラメータを設定し圧縮処理手段１４０ｂへ出力する（ステップ２１００）。テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストを構文解析を行い、読み情報、韻律情報および強調部情報を生成し音声合成制御手段７０ｃに出力する（ステップ１３００）。素片データベース読み取り手段３００は音声合成制御手段７０ｃより入力された読み情報に従って、あらかじめ使用者の好みおよび使用する場面に応じてセットされた素片データベース３８０より合成単位を読み出し圧縮処理手段１４０ｂに出力する（ステップ３６００）。圧縮処理手段１４０ｂは声質制御手段５０ｃより入力された圧縮処理パラメータに従って素片データベース３８０より入力された合成単位の振幅のダイナミックレンジを圧縮し、素片接続手段９０ｂに出力する（ステップ２７００）。素片接続手段９０ｂは音声合成制御手段７０ｃより入力された韻律情報および強調部情報に従って圧縮処理手段１４０ｂより入力された合成単位を合成し合成音声を生成する（ステップ２８００）。素片接続手段９０ｂは電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 In FIG. 13, the same operations as those in FIG. 10 are denoted by the same reference numerals, and the description thereof will not be repeated. Only different parts will be described. First, the auditory characteristic reading means 310 reads out the auditory characteristic 320 corresponding to the preset user and outputs it to the voice quality control means 50c. (Step 3000). The voice quality control unit 50c sets parameters for compression processing based on the auditory characteristics input from the auditory characteristic reading unit 310, and outputs the parameters to the compression processing unit 140b (step 2100). The text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 performs syntax analysis on the text input from the text input unit 10, generates reading information, prosody information, and emphasis unit information, and outputs the generated information to the speech synthesis control unit 70c (step 1300). The unit database reading means 300 reads out the synthesis unit from the unit database 380 set in advance according to the user's preference and the scene to be used according to the reading information input from the speech synthesis control means 70c and outputs it to the compression processing means 140b. (Step 3600). The compression processing unit 140b compresses the dynamic range of the amplitude of the synthesis unit input from the unit database 380 according to the compression processing parameter input from the voice quality control unit 50c, and outputs it to the unit connection unit 90b (step 2700). The unit connection means 90b synthesizes the synthesis unit input from the compression processing means 140b according to the prosody information and the emphasis unit information input from the voice synthesis control means 70c to generate a synthesized voice (step 2800). The unit connection means 90b outputs a synthesized voice through the electroacoustic transducer 60 (step 1900).

　（実施例４）
　以下本発明の第４の実施例について、図面を参照しながら説明する。 (Example 4)
Hereinafter, a fourth embodiment of the present invention will be described with reference to the drawings.

　図１４は本発明の音声合成装置の第４の実施例を示す構成ブロック図である。図１５に第４の実施例の動作を説明するための流れ図を、図１６、図１７に動作を説明するための流れ図の一部を示す。図１４において図１と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図１の音声合成部３０ａが音声合成部３０ｄに置き換わり、声質制御手段５０ａが声質制御手段５０ｄに置き換わり、聴覚特性測定手段１２０が削除された以外は図１と同一な構成である。前記の音声合成部３０ｄは、音声合成制御手段７０ｄ、合成単位を記憶しておく素片データベース８０、素片データベース８０に記憶された合成単位をつなげて合成音声を生成する素片接続手段９０ｄ、および素片接続手段９０ｄで生成された合成音声に強調処理を施す音声音韻強調処理手段１３０ｄを有する。 FIG. 14 is a block diagram showing the configuration of a fourth embodiment of the speech synthesizer of the present invention. FIG. 15 shows a flowchart for explaining the operation of the fourth embodiment, and FIGS. 16 and 17 show a part of the flowchart for explaining the operation. In FIG. 14, the same components or portions as those in FIG. 1 are denoted by the same reference numerals, and the description thereof will be omitted. The configuration is the same as that of FIG. 1 except that the voice synthesis unit 30a in FIG. 1 is replaced with the voice synthesis unit 30d, the voice quality control unit 50a is replaced with the voice quality control unit 50d, and the auditory characteristic measurement unit 120 is deleted. The speech synthesis unit 30d includes a speech synthesis control unit 70d, a segment database 80 that stores synthesis units, a unit connection unit 90d that connects the synthesis units stored in the unit database 80 to generate synthesized speech, And speech phoneme emphasis processing means 130d for emphasizing the synthesized speech generated by the unit connection means 90d.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図１４、図１５、図１６、図１７、図１８に従って説明する。図１５、図１６、図１７、図１８において図２、図４、図５、図６と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 14, 15, 16, 17, and 18. 15, 16, 17, and 18, the same operations as those in FIGS. 2, 4, 5, and 6 are denoted by the same reference numerals, and the description thereof will not be repeated. Only different parts will be described.

　まずテキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストを構文解析を行い、読み情報、韻律情報および強調部情報を生成し音声合成制御手段７０ｄに出力する（ステップ１３００）。素片データベース８０は音声合成制御手段７０ｄより入力された読み情報に従って素片接続手段９０ｄに合成単位を出力する（ステップ４４００）。素片接続手段９０ｄは音声合成制御手段７０ｄより入力された韻律情報および強調部情報に従って素片データベース８０より入力された合成単位を接続して合成音声を生成し、音韻強調処理手段１３０ｄに出力する（ステップ１６００）。声質制御手段５０ｄは強調処理方法の設定を行う（ステップ４７００）。まず声質制御手段５０ｄはマイクロフォン１１０より入力された環境音を１ｋＨｚ以下、１ｋＨｚ〜２ｋＨｚ、２ｋＨｚ〜４ｋＨｚ、４ｋＨｚ以上の帯域に分割し、帯域ごとに１００ｍｓの平均レベルを求める（ステップ１７１０）。１ｋＨｚ以下の環境音の平均レベル、１ｋＨｚ〜２ｋＨｚの環境音の平均レベルと２０ｄＢＳＰＬ／Ｈｚ、他の帯域の環境音の平均レベルと１５ｄＢＳＰＬ／Ｈｚを比較する（ステップ４７２０）。１ｋＨｚ以下の環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上で、かつ１ｋＨｚ〜２ｋＨｚの環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上で、かつ他の帯域の環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ未満の場合、フォルマント強調情報を真とし（ステップ４７３０）、子音強調情報を偽とする（４７８０）。次に全帯域の帯域強調情報を偽とし（ステップ４８００）、制御信号を音韻強調処理手段１３０ｄに出力する（ステップ４８１０）。もしステップ４７２０で１ｋＨｚ以下の環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上で、かつ１ｋＨｚ〜２ｋＨｚの環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上で、かつ他の帯域の環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ未満でない場合は、フォルマント強調情報を偽とし（ステップ４７４０）、１ｋＨｚ〜２ｋＨｚの環境音の平均レベルと２０ｄＢＳＰＬ／Ｈｚ、他の帯域の環境音の平均レベルと１５ｄＢＳＰＬ／Ｈｚを比較する（ステップ４７５０）。ステップ４７５０で１ｋＨｚ〜２ｋＨｚの環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上、かつ２ｋＨｚ〜４ｋＨｚの環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ以上、かつ１ｋＨｚ以下の環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ未満、かつ４ｋＨｚ以上の環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ未満である場合、子音強調情報を真とし（ステップ４７６０）、全帯域の帯域強調情報を偽とし（ステップ４８００）、制御信号を音韻強調処理手段１３０ｄに出力する（ステップ４８１０）。もしステップ４７５０で１ｋＨｚ〜２ｋＨｚの環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上、かつ２ｋＨｚ〜４ｋＨｚの環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ以上、かつ１ｋＨｚ以下の環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ未満、かつ４ｋＨｚ以上の環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ未満でない場合、子音強調情報を偽とし（ステップ４７７０）、各帯域の帯域強調情報を設定する（ステップ４７９０）。１ｋＨｚ以下・BR>フ環境音の平均レベルと２０ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７３０）。ステップ１７３０において１ｋＨｚ以下の環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上である場合、１ｋＨｚ以下の帯域強調情報を真とし（ステップ４７９１）、１ｋＨｚ〜２ｋＨｚの環境音の平均レベルと２０ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７５０）。もしステップ１７３０において１ｋＨｚ以下の環境音が２０ｄＢＳＰＬ／Ｈｚ未満である場合、１ｋＨｚ以下の帯域強調情報を偽とし（ステップ４７９２）、１ｋＨｚ〜２ｋＨｚの環境音の平均レベルと２０ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７５０）。ステップ１７５０において１ｋＨｚ〜２ｋＨｚの環境音の平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上である場合、１ｋＨｚ〜２ｋＨｚの帯域強調情報を真とし（ステップ４７９３）、２ｋＨｚ〜４ｋＨｚの環境音の平均レベルと１５ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７７０）。もしステップ１７５０において１ｋＨｚ〜２ｋＨｚの環境音が２０ｄＢＳＰＬ／Ｈｚ未満である場合、１ｋＨｚ〜２ｋＨｚの帯域強調情報を偽とし（ステップ４７９４）、２ｋＨｚ〜４ｋＨｚの環境音の平均レベルと１５ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７７０）。ステップ１７７０において２ｋＨｚ〜４ｋＨｚの環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ以上である場合、２ｋＨｚ〜４ｋＨｚの帯域強調情報を真とし（ステップ４７９５）、４ｋＨｚ以上の環境音の平均レベルと１５ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７９０）。もしステップ１７７０において２ｋＨｚ〜４ｋＨｚの環境音が１５ｄＢＳＰＬ／Ｈｚ未満である場合、２ｋＨｚ〜４ｋＨｚの帯域強調情報を偽とし（ステップ４７９６）、４ｋＨｚ以上の環境音の平均レベルと１５ｄＢＳＰＬ／Ｈｚとを比較する（ステップ１７９０）。ステップ１７９０において４ｋＨｚ以上の環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ以上である場合、４ｋＨｚ以上の帯域強調情報を真とし（ステップ４７９７）、制御信号を音韻強調処理手段１３０ｄに出力する（ステップ４８１０）。もしステップ１７９０において４ｋＨｚ以上の環境音の平均レベルが１５ｄＢＳＰＬ／Ｈｚ未満である場合、４ｋＨｚ以上の帯域強調情報を偽とし（ステップ４７９８）、制御信号を音韻強調処理手段１３０ｄに出力する（ステップ４８１０）。音韻強調処理手段１３０ｄは音声合成制御手段７０ｄより入力された強調部情報および声質制御手段５０ｄより入力された制御信号に従って強調処理を行う（ステップ４９００）。音韻強調処理手段１３０ｄは音声合成制御手段７０ｄより入力された強調部情報が真か偽かを判定する（ステップ１５１０）。ステップ１５１０において強調部情報が真である場合、合成単位中の母音定常部の時間長を２０％延長し（ステップ１５２０）。声質制御手段５０ｄより入力されたフォルマント強調情報が真か偽かを判定する（ステップ１５３０）。もしステップ１５１０において強調部情報が偽である場合、声質制御手段５０ｄより入力されたフォルマント強調情報が真か偽かを判定する（ステップ１５３０）。ステップ１５３０においてフォルマント強調情報が真である場合、素片接続手段９０ｄより入力された合成音声のスペクトル包絡を求め、スペクトルピークを強調する（ステップ４９１０）。スペクトルピークの強調の方法については例えば平成５年、日本音響学会講演論文集春季２８５頁〜２８６頁に示すような方法を用いるものとする。次に声質制御手段５０ｄより入力された子音強調情報が真か偽かを判定する（ステップ１５５０）。もしステップ１５３０においてフォルマント強調情報が偽である場合、声質制御手段５０ｄより入力された子音強調情報が真か偽かを判定する（ステップ１５５０）。ステップ１５５０において子音強調情報が真である場合、音韻強調処理手段１３０ｄは合成単位中の子音および子音から母音への渡りの振幅を増幅する（ステップ４９２０）。子音強調の方法は例えば１９９２年、電子情報通信学会技術研究報告、巻９１、５１３号３１頁〜３８頁に示すような方法を用いるものとする。次に声質制御手段５０ｄより入力された１ｋＨｚ以下の帯域強調情報が真か偽かを判定する（ステップ４９３０）。もしステップ１５６０において子音強調情報が偽である場合、声質制御手段５０より入力された１ｋＨｚ以下の帯域強調情報が真か偽かを判定する（ステップ４９３０）。ステップ４９３０において１ｋＨｚ以下の帯域強調情報が真である場合、音韻強調処理手段１３０ｄは素片接続手段９０ｄより入力された合成音声の１ｋＨｚ以下の帯域成分の強調処理を行い（ステップ４９４０）、１ｋＨｚ〜２ｋＨｚの帯域強調情報が真か偽かを判定する（ステップ４９５０）。もしステップ４９３０において１ｋＨｚ以下の帯域強調情報が偽である場合、１ｋＨｚ〜２ｋＨｚの帯域強調情報が真か偽かを判定する（ステップ４９５０）。ステップ４９５０において１ｋＨｚ〜２ｋＨｚの帯域強調情報が真である場合、音韻強調処理手段１３０ｄは素片接続手段９０ｄより入力された合成音声の１ｋＨｚ〜２ｋＨｚの帯域成分の強調処理を行い（ステップ４９６０）、２ｋＨｚ〜４ｋＨｚの帯域強調情報が真か偽かを判定する（ステップ４９７０）。もしステップ４９５０において１ｋＨｚ〜２ｋＨｚの帯域強調情報が偽である場合、２ｋＨｚ〜４ｋＨｚの帯域強調情報が真か偽かを判定する（ステップ４９７０）。ステップ４９７０において２ｋＨｚ〜４ｋＨｚの帯域強調情報が真である場合、音韻強調処理手段１３０ｄは素片接続手段９０ｄより入力された合成音声の２ｋＨｚ〜４ｋＨｚの帯域成分の強調処理を行い（ステップ４９８０）、４ｋＨｚ以上の帯域強調情報が真か偽かを判定する（ステップ４９９０）。もしステップ４９７０において２ｋＨｚ〜４ｋＨｚの帯域強調情報が偽である場合、４ｋＨｚ以上の帯域強調情報が真か偽かを判定する（ステップ４９９０）。ステップ４９９０において４ｋＨｚ以上の帯域強調情報が真である場合、音韻強調処理手段１３０ｄは素片接続手段９０ｄより入力された合成音声の４ｋＨｚ以上の帯域成分の強調処理を行い（ステップ５０００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。もしステップ４９９０において４ｋＨｚ以上の帯域強調情報が偽である場合、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 First, the text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 performs a syntax analysis on the text input from the text input unit 10, generates reading information, prosody information, and emphasis unit information and outputs the generated information to the speech synthesis control unit 70d (step 1300). The segment database 80 outputs a synthesis unit to the segment connection unit 90d according to the reading information input from the speech synthesis control unit 70d (step 4400). The unit connection unit 90d connects the synthesis units input from the unit database 80 according to the prosody information and the emphasis unit information input from the voice synthesis control unit 70d to generate a synthesized speech, and outputs the synthesized speech to the phoneme emphasis processing unit 130d. (Step 1600). The voice quality control unit 50d sets an emphasis processing method (step 4700). First, the voice quality control means 50d divides the environmental sound input from the microphone 110 into bands of 1 kHz or less, 1 kHz to 2 kHz, 2 kHz to 4 kHz, and 4 kHz or more, and obtains an average level of 100 ms for each band (step 1710). The average level of the environmental sound of 1 kHz or less is compared with the average level of the environmental sound of 1 kHz to 2 kHz and 20 dBSPL / Hz, and the average level of the environmental sound of other bands is compared with 15 dBSPL / Hz (step 4720). When the average level of the environmental sound of 1 kHz or less is 20 dBSPL / Hz or more, the average level of the environmental sound of 1 kHz to 2 kHz is 20 dBSPL / Hz or more, and the average level of the environmental sound of another band is less than 15 dBSPL / Hz, The formant emphasis information is set to true (step 4730), and the consonant emphasis information is set to false (4780). Next, the band emphasis information of all bands is set to false (step 4800), and a control signal is output to the phoneme emphasis processing means 130d (step 4810). If it is determined in step 4720 that the average level of the environmental sound of 1 kHz or less is 20 dBSPL / Hz or more, the average level of the environmental sound of 1 kHz to 2 kHz is 20 dBSPL / Hz or more, and the average level of the environmental sound of another band is 15 dBBSPL / Hz. If not, the formant emphasis information is set to false (step 4740), and the average level of the environmental sound of 1 kHz to 2 kHz is compared with 20 dBSPL / Hz, and the average level of the environmental sound in other bands is compared with 15 dBSPL / Hz (step 4750). . In step 4750, the average level of the environmental sound of 1 kHz to 2 kHz is 20 dBSPL / Hz or more, and the average level of the environmental sound of 2 kHz to 4 kHz is 15 dBSPL / Hz or more and the average level of the environmental sound of 1 kHz or less is less than 20 dBSPL / Hz, and If the average level of the environmental sound of 4 kHz or more is less than 15 dBSPL / Hz, the consonant emphasis information is set to true (step 4760), the band emphasis information of all bands is set to false (step 4800), and the control signal is converted to the phoneme emphasis processing means 130d. (Step 4810). If the average level of the environmental sound of 1 kHz to 2 kHz is 20 dBSPL / Hz or more and the average level of the environmental sound of 2 kHz to 4 kHz is 15 dBSPL / Hz or more and the average level of the environmental sound of 1 kHz or less is less than 20 dBSPL / Hz in step 4750, If the average level of the environmental sound of 4 kHz or more is not less than 15 dBSPL / Hz, the consonant emphasis information is set to false (step 4770), and band emphasis information of each band is set (step 4790). The average level of the environmental sound is compared with 20 dBSPL / Hz (step 1730). If the average level of the environmental sound of 1 kHz or less is 20 dBSPL / Hz or more in step 1730, the band emphasis information of 1 kHz or less is set to true (step 4791), and the average level of the environmental sound of 1 kHz to 2 kHz is compared with 20 dBSPL / Hz. (Step 1750). If the environmental sound of 1 kHz or less is less than 20 dBSPL / Hz in step 1730, the band emphasis information of 1 kHz or less is set to false (step 4792), and the average level of the environmental sound of 1 kHz to 2 kHz is compared with 20 dBSPL / Hz ( Step 1750). If the average level of the environmental sound of 1 kHz to 2 kHz is equal to or more than 20 dBSPL / Hz in step 1750, the band emphasis information of 1 kHz to 2 kHz is set to true (step 4793), and the average level of the environmental sound of 2 kHz to 4 kHz and 15 dBSPL / Hz are set. Are compared (step 1770). If the environmental sound of 1 kHz to 2 kHz is less than 20 dBSPL / Hz in step 1750, the band emphasizing information of 1 kHz to 2 kHz is false (step 4794) and the average level of the environmental sound of 2 kHz to 4 kHz is compared with 15 dBSPL / Hz. (Step 1770). If the average level of the environmental sound of 2 kHz to 4 kHz is 15 dBSPL / Hz or more in step 1770, the band emphasis information of 2 kHz to 4 kHz is set to true (step 4795), and the average level of the environmental sound of 4 kHz or more and 15 dBSPL / Hz are determined. A comparison is made (step 1790). If the environment sound of 2 kHz to 4 kHz is less than 15 dBSPL / Hz in step 1770, the band emphasis information of 2 kHz to 4 kHz is set to false (step 4796), and the average level of the environment sound of 4 kHz or more is compared with 15 dBSPL / Hz. (Step 1790). When the average level of the environmental sound of 4 kHz or more is 15 dBSPL / Hz or more in step 1790, the band emphasis information of 4 kHz or more is set to true (step 4797), and the control signal is output to the phoneme emphasis processing unit 130d (step 4810). If the average level of the environmental sound of 4 kHz or more is less than 15 dBSPL / Hz in step 1790, the band emphasis information of 4 kHz or more is set to false (step 4798), and the control signal is output to the phoneme emphasis processing means 130d (step 4810). . The phonemic emphasis processing means 130d performs emphasis processing according to the emphasis section information input from the speech synthesis control means 70d and the control signal input from the voice quality control means 50d (step 4900). The phoneme emphasis processing unit 130d determines whether the emphasis unit information input from the speech synthesis control unit 70d is true or false (step 1510). If the emphasis part information is true in step 1510, the time length of the vowel stationary part in the synthesis unit is extended by 20% (step 1520). It is determined whether the formant emphasis information input from the voice quality control means 50d is true or false (step 1530). If the emphasis unit information is false in step 1510, it is determined whether the formant emphasis information input from the voice quality control unit 50d is true or false (step 1530). If the formant emphasis information is true in step 1530, the spectrum envelope of the synthesized speech input from the unit connection means 90d is obtained, and the spectrum peak is emphasized (step 4910). As a method of enhancing the spectral peak, for example, a method shown in pp. 285 to 286 of the Spring Meeting of the Acoustical Society of Japan in 1993 is used. Next, it is determined whether the consonant emphasis information input from the voice quality control means 50d is true or false (step 1550). If the formant emphasis information is false in step 1530, it is determined whether the consonant emphasis information input from the voice quality control means 50d is true or false (step 1550). If the consonant emphasis information is true in step 1550, the phoneme emphasis processing means 130d amplifies the consonant in the synthesis unit and the amplitude of the transition from the consonant to the vowel (step 4920). As a method of emphasizing consonants, for example, a method as shown in IEICE Technical Report, Vol. 91, 513, pp. 31-38, 1992 is used. Next, it is determined whether the band emphasis information of 1 kHz or less input from the voice quality control means 50d is true or false (step 4930). If the consonant emphasis information is false in step 1560, it is determined whether the band emphasis information of 1 kHz or less input from voice quality control means 50 is true or false (step 4930). If the band emphasis information of 1 kHz or less is true in step 4930, the phoneme emphasis processing means 130d performs emphasis processing on the band component of 1 kHz or less of the synthesized speech input from the segment connection means 90d (step 4940). It is determined whether the 2 kHz band emphasis information is true or false (step 4950). If the band emphasis information of 1 kHz or less is false in step 4930, it is determined whether the band emphasis information of 1 kHz to 2 kHz is true or false (step 4950). If the band emphasis information of 1 kHz to 2 kHz is true in step 4950, the phoneme emphasis processing means 130 d performs emphasis processing on the band component of 1 kHz to 2 kHz of the synthesized speech input from the unit connection means 90 d (step 4960). It is determined whether the band emphasis information of 2 kHz to 4 kHz is true or false (step 4970). If the band emphasis information of 1 kHz to 2 kHz is false in step 4950, it is determined whether the band emphasis information of 2 kHz to 4 kHz is true or false (step 4970). If the band emphasis information of 2 kHz to 4 kHz is true in step 4970, the phoneme emphasis processing means 130 d performs emphasis processing of the band component of 2 kHz to 4 kHz of the synthesized speech input from the unit connection means 90 d (step 4980). It is determined whether the band emphasis information of 4 kHz or more is true or false (step 4990). If the band emphasis information of 2 kHz to 4 kHz is false in step 4970, it is determined whether the band emphasis information of 4 kHz or more is true or false (step 4990). If the band emphasis information of 4 kHz or more is true in step 4990, the phoneme emphasis processing unit 130d performs emphasis processing on the band component of 4 kHz or more of the synthesized voice input from the unit connection unit 90d (step 5000). The synthesized speech is output through the converter 60 (step 1900). If the band emphasis information of 4 kHz or more is false in step 4990, a synthesized speech is output through the electroacoustic transducer 60 (step 1900).

　（実施例５）
　以下本発明の第５の実施例について、図面を参照しながら説明する。 (Example 5)
Hereinafter, a fifth embodiment of the present invention will be described with reference to the drawings.

　図１９は本発明の音声合成装置の第５の実施例を示す構成ブロック図である。図２０に第５の実施例の動作を説明するための流れ図をを示す。図１９において図９と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図９の音声合成部３０ｂが音声合成部３０ｅに置き換わり、声質制御手段５０ｂが声質制御手段５０ｅに置き換わり、操作手段４０ｂが操作手段４０ｅに置き換わり、声質制御手段５０ｂが声質制御手段５０ｅに置き換わり、聴覚特性記憶手段２２０が削除された以外は図９と同一な構成である。前記の音声合成部３０ｅは、音声合成制御手段７０ｅ、合成単位を記憶しておくデータベース部２００ｅ、合成単位をつなげて合成音声を生成する素片接続手段９０ｅを有する。前記のデータベース部２００ｅは異なるパラメータを用いた複数の圧縮処理を施された素片を圧縮処理に用いられたパラメータごとに記憶する複数の素片データベース５８０ａ〜ｎと、複数の素片データベース５８０ａ〜ｎと素片接続手段９０ｅとの接続を切り替えるスイッチ２１０ｅとを有する。 FIG. 19 is a configuration block diagram showing a fifth embodiment of the speech synthesizer of the present invention. FIG. 20 is a flowchart for explaining the operation of the fifth embodiment. In FIG. 19, the same reference numerals are given to the same components or portions as those in FIG. The voice synthesizer 30b of FIG. 9 is replaced by the voice synthesizer 30e, the voice quality control means 50b is replaced by the voice quality control means 50e, the operation means 40b is replaced by the operation means 40e, and the voice quality control means 50b is replaced by the voice quality control means 50e. The configuration is the same as that of FIG. 9 except that the characteristic storage unit 220 is deleted. The speech synthesis unit 30e includes a speech synthesis control unit 70e, a database unit 200e that stores synthesis units, and a segment connection unit 90e that connects the synthesis units to generate synthesized speech. The database unit 200e includes a plurality of segment databases 580a to 580n for storing a plurality of segments subjected to a plurality of compression processes using different parameters for each parameter used in the compression process, and a plurality of segment databases 580a to 580a to n and a switch 210e for switching the connection between the unit connection means 90e.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図１９、図２０に従って説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 19 and 20.

　図２０において図１０と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。まずテキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストの構文解析を行い、読み情報、韻律情報および強調部情報を生成し音声合成制御手段７０ｅに出力する（ステップ１３００）。使用者は操作手段４０ｅに圧縮の程度を入力し、操作手段４０ｅは入力結果を圧縮率選択情報として声質制御手段５０ｅに出力する（ステップ５４００）。声質制御手段５０ｅは操作手段４０ｅより入力された圧縮率選択情報に最も近い圧縮率で圧縮が施された素片データベースを素片データベース５８０ａ〜ｎより選択し、スイッチ２１０ｅを切り替えて素片接続手段９０ｅに接続する（ステップ５５００）。ステップ５５００で素片接続手段９０ｅと接続された素片データベース５８０は音声合成制御手段７０ｅより入力された読み情報に従って素片接続手段９０ｅに合成単位を出力する（ステップ５６００）。素片接続手段９０ｅは音声合成制御手段７０ｅより入力された韻律情報および強調部情報に従って素片データベース５８０より入力された合成単位を接続して合成音声を生成し（ステップ２８００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 20. In FIG. 20, the same operations as those in FIG. 10 are denoted by the same reference numerals, and thus description thereof will be omitted, and only different portions will be described. First, the text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 analyzes the syntax of the text input from the text input unit 10, generates reading information, prosody information, and emphasis unit information and outputs the generated information to the speech synthesis control unit 70e (step 1300). The user inputs the degree of compression to the operation unit 40e, and the operation unit 40e outputs the input result to the voice quality control unit 50e as compression ratio selection information (step 5400). The voice quality control unit 50e selects a segment database compressed from the segment databases 580a to 580n at the compression ratio closest to the compression ratio selection information input from the operation unit 40e, and switches the switch 210e to switch the segment connection unit. 90e (step 5500). The unit database 580 connected to the unit connecting unit 90e in step 5500 outputs a synthesis unit to the unit connecting unit 90e according to the reading information input from the speech synthesis control unit 70e (step 5600). The unit connection unit 90e connects the synthesis units input from the unit database 580 according to the prosody information and the emphasis unit information input from the voice synthesis control unit 70e to generate a synthesized voice (step 2800), and the electroacoustic converter The synthesized speech is output through the step 60 (step 1900).

　（実施例６）
　以下本発明の第６の実施例について、図面を参照しながら説明する。 (Example 6)
Hereinafter, a sixth embodiment of the present invention will be described with reference to the drawings.

　図２１は本発明の音声合成装置の第６の実施例を示す構成ブロック図である。図２２に第６の実施例の動作を説明するための流れ図を示す。第６の実施例の構成において図１２に示した第３の実施例の構成と同一物または部分については説明を省略し、異なった部分についてのみ説明する。図１２の音声合成部３０ｃが音声合成部３０ｆに置き換わり、素片データベース３８０ａ〜ｎが素片データベース６８０ａ〜ｎに置き換わり、聴覚特性読み取り手段３１０、声質制御手段５０ｃ、聴覚特性読み取り手段３１０、聴覚特性ａ〜ｎが削除された以外は図１２と同一な構成である。前記の音声合成部３０ｆは図１２の音声合成制御手段７０ｃが音声合成制御手段７０ｆに置き換わり、素片接続手段９０ｂが素片接続手段９０ｆに置き換わり、圧縮処理手段１４０ｂが削除された以外は図１２の音声合成部３０ｃと同一な構成である。素片データベース６８０ａ〜ｎは異なるパラメータを用いた複数の圧縮処理を施された素片を圧縮処理に用いられたパラメータごとに格納した記憶媒体である。素片データベース読み取り手段３００は素片接続手段９０ｆが参照する素片データベース６８０を読み取るものである。 FIG. 21 is a configuration block diagram showing a sixth embodiment of the speech synthesizer of the present invention. FIG. 22 is a flowchart for explaining the operation of the sixth embodiment. In the configuration of the sixth embodiment, the description of the same components or portions as those of the third embodiment shown in FIG. 12 will be omitted, and only different portions will be described. The voice synthesis unit 30c in FIG. 12 is replaced by the voice synthesis unit 30f, the unit databases 380a to 380n are replaced by the unit databases 680a to 680n, and the auditory characteristic reading unit 310, the voice quality control unit 50c, the auditory characteristic reading unit 310, and the auditory characteristic The configuration is the same as that of FIG. 12 except that a to n are deleted. The speech synthesis unit 30f is the same as that shown in FIG. 12 except that the speech synthesis control unit 70c in FIG. 12 is replaced by the speech synthesis control unit 70f, the unit connection unit 90b is replaced by the unit connection unit 90f, and the compression processing unit 140b is deleted. Has the same configuration as that of the voice synthesis unit 30c. The segment databases 680a to 680n are storage media in which segments subjected to a plurality of compression processes using different parameters are stored for each parameter used in the compression process. The unit database reading unit 300 reads the unit database 680 referred to by the unit connecting unit 90f.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図２１、図２２に従って説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 21 and 22.

　図２２において図１３と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。まずテキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストの構文解析を行い、読み情報、韻律情報および強調部情報を生成し音声合成制御手段７０ｆに出力する（ステップ１３００）。素片データベース読み取り手段３００は音声合成制御手段７０ｆより入力された読み情報に従って、あらかじめ使用者の好みおよび使用する場面に応じてセットされた素片データベース６８０より合成単位を読み出し素片接続手段９０ｆに出力する（ステップ６６００）。素片接続手段９０ｆは音声合成制御手段７０ｆより入力された韻律情報および強調部情報に従って素片データベース読み取り手段３００より入力された合成単位を接続して合成音声を生成し（ステップ２８００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。において In FIG. 22, the same operations as those in FIG. 13 are denoted by the same reference numerals, and the description thereof will not be repeated. Only different parts will be described. First, the text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 analyzes the syntax of the text input from the text input unit 10, generates reading information, prosody information, and emphasis unit information and outputs the information to the speech synthesis control unit 70f (step 1300). The unit database reading unit 300 reads out the synthesis unit from the unit database 680 set in advance according to the user's preference and the scene to be used in accordance with the reading information input from the speech synthesis control unit 70f, and sends it to the unit connection unit 90f. Output (Step 6600). The unit connection unit 90f connects the synthesis units input from the unit database reading unit 300 according to the prosody information and the emphasis unit information input from the voice synthesis control unit 70f to generate synthesized speech (step 2800). A synthesized voice is output through the converter 60 (step 1900).

　（実施例７）
　以下本発明の第７の実施例について、図面を参照しながら説明する。 (Example 7)
Hereinafter, a seventh embodiment of the present invention will be described with reference to the drawings.

　図２３は本発明の音声合成装置の第７の実施例を示す構成ブロック図である。図２４に第７の実施例の動作を説明するための流れ図を、図２５に動作を説明するための流れ図の一部を示す。図２３において図１と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図１の音声合成部３０ａが音声合成部３０ｇに置き換わり、声質制御手段５０ａが声質制御手段５０ｇに置き換わった以外は図１と同一な構成である。前記の音声合成部３０ｇは、図１の音声合成部３０ａの音声合成制御手段７０ａが音声合成制御手段７０ｇに置き換わり、圧縮処理手段１４０ａが圧縮処理手段１４０ｇに置き換わり、素片接続手段９０ａが素片接続手段９０ｇに置き換わり、音韻強調処理手段１３０ａが削除された以外は図１の音声合成部３０ａと同一な構成である。 FIG. 23 is a configuration block diagram showing a seventh embodiment of the speech synthesizer of the present invention. FIG. 24 is a flowchart for explaining the operation of the seventh embodiment, and FIG. 25 is a part of a flowchart for explaining the operation. In FIG. 23, the same components or parts as those in FIG. The configuration is the same as that of FIG. 1 except that the voice synthesis unit 30a in FIG. 1 is replaced with a voice synthesis unit 30g, and the voice quality control unit 50a is replaced with a voice quality control unit 50g. The voice synthesizing unit 30g of the voice synthesizing unit 30a shown in FIG. The configuration is the same as that of the speech synthesis unit 30a in FIG.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図２３、図２４、図２５に従って説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 23, 24 and 25.

　まず聴覚特性測定手段１２０で使用者の聴覚特性を測定し、測定結果を声質制御手段５０ｇに出力する。（ステップ１０００）。テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストを構文解析を行い、読み情報、韻律情報および強調部情報を生成し音声合成制御手段７０ｇに出力する（ステップ１３００）。素片データベース８０は音声合成制御手段７０ｇより入力された読み情報に従って圧縮処理手段１４０ｇに合成単位を出力する（ステップ７４００）。圧縮処理手段１４０ｇは声質制御手段５０ｇより入力された制御信号に従って素片データベース８０から入力された合成単位の振幅のダイナミックレンジを圧縮する（ステップ７５００）。まず声質制御手段５０ｇはマイクロフォン１１０より入力された環境音を１ｋＨｚ以下、１ｋＨｚ〜２ｋＨｚ、２ｋＨｚ〜４ｋＨｚ、４ｋＨｚ以上の帯域に分割し、帯域ごとに１００ｍｓの平均レベルを求める（ステップ１７１０）。１ｋＨｚ以下の環境音の平均レベルと聴覚特性測定手段１２０より入力された使用者の５００Ｈｚの最小可聴値とを比較する（ステップ７７２０）。ステップ７７２０において１ｋＨｚ以下の環境音の平均レベルが使用者の５００Ｈｚの最小可聴値以上である場合、声質制御手段５０ｇは合成単位の１ｋＨｚ以下の成分のレベルのダイナミックレンジが１ｋＨｚ以下の環境音の平均レベルの値に聴覚特性測定手段１２０より入力された使用者の５００Ｈｚの最小可聴値を加えた値〜９０ｄＢＳＰＬとなるように圧縮処理のパラメータを設定し（ステップ７７３０）、１ｋＨｚ〜２ｋＨｚの環境音の平均レベルと聴覚特性測定手段１２０より入力された使用者の１ｋＨｚの最小可聴値とを比較する（ステップ７７５０）。もしステップ７７３０において１ｋＨｚ以下の環境音が使用者の５００Ｈｚの最小可聴値未満である場合、聴覚特性測定手段１２０より入力された測定結果に基づき圧縮処理のパラメータを設定し（ステップ７７４０）、１ｋＨｚ〜２ｋＨｚの環境音の平均レベルと聴覚特性測定手段１２０より入力された使用者の１ｋＨｚの最小可聴値とを比較する（ステップ７７５０）。圧縮処理パラメータの設定方法は例えば実施例２および実施例３と同様とする。ステップ７７５０において１ｋＨｚ〜２ｋＨｚの環境音の平均レベルが聴覚特性測定手段１２０より入力された使用者の１ｋＨｚの最小可聴値以上である場合、声質制御手段５０ｇは合成単位の１ｋＨｚ〜２ｋＨｚの成分のレベルのダイナミックレンジが１ｋＨｚ〜２ｋＨｚの環境音の平均レベルの値に聴覚特性測定手段１２０より入力された使用者の１ｋＨｚの最小可聴値を加えた値〜９０ｄＢＳＰＬとなるように圧縮処理のパラメータを設定し（ステップ７７６０）、２ｋＨｚ〜４ｋＨｚの環境音の平均レベルと聴覚特性測定手段１２０より入力された使用者の２ｋＨｚの最小可聴値とを比較する（ステップ７７８０）。もしステップ７７５０において１ｋＨｚ〜２ｋＨｚの環境音が使用者の１ｋＨｚの最小可聴値未満である場合、聴覚特性測定手段１２０より入力された測定結果に基づき圧縮処理のパラメータを設定し（ステップ７７７０）、２ｋＨｚ〜４ｋＨｚの環境音の平均レベルと聴覚特性測定手段１２０より入力された使用者の２ｋＨｚの最小可聴値とを比較する（ステップ７７８０）。ステップ７７８０において２ｋＨｚ〜４ｋＨｚの環境音の平均レベルが聴覚特性測定手段１２０より入力された使用者の２ｋＨｚの最小可聴値以上である場合、声質制御手段５０ｇは合成単位の２ｋＨｚ〜４ｋＨｚの成分のレベルのダイナミックレンジが２ｋＨｚ〜４ｋＨｚの環境音の平均レベルの値に聴覚特性測定手段１２０より入力された使用者の２ｋＨｚの最小可聴値を加えた値〜９０ｄＢＳＰＬとなるように圧縮処理のパラメータを設定し（ステップ７７９０）、４ｋＨｚ以上の環境音の平均レベルと聴覚特性測定手段１２０より入力された使用者の４ｋＨｚの最小可聴値とを比較する（ステップ７８１０）。もしステップ７７８０において２ｋＨｚ〜４ｋＨｚの環境音が使用者の２ｋＨｚの最小可聴値未満である場合、聴覚特性測定手段１２０より入力された測定結果に基づき圧縮処理のパラメータを設定し（ステップ７８００）、４ｋＨｚ以上の環境音の平均レベルと聴覚特性測定手段１２０より入力された使用者の４ｋＨｚの最小可聴値とを比較する（ステップ７８１０）。ステップ７８１０において４ｋＨｚ以上の環境音の平均レベルが聴覚特性測定手段１２０より入力された使用者の４ｋＨｚの最小可聴値以上である場合、声質制御手段５０ｇは合成単位の４ｋＨｚ以上の成分のレベルのダイナミックレンジが４ｋＨｚ以上の環境音の平均レベルの値に聴覚特性測定手段１２０より入力された使用者の４ｋＨｚの最小可聴値を加えた値〜９０ｄＢＳＰＬとなるように圧縮処理のパラメータを設定し（ステップ７８２０）、圧縮処理手段１４０ｇに制御信号を出力する（ステップ１８１０）。もしステップ７８１０において４ｋＨｚ以上の環境音が使用者の４ｋＨｚの最小可聴値未満である場合、聴覚特性測定手段１２０より入力された測定結果に基づき圧縮処理のパラメータを設定し（ステップ７８３０）、圧縮処理手段１４０ｇに制御信号を出力する（ステップ１８１０）。圧縮処理手段１４０ｇは声質制御手段５０ｇより入力された制御信号に基づき素片データベース８０より入力された合成単位に圧縮処理を施し、素片接続手段９０ｇに出力する（ステップ７８４０）。素片接続手段９０ｇは音声合成制御手段７０ｇより入力された韻律情報および強調部情報に従って圧縮処理手段１４０ｇより入力された合成単位を接続して合成音声を生成し（ステップ７９００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 {Circle around (1)} First, the auditory characteristic measuring unit 120 measures the user's auditory characteristics, and outputs the measurement result to the voice quality control unit 50g. (Step 1000). The text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 performs a syntax analysis on the text input from the text input unit 10, generates reading information, prosody information, and emphasis unit information and outputs the generated information to the speech synthesis control unit 70g (step 1300). The segment database 80 outputs a synthesis unit to the compression processing unit 140g according to the reading information input from the speech synthesis control unit 70g (step 7400). The compression processing unit 140g compresses the dynamic range of the amplitude of the synthesis unit input from the unit database 80 according to the control signal input from the voice quality control unit 50g (step 7500). First, the voice quality control unit 50g divides the environmental sound input from the microphone 110 into bands of 1 kHz or less, 1 kHz to 2 kHz, 2 kHz to 4 kHz, and 4 kHz or more, and obtains an average level of 100 ms for each band (step 1710). The average level of the environmental sound of 1 kHz or less is compared with the user's minimum audible value of 500 Hz input from the auditory characteristic measuring means 120 (step 7720). In step 7720, if the average level of the environmental sound of 1 kHz or less is equal to or higher than the minimum audible value of 500 Hz of the user, the voice quality control unit 50g determines the average of the environmental sound of the level of the component of the synthetic unit of 1 kHz or less whose dynamic range is 1 kHz or less. The compression processing parameter is set to be a value obtained by adding the minimum audible value of the user's 500 Hz input from the auditory characteristic measuring means 120 to the level value to 90 dBSPL (step 7730), and the environmental sound of 1 kHz to 2 kHz is set. The average level is compared with the user's minimum audible value of 1 kHz input from the auditory characteristic measuring means 120 (step 7750). If the environmental sound of 1 kHz or less is less than the minimum audible value of 500 Hz of the user in step 7730, compression processing parameters are set based on the measurement result input from the auditory characteristic measuring means 120 (step 7740). The average level of the environmental sound of 2 kHz is compared with the minimum audible value of 1 kHz of the user input from the auditory characteristic measuring means 120 (step 7750). The setting method of the compression processing parameters is, for example, the same as in the second and third embodiments. If the average level of the environmental sound of 1 kHz to 2 kHz is equal to or greater than the minimum audible value of 1 kHz of the user input from the auditory characteristic measuring means 120 in step 7750, the voice quality control means 50 g sets the level of the component of 1 kHz to 2 kHz of the synthesis unit. The parameters of the compression processing are set so that the dynamic range becomes 90 dBSPL, which is a value obtained by adding the minimum audible value of the user's 1 kHz input from the auditory characteristic measuring means 120 to the value of the average level of the environmental sound of 1 kHz to 2 kHz. (Step 7760) The average level of the environmental sound of 2 kHz to 4 kHz is compared with the minimum audible value of 2 kHz of the user input from the auditory characteristic measuring means 120 (Step 7780). If the environmental sound of 1 kHz to 2 kHz is lower than the minimum audible value of 1 kHz of the user in step 7750, the compression processing parameters are set based on the measurement result input from the auditory characteristic measuring means 120 (step 7770), and 2 kHz. The average level of the environmental sound of .about.4 kHz is compared with the minimum audible value of the user at 2 kHz input from the auditory characteristic measuring means 120 (step 7780). If the average level of the environmental sound of 2 kHz to 4 kHz is equal to or higher than the minimum audible value of 2 kHz of the user input from the auditory characteristic measuring means 120 in step 7780, the voice quality control means 50 g sets the level of the component of 2 kHz to 4 kHz in the synthesis unit. The parameters of the compression processing are set so that the dynamic range becomes a value obtained by adding the minimum audible value of the user's 2 kHz input from the auditory characteristic measuring means 120 to the value of the average level of the environmental sound of 2 kHz to 4 kHz to 90 dBSPL. (Step 7790) The average level of the environmental sound of 4 kHz or more is compared with the user's minimum audible value of 4 kHz input from the auditory characteristic measuring means 120 (Step 7810). If the environment sound of 2 kHz to 4 kHz is less than the minimum audible value of 2 kHz of the user in step 7780, the compression processing parameters are set based on the measurement result input from the auditory characteristic measuring means 120 (step 7800). The average level of the environmental sound is compared with the minimum audible value of 4 kHz of the user input from the auditory characteristic measuring means 120 (step 7810). In step 7810, if the average level of the environmental sound of 4 kHz or more is equal to or higher than the minimum audible value of 4 kHz of the user input from the auditory characteristic measuring means 120, the voice quality control means 50g determines the dynamic of the level of the component of 4 kHz or more in the synthesis unit. The compression processing parameter is set to a value obtained by adding the minimum audible value of the user's 4 kHz input from the auditory characteristic measuring means 120 to the value of the average level of the environmental sound having a range of 4 kHz or more to 90 dBSPL (step 7820). ), And outputs a control signal to the compression processing means 140g (step 1810). If the environmental sound of 4 kHz or more is less than the minimum audible value of 4 kHz of the user in step 7810, compression processing parameters are set based on the measurement result input from the auditory characteristic measuring means 120 (step 7830), and the compression processing is performed. A control signal is output to the means 140g (step 1810). The compression processing unit 140g performs a compression process on the synthesis unit input from the unit database 80 based on the control signal input from the voice quality control unit 50g, and outputs the result to the unit connection unit 90g (step 7840). The unit connection unit 90g connects the synthesis unit input from the compression processing unit 140g according to the prosody information and the emphasis unit information input from the voice synthesis control unit 70g to generate a synthesized voice (step 7900), and the electro-acoustic converter A synthesized speech is output through the step 60 (step 1900).

　（実施例８）
　以下本発明の第８の実施例について、図面を参照しながら説明する。 (Example 8)
Hereinafter, an eighth embodiment of the present invention will be described with reference to the drawings.

　図２６は本発明の音声合成装置の第８の実施例を示す構成ブロック図である。図２７に第８の実施例の動作を説明するための流れ図を、図２８に第８の実施例の動作の一部を説明するための流れ図を示す。図２９に第８の実施例のフォルマント強調の処理結果の模式図を示す。第８の実施例の構成において図１２に示した第３の実施例の構成と同一物または部分については説明を省略し、異なった部分についてのみ説明する。図１２の音声合成部３０ｃが音声合成部３０ｈに置き換わり、声質制御手段５０ｃが声質制御手段５０ｈに置き換わり、素片データベース３８０ａ〜ｎが削除された以外は図１２と同一な構成である。前記の音声合成部３０ｈは図１２の音声合成制御手段７０ｃが音声合成制御手段７０ｈに置き換わり、素片データベース読み取り手段３００が素片データベース８０に置き換わり、圧縮処理手段１４０ｂが音韻強調処理手段１３０ｈに置き換わり、強調フィルタ部８００がつけ加わった以外は図１２の音声合成部３０ｃと同一な構成である。前記の強調フィルタ部８００はあらかじめ各音韻ごとにフォルマントを強調するよう設定されたフォルマント強調フィルタ８１０ａ〜ｎと、フォルマント強調フィルタ８１０と音韻強調処理手段１３０ｈの接続を切り替えるスイッチ８２０とを有する。 FIG. 26 is a block diagram showing the configuration of an eighth embodiment of the speech synthesizer of the present invention. FIG. 27 is a flowchart for explaining the operation of the eighth embodiment, and FIG. 28 is a flowchart for explaining a part of the operation of the eighth embodiment. FIG. 29 shows a schematic diagram of the processing result of the formant enhancement according to the eighth embodiment. In the configuration of the eighth embodiment, the description of the same components or portions as those of the third embodiment shown in FIG. 12 will be omitted, and only different portions will be described. The configuration is the same as that of FIG. 12 except that the voice synthesis unit 30c in FIG. 12 is replaced by the voice synthesis unit 30h, the voice quality control unit 50c is replaced by the voice quality control unit 50h, and the unit databases 380a to 380n are deleted. In the speech synthesis unit 30h, the speech synthesis control unit 70c in FIG. 12 is replaced by the speech synthesis control unit 70h, the unit database reading unit 300 is replaced by the unit database 80, and the compression processing unit 140b is replaced by the phoneme emphasis processing unit 130h. 12, except that an emphasis filter unit 800 is added. The emphasis filter unit 800 includes a formant emphasis filter 810a to 810n which is set in advance so as to emphasize the formant for each phoneme, and a switch 820 for switching the connection between the formant emphasis filter 810 and the phoneme emphasis processing unit 130h.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図２６、図２７、図２８、図２９に従って説明する。 The operation of the speech synthesizing apparatus of this embodiment having the above-described configuration will be described below with reference to FIGS. 26, 27, 28, and 29.

　図２７、図２８、図２９において図２、図４、図１３と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。まず聴覚特性読み取り手段３１０により、あらかじめセットした使用者に対応する聴覚特性を読み出し、声質制御手段５０ｈに出力する。（ステップ３０００）。声質制御手段５０は聴覚特性読み取り手段３１０より入力された聴覚特性に基づき強調処理の設定を決定し音韻強調処理手段１３０ｈへ出力する（ステップ１１００）。テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストの構文解析を行い、読み情報、韻律情報および強調部情報を生成し音声合成制御手段７０ｈに出力する（ステップ１３００）。素片データベース８０は音声合成制御手段７０ｈより入力された読み情報に従って音韻強調処理手段１３０ｈに合成単位を出力する（ステップ１４００）。音韻強調処理手段１３０ｈは音声合成制御手段７０ｈより入力された強調部情報と声質制御手段５０ｈより入力された制御信号に従って合成単位に強調処理を施す（ステップ８５００）。音韻強調処理手段１３０ｈは音声合成制御手段７０ｈより入力された強調部情報が真か偽かを判定する（ステップ１５１０）。ステップ１５１０において強調部情報が真である場合、合成単位中の母音定常部の時間長を２０％延長し（ステップ１５２０）。声質制御手段５０ｈより入力されたフォルマント強調情報が真か偽かを判定する（ステップ１５３０）。もしステップ１５１０において強調部情報が偽である場合、声質制御手段５０ｈより入力されたフォルマント強調情報が真か偽かを判定する（ステップ１５３０）。ステップ１５３０においてフォルマント強調情報が真である場合、音声合成制御手段７０ｈより出力された制御信号により素片データベース８０より出力された合成単位に対応するフォルマント強調フィルタ８１０にスイッチ８２０を接続する（ステップ８５１０）。図２８に示すように、ステップ８５１０で接続されたあらかじめ音韻ごとに設定されたフィルタバンクを用いて、フォルマントを含む帯域を選択的に通過させ、図７ｃ）に示すようにフォルマントを含む帯域とフォルマントを含まない帯域とのコントラストを強調する（ステップ８５４０）。次に声質制御手段５０より入力された子音強調情報が真か偽かを判定する（ステップ１５５０）。もしステップ１５３０においてフォルマント強調情報が偽である場合、声質制御手段５０ｈより入力された子音強調情報が真か偽かを判定する（ステップ１５５０）。ステップ１５５０において子音強調情報が真である場合、合成単位中の子音および子音から母音への渡りの振幅を増幅する（ステップ１５６０）。次に声質制御手段５０ｈより入力された帯域強調情報が真か偽かを判定する（ステップ１５７０）。もしステップ１５６０において子音強調情報が偽である場合、声質制御手段５０ｈより入力された帯域強調情報が真か偽かを判定する（ステップ１５７０）。ステップ１５７０において帯域強調情報が真である場合、合成単位中の子音に２ｋＨｚ以上の帯域を強調する高帯域強調処理を行い（ステップ１５８０）、音韻強調処理手段１３０ｈは合成単位を素片接続手段９０ｈに出力する（ステップ１５９０）。もしステップ１５７０において帯域強調情報が偽である場合、音韻強調処理手段１３０ｈは合成単位を素片接続手段９０ｈに出力する（ステップ１５９０）。素片接続手段９０ｈは音声合成制御手段７０ｈより入力された韻律情報および強調部情報に従って音韻強調処理手段１３０ｈより入力された合成単位を接続して合成音声を生成し（ステップ１６００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 In FIGS. 27, 28, and 29, the same operations as those in FIGS. 2, 4, and 13 are denoted by the same reference numerals, and description thereof will be omitted. Only different parts will be described. First, the auditory characteristic reading means 310 reads the auditory characteristic corresponding to the preset user and outputs it to the voice quality control means 50h. (Step 3000). The voice quality control unit 50 determines the setting of the emphasizing process based on the auditory characteristics input from the auditory characteristics reading unit 310, and outputs the setting to the phonemic emphasis processing unit 130h (step 1100). The text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 analyzes the syntax of the text input from the text input unit 10, generates reading information, prosody information and emphasis unit information, and outputs them to the speech synthesis control unit 70h (step 1300). The segment database 80 outputs a synthesis unit to the phoneme enhancement processing unit 130h according to the reading information input from the speech synthesis control unit 70h (step 1400). The phoneme emphasis processing unit 130h performs emphasis processing on a synthesis unit according to the emphasis unit information input from the speech synthesis control unit 70h and the control signal input from the voice quality control unit 50h (step 8500). The phoneme emphasis processing unit 130h determines whether the emphasis unit information input from the speech synthesis control unit 70h is true or false (step 1510). If the emphasis part information is true in step 1510, the time length of the vowel stationary part in the synthesis unit is extended by 20% (step 1520). It is determined whether the formant emphasis information input from the voice quality control means 50h is true or false (step 1530). If the emphasis unit information is false in step 1510, it is determined whether the formant emphasis information input from the voice quality control unit 50h is true or false (step 1530). If the formant emphasis information is true in step 1530, the switch 820 is connected to the formant emphasis filter 810 corresponding to the synthesis unit output from the segment database 80 by the control signal output from the voice synthesis control means 70h (step 8510). ). As shown in FIG. 28, the band including the formant is selectively passed using the filter bank set in advance for each phoneme connected in step 8510, and the band including the formant and the formant are connected as shown in FIG. 7c). The contrast with the band not including is emphasized (step 8540). Next, it is determined whether the consonant emphasis information input from the voice quality control means 50 is true or false (step 1550). If the formant emphasis information is false in step 1530, it is determined whether the consonant emphasis information input from the voice quality control means 50h is true or false (step 1550). If the consonant emphasis information is true in step 1550, the amplitude of the consonant in the synthesis unit and the transition from the consonant to the vowel is amplified (step 1560). Next, it is determined whether the band emphasis information input from the voice quality control means 50h is true or false (step 1570). If the consonant emphasis information is false in step 1560, it is determined whether the band emphasis information input from voice quality control means 50h is true or false (step 1570). If the band emphasis information is true in step 1570, high-band emphasis processing for emphasizing the band of 2 kHz or more is performed on the consonants in the synthesis unit (step 1580), and the phoneme emphasis processing means 130h converts the synthesis unit to the segment connection means 90h. (Step 1590). If the band emphasis information is false in step 1570, the phoneme emphasis processing unit 130h outputs the synthesis unit to the unit connection unit 90h (step 1590). The unit connection unit 90h connects the synthesis units input from the phonemic enhancement processing unit 130h according to the prosody information and the emphasis unit information input from the voice synthesis control unit 70h to generate synthesized speech (step 1600), and performs electro-acoustic conversion. The synthesized speech is output through the device 60 (step 1900).

　（実施例９）
　以下本発明の第９の実施例について、図面を参照しながら説明する。 (Example 9)
Hereinafter, a ninth embodiment of the present invention will be described with reference to the drawings.

　図３０は本発明の音声合成装置の第９の実施例を示す構成ブロック図である。図３１に第９の実施例の動作を説明するための流れ図を、図３２に動作を説明するための流れ図の一部を示す。図３０において図４２と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図４２の声質制御手段５０ｍが声質制御手段５０ｉに置き換わり、操作手段４０ｍがマイクロフォン１１０に置き換わった以外は図４２と同一な構成である。 FIG. 30 is a block diagram showing the configuration of a ninth embodiment of the speech synthesizer of the present invention. FIG. 31 is a flowchart for explaining the operation of the ninth embodiment, and FIG. 32 is a part of a flowchart for explaining the operation. 30, the same components or portions as those in FIG. 42 are denoted by the same reference numerals, and the description thereof will not be repeated. 42 has the same configuration as that of FIG. 42 except that voice quality control means 50m is replaced by voice quality control means 50i and operation means 40m is replaced by microphone 110.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図３０、図３１、図３２に従って説明する。図３１において図２と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 30, 31, and 32. In FIG. 31, the same operations as those in FIG. 2 are denoted by the same reference numerals, and thus description thereof will be omitted, and only different portions will be described.

　声質制御手段５０ｉは合成音声の基本周波数の設定をする（ステップ９１００）。まずマイクロフォン１１０は声質制御手段５０ｉに環境音信号を出力する（ステップ９１１０）。声質制御手段５０ｉはマイクロフォン１１０より入力された環境音のレベルと３０ｄＢ（Ａ）を比較する（ステップ９１２０）。ステップ９１２０で環境音のレベルが３０ｄＢ（Ａ）以上の場合、基本周波数をあらかじめ定められた標準値より２０％高く設定し（ステップ９１３０）、テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。ステップ９１２０で環境音のレベルが３０ｄＢ（Ａ）未満の場合、テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストを構文解析を行い、読み情報および韻律情報を生成し音声合成制御手段７０ｍに出力する（ステップ１３００）。素片データベース８０は音声合成制御手段７０ｍより入力された読み情報に従って素片接続手段９０ｍに合成単位を出力する（ステップ９４００）。素片接続手段９０ｍは音声合成制御手段７０ｍより入力された韻律情報および声質制御手段５０ｉより入力された制御信号に従って素片データベース８０より入力された合成単位を接続して合成音声を生成し（ステップ９５００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 The voice quality control means 50i sets the fundamental frequency of the synthesized voice (step 9100). First, the microphone 110 outputs an environmental sound signal to the voice quality control means 50i (step 9110). The voice quality control means 50i compares the level of the environmental sound input from the microphone 110 with 30 dB (A) (step 9120). If the level of the environmental sound is equal to or more than 30 dB (A) in step 9120, the fundamental frequency is set to be 20% higher than a predetermined standard value (step 9130). Input (step 1200). If the level of the environmental sound is less than 30 dB (A) in step 9120, the text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 performs syntax analysis on the text input from the text input unit 10, generates reading information and prosody information, and outputs the information to the speech synthesis control unit 70m (step 1300). The unit database 80 outputs a synthesis unit to the unit connecting unit 90m according to the reading information input from the speech synthesis control unit 70m (step 9400). The unit connection unit 90m connects the synthesis units input from the unit database 80 in accordance with the prosody information input from the voice synthesis control unit 70m and the control signal input from the voice quality control unit 50i to generate a synthesized speech (step). 9500), and outputs a synthesized voice through the electroacoustic transducer 60 (step 1900).

　（実施例１０）
　以下本発明の第１０の実施例について、図面を参照しながら説明する。 (Example 10)
Hereinafter, a tenth embodiment of the present invention will be described with reference to the drawings.

　図３３は本発明の音声合成装置の第１０の実施例を示す構成ブロック図である。図３４に第１０の実施例の動作を説明するための流れ図を、図３５に動作を説明するための流れ図の一部を示す。図３３において図３０と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図３０の声質制御手段５０ｉが声質制御手段５０ｊに置き換わり、マイクロフォン１１０が聴覚特性測定手段１２０に置き換わった以外は図３０と同一な構成である。 FIG. 33 is a block diagram showing the configuration of a tenth embodiment of the speech synthesizer of the present invention. FIG. 34 is a flow chart for explaining the operation of the tenth embodiment, and FIG. 35 is a part of a flow chart for explaining the operation. 33, the same components or portions as those in FIG. 30 are denoted by the same reference numerals, and the description thereof will not be repeated. 30 has the same configuration as that of FIG. 30 except that voice quality control means 50i is replaced by voice quality control means 50j and microphone 110 is replaced by auditory characteristic measurement means 120.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図３３、図３４、図３５に従って説明する。図３４、図３５において図３１、図３２と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。動作 The operation of the speech synthesizer of this embodiment configured as described above will be described below with reference to FIGS. 33, 34 and 35. 34 and 35, the same operations as those in FIGS. 31 and 32 are denoted by the same reference numerals, and the description thereof will not be repeated. Only different parts will be described.

　聴覚特性測定手段１２０で使用者の聴覚特性を測定する（ステップ１００００）。聴覚特性の測定方法については例えば実施例１と同様とする。声質制御手段５０ｊは聴覚特性測定手段１２０より入力された使用者の聴覚特性および好みに従って合成音声の基本周波数の設定をする（ステップ１０１００）。聴覚特性測定手段１２０より聴覚特性の測定結果を声質制御手段５０ｊに出力する。（ステップ１０１１０）。声質制御手段５０ｊは使用者の２ｋＨｚ未満の平均聴力レベルと２ｋＨｚ以上の平均聴力レベルを比較する（ステップ１０１２０）。ステップ１０１２０において２ｋＨｚ以上の平均聴力レベルから２ｋＨｚ未満の平均聴力レベルを減じた値が３０ｄＢ以上の場合は、合成音声の基本周波数をあらかじめ定められた標準値より２０％高く設定し（ステップ９１３０）、テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。ステップ１０１２０で２ｋＨｚ以上の平均聴力レベルから２ｋＨｚ未満の平均聴力レベルを減じた値が３０ｄＢ未満の場合、テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストの構文解析を行い、読み情報および韻律情報を生成し音声合成制御手段７０ｍに出力する（ステップ１３００）。素片データベース８０は音声合成制御手段７０ｍより入力された読み情報に従って素片接続手段９０ｍに合成単位を出力する（ステップ９４００）。素片接続手段９０ｍは音声合成制御手段７０ｍより入力された韻律情報および声質制御手段５０ｊより入力された制御信号に従って素片データベース８０より入力された合成単位を合成し合成音声を生成し（ステップ９５００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 (4) The hearing characteristics of the user are measured by the hearing characteristics measuring means 120 (step 10000). The method of measuring the auditory characteristics is, for example, the same as in the first embodiment. The voice quality control means 50j sets the fundamental frequency of the synthesized voice in accordance with the user's auditory characteristics and preferences inputted from the auditory characteristic measuring means 120 (step 10100). The auditory characteristic measuring means 120 outputs the measurement result of the auditory characteristic to the voice quality control means 50j. (Step 10110). The voice quality control means 50j compares the user's average hearing level below 2 kHz with the average hearing level above 2 kHz (step 10120). If the value obtained by subtracting the average hearing level of less than 2 kHz from the average hearing level of 2 kHz or more in step 10120 is 30 dB or more, the fundamental frequency of the synthesized speech is set to be 20% higher than a predetermined standard value (step 9130). The text input means 10 inputs a target text to the language processing means 20 (step 1200). If the value obtained by subtracting the average hearing level below 2 kHz from the average hearing level above 2 kHz in step 10120 is less than 30 dB, the text input unit 10 inputs the target text to the language processing unit 20 (step 1200). Next, the language processing unit 20 analyzes the syntax of the text input from the text input unit 10, generates reading information and prosody information, and outputs the information to the speech synthesis control unit 70m (step 1300). The unit database 80 outputs a synthesis unit to the unit connecting unit 90m according to the reading information input from the speech synthesis control unit 70m (step 9400). The unit connection unit 90m synthesizes the synthesis unit input from the unit database 80 according to the prosody information input from the voice synthesis control unit 70m and the control signal input from the voice quality control unit 50j to generate a synthesized voice (step 9500). ), And outputs synthesized speech through the electroacoustic transducer 60 (step 1900).

　（実施例１１）
　以下本発明の第１１の実施例について、図面を参照しながら説明する。 (Example 11)
Hereinafter, an eleventh embodiment of the present invention will be described with reference to the drawings.

　図３６は本発明の音声合成装置の第１１の実施例を示す構成ブロック図である。図３７に第１１の実施例の動作を説明するための流れ図を、図３８に動作を説明するための流れ図の一部を示す。図３６において図３３と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図３３の声質制御手段５０ｊが声質制御手段５０ｋに置き換わり、聴覚特性測定手段１２０が聴覚特性記憶手段２２０に置き換わった以外は図３３と同一な構成である。 FIG. 36 is a block diagram showing a configuration of an eleventh embodiment of the speech synthesizer of the present invention. FIG. 37 is a flowchart for explaining the operation of the eleventh embodiment, and FIG. 38 is a part of a flowchart for explaining the operation. 36, the same components or portions as those in FIG. 33 are denoted by the same reference numerals, and the description thereof will be omitted. Only different portions will be described. The configuration is the same as that of FIG. 33 except that the voice quality control means 50j in FIG. 33 is replaced by the voice quality control means 50k, and the auditory characteristic measuring means 120 is replaced by the auditory characteristic storage means 220.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図３６、図３７、図３８に従って説明する。図３７において図３４と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。 The operation of the speech synthesizing apparatus according to this embodiment having the above-described configuration will be described below with reference to FIGS. 36, 37, and 38. In FIG. 37, the same operations as those in FIG. 34 are denoted by the same reference numerals, and description thereof will be omitted. Only different portions will be described.

　声質制御手段５０ｋは合成音声の基本周波数の設定をする（ステップ１１１００）。まず聴覚特性記憶手段２２０よりあらかじめ測定された使用者の聴覚特性を声質制御手段５０ｋに出力する。（ステップ１１１１０）。声質制御手段５０ｋは使用者の平均聴力レベルと４０ｄＢＨＬを比較する（ステップ１１１２０）。ステップ１１１２０において使用者の平均聴力レベルが４０ｄＢＨＬ以上の場合は、合成音声の話速をあらかじめ定められた標準値より１０％遅く設定し（ステップ１１１３０）、テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。ステップ１１１２０で使用者の平均聴力レベルが４０ｄＢＨＬ未満の場合、テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストを構文解析を行い、読み情報および韻律情報を生成し音声合成制御手段７０ｍに出力する（ステップ１３００）。素片データベース８０は音声合成制御手段７０ｍより入力された読み情報に従って素片接続手段９０ｍに合成単位を出力する（ステップ９４００）。素片接続手段９０ｍは音声合成制御手段７０ｍより入力された韻律情報および声質制御手段５０ｋより入力された制御信号に従って素片データベース８０より入力された合成単位を合成し合成音声を生成し（ステップ９５００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 The voice quality control means 50k sets the fundamental frequency of the synthesized voice (step 11100). First, the auditory characteristic storage unit 220 outputs the user's auditory characteristics measured in advance to the voice quality control unit 50k. (Step 11110). The voice quality control unit 50k compares the average hearing level of the user with 40 dBHL (step 11120). If the average hearing level of the user is equal to or higher than 40 dBHL in step 11120, the speech speed of the synthesized voice is set to be 10% slower than a predetermined standard value (step 11130). Is entered (step 1200). If the average hearing level of the user is less than 40 dBHL in step 11120, the text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 performs syntax analysis on the text input from the text input unit 10, generates reading information and prosody information, and outputs the information to the speech synthesis control unit 70m (step 1300). The unit database 80 outputs a synthesis unit to the unit connecting unit 90m according to the reading information input from the speech synthesis control unit 70m (step 9400). The unit connection unit 90m synthesizes the synthesis unit input from the unit database 80 according to the prosody information input from the voice synthesis control unit 70m and the control signal input from the voice quality control unit 50k to generate a synthesized voice (step 9500). ), And outputs synthesized speech through the electroacoustic transducer 60 (step 1900).

　（実施例１２）
　以下本発明の第１２の実施例について、図面を参照しながら説明する。 (Example 12)
Hereinafter, a twelfth embodiment of the present invention will be described with reference to the drawings.

　図３９は本発明の音声合成装置の第１２の実施例を示す構成ブロック図である。図４０に第１２の実施例の動作を説明するための流れ図を、図４１に動作を説明するための流れ図の一部を示す。図３９において図３６と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図３６の声質制御手段５０ｋが声質制御手段５０ｌに置き換わり、聴覚特性記憶手段２２０が聴覚特性読み取り手段３１０に置き換わり、聴覚特性３２０ａ〜ｎがつけ加わった以外は図３６と同一な構成である。 FIG. 39 is a structural block diagram showing a twelfth embodiment of the speech synthesizer of the present invention. FIG. 40 shows a flowchart for explaining the operation of the twelfth embodiment, and FIG. 41 shows a part of a flowchart for explaining the operation. In FIG. 39, the same components or portions as those in FIG. 36 are denoted by the same reference numerals, and therefore description thereof will be omitted, and only different portions will be described. The configuration is the same as that of FIG. 36 except that the voice quality control unit 50k in FIG. 36 is replaced by the voice quality control unit 501, the auditory characteristic storage unit 220 is replaced by the auditory characteristic reading unit 310, and the auditory characteristics 320a to 320n are added.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図３９、図４０、図４１に従って説明する。図４０、図４１において図３７、図３８と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 39, 40, and 41. In FIGS. 40 and 41, the same operations as those in FIGS. 37 and 38 are denoted by the same reference numerals, and the description thereof will be omitted. Only different parts will be described.

　声質制御手段５０ｌは合成音声の基本周波数の設定をする（ステップ１２１００）。まず聴覚特性読み取り手段３１０はあらかじめセットされた使用者の聴覚特性３２０を読み取り、声質制御手段５０ｌに出力する。（ステップ１２１１０）。声質制御手段５０は使用者の平均聴力レベルと４０ｄＢＨＬを比較する（ステップ１１１２０）。ステップ１１１２０において使用者の平均聴力レベルが４０ｄＢＨＬ以上の場合は、合成音声の話速をあらかじめ定められた標準値より１０％遅く設定し（ステップ１１１３０）、テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。ステップ１１１２０で使用者の平均聴力レベルが４０ｄＢＨＬ未満の場合、テキスト入力手段１０は言語処理手段２０に目的のテキストを入力する（ステップ１２００）。次に言語処理手段２０はテキスト入力手段１０より入力されたテキストの構文解析を行い、読み情報および韻律情報を生成し音声合成制御手段７０ｍに出力する（ステップ１３００）。素片データベース８０は音声合成制御手段７０ｍより入力された読み情報に従って素片接続手段９０ｍに合成単位を出力する（ステップ９４００）。素片接続手段９０ｍは音声合成制御手段７０ｍより入力された韻律情報および声質制御手段５０ｍより入力された制御信号に従って素片データベース８０より入力された合成単位を接続して合成音声を生成し（ステップ９５００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 The voice quality control unit 501 sets the fundamental frequency of the synthesized voice (step 12100). First, the auditory characteristic reading means 310 reads the user's auditory characteristic 320 set in advance and outputs it to the voice quality control means 501. (Step 12110). The voice quality control means 50 compares the average hearing level of the user with 40 dBHL (step 11120). If the average hearing level of the user is equal to or higher than 40 dBHL in step 11120, the speech speed of the synthesized voice is set to be 10% slower than a predetermined standard value (step 11130). Is entered (step 1200). If the average hearing level of the user is less than 40 dBHL in step 11120, the text input means 10 inputs a target text to the language processing means 20 (step 1200). Next, the language processing unit 20 analyzes the syntax of the text input from the text input unit 10, generates reading information and prosody information, and outputs the information to the speech synthesis control unit 70m (step 1300). The unit database 80 outputs a synthesis unit to the unit connecting unit 90m according to the reading information input from the speech synthesis control unit 70m (step 9400). The unit connection unit 90m connects the synthesis units input from the unit database 80 in accordance with the prosodic information input from the voice synthesis control unit 70m and the control signal input from the voice quality control unit 50m to generate a synthesized voice (step). 9500), and outputs a synthesized voice through the electroacoustic transducer 60 (step 1900).

　（実施例１３）
　以下本発明の第１３の実施例について、図面を参照しながら説明する。 (Example 13)
Hereinafter, a thirteenth embodiment of the present invention will be described with reference to the drawings.

　図４２は本発明の音声合成装置の第１３の実施例を示す構成ブロック図である。図４３に第１３の実施例の動作を説明するための流れ図をを示す。図４２において図３０と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図３０の言語処理手段２０が言語処理部９００に置き換わり、音声合成部３０ｍが音声合成部３０ｎに置き換わり、声質制御手段５０ｉが削除され、マイクロフォン１１０が音声合成制御手段７０ｎに接続された以外は図３０と同一な構成である。前記の言語処理部９００は構文解析手段９１０と音声合成開始位置決定手段９２０とを有する。前記の音声合成部３０ｎは音声合成制御手段７０ｎと、素片データベース８０、素片接続手段９０ｎとを有する。 FIG. 42 is a configuration block diagram showing a thirteenth embodiment of the speech synthesizer of the present invention. FIG. 43 is a flowchart for explaining the operation of the thirteenth embodiment. 42, the same components or portions as those in FIG. 30 are denoted by the same reference numerals, and the description thereof will not be repeated. 30 except that the language processing unit 20 of FIG. 30 has been replaced with a language processing unit 900, the voice synthesis unit 30m has been replaced with a voice synthesis unit 30n, the voice quality control unit 50i has been deleted, and the microphone 110 has been connected to the voice synthesis control unit 70n. This is the same configuration as 30. The language processing unit 900 includes a syntax analysis unit 910 and a speech synthesis start position determination unit 920. The speech synthesis unit 30n has a speech synthesis control unit 70n, a unit database 80, and a unit connection unit 90n.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図４２、図４３に従って説明する。図４３において図３１と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 42 and 43. In FIG. 43, the same operations as those in FIG. 31 are denoted by the same reference numerals, and description thereof will be omitted. Only different parts will be described.

　まずテキスト入力手段１０は構文解析手段９１０に目的のテキストを入力する（ステップ１３１００）。次に構文解析手段９１０はテキスト入力手段１０より入力されたテキストを構文解析を行い、構文情報を生成し音声合成開始位置決定手段９２０へ出力し、読み情報および韻律情報を生成し音声合成制御手段７０ｎに出力する（ステップ１３２００）。音声合成開始位置決定手段９２０は構文解析手段９１０より入力された構文情報に従って音声合成開始位置を決定し、音声合成制御手段７０ｎに開始位置情報を出力する（ステップ１３３００）。音声合成制御手段７０ｎはマイクロフォン１１０より環境音信号を取り込み、環境音の１００ｍｓの平均レベルと７０ｄＢ（Ａ）とを比較する（ステップ１３４００）。ステップ１３４００において環境音の平均レベルが７０ｄＢ（Ａ）未満の場合、素片データベース８０は音声合成制御手段７０ｎより入力された読み情報に従って素片接続手段９０ｎに合成単位を出力する（ステップ９４００）。もしステップ１３４００で環境音の平均レベルが７０ｄＢ（Ａ）以上である場合、音声合成制御手段７０ｎは音声合成停止信号を素片接続手段９０ｎに出力し、合成音声の生成を停止する（ステップ１３５００）。音声合成制御手段７０ｎは環境音の平均レベルと７０ｄＢ（Ａ）とを比較し（ステップ１３６００）、環境音の平均レベルが７０ｄＢ（Ａ）以上である場合は、ステップ１３６００を繰り返す。ステップ１３６００において環境音の平均レベルが７０ｄＢ（Ａ）未満である場合にのみ、音声合成開始位置決定手段９２０より入力された開始位置情報に従い、停止位置よりテキスト上の位置が前で最も停止位置に近い音声合成開始位置から音声合成を再開し（ステップ１３７００）、素片データベース８０は音声合成制御手段７０ｎより入力された読み情報に従って素片接続手段９０ｎに合成単位を出力する（ステップ９４００）。素片接続手段９０ｎは音声合成制御手段７０ｎより入力された韻律情報に従って素片データベース８０より入力された合成単位を接続して合成音声を生成し（ステップ９５００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 First, the text input means 10 inputs a target text to the syntax analysis means 910 (step 13100). Next, the syntactic analysis means 910 analyzes the text input from the text input means 10, generates syntactic information, outputs the syntactic information to the speech synthesis start position determining means 920, generates reading information and prosody information, and generates speech information. 70n (step 13200). Speech synthesis start position determination means 920 determines a speech synthesis start position according to the syntax information input from syntax analysis means 910, and outputs start position information to speech synthesis control means 70n (step 13300). The voice synthesis control unit 70n takes in the environmental sound signal from the microphone 110, and compares the average level of the environmental sound for 100 ms with 70 dB (A) (step 13400). If the average level of the environmental sound is less than 70 dB (A) in step 13400, the segment database 80 outputs a synthesis unit to the segment connection unit 90n according to the reading information input from the speech synthesis control unit 70n (step 9400). If the average level of the environmental sound is equal to or more than 70 dB (A) in step 13400, the speech synthesis control unit 70n outputs a speech synthesis stop signal to the unit connection unit 90n, and stops generation of the synthesized speech (step 13500). . The voice synthesis control unit 70n compares the average level of the environmental sound with 70 dB (A) (step 13600), and repeats step 13600 if the average level of the environmental sound is 70 dB (A) or more. Only when the average level of the environmental sound is less than 70 dB (A) in step 13600, the position on the text before the stop position is the closest to the stop position in accordance with the start position information input from the voice synthesis start position determining means 920. Speech synthesis is restarted from the near speech synthesis start position (step 13700), and the segment database 80 outputs a synthesis unit to the segment connection unit 90n according to the reading information input from the speech synthesis control unit 70n (step 9400). The unit connection unit 90n connects the synthesis units input from the unit database 80 according to the prosody information input from the voice synthesis control unit 70n to generate a synthesized voice (step 9500). Is output (step 1900).

　（実施例１４）
　以下本発明の第１４の実施例について、図面を参照しながら説明する。 (Example 14)
Hereinafter, a fourteenth embodiment of the present invention will be described with reference to the drawings.

　図４４は本発明の音声合成装置の第１４の実施例を示す構成ブロック図である。図４５に第１４の実施例の動作を説明するための流れ図をを示す。図４４において図４２と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図４２の音声合成部３０ｎが音声合成部３０ｏに置き換わり、マイクロフォン１１０が操作手段４０ｏに置き換わった以外は図４２と同一な構成である。前記の音声合成部３０ｏは音声合成制御手段７０ｏと、素片データベース８０、素片接続手段９０ｎとを有する。 FIG. 44 is a configuration block diagram showing a fourteenth embodiment of the speech synthesizer of the present invention. FIG. 45 is a flowchart for explaining the operation of the fourteenth embodiment. 44, the same components or portions as those in FIG. 42 are denoted by the same reference numerals, and the description thereof will not be repeated. Only different portions will be described. 42 has the same configuration as that of FIG. 42 except that the voice synthesis unit 30n is replaced by the voice synthesis unit 30o and the microphone 110 is replaced by the operation unit 40o. The speech synthesis unit 30o has a speech synthesis control unit 70o, a unit database 80, and a unit connection unit 90n.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図４４、図４５に従って説明する。図４５において図４３と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。 The operation of the speech synthesizer according to the present embodiment configured as described above will be described below with reference to FIGS. 44 and 45. In FIG. 45, the same operations as those in FIG. 43 are denoted by the same reference numerals, and the description thereof will be omitted. Only different parts will be described.

　まずテキスト入力手段１０は構文解析手段９１０に目的のテキストを入力する（ステップ１３１００）。次に構文解析手段９１０はテキスト入力手段１０より入力されたテキストの構文解析を行い、構文情報を生成し音声合成開始位置決定手段９２０へ出力し、読み情報および韻律情報を生成し音声合成制御手段７０ｏに出力する（ステップ１３２００）。音声合成開始位置決定手段９２０は構文解析手段９１０より入力された構文情報に従って音声合成開始位置を決定し、音声合成制御手段７０ｏに開始位置情報を出力する（ステップ１３３００）。音声合成制御手段７０ｏは操作手段４０ｏより操作信号を取り込み、使用者が音声合成停止信号を入力したか否かを判定する（ステップ１４４００）。ステップ１４４００において音声合成停止信号が入力されていない場合、素片データベース８０は音声合成制御手段７０ｎより入力された読み情報に従って素片接続手段９０ｎに合成単位を出力する（ステップ９４００）。もしステップ１４４００で音声合成停止信号が入力されている場合、音声合成制御手段７０ｏは音声合成停止信号を素片接続手段９０ｎに出力し、合成音声の生成を停止する（ステップ１３５００）。音声合成制御手段７０ｏは操作装置より操作信号を取り込み、使用者が音声合成再開信号を入力したか否かを判定し（ステップ１４６００）、音声合成再開信号が入力されていない場合は、ステップ１４６００を繰り返す。ステップ１４６００において音声合成再開信号が入力された場合にのみ、音声合成開始位置決定手段９２０より入力された開始位置情報に従い、停止位置よりテキスト上の位置が前で最も停止位置に近い音声合成開始位置から音声合成を再開し（ステップ１３７００）、素片データベース８０は音声合成制御手段７０ｏより入力された読み情報に従って素片接続手段９０ｎに合成単位を出力する（ステップ９４００）。素片接続手段９０ｎは音声合成制御手段７０ｏより入力された韻律情報に従って素片データベース８０より入力された合成単位を接続して合成音声を生成し（ステップ９５００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 First, the text input means 10 inputs a target text to the syntax analysis means 910 (step 13100). Next, the syntactic analysis means 910 analyzes the syntax of the text input from the text input means 10, generates syntax information and outputs it to the speech synthesis start position determination means 920, generates reading information and prosody information, and generates speech information. 70o (Step 13200). Speech synthesis start position determination means 920 determines a speech synthesis start position according to the syntax information input from syntax analysis means 910, and outputs start position information to speech synthesis control means 70o (step 13300). The voice synthesis control unit 70o receives the operation signal from the operation unit 40o, and determines whether or not the user has input a voice synthesis stop signal (step 14400). If the speech synthesis stop signal has not been input in step 14400, the segment database 80 outputs a synthesis unit to the segment connection unit 90n according to the reading information input from the speech synthesis control unit 70n (step 9400). If the speech synthesis stop signal is input in step 14400, the speech synthesis control unit 70o outputs the speech synthesis stop signal to the unit connection unit 90n, and stops generating the synthesized speech (step 13500). The voice synthesis control means 70o fetches an operation signal from the operation device and determines whether or not the user has input a voice synthesis restart signal (step 14600). If the voice synthesis restart signal has not been input, step 14600 is performed. repeat. Only when the speech synthesis restart signal is input in step 14600, the speech synthesis start position whose text position is earlier than the stop position and is closest to the stop position in accordance with the start position information input from the speech synthesis start position determination means 920. , The speech synthesis is restarted (step 13700), and the segment database 80 outputs the synthesis unit to the segment connection unit 90n according to the reading information input from the speech synthesis control unit 70o (step 9400). The unit connection unit 90n connects the synthesis units input from the unit database 80 in accordance with the prosody information input from the voice synthesis control unit 70o to generate a synthesized voice (step 9500). Is output (step 1900).

　（実施例１５）
　以下本発明の第１５の実施例について、図面を参照しながら説明する。 (Example 15)
Hereinafter, a fifteenth embodiment of the present invention will be described with reference to the drawings.

　図４６は本発明の音声合成装置の第１５の実施例を示す構成ブロック図である。図４７に第１５の実施例の動作を説明するための流れ図を示す。図４６において図４２と同一物または部分については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。図４２の言語処理部９００ｎが言語処理部９００ｐに置き換わり、言語処理部９００ｐにおいては構文解析手段９１０から構文解析結果を受け取り強調すべき単語を決定する強調語決定手段が追加されている。一方、図４２の音声合成部３０ｎが音声合成部３０ｐに置き換わり、音声合成部３０ｐにおいては音声合成制御手段７０ｐと接続される計時手段９４０と、素片データベースの素片出力を入力とし、音声合成制御手段からの制御信号に基づいて素片に強調処理を施し、素片接続手段へ出力する音韻強調処理手段１３０ｐが追加されている。さらに、強調語決定手段９３０から強調語情報を音声合成制御手段が受け取るという構成になっている。以上の変更以外は図４２と同一な構成である。 FIG. 46 is a configuration block diagram showing a fifteenth embodiment of the speech synthesizer of the present invention. FIG. 47 is a flowchart for explaining the operation of the fifteenth embodiment. In FIG. 46, the same components or portions as those in FIG. 42 are denoted by the same reference numerals, and the description thereof will be omitted. Only different portions will be described. The language processing unit 900n in FIG. 42 is replaced with a language processing unit 900p. The language processing unit 900p further includes an emphasized word determination unit that receives a syntax analysis result from the syntax analysis unit 910 and determines a word to be emphasized. On the other hand, the voice synthesizer 30n in FIG. 42 is replaced with a voice synthesizer 30p, and the voice synthesizer 30p receives a clock unit 940 connected to the voice synthesis controller 70p and a unit output of a unit database, and performs voice synthesis. A phoneme emphasis processing unit 130p that performs emphasis processing on a segment based on a control signal from the control unit and outputs the result to the segment connection unit is added. Further, the speech synthesis control unit receives the emphasized word information from the emphasized word determination unit 930. Except for the above changes, the configuration is the same as that of FIG.

　以上のように構成されたこの実施例の音声合成装置において、以下その動作を図４６、図４７に従って説明する。図４７において図４３と同一の動作については同一符号を付しているので説明を省略し、異なった部分についてのみ説明する。まずテキスト入力手段１０は構文解析手段９１０に目的のテキストを入力する（ステップ１３１００）。次に構文解析手段９１０はテキスト入力手段１０より入力されたテキストの構文解析を行い、構文情報を生成し音声合成開始位置決定手段９２０、および、強調語決定手段９３０へ出力し、読み情報および韻律情報を生成し音声合成制御手段７０ｐに出力する（ステップ１３２００）。音声合成開始位置決定手段９２０は構文解析手段９１０より入力された構文情報に従って音声合成開始位置を決定し、音声合成制御手段７０ｐに開始位置情報を出力する（ステップ１３３００ａ）。同時に、強調語決定手段９３０は構文解析手段９１０より入力された構文情報に従って強調すべき単語を決定し、音声合成制御手段７０ｐに強調語情報を出力する（ステップ１３３００ｂ）。音声合成制御手段７０ｐはマイクロフォン１１０より環境音信号を取り込み、環境音の１００ｍｓの平均レベルと７０ｄＢ（Ａ）を比較する（ステップ１３４００）。ステップ１３４００において環境音の平均レベルが７０ｄＢ（Ａ）未満の場合、素片データベース８０は音声合成制御手段７０ｐより入力された読み情報に従って音韻強調処理手段１３０ｐに合成単位を出力し、音韻強調処理手段１３０ｐでは強調処理を行わず、そのまま素片接続手段９０ｎに合成単位を出力する（ステップ９４００ａ）。もしステップ１３４００で環境音の平均レベルが７０ｄＢ（Ａ）以上である場合、音声合成制御手段７０ｎは音声合成停止信号を素片接続手段９０ｎに出力し、合成音声の生成を停止する（ステップ１３５００）。そして、計時手段９４０に計測開始の信号を送り時間計測を開始する（ステップ１４１００）。音声合成制御手段７０ｐは環境音の平均レベルと７０ｄＢ（Ａ）とを比較し（ステップ１３６００）、環境音の平均レベルが７０ｄＢ（Ａ）以上である場合は、ステップ１３６００を繰り返す。ステップ１３６００において環境音の平均レベルが７０ｄＢ（Ａ）未満である場合には、計時手段９４０に計測終了の信号を送り時間計測を終了し、経過時間を取り込む（ステップ１４２００）。音声合成制御手段７０ｐは、経過時間が０より１秒未満の場合は音声合成開始位置のランクを１に設定し、経過時間が１秒以上２秒未満の場合は音声合成開始位置のランクを２に設定し、経過時間が２秒以上３秒未満の場合は音声合成開始位置のランクを３に設定し、経過時間が３秒以上の場合は音声合成開始位置のランクを４に設定する（ステップ１４３００）。音声合成制御手段は、音声合成を停止した位置より前で停止位置にもっとも近く、かつ、ステップ１４３００で決定したランクの値以上のランクをもつ音声合成開始位置より音声合成を再開する。ステップ１４３００で決定したランク値以上のランクをもつ音声合成開始位置が見つからなければ、文頭から音声合成を再開する（ステップ１４４００）。さらに、音声合成を再開する開始位置の繰り返し回数を１つ増やす（ステップ１４５００）。素片データベース８０は音声合成制御手段７０ｐより入力された読み情報に従って音韻強調処理手段１３０ｐへ合成単位を出力する（ステップ１４６００）。音声合成制御手段７０ｐは、音声合成を再開する開始位置の繰り返し回数が２以上かどうかを判断する（ステップ１４７００）。ステップ１４７００において繰り返し回数が２以上の場合、開始位置から停止位置の区間で音韻強調処理手段１３０ｐに強調制御信号を出力し、音韻強調処理手段１３０ｐにおいて合成単位ごとの強調処理を行う（ステップ１４８００）。ステップ１４７００において繰り返し回数が２未満の場合は、音声合成制御手段７０ｐは強調制御信号の出力をせず、音韻強調処理手段１３０ｐでは素片の強調処理を行わない。素片接続手段９０ｐは音声合成制御手段７０ｐより入力された韻律情報に従って、素片強調処理手段から入力された合成単位を接続して合成音声を生成し（ステップ９５００）、電気音響変換器６０を通して合成音声を出力する（ステップ１９００）。 The operation of the speech synthesizing apparatus according to this embodiment configured as described above will be described below with reference to FIGS. 46 and 47. In FIG. 47, the same operations as those in FIG. 43 are denoted by the same reference numerals, and description thereof will be omitted. Only different parts will be described. First, the text input means 10 inputs a target text to the syntax analysis means 910 (step 13100). Next, the syntactic analysis means 910 analyzes the syntax of the text input from the text input means 10, generates syntactic information, outputs the syntactic information to the speech synthesis start position determining means 920 and the emphasized word determining means 930, and reads the reading information and the prosody. Information is generated and output to the voice synthesis control means 70p (step 13200). Speech synthesis start position determination means 920 determines a speech synthesis start position according to the syntax information input from syntax analysis means 910, and outputs start position information to speech synthesis control means 70p (step 13300a). At the same time, the emphasized word determination means 930 determines a word to be emphasized according to the syntax information input from the syntax analysis means 910, and outputs the emphasized word information to the speech synthesis control means 70p (step 13300b). The voice synthesis control unit 70p takes in the environmental sound signal from the microphone 110 and compares the average level of the environmental sound for 100 ms with 70 dB (A) (step 13400). If the average level of the environmental sound is less than 70 dB (A) in step 13400, the segment database 80 outputs the synthesis unit to the phoneme emphasis processing means 130p according to the reading information input from the speech synthesis control means 70p, and outputs the synthesis unit. At 130p, the emphasis process is not performed, and the synthesis unit is directly output to the segment connection means 90n (step 9400a). If the average level of the environmental sound is equal to or more than 70 dB (A) in step 13400, the speech synthesis control unit 70n outputs a speech synthesis stop signal to the unit connection unit 90n, and stops generation of the synthesized speech (step 13500). . Then, a signal to start the measurement is sent to the timer 940 to start the time measurement (step 14100). The voice synthesis control unit 70p compares the average level of the environmental sound with 70 dB (A) (step 13600), and repeats step 13600 if the average level of the environmental sound is 70 dB (A) or more. If the average level of the environmental sound is less than 70 dB (A) in step 13600, a measurement end signal is sent to the timer 940 to end the time measurement and capture the elapsed time (step 14200). The voice synthesis control unit 70p sets the rank of the voice synthesis start position to 1 when the elapsed time is less than 1 second from 0, and sets the rank of the voice synthesis start position to 2 when the elapsed time is 1 second or more and less than 2 seconds. If the elapsed time is 2 seconds or more and less than 3 seconds, the rank of the speech synthesis start position is set to 3, and if the elapsed time is 3 seconds or more, the rank of the speech synthesis start position is set to 4 (step). 14300). The voice synthesis control means restarts voice synthesis from a voice synthesis start position closest to the stop position before the voice synthesis stop position and having a rank equal to or greater than the rank value determined in step 14300. If no speech synthesis start position having a rank equal to or greater than the rank value determined in step 14300 is found, speech synthesis is restarted from the beginning of the sentence (step 14400). Further, the number of repetitions of the start position where speech synthesis is restarted is increased by one (step 14500). The segment database 80 outputs a synthesis unit to the phoneme enhancement processing unit 130p according to the reading information input from the speech synthesis control unit 70p (step 14600). The speech synthesis control unit 70p determines whether the number of repetitions of the start position at which speech synthesis is restarted is two or more (step 14700). If the number of repetitions is 2 or more in step 14700, an emphasis control signal is output to the phoneme emphasis processing means 130p in the section from the start position to the stop position, and the phoneme emphasis processing means 130p performs emphasis processing for each synthesis unit (step 14800). . If the number of repetitions is less than 2 in step 14700, the speech synthesis control unit 70p does not output the emphasis control signal, and the phoneme emphasis processing unit 130p does not perform the segment emphasis process. According to the prosody information input from the speech synthesis control unit 70p, the unit connection unit 90p connects the synthesis units input from the unit enhancement processing unit to generate synthesized speech (step 9500). A synthesized voice is output (step 1900).

　（実施例１６）
　以下本発明の第１６の実施例について、図面を参照しながら説明する。 (Example 16)
Hereinafter, a sixteenth embodiment of the present invention will be described with reference to the drawings.

　図４８に本発明の一実施例の音声合成装置の言語処理部の構成図を示す。構文解析部１０１は、入力文に対して形態素解析、および、構文解析を行い、入力文を構成する単語列、文節列、文節間の係り受け構造を含んだ構文解析結果を出力する。音声合成開始位置規則保持部１０３は、音声合成開始位置決定部１０２において設定すべき音声合成開始位置の前後の文節、および、文節間の係り受け構造の条件を記述した規則を保持する。図４９は、音声合成開始位置規則部１０３が保持する音声合成開始位置規則の一例を示す図である。音声合成開始位置決定部１０２は、音声合成開始位置を構文解析結果の文節列の間に設定する。図４９において、前文節パターンとは、音声合成開始位置の直前に位置する文節の条件を指定するものである。同様に後文節パターンとは、音声合成開始位置の直後に位置する文節の条件を指定するものである。各文節パターンの形式をＢＮＦ表記で表すと、
　＜文節パターン＞　：＝　＊｜（＜文節名＞　＜形態素列＞）
　　　　＜文節名＞　：＝　名詞句｜述語句｜副詞句｜…
　　　＜形態素列＞　：＝　＊｜（＜形態素＞）｜（＜形態素＞＜形態素列＞）
　　　　＜形態素＞　：＝　＊｜＋｜（＜品詞＞　＜表記＞）
　　　　　＜品詞＞　：＝　名詞｜助詞｜読点｜…
　　　　　＜表記＞　：＝　＊｜は｜から｜、｜…となる。「＊」は任意の文節、任意の形態素列、任意の形態素、あるいは、任意の表記を表す。「＋」は任意の形態素の並びを表す。ランクとは、該当する音声合成開始位置に割り当てられる値であり、制御部１０６がこの値に基づいて音声合成開始位置を選択する。本実施例においては、入力テキストを音声合成する際に音声合成開始位置において挿入されるポーズの長さが長いほどランクの値が大きくなるようにしてある。図４９の一番目の音声合成開始位置規則は、助詞「は」で終わる名詞句と任意の文節との間にランク３の音声合成開始位置を設定するという意味である。音声合成開始位置決定部１０２は、構文解析部１０１が出力した構文解析結果に対して、音声合成開始位置規則保持部１０３に保持される音声合成開始位置規則と構文解析結果に含まれる文節列との照合を行い、照合が成功した箇所に音声合成開始位置およびランクを設定する。図５０は、音声合成開始位置決定部の処理を示す図である。入力テキストは、構文解析部１０１によって処理され、図５０に示すような文節列を生成する。この文節列に対して、音声合成開始位置決定部１０２は、文節列の先頭から２文節に対して音声合成開始位置規則を順に照合し、照合に成功した２文節の間に規則に記述されたランクをもつ音声合成開始位置を設定する。図５０の例では、１番目の２文節間に図４９の２番目の規則が、２番目の２文節間に図４９の３番目の規則が、３番目の２文節間に図４９の４番目の規則が、おのおの照合し、図５０の一番下に示されるようなランクをもつ音声合成開始位置が設定される。どの音声合成開始位置規則にも照合しなかった２文節間には音声合成開始位置は設定されない。 FIG. 48 shows a configuration diagram of the language processing unit of the speech synthesizer according to one embodiment of the present invention. The syntax analyzer 101 performs a morphological analysis and a syntax analysis on the input sentence, and outputs a syntax analysis result including a word string, a clause string, and a dependency structure between the clauses constituting the input sentence. The speech synthesis start position rule holding unit 103 holds rules describing the clauses before and after the speech synthesis start position to be set in the speech synthesis start position determination unit 102 and the condition of the dependency structure between the clauses. FIG. 49 is a diagram illustrating an example of the speech synthesis start position rule held by the speech synthesis start position rule unit 103. The speech synthesis start position determination unit 102 sets the speech synthesis start position between the phrase strings of the syntax analysis result. In FIG. 49, the previous phrase pattern specifies the condition of the phrase located immediately before the speech synthesis start position. Similarly, the later phrase pattern specifies the condition of a phrase located immediately after the speech synthesis start position. When the format of each phrase pattern is expressed in BNF notation,
<Clause pattern>: = * | (<clause name><morphemesequence>)
<Clause name>: = noun phrase | predicate phrase | adverb phrase | ...
<Morpheme sequence>: = * | (<morpheme>) | (<morpheme><morphemesequence>)
<Morpheme>: = * | + | (<part of speech><notation>)
<Part of speech>: = noun | particle | reading |
<Notation>: = * | is |, |, | “*” Represents an arbitrary phrase, an arbitrary morpheme string, an arbitrary morpheme, or an arbitrary notation. “+” Represents an arbitrary arrangement of morphemes. The rank is a value assigned to the corresponding speech synthesis start position, and the control unit 106 selects the speech synthesis start position based on this value. In this embodiment, the rank value increases as the length of the pause inserted at the speech synthesis start position when speech synthesis of the input text is performed. The first speech synthesis start position rule in FIG. 49 means that a speech synthesis start position of rank 3 is set between a noun phrase ending with the particle "ha" and an arbitrary phrase. The speech synthesis start position determination unit 102 compares the syntax analysis result output by the syntax analysis unit 101 with the speech synthesis start position rule held in the speech synthesis start position rule holding unit 103 and the phrase string included in the syntax analysis result. And sets a speech synthesis start position and a rank at a place where the matching is successful. FIG. 50 is a diagram illustrating a process of the speech synthesis start position determination unit. The input text is processed by the syntax analysis unit 101 to generate a phrase string as shown in FIG. For this phrase string, the speech synthesis start position determination unit 102 sequentially compares the speech synthesis start position rules for the two phrases from the beginning of the phrase string, and the rule is described in the rule between the two successfully matched phrases. Set speech synthesis start position with rank. In the example of FIG. 50, the second rule of FIG. 49 is between the first two clauses, the third rule of FIG. 49 is between the second two clauses, and the fourth rule of FIG. 49 is between the third two clauses. Are compared, and a speech synthesis start position having a rank as shown at the bottom of FIG. 50 is set. No speech synthesis start position is set between two phrases that did not match any of the speech synthesis start position rules.

　強調語決定部１０４は、構文解析部１０１が出力した構文解析結果に対して、強調語規則保持部１０５に保持される強調語規則と構文解析結果に含まれる単語列の照合を行い、強調して発音すべき単語を決定する。図５１は、強調語保持部１０５が保持する強調語の規則の一例を示す図である。図５１において、強調語条件は、強調すべき単語の条件を記述したものである。強調語の形式をＢＮＦ表記であらわすと、
　＜強調語条件＞　：＝　（＜品詞＞　＜表記＞）
　　　　＜品詞＞　：＝　名詞｜動詞｜形容詞｜…
　　　　＜表記＞　：＝　＊｜ある｜ない｜…となる。「＊」は任意の表記を表す記号である。強調語条件に当てはまる単語に対して、右側の欄の強調ＯＮ／ＯＦＦの記述に従って、強調の情報を割り当てる。図５２は、強調語決定部の処理を示す図である。図５２において、入力テキストを構文解析部１０１が処理し、単語列を生成する。強調語決定部１０２は、単語列の先頭から順に強調語規則と照合し、照合に成功した場合には強調ＯＮ／ＯＦＦの情報を付与する。図５２の単語の（形容詞　ない）については、強調語条件（形容詞　＊）と強調語条件（形容詞　ない）の両方が照合するが、強調語条件（形容詞　ない）は表記が指定されているより詳細な条件であり、照合の際には優先される。照合の結果、図５２の一番下のような強調語情報が得られる。どの強調語規則にも照合しなかった単語の強調はＯＦＦである。 The emphasized word determination unit 104 compares the syntax analysis result output by the syntax analysis unit 101 with the emphasized word rule held in the emphasized word rule holding unit 105 and a word string included in the syntax analysis result, and emphasizes. To determine the words to pronounce. FIG. 51 is a diagram illustrating an example of the rule of the emphasized word held by the emphasized word holding unit 105. In FIG. 51, the emphasized word condition describes a condition of a word to be emphasized. Expressing the form of the emphasized word in BNF notation,
<Emphasis condition>: = (<part of speech><notation>)
<Part of speech>: = noun | verb | adjective |
<Notation>: = * | “*” Is a symbol representing an arbitrary notation. Emphasis information is assigned to a word that satisfies the emphasis word condition in accordance with the description of emphasis ON / OFF in the right-hand column. FIG. 52 is a diagram illustrating the processing of the emphasized word determination unit. In FIG. 52, the syntax analysis unit 101 processes an input text to generate a word string. The emphasized word determination unit 102 checks the emphasized word rule in order from the beginning of the word string, and if the matching is successful, adds emphasis ON / OFF information. For the word (no adjective) in FIG. 52, both the emphasized word condition (no adjective *) and the emphasized word condition (no adjective) are matched, but the emphasized word condition (no adjective) is more detailed than the notation is specified. This is a condition that is prioritized in matching. As a result of the comparison, the emphasized word information as shown at the bottom of FIG. 52 is obtained. Emphasis on words that did not match any of the emphasis word rules is OFF.

　（実施例１７）
　以下本発明の第１７の実施例について、図面を参照しながら音声素片作成時における強調処理の一例を上げて説明する。 (Example 17)
Hereinafter, a seventeenth embodiment of the present invention will be described with reference to the drawings, taking an example of an emphasis process at the time of speech unit creation.

　図５３に第１７の実施例による音声素片作成の動作を示す流れ図を、図５４に振幅圧縮処理の入出力特性の模式図を示す。 FIG. 53 is a flowchart showing the operation of speech unit creation according to the seventeenth embodiment, and FIG. 54 is a schematic diagram of the input / output characteristics of the amplitude compression processing.

　まず、対象となる音声波形から最初の波形を切り出す（ステップ１５０００）。次にステップ１５０００で切り出された切り出し波形データに、あらかじめ設定しておいた利得値Gを掛け（ステップ１５０１０）、その結果の絶対値の最大値を求め、Amaxに記憶する（ステップ１５０２０）。Amaxがあらかじめ設定しておいたAlimの値よりも大きい場合（ステップ１５０３０）、切り出し波形を(Alim/Amax)倍する（ステップ１５０４０）。また、AmaxがAlimより小さいか等しい場合は何もしない。今回切り出した波形が最後の波形であれば（ステップ１５０５０）終了する。そうでなければ次の波形を切り出し（ステップ１５０６０）、ステップ１５０１０から繰り返す。 First, the first waveform is cut out from the target voice waveform (step 15000). Next, the cut-out waveform data cut out in step 15000 is multiplied by a preset gain value G (step 15010), the maximum value of the absolute value of the result is obtained and stored in Amax (step 15020). If Amax is larger than the preset value of Alim (step 15030), the cutout waveform is multiplied by (Alim / Amax) (step 15040). If Amax is smaller than or equal to Alim, nothing is performed. If the waveform cut out this time is the last waveform (step 15050), the processing ends. Otherwise, the next waveform is cut out (step 15060), and the process is repeated from step 15010.

　このようにすることにより、音声波形にリミッタを用いた場合に起こる時定数の問題などがなく、理想的な振幅圧縮が可能である。図５３に示した振幅圧縮処理をリミッタの入出力特性に例えると図５４（ａ）のように表すことができる。この曲線は例えば図５４（ｂ）や（ｃ）などのように任意に選ぶことができるので様々な振幅圧縮処理が可能となる。また、対象となる音声素片の種別（無声子音、有声子音の別など）によって曲線を選ぶなど、音韻別の振幅圧縮も可能である。さらに、あらかじめ子音部の開始点、終了点などにラベルを付与しておくことにより、子音部と母音部を別の曲線で圧縮することもできる。よう By doing so, there is no problem of a time constant that occurs when a limiter is used for a sound waveform, and ideal amplitude compression can be performed. If the amplitude compression processing shown in FIG. 53 is compared to the input / output characteristics of the limiter, it can be expressed as shown in FIG. Since this curve can be arbitrarily selected as shown in, for example, FIGS. 54B and 54C, various amplitude compression processes can be performed. Further, amplitude compression for each phoneme is also possible, such as selecting a curve according to the type of the target speech unit (unvoiced consonant, voiced consonant, etc.). Further, by giving labels to the start point and end point of the consonant part in advance, the consonant part and the vowel part can be compressed with different curves.

　上記のように様々な振幅圧縮法が選べることから、特定の子音の特定の部分を強調するなどの音韻強調法として有効である。すなわち音声素片作成時のこのような処理は、音声強調法として非常に自由度が高くきめ細かい処理が可能である。また、このような処理は完全に前処理として実行されるため、音声合成時の処理速度に何ら影響を与えないという利点もある。ること Since various amplitude compression methods can be selected as described above, it is effective as a phoneme emphasis method for emphasizing a specific part of a specific consonant. In other words, such processing at the time of speech unit creation has a very high degree of freedom as a voice emphasis method, and allows detailed processing. In addition, since such processing is executed completely as preprocessing, there is an advantage that the processing speed during speech synthesis is not affected at all.

　従って、いかなる複雑な音声強調処理を施すことも可能となる。そこで、フォルマント強調などの周波数領域の強調や、切り出しの対象となる音声波形を複数の帯域に分割して振幅圧縮などを施すことや、切り出し時に同等の処理を加えることにより、難聴者や騒音下での使用に適した合成音声を提供することが可能となる。特に、波形の切り出し時に補聴器の信号処理に相当する処理を加えることは、これまで時定数や未知の入力に対する処理の限界などによって不可能であったきめ細かい強調処理が可能となる。 Therefore, it is possible to perform any complicated voice emphasis processing. Therefore, by emphasizing the frequency domain such as formant emphasis, dividing the audio waveform to be cut out into multiple bands and applying amplitude compression, etc. It is possible to provide a synthesized speech suitable for use in a personal computer. In particular, it is possible to perform a fine emphasizing process which cannot be performed at the time of extracting a waveform by adding a process corresponding to a signal process of a hearing aid, which has been impossible due to a time constant or a limit of a process for an unknown input.

　なお、実施例１７では音声素片に対する処理として主に子音の強調を目的とする振幅の変形処理や周波数特性の変形処理について説明したが、例えば公知の時間長変形技術を用いて子音部分の長さを調整することで明瞭度向上を図るなど、様々な波形変形処理を行ってもよい。 In the seventeenth embodiment, as the processing on the speech unit, the processing of deforming the amplitude and the processing of changing the frequency characteristic mainly for the purpose of emphasizing the consonants have been described. Various waveform deformation processes may be performed, such as by improving the clarity by adjusting the depth.

　なお、実施例１５においてマイクロフォン１１０は環境音信号を取り込んだが、使用者の発声を取り込むものとしても良い。 In the fifteenth embodiment, the microphone 110 captures the environmental sound signal. However, the microphone 110 may capture the utterance of the user.

　なお、実施例１５において素片の強調処理を行ったが、強調処理を施した素片データベースと強調処理を施さない素片データベースとを切り替える、あるいは素片接続後の合成音声に強調処理を行うものとしても良い。 In the fifteenth embodiment, the segment emphasizing process is performed. However, the segment database that has undergone the emphasizing process is switched between the segment database that has not undergone the emphasizing process, or the emphasis process is performed on the synthesized speech after the segment connection. It is good.

　なお、実施例１、実施例８において強調処理は母音部の延長、クロージャーの延長、フォルマント強調、子音強調、および帯域強調としたが、これ以外の強調方法を用いても良い。 In the first and eighth embodiments, the emphasis processing is performed for the extension of the vowel part, the extension of the closure, the formant emphasis, the consonant emphasis, and the band emphasis, but other emphasis methods may be used.

　なお、実施例１、実施例８においてｐが１５より小さい場合にフォルマント強調情報を真とするとしたが、これ以外の値としても良い。 In the first and eighth embodiments, the formant emphasis information is set to true when p is smaller than 15, but may be set to any other value.

　なお、実施例１、実施例８においてギャップ検出域が１０ｍｓ以上の場合に子音強調情報を真とするとしたが、これ以外の値としても良い。 In the first and eighth embodiments, the consonant emphasis information is set to true when the gap detection area is equal to or longer than 10 ms, but may be set to any other value.

　なお、実施例１、実施例８において２ｋＨｚ以上の平均聴力レベルと２ｋＨｚ未満の平均聴力レベルの差が３０ｄＢ以上の場合に帯域強調情報を真とするとしたが、２ｋＨｚ以外の周波数を帯域の境界としても良い。また帯域間の平均聴力レベルの差の基準は３０ｄＢ以外の値でも良い。 In the first and eighth embodiments, the band emphasis information is determined to be true when the difference between the average hearing level of 2 kHz or more and the average hearing level of less than 2 kHz is 30 dB or more. Is also good. The reference of the difference in the average hearing level between the bands may be a value other than 30 dB.

　なお、実施例１、実施例２、実施例４、実施例８において強調部情報が真の場合に母音定常部を２０％延長するとしたが、これ以外の値でも良い。また、子音部分の時間長を延長するとしても良い。 In the first, second, fourth, and eighth embodiments, the steady vowel part is extended by 20% when the emphasis part information is true, but other values may be used. Further, the time length of the consonant part may be extended.

　なお、実施例１、実施例２において強調部情報が真の場合にクロージャーを２０％延長するとしたが、これ以外の値でも良い。 In the first and second embodiments, the closure is extended by 20% when the emphasis part information is true, but other values may be used.

　なお、実施例１、実施例４、実施例７において環境音を１ｋＨｚ以下、１ｋＨｚから２ｋＨｚ、２ｋＨｚ〜４ｋＨｚ、４ｋＨｚ以上の帯域に分割したが、これ以外の分割の方法でも良い。 In the first, fourth, and seventh embodiments, the environmental sound is divided into bands of 1 kHz or less, 1 kHz to 2 kHz, 2 kHz to 4 kHz, and 4 kHz or more, but other division methods may be used.

　なお、実施例１において１ｋＨｚ以下、１ｋＨｚから２ｋＨｚ、２ｋＨｚ〜４ｋＨｚ、４ｋＨｚ以上の各帯域において、それぞれ２０ｄＢＳＰＬ／Ｈｚ、２０ｄＢＳＰＬ／Ｈｚ、１５ｄＢＳＰＬ／Ｈｚ、１０ｄＢＳＰＬ／Ｈｚ以上の環境があるときは圧縮パラメータを設定し、圧縮処理を行うとしたが、これ以外の値でも良い。 In the first embodiment, in each band of 1 kHz or less, 1 kHz to 2 kHz, 2 kHz to 4 kHz, and 4 kHz or more, when there is an environment of 20 dBSPL / Hz, 20 dBSPL / Hz, 15 dBSPL / Hz, and 10 dBSPL / Hz or more, the compression parameter is set. Although the setting is made and the compression process is performed, other values may be used.

　なお、実施例４において１ｋＨｚ以下の環境音平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上かつ、１ｋＨｚから２ｋＨｚの環境音平均レベルが２０ｄＢＳＰＬ／Ｈｚかつ、他の帯域の環境音平均レベルが１５ｄＢＳＰＬ／Ｈｚ以下である場合にフォルマント強調情報を真とするとしたが、これ以外の値でも良い。 In the fourth embodiment, the average environmental sound level of 1 kHz or less is 20 dBSPL / Hz or more, the average environmental sound level of 1 kHz to 2 kHz is 20 dBSPL / Hz, and the average environmental sound level of another band is 15 dBSPL / Hz or less. Although the formant emphasis information is set to true, other values may be used.

　なお、実施例４において１ｋＨｚから２ｋＨｚの環境音平均レベルが２０ｄＢＳＰＬ／Ｈｚ以上かつ、２ｋＨｚから４ｋＨｚの環境音平均レベルが１５ｄＢＳＰＬ／Ｈｚかつ、１ｋＨｚ以下の環境音平均レベルが２０ｄＢＳＰＬ／Ｈｚ以下あるいは４ｋＨＺ以上の環境音平均レベルが１５ＤＢＳＰＬ／Ｈｚ以下の場合に子音強調情報を真としたが、これ以外の値としても良い。 In the fourth embodiment, the average environmental sound level from 1 kHz to 2 kHz is 20 dBSPL / Hz or higher, the average environmental sound level from 2 kHz to 4 kHz is 15 dBSPL / Hz, and the average environmental sound level below 1 kHz is 20 dBSPL / Hz or lower or 4 kHz or higher. The consonant emphasis information is determined to be true when the average environmental sound level is 15 DBSPL / Hz or less, but may be set to any other value.

　なお、実施例４において１ｋＨｚ以下、１ｋＨｚから２ｋＨｚ、２ｋＨｚ〜４ｋＨｚ４ｋＨｚ以上の各帯域において、それぞれ２０ｄＢＳＰＬ／Ｈｚ、２０ｄＢＳＰＬ／Ｈｚ、１５ｄＢＳＰＬ／Ｈｚ、１０ｄＢＳＰＬ／Ｈｚ以上の環境があるときは各帯域の帯域強調情報を真とするとしたが、これ以外の値としても良い。 In the fourth embodiment, in each band of 1 kHz or less, 1 kHz to 2 kHz, 2 kHz to 4 kHz and 4 kHz or more, when there is an environment of 20 dBSPL / Hz, 20 dBSPL / Hz, 15 dBSPL / Hz, and 10 dBSPL / Hz or more, each band is emphasized. Although the information is assumed to be true, other values may be used.

　なお、実施例７においてステップ７５００のように圧縮パラメータを設定したがこれ以外の基準および方法を用いても良い。 In the seventh embodiment, compression parameters are set as in step 7500, but other criteria and methods may be used.

　なお、実施例９において環境音の平均レベルが３０ｄＢ（Ａ）以上の場合に基本周波数を２０％高くするとしたが、これ以外の基準値でもよい。また基本周波数の変更はこれ以外の値としても良い。 In the ninth embodiment, the basic frequency is increased by 20% when the average level of the environmental sound is equal to or higher than 30 dB (A). However, other reference values may be used. The change of the fundamental frequency may be any other value.

　なお、実施例１０において２ｋＨｚ以上の平均聴力レベルと２ｋＨｚ未満の平均聴力レベルの差が３０ｄＢ以上の場合に基本周波数を２０％低くするとしたが２ｋＨｚ以外の周波数を帯域の境界としても良い。また、差の値の基準はこれ以外の値としても良い。また、基本周波数の変更はこれ以外の値としても良い。 In the tenth embodiment, when the difference between the average hearing level of 2 kHz or more and the average hearing level of less than 2 kHz is 30 dB or more, the fundamental frequency is lowered by 20%. However, a frequency other than 2 kHz may be used as a boundary of the band. Further, the reference of the difference value may be another value. Further, the change of the fundamental frequency may be any other value.

　なお、実施例１１、実施例１２において平均聴力レベルが４０ｄＢＨＬ以上の場合に和即を１０％遅くするとしたが、平均聴力レベル以外の聴覚特性を判断に用いても良い。また、平均聴力レベルの基準はこれ以外の値としても良い。また、平均聴力レベルの基準を４０ｄＢＨＬとしたがこれ以外の値でも良い。また、話速を１０％遅くするとしたがこれ以外の値としても良い。 In the eleventh and twelfth embodiments, it is assumed that when the average hearing level is 40 dBHL or more, the instantaneousness is delayed by 10%. However, auditory characteristics other than the average hearing level may be used for the determination. The standard of the average hearing level may be any other value. Although the standard of the average hearing level is set to 40 dBHL, other values may be used. In addition, it is assumed that the speech speed is reduced by 10%, but other values may be used.

　なお、実施例１３、実施例１５において環境音の平均レベルが７０ｄＢ（Ａ）を越えた場合に音声合成を停止するとしたが、これ以外の値としても良い。なお、実施例２、実施例５、実施例８において素片データベースあるいはフォルマント強調フィルタの切替にスイッチを用いたが、ソフトウェア的に切り替えても良い。 In the thirteenth and fifteenth embodiments, the speech synthesis is stopped when the average level of the environmental sound exceeds 70 dB (A), but other values may be used. In the second, fifth, and eighth embodiments, the switch is used to switch the segment database or the formant emphasis filter. However, the switch may be switched by software.

　以上説明したように、本実施形態によれば、使用者の聴覚特性に合わせて合成した音声に強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施す、あるいは使用場面の騒音環境に合わせて合成した音声に強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施す。また、データベースに記憶された合成単位に使用者の聴覚特性に合わせて強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施した後に音声を合成する、あるいは使用場面の騒音環境に合わせてデータベースに記憶された合成単位に強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施した後に音声を合成する。また、あらかじめ強調処理あるいは振幅のダイナミックレンジを圧縮する処理を施した合成単位を用いて音声を合成する。また、音声の合成を中断した際に言語処理結果に基づき停止位置以前で内容理解のしやすいテキスト上の点まで戻って音声の合成を再開する。また、言語処理に基づき強調処理を行う部分を設定することにより、聴覚障害のある使用者や、騒音下での使用でも情報を確実に伝達することができ、その実用的効果は大きい。 As described above, according to the present embodiment, the speech synthesized according to the auditory characteristics of the user is subjected to the emphasis processing or the processing of compressing the dynamic range of the amplitude, or synthesized according to the noise environment of the use scene. The voice is subjected to an emphasis process or a process of compressing the dynamic range of the amplitude. In addition, the synthesis unit stored in the database is subjected to an emphasis process or a process of compressing the dynamic range of the amplitude in accordance with the auditory characteristics of the user, and then synthesized, or stored in the database in accordance with the noise environment of the use scene. After performing the emphasis processing or the processing of compressing the dynamic range of the amplitude on the synthesized unit, the voice is synthesized. Also, speech is synthesized using a synthesis unit that has been subjected to an emphasis process or a process of compressing the dynamic range of the amplitude in advance. When the speech synthesis is interrupted, the speech synthesis is resumed by returning to a point on the text where the content is easy to understand before the stop position based on the result of the language processing. In addition, by setting a portion for performing the emphasis processing based on the language processing, information can be reliably transmitted even to a hearing impaired user or use under noise, and the practical effect is large.

本発明における音声合成装置の第１の実施例の構成ブロック図1 is a block diagram showing the configuration of a first embodiment of a speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例のフォルマント強調方法の模式図Schematic diagram of the formant enhancement method of the embodiment 同実施例の子音強調方法の模式図Schematic diagram of the consonant emphasis method of the embodiment 本発明における音声合成装置の第２の実施例の構成ブロック図2 is a block diagram showing the configuration of a second embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の第３の実施例の構成ブロック図3 is a block diagram showing the configuration of a third embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の第４の実施例の構成ブロック図4 is a block diagram showing the configuration of a fourth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の第５の実施例の構成ブロック図5 is a block diagram showing the configuration of a fifth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の第６の実施例の構成ブロック図Configuration block diagram of a sixth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の第７の実施例の構成ブロック図7 is a block diagram showing the configuration of a seventh embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の第８の実施例の構成ブロック図Configuration block diagram of an eighth embodiment of the speech synthesizer according to the present invention 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例のフォルマント強調方法の模式図Schematic diagram of the formant enhancement method of the embodiment 本発明における音声合成装置の第９の実施例の構成ブロック図9 is a block diagram showing the configuration of a ninth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の第１０の実施例の構成ブロック図Configuration block diagram of a tenth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の第１１の実施例の構成ブロック図Configuration block diagram of an eleventh embodiment of the speech synthesizer according to the present invention 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の１２の実施例の構成ブロック図12 is a block diagram showing the configuration of a twelfth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の１３の実施例の構成ブロック図13 is a block diagram showing the configuration of a thirteenth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の１４の実施例の構成ブロック図14 is a block diagram showing the configuration of a fourteenth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の１５の実施例の構成ブロック図Configuration block diagram of a fifteenth embodiment of the speech synthesizer according to the present invention. 同実施例の動作を説明するための流れ図Flow chart for explaining the operation of the embodiment 本発明における音声合成装置の１６の実施例の構成ブロック図16 is a block diagram showing the configuration of a sixteenth embodiment of the speech synthesizer according to the present invention. 同実施例のランク決定方法の模式図Schematic diagram of the rank determination method of the embodiment 同実施例の動作を説明するための模式図Schematic diagram for explaining the operation of the embodiment 同実施例の強調部分選択方法の模式図Schematic diagram of the highlighted portion selection method of the embodiment 同実施例の動作を説明するための模式図Schematic diagram for explaining the operation of the embodiment 本発明における音声素片作成方法の実施例１７の動作を説明するための流れ図17 is a flowchart for explaining the operation of the seventeenth embodiment of the speech segment generating method according to the present invention. 同実施例の振幅圧縮処理方法の模式図Schematic diagram of the amplitude compression processing method of the embodiment 従来の音声合成装置の構成ブロック図Configuration block diagram of conventional speech synthesizer 従来の音声素片作成方法の波形の切り出し方法の模式図Schematic diagram of the method for extracting waveforms in the conventional speech unit creation method 従来の音声合成装置の素片の接続方法の模式図Schematic diagram of the connection method of the segments of the conventional speech synthesizer

Explanation of reference numerals

　１０　　テキスト入力手段
　２０　　言語処理手段
　３０ａ，３０ｂ，３０ｃ，３０ｄ，３０ｅ，３０ｆ，３０ｇ，３０ｈ，３０ｍ，３０ｎ，３０ｏ，３０ｐ　　音声合成部
　４０ｂ，４０ｅ，４０ｍ，４０ｏ　　操作手段
　５０ａ，５０ｂ，５０ｃ，５０ｄ，５０ｅ，５０ｇ，５０ｈ，５０ｉ，５０ｊ，５０ｋ，５０ｌ，５０ｍ　　声質制御手段
　６０　　電気音響変換器
　７０ａ，７０ｂ，７０ｃ，７０ｄ，７０ｅ，７０ｆ，７０ｇ，７０ｈ，７０ｍ，７０ｎ，７０ｏ，７０ｐ　　音声合成制御手段
　８０，２８０ａ，２８０ｂ，２８０ｃ，２８０ｄ，２８０ｎ，３８０ａ，３８０ｂ，３８０ｃ，３８０ｄ，３８０ｎ，５８０ａ，５８０ｂ，５８０ｃ，５８０ｄ，５８０ｎ，６８０ａ，６８０ｂ，６８０ｃ，６８０ｄ，６８０ｎ　　素片データベース
　９０ａ，９０ｂ，９０ｄ，９０ｅ，９０ｆ，９０ｇ，９０ｈ，９０ｍ，９０ｎ　　素片接続手段
　１１０　　マイクロフォン
　１２０　　聴覚特性測定手段
　１３０ａ，１３０ｄ，１３０ｈ，１３０ｐ　　音韻強調処理手段
　１４０ａ，１４０ｂ，１４０ｇ　　圧縮処理手段
　２００ｂ，２００ｅ　　データベース部
　２１０ｂ，２１０ｅ　　スイッチ
　２２０　　聴覚特性記憶手段
　３００　　素片データベース読み取り手段
　３１０　　聴覚特性読み取り手段
　３２０ａ，３２０ｂ，３２０ｃ，３２０ｄ，３２０ｎ　　聴覚特性
　８００　　強調フィルタ部
　８１０ａ，８１０ｂ，８１０ｃ，８１０ｄ，８１０ｎ　　フォルマント強調フィルタ
　８２０　　スイッチ
　９００，９００ｐ　　言語処理部
　９１０　　構文解析手段
　９２０　　音声合成開始位置決定手段
　９３０　　強調決定手段
　９４０　　時計手段
　１０１　　構文解析部
　１０２　　音声合成開始位置決定部
　１０３　　音声合成開始位置規則保持部
　１０４　　強調語決定部
　１０５　　強調語規則保持部
　１０６　　制御部 Reference Signs List 10 Text input means 20 Language processing means 30a, 30b, 30c, 30d, 30e, 30f, 30g, 30h, 30m, 30n, 30o, 30p Voice synthesizer 40b, 40e, 40m, 40o Operating means 50a, 50b, 50c, 50d , 50e, 50g, 50h, 50i, 50k, 50l, 50m Voice quality control means 60 Electroacoustic transducers 70a, 70b, 70c, 70d, 70e, 70f, 70g, 70h, 70m, 70n, 70o, 70p Voice synthesis control Means 80, 280a, 280b, 280c, 280d, 280n, 380a, 380b, 380c, 380d, 380n, 580a, 580b, 580c, 580d, 580n, 680a, 680b, 680c, 680d, 680n Unit database 90a, 90b, 9 d, 90e, 90f, 90g, 90h, 90m, 90n Unit connection means 110 Microphone 120 Hearing characteristic measurement means 130a, 130d, 130h, 130p Phoneme emphasis processing means 140a, 140b, 140g Compression processing means 200b, 200e Database section 210b, 210e switch 220 auditory characteristic storage means 300 unit database reading means 310 auditory characteristic reading means 320a, 320b, 320c, 320d, 320n auditory characteristic 800 emphasis filter section 810a, 810b, 810c, 810d, 810n formant emphasis filter 820 switch 900, 900p Language processing section 910 Syntax analysis section 920 Speech synthesis start position determination section 930 Emphasis determination section 940 Clock section 101 Syntax analysis section 1 2 speech synthesis start position determination unit 103 speech synthesis start position rule holding unit 104 intensifiers determining unit 105 intensifiers rule holding unit 106 control unit

Claims

A speech synthesizer comprising: a speech synthesis unit that synthesizes speech in accordance with a text; and an emphasis processing unit that performs single or multiple phoneme emphasis processes on the speech synthesized by the speech synthesis unit.

2. The speech synthesizer according to claim 1, wherein the emphasis processing is a consonant emphasis processing for performing amplitude emphasis processing of a consonant or a consonant and a vowel following the consonant based on the phoneme information.

2. The speech synthesizer according to claim 1, wherein the emphasis process is a band emphasis process for emphasizing a consonant frequency band based on phoneme information.

2. The speech synthesizer according to claim 1, further comprising: a control unit configured to analyze the environmental sound input from the microphone and control the emphasis processing unit based on the physical characteristics of the environmental sound.

The speech synthesizer according to claim 4, wherein the control unit analyzes the environmental sound input from the microphone and selects an emphasis processing method used in the emphasis processing unit based on the physical characteristics of the environmental sound.

2. The speech synthesizing apparatus according to claim 1, further comprising an operation unit for allowing a user to adjust a processing method and a degree of emphasis, and a control unit for controlling the emphasis processing unit based on a signal input from the operation unit.

The speech synthesizer according to claim 1, further comprising: a measurement unit configured to measure a user's hearing characteristics; and a control unit configured to control an emphasis processing unit based on the user's hearing characteristics.

The speech synthesizer according to claim 7, wherein the control unit selects an emphasis processing method used in the emphasis processing unit based on the user's auditory characteristics input from the measurement unit.

The voice synthesizing apparatus according to claim 1, further comprising: a storage unit configured to store a user's hearing characteristics; and a control unit configured to control an emphasis processing unit based on the user's hearing characteristics.

The speech synthesizer according to claim 9, wherein the control unit selects an emphasis processing method used in the emphasis processing unit based on the user's auditory characteristics stored in the storage unit.

2. A speech synthesizer according to claim 1, further comprising: a hearing characteristic reading unit; and a control unit, wherein the control unit controls the emphasis processing unit by referring to a user's hearing characteristic stored in a recording medium by the hearing characteristic reading unit. apparatus.

The speech synthesizer according to claim 11, wherein the control unit selects an emphasis processing method to be used by the emphasis processing unit based on the user's auditory characteristics read by the auditory characteristic reading unit.

A speech unit database that stores speech in desired synthesis units such as vowel / consonant / vowel combinations, a unit deformation unit that emphasizes the synthesis unit, and an emphasis process performed by the unit deformation unit. And a speech synthesizer for connecting the synthesized units by a target text to synthesize speech.

14. The speech synthesizer according to claim 13, wherein the emphasis processing is a consonant emphasis processing for performing amplitude emphasis processing of a consonant or a consonant and a subsequent vowel based on the phonemic information.

14. The speech synthesizer according to claim 13, wherein the emphasis process is a band emphasis process for emphasizing a consonant frequency band based on phoneme information.

14. The speech synthesizer according to claim 13, wherein the emphasis processing is a closure emphasis processing for extending a consonant closure based on linguistic information.

14. The speech synthesizer according to claim 13, wherein the emphasis processing is an extension processing for extending a phoneme length based on linguistic information.

14. The speech synthesizer according to claim 13, further comprising: a control unit configured to analyze an environmental sound input from the microphone and control the unit deformation unit based on physical characteristics of the environmental sound.

19. The speech synthesizer according to claim 18, wherein the control unit analyzes the environmental sound input from the microphone and selects an emphasis processing method used in the unit deformation unit based on physical characteristics of the environmental sound.

14. The speech synthesizing apparatus according to claim 13, further comprising: an operation unit for allowing a user to adjust a processing method and a degree of the emphasis; .

14. The speech synthesizer according to claim 13, further comprising: a measuring unit configured to measure a hearing characteristic of the user; and a control unit configured to control a unit deforming unit based on the hearing characteristic of the user.

22. The speech synthesizer according to claim 21, wherein the control unit selects an emphasis processing method used in the unit deformation unit based on the user's auditory characteristics input from the measurement unit.

14. The voice synthesizing device according to claim 13, further comprising: storage means for storing a user's hearing characteristics; and a control unit for controlling a segment deformation unit based on the user's hearing characteristics.

24. The speech synthesizer according to claim 23, wherein the control unit selects an emphasis processing method to be used in the unit deformation unit based on the user's auditory characteristics stored in the storage unit.

14. The voice according to claim 13, further comprising: a hearing characteristic reading unit; and a control unit, wherein the control unit controls the segment deformation unit with reference to a user's hearing characteristic stored in a recording medium by the hearing characteristic reading unit. Synthesizer.

26. The speech synthesizer according to claim 25, wherein the control unit selects an emphasis processing method used in the unit deformation unit based on the user's auditory characteristics read by the auditory characteristic reading unit.

A speech unit database for storing speech subjected to phoneme emphasis processing in a desired synthesis unit such as a vowel / consonant / vowel chain, and speech synthesis for synthesizing speech by connecting the synthesis unit with a target text And a speech synthesizer comprising:

A plurality of speech unit databases having different emphasis methods and degrees and an environment sound input from a microphone are analyzed, and the speech unit database used by the speech synthesis unit for speech synthesis is selected based on the physical characteristics of the environment sound. 28. The speech synthesizer according to claim 27, further comprising: a control unit that performs the control.

A plurality of speech unit databases having different emphasis methods and degrees, operation means for a user to adjust the state of emphasis, and a speech synthesis unit used for speech synthesis based on a signal input from the operation means. The speech synthesizer according to claim 27, further comprising: a control unit that selects the speech unit database.

A plurality of speech unit databases having different emphasis methods and degrees; a measuring unit for measuring a user's hearing characteristics; and the speech unit database used by a speech synthesis unit for speech synthesis based on the user's hearing characteristics. 28. The voice synthesizing apparatus according to claim 27, further comprising: a control unit configured to select a voice.

A plurality of speech unit databases having different emphasis methods and degrees; storage means for storing user's hearing characteristics; and the speech unit database used by the speech synthesis unit for speech synthesis based on the user's hearing characteristics. 28. The voice synthesizing apparatus according to claim 27, further comprising: a control unit configured to select a voice.

28. The speech synthesis apparatus according to claim 27, further comprising: a storage medium storing a plurality of speech unit databases having different emphasis methods and degrees; and a speech unit database reading unit.

A speech synthesis method in which a speech synthesized by a speech synthesis unit that synthesizes speech in accordance with a text is subjected to single or plural phoneme emphasis processes.

A speech synthesis unit output from a speech unit database storing speech in a desired synthesis unit such as a vowel / consonant / vowel combination is subjected to emphasis processing, and the synthesized unit subjected to the emphasis processing is used for Speech synthesis method that synthesizes speech by connecting with text.

The speech synthesis unit output from the speech unit database which stores speech subjected to phoneme emphasis processing in advance in a desired synthesis unit such as a vowel / consonant / vowel combination is connected by a target text to generate a speech. The speech synthesis method to synthesize.