JP2005070430A

JP2005070430A - Speech output device and method

Info

Publication number: JP2005070430A
Application number: JP2003300071A
Authority: JP
Inventors: Toru Marumoto; 徹丸本; Nozomi Saito; 望齊藤
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2003-08-25
Filing date: 2003-08-25
Publication date: 2005-03-17
Also published as: US20050080626A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech output device and a speech output method which can output a high-understandability (easy-to-listen) speech regardless of the contents of an output word. <P>SOLUTION: This device comprises a speech DB 1 wherein familiarity information showing how familiar each word or word string is recorded and a sound pressure adjustment part 3 which adjusts a sound pressure level in units of a word or word string according to familiarity information read out of the speech DB 1 by a reproduction part 2 together with speech data. A word etc., whose familiarity is low is corrected to larger sound pressure and then even when the speech of contents of low word familiarity such as a place name which is unfamiliar is output, it is output with sound pressure larger than a word of high familiarity, so that even the word of low familiarity becomes easy to hear. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は音声出力装置および方法に関し、特に、車室内において車載機から出力される音声がユーザに聞き取りやすくなるように音声の補正を行う装置および方法に用いて好適なものである。 The present invention relates to an audio output apparatus and method, and is particularly suitable for use in an apparatus and method for correcting audio so that audio output from an in-vehicle device can be easily heard by a user in a vehicle interior.

近年、ナビゲーション装置のガイド音声はもとより、ハンズフリー装置による通話相手の音声、情報通信装置で受信したＷｅｂ情報や電子メールの内容読み上げ音声など、車室内における音声出力の需要がますます高まっている。この音声出力の方式としては、ＤＶＤ（Digital Versatile Disk）やハードディスク等のメディアにあらかじめ収録された音声を再生する録音再生方式と、与えられた文字情報をもとに音声波形を構築して再生するＴＴＳ（Text-to-Speech）方式とがある。 In recent years, there has been an increasing demand for voice output in the vehicle interior, such as the voice of the other party using a hands-free device, the Web information received by the information communication device, and the read-out voice of the e-mail, as well as the guide voice of the navigation device. As the audio output method, a recording / reproduction method for reproducing sound recorded in advance on a medium such as a DVD (Digital Versatile Disk) or a hard disk, and an audio waveform is constructed and reproduced based on given character information. There is a TTS (Text-to-Speech) method.

後者のＴＴＳ方式に基づく音声出力装置は、大きく分けて、与えられた文字情報に対して、テキスト解析用の辞書データによって読みやアクセントを付加する言語処理部と、波形／音素片の辞書データによって音声を生成する音声合成部との２つの処理部から構成されている。 The voice output device based on the latter TTS method is roughly divided into a language processing unit for adding reading and accents to given character information by dictionary data for text analysis, and dictionary data of waveform / phoneme segments. It consists of two processing units, a speech synthesis unit that generates speech.

ところで従来、音声出力に関して、ラウドネス補償に基づく音声明瞭度改善システムが提供されている。このシステムは、マイクより入力された周囲の騒音等のレベルに応じて出力音声の音圧レベルを適切に調整することにより、騒音の中でも出力音声がより明瞭に聞こえるようすることを可能にしたものである（例えば、特許文献１参照）。
特開平１１−１６６８３５号公報 Conventionally, a speech intelligibility improving system based on loudness compensation has been provided for speech output. This system makes it possible to hear the output sound more clearly even in noise by appropriately adjusting the sound pressure level of the output sound according to the level of ambient noise etc. input from the microphone. (For example, see Patent Document 1).
Japanese Patent Laid-Open No. 11-166835

この明瞭度改善システムに代表される従来の音声補正装置は、騒音のレベルや車速信号といった周辺騒音環境に基づく物理量をもとに音声の音圧を補正するものである。ただ、これらの物理量をもとに出力音声が明瞭となるように補正しても、それを聞き取るのは人間であるため、全ての単語や文章が同程度に了解できる訳ではない。というのは、同じ音圧レベルの単語であっても、その単語の親密度（なじみ度合い）によって了解度が変わってしまうためである。 A conventional speech correction apparatus represented by this intelligibility improvement system corrects the sound pressure of speech based on physical quantities based on ambient noise environments such as noise level and vehicle speed signal. However, even if the output sound is corrected based on these physical quantities so as to be clear, since it is human beings to hear it, not all words and sentences can be understood to the same extent. This is because even if the words have the same sound pressure level, the intelligibility changes depending on the familiarity (degree of familiarity) of the words.

図５は、単語親密度と単語了解度と音圧との関係を示した試験結果の特性図である。この特性図は、音圧レベルを変えたときにどのように単語了解度が変化するのかを示すと同時に、聴取する単語の親密度によって単語了解度がどのように変化するのかを示したものである。この特性図から明らかなように、同じ音圧レベルであっても、親密度が高い単語ほど了解度も高くなり、親密度が低い単語ほど了解度も低くなっている。 FIG. 5 is a characteristic diagram of test results showing the relationship among word familiarity, word intelligibility, and sound pressure. This characteristic diagram shows how the word intelligibility changes when the sound pressure level is changed, and also shows how the word intelligibility changes depending on the familiarity of the word being listened to. is there. As is clear from this characteristic diagram, even with the same sound pressure level, the word with higher intimacy has a higher intelligibility, and the word with lower intimacy has a lower intelligibility.

このように、物理量をもとに出力音声の音圧レベルを補正しても、出力音声の内容に応じて聞こえやすさは異なってしまう。そのため、例えばナビゲーション装置の走行案内時においても、見知らぬ地名であればあるほど聞き取りづらくなってしまうという問題があった。 Thus, even if the sound pressure level of the output sound is corrected based on the physical quantity, the ease of hearing varies depending on the content of the output sound. For this reason, for example, even when the navigation device is running, there is a problem that it becomes more difficult to hear as the place name is unknown.

なお、単語の親密度を考慮した情報処理装置として、親密度の高い順に変換候補の漢字を並べて表示するようにしたかな漢字変換装置が存在する（例えば、特許文献２参照）。また、入力されたパターン列に対して同一の概念を表現する単語が複数存在する場合に、より親密度の高い単語を探索して認識結果として出力するパターン認識装置も存在する（例えば、特許文献３参照）。
特開２００１−２１６２９５号公報特開２００２−１６２９９１号公報 As an information processing apparatus that takes into account the familiarity of words, there is a kana-kanji conversion apparatus that displays kanjis of conversion candidates arranged side by side in descending order of familiarity (see, for example, Patent Document 2). There is also a pattern recognition device that searches for a word having a higher intimacy and outputs it as a recognition result when there are a plurality of words that express the same concept with respect to the input pattern sequence (for example, Patent Documents). 3).
JP 2001-216295 A JP 2002-162991 A

本発明は、上述のような問題を解決するために成されたものであり、出力される単語の内容にかかわらず、了解度の高い（聞こえやすい）音声を提供できるようにすることを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech with high intelligibility (easy to hear) regardless of the content of the output word. To do.

上記した課題を解決するために、本発明の音声出力装置では、複数の単語または単語列に対してどの程度なじみがあるかを表した親密度に関する情報を用意しておき、当該親密度情報に基づいて、出力対象の音声に対して単語または単語列の単位で音圧レベルを調整するようにしている。 In order to solve the above-described problems, in the audio output device of the present invention, information on familiarity indicating how familiar a plurality of words or word strings are is prepared, and the familiarity information is included in the familiarity information. Based on this, the sound pressure level is adjusted in units of words or word strings with respect to the output target speech.

上記のように構成した本発明によれば、例えば聞きなれない土地名など単語親密度が低い内容の音声を出力する場合であっても、親密度が高い単語に比べて大きな音圧で出力されることにより、単語の了解度を高くすることができる。これにより、出力される音声が親密度の低い単語または単語列で構成されている場合や、出力される音声中に親密度の高い単語や親密度の低い単語が混在しているような場合でも、常に了解度の高い音声を提供することができるようになる。 According to the present invention configured as described above, for example, even in the case of outputting a voice having a low word familiarity such as a land name that cannot be heard, it is output with a larger sound pressure than a word having a high familiarity. By doing so, the intelligibility of the word can be increased. As a result, even when the output voice is composed of words or word strings with low intimacy, or when words with high intimacy or words with low intimacy are mixed in the output voice , You will always be able to provide a high degree of intelligibility.

（第１の実施形態）
以下、本発明による第１の実施形態を図面に基づいて説明する。第１の実施形態は、本発明を録音再生方式の音声出力装置に適用したものである。図１は、第１の実施形態に係る音声出力装置の要部構成例を示す図である。図１に示すように、本実施形態の音声出力装置は、音声データベース（ＤＢ）１、再生部２、音圧調整部３およびボリューム４を備えて構成されている。 (First embodiment)
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, a first embodiment of the invention will be described with reference to the drawings. In the first embodiment, the present invention is applied to an audio output device of a recording / reproducing system. FIG. 1 is a diagram illustrating a configuration example of a main part of an audio output device according to the first embodiment. As shown in FIG. 1, the audio output device according to the present embodiment includes an audio database (DB) 1, a reproduction unit 2, a sound pressure adjustment unit 3, and a volume 4.

音声ＤＢ１は、波形符号化された音声データがＤＶＤやハードディスク等のメディアに記録されて構成されている。この音声ＤＢ１には、出力対象となる音声データが、単語または単語列の単位で録音されている。単語列とは、同時に使用される可能性が高い複数の単語を組み合わせたものや、複数の単語により構成される熟語あるいは簡単な文章などを言う。以下では、「単語または単語列」をまとめて「単語等」と言うことにする。 The audio DB 1 is configured by recording waveform-encoded audio data on a medium such as a DVD or a hard disk. In the voice DB 1, voice data to be output is recorded in units of words or word strings. The word string refers to a combination of a plurality of words that are likely to be used simultaneously, an idiom composed of a plurality of words, or a simple sentence. Hereinafter, “words or word strings” are collectively referred to as “words”.

例えば、本実施形態の音声出力装置をナビゲーション装置に適用する場合、「この先○○方面、渋滞です。」といった一連のガイド音声を、｜この先｜○○｜方面｜渋滞｜です｜のように単語等に区切って、それぞれを別個の音声パターンとして記録する。なお、一連のガイド音声を再生するときは、分割して記録されている複数の音声パターンを順次読み出して出力する。 For example, when the voice output device of this embodiment is applied to a navigation device, a series of guide voices such as “Follow XX direction, traffic jam.” Each is recorded as a separate voice pattern. When reproducing a series of guide voices, a plurality of divided voice patterns are sequentially read out and output.

この音声ＤＢ１には更に、単語等の単位で記録された個々の音声パターンに対して、どの程度なじみがあるかを表した親密度に関する情報が付加して記録されている。この親密度情報は、単語等の単位で、単語親密度がいくつであるかを数値（例えば、１．０〜７．０）として表したものである。このように音声ＤＢ１は、本発明の親密度情報記憶手段を構成する。 The voice DB 1 is further recorded with information on familiarity indicating how familiar each voice pattern is recorded in units of words or the like. This familiarity information is a numerical value (for example, 1.0 to 7.0) indicating how many word familiarities are in units of words or the like. As described above, the voice DB 1 constitutes the familiarity information storage unit of the present invention.

再生部２は、音声ＤＢ１から音声データおよび親密度情報を再生するものである。この再生部２が音声ＤＢ１から複数の音声パターンを任意に選択して読み出す（発話させたい内容に対応する音声パターンが記録されているデータ位置のタグを選択して読み出す）ことにより、任意内容のガイド音声を再生することが可能である。このとき、読み出した複数の音声パターンに対応して記憶されている親密度情報も読み出す。 The reproduction unit 2 reproduces audio data and familiarity information from the audio DB 1. The reproduction unit 2 arbitrarily selects and reads out a plurality of sound patterns from the sound DB 1 (selects and reads a tag at a data position where a sound pattern corresponding to the content to be uttered is recorded), thereby allowing arbitrary contents to be read. It is possible to reproduce the guide voice. At this time, the familiarity information stored corresponding to the read voice patterns is also read.

音圧調整部３は、再生部２により音声ＤＢ１から読み出された親密度情報に基づいてボリューム４を制御することにより、同じく再生部２により音声ＤＢ１から読み出された各音声パターン（各単語等）の音圧レベルを調整する。具体的には、親密度が低い音声パターンほどボリューム値を大きくするように補正する。 The sound pressure adjustment unit 3 controls the volume 4 based on the familiarity information read from the sound DB 1 by the reproduction unit 2, thereby similarly reproducing each sound pattern (each word read from the sound DB 1 by the reproduction unit 2. Etc.) is adjusted. Specifically, the volume value is corrected so as to increase as the voice pattern has a lower familiarity.

例えば、「この先○○方面、渋滞です。」といったガイド音声を出力する場合において、地名を表す○○の部分が「郡山（こおりやま）」のように親密度が高い単語の場合は、ボリューム値の調整は行わない。一方、地名を表す○○の部分が「差塩（さいそ）」や「百槻（どうづき）」のように親密度が低い単語の場合は、ボリューム値を上げるような調整を行う。 For example, in the case of outputting a guide voice such as “This is the direction of XX ahead, traffic jam”, if the part of XX representing the place name is a word with a high familiarity such as “Koriyama”, the volume value No adjustment is made. On the other hand, in the case where the part of XX representing the place name is a word having a low familiarity such as “Daisen” or “Daizuki”, an adjustment is made to increase the volume value.

どの程度ボリューム値を大きくするかは、単語親密度の値によって異ならせるのが好ましい。例えば、図５の特性図において、単語了解度８０％を実現することを想定した場合、元々の音圧レベルが約２０ｄＢであるとすると、単語親密度が７．０〜５．５の音声パターンについては音圧レベルの調整が不要である。これに対して、単語親密度が５．５〜４．０の音声パターンは音圧レベルを約５ｄＢ大きくする。また、単語親密度が４．０〜２．５の音声パターンは音圧レベルを約１５ｄＢ大きくし、単語親密度が２．５〜１．０の音声パターンは音圧レベルを約２０ｄＢ大きくする。 It is preferable that the volume value is increased depending on the word familiarity value. For example, in the characteristic diagram of FIG. 5, assuming that the word intelligibility is 80%, and assuming that the original sound pressure level is about 20 dB, the speech pattern having a word familiarity of 7.0 to 5.5 No adjustment of the sound pressure level is required. On the other hand, a sound pattern having a word familiarity of 5.5 to 4.0 increases the sound pressure level by about 5 dB. In addition, a sound pattern with a word familiarity of 4.0 to 2.5 increases the sound pressure level by about 15 dB, and a sound pattern with a word familiarity of 2.5 to 1.0 increases the sound pressure level by about 20 dB.

以上詳しく説明したように、第１の実施形態によれば、出力対象の各単語等が録音された音声ＤＢ１に対して、単語等の単位で親密度情報を付加して記憶しておき、音声と共に読み出される親密度情報に基づいて、出力対象の各単語等の音圧レベルを適宜調整するようにしたので、普段聞きなれない内容の音声を出力する場合であっても、了解度の高い（聞き取りやすい）音声を提供することができる。よって、例えば本実施形態の音声出力装置を適用したナビゲーション装置を不慣れな土地で使用していても、ガイド音声は常に聞こえやすくなる。 As described above in detail, according to the first embodiment, familiarity information is added and stored in units of words or the like to the voice DB 1 in which each word or the like to be output is recorded, and the voice is recorded. Since the sound pressure level of each word or the like to be output is adjusted as appropriate based on the familiarity information read together with the information, even when outputting a sound whose contents cannot be heard normally, the degree of understanding is high ( (Easy to hear) can be provided. Therefore, for example, even when a navigation device to which the audio output device of the present embodiment is applied is used in an unfamiliar land, the guide audio is always easy to hear.

なお、上記第１の実施形態では、単語親密度が最も高い単語等を基準の音圧レベルとし、これより単語親密度が低い単語等の音圧レベルを大きく補正する例について説明したが、本発明はこれに限定されない。例えば、単語親密度が最も低い単語等を基準の音圧レベルとし、これより単語親密度が高い単語等の音圧レベルを小さく補正するようにしても良い。また、単語親密度が中程度の単語等を基準の音圧レベルとし、基準より単語親密度が高い単語等の音圧レベルを小さく補正し、基準より単語親密度が低い単語等の音圧レベルを大きく補正することによって全ての単語等の了解度が同等程度となるようにしても良い。 In the first embodiment, an example in which a word having the highest word familiarity is set as a reference sound pressure level and a sound pressure level of a word having a lower word familiarity is corrected to a greater value has been described. The invention is not limited to this. For example, the word having the lowest word familiarity may be used as the reference sound pressure level, and the sound pressure level of a word having a higher word familiarity may be corrected to be smaller. In addition, a word having a medium word familiarity is set as a reference sound pressure level, a sound pressure level of a word having a word closeness higher than the reference is corrected to be small, and a sound pressure level of a word having a word closeness lower than the reference is corrected. It is also possible to make the intelligibility of all the words equal to about the same level by correcting.

また、上記第１の実施形態では、単語親密度に応じて音圧レベルを調整することによって全ての単語等の了解度が同等程度となるようにする例について説明したが、本発明はこれに限定されない。例えば、音圧調整によって単語了解度が所定値よりも大きくなるようにすれば良く、必ずしも全ての単語等の了解度が同等程度となるようにする必要はない。 In the first embodiment, the example in which the intelligibility of all words and the like is made equal by adjusting the sound pressure level according to the word familiarity has been described. It is not limited. For example, the word intelligibility may be made larger than a predetermined value by adjusting the sound pressure, and it is not always necessary to make the intelligibility of all the words equal.

また、上記第１の実施形態では、親密度情報に基づいて音声の音圧レベルを調整する例について説明したが、これに加えて、またはこれに代えて、単語親密度が所定値より低い単語等を２回以上繰り返して再生するようにしても良い。例えば、「この先差塩方面、渋滞です。」といったガイド音声を出力する場合に、音圧調整部３が「差塩」の部分で音圧レベルを上げるように調整するとともに、「この先差塩方面、渋滞です。差塩方面です。」ように単語親密度が低い単語等を２回繰り返して再生するように制御する。この繰り返し再生の制御は、再生部２において行うことが可能である。このように、馴染みの少ない単語等であっても繰り返し再生することにより、了解度を高めることができる。 In the first embodiment, the example in which the sound pressure level of the sound is adjusted based on the familiarity information has been described. However, in addition to or instead of this, a word whose word familiarity is lower than a predetermined value Etc. may be repeated two or more times. For example, when outputting a guide voice such as “This destination salt direction is congested”, the sound pressure adjustment unit 3 adjusts the sound pressure level to increase at the “difference salt” portion, "It's a traffic jam. It's in the direction of the difference salt." This repetitive playback control can be performed in the playback unit 2. In this way, it is possible to increase the intelligibility by repeatedly reproducing even a word that is not familiar to the user.

また、親密度情報に基づいて音声の音圧レベルを調整することに加えて、またはこれに代えて、音声の再生速度を調整するようにしても良い。例えば、「この先差塩方面、渋滞です。」といったガイド音声を出力する場合に、音圧調整部３が「差塩」の部分で音圧レベルを上げるように調整するとともに、「差塩」の部分を他の部分より遅い速度で再生するように制御する。この再生速度の制御も、再生部２において行うことが可能である。このように、馴染みの少ない単語等であっても再生速度を遅くすることにより、了解度を高めることができる。 Further, in addition to or instead of adjusting the sound pressure level of sound based on the familiarity information, the sound reproduction speed may be adjusted. For example, in the case of outputting a guide voice such as “This destination salt is in a traffic jam”, the sound pressure adjustment unit 3 adjusts the sound pressure level to increase at the “difference salt” portion, Control the part to play at a slower speed than the other parts. The playback speed can also be controlled by the playback unit 2. In this way, even for words that are not well-known, the intelligibility can be increased by slowing the playback speed.

また、親密度情報に基づいて音声の音圧レベルを調整することに加えて、またはこれに代えて、単語親密度が所定値より低い単語等を画面上に表示するようにしても良い。この表示制御は、図示しない表示コントローラ（例えば、ナビゲーション装置の場合は、地図画像等を表示装置に表示するために通常備えられているもの）を用いて行うことが可能である。このように、馴染みの少ない単語等であっても、それを画面表示して視覚上でも確認できるようにすることにより、了解度を高めることができる。 Further, in addition to or instead of adjusting the sound pressure level of the voice based on the familiarity information, a word having a word familiarity lower than a predetermined value may be displayed on the screen. This display control can be performed using a display controller (not shown) (for example, in the case of a navigation device, one that is normally provided for displaying a map image or the like on the display device). In this way, even a word or the like that is less familiar is displayed on the screen so that it can be visually confirmed, thereby increasing the degree of understanding.

（第２の実施形態）
次に、本発明による第２の実施形態を図面に基づいて説明する。第２の実施形態は、本発明をＴＴＳ方式の音声出力装置に適用したものである。図２は、第２の実施形態に係る音声出力装置の要部構成例を示す図である。図２に示すように、本実施形態の音声出力装置は、テキスト生成部１１、ＴＴＳエンジン１２、音圧調整部１３およびボリューム１４を備えて構成されている。 (Second Embodiment)
Next, a second embodiment of the present invention will be described with reference to the drawings. In the second embodiment, the present invention is applied to a TTS audio output apparatus. FIG. 2 is a diagram illustrating a configuration example of a main part of an audio output device according to the second embodiment. As shown in FIG. 2, the audio output device of the present embodiment includes a text generation unit 11, a TTS engine 12, a sound pressure adjustment unit 13, and a volume 14.

テキスト生成部１１は、出力対象とする音声の内容を文字列で表したテキスト情報を生成するものである。このテキスト生成部１１は、図示しないキーボードをユーザが操作することによって任意の文字列のテキスト情報を手動で生成するものであっても良いし、所定のルールに従ってコントローラが任意の文字列のテキスト情報を自動で生成するものであっても良い。 The text generation unit 11 generates text information that represents the content of audio to be output as a character string. The text generation unit 11 may manually generate text information of an arbitrary character string by a user operating a keyboard (not shown), or the controller may generate text information of an arbitrary character string according to a predetermined rule. May be generated automatically.

ＴＴＳエンジン１２は、言語処理部１５、テキスト解析用辞書１６、音声合成部１７および音素片辞書１８を備えて構成されている。テキスト解析用辞書１６は、各種の単語等から成るテキスト情報と、それらの単語等に対して付加する音韻情報および韻律情報とを対応付けて記憶したテキスト解析用の辞書データベースである。 The TTS engine 12 includes a language processing unit 15, a text analysis dictionary 16, a speech synthesis unit 17, and a phoneme unit dictionary 18. The text analysis dictionary 16 is a text analysis dictionary database that stores text information composed of various words and the like, and phoneme information and prosodic information added to the words in association with each other.

このテキスト解析用辞書１６には更に、単語等の単位で記録された個々のテキスト情報に対して、その単語等に関する親密度情報が付加して記録されている。この親密度情報は、それぞれの単語等ごとに、単語親密度がいくつであるかを数値（例えば、１．０〜７．０）として表したものである。このようにテキスト解析用辞書１６は、本発明の親密度情報記憶手段を構成する。 Further, in the text analysis dictionary 16, intimacy information relating to the word or the like is added to and recorded on the individual text information recorded in units of the word or the like. This familiarity information is a numerical value (for example, 1.0 to 7.0) representing the word familiarity for each word or the like. Thus, the text analysis dictionary 16 constitutes the familiarity information storage means of the present invention.

言語処理部１５は、テキスト生成部１１より入力されるテキスト情報をもとにテキスト解析用辞書１６を参照し、テキスト情報により示される単語等の文字列に対して該当する音韻情報や韻律情報を付加することにより、表音文字列の情報を生成する。このとき言語処理部１５は、入力されたテキスト情報に対応して記憶されている親密度情報も読み出す。 The language processing unit 15 refers to the text analysis dictionary 16 based on the text information input from the text generation unit 11, and obtains phoneme information and prosodic information corresponding to a character string such as a word indicated by the text information. By adding, the information of the phonetic character string is generated. At this time, the language processing unit 15 also reads the familiarity information stored corresponding to the input text information.

音素片辞書１８は、各種の単語等から成る文字列を単位として、それらの文字列に対して付加する波形情報を記憶した音素片の辞書データベースである。音声合成部１７は、言語処理部１５より出力された表音文字列の情報をもとに音素片辞書１８を参照し、当該表音文字列を波形情報を用いて加工することにより、合成音声を生成する。 The phoneme segment dictionary 18 is a phoneme segment dictionary database that stores waveform information to be added to character strings in units of character strings composed of various words. The speech synthesizer 17 refers to the phoneme segment dictionary 18 based on the information of the phonetic character string output from the language processing unit 15, and processes the phonetic character string using the waveform information, thereby synthesizing the synthesized speech. Is generated.

音圧調整部１３は、言語処理部１５によりテキスト解析用辞書１６から抽出された親密度情報に基づいてボリューム１４を制御することにより、音声合成部１７により生成された合成音声の音圧レベルを単語等の単位で調整する。例えば、単語親密度が最も高い単語等を基準の音圧レベルとし、これより単語親密度が低い単語等のボリューム値を大きくするように補正する。どの程度ボリューム値を大きくするかは、第１の実施形態と同様、単語親密度の値によって異ならせるのが好ましい。 The sound pressure adjustment unit 13 controls the volume 14 based on the familiarity information extracted from the text analysis dictionary 16 by the language processing unit 15, thereby adjusting the sound pressure level of the synthesized speech generated by the speech synthesis unit 17. Adjust in units of words. For example, the word having the highest word familiarity is set as the reference sound pressure level, and the volume value of the word having the lower word familiarity is corrected to be larger. The extent to which the volume value is increased is preferably different depending on the value of word familiarity as in the first embodiment.

以上詳しく説明したように、第２の実施形態によれば、与えられたテキスト情報をもとに音声波形を合成して再生するＴＴＳエンジン１２が備えるテキスト解析用辞書１６に対して、単語等の単位で親密度情報を付加して記憶しておき、テキスト情報の解析の際に抽出される親密度情報に基づいて、出力対象の各単語等の音圧レベルを適宜調整するようにしたので、普段聞きなれない内容の音声を出力する場合であっても、了解度の高い（聞き取りやすい）音声を提供することができる。よって、例えば本実施形態の音声出力装置を適用したナビゲーション装置を不慣れな土地で使用していても、ガイド音声は常に聞こえやすくなる。 As described above in detail, according to the second embodiment, words and the like are compared with the text analysis dictionary 16 included in the TTS engine 12 that synthesizes and reproduces a speech waveform based on given text information. Since the familiarity information is added and stored in units, and the sound pressure level of each word to be output is appropriately adjusted based on the familiarity information extracted in the analysis of the text information, Even when outputting sounds that cannot be heard normally, it is possible to provide highly understandable (easy to hear) sounds. Therefore, for example, even when a navigation device to which the audio output device of the present embodiment is applied is used in an unfamiliar land, the guide audio is always easy to hear.

なお、上記第２の実施形態でも、単語親密度が最も高い単語等を基準の音圧レベルとし、これより単語親密度が低い単語等の音圧レベルを大きく補正する例について説明したが、本発明はこれに限定されない。例えば、単語親密度が最も低い単語等を基準の音圧レベルとしても良いし、単語親密度が中程度の単語等を基準の音圧レベルとしても良い。 In the second embodiment, the example in which the word having the highest word familiarity is set as the reference sound pressure level and the sound pressure level of the word having a lower word familiarity is greatly corrected has been described. The invention is not limited to this. For example, a word having the lowest word familiarity may be used as the reference sound pressure level, and a word having a medium word familiarity may be used as the reference sound pressure level.

また、上記第２の実施形態でも、単語親密度に応じて音圧レベルを調整することによって全ての単語等の了解度が同等程度となるようにする必要は必ずしもなく、音圧調整によって単語了解度が所定値よりも大きくなるようにすれば良い。 In the second embodiment as well, it is not always necessary to adjust the sound pressure level according to the word familiarity so that the intelligibility of all the words becomes equal, and the word understanding is achieved by adjusting the sound pressure. The degree may be larger than a predetermined value.

また、上記第２の実施形態でも、親密度情報に基づいて音声の音圧レベルを調整する例について説明したが、これに加えて、またはこれに代えて、単語親密度が所定値より低い単語等を２回以上繰り返して再生するようにしても良い。この繰り返し再生の制御は、例えば音声合成部１７が同じ単語等を２回繰り返して合成することによって行うことが可能である。 In the second embodiment, the example in which the sound pressure level of the sound is adjusted based on the familiarity information has been described. However, in addition to or instead of this, a word whose word familiarity is lower than a predetermined value Etc. may be repeated two or more times. This repetitive playback control can be performed, for example, by the speech synthesizer 17 repeatedly synthesizing the same word or the like twice.

また、親密度情報に基づいて音声の音圧レベルを調整することに加えて、またはこれに代えて、音声の再生速度を調整するようにしても良い。この再生速度の制御は、例えば音声合成部１７から合成音声を出力する際の出力タイミングを可変とすることによって行うことが可能である。 Further, in addition to or instead of adjusting the sound pressure level of sound based on the familiarity information, the sound reproduction speed may be adjusted. This control of the playback speed can be performed, for example, by changing the output timing when the synthesized speech is output from the speech synthesizer 17.

また、親密度情報に基づいて音声の音圧レベルを調整することに加えて、またはこれに代えて、単語親密度が所定値より低い単語等を画面上に表示するようにしても良い。この表示制御は、図示しない表示コントローラ（例えば、ナビゲーション装置の場合は、地図画像等を表示装置に表示するために通常備えられているもの）を用いて行うことが可能である。 Further, in addition to or instead of adjusting the sound pressure level of the voice based on the familiarity information, a word having a word familiarity lower than a predetermined value may be displayed on the screen. This display control can be performed using a display controller (not shown) (for example, in the case of a navigation device, one that is normally provided for displaying a map image or the like on the display device).

（第３の実施形態）
次に、本発明による第３の実施形態を図面に基づいて説明する。第３の実施形態は、ラウドネス補償技術を用いた音声明瞭度改善システムに本発明を適用したものである。図３は、第３の実施形態に係る音声明瞭度改善システムの要部構成例を示す図である。 (Third embodiment)
Next, a third embodiment according to the present invention will be described with reference to the drawings. In the third embodiment, the present invention is applied to a speech intelligibility improvement system using a loudness compensation technique. FIG. 3 is a diagram illustrating a configuration example of a main part of the speech intelligibility improvement system according to the third embodiment.

図３に示すように、本実施形態の音声明瞭度改善システムは、音声ＤＢ２１、再生部２２、ボリューム又はイコライザ（以下、単にボリューム等と記す）２３、音圧調整部２４、ゲイン制御部２５、適応フィルタ（ＡＤＦ）２６、スピーカ２７、マイク２８および減算器２９を備えて構成されている。 As shown in FIG. 3, the speech intelligibility improving system of the present embodiment includes an audio DB 21, a playback unit 22, a volume or equalizer (hereinafter simply referred to as a volume or the like) 23, a sound pressure adjustment unit 24, a gain control unit 25, An adaptive filter (ADF) 26, a speaker 27, a microphone 28 and a subtractor 29 are provided.

音声ＤＢ２１は、波形符号化された音声データがＤＶＤやハードディスク等のメディアに記録されて構成されている。この音声ＤＢ２１には、出力対象となる音声データが、単語等の単位で録音されている。例えば、本実施形態の音声明瞭度改善システムをナビゲーション装置に適用する場合、「この先○○方面、渋滞です。」といった一連のナビ音声を、｜この先｜○○｜方面｜渋滞｜です｜のように単語等に区切って、それぞれを別個の音声パターンとして記録する。 The audio DB 21 is configured by recording waveform-encoded audio data on a medium such as a DVD or a hard disk. In the voice DB 21, voice data to be output is recorded in units of words or the like. For example, when the speech intelligibility improving system of this embodiment is applied to a navigation device, a series of navigation voices such as “Following XX direction, traffic jam” is expressed as | Are recorded as separate audio patterns.

この音声ＤＢ２１には更に、単語等の単位で記録された個々の音声パターンに対して親密度情報が付加して記録されている。この親密度情報は、単語等の単位で、単語親密度がいくつであるかを数値（例えば、１．０〜７．０）として表したものである。このように音声ＤＢ２１は、本発明の親密度情報記憶手段を構成する。 The voice DB 21 further stores intimacy information added to individual voice patterns recorded in units of words or the like. This familiarity information is a numerical value (for example, 1.0 to 7.0) indicating how many word familiarities are in units of words or the like. Thus, the voice DB 21 constitutes the familiarity information storage means of the present invention.

再生部２２は、音声ＤＢ２１から音声データおよび親密度情報を再生するものである。この再生部２２が音声ＤＢ２１から複数の音声パターンを任意に選択して読み出す（発話させたい内容に対応する音声パターンが記録されているデータ位置のタグを選択して読み出す）ことにより、任意内容のナビ音声を再生することが可能である。このとき、読み出した複数の音声パターンに対応して記憶されている親密度情報も読み出す。 The reproduction unit 22 reproduces audio data and familiarity information from the audio DB 21. The reproduction unit 22 arbitrarily selects and reads out a plurality of sound patterns from the sound DB 21 (selects and reads a tag at a data position where a sound pattern corresponding to the content to be uttered is recorded), so It is possible to reproduce navigation voice. At this time, the familiarity information stored corresponding to the read voice patterns is also read.

ボリューム等２３は、再生部２２により再生されたナビ音声の音量を制御する。スピーカ２７は、ボリューム等２３で音圧が補正されたナビ音声を出力する。マイク２８は、発話音声入力用のものであるが、実際には、発声された音声コマンドだけでなく、スピーカ２７から出力されるナビ音声、図示しない他のスピーカから出力されるオーディオ音、走行ノイズなど（以下、オーディオ音と走行ノイズとを合わせて「周辺ノイズ」と言う）も全て同じマイク２８に入力される。 The volume etc. 23 controls the volume of the navigation voice reproduced by the reproduction unit 22. The speaker 27 outputs a navigation voice whose sound pressure is corrected by the volume 23 or the like. The microphone 28 is used for speech voice input. Actually, it is not only a voice command that is spoken, but also a navigation voice that is output from the speaker 27, an audio sound that is output from another speaker (not shown), and traveling noise. (Hereinafter, the audio sound and the running noise are collectively referred to as “ambient noise”) are also input to the same microphone 28.

適応フィルタ２６は、係数同定部および音声補正フィルタを含んで構成されている。係数同定部は、スピーカ２７からマイク２８の間における音響系の伝達関数（音声補正フィルタのフィルタ係数）を同定するためのフィルタであり、ＬＭＳ（Least Mean Square ）アルゴリズムやＮ−ＬＭＳ（Normalized-LMS）アルゴリズムによる適応フィルタが用いられている。この係数同定部は、減算器２９から出力される誤差信号（後述する）のパワーが最小となるように動作して音響系のインパルス応答を同定する。 The adaptive filter 26 includes a coefficient identification unit and a voice correction filter. The coefficient identification unit is a filter for identifying a transfer function (a filter coefficient of a sound correction filter) of an acoustic system between the speaker 27 and the microphone 28, and is an LMS (Least Mean Square) algorithm or an N-LMS (Normalized-LMS). ) An adaptive filter based on an algorithm is used. The coefficient identification unit operates so as to minimize the power of an error signal (described later) output from the subtractor 29, and identifies the impulse response of the acoustic system.

音声補正フィルタは、係数同定部により決定されたフィルタ係数と、制御対象となる音圧補正済みのナビ音声とを用いて畳み込み演算することにより、当該音声補正済みのナビ音声に対して上述の音響系と同一の伝達特性を与える。これにより、マイク２８の位置におけるナビ音声を模擬したナビ模擬音声を生成する。 The sound correction filter performs a convolution operation using the filter coefficient determined by the coefficient identification unit and the sound pressure-corrected navigation sound to be controlled, thereby performing the above-described sound correction on the sound-corrected navigation sound. Gives the same transfer characteristics as the system. Thereby, a navigation simulation voice that simulates the navigation voice at the position of the microphone 28 is generated.

減算器２９は、マイク２８より入力された音声（ナビ音声と周辺ノイズとが混在した音声）から、適応フィルタ２６により生成されたナビ模擬音声を減算することにより、周辺ノイズを抽出する。この減算器２９により抽出された周辺ノイズは、誤差信号として適応フィルタ２６の係数同定部およびゲイン制御部２５にフィードバックされる。 The subtractor 29 extracts the peripheral noise by subtracting the navigation simulated voice generated by the adaptive filter 26 from the voice input from the microphone 28 (voice mixed with navigation voice and ambient noise). The ambient noise extracted by the subtractor 29 is fed back to the coefficient identification unit and the gain control unit 25 of the adaptive filter 26 as an error signal.

ゲイン制御部２５は、適応フィルタ２６から出力されるナビ模擬音声と、減算器２９から出力される周辺ノイズとに基づいて、再生部２２により再生される制御対象のナビ音声に対して加える最適のゲインを算出し、この算出したゲイン値を音圧調整部２４に出力する。ここでは、周辺ノイズ（誤差信号）をナビ音声に対するノイズとみなして、スピーカ２７から出力されるナビ音声がユーザに明瞭に聞こえるように、当該ナビ音声のゲイン調整を行う。このようにゲイン制御部２５は、本発明のゲイン算出手段を構成する。 Based on the simulated navigation sound output from the adaptive filter 26 and the ambient noise output from the subtractor 29, the gain control unit 25 is optimally added to the controlled navigation sound reproduced by the reproduction unit 22. The gain is calculated, and the calculated gain value is output to the sound pressure adjustment unit 24. Here, the surrounding noise (error signal) is regarded as noise for the navigation voice, and the gain of the navigation voice is adjusted so that the navigation voice output from the speaker 27 can be clearly heard by the user. Thus, the gain control unit 25 constitutes a gain calculation unit of the present invention.

音圧調整部２４は、ゲイン制御部２５により算出された補正ゲインに基づいてボリューム等２３を制御し、出力対象となるナビ音声の音圧レベルを全体として調整するとともに、再生部２２により音声ＤＢ２１から読み出された親密度情報に基づいてボリューム等２３を制御し、出力対象となるナビ音声の音圧レベルを単語等の単位で調整する。例えば、単語親密度が最も高い単語等を基準の音圧レベルとし、これより単語親密度が低い単語等のボリューム値を大きくするように補正する。 The sound pressure adjustment unit 24 controls the volume 23 based on the correction gain calculated by the gain control unit 25, adjusts the sound pressure level of the navigation sound to be output as a whole, and the sound DB 21 by the playback unit 22. The volume or the like 23 is controlled based on the familiarity information read out from, and the sound pressure level of the navigation voice to be output is adjusted in units of words or the like. For example, the word having the highest word familiarity is set as the reference sound pressure level, and the volume value of the word having the lower word familiarity is corrected to be larger.

例えば、「この先○○方面、渋滞です。」といったナビ音声を出力する場合において、周囲音があってもこのナビ音声が明瞭に聞こえるように、一連のナビ音声に対して全体としてボリューム値を調整する。さらに、地名を表す○○の部分が「差塩（さいそ）」や「百槻（どうづき）」のように単語親密度が低い単語の場合は、その単語区間では更に補償量を加えるようにボリューム値を調整する。単語等の単位でどの程度ボリューム値を大きくするかは、第１の実施形態と同様に、単語親密度の値によって異ならせるのが好ましい。 For example, when outputting a navigation voice such as “This is the direction of XX ahead, traffic jam”, the volume value is adjusted as a whole for a series of navigation voices so that this navigation voice can be heard clearly even if there is a surrounding sound. To do. Furthermore, if the part of XX that represents the place name is a word with a low word familiarity such as “Daisen” or “Daizuki”, add more compensation for that word section Adjust the volume value. As with the first embodiment, it is preferable that how much the volume value is increased in units of words or the like depends on the value of word familiarity.

以上詳しく説明したように、第３の実施形態によれば、ラウドネス補償型音声明瞭度改善システムにおいて、単語親密度情報に基づいて単語等の単位で音声補償量を適宜調整するようにしたので、出力対象の音声が周囲音によらず明瞭に聞こえるようにするとともに、普段聞きなれない内容の音声を出力する場合であっても、それを聞き取りやすくすることができる。よって、例えばこの音声明瞭度改善システムを適用したナビゲーション装置を不慣れな土地で使用していても、ガイド音声は常に聞こえやすくなる。 As described above in detail, according to the third embodiment, in the loudness compensation type speech intelligibility improvement system, the speech compensation amount is appropriately adjusted in units of words based on the word familiarity information. It is possible to make the output target sound clearly audible regardless of the ambient sound, and to make it easy to hear even when outputting a sound whose contents cannot be heard normally. Therefore, for example, even when a navigation device to which this speech intelligibility improvement system is applied is used in an unfamiliar land, the guide speech is always easy to hear.

なお、上記第３の実施形態でも、単語親密度が最も高い単語等を基準の音圧レベルとし、これより単語親密度が低い単語等の音圧レベルを大きく補正する例について説明したが、本発明はこれに限定されない。例えば、単語親密度が最も低い単語等を基準の音圧レベルとしても良いし、単語親密度が中程度の単語等を基準の音圧レベルとしても良い。 In the third embodiment, the example in which the word having the highest word familiarity is set as the reference sound pressure level and the sound pressure level of the word having a lower word familiarity is corrected to a greater value has been described. The invention is not limited to this. For example, a word having the lowest word familiarity may be used as the reference sound pressure level, and a word having a medium word familiarity may be used as the reference sound pressure level.

また、上記第３の実施形態でも、単語親密度に応じて音圧レベルを調整することによって全ての単語等の了解度が同等程度となるようにする必要は必ずしもなく、音圧調整によって単語了解度が所定値よりも大きくなるようにすれば良い。 In the third embodiment as well, it is not always necessary to adjust the sound pressure level in accordance with the word familiarity so that the intelligibility of all the words becomes the same level. The degree may be larger than a predetermined value.

また、上記第３の実施形態でも、親密度情報に基づいて音声の音圧レベルを調整する例について説明したが、これに加えて、またはこれに代えて、単語親密度が所定値より低い単語等を２回以上繰り返して再生するようにしても良い。この繰り返し再生の制御は、再生部２２によって行うことが可能である。 In the third embodiment, the example in which the sound pressure level of the sound is adjusted based on the familiarity information has been described, but in addition to or instead of this, a word whose word familiarity is lower than a predetermined value Etc. may be repeated two or more times. This repetitive playback control can be performed by the playback unit 22.

また、親密度情報に基づいて音声の音圧レベルを調整することに加えて、またはこれに代えて、音声の再生速度を調整するようにしても良い。この再生速度の制御も、再生部２２によって行うことが可能である。 Further, in addition to or instead of adjusting the sound pressure level of sound based on the familiarity information, the sound reproduction speed may be adjusted. The playback speed can also be controlled by the playback unit 22.

（第４の実施形態）
次に、本発明による第４の実施形態を図面に基づいて説明する。第４の実施形態は、音声通話システム（例えば、ハンズフリーシステム）に本発明を適用したものである。図４は、第４の実施形態に係る音声通話システムの要部構成例を示す図である。 (Fourth embodiment)
Next, a fourth embodiment of the present invention will be described with reference to the drawings. In the fourth embodiment, the present invention is applied to a voice call system (for example, a hands-free system). FIG. 4 is a diagram illustrating a configuration example of a main part of a voice call system according to the fourth embodiment.

図４に示すように、本実施形態の音声通話システムは、音響モデルＤＢ３１、言語モデルＤＢ３２、第１の連続認識部３３、第１の音圧調整部３４、第１のボリューム３５、スピーカ３６、マイク３７、第２の連続認識部３８、第２の音圧調整部３９および第２のボリューム４０を備えて構成されている。 As shown in FIG. 4, the voice call system of this embodiment includes an acoustic model DB 31, a language model DB 32, a first continuous recognition unit 33, a first sound pressure adjustment unit 34, a first volume 35, a speaker 36, A microphone 37, a second continuous recognition unit 38, a second sound pressure adjustment unit 39, and a second volume 40 are provided.

音響モデルＤＢ３１は、認識対象となる各単語等の文字列とその音声パターンの特徴量とを対応付けて記憶した音声辞書データベースである。言語モデルＤＢ３２は、認識された音声パターンの構文を解析するために必要な情報を記憶した構文解析用辞書データベースである。この言語モデルＤＢ３２には更に、各種単語等の文字列を表すテキスト情報とそれらの親密度との関係を示す情報が付加して記憶されている。このように言語モデルＤＢ３２は、本発明の親密度情報記憶手段を構成する。 The acoustic model DB 31 is a speech dictionary database that stores a character string such as each word to be recognized and a feature amount of the speech pattern in association with each other. The language model DB 32 is a syntax analysis dictionary database that stores information necessary for analyzing the syntax of a recognized speech pattern. The language model DB 32 further stores information indicating the relationship between text information representing character strings such as various words and their closeness. As described above, the language model DB 32 constitutes the familiarity information storage unit of the present invention.

第１の連続認識部３３は、受話音声から特徴量を算出し、その算出した特徴量と、音響モデルＤＢ３１にあらかじめ格納されている各単語等の特徴量とを比較して類似度が最も高い音声パターンを検索し、その音声パターンを有する文字列を受話音声の文字列であると認識する。そして、入力された受話音声を、当該認識した文字列のテキスト情報に変換する。このように第１の連続認識部３３は、本発明の第１の音声認識手段を構成する。 The first continuous recognition unit 33 calculates the feature amount from the received voice, compares the calculated feature amount with the feature amount of each word or the like stored in advance in the acoustic model DB 31, and has the highest similarity. A voice pattern is searched, and a character string having the voice pattern is recognized as a character string of the received voice. Then, the input received voice is converted into text information of the recognized character string. Thus, the 1st continuous recognition part 33 comprises the 1st audio | voice recognition means of this invention.

この第１の連続認識部３３は、変換したテキスト情報をもとに言語モデルＤＢ３２を参照し、当該テキスト情報に対応して記憶されている親密度情報を読み出して第１の音圧調整部３４に供給する処理も行う。第１の音圧調整部３４は、第１の連続認識部３３から供給される親密度情報に基づいて第１のボリューム３５を制御することにより、受話音声の音圧レベルを単語等の単位で調整する。例えば、単語親密度が最も高い単語等を基準の音圧レベルとし、これより単語親密度が低い単語等のボリューム値を大きくするように補正する。このように音圧が補正された受話音声は、スピーカ３６から出力される。 The first continuous recognition unit 33 refers to the language model DB 32 based on the converted text information, reads the familiarity information stored corresponding to the text information, and the first sound pressure adjustment unit 34. The processing to supply to is also performed. The first sound pressure adjusting unit 34 controls the first volume 35 based on the familiarity information supplied from the first continuous recognizing unit 33, thereby setting the sound pressure level of the received voice in units such as words. adjust. For example, the word having the highest word familiarity is set as the reference sound pressure level, and the volume value of the word having the lower word familiarity is corrected to be larger. The received voice with the corrected sound pressure is output from the speaker 36.

第２の連続認識部３８は、マイク３７より入力された送話音声から特徴量を算出し、その算出した特徴量と、音響モデルＤＢ３１にあらかじめ格納されている各単語等の特徴量とを比較して類似度が最も高い音声パターンを検索し、その音声パターンを有する文字列を送話音声の文字列であると認識する。そして、入力された送話音声を、当該認識した文字列のテキスト情報に変換する。このように第２の連続認識部３８は、本発明の第２の音声認識手段を構成する。 The second continuous recognition unit 38 calculates a feature value from the transmitted voice input from the microphone 37, and compares the calculated feature value with a feature value of each word or the like stored in advance in the acoustic model DB 31. Then, the voice pattern having the highest similarity is searched, and the character string having the voice pattern is recognized as the character string of the transmitted voice. Then, the input transmission voice is converted into text information of the recognized character string. Thus, the 2nd continuous recognition part 38 comprises the 2nd audio | voice recognition means of this invention.

この第２の連続認識部３８は、変換したテキスト情報をもとに言語モデルＤＢ３２を参照し、当該テキスト情報に対応して記憶されている親密度情報を読み出して第２の音圧調整部３９に供給する処理も行う。第２の音圧調整部３９は、第２の連続認識部３８から供給される親密度情報に基づいて第２のボリューム４０を制御することにより、送話音声の音圧レベルを単語等の単位で調整する。例えば、単語親密度が最も高い単語等を基準の音圧レベルとし、これより単語親密度が低い単語等のボリューム値を大きくするように補正する。このように音圧が補正された送話音声は、通話相手に送信される。 The second continuous recognition unit 38 refers to the language model DB 32 based on the converted text information, reads the familiarity information stored corresponding to the text information, and the second sound pressure adjustment unit 39. The processing to supply to is also performed. The second sound pressure adjustment unit 39 controls the second volume 40 based on the familiarity information supplied from the second continuous recognition unit 38, thereby changing the sound pressure level of the transmitted voice to a unit such as a word. Adjust with. For example, the word having the highest word familiarity is set as the reference sound pressure level, and the volume value of the word having the lower word familiarity is corrected to be larger. The transmitted voice with the corrected sound pressure is transmitted to the other party.

この図４の例では、受話側と送話側との双方に連続認識部および音圧調整部を設けている。これにより、相手側の音声通信システムに図４と同様の構成が備えられていなくても、送受信とも発話内容に応じて適宜調整した音圧による通話音声を提供することができる。なお、本発明においては連続認識部と音圧調整部とを必ずしも受話側と送話側との双方に設ける必要はなく、どちらか一方のみでも良い。 In the example of FIG. 4, a continuous recognition unit and a sound pressure adjustment unit are provided on both the reception side and the transmission side. As a result, even if the other party's voice communication system does not have the same configuration as that shown in FIG. 4, it is possible to provide call voice with sound pressure adjusted appropriately according to the utterance contents for both transmission and reception. In the present invention, it is not always necessary to provide the continuous recognition unit and the sound pressure adjustment unit on both the reception side and the transmission side, and only one of them may be provided.

また、受話側と送話側との双方に連続認識部および音圧調整部を設けた場合、相手側も同様の構成を備えていると、自分の送話側で音圧調整された音声が相手の受話側でも更に音圧調整されることになり、音圧が必要以上に調整されてしまうことになる。そこで、通話を開始する前（最初のコール時）に、所定の通信を行うことによって、通話相手が音圧調整部を備えているか否かを確認する。そして、通話相手が音圧調整部を備えている場合には、第１の音圧調整部３４および第２の音圧調整部３９の少なくとも一方の機能を休止させるように制御することが可能である。 In addition, when the continuous recognition unit and the sound pressure adjustment unit are provided on both the receiver side and the transmitter side, if the other side also has the same configuration, the sound whose sound pressure has been adjusted on its own transmitter side The sound pressure is further adjusted on the other party's receiver side, and the sound pressure is adjusted more than necessary. Therefore, before starting a call (at the time of the first call), it is confirmed whether or not the other party has a sound pressure adjustment unit by performing predetermined communication. When the other party has a sound pressure adjustment unit, it is possible to control so that at least one of the functions of the first sound pressure adjustment unit 34 and the second sound pressure adjustment unit 39 is suspended. is there.

例えば、最初に電話をかけるときに、発呼側の音声通信システムから着呼側の音声通信システムに問い合わせ信号を送信し、着呼側が音圧調整部を備えているか否かを問い合わせる。着呼側システムは、この問い合わせに応答して音圧調整部の有無を発呼側システムに返信する。発呼側システムは、音圧調整部が有るとの返答を受けたときに、発呼側システムにおける第１の音圧調整部３４の機能を休止させるように制御する。また、着呼側システムにおける第１の音圧調整部３４の機能を休止させることを指示する信号を着呼側システムに送信し、着呼側システムにおける第１の音圧調整部３４の機能も休止させるように制御する。 For example, when making a call for the first time, an inquiry signal is transmitted from the calling-side voice communication system to the called-side voice communication system to inquire whether or not the called-side has a sound pressure adjustment unit. In response to this inquiry, the called side system returns the presence or absence of the sound pressure adjustment unit to the calling side system. When receiving a reply that there is a sound pressure adjusting unit, the calling side system controls to stop the function of the first sound pressure adjusting unit 34 in the calling side system. In addition, a signal instructing to suspend the function of the first sound pressure adjusting unit 34 in the called side system is transmitted to the called side system, and the function of the first sound pressure adjusting unit 34 in the called side system is also provided. Control to pause.

なお、発呼側システムが着呼側システムから音圧調整部が有るとの返答を受けたときに、発呼側システムにおける第２の音圧調整部３９と着呼側システムにおける第２の音圧調整部３９との機能を休止させるように制御しても良い。あるいは、発呼側システムにおける第１の音圧調整部３４と第２の音圧調整部３９の機能を休止させるように制御し、着呼側システムの機能は休止させないようにしても良い。さらに、発呼側および着呼側のシステムにおいて双方とも音圧調整部３４，３９の機能は休止させず、音圧の増減幅を通常の半分程度となるように制御しても良い。 When the calling side system receives a response from the called side system that there is a sound pressure adjusting unit, the second sound pressure adjusting unit 39 in the calling side system and the second sound in the called side system. You may control to make the function with the pressure adjustment part 39 pause. Alternatively, the functions of the first sound pressure adjustment unit 34 and the second sound pressure adjustment unit 39 in the calling side system may be controlled to be suspended, and the functions of the called side system may not be suspended. Furthermore, the functions of the sound pressure adjustment units 34 and 39 may not be suspended in both the calling side and called side systems, and the increase / decrease width of the sound pressure may be controlled to be about half of the normal.

以上詳しく説明したように、第４の実施形態によれば、音声通信システムにおいて、通話音声を認識および構文解析し、その解析結果を用いて単語親密度情報に基づいて単語等の単位で音圧を適宜調整するようにしたので、通話中において馴染みのない発話内容があっても、音圧補正によってそれを聞き取りやすくすることができる。よって、常に快適な通話を行うことができる。 As described above in detail, according to the fourth embodiment, in the voice communication system, the call voice is recognized and parsed, and the sound pressure in units of words or the like based on the word familiarity information using the analysis result. Therefore, even if there is an unfamiliar utterance content during a call, it can be easily heard by correcting the sound pressure. Therefore, a comfortable call can always be made.

なお、上記第４の実施形態でも、単語親密度が最も高い単語等を基準の音圧レベルとし、これより単語親密度が低い単語等の音圧レベルを大きく補正する例について説明したが、本発明はこれに限定されない。例えば、単語親密度が最も低い単語等を基準の音圧レベルとしても良いし、単語親密度が中程度の単語等を基準の音圧レベルとしても良い。 In the fourth embodiment, the example in which the word having the highest word familiarity is set as the reference sound pressure level and the sound pressure level of the word having a lower word familiarity is corrected to a greater value has been described. The invention is not limited to this. For example, a word having the lowest word familiarity may be used as the reference sound pressure level, and a word having a medium word familiarity may be used as the reference sound pressure level.

また、上記第４の実施形態に関しても、音圧調整によって単語了解度が所定値よりも大きくなるようにすれば良く、必ずしも全ての単語等の了解度が同等程度となるようにする必要はない。 In the fourth embodiment, the word intelligibility may be made larger than a predetermined value by adjusting the sound pressure, and the intelligibility of all the words and the like is not necessarily equal. .

また、上記第４の実施形態でも、親密度情報に基づいて音声の音圧レベルを調整する例について説明したが、これに加えて、またはこれに代えて、単語親密度が所定値より低い単語等を２回以上繰り返して再生するようにしても良い。この繰り返し再生の制御は、例えば次のようにして行うことが可能である。すなわち、受話音声や送話音声をデジタル化して一旦バッファメモリに蓄積し、バッファメモリからの読み出しを２回以上繰り返し行い、読み出された音声を再度アナログ信号に戻すようにする。 In the fourth embodiment, the example in which the sound pressure level of the sound is adjusted based on the familiarity information has been described, but in addition to or instead of this, the word whose word familiarity is lower than a predetermined value Etc. may be repeated two or more times. This repeated reproduction control can be performed as follows, for example. That is, the received voice and the transmitted voice are digitized and temporarily stored in the buffer memory, and the reading from the buffer memory is repeated twice or more, and the read voice is returned to the analog signal again.

また、親密度情報に基づいて音声の音圧レベルを調整することに加えて、またはこれに代えて、単語親密度に応じて音声の再生速度を調整するようにしても良い。この再生速度の制御も、例えば次のようにして行うことが可能である。すなわち、受話音声や送話音声をデジタル化して一旦バッファメモリに蓄積し、バッファメモリからの読み出しタイミングを単語親密度に応じて可変とする。 Further, in addition to or instead of adjusting the sound pressure level of the sound based on the familiarity information, the sound reproduction speed may be adjusted according to the word familiarity. This reproduction speed control can also be performed, for example, as follows. That is, the received voice and the transmitted voice are digitized and temporarily stored in the buffer memory, and the read timing from the buffer memory is made variable according to the word familiarity.

また、親密度情報に基づいて音声の音圧レベルを調整することに加えて、またはこれに代えて、単語親密度が所定値より低い単語等を画面上に表示するようにしても良い。この表示制御は、図示しない表示コントローラ（例えば、電話番号等を表示装置に表示するために通常備えられているもの）を用いて行うことが可能である。 Further, in addition to or instead of adjusting the sound pressure level of the voice based on the familiarity information, a word having a word familiarity lower than a predetermined value may be displayed on the screen. This display control can be performed using a display controller (not shown) (for example, one that is usually provided for displaying a telephone number or the like on the display device).

以上に説明した第１〜第４の実施形態による音圧調整の手法は、ハードウェア構成、ＤＳＰ、ソフトウェアの何れによっても実現することが可能である。例えばソフトウェアによって実現する場合、本実施形態の音声出力装置は、実際にはコンピュータのＣＰＵあるいはＭＰＵ、ＲＡＭ、ＲＯＭなどを備えて構成され、ＲＡＭやＲＯＭに記憶されたプログラムが動作することによって実現できる。 The sound pressure adjustment methods according to the first to fourth embodiments described above can be realized by any of a hardware configuration, a DSP, and software. For example, when realized by software, the audio output device of the present embodiment is actually configured by including a CPU or MPU of a computer, RAM, ROM, and the like, and can be realized by operating a program stored in the RAM or ROM. .

したがって、コンピュータが上記各実施形態の機能を果たすように動作させるプログラムを例えばＣＤ−ＲＯＭのような記録媒体に記録し、コンピュータに読み込ませることによって実現できるものである。上記プログラムを記録する記録媒体としては、ＣＤ−ＲＯＭ以外に、フレキシブルディスク、ハードディスク、磁気テープ、光ディスク、光磁気ディスク、ＤＶＤ、不揮発性メモリカード等を用いることができる。また、上記プログラムをインターネット等のネットワークを介してコンピュータにダウンロードすることによっても実現できる。 Therefore, it can be realized by recording a program that causes a computer to perform the functions of the above-described embodiments on a recording medium such as a CD-ROM and causing the computer to read the program. As a recording medium for recording the program, a flexible disk, a hard disk, a magnetic tape, an optical disk, a magneto-optical disk, a DVD, a nonvolatile memory card, and the like can be used in addition to the CD-ROM. It can also be realized by downloading the program to a computer via a network such as the Internet.

なお、上記第１〜第４の実施形態は、何れも本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその精神、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 The above first to fourth embodiments are merely examples of implementation in carrying out the present invention, and the technical scope of the present invention should not be construed as being limited thereto. It will not be. In other words, the present invention can be implemented in various forms without departing from the spirit or main features thereof.

本発明の音声出力装置および方法は、親密度に応じて単語等の単位で音圧を調整する装置あるいはシステムに広く適用することが可能であり、例えば上記実施形態で説明したナビゲーション装置や音声通話システムに有用である。また、インターネット等の情報ネットワーク上から受信したＷｅｂ情報や電子メールの内容を音声で読み上げる機能を有する情報通信装置にも有用である。さらに、会話や単語等を読み上げる語学学習システムにおいて、聞きなれない単語等や難易度の高い単語等、発音が難しい単語等の音圧を調整する場合にも有用である。 The voice output device and method of the present invention can be widely applied to a device or system that adjusts sound pressure in units of words or the like according to familiarity. For example, the navigation device and voice call described in the above embodiment Useful for systems. The present invention is also useful for an information communication apparatus having a function of reading out the contents of Web information and electronic mail received from an information network such as the Internet. Furthermore, in a language learning system that reads out conversations, words, and the like, it is also useful when adjusting the sound pressure of words that are difficult to pronounce, such as words that cannot be heard or words that have a high degree of difficulty.

第１の実施形態に係る音声出力装置の要部構成例を示す図である。It is a figure which shows the principal part structural example of the audio | voice output apparatus which concerns on 1st Embodiment. 第２の実施形態に係る音声出力装置の要部構成例を示す図である。It is a figure which shows the principal part structural example of the audio | voice output apparatus which concerns on 2nd Embodiment. 第３の実施形態に係る音声明瞭度改善システムの要部構成例を示す図である。It is a figure which shows the principal part structural example of the speech intelligibility improvement system which concerns on 3rd Embodiment. 第４の実施形態に係る音声通話システムの要部構成例を示す図である。It is a figure which shows the principal part structural example of the voice call system which concerns on 4th Embodiment. 単語親密度と単語了解度と音圧との関係を示した試験結果の特性図である。It is a characteristic view of the test result which showed the relationship between word familiarity, word intelligibility, and sound pressure.

Explanation of symbols

１音声ＤＢ
２再生部
３音圧調整部
４ボリューム
１１テキスト生成部
１２ＴＴＳエンジン
１３音圧調整部
１４ボリューム
１５言語処理部
１６テキスト解析用辞書
１７音声合成部
１８音素片辞書
２１音声ＤＢ
２２再生部
２３ボリュームまたはイコライザ
２４音圧調整部
２５ゲイン制御部
２６適応フィルタ
２７スピーカ
２８マイク
２９減算器
３１音響モデルＤＢ
３２言語モデルＤＢ
３３第１の連続認識部
３４第１の音圧調整部
３５第１のボリューム
３６スピーカ
３７マイク
３８第２の連続認識部
３９第２の音圧調整部
４０第２のボリューム
1 Voice DB
2 playback unit 3 sound pressure adjustment unit 4 volume 11 text generation unit 12 TTS engine 13 sound pressure adjustment unit 14 volume 15 language processing unit 16 text analysis dictionary 17 speech synthesis unit 18 phoneme segment dictionary 21 speech DB
22 playback unit 23 volume or equalizer 24 sound pressure adjustment unit 25 gain control unit 26 adaptive filter 27 speaker 28 microphone 29 subtractor 31 acoustic model DB
32 Language Model DB
33 1st continuous recognition part 34 1st sound pressure adjustment part 35 1st volume 36 speaker 37 microphone 38 2nd continuous recognition part 39 2nd sound pressure adjustment part 40 2nd volume

Claims

Intimacy information storage means for storing information about intimacy indicating how familiar a plurality of words or word strings are;
A sound output device comprising: sound pressure adjustment means for adjusting the sound pressure level of each word or word string to be output based on the familiarity information stored in the familiarity information storage means.

The intimacy information storage means is configured by adding the intimacy information in units of words or word strings to an audio database in which each word or word string to be output is recorded. The audio output device according to claim 1, wherein

The closeness information storage means stores the closeness information in units of words or word strings with respect to a dictionary database for text analysis provided in a device that synthesizes and reproduces a speech waveform based on given text information. The audio output device according to claim 1, wherein the audio output device is configured to be added.

Gain calculating means for calculating a correction gain of the output sound based on the sound pressure level of the output sound and the sound pressure level of ambient sound that can be heard at the listening position of the output sound,
The sound pressure adjusting means adjusts the sound pressure level of the sound to be output based on the correction gain calculated by the gain calculating means, and adds the familiarity information stored in the familiarity information storage means. The sound output device according to claim 1, wherein the sound pressure level of the sound to be output is adjusted based on a word or a word string.

Comparing the input voice with a prepared voice dictionary, the voice recognition means for recognizing a word or word string related to the input voice and converting it into text information,
The intimacy information storage means is configured to store information indicating a relationship between text information representing a plurality of words or word strings and the intimacy,
The sound pressure adjusting means determines the sound pressure level of the input speech based on the familiarity information obtained by referring to the familiarity information storage means based on the text information converted by the speech recognition means. The audio output device according to claim 1, wherein adjustment is performed in units of word strings.

The voice recognition means is characterized by inputting a received voice in a voice call system, comparing with a voice dictionary prepared in advance, recognizing a word or word string related to the received voice and converting it into text information. The audio output device according to claim 5.

The voice recognition means inputs a voice to be spoken in a voice call system, compares it with a voice dictionary prepared in advance, recognizes a word or word string related to the voice to be sent, and converts it into text information. The audio output device according to claim 5, wherein:

The voice recognition means inputs a received voice in a voice call system, compares it with a voice dictionary prepared in advance, recognizes a word or a word string related to the received voice and converts it into text information. A recognition unit is compared with a voice dictionary prepared in advance by inputting a transmission voice in the voice call system, and a second word or word string related to the transmission voice is recognized and converted into text information. Voice recognition means,
The sound pressure adjusting means is configured to generate a sound pressure of the received voice based on the familiarity information obtained by referring to the familiarity information storage means based on the text information converted by the first speech recognition means. The first sound pressure adjusting means for adjusting the level in units of words or word strings, and the familiarity information storing means obtained by referring to the familiarity information storage means based on the text information converted by the second speech recognition means 6. The voice output device according to claim 5, further comprising second sound pressure adjusting means for adjusting a sound pressure level of the transmitted voice in units of words or word strings based on familiarity information.

Before starting a call in the voice call system, determining means for determining whether the other party is equipped with the sound pressure adjusting means;
When it is determined by the determining means that the other party has the sound pressure adjusting means, at least one function of the first sound pressure adjusting means and the second sound pressure adjusting means is suspended. 9. The audio output device according to claim 8, further comprising a control unit for controlling.

A reproduction control means for controlling to reproduce a word or a word string having a familiarity lower than a predetermined value twice or more based on the familiarity information stored in the familiarity information storage means. The audio output device according to claim 1.

The reproduction control means for adjusting the reproduction speed of each word or word string to be output based on the familiarity information stored in the familiarity information storage means. Audio output device.

A display control means for controlling to display on the screen a word or a word string whose intimacy is lower than a predetermined value based on the intimacy information stored in the intimacy information storage means. The audio output device according to claim 1.

The sound pressure adjustment unit refers to the familiarity information that indicates how familiar the multiple words or word strings are, and the sound pressure level of each word or word string to be output depends on the familiarity information. The audio output method is characterized in that the adjustment is performed.

When the sound is reproduced from the sound database in which each word or word string to be output is recorded, the sound pressure adjusting unit stores the intimacy information recorded in units of words or word strings on the sound database. 14. The audio output method according to claim 13, wherein the sound pressure level of the audio to be reproduced is adjusted in units of words or word strings.

The sound pressure adjustment unit refers to the intimacy information recorded in units of words or word strings in the dictionary database for text analysis when synthesizing and reproducing a speech waveform based on the given text information The sound output method according to claim 13, wherein the sound pressure level of the sound to be reproduced is adjusted in units of words or word strings.

When playing back speech input from the outside, the speech recognition unit recognizes the word or word string related to the input speech by comparing the input speech with a speech dictionary prepared in advance, and the recognized word 14. The voice according to claim 13, wherein the sound pressure adjustment unit refers to the familiarity information corresponding to a word string, and adjusts the sound pressure level of the reproduced voice in units of words or word strings. output method.

A correction gain is obtained based on the sound pressure level of the output sound and the sound pressure level of the ambient sound that can be heard at the listening position of the output sound, and the sound pressure level of the output sound is corrected based on the correction gain. In the speech intelligibility improvement system
The sound pressure adjusting unit adjusts the sound pressure level of the output sound based on the correction gain, and adjusts the sound pressure level of the output sound in units of words or word strings based on the familiarity information. The audio output method according to claim 13, wherein:

14. The audio output method according to claim 13, wherein a word or a word string having a familiarity lower than a predetermined value is repeatedly reproduced and output based on the familiarity information twice or more.

Based on the familiarity information, a word or a word string having a familiarity lower than a predetermined value is reproduced and output at a slower speed than a word or a word string having a familiarity greater than or equal to the predetermined value. The audio output method according to claim 13.

14. The audio output method according to claim 13, wherein a word or a word string whose intimacy is lower than a predetermined value is displayed on the screen based on the intimacy information.