JP2006208820A

JP2006208820A - Speech processor

Info

Publication number: JP2006208820A
Application number: JP2005021866A
Authority: JP
Inventors: Takahiro Adachi; 隆弘足立; Reiko Yamada; 玲子山田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-01-28
Filing date: 2005-01-28
Publication date: 2006-08-10
Anticipated expiration: 2025-01-28
Also published as: JP4644876B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech processor capable of outputting an inputted speech while improving articulation on a listening side. <P>SOLUTION: The speech processor comprises a frequency analysis section 1202 which frequency-analyzes the inputted speech signal, an allelic phoneme detection section 1206 which detect respective phonemic parts based upon the analysis result of the frequency analysis section 1202, an amplification processing section 1208 which selectively amplifies the phonemic parts according to the detection result of the phoneme detection section 1206 and attribute information of a user, and an output signal selection section 1210 which puts together and outputs the inputted speech signal and selectively amplified parts. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、入力された音声の明瞭性を向上させて出力することが可能な音声処理装置の構成に関する。 The present invention relates to a configuration of a speech processing apparatus capable of improving the clarity of input speech and outputting the speech.

人間は様々な雑音環境下で音声によるコミュニケーションを行っているが、多くの先行研究で音声知覚は雑音に妨害されることが報告されている（たとえば、非特許文献１を参照）。また、母語よりも非母語の方が雑音の影響を受けやすいことも報告されている（たとえば、非特許文献２を参照）。 Although humans communicate with each other under various noise environments, many previous studies have reported that speech perception is disturbed by noise (see, for example, Non-Patent Document 1). It has also been reported that non-native languages are more susceptible to noise than native languages (see Non-Patent Document 2, for example).

したがって、たとえば、外国語音声学習教材において、雑音環境下においても会話可能な能力の獲得を目指すならば、音声への雑音付加の影響に関して、母語話者と非母語話者との違いを詳細に調査し、効果的な訓練方法を検討する必要がある。 Therefore, for example, in a foreign language speech learning material, if you want to acquire the ability to talk even in a noisy environment, the differences between native speakers and non-native speakers are detailed regarding the effects of adding noise to speech. It is necessary to investigate and consider effective training methods.

このような点について、日本語話者が区別して知覚することが困難なアメリカ英語の／ｒ／−／ｌ／（以下ＲＬと略す）で対立する音声に関してＳＮ比を系統的に操作し、アメリカ英語母語話者と日本語母語話者との正答率の違いを調べた実験結果についての報告がある（たとえば、非特許文献３を参照）。
Kalikow,D.N.,Stevens,K.N.,and Elliott,L.L., Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability, J. Acoust. Soc. Am., 61, pp.1337-1351, 1977 Florentine,M., Speech perception in noise by fluent non-native listeners, Trans. Tech. Comm. Physiol. Acoust., H-85-16、1985 上田和夫、駒木亮、山田玲子、雑音がアメリカ英語／ｒ／，／１／知覚に及ぼす影響，日本心理学会第６５回大会発表論文集，Ｐ．１２０２００１ In this regard, the S / N ratio is systematically manipulated with respect to confronting speech in / r /-/ l / (hereinafter abbreviated as RL) in American English that is difficult for Japanese speakers to distinguish and perceive. There is a report on the results of an experiment examining the difference in correct answer rate between English native speakers and Japanese native speakers (see Non-Patent Document 3, for example).
Kalikow, DN, Stevens, KN, and Elliott, LL, Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability, J. Acoust.Soc. Am., 61, pp.1337-1351, 1977 Florentine, M., Speech perception in noise by fluent non-native listeners, Trans. Tech. Comm. Physiol. Acoust., H-85-16, 1985 Kazuo Ueda, Ryo Komaki, Atsuko Yamada, Effects of Noise on American English / r /, / 1 / Perception, Proceedings of the 65th Annual Meeting of the Japanese Psychological Association 120 2001

ところで、音韻対立によって、音響的差異、聴取時の手掛かりが共に異なる。そのため、例えば／ｂ／と／ｖ／（以下ＢＶと略す）、／ｓ／と／θ／（以下ＳＴＨと略す）も日本語話者にとって区別して知覚することが困難な音韻だが、雑音付加の影響がＲLの場合と異なる可能性がある。このように、ある母語を有する話者にとって、区別して知覚するのが困難がある音韻を「音韻対立のある音韻」と呼ぶことにする。 By the way, both acoustic differences and clues at the time of listening are different depending on the phoneme conflict. Therefore, for example, / b / and / v / (hereinafter abbreviated as BV) and / s / and / θ / (hereinafter abbreviated as STH) are phonemes that are difficult for Japanese speakers to distinguish and perceive. The impact may be different from RL. Thus, a phoneme that is difficult for a speaker who has a native language to distinguish and perceive is referred to as a “phoneme with a phoneme conflict”.

したがって、上述したような雑音環境下での外国語の聞き取り学習をコンピュータによって実現しようとする場合に、最初から学習者に聞き取らせるモデル音声に単に雑音を付加してＳＮ比を劣化させたのでは、十分な学習効果が得られない可能性がある。 Therefore, when the foreign language listening learning under the noisy environment as described above is to be realized by a computer, the noise is simply deteriorated by simply adding noise to the model voice to be heard by the learner from the beginning. There is a possibility that a sufficient learning effect cannot be obtained.

また、このような外国語学習の場合にとどまらず、より一般的には、日本人と外国人とが、音声通信によりコミュニケーションをとる場合などを想定すると、周囲の雑音の影響による聞き取り易さを考慮して、送受信音質の制御を考慮する必要もある。 Furthermore, not only in the case of learning foreign languages, but more generally, assuming that Japanese and foreigners communicate by voice communication, it is easier to hear due to the effects of ambient noise. Considering this, it is also necessary to consider the transmission / reception sound quality control.

しかしながら、従来は、周囲の雑音による音の聞き取り易さの劣化に対して、いかなる対処をとるべきかが、必ずしも明らかでない、という問題があった。特に、送受信者の母語が互いに異なる場合に、どのような音声処理を行って通信を行うのが望ましいかについては、十分な検討がなされていない。 However, in the past, there has been a problem that it is not always clear what measures should be taken against the deterioration of the ease of hearing of sound due to ambient noise. In particular, when the native language of the sender / receiver is different from each other, it has not been sufficiently studied what kind of speech processing is desired to be performed.

本発明は、上記のような問題を解決するためになされたものであって、その目的は、入力された音声について、聞き取りを行う側での明瞭性を向上させて出力することが可能な音声処理装置を提供することである。 The present invention has been made in order to solve the above-described problems, and an object of the present invention is to provide an audio that can be output with improved clarity on the side of listening to the input audio. It is to provide a processing device.

このような目的を達成するために、本発明の音声処理装置は、音声処理装置であって、利用者の属性に応じて強調すべき音韻の強調情報を格納する記憶手段と、入力された音声信号を周波数分析するための周波数分析手段と、周波数分析手段の分析結果に基づいて、各音韻部分を検出するための音韻検出手段と、音韻検出手段の検出結果と強調情報に応じて、音韻部分を選択的に強調する強調処理手段と、入力された音声信号と、選択的に強調された部分とを合成して出力する出力信号選択手段とを備える。 In order to achieve such an object, the speech processing apparatus of the present invention is a speech processing apparatus, which stores storage means for storing enhancement information of phonemes to be enhanced according to user attributes, and input speech A frequency analysis means for frequency analysis of the signal, a phoneme detection means for detecting each phoneme part based on the analysis result of the frequency analysis means, and a phoneme part according to the detection result and enhancement information of the phoneme detection means Emphasizing processing means for selectively emphasizing, and output signal selecting means for synthesizing and outputting the input audio signal and the selectively emphasized portion.

好ましくは、強調すべき音韻は、破裂音の音韻である。 Preferably, the phoneme to be emphasized is a plosive phoneme.

好ましくは、音韻検出手段は、周波数分析手段の分析結果において、低い周波数帯から高い周波数帯にまで一定以上のパワーが所定時間内に存在している垂直パルスの有無により、破裂音に相当する音韻を検知する。 Preferably, the phoneme detection means is a phoneme corresponding to a plosive sound depending on the presence or absence of a vertical pulse in which a certain level of power exists within a predetermined time from a low frequency band to a high frequency band in the analysis result of the frequency analysis means. Is detected.

好ましくは、音声処理装置は、音響モデルを格納する音韻音響モデル格納手段をさらに備え、各音韻に対する音響モデルに基づく尤度計算によって、音韻を検知する。 Preferably, the speech processing apparatus further includes a phonological acoustic model storage unit that stores an acoustic model, and detects phonology by likelihood calculation based on the acoustic model for each phoneme.

以下、図面を参照して本発明の実施の形態について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

（本発明のシステム構成）
図１は、本発明の音声処理装置を用いた通信システム１０００の一例を示す概念図である。 (System configuration of the present invention)
FIG. 1 is a conceptual diagram showing an example of a communication system 1000 using the speech processing apparatus of the present invention.

以下の説明では、本発明の音声処理装置を用いて、遠隔地間で送信者と受信者とが音声信号により通信を行う場合をコンピュータ間の音声通信を例にとって説明する。ただし、本発明は、このような場合に限定されることなく、より一般的に、携帯電話などの他の通信システムや、テレビなどの放送システム、さらには、入力された音声について、聞き取りを行う側での明瞭性を向上させることが必要なシステムに適用することが可能である。たとえば、上述したような外国語の学習装置では、学習者に聞き取り課題を与える際に、意図的に所定レベルの雑音を付加することで、雑音環境下での聞き取り能力の向上を目指す場合に、学習者の訓練の度合いに応じて、最初は明瞭性を向上させるように処理したモデル音声を聞き取らせ、徐々に、本来の生のモデル音声の聞き取り訓練を行わせていくような場合にも適用可能である。 In the following description, a case where a sender and a receiver communicate with each other by a voice signal between remote locations using the voice processing apparatus of the present invention will be described by taking voice communication between computers as an example. However, the present invention is not limited to such a case, and more generally, other communication systems such as mobile phones, broadcasting systems such as televisions, and input audio are heard. It can be applied to systems that need to improve side clarity. For example, in a foreign language learning device as described above, when giving a listening task to a learner, by intentionally adding a predetermined level of noise, when aiming to improve listening ability in a noisy environment, Depending on the degree of training of the learner, it is also applied to the case where the model voice that has been processed to improve clarity is heard at first, and the original raw model voice is gradually trained. Is possible.

図１を参照して、システム１０００は、ユーザ２が、たとえば、インターネットなどのネットワーク４００を介して、遠隔にあるコンピュータ３００のユーザと、音声による通信を行うためのコンピュータ１００を備える。以下の説明では、コンピュータ１００が音声処理装置として機能する。 Referring to FIG. 1, system 1000 includes a computer 100 for user 2 to perform voice communication with a user of a remote computer 300 via a network 400 such as the Internet. In the following description, the computer 100 functions as an audio processing device.

図１を参照して、このコンピュータ１００は、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory ）等の記録媒体上の情報を読込むためのディスクドライブ１０８およびフレキシブルディスク（Flexible Disk、以下ＦＤ）１１６に情報を読み書きするためのＦＤドライブ１０６を備えたコンピュータ本体１０２と、コンピュータ本体１０２に接続された表示装置としてのディスプレイ１０４と、同じくコンピュータ本体１０２に接続された入力装置としてのキーボード１１０およびマウス１１２と、音声入力装置としてのマイク１３２と、音声出力装置としてのスピーカ１３４とを含む。 Referring to FIG. 1, this computer 100 stores information in a disk drive 108 and a flexible disk (hereinafter referred to as FD) 116 for reading information on a recording medium such as a CD-ROM (Compact Disc Read-Only Memory). A computer main body 102 having an FD drive 106 for reading and writing; a display 104 as a display device connected to the computer main body 102; a keyboard 110 and a mouse 112 as input devices also connected to the computer main body 102; A microphone 132 as an input device and a speaker 134 as an audio output device are included.

なお、マイク１３２やスピーカ１３４は、ヘッドセットによりユーザ２が装着するヘッドフォンとマイクとすることもできる。 Note that the microphone 132 and the speaker 134 may be a headphone and a microphone worn by the user 2 using a headset.

なお、コンピュータ３００も、基本的には、コンピュータ１００と同様の構成を有するものとする。 Note that the computer 300 basically has the same configuration as the computer 100.

図２は、このコンピュータ１００のハードウェア構成をブロック図形式で示す図である。 FIG. 2 is a block diagram showing the hardware configuration of the computer 100. As shown in FIG.

図２に示されるように、このコンピュータ１００を構成するコンピュータ本体１０２は、ディスクドライブ１０８およびＦＤドライブ１０６に加えて、それぞれバスＢＳに接続されたＣＰＵ（Central Processing Unit ）１２０と、ＲＯＭ（Read Only Memory) およびＲＡＭ（Random Access Memory）を含むメモリ１２２と、直接アクセスメモリ装置、たとえば、ハードディスク１２４と、マイク１３２またはスピーカ１３４とデータの授受を行い、かつ、ネットワーク４００に対して通信を行うためのインタフェース１２８とを含んでいる。ディスクドライブ１０８には、たとえば、ＣＤ−ＲＯＭ１１８が装着される。ＦＤドライブ１０６にはＦＤ１１６が装着される。 As shown in FIG. 2, in addition to the disk drive 108 and the FD drive 106, the computer main body 102 constituting the computer 100 includes a CPU (Central Processing Unit) 120 connected to the bus BS and a ROM (Read Only). Memory) including a memory 122 and a RAM (Random Access Memory), a direct access memory device such as a hard disk 124, a microphone 132 or a speaker 134, and an interface for communicating with the network 400 128. For example, a CD-ROM 118 is attached to the disk drive 108. An FD 116 is attached to the FD drive 106.

なお、ＣＤ−ＲＯＭ１１８は、コンピュータ本体に対してインストールされるプログラム等の情報を記録可能な媒体であれば、他の媒体、たとえば、ＤＶＤ−ＲＯＭ（Digital Versatile Disc）やメモリカードなどでもよく、その場合は、コンピュータ本体１０２には、これらの媒体を読取ることが可能なドライブ装置が設けられる。 The CD-ROM 118 may be another medium, such as a DVD-ROM (Digital Versatile Disc) or a memory card, as long as it can record information such as a program installed in the computer main body. In this case, the computer main body 102 is provided with a drive device that can read these media.

本発明の音声処理装置の主要部は、コンピュータハードウェアと、ＣＰＵ１２０により実行されるソフトウェアとにより構成される。一般的にこうしたソフトウェアはＣＤ−ＲＯＭ１１８、ＦＤ１１６等の記憶媒体に格納されて流通し、ＣＤ−ＲＯＭドライブ１０８またはＦＤドライブ１０６等により記憶媒体から読取られてハードディスク１２４に一旦格納される。または、当該装置がネットワーク３１０に接続されている場合には、ネットワーク上のサーバから一旦ハードディスク１２４にコピーされる。そうしてさらにハードディスク１２４からメモリ１２２中のＲＡＭに読出されてＣＰＵ１２０により実行される。なお、ネットワーク接続されている場合には、ハードディスク１２４に格納することなくＲＡＭに直接ロードして実行するようにしてもよい。 The main part of the speech processing apparatus of the present invention is composed of computer hardware and software executed by the CPU 120. Generally, such software is stored and distributed in a storage medium such as a CD-ROM 118 or FD 116, read from the storage medium by the CD-ROM drive 108 or FD drive 106, and temporarily stored in the hard disk 124. Alternatively, when the device is connected to the network 310, it is temporarily copied from the server on the network to the hard disk 124. Then, the data is further read from the hard disk 124 to the RAM in the memory 122 and executed by the CPU 120. In the case of network connection, the program may be directly loaded into the RAM and executed without being stored in the hard disk 124.

図１および図２に示したコンピュータのハードウェア自体およびその動作原理は一般的なものである。したがって、本発明の最も本質的な部分は、ＦＤ１１６、ＣＤ−ＲＯＭ１１８、ハードディスク１２４等の記憶媒体に記憶されたソフトウェアである。 The computer hardware itself and its operating principle shown in FIGS. 1 and 2 are general. Therefore, the most essential part of the present invention is software stored in a storage medium such as the FD 116, the CD-ROM 118, and the hard disk 124.

ただし、以下にソフトウェアの処理として説明する機能の一部、たとえば、周波数分析などは、ハードウェアにより実行する構成としてもよい。 However, a part of functions described below as software processing, for example, frequency analysis may be configured to be executed by hardware.

図３は、本発明の音声処理装置として機能するコンピュータ１００の構成を機能ブロックで示す図である。 FIG. 3 is a functional block diagram showing the configuration of the computer 100 functioning as the speech processing apparatus according to the present invention.

図３に示すとおり、ＣＰＵ１２０内には、機能ブロックとして、音声処理プログラムに基づいて、後に説明するような周波数分析を実行する周波数分析部１２０２と、音声の明瞭化処理を行う明瞭化処理部１２０４とが含まれる。 As shown in FIG. 3, in the CPU 120, as a functional block, a frequency analysis unit 1202 that performs frequency analysis as will be described later and a clarification processing unit 1204 that performs speech clarification processing as functional blocks. And are included.

また、ＣＰＵ１２０とバスＢＳにより接続されるハードディスク１２４内には、破裂音等の存在を検出する際に用いられる音韻音響モデルを記録した音韻音響モデルデータベース１２４２と、利用者属性データベース１２４４が格納されている。音韻音響モデルとしては、特に限定されないが、たとえば、隠れマルコフモデルを用いることができる。また、利用者属性データベース１２４４には、本装置の利用者（出力音声の聴取者）の属性をもとに、これと関連付けて、強調する必要のある音韻の情報が予め格納されている。つまり、／ｂ／のような破裂音を含む音韻は、日本語母語話者、アメリカ英語母語話者ともに強調をする必要があるのに対し、／ｒ／，／ｌ／，／ｓ／、／ｔｈ／のような音韻は、日本語母語話者や利用者である場合にのみ強調する必要があるというような情報が格納されている。また、本装置の使用前に、このような利用者の属性については、システム１０００にユーザ２（音声の入力者）が登録をおこなうものとする。 Also, in the hard disk 124 connected to the CPU 120 by the bus BS, a phonological acoustic model database 1242 in which a phonological acoustic model used for detecting the presence of a plosive and the like and a user attribute database 1244 are stored. Yes. The phonological acoustic model is not particularly limited. For example, a hidden Markov model can be used. The user attribute database 1244 stores in advance information on phonemes that need to be emphasized in association with the attributes of the user (listener of the output sound) of the apparatus. That is, a phoneme including a plosive such as / b / needs to be emphasized by both Japanese native speakers and American English native speakers, while / r /, / l /, / s /, / Information that the phonemes such as th / need to be emphasized only when they are Japanese native speakers or users is stored. Further, it is assumed that the user 2 (speech input person) registers such attributes of the user in the system 1000 before using the apparatus.

なお、以下の説明では、「音韻の強調」は、当該音韻部分を選択的に増幅することにより行うものとして説明する。ただし、当該音韻以外の不要部分を選択的に減衰させることによっても当該音韻について「音韻の強調」を行うことができる。 In the following description, it is assumed that “phoneme enhancement” is performed by selectively amplifying the phoneme portion. However, “phoneme enhancement” can also be performed on the phoneme by selectively attenuating unnecessary portions other than the phoneme.

さらに、インタフェース１２８には、ＣＰＵ１２０の制御によりビデオＲＡＭ（図示せず）に出力されてバスＢＳ経由で送出される画像データに基づいて、対応する画像信号に変換してディスプレイ１０４に出力するための画像信号インタフェース１２８２と、ＣＰＵ１２０の制御によりバスＢＳ経由で送出されるデジタル音声データに基づいて、対応する音声信号に変換してスピーカ１３４に出力するための音声変換器１３４と、マイク１３２から入力されるアナログ音声信号を対応するデジタル音声信号に変換するためのアナログデジタル変換器（以下、Ａ／Ｄ変換器）１２８６とを含む。なお、図３には、図示しないが、たとえば、メモリ１２２中には、上述したビデオＲＡＭとして機能する記憶領域や、音声信号の入出力バッファとして機能する記憶領域が割当てられているものとする。 Further, the interface 128 converts the image data output to a video RAM (not shown) under the control of the CPU 120 and sent via the bus BS into a corresponding image signal and outputs the image signal to the display 104. Based on the image signal interface 1282, the audio converter 134 for converting into a corresponding audio signal based on the digital audio data sent via the bus BS under the control of the CPU 120, and outputting to the speaker 134, and the microphone 132. And an analog / digital converter (hereinafter referred to as A / D converter) 1286 for converting the analog audio signal into a corresponding digital audio signal. Although not shown in FIG. 3, for example, it is assumed that a storage area functioning as the video RAM and a storage area functioning as an audio signal input / output buffer are allocated in the memory 122.

図４は、図３で説明した周波数分析部１２０２や、明瞭化処理部１２０４の動作をより詳しく説明するためのブロック図である。 FIG. 4 is a block diagram for explaining the operations of the frequency analysis unit 1202 and the clarification processing unit 1204 described in FIG. 3 in more detail.

図４を参照して、マイク１３２から音声信号が入力されると、Ａ／Ｄ変換器１２８６は、アナログ電気信号で入力された音声信号をデジタル量子化する。 Referring to FIG. 4, when an audio signal is input from microphone 132, A / D converter 1286 digitally quantizes the audio signal input as an analog electric signal.

続いて、周波数分析部１２０２は、ＦＦＴ（Fast Fourier Transform）もしくはウェーブレット（wavelet）変換などのアルゴリズムを用いて変換して周波数分析をし、音声信号中に含まれる各周波数成分の強度を時系列で分割して解析する。 Subsequently, the frequency analysis unit 1202 performs frequency analysis by performing conversion using an algorithm such as FFT (Fast Fourier Transform) or wavelet conversion, and calculates the intensity of each frequency component included in the audio signal in time series. Divide and analyze.

さらに、明瞭化処理部１２０４中の音韻検出部１２０６は、分析された周波数成分中における各音韻を検出する。したがって、音韻検出部１２０６は、破裂音に相当する音韻の他、すべての音韻を検出する。 Further, a phoneme detection unit 1206 in the clarification processing unit 1204 detects each phoneme in the analyzed frequency component. Therefore, the phoneme detection unit 1206 detects all phonemes in addition to phonemes corresponding to plosives.

音韻検出部１２０６の処理をより具体的に説明すると、以下のとおりである。 The processing of the phoneme detection unit 1206 will be described more specifically as follows.

まず、音韻検出部１２０６は、検出された音韻を利用者データベース１２４４のデータと突合せ、もしも当該検出された音韻が利用者の属性から増幅する必要がある場合は、増幅処理部１２０８に当該音韻に相当する範囲内（時間）の信号の部分を増幅させ、増幅が必要でない場合には、増幅は行わず、明瞭化処理部１２０４から出力させる。特に限定されないが、図４のような構成で、増幅が必要でない場合の処理を実現するのであれば、音韻検出部１２０６は、当該信号部分をスルーしてもよいし、増幅処理部１２０８に増幅率＝１として処理させてもよい。 First, the phoneme detection unit 1206 matches the detected phoneme with the data in the user database 1244, and if the detected phoneme needs to be amplified from the user attribute, the phoneme detection unit 1206 sends the phoneme to the amplification processing unit 1208. A portion of the signal within a corresponding range (time) is amplified, and when amplification is not necessary, amplification is not performed and the signal is output from the clarification processing unit 1204. Although not particularly limited, the phoneme detection unit 1206 may pass through the signal portion or amplify the signal to the amplification processing unit 1208 if the configuration shown in FIG. Processing may be performed with a rate = 1.

なお、／ｂ／のような破裂音については、たとえば、以下のようにして検出することができる。 Note that a plosive sound such as / b / can be detected as follows, for example.

ｉ）音韻検出部１２０６は、破裂に先行して存在する筈の閉鎖音（無音、もしくは無気音）を検出する。閉鎖音が存在しない場合は以下の処理ｉｉ）〜ｉｉｉ）は行わない。 i) The phoneme detection unit 1206 detects the closing sound (silence or silent sound) of the spider that exists prior to the rupture. When there is no closing sound, the following processes ii) to iii) are not performed.

ｉｉ）一方、閉鎖音が存在する場合は、音韻検出部１２０６は、閉鎖音に後続した音声に対し、スペクトル包絡を計算する。 ii) On the other hand, if a closing sound exists, the phoneme detection unit 1206 calculates a spectral envelope for the speech following the closing sound.

ｉｉｉ）音韻検出部１２０６は、低い周波数帯から高い周波数帯にまで一定以上のパワーが存在している垂直パルス、または雑音パルスがあるかを計算する。また、一般にこれら成分は４０ｍｓ以下の時間で表れるため，これ以下の時間連続して発生している場合のみ破裂音成分とは見なす。 iii) The phoneme detection unit 1206 calculates whether there is a vertical pulse or a noise pulse in which a certain level of power exists from a low frequency band to a high frequency band. Moreover, since these components generally appear in a time of 40 ms or less, they are regarded as a plosive component only when they occur continuously for a time shorter than this.

なお、他の音韻の検出については、音韻音響モデルデータベース１２４２に格納されたデータに基づいて、人間が発話した音声を用いて作成した各音韻に対する音響モデルを使用し、尤度計算によって破裂音を含む音韻が発生されているかを検出する方法が考えられる。 For detecting other phonemes, an acoustic model for each phoneme created using speech uttered by a person based on data stored in the phoneme acoustic model database 1242 is used. A method for detecting whether or not a phoneme including it is generated is conceivable.

増幅処理部１２０８は、増幅を行って、後段の処理を行う信号選択部１２１０にデータを送る。増幅処理部１２０８での増幅量は予め設定した既定値、もしくは、過去に入力された音声の音圧に応じたゲインで行う。 The amplification processing unit 1208 performs amplification and sends data to the signal selection unit 1210 that performs subsequent processing. The amount of amplification in the amplification processing unit 1208 is performed with a preset default value or with a gain corresponding to the sound pressure of voice input in the past.

信号選択部１２１０では、音韻検出部１２０６から送出される増幅を行っていないデータおよび増幅処理部１２０８からの増幅されたデータを選択的に合成して音声変換器１２８４に送る。 The signal selection unit 1210 selectively synthesizes the non-amplified data sent from the phoneme detection unit 1206 and the amplified data from the amplification processing unit 1208 and sends them to the audio converter 1284.

音声変換機１２８４では、音声再生のためにデジタルアナログ変換装置を行ってスピーカ１３４から再生させる。ただし、他の通信装置（携帯電話、テレビ・ラジオ放送など）を介してデジタル音声データをさらに送信する場合は、所定の符号化を行って受信機に対して送信する。 The audio converter 1284 performs a digital / analog conversion device for audio reproduction and reproduces it from the speaker 134. However, when digital audio data is further transmitted via another communication device (such as a mobile phone, a television / radio broadcast), the data is transmitted to the receiver after predetermined encoding.

図５は、マイク１３２から入力される音声波形の一例を示す図である。 FIG. 5 is a diagram illustrating an example of a voice waveform input from the microphone 132.

図５では、英語を母国語とした米国人が発話した「ＬＡＢ」という英単語音声を波形で示している。 In FIG. 5, the English word speech “LAB” spoken by an American whose native language is English is shown as a waveform.

図６（ａ）は、図５の波形を周波数分析した結果を示す図である。 FIG. 6A is a diagram showing the result of frequency analysis of the waveform of FIG.

すなわち、図５に示した波形を周波数分析すると，図６（ａ）のような声紋パターンが得られる。図６（ａ）の５００ｍｓ前後の縦に薄く出ている部分が「バズバー」と呼ばれる破裂音成分である。このようにパワーが弱い（図中ではパワーの強度を黒色の濃さで示している）と、”Ｂ”と知覚されず、”Ｖ”と知覚されてしまう可能性がある。 That is, when the waveform shown in FIG. 5 is subjected to frequency analysis, a voiceprint pattern as shown in FIG. 6A is obtained. In FIG. 6A, the vertically thinned portion of about 500 ms is a plosive component called “buzz bar”. Thus, when the power is weak (in the figure, the intensity of the power is indicated by the darkness of black), “B” may not be perceived and “V” may be perceived.

図６（ｂ）は、破裂音成分を検出し、破裂音成分の部分のみを増幅した音声の声紋パターンを示す図である。 FIG. 6B is a diagram showing a voiceprint pattern of a voice in which a plosive sound component is detected and only the plosive sound component portion is amplified.

なお、図６（ｂ）において、増幅の強度は先行する音声に合わせて適度に増幅し、また、前後の音声との繋がりを良くするため，エンベロープをかけて増幅している。すなわち、破裂音部分に近づくにつれて、次第に増幅率を大きくし、最大の増幅率の後は次第に増幅率を下げている。 In FIG. 6B, the intensity of amplification is moderately amplified in accordance with the preceding voice, and is amplified with an envelope in order to improve the connection with the preceding and following voices. That is, the gain is gradually increased as the plosive portion is approached, and the gain is gradually decreased after the maximum gain.

図６（ｂ）において、”ＬＡ”の部分は増幅していないほか、エンベロープをかけて増幅することにより、全体として音量が大きくなり、耳障りなほど大きく聞こえたりすることはない。しかし，破裂音成分は大きく増幅されているため、聞き取る側では、”Ｂ”と知覚できるようになり、単語として”ＬＡＢ”と知覚可能になる。 In FIG. 6B, the “LA” portion is not amplified, and the volume is increased as a whole by amplifying with the envelope, so that it does not sound harshly loud. However, since the plosive component is greatly amplified, the listener can perceive “B” and can perceive “LAB” as a word.

以下では、上述したような破裂音等の音韻対立のある波形成分（「音韻対立部分」と呼ぶ）の部分について、選択的に増幅を行うことによる利点を示す実験結果について説明する。 In the following, an experimental result indicating an advantage of selectively amplifying a waveform component having a phoneme conflict such as a plosive as described above (referred to as a “phoneme conflict portion”) will be described.

［実験結果］
音韻対立によって、音響的差異、聴取時の手掛かりが、母国語の異なる聞き手の間では一般に異なる。そのため、例えば／ｂ／と／ｖ／（以下ＢＶと略）、／ｓ／と／θ／（以下ＳＴＨと略）も日本語母語話者にとって知覚困難な音韻だが、雑音付加の影響がＲＬの場合と異なる可能性がある。 [Experimental result]
Due to phonological conflict, acoustic differences and clues at the time of listening generally differ among listeners of different native languages. Therefore, for example, / b / and / v / (hereinafter abbreviated as BV) and / s / and / θ / (hereinafter abbreviated as STH) are phonemes that are difficult to perceive for Japanese native speakers. It may be different.

そこで、以下の実験では、日本語母語話者（以下ＪＡと略）、アメリカ英語母語話者（以下ＡＥと略）を対象とし、ＲL、ＢＶ、ＳＴＨで対立する米単語音声に対して性質の異なる雑音を付加し、明瞭性を測定する実験を行った。また、アメリカ英語母語話者を対象として行った予備実験の結果、音韻によって呈示音圧の影響を受けることが確認されたため、これも併せて検証を行った。 Therefore, in the following experiment, Japanese native speakers (hereinafter abbreviated as JA) and American English native speakers (hereinafter abbreviated as AE) are targeted. An experiment was conducted to measure clarity by adding different noises. In addition, as a result of a preliminary experiment conducted with an American native English speaker, it was confirmed that the phoneme was affected by the sound pressure presented, so this was also verified.

（１実験方法）
（１．１刺激）
ＲL対（ｒｉｇｈｔ−１ｉｇｈｔ等）、ＢＶ対（ｂａｓｅ−ｖａｓｅ等）、ＳＴＨ対（ｍｏｕｓｅ−ｍｏｕｔｈ等）の３種類の音韻で対立する音韻最小対の英単語対を使用し手実験を行った。各対立毎に５０、３０、３０対（合計１１０対）の合計２２０語をアメリカ英語母語話者２名（男性１名、女性１名）が発話したものを刺激音声とした。無響室で収録された音声は単語毎に４４．１ｋＨｚ、１６ｂｉｔの精度でＰＣＭ（Pulse Code Modulation）形式のファイルとして保存された。 (1 Experimental method)
(1.1 Stimulation)
A hand experiment was performed using English word pairs of the smallest phoneme pairs that oppose each other in three phonemes: RL pairs (right-1ight etc.), BV pairs (base-base etc.), and STH pairs (mouse-mouth etc.). For each confrontation, a total of 220 words of 50, 30, and 30 pairs (a total of 110 pairs) spoken by two American English native speakers (one male and one female) were used as stimulating speech. The voice recorded in the anechoic room was saved as a file in PCM (Pulse Code Modulation) format with an accuracy of 44.1 kHz and 16 bits for each word.

雑音付加実鹸用の刺激として、各単語をヘッドホンを通じて出力したときの音圧レベル（Ａ特性）のピーク値の単語間の平均が、ＲＬ対立およびＳＴＨ対立では５９ｄＢ、ＢＶ対立では同６５ｄＢとなるように振幅を調整した。 As a stimulus for adding noise, the average of the sound pressure level (A characteristic) peak value when each word is output through headphones is 59 dB in the RL and STH conflicts, and 65 dB in the BV conflict. The amplitude was adjusted as follows.

ノイズジェネレータで生成したホワイトノイズおよびピンクノイズを、ヘッドホンを通じて出力したときの音圧レベル（Ａ特性）のピーク値を各条件のＳＮ仕になるように振幅を調整し、本実験に用いる音声に付加した。雑音は、音声よりも前後２００ｍｓずつ長い持続時間のものを重ね合わせた。 Adjusting the amplitude so that the peak value of the sound pressure level (A characteristic) when white noise and pink noise generated by the noise generator are output through headphones is in accordance with the SN of each condition, added to the sound used in this experiment did. Noises with a duration longer by 200 ms before and after the voice were superimposed.

図７は、実験条件として用いたＳＮ比を示す図である。 FIG. 7 is a diagram showing the SN ratio used as an experimental condition.

また、明瞭性に対する呈示音圧の影響を測定するための刺激として、各単語をヘッドホンを通じて出力したときの音圧レベル（Ａ特性）ピークの平均が、各音韻対立で３９ｄＢから６９ｄＢとなるように５ｄＢステップで振幅を調整した。 Further, as a stimulus for measuring the influence of the presented sound pressure on the clarity, the average of the sound pressure level (A characteristic) peak when each word is output through the headphones is from 39 dB to 69 dB in each phoneme confrontation. The amplitude was adjusted in 5 dB steps.

（１．２実験参加者）
ＪＡ実験では、日本語を母語とし、３ケ月以上の外国滞在経験のない大学生１１人が実験に参加した。ＡＥ実験では、２３才から４３才までのアメリカ英語母語話者３人が実験に参加した。全員が正常な聴力を持つことを確認した。 (1.2 Experiment participants)
In the JA experiment, 11 university students who were native speakers of Japanese and had no experience of staying abroad for more than 3 months participated in the experiment. In the AE experiment, three American English speakers from the age of 23 to 43 participated in the experiment. All confirmed that they had normal hearing.

（１．３手続き）
実鹸は３日間に分けて防音室内で行った。コンピュータ画面上に音韻最小対をなす英単語２語を視覚呈示し、同時にどちらか一方の単語をヘッドホンより両耳呈示した。実験参加者は、きこえた単語が画面上の単語対のどちらであったかを判断し、選択した。 (1.3 Procedure)
The actual saponification was performed in a soundproof room for 3 days. Two English words that form the smallest phoneme pair were visually presented on the computer screen, and at the same time, either word was presented in both ears via headphones. Participants in the experiment judged and selected which word was the word pair on the screen.

（雑音付加音声セッション）
付加した雑音の種類別に２日間に分けて行った。それぞれ話音別の２つのセクションから構成され、話者の順序は一定であった。各セクションは全ＳＮ仕の音声を含んだ音韻対立毎のブロックからなり、ＲＬ、ＢＶ、ＳＴＨ対立の順で提示した。各ブロック内で全音声刺激をランダムな順序で呈示し、回答の正誤に関するフイードバックは行わなかった。 (Noise-added voice session)
It was divided into two days according to the type of added noise. Each section consisted of two sections for each sound, and the order of the speakers was constant. Each section is composed of blocks for each phoneme confrontation including all SN-speech sounds, and presented in the order of RL, BV, and STH confrontation. All voice stimuli were presented in a random order within each block, and no feedback was given regarding the correctness of the answers.

（音圧変動セッション）
雑音付加音声セッション終了後に音圧変動セッションを実施した。刺激が異なる以外は、構成および方法は雑音付加音声セッションと同じものを用いた。 (Sound pressure fluctuation session)
A sound pressure fluctuation session was conducted after the noise-added speech session. Except for the different stimuli, the structure and method were the same as for the noisy speech session.

（２結果）
（ＪＡ実験）
図８は、ＪＡ実験における雑音付加音声セッションの結果を示す図である。 (2 results)
(JA experiment)
FIG. 8 is a diagram showing the result of a noise-added voice session in the JA experiment.

いずれの音韻対立においても、ＳＮ比が低下した際に、正答率が低下する傾向があることが示された。 In any phonological conflict, it was shown that the correct answer rate tends to decrease when the SN ratio decreases.

雑音の種類およびＳＮ比を被験者内要因とし、正答率を逆正弦変換した値を従属変数とした２要因分散分析を各音韻対立毎に行った。なお、ＢＶ対立においては、ホワイトノイズ条件の−９ｄＢ条件を分析から除いた。その結果、何れの音韻対立においても、ＳＮ比要因の主効果が有意である（ＲＬ、ＢＶ、ＳＴＨ音韻対立でそれぞれ［Ｆ（６，６０）＝２４．９５０，ｐ＜０．０１］、［Ｆ（７，７０）＝１８．６４１，Ｐ＜０．０１］，［Ｆ（６，６０）＝３２．１５２，Ｐ＜０．０１］）が、雑音の種類の要因の主効果、交互作用共に有意ではなかった。 A two-factor ANOVA was performed for each phoneme confrontation, with the noise type and the signal-to-noise ratio as factors within the subject, and the correct answer rate as a dependent variable. In the BV conflict, the -9 dB condition of the white noise condition was excluded from the analysis. As a result, the main effect of the S / N ratio factor is significant in any phoneme conflict ([F (6,60) = 24.950, p <0.01], [<0.01] in the RL, BV, and STH phoneme conflicts, respectively). F (7,70) = 18.641, P <0.01], [F (6,60) = 32.15, P <0.01]) is the main effect and interaction of noise type factors Both were not significant.

次に、図９は、ＪＡ実験における音圧変動セッションの結果を示す図である。 Next, FIG. 9 is a diagram showing the results of the sound pressure fluctuation session in the JA experiment.

音韻対立および呈示音圧を被験者内要因とし、正答率を逆正弦変換した値を従属変数とした２要因分散分析を行った。その結果、呈示音圧要因の主効果が有意であった［Ｆ（６，６０）＝１０．５０３，Ｐ＜０．０１］。音韻対立要因の主効果、交互作用共に有意でなかったものの、３９ｄＢ条件と６３ｄＢ条件の２点の正答率を比較した場合、ＢＶ対立で他音韻対立よりも大きな正答率の変化が見られた。 A two-factor ANOVA was performed with phonological confrontation and presented sound pressure as factors within the subject, and the correct answer rate as the dependent variable. As a result, the main effect of the presenting sound pressure factor was significant [F (6,60) = 10.503, P <0.01]. Although the main effect and interaction of phonological confrontation factors were not significant, when the correct answer rates of the 39 dB condition and the 63 dB condition were compared, the BV confrontation showed a greater change in the correct answer rate than the other phonological confrontation.

（ＡＥ実験）
図１０は、ＡＥ実験における各音韻対立における雑音付加音声セッションの結果を示す図である。いずれの音韻対立においても、ＳＮ比の低下に伴って正答率が低下する傾向があることが示された。 (AE experiment)
FIG. 10 is a diagram illustrating a result of a noise-added speech session in each phoneme conflict in the AE experiment. In any phoneme conflict, it was shown that the correct answer rate tends to decrease as the SN ratio decreases.

次に、図１１は、ＡＥ実験における音圧変動セッションの結果を示す図である。ＲＬおよびＳＴＨ対立では、実験に使用した呈示音圧範囲における変化は殆んど見られないが、ＢＶ対立では、正答率が呈示音圧の影響を受けやすいことが示された。 Next, FIG. 11 is a diagram showing a result of a sound pressure fluctuation session in the AE experiment. The RL and STH conflicts showed little change in the presented sound pressure range used in the experiment, but the BV conflict showed that the correct answer rate was susceptible to the presented sound pressure.

以上の解析結果をまとめると、日本語母語話者、アメリカ英語母語話者共に全ての音韻対立においてＳＮ比の低下に伴って正答率が低下した。さらに、母語、音韻対、呈示音圧に対する雑音付加の影響の関係について、以下のような関係が明らかになった。 Summarizing the above analysis results, the correct answer rate decreased with a decrease in the S / N ratio in all phoneme confrontations for both Japanese native speakers and American English native speakers. Furthermore, the following relations were clarified in relation to the influence of noise addition on the mother tongue, phoneme pair, and presented sound pressure.

（母語と非母語）
アメリカ英語母語話者では、ＢＶ以外の音韻対において、雑音付加の影響を受けにくいＳＮ比のレンジが存在するのに対し、日本語母語話者では、僅かな雑音付加で正答率が低下する傾向が示された。 (Native and non-native)
In American English native speakers, there is a signal-to-noise ratio range that is less susceptible to noise addition in phoneme pairs other than BV, whereas in Japanese native speakers, the correct answer rate tends to decrease with slight noise addition. It has been shown.

また、雑音の種類の影響が、実験参加者の母語により異なる場合があった（例：ＲＬ対立のＡＥ−１５ｄＢ条件とＪＡ−９ｄＢ条件間の比較）。これは、母語により知覚に使用する音響的特徴が異なっていたことを示唆する。 In addition, the influence of the noise type may differ depending on the native language of the experiment participant (example: comparison between AE-15 dB condition and JA-9 dB condition of RL conflict). This suggests that the acoustic features used for perception differed depending on the mother tongue.

（音韻対）
音韻対により雑音の影響が異なった。ＲＬ対立は本実鹸で使用した雑音に対する耐性が比較的高かったが、ＢＶ対立は僅かな雑音付加によっても大きく影響を受け、ＳＴＨ対立ではほぼ一定の割合で正答率が低下した。これは、音韻対によって弁別に使用される音響的特徴が異なり、同じ雑音を付加した場合においても、異なる影響を及ぼしていることを示している。 (Phoneme pair)
The effect of noise was different depending on the phoneme pair. The RL conflict was relatively resistant to the noise used in the actual sapon, but the BV conflict was greatly affected by the addition of a small amount of noise, and the correct answer rate decreased at an almost constant rate in the STH conflict. This indicates that the acoustic features used for the discrimination are different depending on the phoneme pair, and even when the same noise is added, different effects are exerted.

（呈示音圧）
日本語母語話者、アメリカ英語母語話者において、ＢＶ対立のある音韻では呈示音圧の低下により正答率が低下し、知覚が阻害される。しかし、ＲＬやＳＴＨは日本語母語話者でのみ呈示音圧の低下により正答率が低下する。 (Present sound pressure)
In Japanese native speakers and American English native speakers, the correct answer rate decreases due to a decrease in the presented sound pressure in a phoneme with BV conflict, and perception is hindered. However, the correct answer rate of RL and STH decreases only for Japanese native speakers due to a decrease in the sound pressure presented.

以上の結果から、図１〜６においては、音韻対立として破裂音の対の／ｂ／と／ｖ／とを例にとって説明したが、他の音韻対立があるために増幅が必要な音韻について、当該音韻部分を選択的に抜き出して増幅すると、少なくともある母語を有する話者にとっては、知覚の程度が向上することがわかる。 From the above results, in FIGS. 1 to 6, the / b / and / v / pair of plosives has been described as an example of the phoneme conflict, but for phonemes that need to be amplified due to other phoneme conflicts, It can be seen that when the phoneme portion is selectively extracted and amplified, the degree of perception improves at least for a speaker having a certain native language.

また、たとえば、破裂スペクトルがきちんと知覚できないと知覚誤りが起きやすいことは上記の実験の結果明らかになった。さらに、日本語のみの範囲でも、繊細な知覚が必要となる破裂スペクトルの周波数特性により、例えば「ぱ」「た」「か」が弁別されるため、本発明の音声処理装置を用いることにより、英語話者−日本語話者間だけでなく、日本語話者同士の音声通信においても同様に明瞭性の向上が見込まれる。 Also, for example, it has been clarified as a result of the above experiment that a perception error tends to occur if the burst spectrum cannot be perceived properly. Furthermore, even in the Japanese-only range, for example, “pa” “ta” “ka” is discriminated by the frequency characteristics of the burst spectrum that requires delicate perception, so by using the speech processing device of the present invention, In the case of voice communication between Japanese speakers as well as between English speakers and Japanese speakers, improvement in clarity is also expected.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明の音声処理装置を用いた通信システム１０００の一例を示す概念図である。It is a conceptual diagram which shows an example of the communication system 1000 using the audio | voice processing apparatus of this invention. コンピュータ１００のハードウェア構成をブロック図形式で示す図である。It is a figure which shows the hardware constitutions of the computer 100 in a block diagram format. 本発明の音声処理装置として機能するコンピュータ１００の構成を機能ブロックで示す図である。It is a figure which shows the structure of the computer 100 which functions as a voice processing apparatus of this invention with a functional block. 図３で説明した周波数分析部１２０２や、明瞭化処理部１２０４の動作をより詳しく説明するためのブロック図である。FIG. 4 is a block diagram for explaining the operations of the frequency analysis unit 1202 and the clarification processing unit 1204 described in FIG. 3 in more detail. マイク１３２から入力される音声波形の一例を示す図である。It is a figure which shows an example of the audio | voice waveform input from the microphone 132. FIG. 波形を周波数分析した結果および選択的な増幅をした結果を示す図である。It is a figure which shows the result of having frequency-analyzed the waveform, and the result of having selectively amplified. 実験条件として用いたＳＮ比を示す図である。It is a figure which shows the S / N ratio used as experiment conditions. ＪＡ実験における雑音付加音声セッションの結果を示す図である。It is a figure which shows the result of the noise addition voice session in JA experiment. ＪＡ実験における音圧変動セッションの結果を示す図である。It is a figure which shows the result of the sound pressure fluctuation | variation session in JA experiment. ＡＥ実験における各音韻対立における雑音付加音声セッションの結果を示す図である。It is a figure which shows the result of the noise addition audio | voice session in each phoneme confrontation in AE experiment. ＡＥ実験における音圧変動セッションの結果を示す図である。It is a figure which shows the result of the sound pressure fluctuation | variation session in AE experiment.

Explanation of symbols

１００コンピュータ、１０２コンピュータ本体、１０４ディスプレイ、１０６ＦＤドライブ、１０８ディスクドライブ、１１０キーボード、１１２マウス、１１８ＣＤ−ＲＯＭ、１２０ＣＰＵ、１２２メモリ、１２４ハードディスク、１２８インタフェース、１３２マイク、１３４スピーカ、１０００システム、１２０２周波数分析部、１２０４明瞭化処理部。 100 computer, 102 computer main body, 104 display, 106 FD drive, 108 disk drive, 110 keyboard, 112 mouse, 118 CD-ROM, 120 CPU, 122 memory, 124 hard disk, 128 interface, 132 microphone, 134 speaker, 1000 system, 1202 Frequency analysis unit, 1204 Clarification processing unit.

Claims

A voice processing device,
Storage means for storing emphasis information of phonemes to be emphasized according to user attributes;
Frequency analysis means for frequency analysis of the input audio signal;
Based on the analysis result of the frequency analysis means, phoneme detection means for detecting each phoneme portion,
Emphasis processing means for selectively emphasizing the phonological part according to the detection result of the phonological detection means and the emphasis information;
An audio processing apparatus comprising: an output signal selection unit that synthesizes and outputs the input audio signal and the selectively emphasized portion.

The speech processing apparatus according to claim 1, wherein the phoneme to be emphasized is a plosive phoneme.

The phoneme detection means, based on the analysis result of the frequency analysis means, determines a phoneme corresponding to a plosive sound depending on the presence or absence of a vertical pulse in which a certain level of power exists within a predetermined time from a low frequency band to a high frequency band. The audio processing device according to claim 2, wherein the audio processing device is detected.

Further comprising phonological acoustic model storage means for storing the acoustic model;
The speech processing apparatus according to claim 1, wherein the phoneme is detected by likelihood calculation based on the acoustic model for each phoneme.