JP2005202335A

JP2005202335A - Method, device, and program for speech processing

Info

Publication number: JP2005202335A
Application number: JP2004011111A
Authority: JP
Inventors: Takayuki Arai; 隆行荒井; Nao Hodoshima; 奈緒程島; Takakimi Goto; 崇公後藤
Original assignee: Individual
Current assignee: Individual
Priority date: 2004-01-19
Filing date: 2004-01-19
Publication date: 2005-07-28

Abstract

<P>PROBLEM TO BE SOLVED: To improve the articulation of a speech radiated from a speaker by processing a speech signal picked up by a microphone before it is outputted to the speaker. <P>SOLUTION: The speech signal which is digitized by an A/D converter 11 is inputted to a windowing processing part 12 for frame division and then passed through an FFT 13, and a logarithmic spectrum calculation part 14 calculates a logarithmic spectrum, which is subjected to IFFT 15 to generate a cepstrum coefficient. Then regression coefficient calculation parts 16-1 to 16-n calculates regression coefficients when the cepstrum coefficient is viewed in the time direction and a square averaging part 17 calculates the average (D value) of squares of the regression coefficient; and the D value is passed through a threshold processing part 18 to find a stationary part of the speech signal, and a multiplier 19 suppresses the amplitude of the speech signal for the found stationary part, so that the result is outputted through a D/A converter 20. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、室内で拡声される音声の明瞭度を改善するための処理を行う音声処理方法と装置及びプログラムに関する。 The present invention relates to an audio processing method, apparatus, and program for performing processing for improving the intelligibility of audio that is loudened indoors.

講演会場、多目的ホール、教室、教会などの室内で、講演・講義などを行う場合、話者が発生した音声はマイクロフォンによって検出され、増幅などの電気的処理がなされた後、会場に設置されたスピーカから音響として室内に放射され、最終的に聴衆の耳に到達する。 When lectures, lectures, etc. are performed in lecture halls, multipurpose halls, classrooms, churches, etc., the voice generated by the speaker is detected by a microphone, and after electrical processing such as amplification, it is installed in the venue The sound is emitted from the speaker as sound and finally reaches the audience's ear.

このような状況では通常、室内の残響の影響でスピーカから放射される音声の明瞭度が低下する。特に、老人性難聴や聴覚障害の人にとって、このような影響の弊害が大きく、非常に聞き取りにくい音声となってしまう。また、聴取者の母語以外の言語を使った音声コミュニケーションにおいても、残響は好ましくない。例えば、語学の聞き取り実験で、同じ音声が違う残響環境で再生されれば、受験者にとって不利益が生じることにもなりかねない。 In such a situation, the intelligibility of the sound radiated from the speaker is usually lowered due to the effect of reverberation in the room. In particular, for people with senile deafness or hearing impairment, the effects of such effects are great and the sound becomes very difficult to hear. Reverberation is also undesirable in voice communication using a language other than the listener's native language. For example, in a language listening experiment, if the same voice is played in a different reverberant environment, it may be disadvantageous for the examinee.

このような問題に対し、マイクロフォンで検出された音声信号に対してスピーカに出力される前に特定の前処理を施すことによって、スピーカから放射されて聴衆の耳に到達する音声の明瞭度を向上させる試みが従来から種々なされている。その一つとして、発明者らは、荒井隆行，木下慶介，程島奈緒，楠本亜希子，喜田村朋子，“音声の定常部抑圧処理の残響に対する効果”，日本音響学会（秋期研究発表会）講演論文集，vol. 1, 449-450, 2001年10月（非特許文献１）において、入力される音声信号に対して残響によるオーバラップマスキング(overlap-masking)を減らすことを目的とする「定常部抑圧処理」を施すことを提案し、ある雑音環境下では残響による明瞭度の低下を避けることができることを確認している。 For this problem, the sound signal detected by the microphone is subjected to specific preprocessing before it is output to the speaker, thereby improving the clarity of the sound emitted from the speaker and reaching the audience's ear. Various attempts have been made in the past. As one of them, the inventors are Takayuki Arai, Keisuke Kinoshita, Nao Hojima, Akiko Enomoto, Kyoko Kitamura, “Effects of speech suppression on reverberation”, Acoustical Society of Japan (Autumn Conference) Vol. 1, 449-450, October 2001 (Non-Patent Document 1), “Standing part” is intended to reduce overlap-masking due to reverberation to an input audio signal. It has been proposed to perform "suppression processing", and it has been confirmed that a reduction in intelligibility due to reverberation can be avoided under a certain noise environment.

すなわち、残響によって音声の明瞭度を下げる要因の一つとして、オーバラップマスキングの影響が考えられている。オーバラップマスキングとは、先行する音素に伴う残響が後続する音素をマスクする効果であり、特に先行する音素のエネルギーが大きく、後続する音素のエネルギが小さい場合に、その効果が大きくなると考えられている。このようなオーバラップマスキングを減らすためには、適当に原音声のサンプルを間引くことが考えられるが、単に機械的に間引きを行ったのでは逆に音声情報が失われてしまい。結果として逆に明瞭度の低下を招く。 That is, the influence of overlap masking is considered as one of the factors that lower the intelligibility of speech due to reverberation. Overlap masking is the effect of masking the phoneme that is followed by the reverberation associated with the preceding phoneme, especially when the energy of the preceding phoneme is large and the energy of the following phoneme is small. Yes. In order to reduce such overlap masking, it is conceivable to thin out the original voice samples appropriately. However, if the thinning is simply performed, voice information is lost. As a result, the clarity is lowered.

そこで、非特許文献１では音声信号のうち定常部のみを間引く処理を行う。音声信号の定常部は、典型的には母音部の中央（音節核）であり、そのエネルギは大きいものの音声としての情報量は少ない。一方、音声信号の遷移部は音声情報の知覚に関して非常に重要な役割を果たしていることが分かっている（例えば、S.Furui, “On the role of spectral transition for speech perception,”J. Acoust. Soc. Am.,80(4):1016-1025, 1986：非特許文献２）。非特許文献２によると、音節の初期部分と最終部分を様々な位置で削除した刺激を用いて聴取実験を行った結果、音声の遷移部は音声知覚に関して非常に重要な役割を果たしており、母音の定常部は母音、または音節の認識においては必要ではないと報告されている。 Therefore, in Non-Patent Document 1, a process of thinning out only the stationary part of the audio signal is performed. The stationary part of the speech signal is typically the center of the vowel part (syllable nucleus), and its energy is large but the amount of information as speech is small. On the other hand, it is known that the transition part of the speech signal plays a very important role in perception of speech information (for example, S. Furui, “On the role of spectral transition for speech perception,” J. Acoust. Soc Am., 80 (4): 1016-1025, 1986: Non-Patent Document 2). According to Non-Patent Document 2, as a result of listening experiments using stimuli in which the initial and final parts of the syllable were deleted at various positions, the transition part of speech played a very important role in terms of speech perception. It has been reported that the stationary part of is not necessary for vowel or syllable recognition.

音声信号の定常部の中でも母音の定常部は一般にエネルギが大きいことが多いので、それに後続する遷移部やエネルギーの小さい子音はオーバラップの影響をまともに受けやすい。このため、定常部抑圧処理を施すと、音声情報の損失を最小限に抑えつつ、オーバラップマスキングによる遷移部へのマスキング量を減らすことが可能となる。 Since the steady portion of the vowel is generally large in energy among the steady portions of the audio signal, the transition portion and the consonant with low energy that follow it are easily susceptible to overlap. For this reason, when the steady part suppression process is performed, it is possible to reduce the masking amount to the transition part by overlap masking while minimizing the loss of voice information.

非特許文献１では、具体的に次のような信号処理を行う。まず、ＦＩＲフィルタなどによるフィルタバンクを用いて音声信号を1/3-octで帯域分割し、各帯域において時間包絡を抽出する。次に、各帯域の時間包絡を100Hzにダウンサンプリングし、その対数軌跡から前後２点、計５点に対する回帰係数をサンプル毎に計算する。全ての帯域に渡って、回帰係数の２乗平均（以下では、Ｄ値とする）を求める。ここで、Ｄ値は非特許文献２にならって音声信号のスペクトル遷移を示すパラメータを表すものとする。次に、元のサンプリング周波数に戻した後、Ｄ値がある閾値より小さい箇所を定常部とみなし、定常部について元の波形の振幅を抑圧する。このようにして音声信号に定常部抑圧処理を行うことにより、残響によるオーバラップマスキングの影響を軽減して音声の明瞭度の低下を防ぐことができる。
荒井隆行，木下慶介，程島奈緒，楠本亜希子，喜田村朋子，“音声の定常部抑圧処理の残響に対する効果”，日本音響学会（秋期研究発表会）講演論文集，vol. 1, 449-450, 2001年10月 S. Furui, “On the role of spectral transition for speech perception,”J. Acoust. Soc. Am., 80(4):1016-1025, 1986 In Non-Patent Document 1, the following signal processing is specifically performed. First, an audio signal is band-divided by 1 / 3-oct using a filter bank such as an FIR filter, and a time envelope is extracted in each band. Next, the time envelope of each band is down-sampled to 100 Hz, and regression coefficients for a total of five points are calculated for each sample from two points before and after the logarithmic locus. The root mean square (hereinafter referred to as D value) of the regression coefficient is obtained over all the bands. Here, the D value represents a parameter indicating the spectrum transition of the audio signal according to Non-Patent Document 2. Next, after returning to the original sampling frequency, a portion where the D value is smaller than a certain threshold is regarded as a stationary portion, and the amplitude of the original waveform is suppressed for the stationary portion. By performing the steady-state suppression processing on the audio signal in this way, it is possible to reduce the influence of overlap masking due to reverberation and to prevent the speech intelligibility from being lowered.
Takayuki Arai, Keisuke Kinoshita, Nao Hojima, Akiko Enomoto, Kyoko Kitamura, “Effects of speech suppression on reverberation”, Acoustical Society of Japan (Autumn Conference), Vol. 1, 449-450, October 2001 S. Furui, “On the role of spectral transition for speech perception,” J. Acoust. Soc. Am., 80 (4): 1016-1025, 1986

非特許文献１に開示された定常部抑圧処理は、残響によるオーバラップマスキングを減らして、残響による明瞭度の低下を回避する上で効果的であるが、特に帯域分割のためのフィルタバンクでの処理遅延が大きいため、実時間処理には必ずしも適さない。話者が発生した音声をマイクロフォンによって検出し、スピーカによって放射する場合に音声信号に対する前処理により音声の明瞭度を向上させるという当初の目的を考慮すると、処理の実時間性がない場合、話者の口の動きや動作とスピーカから発する音声とが一致しないことになる。従って、明瞭度向上のための定常部抑圧処理の実時間性は非常に重要である。 The steady-state suppression processing disclosed in Non-Patent Document 1 is effective in reducing overlap masking due to reverberation and avoiding a decrease in intelligibility due to reverberation, but particularly in a filter bank for band division. Due to the large processing delay, it is not necessarily suitable for real-time processing. Considering the original goal of improving speech intelligibility by preprocessing speech signals when the speech produced by the speaker is detected by a microphone and radiated by a speaker, the speaker is This means that the movement and movement of the mouth does not match the sound emitted from the speaker. Therefore, the real-time property of the steady-state suppression processing for improving the intelligibility is very important.

本発明は、マイクロフォンで検出された音声信号に対してスピーカに出力される前に明瞭度向上のための定常部抑圧処理を実時間処理により行うことを容易にする音声処理方法と装置及びプログラムを提供することを目的とする。 The present invention provides an audio processing method, apparatus, and program for facilitating real-time processing of steady-state suppression processing for improving intelligibility before an audio signal detected by a microphone is output to a speaker. The purpose is to provide.

上記の課題を解決するため、本発明は入力された音声信号に対してスピーカに出力される前に処理を施す音声処理方法であって、音声信号に対して窓掛け処理を行って該音声信号を複数のフレームに分割するステップと、分割された各フレームの音声信号について対数スペクトルを計算するステップと、対数スペクトルからケプストラム係数を計算するステップと、ケプストラム係数を時間方向に見た場合の回帰係数を計算するステップと、回帰係数の２乗平均を求めるステップと、２乗平均について閾値処理を行うことにより音声信号の定常部を求めるステップと、定常部について音声信号の振幅を抑圧するステップとを具備することを特徴とする。 In order to solve the above-described problems, the present invention is an audio processing method for performing processing on an input audio signal before being output to a speaker. A plurality of frames, a step of calculating a logarithmic spectrum for the audio signal of each divided frame, a step of calculating a cepstrum coefficient from the logarithmic spectrum, and a regression coefficient when the cepstrum coefficient is viewed in the time direction Calculating a root mean square of the regression coefficient, obtaining a stationary part of the speech signal by performing threshold processing on the mean square, and suppressing the amplitude of the speech signal for the steady part. It is characterized by comprising.

また、本発明は入力された音声信号に対してスピーカに出力される前に処理を施す音声処理装置であって、音声信号に対して窓掛け処理を行って該音声信号を複数のフレームに分割する窓掛け処理部と、窓掛け処理部により分割された各フレームの音声信号にフーリエ変換を施すフーリエ変換部と、フーリエ変換部からの出力信号に基づき対数スペクトルを計算する対数スペクトル計算部と、対数スペクトルに対して逆フーリエ変換を施すことによりケプストラム係数を生成するケプストラム係数計算部と、ケプストラム係数を時間方向に見た場合の回帰係数を計算する回帰係数計算部と、回帰係数の２乗平均を求める２乗平均部と、２乗平均について閾値処理を行うことにより音声信号の定常部を求める閾値処理部と、定常部について音声信号の振幅を抑圧する抑圧処理部とを具備することを特徴とする。 The present invention is also an audio processing apparatus that performs processing on an input audio signal before being output to a speaker, and performs a windowing process on the audio signal to divide the audio signal into a plurality of frames. A windowing processing unit, a Fourier transform unit that performs a Fourier transform on the audio signal of each frame divided by the windowing processing unit, a logarithmic spectrum calculation unit that calculates a logarithmic spectrum based on an output signal from the Fourier transform unit, A cepstrum coefficient calculation unit that generates a cepstrum coefficient by performing an inverse Fourier transform on a logarithmic spectrum, a regression coefficient calculation unit that calculates a regression coefficient when the cepstrum coefficient is viewed in the time direction, and a mean square of the regression coefficient A mean square unit for obtaining the threshold value, a threshold value processing unit for obtaining a stationary part of the audio signal by performing threshold processing for the mean square, and a voice for the stationary part Characterized by comprising a suppression unit for suppressing the amplitude of No..

さらに、本発明によると、入力された音声信号に対してスピーカに出力される前に処理を施す音声処理をコンピュータに行わせるプログラムであって、前記音声信号に対して窓掛け処理を行って該音声信号を複数のフレームに分割する処理と、分割された各フレームの音声信号について対数スペクトルを計算する処理と、前記対数スペクトルからケプストラム係数を計算する処理と、前記ケプストラム係数を時間方向に見た場合の回帰係数を計算する処理と、前記回帰係数の２乗平均を求める処理と、前記２乗平均について閾値処理を行うことにより前記音声信号の定常部を求める処理と、前記定常部について前記音声信号の振幅を抑圧する処理とを前記コンピュータに行わせる音声処理プログラムを提供することもできる。 Furthermore, according to the present invention, there is provided a program for causing a computer to perform audio processing for performing processing on an input audio signal before being output to a speaker, and performing a windowing process on the audio signal. A process of dividing the audio signal into a plurality of frames, a process of calculating a logarithmic spectrum for the audio signal of each divided frame, a process of calculating a cepstrum coefficient from the logarithmic spectrum, and the cepstrum coefficient viewed in the time direction A process for calculating a regression coefficient in the case, a process for obtaining a root mean square of the regression coefficient, a process for obtaining a stationary part of the speech signal by performing threshold processing on the mean square, and the speech for the steady part It is also possible to provide an audio processing program that causes the computer to perform processing for suppressing the amplitude of a signal.

マイクロフォンなどで検出された音声信号に対して、定常部の抑圧処理を行うことにより、スピーカから放射される音声の明瞭度を聴覚障害者や高齢者に対しても効果的に向上させることができ、また実時間処理も容易に実現可能となる。 It is possible to effectively improve the intelligibility of the sound radiated from the speaker for the hearing impaired and the elderly by performing the steady-state suppression processing on the sound signal detected by a microphone or the like. In addition, real-time processing can be easily realized.

以下、図面を参照して本発明の実施の形態を説明する。図１に、本発明の一実施形態に基づく音声処理装置を適用した音声拡声システムの例を示す。講演会場、多目的ホール、教室、教会などの室内１において、講演・講義などを行う話者２が発生した音声はマイクロフォン３によって検出される。マイクロフォン３から電気信号として出力される音声信号は、前置増幅器４により増幅された後、本発明の一実施形態に基づく音声処理装置５に入力される。 Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows an example of a voice enhancement system to which a voice processing device according to an embodiment of the present invention is applied. In the room 1 such as a lecture hall, a multipurpose hall, a classroom, or a church, a voice generated by a speaker 2 giving a lecture / lecture is detected by a microphone 3. An audio signal output as an electrical signal from the microphone 3 is amplified by the preamplifier 4 and then input to the audio processing device 5 according to an embodiment of the present invention.

音声処理装置５では、入力される音声信号に対して音声の明瞭度を向上させるための信号処理、すなわち後に詳しく述べるように、残響によるオーバーラップマスキングの影響を減らすために音声信号の定常部の振幅を抑圧する処理が行われる。音声処理装置５で処理された音声信号は、電力増幅器６により増幅された後、室内１に設置されたスピーカ７に供給され、このスピーカ７から音響として放射されることによって、最終的に聴衆８の耳に到達する。 The audio processing device 5 performs signal processing for improving the intelligibility of the input audio signal, that is, as described in detail later, in order to reduce the influence of overlap masking due to reverberation, A process for suppressing the amplitude is performed. The audio signal processed by the audio processing device 5 is amplified by the power amplifier 6, then supplied to the speaker 7 installed in the room 1, and finally radiated as sound from the speaker 7, so that the audience 8 finally becomes. Reach the ears.

図２に、残響によるオーバラップマスキングの様子を示す。音声としては、筑波大学多言語音声コーパスから「October」(話者：EngM2、男性)を用いた。図２（ａ）は原音声波形であり、図２（ａ）の最下行は上５行のセグメンテーションにより/o/, /k/, /t/, /o/, /b/, /er/に分割した音声波形を足し合わせた波形である。図２（ｂ）は図２（ａ）の音声波形に残響時間1.1秒のインパルス応答を畳み込んだ音声波形であり、/k/, /t/, /b/のようなエネルギの弱い子音が、直前の母音に付加された残響によってマスクされている様子が分かる。つまり先行音が母音のようなエネルギの強い音素の場合、後続の音素は残響の尾による影響を大きく受けるのである。 FIG. 2 shows the state of overlap masking due to reverberation. As speech, “October” (speaker: EngM2, male) from the University of Tsukuba Multilingual Speech Corpus was used. FIG. 2A shows the original speech waveform, and the bottom line of FIG. 2A is / o /, / k /, / t /, / o /, / b /, / er / by segmentation of the top 5 lines. This is a waveform obtained by adding the voice waveforms divided into two. Fig. 2 (b) is a speech waveform obtained by convolving an impulse response with a reverberation time of 1.1 seconds into the speech waveform of Fig. 2 (a). Consonants with weak energy such as / k /, / t /, / b / It can be seen that it is masked by the reverberation added to the immediately preceding vowel. That is, when the preceding sound is a strong energy phoneme such as a vowel, the subsequent phoneme is greatly affected by the tail of the reverberation.

そこで、音声処理装置５ではエネルギは比較的大きいが音声認識にはそれほど重要ではないとされる音声信号の定常部をあらかじめ抑圧することで、残響によるオーバラップマスキングの影響を軽減させて明瞭度の改善を達成する。以下、図３を用いて音声処理装置５について具体的に説明する。 Therefore, in the speech processing device 5, the influence of overlap masking due to reverberation is reduced by suppressing in advance the steady portion of the speech signal that is relatively large in energy but not so important for speech recognition. Achieve improvement. Hereinafter, the audio processing device 5 will be described in detail with reference to FIG.

図３において、入力端子１０には図１に示した前置増幅器４によって増幅された音声信号が入力される。この入力音声信号は、Ａ／Ｄ変換器１１により例えばサンプリング周波数16kHzでサンプリングされ、16ビット程度のディジタル信号に変換される。Ａ／Ｄ変換器１１から出力されるディジタル化された音声信号は、まず窓掛け処理部１２に入力され、例えば20msのハニング窓あるいはハミング窓による窓掛け処理が行われる。 In FIG. 3, the audio signal amplified by the preamplifier 4 shown in FIG. This input audio signal is sampled by the A / D converter 11 at a sampling frequency of 16 kHz, for example, and converted to a digital signal of about 16 bits. The digitized audio signal output from the A / D converter 11 is first input to the windowing processing unit 12 and subjected to windowing processing using, for example, a 20 ms Hanning window or Hamming window.

すなわち、窓掛け処理部１２では後述するケプストラム係数を用いて母音の定常部が検出されるように、ディジタル化された音声信号が例えば10ms（50%）の時間長だけ互いにオーバラップした20msの時間長の複数のフレームに切り出され、この後同じ20msの幅を持つハニング窓あるいはハミング窓による窓掛け処理が行われる。 That is, the windowing processing unit 12 uses a cepstrum coefficient, which will be described later, to detect a stationary part of a vowel, so that the digitized speech signals overlap each other by a time length of, for example, 10 ms (50%). A plurality of long frames are cut out, and then a windowing process using a Hanning window or a Hamming window having the same width of 20 ms is performed.

窓掛け処理部１２から出力される各フレームの音声信号は高速フーリエ変換（ＦＦＴ）部１３に入力され、ＦＦＴが施される。高速フーリエ変換部１３の出力信号から、対数スペクトル計算部１４によって各フレームの音声信号の対数スペクトルが計算される。対数スペクトル計算部１４では、具体的には高速フーリエ変換部１３の出力信号について、パワースペクトルを得るために絶対値をとってから２乗計算を行い、この後10＊log10を計算してdB（デシベル）に単位を変換して、出力の対数スペクトルとする。 The audio signal of each frame output from the windowing processing unit 12 is input to the fast Fourier transform (FFT) unit 13 and subjected to FFT. The logarithmic spectrum of the audio signal of each frame is calculated by the logarithmic spectrum calculation unit 14 from the output signal of the fast Fourier transform unit 13. Specifically, the logarithmic spectrum calculation unit 14 performs square calculation after taking an absolute value to obtain a power spectrum for the output signal of the fast Fourier transform unit 13, and then calculates 10 * log 10 to calculate dB ( Convert the unit to decibels to obtain the logarithmic spectrum of the output.

次に、対数スペクトル計算部１４によって計算された対数スペクトルに対して、逆フーリエ変換（ＩＦＦＴ）部１５によってＩＦＦＴが施されることにより、ケプストラム係数が生成される。生成されたケプストラム係数のうち、低い次元の係数が音声信号のスペクトル包絡を表す。そこで、ケプストラム係数に対しリフタリングを施すことにより、スペクトル包絡を表す例えば30次までのケプストラム係数を残して出力する。図４に、入力端子１０に入力される音声信号の原波形に対する対数スペクトル４１（実線）と、30次までのケプストラム係数であるスペクトル包絡４２（破線）を示す。 Next, the log spectrum calculated by the logarithmic spectrum calculation unit 14 is subjected to IFFT by an inverse Fourier transform (IFFT) unit 15 to generate a cepstrum coefficient. Of the generated cepstrum coefficients, low-dimensional coefficients represent the spectral envelope of the audio signal. Therefore, by performing liftering on the cepstrum coefficients, for example, cepstrum coefficients up to the 30th order representing the spectrum envelope are left and output. FIG. 4 shows a logarithmic spectrum 41 (solid line) with respect to the original waveform of the audio signal input to the input terminal 10 and a spectrum envelope 42 (broken line) which is a cepstrum coefficient up to the 30th order.

次に、逆フーリエ変換部１５により生成されリフタリングされた例えば30次までの各ケプストラム係数を回帰係数計算部１６−１〜１６−ｎ（この場合、ｎ＝30とする）に入力し、各ケプストラム係数の時間軌跡に対して例えば前後２点、計５点の回帰係数をサンプル毎に最小二乗法により計算する。他の例として、各ケプストラム係数の時間軌跡に対し前後３点、計７点の回帰係数をサンプル毎に計算してもよい。 Next, each cepstrum coefficient up to, for example, the 30th order generated and lifted by the inverse Fourier transform unit 15 is input to the regression coefficient calculation units 16-1 to 16-n (in this case, n = 30), and each cepstrum is input. For example, a total of 5 regression coefficients, for example, 2 points before and after the coefficient time trajectory are calculated for each sample by the least square method. As another example, a total of 7 regression coefficients may be calculated for each sample, 3 points before and after the time trajectory of each cepstrum coefficient.

図５に、実線で時間軌跡５点のケプストラム係数を示し、破線で回帰直線を示す。回帰直線の傾きが回帰係数（デルタ係数）となる。この場合、30次までのケプストラム係数を用いているので、１フレーム当たり30個のデルタ係数が求まる。 In FIG. 5, the cepstrum coefficient of the five time trajectories is indicated by a solid line, and the regression line is indicated by a broken line. The slope of the regression line is the regression coefficient (delta coefficient). In this case, since 30th-order cepstrum coefficients are used, 30 delta coefficients are obtained per frame.

次に、回帰係数計算部１６−１〜１６−ｎにより計算された回帰係数である30個のデルタ係数の２乗平均を２乗平均部１７により計算し、これを一つのフレームの代表的なＤ値とする。Ｄ値は、非特許文献２に従って定義される、音声信号のスペクトル遷移を示すパラメータであり、フレーム毎に一つずつ得られる。 Next, the mean square of the 30 delta coefficients, which are the regression coefficients calculated by the regression coefficient calculators 16-1 to 16-n, is calculated by the mean square unit 17, which is representative of one frame. Let D value. The D value is a parameter that is defined according to Non-Patent Document 2 and indicates the spectrum transition of the audio signal, and is obtained one for each frame.

図６に、音声信号の母音部分の原波形６１（塗りつぶされた部分）と２乗平均部１７により得られる２乗平均であるＤ値６２（線で描かれた部分）の例を示す。Ｄ値の小さい箇所は、母音の定常部に相当する。そこで、Ｄ値を閾値処理部１８に入力して、予め定められた閾値と比較し、Ｄ値が閾値より小さい箇所を母音の定常部とする。閾値処理部１８の出力は、例えば母音の定常部でα（0≦α＜1）、それ以外の部分で１をとるような二値信号からなる定常部検出信号である。この例ではα＝0.4とするが、0≦α＜1の値であれば何でもよい。この定常部検出信号は乗算器１９に入力され、Ａ／Ｄ変換器１１から出力されるディジタル化された音声信号に乗じられることにより、定常部について音声信号の振幅が抑圧される。 FIG. 6 shows an example of an original waveform 61 (filled portion) of the vowel part of the audio signal and a D value 62 (portion drawn with a line) that is a mean square obtained by the mean square unit 17. A portion with a small D value corresponds to a stationary part of a vowel. Therefore, the D value is input to the threshold processing unit 18 and compared with a predetermined threshold, and a portion where the D value is smaller than the threshold is set as a vowel stationary part. The output of the threshold processing unit 18 is, for example, a stationary part detection signal composed of a binary signal such that α (0 ≦ α <1) in the stationary part of the vowel and 1 in other parts. In this example, α = 0.4, but any value is acceptable as long as 0 ≦ α <1. The stationary part detection signal is input to the multiplier 19 and is multiplied by the digitized speech signal output from the A / D converter 11, whereby the amplitude of the speech signal is suppressed for the stationary part.

図７に、音声信号の原波形７１（薄く塗りつぶされた部分及び濃く塗りつぶされた部分）と定常部が抑圧された後の波形７２（濃く塗りつぶされた部分）を示す。乗算器１９からの定常部抑圧処理後の音声信号は、出力端子２１から出力される。出力端子２１から出力される音声信号は、例えば図１の電力増幅器６に入力され、スピーカ７から音響として放射される。 FIG. 7 shows an original waveform 71 (lightly painted portion and darkly painted portion) of the audio signal and a waveform 72 (darkly painted portion) after the steady portion is suppressed. The audio signal after the stationary part suppression processing from the multiplier 19 is output from the output terminal 21. The audio signal output from the output terminal 21 is input to, for example, the power amplifier 6 of FIG.

このように本実施形態の音声処理装置によると、入力される音声信号の定常部の振幅を抑圧する処理を行うことができるので、処理後の音声信号を図１に示したように電力増幅器６を介して室内１に設置されたスピーカ７に供給することによって、明瞭度の高い音声を発することができる。 As described above, according to the sound processing apparatus of the present embodiment, the process of suppressing the amplitude of the stationary part of the input sound signal can be performed, so that the processed sound signal is converted into the power amplifier 6 as shown in FIG. By supplying to the speaker 7 installed in the room 1 via the voice, a voice with high intelligibility can be emitted.

また、本実施形態の音声処理装置では、入力される音声信号のフレーム単位で残響によるオーバラップマスキングの影響を軽減させる明瞭度の改善処理を行うため、音声信号をフィルタバンクにより帯域分割してから同様の処理を行う非特許文献１に比較して処理遅延が非常に短く、実時間処理が容易となる。 In addition, in the speech processing apparatus of the present embodiment, in order to perform the intelligibility improvement processing to reduce the influence of overlap masking due to reverberation in units of frames of the input speech signal, the speech signal is divided into bands by a filter bank. Compared to Non-Patent Document 1 that performs the same processing, the processing delay is very short, and real-time processing becomes easy.

図３に示した音声処理装置は、Ａ／Ｄ変換器１１の出力からＤ／Ａ変換器２０までの処理をＤＳＰ(Digital Signal Processor)あるいは汎用のＣＰＵ(Central Processing Unit)を用いてソフトウェア処理により実現することもできる。また、図３に示した音声処理装置を専用のハードウェアを用いて実現することも可能である。 The audio processing apparatus shown in FIG. 3 performs processing from the output of the A / D converter 11 to the D / A converter 20 by software processing using a DSP (Digital Signal Processor) or a general-purpose CPU (Central Processing Unit). It can also be realized. It is also possible to implement the speech processing apparatus shown in FIG. 3 using dedicated hardware.

次に、本発明の実施形態の効果を確認するために行った聴取実験の結果について説明する。まず、実験室環境における聴取実験の結果について述べる。
残響環境は、コンピュータ上で音声信号と残響のインパルス応答を畳み込むことによって実現した。使用したインパルス応答は、東大和市大ホール（反射板無し）で測定されたインパルス応答を基に、それらを人工的に加工することによって残響時間0.4秒から1.3秒までの範囲に変化させたものである。 Next, the result of a listening experiment conducted to confirm the effect of the embodiment of the present invention will be described. First, the results of listening experiments in a laboratory environment will be described.
The reverberation environment was realized by convolving the speech signal and the reverberant impulse response on a computer. The impulse response used was based on the impulse response measured at the Higashiyamato City Dai Hall (without reflector) and was artificially processed to change the reverberation time from 0.4 seconds to 1.3 seconds. It is.

刺激は、日本語の単音節ＣＶ（子音−母音）をターゲットとし、日本語のキャリアセンス「題目としては＿といいます」に挿入した。Ｖとして/a/, /i/を用い、Ｃとして/p/, /t/, /k/, /b/, /d/, /g/, /s/, /∫/, /h/, /t∫/, /dz/, /d3/, /m/, /n/の14子音を用いた。結局、実験では24種類のＣＶを使用した。各刺激は、ＡＴＲ研究用日本語音声データベース（話者：ＭＡＵ、40歳男性）を用いた。刺激音は、原音声信号に残響を畳み込んだ刺激セット（処理なし）と、本発明の実施形態に基づく定常部抑圧処理を行った後に残響を畳み込んだ刺激セット（処理あり）の二種類を用意した。被験者は、日本語を母語とする健聴者44名（残響時間が短いセットに対して22名、長いセットに対して22名）とした。 The stimulus was targeted to Japanese monosyllable CV (consonant-vowel) and inserted into the Japanese career sense "I'll call it _". / A /, / i / as V and / p /, / t /, / k /, / b /, / d /, / g /, / s /, / ∫ /, / h /, as C 14 consonants of / t∫ /, / dz /, / d3 /, / m /, / n / were used. After all, 24 types of CV were used in the experiment. For each stimulus, a Japanese speech database for ATR research (speaker: MAU, 40-year-old male) was used. There are two types of stimulus sounds: a stimulus set in which reverberation is convoluted in the original speech signal (no processing), and a stimulus set in which reverberation is convoluted after performing steady-state suppression processing according to the embodiment of the present invention (with processing). Prepared. The subjects were 44 normal listeners whose native language was Japanese (22 for a set with a short reverberation time and 22 for a long set).

実験の指示は、防音室内のコンピュータ画面上で行った。刺激音の指示はヘッドフォン（STAX SR-303）を用い、被験者毎に適した音圧レベルに調整した。各試行において、まず刺激音を一度だけ提示し、提示終了後、画面上に実験で使用した24種類のＣＶを選択肢としてカナで表示した。被験者には、画面上の選択肢を強制的に一つマウスでクリックさせて、回答してもらった。選択が終わると、次の刺激が自動的に提示されるようにした。各被験者に対して、計240刺激（残響5種類×24単音節×処理2種類）をランダムに並べて提示した。 The instruction of the experiment was performed on the computer screen in the soundproof room. The instruction of the stimulation sound was adjusted to a sound pressure level suitable for each subject using headphones (STAX SR-303). In each trial, first, the stimulus sound was presented only once, and after the presentation, 24 types of CV used in the experiment were displayed on the screen as options as kana. The subject was forced to click one of the choices on the screen with a mouse and responded. When the selection is over, the next stimulus is automatically presented. A total of 240 stimuli (5 types of reverberation × 24 single syllables × 2 types of treatment) were randomly presented to each subject.

以上のような条件で行った実験室環境における単音節明瞭度試験の結果として、各残響条件、処理条件における子音の正解率の平均値を表１（残響時間の短いセット）及び表２（残響時間の長いセット）に示す。

As a result of the single syllable intelligibility test in the laboratory environment under the above conditions, the average values of consonant correct answer rates under each reverberation condition and processing condition are shown in Table 1 (set with a short reverberation time) and Table 2 (Reverberation). Long set).

ただし、母音の正解率は、いずれの条件においても100%であった。処理による主効果は、いずれも有意（ｐ＜.001）であった。処理条件間でのｔ検定の結果、表１では残響時間が0.8, 0.9, 1.0秒において「処理あり」の方が、また表２では残響時間が0.9, 1.0, 1.1 ,1.2秒において「処理あり」の方が、それぞれ有意に正解率が高かった。 However, the accuracy rate of vowels was 100% under all conditions. The main effects of treatment were all significant (p <.001). As a result of t-test between processing conditions, in Table 1, “with processing” is shown when reverberation time is 0.8, 0.9, 1.0 seconds, and in table 2, “processing is done” when reverberation time is 0.9, 1.0, 1.1, 1.2 seconds. “” Was significantly higher in accuracy.

これらの実験結果から、全ての残響条件において「処理あり」の方が正解率は高く、さらに残響時間が0.8〜1.2秒では処理の効果が確認された。 From these experimental results, the accuracy rate was higher for “with treatment” under all reverberation conditions, and the effect of the treatment was confirmed when the reverberation time was 0.8 to 1.2 seconds.

次に、上述した実験室環境で効果を示した定常部抑圧処理を実際の残響環境下においてもその効果を確認するために大学の講堂にて実験を行った結果を示す。実験は、単音節明瞭度試験と文の書き取り試験を行った。 Next, a result of an experiment conducted in the university auditorium in order to confirm the effect of the steady-state suppression processing, which has been effective in the laboratory environment described above, even in an actual reverberant environment is shown. In the experiment, a single syllable intelligibility test and a sentence writing test were performed.

単音節明瞭度試験では前述した刺激のうち、母音が/a/のもの（14単音節、キャリア文付き、処理あり／なし）を用いた。文了解度試験では、NTT-AT音素バランス1000文から20文を用いた。被験者は、日本語を母語とする健聴者24名とした。 In the single syllable intelligibility test, among the stimuli described above, those having a vowel of / a / (14 single syllables, with carrier sentence, with / without processing) were used. In the sentence comprehension test, NTT-AT phoneme balance 1000 sentences to 20 sentences were used. The subjects were 24 normal hearing listeners whose native language was Japanese.

実験は、上智大学構内で一番大きな収容人数（822名）を持つ10号館講堂で行った。壇上にスピーカを設置し、ＰＣから予め準備された刺激音を再生した。被験者は、講堂正面の後方のブロックに配置した。始めに被験者に指示を与えた後、テスト用の刺激文を用いて被験者全員が問題なく聞き取れる程度の音量に出力を調整した。 The experiment was conducted in the Hall No. 10 auditorium, which has the largest capacity (822 people) on the campus of Sophia University. A speaker was installed on the platform, and the stimulation sound prepared in advance from the PC was reproduced. The subject was placed in the rear block in front of the auditorium. First, after giving instructions to the subjects, the output was adjusted to a level that all the subjects could hear without any problems using test stimulus sentences.

単音節明瞭度試験では、28刺激（14単音節のそれぞれについて処理あり／処理なし）を2回の計56刺激をランダムに並べ替えて提示した。各試行において刺激音を一度だけ提示し、回答を14単音節のリストから１つ強制的に選んで用紙に書いてもらった。次の刺激提示までの時間は、5秒とした。 In the single syllable intelligibility test, 28 stimuli (with or without treatment for each of the 14 single syllables) were presented in a total of 56 stimuli randomly arranged twice. In each trial, the stimulating sound was presented only once, and one answer was forcibly selected from the list of 14 single syllables and written on the form. The time until the next stimulus presentation was 5 seconds.

文了解度試験では、24名の被験者をグループＡ（13名）とグループＢ（11名）に分け、各グループ毎に実験を行った。各グループでは、異なる20文、すなわち「処理あり」の10文と「処理なし」の10文をランダムに並べ替えて提示した。また、グループＡで「処理あり」であった10文は、グループＢで「処理なし」となり、逆にグループＡで「処理なし」であった10文は、グループＢで「処理あり」となるように組み合わせることによって、バランスをとった。各試行において刺激音は20秒間隔をあけて2度にわたって提示し、回答をカナで用紙に書いてもらった。 In the sentence intelligibility test, 24 subjects were divided into group A (13 people) and group B (11 people), and experiments were conducted for each group. Each group presented 20 different sentences, that is, 10 sentences with “processing” and 10 sentences without “processing” at random. Also, 10 sentences that were “processed” in group A become “no process” in group B, and conversely, 10 sentences that were “no process” in group A became “processed” in group B. Balanced by combining so. In each trial, the stimulating sound was presented twice at 20-second intervals, and the responses were written on a paper in kana.

単音節明瞭度試験では子音の正解率を比較した結果、「処理あり」（69.3%）の方が「処理なし」（62.7%）よりも正解率が高くなった。文了解度試験では、書き取られた文を処理ありと処理なしで比較した。その結果、「処理あり」と「処理なし」では共にモーラ毎の正解率が95%以上と高く、その差はほとんど観測されなかった。 In the single syllable intelligibility test, the correct answer rate of consonants was compared, and “corrected” (69.3%) was higher than “not processed” (62.7%). In the sentence comprehension test, the written sentences were compared with and without processing. As a result, the accuracy rate for each mora was high at 95% or more for both “with processing” and “without processing”, and almost no difference was observed.

単音節明瞭度試験では実験室環境のdiotic受聴の場合と同じ刺激を用いたが、両耳(dichotic)環境においてもその効果を確認できた。文の書き取りでは文脈情報を利用できることから、多少の聞き取りづらさが存在しても特に健聴者の場合には問題ない。今回用いた刺激文は、比較的平易で、訓練を受けたアナウンサがゆっくりと明瞭に発話したもので、また残響時間もそれほど長くない環境で、かつ直接音のエネルギも強かったことが、そもそもの了解度が高かった要因として考えられる。しかし、より劣悪な残響環境下で、親密度の低い語が存在したり、自然発話音声にみられるように話速が速かったり不明瞭な音声になると、本発明の実施形態による処理の効果が顕著に現れるものと予想される。このことは、お年寄りや聴覚障害者に対してはなおさらのことであろう。 In the monosyllable intelligibility test, the same stimulus was used as in the case of diotic listening in the laboratory environment, but the effect was also confirmed in the dichotic environment. Since context information can be used in writing a sentence, even if there is some difficulty in hearing, there is no problem in the case of a normal hearing person. The stimulus sentence used this time was relatively simple, was spoken by a well-trained announcer slowly and clearly, had a reverberation time that was not so long, and had strong direct sound energy. This is considered to be the reason why the intelligibility was high. However, in a worse reverberant environment, if there are words with low intimacy, or if the speech speed is fast or unclear as seen in spontaneous speech, the processing effect according to the embodiment of the present invention is effective. It is expected to appear prominently. This is especially true for the elderly and the hearing impaired.

本発明の一実施形態に従う音声処理装置を用いた音声拡声システムの概念図1 is a conceptual diagram of a voice enhancement system using a voice processing device according to an embodiment of the present invention. 残響によるオーバラップマスキングの例を示す図Diagram showing an example of overlap masking due to reverberation 本発明の一実施形態に従う音声処理装置の構成を示すブロック図The block diagram which shows the structure of the audio | voice processing apparatus according to one Embodiment of this invention. 原音声波形の対数スペクトルとスペクトル包絡について示す図Diagram showing logarithm spectrum and spectrum envelope of original speech waveform 回帰係数の計算例を示す図Diagram showing examples of regression coefficient calculation 原音声波形と回帰係数の２乗平均（Ｄ値）の例を示す図The figure which shows the example of the root mean square (D value) of an original speech waveform and a regression coefficient 原音声波形と定常部が抑圧された音声波形の例を示す図The figure which shows the example of the voice waveform where the original voice waveform and the stationary part were suppressed

Explanation of symbols

１０…入力端子
１１…Ａ／Ｄ変換器
１２…窓掛け処理部
１３…高速フーリエ変換器
１４…対数スペクトル計算部
１５…逆高速フーリエ変換器
１６−１〜１６−ｎ…回帰係数計算部
１７…２乗平均計算部
１８…閾値処理部
１９…乗算器
２０…Ｄ／Ａ変換器
２１…出力端子 DESCRIPTION OF SYMBOLS 10 ... Input terminal 11 ... A / D converter 12 ... Windowing process part 13 ... Fast Fourier transformer 14 ... Logarithmic spectrum calculation part 15 ... Inverse fast Fourier transformer 16-1 to 16-n ... Regression coefficient calculation part 17 ... Root mean square calculation unit 18 ... Threshold processing unit 19 ... Multiplier 20 ... D / A converter 21 ... Output terminal

Claims

An audio processing method for performing processing on an input audio signal before being output to a speaker,
Performing a windowing process on the audio signal to divide the audio signal into a plurality of frames;
Calculating a logarithmic spectrum for the audio signal of each divided frame;
Calculating a cepstrum coefficient from the logarithmic spectrum;
Calculating a regression coefficient when the cepstrum coefficient is viewed in the time direction;
Obtaining a root mean square of the regression coefficients;
Obtaining a stationary part of the audio signal by performing threshold processing on the mean square;
A voice processing method comprising: suppressing the amplitude of the voice signal for the stationary part.

An audio processing device that performs processing on an input audio signal before being output to a speaker,
A windowing processing unit that performs windowing on the audio signal and divides the audio signal into a plurality of frames;
A Fourier transform unit that performs Fourier transform on the audio signal of each frame divided by the windowing processing unit;
A logarithmic spectrum calculation unit for calculating a logarithmic spectrum based on an output signal from the Fourier transform unit;
A cepstrum coefficient calculation unit that generates a cepstrum coefficient by performing an inverse Fourier transform on the logarithmic spectrum;
A regression coefficient calculation unit for calculating a regression coefficient when the cepstrum coefficient is viewed in the time direction;
A mean square part for obtaining a mean square of the regression coefficient;
A threshold processing unit that obtains a stationary part of the audio signal by performing threshold processing on the mean square;
An audio processing apparatus comprising: a suppression processing unit that suppresses an amplitude of the audio signal for the stationary unit.

A program for causing a computer to perform audio processing for performing processing on an input audio signal before being output to a speaker,
Performing a windowing process on the audio signal to divide the audio signal into a plurality of frames;
A process of calculating a logarithmic spectrum for the audio signal of each divided frame;
Calculating cepstrum coefficients from the logarithmic spectrum;
A process of calculating a regression coefficient when the cepstrum coefficient is viewed in the time direction;
A process for obtaining a mean square of the regression coefficients;
Processing for obtaining a stationary part of the audio signal by performing threshold processing on the mean square;
An audio processing program for causing the computer to perform processing for suppressing the amplitude of the audio signal for the stationary part.