JP2012163682A

JP2012163682A - Voice processor and voice processing method

Info

Publication number: JP2012163682A
Application number: JP2011022715A
Authority: JP
Inventors: Osamu Shimada; 修嶋田; Tomoshi Hosokawa; 知志細川; Atsushi Kuroda; 淳黒田; Yuichiro Kishinami; 雄一郎岸波; Toshiyuki Ueno; 寿之上野; Tsugio Endo; 次夫遠藤
Original assignee: NEC Casio Mobile Communications Ltd
Current assignee: NEC Casio Mobile Communications Ltd
Priority date: 2011-02-04
Filing date: 2011-02-04
Publication date: 2012-08-30

Abstract

PROBLEM TO BE SOLVED: To provide a voice processor capable of enhancing articulation of receiving voice even when transmission by a user himself or herself is in progress, and further to provide a voice processing method.SOLUTION: A voice signal input from a microphone is divided into a plurality of voice signals of predetermined frequency bands, weighting processing according to a signal-to-noise ratio is executed with respect to a voice signal of each frequency band and an amount of back ground noise that is included in the voice signal of each frequency band having been divided is estimated. Further, a receiving voice signal is divided into a plurality of voice signals of each predetermined frequency band, a gain that is applied to the voice signals of each frequency band is set based on the amount of the back ground noise of the each frequency band having been estimated such that the voice signals of the each frequency band are increased as the amount of the back ground noise is increased, and the receiving voice signal is corrected by multiplying the voice signals of the each frequency band by the gain.

Description

本発明は背景騒音に応じて音声信号を補正するための音声処理装置及び方法に関する。 The present invention relates to an audio processing apparatus and method for correcting an audio signal according to background noise.

例えば、携帯電話機は、屋外等で使用することが多いため、周囲の交通騒音や広告騒音等の背景騒音により受話音声が聞き難くなることがしばしば起こる。これは背景騒音によって受話音声の一部の周波数成分がマスキングされることで、受話音声の明瞭度が低下することによる。そのため、騒音下でも受話音声が聞き取り易いように、背景騒音に応じて音声信号（受話音声）を補正する様々な技術が提案されている。 For example, since a mobile phone is often used outdoors or the like, it is often difficult to hear a received voice due to background noise such as ambient traffic noise and advertisement noise. This is because the intelligibility of the received voice is lowered by masking a part of the frequency components of the received voice by the background noise. Therefore, various techniques have been proposed for correcting a voice signal (received voice) according to background noise so that the received voice can be easily heard even under noise.

例えば、特許文献１では、受話音声の音声スペクトルを目標スペクトル（例えば子音等の比較的信号レベルが小さい周波数成分を大きくするような音声スペクトル）へ整形することで、受話音声の明瞭度を向上させる技術が記載されている。さらに、特許文献１では、使用者本人の発話による送話音声を背景騒音と判定して受話音声を極端に増幅するのを防止するため、マイクから入力された音声信号の音声スペクトルを解析して該音声信号が使用者本人の送話音声であるか否かを判定し、使用者本人の送話音声であると判定した場合に以下の（１）〜（３）で示す３種類の処理を実行することが記載されている。
（１）使用者本人の送話音声と判定した期間において、上記受話音声の音声スペクトルを目標スペクトルに整形するためのフィルタ部のフィルタ係数を初期値（例えば、受話音声をそのまま出力させるための値）に設定する（以下、第１背景処理と称す）。
（２）使用者本人の送話音声と判定した期間において、上記フィルタ係数を予め設定した最大値以下に抑制する（以下、第２背景処理と称す）。
（３）使用者本人の送話音声と判定した期間では上記フィルタ係数の更新を停止する。すなわち、使用者本人の送話音声と判定する直前のフィルタ係数を用いる（以下、第３背景処理と称す）。 For example, in Patent Document 1, the clarity of received speech is improved by shaping the speech spectrum of the received speech into a target spectrum (for example, a speech spectrum in which a frequency component having a relatively low signal level such as a consonant is increased). The technology is described. Furthermore, in Patent Document 1, in order to prevent the transmitted voice due to the user's own utterance as background noise and prevent the received voice from being extremely amplified, the voice spectrum of the voice signal input from the microphone is analyzed. It is determined whether or not the voice signal is the user's own transmission voice, and when it is determined that the voice signal is the user's own transmission voice, the following three types of processing (1) to (3) are performed. It is described to be executed.
(1) In the period determined to be the user's own transmitted voice, the filter coefficient of the filter unit for shaping the voice spectrum of the received voice into the target spectrum is an initial value (for example, a value for outputting the received voice as it is) ) (Hereinafter referred to as first background processing).
(2) The filter coefficient is suppressed to a preset maximum value or less during a period determined as the user's own transmitted voice (hereinafter referred to as second background processing).
(3) The update of the filter coefficient is stopped in a period when it is determined that the user's own voice is transmitted. That is, the filter coefficient immediately before the determination as the user's own transmitted voice is used (hereinafter referred to as third background processing).

特開２００４−０６１６１７号公報JP 2004-061617 A

上述した特許文献１に記載された技術では、使用者本人が発話中であるときに受話音声の明瞭度を向上させることができない問題がある。 The technique described in Patent Document 1 described above has a problem that the intelligibility of the received voice cannot be improved when the user is speaking.

例えば、第１背景処理では、使用者本人が発話中は、上記フィルタ係数を、例えば初期値で固定することで、受話音声が補正されないことになる。そのため、使用者本人が発話中は受話音声の明瞭度を向上させることができない。 For example, in the first background processing, while the user himself / herself is speaking, the received voice is not corrected by fixing the filter coefficient with, for example, an initial value. Therefore, the clarity of the received voice cannot be improved while the user himself is speaking.

第２背景処理では、使用者本人が発話中であっても受話音声が補正されるが、上記フィルタ係数が最大値以下に制限されることで、受話音声を必ずしも目標スペクトルに整形できるとは限らない。そのため、使用者本人が発話中は受話音声の明瞭度を向上させることができない場合がある。 In the second background processing, the received voice is corrected even when the user is speaking, but the received voice is not necessarily shaped into the target spectrum by limiting the filter coefficient to the maximum value or less. Absent. Therefore, the clarity of the received voice may not be improved while the user himself is speaking.

また、第３背景処理では、使用者本人の送話音声と判定される直前の状態（上記フィルタ係数の値）を維持するため、使用者本人が発話中は背景騒音の変化に追従して受話音声を補正することができない。そのため、第２背景処理と同様に、使用者本人が発話中は受話音声の明瞭度を向上させることができない場合がある。 Further, in the third background processing, in order to maintain the state immediately before it is determined as the user's own transmitted voice (the value of the filter coefficient), the user's own voice follows the change in the background noise during the utterance. The sound cannot be corrected. Therefore, as in the second background process, the clarity of the received voice may not be improved while the user himself / herself is speaking.

本発明は上述したような背景技術が有する問題点を解決するためになされたものであり、使用者本人の送話中であっても、受話音声の明瞭度を向上させることができる音声処理装置及び方法を提供することを目的とする。 The present invention has been made in order to solve the problems of the background art as described above, and is a speech processing apparatus capable of improving the intelligibility of received speech even during the transmission of the user himself / herself. And to provide a method.

上記目的を達成するため本発明の音声処理装置は、マイクから入力された音声信号を所定の周波数帯域の複数の第１音声信号に分割する第１周波数分析部と、
前記第１周波数分析部で分割された周波数帯域毎の第１音声信号に対して信号対雑音比に応じた重みづけを行い、前記音声信号に含まれる前記周波数帯域毎の背景騒音量を推定する背景騒音推定部と、
受話音声信号を所定の周波数帯域毎の複数の第２音声信号に分割する第２周波数分析部と、
前記背景騒音推定部で推定された前記周波数帯域毎の背景騒音量に基づいて、前記背景騒音量が大きいほど前記第２周波数分析部から出力された第２音声信号が大きくなるように、前記周波数帯域毎の第２音声信号に適用するゲインを設定し、該ゲインを対応する前記周波数帯域の第２音声信号に乗算することで前記受話音声信号を補正する特性補正部と、
前記特性補正部から出力された補正後の前記周波数帯域毎の第２音声信号を周波数合成し、補正後の前記受話音声信号を再生する周波数合成部を備えた受話音声補正部と、
を有する。 In order to achieve the above object, an audio processing device of the present invention includes a first frequency analysis unit that divides an audio signal input from a microphone into a plurality of first audio signals in a predetermined frequency band;
The first audio signal for each frequency band divided by the first frequency analysis unit is weighted according to the signal-to-noise ratio, and the background noise amount for each frequency band included in the audio signal is estimated. A background noise estimation unit;
A second frequency analyzer that divides the received voice signal into a plurality of second voice signals for each predetermined frequency band;
Based on the background noise amount for each frequency band estimated by the background noise estimation unit, the frequency is set so that the second sound signal output from the second frequency analysis unit increases as the background noise amount increases. Setting a gain to be applied to the second audio signal for each band, and multiplying the corresponding second audio signal of the frequency band by the gain to correct the received audio signal;
A reception voice correction unit including a frequency synthesis unit that synthesizes the frequency of the corrected second voice signal for each frequency band output from the characteristic correction unit, and reproduces the corrected reception voice signal;
Have

一方、本発明の音声処理方法は、マイクから入力された音声信号を所定の周波数帯域の複数の第１音声信号に分割し、前記周波数帯域毎の第１音声信号に対して信号対雑音比に応じた重みづけを行い、前記音声信号に含まれる前記分割された周波数帯域毎の背景騒音量を推定し、
受話音声信号を所定の周波数帯域毎の複数の第２音声信号に分割し、前記推定された前記周波数帯域毎の背景騒音量に基づいて、前記背景騒音量が大きいほど対応する第２音声信号が大きくなるように、前記周波数帯域毎の第２音声信号に適用するゲインを設定し、
該ゲインを対応する前記周波数帯域の第２音声信号に乗算することで前記受話音声信号を補正し、
該補正後の前記周波数帯域毎の第２音声信号を周波数合成し、補正後の受話音声信号を再生する方法である。 On the other hand, the audio processing method of the present invention divides an audio signal input from a microphone into a plurality of first audio signals in a predetermined frequency band, and has a signal-to-noise ratio with respect to the first audio signal for each frequency band. Performing a corresponding weighting, estimating a background noise amount for each of the divided frequency bands included in the audio signal,
The received voice signal is divided into a plurality of second voice signals for each predetermined frequency band, and based on the estimated background noise quantity for each frequency band, the larger the background noise quantity, the corresponding second voice signal becomes. Set a gain to be applied to the second audio signal for each frequency band so as to increase,
The received audio signal is corrected by multiplying the corresponding second audio signal of the frequency band by the gain,
This is a method of frequency-synthesizing the second audio signal for each frequency band after the correction and reproducing the corrected received voice signal.

本発明によれば、使用者本人の送話中であっても、受話音声の明瞭度を向上させることができる。 According to the present invention, it is possible to improve the intelligibility of received voice even during the transmission of the user himself / herself.

第１の実施の形態の音声処理装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the speech processing unit of 1st Embodiment. 図１に示した背景騒音推定部の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the background noise estimation part shown in FIG. 第１の実施の形態の特性補正部の一構成例を示すブロック図である。It is a block diagram which shows one structural example of the characteristic correction | amendment part of 1st Embodiment. 第２の実施の形態の音声処理装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the speech processing unit of 2nd Embodiment. 第２の実施の形態の特性補正部の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the characteristic correction | amendment part of 2nd Embodiment. 図５に示したゲインリミッタ部の処理例を示す模式図である。It is a schematic diagram which shows the process example of the gain limiter part shown in FIG. 第３の実施の形態の特性補正部の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the characteristic correction | amendment part of 3rd Embodiment.

次に本発明について図面を用いて説明する。
（第１の実施の形態）
図１は、第１の実施の形態の音声処理装置の一構成例を示すブロック図である。 Next, the present invention will be described with reference to the drawings.
(First embodiment)
FIG. 1 is a block diagram illustrating a configuration example of the speech processing apparatus according to the first embodiment.

図１に示すように、第１の実施の形態の音声処理装置は、送話音声分析部１と受話音声補正部２とを有する構成である。 As shown in FIG. 1, the speech processing apparatus according to the first embodiment is configured to include a transmitted speech analysis unit 1 and a received speech correction unit 2.

送話音声分析部１は、第１周波数分析部１０及び背景騒音推定部１１を備えている。受話音声補正部２は、第２周波数分析部１２、特性補正部１３及び周波数合成部１４を備えている。 The transmitted voice analysis unit 1 includes a first frequency analysis unit 10 and a background noise estimation unit 11. The received voice correction unit 2 includes a second frequency analysis unit 12, a characteristic correction unit 13, and a frequency synthesis unit 14.

図１に示す音声処理装置は、例えば、マイクから入力された音声信号（アナログ信号）をデジタル信号に変換するＡ／Ｄ変換器、受話音声補正部２から出力される音声信号（デジタル信号）をアナログ信号に変換するＤ／Ａ変換器、プログラムにしたがって音声信号に対する処理を実行するＣＰＵ、演算処理を実行するＤＳＰ、プログラムや処理に必要な各種データが格納されるメモリ、各種の論理回路等を含む、周知の信号処理回路で実現できる。 The audio processing device shown in FIG. 1 is, for example, an A / D converter that converts an audio signal (analog signal) input from a microphone into a digital signal, and an audio signal (digital signal) output from the received audio correction unit 2. D / A converter for converting to analog signal, CPU for processing audio signal according to program, DSP for executing arithmetic processing, memory for storing various data necessary for program and processing, various logic circuits, etc. Including a known signal processing circuit.

第１周波数分析部１０は、ＤＦＴ（Discrete Fourier Transform：離散フーリエ変換）等の処理によりマイクから入力された音声信号を複数の周波数帯域の音声信号（第１音声信号）に分割する。第１周波数分析部１０は、ＩＩＲ（Infinite Impulse Response：無限インパルス応答）フィルタ等の帯域分割フィルタで実現してもよい。また、第１周波数分析部１０は、マイクから入力された音声信号を一定の帯域幅で分割してもよく、人の聴覚特性を考慮して、例えば高い周波数ほど帯域幅が広くなるように分割してもよい。 The first frequency analysis unit 10 divides the audio signal input from the microphone into a plurality of frequency band audio signals (first audio signals) by processing such as DFT (Discrete Fourier Transform). The first frequency analysis unit 10 may be realized by a band division filter such as an IIR (Infinite Impulse Response) filter. In addition, the first frequency analysis unit 10 may divide the audio signal input from the microphone with a certain bandwidth, and in consideration of human auditory characteristics, for example, the higher frequency may be divided so that the bandwidth becomes wider. May be.

背景騒音推定部１１は、マイクから入力された音声信号に含まれる、第１周波数分析部１０で分割された周波数帯域毎の背景騒音量を推定する。このとき、音声信号に使用者本人の送話音声の信号（以下、送話音声信号と称す）が含まれている場合は、該送話音声信号をできるだけ除いた残りの背景騒音量を算出する。 The background noise estimation unit 11 estimates the amount of background noise for each frequency band divided by the first frequency analysis unit 10 included in the audio signal input from the microphone. At this time, if the user's own transmission voice signal (hereinafter referred to as a transmission voice signal) is included in the voice signal, the remaining background noise amount is calculated by removing the transmission voice signal as much as possible. .

図２は、図１に示した背景騒音推定部の一構成例を示すブロック図である。 FIG. 2 is a block diagram illustrating a configuration example of the background noise estimation unit illustrated in FIG.

図２に示すように、背景騒音推定部１１は重みつき音声計算部１００及び推定雑音計算部１０１を備えている。 As shown in FIG. 2, the background noise estimation unit 11 includes a weighted speech calculation unit 100 and an estimated noise calculation unit 101.

重みつき音声計算部１００は、背景騒音や使用者本人の送話音声を含むマイクから入力された周波数帯域毎の音声信号と推定雑音計算部１０１で推定された周波数帯域毎の背景騒音量とを用いて周波数帯域毎のＳＮＲ（Signal-to-Noise Ratio：信号対雑音比）を算出し、マイクから入力された周波数帯域毎の音声信号に対して該ＳＮＲに応じた重みづけを行う。 The weighted voice calculation unit 100 calculates the background noise and the background noise amount for each frequency band estimated by the estimated noise calculation unit 101 and the voice signal for each frequency band input from the microphone including the user's own voice. The SNR (Signal-to-Noise Ratio) for each frequency band is calculated and weighted according to the SNR for the audio signal for each frequency band input from the microphone.

推定雑音計算部１０１は、マイクから入力された周波数帯域毎の音声信号及び重みつき音声計算部１００から出力された周波数帯域毎の重みつき音声信号から周波数帯域毎の背景騒音量を推定する。推定雑音計算部１０１は、重みつき音声計算部１００から出力された周波数帯域毎の重みつき音声信号に基づき、背景騒音と推定された所定数のフレームの音声信号から周波数帯域毎の背景騒音量を算出しつつ、その値を更新する。 The estimated noise calculation unit 101 estimates the background noise amount for each frequency band from the audio signal for each frequency band input from the microphone and the weighted audio signal for each frequency band output from the weighted audio calculation unit 100. Based on the weighted sound signal for each frequency band output from the weighted sound calculation unit 100, the estimated noise calculation unit 101 calculates the background noise amount for each frequency band from the sound signals of a predetermined number of frames estimated as background noise. While calculating, the value is updated.

図２に示す背景騒音推定部１１による背景騒音量の具体的な算出方法は、例えば特開２００８‐２１６７２１号公報に記載されている。 A specific method of calculating the background noise amount by the background noise estimation unit 11 shown in FIG. 2 is described in, for example, Japanese Patent Application Laid-Open No. 2008-216721.

このように、推定された背景騒音量の更新に用いる、マイクから入力された周波数帯域毎の音声信号に対してＳＮＲに応じた重みづけを行い、背景騒音量を推定することで、マイクから入力された音声信号に含まれる使用者本人の送話音声の影響を低減できるため、背景騒音量をより精度よく推定できる。 In this way, the audio signal for each frequency band input from the microphone used for updating the estimated background noise amount is weighted according to the SNR, and the background noise amount is estimated, so that the input from the microphone is performed. Since the influence of the user's own transmission voice included in the received voice signal can be reduced, the background noise amount can be estimated more accurately.

第２周波数分析部１２は、ＤＦＴ等の処理により受話音声の信号（以下、受話音声信号と称す）を、第１周波数分析部１０と同様に複数の周波数帯域の音声信号（第２音声信号）に分割する。 The second frequency analysis unit 12 converts a received voice signal (hereinafter referred to as a received voice signal) into a plurality of frequency band voice signals (second voice signal) in the same manner as the first frequency analysis unit 10 by processing such as DFT. Divide into

特性補正部１３は、例えば図３に示す構成で実現できる。 The characteristic correction unit 13 can be realized, for example, with the configuration shown in FIG.

図３は、第１の実施の形態の特性補正部の一構成例を示すブロック図である。 FIG. 3 is a block diagram illustrating a configuration example of the characteristic correction unit according to the first embodiment.

図３に示すように、特性補正部１３は、平滑化部２００、ゲイン生成部２０１、ゲインＭａｔｒｉｘ部２０２及び補正部２０３を備えている。 As illustrated in FIG. 3, the characteristic correction unit 13 includes a smoothing unit 200, a gain generation unit 201, a gain Matrix unit 202, and a correction unit 203.

平滑化部２００は、背景騒音推定部１１で推定された周波数帯域毎の背景雑音量を時間軸方向または周波数軸方向で平滑化する。平滑化部２００は、各周波数帯域の背景雑音量を時間軸方向と周波数軸方向の両方でそれぞれ平滑化してもよい。例えば音声処理装置が所定のフレーム単位で音声信号に対する処理を実行する場合、時間軸方向に隣接する複数のフレーム毎の背景雑音量を平滑化すればよい。または、周波数軸方向に隣接する複数のフレーム毎に背景雑音量を平滑化すればよい。あるいは、時間軸方向及び周波数軸方向に隣接する複数のフレーム毎の背景雑音量を平滑化すればよい。平滑化部２００を備えることで、背景騒音量が急激に変化した場合でも、受話音声が不自然に変化するのを抑制できる。 The smoothing unit 200 smoothes the background noise amount for each frequency band estimated by the background noise estimation unit 11 in the time axis direction or the frequency axis direction. The smoothing unit 200 may smooth the background noise amount in each frequency band in both the time axis direction and the frequency axis direction. For example, when the speech processing apparatus executes processing on speech signals in units of predetermined frames, the background noise amount for each of a plurality of frames adjacent in the time axis direction may be smoothed. Alternatively, the background noise amount may be smoothed for each of a plurality of frames adjacent in the frequency axis direction. Alternatively, the background noise amount for each of a plurality of frames adjacent in the time axis direction and the frequency axis direction may be smoothed. By providing the smoothing unit 200, it is possible to suppress the received voice from unnaturally changing even when the background noise amount changes abruptly.

ゲイン生成部２０１は、平滑化部２００で平滑化された各周波数帯域の背景騒音量から周波数帯域毎の受話音声信号に適用する増幅率（ゲイン）をそれぞれ算出する。ゲイン生成部２０１は、例えば背景雑音量が大きいほどゲインが大きくなるように、周波数帯域毎の受話音声信号のゲインをそれぞれ設定すればよい。ゲイン生成部２０１は、周波数帯域毎の受話音声信号のゲインを、例えば、背景雑音量の一次式で求めてもよく、二次以上の高次の方程式で求めてもよい。 The gain generation unit 201 calculates an amplification factor (gain) to be applied to the received voice signal for each frequency band from the background noise amount of each frequency band smoothed by the smoothing unit 200. The gain generation unit 201 may set the gain of the received voice signal for each frequency band so that the gain increases as the background noise amount increases, for example. The gain generation unit 201 may obtain the gain of the received voice signal for each frequency band by, for example, a linear expression of the background noise amount or a higher-order equation of the second or higher order.

ゲインＭａｔｒｉｘ部２０２は、例えば下記式（１）を用いて、ゲイン生成部２０１で算出された各周波数帯域のゲインを混合して平滑化する。式（１）は、受話音声信号をＮ（Ｎは正数）個の周波数帯域（周波数が低い方からｆ１，ｆ２，…，ｆＮ）に分割したときのゲインＭａｔｒｉｘ部２０２による計算式例を示している。式（１）の左辺がゲインＭａｔｒｉｘ部２０２による処理後の周波数帯域（ｆ１，ｆ２，…，ｆＮ）毎のゲインとなる。 The gain Matrix unit 202 mixes and smoothes the gain of each frequency band calculated by the gain generation unit 201 using, for example, the following equation (1). Formula (1) shows an example of a calculation formula by the gain Matrix unit 202 when the received voice signal is divided into N (N is a positive number) frequency bands (f1, f2,..., FN from the lowest frequency). ing. The left side of Expression (1) is the gain for each frequency band (f1, f2,..., FN) after processing by the gain Matrix unit 202.

ゲインＭａｔｒｉｘ部２０２は、ゲイン生成部２０１で算出された周波数帯域毎のゲインを周波数軸方向に平滑化する。混合係数は、基本的に隣接する周波数帯域ほど値が大きくなるように設定する。各混合係数は、予め設定しておいてもよく、ゲイン生成部２０１で算出された各ゲインの分布状況から動的に決定してもよい。例えば、各ゲインの分散が大きい場合は、より多くの周波数帯域のゲインを用いて平滑化すればよい。このようなゲインＭａｔｒｉｘ部２０２による処理を行うことで、周波数合成部１４による周波数合成後の受話音声が不自然になるのを抑制できる。なお、混合係数は、全ての値を設定する必要はなく、一部の混合係数を「０」とすることも可能である。その場合、「０」に設定した混合係数が多いほど、演算量を低減することが可能であり、混合係数を保持するメモリを節約できる。但し、混合係数は、ゲインＭａｔｒｉｘ部２０２による処理後の各周波数帯域のゲインが「０」とならないように設定する必要がある。例えば、各周波数帯域に対応する混合係数を「１」とし、他の混合係数を「０」とすることで、ゲイン生成部２０１で算出された各周波数帯域のゲインをそのまま処理後のゲインとして出力させることが考えられる。 The gain Matrix unit 202 smoothes the gain for each frequency band calculated by the gain generation unit 201 in the frequency axis direction. The mixing coefficient is basically set so that the value increases in the adjacent frequency band. Each mixing coefficient may be set in advance, or may be dynamically determined from the distribution status of each gain calculated by the gain generation unit 201. For example, when the variance of each gain is large, smoothing may be performed using gains in more frequency bands. By performing such processing by the gain Matrix unit 202, it is possible to prevent the received voice after frequency synthesis by the frequency synthesis unit 14 from becoming unnatural. Note that it is not necessary to set all values for the mixing coefficient, and some of the mixing coefficients can be set to “0”. In that case, the more the mixing coefficient set to “0”, the more the amount of calculation can be reduced, and the memory holding the mixing coefficient can be saved. However, the mixing coefficient needs to be set so that the gain of each frequency band after the processing by the gain Matrix unit 202 does not become “0”. For example, by setting the mixing coefficient corresponding to each frequency band to “1” and the other mixing coefficients to “0”, the gain of each frequency band calculated by the gain generation unit 201 is output as the processed gain as it is. It is possible to make it.

補正部２０３は、ゲインＭａｔｒｉｘ部２０２から出力された処理後の周波数帯域毎のゲインを、対応する周波数帯域の音声信号に乗算して出力する。 The correction unit 203 multiplies the audio signal of the corresponding frequency band by the gain for each frequency band after processing output from the gain Matrix unit 202 and outputs the result.

周波数合成部１４は、第２周波数分析部１２による処理（ＤＦＴ等）の逆変換を実行することで、特性補正部１３から出力された周波数帯域毎の音声信号を周波数合成し、受話音声信号（補正後）を再生する。 The frequency synthesizer 14 performs inverse transformation of processing (DFT or the like) by the second frequency analyzer 12 to synthesize the frequency of the audio signal for each frequency band output from the characteristic corrector 13 and receive the received audio signal ( Play (after correction).

第１の実施の形態の音声処理装置によれば、マイクから入力された音声信号に対して周波数帯域毎のＳＮＲに応じた重みづけを行い、背景騒音量を推定することで、マイクから入力された音声信号に含まれる使用者本人の送話音声の影響を低減できるため、背景騒音量をより精度よく推定できる。そのため、推定される背景騒音量では、使用者本人の送話音声の影響が低減されているため、該背景騒音量に基づいて受話音声を補正すれば、使用者本人による送話中であっても受話音声を補正することが可能であり、受話音声の明瞭度を向上させることができる。
（第２の実施の形態）
図４は、第２の実施の形態の音声処理装置の一構成例を示すブロック図であり、図５は、第２の実施の形態の特性補正部の一構成例を示すブロック図である。 According to the audio processing apparatus of the first embodiment, the audio signal input from the microphone is weighted according to the SNR for each frequency band, and the background noise amount is estimated, so that the audio signal is input from the microphone. Since the influence of the user's own transmitted voice included in the voice signal can be reduced, the background noise amount can be estimated more accurately. Therefore, in the estimated background noise amount, the influence of the user's own transmitted voice is reduced. Therefore, if the received voice is corrected based on the background noise amount, the user's own speech is being transmitted. Also, the received voice can be corrected, and the clarity of the received voice can be improved.
(Second Embodiment)
FIG. 4 is a block diagram illustrating a configuration example of the speech processing apparatus according to the second embodiment, and FIG. 5 is a block diagram illustrating a configuration example of the characteristic correction unit according to the second embodiment.

図４に示すように、第２の実施の形態の音声処理装置は、第１の実施の形態で示した特性補正部１３に代えて特性補正部１５を備えた構成である。図５に示すように、第２の実施の形態の特性補正部１５は、第１の実施の形態で示した特性補正部１３にゲインリミッタ部２０４を追加した構成である。音声処理装置のその他の構成及び動作は、第１の実施の形態の音声処理装置と同様であるため、その説明は省略する。 As shown in FIG. 4, the speech processing apparatus according to the second embodiment has a configuration including a characteristic correction unit 15 instead of the characteristic correction unit 13 shown in the first embodiment. As shown in FIG. 5, the characteristic correction unit 15 of the second embodiment has a configuration in which a gain limiter unit 204 is added to the characteristic correction unit 13 shown in the first embodiment. Since other configurations and operations of the voice processing apparatus are the same as those of the voice processing apparatus according to the first embodiment, description thereof is omitted.

第１の実施の形態で示した特性補正部１３では、ゲイン生成部２０１で算出されたゲインをそのまま用いているため、周波数合成部１４による周波数合成後の受話音声信号でデジタルクリップが発生する可能性がある。ゲインリミッタ部２０４は、ゲイン生成部２０１で算出された周波数帯域毎のゲインを抑制し、周波数合成後の受話音声信号でデジタルクリップが発生するのを抑制する。 In the characteristic correction unit 13 shown in the first embodiment, since the gain calculated by the gain generation unit 201 is used as it is, a digital clip can be generated in the received voice signal after frequency synthesis by the frequency synthesis unit 14. There is sex. The gain limiter unit 204 suppresses the gain for each frequency band calculated by the gain generation unit 201, and suppresses the occurrence of a digital clip in the received voice signal after frequency synthesis.

図６は、図５に示したゲインリミッタ部の処理例を示す模式図である。図６の横軸は時間軸（フレーム）を示し、縦軸はゲインリミッタ部２０４で設定する周波数帯域毎のゲインの上限値（ゲインリミット値）を示している。 FIG. 6 is a schematic diagram illustrating a processing example of the gain limiter unit illustrated in FIG. The horizontal axis in FIG. 6 represents the time axis (frame), and the vertical axis represents the upper limit value (gain limit value) of the gain for each frequency band set by the gain limiter unit 204.

ゲインリミッタ部２０４は、まず受話音声信号の振幅（デジタル値）から、音声信号を処理するフレーム毎に、許容される最大ゲインの値（実線）を算出する。最大ゲインの値は、デジタル信号処理で扱える最大値を、フレーム内の受話音声信号の最大値（絶対値）で除算することで求める。最大ゲインの値は、受話音声信号の振幅が小さいほど大きくなり、受話音声信号の振幅が大きいほど小さくなる。 First, the gain limiter unit 204 calculates the allowable maximum gain value (solid line) for each frame in which the audio signal is processed from the amplitude (digital value) of the received audio signal. The maximum gain value is obtained by dividing the maximum value that can be handled by digital signal processing by the maximum value (absolute value) of the received voice signal in the frame. The maximum gain value increases as the amplitude of the received voice signal decreases, and decreases as the amplitude of the received voice signal increases.

ゲインリミッタ部２０４は、ゲインリミッタ値（点線）を算出した最大ゲインの値以下となるように設定する。また、ゲインリミッタ部２０４は、最大ゲインの値が急激に上昇した場合、予め設定されたフレーム数（Ｈｏｌｄフレーム数）の期間はゲインリミッタ値を変更しない。例えば、図６に示す時点Ｔ１以降では、最大ゲインの値が上昇するため、周波数帯域毎のゲインも大きくできる。しかしながら、ゲインリミッタ部２０４は、上記Ｈｏｌｄフレーム数の期間はゲインリミッタ値を上昇させない。これは、ゲインリミッタ値を急激に上昇させて周波数帯域毎のゲインも急激に大きくすると、周波数合成後の受話音声の音量が不自然に大きくなるため、そのような不自然な音量増大を防止するためである。 The gain limiter unit 204 sets the gain limiter value (dotted line) to be equal to or less than the calculated maximum gain value. Also, the gain limiter unit 204 does not change the gain limiter value during a preset number of frames (the number of Hold frames) when the maximum gain value increases rapidly. For example, after the time T1 shown in FIG. 6, the value of the maximum gain increases, so that the gain for each frequency band can be increased. However, the gain limiter unit 204 does not increase the gain limiter value during the period of the number of Hold frames. This is because when the gain limiter value is suddenly increased and the gain for each frequency band is also suddenly increased, the volume of the received voice after frequency synthesis becomes unnatural, so that such an unnatural increase in volume is prevented. Because.

ゲインリミッタ部２０４は、上記Ｈｏｌｄフレーム数の期間が経過した後（図６の時点Ｔ２）、徐々にゲインリミッタ値を上昇させる。このとき、ゲインリミッタ部２０４は、ゲインリミッタ値を予め設定した割合で上昇させてもよく、背景雑音量や最大ゲインの値に応じて上昇割合を変化させてもよい。 The gain limiter unit 204 gradually increases the gain limiter value after the period of the number of Hold frames has elapsed (time T2 in FIG. 6). At this time, the gain limiter unit 204 may increase the gain limiter value at a preset ratio, or may change the increase ratio according to the background noise amount or the maximum gain value.

ゲインリミッタ部２０４は、上昇させたゲインリミッタ値が最大ゲインの値と等しくなった場合（図６の時点Ｔ３）、その後、最大ゲインの値が上昇する場合は、上述したＴ１以降と同様の処理を実行する。また、最大ゲインの値が下降する場合は、ゲインリミッタ値を最大ゲインの値に合わせて下降させる。 When the increased gain limiter value becomes equal to the value of the maximum gain (time point T3 in FIG. 6), and when the maximum gain value increases thereafter, the gain limiter unit 204 performs the same processing as that after T1 described above. Execute. When the maximum gain value decreases, the gain limiter value is decreased in accordance with the maximum gain value.

第２の実施の形態の音声処理装置によれば、ゲインリミッタ部２０４により受話音声信号の振幅に応じて受話音声の各周波数帯域のゲインを上限値以下に制限するため、補正後の受話音声信号がデジタル信号処理で扱える最大値を越えることがない。そのため、第１の実施の形態の音声処理装置と比べて、音質の劣化を招くことなく、受話音声を明瞭化できる。
（第３の実施の形態）
図７は、第３の実施の形態の特性補正部の一構成例を示すブロック図である。 According to the speech processing apparatus of the second embodiment, the gain limiter 204 limits the gain of each frequency band of the received speech to an upper limit value or less according to the amplitude of the received speech signal. Does not exceed the maximum value that can be handled by digital signal processing. Therefore, compared with the speech processing apparatus according to the first embodiment, the received speech can be clarified without causing deterioration in sound quality.
(Third embodiment)
FIG. 7 is a block diagram illustrating a configuration example of the characteristic correction unit according to the third embodiment.

図７に示すように、第３の実施の形態の音声処理装置は、第１の実施の形態で示した特性補正部１３に、背景雑音量マスキング算出部２０５及び受話音声マスキング算出部２０６を追加した構成である。音声処理装置のその他の構成及び動作は、第１の実施の形態の音声処理装置と同様であるため、その説明は省略する。 As shown in FIG. 7, the speech processing apparatus according to the third embodiment adds a background noise amount masking calculation unit 205 and a received speech masking calculation unit 206 to the characteristic correction unit 13 shown in the first embodiment. This is the configuration. Since other configurations and operations of the voice processing apparatus are the same as those of the voice processing apparatus according to the first embodiment, description thereof is omitted.

本実施形態の特性補正部は、背景騒音推定部１１で推定された背景雑音量から背景雑音によってマスキングされる受話音声信号の周波数帯域のゲインのみを増大させる。 The characteristic correcting unit of the present embodiment increases only the gain in the frequency band of the received voice signal masked by the background noise from the background noise amount estimated by the background noise estimating unit 11.

背景雑音量マスキング算出部２０５は、平滑化部２００で平滑化された背景雑音量から分割した周波数帯域毎に周知のマスキング閾値を算出し、該マスキング閾値を用いて人が聴取可能な周波数帯域毎の背景雑音量を算出する。 The background noise amount masking calculation unit 205 calculates a known masking threshold value for each frequency band divided from the background noise amount smoothed by the smoothing unit 200, and uses the masking threshold value for each frequency band that can be heard by a person. The amount of background noise is calculated.

受話音声マスキング算出部２０６は、受話音声信号から分割した周波数帯域毎に周知のマスキング閾値を算出し、該マスキング閾値を用いて人が聴取可能な周波数帯域毎の受話音声量を算出する。 The received voice masking calculation unit 206 calculates a known masking threshold value for each frequency band divided from the received voice signal, and uses the masking threshold value to calculate the received voice amount for each frequency band that can be heard by a person.

マスキング閾値は、ある所望音を聞き取る際、それをマスキングする他の音がある場合に、該所望音が聞き取れる限界の音圧レベルを指す。通常、受話音声や背景騒音には様々な周波数成分を含んでいるため、受話音声であっても、該受話音声のある周波数成分が受話音声の他の周波数成分をマスキングすることがある。背景雑音量マスキング算出部２０５は、周波数帯域毎にマスキング閾値と背景雑音量とを比較し、背景雑音内の他の周波数成分でマスキングされない、人が聴取可能な周波数帯域毎の背景雑音量を算出する。同様に、受話音声マスキング算出部２０６は、周波数帯域毎にマスキング閾値と受話音声量とを比較し、受話音声量内の他の周波数成分でマスキングされない、人が聴取可能な周波数帯域毎の受話音声量を算出する。なお、マスキング閾値の算出方法やマスキング閾値を用いた音声信号の補正方法等については、例えば特開２００９−１７５４２０に開示されている。 The masking threshold refers to a sound pressure level at which a desired sound can be heard when there is another sound that masks the desired sound. Usually, since the received voice and background noise contain various frequency components, a certain frequency component of the received voice may mask other frequency components of the received voice even in the received voice. The background noise amount masking calculation unit 205 compares the masking threshold value with the background noise amount for each frequency band, and calculates the background noise amount for each frequency band that is not masked by other frequency components in the background noise and can be heard by humans. To do. Similarly, the received voice masking calculation unit 206 compares the masking threshold value with the received voice volume for each frequency band, and is not masked with other frequency components in the received voice volume, and the received voice for each frequency band that can be heard by a person. Calculate the amount. Note that a masking threshold calculation method, an audio signal correction method using a masking threshold, and the like are disclosed in, for example, Japanese Patent Application Laid-Open No. 2009-175420.

ゲイン生成部２０７は、背景雑音量マスキング算出部２０５から出力された背景雑音量と、受話音声マスキング算出部２０６から出力された受話音声量とを周波数帯域毎に比較し、受話音声量よりも背景雑音量が大きい場合に、対応する周波数帯域の音声信号（第２音声信号）のゲインを「１」以上に設定する。 The gain generation unit 207 compares the background noise amount output from the background noise amount masking calculation unit 205 with the received voice amount output from the received voice masking calculation unit 206 for each frequency band, and determines the background more than the received voice amount. When the amount of noise is large, the gain of the audio signal (second audio signal) in the corresponding frequency band is set to “1” or more.

ここで、受話音声量よりも背景雑音量が大きい周波数帯域のゲインを背景雑音量／受話音声量に設定すれば、該周波数帯域における背景雑音量と補正処理後の受話音声量とは同等となる。 Here, if the gain of the frequency band in which the background noise amount is larger than the received voice amount is set to the background noise amount / received voice amount, the background noise amount in the frequency band and the received voice amount after the correction processing are equivalent. .

受話音声量よりも背景雑音量が大きい周波数帯域において、補正処理後の受話音声量を背景雑音量よりも常に大きくしたい場合、ゲイン生成部２０７は、該周波数帯域の音声信号（受話音声）のゲインを、（背景雑音量／受話音声量）以上の値に設定すればよい。例えば、対応する周波数帯域の音声信号（受話音声）のゲインを、（背景雑音量／受話音声量）×α（α＞１．０）、あるいは（背景雑音量／受話音声量）＋α（α＞０．０）に設定すればよい。 In the frequency band where the background noise amount is larger than the received voice amount, when it is desired to always increase the received voice amount after the correction processing above the background noise amount, the gain generation unit 207 gains the voice signal (received voice) in the frequency band. May be set to a value equal to or greater than (background noise amount / received voice amount). For example, the gain of the voice signal (received voice) in the corresponding frequency band is set to (background noise amount / received voice amount) × α (α> 1.0) or (background noise amount / received voice amount) + α (α> 0.0).

ゲイン生成部２０７は、背景雑音量よりも受話音声量が大きい場合、対応する周波数帯域の音声信号（受話音声）のゲインを「１」に設定し、増幅させないようにする。 When the received voice volume is larger than the background noise level, the gain generation unit 207 sets the gain of the corresponding frequency band voice signal (received voice) to “1” so as not to be amplified.

第３の実施の形態の音声処理装置によれば、受話音声量よりも背景雑音量が大きい周波数帯域でのみ受話音声信号が増幅されるため、受話音声が不必要に増大することがない。そのため、第１の実施の形態の音声処理装置よりも高品質な受話音声が再生される。なお、図７に示した背景雑音量マスキング算出部２０５及び受話音声マスキング算出部２０６は、第２の実施の形態の特性補正部１５に備えることも可能である。 According to the speech processing apparatus of the third embodiment, the received speech signal is amplified only in the frequency band in which the background noise amount is larger than the received speech amount, so that the received speech does not increase unnecessarily. Therefore, the received voice having higher quality than that of the voice processing apparatus according to the first embodiment is reproduced. Note that the background noise amount masking calculation unit 205 and the received voice masking calculation unit 206 shown in FIG. 7 can be included in the characteristic correction unit 15 of the second embodiment.

１送話音声分析部
２受話音声補正部
１０第１周波数分析部
１１背景騒音推定部
１２第２周波数分析部
１３、１５特性補正部
１４周波数合成部
１００重みつき音声計算部
１０１推定雑音計算部
２００平滑化部
２０１ゲイン生成部
２０２ゲインＭａｔｒｉｘ部
２０３補正部
２０４ゲインリミッタ部
２０５背景雑音量マスキング算出部
２０６受話音声マスキング算出部 DESCRIPTION OF SYMBOLS 1 Transmission voice analysis part 2 Received voice correction | amendment part 10 1st frequency analysis part 11 Background noise estimation part 12 2nd frequency analysis part 13, 15 Characteristic correction | amendment part 14 Frequency synthesis part 100 Weighted speech calculation part 101 Estimated noise calculation part 200 Smoothing unit 201 Gain generation unit 202 Gain Matrix unit 203 Correction unit 204 Gain limiter unit 205 Background noise amount masking calculation unit 206 Received voice masking calculation unit

Claims

A first frequency analysis unit that divides an audio signal input from a microphone into a plurality of first audio signals in a predetermined frequency band;
The first audio signal for each frequency band divided by the first frequency analysis unit is weighted according to the signal-to-noise ratio, and the background noise amount for each frequency band included in the audio signal is estimated. A background noise estimation unit;
A second frequency analyzer that divides the received voice signal into a plurality of second voice signals for each predetermined frequency band;
Based on the background noise amount for each frequency band estimated by the background noise estimation unit, the frequency is set so that the second sound signal output from the second frequency analysis unit increases as the background noise amount increases. Setting a gain to be applied to the second audio signal for each band, and multiplying the corresponding second audio signal of the frequency band by the gain to correct the received audio signal;
A reception voice correction unit including a frequency synthesis unit that synthesizes the frequency of the corrected second voice signal for each frequency band output from the characteristic correction unit, and reproduces the corrected reception voice signal;
A speech processing apparatus.

The characteristic correction unit includes:
The audio processing apparatus according to claim 1, further comprising a gain limiter unit that limits a gain for each frequency band to a predetermined upper limit value or less.

The characteristic correction unit includes:
A background noise amount masking calculation unit that calculates a background noise amount for each frequency band that can be heard by a person, which is not masked by other frequency components in the background noise,
A received voice masking calculating unit that calculates a received voice amount for each frequency band that can be heard by a person, which is not masked by other frequency components in the received voice;
Have
The gain generation unit
The background noise amount output from the background noise amount masking calculation unit and the received voice amount output from the received voice masking calculation unit are compared for each frequency band, and the background noise amount is larger than the received voice amount. The audio processing apparatus according to claim 1 or 2, wherein a gain applied to the second audio signal in a corresponding frequency band is set to 1 or more.

The audio signal input from the microphone is divided into a plurality of first audio signals in a predetermined frequency band, the first audio signal for each frequency band is weighted according to a signal-to-noise ratio, and the audio signal Estimating a background noise amount for each of the divided frequency bands included in
The received voice signal is divided into a plurality of second voice signals for each predetermined frequency band, and based on the estimated background noise quantity for each frequency band, the larger the background noise quantity, the corresponding second voice signal becomes. Set a gain to be applied to the second audio signal for each frequency band so as to increase,
The received audio signal is corrected by multiplying the corresponding second audio signal of the frequency band by the gain,
A speech processing method for synthesizing the frequency of the corrected second speech signal for each frequency band and reproducing the corrected received speech signal.

The audio processing method according to claim 4, wherein a gain for each frequency band is limited to a predetermined upper limit value or less.

Calculate the amount of background noise for each frequency band that can be heard by humans, not masked by other frequency components in the background noise,
Calculate the amount of received voice for each frequency band that can be heard by humans that is not masked by other frequency components in the received voice,
The background noise amount and the received voice amount are compared for each frequency band, and when the background noise amount is larger than the received voice amount, the gain applied to the second voice signal in the corresponding frequency band is set to 1 or more. The voice processing method according to claim 4 or 5, wherein the voice processing method is set.