JP4745916B2

JP4745916B2 - Noise suppression speech quality estimation apparatus, method and program

Info

Publication number: JP4745916B2
Application number: JP2006225158A
Authority: JP
Inventors: 則次恵木; 仁志青木; 玲高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-06-07
Filing date: 2006-08-22
Publication date: 2011-08-10
Anticipated expiration: 2026-08-22
Also published as: JP2008015443A

Description

本発明は、雑音抑圧処理技術を利用する音声通信サービスにおける音声品質の評価技術に係り、特に周囲騒音の影響が大きい通信環境下での音声品質を評価する技術に関するものである。 The present invention relates to a voice quality evaluation technique in a voice communication service using a noise suppression processing technique, and more particularly to a technique for evaluating voice quality in a communication environment where the influence of ambient noise is large.

周囲騒音が大きい環境の音声通信では、送話器に騒音が混入することで、受話者は雑音が重畳した音声を受聴することとなる。ハンズフリー通信では、ハンドセットやヘッドセットを利用する場合に比べて、人間の口とマイクとの間の距離が長くなるため、マイクの収音範囲が広くなり、周囲騒音の影響を受けやすい。また、携帯電話による音声通信では、周囲騒音の大きい室外の環境で使用されることが多いため、ハンドセット通信においても周囲騒音の影響を受けやすい。そのため、このような通信形態では雑音抑圧処理技術が重要となる。 In voice communication in an environment with a high ambient noise, the noise is mixed in the transmitter, so that the listener listens to the voice with the superimposed noise. In hands-free communication, since the distance between the human mouth and the microphone is longer than when using a handset or headset, the sound collection range of the microphone is widened and is easily affected by ambient noise. Further, since voice communication using a mobile phone is often used in an outdoor environment where there is a large amount of ambient noise, the handset communication is also susceptible to ambient noise. Therefore, a noise suppression processing technique is important in such a communication form.

従来、様々な手法を用いた雑音抑圧処理技術が開発されている。高品質な音声通信サービスを提供するためには、雑音抑圧処理技術の性能を正確に把握し、方式のパラメータ最適化および方式選定を行うことが重要である。そのため、雑音抑圧音声の品質評価法が望まれる。
音声品質評価の基本は、実際に音声受聴や会話を行うことによる心理評価に基づく主観品質評価である。主観品質評価は、ユーザが実感する品質を直接的に評価することができる反面、十分な数の被験者や専用の設備が必要となり、多大なコストや時間を要するなど簡便ではない。 Conventionally, noise suppression processing techniques using various methods have been developed. In order to provide a high-quality voice communication service, it is important to accurately grasp the performance of the noise suppression processing technology, optimize the parameters of the method, and select the method. Therefore, a quality evaluation method for noise-suppressed speech is desired.
The basis of voice quality evaluation is subjective quality evaluation based on psychological evaluation by actually listening to voice or talking. Subjective quality evaluation can directly evaluate the quality perceived by the user, but it requires a sufficient number of subjects and dedicated equipment, and is not as simple as requiring significant costs and time.

そこで、人間による主観評価の代わりに、音声信号の物理量に基づいて効率的に主観品質を推定する技術が望まれる。このような技術を客観品質評価と呼ぶ。現在、雑音抑圧処理性能の客観的指標として最も広く用いられている特徴量に雑音除去量が挙げられるが、主観品質との対応という観点では必ずしも十分ではない。なぜなら、雑音の抑圧処理の過程で音声や雑音に歪みが生じ、主観品質に影響を与える要因となるため、主観品質を適切に推定するためには、このような歪みも考慮する必要があるからである。 Therefore, a technique for efficiently estimating subjective quality based on a physical quantity of an audio signal is desired instead of human subjective evaluation. Such a technique is called objective quality evaluation. Currently, noise removal is one of the most widely used feature quantities as an objective index of noise suppression processing performance, but it is not always sufficient in terms of correspondence with subjective quality. This is because distortion occurs in the speech and noise during the noise suppression process, and this affects the subjective quality. Therefore, in order to estimate the subjective quality appropriately, it is necessary to consider such distortion. It is.

音声歪みを評価可能な客観品質評価技術として、非特許文献１に開示されたＰＥＳＱ（Perceptual evaluation of speech quality ）がある。ＰＥＳＱは、原音声信号と、評価対象となる符号化方式や装置で処理された劣化音声信号とを入力とし、両信号の差分から評価対象の品質を測定する技術である。図８、図９はＰＥＳＱを用いた品質評価系の構成例を示すブロック図である。図８は雑音が重畳していない音声信号の品質評価を行う場合の構成を示し、図９は雑音が重畳している音声信号の品質評価を行う場合の構成を示している。図８、図９において、１００は評価対象装置、１０１はＰＥＳＱ装置、１０２は音声加算器である。 Non-Patent Document 1 discloses PESQ (Perceptual evaluation of speech quality) as an objective quality evaluation technique capable of evaluating speech distortion. PESQ is a technique for measuring the quality of an evaluation target from the difference between the original speech signal and a degraded speech signal processed by an encoding method or apparatus to be evaluated. 8 and 9 are block diagrams showing a configuration example of a quality evaluation system using PESQ. FIG. 8 shows a configuration for evaluating the quality of an audio signal on which noise is not superimposed, and FIG. 9 shows a configuration for evaluating the quality of an audio signal on which noise is superimposed. 8 and 9, reference numeral 100 denotes an evaluation target apparatus, 101 denotes a PESQ apparatus, and 102 denotes an audio adder.

ITU-T Recommendation P.862，「Perceptual evaluation of speech quality(PESQ),an obective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs」，Feb.2001ITU-T Recommendation P.862, “Perceptual evaluation of speech quality (PESQ), an obective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs”, Feb.2001

図９に示すとおり、ＰＥＳＱでは、雑音が重畳される前の音声信号と雑音抑圧音声信号とを入力としているため、音声歪みを考慮した雑音重畳音声を品質評価することは可能である。しかしながら、ＰＥＳＱでは、抑圧処理前の雑音に関する入力が無いために、雑音歪みを考慮した評価を行うことができない。よって、非特許文献１に開示されたＰＥＳＱでは、雑音抑圧音声の評価を正確に行うことはできないという問題点があった。 As shown in FIG. 9, in PESQ, since the speech signal before noise is superimposed and the noise-suppressed speech signal are input, it is possible to evaluate the quality of the speech with superimposed noise in consideration of speech distortion. However, in PESQ, since there is no input related to noise before suppression processing, it is not possible to perform evaluation in consideration of noise distortion. Therefore, the PESQ disclosed in Non-Patent Document 1 has a problem that noise-suppressed speech cannot be accurately evaluated.

本発明は、上記課題を解決するためになされたもので、雑音抑圧音声の評価を正確に行うことができる雑音抑圧音声品質推定装置、方法およびプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a noise-suppressed speech quality estimation apparatus, method, and program capable of accurately evaluating a noise-suppressed speech.

本発明は、雑音抑圧音声の品質を客観的に推定する雑音抑圧音声品質推定装置であって、評価対象となる雑音抑圧処理装置への入力として雑音重畳音声信号を与えたときに、前記雑音抑圧処理装置から出力される雑音抑圧音声信号の品質要因の特徴量を検出する検出手段と、この検出手段により検出された特徴量に基づいて前記雑音抑圧音声信号の品質を推定する推定手段とを備え、前記検出手段は、前記雑音抑圧音声信号を一定時間で区切ったときの各区間が音声区間か無音声区間かを判別する判別手段と、音声区間における前記雑音抑圧音声信号の品質要因の特徴量を検出する音声区間特徴量検出手段と、無音声区間における前記雑音抑圧音声信号の品質要因の特徴量を検出する無音声区間特徴量検出手段とを備え、前記音声区間特徴量検出手段は、雑音が重畳される前の音声信号とこれに対応する前記音声区間における雑音抑圧音声信号とを比較することにより、前記雑音抑圧音声信号の品質要因の特徴量として音声歪みを検出する音声歪み測定部と、前記雑音抑圧音声信号の品質要因の特徴量として前記音声区間における雑音抑圧音声信号の音量を検出する音量測定部とを有し、前記無音声区間特徴量検出手段は、前記無音声区間における雑音抑圧音声信号とこれに対応する前記雑音重畳音声信号又はこの雑音重畳音声信号の元となる雑音信号とを比較することにより、前記雑音抑圧音声信号の品質要因の特徴量として雑音歪みを検出する雑音歪み測定部と、前記雑音抑圧音声信号の品質要因の特徴量として前記無音声区間における雑音抑圧音声信号の雑音量を検出する雑音量測定部とを有し、前記雑音歪みは、前記無音声区間における雑音抑圧音声信号を劣化音声信号、これに対応する前記雑音重畳音声信号又はこの雑音重畳音声信号の元となる雑音信号を参照信号としたときのＰＥＳＱ値もしくはＷｉｄｅｂａｎｄ−ＰＥＳＱ値であり、前記推定手段は、前記音声歪みと前記雑音抑圧音声信号の音量と前記雑音歪みと前記雑音抑圧音声信号の雑音量とに基づいて前記雑音抑圧音声信号の品質を推定することを特徴とするものである。 The present invention provides a noise-suppressed speech quality estimation device that objectively estimates the quality of noise-suppressed speech, and when the noise-superimposed speech signal is given as an input to a noise suppression processing device to be evaluated, the noise suppression Detection means for detecting a feature amount of a quality factor of a noise-suppressed speech signal output from the processing device, and estimation means for estimating the quality of the noise-suppressed speech signal based on the feature amount detected by the detection means The detecting means determines whether each section when the noise-suppressed speech signal is divided at a predetermined time is a speech section or a non-speech section; and a feature quantity of the quality factor of the noise-suppressed speech signal in the speech section Voice section feature quantity detecting means for detecting the noise section feature quantity detecting means for detecting the feature quantity of the quality factor of the noise-suppressed speech signal in the no voice section, and the voice section feature quantity The output means detects speech distortion as a feature quantity of the quality factor of the noise-suppressed speech signal by comparing the speech signal before noise is superimposed with the noise-suppressed speech signal in the speech section corresponding to the speech signal. A speech distortion measuring unit; and a volume measuring unit that detects a volume of the noise-suppressed speech signal in the speech segment as a feature amount of the quality factor of the noise-suppressed speech signal, By comparing the noise-suppressed speech signal in the non-speech interval with the corresponding noise-superimposed speech signal or the noise signal that is the basis of this noise-superposed speech signal, noise is used as a feature quantity of the quality factor of the noise-suppressed speech signal. A noise distortion measuring unit for detecting distortion; and a noise amount detecting unit for detecting a noise amount of the noise-suppressed speech signal in the non-speech interval as a feature factor of the quality factor of the noise-suppressed speech signal. The noise distortion is obtained by referring to the noise-suppressed speech signal in the no-speech interval as a degraded speech signal, the noise superimposed speech signal corresponding thereto, or the noise signal that is the basis of this noise superimposed speech signal. A PESQ value or Wideband-PESQ value when a signal is used, and the estimating means is configured to determine the noise based on the voice distortion, the volume of the noise-suppressed voice signal, the noise distortion, and the noise amount of the noise-suppressed voice signal. The quality of the suppressed speech signal is estimated .

また、本発明の雑音抑圧音声品質推定装置の１構成例は、さらに、音声信号が予め登録された音声データベースと、この音声データベースの音声信号を実通話環境下で再生する音声再生手段と、前記再生された音声を集音したときに得られる信号を前記雑音重畳音声信号として出力する音声録音手段とを備えるものである。
また、本発明の雑音抑圧音声品質推定装置の１構成例は、さらに、音声信号が予め登録された音声データベースと、実通話環境下で雑音信号を集音する音声録音手段と、前記音声データベースの音声信号と前記音声録音手段が集音した雑音信号とを加算した信号を前記雑音重畳音声信号として出力する音声加算手段とを備えるものである。
また、本発明の雑音抑圧音声品質推定装置の１構成例は、さらに、音声信号が予め登録された音声データベースと、雑音信号が予め登録された雑音データベースと、前記音声データベースの音声信号と前記雑音データベースの雑音信号とを加算した信号を前記雑音重畳音声信号として出力する音声加算手段とを備えるものである。 In addition, one configuration example of the noise-suppressed speech quality estimation apparatus of the present invention further includes a speech database in which speech signals are registered in advance, speech playback means for playing back the speech signals in the speech database in an actual call environment, Voice recording means for outputting a signal obtained when the reproduced voice is collected as the noise-superimposed voice signal.
In addition, one configuration example of the noise-suppressed speech quality estimation apparatus of the present invention further includes a speech database in which speech signals are registered in advance, speech recording means for collecting noise signals in an actual call environment, and the speech database. And a sound adding means for outputting a signal obtained by adding the sound signal and the noise signal collected by the sound recording means as the noise superimposed sound signal.
Further, one configuration example of the noise-suppressed speech quality estimation apparatus of the present invention further includes a speech database in which speech signals are registered in advance, a noise database in which noise signals are registered in advance, a speech signal in the speech database, and the noise And a voice adding means for outputting a signal obtained by adding the noise signal of the database as the noise superimposed voice signal.

また、本発明の雑音抑圧音声品質推定方法は、評価対象となる雑音抑圧処理装置への入力として雑音重畳音声信号を与えたときに、前記雑音抑圧処理装置から出力される雑音抑圧音声信号の品質要因の特徴量を検出する検出手順と、この検出手順により検出された特徴量に基づいて前記雑音抑圧音声信号の品質を推定する推定手順とを備え、前記検出手順は、前記雑音抑圧音声信号を一定時間で区切ったときの各区間が音声区間か無音声区間かを判別する判別手順と、音声区間における前記雑音抑圧音声信号の品質要因の特徴量を検出する音声区間特徴量検出手順と、無音声区間における前記雑音抑圧音声信号の品質要因の特徴量を検出する無音声区間特徴量検出手順とからなり、前記音声区間特徴量検出手順は、雑音が重畳される前の音声信号とこれに対応する前記音声区間における雑音抑圧音声信号とを比較することにより、前記雑音抑圧音声信号の品質要因の特徴量として音声歪みを検出する音声歪み測定手順と、前記雑音抑圧音声信号の品質要因の特徴量として前記音声区間における雑音抑圧音声信号の音量を検出する音量測定手順とからなり、前記無音声区間特徴量検出手順は、前記無音声区間における雑音抑圧音声信号とこれに対応する前記雑音重畳音声信号又はこの雑音重畳音声信号の元となる雑音信号とを比較することにより、前記雑音抑圧音声信号の品質要因の特徴量として雑音歪みを検出する雑音歪み測定手順と、前記雑音抑圧音声信号の品質要因の特徴量として前記無音声区間における雑音抑圧音声信号の雑音量を検出する雑音量測定手順とからなり、前記雑音歪みは、前記無音声区間における雑音抑圧音声信号を劣化音声信号、これに対応する前記雑音重畳音声信号又はこの雑音重畳音声信号の元となる雑音信号を参照信号としたときのＰＥＳＱ値もしくはＷｉｄｅｂａｎｄ−ＰＥＳＱ値であり、前記推定手順は、前記音声歪みと前記雑音抑圧音声信号の音量と前記雑音歪みと前記雑音抑圧音声信号の雑音量とに基づいて前記雑音抑圧音声信号の品質を推定することを特徴とするものである。 The noise-suppressed speech quality estimation method of the present invention provides a quality of a noise-suppressed speech signal output from the noise suppression processing device when a noise superimposed speech signal is given as an input to the noise suppression processing device to be evaluated. A detection procedure for detecting a feature quantity of the factor, and an estimation procedure for estimating the quality of the noise-suppressed speech signal based on the feature quantity detected by the detection procedure, wherein the detection procedure includes the noise-suppressed speech signal. A discriminating procedure for discriminating whether each segment is a speech segment or a non-speech segment when divided by a certain time, a speech segment feature detection procedure for detecting a feature factor of the quality factor of the noise-suppressed speech signal in a speech segment, A non-voice section feature quantity detection procedure for detecting a feature quantity of a quality factor of the noise-suppressed voice signal in a voice section, and the voice section feature quantity detection procedure is performed before speech is superimposed with noise. A speech distortion measurement procedure for detecting speech distortion as a feature quantity of the quality factor of the noise-suppressed speech signal by comparing the signal and a noise-suppressed speech signal in the speech section corresponding thereto, and the noise-suppressed speech signal A volume measurement procedure for detecting a volume of a noise-suppressed speech signal in the speech segment as a feature factor of the quality factor, and the speechless feature feature detection procedure corresponds to a noise-suppressed speech signal in the speechless segment A noise distortion measurement procedure for detecting noise distortion as a feature quantity of a quality factor of the noise-suppressed voice signal by comparing the noise-superimposed voice signal or a noise signal that is a source of the noise-superimposed voice signal, and the noise suppression A noise amount measurement procedure for detecting a noise amount of a noise-suppressed speech signal in the silent period as a feature amount of a quality factor of the speech signal, The sound distortion is a PESQ value or Wideband when the noise-suppressed speech signal in the non-speech interval is a degraded speech signal, and the corresponding noise superimposed speech signal or the noise signal that is the basis of this noise superimposed speech signal is a reference signal. A PESQ value, and the estimation procedure estimates the quality of the noise-suppressed speech signal based on the speech distortion, the volume of the noise-suppressed speech signal, the noise distortion, and the amount of noise of the noise-suppressed speech signal. It is characterized by .

また、本発明の雑音抑圧音声品質推定プログラムは、評価対象となる雑音抑圧処理装置への入力として雑音重畳音声信号を与えたときに、前記雑音抑圧処理装置から出力される雑音抑圧音声信号の品質要因の特徴量を検出する検出手順と、この検出手順により検出された特徴量に基づいて前記雑音抑圧音声信号の品質を推定する推定手順とをコンピュータに実行させ、前記検出手順は、前記雑音抑圧音声信号を一定時間で区切ったときの各区間が音声区間か無音声区間かを判別する判別手順と、音声区間における前記雑音抑圧音声信号の品質要因の特徴量を検出する音声区間特徴量検出手順と、無音声区間における前記雑音抑圧音声信号の品質要因の特徴量を検出する無音声区間特徴量検出手順とからなり、前記音声区間特徴量検出手順は、雑音が重畳される前の音声信号とこれに対応する前記音声区間における雑音抑圧音声信号とを比較することにより、前記雑音抑圧音声信号の品質要因の特徴量として音声歪みを検出する音声歪み測定手順と、前記雑音抑圧音声信号の品質要因の特徴量として前記音声区間における雑音抑圧音声信号の音量を検出する音量測定手順とからなり、前記無音声区間特徴量検出手順は、前記無音声区間における雑音抑圧音声信号とこれに対応する前記雑音重畳音声信号又はこの雑音重畳音声信号の元となる雑音信号とを比較することにより、前記雑音抑圧音声信号の品質要因の特徴量として雑音歪みを検出する雑音歪み測定手順と、前記雑音抑圧音声信号の品質要因の特徴量として前記無音声区間における雑音抑圧音声信号の雑音量を検出する雑音量測定手順とからなり、前記雑音歪みは、前記無音声区間における雑音抑圧音声信号を劣化音声信号、これに対応する前記雑音重畳音声信号又はこの雑音重畳音声信号の元となる雑音信号を参照信号としたときのＰＥＳＱ値もしくはＷｉｄｅｂａｎｄ−ＰＥＳＱ値であり、前記推定手順は、前記音声歪みと前記雑音抑圧音声信号の音量と前記雑音歪みと前記雑音抑圧音声信号の雑音量とに基づいて前記雑音抑圧音声信号の品質を推定することを特徴とするものである。 The noise-suppressed speech quality estimation program of the present invention provides a quality of a noise-suppressed speech signal output from the noise suppression processing device when a noise-superimposed speech signal is given as an input to the noise suppression processing device to be evaluated. A detection procedure for detecting a feature quantity of the factor and an estimation procedure for estimating the quality of the noise-suppressed speech signal based on the feature quantity detected by the detection procedure are executed by the computer, and the detection procedure includes the noise suppression. A determination procedure for determining whether each section is a speech section or a non-speech section when the speech signal is divided at a predetermined time, and a speech section feature amount detection procedure for detecting a feature amount of the quality factor of the noise-suppressed speech signal in the speech section And a voiceless section feature quantity detection procedure for detecting a feature quantity of a quality factor of the noise-suppressed voice signal in the voiceless section, and the voice section feature quantity detection procedure includes: A speech distortion measurement procedure for detecting speech distortion as a feature quantity of the quality factor of the noise-suppressed speech signal by comparing the speech signal before the sound is superimposed and the noise-suppressed speech signal in the speech section corresponding to the speech signal. And a volume measurement procedure for detecting the volume of the noise-suppressed speech signal in the speech section as a feature quantity of the quality factor of the noise-suppressed speech signal, and the silent section feature amount detection procedure includes noise in the silent section. Noise that detects noise distortion as a feature quantity of the quality factor of the noise-suppressed speech signal by comparing the suppressed speech signal with the noise-superimposed speech signal corresponding thereto or the noise signal that is the basis of the noise-superposed speech signal A noise amount for detecting a noise amount of the noise-suppressed speech signal in the silent period as a distortion measurement procedure and a feature amount of the quality factor of the noise-suppressed speech signal The noise distortion includes a noise-suppressed speech signal in the no-speech interval as a degraded speech signal, a corresponding noise superimposed speech signal or a noise signal that is a source of this noise superimposed speech signal as a reference signal. PESQ value or Wideband-PESQ value when the noise suppression speech is calculated based on the speech distortion, the volume of the noise-suppressed speech signal, the noise distortion, and the noise amount of the noise-suppressed speech signal. The signal quality is estimated .

本発明によれば、評価対象となる雑音抑圧処理装置への入力として雑音重畳音声信号を与え、雑音抑圧処理装置から出力される雑音抑圧音声信号の品質要因の特徴量を検出する際に、雑音重畳音声信号又は雑音重畳音声信号の元となる雑音信号と雑音抑圧音声信号とを比較することにより、雑音抑圧音声信号の品質要因の特徴量として少なくとも雑音歪みを検出し、この特徴量に基づいて雑音抑圧音声信号の品質を推定することにより、雑音抑圧音声のユーザ体感に即した品質推定が可能となり、雑音抑圧処理技術の適切な設定や性能の比較を安価かつ容易に行うことが可能となる。例えば、音声通話を行う環境に応じた音声信号および雑音信号を入力として与えることで、各環境における雑音抑圧処理技術の性能を知ることができる。また、雑音抑圧処理技術を開発している業者にとって、開発技術の性能を知ることを可能とする。さらに、音声通信端末を設計している業者にとって、端末内に雑音抑圧処理技術を組み込む場合に、想定される使用環境に応じた最良の技術の選択および設定を可能とする。
本発明では、雑音抑圧音声信号の雑音量を測定する際に、一定時間ごとの音量を測定することで突発的な雑音を捉える。これにより、突発的な雑音が体感品質に与える影響を考慮した雑音抑圧音声信号の雑音量の測定を行うことが可能となる。
本発明では、雑音抑圧音声信号の音声歪みを測定する際に、雑音抑圧音声信号と雑音重畳音声信号の元となる音声信号を比較することにより検出した歪みの大きさから、雑音の音量に基づいて雑音による歪みの影響を取り除く。これにより、雑音抑圧音声信号の純粋な音声歪みの測定を行うことが可能となる。 According to the present invention, when a noise-superimposed speech signal is given as an input to the noise suppression processing device to be evaluated and the feature quantity of the quality factor of the noise-suppressed speech signal output from the noise suppression processing device is detected, noise is detected. By comparing the noise signal that is the basis of the superimposed speech signal or the noise superimposed speech signal with the noise-suppressed speech signal, at least noise distortion is detected as a feature amount of the quality factor of the noise-suppressed speech signal, and based on this feature amount Estimating the quality of noise-suppressed speech signals makes it possible to estimate the quality of noise-suppressed speech in accordance with the user experience, making it possible to make appropriate settings for noise suppression processing technology and compare performance at low cost and easily. . For example, it is possible to know the performance of the noise suppression processing technique in each environment by providing a voice signal and a noise signal according to the environment where the voice call is performed as inputs. In addition, it is possible for a company developing noise suppression processing technology to know the performance of the developed technology. Further, when a voice communication terminal is designed, when the noise suppression processing technique is incorporated in the terminal, it is possible to select and set the best technique according to the assumed use environment.
In the present invention, when measuring the amount of noise of a noise-suppressed speech signal, sudden noise is captured by measuring the volume at regular intervals. As a result, it is possible to measure the noise amount of the noise-suppressed speech signal in consideration of the effect of sudden noise on the quality of experience.
In the present invention, when measuring the voice distortion of the noise-suppressed voice signal, the noise level is calculated based on the noise volume from the magnitude of the distortion detected by comparing the noise-suppressed voice signal and the voice signal that is the source of the noise-superimposed voice signal. To eliminate the effects of noise distortion. This makes it possible to measure pure speech distortion of the noise-suppressed speech signal.

また、本発明では、音声データベースの音声信号を実通話環境下で再生し、再生された音声を集音したときに得られる信号を雑音重畳音声信号として出力することにより、実通話環境下における雑音抑圧音声のユーザ体感に即した品質推定を正確かつ容易に行うことが可能となる。 Further, the present invention reproduces a voice signal in a voice database under a real call environment, and outputs a signal obtained when the reproduced voice is collected as a noise superimposed voice signal, whereby noise in the real call environment is obtained. It is possible to accurately and easily perform quality estimation in accordance with the user experience of the suppressed speech.

また、本発明では、実通話環境下で雑音信号を集音し、音声データベースの音声信号と集音した雑音信号とを加算した信号を雑音重畳音声信号として出力することにより、実通話環境下における雑音抑圧音声のユーザ体感に即した品質推定を正確かつ容易に行うことが可能となる。 In the present invention, a noise signal is collected in a real call environment, and a signal obtained by adding the voice signal of the voice database and the collected noise signal is output as a noise superimposed voice signal. It is possible to accurately and easily perform quality estimation in accordance with the user experience of noise-suppressed speech.

また、本発明では、音声データベースの音声信号と雑音データベースの雑音信号とを加算した信号を雑音重畳音声信号として出力することにより、様々な雑音環境下における雑音抑圧音声のユーザ体感に即した品質推定を正確かつ容易に行うことが可能となる。 Further, in the present invention, a signal obtained by adding the voice signal of the voice database and the noise signal of the noise database is output as a noise-superimposed voice signal, so that the quality estimation according to the user experience of the noise-suppressed voice under various noise environments. Can be performed accurately and easily.

［第１の実施の形態］
以下、本発明の実施の形態について図面を用いて説明する。図１は、本発明の第１の実施の形態に係る雑音抑圧音声品質推定装置の構成例を示すブロック図である。
図１に示すように、雑音抑圧音声品質推定装置１は、音声データベース部２、音声再生部３、スピーカ４、音声録音部５、マイク６、音声区間検出部７（判別手段）、遅延補正部８−１，８−２−１，８−２−２、スイッチ制御部９、音声連結部１０（１０−１〜１０−４）、音声歪み測定部１１、雑音歪み測定部１２、音量測定部１３、雑音量測定部１４、音声品質推定部１５（推定手段）、音声品質出力部１６、スイッチ２０，２１，２２を備えている。 [First Embodiment]
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration example of a noise-suppressed speech quality estimation apparatus according to the first embodiment of the present invention.
As shown in FIG. 1, a noise-suppressed speech quality estimation apparatus 1 includes a speech database unit 2, a speech playback unit 3, a speaker 4, a speech recording unit 5, a microphone 6, a speech segment detection unit 7 (discriminating means), and a delay correction unit. 8-1, 8-2-1, 8-2-2, switch control unit 9, audio connection unit 10 (10-1 to 10-4), audio distortion measurement unit 11, noise distortion measurement unit 12, volume measurement unit 13, a noise amount measurement unit 14, a voice quality estimation unit 15 (estimation means), a voice quality output unit 16, and switches 20, 21 and 22.

音声区間検出部７と遅延補正部８−１，８−２−１，８−２−２とスイッチ制御部９と音声連結部１０と音声歪み測定部１１と雑音歪み測定部１２と音量測定部１３と雑音量測定部１４とスイッチ２０〜２２とは、雑音抑圧音声信号の品質要因の特徴量を検出する検出手段を構成している。 Voice section detection unit 7, delay correction unit 8-1, 8-2-1, 8-2-2, switch control unit 9, voice connection unit 10, voice distortion measurement unit 11, noise distortion measurement unit 12, and volume measurement unit 13, the noise amount measurement unit 14, and the switches 20 to 22 constitute detection means for detecting the feature amount of the quality factor of the noise-suppressed speech signal.

音声再生部３は、音声データベース部２から音声信号Ａを取得し、音声通話を行う環境下で音声信号Ａをスピーカ４から出力する。
マイク６は、スピーカ４から出力された音声を集音する。このとき、マイク６が集音する音声には雑音が重畳されているので、マイク６から出力される音声信号を雑音重畳音声信号Ｂとする。音声録音部５は、マイク６が集音した雑音重畳音声信号Ｂを取得して評価対象装置１００（雑音抑圧処理装置）と遅延補正部８−２−２に入力する。雑音重畳音声信号Ｂを評価対象装置１００に入力することで、評価対象装置１００から出力される雑音抑圧処理された音声信号をＣとする。評価対象装置１００の例としては、例えば音声通信端末がある。 The audio reproducing unit 3 acquires the audio signal A from the audio database unit 2 and outputs the audio signal A from the speaker 4 in an environment where a voice call is performed.
The microphone 6 collects sound output from the speaker 4. At this time, since noise is superimposed on the sound collected by the microphone 6, the sound signal output from the microphone 6 is referred to as a noise superimposed sound signal B. The voice recording unit 5 acquires the noise superimposed voice signal B collected by the microphone 6 and inputs it to the evaluation target device 100 (noise suppression processing device) and the delay correction unit 8-2-2. By inputting the noise-superimposed speech signal B to the evaluation target device 100, let C be the speech signal that has been subjected to noise suppression processing and is output from the evaluation target device 100. An example of the evaluation target device 100 is a voice communication terminal, for example.

遅延補正部８−１は、音声信号Ａと雑音抑圧音声信号Ｃとを入力とし、雑音抑圧音声信号Ｃの音声信号Ａに対する遅延時間を測定する。遅延補正部８−１は、音声信号Ａと雑音抑圧音声信号Ｃとの間の短時間相互相関係数が最大となる時間を求めることにより、雑音抑圧音声信号Ｃの遅延時間を測定する。遅延補正部８−２−１は、音声信号Ａに遅延補正部８−１で測定された遅延時間分だけ遅延を与えることにより、雑音抑圧音声信号Ｃと時刻が同期した音声信号Ａ’を出力する。遅延補正部８−２−２は、雑音重畳音声信号Ｂに遅延補正部８−１で測定された遅延時間分だけ遅延を与えることにより、雑音抑圧音声信号Ｃと時刻が同期した雑音重畳音声信号Ｂ’を出力する。 The delay correction unit 8-1 receives the audio signal A and the noise-suppressed audio signal C, and measures the delay time of the noise-suppressed audio signal C with respect to the audio signal A. The delay correcting unit 8-1 measures the delay time of the noise-suppressed speech signal C by obtaining the time when the short-time cross-correlation coefficient between the speech signal A and the noise-suppressed speech signal C is maximized. The delay correction unit 8-2-1 delays the audio signal A by the delay time measured by the delay correction unit 8-1, thereby outputting the audio signal A ′ whose time is synchronized with the noise suppression audio signal C. To do. The delay correcting unit 8-2-2 delays the noise superimposed speech signal B by the delay time measured by the delay correcting unit 8-1, so that the noise superimposed speech signal in which the time is synchronized with the noise suppression speech signal C. B 'is output.

音声区間検出部７は、音声信号Ａ’を一定の短区間（２０ｍｓ）ごとに分けて、各区間が音声の存在する音声区間か無音声の区間かをＶＡＤ（Voice Activity Detection）を用いて区間ごとに判別する。音声区間検出部７は、このようにして得られた音声信号Ａ’の各区間の種別情報（音声区間又は無音性区間）をスイッチ制御部９に送信する。 The voice section detection unit 7 divides the voice signal A ′ into fixed short sections (20 ms), and uses VAD (Voice Activity Detection) to determine whether each section is a voice section where voice is present or a voiceless section. Determine for each. The voice section detector 7 transmits the type information (voice section or silent section) of each section of the voice signal A ′ obtained in this way to the switch controller 9.

スイッチ制御部９は、音声区間検出部７から通知された音声信号Ａ’の各区間の種別を基にスイッチ２０〜２２を制御する。
図２は、音声信号Ａ’の短区間ｘ１が音声区間と判別されたときのスイッチ制御を示す図である。スイッチ制御部９は、音声信号Ａ’の短区間ｘ１が音声区間であることを示す種別情報が入力された場合、スイッチ２０を切り換えて、音声信号Ａ’の短区間ｘ１が音声連結部１０−１に入力されるようにすると同時に、スイッチ２１を切り換えて、雑音抑圧音声信号Ｃの短区間が音声連結部１０−３に入力されるようにする。雑音抑圧音声信号Ｃは音声信号Ａ’と同期しているため、音声信号Ａ’の短区間ｘ１に対応した雑音抑圧音声信号Ｃの短区間が音声連結部１０−３に入力されることになる。また、スイッチ制御部９は、スイッチ２２を制御して、音声連結部１０−２への信号入力を無入力とする。 The switch control unit 9 controls the switches 20 to 22 based on the type of each section of the audio signal A ′ notified from the audio section detection unit 7.
FIG. 2 is a diagram illustrating switch control when the short interval x1 of the audio signal A ′ is determined as the audio interval. When the type information indicating that the short interval x1 of the audio signal A ′ is an audio interval is input, the switch control unit 9 switches the switch 20 so that the short interval x1 of the audio signal A ′ is the audio connection unit 10−. At the same time, the switch 21 is switched so that the short section of the noise-suppressed speech signal C is input to the speech linking unit 10-3. Since the noise-suppressed audio signal C is synchronized with the audio signal A ′, the short interval of the noise-suppressed audio signal C corresponding to the short interval x1 of the audio signal A ′ is input to the audio connecting unit 10-3. . Further, the switch control unit 9 controls the switch 22 so that no signal is input to the voice connection unit 10-2.

図３は、音声信号Ａ’の短区間ｘ２が無音声区間と判別されたときのスイッチ制御を示す図である。スイッチ制御部９は、音声信号Ａ’の短区間ｘ２が無音声区間であることを示す種別情報が入力された場合、スイッチ２１を切り換えて、雑音抑圧音声信号Ｃの短区間が音声連結部１０−４に入力されるようにすると同時に、スイッチ２２を切り換えて、雑音重畳音声信号Ｂ’の短区間が音声連結部１０−２に入力されるようにする。雑音重畳音声信号Ｂ’および雑音抑圧音声信号Ｃは音声信号Ａ’と同期しているため、音声信号Ａ’の短区間ｘ２に対応した雑音重畳音声信号Ｂ’の短区間が音声連結部１０−２に入力され、短区間ｘ２に対応した雑音抑圧音声信号Ｃの短区間が音声連結部１０−４に入力されることになる。また、スイッチ制御部９は、スイッチ２０を制御して、音声連結部１０−１への信号入力を無入力とする。 FIG. 3 is a diagram illustrating switch control when the short section x2 of the audio signal A ′ is determined to be a non-voice section. When the type information indicating that the short interval x2 of the audio signal A ′ is a non-audio interval is input, the switch control unit 9 switches the switch 21 so that the short interval of the noise-suppressed audio signal C is the audio connection unit 10. At the same time, the switch 22 is switched so that the short section of the noise superimposed audio signal B ′ is input to the audio connecting unit 10-2. Since the noise superimposed audio signal B ′ and the noise-suppressed audio signal C are synchronized with the audio signal A ′, the short interval of the noise superimposed audio signal B ′ corresponding to the short interval x2 of the audio signal A ′ 2 and the short section of the noise-suppressed speech signal C corresponding to the short section x2 is input to the speech linking unit 10-4. Further, the switch control unit 9 controls the switch 20 so that no signal is input to the voice connection unit 10-1.

音声連結部１０−１は、最初に入力された短区間の信号を記憶し、以降は短区間の信号が入力される度に、入力された短区間の信号を現在記憶している最新の信号の後ろに連結して、この連結した信号を新たに記憶する。音声連結部１０−１は、入力される全ての音声信号Ａ’の短区間を以上のように連結して記憶する。音声連結部１０−２，１０−３，１０−４は、それぞれ同様に入力される短区間の信号を連結して記憶する。 The voice linking unit 10-1 stores the first input signal of the short section, and thereafter, every time the short section signal is input, the latest signal that currently stores the input short section signal. The connected signal is newly stored after the connection. The voice connection unit 10-1 stores and stores the short sections of all input voice signals A 'as described above. The voice coupling units 10-2, 10-3, and 10-4 concatenately store short-period signals that are similarly input.

これにより、音声連結部１０−１には音声信号Ａ’の全ての音声短区間を連結した音声信号ａが記憶され、音声連結部１０−２には音声信号Ａ’の全ての無音声短区間に対応した雑音重畳音声信号Ｂ’の短区間を連結した雑音信号ｂが記憶され、音声連結部１０−３には音声信号Ａ’の全ての音声短区間に対応した雑音抑圧音声信号Ｃの短区間を連結した音声信号ｃ１が記憶され、音声連結部１０−４には音声信号Ａ’の全ての無音声短区間に対応した雑音抑圧音声信号Ｃの短区間を連結した雑音信号ｃ２が記憶される。音声信号Ａ’、雑音重畳音声信号Ｂ’および雑音抑圧音声信号Ｃと、音声信号ａ，ｃ１および雑音信号ｂ，ｃ２との関係を図４に示す。なお、図４の縦軸は信号レベル、横軸は時間である。 As a result, the voice connection unit 10-1 stores the voice signal a obtained by connecting all the short voice sections of the voice signal A ′, and the voice connection section 10-2 stores all the voiceless short sections of the voice signal A ′. The noise signal b obtained by concatenating the short sections of the noise-superimposed speech signal B ′ corresponding to is stored, and the speech concatenation unit 10-3 stores the short of the noise-suppressed speech signal C corresponding to all the speech short sections of the speech signal A ′. The voice signal c1 in which the sections are connected is stored, and the noise signal c2 in which the short sections of the noise-suppressed speech signal C corresponding to all the voiceless short sections of the voice signal A ′ are stored in the voice connecting unit 10-4. The FIG. 4 shows the relationship among the audio signal A ′, the noise superimposed audio signal B ′, and the noise-suppressed audio signal C, and the audio signals a and c1 and the noise signals b and c2. In FIG. 4, the vertical axis represents signal level and the horizontal axis represents time.

音声歪み測定部１１には、音声信号ａ，ｃ１が入力される。音声歪み測定部１１は、音声信号ａとｃ１とを比較することにより、雑音抑圧音声信号Ｃの音声歪みを測定する。本実施の形態では、音声の比較に公知のＰＥＳＱを用い、音声信号ａを参照信号、音声信号ｃ１を劣化音声信号として音声歪みの測定を行う。ＰＥＳＱでは歪みの特徴量をＰＥＳＱ値として出力する。音声歪み測定部１１は、測定したＰＥＳＱ値を音声歪み量ｘ₁として音声品質推定部１５に送信する。 Audio signals a and c1 are input to the audio distortion measurement unit 11. The audio distortion measurement unit 11 measures the audio distortion of the noise-suppressed audio signal C by comparing the audio signals a and c1. In this embodiment, a known PESQ is used for voice comparison, and voice distortion is measured using the voice signal a as a reference signal and the voice signal c1 as a degraded voice signal. PESQ outputs a distortion feature value as a PESQ value. The audio distortion measurement unit 11 transmits the measured PESQ value as the audio distortion amount x ₁ to the audio quality estimation unit 15.

雑音歪み測定部１２には、雑音信号ｂ，ｃ２が入力される。雑音歪み測定部１２は、雑音信号ｂとｃ２とを比較することにより、雑音抑圧音声信号Ｃの雑音歪みを測定する。本実施の形態では、音声歪みの測定のときと同様に、音声の比較に公知のＰＥＳＱを用い、雑音信号ｂを参照信号、雑音信号ｃ２を劣化音声信号として測定を行う。雑音歪み測定部１２は、測定したＰＥＳＱ値を雑音歪み量ｘ₂として音声品質推定部１５に送信する。 Noise signals b and c2 are input to the noise distortion measurement unit 12. The noise distortion measurement unit 12 measures the noise distortion of the noise-suppressed speech signal C by comparing the noise signals b and c2. In the present embodiment, as in the case of measuring the audio distortion, a known PESQ is used for audio comparison, and the noise signal b is used as a reference signal and the noise signal c2 is used as a deteriorated audio signal. The noise distortion measurement unit 12 transmits the measured PESQ value as the noise distortion amount x ₂ to the voice quality estimation unit 15.

音量測定部１３には、音声信号ｃ１が入力される。音量測定部１３は、音声信号ｃ１の音量を測定することにより、雑音抑圧音声信号Ｃの音量を測定する。本実施の形態では、音量の測定にＩＳＯ５３２で規格化された方法を用いる。音量測定部１３は、測定した音量ｘ₃の値を音声品質推定部１５に送信する。 A sound signal c <b> 1 is input to the volume measuring unit 13. The volume measuring unit 13 measures the volume of the noise-suppressed audio signal C by measuring the volume of the audio signal c1. In this embodiment, a method standardized by ISO 532 is used for measuring the volume. The volume measuring unit 13 transmits the value of the measured volume x ₃ to the voice quality estimating unit 15.

雑音量測定部１４には、雑音信号ｃ２が入力される。雑音量測定部１４は、雑音信号ｃ２の音量を測定することにより、雑音抑圧音声信号Ｃの雑音量を測定する。本実施の形態では、音量の測定のときと同様に、測定にＩＳＯ５３２で規格化された方法を用いる。雑音量測定部１４は、測定した雑音量ｘ₄の値を音声品質推定部１５に送信する。 A noise signal c <b> 2 is input to the noise amount measurement unit 14. The noise amount measurement unit 14 measures the noise amount of the noise-suppressed speech signal C by measuring the volume of the noise signal c2. In the present embodiment, a method standardized by ISO 532 is used for measurement, as in the case of measuring the volume. The noise amount measurement unit 14 transmits the value of the measured noise amount x ₄ to the voice quality estimation unit 15.

音声品質推定部１５は、音声歪み測定部１１、雑音歪み測定部１２、音量測定部１３および雑音量測定部１４から入力された雑音抑圧音声信号Ｃの音声歪み量、雑音歪み量、音量および雑音量を基に、雑音抑圧音声信号Ｃの主観品質を推定し、この主観品質の推定値を音声品質出力部１６へ送信する。音声品質推定部１５では、例えば以下の方法によって求めた推定式を用いて主観品質を推定することができる。 The voice quality estimation unit 15 includes a voice distortion amount, a noise distortion amount, a volume, and a noise of the noise suppression voice signal C input from the voice distortion measurement unit 11, the noise distortion measurement unit 12, the volume measurement unit 13, and the noise amount measurement unit 14. Based on the amount, the subjective quality of the noise-suppressed speech signal C is estimated, and the estimated value of the subjective quality is transmitted to the speech quality output unit 16. The voice quality estimation unit 15 can estimate the subjective quality using, for example, an estimation formula obtained by the following method.

まず、推定式を求めるために、音声歪み、雑音歪み、音量、雑音量に対して様々な特徴量を与えた音声サンプルを予め用意し、各音声サンプルに対して複数の被験者が５段階の絶対範疇尺度による主観品質評価を行う。この主観品質評価により得られた評価値の平均をＭＯＳ（Mean Opinion Score）値と呼ぶ。ＭＯＳ値では、５点が非常に良く、１点が非常に悪いということを示している。 First, in order to obtain an estimation formula, voice samples in which various feature quantities are given to voice distortion, noise distortion, volume, and noise amount are prepared in advance. Subjective quality assessment based on category scale. The average of the evaluation values obtained by this subjective quality evaluation is called a MOS (Mean Opinion Score) value. In the MOS value, 5 points are very good and 1 point is very bad.

そして、各音声サンプルに対するＭＯＳ値を基に、音声歪みと雑音歪みと音量と雑音量の４つの品質要因の特徴量を変数として主観品質を推定する式を重回帰分析を用いて求めることで、以下のような式（１）を導出する。
ｙ＝α₁・ｘ₁＋α₂・ｘ₂＋α₃・ｘ₃＋α₄・ｘ₄＋α₅ ・・・（１）
ここで、ｘ₁は音声歪み量、ｘ₂は雑音歪み量、ｘ₃は音量、ｘ₄は雑音量、ｙはＭＯＳ値（主観品質推定値）を表している。α₁、α₂、α₃、α₄、α₅は定数である。音声品質推定部１５は、式（１）を用いて主観品質の推定値を求める。 Then, based on the MOS value for each audio sample, by using multiple regression analysis to obtain an expression for estimating subjective quality using the variables of the four quality factors of audio distortion, noise distortion, volume, and noise amount as variables, The following equation (1) is derived.
y = α ₁ · x ₁ + α ₂ · x ₂ + α ₃ · x ₃ + α ₄ · x ₄ + α ₅ (1)
Here, x ₁ represents the amount of voice distortion, x ₂ represents the amount of noise distortion, x ₃ represents the volume, x ₄ represents the amount of noise, and y represents the MOS value (subjective quality estimate). α ₁ , α ₂ , α ₃ , α ₄ and α ₅ are constants. The voice quality estimation unit 15 obtains an estimated value of subjective quality using Expression (1).

音声品質出力部１６は、音声品質推定部１５から入力された雑音抑圧音声信号Ｃの主観品質の推定値を、雑音抑圧音声品質推定装置１の出力値として出力する。 The speech quality output unit 16 outputs the subjective quality estimation value of the noise-suppressed speech signal C input from the speech quality estimation unit 15 as an output value of the noise-suppressed speech quality estimation device 1.

以上のように、本実施の形態では、従来の問題点を解決するために、雑音抑圧処理前と処理後の雑音を比較して、雑音抑圧音声の雑音歪みを検出する。このために、本実施の形態では、雑音抑圧処理前の雑音信号に関する情報として、抑圧処理前の雑音重畳音声信号を用いる。さらに、雑音抑圧処理後の雑音を得るために、雑音抑圧音声信号を音声区間と無音声区間に分ける。無音声区間における雑音抑圧処理前と処理後の雑音の差分より、雑音抑圧音声の雑音歪みを正確に検出することができる。また、本実施の形態では、無音声区間の雑音抑圧音声の音量を測定することで雑音量を検出する。さらに、本実施の形態では、雑音抑圧音声信号の音声区間における、音声と雑音抑圧音声の差分により音声歪みを検出し、この音声区間の雑音抑圧音声の音量を測定することで音量を検出する。 As described above, in this embodiment, in order to solve the conventional problems, the noise before noise suppression processing is compared with the noise after processing to detect noise distortion of noise-suppressed speech. For this reason, in this embodiment, a noise-superimposed speech signal before the suppression process is used as information regarding the noise signal before the noise suppression process. Furthermore, in order to obtain noise after noise suppression processing, the noise-suppressed voice signal is divided into a voice section and a non-voice section. The noise distortion of the noise-suppressed speech can be accurately detected from the difference between the noise before and after the noise suppression processing in the no-speech section. Further, in the present embodiment, the amount of noise is detected by measuring the volume of the noise-suppressed speech in the silent period. Furthermore, in the present embodiment, the sound distortion is detected from the difference between the sound and the noise-suppressed sound in the sound section of the noise-suppressed sound signal, and the sound volume is detected by measuring the sound volume of the noise-suppressed sound in the sound section.

このようにして検出した雑音抑圧音声の品質要因である音声歪み、雑音歪み、音量、雑音量から、雑音抑圧音声のユーザ体感品質を推定する。本実施の形態では、雑音抑圧音声の品質を推定するために、予め求めた推定式を用いる。この推定式は、各品質要因に対して様々な特徴量の雑音抑圧音声を用意し、主観品質評価実験によってそれぞれの主観品質評価値を取得して、取得した主観品質評価値と品質要因の特徴量の関係から導出したものである。 The user experience quality of the noise-suppressed speech is estimated from the speech distortion, noise distortion, volume, and noise amount that are the quality factors of the noise-suppressed speech detected in this way. In the present embodiment, an estimation equation obtained in advance is used to estimate the quality of noise-suppressed speech. This estimation formula prepares noise-reduced speech with various features for each quality factor, obtains each subjective quality assessment value through subjective quality assessment experiments, and obtains the subjective quality assessment value and the characteristics of the quality factor It is derived from the relationship of quantity.

こうして、本実施の形態では、従来の客観品質評価では不可能であった雑音歪みについて考慮した雑音抑圧音声の品質評価を容易に行うことが可能となる。これにより、本実施の形態では、従来技術よりもユーザ体感品質に近い推定を行うことができる。また、本実施の形態では、音声データベース部２の音声信号を実通話環境下で再生し、再生した音声を集音したときに得られる信号を雑音重畳音声信号とすることにより、実際の通話環境下で生じる雑音重畳音声信号を用いて、雑音抑圧音声のユーザ体感に即した品質推定を正確かつ容易に行うことが可能となる。 Thus, according to the present embodiment, it is possible to easily perform quality evaluation of noise-suppressed speech in consideration of noise distortion, which is impossible with conventional objective quality evaluation. Thereby, in this Embodiment, estimation close | similar to a user experience quality can be performed rather than a prior art. In the present embodiment, the voice signal of the voice database unit 2 is reproduced in an actual call environment, and a signal obtained when the reproduced voice is collected is used as a noise-superimposed voice signal. Using the noise superimposed speech signal generated below, it is possible to accurately and easily estimate the quality of the noise-suppressed speech in accordance with the user experience.

［第２の実施の形態］
以下、本発明の第２の実施の形態について図面を用いて説明する。図５は、本発明の第２の実施の形態に係る雑音抑圧音声品質推定装置の構成例を示すブロック図であり、図１と同様の構成には同一の符号を付してある。
本実施の形態においても、雑音抑圧音声品質推定装置の構成は第１の実施の形態とほぼ同様であるので、第１の実施の形態と異なる部分のみ説明する。 [Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings. FIG. 5 is a block diagram showing a configuration example of a noise-suppressed speech quality estimation apparatus according to the second embodiment of the present invention, and the same components as those in FIG.
Also in the present embodiment, the configuration of the noise-suppressed speech quality estimation apparatus is almost the same as that of the first embodiment, so only the parts that are different from the first embodiment will be described.

まず、第１の実施の形態では、雑音量測定部１４が雑音信号ｃ２の音量を測定する際に、ＩＳＯ５３２で規格化された方法を用いるとしたが、このＩＳＯ５３２で規格化された方法を実施する際に、以下の方法を用いることもできる。雑音量測定部１４は、一定時間ｔ（ｍｓ）ごとに測定した音量Ｂ₁，Ｂ₂，・・・，Ｂ_nを式（２）に代入することで、突発的な雑音が体感品質に与える影響を考慮した雑音ｃ２の音量ｘ₅［ｄＢ］を算出する。本実施の形態ではｔ＝１２０、ｐ＝４とするが、この値に限定されるものではない。雑音量測定部１４は、測定した音量ｘ₅を雑音ｃ２の雑音量として音声品質推定部１５に送信する。また、雑音量測定部１４は、雑音量ｘ₅とは別に、ＩＳＯ５３２で規格化された方法を用いて雑音ｃ２の音量ｘ₄［ｄＢ］を算出する。雑音量測定部１４は、測定した音量ｘ₄を音声歪み測定部１１に送信する。 First, in the first embodiment, when the noise amount measurement unit 14 measures the volume of the noise signal c2, the method standardized by ISO 532 is used. However, the method standardized by ISO 532 is implemented. In doing so, the following method can also be used. The noise amount measurement unit 14 substitutes the sound volumes B ₁ , B ₂ ,..., B _n measured every predetermined time t (ms) into the expression (2), so that sudden noise gives the quality of experience. The volume x ₅ [dB] of the noise c2 considering the influence is calculated. In this embodiment, t = 120 and p = 4. However, the present invention is not limited to these values. The noise amount measurement unit 14 transmits the measured volume x ₅ to the voice quality estimation unit 15 as the noise amount of the noise c2. In addition to the noise amount x ₅ , the noise amount measurement unit 14 calculates the volume x ₄ [dB] of the noise c 2 using a method standardized by ISO 532. The noise amount measurement unit 14 transmits the measured volume x ₄ to the audio distortion measurement unit 11.

次に、第１の実施の形態では、音声歪み測定部１１における音声の比較にＰＥＳＱを用いたが、ＰＥＳＱに代えてＷｉｄｅｂａｎｄ−ＰＥＳＱを用いてもよい。Ｗｉｄｅｂａｎｄ−ＰＥＳＱは公知のＰＥＳＱの対象範囲を電話帯域から広帯域に拡張した技術であり、７ｋＨｚ帯域までを考慮した音声歪みの評価が可能である。本実施の形態の場合、音声歪み測定部１１には、音声ａ，ｃ１の他に、音量測定部１３から音量ｘ₃が入力され、雑音量測定部１４から音量ｘ₄が入力される。 Next, in the first embodiment, PESQ is used for audio comparison in the audio distortion measurement unit 11, but Wideband-PESQ may be used instead of PESQ. Wideband-PESQ is a technology in which the target range of known PESQ is expanded from a telephone band to a wide band, and it is possible to evaluate voice distortion considering up to a 7 kHz band. In the case of the present embodiment, the sound distortion measuring unit 11 receives the sound volume x ₃ from the sound volume measuring unit 13 and the sound volume x ₄ from the noise amount measuring unit 14 in addition to the sounds a and c1.

音声歪み測定部１１は、音声ａと音声ｃ１の比較により雑音抑圧音声Ｃの音声歪みを得る。本実施の形態では、Ｗｉｄｅｂａｎｄ−ＰＥＳＱを用い、音声ａを参照信号、音声ｃ１を劣化音声としてＷ−ＰＥＳＱ値ｘ₆’を算出する。しかし、Ｗｉｄｅｂａｎｄ−ＰＥＳＱは重畳する雑音も歪みとして捉えるため、Ｗ−ＰＥＳＱ値ｘ₆’に対して、音声区間の音声の音量ｘ₃と無音声区間の雑音の音量ｘ₄に基づいて補正を加える。Ｗ−ＰＥＳＱ値ｘ₆’、音量ｘ₃，ｘ₄を式（３）に代入することにより、Ｗ−ＰＥＳＱ値ｘ₆’を補正した値ｘ₆を得る。 The voice distortion measuring unit 11 obtains the voice distortion of the noise-suppressed voice C by comparing the voice a and the voice c1. In the present embodiment, Wideband-PESQ is used, and W-PESQ value x ₆ ′ is calculated using speech a as a reference signal and speech c1 as degraded speech. However, since Wideband-PESQ also captures superimposed noise as distortion, the W-PESQ value x ₆ ′ is corrected based on the sound volume x ₃ in the voice section and the noise volume x ₄ in the non-voice section. . By substituting the W-PESQ value x ₆ ′ and the sound volumes x ₃ and x ₄ into Equation (3), a value x ₆ obtained by correcting the W-PESQ value x ₆ ′ is obtained.

式（３）は、α₁と、ｘ₆’／（１−ｘ₆’／α₂ ^α3(x3-x4)）のうちどちらか小さい方を補正値ｘ₆とすることを意味している。ここでα₁，α₂，α₃は定数である。本実施の形態ではα₁＝４．６４４、α₂＝３、α₃＝０．０７としたが、この値に限定されるものではない。音声歪み測定部１１は、測定した値ｘ₆を音声ｃ１の音声歪み量として音声品質推定部１５に送信する。 Equation (3) means that the smaller one of α ₁ and x ₆ ′ / (1−x ₆ ′ / α ₂ ^{α3 (x3−x4)} ) is set as the correction value x ₆ . Here, α ₁ , α ₂ , and α ₃ are constants. In this embodiment, α ₁ = 4.644, α ₂ = 3, and α ₃ = 0.07, but the present invention is not limited to these values. The audio distortion measurement unit 11 transmits the measured value x ₆ to the audio quality estimation unit 15 as the audio distortion amount of the audio c1.

次に、第１の実施の形態では、雑音歪み測定部１２における音声の比較にＰＥＳＱを用いたが、ＰＥＳＱに代えてＷｉｄｅｂａｎｄ−ＰＥＳＱを用いてもよい。雑音歪み測定部１２には、雑音ｂ，ｃ２が入力される。雑音歪み測定部１２は、雑音ｂと雑音ｃ２の比較により雑音抑圧音声Ｃの雑音歪みを得る。本実施の形態では、Ｗｉｄｅｂａｎｄ−ＰＥＳＱを用い、雑音ｂを参照信号、雑音ｃ２を劣化音声としてＷ−ＰＥＳＱ値ｘ₇を算出する。雑音歪み測定部１２は、算出したＷ−ＰＥＳＱ値ｘ₇を音声ｃ１の雑音歪み量として音声品質推定部１５に送信する。 Next, in the first embodiment, PESQ is used for voice comparison in the noise distortion measurement unit 12, but Wideband-PESQ may be used instead of PESQ. Noises b and c2 are input to the noise distortion measurement unit 12. The noise distortion measurement unit 12 obtains the noise distortion of the noise-suppressed speech C by comparing the noise b and the noise c2. In the present embodiment, Wideband-PESQ is used, and W-PESQ value x ₇ is calculated using noise b as a reference signal and noise c2 as degraded speech. The noise distortion measurement unit 12 transmits the calculated W-PESQ value x ₇ to the voice quality estimation unit 15 as the noise distortion amount of the voice c1.

音声の比較にＰＥＳＱに代えてＷｉｄｅｂａｎｄ−ＰＥＳＱを用いる場合、音声品質推定部１５は、入力された音量ｘ₃、雑音量ｘ₅、音声歪み量ｘ₆、雑音歪み量ｘ₇をもとに、雑音抑圧音声信号Ｃの主観品質を推定する。音声品質推定部１５は、式（４）に音量ｘ₃、雑音量ｘ₅、音声歪み量ｘ₆、雑音歪み量ｘ₇を代入することで値Ｑを算出する。 When Wideband-PESQ is used instead of PESQ for voice comparison, the voice quality estimation unit 15 uses the input volume x ₃ , noise amount x ₅ , voice distortion amount x ₆ , and noise distortion amount x ₇ as follows. The subjective quality of the noise-suppressed speech signal C is estimated. The voice quality estimation unit 15 calculates the value Q by substituting the volume x ₃ , the noise amount x ₅ , the voice distortion amount x ₆ , and the noise distortion amount x ₇ into Equation (4).

ただし、式（４）ではｘ₃−ｘ₅≧１０を制約条件とする。ここで、β₁〜β₁₁は定数である。本実施の形態では、β₁＝３．７、β₂＝０．２１５、β₃＝０．４、β₄＝３、β₅＝１．１、β₆＝０．９、β₇＝０．０５、β₈＝０．２、β₉＝４、β₁₀＝０．０００２、β₁₁＝２４としたが、この値に限定されるものではない。音声品質推定部１５は、算出した値Ｑを雑音抑圧音声信号Ｃの主観品質の推定値として音声品質出力機能１６へ送信する。
他の構成は第１の実施の形態と同じである。こうして、本実施の形態においても、第１の実施の形態と同様の効果を得ることができる。 However, in Expression (4), x ₃ −x ₅ ≧ 10 is a constraint condition. Here, β _{1 to} β ₁₁ are constants. In this embodiment, β ₁ = 3.7, β ₂ = 0.215, β ₃ = 0.4, β ₄ = 3, β ₅ = 1.1, β ₆ = 0.9, β ₇ = 0. .05, β ₈ = 0.2, β ₉ = 4, β ₁₀ = 0.0002, and β ₁₁ = 24, but are not limited to these values. The voice quality estimation unit 15 transmits the calculated value Q to the voice quality output function 16 as an estimated value of the subjective quality of the noise-suppressed voice signal C.
Other configurations are the same as those of the first embodiment. Thus, also in this embodiment, the same effect as that of the first embodiment can be obtained.

［第３の実施の形態］
以下、本発明の第３の実施の形態について図面を用いて説明する。図６は、本発明の第３の実施の形態に係る雑音抑圧音声品質推定装置の構成例を示すブロック図であり、図１と同一の構成には同一の符号を付してある。
図６に示すように、雑音抑圧音声品質推定装置１７は、音声データベース部２、音声録音部５、マイク６、音声区間検出部７、遅延補正部８−１，８−２−１，８−２−２、スイッチ制御部９、音声連結部１０（１０−１〜１０−４）、音声歪み測定部１１、雑音歪み測定部１２、音量測定部１３、雑音量測定部１４、音声品質推定部１５、音声品質出力部１６、音声加算部１８、スイッチ２０〜２２を備えている。 [Third Embodiment]
The third embodiment of the present invention will be described below with reference to the drawings. FIG. 6 is a block diagram showing a configuration example of a noise-suppressed speech quality estimation apparatus according to the third embodiment of the present invention, and the same components as those in FIG.
As shown in FIG. 6, the noise-suppressed speech quality estimation device 17 includes a speech database unit 2, a speech recording unit 5, a microphone 6, a speech segment detection unit 7, and delay correction units 8-1, 8-2-1, 8-. 2-2, switch control unit 9, voice linking unit 10 (10-1 to 10-4), voice distortion measurement unit 11, noise distortion measurement unit 12, volume measurement unit 13, noise amount measurement unit 14, voice quality estimation unit 15, a voice quality output unit 16, a voice addition unit 18, and switches 20 to 22 are provided.

本実施の形態では、通話環境下においてマイク６が収音して音声録音部５が取得する実環境雑音を雑音信号Ｂとする。音声録音部５は、雑音信号Ｂを音声加算部１８と遅延補正部８−２−２に入力する。
音声加算部１８は、音声データベース部２から取得した音声信号Ａと雑音信号Ｂとを入力とし、音声信号Ａと雑音信号Ｂとを加算して雑音重畳音声信号を生成する。この雑音重畳音声信号を評価対象装置１００に入力することで、評価対象装置１００から出力される雑音抑圧処理された音声信号をＣとする。 In the present embodiment, the real environment noise acquired by the microphone 6 and acquired by the voice recording unit 5 in the call environment is referred to as a noise signal B. The voice recording unit 5 inputs the noise signal B to the voice addition unit 18 and the delay correction unit 8-2-2.
The audio adder 18 receives the audio signal A and the noise signal B acquired from the audio database unit 2 and adds the audio signal A and the noise signal B to generate a noise superimposed audio signal. By inputting the noise-superimposed speech signal to the evaluation target device 100, the speech signal subjected to noise suppression processing output from the evaluation target device 100 is set as C.

遅延補正部８−１は、第１の実施の形態と同様に、音声信号Ａと雑音抑圧音声信号Ｃとを入力とし、雑音抑圧音声信号Ｃの音声信号Ａに対する遅延時間を測定する。遅延補正部８−２−１は、音声信号Ａに遅延補正部８−１で測定された遅延時間分だけ遅延を与えることにより、雑音抑圧音声信号Ｃと時刻が同期した音声信号Ａ’を出力する。遅延補正部８−２−２は、雑音信号Ｂに遅延補正部８−１で測定された遅延時間分だけ遅延を与えることにより、雑音抑圧音声信号Ｃと時刻が同期した雑音信号Ｂ’を出力する。
音声区間検出部７は、第１の実施の形態と同様に、音声信号Ａ’を短区間に分けて、各区間の種別情報をスイッチ制御部９に送信する。 Similarly to the first embodiment, the delay correction unit 8-1 receives the audio signal A and the noise-suppressed audio signal C, and measures the delay time of the noise-suppressed audio signal C with respect to the audio signal A. The delay correction unit 8-2-1 delays the audio signal A by the delay time measured by the delay correction unit 8-1, thereby outputting the audio signal A ′ whose time is synchronized with the noise suppression audio signal C. To do. The delay correcting unit 8-2-2 gives the noise signal B a delay corresponding to the delay time measured by the delay correcting unit 8-1, thereby outputting a noise signal B ′ whose time is synchronized with the noise-suppressed voice signal C. To do.
As in the first embodiment, the voice section detection unit 7 divides the voice signal A ′ into short sections and transmits type information of each section to the switch control unit 9.

スイッチ制御部９と音声連結部１０は、第１の実施の形態と同様の処理を行う。ただし、スイッチ２２を介して音声連結部１０−２に入力される信号は、雑音重畳音声信号ではなく雑音信号Ｂ’である。これにより、音声連結部１０−２には音声信号Ａ’の全ての無音声短区間に対応した雑音信号Ｂ’の短区間を連結した雑音信号ｂが記憶される。それ以外は実施形態１と同様に、音声連結部１０−１には音声信号Ａ’の全ての音声短区間を連結した音声信号ａが記憶され、音声連結部１０−３には音声信号Ａ’の全ての音声短区間に対応した雑音抑圧音声信号Ｃの短区間を連結した音声信号ｃ１が記憶され、音声連結部１０−４には音声信号Ａ’の全ての無音声短区間に対応した雑音抑圧音声信号Ｃの短区間を連結した雑音信号ｃ２が記憶される。 The switch control unit 9 and the voice connection unit 10 perform the same processing as in the first embodiment. However, the signal input to the audio connecting unit 10-2 via the switch 22 is not the noise superimposed audio signal but the noise signal B '. As a result, the noise signal b obtained by connecting the short sections of the noise signal B ′ corresponding to all the non-voice short sections of the voice signal A ′ is stored in the voice connecting unit 10-2. Other than that, as in the first embodiment, the audio connection unit 10-1 stores the audio signal a obtained by connecting all the audio short sections of the audio signal A ′, and the audio connection unit 10-3 stores the audio signal A ′. The speech signal c1 obtained by concatenating the short sections of the noise-suppressed speech signal C corresponding to all the speech short sections is stored, and the speech concatenation unit 10-4 stores noise corresponding to all the speechless sections of the speech signal A ′. A noise signal c2 obtained by connecting short sections of the suppressed speech signal C is stored.

音声歪み測定部１１、雑音歪み測定部１２、音量測定部１３、雑音量測定部１４、音声品質推定部１５及び音声品質出力部１６は、第１の実施の形態と同様の処理を行う。
以上の構成により、本実施の形態では、第１の実施の形態と同様の効果を得ることができる。 The audio distortion measurement unit 11, the noise distortion measurement unit 12, the volume measurement unit 13, the noise amount measurement unit 14, the audio quality estimation unit 15, and the audio quality output unit 16 perform the same processing as in the first embodiment.
With the above configuration, the present embodiment can obtain the same effects as those of the first embodiment.

［第４の実施の形態］
以下、本発明の第４の実施の形態について図面を用いて説明する。図７は、本発明の第４の実施の形態に係る雑音抑圧音声品質推定装置の構成例を示すブロック図であり、図１と同一の構成には同一の符号を付してある。
本実施の形態の雑音抑圧音声品質推定装置１９は、第３の実施の形態の雑音抑圧音声品質推定装置１７における音声録音部５とマイク６の代わりに、雑音データベース部２３を用いたものである。 [Fourth Embodiment]
The fourth embodiment of the present invention will be described below with reference to the drawings. FIG. 7 is a block diagram showing a configuration example of a noise-suppressed speech quality estimation apparatus according to the fourth embodiment of the present invention. The same reference numerals are given to the same configurations as those in FIG.
The noise-suppressed speech quality estimation device 19 of the present embodiment uses a noise database unit 23 instead of the speech recording unit 5 and the microphone 6 in the noise-suppressed speech quality estimation device 17 of the third embodiment. .

本実施の形態では、雑音データベース部２３に予め登録されている雑音信号をＢとする。この雑音信号Ｂが音声加算部１８と遅延補正部８−２−２に入力される。以降の動作は第３の実施の形態と同じである。
以上の構成により、本実施の形態では、第１の実施の形態と同様の効果を得ることができる。さらに、本実施の形態では、雑音信号Ｂとして雑音データベース部２３に予め登録されている信号を用いるため、様々な雑音環境下における雑音抑圧音声のユーザ体感に即した品質推定を正確かつ容易に行うことが可能となる。 In the present embodiment, it is assumed that a noise signal registered in advance in the noise database unit 23 is B. This noise signal B is input to the voice adder 18 and the delay corrector 8-2-2. Subsequent operations are the same as those in the third embodiment.
With the above configuration, the present embodiment can obtain the same effects as those of the first embodiment. Furthermore, in the present embodiment, since a signal registered in advance in the noise database unit 23 is used as the noise signal B, the quality estimation according to the user experience of the noise-suppressed speech under various noise environments is performed accurately and easily. It becomes possible.

なお、第３、第４の実施の形態を第２の実施の形態に適用してもよいことは言うまでもない。
また、第１〜第４の実施の形態の雑音抑圧音声品質推定装置は、ＣＰＵ、記憶装置および外部とのインタフェースを備えたコンピュータとこれらのハードウェア資源を制御するプログラムによって実現することができる。このようなコンピュータにおいて、本発明の雑音抑圧音声品質推定方法を実現させるための雑音抑圧音声品質推定プログラムは、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、メモリカードなどの記録媒体に記録された状態で提供される。ＣＰＵは、記録媒体から読み込んだプログラムを記憶装置に書き込み、プログラムに従って第１〜第４の実施の形態で説明した処理を実行する。 Needless to say, the third and fourth embodiments may be applied to the second embodiment.
The noise-suppressed speech quality estimation apparatus according to the first to fourth embodiments can be realized by a computer having a CPU, a storage device, and an external interface, and a program for controlling these hardware resources. In such a computer, the noise-suppressed speech quality estimation program for realizing the noise-suppressed speech quality estimation method of the present invention is recorded on a recording medium such as a flexible disk, a CD-ROM, a DVD-ROM, or a memory card. Provided in. The CPU writes the program read from the recording medium into the storage device, and executes the processes described in the first to fourth embodiments according to the program.

本発明は、音声品質の評価技術に適用することができる。 The present invention can be applied to a voice quality evaluation technique.

本発明の第１の実施の形態に係る雑音抑圧音声品質推定装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the noise suppression audio | voice quality estimation apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態において音声信号の短区間が音声区間と判別されたときのスイッチ制御を示す図である。It is a figure which shows switch control when the short area of an audio | voice signal is discriminate | determined as an audio | voice area in the 1st Embodiment of this invention. 本発明の第１の実施の形態において音声信号の短区間が無音声区間と判別されたときのスイッチ制御を示す図である。It is a figure which shows switch control when the short area of an audio | voice signal is discriminate | determined as a non-voice area in the 1st Embodiment of this invention. 本発明の第１の実施の形態における音声信号、雑音重畳音声信号および雑音抑圧音声信号と、これらの信号を短区間に区切って連結した後の音声信号および雑音信号との関係を示す波形図である。FIG. 4 is a waveform diagram showing the relationship between the audio signal, the noise superimposed audio signal, and the noise-suppressed audio signal in the first embodiment of the present invention, and the audio signal and the noise signal after connecting these signals in a short section. is there. 本発明の第２の実施の形態に係る雑音抑圧音声品質推定装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the noise suppression audio | voice quality estimation apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係る雑音抑圧音声品質推定装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the noise suppression speech quality estimation apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第４の実施の形態に係る雑音抑圧音声品質推定装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the noise suppression speech quality estimation apparatus which concerns on the 4th Embodiment of this invention. ＰＥＳＱを用いた無雑音音声信号の品質評価系の構成例を示すブロック図である。It is a block diagram which shows the structural example of the quality evaluation system of a noiseless audio | voice signal using PESQ. ＰＥＳＱを用いた雑音重畳音声信号の品質評価系の構成例を示すブロック図である。It is a block diagram which shows the structural example of the quality evaluation system of a noise superimposed audio | voice signal using PESQ.

Explanation of symbols

１，１７，１９…雑音抑圧音声品質推定装置、２…音声データベース部、３…音声再生部、４…スピーカ、５…音声録音部、６…マイク、７…音声区間検出部、８−１，８−２−１，８−２−２…遅延補正部、９…スイッチ制御部、１０…音声連結部、１１…音声歪み測定部、１２…雑音歪み測定部、１３…音量測定部、１４…雑音量測定部、１５…音声品質推定部、１６…音声品質出力部、１８…音声加算部、２０，２１，２２…スイッチ、２３…雑音データベース部。 DESCRIPTION OF SYMBOLS 1,17,19 ... Noise suppression voice quality estimation apparatus, 2 ... Voice database part, 3 ... Voice reproduction part, 4 ... Speaker, 5 ... Voice recording part, 6 ... Microphone, 7 ... Voice section detection part, 8-1, 8-2-1, 8-2-2 ... delay correction unit, 9 ... switch control unit, 10 ... audio connection unit, 11 ... audio distortion measurement unit, 12 ... noise distortion measurement unit, 13 ... volume measurement unit, 14 ... Noise amount measurement unit, 15 ... voice quality estimation unit, 16 ... voice quality output unit, 18 ... voice addition unit, 20, 21, 22 ... switch, 23 ... noise database unit.

Claims

A noise-suppressed speech quality estimation device that objectively estimates the quality of noise-suppressed speech,
Detecting means for detecting a feature quantity of a quality factor of a noise-suppressed speech signal output from the noise suppression processing device when a noise-superimposed speech signal is given as an input to the noise suppression processing device to be evaluated;
An estimation unit that estimates the quality of the noise-suppressed speech signal based on the feature amount detected by the detection unit;
The detection means includes
Discriminating means for discriminating whether each section when the noise-suppressed speech signal is divided at a certain time is a speech section or a silent section;
Speech section feature amount detecting means for detecting a feature amount of a quality factor of the noise-suppressed speech signal in a speech section;
A voiceless section feature quantity detecting means for detecting a feature quantity of a quality factor of the noise-suppressed voice signal in a voiceless section,
The voice section feature amount detection means includes:
A speech distortion measuring unit that detects speech distortion as a feature quantity of the quality factor of the noise-suppressed speech signal by comparing the speech signal before noise is superimposed with the noise-suppressed speech signal in the speech section corresponding to the speech signal When,
A volume measuring unit that detects the volume of the noise-suppressed speech signal in the speech section as a feature quantity of the quality factor of the noise-suppressed speech signal;
The silent section feature quantity detecting means is
By comparing the noise-suppressed speech signal in the no-speech section with the corresponding noise-superimposed speech signal or the noise signal that is the basis of this noise-superposed speech signal, the characteristic amount of the quality factor of the noise-suppressed speech signal A noise distortion measurement unit for detecting noise distortion;
A noise amount measuring unit that detects a noise amount of the noise-suppressed speech signal in the no-speech interval as a feature amount of the quality factor of the noise-suppressed speech signal;
The noise distortion is a PESQ value obtained when the noise-suppressed speech signal in the silent period is a degraded speech signal, the noise superimposed speech signal corresponding to the noise suppressed speech signal or the noise signal that is the basis of the noise superimposed speech signal is a reference signal, or Wideband-PESQ value,
The estimation means estimates the quality of the noise-suppressed speech signal based on the speech distortion, the volume of the noise-suppressed speech signal, the noise distortion, and the amount of noise of the noise-suppressed speech signal. Voice quality estimation device.

The noise-suppressed speech quality estimation apparatus according to claim 1,
Furthermore, an audio database in which audio signals are registered in advance,
Voice playback means for playing back the voice signal of the voice database in an actual call environment;
An apparatus for estimating a noise-suppressed voice quality , comprising: voice recording means for outputting a signal obtained when the reproduced voice is collected as the noise-superimposed voice signal .

The noise-suppressed speech quality estimation apparatus according to claim 1 ,
Furthermore, an audio database in which audio signals are registered in advance,
A voice recording means for collecting noise signals in an actual call environment;
A noise-suppressed speech quality estimation apparatus comprising speech adding means for outputting a signal obtained by adding the speech signal of the speech database and the noise signal collected by the speech recording means as the noise superimposed speech signal .

The noise-suppressed speech quality estimation apparatus according to claim 1 ,
Furthermore, an audio database in which audio signals are registered in advance,
A noise database in which noise signals are registered in advance;
A noise-suppressed speech quality estimation apparatus comprising speech adding means for outputting a signal obtained by adding a speech signal of the speech database and a noise signal of the noise database as the noise superimposed speech signal .

A noise suppression speech quality estimation method for objectively estimating the quality of noise suppression speech,
A detection procedure for detecting a feature quantity of a quality factor of a noise-suppressed speech signal output from the noise suppression processing device when a noise-superimposed speech signal is given as an input to the noise suppression processing device to be evaluated;
An estimation procedure for estimating the quality of the noise-suppressed speech signal based on the feature amount detected by the detection procedure,
The detection procedure includes:
A determination procedure for determining whether each section when the noise-suppressed speech signal is divided at a certain time is a speech section or a silent section;
A speech section feature amount detection procedure for detecting a feature amount of a quality factor of the noise-suppressed speech signal in a speech section;
A silent section feature amount detection procedure for detecting a feature amount of a quality factor of the noise-suppressed speech signal in a silent section,
The speech segment feature amount detection procedure includes:
A speech distortion measurement procedure for detecting speech distortion as a feature quantity of the quality factor of the noise-suppressed speech signal by comparing the speech signal before the noise is superimposed with the noise-suppressed speech signal corresponding to the speech section. When,
The volume measurement procedure for detecting the volume of the noise-suppressed speech signal in the speech section as a feature quantity of the quality factor of the noise-suppressed speech signal,
The silent section feature amount detection procedure includes:
By comparing the noise-suppressed speech signal in the no-speech section with the corresponding noise-superimposed speech signal or the noise signal that is the basis of this noise-superposed speech signal, the characteristic amount of the quality factor of the noise-suppressed speech signal Noise distortion measurement procedure for detecting noise distortion,
A noise amount measurement procedure for detecting a noise amount of the noise-suppressed speech signal in the no-speech interval as a feature amount of the quality factor of the noise-suppressed speech signal,
The noise distortion is a PESQ value obtained when the noise-suppressed speech signal in the silent period is a degraded speech signal, the noise superimposed speech signal corresponding to the noise suppressed speech signal or the noise signal that is the basis of the noise superimposed speech signal is a reference signal, or Wideband-PESQ value,
The estimation procedure estimates the quality of the noise-suppressed speech signal based on the speech distortion, the volume of the noise-suppressed speech signal, the noise distortion, and the noise amount of the noise-suppressed speech signal. Speech quality estimation method .

A noise-suppressed speech quality estimation program that operates a computer as a noise-suppressed speech quality estimation device that objectively estimates the quality of noise-suppressed speech,
A detection procedure for detecting a feature quantity of a quality factor of a noise-suppressed speech signal output from the noise suppression processing device when a noise-superimposed speech signal is given as an input to the noise suppression processing device to be evaluated;
Causing the computer to execute an estimation procedure for estimating the quality of the noise-suppressed speech signal based on the feature amount detected by the detection procedure,
The detection procedure includes:
A determination procedure for determining whether each section when the noise-suppressed speech signal is divided at a certain time is a speech section or a silent section;
A speech section feature amount detection procedure for detecting a feature amount of a quality factor of the noise-suppressed speech signal in a speech section;
A silent section feature amount detection procedure for detecting a feature amount of a quality factor of the noise-suppressed speech signal in a silent section,
The speech segment feature amount detection procedure includes:
A speech distortion measurement procedure for detecting speech distortion as a feature quantity of the quality factor of the noise-suppressed speech signal by comparing the speech signal before the noise is superimposed with the noise-suppressed speech signal corresponding to the speech section. When,
The volume measurement procedure for detecting the volume of the noise-suppressed speech signal in the speech section as a feature quantity of the quality factor of the noise-suppressed speech signal,
The silent section feature amount detection procedure includes:
By comparing the noise-suppressed speech signal in the no-speech section with the corresponding noise-superimposed speech signal or the noise signal that is the basis of this noise-superposed speech signal, the characteristic amount of the quality factor of the noise-suppressed speech signal Noise distortion measurement procedure for detecting noise distortion,
A noise amount measurement procedure for detecting a noise amount of the noise-suppressed speech signal in the no-speech interval as a feature amount of the quality factor of the noise-suppressed speech signal,
The noise distortion is a PESQ value obtained when the noise-suppressed speech signal in the silent period is a degraded speech signal, the noise superimposed speech signal corresponding to the noise suppressed speech signal or the noise signal that is the basis of the noise superimposed speech signal is a reference signal, or Wideband-PESQ value,
The estimation procedure estimates the quality of the noise-suppressed speech signal based on the speech distortion, the volume of the noise-suppressed speech signal, the noise distortion, and the noise amount of the noise-suppressed speech signal. Voice quality estimation program.