JP6163468B2

JP6163468B2 - Sound quality evaluation apparatus, sound quality evaluation method, and program

Info

Publication number: JP6163468B2
Application number: JP2014170109A
Authority: JP
Inventors: 祥子栗原; 島内　末廣; 末廣島内; 仲大室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-08-25
Filing date: 2014-08-25
Publication date: 2017-07-12
Anticipated expiration: 2034-08-25
Also published as: JP2016046695A

Description

本発明は、通話品質を評価するための技術に関し、特に拡声系通信システムの品質評価試験技術に関する。 The present invention relates to a technique for evaluating call quality, and more particularly to a quality evaluation test technique for a loudspeaker communication system.

従来、客観評価値であるＰＥＳＱ（Perceptual Evaluation of Speech Quality）値をを用いて会話ＭＯＳ（Mean Opinion Score）値または受聴ＭＯＳ値を推定する場合には、リファレンス信号に基づいてＰＥＳＱ値と会話ＭＯＳ値または受聴ＭＯＳ値との対応関係を表す非線形関数を定式化し、その関数に基づく非線形変換を行う必要があった（例えば、非特許文献１参照）。 Conventionally, when a conversation MOS (Mean Opinion Score) value or a listening MOS value is estimated using a PESQ (Perceptual Evaluation of Speech Quality) value that is an objective evaluation value, a PESQ value and a conversation MOS value are based on a reference signal. Alternatively, it is necessary to formulate a nonlinear function representing the correspondence relationship with the listening MOS value, and to perform nonlinear conversion based on the function (see, for example, Non-Patent Document 1).

社団法人情報通信技術委員会：“ＩＰ電話の通話品質評価法”，ＪＪ−２０１．０１，第５版，２００８年８月.Information and Communication Technology Committee: “IP Phone Call Quality Evaluation Method”, JJ-201.001, 5th edition, August 2008.

この方法に拠れば、ＰＥＳＱ値から会話ＭＯＳまたは受聴ＭＯＳの推定値への変換に複雑な非線形処理が必要となり、計算が複雑化する問題があった。 According to this method, complicated non-linear processing is required for the conversion from the PESQ value to the estimated value of the conversation MOS or listening MOS, and there is a problem that the calculation becomes complicated.

本発明の課題は、少ない演算量でＰＥＳＱ値からＭＯＳ値を推定する技術を提供することである。 An object of the present invention is to provide a technique for estimating a MOS value from a PESQ value with a small amount of calculation.

本発明では、第１の基準音響信号とこれを含む信号に基づく第１の評価対象音響信号とに対する第１のＰＥＳＱ値を得、第２の基準音響信号と第２基準音響信号を含む信号に基づく第２の評価対象音響信号とに対応する第２のＰＥＳＱ値と、第２の基準音響信号に対応する基準音と第２の評価対象音響信号に対応する評価音との違いについての５段階評価に基づく第２のＭＯＳ値と、の線形関係に基づいて、第１のＰＥＳＱ値を線形変換して第１のＭＯＳ値を得る。 In the present invention, a first PESQ value for the first reference acoustic signal and the first evaluation target acoustic signal based on the signal including the first reference acoustic signal is obtained, and the signal including the second reference acoustic signal and the second reference acoustic signal is obtained. 5 levels of differences between the second PESQ value corresponding to the second evaluation target acoustic signal based on the reference sound corresponding to the second reference acoustic signal and the evaluation sound corresponding to the second evaluation target acoustic signal Based on the linear relationship with the second MOS value based on the evaluation, the first PESQ value is linearly converted to obtain the first MOS value.

本発明では、基準音と評価音との違いについての５段階評価に基づくＭＯＳ値を採用することで、ＰＥＳＱ値とＭＯＳ値とを線形な関係に近似することができた。そのため、少ない演算量でＰＥＳＱ値からＭＯＳ値を推定することが可能となった。 In the present invention, the PESQ value and the MOS value can be approximated to a linear relationship by adopting the MOS value based on the five-step evaluation of the difference between the reference sound and the evaluation sound. Therefore, the MOS value can be estimated from the PESQ value with a small amount of calculation.

図１は、第１実施形態のデータ生成装置の機能構成を例示したブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of the data generation apparatus according to the first embodiment. 図２は、第１実施形態のデータ生成装置によって生成されるデータ構造を説明するための概念図である。FIG. 2 is a conceptual diagram for explaining a data structure generated by the data generation apparatus according to the first embodiment. 図３は、第１実施形態のデータ生成装置によって生成されるデータ構造を例示するための図である。FIG. 3 is a diagram for illustrating a data structure generated by the data generation apparatus of the first embodiment. 図４は、第２実施形態のデータ生成装置の機能構成を例示したブロック図である。FIG. 4 is a block diagram illustrating a functional configuration of the data generation device according to the second embodiment. 図５Ａは、図４の通信環境模擬処理部を例示したブロック図である。図５Ｂは、図４の信号処理部を例示したブロック図である。FIG. 5A is a block diagram illustrating the communication environment simulation processing unit of FIG. FIG. 5B is a block diagram illustrating the signal processing unit of FIG. 図６は、第３実施形態の音響品質評価装置の機能構成を例示したブロック図である。FIG. 6 is a block diagram illustrating a functional configuration of the sound quality evaluation apparatus according to the third embodiment. 図７は、第３実施形態の音響品質評価試験での表示内容を例示した図である。FIG. 7 is a diagram illustrating display contents in the sound quality evaluation test of the third embodiment. 図８は、音響品質評価方法を例示するための図である。FIG. 8 is a diagram for illustrating the acoustic quality evaluation method. 図９は、音響品質評価方法を例示するための図である。FIG. 9 is a diagram for illustrating the acoustic quality evaluation method. 図１０は、音響品質評価方法を例示するための図である。FIG. 10 is a diagram for illustrating the acoustic quality evaluation method. 図１１は、音響品質評価方法を例示するための図である。FIG. 11 is a diagram for illustrating the acoustic quality evaluation method. 図１２は、音響品質評価方法を例示するための図である。FIG. 12 is a diagram for illustrating the acoustic quality evaluation method. 図１３は、第４実施形態の音響品質評価装置の機能構成を例示したブロック図である。FIG. 13 is a block diagram illustrating a functional configuration of the sound quality evaluation apparatus according to the fourth embodiment. 図１４は、ＤＭＯＳ値とＰＥＳＱ値との関係を例示した図である。FIG. 14 is a diagram illustrating the relationship between the DMOS value and the PESQ value. 図１５は、第４実施形態の変形例の音響品質評価装置の機能構成を例示したブロック図である。FIG. 15 is a block diagram illustrating a functional configuration of an acoustic quality evaluation apparatus according to a modification of the fourth embodiment.

以下、図面を参照して本発明の実施形態を説明する。
［第１実施形態］
＜拡声系通信システムでの会話ＭＯＳ試験を模擬した評価試験＞
まず、拡声系通信システムでの会話ＭＯＳ試験を模擬した評価試験を概念的に説明する。この評価試験では、近端話者と遠端話者とが拡声系通信システムを通じて会話を行い、近端話者側に位置する評価者が当該拡声系通信システムの品質評価を行う。なお、拡声系通信システムとは、マイクロホンとスピーカーとを備えた端末装置間で音響信号を送受信する通信システムであって、端末装置のスピーカーから出力された音の少なくとも一部がその端末装置のマイクロホンで受音されるもの（音の回り込みが生じるもの）をいう。拡声系通信システムの一例は、音声会議システムやテレビ会議システムである。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
<Evaluation test simulating conversational MOS test in loudspeaker communication system>
First, an evaluation test simulating a conversation MOS test in a loudspeaker communication system will be conceptually described. In this evaluation test, a near-end speaker and a far-end speaker have a conversation through a loudspeaker communication system, and an evaluator located on the near-end speaker side evaluates the quality of the loudspeaker communication system. Note that the loudspeaker communication system is a communication system that transmits and receives an acoustic signal between terminal devices including a microphone and a speaker, and at least a part of the sound output from the speaker of the terminal device is a microphone of the terminal device. The sound received by the sound (the sound wraps around). An example of a loudspeaker communication system is an audio conference system or a video conference system.

図２に例示する拡声系通信システムでは、近端話者の音声が近端話者側のマイクロホンで受音され、それに基づいて得られた音響信号がネットワーク経由で遠端話者側に伝送され、当該音響信号が表す音が遠端話者側のスピーカーから出力される。また、遠端話者側の音が遠端話者側のマイクロホンで受音され、それに基づいて得られた音響信号がネットワーク経由で近端話者側に伝送され、当該音響信号が表す音が近端話者側のスピーカーから出力される。ただし、遠端話者側のスピーカーから出力された音の少なくとも一部は遠端話者側のマイクロホンでも受音される。すなわち、遠端話者側のマイクロホンで受音される遠端話者側の音は、遠端話者の音声に近端話者の音声の回り込み（音響エコー）が重畳されたものである。また、近端話者側に伝送される音響信号は、遠端話者側のマイクロホンで受音された音を表す信号に所定の「信号処理」を行って得られた処理信号に由来するものであってもよいし、このような信号処理を行うことなく得られたものであってもよい。「信号処理」は、どのような処理であってもよい。「信号処理」の例は、エコーキャンセル処理およびノイズキャンセル処理の少なくとも一方を含む処理である。 In the loudspeaker communication system illustrated in FIG. 2, the near-end speaker's voice is received by the near-end speaker's microphone, and an acoustic signal obtained based on the sound is transmitted to the far-end speaker via the network. The sound represented by the acoustic signal is output from the far-end speaker. Further, the far-end speaker side sound is received by the far-end speaker side microphone, and the acoustic signal obtained based on the received sound is transmitted to the near-end speaker side via the network, and the sound represented by the acoustic signal is Output from the near-end speaker. However, at least part of the sound output from the speaker on the far end speaker side is also received by the microphone on the far end speaker side. That is, the far-end speaker's sound received by the far-end speaker's microphone is obtained by superimposing the near-end talker's voice (acoustic echo) on the far-end talker's voice. The acoustic signal transmitted to the near-end speaker is derived from a processed signal obtained by performing predetermined “signal processing” on the signal representing the sound received by the far-end speaker's microphone. It may be obtained without performing such signal processing. “Signal processing” may be any processing. An example of “signal processing” is processing including at least one of echo cancellation processing and noise cancellation processing.

評価者は、ヘッドフォンやイヤホン等の両耳装着型音響再生装置を用い、近端話者からの直接音を一方の耳（例えば利き耳ではない方の耳−例えば右耳）で聴き、近端話者側のスピーカーから出力される音を他方の耳（例えば利き耳−例えば左耳）で聴き、通話品質を主観評価（オピニオン評価）する。本実施形態では、近端話者からの直接音側のチャネルを「Ｒｃｈ」と表記し、近端話者側のスピーカーから出力される音側のチャネルを「Ｌｃｈ」と表記する。上述のように、近端話者側のスピーカーから出力される音は、遠端話者の音声に近端話者の音声の音響エコーが重畳された遠端話者側の音が遠端話者側のマイクロホンで受音され、それに基づいて得られた音響信号が近端話者側に伝送され、近端話者側のスピーカーから出力されたものである。そのため、近端話者側のスピーカーから出力される音に含まれる近端話者の音声の音響エコー成分は、この近端話者の音声の直接音よりも遅延している（音響信号が近端話者側と遠端話者側との間を一往復する時間の遅延）。また、近端話者側のスピーカーから出力される音に含まれる遠端話者の音声の成分は、この遠端話者の音声が発せられた時点よりも遅延している（音響信号が遠端話者側から近端話者側へ伝送される時間の遅延）。ここで、近端話者からの直接音を表す音響信号と、遠端話者側での音の回り込みがある場合の近端話者側のスピーカーから出力される音を表す音響信号と、の組を「劣化信号」と呼ぶ。特に上述の「信号処理」が行われていない「劣化信号」を「劣化信号Ｄ_１」と表記し、「信号処理」が行われた「劣化信号」を「劣化信号Ｄ_２」と表記する。また、参照用として、近端話者からの直接音を表す音響信号と、遠端話者側での音の回り込みがないと仮定した場合の近端話者側のスピーカーから出力される音を表す音響信号と、の組を「参照信号」と呼ぶ。評価者は、例えば「劣化信号Ｄ_１」「劣化信号Ｄ_２」「参照信号」の何れかの組を比較することで通話品質を主観評価する。 The evaluator listens to the direct sound from the near-end speaker with one ear (for example, the ear that is not the dominant ear—for example, the right ear) using a binaural sound reproduction device such as headphones or earphones. The sound output from the speaker on the speaker side is heard with the other ear (for example, the dominant ear—for example, the left ear), and the speech quality is subjectively evaluated (opinion evaluation). In this embodiment, the channel on the direct sound side from the near-end speaker is denoted as “Rch”, and the channel on the sound side output from the near-end speaker is denoted as “Lch”. As described above, the sound output from the speaker on the near-end speaker side is the sound on the far-end speaker side where the acoustic echo of the near-end speaker sound is superimposed on the far-end speaker sound. The sound signal received based on the microphone on the speaker side and transmitted based on the sound is transmitted to the near-end speaker side and output from the speaker on the near-end speaker side. Therefore, the acoustic echo component of the near-end speaker's voice included in the sound output from the near-end speaker's speaker is delayed from the direct sound of the near-end speaker's voice (the acoustic signal is near Delay of one round trip between the end speaker and far end speaker). In addition, the far-end speaker's voice component included in the sound output from the near-end speaker's speaker is delayed from the time when the far-end talker's voice is emitted (the acoustic signal is far away). The delay in time transmitted from the end speaker side to the near end speaker side). Here, an acoustic signal representing the direct sound from the near-end speaker and an acoustic signal representing the sound output from the speaker on the near-end speaker side when there is a sound wraparound on the far-end speaker side, The set is called a “degraded signal”. In particular, a “degraded signal” that has not been subjected to the “signal processing” is denoted as “degraded signal D ₁ ”, and a “degraded signal” that has been subjected to “signal processing” is denoted as “degraded signal D ₂ ”. For reference, the sound signal representing the direct sound from the near-end speaker and the sound output from the near-end speaker when assuming that there is no sound wraparound at the far-end speaker A set of acoustic signals to be expressed is referred to as a “reference signal”. The evaluator subjectively evaluates the call quality by comparing any set of “degraded signal D ₁ ”, “degraded signal D ₂ ”, and “reference signal”, for example.

＜データ生成装置＞
次に、拡声系通信システムでの会話ＭＯＳ試験を模擬した評価試験を行うためのデータ構造を生成するデータ生成装置を例示する。図１に例示するように、本実施形態のデータ生成装置１は、近端話者音響信号記憶部１０１、遠端話者音響信号記憶部１０２、再生部１０３，１０４、スピーカー１０５，１０６、マイクロホン１０７、時間調整処理部１０８、収録処理部１０９、近端端末部１１０、遠端端末部１２０、出力部１３１，１３２，１４１，１４２，１５１，１５２、およびデータ記憶部１８０を有する。遠端端末部１２０は信号処理部１２１を含み、近端端末部１１０と遠端端末部１２０とはネットワーク（ＮＷ）を通じて通信可能に構成されている。少なくとも、スピーカー１０５，１０６およびマイクロホン１０７は、同じ室内に配置されている。データ生成装置１は、例えば、スピーカーやマイクロホンが接続され、ＣＰＵ（central processing unit）等のプロセッサ（ハードウェア・プロセッサ）やＲＡＭ（random-access memory）・ＲＯＭ（read-only memory）等のメモリ等を備えた汎用または専用の１個以上のコンピュータが所定のプログラムを実行することで構成される装置である。各コンピュータは１個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めＲＯＭ等に記録されていてもよい。また、ＣＰＵのようにプログラムが読み込まれることで機能構成を実現する電子回路（circuitry）ではなく、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。また、１個の装置を構成する電子回路が複数のＣＰＵを含んでいてもよい。 <Data generation device>
Next, a data generation apparatus that generates a data structure for performing an evaluation test simulating a conversation MOS test in a loudspeaker communication system will be exemplified. As illustrated in FIG. 1, the data generation apparatus 1 according to the present embodiment includes a near-end speaker acoustic signal storage unit 101, a far-end speaker acoustic signal storage unit 102, playback units 103 and 104, speakers 105 and 106, and a microphone. 107, a time adjustment processing unit 108, a recording processing unit 109, a near-end terminal unit 110, a far-end terminal unit 120, output units 131, 132, 141, 142, 151, 152, and a data storage unit 180. The far-end terminal unit 120 includes a signal processing unit 121, and the near-end terminal unit 110 and the far-end terminal unit 120 are configured to be able to communicate through a network (NW). At least the speakers 105 and 106 and the microphone 107 are arranged in the same room. The data generator 1 is connected to a speaker or a microphone, for example, a processor (hardware processor) such as a CPU (central processing unit), a memory such as a random-access memory (RAM), a read-only memory (ROM), or the like. Is a device configured by executing a predetermined program by one or more general-purpose or dedicated computers. Each computer may include one processor or memory, or may include a plurality of processors or memories. This program may be installed in a computer, or may be recorded in a ROM or the like in advance. In addition, some or all of the processing units may be configured using an electronic circuit that realizes a processing function independently instead of an electronic circuit (circuitry) that realizes a functional configuration by reading a program like a CPU. . In addition, an electronic circuit constituting one device may include a plurality of CPUs.

＜データ生成処理＞
次に、本実施形態のデータ生成処理を説明する。
事前処理として、評価者が受聴する近端話者の直接音（近端話者の音声）に相当する音を表す近端話者音響信号（システムの第１端側の第１音響信号）のデータを近端話者音響信号記憶部１０１に格納し、遠端話者の直接音（遠端話者の音声）に相当する音を表す遠端話者音響信号（システムの第２端側の第２音響信号）のデータを遠端話者音響信号記憶部１０２に格納する。本実施形態の近端話者音響信号および遠端話者音響信号は何れも時系列の音響信号であり、例えば、防音室で収録した音声に基づいて得られたものである。ただし、これは本発明を限定するものではなく、近端話者音響信号および遠端話者音響信号の少なくとも一方が通常の室内環境で収録されたものであってもよい。また、本形態では、近端話者音響信号が表す近端話者音声と遠端話者音響信号が表す遠端話者音声との間の発話タイミング（すなわち、近端話者音声の発話時に対する遠端話者音声の発話時の相対時間、例えば、近端話者音声と遠端話者音声とのかぶり）に制約は設けない。ただし、これは本発明を限定するものではなく、近端話者音声と遠端話者音声との間の発話タイミングに何らかの制約を設けてもよい。また、近端話者および遠端話者に制約はなく、これらが評価者以外の人であってもよいし、これらの少なくとも一方が評価者と同一人物であってもよい。 <Data generation processing>
Next, the data generation process of this embodiment is demonstrated.
As pre-processing, a near-end speaker acoustic signal (first acoustic signal on the first end side of the system) representing a sound corresponding to the direct sound (near-end speaker voice) of the near-end speaker that the evaluator listens to The data is stored in the near-end speaker acoustic signal storage unit 101, and the far-end speaker acoustic signal (the second end side of the system) representing the sound corresponding to the far-end speaker's direct sound (far-end speaker's voice) is stored. The data of the second sound signal is stored in the far-end speaker sound signal storage unit 102. Both the near-end speaker sound signal and the far-end speaker sound signal of this embodiment are time-series sound signals, and are obtained based on, for example, sound recorded in a soundproof room. However, this does not limit the present invention, and at least one of the near-end speaker sound signal and the far-end speaker sound signal may be recorded in a normal indoor environment. In this embodiment, the speech timing between the near-end speaker sound represented by the near-end speaker sound signal and the far-end speaker sound represented by the far-end speaker sound signal (that is, when the near-end speaker sound is uttered). There is no restriction on the relative time when the far-end speaker voice is uttered with respect to (for example, the fogging of the near-end talker voice and the far-end talker voice). However, this does not limit the present invention, and some restrictions may be placed on the speech timing between the near-end speaker speech and the far-end speaker speech. Moreover, there is no restriction | limiting in a near end speaker and a far end speaker, These may be persons other than an evaluator, and at least one of these may be the same person as an evaluator.

以上の前提のもと、上述の評価試験を行うためのデータ構造が次のように生成される。再生部１０３は、近端話者音響信号記憶部１０１から近端話者音響信号のデータを抽出して近端話者音響信号を出力する。再生部１０３から出力された近端話者音響信号は、出力部１３１，１４１，１５１および近端端末部１１０に送られる。出力部１３１，１４１，１５１は、送られた近端話者音響信号（システムの第１端側の第１音響信号）を、それぞれ「劣化信号Ｄ_１」「劣化信号Ｄ_２」「参照信号」のＲｃｈのデータ（システムの第１端側の第１音響信号を含む第１チャネルのデータ）として出力する。また、近端端末部１１０は、送られた近端話者音響信号をネットワーク経由で遠端端末部１２０に伝送する。遠端端末部１２０は伝送された近端話者音響信号（第１音響信号に由来する信号）をスピーカー１０５に送り、スピーカー１０５は近端話者音響信号が表す音（システムの第２端側に送られた第１音響信号に由来する再生信号）を出力する。 Based on the above assumptions, a data structure for performing the above-described evaluation test is generated as follows. The reproduction unit 103 extracts the near-end speaker sound signal data from the near-end speaker sound signal storage unit 101 and outputs the near-end speaker sound signal. The near-end speaker sound signal output from the reproduction unit 103 is sent to the output units 131, 141, 151 and the near-end terminal unit 110. The output units 131, 141, and 151 output the near-end speaker acoustic signals (first acoustic signals on the first end side of the system) to “degraded signal D ₁ ”, “degraded signal D ₂ ”, and “reference signal”, respectively. Rch data (first channel data including the first acoustic signal on the first end side of the system). Further, the near-end terminal unit 110 transmits the sent near-end speaker sound signal to the far-end terminal unit 120 via the network. The far-end terminal unit 120 sends the transmitted near-end speaker sound signal (a signal derived from the first sound signal) to the speaker 105, and the speaker 105 generates a sound represented by the near-end speaker sound signal (second end side of the system). (A reproduction signal derived from the first acoustic signal sent to).

再生部１０４は、遠端話者音響信号記憶部１０２から遠端話者音響信号のデータを抽出して遠端話者音響信号を出力する。再生部１０４から出力された遠端話者音響信号は時間調整処理部１０８およびスピーカー１０６に送られる。時間調整処理部１０８は送られた遠端話者音響信号を遅延させて出力部１５２に送る。時間調整処理部１０８での遅延量τは、遠端端末部１２０から近端端末部１１０までの伝送遅延量Ｂを模擬するものであり、例えば、この伝送遅延量Ｂに基づいて定められる。例えば、遠端端末部１２０から近端端末部１１０までの伝送遅延量Ｂ、当該伝送遅延量Ｂの予測値、当該伝送遅延量Ｂの平均値、またはこれらの何れかの近似値または補正値（関数値）を時間調整処理部１０８での遅延量τとする。なお、「αの近似値」とは、α−β_１以上α＋β_２以下の範囲に属する値を意味する。β_１およびβ_２は正の値（例えば定数）であり、β_１＝β_２であってもよいし、β_１≠β_２であってもよい。また、伝送遅延量Ｂは、往復の遅延量Ｃ（近端話者音響信号が近端端末部１１０から遠端端末部１２０に伝送され、スピーカー１０５からそれを表す音が出力され、マイクロホン１０７で受音されて得られた信号が、さらに遠端端末部１２０から近端端末部１１０に伝送されるまでの時間）の約半分である。そのため、遅延量Ｃに基づいて遅延量τが定められてもよい。例えば、遅延量Ｃの１／２値、当該遅延量Ｃの予測値の１／２値、当該遅延量Ｃの平均値の１／２値、またはこれらの何れかの関数値を遅延量τとしてもよい。遅延量τは固定値であってもよいし、実際に測定された伝送遅延量Ｂに基づいて決定されてもよい。ただし、ネットワーク環境によっては往路と復路との遅延量が異なる場合もある。また、近端端末部１１０や遠端端末部１２０や信号処理部１２１やネットワーク環境が変化すれば伝送遅延量Ｂや遅延量Ｃが変化するため、そのような変化に応じて遅延量τを定めることが望ましい。出力部１５２は、時間調整処理部１０８で遅延させた遠端話者音響信号（基準音響信号、第２音響信号に基づく第２比較用信号）を「参照信号」のＬｃｈのデータ（基準音響信号を表す第２チャネルのデータ）として出力する。 The reproduction unit 104 extracts far-end speaker sound signal data from the far-end speaker sound signal storage unit 102 and outputs a far-end speaker sound signal. The far-end speaker sound signal output from the reproduction unit 104 is sent to the time adjustment processing unit 108 and the speaker 106. The time adjustment processing unit 108 delays the sent far-end speaker sound signal and sends it to the output unit 152. The delay amount τ in the time adjustment processing unit 108 simulates the transmission delay amount B from the far-end terminal unit 120 to the near-end terminal unit 110, and is determined based on the transmission delay amount B, for example. For example, the transmission delay amount B from the far-end terminal unit 120 to the near-end terminal unit 110, the predicted value of the transmission delay amount B, the average value of the transmission delay amount B, or any approximate value or correction value thereof ( (Function value) is the delay amount τ in the time adjustment processing unit 108. The “approximate value of α” means a value belonging to a range of α−β ₁ or more and α + β ₂ or less. β ₁ and β ₂ are positive values (for example, constants), and β ₁ = β ₂ may be satisfied, or β ₁ ≠ β ₂ may be satisfied. Further, the transmission delay amount B is a round-trip delay amount C (a near-end speaker acoustic signal is transmitted from the near-end terminal unit 110 to the far-end terminal unit 120, and a sound representing it is output from the speaker 105. The signal obtained by receiving the sound is about half of the time until the signal is further transmitted from the far-end terminal unit 120 to the near-end terminal unit 110. Therefore, the delay amount τ may be determined based on the delay amount C. For example, ½ value of the delay amount C, ½ value of the predicted value of the delay amount C, ½ value of the average value of the delay amount C, or any one of these function values is used as the delay amount τ. Also good. The delay amount τ may be a fixed value or may be determined based on the actually measured transmission delay amount B. However, depending on the network environment, the amount of delay between the forward path and the return path may be different. Further, if the near-end terminal unit 110, the far-end terminal unit 120, the signal processing unit 121, and the network environment change, the transmission delay amount B and the delay amount C change, so the delay amount τ is determined according to such change. It is desirable. The output unit 152 converts the far-end speaker acoustic signal (reference acoustic signal, second comparison signal based on the second acoustic signal) delayed by the time adjustment processing unit 108 into Lch data (reference acoustic signal) of the “reference signal”. 2nd channel data representing

スピーカー１０６は、送られた遠端話者音響信号（システムの第２端側の第２音響信号）が表す音（第２端側の第２音響信号に由来する再生信号）を出力する。スピーカー１０５から出力された音およびスピーカー１０６から出力された音は室内空間で重畳し、マイクロホン１０７で受音される。マイクロホン１０７で受音して得られた受音信号（第１音響信号に由来する信号と第２音響信号とに基づく信号）は、遠端端末部１２０の信号処理部１２１に送られる。信号処理部１２１は、送られた受音信号に対する信号処理の実行の有無を制御可能である。信号処理が実行される場合、信号処理部１２１は、送られた受音信号に信号処理を行って処理信号を得、遠端端末部１２０は処理信号をネットワーク経由で近端端末部１１０（第１端側）に伝送する。この信号処理には、さらに近端端末部１１０からネットワーク経由で遠端端末部１２０に伝送された近端話者音響信号（スピーカー１０５に入力される近端話者音響信号）が用いられてもよい。一方、信号処理が実行されない場合、遠端端末部１２０は、信号処理部１２１に送られた受音信号をネットワーク経由で近端端末部１１０（第１端側）に伝送する。また信号処理部１２１は、例えば、信号処理の有無を表す情報を収録処理部１０９に送る。また信号処理部１２１は、送られた受音信号に対して信号処理を実行して処理信号を得、遠端端末部１２０は処理信号をネットワーク経由で近端端末部１１０に伝送し、さらに、この信号処理と同一の受音信号または同一の条件のもとで得られた同一とみなせる受音信号をネットワーク経由で近端端末部１１０に伝送してもよい。すなわち、同一または同一とみなせる２つの受音信号の一方に信号処理する場合の一連の処理が行われ、他方に信号処理を実行しない場合の一連の処理が行われてもよい。「同一の条件」とは、少なくとも、データ生成装置１、近端話者音響信号、遠端話者音響信号、および発話タイミングが同一であることを意味する。「信号処理」はどのような処理であってもよく、「信号処理」の例はエコーキャンセル処理およびノイズキャンセル処理の少なくとも一方を含む処理である。なお、エコーキャンセル処理とは、エコーを低減させるための広義のエコーキャンセラによる処理を意味する。広義のエコーキャンセラによる処理とは、エコーを低減させるための処理全般を意味する。広義のエコーキャンセラによる処理は、例えば、適応フィルタを用いた狭義のエコーキャンセラのみによって実現されてもよいし、音声スイッチによって実現されてもよいし、エコーリダクションによって実現されてもよいし、これらの少なくとも一部の技術の組み合わせによって実現されてもよいし、さらにその他の技術との組み合わせによって実現されてもよい（例えば、「知識ベース知識の森、２群−６編−５章、“音響エコーキャンセラ”、電子情報通信学会」参照）。またノイズキャンセル処理とは、遠端端末のマイクロホンの周囲で発生する、遠端話者の音声以外のあらゆる環境雑音に起因する雑音成分を抑圧または除去する処理を意味する。環境雑音とは、例えば、オフィスの空調音、走行中の車内音、交差点での車の通行音、虫の音、キーボードのタッチ音、複数の人の声（ガヤガヤ音）などを指し、音の大／小、屋内／屋外は問わない。 The speaker 106 outputs a sound (a reproduction signal derived from the second acoustic signal on the second end side) represented by the transmitted far-end speaker acoustic signal (second acoustic signal on the second end side of the system). The sound output from the speaker 105 and the sound output from the speaker 106 are superimposed in the indoor space and received by the microphone 107. A sound reception signal (a signal based on the first sound signal and the second sound signal) obtained by receiving the sound with the microphone 107 is sent to the signal processing unit 121 of the far-end terminal unit 120. The signal processing unit 121 can control whether or not signal processing is performed on the received sound reception signal. When signal processing is executed, the signal processing unit 121 performs signal processing on the received sound reception signal to obtain a processed signal, and the far-end terminal unit 120 transmits the processed signal to the near-end terminal unit 110 (the first terminal) via the network. 1 end side). For this signal processing, a near-end speaker sound signal (a near-end speaker sound signal input to the speaker 105) transmitted from the near-end terminal unit 110 to the far-end terminal unit 120 via the network is used. Good. On the other hand, when the signal processing is not executed, the far-end terminal unit 120 transmits the received sound signal sent to the signal processing unit 121 to the near-end terminal unit 110 (first end side) via the network. For example, the signal processing unit 121 sends information indicating the presence or absence of signal processing to the recording processing unit 109. In addition, the signal processing unit 121 performs signal processing on the received sound signal to obtain a processed signal, the far-end terminal unit 120 transmits the processed signal to the near-end terminal unit 110 via the network, and The received sound signal that is the same as this signal processing or the received sound signal that can be regarded as the same obtained under the same conditions may be transmitted to the near-end terminal unit 110 via the network. That is, a series of processes when signal processing is performed on one of two received sound signals that can be regarded as the same or the same may be performed, and a series of processes when signal processing is not performed on the other may be performed. The “same condition” means that at least the data generation device 1, the near-end speaker sound signal, the far-end speaker sound signal, and the speech timing are the same. “Signal processing” may be any processing, and an example of “signal processing” is processing including at least one of echo cancellation processing and noise cancellation processing. Note that the echo cancellation processing means processing by an echo canceller in a broad sense for reducing echo. The processing by the echo canceller in a broad sense means all processing for reducing echo. The processing by the broad echo canceller may be realized only by a narrow sense echo canceller using an adaptive filter, may be realized by a voice switch, may be realized by echo reduction, or these It may be realized by a combination of at least some techniques, and may also be realized by a combination with other techniques (for example, “Knowledge Base Knowledge Forest, Group 2-6, Chapter 5,“ Acoustic Echo ”). (See Canceller, IEICE). The noise canceling process means a process for suppressing or removing a noise component caused by any environmental noise other than the voice of the far-end speaker that occurs around the microphone of the far-end terminal. Environmental noise refers to, for example, office air-conditioning sound, in-car sound while driving, car traffic sound at intersections, insect sounds, keyboard touch sounds, voices of multiple people (gray noise), etc. It doesn't matter whether it's large / small or indoor / outdoor.

遠端端末部１２０からネットワーク経由で伝送された信号（第１音響信号に由来する信号とシステムの第２端側の第２音響信号とに基づく重畳信号）は、近端端末部１１０に入力され、収録処理部１０９に送られる。ここで、信号処理部１２１で信号処理が実行されている場合（信号処理ＯＮ時）、収録処理部１０９は、送られた信号（第１音響信号に由来する信号と第２音響信号とに基づく信号に信号処理を行って得られた処理信号に由来する重畳信号）を出力部１４２に送る。出力部１４２は、送られた信号（評価対象音響信号Ｔ_２）を「劣化信号Ｄ_２」のＬｃｈのデータ（重畳信号を含む第２チャネルのデータ）として出力する。一方、信号処理部１２１で信号処理が実行されていない場合（信号処理ＯＦＦ時）、収録処理部１０９は、送られた信号（受音信号を第１端側に送ることで得られた第１比較用信号）を出力部１３２に送る。出力部１３２は、送られた信号（評価対象音響信号Ｔ_１）を「劣化信号Ｄ_１」のＬｃｈのデータ（重畳信号を含む第２チャネルのデータ）として出力する。 A signal (a superimposed signal based on a signal derived from the first acoustic signal and a second acoustic signal on the second end side of the system) transmitted from the far end terminal unit 120 via the network is input to the near end terminal unit 110. Are sent to the recording processing unit 109. Here, when the signal processing is performed in the signal processing unit 121 (when the signal processing is ON), the recording processing unit 109 is based on the transmitted signal (the signal derived from the first acoustic signal and the second acoustic signal). The superimposition signal derived from the processed signal obtained by performing signal processing on the signal is sent to the output unit 142. The output unit 142 outputs the transmitted signal (evaluation target acoustic signal T ₂ ) as Lch data (second channel data including a superimposed signal) of the “degraded signal D ₂ ”. On the other hand, when the signal processing is not performed in the signal processing unit 121 (when the signal processing is OFF), the recording processing unit 109 transmits the transmitted signal (the first obtained by sending the received sound signal to the first end side). The comparison signal is sent to the output unit 132. The output unit 132 outputs the transmitted signal (evaluation target acoustic signal T ₁ ) as Lch data (second channel data including a superimposed signal) of the “degraded signal D ₁ ”.

出力部１３１から出力されたＲｃｈの近端話者音響信号のデータと、出力部１３２から出力されたＬｃｈの評価対象音響信号Ｔ_１のデータとの組は、「劣化信号Ｄ_１」としてデータ記憶部１８０に格納される。出力部１４１から出力されたＲｃｈの近端話者音響信号のデータと、出力部１４２から出力されたＬｃｈの評価対象音響信号Ｔ_２のデータとの組は、「劣化信号Ｄ_２」としてデータ記憶部１８０に格納される。出力部１５１から出力されたＲｃｈの近端話者音響信号のデータと、出力部１５２から出力されたＬｃｈの基準音響信号のデータとの組は、「参照信号」としてデータ記憶部１８０に格納される。なお、同じ時間区間に対応する「劣化信号Ｄ_１」「劣化信号Ｄ_２」「参照信号」のＲｃｈの近端話者音響信号は互いに同一である。そのため、必ずしも「劣化信号Ｄ_１」「劣化信号Ｄ_２」「参照信号」のそれぞれについて、互いに同一なＲｃｈの近端話者音響信号のデータをデータ記憶部１８０に格納する必要はない。もちろん、「劣化信号Ｄ_１」「劣化信号Ｄ_２」「参照信号」のそれぞれについて、互いに同一なＲｃｈの近端話者音響信号のデータをデータ記憶部１８０に格納してもかまわない。 A set of Rch near-end speaker acoustic signal data output from the output unit 131 and Lch evaluation target acoustic signal T ₁ data output from the output unit 132 is stored as “deteriorated signal D ₁ ”. Stored in the unit 180. The pair of the Rch near-end speaker acoustic signal data output from the output unit 141 and the data of the Lch evaluation target acoustic signal T ₂ output from the output unit 142 is stored as “deteriorated signal D ₂ ”. Stored in the unit 180. A set of Rch near-end speaker acoustic signal data output from the output unit 151 and Lch reference acoustic signal data output from the output unit 152 is stored in the data storage unit 180 as a “reference signal”. The The Rch near-end speaker acoustic signals of “deteriorated signal D ₁ ”, “deteriorated signal D ₂ ”, and “reference signal” corresponding to the same time interval are the same. Therefore, it is not always necessary to store the same Rch near-end speaker acoustic signal data in the data storage unit 180 for each of the “degraded signal D ₁ ”, “degraded signal D ₂ ”, and “reference signal”. Of course, for each of the “degraded signal D ₁ ”, “degraded signal D ₂ ”, and “reference signal”, the data of the near-end speaker acoustic signal of the same Rch may be stored in the data storage unit 180.

図３を用い、上述のように得られた「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」を例示する。図３の例では、前述した同一または同一とみなせる２つの受音信号の一方に信号処理する場合の一連の処理が行われ、他方に信号処理を実行しない場合の一連の処理が行われ、信号処理を実行した場合の「劣化信号Ｄ_２」と、信号処理を実行していない場合の「劣化信号Ｄ_１」との両方が得られている。また図３の例では、「信号処理」としてエコーキャンセル処理を含む処理を用いている。 The “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ” obtained as described above are illustrated using FIG. In the example of FIG. 3, a series of processing is performed when signal processing is performed on one of the two received sound signals that can be regarded as the same or the same, and a series of processing when signal processing is not performed is performed on the other. Both “degraded signal D ₂ ” when processing is performed and “degraded signal D ₁ ” when signal processing is not performed are obtained. In the example of FIG. 3, processing including echo cancellation processing is used as “signal processing”.

本実施形態の「参照信号」のデータ構造は、前述の近端話者音響信号を含むＲｃｈのデータ（システムの第１端側の第１音響信号を含む第１チャネルのデータ）と、前述の遠端話者音響信号に基づく基準音響信号を含むＬｃｈのデータ（第２端側の第２音響信号に基づく第２比較用信号を含む第２チャネルのデータ）とを含む。本実施形態の「劣化信号Ｄ_１」のデータ構造は、前述の近端話者音響信号を含むＲｃｈのデータ（システムの第１端側の第１音響信号を含む第１チャネルのデータ）と、前述の評価対象音響信号Ｔ_１を含むＬｃｈのデータ（第１音響信号に由来する信号とシステムの第２端側の第２音響信号とに基づく重畳信号を含む第２チャネルのデータ）とを含む。評価対象音響信号Ｔ_１は信号処理を行うことなく得られた「第１比較用信号」である。本実施形態の「劣化信号Ｄ_２」のデータ構造は、前述の近端話者音響信号を含むＲｃｈのデータ（システムの第１端側の第１音響信号を含む第１チャネルのデータ）と、前述の評価対象音響信号Ｔ_２を含むＬｃｈのデータ（第１音響信号に由来する信号と第２音響信号とに基づく信号に信号処理を行って得られた処理信号に由来する重畳信号を含む第２チャネルのデータ）とを含む。なお、「評価対象音響信号Ｔ_１を含むＬｃｈのデータ」および「評価対象音響信号Ｔ_２を含むＬｃｈのデータ」は、いずれも「第１音響信号に由来する信号とシステムの第２端側の第２音響信号とに基づく重畳信号を含む第２チャネルのデータ」に相当する。特に「評価対象音響信号Ｔ_２を含むＬｃｈのデータ」は、このような「重畳信号」を含むデータのうち、「第１音響信号に由来する信号と第２音響信号とに基づく信号に信号処理を行って得られた処理信号に由来するもの」を含むデータである。 The data structure of the “reference signal” of the present embodiment includes Rch data including the above-mentioned near-end speaker acoustic signal (first channel data including the first acoustic signal on the first end side of the system) and the above-described data structure. Lch data including a reference sound signal based on the far-end speaker sound signal (second channel data including a second comparison signal based on the second sound signal on the second end side). The data structure of the “degraded signal D ₁ ” of the present embodiment includes Rch data including the above-mentioned near-end speaker acoustic signal (first channel data including the first acoustic signal on the first end side of the system), and and a data Lch including evaluation target sound signal T ₁ of the above (data of the second channel including a superimposed signal based on a second audio signal of the second end side of the derived signal and the system to the first acoustic signal) . Evaluated acoustic signal T ₁ is obtained without performing signal processing "first comparison signal". The data structure of the “degraded signal D ₂ ” of the present embodiment includes Rch data including the above-mentioned near-end speaker acoustic signal (first channel data including the first acoustic signal on the first end side of the system), and the containing superimposed signal derived from the Lch data (signals and processing signal obtained by performing signal processing on a signal based on a second sound signal from the first acoustic signal comprising evaluated acoustic signal T ₂ of the above 2 channel data). Note that “the data of the Lch including the evaluation target acoustic signal T ₁ ” and “the data of the Lch including the evaluation target acoustic signal T ₂ ” are both “the signal derived from the first acoustic signal and the second end side of the system”. This corresponds to “second channel data including a superimposed signal based on the second acoustic signal”. In particular, the “Lch data including the evaluation target acoustic signal T ₂ ” is a signal processing based on “a signal derived from the first acoustic signal and a second acoustic signal among the data including the“ superimposed signal ”. Data derived from the processing signal obtained by performing "."

図３に例示するように、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」のＲｃｈのデータの時間区間ａ−ｂには、互いに同一な近端話者音響信号（第１音響信号）が含まれる。「劣化信号Ｄ_１」「劣化信号Ｄ_２」のＬｃｈのデータの時間区間ｅ−ｄ’には、近端話者音響信号の音響エコー成分が含まれる。音響エコー成分は上記の近端話者音響信号に由来する信号（第１音響信号に由来する信号）であるが、近端話者音響信号に比べて時間区間ａ−ｅ（遅延量Ｃ）だけ遅延している。この遅延量Ｃは、近端話者音響信号が近端端末部１１０から遠端端末部１２０に伝送され、スピーカー１０５からそれを表す音が出力され、マイクロホン１０７で受音されて得られた信号が、さらに遠端端末部１２０から近端端末部１１０に伝送されるまでの時間に相当する。 As illustrated in FIG. 3, the same near-end speaker acoustic signal (first acoustic signal) is used in the time interval ab of the Rch data of “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ”. Signal). The time interval ed ′ of the Lch data of “degraded signal D ₁ ” and “degraded signal D ₂ ” includes the acoustic echo component of the near-end speaker acoustic signal. The acoustic echo component is a signal derived from the above-mentioned near-end speaker acoustic signal (a signal derived from the first acoustic signal), but only in the time interval ae (delay amount C) compared to the near-end speaker acoustic signal. There is a delay. This delay amount C is a signal obtained by transmitting a near-end speaker acoustic signal from the near-end terminal unit 110 to the far-end terminal unit 120, outputting a sound representing it from the speaker 105, and receiving it by the microphone 107. Corresponds to the time until the data is further transmitted from the far-end terminal unit 120 to the near-end terminal unit 110.

「参照信号」のＬｃｈのデータの時間区間ｃ−ｄには、遠端話者音響信号に基づく遠端話者音響信号成分（第２音響信号に基づく第２２成分）が含まれ、「劣化信号Ｄ_１」のＬｃｈのデータの時間区間ｃ’−ｄ’には、遠端話者音響信号に基づく遠端話者音響信号成分（第２音響信号に基づく第２１成分）が重畳され、「劣化信号Ｄ_２」のＬｃｈのデータの時間区間ｃ’−ｄ’には、遠端話者音響信号に基づく遠端話者音響信号成分（第２音響信号に基づく第１成分）が重畳されている。「劣化信号Ｄ_１」「劣化信号Ｄ_２」のＲｃｈの近端話者音響信号の開始時点ａからＬｃｈの遠端話者音響信号成分の開始時点ｃ’までには時間差ａ−ｃ’が存在する。また、「参照信号」のＲｃｈの近端話者音響信号の開始時点ａからＬｃｈの遠端話者音響信号成分の開始時点ｃまでには時間差ａ−ｃが存在する。ここで「劣化信号Ｄ_１」「劣化信号Ｄ_２」での時間差ａ−ｃ’は、近端話者音響信号の開始タイミングと遠端話者音響信号の開始タイミングとの時間差Ａと、信号が遠端端末部１２０から近端端末部１１０に伝送されるまでの伝送遅延量Ｂとの合計Ａ＋Ｂに相当する。一方、「参照信号」での時間差ａ−ｃは、時間差Ａと時間調整処理部１０８での遅延量τとの合計Ａ＋τに相当する。前述のように遅延量τは伝送遅延量Ｂに基づいて定められているため、遅延量τと伝送遅延量Ｂとが一致または近似し、時間差ａ−ｃを時間差ａ−ｃ’に一致または近似させることができる。このようなデータ構造を用いた評価試験では、「劣化信号Ｄ_２」のＲｃｈで近端話者音響信号を出力してからＬｃｈで遠端話者音響信号成分を出力するまでの時間と、「参照信号」のＲｃｈで近端話者音響信号を出力してからＬｃｈで遠端話者音響信号成分を出力するまでの時間とを、一致または近似させることができる。同様に、「劣化信号Ｄ_１」のＲｃｈで近端話者音響信号を出力してからＬｃｈで遠端話者音響信号成分を出力するまでの時間と、「参照信号」のＲｃｈで近端話者音響信号を出力してからＬｃｈで遠端話者音響信号成分を出力するまでの時間とを、一致または近似させることができる。さらに、「劣化信号Ｄ_１」のＲｃｈで近端話者音響信号を出力してからＬｃｈで遠端話者音響信号成分を出力するまでの時間と、「劣化信号Ｄ_２」のＲｃｈで近端話者音響信号を出力してからＬｃｈで遠端話者音響信号成分を出力するまでの時間とを一致または近似させることができる。すなわち、重畳信号は、第２音響信号に基づく第１成分を含み、比較用信号は、第２音響信号に基づく第２成分（第２１成分または第２２成分）を含み、第１チャネルで第１音響信号を出力してから第２チャネルで第１成分を出力するまでの時間と、第１チャネルで第１音響信号を出力してから第２チャネルで第２成分を出力するまでの時間とを、一致または近似させることができる。なお、図３では、近端話者が遠端話者に先行して発話する状況を例示したが、遠端話者が近端話者に先行して発話したり、時間差がａ−ｃ’≒０となったりする場合もある。例えば、近端話者音響信号の開始タイミングと遠端話者音響信号の開始タイミングとの時間差Ａと、信号が遠端端末部１２０から近端端末部１１０に伝送されるまでの伝送遅延量Ｂとが等しい場合は、時間差ａ−ｃ’＝差分Ａ−Ｂ≒０となる場合がある。さらに遠端話者が近端話者に対して伝送遅延量Ｂよりも早く話し始めた場合には波形の位置関係が逆転し、Ｌｃｈの遠端話者音響信号成分の開始時点ｃ’が「劣化信号Ｄ_１」「劣化信号Ｄ_２」のＲｃｈの近端話者音響信号の開始時点ａよりも前になる場合もある。このような場合であっても同様に時間調整を行うことができる。 The time interval cd of the Lch data of the “reference signal” includes a far-end speaker sound signal component based on the far-end speaker sound signal (a 22nd component based on the second sound signal). The far-end speaker acoustic signal component based on the far-end speaker acoustic signal (the 21st component based on the second acoustic signal) is superimposed on the time interval c′-d ′ of the Lch data of “D ₁ ”. The far-end speaker acoustic signal component based on the far-end speaker acoustic signal (the first component based on the second acoustic signal) is superimposed on the time interval c′-d ′ of the Lch data of the signal D ₂ ″. . There is a time difference a−c ′ from the start time a of the Rch near-end speaker acoustic signal of the “deterioration signal D ₁ ” and the “degradation signal D ₂ ” to the start time c ′ of the Lch far-end speaker sound signal component. To do. Also, there is a time difference a−c from the start time a of the Rch near-end speaker sound signal of the “reference signal” to the start time c of the Lch far-end speaker sound signal component. Here, the time difference a−c ′ between the “degraded signal D ₁ ” and the “degraded signal D ₂ ” is the time difference A between the start timing of the near-end speaker acoustic signal and the start timing of the far-end speaker acoustic signal, This corresponds to the sum A + B of the transmission delay amount B from the far end terminal unit 120 to the near end terminal unit 110. On the other hand, the time difference ac in the “reference signal” corresponds to the sum A + τ of the time difference A and the delay amount τ in the time adjustment processing unit 108. Since the delay amount τ is determined based on the transmission delay amount B as described above, the delay amount τ and the transmission delay amount B match or approximate, and the time difference a−c matches or approximates the time difference a−c ′. Can be made. In the evaluation test using such a data structure, the time from the output of the near-end speaker acoustic signal at the Rch of the “degraded signal D ₂ ” to the output of the far-end speaker acoustic signal component at the Lch, The time from when the near-end speaker acoustic signal is output at the Rch of the “reference signal” to when the far-end speaker acoustic signal component is output at the Lch can be matched or approximated. Similarly, the time from the output of the near-end speaker acoustic signal on the Rch of the “degraded signal D ₁ ” to the output of the far-end speaker acoustic signal component on the Lch and the near-end talk on the Rch of the “reference signal” It is possible to match or approximate the time from the output of the speaker audio signal to the output of the far-end speaker audio signal component on the Lch. Furthermore, the time from the output of the near-end speaker acoustic signal at the Rch of the “degraded signal D ₁ ” to the output of the far-end speaker acoustic signal component at the Lch, and the near-end at the Rch of the “degraded signal D ₂ ”. It is possible to match or approximate the time from when the speaker acoustic signal is output to when the far-end speaker acoustic signal component is output at Lch. That is, the superimposed signal includes a first component based on the second acoustic signal, and the comparison signal includes a second component (21st component or 22nd component) based on the second acoustic signal, and the first channel uses the first component. The time from the output of the acoustic signal to the output of the first component on the second channel and the time from the output of the first acoustic signal on the first channel to the output of the second component on the second channel , Can be matched or approximated. FIG. 3 illustrates the situation where the near-end speaker speaks before the far-end speaker, but the far-end speaker speaks before the near-end speaker, or the time difference is ac−c ′. In some cases, ≈0. For example, the time difference A between the start timing of the near-end speaker sound signal and the start timing of the far-end speaker sound signal, and the transmission delay amount B until the signal is transmitted from the far-end terminal unit 120 to the near-end terminal unit 110 May be equal to time difference a−c ′ = difference A−B≈0. Further, when the far-end speaker starts speaking to the near-end speaker earlier than the transmission delay amount B, the positional relationship of the waveforms is reversed, and the start time c ′ of the Lch far-end speaker acoustic signal component is “ In some cases, the deterioration signal D ₁ ”“ deterioration signal D ₂ ”may be before the start time a of the Rch near-end speaker acoustic signal. Even in such a case, the time adjustment can be similarly performed.

また、上述のデータ構造では、「参照信号」としてＲｃｈの近端話者音響信号のデータとＬｃｈの基準音響信号のデータとが対応付けられ、「劣化信号Ｄ_１」としてＲｃｈの近端話者音響信号のデータとＬｃｈの評価対象音響信号Ｔ_１のデータとが対応付けられ、「劣化信号Ｄ_２」としてＲｃｈの近端話者音響信号のデータとＬｃｈの評価対象音響信号Ｔ_２のデータとが対応付けられている。このようなデータ構造を用いた評価試験では、Ｒｃｈで近端話者音響信号を出力しつつ、Ｌｃｈで基準音響信号を出力する制御と、Ｒｃｈで近端話者音響信号を出力しつつ、Ｌｃｈで評価対象音響信号Ｔ_１を出力する制御とを行うことができる。同様に、Ｒｃｈで近端話者音響信号を出力しつつ、Ｌｃｈで基準音響信号を出力する制御と、Ｒｃｈで近端話者音響信号を出力しつつ、Ｌｃｈで評価対象音響信号Ｔ_２を出力する制御とを行うこともできる。さらに、Ｒｃｈで近端話者音響信号を出力しつつ、Ｌｃｈで評価対象音響信号Ｔ_１を出力する制御と、Ｒｃｈで近端話者音響信号を出力しつつ、Ｌｃｈで評価対象音響信号Ｔ_２を出力する制御とを行うこともできる。すなわち、第１チャネルで第１音響信号を出力しつつ、第２チャネルで比較用信号を出力する制御と、第１チャネルで第１音響信号を出力しつつ、第２チャネルで重畳信号を出力する制御と、が可能である。 In the above data structure, the Rch near-end speaker acoustic signal data and the Lch reference acoustic signal data are associated as the “reference signal”, and the Rch near-end speaker is represented as the “degraded signal D ₁ ”. The acoustic signal data and the Lch evaluation target acoustic signal T ₁ data are associated with each other, and the Rch near-end speaker acoustic signal data and the Lch evaluation target acoustic signal T ₂ data are represented as “degraded signal D ₂ ”. Are associated. In an evaluation test using such a data structure, while outputting a near-end speaker sound signal at Rch and outputting a reference sound signal at Lch, and outputting a near-end speaker sound signal at Rch, Lch in it is possible to perform the control for outputting the evaluated acoustic signal T _1. Similarly, while outputs the near-end talker audio signals Rch, and a control for outputting a reference sound signal Lch, while outputs the near-end talker audio signals Rch, outputs the evaluated acoustic signal T ₂ in Lch Control can also be performed. Furthermore, while outputs the near-end talker audio signals Rch, and a control for outputting the evaluated acoustic signal T ₁ in Lch, while outputs the near-end talker audio signals Rch, evaluated by Lch target sound signal T ₂ Can also be controlled. That is, a control for outputting a comparison signal on the second channel while outputting the first acoustic signal on the first channel, and a superimposed signal on the second channel while outputting the first acoustic signal on the first channel Control is possible.

評価試験の際、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」が何らかの順序で再生される。「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」のＲｃｈの信号の再生音は、例えば、両耳装着型音響再生装置の右のスピーカーから出力され、Ｌｃｈの信号の再生音は、例えば、この両耳装着型音響再生装置の左のスピーカーから出力される（ステレオ再生）。評価者は、この両耳装着型音響再生装置を両耳に装着し、ステレオ再生されたこれらの音を聴いて通話品質を主観評価する。この際、評価者はＬｃｈの信号の再生音を利き耳（例えば左耳）で聴き、Ｒｃｈの信号の再生音を利き耳ではない耳（例えば右耳）で聴くことが望ましい。評価試験の詳細は第３実施形態で説明する。 In the evaluation test, “reference signal”, “deteriorated signal D ₁ ”, and “deteriorated signal D ₂ ” are reproduced in some order. The reproduced sound of the Rch signal of “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ” is output from, for example, the right speaker of the binaural-type sound reproducing device, and the reproduced sound of the Lch signal is For example, the sound is output from the left speaker of this binaural-mounted sound reproduction device (stereo reproduction). The evaluator wears the binaural sound reproducing apparatus on both ears and listens to these sounds reproduced in stereo to subjectively evaluate the call quality. At this time, the evaluator preferably listens to the reproduced sound of the Lch signal with the dominant ear (for example, the left ear) and listens to the reproduced sound of the Rch signal with the ear that is not the dominant ear (for example, the right ear). Details of the evaluation test will be described in a third embodiment.

［第１実施形態の変形例］
第１実施形態では、遠端話者音響信号を遅延量τだけ遅延させたものを「参照信号」のＬｃｈの基準音響信号とした。これは「参照信号」と「劣化信号Ｄ_１」「劣化信号Ｄ_２」との間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似（例えば、図３の時間区間ａ−ｃと時間区間ａ−ｃ’との一致または近似）させるためである。しかしながら、このような目的は他の手段によっても実現できる。例えば、再生部１０４から出力された遠端話者音響信号を遅延させることなく「参照信号」のＬｃｈの基準音響信号として出力部１５２から出力し、再生部１０３から出力された近端話者音響信号を時間τだけ時間的に繰り上げたもの（遅延の逆の時間シフトをしたもの）を「参照信号」のＲｃｈの近端話者音響信号としてもよい。あるいは、再生部１０４から出力された遠端話者音響信号を時間τ−Ｔだけ遅延させたものを「参照信号」のＬｃｈの基準音響信号として出力部１５２から出力し、再生部１０３から出力された近端話者音響信号を時間Ｔだけ時間的に繰り上げたものを「参照信号」のＲｃｈの近端話者音響信号としてもよい。ただし、Ｔの値は、例えば、０≦Ｔ≦τである。あるいは、評価試験時の処理により、「参照信号」と「劣化信号Ｄ_１」「劣化信号Ｄ_２」との間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似できるデータ構造であってもよい。例えば、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」のファイル名やそれらを構成する信号の時間情報を持つデータ構造であればよい。データ構造がさらに遅延量τを特定するための情報を持っていてもよい。このような場合、データ記憶部１８０に格納されている「参照信号」と「劣化信号Ｄ_１」「劣化信号Ｄ_２」との間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間が一致または近似されていなくてもよい。要は、何らかの方法で、「参照信号」と「劣化信号Ｄ_１」「劣化信号Ｄ_２」との間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似させることが可能なデータ構造であればよい。さらに環境によっては、「参照信号」と「劣化信号Ｄ_１」「劣化信号Ｄ_２」との間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を調整することなく、評価試験が行われてもよい。このような場合には、「参照信号」と「劣化信号Ｄ_１」「劣化信号Ｄ_２」との間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似させることが不可能なデータ構造であってもよい。また、「劣化信号Ｄ_１」「劣化信号Ｄ_２」との間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間が一致していないデータ構造であってもよい。 [Modification of First Embodiment]
In the first embodiment, the far-end speaker sound signal delayed by the delay amount τ is used as the “reference signal” Lch standard sound signal. This is between the “reference signal” and the “degraded signal D ₁ ” and “degraded signal D ₂ ”, when the near-end speaker acoustic signal (Rch) starts and when the far-end speaker acoustic signal component (Lch) starts. This is for making the time interval between and coincide with each other (for example, coincidence or approximation between the time interval ac and the time interval ac ′ in FIG. 3). However, such an object can be realized by other means. For example, the far-end speaker sound output from the playback unit 104 is output from the output unit 152 as the Lch standard sound signal of the “reference signal” without delaying the far-end speaker sound signal output from the playback unit 104. A signal obtained by raising the signal by time τ (a signal shifted by a time shift opposite to the delay) may be used as the Rch near-end speaker acoustic signal of the “reference signal”. Alternatively, the far-end speaker sound signal output from the playback unit 104 is delayed by the time τ-T and output from the output unit 152 as the Lch standard sound signal of the “reference signal” and output from the playback unit 103. Alternatively, the near-end speaker sound signal of the Rch of the “reference signal” may be obtained by raising the near-end speaker sound signal by time T. However, the value of T is, for example, 0 ≦ T ≦ τ. Alternatively, by the processing during the evaluation test, the near-end speaker acoustic signal (Rch) starts and the far-end speaker acoustic signal between the “reference signal” and the “degraded signal D ₁ ” and the “degraded signal D ₂ ”. It may be a data structure that can match or approximate the time interval from the start of the component (Lch). For example, a data structure having file names of “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ” and time information of signals constituting them may be used. The data structure may further have information for specifying the delay amount τ. In such a case, between the “reference signal” and the “degraded signal D ₁ ” and “degraded signal D ₂ ” stored in the data storage unit 180, when the near-end speaker acoustic signal (Rch) starts and far The time interval from the start of the end speaker audio signal component (Lch) may not be the same or approximated. In short, in some way, between the “reference signal” and the “degraded signal D ₁ ” and “degraded signal D ₂ ”, the beginning of the near-end speaker acoustic signal (Rch) and the far-end speaker acoustic signal component ( Any data structure that can match or approximate the time interval from the start of (Lch) can be used. Further, depending on the environment, between the “reference signal” and the “degraded signal D ₁ ” and “degraded signal D ₂ ”, the near-end speaker acoustic signal (Rch) starts and the far-end speaker acoustic signal component (Lch). The evaluation test may be performed without adjusting the time interval between the start of the first time and the first time. In such a case, between the “reference signal” and the “degraded signal D ₁ ” and “degraded signal D ₂ ”, the near-end speaker acoustic signal (Rch) is started and the far-end speaker acoustic signal component ( The data structure may not be able to match or approximate the time interval from the start of (Lch). Further, the time between the start of the near-end speaker acoustic signal (Rch) and the start of the far-end speaker acoustic signal component (Lch) between the “deteriorated signal D ₁ ” and the “degraded signal D ₂ ”. A data structure in which the sections do not match may be used.

［第２実施形態］
第２実施形態は第１実施形態の変形例であり、通信環境および室内環境を電気的に模擬したデータ生成装置で、評価試験を行うためのデータ構造を生成するものである。以下では、これまで説明した事項との相違点を中心に説明する。既に説明した事項については、それらに用いた参照番号を流用して説明を簡略化する。 [Second Embodiment]
The second embodiment is a modification of the first embodiment, and is a data generation device that electrically simulates a communication environment and an indoor environment, and generates a data structure for performing an evaluation test. Below, it demonstrates centering on the difference with the matter demonstrated so far. About the already demonstrated matter, the reference number used for them is diverted and description is simplified.

＜データ生成装置＞
図４に例示するように、本実施形態のデータ生成装置２は、近端話者音響信号記憶部１０１、遠端話者音響信号記憶部１０２、時間調整処理部２０８、通信環境模擬処理部２６０、信号処理部２７０、出力部１３１，１３２，１４１，１４２，１５１，１５２、およびデータ記憶部１８０を有する。データ生成装置２は、例えば、音声信号の処理が可能な汎用または専用の１個以上のコンピュータが所定のプログラムを実行することで構成される装置である。また、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。 <Data generation device>
As illustrated in FIG. 4, the data generation device 2 of the present embodiment includes a near-end speaker acoustic signal storage unit 101, a far-end speaker acoustic signal storage unit 102, a time adjustment processing unit 208, and a communication environment simulation processing unit 260. , A signal processing unit 270, output units 131, 132, 141, 142, 151, 152, and a data storage unit 180. The data generation device 2 is a device configured by, for example, one or more general-purpose or dedicated computers capable of processing audio signals executing a predetermined program. Further, a part or all of the processing units may be configured using an electronic circuit that realizes a processing function independently.

通信環境模擬処理部２６０は、通信環境および周囲環境（空間伝達系）を電気的に模擬した通信環境模擬処理を行う。この通信環境模擬処理は、少なくとも、近端話者音響信号（第１音響信号）に第１時間調整処理を含む処理を行って得られる信号と、遠端話者音響信号（第２音響信号）に第２時間調整処理を含む処理を行って得られる信号と、を重畳する処理を含む。さらに、通信環境模擬処理が、擬似エコーおよび擬似雑音の少なくとも一方を重畳する処理を含んでもよい。例えば、図５Ａに例示するように、通信環境模擬処理部２６０は、時間調整処理部２６４，２６６、擬似エコー生成部２６５、加算部２６７、入力部２６１，２６２、および出力部２６３を含む。さらに、通信環境模擬処理部２６０が擬似雑音源２６８を含んでもよい。なお、擬似雑音源２６８は遠端端末部のマイクロホンの周囲で発生する、遠端話者の音声以外のあらゆる環境雑音を模擬するためのものである。 The communication environment simulation processing unit 260 performs communication environment simulation processing that electrically simulates the communication environment and the surrounding environment (space transmission system). The communication environment simulation process includes at least a signal obtained by performing a process including a first time adjustment process on the near-end speaker sound signal (first sound signal), and a far-end speaker sound signal (second sound signal). Includes a process of superimposing a signal obtained by performing a process including the second time adjustment process. Furthermore, the communication environment simulation process may include a process of superimposing at least one of a pseudo echo and a pseudo noise. For example, as illustrated in FIG. 5A, the communication environment simulation processing unit 260 includes time adjustment processing units 264 and 266, a pseudo echo generation unit 265, an addition unit 267, input units 261 and 262, and an output unit 263. Further, the communication environment simulation processing unit 260 may include a pseudo noise source 268. The pseudo noise source 268 is for simulating any environmental noise generated around the microphone of the far end terminal unit other than the voice of the far end speaker.

信号処理部２７０は、入力された信号に所定の信号処理を行って出力する。第１実施形態と同様、「信号処理」はどのような処理であってもよく、「信号処理」の例はエコーキャンセル処理およびノイズキャンセル処理の少なくとも一方を含む処理である。エコーキャンセル処理とは、エコーを低減させるための広義のエコーキャンセラによる処理である。例えば、図５Ｂに例示するように、信号処理部２７０は、入力部２７１，２７２、出力部２７３、加算部２７４、適応フィルタ２７５、および時間調整処理部２７６を含む。信号処理部２７０がさらに雑音除去部２７８および乗算部２７７を含んでもよい。また、図５Ｂでは適応フィルタ２７５を用いてエコーキャンセラが構成されているが、音声スイッチやエコーリダクションその他の技術またはそれと適応フィルタ２７５との組み合わせでエコーキャンセラが構成されてもよい。 The signal processing unit 270 performs predetermined signal processing on the input signal and outputs it. As in the first embodiment, “signal processing” may be any processing, and an example of “signal processing” is processing including at least one of echo cancellation processing and noise cancellation processing. The echo cancellation processing is processing by an echo canceller in a broad sense for reducing echo. For example, as illustrated in FIG. 5B, the signal processing unit 270 includes input units 271 and 272, an output unit 273, an addition unit 274, an adaptive filter 275, and a time adjustment processing unit 276. The signal processing unit 270 may further include a noise removal unit 278 and a multiplication unit 277. 5B, the echo canceller is configured by using the adaptive filter 275. However, the echo canceller may be configured by a voice switch, echo reduction, other techniques, or a combination thereof and the adaptive filter 275.

次に、本実施形態のデータ生成処理を説明する。
第１実施形態と同じく、まず事前処理として、近端話者音響信号（第１音響信号）のデータを近端話者音響信号記憶部１０１に格納し、遠端話者音響信号（第２音響信号）のデータを遠端話者音響信号記憶部１０２に格納する。以上の前提のもと、上述の評価試験を行うためのデータ構造が次のように生成される。 Next, the data generation process of this embodiment is demonstrated.
As in the first embodiment, first, as a pre-process, the data of the near-end speaker sound signal (first sound signal) is stored in the near-end speaker sound signal storage unit 101, and the far-end speaker sound signal (second sound) is stored. Signal) data is stored in the far-end speaker sound signal storage unit 102. Based on the above assumptions, a data structure for performing the above-described evaluation test is generated as follows.

近端話者音響信号記憶部１０１から近端話者音響信号が抽出され、出力部１３１，１４１，１５１、通信環境模擬処理部２６０の入力部２６２、および信号処理部２７０の入力部２７２に送られる。遠端話者音響信号記憶部１０２から遠端話者音響信号が抽出され、時間調整処理部２０８および通信環境模擬処理部２６０の入力部２６１に入力される。 Near-end speaker sound signals are extracted from the near-end speaker sound signal storage unit 101 and sent to the output units 131, 141, 151, the input unit 262 of the communication environment simulation processing unit 260, and the input unit 272 of the signal processing unit 270. It is done. The far-end speaker sound signal is extracted from the far-end speaker sound signal storage unit 102 and input to the time adjustment processing unit 208 and the input unit 261 of the communication environment simulation processing unit 260.

出力部１３１，１４１，１５１は、送られた近端話者音響信号（第１音響信号）を、それぞれ「劣化信号Ｄ_１」「劣化信号Ｄ_２」「参照信号」のＲｃｈのデータ（第１音響信号を含む第１チャネルのデータ）として出力する。 The output units 131, 141, and 151 output the RCH data (first signal) of the “deterioration signal D ₁ ”, “deterioration signal D ₂ ”, and “reference signal” to the transmitted near-end speaker acoustic signal (first acoustic signal), respectively. 1st channel data including an acoustic signal).

通信環境模擬処理部２６０は、入力部２６１，２６２に入力された遠端話者音響信号（第２音響信号），近端話者音響信号（第１音響信号）に前述した「通信環境模擬処理」を行い、それによって得られた模擬信号を出力部２６３から出力する。図５Ａの例の場合、入力部２６１に入力された遠端話者音響信号は時間調整処理部２６６に入力され、入力部２６２に入力された近端話者音響信号は時間調整処理部２６４に入力される。時間調整処理部２６６は、当該遠端話者音響信号に遅延量Ｂ’の遅延を与え、それによって得られた信号を加算部２６７に送る（第１時間調整処理）。時間調整処理部２６４は、当該近端話者音響信号に遅延量Ｃ’の遅延を与え、遅延された近端話者音響信号を擬似エコー生成部２６５に送る（第２時間調整処理）。擬似エコー生成部２６５は、遅延された近端話者音響信号を用いて擬似エコーを作成し（例えば、近端話者音響信号（第１音響信号）を遠端話者側のスピーカーで再生して遠端話者側のマイクロホンで収音するときの空間伝達系および収音時の波形歪みを模擬した信号を擬似エコーとして生成する）、それによって得られた信号を加算部２６７に送る。加算部２６７は第１時間調整処理によって得られた信号と第２時間調整処理によって得られた信号を重畳する。擬似雑音源２６８が存在する場合には、加算部２６７はさらに擬似雑音源２６８から出力された擬似雑音信号を重畳してもよい。加算部２６７で得られた信号は出力部２６３に送られ、出力部２６３はそれを模擬信号として出力する。 The communication environment simulation processing unit 260 uses the “communication environment simulation process” described above for the far-end speaker acoustic signal (second acoustic signal) and the near-end speaker acoustic signal (first acoustic signal) input to the input units 261 and 262. And the simulation signal obtained thereby is output from the output unit 263. In the case of the example of FIG. 5A, the far-end speaker sound signal input to the input unit 261 is input to the time adjustment processing unit 266, and the near-end speaker sound signal input to the input unit 262 is input to the time adjustment processing unit 264. Entered. The time adjustment processing unit 266 gives a delay amount B ′ to the far-end speaker sound signal, and sends the signal obtained thereby to the addition unit 267 (first time adjustment processing). The time adjustment processing unit 264 gives a delay amount C ′ to the near-end speaker sound signal, and sends the delayed near-end speaker sound signal to the pseudo echo generation unit 265 (second time adjustment process). The pseudo echo generation unit 265 creates a pseudo echo using the delayed near-end speaker sound signal (for example, reproduces the near-end speaker sound signal (first sound signal) on the far-end speaker side speaker. Then, a signal that simulates the spatial transmission system and the waveform distortion at the time of sound collection when the sound is collected by the microphone on the far end speaker side is generated as a pseudo echo), and the signal obtained thereby is sent to the adder 267. The adder 267 superimposes the signal obtained by the first time adjustment process and the signal obtained by the second time adjustment process. When the pseudo noise source 268 exists, the adding unit 267 may further superimpose the pseudo noise signal output from the pseudo noise source 268. The signal obtained by the adding unit 267 is sent to the output unit 263, and the output unit 263 outputs it as a simulation signal.

なお、上述の遅延量Ｂ’は、例えば、第１実施形態の伝送遅延量Ｂ（遠端端末部１２０から近端端末部１１０までの伝送遅延量）を模擬するものである。一方、遅延量Ｃ’は、例えば、第１実施形態の遅延量Ｃ（信号が近端端末部１１０から遠端端末部１２０に伝送され、スピーカー１０５からそれを表す音が出力され、マイクロホン１０７で受音されて得られた信号が、さらに遠端端末部１２０から近端端末部１１０に伝送されるまでの時間）を模擬するものである。そのため、Ｂ’＜Ｃ’であることが望ましい（例えば、Ｃ’＝２×Ｂ’）。しかしながら、これは本発明を限定するものではなく、Ｂ’＝Ｃ’やＢ’＞Ｃ’ または、Ｂ’＝Ｃ’＝０であってもよい。 Note that the delay amount B ′ described above simulates the transmission delay amount B (transmission delay amount from the far-end terminal unit 120 to the near-end terminal unit 110) of the first embodiment, for example. On the other hand, the delay amount C ′ is, for example, the delay amount C of the first embodiment (a signal is transmitted from the near-end terminal unit 110 to the far-end terminal unit 120, and a sound representing it is output from the speaker 105. Time until the signal obtained by receiving the sound is further transmitted from the far-end terminal unit 120 to the near-end terminal unit 110). Therefore, it is desirable that B ′ <C ′ (for example, C ′ = 2 × B ′). However, this is not a limitation of the present invention, and B ′ = C ′, B ′> C ′ or B ′ = C ′ = 0.

出力部２６３から出力された模擬信号は、出力部１３２および信号処理部２７０の入力部２７１に入力される。出力部１３２は、送られた模擬信号（評価対象音響信号Ｔ_１、第１比較用信号）を「劣化信号Ｄ_１」のＬｃｈのデータ（重畳信号を含む第２チャネルのデータ）として出力する。 The simulation signal output from the output unit 263 is input to the output unit 132 and the input unit 271 of the signal processing unit 270. The output unit 132 outputs the transmitted simulation signal (evaluation target acoustic signal T ₁ , first comparison signal) as Lch data (second channel data including a superimposed signal) of the “deterioration signal D ₁ ”.

信号処理部２７０は、入力部２７１に入力された模擬信号と入力部２７２に入力された近端話者音響信号を用い、当該模擬信号に信号処理を行って重畳信号を得る。図５Ｂの例の場合、近端話者音響信号を時間調整処理部２７６で遅延させた信号に適応フィルタ２７５を適用して得られた信号と模擬信号とを加算部２７４で重畳することでエコーキャンセル処理を行い、雑音除去部２７８および乗算部２７７を有する場合には、さらにノイズキャンセル処理を行って、それによって重畳信号を得る。なお、ノイズキャンセル処理の方法は、例えば、近端話者および遠端話者のどちらの音響信号も存在しない状態で、図５Ａの擬似雑音源２６８が送出する擬似雑音の定常雑音レベルを雑音推定部２７８で推定し、加算部２７４からの出力信号に対して、推定した定常雑音レベルの分だけ振幅が抑圧されるように、乗算部２７７でゲイン値を乗じるものである（例えば、阪内澄宇，羽田陽一，田中雅史，佐々木潤子，片岡章俊，“雑音抑圧及びエコー抑圧機能を備えた音響エコーキャンセラ”，電子情報通信学会論文誌 Vol.J87-A, No.4, pp.448-457 (2004年4月)等参照）。得られた重畳信号は出力部２７３から出力される。出力部２７３は重畳信号（第１音響信号に由来する信号と第２音響信号とに基づく信号に信号処理を行って得られた処理信号に由来する重畳信号）を出力部１４２に送る。出力部１４２は、送られた重畳信号（評価対象音響信号Ｔ_２）を「劣化信号Ｄ_２」のＬｃｈのデータ（重畳信号を含む第２チャネルのデータ）として出力する。 The signal processing unit 270 uses the simulated signal input to the input unit 271 and the near-end speaker acoustic signal input to the input unit 272 to perform signal processing on the simulated signal to obtain a superimposed signal. In the case of the example of FIG. 5B, the signal obtained by applying the adaptive filter 275 to the signal obtained by delaying the near-end speaker acoustic signal by the time adjustment processing unit 276 and the simulated signal are superimposed by the adding unit 274. When canceling processing is performed and the noise removing unit 278 and the multiplying unit 277 are included, noise canceling processing is further performed, thereby obtaining a superimposed signal. Note that the noise cancellation processing method is, for example, that noise estimation is performed on the steady noise level of the pseudo noise transmitted from the pseudo noise source 268 in FIG. 5A in the state where neither the near-end speaker nor the far-end speaker has an acoustic signal. The multiplication unit 277 multiplies the gain value by the multiplication unit 277 so that the amplitude is suppressed by the estimated steady noise level for the output signal from the addition unit 274. U, Yoichi Haneda, Masafumi Tanaka, Junko Sasaki, Akitoshi Kataoka, “Acoustic Echo Canceller with Noise Suppression and Echo Suppression”, IEICE Transactions Vol.J87-A, No.4, pp.448-457 (See April 2004)). The obtained superimposed signal is output from the output unit 273. The output unit 273 sends a superimposed signal (a superimposed signal derived from a processed signal obtained by performing signal processing on a signal based on the signal derived from the first acoustic signal and the second acoustic signal) to the output unit 142. The output unit 142 outputs the transmitted superimposed signal (evaluation target acoustic signal T ₂ ) as Lch data (second channel data including the superimposed signal) of the “degraded signal D ₂ ”.

また、時間調整処理部２０８は、入力された遠端話者音響信号を遅延量τ’だけ遅延させ、遅延させた遠端話者音響信号を出力部１５２に送る。本形態の遅延量τ’は、例えば、上述の遅延量Ｂ’に対応する。例えば、遅延量Ｂ’または当該遅延量Ｂ’の近似値もしくは補正値（関数値）を遅延量τ’とする。あるいは、遅延量τ’が遅延量Ｃ’に対応してもよい。例えば、τ’がＣ’／２またはＣ’／２の関数値であってもよい。あるいは、遅延量τ’が遅延量Ｂ’および遅延量Ｃ’に対応してもよい。出力部１５２は、時間調整処理部２０８で遅延させた遠端話者音響信号（基準音響信号、第２音響信号に基づく第２比較用信号）を「参照信号」のＬｃｈのデータ（基準音響信号を表す第２チャネルのデータ）として出力する。 Further, the time adjustment processing unit 208 delays the input far-end speaker sound signal by the delay amount τ ′, and sends the delayed far-end speaker sound signal to the output unit 152. The delay amount τ ′ in this embodiment corresponds to, for example, the delay amount B ′ described above. For example, the delay amount τ ′ is the delay amount B ′ or an approximate value or correction value (function value) of the delay amount B ′. Alternatively, the delay amount τ ′ may correspond to the delay amount C ′. For example, τ ′ may be a function value of C ′ / 2 or C ′ / 2. Alternatively, the delay amount τ ′ may correspond to the delay amount B ′ and the delay amount C ′. The output unit 152 converts the far-end speaker acoustic signal (reference acoustic signal, second comparison signal based on the second acoustic signal) delayed by the time adjustment processing unit 208 into Lch data (reference acoustic signal) of the “reference signal”. 2nd channel data representing

以上の処理によっても図３に例示するようなデータ構造を得ることができる。得られたデータ構造はデータ記憶部１８０に格納される。 The data structure as illustrated in FIG. 3 can also be obtained by the above processing. The obtained data structure is stored in the data storage unit 180.

［第２実施形態の変形例］
第２実施形態では、時間調整処理部２０８，２６４，２６６，２７６それぞれの遅延処理により、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」の間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似（図３の時間区間ａ−ｃと時間区間ａ−ｃ’との一致または近似）させた。しかしながら、第１実施形態の変形例と同様、このような目的は他の手段によっても実現できる。例えば、遠端話者音響信号記憶部１０２から読み出された遠端話者音響信号を遅延させることなく「参照信号」のＬｃｈの基準音響信号として出力部１５２から出力し、近端話者音響信号記憶部１０１から読み出された近端話者音響信号を時間τ’だけ時間的に繰り上げたものを「参照信号」のＲｃｈの近端話者音響信号としてもよい。要は、
（１）「劣化信号Ｄ_２」のＲｃｈの近端話者音響信号（第１音響信号）が出力されてから、そのＬｃｈの評価対象音響信号Ｔ_２（重畳信号）に含まれる遠端話者音響信号成分（第１成分）が出力されるまでの時間と、「参照信号」のＲｃｈの近端話者音響信号（第１音響信号）が出力されてから、そのＬｃｈの基準音響信号に含まれる遠端話者音響信号成分（第２２成分）が出力されるまでの時間との一致または近似、および、
（２）「劣化信号Ｄ_１」のＲｃｈの近端話者音響信号（第１音響信号）が出力されてから、そのＬｃｈの評価対象音響信号Ｔ_１に含まれる遠端話者音響信号成分（第２１成分）が出力されるまでの時間と、「参照信号」のＲｃｈの近端話者音響信号（第１音響信号）が出力されてから、そのＬｃｈの基準音響信号に含まれる遠端話者音響信号成分（第２２成分）が出力されるまでの時間との一致または近似、
の少なくとも一方を行う１個以上の時間調整処理部を備えていればよい。その他、評価試験の時の処理により、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」の間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似できるデータ構造であってもよい。要は、何らかの方法で、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」の間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似させることが可能なデータ構造であればよい。さらに環境によっては、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」の間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を調整することなく、評価試験が行われてもよい。このような場合には、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」の間で、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似させることが不可能なデータ構造であってもよい。 [Modification of Second Embodiment]
In the second embodiment, the near-end speaker acoustic signal (denoted between the “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ”) by the delay processing of each of the time adjustment processing units 208, 264, 266, 276. Rch) coincides with or approximates the time interval between the start of the far-end speaker acoustic signal component (Lch) (coincidence or approximation of the time interval ac and the time interval ac ′ in FIG. 3). ) However, like the modification of the first embodiment, such an object can be realized by other means. For example, the far-end speaker sound signal read from the far-end speaker sound signal storage unit 102 is output from the output unit 152 as the Lch standard sound signal of the “reference signal” without delay, and the near-end speaker sound signal is output. The near-end speaker sound signal of the Rch of the “reference signal” may be obtained by temporally raising the near-end speaker sound signal read from the signal storage unit 101 by the time τ ′. In short,
(1) After the Rch near-end speaker acoustic signal (first acoustic signal) of “deteriorated signal D ₂ ” is output, the far-end speaker included in the evaluation target acoustic signal T ₂ (superimposed signal) of the Lch Included in the reference sound signal of the Lch after the time until the sound signal component (first component) is output and the Rch near-end speaker sound signal (first sound signal) of the “reference signal” is output Match or approximate the time until the far-end speaker acoustic signal component (the 22nd component) is output, and
(2) After the Rch near-end speaker acoustic signal (first acoustic signal) of the “deteriorated signal D ₁ ” is output, the far-end speaker acoustic signal component included in the evaluation target acoustic signal T ₁ of the Lch ( The time until the (21st component) is output, and the far-end speech included in the Lch reference acoustic signal after the Rch near-end speaker acoustic signal (first acoustic signal) of the “reference signal” is output Match or approximate the time until the human acoustic signal component (the 22nd component) is output,
One or more time adjustment processing units that perform at least one of the above may be provided. In addition, by the processing at the time of the evaluation test, between the “reference signal”, the “degraded signal D ₁ ”, and the “degraded signal D ₂ ”, the near-end speaker acoustic signal (Rch) starts and the far-end speaker acoustic signal component A data structure that can match or approximate the time interval from the start of (Lch) may be used. The point is that the start of the near-end speaker acoustic signal (Rch) and the far-end speaker acoustic signal component (Lch) between the “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ” by some method. Any data structure capable of matching or approximating the time interval between the start and the start of the data may be used. Furthermore, depending on the environment, between the “reference signal”, the “degraded signal D ₁ ”, and the “degraded signal D ₂ ”, the start of the near-end speaker acoustic signal (Rch) and the start of the far-end speaker acoustic signal component (Lch) An evaluation test may be performed without adjusting the time interval between times. In such a case, between the “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ”, the near-end speaker acoustic signal (Rch) starts and the far-end speaker acoustic signal component (Lch). It may be a data structure in which it is impossible to match or approximate the time interval between the start time of and the start time.

［第３実施形態］
第３実施形態では、前述のように生成されたデータ構造を用いた品質評価方法を説明する。 [Third Embodiment]
In the third embodiment, a quality evaluation method using the data structure generated as described above will be described.

＜音響品質評価装置＞
図６に例示するように、本実施形態の音響品質評価装置３は、データ記憶部１８０、集計結果記憶部３０５、再生制御部３０１、表示制御部３０２、集計部３０３、制御部３０４、音響出力処理部３１０−ｎ、表示部３２０−ｎ、および入力部３３０−ｎを有する。ただし、ｎ＝１，・・・，Ｎであり、Ｎは１以上の整数（例えば、Ｎは１以上４以下）である。音響品質評価装置３は、例えば、表示装置（ディスプレイ等）および入力装置（キーボードやマウス等）を備えた前述のような１個以上のコンピュータが所定のプログラムを実行することで構成される装置である。また、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。 <Sound quality evaluation device>
As illustrated in FIG. 6, the sound quality evaluation apparatus 3 according to the present embodiment includes a data storage unit 180, a totaling result storage unit 305, a reproduction control unit 301, a display control unit 302, a totaling unit 303, a control unit 304, and a sound output. The processing unit 310-n, the display unit 320-n, and the input unit 330-n are included. However, n = 1,..., N, and N is an integer of 1 or more (for example, N is 1 or more and 4 or less). The sound quality evaluation apparatus 3 is an apparatus configured by, for example, one or more computers including a display device (display, etc.) and an input device (keyboard, mouse, etc.) executing a predetermined program. is there. Further, a part or all of the processing units may be configured using an electronic circuit that realizes a processing function independently.

＜音響品質評価処理＞
音響品質評価装置３は、前述したデータ構造を用い、制御部３０４の制御のもと、前述した拡声系通信システムでの会話ＭＯＳ試験を模擬した評価試験を行う。 <Sound quality evaluation process>
The sound quality evaluation apparatus 3 uses the data structure described above and performs an evaluation test that simulates the conversation MOS test in the above-described loudspeaker communication system under the control of the control unit 304.

ｎ＝１，・・・，Ｎについて、音響出力処理部３１０−ｎの出力部３１１−ｎに両耳装着型音響再生装置３４０−ｎの一方のチャネルであるＲｃｈ（第１チャネル：例えば右チャネル）が接続され、出力部３１２−ｎに両耳装着型音響再生装置３４０−ｎの他方のチャネルであるＬｃｈ（第２チャネル：例えば左チャネル）が接続される。なお、両耳装着型音響再生装置３４０−ｎとは、一方のチャネルＲｃｈの音を出力する一方の耳専用のスピーカーと、他方のチャネルＬｃｈの音を出力する他方の耳専用のスピーカーと、を備えたステレオ再生可能な音響再生装置である。両耳装着型音響再生装置３４０−ｎの具体例は、ヘッドフォンやイヤホン等である。評価者３５０−ｎは、両耳装着型音響再生装置３４０−ｎを装着し、表示部３２０−ｎから出力される表示内容に従って、両耳装着型音響再生装置３４０−ｎから出力される音の主観評価を行い、評価結果を入力部３３０−ｎに入力する。なお、評価者３５０−ｎは、その利き耳（例えば、左耳）にチャネルＬｃｈの音を出力する側のスピーカーを装着し、利き耳ではない側の耳（例えば、右耳）にチャネルＲｃｈの音を出力する側のスピーカーを装着することが望ましい。以下、これらの処理を詳細に説明する。 For n = 1,..., N, the Rch (first channel: for example, the right channel) that is one channel of the binaural sound reproduction device 340-n is output to the output unit 311-n of the sound output processing unit 310-n. ) Is connected, and the Lch (second channel: for example, the left channel), which is the other channel of the binaural sound reproducing device 340-n, is connected to the output unit 312-n. The binaural-mounted sound reproducing device 340-n includes a speaker dedicated to one ear that outputs sound of one channel Rch and a speaker dedicated to the other ear that outputs sound of the other channel Lch. This is a stereo sound reproduction apparatus equipped with stereo reproduction. Specific examples of the binaural-mounted sound reproducing device 340-n include headphones and earphones. The evaluator 350-n wears the binaural-type sound reproduction device 340-n, and according to the display content output from the display unit 320-n, the sound output from the binaural-type sound reproduction device 340-n. Subjective evaluation is performed, and the evaluation result is input to the input unit 330-n. Note that the evaluator 350-n wears a speaker on the side that outputs the sound of the channel Lch in the dominant ear (for example, the left ear), and the channel Rch in the ear (for example, the right ear) that is not the dominant ear. It is desirable to attach a speaker that outputs sound. Hereinafter, these processes will be described in detail.

再生制御部３０１は、制御部３０４の制御に従い（制御内容は後述）、データ記憶部１８０から前述したデータ構造から「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」の何れかを抽出し、音響出力処理部３１０−ｎ（ただし、ｎ＝１，・・・，Ｎ）に送る。この際に、近端話者音響信号（Ｒｃｈ）の開始時と遠端話者音響信号成分（Ｌｃｈ）の開始時との間の時間区間を一致または近似させるための処理がなされてもよい。音響出力処理部３１０−ｎは、送られた信号に応じて以下の処理を行う。なお、「参照信号」の基準音響信号が表す音を「基準音」とよび、「劣化信号Ｄ_１」の評価対象音響信号Ｔ_１が表す音、および「劣化信号Ｄ_２」の評価対象音響信号Ｔ_２が表す音を「評価音」とよぶことにする。 The reproduction control unit 301 extracts any one of “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ” from the data structure described above from the data storage unit 180 in accordance with the control of the control unit 304 (the control content will be described later). And sent to the sound output processing unit 310-n (where n = 1,..., N). At this time, a process for matching or approximating the time interval between the start time of the near-end speaker sound signal (Rch) and the start time of the far-end speaker sound signal component (Lch) may be performed. The sound output processing unit 310-n performs the following processing according to the transmitted signal. Note that the sound represented by the reference acoustic signal of the “reference signal” is referred to as “reference sound”, the sound represented by the evaluation target acoustic signal T _{1 of} the “degraded signal D ₁ ”, and the evaluation target acoustic signal of the “deteriorated signal D ₂ ”. the sound T ₂ represents will be referred to as "evaluation sound".

≪「参照信号」が送られた場合≫
「参照信号」が送られた場合、音響出力処理部３１０−ｎ（ただし、ｎ＝１，・・・，Ｎ）は、「参照信号」の近端話者音響信号（第１音響信号）を出力部３１１−ｎから両耳装着型音響再生装置３４０−ｎの一方のチャネルであるＲｃｈ（第１チャネル）に出力しつつ、「参照信号」の基準音響信号を出力部３１２−ｎから両耳装着型音響再生装置３４０−ｎの他方のチャネルであるＬｃｈ（第２チャネル）に出力する（第１処理）。 ≪When “reference signal” is sent≫
When the “reference signal” is transmitted, the sound output processing unit 310-n (where n = 1,..., N) transmits the near-end speaker sound signal (first sound signal) of the “reference signal”. While outputting from the output unit 311-n to the Rch (first channel) which is one channel of the binaural-mounted sound reproduction device 340-n, the reference acoustic signal of the “reference signal” is output from the output unit 312-n to both ears. Output to Lch (second channel) which is the other channel of the wearable sound reproducing device 340-n (first processing).

≪「劣化信号Ｄ_１」が送られた場合≫
「劣化信号Ｄ_１」が送られた場合、音響出力処理部３１０−ｎ（ただし、ｎ＝１，・・・，Ｎ）は、「劣化信号Ｄ_１」の近端話者音響信号（第１音響信号）を出力部３１１−ｎから両耳装着型音響再生装置３４０−ｎのＲｃｈ（第１チャネル）に出力しつつ、「劣化信号Ｄ_１」の評価対象音響信号Ｔ_１（第１音響信号に由来する信号と第２音響信号とに基づく評価音を表す重畳信号）を出力部３１２−ｎから両耳装着型音響再生装置３４０−ｎのＬｃｈ（第２チャネル）に出力する（第２処理）。 ≪When “deterioration signal D ₁ ” is sent≫
When the “degraded signal D ₁ ” is sent, the sound output processing unit 310-n (where n = 1,..., N), the near-end speaker acoustic signal ( _first signal) of the “degraded signal D ₁ ”. while outputting a sound signal) from the output unit 311-n to the Rch (first channel) of two earset sound reproducing apparatus 340-n, evaluated acoustic signal _{T 1} (first acoustic signal of "degraded signal _{D 1"} Is output from the output unit 312-n to the Lch (second channel) of the binaural-type sound reproduction device 340-n (second processing). ).

≪「劣化信号Ｄ_２」が送られた場合≫
「劣化信号Ｄ_２」が送られた場合、音響出力処理部３１０−ｎ（ただし、ｎ＝１，・・・，Ｎ）は、「劣化信号Ｄ_２」の近端話者音響信号（第１音響信号）を出力部３１１−ｎから両耳装着型音響再生装置３４０−ｎのＲｃｈ（第１チャネル）に出力しつつ、「劣化信号Ｄ_２」の評価対象音響信号Ｔ_２（第１音響信号に由来する信号と第２音響信号とに基づく評価音を表す重畳信号。ただし、この重畳信号は、第１音響信号に由来する信号と第２音響信号とに基づく信号に信号処理を行って得られた処理信号に由来する。）を出力部３１２−ｎから両耳装着型音響再生装置３４０−ｎのＬｃｈ（第２チャネル）に出力する（第２処理）。 «If the" deterioration signal D ₂ "has been sent»
When the “degraded signal D ₂ ” is sent, the sound output processing unit 310-n (where n = 1,..., N) transmits the near-end speaker acoustic signal (first signal) of the “degraded signal D ₂ ”. (Acoustic signal) is output from the output unit 311-n to the Rch (first channel) of the binaural-mounted sound reproducing device 340-n, and the evaluation target acoustic signal T ₂ (first acoustic signal) of the “deteriorated signal D ₂ ” is output. A superimposed signal representing an evaluation sound based on the signal derived from the second acoustic signal and the second acoustic signal, which is obtained by performing signal processing on a signal derived from the signal derived from the first acoustic signal and the second acoustic signal. Is output from the output unit 312-n to the Lch (second channel) of the binaural-type sound reproducing device 340-n (second processing).

表示制御部３０２は、制御部３０４の制御に従い（制御内容は後述）、表示部３２０−ｎ（ただし、ｎ＝１，・・・，Ｎ）に表示情報を送る。表示部３２０−ｎは、送られた表示情報に従い、基準音と評価音との違いが分かるか否かと、評価音の聞き取りにくさについての２段階以上の度合いと、の組み合わせからなる３段階以上のカテゴリーを含む評価カテゴリーを表示する。評価者３５０−ｎは、この表示に従って両耳装着型音響再生装置３４０−ｎから出力された音を主観評価する。ここで「基準音」は、遠端話者から理想的な状態で受信した音響信号に相当する。近端話者からの直接音に相当する「近端話者音」と合わせて提示することで、拡声系通信システムの理想的な状態を模擬することができる。「近端話者音」を「基準音響信号」と同時に提示することで、近端話者の音声の回り込み（音響エコー）と、遠端話者の音声を区別しやすくなる。「評価音」を常に「基準音」と比較することで、評価対象とする通信システムがどれだけ理想的な状態に近いか、または異なる状態であるか、を客観的に、かつ主観的に評価することができる。「評価音」のみを提示して評価すると、遠端話者の言いよどみや、遠端話者の周囲騒音などが劣化要因として判断され、低く評価される可能性が高い。常に「基準音」と比較することで、通信システム以外の劣化要因が評価対象から排除され、ばらつきの少ない、的確な評価値を得ることができる。また、この評価カテゴリーは、基準音に対する評価音の劣化のみならず、評価音の聞き取りにくさ（聞き取り易さ）に対する評価基準を定めたものである。このように、評価音の基準音からの劣化度と聞き取りやすさの度合いを組み合わせた評価カテゴリーを表示することで、従来のＤＣＲ（劣化カテゴリ評価）のように劣化のみに着目した評価カテゴリーを表示する場合に比べ、どのような基準で評価を行えばよいかが明確になり、複数の要因が複雑に絡み合うような環境でも評価ばらつきを小さくすることができる。また、評価音の聞き取り「にくさ」についての評価基準（否定的な評価基準）を表示することで、評価音の聞き取り「易さ」についての評価基準（肯定的な評価基準）を表示する場合に比べて評価者３５０−ｎの選択が厳密になり、評価精度が向上する。これは生理学上の自然法則に基づく。 The display control unit 302 sends display information to the display unit 320-n (where n = 1,..., N) in accordance with the control of the control unit 304 (details of control will be described later). The display unit 320-n has three or more levels consisting of a combination of whether or not the difference between the reference sound and the evaluation sound is known and the degree of two or more levels of difficulty in hearing the evaluation sound according to the display information sent. The evaluation category including the category is displayed. The evaluator 350-n subjectively evaluates the sound output from the binaural sound reproduction device 340-n according to this display. Here, the “reference sound” corresponds to an acoustic signal received from the far-end speaker in an ideal state. An ideal state of a loudspeaker communication system can be simulated by presenting it together with a “near-end speaker sound” corresponding to a direct sound from the near-end speaker. By presenting the “near-end speaker sound” at the same time as the “reference acoustic signal”, it becomes easy to distinguish between the near-end speaker's voice wraparound (acoustic echo) and the far-end speaker's voice. By comparing "evaluation sound" with "reference sound" at all times, objectively and subjectively evaluate how close or different the communication system is to be evaluated is. can do. When only the “evaluation sound” is presented and evaluated, the far-end speaker's stagnation, the far-end speaker's ambient noise, etc. are judged as degradation factors and are likely to be evaluated low. By always comparing with the “reference sound”, deterioration factors other than the communication system are excluded from the evaluation target, and an accurate evaluation value with little variation can be obtained. This evaluation category defines not only the deterioration of the evaluation sound with respect to the reference sound but also the evaluation standard for difficulty in hearing the evaluation sound (easy to hear). In this way, by displaying an evaluation category that combines the degree of deterioration of the evaluation sound from the reference sound and the degree of ease of hearing, an evaluation category that focuses only on deterioration, such as conventional DCR (deterioration category evaluation), is displayed. Compared to the case, it becomes clear what criteria should be used for evaluation, and the evaluation variation can be reduced even in an environment where a plurality of factors are intertwined in a complicated manner. In addition, by displaying the evaluation standard (negative evaluation standard) for the evaluation sound listening “Nikusa”, the evaluation standard (positive evaluation standard) for the evaluation sound listening “ease” is displayed. In comparison with the above, the selection of the evaluator 350-n becomes strict and the evaluation accuracy is improved. This is based on the natural laws of physiology.

好ましくは、評価カテゴリーは、基準音と評価音との違いが分かるか否かと、評価音の聞き取りにくさについての３段階以上の度合いと、の組み合わせからなる４段階以上のカテゴリーを含む。評価音の聞き取りにくさについての３段階以上の度合いについての評価基準を定めることで、評価精度をより向上させることができる。特に、評価カテゴリーは、基準音と評価音との違いが分からないことを表す１段階のカテゴリーと、基準音と評価音との違いが分かる旨と評価音の聞き取りにくさについての４段階の度合いとの組み合わせからなる４段階のカテゴリーとを含むことが望ましい。以下に評価カテゴリーの具体例を示す。
なお、「基準音と違いが分からない」「違いはあるが」「違いがあり」は「基準音と評価音との違いが分かるか否か」を表し、「聞き取りには問題がない」「少し聞き取りにくい」「聞き取りにくい」「非常に聞き取りにくい」は「評価音の聞き取りにくさについての度合い」を表す。この例の各評価カテゴリーには１から５の評価を表す値が対応付けられており、この値が大きいほど品質が高いことを表す。ここでは、「基準音」が理想的な状態であるとしてカテゴリーを設定したが、評価対象とする通信システムのノイズキャンセラ等の効果によって、「評価音」が「基準音」よりも評価が高くなる状態も考えられる。この場合は、さらに上位のカテゴリーとして「違いはあるが、聞き取りやすい」を含めてもよい。 Preferably, the evaluation category includes a category of four or more levels composed of a combination of whether or not a difference between the reference sound and the evaluation sound is known and a degree of three or more levels of difficulty in hearing the evaluation sound. The evaluation accuracy can be further improved by determining the evaluation criteria for the degree of three or more levels of difficulty in hearing the evaluation sound. In particular, the evaluation category is a one-step category indicating that the difference between the reference sound and the evaluation sound is not known, and a four-step degree indicating that the difference between the reference sound and the evaluation sound can be understood and the evaluation sound is difficult to hear. It is desirable to include a four-stage category consisting of Specific examples of evaluation categories are shown below.
In addition, “I don't know the difference from the reference sound”, “I have a difference” or “I have a difference” means “I can understand the difference between the reference sound and the evaluation sound”, and “There is no problem with listening” “Difficult to hear a little”, “Difficult to hear” and “Very difficult to hear” represent “degree of difficulty in hearing the evaluation sound”. Each evaluation category in this example is associated with a value representing an evaluation of 1 to 5, and the larger this value, the higher the quality. Here, the category is set assuming that the “reference sound” is in an ideal state, but the “evaluation sound” has a higher evaluation than the “reference sound” due to the effect of the noise canceller of the communication system to be evaluated. Is also possible. In this case, “there is a difference, but easy to hear” may be included as a higher category.

以下に従来のＤＣＲ（劣化カテゴリ評価）で用いられていた劣化のみに着目した評価カテゴリーを示す。表１の評価カテゴリーと比べて主観的・内面的な表現が多いことが分かる。
The following is an evaluation category focusing only on the degradation used in the conventional DCR (degradation category evaluation). It can be seen that there are more subjective and internal expressions than the evaluation categories in Table 1.

さらに、表示制御部３０２が出力する表示情報が、評価音の聞き取り易さの評価を指示するための情報を含み、表示部３２０−ｎが、さらに評価音の聞き取り易さの評価を指示するための表示（「何を評価するか」を表す表示）を行ってもよい。例えば、表示部３２０−ｎは「評価音の『女声（左側）』の聞き取り易さ、を評価してください」と表示してもよい。この例において左側とは「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ₂」におけるＬｃｈ（第２チャネル）側のスピーカーの出力を指している。上述のように、評価カテゴリーは、基準音と評価音との違いが分かるか否かと評価音の聞き取りにくさについての度合いとの組み合わせからなる。生理学上、人間は違いの有無には敏感であり、特に注意をしていなくても基準音と評価音との違いの有無を評価することができる。一方、聞き取り易さについては注意をしていないと適切な評価を行うことができない。このような自然法則に基づき、表示部３２０−ｎが、さらに評価音の聞き取り易さの評価を指示するための表示を行うことで、評価精度を向上できたり、評価ばらつきを低減できたりする。なお、何を評価するかを表す表示として「評価音の聞き取り『にくさ』の評価を指示するための表示」を行った場合、生理学上、評価者３５０−ｎは詳細な点に注目しすぎてしまい、「聞き取り易さ」への影響が小さな劣化をも評価してしまう傾向がある。何を評価するかを表す表示として「評価音の聞き取り『易さ』の評価を指示するための表示」することで、評価者３５０−ｎの評価が適切になり、評価精度を向上できたり、評価ばらつきを低減できたりする。 Furthermore, the display information output by the display control unit 302 includes information for instructing evaluation of the ease of hearing of the evaluation sound, and the display unit 320-n further instructs evaluation of the ease of hearing of the evaluation sound. (Display indicating “what to evaluate”) may be performed. For example, the display unit 320-n may display “Please rate the ease of hearing of the evaluation sound“ female voice (left side) ””. In this example, the left side indicates the output of the speaker on the Lch (second channel) side in the “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ”. As described above, the evaluation category includes a combination of whether or not the difference between the reference sound and the evaluation sound is known and the degree of difficulty in hearing the evaluation sound. Physiologically, humans are sensitive to the difference and can evaluate the difference between the reference sound and the evaluation sound without particular attention. On the other hand, appropriate evaluation cannot be performed unless attention is paid to the ease of hearing. Based on such a natural law, the display part 320-n can further improve the evaluation accuracy or reduce the evaluation variation by performing display for instructing the evaluation of the ease of hearing of the evaluation sound. In addition, when the “display for instructing the evaluation of listening to the evaluation sound“ Nikusa ”” is performed as a display indicating what is evaluated, the evaluator 350-n pays too much attention to details in terms of physiology. Therefore, there is a tendency to evaluate even a small deterioration having an influence on “easy to hear”. As a display indicating what is to be evaluated, “display for instructing evaluation of listening to“ ease of evaluation sound ”” makes evaluation of the evaluator 350-n appropriate and can improve evaluation accuracy, Evaluation variation can be reduced.

さらに、表示制御部３０２が出力する表示情報が、何に着目するかを表示するための情報を含み、表示部３２０−ｎが「何に着目するか」を表示してもよい。例えば、表示部３２０−ｎは、上述の「第１処理」の際に基準音に着目する旨の指示を表す表示を行い、「第２処理」の際に評価音に着目する旨の指示を表す表示を行ってもよい。例えば、表示部３２０−ｎは、「第１処理」の際に「基準音（１）：『女声（左側）』に着目してください」との表示を行い、「劣化信号Ｄ_１」を出力する「第２処理」の際に「評価音（１）：『女声（左側）』に着目してください」との表示を行い、「劣化信号Ｄ_２」を出力する「第２処理」の際に「評価音（２）：『女声（左側）』に着目してください」との表示を行ってもよい。これにより、評価対象を明らかにし、評価者３５０−ｎを評価対象音響信号（遠端話者音響信号側）に着目させるとともに、評価者３５０−ｎを近端話者音響信号側に着目させないようにすることができる。また、音響出力処理部３１０−ｎから出力される信号に応じて、表示部３２０−ｎから表示される「何に着目するか」「何を評価するか」の表示が変わることで、評価対象音響信号の発生タイミングを視覚的に認識させることができる。 Further, the display information output by the display control unit 302 may include information for displaying what is focused on, and the display unit 320-n may display “what to focus on”. For example, the display unit 320-n performs a display indicating an instruction to pay attention to the reference sound at the time of the above-described “first process”, and an instruction to pay attention to the evaluation sound at the time of the “second process”. You may perform the display to represent. For example, the display unit 320-n displays “reference sound (1): pay attention to“ female voice (left side) ”during“ first processing ”and outputs“ deterioration signal D ₁ ”. When “second processing” is performed, “evaluation sound (1): pay attention to“ female voice (left side) ”is displayed and“ deterioration signal D ₂ ”is output. "Evaluation sound (2): Pay attention to" Female voice (left side) "" may be displayed. As a result, the evaluation target is clarified so that the evaluator 350-n is focused on the evaluation target acoustic signal (far-end speaker acoustic signal side) and the evaluator 350-n is not focused on the near-end speaker acoustic signal side. Can be. In addition, depending on the signal output from the sound output processing unit 310-n, the display of “what to focus on” and “what to evaluate” displayed from the display unit 320-n is changed. The generation timing of the acoustic signal can be visually recognized.

主観評価を行った評価者３５０−ｎは、評価カテゴリーから選択したカテゴリーを表す情報（評価結果を表す情報）である評価値Ｉ−ｎを入力部３３０−ｎに入力する。図７に表示部３２０−ｎが表示する表示画面３２１を例示する。この表示画面３２１は、「何に着目するか」を表示する着目内容提示部３２１１、「何を評価するか」を表示する評価指示提示部３２１２、評価カテゴリーを表示する評価カテゴリー提示部３２１３、評価を表す値「１」〜「５」（評価値Ｉ−ｎ）の入力のためにタッチまたはクリックされるアイコン３２１４〜３２１８、入力確定のためにタッチまたはクリックされるアイコン３２１９を含む。評価者３５０−ｎは、着目内容提示部３２１１、評価指示提示部３２１２、評価カテゴリー提示部３２１３の表示に従い、両耳装着型音響再生装置３４０−ｎから出力された音を主観評価し、評価に対応するアイコン３２１４〜３２１８の何れかをタッチまたはクリックし、確定のためのアイコン３２１９をタッチまたはクリックする。アイコン３２１４〜３２１９がアクティブでアイコン３２１９がタッチまたはクリックされるまでは、評価者３５０−ｎはアイコン３２１４〜３２１８を何度も選び直すタッチまたはクリック操作が可能である。これにより、評価カテゴリーから選択されたカテゴリーを表す評価値Ｉ−ｎが入力部３３０−ｎに入力される。なお、評価条件を同一とするため、上述の評価試験は、すべての評価者３５０−ｎ（ただし、ｎ＝１，・・・，Ｎ）によって同時に実行されることが望ましい。一定時間以上評価が確定しない評価者がいる場合は、その評価者に対して確定を促す画面表示と、他の評価者に対しては待たせる画面表示を行ってもよい。 The evaluator 350-n who performed the subjective evaluation inputs an evaluation value In, which is information representing the category selected from the evaluation categories (information representing the evaluation result), to the input unit 330-n. FIG. 7 illustrates a display screen 321 displayed by the display unit 320-n. The display screen 321 includes an attention content presentation unit 3211 that displays “what to focus on”, an evaluation instruction presentation unit 3212 that displays “what to evaluate”, an evaluation category presentation unit 3213 that displays an evaluation category, and an evaluation The icons 3214 to 3218 that are touched or clicked to input the values “1” to “5” (evaluation value In) that represent the values “3” and the icon 3219 that is touched or clicked to confirm the input are included. The evaluator 350-n subjectively evaluates the sound output from the binaural-equipped sound reproduction device 340-n according to the display of the attention content presentation unit 3211, the evaluation instruction presentation unit 3212, and the evaluation category presentation unit 3213, and evaluates it. One of the corresponding icons 3214 to 3218 is touched or clicked, and the icon 3219 for confirmation is touched or clicked. Until the icon 3214 to 3219 is active and the icon 3219 is touched or clicked, the evaluator 350-n can perform a touch or click operation to reselect the icons 3214 to 3218 many times. Thereby, the evaluation value In representing the category selected from the evaluation categories is input to the input unit 330-n. In addition, in order to make evaluation conditions the same, it is desirable that the above-described evaluation test is simultaneously executed by all the evaluators 350-n (where n = 1,..., N). When there is an evaluator who does not confirm the evaluation for a certain time or more, a screen display that prompts the evaluator to confirm and a screen display that waits for other evaluators may be displayed.

入力部３３０−ｎに入力された評価値Ｉ−ｎは集計部３０３に送られる。集計部３０３は、評価値Ｉ−ｎを集計し、それによって得られた集計結果を集計結果記憶部３０５に格納する。例えば、集計結果は、評価者３５０−ｎを表すＩＤ、評価試験に用いられた「劣化信号Ｄ_２」等の音響信号やその条件とともに格納される。評価値Ｉ−ｎの集計結果は、評価値Ｉ−ｎの集合であってもよいし、評価試験に用いられた音響信号ごとでの最大値、最小値、平均値、分散値等であってもよい。評価内容に疑いがある評価者３５０−ｎに対応する評価値Ｉ−ｎを除外してから求めた最大値、最小値、平均値、分散値等を集計結果としてもよい。その他、他の処理装置でさらに詳しい分析が行われてもよい。 The evaluation value In input to the input unit 330-n is sent to the counting unit 303. The tabulation unit 303 tabulates the evaluation value In and stores the tabulation result obtained thereby in the tabulation result storage unit 305. For example, the tabulation result is stored together with an ID representing the evaluator 350-n, an acoustic signal such as “deterioration signal D ₂ ” used in the evaluation test, and its conditions. The aggregation result of the evaluation values In may be a set of the evaluation values In, or may be a maximum value, a minimum value, an average value, a variance value, etc. for each acoustic signal used in the evaluation test. Also good. The maximum value, the minimum value, the average value, the variance value, and the like obtained after excluding the evaluation value In corresponding to the evaluator 350-n whose suspicion is in the evaluation content may be used as the aggregation result. In addition, further detailed analysis may be performed by another processing apparatus.

≪制御部３０４の制御内容≫
次に、図８から図１２を用い、制御部３０４の制御内容を例示する。これらの図の横軸は時間軸を表し、紙面の右に向かうほど後の時間を表す。これらの図の「Ｌｃｈ」の行は、両耳装着型音響再生装置３４０−ｎのＬｃｈ側のスピーカーから出力させる音を表し、「Ｒｃｈ」の行は、両耳装着型音響再生装置３４０−ｎのＲｃｈ側のスピーカーから出力させる音を表す。これらの図の「３２１１」の列は、着目内容提示部３２１１の提示内容（何に着目するか）を表し、「３２１２」の列は、評価指示提示部３２１２の提示内容（何を評価するか）を表し、「３２１３」の列は、評価カテゴリー提示部３２１３の提示内容（評価カテゴリー）を表す。 << Control contents of control unit 304 >>
Next, the control contents of the control unit 304 will be illustrated using FIGS. 8 to 12. The horizontal axis of these figures represents the time axis, and represents the later time as it goes to the right of the page. In these figures, the “Lch” row represents the sound output from the speaker on the Lch side of the binaural-mounted sound reproducing device 340-n, and the “Rch” row represents the binaural-mounted sound reproducing device 340-n. Represents the sound output from the speaker on the Rch side. In these figures, the column “3211” represents the presentation content of the focus content presentation unit 3211 (what to focus on), and the column “3212” represents the content of the evaluation instruction presentation unit 3212 (what to evaluate). The column “3213” represents the presentation content (evaluation category) of the evaluation category presentation unit 3213.

≪図８の例≫
図８の例では、まず、再生制御部３０１がデータ記憶部１８０から「参照信号」を読み込み、それを音響出力処理部３１０−ｎ（ただし、ｎ＝１，・・・，Ｎ）に送る。音響出力処理部３１０−ｎは、出力部３１２−ｎから「参照信号」の基準音響信号を出力し、出力部３１１−ｎから「参照信号」の近端話者音響信号を出力する。これにより、両耳装着型音響再生装置３４０−ｎのＬｃｈからは基準音響信号が表す「基準音」が出力され、Ｒｃｈからは近端話者からの直接音に相当する「近端話者音」が出力される。この際、表示制御部３０２は、着目内容Ｆ_１および評価カテゴリーを表す表示情報を表示部３２０−ｎに送る。なお、着目内容Ｆ_１は、基準音（Ｌｃｈ）に着目する旨の指示を表す内容（例えば「基準音（１）：「女声（左側）」に着目してください」）を意味する。また、評価カテゴリーは、前述の「基準音と評価音との違いが分かるか否かと、評価音の聞き取りにくさについての２段階以上の度合いと、の組み合わせからなる３段階以上のカテゴリーを含む評価カテゴリー」である。表示部３２０−ｎは、着目内容Ｆ_１を着目内容提示部３２１１に提示し、評価カテゴリーを評価カテゴリー提示部３２１３に提示する（ステップＳ１）。 ≪Example of FIG. 8≫
In the example of FIG. 8, the reproduction control unit 301 first reads a “reference signal” from the data storage unit 180 and sends it to the sound output processing unit 310-n (where n = 1,..., N). The sound output processing unit 310-n outputs the reference sound signal of the “reference signal” from the output unit 312-n, and outputs the near-end speaker sound signal of the “reference signal” from the output unit 311-n. As a result, the “reference sound” represented by the reference acoustic signal is output from the Lch of the binaural-mounted sound reproduction device 340-n, and the “near-end speaker sound corresponding to the direct sound from the near-end speaker is output from the Rch. Is output. At this time, the display control unit 302 sends a display information indicating the attention content _{F 1} and evaluation categories on the display unit 320-n. Note that the focus content F ₁ means content indicating an instruction to focus on the reference sound (Lch) (for example, “focus on the reference sound (1):“ female voice (left side) ”). In addition, the evaluation category includes an evaluation including three or more categories consisting of a combination of the above-mentioned “whether or not the difference between the reference sound and the evaluation sound is known and the degree of difficulty of hearing the evaluation sound in two or more levels. Category ". Display unit 320-n presents the focused content _{F 1} to the target content presentation unit 3211 presents the evaluation category rating category presentation unit 3213 (step S1).

次に、再生制御部３０１がデータ記憶部１８０から「劣化信号Ｄ_２」を読み込み、それを音響出力処理部３１０−ｎ（ただし、ｎ＝１，・・・，Ｎ）に送る。音響出力処理部３１０−ｎは、出力部３１２−ｎから「劣化信号Ｄ_２」の評価対象音響信号Ｔ_２を出力し、出力部３１１−ｎから「劣化信号Ｄ_２」の近端話者音響信号を出力する。これにより、両耳装着型音響再生装置３４０−ｎのＬｃｈからは「劣化信号Ｄ_２」の評価対象音響信号Ｔ_２が表す「評価音」が出力され、Ｒｃｈからは近端話者音響信号が表す「近端話者音」が出力される。この際、表示制御部３０２は、着目内容Ｆ_２、評価指示Ｓ_１、および、評価カテゴリーを表す表示情報を表示部３２０−ｎに送る。なお、着目内容Ｆ_２は、評価音（Ｌｃｈ）に着目する旨の指示を表す内容（例えば「評価音（１）：『女声（左側）』に着目してください」）を意味する。評価指示Ｓ_１は、評価音（Ｌｃｈ）の聞き取り易さの評価の指示（例えば「評価音の『女声（左側）』の聞き取り易さ、を評価してください」）を意味する。表示部３２０−ｎは、着目内容Ｆ_２を着目内容提示部３２１１に提示し、評価指示Ｓ_１を評価指示提示部３２１２に提示し、評価カテゴリーを評価カテゴリー提示部３２１３に提示する（ステップＳ２）。 Next, the reproduction control unit 301 reads “deterioration signal D ₂ ” from the data storage unit 180 and sends it to the sound output processing unit 310-n (where n = 1,..., N). Sound output processing unit 310-n outputs the evaluated acoustic signal _{T 2} of the "degraded signal _{D 2} 'from the output unit 312-n, the near-end speaker sound" degraded signal _{D 2'} from the output unit 311-n Output a signal. As a result, the “evaluation sound” represented by the evaluation target sound signal T ₂ of the “deterioration signal D ₂ ” is output from the Lch of the binaural-mounted sound reproduction device 340-n, and the near-end speaker sound signal is output from the Rch. A “near-end speaker sound” is output. At this time, the display control unit 302 sends the attention content F ₂ , the evaluation instruction S ₁ , and display information representing the evaluation category to the display unit 320-n. In addition, attention contents F _2, the contents (for example, "evaluation sound (1):" female voice (please focus on the left side), ""), which represents an instruction to focus on the evaluation sound (Lch) means. The evaluation instruction S ₁ means an instruction for evaluating the ease of hearing of the evaluation sound (Lch) (for example, “evaluate the ease of hearing of the“ female voice (left side) ”of the evaluation sound”). Display unit 320-n presents the focused content _{F 2} to the target content presentation unit 3211 presents the evaluation instruction _{S 1} to the evaluation instruction presentation unit 3212 presents the evaluation category rating category presentation unit 3213 (step S2) .

次に、ステップＳ１をもう一度実行し（ステップＳ３）、さらにステップＳ２をもう一度実行する（ステップＳ４）。ステップＳ１、ステップＳ２の繰り返しを３回以上としてもよい。 Next, step S1 is executed once again (step S3), and step S2 is executed again (step S4). Step S1 and step S2 may be repeated three or more times.

その後、アイコン３２１４〜３２１９をアクティブにして、入力部３３０−ｎからの評価値Ｉ−ｎおよび確定の旨の入力を受け付ける（ステップＳ５）。 Thereafter, the icons 3214 to 3219 are activated, and the evaluation value In and the input of confirmation are received from the input unit 330-n (step S5).

さらに、ステップＳ１〜Ｓ５の「劣化信号Ｄ_２」を「劣化信号Ｄ_１」に置換し、「評価対象音響信号Ｔ_２」を「評価対象音響信号Ｔ_１」に置換した処理が実行されてもよい。また、評価カテゴリー提示部３２１３の評価カテゴリーの提示はステップＳ１〜Ｓ５を通して継続的に行われてもよいし、各ステップが終了するたびに評価カテゴリーの提示が消えてもよい。 Furthermore, even if the process in which “degraded signal D ₂ ” in steps S1 to S5 is replaced with “degraded signal D ₁ ” and “evaluation target acoustic signal T ₂ ” is replaced with “evaluation target acoustic signal T ₁ ” is executed. Good. In addition, the presentation of the evaluation category by the evaluation category presentation unit 3213 may be continuously performed through steps S1 to S5, or the presentation of the evaluation category may disappear every time each step is completed.

≪図９の例≫
図９の例では、「基準音」、評価対象音響信号Ｔ_１が表す「評価音」、および評価対象音響信号Ｔ_２が表す「評価音」のうち、対比を行う一組の音をランダムに選択し、選択した音を順番に出力する。 ≪Example of FIG. 9≫
In the example of FIG. 9, among the “reference sound”, the “evaluation sound” represented by the evaluation target acoustic signal T ₁ , and the “evaluation sound” represented by the evaluation target acoustic signal T ₂ , a pair of sounds to be compared are randomly selected. Select and output the selected sounds in order.

以下に処理の具体例を示す。
まず再生制御部３０１は、「参照信号」「劣化信号Ｄ_１」「劣化信号Ｄ_２」から、対比する組をランダムに選択する。対比する組の例は、「参照信号」と「劣化信号Ｄ_１」とからなる組、「参照信号」と「劣化信号Ｄ_２」とからなる組、「劣化信号Ｄ_１」と「劣化信号Ｄ_２」とからなる組である。対比する組を構成する信号のうち、先に出力する信号を「第１出力信号」とよび、後に出力する信号を「第２出力信号」とよぶ。対比する組を構成する信号のうち何れを先に出力してもかまわない。例えば、「参照信号」と「劣化信号Ｄ_１」とからなる組を対比する場合、「参照信号」を「第１出力信号」とし、「劣化信号Ｄ_１」を「第２出力信号」としてもよいし、「参照信号」を「第２出力信号」とし、「劣化信号Ｄ_１」を「第１出力信号」としてもよい。 A specific example of processing is shown below.
First, the reproduction control unit 301 randomly selects a pair to be compared from “reference signal”, “degraded signal D ₁ ”, and “degraded signal D ₂ ”. Examples of sets to be compared are a set of “reference signal” and “degraded signal D ₁ ”, a set of “reference signal” and “degraded signal D ₂ ”, “degraded signal D ₁ ”, and “degraded signal D”. ₂ ”. Of the signals constituting the pair to be compared, a signal output first is called a “first output signal”, and a signal output later is called a “second output signal”. Any of the signals constituting the pair to be compared may be output first. For example, when comparing a set of “reference signal” and “degraded signal D ₁ ”, “reference signal” may be “first output signal” and “degraded signal D ₁ ” may be “second output signal”. Alternatively, the “reference signal” may be the “second output signal” and the “deterioration signal D ₁ ” may be the “first output signal”.

次に、Ｌｃｈから「第１出力信号」に対応する「基準音または評価音」が出力され、Ｒｃｈから「第１出力信号」に対応する「近端話者音」が出力される（ステップＳ２１）。「第１出力信号」が「参照信号」である場合のステップＳ２１の処理は、前述のステップＳ１と同じである。「第１出力信号」が「劣化信号Ｄ_２」である場合のステップＳ２１の処理は、評価指示Ｓ_１を評価指示提示部３２１２に提示しない以外、前述のステップＳ２と同じである。「第１出力信号」が「劣化信号Ｄ_１」である場合のステップＳ２１の処理は、前述のステップＳ２の処理において「劣化信号Ｄ_２」を「劣化信号Ｄ_１」に置換し、「評価対象音響信号Ｔ_２」を「評価対象音響信号Ｔ_１」に置換し、評価指示Ｓ_１を評価指示提示部３２１２に提示しないこととした処理である。 Next, the “reference sound or evaluation sound” corresponding to the “first output signal” is output from the Lch, and the “near-end speaker sound” corresponding to the “first output signal” is output from the Rch (step S21). ). The process of step S21 when the “first output signal” is the “reference signal” is the same as the above-described step S1. Step S21 if the "first output signal" is "degraded signal D _2" except that it does not provide an evaluation instruction S ₁ to the evaluation instruction presentation unit 3212 is the same as step S2 described above. When the “first output signal” is “degraded signal D ₁ ”, the process of step S21 replaces “degraded signal D ₂ ” with “degraded signal D ₁ ” in the process of step S2 described above, This is a process in which the “acoustic signal T ₂ ” is replaced with “evaluation target acoustic signal T ₁ ” and the evaluation instruction S ₁ is not presented to the evaluation instruction presentation unit 3212.

次にＬｃｈから「第２出力信号」に対応する「基準音または評価音」が出力され、Ｒｃｈから「第２出力信号」に対応する「近端話者音」が出力される（ステップＳ２２）。「第２出力信号」が「参照信号」である場合のステップＳ２２の処理は、前述のステップＳ１に加え、評価指示Ｓ_１を評価指示提示部３２１２に提示する処理を行うものである。「第２出力信号」が「劣化信号Ｄ_２」である場合のステップＳ２１の処理は、前述のステップＳ２と同じである。「第２出力信号」が「劣化信号Ｄ_１」である場合のステップＳ２１の処理は、前述のステップＳ２の処理において「劣化信号Ｄ_２」を「劣化信号Ｄ_１」に置換し、「評価対象音響信号Ｔ_２」を「評価対象音響信号Ｔ_１」に置換した処理である。 Next, the “reference sound or evaluation sound” corresponding to the “second output signal” is output from the Lch, and the “near-end speaker sound” corresponding to the “second output signal” is output from the Rch (step S22). . Processing in step S22 in case the "second output signal" is "reference signal" is intended to addition to the step S1 described above carries out a process of presenting the evaluation instruction S ₁ to the evaluation instruction presentation unit 3212. The process of step S21 when the “second output signal” is “degraded signal D ₂ ” is the same as step S2 described above. When the “second output signal” is “degraded signal D ₁ ”, the process of step S21 is performed by replacing “degraded signal D ₂ ” with “degraded signal D ₁ ” in the process of step S2 described above. an acoustic signal T ₂ "is a substituted processed" evaluated sound signals T ₁ ".

最後に、評価値の入力とその確定が行われる（ステップＳ５）。 Finally, the evaluation value is input and confirmed (step S5).

その他、ステップＳ２１，２２の変形例として、Ｌｃｈから出力されている音が「基準音」であるか「評価音」であるかを提示しないこととしてもよい。すなわち、着目内容Ｆ_１および着目内容Ｆ_２に代えて、Ｌｃｈに着目する旨の指示を表す内容（例えば「『女声（左側）』に着目してください」）を提示してもよい。この場合、評価者３５０−ｎは提示されている音が「基準音」であるか「評価音」であるかを知らされることなく、主観評価を行うことになる。 In addition, as a modified example of steps S21 and S22, it may not indicate whether the sound output from the Lch is a “reference sound” or an “evaluation sound”. That is, instead of the focus content F ₁ and the focus content F ₂ , content indicating an instruction to focus on Lch (eg, “focus on“ female voice (left side) ”) may be presented. In this case, the evaluator 350-n performs the subjective evaluation without being notified whether the presented sound is the “reference sound” or the “evaluation sound”.

≪図１０の例≫
図１０の例では、１回目に「基準音」が出力され、２回目および３回目にそれぞれ「隠された基準音」または評価対象音響信号Ｔ_１が表す「評価音」もしくは評価対象音響信号Ｔ_２が表す「評価音」が出力される。ここで、２回目に「隠された基準音」が出力された場合、３回目には評価対象音響信号Ｔ_１が表す「評価音」もしくは評価対象音響信号Ｔ_２が表す「評価音」が出力される（パターン１）。一方、２回目に評価対象音響信号Ｔ_１が表す「評価音」もしくは評価対象音響信号Ｔ_２が表す「評価音」が出力された場合、３回目に「隠された基準音」が出力される（パターン２）。なお、「隠された基準音」とは、「基準音」であることを示さずに出力する「基準音」を意味する。また、パターン１とするかパターン２とするかはランダムに定められる。 ≪Example of FIG. 10≫
In the example of FIG. 10, the “reference sound” is output at the first time, and the “evaluated sound” or the evaluation target sound signal T represented by the “hidden reference sound” or the evaluation target sound signal T ₁ at the second time and the third time, respectively. _The “evaluation sound” represented by ₂ is output. Here, when the “hidden reference sound” is output for the second time, the “evaluation sound” represented by the evaluation target acoustic signal T ₁ or the “evaluation sound” represented by the evaluation target acoustic signal T ₂ is output for the third time. (Pattern 1). On the other hand, when the “evaluation sound” represented by the evaluation target acoustic signal T ₁ or the “evaluation sound” represented by the evaluation target acoustic signal T ₂ is output for the second time, the “hidden reference sound” is output for the third time. (Pattern 2). The “hidden reference sound” means a “reference sound” that is output without indicating that it is a “reference sound”. Whether to use pattern 1 or pattern 2 is determined randomly.

以下に処理の具体例を示す。 A specific example of processing is shown below.

まず、Ｌｃｈから「参照信号」に対応する「基準音」が出力され、Ｒｃｈから「参照信号」に対応する「近端話者音」が出力される（ステップＳ３１）。ステップＳ３１の処理は、前述のステップＳ２１と同じである。 First, the “reference sound” corresponding to the “reference signal” is output from the Lch, and the “near-end speaker sound” corresponding to the “reference signal” is output from the Rch (step S31). The process in step S31 is the same as that in step S21 described above.

次に、再生制御部３０１は、パターン１とするかパターン２とするかをランダムに選択する。
パターン１が選択された場合、まず、Ｌｃｈから「参照信号」に対応する「隠された基準音」が出力され、Ｒｃｈから「参照信号」に対応する「近端話者音」が出力され（ステップＳ３２）、次に、Ｌｃｈから「劣化信号Ｄ_１」の評価対象音響信号Ｔ_１が表す「評価音」もしくは「劣化信号Ｄ_２」の評価対象音響信号Ｔ_２が表す「評価音」が出力され、Ｒｃｈから「劣化信号Ｄ_１」もしくは「劣化信号Ｄ_２」に対応する「近端話者音」が出力される（ステップＳ３３）。
一方、パターン２が選択された場合、Ｌｃｈから評価対象音響信号Ｔ_１が表す「評価音」もしくは評価対象音響信号Ｔ_２が表す「評価音」が出力され、Ｒｃｈから「劣化信号Ｄ_１」もしくは「劣化信号Ｄ_２」に対応する「近端話者音」が出力され（ステップＳ３２）、次に、Ｌｃｈから「参照信号」に対応する「隠された基準音」が出力され、Ｒｃｈから「参照信号」に対応する「近端話者音」が出力される（ステップＳ３３）。 Next, the playback control unit 301 randomly selects pattern 1 or pattern 2.
When the pattern 1 is selected, first, the “hidden reference sound” corresponding to the “reference signal” is output from the Lch, and the “near-end speaker sound” corresponding to the “reference signal” is output from the Rch ( step S32), then evaluated acoustic signal T ₁ is represented "evaluated sound" or "degraded signal D _2" evaluated sound signal T ₂ represents "evaluated sound" of the "degraded signal D ₁ 'from Lch output Then, “Near-end speaker sound” corresponding to “Deteriorated signal D ₁ ” or “Deteriorated signal D ₂ ” is output from Rch (step S 33).
On the other hand, when the pattern 2 is selected, the “evaluation sound” represented by the evaluation target acoustic signal T ₁ or the “evaluation sound” represented by the evaluation target acoustic signal T ₂ is output from the Lch, and the “deterioration signal D ₁ ” or A “near-end speaker sound” corresponding to the “deterioration signal D ₂ ” is output (step S32), and then a “hidden reference sound” corresponding to the “reference signal” is output from the Lch. A “near-end speaker sound” corresponding to the “reference signal” is output (step S33).

Ｌｃｈから「参照信号」に対応する「隠された基準音」を出力し、Ｒｃｈから「参照信号」に対応する「近端話者音」を出力する処理は、着目内容Ｆ_２に代えて着目内容Ｆ_１を着目内容提示部３２１１に提示し、評価指示Ｓ_１を評価指示提示部３２１２に提示する以外は、前述のステップＳ１と同じである。また、Ｌｃｈから評価対象音響信号Ｔ_１が表す「評価音」もしくは評価対象音響信号Ｔ_２が表す「評価音」を出力し、Ｒｃｈから「劣化信号Ｄ_１」もしくは「劣化信号Ｄ_２」に対応する「近端話者音」を出力する処理は、前述のステップＳ２の処理、またはステップＳ２の処理において「劣化信号Ｄ_２」を「劣化信号Ｄ_１」に置換し、「評価対象音響信号Ｔ_２」を「評価対象音響信号Ｔ_１」に置換した処理と同じである。 The process of outputting the “hidden reference sound” corresponding to the “reference signal” from the Lch and outputting the “near-end speaker sound” corresponding to the “reference signal” from the Rch is performed instead of the attention content F _2. This is the same as step S1 described above, except that the content F ₁ is presented to the attention content presentation unit 3211 and the evaluation instruction S ₁ is presented to the evaluation instruction presentation unit 3212. Further, “evaluation sound” represented by the evaluation target acoustic signal T ₁ or “evaluation sound” represented by the evaluation target acoustic signal T ₂ is output from the Lch, and corresponds to the “deterioration signal D ₁ ” or “deterioration signal D ₂ ” from the Rch. In the process of outputting the “near-end speaker sound” to be performed, “degraded signal D ₂ ” is replaced with “degraded signal D ₁ ” in the process of step S2 or the process of step S2, and “evaluation target acoustic signal T ₂ ”is the same as the processing in which“ evaluation target acoustic signal T ₁ ”is replaced.

最後に、評価値の入力とその確定が行われる（ステップＳ５）。ただし、評価者３５０−ｎは、ステップＳ３２，Ｓ３３で出力された音のうち、どちらが評価音かを判断し、評価音と判断した音に対してのみ評価値を入力する。評価音と判断されなかった音については自働的に「隠された基準音」と判断したとみなされ、隠された基準音に対する評価値「５」が付与される。また、評価者３５０−ｎが入力部３３０−ｎに指示入力を行うことにより、ステップＳ５の前に、ステップＳ３１〜Ｓ３３を所望の順序で何度でも実行できる構成であってもよい。 Finally, the evaluation value is input and confirmed (step S5). However, the evaluator 350-n determines which one of the sounds output in steps S32 and S33 is the evaluation sound, and inputs the evaluation value only for the sound determined to be the evaluation sound. A sound that is not judged as an evaluation sound is automatically regarded as a “hidden reference sound” and is given an evaluation value “5” for the hidden reference sound. Further, the evaluator 350-n may input instructions to the input unit 330-n so that steps S31 to S33 can be executed any number of times in a desired order before step S5.

≪図１１の例≫
図１１の例でも、１回目に「基準音」が出力され、２回目および３回目にそれぞれ、ランダムに選択されたパターン１またはパターン２に従い、「隠された基準音」または評価対象音響信号Ｔ_１が表す「評価音」もしくは評価対象音響信号Ｔ_２が表す「評価音」が出力される。ただし、２回目および３回目の出力時にそれぞれに対する評価値が入力され（ステップＳ１３２，Ｓ１３３）、最後に評価値の確定入力のみがなされる（ステップＳ１０５）。なお、評価者３５０−ｎは、ステップＳ１３２，Ｓ１３３で出力された音のうち、「隠された基準音」と判断したほうに評価値「５」を入力し、「評価音」と判断したほうに自らの評価値を入力する。その他の詳細は、図１０の例と同じである。 ≪Example of FIG. 11≫
In the example of FIG. 11, the “reference sound” is output at the first time, and the “hidden reference sound” or the evaluation target sound signal T according to the pattern 1 or pattern 2 selected at random for the second time and the third time, respectively. _The “evaluation sound” represented by ₁ or the “evaluation sound” represented by the evaluation target acoustic signal T ₂ is output. However, the evaluation values for the second and third outputs are input (steps S132 and S133), and finally, the final determination value is input (step S105). Note that the evaluator 350-n inputs the evaluation value “5” to the one judged as “hidden reference sound” among the sounds outputted in steps S132 and S133, and judged as “evaluation sound”. Enter your own evaluation value in. Other details are the same as in the example of FIG.

≪図１２の例≫
図１２では、１回目に「基準音」が出力され（ステップＳ４１）、２回目からｘ＋１回目（ｘは３以上の整数（例えばｘは１４以下））に「評価音１」から「評価音ｘ」が出力され（ステップＳ４２−１〜Ｓ４２−ｘ）、評価値の入力とその確定が行われる（ステップＳ５）。なお、「評価音１」から「評価音ｘ」は、評価対象音響信号Ｔ_１が表す「評価音」および評価対象音響信号Ｔ_２が表す「評価音」の少なくとも一方、１個の「隠された基準音」、１個以上の「アンカー音」を含む。なお、「アンカー音」とは悪い音響品質の基準となる音を表す。複数のアンカー音を含む場合は、段階的に悪くなる音響品質の基準を用いてよい。また、ステップＳ５では、ステップＳ４２−１〜Ｓ４２−ｘで出力された音それぞれの評価値が入力される。また、「評価音１」から「評価音ｘ」の出力順序はランダムに定められる。ただし、評価者３５０−ｎが入力部３３０−ｎに指示入力を行うことにより、ステップＳ５の前に、ステップＳ４２−１〜Ｓ４２−ｘを所望の順序で何度でも実行できる構成であってもよい。その他は、図１０の例と同様である。 << Example of FIG. 12 >>
In FIG. 12, the “reference sound” is output for the first time (step S41), and “evaluation sound 1” to “evaluation sound x” for the second to x + 1th time (x is an integer of 3 or more (eg, x is 14 or less)). "Is output (steps S42-1 to S42-x), and an evaluation value is input and confirmed (step S5). The “evaluation sound 1” to “evaluation sound x” are at least one of “evaluation sound” represented by the evaluation target acoustic signal T ₁ and “evaluation sound” represented by the evaluation target acoustic signal T _2. "Reference sound" and one or more "anchor sounds". The “anchor sound” represents a sound that is a reference for bad acoustic quality. When a plurality of anchor sounds are included, a sound quality standard that gradually deteriorates may be used. In step S5, the evaluation values of the sounds output in steps S42-1 to S42-x are input. Further, the output order of “evaluation sound 1” to “evaluation sound x” is determined randomly. However, even if the evaluator 350-n inputs an instruction to the input unit 330-n, the steps S42-1 to S42-x can be executed any number of times in a desired order before step S5. Good. Others are the same as the example of FIG.

［第４実施形態］
本発明者は、第３実施形態で得られる評価値（基準音響信号に対応する基準音と評価対象音響信号に対応する評価音との違いについての５段階評価に基づくＭＯＳ値（表１に例示））と、ＰＥＳＱ（これらの基準音響信号および評価対象音響信号に対応するＰＥＳＱ値）との関係が線形関係に近似できることを見出した。このようなことは従来知られていない（例えば、非特許文献１の「付図Ｖ−１／ＪＪ−２０１．０１＜ＰＥＳＱ値と受聴ＭＯＳ値の関係の定式化＞」等参照）。本実施形態では、この知見に基づき、線形演算によってＰＥＳＱから煩雑な主観評価や計算量の多い非線形演算を行うことなく、演算量の少ない線形演算でＭＯＳ値を推定できる。以下、詳細に説明する。 [Fourth Embodiment]
The inventor has obtained an evaluation value obtained in the third embodiment (a MOS value based on a five-step evaluation of the difference between the reference sound corresponding to the reference acoustic signal and the evaluation sound corresponding to the evaluation target acoustic signal (exemplified in Table 1). )) And PESQ (the PESQ values corresponding to these reference acoustic signals and evaluation target acoustic signals) can be approximated to a linear relationship. This has not been known in the past (see, for example, “Appendix V-1 / JJ-201.001 <Formulation of relationship between PESQ value and listening MOS value>” in Non-Patent Document 1). In the present embodiment, based on this knowledge, the MOS value can be estimated by a linear calculation with a small amount of calculation without performing complicated subjective evaluation and non-linear calculation with a large amount of calculation from the PESQ by linear calculation. Details will be described below.

図１４は、第３実施形態で例示したように「参照信号」と「劣化信号」とを用いて評価試験を行って得られたＭＯＳ値（ＤＭＯＳ（Degradation MOS）値）と、それらに対応する「基準音響信号」と「評価対象音響信号」とから得られたＰＥＳＱ値との関係を表したグラフである。縦軸はＭＯＳ値（ＤＭＯＳ値）を表し、横軸はＰＥＳＱ値を表す。小さなダイヤ形のマークは主観評価試験による測定値を表し、破線直線上の大きな正方形のマークはそれらの線形関係に基づいた推定値を表す。この図に示すように、第３実施形態で得られたＭＯＳ値とそれに対応するＰＥＳＱ値との関係は線形関係で近似できる。そのため、用意しておいた基準音響信号と評価対象音響信号とからなるリファレンス信号を用い、この線形関係を表す線形関数（一次関数）等を定式化しておけば、新たな基準音響信号と評価対象音響信号とからＰＥＳＱ値を算出し、そのＰＥＳＱ値をこの線形関数に代入してＭＯＳ値を算出できる。 FIG. 14 shows MOS values (DMOS (Degradation MOS) values) obtained by performing an evaluation test using “reference signals” and “degradation signals” as exemplified in the third embodiment, and the corresponding values. It is a graph showing the relationship between the PESQ value obtained from the “reference acoustic signal” and the “evaluation target acoustic signal”. The vertical axis represents the MOS value (DMOS value), and the horizontal axis represents the PESQ value. A small diamond mark represents a measured value obtained by a subjective evaluation test, and a large square mark on the broken line represents an estimated value based on the linear relationship. As shown in this figure, the relationship between the MOS value obtained in the third embodiment and the corresponding PESQ value can be approximated by a linear relationship. Therefore, if a reference function consisting of a prepared reference acoustic signal and an evaluation target acoustic signal is used and a linear function (linear function) representing this linear relationship is formulated, a new reference acoustic signal and an evaluation target The PESQ value is calculated from the acoustic signal, and the MOS value can be calculated by substituting the PESQ value into this linear function.

＜構成＞
図１３に例示するように、本実施形態の音響品質評価装置４は、ＰＥＳＱ算出部４１および線形変換部４２を有する。音響品質評価装置４は、例えば、前述のような１個以上のコンピュータが所定のプログラムを実行することで構成される装置である。また、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。 <Configuration>
As illustrated in FIG. 13, the acoustic quality evaluation device 4 of this embodiment includes a PESQ calculation unit 41 and a linear conversion unit 42. The acoustic quality evaluation apparatus 4 is an apparatus configured by, for example, one or more computers as described above executing a predetermined program. Further, a part or all of the processing units may be configured using an electronic circuit that realizes a processing function independently.

＜前処理＞
音響品質評価処理の前処理として、基準音響信号と当該基準音響信号を含む信号に基づく評価対象音響信号との組をリファレンス信号として用い、基準音響信号（第２の基準音響信号）と当該基準音響信号を含む信号に基づく評価対象音響信号（第２の評価対象音響信号）とに対応するＰＥＳＱ値（第２のＰＥＳＱ値）と、当該基準音響信号に対応する基準音と当該評価対象音響信号に対応する評価音との違いについての５段階評価に基づくＭＯＳ値（第２のＭＯＳ値）と、の線形関係を求めておく。このとき基準音響信号と当該基準音響信号を含む信号に基づく評価対象音響信号の組については様々な組み合わせを行い、また評価者についても複数人で主観評価試験を実施し、リファレンス信号への依存性や評価者個人差への依存性を軽減する形で、線形関係を統計的に解析する。この解析結果として得た情報が、図１４に示されるＰＥＳＱ値（第２のＰＥＳＱ値）とＭＯＳ値（第２のＭＯＳ値）との線形関係である。このような線形関係を表す情報は線形変換部４２に設定される。「線形関係を表す情報」の例は、この線形関係を表す線形関数Ｆや、この線形関数Ｆを特定するパラメータ等である。線形関数Ｆの例は、ＰＥＳＱ値を入力としてそれに対応するＭＯＳ値を出力する関数であり、例えば、ＭＯＳ値＝α×ＰＥＳＱ値＋βである。なお、αおよびβはパラメータである。 <Pretreatment>
As a pre-processing of the sound quality evaluation process, a set of a reference sound signal and an evaluation target sound signal based on a signal including the reference sound signal is used as a reference signal, and the reference sound signal (second reference sound signal) and the reference sound are used. PESQ value (second PESQ value) corresponding to the evaluation target acoustic signal (second evaluation target acoustic signal) based on the signal including the signal, the reference sound corresponding to the reference acoustic signal, and the evaluation target acoustic signal A linear relationship between the MOS value (second MOS value) based on the five-step evaluation of the difference from the corresponding evaluation sound is obtained in advance. At this time, various combinations of the reference acoustic signal and the evaluation target acoustic signal based on the signal including the reference acoustic signal are performed, and a subjective evaluation test is performed by a plurality of evaluators. And statistically analyze the linear relationship in a manner that reduces dependence on individual evaluator differences. Information obtained as a result of this analysis is a linear relationship between the PESQ value (second PESQ value) and the MOS value (second MOS value) shown in FIG. Information representing such a linear relationship is set in the linear conversion unit 42. Examples of “information representing a linear relationship” are a linear function F representing the linear relationship, a parameter specifying the linear function F, and the like. An example of the linear function F is a function for inputting a PESQ value and outputting a corresponding MOS value, for example, MOS value = α × PESQ value + β. Α and β are parameters.

なお、基準音響信号は、第１〜３実施形態で例示したような遠端話者音響信号であってもよいし、その他の音声信号であってもよいし、音楽や背景音等のその他の音響信号であってもよい。評価対象音響信号は、基準音響信号を含む信号に基づくものであればどのようなものでもよい。評価対象音響信号の例は、基準音響信号を含む信号の劣化信号であり、例えば、第１〜３実施形態で例示したような基準音響信号にエコー成分およびノイズ成分の少なくとも一方が重畳した信号である。 The reference sound signal may be a far-end speaker sound signal as exemplified in the first to third embodiments, may be another sound signal, or may be other music or background sound. It may be an acoustic signal. The evaluation target acoustic signal may be anything as long as it is based on a signal including a reference acoustic signal. An example of the evaluation target acoustic signal is a deterioration signal of a signal including a reference acoustic signal, for example, a signal in which at least one of an echo component and a noise component is superimposed on the reference acoustic signal as exemplified in the first to third embodiments. is there.

ＰＥＳＱ算出部４１におけるＰＥＳＱ値の算出方法は周知であり、例えば、「ITU-T Recommendation P.862」等に詳細に記載されている。「ITU-T Recommendation P.862」の記載における「original X(t)」が本発明の基準音響信号に、「degraded signal Y(t)」が本発明の評価対象音響信号に、それぞれ該当する。なお、通常のＰＥＳＱ値の算出処理は、基準音響信号と評価対象音響信号との時間ずれを補正する処理が含まれる。 The calculation method of the PESQ value in the PESQ calculation unit 41 is well known, and is described in detail in “ITU-T Recommendation P.862”, for example. “Original X (t)” in the description of “ITU-T Recommendation P.862” corresponds to the reference acoustic signal of the present invention, and “degraded signal Y (t)” corresponds to the acoustic signal to be evaluated of the present invention. Note that the normal PESQ value calculation processing includes processing for correcting a time lag between the reference acoustic signal and the evaluation target acoustic signal.

基準音響信号に対応する基準音と評価対象音響信号に対応する評価音との違いについての５段階評価に基づくＭＯＳ値は、例えば、受聴された基準音と評価音との違いについて５段階評価（主観評価）の平均値である。５段階評価自体は５段階の評価カテゴリーを表す５つの値の何れかであるが、その平均値であるＭＯＳ値は１以上５以下の範囲に属する何れかの値である。「基準音と評価音との違いについて５段階評価」の内容に限定はない。このような５段階評価の例は、「基準音と評価音との違いが分かるか否かと、評価音の聞き取り易さおよび／または聞き取りにくさについての度合いと、の組み合わせからなる評価カテゴリーについての５段階評価」である。特に、このような５段階評価が「基準音と評価音との違いが分かるか否かと、評価音の聞き取りにくさについての４段階の度合いと、の組み合わせからなる評価カテゴリーについての５段階評価」である場合、より誤差の小さな線形関係が成り立つ。より好ましくは、このような５段階評価が、「基準音と評価音との違いが分からないことを表す１段階のカテゴリーと、基準音と評価音との違いが分かる旨と評価音の聞き取りにくさについての４段階の度合いとの組み合わせからなる４段階のカテゴリーと、を含む評価カテゴリーについての５段階評価」であることが望ましい。なお、「基準音と評価音との違いが分かるか否か」および「評価音の聞き取りにくさについての度合い」の具体例は、第３実施形態に例示した通りである。「評価音の聞き取り易さについての度合い」の具体例は、「聞き取りには問題がない」「少し聞き取り易い」「聞き取り易い」「非常に聞き取り易い」である。また、このような５段階評価に基づくＭＯＳ値は、「評価音の聞き取り易さの評価」を指示して得られた５段階評価に基づくものであることが望ましい。例えば、第３実施形態で例示したように、主観評価試験時に「評価音の『女声（左側）』の聞き取り易さ、を評価してください」等の内容が評価者に提示されて得られた５段階評価に基づくＭＯＳ値であることが望ましい。 The MOS value based on the five-step evaluation about the difference between the reference sound corresponding to the reference sound signal and the evaluation sound corresponding to the evaluation target sound signal is, for example, a five-step evaluation on the difference between the received reference sound and the evaluation sound ( This is the average value of subjective evaluation. The five-level evaluation itself is one of five values representing a five-level evaluation category, but the average MOS value is any value in the range of 1 to 5. There is no limitation on the content of “5-level evaluation of the difference between the reference sound and the evaluation sound”. An example of such a five-level evaluation is “for an evaluation category consisting of a combination of whether or not the difference between the reference sound and the evaluation sound is known and the degree of ease of hearing and / or difficulty in hearing the evaluation sound. It is a “5-level evaluation”. In particular, such a five-step evaluation is “a five-step evaluation for an evaluation category consisting of a combination of whether or not the difference between the reference sound and the evaluation sound is known and a four-step degree of difficulty in hearing the evaluation sound”. In this case, a linear relationship with smaller error is established. More preferably, such a five-step evaluation is “in order to understand the difference between the reference sound and the evaluation sound, and the one-step category indicating that the difference between the reference sound and the evaluation sound is unknown”. It is desirable to be a “5-level evaluation for an evaluation category” including a 4-level category consisting of a combination of the 4-level degree of the stiffness. Specific examples of “whether or not the difference between the reference sound and the evaluation sound is known” and “the degree of difficulty in hearing the evaluation sound” are as illustrated in the third embodiment. Specific examples of “degree of ease of hearing of evaluation sound” are “no problem in listening”, “a little easy to hear”, “easy to hear”, “very easy to hear”. Further, it is desirable that the MOS value based on such a five-step evaluation is based on the five-step evaluation obtained by instructing “evaluation of ease of hearing of evaluation sound”. For example, as exemplified in the third embodiment, contents such as “evaluate the ease of hearing of the evaluation sound“ female voice (left side) ”” were presented to the evaluator during the subjective evaluation test. A MOS value based on a five-step evaluation is desirable.

＜音響品質評価処理＞
以上の前提のもと、以下のように音響品質評価処理が行われる。まず、ＰＥＳＱ算出部４１は、基準音響信号（第１の基準音響信号）と当該基準音響信号を含む信号に基づく評価対象音響信号（第１の評価対象音響信号）とを入力とし、当該基準音響信号と当該評価対象音響信号とに対するＰＥＳＱ値（第１のＰＥＳＱ値）を得て出力する。このＰＥＳＱ値は線形変換部４２に入力される。線形変換部４２は、上述した線形関係に基づいて、入力されたＰＥＳＱ値を線形変換してＭＯＳの推定値（第１のＭＯＳ値）を得て出力する。例えば、線形変換部４２は、ＰＥＳＱ値を前述の線形関数Ｆに代入して得られた結果をＭＯＳの推定値として出力する。 <Sound quality evaluation process>
Based on the above assumptions, the sound quality evaluation process is performed as follows. First, the PESQ calculation unit 41 receives a reference acoustic signal (first reference acoustic signal) and an evaluation target acoustic signal (first evaluation target acoustic signal) based on a signal including the reference acoustic signal, and inputs the reference acoustic signal. A PESQ value (first PESQ value) for the signal and the evaluation target acoustic signal is obtained and output. This PESQ value is input to the linear conversion unit 42. The linear conversion unit 42 linearly converts the input PESQ value based on the linear relationship described above to obtain and output an estimated MOS value (first MOS value). For example, the linear conversion unit 42 outputs the result obtained by substituting the PESQ value into the above-described linear function F as the MOS estimated value.

［第４実施形態の変形例１］
図１５に例示するように、この変形例の音響品質評価装置は、ＰＥＳＱ算出部４１、線形変換部４２、遠端話者音響信号記憶部１０２、およびデータ記憶部１８０を有する。ＰＥＳＱ算出部４１は、遠端話者音響信号記憶部１０２から遠端話者音響信号を基準音響信号として読み出し、データ記憶部１８０からこの遠端話者音響信号に対応する評価対象音響信号Ｔ_１を読み出す（図３参照）。ＰＥＳＱ算出部４１は、これらに対するＰＥＳＱ値を得て出力する。以降の処理は第４実施形態と同じである。なお、ＰＥＳＱ算出部４１が、遠端話者音響信号記憶部１０２から遠端話者音響信号を基準音響信号として読み出すことに代えて、データ記憶部１８０から基準音響信号を読み出してもよい。 [Modification 1 of Fourth Embodiment]
As illustrated in FIG. 15, the sound quality evaluation apparatus of this modification includes a PESQ calculation unit 41, a linear conversion unit 42, a far-end speaker sound signal storage unit 102, and a data storage unit 180. The PESQ calculation unit 41 reads the far-end speaker acoustic signal from the far-end speaker acoustic signal storage unit 102 as a reference acoustic signal, and the evaluation target acoustic signal T ₁ corresponding to the far-end speaker acoustic signal from the data storage unit 180. (See FIG. 3). The PESQ calculation unit 41 obtains and outputs PESQ values for these. The subsequent processing is the same as in the fourth embodiment. Note that the PESQ calculation unit 41 may read the reference sound signal from the data storage unit 180 instead of reading the far end speaker sound signal from the far end speaker sound signal storage unit 102 as the reference sound signal.

［第４実施形態の変形例２］
第４実施形態の変形例１の評価対象音響信号Ｔ_１を評価対象音響信号Ｔ_２に置換した形態であってもよい。すなわち、ＰＥＳＱ算出部４１は、遠端話者音響信号記憶部１０２から遠端話者音響信号を基準音響信号として読み出し、データ記憶部１８０からこの遠端話者音響信号に対応する評価対象音響信号Ｔ_２を読み出す。ＰＥＳＱ算出部４１は、これらに対するＰＥＳＱ値を得て出力する。以降の処理は第４実施形態と同じである。 [Modification 2 of the fourth embodiment]
Fourth may be in the form obtained by substituting evaluated acoustic signal T ₁ of the first modification of the embodiment in the evaluation target sound signal T _2. That is, the PESQ calculation unit 41 reads the far-end speaker acoustic signal as the reference acoustic signal from the far-end speaker acoustic signal storage unit 102, and the evaluation target acoustic signal corresponding to the far-end speaker acoustic signal from the data storage unit 180. read the T _2. The PESQ calculation unit 41 obtains and outputs PESQ values for these. The subsequent processing is the same as in the fourth embodiment.

［第４実施形態の変形例３］
図１５に例示するように、この変形例の音響品質評価装置は、ＰＥＳＱ算出部４１、線形変換部４２、近端話者音響信号記憶部１０１、遠端話者音響信号記憶部１０２、データ記憶部１８０、および信号処理部６２１を含む。なお、信号処理部６２１は、何らかの「信号処理」を行う処理部である。「信号処理」の例は、エコーキャンセル処理およびノイズキャンセル処理の少なくとも一方を含む処理である。その他、「信号処理」が、エコーキャンセル処理もノイズキャンセル処理も含まない処理であってもよい。ＰＥＳＱ算出部４１は、遠端話者音響信号記憶部１０２から遠端話者音響信号を基準音響信号として読み出す。信号処理部６２１は、データ記憶部１８０からこの遠端話者音響信号に対応する評価対象音響信号Ｔ_１を読み出し、近端話者音響信号記憶部１０１から評価対象音響信号Ｔ_１に対応する近端話者音響信号を読み出す（図３参照）。信号処理部６２１は、これらを用いて評価対象音響信号Ｔ_１に信号処理を行い、それによって得られた信号を評価対象信号としてＰＥＳＱ算出部４１に送る。ＰＥＳＱ算出部４１は、入力された信号に対するＰＥＳＱ値を得て出力する。以降の処理は第４実施形態と同じである。 [Modification 3 of the fourth embodiment]
As illustrated in FIG. 15, the acoustic quality evaluation apparatus of this modification includes a PESQ calculation unit 41, a linear conversion unit 42, a near-end speaker acoustic signal storage unit 101, a far-end speaker acoustic signal storage unit 102, and a data storage. Unit 180 and signal processing unit 621. The signal processing unit 621 is a processing unit that performs some kind of “signal processing”. An example of “signal processing” is processing including at least one of echo cancellation processing and noise cancellation processing. In addition, the “signal processing” may be processing that does not include echo cancellation processing and noise cancellation processing. The PESQ calculation unit 41 reads the far-end speaker sound signal from the far-end speaker sound signal storage unit 102 as a reference sound signal. The signal processing unit 621 reads the evaluated acoustic signals T ₁ corresponding from the data storage unit 180 to the far-end talker's sound signal, near corresponding to the evaluation target sound signal T ₁ from the near end talker sound signal storage unit 101 An end speaker audio signal is read (see FIG. 3). The signal processing unit 621, these evaluated target sound signal T ₁ to signal processing using, sends a signal obtained thereby to the PESQ calculation unit 41 as the evaluation target signal. The PESQ calculation unit 41 obtains and outputs a PESQ value for the input signal. The subsequent processing is the same as in the fourth embodiment.

［その他の変形例等］
なお、本発明は上述の実施の形態に限定されるものではない。例えば、参照信号や劣化信号が音声以外の音響信号（音楽や背景音等）に基づいて得られたものであってもよい。また、参照信号や劣化信号が時系列信号でなくてもよい。また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 [Other variations]
The present invention is not limited to the embodiment described above. For example, the reference signal or the deterioration signal may be obtained based on an acoustic signal (music, background sound, etc.) other than voice. Further, the reference signal and the deteriorated signal may not be a time series signal. In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads a program stored in its own recording device and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and each time the program is transferred from the server computer to the computer. The processing according to the received program may be executed sequentially. The above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good.

上記実施形態では、コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されたが、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 In the above embodiment, the processing functions of the apparatus are realized by executing a predetermined program on a computer. However, at least a part of these processing functions may be realized by hardware.

１，２データ生成装置
３〜６音響品質評価装置 1, 2 Data generation devices 3-6 Sound quality evaluation device

Claims

While outputting the first sound signal to the first channel, which is one channel of the binaural sound reproduction device, a signal representing the reference sound corresponding to the second reference sound signal is output from the binaural sound reproduction device. A first process for outputting to the second channel, which is the other channel, and a signal derived from the first acoustic signal and the second reference acoustic signal while outputting the first acoustic signal to the first channel. A second process of outputting a second evaluation target acoustic signal, which is a superimposed signal representing an evaluation sound based on the included signal, to the second channel;
A display unit for displaying a five-step evaluation category about the difference between the reference sound and the evaluation sound;
An input unit for receiving input of information representing a category selected from the evaluation category;
A PESQ calculation unit for obtaining a first PESQ value for a first reference acoustic signal and a first evaluation target acoustic signal based on a signal including the first reference acoustic signal;
Wherein the second PESQ value corresponding to a second reference sound signal and the previous SL second evaluation target sound signal, obtained from the information representing the selected category, the Review Ataion the previous SL reference tone based on the 5 and second MOS value that represents the stage evaluation value, the linear relationship of the differences between, the linear conversion unit you output the first PESQ value to obtain a first MOS value by linear transformation When,
A sound quality evaluation apparatus.

The acoustic quality evaluation apparatus according to claim 1,
Evaluation Category of the 5 stages, and whether or not the difference between the evaluation sound and the reference sound is found, the degree of hearing ease and / or hearing difficulty of the evaluated sound, a combination of an acoustic quality evaluation apparatus.

The sound quality evaluation apparatus according to claim 1 or 2,
The five-stage evaluation category is an acoustic quality evaluation apparatus comprising a combination of whether or not a difference between the reference sound and the evaluation sound is known and a four-stage degree of difficulty in hearing the evaluation sound.

The sound quality evaluation apparatus according to any one of claims 1 to 3,
The five-step evaluation category is a one-step category indicating that the difference between the reference sound and the evaluation sound is not known, and that the difference between the reference sound and the evaluation sound is understood and the evaluation sound is heard. A sound quality evaluation apparatus comprising: a four-stage category composed of a combination with four degrees of the degree.

The sound quality evaluation apparatus according to any one of claims 1 to 4,
The said display part is an acoustic quality evaluation apparatus which displays the information for instruct | indicating evaluation of the ease of hearing of the said evaluation sound.

  The acoustic quality evaluation apparatus according to claim 1,
  α and β are constants,
  The linear conversion unit is an acoustic quality evaluation device that outputs a value obtained by performing the first PESQ value × α + β as the first MOS value.

While outputting the first sound signal to the first channel, which is one channel of the binaural sound reproduction device, a signal representing the reference sound corresponding to the second reference sound signal is output from the binaural sound reproduction device. A first process for outputting to the second channel, which is the other channel, and a signal derived from the first acoustic signal and the second reference acoustic signal while outputting the first acoustic signal to the first channel. A second process of outputting a second evaluation target acoustic signal, which is a superimposed signal representing an evaluation sound based on the included signal, to the second channel;
A display step for displaying a five-step evaluation category for the difference between the reference sound and the evaluation sound;
An input step for receiving input of information representing a category selected from the evaluation categories;
A PESQ calculation step for obtaining a first PESQ value for a first evaluation acoustic signal based on a first reference acoustic signal and a signal including the first reference acoustic signal;
Wherein the second PESQ value corresponding to a second reference sound signal and the previous SL second evaluation target sound signal, obtained from the information representing the selected category, the Review Ataion the previous SL reference tone 5 a second MOS value that represents the stage evaluation value, based on a linear relationship, linear transformation step you outputting the first PESQ value to obtain a first MOS value by linear transformation on the differences between the When,
A method for evaluating sound quality.

The program for functioning a computer as an acoustic quality evaluation apparatus in any one of Claim 1 to 6 .