JP2005077970A

JP2005077970A - Device and method for speech quality objective evaluation

Info

Publication number: JP2005077970A
Application number: JP2003311090A
Authority: JP
Inventors: Rei Takahashi; 玲高橋; Atsuko Kurashima; 敦子倉島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-09-03
Filing date: 2003-09-03
Publication date: 2005-03-24
Anticipated expiration: 2023-09-03
Also published as: JP4113481B2

Abstract

<P>PROBLEM TO BE SOLVED: To precisely estimate the subjective quality of the speech of a real telephone call, by taking into account the influence of the speaking state of a speaker of a side (receiving side) of speech quality evaluation on the quality evaluation. <P>SOLUTION: A distortion measurement part 12 compares a speech signal at a point A obtained from a speech DB 11 with a deteriorated speech obtained by passing the former speech signal through a system 2 to be evaluated, to quantize the extent of the distortion of the deteriorated speech as a time series. A two-way speech section detection part 13 detects a bilateral speech section by comparing the speech signal at the point A obtained from the speech DB 11 with a speech signal at a point B. A weighting part 14 weights and averages the extent of distortion time series that the distortion quantity measurement part 12 outputs, while making the weight on the extent of quality deterioration in the two-way speech section which is smaller than in an independent speech section. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は，音声品質の評価技術に関し，特に，人間が音声を聴いてその品質を評価する主観評価試験を行うことなく，音声信号の物理的特徴量の測定から主観品質を推定する音声品質客観評価装置および音声品質客観評価方法に関する。 The present invention relates to a speech quality evaluation technique, and more particularly to a speech quality objective in which subjective quality is estimated from measurement of physical features of a speech signal without performing a subjective assessment test in which a human listens to speech and evaluates the quality. The present invention relates to an evaluation apparatus and a voice quality objective evaluation method.

従来の音声品質客観評価装置のブロック構成を図７に示す。図７において，２は音声品質が伝送によりどの程度劣化するかの評価対象となる評価対象系，４は音声品質客観評価装置である。音声品質客観評価装置４内において，４１は評価音源が保持されている音声データベース（ＤＢ），４２は評価対象系２において生じる歪量を測定する歪量測定部，４３は歪量測定部４２が測定した歪時系列を時間平均する平均処理部である。評価対象系２としては，例えば固定電話システム，携帯電話システム，ＩＰ電話システム等が対象となる。 FIG. 7 shows a block configuration of a conventional voice quality objective evaluation apparatus. In FIG. 7, reference numeral 2 denotes an evaluation target system that is an evaluation target of how much the voice quality deteriorates due to transmission, and 4 denotes a voice quality objective evaluation apparatus. In the voice quality objective evaluation device 4, reference numeral 41 denotes a voice database (DB) in which an evaluation sound source is held, reference numeral 42 denotes a distortion amount measurement unit that measures the distortion amount generated in the evaluation target system 2, and reference numeral 43 denotes a distortion amount measurement unit 42. It is an average process part which carries out the time average of the measured distortion time series. The evaluation target system 2 is, for example, a fixed telephone system, a mobile phone system, an IP telephone system, or the like.

従来の音声品質客観評価装置４では，評価音源が保持されている音声ＤＢ４１から評価対象系２への入力信号と評価対象系２からの出力信号（以下，劣化音声という）を用い，歪量測定部４２が音声品質客観評価アルゴリズム（例えば，ＩＴＵ−Ｔ勧告Ｐ．８６２に規定される音声品質客観評価法）に基づいて歪量の時系列を算出する。 The conventional speech quality objective evaluation apparatus 4 uses the input signal from the speech DB 41 holding the evaluation sound source to the evaluation target system 2 and the output signal from the evaluation target system 2 (hereinafter referred to as degraded speech) to measure the amount of distortion. The unit 42 calculates a distortion time series based on an audio quality objective evaluation algorithm (for example, an audio quality objective evaluation method defined in ITU-T recommendation P.862).

ＩＴＵ−Ｔ勧告Ｐ．８６２に規定される音声品質客観評価法では，具体的には評価音源と，評価対象系２を通した劣化音声の周波数スペクトル分析を行い，これらの差分を求め，人間の聴覚特性に基づく重み付けをすることにより，知覚される歪量を定量化する。 ITU-T recommendation P.I. In the speech quality objective evaluation method defined in 862, specifically, the frequency spectrum analysis of the evaluation sound source and the degraded speech through the evaluation target system 2 is performed, the difference between these is obtained, and weighting based on human auditory characteristics is performed. By doing so, the perceived amount of distortion is quantified.

この際，試験信号としては双方向通話における片方向音声信号のみを用いる。平均処理部４３によっで，歪量測定部４２の出力である歪時系列を時間平均することにより最終的な客観評価値を得る。 At this time, only a one-way audio signal in a two-way call is used as a test signal. The average processing unit 43 obtains a final objective evaluation value by time averaging the distortion time series as the output of the distortion amount measuring unit 42.

一般に，双方向通話（地点Ａと地点Ｂの通話）は，図２に示すように，
（１）：双方無音区間，
（２）−Ａ：地点Ａ単独発話区間，
（２）−Ｂ：地点Ｂ単独発話区間，
（３）：双方発話区間，
の４状態に分類される。 In general, two-way calls (calls at point A and point B) are as shown in FIG.
(1): Both silent sections,
(2) -A: Point A single utterance section,
(2) -B: Point B single utterance section,
(3): Both utterance intervals,
Are classified into four states.

従来の方法では，例えば地点Ａに着目した場合，図７に示す平均処理部４３が，地点Ａ発話区間（つまり地点Ａ単独発話区間と双方発話区間）と，地点Ａ無音区間（つまり地点Ｂ単独発話区間と双方無音区間）とを区別し，両者の間で音声信号の歪量に対する重み付けを変えることにより，主観品質評価特性と整合した劣化の定量化を行っていた。 In the conventional method, for example, when attention is paid to the point A, the average processing unit 43 shown in FIG. Distinguishing between speech and silent sections), and changing the weighting for the amount of distortion of the speech signal between them, the degradation quantified consistent with the subjective quality evaluation characteristics.

なお，図７に示す歪量測定部４２において用いる音声品質客観評価アルゴリズムについては，例えば下記の非特許文献１に記載されている。本アルゴリズムは，ＩＴＵ−Ｔ勧告Ｐ．８６２として採用されている。
Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecsRix,A.W.；Beerends,J.G.;Hollier,M.P.;Hekstra,A.P.;Acoustics,speech，and Signal Processing,2001.Proceedings. (ICASSP '01). 2001 IEEE International Conference on,Volume：2,7-11 May 2001 Page(s):749-752 vol.2 ． Note that the speech quality objective evaluation algorithm used in the distortion amount measurement unit 42 shown in FIG. 7 is described in Non-Patent Document 1, for example. This algorithm is the ITU-T recommendation P.I. 862 is adopted.
Perceptual evaluation of speech quality (PESQ) -a new method for speech quality assessment of telephone networks and codecsRix, AW; Beerends, JG; Hollier, MP; Hekstra, AP; Acoustics, speech, and Signal Processing, 2001.Proceedings. '01). 2001 IEEE International Conference on, Volume: 2,7-11 May 2001 Page (s): 749-752 vol.2.

しかし，実際には，地点Ａ発話区間においても地点Ａ単独発話区間と双方発話区間とでは，地点Ｂの話者が感じる地点Ａの話者の音声品質に対する感度は異なる。つまり，双方発話区間の場合，地点Ｂの話者が発話中であるため，地点Ａの話者の音声が歪んでいる場合でも，地点Ｂの話者自身の音声にマスクされ，地点Ａの話者の音声歪を知覚しにくくなり，結果として，同一の歪量であっても主観品質の劣化は地点Ａ単独発話区間に比べて軽減される。従来の方法では，この効果を考慮していないため，推定主観品質は実際の通話において地点Ｂの話者が感じる主観品質より厳しい評価となり，品質推定精度の点で問題があった。 However, in fact, even in the point A utterance section, the sensitivity to the voice quality of the speaker at the point A felt by the speaker at the point B is different between the point A single utterance section and the two-side utterance section. That is, in the case of the two-speaking section, since the speaker at the point B is speaking, even if the speaker's voice at the point A is distorted, it is masked by the voice of the speaker at the point B and As a result, the deterioration of the subjective quality is reduced compared with the point A single utterance section even with the same amount of distortion. In the conventional method, since this effect is not taken into consideration, the estimated subjective quality is evaluated more severely than the subjective quality felt by the speaker at the point B in an actual call, and there is a problem in quality estimation accuracy.

本発明の目的は，品質を評価する側の話者の発話状態が品質評価に与える影響を考慮し，現実の通話における音声の主観品質を精度良く推定可能な音声品質客観評価装置および音声品質客観評価方法を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide an objective speech quality objective evaluation device and speech quality objective that can accurately estimate the subjective quality of speech in an actual call in consideration of the influence of the speech state of the speaker evaluating the quality on the quality assessment. To provide an evaluation method.

上記課題を解決するため，本発明は，地点Ａ，Ｂ間の通話において，地点Ａの音声品質を評価（つまり，地点Ｂの話者が感じる音声品質を評価）する際に，地点Ｂの話者の発話状態を考慮し，これに基づいて地点Ａの話者の音声歪量に重み付けすることを主要な特徴とする。 In order to solve the above-mentioned problem, the present invention relates to a point B story when evaluating a voice quality at a point A (that is, evaluating a voice quality felt by a speaker at the point B) in a call between the points A and B. The main feature is that the speech distortion amount of the speaker at the point A is weighted based on the utterance state of the speaker.

すなわち，本発明は，音声信号の主観品質（人間が信号を聴いたときに感じる品質）を，音声信号の物理的特徴量の測定結果から推定する音声品質客観評価装置において，双方発話区間の品質劣化量に対する重み付けを，単独発話区間に比べて軽減することを特徴とする。また，双方発話率がエンドエンドの伝送遅延時間に相関があることを考慮し，評価に用いる音声信号の双方発話率を評価対象系の伝送遅延時間から決定することを特徴とする。 In other words, the present invention provides a speech quality objective evaluation device that estimates the subjective quality of a speech signal (the quality that a person feels when listening to the signal) from the measurement results of the physical features of the speech signal. The feature is that the weighting for the amount of degradation is reduced compared to the single utterance interval. Further, in consideration of the fact that the bidirectional speech rate is correlated with the end-to-end transmission delay time, the bidirectional speech rate of the voice signal used for the evaluation is determined from the transmission delay time of the evaluation target system.

従来法では，地点Ａの話者の音声信号を分析した結果のみに基づいて歪量を定量化し，主観品質を推定しており，この点が本発明との差異である。本発明では，地点Ｂの話者が感じる地点Ａの話者の音声品質を，地点Ｂの話者の発話状態を考慮して評価するため，これを考慮していない従来法に比べて精度の良い主観品質推定が可能となる。 In the conventional method, the distortion amount is quantified based on only the result of analyzing the voice signal of the speaker at the point A, and the subjective quality is estimated. This is a difference from the present invention. In the present invention, since the voice quality of the speaker at the point A felt by the speaker at the point B is evaluated in consideration of the utterance state of the speaker at the point B, the accuracy is higher than that in the conventional method not considering this. Good subjective quality estimation is possible.

本発明の音声品質客観評価装置によれば，音声品質を評価する側の話者の発話状態が品質評価に与える影響を考慮した音声品質の客観評価が可能となり，結果として，現実の通話における音声の主観品質を精度良く推定可能となる。また，特に評価に用いる音声信号の双方発話率を伝送遅延時間に応じて決定し，その双方発話率の音声信号を評価音源として用いることにより，さらに精度の良い主観品質の推定が可能となる。 According to the voice quality objective evaluation device of the present invention, it is possible to objectively evaluate the voice quality in consideration of the influence of the speech state of the speaker on the voice quality evaluation side on the quality evaluation. It is possible to accurately estimate the subjective quality of. Further, by determining the both speech rates of the voice signal used for the evaluation in accordance with the transmission delay time and using the voice signal of the both speech rates as the evaluation sound source, it is possible to estimate the subjective quality with higher accuracy.

本発明の第１の実施例のブロック構成を図１に示す。図１において，１は本発明に係る音声品質客観評価装置である。また，上述したように，評価対象系２としては，例えば，ＩＰ電話システム，固定電話システム，携帯電話システム等を想定する。 A block configuration of the first embodiment of the present invention is shown in FIG. In FIG. 1, reference numeral 1 denotes an audio quality objective evaluation apparatus according to the present invention. As described above, the evaluation target system 2 is assumed to be, for example, an IP telephone system, a fixed telephone system, a mobile phone system, or the like.

音声品質客観評価装置１は，地点Ａ，Ｂ双方の音声信号を保持する音声ＤＢ１１，音声ＤＢ１１から得られる地点Ａの音声信号とこれを評価対象系２に通して得られる劣化音声とを比較することにより劣化音声の歪量を時系列として定量化する歪量測定部１２，音声ＤＢ１１から得られる地点Ａの音声信号と地点Ｂの音声信号とを比較することにより双方発話区間を検出する双方発話区間検出部１３，および双方発話区間検出部１３から得られる情報に基づいて歪量測定部１２が出力する歪量時系列に重み付けを行う重み付け部１４を備える。 The voice quality objective evaluation device 1 compares the voice DB 11 holding the voice signals of both the points A and B, the voice signal of the point A obtained from the voice DB 11 and the deteriorated voice obtained by passing this through the evaluation target system 2. The distortion amount measuring unit 12 that quantifies the distortion amount of the deteriorated speech as a time series, and the two-way utterance detecting the two-speaking section by comparing the voice signal at the point A and the voice signal at the point B obtained from the voice DB 11 Based on information obtained from the section detection unit 13 and the both-speaking section detection unit 13, a weighting unit 14 that weights the distortion time series output from the distortion measurement unit 12 is provided.

音声ＤＢ１１は，各地点の話者の音声信号を蓄えている。具体的な音声信号としてはＩＴＵ−Ｔ勧告Ｐ．５９擬似会話音声を用いることができる。この擬似音声信号は，図２に示すような２チャネルの音声信号から構成される。 The voice DB 11 stores voice signals of speakers at each point. Specific audio signals include ITU-T recommendation P.I. 59 pseudo-conversational speech can be used. This pseudo audio signal is composed of a 2-channel audio signal as shown in FIG.

図２は，双方向通話の特徴を説明する図である。地点Ａまたは地点Ｂにおける発話区間が斜線部に示される。図２において，（１）の区間は双方無音区間，（２）−Ａの区間は地点Ａ単独発話区間，（２）−Ｂの区間は地点Ｂ単独発話区間，（３）の区間は双方発話区間である。 FIG. 2 is a diagram for explaining the characteristics of a two-way call. The utterance section at the point A or the point B is indicated by the hatched portion. In FIG. 2, the section (1) is a silent section for both, the section (2) -A is a point A single utterance section, the section (2) -B is a point B single utterance section, and the section (3) is a double utterance. It is a section.

歪量測定部１２における歪量の算出には，例えばＩＴＵ−Ｔ勧告Ｐ．８６２に規定されるアルゴリズムを適用する。双方発話区間検出部１３は，図２に示すような地点Ａ音声信号と地点Ｂ音声信号の２チャネル信号のパワーを比較することにより，双方発話区間を検出し，その情報を重み付け部１４に提供する。 For the calculation of the strain amount in the strain amount measuring unit 12, for example, ITU-T recommendation P.I. The algorithm defined in 862 is applied. The two-speaking section detecting unit 13 detects the two-speaking section by comparing the powers of the two channel signals of the point A voice signal and the point B voice signal as shown in FIG. 2 and provides the information to the weighting unit 14. To do.

重み付け部１４は，双方発話区間検出部１３から得た情報に基づいて，地点Ａ発話区間のうち，地点Ａ単独発話区間（図２の（２）−Ａ）と双方発話区間（図２の（３））とを区別して歪量測定部１２の出力である歪量時系列を重み付け平均する。 Based on the information obtained from the both-side utterance section detection unit 13, the weighting unit 14 includes the point A utterance section ((2) -A in FIG. 2) and the two-part utterance section (((2) in FIG. 2)). 3)) is distinguished, and the distortion time series as the output of the distortion measurement unit 12 is weighted averaged.

評価に用いる地点Ａ単独発話区間集合をΩｓ，双方発話区間集合をΩｄ，双方無音区間集合をΩｅ，歪量時系列をＤ（ｔ）（ｔ：時間）とした時に，例えば，客観評価値Ｙを以下のように決定する。 For example, when the point A single utterance interval set used for evaluation is Ωs, both utterance interval sets are Ωd, both silent interval sets are Ωe, and the distortion time series is D (t) (t: time), for example, objective evaluation value Y Is determined as follows.

具体的な重み係数αは，主観評価実験により得られる主観評価値（学習データ）と上記客観評価値Ｙの相関が最も高くなるように予め最適化する。 The specific weighting coefficient α is optimized in advance so that the correlation between the subjective evaluation value (learning data) obtained by the subjective evaluation experiment and the objective evaluation value Y is the highest.

図３は，本発明の第１の実施例に係る音声品質客観評価処理フローの一例を示す図である。まず，歪量測定部１２が，音声ＤＢ１１から得られる地点Ａの音声信号とこれを評価対象系２に通して得られる劣化音声とを比較することにより劣化音声の歪量を時系列として定量化する（ステップＳ１）。 FIG. 3 is a diagram showing an example of an audio quality objective evaluation process flow according to the first embodiment of the present invention. First, the distortion amount measuring unit 12 quantifies the distortion amount of the deteriorated speech as a time series by comparing the speech signal of the point A obtained from the speech DB 11 and the deteriorated speech obtained by passing this through the evaluation target system 2. (Step S1).

双方発話区間検出部１３が，音声ＤＢ１１から得られる地点Ａの音声信号と地点Ｂの音声信号とを比較することにより双方発話区間を検出する（ステップＳ２）。そして，重み付け部１４が，双方発話区間検出部１３から得られる情報に基づいて歪量測定部１２が出力する歪量時系列を，上記式に従って重み付け平均し，客観評価値Ｙを算出する（ステップＳ３）。 The both-speaking section detecting unit 13 detects the both-speaking section by comparing the voice signal at the point A and the voice signal at the point B obtained from the voice DB 11 (step S2). Then, the weighting unit 14 weights and averages the distortion amount time series output from the distortion amount measurement unit 12 based on the information obtained from the two-utterance section detection unit 13 according to the above formula, and calculates the objective evaluation value Y (step) S3).

本発明の第２の実施例のブロック構成を図４に示す。音声品質客観評価装置３は，地点Ａ，Ｂ間における評価対象系２の伝送遅延時間の測定に用いる試験信号を保持する試験信号データベース（ＤＢ）３１，地点Ａから試験信号を送信した時刻と地点Ｂでこれを受信した時刻とを比較することにより伝送遅延時間を測定する遅延時間測定部３２，伝送遅延時間と双方発話率との対応情報を保持する双方発話率テーブル３４，遅延時間測定部３２の出力である伝送遅延時間から双方発話率テーブル３４を参照することにより双方発話率を決定する双方発話率決定部３３，この双方発話率を実現する２チャネルの音声信号を生成する音声信号生成部３５，地点Ａの音声信号とこれを評価対象系に通して得られる劣化音声とを比較することにより劣化音声の歪量を時系列として定量化する歪量測定部３６，音声信号生成部３５から得られる地点Ａの音声信号と地点Ｂの音声信号とを比較することにより双方発話区間を検出する双方発話区間検出部３７，および双方発話区間検出部３７から得られる情報に基づいて歪量測定部３６が出力する歪量時系列に重み付けを行う重み付け部３８を備える。 FIG. 4 shows a block configuration of the second embodiment of the present invention. The voice quality objective evaluation device 3 includes a test signal database (DB) 31 that holds a test signal used for measuring the transmission delay time of the evaluation target system 2 between the points A and B, and the time and point at which the test signal is transmitted from the point A. A delay time measuring unit 32 that measures the transmission delay time by comparing the time when it was received in B, a two-way speech rate table 34 that holds correspondence information between the transmission delay time and the two-way speech rate, and a delay time measuring unit 32 The bilateral speech rate determination unit 33 that determines the bilateral speech rate by referring to the bilateral speech rate table 34 from the transmission delay time that is the output of the voice, and the audio signal generation unit that generates a 2-channel audio signal that realizes the bilateral speech rate 35. Distortion amount measuring unit 3 for quantifying the amount of distortion of degraded speech as a time series by comparing the speech signal of point A with the degraded speech obtained by passing the speech signal through the evaluation target system. The information obtained from the both-speaking section detecting unit 37 and the both-speaking section detecting unit 37 for detecting the both-speaking section by comparing the speech signal at the point A and the speech signal at the point B obtained from the speech signal generating unit 35. Is provided with a weighting unit 38 that weights the distortion amount time series output from the distortion amount measurement unit 36.

第１の実施例では，双方発話率が一定であることを前提として，これを実現する音声信号を予め図１中に示す音声ＤＢ１１に蓄えておく方法を用いた。しかし，一般に地点Ａ，Ｂ間の伝送遅延時間が長くなるほど会話がしにくくなり，会話の衝突が起きやすくなるという性質がある。つまり，伝送遅延時間が長くなるほど双方発話率が高くなる。 In the first embodiment, on the assumption that the two-way utterance rate is constant, a method is used in which a voice signal for realizing this is stored in advance in the voice DB 11 shown in FIG. However, in general, the longer the transmission delay time between points A and B, the more difficult it is to have a conversation and the more likely that a conversation collision will occur. That is, the longer the transmission delay time, the higher the both-side utterance rate.

そこで本実施例では，予め伝送遅延時間と双方発話率の関係をテーブル（双方発話率テーブル３４）として用意し，伝送遅延時間を測定した結果に基づいて，適切な双方発話率を決定している。 Therefore, in this embodiment, the relationship between the transmission delay time and the two-way speech rate is prepared in advance as a table (two-way speech rate table 34), and an appropriate two-way speech rate is determined based on the result of measuring the transmission delay time. .

図５は，双方発話率テーブル３４のデータ構成例を示す図である。双方発話率テーブル３４には，例えば図５に示すように，伝送の遅延時間（ｍｓｅｃ）と双方発話率（％）との対応情報が格納されている。この双方発話率テーブル３４は，伝送遅延時間をパラメータとした会話実験を行い，このときの会話音声の双方発話率を分析することにより，両者の対応関係を調べ，結果をテーブル化することによって作成することができる。 FIG. 5 is a diagram illustrating a data configuration example of the two-way utterance rate table 34. For example, as shown in FIG. 5, the two-way utterance rate table 34 stores correspondence information between the transmission delay time (msec) and the two-way utterance rate (%). This two-way utterance rate table 34 is created by conducting a conversation experiment using the transmission delay time as a parameter, analyzing the two-way utterance rate of the conversation voice at this time, examining the correspondence between the two, and tabulating the results. can do.

双方発話率決定部３３によって決定された双方発話率を実現する音声信号を得るために，音声信号生成部３５では，例えばＩＴＵ−Ｔ勧告Ｐ．５９に準拠したアルゴリズムに従って擬似的な２チャネル音声信号を生成する。この音声信号の生成では，話者Ａ単独発話，話者Ｂ単独発話，双方発話，双方無音の４つの状態間の状態遷移確率を，上記双方発話率を満たすように定め，各状態に対応する擬似音声信号を，ＩＴＵ−Ｔ勧告Ｐ．５０に定められる生成法を用いて，長時間平均スペクトル，瞬時振幅分布，ピッチ周波数等の音声特徴量が平均的な特性となるように信号を生成する。 In order to obtain a voice signal that realizes the two-way utterance rate determined by the two-way utterance rate determining unit 33, the voice signal generating unit 35, for example, ITU-T Recommendation A pseudo two-channel audio signal is generated according to an algorithm based on 59. In the generation of this voice signal, the state transition probabilities between the four states of speaker A single utterance, speaker B single utterance, bilateral utterance, and bilateral silence are determined so as to satisfy the bilateral utterance rate, and correspond to each state The pseudo audio signal is transmitted to ITU-T recommendation P.22. Using the generation method defined in 50, a signal is generated so that speech feature quantities such as a long-time average spectrum, instantaneous amplitude distribution, and pitch frequency have an average characteristic.

音声信号生成部３５を備える代わりに，様々な双方発話率を有する２チャネル音声信号を音声データベースとして予め用意しておき，適切な双方発話率の音声信号を選択して用いることも可能である。 Instead of providing the voice signal generation unit 35, it is also possible to prepare a two-channel voice signal having various two-way speech rates in advance as a voice database and select and use a voice signal having an appropriate two-way voice rate.

以後の動作は第１の実施例に準ずる。すなわち，双方発話区間検出部３７が，音声信号生成部３５から得られる地点Ａ音声信号と地点Ｂの音声信号とを比較することにより双方発話区間を検出し，重み付け部３８が，双方発話区間検出部３７から得られる情報に基づいて歪量測定部３６が出力する歪量時系列に重み付けを行う。 The subsequent operation is in accordance with the first embodiment. That is, the both-speaking section detecting unit 37 detects the both-speaking section by comparing the point A speech signal obtained from the speech signal generating unit 35 and the point-B speech signal, and the weighting unit 38 detects the both-speaking section detection. Based on the information obtained from the unit 37, the strain amount time series output from the strain amount measuring unit 36 is weighted.

図６は，本発明の第２の実施例に係る音声品質客観評価処理フローの一例を示す図である。まず，遅延時間測定部３２が，地点Ａから試験信号を送信した時刻と地点Ｂでこれを受信した時刻とを比較することにより伝送遅延時間を測定する（ステップＳ１１）。 FIG. 6 is a diagram showing an example of an audio quality objective evaluation process flow according to the second embodiment of the present invention. First, the delay time measuring unit 32 measures the transmission delay time by comparing the time when the test signal is transmitted from the point A and the time when the test signal is received at the point B (step S11).

双方発話率決定部３３が，測定された伝送遅延時間から双方発話率テーブル３４を参照することにより双方発話率を決定する（ステップＳ１２）。その結果をもとに，音声信号生成部３５が，決定された双方発話率を実現する２チャネルの音声信号（地点Ａの音声信号と地点Ｂの音声信号）を生成する（ステップＳ１３）。 The two-way utterance rate determining unit 33 determines the two-way utterance rate by referring to the two-way utterance rate table 34 from the measured transmission delay time (step S12). Based on the result, the voice signal generation unit 35 generates a two-channel voice signal (the voice signal at the point A and the voice signal at the point B) that realizes the determined bilateral speech rate (step S13).

歪量測定部３６が，地点Ａの音声信号とこれを評価対象系に通して得られる劣化音声とを比較することにより劣化音声の歪量を時系列として定量化する（ステップＳ１４）。双方発話区間検出部３７が，音声信号生成部３５から得られる地点Ａの音声信号と地点Ｂの音声信号とを比較することにより双方発話区間を検出する（ステップＳ１５）。重み付け部３８が，双方発話区間検出部３７から得られる情報に基づいて歪量測定部３６が出力する歪量時系列を，前述した式に従って重み付け平均し，客観評価値Ｙを算出する（ステップＳ１６）。 The distortion amount measuring unit 36 quantifies the distortion amount of the deteriorated sound as a time series by comparing the sound signal of the point A with the deteriorated sound obtained by passing the sound signal through the evaluation target system (step S14). The both-speaking section detecting unit 37 detects the both-speaking section by comparing the sound signal at the point A and the sound signal at the point B obtained from the sound signal generating unit 35 (step S15). The weighting unit 38 weights and averages the distortion amount time series output from the distortion amount measuring unit 36 based on the information obtained from the two-speaking section detecting unit 37 according to the above-described formula, thereby calculating the objective evaluation value Y (step S16). ).

本発明の第１の実施例のブロック構成を示す図である。It is a figure which shows the block configuration of the 1st Example of this invention. 双方向通話の特徴を説明する図である。It is a figure explaining the characteristic of a two-way call. 音声品質客観評価処理フローの一例を示す図である。It is a figure which shows an example of an audio | voice quality objective evaluation processing flow. 本発明の第２の実施例のブロック構成を示す図である。It is a figure which shows the block configuration of the 2nd Example of this invention. 双方発話率テーブルのデータ構成例を示す図である。It is a figure which shows the data structural example of a both-side utterance rate table. 音声品質客観評価処理フローの一例を示す図である。It is a figure which shows an example of an audio | voice quality objective evaluation processing flow. 従来の音声品質客観品質評価装置のブロック構成を示す図である。It is a figure which shows the block structure of the conventional audio | voice quality objective quality evaluation apparatus.

Explanation of symbols

１，３，４音声品質客観評価装置
２評価対象系
１１，４１音声データベース（ＤＢ）
１２，３６，４２歪量測定部
１３，３７双方発話区間検出部
１４，３８重み付け部
３１試験信号データベース（ＤＢ）
３２遅延時間測定部
３３双方発話率決定部
３４双方発話率テーブル
３５音声信号生成部
４３平均処理部 1,3,4 Speech quality objective evaluation system 2 Evaluation target system 11,41 Voice database (DB)
12, 36, 42 Distortion amount measurement unit 13, 37 Both utterance section detection unit 14, 38 Weighting unit 31 Test signal database (DB)
32 Delay time measurement unit 33 Both-side utterance rate determination unit 34 Both-side utterance rate table 35 Audio signal generation unit 43 Average processing unit

Claims

An audio quality objective evaluation device for objectively evaluating audio quality from measurement results of physical features of audio signals through an evaluation target system from a first point to a second point,
A distortion amount measuring means for measuring the distortion amount of the deteriorated voice by comparing the voice signal of the first point with the deteriorated voice obtained by passing the voice signal through the evaluation target system, and quantifying the distortion amount as a time series;
A both-speaking section detecting means for detecting a both-speaking section by comparing the sound signal of the first point and the sound signal of the second point;
Weighting means for calculating a weighted average of distortion amount time series output from the distortion amount measuring means using a weighting value obtained by reducing the weight for the distortion amount of the detected both utterance intervals compared to a single utterance interval; A voice quality objective evaluation device characterized by comprising:

An audio quality objective evaluation device for objectively evaluating audio quality from measurement results of physical features of audio signals through an evaluation target system from a first point to a second point,
A delay time measuring means for measuring a transmission delay time of an evaluation target system from the first point to the second point;
Based on the measured transmission delay time, a both-side speech rate determining means for determining a two-way speech rate from correspondence information between a predetermined transmission delay time and a two-way speech rate in a call;
An audio signal generating / selecting means for generating an audio signal for realizing the determined bilateral speech rate as an evaluation sound source or selecting from an audio database prepared in advance;
Distortion amount measuring means for measuring the distortion amount of the deteriorated speech by comparing the sound signal of the first point of the evaluation sound source with the deteriorated speech obtained by passing this through the evaluation target system, and quantifying it as a distortion time series When,
A both-speaking section detecting means for detecting a both-speaking section by comparing a voice signal of the first point and a second point of the evaluation sound source;
Weighting means for calculating a weighted average of distortion amount time series output from the distortion amount measuring means using a weighting value obtained by reducing the weight for the distortion amount of the detected both utterance intervals compared to a single utterance interval; A voice quality objective evaluation device characterized by comprising:

An audio quality objective evaluation method for objectively evaluating audio quality from measurement results of physical features of audio signals through an evaluation target system from a first location to a second location,
A distortion amount measuring step of measuring a distortion amount of the deteriorated voice by comparing the voice signal of the first point and the deteriorated voice obtained by passing this through the evaluation target system, and quantifying the distortion amount as a time series;
A both-speaking section detecting step of detecting a both-speaking section by comparing the sound signal of the first point and the sound signal of the second point;
A weighting step of calculating a weighted average of the distortion amount time series using a weighting value obtained by reducing a weight for the distortion amount of the detected both utterance sections as compared to a single utterance section. Quality objective evaluation method.

An audio quality objective evaluation method for objectively evaluating audio quality from measurement results of physical features of audio signals through an evaluation target system from a first location to a second location,
A delay time measuring step of measuring a transmission delay time of an evaluation target system from the first point to the second point;
A two-way utterance rate determining step for determining a two-way utterance rate from correspondence information between a predetermined transmission delay time and a two-way utterance rate in a call based on the measured transmission delay time;
An audio signal generation / selection step of generating an audio signal that realizes the determined bilateral speech rate as an evaluation sound source or selecting from an audio database prepared in advance;
A distortion amount measuring step of measuring the distortion amount of the deteriorated speech by comparing the sound signal of the first point of the evaluation sound source with the deteriorated speech obtained by passing this through the evaluation target system, and quantifying it as a distortion amount time series. When,
A both-speaking section detecting step of detecting a both-speaking section by comparing the sound signal of the first point of the evaluation sound source with the sound signal of the second point;
A weighting step of calculating a weighted average of the distortion amount time series using a weighting value obtained by reducing a weight for the distortion amount of the detected both utterance sections as compared to a single utterance section. Quality objective evaluation method.