JPH09331391A

JPH09331391A - Speech quality object estimate device

Info

Publication number: JPH09331391A
Application number: JP15106796A
Authority: JP
Inventors: Tetsuro Yamazaki; 哲朗山崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-06-12
Filing date: 1996-06-12
Publication date: 1997-12-22

Abstract

PROBLEM TO BE SOLVED: To improve the speech quality estimate accuracy by integrating a section to identify a coding system of a voice signal so as to select a standard deteriorating voice signal and a weight coefficient in matching with the coding system of a test voice signal. SOLUTION: A coding voice identification section 1 identifies a coding system of a coding voice signal having a code error. A similarity calculation section 3 obtains a time series of the similarity between the analysis result by a voice analysis section 2 and a standard deterioration voice signal selected based on the identification result by the coding voice identification section 1. A weight coefficient database 6 stores weight coefficients between units obtained by subject evaluation values of various deteriorated voice signals and learning processing by a neural net by the similarity time series. A quality estimate calculation section 4 applies input processing to the weight coefficient selected based on the time series of the similarity obtained by the similarity calculation section 3 and the identification result of the coded voice identification section 1 from the weight coefficient database 6 and estimates the speech quality.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、通話品質客観推
定装置に関し、特に、電話伝送装置において発生した
歪、雑音により劣化した音声信号の通話品質を音声信号
の物理量により客観的に推定する通話品質客観推定装置
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech quality objective estimation apparatus, and more particularly to a speech quality objectively estimating speech speech quality of a voice signal deteriorated by distortion and noise generated in a telephone transmission apparatus, by a physical quantity of the speech signal. The present invention relates to an objective estimation device.

【０００２】[0002]

【従来の技術】従来例を図１を参照して説明する。図１
において、鎖線により包囲される部分が従来例を示す。
パターンマッチング法とニューラルネットを使用して通
話品質を客観的に推定する通話品質客観推定装置は、品
質推定されるべき音声信号である試験音声信号を音声分
析部２に入力し、この分析結果と、標準劣化音声データ
ベース５に予め蓄積される劣化が生じた音声信号の特徴
をパラメータの短時間系列で表わした標準劣化音声信号
との間の類似度を、類似度計算部３においてパターンマ
ッチング処理して求め、求められた類似度の時系列をニ
ューラルネットを構成する品質推定値計算部４に入力処
理して通話品質を推定する。この推定装置のパターンマ
ッチングを実施する類似度計算部３における類似度の計
算、および標準劣化音声信号を作成する学習処理は、通
話品質客観測定方法（特願平３−１１８９２４）と同様
に行う。そして、類似度時系列を入力として通話品質の
尺度を表わすＭＯＳ（ＭｅａｎＯｐｉｎｉｏｎＳｃ
ｏｒｅ：平均オピニオン評点）を出力するニューラルネ
ットは、入力層、中間層、および出力層の３層より成
り、学習処理、品質推定処理を以下の通りに行う。2. Description of the Related Art A conventional example will be described with reference to FIG. FIG.
In FIG. 1, the portion surrounded by the chain line shows a conventional example.
A speech quality objective estimation apparatus that objectively estimates speech quality using a pattern matching method and a neural network inputs a test speech signal, which is a speech signal to be quality-estimated, to a speech analysis unit 2 and outputs this analysis result. , The similarity calculation unit 3 performs pattern matching processing on the similarity with the standard deteriorated voice signal in which the characteristics of the deteriorated voice signal stored in the standard deteriorated voice database 5 in advance are represented by a short time series of parameters. Then, the time series of the obtained similarity is input to the quality estimation value calculation unit 4 forming the neural network to estimate the call quality. The calculation of the degree of similarity in the degree-of-similarity calculation unit 3 that performs the pattern matching of this estimation device and the learning process for creating the standard deteriorated speech signal are performed in the same manner as the speech quality objective measuring method (Japanese Patent Application No. 3-118924). Then, a MOS (Mean Opinion Sc) representing a measure of speech quality with the similarity time series as an input.
The ore: average opinion score) is output to the neural network, which is composed of three layers: an input layer, an intermediate layer, and an output layer. Learning processing and quality estimation processing are performed as follows.

【０００３】学習処理は、先ず、短時間区間（以降、フ
レームと記述する）毎に求めた学習用音声信号と標準劣
化音声信号の類似度の時系列を５つの区分に均等分割
し、各区分毎に類似度を平均化したものをニューラルネ
ットの入力層に入力する。中間層のユニット数は５とす
る。ユニット数１の出力層には、学習音声信号のＭＯＳ
を入力する。次に、バックプロパゲーション法（Ｄ．
Ｅ．Ｒｕｍｅｌｈａｒｔ，Ｊ．Ｌ．ＭｃＣｌｅｌｌａｎ
ｄ，ａｎｄｔｈｅＰＤＰＲｅｓｅａｒｃｈＧｒｏ
ｕｐ，ＰａｒａｌｌｅｌＤｉｓｔｒｉｂｕｔｅｄＰ
ｒｏｃｅｓｓｉｎｇＶｏｌ．１ＭＩＴＰｒｅｓｓ，
ｐｐ．３１８−３６２，１９８６）により各層のユニッ
ト間の重み係数の学習を行う。中間層および出力層の各
ユニットの入出力関数にはシグモイド関数が使用され
る。学習処理毎に学習サンプル、非学習サンプルの品質
推定を行い、主観品質測定値との間の差を求める。学習
サンプルの品質推定値と主観品質との間の差が小さい値
で安定し、非学習サンプルの品質推定値と主観品質との
間の差が極小となった学習回数で処理を終了する。これ
ら一連の学習処理から得られたユニット間の重み係数は
これを重み係数データベースに蓄積しておき、品質推定
処理をするに際して使用される。品質推定処理は標準劣
化音声信号に対する試験音声の各フレームの類似度の時
系列を５つの区分に均等分割し、各区分毎に平均した類
似度を、学習処理で使用したニューラルネットと同じ構
造のニューラルネットに入力し、学習処理で得られたユ
ニット間の重み係数とシグモイド関数を使用して試験音
声のＭＯＳを決定する。In the learning process, first, the time series of the similarity between the learning voice signal and the standard deteriorated voice signal obtained for each short time period (hereinafter referred to as a frame) is equally divided into five divisions, and each division is divided into five divisions. The averaged similarity is input to the input layer of the neural network. The number of units in the intermediate layer is 5. The output layer with one unit has a learning voice signal MOS.
Enter Next, the back propagation method (D.
E. FIG. Rumelhart, J .; L. McClelllan
d, and the PDP Research Gro
up, Parallel Distributed P
processing Vol. 1 MIT Press,
pp. 318-362, 1986), the weighting coefficient between units in each layer is learned. A sigmoid function is used as the input / output function of each unit in the middle layer and the output layer. The quality of the learning sample and the non-learning sample is estimated for each learning process, and the difference between the quality value and the subjective quality measurement value is obtained. The processing ends when the difference between the quality estimation value of the learning sample and the subjective quality is stable at a small value, and the difference between the quality estimation value of the non-learning sample and the subjective quality becomes minimal, and the processing is ended. The weighting factors between units obtained from the series of learning processes are accumulated in the weighting factor database and used in the quality estimation process. The quality estimation processing equally divides the time series of the similarity of each frame of the test speech with respect to the standard deteriorated speech signal into five sections, and averages the similarity for each section with the same structure as the neural network used in the learning processing. It is input to the neural network and the weighting coefficient between the units obtained by the learning process and the sigmoid function are used to determine the MOS of the test voice.

【０００４】[0004]

【発明が解決しようとする課題】上述した通話品質客観
推定装置は、試験音声信号と同じ符号化方式の音声信号
を学習サンプルに使用して標準劣化音声およびユニット
間の重み係数を決定しているので、品質を充分な精度で
推定することができる。しかし、符号化方式が明らかで
はない試験音声信号の品質を推定する場合、この通話品
質推定装置は符号化方式に適合していない標準劣化音声
信号と重み係数を使用する恐れがある。符号化方式に適
合しない標準劣化音声信号を使用した場合、標準劣化音
声信号と試験音声信号との間の類似度は正確には求めら
れない。また、ニューラルネットで学習された重み係数
は、符号化方式に依存したものであるので、符号化方式
が異なった場合、正しい品質推定値は得られない。The speech quality objective estimation apparatus described above uses a speech signal of the same coding method as the test speech signal as a learning sample to determine the standard deteriorated speech and the weighting coefficient between units. Therefore, the quality can be estimated with sufficient accuracy. However, when estimating the quality of a test speech signal whose coding scheme is not clear, this speech quality estimation device may use a standard degraded speech signal and a weighting coefficient that are not compatible with the coding scheme. If a standard degraded speech signal that does not conform to the coding method is used, the degree of similarity between the standard degraded speech signal and the test speech signal cannot be obtained accurately. In addition, since the weighting coefficient learned by the neural network depends on the encoding method, a correct quality estimation value cannot be obtained when the encoding method is different.

【０００５】この発明は、この通話品質客観推定装置に
音声信号の符号化方式を識別する部分を組み込むことに
より、試験音声の符号化方式に適合した標準劣化音声信
号と重み係数とが選択され、上述の問題を解消した通話
品質推定精度の良好な通話品質客観推定装置を提供する
ものである。According to the present invention, by incorporating a portion for identifying a voice signal coding system into this speech quality objective estimation device, a standard deteriorated voice signal and a weighting coefficient suitable for the test voice coding system are selected. It is an object of the present invention to provide a speech quality objective estimation device with good speech quality estimation accuracy that solves the above problems.

【０００６】[0006]

【課題を解決するための手段】標準劣化音声信号と試験
音声信号との間の類似度を求め、類似度の時系列をニュ
ーラルネットに入力処理して試験音声信号の通話品質を
客観的に推定する通話品質客観推定装置において、試験
音声信号を周波数分析する音声分析部２を具備し、各種
劣化音声信号を予め蓄積する標準劣化音声データベース
５を具備し、符号誤りが生じた符号化音声信号の符号化
方式を識別する符号化音声識別部１を具備し、音声分析
部２における分析結果と標準劣化音声データベース５か
ら符号化音声識別部１の識別結果に基づいて選択された
標準劣化音声信号との間の類似度の時系列を求める類似
度計算部３を具備し、各種劣化音声信号の主観評価値と
類似度時系列によりニューラルネットで学習処理を行っ
た結果得られたユニット間の重み係数を予め蓄積する重
み係数データベース６を具備し、類似度計算部３におい
て求められた類似度の時系列および重み係数データベー
ス６から符号化音声識別部１の識別結果に基づいて選択
された重み係数を入力処理して通話品質を推定するニュ
ーラルネットを構成する品質推定値計算部４を具備する
通話品質客観推定装置を構成した。[Problem to be Solved] A similarity between a standard deteriorated voice signal and a test voice signal is obtained, and a time series of the similarity is input to a neural network to objectively estimate the speech quality of the test voice signal. In the speech quality objective estimation apparatus, the speech analysis unit 2 for frequency-analyzing the test speech signal is provided, and the standard degraded speech database 5 for preliminarily storing various degraded speech signals is provided. A standard speech signal selected from the standard analysis speech database 5 and the analysis result of the speech analysis section 2 based on the identification result of the encoded voice identification section 1; The similarity calculation unit 3 for obtaining the time series of the similarity between the two is provided, and the result obtained by performing the learning processing by the neural network by the subjective evaluation value of various deteriorated speech signals and the similarity time series is obtained. And a weighting coefficient database 6 for accumulating weighting coefficients between sets in advance, and based on the time series of the similarity calculated by the similarity calculating section 3 and the weighting coefficient database 6 based on the discrimination result of the encoded speech discriminating section 1. The speech quality objective estimation device is configured to include the quality estimation value calculation unit 4 that constitutes the neural network that estimates the speech quality by inputting the selected weighting factor.

【０００７】そして、符号化音声識別部１は試験音声信
号を周波数分析し、分析結果に基づいて試験音声信号の
スペクトル概形を決定し、スペクトル概形により試験音
声信号の符号化方式を識別するものである通話品質客観
推定装置を構成した。Then, the coded voice identification unit 1 frequency-analyzes the test voice signal, determines the spectrum outline of the test voice signal based on the analysis result, and identifies the coding system of the test voice signal by the spectrum outline. We constructed an objective speech quality estimation device.

【０００８】[0008]

【発明の実施の形態】この発明の通話品質客観推定装置
の一実施例においては、先ず、試験音声をフレーム毎に
高速フーリエ変換した後、得られたスペクトルを幾つか
の帯域に等分割区分し、区分毎にスペクトルを平均す
る。平均化した値を更に音声の長さ、即ち全フレーム数
で平均する。次に、最も低い帯域のスペクトルと全帯域
スペクトルの平均の差を全帯域スペクトルの標準偏差で
除算する。除算して得られた値を“識別に使用する物理
量”とし、この物理量と閾値とを比較することにより符
号化方式を識別する。識別に使用する閾値は、符号誤り
が生じた音声信号のサンプルから求めた“識別に使用す
る物理量”の分布により決定する。BEST MODE FOR CARRYING OUT THE INVENTION In an embodiment of the speech quality objective estimation apparatus of the present invention, first, a test speech is subjected to fast Fourier transform for each frame, and then the obtained spectrum is equally divided into several bands. , Average the spectra for each section. The averaged values are further averaged over the voice length, that is, the total number of frames. Then, the difference between the average of the lowest band spectrum and the full band spectrum is divided by the standard deviation of the full band spectrum. The value obtained by the division is defined as "a physical quantity used for identification", and the encoding method is identified by comparing this physical quantity with a threshold value. The threshold used for identification is determined by the distribution of the "physical quantity used for identification" obtained from the sample of the speech signal in which the code error has occurred.

【０００９】[0009]

【実施例】この発明の実施例を図１を参照して説明す
る。図１において、１はこの発明により付加される符号
化音声識別部であり、試験音声信号を入力して符号誤り
が生じた符号化音声信号の符号化方式を識別するもので
ある。符号化音声識別部１の識別結果を使用して標準劣
化音声データベース５から標準劣化音声信号を選択する
と共に、重み係数データベース６からニューラルネット
により求められた重み係数を選択するものである。音声
分析部２は試験音声信号のＬＰＣケプストラム係数をフ
レーム毎に求める。類似度計算部３は試験音声信号と標
準劣化音声データベース５から符号化音声識別部１の識
別結果に応じて選択された標準劣化音声信号との間のパ
ターンマッチングをフレーム毎に行い、類似度の時系列
を求める。ここで、試験音声信号および標準劣化音声信
号の特徴を表わす物理量はＬＰＣケプストラム係数であ
る。品質推定値計算部４には、類似度の時系列を５つの
区分に等分割し、区分毎にこれに属する類似度を平均し
たものが入力されると共に、重み係数データベース６か
ら符号化方式識別結果に応じた重み係数が入力され、品
質（ＭＯＳ：平均オピニオン評点）を求める。標準劣化
音声データベース５は、各種劣化音声信号を代表する劣
化音声信号のデータベースである。重み係数データベー
ス６は、各種劣化音声信号の主観評価値と類似度時系列
によりニューラルネットで学習処理を行った結果得られ
たユニット間の重み係数のデータベースである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT An embodiment of the present invention will be described with reference to FIG. In FIG. 1, reference numeral 1 denotes a coded voice identification unit added according to the present invention, which identifies a coding system of a coded voice signal in which a test voice signal is input and a code error occurs. The standard voice signal is selected from the standard voice database 5 using the discrimination result of the encoded voice discriminating unit 1, and the weighting factor obtained by the neural network is selected from the weighting factor database 6. The voice analysis unit 2 obtains the LPC cepstrum coefficient of the test voice signal for each frame. The similarity calculation unit 3 performs pattern matching between the test speech signal and the standard deteriorated speech signal selected from the standard deteriorated speech database 5 according to the identification result of the encoded speech identification unit 1 for each frame to determine the similarity. Find the time series. Here, the physical quantity representing the characteristics of the test voice signal and the standard deteriorated voice signal is the LPC cepstrum coefficient. To the quality estimation value calculation unit 4, the time series of the similarity is equally divided into 5 sections, the average of the degrees of similarity belonging to each section is input, and the encoding method identification is performed from the weight coefficient database 6. The weighting factor corresponding to the result is input, and the quality (MOS: average opinion score) is obtained. The standard deteriorated voice database 5 is a database of deteriorated voice signals representing various deteriorated voice signals. The weighting coefficient database 6 is a database of weighting coefficients between units obtained as a result of performing a learning process by a neural network using subjective evaluation values of various deteriorated speech signals and similarity time series.

【００１０】図２は、符号化音声識別部１の処理の流れ
図である。先ず、ステップ７において、試験音声信号を
フレーム毎に５１２点で高速フーリエ変換し２５６点ス
ペクトルを求める。ステップ８においては、電話帯域音
声（３４００Ｈｚ以下）を対象としているところから、
得られた２５６点スペクトルの内の２２５点スペクトル
を９帯域に等分割し、各帯域毎に平均する。ステップ９
においては、この全９帯域スペクトルを音声の長さ、即
ち全フレーム数で平均する。ステップ１０においては、
最も低域の帯域のスペクトルと全９帯域スペクトルの平
均の差を全９帯域スペクトルの標準偏差で除算する。こ
の除算して得られた値を“識別に使用する物理量”と
し、この物理量と閾値とを比較することにより符号化方
式を識別する。即ち、ステップ１１において、ステップ
１０において得られた値である“識別に使用する物理
量”と、３段階の誤り率（ＢＥＲ＝０，１０^-3，１
０^-2）で符号誤りを発生させたＡＤＰＣＭ符号化音声信
号、ＬＤＣＥＬＰ符号化音声信号の集合からＢＥＲ＝１
０^-2のＡＤＰＣＭ、ＬＤＣＥＬＰ符号化音声信号を識別
する閾値とを比較する。比較した結果、試験音声信号が
ＢＥＲ＝１０^-2のＡＤＰＣＭ符号化音声信号である場合
には''１''、ＢＥＲ＝１０^-2のＬＤＣＥＬＰ符号化音声
信号である場合は''２''、それ以外の場合は''３''を、
識別結果として類似度計算部３および品質推定値計算部
４に出力する。類似度計算部３および品質推定値計算部
４においては、識別結果が''１''である場合、類似度計
算部３および品質推定値計算部４からＢＥＲ＝１０^-2の
ＡＤＰＣＭ符号化音声信号の通話品質を推定する標準劣
化音声データおよび重み係数を選択する。同様に、識別
結果が''２''である場合、類似度計算部３および品質推
定値計算部４からＢＥＲ＝１０^-2のＬＤＣＥＬＰ符号化
音声信号の通話品質を推定する標準劣化音声データおよ
び重み係数を選択する。識別結果が''３''である場合、
類似度計算部３および品質推定値計算部４からＢＥＲ＝
１０^-2のＡＤＰＣＭ符号化音声信号およびＬＤＣＥＬＰ
符号化音声信号以外の通話品質を推定する標準劣化音声
データおよび重み係数を選択する。FIG. 2 is a flow chart of the processing of the coded speech identifying section 1. First, in step 7, the test speech signal is fast Fourier transformed at 512 points for each frame to obtain a 256-point spectrum. In Step 8, since the target is telephone band voice (3400 Hz or less),
A 225-point spectrum of the obtained 256-point spectrum is equally divided into 9 bands and averaged for each band. Step 9
In the above, the entire 9-band spectrum is averaged over the voice length, that is, the total number of frames. In step 10,
The average difference between the spectrum of the lowest band and the spectrum of all 9 bands is divided by the standard deviation of the spectrum of all 9 bands. The value obtained by this division is taken as the "physical quantity used for identification", and the encoding method is identified by comparing this physical quantity with a threshold value. That is, in step 11, the “physical quantity used for identification” which is the value obtained in step 10 and the error rate of three levels (BER = 0, 10 ⁻³ , 1)
BER = 1 from the set of the ADPCM coded voice signal and the LDCELP coded voice signal in which a code error has occurred in 0 ^-2 ).
0 ^-2 ADPCM, and compares the threshold value for identifying the LDCELP encoded audio signal. As a result of comparison, "1" when the test speech signal is an ADPCM coded speech signal with BER = 10 ^-2 , and "^2" when the test speech signal is an LDCELP coded speech signal with BER = 10 ^-2 , Otherwise "3",
The identification result is output to the similarity calculation unit 3 and the quality estimation value calculation unit 4. In the similarity calculation unit 3 and the quality estimation value calculation unit 4, when the identification result is “1”, the similarity calculation unit 3 and the quality estimation value calculation unit 4 calculate the ADPCM coded speech with BER = 10 ^−2. Select standard degraded speech data and weighting factors that estimate the speech quality of the signal. Similarly, when the identification result is “2”, the standard deterioration voice data for estimating the speech quality of the LDCELP encoded voice signal of BER = 10 ⁻² from the similarity calculation unit 3 and the quality estimation value calculation unit 4 and Select a weighting factor. If the identification result is "3",
From the similarity calculation unit 3 and the quality estimation value calculation unit 4, BER =
10 ^-2 ADPCM coded audio signal and LDCELP
Standard degrading speech data and weighting factors for estimating speech quality other than the coded speech signal are selected.

【００１１】３段階の誤り率（ＢＥＲ＝０，１０^-3，１
０^-2）で符号誤りを発生させたＡＤＰＣＭ符号化音声、
ＬＤＣＥＬＰ符号化音声５７６文章を対象とし、ＢＥＲ
＝１０^-2のＡＤＰＣＭ符号化音声信号、ＬＤＣＥＬＰ符
号化音声信号を識別する実験を行った結果、ＢＥＲ＝１
０^-2のＬＤＣＥＬＰ符号化音声信号の識別正答率は９
７．９％、ＢＥＲ＝１０^-2のＡＤＰＣＭ符号化音声信号
の識別正答率は８３．３％、その他の音声信号の識別正
答率は７５．０％、３カテゴリーの平均は８７．３％と
なった。また、識別部を組み込んだ場合の品質測定実験
の結果、識別部を組み込まない場合と比較して精度が約
０．２５改善された。Three-step error rate (BER = 0, 10 ^-3 , 1
ADPCM coded speech with a code error of 0 ^-2 ),
BER for LDCELP encoded voice 576 sentences
= 10 ⁻² ADPCM coded speech signal, LDCELP coded speech signal were identified, and BER = 1.
The discrimination correct answer rate of the LDCELP coded voice signal of 0 ^-2 is 9
7.9%, the identification correct answer rate of ADPCM coded voice signal of BER = 10 ^-2 is 83.3%, the identification correct answer rate of other voice signals is 75.0%, and the average of 3 categories is 87.3%. became. In addition, as a result of the quality measurement experiment in which the identification unit is incorporated, the accuracy is improved by about 0.25 as compared with the case where the identification unit is not incorporated.

【００１２】[0012]

【発明の効果】以上の通りであって、パターンマッチン
グとニューラルネットを使用した通話品質客観推定装置
に、音声信号の符号化方式を識別する部分を組み込んで
試験音声の符号化方式を識別することにより、試験音声
の符号化方式に適合した標準劣化音声信号と重み係数を
使用して通話品質推定が行われるので、通話品質推定の
精度を改善することができる。As described above, the speech quality objective estimation apparatus using the pattern matching and the neural network is provided with a portion for identifying the speech signal encoding method to identify the test speech encoding method. As a result, the call quality estimation is performed using the standard deteriorated voice signal and the weighting coefficient that are suitable for the test voice coding method, so that the accuracy of the call quality estimation can be improved.

[Brief description of drawings]

【図１】通話品質客観推定装置を説明する図。FIG. 1 is a diagram illustrating a call quality objective estimation device.

【図２】符号化音声識別部の処理の手順を説明する図。FIG. 2 is a diagram illustrating a procedure of processing of a coded voice identification unit.

[Explanation of symbols]

１符号化音声識別部２音声分析部３類似度計算部４品質推定値計算部５標準劣化音声データベース６重み係数データベース 1 coded speech identification unit 2 speech analysis unit 3 similarity calculation unit 4 quality estimation value calculation unit 5 standard deteriorated speech database 6 weighting coefficient database

Claims

[Claims]

1. A call quality objective for objectively estimating the call quality of a test voice signal by obtaining a similarity between a standard degraded voice signal and a test voice signal and inputting a time series of the similarity into a neural network. The estimation device is equipped with a speech analysis unit for frequency-analyzing the test speech signal, equipped with a standard degraded speech database that stores various degraded speech signals in advance, and identifies the coding method of the coded speech signal in which a code error has occurred. A coded speech identification unit is provided, and a time series of the similarity between the analysis result in the speech analysis unit and the standard degraded speech signal selected based on the identification result of the encoded speech identification unit from the standard degraded speech database is obtained. Equipped with a similarity calculation unit, the weighting factors between units obtained as a result of learning processing by a neural network using subjective evaluation values of various deteriorated speech signals and similarity time series are stored in advance. The weighting coefficient database is provided, and the weighting coefficient selected from the time series of the similarity determined by the similarity calculating section and the weighting coefficient database based on the identification result of the encoded speech identification section is input-processed to improve the speech quality. An objective speech quality estimation apparatus comprising a quality estimation value calculation unit forming a neural network for estimation.

2. The speech quality objective estimation device according to claim 1, wherein the coded voice identification unit frequency-analyzes the test voice signal, and determines a spectrum outline of the test voice signal based on the analysis result. A speech quality objective estimation device characterized by identifying a coding system of a test voice signal by a rough shape.