JP3351746B2

JP3351746B2 - Audio signal compression method, audio signal compression device, audio signal compression method, audio signal compression device, speech recognition method, and speech recognition device

Info

Publication number: JP3351746B2
Application number: JP28160498A
Authority: JP
Inventors: 良久中藤; 武志則松; 峰生津島; 智一石川; 光彦芹川; 大朗片山; 順一中橋; 順子八木
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1997-10-03
Filing date: 1998-10-02
Publication date: 2002-12-03
Anticipated expiration: 2018-10-02
Also published as: JPH11327600A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音楽を電気信号に
変換したオーディオ信号や人の音声を電気信号に変換し
た音声信号の、少容量の伝送路での情報伝送，記録メデ
ィアへの効率的な蓄積を可能とするために、特に人間の
聴覚的な性質である聴覚感度特性に対応した周波数上の
重み付けに基づいてオーディオ信号あるいは音声信号を
圧縮する場合に、従来よりも効率よく、高音質を保った
まま圧縮することのできるオーディオ信号圧縮方法、お
よびオーディオ信号圧縮装置、あるいは音声信号圧縮方
法、および音声信号圧縮装置に関するものである。ま
た、本発明は、高性能な音声認識装置を実現するため
に、特に人間の聴覚的な性質である聴覚感度特性を取り
入れた線形予測分析法により求めた、周波数毎に分解能
を変化させた特徴量を用いて認識を行う場合に、従来よ
りも高い認識率を得ることのできる音声認識方法、およ
び音声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to efficient transmission of an audio signal obtained by converting music into an electric signal or an audio signal obtained by converting human voice into an electric signal through a small-capacity transmission line to a recording medium. In particular, when compressing an audio signal or a voice signal based on the weighting on the frequency corresponding to the auditory sensitivity characteristic which is a human auditory characteristic, a higher sound quality than before can be achieved. TECHNICAL FIELD The present invention relates to an audio signal compression method and an audio signal compression apparatus or an audio signal compression method and an audio signal compression apparatus which can perform compression while maintaining the same. In addition, the present invention provides a high-performance speech recognition device, in which the resolution is changed for each frequency, which is obtained by a linear prediction analysis method that incorporates the auditory sensitivity characteristic, which is a human auditory characteristic. The present invention relates to a speech recognition method and a speech recognition device that can obtain a higher recognition rate than in the past when performing recognition using an amount.

【０００２】[0002]

【従来の技術】この種のオーディオ信号圧縮方法につい
ては従来さまざまなものが提案されているが、以下では
その一例について、説明を行う。2. Description of the Related Art A variety of audio signal compression methods of this type have been proposed in the past, and an example will be described below.

【０００３】まず、入力されたオーディオ信号の時系列
は、例えばＭＤＣＴ（modified discrete cosine trans
form：変形離散コサイン変換）、あるいはＦＦＴ（高速
フーリエ変換）等により一定周期の長さ（フレーム）毎
に周波数特性信号系列に変換され、さらに入力オーディ
オ信号をフレーム毎に線形予測分析（ＬＰＣ分析）を行
うことにより、ＬＰＣ係数（linear predictive coeffi
cient ；線形予測係数）やＬＳＰ係数（line spectrum
pair coefficient ），あるいはＰＡＲＣＯＲ係数（pa
rtial auto-correlation coefficient）等を抽出すると
ともに、これらの係数からＬＰＣスペクトル包絡を求め
る。次に算出された周波数特性信号系列を、求めたＬＰ
Ｃスペクトル包絡で割り算して正規化することにより、
周波数特性を平坦化し、さらにパワーの最大値，あるい
は平均値等に基づいてパワーの正規化を行なう。以降の
説明では、このパワーの正規化が行なわれた時点の出力
係数を残差信号とも呼ぶ。さらにこの平坦化された残差
信号を、スペクトル包絡を重み付けとしてベクトル量子
化する。このようなオーディオ信号圧縮方法の例として
は、ＴｗｉｎＶＱ（岩上、守谷、三樹：「周波数重み付
けインターリーブベクトル量子化（TwinVQ）によるオー
ディオ符号化」音響学会講演論文集、1-P-1,pp.339-34
0,(1994) ）がある。First, a time series of an input audio signal is represented by, for example, an MDCT (modified discrete cosine trans).
form: Modified discrete cosine transform) or FFT (Fast Fourier Transform) is converted into a frequency characteristic signal sequence for each fixed period (frame), and the input audio signal is further subjected to linear prediction analysis (LPC analysis) for each frame. , The LPC coefficient (linear predictive coeffi
cient; linear prediction coefficient and LSP coefficient (line spectrum)
pair coefficient) or PARCOR coefficient (pa
rtial auto-correlation coefficient) and the like, and an LPC spectrum envelope is obtained from these coefficients. Next, the calculated frequency characteristic signal sequence is
By dividing by C spectrum envelope and normalizing,
The frequency characteristics are flattened, and the power is normalized based on the maximum value or the average value of the power. In the following description, the output coefficient at the time when the power is normalized is also referred to as a residual signal. Further, the flattened residual signal is vector-quantized using the spectral envelope as a weight. An example of such an audio signal compression method is TwinVQ (Iwagami, Moriya, Miki: "Audio coding by frequency weighted interleaved vector quantization (TwinVQ)", Proceedings of the Acoustical Society of Japan, 1-P-1, pp. 339. -34
0, (1994)).

【０００４】次に、音声信号圧縮方法の従来例につい
て、以下にその説明を行う。まず、入力された音声信号
の時系列は、フレーム毎に線形予測分析（ＬＰＣ分析）
することにより、ＬＰＣ係数（線形予測係数）やＬＳＰ
係数（line spectrum pair coefficient），あるいはＰ
ＡＲＣＯＲ係数（偏自己相関係数）等のＬＰＣスペクト
ル包絡成分と、周波数特性が平坦化された残差信号とに
分離される。そしてＬＰＣスペクトル包絡成分はスカラ
ー量子化され、また平坦化された残差信号はあらかじめ
用意した音源コードブックにより量子化することで、デ
ィジタル信号へとそれぞれ変換される。このような音声
信号圧縮方法の例としては、ＣＥＬＰ（M.R. Schroeder
and B.S. Atal: “Code-excited linear prediction(C
ELP) high quality speech atvery low rates", Proc.
ICASSP-85(March 1985)がある。Next, a conventional example of the audio signal compression method will be described below. First, the time series of the input audio signal is obtained by linear prediction analysis (LPC analysis) for each frame.
By doing, LPC coefficient (linear prediction coefficient) and LSP
Coefficient (line spectrum pair coefficient) or P
An LPC spectrum envelope component such as an ARCOR coefficient (partial autocorrelation coefficient) is separated into a residual signal having a flattened frequency characteristic. Then, the LPC spectrum envelope component is scalar-quantized, and the flattened residual signal is quantized by a sound source codebook prepared in advance to be converted into a digital signal. As an example of such an audio signal compression method, CELP (MR Schroeder
and BS Atal: “Code-excited linear prediction (C
ELP) high quality speech atvery low rates ", Proc.
There is ICASSP-85 (March 1985).

【０００５】また、音声認識方法の従来例について、以
下にその説明を行う。一般に音声認識装置では、あらか
じめ基準となる音声データを用いて、音韻あるいは単語
毎の標準モデルを作成しておき、入力音声からスペクト
ル包絡に対応する特徴量を求め、その時系列と標準モデ
ルとの間の類似度を計算し、この類似度が最も大きい標
準モデルに対応する音韻あるいは単語を見つけること
で、音声認識を行う。この場合の標準モデルとしては、
例えば、隠れマルコフモデル（hidden Markov model ；
ＨＭＭ）や、代表的な特徴量の時系列そのものを標準モ
デルとして用いている（中川聖一著、「確率モデルによ
る音声認識」、電子情報通信学会編、ｐ１８〜２０）。A conventional example of a speech recognition method will be described below. In general, in a speech recognition device, a standard model for each phoneme or word is created in advance using reference speech data, and a feature amount corresponding to a spectral envelope is obtained from input speech. Is calculated, and a phoneme or word corresponding to the standard model having the highest similarity is found to perform speech recognition. The standard model in this case is
For example, the hidden Markov model (hidden Markov model;
HMM) and the time series of representative feature values themselves are used as standard models (Seiichi Nakagawa, “Speech Recognition by Stochastic Model”, edited by IEICE, pp. 18-20).

【０００６】従来、入力音声から求めた特徴量の時系列
としては、入力された音声の時系列を、例えば線形予測
分析（ＬＰＣ分析）により一定周期の長さ（フレーム）
毎の線形予測係数（ＬＰＣ係数）に変換し、この線形予
測係数をケプストラム変換して得られるＬＰＣケプスト
ラム係数（鹿野清宏、中村哲、伊勢史郎著、「音声・音
情報のディジタル信号処理」、昭晃堂、ｐ１０〜１６）
や、あるいは入力音声をＤＦＴやバンドパスフィルタバ
ンク等により一定周期の長さ（フレーム）毎のパワース
ペクトルに変換し、このパワースペクトルをケプストラ
ム変換して得られるケプストラム係数等を用いて認識を
行っている。Conventionally, as a time series of feature amounts obtained from an input speech, the time series of the input speech is determined by, for example, a linear prediction analysis (LPC analysis) having a predetermined period length (frame).
LPC cepstrum coefficients obtained by converting the linear prediction coefficients (LPC coefficients) into cepstrum transforms (Kiyohiro Kano, Satoshi Nakamura, Shiro Ise, "Digital Signal Processing of Voice / Sound Information", Sho Kodo, p10-16)
Or the input voice is converted into a power spectrum for each length (frame) of a certain period by a DFT or a band-pass filter bank, and recognition is performed using cepstrum coefficients obtained by cepstrum conversion of the power spectrum. I have.

【０００７】[0007]

【発明が解決しようとする課題】オーディオ信号圧縮方
法の従来例では、ＭＤＣＴあるいはＦＦＴ等により算出
された周波数特性信号系列をＬＰＣスペクトル包絡で割
り算して正規化された残差信号を求めている。一方、音
声信号圧縮方法の従来例では、入力音声信号を、線形予
測分析により算出されたＬＰＣスペクトル包絡と残差信
号とに分離しており、オーディオ信号圧縮方法の従来例
と音声信号圧縮方法の従来例とはともに、入力信号から
通常の線形予測分析によりスペクトル包絡成分を除去す
る、すなわち、入力信号をスペクトル包絡で正規化（平
坦化）して残差信号を求めていることでは同様である。
そこで、この線形予測分析の性能を向上させる、あるい
は線形予測分析により得られたスペクトル包絡の推定精
度を上げられれば、従来よりも効率よく、高音質を保っ
たまま情報を圧縮することができる。In a conventional example of the audio signal compression method, a normalized residual signal is obtained by dividing a frequency characteristic signal sequence calculated by MDCT or FFT by an LPC spectrum envelope. On the other hand, in the conventional example of the audio signal compression method, the input audio signal is separated into the LPC spectrum envelope and the residual signal calculated by the linear prediction analysis, and the conventional audio signal compression method and the audio signal compression method are separated. This is the same as in the conventional example in that the spectral envelope component is removed from the input signal by ordinary linear prediction analysis, that is, the input signal is normalized (flattened) by the spectral envelope to obtain the residual signal. .
Therefore, if the performance of this linear prediction analysis is improved, or if the estimation accuracy of the spectrum envelope obtained by the linear prediction analysis can be increased, information can be compressed more efficiently than before, while maintaining high sound quality.

【０００８】ところで、通常の線形予測分析では、どの
周波数帯域に対しても同じ精度の周波数分解能で包絡を
推定することになるので、聴感上重要な，低い周波数帯
域の周波数分解能を上げる、すなわち、低い周波数帯域
のスペクトル包絡を正確に求めようとすると、分析次数
を上げる必要があり、結局、情報量が増えるという問題
があった。また、分析次数を上げると、聴感上あまり重
要ではない，高い周波数帯域の分解能を必要以上に上げ
ることになるので、高い周波数帯域にピークを持つスペ
クトル包絡を算出する場合がでてくるようになり、結
局、音質を劣化させる問題もある。By the way, in the ordinary linear prediction analysis, the envelope is estimated with the same precision frequency resolution for any frequency band, so that the frequency resolution of a low frequency band that is important for hearing is increased, that is, In order to accurately obtain the spectrum envelope of a low frequency band, it is necessary to increase the order of analysis, and as a result, there is a problem that the amount of information increases. In addition, when the analysis order is increased, the resolution of a high frequency band, which is not so important for the sense of hearing, is unnecessarily increased, so that a spectrum envelope having a peak in a high frequency band may be calculated. In the end, there is a problem that the sound quality is deteriorated.

【０００９】また、オーディオ信号圧縮方法の従来例の
ように、ベクトル量子化を行う際には、量子化の際の重
み付けをスペクトル包絡のみに基づいて行なっているた
め、通常の線形予測分析では人間の聴覚的な性質を利用
して効率よく量子化することができないという問題があ
った。Also, as in the conventional audio signal compression method, when performing vector quantization, weighting at the time of quantization is performed based only on the spectral envelope. There is a problem that it is not possible to efficiently quantize using the auditory properties of the sound.

【００１０】一方、音声認識方法の従来例では、例えば
通常の線形予測分析により求められたＬＰＣケプストラ
ム係数では、人間の聴覚的な性質である聴覚感度特性を
取り入れた線形予測分析法を行っていないため、十分な
認識性能を発揮していない可能性がある。そもそも人間
の聴覚は、低域の周波数成分を重要視し、高域の周波数
成分は低域ほど重要視していない傾向があることが一般
に知られている。そこで、このＬＰＣケプストラム（ce
pstrum）係数をメル（mel)変換することで得られるＬＰ
Ｃメル係数（鹿野清宏、中村哲、伊勢史郎著、「音声・
音情報のディジタル信号処理」、昭晃堂、ｐ３９〜４
０）を用いて認識を行う方法もあるが、そもそもＬＰＣ
ケプストラム係数自体には線形予測分析の際に人間の聴
覚の特徴が十分考慮されていない。そのためメル変換さ
れたＬＰＣメルケプストラム係数にも聴覚上重要な低域
の情報は十分反映されていない。On the other hand, in the conventional example of the speech recognition method, for example, the LPC cepstrum coefficient obtained by the ordinary linear prediction analysis does not perform the linear prediction analysis method incorporating the auditory sensitivity characteristic which is a human auditory characteristic. Therefore, there is a possibility that sufficient recognition performance is not exhibited. It is generally known that, in the first place, human hearing tends to place importance on low-frequency components and less on high-frequency components than low frequencies. Therefore, this LPC cepstrum (ce
pstrum) LP obtained by performing mel conversion of coefficients
C-mel coefficient (by Kiyohiro Kano, Satoshi Nakamura, Shiro Ise, "
Digital signal processing of sound information ", Shokodo, p.39-4
Although there is a method of performing recognition using 0), in the first place LPC
The cepstrum coefficients themselves do not sufficiently take into account the characteristics of human hearing in linear predictive analysis. For this reason, low-frequency information that is auditory important is not sufficiently reflected in the mel-transformed LPC mel-cepstral coefficient.

【００１１】メル尺度は、人間の音の高さの知覚特性か
ら得られた尺度であり、音の高さは周波数に大きく依存
する量であるが、周波数だけではなく音の強さにも影響
されることもよく知られており、そこで、1000 Hz, 40
dB SPLの純音を基準の音を1000 melとして、これより2
倍の高さあるいは１／２の高さに知覚される音をマグニ
チュード測定法などで測定し、それぞれ2000 mel, 500m
elと決定したものであるが、上述のように、ＬＰＣケプ
ストラム係数自体は線形予測分析の際に人間の聴覚の特
徴が十分考慮されない以上、メル化、即ちメル変換を行
っても本質的な認識性能の向上は期待できない。[0011] The Mel scale is a scale obtained from the perceptual characteristics of human pitch, and the pitch is an amount that largely depends on the frequency, but it affects not only the frequency but also the intensity of the sound. It is also well known that 1000 Hz, 40
The pure sound of dB SPL is set to 1000 mel as the reference sound.
The sound perceived at double height or half height is measured by the magnitude measurement method etc., and it is 2000 mel and 500m respectively.
However, as described above, the LPC cepstrum coefficient itself does not sufficiently take into account the characteristics of human hearing in linear prediction analysis, so that the LPC cepstrum coefficient is essentially recognized even if mel conversion, that is, mel transformation is performed. No improvement in performance can be expected.

【００１２】さらに通常の線形予測分析では、どの周波
数帯域に対しても同じ周波数分解能でスペクトル包絡を
推定することになるので、聴感上重要な低い周波数帯域
の周波数分解能を上げようとすると、すなわち、低い周
波数帯域のスペクトル包絡を正確に求めようとすると、
分析次数を上げる必要があり、結局特徴量が増え、認識
にかかる処理量が増えるという問題がある。また、分析
次数を上げると、高い周波数帯域の分解能を必要以上に
上げることになるので、高い周波数帯域に不要な特徴を
持つことになり、却って認識性能を劣化させてしまうと
いう問題もある。Further, in the ordinary linear prediction analysis, the spectral envelope is estimated with the same frequency resolution for any frequency band. Therefore, when trying to increase the frequency resolution of a low frequency band that is important for hearing, that is, When trying to accurately determine the spectral envelope of the lower frequency band,
It is necessary to increase the order of analysis, which results in a problem that the amount of features increases and the amount of processing required for recognition increases. In addition, when the analysis order is increased, the resolution of a high frequency band is unnecessarily increased, so that the high frequency band has an unnecessary feature, and the recognition performance is rather deteriorated.

【００１３】また、ＤＦＴやバンドパスフィルタバンク
（band pass filter bank ）等から求めたケプストラム
係数やメルケプストラム係数を特徴量として用いて音声
認識を行う方法もあるが、ＤＦＴやバンドパスフィルタ
バンクの演算量が線形予測分析に比べて非常に多いとい
う問題点もある。There is also a method of performing speech recognition using a cepstrum coefficient or a mel cepstrum coefficient obtained from a DFT or a band pass filter bank as a feature amount. There is also a problem that the amount is much larger than that of the linear prediction analysis.

【００１４】本発明は、上記のような問題を解消するた
めになされたものであり、線形予測分析の性能を向上さ
せる、すなわち人間の聴覚的な性質である聴覚感度特性
を取り入れた線形予測分析法（以降、メル線形予測分析
法（ＭＬＰＣ分析法）と呼ぶ）を行い、その結果得られ
たメル化された線形予測係数（以降、メル線形予測係数
と呼ぶ）を音声認識に用いたり、あるいは通常の線形予
測係数からＰＡＲＣＯＲ係数を求めるのと同様の公知の
手法によりメル線形予測係数から求めることのできるメ
ル化されたＰＡＲＣＯＲ係数（以降、メルＰＡＲＣＯＲ
係数と呼ぶ）や、通常の線形予測係数からＬＳＰ係数を
求めるのと同様の公知の手法によりメル線形予測係数か
ら求めることのできるメル化されたＬＳＰ係数（以降、
メルＬＳＰ係数と呼ぶ）や、さらにメル線形予測係数を
ケプストラム変換して得られるメルＬＰＣケプストラム
係数を音声認識に用いることで、さらに認識性能の向上
を図ることが可能になる点に着眼してなされたものであ
る。この種のメル化された係数を用いることにより、オ
ーディオ信号や音声信号の圧縮性能の向上や音声の認識
性能の向上を図ることは従来より想定されてはいたが、
現実には計算量が膨大になり、実使用に供されることは
なかった。本件発明者は、かかる現状に鑑み鋭意研究を
行った結果、本来この種の係数を計算するのに無限回の
演算を行う必要があり、またこれを有限回で打ち切った
場合には演算誤差を伴っていたものが、所望の設定回数
の演算を行うだけで、無限回演算を行ったのと同等な演
算を行うことができ、しかもこの演算に誤差が伴わない
全く新規な演算が存在することを見い出した。本発明
は、かかる新たな演算を用いることにより、人間の聴覚
的な性質である聴覚感度特性に対応した周波数上の重み
付けを行ってオーディオ信号や音声信号の圧縮性能の向
上や音声の認識性能の向上を図ることができる、オーデ
ィオ信号圧縮方法、オーディオ信号圧縮装置、音声信号
圧縮方法、音声信号圧縮装置，音声認識方法および音声
認識装置を得ることを目的としている。SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to improve the performance of linear predictive analysis, that is, to improve the performance of linear predictive analysis incorporating auditory sensitivity characteristics which are human auditory characteristics. (Hereinafter, referred to as mel linear prediction analysis method (MLPC analysis method)), and the resulting melded linear prediction coefficients (hereinafter, referred to as mel linear prediction coefficients) are used for speech recognition, or Melted PARCOR coefficients that can be determined from mel linear prediction coefficients by a known method similar to that for determining PARCOR coefficients from ordinary linear prediction coefficients (hereinafter referred to as mel PARCOR coefficients)
LSP coefficients that can be obtained from the mel linear prediction coefficients by a known method similar to that for obtaining LSP coefficients from ordinary linear prediction coefficients (hereinafter, referred to as coefficients).
The mel LSP coefficients are referred to as mel LSP coefficients, and mel LPC cepstrum coefficients obtained by cepstrum conversion of mel linear prediction coefficients are used for speech recognition, thereby making it possible to further improve recognition performance. It is a thing. Although it has been conventionally assumed that the use of this type of melted coefficient improves the compression performance of audio signals and audio signals and the performance of speech recognition,
In reality, the amount of calculation became enormous, and it was never used for actual use. The inventor of the present invention has conducted intensive studies in view of the current situation, and as a result, it was originally necessary to perform an infinite number of calculations to calculate this kind of coefficient, and if this was censored by a finite number of times, the calculation error was reduced. What has been involved is that, just by performing the desired number of calculations, it is possible to perform calculations equivalent to performing infinite calculations, and there is a completely new calculation that does not involve errors in this calculation. I found The present invention uses such a new operation to perform weighting on a frequency corresponding to the auditory sensitivity characteristic, which is a human auditory property, to improve the compression performance of an audio signal and a voice signal and to improve the voice recognition performance. An object of the present invention is to provide an audio signal compression method, an audio signal compression device, an audio signal compression method, an audio signal compression device, a speech recognition method, and a speech recognition device that can be improved.

【００１５】即ち、本発明は、人間の聴覚的な性質であ
る聴覚感度特性に対応した周波数上の重み付けに基づい
てスペクトル包絡を求めて、線形予測分析の性能を向上
させる、あるいは線形予測分析により得られたスペクト
ル包絡の推定精度を上げ、従来よりも効率よく、高音質
を保ったまま圧縮することのできるオーディオ信号圧縮
方法、およびオーディオ信号圧縮装置あるいは音声信号
圧縮方法、および音声信号圧縮装置を提供することを目
的とする。That is, the present invention improves the performance of linear prediction analysis by obtaining a spectrum envelope based on frequency weighting corresponding to the auditory sensitivity characteristic, which is a human auditory characteristic, or by performing linear prediction analysis. An audio signal compression method, an audio signal compression apparatus or an audio signal compression method, and an audio signal compression apparatus capable of increasing the accuracy of estimating the obtained spectrum envelope and compressing it while maintaining high sound quality more efficiently than before. The purpose is to provide.

【００１６】また、人間の聴覚的な性質である聴覚感度
特性に対応した周波数上の重み付けに基づいたメル線形
予測分析によりスペクトル包絡に対応する特徴量を求め
ているため、少ない特徴量でも効率的にスペクトル包絡
の特徴を捉えていることができ、さらにこの特徴量を音
声認識に用いることで、従来よりも少ない処理量で高い
認識性能を実現することのできる音声認識方法、および
音声認識装置を提供することを目的とする。In addition, since the characteristic amount corresponding to the spectral envelope is obtained by the mel-linear predictive analysis based on the weighting on the frequency corresponding to the auditory sensitivity characteristic which is a human auditory characteristic, even a small characteristic amount is efficient. A speech recognition method and a speech recognition device capable of realizing high recognition performance with a smaller processing amount than before by using this feature amount for speech recognition. The purpose is to provide.

【００１７】[0017]

【課題を解決するための手段】上記課題を解決するため
に、本発明（請求項１）に係るオーディオ信号圧縮方法
は、入力されたオーディオ信号に対し、符号化を行い、
かつ、その情報量を圧縮するオーデオ信号圧縮方法にお
いて、上記入力されたオーディオ信号と、該入力された
オーディオ信号に対して人間の聴覚感度特性に対応する
周波数軸の伸縮を行ったオーディオ信号とを用いて、メ
ル周波数軸上の自己相関関数を求め、上記メル周波数軸
上の自己相関関数からメル線形予測係数を求め、上記メ
ル線形予測係数そのものをスペクトル包絡とするか、あ
るいは該メル線形予測係数からスペクトル包絡を求め、
上記スペクトル包絡を用いて、上記入力されたオーディ
オ信号を、フレーム毎に平滑化するものである。In order to solve the above-mentioned problem, an audio signal compression method according to the present invention (claim 1) encodes an input audio signal,
And, in Odeo signal compression method of compressing the information amount, and the audio signal that is the input, which is the input
Corresponds to human auditory sensitivity characteristics for audio signals
Using the audio signal with the frequency axis expanded and contracted,
The autocorrelation function on the frequency axis
Find the mel linear prediction coefficient from the autocorrelation function
Whether the linear prediction coefficient itself is the spectral envelope,
Or obtain a spectral envelope from the mel linear prediction coefficient,
The input audio signal is smoothed for each frame using the spectrum envelope .

【００１８】[0018]

【００１９】また、本発明（請求項２）に係るオーディ
オ信号圧縮方法は、入力されたオーディオ信号に対し、
符号化を行い、かつ、その情報量を圧縮するオーデオ信
号圧縮方法において、上記入力されたオーディオ信号か
ら、一定時間長のオーディオ信号を切り出し、該一定時
間長のオーディオ信号を、複数段のオールパスフィルタ
に通して、各段毎のフィルタ出力信号を求め、上記入力
されたオーディオ信号と、上記各段毎のフィルタ出力信
号との、有限回行う積和（数１）により、人間の聴覚感
度特性に対応する周波数軸の伸縮を行ったメル周波数軸
上の自己相関関数を求め、上記メル周波数軸上の自己相
関関数からメル線形予測係数を求め、上記メル線形予測
係数そのものをスペクトル包絡とするか、あるいは該メ
ル線形予測係数からスペクトル包絡を求め、上記スペク
トル包絡を用いて、上記入力されたオーディオ信号を、
フレーム毎に平滑化するものである。但し、（数１）は Also, the audio signal compression method according to the present invention (claim 2) is a method for compressing an input audio signal.
An audio signal that encodes and compresses the amount of information
Signal compression method, the input audio signal
From the audio signal of a certain time length,
Multi-stage all-pass filter for audio signals with a long length
To obtain the filter output signal of each stage,
Audio signal and the filter output signal for each stage
The human sense of hearing is obtained by multiply-accumulate (Equation 1) with the signal
Mel frequency axis with expansion and contraction of frequency axis corresponding to degree characteristics
Calculate the autocorrelation function on the
Find the mel linear prediction coefficient from the function,
The coefficient itself can be used as the spectral envelope or the
Calculate the spectral envelope from the linear prediction coefficients
Using the torque envelope, the input audio signal is
This is to smooth each frame . However, (Equation 1) is

【数１】により表され、φ（ｉ，ｊ）は自己相関関数、ｘ［ｎ］
は入力信号、ｙ _(i-j) ［ｎ］は各段毎のフィルタ出力信
号である。 (Equation 1) Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

【００２０】[0020]

【００２１】また、本発明（請求項３）に係るオーディ
オ信号圧縮方法は、請求項２に記載のオーディオ信号圧
縮方法において、上記オールパスフィルタは、１次のオ
ールパスフィルタである。 The audio signal compression method according to the present invention (claim 3) is the audio signal compression method according to claim 2 , wherein the all-pass filter is a first-order audio signal.
Filter.

【００２２】[0022]

【００２３】また、本発明（請求項４）に係るオーディ
オ信号圧縮方法は、請求項２または請求項３に記載のオ
ーディオ信号圧縮方法において、上記オールパスフィル
タのフィルタ係数に、バーク尺度、またはメル尺度を用
い、人間の聴覚感度特性に対応する周波数上の重み付け
を行うものである。The audio signal compression method according to the present invention (claim 4) is described in claim 2 or claim 3.
Audio signal compression method.
Use Bark scale or Mel scale for filter coefficients
Weighting on frequencies corresponding to human auditory sensitivity characteristics
Is what you do .

【００２４】[0024]

【００２５】また、本発明（請求項５）に係るオーディ
オ信号圧縮装置は、入力されたオーディオ信号に対し、
符号化を行い、かつ、その情報量を圧縮するオーディオ
信号圧縮装置において、上記入力されたオーディオ信号
を、周波数領域信号に変換して出力する時間周波数変換
手段と、上記入力されたオーディオ信号と、該入力され
たオーディオ信号に対して人間の聴覚感度特性に対応す
る周波数軸の伸縮を行ったオーディオ信号とを用いて、
メル周波数軸上の自己相関関数を求め、該メル周波数軸
上の自己相関関数から得られるメル線形予測係数をスペ
クトル包絡とするか、あるいは、該メル線形予測係数か
らスペクトル包絡を求めるスペクトル包絡算出手段と、
上記周波数領域信号を上記スペクトル包絡で正規化し
て、残差信号を得る正規化手段と、上記残差信号をパワ
ーの最大値あるいは平均値に基づいて正規化し、正規化
残差信号を求めるパワー正規化手段と、上記正規残差信
号を、残差コードブックによりベクトル量子化し、残差
符号に変換するベクトル量子化手段とを備えるものであ
る。Further, the audio signal compression apparatus according to the present invention (claim 5) performs
In an audio signal compression device for performing encoding and compressing the amount of information, the input audio signal
Is converted to a frequency domain signal and output.
Means, the input audio signal, and the input audio signal.
Corresponding to the human auditory sensitivity characteristics
Using an audio signal that has undergone expansion and contraction of the frequency axis,
Find an autocorrelation function on the mel frequency axis,
The mel linear prediction coefficient obtained from the autocorrelation function
Vector envelope or the mel linear prediction coefficient
Spectrum envelope calculation means for obtaining a spectrum envelope from
Normalize the frequency domain signal with the spectral envelope
Normalizing means for obtaining a residual signal; and
Normalization based on the maximum or average value of
Power normalizing means for obtaining a residual signal;
Is quantized by the residual codebook, and the residual
And a vector quantization means for converting into a code .

【００２６】[0026]

【００２７】また、本発明（請求項６）に係るオーディ
オ信号圧縮装置は、請求項５に記載のオーディオ信号圧
縮装置において、上記スペクトル包絡に対して、人間の
聴覚感度特性に対応する周波数上の重み付けを行い、聴
覚重み付け係数として出力する聴覚重み付け計算手段を
備え、上記ベクトル量子化手段は、上記聴覚重み付け係
数を用いて、上記正規残差信号の量子化を行うものであ
る。The audio signal compression apparatus according to the present invention (Claim 6) provides the audio signal compression apparatus according to Claim 5.
In the decompression device, human beings
Weighting on the frequency corresponding to the auditory sensitivity characteristics
Perceptual weighting means for outputting as perceptual weighting coefficients
Wherein said vector quantization means comprises:
The quantization of the normal residual signal is performed using a number .

【００２８】[0028]

【００２９】[0029]

【００３０】[0030]

【００３１】また、本発明（請求項７）に係るオーディ
オ信号圧縮装置は、請求項６に記載のオーディオ信号圧
縮装置において、上記ベクトル量子化手段が、複数の縦
列に接続された複数の当該ベクトル量子化手段から構成
される多重量子化手段であって、上記多重量子化手段
は、該多重量子化手段を構成する少なくとも１つの上記
ベクトル量子化手段が、上記重み付け係数を用いて、上
記残差信号の量子化を行うものである。Further, the audio signal compression apparatus according to the present invention (Claim 7), in the audio signal compression apparatus according to claim 6, is the vector quantization means, a plurality of vertical
Consists of multiple vector quantization means connected to a column
Multiple quantizing means, wherein the multiple quantizing means is
Is at least one of the above-mentioned multiple quantizing means.
The vector quantization means uses the above weighting coefficient to calculate
This is to quantize the residual signal .

【００３２】[0032]

【００３３】また、本発明（請求項８）に係るオーディ
オ信号圧縮装置は、請求項５ないし請求項７のいずれか
に記載のオーディオ信号圧縮装置において、上記スペク
トル包絡算出手段は、入力されたオーディオ信号から、
一定時間長のオーディオ信号を切り出し、上記一定時間
長のオーディオ信号を複数段のオールパスフィルタに通
して、各段毎のフィルタ出力信号を求め、上記入力され
たオーディオ信号と、上記各段毎のフィルタ出力信号と
の、有限回行う積和（数２）により、人間の聴覚感度特
性に対応する周波数軸の伸縮を行ったメル周波数軸上の
自己相関関数を求め、上記メル周波数軸上の自己相関関
数よりメル線形予測係数を求め、上記メル線形予測係数
そのものをスペクトル包絡とするか、あるいは、該メル
線形予測係数からスペクトル包絡を求めるものである。
但し、（数２）は Also, the audio signal compression apparatus according to the present invention (claim 8) is any one of claims 5 to 7
2. The audio signal compression device according to
The torque envelope calculating means calculates, from the input audio signal,
Cut out the audio signal of a certain time length, and
Long audio signal through a multi-stage all-pass filter.
Then, a filter output signal for each stage is obtained, and
Audio signal, and the filter output signal of each stage
Of the human auditory sensitivity by multiply-accumulate (Equation 2)
On the mel frequency axis that expands and contracts the frequency axis corresponding to the
Calculate the autocorrelation function and calculate the autocorrelation function on the mel frequency axis.
The mel linear prediction coefficient is obtained from the
The spectrum envelope itself, or
A spectral envelope is obtained from a linear prediction coefficient .
However, (Equation 2) is

【数２】により表され、φ（ｉ，ｊ）は自己相関関数、ｘ［ｎ］
は入力信号、ｙ _(i-j) ［ｎ］は各段毎のフィルタ出力信
号である。 (Equation 2) Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

【００３４】[0034]

【００３５】また、本発明（請求項９）に係るオーディ
オ信号圧縮装置は、請求項８に記載のオーディオ信号圧
縮装置において、上記オールパスフィルタは、１次のオ
ールパスフィルタである。The audio signal compression apparatus according to the present invention (claim 9) provides the audio signal compression apparatus according to claim 8
In the compression device, the all-pass filter is a first-order
Filter .

【００３６】[0036]

【００３７】また、本発明（請求項１０）に係るオーデ
ィオ信号圧縮装置は、請求項８または請求項９に記載の
オーディオ信号圧縮装置において、上記オールパスフィ
ルタのフィルタ係数に、バーク尺度、またはメル尺度を
用い、人間の聴覚感度特性に対応する周波数上の重み付
けを行うものである。Further, the audio signal compression apparatus according to the present invention (Claim 10), in the audio signal compression apparatus according to claim 8 or claim 9, said all-pass Fi
Bark scale or Mel scale for filter coefficients
Weighting on the frequency corresponding to the human auditory sensitivity characteristics
To perform

【００３８】[0038]

【００３９】また、本発明（請求項１１）に係る音声信
号圧縮方法は、入力された音声信号に対し、符号化を行
い、かつ、その情報量を圧縮する音声信号圧縮方法にお
いて、上記入力された音声信号と、該入力された音声信
号に対して人間の聴覚感度特性に対応する周波数軸の伸
縮を行った音声信号とを用いて、メル周波数軸上の自己
相関関数を求め、上記メル周波数軸上の自己相関関数か
らメル線形予測係数を求め、上記メル線形予測係数その
ものをスペクトル包絡とするか、あるいは該メル線形予
測係数からスペクトル包絡を求め、上記スペクトル包絡
を用いて、上記入力された音声信号を平滑化するもので
ある。Further, the audio signal according to the present invention (Claim 11)
The signal compression method encodes the input audio signal.
Audio signal compression method that compresses the amount of information
And the input voice signal and the input voice signal.
The frequency axis extension corresponding to the human auditory sensitivity characteristics
Using the compressed audio signal, the self
Find the correlation function and determine if it is an autocorrelation function on the mel frequency axis.
The mel linear prediction coefficient is obtained from
The spectral envelope or the mel-linear
Calculate the spectral envelope from the measured coefficients and calculate the spectral envelope
Is used to smooth the input audio signal .

【００４０】[0040]

【００４１】また、本発明（請求項１２）に係る音声信
号圧縮方法は、入力された音声信号に対し、符号化を行
い、かつ、その情報量を圧縮する音声信号圧縮方法にお
いて、上記入力された音声信号から、一定時間長の音声
信号を切り出し、該一定時間長の音声信号を、複数段の
オールパスフィルタに通して、各段毎のフィルタ出力信
号を求め、上記入力された音声信号と、上記各段毎のフ
ィルタ出力信号との、有限回行う積和（数３）により、
人間の聴覚感度特性に対応する周波数軸の伸縮を行った
メル周波数軸上の自己相関関数を求め、該メル周波数軸
上の自己相関関数からメル線形予測係数を求め、該メル
線形予測係数そのものをスペクトル包絡とするか、ある
いは該メル線形予測係数からスペクトル包絡を求め、該
スペクトル包絡を用いて、上記入力された音声信号を平
滑化するものである。但し、（数３）は Further, the audio signal according to the present invention (Claim 12)
The signal compression method encodes the input audio signal.
Audio signal compression method that compresses the amount of information
From the input audio signal,
The signal is cut out, and the audio signal having the predetermined time length is
Pass the filter output signal of each stage through an all-pass filter.
Signal and the input audio signal and the
By the product sum (Equation 3) performed finitely with the filter output signal,
Expansion and contraction of frequency axis corresponding to human auditory sensitivity characteristics
Find an autocorrelation function on the mel frequency axis,
The mel linear prediction coefficient is obtained from the above autocorrelation function,
Whether the linear prediction coefficient itself is the spectral envelope, or
Or obtain a spectral envelope from the mel linear prediction coefficient,
Using the spectral envelope, the input speech signal is
It is to lubricate . However, (Equation 3) is

【数３】により表され、φ（ｉ，ｊ）は自己相関関数、ｘ［ｎ］
は入力信号、ｙ _(i-j) ［ｎ］は各段毎のフィルタ出力信
号である。 (Equation 3) Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

【００４２】[0042]

【００４３】[0043]

【００４４】[0044]

【００４５】また、本発明（請求項１３）に係る音声信
号圧縮方法は、請求項１２に記載の音声信号圧縮方法に
おいて、上記オールパスフィルタは、１次のオールパス
フィルタである。 The voice signal according to the present invention (claim 13)
The signal compression method according to claim 12, wherein the all-pass filter is a first-order all-pass filter.
Filter.

【００４６】[0046]

【００４７】また、本発明（請求項１４）に係る音声信
号圧縮方法は、請求項１２または請求項１３に記載の音
声信号圧縮方法において、上記オールパスフィルタのフ
ィルタ係数に、バーク尺度、またはメル尺度を用い、人
間の聴覚感度特性に対応する周波数上の重み付けを行う
ものである。Further, the audio signal according to the present invention (Claim 14)
The signal compression method according to claim 12 or claim 13.
In the voice signal compression method, the filter of the all-pass filter is used.
Using the Bark scale or the Mel scale for the filter coefficient,
Weighting on the frequency corresponding to the auditory sensitivity characteristics between
Things .

【００４８】[0048]

【００４９】また、本発明（請求項１５）に係る音声信
号圧縮装置は、入力された音声信号に対し、符号化を行
い、かつ、その情報量を圧縮する音声信号圧縮装置にお
いて、上記入力された音声信号と、該入力された音声信
号に対して人間の聴覚感度特性に対応する周波数軸の伸
縮を行った音声信号とを用いて、メル周波数軸上の自己
相関関数を求め、該メル周波数軸上の自己相関関数から
得られるメル形成予測係数を、スペクトル包絡を表現す
る特徴量に変換する特徴量算出手段と、上記入力された
音声信号を、上記特徴量で逆フィルタリングして正規化
し、残差信号を得る包絡正規化手段と、上記残差信号を
パワーの最大値あるいは平均値に基づいて正規化し、正
規化残差信号を求めるパワー正規化手段と、上記正規化
残差信号を、残差コードブックによりベクトル量子化
し、残差符号に変換するベクトル量子化手段とを備える
ものである。[0049] The audio signal according to the present invention (Claim 15)
The signal compression device encodes the input audio signal.
Audio signal compressor that compresses the amount of information
And the input voice signal and the input voice signal.
The frequency axis extension corresponding to the human auditory sensitivity characteristics
Using the compressed audio signal, the self
Find the correlation function, and from the autocorrelation function on the mel frequency axis
Express the resulting mel formation prediction coefficients as a spectral envelope
A feature value calculating means for converting the feature value into
Normalize the audio signal by inverse filtering with the above features
And an envelope normalizing means for obtaining a residual signal;
Normalize based on the maximum or average power and
Power normalizing means for obtaining a normalized residual signal;
Vector quantization of residual signal by residual codebook
And a vector quantizing means for converting into a residual code .

【００５０】[0050]

【００５１】また、本発明（請求項１６）に係る音声信
号圧縮装置は、請求項１５に記載の音声信号圧縮装置に
おいて、上記特徴量算出手段は、上記入力された音声信
号から一定時間長の音声信号を切り出し、上記一定時間
長の音声信号を、複数段のオールパスフィルタに通し
て、各段毎のフィルタ出力信号を求め、上記入力された
音声信号と、上記各段毎のフィルタ出力信号との、有限
回行う積和（数４）により、人間の聴覚感度特性に対応
する周波数軸の伸縮を行ったメル周波数軸上の自己相関
関数を求め、上記メル周波数軸上の自己相関関数からメ
ル線形予測係数を求め、上記メル線形予測係数を、スペ
クトル包絡を表現する特徴量に変換するものである。但
し、（数４）はFurther, the audio signal compression apparatus according to the present invention (claim 16) is the same as the audio signal compression apparatus according to claim 15,
In the above, the feature amount calculating means may include the input audio signal.
The audio signal of a fixed time length is cut out from the
Pass long audio signals through multiple stages of all-pass filters
To obtain the filter output signal of each stage,
Finite difference between the audio signal and the filter output signal of each stage
Multiply-accumulate (Equation 4) to respond to human auditory sensitivity characteristics
On the Mel Frequency Axis with Expansion and Contraction of the Changing Frequency Axis
Function and calculate the function from the autocorrelation function on the mel frequency axis.
The linear prediction coefficient is calculated, and the mel linear prediction coefficient is
It is converted into a feature quantity expressing a vector envelope . However, (Equation 4 ) is

【数４】により表され、φ（ｉ，ｊ）は自己相関関数、ｘ［ｎ］
は入力信号、ｙ _(i-j) ［ｎ］は各段毎のフィルタ出力信
号である。 (Equation 4) Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

【００５２】[0052]

【００５３】[0053]

【００５４】[0054]

【００５５】また、本発明（請求項１７）に係る音声信
号圧縮装置は、請求項１６に記載の音声信号圧縮装置に
おいて、上記オールパスフィルタは、１次のオールパス
フィルタである。[0055] The audio signal compression apparatus according to the present invention (Claim 17), Oite to the audio signal compression apparatus of claim 16, said all-pass filters, first order all-pass
Filter .

【００５６】[0056]

【００５７】また、本発明（請求項１８）に係る音声信
号圧縮装置は、請求項１６または請求項１７に記載の音
声信号圧縮装置において、上記オールパスフィルタのフ
ィルタ係数に、バーク尺度、またはメル尺度を用い、人
間の聴覚感度特性に対応する周波数上の重み付けを行う
ものである。Further, the audio signal compression apparatus according to the present invention (claim 18) provides the sound signal compression apparatus according to claim 16 or claim 17.
In the voice signal compression device, the filter of the all-pass filter is used.
Using the Bark scale or the Mel scale for the filter coefficient,
Weighting on the frequency corresponding to the auditory sensitivity characteristic between the two .

【００５８】[0058]

【００５９】また、本発明（請求項１９）に係る音声認
識方法は、入力された音声信号から、音声を認識する音
声認識方法において、上記入力された音声信号と、該入
力された音声信号に対して人間の聴覚感度特性に対応す
る周波数軸の伸縮を行った音声信号とを用いて、メル周
波数軸上の自己相関関数を求め、上記メル周波数軸上の
自己相関関数からメル線形予測係数を求め、上記メル線
形予測係数からスペクトル包絡を表現する特徴量を求め
るものである。Further, the voice recognition according to the present invention (claim 19).
The recognition method is a sound that recognizes voice from the input voice signal.
In the voice recognition method, the input speech signal and the input
Respond to human auditory sensitivity characteristics
Using the audio signal with the frequency axis expanded and contracted
Find the autocorrelation function on the wave number axis, and
Find the mel linear prediction coefficient from the autocorrelation function,
Of features expressing spectral envelope from shape prediction coefficients
It is those that.

【００６０】[0060]

【００６１】また、本発明（請求項２０）に係る音声認
識方法は、入力された音声信号から、音声を認識する音
声認識方法において、上記入力された音声信号から、一
定時間長の音声信号を切り出し、該一定時間長の音声信
号を、複数段のオールパスフィルタに通して、各段毎の
フィルタ出力信号を求め、上記入力された音声信号と、
上記各段毎のフィルタ出力信号との、有限回行う積和
（数５）により、人間の聴覚感度特性に対応する周波数
軸の伸縮を行ったメル周波数軸上の自己相関関数を求
め、該メル周波数軸上の自己相関関数からメル線形予測
係数を求め、該メル線形予測係数からスペクトル包絡を
表現する特徴量を求めるものである。但し、（数５）は [0061] The voice certification according to the present invention (Claim 20)
The recognition method is a sound that recognizes voice from the input voice signal.
In the voice recognition method, one of
The audio signal of a fixed time length is cut out and the audio signal of the fixed time length is cut out.
Signal through a multi-stage all-pass filter,
A filter output signal is obtained, and the input audio signal is
A finite number of product sums with the filter output signal of each stage
From equation (5), the frequency corresponding to the human auditory sensitivity characteristic
Calculate the autocorrelation function on the mel frequency axis that has been stretched.
Mel linear prediction from the autocorrelation function on the mel frequency axis
The coefficient is obtained, and the spectral envelope is obtained from the mel linear prediction coefficient.
This is for obtaining a feature to be expressed . Where (Equation 5) is

【数５】により表され、φ（ｉ，ｊ）は自己相関関数、ｘ［ｎ］
は入力信号、ｙ _(i-j) ［ｎ］は各段毎のフィルタ出力信
号である。 (Equation 5) Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

【００６２】[0062]

【００６３】また、本発明（請求項２１）に係る音声認
識方法は、請求項２０に記載の音声認識方法において、
上記オールパスフィルタは、１次のオールパスフィルタ
である。The voice recognition according to the present invention (claim 21).
The speech recognition method according to claim 20, wherein
The all-pass filter is a first-order all-pass filter .

【００６４】[0064]

【００６５】[0065]

【００６６】[0066]

【００６７】また、本発明（請求項２２）に係る音声認
識方法は、請求項２０または請求項２１に記載の音声認
識方法において、上記オールパスフィルタのフィルタ係
数に、バーク尺度、またはメル尺度を用い、人間の聴覚
感度特性に対応する周波数上の重み付けを行うものであ
る。Further, the voice recognition method according to the present invention (claim 22) provides the voice recognition method according to claim 20 or claim 21.
In the recognition method, the filter
Use the Bark scale or Mel scale for the number, and
The weighting on the frequency corresponding to the sensitivity characteristic is performed .

【００６８】[0068]

【００６９】また、本発明（請求項２３）に係る音声認
識装置は、入力された音声信号から、音声を認識する音
声認識装置において、上記入力された音声信号と、該入
力された音声信号に対して、人間の聴覚感度特性に対応
する周波数軸の伸縮を行った音声信号とを用いて、メル
周波数軸上の自己相関関数を求め、該メル周波数軸上の
自己相関関数からメル形成予測係数を求めるメル線形予
測分析手段と、上記メル線形予測係数からケプストラム
係数を算出するケプストラム係数算出手段と、上記ケプ
ストラム係数の複数フレーム分と、複数の標準モデルと
の間の距離を算出し、該距離が最も短いものを、上記複
数の標準モデルの中で最も類似度が大きいものと認識す
る音声認識手段とを備えるものである。The voice recognition device according to the present invention (claim 23) provides a sound recognition device for recognizing voice from an input voice signal.
In the voice recognition device, the input speech signal and the input
Corresponds to human auditory sensitivity characteristics for the input audio signal
Using the audio signal with the frequency axis
Find the autocorrelation function on the frequency axis, and on the mel frequency axis
Mel linear prediction to find mel formation prediction coefficients from autocorrelation function
Cepstrum from the mel linear prediction coefficient
Cepstrum coefficient calculating means for calculating a coefficient;
Multiple frames of strum coefficients and multiple standard models
Is calculated, and the shortest distance is calculated as
Is recognized as having the highest similarity among standard models
Voice recognition means .

【００７０】[0070]

【００７１】また、本発明（請求項２４）に係る音声認
識装置は、請求項２３に記載の音声認識装置において、
上記メル線形予測分析手段は、上記入力された音声信号
から、一定時間長の音声信号を切り出し、該一定時間長
の音声信号を、複数段のオールパスフィルタに通して、
各段毎のフィルタ出力信号を求め、上記入力された音声
信号と、上記各段毎のフィルタ出力信号との、有限回行
う積和（数６）により、人間の聴覚感度特性に対応する
周波数軸の伸縮を行ったメル周波数軸上の自己相関関数
を求め、上記メル周波数軸上の自己相関関数からメル線
形予測係数を求めるものである。但し、（数６）は [0071] The speech recognition apparatus according to the present invention (Claim 24), in the speech recognition apparatus according to claim 23,
The mel linear prediction analysis means is configured to output the input speech signal
From the audio signal of a fixed time length,
Through the multi-stage all-pass filter,
Obtain the filter output signal for each stage, and
Finite round trip of the signal and the filter output signal of each stage
The product sum (Equation 6) corresponds to the human auditory sensitivity characteristics
Autocorrelation function on mel frequency axis with expansion and contraction of frequency axis
From the autocorrelation function on the mel frequency axis.
This is for obtaining a shape prediction coefficient . Where (Equation 6) is

【数６】により表され、φ（ｉ，ｊ）は自己相関関数、ｘ［ｎ］
は入力信号、ｙ _(i-j) ［ｎ］は各段毎のフィルタ出力信
号である。 (Equation 6) Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

【００７２】[0072]

【００７３】また、本発明（請求項２５）に係る音声認
識装置は、請求項２４に記載の音声信号圧縮方法におい
て、上記オールパスフィルタは、１次のオールパスフィ
ルタである。Further, the speech recognition apparatus according to the present invention (claim 25) provides the speech signal compression method according to claim 24.
Therefore, the all-pass filter is a first-order all-pass filter.
Ruta .

【００７４】[0074]

【００７５】また、本発明（請求項２６）に係る音声認
識装置は、請求項２４または請求項２５に記載の音声圧
縮方法において、上記オールパスフィルタのフィルタ係
数に、バーク尺度、またはメル尺度を用い、人間の聴覚
感度特性に対応する周波数上の重み付けを行うものであ
る。Further, the voice recognition apparatus according to the present invention (claim 26) provides a voice pressure control device according to claim 24 or claim 25.
In the compression method, a filter
Use the Bark scale or Mel scale for the number, and
The weighting on the frequency corresponding to the sensitivity characteristic is performed .

【００７６】[0076]

【００７７】[0077]

【００７８】[0078]

【００７９】[0079]

【００８０】[0080]

【００８１】[0081]

【００８２】[0082]

【００８３】[0083]

【００８４】[0084]

【００８５】[0085]

【００８６】[0086]

【００８７】[0087]

【００８８】[0088]

【００８９】[0089]

【００９０】[0090]

【００９１】[0091]

【００９２】[0092]

【００９３】[0093]

【００９４】[0094]

【００９５】[0095]

【００９６】[0096]

【００９７】[0097]

【発明の実施の形態】（実施の形態１）図１は本発明の
第１の実施の形態によるオーディオ信号圧縮装置の構成
を示すブロック図である。同図において、１は、例え
ば、ＭＤＣＴ，あるいはＦＦＴ等により入力されたディ
ジタルオーディオ信号や音声信号の時系列を、一定周期
の長さ（フレーム）毎に周波数特性信号系列に変換する
時間周波数変換部である。また、２は、予測モデルに周
波数伸縮機能を組み込んだメル線形予測分析を用いて、
入力オーディオ信号から、周波数毎に分析精度を変化さ
せたスペクトル包絡をフレーム毎に求めるスペクトル包
絡算出部である。３は時間周波数変換部１で算出された
周波数特性信号系列をスペクトル包絡算出部２で求めた
スペクトル包絡で割り算して正規化することにより、周
波数特性を平坦化する正規化部、４は正規化部３で平坦
化された周波数特性信号系列に対し、パワーの最大値，
あるいは平均値等に基づいてパワーの正規化を行なうパ
ワー正規化部である。５は、正規化部３，パワー正規化
部４で平坦化された周波数特性信号系列をベクトル量子
化する多段量子化部であり、この多段量子化部５は、互
いに縦列接続された第１段の量子化器５１，第２段の量
子化器５２，・・・，第Ｎ段の量子化器５３を含む。６
は、時間周波数変換部１から出力された周波数特性信号
系列とスペクトル包絡算出部２で求めたスペクトル包絡
を入力とし、人間の聴覚感度特性に基づいて、量子化部
５での量子化の際に用いる重み付け係数を求める聴覚重
み付け計算部である。(Embodiment 1) FIG. 1 is a block diagram showing a configuration of an audio signal compression apparatus according to a first embodiment of the present invention. In FIG. 1, reference numeral 1 denotes a time-frequency conversion unit that converts a time series of a digital audio signal or an audio signal input by, for example, MDCT or FFT into a frequency characteristic signal sequence for each fixed period length (frame). It is. In addition, 2 uses the mel linear prediction analysis that incorporates the frequency stretching function into the prediction model,
This is a spectrum envelope calculation unit that obtains, for each frame, a spectrum envelope in which the analysis accuracy is changed for each frequency from the input audio signal. Reference numeral 3 denotes a normalization unit that flattens the frequency characteristics by dividing the frequency characteristic signal sequence calculated by the time-frequency conversion unit 1 by the spectrum envelope obtained by the spectrum envelope calculation unit 2 and normalizing the result. For the frequency characteristic signal sequence flattened by the unit 3, the maximum value of the power,
Alternatively, it is a power normalizing unit that normalizes power based on an average value or the like. Reference numeral 5 denotes a multi-stage quantization unit that vector-quantizes the frequency characteristic signal sequence flattened by the normalization unit 3 and the power normalization unit 4. The multi-stage quantization unit 5 includes a first stage connected in cascade with each other. , A second-stage quantizer 52,..., An N-th stage quantizer 53. 6
Receives the frequency characteristic signal sequence output from the time-frequency conversion unit 1 and the spectrum envelope calculated by the spectrum envelope calculation unit 2, and based on human auditory sensitivity characteristics, performs quantization at the quantization unit 5. It is an auditory weighting calculator for determining a weighting coefficient to be used.

【００９８】次に動作について説明する。入力されたデ
ィジタルオーディオ信号（以下、入力信号とも記す）の
時系列は、一定周期の長さ（フレーム）毎に時間周波数
変換部１でＭＤＣＴ，ＦＦＴ等により周波数特性信号系
列に変換される。Next, the operation will be described. A time series of an input digital audio signal (hereinafter also referred to as an input signal) is converted into a frequency characteristic signal series by the MDCT, FFT, or the like in the time-frequency converter 1 for each fixed period length (frame).

【００９９】さらに入力信号は、フレーム毎に、スペク
トル包絡算出部２で、予測モデルに周波数伸縮を組み込
んだメル線形予測分析を用いて、周波数毎に分析精度を
変化させたスペクトル包絡が求められる。図２は、入力
信号から、メル線形予測分析を用いて周波数毎に分析精
度を変化させたスペクトル包絡を求めるスペクトル包絡
算出部２を示した図である。同図において、スペクトル
包絡算出部２は、メル線形予測分析を用いて周波数毎に
分析精度を変化させた、すなわちメル化した線形予測係
数を求めるメル化係数算出部２１と、スペクトル平坦化
のために用いる直線周波数のスペクトル包絡を計算する
包絡算出部２２とからなる。以下、このメル化係数算出
部２１と包絡算出部２２のそれぞれについて説明する。Further, the spectrum envelope of the input signal is obtained for each frame by the spectrum envelope calculation unit 2 by using the mel-linear prediction analysis in which the frequency expansion and contraction is incorporated in the prediction model and the analysis accuracy is changed for each frequency. FIG. 2 is a diagram illustrating the spectrum envelope calculation unit 2 that obtains a spectrum envelope in which the analysis accuracy is changed for each frequency from the input signal using the mel linear prediction analysis. In the figure, a spectrum envelope calculation unit 2 uses a mel linear prediction analysis to change the analysis accuracy for each frequency, that is, a mel-coefficient calculation unit 21 that obtains a mel-formed linear prediction coefficient, and a And an envelope calculating unit 22 for calculating the spectral envelope of the linear frequency used for the calculation. Hereinafter, each of the melding coefficient calculation unit 21 and the envelope calculation unit 22 will be described.

【０１００】まずメル化係数算出部２１における処理の
概略を、図３に示す。図３において、２１１は入力信号
の周波数軸の伸縮を行うオールパスフィルタ、２１２は
このオールパスフィルタ２１１の出力信号と予測係数と
の線形結合を作成し、オールパスフィルタ２１１の入力
信号の予測値を出力する線形結合部、２１３は線形結合
部２１２から出力される予測値とオールパスフィルタ２
１１の出力信号とに対し最小２乗法を適用してメル化線
形予測係数を出力する最小２乗法演算部である。次に、
この図３を用いて周波数毎に分析精度を変化させた線形
予測係数、すなわちメル化した線形予測係数の推定方法
を説明する。先ず、入力信号x ［n ］を、i 段のオール
パスフィルタ２１１First, an outline of the processing in the melting coefficient calculation section 21 is shown in FIG. In FIG. 3, reference numeral 211 denotes an all-pass filter that expands and contracts the frequency axis of an input signal, and 212 generates a linear combination of the output signal of the all-pass filter 211 and a prediction coefficient, and outputs a predicted value of the input signal of the all-pass filter 211. The linear combination unit 213 includes a prediction value output from the linear combination unit 212 and the all-pass filter 2.
A least-squares method operation unit that applies the least-squares method to the 11 output signals and outputs a melded linear prediction coefficient. next,
A method of estimating a linear prediction coefficient in which the analysis accuracy is changed for each frequency, that is, a melded linear prediction coefficient will be described with reference to FIG. First, the input signal x [n] is converted to an i-stage all-pass filter 211.

【０１０１】[0101]

【数１３】 (Equation 13)

【０１０２】に通した出力信号yi［n ］と、線形結合部
２１２により作成した，予測係数The output signal yi [n] passed through and the prediction coefficient

【０１０３】[0103]

【数１４】 [Equation 14]

【０１０４】との線形結合によるx ［n ］の予測値Predicted value of x [n] by linear combination with

【０１０５】[0105]

【数１５】 (Equation 15)

【０１０６】は、（数１６）で示される。Is represented by (Equation 16).

【０１０７】[0107]

【数１６】 (Equation 16)

【０１０８】ただし、〔〕は時間軸上の数列を示す。
ここで、オールパスフィルタ（数１３）は、（数１７）
で表される。また、出力信号yi［n ］は後述する（数２
１）および（数２９）から求まる。Here, [] indicates a sequence on the time axis.
Here, the all-pass filter (Equation 13) is represented by (Equation 17)
It is represented by The output signal yi [n] is described later (Equation 2).
1) and (Equation 29).

【０１０９】[0109]

【数１７】 [Equation 17]

【０１１０】ただし、ｚは、ｚ変換の演算子を表す。Here, z represents a z-conversion operator.

【０１１１】このオールパスフィルタの周波数特性を、
図５に示す。図５において、横軸が変換前の周波数軸
で、縦軸は変換後の周波数軸を表す。図では、α=-0.5
からα=0.8まで0.1 刻みでαの値を変化させたときの様
子を表示している。図からαの値が正のときは、低周波
数帯域が伸び、高周波数帯域が縮んでいることが分か
る。また、αの値が負の場合はその逆となる。The frequency characteristics of this all-pass filter are as follows:
As shown in FIG. In FIG. 5, the horizontal axis represents the frequency axis before conversion, and the vertical axis represents the frequency axis after conversion. In the figure, α = -0.5
The state when the value of α is changed from 0.1 to α = 0.8 in increments of 0.1 is displayed. From the figure, it can be seen that when the value of α is positive, the low frequency band is extended and the high frequency band is contracted. When the value of α is negative, the reverse is true.

【０１１２】本発明では、入力信号としてサンプリング
周波数，すなわち帯域幅、の異なるオーディオ信号や音
声信号を想定しているので、サンプリング周波数に応じ
てαの値をそれぞれの信号に合わせて決定することで、
スペクトル包絡を求める際に人間の聴覚特性に合った周
波数分解能を得ることで、スペクトル包絡を求める際に
人の聴覚特性に合った周波数分解能を得ることができ
る。例えば、聴覚の周波数分解能に関する臨界帯域幅の
観測から導かれた尺度としてバーク尺度が一般に知られ
ており、この特性に基づいてαの値を決定することも可
能である。In the present invention, since audio signals and audio signals having different sampling frequencies, that is, bandwidths, are assumed as input signals, the value of α is determined according to each sampling signal in accordance with the sampling frequency. ,
By obtaining a frequency resolution that matches human auditory characteristics when obtaining a spectral envelope, it is possible to obtain a frequency resolution that matches human auditory characteristics when obtaining a spectral envelope. For example, the Bark scale is generally known as a scale derived from observation of a critical bandwidth with respect to the frequency resolution of hearing, and it is also possible to determine the value of α based on this characteristic.

【０１１３】このバーク尺度はFletcherが提唱した聴覚
フィルタの概念から得られた尺度であり、Fletcherの言
う聴覚フィルタとは、中心周波数が連続的に変化する帯
域フィルタで、信号音に一番近い中心周波数を持つ帯域
フィルタが信号音の周波数分析を行い、音のマスキング
に影響を及ぼす雑音成分はこの帯域フィルタ内の周波数
成分に限られるようなフィルタである。Fletcherはこの
帯域フィルタのバンド幅を臨界帯域と名付けている。ま
た、人間の主観に基づいてピッチ感覚を直接数量化した
心理尺度としてメル尺度が一般に知られており、この特
性に基づいてαの値を決定することも可能である。The Bark scale is a measure obtained from the concept of the auditory filter proposed by Fletcher. The auditory filter described by Fletcher is a band-pass filter whose center frequency changes continuously. A band filter having a frequency analyzes the frequency of the signal sound, and a noise component that affects the masking of the sound is a filter that is limited to the frequency component in the band filter. Fletcher names the bandwidth of this bandpass filter the critical band. The Mel scale is generally known as a psychological scale that directly quantifies pitch sensation based on human subjectivity, and the value of α can be determined based on this characteristic.

【０１１４】例えばメル尺度を、聴覚感度特性に対応し
た周波数上の重み付けとして採用する場合、我々は、サ
ンプリング周波数が8kHzではα=0.31 とし、10kHz では
α=0.35 、12kHz ではα=0.41 、16kHz ではα=0.45 、
44.1kHz ではα=0.6〜0.7 とした。また、バーク尺度
を、聴覚感度特性に対応した周波数上の重み付けとして
採用する場合、αをこれらの値から適宜変更すればよ
い。たとえば、バーク尺度の場合、12kHz では、我々
は、α＝0.51を採用している。For example, when the Mel scale is used as a weighting on the frequency corresponding to the auditory sensitivity characteristics, we set α = 0.31 at a sampling frequency of 8 kHz, α = 0.35 at 10 kHz, α = 0.41 at 12 kHz, and α = 0.41 at 12 kHz. α = 0.45,
At 44.1 kHz, α was set to 0.6 to 0.7. When the Bark scale is used as a weight on a frequency corresponding to the auditory sensitivity characteristic, α may be appropriately changed from these values. For example, in the case of the Bark scale, at 12 kHz, we have adopted α = 0.51.

【０１１５】次に、Next,

【０１１６】[0116]

【数１８】 (Equation 18)

【０１１７】で示されるオールパスフィルタの出力信号
yi［n ］と、予測値（数１５）との全２乗誤差εを最小
化するように、最小２乗法演算部２１３において、最小
２乗法を用いて係数The output signal of the all-pass filter represented by
The least-squares-method calculating unit 213 uses the least-squares method to minimize the total square error ε between yi [n] and the predicted value (Equation 15).

【０１１８】[0118]

【数１９】 [Equation 19]

【０１１９】を求めることができる。ここで、ｐは予測
係数の次数であり、ｐは予め予備実験的に信号圧縮の計
算量を考慮してその値を設定しておけばよく、入力信号
が音声信号の場合、例えば８ないし１４等に、また、入
力信号がオーディオ信号の場合、例えば１０ないし２０
等に設定しておけばよい。ただし、Can be obtained. Here, p is the order of the prediction coefficient, and p may be set in advance by a preliminary experiment in consideration of the amount of signal compression calculation. When the input signal is a speech signal, for example, 8 to 14 For example, when the input signal is an audio signal, for example, 10 to 20
And so on. However,

【０１２０】[0120]

【数２０】 (Equation 20)

【０１２１】[0121]

【数２１】 (Equation 21)

【０１２２】である。Is as follows.

【０１２３】ところで（数１８）の全２乗誤差 εを最
小化する，メル化した線形予測係数は、次の正規方程式
で与えられる。The mel-formed linear prediction coefficient for minimizing the total square error ε of (Equation 18) is given by the following normal equation.

【０１２４】[0124]

【数２２】 (Equation 22)

【０１２５】ただし、係数Where the coefficient

【０１２６】[0126]

【数２３】 (Equation 23)

【０１２７】は、メル周波数軸上（メル周波数領域）で
の自己相関関数（メル自己相関関数）であり、次式で与
えられる。Is an autocorrelation function (mel autocorrelation function) on the mel frequency axis (mel frequency domain), and is given by the following equation.

【０１２８】[0128]

【数２４】 (Equation 24)

【０１２９】ここで、（数２３）は、パーセバルの定理
により、直線周波数軸上でのスペクトルHere, (Equation 23) is expressed by the spectrum on the linear frequency axis according to Parseval's theorem.

【０１３０】[0130]

【数２５】 (Equation 25)

【０１３１】と、（数２６）で関係づけられる。ただ
し、（）は周波数領域での数列を表わす。And (Equation 26). Here, () represents a sequence in the frequency domain.

【０１３２】[0132]

【数２６】 (Equation 26)

【０１３３】さらに、（数２６）をメル周波数軸上での
式の形に書き換えると、Further, when (Equation 26) is rewritten into the form of an expression on the mel frequency axis,

【０１３４】[0134]

【数２７】 [Equation 27]

【０１３５】のようになる。ただし、Is as follows. However,

【０１３６】[0136]

【数２８】 [Equation 28]

【０１３７】である。この（数２８）は、（数１７）に
より表されるオールパスフィルタをフーリエ変換するこ
とで得られる。（数２７）は、メル自己相関関数（数２
３）が、メル周波数軸上でのパワースペクトルの逆フー
リエ変換に等しいことを意味する。したがって、（数２
２）の係数行列はToeplitz形の自己相関行列となり、簡
単な漸化式でメル化した線形予測係数を求めることが可
能となる。以下、メル化した線形予測係数を求めるため
の実際の計算の手順を示し、そのフローを図４に示す。（ステップ１）ステップＳ１において、入力信号x ［n
］を得て、ステップＳ２において、i段のオールパスフ
ィルタに通すことにより、ステップＳ３において得た出
力信号yi［n ］を、次式により求める。Is as follows. This (Equation 28) is obtained by Fourier-transforming the all-pass filter represented by (Equation 17). (Equation 27) is the mel autocorrelation function (Equation 2)
3) is equivalent to the inverse Fourier transform of the power spectrum on the mel frequency axis. Therefore, (Equation 2)
The coefficient matrix of 2) is a Toeplitz-type autocorrelation matrix, and it is possible to obtain a linear prediction coefficient that has been melted by a simple recurrence formula. Hereinafter, the procedure of the actual calculation for obtaining the melded linear prediction coefficient will be described, and the flow thereof is shown in FIG. (Step 1) In step S1, the input signal x [n
Is obtained, and in step S2, the signal is passed through an i-stage all-pass filter, whereby the output signal yi [n] obtained in step S3 is obtained by the following equation.

【０１３８】[0138]

【数２９】 (Equation 29)

【０１３９】のようになる。ただし（数２１）である。（ステップ２）ステップＳ４において、入力信号x ［n
］と各段のフィルタ出力信号yi［n ］との次式のよう
な積和を演算することにより、ステップＳ５において、
メル周波数軸上の自己相関関数を得る。このときメル自
己相関関数（数２３）は、（数２７）の関係からオール
パスフィルタの段数差Is as follows. However, it is (Equation 21). (Step 2) In step S4, the input signal x [n
] And the filter output signal yi [n] of each stage are calculated by the following equation, so that in step S5,
Obtain the autocorrelation function on the mel frequency axis. At this time, the mel autocorrelation function (Equation 23) is obtained by calculating the difference in the number of stages of the all-pass filter from the relationship of (Equation 27).

【０１４０】[0140]

【数３０】 [Equation 30]

【０１４１】のみに依存しているので、次の（数３１）
のように、Ｎ項の積和演算で計算することができ、演算
の打ち切りによる近似を行う必要がない。なお、この
（数３１）は、（数２１）および（数２９）を用いて
（数２４）を変形することにより得られるものである。Since it depends only on (Equation 31)
, The calculation can be performed by the product-sum operation of N terms, and there is no need to perform approximation by terminating the operation. Note that this (Equation 31) is obtained by transforming (Equation 24) using (Equation 21) and (Equation 29).

【０１４２】[0142]

【数３１】 (Equation 31)

【０１４３】即ち、この（数３１）から分かるように、
この計算は（数２４）に示された，通常の計算手法であ
れば本来無限回の計算を必要とすべきものが、有限回の
計算で終了するので、膨大な計算を必要としない。ま
た、無限回の演算を行う代わりに有限回の演算で演算を
打ち切る場合に必要な波形の打ち切り等の近似をまった
く必要とせず、波形の打ち切りに伴う誤差は全く発生し
ない。しかもその計算量は通常の自己相関係数の約２倍
の計算量で済むため、波形から直接求めることが可能で
ある。この点は、（数２４）に示された，従来の計算法
とは決定的に異なる重要な点である。That is, as can be seen from this (Equation 31),
This calculation, which should normally require an infinite number of calculations in the case of a normal calculation method shown in (Equation 24), is completed by a finite number of calculations, so that a huge amount of calculation is not required. In addition, there is no need to perform any approximation such as truncation of a waveform, which is necessary when the computation is terminated by a finite number of computations instead of performing an infinite number of computations, and no error occurs due to the truncation of the waveform. Moreover, the amount of calculation is about twice as large as the normal autocorrelation coefficient, so that it can be obtained directly from the waveform. This point is an important point that is decisively different from the conventional calculation method shown in (Equation 24).

【０１４４】（ステップ３）ステップＳ６において、メ
ル自己相関関数（数２３）を用いて（数２２）の正規方
程式を、既に公知のアルゴリズム、たとえばDurbinの方
法などで解くことにより、ステップＳ７において、メル
化した線形予測係数（メル線形予測係数）を求める。(Step 3) In step S6, the normal equation of (Equation 22) is solved by a well-known algorithm, for example, the method of Durbin, using the mel autocorrelation function (Equation 23). A mel-formed linear prediction coefficient (mel linear prediction coefficient) is obtained.

【０１４５】次に、包絡算出部２２の概略を図６に示
す。この図６において、２２１はメル化線形予測係数に
対し逆メル変換を行い直線周波数の線形予測係数を出力
する逆メル変換部、２２２は直線周波数の線形予測係数
をフーリエ変換しスペクトル包絡を出力するＦＦＴ部で
ある。次に、この図６を用いて、周波数毎に分析精度を
変化させた線形予測係数、すなわちメル化した線形予測
係数（数１９）からスペクトル平坦化のために用いる直
線周波数のスペクトル包絡を求める方法を説明する。ま
ず、逆メル変換部２２１において、メル化した線形予測
係数（数１９）からNext, an outline of the envelope calculating section 22 is shown in FIG. In FIG. 6, reference numeral 221 denotes an inverse mel transform unit which performs inverse mel transform on the mel-formed linear prediction coefficient and outputs a linear frequency linear prediction coefficient, and 222 denotes a Fourier transform of the linear frequency linear prediction coefficient and outputs a spectrum envelope. It is an FFT unit. Next, using FIG. 6, a method of obtaining a spectral envelope of a linear frequency used for spectrum flattening from a linear prediction coefficient obtained by changing the analysis accuracy for each frequency, that is, a melted linear prediction coefficient (Equation 19). Will be described. First, in the inverse mel transform unit 221, from the linear prediction coefficient (Equation 19)

【０１４６】[0146]

【数３２】 (Equation 32)

【０１４７】の逆メル変換により、直線周波数の線形予
測係数By the inverse mel transform, the linear prediction coefficient of the linear frequency

【０１４８】[0148]

【数３３】 [Equation 33]

【０１４９】を求める。実際に（数３２）を解くには、
良く知られたOppenheim の漸化式を計算することで解く
ことが可能である。ここで、オールパスフィルタIs obtained. To actually solve (Equation 32),
It can be solved by calculating the well-known Oppenheim recurrence equation. Where the allpass filter

【０１５０】[0150]

【数３４】 (Equation 34)

【０１５１】は、（数１７）において、αを−αと置き
換えた（数３５）のオールパスフィルタを用いる必要が
ある。In (Equation 17), it is necessary to use the all-pass filter of (Equation 35) in which α is replaced by −α.

【０１５２】[0152]

【数３５】 (Equation 35)

【０１５３】これにより、メル周波数から線形周波数へ
と変換された予測係数を求めることが可能となる。さら
に、ＦＦＴ部２２２において、この直線周波数の線形予
測係数（数３３）からＦＦＴを用いて（数３６）によ
り、スペクトル平坦化のために用いる直線周波数のスペ
クトル包絡S(ejα) を求めることができる。As a result, it is possible to obtain a prediction coefficient converted from a mel frequency to a linear frequency. Further, in the FFT section 222, the spectrum envelope S (ejα) of the linear frequency used for spectrum flattening can be obtained from the linear prediction coefficient of the linear frequency (Equation 33) using FFT (Equation 36). .

【０１５４】[0154]

【数３６】 [Equation 36]

【０１５５】次に、正規化部３では、上記で算出された
周波数特性信号系列をスペクトル包絡で割り算し正規化
することにより、周波数特性信号系列を平坦化する。正
規化部３で平坦化された周波数特性信号系列は、パワー
正規化部４において、さらにパワーの最大値，あるいは
平均値等に基づいてパワーの正規化が行われる。Next, the normalizing section 3 flattens the frequency characteristic signal sequence by dividing the frequency characteristic signal sequence calculated above by the spectrum envelope and normalizing the divided frequency characteristic signal sequence. The power normalization unit 4 further normalizes the power of the frequency characteristic signal sequence flattened by the normalization unit 3 based on the maximum value or average value of the power.

【０１５６】ところで、音声信号圧縮では、この正規化
部３と同様のスペクトル包絡による正規化を行ってい
る。すなわち、入力された音声信号の時系列は、フレー
ム毎に線形予測分析（ＬＰＣ分析）することにより、Ｌ
ＰＣ係数（線形予測係数）やＬＳＰ係数（line spectru
m pair coefficient），あるいはＰＡＲＣＯＲ係数（偏
自己相関係数）等のＬＰＣスペクトル包絡成分と周波数
特性が平坦化された残差信号とに分離しており、このこ
とはすなわち、上記実施の形態のように、スペクトル包
絡成分による周波数上での割算の処理と等価の処理であ
り、また線形予測分析により求めた線形予測係数やＬＳ
Ｐ係数、あるいはＰＡＲＣＯＲ係数等のスペクトル包絡
成分を用いて、時間軸上での逆フィルタリング処理をす
ることとも等価な処理である。そこで、本発明のような
入力音声から求めたメル化された線形予測係数や、ある
いは通常の線形予測係数からＰＡＲＣＯＲ係数を求める
のと同様の公知の手法により、メル化された線形予測係
数から求めたメル化されたＰＡＲＣＯＲ係数や、あるい
は通常の線形予測係数からＬＳＰ係数を求めるのと同様
の公知の手法により、メル化された線形予測係数から求
めたメル化されたＬＳＰ係数を用いて、時間軸上での逆
フィルタリング処理を行うことや、あるいはスペクトル
包絡成分と残差信号とに分離を行うことで、音声信号圧
縮を行うことは可能である。In the audio signal compression, normalization is performed by the same spectral envelope as in the normalization unit 3. In other words, the time series of the input audio signal is obtained by performing linear prediction analysis (LPC analysis) for each frame to obtain L
PC coefficient (linear prediction coefficient) and LSP coefficient (line spectru
m pair coefficient) or LPC spectrum envelope component such as a PARCOR coefficient (partial autocorrelation coefficient) and a residual signal whose frequency characteristic is flattened. The processing is equivalent to the division on the frequency by the spectral envelope component, and the linear prediction coefficient or LS obtained by the linear prediction analysis.
This is equivalent to performing inverse filtering processing on the time axis using a spectral envelope component such as a P coefficient or a PARCOR coefficient. Therefore, the mellated linear prediction coefficients obtained from the input speech as in the present invention, or the PACs obtained from the melted linear prediction coefficients by a known method similar to the method for obtaining the PARCOR coefficient from the normal linear prediction coefficients, are obtained. By using the melded LSP coefficient obtained from the melded linear prediction coefficient by a known method similar to that for obtaining the LSP coefficient from the melded PARCOR coefficient or the normal linear prediction coefficient, It is possible to perform audio signal compression by performing an on-axis inverse filtering process, or by separating a spectral envelope component and a residual signal.

【０１５７】一方、聴覚重み付け計算部６には、時間周
波数変換部１から出力された周波数特性信号系列と、ス
ペクトル包絡算出部２で求めたスペクトル包絡とが入力
され、時間周波数変換部１から出力された周波数特性信
号系列のスペクトルについて、最小可聴限特性や聴覚マ
スキング特性等の人間の聴覚的な性質である聴覚感度特
性に基づいて、この聴覚感度特性を考慮した特性信号を
算出し、さらにこの特性信号とスペクトル包絡に基づい
て、量子化に用いる重み付け係数を求める。On the other hand, the frequency characteristic signal sequence output from the time-frequency conversion unit 1 and the spectrum envelope obtained by the spectrum envelope calculation unit 2 are input to the auditory weighting calculation unit 6. For the spectrum of the frequency characteristic signal sequence, based on the auditory sensitivity characteristics that are human auditory characteristics such as the minimum audibility characteristics and the auditory masking characteristics, a characteristic signal in consideration of the auditory sensitivity characteristics is calculated. A weighting coefficient used for quantization is obtained based on the characteristic signal and the spectral envelope.

【０１５８】パワー正規化部４から出力された残差信号
は、多段量子化部５の第１段の量子化部５１で聴覚重み
付け計算部６によって求められた重み付け係数を用いて
量子化され、第１段の量子化部５１での量子化による量
子化誤差成分が、多段量子化部５の第２段の量子化部５
２で聴覚重み付け計算部６によって求められた重み付け
係数を用いて量子化され、以下同様にして、複数段の量
子化部のそれぞれにおいて、前段の量子化部での量子化
による量子化誤差成分の量子化が行なわれる。これらの
各量子化部は量子化結果としてコードを出力する。そし
て、第（Ｎ−１）段の量子化部での量子化による量子化
誤差成分に対して、第Ｎ段の量子化部５３で聴覚重み付
け計算部６によって求められた重み付け係数を用いて量
子化が行なわれることにより、オーディオ信号の圧縮符
号化が完了する。The residual signal output from the power normalizing section 4 is quantized by the first-stage quantizing section 51 of the multi-stage quantizing section 5 using the weighting coefficient obtained by the auditory weighting calculating section 6. The quantization error component due to the quantization in the first-stage quantization unit 51 is converted to the second-stage quantization unit 5 of the multi-stage quantization unit 5.
In step 2, the quantization is performed using the weighting coefficient obtained by the auditory weighting calculation unit 6, and similarly, in each of the plurality of stages of quantization units, the quantization error component of the quantization in the preceding stage quantization unit is calculated. Quantization is performed. Each of these quantization units outputs a code as a quantization result. Then, the quantization error component by the quantization in the (N-1) th stage quantization unit is quantized by the Nth stage quantization unit 53 using the weighting coefficient obtained by the auditory weighting calculation unit 6. As a result, the compression encoding of the audio signal is completed.

【０１５９】このように、本実施の形態１によるオーデ
ィオ信号圧縮方法およびオーディオ信号圧縮装置によれ
ば、正規化部３で、入力オーディオ信号から算出された
周波数特性信号系列を、人間の聴覚的な性質である聴覚
感度特性に応じて周波数毎に分析精度を変化させたスペ
クトル包絡を用いて正規化する構成としたので、正確に
周波数特性信号系列の平坦化が行え、効率の良い量子化
を行なうことができる。As described above, according to the audio signal compression method and the audio signal compression apparatus according to the first embodiment, the frequency characteristic signal sequence calculated from the input audio signal by the normalization unit 3 is converted into a human auditory sense. Since the normalization is performed using the spectral envelope whose analysis accuracy is changed for each frequency according to the auditory sensitivity characteristic, which is the property, the frequency characteristic signal sequence can be accurately flattened and efficient quantization performed. be able to.

【０１６０】また、多段量子化部５でベクトル量子化す
る際の負担が少なくなり、効率の良い量子化を行なうこ
とができる。ベクトル量子化では、ある限られた情報
（コード）で周波数特性信号系列を表現するため、周波
数特性信号系列の形状が単純であればあるほど、より少
ないコードで表現することができる。そこで、本発明で
は、周波数特性信号系列の形状を単純化するために、周
波数特性信号系列の概略形状を表現しているスペクトル
包絡を用いて正規化しているが、この概略形状として周
波数毎に分析精度を変化させたスペクトル包絡を用いる
ことで、より正確に周波数特性信号系列の形状を単純化
でき、効率の良い量子化が行なうことができる。In addition, the burden on vector quantization by the multi-stage quantization unit 5 is reduced, and efficient quantization can be performed. In the vector quantization, a frequency characteristic signal sequence is represented by limited information (code). Therefore, the simpler the shape of the frequency characteristic signal sequence, the smaller the number of codes that can be represented. Thus, in the present invention, in order to simplify the shape of the frequency characteristic signal sequence, the frequency characteristic signal sequence is normalized using a spectral envelope expressing the schematic shape thereof. By using the spectral envelope with the changed accuracy, the shape of the frequency characteristic signal sequence can be simplified more accurately, and efficient quantization can be performed.

【０１６１】また、多段量子化部５の複数段のベクトル
量子化部５１〜５３で、聴覚重み付け計算部６において
入力オーディオ信号のスペクトル，人間の聴覚的な性質
である聴覚感度特性，及び人間の聴覚的な性質である聴
覚感度特性に応じて、周波数毎に分析精度を変化させた
スペクトル包絡に基づいて算出された周波数上の重み付
け係数を量子化の際の重み付けとして用いてベクトル量
子化を行なう構成としたので、人間の聴覚的な性質を利
用して効率の良い量子化を行なうことができる。A plurality of stages of vector quantizers 51 to 53 of the multistage quantizer 5 provide an auditory weighting calculator 6 with which the spectrum of the input audio signal, the auditory sensitivity characteristic which is a human auditory characteristic, and the human auditory sensitivity characteristic are obtained. Vector quantization is performed by using a weighting coefficient on a frequency calculated based on a spectral envelope in which the analysis accuracy is changed for each frequency in accordance with the auditory sensitivity characteristic, which is an auditory property, as a weight at the time of quantization. With this configuration, efficient quantization can be performed by utilizing the auditory characteristics of humans.

【０１６２】なお、メル化係数算出部２１は、入力信号
から、メル線形予測分析を用いて周波数毎に分析精度を
変化させた線形予測係数、すなわちメル化した線形予測
係数を求める部分であるが、以下のような方法を用いて
これを求めても良い。すなわち、入力信号に対し、オー
ルパスフィルタを用いて周波数軸の伸縮を行うことで周
波数伸縮信号を求め、この周波数伸縮信号に対して通常
の線形予測分析を行うことで周波数毎に分析精度を変化
させたスペクトル包絡を求める方法である。以下、周波
数毎に分析精度を変化させた線形予測係数、すなわちメ
ル化した線形予測係数を推定する方法について説明す
る。まず、入力信号x ［n ］を、It should be noted that the melding coefficient calculating section 21 is a section for obtaining, from the input signal, a linear prediction coefficient in which the analysis precision is changed for each frequency using mel linear prediction analysis, that is, a melted linear prediction coefficient. This may be obtained using the following method. That is, the input signal is subjected to expansion and contraction of the frequency axis using an all-pass filter to obtain a frequency expansion and contraction signal, and normal linear prediction analysis is performed on the frequency expansion and contraction signal to change the analysis accuracy for each frequency. This is a method for obtaining the spectral envelope. Hereinafter, a method of estimating a linear prediction coefficient whose analysis accuracy is changed for each frequency, that is, a melded linear prediction coefficient will be described. First, the input signal x [n] is

【０１６３】[0163]

【数３７】 (37)

【０１６４】により周波数軸をメル周波数へと変換した
出力信号Output signal obtained by converting the frequency axis to the mel frequency

【０１６５】[0165]

【数３８】 (38)

【０１６６】を求める。ここで、オールパスフィルタIs obtained. Where the allpass filter

【０１６７】[0167]

【数３９】 [Equation 39]

【０１６８】は、（数１７）で表される。Is represented by (Equation 17).

【０１６９】次に、この出力信号（数３８）に対して通
常の線形予測分析を行うことでメル化された、すなわち
周波数毎に分析精度を変化させた線形予測係数Next, a linear prediction coefficient obtained by performing a normal linear prediction analysis on the output signal (Equation 38), that is, a linear prediction coefficient obtained by changing the analysis accuracy for each frequency

【０１７０】[0170]

【数４０】 (Equation 40)

【０１７１】を求めることができる。実際に（数３７）
を解くには、良く知られたOppenheim の漸化式を計算す
ることで解くことが可能である。メル化係数算出部２１
では、このような方法で求めた周波数毎に分析精度を変
化させた線形予測係数を用いても良い。Can be obtained. Actually (Equation 37)
Can be solved by calculating the well-known Oppenheim recurrence formula. Melting coefficient calculator 21
Then, a linear prediction coefficient obtained by changing the analysis accuracy for each frequency obtained by such a method may be used.

【０１７２】さらに、スペクトル包絡算出部２は、入力
信号から直接オールパスフィルタを用いて周波数軸の伸
縮を行うことにより周波数伸縮信号を求めることで、周
波数毎に分析精度を変化させたスペクトル包絡を求める
方法以外に、入力信号のパワースペクトルを周波数軸上
で再標本化、すなわち補間処理を行うことで、周波数軸
伸縮した、すなわちメル変換したパワースペクトルを求
めておき、これを逆ＤＦＴすることで、周波数毎に分析
精度を変化させたスペクトル包絡を求めることも可能で
ある。Further, the spectrum envelope calculation unit 2 obtains a frequency expanded / contracted signal by directly expanding / contracting the frequency axis from the input signal using an all-pass filter, thereby obtaining a spectrum envelope in which the analysis accuracy is changed for each frequency. In addition to the method, by resampling the power spectrum of the input signal on the frequency axis, that is, by performing interpolation processing, the power spectrum that has been expanded / contracted in the frequency axis, that is, mel-transformed, is obtained, and the inverse DFT is performed on the power spectrum. It is also possible to obtain a spectral envelope in which the analysis accuracy is changed for each frequency.

【０１７３】さらに、スペクトル包絡算出部２は、入力
信号から求めた自己相関関数をｍ段のオールパスフィル
タを通して周波数軸の伸縮を行った自己相関関数を求
め、この自己相関関数から周波数毎に分析精度を変化さ
せたスペクトル包絡を求めることも可能である。Further, the spectrum envelope calculator 2 calculates an autocorrelation function obtained by expanding and contracting the frequency axis of the autocorrelation function obtained from the input signal through an m-stage all-pass filter. It is also possible to obtain a spectral envelope in which is changed.

【０１７４】なお、図１のオーディオ信号圧縮装置で
は、聴覚重み付け計算部６が重み付け係数の算出にスペ
クトル包絡を用いる構成としているが、入力オーディオ
信号のスペクトルと、人間の聴覚的な性質である聴覚感
度特性のみを用いて、重み付け係数を算出するようにし
てもよい。In the audio signal compression apparatus shown in FIG. 1, the auditory weighting calculator 6 uses a spectral envelope to calculate the weighting coefficient. The weighting coefficient may be calculated using only the sensitivity characteristics.

【０１７５】また、図１のオーディオ信号圧縮装置で
は、多段量子化部５の複数段のベクトル量子化部の全て
が聴覚重み付け計算部６において求められた聴覚感度特
性に基づく重み付け係数を用いて量子化するようにして
いるが、多段量子化部５の複数段のベクトル量子化器の
いずれか１つが聴覚感度特性に基づく重み付け係数を用
いて量子化を行なうものであれば、このような聴覚感度
特性に基づく重み付け係数を用いない場合に比して、効
率のよい量子化を行なうことができる。さらに、図１の
オーディオ信号圧縮装置では、圧縮すべき信号がオーデ
ィオ帯域の信号であるとして説明を行ったが、これを音
声帯域の信号としてもよく、この場合、図１の装置がそ
のまま音声信号圧縮装置となる。また、図１のオーディ
オ信号圧縮装置では、人間の聴覚的な性質である聴覚感
度特性に対応した周波数上の重み付けとして、メル尺度
を用いるようにしたが、オールパスフィルタのαの値を
適宜変更することにより、図１のブロック構成そのまま
でバーク尺度に基づき信号圧縮を行うオーディオ信号圧
縮装置に装置を変更することができる。In the audio signal compression apparatus of FIG. 1, all of the vector quantizers of the multi-stage quantizer 5 use a weighting coefficient based on the auditory sensitivity characteristic obtained by the auditory weight calculator 6 to quantize the quantized signal. If any one of the vector quantizers of the multiple stages of the multi-stage quantization unit 5 performs quantization using a weighting coefficient based on the auditory sensitivity characteristic, such an auditory sensitivity is used. Efficient quantization can be performed as compared with a case where a weighting coefficient based on characteristics is not used. Further, in the audio signal compression apparatus shown in FIG. 1, the signal to be compressed is described as an audio band signal. However, this may be an audio band signal. In this case, the audio signal compression apparatus shown in FIG. It becomes a compression device. Further, in the audio signal compression apparatus of FIG. 1, the mel scale is used as the frequency weighting corresponding to the auditory sensitivity characteristic which is a human auditory characteristic, but the value of α of the all-pass filter is appropriately changed. Thus, the apparatus can be changed to an audio signal compression apparatus that performs signal compression based on the Bark scale without changing the block configuration in FIG.

【０１７６】（実施の形態２）図７は本発明の第２の実
施の形態による音声認識装置の構成を示すブロック図で
ある。同図において、７は、予測モデルに周波数伸縮を
組み込んだメル線形予測分析を用いて、入力音声から周
波数毎に分解能を変化させたメル線形予測係数をフレー
ム毎に算出するメル線形予測分析部である。８は、メル
線形予測分析部７で算出されたメル線形予測係数をケプ
ストラム係数へと変換するケプストラム係数算出部であ
る。９は、ケプストラム係数算出部８で算出されたケプ
ストラム係数の時系列と、あらかじめ用意した単語や音
韻などの複数の標準モデルとの間の類似度を算出し、最
も類似度の大きい単語や音韻を認識する音声認識部であ
る。なおこの音声認識部９は特定話者認識を行うもので
も、不特定話者認識を行うものでもよい。(Embodiment 2) FIG. 7 is a block diagram showing a configuration of a speech recognition apparatus according to a second embodiment of the present invention. In the figure, reference numeral 7 denotes a mel linear prediction analysis unit that calculates, for each frame, a mel linear prediction coefficient in which the resolution is changed for each frequency from the input speech using mel linear prediction analysis incorporating frequency expansion and contraction in the prediction model. is there. Reference numeral 8 denotes a cepstrum coefficient calculation unit that converts the mel linear prediction coefficients calculated by the mel linear prediction analysis unit 7 into cepstrum coefficients. 9 calculates the similarity between the time series of the cepstrum coefficient calculated by the cepstrum coefficient calculator 8 and a plurality of standard models such as words and phonemes prepared in advance, and calculates the word or phoneme having the highest similarity. It is a voice recognition unit for recognition. The voice recognition unit 9 may perform specific speaker recognition or may perform unspecified speaker recognition.

【０１７７】次に詳細な動作について説明する。まず、
入力されたディジタル音声（以下、「入力信号」とも記
す）の時系列は、一定周期の長さ（フレーム）毎にメル
線形予測分析部７で予測モデルに周波数伸縮を組み込ん
だメル線形予測分析を用いて、周波数毎に分解能を変化
させたスペクトル包絡に対応するメル線形予測係数が算
出される。以下、メル線形予測分析部７の動作について
説明する。Next, a detailed operation will be described. First,
The time series of the input digital voice (hereinafter, also referred to as “input signal”) is obtained by performing a mel linear prediction analysis in which frequency expansion and contraction is incorporated in the prediction model by the mel linear prediction analysis unit 7 for each fixed period length (frame). The mel linear prediction coefficient corresponding to the spectrum envelope of which the resolution is changed for each frequency is calculated using the mel linear prediction coefficient. Hereinafter, the operation of the mel linear prediction analysis unit 7 will be described.

【０１７８】まず、メル線形予測分析部７の概略を図８
に示す。図８を用いて周波数毎に分解能を変化させた線
形予測係数、すなわちメル化した線形予測係数の算出方
法を説明する。[0178] First, FIG. 8 a schematic Mel linear predictive analysis section 7
Shown in Figure 8 linear prediction coefficients of varying resolution for each frequency by using, i.e. the method of calculating the mel-LPC coefficients will be described.

【０１７９】[0179]

【数４１】 [Equation 41]

【０１８０】で置き換えたモデルModel replaced by

【０１８１】[0181]

【数４２】 (Equation 42)

【０１８２】を用いる。ただし、Is used. However,

【０１８３】[0183]

【数４３】 [Equation 43]

【０１８４】は、メル線形予測係数、αは線形予測分析
の分解能を周波数毎に変化させるための伸縮係数であ
る。オールパスフィルタの周波数特性は、図５に既に示
している。例えば、伸縮係数としては、サンプリング周
波数が、8kHzではα=0.31 、10kHz ではα=0.35 、12kH
z ではα=0.41 、16kHz ではα=0.45 、44.1kHz ではα
=0.6〜0.7 などの値を用いれば良い。ここで、長さN の
有限長波形x ［n ］(n=0,...,N-1) に対する予測誤差
を、Is a mel linear prediction coefficient, and α is an expansion / contraction coefficient for changing the resolution of the linear prediction analysis for each frequency. The frequency characteristics of the all-pass filter have already been shown in FIG. For example, as the expansion coefficient, the sampling frequency is α = 0.31 at 8 kHz, α = 0.35 at 10 kHz, and 12 kHz.
α = 0.41 at z, α = 0.45 at 16kHz, α at 44.1kHz
A value such as = 0.6 to 0.7 may be used. Here, the prediction error for a finite-length waveform x [n] (n = 0, ..., N-1) of length N is

【０１８５】[0185]

【数４４】 [Equation 44]

【０１８６】のような無限区間に亘る全２乗予測誤差で
評価する。このとき、Evaluation is made based on the total squared prediction error over the infinite section as described above. At this time,

【０１８７】[0187]

【数４５】 [Equation 45]

【０１８８】であり、また、yi［n ］を、入力信号x
［n ］をi 段のオールパスフィルタに通した出力波形と
すると、yi［n ］の予測値And yi [n] is converted to the input signal x
Assuming that [n] is the output waveform that has passed through the i-stage all-pass filter, the predicted value of yi [n]

【０１８９】[0189]

【数４６】 [Equation 46]

【０１９０】は次式のような線形結合で表される。Is represented by the following linear combination.

【０１９１】[0191]

【数４７】 [Equation 47]

【０１９２】これより、予測誤差を最小とする係数（数
４３）は、次式の連立方程式で与えられる。Thus, the coefficient (Equation 43) that minimizes the prediction error is given by the following simultaneous equation.

【０１９３】[0193]

【数４８】 [Equation 48]

【０１９４】ただし、φijは、無限長波形yi［n ］とyj
［n ］の共分散であるが、パーセバルの定理および、オ
ールパスフィルタHere, φij is an infinite-length waveform yi [n] and yj
The covariance of [n], but Parseval's theorem and allpass filter

【０１９５】[0195]

【数４９】 [Equation 49]

【０１９６】をフーリエ変換した，周波数軸上での表現
を用いることにより、φijは次式のように有限回の積和
演算で与えられる。By using the Fourier-transformed representation on the frequency axis, φij is given by a finite number of product-sum operations as in the following equation.

【０１９７】[0197]

【数５０】 [Equation 50]

【０１９８】さらに、Further,

【０１９９】[0199]

【数５１】 (Equation 51)

【０２００】とおくと、r ［ｍ］は自己相関関数として
の性質を持つことを示すことができ、In other words, it can be shown that r [m] has a property as an autocorrelation function,

【０２０１】[0201]

【数５２】 (Equation 52)

【０２０２】の安定性も保証される。なお、（数５０）
から分かるように、この計算は（数５０）の中辺で示さ
れた通常の計算手法であれば本来無限回の計算を必要と
すべきものが、（数５０）の右辺で示された有限回の計
算で終了するので、膨大な計算を必要としない。また、
無限回の演算を行う代わりに有限回の演算で演算を打ち
切る場合に必要な波形の打ち切り等の近似をまったく必
要とせず、波形の打ち切りに伴う誤差は全く発生しな
い。しかもその計算量は通常の自己相関係数の数倍の計
算量で済むため、波形から直接求めることが可能であ
る。この点は、従来の計算法とは決定的に異なる重要な
点である。The stability of the above is also guaranteed. (Equation 50)
As can be seen from this calculation, if the normal calculation method shown in the middle side of (Equation 50) should originally require an infinite number of calculations, the calculation is a finite number of times shown in the right side of (Equation 50). Since it ends with the calculation of, a huge amount of calculation is not required. Also,
There is no need to perform any approximation, such as truncation of the waveform, which is necessary when the computation is terminated by finite computations instead of performing infinite computations, and no error occurs due to the waveform truncation. In addition, the amount of calculation is several times the amount of a normal autocorrelation coefficient, and thus can be obtained directly from the waveform. This is an important point that is crucially different from the conventional calculation method.

【０２０３】以下、メル線形予測係数を求めるための実
際の計算の手順を図８に示す。この部分は実施の形態１
の図３と同様であり、図８において、７１は入力信号の
周波数軸の伸縮を行うオールパスフィルタ、７２はこの
オールパスフィルタ７１の出力信号と予測係数との線形
結合を作成し、オールパスフィルタ７１の入力信号の予
測値を出力する線形結合部、７３は線形結合部７２から
出力される予測値と入力信号とに対し最小２乗法を適用
してメル化線形予測係数を出力する最小２乗法演算部で
ある。次に、この図８を用いて周波数毎に分析精度を変
化させた線形予測係数、すなわちメル化した線形予測係
数の推定方法を説明する。FIG. 8 shows an actual calculation procedure for obtaining the mel linear prediction coefficient. This part is the first embodiment.
8 is the same as FIG. 3. In FIG. 8, reference numeral 71 denotes an all-pass filter that expands and contracts the frequency axis of the input signal, and 72 creates a linear combination of the output signal of the all-pass filter 71 and the prediction coefficient. A linear combination unit 73 that outputs a predicted value of the input signal, and a least square method operation unit 73 that applies the least square method to the predicted value output from the linear combination unit 72 and the input signal to output a melded linear prediction coefficient It is. Next, a method of estimating a linear prediction coefficient in which the analysis accuracy is changed for each frequency, that is, a melded linear prediction coefficient will be described with reference to FIG.

【０２０４】（ステップ１）入力信号x ［ｎ］をi 段の
オールパスフィルタ７１に通した出力信号yi［ｎ］を、
次式により求める。(Step 1) Output signal yi [n] obtained by passing input signal x [n] through i-stage all-pass filter 71 is
It is calculated by the following equation.

【０２０５】[0205]

【数５３】 (Equation 53)

【０２０６】のようになる。ただし、（数２１）であ
る。（ステップ２）線形結合部７２において、入力信号x
［ｎ］と各段のフィルタ出力信号yi［ｎ］との次式のよ
うな積和により、メル周波数軸上の自己相関関数を求め
る。このときメル自己相関関数（数２３）は、（数２
７）の関係からオールパスフィルタの段数差The following is obtained. However, it is (Equation 21). (Step 2) In the linear combination unit 72, the input signal x
An autocorrelation function on the mel frequency axis is obtained by the product sum of [n] and the filter output signal yi [n] of each stage as in the following equation. At this time, the Mel autocorrelation function (Equation 23) is expressed by (Equation 2)
7) Difference in the number of stages of the all-pass filter

【０２０７】[0207]

【数５４】 (Equation 54)

【０２０８】のみに依存しているので、次式のように打
ち切りの近似をすることなく、Ｎ項の積和演算で計算す
ることができる。Since it depends only on the above equation, it can be calculated by the product-sum operation of N terms without approximating truncation as in the following equation.

【０２０９】[0209]

【数５５】 [Equation 55]

【０２１０】（ステップ３）最小２乗法演算部７３にお
いて、メル自己相関関数（数２３）を用いて（数２２）
の正規方程式を、既に公知のアルゴリズム、たとえばDu
rbinの方法などで解くことにより、メル化した線形予測
係数（メル線形予測係数）を求める。(Step 3) The least squares method operation section 73 uses the mel autocorrelation function (Equation 23) to obtain (Equation 22)
Of the normal equation of a known algorithm, such as Du
A mel-formed linear prediction coefficient (mel linear prediction coefficient) is obtained by solving with the rbin method or the like.

【０２１１】以上のようにして求めたメル線形予測係数
（数４３）から、ケプストラム係数算出部８において、
ケプストラム係数へと変換する。ケプストラム係数への
変換の方法は既に公知であり、例えば文献（鹿野清宏、
中村哲、伊勢史郎著、「音声・音情報のディジタル信号
処理」、昭晃堂、ｐ１０〜１６）に詳しく記載されてお
り、メル線形予測係数を通常の線形予測係数と同じよう
に扱って変換すれば良い。その結果、メル周波数軸上で
のケプストラム係数を求めることができる。From the mel linear prediction coefficient (Equation 43) obtained as described above, the cepstrum coefficient calculator 8 calculates
Convert to cepstrum coefficients. The method of conversion to cepstrum coefficients is already known, and is described in, for example, literatures (Kiyoshi Hiroshi,
This is described in detail in Satoshi Nakamura and Shiro Ise, "Digital Signal Processing of Speech and Sound Information", Shokodo, pp. 10-16, and converts mel linear prediction coefficients in the same way as ordinary linear prediction coefficients. Just do it. As a result, the cepstrum coefficient on the mel frequency axis can be obtained.

【０２１２】このようにして算出されたケプストラム係
数（以下、メルＬＰＣケプストラム係数と呼ぶ）の時系
列は、音声認識部９においてあらかじめ用意した単語や
音韻などの複数の標準モデルとの間の類似度を算出し、
最も類似度の大きい単語や音韻を認識する。The time series of the cepstrum coefficients (hereinafter referred to as mel LPC cepstrum coefficients) calculated in this manner is obtained by calculating the similarity between a plurality of standard models such as words and phonemes prepared in advance in the speech recognition unit 9. Is calculated,
Recognize words and phonemes with the highest similarity.

【０２１３】標準モデルとしては、複数の認識対象語彙
毎の特徴量の時系列を確率的な遷移として表現する隠れ
マルコフモデル（ＨＭＭ）と呼ばれる方法があり、既に
幅広く利用されており公知である（例えば、中川聖一：
“確率モデルによる音声認識”、電子情報通信学会
編）。ＨＭＭとは、あらかじめ個人差による音韻や単語
の特徴量の時系列をＨＭＭモデルに学習させておき、入
力音声がモデルに確率値としてどのくらい近いかを捉え
て認識する方法である。本実施の形態では、この特徴量
の時系列として、前述のメルＬＰＣケプストラム係数の
時系列を用いる。As a standard model, there is a method called a Hidden Markov Model (HMM) that expresses a time series of feature amounts for each of a plurality of vocabularies to be recognized as stochastic transitions, and has already been widely used and known ( For example, Seiichi Nakagawa:
“Speech Recognition by Probabilistic Model,” edited by IEICE. The HMM is a method in which a time series of phonemes and word feature values due to individual differences is learned in advance by an HMM model, and how close the input voice is to the model as a probability value is recognized and recognized. In the present embodiment, the time series of the mel LPC cepstrum coefficients described above is used as the time series of the feature amount.

【０２１４】また、標準モデルとしては、複数の認識対
象語彙毎の特徴量の時系列の中の代表的な特徴量の時系
列をモデルとしても良いし、さらに特徴量の時系列を時
間的あるいは周波数的に正規化（伸縮）することで得ら
れる特徴量の正規化時系列を用いてもよい。例えば、時
間軸上で任意の長さに正規化する方法としてＤＰマッチ
ング（dynamic programming ；動的計画法）があり、あ
らかじめ決定した対応付けの規則に従って、時間的特徴
量の時系列を正規化することが可能である。本実施の形
態では、このようにいずれの場合の標準モデルを使用し
ても、特徴量の時系列として前述のメルＬＰＣケプスト
ラム係数の時系列を用いれば良いので、何等問題はな
い。Further, as the standard model, a time series of a representative feature amount in a time series of feature amounts for each of a plurality of recognition target vocabularies may be used as a model. A normalized time series of feature amounts obtained by normalizing (expanding or contracting) in frequency may be used. For example, there is DP matching (dynamic programming) as a method of normalizing to an arbitrary length on the time axis, and normalizes a time series of temporal feature values according to a predetermined association rule. It is possible. In this embodiment, there is no problem in using the standard model in any case, since the time series of the mel LPC cepstrum coefficients described above may be used as the time series of the feature amount.

【０２１５】ところで本実施の形態では、入力音声から
求めた特徴量の時系列として、メルＬＰＣケプストラム
係数を用いて認識を行っているが、通常の線形予測係数
からＰＡＲＣＯＲ係数を求めるのと同様の公知の手法に
よりメル線形予測係数から求めることのできるメルＰＡ
ＲＣＯＲ係数や、あるいは通常の線形予測係数からＬＳ
Ｐ係数を求めるのと同様の公知の手法によりメル線形予
測係数から求めることのできるメルＬＳＰ係数を音声認
識に用いることも可能である。また、これらメル線形予
測係数から求められるメル線形予測係数、メルＰＡＲＣ
ＯＲ係数、メルＬＳＰ係数、メルＬＰＣケプストラム係
数等は、音声認識のみならず音声合成や音声符号化等の
幅広い分野で、従来の線形予測分析から求められる線形
予測係数、ＰＡＲＣＯＲ係数、ＬＳＰ係数、ＬＰＣケプ
ストラム係数等に置き換えて使用することができる。In the present embodiment, the recognition is performed using the mel LPC cepstrum coefficient as the time series of the feature amount obtained from the input speech. However, the same as the case of obtaining the PARCOR coefficient from the normal linear prediction coefficient. Mel PA that can be determined from the Mel linear prediction coefficient by a known method
LS from RCOR coefficient or ordinary linear prediction coefficient
It is also possible to use mel LSP coefficients, which can be obtained from mel linear prediction coefficients by a known method similar to that for obtaining the P coefficient, for speech recognition. The mel linear prediction coefficient obtained from the mel linear prediction coefficient, mel PARC
OR coefficient, Mel LSP coefficient, Mel LPC cepstrum coefficient, etc. are used in a wide range of fields such as speech synthesis as well as speech recognition, linear prediction coefficients, PARCOR coefficients, LSP coefficients, LPC coefficients obtained from conventional linear prediction analysis. It can be used in place of a cepstrum coefficient or the like.

【０２１６】なお、本実施の形態において、メル線形予
測分析部７は、入力信号から、メル線形予測分析を用い
て周波数毎に分解能を変化させた線形予測係数、すなわ
ちメル化した線形予測係数を求めるものとしたが、第１
の実施の形態と同様な方法を用いて求めても良い。すな
わち、入力信号をオールパスフィルタを用いて周波数軸
の伸縮を行うことで周波数伸縮信号を求め、この周波数
伸縮信号に対して通常の線形予測分析を行うことによ
り、周波数毎に分解能を変化させたスペクトル包絡を求
める方法である。In the present embodiment, the mel linear prediction analysis unit 7 converts a linear prediction coefficient whose resolution is changed for each frequency using the mel linear prediction analysis from an input signal, that is, a mel-formed linear prediction coefficient. The first was
May be obtained by using the same method as in the embodiment. That is, the frequency expansion / contraction signal is obtained by expanding / contracting the frequency axis of the input signal using an all-pass filter, and a normal linear prediction analysis is performed on the frequency expansion / contraction signal, thereby changing the spectrum for each frequency. This is a method to find the envelope.

【０２１７】このように、人間の聴覚的な性質である聴
覚感度特性に対応した周波数上の重み付けに基づいたメ
ル線形予測分析により、聴覚感度特性に応じて周波数毎
に分解能を変化させたスペクトル包絡に対応する特徴量
を求めることにより、少ない特徴量でも効率的にスペク
トル包絡の特徴を捉えていることができ、さらにこの特
徴量を音声認識に用いることで、従来よりも少ない処理
量で高い認識性能を実現することができる。As described above, by the mel-linear prediction analysis based on the weighting on the frequency corresponding to the auditory sensitivity characteristic, which is a human auditory characteristic, the spectral envelope in which the resolution is changed for each frequency in accordance with the auditory sensitivity characteristic. By obtaining the feature amount corresponding to, it is possible to efficiently capture the features of the spectral envelope even with a small amount of feature, and by using this feature amount for speech recognition, it is possible to achieve high recognition with less processing amount than before. Performance can be realized.

【０２１８】（実施の形態３）図９は本発明の実施の形
態３によるオーディオ信号圧縮装置の構成を示すブロッ
ク図である。本実施の形態によるオーディオ信号圧縮装
置は、主に音声などの狭帯域信号圧縮において用いられ
ている音声信号圧縮装置について説明したものである。
同図において、１１は、予測モデルに周波数伸縮を組み
込んだメル線形予測分析により、入力オーディオ信号か
ら周波数毎に分析精度を変化させたスペクトル包絡を表
現するメル線形予測係数をフレーム毎に求めるメルパラ
メータ算出部である。１２は、メルパラメータ算出部１
で求めたメル周波数軸上のメル線形予測係数を直線周波
数軸の線形予測係数などのスペクトル包絡を表現する特
徴量へと変換するパラメータ変換部である。１３は、入
力オーディオ信号をパラメータ変換部２で求めた特徴量
で逆フィルタリングして正規化することにより残差信号
を算出する包絡正規化部、１４は、包絡正規化部１３で
算出した残差信号をパワーの最大値，あるいは平均値等
に基づいてパワーの正規化を行なうパワー正規化部であ
る。１５は、パワー正規化部１４で正規化された正規化
残差信号を残差コードブック１６によりベクトル量子化
し、残差符号へと変換するベクトル量子化部である。(Embodiment 3) FIG. 9 is a block diagram showing a configuration of an audio signal compression apparatus according to Embodiment 3 of the present invention. The audio signal compression device according to the present embodiment is an audio signal compression device mainly used in narrow-band signal compression of audio and the like.
In the figure, reference numeral 11 denotes a mel parameter for obtaining, for each frame, a mel linear prediction coefficient expressing a spectral envelope in which the analysis accuracy is changed for each frequency from an input audio signal by mel linear prediction analysis incorporating frequency expansion and contraction into a prediction model. It is a calculation unit. 12 is a mel parameter calculator 1
A parameter conversion unit that converts the mel linear prediction coefficient on the mel frequency axis obtained in the above into a feature quantity expressing a spectral envelope such as a linear prediction coefficient on a linear frequency axis. Reference numeral 13 denotes an envelope normalizing unit that calculates a residual signal by inverse filtering and normalizing the input audio signal with the feature amount obtained by the parameter converting unit 2, and 14 denotes a residual signal calculated by the envelope normalizing unit 13. A power normalizing unit that normalizes the power of the signal based on the maximum value or the average value of the power. Reference numeral 15 denotes a vector quantization unit that performs vector quantization on the normalized residual signal normalized by the power normalization unit 14 using the residual codebook 16 and converts the signal into a residual code.

【０２１９】次に動作について説明する。入力された音
声などのディジタルオーディオ信号（以下、入力信号あ
るいは入力音声とも記す）の時系列は、一定周期の長さ
（フレーム）毎に、メルパラメータ算出部１１で、予測
モデルに周波数伸縮を組み込んだメル線形予測分析によ
り、入力信号から周波数毎に分析精度を変化させたスペ
クトル包絡を表現するメル線形予測係数が求められる。
スペクトル包絡を表現するメル線形予測係数を求める部
分は、実施の形態１のメル化係数算出部２１で説明して
いる方法と同じであり、同様の手順でスペクトル包絡を
表現する特徴量を求めることができる。Next, the operation will be described. The time series of a digital audio signal such as an input voice (hereinafter, also referred to as an input signal or an input voice) is obtained by incorporating frequency expansion / contraction into a prediction model by the mel parameter calculation unit 11 for each fixed period length (frame). The mel-linear predictive analysis determines a mel-linear predictive coefficient that expresses a spectral envelope in which the analysis accuracy is changed for each frequency from the input signal.
The part for calculating the mel linear prediction coefficient expressing the spectrum envelope is the same as the method described in the melding coefficient calculation unit 21 of the first embodiment, and calculating the feature amount expressing the spectrum envelope in the same procedure. Can be.

【０２２０】次に、パラメータ変換部１２では、メルパ
ラメータ算出部１１で算出されたメル周波数軸上のメル
線形予測係数を直線周波数軸の線形予測係数などスペク
トル包絡を表現する特徴量へと変換する。この部分も、
実施の形態１で説明している方法と同じであり、包絡算
出部２２と同様な方法で実現できる。ところで主に音声
信号の圧縮では、入力された音声信号の時系列は、フレ
ーム毎に線形予測分析（ＬＰＣ分析）することにより、
ＬＰＣ係数（線形予測係数）やＬＳＰ係数（line spect
rum pair coefficient），あるいはＰＡＲＣＯＲ係数
（偏自己相関係数）等のＬＰＣスペクトル包絡成分を表
わす特徴量を求め、この特徴量で逆フィルタリングして
正規化することにより残差信号を算出している。そこで
本実施の形態のような入力音声から求めたメル化された
線形予測係数を正規化のための特徴量として用いたり、
あるいは通常の線形予測係数からＰＡＲＣＯＲ係数を求
めるのと同様の公知の手法によりメル化された線形予測
係数から求めたメル化されたＰＡＲＣＯＲ係数や、ある
いは通常の線形予測係数からＬＳＰ係数を求めるのと同
様の公知の手法によりメル化された線形予測係数から求
めたメル化されたＬＳＰ係数を用いて、時間軸上での逆
フィルタリング処理や、あるいはスペクトル包絡成分と
残差信号とに分離を行えば、より精度の良い正規化や分
離が可能となる。Next, the parameter conversion unit 12 converts the mel linear prediction coefficient on the mel frequency axis calculated by the mel parameter calculation unit 11 into a feature quantity expressing a spectral envelope such as a linear prediction coefficient on the linear frequency axis. . This part also
This is the same as the method described in the first embodiment, and can be realized by a method similar to that of the envelope calculation unit 22. By the way, mainly in the compression of the audio signal, the time series of the input audio signal is obtained by performing a linear prediction analysis (LPC analysis) for each frame.
LPC coefficients (linear prediction coefficients) and LSP coefficients (line spect
A characteristic amount representing an LPC spectrum envelope component such as a rum pair coefficient or a PARCOR coefficient (partial autocorrelation coefficient) is obtained, and a residual signal is calculated by inverse filtering and normalizing the characteristic amount. Therefore, a mel-formed linear prediction coefficient obtained from the input speech as in the present embodiment is used as a feature amount for normalization,
Alternatively, a melded PARCOR coefficient obtained from a mellated linear prediction coefficient by a known method similar to a method for obtaining a PARCOR coefficient from a normal linear prediction coefficient, or an LSP coefficient obtained from a normal linear prediction coefficient. Using the melded LSP coefficients obtained from the melformed linear prediction coefficients by a similar known method, inverse filtering on the time axis, or separation into a spectral envelope component and a residual signal, Thus, more accurate normalization and separation can be performed.

【０２２１】同様に、本実施の形態の包絡正規化部１３
では、パラメータ変換部１２で変換された直線周波数軸
の線形予測係数などスペクトル包絡を表現する特徴量を
用いて、逆フィルタリングし、スペクトル包絡成分の正
規化を行い、残差信号を算出している。さらにパワー正
規化部１４では、包絡正規化部３で求められた残差信号
をパワーの最大値，あるいは平均値等に基づいてパワー
の正規化が行われる。そしてベクトル量子化部１５で
は、パワー正規化部１４から出力された残差信号が、あ
らかじめ求めておいた残差コードブック１６を用いてベ
クトル量子化される。その結果、ベクトル量子化部１５
は、量子化結果としてコードを出力することにより入力
信号の圧縮符号化が完了する。Similarly, the envelope normalizing section 13 of the present embodiment
In the above, inverse filtering is performed using a feature amount expressing a spectral envelope such as a linear prediction coefficient of a linear frequency axis converted by the parameter converting unit 12 to normalize a spectral envelope component, thereby calculating a residual signal. . Further, the power normalizing section 14 normalizes the power of the residual signal obtained by the envelope normalizing section 3 based on the maximum value or average value of the power. Then, in the vector quantization unit 15, the residual signal output from the power normalization unit 14 is vector-quantized using a residual codebook 16 obtained in advance. As a result, the vector quantization unit 15
Outputs the code as the quantization result, thereby completing the compression encoding of the input signal.

【０２２２】このように、本実施の形態によるオーディ
オ信号圧縮方法、およびオーディオ信号圧縮装置によれ
ば、メルパラメータ算出部１において、入力オーディオ
信号から算出された周波数特性信号系列を人間の聴覚的
な性質である聴覚感度特性に応じて周波数毎に分析精度
を変化させたスペクトル包絡を表現するメル線形予測係
数を求め、パラメータ変換部２で、このメル線形予測係
数を直線周波数軸の線形予測係数などのスペクトル包絡
を表現する特徴量へと変換し、さらに包絡正規化部３
で、パラメータ変換部２で求めた特徴量で逆フィルタリ
ングして正規化することにより、残差信号を正規化する
構成としたので、正確に周波数特性信号系列の平坦化が
行え、効率の良い量子化を行なうことができる。また、
ベクトル量子化では、ある限られた情報（コード）で残
差信号を表現するため、残差信号の形状が単純であれば
あるほど、より少ないコードで表現することができる。
そこで本発明では、残差信号の形状を単純化するため
に、周波数毎に分析精度を変化させたスペクトル包絡を
用いることで、より正確に残差信号の形状の単純化を行
うことができ、効率の良い量子化を行なうことができ
る。As described above, according to the audio signal compression method and the audio signal compression apparatus according to the present embodiment, the mel parameter calculation unit 1 converts the frequency characteristic signal sequence calculated from the input audio signal into a human auditory sense. A mel linear prediction coefficient expressing a spectrum envelope in which the analysis accuracy is changed for each frequency according to the auditory sensitivity characteristic which is a property is obtained, and the mel linear prediction coefficient is converted into a linear prediction coefficient of a linear frequency axis by the parameter conversion unit 2. Is converted into a feature quantity expressing the spectral envelope of
Since the residual signal is normalized by inverse filtering and normalization with the feature amount obtained by the parameter conversion unit 2, the frequency characteristic signal sequence can be accurately flattened, and the quantum efficiency can be improved efficiently. Can be performed. Also,
In vector quantization, a residual signal is represented by limited information (code). Therefore, the simpler the shape of the residual signal, the smaller the number of codes that can be represented.
Therefore, in the present invention, in order to simplify the shape of the residual signal, by using a spectral envelope in which the analysis accuracy is changed for each frequency, it is possible to more accurately simplify the shape of the residual signal, Efficient quantization can be performed.

【０２２３】（実施の形態４）図１０は本発明の第４の
実施の形態による携帯電話機の構成を示すブロック図で
ある。本実施の形態による携帯電話機は、実施の形態３
における，主に音声などの狭帯域信号圧縮において用い
られている音声信号圧縮装置を用いて信号圧縮を行うよ
うしたものについて説明したものである。同図におい
て、１１は、予測モデルに周波数伸縮を組み込んだメル
線形予測分析により、入力オーディオ信号から周波数毎
に分析精度を変化させたスペクトル包絡を表現するメル
線形予測係数をフレーム毎に求めるメルパラメータ算出
部である。１２は、メルパラメータ算出部１で求めたメ
ル周波数軸上のメル線形予測係数を直線周波数軸の線形
予測係数などのスペクトル包絡を表現する特徴量へと変
換するパラメータ変換部である。１３は、入力オーディ
オ信号をパラメータ変換部２で求めた特徴量で逆フィル
タリングして正規化することにより残差信号を算出する
包絡正規化部、１４は、包絡正規化部１３で算出した残
差信号をパワーの最大値，あるいは平均値等に基づいて
パワーの正規化を行なうパワー正規化部である。１５
は、パワー正規化部１４で正規化された正規化残差信号
を残差コードブック１６によりベクトル量子化し、残差
符号へと変換するベクトル量子化部である。１０はこれ
らメルパラメータ算出部１１，パラメータ変換部１２，
包絡正規化部１３，パワー正規化部１４，ベクトル量子
化部１５および残差コードブック１６からなり、マイク
ロフォンなどから入力される入力音声信号を、人間の聴
覚的な性質である聴覚感度特性に対応した周波数上の重
み付けに基づいて情報圧縮する音声圧縮部である。３１
はこの音声圧縮部１０により情報圧縮されたコードを、
携帯電話機の仕様に応じた周波数および変調方式の高周
波信号に変調し送信する送信部、３２はこの送信部３１
からの高周波信号を送信するアンテナである。(Embodiment 4) FIG. 10 is a block diagram showing a configuration of a mobile phone according to a fourth embodiment of the present invention. The mobile phone according to the present embodiment is similar to the mobile phone according to the third embodiment.
In the above description, the signal compression is performed by using an audio signal compression device mainly used in compression of narrow band signals such as audio. In the figure, reference numeral 11 denotes a mel parameter for obtaining, for each frame, a mel linear prediction coefficient expressing a spectral envelope in which the analysis accuracy is changed for each frequency from an input audio signal by mel linear prediction analysis incorporating frequency expansion and contraction into a prediction model. It is a calculation unit. Reference numeral 12 denotes a parameter conversion unit that converts the mel linear prediction coefficient on the mel frequency axis obtained by the mel parameter calculation unit 1 into a feature quantity expressing a spectral envelope such as a linear prediction coefficient on a linear frequency axis. Reference numeral 13 denotes an envelope normalizing unit that calculates a residual signal by inverse filtering and normalizing the input audio signal with the feature amount obtained by the parameter converting unit 2, and 14 denotes a residual signal calculated by the envelope normalizing unit 13. A power normalizing unit that normalizes the power of the signal based on the maximum value or the average value of the power. Fifteen
Is a vector quantization unit that performs vector quantization on the normalized residual signal normalized by the power normalization unit 14 using the residual codebook 16 and converts it into a residual code. Reference numeral 10 denotes a mel parameter calculator 11, a parameter converter 12,
It comprises an envelope normalizing unit 13, a power normalizing unit 14, a vector quantizing unit 15, and a residual codebook 16, and adapts an input audio signal input from a microphone or the like to an auditory sensitivity characteristic which is a human auditory characteristic. This is a voice compression unit that compresses information based on the weighted frequency. 31
Represents a code whose information has been compressed by the audio compression unit 10,
A transmitting unit 32 modulates and transmits a high-frequency signal of a frequency and a modulation method according to the specifications of the mobile phone.
This is an antenna that transmits a high-frequency signal from the antenna.

【０２２４】次に動作について説明する。音声圧縮部１
０の動作は第３の実施の形態による音声信号圧縮装置と
同様である。即ち、入力された音声などのディジタルオ
ーディオ信号（以下、入力信号あるいは入力音声とも記
す）の時系列は、一定周期の長さ（フレーム）毎に、メ
ルパラメータ算出部１１で、予測モデルに周波数伸縮を
組み込んだメル線形予測分析により、入力信号から周波
数毎に分析精度を変化させたスペクトル包絡を表現する
メル線形予測係数が求められる。スペクトル包絡を表現
するメル線形予測係数を求める部分は、実施の形態１の
メル化係数算出部２１で説明している方法と同じであ
り、同様の手順でスペクトル包絡を表現する特徴量を求
めることができる。Next, the operation will be described. Audio compression unit 1
The operation of 0 is the same as that of the audio signal compression device according to the third embodiment. That is, the time series of a digital audio signal such as an input voice (hereinafter also referred to as an input signal or an input voice) is subjected to frequency expansion and contraction by the mel parameter calculation unit 11 for each predetermined period length (frame). Is obtained from the input signal, a mel linear prediction coefficient expressing a spectrum envelope in which the analysis accuracy is changed for each frequency. The part for calculating the mel linear prediction coefficient expressing the spectrum envelope is the same as the method described in the melding coefficient calculation unit 21 of the first embodiment, and calculating the feature amount expressing the spectrum envelope in the same procedure. Can be.

【０２２５】次に、パラメータ変換部１２では、メルパ
ラメータ算出部１１で算出されたメル周波数軸上のメル
線形予測係数を直線周波数軸の線形予測係数などスペク
トル包絡を表現する特徴量へと変換する。この部分も、
実施の形態１で説明している方法と同じであり、包絡算
出部２２と同様な方法で実現できる。ところで主に音声
信号の圧縮では、入力された音声信号の時系列は、フレ
ーム毎に線形予測分析（ＬＰＣ分析）することにより、
ＬＰＣ係数（線形予測係数）やＬＳＰ係数（line spect
rum pair coefficient），あるいはＰＡＲＣＯＲ係数
（偏自己相関係数）等のＬＰＣスペクトル包絡成分を表
わす特徴量を求め、この特徴量で逆フィルタリングして
正規化することにより残差信号を算出している。そこで
本実施の形態のような入力音声から求めたメル化された
線形予測係数を正規化のための特徴量として用いたり、
あるいは通常の線形予測係数からＰＡＲＣＯＲ係数を求
めるのと同様の公知の手法によりメル化された線形予測
係数から求めたメル化されたＰＡＲＣＯＲ係数や、ある
いは通常の線形予測係数からＬＳＰ係数を求めるのと同
様の公知の手法によりメル化された線形予測係数から求
めたメル化されたＬＳＰ係数を用いて、時間軸上での逆
フィルタリング処理や、あるいはスペクトル包絡成分と
残差信号とに分離を行えば、より精度の良い正規化や分
離が可能となる。Next, the parameter conversion unit 12 converts the mel linear prediction coefficient on the mel frequency axis calculated by the mel parameter calculation unit 11 into a feature quantity expressing a spectral envelope such as a linear prediction coefficient on the linear frequency axis. . This part also
This is the same as the method described in the first embodiment, and can be realized by a method similar to that of the envelope calculation unit 22. By the way, mainly in the compression of the audio signal, the time series of the input audio signal is obtained by performing a linear prediction analysis (LPC analysis) for each frame.
LPC coefficients (linear prediction coefficients) and LSP coefficients (line spect
A characteristic amount representing an LPC spectrum envelope component such as a rum pair coefficient or a PARCOR coefficient (partial autocorrelation coefficient) is obtained, and a residual signal is calculated by inverse filtering and normalizing the characteristic amount. Therefore, a mel-formed linear prediction coefficient obtained from the input speech as in the present embodiment is used as a feature amount for normalization,
Alternatively, a melded PARCOR coefficient obtained from a mellated linear prediction coefficient by a known method similar to a method for obtaining a PARCOR coefficient from a normal linear prediction coefficient, or an LSP coefficient obtained from a normal linear prediction coefficient. Using the melded LSP coefficients obtained from the melformed linear prediction coefficients by a similar known method, inverse filtering on the time axis, or separation into a spectral envelope component and a residual signal, Thus, more accurate normalization and separation can be performed.

【０２２６】同様に、本実施の形態の包絡正規化部１３
では、パラメータ変換部１２で変換された直線周波数軸
の線形予測係数などスペクトル包絡を表現する特徴量を
用いて、逆フィルタリングし、スペクトル包絡成分の正
規化を行い、残差信号を算出している。さらにパワー正
規化部１４では、包絡正規化部３で求められた残差信号
をパワーの最大値，あるいは平均値等に基づいてパワー
の正規化が行われる。そしてベクトル量子化部１５で
は、パワー正規化部１４から出力された残差信号が、あ
らかじめ求めておいた残差コードブック１６を用いてベ
クトル量子化される。その結果、ベクトル量子化部１５
は、量子化結果としてコードを出力することにより音声
信号の圧縮符号化が完了する。そして、このように音声
圧縮部１０において圧縮符号化された音声信号のコード
は、送信部３１に入力され、この送信部３１において、
携帯電話機が採用している仕様に則った周波数および変
調方式の高周波に変換され、アンテナ３２を介して基地
局に向けて送信される。Similarly, the envelope normalizing section 13 of the present embodiment
In the above, inverse filtering is performed using a feature amount expressing a spectral envelope such as a linear prediction coefficient of a linear frequency axis converted by the parameter converting unit 12 to normalize a spectral envelope component, thereby calculating a residual signal. . Further, the power normalizing section 14 normalizes the power of the residual signal obtained by the envelope normalizing section 3 based on the maximum value or average value of the power. Then, in the vector quantization unit 15, the residual signal output from the power normalization unit 14 is vector-quantized using a residual codebook 16 obtained in advance. As a result, the vector quantization unit 15
Outputs a code as a quantization result, thereby completing the compression encoding of the audio signal. Then, the code of the audio signal that has been compression-encoded in the audio compression unit 10 is input to the transmission unit 31, and the transmission unit 31
The frequency is converted into a frequency in accordance with the specification adopted by the mobile phone and a high frequency of a modulation method, and transmitted to the base station via the antenna 32.

【０２２７】このように、本実施の形態による携帯電話
機によれば、メルパラメータ算出部１において、入力オ
ーディオ信号から算出された周波数特性信号系列を人間
の聴覚的な性質である聴覚感度特性に応じて周波数毎に
分析精度を変化させたスペクトル包絡を表現するメル線
形予測係数を求め、パラメータ変換部２で、このメル線
形予測係数を直線周波数軸の線形予測係数などのスペク
トル包絡を表現する特徴量へと変換し、さらに包絡正規
化部３で、パラメータ変換部２で求めた特徴量で逆フィ
ルタリングして正規化することにより、残差信号を正規
化する構成としたので、正確に周波数特性信号系列の平
坦化が行え、効率の良い量子化を行なうことができる。
また、ベクトル量子化では、ある限られた情報（コー
ド）で残差信号を表現するため、残差信号の形状が単純
であればあるほど、より少ないコードで表現することが
できる。そこで本発明では、残差信号の形状を単純化す
るために、周波数毎に分析精度を変化させたスペクトル
包絡を用いることで、より正確に残差信号の形状の単純
化を行うことができ、効率の良い量子化を行なうことが
できる。このため、同一の帯域を使用するのであれば、
従来のものに比しより通話品質を向上させることがで
き、従来と同等の通話品質でよいのであれば、よりチャ
ンネル数を増すことが可能となる。なお、本実施の形態
は、携帯電話機以外にも、自動車電話機等の移動体通信
に適用することが可能である。As described above, according to the mobile phone of the present embodiment, mel parameter calculating section 1 converts the frequency characteristic signal sequence calculated from the input audio signal according to the auditory sensitivity characteristic which is a human auditory characteristic. To obtain a mel linear prediction coefficient representing a spectral envelope in which the analysis precision is changed for each frequency, and the parameter conversion unit 2 converts the mel linear prediction coefficient into a feature quantity representing a spectral envelope such as a linear prediction coefficient on a linear frequency axis. , And the envelope normalization unit 3 normalizes the residual signal by performing inverse filtering and normalization with the feature amount obtained by the parameter conversion unit 2, so that the frequency characteristic signal can be accurately calculated. The sequence can be flattened, and efficient quantization can be performed.
In vector quantization, a residual signal is represented by limited information (code). Therefore, the simpler the shape of the residual signal, the smaller the number of codes. Therefore, in the present invention, in order to simplify the shape of the residual signal, by using a spectral envelope in which the analysis accuracy is changed for each frequency, it is possible to more accurately simplify the shape of the residual signal, Efficient quantization can be performed. So if you use the same band,
The communication quality can be improved as compared with the conventional one, and the number of channels can be further increased if the communication quality is the same as the conventional one. Note that the present embodiment can be applied to mobile communication such as an automobile telephone in addition to a mobile telephone.

【０２２８】（実施の形態５）図１１は本発明の第５の
実施の形態によるネットワーク機器の構成を示すブロッ
ク図である。本実施の形態によるネットワーク機器は、
実施の形態３における，主に音声などの狭帯域信号圧縮
において用いられている音声信号圧縮装置を用いて信号
圧縮を行い、これをインターネット等のネットワークを
介して他のネットワーク機器に送り込む，インターネッ
ト電話等を想定しているものである。同図において、１
１は、予測モデルに周波数伸縮を組み込んだメル線形予
測分析により、入力オーディオ信号から周波数毎に分析
精度を変化させたスペクトル包絡を表現するメル線形予
測係数をフレーム毎に求めるメルパラメータ算出部であ
る。１２は、メルパラメータ算出部１で求めたメル周波
数軸上のメル線形予測係数を直線周波数軸の線形予測係
数などのスペクトル包絡を表現する特徴量へと変換する
パラメータ変換部である。１３は、入力オーディオ信号
をパラメータ変換部２で求めた特徴量で逆フィルタリン
グして正規化することにより残差信号を算出する包絡正
規化部、１４は、包絡正規化部１３で算出した残差信号
をパワーの最大値，あるいは平均値等に基づいてパワー
の正規化を行なうパワー正規化部である。１５は、パワ
ー正規化部１４で正規化された正規化残差信号を残差コ
ードブック１６によりベクトル量子化し、残差符号へと
変換するベクトル量子化部である。１０はこれらメルパ
ラメータ算出部１１，パラメータ変換部１２，包絡正規
化部１３，パワー正規化部１４，ベクトル量子化部１５
および残差コードブック１６からなり、マイクロフォン
などから入力される入力音声信号を、人間の聴覚的な性
質である聴覚感度特性に対応した周波数上の重み付けに
基づいて情報圧縮する音声圧縮部である。４０はこの音
声圧縮部１０により情報圧縮されたコードを、ネットワ
ークで音声データの伝送用のコードに変換し、ＴＣＰ／
ＩＰプロトコル等のネットワークの仕様に応じたプロト
コルに則って伝送するネットワークインターフェース部
である。(Embodiment 5) FIG. 11 is a block diagram showing a configuration of a network device according to a fifth embodiment of the present invention. The network device according to the present embodiment
An Internet telephone which performs signal compression using an audio signal compression device mainly used in narrow-band signal compression of audio and the like in the third embodiment, and sends this to another network device via a network such as the Internet. Etc. are assumed. In the figure, 1
Reference numeral 1 denotes a mel parameter calculation unit that obtains, for each frame, a mel linear prediction coefficient that expresses a spectral envelope in which the analysis accuracy is changed for each frequency from an input audio signal by mel linear prediction analysis incorporating frequency expansion and contraction into a prediction model. . Reference numeral 12 denotes a parameter conversion unit that converts the mel linear prediction coefficient on the mel frequency axis obtained by the mel parameter calculation unit 1 into a feature quantity expressing a spectral envelope such as a linear prediction coefficient on a linear frequency axis. Reference numeral 13 denotes an envelope normalizing unit that calculates a residual signal by inverse filtering and normalizing the input audio signal with the feature amount obtained by the parameter converting unit 2, and 14 denotes a residual signal calculated by the envelope normalizing unit 13. A power normalizing unit that normalizes the power of the signal based on the maximum value or the average value of the power. Reference numeral 15 denotes a vector quantization unit that performs vector quantization on the normalized residual signal normalized by the power normalization unit 14 using the residual codebook 16 and converts the signal into a residual code. Reference numeral 10 denotes a mel parameter calculation unit 11, a parameter conversion unit 12, an envelope normalization unit 13, a power normalization unit 14, and a vector quantization unit 15.
And a residual codebook 16, which is an audio compression unit for compressing information on an input audio signal input from a microphone or the like based on weighting on a frequency corresponding to the auditory sensitivity characteristic which is a human auditory characteristic. 40 converts the code compressed by the voice compression unit 10 into a code for transmitting voice data over a network,
This is a network interface unit that transmits data in accordance with a protocol such as an IP protocol according to network specifications.

【０２２９】次に動作について説明する。音声圧縮部１
０の動作は第３の実施の形態による音声信号圧縮装置と
同様である。即ち、入力された音声などのディジタルオ
ーディオ信号（以下、入力信号とも記す）の時系列は、
一定周期の長さ（フレーム）毎に、メルパラメータ算出
部１１で、予測モデルに周波数伸縮を組み込んだメル線
形予測分析により、入力オーディオ信号から周波数毎に
分析精度を変化させたスペクトル包絡を表現するメル線
形予測係数が求められる。スペクトル包絡を表現するメ
ル線形予測係数を求める部分は、実施の形態１のメル化
係数算出部２１で説明している方法と同じであり、同様
の手順でスペクトル包絡を表現する特徴量を求めること
ができる。Next, the operation will be described. Audio compression unit 1
The operation of 0 is the same as that of the audio signal compression device according to the third embodiment. That is, the time series of a digital audio signal such as an input voice (hereinafter, also referred to as an input signal) is
For each fixed period length (frame), the mel parameter calculation unit 11 expresses a spectral envelope in which the analysis accuracy is changed for each frequency from an input audio signal by mel linear prediction analysis incorporating frequency expansion and contraction into a prediction model. A mel linear prediction coefficient is determined. The part for calculating the mel linear prediction coefficient expressing the spectrum envelope is the same as the method described in the melding coefficient calculation unit 21 of the first embodiment, and calculating the feature amount expressing the spectrum envelope in the same procedure. Can be.

【０２３０】次に、パラメータ変換部１２では、メルパ
ラメータ算出部１１で算出されたメル周波数軸上のメル
線形予測係数を直線周波数軸の線形予測係数などスペク
トル包絡を表現する特徴量へと変換する。この部分も、
実施の形態１で説明している方法と同じであり、包絡算
出部２２と同様な方法で実現できる。ところで主に音声
信号の圧縮では、入力された音声信号の時系列は、フレ
ーム毎に線形予測分析（ＬＰＣ分析）することにより、
ＬＰＣ係数（線形予測係数）やＬＳＰ係数（line spect
rum pair coefficient），あるいはＰＡＲＣＯＲ係数
（偏自己相関係数）等のＬＰＣスペクトル包絡成分を表
わす特徴量を求め、この特徴量で逆フィルタリングして
正規化することにより残差信号を算出している。そこで
本実施の形態のような入力音声から求めたメル化された
線形予測係数を正規化のための特徴量として用いたり、
あるいは通常の線形予測係数からＰＡＲＣＯＲ係数を求
めるのと同様の公知の手法によりメル化された線形予測
係数から求めたメル化されたＰＡＲＣＯＲ係数や、ある
いは通常の線形予測係数からＬＳＰ係数を求めるのと同
様の公知の手法によりメル化された線形予測係数から求
めたメル化されたＬＳＰ係数を用いて、時間軸上での逆
フィルタリング処理や、あるいはスペクトル包絡成分と
残差信号とに分離を行えば、より精度の良い正規化や分
離が可能となる。Next, the parameter conversion unit 12 converts the mel linear prediction coefficient on the mel frequency axis calculated by the mel parameter calculation unit 11 into a feature quantity expressing a spectral envelope such as a linear prediction coefficient on the linear frequency axis. . This part also
This is the same as the method described in the first embodiment, and can be realized by a method similar to that of the envelope calculation unit 22. By the way, mainly in the compression of the audio signal, the time series of the input audio signal is obtained by performing a linear prediction analysis (LPC analysis) for each frame.
LPC coefficients (linear prediction coefficients) and LSP coefficients (line spect
A characteristic amount representing an LPC spectrum envelope component such as a rum pair coefficient or a PARCOR coefficient (partial autocorrelation coefficient) is obtained, and a residual signal is calculated by inverse filtering and normalizing the characteristic amount. Therefore, a mel-formed linear prediction coefficient obtained from the input speech as in the present embodiment is used as a feature amount for normalization,
Alternatively, a melded PARCOR coefficient obtained from a mellated linear prediction coefficient by a known method similar to a method for obtaining a PARCOR coefficient from a normal linear prediction coefficient, or an LSP coefficient obtained from a normal linear prediction coefficient. Using the melded LSP coefficients obtained from the melformed linear prediction coefficients by a similar known method, inverse filtering on the time axis, or separation into a spectral envelope component and a residual signal, Thus, more accurate normalization and separation can be performed.

【０２３１】同様に、本実施の形態の包絡正規化部１３
では、パラメータ変換部１２で変換された直線周波数軸
の線形予測係数などスペクトル包絡を表現する特徴量を
用いて、逆フィルタリングし、スペクトル包絡成分の正
規化を行い、残差信号を算出している。さらにパワー正
規化部１４では、包絡正規化部１３で求められた残差信
号をパワーの最大値，あるいは平均値等に基づいてパワ
ーの正規化が行われる。そしてベクトル量子化部１５で
は、パワー正規化部１４から出力された残差信号が、あ
らかじめ求めておいた残差コードブック１６を用いてベ
クトル量子化される。その結果、ベクトル量子化部１５
は、量子化結果としてコードを出力することにより音声
信号の圧縮符号化が完了する。そして、このように音声
圧縮部１０において圧縮符号化された音声信号のコード
は、ネットワークインターフェース部４０に入力され、
このネットワークインターフェース部４０において、音
声圧縮部１０により情報圧縮されたコードを、ネットワ
ークで音声データの伝送用のコードに変換し、ＴＣＰ／
ＩＰプロトコル等のネットワークの仕様に応じたプロト
コルに則ってネットワークに向けて送出する。Similarly, the envelope normalization unit 13 of the present embodiment
In the above, inverse filtering is performed using a feature amount expressing a spectral envelope such as a linear prediction coefficient of a linear frequency axis converted by the parameter converting unit 12 to normalize a spectral envelope component, thereby calculating a residual signal. . Further, the power normalizing section 14 normalizes the power of the residual signal obtained by the envelope normalizing section 13 based on the maximum value or average value of the power. Then, in the vector quantization unit 15, the residual signal output from the power normalization unit 14 is vector-quantized using a residual codebook 16 obtained in advance. As a result, the vector quantization unit 15
Outputs a code as a quantization result, thereby completing the compression encoding of the audio signal. Then, the code of the audio signal compressed and encoded in the audio compression unit 10 is input to the network interface unit 40,
In the network interface unit 40, the code information compressed by the audio compression unit 10 is converted into a code for transmitting audio data over a network,
The packet is transmitted to the network according to a protocol such as an IP protocol according to the specifications of the network.

【０２３２】このように、本実施の形態によるネットワ
ーク機器によれば、メルパラメータ算出部１１におい
て、入力オーディオ信号から算出された周波数特性信号
系列を人間の聴覚的な性質である聴覚感度特性に応じて
周波数毎に分析精度を変化させたスペクトル包絡を表現
するメル線形予測係数を求め、パラメータ変換部１２
で、このメル線形予測係数を直線周波数軸の線形予測係
数などのスペクトル包絡を表現する特徴量へと変換し、
さらに包絡正規化部１３で、パラメータ変換部１２で求
めた特徴量で逆フィルタリングして正規化することによ
り、残差信号を正規化する構成としたので、正確に周波
数特性信号系列の平坦化が行え、効率の良い量子化を行
なうことができる。また、ベクトル量子化では、ある限
られた情報（コード）で残差信号を表現するため、残差
信号の形状が単純であればあるほど、より少ないコード
で表現することができる。そこで本発明では、残差信号
の形状を単純化するために、周波数毎に分析精度を変化
させたスペクトル包絡を用いることで、より正確に残差
信号の形状の単純化を行うことができ、効率の良い量子
化を行なうことができる。このため、ネットワークのデ
ータ転送速度が同一であれば、従来のものに比しより通
話品質を向上させることができ、従来と同等の通話品質
でよいのであれば、より収容できる端末の数を増すこと
が可能となる。なお、本実施の形態は、パソコンやイン
ターネット電話機，インターネットＴＶ等のインターネ
ット機器を想定しているが、パソコン通信等、インター
ネット以外のプロトコルを用いる端末にも適用すること
が可能である。As described above, according to the network equipment of the present embodiment, mel parameter calculating section 11 converts the frequency characteristic signal sequence calculated from the input audio signal according to the auditory sensitivity characteristic which is a human auditory characteristic. To obtain a mel linear prediction coefficient expressing a spectrum envelope in which the analysis accuracy is changed for each frequency,
Then, this mel linear prediction coefficient is converted into a feature quantity expressing a spectral envelope such as a linear prediction coefficient on a linear frequency axis,
Further, the envelope normalization unit 13 normalizes the residual signal by performing inverse filtering and normalization using the feature amount obtained by the parameter conversion unit 12, so that the frequency characteristic signal sequence can be accurately flattened. This makes it possible to perform efficient quantization. In vector quantization, a residual signal is represented by limited information (code). Therefore, the simpler the shape of the residual signal, the smaller the number of codes. Therefore, in the present invention, in order to simplify the shape of the residual signal, by using a spectral envelope in which the analysis accuracy is changed for each frequency, it is possible to more accurately simplify the shape of the residual signal, Efficient quantization can be performed. For this reason, if the data transfer speed of the network is the same, the call quality can be improved as compared with the conventional one, and if the call quality is equivalent to the conventional one, the number of terminals that can be accommodated is increased. It becomes possible. Note that the present embodiment assumes Internet devices such as a personal computer, an Internet telephone, and an Internet TV, but can also be applied to terminals using protocols other than the Internet, such as personal computer communication.

【０２３３】（実施の形態６）図１２は本発明の第６の
実施の形態によるネットワーク機器の構成を示すブロッ
ク図である。本実施の形態によるネットワーク機器は、
実施の形態１における，主にオーディオ帯域の信号圧縮
において用いられているオーディオ信号圧縮装置を用い
て信号圧縮を行い、これをインターネット等のネットワ
ークを介して他のネットワーク機器に送り込む，インタ
ーネット機器等を想定しているものである。同図におい
て、１は、例えば、ＭＤＣＴ，あるいはＦＦＴ等により
入力されたディジタルオーディオ信号や音声信号の時系
列を、一定周期の長さ（フレーム）毎に周波数特性信号
系列に変換する時間周波数変換部である。また、２は、
予測モデルに周波数伸縮機能を組み込んだメル線形予測
分析を用いて、入力オーディオ信号から、周波数毎に分
析精度を変化させたスペクトル包絡をフレーム毎に求め
るスペクトル包絡算出部である。３は時間周波数変換部
１で算出された周波数特性信号系列をスペクトル包絡算
出部２で求めたスペクトル包絡で割り算して正規化する
ことにより、周波数特性を平坦化する正規化部、４は正
規化部３で平坦化された周波数特性信号系列に対し、パ
ワーの最大値，あるいは平均値等に基づいてパワーの正
規化を行なうパワー正規化部である。５は、正規化部
３，パワー正規化部４で平坦化された周波数特性信号系
列をベクトル量子化する多段量子化部であり、この多段
量子化部５は、互いに縦列接続された第１段の量子化器
５１，第２段の量子化器５２，・・・，第Ｎ段の量子化
器５３を含む。６は、時間周波数変換部１から出力され
た周波数特性信号系列とスペクトル包絡算出部２で求め
たスペクトル包絡を入力とし、人間の聴覚感度特性に基
づいて、量子化部５での量子化の際に用いる重み付け係
数を求める聴覚重み付け計算部である。２０はこれら時
間周波数変換部１，スペクトル包絡算出部２，正規化部
３，パワー正規化部４，量子化部５および聴覚重み付け
計算部６からなり、外部から入力される入力オーディオ
音声信号を、人間の聴覚的な性質である聴覚感度特性に
対応した周波数上の重み付けに基づいて情報圧縮するオ
ーディオ信号圧縮部である。４１はこのオーディオ信号
圧縮部２０により情報圧縮されたコードを、ネットワー
クでオーディオデータの伝送用のコードに変換し、ＴＣ
Ｐ／ＩＰプロトコル等のネットワークの仕様に応じたプ
ロトコルに則って伝送するネットワークインターフェー
ス部である。(Embodiment 6) FIG. 12 is a block diagram showing a configuration of a network device according to a sixth embodiment of the present invention. The network device according to the present embodiment
In the first embodiment, an Internet device or the like that performs signal compression using an audio signal compression device mainly used in signal compression of an audio band and sends this to another network device via a network such as the Internet. It is assumed. In FIG. 1, reference numeral 1 denotes a time-frequency conversion unit that converts a time series of a digital audio signal or an audio signal input by, for example, MDCT or FFT into a frequency characteristic signal sequence for each fixed period length (frame). It is. Also, 2
This is a spectrum envelope calculation unit that obtains, for each frame, a spectrum envelope in which the analysis accuracy is changed for each frequency from the input audio signal using mel linear prediction analysis in which a frequency expansion function is incorporated in the prediction model. Reference numeral 3 denotes a normalization unit that flattens the frequency characteristics by dividing the frequency characteristic signal sequence calculated by the time-frequency conversion unit 1 by the spectrum envelope obtained by the spectrum envelope calculation unit 2 and normalizing the result. A power normalizing unit that normalizes the power of the frequency characteristic signal sequence flattened by the unit 3 based on the maximum value or the average value of the power. Reference numeral 5 denotes a multi-stage quantization unit that vector-quantizes the frequency characteristic signal sequence flattened by the normalization unit 3 and the power normalization unit 4. The multi-stage quantization unit 5 includes a first stage connected in cascade with each other. , A second-stage quantizer 52,..., An N-th stage quantizer 53. 6 receives the frequency characteristic signal sequence output from the time-frequency conversion unit 1 and the spectrum envelope obtained by the spectrum envelope calculation unit 2 as inputs, and performs quantization at the quantization unit 5 based on human auditory sensitivity characteristics. This is an auditory weighting calculation unit that obtains a weighting coefficient used for. Reference numeral 20 includes a time-frequency conversion unit 1, a spectrum envelope calculation unit 2, a normalization unit 3, a power normalization unit 4, a quantization unit 5, and an auditory weighting calculation unit 6. The audio signal compression unit compresses information based on frequency weighting corresponding to the auditory sensitivity characteristic that is a human auditory characteristic. 41 converts the code information compressed by the audio signal compression unit 20 into a code for transmitting audio data over a network,
A network interface unit that transmits data according to a protocol such as a P / IP protocol according to network specifications.

【０２３４】次に動作について説明する。オーディオ信
号圧縮部２０の動作は第１の実施の形態によるオーディ
オ信号圧縮装置と同様である。即ち、入力されたディジ
タルオーディオ信号（以下、入力信号とも記す）の時系
列は、一定周期の長さ（フレーム）毎に時間周波数変換
部１でＭＤＣＴ，ＦＦＴ等により周波数特性信号系列に
変換される。Next, the operation will be described. The operation of the audio signal compression unit 20 is the same as that of the audio signal compression device according to the first embodiment. That is, a time series of an input digital audio signal (hereinafter also referred to as an input signal) is converted into a frequency characteristic signal series by the MDCT, FFT, or the like by the time-frequency conversion unit 1 for each fixed period length (frame). .

【０２３５】さらに入力信号は、フレーム毎に、スペク
トル包絡算出部２で、予測モデルに周波数伸縮を組み込
んだメル線形予測分析を用いて、周波数毎に分析精度を
変化させたスペクトル包絡が求められる。次に、正規化
部３では、上記で算出された周波数特性信号系列をスペ
クトル包絡で割り算し正規化することにより、周波数特
性信号系列を平坦化する。正規化部３で平坦化された周
波数特性信号系列は、パワー正規化部４において、さら
にパワーの最大値，あるいは平均値等に基づいてパワー
の正規化が行われる。一方、聴覚重み付け計算部６に
は、時間周波数変換部１から出力された周波数特性信号
系列と、スペクトル包絡算出部２で求めたスペクトル包
絡とが入力され、時間周波数変換部１から出力された周
波数特性信号系列のスペクトルについて、最小可聴限特
性や聴覚マスキング特性等の人間の聴覚的な性質である
聴覚感度特性に基づいて、この聴覚感度特性を考慮した
特性信号を算出し、さらにこの特性信号とスペクトル包
絡に基づいて、量子化に用いる重み付け係数を求める。Further, for the input signal, the spectrum envelope in which the analysis accuracy is changed for each frequency is obtained by the spectrum envelope calculation unit 2 for each frame by using the mel-linear prediction analysis in which frequency expansion and contraction are incorporated in the prediction model. Next, the normalizing unit 3 flattens the frequency characteristic signal sequence by dividing the frequency characteristic signal sequence calculated above by the spectrum envelope and normalizing the divided frequency characteristic signal sequence. The power normalization unit 4 further normalizes the power of the frequency characteristic signal sequence flattened by the normalization unit 3 based on the maximum value or average value of the power. On the other hand, the perceptual weighting calculator 6 receives the frequency characteristic signal sequence output from the time-frequency converter 1 and the spectrum envelope obtained by the spectrum envelope calculator 2, and outputs the frequency output from the time-frequency converter 1. For the spectrum of the characteristic signal series, based on the auditory sensitivity characteristics which are human auditory characteristics such as the minimum audibility characteristic and the auditory masking characteristic, calculate a characteristic signal in consideration of this auditory sensitivity characteristic, and further calculate this characteristic signal and A weighting coefficient used for quantization is obtained based on the spectral envelope.

【０２３６】パワー正規化部４から出力された残差信号
は、多段量子化部５の第１段の量子化部５１で聴覚重み
付け計算部６によって求められた重み付け係数を用いて
量子化され、第１段の量子化部５１での量子化による量
子化誤差成分が、多段量子化部５の第２段の量子化部５
２で聴覚重み付け計算部６によって求められた重み付け
係数を用いて量子化され、以下同様にして、複数段の量
子化部のそれぞれにおいて、前段の量子化部での量子化
による量子化誤差成分の量子化が行なわれる。これらの
各量子化部は量子化結果としてコードを出力する。そし
て、第（Ｎ−１）段の量子化部での量子化による量子化
誤差成分に対して、第Ｎ段の量子化部５３で聴覚重み付
け計算部６によって求められた重み付け係数を用いて量
子化が行なわれることにより、オーディオ信号の圧縮符
号化が完了する。そして、このようにオーディオ信号圧
縮部２０において圧縮符号化された音声信号のコード
は、ネットワークインターフェース部４１に入力され、
このネットワークインターフェース部４０において、オ
ーディオ信号圧縮部２０により情報圧縮されたコード
を、ネットワークでオーディオデータの伝送用のコード
に変換し、ＴＣＰ／ＩＰプロトコル等のネットワークの
仕様に応じたプロトコルに則ってネットワークに向けて
送出する。The residual signal output from the power normalizing section 4 is quantized by the first-stage quantizing section 51 of the multi-stage quantizing section 5 using the weighting coefficient obtained by the auditory weighting calculating section 6. The quantization error component due to the quantization in the first-stage quantization unit 51 is converted to the second-stage quantization unit 5 of the multi-stage quantization unit 5.
In step 2, the quantization is performed using the weighting coefficient obtained by the auditory weighting calculation unit 6, and similarly, in each of the plurality of stages of quantization units, the quantization error component of the quantization in the preceding stage quantization unit is calculated. Quantization is performed. Each of these quantization units outputs a code as a quantization result. Then, the quantization error component by the quantization in the (N-1) th stage quantization unit is quantized by the Nth stage quantization unit 53 using the weighting coefficient obtained by the auditory weighting calculation unit 6. As a result, the compression encoding of the audio signal is completed. Then, the code of the audio signal compressed and encoded in the audio signal compression unit 20 is input to the network interface unit 41,
In the network interface unit 40, the code information compressed by the audio signal compression unit 20 is converted into a code for transmitting audio data in a network, and the network is transmitted in accordance with a protocol such as a TCP / IP protocol according to the network specification. Send to.

【０２３７】このように、本実施の形態６によるネット
ワーク機器によれば、正規化部３で、入力オーディオ信
号から算出された周波数特性信号系列を、人間の聴覚的
な性質である聴覚感度特性に応じて周波数毎に分析精度
を変化させたスペクトル包絡を用いて正規化する構成と
したので、正確に周波数特性信号系列の平坦化が行え、
効率の良い量子化を行なうことができる。また、多段量
子化部５でベクトル量子化する際の負担が少なくなり、
効率の良い量子化を行なうことができる。ベクトル量子
化では、ある限られた情報（コード）で周波数特性信号
系列を表現するため、周波数特性信号系列の形状が単純
であればあるほど、より少ないコードで表現することが
できる。そこで、本発明では、周波数特性信号系列の形
状を単純化するために、周波数特性信号系列の概略形状
を表現しているスペクトル包絡を用いて正規化している
が、この概略形状として周波数毎に分析精度を変化させ
たスペクトル包絡を用いることで、より正確に周波数特
性信号系列の形状を単純化でき、効率の良い量子化が行
なうことができる。As described above, according to the network device of the sixth embodiment, the frequency characteristic signal sequence calculated from the input audio signal by the normalizing unit 3 is converted into the auditory sensitivity characteristic which is a human auditory characteristic. Since it is configured to normalize using the spectral envelope with the analysis accuracy changed for each frequency, the frequency characteristic signal sequence can be accurately flattened,
Efficient quantization can be performed. In addition, the burden when performing vector quantization in the multi-stage quantization unit 5 is reduced,
Efficient quantization can be performed. In the vector quantization, a frequency characteristic signal sequence is represented by limited information (code). Therefore, the simpler the shape of the frequency characteristic signal sequence, the smaller the number of codes that can be represented. Thus, in the present invention, in order to simplify the shape of the frequency characteristic signal sequence, the frequency characteristic signal sequence is normalized using a spectral envelope expressing the schematic shape thereof. By using the spectral envelope with the changed accuracy, the shape of the frequency characteristic signal sequence can be simplified more accurately, and efficient quantization can be performed.

【０２３８】また、多段量子化部５の複数段のベクトル
量子化部５１〜５３で、聴覚重み付け計算部６において
入力オーディオ信号のスペクトル，人間の聴覚的な性質
である聴覚感度特性，及び人間の聴覚的な性質である聴
覚感度特性に応じて、周波数毎に分析精度を変化させた
スペクトル包絡に基づいて算出された周波数上の重み付
け係数を量子化の際の重み付けとして用いてベクトル量
子化を行なう構成としたので、人間の聴覚的な性質を利
用して効率の良い量子化を行なうことができる。このよ
うに、オーディオ信号の効率よい量子化を行っているた
め、ネットワークのデータ転送速度が同一あれば、従来
のものに比しよりオーディオ品質を向上させることがで
き、従来と同等のオーディオ品質でよいのであれば、よ
り収容できる端末の数を増すことが可能となる。なお、
本実施の形態は、パソコンやインターネットＴＶ等のイ
ンターネット機器を想定しているが、パソコン通信等、
インターネット以外のプロトコルを用いる端末にも適用
することが可能である。In the multiple-stage vector quantizers 51 to 53 of the multi-stage quantizer 5, the auditory weighting calculator 6 calculates the spectrum of the input audio signal, the auditory sensitivity characteristics which are human auditory characteristics, and the human auditory sensitivity characteristics. Vector quantization is performed by using a weighting coefficient on a frequency calculated based on a spectral envelope in which the analysis accuracy is changed for each frequency in accordance with the auditory sensitivity characteristic, which is an auditory property, as a weight at the time of quantization. With this configuration, efficient quantization can be performed by utilizing the auditory characteristics of humans. As described above, since the audio signal is efficiently quantized, if the data transfer rate of the network is the same, the audio quality can be improved as compared with the conventional one, and the same audio quality as the conventional one can be obtained. If it is good, it is possible to increase the number of terminals that can be accommodated. In addition,
In this embodiment, Internet devices such as a personal computer and an Internet TV are assumed.
The present invention can be applied to a terminal using a protocol other than the Internet.

【０２３９】[0239]

【発明の効果】以上のように、本発明（請求項１）に係
るオーディオ信号圧縮方法によれば、入力されたオーデ
ィオ信号に対し、符号化を行い、かつ、その情報量を圧
縮するオーデオ信号圧縮方法において、上記入力された
オーディオ信号と、該入力されたオーディオ信号に対し
て人間の聴覚感度特性に対応する周波数軸の伸縮を行っ
たオーディオ信号とを用いて、メル周波数軸上の自己相
関関数を求め、上記メル周波数軸上の自己相関関数から
メル線形予測係数を求め、上記メル線形予測係数そのも
のをスペクトル包絡とするか、あるいは該メル線形予測
係数からスペクトル包絡を求め、上記スペクトル包絡を
用いて、上記入力されたオーディオ信号を、フレーム毎
に平滑化するようにしたので、人間の聴覚的な性質を利
用して効率の良い信号圧縮を行うことができるオーディ
オ信号圧縮方法が得られる効果がある。As described above, according to the audio signal compression method according to the present invention (claim 1), an audio signal for encoding an input audio signal and compressing the information amount is encoded. In the compression method, the above input
Audio signal and the input audio signal
Expands and contracts the frequency axis corresponding to human auditory sensitivity characteristics
Self-phase on the Mel frequency axis
Calculate the function and calculate from the autocorrelation function on the mel frequency axis.
The mel linear prediction coefficient is obtained, and the mel linear prediction coefficient
Is the spectral envelope or the mel linear prediction
Calculate the spectral envelope from the coefficients and calculate the spectral envelope
In this case, the input audio signal is smoothed for each frame, so that an audio signal compression method capable of performing efficient signal compression utilizing human auditory characteristics is obtained. There is.

【０２４０】また、本発明（請求項２）に係るオーディ
オ信号圧縮方法によれば、入力されたオーディオ信号に
対し、符号化を行い、かつ、その情報量を圧縮するオー
デオ信号圧縮方法において、上記入力されたオーディオ
信号から、一定時間長のオーディオ信号を切り出し、該
一定時間長のオーディオ信号を、複数段のオールパスフ
ィルタに通して、各段毎のフィルタ出力信号を求め、上
記入力されたオーディオ信号と、上記各段毎のフィルタ
出力信号との、有限回行う積和（数１）により、人間の
聴覚感度特性に対応する周波数軸の伸縮を行ったメル周
波数軸上の自己相関関数を求め、上記メル周波数軸上の
自己相関関数からメル線形予測係数を求め、上記メル線
形予測係数そのものをスペクトル包絡とするか、あるい
は該メル線形予測係数からスペクトル包絡を求め、上記
スペクトル包絡を用いて、上記入力されたオーディオ信
号を、フレーム毎に平滑化するようにしたので、人間の
聴覚的な性質を利用して効率の良い信号圧縮を行うこと
ができるオーディオ信号圧縮方法が得られる効果があ
る。また、本来無限回の演算を必要としていたメル線形
予測係数の算出が、近似計算を全く必要とすることな
く、予め設定した有限回の演算により得られるので、上
記入力されるオーディオ信号の圧縮性能の向上や、認識
性能の向上を図ることができる。但し、（数１）は Further, according to the audio signal compression method according to the present invention (claim 2), the input audio signal
On the other hand, perform encoding and compress the amount of information.
In the video signal compression method, the input audio
From the signal, cut out the audio signal of a certain time length,
An audio signal with a fixed length of time can be
Filter to determine the filter output signal for each stage,
The input audio signal and the filter for each stage
By the product sum (Equation 1) performed finitely with the output signal, human
Mel circumference with frequency axis expansion / contraction corresponding to hearing sensitivity characteristics
Find the autocorrelation function on the wave number axis, and
Find the mel linear prediction coefficient from the autocorrelation function,
The shape prediction coefficient itself as the spectral envelope, or
Finds the spectral envelope from the mel linear prediction coefficient,
Using the spectral envelope, the input audio signal
Since the signal is smoothed for each frame, there is an effect that an audio signal compression method capable of performing efficient signal compression using human auditory characteristics is obtained. In addition, Mel linear which originally required infinite number of operations
Calculation of prediction coefficients does not require any approximation calculation.
Since it is obtained by a finite number of calculations set in advance,
Improves compression performance of input audio signals and recognizes them
Performance can be improved. However, (Equation 1) is

【０２４１】また、本発明（請求項３）に係るオーディ
オ信号圧縮方法によれば、請求項２に記載のオーディオ
信号圧縮方法において、上記オールパスフィルタは、１
次のオールパスフィルタであるようにしたので、本来無
限回の演算を必要としていたものが、実際に実現可能な
１次のオールパスフィルタを用いることで、近似計算を
まったく必要とすることなく予め設定した有限回の演算
ですむこととなり、効率良い信号の圧縮を行うことがで
きる。 According to the audio signal compression method according to the present invention (claim 3), in the audio signal compression method according to claim 2, the all-pass filter has one
Since the next all-pass filter is used,
What required a limited number of operations is now feasible
By using a first-order all-pass filter, approximate calculation can be performed.
Preset finite number of operations without any need
This means that efficient signal compression can be performed.
Wear.

【０２４２】また、本発明（請求項４）に係るオーディ
オ信号圧縮方法によれば、請求項２または請求項３に記
載のオーディオ信号圧縮方法において、上記オールパス
フィルタのフィルタ係数に、バーク尺度、またはメル尺
度を用い、人間の聴覚感度特性に対応する周波数上の重
み付けを行うようにしたので、バーク尺度あるいはメル
尺度を用いて、人間の聴覚上重要である低い周波数帯域
側を、高い周波数帯域側より周波数分解能を上げて分析
することが可能となり、人間の聴覚的な性質を利用して
効率の良い信号圧縮を行うことができるオーディオ信号
圧縮方法が得られる効果がある。Further, according to the audio signal compression method according to the present invention (claim 4), the audio signal compression method according to claim 2 or 3 is described.
In the audio signal compression method described above,
Bark scale or Mel scale is used for the filter coefficient of the filter.
Frequency, the frequency weight corresponding to the human auditory sensitivity
The Bark scale or Mel
Using a scale, low frequency bands that are important to human hearing
Analysis with higher frequency resolution than the higher frequency band
This makes it possible to obtain an audio signal compression method capable of performing efficient signal compression by utilizing human auditory characteristics.

【０２４３】また、本発明（請求項５）に係るオーディ
オ信号圧縮装置によれば、入力されたオーディオ信号に
対し、符号化を行い、かつ、その情報量を圧縮するオー
ディオ信号圧縮装置において、上記入力されたオーディ
オ信号を、周波数領域信号に変換して出力する時間周波
数変換手段と、上記入力されたオーディオ信号と、該入
力されたオーディオ信号に対して人間の聴覚感度特性に
対応する周波数軸の伸縮を行ったオーディオ信号とを用
いて、メル周波数軸上の自己相関関数を求め、該メル周
波数軸上の自己相関関数から得られるメル線形予測係数
をスペクトル包絡とするか、あるいは、該メル線形予測
係数からスペクトル包絡を求めるスペクトル包絡算出手
段と、上記周波数領域信号を上記スペクトル包絡で正規
化して、残差信号を得る正規化手段と、上記残差信号を
パワーの最大値あるいは平均値に基づいて正規化し、正
規化残差信号を求めるパワー正規化手段と、上記正規残
差信号を、残差コードブックによりベクトル量子化し、
残差符号に変換するベクトル量子化手段とを備えるよう
にしたので、人間の聴覚的な性質を利用して効率の良い
信号の圧縮を行うことができるオーディオ信号圧縮装置
が得られる効果がある。[0243] Further, according to the audio signal compression apparatus according to the present invention (Claim 5), the input audio signals, performs encoding, and, in the audio signal compression apparatus for compressing the information amount, the The entered audio
E time signal that converts the signal into a frequency domain signal and outputs it
Number conversion means, the input audio signal,
For human auditory sensitivity characteristics to the input audio signal
Use the audio signal that has expanded and contracted the corresponding frequency axis.
The autocorrelation function on the mel frequency axis
Mel linear prediction coefficient obtained from autocorrelation function on wavenumber axis
Is the spectral envelope, or the mel linear prediction
A spectral envelope calculator that calculates the spectral envelope from the coefficients
And normalizing the frequency domain signal with the spectral envelope
Normalizing means for obtaining a residual signal;
Normalize based on the maximum or average power and
A power normalizing means for obtaining a normalized residual signal;
Vector quantization of the difference signal by the residual codebook,
Vector quantization means for converting to a residual code.
Therefore, there is an effect that an audio signal compression device capable of efficiently compressing a signal by utilizing the auditory characteristics of human beings is obtained.

【０２４４】また、本発明（請求項６）に係るオーディ
オ信号圧縮装置によれば、請求項５に記載のオーディオ
信号圧縮装置において、上記スペクトル包絡に対して、
人間の聴覚感度特性に対応する周波数上の重み付けを行
い、聴覚重み付け係数として出力する聴覚重み付け計算
手段を備え、上記ベクトル量子化手段は、上記聴覚重み
付け係数を用いて、上記正規残差信号の量子化を行うよ
うにしたので、上記聴覚重み付け係数が、無限回の演算
でなく、予め設定した有限回の演算により求めることが
可能なメル線形予測係数から得る、スペクトル包絡より
求められるため、人間の聴覚的な性質を利用して効率の
良い信号圧縮を行うことができるオーディオ信号圧縮装
置が得られる効果がある。According to the audio signal compression apparatus of the present invention (claim 6), the audio signal compression apparatus of claim 5
In the signal compression device, for the spectrum envelope,
Performs weighting on frequencies corresponding to human auditory sensitivity characteristics.
Perceptual weighting calculation to output as perceptual weighting coefficient
Means, wherein said vector quantization means comprises:
The quantization of the above normal residual signal is performed using the
In this case, the auditory weighting coefficient is
Instead, it can be obtained by a predetermined finite number of calculations.
From spectral envelope, obtained from possible mel linear prediction coefficients
Since sought, audio signal compression instrumentation which utilizes the auditory nature of human can do a good signal compression of efficiency
There is an effect that can be obtained.

【０２４５】[0245]

【０２４６】[0246]

【０２４７】また、本発明（請求項７）に係るオーディ
オ信号圧縮装置によれば、請求項６に記載のオーディオ
信号圧縮装置において、上記ベクトル量子化手段が、複
数の縦列に接続された複数の当該ベクトル量子化手段か
ら構成される多重量子化手段であって、上記多重量子化
手段は、該多重量子化手段を構成する少なくとも１つの
上記ベクトル量子化手段が、上記聴覚重み付け係数を用
いて、上記残差信号の量子化を行うものであるようにし
たので、人間の聴覚上重要である低い周波数帯域側を、
高い周波数帯域より周波数分解能を上げて分析すること
を可能とし、また、上記複数の量子化手段それぞれが用
いる個別の聴覚重み付け係数を算出する際に用いるスペ
クトル包絡を、無限回の演算でなく、予め設定した有限
回の演算により求めることができるため、人間の聴覚的
な性質を利用して効率の良い信号圧縮を行うことができ
るオーディオ信号圧縮装置が得られる効果がある。[0247] Further, according to the audio signal compression apparatus according to the present invention (Claim 7), Oite to the audio signal compression apparatus according to claim 6, is the vector quantization means, double
A number of such vector quantizers connected in a number column
Multiplex quantization means comprising:
The means comprises at least one of the multiple quantizing means.
The vector quantization means uses the auditory weighting coefficient.
And quantize the residual signal.
Therefore, the lower frequency band that is important for human hearing
Analysis with higher frequency resolution than higher frequency band
And each of the plurality of quantization means can be used.
Used to calculate the individual auditory weighting factors
The vector envelope is not an infinite number of calculations, but a preset finite
It is possible to determine by calculation times, there is an effect that an audio signal compression apparatus utilizing auditory nature of human can do a good signal compression of efficiency is obtained.

【０２４８】また、本発明（請求項８）に係るオーディ
オ信号圧縮装置によれば、請求項５ないし請求項７のい
ずれかに記載のオーディオ信号圧縮装置において、上記
スペクトル包絡算出手段は、入力されたオーディオ信号
から、一定時間長のオーディオ信号を切り出し、上記一
定時間長のオーディオ信号を複数段のオールパスフィル
タに通して、各段毎のフィルタ出力信号を求め、上記入
力されたオーディオ信号と、上記各段毎のフィルタ出力
信号との、有限回行う積和（数２）により、人間の聴覚
感度特性に対応する周波数軸の伸縮を行ったメル周波数
軸上の自己相関関数を求め、上記メル周波数軸上の自己
相関関数よりメル線形予測係数を求め、上記メル線形予
測係数そのものをスペクトル包絡とするか、あるいは、
該メル線形予測係数からスペクトル包絡を求めるもので
あるので、オーディオ信号の圧縮を行う際に、近似計算
を全く必要とせず、予め設定した有限回の演算で処理可
能となり、人間の聴覚的な性質を利用して効率の良い信
号圧縮を行うことができるオーディオ信号圧縮装置が得
られる効果がある。但し、（数２）は、 Further, according to the audio signal compression device according to the present invention (claim 8), any one of claims 5 to 7 can be used.
In the audio signal compression device described in any of the above,
The spectral envelope calculating means calculates the input audio signal
Audio signal of a fixed time length from
Multi-stage all-pass fill for fixed time audio signals
To obtain the filter output signal for each stage,
Input audio signal and filter output for each stage
Hearing by humans by multiply-accumulate (Equation 2) with signals
Mel frequency with expanded and contracted frequency axis corresponding to sensitivity characteristics
Calculate the autocorrelation function on the axis
The mel linear prediction coefficient is obtained from the correlation function, and the above mel linear prediction coefficient is obtained.
Using the measured coefficient itself as the spectral envelope, or
A spectral envelope is obtained from the mel linear prediction coefficient.
Approximate calculation when compressing audio signals
No processing is required, and processing can be performed with a preset finite number of calculations
Thus, there is an effect that an audio signal compression apparatus capable of performing efficient signal compression utilizing human auditory characteristics can be obtained. However, (Equation 2) is

【０２４９】また、本発明（請求項９）に係るオーディ
オ信号圧縮装置によれば、請求項８に記載のオーディオ
信号圧縮装置において、上記オールパスフィルタは、１
次のオールパスフィルタであるようにしたので、本来無
限回の演算を必要としていたものが、実際に実現可能な
１次のオールパスフィルタを用いることで、近似計算を
まったく必要とすることなく予め設定した有限回の演算
ですむこととなり、効率良い信号の圧縮を行うことがで
きる。 [0249] Further, according to the audio signal compression apparatus according to the present invention (Claim 9), the audio according to claim 8
In the signal compression device, the all-pass filter includes 1
Since the next all-pass filter is used,
What required a limited number of operations is now feasible
By using a first-order all-pass filter, approximate calculation can be performed.
Preset finite number of operations without any need
This means that efficient signal compression can be performed.
Wear.

【０２５０】また、本発明（請求項１０）に係るオーデ
ィオ信号圧縮装置によれば、請求項８または請求項９に
記載のオーディオ信号圧縮装置において、上記オールパ
スフィルタのフィルタ係数に、バーク尺度、またはメル
尺度を用い、人間の聴覚感度特性に対応する周波数上の
重み付けを行うようにしたので、バーク尺度あるいはメ
ル尺度を用いて、人間の聴覚上重要である低い周波数帯
域側を、高い周波数帯域側より周波数分解能を上げて分
析することが可能となり、人間の聴覚的な性質を利用し
て効率の良い信号圧縮を行うことができるオーディオ信
号圧縮方法が得られる効果がある。[0250] Further, according to the audio signal compression apparatus according to the present invention (Claim 10), in the audio signal compression apparatus according to claim 8 or claim 9, said Orupa
The filter coefficients of the filter
Using the scale, the frequency corresponding to the human auditory sensitivity characteristics
Since weighting is performed, the bark scale or
Low frequency band that is important to human hearing
Frequency range with higher frequency resolution than the higher frequency band side.
This makes it possible to obtain an audio signal compression method capable of performing efficient signal compression utilizing human auditory characteristics.

【０２５１】また、本発明（請求項１１）に係る音声信
号圧縮方法によれば、入力された音声信号に対し、符号
化を行い、かつ、その情報量を圧縮する音声信号圧縮方
法において、上記入力された音声信号と、該入力された
音声信号に対して人間の聴覚感度特性に対応する周波数
軸の伸縮を行った音声信号とを用いて、メル周波数軸上
の自己相関関数を求め、上記メル周波数軸上の自己相関
関数からメル線形予測係数を求め、上記メル線形予測係
数そのものをスペクトル包絡とするか、あるいは該メル
線形予測係数からスペクトル包絡を求め、上記スペクト
ル包絡を用いて、上記入力された音声信号を平滑化する
ようにしたので、人間の聴覚的な性質を利用して効率の
良い信号圧縮を行うことができる音声信号圧縮方法が得
られる効果がある。The audio signal according to the present invention (claim 11)
According to the signal compression method , a code
Audio signal compression method that compresses the amount of information
The input audio signal and the input audio signal.
Frequency corresponding to human auditory sensitivity characteristics for audio signal
Using the audio signal that has undergone axis expansion and contraction,
Of the mel frequency axis
Find the mel linear prediction coefficient from the function,
The number itself as the spectral envelope, or
Calculate the spectral envelope from the linear prediction coefficients and calculate the spectrum
Smoothing the input audio signal by using the envelope
Thus, there is an effect that an audio signal compression method capable of performing efficient signal compression utilizing human auditory characteristics is obtained.

【０２５２】また、本発明（請求項１２）に係る音声信
号圧縮方法によれば、入力された音声信号に対し、符号
化を行い、かつ、その情報量を圧縮する音声信号圧縮方
法において、上記入力された音声信号から、一定時間長
の音声信号を切り出し、該一定時間長の音声信号を、複
数段のオールパスフィルタに通して、各段毎のフィルタ
出力信号を求め、上記入力された音声信号と、上記各段
毎のフィルタ出力信号との、有限回行う積和（数３）に
より、人間の聴覚感度特性に対応する周波数軸の伸縮を
行ったメル周波数軸上の自己相関関数を求め、該メル周
波数軸上の自己相関関数からメル線形予測係数を求め、
該メル線形予測係数そのものをスペクトル包絡とする
か、あるいは該メル線形予測係数からスペクトル包絡を
求め、該スペクトル包絡を用いて、上記入力された音声
信号を平滑化するようにしたので、上記メル線形予測係
数を得る際に、近似計算を全く必要とすることなく、予
め設定した有限回の演算で処理が可能となり、人間の聴
覚的な性質を利用して効率の良い信号圧縮を行うことが
できる音声信号圧縮方法が得られる効果がある。但し、
（数３）は The audio signal according to the present invention (Claim 12).
According to the signal compression method , a code
Audio signal compression method that compresses the amount of information
In the method, a fixed time
Audio signal of a predetermined time length, and
Pass through several stages of all-pass filter, filter each stage
An output signal is obtained, and the input audio signal is
To the product sum (Equation 3) performed finite times with the filter output signal of each
The expansion and contraction of the frequency axis corresponding to human hearing sensitivity characteristics
The autocorrelation function on the mel frequency axis is determined, and the
Find the mel linear prediction coefficient from the autocorrelation function on the wave number axis,
Use the Mel linear prediction coefficient itself as the spectral envelope
Or the spectral envelope from the mel linear prediction coefficients
And using the spectral envelope, the input speech
Since the signal is smoothed, the mel linear prediction
To get the number, you do not need any approximation
Thus, the processing can be performed with a finite number of calculations set, and there is an effect that an audio signal compression method capable of performing efficient signal compression using human auditory characteristics is obtained. However,
(Equation 3) is

【０２５３】[0253]

【０２５４】[0254]

【０２５５】また、本発明（請求項１３）に係る音声信
号圧縮方法によれば、請求項１２に記載の音声信号圧縮
方法において、上記オールパスフィルタは、１次のオー
ルパスフィルタであるようにしたので、本来無限回の演
算を必要としていたものが、実際に実現可能な１次のオ
ールパスフィルタを用いることで、近似計算をまったく
必要とすることなく予め設定した有限回の演算ですむこ
ととなり、効率良い信号の圧縮を行うことができる。 The audio signal according to the present invention (claim 13)
The audio signal compression method according to claim 12, according to the signal compression method.
In the method , the all-pass filter comprises :
Filter so that the performance is essentially infinite
What needed to be calculated,
The use of a multi-pass filter allows
Only a finite number of calculations can be done without need
Thus, efficient signal compression can be performed.

【０２５６】また、本発明（請求項１４）に係る音声信
号圧縮方法によれば、請求項１２または請求項１３に記
載の音声信号圧縮方法において、上記オールパスフィル
タのフィルタ係数に、バーク尺度、またはメル尺度を用
い、人間の聴覚感度特性に対応する周波数上の重み付け
を行うようにしたので、バーク尺度あるいはメル尺度を
用いて、人間の聴覚上重要である低い周波数帯域側を、
高い周波数帯域側より周波数分解能を上げて分析するこ
とが可能となり、人間の聴覚的な性質を利用して効率の
良い信号圧縮を行うことができる音声信号圧縮方法が得
られる効果がある。The audio signal according to the present invention (claim 14)
According to Patent compression method, the audio signal compression method of the serial mounting in claim 12 or claim 13, said all-pass fill
Use Bark scale or Mel scale for filter coefficients
Weighting on frequencies corresponding to human auditory sensitivity characteristics
So that the Bark scale or Mel scale
Using the low frequency band side that is important for human hearing,
Analyze with higher frequency resolution than the higher frequency band.
This makes it possible to obtain an audio signal compression method capable of performing efficient signal compression utilizing human auditory characteristics.

【０２５７】また、本発明（請求項１５）に係る音声信
号圧縮装置によれば、入力された音声信号に対し、符号
化を行い、かつ、その情報量を圧縮する音声信号圧縮装
置において、上記入力された音声信号と、該入力された
音声信号に対して人間の聴覚感度特性に対応する周波数
軸の伸縮を行った音声信号とを用いて、メル周波数軸上
の自己相関関数を求め、該メル周波数軸上の自己相関関
数から得られるメル形成予測係数を、スペクトル包絡を
表現する特徴量に変換する特徴量算出手段と、上記入力
された音声信号を、上記特徴量で逆フィルタリングして
正規化し、残差信号を得る包絡正規化手段と、上記残差
信号をパワーの最大値あるいは平均値に基づいて正規化
し、正規化残差信号を求めるパワー正規化手段と、上記
正規化残差信号を、残差コードブックによりベクトル量
子化し、残差符号に変換するベクトル量子化手段とを備
えるようにしたので、人間の聴覚的な性質を利用して効
率の良い信号圧縮を行うことができる音声信号圧縮装置
が得られる効果がある。[0257] Further, according to the speech signal No. compression apparatus according to the present invention (Claim 15), the input audio signal, code
Signal compression device that compresses and compresses the amount of information
The input audio signal and the input audio signal.
Frequency corresponding to human auditory sensitivity characteristics for audio signal
Using the audio signal that has undergone axis expansion and contraction,
Of the mel frequency axis.
The mel formation prediction coefficient obtained from the
A feature value calculating means for converting into a feature value to be expressed;
Inverse filtering of the audio signal
Envelope normalizing means for normalizing and obtaining a residual signal;
Normalize signal based on maximum or average power
Power normalizing means for obtaining a normalized residual signal;
The vector value of the normalized residual signal is calculated using the residual codebook.
And a vector quantization means for converting to a residual code.
As a result, there is an effect that an audio signal compression apparatus capable of performing efficient signal compression utilizing human auditory characteristics can be obtained.

【０２５８】また、本発明（請求項１６）に係る音声信
号圧縮装置によれば、請求項１５に記載の音声信号圧縮
装置において、上記特徴量算出手段は、上記入力された
音声信号から一定時間長の音声信号を切り出し、上記一
定時間長の音声信号を、複数段のオールパスフィルタに
通して、各段毎のフィルタ出力信号を求め、上記入力さ
れた音声信号と、上記各段毎のフィルタ出力信号との、
有限回行う積和（数４）により、人間の聴覚感度特性に
対応する周波数軸の伸縮を行ったメル周波数軸上の自己
相関関数を求め、上記メル周波数軸上の自己相関関数か
らメル線形予測係数を求め、上記メル線形予測係数を、
スペクトル包絡を表現する特徴量に変換するようにした
ので、音声信号の圧縮を行う際に、本来無限回の演算が
必要であったのが、近似計算を全く必要とせず、予め設
定した有限回の演算で処理可能となり、人間の聴覚的な
性質を利用して効率の良い信号圧縮を行うことができる
音声信号圧縮装置が得られる効果がある。但し、（数
４）はFurther, according to the audio signal compression apparatus according to the present invention (claim 16), the audio signal compression apparatus according to claim 15 is provided.
In the apparatus, the feature amount calculating means may include the input
Cut out the audio signal of a fixed time length from the audio signal, and
Converting fixed-length audio signals to multi-stage all-pass filters
To obtain the filter output signal for each stage,
Of the output audio signal and the filter output signal of each of the above stages,
The harmful sum of products (Equation 4) is performed finite times to improve human auditory sensitivity characteristics.
Self on the mel frequency axis that has expanded and contracted the corresponding frequency axis
Find the correlation function and determine if it is an autocorrelation function on the mel frequency axis.
The mel linear prediction coefficient is obtained from
Converted to spectral feature
Therefore, when compressing the audio signal, infinitely many operations are originally required.
What was necessary was no pre-defined
Boss was made possible treatment with a finite number of operations, the speech signal compression apparatus auditory nature of human can perform efficient signal compression to take advantage there is the effect obtained. However, (number
4 ) is

【０２５９】[0259]

【０２６０】[0260]

【０２６１】また、本発明（請求項１７）に係る音声信
号圧縮装置によれば、請求項１６に記載の音声信号圧縮
装置において、上記オールパスフィルタは、１次のオー
ルパスフィルタであるようにしたので、本来無限回の演
算を必要としていたものが、実際に実現可能な１次のオ
ールパスフィルタを用いることで、近似計算をまったく
必要とすることなく予め設定した有限回の演算ですむこ
ととなり、効率良い音声信号の圧縮を行うことができ
る。 According to the audio signal compression apparatus of the present invention (claim 17), the audio signal compression apparatus of claim 16 is provided.
In the apparatus , the all-pass filter includes :
Filter so that the performance is essentially infinite
What needed to be calculated,
The use of a multi-pass filter allows
Only a finite number of calculations can be done without need
And efficient compression of the audio signal can be performed.
You.

【０２６２】また、本発明（請求項１８）に係る音声信
号圧縮装置によれば、請求項１６または請求項１７に記
載の音声信号圧縮装置において、上記オールパスフィル
タのフィルタ係数に、バーク尺度、またはメル尺度を用
い、人間の聴覚感度特性に対応する周波数上の重み付け
を行うようにしたので、バーク尺度あるいはメル尺度を
用いて、人間の聴覚上重要である低い周波数帯域側を、
高い周波数帯域側より周波数分解能を上げて分析するこ
とを可能とし、人間の聴覚的な性質を利用して効率の良
い信号圧縮を行うことができる音声信号圧縮装置が得ら
れる効果がある。Further, according to the audio signal compression apparatus according to the present invention (claim 18), the sound signal compression apparatus according to claim 16 or claim 17 is described.
The audio signal compression apparatus described above,
Use Bark scale or Mel scale for filter coefficients
Weighting on frequencies corresponding to human auditory sensitivity characteristics
So that the Bark scale or Mel scale
Using the low frequency band side that is important for human hearing,
Analyze with higher frequency resolution than the higher frequency band.
Sorted enabling, human voice signal compression apparatus auditory nature can be take advantage perform efficient signal compression has the effect to be obtained.

【０２６３】また、本発明（請求項１９）に係る音声認
識方法によれば、入力された音声信号から、音声を認識
する音声認識方法において、上記入力された音声信号
と、該入力された音声信号に対して人間の聴覚感度特性
に対応する周波数軸の伸縮を行った音声信号とを用い
て、メル周波数軸上の自己相関関数を求め、上記メル周
波数軸上の自己相関関数からメル線形予測係数を求め、
上記メル線形予測係数からスペクトル包絡を表現する特
徴量を求めるようにしたので、人間の聴覚上重要である
低い周波数帯域側を、高い周波数帯域側より周波数分析
能を上げて分析することが可能となり、人間の聴覚的な
性質を利用して精度の高い音声認識を行うことができる
音声認識方法が得られる効果がある。The voice recognition according to the present invention (Claim 19).
According to the recognition method , the voice is recognized from the input voice signal.
The input speech signal
And a human auditory sensitivity characteristic with respect to the input audio signal.
Using an audio signal with the frequency axis corresponding to
To find the autocorrelation function on the mel frequency axis,
Find the mel linear prediction coefficient from the autocorrelation function on the wave number axis,
A characteristic that expresses the spectral envelope from the above mel linear prediction coefficients
It is important for human hearing because we have been asked to collect
Frequency analysis of lower frequency band side than higher frequency band side
This makes it possible to perform analysis with higher performance, and has an effect of obtaining a voice recognition method capable of performing highly accurate voice recognition using human auditory characteristics.

【０２６４】また、本発明（請求項２０）に係る音声認
識方法によれば、入力された音声信号から、音声を認識
する音声認識方法において、上記入力された音声信号か
ら、一定時間長の音声信号を切り出し、該一定時間長の
音声信号を、複数段のオールパスフィルタに通して、各
段毎のフィルタ出力信号を求め、上記入力された音声信
号と、上記各段毎のフィルタ出力信号との、有限回行う
積和（数５）により、人間の聴覚感度特性に対応する周
波数軸の伸縮を行ったメル周波数軸上の自己相関関数を
求め、該メル周波数軸上の自己相関関数からメル線形予
測係数を求め、該メル線形予測係数からスペクトル包絡
を表現する特徴量を求めるようにしたので、音声信号の
圧縮を行う際に、本来無限回の演算が必要であったの
が、近似計算を全く必要とせず、予め設定した有限回の
演算で処理可能となり、人間の聴覚的な性質を利用して
より精度の高い音声認識を行うことができる音声認識方
法が得られる効果がある。但し、（数５）は The voice recognition according to the present invention (Claim 20).
According to the recognition method , the voice is recognized from the input voice signal.
In the voice recognition method,
From the audio signal of a certain time length,
Pass the audio signal through a multi-stage all-pass filter,
Find the filter output signal for each stage, and
Signal and the filter output signal of each stage described above for a finite number of times.
By the product sum (Equation 5), the frequency corresponding to the human auditory sensitivity characteristic
The autocorrelation function on the mel frequency axis with the expansion and contraction of the wavenumber axis is
From the autocorrelation function on the mel frequency axis.
The spectral envelope from the mel linear prediction coefficient
Since the feature value that expresses
When performing compression, an inherently infinite number of calculations was required.
However, no approximation calculation is required, and a preset finite number of
It can be processed by arithmetic and utilizes human auditory properties
A speech recognition method that can perform more accurate speech recognition
There is an effect that a law can be obtained. Where (Equation 5) is

【０２６５】また、本発明（請求項２１）に係る音声認
識方法によれば、請求項２０に記載の音声認識方法にお
いて、上記オールパスフィルタは、１次のオールパスフ
ィルタであるようにしたので、本来無限回の演算を必要
としていたものが、実際に実現可能な１次のオールパス
フィルタで、近似計算をまったく必要とすることなく予
め設定した有限回の演算ですむこととなり、処理量がす
くなくなって、精度の高い音声認識を行うことができ
る。 [0265] The voice certification according to the present invention (Claim 21)
According to 識方method, your speech recognition method according to claim 20
And the all-pass filter has a first-order all-pass filter.
Filter, so infinitely many operations are required
What was the first-order all-pass that is actually feasible
Filters allow you to predict without any approximation
The finite number of calculations that have been set is sufficient, and the processing amount is small.
And can perform highly accurate speech recognition.
You.

【０２６６】[0266]

【０２６７】[0267]

【０２６８】また、本発明（請求項２２）に係る音声認
識方法によれば、請求項２０または請求項２１に記載の
音声認識方法において、上記オールパスフィルタのフィ
ルタ係数に、バーク尺度、またはメル尺度を用い、人間
の聴覚感度特性に対応する周波数上の重み付けを行うよ
うにしたので、バーク尺度あるいはメル尺度を用いて、
人間の聴覚上重要である低い周波数帯域側を、高い周波
数帯域側より周波数分解能を上げて分析することを可能
とし、人間の聴覚的な性質を利用してより精度の高い音
声認識を行うことができる音声認識方法が得られる効果
がある。The voice recognition according to the present invention (Claim 22).
According to 識方method, in the speech recognition method according to claim 20 or claim 21, the all-pass filter Fi
Using the Bark scale or the Mel scale for the Luta coefficient,
Weight on the frequency corresponding to the auditory sensitivity characteristics of
So, using Bark scale or Mel scale,
The lower frequency band, which is important for human hearing, is
Enables analysis with higher frequency resolution than several bands
And more accurate sound utilizing the auditory properties of humans
There is an effect that a voice recognition method capable of performing voice recognition is obtained.

【０２６９】また、本発明（請求項２３）に係る音声認
識装置によれば、入力された音声信号から、音声を認識
する音声認識装置において、上記入力された音声信号
と、該入力された音声信号に対して、人間の聴覚感度特
性に対応する周波数軸の伸縮を行った音声信号とを用い
て、メル周波数軸上の自己相関関数を求め、該メル周波
数軸上の自己相関関数からメル形成予測係数を求めるメ
ル線形予測分析手段と、上記メル線形予測係数からケプ
ストラム係数を算出するケプストラム係数算出手段と、
上記ケプストラム係数の複数フレーム分と、複数の標準
モデルとの間の距離を算出し、該距離が最も短いもの
を、上記複数の標準モデルの中で最も類似度が大きいも
のと認識する音声認識手段とを備えるようにしたので、
人間の聴覚上重要である低い周波数帯域側を、高い周波
数帯域側より周波数分解能を上げて分析することを可能
とし、人間の聴覚的な性質を利用して高精度に音声認識
を行うことができる音声認識装置が得られる効果があ
る。According to the speech recognition apparatus of the present invention (claim 23), speech is recognized from an input speech signal.
The input speech signal
And a human auditory sensitivity characteristic for the input audio signal.
Using an audio signal that has expanded and contracted the frequency axis corresponding to the
The autocorrelation function on the mel frequency axis
A method for determining the mel formation prediction coefficient from the autocorrelation function on several axes
Linear prediction analysis means, and
Cepstrum coefficient calculating means for calculating a strum coefficient,
Multiple frames of the above cepstrum coefficients and multiple standards
Calculate the distance between the model and the one with the shortest distance
Is the highest similarity among the above standard models.
And voice recognition means for recognizing that
The lower frequency band, which is important for human hearing, is
Enables analysis with higher frequency resolution than several bands
Thus, there is an effect that a speech recognition device capable of performing speech recognition with high accuracy by utilizing human auditory characteristics is obtained.

【０２７０】また、本発明（請求項２４）に係る音声認
識装置によれば、請求項２３に記載の音声認識装置にお
いて、上記メル線形予測分析手段は、上記入力された音
声信号から、一定時間長の音声信号を切り出し、該一定
時間長の音声信号を、複数段のオールパスフィルタに通
して、各段毎のフィルタ出力信号を求め、上記入力され
た音声信号と、上記各段毎のフィルタ出力信号との、有
限回行う積和（数６）により、人間の聴覚感度特性に対
応する周波数軸の伸縮を行ったメル周波数軸上の自己相
関関数を求め、上記メル周波数軸上の自己相関関数から
メル線形予測係数を求めるようにしたので、音声信号の
圧縮を行う際に、本来無限回の演算が必要であったの
を、近似計算を全く必要とせず予め設定した有限回の演
算で処理可能とし、人間の聴覚的な性質を利用して高精
度に音声認識を行うことができる音声認識装置が得られ
る効果がある。但し、（数６）は [0270] Further, according to the speech recognition apparatus according to the present invention (Claim 24), and have your to the speech recognition apparatus of claim 23, said mel linear predictive analysis means, being said input sound
From the voice signal, cut out a voice signal of a certain time length and
Time-length audio signals are passed through multiple stages of all-pass filters.
Then, a filter output signal for each stage is obtained, and
Of the audio signal and the filter output signal of each stage
The product-sum operation (Equation 6) is performed for the limited number of times,
Self-phase on mel frequency axis with corresponding expansion and contraction of frequency axis
Calculate the function and calculate from the autocorrelation function on the mel frequency axis.
Since the mel linear prediction coefficient is determined, the
When performing compression, an inherently infinite number of calculations was required.
For a finite number of pre-defined performances without any need for approximation calculations.
Thus, there is an effect that a speech recognition device capable of performing speech recognition with high accuracy by making use of the auditory properties of humans that can be processed by calculation is obtained. Where (Equation 6) is

【０２７１】また、本発明（請求項２５）に係る音声認
識装置によれば、請求項２４に記載の音声信号圧縮方法
において、上記オールパスフィルタは、１次のオールパ
スフィルタであるようにしたので、本来無限回の演算を
必要としていたものが、実際に実現可能な１次のオール
パスフィルタを用いることで、近似計算をまったく必要
とすることなく予め設定した有限回の演算ですむことと
なり、処理量がすくなくなり、より精度の高い音声認識
を行うことができる。 According to the speech recognition apparatus of the present invention (claim 25), in the speech signal compression method according to claim 24, the all-pass filter includes a primary all-pass filter.
Filter, so that infinitely many operations can be performed.
What we needed was the first order
Approximate calculation is absolutely necessary by using a path filter
Finite number of operations set in advance without
And the amount of processing is reduced, and more accurate speech recognition
It can be performed.

【０２７２】また、本発明（請求項２６）に係る音声認
識装置によれば、請求項２４または請求項２５に記載の
音声圧縮方法において、上記オールパスフィルタのフィ
ルタ係数に、バーク尺度、またはメル尺度を用い、人間
の聴覚感度特性に対応する周波数上の重み付けを行うよ
うにしたので、バーク尺度あるいはメル尺度を用いて、
人間の聴覚上重要である低い周波数帯域側を、高い周波
数帯域側より周波数分解能を上げて分析することを可能
とし、人間の聴覚的な性質を利用して高精度に音声認識
を行うことができる音声認識装置が得られる効果があ
る。According to the voice recognition device of the present invention (claim 26), the speech recognition device according to claim 24 or claim 25 is provided.
In the audio compression method, the filter of the all-pass filter is used.
Using the Bark scale or the Mel scale for the Luta coefficient,
Weight on the frequency corresponding to the auditory sensitivity characteristics of
So, using Bark scale or Mel scale,
The lower frequency band, which is important for human hearing, is
Enables analysis with higher frequency resolution than several bands
Thus, there is an effect that a speech recognition device capable of performing speech recognition with high accuracy by utilizing human auditory characteristics is obtained.

【０２７３】[0273]

【０２７４】[0274]

【０２７５】[0275]

【０２７６】[0276]

【０２７７】[0277]

【０２７８】[0278]

【０２７９】[0279]

【０２８０】[0280]

【０２８１】[0281]

【０２８２】[0282]

【０２８３】[0283]

【０２８４】[0284]

[Brief description of the drawings]

【図１】本発明の第１の実施の形態によるオーディオ信
号圧縮装置の構成を示すブロック図FIG. 1 is a block diagram showing a configuration of an audio signal compression device according to a first embodiment of the present invention.

【図２】本発明の第１の実施の形態によるオーディオ信
号圧縮装置に係るスペクトル包絡算出部の詳細な構成の
一例を示すブロック図FIG. 2 is a block diagram showing an example of a detailed configuration of a spectrum envelope calculation unit according to the audio signal compression device according to the first embodiment of the present invention.

【図３】本発明の第１の実施の形態によるオーディオ信
号圧縮装置に係るメル化係数算出部の詳細な構成の一例
を示すブロック図FIG. 3 is a block diagram showing an example of a detailed configuration of a melding coefficient calculator according to the audio signal compression device according to the first embodiment of the present invention.

【図４】本発明の第１の実施の形態によるオーディオ信
号圧縮装置に係るメル化係数算出部の詳細な計算の手順
の一例を示すブロック図FIG. 4 is a block diagram showing an example of a detailed calculation procedure of a melding coefficient calculation unit in the audio signal compression device according to the first embodiment of the present invention.

【図５】周波数軸伸縮関数（オールパスフィルタ）の特
性を示す図FIG. 5 is a diagram showing characteristics of a frequency axis expansion / contraction function (all-pass filter).

【図６】本発明の第１の実施の形態によるオーディオ信
号圧縮装置に係る包絡算出部の詳細な構成の一例を示す
ブロック図FIG. 6 is a block diagram showing an example of a detailed configuration of an envelope calculation unit according to the audio signal compression device according to the first embodiment of the present invention.

【図７】本発明の第２の実施の形態による音声認識装置
の構成を示すブロック図FIG. 7 is a block diagram showing a configuration of a speech recognition device according to a second embodiment of the present invention.

【図８】本発明の第２の実施の形態による音声認識装置
に係るメル線形予測分析部の詳細な構成の一例を示すブ
ロック図FIG. 8 is a block diagram illustrating an example of a detailed configuration of a mel linear prediction analysis unit according to the speech recognition device according to the second embodiment of the present invention.

【図９】本発明の第３の実施の形態によるオーディオ信
号圧縮装置の構成を示すブロック図FIG. 9 is a block diagram showing a configuration of an audio signal compression device according to a third embodiment of the present invention.

【図１０】本発明の第４の実施の形態による携帯電話機
の構成を示すブロック図FIG. 10 is a block diagram showing a configuration of a mobile phone according to a fourth embodiment of the present invention.

【図１１】本発明の第５の実施の形態によるネットワー
ク機器の構成を示すブロック図FIG. 11 is a block diagram showing a configuration of a network device according to a fifth embodiment of the present invention.

【図１２】本発明の第６の実施の形態によるネットワー
ク機器の構成を示すブロック図FIG. 12 is a block diagram showing a configuration of a network device according to a sixth embodiment of the present invention.

[Explanation of symbols]

１時間周波数変換部２スペクトル包絡算出部３正規化部４パワー正規化部５多段量子化部６聴覚重み付け計算部７メル線形予測分析部８ケプストラム係数算出部９音声認識部５１第１段の量子化器５２第２段の量子化器５３第３段の量子化器 DESCRIPTION OF SYMBOLS 1 Time frequency conversion part 2 Spectrum envelope calculation part 3 Normalization part 4 Power normalization part 5 Multistage quantization part 6 Auditory weight calculation part 7 Mel linear prediction analysis part 8 Cepstrum coefficient calculation part 9 Voice recognition part 51 First stage quantum Quantizer 52 Second-stage quantizer 53 Third-stage quantizer

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＧ１０Ｌ 9/08 ３０１Ａ 3/00 ５１５Ｚ (72)発明者石川智一大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者芹川光彦大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者片山大朗大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者中橋順一大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者八木順子大阪府門真市大字門真1006番地松下電器産業株式会社内 (56)参考文献特開平３−138700（ＪＰ，Ａ) 特開平５−313695（ＪＰ，Ａ) 特開平７−160297（ＪＰ，Ａ) 特開平７−191696（ＪＰ，Ａ) 特開平８−115095（ＪＰ，Ａ) 特開平９−244698（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 15/02 G10L 19/00 - 19/14 ──────────────────────────────────────────────────の Continuing on the front page (51) Int.Cl. ⁷ Identification code FIG10L 9/08 301A 3/00 515Z (72) Inventor Tomokazu Ishikawa 1006 Odakadoma, Kadoma-shi, Osaka Matsushita Electric Industrial Co., Ltd. 72) Inventor Mitsuhiko Serikawa 1006 Kadoma Kadoma, Kadoma City, Osaka Prefecture, Japan Matsushita Electric Industrial Co., Ltd. (72) Inventor Dairo Katayama 1006 Odaka Kadoma, Kadoma City, Osaka Prefecture Matsushita Electric Industrial Co., Ltd. (72) Inventor Nakahashi Junichi 1006 Kadoma, Kazuma, Osaka Prefecture Matsushita Electric Industrial Co., Ltd. (72) Inventor Junko Yagi 1006, Kadoma, Kazuma, Kadoma, Osaka Prefecture Matsushita Electric Industrial Co., Ltd. (56) References JP-A-3-138700 ( JP, A) JP-A-5-313695 (JP, A) JP-A-7-160297 (JP, A) JP-A-7-191696 (JP, A) JP-A-8-115095 (JP, A A) JP-A-9-244698 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 15/00-15/02 G10L 19/00-19/14

Claims

(57) [Claims]

1. An audio signal compression method for encoding an input audio signal and compressing the amount of information, said input audio signal and said input audio signal
Frequency corresponding to human auditory sensitivity characteristics for audio signals
Using the audio signal that has undergone axis expansion and contraction,
Find the autocorrelation function on the number axis, and use the mel linear prediction function from the autocorrelation function on the mel frequency axis.
Find the number and use the mel linear prediction coefficient itself as the spectral envelope
Or the spectral envelope from the mel linear prediction coefficients
Determined, the upper kiss spectrum with envelope, the audio Oh signal the input smoothes for each frame, the audio signal compression method, characterized in that.

2. A coding method for an input audio signal.
Signal pressure that compresses and compresses the amount of information
In the compression method, a fixed time length of audio is input from the input audio signal.
Audio signal of a certain length of time , and
A filter output signal for each stage is obtained through a filter, and the input audio signal and the filter for each stage are obtained.
The product sum (Equation 1) performed finitely with the output signal
Melted and expanded frequency axis corresponding to the auditory sensitivity characteristics of
The autocorrelation function on the frequency axis is obtained, and the mel linear prediction
Find the number and use the mel linear prediction coefficient itself as the spectral envelope
Or the spectral envelope from the mel linear prediction coefficients
Using the above-mentioned spectrum envelope,
An audio signal compression method characterized by smoothing an e-signal for each frame . However, (number
1) [number 1] Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

3. The audio signal compression method according to claim 2, wherein said all-pass filter is a first-order all-pass filter.
In it, the audio signal compression method, characterized in that.

4. The audio signal compression method according to claim 2 , wherein the filter coefficient of the all-pass filter includes a bark scale.
Corresponds to human hearing sensitivity characteristics using degree or mel scale
A method for compressing an audio signal, wherein weighting is performed on a frequency to be processed .

5. An audio signal compression apparatus which encodes an input audio signal and compresses the information amount , transforms the input audio signal into a frequency domain signal.
Time-frequency converting means for converting and outputting, the input audio signal, and the input audio signal.
Frequency corresponding to human auditory sensitivity characteristics for audio signals
Using the audio signal that has undergone axis expansion and contraction,
Find the autocorrelation function on the number axis and calculate the autocorrelation function on the mel frequency axis.
The mel linear prediction coefficient obtained from the correlation function
Or from the mel linear prediction coefficient
A spectrum envelope calculating means for obtaining a torque envelope, and normalizing the frequency domain signal with the spectrum envelope.
Normalizing means for obtaining a residual signal, and the residual signal based on a maximum value or an average value of power.
Power normalizing means for normalizing and obtaining a normalized residual signal
And the above-mentioned normal residual signal is vectorized by the residual codebook.
Vector quantization means for quantizing and converting to a residual code,
The provided, the audio signal compression apparatus characterized by.

6. The audio signal compression apparatus according to claim 5,
In location, with respect to the spectrum envelope, against the human auditory sensitivity characteristics
Weighting on the corresponding frequency, and
And a perceptual weight calculating means for outputting the perceptual weight, wherein the vector quantizing means uses the perceptual weighting coefficient.
An audio signal compression apparatus for quantizing the normal residual signal .

7. The audio signal compression device according to claim 6, wherein said vector quantization means comprises a plurality of columns connected in a plurality of columns.
Multiple quantization consisting of the vector quantization means of the number
Means , wherein the multiple quantizing means comprises a plurality of components constituting the multiple quantizing means.
At least one of the vector quantization means is configured to
The quantization of the residual signal is performed using the
That, the audio signal compression apparatus characterized by.

8. The method according to claim 5, wherein
In the audio signal compression apparatus described above, the spectrum envelope calculating means converts an audio signal having a predetermined time length from an input audio signal.
Excised O signals, the all-pass multiple stages audio signal of the predetermined time length
A filter output signal for each stage is obtained through a filter, and the input audio signal and the filter for each stage are obtained.
The product sum (Equation 2) performed finite times with the output signal
Melted and expanded frequency axis corresponding to the auditory sensitivity characteristics of
The autocorrelation function on the frequency axis is obtained, and the mel linear prediction
Find the number and use the mel linear prediction coefficient itself as the spectral envelope
Or the spectral envelope from the mel linear prediction coefficients
And requests the audio signal compression apparatus characterized by. However, (number
2) is [Equation 2] Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

9. The audio signal compression apparatus according to claim 8, wherein said all-pass filter is a first-order all-pass filter.
In it, the audio signal compression apparatus characterized by.

10. The audio signal compression device according to claim 8, wherein a filter coefficient of the all-pass filter includes a bark scale.
Corresponds to human hearing sensitivity characteristics using degree or mel scale
An audio signal compression apparatus , which performs weighting on a frequency to be performed .

11. An input audio signal is encoded.
Audio signal compression method to compress the amount of information
Oite, the audio signal the input pair to the input audio signal
To expand and contract the frequency axis corresponding to the human auditory sensitivity characteristics.
Auto-correlation on the mel frequency axis using the
The mel linear prediction coefficient is calculated from the autocorrelation function on the mel frequency axis.
Find the number and use the mel linear prediction coefficient itself as the spectral envelope
Or the spectral envelope from the mel linear prediction coefficients
And using the spectral envelope, the input speech signal
The smoothing, audio signal compression method, characterized in that.

12. An input audio signal is encoded.
Audio signal compression method to compress the amount of information
In the above, an audio signal of a fixed time length is
Cut out the audio signal of a certain length of time , and
To obtain a filter output signal for each stage, and the input audio signal and the filter output signal for each stage are obtained.
Hearing of humans by multiply-accumulate (Equation 3) with signals
Mel frequency with expanded and contracted frequency axis corresponding to sensitivity characteristics
The autocorrelation function on the axis is obtained, and the mel linear prediction coefficient is obtained from the autocorrelation function on the mel frequency axis.
And use the mel linear prediction coefficient itself as the spectral envelope
Or the spectral envelope from the mel linear prediction coefficients
Determined by using the spectrum envelope, the sound signal the input
An audio signal compression method characterized by smoothing . However, (Equation 3 ) is [Equation 3] Where φ (i, j) is an autocorrelation function and x [n]
Is an input signal, and y _(ij) [n] is a filter output signal for each stage.

13. The audio signal compression method according to claim 12, wherein the all-pass filter is a first-order all-pass filter.
In it, the audio signal compression method, characterized in that.

14. The method according to claim 12 or claim 13.
In the audio signal compression method, the filter coefficient of the all-pass filter includes a bark scale.
Corresponds to human hearing sensitivity characteristics using degree or mel scale
A sound signal compression method, wherein weighting is performed on a frequency to be performed .

15. An encoding method for an input audio signal.
And an audio signal compression device that compresses the amount of information
Oite, the audio signal the input pair to the input audio signal
To expand and contract the frequency axis corresponding to the human auditory sensitivity characteristics.
Auto-correlation on the mel frequency axis using the
The number is obtained from the autocorrelation function on the mel frequency axis.
Mel formation prediction coefficient
A characteristic amount calculating means for converting the input audio signal into an amount,
Envelope normalizing means for obtaining a residual signal by performing a normalization process, and the residual signal based on the maximum value or average value of the power.
Power normalizing means for normalizing and obtaining a normalized residual signal
And the normalized residual signal is vectorized using the residual codebook.
Vector quantization means for quantizing and converting to a residual code
Comprises, when it audio signal compressor according to claim.

16. An audio signal compression apparatus according to claim 15,
In the above, the feature amount calculating means cuts off the audio signal of a fixed time length from the input audio signal.
And outputs the audio signal of the fixed time length to a multi-stage all-pass filter.
Through the filter to obtain a filter output signal for each stage, the input audio signal and the filter output for each stage
Hearing by humans by multiply-accumulate (Equation 4) with signals
Mel frequency with expanded and contracted frequency axis corresponding to sensitivity characteristics
The autocorrelation function on the axis is calculated, and the mel linear prediction coefficient is calculated from the autocorrelation function on the mel frequency axis.
Patent which obtains the number, the mel linear predictive coefficients, representing spectral envelope
An audio signal compression device , which converts the audio signal into a characteristic amount . However, (Equation 4 ) is [Equation 4] Where φ (i, j) is an autocorrelation function and x [n]
Is an input signal, and y _(ij) [n] is a filter output signal for each stage.

17. The audio signal compression device according to claim 16, wherein the all-pass filter is a first-order all-pass filter.
In it, the speech signal compression apparatus characterized by.

(18) The method according to (16) or (17),
Oite to the audio signal compression apparatus, the filter coefficients of the all-pass filter, bark scale
Corresponds to human hearing sensitivity characteristics using degree or mel scale
An audio signal compression apparatus for performing weighting on a frequency to be performed .

19. Recognizing a voice from an input voice signal.
In speech recognition method for the speech signal the input pair to the input audio signal
To expand and contract the frequency axis corresponding to the human auditory sensitivity characteristics.
Auto-correlation on the mel frequency axis using the
The mel linear prediction coefficient is calculated from the autocorrelation function on the mel frequency axis.
Number, and express the spectral envelope from the mel linear prediction coefficient.
A voice recognition method for determining a collection amount .

20. Recognizing a voice from an input voice signal
In the voice recognition method, a voice signal having a predetermined time length is converted from the input voice signal.
Cut out the audio signal of a certain length of time , and
To obtain a filter output signal for each stage, and the input audio signal and the filter output signal for each stage are obtained.
Hearing by humans by multiply-accumulate (Equation 5) with signals
Mel frequency with expanded and contracted frequency axis corresponding to sensitivity characteristics
The autocorrelation function on the axis is obtained, and the mel linear prediction coefficient is obtained from the autocorrelation function on the mel frequency axis.
The calculated feature representing the spectral envelope from the mel linear predictive coefficients
A speech recognition method characterized by determining a quantity . However, (Equation 5) is [Equation 5] Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [ n] is the filter output signal for each stage.
No.

21. The speech recognition method according to claim 20 , wherein said all-pass filter is a first-order all-pass filter.
In a speech recognition method characterized by.

22. The speech recognition method according to claim 20 , wherein a bark scale is added to a filter coefficient of the all-pass filter.
Corresponds to human hearing sensitivity characteristics using degree or mel scale
A voice recognition method for performing weighting on a frequency to be performed .

23. Recognizing a voice from an input voice signal
A voice recognition device that performs the above-described processing on the input voice signal and the input voice signal.
To expand and contract the frequency axis corresponding to the human auditory sensitivity characteristics.
Auto-correlation on the mel frequency axis using the speech signal
A function is calculated, and the mel
A mel linear prediction analysis means for determining a formation prediction coefficient, and calculating a cepstrum coefficient from the mel linear prediction coefficient
Cepstrum coefficient calculation means, a plurality of frames of the cepstrum coefficient,
Calculate the distance between the model and the one with the shortest distance
Is the highest similarity among the above standard models.
And a speech recognition means for recognizing as speech recognition apparatus characterized by.

24. The speech recognition apparatus according to claim 23 , wherein the mel linear prediction analysis means converts a speech signal having a predetermined time length from the input speech signal.
Cut out the audio signal of a certain length of time , and
To obtain a filter output signal for each stage, and the input audio signal and the filter output signal for each stage are obtained.
By the sum of products (Equation 6) performed finite times with the signal, human hearing
Mel frequency with expanded and contracted frequency axis corresponding to sensitivity characteristics
The autocorrelation function on the axis is calculated, and the mel linear prediction coefficient is calculated from the autocorrelation function on the mel frequency axis.
A voice recognition device for determining a number . However, (Equation 6) is [Equation 6] Where φ (i, j) is an autocorrelation function and x [n]
Is the input signal, and y _(ij) [n] is the filter output signal for each stage.
No.

25. The audio signal compression method according to claim 24, wherein said all-pass filter is a first-order all-pass filter.
In it, the speech recognition apparatus characterized by.

(26) The method according to (24) or (25),
In the audio compression method, the filter coefficient of the all-pass filter includes a bark scale.
Corresponds to human hearing sensitivity characteristics using degree or mel scale
A voice recognition device for performing weighting on a frequency to be performed .