JP4578145B2

JP4578145B2 - Speech coding apparatus, speech decoding apparatus, and methods thereof

Info

Publication number: JP4578145B2
Application number: JP2004131945A
Authority: JP
Inventors: 薫佐藤; 利幸森井
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2003-04-30
Filing date: 2004-04-27
Publication date: 2010-11-10
Anticipated expiration: 2024-04-27
Also published as: JP2004348120A

Abstract

<P>PROBLEM TO BE SOLVED: To perform scalable encoding with a small calculation quantity and a small encoded information amount. <P>SOLUTION: This voice encoding device is configured to obtain basic layer encoded information by encoding an input signal by a basic layer encoding part 101. A basic layer decoding part 102 decodes the basic layer encoded information to obtain a basic layer decoded signal and long-period prediction information (pitch lag). An addition part 103 inverts the polarity of the basic layer decoded signal and adds it to the input signal to obtain a residue signal. An extension layer encoding part 104 encodes a long-period prediction coefficient calculated by using the long-period prediction information and residue signal to obtain layer extension layer encoded information. A basic layer decoding part 152 decodes the basic layer encoded information to obtain a basic layer decoded signal and long-period prediction information. An extension layer decoding part 153 decodes the extension layer encoded information by using the long-period prediction information to obtain an extension layer decoded signal. An addition part 154 sums the basic layer decoded signal and extension layer decoded signal to obtain a voice/musical sound signal. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声・楽音信号を符号化して伝送する通信システムに使用される音声符号化装置、音声復号化装置及びこれらの方法に関する。 The present invention relates to a speech encoding apparatus, speech decoding apparatus, and methods for use in a communication system that encodes and transmits speech / musical sound signals.

ディジタル無線通信や、インターネット通信に代表されるパケット通信、あるいは音声蓄積などの分野においては、電波などの伝送路容量や記憶媒体の有効利用を図るため、音声信号の符号化／復号化技術が不可欠であり、これまでに多くの音声符号化／復号化方式が開発されてきた。その中で、ＣＥＬＰ方式の音声符号化／復号化方式が主流の方式として実用化されている（例えば、非特許文献１）。 In fields such as digital wireless communications, packet communications represented by Internet communications, and voice storage, voice signal encoding / decoding technology is indispensable for effective use of transmission path capacity such as radio waves and storage media. So far, many speech encoding / decoding schemes have been developed. Among them, the CELP speech encoding / decoding method has been put into practical use as a mainstream method (for example, Non-Patent Document 1).

ＣＥＬＰ方式の音声符号化装置は、予め記憶された音声モデルに基づいて入力音声をコード化する。具体的には、ディジタル化された音声信号を２０ms程度のフレームに区切り、フレーム毎に音声信号の線形予測分析を行い、線形予測係数と線形予測残差ベクトルを求め、線形予測係数と線形予測残差ベクトルをそれぞれ個別に符号化する。 The CELP speech encoding apparatus encodes input speech based on a speech model stored in advance. Specifically, the digitized speech signal is divided into frames of about 20 ms, the speech signal is subjected to linear prediction analysis for each frame, the linear prediction coefficient and the linear prediction residual vector are obtained, and the linear prediction coefficient and the linear prediction residual are obtained. Each difference vector is encoded separately.

低ビットレートの通信を実行するためには、記憶できる音声モデルの量が限られるため、従来のＣＥＬＰタイプの音声符号化／復号化方式では、主に発声音のモデルを記憶している。 In order to execute low bit rate communication, the amount of speech models that can be stored is limited. Therefore, in the conventional CELP type speech encoding / decoding method, the model of the uttered sound is mainly stored.

また、インターネット通信のようなパケットを伝送する通信システムでは、ネットワークの状態によりパケット損失が起こるため、符号化情報の一部が欠損した場合であっても符号化情報の残りの一部から音声、楽音を復号化できることが望ましい。同様に、通信容量に応じてビットレートを変化させる可変レート通信システムにおいては、通信容量が低下した場合、符号化情報の一部のみを伝送することにより通信容量の負担を軽減させることが容易であることが望ましい。このように、符号化情報の全てもしくは符号化情報の一部のみを用いて音声、楽音を復号化できる技術として、最近、スケーラブル符号化技術が注目を浴びている。従来にもいくつかのスケーラブル符号化方式が開示されている（例えば、特許文献１参照）。 Further, in a communication system that transmits packets such as Internet communication, packet loss occurs depending on the state of the network. Therefore, even if a part of the encoded information is lost, voice, It is desirable to be able to decode musical sounds. Similarly, in a variable rate communication system that changes the bit rate in accordance with the communication capacity, when the communication capacity decreases, it is easy to reduce the communication capacity burden by transmitting only a part of the encoded information. It is desirable to be. As described above, the scalable coding technique has recently attracted attention as a technique that can decode speech and musical tone using all of the encoded information or only a part of the encoded information. Conventionally, several scalable coding schemes have been disclosed (see, for example, Patent Document 1).

スケーラブル符号化方式は、一般的に、基本レイヤと拡張レイヤとからなり、各レイヤは、基本レイヤを最も下位のレイヤとし、階層構造を形成している。そして、各レイヤでは、より下位のレイヤの入力信号と出力信号との差である残差信号について符号化が行われる。この構成により、全レイヤの符号化情報もしくは下位レイヤの符号化情報のみを用いて、音声・楽音信号を復号化することができる。
特開平１０−９７２９５号公報 M.R.Schroeder, B.S.Atal, "Code Excited Linear Prediction: High Quality Speech at Low Bit Rate", IEEE proc., ICASSP'85 pp.937-940 A scalable coding method generally includes a base layer and an enhancement layer, and each layer forms a hierarchical structure with the base layer as the lowest layer. In each layer, encoding is performed on a residual signal that is a difference between an input signal and an output signal of a lower layer. With this configuration, it is possible to decode a voice / musical sound signal using only all layer coding information or lower layer coding information.
JP-A-10-97295 MRSchroeder, BSAtal, "Code Excited Linear Prediction: High Quality Speech at Low Bit Rate", IEEE proc., ICASSP'85 pp.937-940

しかしながら、従来のスケーラブル符号化方式では、基本レイヤおよび拡張レイヤの符号化方式としてＣＥＬＰタイプの音声符号化／復号化方式を用いるため、計算量、符号化情報共に相応の量が必要となる。 However, in the conventional scalable coding scheme, since CELP type speech coding / decoding scheme is used as the coding scheme of the base layer and the enhancement layer, both the calculation amount and the coding information need to be appropriate amounts.

本発明はかかる点に鑑みてなされたものであり、少ない計算量及び符号化情報量でスケーラブル符号化を実現することができる音声符号化装置、音声復号化装置及びこれらの方法を提供することを目的とする。 The present invention has been made in view of the above points, and provides a speech encoding apparatus, speech decoding apparatus, and methods that can realize scalable encoding with a small amount of calculation and an amount of encoded information. Objective.

本発明の音声符号化装置は、入力信号を符号化して第１符号化情報を生成する基本レイヤ符号化手段と、前記第１符号化情報を復号化して第１復号化信号を生成するとともに、音声・楽音が有する長期的な相関を表す情報である長期予測情報を生成する基本レイヤ復号化手段と、前記入力信号と前記第１復号化信号との差分である残差信号を求める加算手段と、前記長期予測情報及び前記残差信号を用いて長期予測係数を算出し、前記長期予測係数を符号化して第２符号化情報を生成する拡張レイヤ符号化手段と、を具備する構成を採る。 The speech encoding apparatus according to the present invention includes a base layer encoding unit that encodes an input signal to generate first encoded information, and generates a first decoded signal by decoding the first encoded information, Base layer decoding means for generating long-term prediction information that is information representing long-term correlation of voice / musical sound; and addition means for obtaining a residual signal that is a difference between the input signal and the first decoded signal; And a long-term prediction coefficient is calculated using the long-term prediction information and the residual signal, and an enhancement layer encoding unit that encodes the long-term prediction coefficient to generate second encoded information is employed.

本発明の音声符号化装置における基本レイヤ復号化手段は、駆動音源信号サンプルから切り出された適応音源ベクトルの切り出し位置を示す情報を長期予測情報とする構成を採る。 The base layer decoding means in the speech coding apparatus according to the present invention employs a configuration in which information indicating the cut-out position of the adaptive excitation vector cut out from the driving excitation signal sample is used as the long-term prediction information.

本発明の音声符号化装置における拡張レイヤ符号化手段は、前記長期予測情報に基づいて拡張レイヤの長期予測ラグを求める手段と、バッファに記憶されている過去の長期予測信号系列から前記長期予測ラグだけ遡った長期予測信号を切り出す手段と、前記残差信号及び前記長期予測信号を用いて長期予測係数を算出する手段と、前記長期予測係数を符号化することにより前記拡張レイヤ符号化情報を生成する手段と、前記拡張レイヤ符号化情報を復号化して復号化長期予測係数を生成する手段と、前記復号化長期予測係数及び前記長期予測信号を用いて新たな長期予測信号を算出し、前記新たな長期予測信号を用いて前記バッファを更新する手段と、を有する構成を採る。 The enhancement layer encoding means in the speech coding apparatus of the present invention comprises: means for obtaining an enhancement layer long-term prediction lag based on the long-term prediction information; and the long-term prediction lag from a past long-term prediction signal sequence stored in a buffer. Means for extracting a long-term prediction signal traced back, means for calculating a long-term prediction coefficient using the residual signal and the long-term prediction signal, and generating the enhancement layer coding information by encoding the long-term prediction coefficient Means for decoding the enhancement layer coding information to generate a decoded long-term prediction coefficient, calculating a new long-term prediction signal using the decoded long-term prediction coefficient and the long-term prediction signal, And a means for updating the buffer using a long-term prediction signal.

これらの構成により、音声・楽音の長期的な相関の性質を利用して残差信号を拡張レイヤにおいて長期予測することができるので、少ない符号化情報で周波数帯域の広い音声・楽音信号を効果的に符号化／復号化することができ、また、演算量の削減を図ることができる。また、長期予測ラグを符号化／復号化するのではなく、基本レイヤの長期予測情報を利用して長期予測ラグを求めることにより、符号化情報の削減を図ることができる。 With these configurations, the residual signal can be predicted for a long time in the enhancement layer using the long-term correlation property of the voice / musical sound, so that the voice / musical sound signal with a wide frequency band can be effectively used with less coding information. Can be encoded / decoded, and the amount of calculation can be reduced. Also, encoding information can be reduced by obtaining long-term prediction lag using long-term prediction information of the base layer, instead of encoding / decoding long-term prediction lag.

本発明の音声符号化装置における拡張レイヤ符号化手段は、前記残差信号と前記長期予測信号との差分である長期予測残差信号を求める手段と、前記長期予測残差信号を符号化することにより長期予測残差符号化情報を生成する手段と、前記長期予測残差符号化情報を復号化して復号化長期予測残差信号算出する手段と、前記新たな長期予測信号と前記復号化長期予測残差信号とを加算し、加算結果を用いて前記バッファを更新する手段と、をさらに有する構成を採る。 The enhancement layer coding means in the speech coding apparatus according to the present invention comprises: means for obtaining a long-term prediction residual signal that is a difference between the residual signal and the long-term prediction signal; and encoding the long-term prediction residual signal. Means for generating long-term prediction residual coding information, means for decoding the long-term prediction residual coding information and calculating a decoded long-term prediction residual signal, the new long-term prediction signal and the decoded long-term prediction And a means for adding the residual signal and updating the buffer using the addition result.

この構成により、残差信号と長期予測信号との差（長期予測残差信号）を符号化／復号化することができるので、さらに高品質な復号化信号を得ることができる。 With this configuration, the difference between the residual signal and the long-term prediction signal (long-term prediction residual signal) can be encoded / decoded, so that a higher-quality decoded signal can be obtained.

本発明の音声復号化装置は、上記いずれかの音声符号化装置から第１符号化情報及び第２符号化情報を受信して音声を復号化する音声復号化装置であって、前記第１符号化情報を復号化して第１復号化信号を生成するとともに、音声・楽音が有する長期的な相関を表す情報である長期予測情報を生成する基本レイヤ復号化手段と、前記長期予測情報を用いて前記第２符号化情報を復号化して第２復号化信号を生成する拡張レイヤ復号化手段と、前記第１復号化信号と前記第２復号化信号とを加算し、加算結果である音声・楽音信号を出力する加算手段と、を具備する構成を採る。 The speech decoding apparatus according to the present invention is a speech decoding apparatus that receives first encoded information and second encoded information from any of the above speech encoding apparatuses and decodes speech, wherein the first code A base layer decoding means for generating long-term prediction information that is information representing a long-term correlation of speech and music, and generating the first decoded signal by decoding the encoded information, and using the long-term prediction information Enhancement layer decoding means for decoding the second encoded information to generate a second decoded signal, adding the first decoded signal and the second decoded signal, and adding the voice / musical sound as a result of the addition And an adding means for outputting a signal.

本発明の音声復号化装置における基本レイヤ復号化手段は、駆動音源信号サンプルから切り出された適応音源ベクトルの切り出し位置を示す情報を長期予測情報とする構成を採る。 The base layer decoding means in the speech decoding apparatus according to the present invention employs a configuration in which information indicating the extracted position of the adaptive excitation vector extracted from the driving excitation signal sample is used as the long-term prediction information.

本発明の音声復号化装置における拡張レイヤ復号化手段は、前記長期予測情報に基づいて拡張レイヤの長期予測ラグを求める手段と、バッファに記憶されている過去の長期予測信号系列から長期予測ラグだけ遡った長期予測信号を切り出す手段と、前記拡張レイヤ符号化情報を復号化して復号化長期予測係数を求める手段と、前記復号化長期予測係数及び長期予測信号を用いて長期予測信号を算出し、前記長期予測信号を用いて前記バッファを更新する手段と、を有し、前記長期予測信号を拡張レイヤ復号化信号とする構成を採る。 The enhancement layer decoding means in the speech decoding apparatus of the present invention comprises: means for obtaining an enhancement layer long-term prediction lag based on the long-term prediction information; and only a long-term prediction lag from a past long-term prediction signal sequence stored in the buffer A means for cutting back a long-term prediction signal; a means for decoding the enhancement layer coding information to obtain a decoded long-term prediction coefficient; and calculating a long-term prediction signal using the decoded long-term prediction coefficient and the long-term prediction signal; Means for updating the buffer using the long-term prediction signal, and adopting a configuration in which the long-term prediction signal is an enhancement layer decoded signal.

これらの構成により、音声・楽音の長期的な相関の性質を利用して残差信号を拡張レイヤにおいて長期予測することができるので、少ない符号化情報で周波数帯域の広い音声・楽音信号を効果的に符号化／復号化することができ、また、演算量の削減を図ることができる。また、長期予測ラグを符号化／復号化するのではなく、基本レイヤの長期予測情報を利用して長期予測ラグを求めることにより、符号化情報の削減を図ることができる。また、基本レイヤ符号化情報を復号化することによって、基本レイヤの復号化信号のみを得ることができ、ＣＥＬＰタイプの音声符号化／復号化方法において、符号化情報の一部からでも音声・楽音を復号化できる機能（スケーラブル符号化）を実現することができる。 With these configurations, the residual signal can be predicted for a long time in the enhancement layer using the long-term correlation property of the voice / musical sound, so that the voice / musical sound signal with a wide frequency band can be effectively used with less coding information. Can be encoded / decoded, and the amount of calculation can be reduced. Also, encoding information can be reduced by obtaining long-term prediction lag using long-term prediction information of the base layer, instead of encoding / decoding long-term prediction lag. Also, by decoding the base layer encoded information, only the base layer decoded signal can be obtained. In the CELP type speech encoding / decoding method, speech / musical sound can be obtained even from a part of the encoded information. Can be realized (scalable coding).

本発明の音声復号化装置における拡張レイヤ復号化手段は、前記長期予測残差符号化情報を復号化して復号化長期予測残差信号を求める手段と、前記長期予測信号と前記復号化長期予測残差信号とを加算する手段と、を有し、前記加算結果を拡張レイヤ復号化信号とする構成を採る。 The enhancement layer decoding means in the speech decoding apparatus of the present invention comprises means for decoding the long-term prediction residual coding information to obtain a decoded long-term prediction residual signal, the long-term prediction signal, and the decoded long-term prediction residual. Means for adding the difference signal, and the addition result is an enhancement layer decoded signal.

本発明の音声信号送信装置は、上記いずれかの音声符号化装置を具備する構成を採る。また、本発明の音声信号受信装置は、上記いずれかの音声復号化装置を具備する構成を採る。本発明の基地局装置は、上記音声信号送信装置あるいは音声信号受信装置の少なくとも一方を具備する構成を採る。また、本発明の通信端末装置は、上記音声信号送信装置あるいは音声信号受信装置の少なくとも一方を具備する構成を採る。 The audio signal transmitting apparatus of the present invention employs a configuration including any one of the above audio encoding apparatuses. The audio signal receiving apparatus of the present invention employs a configuration including any of the above audio decoding apparatuses. A base station apparatus according to the present invention employs a configuration including at least one of the above-described audio signal transmitting apparatus and audio signal receiving apparatus. Moreover, the communication terminal apparatus of this invention takes the structure which comprises at least one of the said audio | voice signal transmission apparatus or an audio | voice signal receiving apparatus.

これらの構成により、少ない符号化情報で周波数帯域の広い音声・楽音信号を効果的に符号化／復号化することができ、また、演算量の削減を図ることができる。 With these configurations, it is possible to effectively encode / decode a voice / musical sound signal having a wide frequency band with a small amount of encoded information, and to reduce the amount of calculation.

本発明の音声符号化方法は、入力信号を符号化して第１符号化情報を生成する工程と、前記第１符号化情報を復号化して第１復号化信号を生成するとともに、音声・楽音が有する長期的な相関を表す情報である長期予測情報を生成する工程と、前記入力信号と前記第１復号化信号との差分である残差信号を求める工程と、前記長期予測情報及び前記残差信号を用いて長期予測係数を算出し、前記長期予測係数を符号化して第２符号化情報を生成する工程と、を具備する方法を採る。 The speech encoding method of the present invention includes a step of generating input first encoded information by encoding an input signal, generating a first decoded signal by decoding the first encoded information, and generating voice / musical sound. A step of generating long-term prediction information which is information representing long-term correlation, a step of obtaining a residual signal which is a difference between the input signal and the first decoded signal, the long-term prediction information and the residual And calculating a long-term prediction coefficient using the signal and encoding the long-term prediction coefficient to generate second encoded information.

この方法により、音声・楽音の長期的な相関の性質を利用して残差信号を拡張レイヤにおいて長期予測することができるので、少ない符号化情報で周波数帯域の広い音声・楽音信号を効果的に符号化／復号化することができ、また、演算量の削減を図ることができる。また、長期予測ラグを符号化／復号化するのではなく、基本レイヤの長期予測情報を利用して長期予測ラグを求めることにより、符号化情報の削減を図ることができる。 This method enables long-term prediction of residual signals in the enhancement layer using the long-term correlation nature of speech and music, so it is possible to effectively produce speech and music signals with a wide frequency band with less coding information. Encoding / decoding can be performed, and the amount of calculation can be reduced. Also, encoding information can be reduced by obtaining long-term prediction lag using long-term prediction information of the base layer, instead of encoding / decoding long-term prediction lag.

本発明の音声復号化方法は、上記音声符号化方法で生成された第１符号化情報及び第２符号化情報を用いて音声を復号化する音声復号化方法であって、前記第１符号化情報を復号化して第１復号化信号を生成するとともに、音声・楽音が有する長期的な相関を表す情報である長期予測情報を生成する工程と、前記長期予測情報を用いて前記第２符号化情報を復号化して第２復号化信号を生成する工程と、前記第１復号化信号と前記第２復号化信号とを加算し、加算結果である音声・楽音信号を出力する工程と、を具備する方法を採る。 The speech decoding method of the present invention is a speech decoding method for decoding speech using the first encoded information and the second encoded information generated by the speech encoding method, wherein the first encoding is performed. Decoding the information to generate a first decoded signal, generating long-term prediction information, which is information representing a long-term correlation of speech and music, and the second encoding using the long-term prediction information Decoding the information to generate a second decoded signal, adding the first decoded signal and the second decoded signal, and outputting a speech / musical sound signal as a result of the addition. Take the way.

この方法により、音声・楽音の長期的な相関の性質を利用して残差信号を拡張レイヤにおいて長期予測することができるので、少ない符号化情報で周波数帯域の広い音声・楽音信号を効果的に符号化／復号化することができ、また、演算量の削減を図ることができる。また、長期予測ラグを符号化／復号化するのではなく、基本レイヤの長期予測情報を利用して長期予測ラグを求めることにより、符号化情報の削減を図ることができる。また、基本レイヤ符号化情報を復号化することによって、基本レイヤの復号化信号のみを得ることができ、ＣＥＬＰタイプの音声符号化／復号化方法において、符号化情報の一部からでも音声・楽音を復号化できる機能（スケーラブル符号化）を実現することができる。 This method enables long-term prediction of residual signals in the enhancement layer using the long-term correlation nature of speech and music, so it is possible to effectively produce speech and music signals with a wide frequency band with less coding information. Encoding / decoding can be performed, and the amount of calculation can be reduced. Also, encoding information can be reduced by obtaining long-term prediction lag using long-term prediction information of the base layer, instead of encoding / decoding long-term prediction lag. Also, by decoding the base layer encoded information, only the base layer decoded signal can be obtained. In the CELP type speech encoding / decoding method, speech / musical sound can be obtained even from a part of the encoded information. Can be realized (scalable coding).

以上説明したように、本発明によれば、少ない符号化情報で周波数帯域の広い音声・楽音信号を効果的に符号化／復号化することができ、また、演算量の削減を図ることができる。また、基本レイヤの長期予測情報を利用して長期予測ラグを求めることにより、符号化情報を削減することができる。また、基本レイヤ符号化情報を復号化することによって、基本レイヤの復号化信号のみを得ることができ、ＣＥＬＰタイプの音声符号化／復号化方法において、符号化情報の一部からでも音声・楽音を復号化できる機能（スケーラブル符号化）を実現することができる。 As described above, according to the present invention, it is possible to effectively encode / decode a voice / musical sound signal having a wide frequency band with a small amount of encoded information, and to reduce the amount of calculation. . Moreover, encoding information can be reduced by calculating | requiring long-term prediction lag using the long-term prediction information of a base layer. Also, by decoding the base layer encoded information, only the base layer decoded signal can be obtained. In the CELP type speech encoding / decoding method, speech / musical sound can be obtained even from a part of the encoded information. Can be realized (scalable coding).

本発明の骨子は、長期予測を行う拡張レイヤを設け、音声・楽音の長期的な相関の性質を利用して拡張レイヤにおいて残差信号の長期予測を行うことにより復号化信号の品質の向上を図り、基本レイヤの長期予測情報を利用して長期予測ラグを求めることにより演算量の削減を図ることである。 The gist of the present invention is to provide an enhancement layer that performs long-term prediction, and to improve the quality of the decoded signal by performing long-term prediction of the residual signal in the enhancement layer using the long-term correlation property of speech and music. In other words, the calculation amount is reduced by obtaining the long-term prediction lag using the long-term prediction information of the base layer.

以下、本発明の実施の形態について、添付図面を参照して詳細に説明する。なお、以下の各本実施の形態では、基本レイヤと拡張レイヤとで構成される二階層の音声符号化／復号化方法において拡張レイヤで長期予測を行う場合について説明する。ただし、本発明は階層について制限はなく、三階層以上の階層的な音声符号化／復号化方法において下位レイヤの長期予測情報を利用して上位レイヤで長期予測を行う場合についても適用することができる。階層的な音声符号化方法とは、残差信号（下位レイヤの入力信号と下位レイヤの復号化信号との差）を長期予測によって符号化して符号化情報を出力する音声符号化方法が上位レイヤに複数存在して階層構造を成している方法である。また、階層的な音声復号化方法とは、残差信号を復号化する音声復号化方法が上位レイヤに複数存在して階層構造を成している方法である。ここで、最下のレイヤに存在する音声・楽音符号化／復号化方法を基本レイヤとする。また、基本レイヤより上位レイヤに存在する音声・楽音符号化／復号化方法を拡張レイヤとする。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In each of the following embodiments, a case will be described in which long-term prediction is performed in the enhancement layer in a two-layer speech encoding / decoding method including a base layer and an enhancement layer. However, the present invention is not limited in terms of hierarchies, and can be applied to a case in which long-term prediction is performed in an upper layer using long-term prediction information of a lower layer in a hierarchical speech encoding / decoding method of three or more layers. it can. The hierarchical speech coding method is a speech coding method in which a residual signal (difference between an input signal of a lower layer and a decoded signal of a lower layer) is encoded by long-term prediction and encoded information is output. In this method, there are a plurality of layers in a hierarchical structure. The hierarchical speech decoding method is a method in which a plurality of speech decoding methods for decoding a residual signal exist in a higher layer to form a hierarchical structure. Here, the speech / musical sound encoding / decoding method existing in the lowest layer is assumed to be a basic layer. Also, a speech / musical sound encoding / decoding method existing in a layer higher than the basic layer is defined as an enhancement layer.

また、本発明の各実施の形態では、基本レイヤがＣＥＬＰタイプの音声符号化／復号化を行う場合を例にして説明する。 Also, in each embodiment of the present invention, a case where the base layer performs CELP type speech encoding / decoding will be described as an example.

（実施の形態１）
図１は、本発明の実施の形態１に係る音声符号化装置／音声復号化装置の構成を示すブロック図である。 (Embodiment 1)
FIG. 1 is a block diagram showing a configuration of a speech coding apparatus / speech decoding apparatus according to Embodiment 1 of the present invention.

図１において、音声符号化装置１００は、基本レイヤ符号化部１０１と、基本レイヤ復号化部１０２と、加算部１０３と、拡張レイヤ符号化部１０４と、多重化部１０５とから主に構成される。また、音声復号化装置１５０は、多重化分離部１５１と、基本レイヤ復号化部１５２と、拡張レイヤ復号化部１５３と、加算部１５４とから主に構成される。 In FIG. 1, speech encoding apparatus 100 mainly includes a base layer encoding unit 101, a base layer decoding unit 102, an adding unit 103, an enhancement layer encoding unit 104, and a multiplexing unit 105. The Speech decoding apparatus 150 is mainly configured by a demultiplexing unit 151, a base layer decoding unit 152, an enhancement layer decoding unit 153, and an addition unit 154.

基本レイヤ符号化部１０１は、音声・楽音信号を入力し、ＣＥＬＰタイプの音声符号化方法を用いて入力信号を符号化し、符号化によって求められる基本レイヤ符号化情報を基本レイヤ復号化部１０２に出力するとともに、多重化部１０５に出力する。 The base layer encoding unit 101 receives a speech / musical sound signal, encodes the input signal using a CELP type speech encoding method, and transmits base layer encoding information obtained by encoding to the base layer decoding unit 102. Output to the multiplexing unit 105 as well.

基本レイヤ復号化部１０２は、ＣＥＬＰタイプの音声復号化方法を用いて基本レイヤ符号化情報を復号化し、復号化によって求められる基本レイヤ復号化信号を加算部１０３に出力する。また、基本レイヤ復号化部１０２は、ピッチラグを基本レイヤの長期予測情報として拡張レイヤ符号化部１０４に出力する。 Base layer decoding section 102 decodes base layer encoded information using a CELP type speech decoding method, and outputs a base layer decoded signal obtained by decoding to adding section 103. Also, base layer decoding section 102 outputs pitch lag to enhancement layer encoding section 104 as long-term prediction information of the base layer.

ここで、「長期予測情報」とは、音声・楽音信号が有する長期的な相関を表す情報である。また、「ピッチラグ」とは、基本レイヤで特定される位置情報であり、詳細な説明は後述する。 Here, the “long-term prediction information” is information representing a long-term correlation that the voice / musical sound signal has. The “pitch lag” is position information specified by the base layer, and will be described in detail later.

加算部１０３は、入力信号に、基本レイヤ復号化部１０２から出力された基本レイヤ復号化信号を極性反転してから加算し、加算結果である残差信号を拡張レイヤ符号化部１０４に出力する。 Addition section 103 adds the base layer decoded signal output from base layer decoding section 102 to the input signal after inverting the polarity, and outputs the residual signal as the addition result to enhancement layer encoding section 104. .

拡張レイヤ符号化部１０４は、基本レイヤ復号化部１０２より出力された長期予測情報及び加算部１０３より出力された残差信号を用いて長期予測係数を算出し、長期予測係数を符号化し、符号化によって求められる拡張レイヤ符号化情報を多重化部１０５に出力する。 The enhancement layer encoding unit 104 calculates a long-term prediction coefficient using the long-term prediction information output from the base layer decoding unit 102 and the residual signal output from the addition unit 103, encodes the long-term prediction coefficient, The enhancement layer coding information obtained by the conversion is output to multiplexing section 105.

多重化部１０５は、基本レイヤ符号化部１０１から出力された基本レイヤ符号化情報と、拡張レイヤ符号化部１０４から出力された拡張レイヤ符号化情報と、を多重化して多重化情報として伝送路を介して多重化分離部１５１に出力する。 The multiplexing unit 105 multiplexes the base layer encoded information output from the base layer encoding unit 101 and the enhancement layer encoded information output from the enhancement layer encoding unit 104, and transmits the multiplexed channel as multiplexed information. Is output to the demultiplexing unit 151.

多重化分離部１５１は、音声符号化装置１００から伝送された多重化情報を、基本レイヤ符号化情報と拡張レイヤ符号化情報とに分離し、分離された基本レイヤ符号化情報を基本レイヤ復号化部１５２に出力し、また、分離された拡張レイヤ符号化情報を拡張レイヤ復号化部１５３に出力する。 The demultiplexing unit 151 demultiplexes the multiplexed information transmitted from the speech coding apparatus 100 into base layer encoded information and enhancement layer encoded information, and performs base layer decoding on the separated base layer encoded information And outputs the separated enhancement layer coding information to enhancement layer decoding section 153.

基本レイヤ復号化部１５２は、ＣＥＬＰタイプの音声復号化方法を用いて基本レイヤ符号化情報を復号化し、復号化によって求められる基本レイヤ復号化信号を加算部１５４に出力する。また、基本レイヤ復号化部１５２は、ピッチラグを基本レイヤの長期予測情報として拡張レイヤ復号化部１５３に出力する。 Base layer decoding section 152 decodes base layer encoded information using a CELP type speech decoding method, and outputs a base layer decoded signal obtained by decoding to addition section 154. Also, base layer decoding section 152 outputs pitch lag to enhancement layer decoding section 153 as long-term prediction information of the base layer.

拡張レイヤ復号化部１５３は、長期予測情報を利用して拡張レイヤ符号化情報を復号化し、復号化によって求められる拡張レイヤ復号化信号を加算部１５４に出力する。 Enhancement layer decoding section 153 decodes enhancement layer encoded information using long-term prediction information, and outputs an enhancement layer decoded signal obtained by decoding to addition section 154.

加算部１５４は、基本レイヤ復号化部１５２から出力された基本レイヤ復号化信号と拡張レイヤ復号化部１５３から出力された拡張レイヤ復号化信号とを加算し、加算結果である音声・楽音信号を後工程の装置に出力する。 The adding unit 154 adds the base layer decoded signal output from the base layer decoding unit 152 and the enhancement layer decoded signal output from the enhancement layer decoding unit 153, and adds the voice / musical sound signal as the addition result. Output to the post-process device.

次に、図１の基本レイヤ符号化部１０１の内部構成を図２のブロック図を用いて説明する。 Next, the internal configuration of base layer encoding section 101 in FIG. 1 will be described using the block diagram in FIG.

基本レイヤ符号化部１０１の入力信号は、前処理部２００に入力される。前処理部２００は、ＤＣ成分を取り除くハイパスフィルタ処理や後続する符号化処理の性能改善につながるような波形整形処理やプリエンファシス処理を行い、これらの処理後の信号（Xin）をＬＰＣ分析部２０１および加算器２０４に出力する。 An input signal of the base layer encoding unit 101 is input to the preprocessing unit 200. The pre-processing unit 200 performs waveform shaping processing and pre-emphasis processing that leads to performance improvement of high-pass filter processing for removing DC components and subsequent encoding processing, and outputs the signal (Xin) after these processing to the LPC analysis unit 201. And output to the adder 204.

ＬＰＣ分析部２０１は、Xinを用いて線形予測分析を行い、分析結果（線形予測係数）をＬＰＣ量子化部２０２へ出力する。ＬＰＣ量子化部２０２は、ＬＰＣ分析部２０１から出力された線形予測係数（ＬＰＣ）の量子化処理を行い、量子化ＬＰＣを合成フィルタ２０３へ出力するとともに量子化ＬＰＣを表す符号（Ｌ）を多重化部２１３へ出力する。 The LPC analysis unit 201 performs linear prediction analysis using Xin, and outputs the analysis result (linear prediction coefficient) to the LPC quantization unit 202. The LPC quantization unit 202 quantizes the linear prediction coefficient (LPC) output from the LPC analysis unit 201, outputs the quantized LPC to the synthesis filter 203, and multiplexes a code (L) representing the quantized LPC. To the conversion unit 213.

合成フィルタ２０３は、量子化ＬＰＣに基づくフィルタ係数により、後述する加算器２１０から出力される駆動音源に対してフィルタ合成を行うことにより合成信号を生成し、合成信号を加算器２０４へ出力する。 The synthesis filter 203 generates a synthesized signal by performing filter synthesis on a driving sound source output from the adder 210 described later using a filter coefficient based on the quantized LPC, and outputs the synthesized signal to the adder 204.

加算器２０４は、合成信号の極性を反転させてXinに加算することにより誤差信号を算出し、誤差信号を聴覚重み付け部２１１へ出力する。 The adder 204 calculates the error signal by inverting the polarity of the combined signal and adding it to Xin, and outputs the error signal to the auditory weighting unit 211.

適応音源符号帳２０５は、過去に加算器２１０によって出力された駆動音源信号をバッファに記憶しており、パラメータ決定部２１２から出力された信号によって特定される過去の駆動音源信号サンプルから１フレーム分のサンプルを適応音源ベクトルとして切り出して乗算器２０８へ出力する。 The adaptive excitation codebook 205 stores the drive excitation signal output by the adder 210 in the past in a buffer, and one frame from the past drive excitation signal sample specified by the signal output from the parameter determination unit 212. Are extracted as adaptive excitation vectors and output to the multiplier 208.

量子化利得生成部２０６は、パラメータ決定部２１２から出力された信号によって特定される適応音源利得と固定音源利得とをそれぞれ乗算器２０８と２０９へ出力する。 The quantization gain generation unit 206 outputs the adaptive excitation gain and the fixed excitation gain specified by the signal output from the parameter determination unit 212 to the multipliers 208 and 209, respectively.

固定音源符号帳２０７は、パラメータ決定部２１２から出力された信号によって特定される形状を有するパルス音源ベクトルに拡散ベクトルを乗算して得られた固定音源ベクトルを乗算器２０９へ出力する。 Fixed excitation codebook 207 outputs a fixed excitation vector obtained by multiplying a pulse excitation vector having a shape specified by the signal output from parameter determination section 212 by a diffusion vector to multiplier 209.

乗算器２０８は、量子化利得生成部２０６から出力された量子化適応音源利得を、適応音源符号帳２０５から出力された適応音源ベクトルに乗じて、加算器２１０へ出力する。乗算器２０９は、量子化利得生成部２０６から出力された量子化固定音源利得を、固定音源符号帳２０７から出力された固定音源ベクトルに乗じて、加算器２１０へ出力する。 Multiplier 208 multiplies the adaptive excitation vector output from adaptive excitation codebook 205 by the quantized adaptive excitation gain output from quantization gain generator 206 and outputs the result to adder 210. Multiplier 209 multiplies the fixed excitation vector output from fixed excitation codebook 207 by the quantized fixed excitation gain output from quantization gain generation section 206 and outputs the result to adder 210.

加算器２１０は、利得乗算後の適応音源ベクトルと固定音源ベクトルとをそれぞれ乗算器２０８と乗算器２０９から入力し、これらをベクトル加算し、加算結果である駆動音源を合成フィルタ２０３および適応音源符号帳２０５へ出力する。なお、適応音源符号帳２０５に入力された駆動音源は、バッファに記憶される。 The adder 210 inputs the adaptive excitation vector and fixed excitation vector after gain multiplication from the multiplier 208 and the multiplier 209, respectively, adds them, and adds the drive excitation as the addition result to the synthesis filter 203 and the adaptive excitation code. Output to the book 205. The drive excitation input to adaptive excitation codebook 205 is stored in the buffer.

聴覚重み付け部２１１は、加算器２０４から出力された誤差信号に対して聴覚的な重み付けをおこない、聴覚重み付け領域でのXinと合成信号との歪みを算出し、パラメータ決定部２１２へ出力する。 The auditory weighting unit 211 performs auditory weighting on the error signal output from the adder 204, calculates the distortion between Xin and the synthesized signal in the auditory weighting region, and outputs the distortion to the parameter determination unit 212.

パラメータ決定部２１２は、聴覚重み付け部２１１から出力された符号化歪みを最小とする適応音源ベクトル、固定音源ベクトル及び量子化利得を、各々適応音源符号帳２０５、固定音源符号帳２０７及び量子化利得生成部２０６から選択し、選択結果を示す適応音源ベクトル符号（Ａ）、音源利得符号（Ｇ）及び固定音源ベクトル符号（Ｆ）を多重化部２１３に出力する。なお、適応音源ベクトル符号（Ａ）は、ピッチラグに対応する符号である。 The parameter determination unit 212 outputs the adaptive excitation vector, fixed excitation vector, and quantization gain that minimize the coding distortion output from the perceptual weighting unit 211 to the adaptive excitation codebook 205, fixed excitation codebook 207, and quantization gain, respectively. The adaptive excitation vector code (A), excitation gain code (G), and fixed excitation vector code (F) indicating the selection result are output from the generation unit 206 to the multiplexing unit 213. The adaptive excitation vector code (A) is a code corresponding to the pitch lag.

多重化部２１３は、ＬＰＣ量子化部２０２から量子化ＬＰＣを表す符号（Ｌ）を入力し、パラメータ決定部２１２から適応音源ベクトルを表す符号（Ａ）、固定音源ベクトルを表す符号（Ｆ）および量子化利得を表す符号（Ｇ）を入力し、これらの情報を多重化して基本レイヤ符号化情報として出力する。 The multiplexing unit 213 receives the code (L) representing the quantized LPC from the LPC quantization unit 202, and the code (A) representing the adaptive excitation vector, the code (F) representing the fixed excitation vector, and the parameter determination unit 212. A code (G) representing a quantization gain is input, and the information is multiplexed and output as base layer encoded information.

以上が、図１の基本レイヤ符号化部１０１の内部構成の説明である。 The above is the description of the internal configuration of the base layer encoding unit 101 in FIG.

次に、図３を用いて、パラメータ決定部２１２が、適応音源符号帳２０５から生成される信号を決定する処理を簡単に説明する。図３において、バッファ３０１は適応音源符号帳２０５が備えるバッファであり、位置３０２は適応音源ベクトルの切り出し位置であり、ベクトル３０３は、切り出された適応音源ベクトルである。また、数値「４１」、「２９６」は、切り出し位置３０２を動かす範囲の下限と上限とに対応している。 Next, the process in which the parameter determination unit 212 determines a signal generated from the adaptive excitation codebook 205 will be briefly described with reference to FIG. In FIG. 3, a buffer 301 is a buffer included in the adaptive excitation codebook 205, a position 302 is a cut-out position of the adaptive excitation vector, and a vector 303 is a cut-out adaptive excitation vector. Numerical values “41” and “296” correspond to the lower limit and the upper limit of the range in which the cutout position 302 is moved.

切り出し位置３０２を動かす範囲は、適応音源ベクトルを表す符号（Ａ）に割り当てるビット数を「８」とする場合、「２５６」の長さの範囲（例えば、４１〜２９６）に設定することができる。また、切り出し位置３０２を動かす範囲は、任意に設定することができる。 The range in which the cutout position 302 is moved can be set to a range of a length of “256” (for example, 41 to 296) when the number of bits assigned to the code (A) representing the adaptive excitation vector is “8”. . Further, the range in which the cutout position 302 is moved can be arbitrarily set.

パラメータ決定部２１２は、切り出し位置３０２を設定された範囲内で動かし、適応音源ベクトル３０３をそれぞれフレームの長さだけ切り出す。そして、パラメータ決定部２１２は、聴覚重み付け部２１１から出力される符号化歪みが最小となる切り出し位置３０２を求める。 The parameter determination unit 212 moves the cutout position 302 within the set range, and cuts out the adaptive excitation vector 303 by the length of each frame. Then, the parameter determination unit 212 obtains a cutout position 302 at which the coding distortion output from the auditory weighting unit 211 is minimized.

このように、パラメータ決定部２１２によって求められるバッファの切り出し位置３０２が「ピッチラグ」である。 Thus, the buffer cut-out position 302 obtained by the parameter determination unit 212 is “pitch lag”.

次に、図１の基本レイヤ復号化部１０２（１５２）の内部構成について図４を用いて説明する。 Next, the internal configuration of base layer decoding section 102 (152) in FIG. 1 will be described using FIG.

図４において、基本レイヤ復号化部１０２（１５２）に入力された基本レイヤ符号化情報は、多重化分離部４０１によって個々の符号（Ｌ、Ａ、Ｇ、Ｆ）に分離される。分離されたＬＰＣ符号（Ｌ）はＬＰＣ復号化部４０２に出力され、分離された適応音源ベクトル符号（Ａ）は適応音源符号帳４０５に出力され、分離された音源利得符号（Ｇ）は量子化利得生成部４０６に出力され、分離された固定音源ベクトル符号（Ｆ）は固定音源符号帳４０７へ出力される。 In FIG. 4, the base layer coding information input to the base layer decoding unit 102 (152) is separated into individual codes (L, A, G, F) by the multiplexing / separating unit 401. The separated LPC code (L) is output to the LPC decoding unit 402, the separated adaptive excitation vector code (A) is output to the adaptive excitation codebook 405, and the separated excitation gain code (G) is quantized. The fixed excitation vector code (F) output to the gain generation unit 406 and separated is output to the fixed excitation codebook 407.

ＬＰＣ復号化部４０２は、多重化分離部４０１から出力された符号（Ｌ）からＬＰＣを復号し、合成フィルタ４０３に出力する。 The LPC decoding unit 402 decodes the LPC from the code (L) output from the demultiplexing unit 401 and outputs the LPC to the synthesis filter 403.

適応音源符号帳４０５は、多重化分離部４０１から出力された符号（Ａ）で指定される過去の駆動音源信号サンプルから１フレーム分のサンプルを適応音源ベクトルとして取り出して乗算器４０８へ出力する。また、適応音源符号帳４０５は、ピッチラグを長期予測情報として拡張レイヤ符号化部１０４（拡張レイヤ復号化部１５３）に出力する。 The adaptive excitation codebook 405 extracts a sample for one frame from the past driving excitation signal samples specified by the code (A) output from the demultiplexing unit 401 as an adaptive excitation vector, and outputs it to the multiplier 408. Also, adaptive excitation codebook 405 outputs pitch lag as long-term prediction information to enhancement layer encoding section 104 (enhancement layer decoding section 153).

量子化利得生成部４０６は、多重化分離部４０１から出力された音源利得符号（Ｇ）で指定される適応音源ベクトル利得と固定音源ベクトル利得を復号し乗算器４０８及び乗算器４０９へ出力する。 The quantization gain generation unit 406 decodes the adaptive excitation vector gain and the fixed excitation vector gain specified by the excitation gain code (G) output from the multiplexing / separation unit 401 and outputs them to the multiplier 408 and the multiplier 409.

固定音源符号帳４０７は、多重化分離部４０１から出力された符号（Ｆ）で指定される固定音源ベクトルを生成し、乗算器４０９へ出力する。 The fixed excitation codebook 407 generates a fixed excitation vector specified by the code (F) output from the demultiplexing unit 401 and outputs the fixed excitation vector to the multiplier 409.

乗算器４０８は、適応音源ベクトルに適応音源ベクトル利得を乗算して、加算器４１０へ出力する。乗算器４０９は、固定音源ベクトルに固定音源ベクトル利得を乗算して、加算器４１０へ出力する。 Multiplier 408 multiplies the adaptive excitation vector by the adaptive excitation vector gain and outputs the result to adder 410. Multiplier 409 multiplies the fixed excitation vector by the fixed excitation vector gain and outputs the result to adder 410.

加算器４１０は、乗算器４０８、４０９から出力された利得乗算後の適応音源ベクトルと固定音源ベクトルの加算を行って駆動音源ベクトルを生成し、これを合成フィルタ４０３及び適応音源符号帳４０５に出力する。 The adder 410 adds the adaptive excitation vector after gain multiplication output from the multipliers 408 and 409 and the fixed excitation vector to generate a driving excitation vector, and outputs this to the synthesis filter 403 and the adaptive excitation codebook 405. To do.

合成フィルタ４０３は、加算器４１０から出力された駆動音源ベクトルを駆動信号として、ＬＰＣ復号化部４０２によって復号されたフィルタ係数を用いて、フィルタ合成を行い、合成した信号を後処理部４０４へ出力する。 The synthesis filter 403 performs filter synthesis using the filter coefficient decoded by the LPC decoding unit 402 using the drive excitation vector output from the adder 410 as a drive signal, and outputs the synthesized signal to the post-processing unit 404. To do.

後処理部４０４は、合成フィルタ４０３から出力された信号に対して、ホルマント強調やピッチ強調といったような音声の主観的な品質を改善する処理や、定常雑音の主観的品質を改善する処理などを施し、基本レイヤ復号化信号として出力する。 The post-processing unit 404 performs, for the signal output from the synthesis filter 403, processing for improving the subjective quality of speech such as formant enhancement and pitch enhancement, processing for improving the subjective quality of stationary noise, and the like. And output as a base layer decoded signal.

以上が、図１の基本レイヤ復号化部１０２（１５２）の内部構成の説明である。 The above is the description of the internal configuration of base layer decoding section 102 (152) in FIG.

次に、図１の拡張レイヤ符号化部１０４の内部構成について図５のブロック図を用いて説明する。 Next, the internal configuration of enhancement layer encoding section 104 in FIG. 1 will be described using the block diagram in FIG.

拡張レイヤ符号化部１０４では、残差信号をＮサンプルずつ区切り（Ｎは自然数）、Ｎサンプルを1フレームとしてフレーム毎に符号化を行う。以下、残差信号をｅ（０）〜ｅ（Ｘ−１）と表し、符号化の対象となるフレームをｅ（ｎ）〜ｅ（ｎ＋Ｎ−１）と表すこととする。ここで、Ｘは残差信号の長さであり、Ｎはフレームの長さに相当する。また、ｎは各フレームの先頭に位置するサンプルであり、ｎはＮの整数倍に相当する。なお、あるフレームの信号を過去に生成された信号から予測して生成する方法は長期予測と呼ばれる。また、長期予測を行うフィルタはピッチフィルタ、コムフィルタ等と呼ばれる。 The enhancement layer encoding unit 104 divides the residual signal into N samples (N is a natural number), and encodes each frame with N samples as one frame. Hereinafter, the residual signals are represented as e (0) to e (X-1), and the frames to be encoded are represented as e (n) to e (n + N-1). Here, X is the length of the residual signal, and N corresponds to the length of the frame. N is a sample located at the head of each frame, and n corresponds to an integer multiple of N. A method of predicting and generating a signal of a certain frame from a signal generated in the past is called long-term prediction. A filter that performs long-term prediction is called a pitch filter, a comb filter, or the like.

図５において、長期予測ラグ指示部５０１は、基本レイヤ復号化部１０２で求められる長期予測情報ｔを入力し、これに基づいて拡張レイヤの長期予測ラグＴを求め、これを長期予測信号記憶部５０２に出力する。なお、基本レイヤと拡張レイヤとの間でサンプリング周波数の違いが生じる場合、長期予測ラグＴは、以下の式（１）により求めることができる。なお、式（１）において、Ｄは拡張レイヤのサンプリング周波数、ｄは基本レイヤのサンプリング周波数である。 In FIG. 5, the long-term prediction lag instruction unit 501 receives the long-term prediction information t obtained by the base layer decoding unit 102, obtains the long-term prediction lag T of the enhancement layer based on the long-term prediction information t, and obtains the long-term prediction signal storage unit The data is output to 502. When a difference in sampling frequency occurs between the base layer and the enhancement layer, the long-term prediction lag T can be obtained by the following equation (1). In Equation (1), D is the sampling frequency of the enhancement layer, and d is the sampling frequency of the base layer.

Ｔ＝Ｄ×ｔ／ｄ・・・（１）
長期予測信号記憶部５０２は、過去に生成された長期予測信号を記憶するバッファを備える。バッファの長さをＭとした場合、バッファは過去に生成された長期予測信号の系列ｓ（ｎ−Ｍ−１）〜ｓ（ｎ−１）で構成される。長期予測信号記憶部５０２は、長期予測ラグ指示部５０１より長期予測ラグＴを入力すると、バッファに記憶されている過去の長期予測信号の系列から長期予測ラグＴだけ遡った長期予測信号ｓ（ｎ−Ｔ）〜ｓ（ｎ−Ｔ＋Ｎ−１）を切り出し、これを長期予測係数計算部５０３及び長期予測信号生成部５０６に出力する。また、長期予測信号記憶部５０２は、長期予測信号生成部５０６から長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）を入力し、以下の式（２）によりバッファの更新を行う。 T = D × t / d (1)
The long-term prediction signal storage unit 502 includes a buffer that stores long-term prediction signals generated in the past. When the length of the buffer is M, the buffer is composed of a series of long-term prediction signals s (n−M−1) to s (n−1) generated in the past. When the long-term prediction signal storage unit 502 receives the long-term prediction lag T from the long-term prediction lag instruction unit 501, the long-term prediction signal s (n) traced back by the long-term prediction lag T from the series of past long-term prediction signals stored in the buffer. −T) to s (n−T + N−1) are cut out and output to the long-term prediction coefficient calculation unit 503 and the long-term prediction signal generation unit 506. The long-term prediction signal storage unit 502 receives the long-term prediction signals s (n) to s (n + N−1) from the long-term prediction signal generation unit 506, and updates the buffer using the following equation (2).

なお、長期予測ラグＴがフレーム長Ｎより短く、長期予測信号記憶部５０２が長期予測信号を切り出すことができない場合、長期予測ラグＴをフレーム長Ｎより長くなるまで整数倍することにより長期予測信号を切り出すことができる。あるいは、長期予測ラグＴだけ遡った長期予測信号ｓ（ｎ−Ｔ）〜ｓ（ｎ−Ｔ＋Ｎ−１）を繰り返して、フレーム長Ｎの長さまで充当させることにより切り出すことができる。

When the long-term prediction lag T is shorter than the frame length N and the long-term prediction signal storage unit 502 cannot extract the long-term prediction signal, the long-term prediction signal is obtained by multiplying the long-term prediction lag T by an integer until it becomes longer than the frame length N. Can be cut out. Alternatively, long-term prediction signals s (n−T) to s (n−T + N−1) that are traced back by the long-term prediction lag T can be repeated and applied to the length of the frame length N.

長期予測係数計算部５０３は、残差信号ｅ（ｎ）〜ｅ（ｎ＋Ｎ−１）及び長期予測信号ｓ（ｎ−Ｔ）〜ｓ（ｎ−Ｔ＋Ｎ−１）を入力し、これらを用いて以下の式（３）により、長期予測係数βを算出し、これを長期予測係数符号化部５０４に出力する。 The long-term prediction coefficient calculation unit 503 inputs residual signals e (n) to e (n + N−1) and long-term prediction signals s (n−T) to s (n−T + N−1), and uses them to The long-term prediction coefficient β is calculated by the following equation (3), and this is output to the long-term prediction coefficient encoding unit 504.

長期予測係数符号化部５０４は、長期予測係数βを符号化し、符号化によって求められる拡張レイヤ符号化情報を長期予測係数復号化部５０５に出力し、伝送路を介して拡張レイヤ復号化部１５３に出力する。なお、長期予測係数βの符号化方法として、スカラ量子化により行う方法等が知られている。

The long-term prediction coefficient encoding unit 504 encodes the long-term prediction coefficient β, outputs the enhancement layer coding information obtained by the encoding to the long-term prediction coefficient decoding unit 505, and extends the enhancement layer decoding unit 153 via the transmission path. Output to. As a method for encoding the long-term prediction coefficient β, a method using scalar quantization or the like is known.

長期予測係数復号化部５０５は、拡張レイヤ符号化情報を復号化し、これによって求められる復号化長期予測係数β_ｑを長期予測信号生成部５０６に出力する。 The long-term prediction coefficient decoding unit 505 decodes the enhancement layer coding information and outputs the decoded long-term prediction coefficient β _{q obtained} thereby to the long-term prediction signal generation unit 506.

長期予測信号生成部５０６は、復号化長期予測係数β_ｑ及び長期予測信号ｓ（ｎ−Ｔ）〜ｓ（ｎ−Ｔ＋Ｎ−１）を入力し、これらを用いて以下の式（４）により、長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）を算出し、これを長期予測信号記憶部５０２に出力する。 The long-term prediction signal generation unit 506 receives the decoded long-term prediction coefficient β _q and the long-term prediction signals s (n−T) to s (n−T + N−1), and uses them according to the following equation (4): Long-term prediction signals s (n) to s (n + N−1) are calculated and output to the long-term prediction signal storage unit 502.

以上が、図１の拡張レイヤ符号化部１０４の内部構成の説明である。

The above is the description of the internal configuration of the enhancement layer encoding unit 104 in FIG.

次に、図１の拡張レイヤ復号化部１５３の内部構成について図６のブロック図を用いて説明する。 Next, the internal configuration of enhancement layer decoding section 153 in FIG. 1 will be described using the block diagram in FIG.

図６において、長期予測ラグ指示部６０１は、基本レイヤ復号化部１５２から出力された長期予測情報を用いて拡張レイヤの長期予測ラグＴを求め、これを長期予測信号記憶部６０２に出力する。 In FIG. 6, the long-term prediction lag instruction unit 601 obtains the enhancement layer long-term prediction lag T using the long-term prediction information output from the base layer decoding unit 152, and outputs this to the long-term prediction signal storage unit 602.

長期予測信号記憶部６０２は、過去に生成された長期予測信号を記憶するバッファを備える。バッファの長さをＭとした場合、バッファは過去に生成された長期予測信号の系列ｓ（ｎ−Ｍ−１）〜ｓ（ｎ−１）で構成される。長期予測信号記憶部６０２は、長期予測ラグ指示部６０１より長期予測ラグＴを入力すると、バッファに記憶されている過去の長期予測信号の系列から長期予測ラグＴだけ遡った長期予測信号ｓ（ｎ−Ｔ）〜ｓ（ｎ−Ｔ＋Ｎ−１）を切り出し、これを長期予測信号生成部６０４に出力する。また、長期予測信号記憶部６０２は、長期予測信号生成部６０４から長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）を入力し、上記式（２）によりバッファの更新を行う。 The long-term prediction signal storage unit 602 includes a buffer that stores long-term prediction signals generated in the past. When the length of the buffer is M, the buffer is composed of a series of long-term prediction signals s (n−M−1) to s (n−1) generated in the past. When the long-term prediction signal storage unit 602 receives the long-term prediction lag T from the long-term prediction lag instruction unit 601, the long-term prediction signal s (n) that is traced back from the series of past long-term prediction signals stored in the buffer by the long-term prediction lag T. −T) to s (n−T + N−1) are cut out and output to the long-term prediction signal generation unit 604. The long-term prediction signal storage unit 602 receives the long-term prediction signals s (n) to s (n + N−1) from the long-term prediction signal generation unit 604, and updates the buffer according to the above equation (2).

長期予測係数復号化部６０３は、拡張レイヤ符号化情報を復号化し、復号化によって求められる復号化長期予測係数β_ｑを長期予測信号生成部６０４に出力する。 The long-term prediction coefficient decoding unit 603 decodes the enhancement layer coding information and outputs a decoded long-term prediction coefficient β _q obtained by decoding to the long-term prediction signal generation unit 604.

長期予測信号生成部６０４は、復号化長期予測係数β_ｑ及び長期予測信号ｓ（ｎ−Ｔ）〜ｓ（ｎ−Ｔ＋Ｎ−１）を入力し、これらを用いて上記式（４）により、長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）を算出し、これを長期予測信号記憶部６０２及び加算部１５３に拡張レイヤ復号化信号として出力する。 The long-term prediction signal generation unit 604 inputs the decoded long-term prediction coefficient β _q and the long-term prediction signals s (n−T) to s (n−T + N−1), and uses these to calculate the long-term prediction signal generation unit 604 according to the above equation (4). Prediction signals s (n) to s (n + N−1) are calculated and output to the long-term prediction signal storage unit 602 and the addition unit 153 as enhancement layer decoded signals.

以上が、図１の拡張レイヤ復号化部１５３の内部構成の説明である。 The above is the description of the internal configuration of the enhancement layer decoding unit 153 in FIG.

このように、長期予測を行う拡張レイヤを設け、音声・楽音の長期的な相関の性質を利用して残差信号を拡張レイヤにおいて長期予測することにより、少ない符号化情報で周波数帯域の広い音声・楽音信号を効果的に符号化／復号化することができ、また、演算量の削減を図ることができる。 In this way, an extended layer that performs long-term prediction is provided, and long-term prediction of residual signals in the extended layer using the long-term correlation properties of speech and music makes it possible to generate speech with a wide frequency band with less coding information. The musical sound signal can be effectively encoded / decoded and the amount of calculation can be reduced.

このとき、長期予測ラグを符号化／復号化するのではなく、基本レイヤの長期予測情報を利用して長期予測ラグを求めることにより、符号化情報の削減を図ることができる。 At this time, the encoding information can be reduced by obtaining the long-term prediction lag using the long-term prediction information of the base layer instead of encoding / decoding the long-term prediction lag.

また、基本レイヤ符号化情報を復号化することによって、基本レイヤの復号化信号のみを得ることができ、ＣＥＬＰタイプの音声符号化／復号化方法において、符号化情報の一部からでも音声・楽音を復号化できる機能（スケーラブル符号化）を実現することができる。 Also, by decoding the base layer encoded information, only the base layer decoded signal can be obtained. In the CELP type speech encoding / decoding method, speech / musical sound can be obtained even from a part of the encoded information. Can be realized (scalable coding).

また、長期予測においては、音声・楽音が有する長期的な相関を利用し、現フレームとの相関が最も高いフレームをバッファから切り出し、切り出したフレームの信号を用いて現フレームの信号を表現する。しかしながら、現フレームとの相関が最も高いフレームをバッファから切り出す手段において、ピッチラグなどの音声・楽音が有する長期的な相関を表わす情報が無い場合には、バッファからフレームを切り出す際の切り出し位置を変化させながら、切り出したフレームと現フレームとの自己相関関数を計算し、最も相関が高くなるフレームを探索する必要があり、探索に掛かる計算量は非常に大きくなってしまう。 In the long-term prediction, a long-term correlation of voice / musical sound is used, a frame having the highest correlation with the current frame is cut out from the buffer, and a signal of the current frame is expressed using the signal of the cut-out frame. However, if there is no information indicating the long-term correlation of voice / musical sounds such as pitch lag in the means for cutting out the frame having the highest correlation with the current frame from the buffer, the cutout position when cutting out the frame from the buffer is changed. Therefore, it is necessary to calculate the autocorrelation function between the clipped frame and the current frame and search for a frame with the highest correlation, and the amount of calculation required for the search becomes very large.

ところが、基本レイヤ符号化部１０１で求めたピッチラグを用いて切り出し位置を一意に定めることにより、通常の長期予測を行う際に掛かる計算量を大幅に削減することができる。 However, by uniquely determining the cut-out position using the pitch lag obtained by the base layer encoding unit 101, the amount of calculation required for normal long-term prediction can be significantly reduced.

なお、本実施の形態で説明した拡張レイヤ長期予測方法では、基本レイヤ復号化部より出力される長期予測情報がピッチラグである場合について説明したが、本発明はこれに限られず、音声・楽音が有する長期的な相関を表す情報であれば長期予測情報として用いることができる。 In the enhancement layer long-term prediction method described in the present embodiment, the case where the long-term prediction information output from the base layer decoding unit is a pitch lag has been described. However, the present invention is not limited to this, and voice / musical sound is generated. Any information representing a long-term correlation can be used as long-term prediction information.

また、本実施の形態では、長期予測信号記憶部５０２がバッファから長期予測信号を切り出す位置を長期予測ラグＴとする場合について説明したが、これを長期予測ラグＴ付近の位置Ｔ＋α（αは微小な数であり、任意に設定可能）とする場合についても本発明は適用することができ、長期予測ラグＴに微小な誤差が生じる場合でも本実施の形態と同様の作用・効果を得ることができる。 Further, in the present embodiment, the case where the long-term prediction signal storage unit 502 extracts the long-term prediction signal from the buffer as the long-term prediction lag T has been described, but this is a position near the long-term prediction lag T, T + α (α is very small). The present invention can also be applied to the case where the number is arbitrarily settable and can be set arbitrarily, and even when a small error occurs in the long-term prediction lag T, the same operation and effect as in the present embodiment can be obtained. it can.

例えば、長期予測信号記憶部５０２は、長期予測ラグ指示部５０１より長期予測ラグＴを入力し、バッファに記憶されている過去の長期予測信号の系列からＴ＋αだけ遡った長期予測信号ｓ（ｎ−Ｔ−α）〜ｓ（ｎ−Ｔ−α＋Ｎ−１）を切り出し、以下の式（５）を用いて判定値Ｃを算出し、判定値Ｃが最大となるαを求め、これを符号化する。復号化を行う場合、長期予測信号記憶部６０２は、αの符号化情報を復号化してαを求め、また、長期予測ラグＴを用いて長期予測信号ｓ（ｎ−Ｔ−α）〜ｓ（ｎ−Ｔ−α＋Ｎ−１）を切り出す。 For example, the long-term prediction signal storage unit 502 receives the long-term prediction lag T from the long-term prediction lag instruction unit 501, and the long-term prediction signal s (n−) that is traced back by T + α from the series of past long-term prediction signals stored in the buffer. T−α) to s (n−T−α + N−1) are cut out, a determination value C is calculated using the following equation (5), α that maximizes the determination value C is obtained, and this is encoded. . When decoding, the long-term prediction signal storage unit 602 decodes the encoding information of α to obtain α, and uses the long-term prediction lag T to determine long-term prediction signals s (n−T−α) to s ( n-T-α + N-1) is cut out.

また、本実施の形態では、音声・楽音信号を用いて長期予測を行う場合について説明したが、ＭＤＣＴ、ＱＭＦ等の直交変換を用いて音声・楽音信号を時間領域から周波数領域へ変換し、変換後の信号（周波数パラメータ）を用いて長期予測を行う場合についても本発明は適用することができ、本実施の形態と同様の作用・効果を得ることができる。例えば、音声・楽音信号の周波数パラメータで拡張レイヤ長期予測を行う場合には、図５において、長期予測係数計算部５０３に、長期予測信号ｓ（ｎ−Ｔ）〜ｓ（ｎ−Ｔ＋Ｎ−１）を時間領域から周波数領域へ変換する機能及び残差信号を周波数パラメータへ変換する機能を新たに設け、長期予測信号生成部５０６に、長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）を周波数領域から時間領域へ逆変換する機能を新たに設ける。また、図６において、長期予測信号生成部６０４に、長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）を周波数領域から時間領域へ逆変換する機能を新たに設ける。

In the present embodiment, the case where long-term prediction is performed using a voice / musical sound signal has been described. However, the voice / musical sound signal is converted from the time domain to the frequency domain using orthogonal transform such as MDCT, QMF, etc. The present invention can also be applied to the case where long-term prediction is performed using a later signal (frequency parameter), and the same operations and effects as in the present embodiment can be obtained. For example, when performing extended layer long-term prediction using the frequency parameter of the voice / musical sound signal, the long-term prediction coefficient calculator 503 in FIG. 5 sends the long-term prediction signals s (n−T) to s (n−T + N−1). Is newly provided in the long-term prediction signal generation unit 506 with the function of converting the long-term prediction signals s (n) to s (n + N−1) to the frequency parameter. A new function for reverse conversion from the domain to the time domain is provided. In FIG. 6, the long-term prediction signal generation unit 604 is newly provided with a function of inversely transforming long-term prediction signals s (n) to s (n + N−1) from the frequency domain to the time domain.

また、通常の音声・楽音符号化／復号化方法では、伝送路において誤り検出もしくは誤り訂正に用いる冗長ビットを符号化情報に付加させて、冗長ビットを含む符号化情報を伝送することが一般的であるが、本発明では、基本レイヤ符号化部１０１より出力される符号化情報（Ａ）と拡張レイヤ符号化部１０４より出力される符号化情報（Ｂ）とに割り当てる冗長ビットのビット配分を符号化情報（Ａ）に重みを付けて振り分けることができる。 Further, in a normal voice / musical sound encoding / decoding method, it is common to add redundant bits used for error detection or error correction in the transmission path to the encoded information and transmit the encoded information including the redundant bits. However, in the present invention, the bit allocation of redundant bits to be allocated to the encoding information (A) output from the base layer encoding unit 101 and the encoding information (B) output from the enhancement layer encoding unit 104 is performed. The encoded information (A) can be distributed with a weight.

（実施の形態２）
実施の形態２では、残差信号と長期予測信号との差（長期予測残差信号）の符号化／復号化を行う場合について説明する。 (Embodiment 2)
In the second embodiment, a case will be described in which the difference (long-term prediction residual signal) between the residual signal and the long-term prediction signal is encoded / decoded.

本実施の形態の音声符号化装置／音声復号化装置は、構成が図１と同様であり、拡張レイヤ符号化部１０４及び拡張レイヤ復号化部１５３の内部構成のみが異なる。 The speech coding apparatus / speech decoding apparatus according to the present embodiment has the same configuration as that of FIG. 1, and only the internal configurations of enhancement layer encoding section 104 and enhancement layer decoding section 153 are different.

図７は、本実施の形態に係る拡張レイヤ符号化部１０４の内部構成を示すブロック図である。なお、図７において、図５と共通する構成部分には図５と同一符号を付して説明を省略する。 FIG. 7 is a block diagram showing an internal configuration of enhancement layer encoding section 104 according to the present embodiment. In FIG. 7, the same components as those in FIG. 5 are denoted by the same reference numerals as those in FIG.

図７の拡張レイヤ符号化部１０４は、図５と比較して、加算部７０１、長期予測残差信号符号化部７０２、符号化情報多重化部７０３、長期予測残差信号復号化部７０４及び加算部７０５を追加した構成を採る。 The enhancement layer encoding unit 104 in FIG. 7 includes an addition unit 701, a long-term prediction residual signal encoding unit 702, an encoded information multiplexing unit 703, a long-term prediction residual signal decoding unit 704, and FIG. A configuration in which an adding unit 705 is added is adopted.

長期予測信号生成部５０６は、算出した長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）を加算部７０１及び加算部７０５に出力する。 The long-term prediction signal generation unit 506 outputs the calculated long-term prediction signals s (n) to s (n + N−1) to the addition unit 701 and the addition unit 705.

加算部７０１は、以下の式（６）に示すように、長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）の極性を反転させて残差信号ｅ（ｎ）〜ｅ（ｎ＋Ｎ−１）に加算し、加算結果である長期予測残差信号ｐ（ｎ）〜ｐ（ｎ＋Ｎ−１）を長期予測残差信号符号化部７０２に出力する。 The adder 701 inverts the polarities of the long-term prediction signals s (n) to s (n + N−1) to obtain residual signals e (n) to e (n + N−1) as shown in the following formula (6). And the long-term prediction residual signals p (n) to p (n + N−1), which are the addition results, are output to the long-term prediction residual signal encoding unit 702.

長期予測残差信号符号化部７０２は、長期予測残差信号ｐ（ｎ）〜ｐ（ｎ＋Ｎ−１）の符号化を行い、符号化によって求められる符号化情報（以下、「長期予測残差符号化情報」という）を符号化情報多重化部７０３及び長期予測残差信号復号化部７０４に出力する。なお、長期予測残差信号の符号化は、ベクトル量子化が一般的である。

The long-term prediction residual signal encoding unit 702 encodes the long-term prediction residual signals p (n) to p (n + N−1) and encodes information obtained by the encoding (hereinafter, “long-term prediction residual code”). Output to the encoded information multiplexing unit 703 and the long-term prediction residual signal decoding unit 704. Note that vector quantization is generally used to encode a long-term prediction residual signal.

ここで、長期予測残差信号ｐ（ｎ）〜ｐ（ｎ＋Ｎ−１）の符号化方法について８ビットでベクトル量子化を行う場合を例に説明する。この場合、長期予測残差信号符号化部７０２の内部には、予め作成された２５６種類のコードベクトルが格納されたコードブックが用意される。このコードベクトルCODE^（ｋ）（０）〜CODE^（ｋ）（Ｎ−１）は、Ｎの長さのベクトルである。また、ｋはコードベクトルのインデクスであり、０から２５５までの値をとる。長期予測残差信号符号化部７０２は、以下の式（８）により長期予測残差信号ｐ（ｎ）〜ｐ（ｎ＋Ｎ−１）とコードベクトルCODE^（ｋ）（０）〜CODE^（ｋ）（Ｎ−１）との二乗誤差er を求める。 Here, the encoding method of the long-term prediction residual signals p (n) to p (n + N−1) will be described as an example in which vector quantization is performed with 8 bits. In this case, a code book in which 256 types of code vectors created in advance are stored is prepared in the long-term prediction residual signal encoding unit 702. The code vectors CODE ^(k) (0) to CODE ^(k) (N−1) are N-length vectors. K is an index of a code vector and takes a value from 0 to 255. The long-term prediction residual signal encoding unit 702 uses the following equation (8) to calculate the long-term prediction residual signals p (n) to p (n + N−1) and the code vectors CODE ^(k) (0) to CODE ^(k) ( N-1) and the square error er are obtained.

そして、長期予測残差信号符号化部７０２は、二乗誤差er が最小となるｋの値を長期予測残差符号化情報として決定する。

Then, long-term prediction residual signal encoding section 702 determines the value of k that minimizes square error er as long-term prediction residual encoding information.

符号化情報多重化部７０３は、長期予測係数符号化部５０４より入力した拡張レイヤ符号化情報と、長期予測残差信号符号化部７０２より入力した長期予測残差符号化情報を多重化し、多重化後の情報を伝送路を介して拡張レイヤ復号化部１５３に出力する。 The encoded information multiplexing unit 703 multiplexes the enhancement layer encoded information input from the long-term prediction coefficient encoding unit 504 and the long-term prediction residual encoded information input from the long-term prediction residual signal encoding unit 702 to multiplex The converted information is output to enhancement layer decoding section 153 via the transmission path.

長期予測残差信号復号化部７０４は、長期予測残差符号化情報の復号化を行い、復号化によって求められた復号化長期予測残差信号ｐ_ｑ（ｎ）〜ｐ_ｑ（ｎ＋Ｎ−１）を加算部７０５に出力する。 The long-term prediction residual signal decoding unit 704 decodes the long-term prediction residual encoded information, and decodes long-term prediction residual signals p _q (n) to p _q (n + N−1) obtained by decoding. Is output to the adder 705.

加算部７０５は、長期予測信号生成部５０６より入力した長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）と長期予測残差信号復号化部７０４より入力した復号化長期予測残差信号ｐ_ｑ（ｎ）〜ｐ_ｑ（ｎ＋Ｎ−１）とを加算し、加算結果を長期予測信号記憶部５０２に出力する。この結果、長期予測信号記憶部５０２は、以下の式（８）によりバッファの更新を行う。 The addition unit 705 receives the long-term prediction signals s (n) to s (n + N−1) input from the long-term prediction signal generation unit 506 and the decoded long-term prediction residual signal p _q input from the long-term prediction residual signal decoding unit 704. (N) to _{p q} (n + N−1) are added, and the addition result is output to the long-term prediction signal storage unit 502. As a result, the long-term prediction signal storage unit 502 updates the buffer according to the following equation (8).

以上が、本実施の形態に係る拡張レイヤ符号化部１０４の内部構成の説明である。

The above is the description of the internal configuration of enhancement layer encoding section 104 according to the present embodiment.

次に、本実施の形態に係る拡張レイヤ復号化部１５３の内部構成について、図８のブロック図を用いて説明する。なお、図８において、図６と共通する構成部分には図６と同一符号を付して説明を省略する。 Next, the internal configuration of enhancement layer decoding section 153 according to the present embodiment will be described using the block diagram of FIG. In FIG. 8, the same components as those in FIG. 6 are denoted by the same reference numerals as those in FIG.

図８の拡張レイヤ復号化部１５３は、図６と比較して、符号化情報分離部８０１、長期予測残差信号復号化部８０２及び加算部８０３を追加した構成を採る。 The enhancement layer decoding unit 153 in FIG. 8 employs a configuration in which an encoded information separation unit 801, a long-term prediction residual signal decoding unit 802, and an addition unit 803 are added as compared to FIG.

符号化情報分離部８０１は、伝送路より受信した多重化されている符号化情報を、拡張レイヤ符号化情報と長期予測残差符号化情報とに分離し、拡張レイヤ符号化情報を長期予測係数復号化部６０３に出力し、長期予測残差符号化情報を長期予測残差信号復号化部８０２に出力する。 The encoded information separation unit 801 separates the multiplexed encoded information received from the transmission path into enhancement layer encoded information and long-term prediction residual encoded information, and converts the enhancement layer encoded information into the long-term prediction coefficient. It outputs to decoding section 603 and outputs long-term prediction residual coding information to long-term prediction residual signal decoding section 802.

長期予測残差信号復号化部８０２は、長期予測残差符号化情報を復号化して復号化長期予測残差信号ｐ_ｑ（ｎ）〜ｐ_ｑ（ｎ＋Ｎ−１）を求め、これを加算部８０３に出力する。 The long-term prediction residual signal decoding unit 802 obtains decoded long-term prediction residual signals p _q (n) to p _q (n + N−1) by decoding the long-term prediction residual coding information, and adds this to the adding unit 803. Output to.

加算部８０３は、長期予測信号生成部６０４より入力した長期予測信号ｓ（ｎ）〜ｓ（ｎ＋Ｎ−１）と長期予測残差信号復号化部８０２より入力した復号化長期予測残差信号ｐ_ｑ（ｎ）〜ｐ_ｑ（ｎ＋Ｎ−１）とを加算し、加算結果を長期予測信号記憶部６０２に出力し、加算結果を拡張レイヤ復号化信号として出力する。 The addition unit 803 receives the long-term prediction signals s (n) to s (n + N−1) input from the long-term prediction signal generation unit 604 and the decoded long-term prediction residual signal p _q input from the long-term prediction residual signal decoding unit 802. (N) to _{p q} (n + N−1) are added, the addition result is output to the long-term prediction signal storage unit 602, and the addition result is output as an enhancement layer decoded signal.

以上が、本実施の形態に係る拡張レイヤ復号化部１５３の内部構成の説明である。 The above is the description of the internal configuration of enhancement layer decoding section 153 according to the present embodiment.

このように、残差信号と長期予測信号との差（長期予測残差信号）を符号化／復号化することにより、上記実施の形態１よりもさらに高品質な復号化信号を得ることができる。 In this way, by encoding / decoding the difference between the residual signal and the long-term prediction signal (long-term prediction residual signal), it is possible to obtain a decoded signal with higher quality than that of the first embodiment. .

なお、本実施の形態では、ベクトル量子化により長期予測残差信号の符号化を行う場合について説明したが、本発明は符号化方法に制限はなく、例えば、形状-利得ＶＱ、分割ＶＱ、変換ＶＱ、多段階ＶＱにより符号化を行ってもよい。 In this embodiment, the case of encoding a long-term prediction residual signal by vector quantization has been described. However, the present invention is not limited to the encoding method, and for example, shape-gain VQ, division VQ, transform Encoding may be performed by VQ or multistage VQ.

以下、１３ビットで形状８ビット、利得５ビットの形状-利得ＶＱにより符号化を行う場合について説明する。この場合、コードブックは形状コードブック、利得コードブックの二種類が用意される。形状コードブックは２５６種類の形状コードベクトルから成り、形状コードベクトルSCODE^（ｋ１）（０）〜SCODE^（ｋ１）（Ｎ−１）は、Ｎの長さのベクトルである。ここで、ｋ１は形状コードベクトルのインデクスであり、０から２５５までの値をとる。また、利得コードブックは３２種類の利得コードから成り、利得コードGCODE^（ｋ２）はスカラの値をとる。ここで、ｋ２は利得コードのインデクスであり、０から３１までの値をとる。長期予測残差信号符号化部７０２は、以下の式（９）により長期予測残差信号ｐ（ｎ）〜ｐ（ｎ＋Ｎ−１）の利得gainと形状ベクトルshape（０）〜shape（Ｎ−１）を求め、以下の式（１０）により利得gainと利得コードGCODE^（ｋ２）との利得誤差gainerと、形状ベクトルshape（０）〜shape（Ｎ−１）と形状コードベクトルSCODE^（ｋ１）（０）〜SCODE^（ｋ１）（Ｎ−１）との二乗誤差shapeerとを求める。 In the following, description will be given of a case where encoding is performed with a shape-gain VQ of 13 bits, a shape of 8 bits, and a gain of 5 bits. In this case, two types of codebooks are prepared: a shape codebook and a gain codebook. The shape code book is composed of 256 types of shape code vectors, and the shape code vectors SCODE ^(k1) (0) to SCODE ^(k1) (N-1) are vectors of length N. Here, k1 is an index of the shape code vector and takes a value from 0 to 255. The gain codebook is composed of 32 types of gain codes, and the gain code GCODE ^(k2) takes a scalar value. Here, k2 is an index of the gain code and takes a value from 0 to 31. The long-term prediction residual signal encoding unit 702 calculates the gain gain of the long-term prediction residual signals p (n) to p (n + N−1) and the shape vectors shape (0) to shape (N−1) by the following equation (9). ) And gain error gainer between gain gain and gain code GCODE ^(k2) , shape vector shape (0) to shape (N-1), and shape code vector SCODE ^(k1) (0 ⁾ ) To SCODE ^(k1) A square error shapeer with (N-1) is obtained.

そして、長期予測残差信号符号化部７０２は、利得誤差gainer が最小となるｋ２の値と二乗誤差shapperが最小となるｋ１の値とを求め、これらの求めた値を長期予測残差符号化情報とする。

Then, the long-term prediction residual signal encoding unit 702 obtains a value of k2 that minimizes the gain error gainer and a value of k1 that minimizes the square error shaper, and encodes these obtained values into the long-term prediction residual encoding. Information.

次に、８ビットで分割ＶＱにより符号化を行う場合について説明する。この場合、コードブックは第１分割コードブック、第２分割コードブックの二種類が用意される。第１分割コードブックは１６種類の第１分割コードベクトルSPCODE^（ｋ３）（０）〜SPCODE^（ｋ３）（Ｎ／２−１）から成り、第２分割コードブックSPCODE^（ｋ４）（０）〜SPCODE^（ｋ４）（Ｎ／２−１）は１６種類の第２分割コードベクトルから成り、それぞれコードベクトルはＮ／２の長さのベクトルである。ここで、ｋ３は第１分割コードベクトルのインデクスであり、０から１５までの値をとる。また、ｋ４は第２分割コードベクトルのインデクスであり、０から１５までの値をとる。長期予測残差信号符号化部７０２は、以下の式（１１）により長期予測残差信号ｐ（ｎ）〜ｐ（ｎ＋Ｎ−１）を、第１分割ベクトルｓｐ_１（０）〜ｓｐ_１（Ｎ／２−１）と第２分割ベクトルｓｐ_２（０）〜ｓｐ_２（Ｎ／２−１）とに分割し、以下の式（１２）により第１分割ベクトルｓｐ_１（０）〜ｓｐ_１（Ｎ／２−１）と第１分割コードベクトルSPCODE^（ｋ３）（０）〜SPCODE^（ｋ３）（Ｎ／２−１）との二乗誤差spliter_１と、第２分割ベクトルｓｐ_２（０）〜ｓｐ_２（Ｎ／２−１）と第２分割コードブックSPCODE^（ｋ４）（０）〜SPCODE^（ｋ４）（Ｎ／２−１）との二乗誤差spliter_２とを求める。 Next, a case where encoding is performed with 8-bit division VQ will be described. In this case, two types of code books are prepared: a first divided code book and a second divided code book. The first divided code book is composed of 16 types of first divided code vectors SPCODE ^(k3) (0) to SPCODE ^(k3) (N / 2-1), and the second divided code book SPCODE ^(k4) (0) to SPCODE. ^(K4) (N / 2-1) is composed of 16 types of second divided code vectors, and each code vector is a vector of length N / 2. Here, k3 is an index of the first divided code vector and takes a value from 0 to 15. K4 is an index of the second divided code vector and takes a value from 0 to 15. Term prediction residual signal coding section 702, a long term prediction residual signal p (n) ~p (n + N-1) by the following equation (11), first split vector _sp 1 (0) _{to SP} 1 (N / 2-1) and second split vector _sp 2 ₍₀₎ to sP 2 (divided into N / 2-1) and first split vector _sp 1 (0 by the following equation (12)) _{to sP} 1 ( N / 2-1) and the square error spliter ₁ between the first divided code vector SPCODE ^(k3) (0) to SPCODE ^(k3) (N / 2-1), and the second divided vector sp ₂ (0) to sp ₂ (N / 2-1) and the second divided codebook SPCODE ^(k4) (0) to SPCODE ^(k4) (N / 2-1) square error spliter ₂ is obtained.

そして、長期予測残差信号符号化部７０２は、二乗誤差spliter_１が最小となるｋ３の値と二乗誤差spliter_２が最小となるｋ４の値とを求め、これらの求めた値を長期予測残差符号化情報とする。

Then, the long-term prediction residual signal encoding unit 702 obtains a value of k3 that minimizes the square error spliter ₁ and a value of k4 that minimizes the square error spliter _{2, and} uses these obtained values as the long-term prediction residual. This is encoded information.

次に、８ビットで離散フーリエ変換を用いた変換ＶＱにより符号化を行う場合について説明する。この場合、２５６種類の変換コードベクトルから成る変換コードブックが用意され、変換コードベクトルTCODE^（ｋ５）（０）〜TCODE^（ｋ５）（Ｎ／２−１）はＮの長さのベクトルである。ここで、ｋ５は変換コードベクトルのインデクスであり、０から２５５までの値をとる。長期予測残差信号符号化部７０２は、以下の式（１３）により長期予測残差信号ｐ（ｎ）〜ｐ（ｎ＋Ｎ−１）を離散フーリエ変換して変換ベクトルtp（０）〜tp（Ｎ−１）を求め、以下の式（１４）により変換ベクトルtp（０）〜tp（Ｎ−１）と変換コードベクトルTCODE^（ｋ５）（０）〜TCODE^（ｋ５）（Ｎ／２−１）との二乗誤差transerを求める。 Next, a case where encoding is performed by transform VQ using discrete Fourier transform with 8 bits will be described. In this case, a conversion code book composed of 256 types of conversion code vectors is prepared, and the conversion code vectors TCODE ^(k5) (0) to TCODE ^(k5) (N / 2-1) are N-length vectors. Here, k5 is an index of the conversion code vector and takes a value from 0 to 255. The long-term prediction residual signal encoding unit 702 performs discrete Fourier transform on the long-term prediction residual signals p (n) to p (n + N−1) according to the following equation (13), and transform vectors tp (0) to tp (N -1), and conversion vectors tp (0) to tp (N-1) and conversion code vectors TCODE ^(k5) (0) to TCODE ^(k5) (N / 2-1) according to the following equation (14): Find the square error transer of.

そして、長期予測残差信号符号化部７０２は、二乗誤差transerが最小となるｋ５の値を求め、この値を長期予測残差符号化情報とする。

Then, the long-term prediction residual signal encoding unit 702 obtains a value of k5 that minimizes the square error transer, and uses this value as long-term prediction residual coding information.

次に、１３ビットで一段目５ビット、二段目８ビットの二段ＶＱにより符号化を行う場合について説明する。この場合、一段目コードブック、二段目コードブックの二種類のコードブックを用意する。一段目コードブックは３２種類の一段目コードベクトルPHCODE_１ ^（ｋ６）（０）〜PHCODE_１ ^（ｋ６）（Ｎ−１）から成り、二段目コードブックは２５６種類の二段目コードベクトルPHCODE_２ ^（ｋ７）（０）〜PHCODE_２ ^（ｋ７）（Ｎ−１）から成り、それぞれコードベクトルはＮの長さのベクトルである。ここで、ｋ６は一段目コードベクトルのインデクスであり、０から３１までの値をとる。また、ｋ７は二段目コードベクトルのインデクスであり、０から２５５までの値をとる。長期予測残差信号符号化部７０２は、以下の式（１５）により長期予測残差信号ｐ（ｎ）〜ｐ（ｎ＋Ｎ−１）と一段目コードベクトルPHCODE_１ ^（ｋ６）（０）〜PHCODE_１ ^（ｋ６）（Ｎ−１）との二乗誤差phaseer_１を求め、二乗誤差phaseer_１が最小となるｋ６の値を求め、この値をｋmaxとする。 Next, a case will be described in which encoding is performed using a 13-bit two-stage VQ of 5 bits for the first stage and 8 bits for the second stage. In this case, two types of codebooks are prepared: a first-stage codebook and a second-stage codebook. The first-stage codebook consists of 32 types of first-stage code vectors PHCODE ₁ ^(k6) (0) to PHCODE ₁ ^(k6) (N−1), and the second-stage codebook has 256 types of second-stage code vectors PHCODE _2. ^(K7) (0) to PHCODE ₂ ^(k7) (N-1). Each code vector is a vector of length N. Here, k6 is the index of the first-stage code vector and takes a value from 0 to 31. K7 is an index of the second-stage code vector and takes a value from 0 to 255. The long-term prediction residual signal encoding unit 702 calculates the long-term prediction residual signals p (n) to p (n + N−1) and the first-stage code vector PHCODE ₁ ^(k6 ) (0) to PHCODE ₁ by the following equation (15). ^(K6) The square error phaser ₁ with (N−1) is obtained, the value of k6 that minimizes the square error phaser ₁ is obtained, and this value is designated as kmax.

そして、長期予測残差信号符号化部７０２は、以下の式（１６）により誤差ベクトルep（０）〜ep（Ｎ−１）を求め、以下の式（１７）により誤差ベクトルep（０）〜ep（Ｎ−１）と二段目コードベクトルPHCODE_２ ^（ｋ７）（０）〜PHCODE_２ ^（ｋ７）（Ｎ−１）との二乗誤差phaseer_２を求め、二乗誤差phaseer_２が最小となるｋ７の値を求め、この値とｋｍａｘとを長期予測残差符号化情報とする。

Then, the long-term prediction residual signal encoding unit 702 obtains error vectors ep (0) to ep (N−1) by the following equation (16), and the error vectors ep (0) to ε (0) to The square error phaseer ₂ between ep (N−1) and the second-stage code vector PHCODE ₂ ^(k7) (0) to PHCODE ₂ ^(k7) (N−1) is obtained, and k7 that minimizes the square error phaser ₂ is obtained. A value is obtained, and this value and kmax are used as long-term prediction residual coding information.

（実施の形態３）
図９は、上記実施の形態１、２で説明した音声符号化装置及び音声復号化装置を含む音声信号送信装置および音声信号受信装置の構成を示すブロック図である。 (Embodiment 3)
FIG. 9 is a block diagram showing a configuration of a speech signal transmitting apparatus and a speech signal receiving apparatus including the speech coding apparatus and speech decoding apparatus described in the first and second embodiments.

図９において、音声信号９０１は入力装置９０２によって電気的信号に変換されＡ／Ｄ変換装置９０３に出力される。Ａ／Ｄ変換装置９０３は入力装置９０２から出力された（アナログ）信号をディジタル信号に変換し音声符号化装置９０４へ出力する。音声符号化装置９０４は、図１に示した音声符号化装置１００を実装し、Ａ／Ｄ変換装置９０３から出力されたディジタル音声信号を符号化し符号化情報をＲＦ変調装置９０５へ出力する。ＲＦ変調装置９０５は音声符号化装置９０４から出力された音声符号化情報を電波等の伝播媒体に載せて送出するための信号に変換し送信アンテナ９０６へ出力する。送信アンテナ９０６はＲＦ変調装置９０５から出力された出力信号を電波（ＲＦ信号）として送出する。なお、図中のＲＦ信号９０７は送信アンテナ９０６から送出された電波（ＲＦ信号）を表す。以上が音声信号送信装置の構成および動作である。 In FIG. 9, an audio signal 901 is converted into an electrical signal by an input device 902 and output to an A / D conversion device 903. The A / D conversion device 903 converts the (analog) signal output from the input device 902 into a digital signal and outputs it to the speech encoding device 904. The speech coding apparatus 904 is mounted with the speech coding apparatus 100 shown in FIG. 1, encodes the digital speech signal output from the A / D conversion apparatus 903, and outputs encoded information to the RF modulation apparatus 905. The RF modulation device 905 converts the speech coding information output from the speech coding device 904 into a signal to be transmitted on a propagation medium such as a radio wave and outputs the signal to the transmission antenna 906. The transmission antenna 906 transmits the output signal output from the RF modulation device 905 as a radio wave (RF signal). Note that an RF signal 907 in the figure represents a radio wave (RF signal) transmitted from the transmission antenna 906. The above is the configuration and operation of the audio signal transmitting apparatus.

ＲＦ信号９０８は受信アンテナ９０９によって受信されＲＦ復調装置９１０に出力される。なお、図中のＲＦ信号９０８は受信アンテナ９０９に受信された電波を表し、伝播路において信号の減衰や雑音の重畳がなければＲＦ信号９０７と全く同じものになる。 The RF signal 908 is received by the receiving antenna 909 and output to the RF demodulator 910. Note that an RF signal 908 in the figure represents a radio wave received by the receiving antenna 909 and is exactly the same as the RF signal 907 if there is no signal attenuation or noise superposition in the propagation path.

ＲＦ復調装置９１０は受信アンテナ９０９から出力されたＲＦ信号から音声符号化情報を復調し音声復号化装置９１１へ出力する。音声復号化装置９１１は、図１に示した音声復号化装置１５０を実装し、ＲＦ復調装置９１０から出力された音声符号化情報から音声信号を復号しＤ／Ａ変換装置９１２へ出力する。Ｄ／Ａ変換装置９１２は音声復号化装置９１１から出力されたディジタル音声信号をアナログの電気的信号に変換し出力装置９１３へ出力する。 The RF demodulator 910 demodulates speech coding information from the RF signal output from the receiving antenna 909 and outputs it to the speech decoder 911. The speech decoding apparatus 911 is mounted with the speech decoding apparatus 150 shown in FIG. 1, decodes a speech signal from the speech encoded information output from the RF demodulation apparatus 910, and outputs the speech signal to the D / A conversion apparatus 912. The D / A conversion device 912 converts the digital audio signal output from the audio decoding device 911 into an analog electrical signal and outputs it to the output device 913.

出力装置９１３は電気的信号を空気の振動に変換し音波として人間の耳に聴こえるように出力する。なお、図中、参照符号９１４は出力された音波を表す。以上が音声信号受信装置の構成および動作である。 The output device 913 converts an electrical signal into air vibration and outputs it as a sound wave so that it can be heard by a human ear. In the figure, reference numeral 914 represents the output sound wave. The above is the configuration and operation of the audio signal receiving apparatus.

無線通信システムにおける基地局装置および通信端末装置に、上記のような音声信号送信装置および音声信号受信装置を備えることにより、高品質な復号化信号を得ることができる。 By providing the base station apparatus and the communication terminal apparatus in the wireless communication system with the audio signal transmitting apparatus and the audio signal receiving apparatus as described above, a high-quality decoded signal can be obtained.

本発明は、音声・楽音信号を符号化して伝送する通信システムに使用される音声符号化装置、音声復号化装置に用いるに好適である。 The present invention is suitable for use in a speech coding apparatus and speech decoding apparatus used in a communication system that encodes and transmits speech / musical sound signals.

本発明の実施の形態１に係る音声符号化装置／音声復号化装置の構成を示すブロック図1 is a block diagram showing a configuration of a speech encoding device / speech decoding device according to Embodiment 1 of the present invention. 上記実施の形態に係る基本レイヤ符号化部の内部構成を示すブロック図The block diagram which shows the internal structure of the base layer encoding part which concerns on the said embodiment. 上記実施の形態に係る基本レイヤ符号化部の内のパラメータ決定部が適応音源符号帳から生成される信号を決定する処理を説明するための図The figure for demonstrating the process in which the parameter determination part of the base layer encoding part which concerns on the said embodiment determines the signal produced | generated from an adaptive excitation codebook. 上記実施の形態に係る基本レイヤ復号化部の内部構成を示すブロック図The block diagram which shows the internal structure of the base layer decoding part which concerns on the said embodiment. 上記実施の形態に係る拡張レイヤ符号化部の内部構成を示すブロック図The block diagram which shows the internal structure of the enhancement layer encoding part which concerns on the said embodiment. 上記実施の形態に係る拡張レイヤ復号化部の内部構成を示すブロック図The block diagram which shows the internal structure of the enhancement layer decoding part which concerns on the said embodiment. 本発明の実施の形態２に係る拡張レイヤ符号化部の内部構成を示すブロック図The block diagram which shows the internal structure of the enhancement layer encoding part which concerns on Embodiment 2 of this invention. 上記実施の形態に係る拡張レイヤ復号化部の内部構成を示すブロック図The block diagram which shows the internal structure of the enhancement layer decoding part which concerns on the said embodiment. 本発明の実施の形態３係る音声信号送信装置／音声信号受信装置の構成を示すブロック図Block diagram showing a configuration of an audio signal transmitting apparatus / audio signal receiving apparatus according to Embodiment 3 of the present invention

Explanation of symbols

１００音声符号化装置
１０１基本レイヤ符号化部
１０２、１５２基本レイヤ復号化部
１０３、１５４、７０１、７０５、８０３加算部
１０４拡張レイヤ符号化部
１０５多重化部
１５０音声復号化装置
１５１多重化分離部
１５３拡張レイヤ復号化部
５０１、６０１長期予測ラグ指示部
５０２、６０２長期予測信号記憶部
５０３長期予測係数計算部
５０４長期予測係数符号化部
５０５、６０３長期予測係数復号化部
５０６、６０４長期予測信号生成部
７０２長期予測残差信号符号化部
７０３符号化情報多重化部
７０４長期予測残差信号復号化部
８０１符号化情報分離部
８０２長期予測残差信号復号化部 DESCRIPTION OF SYMBOLS 100 Speech encoder 101 Base layer encoding part 102,152 Base layer decoding part 103,154,701,705,803 Adder 104 Enhancement layer encoding part 105 Multiplexing part 150 Speech decoding apparatus 151 Demultiplexing part 153 Enhancement layer decoding unit 501, 601 Long-term prediction lag instruction unit 502, 602 Long-term prediction signal storage unit 503 Long-term prediction coefficient calculation unit 504 Long-term prediction coefficient coding unit 505, 603 Long-term prediction coefficient decoding unit 506, 604 Long-term prediction signal Generation unit 702 Long-term prediction residual signal encoding unit 703 Encoding information multiplexing unit 704 Long-term prediction residual signal decoding unit 801 Encoding information separation unit 802 Long-term prediction residual signal decoding unit

Claims

A base layer encoding means for encoding an input signal to generate first encoded information, a first decoded signal by decoding the first encoded information, and a long-term correlation of voice / musical sound. Base layer decoding means for generating long-term prediction information that is information indicating the above, addition means for obtaining a residual signal that is the difference between the input signal and the first decoded signal, the long-term prediction information and the residual Enhancement layer encoding means for calculating a long-term prediction coefficient using a signal and encoding the long-term prediction coefficient to generate second encoded information ,
The enhancement layer coding means includes a means for obtaining a long-term prediction lag of the enhancement layer based on the long-term prediction information, and a long-term prediction signal that is traced back by the long-term prediction lag from a past long-term prediction signal sequence stored in a buffer. Means for cutting out, means for calculating a long-term prediction coefficient by applying the residual signal and the long-term prediction signal to a predetermined calculation formula, and generating enhancement layer coding information by encoding the long-term prediction coefficient Means for decoding the enhancement layer coding information to generate a decoded long-term prediction coefficient; calculating a new long-term prediction signal using the decoded long-term prediction coefficient and the long-term prediction signal; and Updating the buffer using a long-term prediction signal,
Speech encoding device.

The enhancement layer encoding means includes a means for obtaining a long-term prediction residual signal that is a difference between the residual signal and the new long-term prediction signal; and a long-term prediction residual by encoding the long-term prediction residual signal. Means for generating coding information; means for decoding the long-term prediction residual coding information to calculate a decoded long-term prediction residual signal; the new long-term prediction signal and the decoded long-term prediction residual signal; was seeking addition result of addition, said instead of the new long term prediction signal addition result using a further and means for updating the buffer, the speech coding apparatus according to claim 1.

The base layer decoding means uses information indicating a cut-out position of an adaptive excitation vector cut out from a driving excitation signal sample as long-term prediction information.
The speech coding apparatus according to claim 1 or 2.

The means for calculating the long-term prediction coefficient inputs residual signals e (n) to e (n + N−1) and long-term prediction signals s (n−T) to s (n−T + N−1), and uses the following formula: The speech encoding apparatus according to any one of claims 1 to 3, wherein a long-term prediction coefficient β is calculated by the following.

Here, N is the number of samples constituting one frame, n is a sample located at the head of each frame, and T is a long-term prediction lag.

A speech decoding apparatus for decoding an audio from the audio encoding apparatus according to claim 1, wherein receiving the first encoded information and second encoded information,
Base layer decoding means for decoding the first encoded information to generate a first decoded signal and generating long-term prediction information which is information representing a long-term correlation of speech / musical sound; and the long-term prediction Enhancement layer decoding means for decoding the second encoded information using information to generate a second decoded signal, adding the first decoded signal and the second decoded signal, Adding means for outputting a certain voice / musical sound signal ,
The enhancement layer decoding means extracts a long-term prediction signal that is traced back by a long-term prediction lag from a past long-term prediction signal sequence stored in a buffer, and a means for obtaining an enhancement layer long-term prediction lag based on the long-term prediction information Means for decoding the enhancement layer coding information included in the second coding information to obtain a decoded long-term prediction coefficient; a new long-term prediction using the decoded long-term prediction coefficient and the long-term prediction signal Means for calculating a signal and updating the buffer using the new long-term prediction signal, and making the new long-term prediction signal the second decoded signal,
Speech decoding device.

A speech decoding device that receives first encoded information and second encoded information from the speech encoding device according to claim 2 and decodes speech,
Base layer decoding means for decoding the first encoded information to generate a first decoded signal and generating long-term prediction information which is information representing a long-term correlation of speech / musical sound; and the long-term prediction Enhancement layer decoding means for decoding the second encoded information using information to generate a second decoded signal, adding the first decoded signal and the second decoded signal, Adding means for outputting a certain voice / musical sound signal,
The enhancement layer decoding means extracts a long-term prediction signal that is traced back by a long-term prediction lag from a past long-term prediction signal sequence stored in a buffer, and a means for obtaining an enhancement layer long-term prediction lag based on the long-term prediction information Means for decoding the enhancement layer coding information included in the second coding information to obtain a decoded long-term prediction coefficient; a new long-term prediction using the decoded long-term prediction coefficient and the long-term prediction signal Means for calculating a signal; means for decoding the long-term prediction residual coding information included in the second coding information to obtain a decoded long-term prediction residual signal; and the new long-term prediction signal and the decoding Adding means for adding a long-term prediction residual signal; and means for updating the buffer using the addition result of the addition means, and the addition result is the second decoded signal ;
Speech decoding device.

The base layer decoding means uses information indicating a cut-out position of an adaptive excitation vector cut out from a driving excitation signal sample as long-term prediction information.
The speech decoding apparatus according to claim 5 or 6.

An audio signal transmitting apparatus comprising the audio encoding apparatus according to claim 1.

Audio signal receiving apparatus characterized by comprising a speech decoding apparatus according to any one of claims 7 to claim 5.

A base station apparatus comprising at least one of the voice signal transmitting apparatus according to claim 8 and the voice signal receiving apparatus according to claim 9 .

A communication terminal apparatus comprising at least one of the audio signal transmitting apparatus according to claim 8 and the audio signal receiving apparatus according to claim 9 .

A process for generating first encoded information by encoding an input signal, a first decoded signal by decoding the first encoded information, and information representing a long-term correlation of voice / musical sound. Generating a certain long-term prediction information; obtaining a residual signal that is a difference between the input signal and the first decoded signal; and calculating a long-term prediction coefficient using the long-term prediction information and the residual signal And an enhancement layer encoding step of generating the second encoded information by encoding the long-term prediction coefficient ,
The enhancement layer encoding step includes a step of obtaining an enhancement layer long-term prediction lag based on the long-term prediction information, and a long-term prediction signal that is traced back by the long-term prediction lag from a past long-term prediction signal sequence stored in a buffer. Cutting out, calculating a long-term prediction coefficient by applying the residual signal and the long-term prediction signal to a predetermined calculation formula, and generating enhancement layer coding information by encoding the long-term prediction coefficient A step of decoding the enhancement layer coding information to generate a decoded long-term prediction coefficient, calculating a new long-term prediction signal using the decoded long-term prediction coefficient and the long-term prediction signal, and Updating the buffer with a long-term prediction signal.
Speech encoding method.

A speech decoding method for decoding speech using the first encoded information and the second encoded information generated by the speech encoding method according to claim 12 ,
Decoding the first encoded information to generate a first decoded signal, generating long-term prediction information that is information representing a long-term correlation of speech and music, and using the long-term prediction information Decoding the second encoded information to generate a second decoded signal, adding the first decoded signal and the second decoded signal, and outputting a voice / musical sound signal as a result of the addition An enhancement layer decoding step ,
The enhancement layer decoding step includes a step of obtaining a long-term prediction lag of the enhancement layer based on the long-term prediction information, and a long-term prediction signal that is traced back by a long-term prediction lag from a past long-term prediction signal sequence stored in the buffer. A step of decoding the enhancement layer coding information included in the second coding information to obtain a decoded long-term prediction coefficient, and a new long-term prediction using the decoded long-term prediction coefficient and the long-term prediction signal Calculating a signal and updating the buffer using the new long-term prediction signal, and making the new long-term prediction signal the second decoded signal,
Speech decoding method.