JP2005003912A

JP2005003912A - Audio signal encoding system, audio signal encoding method, and program

Info

Publication number: JP2005003912A
Application number: JP2003166899A
Authority: JP
Inventors: Masanobu Funakoshi; 正伸船越
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-06-11
Filing date: 2003-06-11
Publication date: 2005-01-06

Abstract

PROBLEM TO BE SOLVED: To provide an audio encoding technique capable of forming a bit stream of good sound quality by suitably suppressing preechoes while maintaining encoding efficiency. SOLUTION: The audio signal encoding system has a frame dividing section (1) which divides an audio input signal to frames of processing units, an auditory psychology arithmetic section (3) which outputs an auditory entropy value by analyzing a frame divided audio input signal, a block length decision section (4) which determines the conversion block length of the frame based on the auditory entropy value and auditory entropy threshold outputted by the auditory psychology arithmetic section, and a filter bank (2) which blocks the frame according to the block length determined by the block length decision section and converts the blocks to frequency spectra. The block length decision section determines the auditory entropy threshold according to the genre of the audio input signal. COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、オーディオ信号の符号化技術に関し、特に、変換ブロック長の変更が可能な変換符号化技術を利用したオーディオ信号符号化技術に関する。
【０００２】
【従来の技術】
近年、高音質、かつ高効率なオーディオ信号符号化技術は、ＤＶＤ−Ｖｉｄｅｏの音声トラックや、半導体メモリやＨＤＤなどを利用した携帯オーディオプレイヤー、インターネットを介した音楽配信、家庭内ＬＡＮにおけるホームサーバなどに広く利用され、幅広く普及するとともにその重要性も増している。
【０００３】
このようなオーディオ信号符号化技術の多くは、変換符号化技術を利用して時間周波数変換を行っている。例えば、ＭＰＥＧ２−ＡＡＣやＤｏｌｂｙＤｉｇｉｔａｌ（ＡＣ−３）などでは、ＭＤＣＴなどの直交変換単体でフィルタバンクを構成しており、ＭＰＥＧ１ＡｕｄｉｏＬａｙｅｒ３（ＭＰ３）やＡＴＲＡＣ（ＭＤ）では、ＱＭＦなどのサブバンド分割フィルタと直交変換を多段接続してフィルタバンクを構成している。
【０００４】
これらの高効率オーディオ符号化技術では、人間の聴覚特性を利用したマスキング分析を行うことによって、マスキングされると判断したスペクトル成分を取り除くことにより、圧縮効率を高めている。
【０００５】
これらの高効率オーディオ符号化技術で用いられているマスキング分析は、主に、静寂時の可聴周波数領域によるマスキングと、臨界帯域におけるマスカーによる周波数マスキングである。
【０００６】
上記マスキング分析により、人間に感知できないと判断される信号は主に高周波域の信号になるため、通常の場合、高周波成分の量子化誤差は多少大きくなってもマスキングされうる。
【０００７】
ところが、変換符号化方式では、オーディオ入力信号に急激な変化がある、いわゆる過渡状態の場合、急激な変化が起こっている部分の高周波成分の量子化誤差が、急激な変化の直前や直後の信号にまで影響を与えるため、リンギングノイズが生じる。
【０００８】
人間の聴覚特性として、大きな音が発生した場合、その直前と直後の時間は音が聞こえづらくなる。これを時間マスキング効果という。大きな音の後に聞こえなくなる時間は、個人差はあるが約１００ｍｓｅｃ程度と比較的長い。しかしながら、直前に働くマスキング効果の時間は約５〜６ｍｓｅｃと短い。従って、リンギングノイズが生じると、大きな音の前のノイズは感知されやすくなってしまう。これは一般にプリエコーと呼ばれる現象である。
【０００９】
以下、この現象を図を用いて説明する。
【００１０】
図１０は、急激に変化しているオーディオ入力信号の一例である。この信号を、ＭＰＥＧ−２ＡＡＣの通常のブロック長の場合の変換単位である２０４８サンプルで符号化・復号化したオーディオ信号の例を図１１に示している。図示したように急激な信号の変化の部分で生じている高周波域の量子化誤差が、ブロック全域に渡って影響している。
【００１１】
前述したように、振幅が急激に変化する部分の直前では、時間マスキング効果によって人間はノイズを感知できない。しかしながら、入力信号が音楽用ＣＤに用いられているＰＣＭ信号と同様な４４．１ＫＨｚサンプリング周波数を用いていると仮定して、変換単位を時間に換算すると、２０４８サンプルの時間は２０４８÷４４１００×１０００＝約４６．４４ｍｓとなるため、この前半の時間にノイズが生じているとしてもプリマスキング時間をはみだしてしまい、人間はプリエコーを感知してしまう。
【００１２】
これを抑制するための一方法として、種々のオーディオ符号化方式では、入力信号の急激な変化を検知して変換ブロック長を短くすることにより、急激な変化による高周波成分の量子化誤差が、変化直前の部分に及ばないようにすることで、プリエコーの発生を抑制している。
【００１３】
図１２では、ＭＰＥＧ−２ＡＡＣにおけるショートブロック長の場合の変換単位である２５６サンプルで図１０に示すオーディオ信号を符号化、復号化した場合の時間領域信号を示している。この場合、入力信号の急激な変化による高周波数域の量子化誤差の影響は、変化が発生している２５６サンプルブロックの中に閉じ込められてしまう。先ほどと同様に、このブロック長を４４．１ＫＨｚサンプリング周波数で時間に換算すると、約５．８０ｍｓとなるため、プリマスキング効果によりこのノイズを人間はほぼ感知できなくなり、結果としてプリエコーは消える。
【００１４】
なお、実際の処理では、変換によって生じる折り返し歪みを除去するために、変換兆単位のウィンドウ掛けを行った上で変換長の５０％ずつ入力サンプルをずらして変換を行ない、その結果の重ねあわせを行なうが、この手続きは本図では説明の便宜上省略している。
【００１５】
ところが、一般にブロック長を短くすると、周波数分解能が落ちることによりマスキング分析の精度が落ちるばかりでなく、量子化時に使用する各周波数帯域ごとの正規化係数（以下、スケールファクタ）がブロックの数だけ増大するために、そこで消費される情報量が増えてしまい、量子化時に本来ならスペクトル情報に割り当てるべきビットがスケールファクタに消費されてしまうため、符号化効率が低下する。その結果、特に低ビットレート時には量子化誤差が厳密にマスキングできなくなるため、ブロック長が長い場合に比較して、ノイズが感知されやすくなる恐れがある。
【００１６】
そこで、実際のブロック長を決定する場合は、プリエコーの抑制と符号化効率の低下によって発生するノイズとのバランスを適宜考慮して決める必要がある。
【００１７】
このブロック長選択方法として、ＭＰ３やＭＰＥＧ−２ＡＡＣなどのＭＰＥＧＡｕｄｉｏ符号化方式では、当該ブロック毎に聴覚エントロピー（以下ＰＥ）を算出し、予め定められたＰＥ閾値よりも大きい場合に、短いブロック長を選択することになっている。
【００１８】
なお、このようなオーディオ符号化データに関するメタ情報（楽曲名、アーティスト名、アルバム名、ジャンル、作成年、著作権情報など）を格納するためのデータ形式の一つとして、ＩＤ３タグが提案されている。ＩＤ３タグは、特にＭＰ３ファイルにおいて一般的に使用されており、ＭＰ３ビットストリームの末尾に付加される。
【００１９】
ＩＤ３タグの表示、編集は、一般的なオーディオ符号化・復号化アプリケーションの一機能として実装されているため、ユーザは容易にＩＤ３タグの内容を編集することが可能である。
【００２０】
一方、近年では、オーディオ信号そのものを分析し、上記ジャンルなどのメタ情報を自動的に抽出する技術の研究も盛んに行われている。下記の特許文献１では、音楽信号の低域の周波数分析結果と、予め音楽ジャンルごとに求められた複数の低周波成分パターンを比較することにより、当該音楽信号の音楽ジャンルを判定している。また、下記の特許文献２では、ジャンルごとのリズム、テンポ、調整などのパターンを学習したニューラルネットワークを用いて音楽信号の音楽ジャンルを判定している。
【００２１】
また、下記の特許文献３では、ＰＥの最大値から最小値を引いた差分と閾値を比較することによってブロック長の切り替えを判断する技術が開示されている。
【００２２】
また、下記の特許文献４では、聴覚エントロピーの計算は行わずに、入力ＰＣＭ信号をブロック長よりも短いセグメントに分割し、振幅の急激な変化を検知し、急激な変化を検知した場合、その前のセグメントにおける量子化ノイズとそれ以前のセグメントのマスキング値を推定し、マスキングできない量子化ノイズが発生する場合に短いブロック長を選択するという方式が開示されている。
【００２３】
【特許文献１】
特開平８−３７７００号公報
【特許文献２】
特開平１０−１６１６５４号公報
【特許文献３】
特開２０００−２７６１９８号公報
【特許文献４】
特開２００１−１４２４９３号公報
【００２４】
【発明が解決しようとする課題】
上記のように、例えばＭＰＥＧ−２ＡＡＣ規格書（ＩＳＯ／ＩＥＣ１３８１８−７：１９９７）に記載されているブロック長判定方法では、ブロック毎のＰＥと固定のＰＥ閾値との比較のみでブロック長を決定している。ところが、このＰＥ閾値は実際には機器の実装や入力信号により変化するため、固定のＰＥ閾値をあらゆる場合に適用すると、楽曲によってはブロック長の決定が正しくないことが度々生じ、往々にして符号化効率が下がったり、プリエコーが生じたりして音質劣化する場合がある。
【００２５】
例えば、従来の固定のＰＥ閾値によってあらゆるジャンルのオーディオ信号のブロック長選択を行うと、ハードロックのような変化が激しい音楽では、短いブロック長が多用されることになるため、符号化効率が落ち、結果としてマスクされないノイズが増え、音質劣化する。逆に、クラシックのような大人しい音楽では、適切に短いブロック長が選択されずに、プリエコーが生じてしまう。
【００２６】
本発明は、上記問題点に鑑みて考案されたものであり、入力オーディオ信号のジャンルによってＰＥ閾値を適宜変更することにより、適切なブロック選択を可能にすることで、必要以上に短いブロック長が選択されることを防止して、符号化効率を保ちながらプリエコーを適宜抑制し、音質の良いビットストリームを作成することができるオーディオ符号化技術を提供することを目的とする。
【００２７】
【課題を解決するための手段】
本発明の一観点によれば、オーディオ入力信号を処理単位のフレームに分割するフレーム分割部と、前記フレーム分割されたオーディオ入力信号を分析して聴覚エントロピー値を出力する聴覚心理演算部と、前記聴覚心理演算部が出力する聴覚エントロピー値と聴覚エントロピー閾値とを基に、前記フレームの変換ブロック長を決定するブロック長判定部と、前記ブロック長判定部が決定したブロック長に応じて、前記フレームをブロック化し、周波数スペクトルに変換するフィルタバンクとを有し、前記ブロック長判定部は、オーディオ入力信号のジャンルに応じて、前記聴覚エントロピー閾値を決定することを特徴とするオーディオ信号符号化装置が提供される。
本発明の他の観点によれば、オーディオ入力信号を処理単位のフレームに分割するフレーム分割部と、前記フレーム分割されたオーディオ入力信号を分析して聴覚エントロピー値を出力する聴覚心理演算部と、前記聴覚心理演算部が出力する聴覚エントロピー値と聴覚エントロピー閾値とを基に、前記フレームの変換ブロック長を決定するブロック長判定部と、前記ブロック長判定部が決定したブロック長に応じて、前記フレームをブロック化し、周波数スペクトルに変換するフィルタバンクと、前記聴覚心理演算部により出力される各フレームの聴覚エントロピー値の履歴を保存するための履歴保存部とを有し、前記ブロック長判定部は、前記履歴保存部に保存されている聴覚エントロピー値の履歴に応じて、前記聴覚エントロピー閾値を決定することを特徴とするオーディオ信号符号化装置が提供される。
本発明のさらに他の観点によれば、オーディオ入力信号を処理単位のフレームに分割するフレーム分割ステップと、前記フレーム分割されたオーディオ入力信号を分析して聴覚エントロピー値を出力する聴覚心理演算ステップと、前記出力された聴覚エントロピー値と聴覚エントロピー閾値とを基に、前記フレームの変換ブロック長を決定するブロック長判定ステップと、前記決定されたブロック長に応じて、前記フレームをブロック化し、周波数スペクトルに変換する変換ステップとを有し、前記ブロック長判定ステップは、オーディオ入力信号のジャンルに応じて、前記聴覚エントロピー閾値を決定することを特徴とするオーディオ信号符号化方法が提供される。
本発明のさらに他の観点によれば、オーディオ入力信号を処理単位のフレームに分割するフレーム分割ステップと、前記フレーム分割されたオーディオ入力信号を分析して聴覚エントロピー値を出力する聴覚心理演算ステップと、前記出力された聴覚エントロピー値と聴覚エントロピー閾値とを基に、前記フレームの変換ブロック長を決定するブロック長判定ステップと、前記決定されたブロック長に応じて、前記フレームをブロック化し、周波数スペクトルに変換する変換ステップと、前記聴覚心理演算ステップにより出力された各フレームの聴覚エントロピー値の履歴を履歴保存部に保存する履歴保存ステップとを有し、前記ブロック長判定ステップは、前記履歴保存部に保存されている聴覚エントロピー値の履歴に応じて、前記聴覚エントロピー閾値を決定することを特徴とするオーディオ信号符号化方法が提供される。
本発明のさらに他の観点によれば、オーディオ入力信号を処理単位のフレームに分割するフレーム分割ステップと、前記フレーム分割されたオーディオ入力信号を分析して聴覚エントロピー値を出力する聴覚心理演算ステップと、前記出力された聴覚エントロピー値と聴覚エントロピー閾値とを基に、前記フレームの変換ブロック長を決定するブロック長判定ブロックと、前記決定されたブロック長に応じて、前記フレームをブロック化し、周波数スペクトルに変換する変換ステップとをコンピュータに実行させるためのプログラムであって、前記ブロック長判定ステップは、オーディオ入力信号のジャンルに応じて、前記聴覚エントロピー閾値を決定することを特徴とするプログラムが提供される。
本発明のさらに他の観点によれば、オーディオ入力信号を処理単位のフレームに分割するフレーム分割ステップと、前記フレーム分割されたオーディオ入力信号を分析して聴覚エントロピー値を出力する聴覚心理演算ステップと、前記出力された聴覚エントロピー値と聴覚エントロピー閾値とを基に、前記フレームの変換ブロック長を決定するブロック長判定ステップと、前記決定されたブロック長に応じて、前記フレームをブロック化し、周波数スペクトルに変換する変換ステップと、前記聴覚心理演算ステップにより出力された各フレームの聴覚エントロピー値の履歴を履歴保存部に保存する履歴保存ステップとをコンピュータに実行させるためのプログラムであって、前記ブロック長判定ステップは、前記履歴保存部に保存されている聴覚エントロピー値の履歴に応じて、前記聴覚エントロピー閾値を決定することを特徴とするプログラムが提供される。
【００２８】
本発明によれば、オーディオ入力信号のジャンルに応じて、変換ブロック長決定の際に参照する聴覚エントロピー閾値を決定するため、プリエコーの発生を避けながら符号化効率の良い符号化処理が可能となり、特に低ビットレート時に音質の良い符号化処理を実現することができる。
【００２９】
【発明の実施の形態】
（第１の実施形態）
以下図面を参照しながら本発明を詳細に説明する。
図１は、本発明の第１の実施形態におけるオーディオ符号化装置の一構成例である。
【００３０】
図示の構成において、１はオーディオ入力信号を処理単位であるフレームに分割するフレーム分割器である。ここで分割されたフレームは後述するフィルタバンク２と聴覚心理演算器３へ送出される。
【００３１】
２はフィルタバンクであり、フレームに分割された入力時間信号をブロック長判定器４によって指定された長さのブロック長で周波数スペクトルに変換する。
【００３２】
３は聴覚心理演算器であり、オーディオ入力信号をフレーム単位に分析し、聴覚エントロピー値の算出と、量子化単位となる分割周波数帯域ごとのマスキング計算を行う。この演算の結果、聴覚エントロピー（ＰＥ）値がブロック長判定器４に、また、各分割周波数帯域ごとの信号対マスク比（ＳｉｇｎａｌＭａｓｋＲａｔｉｏ：ＳＭＲ）がビット割当て器５にそれぞれ出力される。
【００３３】
４はブロック長判定器であり、外部から送出されるジャンル情報によりＰＥ閾値を変更して保持し、聴覚心理演算器３から送出されるＰＥとＰＥ閾値とを比較して変換ブロック長を判定し、フィルタバンク２に通知する。
【００３４】
５はビット割当て器であり、聴覚心理演算器３より送出される分割周波数帯域ごとのＳＭＲ値や周波数スペクトルを参考にして、各分割周波数帯域ごとに割り当てるビット量を決定する。
【００３５】
６は量子化器であり、フィルタバンク２が出力する周波数スペクトルの正規化係数（スケールファクタ）を各周波数帯域毎に算出し、ビット割当て器５が出力する、各周波数帯域ごとのビット量に従って周波数スペクトルを量子化する。
【００３６】
７はビット整形器であり、量子化器６が出力するスケールファクタと量子化スペクトルを適宜規定のフォーマットに整形してビットストリームを作成し、出力する。
【００３７】
上記構成によるオーディオ信号符号化装置におけるオーディオ信号の処理動作を以下に説明する。
【００３８】
なお、本実施形態では説明の便宜のために符号化方式としてＭＰＥＧ２−ＡＡＣを例にとって説明するが、ＭＰ３など、ＰＥによってブロック長切り替え判定を行うその他の符号化方式についても全く同様な方法で実現可能である。
【００３９】
まず、処理に先立ち、各部の初期化を行う。このとき、ブロック長判定器４では、外部から入力されたオーディオ信号のジャンル情報によって、ＰＥ閾値を変更する。ＰＥ閾値の変更は、ジャンルごとのＰＥ閾値を格納したテーブルを参照し、ジャンルによって選択することによって変更しても良いし、ジャンルごとのＰＥ補正値を格納したテーブルを参照して、ＰＥ閾値の基準となる値にジャンルごとの補正値を加えても良い。図３にＰＥ閾値テーブルの構成例、図４にＰＥ補正値テーブルの模倣図を示す。なお、図４のＰＥ補正値テーブルを用いる場合のＰＥの基準値は１８００である。本実施形態において、これらのテーブルはブロック長判定器４に格納されている。
【００４０】
オーディオＰＣＭ信号などの入力オーディオ信号はフレーム分割器１によってフレーム単位に分割され、フィルタバンク２と聴覚心理演算器３に送出される。ＭＰＥＧ２−ＡＡＣＬＣ（Ｌｏｗ−Ｃｏｍｐｌｅｘｉｔｙ）プロファイルの場合、１フレームは１０２４サンプルのＰＣＭ信号で構成される。
【００４１】
入力オーディオ信号はフレーム毎に聴覚心理演算器３によって聴覚エントロピーと各周波数帯域ごとのマスキング計算が行われる。算出された聴覚エントロピー値はブロック長判定器４によって、予めジャンル情報によって変更されたＰＥ閾値と比較される。ここで、ＰＥ閾値よりも当該フレームのＰＥ値が大きい場合は、短いブロック長を使用することが判定され、そうでない場合は、長いブロック長を使用することが判定される。フィルタバンク２では、この判定に沿ったブロック長で、入力信号を周波数スペクトルへ変換する。
【００４２】
なお、ＭＰＥＧ２−ＡＡＣでは、フィルタバンク２において直交変換によるエイリアシングを除去するために、ＭＤＣＴによる重複変換が行われる。その都合上、時間周波数変換では、処理対象フレームとその直前のフレームを合わせた２０４８サンプルを一単位として入力し、１０２４個の周波数スペクトルを得る。このとき、長いブロック長を用いる場合は、入力信号の２０４８サンプルを一つのブロックとして直交変換を行い、１０２４個の周波数スペクトルを出力する。短いブロック長を用いる場合は、入力信号の２５６サンプルを一つのブロックとして１２８個の周波数スペクトルを出力する変換を、入力信号を１２８サンプルずつずらしながら都合８回の変換を行う。
【００４３】
フィルタバンク２から出力された周波数スペクトルと、聴覚心理演算器３から出力されたＳＭＲ値によって、ビット割当て器５は各周波数帯域にビットを割り当て、量子化器６は各周波数帯域ごとにスケールファクタを算出し、各周波数帯域に割当てられたビットに従って周波数スペクトルを量子化する。
【００４４】
各周波数帯域ごとのスケールファクタと量子化スペクトルはビット整形器７によって定められた書式に従ってビットストリームに整形されて、出力される。
【００４５】
（第２の実施形態）
本発明の第２の実施形態は、汎用的なパーソナルコンピュータ（ＰＣ）上で動作するソフトウェアプログラムとして実施することができる。以下、この場合について図面を用いて説明する。
【００４６】
図５は、本発明の第２の実施形態におけるオーディオ信号符号化装置の構成例である。
【００４７】
図示の構成において、１００はＣＰＵであり、オーディオ信号符号化処理のための演算、論理判断等を行い、１０２のバスを介して、バスに接続された各構成要素を制御する。
【００４８】
１０１はメモリであり、本実施形態の構成例における基本Ｉ／Ｏプログラムや、実行しているプログラムコード、プログラム処理時に必要なデータなどを格納する。
【００４９】
１０２はバスであり、ＣＰＵ１００の制御の対象とする構成要素を指示するアドレス信号を転送し、ＣＰＵ１００の制御の対象とする各構成要素のコントロール信号を転送し、各構成機器相互間のデータ転送を行う。
【００５０】
１０３は端末であり、装置の起動、各種条件や入力信号の設定、符号化開始の指示を行う。
【００５１】
１０４は外部記憶装置であり、データやプログラム等を記憶するための外部記憶領域である。データやプログラム等は必要に応じて保管され、また、保管されたデータやプログラムは必要な時に呼び出される。
【００５２】
１０５はメディアドライブであり、記録媒体に記録されているプログラムやデータ、デジタルオーディオ信号などはこのメディアドライブ１０５が読み取ることにより本オーディオ信号符号化装置にロードされる。また、外部記憶装置１０４に蓄えられた各種データや実行プログラムを、記録媒体に書き込むことができる。
【００５３】
１０６はマイクであり、音を集音してオーディオ信号に変換する。
【００５４】
１０７はスピーカーであり、任意のオーディオ信号データを実際の音にして出力することができる。
【００５５】
１０８は通信網であり、ＬＡＮ、公衆回線、無線回線、放送電波などで構成されている。
【００５６】
１０９は通信インターフェースであり、通信網１０８に接続されている。本実施形態のオーディオ信号符号化装置はこの機器を介して通信網を経由し、外部機器と通信し、データやプログラムを送受信することができる。
【００５７】
かかる各構成要素からなる本実施形態のオーディオ信号符号化装置においては、端末１０３からの各種の入力に応じて作動するものであって、端末１０３からの入力が供給されると、インターラプト信号がＣＰＵ１００に送られることによって、ＣＰＵ１００がメモリ１０１内に記憶してある各種の制御信号を読出し、それらの制御信号に従って、各種の制御が行われる。
【００５８】
本実施形態の装置は、基本Ｉ／Ｏプログラム、ＯＳ、および本オーディオ信号符号化処理プログラムをＣＰＵ１００が実行することによって動作する。基本Ｉ／Ｏプログラムはメモリ１０１中に書き込まれており、ＯＳは外部記憶装置１０４に書き込まれている。そして、本装置の電源がＯＮにされると、基本Ｉ／Ｏプログラム中のＩＰＬ（イニシャルプログラムローディング）機能により外部記憶装置１０４からＯＳがメモリ１０１に読み込まれ、ＯＳの動作が開始される。
【００５９】
本オーディオ信号符号化処理プログラムは、図２に示されるオーディオ信号符号化処理手順のフローチャートに基づいてプログラムコード化されたものである。
【００６０】
図６は、本オーディオ信号符号化処理プログラムおよび関連データを記録媒体に記録したときの内容構成図である。
【００６１】
本実施形態において、本オーディオ信号符号化処理プログラムおよび関連データは記録媒体に記録されている。図示したように記録媒体の先頭領域には、この記録媒体のディレクトリ情報が記録されており、その後にこの記録媒体のコンテンツである本オーディオ信号符号化処理プログラムと、オーディオ信号符号化処理関連データがファイルとして記録されている。
【００６２】
図７は本オーディオ信号符号化装置に、本オーディオ信号符号化処理プログラムを導入する模式図である。記録媒体に記録されたオーディオ信号符号化処理プログラムおよび関連データは、図７に示したようにメディアドライブ１０５を通じて本装置にロードすることができる。この記録媒体１１０をパーソナルコンピュータのメディアドライブ１０５にセットすると、ＯＳ及び基本Ｉ／Ｏプログラムの制御のもとに本オーディオ信号符号化処理プログラムおよび関連データが記録媒体から読み出され、外部記憶装置１０４に格納される。その後、再起動時にこれらの情報がメモリ１０１にロードされて動作可能となる。
【００６３】
図８は、本オーディオ信号符号化装置処理プログラムがメモリ１０１にロードされ実行可能となった状態のメモリマップを示す。メモリ１０１には、基本Ｉ／Ｏプログラム、ＯＳ、オーディオ信号符号化処理プログラム、関連データ及びワークエリアが格納される。このとき、メモリ１０１のワークエリアには、ＰＥ閾値と、図３で示したＰＥ閾値テーブルが格納されている。
【００６４】
以下、本実施形態においてＣＰＵ１００で実行されるオーディオ信号符号化処理をフローに従って説明する。
【００６５】
図２は、本実施形態におけるオーディオ信号符号化処理のフローチャートである。
【００６６】
まず、ステップＳ１は、符号化する入力オーディオ信号とそのジャンルをユーザが端末１０３を用いて指定する処理である。本実施形態において、符号化するオーディオ信号は、外部記憶装置１０４に格納されているオーディオＰＣＭファイルでも良いし、マイク１０６で捉えたリアルタイムの音声信号をアナログ・デジタル変換した信号でも良い。処理を終えると、ステップＳ２へ進む。
【００６７】
ステップＳ２は、ステップＳ１で指定された符号化するオーディオ入力信号のジャンルによって、メモリ１０１上のＰＥ閾値テーブルを検索し、メモリ１０１上のＰＥ閾値を変更する処理である。例えば、ステップＳ１で入力されたジャンルが「ロック」の場合、図３で図示したＰＥ閾値テーブルを検索した結果、ＰＥ閾値として２０００がメモリ１０１上の所定領域に格納される。処理を終えると、ステップＳ３へ進む。
【００６８】
ステップＳ３は、符号化する入力オーディオ信号が終了したかどうかを判定する処理である。入力信号が終了している場合は、ステップＳ１２へ処理が進む。未終了の場合は、ステップＳ４へ処理が進む。
【００６９】
ステップＳ４は、入力信号をチャンネルごとに処理単位であるフレームに分割する処理である。第１の実施形態での説明同様、例えば、ＭＰＥＧ２−ＡＡＣの場合、チャンネルごとに１０２４サンプルのフレームに分割する。処理を終えると、ステップＳ５へ処理が進む。
【００７０】
ステップＳ５は、符号化対象となっているフレームの聴覚心理演算を行う処理である。この演算の結果、処理対象フレームの聴覚エントロピー（ＰＥ）と、量子化単位である分割周波数帯域ごとのＳＭＲ値が算出される。処理を終えると、ステップＳ６へ処理が進む。
【００７１】
ステップＳ６は、ステップＳ５で算出された処理対象フレームのＰＥ値と、ステップＳ２によって格納されたＰＥ閾値とを比較し、変換ブロック長を判定する処理である。処理対象フレームのＰＥ値がＰＥ閾値よりも大きい場合は、ステップＳ７へ処理が進む。そうでない場合は、ステップＳ８へ処理が進む。
【００７２】
ステップＳ７では、ステップＳ６で行われた判定に基づき、処理対象フレームに対してショートブロック（短いブロック長）による直交変換を行う。ＭＰＥＧ２−ＡＡＣの場合、この結果、１２８の周波数成分に分割されたスペクトルの組が８組得られる。処理を終えると、ステップＳ９に処理が進む。
【００７３】
ステップＳ８では、ステップＳ６で行われた判定に基づき、処理対象フレームに対してロングブロック（長いブロック長）による直交変換を行う。ＭＰＥＧ２−ＡＡＣの場合、この結果、１０２４の周波数成分に分割されたスペクトルの組が一組だけ得られる。処理を終えると、ステップＳ９に処理が進む。
【００７４】
ステップＳ９は、ステップＳ５で算出される分割周波数帯域ごとのＳＭＲ値と、ステップＳ７やステップＳ８で得られる周波数スペクトルと、符号化ビットレートより、各分割周波数帯域に割り当てるビット量を決定する処理である。このような処理は本実施形態のような変換符号化方法において一般的であるので、詳細は説明しない。処理を終えると、ステップＳ１０へ処理が進む。
【００７５】
ステップＳ１０は、各分割周波数帯域ごとのスケールファクタを算出するとともに、ステップＳ９で割り当てられたビット量に従って、周波数スペクトルを量子化する処理である。処理を終えると、ステップＳ１１へ処理が進む。
【００７６】
ステップＳ１１は、ステップＳ１０で算出されたスケールファクタと量子化スペクトルを、符号化方式によって定められたフォーマットに従って整形し、ビットストリームとして出力する処理である。本実施形態において、この処理によって出力されるビットストリームは、外部記憶装置１０４に格納されても良いし、あるいは、通信インターフェース１０９を介して回線網１０８に繋がっている外部機器に出力されても良い。処理を終えると、ステップＳ３へ処理が進む。
【００７７】
ステップＳ１２は、聴覚心理演算や直交変換などで生じる遅延によってまだ出力されていない量子化スペクトルがメモリ上に残っているため、それらをビットストリームに整形して出力する処理である。処理を終えると、オーディオ信号符号化処理を終了する。
【００７８】
以上説明したように、本実施形態におけるオーディオ信号符号化処理では、オーディオ入力信号のジャンルによって、ブロック長判定の際に参照するＰＥ閾値を適宜設定するため、プリエコーの発生を避けながら符号化効率の良い符号化処理が可能となり、特に低ビットレート時に音質の良い符号化処理を実現することができる。例えば、入力信号のジャンルがクラシックの場合、ＰＥ閾値が低めに設定されるため、短いブロック長を選択すべき部分で適切に短いブロック長が選択されるため、プリエコーの発生を低減することが可能となり、逆に、ジャンルがロックの場合、ＰＥ閾値を高めに設定することによって、短いブロック長が過多に適用されて符号化効率が落ち結果として音質が劣化することを防ぐことができる。
【００７９】
（第３の実施形態）
本発明の第３の実施形態では、ユーザが入力したＩＤ３タグ中のジャンルによってＰＥ閾値が決定される場合を説明する。
【００８０】
図１３は、本発明の第３の実施形態におけるオーディオ信号符号化装置の一構成例である。
【００８１】
図示したように、本実施形態におけるオーディオ信号符号化装置は、図１に示した構成例に、８のＩＤ３タグ入力器を付加したものである。
【００８２】
１から７までの構成要素は第１の実施形態で示したものと同様であるので、説明を省略する。
【００８３】
８は、ＩＤ３タグ入力器であり、ユーザのＩＤ３タグ情報の入力を受け付け、その内容を保持する。
【００８４】
上記構成によるオーディオ信号符号化装置におけるオーディオ信号の処理動作を、前述の第１の実施形態の動作と異なる部分を中心に以下に説明する。
【００８５】
まず、処理に先立ち、各部の初期化を行う。
【００８６】
次に、ユーザによって符号化するオーディオ入力信号が指定されるとともに、入力信号に関するＩＤ３タグ情報の入力がＩＤ３タグ入力器８に対して行われる。このとき、ブロック長判定器４では、入力されたＩＤ３タグのジャンルに従って、ＰＥ閾値を変更する。ＰＥ閾値の変更は、ジャンルごとのＰＥ閾値を格納したテーブルを参照し、ＩＤ３タグのジャンルＮｏ（番号）によって選択することによって変更しても良いし、ジャンルごとのＰＥ補正値を格納したテーブルをＩＤ３タグのジャンルＮｏによって検索し、ＰＥ閾値の基準となる値にジャンルごとの補正値を加えても良い。図１５にこの場合のＰＥ閾値テーブルの模倣図、図１６にＰＥ補正値テーブルの模倣図を示す。なお、本実施形態において、図１６のＰＥ補正値テーブルを用いる場合のＰＥの基準値は１８００である。本実施形態において、これらのテーブルはブロック長判定器４に格納されている。
【００８７】
その後は、第１の実施形態で前述したように、入力されたオーディオ信号はフレーム分割器１によって分割後、聴覚心理演算器３で分析されてＰＥ値とＳＭＲが算出される。ブロック長判定器４においてＰＥ値とＰＥ閾値が比較され、その結果によってブロック長が決定される。フレーム分割されたオーディオ入力信号は、フィルタバンク２において、決定されたブロック長で周波数スペクトルに変換され、ビット割当て器５が聴覚心理演算器４から出力されたＳＭＲ値にしたがって、各周波数帯域にビットを割り当て、量子化器６は各周波数帯域ごとに正規化係数を算出し、各周波数帯域に割当てられたビットに従って周波数スペクトルを量子化し、ビット整形器７によって定められた書式に応じてビットストリームに整形され、出力される。
【００８８】
全ての入力信号の符号化が終了すると、ＩＤ３タグ入力器８に保持されているＩＤ３タグの各情報が、ビット整形器７によってビットストリームの末尾に付加される。
【００８９】
以下、同様な処理を汎用的なＰＣ上で実装する場合について説明する。
【００９０】
図５に示すオーディオ符号化装置の構成によって、この場合においても前述した第２の実施形態と同様に実施することが可能である。なお、図５に示す構成において、図１５で図示したＰＥ閾値テーブルがメモリ１０１上のワークエリアに予め格納されている。
【００９１】
以下、ＣＰＵ１００において行われる処理の詳細を、図１４に示すフローに従って説明する。
【００９２】
まず、ステップＳ１０１は、符号化する入力オーディオ信号をユーザが端末１０３を用いて指定する処理である。本実施形態において、符号化するオーディオ信号は、外部記憶装置１０４に格納されているオーディオＰＣＭファイルでも良いし、マイク１０６で捉えたリアルタイムの音声信号をアナログ・デジタル変換した信号でも良い。処理を終えると、ステップＳ１０２へ進む。
【００９３】
ステップＳ１０２は、ＩＤ３タグをユーザが端末１０３を用いて指定する処理である。この処理の結果、ＩＤ３タグの一部として、入力オーディオ信号のＩＤ３ジャンルＮｏが指定される。処理を終えると、ステップＳ１０３へ進む。
【００９４】
ステップＳ１０３は、ステップＳ１０２で指定された符号化するオーディオ入力信号のＩＤ３ジャンルＮｏによって、メモリ１０１上のＰＥ閾値テーブルを検索し、メモリ１０１上のＰＥ閾値を変更する処理である。例えば、ステップＳ１０２で指定されたＩＤ３ジャンルＮｏが「２」の場合、図１５で図示したＰＥ閾値テーブルを検索した結果、ＰＥ閾値として１６００がメモリ１０１上の所定領域に格納される。処理を終えると、ステップＳ１０４へ進む。
【００９５】
ステップＳ１０４は、符号化する入力オーディオ信号が終了したかどうかを判定する処理である。入力信号が終了している場合は、ステップＳ１１３へ処理が進む。未終了の場合は、ステップＳ１０５へ処理が進む。
【００９６】
ステップＳ１０５は、入力信号をチャンネルごとに処理単位であるフレームに分割する処理である。第１の実施形態での説明同様、例えば、ＭＰＥＧ２−ＡＡＣの場合、チャンネルごとに１０２４サンプルのフレームに分割する。処理を終えると、ステップＳ１０６へ処理が進む。
【００９７】
ステップＳ１０６は、符号化対象となっているフレームの聴覚心理演算を行う処理である。この演算の結果、処理対象フレームの聴覚エントロピー（ＰＥ）と、量子化単位である分割周波数帯ごとのＳＭＲ値が算出される。処理を終えると、ステップＳ１０７へ処理が進む。
【００９８】
ステップＳ１０７は、ステップＳ１０６で算出された処理対象フレームのＰＥ値と、ステップＳ１０３によって格納されたＰＥ閾値とを比較し、変換ブロック長を判定する処理である。処理対象フレームのＰＥ値がＰＥ閾値よりも大きい場合は、ステップＳ１０８へ処理が進む。そうでない場合は、ステップＳ１０９へ処理が進む。
【００９９】
ステップＳ１０８では、ステップＳ１０７で行われた判定に基づき、処理対象フレームに対してショートブロック（短いブロック長）による直交変換を行う。ＭＰＥＧ２−ＡＡＣの場合、この結果、１２８個の周波数成分に分割されたスペクトルの組が８組得られる。処理を終えると、ステップＳ１１０に処理が進む。
【０１００】
ステップＳ１０９では、ステップＳ１０７で行われた判定に基づき、処理対象フレームに対してロングブロック（長いブロック長）による直交変換を行う。ＭＰＥＧ２−ＡＡＣの場合、この結果、１０２４の周波数成分に分割されたスペクトルの組が一組だけ得られる。処理を終えると、ステップＳ１１０に処理が進む。
【０１０１】
ステップＳ１１０は、ステップＳ１０６で算出される分割周波数帯域ごとのＳＭＲ値と、ステップＳ１０８もしくはステップＳ１０９で得られる周波数スペクトルと、符号化ビットレートより、各分割周波数帯域に割り当てるビット量を決定する処理である。処理を終えると、ステップＳ１１１へ処理が進む。
【０１０２】
ステップＳ１１１は、各分割周波数帯域ごとのスケールファクタを算出するとともに、ステップＳ１１０で割り当てられたビット量に従って、周波数スペクトルを量子化する処理である。処理を終えると、ステップＳ１１２へ処理が進む。
【０１０３】
ステップＳ１１２は、ステップＳ１１１で算出されたスケールファクタと量子化スペクトルを、符号化方式によって定められたフォーマットに従って整形し、ビットストリームとして出力する処理である。本実施形態において、この処理によって出力されるビットストリームは、外部記憶装置に格納されても良いし、あるいは、通信インターフェースを介して回線網に繋がっている外部機器に出力されても良い。処理を終えると、ステップＳ１０４へ処理が進む。
【０１０４】
ステップＳ１１３は、聴覚心理演算や直交変換などで生じる遅延によってまだ出力されていない量子化スペクトルがメモリ上に残っているため、それらをビットストリームに整形して出力する処理である。処理を終えると、ステップＳ１１４へ処理が進む。
【０１０５】
ステップＳ１１４は、作成された符号化ビットストリームの末尾に、ステップＳ１０２で入力されたＩＤ３タグ情報を適宜コード化して付加する処理である。これによって、この符号化データを復号化する際に、このデータのＩＤ３タグ情報にアクセスすることが可能となり、様々な情報を復号処理側で利用することが可能となる。処理を終えると、オーディオ信号符号化処理を終了する。
【０１０６】
上記説明したように、本実施形態においては、オーディオ符号化データの標準的なメタデータであるＩＤ３タグを入力することによって、適宜ＰＥ閾値が変更されることにより、予め定められているジャンルからユーザはジャンルを選択すればよいため、符号化時のユーザの利便性が向上し、かつ音質の良い符号化データを得ることができる。
【０１０７】
（第４の実施形態）
本発明の第４の実施形態では、ＰＥ閾値をＰＥ履歴によって調整する場合の例を示す。
【０１０８】
図１７は、本発明の第４の実施形態におけるオーディオ信号符号化装置の一構成例である。
【０１０９】
図示したように、本実施形態におけるオーディオ信号符号化装置は、図１に示した構成例に、９のＰＥ履歴記憶部を付加したものである。
【０１１０】
１から７までの構成要素は第１の実施形態で示したものと同様であるので、説明を省略する。
【０１１１】
９はＰＥ履歴一時記憶器であり、聴覚心理演算器３によってオーディオ入力信号を解析した結果出力される各フレームのＰＥ値と、その結果ブロック長判定器４によって決定されたブロック長の組を、新しいものから予め決められた数だけ順次格納する。本実施形態においては、１００フレーム分のＰＥ履歴を格納することができる。ＰＥ履歴一時記憶器９に格納されるＰＥ履歴テーブルの模倣図を図１８に示す。
【０１１２】
上記構成によるオーディオ信号符号化装置におけるオーディオ信号の処理動作を、前述の第１の実施形態の動作と異なる部分を中心にして以下に説明する。
【０１１３】
なお、本実施形態では説明の便宜のために符号化方式としてＭＰＥＧ２−ＡＡＣを例にとって説明するが、ＭＰ３など、ＰＥによってブロック長切り替え判定を行うその他の符号化方式についても全く同様な方法で実現可能である。
【０１１４】
まず、処理に先立ち、各部の初期化を行う。このとき、ブロック長判定器４では、外部から入力されたオーディオ信号のジャンル情報によって、ブロック長判定器４内部のＰＥ閾値テーブルを参照し、初期のＰＥ閾値を設定する。
【０１１５】
オーディオＰＣＭ信号などの入力オーディオ信号はフレーム分割器１によってフレーム単位に分割され、フィルタバンク２と聴覚心理演算器３に送出される。
【０１１６】
入力オーディオ信号はフレーム毎に聴覚心理演算器３によって聴覚エントロピー（ＰＥ）と各周波数帯域ごとのマスキング計算が行われる。算出された聴覚エントロピー値はブロック長判定器４に送出される。
【０１１７】
ブロック長判定器４は、ＰＥ閾値の入力をトリガとして、ＰＥ閾値の補正を開始する。本実施形態におけるＰＥ閾値の補正処理は、図２０のフローを用いて後述する。この間、ブロック長判定器４は、処理に必要な履歴情報を逐次ＰＥ履歴一時記憶器９にアクセスして取り出すことが可能である。
【０１１８】
ＰＥ閾値の補正が終了すると、ブロック長判定器４は入力ＰＥ値と補正後のＰＥ閾値とを比較する。ＰＥ閾値よりも当該フレームのＰＥ値が大きい場合は、短いブロック長を使用することが判定され、そうでない場合は、長いブロック長を使用することが判定される。ブロック長が決定されると、ブロック長判定器４はＰＥ値と選択ブロック長をＰＥ履歴一時記憶器９に送出する。ＰＥ履歴一時記憶器９は、最新のＰＥと決定ブロック長をＰＥ履歴テーブルに保存する。このとき、履歴情報の数が予め定められた数を超えた場合は、最も古い履歴を削除する。例えば、図１８に図示したＰＥ履歴テーブルの場合、最新の履歴を追加する場合は、最も古い履歴である、Ｎｏ．１００の行がテーブルより削除され、図中の最も上の行に新規履歴となるＰＥ値とブロック種別が格納され、Ｎｏ．が振りなおされる。
【０１１９】
その後は、第１の実施形態で前述したように、オーディオ入力信号はフィルタバンク２において、決定されたブロック長で、周波数スペクトルに変換され、ビット割当て器５が聴覚心理演算器４から出力されたＳＭＲ値にしたがって、各周波数帯域にビットを割り当て、量子化器６は各周波数帯域ごとに正規化係数を算出し、各周波数帯域に割当てられたビットに従って周波数スペクトルを量子化し、ビット整形器７によって定められた書式に応じてビットストリームに整形され、出力される。
【０１２０】
以下、同様な処理を汎用的なＰＣ上で実装する場合について説明する。
【０１２１】
図５に示すオーディオ符号化装置の構成によって、この場合においても前述した第２の実施形態と同様に実施することが可能である。なお、図５に示す構成において、図１８に示すＰＥ閾値履歴テーブルは、メモリ１０１上のワークエリアに格納されている。
【０１２２】
以下、ＣＰＵ１００において行われる処理の詳細を、図１９に示すフローに従って説明する。
【０１２３】
ステップＳ２０１は、ユーザが符号化するオーディオ入力信号と、そのジャンルを指定する処理である。処理を終えると、ステップＳ２０２へ進む。
【０１２４】
ステップＳ２０２は、ステップＳ２０１で指定されたジャンルに従って、ＰＥ閾値テーブルを検索し、ＰＥ閾値を変更する処理である。処理を終えると、ステップＳ２０３へ進む。
【０１２５】
ステップＳ２０３は、オーディオ入力信号が終了したかどうかを判定する処理である。入力信号が終了している場合は、ステップＳ２１４へ進む。入力信号が未終了である場合は、ステップＳ２０４へ処理が進む。
【０１２６】
ステップＳ２０４は、符号化単位である１フレーム分のオーディオ入力信号をメモリ１０１上に読み出す処理である。処理を終えると、ステップＳ２０５へ進む。
【０１２７】
ステップＳ２０５は、符号化対象となっているフレームの聴覚心理演算を行う処理である。この演算の結果、処理対象フレームの聴覚エントロピー（ＰＥ）と、量子化単位である分割周波数帯ごとのＳＭＲ値が算出される。処理を終えると、ステップＳ２０６へ進む。
【０１２８】
ステップＳ２０６は、ステップＳ２０５で算出された処理対象フレームのＰＥ値と、ステップＳ２０２によって格納されたＰＥ閾値とを比較し、変換ブロック長を判定する処理である。処理対象フレームのＰＥ値がＰＥ閾値よりも大きい場合は、ステップＳ２０７へ処理が進む。そうでない場合は、ステップＳ２０８へ処理が進む。
【０１２９】
ステップＳ２０７では、ステップＳ２０６で行われた判定に基づき、処理対象フレームに対してショートブロック（短いブロック長）による直交変換を行う。ＭＰＥＧ２−ＡＡＣの場合、この結果、１２８の周波数成分に分割されたスペクトルの組が８組得られる。処理を終えると、ステップＳ２０９に処理が進む。
【０１３０】
ステップＳ２０８では、ステップＳ２０６で行われた判定に基づき、処理対象フレームに対してロングブロック（長いブロック長）による直交変換を行う。ＭＰＥＧ２−ＡＡＣの場合、この結果、１０２４の周波数成分に分割されたスペクトルの組が一組だけ得られる。処理を終えると、ステップＳ２０９に処理が進む。
【０１３１】
ステップＳ２０９は、ステップＳ２０５で算出される分割周波数帯域ごとのＳＭＲ値と、ステップＳ２０７やステップＳ２０８で得られる周波数スペクトルと、符号化ビットレートより、各分割周波数帯域に割り当てるビット量を決定する処理である。処理を終えると、ステップＳ２１０へ処理が進む。
【０１３２】
ステップＳ２１０は、各分割周波数帯域ごとのスケールファクタを算出するとともに、ステップＳ２０９で割り当てられたビット量に従って、周波数スペクトルを量子化する処理である。処理を終えると、ステップＳ２１１へ処理が進む。
【０１３３】
ステップＳ２１１は、ステップＳ２１０で算出されたスケールファクタと量子化スペクトルを、符号化方式によって定められたフォーマットに従って整形し、ビットストリームとして出力する処理である。本実施形態において、この処理によって出力されるビットストリームは、外部記憶装置１０４に格納されても良いし、あるいは、通信インターフェース１０９を介して回線網１０８に繋がっている外部機器に出力されても良い。処理を終えると、ステップＳ２１２へ処理が進む。
【０１３４】
ステップＳ２１２は、処理中のフレームのＰＥ値と選択ブロックを、メモリ１０１中のＰＥ履歴テーブルに格納する処理である。なお、ＰＥ履歴テーブルが一杯になっている場合、最も古い履歴を削除して、このフレームの履歴を格納する。処理を終えると、ステップＳ２１３へ進む。
【０１３５】
ステップＳ２１３は、ＰＥ履歴テーブルに格納されている履歴によって、ＰＥ閾値を補正する処理である。この処理の詳細は、図２０を用いて後述する。処理を終えると、ステップＳ２０３へ進む。
【０１３６】
ステップＳ２１４は、聴覚心理演算や直交変換などで生じる遅延によってまだ出力されていない量子化スペクトルがメモリ１０１上に残っているため、それらをビットストリームに整形して出力する処理である。処理を終えると、オーディオ信号符号化処理を終了する。
【０１３７】
図２０は、本実施形態におけるステップＳ２１３の履歴によるＰＥ補正処理を詳細化したフローである。
【０１３８】
ステップＳ３０１は、ＰＥ履歴テーブルにショートブロックがあるかどうかを検索する処理である。ＰＥ履歴テーブルにショートブロックが一つでもある場合は、ステップＳ３０２へ処理が進む。そうでない場合は、ステップＳ３１１へ処理が進む。
【０１３９】
ステップＳ３０２は、ＰＥ履歴テーブルを参照して、履歴に残っているロングブロックの平均ＰＥ値（以下、Ｌｏｎｇ＿ＰＥ）を算出する処理である。処理を終えると、ステップＳ３０３へ処理が進む。
【０１４０】
ステップＳ３０３は、ＰＥ履歴テーブルを参照してショートブロックの平均ＰＥ値（以下、Ｓｈｏｒｔ＿ＰＥ）を算出する処理である。処理を終えると、ステップＳ３０４へ処理が進む。
【０１４１】
ステップＳ３０４は、平均ＰＥ値、すなわち、（Ｌｏｎｇ＿ＰＥ＋Ｓｈｏｒｔ＿ＰＥ）／２を算出する処理である。本実施形態において、平均ＰＥ値としてこのような演算を行うのは、一つのオーディオ入力信号（楽曲）においても、ＰＥ値が高い部分とＰＥ値が低い部分、すなわち、ショートブロックとして処理するフレームの出現する頻度が異なる場合があるためである。このような場合、ＰＥ平均値を全てのフレームの平均値とすると、同じオーディオ入力信号を符号化する処理の途中でＰＥ平均値が乱高下して、その後の補正処理で妥当な補正ができなくなり、結果として作成されたビットストリームの音質が安定しない恐れがある。よって、このような計算方法を取っている。処理を終えると、ステップＳ３０５へ処理が進む。
【０１４２】
ステップＳ３０５は、ステップＳ３０４で求めた平均ＰＥ値が現在のＰＥ閾値よりも大きいかどうかを判定する処理である。この判定の結果、平均ＰＥ値の方が大きい場合はステップＳ３０６に処理が進む。そうでない場合は、ステップＳ３０７へ処理が進む。
【０１４３】
ステップＳ３０６は、現在のＰＥ閾値に１０を加えてＰＥ閾値を補正する処理である。すなわち、ステップＳ３０４で算出した平均ＰＥ値が現在のＰＥ閾値よりも大きい場合は、ＰＥ閾値をプラス（＋）方向に補正する。このようにすることで、ＰＥ値が高めに推移している楽曲に対して適切にＰＥ閾値を補正することができる。処理を終えると、ステップＳ３０９へ処理が進む。
【０１４４】
ステップＳ３０７は、ステップＳ３０４で求めた平均ＰＥ値が現在のＰＥ閾値よりも小さいかどうかを判定する処理である。この判定の結果、平均ＰＥ値の方が小さい場合は、ステップＳ３０８に処理が進む。そうでない場合は、ステップＳ３０９へ進む。
【０１４５】
ステップＳ３０８は、現在のＰＥ閾値から１０を引いてＰＥ閾値を補正する処理である。すなわち、ステップＳ３０４で算出した平均ＰＥ値が現在のＰＥ閾値よりも小さい場合は、ＰＥ閾値をマイナス方向に補正する。このようにすることで、ＰＥ値が低めに推移している楽曲に対して適切にＰＥ閾値を補正することができる。処理を終えると、ステップＳ３０９へ処理が進む。
【０１４６】
ステップＳ３０９は、ＰＥ履歴テーブル中にショートブロックとなった履歴の数をカウントし、２０を超えているかどうかを判定する処理である。判定の結果、ショートブロックの履歴が２０以上ある場合はステップＳ３１０に処理が進む。そうでない場合は、履歴によるＰＥ補正処理を終了し、リターンする。
【０１４７】
ステップＳ３１０は、ＰＥ閾値に１０を加えてＰＥ閾値を補正する処理である。すなわち、１００フレームの履歴中に２０以上ものショートブロックが存在する場合は、明らかにショートブロックを多用しすぎて符号化効率が低下していると考えられるため、ＰＥ閾値をプラス方向に補正する。このようにすることで、ＰＥ閾値が低いことによる符号化効率の低下を抑制する制御が可能になる。処理を終えると、履歴によるＰＥ補正処理を終了し、リターンする。
【０１４８】
ステップＳ３１１は、現在のＰＥ閾値が閾値下限に達しているかどうかを判定する処理である。判定の結果、ＰＥ閾値が閾値下限に達している場合は履歴によるＰＥ補正処理を終了し、リターンする。そうでない場合は、ステップＳ３１２に処理が進む。
【０１４９】
ステップＳ３１２は、ＰＥ閾値から１０を引いてＰＥ閾値を補正する処理である。すなわち、履歴が全てロングブロックで構成されている場合は、入力信号に対してＰＥ閾値が高めに設定され、本来ショートブロックで符号化されるべきフレームを見落としている恐れがあるため、マイナス方向に補正する。ただし、予め設定した閾値下限を超えない程度に補正を行う。これは、ほぼ変化のないような静かな楽曲の場合、ＰＥ閾値が極端に下がってしまうと本来ロングブロックで符号化すべき箇所までショートブロックで符号化され、符号化効率が下がってしまう現象を防ぐためである。処理を終えると、補正処理を終了してリターンする。
【０１５０】
以上説明したように、本実施形態では、ＰＥ履歴を用いてＰＥ閾値の補正を処理中に適宜行うことによって、同じジャンルの楽曲でも入力信号に応じたＰＥ閾値を適宜補正することができるため、更に適切なブロック判定を行うことが可能になり、その結果、更に音質の良いビットストリームを作成することができる。
【０１５１】
（第５の実施形態）
本発明の第５の実施形態では、入力信号を解析してジャンルを自動判定する場合の例を示す。
【０１５２】
図２１は、本発明の第５の実施形態におけるオーディオ信号符号化装置の一構成例である。
【０１５３】
図示したように、本実施形態におけるオーディオ信号符号化装置は、図１に示した構成例に、１０のジャンル自動判定器を付加したものである。
【０１５４】
１から７までの構成要素は第１の実施形態で示したものと同様であるので、説明を省略する。
【０１５５】
１０は、ジャンル判定器であり、オーディオ入力信号を解析して、そのジャンルを判定する。
【０１５６】
上記構成によるオーディオ信号符号化装置におけるオーディオ信号の処理動作を、前述の第１の実施形態の動作と異なる部分を中心に以下に説明する。
【０１５７】
まず、処理に先立ち、各部の初期化を行う。
【０１５８】
次に、ユーザによって符号化するオーディオ入力信号が指定される。
【０１５９】
次に、ジャンル判定器１０は、指定された入力信号を解析して、自動的にジャンルを判定し、ブロック長判定器４に送出する。このジャンル判定方法の詳細については、図２３を用いて後述する。ブロック長判定器４は、受け取ったジャンルに従って、ＰＥ閾値を変更する。ＰＥ閾値の変更は、ジャンルごとのＰＥ閾値を格納したテーブルを参照することによって変更しても良いし、ジャンルごとのＰＥ補正値を格納したテーブルを検索し、ＰＥ閾値の基準となる値にジャンルごとの補正値を加えても良い。図３にこの場合のＰＥ閾値テーブルの構成例、図４にＰＥ補正値テーブルの構成例を示す。なお、本実施形態において、図４の補正値テーブルを用いる場合のＰＥの基準値は１８００である。本実施形態において、これらのテーブルはブロック長判定器４に格納されている。
【０１６０】
その後は、第１の実施形態で前述したように、入力されたオーディオ信号はフレーム分割器１によって分割後、聴覚心理演算器３で分析されてＰＥ値とＳＭＲが算出される。ブロック長判定器４においてＰＥ値とＰＥ閾値が比較され、その結果によってブロック長が決定される。フレーム分割されたオーディオ入力信号は、フィルタバンク２において、決定されたブロック長で周波数スペクトルに変換され、ビット割当て器５が聴覚心理演算器３から出力されたＳＭＲ値にしたがって、各周波数帯域にビットを割り当て、量子化器６は各周波数帯域ごとに正規化係数を算出し、各周波数帯域に割当てられたビットに従って周波数スペクトルを量子化し、ビット整形器７によって定められた書式に応じてビットストリームに整形され、出力される。
【０１６１】
以下、同様な処理を汎用的なＰＣ上で実装する場合について説明する。
【０１６２】
図５に示すオーディオ符号化装置の構成によって、この場合においても前述した第２の実施形態と同様に実施することが可能である。
【０１６３】
以下、ＣＰＵ１００において行われる処理の詳細を、図２２に示すフローに従って説明する。
【０１６４】
図２２は、本実施形態におけるオーディオ信号符号化処理のフローである。
【０１６５】
まず、ステップＳ４０１は、符号化する入力オーディオ信号をユーザが端末１０３を用いて指定する処理である。本実施形態において、符号化するオーディオ信号は、外部記憶装置１０４に格納されているオーディオＰＣＭファイルでも良いし、マイク１０６で捉えたリアルタイムの音声信号をアナログ・デジタル変換した信号でも良い。処理を終えると、ステップＳ４０２へ進む。
【０１６６】
ステップＳ４０２は、入力オーディオ信号のジャンルを自動判定する処理である。この処理の詳細は図２３を用いて後述する。処理を終えると、ステップＳ４０３へ進む。
【０１６７】
ステップＳ４０３は、ステップＳ４０２で判定された符号化するオーディオ入力信号のジャンルによって、メモリ１０１上のＰＥ閾値テーブルを検索し、メモリ１０１上のＰＥ閾値を変更する処理である。処理を終えると、ステップＳ４０４へ進む。
【０１６８】
ステップＳ４０４は、符号化する入力オーディオ信号が終了したかどうかを判定する処理である。入力信号が終了している場合は、ステップＳ４１３へ処理が進む。未終了の場合は、ステップＳ４０５へ処理が進む。
【０１６９】
ステップＳ４０５は、入力信号をチャンネルごとに処理単位であるフレームに分割する処理である。第１の実施形態での説明同様、例えば、ＭＰＥＧ２−ＡＡＣの場合、チャンネルごとに１０２４サンプルのフレームに分割する。処理を終えると、ステップＳ４０６へ処理が進む。
【０１７０】
ステップＳ４０６は、符号化対象となっているフレームの聴覚心理演算を行う処理である。この演算の結果、処理対象フレームの聴覚エントロピー（ＰＥ）と、量子化単位である分割周波数帯ごとのＳＭＲ値が算出される。処理を終えると、ステップＳ４０７へ処理が進む。
【０１７１】
ステップＳ４０７は、ステップＳ４０６で算出された処理対象フレームのＰＥ値と、ステップＳ４０３によって格納されたＰＥ閾値とを比較し、変換ブロック長を判定する処理である。処理対象フレームのＰＥ値がＰＥ閾値よりも大きい場合は、ステップＳ４０８へ処理が進む。そうでない場合は、ステップＳ４０９へ処理が進む。
【０１７２】
ステップＳ４０８では、ステップＳ１０７で行われた判定に基づき、処理対象フレームに対してショートブロック（短いブロック長）による直交変換を行う。ＭＰＥＧ２−ＡＡＣの場合、この結果、１２８個の周波数成分に分割されたスペクトルの組が８組得られる。処理を終えると、ステップＳ４１０に処理が進む。
【０１７３】
ステップＳ４０９では、ステップＳ４０７で行われた判定に基づき、処理対象フレームに対してロングブロック（長いブロック長）による直交変換を行う。ＭＰＥＧ２−ＡＡＣの場合、この結果、１０２４の周波数成分に分割されたスペクトルの組が一組だけ得られる。処理を終えると、ステップＳ４１０に処理が進む。
【０１７４】
ステップＳ４１０は、ステップＳ４０６で算出される分割周波数帯域ごとのＳＭＲ値と、ステップＳ４０８もしくはステップＳ４０９で得られる周波数スペクトルと、符号化ビットレートより、各分割周波数帯域に割り当てるビット量を決定する処理である。このような処理は本実施形態のような変換符号化方法において一般的であるので、詳細は説明しない。処理を終えると、ステップＳ４１１へ処理が進む。
【０１７５】
ステップＳ４１１は、各分割周波数帯域ごとのスケールファクタを算出するとともに、ステップＳ４１０で割り当てられたビット量に従って、周波数スペクトルを量子化する処理である。処理を終えると、ステップＳ４１２へ処理が進む。
【０１７６】
ステップＳ４１２は、ステップＳ４１１で算出されたスケールファクタと量子化スペクトルを、符号化方式によって定められたフォーマットに従って整形し、ビットストリームとして出力する処理である。本実施形態において、この処理によって出力されるビットストリームは、外部記憶装置１０４に格納されても良いし、あるいは、通信インターフェース１０９を介して回線網１０８に繋がっている外部機器に出力されても良い。処理を終えると、ステップＳ４０４へ処理が進む。
【０１７７】
ステップＳ４１３は、聴覚心理演算や直交変換などで生じる遅延によってまだ出力されていない量子化スペクトルがメモリ上に残っているため、それらをビットストリームに整形して出力する処理である。処理を終えると、オーディオ信号符号化処理を終了する。
【０１７８】
図２３は、本実施形態におけるステップＳ４０２のジャンル自動判定処理を詳細化したフローである。
【０１７９】
ステップＳ５０１は、オーディオ入力信号の振幅の強弱と繰り返しパターンを検知することにより、オーディオ入力信号のリズムを抽出する処理である。処理を終えると、ステップＳ５０２へ処理が進む。
【０１８０】
ステップＳ５０２は、ステップＳ５０１で抽出されたリズムと、入力オーディオ信号のサンプリングレートより、入力信号のテンポを算出する処理である。処理を終えると、ステップＳ５０３へ処理が進む。
【０１８１】
ステップＳ５０３は、ＦＦＴなどの直交変換を用いて、オーディオ入力信号を周波数スペクトルに変換し、純音成分の分析などを行う処理である。処理を終えると、ステップＳ５０４へ処理が進む。
【０１８２】
ステップＳ５０４は、ステップＳ５０３で行った周波数解析結果を基に、楽曲のメロディや和音構成、コード進行等を分析して曲調の抽出を行う処理である。処理を終えると、ステップＳ５０５へ処理が進む。
【０１８３】
ステップＳ５０５は、ここまでに抽出されたリズム、テンポ、曲調を綜合的に分析して、入力信号のジャンルを判定する処理である。この判定方法としては、例えばこれらの特徴とジャンルが格納されたデータベースを検索し、もっとも近似しているパターンを持つジャンルを選択することによってジャンルを判定する、あるいは、ジャンルごとの楽曲の特徴を学習したニューラルネットワーク回路によって判定するなどの手法があるが、これらは全て公知であるため、詳述しない。この判定の結果、ジャンルが判定不可能である場合はステップＳ５０６へ処理が進む。ジャンル判定ができた場合は、ステップＳ５０７へ処理が進む。
【０１８４】
ステップＳ５０６は、ジャンル情報としてデフォルトジャンルを選択する処理である。本実施形態において、デフォルトジャンルとは、ジャンル判定不可能な入力信号に与えられるジャンルであり、この情報がジャンルとして設定されると、図２２のステップＳ４０３において、ＰＥ閾値としてデフォルトの値が設定される。処理を終えると、ステップＳ５０７へ処理が進む。
【０１８５】
ステップＳ５０７は、ステップＳ５０５やステップＳ５０６で選択されたジャンル情報をメモリ１０１上のワークエリアに格納する処理である。処理を終えると、ジャンル自動判定処理を終了してリターンする。
【０１８６】
以上説明したように、本実施形態においては、ユーザはオーディオ入力信号を指定するだけで、自動的にジャンルが判定され、適宜ジャンルに応じたＰＥ閾値が設定されるため、ユーザに余計な手間をかけることなく音質の良いビットストリームを得ることが可能となる。
【０１８７】
（その他の実施形態）
なお、本発明は上述した実施形態に限定されるものではない。
【０１８８】
上述の第２の実施形態では、図３のＰＥ閾値テーブルを検索することによって、ジャンルによるＰＥ閾値変更を行っているが、これは、図４のＰＥ閾値補正テーブルを検索することでも実現可能である。図９は、この場合のメモリマップ構成図である。このメモリマップは、基本的には図８のメモリマップと同じであるが、ワークエリアにはＰＥ閾値、ＰＥ基準値及びＰＥ閾値補正テーブルが格納される。
【０１８９】
また、上述の第２の実施形態では、メモリマップ上にＰＥ閾値を格納しているが、その代わりに、ＰＥ閾値テーブル上の値を指すポインタを格納しても同様の処理が実現可能である。
【０１９０】
また、上述の実施形態では、特に記録媒体に関して言及していないが、これは、ＦＤ、ＨＤＤ、ＣＤ、ＤＶＤ、ＭＯ、半導体メモリなど、どのような記録媒体を用いても適用可能である。
【０１９１】
また、上述の第４の実施形態では、ＰＥ履歴テーブルを通常の表形式のデータとして構成しているが、これは予め定められた数のリストを格納できるリングバッファを用いて構成しても適用可能である。
【０１９２】
また、上述の第４の実施形態では、１フレームの処理を行う度にＰＥ閾値の補正を行っているが、これは数フレーム、もしくは数十フレーム単位の頻度で行ってもよい。
【０１９３】
以上説明したように、上記実施形態によれば、オーディオ入力信号のジャンルによって、ブロック長判定の際に参照するＰＥ閾値を適宜設定するため、プリエコーの発生を避けながら符号化効率の良い符号化処理が可能となり、特に低ビットレート時に音質の良い符号化処理を実現することができる。例えば、入力信号のジャンルがクラシックの場合、ＰＥ閾値が低めに設定されるため、短いブロック長を選択すべき部分で適切に短いブロック長が選択されるため、プリエコーの発生を低減することが可能となり、逆に、ジャンルがロックの場合、短いブロック長が過多に適用されて符号化効率が落ち、結果として音質が劣化することを防ぐことができる。すなわち、入力信号の特性に応じた適切なブロック長選択が可能になり、符号化効率の低下を防止しつつ、プリエコーの発生を抑えて高音質なビットストリームを作成することができる。
【０１９４】
更に、上記実施形態によれば、符号化オーディオデータにおいて一般的に利用されているメタデータであるＩＤ３タグのジャンル情報によってＰＥ閾値を変更することによって、ユーザに意識させることなく適宜ＰＥ閾値を設定することが可能になり、ユーザに余計な手間をかけることなく音質の良いビットストリームを得ることができる。すなわち、ユーザに余計な手間をかけることなく入力信号のジャンルに応じた適切なブロック長の選択が可能になる。
【０１９５】
更に、上記実施形態によれば、過去のＰＥ値とブロック選択情報の履歴によって、ＰＥ閾値を適宜補正することによって、入力信号に応じたＰＥ閾値の補正が可能になり、ジャンル選択だけでは細かな調整が行えなかったブロック選択が適宜行われるようにすることが可能となり、更に音質の良いビットストリームが得られる。すなわち、入力信号に応じて適切にブロック長選択を制御することが可能になり、更に高品質なビットストリームを提供することができる。
【０１９６】
更に、上記実施形態によれば、入力信号のジャンルを自動的に判断し、それによってＰＥ閾値を変更することによって、ユーザに意識させることなく入力信号のジャンルに応じたＰＥ閾値の設定が可能となり、ユーザの利便性が向上する。
【０１９７】
本実施形態は、コンピュータがプログラムを実行することによって実現することができる。また、プログラムをコンピュータに供給するための手段、例えばかかるプログラムを記録したＣＤ−ＲＯＭ等のコンピュータ読み取り可能な記録媒体又はかかるプログラムを伝送するインターネット等の伝送媒体も本発明の実施形態として適用することができる。また、上記のプログラムを記録したコンピュータ読み取り可能な記録媒体等のコンピュータプログラムプロダクトも本発明の実施形態として適用することができる。上記のプログラム、記録媒体、伝送媒体及びコンピュータプログラムプロダクトは、本発明の範疇に含まれる。記録媒体としては、例えばフレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ等を用いることができる。
【０１９８】
なお、上記実施形態は、何れも本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその技術思想、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。
【０１９９】
【発明の効果】
以上説明したように、本発明によれば、オーディオ入力信号のジャンルに応じて、変換ブロック長決定の際に参照する聴覚エントロピー閾値を決定するため、プリエコーの発生を避けながら符号化効率の良い符号化処理が可能となり、特に低ビットレート時に音質の良い符号化処理を実現することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態におけるオーディオ信号符号化装置の一構成例を示す図である。
【図２】本発明の第２の実施形態におけるオーディオ信号符号化処理のフローチャートである。
【図３】本発明の第１の実施形態におけるＰＥ閾値テーブルの模式図である。
【図４】本発明の第１の実施形態におけるＰＥ補正値テーブルの模式図である。
【図５】本発明の第２の実施形態におけるオーディオ信号符号化装置の一構成例を示す図である。
【図６】本発明の第２の実施形態におけるオーディオ信号符号化処理プログラムを格納した記憶媒体の内容構成例を示す図である。
【図７】本発明の第２の実施形態におけるオーディオ信号符号化処理をパーソナルコンピュータに導入する模式図である。
【図８】本発明の第２の実施形態におけるメモリマップ構成図である。
【図９】本発明の他の実施形態におけるメモリマップ構成図である。
【図１０】オーディオ信号の模倣図である。
【図１１】図１０で示されるオーディオ信号を２０４８サンプル単位で符号化、復号化した場合のオーディオ信号の模倣図である。
【図１２】図１０で示されるオーディオ信号を２５６サンプル単位で符号化、復号化した場合のオーディオ信号の模倣図である。
【図１３】本発明の第３の実施形態におけるオーディオ信号符号化装置の一構成例を示す図である。
【図１４】本発明の第３の実施形態におけるオーディオ信号符号化処理のフローチャートである。
【図１５】本発明の第３の実施形態におけるＰＥ閾値テーブルの模式図である。
【図１６】本発明の第３の実施形態におけるＰＥ補正値テーブルの模式図である。
【図１７】本発明の第４の実施形態におけるオーディオ信号符号化装置の一構成例を示す図である。
【図１８】本発明の第４の実施形態におけるＰＥ履歴テーブルの模倣図である。
【図１９】本発明の第４の実施形態におけるオーディオ信号符号化処理のフローチャートである。
【図２０】本発明の第４の実施形態におけるＰＥ閾値補正処理のフローチャートである。
【図２１】本発明の第５の実施形態におけるオーディオ信号符号化装置の一構成例を示す図である。
【図２２】本発明の第５の実施形態におけるオーディオ信号符号化処理のフローチャートである。
【図２３】本発明の第５の実施形態におけるジャンル自動判定処理のフローチャートである。
【符号の説明】
１フレーム分割器
２フィルタバンク
３聴覚心理演算器
４ブロック長判定器
５ビット割当て器
６量子化器
７ビット整形器
８ＩＤ３タグ入力器
９ＰＥ履歴一時保存器
１０ジャンル自動判定器
１００ＣＰＵ
１０１メモリ
１０２バス
１０３端末
１０４外部記憶装置
１０５メディアドライブ
１０６マイク
１０７スピーカー
１０８通信回線
１０９通信インターフェース
１１０記憶媒体[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio signal encoding technique, and more particularly, to an audio signal encoding technique using a transform coding technique capable of changing the transform block length.
[0002]
[Prior art]
In recent years, high-quality and high-efficiency audio signal encoding technology has been developed for DVD-Video audio tracks, portable audio players using semiconductor memory, HDD, etc., music distribution via the Internet, home servers in home LAN, etc. It is widely used and widely spread, and its importance is increasing.
[0003]
Many of such audio signal encoding techniques perform time-frequency conversion using a conversion encoding technique. For example, in MPEG2-AAC and Dolby Digital (AC-3), a filter bank is configured by a single orthogonal transform such as MDCT, and in MPEG1 Audio Layer 3 (MP3) and ATRAC (MD), subband division such as QMF is performed. A filter bank is configured by connecting filters and orthogonal transforms in multiple stages.
[0004]
In these high-efficiency audio encoding techniques, the compression efficiency is increased by removing the spectral components determined to be masked by performing masking analysis using human auditory characteristics.
[0005]
The masking analysis used in these high-efficiency audio coding techniques is mainly masking by an audible frequency region in silence and frequency masking by a masker in a critical band.
[0006]
Since the signal determined to be undetectable by humans by the masking analysis is mainly a signal in a high frequency range, in general, the quantization error of the high frequency component can be masked even if it becomes a little larger.
[0007]
However, in the transform coding method, in the case of a so-called transient state in which there is a sudden change in the audio input signal, the quantization error of the high-frequency component in the part where the sudden change occurs is the signal immediately before or immediately after the sudden change. Therefore, ringing noise occurs.
[0008]
As a human auditory characteristic, when a loud sound is generated, it becomes difficult to hear the sound immediately before and immediately after that. This is called a time masking effect. The amount of time that can be heard after a loud sound is relatively long, about 100 msec, although there are individual differences. However, the time of the masking effect that works immediately before is as short as about 5 to 6 msec. Therefore, when ringing noise is generated, noise before a loud sound is easily detected. This is a phenomenon generally called pre-echo.
[0009]
Hereinafter, this phenomenon will be described with reference to the drawings.
[0010]
FIG. 10 is an example of an audio input signal that is changing rapidly. FIG. 11 shows an example of an audio signal obtained by encoding and decoding this signal with 2048 samples, which is a conversion unit in the case of a normal block length of MPEG-2 AAC. As shown in the figure, a quantization error in a high frequency region that occurs in a sudden signal change portion affects the entire block.
[0011]
As described above, immediately before the portion where the amplitude changes abruptly, a human cannot sense noise due to the time masking effect. However, assuming that the input signal uses the same 44.1 KHz sampling frequency as the PCM signal used for the music CD, the conversion unit is converted into time, and the time of 2048 samples is 2048 ÷ 44100 × 1000. = About 46.44 ms, so even if noise occurs in the first half of the time, the pre-masking time will be over and human will perceive the pre-echo.
[0012]
As a method for suppressing this, in various audio encoding methods, by detecting a sudden change in the input signal and shortening the transform block length, the quantization error of the high frequency component due to the sudden change is changed. The occurrence of pre-echo is suppressed by making it not reach the immediately preceding portion.
[0013]
FIG. 12 shows a time domain signal when the audio signal shown in FIG. 10 is encoded and decoded with 256 samples, which is a conversion unit in the case of a short block length in MPEG-2 AAC. In this case, the influence of the quantization error in the high frequency region due to the rapid change of the input signal is confined in the 256 sample block in which the change occurs. As before, when this block length is converted to time at the 44.1 KHz sampling frequency, it becomes approximately 5.80 ms, and thus the human cannot substantially perceive this noise due to the premasking effect, and as a result, the pre-echo disappears.
[0014]
In the actual processing, in order to remove aliasing distortion caused by conversion, the conversion is performed by shifting the input sample by 50% of the conversion length after performing conversion in units of conversion trillions, and superimposing the results. Although this procedure is carried out, this procedure is omitted in the figure for convenience of explanation.
[0015]
However, in general, shortening the block length not only reduces the accuracy of masking analysis due to a decrease in frequency resolution, but also increases the normalization coefficient (hereinafter referred to as scale factor) for each frequency band used for quantization by the number of blocks. Therefore, the amount of information consumed there increases, and bits that should be allocated to spectrum information at the time of quantization are consumed for the scale factor, so that the coding efficiency is lowered. As a result, the quantization error cannot be strictly masked particularly at a low bit rate, and noise may be detected more easily than when the block length is long.
[0016]
Therefore, when determining the actual block length, it is necessary to appropriately determine the balance between suppression of pre-echo and noise generated by a decrease in coding efficiency.
[0017]
As this block length selection method, in MPEG Audio encoding schemes such as MP3 and MPEG-2 AAC, an auditory entropy (hereinafter referred to as PE) is calculated for each block, and if the block length is larger than a predetermined PE threshold, a short block is selected. The length is to be selected.
[0018]
Note that an ID3 tag has been proposed as one of data formats for storing meta information (music title, artist name, album name, genre, creation year, copyright information, etc.) regarding such audio encoded data. Yes. The ID3 tag is generally used particularly in the MP3 file, and is added to the end of the MP3 bit stream.
[0019]
Since the display and editing of the ID3 tag is implemented as a function of a general audio encoding / decoding application, the user can easily edit the contents of the ID3 tag.
[0020]
On the other hand, in recent years, research on techniques for analyzing audio signals themselves and automatically extracting meta information such as genres has been actively conducted. In Patent Document 1 below, the music genre of the music signal is determined by comparing the low-frequency analysis result of the music signal with a plurality of low-frequency component patterns obtained in advance for each music genre. In Patent Document 2 below, the music genre of a music signal is determined using a neural network that has learned patterns such as rhythm, tempo, and adjustment for each genre.
[0021]
Patent Document 3 below discloses a technique for determining block length switching by comparing a threshold value with a difference obtained by subtracting a minimum value from a maximum value of PE.
[0022]
Further, in Patent Document 4 below, when the entropy of auditory entropy is not calculated, the input PCM signal is divided into segments shorter than the block length, a sudden change in amplitude is detected, and a sudden change is detected. A method of estimating the quantization noise in the previous segment and the masking value of the previous segment and selecting a short block length when quantization noise that cannot be masked occurs is disclosed.
[0023]
[Patent Document 1]
JP-A-8-37700
[Patent Document 2]
Japanese Patent Laid-Open No. 10-161654
[Patent Document 3]
JP 2000-276198 A
[Patent Document 4]
JP 2001-142493 A
[0024]
[Problems to be solved by the invention]
As described above, for example, in the block length determination method described in the MPEG-2 AAC standard (ISO / IEC 13818-7: 1997), the block length is determined only by comparing the PE for each block with a fixed PE threshold. is doing. However, since this PE threshold actually changes depending on the device implementation and input signal, when a fixed PE threshold is applied in all cases, the block length is often determined incorrectly depending on the music. The sound quality may be deteriorated due to a decrease in efficiency or a pre-echo.
[0025]
For example, if the block length selection of audio signals of all genres is performed using a conventional fixed PE threshold, a short block length is frequently used in music that is drastically changing, such as hard rock. As a result, unmasked noise increases and sound quality deteriorates. On the other hand, in an adult music such as classical music, a pre-echo occurs without an appropriate short block length being selected.
[0026]
The present invention has been devised in view of the above problems, and by appropriately changing the PE threshold depending on the genre of the input audio signal, by enabling appropriate block selection, a block length shorter than necessary can be achieved. An object of the present invention is to provide an audio encoding technique capable of preventing a selection, appropriately suppressing pre-echo while maintaining encoding efficiency, and creating a bit stream with good sound quality.
[0027]
[Means for Solving the Problems]
According to an aspect of the present invention, a frame dividing unit that divides an audio input signal into frames of processing units, an auditory psychological operation unit that analyzes the audio input signal divided into frames and outputs an auditory entropy value, Based on the auditory entropy value and auditory entropy threshold output by the auditory psychological calculation unit, a block length determination unit that determines a conversion block length of the frame, and the frame according to the block length determined by the block length determination unit An audio signal encoding device, wherein the block length determination unit determines the auditory entropy threshold according to a genre of an audio input signal. Provided.
According to another aspect of the present invention, a frame dividing unit that divides an audio input signal into frames of processing units, an auditory psychological operation unit that analyzes the audio input signal divided into frames and outputs an auditory entropy value, Based on the auditory entropy value and the auditory entropy threshold output by the auditory psychological calculation unit, a block length determination unit that determines a conversion block length of the frame, and a block length determined by the block length determination unit, A filter bank that blocks a frame and converts it into a frequency spectrum, and a history storage unit for storing a history of auditory entropy values of each frame output by the auditory psychological operation unit, the block length determination unit The auditory entropy threshold is set according to the history of auditory entropy values stored in the history storage unit. Audio signal encoding apparatus characterized by a constant is provided.
According to still another aspect of the present invention, a frame dividing step of dividing an audio input signal into frames of processing units, an auditory psychological calculation step of analyzing the frame-divided audio input signal and outputting an auditory entropy value, A block length determination step for determining a transform block length of the frame based on the output auditory entropy value and an auditory entropy threshold, and the frame is blocked according to the determined block length, and a frequency spectrum The audio signal encoding method is characterized in that the block length determination step determines the auditory entropy threshold according to the genre of the audio input signal.
According to still another aspect of the present invention, a frame dividing step of dividing an audio input signal into frames of processing units, an auditory psychological calculation step of analyzing the frame-divided audio input signal and outputting an auditory entropy value, A block length determination step for determining a transform block length of the frame based on the output auditory entropy value and an auditory entropy threshold, and the frame is blocked according to the determined block length, and a frequency spectrum And a history storage step of storing a history of auditory entropy values of each frame output by the auditory psychological calculation step in a history storage unit, wherein the block length determination step includes the history storage unit According to the history of auditory entropy values stored in Audio signal encoding method characterized by determining a Toropi threshold is provided.
According to still another aspect of the present invention, a frame dividing step of dividing an audio input signal into frames of processing units, an auditory psychological calculation step of analyzing the frame-divided audio input signal and outputting an auditory entropy value, A block length determination block that determines a transform block length of the frame based on the output auditory entropy value and an auditory entropy threshold, and blocks the frame according to the determined block length, and a frequency spectrum A program for causing a computer to execute a conversion step for converting into a block, wherein the block length determination step determines the auditory entropy threshold according to a genre of an audio input signal. The
According to still another aspect of the present invention, a frame dividing step of dividing an audio input signal into frames of processing units, an auditory psychological calculation step of analyzing the frame-divided audio input signal and outputting an auditory entropy value, A block length determination step for determining a transform block length of the frame based on the output auditory entropy value and an auditory entropy threshold, and the frame is blocked according to the determined block length, and a frequency spectrum A program for causing a computer to execute a conversion step of converting into a history storage step of storing a history of auditory entropy values of each frame output by the auditory psychological calculation step in a history storage unit, the block length The determination step is stored in the history storage unit. Depending on the history of the objective entropy value, the program and determines the aural entropy threshold is provided.
[0028]
According to the present invention, in order to determine the auditory entropy threshold to be referred to when determining the transform block length according to the genre of the audio input signal, it is possible to perform coding processing with good coding efficiency while avoiding the occurrence of pre-echo, Particularly, encoding processing with good sound quality can be realized at a low bit rate.
[0029]
DETAILED DESCRIPTION OF THE INVENTION
(First embodiment)
Hereinafter, the present invention will be described in detail with reference to the drawings.
FIG. 1 is a configuration example of an audio encoding device according to the first embodiment of the present invention.
[0030]
In the illustrated configuration, reference numeral 1 denotes a frame divider that divides an audio input signal into frames as processing units. The divided frames are sent to a filter bank 2 and an auditory psychological calculator 3 described later.
[0031]
Reference numeral 2 denotes a filter bank, which converts an input time signal divided into frames into a frequency spectrum with a block length of a length designated by the block length determiner 4.
[0032]
An auditory psychological calculator 3 analyzes an audio input signal for each frame, calculates an auditory entropy value, and performs masking calculation for each divided frequency band serving as a quantization unit. As a result of this calculation, the auditory entropy (PE) value is output to the block length determiner 4, and the signal-to-mask ratio (Signal Mask Ratio: SMR) for each divided frequency band is output to the bit allocator 5.
[0033]
Reference numeral 4 denotes a block length determiner, which changes and holds the PE threshold according to genre information sent from the outside, and compares the PE sent from the auditory psychological calculator 3 with the PE threshold to determine the converted block length. , Notify the filter bank 2.
[0034]
A bit allocator 5 determines the bit amount to be allocated to each divided frequency band with reference to the SMR value and frequency spectrum for each divided frequency band transmitted from the auditory psychological calculator 3.
[0035]
Reference numeral 6 denotes a quantizer, which calculates a normalization coefficient (scale factor) of the frequency spectrum output from the filter bank 2 for each frequency band and outputs a frequency according to the bit amount for each frequency band output from the bit allocator 5. Quantize the spectrum.
[0036]
A bit shaper 7 forms a bit stream by appropriately shaping the scale factor and quantized spectrum output from the quantizer 6 into a prescribed format and outputs the bit stream.
[0037]
An audio signal processing operation in the audio signal encoding apparatus having the above configuration will be described below.
[0038]
In the present embodiment, MPEG2-AAC is described as an example of an encoding method for convenience of explanation, but other encoding methods such as MP3 that perform block length switching determination by PE are also realized in exactly the same manner. Is possible.
[0039]
First, prior to processing, each unit is initialized. At this time, the block length determiner 4 changes the PE threshold according to the genre information of the audio signal input from the outside. The PE threshold value can be changed by referring to a table storing PE threshold values for each genre and selecting by genre, or referring to a table storing PE correction values for each genre. A correction value for each genre may be added to the reference value. FIG. 3 shows a configuration example of the PE threshold value table, and FIG. 4 shows a mimic diagram of the PE correction value table. The reference value of PE when the PE correction value table of FIG. In the present embodiment, these tables are stored in the block length determiner 4.
[0040]
An input audio signal such as an audio PCM signal is divided into frames by the frame divider 1 and sent to the filter bank 2 and the psychoacoustic calculator 3. In the case of the MPEG2-AAC LC (Low-Complexity) profile, one frame is composed of 1024 sample PCM signals.
[0041]
The input audio signal is subjected to auditory entropy and masking calculation for each frequency band by the auditory psychological calculator 3 for each frame. The calculated auditory entropy value is compared with the PE threshold value previously changed by the genre information by the block length determiner 4. Here, when the PE value of the frame is larger than the PE threshold, it is determined to use a short block length, and otherwise, it is determined to use a long block length. In the filter bank 2, the input signal is converted into a frequency spectrum with a block length according to this determination.
[0042]
In MPEG2-AAC, in order to eliminate aliasing due to orthogonal transformation in the filter bank 2, overlapping transformation by MDCT is performed. For this reason, in the time-frequency conversion, 2048 samples including the processing target frame and the immediately preceding frame are input as one unit, and 1024 frequency spectra are obtained. At this time, when a long block length is used, 2048 samples of the input signal are orthogonally transformed as one block, and 1024 frequency spectra are output. When a short block length is used, the conversion for outputting 128 frequency spectra with 256 samples of the input signal as one block is performed 8 times while shifting the input signal by 128 samples.
[0043]
Based on the frequency spectrum output from the filter bank 2 and the SMR value output from the psychoacoustic operator 3, the bit allocator 5 allocates bits to each frequency band, and the quantizer 6 determines the scale factor for each frequency band. Calculate and quantize the frequency spectrum according to the bits assigned to each frequency band.
[0044]
The scale factor and the quantized spectrum for each frequency band are shaped into a bit stream according to the format determined by the bit shaper 7 and output.
[0045]
(Second Embodiment)
The second embodiment of the present invention can be implemented as a software program that runs on a general-purpose personal computer (PC). Hereinafter, this case will be described with reference to the drawings.
[0046]
FIG. 5 is a configuration example of an audio signal encoding device according to the second embodiment of the present invention.
[0047]
In the configuration shown in the figure, reference numeral 100 denotes a CPU, which performs operations for audio signal encoding processing, logical determination, and the like, and controls each component connected to the bus via the bus 102.
[0048]
A memory 101 stores a basic I / O program in the configuration example of the present embodiment, a program code being executed, data necessary for program processing, and the like.
[0049]
Reference numeral 102 denotes a bus, which transfers an address signal indicating a component to be controlled by the CPU 100, transfers a control signal of each component to be controlled by the CPU 100, and transfers data between the components. Do.
[0050]
Reference numeral 103 denotes a terminal for instructing device activation, setting of various conditions and input signals, and encoding start.
[0051]
Reference numeral 104 denotes an external storage device, which is an external storage area for storing data, programs, and the like. Data, programs, and the like are stored as necessary, and the stored data and programs are called up when necessary.
[0052]
Reference numeral 105 denotes a media drive. Programs, data, digital audio signals, and the like recorded on the recording medium are loaded into the audio signal encoding apparatus by the media drive 105 reading. In addition, various data and execution programs stored in the external storage device 104 can be written to the recording medium.
[0053]
A microphone 106 collects sound and converts it into an audio signal.
[0054]
Reference numeral 107 denotes a speaker, which can output arbitrary audio signal data as an actual sound.
[0055]
Reference numeral 108 denotes a communication network, which includes a LAN, a public line, a wireless line, a broadcast wave, and the like.
[0056]
Reference numeral 109 denotes a communication interface, which is connected to the communication network 108. The audio signal encoding apparatus according to the present embodiment can communicate with an external device via the communication network via this device and transmit / receive data and programs.
[0057]
In the audio signal encoding apparatus according to the present embodiment including such components, the apparatus operates in response to various inputs from the terminal 103. When an input from the terminal 103 is supplied, an interrupt signal is generated. By being sent to the CPU 100, the CPU 100 reads out various control signals stored in the memory 101, and various controls are performed in accordance with those control signals.
[0058]
The apparatus according to the present embodiment operates when the CPU 100 executes the basic I / O program, the OS, and the audio signal encoding processing program. The basic I / O program is written in the memory 101, and the OS is written in the external storage device 104. When the power of the apparatus is turned on, the OS is read from the external storage device 104 into the memory 101 by the IPL (Initial Program Loading) function in the basic I / O program, and the operation of the OS is started.
[0059]
This audio signal encoding processing program is a program code based on the flowchart of the audio signal encoding processing procedure shown in FIG.
[0060]
FIG. 6 is a content configuration diagram when the audio signal encoding processing program and related data are recorded on a recording medium.
[0061]
In the present embodiment, the audio signal encoding processing program and related data are recorded on a recording medium. As shown in the figure, directory information of the recording medium is recorded in the head area of the recording medium, and thereafter, the audio signal encoding processing program which is the content of the recording medium and the audio signal encoding processing related data are stored. It is recorded as a file.
[0062]
FIG. 7 is a schematic diagram for introducing the audio signal encoding processing program into the audio signal encoding apparatus. The audio signal encoding processing program and related data recorded on the recording medium can be loaded into this apparatus through the media drive 105 as shown in FIG. When the recording medium 110 is set in the media drive 105 of the personal computer, the audio signal encoding processing program and related data are read from the recording medium under the control of the OS and the basic I / O program, and the external storage device 104 is read out. Stored in After that, these information are loaded into the memory 101 at the time of restart and can be operated.
[0063]
FIG. 8 shows a memory map in a state where the audio signal encoding device processing program is loaded into the memory 101 and becomes executable. The memory 101 stores a basic I / O program, an OS, an audio signal encoding processing program, related data, and a work area. At this time, the PE threshold and the PE threshold table shown in FIG. 3 are stored in the work area of the memory 101.
[0064]
Hereinafter, an audio signal encoding process executed by the CPU 100 in the present embodiment will be described according to a flow.
[0065]
FIG. 2 is a flowchart of audio signal encoding processing in the present embodiment.
[0066]
First, step S 1 is a process in which the user designates an input audio signal to be encoded and its genre using the terminal 103. In the present embodiment, the audio signal to be encoded may be an audio PCM file stored in the external storage device 104 or a signal obtained by analog / digital conversion of a real-time audio signal captured by the microphone 106. When the process is finished, step S2 follows.
[0067]
Step S2 is a process of searching the PE threshold value table on the memory 101 and changing the PE threshold value on the memory 101 according to the genre of the audio input signal to be encoded specified in step S1. For example, when the genre input in step S1 is “lock”, 2000 is stored as a PE threshold in a predetermined area on the memory 101 as a result of searching the PE threshold table shown in FIG. When the process is finished, step S3 follows.
[0068]
Step S3 is a process for determining whether or not the input audio signal to be encoded has been completed. If the input signal has ended, the process proceeds to step S12. If not completed, the process proceeds to step S4.
[0069]
Step S4 is a process of dividing the input signal into frames which are processing units for each channel. Similar to the description in the first embodiment, for example, in the case of MPEG2-AAC, each channel is divided into 1024 sample frames. When the process is finished, the process proceeds to step S5.
[0070]
Step S5 is a process of performing auditory psychological calculation of the frame to be encoded. As a result of this calculation, the auditory entropy (PE) of the processing target frame and the SMR value for each divided frequency band that is a quantization unit are calculated. When the process is finished, the process proceeds to step S6.
[0071]
Step S6 is a process of comparing the PE value of the processing target frame calculated in step S5 with the PE threshold stored in step S2 to determine the transform block length. If the PE value of the processing target frame is larger than the PE threshold, the process proceeds to step S7. Otherwise, the process proceeds to step S8.
[0072]
In step S7, based on the determination made in step S6, orthogonal transform using a short block (short block length) is performed on the processing target frame. In the case of MPEG2-AAC, as a result, eight sets of spectra divided into 128 frequency components are obtained. When the process is finished, the process proceeds to step S9.
[0073]
In step S8, orthogonal transform using a long block (long block length) is performed on the processing target frame based on the determination made in step S6. In the case of MPEG2-AAC, as a result, only one set of spectra divided into 1024 frequency components is obtained. When the process is finished, the process proceeds to step S9.
[0074]
Step S9 is a process of determining the bit amount to be allocated to each divided frequency band from the SMR value for each divided frequency band calculated in step S5, the frequency spectrum obtained in step S7 or step S8, and the encoding bit rate. is there. Such processing is common in the transform coding method as in the present embodiment, and therefore will not be described in detail. When the process is finished, the process proceeds to step S10.
[0075]
Step S10 is a process of calculating a scale factor for each divided frequency band and quantizing the frequency spectrum according to the bit amount assigned in step S9. When the process is finished, the process proceeds to step S11.
[0076]
Step S11 is a process of shaping the scale factor and the quantized spectrum calculated in step S10 according to a format determined by the encoding method and outputting it as a bit stream. In the present embodiment, the bit stream output by this processing may be stored in the external storage device 104, or may be output to an external device connected to the network 108 via the communication interface 109. . When the process is finished, the process proceeds to step S3.
[0077]
Step S12 is a process of shaping the quantized spectrum that has not yet been output due to the delay caused by the psychoacoustic operation or orthogonal transformation, etc., into the bit stream and outputting it. When the process is finished, the audio signal encoding process is finished.
[0078]
As described above, in the audio signal encoding process according to the present embodiment, the PE threshold value to be referred to when determining the block length is appropriately set according to the genre of the audio input signal. A good encoding process is possible, and an encoding process with good sound quality can be realized particularly at a low bit rate. For example, when the genre of the input signal is classic, the PE threshold is set low, so that a short block length is appropriately selected at a portion where a short block length is to be selected, so that the occurrence of pre-echo can be reduced. On the other hand, when the genre is rock, by setting the PE threshold value higher, it is possible to prevent the sound quality from being deteriorated as a result of the excessively short block length being applied and the encoding efficiency being lowered.
[0079]
(Third embodiment)
In the third embodiment of the present invention, the case where the PE threshold is determined by the genre in the ID3 tag input by the user will be described.
[0080]
FIG. 13 is a configuration example of an audio signal encoding device according to the third embodiment of the present invention.
[0081]
As shown in the figure, the audio signal encoding apparatus according to the present embodiment is obtained by adding 8 ID3 tag input devices to the configuration example shown in FIG.
[0082]
Since the constituent elements 1 to 7 are the same as those shown in the first embodiment, the description thereof will be omitted.
[0083]
Reference numeral 8 denotes an ID3 tag input device that accepts input of user ID3 tag information and holds the contents thereof.
[0084]
An audio signal processing operation in the audio signal encoding apparatus having the above configuration will be described below with a focus on differences from the operation of the first embodiment.
[0085]
First, prior to processing, each unit is initialized.
[0086]
Next, an audio input signal to be encoded is designated by the user, and ID3 tag information related to the input signal is input to the ID3 tag input unit 8. At this time, the block length determiner 4 changes the PE threshold according to the input ID3 tag genre. The PE threshold value may be changed by referring to a table storing the PE threshold value for each genre and selecting by the genre number (number) of the ID3 tag, or a table storing the PE correction value for each genre. A search may be made according to the genre number of the ID3 tag, and a correction value for each genre may be added to a value serving as a reference for the PE threshold. FIG. 15 shows a mimic diagram of the PE threshold value table in this case, and FIG. 16 shows a mimic diagram of the PE correction value table. In the present embodiment, the PE reference value when the PE correction value table of FIG. In the present embodiment, these tables are stored in the block length determiner 4.
[0087]
Thereafter, as described above in the first embodiment, the input audio signal is divided by the frame divider 1 and then analyzed by the auditory psychological calculator 3 to calculate the PE value and the SMR. The block length determiner 4 compares the PE value with the PE threshold, and the block length is determined based on the result. The audio input signal that has been divided into frames is converted into a frequency spectrum by the determined block length in the filter bank 2, and the bit allocator 5 applies a bit to each frequency band according to the SMR value output from the psychoacoustic operator 4. The quantizer 6 calculates a normalization coefficient for each frequency band, quantizes the frequency spectrum according to the bits assigned to each frequency band, and converts the frequency spectrum into a bit stream according to the format determined by the bit shaper 7. Formatted and output.
[0088]
When encoding of all input signals is completed, each information of the ID3 tag held in the ID3 tag input unit 8 is added to the end of the bit stream by the bit shaper 7.
[0089]
Hereinafter, a case where similar processing is implemented on a general-purpose PC will be described.
[0090]
The configuration of the audio encoding device shown in FIG. 5 can be implemented in this case as in the second embodiment described above. In the configuration shown in FIG. 5, the PE threshold table shown in FIG. 15 is stored in advance in the work area on the memory 101.
[0091]
Hereinafter, the details of the processing performed in the CPU 100 will be described according to the flow shown in FIG.
[0092]
First, step S 101 is processing in which the user designates an input audio signal to be encoded using the terminal 103. In the present embodiment, the audio signal to be encoded may be an audio PCM file stored in the external storage device 104 or a signal obtained by analog / digital conversion of a real-time audio signal captured by the microphone 106. When the process is finished, step S102 follows.
[0093]
Step S102 is processing in which the user designates the ID3 tag using the terminal 103. As a result of this processing, the ID3 genre No. of the input audio signal is specified as a part of the ID3 tag. When the process is finished, step S 103 follows.
[0094]
Step S103 is a process of searching the PE threshold value table on the memory 101 and changing the PE threshold value on the memory 101 based on the ID3 genre No. of the audio input signal to be encoded specified in step S102. For example, when the ID3 genre No specified in step S102 is “2”, as a result of searching the PE threshold table shown in FIG. 15, 1600 is stored as a PE threshold in a predetermined area on the memory 101. When the process is finished, step S 104 follows.
[0095]
Step S104 is processing for determining whether or not the input audio signal to be encoded has been completed. If the input signal has ended, the process proceeds to step S113. If not completed, the process proceeds to step S105.
[0096]
Step S105 is processing for dividing the input signal into frames, which are processing units, for each channel. Similar to the description in the first embodiment, for example, in the case of MPEG2-AAC, each channel is divided into 1024 sample frames. When the process is finished, the process proceeds to step S106.
[0097]
Step S 106 is processing for performing psychoacoustic calculation of the frame to be encoded. As a result of this calculation, the auditory entropy (PE) of the processing target frame and the SMR value for each divided frequency band that is a quantization unit are calculated. When the process is finished, the process proceeds to step S107.
[0098]
Step S107 is a process of comparing the PE value of the processing target frame calculated in step S106 with the PE threshold stored in step S103 to determine the transform block length. If the PE value of the processing target frame is larger than the PE threshold, the process proceeds to step S108. Otherwise, the process proceeds to step S109.
[0099]
In step S108, based on the determination made in step S107, orthogonal transform using a short block (short block length) is performed on the processing target frame. In the case of MPEG2-AAC, as a result, eight sets of spectra divided into 128 frequency components are obtained. When the process is finished, the process proceeds to step S110.
[0100]
In step S109, based on the determination made in step S107, orthogonal transform using a long block (long block length) is performed on the processing target frame. In the case of MPEG2-AAC, as a result, only one set of spectra divided into 1024 frequency components is obtained. When the process is finished, the process proceeds to step S110.
[0101]
Step S110 is a process of determining the bit amount to be allocated to each divided frequency band from the SMR value for each divided frequency band calculated in step S106, the frequency spectrum obtained in step S108 or step S109, and the encoding bit rate. is there. When the process is finished, the process proceeds to step S111.
[0102]
Step S111 is a process of calculating a scale factor for each divided frequency band and quantizing the frequency spectrum according to the bit amount allocated in step S110. When the process is finished, the process proceeds to step S112.
[0103]
Step S112 is processing for shaping the scale factor and the quantized spectrum calculated in step S111 according to a format determined by the encoding method and outputting the result as a bit stream. In this embodiment, the bit stream output by this processing may be stored in an external storage device, or may be output to an external device connected to a line network via a communication interface. When the process is finished, the process proceeds to step S104.
[0104]
Step S113 is a process of shaping the quantized spectrum that has not yet been output due to the delay caused by the psychoacoustic calculation or orthogonal transformation, etc., into the bit stream and outputting it. When the process is finished, the process proceeds to step S114.
[0105]
Step S114 is processing for appropriately encoding and adding the ID3 tag information input in step S102 to the end of the generated encoded bitstream. As a result, when this encoded data is decoded, the ID3 tag information of this data can be accessed, and various information can be used on the decoding processing side. When the process is finished, the audio signal encoding process is finished.
[0106]
As described above, in this embodiment, by inputting an ID3 tag that is standard metadata of audio encoded data, the PE threshold is appropriately changed, so that the user can be changed from a predetermined genre. Since it is sufficient to select the genre, the convenience of the user at the time of encoding is improved, and encoded data with good sound quality can be obtained.
[0107]
(Fourth embodiment)
In the fourth embodiment of the present invention, an example in which the PE threshold is adjusted by the PE history is shown.
[0108]
FIG. 17 is a configuration example of an audio signal encoding device according to the fourth embodiment of the present invention.
[0109]
As shown in the figure, the audio signal encoding apparatus according to this embodiment is obtained by adding nine PE history storage units to the configuration example shown in FIG.
[0110]
Since the constituent elements 1 to 7 are the same as those shown in the first embodiment, the description thereof will be omitted.
[0111]
9 is a PE history temporary storage, and a set of the PE value of each frame output as a result of analyzing the audio input signal by the psychoacoustic operator 3 and the block length determined by the block length determiner 4 as a result, A predetermined number is sequentially stored from a new one. In the present embodiment, the PE history for 100 frames can be stored. A mimetic diagram of the PE history table stored in the PE history temporary storage 9 is shown in FIG.
[0112]
An audio signal processing operation in the audio signal encoding apparatus having the above-described configuration will be described below with a focus on differences from the operation of the first embodiment.
[0113]
In the present embodiment, MPEG2-AAC is described as an example of an encoding method for convenience of explanation, but other encoding methods such as MP3 that perform block length switching determination by PE are also realized in exactly the same manner. Is possible.
[0114]
First, prior to processing, each unit is initialized. At this time, the block length determiner 4 sets an initial PE threshold by referring to the PE threshold table inside the block length determiner 4 according to the genre information of the audio signal input from the outside.
[0115]
An input audio signal such as an audio PCM signal is divided into frames by the frame divider 1 and sent to the filter bank 2 and the psychoacoustic calculator 3.
[0116]
The input audio signal is subjected to auditory entropy (PE) and masking calculation for each frequency band by the psychoacoustic calculator 3 for each frame. The calculated auditory entropy value is sent to the block length determiner 4.
[0117]
The block length determination unit 4 starts correction of the PE threshold value with the input of the PE threshold value as a trigger. The PE threshold correction processing in this embodiment will be described later using the flow of FIG. During this time, the block length determiner 4 can sequentially retrieve the history information necessary for processing by accessing the PE history temporary storage 9.
[0118]
When the correction of the PE threshold is completed, the block length determiner 4 compares the input PE value with the corrected PE threshold. When the PE value of the frame is larger than the PE threshold, it is determined to use a short block length, and when not, it is determined to use a long block length. When the block length is determined, the block length determination unit 4 sends the PE value and the selected block length to the PE history temporary storage unit 9. The PE history temporary storage 9 stores the latest PE and the determined block length in the PE history table. At this time, if the number of history information exceeds a predetermined number, the oldest history is deleted. For example, in the case of the PE history table shown in FIG. 100 rows are deleted from the table, and the PE value and block type as a new history are stored in the top row in the figure. Is shaken again.
[0119]
Thereafter, as described above in the first embodiment, the audio input signal is converted into the frequency spectrum with the determined block length in the filter bank 2, and the bit allocator 5 is output from the psychoacoustic operator 4. A bit is assigned to each frequency band according to the SMR value, and the quantizer 6 calculates a normalization coefficient for each frequency band, quantizes the frequency spectrum according to the bit assigned to each frequency band, and the bit shaper 7 It is formatted into a bitstream according to the specified format and output.
[0120]
Hereinafter, a case where similar processing is implemented on a general-purpose PC will be described.
[0121]
The configuration of the audio encoding device shown in FIG. 5 can be implemented in this case as in the second embodiment described above. In the configuration shown in FIG. 5, the PE threshold value history table shown in FIG. 18 is stored in the work area on the memory 101.
[0122]
Hereinafter, the details of the processing performed in the CPU 100 will be described according to the flow shown in FIG.
[0123]
Step S201 is processing for designating an audio input signal to be encoded by the user and its genre. When the process is finished, step S 202 follows.
[0124]
Step S202 is processing for searching the PE threshold value table and changing the PE threshold value according to the genre specified in step S201. When the process is finished, step S203 follows.
[0125]
Step S203 is a process for determining whether or not the audio input signal is finished. If the input signal has ended, the process proceeds to step S214. If the input signal is not completed, the process proceeds to step S204.
[0126]
Step S 204 is processing for reading an audio input signal for one frame, which is an encoding unit, onto the memory 101. When the process is finished, step S 205 follows.
[0127]
Step S205 is processing for performing auditory psychological calculation of a frame to be encoded. As a result of this calculation, the auditory entropy (PE) of the processing target frame and the SMR value for each divided frequency band that is a quantization unit are calculated. When the process is finished, step S206 follows.
[0128]
Step S206 is processing for comparing the PE value of the processing target frame calculated in step S205 with the PE threshold stored in step S202 to determine the transform block length. If the PE value of the processing target frame is larger than the PE threshold, the process proceeds to step S207. Otherwise, the process proceeds to step S208.
[0129]
In step S207, based on the determination made in step S206, orthogonal transform is performed on the processing target frame using a short block (short block length). In the case of MPEG2-AAC, as a result, eight sets of spectra divided into 128 frequency components are obtained. When the process is finished, the process proceeds to step S209.
[0130]
In step S208, based on the determination made in step S206, orthogonal transform using a long block (long block length) is performed on the processing target frame. In the case of MPEG2-AAC, as a result, only one set of spectra divided into 1024 frequency components is obtained. When the process is finished, the process proceeds to step S209.
[0131]
Step S209 is a process of determining the bit amount to be allocated to each divided frequency band from the SMR value for each divided frequency band calculated in step S205, the frequency spectrum obtained in step S207 or step S208, and the encoding bit rate. is there. When the process is finished, the process proceeds to step S210.
[0132]
Step S210 is a process of calculating a scale factor for each divided frequency band and quantizing the frequency spectrum according to the bit amount allocated in step S209. When the process is finished, the process proceeds to step S211.
[0133]
Step S211 is processing for shaping the scale factor and the quantized spectrum calculated in step S210 according to a format determined by the encoding method and outputting the result as a bit stream. In the present embodiment, the bit stream output by this processing may be stored in the external storage device 104, or may be output to an external device connected to the network 108 via the communication interface 109. . When the process is finished, the process proceeds to step S212.
[0134]
Step S212 is processing for storing the PE value and selected block of the frame being processed in the PE history table in the memory 101. When the PE history table is full, the oldest history is deleted and the history of this frame is stored. When the process is finished, step S213 follows.
[0135]
Step S213 is processing for correcting the PE threshold based on the history stored in the PE history table. Details of this processing will be described later with reference to FIG. When the process is finished, step S203 follows.
[0136]
Step S214 is a process of shaping the quantized spectrum that has not yet been output due to the delay caused by the psychoacoustic calculation or orthogonal transform, etc., on the memory 101, and outputting it after converting it into a bit stream. When the process is finished, the audio signal encoding process is finished.
[0137]
FIG. 20 is a detailed flow of the PE correction process based on the history in step S213 in the present embodiment.
[0138]
Step S301 is a process of searching for a short block in the PE history table. If there is even one short block in the PE history table, the process proceeds to step S302. Otherwise, the process proceeds to step S311.
[0139]
Step S302 refers to processing for calculating an average PE value (hereinafter, Long_PE) of long blocks remaining in the history with reference to the PE history table. When the process is finished, the process proceeds to step S303.
[0140]
Step S303 is processing for calculating an average PE value of short blocks (hereinafter, Short_PE) with reference to the PE history table. When the process is finished, the process proceeds to step S304.
[0141]
Step S304 is processing to calculate an average PE value, that is, (Long_PE + Short_PE) / 2. In the present embodiment, such calculation is performed as an average PE value because, even in one audio input signal (music), a portion having a high PE value and a portion having a low PE value, that is, a frame to be processed as a short block. This is because the frequency of appearance may be different. In such a case, if the PE average value is the average value of all the frames, the PE average value fluctuates in the middle of the process of encoding the same audio input signal, and the subsequent correction process cannot perform a proper correction. As a result, the sound quality of the created bitstream may not be stable. Therefore, such a calculation method is taken. When the process is finished, the process proceeds to step S305.
[0142]
Step S305 is processing for determining whether or not the average PE value obtained in step S304 is larger than the current PE threshold value. As a result of the determination, if the average PE value is larger, the process proceeds to step S306. Otherwise, the process proceeds to step S307.
[0143]
Step S306 is processing for correcting the PE threshold by adding 10 to the current PE threshold. That is, when the average PE value calculated in step S304 is larger than the current PE threshold value, the PE threshold value is corrected in the plus (+) direction. By doing in this way, PE threshold value can be appropriately corrected with respect to the music in which PE value is changing high. When the process is finished, the process proceeds to step S309.
[0144]
Step S307 is processing for determining whether or not the average PE value obtained in step S304 is smaller than the current PE threshold value. As a result of the determination, if the average PE value is smaller, the process proceeds to step S308. Otherwise, the process proceeds to step S309.
[0145]
Step S308 is processing for correcting the PE threshold by subtracting 10 from the current PE threshold. That is, when the average PE value calculated in step S304 is smaller than the current PE threshold, the PE threshold is corrected in the negative direction. By doing in this way, PE threshold value can be appropriately corrected with respect to music whose PE value transitions low. When the process is finished, the process proceeds to step S309.
[0146]
Step S309 is a process of counting the number of histories that have become short blocks in the PE history table and determining whether or not the number exceeds 20. As a result of the determination, if there are 20 or more short block histories, the process proceeds to step S310. Otherwise, the PE correction process based on the history is terminated and the process returns.
[0147]
Step S310 is processing for correcting the PE threshold by adding 10 to the PE threshold. That is, when there are 20 or more short blocks in the history of 100 frames, it is considered that coding efficiency is lowered due to excessive use of short blocks, and therefore the PE threshold value is corrected in the positive direction. By doing in this way, the control which suppresses the fall of the encoding efficiency by having a low PE threshold value is attained. When the process is finished, the PE correction process based on the history is finished and the process returns.
[0148]
Step S311 is processing for determining whether or not the current PE threshold value has reached the threshold lower limit. As a result of the determination, if the PE threshold value has reached the threshold lower limit, the PE correction process based on the history is terminated and the process returns. Otherwise, the process proceeds to step S312.
[0149]
Step S312 is a process of correcting the PE threshold by subtracting 10 from the PE threshold. In other words, when the history is entirely composed of long blocks, the PE threshold is set higher for the input signal, and there is a possibility that frames that should originally be encoded with short blocks may be overlooked. to correct. However, the correction is performed to such an extent that the preset threshold lower limit is not exceeded. In the case of a quiet song that has almost no change, if the PE threshold value is extremely lowered, the portion that should be encoded with the long block is encoded with the short block, thereby preventing the phenomenon that the encoding efficiency decreases. Because. When the process is finished, the correction process is finished and the process returns.
[0150]
As described above, in the present embodiment, the PE threshold value can be appropriately corrected during processing by using the PE history, so that the PE threshold value corresponding to the input signal can be corrected appropriately. Furthermore, it is possible to perform appropriate block determination, and as a result, it is possible to create a bit stream with better sound quality.
[0151]
(Fifth embodiment)
The fifth embodiment of the present invention shows an example in which an input signal is analyzed to automatically determine a genre.
[0152]
FIG. 21 is a configuration example of an audio signal encoding device according to the fifth embodiment of the present invention.
[0153]
As shown in the figure, the audio signal encoding apparatus according to the present embodiment is obtained by adding ten genre automatic determiners to the configuration example shown in FIG.
[0154]
Since the constituent elements 1 to 7 are the same as those shown in the first embodiment, the description thereof will be omitted.
[0155]
Reference numeral 10 denotes a genre determiner, which analyzes an audio input signal and determines its genre.
[0156]
An audio signal processing operation in the audio signal encoding apparatus having the above configuration will be described below with a focus on differences from the operation of the first embodiment.
[0157]
First, prior to processing, each unit is initialized.
[0158]
Next, an audio input signal to be encoded is designated by the user.
[0159]
Next, the genre determiner 10 analyzes the designated input signal, automatically determines the genre, and sends it to the block length determiner 4. Details of this genre determination method will be described later with reference to FIG. The block length determiner 4 changes the PE threshold according to the received genre. The PE threshold value may be changed by referring to a table storing the PE threshold value for each genre, or a table storing the PE correction value for each genre is searched, and the genre is set as a reference value for the PE threshold value. A correction value for each may be added. FIG. 3 shows a configuration example of the PE threshold value table in this case, and FIG. 4 shows a configuration example of the PE correction value table. In the present embodiment, the PE reference value when the correction value table of FIG. In the present embodiment, these tables are stored in the block length determiner 4.
[0160]
Thereafter, as described above in the first embodiment, the input audio signal is divided by the frame divider 1 and then analyzed by the auditory psychological calculator 3 to calculate the PE value and the SMR. The block length determiner 4 compares the PE value with the PE threshold, and the block length is determined based on the result. The frame-divided audio input signal is converted into a frequency spectrum with the determined block length in the filter bank 2, and the bit allocator 5 is assigned a bit in each frequency band according to the SMR value output from the auditory psychological calculator 3. , The quantizer 6 calculates a normalization coefficient for each frequency band, quantizes the frequency spectrum according to the bits allocated to each frequency band, and converts the frequency spectrum into a bit stream according to the format determined by the bit shaper 7. Formatted and output.
[0161]
Hereinafter, a case where similar processing is implemented on a general-purpose PC will be described.
[0162]
The configuration of the audio encoding device shown in FIG. 5 can be implemented in this case as in the second embodiment described above.
[0163]
Hereinafter, the details of the processing performed in the CPU 100 will be described according to the flow shown in FIG.
[0164]
FIG. 22 is a flow of audio signal encoding processing in the present embodiment.
[0165]
First, step S 401 is processing in which the user designates an input audio signal to be encoded using the terminal 103. In the present embodiment, the audio signal to be encoded may be an audio PCM file stored in the external storage device 104 or a signal obtained by analog / digital conversion of a real-time audio signal captured by the microphone 106. When the process is finished, step S 402 follows.
[0166]
Step S402 is processing for automatically determining the genre of the input audio signal. Details of this processing will be described later with reference to FIG. When the process is finished, step S 403 follows.
[0167]
Step S403 is a process of searching the PE threshold value table on the memory 101 and changing the PE threshold value on the memory 101 according to the genre of the audio input signal to be encoded determined in step S402. When the process is finished, step S 404 follows.
[0168]
Step S404 is processing for determining whether or not the input audio signal to be encoded has been completed. If the input signal has ended, the process proceeds to step S413. If not completed, the process proceeds to step S405.
[0169]
Step S405 is processing for dividing the input signal into frames which are processing units for each channel. Similar to the description in the first embodiment, for example, in the case of MPEG2-AAC, each channel is divided into 1024 sample frames. When the process is finished, the process proceeds to step S406.
[0170]
Step S406 is processing for performing psychoacoustic calculation of the frame to be encoded. As a result of this calculation, the auditory entropy (PE) of the processing target frame and the SMR value for each divided frequency band that is a quantization unit are calculated. When the process is finished, the process advances to step S407.
[0171]
Step S407 is processing for comparing the PE value of the processing target frame calculated in step S406 with the PE threshold stored in step S403 to determine the transform block length. If the PE value of the processing target frame is larger than the PE threshold, the process proceeds to step S408. Otherwise, the process proceeds to step S409.
[0172]
In step S408, based on the determination made in step S107, orthogonal transform using a short block (short block length) is performed on the processing target frame. In the case of MPEG2-AAC, as a result, eight sets of spectra divided into 128 frequency components are obtained. When the process is finished, the process proceeds to step S410.
[0173]
In step S409, based on the determination made in step S407, orthogonal transform using a long block (long block length) is performed on the processing target frame. In the case of MPEG2-AAC, as a result, only one set of spectra divided into 1024 frequency components is obtained. When the process is finished, the process proceeds to step S410.
[0174]
Step S410 is a process of determining the bit amount to be assigned to each divided frequency band from the SMR value for each divided frequency band calculated in step S406, the frequency spectrum obtained in step S408 or step S409, and the encoding bit rate. is there. Such processing is common in the transform coding method as in the present embodiment, and therefore will not be described in detail. When the process is finished, the process advances to step S411.
[0175]
Step S411 is a process of calculating the scale factor for each divided frequency band and quantizing the frequency spectrum according to the bit amount allocated in step S410. When the process is finished, the process advances to step S412.
[0176]
Step S412 is a process of shaping the scale factor and the quantized spectrum calculated in step S411 according to a format determined by the encoding method and outputting it as a bit stream. In the present embodiment, the bit stream output by this processing may be stored in the external storage device 104, or may be output to an external device connected to the network 108 via the communication interface 109. . When the process is finished, the process proceeds to step S404.
[0177]
Step S413 is a process of shaping the quantized spectrum that has not yet been output due to the delay caused by the psychoacoustic operation or orthogonal transformation, etc., into the bit stream and outputting it. When the process is finished, the audio signal encoding process is finished.
[0178]
FIG. 23 is a flow detailing the genre automatic determination process in step S402 in the present embodiment.
[0179]
Step S501 is processing for extracting the rhythm of the audio input signal by detecting the amplitude of the audio input signal and the repetition pattern. When the process is finished, the process proceeds to step S502.
[0180]
Step S502 is a process of calculating the tempo of the input signal from the rhythm extracted in step S501 and the sampling rate of the input audio signal. When the process is finished, the process advances to step S503.
[0181]
Step S503 is processing for converting an audio input signal into a frequency spectrum using orthogonal transform such as FFT and analyzing a pure tone component. When the process is finished, the process advances to step S504.
[0182]
Step S504 is a process of extracting the melody by analyzing the melody, chord composition, chord progression, etc. of the music based on the frequency analysis result performed in step S503. When the process is finished, the process advances to step S505.
[0183]
Step S505 is a process of comprehensively analyzing the rhythm, tempo, and music tone extracted so far to determine the genre of the input signal. As this determination method, for example, a database storing these features and genres is searched, and a genre is determined by selecting a genre having the most similar pattern, or a feature of music for each genre is learned. There are techniques such as determination using a neural network circuit, but these are all known and will not be described in detail. As a result of this determination, if the genre cannot be determined, the process proceeds to step S506. If the genre can be determined, the process proceeds to step S507.
[0184]
Step S506 is processing for selecting a default genre as genre information. In this embodiment, the default genre is a genre given to an input signal whose genre cannot be determined. When this information is set as a genre, a default value is set as the PE threshold value in step S403 of FIG. The When the process is finished, the process advances to step S507.
[0185]
Step S507 is processing to store the genre information selected in step S505 or step S506 in the work area on the memory 101. When the process is finished, the genre automatic determination process is finished and the process returns.
[0186]
As described above, according to the present embodiment, the user simply specifies the audio input signal, the genre is automatically determined, and the PE threshold corresponding to the genre is appropriately set. It is possible to obtain a bit stream with good sound quality without being applied.
[0187]
(Other embodiments)
In addition, this invention is not limited to embodiment mentioned above.
[0188]
In the second embodiment described above, the PE threshold value is changed according to the genre by searching the PE threshold value table of FIG. 3, but this can also be realized by searching the PE threshold value correction table of FIG. is there. FIG. 9 is a memory map configuration diagram in this case. This memory map is basically the same as the memory map of FIG. 8, but the PE threshold value, the PE reference value, and the PE threshold value correction table are stored in the work area.
[0189]
In the second embodiment described above, the PE threshold value is stored on the memory map, but the same processing can be realized by storing a pointer indicating a value on the PE threshold table instead. .
[0190]
In the above-described embodiment, no particular reference is made to the recording medium, but this is applicable to any recording medium such as FD, HDD, CD, DVD, MO, and semiconductor memory.
[0191]
In the fourth embodiment described above, the PE history table is configured as normal tabular data, but this is applicable even when configured using a ring buffer capable of storing a predetermined number of lists. Is possible.
[0192]
In the fourth embodiment described above, the PE threshold value is corrected every time one frame is processed. However, this correction may be performed at a frequency of several frames or tens of frames.
[0193]
As described above, according to the above-described embodiment, the PE threshold to be referred to when determining the block length is appropriately set according to the genre of the audio input signal. In particular, it is possible to realize encoding processing with good sound quality at a low bit rate. For example, when the genre of the input signal is classic, the PE threshold is set low, so that a short block length is appropriately selected at a portion where a short block length is to be selected, so that the occurrence of pre-echo can be reduced. On the other hand, when the genre is rock, it is possible to prevent a short block length from being applied excessively, resulting in a decrease in encoding efficiency and consequently deterioration in sound quality. In other words, it is possible to select an appropriate block length according to the characteristics of the input signal, and it is possible to create a high-quality bitstream while suppressing the occurrence of pre-echo while preventing a decrease in encoding efficiency.
[0194]
Furthermore, according to the above embodiment, the PE threshold value is appropriately set without making the user aware of it by changing the PE threshold value according to the genre information of the ID3 tag which is metadata generally used in the encoded audio data. Therefore, it is possible to obtain a bit stream with good sound quality without extra user effort. In other words, it is possible to select an appropriate block length according to the genre of the input signal without extra user effort.
[0195]
Furthermore, according to the above-described embodiment, the PE threshold value can be corrected according to the input signal by appropriately correcting the PE threshold value based on the past PE value and the history of block selection information. It is possible to appropriately select a block that could not be adjusted, and a bit stream with better sound quality can be obtained. That is, block length selection can be appropriately controlled according to the input signal, and a higher quality bit stream can be provided.
[0196]
Furthermore, according to the above-described embodiment, by automatically determining the genre of the input signal and thereby changing the PE threshold, it is possible to set the PE threshold according to the genre of the input signal without making the user aware of it. , User convenience is improved.
[0197]
This embodiment can be realized by a computer executing a program. Also, means for supplying a program to a computer, for example, a computer-readable recording medium such as a CD-ROM recording such a program, or a transmission medium such as the Internet for transmitting such a program is also applied as an embodiment of the present invention. Can do. A computer program product such as a computer-readable recording medium in which the above program is recorded can also be applied as an embodiment of the present invention. The above program, recording medium, transmission medium, and computer program product are included in the scope of the present invention. As the recording medium, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.
[0198]
The above-described embodiments are merely examples of implementation in carrying out the present invention, and the technical scope of the present invention should not be construed in a limited manner. That is, the present invention can be implemented in various forms without departing from the technical idea or the main features thereof.
[0199]
【The invention's effect】
As described above, according to the present invention, the auditory entropy threshold value to be referred to when determining the transform block length is determined according to the genre of the audio input signal. Encoding processing is possible, and encoding processing with good sound quality can be realized particularly at low bit rates.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration example of an audio signal encoding device according to a first embodiment of the present invention.
FIG. 2 is a flowchart of an audio signal encoding process in the second embodiment of the present invention.
FIG. 3 is a schematic diagram of a PE threshold value table according to the first embodiment of the present invention.
FIG. 4 is a schematic diagram of a PE correction value table according to the first embodiment of the present invention.
FIG. 5 is a diagram illustrating a configuration example of an audio signal encoding device according to a second embodiment of the present invention.
FIG. 6 is a diagram showing a content configuration example of a storage medium storing an audio signal encoding processing program according to the second embodiment of the present invention.
FIG. 7 is a schematic diagram for introducing an audio signal encoding process according to the second embodiment of the present invention into a personal computer.
FIG. 8 is a configuration diagram of a memory map in a second embodiment of the present invention.
FIG. 9 is a configuration diagram of a memory map according to another embodiment of the present invention.
FIG. 10 is a mimic diagram of an audio signal.
11 is a mimic diagram of an audio signal when the audio signal shown in FIG. 10 is encoded and decoded in units of 2048 samples. FIG.
12 is a mimic diagram of an audio signal when the audio signal shown in FIG. 10 is encoded and decoded in units of 256 samples.
FIG. 13 is a diagram illustrating a configuration example of an audio signal encoding device according to a third embodiment of the present invention.
FIG. 14 is a flowchart of audio signal encoding processing according to the third embodiment of the present invention.
FIG. 15 is a schematic diagram of a PE threshold value table according to the third embodiment of the present invention.
FIG. 16 is a schematic diagram of a PE correction value table according to the third embodiment of the present invention.
FIG. 17 is a diagram illustrating a configuration example of an audio signal encoding device according to a fourth embodiment of the present invention.
FIG. 18 is a mimic diagram of a PE history table according to the fourth embodiment of the present invention.
FIG. 19 is a flowchart of audio signal encoding processing in the fourth embodiment of the present invention;
FIG. 20 is a flowchart of PE threshold value correction processing according to the fourth embodiment of the present invention.
FIG. 21 is a diagram illustrating a configuration example of an audio signal encoding device according to a fifth embodiment of the present invention.
FIG. 22 is a flowchart of audio signal encoding processing in the fifth embodiment of the present invention;
FIG. 23 is a flowchart of genre automatic determination processing in the fifth embodiment of the present invention;
[Explanation of symbols]
1 Frame divider
2 filter banks
3 auditory psychological calculator
4 Block length detector
5-bit allocator
6 Quantizer
7-bit shaper
8 ID3 tag input device
9 PE history temporary storage
10 Genre automatic judgment device
100 CPU
101 memory
102 bus
103 terminals
104 External storage device
105 media drive
106 microphone
107 Speaker
108 Communication line
109 Communication interface
110 Storage media

Claims

A frame dividing unit that divides the audio input signal into frames of processing units;
An auditory psychological operation unit that analyzes the audio input signal divided into frames and outputs an auditory entropy value;
Based on the auditory entropy value and auditory entropy threshold output by the auditory psychological calculation unit, a block length determination unit that determines a conversion block length of the frame;
According to the block length determined by the block length determination unit, the frame is blocked and converted to a frequency spectrum, and a filter bank,
The audio signal encoding apparatus, wherein the block length determination unit determines the auditory entropy threshold according to a genre of an audio input signal.

A bit allocation unit that divides the frequency spectrum converted by the filter bank into a plurality of frequency bands, and determines a bit amount to be allocated to each frequency band according to a signal-to-mask ratio calculated by the auditory psychological calculation unit; ,
A quantization unit that quantizes a frequency spectrum transformed by the filter bank according to the bit allocation determined by the bit allocation unit;
The audio signal encoding apparatus according to claim 1, further comprising: a bit shaping unit that generates a bit stream obtained by shaping the quantized spectrum quantized by the quantization unit according to a prescribed format.

The audio signal encoding according to claim 1 or 2, wherein the block length determination unit determines the auditory entropy threshold by adding or subtracting a correction value for each genre to a reference value of the auditory entropy threshold. apparatus.

The audio signal encoding apparatus according to claim 1, wherein a genre of the audio input signal is designated by a user.

Furthermore, it has an ID3 tag input unit for inputting an ID3 tag,
The audio signal according to any one of claims 1 to 4, wherein the block length determination unit determines the auditory entropy threshold according to a genre of an ID3 tag input by the ID3 tag input unit. Encoding device.

Furthermore, a history storage unit for storing a history of auditory entropy values of each frame output by the auditory psychological calculation unit,
The said block length determination part determines the said auditory entropy threshold value according to the log | history of the auditory entropy value preserve | saved at the said log | history preservation | save part, The any one of Claims 1-5 characterized by the above-mentioned. Audio signal encoding device.

The history storage unit stores the history of the transform block length together with the history of auditory entropy values,
The audio signal encoding apparatus according to claim 6, wherein the block length determination unit determines the auditory entropy threshold according to the history of the auditory entropy value and the history of the transform block length.

8. The audio signal encoding apparatus according to claim 7, wherein the block length determination unit selects and determines either a long block length or a short block length as the transform block length.

The block length determination unit refers to the history stored in the history storage unit, according to the average value of the auditory entropy value when the long block length is selected and the average value of the auditory entropy value when the short block length is selected. 9. The audio signal encoding apparatus according to claim 8, wherein the auditory entropy threshold is determined.

The audio signal encoding apparatus according to claim 8 or 9, wherein the block length determination unit determines the auditory entropy threshold according to the number of short block length selections.

Furthermore, it has a genre determiner for determining the genre of the audio input signal,
The audio signal encoding apparatus according to claim 1 or 2, wherein the block length determination unit determines the auditory entropy threshold according to the genre determined by the genre determiner.

A frame dividing unit that divides the audio input signal into frames of processing units;
An auditory psychological operation unit that analyzes the audio input signal divided into frames and outputs an auditory entropy value;
Based on the auditory entropy value and auditory entropy threshold output by the auditory psychological calculation unit, a block length determination unit that determines a conversion block length of the frame;
A filter bank that blocks the frame according to the block length determined by the block length determination unit and converts it into a frequency spectrum;
A history storage unit for storing a history of auditory entropy values of each frame output by the auditory psychological calculation unit;
The audio signal encoding apparatus, wherein the block length determination unit determines the auditory entropy threshold according to a history of auditory entropy values stored in the history storage unit.

A bit allocation unit that divides the frequency spectrum converted by the filter bank into a plurality of frequency bands, and determines a bit amount to be allocated to each frequency band according to a signal-to-mask ratio calculated by the auditory psychological calculation unit; ,
A quantization unit that quantizes a frequency spectrum transformed by the filter bank according to the bit allocation determined by the bit allocation unit;
13. The audio signal encoding apparatus according to claim 12, further comprising: a bit shaping unit that generates a bit stream obtained by shaping the quantized spectrum quantized by the quantization unit according to a prescribed format.

The history storage unit stores the history of the transform block length together with the history of auditory entropy values,
The audio signal encoding apparatus according to claim 12 or 13, wherein the block length determination unit determines the auditory entropy threshold according to the history of the auditory entropy value and the history of the transform block length.

15. The audio signal encoding apparatus according to claim 14, wherein the block length determination unit selects and determines either a long block length or a short block length as the transform block length.

The block length determination unit refers to the history stored in the history storage unit, according to the average value of the auditory entropy value when the long block length is selected and the average value of the auditory entropy value when the short block length is selected. The audio signal encoding apparatus according to claim 15, wherein the auditory entropy threshold is determined.

The audio signal encoding apparatus according to claim 15 or 16, wherein the block length determination unit determines the auditory entropy threshold according to the number of short block length selections.

A frame dividing step for dividing the audio input signal into frames of processing units;
An auditory psychological calculation step of analyzing the frame-divided audio input signal and outputting an auditory entropy value;
A block length determination step for determining a transform block length of the frame based on the output auditory entropy value and an auditory entropy threshold;
And converting the frame into a frequency spectrum according to the determined block length,
The audio signal encoding method, wherein the block length determination step determines the auditory entropy threshold according to a genre of an audio input signal.

A frame dividing step for dividing the audio input signal into frames of processing units;
An auditory psychological calculation step of analyzing the frame-divided audio input signal and outputting an auditory entropy value;
A block length determination step for determining a transform block length of the frame based on the output auditory entropy value and an auditory entropy threshold;
According to the determined block length, converting the frame into a frequency spectrum;
A history storage step of storing a history of auditory entropy values of each frame output by the auditory psychological calculation step in a history storage unit;
The audio signal encoding method, wherein the block length determining step determines the auditory entropy threshold according to a history of auditory entropy values stored in the history storage unit.

A frame dividing step for dividing the audio input signal into frames of processing units;
An auditory psychological calculation step of analyzing the frame-divided audio input signal and outputting an auditory entropy value;
A block length determination block for determining a transform block length of the frame based on the output auditory entropy value and auditory entropy threshold;
A program for causing a computer to execute a conversion step of blocking the frame according to the determined block length and converting the frame into a frequency spectrum,
The block length determination step determines the auditory entropy threshold according to a genre of an audio input signal.

A frame dividing step for dividing the audio input signal into frames of processing units;
An auditory psychological calculation step of analyzing the frame-divided audio input signal and outputting an auditory entropy value;
A block length determination step for determining a transform block length of the frame based on the output auditory entropy value and an auditory entropy threshold;
According to the determined block length, converting the frame into a frequency spectrum;
A program for causing a computer to execute a history storage step of storing a history of auditory entropy values of each frame output by the auditory psychological calculation step in a history storage unit,
The block length determination step determines the auditory entropy threshold according to a history of auditory entropy values stored in the history storage unit.