JP3858784B2

JP3858784B2 - Audio signal time axis companding device, method and program

Info

Publication number: JP3858784B2
Application number: JP2002233085A
Authority: JP
Inventors: 多伸近藤; ボナダジョルディ
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2002-08-09
Filing date: 2002-08-09
Publication date: 2006-12-20
Anticipated expiration: 2022-08-09
Also published as: JP2004070240A

Description

【０００１】
【発明の属する技術分野】
本発明は、原オーディオ信号のピッチ及び音質を変えずに原オーディオ信号を所望の圧伸率で時間軸圧伸するオーディオ信号の時間軸圧伸装置及び方法に関する。
【０００２】
【従来の技術】
時間軸圧伸の方法は、時間領域で処理を行うものと、周波数領域で処理を行うものとの２つに大別される。一般に、時間領域の処理は処理負荷が低く、リアルタイムでの処理が容易であるが、良い音質を得ることは難しい。一方、周波数領域の処理は良い音質を得やすいが、ＦＦＴなどにより時間領域表現を周波数領域表現に変換する場合でも、フェイズボコーダなどにより正弦波の組に分解する場合でも処理負荷が高くリアルタイムでの処理が難しい。
【０００３】
【発明が解決しようとする課題】
本発明は、この点に鑑みてなされたものであり、周波数領域でデータ処理を行う場合にデータ処理量を削減し、リアルタイムでの処理をも可能としたオーディオ信号の時間軸圧伸装置、方法及びプログラムを提供することを目的とする。
【０００４】
【課題を解決するための手段】
上記目的達成のため、本出願の第１の発明に係るオーディオ信号の時間軸圧伸装置は、フレームに区切られたオーディオ信号からその周波数スペクトルの振幅エンベロープの複数のローカルピークを検出し、該ローカルピークの振幅データと位相データを分析フレームとして前記フレーム毎に出力する検出部と、人間の聴覚心理特性曲線と前記分析フレームの振幅データとを比較して該聴覚心理特性曲線よりも小さいローカルピークの振幅データ及び位相データを削除した分析フレームを生成するデータ削減部と、前記分析フレームの単位時間当たりのフレーム数を所定の時間軸圧伸率に基づいて調整し、調整した分析フレームの振幅データと該調整した分析フレームの隣接したローカルピークを連携させる処理とにより求めた瞬時周波数から算出した位相データに基づいてオーディオ信号を合成する合成部とを備えたことを特徴とする。
【０００５】
この第１の発明に係る音声合成装置によれば、分析部で分析された各フレームのピークのデータが、データ削減部において、聴覚心理特性曲線と比較される。そして、この比較の結果に基づいてピークのデータが削減される。このため、その後のピーク連携部、位相生成部、合成部における負荷が大きく軽減され、オーディオ信号のリアルタイム処理が可能になる。
【０００６】
上記目的達成のため、本出願の第２の発明に係るオーディオ信号の時間軸圧伸方法は、原オーディオ信号を所望の圧縮率で時間軸圧伸する時間軸圧伸装置によるオーディオ信号の時間軸圧伸方法であって、フレームに区切られたオーディオ信号からその周波数スペクトルの振幅エンベロープの複数のローカルピークを検出し、該ローカルピークの振幅データと位相データを分析フレームとして前記フレーム毎に検出する検出ステップと、人間の聴覚心理特性曲線と前記分析フレームの振幅データとを比較して該聴覚心理特性曲線よりも小さいローカルピークの振幅データ及び位相データを削除した分析フレームを生成するデータ削減ステップと、前記分析フレームの単位時間当たりのフレーム数を所定の時間軸圧伸率に基づいて調整し、調整した分析フレームの振幅データと該調整した分析フレームの隣接したローカルピークを連携させる処理とにより求めた瞬時周波数から算出した位相データに基づいてオーディオ信号を合成する合成ステップとを備えたことを特徴とする。
【０００７】
上記目的達成のため、本出願の第３の発明に係るオーディオ信号の時間軸圧伸用プログラムは、フレームに区切られたオーディオ信号からその周波数スペクトルの振幅エンベロープの複数のローカルピークを検出し、該ローカルピークの振幅データと位相データを分析フレームとして前記フレーム毎に出力する検出ステップと、人間の聴覚心理特性曲線と前記分析フレームの振幅データとを比較して該聴覚心理特性曲線よりも小さいローカルピークの振幅データ及び位相データを削除した分析フレームを生成するデータ削減ステップと、前記分析フレームの単位時間当たりのフレーム数を所定の時間軸圧伸率に基づいて調整し、調整した分析フレームの振幅データと該調整した分析フレームの隣接したローカルピークを連携させる処理とにより求めた瞬時周波数から算出した位相データに基づいてオーディオ信号を合成する合成ステップとをコンピュータに実行させるように構成されたことを特徴とする。
【０００８】
【発明の実施の形態】
次に、本発明の実施の形態を図面に沿って詳細に説明する。
図１は、本発明の実施の形態に係るオーディオ信号の時間軸圧伸装置の全体構成を示している。図１に示すように、本発明の実施の形態に係るオーディオ信号の時間軸圧伸装置は、分析部１０、聴覚心理特性評価部２０、フレーム調整部２５、タイムスケーリング部３０、合成部４０とから大略構成されている。
【０００９】
分析部１０は、窓関数乗算部１１と、ＦＦＴ部１２と、スペクトルピーク検出部１３とを含んでいる。窓関数乗算部１１は、ハミング関数などの窓関数を生成すると共にこの窓関数を入力オーディオ信号に乗算して、これにより入力オーディオ信号をフレーム単位で切り出すためのものである。ＦＦＴ部１２は、窓関数乗算部１１からの入力に対し高速フーリエ変換（ＦＦＴ）を施して、振幅データと位相データとを含んだフレーム単位の周波数スペクトルデータを出力するものである。スペクトルピーク検出部１３は、設定した帯域での振幅データの最大値を求めることなどによるピーク検出アルゴリズムを使用して、ＦＦＴ部１２から出力された周波数スペクトルの振幅のエンベロープのローカルピークを検出し、検出したローカルピークの振幅データと位相データを分析フレームＡＦ_nとして出力する。このとき、ＦＦＴの結果のサンプル点だけによってピークを検出するのではなく、周波数が近接する数個のサンプル点を使って、スプライン補間や２次補間を用いてサンプル点間のピークとなるはずの周波数もピークとして検出する。
【００１０】
聴覚心理特性評価部２０は、聴覚心理特性曲線を記憶したテーブルを備えている。この聴覚心理特性曲線とは、人間の耳の聴神経で検知され得る音の特性を示したものであり、例えば、後述する最小可聴限特性曲線や、周波数マスキング特性曲線、ラウドネス特性曲線などがこれに該当する。聴覚心理特性評価部２０は、聴覚心理特性曲線に基づいて、前記スペクトルピーク検出部１３で検出されたローカルピークデータを削減する。
【００１１】
聴覚心理特性評価部２０のテーブルに最小可聴限特性曲線が記憶される場合について説明する。最小可聴限特性曲線とは、図２に点線ＡＳで示すグラフのように、人間の耳が音を聴く際に、聴こえる音の中で最も小さなレベルと周波数との関係を示すデータである。
聴覚心理特性評価部２０は、この最小可聴限特性曲線と、スペクトルピーク検出部１３で検出されたローカルピークとを比較して、分析フレームＡＦ_nから最小可聴限特性曲線ＡＳよりも小さい値のローカルピークのデータ（図２の黒丸印のデータ）を削除して、次段のフレーム数調整部２５に出力する。これにより、後段のタイムスケーリング部３０での処理するデータ量が減少するため、時間軸圧伸の処理量を削減することができる。
【００１２】
次に、聴覚心理特性評価部２０のテーブルに周波数マスキング特性曲線を記憶させる場合について説明する。
周波数マスキングとは、ある周波数の音声が感受される場合、その音声より振幅が小さく周波数が隣接する音声が聞こえにくくなる現象のことをいう。人間の耳は多数の聴神経を有しており、音の周波数により刺激を受ける聴神経が異なっており、また、ある周波数に対応する聴神経が刺激を受けた場合、それに隣接する周波数に対応する聴神経は逆に抑圧される。この抑圧の度合いを示したものが、周波数マスキング特性曲線である。
【００１３】
図３はこの周波数マスキング特性曲線の一例である。
スペクトルピーク検出部１３で検出されたローカルピークのうち、振幅の大きいものを複数個選択し、この選択されたローカルピークＰｍｉを頂点として右下方向、左下方向に伸びる直線Ｌｉ、Ｌｉ´を描く。そして、この複数のＬｉ、Ｌｉ´を接続した周波数マスキング曲線ＭＬを形成し、このマスキング曲線ＭＬよりも下にあるローカルピークのデータを分析フレームＡＦ_nから削除してデータ量を削減し、次段のフレーム数調整部２５に出力する。
【００１４】
この最小可聴限特性曲線、周波数マスキング特性曲線の両方に基づいてもローカルピークのデータを削減するようにすることもできる。これにより、データの削減量を更に大きくすることが出来る。
【００１５】
フレーム数調整部２５は、聴覚心理特性評価部２０から出力されたデータに対し、このデータが所望の圧伸率となるよう、分析フレームＡＦ_nを単位として間引き、繰り返しを行ってフレーム数の調整を行う。
タイムスケーリング部３０は、ピーク連携部３１と、位相生成部３２とを含んでいる。聴覚心理特性評価部２０から出力されたデータに対し、分析フレームを単位とした間引き、繰り返しにより所望の圧伸率になるような時間軸上のフレーム数の調整が行われた後、ピーク連携部３１は、図４に示すように、隣接する分析フレームＡＦ_n-1、ＡＦ_nにおいてそれぞれ検出されたローカルピークデータのうち、連続していると考えられるピークを選択して互いに連携させる処理を行う。すなわち、過去の分析フレームＡＦ_n-1のローカルピークｆ１、ｆ２、ｆ３・・・に対応するローカルピークが、現在の分析フレームＡＦ_n（ｆ1´、ｆ２´、ｆ３´・・・）にも存在するか否かをチェックし、存在する場合には、その対応するローカルピーク同士を連携させる。対応関係の判断は、両ローカルピークの周波数の差が所定値以内であるか否かにより判断し、所定値Δｆmax以上の差があるローカルピーク同士は連携の対象から除外する。この際、周波数の差が最も小さなローカルピークを連携させることで、過去の分析フレームＡＦ_n-1の複数のローカルピークと現在の分析フレームＡＦ_nの１つのローカルピークとが連携することを防止することができる。
【００１６】
この連携処理がなされた場合、この連携された２つのローカルピークの周波数の差を求め、これを分析フレームＡＦ_n-1、ＡＦ_nの間の時間で微分することにより、フレーム間の任意の位置での瞬時周波数ｆｒを求めることができるようになる。簡略的に、２つのローカルピークの平均周波数を瞬時周波数ｆｒとしてもよい。
【００１７】
位相生成部３２は、過去の分析フレームＡＦ_n-1の連携されたローカルピーク（周波数ｆ）での位相を初期位相Φ_AFn-1、_fと考え、この初期位相Φ_AFn-1、_fに瞬時周波数ｆｒとフレーム間の時間Δｔから求めた位相変化（２πｆｒ×Δｔ）を加算することにより、対応する現在のフレームＡＦ_nの対応するローカルピーク（周波数ｆ´）での正弦波成分の位相を求めることができる。更に過去の合成フレームＳＦ_n-1の連携されたローカルピークの位相に同じ位相変化を加算することで、合成フレームＳＦ_nの対応するローカルピークの位相を求める。連携するローカルピークが見つからないローカルピークの位相については、分析フレームの対応するローカルピークの位相がそのまま合成フレームの位相とされる。
なお、合成フレームＳＦ_nの振幅については、対応する分析フレームＡＦ_nの振幅がそのまま使われる。
【００１８】
合成部４０は、逆ＦＦＴ部４１と、窓関数重ね合わせ部４２とを含んでいる。逆ＦＦＴ部４１は、タイムスケーリング部３０で合成された合成フレームＳＦ_nに逆高速フーリエ変換（逆ＦＦＴ）を施して時間領域表現に変換する機能を有する。窓関数重ね合せ部４２は、得られた時間領域の出力オーディオ信号に窓関数を乗算すると共に、時間的に一部重複するように重ね合わせて外部に時間軸圧伸されたオーディオ信号として出力する部分である。
【００１９】
次に、この時間軸圧伸装置の作用を、図５に示すフローチャートに基づいて説明する。この時間軸圧伸装置に入力されるオーディオ信号は、まず窓関数乗算部１１に入力されて、窓関数と乗算される。これにより、入力オーディオ信号がフレーム単位で切り出される（Ｓ１）。このフレーム単位のオーディオ信号は、ＦＦＴ部１２において高速フーリエ変換（ＦＦＴ）されて、振幅データと位相データとを含むフレーム単位の周波数スペクトルデータが出力される（Ｓ２）。スペクトルピーク検出部１３は、ピーク検出アルゴリズムを使用して、ＦＦＴ部１２から出力された周波数スペクトルの振幅のエンベロープのローカルピークを検出し分析フレームＡＦ_nとして出力する（Ｓ３）。聴覚心理特性評価部２０は、この検出されたローカルピークと、図示しないテーブルに記憶された聴覚心理特性曲線とを比較してローカルピークのデータを削減する（Ｓ４）。
【００２０】
続いて、フレーム調整部２５において、所望の圧伸率に応じたフレーム数となるように、分析フレームＡＦ_nを単位として間引き、繰り返しが行われる（Ｓ５）。
次に、ピーク連携部３１において、隣接するフレームＡＦ_n-1、ＡＦ_nにおいて検出されたローカルピークデータのうち、対応関係にあるピークを選択して互いに連携させる。すなわち、過去のフレームＡＦ_n-1のローカルピークｆ１、ｆ２、ｆ３・・・に対応するローカルピークが、現在のフレームＡＦ_nにも存在するか否かをチェックし、存在する場合には、その対応するローカルピークｆ1´、ｆ２´、ｆ３´等をローカルピークｆ１、ｆ２、ｆ３等と連携させる（Ｓ６）。
【００２１】
次に、位相生成部３２において、過去の分析フレームＡＦ_n-1の連携されたローカルピーク（周波数ｆ）での位相位相Φ_AFn-1、_fと、連携された前後のローカルピークの周波数ｆ、ｆ´とに基づき、対応する合成フレームＳＦ_nのローカルピークでの位相を求める（Ｓ７）。
【００２２】
こうして、合成フレームＳＦ_nの振幅、位相データが求められると、これらのデータが逆ＦＦＴ部４１において逆高速フーリエ変換を施され、時間領域の信号に変換される。この時間領域に変換された各フレーム毎の信号が、窓関数乗算及重ね合せ部４２において重ね合わされ、時間軸圧伸されたオーディオ信号として出力される。
【００２３】
以上、実施の形態について説明したが、本発明はこれに限定されるものではなく、本発明の趣旨を逸脱しない範囲で様々な改変や追加が可能である。
例えば、分析部１０におけるピーク検出の手法はＦＦＴに限らず、その他の離散コサイン変換（ＤＣＴ）などの直交変換でもよく、切り出した各フレームのローカルピークが検出される手法であればよい。
また、上記実施の形態では、タイムスケーリング部３０での時間軸圧伸の処理で、振幅データは分析フレームＡＦ_nのデータをそのまま合成フレームＡＦ_nに用いることとしていたが、位相と同様に前後のフレームのデータの補間により求めるようにしてもよい。
【００２４】
【発明の効果】
以上説明したように、本発明に係るオーディオ信号の時間軸圧伸装置、方法及びプログラムによれば、周波数領域でデータ処理を行う場合にデータ処理量を削減し、オーディオ信号のリアルタイム処理が可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係るオーディオ信号の時間軸圧伸装置の全体構成を示すブロック図である。
【図２】聴覚心理特性評価部２０において、最小可聴限特性曲線を利用してデータ量の削減を行う手法を説明する概念図である。
【図３】聴覚心理特性評価部２０において、周波数マスキング特性曲線を利用してデータ量の削減を行う手法を説明する概念図である。
【図４】図１に示すピーク連携部３１の機能を説明するための概念図である。
【図５】図１に示す時間軸圧伸装置の作用を示すフローチャートである。
【符号の説明】
１０・・・分析部、１１・・・窓関数乗算部、１２・・・ＦＦＴ部、１３・・・スペクトルピーク検出部、２０・・・聴覚心理特性評価部、２５・・・フレーム数調整部、３０・・・タイムスケーリング部、３１・・・ピーク連携部、３２・・・位相生成部、４０・・・合成部、４１・・・逆ＦＦＴ部、４２・・・窓関数重ね合せ部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio signal time axis companding apparatus and method for compressing an original audio signal with a desired companding ratio without changing the pitch and sound quality of the original audio signal.
[0002]
[Prior art]
The time axis companding method is broadly divided into two types, that is, processing in the time domain and processing in the frequency domain. In general, processing in the time domain is low in processing load and easy in real time, but it is difficult to obtain good sound quality. On the other hand, the processing in the frequency domain is easy to obtain good sound quality, but the processing load is high in real time even when the time domain representation is converted into the frequency domain representation by FFT, etc. Processing is difficult.
[0003]
[Problems to be solved by the invention]
The present invention has been made in view of this point, and reduces and reduces the amount of data processing when performing data processing in the frequency domain, and a time axis companding apparatus and method for audio signals that can be processed in real time. And to provide a program.
[0004]
[Means for Solving the Problems]
To achieve the above object, an audio signal time-axis companding device according to the first invention of the present application detects a plurality of local peaks of an amplitude envelope of a frequency spectrum from an audio signal divided into frames , and A detection unit that outputs peak amplitude data and phase data for each frame as an analysis frame, and compares the human psychoacoustic characteristic curve with the amplitude data of the analysis frame to determine a local peak smaller than the audio psychological characteristic curve. a data reducing unit that generates analysis frame deleting the amplitude data and phase data, the number of frames per analysis frame unit time was adjusted based on a predetermined time scale modification ratio, the amplitude of the analysis frame adjusted Is the instantaneous frequency obtained by data and the processing of linking adjacent local peaks in the adjusted analysis frame? Characterized in that a synthesizing unit for synthesizing an audio signal based on the calculated phase data.
[0005]
According to the speech synthesizer according to the first aspect of the invention, the peak data of each frame analyzed by the analysis unit is compared with the psychoacoustic characteristic curve by the data reduction unit. The peak data is reduced based on the result of this comparison. For this reason, the load in subsequent peak cooperation part, a phase generation part, and a synthetic | combination part is reduced greatly, and the real-time process of an audio signal is attained.
[0006]
In order to achieve the above object, the audio signal time axis companding method according to the second invention of the present application provides a time axis compensator for an audio signal by a time axis companding apparatus for compressing an original audio signal at a desired compression rate. A companding method for detecting a plurality of local peaks of an amplitude envelope of a frequency spectrum from an audio signal divided into frames, and detecting the amplitude data and phase data of the local peaks for each frame as an analysis frame a step, a data reduction step of generating an analysis frame deleting the amplitude data and phase data of small local peaks than該聴objective psychological characteristic curve is compared with the amplitude data of the analysis frame to human psychoacoustic characteristic curve , adjusted based on the number of frames per said analysis frame unit of time at a given time scale modification ratio, adjusted And characterized by comprising a synthesizing step of synthesizing the audio signal based on the phase data calculated from the instantaneous frequency obtained by the process of linking the local peak adjacent the amplitude data and the analysis frame the adjusted analysis frame To do.
[0007]
To achieve the above object, a time axis companding program for an audio signal according to a third invention of the present application detects a plurality of local peaks of an amplitude envelope of the frequency spectrum from an audio signal divided into frames. A detection step for outputting the amplitude data and phase data of the local peak for each frame as an analysis frame, and comparing the human psychoacoustic characteristic curve with the amplitude data of the analysis frame, the local peak smaller than the audio psychological characteristic curve of the data reduction step of generating an analysis frame deleting the amplitude data and phase data, the analysis was adjusted based on the number of frames per frame unit of time at a given time scale modification ratio, adjusted for the analysis frame By linking the amplitude data and adjacent local peaks of the adjusted analysis frame Based on the phase data calculated from the instantaneous frequency obtained, characterized in that it is configured to perform a combining step of combining the audio signal to the computer.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 shows the overall configuration of an audio signal time axis companding apparatus according to an embodiment of the present invention. As shown in FIG. 1, the audio signal time axis companding apparatus according to the embodiment of the present invention includes an analysis unit 10, an auditory psychological characteristic evaluation unit 20, a frame adjustment unit 25, a time scaling unit 30, and a synthesis unit 40. It is roughly composed of
[0009]
The analysis unit 10 includes a window function multiplication unit 11, an FFT unit 12, and a spectrum peak detection unit 13. The window function multiplication unit 11 generates a window function such as a Hamming function and multiplies the input audio signal by this window function, thereby cutting out the input audio signal in units of frames. The FFT unit 12 performs fast Fourier transform (FFT) on the input from the window function multiplication unit 11 and outputs frequency spectrum data in units of frames including amplitude data and phase data. The spectrum peak detection unit 13 detects a local peak of the amplitude envelope of the frequency spectrum output from the FFT unit 12 using a peak detection algorithm such as obtaining the maximum value of the amplitude data in the set band. and it outputs the amplitude data and phase data of the detected local peaks as the analysis frame AF _n. At this time, a peak should not be detected only by the sample point of the FFT result, but should be a peak between the sample points using spline interpolation or quadratic interpolation using several sample points having close frequencies. The frequency is also detected as a peak.
[0010]
The auditory psychological characteristic evaluation unit 20 includes a table that stores an auditory psychological characteristic curve. This auditory psychological characteristic curve indicates the characteristic of sound that can be detected by the auditory nerve of the human ear, such as the minimum audible characteristic curve, frequency masking characteristic curve, and loudness characteristic curve described later. Applicable. The auditory psychological characteristic evaluating unit 20 reduces the local peak data detected by the spectrum peak detecting unit 13 based on the auditory psychological characteristic curve.
[0011]
The case where the minimum audible limit characteristic curve is stored in the table of the psychoacoustic characteristic evaluation unit 20 will be described. The minimum audible limit characteristic curve is data indicating the relationship between the smallest level and the frequency of the sound that can be heard when the human ear listens to the sound as shown by the dotted line AS in FIG.
Psychoacoustic characteristic evaluation unit 20, and the minimum audible limit characteristic curve is compared with the local peak detected by the spectral peak detecting unit 13, local value smaller than the minimum limit of audibility characteristic curve AS from analysis frame AF _n The peak data (black circle data in FIG. 2) is deleted and output to the next frame number adjustment unit 25. As a result, the amount of data to be processed by the time scaling unit 30 in the subsequent stage is reduced, so that the amount of time axis companding processing can be reduced.
[0012]
Next, a case where the frequency masking characteristic curve is stored in the table of the auditory psychological characteristic evaluation unit 20 will be described.
Frequency masking refers to a phenomenon in which when a sound of a certain frequency is sensed, it is difficult to hear a sound having an amplitude smaller than that sound and an adjacent frequency. The human ear has a large number of auditory nerves, and the auditory nerve to be stimulated differs depending on the frequency of the sound, and when the auditory nerve corresponding to a certain frequency is stimulated, the auditory nerve corresponding to the adjacent frequency is Conversely, it is suppressed. A frequency masking characteristic curve indicates the degree of suppression.
[0013]
FIG. 3 shows an example of this frequency masking characteristic curve.
Among the local peaks detected by the spectrum peak detector 13, a plurality of large peaks are selected, and straight lines Li and Li ′ extending in the lower right direction and the lower left direction are drawn with the selected local peak Pmi as a vertex. Then, the plurality of Li, to form a frequency masking curve ML connected to Li', reducing the amount of data by deleting the data of the local peak is below the masking curve ML from analysis frame AF _n, the next stage To the frame number adjusting unit 25.
[0014]
It is also possible to reduce local peak data based on both the minimum audible limit characteristic curve and the frequency masking characteristic curve. Thereby, the amount of data reduction can be further increased.
[0015]
Frame number adjustment unit 25, to the data output from the psychoacoustic characteristic evaluation unit 20, so that this data is desired companding ratio, thinning the analysis frame AF _n units, adjustment of the number of frames after repeated I do.
The time scaling unit 30 includes a peak cooperation unit 31 and a phase generation unit 32. After the data output from the psychoacoustic characteristic evaluation unit 20 is thinned out in units of analysis frames, the number of frames on the time axis is adjusted so that a desired companding rate is obtained by repetition, and then the peak cooperation unit As shown in FIG. 4, 31 performs processing for selecting peaks considered to be continuous from local peak data detected in adjacent analysis frames AF _n−1 and AF _n and linking them to each other. . That is, local peaks corresponding to the local peaks f1, f2, f3... Of the past analysis frame AF _n-1 also exist in the current analysis frame AF _n (f1 ′, f2 ′, f3 ′...). Whether or not to do so is checked, and if it exists, the corresponding local peaks are linked. The determination of the correspondence relationship is made based on whether or not the frequency difference between the two local peaks is within a predetermined value, and local peaks having a difference greater than or equal to the predetermined value Δfmax are excluded from the targets of cooperation. At this time, by linking the local peaks having the smallest frequency difference, it is possible to prevent a plurality of local peaks in the past analysis frame AF _{n−1 and} a single local peak in the current analysis frame AF _n from linking. be able to.
[0016]
When this cooperation processing is performed, the difference between the frequencies of the two local peaks that have been cooperated is obtained, and this is differentiated with respect to the time between the analysis frames AF _n-1 and AF _n to obtain an arbitrary position between the frames. The instantaneous frequency fr at can be obtained. For simplicity, the average frequency of two local peaks may be the instantaneous frequency fr.
[0017]
The phase generation unit 32 considers the phase at the local peak (frequency f) associated with the past analysis frame AF _n−1 as the initial phase Φ _AFn−1 , _f, and instantaneously _{takes the} initial phase Φ _AFn− 1, _f . by adding the phase change calculated from the time Delta] t between the frequencies fr and the frame (2πfr × Δt), determining the phase of the sinusoidal components at the corresponding corresponding local peaks of the current frame AF _n (frequency f') be able to. Further, by adding the same phase change to the phase of the linked local peak of the past synthesized frame SF _n−1 , the corresponding local peak phase of the synthesized frame SF _n is obtained. As for the phase of the local peak for which no associated local peak is found, the phase of the corresponding local peak in the analysis frame is directly used as the phase of the synthesized frame.
Note that the amplitude of the composite frame SF _n, the amplitude of the corresponding analysis frame AF _n should be used as is.
[0018]
The combining unit 40 includes an inverse FFT unit 41 and a window function overlapping unit 42. The inverse FFT unit 41 has a function of performing inverse fast Fourier transform (inverse FFT) on the synthesized frame SF _n synthesized by the time scaling unit 30 to convert it into a time domain representation. The window function superimposing unit 42 multiplies the obtained output audio signal in the time domain by the window function, and outputs the audio signal as an audio signal that is overlapped so as to partially overlap in time and expanded in the time axis. Part.
[0019]
Next, the operation of the time axis companding device will be described based on the flowchart shown in FIG. The audio signal input to the time axis companding device is first input to the window function multiplier 11 and multiplied by the window function. As a result, the input audio signal is cut out in units of frames (S1). The frame unit audio signal is subjected to fast Fourier transform (FFT) in the FFT unit 12 to output frame unit frequency spectrum data including amplitude data and phase data (S2). Spectrum peak detecting unit 13 uses the peak detection algorithm to detect the local peaks of the envelope of the amplitude of the frequency spectrum output from the FFT unit 12 is output as the analysis frame AF _n (S3). The auditory psychological characteristic evaluation unit 20 compares the detected local peak with an auditory psychological characteristic curve stored in a table (not shown) to reduce local peak data (S4).
[0020]
Subsequently, in the frame controller 25, so that the number of frames corresponding to a desired draw ratio, thinning the analysis frame AF _n units, the repeating is performed (S5).
Next, the peak link unit 31 selects a peak in correspondence between the local peak data detected in the adjacent frames AF _n−1 and AF _{n and} links them to each other. That is, it is checked whether local peaks corresponding to the local peaks f1, f2, f3,... Of the past frame AF _n−1 also exist in the current frame AF _n. Corresponding local peaks f1 ', f2', f3 ', etc. are linked with local peaks f1, f2, f3, etc. (S6).
[0021]
Next, in the phase generation unit 32, the phase phase Φ _AFn− 1, _f at the linked local peak (frequency f) of the past analysis frame AF _n−1 , and the frequency f of the linked local peaks before and after, Based on f ′, the phase at the local peak of the corresponding composite frame SF _n is obtained (S7).
[0022]
Thus, when the amplitude and phase data of the synthesized frame SF _n is obtained, these data are subjected to inverse fast Fourier transform in the inverse FFT unit 41 to be converted into signals in the time domain. The signal for each frame converted into the time domain is superimposed in the window function multiplication and superposition unit 42 and output as an audio signal expanded in time.
[0023]
Although the embodiment has been described above, the present invention is not limited to this, and various modifications and additions can be made without departing from the spirit of the present invention.
For example, the method of peak detection in the analysis unit 10 is not limited to FFT, but may be other orthogonal transforms such as discrete cosine transform (DCT), as long as a local peak of each cut frame is detected.
Further, in the above-described embodiment, the time axis companding process in the time scaling unit 30 uses the data of the analysis frame AF _n as it is for the composite frame AF _n as the amplitude data. It may be obtained by interpolation of frame data.
[0024]
【The invention's effect】
As described above, according to the audio signal time axis companding apparatus, method, and program according to the present invention, it is possible to reduce the amount of data processing when performing data processing in the frequency domain and to perform real-time processing of the audio signal. Become.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the overall configuration of an audio signal time axis companding device according to an embodiment of the present invention.
FIG. 2 is a conceptual diagram illustrating a method for reducing the amount of data using a minimum audible limit characteristic curve in the psychoacoustic characteristic evaluation unit 20;
FIG. 3 is a conceptual diagram illustrating a method for reducing the amount of data using a frequency masking characteristic curve in the psychoacoustic characteristic evaluation unit 20;
4 is a conceptual diagram for explaining a function of a peak cooperation unit 31 shown in FIG.
FIG. 5 is a flowchart showing the operation of the time-axis companding device shown in FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Analysis part, 11 ... Window function multiplication part, 12 ... FFT part, 13 ... Spectral peak detection part, 20 ... Auditory psychological characteristic evaluation part, 25 ... Frame number adjustment part , 30 ... Time scaling unit, 31 ... Peak cooperation unit, 32 ... Phase generation unit, 40 ... Synthesis unit, 41 ... Inverse FFT unit, 42 ... Window function superposition unit

Claims

A detection unit that detects a plurality of local peaks of the amplitude envelope of the frequency spectrum from the audio signal separated in the frame, and outputs for each of said frames of amplitude data and phase data of the local peak as the analysis frame,
A data reducing unit that generates analysis frame deleting the human psychoacoustic characteristic curve and the amplitude data and phase data of small local peaks than該聴objective psychological characteristic curve is compared with the amplitude data of the analysis frame,
Wherein the number of frames per analysis frame unit time was adjusted based on a predetermined time scale modification ratio, adjacent obtained by a process of linking the local peak of the amplitude data and the analysis frame the adjusted analysis frame adjusted A time-axis companding device for audio signals, comprising: a synthesizing unit that synthesizes audio signals based on phase data calculated from the instantaneous frequency .

The audio signal time axis companding apparatus according to claim 1, wherein the psychoacoustic characteristic curve is a minimum audible limit characteristic curve indicating a relationship between a minimum sound pressure and a frequency that can be heard by a human ear.

2. The audio signal time axis according to claim 1, wherein the auditory psychological characteristic curve is a frequency masking characteristic curve indicating a degree of difficulty in hearing a frequency in the vicinity of a frequency when a sound of a certain frequency is sensed by a human ear. Drawing machine.

A time axis companding method of an audio signal by a time axis companding device that compands an original audio signal with a desired compression rate,
Detecting a plurality of local peaks of the amplitude envelope of the frequency spectrum from the audio signal divided into frames, and detecting the amplitude data and phase data of the local peaks for each frame as analysis frames ;
And data reduction step of generating an analysis frame deleting the amplitude data and phase data of small local peaks than該聴objective psychological characteristic curve is compared with the amplitude data of the analysis frame to human psychoacoustic characteristic curve,
Wherein the number of frames per analysis frame unit time was adjusted based on a predetermined time scale modification ratio, adjacent obtained by a process of linking the local peak of the amplitude data and the analysis frame the adjusted analysis frame adjusted And a synthesizing step for synthesizing the audio signal based on the phase data calculated from the instantaneous frequency .

A detection step of detecting a plurality of local peaks of the amplitude envelope of the frequency spectrum from the audio signal separated in the frame, and outputs for each of said frames of amplitude data and phase data of the local peak as the analysis frame,
And data reduction step of generating an analysis frame deleting the amplitude data and phase data of small local peaks than該聴objective psychological characteristic curve is compared with the amplitude data of the analysis frame to human psychoacoustic characteristic curve,
Wherein the number of frames per analysis frame unit time was adjusted based on a predetermined time scale modification ratio, adjacent obtained by a process of linking the local peak of the amplitude data and the analysis frame the adjusted analysis frame adjusted An audio signal time axis companding program configured to cause a computer to execute a synthesis step of synthesizing an audio signal based on phase data calculated from the instantaneous frequency .