JP6846822B2

JP6846822B2 - Audio signal processor, audio signal processing method, and audio signal processing program

Info

Publication number: JP6846822B2
Application number: JP2018514561A
Authority: JP
Inventors: 安藤　彰男; 彰男安藤
Original assignee: University of Toyama NUC
Current assignee: University of Toyama NUC
Priority date: 2016-04-27
Filing date: 2017-04-21
Publication date: 2021-03-24
Anticipated expiration: 2037-04-21
Also published as: WO2017188141A1; JPWO2017188141A1

Description

本発明の一側面は、オーディオ信号処理装置、オーディオ信号処理方法、およびオーディオ信号処理プログラムに関する。 One aspect of the present invention relates to an audio signal processing device, an audio signal processing method, and an audio signal processing program.

オーディオ信号のチャネル数を変更する手法が従来から知られている。具体的には、Ｍチャネルのオーディオ信号をＮチャネル（ただし、Ｎ＞Ｍ）のオーディオ信号に変換するアップミックスという手法と、Ｎチャネルのオーディオ信号をＭチャネルのオーディオ信号に変換するダウンミックスという手法が存在する。例えば、２チャネル（左チャネルおよび右チャネル）のオーディオ信号から５．１チャネルのオーディオ信号への変換はアップミックスの一例である。また、５．１チャネルのオーディオ信号から２チャネルのオーディオ信号への変換はダウンミックスの一例である。 A method of changing the number of channels of an audio signal has been conventionally known. Specifically, a method called upmix that converts an M channel audio signal into an N channel (however, N> M) audio signal and a method called a downmix that converts an N channel audio signal into an M channel audio signal. Exists. For example, converting a 2-channel (left channel and right channel) audio signal to a 5.1 channel audio signal is an example of an upmix. Also, the conversion of a 5.1 channel audio signal to a 2 channel audio signal is an example of downmixing.

例えば下記特許文献１には、テレビ・ラジオのスポーツ実況番組のステレオ放送を、迫力ある臨場感と聴き取りやすいアナウンスとするサラウンド再生装置が記載されている。この装置はフロント左／右チャンネル信号創成手段、フロントセンタチャンネル信号創成手段、およびリア左／右サラウンドチャンネル信号創成手段を有する。フロント左／右チャンネル信号創成手段は、２チャンネル音声信号入力に対して、マトリックス処理を行って得たフロント左／右チャンネル用各音声信号に、残響音を選択的に付加すると共にフロント用音量調整を行い、フロント左／右チャンネル用各音声信号として出力する。フロントセンタチャンネル信号創成手段は、２チャンネル音声信号入力から、同相成分を抽出して得た音声信号に、残響音を付加せずにフロントセンタチャンネル用音声信号としてセンタ用音量調整を行って出力する。リア左／右サラウンドチャンネル信号創成手段は、マトリックス処理を行って得たフロント左／右チャンネル用各音声信号に、残響音を付加すると共にリア用音量調整を行い、リア左／右チャンネル用各音声信号として出力する。 For example, Patent Document 1 below describes a surround playback device that makes a stereo broadcast of a live sports program of television or radio an announcement that has a powerful presence and is easy to hear. This device has front left / right channel signal creation means, front center channel signal creation means, and rear left / right surround channel signal creation means. The front left / right channel signal creation means selectively adds a reverberant sound to each front left / right channel audio signal obtained by performing matrix processing on a 2-channel audio signal input, and adjusts the front volume. And output as each audio signal for the front left / right channel. The front center channel signal creation means adjusts the center volume as an audio signal for the front center channel and outputs it as an audio signal for the front center channel without adding reverberation to the audio signal obtained by extracting in-phase components from the 2-channel audio signal input. .. The rear left / right surround channel signal creation means adds reverberation to each front left / right channel audio signal obtained by matrix processing and adjusts the rear volume to adjust the rear left / right channel audio. Output as a signal.

下記非特許文献１，２はいずれも、アップミックスの手法を記載する文献である。非特許文献１には、ステレオ信号を帯域分割し、帯域ごとにステレオ信号を主信号とアンビエンス信号とに分割し、アンビエンス信号を５．１チャネルの後方チャネルから再生する手法が記載されている。非特許文献２には、ステレオ信号を帯域分割した後に、そのステレオ信号を直接音成分と残響音成分とに分割し、残響音成分を側方から再生する方法が記載されている。 The following non-patent documents 1 and 2 are documents that describe an upmix method. Non-Patent Document 1 describes a method of dividing a stereo signal into bands, dividing the stereo signal into a main signal and an ambience signal for each band, and reproducing the ambience signal from the rear channel of 5.1 channels. Non-Patent Document 2 describes a method in which a stereo signal is band-divided, the stereo signal is divided into a direct sound component and a reverberation sound component, and the reverberation sound component is reproduced from the side.

下記非特許文献３，４はいずれも、多チャネルのオーディオ信号を２チャネルのオーディオ信号のペアに分割することで、３チャネル以上のオーディオ信号を生成する手法を開示する。 Each of the following non-patent documents 3 and 4 discloses a method of generating an audio signal of 3 channels or more by dividing a multi-channel audio signal into a pair of audio signals of 2 channels.

特開２００７−２８０６５号公報JP-A-2007-28065

C. Avendano and J-M Jot, "A Frequency-Domain Approach to Multichannel Upmix," J. Audio Eng. Soc., Vol. 52, No. 7/8, pp. 740-749, 2004C. Avendano and J-M Jot, "A Frequency-Domain Approach to Multichannel Upmix," J. Audio Eng. Soc., Vol. 52, No. 7/8, pp. 740-749, 2004 C. Faller, "Multiple-Loudspeaker Playback of Stereo Signals," J. Audio Eng. Soc., Vol. 54, No. 11, pp. 1051-1064, 2006C. Faller, "Multiple-Loudspeaker Playback of Stereo Signals," J. Audio Eng. Soc., Vol. 54, No. 11, pp. 1051-1064, 2006 J. Thompson, B. Smith, A. Warmer and J-M Jot, “Direct-Diffuse Decomposition of Multichannel Signals Using a System of Pairwise Correlations,” Proc. Audio Eng. Soc. 133rd Convention, Paper no. 8807, 2012J. Thompson, B. Smith, A. Warmer and J-M Jot, “Direct-Diffuse Decomposition of Multichannel Signals Using a System of Pairwise Correlations,” Proc. Audio Eng. Soc. 133rd Convention, Paper no. 8807, 2012 C. Faller, L. Altmann, J. Levinson and M. Schmidt, “Multichannel Ring Upmix,” Proc. Audio Eng. Soc. 134th Convention, Paper no. 8908, 2013C. Faller, L. Altmann, J. Levinson and M. Schmidt, “Multichannel Ring Upmix,” Proc. Audio Eng. Soc. 134th Convention, Paper no. 8908, 2013

特許文献１に記載のサラウンド再生装置は原音に残響音を付加するため、再生音の雰囲気（例えば音色）が原音から変わったり損なわれたりしてしまう。これに対して非特許文献１，２に記載の手法は残響音を付加するものではないが、原理上、２チャネルのオーディオ信号（すなわち、ステレオ信号）にしか適用できない。 Since the surround reproduction device described in Patent Document 1 adds a reverberant sound to the original sound, the atmosphere (for example, timbre) of the reproduced sound is changed or impaired from the original sound. On the other hand, the methods described in Non-Patent Documents 1 and 2 do not add reverberation, but in principle, they can be applied only to two-channel audio signals (that is, stereo signals).

非特許文献３，４に記載の手法では、２チャネルのオーディオ信号の間で相関が高い成分をコヒーレント成分として抽出するので、二つのスピーカの中間付近に位置する音の情報を取得することになる。したがって、３チャネル以上のオーディオ・システムでは、任意の二つのスピーカの中間付近の音の情報だけしかコヒーレント成分として抽出することができず、全スピーカで囲まれた領域の中央部分に位置する音の情報を抽出することができない。 In the methods described in Non-Patent Documents 3 and 4, a component having a high correlation between two channels of audio signals is extracted as a coherent component, so that sound information located near the middle of the two speakers is acquired. .. Therefore, in an audio system with three or more channels, only sound information near the middle of any two speakers can be extracted as a coherent component, and the sound located in the central part of the area surrounded by all speakers can be extracted. Information cannot be extracted.

そこで、原音のチャネル数にかかわらず、オーディオ信号のチャネル数を変更する際に原音の雰囲気を可能な限り維持する手法が望まれている。 Therefore, regardless of the number of channels of the original sound, a method of maintaining the atmosphere of the original sound as much as possible when changing the number of channels of the audio signal is desired.

本発明の一側面に係るオーディオ信号処理装置は、複数のチャネルのオーディオ信号を受け付ける受付部と、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割部であって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割部と、分割部により抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力部とを備える。 The audio signal processing device according to one aspect of the present invention is a reception unit that receives audio signals of a plurality of channels and a division unit that executes division processing for dividing the audio signal into a coherent component and a field component for each channel. When the division process uses one channel to be divided as the target channel, the estimated signal calculated by using at least the audio signal of the channel other than the target channel is the audio signal of the target channel. The step including extracting the estimated signal having the highest correlation as the coherent component of the target channel and extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel. It includes a division unit and an output unit that outputs a coherent component and a field component of each channel extracted by the division unit.

本発明の一側面に係るオーディオ信号処理方法は、オーディオ信号処理装置が、複数のチャネルのオーディオ信号を受け付ける受付ステップと、オーディオ信号処理装置が、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割ステップと、オーディオ信号処理装置が、分割ステップにおいて抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力ステップとを含む。 The audio signal processing method according to one aspect of the present invention includes a reception step in which the audio signal processing device receives audio signals of a plurality of channels, and a division in which the audio signal processing device divides the audio signal into a coherent component and a field component. This is a division step in which processing is executed for each channel, and when the division processing uses one channel that is the target of the division processing as the target channel, it is calculated using at least the audio signals of channels other than the target channel. The step of extracting the estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals as the coherent component of the target channel, and the difference between the audio signal of the target channel and the coherent component of the target channel are the target channel. The division step includes a step of extracting as a field component of the above, and an output step in which the audio signal processing device outputs a coherent component and a field component of each channel extracted in the division step.

本発明の一側面に係るオーディオ信号処理プログラムは、複数のチャネルのオーディオ信号を受け付ける受付ステップと、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割ステップと、分割ステップにおいて抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力ステップとをコンピュータに実行させる。 The audio signal processing program according to one aspect of the present invention is a reception step for receiving audio signals of a plurality of channels and a division step for dividing the audio signal into a coherent component and a field component for each channel. When the division process uses one channel to be divided as the target channel, the estimated signal calculated by using at least the audio signal of the channel other than the target channel is the audio signal of the target channel. The step including extracting the estimated signal having the highest correlation as the coherent component of the target channel and extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel. The computer is made to execute the division step and the output step of outputting the coherent component and the field component of each channel extracted in the division step.

このような側面においては、対象チャネル以外のチャネルのオーディオ信号を用いて推定され、且つ該対象チャネルの実際のオーディオ信号ごとの相関が最も高い信号が該対象チャネルのコヒーレント成分として抽出される。また、対象チャネルの実際のオーディオ信号とそのコヒーレント成分との差分が該対象チャネルのフィールド成分として抽出される。このコヒーレント成分およびフィールド成分は各チャネルについて得られる。このように、音を追加することなく元のオーディオ信号のみを用いて各チャネルのコヒーレント成分およびフィールド成分を求めることで、原音の雰囲気を可能な限り維持することができる。加えて、コヒーレント成分およびフィールド成分は元のチャネル数の分だけ求めることができるので、この手法は原音のチャネル数にかかわらず適用できる。 In such an aspect, a signal that is estimated using audio signals of channels other than the target channel and has the highest correlation for each actual audio signal of the target channel is extracted as a coherent component of the target channel. Further, the difference between the actual audio signal of the target channel and its coherent component is extracted as the field component of the target channel. This coherent and field component is obtained for each channel. In this way, by obtaining the coherent component and the field component of each channel using only the original audio signal without adding sound, the atmosphere of the original sound can be maintained as much as possible. In addition, since the coherent component and the field component can be obtained for the original number of channels, this method can be applied regardless of the number of channels of the original sound.

本発明の一側面によれば、原音のチャネル数にかかわらず、オーディオ信号のチャネル数を変更する際に原音の雰囲気を可能な限り維持することができる。 According to one aspect of the present invention, the atmosphere of the original sound can be maintained as much as possible when changing the number of channels of the audio signal regardless of the number of channels of the original sound.

実施形態に係るオーディオ信号処理の例を示す図である。It is a figure which shows the example of the audio signal processing which concerns on embodiment. 実施形態に係るオーディオ信号処理装置として機能するコンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware configuration of the computer which functions as the audio signal processing apparatus which concerns on embodiment. 実施形態に係るオーディオ信号処理装置の機能構成を示す図である。It is a figure which shows the functional structure of the audio signal processing apparatus which concerns on embodiment. オーディオ信号を処理する単位であるブロックを示す図である。It is a figure which shows the block which is the unit which processes an audio signal. ある一つのチャネルにおける処理を示す図である。It is a figure which shows the processing in a certain channel. 実施形態に係るオーディオ信号処理装置の動作を示すフローチャートである。It is a flowchart which shows the operation of the audio signal processing apparatus which concerns on embodiment. 図６に示すコヒーレント成分の抽出の詳細を示すフローチャートである。It is a flowchart which shows the detail of the extraction of the coherent component shown in FIG. 実施形態に係るオーディオ信号処理プログラムの構成を示す図である。It is a figure which shows the structure of the audio signal processing program which concerns on embodiment. 従来の手法におけるコヒーレント成分の抽出の例を示す図である。It is a figure which shows the example of the extraction of the coherent component in the conventional method. 実施形態におけるコヒーレント成分の抽出の例を示す図である。It is a figure which shows the example of the extraction of the coherent component in an embodiment.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。なお、図面の説明において同一または同等の要素には同一の符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are designated by the same reference numerals, and duplicate description will be omitted.

図１〜図５を参照しながら、実施形態に係るオーディオ信号処理装置１０の機能および構成を説明する。オーディオ信号処理装置１０は、複数のチャネルのオーディオ信号のそれぞれをコヒーレント成分とフィールド成分とに分割するコンピュータである。オーディオ信号は、ヒトが聴くことができる周波数帯域（一般に約２０Ｈｚ〜２００００Ｈｚ）の音を含むデジタル信号であり、必要に応じてアナログ信号に変換される。オーディオ信号で示される音の例として声、音楽、映像の音、自然音、あるいはこれらの任意の組合せが挙げられるが、これらに限定されるものではない。 The functions and configurations of the audio signal processing device 10 according to the embodiment will be described with reference to FIGS. 1 to 5. The audio signal processing device 10 is a computer that divides each of the audio signals of a plurality of channels into a coherent component and a field component. The audio signal is a digital signal including sounds in a frequency band (generally about 20 Hz to 20000 Hz) that can be heard by humans, and is converted into an analog signal as needed. Examples of sounds represented by audio signals include, but are not limited to, voice, music, video sounds, natural sounds, or any combination thereof.

図１は、オーディオ信号処理装置１０によるオーディオ信号の処理の一例を示し、より具体的には、２チャネル（ＬチャネルおよびＲチャネル）、すなわちステレオのオーディオ信号の処理を示す。オーディオ信号処理装置１０は各チャネルの信号をコヒーレント成分とフィールド成分とに分割する。 FIG. 1 shows an example of processing an audio signal by the audio signal processing device 10, and more specifically, shows processing of two channels (L channel and R channel), that is, a stereo audio signal. The audio signal processing device 10 divides the signal of each channel into a coherent component and a field component.

ある一つのチャネルのコヒーレント成分とは、他のチャネルのオーディオ信号との相関が高い成分である。ある一つのチャネルのフィールド成分とは、該チャネルのオーディオ信号（すなわち、元の信号）と該チャネルのコヒーレント成分との差分である。より具体的には、フィールド成分はオーディオ信号からコヒーレント成分を差し引くことで得られる成分である。コヒーレント成分は明瞭な方向性を有する音であるのに対して、フィールド成分は、拡散性を持つ、周囲を取り巻くような音（ａｍｂｉｅｎｔｓｏｕｎｄ）である。以下では、フィールド成分に対応する音を「フィールド音」ともいう。 The coherent component of one channel is a component that has a high correlation with the audio signal of the other channel. The field component of a channel is the difference between the audio signal of that channel (ie, the original signal) and the coherent component of that channel. More specifically, the field component is a component obtained by subtracting the coherent component from the audio signal. The coherent component is a sound with a clear direction, while the field component is a diffusive, ambient sound. In the following, the sound corresponding to the field component is also referred to as “field sound”.

図１は、オーディオ信号処理装置１０がＬチャネルのオーディオ信号をＬチャネルのコヒーレント成分Ｌγおよびフィールド成分Ｌφに分割し、Ｒチャネルのオーディオ信号をＲチャネルのコヒーレント成分Ｒγおよびフィールド成分Ｒφに分割することを示す。コヒーレント成分ＬγはＲチャネルのオーディオ信号との相関が高い成分であり、コヒーレント成分ＲγはＬチャネルのオーディオ信号との相関が高い成分である。 In FIG. 1, the audio signal processing device 10 divides the L-channel audio signal into the L-channel coherent component Lγ and the field component Lφ, and divides the R-channel audio signal into the R-channel coherent component Rγ and the field component Rφ. Is shown. The coherent component Lγ is a component having a high correlation with the audio signal of the R channel, and the coherent component Rγ is a component having a high correlation with the audio signal of the L channel.

図１は２チャネルのオーディオ信号の処理を示すが、オーディオ信号処理装置１０は任意の個数のオーディオ信号を処理してよい。オーディオ信号処理装置１０は３以上のチャネルのオーディオ信号を処理してもよく、例えば、８Ｋスーパーハイビジョン用の２２．２チャネルのオーディオ信号を処理してもよい。 Although FIG. 1 shows the processing of audio signals of two channels, the audio signal processing device 10 may process an arbitrary number of audio signals. The audio signal processing device 10 may process audio signals of 3 or more channels, and may process, for example, 22.2 channels of audio signals for 8K Super Hi-Vision.

三次元空間での音の方向、距離、広がりを再現可能な立体音響効果を実現するために、複数チャネルのオーディオ信号は、三次元空間内に分散して配置された複数のマイクにより記録される。複数チャネルのオーディオ信号は、複数の目的音（ｏｂｊｅｃｔｓｏｕｎｄ）が互いに混ざったり目的音がフィールド音と混ざったりしたかたちで記録される。一般に音源からの距離は個々のマイクで異なるため、ある特定の音が到着する時間はマイク毎に異なり、その結果、記録されたオーディオ信号のコヒーレントが低くなる。コヒーレント成分を各チャネルのオーディオ信号から取り出すことができれば、音の明瞭性および見かけの音源の幅（ＡＳＷ：ＡｐｐａｒｅｎｔＳｏｕｒｃｅＷｉｄｔｈ）を改善することができる。また、フィールド成分を抽出してこれをアップミックスに用いることで、良好なアンビエンス効果（聴取者の周囲を音が取り巻くような感じ）を生み出すことが可能になる。一般に、コヒーレント成分は主たる音源から発せられる目的音（例えば、歌声、楽器の音、スピーカから発せられる音など）に相当し、フィールド成分は、音の方向性が明瞭でない音（例えば、エコー、うなりなど）に相当する。 In order to realize a stereophonic effect that can reproduce the direction, distance, and spread of sound in three-dimensional space, multi-channel audio signals are recorded by multiple microphones distributed in three-dimensional space. .. A multi-channel audio signal is recorded in a form in which a plurality of object sounds are mixed with each other or the target sound is mixed with a field sound. In general, the distance from the sound source is different for each microphone, so that the time when a specific sound arrives is different for each microphone, and as a result, the coherence of the recorded audio signal is low. If the coherent component can be extracted from the audio signal of each channel, the clarity of sound and the apparent width of the sound source (ASW) can be improved. In addition, by extracting the field component and using it in the upmix, it is possible to produce a good ambience effect (feeling that the sound surrounds the listener). In general, the coherent component corresponds to the target sound emitted from the main sound source (for example, singing voice, musical instrument sound, sound emitted from the speaker, etc.), and the field component corresponds to the sound whose direction is not clear (for example, echo, growl). Etc.).

Ｎ個のチャネルのうちｌ番目のチャネルのオーディオ信号をｘ_ｌ（ｎ）とすると、このオーディオ信号ｘ_ｌ（ｎ）はＭ個の目的音ｑ_ｌｍ（ｎ）（ｍ＝１，…，Ｍ）とフィールド音ｖ_ｌ（ｎ）とから成る。すなわち、オーディオ信号ｘ_ｌ（ｎ）は式（１）で示される。

Assuming that the audio signal of the lth channel among the N channels is x _l (n), this audio signal x _l (n) is the M target sound q _lm (n) (m = 1, ..., M). And the field sound v _l (n). That is, the audio signal x _l (n) is represented by the equation (1).

この式（１）で示されるように、目的音とフィールド音とは互いに統計的に独立と見なすことができる。オーディオ信号ｘ_ｌ（ｎ）のコヒーレント成分γ_ｌ（ｎ）は式（２）で示される。

As shown by this equation (1), the target sound and the field sound can be regarded as statistically independent of each other. The coherent component γ _l (n) of the audio signal x _l (n) is represented by the equation (2).

オーディオ信号ｘ_ｌ（ｎ）のフィールド成分φ_ｌ（ｎ）は式（３）で示される。

The field component φ _l (n) of the audio signal x _l (n) is represented by the equation (3).

オーディオ信号処理装置１０の具体的な実現方法は限定されない。例えば、オーディオ信号処理装置１０はパーソナル・コンピュータ、サーバ、携帯端末などのコンピュータに所定のプログラム（例えば、後述するオーディオ信号処理プログラムＰ１）をインストールすることで実現されてもよい。あるいは、アンプなどの音響機器がオーディオ信号処理装置１０として機能してもよい。 The specific implementation method of the audio signal processing device 10 is not limited. For example, the audio signal processing device 10 may be realized by installing a predetermined program (for example, an audio signal processing program P1 described later) on a computer such as a personal computer, a server, or a mobile terminal. Alternatively, an audio device such as an amplifier may function as the audio signal processing device 10.

図２は、オーディオ信号処理装置１０として機能するコンピュータ１００の一般的なハードウェア構成を示す。コンピュータ１００は、オペレーティングシステムやアプリケーション・プログラムなどを実行するプロセッサ（例えばＣＰＵ）１０１と、ＲＯＭおよびＲＡＭで構成される主記憶部１０２と、ハードディスクやフラッシュメモリなどで構成される補助記憶部１０３と、ネットワークカードまたは無線通信モジュールで構成される通信制御部１０４と、キーボードやマウスなどの入力装置１０５と、モニタなどの出力装置１０６とを備える。 FIG. 2 shows a general hardware configuration of a computer 100 that functions as an audio signal processing device 10. The computer 100 includes a processor (for example, a CPU) 101 that executes an operating system, an application program, and the like, a main storage unit 102 that is composed of ROM and RAM, and an auxiliary storage unit 103 that is composed of a hard disk, flash memory, and the like. It includes a communication control unit 104 composed of a network card or a wireless communication module, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a monitor.

オーディオ信号処理装置１０の各機能要素は、プロセッサ１０１または主記憶部１０２の上に所定のソフトウェア（例えば、後述するオーディオ信号処理プログラムＰ１）を読み込ませてそのソフトウェアを実行させることで実現される。プロセッサ１０１はそのソフトウェアに従って、通信制御部１０４、入力装置１０５、または出力装置１０６を動作させ、主記憶部１０２または補助記憶部１０３におけるデータの読み出し及び書き込みを行う。処理に必要なデータまたはデータベースは主記憶部１０２または補助記憶部１０３内に格納される。 Each functional element of the audio signal processing device 10 is realized by loading predetermined software (for example, an audio signal processing program P1 described later) on the processor 101 or the main storage unit 102 and executing the software. The processor 101 operates the communication control unit 104, the input device 105, or the output device 106 according to the software, and reads and writes data in the main storage unit 102 or the auxiliary storage unit 103. The data or database required for processing is stored in the main storage unit 102 or the auxiliary storage unit 103.

なお、オーディオ信号処理装置１０は１台のコンピュータで構成されてもよいし、複数台のコンピュータで構成されてもよい。複数台のコンピュータを用いる場合には、これらのコンピュータがインターネットやイントラネットなどの通信ネットワークを介して接続されることで、論理的に一つのオーディオ信号処理装置１０が構築される。 The audio signal processing device 10 may be composed of one computer or a plurality of computers. When a plurality of computers are used, one audio signal processing device 10 is logically constructed by connecting these computers via a communication network such as the Internet or an intranet.

図３は、オーディオ信号処理装置１０の機能構成を示す。図３に示すように、オーディオ信号処理装置１０は機能的構成要素として受付部１１、分割部１２、および出力部１３を備える。 FIG. 3 shows the functional configuration of the audio signal processing device 10. As shown in FIG. 3, the audio signal processing device 10 includes a reception unit 11, a division unit 12, and an output unit 13 as functional components.

受付部１１は、複数のチャネルのオーディオ信号を受け付ける機能要素である。「オーディオ信号を受け付ける」とは、オーディオ信号処理装置１０がオーディオ信号を任意の手法で取得することである。言い換えると、「オーディオ信号を受け付ける」とは、オーディオ信号がオーディオ信号処理装置１０に入力されることを意味する。各チャネルのオーディオ信号を受け付ける具体的な手法は限定されない。例えば、受付部１１はデータベースまたは他の装置にアクセスしてオーディオ信号のデータファイルを読み出すことでそのオーディオ信号を受け付けてもよい。あるいは、受付部１１は他の装置から通信ネットワーク経由で送られてきたオーディオ信号を受信してもよい。あるいは、受付部１１はオーディオ信号処理装置１０で入力されたオーディオ信号を取得してもよい。いずれにしても、受付部１１は受け付けた各チャネルのオーディオ信号を分割部１２に出力する。 The reception unit 11 is a functional element that receives audio signals of a plurality of channels. “Receiving an audio signal” means that the audio signal processing device 10 acquires an audio signal by an arbitrary method. In other words, "accepting an audio signal" means that the audio signal is input to the audio signal processing device 10. The specific method of accepting the audio signal of each channel is not limited. For example, the reception unit 11 may receive the audio signal by accessing the database or other device and reading the data file of the audio signal. Alternatively, the reception unit 11 may receive an audio signal sent from another device via the communication network. Alternatively, the reception unit 11 may acquire the audio signal input by the audio signal processing device 10. In any case, the reception unit 11 outputs the audio signal of each received channel to the division unit 12.

分割部１２は、各チャネルのオーディオ信号をコヒーレント成分とフィールド成分とに分割する機能要素である。以下の説明は、分割部１２が式（４）で示されるＮチャネルのオーディオ信号｛ｘ_ｌ（ｎ）｜ｌ＝１，…，Ｎ｝を処理することを前提とする。

The dividing unit 12 is a functional element that divides the audio signal of each channel into a coherent component and a field component. The following description is based on the premise that the dividing unit 12 processes the N-channel audio signal { _xl (n) | l = 1, ..., N} represented by the equation (4).

まず、分割部１２は各チャネルのオーディオ信号を複数の時間区間の信号に分割する。具体的には、分割部１２は窓関数（例えば、カイザー・ベッセル窓）を用いてオーディオ信号を短い時間間隔（これを「フレーム」という）の信号に区切る。例えば、後述する変形離散コサイン変換（ＭＤＣＴ）において１０２４個の周波数点を用いるのであれば、分割部１２は２０４８点分の長さに相当するカイザー・ベッセル窓を用いてオーディオ信号を複数のフレームに分割する。通常、１フレーム内のサンプル数は適切な周波数分解能が得られるように決められるが、そのサンプル数はコヒーレント成分を推定するには十分ではない。そこで、分割部１２は連続する複数のフレーム（例えば２４個のフレーム）を一つの時間区間（これを「ブロック」という）の信号として設定する。図４はこのようなブロックの生成の概念を示し、より具体的には、２チャネル（ＬチャネルおよびＲチャネル）のオーディオ信号のそれぞれを複数のブロックに分割する処理を示す。 First, the division unit 12 divides the audio signal of each channel into signals in a plurality of time intervals. Specifically, the dividing unit 12 uses a window function (for example, a Kaiser-Vessel window) to divide an audio signal into signals having a short time interval (this is referred to as a “frame”). For example, if 1024 frequency points are used in the modified discrete cosine transform (MDCT) described later, the dividing unit 12 uses a Kaiser-Vessel window corresponding to the length of 2048 points to divide the audio signal into a plurality of frames. To divide. Normally, the number of samples in one frame is determined so as to obtain appropriate frequency resolution, but the number of samples is not sufficient for estimating the coherent component. Therefore, the dividing unit 12 sets a plurality of consecutive frames (for example, 24 frames) as a signal for one time interval (this is referred to as a "block"). FIG. 4 shows the concept of generating such a block, and more specifically, shows a process of dividing each of the audio signals of two channels (L channel and R channel) into a plurality of blocks.

各チャネルのオーディオ信号を複数のブロックに分割すると、分割部１２は各チャネルの各ブロックに対して以下の処理を実行する。本明細書では、オーディオ信号をコヒーレント成分とフィールド成分とに分ける対象（すなわち、分割処理の対象）となるチャネルを「対象チャネル」という。ここでは、ある一つの対象チャネルにおける処理を説明する。 When the audio signal of each channel is divided into a plurality of blocks, the division unit 12 executes the following processing for each block of each channel. In the present specification, a channel that is a target for dividing an audio signal into a coherent component and a field component (that is, a target for division processing) is referred to as a “target channel”. Here, the processing in one target channel will be described.

分割部１２は、対象チャネルのコヒーレント成分を抽出し、その後に該対象チャネルのフィールド成分を抽出する。図５は、その一連の処理の前半に相当する、コヒーレント成分の抽出の概念を示す。分割部１２は、フィルタバンクを用いて、対象チャネルであるｌ番目のチャネルのオーディオ信号ｘ_ｌ（ｎ）をＫ個の周波数帯域（サブバンド）の信号（これを「サブバンド信号」という。）に分割する。そして、分割部１２は各サブバンドにおいて、対象チャネル以外の他のチャネルのオーディオ信号を用いてコヒーレント成分γ_ｌ ^（ｋ）（ｎ）（ｋ＝１，…，Ｋ）を抽出する。分割部１２はこの抽出の際に最小二乗法を用いる。そして、分割部１２は全サブバンドのコヒーレント成分を加算することで、対象チャネルのコヒーレント成分γ_ｌ（ｎ）を抽出する。その後、分割部１２は、元のオーディオ信号ｘ_ｌ（ｎ）からコヒーレント成分γ_ｌ（ｎ）を差し引くことでフィールド成分φ_ｌ（ｎ）を抽出する。The dividing unit 12 extracts the coherent component of the target channel, and then extracts the field component of the target channel. FIG. 5 shows the concept of extraction of coherent components, which corresponds to the first half of the series of treatments. _{The division unit 12 uses a filter bank to convert the audio signal x l} (n) of the l-th channel, which is the target channel, into a signal of K frequency bands (sub-bands) (this is referred to as a “sub-band signal”). Divide into. _{Then, the dividing unit 12 extracts the coherent component γ l} ^(k) (n) (k = 1, ..., K) in each subband using the audio signals of channels other than the target channel. The dividing unit 12 uses the least squares method for this extraction. _{Then, the dividing unit 12 extracts the coherent component γ l} (n) of the target channel by adding the coherent components of all the subbands. After that, the dividing unit 12 extracts the field component φ _l (n) by subtracting the coherent component γ _l _{(n) from the original audio signal x l (n).}

分割部１２は対象チャネルのオーディオ信号の各ブロックについて以下の処理を実行する。 The dividing unit 12 executes the following processing for each block of the audio signal of the target channel.

分割部１２はフィルタバンクを用いて各チャネルのオーディオ信号ｘ_ｌ（ｎ）をＫ個のサブバンド信号ｘ_ｌ ^（ｋ）（ｎ）に分割する。この分割は式（５）で示される。

The dividing unit 12 divides the audio signal x _l (n) of each channel into K subband signals x _l ^(k) (n) using a filter bank. This division is represented by the equation (5).

なお、式（５）で示されるサブバンド信号ｘ_ｌ ^（ｋ）（ｎ）は時間領域での信号であり、したがって、時間領域サブバンド信号である。周波数領域での信号を用いる上記の非特許文献１〜４の手法と異なり、オーディオ信号処理装置１０は時間領域サブバンド信号を用いるので、連続する任意のフレーム数の信号を一つのブロック信号として処理することで推定区間長を伸ばすことができる。この結果、得られたコヒーレント成分の音質を損なうことなく各チャネルのオーディオ信号を処理することができる。 _{The subband signals x l} ^(k) (n) represented by the equation (5) are signals in the time domain, and are therefore time domain subband signals. Unlike the methods of Non-Patent Documents 1 to 4 that use signals in the frequency domain, the audio signal processing device 10 uses time domain subband signals, so that signals of any number of consecutive frames are processed as one block signal. By doing so, the estimated section length can be extended. As a result, the audio signal of each channel can be processed without impairing the sound quality of the obtained coherent component.

続いて、分割部１２はこのサブバンド信号ｘ_ｌ ^（ｋ）（ｎ）を、対象チャネル以外のＮ−１個のチャネルの同帯域（同じサブバンド）のサブバンド信号｛ｘ_ｍ ^（ｋ）（ｎ）｜ｍ＝１，…，ｌ−１，ｌ＋１，…，Ｎ｝の線形結合から推定する。ある１ブロックに対応するこの線形結合は式（６）で示される。

Subsequently, the dividing unit 12 _{applies the} ^{subband signals x l (k)} _{(n) to the subband signals {x m} ^(k) (n) of the same band (same subband) of N-1 channels other than the target channel. n) Estimated from the linear combination of | m = 1, ..., l-1, l + 1, ..., N}. This linear combination corresponding to a certain block is represented by the equation (6).

推定信号

は、他チャネル（対象チャネル以外のＮ−１個のチャネル）の同帯域の信号との相関が高い成分と考えることができる。対象チャネルのサブバンド信号とこの推定信号との推定誤差ｅ_ｌ ^（ｋ）（ｎ）は式（７）で示される。

Estimated signal

Can be considered as a component having a high correlation with signals of the same band of other channels (N-1 channels other than the target channel). _{The estimation error el} ^(k) (n) between the subband signal of the target channel and this estimated signal is represented by the equation (7).

分割部１２は、この推定誤差を最小にする係数｛ａ_ｍ ^（ｋ）｜ｍ＝１，…，ｌ−１，ｌ＋１，…，Ｎ｝を最小二乗法で求める。最小化すべき誤差関数は式（８）で示される。

Dividing unit 12, the coefficient estimation errors to minimize _{^{{a m (k) | m}} = 1, ..., l-1, l + 1, ..., N} seek the least squares method. The error function to be minimized is shown by Eq. (8).

ここで、

とすると、最適な係数群

は式（９）を満たす。

here,

Then, the optimum coefficient group

Satisfies equation (9).

この式（９）をｍ＝１，…，ｌ−１，ｌ＋１，…，Ｎで連立させると式（１０）が得られる。

ここで、

である。When this equation (9) is combined with m = 1, ..., L-1, l + 1, ..., N, the equation (10) is obtained.

here,

Is.

ｋ番目のサブバンドにおける対象チャネルの係数ベクトルａ＾_ｌ ^（ｋ）は式（１１）により得られる。

_{The coefficient vector a ^ l} ^(k) of the target channel in the k-th subband is obtained by Eq. (11).

ｋ番目のサブバンドにおける対象チャネルのコヒーレント成分γ_ｌ ^（ｋ）（ｎ）は式（１２）により得られる。このコヒーレント成分γ_ｌ ^（ｋ）（ｎ）は、対象チャネル以外のチャネルのオーディオ信号を用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号に相当する。

_{The coherent components γ l} ^(k) (n) of the target channel in the k-th subband are obtained by the formula (12). The coherent components γ _l ^(k) (n) correspond to the estimated signals having the highest correlation with the audio signals of the target channel among the estimated signals calculated using the audio signals of channels other than the target channel.

分割部１２はすべてのサブバンドについてコヒーレント成分を求める。そして、分割部１２は全サブバンドのコヒーレント成分を加算することで対象チャネルのコヒーレント成分を求める。この処理は式（１３）で示される。

The dividing unit 12 obtains coherent components for all subbands. Then, the dividing unit 12 obtains the coherent component of the target channel by adding the coherent components of all the subbands. This process is represented by the equation (13).

さらに、分割部１２は対象チャネルの元のオーディオ信号からそのコヒーレント成分を差し引くことで、対象チャネルのフィールド成分を求める。この処理は上記式（３）で示される。 Further, the dividing unit 12 obtains the field component of the target channel by subtracting the coherent component from the original audio signal of the target channel. This process is represented by the above equation (3).

なお、分割部１２は、各サブバンドにおいてオーディオ信号からコヒーレント成分を差し引くことでフィールド成分を求め、全サブバンドのフィールド成分を加算することで対象チャネルのフィールド成分を求めてもよい。具体的には、ｋ番目のサブバンドにおける対象チャネルのフィールド成分φ_ｌ ^（ｋ）（ｎ）は式（１４）により得られる。

対象チャネルのフィールド成分φ_ｌ（ｎ）は式（１５）により得られる。

The division unit 12 may obtain the field component by subtracting the coherent component from the audio signal in each subband, and may obtain the field component of the target channel by adding the field components of all the subbands. _{Specifically, the field components φ l} ^(k) (n) of the target channel in the k-th subband are obtained by the equation (14).

The field component φ _l (n) of the target channel is obtained by the equation (15).

分割部１２は上記の処理を対象チャネルのオーディオ信号の各ブロックに対して実行する。そして、分割部１２は全ブロックのコヒーレント成分を連結することで対象チャネルのコヒーレント成分を抽出する。また、分割部１２は全ブロックのフィールド成分を連結することで対象チャネルのフィールド成分を生成する。 The dividing unit 12 executes the above processing for each block of the audio signal of the target channel. Then, the dividing unit 12 extracts the coherent component of the target channel by connecting the coherent components of all the blocks. Further, the dividing unit 12 generates the field component of the target channel by connecting the field components of all the blocks.

分割部１２は複数のチャネルのそれぞれを対象チャネルとして設定して上記の処理を実行することで、全チャネルについてコヒーレント成分およびフィールド成分を生成する。そして、分割部１２は全チャネルのコヒーレント成分およびフィールド成分を出力部１３に出力する。 The dividing unit 12 sets each of the plurality of channels as a target channel and executes the above processing to generate a coherent component and a field component for all the channels. Then, the division unit 12 outputs the coherent component and the field component of all channels to the output unit 13.

このように、分割部１２は各チャネルのオーディオ信号に別の信号を追加することなく（すなわち、原音に別の音を追加することなく）、各チャネルのオーディオ信号をコヒーレント成分とフィールド成分とに分割する。 In this way, the dividing unit 12 divides the audio signal of each channel into a coherent component and a field component without adding another signal to the audio signal of each channel (that is, without adding another sound to the original sound). To divide.

出力部１３は、分割部１２により生成された各チャネルのコヒーレント成分およびフィールド成分を処理結果として出力する機能要素である。この処理結果は、Ｎチャネルから２Ｎチャネルへのアップミックスを実現したものであるということができる。処理結果の出力方法は何ら限定されない。例えば、出力部１３は処理結果をメモリやデータベースなどの記憶装置に格納してもよいし、通信ネットワークを介して他の装置に送信してもよい。あるいは、出力部１３は各チャネルのコヒーレント成分およびフィールド成分を対応するスピーカに出力してもよい。いずれにしても、オーディオ信号処理装置１０による処理結果を用いて、既存の音声素材を、より多くのチャネル数を持つコンテンツの制作に利用したり、より多くのチャネルを有するオーディオ・システムで再生したりすることが可能になる。 The output unit 13 is a functional element that outputs the coherent component and the field component of each channel generated by the dividing unit 12 as a processing result. It can be said that this processing result realizes an upmix from N channel to 2N channel. The output method of the processing result is not limited in any way. For example, the output unit 13 may store the processing result in a storage device such as a memory or a database, or may transmit the processing result to another device via the communication network. Alternatively, the output unit 13 may output the coherent component and the field component of each channel to the corresponding speaker. In any case, the processing result of the audio signal processing device 10 can be used to use the existing audio material for producing content having a larger number of channels, or to reproduce the existing audio material in an audio system having a larger number of channels. It becomes possible to do.

オーディオ信号処理装置１０は、Ｎチャネルのオーディオ信号を２Ｎより大きい数のチャネルにアップミックスしてもよい。具体的には、オーディオ信号処理装置１０は、抽出した複数のフィールド成分を下記参考文献に記載の手法で無相関化することで、チャネル間の相関が互いに異なる信号を生成する。これにより、Ｎより多い個数のフィールド成分が得られる。例えば、ステレオの音声素材を５．１チャネルの音声素材に変換したり、５．１チャネルのオーディオ・システムを用いてより高い臨場感で再生したりすることができる。あるいは、５．１チャネルの音声素材を２２．２チャネルの音声素材に変換したり、２２．２チャネルのオーディオ・システムを用いてより高い臨場感で再生したりすることができる。
（参考文献）J. Breebaart and C. Fallar, “Spatial Audio Processing - MPEG Surround and Other Applications,” Wiley, 2007.The audio signal processor 10 may upmix an N-channel audio signal into a number of channels greater than 2N. Specifically, the audio signal processing device 10 generates signals having different correlations between channels by uncorrelated the extracted plurality of field components by the method described in the following reference. As a result, a larger number of field components than N can be obtained. For example, stereo audio material can be converted to 5.1-channel audio material, or reproduced with a higher sense of presence using a 5.1-channel audio system. Alternatively, 5.1-channel audio material can be converted to 22.2-channel audio material, or reproduced with a higher sense of presence using a 22.2-channel audio system.
(Reference) J. Breebaart and C. Fallar, “Spatial Audio Processing --MPEG Surround and Other Applications,” Wiley, 2007.

オーディオ信号処理装置１０は、Ｎチャネルのオーディオ信号を、２Ｎより小さいＪ個のオーディオ信号（ただし、Ｊ＞Ｎ）のオーディオ信号にアップミックスしてもよい。具体的には、オーディオ信号処理装置１０はＮ個のフィールド成分をミキシングすることで、ＮチャネルからＪチャネルへのアップミックスを実現する。 The audio signal processing device 10 may upmix an N-channel audio signal into an audio signal of J audio signals (where J> N) smaller than 2N. Specifically, the audio signal processing device 10 realizes an upmix from N channel to J channel by mixing N field components.

オーディオ信号処理装置１０による処理結果はアップミックスだけでなくダウンミックスにも利用可能である。 The processing result by the audio signal processing device 10 can be used not only for upmixing but also for downmixing.

次に、図６および図７を参照しながら、オーディオ信号処理装置１０の動作を説明するとともに本実施形態に係るオーディオ信号処理方法について説明する。オーディオ信号処理装置１０では、まず、受付部１１が複数のチャネルのオーディオ信号を受け付ける（受付ステップ）。続いて、分割部１２がオーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する（分割ステップ）。そして、出力部１３が各チャネルのコヒーレント成分およびフィールド成分を出力する（出力ステップ）。以下では、特に重要な分割部１２の処理（分割ステップ）について詳しく説明する。 Next, the operation of the audio signal processing device 10 will be described with reference to FIGS. 6 and 7, and the audio signal processing method according to the present embodiment will be described. In the audio signal processing device 10, first, the reception unit 11 receives audio signals of a plurality of channels (reception step). Subsequently, the division unit 12 executes a division process for dividing the audio signal into a coherent component and a field component for each channel (division step). Then, the output unit 13 outputs the coherent component and the field component of each channel (output step). In the following, a particularly important process (division step) of the division unit 12 will be described in detail.

図６は、一つの対象チャネルのコヒーレント成分およびフィールド成分を生成する処理を示す。 FIG. 6 shows a process of generating a coherent component and a field component of one target channel.

まず、分割部１２は各チャネルのオーディオ信号を複数のブロックに分割する（ステップＳ１１）。なお、ステップＳ１１において分割した各チャネルおよび各ブロックのオーディオ信号を保存することで、２番目以降の対象チャネルを処理する際にはステップＳ１１を省略することができる。 First, the division unit 12 divides the audio signal of each channel into a plurality of blocks (step S11). By storing the audio signals of each channel and each block divided in step S11, step S11 can be omitted when processing the second and subsequent target channels.

続いて、分割部１２は対象チャネルの複数のブロックのうちの一つを処理対象として設定する（ステップＳ１２）。続いて、分割部１２は、対象チャネル以外のチャネルのオーディオ信号を用いて算出される推定信号のうち、対象チャネルのオーディオ信号との相関が最も高い推定信号を、対象チャネルのコヒーレント成分として抽出する（ステップＳ１３）。続いて、分割部１２は、対象チャネルのオーディオ信号とそのコヒーレント成分との差分を、対象チャネルのフィールド成分として抽出する（ステップＳ１４）。このような処理により、分割部１２は対象チャネルの１ブロックのコヒーレント成分およびフィールド成分を得る。 Subsequently, the division unit 12 sets one of the plurality of blocks of the target channel as the processing target (step S12). Subsequently, the dividing unit 12 extracts the estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals calculated using the audio signals of the channels other than the target channel as the coherent component of the target channel. (Step S13). Subsequently, the division unit 12 extracts the difference between the audio signal of the target channel and the coherent component thereof as a field component of the target channel (step S14). By such processing, the dividing unit 12 obtains one block of coherent component and field component of the target channel.

分割部１２は一つのブロックを処理すると次のブロックの処理に移る（ステップＳ１５参照）。すなわち、分割部１２は次のブロックを処理対象として設定し（ステップＳ１２）、そのブロックのコヒーレント成分およびフィールド成分を生成する（ステップＳ１３およびＳ１４）。分割部１２はすべてのブロックについてステップＳ１２〜Ｓ１４の処理を実行し、全ブロックのコヒーレント成分およびフィールド成分を生成する（ステップＳ１５においてＹＥＳ）。そして、分割部１２は全ブロックのコヒーレント成分を連結することで対象チャネルの最終的なコヒーレント成分を得ると共に、全ブロックのフィールド成分を連結することで対象チャネルの最終的なフィールド成分を得る。 When the dividing unit 12 processes one block, it moves to the processing of the next block (see step S15). That is, the dividing unit 12 sets the next block as a processing target (step S12), and generates a coherent component and a field component of the block (steps S13 and S14). The dividing unit 12 executes the processes of steps S12 to S14 for all the blocks to generate the coherent component and the field component of all the blocks (YES in step S15). Then, the dividing unit 12 obtains the final coherent component of the target channel by connecting the coherent components of all the blocks, and obtains the final field component of the target channel by connecting the field components of all the blocks.

図７は、図６におけるステップＳ１３の処理の詳細、すなわち、対象チャネルのコヒーレント成分を生成する処理の詳細を示す。図７に示す処理は対象チャネルのオーディオ信号の各ブロックについて実行される。 FIG. 7 shows the details of the process of step S13 in FIG. 6, that is, the details of the process of generating the coherent component of the target channel. The process shown in FIG. 7 is executed for each block of the audio signal of the target channel.

まず、分割部１２は各チャネル（対象チャネルおよびすべての他チャネル）について、ブロック信号を複数のサブバンドに分割することで複数のサブバンド信号を生成する（ステップＳ１３１）。続いて、分割部１２は複数のサブバンドのうちの一つを処理対象として設定する（ステップＳ１３２）。続いて、分割部１２は、対象チャネル以外のチャネルのサブバンド信号を用いて算出される推定信号のうち、対象チャネルのサブバンド信号との相関が最も高い推定信号を、処理対象であるサブバンドにおける対象チャネルのコヒーレント成分として抽出する（ステップＳ１３３）。分割部１２はすべてのサブバンドについてステップＳ１３２およびＳ１３３の処理を実行する（ステップＳ１３４参照）。対象チャネルについて全サブバンドのコヒーレント成分を生成すると（ステップＳ１３４においてＹＥＳ）、分割部１２はそれらのコヒーレント成分を加算することで対象チャネルのコヒーレント成分（より具体的には、１ブロック分のコヒーレント成分）を生成する（ステップＳ１３５）。 First, the division unit 12 generates a plurality of subband signals by dividing the block signal into a plurality of subbands for each channel (target channel and all other channels) (step S131). Subsequently, the dividing unit 12 sets one of the plurality of subbands as a processing target (step S132). Subsequently, the dividing unit 12 processes the estimated signal having the highest correlation with the subband signal of the target channel among the estimated signals calculated using the subband signals of channels other than the target channel. It is extracted as a coherent component of the target channel in (step S133). The dividing unit 12 executes the processes of steps S132 and S133 for all the subbands (see step S134). When the coherent components of all subbands are generated for the target channel (YES in step S134), the dividing unit 12 adds the coherent components of the target channel to generate the coherent components of the target channel (more specifically, one block of coherent components). ) Is generated (step S135).

次に、図８を参照しながら、コンピュータをオーディオ信号処理装置１０として機能させるためのオーディオ信号処理プログラムＰ１を説明する。 Next, the audio signal processing program P1 for making the computer function as the audio signal processing device 10 will be described with reference to FIG.

オーディオ信号処理プログラムＰ１はメインモジュールＰ１０、受付モジュールＰ１１、分割モジュールＰ１２、および出力モジュールＰ１３を含む。メインモジュールＰ１０は、オーディオ信号の処理を統括的に実行する部分である。受付モジュールＰ１１、分割モジュールＰ１２、および出力モジュールＰ１３を実行することにより実現される機能はそれぞれ、上記の受付部１１、分割部１２、および出力部１３の機能と同様である。 The audio signal processing program P1 includes a main module P10, a reception module P11, a division module P12, and an output module P13. The main module P10 is a portion that comprehensively executes processing of audio signals. The functions realized by executing the reception module P11, the division module P12, and the output module P13 are the same as the functions of the reception unit 11, the division unit 12, and the output unit 13, respectively.

オーディオ信号処理プログラムＰ１は、例えば、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭ、半導体メモリなどの有形の記録媒体に固定的に記録された上で提供されてもよい。あるいは、オーディオ信号処理プログラムＰ１は、搬送波に重畳されたデータ信号として通信ネットワークを介して提供されてもよい。 The audio signal processing program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, a DVD-ROM, or a semiconductor memory. Alternatively, the audio signal processing program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.

以上説明したように、本発明の一側面に係るオーディオ信号処理装置は、複数のチャネルのオーディオ信号を受け付ける受付部と、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割部であって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割部と、分割部により抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力部とを備える。 As described above, the audio signal processing device according to one aspect of the present invention executes a reception unit that receives audio signals of a plurality of channels and a division process that divides the audio signal into a coherent component and a field component for each channel. Of the estimated signals calculated by using at least the audio signals of channels other than the target channel when the division process uses one channel to be the target of the division process as the target channel. The step of extracting the estimated signal having the highest correlation with the audio signal of the channel as the coherent component of the target channel and the difference between the audio signal of the target channel and the coherent component of the target channel are extracted as the field component of the target channel. The division unit including the step and an output unit for outputting the coherent component and the field component of each channel extracted by the division unit are provided.

このような側面においては、対象チャネル以外のチャネルのオーディオ信号を用いて推定され、且つ該対象チャネルの実際のオーディオ信号ごとの相関が最も高い信号が該対象チャネルのコヒーレント成分として抽出される。また、対象チャネルの実際のオーディオ信号とそのコヒーレント成分との差分が該対象チャネルのフィールド成分として抽出される。このコヒーレント成分およびフィールド成分は各チャネルについて得られる。このように、音を追加することなく元のオーディオ信号のみを用いて各チャネルのコヒーレント成分およびフィールド成分を求めることで、原音の雰囲気（例えば本来の音色）を可能な限りまたは完全に維持することができる。加えて、コヒーレント成分およびフィールド成分は元のチャネル数の分だけ求めることができるので、この手法は原音のチャネル数にかかわらず適用できる。例えば、本発明の一側面は２チャネル、３チャネル、５．１チャネル、２２．２チャネルなどの任意のチャネル数のオーディオ信号に対して適用できる。 In such an aspect, a signal that is estimated using audio signals of channels other than the target channel and has the highest correlation for each actual audio signal of the target channel is extracted as a coherent component of the target channel. Further, the difference between the actual audio signal of the target channel and its coherent component is extracted as the field component of the target channel. This coherent and field component is obtained for each channel. In this way, by obtaining the coherent component and field component of each channel using only the original audio signal without adding sound, the atmosphere of the original sound (for example, the original timbre) is maintained as much as possible or completely. Can be done. In addition, since the coherent component and the field component can be obtained for the original number of channels, this method can be applied regardless of the number of channels of the original sound. For example, one aspect of the present invention can be applied to an audio signal of any number of channels such as 2 channels, 3 channels, 5.1 channels, 22.2 channels and the like.

図９および図１０を用いて上記側面の優位性を説明する。図９は従来の手法におけるコヒーレント成分の抽出の例を示す図であり、図１０は上記側面におけるコヒーレント成分の抽出の例を示す図である。図９，１０共に、三角形状に配置された三つのスピーカ９０からオーディオ信号が出力される例を示し、したがって、この例は３チャネルのオーディオ・システムを示す。 The superiority of the above aspect will be described with reference to FIGS. 9 and 10. FIG. 9 is a diagram showing an example of extraction of a coherent component in a conventional method, and FIG. 10 is a diagram showing an example of extraction of a coherent component in the above aspect. Both FIGS. 9 and 10 show an example in which an audio signal is output from three speakers 90 arranged in a triangular shape, and therefore this example shows a three-channel audio system.

図９に示すように、上記の非特許文献３，４に記載の手法では、２チャネルのオーディオ信号の間で相関が高い成分をコヒーレント成分９１として抽出する（なお、破線９２はフィールド成分を示す）。したがって、このような従来の手法では、二つのスピーカ（チャネル）９０の中間部分９３に位置する音の情報しか取得することができず、三つのスピーカ（チャネル）９０で囲まれた領域の中央部分９４に位置する音の情報を抽出することができない。 As shown in FIG. 9, in the method described in Non-Patent Documents 3 and 4 described above, a component having a high correlation between two channels of audio signals is extracted as a coherent component 91 (note that the broken line 92 indicates a field component). ). Therefore, in such a conventional method, only sound information located in the intermediate portion 93 of the two speakers (channels) 90 can be acquired, and the central portion of the region surrounded by the three speakers (channels) 90. The sound information located at 94 cannot be extracted.

これに対して上記側面では、あるスピーカ（チャネル）９０のコヒーレント成分が他のスピーカ（チャネル）９０の信号から推定される。そのため、図１０に示すように、三つのスピーカ（チャネル）９０で囲まれた領域の中央部分９５に位置する音の情報を抽出することができる。この中央部分９５は、図９における部分９３，９４の和に相当し得る。 On the other hand, in the above aspect, the coherent component of one speaker (channel) 90 is estimated from the signal of another speaker (channel) 90. Therefore, as shown in FIG. 10, sound information located in the central portion 95 of the region surrounded by the three speakers (channels) 90 can be extracted. The central portion 95 may correspond to the sum of the portions 93 and 94 in FIG.

他の側面に係るオーディオ信号処理装置では、分割処理が、窓関数を用いてオーディオ信号を複数のフレームに区切る処理を各チャネルについて実行するステップと、連続する少なくとも二つのフレームを一つのブロックにまとめる処理を複数のフレームの全体に対して実行することで複数のブロックを生成する処理を各チャネルについて実行するステップと、ブロックのそれぞれにおいて対象チャネルのコヒーレント成分を抽出するステップとを含んでもよい。 In the audio signal processing apparatus according to the other aspect, the division process combines a step of performing a process of dividing the audio signal into a plurality of frames for each channel by using a window function and at least two consecutive frames into one block. It may include a step of executing a process of generating a plurality of blocks for each channel by executing the process for the entire plurality of frames, and a step of extracting a coherent component of the target channel in each of the blocks.

複数のフレームで構成されるブロックを採用することで、コヒーレント成分の推定のためのサンプル数が多くなるので、コヒーレント成分をより精度良く抽出することが可能になる。 By adopting a block composed of a plurality of frames, the number of samples for estimating the coherent component is increased, so that the coherent component can be extracted more accurately.

他の側面に係るオーディオ信号処理装置では、分割部が、各チャネルのオーディオ信号を複数のサブバンドに分割することで、各チャネルについて複数のサブバンド信号を生成するステップと、複数のサブバンドのそれぞれにおいて対象チャネルのコヒーレント成分を抽出するステップと、複数のサブバンドにおけるコヒーレント成分を加算することで対象チャネルのコヒーレント成分を抽出するステップとを含んでもよい。 In the audio signal processing device according to the other aspect, the dividing unit divides the audio signal of each channel into a plurality of subbands to generate a plurality of subband signals for each channel, and a step of generating a plurality of subband signals and a plurality of subbands. In each case, a step of extracting the coherent component of the target channel and a step of extracting the coherent component of the target channel by adding the coherent components in a plurality of subbands may be included.

一般に、音声処理では一部の周波数が他の周波数よりも重要であることが多い。サブバンド毎に処理することで、それぞれの周波数帯で要求される精度に応じてコヒーレント成分を抽出することができ、ひいてはコヒーレント成分およびフィールド成分を精度良く抽出することができる。 In general, some frequencies are often more important than others in speech processing. By processing each subband, the coherent component can be extracted according to the accuracy required for each frequency band, and the coherent component and the field component can be extracted with high accuracy.

以下、実施例に基づいて本発明を具体的に説明するが、本発明はそれらに何ら限定されるものではない。 Hereinafter, the present invention will be specifically described based on examples, but the present invention is not limited thereto.

表１に示される７個のステレオ音声素材（すなわち、２チャネルのオーディオ信号）を用意した。いずれの音声素材も市販のＣＤから入手したものであり、サンプリング周波数は４４．１ｋＨｚであった。表１の名前欄は曲名または楽曲の種類を示し、説明欄は演奏の形態を示す。ミキシング欄における「Ａｒｔｉｆｉｃａｌ」はミキシング処理が施された素材であることを示し、「Ｎａｔｕｒａｌ」はミキシング処理が施されていない素材であることを示す。長さ欄は再生時間を示す。

Seven stereo audio materials (that is, two-channel audio signals) shown in Table 1 were prepared. All audio materials were obtained from commercially available CDs, and the sampling frequency was 44.1 kHz. The name column in Table 1 indicates the title of the song or the type of the song, and the explanation column indicates the form of the performance. In the mixing column, "Artifical" indicates that the material has been mixed, and "Natural" indicates that the material has not been mixed. The length column indicates the playback time.

オーディオ信号を完全に再構築できるフィルタバンクを構築するために、変形離散コサイン変換（ＭＤＣＴ）を用いた重畳加算法を採用した。オーディオ信号を複数のフレームに分割するための窓関数としてカイザー・ベッセル窓を用いた。フレーム長は２０４８点とし、これは、ＭＤＣＴにおいて１０２４個の周波数点が得られることを意味する。その周波数点を表２に示すように２３個のサブバンドにまとめた。これらのサブバンドは、ＭＰＥＧ−２ＡＡＣ標準を参考に、４８ｋＨｚｌｏｎｇＦＦＴ（高速フーリエ変換）における６９個のサブバンドを三つの連続するサブバンド毎に一つにまとめたものである。２４個のフレームを１ブロックとした。サンプリング周波数が４４．１ｋＨｚであれば、ブロック長は０．５８秒に相当するものであった。

In order to construct a filter bank that can completely reconstruct the audio signal, a superposition addition method using a modified discrete cosine transform (MDCT) was adopted. The Kaiser-Vessel window was used as a window function for dividing the audio signal into multiple frames. The frame length is 2048 points, which means that 1024 frequency points can be obtained in M DCT. The frequency points are grouped into 23 subbands as shown in Table 2. These sub-bands are a collection of 69 sub-bands in a 48 kHz long FFT (Fast Fourier Transform), one for each of three consecutive sub-bands, with reference to the MPEG-2 AAC standard. Twenty-four frames were regarded as one block. If the sampling frequency was 44.1 kHz, the block length was equivalent to 0.58 seconds.

実験結果をチャネル間の相互相関係数で評価した。原音、コヒーレント成分、およびフィールド成分の相互相関係数を表３に示す。コヒーレント成分は原音よりも高い相互相関を示した。このようなコヒーレント成分は原音よりも狭い音場の雰囲気をもたらす。一方、フィールド成分は、一個の素材（“ＱｕｉｅｔＮｉｇｈｔ”）を除いて負の相互相関を示した。負の相互相関を示すフィールド成分を側方もしくは後方に設置したスピーカで再生すれば、良好なアンビエンス効果が得られる。その結果として、臨場感の高い音を再生することができる。

The experimental results were evaluated by the intercorrelation coefficient between channels. Table 3 shows the intercorrelation coefficients of the original sound, the coherent component, and the field component. The coherent component showed a higher correlation than the original sound. Such a coherent component provides an atmosphere of a sound field narrower than the original sound. On the other hand, the field components showed a negative cross-correlation except for one material (“Quiet Night”). A good ambience effect can be obtained by reproducing the field component showing a negative cross-correlation with a speaker installed on the side or the rear. As a result, it is possible to reproduce a highly realistic sound.

以上、本発明をその実施形態に基づいて詳細に説明した。しかし、本発明は上記実施形態に限定されるものではない。本発明は、その要旨を逸脱しない範囲で様々な変形が可能である。 The present invention has been described in detail above based on the embodiment. However, the present invention is not limited to the above embodiment. The present invention can be modified in various ways without departing from the gist thereof.

上記実施形態では、分割部１２が、ある一つの対象チャネルのコヒーレント成分を、該対象チャネル以外のチャネルのオーディオ信号を用いて推定した。この変形例として、分割部は、当該他チャネルのオーディオ信号と、対象チャネルの過去のオーディオ信号および当該他チャネルの過去のオーディオ信号の少なくとも一方とを用いて、該対象チャネルのコヒーレント成分を推定してもよい。ここで、「過去のオーディオ信号」とは、処理対象のブロックより時間的に前のブロックのオーディオ信号である。対象チャネルおよび他チャネルのうちの一方または双方の過去のオーディオ信号も用いて、処理対象のブロックにおける対象チャネルのオーディオ信号を推定することで、コヒーレント成分をより精度良く抽出することが期待できる。 In the above embodiment, the division unit 12 estimates the coherent component of one target channel using audio signals of channels other than the target channel. As an example of this modification, the dividing unit estimates the coherent component of the target channel using the audio signal of the other channel, the past audio signal of the target channel, and at least one of the past audio signals of the other channel. You may. Here, the "past audio signal" is an audio signal of a block that is time ahead of the block to be processed. By estimating the audio signal of the target channel in the block to be processed by using the past audio signals of one or both of the target channel and the other channel, it can be expected that the coherent component can be extracted more accurately.

少なくとも一つのプロセッサにより実行されるオーディオ信号処理方法の手順は上記実施形態での例に限定されない。例えば、オーディオ信号処理装置は上述したステップ（処理）の一部を省略してもよいし、別の順序で各ステップを実行してもよい。また、上述したステップのうちの任意の２以上のステップが組み合わされてもよいし、ステップの一部が修正又は削除されてもよい。あるいは、オーディオ信号処理装置は上記の各ステップに加えて他のステップを実行してもよい。 The procedure of the audio signal processing method executed by at least one processor is not limited to the example in the above embodiment. For example, the audio signal processing device may omit a part of the above-mentioned steps (processing), or may execute each step in a different order. Further, any two or more steps among the above-mentioned steps may be combined, or a part of the steps may be modified or deleted. Alternatively, the audio signal processor may perform other steps in addition to each of the above steps.

オーディオ信号処理装置は、二つの数値の大小関係を比較する際に、「以上」および「よりも大きい」という二つの基準のどちらを用いてもよく、「以下」および「未満」の二つの基準のうちのどちらを用いてもよい。このような基準の選択は、二つの数値の大小関係を比較する処理についての技術的意義を変更するものではない。 The audio signal processor may use either of the two criteria "greater than or equal to" and "greater than" when comparing the magnitude relations of the two numbers, and the two criteria "less than or equal to" and "less than". Either of these may be used. The selection of such criteria does not change the technical significance of the process of comparing the magnitude relations of two numbers.

１０…オーディオ信号処理装置、１１…受付部、１２…分割部、１３…出力部、ｅｌ…推定誤差、Ｐ１…オーディオ信号処理プログラム、Ｐ１０…メインモジュール、Ｐ１１…受付モジュール、Ｐ１２…分割モジュール、Ｐ１３…出力モジュール。 10 ... Audio signal processing device, 11 ... Reception unit, 12 ... Division unit, 13 ... Output unit, el ... Estimation error, P1 ... Audio signal processing program, P10 ... Main module, P11 ... Reception module, P12 ... Division module, P13 … Output module.

Claims

A reception unit that accepts audio signals from multiple channels,
A division unit that executes a division process for dividing the audio signal into a coherent component and a field component for each channel.
When one of the channels to be divided is used as the target channel, the correlation with the audio signal of the target channel among the estimated signals calculated by using at least the audio signals of channels other than the target channel. Is the step of extracting the highest estimated signal as the coherent component of the target channel, and
A division portion comprising a step of extracting a difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel.
An audio signal processing device including an output unit that outputs the coherent component and the field component of each channel extracted by the division unit.

The division process
The step of executing the process of dividing the audio signal into multiple frames using the window function for each channel, and
A step of executing a process of generating a plurality of the blocks by executing a process of combining at least two consecutive frames into one block for the entire plurality of frames, and a step of executing the process of generating a plurality of the blocks for each channel.
Each of the blocks comprises a step of extracting the coherent component of the target channel.
The audio signal processing device according to claim 1.

The divided part
A step of generating multiple subband signals for each channel by dividing the audio signal of each channel into multiple subbands,
A step of extracting the coherent component of the target channel in each of the plurality of subbands,
A step of extracting the coherent component of the target channel by adding the coherent components in the plurality of subbands is included.
The audio signal processing device according to claim 1 or 2.

A reception step in which the audio signal processor accepts audio signals from multiple channels,
The audio signal processing device is a division step of executing a division process for dividing the audio signal into a coherent component and a field component for each channel, and the division process is a division process.
When one of the channels to be divided is used as the target channel, the correlation with the audio signal of the target channel among the estimated signals calculated by using at least the audio signals of channels other than the target channel. Is the step of extracting the highest estimated signal as the coherent component of the target channel, and
The division step including a step of extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel.
An audio signal processing method in which the audio signal processing apparatus includes an output step of outputting the coherent component and the field component of each channel extracted in the dividing step.

A reception step that accepts audio signals from multiple channels,
A division step of executing a division process for dividing the audio signal into a coherent component and a field component for each channel, wherein the division process is performed.
When one of the channels to be divided is used as the target channel, the correlation with the audio signal of the target channel among the estimated signals calculated by using at least the audio signals of channels other than the target channel. Is the step of extracting the highest estimated signal as the coherent component of the target channel, and
The division step including a step of extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel.
An audio signal processing program that causes a computer to execute the coherent component of each channel extracted in the division step and an output step that outputs the field component.