JP2011239036A

JP2011239036A - Audio signal converter, method, program, and recording medium

Info

Publication number: JP2011239036A
Application number: JP2010106645A
Authority: JP
Inventors: Sumio Sato; 純生佐藤; Nagao Hattori; 永雄服部; Chan Bin Ni; 嬋斌倪
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2010-05-06
Filing date: 2010-05-06
Publication date: 2011-11-24

Abstract

PROBLEM TO BE SOLVED: To provide an audio signal converter which can convert an audio signal for a multi-channel system without generating any noise caused by a discontinuous point.SOLUTION: An audio signal converter (illustrated by an audio signal processing part 113) comprises: a converter to perform discrete Fourier transformation to input audio signals of two channels; a correlation signal extraction part to extract correlation signals about the audio signals of the two channels after the discrete Fourier transformation by the converter, while ignoring a DC component; an inverse conversion part to perform discrete Fourier inverse transformation to the correlation signals extracted by the correlation signal extraction part, or to the correlation signals and non-correlation signals, or to audio signals generated from the correlation signals or from the correlation signals and non-correlation signals; and a noise removal part 122 to remove discontinuous points of waveform from the audio signals after the discrete Fourier inverse transformation in the inverse conversion part.

Description

本発明は、マルチチャネル再生方式用の音声信号を変換するための音声信号変換装置、方法、プログラム、及び記録媒体に関する。 The present invention relates to an audio signal conversion apparatus, method, program, and recording medium for converting an audio signal for a multi-channel playback system.

従来から提案されている音響再生方式には、ステレオ（２ｃｈ）方式、５.１ｃｈサラウンド方式（ＩＴＵ−ＲＢＳ.７７５−１）などがあり広く民生用として普及している。２ｃｈ方式とは、図１で模式的に図示したように、左スピーカ１１Ｌと右スピーカ１１Ｒから異なる音声データを発生させる方式である。５.１ｃｈサラウンド方式とは、図２で模式的に図示したように、左フロントスピーカ２１Ｌ、右フロントスピーカ２１Ｒ、それらの間に配置するセンタースピーカ２２Ｃ、左リアスピーカ２３Ｌ、右リアスピーカ２３Ｒ、及び図示しない低音域（一般的に２０Ｈｚ〜１００Ｈｚ）専用のサブウーファーに対し、それぞれ異なる音声データを入力して出力する方式である。 Conventionally proposed sound reproduction methods include a stereo (2ch) method, a 5.1ch surround method (ITU-R BS.775-1), and the like, which are widely used for consumer use. The 2ch system is a system for generating different audio data from the left speaker 11L and the right speaker 11R as schematically illustrated in FIG. The 5.1ch surround system is, as schematically illustrated in FIG. 2, a left front speaker 21L, a right front speaker 21R, a center speaker 22C, a left rear speaker 23L, a right rear speaker 23R disposed between them, This is a method of inputting and outputting different audio data to a subwoofer dedicated to a low sound range (generally 20 Hz to 100 Hz) not shown.

また、２ｃｈ方式や５.１ｃｈサラウンド方式の他にも、７.１ｃｈ、９.１ｃｈ、２２.２ｃｈなどさまざまな音響再生方式が提案されている。上述した方式はいずれも、聴取者（受聴者）を中心とする円周上または球面上に各スピーカを配置し、理想的には各スピーカから等距離にある聴取位置（受聴位置）、いわゆるスイートスポットで聴くことが好ましいとされている。例えば２ｃｈ方式ではスイートスポット１２で、５.１ｃｈサラウンド方式ではスイートスポット２４で聴くことが好ましい。スイートスポットで聴くと、音圧のバランスによる合成音像が製作者の意図するところに定位する。逆に、スイートスポット以外の位置で聴くと、一般的に、音像・音質が劣化する。以下、これらの方式を総称してマルチチャネル再生方式と呼ぶ。 In addition to the 2ch system and 5.1ch surround system, various sound reproduction systems such as 7.1ch, 9.1ch, and 22.2ch have been proposed. In any of the methods described above, each speaker is arranged on a circumference or a spherical surface centered on the listener (listener), and ideally a listening position (listening position) that is equidistant from each speaker, so-called sweet. It is preferable to listen at a spot. For example, it is preferable to listen to the sweet spot 12 in the 2ch system and the sweet spot 24 in the 5.1ch surround system. When listening at the sweet spot, the synthesized sound image based on the balance of sound pressure is localized where the producer intended. Conversely, when listening at a position other than the sweet spot, the sound image / quality is generally deteriorated. Hereinafter, these methods are collectively referred to as a multi-channel reproduction method.

一方、マルチチャネル再生方式とは別に、音源オブジェクト指向再生方式もある。この方式は、全ての音が、いずれかの音源オブジェクトが発する音であるとする方式であり、各音源オブジェクト（以下、「仮想音源」と呼ぶ。）が自身の位置情報と音声信号とを含んでいる。音楽コンテンツを例にとると、各仮想音源は、それぞれの楽器の音と楽器が配置されている位置情報とを含む。
そして、音源オブジェクト指向再生方式は、通常、直線状あるいは面状に並べたスピーカ群によって音の波面を合成する再生方式（すなわち波面合成再生方式）により再生される。このような波面合成再生方式のうち、非特許文献１に記載のＷａｖｅＦｉｅｌｄＳｙｎｔｈｅｓｉｓ（ＷＦＳ）方式は、直線状に並べたスピーカ群（以下、スピーカアレイという）を用いる現実的な実装方法の１つとして近年盛んに研究されている。 On the other hand, apart from the multi-channel playback method, there is also a sound source object-oriented playback method. This method is a method in which all sounds are sounds emitted by any sound source object, and each sound source object (hereinafter referred to as “virtual sound source”) includes its own position information and audio signal. It is out. Taking music content as an example, each virtual sound source includes the sound of each musical instrument and position information where the musical instrument is arranged.
The sound source object-oriented reproduction method is usually reproduced by a reproduction method (that is, a wavefront synthesis reproduction method) in which a sound wavefront is synthesized by a group of speakers arranged in a straight line or a plane. Among such wavefront synthesis reproduction systems, the Wave Field Synthesis (WFS) system described in Non-Patent Document 1 is one of the practical mounting methods using linearly arranged speaker groups (hereinafter referred to as speaker arrays). Has been actively studied in recent years.

このような波面合成再生方式は、上述のマルチチャネル再生方式とは異なり、図３で模式的に図示したように、並べられたスピーカ群３１の前のどの位置で聴いている受聴者に対しても、良好な音像と音質を両方同時に提示することができるという特長を持つ。つまり、波面合成再生方式でのスイートスポット３２は図示するように幅広くなっている。
また、ＷＦＳ方式によって提供される音響空間内においてスピーカアレイと対面して音を聴いている受聴者は、実際にはスピーカアレイから放射される音が、スピーカアレイの後方に仮想的に存在する音源（仮想音源）から放射されているかのような感覚を受ける。 Such a wavefront synthesis reproduction method is different from the above-described multi-channel reproduction method, as shown schematically in FIG. 3, for a listener who is listening at any position in front of the arranged speaker groups 31. However, it has the feature that both good sound image and sound quality can be presented at the same time. That is, the sweet spot 32 in the wavefront synthesis reproduction system is wide as shown in the figure.
In addition, a listener who is listening to sound while facing the speaker array in an acoustic space provided by the WFS method is actually a sound source in which the sound radiated from the speaker array virtually exists behind the speaker array. Feels like being emitted from (virtual sound source).

この波面合成再生方式では、仮想音源を表す入力信号を必要とする。そして、一般的に、１つの仮想音源には１チャネル分の音声信号とその仮想音源の位置情報が含まれることを必要とする。上述の音楽コンテンツを例にとると、例えば楽器毎に録音された音声信号とその楽器の位置情報ということになる。ただし、仮想音源それぞれの音声信号は必ずしも楽器毎である必要はないが、コンテンツ製作者が意図するそれぞれの音の到来方向と大きさが、仮想音源という概念を用いて表現されている必要がある。 This wavefront synthesis reproduction method requires an input signal representing a virtual sound source. In general, one virtual sound source needs to include an audio signal for one channel and position information of the virtual sound source. Taking the above-described music content as an example, for example, it is an audio signal recorded for each musical instrument and position information of the musical instrument. However, the sound signal of each virtual sound source does not necessarily need to be for each musical instrument, but the arrival direction and magnitude of each sound intended by the content creator must be expressed using the concept of virtual sound source. .

ここで、前述のマルチチャンネル方式の中でも最も広く普及している方式はステレオ（２ｃｈ）方式であるため、ステレオ方式の音楽コンテンツについて考察する。図４に示すように２つのスピーカ４１Ｌ，４１Ｒを用いて、ステレオ方式の音楽コンテンツにおけるＬ（左）チャネルとＲ（右）チャネルの音声信号を、それぞれ左に設置したスピーカ４１Ｌ、右に設置したスピーカ４１Ｒで再生する。このような再生を行うと、図４に示すように、各スピーカ４１Ｌ，４１Ｒから等距離の地点、すなわちスイートスポット４３で聴く場合にのみ、ボーカルの声とベースの音が真ん中の位置４２ｂから聞こえ、ピアノの音が左側の位置４２ａ、ドラムの音が右側の位置４２ｃなど、製作者が意図したように音像が定位して聞こえる。
このようなコンテンツを波面合成再生方式で再生し、波面合成再生方式の特長である、どの位置の受聴者に対してもコンテンツ製作者の意図通りの音像定位を提供することを考える。そのためには、図５で示すスイートスポット５３のように、どの視聴位置からでも図４のスイートスポット４３内で聴いたときの音像が知覚できなければならない。つまり、直線状あるいは面状に並べられたスピーカ群５１によって、広いスイートスポット５３で、ボーカルの声とベースの音が真ん中の位置５２ｂから聞こえ、ピアノの音が左側の位置５２ａ、ドラムの音が右側の位置５２ｃなど、製作者が意図したように音像が定位して聞こえなければならない。 Here, since the most widespread method among the above-mentioned multi-channel methods is the stereo (2ch) method, a stereo-type music content will be considered. As shown in FIG. 4, the audio signals of the L (left) channel and the R (right) channel in stereo music contents are installed on the left speaker 41L and on the right using two speakers 41L and 41R, respectively. Playback is performed by the speaker 41R. When such reproduction is performed, as shown in FIG. 4, only when listening at a point equidistant from each of the speakers 41L and 41R, that is, the sweet spot 43, the voice of the vocal and the sound of the bass can be heard from the middle position 42b. The sound image is localized and heard as intended by the producer, such as a piano sound on the left side 42a and a drum sound on the right side 42c.
It is considered that such content is reproduced by the wavefront synthesis reproduction method, and the sound image localization as intended by the content producer is provided to the listener at any position, which is a feature of the wavefront synthesis reproduction method. For this purpose, it is necessary to be able to perceive a sound image when listening in the sweet spot 43 of FIG. 4 from any viewing position, such as the sweet spot 53 shown in FIG. That is, the vocal group and the sound of the bass are heard from the middle position 52b at the wide sweet spot 53 by the speaker group 51 arranged in a straight line or a plane, and the piano sound is heard from the left position 52a and the drum sound. The sound image must be localized and heard as intended by the producer, such as the right position 52c.

その課題に対し、例えば、図６のようにＬチャネルの音、Ｒチャネルの音をそれぞれ仮想音源６２ａ，６２ｂとして配置した場合を考える。この場合、Ｌ／Ｒチャネルそれぞれが単体で１つの音源を表すのではなく２つのチャンネルによって合成音像を生成するものであるから、それを波面合成再生方式で再生したとしても、やはりスイートスポット６３が生成されてしまい、スイートスポット６３の位置でしか、図４のような音像定位はしない。つまり、そのような音像定位を実現するには、２ｃｈのステレオデータから、何らかの手段によって音像毎の音声に分離し、各音声から仮想音源データを生成することが必要となる。 For example, consider the case where the L channel sound and the R channel sound are arranged as virtual sound sources 62a and 62b, respectively, as shown in FIG. In this case, since each L / R channel does not represent a single sound source alone but generates a synthesized sound image by two channels, the sweet spot 63 is still generated even if it is reproduced by the wavefront synthesis reproduction method. The sound image is localized as shown in FIG. 4 only at the position of the sweet spot 63. That is, in order to realize such sound image localization, it is necessary to separate 2ch stereo data into sound for each sound image by some means and generate virtual sound source data from each sound.

この課題に対し、特許文献１に記載の方法では、２ｃｈステレオデータを周波数帯域毎に信号のパワーの相関係数を基に相関信号と無相関信号とに分離し、相関信号については合成音像方向を推定し、それらの結果から仮想音源を生成している。 In response to this problem, the method described in Patent Document 1 separates 2ch stereo data into a correlated signal and an uncorrelated signal based on the correlation coefficient of the signal power for each frequency band. And a virtual sound source is generated from the results.

欧州特許出願公開第１７６１１１０号明細書European Patent Application No. 1761110

A. J. Berkhout, D. de Vries, and P. Vogel, “Acoustic control by wave field synthesis”, J. Acoust. Soc. Am. Volume 93(5), アメリカ合衆国、Acoustical Society of America, May 1993, pp. 2764-2778AJ Berkhout, D. de Vries, and P. Vogel, “Acoustic control by wave field synthesis”, J. Acoust. Soc. Am. Volume 93 (5), United States, Acoustical Society of America, May 1993, pp. 2764- 2778

しかしながら、特許文献１に記載の方法では、元の音声信号の分析の際、離散フーリエ変換後の左右チャネルの直流成分を無視している。図７は、音声信号を離散フーリエ変換したときの結果の一例を示す模式図である。図７において、鉛直方向の軸は実部、手前方向の軸は虚部を表しており、符号７１は直流成分を示している。特許文献１に記載の方法では、この直流成分７１を無視するため、フーリエ逆変換後のセグメント間の波形の連続性が保証されず、セグメントの境界では波形が不連続となる。低い帯域の信号が多く含まれるコンテンツでは特に、生成した音声信号波形には不連続点が多く含まれ、それらはノイズとして知覚されてしまう。 However, in the method described in Patent Document 1, the DC component of the left and right channels after the discrete Fourier transform is ignored when analyzing the original audio signal. FIG. 7 is a schematic diagram illustrating an example of a result obtained when a discrete Fourier transform is performed on an audio signal. In FIG. 7, the vertical axis represents the real part, the front axis represents the imaginary part, and reference numeral 71 represents a DC component. In the method described in Patent Document 1, since the direct current component 71 is ignored, the continuity of the waveform between segments after inverse Fourier transform is not guaranteed, and the waveform becomes discontinuous at the segment boundary. Particularly in content including many low-band signals, the generated audio signal waveform includes many discontinuities, which are perceived as noise.

図８に示す音楽コンテンツ８０の例で、このノイズについて説明する。音楽コンテンツ８０における左チャネルの音声信号８１及び右チャネルの音声信号８２を、特許文献１に記載の方法を用いて例えば５つのチャネルに変換すると、図９に示す音楽コンテンツ９０のような結果になる。音楽コンテンツ９０は、５つのチャネルの音声信号９１〜９５を有することになる。そして、図１０は、図９の上から３番目のチャネルの音声信号９３における９秒付近を拡大したものであるが、図１０に示す音声信号１００では、中央付近１０１にあるように不連続点が生じている。このような不連続点が多数含まれてしまうため、耳障りなノイズとして知覚されてしまう。 This noise will be described using the example of the music content 80 shown in FIG. When the audio signal 81 of the left channel and the audio signal 82 of the right channel in the music content 80 are converted into, for example, five channels using the method described in Patent Document 1, a result like the music content 90 shown in FIG. 9 is obtained. . The music content 90 will have five channels of audio signals 91-95. FIG. 10 is an enlarged view of the vicinity of 9 seconds in the audio signal 93 of the third channel from the top of FIG. 9, but in the audio signal 100 shown in FIG. Has occurred. Since many such discontinuous points are included, it is perceived as annoying noise.

このような問題は、マルチチャネル方式用の音声信号に対して、波面合成再生方式で再生させるための音声信号に変換する場合に限ったものではなく、同じくマルチチャネル方式用（チャネル数は同じでも異なってもよい）の音声信号に変換する場合にも生じ得る。それは、このような変換の場合にも上述のような離散フーリエ変換・逆変換を施し且つ左右チャネルの直流成分を無視することがあるためである。 Such a problem is not limited to the case where a multi-channel audio signal is converted into an audio signal to be reproduced by the wavefront synthesis reproduction method, but also for the multi-channel method (even if the number of channels is the same). This may also occur in the case of conversion to a sound signal that may be different. This is because even in the case of such conversion, the discrete Fourier transform / inverse transform as described above may be performed and the DC components of the left and right channels may be ignored.

本発明は、上述のような実状に鑑みてなされたものであり、その目的は、２ｃｈや５.１ｃｈ等のマルチチャネル方式用の音声信号を、不連続点に起因するノイズを発生させることなく変換することが可能な音声信号変換装置、方法、プログラム、及び記録媒体を提供することにある。 The present invention has been made in view of the above-described actual situation, and an object of the present invention is to generate a multichannel audio signal such as 2ch or 5.1ch without generating noise due to discontinuities. An audio signal conversion apparatus, method, program, and recording medium that can be converted are provided.

上述したような課題を解決するために、本発明の第１の技術手段は、マルチチャネルの入力音声信号を、スピーカ群によって再生させるための音声信号に変換する音声信号変換装置であって、２つのチャネルの入力音声信号に離散フーリエ変換を施す変換部と、該変換部で離散フーリエ変換後の２つのチャネルの音声信号について、直流成分を無視して相関信号を抽出する相関信号抽出部と、該相関信号抽出部で抽出された相関信号または該相関信号及び無相関信号に対して、もしくは前記相関信号から生成された音声信号に対して、もしくは前記相関信号及び前記無相関信号から生成された音声信号に対して、離散フーリエ逆変換を施す逆変換部と、該逆変換部で離散フーリエ逆変換後の音声信号から波形の不連続点を除去する除去部と、を備えたことを特徴としたものである。 In order to solve the above-described problems, a first technical means of the present invention is an audio signal conversion apparatus that converts a multi-channel input audio signal into an audio signal for reproduction by a speaker group. A conversion unit that performs discrete Fourier transform on the input audio signal of one channel; and a correlation signal extraction unit that extracts a correlation signal by ignoring a direct current component for the audio signal of two channels after the discrete Fourier transform by the conversion unit; The correlation signal extracted by the correlation signal extraction unit or the correlation signal and the non-correlation signal, or the voice signal generated from the correlation signal, or the correlation signal and the non-correlation signal An inverse transform unit that performs discrete Fourier inverse transform on the audio signal, and a removal unit that removes discontinuous points of the waveform from the audio signal after the discrete Fourier inverse transform by the inverse transform unit; It is obtained characterized by including.

第２の技術手段は、第１の技術手段において、前記除去部は、処理セグメントの境界において波形の微分値を維持させるように前記離散フーリエ逆変換後の音声信号に直流成分を加算することで、前記不連続点を除去することを特徴としたものである。 According to a second technical means, in the first technical means, the removing unit adds a direct current component to the audio signal after the discrete Fourier transform so as to maintain the differential value of the waveform at the boundary of the processing segment. The discontinuous points are removed.

第３の技術手段は、第２の技術手段において、前記除去部は、加算する前記直流成分の振幅の大きさを、加算時点からの経過時間に比例して減少させることを特徴としたものである。 A third technical means is characterized in that, in the second technical means, the removing unit reduces the magnitude of the amplitude of the DC component to be added in proportion to the elapsed time from the addition time point. is there.

第４の技術手段は、第３の技術手段において、前記除去部は、前記減少させるための比例定数を、加算するために求めた前記直流成分の振幅の大きさに応じて変更することを特徴としたものである。 According to a fourth technical means, in the third technical means, the removing unit changes the proportionality constant for the reduction according to the magnitude of the amplitude of the DC component obtained for addition. It is what.

第５の技術手段は、第４の技術手段において、前記除去部は、前記離散フーリエ逆変換後の音声信号の波形が０を交差する回数が所定時間内で所定回数以上存在する箇所以外において、前記直流成分の加算を実行することを特徴としたものである。 According to a fifth technical means, in the fourth technical means, the removing unit, except for a place where the number of times that the waveform of the speech signal after the discrete Fourier inverse transform crosses 0 is greater than or equal to a predetermined number of times within a predetermined time, The addition of the DC component is executed.

第６の技術手段は、第２〜第５のいずれか１の技術手段において、前記除去部は、加算するために求めた前記直流成分の振幅が所定値未満である場合のみ、前記直流成分の加算を実行することを特徴としたものである。 A sixth technical means is the technical means according to any one of the second to fifth technical means, wherein the removing unit is configured to detect the DC component only when the amplitude of the DC component obtained for addition is less than a predetermined value. It is characterized by performing addition.

第７の技術手段は、第１〜第３のいずれか１の技術手段において、前記除去部は、前記離散フーリエ逆変換後の音声信号の波形が０を交差する回数が所定時間内で所定回数以上存在する箇所以外において、前記不連続点の除去を実行することを特徴としたものである。 A seventh technical means is the technical means according to any one of the first to third technical means, wherein the removing unit is configured such that the number of times that the waveform of the sound signal after the inverse discrete Fourier transform crosses 0 is a predetermined number of times within a predetermined time. The removal of the discontinuous points is executed in places other than the above existing locations.

第８の技術手段は、第１〜第７のいずれか１の技術手段において、前記除去部で処理対象となる前記離散フーリエ逆変換後の音声信号は、前記相関信号または前記相関信号及び前記無相関信号に対して、時間領域あるいは周波数領域においてスケーリング処理を行い、該スケーリング処理後の音声信号とすることを特徴としたものである。 According to an eighth technical means, in any one of the first to seventh technical means, the speech signal after the inverse discrete Fourier transform to be processed by the removing unit is the correlation signal or the correlation signal and the non-correlation signal. The correlation signal is subjected to a scaling process in the time domain or the frequency domain to obtain an audio signal after the scaling process.

第９の技術手段は、第１〜第８のいずれか１の技術手段において、前記マルチチャネルの入力音声信号は３つ以上のチャネルの入力音声信号とし、前記マルチチャネルの入力音声信号のうちいずれか２つの入力音声信号に対して、前記変換部、前記相関信号抽出部、前記逆変換部、及び前記除去部により前記不連続点の除去を行って、前記スピーカ群によって再生させるための音声信号を生成し、前記音声信号変換装置は、生成された音声信号に残りのチャネルの入力音声信号を加算する加算部をさらに備えたことを特徴としたものである。 According to a ninth technical means, in any one of the first to eighth technical means, the multi-channel input audio signal is an input audio signal of three or more channels, and any of the multi-channel input audio signals is selected. An audio signal to be reproduced by the speaker group by removing the discontinuous points from the two input audio signals by the conversion unit, the correlation signal extraction unit, the inverse conversion unit, and the removal unit. The audio signal conversion apparatus further includes an adder that adds the input audio signals of the remaining channels to the generated audio signal.

第１０の技術手段は、第１〜第９のいずれか１の技術手段において、前記マルチチャネルの入力音声信号を含むディジタルコンテンツを入力するディジタルコンテンツ入力部と、ディジタルコンテンツを復号化するデコーダ部と、該デコーダ部で復号化したディジタルコンテンツから音声信号を分離する音声信号抽出部と、該音声信号抽出部で抽出した音声信号から、３チャネル以上で且つ前記入力音声信号とは異なるマルチチャネルの音声信号に変換する音声信号処理部とをさらに備え、該音声信号処理部は、前記変換部、前記相関信号抽出部、前記逆変換部、及び前記除去部を備えることを特徴としたものである。 According to a tenth technical means, in any one of the first to ninth technical means, a digital content input unit for inputting digital content including the multi-channel input audio signal, a decoder unit for decoding the digital content, An audio signal extraction unit that separates an audio signal from the digital content decoded by the decoder unit, and a multi-channel audio that is 3 channels or more different from the input audio signal from the audio signal extracted by the audio signal extraction unit An audio signal processing unit that converts the signal into a signal, and the audio signal processing unit includes the conversion unit, the correlation signal extraction unit, the inverse conversion unit, and the removal unit.

第１１の技術手段は、第１０の技術手段において、前記ディジタルコンテンツ入力部は、ディジタルコンテンツを格納する記録媒体、ネットワークを介してディジタルコンテンツを配信するサーバまたはディジタルコンテンツを放送する放送局からディジタルコンテンツを入力することを特徴としたものである。 According to an eleventh technical means, in the tenth technical means, the digital content input unit receives the digital content from a recording medium for storing the digital content, a server for distributing the digital content via a network, or a broadcasting station for broadcasting the digital content. It is characterized by inputting.

第１２の技術手段は、第１〜第１１のいずれか１の技術手段において、前記音声信号処理部における処理を実行するか否かを、ユーザ操作に応じて切り替える切替部をさらに備えたことを特徴としたものである。 The twelfth technical means further comprises a switching unit that switches whether to execute the processing in the audio signal processing unit according to a user operation in any one of the first to eleventh technical means. It is a feature.

第１３の技術手段は、マルチチャネルの入力音声信号を、スピーカ群によって再生させるための音声信号に変換する音声信号変換方法であって、変換部が、２つのチャネルの入力音声信号に離散フーリエ変換を施す変換ステップと、相関信号抽出部が、前記変換ステップで離散フーリエ変換後の２つのチャネルの音声信号について、直流成分を無視して相関信号を抽出する抽出ステップと、逆変換部が、前記抽出ステップで抽出された相関信号または該相関信号及び無相関信号に対して、もしくは前記相関信号から生成された音声信号に対して、もしくは前記相関信号及び前記無相関信号から生成された音声信号に対して、離散フーリエ逆変換を施す逆変換ステップと、除去部が、前記逆変換ステップで離散フーリエ逆変換後の音声信号から波形の不連続点を除去する除去ステップと、を有することを特徴としたものである。 A thirteenth technical means is an audio signal conversion method for converting a multi-channel input audio signal into an audio signal to be reproduced by a speaker group, wherein the conversion unit performs discrete Fourier transform on the input audio signal of two channels. A conversion step for applying a correlation signal extraction unit, an extraction step for ignoring a direct current component and extracting a correlation signal for the two-channel audio signals after discrete Fourier transform in the conversion step, and an inverse conversion unit, The correlation signal extracted in the extraction step or the correlation signal and the non-correlation signal, or the voice signal generated from the correlation signal, or the voice signal generated from the correlation signal and the non-correlation signal On the other hand, the inverse transform step for performing the discrete Fourier inverse transform, and the removing unit from the speech signal after the discrete Fourier inverse transform in the inverse transform step. It is obtained by comprising: the removal step of removing the discontinuities in the form, the.

第１４の技術手段は、コンピュータに、２つのチャネルの入力音声信号に離散フーリエ変換を施す変換ステップと、該変換ステップで離散フーリエ変換後の２つのチャネルの音声信号について、直流成分を無視して相関信号を抽出する抽出ステップと、該抽出ステップで抽出された相関信号または該相関信号及び無相関信号に対して、もしくは前記相関信号から生成された音声信号に対して、もしくは前記相関信号及び前記無相関信号から生成された音声信号に対して、離散フーリエ逆変換を施す逆変換ステップと、該逆変換ステップで離散フーリエ逆変換後の音声信号から波形の不連続点を除去する除去ステップと、を実行させるためのプログラムである。
第１５の技術手段は、第１４の技術手段におけるプログラムを記録したコンピュータ読み取り可能な記録媒体である。 In a fourteenth technical means, a computer performs a discrete Fourier transform on the input audio signals of two channels, and ignores the DC component of the audio signals of the two channels after the discrete Fourier transform in the conversion step. An extraction step for extracting a correlation signal; the correlation signal extracted in the extraction step or the correlation signal and the non-correlation signal; or the voice signal generated from the correlation signal; or the correlation signal and the An inverse transform step for performing discrete Fourier inverse transform on the speech signal generated from the uncorrelated signal, and a removal step for removing discontinuous points of the waveform from the speech signal after the discrete Fourier inverse transform in the inverse transform step; Is a program for executing
The fifteenth technical means is a computer-readable recording medium recording the program according to the fourteenth technical means.

本発明によれば、２ｃｈや５.１ｃｈ等のマルチチャネル方式用の音声信号を、不連続点に起因するノイズを発生させることなく変換することが可能になる。 According to the present invention, it is possible to convert a multichannel audio signal such as 2ch or 5.1ch without generating noise caused by discontinuities.

２ｃｈ方式を説明するための模式図である。It is a schematic diagram for demonstrating a 2ch system. ５.１ｃｈサラウンド方式を説明するための模式図である。It is a schematic diagram for demonstrating a 5.1ch surround system. 波面合成再生方式を説明するための模式図である。It is a schematic diagram for demonstrating a wavefront synthetic | combination reproduction | regeneration system. ボーカル、ベース、ピアノ、及びドラムの音がステレオ方式で記録された音楽コンテンツを、左右２つのスピーカを用いて再生する様子を示す模式図である。It is a schematic diagram which shows a mode that the music content by which the sound of the vocal, the bass, the piano, and the drum was recorded by the stereo system is reproduced using two right and left speakers. 図４の音楽コンテンツを波面合成再生方式で再生した際の、理想的なスイートスポットの様子を示す模式図である。FIG. 5 is a schematic diagram showing an ideal sweet spot when the music content of FIG. 4 is reproduced by the wavefront synthesis reproduction method. 図４の音楽コンテンツにおける左／右チャネルの音声信号をそれぞれ左／右スピーカの位置に仮想音源を設定して波面合成再生方式で再生した際の、実際のスイートスポットの様子を示す模式図である。FIG. 5 is a schematic diagram showing a state of an actual sweet spot when the audio signal of the left / right channel in the music content of FIG. 4 is reproduced by the wavefront synthesis reproduction method by setting a virtual sound source at the position of the left / right speaker, respectively. . 音声信号を離散フーリエ変換したときの結果の一例を示す模式図である。It is a schematic diagram which shows an example of a result when carrying out discrete Fourier transform of the audio | voice signal. 左チャネル及び右チャネルの音声信号でなる音楽コンテンツの波形の一例を示す図である。It is a figure which shows an example of the waveform of the music content which consists of an audio | voice signal of a left channel and a right channel. 従来の方法を用いて、図８の音楽コンテンツを５つのチャネルに変換した結果の波形を示す図である。It is a figure which shows the waveform of the result of having converted the music content of FIG. 8 into five channels using the conventional method. 図９の音楽コンテンツのうち１つのチャネルの音声信号の一部を拡大した図である。FIG. 10 is an enlarged view of a part of an audio signal of one channel in the music content of FIG. 9. 本発明に係る音声信号変換装置を備えた音声データ再生装置の一構成例を示すブロック図である。It is a block diagram which shows one structural example of the audio | voice data reproduction apparatus provided with the audio | voice signal converter concerning this invention. 図１１の音声データ再生装置における音声信号処理部（本発明に係る音声信号変換装置）の一構成例を示すブロック図である。It is a block diagram which shows one structural example of the audio | voice signal processing part (audio | voice signal converter concerning this invention) in the audio | voice data reproduction | regeneration apparatus of FIG. 図１２の音声信号処理部における音声信号分離抽出部及び雑音除去部での音声信号処理の一例を説明するためのフロー図である。FIG. 13 is a flowchart for explaining an example of audio signal processing in an audio signal separation and extraction unit and a noise removal unit in the audio signal processing unit of FIG. 12. 図１２の音声信号処理部において音声データをバッファに蓄える様子を示す図である。It is a figure which shows a mode that audio | voice data are stored in a buffer in the audio | voice signal processing part of FIG. 受聴者と左右のスピーカと合成音像との位置関係の例を説明するための模式図である。It is a schematic diagram for demonstrating the example of the positional relationship of a listener, a right-and-left speaker, and a synthesized sound image. 波面合成再生方式で使用するスピーカ群と仮想音源との位置関係の例を説明するための模式図である。It is a schematic diagram for demonstrating the example of the positional relationship of the speaker group and virtual sound source which are used with a wavefront synthetic | combination reproduction | regeneration system. 図１６の仮想音源と受聴者及び合成音像との位置関係の例を説明するための模式図である。It is a schematic diagram for demonstrating the example of the positional relationship of the virtual sound source of FIG. 16, a listener, and a synthesized sound image. 左右チャネルの音声信号を離散フーリエ変換し左右チャネルの直流成分を無視した場合に、離散フーリエ逆変換後のセグメント境界に生じる波形の不連続点を説明するための模式図である。FIG. 6 is a schematic diagram for explaining waveform discontinuities occurring at segment boundaries after inverse discrete Fourier transform when the left and right channel audio signals are discrete Fourier transformed and the left and right channel DC components are ignored. 本発明に係る不連続点除去処理の一例を説明するための模式図である。It is a schematic diagram for demonstrating an example of the discontinuous point removal process which concerns on this invention. 図１９の不連続点除去処理を適用して、左右チャネルの音声信号でなる或る音楽コンテンツを５つのチャネルに変換した結果の波形を示す図である。It is a figure which shows the waveform of the result of having converted the music content which consists of the audio | voice signal of a left-right channel into five channels by applying the discontinuous point removal process of FIG. 本発明に係る他の不連続点除去処理を適用して、図２０で対象とした音楽コンテンツと同じ音楽コンテンツを５つのチャネルに変換した結果の波形を示す図である。It is a figure which shows the waveform of the result of having applied the other discontinuous point removal process which concerns on this invention, and having converted the music content same as the music content made into object in FIG. 20 into five channels. 図２１と同じ不連続点除去処理を適用して、図２０及び図２１で対象とした音楽コンテンツとは異なる音声信号波形の変化が激しい音楽コンテンツを、５つのチャネルに変換した結果の波形を示す図である。21 shows the waveform obtained as a result of converting music content having a drastic change in audio signal waveform different from the music content targeted in FIGS. 20 and 21 into five channels by applying the same discontinuous point removal processing as in FIG. FIG. 本発明に係る他の不連続点除去処理を適用して、図２２で対象とした音楽コンテンツと同じ音楽コンテンツを５つのチャネルに変換した結果の波形を示す図である。It is a figure which shows the waveform of the result of having applied the other discontinuous point removal process which concerns on this invention, and having converted the music content same as the music content made into object in FIG. 22 into five channels. 図２３の不連続点除去処理を適用して、図８の音楽コンテンツを５つのチャネルに変換した結果の波形を示す図である。It is a figure which shows the waveform of the result of having applied the discontinuous point removal process of FIG. 23, and having converted the music content of FIG. 8 into five channels. 図２４の音楽コンテンツのうち１つのチャネルの音声信号の一部を拡大した図である。It is the figure which expanded a part of audio | voice signal of one channel among the music contents of FIG. ５.１ｃｈの音声信号を波面合成再生方式で再生する際に、使用するスピーカ群と仮想音源との位置関係の例を説明するための模式図である。It is a schematic diagram for demonstrating the example of the positional relationship of the speaker group to be used, and a virtual sound source, when reproducing | regenerating a 5.1ch audio | voice signal by a wavefront synthetic | combination reproduction | regeneration system. 図１１の音声データ再生装置を備えたテレビ装置の構成例を示す図である。It is a figure which shows the structural example of the television apparatus provided with the audio | voice data reproduction | regeneration apparatus of FIG. 図１１の音声データ再生装置を備えたテレビ装置の他の構成例を示す図である。It is a figure which shows the other structural example of the television apparatus provided with the audio | voice data reproduction | regeneration apparatus of FIG. 図１１の音声データ再生装置を備えたテレビ装置の他の構成例を示す図である。It is a figure which shows the other structural example of the television apparatus provided with the audio | voice data reproduction | regeneration apparatus of FIG. 図１１の音声データ再生装置を備えた映像投影システムの構成例を示す図The figure which shows the structural example of the video projection system provided with the audio | voice data reproduction | regeneration apparatus of FIG. 図１１の音声データ再生装置を備えた映像投影システムの他の構成例を示す図である。It is a figure which shows the other structural example of the video projection system provided with the audio | voice data reproduction | regeneration apparatus of FIG. 図１１の音声データ再生装置を備えたテレビボードとテレビ装置とでなるシステムの構成例を示す図The figure which shows the structural example of the system which consists of a television board provided with the audio | voice data reproducing apparatus of FIG. 11, and a television apparatus. 図１１の音声データ再生装置を備えた自動車の例を示す図である。It is a figure which shows the example of the motor vehicle provided with the audio | voice data reproduction | regeneration apparatus of FIG. 図１１の音声データ再生装置における再生対象のスピーカの例を示す図である。It is a figure which shows the example of the speaker of reproduction | regeneration object in the audio | voice data reproduction | regeneration apparatus of FIG.

本発明に係る音声信号変換装置は、マルチチャネル再生方式用の音声信号を、チャネル数の同じ又は異なるスピーカ群で再生するための音声信号や波面合成再生方式用の音声信号などに変換する装置であって、音声信号処理装置、音声データ変換装置などとも呼べ、音声データ再生装置に組み込むことができる。なお、音声信号とは、当然、いわゆる音声を記録した信号に限ったものではなく、音響信号とも呼べる。また、波面合成再生方式とは、上述したように直線状または面状に並べたスピーカ群によって音の波面を合成する再生方式である。 An audio signal conversion apparatus according to the present invention is an apparatus for converting an audio signal for multi-channel reproduction system into an audio signal for reproduction with a group of speakers having the same or different number of channels, an audio signal for wavefront synthesis reproduction system, or the like. Therefore, it can also be called an audio signal processing device, an audio data conversion device, etc., and can be incorporated into an audio data reproducing device. Of course, the audio signal is not limited to a signal in which a so-called audio is recorded, and can also be called an acoustic signal. The wavefront synthesis reproduction method is a reproduction method in which a wavefront of sound is synthesized by a group of speakers arranged in a straight line or a plane as described above.

以下、図面を参照しながら、本発明に係る音声信号変換装置の構成例及び処理例について説明する。また、以下の説明では、まず、本発明に係る音声信号変換装置が、変換により波面合成再生方式用の音声信号を生成する例を挙げる。
図１１は、本発明に係る音声信号変換装置を備えた音声データ再生装置の一構成例を示すブロック図で、図１２は、図１１の音声データ再生装置における音声信号処理部（本発明に係る音声信号変換装置）の一構成例を示すブロック図である。 Hereinafter, a configuration example and a processing example of an audio signal conversion device according to the present invention will be described with reference to the drawings. In the following description, first, an example in which the audio signal conversion apparatus according to the present invention generates an audio signal for the wavefront synthesis reproduction method by conversion will be given.
FIG. 11 is a block diagram showing an example of the configuration of an audio data reproducing apparatus provided with the audio signal converting apparatus according to the present invention. FIG. 12 shows an audio signal processing unit (according to the present invention) in the audio data reproducing apparatus of FIG. It is a block diagram which shows one structural example of an audio | voice signal converter.

図１１で例示する音声データ再生装置１１０は、デコーダ１１１、音声信号抽出部１１２、音声信号処理部１１３、Ｄ／Ａコンバータ１１４、増幅器群１１５、そしてスピーカ群１１６から構成される。デコーダ１１１は、音声のみあるいは音声付き映像のコンテンツを復号化し、信号処理可能な形式に変換し音声信号抽出部１１２に出力する。そのコンテンツは、放送局から送信されたデジタル放送のコンテンツや、ネットワークを介してディジタルコンテンツを配信するサーバからインターネットからダウンロードしたり、あるいは外部記憶装置等の記録媒体から読み込んだりすることによって取得する。このように、図１１では図示しないが、音声データ再生装置１１０は、マルチチャネルの入力音声信号を含むディジタルコンテンツを入力するディジタルコンテンツ入力部を備える。デコーダ１１１は、ここで入力されたディジタルコンテンツを復号化することになる。音声信号抽出部１１２では、得られた信号から音声信号を分離、抽出する。ここではそれは２ｃｈステレオ信号とする。その２チャネル分の信号を音声信号処理部１１３に出力する。 The audio data reproduction device 110 illustrated in FIG. 11 includes a decoder 111, an audio signal extraction unit 112, an audio signal processing unit 113, a D / A converter 114, an amplifier group 115, and a speaker group 116. The decoder 111 decodes the content of audio only or video with audio, converts it into a signal processable format, and outputs it to the audio signal extraction unit 112. The content is acquired by downloading from the Internet from a digital broadcast content transmitted from a broadcasting station, a server that distributes digital content via a network, or reading from a recording medium such as an external storage device. As described above, although not shown in FIG. 11, the audio data reproducing apparatus 110 includes a digital content input unit that inputs digital content including a multi-channel input audio signal. The decoder 111 decodes the digital content input here. The audio signal extraction unit 112 separates and extracts an audio signal from the obtained signal. Here, it is a 2ch stereo signal. The signals for the two channels are output to the audio signal processing unit 113.

音声信号処理部１１３では、得られた２チャネル信号から、３チャネル以上で且つ入力音声信号とは異なるマルチチャネルの音声信号（以下の例では、仮想音源数分の信号として説明する）を生成する。つまり入力音声信号を別のマルチチャネルの音声信号に変換する。音声信号処理部１１３は、その音声信号をＤ／Ａコンバータ１１４に出力する。仮想音源の数は、ある一定以上の数があれば予め決めておいても性能上差し支えはないが、仮想音源数が多くなるほど演算量も多くなる。そのため実装する装置の性能を考慮してその数を決定することが望ましい。ここの例では、その数を５として説明する。 The audio signal processing unit 113 generates multi-channel audio signals (which will be described as signals corresponding to the number of virtual sound sources in the following example) from three or more channels and different from the input audio signal from the obtained two-channel signals. . That is, the input audio signal is converted into another multi-channel audio signal. The audio signal processing unit 113 outputs the audio signal to the D / A converter 114. The number of virtual sound sources can be determined in advance if there is a certain number or more, but the amount of calculation increases as the number of virtual sound sources increases. Therefore, it is desirable to determine the number in consideration of the performance of the mounted device. In this example, the number is assumed to be 5.

Ｄ／Ａコンバータ１１４では得られた信号をアナログ信号に変換し、それぞれの信号を増幅器１１５に出力する。各増幅器１１５では入力されたアナログ信号を拡声し各スピーカ１１６に伝送し、各スピーカ１１６から空間中に音として出力される。 The D / A converter 114 converts the obtained signal into an analog signal and outputs each signal to the amplifier 115. Each amplifier 115 amplifies the input analog signal and transmits it to each speaker 116, and is output from each speaker 116 as sound into the space.

この図における音声信号処理部の詳細な構成を図１２に示す。音声信号処理部１１３は、音声信号分離抽出部１２１、雑音除去部１２２、そして、音声出力信号生成部１２３から構成される。 FIG. 12 shows a detailed configuration of the audio signal processing unit in this figure. The audio signal processing unit 113 includes an audio signal separation / extraction unit 121, a noise removal unit 122, and an audio output signal generation unit 123.

音声信号分離抽出部１２１は２チャネルの信号から各仮想音源に対応する音声信号を生成し、それを雑音除去部１２２に出力する。雑音除去部１２２では、得られた音声信号波形から知覚上ノイズとなる部分を除去し、ノイズ除去後の音声信号を音声出力信号生成部１２３に出力する。音声出力信号生成部１２３では、得られた音声信号から各スピーカに対応するそれぞれの出力音声信号波形を生成する。音声出力信号生成部１２３では、波面合成再生処理などの処理が施され、例えば、得られた各仮想音源用の音声信号を各スピーカに割り当て、スピーカ毎の音声信号を生成する。波面合成再生処理の一部は音声信号分離抽出部１２１で担ってもよい。 The audio signal separation / extraction unit 121 generates an audio signal corresponding to each virtual sound source from the two-channel signals, and outputs it to the noise removal unit 122. The noise removing unit 122 removes a perceptual noise part from the obtained sound signal waveform, and outputs the sound signal after the noise removal to the sound output signal generating unit 123. The audio output signal generation unit 123 generates each output audio signal waveform corresponding to each speaker from the obtained audio signal. The audio output signal generation unit 123 performs processing such as wavefront synthesis reproduction processing. For example, the obtained audio signal for each virtual sound source is assigned to each speaker, and an audio signal for each speaker is generated. A part of the wavefront synthesis reproduction processing may be performed by the audio signal separation / extraction unit 121.

次に、図１３に従って、音声信号分離抽出部１２１及び雑音除去部１２２での音声信号処理例を説明する。図１３は、図１２の音声信号処理部における音声信号分離抽出部及び雑音除去部での音声信号処理の一例を説明するためのフロー図で、図１４は、図１２の音声信号処理部において音声データをバッファに蓄える様子を示す図である。 Next, an example of audio signal processing in the audio signal separation / extraction unit 121 and the noise removal unit 122 will be described with reference to FIG. 13 is a flowchart for explaining an example of the audio signal processing in the audio signal separation / extraction unit and the noise removal unit in the audio signal processing unit in FIG. 12, and FIG. 14 shows the audio in the audio signal processing unit in FIG. It is a figure which shows a mode that data are stored in a buffer.

まず、音声信号分離抽出部１２１は、１セグメントの半分の長さの音声データを、図１１における音声信号抽出部１１２での抽出結果から読み出す（ステップＳ１３１）。ここで、音声データとは、例えば４８ｋＨｚなどの標本化周波数で標本化された離散音声信号波形を指すものとする。そして、セグメントとは、ある一定の長さの標本点群からなる音声データ区間であり、ここでは後ほど離散フーリエ変換の対象となる区間長を指すものとする。その値は例えば１０２４とする。この例では、１セグメントの半分の長さである５１２点の音声データが読み出し対象となる。 First, the audio signal separation / extraction unit 121 reads out audio data having a length of half of one segment from the extraction result of the audio signal extraction unit 112 in FIG. 11 (step S131). Here, the audio data refers to a discrete audio signal waveform sampled at a sampling frequency such as 48 kHz. A segment is an audio data section composed of a group of sample points having a certain length, and here, it is assumed that the section length is an object of discrete Fourier transform later. For example, the value is 1024. In this example, 512 points of audio data that are half the length of one segment are to be read.

読み出した５１２点の音声データは図１４で例示するようなバッファ１４０に蓄えられる。このバッファは、直前の１セグメント分の音声信号波形を保持しておけるようになっており、それより過去のセグメントは捨てていく。直前の半セグメント分のデータと最新の半セグメント分のデータを繋げて１セグメント分の音声データを作成し、窓関数演算（ステップＳ１３２）に進む。すなわち、全ての標本データは窓関数演算に２回読み込まれることになる。 The read 512-point audio data is stored in the buffer 140 illustrated in FIG. This buffer can hold the sound signal waveform for the immediately preceding segment, and the past segments are discarded. Audio data for one segment is created by connecting the data for the previous half segment and the data for the latest half segment, and the process proceeds to window function calculation (step S132). That is, all the sample data is read twice in the window function calculation.

ステップＳ１３２における窓関数演算では、従来提案されている次のＨａｎｎ窓を１セグメント分の音声データに乗算する。

ここで、ｍは自然数、Ｍは１セグメント長で偶数とする。ステレオの入力信号をそれぞれｘ_Ｌ（ｍ）、ｘ_Ｒ（ｍ）とすると、窓関数乗算後の音声信号ｘ′_Ｌ（ｍ）、ｘ′_Ｒ（ｍ）は、 In the window function calculation in step S132, the audio data for one segment is multiplied by a conventionally proposed next Hann window.

Here, m is a natural number, M is an even number of one segment length. If the stereo input signals are x _L (m) and x _R (m), respectively, the audio signals x ′ _L (m) and x ′ _R (m) after the window function multiplication are

ｘ′_Ｌ（ｍ）＝ｗ（ｍ）ｘ_Ｌ（ｍ）、
ｘ′_Ｒ（ｍ）＝ｗ（ｍ）ｘ_Ｒ（ｍ） (2)
と計算される。このＨａｎｎ窓を用いると、例えば標本点ｍ_０（ただし、Ｍ／２≦ｍ_０＜Ｍ）の入力信号ｘ_Ｌ（ｍ_０）にはｓｉｎ^２（（ｍ_０／Ｍ）π）が乗算される。そして、その次の回の読み込みではその同じ標本点がｍ_０−Ｍ／２として読み込まれるので、 x ′ _L (m) = w (m) × _L (m)
x ′ _R (m) = w (m) × _R (m) (2)
Is calculated. Using this Hann window, for example, the input signal x _L (m ₀ ) at the sample point m ₀ (M / 2 ≦ m ₀ <M) is multiplied by sin ² ((m ₀ / M) π). . And in the next reading, the same sample point is read as m ₀ -M / 2.

が乗算される。ここで、ｓｉｎ^２（（ｍ_０／Ｍ）π）＋ｃｏｓ^２（（ｍ_０／Ｍ）π）＝１であるから、もし、何も修正を加えずに読み込んだ信号を半セグメントずつずらして加算すれば、元の信号が完全に復元されることになる。

Is multiplied. Here, since sin ² ((m ₀ / M) π) + cos ² ((m ₀ / M) π) = 1, the signal read without any correction is shifted by half a segment and added. Then, the original signal is completely restored.

そうして得られた音声データを、次の数式(3)のように離散フーリエ変換し、周波数領域の音声データを得る（ステップＳ１３３）。ここで、ＤＦＴは離散フーリエ変換を表し、ｋは自然数で、０≦ｋ＜Ｍである。Ｘ_Ｌ（ｋ）、Ｘ_Ｒ（ｋ）は複素数となる。
Ｘ_Ｌ（ｋ）＝ＤＦＴ（ｘ′_Ｌ（ｎ））、
Ｘ_Ｒ（ｋ）＝ＤＦＴ（ｘ′_Ｒ（ｎ）） (3) The audio data thus obtained is subjected to discrete Fourier transform as in the following formula (3) to obtain audio data in the frequency domain (step S133). Here, DFT represents discrete Fourier transform, k is a natural number, and 0 ≦ k <M. X _L (k) and X _R (k) are complex numbers.
X _L (k) = DFT (x ′ _L (n))
X _R (k) = DFT (x ′ _R (n)) (3)

次に、得られた周波数領域の音声データを小さい帯域に分割し、分割した各帯域についてステップＳ１３５〜Ｓ１３８の処理を実行する（ステップＳ１３４ａ，Ｓ１３４ｂ）。具体的に個々の処理について説明する。 Next, the obtained frequency domain audio data is divided into smaller bands, and the processes of steps S135 to S138 are executed for each of the divided bands (steps S134a and S134b). Specific processing will be described.

まず、分割方法についてはEquivalent Rectangular Band（ＥＲＢ）を用い、ＥＲＢの帯域幅で０Ｈｚから標本化周波数の１／２の周波数までの間を分割する。ここで、ＥＲＢにより、与えられた周波数の上限ｆ_ｍａｘ［Ｈｚ］までをいくつに分割するか、すなわちＥＲＢで分割した各帯域の索引の最大値Ｉは次式によって与えられる。
Ｉ＝ｆｌｏｏｒ（２１.４ｌｏｇ_１０（０.００４３７ｆ_ｍａｘ＋１）） (4)
ただし、ｆｌｏｏｒ（ａ）はフロア関数で、実数ａを越えない整数の最大値を表す。 First, as a division method, an Equivalent Rectangular Band (ERB) is used, and the ERB bandwidth is divided from 0 Hz to half the sampling frequency. Here, how many times the upper limit f _max [Hz] of a given frequency is divided by ERB, that is, the maximum value I of the index of each band divided by ERB is given by the following equation.
I = floor (21.4 log ₁₀ (0.000043 f _max +1)) (4)
However, floor (a) is a floor function and represents the maximum value of an integer not exceeding the real number a.

そして、それぞれのＥＲＢ幅の帯域（以下、小帯域）の中心周波数Ｆ_ｃ ^（ｉ）（１≦ｉ≦Ｉ）［Ｈｚ］は次式によって与えられる。

The center frequency F _c ⁽ⁱ⁾ (1 ≦ i ≦ I) [Hz] of each ERB width band (hereinafter referred to as a small band) is given by the following equation.

また、その時のＥＲＢの帯域幅ｂ^（ｉ）［Ｈｚ］は次式によって求められる。
ｂ^（ｉ）＝２４.７（０.００４３７Ｆ_ｃ ^（ｉ）＋１） (6)
よって、その中心周波数から低域側と高域側にそれぞれＥＲＢ／２の周波数幅だけシフトすることによりｉ番目の小帯域の両側の境界周波数Ｆ_Ｌ ^（ｉ）、Ｆ_Ｕ ^（ｉ）を求めることができる。したがって、ｉ番目の小帯域には、Ｋ_Ｌ ^（ｉ）番目の線スペクトルからＫ_Ｕ ^（ｉ）番目の線スペクトルが含まれる。ここで、Ｋ_Ｌ ^（ｉ）、Ｋ_Ｕ ^（ｉ）はそれぞれ次の数式(7)、(8)で表される。
Ｋ_Ｌ ^（ｉ）＝ｃｅｉｌ（２１.４ｌｏｇ_１０（０.００４３７Ｆ_Ｌ ^（ｉ）＋１）） (7)
Ｋ_Ｕ ^（ｉ）＝ｆｌｏｏｒ（２１.４ｌｏｇ_１０（０.００４３７Ｆ_Ｕ ^（ｉ）＋１）） (8)
ただし、ｃｅｉｌ（ａ）は天井関数で、実数ａより小さくならない整数の最小値を表す。また、離散フーリエ変換した後の線スペクトルは、直流成分すなわち例えばＸ_Ｌ（０）を除いて、Ｍ／２（ただし、Ｍは偶数）を境に対称となっている。すなわち、Ｘ_Ｌ（ｋ）とＸ_Ｌ（Ｍ−ｋ）は０＜ｋ＜Ｍ／２の範囲で複素共役の関係になる。したがって、以下ではＫ_Ｕ ^（ｉ）≦Ｍ／２の範囲を分析の対象として考え、ｋ＞Ｍ／２の範囲については複素共役の関係にある対称の線スペクトルと同じ扱いとする。 Further, the bandwidth b ⁽ⁱ⁾ [Hz] of the ERB at that time is obtained by the following equation.
b ⁽ⁱ⁾ = 24.7 (0.000043F _c ⁽ⁱ⁾ +1) (6)
Therefore, the boundary frequencies F _L ⁽ⁱ⁾ and F _U ⁽ⁱ⁾ on both sides of the i-th small band are obtained by shifting the center frequency from the low frequency side to the high frequency side by the frequency width of ERB / 2. Can do. Accordingly, the i th small band includes the K _U ⁽ⁱ⁾ th line spectrum from the K _L ⁽ⁱ⁾ th line spectrum. Here, K _L ⁽ⁱ⁾ and K _U ⁽ⁱ⁾ are expressed by the following equations (7) and (8), respectively.
_{^{K L (i) = ceil (}} 21.4log 10 (0.00437F L (i) +1)) (7)
K _U ⁽ⁱ⁾ = floor (21.4 log ₁₀ (0.0000437 F _U ⁽ⁱ⁾ +1)) (8)
However, ceil (a) is a ceiling function and represents the minimum value of an integer that is not smaller than the real number a. Further, the line spectrum after the discrete Fourier transform is symmetric with respect to M / 2 (where M is an even number) except for a direct current component, that is, X _L (0), for example. That is, X _L (k) and X _L (M−k) have a complex conjugate relationship in the range of 0 <k <M / 2. Therefore, in the following, the range of K _U ⁽ⁱ⁾ ≦ M / 2 is considered as the object of analysis, and the range of k> M / 2 is treated the same as a symmetric line spectrum having a complex conjugate relationship.

これらの具体例を示す。例えば、標本化周波数が４８０００Ｈｚの場合、Ｉ＝４９となり、４９の小帯域に分割することとなる。ただし、最も高い小帯域区間よりもさらに上の周波数に相当する線スペクトル成分も存在するが、それらは聴感上の影響も殆ど無く、さらに通常は値が微小であるため、それらは最も高い小帯域区間に含めることとして差し支えない。 Specific examples of these will be shown. For example, when the sampling frequency is 48000 Hz, I = 49, which is divided into 49 small bands. However, although there are line spectral components corresponding to frequencies higher than the highest sub-band section, they have almost no audible effect and usually have a small value, so they are the highest sub-band. It can be included in the section.

次に、このようにして決定される各小帯域において、左チャネルと右チャネルの正規化相関係数を次式で求めることで、相関係数を取得する（ステップＳ１３５）。

Next, in each small band determined in this manner, the correlation coefficient is obtained by obtaining the normalized correlation coefficient of the left channel and the right channel by the following equation (step S135).

この正規化相関係数ｄ^（ｉ）は左右のチャネルの音声信号にどれだけ相関があるかを表すものであり、０から１の間の実数の値をとる。全く同じ信号同士であれば１、そして全く無相関の信号同士であれば０となる。ここで、左右のチャネルの音声信号の電力Ｐ_Ｌ ^（ｉ）とＰ_Ｒ ^（ｉ）の両方が０である場合、その小帯域に関して相関信号と無相関信号の抽出は不可能とし、処理を行わず次の小帯域の処理に移ることとする。また、Ｐ_Ｌ ^（ｉ）とＰ_Ｒ ^（ｉ）のいずれか片方が０である場合、数式(9)では演算不可能であるが、正規化相関係数ｄ^（ｉ）＝０とし、その小帯域の処理を続行する。 This normalized correlation coefficient d ⁽ⁱ⁾ represents how much the audio signals of the left and right channels are correlated, and takes a real value between 0 and 1. 1 if the signals are exactly the same, and 0 if the signals are completely uncorrelated. Here, when both the powers P _L ⁽ⁱ⁾ and P _R ⁽ⁱ⁾ of the audio signals of the left and right channels are 0, the correlation signal and the non-correlation signal cannot be extracted for the small band, and the process is performed. Let's move to the next small band processing. Further, when either one of P _L ⁽ⁱ⁾ and P _R ⁽ⁱ⁾ is 0, the calculation cannot be performed in Equation (9), but the normalized correlation coefficient d ⁽ⁱ⁾ = 0 is set, and the smaller Continue processing bandwidth.

次に、この正規化相関係数ｄ^（ｉ）を用いて、左右チャネルの音声信号から相関信号と無相関信号をそれぞれ分離抽出するための変換係数を求め（ステップＳ１３６）、ステップＳ１３６で取得したそれぞれの変換係数を用いて、左右チャネルの音声信号から相関信号と無相関信号を分離抽出する（ステップＳ１３７）。相関信号及び無相関信号は、いずれも推定した音声信号として抽出すればよい。 Next, using this normalized correlation coefficient d ⁽ⁱ⁾ , conversion coefficients for separating and extracting correlated signals and uncorrelated signals from the left and right channel audio signals are obtained (step S136), and obtained in step S136. Using each transform coefficient, a correlation signal and a non-correlation signal are separated and extracted from the audio signals of the left and right channels (step S137). What is necessary is just to extract both a correlation signal and a non-correlation signal as the estimated audio | voice signal.

ステップＳ１３６，Ｓ１３７の処理例を説明する。ここで、特許文献１と同様、左右チャネルそれぞれの信号は、無相関信号と相関信号から構成され、相関信号については左右から同じ信号が出力されるものとするモデルを採用する。そして、左右から出力される相関信号によって合成される音像は、その相関信号の左右それぞれの音圧のバランスによって方向が決定されるものとする。そのモデルに従うと、入力信号ｘ_Ｌ（ｎ）、ｘ_Ｒ（ｎ）は、
ｘ_Ｌ（ｍ）＝ｓ（ｍ）＋ｎ_Ｌ（ｍ）、
ｘ_Ｒ（ｍ）＝αｓ（ｍ）＋ｎ_Ｒ（ｍ） (13)
と表される。ここで、ｓ（ｍ）は左右の相関信号、ｎ_Ｌ（ｍ）は左チャネルの音声信号から相関信号ｓ（ｍ）を減算したものであって（左チャネルの）無相関信号として定義できるもの、ｎ_Ｒ（ｍ）は右チャネルの音声信号から相関信号ｓ（ｍ）を減算したものであって（右チャネルの）無相関信号として定義できるものである。また、αは相関信号の左右音圧バランスの程度を表す正の実数である。 A processing example of steps S136 and S137 will be described. Here, as in Patent Document 1, the left and right channel signals are each composed of an uncorrelated signal and a correlated signal, and the same signal is output from the left and right for the correlated signal. The direction of the sound image synthesized from the correlation signals output from the left and right is determined by the balance of the sound pressures on the left and right of the correlation signal. According to the model, the input signals x _L (n), x _R (n) are
x _L (m) = s (m) + n _L (m),
x _R (m) = αs (m) + n _R (m) (13)
It is expressed. Here, s (m) is a left and right correlation signal, and n _L (m) is a signal obtained by subtracting a correlation signal s (m) from an audio signal of the left channel and can be defined as an uncorrelated signal (left channel). , N _R (m) is obtained by subtracting the correlation signal s (m) from the audio signal of the right channel and can be defined as an uncorrelated signal (right channel). Α is a positive real number representing the degree of left / right sound pressure balance of the correlation signal.

数式(13)により、数式(2)で前述した窓関数乗算後の音声信号ｘ′_Ｌ（ｍ）、ｘ′_Ｒ（ｍ）は、次の数式(14)で表される。ただし、ｓ′（ｍ）、ｎ′_Ｌ（ｍ）、ｎ′_Ｒ（ｍ）はそれぞれｓ（ｍ）、ｎ_Ｌ（ｍ）、ｎ_Ｒ（ｍ）に窓関数を乗算したものである。
ｘ′_Ｌ（ｍ）＝ｗ（ｍ）｛ｓ（ｍ）＋ｎ_Ｌ（ｍ）｝＝ｓ′（ｍ）＋ｎ′_Ｌ（ｍ）、
ｘ′_Ｒ（ｍ）＝ｗ（ｍ）｛αｓ（ｍ）＋ｎ_Ｒ（ｍ）｝＝αｓ′（ｍ）＋ｎ′_Ｒ（ｍ）
(14) From the equation (13), the audio signals x ′ _L (m) and x ′ _R (m) after the window function multiplication described in the equation (2) are expressed by the following equation (14). Here, s ′ (m), n ′ _L (m), and n ′ _R (m) are obtained by multiplying s (m), n _L (m), and n _R (m) by a window function, respectively.
x ′ _L (m) = w (m) {s (m) + n _L (m)} = s ′ (m) + n ′ _L (m),
x ′ _R (m) = w (m) {αs (m) + n _R (m)} = αs ′ (m) + n ′ _R (m)
(14)

数式(14)を離散フーリエ変換することによって、次の数式(15)を得る。ただし、Ｓ（ｋ）、Ｎ_Ｌ（ｋ）、Ｎ_Ｒ（ｋ）はそれぞれｓ′（ｍ）、ｎ′_Ｌ（ｍ）、ｎ′_Ｒ（ｍ）を離散フーリエ変換したものである。
Ｘ_Ｌ（ｋ）＝Ｓ（ｋ）＋Ｎ_Ｌ（ｋ）、
Ｘ_Ｒ（ｋ）＝αＳ（ｋ）＋Ｎ_Ｒ（ｋ） (15) The following equation (15) is obtained by subjecting the equation (14) to discrete Fourier transform. However, S (k), N _L (k), and N _R (k) are discrete Fourier transforms of s ′ (m), n ′ _L (m), and n ′ _R (m), respectively.
X _L (k) = S (k) + N _L (k),
X _R (k) = αS (k) + N _R (k) (15)

したがって、ｉ番目の小帯域における音声信号Ｘ_Ｌ ^（ｉ）（ｋ）、Ｘ_Ｒ ^（ｉ）（ｋ）は、
Ｘ_Ｌ ^（ｉ）（ｋ）＝Ｓ^（ｉ）（ｋ）＋Ｎ_Ｌ ^（ｉ）（ｋ）、
Ｘ_Ｒ ^（ｉ）（ｋ）＝α^（ｉ）Ｓ^（ｉ）（ｋ）＋Ｎ_Ｒ ^（ｉ）（ｋ）
ただし、Ｋ_Ｌ ^（ｉ）≦ｋ≦Ｋ_Ｕ ^（ｉ） (16)
と表現される。ここで、α^（ｉ）はｉ番目の小帯域におけるαを表す。以後、ｉ番目の小帯域における相関信号Ｓ^（ｉ）（ｋ）、無相関信号Ｎ_Ｌ ^（ｉ）（ｋ）、Ｎ_Ｒ ^（ｉ）（ｋ）をそれぞれ、
Ｓ^（ｉ）（ｋ）＝Ｓ（ｋ）、
Ｎ_Ｌ ^（ｉ）（ｋ）＝Ｎ_Ｌ（ｋ）、
Ｎ_Ｒ ^（ｉ）（ｋ）＝Ｎ_Ｒ（ｋ）
ただし、Ｋ_Ｌ ^（ｉ）≦ｋ≦Ｋ_Ｕ ^（ｉ） (17)
とおくこととする。 Therefore, the audio signals X _L ⁽ⁱ⁾ (k) and X _R ⁽ⁱ⁾ (k) in the i-th small band are
X _L ⁽ⁱ⁾ (k) = S ⁽ⁱ⁾ (k) + N _L ⁽ⁱ⁾ (k),
X _R ⁽ⁱ⁾ (k) = α ⁽ⁱ⁾ S ⁽ⁱ⁾ (k) + N _R ⁽ⁱ⁾ (k)
However, K _L ⁽ⁱ⁾ ≦ k ≦ K _U ⁽ⁱ⁾ (16)
It is expressed. Here, α ⁽ⁱ⁾ represents α in the i-th subband. Thereafter, the correlation signal S ⁽ⁱ⁾ (k), the uncorrelated signal N _L ⁽ⁱ⁾ (k), and N _R ⁽ⁱ⁾ (k) in the i-th small band are respectively
S ⁽ⁱ⁾ (k) = S (k),
N _L ⁽ⁱ⁾ (k) = N _L (k),
N _R ⁽ⁱ⁾ (k) = N _R (k)
However, K _L ⁽ⁱ⁾ ≦ k ≦ K _U ⁽ⁱ⁾ (17)
I will leave it.

数式(16)から、数式(12)の音圧Ｐ_Ｌ ^（ｉ）とＰ_Ｒ ^（ｉ）は、
Ｐ_Ｌ ^（ｉ）＝Ｐ_Ｓ ^（ｉ）＋Ｐ_Ｎ ^（ｉ）、
Ｐ_Ｒ ^（ｉ）＝［α^（ｉ）］^２Ｐ_Ｓ ^（ｉ）＋Ｐ_Ｎ ^（ｉ） (18)
と表される。ここで、Ｐ_Ｓ ^（ｉ）、Ｐ_Ｎ ^（ｉ）はｉ番目の小帯域におけるそれぞれ相関信号、無相関信号の電力であり、

と表される。ここで、左右の無相関信号の音圧は等しいと仮定している。 From Equation (16), the sound pressures P _L ⁽ⁱ⁾ and P _R ⁽ⁱ⁾ in Equation (12 ⁾ are
P _L ⁽ⁱ⁾ = P _S ⁽ⁱ⁾ + P _N ⁽ⁱ⁾ ,
_{^{P R (i) = [α}} (i)] 2 P S (i) + P N (i) (18)
It is expressed. Here, P _S ⁽ⁱ⁾ and P _N ⁽ⁱ⁾ are the powers of the correlated signal and the uncorrelated signal in the i-th small band, respectively.

It is expressed. Here, it is assumed that the sound pressures of the left and right uncorrelated signals are equal.

また、数式(10)〜(12)より、数式(9)は、

と表すことができる。ただし、この算出においてはＳ（ｋ）、Ｎ_Ｌ（ｋ）、Ｎ_Ｒ（ｋ）が互いに直交し、かけ合わされたときの電力は０と仮定している。 Also, from Equations (10) to (12), Equation (9) is

It can be expressed as. However, in this calculation, it is assumed that S (k), N _L (k), and N _R (k) are orthogonal to each other and the power when multiplied is 0.

数式(18)と数式(20)を解くことにより、次の式が得られる。

By solving Equation (18) and Equation (20), the following equation is obtained.

これらの値を用いて、各小帯域における相関信号と無相関信号を推定する。ｉ番目の小帯域における相関信号Ｓ^（ｉ）（ｋ）の推定値est（Ｓ^（ｉ）（ｋ））を、媒介変数μ_１、μ_２を用いて、
est（Ｓ^（ｉ）（ｋ））＝μ_１Ｘ_Ｌ ^（ｉ）（ｋ）＋μ_２Ｘ_Ｒ ^（ｉ）（ｋ） (23)
とおくと、推定誤差εは、
ε＝est（Ｓ^（ｉ）（ｋ））−Ｓ^（ｉ）（ｋ） (24)
と表される。ここで、est（Ａ）はＡの推定値を表すものとする。そして二乗誤差ε^２が最少になるとき、εとＸ_Ｌ ^（ｉ）（ｋ）、Ｘ_Ｒ ^（ｉ）（ｋ）はそれぞれ直交するという性質を利用すると、
Ｅ［ε・Ｘ_Ｌ ^（ｉ）（ｋ）］＝０、Ｅ［ε・Ｘ_Ｒ ^（ｉ）（ｋ）］＝０ (25)
という関係が成り立つ。数式(16)、(19)、(21)〜(24)を利用すると、数式(25)から次の連立方程式が導出できる。
（１−μ_１−μ_２α^（ｉ））Ｐ_Ｓ ^（ｉ）−μ_１Ｐ_Ｎ ^（ｉ）＝０
α^（ｉ）（１−μ_１−μ_２α^（ｉ））Ｐ_Ｓ ^（ｉ）−μ_２Ｐ_Ｎ ^（ｉ）＝０
(26) Using these values, a correlated signal and an uncorrelated signal in each small band are estimated. The estimated value est (S ⁽ⁱ⁾ (k)) of the correlation signal S ⁽ⁱ⁾ (k) in the i-th subband is obtained using the parameters μ ₁ and μ ₂ ,
est (S ⁽ⁱ⁾ (k)) = μ ₁ X _L ⁽ⁱ⁾ (k) + μ ₂ X _R ⁽ⁱ⁾ (k) (23)
The estimated error ε is
ε = est (S ⁽ⁱ⁾ (k))-S ⁽ⁱ⁾ (k) (24)
It is expressed. Here, est (A) represents an estimated value of A. And when the square error ε ² is minimized, using the property that ε and X _L ⁽ⁱ⁾ (k), X _R ⁽ⁱ⁾ (k) are orthogonal to each other,
E [ε · X _L ⁽ⁱ⁾ (k)] = 0, E [ε · X _R ⁽ⁱ⁾ (k)] = 0 (25)
This relationship holds. The following simultaneous equations can be derived from Equation (25) by using Equations (16), (19), and (21) to (24).
_{_{(1-μ 1 -μ 2 α}} (i)) P S (i) -μ 1 P N (i) = 0
^{α (i) (1-μ} 1 -μ 2 α (i)) P S (i) -μ 2 P N (i) = 0
(26)

この数式(26)を解くことによって、各媒介変数が次のように求まる。

ここで、このようにして求まる推定値est（Ｓ^（ｉ）（ｋ））の電力Ｐ_est（Ｓ） ^（ｉ）が、数式(23）の両辺を二乗して求まる次の式
Ｐ_est（Ｓ） ^（ｉ）＝（μ_１＋α^（ｉ）μ_２）^２Ｐ_Ｓ ^（ｉ）＋（μ_１ ^２＋μ_２ ^２）Ｐ_Ｎ ^（ｉ） (28)
を満たす必要があるため、この式から推定値を次式のようにスケーリングする。なお、est′（Ａ）はＡの推定値をスケーリングしたものを表す。 By solving the equation (26), each parameter is obtained as follows.

Here, the power P _{est (S)} ⁽ⁱ⁾ of the estimated value est (S ⁽ⁱ⁾ (k)) obtained in this way is obtained by squaring both sides of the equation (23), and the following equation P _{est (S ^{_{) (i) = (μ 1}}} + α (i) μ 2) 2 P S (i) + (μ 1 2 + μ 2 2) P N (i) (28)
Therefore, the estimated value is scaled as follows from this equation. Note that est ′ (A) represents a scaled estimate of A.

そして、ｉ番目の小帯域における左右チャネルの無相関信号Ｎ_Ｌ ^（ｉ）（ｋ）、Ｎ_Ｒ ^（ｉ）（ｋ）に対する推定値est（Ｎ_Ｌ ^（ｉ）（ｋ））、est（Ｎ_Ｒ ^（ｉ）（ｋ））はそれぞれ、
est（Ｎ_Ｌ ^（ｉ）（ｋ））＝μ_３Ｘ_Ｌ ^（ｉ）（ｋ）＋μ_４Ｘ_Ｒ ^（ｉ）（ｋ） (30)
est（Ｎ_Ｒ ^（ｉ）（ｋ））＝μ_５Ｘ_Ｌ ^（ｉ）（ｋ）＋μ_６Ｘ_Ｒ ^（ｉ）（ｋ） (31)
とおくことにより、上述の求め方と同様にして、媒介変数μ_３〜μ_６は、 Then, the estimated values est (N _L ⁽ⁱ⁾ (k)) and est (N _{R for the} uncorrelated signals N _L ⁽ⁱ⁾ (k) and N _R ⁽ⁱ⁾ (k) of the left and right channels in the i-th small band. ^(I) (k))
est (N _L ⁽ⁱ⁾ (k)) = μ ₃ X _L ⁽ⁱ⁾ (k) + μ ₄ X _R ⁽ⁱ⁾ (k) (30)
est (N _R ⁽ⁱ⁾ (k)) = μ ₅ X _L ⁽ⁱ⁾ (k) + μ ₆ X _R ⁽ⁱ⁾ (k) (31)
Thus, in the same manner as the above-described method, the parametric variables μ _{3 to} μ ₆ are

と求めることができる。このようにして求めた推定値est（Ｎ_Ｌ ^（ｉ）（ｋ））、est（Ｎ_Ｒ ^（ｉ）（ｋ））も上述と同様に、次の式によってそれぞれスケーリングする。

It can be asked. The estimated values est (N _L ⁽ⁱ⁾ (k)) and est (N _R ⁽ⁱ⁾ (k)) obtained in this way are also scaled by the following equations, as described above.

数式(27)、(32)、(33)で示した各媒介変数μ_１〜μ_６及び数式(29)、(34)、(35)で示したスケーリングの係数が、ステップＳ１３６で求める変換係数に該当する。そして、ステップＳ１３７では、これらの変換係数を用いた演算（数式(23)、(30)、(31)）により推定することで、相関信号と無相関信号（右チャネルの無相関信号、左チャネルの無相関信号）とを分離抽出する。 The transformation variables obtained in step S136 are the parameters [mu] _{1 to} [mu] ₆ represented by the mathematical expressions (27), (32), and (33) and the scaling coefficients represented by the mathematical expressions (29), (34), and (35). It corresponds to. In step S137, the correlation signal and the uncorrelated signal (the uncorrelated signal of the right channel, the uncorrelated signal of the left channel, and the left channel are estimated by calculation using these transform coefficients (Equations (23), (30), (31)) And uncorrelated signals).

次に、仮想音源への割り当て処理を行う（ステップＳ１３８）。まず、この割り当て処理では前処理として、小帯域毎に推定した相関信号によって生成される合成音像の方向を推定する。この推定処理について、図１５〜図１７に基づき説明する。図１５は、受聴者と左右のスピーカと合成音像との位置関係の例を説明するための模式図、図１６は、波面合成再生方式で使用するスピーカ群と仮想音源との位置関係の例を説明するための模式図、図１７は、図１６の仮想音源と受聴者及び合成音像との位置関係の例を説明するための模式図である。 Next, assignment processing to a virtual sound source is performed (step S138). First, in this allocation process, the direction of the synthesized sound image generated by the correlation signal estimated for each small band is estimated as preprocessing. This estimation process will be described with reference to FIGS. FIG. 15 is a schematic diagram for explaining an example of the positional relationship between the listener, the left and right speakers, and the synthesized sound image, and FIG. 16 is an example of the positional relationship between the speaker group used in the wavefront synthesis reproduction method and the virtual sound source. FIG. 17 is a schematic diagram for explaining an example of the positional relationship between the virtual sound source of FIG. 16, the listener, and the synthesized sound image.

いま、図１５に示す位置関係１５０のように、受聴者から左右のスピーカ１５１Ｌ，１５１Ｒの中点にひいた線と、同じく受聴者１５３からいずれかのスピーカ１５１Ｌ／１５１Ｒの中心までひいた線がなす見開き角をθ_０、受聴者１５３から推定合成音像１５２の位置までひいた線がなす見開き角をθとする。ここで、左右のスピーカ１５１Ｌ，１５１Ｒから同じ音声信号を、音圧バランスを変えて出力した場合、その出力音声によって生じる合成音像１５２の方向は、音圧バランスを表す前述のパラメータαを用いて次の式で近似できることが一般的に知られている（以下、立体音響におけるサインの法則と呼ぶ）。 Now, as in the positional relationship 150 shown in FIG. 15, a line drawn from the listener to the midpoint of the left and right speakers 151L and 151R and a line drawn from the listener 153 to the center of one of the speakers 151L / 151R. The spread angle formed is θ ₀ , and the spread angle formed by the line drawn from the listener 153 to the position of the estimated synthesized sound image 152 is θ. Here, when the same audio signal is output from the left and right speakers 151L and 151R while changing the sound pressure balance, the direction of the synthesized sound image 152 generated by the output sound is the following using the parameter α representing the sound pressure balance. It is generally known that the following equation can be approximated (hereinafter referred to as the sign law in stereophonic sound).

ここで、２ｃｈステレオの音声信号を波面合成再生方式で再生できるようにするために、図１２に示す音声信号分離抽出部１２１が２ｃｈの信号を複数チャネルの信号に変換する。例えば変換後のチャネル数を５つとした場合、それを図１６で示す位置関係１６０のように、波面合成再生方式における仮想音源１６２ａ〜１６２ｅと見做し、スピーカ群（スピーカアレイ）１６１の後方に配置する。なお、仮想音源１６２ａ〜１６２ｅにおける隣り合う仮想音源との間隔は均等とする。したがって、ここでの変換は、２ｃｈの音声信号を仮想音源数の音声信号に変換することになる。既に説明したように、音声信号分離抽出部１２１は、まず２ｃｈの音声信号を、小帯域毎に１つの相関信号と２つの無相関信号に分離する。音声信号分離抽出部１２１では、さらにそれらの信号をどのように仮想音源数の仮想音源（ここでは５つの仮想音源）に割り当てるかを事前に決めておかなければならない。なお、割り当ての方法については複数の方法の中からユーザ設定可能にしておいてもよいし、仮想音源数に応じて選択可能な方法を変えてユーザに提示するようにしてもよい。 Here, in order to be able to reproduce the 2ch stereo audio signal by the wavefront synthesis reproduction method, the audio signal separation and extraction unit 121 shown in FIG. 12 converts the 2ch signal into a signal of a plurality of channels. For example, when the number of channels after conversion is five, it is regarded as virtual sound sources 162a to 162e in the wavefront synthesis reproduction method as in the positional relationship 160 shown in FIG. 16, and behind the speaker group (speaker array) 161. Deploy. Note that the virtual sound sources 162a to 162e are equally spaced from adjacent virtual sound sources. Therefore, the conversion here converts the audio signal of 2ch into the audio signal of the number of virtual sound sources. As already described, the audio signal separation and extraction unit 121 first separates the 2ch audio signal into one correlation signal and two uncorrelated signals for each small band. In the audio signal separation / extraction unit 121, it is necessary to determine in advance how to allocate these signals to the virtual sound sources of the number of virtual sound sources (here, five virtual sound sources). The assignment method may be user-configurable from a plurality of methods, or may be presented to the user by changing the selectable method according to the number of virtual sound sources.

割り当て方法の１つの例として、次のような方法を採る。それは、まず、左右の無相関信号については、５つの仮想音源の両端（仮想音源１６２ａ，１６２ｅ）にそれぞれ割り当てる。次に、相関信号によって生じる合成音像については、５つのうちの隣接する２つの仮想音源に割り当てる。隣接するどの２つの仮想音源に割り当てるかについては、まず、前提として、相関信号によって生じる合成音像が５つの仮想音源の両端（仮想音源１６２ａ，１６２ｅ）より内側になるものとし、すなわち、２ｃｈステレオ再生時の２つのスピーカによってなす見開き角内におさまるように５つの仮想音源１６２ａ〜１６２ｅを配置するものとする。そして、合成音像の推定方向から、その合成音像を挟むような隣接する２つの仮想音源を決定し、その２つの仮想音源への音圧バランスの割り当てを調整して、その２つの仮想音源によって合成音像を生じさせるように再生する、という割り当て方法を採る。 As an example of the allocation method, the following method is adopted. First, the left and right uncorrelated signals are assigned to both ends (virtual sound sources 162a and 162e) of the five virtual sound sources, respectively. Next, the synthesized sound image generated by the correlation signal is assigned to two adjacent virtual sound sources out of the five. As for the premise of assigning to two adjacent virtual sound sources, first, it is assumed that the synthesized sound image generated by the correlation signal is inside of both ends (virtual sound sources 162a and 162e) of the five virtual sound sources, that is, 2ch stereo reproduction It is assumed that five virtual sound sources 162a to 162e are arranged so as to fall within a spread angle formed by two speakers at the time. Then, two adjacent virtual sound sources that sandwich the synthesized sound image are determined from the estimated direction of the synthesized sound image, and the allocation of the sound pressure balance to the two virtual sound sources is adjusted, and the two virtual sound sources are synthesized. An allocation method is adopted in which reproduction is performed so as to generate a sound image.

そこで、図１７で示す位置関係１７０のように、受聴者１７３から両端の仮想音源１６２ａ，１６２ｅの中点にひいた線と、端の仮想音源１６２ｅにひいた線とがなす見開き角をθ′_０、受聴者１７３から合成音像１７１にひいた線とがなす見開き角をθ′とする。さらに、受聴者１７３から合成音像１７１を挟む２つの仮想音源１６２ｃ，１６２ｄの中点にひいた線と、受聴者１７３から両端の仮想音源１６２ａ，１６２ｅの中点にひいた線（受聴者１７３から仮想音源１６２ｃにひいた線）とがなす見開き角をφ_０、受聴者１７３から合成音像１７１にひいた線とがなす見開き角をφとする。ここで、φ_０は正の実数である。数式(36)で説明したようにして方向を推定した図１５の合成音像１５２（図１７における合成音像１７１に対応）を、これらの変数を用いて仮想音源に割り当てる方法について説明する。 Therefore, as shown in the positional relationship 170 shown in FIG. 17, the spread angle formed by the line drawn from the listener 173 to the midpoint of the virtual sound sources 162a and 162e at both ends and the line drawn from the virtual sound source 162e at the end is θ ′. ₀ , the spread angle formed by the line drawn from the listener 173 to the synthesized sound image 171 is θ ′. Further, a line drawn from the listener 173 at the midpoint between the two virtual sound sources 162c and 162d sandwiching the synthesized sound image 171 and a line drawn from the listener 173 at the midpoint between the virtual sound sources 162a and 162e (from the listener 173). A spread angle formed by a line drawn on the virtual sound source 162c) is φ ₀ , and a spread angle formed by a line drawn from the listener 173 on the synthesized sound image 171 is φ. Here, φ ₀ is a positive real number. A method of assigning the synthesized sound image 152 in FIG. 15 (corresponding to the synthesized sound image 171 in FIG. 17) whose direction has been estimated as described in Expression (36) to a virtual sound source using these variables will be described.

まず、見開き角の差によるスケーリングを次の式のように行う。
θ′＝（θ′_０／θ_０）θ (37)
これにより、仮想音源の配置による見開き角の差異が考慮されることになる。ただし、θ′_０とθ_０の値は、音声データ再生装置のシステム実装時に調整すればよく、またθ′_０とθ_０の値を等しくしなくても特に問題は生じないため、この例では、θ_０＝π／６［ｒａｄ］、θ′_０＝π／４［ｒａｄ］として説明する。 First, scaling by the difference in spread angle is performed as in the following equation.
θ ′ = (θ ′ ₀ / θ ₀ ) θ (37)
Thereby, the difference in the spread angle due to the placement of the virtual sound source is taken into consideration. However, the values of θ ′ ₀ and θ ₀ only need to be adjusted when the audio data reproducing apparatus is installed, and there is no particular problem even if the values of θ ′ ₀ and θ ₀ are not equal. , Θ ₀ = π / 6 [rad], and θ ′ ₀ = π / 4 [rad].

次に、ｉ番目の合成音像の方向θ^（ｉ）が数式(36)によって推定され、例えばθ^（ｉ）＝π／１５［ｒａｄ］であったとすると、数式(37)よりθ′^（ｉ）＝π／１０［ｒａｄ］となる。そして、仮想音源が５つの場合、図１７に示すように合成音像１７１は左から数えて３番目の仮想音源１６２ｃと４番目の仮想音源１６２ｄの間に位置することになる。また、仮想音源が５つである場合、３番目の仮想音源１６２ｃと４番目の仮想音源１６２ｄの間について、θ′_０＝π／４［ｒａｄ］より、φ_０≒０.０７８［ｒａｄ］となり、ｉ番目の小帯域におけるφをφ^（ｉ）とすると、φ^（ｉ）＝θ′^（ｉ）−φ_０≒０.０２２π［ｒａｄ］となる。このようにして、各小帯域における相関信号によって生じる合成音像の方向を、それを挟む２つの仮想音源の方向からの相対的な角度で表す。そして上述したように、その２つの仮想音源１６２ｃ，１６２ｄでその合成音像を生じさせることを考える。そのためには、２つの仮想音源１６２ｃ，１６２ｄからの出力音声信号の音圧バランスを調整すればよく、その調整方法については、再び数式(36)として利用した立体音響におけるサインの法則を用いる。 Next, if the direction θ ⁽ⁱ⁾ of the i-th synthesized sound image is estimated by Expression (36), for example, θ ⁽ⁱ⁾ = π / 15 [rad], then θ ′ ⁽ⁱ⁾ from Expression (37 ^). = Π / 10 [rad]. When there are five virtual sound sources, the synthesized sound image 171 is located between the third virtual sound source 162c and the fourth virtual sound source 162d as counted from the left as shown in FIG. Further, when there are five virtual sound sources, φ ₀ ≈0.078 [rad] from θ ′ ₀ = π / 4 [rad] between the third virtual sound source 162c and the fourth virtual sound source 162d. When φ in the i-th small band is φ ⁽ⁱ⁾ , φ ⁽ⁱ⁾ = θ ′ ⁽ⁱ⁾ −φ ₀ ≈0.022π [rad]. In this way, the direction of the synthesized sound image generated by the correlation signal in each small band is represented by a relative angle from the directions of the two virtual sound sources sandwiching the direction. Then, as described above, it is considered that the synthesized sound image is generated by the two virtual sound sources 162c and 162d. For that purpose, the sound pressure balance of the output audio signals from the two virtual sound sources 162c and 162d may be adjusted, and as the adjustment method, the sign law in the stereophonic sound used again as the equation (36) is used.

ここで、ｉ番目の小帯域における相関信号によって生じる合成音像を挟む２つの仮想音源１６２ｃ，１６２ｄのうち、３番目の仮想音源１６２ｃに対するスケーリング係数をｇ_１、４番目の仮想音源１６２ｄに対するスケーリング係数をｇ_２とすると、３番目の仮想音源１６２ｃからはｇ_１・est′（Ｓ^（ｉ）（ｋ））、４番目の仮想音源１６２ｄからはｇ_２・est′（Ｓ^（ｉ）（ｋ））の音声信号を出力することになる。そして、ｇ_１、ｇ_２は立体音響におけるサインの法則により、

を満たせばよい。 Here, of the two

virtual sound sources

162c and 162d sandwiching the synthesized sound image generated by the correlation signal in the i-th small band, the scaling coefficient for the third virtual sound source 162c is denoted by g ₁ , and the scaling coefficient for the fourth virtual sound source 162d is denoted by When _g _2, g ₁ · est from the third virtual sound source ^{162c '(S (i) (} k)), from the fourth virtual source _{^{162d g 2 · est' (S}} (i) (k)) The audio signal is output. And g ₁ and g ₂ are based on the sign law in stereophonic sound,

Should be satisfied.

一方、３番目の仮想音源１６２ｃと４番目の仮想音源１６２ｄからの電力の合計が、元の２ｃｈステレオの相関信号の電力と等しくなるようにｇ_１、ｇ_２を正規化すると、
ｇ_１ ^２＋ｇ_２ ^２＝１＋［α^（ｉ）］^２ (39)
となる。 On the other hand, when g ₁ and g ₂ are normalized such that the total power from the third virtual sound source 162c and the fourth virtual sound source 162d is equal to the power of the original 2ch stereo correlation signal,
g ₁ ² + g ₂ ² = 1 + [α ⁽ⁱ⁾ ] ² (39)
It becomes.

これらを連立させることで、

と求められる。この数式(40)に上述のφ^（ｉ）、φ_０を代入することによって、ｇ_１、ｇ_２を算出する。このようにして算出したスケーリング係数に基づき、上述したように３番目の仮想音源１６２ｃにはｇ_１・est′（Ｓ^（ｉ）（ｋ））の音声信号を、４番目の仮想音源１６２ｄからはｇ_２・est′（Ｓ^（ｉ）（ｋ））の音声信号を割り当てる。そして、これも上述したように、無相関信号は両端の仮想音源１６２ａ，１６２ｅに割り当てられる。すなわち、１番目の仮想音源１６２ａにはest′（Ｎ_Ｌ ^（ｉ）（ｋ））を、５番目の仮想音源１６２ｅにはest′（Ｎ_Ｒ ^（ｉ）（ｋ））を割り当てる。 By bringing these together,

Is required. By substituting the above-mentioned φ ⁽ⁱ⁾ and φ ₀ into this mathematical formula (40), g ₁ and g ₂ are calculated. Based on the scaling coefficient calculated in this way, as described above, the third virtual sound source 162c receives the audio signal of g ₁ · est ′ (S ⁽ⁱ⁾ (k)) from the fourth virtual sound source 162d. The audio signal of g ₂ · est ′ (S ⁽ⁱ⁾ (k)) is assigned. As described above, the uncorrelated signal is assigned to the virtual sound sources 162a and 162e at both ends. In other words, _'the ^{(N L (i) (k} )), the 5 th virtual source 162e _est' est is the first virtual sound source 162a assigns the ^{(N R (i) (k} )).

この例とは異なり、もし合成音像の推定方向が１番目と２番目の仮想音源の間であった場合には、１番目の仮想音源にはｇ_１・est′（Ｓ^（ｉ）（ｋ））とest′（Ｎ_Ｌ ^（ｉ）（ｋ））の両方が割り当てられることになる。また、もし合成音像の推定方向が４番目と５番目の仮想音源の間であった場合には、５番目の仮想音源にはｇ_２・est′（Ｓ^（ｉ）（ｋ））とest′（Ｎ_Ｒ ^（ｉ）（ｋ））の両方が割り当てられることになる。 Unlike this example, if the estimated direction of the synthesized sound image is between the first and second virtual sound sources, g ₁ · est ′ (S ⁽ⁱ⁾ (k) ) And est ′ (N _L ⁽ⁱ⁾ (k)) will be assigned. If the estimated direction of the synthesized sound image is between the fourth and fifth virtual sound sources, the second virtual sound source includes g ₂ · est ′ (S ⁽ⁱ⁾ (k)) and est ′. (N _R ⁽ⁱ⁾ (k)) will be assigned.

以上のようにして、ステップＳ１３８における、ｉ番目の小帯域についての左右チャネルの相関信号と無相関信号の割り当てが行われる。これをステップＳ１３４ａ，Ｓ１３４ｂのループにより全ての小帯域について行う。その結果、仮想音源の数をＪとすると、各仮想音源（出力チャネル）に対する周波数領域の出力音声信号Ｙ_１（ｋ），・・・，Ｙ_Ｊ（ｋ）が求まる。 As described above, in step S138, the left and right channel correlation signals and uncorrelated signals are assigned to the i-th small band. This is performed for all the small bands by the loop of steps S134a and S134b. As a result, if the number of virtual sound sources is J, output audio signals Y ₁ (k),..., Y _J (k) in the frequency domain for each virtual sound source (output channel) are obtained.

そして、得られた各出力チャネルについて、ステップＳ１４０〜Ｓ１４２の処理を実行する（ステップＳ１３９ａ，Ｓ１３９ｂ）。以下、ステップＳ１４０〜Ｓ１４２の処理について説明する。 Then, the processing of steps S140 to S142 is executed for each obtained output channel (steps S139a and S139b). Hereinafter, the processing of steps S140 to S142 will be described.

まず、各出力チャネルを離散フーリエ逆変換することによって、時間領域の出力音声信号ｙ′_ｊ（ｍ）を求める（ステップＳ１４０）。ここで、ＤＦＴ^−１は離散フーリエ逆変換を表す。
ｙ′_ｊ（ｍ）＝ＤＦＴ^−１（Ｙ_ｊ（ｋ））（１≦ｊ≦Ｊ） (41)
ここで、数式(3)で説明したように、離散フーリエ変換した信号は、窓関数乗算後の信号であったため、逆変換して得られた信号ｙ′_ｊ（ｍ）も窓関数が乗算された状態となっている。窓関数は数式(1)に示すような関数であり、読み込みは半セグメント長ずつずらしながら行ったため、前述した通り、１つ前に処理したセグメントの先頭から半セグメント長ずつずらしながら出力バッファに加算していくことにより変換後のデータを得る。 First, the output speech signal y ′ _j (m) in the time domain is obtained by performing inverse discrete Fourier transform on each output channel (step S140). Here, DFT ⁻¹ represents discrete Fourier inverse transform.
y ′ _j (m) = DFT ⁻¹ (Y _j (k)) (1 ≦ j ≦ J) (41)
Here, as described in Equation (3), the signal subjected to the discrete Fourier transform is a signal after the window function multiplication, and therefore the signal y ′ _j (m) obtained by the inverse transformation is also multiplied by the window function. It is in the state. The window function is a function as shown in Equation (1), and reading is performed while shifting by half segment length. As described above, the window function is added to the output buffer while shifting by half segment length from the beginning of the previous segment. By doing so, the converted data is obtained.

しかし、このままでは、従来技術として上述した通り、図１０の中央付近１０１で示すような不連続点が変換後のデータに多数含まれてしまい、それらが再生時にノイズとなって知覚される。このような不連続点は、直流成分の線スペクトルを考慮しないことによるものであることは前述した通りである。図１８はそれを模式的に示した波形のグラフである。より詳細には、図１８は、左右チャネルの音声信号を離散フーリエ変換し左右チャネルの直流成分を無視した場合に、離散フーリエ逆変換後のセグメント境界に生じる波形の不連続点を説明するための模式図である。図１８に示すグラフ１８０において、横軸は時間を表しており、例えば（Ｍ−２）^（ｌ）という記号は、ｌ番目のセグメントのＭ−２番目の標本点であることを示している。グラフ１８０の縦軸は、それらの標本点に対する出力信号の値である。このグラフ１８０から分かるように、ｌ番目のセグメントの最後から（ｌ＋１）番目のセグメントの最初にかけての部分で不連続点が生じてしまう。 However, in this state, as described above as the prior art, many discontinuous points as indicated by the central portion 101 in FIG. 10 are included in the converted data, and these are perceived as noise during reproduction. As described above, such a discontinuous point is caused by not considering the line spectrum of the DC component. FIG. 18 is a waveform graph schematically showing this. More specifically, FIG. 18 is a diagram for explaining the discontinuity points of the waveform generated at the segment boundary after the inverse discrete Fourier transform when the left and right channel audio signals are discrete Fourier transformed and the left and right channel DC components are ignored. It is a schematic diagram. In the graph 180 shown in FIG. 18, the horizontal axis represents time. For example, the symbol (M-2) ^(l) indicates that it is the M-2th sample point of the lth segment. The vertical axis of the graph 180 is the value of the output signal for those sample points. As can be seen from the graph 180, a discontinuity occurs in the portion from the end of the l-th segment to the beginning of the (l + 1) -th segment.

図１８で説明したような問題を解決するために、本発明に係る音声信号変換装置は、次のように構成する。すなわち、本発明に係る音声信号変換装置は、変換部、相関信号抽出部、逆変換部、及び除去部を備える。変換部は、２つのチャネルの入力音声信号に離散フーリエ変換を施す。相関信号抽出部は、変換部で離散フーリエ変換後の２つのチャネルの音声信号について、直流成分を無視して相関信号を抽出する。つまり、抽出部は、２つのチャネルの入力音声信号の相関信号を抽出する。逆変換部は、（ａ１）相関信号抽出部で抽出された相関信号に対して、または（ａ２）その相関信号及び無相関信号（その相関信号を除く信号）に対して、もしくは（ｂ１）その相関信号から生成された音声信号、または（ｂ２）その相関信号及びその無相関信号から生成された音声信号に対して、離散フーリエ逆変換を施す。なお、ここでの例では、逆変換部が上記（ｂ２）の音声信号の例である、波面合成再生方式用の仮想音源への割り当て後の音声信号に対して、不連続点を除去した例を挙げたが、これに限らない。例えば、上記（ａ１）または（ａ２）の例である仮想音源への割り当て前の音声信号に対して、すなわち抽出された相関信号または抽出された相関信号及び無相関信号に対して、不連続点を除去し、その後、割り当てを行うようにしてもよい。 In order to solve the problem described with reference to FIG. 18, the audio signal conversion apparatus according to the present invention is configured as follows. That is, the audio signal conversion apparatus according to the present invention includes a conversion unit, a correlation signal extraction unit, an inverse conversion unit, and a removal unit. The conversion unit performs discrete Fourier transform on the input audio signals of the two channels. The correlation signal extraction unit extracts the correlation signal from the two-channel audio signals after the discrete Fourier transform by the conversion unit while ignoring the DC component. That is, the extraction unit extracts a correlation signal between the input audio signals of the two channels. The inverse transform unit is (a1) for the correlation signal extracted by the correlation signal extraction unit, or (a2) for the correlation signal and the non-correlation signal (signal excluding the correlation signal), or (b1) the Discrete Fourier inverse transform is performed on the audio signal generated from the correlation signal or (b2) the audio signal generated from the correlation signal and the non-correlation signal. In the example here, the inverse transform unit is an example of the audio signal of (b2) above, and the discontinuous points are removed from the audio signal after allocation to the virtual sound source for the wavefront synthesis reproduction method. However, this is not a limitation. For example, discontinuous points with respect to an audio signal before allocation to a virtual sound source, which is an example of the above (a1) or (a2), that is, with respect to an extracted correlation signal or an extracted correlation signal and an uncorrelated signal May be removed and then assigned.

そして、除去部は、逆変換部で離散フーリエ逆変換後の音声信号から波形の不連続点を除去する。つまり、除去部では、相関信号またはそれから生成された音声信号について、離散フーリエ逆変換した後の信号から波形の不連続点を除去する。
図１２における音声信号処理部１１３の例では、上述の変換部、相関信号抽出部、及び逆変換部は音声信号分離抽出部１２１に含まれることになり、上述の除去部は雑音除去部１２２で例示できる。 And a removal part removes the discontinuous point of a waveform from the audio | voice signal after discrete Fourier inverse transform in an inverse transformation part. That is, the removing unit removes the discontinuous points of the waveform from the signal after the inverse discrete Fourier transform of the correlation signal or the sound signal generated therefrom.
In the example of the audio signal processing unit 113 in FIG. 12, the above-described conversion unit, correlation signal extraction unit, and inverse conversion unit are included in the audio signal separation / extraction unit 121, and the above-described removal unit is the noise removal unit 122. It can be illustrated.

図１９を参照して、図１８で説明したような問題を解決するためのこのような処理について具体的に説明する。図１９は、本発明に係る不連続点除去処理の一例を説明するための模式図で、左右チャネルの音声信号を離散フーリエ変換し左右チャネルの直流成分を無視した場合に、離散フーリエ逆変換後のセグメント境界に生じる波形の不連続点を除去する方法を説明するための模式図である。 With reference to FIG. 19, such processing for solving the problem described with reference to FIG. 18 will be specifically described. FIG. 19 is a schematic diagram for explaining an example of the discontinuous point removal process according to the present invention. When the left and right channel audio signals are discrete Fourier transformed and the left and right channel DC components are ignored, the discrete Fourier inverse transform is performed. It is a schematic diagram for demonstrating the method of removing the discontinuous point of the waveform which arises in the segment boundary.

本発明における不連続点除去処理では、図１９のグラフ１９０で図１８のグラフ１８０に対する除去例を示すように、ｌ番目のセグメントの最後の波形の微分値と（ｌ＋１）番目のセグメントの先頭の微分値が一致するようにする。具体的には雑音除去部１２２が、ｌ番目のセグメントの最後の２点による傾きが維持されるような（ｌ＋１）番目のセグメントの先頭の値となるよう、（ｌ＋１）番目のセグメントの波形に直流成分（バイアス）を加える。その結果、処理後の出力音声信号ｙ″_ｊ（ｍ）は、
ｙ″_ｊ（ｍ）＝ｙ′_ｊ（ｍ）＋Ｂ (42)
となる。Ｂはバイアスを表す定数であり、１回前の出力音声信号と今回の処理の出力音声信号が出力バッファで加算された後、図１９のグラフ１９０のように波形が連続するように決定される。 In the discontinuous point removal processing according to the present invention, as shown in the graph 190 of FIG. 19 with respect to the graph 180 of FIG. 18, the differential value of the last waveform of the l-th segment and the beginning of the (l + 1) -th segment. Ensure that the differential values match. Specifically, the noise removal unit 122 applies the waveform of the (l + 1) th segment so that the first value of the (l + 1) th segment is maintained so that the slope of the last two points of the lth segment is maintained. Add DC component (bias). As a result, the processed output audio signal y ″ _j (m) is
y ″ _j (m) = y ′ _j (m) + B (42)
It becomes. B is a constant representing a bias, and after the output audio signal of the previous time and the output audio signal of the current process are added by the output buffer, the waveform is determined so as to be continuous as shown in the graph 190 of FIG. .

このように、雑音除去部１２２は、処理セグメントの境界において波形の微分値を維持させるように離散フーリエ逆変換後の音声信号（相関信号またはそれから生成された音声信号）に直流成分を加算することで、不連続点を除去することが好ましい。なお、この例ではマイナスのバイアスをかけているが、当然、上記微分値を一致させるためにはプラスのバイアスをかける場合もある。 As described above, the noise removing unit 122 adds a direct current component to the audio signal after the inverse discrete Fourier transform (correlation signal or the audio signal generated therefrom) so as to maintain the differential value of the waveform at the boundary of the processing segment. Thus, it is preferable to remove discontinuous points. In this example, a negative bias is applied, but naturally a positive bias may be applied in order to match the differential values.

このように、本発明によれば、２ｃｈや５.１ｃｈ等のマルチチャネル方式用の音声信号を、不連続点に起因するノイズを発生させることなく、波面合成再生方式で再生させるための音声信号に変換することが可能になる。そして、それにより、波面合成再生方式の特長である、どの位置の受聴者に対してもコンテンツ製作者の意図通りの音像定位を提供するという効果を享受できる。 Thus, according to the present invention, an audio signal for reproducing a multichannel audio signal such as 2ch or 5.1ch by a wavefront synthesis reproduction method without generating noise due to discontinuous points. Can be converted to As a result, it is possible to enjoy the effect of providing sound image localization as intended by the content producer to the listener at any position, which is a feature of the wavefront synthesis reproduction method.

また、雑音除去部１２２で処理対象となる離散フーリエ逆変換後の音声信号は、各数式で例示したように、相関信号または相関信号及び無相関信号に対して、時間領域あるいは周波数領域においてスケーリング処理を行い、そのスケーリング処理後の音声信号としてもよい。つまり、相関信号や無相関信号に対しスケーリング処理を施し、スケーリング処理後の相関信号や無相関信号に対し、不連続点の除去を行うようにしてもよい。 Further, the speech signal after inverse discrete Fourier transform to be processed by the noise removing unit 122 is subjected to scaling processing in the time domain or the frequency domain with respect to the correlation signal or the correlation signal and the non-correlation signal, as exemplified by each equation. And the audio signal after the scaling processing may be used. That is, the scaling process may be performed on the correlation signal or the non-correlation signal, and the discontinuous points may be removed from the correlation signal or the non-correlation signal after the scaling process.

図２０及び図２１を参照して本発明のより好ましい例について説明する。図２０は、図１９の不連続点除去処理を適用して、左右チャネルの音声信号でなる或る音楽コンテンツを５つのチャネルに変換した結果の波形を示す図で、図２１は、本発明に係る他の不連続点除去処理を適用して、図２０で対象とした音楽コンテンツと同じ音楽コンテンツを５つのチャネルに変換した結果の波形を示す図である。つまり、図２１は、左右チャネルの音声信号を離散フーリエ変換し左右チャネルの直流成分を無視した場合に、離散フーリエ逆変換後のセグメント境界に生じる波形の不連続点を除去する方法を説明するための模式図である。 A more preferable example of the present invention will be described with reference to FIGS. FIG. 20 is a diagram showing a waveform obtained as a result of converting a certain music content composed of audio signals of left and right channels into five channels by applying the discontinuous point removal processing of FIG. 19, and FIG. It is a figure which shows the waveform of the result of having applied the other discontinuous point removal process which concerns, and having converted the music content same as the music content made into object in FIG. 20 into five channels. That is, FIG. 21 is a diagram for explaining a method of removing the waveform discontinuity generated at the segment boundary after the discrete Fourier inverse transform when the left and right channel audio signals are discrete Fourier transformed and the left and right channel DC components are ignored. FIG.

図１９で説明した不連続点除去処理のみでは、バイアス成分が蓄積してしまい、波形の振幅がオーバフローしてしまうことがある。図２０で例示する変換後の音楽コンテンツ２００では、５つのチャネルの音声信号２０１〜２０５のうち、特に上から２，３番目のチャネルの音声信号２０２，２０３でバイアス成分の蓄積が多く見られ、音声信号２０３ではオーバーフローしてしまっていることが分かる。 Only the discontinuous point removal processing described with reference to FIG. 19 may accumulate bias components and overflow the amplitude of the waveform. In the music content 200 after conversion illustrated in FIG. 20, among the five channels of audio signals 201 to 205, in particular, accumulation of bias components is often seen in the audio signals 202 and 203 of the second and third channels from the top, It can be seen that the audio signal 203 has overflowed.

したがって、本発明では、次式のように、加算するバイアス成分（直流成分）の振幅の大きさを時間的に減少させることにより収束させることが好ましい。なお、「時間的に減少させる」とは、加算時点からの経過時間、例えば処理セグメント毎の開始点や不連続点の開始点からの経過時間に比例して減少させることを意味する。
ｙ″_ｊ（ｍ）＝ｙ′_ｊ（ｍ）＋Ｂ×（（Ｍ−ｍσ）／Ｍ） (43)
ただし、σはその減少の程度を調整するパラメータであり、例えば０.５などとする。なお、減少のためにはＢ，σはいずれも正とする。さらに、加算用に求めたバイアスの値の絶対値がある一定以上となった場合には、その値に応じてσを動的に増減させるなどしてもよい。増減させるタイミングは次の処理セグメントでよい。これに限らず、減少させるための比例定数に相当するσを、バイアス値の絶対値（直流成分の振幅の大きさ）に応じて変更する（変化させる）ようにしておけば、フィードバック機能が働き、同様の効果が得られる。ただ、これらの方法では音声波形の振幅がオーバフローしないことを保障するものではない。 Therefore, in the present invention, it is preferable to converge by decreasing the magnitude of the amplitude of the bias component (DC component) to be added as shown in the following equation. Note that “decrease in time” means to decrease in proportion to the elapsed time from the addition time, for example, the elapsed time from the start point of each processing segment or the start point of the discontinuous point.
y ″ _j (m) = y ′ _j (m) + B × ((M−mσ) / M) (43)
However, σ is a parameter for adjusting the degree of the decrease, and is set to 0.5, for example. For the purpose of reduction, both B and σ are positive. Furthermore, when the absolute value of the bias value obtained for addition exceeds a certain value, σ may be dynamically increased or decreased according to the value. The timing to increase or decrease may be in the next processing segment. Not limited to this, the feedback function works if σ corresponding to the proportional constant to be reduced is changed (changed) according to the absolute value of the bias value (the magnitude of the amplitude of the DC component). A similar effect can be obtained. However, these methods do not guarantee that the amplitude of the speech waveform does not overflow.

よって、例えばバイアス値がある一定（所定値）以上の値になった場合には、数式(43)の第二項のバイアス項を加算しないようにする処理を安全弁の機能として加えてもよい。つまり、雑音除去部１２２は、加算するために求めた直流成分の振幅が所定値未満である場合のみ、直流成分の加算を実行する（不連続点の除去を実行する）ことが好ましい。この方法を採用することにより、図２０の音楽コンテンツ２００として出力される出力結果は、図２１に示す音楽コンテンツ２１０のような出力結果となり、バイアス成分が蓄積しないようになる。特に、音楽コンテンツ２００の音声信号２０２，２０３に対応する音声信号、すなわち音楽コンテンツ２１０の５つのチャネルの音声信号２１１〜２１５のうち上から２，３番目のチャネルの音声信号２１２，２１３においても、バイアス成分が蓄積されていないことが分かる。 Therefore, for example, when the bias value becomes a certain value (predetermined value) or more, a process of not adding the bias term of the second term of the equation (43) may be added as a function of the safety valve. That is, it is preferable that the noise removing unit 122 executes the addition of the DC component (executes the removal of the discontinuous points) only when the amplitude of the DC component obtained for the addition is less than a predetermined value. By adopting this method, the output result output as the music content 200 of FIG. 20 becomes an output result like the music content 210 shown in FIG. 21, and the bias component is not accumulated. In particular, in the audio signals corresponding to the audio signals 202 and 203 of the music content 200, that is, the audio signals 212 and 213 of the second and third channels from the top among the audio signals 211 to 215 of the five channels of the music content 210, It can be seen that the bias component is not accumulated.

図２２及び図２３を参照して本発明のより好ましい例について説明する。図２２は、図２１と同じ不連続点除去処理を適用して、図２０や図２１で対象とした音楽コンテンツとは異なる音声信号波形の変化が激しい音楽コンテンツを、５つのチャネルに変換した結果の波形を示す図である。また、図２３は、本発明に係る更に他の不連続点除去処理を適用して、図２２で対象とした音楽コンテンツと同じ音楽コンテンツを５つのチャネルに変換した結果の波形を示す図である。つまり、図２３は、音声信号波形の変化が激しい左右チャネルの音声信号を離散フーリエ変換し左右チャネルの直流成分を無視した場合に、離散フーリエ逆変換後のセグメント境界に生じる波形の不連続点を除去する方法を説明するための模式図である。 A more preferred example of the present invention will be described with reference to FIGS. FIG. 22 shows the result of converting the music content, which is different from the music content targeted in FIG. 20 and FIG. It is a figure which shows these waveforms. FIG. 23 is a diagram showing waveforms obtained as a result of converting the same music content as the target music content in FIG. 22 into five channels by applying still another discontinuous point removal process according to the present invention. . That is, FIG. 23 shows discontinuous points of the waveform generated at the segment boundary after the discrete Fourier inverse transform when the left and right channel sound signals having a drastic change in the sound signal waveform are subjected to the discrete Fourier transform and the direct current components of the left and right channels are ignored. It is a schematic diagram for demonstrating the method to remove.

例えば音声の子音部分など、音声信号がホワイトノイズに近いような場合、音声信号波形の変化が激しく元の波形が既に不連続に近いような状態になっているものがある。このような左右チャネルの音声信号を波面合成再生方式用の音声信号に変換する際に本発明の不連続点除去処理を適用すると、逆に波形を歪ませてしまう場合もある。つまり、元の波形が不連続に近いような状態の音声信号に対し、本発明の不連続点除去処理を適用すると、この処理がそのような元々不連続の状態に近い波形を無理矢理連続にしようとするため、逆に波形を歪ませてしまう可能性がある。その一例が図２２である。図２２で示す変換後の音楽コンテンツ２２０では、５つのチャネルの音声信号２２１〜２２５のうち１，５番目の音声信号２２１，２２５における矢印で示す箇所では特に、歪みが大きくなっており、ノイズとして知覚される。 For example, when the audio signal is close to white noise, such as the consonant part of the audio, there are cases where the change in the audio signal waveform is so drastic that the original waveform is already close to discontinuity. If the discontinuous point removal processing of the present invention is applied when converting such left and right channel audio signals into audio signals for the wavefront synthesis reproduction system, the waveform may be distorted. In other words, if the discontinuous point removal processing of the present invention is applied to an audio signal in which the original waveform is close to discontinuity, this processing will force the waveform close to the original discontinuity to be continuous. Therefore, the waveform may be distorted. One example is shown in FIG. In the music content 220 after conversion shown in FIG. 22, the distortion is particularly large at the locations indicated by arrows in the first and fifth audio signals 221 and 225 out of the five channels of audio signals 221 to 225. Perceived.

この問題を解消するために、本発明に係る音声信号変換処理における不連続点除去処理では次に示す方法を採用することが好ましい。それは、音声の子音部分など信号がホワイトノイズに近いような場合、入力音声信号の波形が所定時間内（例えば処理セグメント内やその半分内）で０を交差する回数が、その他の部分に比べて極端に増加することを利用する。なお、０をどこに採るようにするかは任意に決めておけば済む。よって、出力音声信号（少なくとも離散フーリエ逆変換後の音声信号）が半セグメント長の中で０を交差する回数をカウントし、それが一定の値（所定回数）以上である場合には、その次のセグメントを所定回数以上存在する箇所とみなし、その次のセグメント処理において、数式(42)や数式(43)における右辺第二項のバイアス項を加算しないこととする。つまり、それ以外の箇所でのみ不連続点除去処理を実行する。なお、カウントは、セグメント境界とは関係なく一定時間の音声波形について実行してもよいし、複数のセグメント処理分の音声波形について実行してもよく、いずれの場合にもそのカウント結果から次のセグメント処理でバイアス項を加算するか否かを決めればよい。 In order to solve this problem, it is preferable to employ the following method in the discontinuous point removal processing in the audio signal conversion processing according to the present invention. That is, when the signal is close to white noise, such as the consonant part of the voice, the number of times that the waveform of the input voice signal crosses 0 within a predetermined time (for example, within the processing segment or half thereof) compared to the other parts. Take advantage of extreme increases. In addition, what is necessary is just to decide where to take 0. Therefore, the number of times that the output audio signal (at least the audio signal after the inverse discrete Fourier transform) crosses 0 in the half segment length is counted, and if it is equal to or greater than a certain value (predetermined number), the next And the second term on the right-hand side in Equation (42) or Equation (43) is not added in the next segment processing. That is, the discontinuous point removal process is executed only at other points. The count may be performed for a speech waveform for a certain time regardless of the segment boundary, or may be performed for speech waveforms for a plurality of segment processes. What is necessary is just to determine whether a bias term is added by segment processing.

このような方法を採用することで、音楽コンテンツ２２０の特に音声信号２２１，２２５における矢印で示す箇所は、図２３で示す変換後の音楽コンテンツ２３０の５つのチャネルの音声信号２３１〜２３５のうちの音声信号２３１，２３５における矢印で示す箇所のように、歪みがなくなりノイズが発生しない。 By adopting such a method, the locations indicated by arrows in the music content 220, particularly the audio signals 221 and 225, are among the audio signals 231 to 235 of the five channels of the music content 230 after conversion shown in FIG. As indicated by the arrows in the audio signals 231, 235, there is no distortion and no noise is generated.

図２３を参照しながら説明したより好ましい不連続点除去処理の効果について図９と比較しながら説明する。図２４は、図２３の不連続点除去処理を適用して、図８の音楽コンテンツを５つのチャネルに変換した結果の波形を示す図で、図２５は、図２４の音楽コンテンツ（変換後）のうち１つのチャネルの音声信号の一部を拡大した図である。上述したような不連続点除去処理（ノイズ除去処理）により、図８に示す音楽コンテンツ８０が入力音声信号である場合、図２４に示す音楽コンテンツ２４０における５つのチャネルの音声信号２４１〜２４５のように変換される。特に、音楽コンテンツ２４０における上から３番目のチャネルの音声信号２４３における、図９に対応する不連続点の箇所は、図２５の音声信号２５０で示す通り、不連続点が解消され連続になっていることが分かる。このように不連続点を無くし、ノイズを除去できる。なお、図２４及び図２５は図２３を参照しながら説明した好ましい不連続点除去処理を適用した場合の結果として説明したが、数式(42)または数式(43)のような処理でも、多少の違いはあるものの同様に音声信号２５０で示すように連続な音声信号になる。 The effect of the more preferable discontinuous point removal processing described with reference to FIG. 23 will be described in comparison with FIG. FIG. 24 is a diagram showing waveforms as a result of converting the music content of FIG. 8 into five channels by applying the discontinuous point removal processing of FIG. 23, and FIG. 25 is the music content (after conversion) of FIG. It is the figure which expanded a part of audio | voice signal of one channel among these. When the music content 80 shown in FIG. 8 is an input audio signal by the discontinuous point removal processing (noise removal processing) as described above, the audio signals 241 to 245 of five channels in the music content 240 shown in FIG. Is converted to In particular, the discontinuous points corresponding to FIG. 9 in the audio signal 243 of the third channel from the top in the music content 240 become continuous with the discontinuities being eliminated as shown by the audio signal 250 in FIG. I understand that. In this way, discontinuities can be eliminated and noise can be removed. 24 and 25 have been described as a result of applying the preferred discontinuous point removal process described with reference to FIG. 23, but some processing such as Expression (42) or Expression (43) may be performed somewhat. Although there is a difference, it becomes a continuous audio signal as indicated by the audio signal 250 in the same manner.

以上、本発明に係る音声信号変換処理について、入力音声信号が２ｃｈの音声信号である例を挙げて説明したが、次に他のマルチチャネルの音声信号であっても適用可能であることを説明する。ここでは、図２６を参照しながら５.１ｃｈの入力音声信号を例に挙げるが、他のマルチチャネルの入力音声信号についても同様に適用できる。 In the above, the audio signal conversion processing according to the present invention has been described with reference to an example in which the input audio signal is a 2ch audio signal. However, it can be applied to other multi-channel audio signals. To do. Here, a 5.1ch input audio signal is taken as an example with reference to FIG. 26, but the present invention can be similarly applied to other multi-channel input audio signals.

図２６は、５.１ｃｈの音声信号を波面合成再生方式で再生する際に、使用するスピーカ群と仮想音源との位置関係の例を説明するための模式図である。５.１ｃｈの入力音声に本発明に係る音声信号変換処理を適用することを考える。５.１ｃｈのスピーカの配置方法については一般的に図２のように配置されることが多く、受聴者の前方には３つのスピーカ２１Ｌ、２２Ｃ，２１Ｒが並んでいる。そして、映画などのコンテンツでは特に、前方中央のいわゆるセンターチャネルは人の台詞音声などの用途で使用されることが多い。つまり、センターチャネルと左チャネル、あるいはセンターチャネルと右チャネルの間で合成音像を生じさせるような音圧制御がされている箇所はあまり多くない。 FIG. 26 is a schematic diagram for explaining an example of a positional relationship between a speaker group to be used and a virtual sound source when a 5.1ch audio signal is reproduced by the wavefront synthesis reproduction method. Consider applying the audio signal conversion processing according to the present invention to 5.1ch input audio. In general, 5.1ch speakers are arranged as shown in FIG. 2, and three speakers 21L, 22C, and 21R are arranged in front of the listener. And especially in content such as movies, the so-called center channel at the front center is often used for applications such as human speech. That is, there are not many places where sound pressure control is performed so as to generate a synthesized sound image between the center channel and the left channel or between the center channel and the right channel.

この性質を利用して、図２６で示す位置関係２６０のように、５.１ｃｈの前方左右のスピーカ２６２ａ，２６２ｃへの入力音声信号を本方式（本発明に係る音声信号変換処理）によって変換し、例えば５つの仮想音源２６３ａ〜２６３ｅに割り当てた後、真ん中の仮想音源２６３ｃにセンターチャネル（センタースピーカ用のチャネル）の音声信号を加算する。そのようにして、出力音声信号を仮想音源に対する音像として波面合成再生方式でスピーカアレイ２６１により再生する。そして後方左右のチャネル用の入力音声信号については、後方に５.１ｃｈと同じくスピーカ２６２ｄ，２６２ｅを設置し、そこから何も手を加えずに出力するなどすればよい。 By utilizing this property, the input audio signals to the 5.1ch front left and right speakers 262a and 262c are converted by this method (audio signal conversion processing according to the present invention) as in the positional relationship 260 shown in FIG. For example, after allocating to the five virtual sound sources 263a to 263e, the audio signal of the center channel (the channel for the center speaker) is added to the middle virtual sound source 263c. In this way, the output audio signal is reproduced as a sound image for the virtual sound source by the speaker array 261 by the wavefront synthesis reproduction method. As for the input audio signals for the left and right channels, speakers 262d and 262e may be installed at the rear as in 5.1ch, and output without any change from there.

このように、マルチチャネルの入力音声信号が３つ以上のチャネルの入力音声信号であることを前提とし、マルチチャネルの入力音声信号のうちいずれか２つの入力音声信号に対して、本発明に係る上述のような音声信号変換処理を行って、波面合成再生方式で再生させるための音声信号を生成し、生成された音声信号に残りのチャネルの入力音声信号を加算して出力するようにしてもよい。この加算は、例えば音声出力信号生成部１２３において加算部を設けておけば済む。 As described above, on the premise that the multi-channel input audio signal is an input audio signal of three or more channels, the present invention relates to any two input audio signals among the multi-channel input audio signals. The audio signal conversion process as described above is performed to generate an audio signal to be reproduced by the wavefront synthesis reproduction method, and the input audio signals of the remaining channels are added to the generated audio signal and output. Good. For this addition, for example, an adder may be provided in the audio output signal generator 123.

次に、本発明の実装について簡単に説明する。本発明は、例えばテレビなど映像の伴う装置に利用できる。本発明を適用可能な装置の様々な例について、図２７〜図３３を参照しながら説明する。図２７〜図２９は、それぞれ図１１の音声データ再生装置を備えたテレビ装置の構成例を示す図で、図３０及び図３１は、それぞれ図１１の音声データ再生装置を備えた映像投影システムの構成例を示す図、図３２は、図１１の音声データ再生装置を備えたテレビボードとテレビ装置とでなるシステムの構成例を示す図、図３３は、図１１の音声データ再生装置を備えた自動車の例を示す図である。なお、図２７〜図３３のいずれにおいても、スピーカアレイとしてＬＳＰ１〜ＬＳＰ８で示す８個のスピーカを配列した例を挙げているが、スピーカの数は複数であればよい。 Next, the implementation of the present invention will be briefly described. The present invention can be used for an apparatus accompanied with an image such as a television. Various examples of apparatuses to which the present invention can be applied will be described with reference to FIGS. 27 to 29 are diagrams showing an example of the configuration of a television apparatus provided with the audio data reproducing apparatus of FIG. 11, and FIGS. 30 and 31 are diagrams of a video projection system provided with the audio data reproducing apparatus of FIG. 11, respectively. FIG. 32 is a diagram illustrating a configuration example, FIG. 32 is a diagram illustrating a configuration example of a system including a television board and a television device including the audio data reproducing device of FIG. 11, and FIG. 33 is an audio data reproducing device of FIG. It is a figure which shows the example of a motor vehicle. In any of FIGS. 27 to 33, an example is shown in which eight speakers indicated by LSP1 to LSP8 are arranged as the speaker array, but the number of speakers may be plural.

本発明に係る音声信号変換装置やそれを備えた音声データ再生装置はテレビ装置に利用できる。テレビ装置におけるこれらの装置の配置は自由に決めればよい。図２７で示すテレビ装置２７０のように、テレビ画面２７１の下方に、音声データ再生装置におけるスピーカＬＳＰ１〜ＬＳＰ８を直線状に並べたスピーカ群２７２を設けてもよい。図２８で示すテレビ装置２８０のように、テレビ画面２８１の上方に、音声データ再生装置におけるスピーカＬＳＰ１〜ＬＳＰ８を直線状に並べたスピーカ群２８２を設けてもよい。図２９で示すテレビ装置２９０のように、テレビ画面２９１に、音声データ再生装置における透明のフィルム型スピーカＬＳＰ１〜ＬＳＰ８を直線状に並べたスピーカ群２９２を埋め込んでもよい。 The audio signal conversion apparatus according to the present invention and the audio data reproduction apparatus including the same can be used for a television apparatus. The arrangement of these devices in the television device may be determined freely. As in the television device 270 shown in FIG. 27, a speaker group 272 in which the speakers LSP1 to LSP8 in the audio data reproducing device are arranged in a straight line may be provided below the television screen 271. Like the television device 280 shown in FIG. 28, a speaker group 282 in which the speakers LSP1 to LSP8 in the audio data reproducing device are arranged in a straight line may be provided above the television screen 281. As in the television device 290 shown in FIG. 29, a speaker group 292 in which transparent film type speakers LSP1 to LSP8 in the audio data reproducing device are arranged in a straight line may be embedded in the television screen 291.

また、本発明に係る音声信号変換装置やそれを備えた音声データ再生装置は、映像投影システムに利用できる。図３０で示す映像投影システム３００のように、映像投射装置３０１ａで映像を投射する投射用スクリーン３０１ｂに、スピーカＬＳＰ１〜ＬＳＰ８のスピーカ群３０２を埋め込むようにしてもよい。図３１で示す映像投影システムのように、映像投射装置３１１ａで映像を投射する音透過型のスクリーン３１１ｂの後ろに、スピーカＬＳＰ１〜ＬＳＰ８を並べたスピーカ群３１２を配置してもよい。そのほか、本発明に係る音声信号変換装置やそれを備えた音声データ再生装置は、テレビ台（テレビボード）に埋め込むこともできる。図３２で示すシステム（ホームシアターシステム）３２０のように、テレビ装置３２１を搭載するためのテレビ台３２２ａにスピーカＬＳＰ１〜ＬＳＰ８を並べたスピーカ群３２２ｂを埋め込むようにしてもよい。さらに、本発明に係る音声信号変換装置やそれを備えた音声データ再生装置は、カーオーディオに適用することもできる。図３３で示す自動車３３０のように、車内のダッシュボードにスピーカＬＳＰ１〜ＬＳＰ８を曲線状に並べたスピーカ群３３２を埋め込むようにしてもよい。 Moreover, the audio signal conversion apparatus according to the present invention and the audio data reproduction apparatus including the same can be used in a video projection system. As in the video projection system 300 shown in FIG. 30, the speaker group 302 of the speakers LSP1 to LSP8 may be embedded in the projection screen 301b that projects the video by the video projection device 301a. As in the video projection system shown in FIG. 31, a speaker group 312 in which speakers LSP1 to LSP8 are arranged behind a sound transmission type screen 311b for projecting an image by the video projection device 311a may be arranged. In addition, the audio signal conversion apparatus according to the present invention and the audio data reproduction apparatus including the same can be embedded in a TV stand (TV board). As in a system (home theater system) 320 shown in FIG. 32, a speaker group 322b in which speakers LSP1 to LSP8 are arranged may be embedded in a television stand 322a for mounting the television device 321. Furthermore, the audio signal conversion device according to the present invention and the audio data reproduction device including the same can also be applied to car audio. As in an automobile 330 shown in FIG. 33, a speaker group 332 in which speakers LSP1 to LSP8 are arranged in a curved shape may be embedded in a dashboard inside the vehicle.

また、図２７〜図３３を参照して説明したような装置などに本発明に係る音声信号変換処理を適用した際、受聴者はこの変換処理（図１１や図１２の音声信号処理部１１３における処理）を行うか行わないかについて、装置本体に備えられたボタン操作やあるいはリモートコントローラ操作などでなされたユーザ操作により切り替える切替部を設けることもできる。この変換処理を行わない場合、２ｃｈ音声データの再生には、図６に示したように仮想音源を配置して波面合成再生方式で再生してもよい。あるいは図３４に示す位置関係３４０のように、アレイスピーカ３４１の両端のスピーカ３４１Ｌ，３４１Ｒのみを用いて再生してもよい。５.１ｃｈ音声データについても同様に、３つの仮想音源に割り当ててもよいし、あるいは両端と真ん中の１つか２つのスピーカのみを用いて再生してもよい。 In addition, when the audio signal conversion process according to the present invention is applied to an apparatus or the like described with reference to FIGS. 27 to 33, the listener can perform the conversion process (in the audio signal processing unit 113 in FIGS. 11 and 12). It is also possible to provide a switching unit that switches whether or not to perform processing by a user operation performed by a button operation or a remote controller operation provided in the apparatus main body. When this conversion processing is not performed, the 2ch audio data may be reproduced by arranging the virtual sound source as shown in FIG. Alternatively, reproduction may be performed using only the speakers 341L and 341R at both ends of the array speaker 341 as in the positional relationship 340 shown in FIG. Similarly, 5.1ch audio data may be assigned to three virtual sound sources, or may be reproduced using only one or two speakers at both ends and the middle.

また、本発明で適用可能な波面合成再生方式としては、上述したようにスピーカアレイ（複数のスピーカ）を備えて仮想音源に対する音像としてそれらのスピーカから出力するようにする方式であればよく、非特許文献１に記載のＷＦＳ方式の他、人間の音像知覚に関する現象としての先行音効果（ハース効果）を利用した方式など様々な方式が挙げられる。ここで、先行音効果とは、同一の音声を複数の音源から再生し、音源それぞれから聴取者に到達する各音声に小さな時間差がある場合、先行して到達した音声の音源方向に音像が定位する効果を指し示したものである。この効果を利用すれば、仮想音源位置に音像を知覚させることが可能となる。ただし、その効果だけで音像を明確に知覚させることは難しい。ここで、人間は音圧を最も高く感じる方向に音像を知覚するという性質も持ち合わせている。したがって、音声データ再生装置において、上述の先行音効果と、この最大音圧方向知覚の効果とを組み合わせ、これにより、少ない数のスピーカでも仮想音源の方向に音像を知覚させることが可能になる。 In addition, as a wavefront synthesis reproduction method applicable in the present invention, any method may be used as long as it includes a speaker array (a plurality of speakers) and outputs a sound image for a virtual sound source from those speakers. In addition to the WFS method described in Patent Document 1, there are various methods such as a method using a preceding sound effect (Haas effect) as a phenomenon related to human sound image perception. Here, the preceding sound effect means that if the same sound is played from multiple sound sources and each sound reaching the listener from each sound source has a small time difference, the sound image is localized in the sound source direction of the sound that has arrived in advance. It points out the effect to do. If this effect is used, a sound image can be perceived at the virtual sound source position. However, it is difficult to clearly perceive the sound image only by the effect. Here, humans also have the property of perceiving a sound image in the direction in which the sound pressure is felt highest. Therefore, in the audio data reproducing apparatus, the above-described effect of the preceding sound and the effect of perceiving the maximum sound pressure direction are combined, so that a sound image can be perceived in the direction of the virtual sound source even with a small number of speakers.

以上、本発明に係る音声信号変換装置が、マルチチャネル方式用の音声信号に対して波面合成再生方式で再生させるための音声信号に変換することを前提にして説明したが、本発明は、同じくマルチチャネル方式用（チャネル数は同じでも異なってもよい）の音声信号に変換する場合などにも同様に適用できる。変換後の音声信号としては、配置は問わないが少なくとも複数のスピーカからなるスピーカ群によって再生させるための音声信号であればよい。それは、このような変換の場合にも上述のような離散フーリエ変換・逆変換を施し且つ相関信号を得るために直流成分を無視することがあるためである。このように変換された音声信号の再生方法としては、例えば１つ１つの仮想音源用に抽出した信号に対し、それぞれ１つずつスピーカを対応させて波面合成再生方式ではなく普通に出力再生させることが考えられる。さらに、両側の無相関信号はそれぞれ別の、側方や後方に設置するスピーカに割り当てるような再生方法など、様々な再生方法が考えられる。 As described above, the audio signal conversion apparatus according to the present invention has been described on the assumption that the audio signal for the multi-channel method is converted into an audio signal for reproduction by the wavefront synthesis reproduction method. The present invention can be similarly applied to the case of converting to a multi-channel audio signal (the number of channels may be the same or different). The converted audio signal may be an audio signal to be reproduced by a speaker group including at least a plurality of speakers, although the arrangement is not limited. This is because even in the case of such conversion, the DC component may be ignored in order to perform the discrete Fourier transform / inverse transform as described above and obtain a correlation signal. As a method of reproducing the audio signal converted in this way, for example, each of the signals extracted for each virtual sound source is associated with one speaker at a time, and is normally output and reproduced instead of the wavefront synthesis reproduction method. Can be considered. Further, various reproduction methods such as a method of assigning the uncorrelated signals on both sides to different speakers installed on the side and the rear can be considered.

また、例えば図１２で例示した音声信号処理部１１３における各構成要素など、本発明に係る音声信号変換装置の各構成要素やその装置を備えた音声データ再生装置の各構成要素は、例えばマイクロプロセッサ（またはＤＳＰ：Digital Signal Processor）、メモリ、バス、インターフェイス、周辺装置などのハードウェアと、これらのハードウェア上にて実行可能なソフトウェアとにより実現できる。上記ハードウェアの一部または全部は集積回路／ＩＣ（Integrated Circuit）チップセットとして搭載することができ、その場合、上記ソフトウェアは上記メモリに記憶しておければよい。また、本発明の各構成要素の全てをハードウェアで構成してもよく、その場合についても同様に、そのハードウェアの一部または全部を集積回路／ＩＣチップセットとして搭載することも可能である。 Further, for example, each component of the audio signal conversion device according to the present invention, such as each component in the audio signal processing unit 113 illustrated in FIG. (Or DSP: Digital Signal Processor), hardware such as a memory, a bus, an interface, and a peripheral device, and software that can be executed on these hardware. Part or all of the hardware can be mounted as an integrated circuit / IC (Integrated Circuit) chip set, and in this case, the software may be stored in the memory. In addition, all the components of the present invention may be configured by hardware, and in that case as well, part or all of the hardware can be mounted as an integrated circuit / IC chip set. .

また、上述した様々な構成例における機能を実現するためのソフトウェアのプログラムコードを記録した記録媒体を、音声信号変換装置となる汎用コンピュータ等の装置に供給し、その装置内のマイクロプロセッサまたはＤＳＰによりプログラムコードが実行されることによっても、本発明の目的が達成される。この場合、ソフトウェアのプログラムコード自体が上述した様々な構成例の機能を実現することになり、このプログラムコード自体や、プログラムコードを記録した記録媒体（外部記録媒体や内部記憶装置）であっても、そのコードを制御側が読み出して実行することで、本発明を構成することができる。外部記録媒体としては、例えばＣＤ−ＲＯＭまたはＤＶＤ−ＲＯＭなどの光ディスクやメモリカード等の不揮発性の半導体メモリなど、様々なものがが挙げられる。内部記憶装置としては、ハードディスクや半導体メモリなど様々なものが挙げられる。また、プログラムコードはインターネットからダウンロードして実行することや、放送波から受信して実行することもできる。 In addition, a recording medium on which a program code of software for realizing the functions in the various configuration examples described above is recorded is supplied to a device such as a general-purpose computer serving as an audio signal conversion device, and the microprocessor or DSP in the device is used. The object of the present invention is also achieved by executing the program code. In this case, the software program code itself realizes the functions of the above-described various configuration examples. Even if the program code itself or a recording medium (external recording medium or internal storage device) on which the program code is recorded is used. The present invention can be configured by the control side reading and executing the code. Examples of the external recording medium include various media such as an optical disk such as a CD-ROM or a DVD-ROM and a non-volatile semiconductor memory such as a memory card. Examples of the internal storage device include various devices such as a hard disk and a semiconductor memory. The program code can be downloaded from the Internet and executed, or received from a broadcast wave and executed.

以上、本発明に係る音声信号変換装置について説明したが、処理の流れをフロー図で例示したように、本発明は、マルチチャネルの入力音声信号をスピーカ群によって再生させるための音声信号に変換する音声信号変換方法としての形態も採り得る。 Although the audio signal conversion apparatus according to the present invention has been described above, as illustrated in the flowchart of the processing flow, the present invention converts a multi-channel input audio signal into an audio signal for reproduction by a speaker group. A form as an audio signal conversion method may also be adopted.

この音声信号変換方法は、次の変換ステップ、抽出ステップ、逆変換ステップ、及び除去ステップを有する。変換ステップは、変換部が、２つのチャネルの入力音声信号に離散フーリエ変換を施すステップである。抽出ステップは、相関信号抽出部が、変換ステップで離散フーリエ変換後の２つのチャネルの音声信号について、直流成分を無視して相関信号を抽出するステップである。逆変換ステップは、逆変換部が、抽出ステップで抽出された相関信号または相関信号及び無相関信号に対して、もしくは相関信号から生成された音声信号に対して、もしくは相関信号及び無相関信号から生成された音声信号に対して、離散フーリエ逆変換を施すステップである。除去ステップは、除去部が、逆変換ステップで離散フーリエ逆変換後の音声信号から波形の不連続点を除去するステップである。その他の応用例については、音声信号変換装置について説明した通りであり、その説明を省略する。 This audio signal conversion method includes the following conversion step, extraction step, inverse conversion step, and removal step. The conversion step is a step in which the conversion unit performs a discrete Fourier transform on the input audio signals of the two channels. The extraction step is a step in which the correlation signal extraction unit extracts a correlation signal by ignoring a direct current component of the audio signals of the two channels after the discrete Fourier transform in the conversion step. In the inverse conversion step, the inverse conversion unit performs the correlation signal or the correlation signal and the non-correlation signal extracted in the extraction step, the voice signal generated from the correlation signal, or the correlation signal and the non-correlation signal. This is a step of performing inverse discrete Fourier transform on the generated audio signal. The removal step is a step in which the removal unit removes the discontinuous points of the waveform from the audio signal after the discrete Fourier inverse transform in the inverse transform step. Other application examples are the same as those described for the audio signal converter, and the description thereof is omitted.

なお、上記プログラムコード自体は、換言すると、この音声信号変換方法をコンピュータに実行させるためのプログラムである。すなわち、このプログラムは、コンピュータに、２つのチャネルの入力音声信号に離散フーリエ変換を施す変換ステップと、変換ステップで離散フーリエ変換後の２つのチャネルの音声信号について、直流成分を無視して相関信号を抽出する抽出ステップと、抽出ステップで抽出された相関信号または相関信号及び無相関信号に対して、もしくは相関信号から生成された音声信号に対して、もしくは相関信号及び無相関信号から生成された音声信号に対して、離散フーリエ逆変換を施す逆変換ステップと、逆変換ステップで離散フーリエ逆変換後の音声信号から波形の不連続点を除去する除去ステップと、を実行させるためのプログラムである。 Note that the program code itself is a program for causing a computer to execute the audio signal conversion method. That is, this program performs a correlation step in which a computer performs a discrete Fourier transform on an input audio signal of two channels and an audio signal of two channels after the discrete Fourier transform in the conversion step, ignoring a direct current component. An extraction step for extracting a correlation signal, a correlation signal or a correlation signal and a non-correlation signal extracted in the extraction step, a voice signal generated from a correlation signal, or a correlation signal and a non-correlation signal This is a program for executing an inverse transform step for performing inverse discrete Fourier transform on an audio signal, and a removing step for removing waveform discontinuities from the audio signal after the discrete Fourier inverse transform in the inverse transform step. .

１１０…音声データ再生装置、１１１…デコーダ、１１２…音声信号抽出部、１１３…音声信号処理部、１１４…Ｄ／Ａコンバータ、１１５…増幅器、１１６…スピーカ、１２１…音声信号分離抽出部、１２２…雑音除去部、１２３…音声出力信号生成部。 DESCRIPTION OF SYMBOLS 110 ... Audio | voice data reproduction apparatus, 111 ... Decoder, 112 ... Audio signal extraction part, 113 ... Audio signal processing part, 114 ... D / A converter, 115 ... Amplifier, 116 ... Speaker, 121 ... Audio signal separation extraction part, 122 ... Noise removal unit, 123... Audio output signal generation unit.

Claims

An audio signal converter for converting a multi-channel input audio signal into an audio signal for reproduction by a group of speakers,
A conversion unit for performing discrete Fourier transform on the input audio signals of the two channels;
A correlation signal extraction unit that extracts a correlation signal by ignoring a direct current component for two-channel audio signals after discrete Fourier transform in the conversion unit;
The correlation signal extracted by the correlation signal extraction unit or the correlation signal and the non-correlation signal, or the voice signal generated from the correlation signal, or the correlation signal and the non-correlation signal An inverse transform unit that performs discrete Fourier inverse transform on the audio signal;
A removing unit for removing the discontinuous points of the waveform from the audio signal after the discrete Fourier inverse transform in the inverse transform unit;
An audio signal conversion device comprising:

The said removal part removes the said discontinuous point by adding a DC component to the audio | voice signal after the said discrete Fourier transform so that the differential value of a waveform may be maintained in the boundary of a process segment. Item 2. The audio signal conversion device according to Item 1.

The audio signal conversion apparatus according to claim 2, wherein the removing unit reduces the magnitude of the amplitude of the DC component to be added in proportion to the elapsed time from the addition time.

The audio signal conversion apparatus according to claim 3, wherein the removing unit changes the proportional constant for the reduction according to the magnitude of the amplitude of the DC component obtained for addition.

The said removal part performs addition of the said DC component except the location where the frequency | count that the waveform of the audio | voice signal after said discrete Fourier transform crosses 0 exists more than predetermined times within predetermined time, It is characterized by the above-mentioned. Item 5. The audio signal converter according to Item 4.

The said removal part performs addition of the said DC component only when the amplitude of the said DC component calculated | required in order to add is less than predetermined value, The said any one of Claims 2-5 characterized by the above-mentioned. Audio signal converter.

The removing unit performs the removal of the discontinuous points except where the number of times that the waveform of the audio signal after the discrete Fourier inverse crosses 0 exceeds a predetermined number of times within a predetermined time. The audio | voice signal converter of any one of Claims 1-3.

The audio signal after the inverse discrete Fourier transform to be processed by the removing unit performs a scaling process in the time domain or the frequency domain on the correlated signal or the correlated signal and the uncorrelated signal, and after the scaling process The audio signal conversion apparatus according to any one of claims 1 to 7, wherein the audio signal conversion apparatus according to any one of claims 1 to 7 is used.

The multi-channel input audio signal is an input audio signal of three or more channels, and for any two input audio signals of the multi-channel input audio signals, the conversion unit, the correlation signal extraction unit, The discontinuous point is removed by an inverse transform unit and the removal unit, and an audio signal to be reproduced by the speaker group is generated,
9. The audio signal conversion according to claim 1, further comprising: an adder that adds the input audio signals of the remaining channels to the generated audio signal. 10. apparatus.

A digital content input unit that inputs digital content including the multi-channel input audio signal; a decoder unit that decodes the digital content; and an audio signal extraction unit that separates the audio signal from the digital content decoded by the decoder unit; An audio signal processing unit that converts the audio signal extracted by the audio signal extraction unit into a multi-channel audio signal that has three or more channels and is different from the input audio signal, and the audio signal processing unit includes: The audio signal conversion apparatus according to claim 1, further comprising a conversion unit, the correlation signal extraction unit, the inverse conversion unit, and the removal unit.

The digital content input unit inputs digital content from a recording medium storing digital content, a server that distributes digital content via a network, or a broadcasting station that broadcasts digital content. Audio signal converter.

The audio signal converter according to any one of claims 1 to 11, further comprising a switching unit that switches whether to execute processing in the audio signal processing unit according to a user operation.

An audio signal conversion method for converting a multi-channel input audio signal into an audio signal for reproduction by a speaker group,
A conversion step in which the conversion unit performs discrete Fourier transform on the input audio signals of the two channels;
An extraction step in which the correlation signal extraction unit extracts a correlation signal by ignoring a direct current component for the audio signals of the two channels after the discrete Fourier transform in the conversion step;
An inverse transform unit for the correlation signal extracted in the extraction step, the correlation signal and the non-correlation signal, the voice signal generated from the correlation signal, or the correlation signal and the non-correlation signal; An inverse transform step for performing an inverse discrete Fourier transform on the generated audio signal;
A removing unit for removing discontinuous points of the waveform from the audio signal after discrete Fourier inverse transform in the inverse transform step;
An audio signal conversion method comprising:

On the computer,
A transform step for performing a discrete Fourier transform on the input audio signals of the two channels;
An extraction step of ignoring a direct current component and extracting a correlation signal for the audio signals of the two channels after the discrete Fourier transform in the conversion step;
The correlation signal extracted in the extraction step or the correlation signal and the non-correlation signal, or the voice signal generated from the correlation signal, or the voice signal generated from the correlation signal and the non-correlation signal An inverse transform step for performing an inverse discrete Fourier transform,
A removal step of removing discontinuous points of the waveform from the audio signal after the discrete Fourier inverse transform in the inverse transform step;
A program for running

The computer-readable recording medium which recorded the program of Claim 14.