JP5754595B2

JP5754595B2 - Trans oral system

Info

Publication number: JP5754595B2
Application number: JP2011254913A
Authority: JP
Inventors: 阪内　澄宇; 澄宇阪内; 健太丹羽; 羽田　陽一; 陽一羽田; 拓磨岡本; 幸雄岩谷; 鈴木　陽一; 陽一鈴木
Original assignee: Tohoku University NUC; Nippon Telegraph and Telephone Corp
Current assignee: Tohoku University NUC; Nippon Telegraph and Telephone Corp
Priority date: 2011-11-22
Filing date: 2011-11-22
Publication date: 2015-07-29
Anticipated expiration: 2031-11-22
Also published as: JP2013110633A

Description

この発明は、通信・放送における音の再生時における受聴者の体勢の変化に対応可能なトランスオーラル（transaural）システムに関する。 The present invention relates to a transaural system that can cope with changes in the posture of a listener during sound reproduction in communication / broadcasting.

トランスオーラルシステムとは、原音場で受聴者の両耳に達する信号と等価な信号を、再生音場内のスピーカによって受聴者の両耳に生成する方法を実現したシステムのことである。トランスオーラルシステムは、例えば、仮想世界の聴覚情報と現実世界における周囲の音とを重ね合わせて聞くことが可能であるため、バーチャルリアリティの分野で期待されている。 The trans-oral system is a system that realizes a method of generating a signal equivalent to a signal reaching the listener's ears in the original sound field in the listener's ears by a speaker in the reproduction sound field. The trans-oral system is expected in the field of virtual reality, for example, because it is possible to listen to the auditory information of the virtual world and the surrounding sound in the real world in an overlapping manner.

人間は音の到来方向や距離を知覚できる音像定位能力を有する。この能力は、音源から左右の耳に至る伝達関数（頭部伝達関数）を左右の音から推定することで、経験的に実現されている。頭部伝達関数をモノラル音源に畳み込み、両耳に入力すると、頭部伝達関数が示す音の位置から音像を与えることが可能となる。トランスオーラルシステムを実現する場合は、一方のスピーカからの音が両方の耳に届くため、スピーカから両耳までの伝達関数も考慮してクロストーク成分を除去する必要がある。 Humans have sound image localization ability that can perceive the direction and distance of sound. This ability is empirically realized by estimating the transfer function (head-related transfer function) from the sound source to the left and right ears from the left and right sounds. When the head-related transfer function is convolved with a monaural sound source and input to both ears, a sound image can be given from the position of the sound indicated by the head-related transfer function. When a trans-oral system is realized, since the sound from one speaker reaches both ears, it is necessary to remove the crosstalk component in consideration of the transfer function from the speaker to both ears.

頭部の位置の変動に対応するトランスオーラルシステムとして、非特許文献１に開示されたものが知られている。図６に、非特許文献１のトランスオーラルシステム９００のシステム構成を示す。トランスオーラルシステム９００は、一対のスピーカ９１，９２と、センサ９３と、頭部位置検出部９４と、バイノーラル処理部９５と、クロストーク処理部９６と、Ｄ/Ａ変換器９７と、アンプ９８を具備する。 As a trans-oral system corresponding to fluctuations in the position of the head, the one disclosed in Non-Patent Document 1 is known. FIG. 6 shows a system configuration of the trans-oral system 900 of Non-Patent Document 1. The trans-oral system 900 includes a pair of speakers 91 and 92, a sensor 93, a head position detection unit 94, a binaural processing unit 95, a crosstalk processing unit 96, a D / A converter 97, and an amplifier 98. It has.

センサ９３は、一対のスピーカ９１，９２の正面に位置する受聴者の頭部に装着され、受聴者の頭部の位置の変化を検出する。センサ９３には、例えば、磁気センサが用いられて、３自由度の位置変化情報を出力する。 The sensor 93 is attached to a listener's head positioned in front of the pair of speakers 91 and 92 and detects a change in the position of the listener's head. For example, a magnetic sensor is used as the sensor 93, and position change information with three degrees of freedom is output.

頭部位置検出部９４は、センサ９３の出力する位置変化情報から、一対のスピーカ９１，９２の正面から変化した頭部位置情報を検出する。バイノーラル処理部９５は、音声信号と頭部位置情報を入力として、音声信号（ディジタル信号）からバイノーラル信号を合成する。バイノーラル信号とは、ヘッドホンで聞いた時に立体的に聞こえる音信号であり、音声信号に、右側のスピーカから受聴者の右耳への頭部伝達関数と、左側のスピーカから受聴者の左耳への頭部伝達関数を畳み込んで合成した左右２チャンネルの信号である。頭部伝達関数は、一対のスピーカ９１，９２と受聴者のマネキン人形を、無響室内に図６に示す位置関係に配置して測定したスピーカ９１と９２と、マネキン人形の左右の耳の位置との間の伝達関数である。つまり、頭部伝達関数とは、残響の無い状態における右側スピーカと受聴者の右耳間、左側スピーカと受聴者の左耳間の伝達関数のことである。 The head position detection unit 94 detects head position information changed from the front of the pair of speakers 91 and 92 from the position change information output from the sensor 93. The binaural processing unit 95 receives the audio signal and the head position information as input and synthesizes a binaural signal from the audio signal (digital signal). A binaural signal is a sound signal that can be heard three-dimensionally when listening through headphones. The head signal transfer function from the right speaker to the right ear of the listener and the left speaker to the listener's left ear Are the left and right channel signals obtained by convolving and synthesizing the head related transfer functions. The head-related transfer function is determined by arranging a pair of speakers 91 and 92 and the listener's mannequin doll in the positional relationship shown in FIG. 6 in the anechoic chamber, and the positions of the left and right ears of the mannequin doll. Is a transfer function between That is, the head-related transfer function is a transfer function between the right speaker and the right ear of the listener, and between the left speaker and the listener's left ear in a state without reverberation.

よって、バイノーラル処理部９５は、内部に記憶している頭部位置情報に対応させて記憶されている複数の頭部伝達関数から、頭部位置情報に対応する頭部伝達関数を選択して音声信号に畳み込み、音声信号をスピーカ９１，９２の音があたかもヘッドホンで聴取したように聞こえるバイノーラル信号に合成するものである。 Therefore, the binaural processing unit 95 selects a head-related transfer function corresponding to the head position information from a plurality of head-related transfer functions stored in association with the head position information stored therein, The signal is convoluted and the audio signal is synthesized into a binaural signal that sounds as if the sound of the speakers 91 and 92 is heard through headphones.

クロストーク処理部９６は、本来片方の耳だけに届くべき音が、もう一方の耳にも届いてしまう空間クロストークを除去する処理をおこなう。この空間クロストークは、バイノーラル信号の合成を阻害する。そこで、クロストーク処理部９６は、頭部位置情報によって変化するスピーカ９１から受聴者の左右耳間の伝達関数Ｇ_９１Ｌ（ω），Ｇ_９１Ｒ（ω）と、スピーカ９２から受聴者の左右耳間の伝達関数Ｇ_９２Ｌ（ω），Ｇ_９２Ｒ（ω）とによって発生するクロストークをキャンセルする。そのクロストークをキャンセルするフィルタＨ_ｉ（ω）構成を式（１）に示す。ｉはスピーカ番号を意味する。全ての伝達関数は、周波数領域で記述されている場合、複素数となる。 The crosstalk processing unit 96 performs processing for removing spatial crosstalk in which a sound that should originally reach only one ear reaches the other ear. This spatial crosstalk inhibits the synthesis of binaural signals. Therefore, the crosstalk processing unit 96 transfers the transfer functions G ₉₁ L (ω) and G ₉₁ R (ω) between the left and right ears of the listener from the speaker 91 that change according to the head position information, and the left and right of the listener from the speaker 92. The crosstalk generated by the interaural transfer functions G ₉₂ L (ω) and G ₉₂ R (ω) is canceled. The filter H _i (ω) configuration for canceling the crosstalk is shown in equation (1). i means a speaker number. All transfer functions are complex numbers when described in the frequency domain.

ここで、Ｈ_Ｒ（ω），Ｈ_Ｌ（ω）は頭部伝達関数である。 Here, H _R (ω) and H _L (ω) are head-related transfer functions.

クロストーク成分がキャンセルされたバイノーラル信号は、Ｄ/Ａ変換器９７でアナログ信号に変換され、アンプ９８で増幅された後に一対のスピーカ９１，９２に供給される。 The binaural signal from which the crosstalk component has been canceled is converted into an analog signal by the D / A converter 97, amplified by the amplifier 98, and then supplied to the pair of speakers 91 and 92.

山本健一郎、苗村健、原島博、“３次元センサとトランスオーラル処理を用いた音像の定位”日本バーチャルリアリティ学会論文誌、５巻、３号、981-987頁、2000年.Kenichiro Yamamoto, Ken Naemura, Hiroshi Harashima, “Sound image localization using three-dimensional sensors and transoral processing” Transactions of the Virtual Reality Society of Japan, Vol. 3, No. 3, pp. 981-987, 2000.

しかし、従来の方法では、受聴者の頭部の位置を検出する目的でセンサ９３の装着が必要であり、煩わしいといった第一の課題がある。また、頭部位置が大きく動き、一対のスピーカ９１，９２の正面中央を基準にして、頭部が、例えばスピーカ９１の方向に６０度以上の角度、回転したとすると受聴者の外耳道入り口（以降、耳孔）は、スピーカ９２から見て受聴者の耳介の影になりスピーカ９２からの信号レベルが小さくなる。と共に、高い周波数でのゼロ点も急激に増加するため、クロストークキャンセラのフィルタ特性が不安定になる第二の課題がある。 However, the conventional method has the first problem that it is necessary to wear the sensor 93 for the purpose of detecting the position of the listener's head, which is troublesome. Further, if the head position moves greatly and the head rotates, for example, at an angle of 60 degrees or more in the direction of the speaker 91 with respect to the front center of the pair of speakers 91 and 92, the listener's ear canal entrance (hereinafter referred to as the ear canal) , The ear hole) becomes a shadow of the listener's pinna when viewed from the speaker 92, and the signal level from the speaker 92 becomes small. At the same time, the zero point at a high frequency also increases rapidly, which causes a second problem that the filter characteristics of the crosstalk canceller become unstable.

この発明は、このような課題に鑑みてなされたものであり、受聴者に装着するセンサを不要とし、また、受聴者が大きく動いた場合にも、実世界と同じように受聴者の頭部の運動に影響されない絶対位置を有する音像を自然に聴取することができるトランスオーラルシステムを提供することを目的とする。 The present invention has been made in view of such a problem, and does not require a sensor to be worn on the listener, and also when the listener moves greatly, the listener's head as in the real world. It is an object of the present invention to provide a trans-oral system that can naturally listen to a sound image having an absolute position that is not affected by the movement of the sound.

この発明のトランスオーラルシステムは、３個以上のスピーカと、撮像部と、顔姿勢解析部と、スピーカ選択部と、バイノーラル処理部と、クロストーク処理部と、スピーカ駆動部を具備する。３個以上のスピーカは、受聴者の頭部中心から等距離の位置に、放音側を受聴者に向けて配置され、２個を一組とする。撮像部は、受聴者の顔画像を撮影して顔画像情報を出力する。顔姿勢解析部は、撮像部が出力する顔画像情報から受聴者の顔の姿勢を解析して顔姿勢情報を出力する。スピーカ選択部は、顔姿勢情報を入力として、当該顔姿勢情報に対応させて上記３個以上のスピーカの中から隣り合う一組のスピーカを選択するスピーカ選択情報を出力する。バイノーラル処理部は、音声信号と上記顔姿勢情報と上記スピーカ選択情報とを入力として、当該スピーカ選択情報に基づく一組のスピーカの右側のスピーカから上記受聴者の右耳までの頭部伝達関数と左側のスピーカから左耳までの頭部伝達関数を、上記音声信号に畳み込んで右チャネルと左チャネルのバイノーラル信号を出力する。クロストーク処理部は、バイノーラル信号と顔姿勢情報とスピーカ選択情報とを入力として、当該スピーカ選択情報に基づく一組のスピーカから受聴者の右耳までの伝達関数と左耳までの伝達関数を用いて、左右２チャネルのバイノーラル信号から空間クロストーク成分を除去した左右２チャネルのスピーカ駆動信号を生成する。スピーカ駆動部は、スピーカ選択情報とスピーカ駆動信号を入力として、スピーカ選択情報に基づく一組のスピーカにスピーカ駆動信号を出力する。 The trans-oral system of the present invention includes three or more speakers, an imaging unit, a face posture analysis unit, a speaker selection unit, a binaural processing unit, a crosstalk processing unit, and a speaker driving unit. Three or more speakers are arranged at a position equidistant from the center of the listener's head, with the sound emission side facing the listener, and two speakers as a set. The imaging unit captures a face image of the listener and outputs face image information. The face posture analysis unit analyzes the posture of the listener's face from the face image information output by the imaging unit, and outputs face posture information. The speaker selection unit receives the face posture information and outputs speaker selection information for selecting a pair of adjacent speakers from the three or more speakers corresponding to the face posture information. The binaural processing unit receives an audio signal, the face posture information, and the speaker selection information as input, and a head related transfer function from the right speaker of the pair of speakers to the right ear of the listener based on the speaker selection information; The head-related transfer function from the left speaker to the left ear is convoluted with the audio signal to output right channel and left channel binaural signals. The crosstalk processing unit receives a binaural signal, face posture information, and speaker selection information, and uses a transfer function from a set of speakers to the right ear of the listener and a transfer function to the left ear based on the speaker selection information. Thus, left and right two-channel speaker drive signals are generated by removing spatial crosstalk components from the left and right two-channel binaural signals. The speaker driving unit receives the speaker selection information and the speaker driving signal, and outputs the speaker driving signal to a set of speakers based on the speaker selection information.

この発明のトランスオーラルシステムによれば、受聴者の顔の姿勢を、撮像部が撮影する顔姿勢情報を解析して求めるので、従来技術のように受聴者の体に装着するセンサが不要である。また、受聴者の顔の姿勢に対応させて３個以上のスピーカの中から受聴者に対向する位置にある、つまり、受聴者の耳孔が受聴者の耳介の影にならない一組のスピーカが選択されるので、受聴者の顔の姿勢の大きな変化に対応することが可能なトランスオーラルシステムを提供することが可能になる。 According to the trans-oral system of the present invention, the posture of the listener's face is obtained by analyzing the facial posture information captured by the imaging unit, so that a sensor attached to the body of the listener as in the prior art is unnecessary. . In addition, there is a pair of speakers that are in a position facing the listener out of three or more speakers corresponding to the posture of the listener's face, that is, the listener's ear hole is not a shadow of the listener's pinna. Therefore, it is possible to provide a trans-oral system that can cope with a large change in the posture of the listener's face.

この発明のトランスオーラルシステム１００の機能構成例を示す図。The figure which shows the function structural example of the trans-oral system 100 of this invention. 顔姿勢解析部２０の機能構成例を示す図。The figure which shows the function structural example of the face attitude | position analysis part 20. FIG. スピーカ選択部３０のスピーカ選択情報の一例を示す図。The figure which shows an example of the speaker selection information of the speaker selection part 30. FIG. スピーカ駆動部４０の具体例を示す図。The figure which shows the specific example of the speaker drive part. この発明のトランスオーラルシステム２００の機能構成例を示す図。The figure which shows the function structural example of the trans-oral system 200 of this invention. 従来のトランスオーラルシステム９００の機能構成を示す図。The figure which shows the function structure of the conventional trans-oral system 900.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには
同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明のトランスオーラルシステム１００の機能構成例を示す。３個以上のスピーカ１〜６と、撮像部１０と、顔姿勢解析部２０と、スピーカ選択部３０と、バイノーラル処理部９５と、クロストーク処理部９６と、Ｄ/Ａ変換器９７と、アンプ９８と、スピーカ駆動部４０を具備する。バイノーラル処理部９５と、クロストーク処理部９６と、Ｄ/Ａ変換器９７と、アンプ９８と、は参照符号から分かるように従来のトランスオーラルシステム９００と同じものである。なお、図１のスピーカ１〜６の配置は、受聴者の頭部を中心として平面的に見たものである。 FIG. 1 shows a functional configuration example of the trans-oral system 100 of the present invention. Three or more speakers 1 to 6, an image pickup unit 10, a face posture analysis unit 20, a speaker selection unit 30, a binaural processing unit 95, a crosstalk processing unit 96, a D / A converter 97, and an amplifier 98 and a speaker drive unit 40. The binaural processing unit 95, the crosstalk processing unit 96, the D / A converter 97, and the amplifier 98 are the same as the conventional transoral system 900 as can be seen from the reference numerals. In addition, arrangement | positioning of the speakers 1-6 of FIG. 1 is seen planarly centering on a listener's head.

３個以上のスピーカ１〜６は、受聴者の頭部中心から等距離の位置に、放音側を受聴者に向けて配置される。この例では、受聴者を中心として中心角度６０度毎の等距離の位置に、６個のスピーカ１〜６が配置されている。ステレオ再生においては、２つのスピーカと受聴者を正三角形の頂点に配置することが原則（例えば、参考文献「究極のサウンドを楽しむオーディオ入門マニュアル」、成美堂出版、1998年、p139）である。この実施例では受聴者を取り囲むように６個のスピーカ１〜６が配置されている。なお、スピーカ１〜６の鉛直方向の位置は、同一平面上に無くても良い。スピーカと受聴者間の伝達特性を測定した際の位置と一致していれば良い。スピーカと受聴者との距離は、スピーカの出力する音圧レベルに依存する。その距離は、概ね50ｃｍ〜５ｍ程度の範囲である。 The three or more speakers 1 to 6 are arranged at a position equidistant from the center of the listener's head with the sound emission side facing the listener. In this example, six speakers 1 to 6 are arranged at equidistant positions with a center angle of 60 degrees around the listener. In stereo reproduction, it is a principle to place two speakers and a listener at the apex of an equilateral triangle (for example, reference document “Introduction to Audio to Enjoy Ultimate Sound”, Seimido Publishing, 1998, p139). In this embodiment, six speakers 1 to 6 are arranged so as to surround the listener. The vertical positions of the speakers 1 to 6 may not be on the same plane. It is only necessary to match the position when the transfer characteristic between the speaker and the listener is measured. The distance between the speaker and the listener depends on the sound pressure level output by the speaker. The distance is approximately in the range of about 50 cm to 5 m.

トランスオーラルシステム１００としての最小の数のスピーカは、図１中に実線で示すように受聴者の正面に配置される２個のスピーカ１と６と、例えば、受聴者の左側方に配置されるスピーカ２の３個のスピーカで構成される。３個目のスピーカは、受聴者の右側方のスピーカ５であっても良い。スピーカの数が減ると、受聴者の姿勢の変化に対応できる範囲が狭くなる。 The minimum number of speakers as the trans-oral system 100 is arranged in two speakers 1 and 6 arranged in front of the listener as shown by a solid line in FIG. 1, for example, on the left side of the listener. The speaker 2 is composed of three speakers. The third speaker may be the speaker 5 on the right side of the listener. As the number of speakers decreases, the range that can accommodate changes in the listener's posture becomes narrower.

撮像部１０は、受聴者の顔画像を撮影して顔画像情報を出力する。撮像部１０は、例えば、ディジタルカメラを１秒間に１６回（16Hz）撮影した顔画像を出力する。撮像部１０は、デジタルビデオカメラであっても良い。また、３Ｄステレオカメラを用いても良い。 The imaging unit 10 captures a face image of the listener and outputs face image information. The imaging unit 10 outputs, for example, a face image obtained by capturing a digital camera 16 times per second (16 Hz). The imaging unit 10 may be a digital video camera. A 3D stereo camera may be used.

顔姿勢解析部２０は、その顔画像情報から受聴者の顔の姿勢を解析して顔姿勢情報を出力する。受聴者の顔の姿勢を解析する技術は数多く検討されており、この実施例ではその何れの技術を用いても良い。例えば、予め顔の方向を変えた画像を複数撮影しておき、その画像との一致具合を判定することで、顔姿勢情報を出力するようにしても良い。図２に、その手法に基づく顔姿勢解析部２０の機能構成例を示す。顔姿勢解析部２０は、顔姿勢判定部２１と、顔姿勢データ記憶部２２と、を具備する。顔姿勢データ記憶部２２には、受聴者の顔の方向を水平方向に一周した場合の所定の角度毎の顔の画像が予め記憶されている。顔姿勢判定部２１は、撮像部１０から入力される現在の顔画像情報と、顔姿勢データ記憶部２２に記憶されている顔の方向が既知の記憶画像と比較して、現在の顔画像情報に一番近い顔の方向を、顔姿勢情報として出力する。または、画像データを比較するのでは無く、顔のパーツの検出や、笑顔検出等で知られているように、顔画像データをパターン認識して顔の方向を、直接計算で求めるようにしても良い。顔姿勢情報は、例えば、受聴者の顔の向きを表す角度で与えられる。 The face posture analysis unit 20 analyzes the posture of the listener's face from the face image information and outputs face posture information. Many techniques for analyzing the posture of the listener's face have been studied, and any of these techniques may be used in this embodiment. For example, a plurality of images with different face directions may be taken in advance, and face posture information may be output by determining the degree of coincidence with the images. FIG. 2 shows a functional configuration example of the face posture analysis unit 20 based on the method. The face posture analysis unit 20 includes a face posture determination unit 21 and a face posture data storage unit 22. The face posture data storage unit 22 stores in advance an image of a face for each predetermined angle when the listener's face direction goes around in the horizontal direction. The face posture determination unit 21 compares the current face image information input from the imaging unit 10 with a stored image in which the face direction stored in the face posture data storage unit 22 is known, thereby comparing the current face image information. The face direction closest to is output as face posture information. Alternatively, instead of comparing the image data, the face direction may be directly calculated by recognizing the pattern of the face image data, as is known in face part detection, smile detection, etc. good. The face posture information is given by, for example, an angle representing the orientation of the listener's face.

スピーカ選択部３０は、顔姿勢情報を入力として、当該顔姿勢情報に対応させて３個以上のスピーカの中から隣り合う一組のスピーカを選択するスピーカ選択情報を出力する。図３に、スピーカ選択情報の一例を示す。左側の列の角度αは、受聴者の顔の向きを表す角度である。右側のスピーカの列中のＲ・Ｌの表記は、バイノーラル信号のチャネルを表す。 The speaker selection unit 30 receives the face posture information and outputs speaker selection information for selecting a pair of adjacent speakers from three or more speakers corresponding to the face posture information. FIG. 3 shows an example of speaker selection information. The angle α in the left column is an angle representing the orientation of the listener's face. The notation R · L in the column of the right speaker represents a binaural signal channel.

角度αは、スピーカ１と６の中心に受聴者の顔を向けたときの角度をα＝０°、時計方向に正（＋）の角度、反時計方向に負（−）の角度、と定義する。角度α＝０°の場合は、スピーカ１から左（Ｌ）チャネルのバイノーラル信号、スピーカ６から右（Ｒ）チャネルのバイノーラル信号が出力される。スピーカ選択部３０は、顔姿勢情報（角度α）を入力としたエンコーダー（encoder）である。 The angle α is defined as α = 0 ° when the listener's face is directed to the center of the speakers 1 and 6, a positive (+) angle in the clockwise direction, and a negative (−) angle in the counterclockwise direction. To do. When the angle α = 0 °, the speaker 1 outputs a left (L) channel binaural signal and the speaker 6 outputs a right (R) channel binaural signal. The speaker selection unit 30 is an encoder that receives face posture information (angle α) as an input.

受聴者の顔が水平方向に＋６０度回転（α＝６０°）したと仮定すると、受聴者の左耳孔がスピーカ１から見て左耳の耳介の影に隠れる。そうすると、クロストーク処理部９６のフィルタリング処理の動作が不安定になるので、その場合、スピーカ１からの放音は遮断（×）され、スピーカ６から左（Ｌ）チャネルのバイノーラル信号、スピーカ５から右（Ｒ）チャネルのバイノーラル信号が出力されるようにスピーカが選択される。この結果、受聴者の両耳孔は、スピーカ６と５から見て受聴者の耳介に隠れることが無い。したがって、クロストーク処理部９６のフィルタリング処理の動作が安定して動作する。 Assuming that the listener's face is rotated by +60 degrees (α = 60 °) in the horizontal direction, the listener's left ear hole is hidden behind the pinna of the left ear when viewed from the speaker 1. Then, the filtering processing operation of the crosstalk processing unit 96 becomes unstable. In this case, sound emission from the speaker 1 is blocked (x), and the left (L) channel binaural signal from the speaker 6 and the speaker 5 The speaker is selected so that a right (R) channel binaural signal is output. As a result, both ear holes of the listener are not hidden in the listener's pinna when viewed from the speakers 6 and 5. Accordingly, the filtering processing operation of the crosstalk processing unit 96 operates stably.

更に、受聴者の顔が水平方向に＋６０度回転したα＝＋１２０°の場合は、スピーカ５から左（Ｌ）チャネル、スピーカ４から右（Ｒ）チャネルのバイノーラル信号が出力されるようにスピーカが選択される。顔姿勢情報に対応させて３個以上のスピーカ１〜６の中から隣り合う一組のスピーカが選択することで、受聴者が大きく顔の方向を変えても、その運動に影響されない絶対位置を有する音像を自然に提供することが可能になる。 Further, in the case where α = + 120 ° where the listener's face is rotated +60 degrees in the horizontal direction, the speaker is arranged so that the left (L) channel and the right (R) channel of the speaker 5 are output from the speaker 5. Selected. By selecting a pair of adjacent speakers from three or more speakers 1 to 6 corresponding to the face posture information, even if the listener changes the face direction, the absolute position is not affected by the movement. It is possible to naturally provide a sound image having the same.

なお、図３に示すように、０度と−３６０度、＋６０度と−３００度、＋１２０度と−２４０度、＋１８０度と−１８０度、＋２４０度と−１２０度、＋３００度と−６０度、におけるスピーカの組み合わせは同じ一組である。このように受聴者を中心として６０度の中心角度毎に６個のスピーカを配置した場合は、受聴者の顔の向きが３６０度、一周しても自然な音像を提供することが可能である。例えば、スピーカを６と１と２の３個とした場合は、受聴者の左右の耳孔が、スピーカから見て耳介の影にならない＋６０度〜−６０度の範囲で自然な音像を提供することができる。 As shown in FIG. 3, 0 degrees and −360 degrees, +60 degrees and −300 degrees, +120 degrees and −240 degrees, +180 degrees and −180 degrees, +240 degrees and −120 degrees, +300 degrees and −60 degrees. The combination of speakers in, is the same set. Thus, when six speakers are arranged at a central angle of 60 degrees with the listener as the center, it is possible to provide a natural sound image even if the listener's face is rotated 360 degrees. . For example, when there are three speakers, 6 and 1 and 2, the listener's left and right ear holes provide a natural sound image in a range of +60 degrees to -60 degrees that does not become a shadow of the pinna when viewed from the speakers. be able to.

バイノーラル処理部９５は、音声信号と顔姿勢情報とスピーカ選択情報とを入力として、当該スピーカ選択情報に基づく一組のスピーカの右側のスピーカから受聴者の右耳までの頭部伝達関数と左側のスピーカから左耳までの頭部伝達関数を、音声信号に畳み込んで右チャネルと左チャネルのバイノーラル信号を出力する。バイノーラル処理部９５は、顔姿勢情報に対応する複数の頭部伝達関数を記憶している。その数は、例えば、顔姿勢データ記憶部２２に記憶された角度の数分である。または、顔姿勢情報が、例えば、１度ごとと、バイノーラル処理部９５が記憶した角度よりも細かい場合は、既存の補間技術を用いて補間して頭部伝達関数を求めるようにしても良い。角度情報が一致する頭部伝達関数が用意されていない場合は、その方向の最も近い両側の頭部伝達関数の重み平均値を計算して用いる。この頭部伝達関数を補間して求める考えは、クロストーク処理部９６における伝達関数にも適用できる。 The binaural processing unit 95 receives the audio signal, the face posture information, and the speaker selection information as inputs, and the head related transfer function from the right speaker to the right ear of the listener based on the speaker selection information and the left transfer function The head-related transfer function from the speaker to the left ear is convoluted with the audio signal, and the right channel and left channel binaural signals are output. The binaural processing unit 95 stores a plurality of head related transfer functions corresponding to face posture information. The number is, for example, the number of angles stored in the face posture data storage unit 22. Alternatively, when the face posture information is finer than the angle stored by the binaural processing unit 95, for example, once, the head-related transfer function may be obtained by interpolation using an existing interpolation technique. When head related transfer functions having the same angle information are not prepared, the weighted average value of the head related transfer functions on both sides closest to the direction is calculated and used. This idea of interpolating the head-related transfer function can also be applied to the transfer function in the crosstalk processing unit 96.

クロストーク処理部９６は、バイノーラル信号と顔姿勢情報とスピーカ選択情報を入力として、当該スピーカ選択情報に基づく一組のスピーカから受聴者の右耳までの伝達関数と左耳までの伝達関数を用いて、左右２チャネルのバイノーラル信号から空間クロストーク成分を除去した左右２チャネルのスピーカ駆動信号を生成する。クロストーク処理部９６は、スピーカ選択情報に基づく一組のスピーカから受聴者の右耳と左耳までの伝達関数を用いて、左右２チャネルのバイノーラル信号から空間クロストーク成分を除去するフィルタリング処理を行う。スピーカ選択情報に対応した２個のそれぞれのスピーカから受聴者の右耳と左耳までの伝達関数は、予めクロストーク処理部９６に記憶されている。空間クロストーク成分を除去するフィルタリング処理そのものは、従来のトランスオーラルシステム９００と同じである。 The crosstalk processing unit 96 receives a binaural signal, face posture information, and speaker selection information as inputs, and uses a transfer function from a set of speakers to the listener's right ear and a transfer function to the left ear based on the speaker selection information. Thus, left and right two-channel speaker drive signals are generated by removing spatial crosstalk components from the left and right two-channel binaural signals. The crosstalk processing unit 96 performs filtering processing to remove spatial crosstalk components from the left and right two-channel binaural signals using a transfer function from a pair of speakers based on speaker selection information to the right and left ears of the listener. Do. Transfer functions from the two speakers corresponding to the speaker selection information to the right and left ears of the listener are stored in the crosstalk processing unit 96 in advance. The filtering process itself for removing the spatial crosstalk component is the same as the conventional trans-oral system 900.

スピーカ駆動部４０は、スピーカ選択情報とスピーカ駆動信号を入力として、スピーカ選択情報に基づく一組のスピーカに、スピーカ駆動信号を出力する。スピーカ駆動部４０は、図３に示した左右２チャネルのバイノーラル信号を各スピーカ１〜６に振り分ける動作を行う。 The speaker drive unit 40 receives the speaker selection information and the speaker drive signal as inputs, and outputs the speaker drive signal to a set of speakers based on the speaker selection information. The speaker drive unit 40 performs an operation of distributing the left and right two-channel binaural signals shown in FIG. 3 to the speakers 1 to 6.

図４に、より具体的なスピーカ選択部４０の機能構成例を示す。スピーカ選択部４０は、スピーカ１〜６にそれぞれ接続される一対のリレー４１〜４６で構成される。一対のリレー４１は、アンプ９８で増幅された右（Ｒ）チャネルのバイノーラル信号と左（Ｌ）チャネルのバイノーラル信号とが、それぞれ一端に接続され、その他端にはスピーカ１が共に接続されるリレー４１_Ｒと４１_Ｌとを備える。一対のリレー４２〜４６もそれぞれスピーカ２〜６に接続され、その構成は一対のリレー４１と同じである。 FIG. 4 shows a more specific functional configuration example of the speaker selection unit 40. The speaker selection unit 40 includes a pair of relays 41 to 46 connected to the speakers 1 to 6, respectively. The pair of relays 41 are relays in which a right (R) channel binaural signal and a left (L) channel binaural signal amplified by an amplifier 98 are respectively connected to one end and the other end is connected to the speaker 1 together. 41 _R and 41 _L are provided. The pair of relays 42 to 46 are also connected to the speakers 2 to 6, respectively, and the configuration thereof is the same as that of the pair of relays 41.

リレー４１_Ｌとリレー４６_Ｒの制御端子には、スピーカ選択部３０の０（−３６０）出力端子が接続されている。スピーカ選択部３０の０（−３６０）出力端子は、顔姿勢解析部２０が出力する顔姿勢情報が表す顔の向きを表す角度αが、−６０°＜α＜６０°の範囲で“１”（論理レベル１）となる選択信号を出力する。０（−３６０）出力端子の選択信号が“１”になると、リレー４１_Ｌとリレー４６_Ｒとが導通状態となり、スピーカ１に左（Ｌ）チャネルのバイノーラル信号が供給され、スピーカ６に右（Ｒ）チャネルのバイノーラル信号が供給される。他のスピーカ２〜５へのバイノーラル信号の供給は遮断される。 The 0 (−360) output terminal of the speaker selection unit 30 is connected to the control terminals of the relay 41 _L and the relay 46 _R. The 0 (−360) output terminal of the speaker selection unit 30 is “1” when the angle α representing the face direction represented by the face posture information output by the face posture analysis unit 20 is in the range of −60 ° <α <60 °. A selection signal of (logic level 1) is output. When the selection signal of the 0 (−360) output terminal becomes “1”, the relay 41 _L and the relay 46 _R become conductive, the left (L) channel binaural signal is supplied to the speaker 1, and the right ( R) Channel binaural signals are supplied. The supply of binaural signals to the other speakers 2 to 5 is cut off.

スピーカ選択部３０の３００（−６０）出力端子は、リレー４１_Ｒとリレー４２_Ｌの制御端子に接続される。３００（−６０）出力端子は、角度αが、−１２０°＜α≦−６０°の範囲で“１”となるので、スピーカ１に右（Ｒ）チャネルのバイノーラル信号、スピーカ２に左（Ｌ）チャネルのバイノーラル信号が供給される。 The 300 (−60) output terminal of the speaker selection unit 30 is connected to the control terminals of the relay 41 _R and the relay 42 _L. The 300 (−60) output terminal has an angle α of “1” in a range of −120 ° <α ≦ −60 °, and therefore the speaker 1 has a right (R) channel binaural signal and the speaker 2 has a left (L ) The binaural signal of the channel is supplied.

表１に、角度αの角度範囲と、右（Ｒ）チャネルと左（Ｌ）チャネルのバイノーラル信号が供給されるスピーカ番号との関係を示す。 Table 1 shows the relationship between the angle range of the angle α and the speaker numbers to which the binaural signals of the right (R) channel and the left (L) channel are supplied.

表１に示すように、顔姿勢情報（角度α）に対応させてバイノーラル信号を供給するスピーカを選択することで、受聴者の顔の向きが３６０度、一周しても受聴者の耳孔が耳介の影になら無いので自然な音像を提供することが可能になる。 As shown in Table 1, by selecting a speaker that supplies a binaural signal in correspondence with face posture information (angle α), the listener's ear canal can be heard even if the listener's face turns 360 degrees. It is possible to provide a natural sound image because it is not a shadow of the media.

トランスオーラルシステム１００は、撮像部１０で撮影した受聴者の顔画像から顔姿勢情報を求める第一の特徴と、その顔姿勢情報に基づいて３個以上のスピーカの中から２個のスピーカを選択する第二の特徴を有するものである。本願発明のトランスオーラルシステムは、第二の特徴のみを有する構成も考えられる。その構成のトランスオーラルシステム２００を次に説明する。 The trans-oral system 100 selects two speakers from three or more speakers based on the first feature for obtaining face posture information from the face image of the listener photographed by the imaging unit 10 and the face posture information. It has the 2nd characteristic to do. The trans-oral system of the present invention may have a configuration having only the second feature. The transoral system 200 having such a configuration will be described next.

図５に、この発明のトランスオーラルシステム２００の機能構成例を示す。トランスオーラルシステム２００は、トランスオーラルシステム１００の撮像部１０と顔姿勢解析部２０に代えて、入力部５０を備える点で異なる。他の機能部の構成は、基本的に同じ考えで実現できる。 FIG. 5 shows a functional configuration example of the trans-oral system 200 of the present invention. The trans-oral system 200 is different in that an input unit 50 is provided instead of the imaging unit 10 and the face posture analysis unit 20 of the trans-oral system 100. The configuration of the other functional units can be basically realized with the same idea.

入力部５０は、外部から入力される受聴者の顔方向情報又は受聴者の両耳とスピーカとの相対的な位置情報を表す頭部位置情報を、バイノーラル処理部９５とクロストーク処理部９６とスピーカ選択部３０に出力する。受聴者の顔方向情報又は受聴者の両耳とスピーカとの相対的な位置情報を表す頭部位置情報は、上記した顔姿勢情報と同じ意味を持つ信号である。 The input unit 50 receives the listener's face direction information input from the outside or the head position information indicating the relative position information of the listener's both ears and the speaker, the binaural processing unit 95 and the crosstalk processing unit 96. Output to the speaker selection unit 30. The head position information indicating the listener's face direction information or the relative position information of the listener's both ears and the speaker is a signal having the same meaning as the face posture information described above.

例えば、トランスオーラルシステム２００の利用者が、受聴者の顔を目視で判断してスピーカに対する受聴者の頭部位置情報を、入力部５０に手入力しても良い。スピーカ選択部３０は、その頭部位置情報に基づいて一組のスピーカの選択を行うように構成しておく。バイノーラル処理部９５とクロストーク処理部９６も、頭部位置情報に基づいて頭部伝達関数と伝達関数を選択するように構成しておく。そうすることで、トランスオーラルシステム１００と同様に、受聴者が大きく顔の方向を変えてもその運動に影響されない絶対位置を有する音像を自然に提供することが可能である。 For example, the user of the trans-oral system 200 may determine the listener's face visually and manually input the listener's head position information with respect to the speaker into the input unit 50. The speaker selection unit 30 is configured to select a set of speakers based on the head position information. The binaural processing unit 95 and the crosstalk processing unit 96 are also configured to select a head-related transfer function and a transfer function based on the head position information. By doing so, similarly to the trans-oral system 100, it is possible to naturally provide a sound image having an absolute position that is not affected by the movement of the listener even if the listener greatly changes the face direction.

なお、バイノーラル処理部９５に入力される音声信号はディジタル信号の例で説明を行ったが、音声信号はアナログ信号でも良い。その場合、バイノーラル処理とクロストーク処理は、顔姿勢情報に対応した複数のアナログフィルタで実現される。音声信号がアナログ信号で与えられる場合は、Ｄ/Ａ変換器９７は不要である。また、バイノーラル信号の出力レベルが大きい場合には、アンプ９８も不要である。このように、Ｄ/Ａ変換器９７とアンプ９８は、本願発明を特徴付けるものではない。 The audio signal input to the binaural processing unit 95 has been described as an example of a digital signal, but the audio signal may be an analog signal. In that case, binaural processing and crosstalk processing are realized by a plurality of analog filters corresponding to face posture information. When the audio signal is given as an analog signal, the D / A converter 97 is not necessary. Further, when the output level of the binaural signal is high, the amplifier 98 is not necessary. Thus, the D / A converter 97 and the amplifier 98 do not characterize the present invention.

なお、顔姿勢情報に基づいて一組のスピーカを選択するスピーカ選択部３０を、独立して具備する機能構成例で説明を行ったが、スピーカ選択部３０の機能を、バイノーラル処理部９５とクロストーク処理部９６とスピーカ駆動部４０にそれぞれに持たせても良い。その場合は、スピーカ選択部３０は不要である。このように、本願発明のトランスオーラルシステムは、上記した実施例の構成に限定されるものではない。
トランスオーラルシステム１００，２００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるように構成してもよい。 In addition, although the speaker selection part 30 which selects a pair of speakers based on face posture information was demonstrated by the example of a functional structure provided independently, the function of the speaker selection part 30 is crossed with the binaural processing part 95. The talk processing unit 96 and the speaker driving unit 40 may be provided respectively. In that case, the speaker selection part 30 is unnecessary. Thus, the trans-oral system of the present invention is not limited to the configuration of the above-described embodiment.
The trans-oral systems 100 and 200 may be configured such that a predetermined program is read into a computer including, for example, a ROM, a RAM, and a CPU, and the CPU executes the program.

その場合、その処理内容を記述したプログラムは、コンピュータで読み取り可能な任意の記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリがある。より具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 In that case, the program describing the processing contents can be recorded in any computer-readable recording medium. Examples of the computer-readable recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. More specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape, etc., and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read) Only Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェアとして実現することとしてもよい。 Each unit may be configured by executing a predetermined program on a computer, or at least a part of the processing contents may be realized as hardware.

Claims

6 or more speakers, each of which is a set of two speakers, with the sound emission side facing the listener at equidistant positions with the center angle being equal to the center of the listener's head;
An imaging unit that captures a face image of the listener and outputs face image information;
A face posture analysis unit that analyzes the posture of the listener's face from the face image information and outputs face posture information;
A speaker selection unit that outputs the speaker selection information for selecting a pair of adjacent speakers from the six or more speakers in correspondence with the face posture information, using the face posture information as an input;
With the audio signal, the face posture information, and the speaker selection information as inputs, the head-related transfer function from the right speaker to the right ear of the listener based on the speaker selection information and the left from the left speaker A binaural processing unit that convolves the head-related transfer function up to the ear with the audio signal and outputs a binaural signal of the right channel and the left channel;
Using the binaural signal, the face posture information, and the speaker selection information as inputs, using a transfer function from a set of speakers to the right ear of the listener and a transfer function to the left ear based on the speaker selection information, A crosstalk processing unit for generating left and right two-channel speaker drive signals obtained by removing spatial crosstalk components from the left and right two-channel binaural signals;
A speaker driving unit that receives the speaker selection information and the speaker driving signal as input and outputs the speaker driving signal to a pair of speakers based on the speaker selection information;
Transoral system comprising

6 or more speakers, each of which is a set of two speakers, with the sound emission side facing the listener at equidistant positions with the center angle being equal to the center of the listener's head;
An input unit for inputting head position information representing a relative direction between the listener's face direction or both ears of the listener and the speaker;
A speaker selection unit that outputs the speaker selection information for selecting a pair of adjacent speakers from among the six or more speakers corresponding to the head position information, with the head position information as an input;
As input an audio signal and the head position information and the speaker selection information, the head-related transfer function from the right speaker of a set of speakers based on the speaker selection information to the right ear of the listener and the left speaker A binaural processing unit that convolves the head-related transfer function from the left ear to the left ear into the audio signal and outputs a binaural signal of the right channel and the left channel;
Using the binaural signal, the head position information, and the speaker selection information as inputs, using a transfer function from a pair of speakers to the right ear of the listener and a transfer function to the left ear based on the speaker selection information A crosstalk processing unit for generating left and right two-channel speaker drive signals obtained by removing spatial crosstalk components from the left and right two-channel binaural signals;
A speaker driving unit that receives the speaker selection information and the speaker driving signal as input and outputs the speaker driving signal to a pair of speakers based on the speaker selection information;
Transoral system comprising