JP2011234139A

JP2011234139A - Three-dimensional audio signal generating device

Info

Publication number: JP2011234139A
Application number: JP2010102783A
Authority: JP
Inventors: Yasuaki Ohashi; 靖明大橋
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2010-04-28
Filing date: 2010-04-28
Publication date: 2011-11-17

Abstract

PROBLEM TO BE SOLVED: To provide a three-dimensional audio signal that is linked with a position and depth capable of enhancing realistic sensation in accordance with the position and the depth state of an object (particularly, a person) inside a video signal, because even if a two-dimensional video signal becomes three-dimensional, since voice was as usual, a conventional audio signal was too innocuous for viewers to enjoy reality and presence inherent to a moving video signal and a powerful image.SOLUTION: Depth information of a screen is obtained from information of conversion processing (extraction of a motion vector and a high frequency component) from a two-dimensional image to a three-dimensional image, picture elements are moved so as to become the three-dimensional image based on the information thereof, and the object position information and depth information can be extracted. By using these object position information and depth information, a direction, a phase and a volume of a sound are controlled, thereby generating a three-dimensional audio signal from a two-dimensional audio signal.

Description

本発明は、２次元映像から３次元映像を生成する処理より得た情報、もしくは３次元映像情報から映像内の人物位置と奥行き情報を抽出し、該奥行き情報を基に２次元音声信号から３次元音声信号を生成する装置に関する。 The present invention extracts information obtained from a process of generating a 3D video from a 2D video, or person position and depth information in the video from the 3D video information, and 3D from a 2D audio signal based on the depth information. The present invention relates to an apparatus for generating a three-dimensional audio signal.

近年、映像技術の発展により、３次元映像として制作されたコンテンツや２次元映像として制作された番組を臨場感あふれる３次元映像として表示する３次元映像装置の普及が期待されている。３次元映像は、両眼視差技術を基にして、右目用の画像と左目用の画像とを別々に作成している。３次元映像装置では、この右目用の画像と左目用の画像を交互に切り換えて表示し、切替タイミングに同期した画像を３次元映像視聴用眼鏡を用いて見ることにより、３次元映像を楽しむことができる。 In recent years, with the development of video technology, the spread of 3D video devices that display content produced as 3D video and programs produced as 2D video as 3D video full of realism is expected. The 3D video is created separately for the right-eye image and the left-eye image based on the binocular parallax technique. In the 3D video device, the right-eye image and the left-eye image are alternately switched and displayed, and the user can enjoy the 3D video by viewing the image synchronized with the switching timing using the 3D video viewing glasses. Can do.

２次元映像信号を３次元映像信号に変換する方法として、例えば、特許文献１においては、以下の技術が開示されている。２次元映像信号の各フィールド画面内に設定された複数の視差算出領域のそれぞれに対して、映像の遠近に関する画像特徴量を抽出し、この画像特徴量に基づいて、単位領域毎に視差情報を生成し、所定単位領域に対応する視差情報に応じた水平位相差を有する右目用及び左目用の画像信号生成し、３次元映像信号として表示する。 As a method for converting a 2D video signal into a 3D video signal, for example, Patent Document 1 discloses the following technique. For each of a plurality of parallax calculation areas set in each field screen of the 2D video signal, an image feature amount related to the perspective of the video is extracted, and based on this image feature amount, parallax information is obtained for each unit area. The image signal for right eye and left eye which generate | occur | produces and has the horizontal phase difference according to the parallax information corresponding to a predetermined unit area | region is produced | generated, and it displays as a three-dimensional video signal.

また、特許文献２においては、音声情報や画像情報等に音源や被写体の位置に関する情報を付加して記録し、これらの情報を利用し、視聴者に現実感や立体感のある音声情報を与えることができる情報記録装置の技術が開示されている。 Further, in Patent Document 2, information related to the position of a sound source or a subject is added to audio information, image information, and the like, and the information is used to give audio information with a sense of reality or a stereoscopic effect to the viewer. An information recording apparatus technology that can be used is disclosed.

特開平１０−５１８１２JP-A-10-51812 特開２００４−３２７２６JP 2004-32726 A

図１２は、特許文献１記載の２Ｄ／３Ｄ映像変換装置の全体構成を示すブロック図である。該装置では、入力された２次元の映像信号から、映像の遠近に関する画像特徴量（高周波成分積算値、輝度コントラスト、輝度積算値及び彩度積算値）をそれぞれの回路で算出し、算出された特徴量を基に視差情報を抽出する。抽出した視差情報により右目用の画像信号と左目用の画像信号を作成し、３次元映像信号に変換して、画面に出力している。しかし、２次元の映像信号が視覚的に３次元になったとしても、音声は従来通りであるため、視聴者に３次元の動画信号本来の現実感や臨場感、迫力のある映像を味わうには迫力に欠けるといった課題を有していた。 FIG. 12 is a block diagram showing the overall configuration of the 2D / 3D video conversion device described in Patent Document 1. As shown in FIG. In the apparatus, an image feature amount (high-frequency component integrated value, luminance contrast, luminance integrated value, and saturation integrated value) related to the perspective of the video is calculated by each circuit from the input two-dimensional video signal. Parallax information is extracted based on the feature amount. An image signal for the right eye and an image signal for the left eye are created from the extracted parallax information, converted into a three-dimensional video signal, and output to the screen. However, even if the two-dimensional video signal is visually three-dimensional, the sound is the same as before, so the viewer can enjoy the original reality, presence, and powerful video of the three-dimensional video signal. Had the problem of lacking power.

また、図１３は、特許文献２記載の情報記録装置の構成を示すブロック図である。この情報記録装置は、映像の撮影時に各物体の位置情報をセンサー等で測定し、複数マイクロホンで記録する方式であり、音声情報や画像情報等に位置情報追加といった作業負荷が発生し、しかも複数マイクロホンといった高価な機材を用いるため、コストが掛かる。更に、２次元動画信号を３次元動画信号に変換しようとした場合は、元々の信号に位置情報が付加されていないため、３次元音声信号を提供することができないといった課題を有していた。 FIG. 13 is a block diagram showing the configuration of the information recording apparatus described in Patent Document 2. This information recording device is a method of measuring the position information of each object with a sensor or the like when recording a video and recording it with a plurality of microphones, which causes a work load of adding position information to audio information, image information, etc. Since expensive equipment such as a microphone is used, it is expensive. Further, when trying to convert a two-dimensional moving image signal into a three-dimensional moving image signal, there is a problem that a three-dimensional audio signal cannot be provided because position information is not added to the original signal.

本発明はかかる課題を解決するためになされたものであり、以下の構成を有する。 The present invention has been made to solve such a problem, and has the following configuration.

本発明に係る３次元音声信号生成装置は、入力された動画信号を映像信号と音声信号に分離する信号分離部と、前記映像信号より物体方位情報と奥行き情報を抽出する物体方位・奥行き情報抽出部と、前記音声信号と前記物体方位情報及び奥行き情報より、前記音声信号に方位情報の追加、もしくはゲインの調整を行う物体方位追加・ゲイン調整部とを有することを特徴とする。 A three-dimensional audio signal generation device according to the present invention includes a signal separation unit that separates an input moving image signal into a video signal and an audio signal, and object orientation / depth information extraction that extracts object orientation information and depth information from the video signal. And an object azimuth addition / gain adjustment unit for adding azimuth information to the audio signal or adjusting gain based on the audio signal, the object azimuth information, and the depth information.

本発明に係る３次元音声信号生成装置は、前記入力された映像信号が２次元の映像信号の場合、前記物体方位・奥行き情報抽出部は、輪郭検出部と動きベクトル検出部の少なくとも１つと、高周波成分検出部と輝度成分検出部の少なくとも１つとの組み合わせにより構成されることを特徴とする。 In the three-dimensional audio signal generation device according to the present invention, when the input video signal is a two-dimensional video signal, the object orientation / depth information extraction unit includes at least one of a contour detection unit and a motion vector detection unit, It is characterized by comprising a combination of at least one of a high frequency component detector and a luminance component detector.

本発明に係る３次元音声信号生成装置は、前記入力された映像信号が３次元の映像信号の場合、前記物体方位・奥行き情報抽出部は、輪郭検出部と画素視差検出部により構成されることを特徴とする。 In the three-dimensional audio signal generation device according to the present invention, when the input video signal is a three-dimensional video signal, the object orientation / depth information extraction unit includes a contour detection unit and a pixel parallax detection unit. It is characterized by.

本発明に係る３次元音声信号生成装置は、前記輪郭検出部は、既存の人の輪郭との相関によって人物の輪郭を検出し、その輪郭が閾値以上の面積の場合を画面の主役とすることを特徴とする。 In the three-dimensional audio signal generation device according to the present invention, the contour detection unit detects a contour of a person based on a correlation with an existing contour of a person, and the screen has a leading role when the contour has an area equal to or larger than a threshold value. It is characterized by.

本発明に係る３次元音声信号生成装置は、前記輪郭検出部が、複数の人物の輪郭を検出し場合、その輪郭の面積が大きい人物ほど視聴者に近い側に奥行き情報を設定することを特徴とする。 In the three-dimensional audio signal generation device according to the present invention, when the contour detection unit detects the contours of a plurality of persons, the depth information is set closer to the viewer as the contour area is larger. And

本発明に係る３次元音声信号生成装置は、入力された動画信号を映像信号と音声信号に分離する信号分離部と、前記映像信号が２次元映像信号か３次元映像信号かを検出する２次元／３次元映像検出部と、前記映像信号が２次元の映像信号の場合に、物体方位情報と奥行き情報を抽出する第一の物体方位・奥行き情報抽出部と、前記映像信号が３次元の映像信号の場合に、物体方位情報と奥行き情報を抽出する第二の物体方位・奥行き情報抽出部と、前記音声信号と、第一又は第二の物体方位・奥行き情報抽出部で抽出された物体方位情報及び奥行き情報より、前記音声信号に方位情報の追加、もしくはゲインの調整を行う物体方位追加・ゲイン調整部とを有することを特徴とする。 A 3D audio signal generation apparatus according to the present invention includes a signal separation unit that separates an input moving image signal into a video signal and an audio signal, and a 2D that detects whether the video signal is a 2D video signal or a 3D video signal. / 3D video detection unit, a first object orientation / depth information extraction unit that extracts object orientation information and depth information when the video signal is a 2D video signal, and the video signal is a 3D video In the case of a signal, a second object orientation / depth information extraction unit that extracts object orientation information and depth information, and the audio signal and the object orientation extracted by the first or second object orientation / depth information extraction unit An object orientation addition / gain adjustment unit for adding orientation information or adjusting gain to the audio signal based on information and depth information is provided.

本発明に係る３次元音声信号生成装置の前記第一の物体方位・奥行き情報抽出部は、輪郭検出部と動きベクトル検出部の少なくとも１つと、高周波成分検出部と輝度成分検出部の少なくとも１つとの組み合わせにより構成されることを特徴とする。 The first object orientation / depth information extraction unit of the three-dimensional audio signal generation device according to the present invention includes at least one of a contour detection unit and a motion vector detection unit, and at least one of a high frequency component detection unit and a luminance component detection unit. It is characterized by comprising.

本発明に係る３次元音声信号生成装置の前記第二の物体方位・奥行き情報抽出部は、輪郭検出部と画素視差検出部により構成されることを特徴とする。 The second object orientation / depth information extraction unit of the three-dimensional audio signal generation device according to the present invention includes an outline detection unit and a pixel parallax detection unit.

本発明に係る３次元音声信号生成装置の前記輪郭検出部は、既存の人の輪郭との相関によって人物の輪郭を検出し、その輪郭が閾値以上の面積の場合を画面の主役とすることを特徴とする。 The contour detection unit of the three-dimensional audio signal generation device according to the present invention detects the contour of a person based on a correlation with an existing contour of a person, and takes the case where the contour has an area equal to or larger than a threshold as the main role of the screen. Features.

本発明に係る３次元音声信号生成装置の前記輪郭検出部は、前記輪郭検出部が、複数の人物の輪郭を検出し場合、その輪郭の面積が大きい人物ほど視聴者に近い側に奥行き情報を設定することを特徴とする。 In the contour detection unit of the three-dimensional audio signal generation device according to the present invention, when the contour detection unit detects the contours of a plurality of persons, the depth information is displayed on the side closer to the viewer as the contour area is larger. It is characterized by setting.

本発明に係る３次元音声信号生成装置の前記物体方位追加・ゲイン調整部は、空間伝達関数データベースを参照し、ユーザの視聴形態に応じた３次元音声信号を提供可能であることを特徴とする。 The object orientation addition / gain adjustment unit of the 3D audio signal generation device according to the present invention can provide a 3D audio signal according to a user's viewing mode by referring to a spatial transfer function database. .

本発明に係る３次元音声信号生成装置の前記物体方位追加・ゲイン調整部は、３次元映像視聴用眼鏡のフレームに取り付けスピーカーにて視聴すること前提に３次元音声信号を生成することを特徴とする。 The object orientation addition / gain adjustment unit of the 3D audio signal generation device according to the present invention generates a 3D audio signal on the premise that the object is attached to a frame of 3D video viewing glasses and viewed with a speaker. To do.

本発明に係る３次元音声信号生成装置の前記物体方位追加・ゲイン調整部は、３次元映像視聴用眼鏡のフレームに取り付けインナースピーカー又はヘッドホンにて視聴すること前提に３次元音声信号を生成することを特徴とする。 The object orientation addition / gain adjustment unit of the three-dimensional audio signal generation device according to the present invention generates a three-dimensional audio signal on the premise that the object is attached to a frame of 3D video viewing glasses and viewed with an inner speaker or headphones. It is characterized by.

本発明に係る３次元音声信号生成装置の前記物体方位追加・ゲイン調整部は、サラウンドスピーカーにて視聴すること前提に３次元音声信号を生成することを特徴とする。 The object orientation addition / gain adjustment unit of the three-dimensional audio signal generation device according to the present invention generates a three-dimensional audio signal on the premise of viewing with a surround speaker.

本発明によれば、一般的なステレオ放送であるデジタル放送やゲーム等の２次元映像信号及び２次元音声信号（ステレオ放送）を３次元映像信号及び３次元音声信号に変換することが可能となり、現実感及び臨場感において迫力のある映像信号及び音声信号を視聴者に提供することが可能となる。 According to the present invention, it is possible to convert a two-dimensional video signal and a two-dimensional audio signal (stereo broadcast) of a general stereo broadcast such as a digital broadcast or a game into a three-dimensional video signal and a three-dimensional audio signal. It is possible to provide a viewer with a powerful video signal and audio signal in the sense of reality and presence.

また、入力された映像信号が既に３次元映像信号であった場合には、該映像信号の両眼視差情報を抽出することで画像内にある各物体のおおよその位置と、各物体に対する奥行き情報が抽出できるため、入力された音声信号に対して音声の方位、位相、音量を映像信号に合わせて制御することが可能となる。 If the input video signal is already a 3D video signal, the binocular parallax information of the video signal is extracted to obtain the approximate position of each object in the image and the depth information for each object. Therefore, it is possible to control the azimuth, phase, and sound volume of the input audio signal in accordance with the video signal.

実施例１に係る３次元音声信号生成装置の要部構成を示す機能ブロック図である。1 is a functional block diagram illustrating a main configuration of a three-dimensional audio signal generation device according to Embodiment 1. FIG. 実施例１に係る２次元映像信号から物体方位情報や奥行き情報を抽出する物体方位・奥行き情報抽出部の要部構成を示す機能ブロック図である。3 is a functional block diagram illustrating a main configuration of an object orientation / depth information extraction unit that extracts object orientation information and depth information from a two-dimensional video signal according to Embodiment 1. FIG. 実施例１に係る１フレームの画像信号を水平方向にｍ個、垂直成分にｎ個の領域に分割した視差算出領域を示す図である。It is a figure which shows the parallax calculation area | region which divided the image signal of 1 frame which concerns on Example 1 into m area | region in a horizontal direction, and n area | region in a vertical component. 実施例１に係る３次元映像信号を見るためのスピーカー付き３次元映像視聴用眼鏡を示す図である。It is a figure which shows the glasses for 3D video viewing with a speaker for seeing the 3D video signal based on Example 1. FIG. 実施例１に係るダミーヘッドを用いて頭部伝達関数も含めた空間伝達関数を事前に収録する場合の１例を示す図である。It is a figure which shows an example in the case of recording in advance the spatial transfer function including a head-related transfer function using the dummy head which concerns on Example 1. FIG. 実施例２に係る３次元音声信号生成装置の要部構成を示す機能ブロック図である。FIG. 6 is a functional block diagram illustrating a main configuration of a three-dimensional audio signal generation device according to a second embodiment. 実施例２に係る３次元映像信号から物体方位情報や奥行き情報を抽出する物体方位・奥行き情報抽出部の要部構成を示す機能ブロック図である。FIG. 10 is a functional block diagram illustrating a main configuration of an object orientation / depth information extraction unit that extracts object orientation information and depth information from a 3D video signal according to Embodiment 2; 実施例２に係る奥行き情報の抽出方法を示す図である。It is a figure which shows the extraction method of the depth information which concerns on Example 2. FIG. 実施例３に係る３次元音声信号生成装置の要部構成を示す機能ブロック図である。FIG. 10 is a functional block diagram illustrating a main configuration of a three-dimensional audio signal generation device according to a third embodiment. 実施例４に係る３次元音声信号をサラウンドスピーカーに出力する際の説明図である。It is explanatory drawing at the time of outputting the three-dimensional audio | voice signal which concerns on Example 4 to a surround speaker. 実施例４に係る視差算出領域上での輪郭検出部による人物の顔を輪郭検出した際の説明図である。It is explanatory drawing at the time of carrying out the outline detection of the person's face by the outline detection part on the parallax calculation area | region which concerns on Example 4. FIG. 従来の表示装置の２Ｄ／３Ｄ映像変換装置の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the 2D / 3D video conversion apparatus of the conventional display apparatus. 従来の３次元音声を記録するための情報記録装置の構成を示すブロック図である。It is a block diagram which shows the structure of the information recording device for recording the conventional three-dimensional sound.

以下、本発明をその実施形態を示す図面に基づいて詳述する。 Hereinafter, the present invention will be described in detail with reference to the drawings illustrating embodiments thereof.

図１は、本発明の実施例１に係る３次元音声信号生成装置の要部構成を示す機能ブロック図である。本発明の３次元音声信号生成装置は、テレビや録画再生機、もしくはパーソナルコンピュータ（ＰＣ）等の映像信号及び音声信号を扱う機器に組み込んで使用したり、外部ユニットとして使用したりすることができる。 FIG. 1 is a functional block diagram illustrating the main configuration of the three-dimensional audio signal generation device according to Embodiment 1 of the present invention. The three-dimensional audio signal generation apparatus of the present invention can be used by being incorporated in a device that handles video signals and audio signals, such as a television, a recording / playback device, or a personal computer (PC), or can be used as an external unit. .

図１の映像信号と音声信号の混合信号である２次元の動画信号Ｓ_００は、映像・音声処理部１００に入力された後、映像信号と音声信号に分離され、２次元映像信号及び２次元音声信号から３次元映像信号Ｓ_２１と３次元音声信号Ｓ_２２を生成し出力する。前記映像・音声処理部１００は、信号分離部１０１と物体方位・奥行き情報抽出部１０２と、物体方位追加・ゲイン調整部１０３と、３次元映像生成部１０４と、同期調整部１０５から構成される。信号分離部１０１は、前記動画信号Ｓ_００をデコードして、映像信号Ｓ_１１と音声信号Ｓ_１２に分離する。物体方位・奥行き情報抽出部１０２では、前記信号分離部１０１で分離した映像信号Ｓ_１１より物体方位情報と奥行き情報を抽出する。３次元映像生成部１０４では、前記物体方位・奥行き情報抽出部１０２で抽出された各物体方位情報と奥行き情報を基に、物体方位追加・ゲイン調整部１０３では、前記物体方位・奥行き情報抽出部１０２で抽出された各物体方位情報と奥行き情報を基に、前記信号分離部１０１で分離した音声信号Ｓ_１２に前記物体方位情報を追加し、もしくはゲインの調整を行う。同期調整部１０５では、前記３次元映像生成部１０４から出力される映像信号と、前記物体方位追加・ゲイン調整部１０３から出力される音声信号の同期を合わせて、映像・音声処理部１００の出力信号として映像信号Ｓ_２１と音声信号Ｓ_２２が出力される。 A two-dimensional moving image signal S ₀₀ , which is a mixed signal of the video signal and the audio signal in FIG. 1, is input to the video / audio processing unit 100, and then separated into a video signal and an audio signal. A 3D video signal S ₂₁ and a 3D audio signal S ₂₂ are generated from the audio signal and output. The video / audio processing unit 100 includes a signal separation unit 101, an object orientation / depth information extraction unit 102, an object orientation addition / gain adjustment unit 103, a 3D video generation unit 104, and a synchronization adjustment unit 105. . Signal separator 101, decodes the video signal _{S 00,} separates the video signal _{S 11} and the audio signal _{S 12.} In the object orientation, the depth information extraction unit 102 extracts the object orientation information and depth information from the video signal S ₁₁ which is separated by the signal separation unit 101. In the 3D image generation unit 104, based on each object orientation information and depth information extracted by the object orientation / depth information extraction unit 102, the object orientation addition / gain adjustment unit 103, the object orientation / depth information extraction unit based on each object orientation information and the depth information extracted at 102, the add the object orientation information to the audio signal S ₁₂ which is separated by the signal separation unit 101, or to adjust the gain. The synchronization adjustment unit 105 synchronizes the video signal output from the 3D video generation unit 104 and the audio signal output from the object orientation addition / gain adjustment unit 103 to output the video / audio processing unit 100. A video signal S ₂₁ and an audio signal S ₂₂ are output as signals.

図２は、図1に示す物体方位・奥行き情報抽出部１０２の要部構成を示す機能ブロック図である。物体方位・奥行き情報抽出部１０２は、輪郭検出部１１０と、動きベクトル検出部１１１と高周波成分検出部１１２と輝度成分検出部１１３により構成されている。尚、輝度成分検出部１１３は、輝度値積算手段又は輝度コントラスト算出手段の少なくともいずれか１つを含む構成とする。まず、信号分離部１０１によってデコードされ、分離された２次元の映像信号Ｓ_１１に着目する。信号分離部１０１において分離された２次元の映像信号Ｓ_１１は、物体方位・奥行き情報抽出部１０２に入力され、図２のように、２次元映像信号から３次元映像信号に変換するための各情報が検出される。物体方位情報は、輪郭検出部１１０から検出した輪郭情報と、動きベクトル検出部１１１から検出する動きベクトル情報により取得する。奥行き情報は、輪郭検出部１１０から検出した輪郭情報と、動きベクトル検出部１１１から検出する動きベクトル情報と、高周波成分検出部１１２、輝度成分検出部１１３のいずれか１つの情報から処理しても良い。または、精度をあげるため、複数の情報を任意の組み合わせで処理して、奥行き情報を抽出しても良い。ここでは、奥行き情報を抽出するのに高周波成分と輝度成分を記載しているが、これらの代わりに、彩度成分を用いても良いし、これらを組み合わせて処理しても良い。 FIG. 2 is a functional block diagram showing a main configuration of the object orientation / depth information extraction unit 102 shown in FIG. The object orientation / depth information extraction unit 102 includes a contour detection unit 110, a motion vector detection unit 111, a high frequency component detection unit 112, and a luminance component detection unit 113. The luminance component detection unit 113 includes at least one of luminance value integration means and luminance contrast calculation means. First, decoded by the signal separation unit 101, attention is paid to two-dimensional video signal S ₁₁ separated. Two-dimensional image signal S ₁₁ which is separated in the signal separation unit 101 is input to the object orientation, the depth information extraction unit 102, as shown in FIG. 2, each for converting the 2D image signal to the 3D image signal Information is detected. The object orientation information is acquired from the contour information detected from the contour detection unit 110 and the motion vector information detected from the motion vector detection unit 111. The depth information may be processed from the contour information detected from the contour detection unit 110, the motion vector information detected from the motion vector detection unit 111, and any one information of the high frequency component detection unit 112 and the luminance component detection unit 113. good. Alternatively, in order to increase accuracy, depth information may be extracted by processing a plurality of pieces of information in an arbitrary combination. Here, the high-frequency component and the luminance component are described to extract the depth information, but instead of these, a saturation component may be used, or a combination of these may be processed.

一般的に、映像は被写体が前方、背景が後方に存在し、ピントは被写体にあっている場合が多い。このため、撮影するカメラに近くに在る物体ほど、高周波成分、コントラスト、輝度及び彩度が高いと考えられる。 In general, an image has a subject in front and a background in back, and the focus is often on the subject. For this reason, it is considered that an object closer to the camera to be photographed has higher high-frequency components, contrast, brightness, and saturation.

図３は、１フレームの画像信号を水平方向にｍ個、垂直成分にｎ個の領域に分割した視差算出領域である。左上部の隅の領域をＡ_1，1とし、右下隅の領域をＡ_ｎ，ｍとして各領域を表す。本発明の装置においては、分割した各領域毎に物体方位情報と奥行き情報を割り当てる。ただし、周波成分値が比較的高い領域、もしくは輝度差の閾値を設定し、その閾値範囲内をある程度(100%でなくても構わない)占めている領域を一つのグループとみなし、そのグループ単位で分割しても良い。 FIG. 3 shows a parallax calculation area obtained by dividing an image signal of one frame into m areas in the horizontal direction and n areas in the vertical component. The corner regions of the upper left portion as A _{1, 1,} represent each region a region of the lower right corner A _n, as _m. In the apparatus of the present invention, object orientation information and depth information are assigned to each divided area. However, a region with a relatively high frequency component value or a threshold value for luminance difference is set, and a region that occupies a certain amount (not necessarily 100%) within the threshold range is considered as one group, and the unit You may divide by.

図２記載の輪郭検出部１１０と動きベクトル検出部１１１を用い、既存の人物の輪郭との相関を見ることで、映像信号中の人物の方位情報を得ることが可能である。また、輪郭検出部１１０において検出した人物の輪郭情報の面積が、閾値以上の面積を有する輪郭であれば、前記抽出した人物は、画面の主役（メイン）の人物とみなし、その輪郭を検出した領域を視聴者に一番近い側にある画像と認識する。よって、テレビ画面を奥行き情報の中央とし、視聴者側を手前、逆方向を奥側とした場合、奥行き情報を手前に設定する。尚、複数の人物の輪郭情報を検出した際には、面積が大きい人物ほど、奥行き情報を手前に設定する。ただし、奥側と手前の範囲設定は任意で決定できるものとする。 By using the contour detection unit 110 and the motion vector detection unit 111 shown in FIG. 2 and looking at the correlation with the contour of an existing person, it is possible to obtain the orientation information of the person in the video signal. If the area of the contour information of the person detected by the contour detection unit 110 is a contour having an area equal to or larger than the threshold, the extracted person is regarded as the main person on the screen and the contour is detected. The area is recognized as the image closest to the viewer. Therefore, when the TV screen is set to the center of the depth information, the viewer side is set to the front side, and the reverse direction is set to the back side, the depth information is set to the front side. When the contour information of a plurality of persons is detected, the depth information is set to the front for a person with a larger area. However, the range setting on the back side and the front side can be arbitrarily determined.

図２記載の動きベクトル検出部１１１を用いた場合、動きベクトルが検出された領域には物体が存在していると見なす事ができる。よって、動きの大きい物体ほど手前に映っていると考えられるため、動きベクトル検出部１１１で検出した領域の動きベクトルが大きな物体、つまり、動きの大きい物体が表示されている領域を手前に、動きの小さい物体が表示されている領域を奥側に奥行き情報を設定する。 When the motion vector detection unit 111 shown in FIG. 2 is used, it can be considered that an object exists in the region where the motion vector is detected. Therefore, it can be considered that an object having a large motion is reflected in the foreground, and therefore, an object having a large motion vector in the region detected by the motion vector detecting unit 111, that is, a region in which an object having a large motion is displayed is moved forward. Depth information is set on the back side of the area where the small object is displayed.

図２記載の高周波成分検出部１１２では、図３で分割された各領域の高周波成分（閾値以上）を積算した結果より、高周波成分の多い領域を検出する。一般的に、画像は被写体にピントを合わせることが多い。よって、被写体は隣り合う画素の輝度差が自然と大きくなるため、高周波成分と判断できる。この高周波成分値が比較的高い領域を手前にする。 The high frequency component detection unit 112 shown in FIG. 2 detects a region with a high frequency component based on the result of integrating the high frequency components (threshold value or more) of each region divided in FIG. In general, an image often focuses on a subject. Therefore, the subject can be determined as a high-frequency component because the luminance difference between adjacent pixels naturally increases. The region where the high frequency component value is relatively high is set in front.

輝度成分検出部１１３では、高周波成分検出部１１２と同様に図３で分割された各領域に対する輝度値積算又は輝度コントラストの算出を行う。ここでは、輝度成分検出部１１３で検出した領域の輝度値を用いて奥行き情報を判断する。奥行き情報は、検出した領域の輝度値が高い領域ほど手前に映っていると考えられるため、算出された輝度値により、輝度値の高い領域を手前に輝度値の低い領域を奥側に奥行き情報を設定する。尚、奥行き情報を設定する場合に輝度値の代わりに輝度コントラスト値を用いて判断しても良いし、輝度値と輝度コントラスト値を両方合算して判断しても良い。 Similarly to the high frequency component detection unit 112, the luminance component detection unit 113 performs integration of luminance values or calculation of luminance contrast for each region divided in FIG. Here, the depth information is determined using the luminance value of the region detected by the luminance component detection unit 113. Since the depth information is considered to be in the foreground as the detected region has a higher luminance value, the calculated luminance value indicates the depth information with the higher luminance value in front and the lower luminance value in the back. Set. When setting the depth information, the determination may be made using the luminance contrast value instead of the luminance value, or may be made by adding both the luminance value and the luminance contrast value.

次に、映像信号と音声信号の混合信号である２次元の動画信号Ｓ_００から分離した音声信号Ｓ_１２に着目する。放送信号の音声は、一般的に左右２チャネルのステレオ信号になっている場合が多い。本実施例では、この２チャンネルの音声信号に対し、以下方法により方位情報を畳み込む。 Next, attention is paid to the sound signal S ₁₂ which is separated from the two-dimensional video signal S ₀₀ is a mixed signal of the video signal and audio signal. In many cases, the sound of a broadcast signal is generally a stereo signal of two left and right channels. In this embodiment, the azimuth information is convoluted with respect to the two-channel audio signal by the following method.

まず、２チャンネルの音声信号Ｓ_１２に、対し、周波数領域で分割を行う。実施例では、０〜３００Hzまでの低周波数帯域、０.３〜３.４ｋＨｚの音声帯域及び３.４ｋＨｚより上の高周波数帯域といった３つに分割する。一般的に、人間は低周波音の方位に対して鈍感であるため、方位情報を追加するのは音声帯域と高周波帯域のみとする。 First, the two-channel audio signals S _12, against performs division in the frequency domain. In the embodiment, it is divided into three, a low frequency band from 0 to 300 Hz, a voice band of 0.3 to 3.4 kHz, and a high frequency band above 3.4 kHz. In general, since humans are insensitive to the direction of low-frequency sound, the direction information is added only to the voice band and the high-frequency band.

次に、音声帯域信号に対する処理の一例について説明する。図２の輪郭検出部１１０にて人物の輪郭が検出された場合、音声を発している可能性は高い。しかし、音声の存在を確実にするため、音声帯域の音声信号ｘ_Ｌ(t)、ｘ_Ｒ(t)に対し、尖度（Kurtosis）値を用いて音声の有無を判別する。 Next, an example of processing for an audio band signal will be described. When the contour detection unit 110 in FIG. 2 detects the contour of a person, there is a high possibility that a voice is being emitted. However, in order to ensure the presence of speech, the presence or absence of speech is determined using the Kurtosis value for speech signals x _L (t) and x _R (t) in the speech band.

尖度（Kurtosis）とは、ガウス分布を基準とした振幅の分布の相対的な鋭さ平坦さを表す無次元量であり、一般的に以下の傾向がある。
・音声信号は比較的高い値をとる。
・音楽信号のようなGaussianに近い信号は０に近い値をとる。
・一様分布雑音のようなsub-Gaussianの信号は負の値をとる。 Kurtosis is a dimensionless quantity that represents the relative sharpness and flatness of the amplitude distribution based on the Gaussian distribution, and generally has the following tendencies.
・ The audio signal has a relatively high value.
-Signals close to Gaussian, such as music signals, take values close to zero.
・ Sub-Gaussian signals such as uniformly distributed noise take negative values.

音声信号ｘ_Ｌ(t)、ｘ_Ｒ(t)に対する尖度（Kurtosis）値K(ｘ(t))、K(ｘ(t))は以下になる。 Kurtosis values K (x (t)) and K (x (t)) for audio signals x _L (t) and x _R (t) are as follows.

これより、尖度（Kurtosis）値K(ｘ(t))、K(ｘ(t))に対して閾値を決め、音声の有無の判別を行う。 From this, threshold values are determined for the kurtosis (Kurtosis) values K (x (t)) and K (x (t)), and the presence or absence of speech is determined.

また、輪郭検出部１１０で検出した人物の輪郭が複数ある場合、顔の部分の画像を認識し、口元の動きと音声信号が同期が取れているかどうかを見て、どの人物が話をしているのかどうかを検出し、特定することも可能である。更には、人物の輪郭が男女の見分けがつくような輪郭である場合、音声信号の周波数スペクトラムを分析し、男性の声か女性の声かを特定し、関連付けることも可能である。 In addition, when there are a plurality of contours of a person detected by the contour detection unit 110, it recognizes the face image, sees if the movement of the mouth and the audio signal are synchronized, and who is talking It is also possible to detect and specify whether or not there is. Furthermore, when the contour of a person is such that the gender can be discriminated, the frequency spectrum of the audio signal can be analyzed to identify and associate a male voice or a female voice.

次に、音声信号に対する方位情報の追加について説明する。３次元映像は、両眼視差技術を基にして制作され、右目用の画像と左目用の画像を偏光もしくはアクティブシャッターにより左右の眼に応じた画像のみを表示させる３次元映像視聴用眼鏡を用いて見る事により、３次元映像として楽しむことができる。 Next, the addition of azimuth information to the audio signal will be described. 3D video is produced based on binocular parallax technology, using 3D video viewing glasses that display only the right and left eye images by polarizing or active shuttering the right and left eye images. You can enjoy it as a 3D image.

図４は、スピーカー付きの３次元映像視聴用眼鏡を示す図である。図４（ａ）は、３次元映像視聴用眼鏡のフレームの両サイドにスピーカーを取り付けた場合の例を示す。図４（ｂ）は、３次元映像視聴用眼鏡のフレームの両サイドにインナースピーカーを取り付けた場合の例を示す。 FIG. 4 is a diagram showing 3D video viewing glasses with speakers. FIG. 4A shows an example in which speakers are attached to both sides of a frame of 3D video viewing glasses. FIG. 4B shows an example in which inner speakers are attached to both sides of a frame of 3D video viewing glasses.

図４（ａ）の３次元映像視聴用眼鏡のフレームに取り付けたスピーカーを用い、３次元音声を視聴するには、通常のステレオ信号に対し、事前に測定された空間伝達関数を基に逆フィルタを畳み込むことで、頭内部で聴こえる音声信号でなく、外部から聞こえる音声信号を生成することが必須条件となる。図４（ｂ）のインナースピーカー、又はヘッドホンを用いる場合は、頭内部で聴こえる音声信号になるため、頭部伝達関数のみを考慮すればよい。 In order to view 3D audio using a speaker attached to the frame of the 3D video viewing glasses of FIG. 4A, an inverse filter is applied to a normal stereo signal based on a spatial transfer function measured in advance. It is an indispensable condition to generate an audio signal that can be heard from outside rather than an audio signal that can be heard inside the head. In the case of using the inner speaker or the headphones of FIG. 4B, since it becomes an audio signal that can be heard inside the head, only the head-related transfer function needs to be considered.

図５は、ダミーヘッドを用いて頭部伝達関数も含めた空間伝達関数を事前に収録する場合の例を示している。また、イメージ法（特定の広さの部屋にいると仮定し、演算により人工的な残響を生成する方法、たとえば、J.B.AllenandD.A.Berkley、"Image Method for Efficiently Simulating Smallroom Acoustics"、J.Acoust.Soc.Am.、vol.65、No.4、pp.943-950、1979.参照）などを用いたシミュレーションを用いた場合も、頭部伝達関数を畳み込めば、擬似的なインパルス応答（空間伝達関数）が得られる。前記方法等により、空間伝達関数データベースを図３で分割した視差算出領域毎に事前に作成しておく。該データベースは、図１の物体方位追加・ゲイン調整部１０３にて保持しておき、物体方位・奥行き情報抽出部１０２から入手した方位情報を用いて、前記空間伝達関数データベースから最適な空間伝達関数を選択して使用する。 FIG. 5 shows an example in which a spatial transfer function including a head-related transfer function is recorded in advance using a dummy head. Also, image methods (methods that assume that you are in a room of a specific size and generate artificial reverberation by computation, such as JBAllenandD.A.Berkley, "Image Method for Efficiently Simulating Smallroom Acoustics", J.Acoust .Soc.Am., Vol.65, No.4, pp.943-950, 1979), etc., even if the head-related transfer function is convoluted, a pseudo impulse response ( Spatial transfer function). The spatial transfer function database is created in advance for each parallax calculation area divided in FIG. The database is held in the object azimuth addition / gain adjustment unit 103 in FIG. 1, and the optimum spatial transfer function is obtained from the spatial transfer function database using the azimuth information obtained from the object azimuth / depth information extraction unit 102. Select and use.

例えば、図２の輪郭検出部１１０で検出された輪郭の方位として選択された空間伝達関数がＡ_L(k,ｎ,ｍ)及びＡ_Ｒ(k,ｎ,ｍ)とする。ここで、Ｌ及びＲはそれぞれ左右の耳へ伝達されることを示し、ｋは離散スペクトルの周波数番号、空間伝達関数データベースｍは図５に示される水平到来方位の番号を示し、ｎは鉛直到来方位の番号を表す。ここで得られたｍ、ｎは図３の視差算出領域Ａｎ，ｍのｍ、ｎに相当し、それぞれの視差算出領域毎にそれぞれ空間伝達関数を算出し、空間伝達関数データベースに格納される。 For example, the spatial transfer functions selected as the orientation of the contour detected by the contour detection unit 110 in FIG. 2 are A _L (k, n, m) and A _R (k, n, m). Here, L and R are transmitted to the left and right ears, respectively, k is the frequency number of the discrete spectrum, the spatial transfer function database m is the horizontal arrival direction number shown in FIG. 5, and n is the vertical arrival Represents the bearing number. M and n obtained here correspond to m and n in the parallax calculation areas An and m of FIG. 3, and a spatial transfer function is calculated for each parallax calculation area and stored in the spatial transfer function database.

音声帯域の信号ｘ_Ｌ(t)、ｘ_Ｒ(t)を周波数領域に変換した信号Ｘ_Ｌ(k)、Ｘ_Ｒ(k)に対し、以下の出力を得ることができる。（時間領域に戻った場合、ｙ_Ｌ(t)、ｙ_Ｒ(t)） The following outputs can be obtained for the signals X _L (k) and X _R (k) obtained by converting the audio band signals x _L (t) and x _R (t) into the frequency domain. (When returning to the time domain, y _L (t), y _R (t))

次に、ゲイン調整について説明する。一般的に、低周波音のゲインは高い傾向にあるため、低周波音に関するゲイン調整は考慮せず、音声周波数帯域の信号と高周波音帯域の信号に着目する。 Next, gain adjustment will be described. In general, since the gain of low-frequency sound tends to be high, attention is paid to a signal in a sound frequency band and a signal in a high-frequency sound band without considering gain adjustment related to the low-frequency sound.

音声周波数帯域の信号の場合、検出された人物の輪郭が、図３の分割された領域のどれほど占めているか確認し、人物の輪郭が占める領域の大きさに応じて重み係数を変更する。人物の輪郭が占める領域が大きいほど画面の手前にあると判断するため、人物の輪郭が占める領域が大きいほど大きな音になるように設定される。高周波音帯域の信号に関しては、図２の高周波成分検出部１１２、輝度成分検出部１１３で検出された物体が高周波音とみなし、その検出領域の大きさにより重み係数を変更する。すなわち、物体方位・奥行き情報抽出部１０２で抽出された奥行き情報に応じて重み係数を変更し、人物の音声周波数帯域の信号のゲイン調整を行なう。 In the case of a signal in the audio frequency band, it is confirmed how much the detected outline of the person occupies the divided area in FIG. 3, and the weighting coefficient is changed according to the size of the area occupied by the outline of the person. Since it is determined that the area occupied by the outline of the person is larger in front of the screen, the larger the area occupied by the outline of the person is, the larger the sound is set. Regarding the signal in the high frequency sound band, the object detected by the high frequency component detection unit 112 and the luminance component detection unit 113 in FIG. 2 is regarded as a high frequency sound, and the weighting coefficient is changed depending on the size of the detection region. That is, the weighting coefficient is changed according to the depth information extracted by the object orientation / depth information extraction unit 102, and the gain adjustment of the signal in the voice frequency band of the person is performed.

次に、映像信号Ｓ_１１は、図１の物体方位・奥行き検出手段１０２にて演算されるため、遅延が生じる。同様に、音声信号Ｓ_１２も物体方位・ゲイン調整部１０３にて演算されるため、遅延が生じる。よって、物体方位・奥行き検出手段１０２と物体方位・ゲイン調整部１０３の出力は、同期調整部１０５にて同期調整を行い３次元映像信号Ｓ_２１と３次元音声信号Ｓ_２２として出力される。 Then, the video signal S ₁₁ is to be calculated by the object orientation-depth detecting unit 102 of FIG. 1, a delay occurs. Similarly, because it is calculated by the audio signal S ₁₂ is also the object orientation gain adjustment unit 103, a delay occurs. Therefore, the output of the object orientation-depth detecting unit 102 and the object orientation gain adjustment unit 103 is output to the three-dimensional video signal S ₂₁ to synchronize adjusted by the synchronization adjustment section 105 as a three-dimensional audio signal S _22.

２次元の動画信号を３次元の映像信号と３次元音声信号として生成することができるため、既存の２次元映像信号装置の出力端子に直接取り付けるアダプタ形式の装置として提供可能でき、視聴者が簡単に３次元の動画信号本来の現実感や臨場感、迫力のある映像を味わうことが可能となる。また、映像信号の輪郭検出や動きベクトルから抽出した物体方位情報と輪郭情報と動きベクトルと高周波成分と輝度の１つ又は組み合わせから得た奥行き情報により３次元音声信号を生成することができるので、元々の信号に位置情報が付加されていないため、３次元音声信号を提供することができない場合でも３次元音声信号を視聴者が提供することができる。更には、３次元映像視聴用眼鏡のフレームに取り付けたスピーカーを用いて聞くことができるので、外部の雑音が大きい場合でも快適な３次元音声信号を聞くことができ、場所によって音場が変ることがなく、複数の人が同じ音場の環境を楽しむことができる。 Since a 2D video signal can be generated as a 3D video signal and a 3D audio signal, it can be provided as an adapter-type device that can be directly attached to the output terminal of an existing 2D video signal device, making viewers easy In addition, it is possible to enjoy the original reality, presence, and powerful images of a three-dimensional video signal. Further, since it is possible to generate a three-dimensional audio signal from depth information obtained from one or a combination of object orientation information, contour information, motion vector, high frequency component, and luminance extracted from the contour detection and motion vector of the video signal, Since the position information is not added to the original signal, the viewer can provide the 3D audio signal even when the 3D audio signal cannot be provided. Furthermore, since it can be heard using a speaker attached to the frame of 3D video viewing glasses, a comfortable 3D audio signal can be heard even when there is a large amount of external noise, and the sound field changes depending on the location. And multiple people can enjoy the same sound field environment.

図６は、本発明の実施例２に係る３次元音声信号生成装置の要部構成を示す機能ブロック図である。映像信号と音声信号の混合信号である３次元の動画信号Ｓ_０１は、映像・音声処理部１２０に入力され、各種処理を施した後、３次元映像信号Ｓ_２１と３次元音声信号Ｓ_２２を生成し出力する。前記映像・音声処理部１２０は、信号分離部１０１と物体方位・奥行き情報抽出部１２１と、物体方位追加・ゲイン調整部１０３と、同期調整部１０５から構成される。信号分離部１０１は、前記動画信号Ｓ_０１をデコードして、映像信号Ｓ’_１１と音声信号Ｓ_１２に分離する。物体方位・奥行き情報抽出部１２１では、前記信号分離部１０１で分離した３次元映像信号Ｓ’_１１より物体方位情報と奥行き情報を抽出する。物体方位追加・ゲイン調整部１０３では、前記物体方位・奥行き情報抽出部１２１で抽出された各物体方位情報と奥行き情報を基に、前記信号分離部１０１で分離した音声信号Ｓ_１２に前記物体方位情報を追加し、もしくはゲインの調整を行う。同期調整部１０５では、３次元映像信号Ｓ’_１１と、前記物体方位追加・ゲイン調整部１０３から出力される音声信号の同期を合わせて、映像・音声処理部１００の出力信号として３次元映像信号Ｓ_２１と３次元音声信号Ｓ_２２が出力される。 FIG. 6 is a functional block diagram showing the main configuration of the three-dimensional audio signal generation device according to Embodiment 2 of the present invention. A three-dimensional moving image signal S ₀₁ , which is a mixed signal of a video signal and an audio signal, is input to the video / audio processing unit 120, and after various processing, the three-dimensional video signal S ₂₁ and the three-dimensional audio signal S ₂₂ are converted into a three-dimensional video signal S _21. Generate and output. The video / audio processing unit 120 includes a signal separation unit 101, an object orientation / depth information extraction unit 121, an object orientation addition / gain adjustment unit 103, and a synchronization adjustment unit 105. The signal separation unit 101 decodes the moving image signal S ₀₁ and separates it into a video signal S ′ ₁₁ and an audio signal S ₁₂ . The object orientation / depth information extraction unit 121 extracts object orientation information and depth information from the 3D video signal S ′ ₁₁ separated by the signal separation unit 101. The object azimuth addition / gain adjustment unit 103 converts the object azimuth into the audio signal S ₁₂ separated by the signal separation unit 101 based on each object azimuth information and depth information extracted by the object azimuth / depth information extraction unit 121. Add information or adjust gain. The synchronization adjustment unit 105 synchronizes the 3D video signal S ′ ₁₁ and the audio signal output from the object orientation addition / gain adjustment unit 103, and outputs the 3D video signal as an output signal of the video / audio processing unit 100. S ₂₁ and a three-dimensional audio signal S ₂₂ are output.

図７は、実施例２に係る３次元映像信号から物体方位情報や奥行き情報を抽出する物体方位・奥行き情報抽出部１２１の要部構成を示す機能ブロック図である。物体方位・奥行き情報抽出部１２１は、輪郭検出部１２２と画素視差検出部１２３により構成されている。輪郭検出部１２２は、図２の輪郭検出部１１０と同様であり、入力された映像信号より人物の輪郭を検出する。次に、画素視差検出部１２３では３次元用信号の構成を活用し、奥行き情報を抽出する。 FIG. 7 is a functional block diagram illustrating the main configuration of the object orientation / depth information extraction unit 121 that extracts object orientation information and depth information from the 3D video signal according to the second embodiment. The object orientation / depth information extraction unit 121 includes a contour detection unit 122 and a pixel parallax detection unit 123. The contour detection unit 122 is similar to the contour detection unit 110 in FIG. 2 and detects the contour of a person from the input video signal. Next, the pixel parallax detection unit 123 uses the configuration of the three-dimensional signal to extract depth information.

ここで、現在主流となっている３次元映像信号の構成としては、サイドバイサイド方式もしくはフレームパッキング方式が挙げられる。両方式に多少の違いはあるが、デコードされた映像信号は、左右の画像を順次に出力することには違いない。つまり、前後の画像から左右それぞれの画像を抽出できるため、その画素差分から奥行き情報を抽出できる。 Here, a side-by-side method or a frame packing method can be cited as a configuration of a 3D video signal that is currently mainstream. Although there is a slight difference between the two systems, the decoded video signal must output left and right images sequentially. That is, since left and right images can be extracted from the previous and subsequent images, depth information can be extracted from the pixel difference.

図８は、奥行き情報の抽出方法を示す図である。この場合、奥行き情報ａは下記式により算出できる。 FIG. 8 is a diagram illustrating a method for extracting depth information. In this case, the depth information a can be calculated by the following formula.

ｂは視聴者Ｈの目の位置から出力パネルＰまでの距離、視差ｃは視聴者Ｈの左目と右目の間隔、両眼視差ｄは特定の画素である右目用画像Ｒとこれに対する左目用画像ＬのパネルＰ上の間隔を表す。尚、ｂは、パネルＰの表示装置側に距離センサーを備え付け、３次元映像視聴用眼鏡とパネル間の距離を測定し、前記３次元映像視聴用眼鏡から視聴者Ｈの目までの距離（約１．５〜２ｃｍ）を加算して算出すれば良い。または、パネルから見る位置の距離を予め、指定しておき、その位置に合わせるようにしても良い。 b is the distance from the eye position of the viewer H to the output panel P, parallax c is the distance between the left eye and right eye of the viewer H, binocular parallax d is the right-eye image R that is a specific pixel, and the left-eye image corresponding thereto The interval on the panel P of L is represented. In the case of b, a distance sensor is provided on the display device side of the panel P, the distance between the 3D video viewing glasses and the panel is measured, and the distance from the 3D video viewing glasses to the eyes of the viewer H (about approx. (1.5 to 2 cm) may be added and calculated. Alternatively, the distance of the viewing position from the panel may be designated in advance and matched with that position.

ここでaが０より大きい場合（ａ＞０）、その画素はパネルＰのスクリーンの奥側で表示されていることが分かる。逆にaが０より小さい場合（ａ＜０）、パネルＰのスクリーンの手前に表示されている。aが０の場合（ａ＝０）、パネルＰのスクリーンの上に表示されている。 Here, when a is larger than 0 (a> 0), it is understood that the pixel is displayed on the back side of the screen of the panel P. Conversely, when a is smaller than 0 (a <0), it is displayed in front of the screen of panel P. When a is 0 (a = 0), it is displayed on the screen of the panel P.

よって、１９２０×１０８０の画素全てにおいて奥行き情報を抽出できれば、図３に示した１フレームの映像信号を水平方向にｍ個、垂直成分にｎ個の領域に分割した視差算出領域であっても、または、奥行き情報の類似した画素同士によるグループであっても、その領域ごとに方位情報とゲイン調整を行うことで３次元音声を生成することができる。その際、図７の輪郭検出部１２２により検出された人物の輪郭の領域に、音声帯域信号が割り当てることが可能である。但し、この場合であっても、尖度（Kurtosis）による音声検出を用いて音声の有無を判別し、その結果を用いて割り当てた方が誤った方位割当が軽減できる。人物の輪郭領域に割り当てられた音声信号は、奥行き情報aにより、３次元音声信号として割り当てられ、奥行き感のある音声が視聴者に提供することができる。 Therefore, if the depth information can be extracted from all of the 1920 × 1080 pixels, even if the parallax calculation area is obtained by dividing the video signal of one frame shown in FIG. 3 into m areas in the horizontal direction and n areas in the vertical component, Or even if it is a group by pixels with similar depth information, three-dimensional sound can be generated by performing azimuth information and gain adjustment for each region. At that time, the voice band signal can be assigned to the contour region of the person detected by the contour detection unit 122 of FIG. However, even in this case, erroneous orientation assignment can be reduced by determining the presence or absence of speech using speech detection based on kurtosis and assigning using the result. The audio signal assigned to the contour region of the person is assigned as a three-dimensional audio signal according to the depth information a, and a sound with a sense of depth can be provided to the viewer.

また、３次元の動画信号の音声信号のみを３次元音声信号として生成することができるため、既存の３次元映像信号装置の出力端子に直接取り付けるアダプタ形式の装置として提供可能でき、視聴者が簡単に３次元の動画信号本来の現実感や臨場感、迫力のある映像を味わうことが可能となる。 In addition, since only the audio signal of the three-dimensional video signal can be generated as the three-dimensional audio signal, it can be provided as an adapter-type device that can be directly attached to the output terminal of an existing three-dimensional video signal device, making it easy for the viewer In addition, it is possible to enjoy the original reality, presence, and powerful images of a three-dimensional video signal.

図９は、実施例３に係る３次元音声信号生成装置の要部構成を示す機能ブロック図である。映像信号と音声信号の混合信号である動画信号Ｓ_０２は、映像・音声処理部１３０に入力され、各種処理を施した後、３次元映像信号Ｓ_２１と３次元音声信号Ｓ_２２を生成し出力する。前記映像・音声処理部１３０は、信号分離部１０１と、２次元／３次元映像検出部１３１と、２次元映像信号より物体方位情報と奥行き情報を抽出する物体方位・奥行き情報抽出部１０２と、３次元映像信号より物体方位情報と奥行き情報を抽出する物体方位・奥行き情報抽出部１２１と、物体方位追加・ゲイン調整部１０３と、２次元映像信号より３次元映像信号を生成する３次元映像生成部１０４と、同期調整部１０５から構成される。信号分離部１０１は、前記動画信号Ｓ_０２をデコードして、映像信号Ｓと音声信号Ｓ_１２に分離する。映像信号Ｓは、２次元／３次元映像検出部１３１にて、２次元映像信号なのか３次元映像信号なのか自動的に検出する。２次元の映像信号Ｓ_１１の場合、物体方位・奥行き情報抽出部１０２で物体方位情報と奥行き情報を抽出し、更には、３次元映像生成部１０４で３次元映像信号を生成する。３次元の映像信号Ｓ’_１１の場合、物体方位・奥行き情報抽出部１２１で物体方位情報と奥行き情報を抽出する。前記物体方位・奥行き情報抽出部１０２又は物体方位・奥行き情報抽出部１２１の何れかで抽出された各物体方位情報と奥行き情報を基に、物体方位追加・ゲイン調整部１０３では、前記信号分離部１０１で分離した音声信号Ｓ_１２に前記物体方位情報を追加し、もしくはゲインの調整を行う。同期調整部１０５では、前記３次元映像生成部１０４から出力される映像信号又は３次元の映像信号Ｓ’_１１の何れかの映像信号と、前記物体方位追加・ゲイン調整部１０３から出力される音声信号の同期を合わせて、映像・音声処理部１３０の出力信号として映像信号Ｓ_２１と音声信号Ｓ_２２を出力する。図９の物体方位・奥行き情報抽出部１０２の構成は図２、又、物体方位・奥行き情報抽出部１２１の構成は図７と同じ構成である。 FIG. 9 is a functional block diagram of the main configuration of the three-dimensional audio signal generation device according to the third embodiment. The moving image signal S _02, which is a mixed signal of the video signal and the audio signal, is input to the video / audio processing unit 130, and after various processing, the 3D video signal S ₂₁ and the 3D audio signal S ₂₂ are generated and output. To do. The video / audio processing unit 130 includes a signal separation unit 101, a 2D / 3D video detection unit 131, an object orientation / depth information extraction unit 102 that extracts object orientation information and depth information from a 2D video signal, Object orientation / depth information extraction unit 121 that extracts object orientation information and depth information from a 3D video signal, object orientation addition / gain adjustment unit 103, and 3D video generation that generates a 3D video signal from the 2D video signal Unit 104 and synchronization adjustment unit 105. Signal separator 101, decodes the video signal _{S 02,} separates the video signal S and the voice signal _{S 12.} The video signal S is automatically detected by the 2D / 3D video detection unit 131 as to whether it is a 2D video signal or a 3D video signal. In the case of the 2D video signal S ₁₁ , the object orientation / depth information extraction unit 102 extracts the object orientation information and depth information, and the 3D video generation unit 104 generates a 3D video signal. In the case of the three-dimensional video signal S ′ ₁₁ , the object orientation / depth information extraction unit 121 extracts object orientation information and depth information. Based on each object orientation information and depth information extracted by either the object orientation / depth information extraction unit 102 or the object orientation / depth information extraction unit 121, the object orientation addition / gain adjustment unit 103 uses the signal separation unit Add the object orientation information to the audio signal S ₁₂ separated by 101, or to adjust the gain. In the synchronization adjustment unit 105, either the video signal output from the 3D video generation unit 104 or the video signal of the 3D video signal S ′ ₁₁ and the audio output from the object orientation addition / gain adjustment unit 103. The video signal S ₂₁ and the audio signal S ₂₂ are output as output signals of the video / audio processing unit 130 in synchronization with the signals. The configuration of the object orientation / depth information extraction unit 102 in FIG. 9 is the same as that in FIG. 2, and the configuration of the object orientation / depth information extraction unit 121 is the same as that in FIG.

図９の信号分離部１０１で分離された映像信号Ｓを２次元／３次元映像検出部１３１により２次元映像信号なのか３次元映像信号なのかを自動判別して、それぞれの信号に対して、３次元音声を生成する装置であり、入力信号に制約がなく利用できるので大変使い勝手の良い３次元音声信号生成装置として使用可能である。 The video signal S separated by the signal separation unit 101 of FIG. 9 is automatically determined by the 2D / 3D video detection unit 131 as a 2D video signal or a 3D video signal, and for each signal, This is a device that generates a three-dimensional sound, and can be used as a three-dimensional sound signal generating device that is very convenient because it can be used without any restrictions on the input signal.

本発明の実施例１の３次元音声信号生成装置では、図１の物体方位・ゲイン調整部１０３においては、３次元映像視聴用眼鏡にスピーカーを取り付け、頭部伝達関数を基に逆フィルタを畳み込むことで方位情報及び奥行き情報を追加し、３次元音声を生成する例を挙げた。しかし、現在の主流であるサラウンドスピーカーを用いた場合の３次元音声信号の生成例について実施例４として説明する。 In the three-dimensional audio signal generation device according to the first embodiment of the present invention, in the object orientation / gain adjustment unit 103 in FIG. 1, a speaker is attached to glasses for viewing three-dimensional images, and an inverse filter is convoluted based on the head-related transfer function. In this example, the direction information and the depth information are added to generate the three-dimensional sound. However, an example of generating a three-dimensional audio signal in the case where a surround speaker that is currently mainstream is used will be described as a fourth embodiment.

図１０は、本発明の実施例４に係る３次元音声信号をサラウンドスピーカーに出力する際の説明図である。サラウンドスピーカーとは、図１０のように、視聴者を中心として複数のスピーカーがあらゆる方位から音声を放出する構成となっている。３次元映像信号を見ながら音声をサラウンドスピーカーで聴ければ、迫力が増すが、一般的に３次元映像信号に対する音声信号は２チャネルのステレオ信号になっている場合が多い。よって、図１の物体方位・奥行き情報抽出部１０２で得た物体方位情報と奥行き情報を活用し、多チャネル信号に変換し、３次元音声信号を生成する。 FIG. 10 is an explanatory diagram when a three-dimensional audio signal according to the fourth embodiment of the present invention is output to a surround speaker. As shown in FIG. 10, the surround speaker has a configuration in which a plurality of speakers emit sound from all directions with a viewer at the center. If you listen to the sound with a surround speaker while watching the 3D video signal, the power increases, but generally the audio signal for the 3D video signal is often a two-channel stereo signal. Therefore, the object orientation information and depth information obtained by the object orientation / depth information extraction unit 102 in FIG. 1 are utilized to convert the information into a multi-channel signal and generate a three-dimensional audio signal.

まず、実施例１と同様に、０〜３００Hzまでの低周波数帯域、０.３〜３.４ｋＨｚの音声帯域及び３.４ｋＨｚより上の高周波数帯域といった３つに分割する。低周波音に関しては、図１０のサブウーハー（SW）から出力させ、その他の帯域の信号に関しては、defaultではLRチャネルの各スピーカーより同一のゲインで出力されるものとする。 First, in the same manner as in the first embodiment, it is divided into three, a low frequency band from 0 to 300 Hz, a voice band from 0.3 to 3.4 kHz, and a high frequency band above 3.4 kHz. The low-frequency sound is output from the subwoofer (SW) in FIG. 10, and the signals in other bands are output with the same gain from each speaker of the LR channel by default.

図１１は、図３の視差算出領域上で輪郭検出部（１１０又は１２２）により人物の顔の輪郭検出した際の説明図である。例えば図１１のように、中央より左側で輪郭検出があり、かつ尖度（Kurtosis）値が閾値以上であった場合、音声帯域信号ｘ_Ｌ(t)、ｘ_Ｒ(t)に対し、FL、SL、RLから出力されるｘ_Ｌ(t)には、重み係数ｋを１．２として、掛け合わせる。逆に、ｘ_Ｒ(t)が出力されるFR、SR、RRに対しては、重み係数ｋを０．８とし、掛け合わる。前記記載の重み係数ｋは一例であって、視聴者の好みによって変更し、別の組み合わせで設定することも可能である。 FIG. 11 is an explanatory diagram when the contour of a human face is detected by the contour detection unit (110 or 122) on the parallax calculation area of FIG. For example, as shown in FIG. 11, when there is contour detection on the left side of the center and the kurtosis value is equal to or greater than a threshold value, FL, voice band signals x _L (t), x _R (t) X _L (t) output from SL and RL is multiplied by a weighting factor k of 1.2. Conversely, FR, SR, and RR for which x _R (t) is output are multiplied by a weighting factor k of 0.8. The weighting factor k described above is an example, and can be changed according to the preference of the viewer and set in another combination.

さらに、奥行き情報から手前であった場合、スピーカーCから出力されるｘ_L(t)、ｘ_Ｒ(t)それぞれに掛け合わせる重み係数ｋを１より大きくする。逆に奥側であれば、重み係数ｋを１より小さくする。また、高周波音帯域の信号に対しても、画素視差検出部１２３より左右情報及び奥行き情報を得られるため、重み係数ｋを連動させて、多チャネル信号に変換して３次元音声信号を生成し、出力させることができる。 Further, when the depth information is in the foreground, the weighting coefficient k multiplied by x _L (t) and x _R (t) output from the speaker C is set larger than 1. On the other hand, if it is the back side, the weight coefficient k is made smaller than 1. In addition, since the left and right information and the depth information can be obtained from the pixel parallax detection unit 123 even for a signal in a high frequency sound band, a three-dimensional audio signal is generated by converting the signal into a multichannel signal in conjunction with the weighting factor k. Can be output.

従って、３次元映像信号を見ながら迫力のある３次元音声信号をサラウンドスピーカーで聴くことも可能である。 Therefore, it is possible to listen to a powerful 3D audio signal with a surround speaker while watching the 3D video signal.

以上、本発明の３次元音声信号を生成する装置は、実施例１〜実施例４で説明した内容に限定される必要はなく、テレビや録画再生機、パーソナルコンピュータ（ＰＣ）等に組み込んで使用したり、外部ユニットとして使用したりすることができる。また、本発明はＣＡＴＶ等のセットトップボックス、カーナビゲーション、携帯電話、電子辞書、電子書籍等の映像信号と音声信号を取り扱う機器について広く適用できる。 As described above, the apparatus for generating a three-dimensional audio signal according to the present invention is not limited to the contents described in the first to fourth embodiments, and is used by being incorporated in a television, a recording / reproducing device, a personal computer (PC), or the like. And can be used as an external unit. The present invention can be widely applied to devices that handle video signals and audio signals, such as set-top boxes such as CATV, car navigation systems, mobile phones, electronic dictionaries, and electronic books.

本発明にかかる３次元音声信号生成装置は、２次元映像から３次元映像を生成する処理より得た情報、もしくは３次元映像情報から映像内の人物位置と奥行き情報を抽出し、該奥行き情報を基に２次元音声信号から３次元音声信号を生成する装置に関する。 The three-dimensional audio signal generation device according to the present invention extracts information obtained from a process of generating a three-dimensional image from a two-dimensional image, or person position and depth information in the image from the three-dimensional image information, and the depth information is extracted. The present invention relates to an apparatus for generating a three-dimensional audio signal from a two-dimensional audio signal.

１００、１２０、１３０映像・音声処理部
１０１信号分離部
１０２、１２１物体方位・奥行き情報抽出部
１０３物体方位追加・ゲイン調整部
１０４３次元映像生成部
１０５同期調整部
１１０、１２２輪郭検出部
１１１動きベクトル検出部
１１２高周波成分検出部
１１３輝度成分検出部
１２３画素視差検出部
１３１２次元／３次元映像検出部
Ｓ_００、Ｓ_０１、Ｓ_０２動画信号
Ｓ、Ｓ_１１、Ｓ’_１１、Ｓ_２１映像信号
Ｓ_１２、Ｓ_２２音声信号 100, 120, 130 Video / Audio Processing Unit 101 Signal Separation Unit 102, 121 Object Direction / Depth Information Extraction Unit 103 Object Direction Addition / Gain Adjustment Unit 104 3D Video Generation Unit 105 Synchronization Adjustment Unit 110, 122 Contour Detection Unit 111 Motion Vector detection unit 112 High frequency component detection unit 113 Luminance component detection unit 123 Pixel parallax detection unit 131 2D / 3D video detection unit S ₀₀ , S ₀₁ , S ₀₂ Video signal S, S ₁₁ , S ′ ₁₁ , S ₂₁ video signal _S _12, S ₂₂ voice signal

Claims

A signal separation unit for separating the input video signal into a video signal and an audio signal;
An object orientation / depth information extraction unit that extracts object orientation information and depth information from the video signal;
3. A three-dimensional audio signal generation apparatus comprising: an object direction addition / gain adjustment unit that adds direction information to the audio signal or adjusts gain based on the audio signal, the object direction information, and depth information.

When the input video signal is a two-dimensional video signal, the object orientation / depth information extraction unit includes at least one of a contour detection unit and a motion vector detection unit, and at least one of a high frequency component detection unit and a luminance component detection unit. The three-dimensional audio signal generation device according to claim 1, wherein the three-dimensional audio signal generation device is configured by a combination of the two.

The three-dimensional audio according to claim 1, wherein when the input video signal is a three-dimensional video signal, the object orientation / depth information extraction unit includes a contour detection unit and a pixel parallax detection unit. Signal generator.

4. The outline detection unit according to claim 2, wherein the outline detection unit detects an outline of a person based on a correlation with an outline of an existing person, and the main part of the screen is a case where the outline has an area equal to or larger than a threshold value. Dimensional audio signal generator.

5. The three-dimensional audio signal generation according to claim 4, wherein when the contour detection unit detects the contours of a plurality of persons, the depth information is set closer to the viewer as the person having a larger area of the contours. apparatus.

A signal separation unit for separating the input video signal into a video signal and an audio signal;
A 2D / 3D video detector for detecting whether the video signal is a 2D video signal or a 3D video signal;
A first object orientation / depth information extraction unit for extracting object orientation information and depth information when the video signal is a two-dimensional video signal;
A second object azimuth / depth information extraction unit for extracting object azimuth information and depth information when the video signal is a three-dimensional video signal;
Object orientation addition / gain for adding orientation information or adjusting gain to the audio signal based on the audio signal and the object orientation information and depth information extracted by the first or second object orientation / depth information extraction unit. A three-dimensional audio signal generation device comprising an adjustment unit.

The first object orientation / depth information extraction unit is configured by a combination of at least one of a contour detection unit and a motion vector detection unit, and at least one of a high frequency component detection unit and a luminance component detection unit. The three-dimensional audio | voice signal production | generation apparatus of Claim 6.

The three-dimensional audio signal generation device according to claim 6, wherein the second object orientation / depth information extraction unit includes an outline detection unit and a pixel parallax detection unit.

The said outline detection part detects the outline of a person by correlation with the outline of the existing person, and the case where the outline is an area more than a threshold value plays a leading role of a screen, 3 Dimensional audio signal generator.

The three-dimensional audio signal generation according to claim 9, wherein when the contour detection unit detects the contours of a plurality of persons, depth information is set closer to the viewer as the contour area is larger. apparatus.

11. The three-dimensional audio signal generation device according to claim 1, wherein the object orientation addition / gain adjustment unit can provide a three-dimensional audio signal according to a user's viewing mode with reference to a spatial transfer function database. A three-dimensional audio signal generation device characterized in that:

12. The three-dimensional audio signal generation device according to claim 11, wherein the object orientation addition / gain adjustment unit generates a three-dimensional audio signal on the premise that the object is attached to a frame of 3D video viewing glasses and viewed with a speaker. A three-dimensional audio signal generation device characterized by the above.

11. The three-dimensional audio signal generation device according to claim 1, wherein the object orientation addition / gain adjustment unit is attached to a frame of glasses for viewing three-dimensional images and is viewed on an inner speaker or headphones. An apparatus for generating a three-dimensional audio signal, characterized by generating an audio signal.

11. The three-dimensional audio signal generation device according to claim 1, wherein the object orientation addition / gain adjustment unit generates a three-dimensional audio signal on the premise that the object is viewed with a surround speaker. Audio signal generator.