JP5174527B2

JP5174527B2 - Acoustic signal multiplex transmission system, production apparatus and reproduction apparatus to which sound image localization acoustic meta information is added

Info

Publication number: JP5174527B2
Application number: JP2008127600A
Authority: JP
Inventors: 馨渡辺; 公男濱崎; 靖茂中山
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2008-05-14
Filing date: 2008-05-14
Publication date: 2013-04-03
Anticipated expiration: 2028-05-14
Also published as: JP2009278381A

Description

本発明は、超高精細テレビシステムの受信視聴において、臨場感の高い音響再生を行う技術に関し、特に、放送局から映像とともに伝送される音像定位音響メタ情報を用いて、映像に対して音像位置の一致を必要とする音を識別し、受信側の映像ディスプレイのサイズに合わせて音像定位の制御を可能にする、音響信号多重伝送システム、制作装置及び再生装置に関する。 The present invention relates to a technique for performing sound reproduction with a high sense of presence in reception viewing of an ultra-high-definition television system, and in particular, using a sound image localization acoustic meta information transmitted from a broadcasting station together with a video, a sound image position relative to the video. The present invention relates to an acoustic signal multiplex transmission system, a production apparatus, and a reproduction apparatus that identify sounds that require matching and that can control sound image localization in accordance with the size of the video display on the receiving side.

映像とともに放送されるマルチチャンネルの音響信号を、受信側において用いられる映像音響システム（以下、受信視聴環境とも称する）に適用する場合に、マルチチャンネルの音響信号は、視聴者の周囲に配置した複数のスピーカーへサラウンド音響再生をするのに用いられる。例えば、デジタル方式のサラウンド音響再生であれば、前方左（ＦＬ：ＦｒｏｎｔＬｅｆｔ）、前方中央（Ｃ：Ｃｅｎｔｅｒ）、前方右（ＦＲ：ＦｒｏｎｔＲｉｇｈｔ）、左後方及び左側方に配置するサラウンド左（ＳＬ：ＳｕｒｒｏｕｎｄＬｅｆｔ）、右後方及び右側方に配置するサラウンド右（ＳＲ：ＳｕｒｒｏｕｎｄＲｉｇｈｔ）、低音を再生するサブウーハ（ＳＷ：ＳｕｂＷｏｏｆｅｒ）の５．１チャンネルの音響信号が知られている。 When a multi-channel audio signal broadcast together with video is applied to a video / audio system (hereinafter also referred to as a reception viewing environment) used on the receiving side, the multi-channel audio signal is a plurality of audio signals arranged around the viewer. It is used to reproduce surround sound to other speakers. For example, in the case of digital surround sound reproduction, surround left (SL) arranged on the front left (FL: Front Left), front center (C: Center), front right (FR: Front Right), left rear, and left side. : Surround Left (SR), Surround Right (SR: Surround Right) placed on the right side, and Subwoofer (SW: Sub Woofer) that reproduces bass sound are known.

また、大画面・高精細の映像システムに対応した音響方式として、５．１チャンネルサラウンド音響方式よりもさらに臨場感の高い音響を再生することができる、前方５チャンネル＋側方・後方５チャンネル音響方式も知られている。 In addition, as an audio system compatible with large-screen, high-definition video systems, it can reproduce more realistic sound than the 5.1-channel surround sound system. The method is also known.

また、更に高さ方向にもチャンネルを階層的に拡張し、高さ方向の音像定位を可能にした３次元（立体）音響方式もある（例えば、非特許文献１参照）。図７に示すように、この超高精細・高臨場感映像音響システム（以下、超高精細テレビシステムと称する）の番組制作においては、大画面映像ディスプレイ（例えば、７６８０×４３２０画素）と多数のスピーカーとを配置した標準制作条件下で番組制作を行う。即ち、この標準制作条件下では、大画面のディスプレイ１８ａを前方に、低域効果音用スピーカーＬＦＥ１，ＬＦＥ２を除き、立体式の上位レイヤＵ１〜Ｕ９、中間レイヤＭ１〜Ｍ１０、下位レイヤＬ１〜Ｌ３の２２チャンネルで音響信号が制作され、スクリーン上の映像位置とスピーカー再生した音像の位置を一致させるように番組制作を行う。 In addition, there is a three-dimensional (stereoscopic) sound system in which channels are expanded hierarchically in the height direction to enable sound image localization in the height direction (see, for example, Non-Patent Document 1). As shown in FIG. 7, in the production of a program of this ultra-high-definition / high-sense video / audio system (hereinafter referred to as an ultra-high-definition television system), a large-screen video display (for example, 7680 × 4320 pixels) and many Program production under standard production conditions with speakers. In other words, under the standard production conditions, the large-screen display 18a is forward, the low-frequency sound effect speakers LFE1 and LFE2 are excluded, and the three-dimensional upper layers U1 to U9, the intermediate layers M1 to M10, and the lower layers L1 to L3. The sound signal is produced on the 22 channels, and the program is produced so that the image position on the screen matches the position of the sound image reproduced by the speaker.

一方、受信側においては、番組制作と同一の標準制作条件で映像ディスプレイ・音響用スピーカーを設置することは困難な場合が多く、受信環境に応じた映像ディスプレイ（通常、制作側のものに対して小型サイズであり、例えば１９２０×１０８０画素）と所定数のスピーカーとが設置される（図８参照）。 On the other hand, on the receiving side, it is often difficult to install a video display and audio speakers under the same standard production conditions as the program production. A small size, for example, 1920 × 1080 pixels) and a predetermined number of speakers are installed (see FIG. 8).

安藤彰男、“高臨場感音響”、ＮＨＫ放送技術研究所、［検索日：平成２０年４月２０日］、インターネット〈http://www.nhk.or.jp/strl/group/ningen_joho/ningen_joho06.html〉Akio Ando, “High Realistic Acoustics”, NHK Broadcasting Technology Laboratory, [Search Date: April 20, 2008], Internet <http://www.nhk.or.jp/strl/group/ningen_joho/ningen_joho06 .html>

標準制作条件下で制作された番組では、幾つかの音源（仮想音源）としてシーンに現れる多数のオブジェクトに合わせて音圧レベルや音の方向を考慮したマルチチャンネルの音響信号が映像信号とともに提供される。この場合、標準制作条件より小さいサイズの映像ディスプレイを設置した受信視聴環境において、提供された映像信号及び音響信号をそのまま再生すると、番組制作者が意図した臨場感の高い音の再生を実現することができないことがある。従って、受信側では、受信視聴環境に応じて、そのまま臨場感の高い音の再生を保持しつつ、且つ映像と音像位置の一致が必要な音については、映像ディスプレイのサイズに合わせるように音像位置を制御して再生するような、臨場感の高い超高精細テレビシステムの音響再生を行うことを可能とすることが望まれる。 For programs produced under standard production conditions, multi-channel audio signals that take into account the sound pressure level and direction of sound are provided along with video signals for a number of objects appearing in the scene as several sound sources (virtual sound sources). The In this case, in a receiving and viewing environment in which a video display with a size smaller than the standard production conditions is installed, if the provided video signal and audio signal are reproduced as they are, reproduction of sound with a high sense of presence intended by the program producer can be realized. May not be possible. Therefore, on the receiving side, according to the receiving and viewing environment, the sound image position is adjusted so as to match the size of the video display while maintaining the reproduction of sound with a high sense of presence as it is, and for the sound that needs to match the image and the sound image position. It is desired to be able to perform sound reproduction of an ultra-high-definition television system with a high sense of presence, such as controlling and reproducing the sound.

また、制作される番組音には、俳優のせりふやオブジェクトから放射される音のように、画面の中の映像から発音される様に聞かせる「映像と音像位置の一致が望ましい音」のほかに、番組全体の臨場感を高めるための効果音など「映像と直接的な関連が無い効果的な音」も多数存在する。従って、制作者が意図する映像及び音声の情報を、受信視聴環境でより忠実に再現するのを試みるのであれば、これらの音の識別が受信側で行われるようにするのが望ましい。 In addition to the sound produced by actors' dialogues and objects, the program sounds that are produced will be heard as if they are pronounced from the video on the screen. In addition, there are many “effective sounds that are not directly related to the video” such as sound effects for enhancing the realism of the entire program. Therefore, if the video and audio information intended by the producer is to be reproduced more faithfully in the reception / viewing environment, it is desirable to identify these sounds on the receiving side.

このように、従来の受信環境（例えば、家庭）でのテレビ視聴では、制作側で規定したスピーカー配置に依存して番組制作され、伝送された音響チャンネル信号を、チャンネルごとに割り当てられたスピーカーから受信視聴環境でそのまま再生するため、映像と音像位置の一致制御を行うことができなかった。 As described above, in television viewing in a conventional reception environment (for example, home), the program is produced and transmitted based on the speaker arrangement specified by the production side, and the transmitted acoustic channel signal is transmitted from the speaker assigned to each channel. Since playback is performed as it is in the reception viewing environment, it is not possible to perform coincidence control between the video and sound image positions.

また、音響チャンネル間の音像定位の情報を表す従来の音像定位音響メタ情報は、伝送チャンネル数や伝送ビットレートの削減にのみ用いられるように専用化され、この音像定位音響メタ情報を映像と音像位置の一致制御を行うのに使用することはできなかった。 In addition, conventional sound image localization sound meta information representing information on sound image localization between sound channels is dedicated to be used only for the reduction of the number of transmission channels and the transmission bit rate, and this sound image localization sound meta information is used for video and sound images. It could not be used to perform position matching control.

本発明の目的は、臨場感の高い超高精細テレビシステムの音響再生を行うことを可能とするために、標準制作条件より小さいサイズの映像ディスプレイを設置した受信視聴環境においても、臨場感の高い音の再生を保持しつつ、且つ映像と音像位置の一致が必要な音については、受信側の映像ディスプレイのサイズに合わせて音像定位の制御を可能にする、音響信号多重伝送システム、制作装置及び再生装置を提供することにある。 An object of the present invention is to provide high realistic sound even in a reception viewing environment in which a video display having a size smaller than the standard production conditions is installed in order to enable sound reproduction of a super high definition television system with high realistic sound. An acoustic signal multiplex transmission system, a production device, and a sound device that can control sound image localization in accordance with the size of the video display on the receiving side for sound that needs to match the position of the image and the sound image while maintaining sound reproduction. To provide a playback device.

即ち、本発明による音響信号多重伝送システムは、標準制作条件で制作された音響信号に、所定の音像定位音響メタ情報を多重して伝送する制作装置と、前記音響信号を受信し、前記音像定位音響メタ情報に基づいて新たな音響信号に変換して再生する再生装置とを備える音響信号多重伝送システムであって、前記制作装置は、標準制作条件となる番組制作時のディスプレイに表示される映像と同期したマルチチャンネルの音響信号を所定数のスピーカー向けにミキシングする際に、映像と音源位置の一致が必要な音か否か、映像と音像位置の一致が必要な音の場合には標準制作条件における音像定位の位置、及び該音像定位に関わる音響特徴メタ情報を含む、音の属性を示す音像定位音響メタ情報を、前記音響信号のフレーム単位で生成する手段と、前記標準制作条件における音響信号を前記音像定位音響メタ情報とともに符号化して伝送する手段とを有し、前記再生装置は、前記制作装置から、符号化された音響信号及び音像定位音響メタ情報を受信してそれぞれ復号する手段と、復号した音像定位音響メタ情報に基づいて、復号した音響信号についてフレーム単位で映像と音源位置の一致が必要な音か否かを識別する手段と、映像と音源位置の一致が必要な音である場合にのみ、前記標準制作条件と予め対応付けた受信視聴環境におけるディスプレイ及び所定数のスピーカーに適合する音像位置を、前記標準制作条件における音像定位の位置の情報から座標変換して決定する手段と、前記該音像定位に関わる音響特徴メタ情報に基づいて、前記座標変換した音像定位の位置に適合する音響信号を、音源毎に前記受信視聴環境におけるスピーカーに対応する所定のチャンネルの音響信号に変換する手段と、前記変換した所定のチャンネルの音響信号と、映像と音源位置の一致が必要な音でない該所定のチャンネルの音響信号とを合成して、当該所定のチャンネルに対応するスピーカーから音を再生する手段とを有し、標準制作条件と異なるサイズのディスプレイを設置した受信視聴環境において、映像と音像位置の一致が必要な音について、該ディスプレイのサイズに合わせるように音像位置を制御するようにしたことを特徴とする。 That is, the acoustic signal multiplex transmission system according to the present invention includes a production apparatus that multiplexes and transmits predetermined sound image localization acoustic meta information to an acoustic signal produced under standard production conditions, and receives the acoustic signal, and the sound image localization. A sound signal multiplex transmission system including a playback device that converts to a new sound signal based on the sound meta information and plays back the image, and the production device displays a video displayed on a display at the time of program production that is a standard production condition When mixing multi-channel audio signals that are synchronized with a specific number of speakers, whether the sound needs to match the sound source position with the video, or if the sound needs to match the sound image position, standard production Sound image localization acoustic meta information indicating sound attributes including the position of the sound image localization in the condition and the acoustic feature meta information related to the sound image localization is generated for each frame of the acoustic signal. Means and a means for encoding and transmitting an acoustic signal in the standard production conditions together with the sound image localization acoustic meta information, and the reproducing apparatus receives the encoded acoustic signal and the sound image localization acoustic meta from the production apparatus. Means for receiving and decoding each of the information, means for identifying whether or not the decoded acoustic signal is a sound that requires matching of the image and the sound source position for each frame based on the decoded sound image localization sound meta information, Only when the sound needs to be matched with the sound source position, the sound image localization position in the standard production condition is the sound image position suitable for the display and the predetermined number of speakers in the reception viewing environment previously associated with the standard production condition. Based on the acoustic feature meta-information relating to the sound image localization, and at the position of the sound image localization after the coordinate conversion A means for converting a sound signal to be combined into a sound signal of a predetermined channel corresponding to a speaker in the reception viewing environment for each sound source, and the converted sound signal of the predetermined channel, and a match between a video and a sound source position are required. In a receiving and viewing environment in which the sound signal of the predetermined channel that is not a sound is synthesized and a sound is reproduced from a speaker corresponding to the predetermined channel, and a display having a size different from the standard production conditions is installed. The sound image position is controlled so as to match the size of the display for the sound that needs to match the video and the sound image position.

また、本発明による音響信号多重伝送システムにおいて、前記音像定位音響メタ情報は、前記音響特徴メタ情報として、「伝送チャンネル・音響主要周波数成分」、「音響周波数コンター」、「チャンネル間レベル差（ＩＣＬＤ：Ｉｎｔｅｒ−ＣｈａｎｎｅｌＬｅｖｅｌＤｉｆｆｅｒｅｎｃｅｓ）」、及び「チャンネル間時間差（ＩＣＴＤ：Ｉｎｔｅｒ−ＣｈａｎｎｅｌＴｉｍｅＤｉｆｆｅｒｅｎｃｅｓ）」の情報を含み、前記受信視聴環境におけるスピーカーに対応する所定のチャンネルの音響信号に変換する手段は、前記「伝送チャンネル・音響主要周波数成分」、「音響周波数コンター」、「ＩＣＴＤ」、及び「ＩＣＬＤ」の組み合わせの情報を用いて、チャンネル毎の映音一致音源の信号を決定することを特徴とする。 In the acoustic signal multiplex transmission system according to the present invention, the sound image localization acoustic meta information includes, as the acoustic feature meta information, “transmission channel / acoustic main frequency component”, “acoustic frequency contour”, “interchannel level difference (ICLD)”. : Inter-Channel Level Differentials) and “Inter-Channel Time Differences (ICTD)” information, and means for converting into an audio signal of a predetermined channel corresponding to a speaker in the reception viewing environment, Using the information of the combination of the “transmission channel / acoustic main frequency component”, “acoustic frequency contour”, “ICTD”, and “ICLD”, the sound matching sound source signal for each channel is determined. .

また、本発明による音響信号多重伝送システムにおいて、前記音像定位音響メタ情報は、前記音響特徴メタ情報として、「音響周波数コンター」、「音像位置」、及び「チャンネル間相関（ＩＣＣ：Ｉｎｔｅｒ−ＣｈａｎｎｅｌＣｏｈｅｒｅｎｃｅ）」の情報を含み、
前記受信視聴環境におけるスピーカーに対応する所定のチャンネルの音響信号に変換する手段は、前記「音響周波数コンター」、「音像位置」、及び「ＩＣＣ」の組み合わせの情報を用いて、チャンネル毎の映音一致音源の信号を決定することを特徴とする。 In the acoustic signal multiplex transmission system according to the present invention, the sound image localization acoustic meta information includes, as the acoustic feature meta information, “acoustic frequency contour”, “sound image position”, and “inter-channel correlation (ICC)”. ) "Information,
The means for converting to an acoustic signal of a predetermined channel corresponding to a speaker in the reception viewing environment uses the information of the combination of the “acoustic frequency contour”, “sound image position”, and “ICC” to output the sound for each channel. The coincidence sound source signal is determined.

また、本発明による音響信号多重伝送システムにおいて、前記音像定位音響メタ情報は、前記音響特徴メタ情報として、「音響周波数コンター」、「伝送チャンネル番号情報」、及び「チャンネル間相関（ＩＣＣ：Ｉｎｔｅｒ−ＣｈａｎｎｅｌＣｏｈｅｒｅｎｃｅ）」の情報を含み、前記受信視聴環境におけるスピーカーに対応する所定のチャンネルの音響信号に変換する手段は、前記「音響周波数コンター」、「音像位置」、「伝送チャンネル番号情報」、及び「ＩＣＣ」の組み合わせの情報を用いて、チャンネル毎の映音一致音源の信号を決定することを特徴とする。 In the acoustic signal multiplex transmission system according to the present invention, the sound image localization acoustic meta information includes, as the acoustic feature meta information, “acoustic frequency contour”, “transmission channel number information”, and “inter-channel correlation (ICC: Inter-)”. Means for converting into an audio signal of a predetermined channel corresponding to a speaker in the reception viewing environment, the “acoustic frequency contour”, “sound image position”, “transmission channel number information”, and Using the information on the combination of “ICC”, the sound matching sound source signal for each channel is determined.

更に、本発明による制作装置は、標準制作条件で制作された音響信号に、所定の音像定位音響メタ情報を多重して伝送する制作装置であって、標準制作条件となる番組制作時のディスプレイに表示される映像と同期したマルチチャンネルの音響信号を所定数のスピーカー向けにミキシングする際に、映像と音源位置の一致が必要な音か否か、映像と音像位置の一致が必要な音の場合には標準制作条件における音像定位の位置、及び該音像定位に関わる音響特徴メタ情報を含む、音の属性を示す音像定位音響メタ情報を、前記音響信号のフレーム単位で生成する手段と、前記標準制作条件における音響信号を前記音像定位音響メタ情報とともに符号化して伝送する手段とを備えることを特徴とする。 Furthermore, the production apparatus according to the present invention is a production apparatus that multiplexes and transmits predetermined sound image localization acoustic meta information to an acoustic signal produced under standard production conditions, and is used as a display at the time of program production as standard production conditions. When mixing multi-channel audio signals that are synchronized with the displayed video for a certain number of speakers, whether the sound needs to match the position of the video and the sound source, or if the sound needs to match the position of the video and the sound image Includes means for generating sound image localization acoustic meta information indicating sound attributes in units of frames of the acoustic signal, including sound image localization positions in standard production conditions and acoustic feature meta information related to the sound image localization; And means for encoding and transmitting an acoustic signal under production conditions together with the sound image localization acoustic meta information.

更に、本発明による再生装置は、所定の標準制作条件に従って制作された音響信号を受信し、該音響信号に付された音像定位音響メタ情報に基づいて新たな音響信号に変換して再生する再生装置であって、符号化された音響信号及び音像定位音響メタ情報を受信してそれぞれ復号する手段と、復号した音像定位音響メタ情報に基づいて、復号した音響信号についてフレーム単位で映像と音源位置の一致が必要な音か否かを識別する手段と、映像と音源位置の一致が必要な音である場合にのみ、前記標準制作条件と予め対応付けた受信視聴環境におけるディスプレイ及び所定数のスピーカーに適合する音像位置を、前記標準制作条件における音像定位の位置の情報から座標変換する手段と、前記該音像定位に関わる音響特徴メタ情報に基づいて、前記座標変換した音像定位の位置に適合する音響信号を、音源毎に前記受信視聴環境におけるスピーカーに対応する所定のチャンネルの音響信号に変換する手段と、前記変換した所定のチャンネルの音響信号と、映像と音源位置の一致が必要な音でない該所定のチャンネルの音響信号とを合成して、当該所定のチャンネルに対応するスピーカーから音を再生する手段とを備えることを特徴とする。 Furthermore, the playback device according to the present invention receives a sound signal produced according to a predetermined standard production condition, converts the sound signal to a new sound signal based on the sound image localization sound meta information attached to the sound signal, and plays back the sound signal. A device that receives and decodes an encoded acoustic signal and sound image localization acoustic meta information, and, based on the decoded sound image localization acoustic meta information, the image and sound source position of the decoded acoustic signal in units of frames; Means for discriminating whether or not the sound needs to be matched, and a display and a predetermined number of speakers in the receiving and viewing environment previously associated with the standard production conditions only when the sound and the sound source position need to be matched Based on the sound feature meta information related to the sound image localization, and means for coordinate conversion of the sound image position conforming to the sound image localization from the information of the position of the sound image localization in the standard production conditions Means for converting an acoustic signal that conforms to the position of the sound image localization having undergone the coordinate conversion into an acoustic signal of a predetermined channel corresponding to a speaker in the reception viewing environment for each sound source; and the acoustic signal of the converted predetermined channel; And means for synthesizing an image and an acoustic signal of the predetermined channel that does not require sound source position matching and reproducing sound from a speaker corresponding to the predetermined channel.

これにより、映像と音像位置の一致が必要な音については、映像ディスプレイにあわせるように適切な音像位置の制御を可能とする。 As a result, for the sound that needs to match the image and the sound image position, it is possible to control the sound image position appropriately so as to match the image display.

本発明によれば、標準制作条件より小さいサイズの映像ディスプレイを設置した受信視聴環境においても、臨場感の高い音の再生を保持しつつ、且つ映像と音像位置の一致が必要な音については、映像ディスプレイに合わせるように音像位置を制御して再生することで、臨場感の高い超高精細・高臨場感テレビの音響再生が可能になる。 According to the present invention, even in a reception viewing environment in which a video display having a size smaller than the standard production conditions is installed, while maintaining the reproduction of sound with a high sense of presence, and for the sound that requires matching between the image and the sound image position, By controlling and reproducing the position of the sound image so as to match the video display, it is possible to reproduce the sound of an ultra-high-definition, high-sense TV with high presence.

以下、本発明による一実施例の音響信号多重伝送システムを説明する。また、実施例の音響信号多重伝送システムは、本発明による一実施例の制作装置、及び再生装置から構成される。 An acoustic signal multiplex transmission system according to an embodiment of the present invention will be described below. The acoustic signal multiplex transmission system according to the embodiment includes the production apparatus and the reproduction apparatus according to an embodiment of the present invention.

[システム構成]
本実施例の音響信号多重伝送システムは、標準制作条件で制作された音響信号に、所定の音像定位音響メタ情報を多重して伝送する制作装置と、音響信号を受信し、音像定位音響メタ情報に基づいて新たな音響信号に変換して再生する再生装置とを備える音響信号多重伝送システムとして構成される。本実施例の音響信号多重伝送システムは、標準制作条件より小さいサイズの映像ディスプレイを設置した受信視聴環境においても、映像と音像位置の一致が必要な音については、映像ディスプレイのサイズに合わせるように音像位置を制御し、臨場感の高い超高精細・高臨場感テレビの音響再生を行うようにするためのシステムである。 [System configuration]
The acoustic signal multiplex transmission system of the present embodiment includes a production apparatus that multiplexes and transmits predetermined sound image localization acoustic meta information to an acoustic signal produced under standard production conditions, and receives the acoustic signal, and the sound image localization acoustic meta information. And a playback device for converting to a new sound signal and playing it back. In the audio signal multiplex transmission system of this embodiment, even in a reception viewing environment in which a video display having a size smaller than the standard production conditions is installed, the sound that needs to match the position of the video and the sound image should be matched to the size of the video display. This is a system for controlling the sound image position and reproducing the sound of a super-high-definition, high-realistic television with high presence.

本実施例の制作装置は、音像定位音響メタ情報を付加した音響信号の多重・伝送する装置であり、音響信号及び音像定位音響メタ情報を符号化し、１つのビットストリームに多重する。この「音像定位音響メタ情報」を付加した多重化・音響信号を、電波又はＩＰ回線等で家庭など遠隔地に伝送する。 The production apparatus according to the present embodiment is an apparatus that multiplexes and transmits an acoustic signal to which sound image localization acoustic meta information is added, and encodes and multiplexes the acoustic signal and the sound image localization acoustic meta information into one bit stream. The multiplexed / acoustic signal to which the “sound image localization acoustic meta information” is added is transmitted to a remote place such as a home via radio waves or an IP line.

即ち、本実施例の制作装置は、標準制作条件となる番組制作時のディスプレイに表示される映像と同期したマルチチャンネルの音響信号を所定数のスピーカー向けにミキシングする際に、映像と音源位置の一致が必要な音か否か、映像と音像位置の一致が必要な音の場合には標準制作条件における音像定位の位置、及び該音像定位に関わる音響特徴メタ情報を含む、音の属性を示す音像定位音響メタ情報を、音響信号のフレーム単位で生成し、標準制作条件における音響信号を音像定位音響メタ情報とともに符号化して伝送する。 That is, the production apparatus of the present embodiment, when mixing a multi-channel audio signal synchronized with the video displayed on the display at the time of program production, which is a standard production condition, for a predetermined number of speakers, Indicates whether or not the sound needs to be matched, and in the case of a sound that requires matching between the image and the sound image position, the sound image localization including the position of the sound image localization in the standard production conditions and the acoustic feature meta information related to the sound image localization Sound image localization acoustic meta information is generated for each frame of the acoustic signal, and the acoustic signal under standard production conditions is encoded and transmitted together with the sound image localization acoustic meta information.

この制作時においては、番組音の制作及び収録を行う際に、映像と音源位置の一致が必要な音か否か、また映像と音像位置の一致が必要な音の場合には標準制作条件における音像定位の位置（又はスクリーン上の位置）、及び定位に関わる音響特徴メタ情報など、音の属性を示す音像定位音響メタ情報を作成する。一方、映像と音像位置の一致を必要としない音について、音像定位音響メタ情報で識別可能にする。 At the time of production, when producing and recording program sound, whether or not the sound needs to match the position of the video and the sound source, and if the sound needs to match the position of the video and the sound image, Sound image localization acoustic meta information indicating sound attributes such as a position of sound image localization (or a position on the screen) and acoustic feature meta information related to localization is created. On the other hand, the sound that does not require the coincidence between the video and the sound image position can be identified by the sound image localization acoustic meta information.

本実施例の再生装置は、音像定位音響メタ情報が付加された音響信号を受信し、標準制作条件より小さいサイズの映像ディスプレイを設置した受信視聴環境に合わせて、映像と音像位置の一致が必要な音を映像ディスプレイに適した音像位置に制御する。 The playback device of this embodiment receives an audio signal to which sound image localization acoustic meta information is added, and it is necessary to match the position of the image and the sound image in accordance with the reception viewing environment in which an image display having a size smaller than the standard production conditions is installed. To control the sound image position suitable for the video display.

即ち、本実施例の再生装置は、制作装置から、符号化された音響信号及び音像定位音響メタ情報を受信してそれぞれ復号し、復号した音像定位音響メタ情報に基づいて、復号した音響信号についてフレーム単位で映像と音源位置の一致が必要な音か否かを識別し、映像と音源位置の一致が必要な音である場合にのみ、標準制作条件と予め対応付けた受信視聴環境におけるディスプレイ及び所定数のスピーカーに適合する音像位置を、標準制作条件における音像定位の位置の情報から座標変換（尚、ＸＹＺ座標の変換には、ｓｉｎ／ｃｏｓを用いた変換としてもよい）して決定し、該音像定位に関わる音響特徴メタ情報に基づいて、座標変換した音像定位の位置に適合する音響信号を、音源毎に前記受信視聴環境におけるスピーカーに対応する所定のチャンネルの音響信号に変換し、変換した所定のチャンネルの音響信号と、映像と音源位置の一致が必要な音でない該所定のチャンネルの音響信号とを合成して、当該所定のチャンネルに対応するスピーカーから音を再生する。 That is, the playback apparatus of the present embodiment receives the encoded acoustic signal and the sound image localization acoustic meta information from the production apparatus, respectively decodes the decoded acoustic signal based on the decoded sound image localization acoustic meta information. Identify whether or not the sound needs to match the position of the video and the sound source in units of frames, and only when the sound needs to match the position of the video and the sound source A sound image position suitable for a predetermined number of speakers is determined by coordinate conversion from information on the position of sound image localization in standard production conditions (in addition, conversion of XYZ coordinates may be conversion using sin / cos), Based on the acoustic feature meta information related to the sound image localization, an acoustic signal suitable for the position of the coordinated sound image localization corresponds to the speaker in the reception viewing environment for each sound source. Converts the sound signal of a predetermined channel into a predetermined channel, and combines the converted sound signal of the predetermined channel with the sound signal of the predetermined channel that does not require sound and the position of the sound source to match. Play sound from the speaker you want to play.

後述するように、本実施例の再生装置は、標準制作条件より小さいサイズの映像ディスプレイを設置した受信視聴環境において、映像と音像位置の一致が必要な音については、受信した各音響チャンネル信号と分離された「音像定位音響メタ情報」に基づいて、後述する一例の「映音一致音源の音響主要周波数成分の抜き出し」、「映音一致音源の音響主要周波数成分及び音響周波数コンターの合成」の処理を施して、映像ディスプレイのサイズに合わせるように音像の位置を制御する。更に、本実施例の再生装置は、映像と音像位置の一致を必要としない音については特別な処理は行わず、従来と同様の処理を施すようにする。受信視聴環境に応じて生成された各スピーカー毎の出力信号は、映像と音像位置の一致の場合と一致しない場合との組み合わせで、即ち合成されて各スピーカーから再生される。 As will be described later, in the reception viewing environment in which a video display having a size smaller than the standard production conditions is installed, the playback device of the present embodiment, for the sound that requires matching between the video and the sound image position, Based on the separated "sound image localization acoustic meta information", examples of "extraction of main acoustic frequency components of projection matching sound source" and "synthesis of main matching frequency components and acoustic frequency contours of projection matching source" Processing is performed to control the position of the sound image to match the size of the video display. Furthermore, the playback apparatus of the present embodiment does not perform any special processing for the sound that does not require the coincidence of the video and the sound image position, and performs the same processing as the conventional one. The output signal for each speaker generated in accordance with the reception / viewing environment is combined with the case where the video and sound image positions coincide with each other, that is, combined and reproduced from each speaker.

図１は、伝送する音響信号に音像定位音響メタ情報を付随させて多重伝送する制作装置を示す図である。尚、多重伝送する音響信号は、例えばデジタル放送などでは、対象の映像信号と同期して多重伝送することは既知であり、図１では、その多重伝送する音響信号についてのみ示している。制作装置は、音響収録再生装置１と、音響ミキシング卓２と、音響信号制作伝送チャンネル変換装置３と、音響信号符号化装置４と、音響定位音響メタ情報多重装置５と、音響信号・メタ情報多重装置６と、音響ミキシング制御譜面データ生成装置７と、シーンに合わせてせりふを収音するマイクロホン８とを備える。 FIG. 1 is a diagram illustrating a production apparatus that performs multiplex transmission by attaching sound image localization acoustic meta information to an acoustic signal to be transmitted. Note that, for example, in digital broadcasting, it is known that multiplexed audio signals are multiplexed and transmitted in synchronization with a target video signal. FIG. 1 shows only the multiplexed audio signals. The production apparatus includes an acoustic recording / playback apparatus 1, an acoustic mixing console 2, an acoustic signal production transmission channel conversion apparatus 3, an acoustic signal encoding apparatus 4, an acoustic localization acoustic meta information multiplexing apparatus 5, and an acoustic signal / meta information. A multiplexing device 6, an acoustic mixing control musical score data generation device 7, and a microphone 8 that collects speech in accordance with the scene are provided.

音響収録再生装置１は、音響ミキシング制御譜面データ生成装置７により生成された譜面データに基づく音響信号の発生又は終了を指示する制御信号（音響信号発生・終了制御信号）に従って、所定の音声を収録し、又は収録した複数チャンネルの音響信号を再生し、音響ミキシング卓２に出力する。 The sound recording / playback apparatus 1 records predetermined sound according to a control signal (acoustic signal generation / end control signal) instructing generation or termination of an acoustic signal based on the musical score data generated by the acoustic mixing control musical score data generation apparatus 7. Alternatively, the recorded multi-channel acoustic signals are reproduced and output to the acoustic mixing console 2.

音響ミキシング卓２は、音響ミキシング制御譜面データ生成装置７により生成された譜面データに基づく音響信号のミキシングを指示する制御信号（音響ミキシング制御信号）に従って、マイクロホン８を介して収録したせりふの信号と、音響収録再生装置１から再生された音響信号とをミキシングして、前述した標準制作条件で制作した複数のチャンネル毎の制作音響信号を音響信号制作伝送チャネル変換装置３に出力する。 The acoustic mixing table 2 includes a dialogue signal recorded via the microphone 8 in accordance with a control signal (acoustic mixing control signal) instructing mixing of the acoustic signal based on the musical score data generated by the acoustic mixing control musical score data generation device 7. Then, the sound signal reproduced from the sound recording / playback apparatus 1 is mixed, and the produced sound signal for each of a plurality of channels produced under the standard production conditions described above is output to the sound signal production / transmission channel converter 3.

音響信号制作伝送チャネル変換装置３は、制作された音響信号を伝送チャネル用のフレーム単位の音響信号に変換し、変換した伝送用の音響信号（伝送音響信号）を音響信号符号化装置４に出力する。 The acoustic signal production transmission channel conversion device 3 converts the produced acoustic signal into an acoustic signal for each frame for the transmission channel, and outputs the converted acoustic signal for transmission (transmission acoustic signal) to the acoustic signal encoding device 4. To do.

音響信号符号化装置４は、所定の音響符号化方式に従って伝送音響信号を符号化し、フレーム単位の符号化伝送音響信号を音響信号・メタ情報多重装置６に出力する。 The acoustic signal encoding device 4 encodes the transmission acoustic signal according to a predetermined acoustic encoding method, and outputs the encoded transmission acoustic signal in frame units to the acoustic signal / meta information multiplexing device 6.

音響定位音響メタ情報多重装置５は、音響収録再生装置１からの音響信号と、音響ミキシング卓２からの制作音響信号とを用いて、映像と音源位置の一致が必要な音か否かの識別、映像と音像位置の一致が必要な音の場合には標準制作条件における音像定位の位置（又はスクリーン上の位置）、及び定位に関わる音響特徴メタ情報など、音の属性を示す音像定位音響メタ情報をフレーム単位で作成し（後述する）、所定の音響符号化方式に従って音像定位音響メタ情報を符号化して音響信号・メタ情報多重装置６に送出する。 The sound localization sound meta information multiplexing device 5 uses the sound signal from the sound recording / playback device 1 and the produced sound signal from the sound mixing table 2 to identify whether the sound needs to match the sound source position. In the case of sound that needs to match the position of the image and the sound image, the sound image localization acoustic meta that indicates the sound attributes such as the position of the sound image localization (or position on the screen) in the standard production conditions and the acoustic feature meta information related to the localization. Information is created in units of frames (described later), and the sound image localization acoustic meta information is encoded according to a predetermined acoustic encoding method and transmitted to the acoustic signal / meta information multiplexing device 6.

音響信号・メタ情報多重装置６は、フレーム単位の符号化伝送音響信号と、フレーム単位で音響定位音響メタ情報多重装置５から出力される符号化された音響定位音響メタ情報とを多重して電波又はＩＰ回線等で再生装置に向けて出力する。 The acoustic signal / meta information multiplexing device 6 multiplexes the encoded transmission acoustic signal for each frame and the encoded acoustic localization acoustic meta information output from the acoustic localization acoustic meta information multiplexing device 5 for each frame. Alternatively, the data is output to the playback device via an IP line or the like.

音響ミキシング制御譜面データ生成装置７は、音響制作技術者の操作によって、譜面データを生成し、音響収録再生装置１への音響信号発生・終了制御信号、及び音響ミキシング卓２への音響ミキシング制御信号を出力する。 The acoustic mixing control musical score data generation device 7 generates musical score data by an operation of a sound production engineer, generates an acoustic signal generation / termination control signal to the acoustic recording / playback device 1, and an acoustic mixing control signal to the acoustic mixing console 2. Is output.

このように、本実施例の制作装置は、番組制作者の意図、即ち音響制作技術者の操作する情報に基づいて、映像と音像位置の一致が必要な音か否か、また映像と音像位置の一致が必要な音の場合には標準制作条件における音像定位の位置情報や音像定位に関わる音像定位音響メタ情報など、音の属性を示す情報をつけて番組音の制作及び収録を行い、音響信号及び音像定位音響メタ情報を符号化し、１つのビットストリームに多重して伝送する。 As described above, the production apparatus according to the present embodiment determines whether the sound needs to match the video and sound image positions based on the intention of the program producer, that is, information operated by the sound production engineer, and the video and sound image positions. If the sound needs to match, the program sound is produced and recorded with information indicating sound attributes, such as the sound image localization position information in the standard production conditions and the sound image localization acoustic meta information related to the sound image localization. The signal and the sound image localization acoustic meta information are encoded, multiplexed into one bit stream, and transmitted.

図２は、音像定位音響メタ情報を利用して受信環境用の音響信号に変換し、再生する再生装置を示す図である。尚、再生する音響信号は、例えばデジタル放送などでは、対象の映像信号と同期して再生することは既知であり、図２では、その再生する音響信号についてのみ示している。本実施例の再生装置は、音響信号・メタ情報多重分離装置１１と、音響定位音響メタ情報分離装置１２と、音響信号復号装置１３と、音響信号定位変換器１４と、音響信号定位変換制御部１５と、音響信号再生装置１６と、スピーカーセット１７とを備える。 FIG. 2 is a diagram illustrating a playback apparatus that converts sound into sound signals for reception environment using sound image localization sound meta information and plays back the sound signals. Note that it is known that an audio signal to be reproduced is reproduced in synchronization with a target video signal in, for example, digital broadcasting, and FIG. 2 shows only the audio signal to be reproduced. The reproduction apparatus of the present embodiment includes an acoustic signal / meta information demultiplexing device 11, an acoustic localization acoustic meta information separation device 12, an acoustic signal decoding device 13, an acoustic signal localization converter 14, and an acoustic signal localization conversion control unit. 15, an acoustic signal reproduction device 16, and a speaker set 17.

音響信号・メタ情報多重分離装置１１は、制作装置からフレーム単位で多重化され、符号化された伝送音響信号及び音響定位音響メタ情報を受信して、符号化された伝送音響信号及び音響定位音響メタ情報を分離し、それぞれ音響信号復号装置１３及び音響定位音響メタ情報分離装置１２に送出する。 The acoustic signal / meta information demultiplexing device 11 receives the transmission acoustic signal and acoustic localization acoustic meta information multiplexed and encoded in units of frames from the production device, and encodes the transmission acoustic signal and acoustic localization sound. The meta information is separated and sent to the acoustic signal decoding device 13 and the acoustic localization acoustic meta information separation device 12, respectively.

音響定位音響メタ情報分離装置１２は、フレーム単位で音像定位音響メタ情報の有無を監視し、音像定位音響メタ情報が有る場合に、符号化された音響定位音響メタ情報を復号するとともに、後述する各項目ごとのメタ情報に分離して音響信号定位変換制御部１５に送出する。 The sound localization sound meta information separation device 12 monitors the presence or absence of sound image localization sound meta information in units of frames, and decodes the encoded sound localization sound meta information when there is sound image localization sound meta information. The meta information for each item is separated and sent to the acoustic signal localization conversion control unit 15.

音響信号復号装置１３は、符号化された伝送音響信号を復号し、制作時の複数のチャンネル毎の音響信号を音響信号定位変換器１４に送出する。 The acoustic signal decoding device 13 decodes the encoded transmission acoustic signal and sends the acoustic signal for each of a plurality of channels at the time of production to the acoustic signal localization converter 14.

音響信号定位変換器１４は、音響信号定位変換制御部１５の制御信号により、制作時の複数のチャンネル毎の音響信号を、受信視聴環境に応じた複数のチャンネル毎の音響信号に変換し、音響信号再生装置１６に送出する。 The acoustic signal localization converter 14 converts an acoustic signal for each of a plurality of channels at the time of production into an acoustic signal for each of a plurality of channels according to the reception viewing environment by using the control signal of the acoustic signal localization conversion control unit 15. It is sent to the signal reproduction device 16.

音響信号定位変換制御部１５は、音響定位音響メタ情報分離装置１２から受信した各項目ごとに分離した音響定位音響メタ情報を解析して、対応するフレームの制作時の音響信号を、映像ディスプレイのサイズを含む受信視聴環境に応じた複数のチャンネル毎の音響信号に変換するための制御信号を音響信号定位変換器１４に送出する。この変換制御は、詳細に後述する。尚、音響信号定位変換制御部１５は、各項目ごとに分離したメタ情報を解析した結果を、受信視聴環境における位置座標に応じたスピーカーに対応するように、操作者によって随意選定し、設定することが可能である。 The acoustic signal localization conversion control unit 15 analyzes the acoustic localization acoustic meta information separated for each item received from the acoustic localization acoustic meta information separation device 12, and outputs the acoustic signal at the time of production of the corresponding frame to the video display. A control signal for converting into an acoustic signal for each of a plurality of channels according to the reception viewing environment including the size is sent to the acoustic signal localization converter 14. This conversion control will be described later in detail. The acoustic signal localization conversion control unit 15 arbitrarily selects and sets the result of analyzing the meta information separated for each item so as to correspond to the speaker according to the position coordinates in the reception viewing environment. It is possible.

音響信号再生装置１６は、受信視聴環境に応じた複数のチャンネル毎の音響信号を、再生信号としてスピーカーセット１７におけるそれぞれのスピーカーに向けて送出する。 The acoustic signal reproducing device 16 sends out acoustic signals for a plurality of channels corresponding to the reception viewing environment to the respective speakers in the speaker set 17 as reproduced signals.

スピーカーセット１７は、受信視聴環境における複数のチャンネル毎の音響信号にそれぞれ対応した再生信号を、複数のスピーカーＳｐ１〜Ｓｐ７で再生する。尚、図２に示すスピーカーセット１７は、７つのスピーカーのみを図示しているが、例示にすぎず、例えば、５．１チャンネルのスピーカーセット、前方５チャンネル＋側方・後方５チャンネル音響方式のスピーカーセット、又は３次元（立体）音響方式のスピーカーセットとすることができ、これらの各スピーカーのうち、音像定位変換するスピーカーは、音響定位音響メタ情報に対応する位置座標に配置するのが好適である。即ち、再生装置側では、制作装置側の音響定位音響メタ情報に対応する位置座標について、音響信号定位変換制御部１５に予め設定しておく。 The speaker set 17 reproduces a reproduction signal corresponding to each of the sound signals for each of a plurality of channels in the reception / viewing environment by the plurality of speakers Sp1 to Sp7. The speaker set 17 shown in FIG. 2 shows only seven speakers, but is merely an example. For example, a 5.1 channel speaker set, a front 5 channel + side / rear 5 channel sound system is used. A speaker set or a three-dimensional (three-dimensional) sound system speaker set can be used, and among these speakers, a speaker for sound image localization conversion is preferably arranged at a position coordinate corresponding to the acoustic localization acoustic meta information. It is. That is, on the playback device side, position coordinates corresponding to the acoustic localization acoustic meta information on the production device side are set in advance in the acoustic signal localization conversion control unit 15.

このように、再生装置は、音像定位音響メタ情報を利用して受信視聴環境に応じた音響信号に変換して再生する。即ち、受信視聴環境において、臨場感の高い音の再生を保持するために、制作側で意図した映像と音像位置との一致が必要な「映音一致音源」について、音像定位音響メタ情報を利用して、映像ディスプレイのサイズに合わせるように制作側での音像位置を適切な音像位置に変換制御した音響チャンネル信号を生成し、各スピーカーから再生を行う。 As described above, the playback device uses the sound image localization acoustic meta information to convert into an acoustic signal corresponding to the reception viewing environment and reproduces it. In other words, the sound image localization acoustic meta information is used for the “projection matching sound source” that requires matching between the image and the sound image position intended on the production side in order to maintain the sound reproduction with high presence in the reception viewing environment. Then, an audio channel signal is generated by converting the sound image position on the production side to an appropriate sound image position so as to match the size of the video display, and is reproduced from each speaker.

以下、音響信号定位変換制御部１５における音像位置変換制御の処理について説明する。 Hereinafter, the process of the sound image position conversion control in the acoustic signal localization conversion control unit 15 will be described.

図３に、標準制作条件より小さいディスプレイでの受信視聴環境における中間レイヤのスピーカー配置例を示す。ここでは、標準制作条件と同様のスピーカー配置で受信視聴環境が構成され、受信視聴環境では、標準制作条件であるディスプレイ１８ａよりも小さいサイズのディスプレイ１８ｂによって映像が映し出されるものとして説明する。 FIG. 3 shows an example of speaker arrangement in the intermediate layer in a reception viewing environment on a display smaller than the standard production conditions. Here, it is assumed that the reception / viewing environment is configured with the same speaker arrangement as in the standard production conditions, and in the reception / viewing environment, video is projected on the display 18b having a size smaller than the display 18a, which is the standard production conditions.

即ち、標準制作条件と同様の受信視聴環境のスピーカー配置を構成したとしても、標準制作条件より小さいディスプレイである場合には、幾つかの音源（仮想音源）としてシーンに現れる多数のオブジェクトに合わせて音圧レベルや音の方向を考慮したマルチチャンネルの音響信号が映像信号とともに提供されるため、制作側で映音一致で提供されていた場合には、音像位置変換制御が必要となる。 In other words, even if the speaker arrangement in the reception viewing environment is the same as in the standard production conditions, if the display is smaller than the standard production conditions, the sound source (virtual sound source) may be adapted to many objects appearing in the scene. Since a multi-channel audio signal that takes into account the sound pressure level and sound direction is provided together with the video signal, sound image position conversion control is required when the production side provides the sound coincidence.

そこで、制作時においては、通常の番組音の制作に合わせて、番組制作者の意図及び音響制作技術者の操作する情報に基づいて、音の属性を示す音像定位音響情報をつけて番組音の制作及び収録を行う。 Therefore, at the time of production, in accordance with the production of normal program sound, based on the intention of the program producer and information operated by the sound production engineer, sound image localization sound information indicating sound attributes is attached and the program sound is recorded. Production and recording.

ここで、「映音一致音源」に付随させる音像定位音響メタ情報について説明する。送信側では、制作側の音像定位音響メタ情報について、音響時間サンプル値を一定の時間単位に分割したフレームごとに伝送するために収録を行う。受信側では、この音像定位音響メタ情報をフレーム毎に抽出し、映像及び音声を受信側の受信環境の映像音響システムに適合させるための演算を行って再生する。尚、送信側で、音響時間サンプルを表すフレームごとに、メタ情報の有無を表すフラグ（例えば、メタ情報があるときは１、無いときは０）を付すようにし、受信側ではこのフラグを監視して、対応するフレームの音像定位音響メタ情報を抽出する。 Here, the sound image localization acoustic meta information associated with the “projection sound source” will be described. On the transmission side, the sound image localization acoustic meta information on the production side is recorded so as to be transmitted for each frame obtained by dividing the acoustic time sample value into fixed time units. On the receiving side, this sound image localization acoustic meta-information is extracted for each frame, and the video and audio are subjected to calculations for adapting to the video / acoustic system in the receiving environment on the receiving side and reproduced. On the transmission side, for each frame representing the acoustic time sample, a flag indicating the presence / absence of meta information (for example, 1 if there is meta information, 0 if there is no meta information) is attached, and this flag is monitored on the receiving side. Then, the sound image localization acoustic meta information of the corresponding frame is extracted.

[音像定位音響メタ情報]
音像定位音響メタ情報は、「使用音響信号チャンネル番号」及び「映音一致音源フラグ」を少なくとも含み、「音源連続フラグ」、「音像位置」、「伝送チャンネル番号情報」、「音響周波数コンター」、「伝送チャンネル・音響主要周波数成分」、「チャンネル間レベル差（ＩＣＬＤ：Ｉｎｔｅｒ−ＣｈａｎｎｅｌＬｅｖｅｌＤｉｆｆｅｒｅｎｃｅｓ）」、「チャンネル間時間差（ＩＣＴＤ：Ｉｎｔｅｒ−ＣｈａｎｎｅｌＴｉｍｅＤｉｆｆｅｒｅｎｃｅｓ）」、及び「チャンネル間相関（ＩＣＣ：Ｉｎｔｅｒ−ＣｈａｎｎｅｌＣｏｈｅｒｅｎｃｅ）」のうち、受信側で必要な情報項目を少なくとも１つ以上選択的に含めることができる。尚、「音響周波数コンター」、「伝送チャンネル・音響主要周波数成分」、「チャンネル間レベル差（ＩＣＬＤ）」、「チャンネル間時間差（ＩＣＴＤ）」、及び「チャンネル間相関（ＩＣＣ）」は、聴覚の臨界帯域幅か、又は聴覚の１／２臨界帯域幅毎に作成される情報であり、映音一致音源ごとに作成される音響特徴メタ情報として位置付けられる。 [Sound image localization acoustic meta information]
The sound image localization sound meta information includes at least a “use sound signal channel number” and a “sound matching sound source flag”, a “sound source continuous flag”, “sound image position”, “transmission channel number information”, “acoustic frequency contour”, “Transmission channel / acoustic main frequency component”, “Inter-Channel Level Differences (ICLD)”, “Inter-Channel Time Differences (ICTD)”, and “Inter-Channel Correlation (ICC)” -Channel Coherence) ", at least one or more information items necessary on the receiving side can be selectively included. The “acoustic frequency contour”, “transmission channel / acoustic main frequency component”, “inter-channel level difference (ICLD)”, “inter-channel time difference (ICTD)”, and “inter-channel correlation (ICC)” It is information created for each critical bandwidth or for every ½ critical bandwidth of hearing, and is positioned as acoustic feature meta-information created for each projection sound matching sound source.

「使用音響信号チャンネル番号」は、番組制作の番組音に使用されている（無音でない）音響信号のチャンネル番号を示す。標準制作条件と異なる受信視聴環境のスピーカー配置である場合にも、再生装置側で、この情報を元に、受信視聴環境における世界座標に対応付けられたチャンネル番号に変換することができる。 “Used audio signal channel number” indicates the channel number of the audio signal used for the program sound of program production (not silence). Even in the case of a speaker arrangement in a reception / viewing environment different from the standard production conditions, the playback apparatus can convert the channel number associated with the world coordinates in the reception / viewing environment based on this information.

「映音一致音源フラグ」は、番組制作時の情報として映像と音像位置とを一致させた音源を受信側で識別させるために、映像と音像位置との一致が必要な音源の有無を示す。このフラグは、例えば一致音源の開始時に１、終了時に０とする。 The “projection matching sound source flag” indicates the presence or absence of a sound source that requires matching between the video and the sound image position in order to identify on the receiving side a sound source that matches the video and the sound image position as information at the time of program production. This flag is set to 1 at the start of the coincidence sound source and 0 at the end, for example.

「音源連続フラグ」は、映像と音像位置の一致が必要な番号の音源（以下、映音一致音源と称する）が、隣接フレーム間で連続していることを示すものであり、フレーム単位で「映音一致音源フラグ」を判別する際の参考情報として用いられることができる。例えば、受信側で隣接フレーム間での音の連続性を、よりスムーズにするフィルタを用いて補正する場合に、この音源連続フラグを用いるのが好適である。このフラグは、例えば隣接フレーム間で連続している場合に１、終了時にフラグを０とする。 The “sound source continuous flag” indicates that a sound source having a number that requires matching between a video and a sound image position (hereinafter referred to as a “sound matching sound source”) is continuous between adjacent frames. It can be used as reference information when discriminating the “sound matching sound source flag”. For example, it is preferable to use this sound source continuation flag when correcting the continuity of sound between adjacent frames on the receiving side using a filter that makes smoother. This flag is set to 1 when, for example, it is continuous between adjacent frames, and set to 0 at the end.

「音像位置」は、映音一致音源ごとに作成される音響特徴メタ情報の１つとして、番組制作の音像の位置を示す。例えば、制作者側で用いた番組制作環境（スクリーンサイズ、スピーカーの種類ごとの配置、設定される視聴位置を含む）における世界座標Ｘ，Ｙ，Ｚの座標情報とすることができる。 “Sound image position” indicates the position of the sound image of the program production as one of the acoustic feature meta information created for each sound matching sound source. For example, the coordinate information of the world coordinates X, Y, and Z in the program production environment (including the screen size, the arrangement for each type of speaker, and the set viewing position) used on the producer side can be used.

「伝送チャンネル番号情報」は、映音一致音源ごとに作成される音響特徴メタ情報の１つとして、音響信号が伝送されるスピーカーチャンネル番号を示す。スピーカー間に音像位置がある場合、制作時のスクリーンを囲む音響チャンネルを示す。尚、この情報は、「音像位置」の情報により、伝送チャンネルは一般的に特定することができるが、参考として、及び／又は処理の便宜のために、送信する情報である。例えば、受信側でこの伝送チャンネル番号情報を用いれば、スピーカーチャンネル番号を直ちに特定することができるためチャンネル割り当ての処理速度を向上させることができる。 “Transmission channel number information” indicates a speaker channel number through which an acoustic signal is transmitted as one piece of acoustic feature meta information created for each sound matching sound source. When there is a sound image position between the speakers, it indicates an acoustic channel surrounding the screen at the time of production. This information is information to be transmitted for reference and / or processing convenience, although the transmission channel can generally be specified by the information of “sound image position”. For example, if this transmission channel number information is used on the receiving side, the speaker channel number can be immediately identified, so that the channel assignment processing speed can be improved.

「音響周波数コンター」は、映音一致音源ごとに作成される音響特徴メタ情報の１つとして、音源の周波数毎のレベル情報を示す。 The “acoustic frequency contour” indicates level information for each frequency of the sound source as one of the acoustic feature meta information created for each sound matching sound source.

「伝送チャンネル・音響主要周波数成分」は、映音一致音源ごとに作成される音響特徴メタ情報の１つとして、音響信号が伝送されるチャンネルにおける当該映音一致音源の主要周波数成分を示す。 “Transmission channel / acoustic main frequency component” indicates the main frequency component of the sound matching sound source in the channel through which the sound signal is transmitted as one of the acoustic feature meta information created for each sound matching sound source.

「チャンネル間レベル差（ＩＣＬＤ）」は、音響信号が伝送されるチャンネル間のレベル差を示す。 “Inter-channel level difference (ICLD)” indicates a level difference between channels through which an acoustic signal is transmitted.

「チャンネル間時間差（ＩＣＴＤ）」は、映音一致音源ごとに作成される音響特徴メタ情報の１つとして、音響信号が伝送されるチャンネル間の時間差を示す。 “Interchannel time difference (ICTD)” indicates a time difference between channels through which an acoustic signal is transmitted as one piece of acoustic feature meta information created for each sound matching sound source.

「チャンネル間相関（ＩＣＣ）」は、映音一致音源ごとに作成される音響特徴メタ情報の１つとして、音響信号が伝送されるチャンネル間の相関を示す。例えば、０〜２４の番号で付されたチャンネルがある場合に、隣接チャンネル間で生成される音像の隣接チャンネルの信号レベルの比率を表す。 “Inter-channel correlation (ICC)” indicates correlation between channels through which an acoustic signal is transmitted as one piece of acoustic feature meta information created for each sound matching sound source. For example, when there are channels numbered from 0 to 24, the ratio of the signal level of the adjacent channel of the sound image generated between the adjacent channels is represented.

尚、「チャンネル間レベル差（ＩＣＬＤ）」、「チャンネル間時間差（ＩＣＴＤ）」、及び「チャンネル間相関（ＩＣＣ）」は、「使用音響信号チャンネル番号」の情報により、再生装置側で特定することができるが、再生装置側での処理負荷量が大きくなる場合に有効な情報であり、必ずしも必要とされる情報ではないことに留意する。しかしながら、再生装置の処理コストを考慮し、送信側で伝送するのが好ましい。 Note that the “level difference between channels (ICLD)”, “time difference between channels (ICTD)” and “correlation between channels (ICC)” should be specified on the playback device side based on the information of “used acoustic signal channel number”. However, it should be noted that the information is effective when the processing load on the playback device side is large, and is not necessarily required. However, it is preferable to transmit on the transmission side in consideration of the processing cost of the playback device.

次に、再生装置で行う音響チャンネル信号の処理手法を説明する。 Next, an acoustic channel signal processing method performed by the playback apparatus will be described.

[再生装置の処理手法]
図４に、再生装置で行う音響チャンネル信号の処理手法のフローチャートを示す。再生装置は、受信視聴環境において、臨場感の高い音の再生を保持しつつ、且つ映像と音像位置の一致が必要な音については、ディスプレイのサイズに合わせて音像位置を制御して再生する。 [Processing method of playback device]
FIG. 4 shows a flowchart of an acoustic channel signal processing method performed by the playback apparatus. In the reception / viewing environment, the playback device plays back sound with a high sense of presence while controlling the sound image position according to the size of the display for sounds that require matching between the image and the sound image position.

本実施例の説明において、映像と音像位置の一致を必要とする音源を「映音一致音源」、必要としない音源を「効果用音源」と称する。 In the description of the present embodiment, the sound source that requires the coincidence of the image and the sound image position is referred to as “projection sound source”, and the sound source that does not need to be referred to as “effect sound source”.

再生装置における音像位置制御処理手法を説明するに、水平面内にある音像を、映像ディスプレイ位置に移動させる場合について説明する。尚、同様な手法で、垂直(上下)方向に音像を移動させることが可能である。 In order to describe the sound image position control processing method in the playback device, a case where a sound image in a horizontal plane is moved to the video display position will be described. Note that it is possible to move the sound image in the vertical (up and down) direction by a similar method.

ステップＳ１で、再生装置の音響定位音響メタ情報分離装置１２により、フレーム単位で音像定位音響メタ情報の有無を監視する。例えば、フラグ値「１」であれば、音像定位音響メタ情報が付され、処理すべきフレームであると識別し、フラグ値「０」であれば、音像定位音響メタ情報が付されておらず、処理を要しないフレームであると識別する。音像定位音響メタ情報が有る場合に、符号化された音響定位音響メタ情報を復号するとともに、各項目ごとのメタ情報に分離し、ステップＳ２に進む。 In step S1, the sound localization sound meta information separation device 12 of the playback apparatus monitors the presence or absence of sound image localization sound meta information in units of frames. For example, if the flag value is “1”, the sound image localization acoustic meta information is attached to identify the frame to be processed. If the flag value is “0”, the sound image localization acoustic meta information is not attached. The frame is identified as a frame that does not require processing. If there is sound image localization acoustic meta information, the encoded acoustic localization acoustic meta information is decoded and separated into meta information for each item, and the process proceeds to step S2.

ステップＳ２で、再生装置の音響信号定位変換制御部１５により、情報項目「使用音響信号チャンネル番号」で使用音のチャンネルを特定し、情報項目「映音一致音源フラグ」で映像と音像位置の一致が必要な音源の有無を識別する。「効果用音源」であると識別した場合、音響信号定位変換を施すことなく、特定した使用音のチャンネルに対応する音響信号を出力するように、音響信号定位変換器１４を制御する。一方、「映音一致音源」であると識別した場合、音響信号定位変換を施すように音響信号定位変換器１４を制御するため、ステップＳ３に進む。 In step S2, the sound signal localization conversion control unit 15 of the playback device specifies the channel of the sound to be used with the information item “used sound signal channel number” and matches the position of the image and the sound image with the information item “projection matching sound source flag”. Identify the presence or absence of a sound source that requires. When the sound source is identified as an “effect sound source”, the acoustic signal localization converter 14 is controlled so as to output an acoustic signal corresponding to the specified channel of the used sound without performing the acoustic signal localization conversion. On the other hand, when it is identified as “projection sound source,” the process proceeds to step S3 in order to control the acoustic signal localization converter 14 to perform acoustic signal localization conversion.

ステップＳ３で、再生装置の音響信号定位変換制御部１５により、「映音一致音源」であると識別した場合、情報項目「音像位置」又は「伝送チャンネル番号情報」に基づいて制作時の音像位置を特定し、情報項目「伝送チャンネル・音響主要周波数成分」から、映音一致音源の音響主要周波数成分の抜き出す。 In step S3, if the sound signal localization conversion control unit 15 of the playback apparatus identifies that the sound source is a "sound matching sound source", the sound image position at the time of production based on the information item "sound image position" or "transmission channel number information" And the main acoustic frequency component of the sound matching sound source is extracted from the information item “transmission channel / main acoustic frequency component”.

ステップＳ４で、再生装置の音響信号定位変換制御部１５により、抜き出した映音一致音源の音響主要周波数成分と、情報項目「音響周波数コンター」から抜き出した音源の周波数毎のレベル情報とを合成し、「映音一致音源の生成」を行う。 In step S4, the acoustic signal localization conversion control unit 15 of the playback device synthesizes the main acoustic frequency component of the extracted sound matching sound source and the level information for each frequency of the sound source extracted from the information item “acoustic frequency contour”. , “Generate sound projection matching sound source”.

ステップＳ５で、映像と音像位置の一致が必要な音源の個数が全て終了した場合、ステップＳ６に移る。残りの音源がある場合はステップＳ３に戻る。 If it is determined in step S5 that the number of sound sources that require matching between the video and sound image positions has been completed, the process proceeds to step S6. If there are remaining sound sources, the process returns to step S3.

ステップＳ６で、再生装置の音響信号定位変換制御部１５により、生成した「映音一致音源」と、映像と音像位置の一致を必要としない音である「効果用音源」とを合成するように、音響信号定位変換器１４を制御する。 In step S 6, the generated “projection matching sound source” is synthesized by the sound signal localization conversion control unit 15 of the playback apparatus with the “effect sound source” that is a sound that does not require matching between the image and the sound image position. The acoustic signal localization converter 14 is controlled.

ステップＳ７で、音響信号定位変換器１４により、音響信号再生装置１６を経て合成音を各スピーカーから出力するチャンネル音響信号として出力する。 In step S7, the acoustic signal localization converter 14 outputs the synthesized sound as a channel acoustic signal output from each speaker via the acoustic signal reproducing device 16.

また、ステップＳ３の「映音一致音源の抜き出し」、ステップＳ４の「映音一致音源の生成」の方法の例を説明する。 In addition, an example of the method of “extraction of projection matching sound source” in step S3 and “generation of projection matching sound source” in step S4 will be described.

音像の抜き出し手法として、映音一致音源の音像定位音響メタ情報のうち、情報項目「伝送チャンネル・音響主要周波数成分」を用い、これに示される周波数帯域をゼロ又は抑圧することで元の位置の音像成分の抜き出しを行う。 As a sound image extraction method, the information item “Transmission channel / acoustic main frequency component” is used in the sound image localization sound meta information of the sound matching sound source, and the frequency band indicated by this is zeroed or suppressed to suppress the original position. Extracts sound image components.

映音一致音源の生成手法として、映音一致音源の周波数帯域の抑圧は、音響主要周波数成分に隣接する周波数帯域の信号レベルと同程度以下とすることにより映音一致音源の音響主要周波数成分の信号の抑圧を行う。 As a method of generating a sound matching sound source, the suppression of the frequency band of the sound matching sound source is less than or equal to the signal level of the frequency band adjacent to the sound main frequency component, thereby reducing the sound main frequency component of the sound matching sound source. Perform signal suppression.

そこで、受信視聴環境におけるディスプレイのサイズにー致させる音像の生成手法として、まず、映音一致音源の音像定位音響メタ情報のうち、「伝送チャンネル・音響主要周波数成分」、「音響周波数コンター」、「ＩＣＴＤ」、及び「ＩＣＬＤ」の組み合わせか、又は「音響周波数コンター」、「音像位置」、「ＩＣＣ」の組み合わせを用いて、チャンネル毎の映音一致音源の信号を決定する。 Therefore, as a sound image generation method that matches the size of the display in the reception viewing environment, first, among the sound image localization acoustic meta information of the sound matching sound source, "transmission channel / acoustic main frequency component", "acoustic frequency contour", By using a combination of “ICTD” and “ICLD” or a combination of “acoustic frequency contour”, “sound image position”, and “ICC”, the sound matching sound source signal for each channel is determined.

次に、受信視聴環境における新たな音像位置を求める。受信視聴環境におけるスピーカー及びディスプレイは、随意、再生装置側の音響信号定位変換制御部１５に世界座標Ｘ，Ｙ，Ｚで設定されるものとする。即ち、制作側の世界座標Ｘ，Ｙ，Ｚと再生装置側の世界座標Ｘ，Ｙ，Ｚは、予め対応するように定められている。 Next, a new sound image position in the reception viewing environment is obtained. The speakers and the display in the reception viewing environment are optionally set in the world signal X, Y, Z in the acoustic signal localization conversion control unit 15 on the playback device side. That is, the world coordinates X, Y, Z on the production side and the world coordinates X, Y, Z on the playback device side are determined in advance to correspond to each other.

例えば、図５（ａ）に示すように、制作時のディスプレイ１８ａの周囲に位置する中間レイヤのスピーカーＭ１〜Ｍ５により、それぞれの音像定位関数１００Ｍ１〜１００Ｍ５の音像定位関数として表される信号レベルの音響信号を出力させ、「映音一致音源」Ｐ_１，Ｐ_２を形成するとともに、「効果音音源」Ｅ_１を形成するように意図して制作していたとする。受信視聴環境におけるディスプレイ１８ｂは、制作時のディスプレイ１８ａよりも小型であるため、中間レイヤのスピーカーＭ３〜Ｍ５により、それぞれ音像定位関数１００Ｍ３〜１００Ｍ５の音像定位関数として表される信号レベルの音響信号を出力させ、新たな音像位置となる「映音一致音源」Ｐ_１’，Ｐ_２’を形成し、「効果音音源」Ｅ_１’は、映像と音源の一致が必要ない情報として識別し、「効果音音源」Ｅ_１と同様の音源として形成する。尚、図５は、水平面内にある音像を移動させる場合の例である。 For example, as shown in FIG. 5 (a), signal levels represented as sound image localization functions of the sound image localization functions 100M1 to 100M5 by the speakers M1 to M5 of the intermediate layer positioned around the display 18a at the time of production. It is assumed that the audio signal is output and the “projection sound source” P ₁ and P ₂ are formed, and the “sound source” E ₁ is intentionally produced. Since the display 18b in the reception viewing environment is smaller than the display 18a at the time of production, acoustic signals having signal levels represented as sound image localization functions of the sound image localization functions 100M3 to 100M5 are respectively obtained by the speakers M3 to M5 in the intermediate layer. The “sound matching sound source” P ₁ ′, P ₂ ′, which is a new sound image position, is formed, and the “sound effect sound source” E ₁ ′ is identified as information that does not require matching between the video and the sound source. It formed as the same sound source and sound effects sound source "E _1. FIG. 5 shows an example of moving a sound image in the horizontal plane.

実際には、主に、情報項目「音像位置」又は「伝送チャンネル・音響主要周波数成分」に基づいて、受信視聴環境における新たな音像位置となる「映音一致音源」Ｐ_１’，Ｐ_２’を求め、図５に示す１００Ｍ３〜１００Ｍ５の各スピーカーから出力する映音一致音源の周波数ごとの信号レベルを算出することになる。 Actually, “projection matching sound source” P ₁ ′, P ₂ ′ which is a new sound image position in the reception viewing environment mainly based on the information item “sound image position” or “transmission channel / acoustic main frequency component”. And the signal level for each frequency of the sound matching sound source output from each of the speakers 100M3 to 100M5 shown in FIG. 5 is calculated.

映音一致音源の周波数ごとの信号レベルを算出するには幾つかの例がある。 There are several examples of calculating the signal level for each frequency of the projection matching sound source.

（例１）
音響信号が伝送されるチャンネルにおける映音一致音源の音像定位音響メタ情報のうち、「音響周波数コンター」、「ＩＣＴＤ」、「ＩＣＬＤ」、及び「伝送チャンネル・音響主要周波数成分」に基づいて、チャンネル毎の映音一致音源の信号を決定する。具体的には、例えば、音響信号が伝送されるチャンネルの周波数成分ごとに映音一致音源の「音響周波数コンター」を乗算し、各チャンネルの映音一致音源の周波数成分をまず求める。次にチャンネル間の周波数ごとのＩＣＴＤ，ＩＣＬＤを算出し、音像定位音響メタ情報の「ＩＣＴＤ，ＩＣＬＤ」と比較して周波数レベルを補正する。これにより、チャンネル毎の映音一致音源の信号を決定する。この場合、各フレーム単位で制作時の音像定位位置を受信環境の音像定位位置へと確実に変換することができる。 (Example 1)
Of the sound image localization sound meta information of the sound matching sound source in the channel through which the sound signal is transmitted, the channel based on “acoustic frequency contour”, “ICTD”, “ICLD”, and “transmission channel / acoustic main frequency component” The signal of the sound matching sound source for each is determined. Specifically, for example, the frequency component of the sound matching sound source of each channel is first obtained by multiplying the “acoustic frequency contour” of the sound matching sound source for each frequency component of the channel through which the acoustic signal is transmitted. Next, ICTD and ICLD for each frequency between channels are calculated, and the frequency level is corrected by comparison with “ICTD and ICLD” of the sound image localization acoustic meta information. As a result, the sound matching sound source signal for each channel is determined. In this case, the sound image localization position at the time of production can be reliably converted into the sound image localization position in the reception environment for each frame.

（例２）
音響信号が伝送されるチャンネルにおける映音一致音源の音像定位音響メタ情報のうち、「音響周波数コンター」、「音像位置」、及び「ＩＣＣ」に基づいて、映音一致音源の信号を決定する。具体的には、例えば、音響信号が伝送されるチャンネルの周波数成分ごとに映音一致音源の「音響周波数コンター」を乗算し、各チャンネルの映音一致音源の周波数成分をまず求める。次に、「音像位置」情報から伝送チャンネル番号を特定し、「音像位置」情報と該伝送チャンネル番号とに基づいて、各チャンネルの混合比を算出し、映音一致音源の各チャンネルの混合比により、映音一致音源の信号を決定する。また、この決定信号と音像定位音響メタ情報の「ＩＣＣ」と比較して周波数レベルを補正する。これにより、チャンネル毎の映音一致音源の信号を決定する。この場合、各フレーム単位で先の音響信号に基づいて順次計算すればよく、再生装置側の処理負担を軽減させることができる。 (Example 2)
The sound matching sound source signal is determined based on “acoustic frequency contour”, “sound image position”, and “ICC” in the sound image localization sound meta information of the sound matching sound source in the channel through which the sound signal is transmitted. Specifically, for example, the frequency component of the sound matching sound source of each channel is first obtained by multiplying the “acoustic frequency contour” of the sound matching sound source for each frequency component of the channel through which the acoustic signal is transmitted. Next, the transmission channel number is identified from the “sound image position” information, the mixing ratio of each channel is calculated based on the “sound image position” information and the transmission channel number, and the mixing ratio of each channel of the sound matching sound source is calculated. Thus, the signal of the projection matching sound source is determined. Further, the frequency level is corrected by comparing this determination signal with “ICC” of the sound image localization acoustic meta information. As a result, the sound matching sound source signal for each channel is determined. In this case, it is only necessary to sequentially calculate each frame unit based on the previous acoustic signal, and the processing burden on the playback device side can be reduced.

（例３）
音響信号が伝送されるチャンネルにおける映音一致音源の音像定位音響メタ情報のうち、「音響周波数コンター」、「音像位置」、「伝送チャンネル番号情報」、及び「ＩＣＣ」に基づいて、映音一致音源の信号を決定する。具体的には、例えば、音響信号が伝送されるチャンネルの周波数成分ごとに映音一致音源の「音響周波数コンター」を乗算し、各チャンネルの映音一致音源の周波数成分をまず求める。次に、「音像位置」情報から伝送チャンネル番号を特定することなく、伝送チャンネル番号情報を用いる。次に、「音像位置」情報と該伝送チャンネル番号情報に基づいて、各チャンネルの混合比を算出し、映音一致音源の各チャンネルの混合比により、映音一致音源の信号を決定する。また、この決定信号と音像定位音響メタ情報の「ＩＣＣ」と比較して周波数レベルを補正する。これにより、チャンネル毎の映音一致音源の信号を決定する。この場合、各フレーム単位で先の音響信号に基づいて順次計算すればよく、再生装置側の処理負担をより一層軽減させることができる。 (Example 3)
Based on the “acoustic frequency contour”, “sound image position”, “transmission channel number information”, and “ICC” among the sound image localization sound meta information of the sound matching sound source in the channel where the sound signal is transmitted, the sound matching Determine the sound source signal. Specifically, for example, the frequency component of the sound matching sound source of each channel is first obtained by multiplying the “acoustic frequency contour” of the sound matching sound source for each frequency component of the channel through which the acoustic signal is transmitted. Next, the transmission channel number information is used without specifying the transmission channel number from the “sound image position” information. Next, the mixing ratio of each channel is calculated based on the “sound image position” information and the transmission channel number information, and the signal of the sound matching sound source is determined based on the mixing ratio of each channel of the sound matching sound source. Further, the frequency level is corrected by comparing this determination signal with “ICC” of the sound image localization acoustic meta information. As a result, the sound matching sound source signal for each channel is determined. In this case, it is only necessary to sequentially calculate each frame unit based on the previous acoustic signal, and the processing load on the playback device side can be further reduced.

これらの３つの例から、音像定位音響メタ情報を用いて受信視聴環境に応じた音像定位を実現する他の組み合わせがあることが分かる。一方、制作側で意図した音像位置を受信視聴環境で再現するために、音像定位音響メタ情報のうちの幾つかを用いて実現する例である。しかしながら、例えば、これらの音像定位音響メタ情報のうち、受信視聴環境で必要としない（例えば、制作側のものと対応するスピーカーを再生装置側で用意していない）音響信号がある場合には、再生装置側で随意所望のチャンネルの音響信号で再構成するように設定することもできる。従って、再生装置側は、これに伴って必要とされる音像定位音響メタ情報のみを選択的に用いるようにしてもよい。 From these three examples, it can be seen that there are other combinations that realize sound image localization according to the reception viewing environment using sound image localization acoustic meta information. On the other hand, in order to reproduce the sound image position intended on the production side in the reception viewing environment, this is an example realized by using some of the sound image localization acoustic meta information. However, for example, among these sound image localization acoustic meta information, if there is an acoustic signal that is not required in the reception viewing environment (for example, a speaker corresponding to the production side is not prepared on the playback device side) It can also be set so that the reproduction apparatus can reconstruct the sound signal of a desired channel as desired. Therefore, the reproduction apparatus side may selectively use only the sound image localization acoustic meta information required in association with this.

こうして決定したチャンネル毎の映音一致音源の信号は、新たな信号レベルの音響信号として、各スピーカーから出力される。例えば、図５を参照するに、制作時のディスプレイ１８ａの周囲に位置する中間レイヤのスピーカーＭ１〜Ｍ５により出力される、それぞれ音像定位関数１００Ｍ１〜１００Ｍ５の音像定位関数として表される信号レベルの音響信号について、この水平面内にある音像を移動させるように、中間レイヤのスピーカーＭ３〜Ｍ５により新たな信号レベルの音響信号として出力する。 The sound matching sound source signal for each channel thus determined is output from each speaker as an acoustic signal having a new signal level. For example, referring to FIG. 5, the sound of the signal level expressed as the sound image localization function of the sound image localization functions 100M1 to 100M5 output from the speakers M1 to M5 of the intermediate layer positioned around the display 18a at the time of production. The signal is output as an acoustic signal of a new signal level by the speakers M3 to M5 in the intermediate layer so as to move the sound image in the horizontal plane.

図５では、特定のチャンネルのスピーカーについてのみ説明したが、制作時の標準制作条件は、図７に示すように、大画面のディスプレイ１８ａを前方に低域効果音用スピーカーＬＦＥ_１，ＬＦＥ_２を除き、立体式の上位レイヤＵ１〜Ｕ９、中間レイヤＭ１〜Ｍ１０、下位レイヤＬ１〜Ｌ３の２２チャンネルで制作するため、例えば、図８に示すように、受信視聴環境において、ディスプレイ１８ａよりも小さいディスプレイ１８ｂとともに制作時と同一の世界座標で構成した２２チャンネルの音響信号は、それぞれフレーム単位で音像定位変換が施される（図６参照）。 Although only the speaker of a specific channel has been described in FIG. 5, the standard production conditions at the time of production are as follows. As shown in FIG. 7, the low-frequency sound effect speakers LFE ₁ and LFE ₂ are placed in front of the large screen display 18a. Except for the production with 22 channels of the three-dimensional upper layers U1 to U9, the intermediate layers M1 to M10, and the lower layers L1 to L3, for example, a display smaller than the display 18a in the reception viewing environment as shown in FIG. The acoustic signals of 22 channels composed of the same world coordinates as at the time of production together with 18b are subjected to sound image localization conversion in units of frames (see FIG. 6).

これにより、本実施例の制作装置及び再生装置からなる、本実施例の音響信号多重伝送システムによれば、番組制作時の映像ディスプレイ・音響用スピーカー条件と受信視聴環境との差異が大きい場合においても、映音一致音源の度合いが好適に適合するようになり、より臨場感の高い音の再生が可能となる。 Thereby, according to the acoustic signal multiplex transmission system of the present embodiment, which is composed of the production device and the playback device of the present embodiment, in the case where the difference between the video display / sound speaker conditions at the time of program production and the reception viewing environment is large. However, the degree of the projection sound matching sound source is suitably adapted, and it is possible to reproduce a sound with a higher presence.

上述の実施例については代表的な例として説明したが、本発明の趣旨及び範囲内で、多くの変更及び置換ができることは当業者に明らかである。例えば、受信視聴環境における中央の３個のスピーカーの開き角度が番組制作時の開き角度（例えば６０度）と異なる場合には、映像ディスプレイの両脇に左右スピーカーを設置することで、実施例と同一の処理で近似的に高音質の再生を実現することも可能である。即ち、受信視聴環境において、制作時と同一の世界座標でスピーカーを構成しない場合には、制作時のスピーカー配置の情報（「音像位置」）から、受信視聴環境において出力すべきスピーカー及び音響信号の変換を世界座標に従って変換可能であることは、当業者に明らかである。また、前述した実施例において、固定のフレーム長の音響信号を扱うものとして説明したが、可変のフレーム長としてもよい。この場合には、各フレーム長が識別可能な付加情報を予め送付するか、又は各フレームヘッダに付加するようにする。従って、本発明は、上述の実施例によって制限するものと解するべきではなく、特許請求の範囲によってのみ制限される。 Although the above embodiments have been described as representative examples, it will be apparent to those skilled in the art that many changes and substitutions can be made within the spirit and scope of the invention. For example, when the opening angle of the central three speakers in the reception viewing environment is different from the opening angle (eg, 60 degrees) at the time of program production, the left and right speakers are installed on both sides of the video display. It is also possible to achieve high-quality sound reproduction approximately by the same processing. That is, if the speaker is not configured with the same world coordinates as at the time of production in the reception / viewing environment, the speaker and sound signals to be output in the reception / viewing environment are determined from the information on the speaker arrangement at the time of production (“sound image position”). It will be apparent to those skilled in the art that the transformation can be transformed according to world coordinates. In the above-described embodiment, the description has been made assuming that the acoustic signal having a fixed frame length is handled. However, a variable frame length may be used. In this case, additional information that can identify each frame length is sent in advance or added to each frame header. Accordingly, the invention should not be construed as limited by the embodiments described above, but only by the claims.

本発明は、臨場感の高い超高精細テレビシステムの音響再生を行うことが可能となるため、映像・音響信号多重伝送システムに有用である。 The present invention is useful for a video / audio signal multiplex transmission system because sound reproduction of an ultra-high-definition television system with high presence can be performed.

本発明による実施例の音像定位音響メタ情報を付随した音響信号多重伝送システムにおける制作装置を示す図である。It is a figure which shows the production apparatus in the acoustic signal multiplex transmission system accompanied by the sound image localization acoustic meta information of the Example by this invention. 本発明による実施例の音像定位音響メタ情報を付随した音響信号多重伝送システムにおける再生装置を示す図である。It is a figure which shows the reproducing | regenerating apparatus in the acoustic signal multiplex transmission system accompanied by the sound image localization acoustic meta information of the Example by this invention. 本発明による実施例の標準制作条件より小さいディスプレイでの受信視聴環境における中間レイヤのスピーカー配置例を示す図である。It is a figure which shows the speaker arrangement | positioning example of the intermediate | middle layer in the reception viewing environment with the display smaller than the standard production conditions of the Example by this invention. 本発明による実施例の再生装置で行う音響チャンネル信号の処理手法のフローチャートである。It is a flowchart of the processing method of the acoustic channel signal performed with the reproducing | regenerating apparatus of the Example by this invention. 本発明による実施例の音像定位関数に基づく音像定位変換の概略図である。It is the schematic of sound image localization conversion based on the sound image localization function of the Example by this invention. 本発明による実施例のフレーム単位の音響信号の概略図である。It is the schematic of the acoustic signal of the frame unit of the Example by this invention. 制作時の標準制作条件におけるディスプレイ及びスピーカーの配置例を示す図である。It is a figure which shows the example of arrangement | positioning of the display and speaker in the standard production conditions at the time of production. 受信視聴環境におけるディスプレイ及びスピーカーの配置例を示す図である。It is a figure which shows the example of arrangement | positioning of the display and speaker in a reception viewing environment.

Explanation of symbols

１音響収録再生装置
２音響ミキシング卓
３音響信号制作伝送チャンネル変換装置
４音響信号符号化装置
５音響定位音響メタ情報多重装置
６音響信号・メタ情報多重装置
７音響ミキシング制御譜面データ生成装置
８マイクロホン
１１音響信号・メタ情報多重分離装置
１２音響定位音響メタ情報分離装置
１３音響信号復号装置
１４音響信号定位変換器
１５音響信号定位変換制御部
１６音響信号再生装置
１７スピーカーセット
１８ａ制作時のディスプレイ
１８ｂ受信環境のディスプレイ
ＬＦＥ_１，ＬＦＥ_２低域効果音用スピーカー
Ｕ１〜Ｕ９上位レイヤのスピーカー
Ｍ１〜Ｍ１０中間レイヤのスピーカー
Ｌ１〜Ｌ３下位レイヤのスピーカー
１００Ｍ３〜１００Ｍ５音像定位関数
Ｐ_１，Ｐ_２受信環境の「映音一致音源」の音像位置
Ｅ_１受信環境の「効果音音源」の音像位置
Ｐ_１’，Ｐ_２’ 受信環境の「映音一致音源」の音像位置
Ｅ_１’ 受信環境の「効果音音源」の音像位置 DESCRIPTION OF SYMBOLS 1 Sound recording / reproducing apparatus 2 Acoustic mixing table 3 Acoustic signal production transmission channel converter 4 Acoustic signal encoding apparatus 5 Acoustic localization acoustic meta information multiplexing apparatus 6 Acoustic signal / meta information multiplexing apparatus 7 Acoustic mixing control musical score data generating apparatus 8 Microphone 11 Acoustic signal / meta information demultiplexing device 12 Acoustic localization acoustic meta information separating device 13 Acoustic signal decoding device 14 Acoustic signal localization converter 15 Acoustic signal localization conversion control unit 16 Acoustic signal reproduction device 17 Speaker set 18a Display 18b at the time of production Reception environment LFE ₁ , LFE ₂ Low-frequency sound effect speakers U _{1 to} U 9 Upper layer speakers M 1 to M 10 Middle layer speakers L _{1 to} L 3 Lower layer speakers 100 M 3 to 100 M 5 Sound image localization functions P ₁ and P ₂ Sound image position of "sound matching sound source" ₁ sound image position P ₁ of the "effect sound source" of the receiving environment ', P _2' sound image position of "sound effects sound source" of the sound image position E ₁ 'reception environment of "Utsuoto match sound source" of the receiving environment

Claims

A production device that multiplexes and transmits predetermined sound image localization acoustic meta information to an acoustic signal produced under standard production conditions, and receives the acoustic signal and converts it to a new acoustic signal based on the sound localization acoustic meta information And an audio signal multiplex transmission system comprising a reproducing device for reproducing,
The production device
When mixing multi-channel audio signals synchronized with the video displayed on the display during program production, which is the standard production condition, for a predetermined number of speakers, whether the video and sound must match the sound source position Sound image localization acoustic meta information indicating sound attributes, including sound image localization positions in standard production conditions and acoustic feature meta information related to the sound image localization in the case of a sound that needs to match the sound image position with the sound signal Means for generating a frame unit of
Means for encoding and transmitting the sound signal in the standard production conditions together with the sound image localization sound meta information,
The playback device
Means for receiving and decoding the encoded acoustic signal and the sound image localization acoustic meta information from the production device;
Based on the decoded sound image localization acoustic meta-information, means for identifying whether or not the decoded sound signal is a sound that requires matching between the video and the sound source position in units of frames;
Only when the sound needs to match the position of the image and the sound source, the sound image position suitable for the display and a predetermined number of speakers in the reception viewing environment previously associated with the standard production condition is set as the sound image localization in the standard production condition. Means for converting and determining coordinates from position information;
Based on the acoustic feature meta-information related to the sound image localization, an acoustic signal suitable for the coordinate-converted sound image localization position is converted into an acoustic signal of a predetermined channel corresponding to a speaker in the reception viewing environment for each sound source. Means,
Means for synthesizing the converted sound signal of the predetermined channel and the sound signal of the predetermined channel that is not a sound that requires matching between the image and the sound source position, and reproducing sound from a speaker corresponding to the predetermined channel; Have
In the reception viewing environment where a display of a size different from the standard production conditions is installed, the sound image position is controlled so as to match the size of the display for the sound that needs to match the position of the image and the sound image, Acoustic signal multiplex transmission system.

The sound image localization acoustic meta information includes, as the acoustic feature meta information, “transmission channel / acoustic main frequency component”, “acoustic frequency contour”, “inter-channel level differences (ICLD)”, and “channel”. Information on the time difference (ICTD: Inter-Channel Time Differences)
The means for converting into an acoustic signal of a predetermined channel corresponding to a speaker in the reception viewing environment is information on a combination of the “transmission channel / acoustic main frequency component”, “acoustic frequency contour”, “ICTD”, and “ICLD”. The sound signal multiplex transmission system according to claim 1, wherein a signal of a sound matching sound source for each channel is determined by using.

The sound image localization acoustic meta information includes information on “acoustic frequency contour”, “sound image position”, and “inter-channel correlation (ICC)” as the acoustic feature meta information,
The means for converting to an acoustic signal of a predetermined channel corresponding to a speaker in the reception viewing environment uses the information of the combination of the “acoustic frequency contour”, “sound image position”, and “ICC” to output the sound for each channel. The acoustic signal multiplex transmission system according to claim 1, wherein a signal of a coincidence sound source is determined.

The sound image localization acoustic meta information includes information on “acoustic frequency contour”, “sound image position”, “transmission channel number information”, and “inter-channel correlation (ICC)” as the acoustic feature meta information. Including
The means for converting into an acoustic signal of a predetermined channel corresponding to a speaker in the reception viewing environment uses information of a combination of the “acoustic frequency contour”, “sound image position”, “transmission channel number information”, and “ICC”. The sound signal multiplex transmission system according to claim 1, wherein a signal of a sound matching sound source for each channel is determined.

A production device that multiplexes and transmits predetermined sound image localization acoustic meta information to an acoustic signal produced under standard production conditions,
When mixing multi-channel audio signals synchronized with the video displayed on the display during program production, which is the standard production condition, for a predetermined number of speakers, whether the video and sound must match the sound source position Sound image localization acoustic meta information indicating sound attributes, including sound image localization positions in standard production conditions and acoustic feature meta information related to the sound image localization in the case of a sound that needs to match the sound image position with the sound signal Means for generating a frame unit of
A production apparatus comprising: means for encoding and transmitting an acoustic signal under the standard production conditions together with the sound image localization acoustic meta information.

A playback device that receives an acoustic signal produced according to a predetermined standard production condition, converts the acoustic signal to a new acoustic signal based on the sound image localization acoustic meta information attached to the acoustic signal, and reproduces the new acoustic signal,
Means for receiving and decoding the encoded acoustic signal and sound image localization acoustic meta information, respectively;
Based on the decoded sound image localization acoustic meta-information, means for identifying whether or not the decoded sound signal is a sound that requires matching between the video and the sound source position in units of frames;
Only when the sound needs to match the position of the image and the sound source, the sound image position suitable for the display and a predetermined number of speakers in the reception viewing environment previously associated with the standard production condition is set as the sound image localization in the standard production condition. Means for converting coordinates from position information;
Based on the acoustic feature meta-information related to the sound image localization, an acoustic signal suitable for the coordinate-converted sound image localization position is converted into an acoustic signal of a predetermined channel corresponding to a speaker in the reception viewing environment for each sound source. Means,
Means for synthesizing the converted sound signal of the predetermined channel and the sound signal of the predetermined channel that is not a sound that requires matching between the image and the sound source position, and reproducing sound from a speaker corresponding to the predetermined channel; A playback apparatus comprising: