JP2007158396A

JP2007158396A - Video/audio synchronization transmission apparatus

Info

Publication number: JP2007158396A
Application number: JP2005346474A
Authority: JP
Inventors: Ayako Nemoto; 亞矢子根本; Hideki Inomata; 英樹猪股
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2005-11-30
Filing date: 2005-11-30
Publication date: 2007-06-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a video/audio synchronization transmission apparatus capable of transmitting an audio signal and a video signal of an object image synchronously with each other by clarifying a causal relationship of audio corresponding to the object image on a screen in order that a viewer/listener can particularize a sound source on the screen. <P>SOLUTION: The video/audio synchronization transmission apparatus includes: an identification/position generating means for generating an identifier used to identify the object image for outputting sound from the video signal and positional information for denoting the position of the object image on the screen; an identification information attaching means for attaching the identifier of the object image and the positional information to the video signal; and a synchronizing signal composite means for multiplexing the detected audio signal on the video signal for the object image at the position on the basis of the positional information and providing an output of the resulting signal. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、別々に生成された映像信号と音声信号を伝送する際に被写体像とその音声を同期させて伝送する映像・音声同期伝送装置に関するものである。 The present invention relates to a video / audio synchronous transmission apparatus for transmitting a subject image and its sound in synchronization when transmitting separately generated video signals and audio signals.

近年、劇場や映画館等において、大画面のスクリーンに映像信号に基づいた映像を投射して表示を行うようになって来ている。また、家庭用のＡＶシステムにおいても、フラットディスプレイ等で大画面化が進んで来ている。一方、このような大画面の映像に付随させる音声に関しては、臨場感を出すために、ステレオや５．１チャンネルサラウンド等の多チャンネルのスピーカを使用する方式が適用されて来た。例えば２チャンネルステレオ方式においては、テレビ番組で伝送される音声信号や記録媒体上に記録されている音声信号は２チャンネルそれぞれ別々に伝送または記録されており、各チャンネルを左右のスピーカで出力させることにより、平面的ではあるが左右の音の広がりを感じさせることを可能にしている。但し、画面に対する視聴者の位置に偏りがある場合には、左右いずれかのスピーカの音が支配的になってしまい、音の広がりをあまり感じられないということが起きる。これを改善するために、５．１チャンネルサラウンド方式では、正面、右前方、左前方、右後方、左後方にそれぞれ置くスピーカと通常正面に配置する低音出力用サブウーファースピーカの６スピーカによって、立体的で臨場感のある音声環境を実現している。 In recent years, in theaters, movie theaters, and the like, an image based on an image signal is projected and displayed on a large screen. Also, in home AV systems, the increase in screen size is progressing with flat displays and the like. On the other hand, with respect to the sound accompanying such a large screen image, a method using a multi-channel speaker such as stereo or 5.1 channel surround has been applied in order to give a sense of reality. For example, in the 2-channel stereo system, audio signals transmitted in a television program and audio signals recorded on a recording medium are transmitted or recorded separately for each of the two channels, and each channel is output by left and right speakers. This makes it possible to feel the spread of sound on the left and right, albeit planarly. However, if there is a bias in the position of the viewer with respect to the screen, the sound of either the left or right speaker becomes dominant, and it may occur that the spread of the sound is not felt so much. In order to improve this, the 5.1 channel surround system uses three speakers: a speaker placed at the front, right front, left front, right rear, and left rear and a subwoofer speaker for low-frequency output that is usually placed at the front. Realizes a realistic and realistic sound environment.

しかし、これらの方式は横方向の音の広がりを追求したもので、上下方向の音の広がりについてはあまり考慮されておらず、画面が非常に大きい映像表示システムに用いた場合には、画面上の音源（画面の中で音の出ている位置、音を発している被写体像の位置等）と実際の音源（スピーカ）とが分かれてしまい、映像に対して音声が不自然な位置再生となり、臨場感が損なわれるという問題があった。このような問題に対して、映像表示用のスクリーンの背面にマトリクス状に複数のスピーカを配置し、画面上の音源の位置と当該スピーカの位置とを一致させる制御手段を設けることによって、臨場感のある映像表示用スクリーンを提供する方法が提案されている（例えば、特許文献１参照）。但し、特許文献１では、記録媒体上に記録されている音声信号の再生にあたって、画面上の音源の位置を特定するための音声信号をどのように制御するかについては、詳細には開示していない。 However, these methods pursue the spread of sound in the horizontal direction, and do not take much consideration into the spread of sound in the vertical direction, and when used in a video display system with a very large screen, Sound source (position of sound in the screen, position of subject image producing sound, etc.) and actual sound source (speaker) are separated, resulting in unnatural sound reproduction for the video There was a problem that the sense of reality was impaired. To deal with such problems, a plurality of speakers are arranged in a matrix on the back of the screen for video display, and a control means for matching the position of the sound source on the screen with the position of the speakers is provided. There has been proposed a method of providing a certain video display screen (see, for example, Patent Document 1). However, Patent Document 1 discloses in detail how to control the audio signal for specifying the position of the sound source on the screen when reproducing the audio signal recorded on the recording medium. Absent.

一方、ＨＤＤ、ＤＶＤ等の記録媒体に蓄積するテレビ番組等の映像および音声信号を符号化する場合において、画面の中で特に注目される映像シーンや被写体に多くの情報量を割り当てることによって、決められた符号化レートを効率的に使い、視覚的な画質向上を図るようにする方法がある。この方法において、例えば着目点の一つとして人物検出を行う方法があるが、人物が複数いる場合には、それらに対する情報量の割り当てが分散してしまい、視覚的な画質向上があまり期待できなくなる。そのような場合に、特に音声を発している被写体を判別することが出来れば、画面内で情報量の割り当てを行う際に有用な情報になると考えられる。 On the other hand, when video and audio signals such as TV programs stored in recording media such as HDDs and DVDs are encoded, a large amount of information is allocated to video scenes and subjects of particular interest on the screen. There is a method for efficiently using the encoded rate and improving the visual image quality. In this method, for example, there is a method of detecting a person as one of the points of interest. However, when there are a plurality of persons, the allocation of information amount to them is dispersed, so that it is difficult to expect improvement in visual image quality. . In such a case, if it is possible to discriminate the subject that is producing the sound, it is considered that the information is useful when the information amount is allocated on the screen.

特開平２−２２４４９５号公報JP-A-2-224495

特許文献１に開示されているような大画面の映像表示システムにおける映像と音声の再生方法は、以上のように行われているが、次のような問題がある。テレビ番組等の映像、音声の同期方法に関して、一般的には映像信号はカメラ、音声信号はマイクを通じて別信号として取り出され、別信号のまま伝送路に多重化されている。この場合、大画面における音響システムで再生する場合や、画面内で効率的な情報量の割り当てを行う符号化制御において、画面上の被写体像と音声の因果関係が明確にならない状態が生じ、視聴する者が画面上での音源を特定できないという問題がある。 The video and audio playback method in the large-screen video display system disclosed in Patent Document 1 is performed as described above, but has the following problems. Regarding video and audio synchronization methods for television programs and the like, generally, a video signal is extracted as a separate signal through a camera and a sound signal is extracted as a separate signal through a microphone, and is multiplexed on the transmission path as a separate signal. In this case, when playing back on an acoustic system on a large screen, or in encoding control that allocates information efficiently within the screen, a state occurs in which the causal relationship between the subject image on the screen and the sound is not clear, and viewing There is a problem that the person who does it cannot identify the sound source on the screen.

この発明は、上記問題点を解決するためになされたもので、視聴する者が画面上での音源を特定できるようにするために、画面上の被写体像と対応する音声の因果関係を明確にし、音声信号と映像信号の被写体像を同期させて伝送することを可能にする映像・音声同期伝送装置を得ることを目的とする。 The present invention has been made to solve the above problems, and in order to enable a viewer to specify a sound source on the screen, the causal relationship between the subject image on the screen and the corresponding sound is clarified. Another object of the present invention is to provide a video / audio synchronous transmission apparatus that enables transmission of a subject image of an audio signal and a video signal in synchronization.

この発明に係る映像・音声同期伝送装置は、映像コンテンツ制作時に別々に作製された映像信号と、当該映像信号の画面上に登場する音声を発する被写体像のそれぞれと関連付けた識別子が付加されたそれぞれの音声信号とを同期させて利用装置に伝送する映像・音声同期伝送装置であって、入力された映像信号から、音声を発する被写体像を識別するための識別子を表す識別信号を生成すると共に、画面上の当該被写体像の位置を表す位置情報を生成する識別・位置生成手段と、生成された識別信号と位置情報を映像信号に付加する識別情報付加手段と、映像信号に付加されている識別信号に基づいて、入力された音声信号の中から被写体像の識別子に対応する識別子を持つ音声信号を検出し、映像信号に付加されている位置情報に基づいて、検出された音声信号を映像信号の被写体像の位置に多重化してＡＶ多重信号として出力する同期信号合成手段とを備えたものである。 The video / audio synchronous transmission device according to the present invention includes a video signal created separately at the time of video content production, and an identifier associated with each of the subject images that emit audio appearing on the screen of the video signal. And an audio / synchronous transmission device that synchronizes and transmits the audio signal to the utilization device, and generates an identification signal representing an identifier for identifying a subject image that emits audio from the input video signal, Identification / position generation means for generating position information representing the position of the subject image on the screen, identification information addition means for adding the generated identification signal and position information to the video signal, and identification added to the video signal Based on the position information added to the video signal, an audio signal having an identifier corresponding to the identifier of the subject image is detected from the input audio signal based on the signal. , In which a synchronization signal synthesizing means for outputting as a multiplexed AV signal by multiplexing the detected speech signal to the position of the object image of the video signal.

この発明によれば、大画面の映像表示システムで映像を表示する際に、ＡＶ合成信号から分離した音声信号を音響システムに与え、画面上での音源の位置情報を用いて、あたかも再生画面上の被写体像からその被写体像に対応する音声を発生させることに利用することができる効果がある。 According to the present invention, when a video is displayed on a large-screen video display system, an audio signal separated from the AV synthesized signal is given to the acoustic system, and the position information of the sound source on the screen is used as if on the playback screen. This can be used to generate sound corresponding to the subject image from the subject image.

実施の形態１．
図１はこの発明の実施の形態１による映像・音声同期伝送装置の機能構成を示すブロック図である。
まず、ここでは、テレビ番組等の収録時に音声を発する被写体（人物、動物等の音声を発するもの全て）とその被写体が発する音声との関連付けを行っておく。図２に示すように、例えば音声を発する被写体として３人の人物が存在する場合、その実際の被写体に、画像処理上で識別可能となるマークａ１，ａ２，ａ３を予め付けておく。このマークには、例えばバーコードや発光体等が用いられる。したがって、カメラから得られる映像信号１０１にはこのマークが付いた被写体像が含まれることになる。なお、カメラで駒取りして動画アニメーションを製作する場合の音声を発する部分に識別可能なマーク付けをすれば、上記被写体と同様に扱うことができる。
一方、図１に示すように、アフレコ時に、人物（被写体）対応別に設けたマイクから入力される音声信号（符号化されたデジタルデータ）１０４〜１０６には、各人物と各人物が発する音声信号とを一致させるための対応付けがされる。すなわち、マークａ１が付けられた被写体の音声信号１０４には識別子ｂ１を、マークａ２が付けられた被写体の音声信号１０５には識別子ｂ２を、マークａ３が付けられた被写体の音声信号１０６には識別子ｂ３を、それぞれ対応付けて付加しておく。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a functional configuration of a video / audio synchronous transmission apparatus according to Embodiment 1 of the present invention.
First, here, a subject that emits sound when recording a TV program or the like (all those that emit sound of people, animals, etc.) and a sound emitted by the subject are associated with each other. As shown in FIG. 2, for example, when there are three persons as subjects that emit sound, marks a 1, a 2, and a 3 that can be identified in image processing are attached to the actual subjects in advance. For example, a bar code or a light emitter is used for this mark. Accordingly, the video signal 101 obtained from the camera includes a subject image with this mark. In addition, if a mark is identifiable to a portion that emits sound when a moving picture animation is produced by taking a piece with a camera, it can be handled in the same manner as the subject.
On the other hand, as shown in FIG. 1, during the post-recording, audio signals (encoded digital data) 104 to 106 input from microphones provided for each person (subject) are included in the audio signals emitted by each person and each person. Is made to match. That is, the identifier b1 is assigned to the audio signal 104 of the subject with the mark a1, the identifier b2 is assigned to the audio signal 105 of the subject with the mark a2, and the identifier is assigned to the audio signal 106 of the subject with the mark a3. b3 is added in association with each other.

図１において、被写体認識部（識別・位置生成手段）１は、入力された映像信号（アナログ信号）１０１について、上記のように予め実際の被写体に付けられたマークを対応する被写体像から認識して、当該マークに対応する被写体像を識別するための識別子からなる識別信号１０２を生成すると共に、そのマークが画面上のどの位置にあるかを表す位置情報１０３を生成する手段である。ＡＤ変換・符号化部（識別情報付加手段）１０は、映像信号（アナログ信号）１０１をデジタル変換した後、所定の方法で符号化し、かつ識別信号１０と位置情報１０３を付加して映像信号（デジタルデータ）１１１として出力する手段である。同期信号合成部２は、被写体認識部１から入力される識別信号１０２と位置情報１０３に基づいて、入力された音声信号１０４〜１０６の中から被写体像の識別子に対応する識別子を持つ音声信号を検出し、映像信号に付加されている位置情報に基づいて、検出された音声信号を映像信号１１１の被写体像の位置に多重化してＡＶ多重信号として出力する手段である。 In FIG. 1, a subject recognition unit (identification / position generation means) 1 recognizes a mark previously attached to an actual subject as described above from a corresponding subject image for an input video signal (analog signal) 101. Thus, the identification signal 102 including an identifier for identifying the subject image corresponding to the mark is generated, and the position information 103 indicating the position of the mark on the screen is generated. The AD conversion / encoding unit (identification information adding means) 10 digitally converts the video signal (analog signal) 101, encodes it by a predetermined method, adds the identification signal 10 and the position information 103, and adds the video signal ( (Digital data) 111 for outputting. Based on the identification signal 102 and the position information 103 input from the subject recognition unit 1, the synchronization signal synthesis unit 2 generates an audio signal having an identifier corresponding to the identifier of the subject image from the input audio signals 104 to 106. Based on the position information detected and added to the video signal, the detected audio signal is multiplexed with the position of the subject image of the video signal 111 and output as an AV multiplexed signal.

次に動作について説明する。
まず、被写体認識部１では、カメラから入力される映像信号（アナログ信号）１０１から、図２に示すような実際の被写体に付けられているマークを識別し、そのマークに対応した被写体像であることを表す識別子からなる識別信号１０２を生成する。この場合の被写体像の識別子はマイクから入力される音声信号の識別子と対応付けられて生成されるものとする。また、このとき被写体認識部１は、画面上のマークが付いている位置（またはマークが付いている被写体像の位置）を画素単位で表した位置情報（アドレス）１０３を生成する。また、カメラから入力される映像信号（アナログ信号）１０１は、ＡＤ変換・符号化部１０において、デジタル変換された後、所定の方法で符号化され、かつ識別信号１０と位置情報１０３が付加されて映像信号（デジタル信号）１１１として出力される。 Next, the operation will be described.
First, the subject recognition unit 1 identifies a mark attached to an actual subject as shown in FIG. 2 from a video signal (analog signal) 101 input from the camera, and is a subject image corresponding to the mark. An identification signal 102 composed of an identifier representing this is generated. In this case, the identifier of the subject image is generated in association with the identifier of the audio signal input from the microphone. At this time, the subject recognizing unit 1 generates position information (address) 103 that represents the position on the screen where the mark is attached (or the position of the subject image attached with the mark) in pixel units. The video signal (analog signal) 101 input from the camera is digitally converted by the AD conversion / encoding unit 10 and then encoded by a predetermined method, and the identification signal 10 and the position information 103 are added. Are output as video signals (digital signals) 111.

同期信号合成部２には、映像信号１１１とマイクから入力されたアフレコ時の音声信号１０４〜１０６がそれぞれ別々に入力されている。同期信号合成部２では、まず映像信号１１１に付加されている識別信号１０２に基づいて、音声信号１０４〜１０６の中から被写体像の識別子に対応する識別子を持つ音声信号を検出し、映像の画素単位または複数画素単位に対応付けて抽出する。次に、同期信号合成部２では、識別信号１０２に対応する位置情報１０３が示すマークが付いている映像信号１１１の位置に、画素単位または複数画素単位に対応付けられた音声信号を多重化し、その信号をＡＶ多重信号１０７として出力する。 The video signal 111 and the after-recording audio signals 104 to 106 inputted from the microphone are separately inputted to the synchronization signal synthesis unit 2. First, the synchronization signal synthesizer 2 detects an audio signal having an identifier corresponding to the identifier of the subject image from the audio signals 104 to 106 based on the identification signal 102 added to the video signal 111, and generates a video pixel. Extraction is performed in association with a unit or a plurality of pixel units. Next, the synchronization signal synthesizer 2 multiplexes the audio signal associated with the pixel unit or the plurality of pixel units at the position of the video signal 111 with the mark indicated by the position information 103 corresponding to the identification signal 102, The signal is output as an AV multiplexed signal 107.

以上の映像信号と音声信号を多重化する方法を、画面上におけるイメージで表すと図３のようになる。すなわち、音声信号の音源となる画面上の被写体像の位置に、その被写体像と因果関係を持つ音声信号を画素単位または複数画素単位で割り当てて、映像信号に音声信号を多重化することになる。この場合、音声信号は固定レートではなくなる。 The method of multiplexing the video signal and the audio signal as described above is represented as an image on the screen as shown in FIG. That is, an audio signal having a causal relationship with the subject image is assigned to the position of the subject image on the screen as a sound source of the audio signal in units of pixels or a plurality of pixels, and the audio signal is multiplexed with the video signal. . In this case, the audio signal is not at a fixed rate.

以上のように、この実施の形態１によれば、テレビ番組等の映像信号と音声信号について別々に伝送するのではなく、音源となる画面上の被写体像の位置に、その被写体像に対応する音声信号を画素または複数画素単位で割り当てることにより因果関係のある映像信号と音声信号とを完全同期をとって多重化しＡＶ合成信号として伝送するようにしている。したがって、大画面の映像表示システムで映像を表示する際に、ＡＶ合成信号から分離した音声信号を音響システムに与え、画面上での音源の位置情報（アドレス）を用いて、あたかも再生画面上の被写体像からその被写体像に対応する音声を発生させることに利用することができる。また、符号化する際には画面内での効率的な情報量の割り当てを実現することが可能となる。さらに、カメラで被写体を撮影する際に、被写体自身に画像処理上で識別できるマークを付け、それを認識することによって、音声信号の音源となる被写体の画面上の位置を確定することができるため、自動的にＡＶ合成信号を作成することが可能となる。 As described above, according to the first embodiment, the video signal and the audio signal of a television program or the like are not transmitted separately, but the subject image is corresponding to the position of the subject image on the screen serving as a sound source. By assigning audio signals in units of pixels or a plurality of pixels, causal video signals and audio signals are multiplexed in perfect synchronization and transmitted as AV composite signals. Therefore, when a video is displayed on a large-screen video display system, an audio signal separated from the AV composite signal is given to the acoustic system, and the position information (address) of the sound source on the screen is used, as if on the playback screen. This can be used to generate sound corresponding to the subject image from the subject image. Further, when encoding, it is possible to realize efficient allocation of information amount within the screen. Furthermore, when a subject is photographed with a camera, a mark that can be identified for image processing is attached to the subject and the position on the screen of the subject that is the sound source of the audio signal can be determined by recognizing the mark. It is possible to automatically create an AV composite signal.

実施の形態２．
図４はこの発明の実施の形態２による映像・音声同期伝送装置の機能構成を示すブロック図である。図において、図１に相当する部分には同一符号を付し、原則としてその説明は省略する。この実施の形態２の構成は、実施の形態１の同期信号合成部２の代わりに、音声信号多重部３を備えている。
音声信号多重部３は、被写体認識部（識別・位置生成手段）１から入力される被写体像の識別子からなる識別信号１０２とその被写体像の位置情報１０３に基づいて、カメラから入力された映像信号１０１に含まれる各被写体像と各被写体に対応させたアフレコ時のマイクから入力された音声信号１０４〜１０６とを関連付けて、音声信号に因果関係のある被写体像の識別子を多重化して音声多重信号１０８を生成し出力する手段である。なお、この実施の形態２の場合、ＡＤ変換・符号化部１０で映像信号１１１に付加されるのは識別信号だけでもよい。 Embodiment 2. FIG.
FIG. 4 is a block diagram showing a functional configuration of a video / audio synchronous transmission apparatus according to Embodiment 2 of the present invention. In the figure, portions corresponding to those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted in principle. The configuration of the second embodiment includes an audio signal multiplexing unit 3 instead of the synchronization signal synthesis unit 2 of the first embodiment.
The audio signal multiplexing unit 3 is a video signal input from the camera based on the identification signal 102 including the identifier of the subject image input from the subject recognition unit (identification / position generation means) 1 and the position information 103 of the subject image. 101 is associated with the audio signals 104 to 106 input from the microphones for after-recording corresponding to the respective subjects, and the identifiers of the subject images that are causally related to the audio signals are multiplexed to generate an audio multiplexed signal. Means for generating and outputting 108. In the second embodiment, the AD conversion / encoding unit 10 may add only the identification signal to the video signal 111.

次に、動作について説明する。
音声信号多重部３では、各被写体に対応させたアフレコ時のマイクからの音声信号１０４〜１０６が入力されると、まず被写体認識部１で生成された識別信号１０２に基づいて、音声信号１０４〜１０６の中から被写体像の識別子に対応する識別子を持つ音声信号を検出する。次に、検出した音声信号に、被写体像の識別子と被写体識別部１から得られるその被写体像の位置情報（アドレス）１０３を多重化して音声多重信号１０８として出力する。すなわち、各音声信号に対してその音源となる被写体像が画面上のどの位置にいるかの位置情報を多重するのである。このようにして生成された、各被写体像の位置情報（アドレス）を多重化した音声多重信号１０８は映像信号１１１とは別信号にして利用装置へ伝送される。 Next, the operation will be described.
In the audio signal multiplexing unit 3, when the audio signals 104 to 106 from the microphone corresponding to each subject are input, the audio signals 104 to 106 are first based on the identification signal 102 generated by the subject recognition unit 1. An audio signal having an identifier corresponding to the identifier of the subject image is detected from 106. Next, the subject image identifier and the position information (address) 103 of the subject image obtained from the subject identification unit 1 are multiplexed on the detected sound signal and output as a sound multiplexed signal 108. That is, position information indicating where the subject image serving as the sound source is on the screen is multiplexed for each audio signal. The thus generated audio multiplexed signal 108 obtained by multiplexing the position information (address) of each subject image is transmitted as a separate signal from the video signal 111 to the utilization device.

以上のように、この実施の形態２によれば、テレビ番組等の映像信号と音声信号を従来のように別信号として伝送するようにしているが、音声信号に画面上での音源の位置情報を多重化して送るようにしたので、大画面の映像表示システムで映像を表示する際に、音声信号を音響システムに与え、音声多重信号から検出した音源の位置情報を用いて、あたかも再生画面上の被写体像からその被写体像に対応する音声を発生させることに利用することができる。また、実施の形態１と同様に、符号化する際には画面内での効率的な情報量の割り当てを実現することが可能となる。 As described above, according to the second embodiment, the video signal and the audio signal of a television program or the like are transmitted as separate signals as in the prior art, but the position information of the sound source on the screen is included in the audio signal. When displaying video on a large-screen video display system, the audio signal is given to the acoustic system, and the sound source position information detected from the audio multiplexed signal is used as if it were on the playback screen. Can be used to generate sound corresponding to the subject image from the subject image. Further, as in the first embodiment, it is possible to realize efficient information amount allocation in the screen when encoding.

実施の形態３．
上記実施の形態２では、音声信号に対してその音源である被写体の画面上の被写体像の位置情報（アドレス）を多重して音声多重信号を生成しているが、音声多重信号を生成する場合に、電子透かし技術を用いて音声信号に位置情報（アドレス）を埋め込むようにしてもよい。この場合、図４における音声信号多重部３に、電子透かし技術によって画面上の被写体像の位置情報（アドレス）を埋め込む機能を持たせることになる。これによっても、実施の形態２と同様の効果を得ることができる。 Embodiment 3 FIG.
In the second embodiment, the position information (address) of the subject image on the screen of the subject that is the sound source is multiplexed with the sound signal to generate the sound multiplexed signal. However, when the sound multiplexed signal is generated. In addition, position information (address) may be embedded in the audio signal using a digital watermark technique. In this case, the audio signal multiplexing unit 3 in FIG. 4 has a function of embedding the position information (address) of the subject image on the screen by the digital watermark technique. Also by this, the same effect as in the second embodiment can be obtained.

実施の形態４．
図５はこの発明の実施の形態４による映像・音声同期伝送装置の機能構成を示すブロック図である。図において、図１に相当する部分には同一符号を付し、原則としてその説明は省略する。この実施の形態４の構成は、実施の形態１の同期信号合成部２の代わりに、映像信号多重部４を備えている。
映像信号多重部４は、被写体認識部１から入力される識別信号１０２と位置情報１０３に基づいて、入力された音声信号１０４〜１０６の中から被写体像の識別子に対応する識別子を持つ音声信号を検出し、検出された音声信号に付加された識別子を前記識別情報付加手段からの映像信号１１１に電子透かしデータとして埋め込んで映像多重信号として出力する手段である。 Embodiment 4 FIG.
FIG. 5 is a block diagram showing a functional configuration of a video / audio synchronous transmission apparatus according to Embodiment 4 of the present invention. In the figure, portions corresponding to those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted in principle. The configuration of the fourth embodiment includes a video signal multiplexing unit 4 instead of the synchronization signal synthesis unit 2 of the first embodiment.
Based on the identification signal 102 and position information 103 input from the subject recognition unit 1, the video signal multiplexing unit 4 outputs an audio signal having an identifier corresponding to the identifier of the subject image from the input audio signals 104 to 106. It is a means for detecting and embedding an identifier added to the detected audio signal in the video signal 111 from the identification information adding means as digital watermark data and outputting it as a video multiplexed signal.

次に、動作について説明する。
映像信号多重部４では、映像信号１１１が入力されると、その中に含まれる識別信号１０２に基づいて、入力される音声信号の中から被写体像の識別子に対応する識別子を持つマイクから入力された音声信号を検出する。次に、この検出された音声信号に付加された識別子を抽出して映像信号１１１に電子透かしデータとして埋め込んで映像多重信号１０９として出力する。映像信号多重部４で生成された映像多重信号１０９と各マイクの音声信号１０４〜１０６とは別々の信号として利用装置に対して伝送される。 Next, the operation will be described.
When the video signal 111 is input to the video signal multiplexing unit 4, the video signal 111 is input from a microphone having an identifier corresponding to the identifier of the subject image among the input audio signals based on the identification signal 102 included therein. Detects a voice signal. Next, an identifier added to the detected audio signal is extracted, embedded in the video signal 111 as digital watermark data, and output as a video multiplexed signal 109. The video multiplexed signal 109 generated by the video signal multiplexing unit 4 and the audio signals 104 to 106 of the microphones are transmitted as separate signals to the utilization device.

以上のように、この実施の形態４によれば、テレビ番組等の映像信号と音声信号を従来のように別信号として伝送するようにしているが、画面上での音源の位置情報が付加された映像信号に、その音源に対応する音声信号の識別子を多重化して送るようにしたので、大画面の映像表示システムで映像を表示する際に、映像多重信号から検出した音源となる被写体像の位置情報を用いて、対応する音声信号を検出して音響システムに与え、あたかも再生画面上の被写体像からその被写体像に対応する音声を発生させることに利用することができる。また、符号化する際には画面内での効率的な情報量の割り当てを実現することが可能となる。 As described above, according to the fourth embodiment, the video signal and the audio signal of a television program or the like are transmitted as separate signals as in the prior art, but the position information of the sound source on the screen is added. Since the identifier of the audio signal corresponding to the sound source is multiplexed and sent to the recorded video signal, when displaying the video on the large-screen video display system, the subject image as the sound source detected from the video multiplexed signal is displayed. Using the position information, a corresponding audio signal can be detected and applied to the acoustic system, and can be used for generating sound corresponding to the subject image from the subject image on the playback screen. Further, when encoding, it is possible to realize efficient allocation of information amount within the screen.

実施の形態５．
図６はこの発明の実施の形態５による映像・音声同期伝送装置の機能構成を示すブロック図である。図において、図１に相当する部分には同一符号を付し、原則としてその説明は省略する。この実施の形態５の構成は、実施の形態１の被写体認識部１の代わりに、被写体識別データ生成部（識別・位置生成手段）５を備えている。
この実施の形態５の場合、音声を発する被写体（人物、動物等の音声を発するもの全て）とその被写体が発する音声との関連付けの方法が上記各実施の形態の場合と異なる。ここでは、カメラまたは映像記憶装置から入力される映像信号（アナログ）１１１の映像は、音声を発する被写体に対して、画像処理上で識別可能となるマークａ１，ａ２，ａ３が付けられていない通常の映像である。
被写体識別データ生成部５は、入力される映像信号１０１をモニタで再生し、その画面上にある音源となる被写体像に入力ペンを接触させることにより、当該被写体像を識別するための識別子からなる識別信号を生成すると共に、画面上の当該被写体像の位置を表す位置情報１０３を生成する手段である。 Embodiment 5. FIG.
FIG. 6 is a block diagram showing a functional configuration of a video / audio synchronous transmission apparatus according to Embodiment 5 of the present invention. In the figure, portions corresponding to those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted in principle. The configuration of the fifth embodiment includes a subject identification data generation unit (identification / position generation unit) 5 instead of the subject recognition unit 1 of the first embodiment.
In the case of the fifth embodiment, the method of associating the subject that emits sound (all those that emit sound such as people and animals) and the sound emitted by the subject is different from the case of each of the above embodiments. Here, the video of the video signal (analog) 111 input from the camera or the video storage device is usually not provided with marks a1, a2, and a3 that can be identified in image processing for a subject that emits sound. It is an image of.
The subject identification data generation unit 5 includes an identifier for identifying the subject image by reproducing the input video signal 101 on a monitor and bringing the input pen into contact with the subject image serving as a sound source on the screen. A means for generating an identification signal and generating position information 103 representing the position of the subject image on the screen.

次に、動作について説明する。
カメラまたは映像記憶装置から映像信号１１１入力されると、被写体識別データ生成部５では、映像信号１１１をモニタで再生する。番組制作者がその画面上にある音源となる被写体像に入力ペンを接触させる。複数の被写体像がある場合には、被写体像毎に入力ペンの被写体像を特定するモードを切り替えたり、あるいは複数の人で異なるモードの入力ペンを用いて行う。このことにより、被写体識別データ生成部５では、その被写体像を表す識別子を発生させて識別信号１０２を生成する。また、この時、その被写体像が画面上のどの位置にあるかを表す位置情報１０３を生成して、識別信号１０２と共にＡＤ変換・符号化部１０に入力する。その後の処理は実施の形態１と同様である。 Next, the operation will be described.
When the video signal 111 is input from the camera or the video storage device, the subject identification data generation unit 5 reproduces the video signal 111 on the monitor. The program producer brings the input pen into contact with the subject image that is the sound source on the screen. When there are a plurality of subject images, the mode for specifying the subject image of the input pen is switched for each subject image, or the input pens of different modes are used by a plurality of people. As a result, the subject identification data generating unit 5 generates an identifier representing the subject image and generates the identification signal 102. At this time, position information 103 indicating where the subject image is on the screen is generated and input to the AD conversion / encoding unit 10 together with the identification signal 102. The subsequent processing is the same as in the first embodiment.

以上のように、この実施の形態５によれば、通常の映像信号を用いて、その画面上に存在する音源となる被写体像の識別信号と位置情報を生成するようにしたので、被写体自身にマーク付けしていない映像や外国から入手した映像コンテンツの音声吹き替えにも対応可能となる。また、上記各実施の形態で行っているように、被写体自身にマークを付ける方法を用いた場合には、被写体の遠近や向きに因りマークを認識できない場合が起こり得るが、この実施の形態５の場合は、被写体像の画面上に現れている部分に対して指定処理を行うため、手作業を伴うが、対象とする被写体を確実に特定する識別子を付与することが可能である。
なお、被写体識別データ生成部５を実施の形態２から実施の形態４のいずれか１つに適用しても、それぞれ同様な効果が得られる。 As described above, according to the fifth embodiment, the normal image signal is used to generate the identification signal and the position information of the subject image serving as the sound source existing on the screen. It is also possible to support voice-dubbing of unmarked video and video content obtained from foreign countries. In addition, as in the above embodiments, when the method of marking a subject itself is used, the mark may not be recognized depending on the perspective and direction of the subject. In this case, since the designation process is performed on the portion of the subject image that appears on the screen, it is accompanied by manual work, but it is possible to assign an identifier that reliably identifies the subject subject.
Note that the same effect can be obtained by applying the subject identification data generation unit 5 to any one of the second to fourth embodiments.

この発明の実施の形態１による映像・音声同期伝送装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video / audio synchronous transmission apparatus by Embodiment 1 of this invention. この発明の実施の形態１に係る被写体と音声を関連付ける方法を示す説明図である。It is explanatory drawing which shows the method of associating the to-be-photographed object and audio | voice based on Embodiment 1 of this invention. この発明の実施の形態１に係る画面上の特定された画像の映像信号とそれに対応する音声信号を多重する方法をイメージで示す説明図である。It is explanatory drawing which shows the method of multiplexing the video signal of the image specified on the screen concerning Embodiment 1 of this invention, and the audio | voice signal corresponding to it with an image. この発明の実施の形態２による映像・音声同期伝送装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video / audio synchronous transmission apparatus by Embodiment 2 of this invention. この発明の実施の形態４による映像・音声同期伝送装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video / audio synchronous transmission apparatus by Embodiment 4 of this invention. この発明の実施の形態５による映像・音声同期伝送装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video / audio synchronous transmission apparatus by Embodiment 5 of this invention.

Explanation of symbols

１被写体認識部、２同期信号合成部、３音声信号多重部、４映像信号多重部、５被写体識別データ生成部、１０ＡＤ変換・符号化部。 1 subject recognition unit, 2 synchronization signal synthesis unit, 3 audio signal multiplexing unit, 4 video signal multiplexing unit, 5 subject identification data generation unit, 10 AD conversion / encoding unit

Claims

Video signals created separately during video content production and the audio signals with identifiers associated with each of the subject images that emit audio appearing on the screen of the video signal are synchronized and transmitted to the user device A video / audio synchronous transmission device,
An identification / position generating means for generating an identification signal representing an identifier for identifying a subject image that emits sound from an input video signal, and for generating position information representing the position of the subject image on the screen;
Identification information adding means for adding the generated identification signal and position information to the video signal;
Based on the identification signal added to the video signal, an audio signal having an identifier corresponding to the identifier of the subject image is detected from the input audio signal, and based on the position information added to the video signal, A video / audio synchronous transmission apparatus comprising: synchronization signal synthesis means for multiplexing the detected audio signal at the position of the subject image of the video signal and outputting it as an AV multiplexed signal.

2. The video according to claim 1, wherein when the audio signal is multiplexed at the position of the subject image of the video signal, the synchronization signal synthesizing unit multiplexes the audio signal in units of a pixel or a plurality of pixels of the video. -Voice synchronous transmission device.

Video signals created separately during video content production and the audio signals with identifiers associated with each of the subject images that emit audio appearing on the screen of the video signal are synchronized and transmitted to the user device A video / audio synchronous transmission device,
Identification / position generating means for generating an identification signal representing an identifier for identifying a subject image that emits sound and position information representing a position of the subject image on the screen from the input video signal;
Identification information adding means for adding the generated identification signal to the video signal;
Based on the identification signal generated by the identification / position generation means, an audio signal having an identifier corresponding to the identifier of the subject image is detected from the input audio signal, and the identification / Audio signal multiplexing means for multiplexing the position information of the subject image generated by the position generation means and outputting as an audio multiplexed signal;
A video / audio synchronous transmission apparatus for separately transmitting a video signal from the identification information adding means and an audio multiplexed signal from the audio signal multiplexing means.

4. The video / audio synchronous transmission apparatus according to claim 3, wherein the audio signal multiplexing means embeds the position information of the subject image in the audio signal as digital watermark data.

Video signals created separately during video content production and the audio signals with identifiers associated with each of the subject images that emit audio appearing on the screen of the video signal are synchronized and transmitted to the user device A video / audio synchronous transmission device,
Identification / position generating means for generating an identification signal representing an identifier for identifying a subject image that emits sound and position information representing a position of the subject image on the screen from the input video signal;
Identification information adding means for adding the generated identification signal and position information to the video signal;
Based on the identification signal added to the video signal from the identification information adding means, an audio signal having an identifier corresponding to the identifier of the subject image is detected from the input audio signal and added to the detected audio signal. Video signal multiplexing means for embedding the identifier thus obtained as digital watermark data in the video signal from the identification information adding means and outputting as a video multiplexed signal;
A video / audio synchronous transmission apparatus for separately transmitting a video signal from the video signal multiplexing means and the inputted audio signal.

The identification / position generation means recognizes, from the input video signal, an identification signal comprising an identifier for recognizing a mark previously attached to the actual subject from the corresponding subject image and identifying the subject image corresponding to the mark. 6. The image according to claim 1, wherein the image is generated and subject recognition means for generating position information representing the position of the subject image on the screen from the position of the mark. -Voice synchronous transmission device.

The identification / position generation means reproduces an input video signal on a monitor, and an identification signal comprising an identifier for identifying the subject image by bringing the input pen into contact with the subject image serving as a sound source on the screen 6. The video image data according to claim 1, wherein the object identification data generating unit generates position information representing the position of the subject image on the screen. Voice synchronous transmission device.