JP3377463B2

JP3377463B2 - Video / audio gap correction system, method and recording medium

Info

Publication number: JP3377463B2
Application number: JP37770898A
Authority: JP
Inventors: 章中村; 智史山口; 昭彦児野; 定晴岡田
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1998-12-29
Filing date: 1998-12-29
Publication date: 2003-02-17
Anticipated expiration: 2018-12-29
Also published as: JP2000196917A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、動画像と音（音声
または音響等）を有する映像／音声における動画像と音
との間のずれを検出し、そのずれを補正する映像／音声
ずれ補正システム、方法および記録媒体に関し、より詳
しくは動画像を取り扱うメディア、たとえば、テレビジ
ョン、ＶＴＲ、ＤＶＤのような映像／音声記録メディア
や映像／音声処理用コンピュータ、医療機器あるいはテ
レビ電話等の通信機器に好適な映像／音声ずれ補正シス
テム、方法および記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention detects a shift between a moving image and a sound in a moving image / voice having a moving image and a sound (voice or sound, etc.), and corrects the shift. More specifically, the present invention relates to a system, a method, and a recording medium. More specifically, a medium that handles moving images, for example, a video / audio recording medium such as a television, a VTR, and a DVD, a video / audio processing computer, a medical device, or a communication device such as a videophone. The present invention relates to a suitable video / audio shift correction system, method, and recording medium.

【０００２】[0002]

【従来の技術】映像、すなわち、動画像と音を有する映
像をアナログ機器やデジタル処理機器で処理する場合
に、たとえば、動画像のみを画像処理すると、動画像と
音との間にずれが生じる。このようなずれを補正するた
めの方法としては特願平８−２９６２１６号などの提案
がある。この提案では、番組開始前に動画像と音声の双
方の基準信号を重畳し、動画像と音声との間のずれを補
正する。2. Description of the Related Art When a video, that is, a video having a moving image and a sound is processed by an analog device or a digital processing device, for example, when only the moving image is processed, a difference occurs between the moving image and the sound. . As a method for correcting such a shift, there is a proposal such as Japanese Patent Application No. 8-296216. In this proposal, the reference signals of both the moving image and the sound are superimposed before the program starts, and the shift between the moving image and the sound is corrected.

【０００３】[0003]

【発明が解決しようとする課題】上述の提案では番組放
送前に本線の動画像および音声の基準信号を重畳するた
めに、放送中はずれを計測し、補正することが困難とい
う問題がある。In the above-mentioned proposal, there is a problem that it is difficult to measure and correct the deviation during broadcasting because the moving image and audio reference signals of the main line are superimposed before the program broadcasting.

【０００４】また、ＶＴＲ（映像記録再生装置）に収録
された素材（映像）を編集する際には、それぞれの断片
的な素材について、各々、基準となる信号を重畳しなけ
ればならず、動画像と音声の同期をとることがユーザに
とっては大変な労力となっていた。Further, when editing a material (video) recorded in a VTR (video recording / reproducing apparatus), a reference signal must be superimposed on each fragmentary material. It has been a great effort for the user to synchronize the image and the sound.

【０００５】そこで、本発明の目的は、上述の点に鑑み
て、放送中でも映像／音声の中の動画像と音（音声また
は音響）との同期処理を簡単な操作で実行することがで
きる映像／音声ずれ補正システム、方法および記録媒体
を提供することにある。Therefore, in view of the above-mentioned point, an object of the present invention is a video image in which a synchronizing process between a moving image in a video image / sound and a sound (sound or sound) can be executed by a simple operation even during broadcasting. / To provide an audio shift correction system, method and recording medium.

【０００６】[0006]

【課題を解決するための手段】このような目的を達成す
るために、請求項１の発明は、入力の動画像と入力の音
との間の時間的なずれを補正する映像／音声ずれ補正シ
ステムにおいて、前記入力の動画像の中に含まれている
音源の動きからその音源の第１の音の発生タイミングを
検出する第１の検出手段と、前記入力の音から前記音源
の第２の発生タイミングを検出する第２の検出手段と、
前記第１の検出手段により検出された第１の音の発生タ
イミングと前記第２の検出手段により検出された第２の
音の発生タイミングとの間の時間的なずれを計測する計
測手段と、当該計測された時間的なずれに基づき前記入
力の動画像と前記入力の音声とを同期させる信号処理手
段とを具えたことを特徴とする。In order to achieve such an object, the invention of claim 1 is a video / audio shift correction for correcting a temporal shift between an input moving image and an input sound. In the system, first detecting means for detecting a generation timing of a first sound of the sound source from a motion of the sound source included in the moving image of the input, and second detecting means of the sound source from the sound of the input. Second detection means for detecting the generation timing;
Measuring means for measuring a time difference between the generation timing of the first sound detected by the first detection means and the generation timing of the second sound detected by the second detection means; It is characterized by comprising signal processing means for synchronizing the moving image of the input and the voice of the input based on the measured time lag.

【０００７】請求項２の発明は、請求項１に記載の映像
／音声ずれ補正システムにおいて、前記第１の検出手段
は、前記音源の動きの動きベクトルを計算し、当該動き
ベクトルの計算結果に基づき動画像における前記第１の
音の発生タイミングを検出することを特徴とする。According to a second aspect of the present invention, in the video / audio shift correction system according to the first aspect, the first detecting means calculates a motion vector of the motion of the sound source, and the calculated result of the motion vector is obtained. The generation timing of the first sound in the moving image is detected based on the above.

【０００８】請求項３の発明は、請求項２に記載の映像
／音声ずれ補正システムにおいて、前記動きベクトルは
上下方向および左右方向の少なくともいずれか一方の方
向について計算されることを特徴とする。According to a third aspect of the present invention, in the video / audio shift correction system according to the second aspect, the motion vector is calculated in at least one of a vertical direction and a horizontal direction.

【０００９】請求項４の発明は、請求項１〜請求項３の
いずれかに記載の映像／音声ずれ補正システムにおい
て、前記音源は唇であることを特徴とする。According to a fourth aspect of the present invention, in the video / audio shift correction system according to any one of the first to third aspects, the sound source is a lip.

【００１０】請求項５の発明は、請求項４に記載の映像
／音声ずれ補正システムにおいて、前記第２の検出手段
は、音声の無声・有声／無音区間から前記第２の音の発
生タイミングを検出することを特徴とする。According to a fifth aspect of the present invention, in the video / audio shift correction system according to the fourth aspect, the second detecting means determines the generation timing of the second sound from the unvoiced / voiced / silent section of the voice. It is characterized by detecting.

【００１１】請求項６の発明は、請求項１〜請求項３の
いずれかに記載の映像／音声ずれ補正システムにおい
て、前記音源は、衝突により音響を発生する複数の物体
であることを特徴とする。According to a sixth aspect of the present invention, in the video / audio shift correction system according to any of the first to third aspects, the sound source is a plurality of objects that generate sound by collision. To do.

【００１２】請求項７の発明は、請求項６に記載の映像
／音声ずれ補正システムにおいて、前記第２の検出手段
は、音響の有音／無音区間から前記第２の音の発生タイ
ミングを検出することを特徴とする。According to a seventh aspect of the present invention, in the video / audio shift correction system according to the sixth aspect, the second detecting means detects the generation timing of the second sound from the sound / silence section of the sound. It is characterized by doing.

【００１３】請求項８の発明は、請求項１〜請求項７の
いずれかに記載の映像／音声ずれ補正システムにおい
て、前記第１の音の発生タイミングおよび前記第２の音
の発生タイミングは複数回検出され、当該検出された複
数の第１および第２の音の発生タイミングに基づき前記
時間的なずれを前記計測手段において決定することを特
徴とする。According to an eighth aspect of the present invention, in the video / audio shift correction system according to any one of the first to seventh aspects, the first sound generation timing and the second sound generation timing are plural. It is characterized in that it is detected once, and the time difference is determined by the measuring means based on the generation timings of the plurality of detected first and second sounds.

【００１４】請求項９の発明は、入力の動画像と入力の
音との間の時間的なずれを補正する映像／音声ずれ補正
方法において、前記入力の動画像の中に含まれている音
源の動きからその音源の第１の音の発生タイミングを検
出し、前記入力の音から前記音源の第２の発生タイミン
グを検出し、検出された前記第１の音の発生タイミング
と検出された前記第２の音の発生タイミングとの間の時
間的なずれを計測し、当該計測された時間的なずれに基
づき前記入力の動画像と前記入力の音声とを同期させる
ことを特徴とする。According to a ninth aspect of the present invention, in a video / audio shift correction method for correcting a temporal shift between an input moving image and an input sound, a sound source included in the input moving image. Detecting the generation timing of the first sound of the sound source from the movement of the sound source, detecting the second generation timing of the sound source from the input sound, detecting the generation timing of the detected first sound and the detected timing of the first sound. It is characterized in that a time lag between the generation timing of the second sound is measured, and the moving image of the input and the voice of the input are synchronized based on the measured time lag.

【００１５】請求項１０の発明は、入力の動画像と入力
の音との間の時間的なずれを補正するためのプログラム
を記録し、該プログラムがコンピュータにより実行され
る記録媒体において、前記プログラムは、前記入力の動
画像の中に含まれている音源の動きからその音源の第１
の音の発生タイミングを検出する第１の検出ステップ
と、前記入力の音から前記音源の第２の発生タイミング
を検出する第２の検出ステップと、前記第１の検出ステ
ップにより検出された第１の音の発生タイミングと前記
第２の検出ステップにより検出された第２の音の発生タ
イミングとの間の時間的なずれを計測する計測ステップ
と、当該計測された時間的なずれに基づき前記入力の動
画像と前記入力の音声とを同期させる信号処理ステップ
とを具えたことを特徴とする。According to a tenth aspect of the present invention, a program for correcting a time lag between an input moving image and an input sound is recorded, and the program is stored in a recording medium executed by a computer. Is the first sound source of the sound source from the motion of the sound source included in the input moving image.
Detection step of detecting the sound generation timing of the sound source, a second detection step of detecting the second generation timing of the sound source from the input sound, and a first detection step detected by the first detection step. Measuring step for measuring the temporal shift between the sound generation timing of the second sound and the second sound generation timing detected by the second detecting step, and the input based on the measured time shift. And a signal processing step of synchronizing the moving image with the input voice.

【００１６】[0016]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を詳細に説明する。DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described in detail below with reference to the drawings.

【００１７】図１は本発明実施形態における処理の内容
を示す。FIG. 1 shows the contents of processing in the embodiment of the present invention.

【００１８】（ａ）映像／音声ずれ補正システムに対し
て動画像信号、たとえば、ハイビジョン、ＮＴＳＣ，Ｐ
ＡＬなどの形態の動画像信号と、動画像信号に付随する
音の、この場合、音声信号が入力される。本例では、発
話時のように唇の動きを撮影した動画像（たとえば、ニ
ュースなどのようなバストショットで撮影された動画
像）を説明する。なお、タンバリンやカスタネットなど
の打楽器を人間が演奏している映像の中の動画像、さら
には叩くことによって音を発生しているなどの動画像信
号を入力可能である。また、本実施例はハイビジョン信
号を例にして説明する。(A) For a video / audio shift correction system, a moving image signal, for example, HDTV, NTSC, P
In this case, an audio signal of a moving image signal in the form of AL or the like and a sound accompanying the moving image signal is input. In this example, a moving image in which the movement of the lips is photographed as when uttering (for example, a moving image photographed by a bust shot such as news) will be described. It should be noted that it is possible to input a moving image signal in a video of a human playing a percussion instrument such as a tambourine or castanet, or even a moving image signal indicating that a sound is generated by hitting. In addition, this embodiment will be described by taking a high-definition signal as an example.

【００１９】（ｂ）ブロック１では動画像と音声の間の
ずれの計測に必要で、動画像の動きベクトルを検出する
部分（検出窓）を指示する。本願発明者は、音の発生
源、この場合、唇の動きの中に、音が発生するときの特
有の動きがあることを発見し、この特有の動きを動画像
の解析により検出する。この解析に動きベクトル（後
述）という概念を使用する。(B) Block 1 indicates a portion (detection window) for detecting a motion vector of a moving image, which is necessary for measuring a shift between the moving image and sound. The inventor of the present application has found that there is a peculiar movement when the sound is generated in the sound generation source, in this case, the movement of the lips, and the peculiar movement is detected by analyzing the moving image. The concept of motion vectors (described below) is used for this analysis.

【００２０】動画像解析から得られる音の発生タイミン
グと音声信号から得られる音声（あるいは音響）の発生
タイミングとを比較することによりお互いのずれを計測
する。計測されたずれに基づき、動画像あるいは音声の
いずれか一方を遅延させることにより双方の同期が取ら
れる。By comparing the sound generation timing obtained from the moving image analysis with the sound (or sound) generation timing obtained from the sound signal, the mutual deviation is measured. Based on the measured shift, either the moving image or the sound is delayed to synchronize them.

【００２１】（ｃ）本実施形態ではリアルタイムの映像
（動画像＋音）のずれ補正を行うために、アナログの画
像処理装置のタッチパネル式表示器に動画像を表示さ
せ、表示された動画像の中の音源画像の近傍、たとえ
ば、縦２００ドット、横３５０ドットのブロック中の動
画像を画像処理装置により切り出す。切り出された動画
像はＲＧＢ４：４：４フォーマットでブロック２におい
て、Ａ／Ｄ変換器によりアナログデジタル変換される。
この切り出し処理により音源の背景の動画像部分を除去
することで計測対象の動画像データのドット（画素）数
が減少するので、後述の動画像解析処理時間を短縮する
ことが可能となる。(C) In the present embodiment, in order to correct the deviation of the video (moving image + sound) in real time, the moving image is displayed on the touch panel display of the analog image processing device, and the displayed moving image is displayed. An image processing device cuts out a moving image in the vicinity of the inside sound source image, for example, a block of 200 dots vertically and 350 dots horizontally. The clipped moving image is converted to analog to digital by the A / D converter in block 2 in RGB 4: 4: 4 format.
Since the number of dots (pixels) of the moving image data to be measured is reduced by removing the moving image portion of the background of the sound source by this cutout processing, it is possible to shorten the moving image analysis processing time described below.

【００２２】（ｄ）ブロック２でＡ／Ｄ変換された動画
像から唇の動きベクトル、打楽器の動きベクトル、叩く
ことによって音を発生している動画像の場合には、その
叩いている部分（たとえば、手）の動きベクトルを検出
するための前処理をブロック３で行う。本実施形態では
動画像中の上記各種の音源の音を発生するための動きを
音発生パターンと総称することにする。具体的な前処理
としてはマトリクスによる動きベクトル検出用の輝度信
号を信号処理プロセッサ等により作成する。(D) The motion vector of the lips, the motion vector of the percussion instrument, and the motion image generated by tapping from the motion image A / D-converted in block 2 (in the case of a motion image generated by tapping, the tapped part ( For example, pre-processing for detecting a motion vector of (hand) is performed in block 3. In the present embodiment, movements for generating sounds of the various sound sources in the moving image will be collectively referred to as sound generation patterns. As a specific preprocessing, a luminance signal for motion vector detection by a matrix is created by a signal processor or the like.

【００２３】（ｅ）次に勾配法に基づく初期偏位ベクト
ル（たとえば、候補ベクトル：８種；ブロックサイズ８
画素、８ライン）を用いた反復勾配法（たとえば、最大
反復回数：２回、ブロックサイズ：８画素、８ライン）
により音源の音発生パターンに関する動きベクトルをブ
ロック４，５により求める。勾配法および反復勾配法に
ついては特願平９−２４９１３４号、テレビジョン学会
誌Ｖｏｌ４８，Ｎｏ１、ｐｐ８４−９４（１９９４）お
よびＮＨＫ技研Ｒ＆Ｄ，Ｎｏ２６，１９９３年９月号を
参照されたい。(E) Next, an initial deviation vector based on the gradient method (for example, candidate vector: 8 types; block size 8)
Iterative gradient method using (pixels, 8 lines) (for example, maximum number of iterations: 2 times, block size: 8 pixels, 8 lines)
Then, the motion vector related to the sound generation pattern of the sound source is obtained in blocks 4 and 5. For the gradient method and the iterative gradient method, see Japanese Patent Application No. 9-249134, Journal of Television Society Vol 48, No 1, pp 84-94 (1994) and NHK Giken R & D, No 26, September 1993.

【００２４】（ｆ）動きベクトル量算出処理ブロック６
において、音源の上下方向の動きベクトルおよび左右方
向の動きベクトルの量を求める。切り出した動画像が音
を発生する動きを示すときは、これらのベクトル量のう
ち、少なくとも一方の大きさはある一定値以上となって
いる。従って、これらの上下方向および左右方向のベク
トルの和をとって１つの動きベクトルを求め、このベク
トルの大きさを音の発生の有無の判定に利用する。信号
処理プロセッサは、計算したベクトル量と一定値とを比
較して、切り出した動画像が音発生パターンであるか否
かを判定する。(F) Motion vector amount calculation processing block 6
At, the amounts of the vertical motion vector and the horizontal motion vector of the sound source are obtained. When the cut-out moving image shows a motion that produces a sound, at least one of these vector amounts has a certain value or more. Therefore, one motion vector is obtained by summing these vertical and horizontal vectors, and the magnitude of this vector is used to determine whether or not sound is generated. The signal processor compares the calculated vector amount with a constant value to determine whether or not the clipped moving image has a sound generation pattern.

【００２５】（ｇ）入力された音声信号はブロック７に
おいてＡ／Ｄ変換器によりアナログデジタル（Ａ／Ｄ）
変換される。より具体的は、音声信号を量子化ビット数
１６ビット、標本化周波数４８ｋＨｚでサンプリングす
ることによりＡ／Ｄ変換を行う。(G) The input audio signal is converted to analog digital (A / D) by the A / D converter in block 7.
To be converted. More specifically, A / D conversion is performed by sampling the audio signal with a quantization bit number of 16 bits and a sampling frequency of 48 kHz.

【００２６】（ｈ）Ａ／Ｄ変換された音声は無声・有声
／無音区間の分割を信号処理プロセッサにより行う。無
声・有声／無音区間のための信号処理としてはデジタル
音声処理で使用されている周知技術を用いればよい。な
お、打楽器等の音響信号については、簡単には、音の大
きさが閾値以上となったこと、より詳しくは打楽器特有
の周波数の音が発生したことを検出するとよい。(H) The A / D converted voice is divided into unvoiced / voiced / voiceless sections by a signal processor. As the signal processing for the unvoiced / voiced / unvoiced section, a well-known technique used in digital voice processing may be used. Regarding an acoustic signal of a percussion instrument or the like, it may be simply detected that the volume of the sound is equal to or larger than a threshold value, more specifically, that a sound having a frequency peculiar to the percussion instrument is generated.

【００２７】（ｉ）分割された有音区間の中の声（音）
立て部分をブロック９において次に検出する。上述の
（ｈ），（ｉ）の処理を信号処理プロセッサにより行う
ことで、音（無声・有声または有音）／無音の区間を識
別し、音立ての発生タイミングを検出する。(I) Voice (sound) in the divided voiced section
The stand is then detected in block 9. By performing the above-described processes (h) and (i) by the signal processor, the sound (unvoiced / voiced or voiced) / silent section is identified, and the timing of occurrence of the pitch is detected.

【００２８】（ｊ）ブロック６により動画像から得られ
た音の発生タイミングとブロック９の処理により音声信
号から得られた音の発生タイミングとの時間的な差分を
計算すると、動画像信号と音声信号との間のずれを計算
（計測）することができる。この処理がブロック１０に
おいて、信号処理プロセッサにより行われる。(J) When the time difference between the sound generation timing obtained from the moving image by the block 6 and the sound generation timing obtained from the sound signal by the processing of the block 9 is calculated, the moving image signal and the sound are calculated. The deviation from the signal can be calculated (measured). This processing is performed by the signal processor in block 10.

【００２９】（ｋ）入力された音声信号はブロック１１
において、遅延素子等によりブロック１０において計算
されたずれ量分だけ遅延される。なお、ブロック１０に
おいて、計測値を表示させ、その値を基にマニュアルに
よりブロック１１の遅延量を設定することも可能であ
る。これにより動画像信号と音声信号との同期がとら
れ、映像／音声ずれ補正システムから出力される。(K) The input voice signal is a block 11
At, a delay element or the like delays the shift amount calculated in block 10. It is also possible to display the measured value in the block 10 and manually set the delay amount of the block 11 based on the displayed value. As a result, the moving image signal and the audio signal are synchronized and output from the video / audio shift correction system.

【００３０】図２に上述の動きベクトル、より詳しくは
唇の上下方法の動きベクトルの定義方法を示す。無音状
態、すなわち、何も発声しない状態では、口は結んでい
るので、この時の唇の中心の位置を始点とし、上唇が上
がる（口が開いた）方向のベクトルを「上下方向の動き
ベクトル」とし、ａで表す。このベクトルａは始点を０
（ゼロ）ベクトルとして上方向が（＋）のベクトルであ
る。下方向が（−）のベクトルである。FIG. 2 shows the above-described motion vector, more specifically, a method of defining the motion vector of the lip up / down method. In the silent state, that is, in the state where nothing is uttered, the mouth is tied, so the vector at the center of the lips at this time is used as the starting point, and the vector in the direction in which the upper lip rises (the mouth is opened) is defined as the "vertical motion vector". and ", represented by a. This vector a has a starting point of 0
The upward direction is a (+) vector as a (zero) vector. The downward direction is a (-) vector.

【００３１】図３は唇の左右方向の動きベクトルの定義
方法を示す。無音状態における唇の左端または右端を始
点として、この始点が外側へ延びる方向を「左右方向の
動きベクトル」とし、ｂで表す。始点を０ベクトルとす
ると外側方向が（＋）のベクトルであり、内側方向が
（−）のベクトルである。FIG. 3 shows a method of defining a horizontal motion vector of the lips. The left end or the right end of the lips in the silent state is used as a starting point, and a direction in which this starting point extends outward is referred to as a “horizontal motion vector” and is represented by b . When the starting point is a 0 vector, the outer direction is a (+) vector and the inner direction is a (-) vector.

【００３２】ここで、各時刻（ｔ）における動きベクト
ルＳ（ｔ）はHere, the motion vector S (t) at each time (t) is

【００３３】[0033]

【数１】Ｓ（ｔ）＝ａ（ｔ）＋ｂ（ｔ）と定義される。## EQU00001 ## S (t) = a (t) + b (t) is defined.

【００３４】図４はアナウンサーがニュース文を発声し
たときの動画像と音声が同期している状態での音声の声
立て点と唇の動きベクトルの大きさ｜Ｓ（ｔ）｜の関係
を示す。図４において、唇の動きベクトルでは、たとえ
ば、Ａポイントから０ベクトルが増加するのに対応し
て、音声波形も声立てポイントから発生していることが
明らかであろう。FIG. 4 shows the relationship between the voice call point and the size of the lip motion vector │S (t) │ in a state where the moving image and the voice are synchronized when the announcer speaks a news sentence. . In FIG. 4, it will be apparent that in the lip motion vector, for example, the voice waveform is also generated from the voice-up point in response to the increase of the 0 vector from the A point.

【００３５】図５は動画像と音声との間にずれが生じた
場合を示す。一般的に画像系の処理時間が音声系の処理
時間に比べて長いので、図５に示すように音声に比べて
ずれ時間Ｔだけ（動）画像が遅延する。音声の声立てポ
イントＡから次の声立てポイントＢまでの時間は通常の
発話時は数秒であるのに対し、動画像と音声のずれ時間
は数フレーム（画面）分の時間長さである。なお１フレ
ームは１／３０秒に相当する。FIG. 5 shows a case where a shift occurs between the moving image and the sound. Generally, the processing time of the image system is longer than the processing time of the audio system, so that the (moving) image is delayed by the shift time T as compared with the sound as shown in FIG. The time from the voice call point A to the next voice call point B is several seconds during normal utterance, while the time gap between the moving image and the voice is a time length of several frames (screen). One frame corresponds to 1/30 second.

【００３６】映像と音声のずれ時間を計測するためには
図５における音声の声立てポイントは１ポイントだけで
計測可能であるが、発話状態によっては１ポイントのみ
の計測では精度を確保できない場合がある。In order to measure the time difference between the video and the audio, the voice call point in FIG. 5 can be measured by only one point, but depending on the utterance state, the accuracy may not be secured by measuring only one point. is there.

【００３７】そこで計測精度を高めるための処理方法を
図６を参照して説明する。Therefore, a processing method for increasing the measurement accuracy will be described with reference to FIG.

【００３８】図６はずれの生じている時の音声波形と、
動きベクトルの関係を示す。声立てポイントＡ，Ｂ，
Ｃ，Ｄ，Ｅと各ポイントの動画像に対するずれ時間Ｔ
_A 、Ｔ_B，Ｔ_C ，Ｔ_D ，Ｔ_E を上述のずれ補正方法にし
たがって、計測する。計測したずれ時間の大小関係が、FIG. 6 shows a voice waveform when a shift occurs,
The relationship between motion vectors is shown. Voice-up points A, B,
C, D, E and deviation time T with respect to the moving image of each point
_A , T _B , T _C , T _D , and T _E are measured according to the above-described deviation correction method. The magnitude relationship of the measured deviation time is

【００３９】[0039]

【数２】Ｔ_A ≦Ｔ_E ≦Ｔ_B ≦Ｔ_C ≦Ｔ_D である場合、最大値Ｔ_D 、最小値Ｔ_A を除いたＴ_B ，Ｔ
_C ，Ｔ_E の平均値Ｔ_AVE を信号処理プロセッサにより計
算し、動画像と音声のずれ時間とするとよい。なお、## EQU2 ## When T _A ≤T _E ≤T _B ≤T _C ≤T _D , T _B and T excluding the maximum value T _D and the minimum value T _A
The average value T _AVE of _C and T _E may be calculated by a signal processor and used as the time lag between the moving image and the sound. In addition,

【００４０】[0040]

【数３】Ｔ_AVE ＝（Ｔ_B ＋Ｔ_C ＋Ｔ_E ）／３である。平均値Ｔ_AVE をずれの計測結果として採用し、
補正処理ブロック１１で遅延素子により入力の音声信号
を遅延させれば動画像と音声の同期が可能となる。## _EQU3 ## T _AVE = (T _B + T _C + T _E ) / 3. Using the average value T _AVE as the measurement result of the deviation,
If the input voice signal is delayed by the delay element in the correction processing block 11, the moving image and the voice can be synchronized.

【００４１】上述の例では、アナログの画像処理装置、
Ａ／Ｄ変換器、信号処理プロセッサ、遅延素子等で構成
したが、動画像および音声（音響）信号をデジタル形態
で取り扱う場合には、パーソナルコンピュータ、ワーク
ステーション等の汎用コンピュータで映像／音声ずれ補
正システムを実現できる。汎用コンピュータの構成は周
知のため、その処理手順を図７および図８に示す。図７
および図８の処理手順は説明の便宜上、機能表現を使用
しているが、実際にはＣＰＵが実行可能なプログラム言
語で記録媒体に保存され、実行される。保存目的の記録
媒体としてはハードディスク、ＲＯＭなどを使用するこ
とができる。さらには、フロッピーディスク、ＣＤ−Ｒ
ＯＭなどの携帯用記録媒体から汎用コンピュータ内の上
記記録装置（記憶装置）に実装してもよい。図７におい
て、汎用コンピュータは動画像および音声信号を入力す
る。アナログ形態の信号の場合にはビデオカード等によ
りアナログデジタル変換を行う。複数フレーム分に相当
する動画像および音声信号を装置内部のメモリに一時保
存する（ステップＳ１０）。一時保存された動画像ある
いは先頭部分の静止画像をディスプレイに表示させ、画
像の切り出し領域をマウス等により範囲指定する（ステ
ップＳ２０）。一時保存された動画像（複数の静止画
像）から範囲指定された領域の画像を切り出し、切り出
した画像をメモリの別領域に記憶する。切り出した画像
信号に対して上述の前処理が実行される（ステップＳ３
０→Ｓ４０）。前処理された動画像信号について動きベ
クトルおよびその大きさが計算される（ステップＳ５
０）。計算結果として得られる上下および左右方向の動
きベクトルを解析し、図５のＡ，Ｂに相当するポイント
を検出する。このポイントを示すデータは装置内のメモ
リに一時記憶され、画像の示す音の発声タイミングが検
出される（ステップＳ６０→Ｓ７０）。In the above example, an analog image processing device,
Although it is composed of an A / D converter, a signal processor, a delay element, etc., when handling moving images and audio (acoustic) signals in digital form, a general-purpose computer such as a personal computer or a workstation corrects video / audio deviation. The system can be realized. Since the configuration of a general-purpose computer is well known, its processing procedure is shown in FIGS. 7 and 8. Figure 7
For convenience of description, the processing procedure in FIG. 8 and FIG. 8 use functional expressions, but in reality, the processing language is stored in a recording medium in a program language executable by the CPU and executed. A hard disk, a ROM, or the like can be used as a recording medium for storage purposes. Furthermore, floppy disk, CD-R
It may be mounted on the recording device (storage device) in the general-purpose computer from a portable recording medium such as OM. In FIG. 7, a general-purpose computer inputs a moving image and a sound signal. In the case of analog signals, analog-digital conversion is performed by a video card or the like. The moving images and audio signals corresponding to a plurality of frames are temporarily stored in the internal memory of the device (step S10). The temporarily stored moving image or the still image of the head portion is displayed on the display, and the cutout region of the image is designated with the mouse or the like (step S20). The image of the area specified by the range is cut out from the temporarily stored moving image (a plurality of still images), and the cut out image is stored in another area of the memory. The above-described preprocessing is executed on the cut out image signal (step S3).
0 → S40). A motion vector and its magnitude are calculated for the pre-processed video signal (step S5).
0). The vertical and horizontal motion vectors obtained as the calculation result are analyzed, and the points corresponding to A and B in FIG. 5 are detected. The data indicating this point is temporarily stored in the memory of the apparatus, and the utterance timing of the sound indicated by the image is detected (steps S60 → S70).

【００４２】次に、メモリに一時記憶されている複数フ
レーム分の音声信号を使用して、無声・有声／無音区間
を検出し、次に声立てポイント、すなわち、音声の発生
ポイントを検出する（ステップＳ８０→Ｓ９０）。Next, the voice signal for a plurality of frames temporarily stored in the memory is used to detect the unvoiced / voiced / silent section, and then the voice-up point, that is, the voice generation point is detected ( Steps S80 → S90).

【００４３】メモリに記憶されている動画像から検出し
た音声の発生ポイントと、音声から検出した発生ポイン
トとの時間的なずれが計算（計測）される（ステップＳ
１００）。計算された時間だけ、一時保存されている入
力音声を遅らせて、入力され、一時保存されている動画
像信号と音声信号を外部出力する（ステップＳ１１
０）。The time difference between the sound generation point detected from the moving image stored in the memory and the sound generation point detected from the sound is calculated (measured) (step S).
100). The temporarily stored input voice is delayed by the calculated time, and the input and temporarily stored moving image signal and voice signal are externally output (step S11).
0).

【００４４】なお、必要に応じて、遅延させた音声信号
と動画像信号を装置内のハードディスクに保存してもよ
い（ステップＳ１２０）。If necessary, the delayed audio signal and moving image signal may be stored in the hard disk of the apparatus (step S120).

【００４５】以後、複数のフレーム単位で動画像が入力
される毎に上述の処理が実行される。なお、ステップＳ
２０の処理だけは初期的に実行され、切り出すべき領域
（検出窓）が設定された後は、省略される。After that, the above-mentioned processing is executed every time a moving image is input in units of a plurality of frames. Note that step S
Only the process of 20 is executed initially, and is omitted after the region (detection window) to be cut out is set.

【００４６】上述の実施形態の他に次の形態を実施でき
る。In addition to the above embodiments, the following modes can be implemented.

【００４７】１）上述の実施形態では主に音声を例にと
り説明したが、音の発生に動きを伴う物体であれば、そ
の動画像から音の発生タイミングを検出することができ
る。1) In the above-mentioned embodiment, the explanation has been given mainly with respect to the voice, but if an object is accompanied by a motion in the generation of the sound, the timing of the sound generation can be detected from the moving image.

【００４８】２）映像／音声ずれ補正システムはその信
号の形態に応じて適宜構成すればよい。また、カード、
やＩＣチップの形態で構成し、他の装置、たとえば、通
信装置に組み込んでもよいこと勿論である。2) The video / audio shift correction system may be appropriately configured according to the form of the signal. Also a card,
Of course, it may be configured in the form of an IC chip or the like, and may be incorporated in another device, for example, a communication device.

【００４９】３）上述の実施形態では動きベクトルは上
下方向および左右方向の双方向について検出したが、予
め音源の動きの方向が上下または左右方向に特定される
場合には、その特定方向のみの動きベクトルを計算すれ
ばよい。3) In the above-described embodiment, the motion vector is detected in both the vertical direction and the horizontal direction, but when the motion direction of the sound source is specified in the vertical direction or the horizontal direction in advance, only that specific direction is detected. It suffices to calculate the motion vector.

【００５０】４）音源が音を発生する動きは、音源の種
類によって異なる。そこで、音源の種類ごとに動きベク
トルの変化を特徴パターンとして用意しておくと、複数
の特徴パターンと、同期の対象の動画像から抽出した特
徴パターンと比較することによって、音源の種類を識別
することもできる。また、不特定多数の動画像を同期の
対象とすることができる。4) The movement in which the sound source produces sound differs depending on the type of the sound source. Therefore, if a change in motion vector is prepared as a characteristic pattern for each type of sound source, the type of sound source is identified by comparing a plurality of characteristic patterns with a characteristic pattern extracted from a moving image to be synchronized. You can also Also, an unspecified number of moving images can be synchronized.

【００５１】５）上述の実施形態では、動画像を構成す
る静止画像の中から、範囲指定した領域の画像（デー
タ）を切り出していたが、隣接する２つの静止画像の差
分を計算しても動きを伴う音源の画像データを取得する
ことができる。5) In the above-described embodiment, the image (data) of the range-designated area is cut out from the still images forming the moving image. However, even if the difference between two adjacent still images is calculated. It is possible to acquire image data of a sound source that moves.

【００５２】６）音源としては単一の物体の音の動きに
固有の特徴があれは、音の発生タイミングを動画像から
検出することが可能であるが、たとえば、カスタネッ
ト、タンバリンと手、太鼓とバチ、手を叩く動作のよう
に２つの物体が衝突することにより音を発生する場合に
は、物体の動く方向が正反対に変化するので、その動き
の特徴を画像解析することにより、その音の発生タイミ
ングを検出することが可能となる。6) As a sound source, if there is a characteristic peculiar to the movement of the sound of a single object, the sound generation timing can be detected from the moving image. For example, castanets, tambourines and hands, When sound is generated by two objects colliding with each other, such as a drum, a drumstick, and a hand clapping motion, the moving directions of the objects change in opposite directions. It is possible to detect the sound generation timing.

【００５３】[0053]

【発明の効果】以上、説明したように、請求項１，９，
１０の発明によれば、映像／音声信号の中の動画像に含
まれる音源の動きから（第１の）音の発生タイミングを
検出することにより、音そのものから検出した（第２
の）音の発生タイミングと比較することで音と映像／音
声のずれを計測することができる。このため、放送中な
どの映像（動画像＋音）でもリアルタイムで動画像と音
の間のずれの補正が可能となる。また、請求項１，９，
１０の発明に係わる一連の処理を自動化することができ
るので、ユーザの操作労力を大幅に低減することができ
る。As described above, according to claims 1, 9 and
According to the tenth aspect of the invention, by detecting the generation timing of the (first) sound from the motion of the sound source included in the moving image in the video / audio signal, the detection is performed from the sound itself (second
The difference between the sound and the video / audio can be measured by comparing with the sound generation timing. Therefore, it is possible to correct the deviation between the moving image and the sound in real time even in the video (moving image + sound) during broadcasting. In addition, claims 1, 9 and
Since the series of processes according to the tenth invention can be automated, the operation labor of the user can be significantly reduced.

【００５４】請求項２の発明では、音源の動きベクトル
を計算することで、音を発生する音源の動きの特徴を検
出し、音の発生タイミングを動画像から検出することが
できる。According to the second aspect of the present invention, by calculating the motion vector of the sound source, it is possible to detect the characteristics of the motion of the sound source that generates the sound, and to detect the sound generation timing from the moving image.

【００５５】請求項３〜７の発明では、音声については
唇の動きから、音響を発生する、たとえば、打楽器につ
いては、衝突する物体の動きを示す画像から音の発生を
検出し、入力の音については有音区間と無音区間とを区
別することで音の発生タイミングを区別するので、人間
の操作を要せず、自動処理化が可能となる。According to the third to seventh aspects of the invention, sound is generated from the movement of the lips for the voice. For example, in the case of a percussion instrument, the generation of sound is detected from an image showing the movement of a colliding object, and the input sound is detected. With respect to the above, since the sound generation timing is distinguished by distinguishing between the voiced section and the silent section, human operation is not required, and automatic processing is possible.

【００５６】請求項８の発明では、動画像と音のずれ時
間を複数回計測し、その計測結果から実際にずれ補正を
行う遅延時間を決定することで、遅延時間の計測精度が
向上する。In the invention of claim 8, the deviation time between the moving image and the sound is measured a plurality of times, and the delay time for actually correcting the deviation is determined from the measurement result, whereby the measurement accuracy of the delay time is improved.

[Brief description of drawings]

【図１】本発明実施形態の処理内容を示すブロック図で
ある。FIG. 1 is a block diagram showing the processing contents of an embodiment of the present invention.

【図２】唇の上下方向の動きベクトルを示す説明図であ
る。FIG. 2 is an explanatory diagram showing vertical motion vectors of lips.

【図３】唇の左右方向の動きベクトルを示す説明図であ
る。FIG. 3 is an explanatory diagram showing a horizontal motion vector of a lip.

【図４】動画像と音声が同期している場合の声立てと動
きベクトルの関係を示す波形図である。FIG. 4 is a waveform diagram showing a relationship between a voice call and a motion vector when a moving image and a sound are synchronized with each other.

【図５】動画像が遅延している場合の声立てと動きベク
トルの関係を示す波形図である。FIG. 5 is a waveform diagram showing a relationship between a voice call and a motion vector when a moving image is delayed.

【図６】動画像と音声との間にずれが生じている場合の
声立てと動きベクトルの関係を示す波形図である。FIG. 6 is a waveform diagram showing a relationship between a voice call and a motion vector when a shift occurs between a moving image and a sound.

【図７】本発明実施形態の汎用コンピュータの処理手順
を示すフローチャートである。FIG. 7 is a flowchart showing a processing procedure of a general-purpose computer according to the embodiment of the present invention.

【図８】本発明実施形態の汎用コンピュータの処理手順
を示すフローチャートである。FIG. 8 is a flowchart showing a processing procedure of a general-purpose computer according to the embodiment of the present invention.

[Explanation of symbols]

１動きベクトルの検出窓を指示する処理（ブロック）２，７Ａ／Ｄ変換処理（ブロック）３前処理（ブロック）４動き検出処理（ブロック）５ベクトル検出割付処理（ブロック）６動きベクトル量検出処理（ブロック）８無声・有声／無音区間の分割処理（ブロック）９声立て検出処理（ブロック）１０動画像／音ずれの検出（測定）処理（ブロック）１１音声ディレー処理（ブロック） 1 Processing to specify the motion vector detection window (block) 2,7 A / D conversion processing (block) 3 Pre-processing (block) 4 Motion detection processing (block) 5 Vector detection allocation processing (block) 6 Motion vector amount detection processing (block) 8 Unvoiced / voiced / unvoiced segmentation processing (block) 9 Voice detection processing (block) 10 Moving image / sound shift detection (measurement) processing (block) 11 Voice delay processing (block)

───────────────────────────────────────────────────── フロントページの続き (72)発明者岡田定晴東京都渋谷区神南二丁目２番１号日本放送協会放送センター内 (56)参考文献特開平９−163333（ＪＰ，Ａ) 特開平８−23530（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) H04N 5/222 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Sadaharu Okada 2-2-1 Jinnan 2-chome, Shibuya-ku, Tokyo Inside the Japan Broadcasting Corporation Broadcasting Center (56) Reference JP-A-9-163333 (JP, A) JP-A 8-23530 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) H04N 5/222

Claims

(57) [Claims]

1. A video / audio shift correction system for correcting a temporal shift between an input moving image and an input sound, wherein a sound source is converted from a motion of a sound source included in the input moving image. Detected by the first detection means, second detection means for detecting the second generation timing of the sound source from the input sound, and second detection means for detecting the second generation timing of the sound source from the input sound. Measuring means for measuring the time difference between the generation timing of the first sound and the generation timing of the second sound detected by the second detecting means, and the measured time difference. A video / audio shift correction system characterized by comprising signal processing means for synchronizing the input moving image with the input audio based on the above.

2. The video / audio shift correction system according to claim 1, wherein the first detecting means calculates a motion vector of the motion of the sound source, and the motion vector of the motion vector is calculated based on the calculation result of the motion vector. A video / audio shift correction system characterized by detecting the timing of generation of a first sound.

3. The video / audio shift correction system according to claim 2, wherein the motion vector is calculated in at least one of a vertical direction and a horizontal direction. .

4. The video / audio shift correction system according to claim 1, wherein the sound source is a lip.

5. The video / audio shift correction system according to claim 4, wherein the second detection unit is a voice unvoiced device.
A video / audio shift correction system, characterized in that the generation timing of the second sound is detected from a voiced / silent section.

6. The video / audio shift correction system according to claim 1, wherein the sound source is a plurality of objects that generate sound due to a collision. Correction system.

7. The video / audio shift correction system according to claim 6, wherein the second detection unit is configured to detect the presence / absence of sound.
A video / audio shift correction system, characterized in that the generation timing of the second sound is detected from a silent section.

8. The video / audio shift correction system according to claim 1, wherein the generation timing of the first sound and the generation timing of the second sound are detected a plurality of times. A plurality of first and second detected
The video / audio shift correction system, wherein the measuring unit determines the temporal shift based on the timing of sound generation.

9. A video / audio shift correction method for correcting a temporal shift between an input moving image and an input sound, wherein a sound source is converted from a motion of a sound source included in the input moving image. Detecting the generation timing of the first sound, detecting the second generation timing of the sound source from the input sound, and detecting the generation timing of the detected first sound and the detected second sound. A video / audio shift correction method characterized by measuring a temporal shift from a generation timing and synchronizing the input moving image and the input voice based on the measured temporal shift.

10. A recording medium for recording a program for correcting a temporal shift between an input moving image and an input sound, the program being executed by a computer, wherein the program is the input A first detecting step of detecting a generation timing of a first sound of the sound source from a motion of the sound source included in the moving image; and a second detecting step of detecting a second generation timing of the sound source from the input sound. 2 detection step, and a time lag between the generation timing of the first sound detected by the first detection step and the generation timing of the second sound detected by the second detection step. A recording medium comprising a measuring step of measuring and a signal processing step of synchronizing the moving image of the input and the sound of the input based on the measured time lag. .