JP2015012557A

JP2015012557A - Video audio processor, video audio processing system, video audio synchronization method, and program

Info

Publication number: JP2015012557A
Application number: JP2013138699A
Authority: JP
Inventors: 智彦村上; Tomohiko Murakami; 要内藤; Kaname Naito
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-07-02
Filing date: 2013-07-02
Publication date: 2015-01-19

Abstract

PROBLEM TO BE SOLVED: To perform lip-sync of a video and audio without using additional information such as a time stamp.SOLUTION: A video audio processor includes: a video processing part for inputting a video; an audio processing part for inputting an audio; a video conversation determination part for detecting the face from the video input to the video processing part, and for performing the dynamic/static determination of the lip area of the detected face to detect that a conversation has started on the video; a voice conversation determination part for performing the volume determination of the audio input to the audio processing part to detect that a conversation has started on the audio; and a video audio synchronization part for, on the basis of a difference between the on-video conversation start timing in which it is detected that the conversation has started on the video by the video conversation determination part and the on-voice conversation start timing in which it is detected that the conversation has started on the audio by the audio conversation determination part, adjusting the output timing of the video and audio.

Description

本発明は、映像と音声を同期（リップシンク）させる技術に関する。 The present invention relates to a technique for synchronizing video and audio (lip sync).

近年、テレビ会議を利用する顧客からは、特定の議題について、災害対策のためにＢＣＰ（Business Continuity Plan）の一環として、衛星通信網（帯域幅：６４ｋｂｐｓ）を利用したいというニーズがあった。ただ、映像と音声のそれぞれについて顧客から要求された通信品質を満たすには、衛星通信網の帯域幅では明らかに帯域が不足していた。 In recent years, customers who use video conferencing have a need to use a satellite communication network (bandwidth: 64 kbps) as a part of BCP (Business Continuity Plan) for disaster countermeasures on a specific agenda. However, the bandwidth of the satellite communication network was clearly insufficient to satisfy the communication quality required by customers for both video and audio.

ここで、上記問題は、映像と音声の送受信用のネットワークをそれぞれ別個に用意することで解決することができる。例えば、映像については、当初の予定通りに衛星通信網を利用し、音声については衛星電話網を利用することが考えられる。 Here, the above problem can be solved by preparing separate networks for video and audio transmission / reception. For example, it is conceivable to use a satellite communication network as originally scheduled for video and to use a satellite telephone network for audio.

しかし、映像と音声の送受信用のネットワークをそれぞれ別個に用意する場合、各ネットワーク上での通信遅延が異なることに起因して、受信側では、各ネットワークからの映像と音声の入力に時間差が生じる場合がある。 However, when video and audio transmission / reception networks are prepared separately, there is a time difference in video and audio input from each network on the receiving side due to different communication delays on each network. There is a case.

そのため、テレビ会議において、映像と音声の送受信用のネットワークをそれぞれ別個に用意する場合、受信側では、映像と音声を同期（リップシンク）させて、映像上の話者の口の動きと音声を合わせる必要がある。 For this reason, when video and audio transmission / reception networks are separately prepared in a video conference, the receiving side synchronizes the video and audio (lip sync) so that the movement and audio of the speaker's mouth on the video are synchronized. It is necessary to match.

映像と音声とを同期させる関連技術としては、特許文献１〜３に記載の技術がある。 As related technologies for synchronizing video and audio, there are technologies described in Patent Documents 1 to 3.

特許文献１に記載の技術においては、映像と音声にそれぞれ付加されたタイムスタンプを比較し、映像と音声の再生タイミングを同期させる。このように同期のためにタイムスタンプ等の付加情報を用いる技術は、一般的な技術である。 In the technique described in Patent Document 1, time stamps added to video and audio are compared to synchronize the playback timing of video and audio. The technique of using additional information such as a time stamp for synchronization as described above is a general technique.

また、特許文献２に記載の技術においては、テレビ会議の参加者の顔を撮影した画像中で、口の動きが検出されると、その参加者に対応するマイクからの信号に基づいて音声信号を生成すると共に、その参加者の画像を拡大表示する。 In the technique described in Patent Document 2, when movement of the mouth is detected in an image obtained by capturing the face of a participant in a video conference, an audio signal is generated based on a signal from a microphone corresponding to the participant. And the enlarged image of the participant is displayed.

また、特許文献３に記載の技術においては、送信側は、音声データが有効で、かつ、唇映像データに所定の大きさの動きが検出され、かつ、発言者が発言権を保持している場合に、その音声データと発声情報を送信すると共に、映像データの符号化方法を変更する。受信側は、映像データを復号して表示すると共に、発声情報が指示する端末装置からの音声データのみを選択的に合成して音声出力する。 In the technique described in Patent Document 3, the transmitting side has valid audio data, a predetermined amount of movement is detected in the lip video data, and the speaker holds the right to speak. In this case, the audio data and utterance information are transmitted, and the video data encoding method is changed. The receiving side decodes and displays the video data, and selectively synthesizes only the audio data from the terminal device indicated by the utterance information and outputs the audio.

特開２０００−３２４１６３号公報JP 2000-324163 A 特開２００４−１１８３１４号公報JP 2004-118314 A 特開２００２−１５８９８３号公報JP 2002-158983 A

しかし、特許文献１に記載の技術は、映像と音声のタイムスタンプを比較する必要があるため、映像と音声を処理する処理装置を同一にするか、または、映像を処理する処理装置と音声を処理する処理装置とを別個に用意して、両処理装置間でタイムスタンプをリアルタイムにやり取りする必要がある。 However, since the technique described in Patent Document 1 needs to compare time stamps of video and audio, the processing device that processes video and audio is the same, or the processing device that processes video and audio are the same. It is necessary to prepare a processing device for processing separately and exchange time stamps between the processing devices in real time.

また、特許文献１に記載の技術は、音声にタイムスタンプを付加することができない衛星電話網には適用することができない。 Further, the technique described in Patent Document 1 cannot be applied to a satellite telephone network that cannot add a time stamp to voice.

また、特許文献２に記載の技術は、参加者が話すタイミングとその参加者の画像を拡大表示するタイミングとを同期させるものであり、映像と音声のリップシンクを行うものではない。 The technique described in Patent Document 2 synchronizes the timing when a participant speaks and the timing when the participant's image is enlarged and displayed, and does not perform lip sync between video and audio.

また、特許文献２に記載の技術は、同期を取るタイミングが、映像と音声をネットワークに送信する前のタイミングであるため、ネットワーク上で発生した通信遅延を吸収することはできない。 In the technique described in Patent Document 2, the synchronization timing is the timing before video and audio are transmitted to the network, and therefore communication delays occurring on the network cannot be absorbed.

また、特許文献３に記載の技術は、発言権を保持する発言者が話すタイミングと映像データの符号化方法を変更するタイミングとを同期させるものであり、映像と音声のリップシンクを行うものではない。 The technique described in Patent Document 3 synchronizes the timing of speaking by a speaker holding the right to speak and the timing of changing the encoding method of video data, and does not perform video and audio lip synchronization. Absent.

以上の通り、特許文献１に記載の技術は、タイムスタンプ等の付加情報を用いる必要があるという問題点があり、また、特許文献２，３に記載の技術は、映像と音声のリップシンクを行うことができないという問題点がある。 As described above, the technique described in Patent Document 1 has a problem that it is necessary to use additional information such as a time stamp, and the techniques described in Patent Documents 2 and 3 provide a lip sync between video and audio. There is a problem that it cannot be done.

そこで、本発明の目的は、上述した課題を解決し、タイムスタンプ等の付加情報を用いることなく、映像と音声のリップシンクを行うことができる技術を提供することにある。 Accordingly, an object of the present invention is to solve the above-described problems and provide a technique capable of performing lip sync between video and audio without using additional information such as a time stamp.

本発明の映像音声処理装置は、
映像を入力する映像処理部と、
音声を入力する音声処理部と、
前記映像処理部に入力された映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、該映像上で会話が開始されたことを検知する映像会話判定部と、
前記音声処理部に入力された音声の音量判定を行うことで、該音声上で会話が開始されたことを検知する音声会話判定部と、
前記映像会話判定部が映像上で会話が開始されたことを検知した映像上会話開始タイミングと前記音声会話判定部が音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差を基に、前記映像と前記音声の出力タイミングを調整する映像音声同期部と、を有する。 The video / audio processing apparatus of the present invention comprises:
A video processing unit for inputting video;
A voice processing unit for inputting voice;
A video conversation determination unit for detecting a face from the video input to the video processing unit and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination unit that detects the start of a conversation on the voice by performing a volume determination of the voice input to the voice processing unit;
The difference between the on-video conversation start timing when the video conversation determination unit detects that the conversation has started on the video and the on-speech conversation start timing when the voice conversation determination unit detects that the conversation has started on the voice And a video / audio synchronization unit that adjusts the output timing of the video and the audio.

本発明の映像音声処理システムは、
映像を入力する映像処理部と、
音声を入力する音声処理部と、
前記映像処理部に入力された映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、該映像上で会話が開始されたことを検知する映像会話判定部と、
前記音声処理部に入力された音声の音量判定を行うことで、該音声上で会話が開始されたことを検知する音声会話判定部と、
前記映像会話判定部が映像上で会話が開始されたことを検知した映像上会話開始タイミングと前記音声会話判定部が音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差を基に、前記映像と前記音声の出力タイミングを調整する映像音声同期部と、を有する。 The video / audio processing system of the present invention includes:
A video processing unit for inputting video;
A voice processing unit for inputting voice;
A video conversation determination unit for detecting a face from the video input to the video processing unit and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination unit that detects the start of a conversation on the voice by performing a volume determination of the voice input to the voice processing unit;
The difference between the on-video conversation start timing when the video conversation determination unit detects that the conversation has started on the video and the on-speech conversation start timing when the voice conversation determination unit detects that the conversation has started on the voice And a video / audio synchronization unit that adjusts the output timing of the video and the audio.

本発明の映像音声同期方法は、
映像音声処理装置による映像音声同期方法であって、
映像を入力する映像処理ステップと、
音声を入力する音声処理ステップと、
前記映像処理ステップにて入力された映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、該映像上で会話が開始されたことを検知する映像会話判定ステップと、
前記音声処理ステップにて入力された音声の音量判定を行うことで、該音声上で会話が開始されたことを検知する音声会話判定ステップと、
前記映像会話判定ステップにて映像上で会話が開始されたことを検知した映像上会話開始タイミングと前記音声会話判定ステップにて音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差を基に、前記映像と前記音声の出力タイミングを調整する映像音声同期ステップと、を有する。 The video / audio synchronization method of the present invention includes:
A video / audio synchronization method by a video / audio processing apparatus,
A video processing step for inputting video;
A voice processing step for inputting voice;
A video conversation determination step of detecting a face from the video input in the video processing step and detecting a conversation on the video by performing a motion determination of the detected lip region;
A voice conversation determination step of detecting that a conversation is started on the voice by performing a volume determination of the voice input in the voice processing step;
A conversation start timing on the video when the conversation start on the video is detected in the video conversation determination step, and a voice conversation start timing when the conversation is started on the voice in the voice conversation determination step; And a video / audio synchronization step for adjusting an output timing of the video and the audio based on the difference between the video and audio.

本発明のプログラムは、
コンピュータに、
映像を入力する映像処理手順と、
音声を入力する音声処理手順と、
前記映像処理手順にて入力された映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、該映像上で会話が開始されたことを検知する映像会話判定手順と、
前記音声処理手順にて入力された音声の音量判定を行うことで、該音声上で会話が開始されたことを検知する音声会話判定手順と、
前記映像会話判定手順にて映像上で会話が開始されたことを検知した映像上会話開始タイミングと前記音声会話判定手順にて音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差を基に、前記映像と前記音声の出力タイミングを調整する映像音声同期手順と、とを実行させるためのものである。 The program of the present invention
On the computer,
Video processing procedure for inputting video,
Audio processing procedure for inputting audio,
A video conversation determination procedure for detecting a face from the video input in the video processing procedure and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination procedure for detecting that a conversation is started on the voice by performing a volume determination of the voice input in the voice processing procedure;
A conversation start timing on the video when the conversation start on the video is detected in the video conversation determination procedure, and a voice conversation start timing when the conversation is started on the voice in the voice conversation determination procedure; And a video / audio synchronization procedure for adjusting the output timing of the video and the audio based on the difference between the video and the audio.

本発明によれば、タイムスタンプ等の付加情報を用いることなく、映像と音声のリップシンクを行うことができるという効果が得られる。 According to the present invention, it is possible to perform an lip sync between video and audio without using additional information such as a time stamp.

本発明の第１の実施形態の映像音声処理システムの構成を示す図である。It is a figure which shows the structure of the video / audio processing system of the 1st Embodiment of this invention. テレビ会議送受信機２００の構成を示すブロック図である。2 is a block diagram showing a configuration of a video conference transceiver 200. FIG. テレビ会議送受信機２００の動作を示すシーケンス図である。FIG. 11 is a sequence diagram showing an operation of the video conference transceiver 200. 映像音声同期部２３３における、音声の入力と映像の入力が同時である場合の同期動作を示す図である。It is a figure which shows the synchronizing operation | movement in the audio | video audio | voice synchronizer 233 in case audio | voice input and video input are simultaneous. 映像音声同期部２３３における、音声の入力に対する映像の入力遅れがバッファ長の範囲内に収まる場合の同期動作を示す図である。It is a figure which shows the synchronizing operation | movement in the video audio | voice synchronizer 233 when the input delay of the image | video with respect to the audio | voice input is settled in the range of a buffer length. 映像音声同期部２３３における、音声の入力に対する映像の入力遅れがバッファ長の範囲内に収まらない場合の同期動作を示す図である。It is a figure which shows the synchronizing operation | movement when the video input delay with respect to the audio | voice input in the video / audio synchronizer 233 does not fit in the buffer length range. 映像音声同期部２３３における、音声の入力に対する映像の入力遅れがバッファ長の範囲内に収まらない場合の同期動作を示す図である。It is a figure which shows the synchronizing operation | movement when the video input delay with respect to the audio | voice input in the video / audio synchronizer 233 does not fit in the buffer length range. 映像音声同期部２３３における、音声の入力に対する映像の入力遅れがバッファ長の範囲内に収まらない場合の同期動作を示す図である。It is a figure which shows the synchronizing operation | movement when the video input delay with respect to the audio | voice input in the video / audio synchronizer 233 does not fit in the buffer length range.

以下に、本発明を実施するための形態について図面を参照して説明する。
（１）第１の実施形態
本実施形態は、本発明をテレビ会議に適用した例である。
（１−１）第１の実施形態の構成
まず、本実施形態の構成について説明する。 EMBODIMENT OF THE INVENTION Below, the form for implementing this invention is demonstrated with reference to drawings.
(1) First Embodiment This embodiment is an example in which the present invention is applied to a video conference.
(1-1) Configuration of First Embodiment First, the configuration of the present embodiment will be described.

図１に、本実施形態の映像音声処理システムの全体の概要を示す。 FIG. 1 shows an overview of the entire video / audio processing system of this embodiment.

本実施形態の映像音声処理システムは、映像音声処理装置として、テレビ会議送受信機１００，２００を有している。 The video / audio processing system of this embodiment includes video conference transceivers 100 and 200 as video / audio processing apparatuses.

テレビ会議送受信機１００，２００は、テレビ会議の映像と音声をリアルタイムに双方向に送受信するもので、映像の送受信には映像用ネットワーク１を用い、音声の送受信には音声用ネットワーク２を用いる。 The video conference transceivers 100 and 200 transmit and receive video and audio of a video conference in real time in both directions. The video network 1 is used for video transmission and reception, and the audio network 2 is used for audio transmission and reception.

例えば、テレビ会議送受信機１００からテレビ会議送受信機２００への映像と音声の送信においては、テレビ会議送受信機２００は、映像用ネットワーク１からの映像の入力と音声用ネットワーク２からの音声の入力に時間差が生じた可能性があることを検知して、映像と音声の出力タイミングを調整する。 For example, in the transmission of video and audio from the video conference transceiver 100 to the video conference transceiver 200, the video conference transceiver 200 inputs video from the video network 1 and inputs audio from the audio network 2. It detects that a time difference may have occurred, and adjusts the output timing of video and audio.

本実施形態においては、音声用ネットワーク２には、通信遅延が比較的小さい固定電話網を利用し、映像用ネットワーク１には、帯域幅が広いが通信遅延が大きくなり得るＩＰ（internet protocol）ネットワークを利用するものとする。この条件下においては、音声よりも映像が遅れて入力される傾向が強いものと考えられる。 In the present embodiment, a fixed telephone network having a relatively small communication delay is used for the audio network 2, and an IP (internet protocol) network having a wide bandwidth but a large communication delay is used for the video network 1. Shall be used. Under this condition, it is considered that video tends to be input later than audio.

また、本実施形態においては、テレビ会議では音声の応答性が重要であると考え、音声の入力に対して映像の入力が大きく遅れる場合には、映像と音声の出力タイミングを調整することを避け、音声のみを先に出力するものとする。 Also, in this embodiment, it is considered that audio responsiveness is important in video conferencing, and when video input is greatly delayed with respect to audio input, avoid adjusting the video and audio output timing. Assume that only audio is output first.

図２に、テレビ会議送受信機２００の構成を示す。なお、図２は、テレビ会議送受信機２００の構成要素のうち、ネットワークから映像と音声を入力する処理とそれ以降の処理を行う構成要素のみを示し、その他の構成要素は省略している。また、テレビ会議送受信機１００の構成は、テレビ会議送受信機２００と同様の構成であるとする。 FIG. 2 shows a configuration of the video conference transceiver 200. Note that FIG. 2 shows only components that perform video and audio input processing from the network and subsequent components among the components of the video conference transceiver 200, and other components are omitted. The configuration of the video conference transceiver 100 is the same as that of the video conference transceiver 200.

テレビ会議送受信機２００は、映像処理部２１０と、音声処理部２２０と、映像音声同期処理部２３０と、を有している。 The video conference transceiver 200 includes a video processing unit 210, an audio processing unit 220, and a video / audio synchronization processing unit 230.

映像処理部２１０は、映像用ネットワーク１を介したテレビ会議送受信機１００との通信を終端し、テレビ会議送受信機１００から送信されたテレビ会議の映像を映像用ネットワーク１から入力する役割を持つ。 The video processing unit 210 has a role of terminating communication with the video conference transmitter / receiver 100 via the video network 1 and inputting the video of the video conference transmitted from the video conference transmitter / receiver 100 from the video network 1.

音声処理部２２０は、音声用ネットワーク２を介したテレビ会議送受信機１００との通信を終端し、テレビ会議送受信機１００から送信されたテレビ会議の音声を音声用ネットワーク２から入力する役割を持つ。 The audio processing unit 220 has a role of terminating communication with the video conference transceiver 100 via the audio network 2 and inputting the audio of the video conference transmitted from the video conference transceiver 100 from the audio network 2.

映像音声同期処理部２３０は、会話判定部（映像）２３１と、会話判定部（音声）２３２と、映像音声同期部２３３と、の３つの構成要素を有している。 The video / audio synchronization processing unit 230 includes three components: a conversation determination unit (video) 231, a conversation determination unit (sound) 232, and a video / audio synchronization unit 233.

会話判定部（映像）２３１は、映像処理部２１０に入力された映像上で会話が開始されたことを検知する。 The conversation determination unit (video) 231 detects that a conversation has started on the video input to the video processing unit 210.

会話判定部（音声）２３２は、音声処理部２２０に入力された音声上で会話が開始されたことを検知する。 The conversation determination unit (voice) 232 detects that a conversation has started on the voice input to the voice processing unit 220.

映像音声同期部２３３は、会話判定部（映像）２３１が映像上で会話が開始されたことを検知したタイミング（映像上会話開始タイミング）と会話判定部（音声）２３２が音声上で会話が開始されたことを検知したタイミング（音声上会話開始タイミング）との差を基に、これらの会話開始タイミングが合うように、映像と音声の出力タイミングを調整する。 The video / audio synchronization unit 233 has a timing (conversation start timing on video) when the conversation determination unit (video) 231 detects that a conversation is started on the video and a conversation determination unit (sound) 232 starts conversation on the audio. On the basis of the difference from the timing at which this is detected (speech start timing on speech), the video and audio output timings are adjusted so that these conversation start timings match.

ここで、映像音声同期部２３３は、出力タイミングの調整のために、映像と音声のそれぞれを一時的に蓄積するための映像用バッファと音声用バッファを持つ。 Here, the video / audio synchronization unit 233 includes a video buffer and an audio buffer for temporarily storing video and audio in order to adjust the output timing.

なお、映像音声同期部２３３は、モニター等である映像再生装置に映像を出力し、スピーカー等である音声再生装置に音声を出力する。
（１−２）第１の実施形態の動作
次に、本実施形態の動作について説明する。
（Ａ）テレビ会議送受信機２００の動作シーケンス
まず、テレビ会議送受信機２００の動作シーケンスについて説明する。 Note that the video / audio synchronization unit 233 outputs video to a video playback device such as a monitor, and outputs audio to a voice playback device such as a speaker.
(1-2) Operation of the First Embodiment Next, the operation of the present embodiment will be described.
(A) Operation Sequence of Video Conference Transceiver 200 First, the operation sequence of the video conference transceiver 200 will be described.

図３に、テレビ会議送受信機２００の動作シーケンスを示す。なお、図３は、音声に比べて映像が遅れて入力された場合の動作を示している。
＜ａ：映像＞
まず、映像を入力してから、その映像上で会話が開始されたことを検知するまでの動作について説明する。 FIG. 3 shows an operation sequence of the video conference transceiver 200. FIG. 3 shows an operation when a video is input with a delay compared to audio.
<A: Video>
First, an operation from when a video is input to when it is detected that conversation has started on the video will be described.

ステップａ−１：
まず、映像処理部２１０は、テレビ会議送受信機１００から送信されたテレビ会議の映像を映像用ネットワーク１から入力する。 Step a-1:
First, the video processing unit 210 inputs the video of the video conference transmitted from the video conference transceiver 100 from the video network 1.

ステップａ−２：
次に、映像処理部２１０は、上記で入力した映像を会話判定部（映像）２３１に渡す。 Step a-2:
Next, the video processing unit 210 passes the video input above to the conversation determination unit (video) 231.

ステップａ−３：
次に、会話判定部（映像）２３１は、映像処理部２１０から渡された映像の内容に対して会話判定を行い、映像上で会話が開始されたことを検知する。 Step a-3:
Next, the conversation determination unit (video) 231 performs conversation determination on the content of the video passed from the video processing unit 210 and detects that the conversation is started on the video.

会話判定は、映像から顔を検出し、検出した顔の唇領域の動静判定を行うことにより実現する。唇領域の静動判定で閾値を越えた動きがある状態を、話者が話している状態とみなし、閾値を越えたことをもって、会話が開始されたと検知する。 The conversation determination is realized by detecting a face from the video and determining the movement of the lip area of the detected face. A state where there is a movement exceeding the threshold in the lip region static determination is regarded as a state in which the speaker is speaking, and it is detected that the conversation is started when the threshold is exceeded.

ステップａ−４：
会話判定部（映像）２３１は、映像上で会話が開始されたことを検知すると、その旨を映像音声同期部２３３に通知する。この通知は、会話判定部（映像）２３１から映像音声同期部２３３に映像を渡す際に呼び出す関数の１パラメータ（０／１のフラグ）で表現される。なお、この通知の有無に関係なく、会話判定部（映像）２３１は、映像音声同期部２３３に対して映像を随時出力する。
＜ｂ：音声＞
次に、音声を入力してから、その音声上で会話が開始されたことを検知するまでの動作について説明する。 Step a-4:
When the conversation determination unit (video) 231 detects that the conversation is started on the video, it notifies the video / audio synchronization unit 233 of the fact. This notification is expressed by one parameter (0/1 flag) of a function to be called when the video is transferred from the conversation determination unit (video) 231 to the video / audio synchronization unit 233. Regardless of the presence or absence of this notification, the conversation determination unit (video) 231 outputs video to the video / audio synchronization unit 233 as needed.
<B: Voice>
Next, an operation from when a voice is input until it is detected that a conversation is started on that voice will be described.

ステップｂ−１：
まず、音声処理部２２０は、テレビ会議送受信機１００から送信されたテレビ会議の音声を音声用ネットワーク２から入力する。 Step b-1:
First, the audio processing unit 220 inputs the audio of the video conference transmitted from the video conference transceiver 100 from the audio network 2.

ステップｂ−２：
次に、音声処理部２２０は、上記で入力した音声を会話判定部（音声）２３２に渡す。 Step b-2:
Next, the voice processing unit 220 passes the voice input above to the conversation determination unit (voice) 232.

ステップｂ−３：
次に、会話判定部（音声）２３２は、音声処理部２２０から渡された音声の内容に対して会話判定を行い、音声上で会話が開始されたことを検知する。 Step b-3:
Next, the conversation determination unit (speech) 232 performs conversation determination on the content of the voice passed from the voice processing unit 220 and detects that the conversation is started on the voice.

会話判定は、音声の音量判定を行い、音量が閾値を越えている状態を話者が話している状態とみなし、閾値を越えたことをもって、会話が開始されたと検知する。 In the conversation determination, the sound volume is determined, the state where the volume exceeds the threshold value is regarded as the state where the speaker is speaking, and it is detected that the conversation is started when the threshold value is exceeded.

また、周りの音やノイズを音声として誤検出しないために、人間の音声の周波数帯を利用して、人間の音声を抽出する処理を事前に行い、その後、抽出音声に対して上記の会話判定を行うものとする。 In addition, in order to prevent false detection of surrounding sounds and noise as speech, human speech is extracted in advance using the frequency band of human speech, and then the above conversation determination is performed on the extracted speech. Shall be performed.

ステップｂ−４：
会話判定部（音声）２３２は、音声上で会話が開始されたことを検知すると、その旨を映像音声同期部２３３に通知する。この通知は、会話判定部（音声）２３２から映像音声同期部２３３に音声を渡す際に呼び出す関数の１パラメータ（０／１のフラグ）で表現される。なお、この通知の有無に関係なく、会話判定部（音声）２３２は、映像音声同期部２３３に対して音声を随時出力する。
＜ｃ：映像音声同期＞
次に、映像と音声を同期させる動作について説明する。 Step b-4:
When the conversation determining unit (sound) 232 detects that the conversation is started on the sound, the conversation determining unit (sound) 232 notifies the video / audio synchronizing unit 233 to that effect. This notification is expressed by one parameter (0/1 flag) of a function to be called when the voice is transferred from the conversation determination unit (voice) 232 to the video / audio synchronization unit 233. Regardless of the presence or absence of this notification, the conversation determination unit (audio) 232 outputs audio to the video / audio synchronization unit 233 as needed.
<C: Audio / video synchronization>
Next, an operation for synchronizing video and audio will be described.

ステップｃ−１：
映像音声同期部２３３は、会話判定部（音声）２３２から、音声を随時入力し、入力した音声を音声用バッファに蓄積すると共に、会話判定部（映像）２３１から、映像を随時入力し、入力した映像を映像用バッファに蓄積する。 Step c-1:
The audio / video synchronization unit 233 inputs the audio from the conversation determination unit (audio) 232 as needed, accumulates the input audio in the audio buffer, and inputs the video from the conversation determination unit (video) 231 as needed. Is stored in the video buffer.

映像音声同期部２３３は、音声の入力に対する映像の入力遅れが、予め設定された許容時間（例えばＴ秒）の範囲内に収まる場合に、映像と音声の出力タイミングを調整する。 The video / audio synchronization unit 233 adjusts the video and audio output timing when the video input delay with respect to the audio input falls within a preset allowable time (for example, T seconds).

そこで、音声用バッファと映像用バッファのバッファ長は、上記の許容時間に相当する長さに設定する（例：６０ｆｐｓであれば６ｆｒａｍｅ分に設定する等）。 Therefore, the buffer lengths of the audio buffer and the video buffer are set to a length corresponding to the above-described allowable time (for example, set to 6 frames for 60 fps).

また、映像音声同期部２３３は、会話判定部（音声）２３２から、音声上で会話が開始されたことの通知を受けると共に、映像音声同期部２３３は、会話判定部（映像）２３１から、映像上で会話が開始されたことの通知を受ける。 In addition, the video / audio synchronization unit 233 receives a notification from the conversation determination unit (audio) 232 that the conversation has started on the voice, and the video / audio synchronization unit 233 receives the video from the conversation determination unit (video) 231. Get notified that the conversation has started.

ここで、映像音声同期部２３３は、音声の入力に対する映像の入力遅れ（すなわち、音声上会話開始タイミングに対する映像上会話開始タイミングの遅れ）がバッファ長の範囲内に収まる場合は、映像と音声の出力タイミングを調整し、映像上で会話が開始された時の映像フレーム（会話開始映像フレーム）の映像と音声上で会話が開始された時の音声フレーム（会話開始音声フレーム）の音声を同時に出力するようにする。 Here, the video / audio synchronization unit 233, when the video input delay with respect to the audio input (that is, the delay in the video conversation start timing with respect to the voice conversation start timing) falls within the buffer length range, Adjust the output timing, and simultaneously output the video frame (conversation start video frame) when the conversation starts on the video and the voice frame (conversation start audio frame) when the conversation starts on the audio. To do.

一方、映像音声同期部２３３は、音声の入力に対する映像の入力遅れがバッファ長の範囲内に収まらない場合は、出力タイミングの調整をせずに、会話開始音声フレームの音声を先に出力するようにする。 On the other hand, if the video input delay with respect to the audio input does not fall within the buffer length range, the video / audio synchronization unit 233 outputs the audio of the conversation start audio frame first without adjusting the output timing. To.

ステップｃ−２：
映像音声同期部２３３は、音声の入力に対する映像の入力遅れがバッファ長の範囲内に収まる場合、映像と音声の出力タイミングを調整し、会話開始音声フレームの音声と会話開始映像フレームの映像を同時に出力する。このとき、映像用バッファの先頭から会話開始映像フレームの直前までの映像フレームの映像についてはスキップする。 Step c-2:
The video / audio synchronization unit 233 adjusts the video and audio output timing when the video input delay with respect to the audio input falls within the buffer length range, and simultaneously outputs the audio of the conversation start audio frame and the video of the conversation start video frame. Output. At this time, the video of the video frame from the top of the video buffer to immediately before the conversation start video frame is skipped.

一方、映像音声同期部２３３は、音声の入力に対する映像の入力遅れがバッファ長の範囲内に収まらない場合は、出力タイミングを調整せずに、会話開始音声フレームの音声を先に出力する。
（Ｂ）映像と音声の同期動作の具体例
次に、映像音声同期部２３３における映像と音声の同期動作について、具体例を挙げて説明する。 On the other hand, when the video input delay with respect to the audio input does not fall within the buffer length range, the video / audio synchronization unit 233 outputs the audio of the conversation start audio frame first without adjusting the output timing.
(B) Specific Example of Video and Audio Synchronization Operation Next, the video and audio synchronization operation in the video and audio synchronization unit 233 will be described with a specific example.

図４〜図８に、映像用バッファと音声用バッファの内容と、その状態である場合の同期動作の内容を示す。 4 to 8 show the contents of the video buffer and the audio buffer, and the contents of the synchronization operation in this state.

ここで、映像音声同期部２３３では、映像と音声の１秒あたりのフレーム数を同一として扱い（その際、必要であればフレーム数を同一とするためのフレーム変換も行い）、映像フレームと音声フレームは共に、１フレーム目から始まり、Ｎフレーム目で会話が開始されるものとする。 Here, the video / audio synchronization unit 233 treats the same number of frames per second for video and audio (in this case, frame conversion is performed to make the number of frames the same if necessary), and the video frame and audio Both frames start from the first frame and the conversation starts at the Nth frame.

また、図４〜図８では、上が音声用バッファ、下が映像用バッファの内容を示し、各バッファは左からデータが入力（蓄積）され、右から出力されるＦＩＦＯ（First In First Out）バッファであるものとする。また、各バッファ中の四角は、フレームのデータを示す。また、各データに記載されている文字は、そのデータが何フレーム目であるかを意味し、例えばＮと記載のあるデータはＮフレーム目である。また、会話が開始されたＮフレーム目に網掛け表示をし、その時点で出力されるフレームの枠を太く表示している。
（Ｂ−１）音声の入力と映像の入力が同時である場合
図４に、音声の入力と映像の入力が同時である場合の同期動作を示す。 4 to 8, the upper part shows the contents of the audio buffer and the lower part shows the contents of the video buffer. Each buffer receives (stores) data from the left and outputs from the right (FIFO (First In First Out)). Assume that it is a buffer. A square in each buffer indicates frame data. Moreover, the character described in each data means what frame the data is, for example, the data with N is the Nth frame. Further, the display is shaded on the Nth frame where the conversation is started, and the frame of the frame output at that time is displayed thickly.
(B-1) When Audio Input and Video Input are Simultaneous FIG. 4 shows a synchronization operation when audio input and video input are simultaneous.

図４では、音声用バッファと映像用バッファ共に、その先頭に会話が開始された時のＮフレーム目のデータが蓄積されている。 In FIG. 4, both the audio buffer and the video buffer store Nth frame data at the beginning of the conversation.

そのため、映像音声同期部２３３は、映像の入力遅れがないことを検知する。 Therefore, the video / audio synchronization unit 233 detects that there is no video input delay.

そこで、映像音声同期部２３３は、音声と映像共にバッファの先頭のＮフレーム目のデータを出力する。 Therefore, the video / audio synchronization unit 233 outputs data of the first N frames of the buffer for both audio and video.

この出力結果では、音声と映像が同期されている。
（Ｂ−２）音声の入力に対して映像の入力が遅れている場合
（Ｂ−２−１）映像の入力遅れがバッファ長の範囲内に収まる場合
音声の入力に対して映像の入力が遅れているが、映像の入力遅れがバッファ長の範囲内に収まる場合、音声と映像を同期させることができる。 In this output result, audio and video are synchronized.
(B-2) When video input is delayed with respect to audio input (B-2-1) When video input delay is within the buffer length range Video input is delayed with respect to audio input However, if the video input delay falls within the buffer length range, the audio and video can be synchronized.

図５に、この場合の同期動作を示す。 FIG. 5 shows the synchronization operation in this case.

図５では、音声用バッファの先頭に会話が開始された時のＮフレーム目の音声データが蓄積されているが、映像用バッファでは、先頭以外にＮフレーム目の映像データが蓄積されている。 In FIG. 5, the Nth frame of audio data when conversation is started is stored at the beginning of the audio buffer. However, in the video buffer, the Nth frame of video data is stored in addition to the beginning.

そのため、映像音声同期部２３３は、映像の入力遅れを検知する。 Therefore, the video / audio synchronization unit 233 detects a video input delay.

そこで、映像音声同期部２３３は、音声用バッファの先頭のＮフレーム目の音声データを出力し、また、映像用バッファの先頭のＮ−３フレーム目からＮ−１フレーム目までの映像データをスキップし、Ｎフレーム目の映像データを出力する。 Therefore, the video / audio synchronization unit 233 outputs the audio data of the first N frames of the audio buffer, and skips the video data from the N-3th frame to the (N-1) th frame of the first buffer. The video data of the Nth frame is output.

この出力結果では、音声と映像が同期されている。
（Ｂ−２−２）映像の入力遅れがバッファ長の範囲内に収まらない場合
音声の入力に対して映像の入力が遅れており、映像の入力遅れがバッファ長の範囲内に収まらない場合、映像の入力遅れが許容時間を越え、音声と映像の同期ができない。 In this output result, audio and video are synchronized.
(B-2-2) When the video input delay is not within the buffer length range When the video input is delayed with respect to the audio input, and the video input delay is not within the buffer length range, The video input delay exceeds the allowable time, and the audio and video cannot be synchronized.

図６〜図８に、この場合の同期動作を示す。 6 to 8 show the synchronization operation in this case.

図６は、音声上では会話が開始されたが、映像上ではまだ会話が開始されていない時の動作を示している。 FIG. 6 shows an operation when the conversation is started on the voice but the conversation is not yet started on the video.

映像音声同期部２３３は、音声用バッファの先頭に会話が開始された時のＮフレーム目の音声データが蓄積されているため、音声上では会話が開始されていることがわかるが、映像用バッファにはＮフレーム目の映像データが未だ蓄積されていない。 Since the audio data of the Nth frame when the conversation is started is stored at the head of the audio buffer, the video / audio synchronization unit 233 knows that the conversation is started on the audio, but the video buffer No video data for the Nth frame has been stored yet.

そのため、映像音声同期部２３３は、映像の入力遅れを検知しない。 Therefore, the video / audio synchronization unit 233 does not detect a video input delay.

そこで、映像音声同期部２３３は、音声用バッファの先頭のＮフレーム目の音声データを出力し、また、映像用バッファの先頭のＮ−８フレーム目の映像データを出力する。 Therefore, the video / audio synchronization unit 233 outputs the first N frames of audio data in the audio buffer, and outputs the first N−8th frame of video data in the video buffer.

ここで、映像音声同期部２３３が、映像の遅れを検知しない理由は、音声上での会話の開始を誤って検知した場合や、映像上での会話の開始を誤って検知できなかった場合に備えるためである。 Here, the reason why the video / audio synchronization unit 233 does not detect the delay of the video is when the start of the conversation on the audio is erroneously detected or when the start of the conversation on the video is not erroneously detected. It is for preparing.

この出力結果では、音声と映像が同期されず、映像が音声よりも８フレーム分遅れている。 In this output result, the audio and the video are not synchronized, and the video is delayed by 8 frames from the audio.

図７は、図６の動作の後に、映像上でも会話が開始された時（映像上で会話が開始された時のＮフレーム目の映像データが映像音声同期部２３３に入力された時）の動作を示している。 FIG. 7 shows a case where conversation is started even on the video after the operation of FIG. 6 (when the video data of the Nth frame when the conversation is started on the video is input to the video / audio synchronization unit 233). The operation is shown.

映像音声同期部２３３は、会話判定部（映像）２３１からＮフレーム目の映像データが入力されたタイミングで映像の入力遅れを検知する。 The video / audio synchronization unit 233 detects the video input delay at the timing when the video data of the Nth frame is input from the conversation determination unit (video) 231.

そこで、映像音声同期部２３３は、音声用バッファの先頭のＮ＋３フレーム目の音声データを出力し、また、映像用バッファの先頭のＮ−５フレーム目からＮ−１フレーム目までの映像データをスキップし、Ｎフレーム目の映像データを出力する。 Therefore, the video / audio synchronization unit 233 outputs the audio data of the first N + 3th frame of the audio buffer, and skips the video data from the first N-5th frame to the N−1th frame of the video buffer. The video data of the Nth frame is output.

この出力結果では、音声と映像を同期しきれず、映像が音声よりも３フレーム分遅れている（ただし、図６よりも５フレーム分の遅れが改善されている）。 In this output result, the audio and the video cannot be synchronized, and the video is delayed by 3 frames from the audio (however, the delay of 5 frames is improved compared to FIG. 6).

図８は、図７の動作の後に、複数の映像フレームがまとまって映像音声同期部２３３に入力され、それを用いて映像の遅れを改善する時の動作を示している。 FIG. 8 shows an operation when a plurality of video frames are collected and input to the video / audio synchronization unit 233 after the operation of FIG. 7 to improve the video delay.

映像音声同期部２３３は、映像の入力遅れを検知しているため、極力映像の入力遅れを改善しようとする。 Since the video / audio synchronization unit 233 detects the video input delay, it tries to improve the video input delay as much as possible.

そこで、映像音声同期部２３３は、音声用バッファの先頭のＮ＋１０フレーム目の音声データを出力し、また、映像用バッファの先頭のＮ＋７フレーム目からＮ＋８フレーム目までの映像データをスキップし、Ｎ＋９フレーム目の映像データを出力する。 Therefore, the video / audio synchronization unit 233 outputs the first N + 10th frame of audio data in the audio buffer, skips the first N + 7th frame to the N + 8th frame of the video buffer, and outputs N + 9th frame. Outputs the eye image data.

この出力結果では、音声は映像が同期されておらず、映像が音声よりも１フレーム分遅れている（ただし、図７よりも２フレーム分の遅れが改善されている）。 In this output result, the audio is not synchronized with the video, and the video is delayed by one frame from the audio (however, the delay of two frames is improved as compared with FIG. 7).

なお、映像音声同期部２３３は、次回、複数の映像フレームがまとまって入力された場合、その複数の映像フレームのうちの先頭の映像フレームの映像データをスキップする。これにより、音声と映像が同期される。 The video / audio synchronization unit 233 skips the video data of the first video frame among the plurality of video frames when the plurality of video frames are input next time. Thereby, the sound and the video are synchronized.

このように、映像音声同期部２３３は、映像と音声が同期するまでの間に、Ｘ（Ｘは、２以上の自然数）個の映像フレームがまとまって入力され、そのＸ個の映像フレームのうちの先頭の映像フレームの映像が映像用バッファの先頭に蓄積された場合、映像用バッファの先頭からＹ（Ｙは、Ｘ−１以下の自然数）個分の映像フレームの映像をスキップし、残りの映像フレームのうちの先頭の映像フレームの映像を出力する。
（１−３）第１の実施形態の効果
次に、本実施形態の効果について説明する。
（１−３−１）タイムスタンプ等の付加情報を用いる必要がない点
本実施形態においては、映像音声同期処理部２３０は、映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、映像上で会話が開始されたことを検知する会話判定部（映像）２３１と、音声の音量判定を行うことで、音声上で会話が開始されたことを検知する会話判定部（音声）２３２と、会話判定部（映像）２３１が映像上で会話が開始されたことを検知した映像上会話開始タイミングと会話判定部（音声）２３２が音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差に基づいて、映像と音声の出力タイミングを調整する映像音声同期部２３３と、を有している。 In this manner, the video / audio synchronization unit 233 receives X (X is a natural number of 2 or more) video frames as a whole until the video and audio are synchronized, and among the X video frames, When the video of the first video frame is accumulated at the top of the video buffer, the video of Y (Y is a natural number less than X-1) video frames from the top of the video buffer is skipped and the remaining video frames are skipped. The video of the first video frame among the video frames is output.
(1-3) Effects of the First Embodiment Next, effects of the present embodiment will be described.
(1-3-1) It is not necessary to use additional information such as a time stamp. In the present embodiment, the video / audio synchronization processing unit 230 detects a face from the video, and determines the movement of the detected lip region of the face. A conversation determination unit (video) 231 that detects that the conversation has started on the video, and a conversation determination unit (video) 231 that detects that the conversation has started on the voice by performing the sound volume determination. Voice) 232 and conversation determination unit (video) 231 detects the start of conversation on the video and conversation determination unit (voice) 232 detects that conversation has started on the voice. And a video / audio synchronizer 233 that adjusts the video and audio output timing based on the difference from the voice conversation start timing.

このように、本実施形態においては、映像と音声そのものから、会話開始タイミングを検知するため、タイムスタンプ等の付加情報は不要である。 As described above, in this embodiment, since the conversation start timing is detected from the video and audio itself, additional information such as a time stamp is not necessary.

したがって、付加情報を用いることなく、映像と音声を同期（リップシンク）させて、映像上の話者の口の動きと音声を合わせることができる。 Therefore, without using additional information, the video and audio can be synchronized (lip sync), and the movement of the mouth of the speaker on the video can be matched with the audio.

また、付加情報を用いないため、利用するネットワークや処理装置は種別を問わず適用可能である。 In addition, since additional information is not used, the network and processing device to be used can be applied regardless of the type.

例えば、災害時などに、利用するネットワークや処理装置に障害が発生しても、処理装置を変更するだけで、リップシンクを実現可能である。
（１−３−２）既存の処理装置を無改造で流用可能である点
本実施形態においては、上述の通り、映像処理部２１０と音声処理部２２０以外の第３者である映像音声同期処理部２３０がリップシンクを実現している。 For example, even when a failure occurs in the network or processing device to be used in the event of a disaster, lip sync can be realized simply by changing the processing device.
(1-3-2) A point where an existing processing device can be used without modification In the present embodiment, as described above, a video / audio synchronization process which is a third party other than the video processing unit 210 and the audio processing unit 220. The unit 230 realizes lip sync.

そのため、映像処理部２１０と音声処理部２２０を互いに物理的に分離可能であり、かつ、両者が互いを意識する必要がない。 Therefore, the video processing unit 210 and the audio processing unit 220 can be physically separated from each other, and both need not be aware of each other.

これにより、既存の処理装置（固定電話・携帯電話など）を無改造で利用可能である。 As a result, it is possible to use an existing processing device (a fixed phone, a mobile phone, etc.) without modification.

また、既存の処理装置を利用することで、ネットワークの利用認可の取得等を新たに行う必要もない。
（１−３−３）ネットワークや処理装置の組み合わせが自在である点
本実施形態においては、上述の通り、既存の処理装置を流用可能であるため、特定のネットワークを利用可能な既存の処理装置を組み合わせて利用することにより、複数のネットワーク種別に対応可能である。 In addition, by using an existing processing device, there is no need to newly acquire a network usage authorization or the like.
(1-3-3) The point that the combination of a network and a processing apparatus is free In this embodiment, since the existing processing apparatus can be diverted as above-mentioned, the existing processing apparatus which can utilize a specific network By combining and using, it is possible to cope with a plurality of network types.

例えば、電話網、ＩＳＤＮ（Integrated Services Digital Network）、ＮＧＮ（Next Generation Network）、衛星電話網などでは、プロトコルや通信手順が制限されるが、これらのネットワークにも対応可能である。 For example, in a telephone network, ISDN (Integrated Services Digital Network), NGN (Next Generation Network), satellite telephone network, etc., protocols and communication procedures are limited, but these networks can also be supported.

また、ＩＰネットワークでは、プロトコル・フォーマットが一般化されているが、ＩＰネットワークにも対応可能である。 Further, although the protocol format is generalized in the IP network, it can be applied to the IP network.

また、複数のネットワーク種別に対応可能であるため、システムの可用性も向上する。
（１−３−４）ネットワークのリアルタイムな遅延の変動に対して同期が可能な点
本実施形態においては、会話の開始を同期の契機とし、同期の契機が起こる度に、再度同期を取るため、ネットワーク上でのリアルタイムな遅延の変動にも対応可能である。 In addition, since it is possible to cope with a plurality of network types, system availability is also improved.
(1-3-4) A point where synchronization is possible with respect to real-time delay fluctuations in the network In the present embodiment, the start of conversation is a trigger for synchronization, and synchronization is re-established every time a synchronization trigger occurs. It is also possible to cope with real-time delay fluctuations on the network.

例えば、モバイル通信や海外との通信では、ネットワークの品質上、遅延量に変動がある場合があるが、このような遅延量の変動に対しても対応可能である。
（２）他の実施形態
以上、実施形態を参照して本発明を説明したが、本発明は上記実施形態に限定されるものではない。本発明の構成や詳細には、本発明の範囲内で当業者が理解し得る様々な変更をすることができる。 For example, in mobile communications and communications with foreign countries, there are cases where the delay amount varies due to the quality of the network, but it is possible to cope with such a variation in delay amount.
(2) Other Embodiments Although the present invention has been described above with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

例えば、上記実施形態においては、本発明をテレビ会議に適用する場合を例に挙げたが、これには限定されない。本発明が適用可能な例として以下が挙げられる。 For example, in the above embodiment, the case where the present invention is applied to a video conference has been described as an example, but the present invention is not limited to this. The following is an example to which the present invention can be applied.

・ウェブ会議
・映像配信（ＶＯＤ（Video On Demand）など）
・マルチメディア処理装置での動画編集
マルチメディア処理装置（映像プレイヤ、ＰＣ（Personal Computer）等）においては、動画ファイルの編集（つなぎやテレビのＣＭ（Commercial Message）カット等）を行う際に、動画ファイルのタイムスタンプの一部もしくは全部が失われると、再生時に、タイムスタンプを利用して映像と音声の同期を取ることができなくなる。このような場合でも、本発明により映像と音声の同期を取ることが可能となる。・ Web conference ・ Video distribution (VOD (Video On Demand) etc.)
・ Video editing on multimedia processing device In multimedia processing devices (video player, PC (Personal Computer), etc.), when editing a video file (such as tethering or TV commercial message cut) If a part or all of the time stamp of the file is lost, it becomes impossible to synchronize video and audio using the time stamp during playback. Even in such a case, the present invention makes it possible to synchronize video and audio.

このように、ネットワークを利用しない処理装置にも本発明は有用である。 Thus, the present invention is also useful for a processing apparatus that does not use a network.

また、上記実施形態においては、音声用ネットワーク２には固定電話網を利用し、映像用ネットワーク１にはＩＰネットワークを利用する例を挙げたが、本発明はネットワーク種別を問わずに適用可能である。 In the above-described embodiment, an example in which a fixed telephone network is used for the audio network 2 and an IP network is used for the video network 1 has been described. However, the present invention can be applied regardless of the network type. is there.

一例として、帯域保障型のネットワークや、オープンフロー（OpenFlow）などの経路制御機能を持つネットワークも利用可能である。これらの同一のネットワークで音声と映像を通信しても、優先度制御やルーティングの自律制御等により、より高いリアルタイム性が求められる音声は、映像よりも優先して扱われ、映像との間で時間差が生じる可能性がある。この場合も、本発明により映像と音声を同期させることが可能である。また、この場合の構成と動作は、上記実施形態と全く同様なので、ここでは詳細な説明を割愛する。 For example, a bandwidth guarantee type network or a network having a path control function such as OpenFlow can be used. Even if audio and video are communicated on these same networks, audio that requires higher real-time performance due to priority control and autonomous control of routing, etc. is handled with priority over video, There may be a time difference. Also in this case, it is possible to synchronize video and audio according to the present invention. In addition, since the configuration and operation in this case are exactly the same as those in the above embodiment, a detailed description is omitted here.

また、上記実施形態においては、映像処理部２１０、音声処理部２２０、および映像音声同期処理部２３０を同一の処理装置内に配置したが、これらの処理部はそれぞれ別々の処理装置として設けても良い。 In the above embodiment, the video processing unit 210, the audio processing unit 220, and the video / audio synchronization processing unit 230 are arranged in the same processing device. However, these processing units may be provided as separate processing devices. good.

例えば、音声処理部２２０には既存の固定電話や携帯電話等を利用することができ、これにより、既存の映像・音声関連の処理装置を有効活用することが可能となる。 For example, an existing fixed telephone, a mobile phone, or the like can be used for the audio processing unit 220, which makes it possible to effectively use an existing video / audio related processing device.

また、上記実施形態においては、会話の開始を契機に映像と音声の同期を行っていたが、これに加えて、会話の終了時や文、単語の切れ目などを解析し、これらも契機として映像と音声の同期を行っても良い。 In the above embodiment, video and audio are synchronized when the conversation starts, but in addition to this, when the conversation ends, sentences and word breaks are analyzed, and these are also triggered as a video. And audio may be synchronized.

このように同期の契機を増加させることで、ネットワークのゆらぎ等により、一度取った同期が時間の経過と共にずれていくことを抑止することができる。 By increasing the trigger of synchronization in this way, it is possible to prevent the synchronization once taken due to the fluctuation of the network or the like from shifting with time.

また、上記実施形態においては、１対１の通信を例に挙げたが、本発明は１対多や多対多の通信にも適用可能である。 In the above embodiment, one-to-one communication has been described as an example. However, the present invention can also be applied to one-to-many and many-to-many communication.

また、上記実施形態においては、テレビ会議の映像と音声を終端する処理装置を例に挙げたが、テレビ会議のＭＣＵ（Multi point control Unit：多地点会議装置）のような、映像と音声を中継する処理装置にも、本発明は適用可能である。 In the above embodiment, the processing device that terminates video and audio of a video conference is taken as an example, but video and audio are relayed as in a video conference MCU (Multi point control unit). The present invention can also be applied to a processing apparatus.

また、上記実施形態においては、映像と音声のそれぞれで利用するネットワークを分けた例を挙げたが、映像と音声で同一のネットワークを利用する場合にも、本発明は適用可能である。 Moreover, although the example which divided | segmented the network utilized by each of an image | video and an audio | voice was given in the said embodiment, this invention is applicable also when using the same network for an image | video and an audio | voice.

また、上記実施形態においては、図３のステップｃ−２において、映像フレームをスキップする例を挙げたが、その代わりに、映像フレームを破棄したり、映像フレームを一時的に早送りしたりしても良い。 In the above embodiment, an example in which the video frame is skipped in step c-2 in FIG. 3 has been described. Instead, the video frame is discarded or the video frame is temporarily fast-forwarded. Also good.

また、上記実施形態においては、図３のステップａ−４，ｂ−４において、呼び出し関数においてフラグを立てているが、その代わりに、映像や音声のデータ（Ｈ．２６４やＡＡＣ等）の一領域を利用して、特定の箇所のデータをフラグとして扱っても良い。 In the above embodiment, a flag is set in the calling function in steps a-4 and b-4 of FIG. 3, but instead of one of video and audio data (H.264, AAC, etc.). Using a region, data at a specific location may be handled as a flag.

また、上記実施形態においては、図４〜図８において、映像と音声の１秒あたりのフレーム数を同一とみなしているが、必ずしてもそうである必要はない。すなわち、フレームを基に出力タイミングを調整するのではなく、動画開始からの時間長を基に出力タイミングを調整することにより、より細かい単位での調整も可能となる。 In the above embodiment, the number of frames per second for video and audio is considered to be the same in FIGS. 4 to 8, but this need not necessarily be the case. That is, the output timing is not adjusted based on the frame, but the output timing is adjusted based on the time length from the start of the moving image, so that finer adjustment is possible.

また、本発明の映像音声処理装置にて行われる方法は、コンピュータに実行させるためのプログラムに適用しても良い。また、そのプログラムを記憶媒体に格納することも可能であり、ネットワークを介して外部に提供することも可能である。 The method performed by the video / audio processing apparatus of the present invention may be applied to a program for causing a computer to execute the method. In addition, the program can be stored in a storage medium and can be provided to the outside via a network.

なお、上記の実施形態の一部または全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）
映像を入力する映像処理部と、
音声を入力する音声処理部と、
前記映像処理部に入力された映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、該映像上で会話が開始されたことを検知する映像会話判定部と、
前記音声処理部に入力された音声の音量判定を行うことで、該音声上で会話が開始されたことを検知する音声会話判定部と、
前記映像会話判定部が映像上で会話が開始されたことを検知した映像上会話開始タイミングと前記音声会話判定部が音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差を基に、前記映像と前記音声の出力タイミングを調整する映像音声同期部と、を有する映像音声処理装置。
（付記２）
前記映像音声同期部は、
前記映像上会話開始タイミングと前記音声上会話開始タイミングとの差が、予め設定された許容時間の範囲内に収まる場合に、前記映像と前記音声の出力タイミングを調整する、付記１に記載の映像音声処理装置。
（付記３）
前記映像音声同期部は、
前記許容時間に相当するバッファ長に設定され、前記映像を一時的に蓄積する映像用バッファと、
前記許容時間に相当するバッファ長に設定され、前記音声を一時的に蓄積する音声用バッファと、を含み、
前記音声用バッファの先頭に前記音声上会話開始タイミングに対応する会話開始音声フレームの音声が蓄積され、前記映像用バッファの先頭以外に前記映像上会話開始タイミングに対応する会話開始映像フレームの映像が蓄積されている場合、前記会話開始音声フレームの音声を出力すると共に、前記映像用バッファの先頭から前記会話開始映像フレームの直前までの映像フレームの映像をスキップ、削除、または早送りし、前記会話開始映像フレームの映像を出力する、付記２に記載の映像音声処理装置。
（付記４）
前記映像音声同期部は、
前記音声用バッファの先頭に前記会話開始音声フレームの音声が蓄積され、前記映像用バッファに前記会話開始映像フレームの映像が蓄積されていない場合、前記会話開始音声フレームの音声を出力すると共に、前記映像用バッファの先頭の映像フレームの映像を出力し、
前記会話開始音声フレームの音声を出力した後に、前記映像用バッファに前記会話開始映像フレームの映像が蓄積された場合、前記音声用バッファの先頭の音声フレームの音声を出力すると共に、前記映像用バッファの先頭から前記会話開始映像フレームの直前までの映像フレームの映像をスキップ、削除、または早送りし、前記会話開始映像フレームの映像を出力し、
以降、映像と音声が同期するまでの間に、Ｘ（Ｘは、２以上の自然数）個の映像フレームが入力され、前記Ｘ個の映像フレームのうちの先頭の映像フレームの映像が前記映像用バッファの先頭に蓄積された場合、前記音声用バッファの先頭の音声フレームの音声を出力すると共に、前記映像用バッファの先頭からＹ（Ｙは、Ｘ−１以下の自然数）個分の映像フレームの映像をスキップ、削除、または早送りし、残りの映像フレームのうちの先頭の映像フレームの映像を出力する、付記３に記載の映像音声処理装置。
（付記５）
映像を入力する映像処理部と、
音声を入力する音声処理部と、
前記映像処理部に入力された映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、該映像上で会話が開始されたことを検知する映像会話判定部と、
前記音声処理部に入力された音声の音量判定を行うことで、該音声上で会話が開始されたことを検知する音声会話判定部と、
前記映像会話判定部が映像上で会話が開始されたことを検知した映像上会話開始タイミングと前記音声会話判定部が音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差を基に、前記映像と前記音声の出力タイミングを調整する映像音声同期部と、を有する映像音声処理システム。
（付記６）
前記映像音声同期部は、
前記映像上会話開始タイミングと前記音声上会話開始タイミングとの差が、予め設定された許容時間の範囲内に収まる場合に、前記映像と前記音声の出力タイミングを調整する、付記５に記載の映像音声処理システム。
（付記７）
前記映像音声同期部は、
前記許容時間に相当するバッファ長に設定され、前記映像を一時的に蓄積する映像用バッファと、
前記許容時間に相当するバッファ長に設定され、前記音声を一時的に蓄積する音声用バッファと、を含み、
前記音声用バッファの先頭に前記音声上会話開始タイミングに対応する会話開始音声フレームの音声が蓄積され、前記映像用バッファの先頭以外に前記映像上会話開始タイミングに対応する会話開始映像フレームの映像が蓄積されている場合、前記会話開始音声フレームの音声を出力すると共に、前記映像用バッファの先頭から前記会話開始映像フレームの直前までの映像フレームの映像をスキップ、削除、または早送りし、前記会話開始映像フレームの映像を出力する、付記６に記載の映像音声処理システム。
（付記８）
前記映像音声同期部は、
前記音声用バッファの先頭に前記会話開始音声フレームの音声が蓄積され、前記映像用バッファに前記会話開始映像フレームの映像が蓄積されていない場合、前記会話開始音声フレームの音声を出力すると共に、前記映像用バッファの先頭の映像フレームの映像を出力し、
前記会話開始音声フレームの音声を出力した後に、前記映像用バッファに前記会話開始映像フレームの映像が蓄積された場合、前記音声用バッファの先頭の音声フレームの音声を出力すると共に、前記映像用バッファの先頭から前記会話開始映像フレームの直前までの映像フレームの映像をスキップ、削除、または早送りし、前記会話開始映像フレームの映像を出力し、
以降、映像と音声が同期するまでの間に、Ｘ（Ｘは、２以上の自然数）個の映像フレームが入力され、前記Ｘ個の映像フレームのうちの先頭の映像フレームの映像が前記映像用バッファの先頭に蓄積された場合、前記音声用バッファの先頭の音声フレームの音声を出力すると共に、前記映像用バッファの先頭からＹ（Ｙは、Ｘ−１以下の自然数）個分の映像フレームの映像をスキップ、削除、または早送りし、残りの映像フレームのうちの先頭の映像フレームの映像を出力する、付記７に記載の映像音声処理システム。
（付記９）
映像音声処理装置による映像音声同期方法であって、
映像を入力する映像処理ステップと、
音声を入力する音声処理ステップと、
前記映像処理ステップにて入力された映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、該映像上で会話が開始されたことを検知する映像会話判定ステップと、
前記音声処理ステップにて入力された音声の音量判定を行うことで、該音声上で会話が開始されたことを検知する音声会話判定ステップと、
前記映像会話判定ステップにて映像上で会話が開始されたことを検知した映像上会話開始タイミングと前記音声会話判定ステップにて音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差を基に、前記映像と前記音声の出力タイミングを調整する映像音声同期ステップと、を有する映像音声処理方法。
（付記１０）
前記映像音声同期ステップでは、
前記映像上会話開始タイミングと前記音声上会話開始タイミングとの差が、予め設定された許容時間の範囲内に収まる場合に、前記映像と前記音声の出力タイミングを調整する、付記９に記載の映像音声処理方法。
（付記１１）
前記映像音声処理装置は、
前記許容時間に相当するバッファ長に設定され、前記映像を一時的に蓄積する映像用バッファと、
前記許容時間に相当するバッファ長に設定され、前記音声を一時的に蓄積する音声用バッファと、を含むものであり、
前記映像音声同期ステップでは、
前記音声用バッファの先頭に前記音声上会話開始タイミングに対応する会話開始音声フレームの音声が蓄積され、前記映像用バッファの先頭以外に前記映像上会話開始タイミングに対応する会話開始映像フレームの映像が蓄積されている場合、前記会話開始音声フレームの音声を出力すると共に、前記映像用バッファの先頭から前記会話開始映像フレームの直前までの映像フレームの映像をスキップ、削除、または早送りし、前記会話開始映像フレームの映像を出力する、付記１０に記載の映像音声処理方法。
（付記１２）
前記映像音声同期ステップでは、
前記音声用バッファの先頭に前記会話開始音声フレームの音声が蓄積され、前記映像用バッファに前記会話開始映像フレームの映像が蓄積されていない場合、前記会話開始音声フレームの音声を出力すると共に、前記映像用バッファの先頭の映像フレームの映像を出力し、
前記会話開始音声フレームの音声を出力した後に、前記映像用バッファに前記会話開始映像フレームの映像が蓄積された場合、前記音声用バッファの先頭の音声フレームの音声を出力すると共に、前記映像用バッファの先頭から前記会話開始映像フレームの直前までの映像フレームの映像をスキップ、削除、または早送りし、前記会話開始映像フレームの映像を出力し、
以降、映像と音声が同期するまでの間に、Ｘ（Ｘは、２以上の自然数）個の映像フレームが入力され、前記Ｘ個の映像フレームのうちの先頭の映像フレームの映像が前記映像用バッファの先頭に蓄積された場合、前記音声用バッファの先頭の音声フレームの音声を出力すると共に、前記映像用バッファの先頭からＹ（Ｙは、Ｘ−１以下の自然数）個分の映像フレームの映像をスキップ、削除、または早送りし、残りの映像フレームのうちの先頭の映像フレームの映像を出力する、付記１１に記載の映像音声処理方法。
（付記１３）
コンピュータに、
映像を入力する映像処理手順と、
音声を入力する音声処理手順と、
前記映像処理手順にて入力された映像から顔を検出し、検出した顔の唇領域の動静判定を行うことで、該映像上で会話が開始されたことを検知する映像会話判定手順と、
前記音声処理手順にて入力された音声の音量判定を行うことで、該音声上で会話が開始されたことを検知する音声会話判定手順と、
前記映像会話判定手順にて映像上で会話が開始されたことを検知した映像上会話開始タイミングと前記音声会話判定手順にて音声上で会話が開始されたことを検知した音声上会話開始タイミングとの差を基に、前記映像と前記音声の出力タイミングを調整する映像音声同期手順と、とを実行させるためのプログラム。
（付記１４）
前記映像音声同期手順では、
前記映像上会話開始タイミングと前記音声上会話開始タイミングとの差が、予め設定された許容時間の範囲内に収まる場合に、前記映像と前記音声の出力タイミングを調整する、付記１３に記載のプログラム。
（付記１５）
前記コンピュータは、
前記許容時間に相当するバッファ長に設定され、前記映像を一時的に蓄積する映像用バッファと、
前記許容時間に相当するバッファ長に設定され、前記音声を一時的に蓄積する音声用バッファと、を含むものであり、
前記映像音声同期手順では、
前記音声用バッファの先頭に前記音声上会話開始タイミングに対応する会話開始音声フレームの音声が蓄積され、前記映像用バッファの先頭以外に前記映像上会話開始タイミングに対応する会話開始映像フレームの映像が蓄積されている場合、前記会話開始音声フレームの音声を出力すると共に、前記映像用バッファの先頭から前記会話開始映像フレームの直前までの映像フレームの映像をスキップ、削除、または早送りし、前記会話開始映像フレームの映像を出力する、付記１４に記載のプログラム。
（付記１６）
前記映像音声同期手順では、
前記音声用バッファの先頭に前記会話開始音声フレームの音声が蓄積され、前記映像用バッファに前記会話開始映像フレームの映像が蓄積されていない場合、前記会話開始音声フレームの音声を出力すると共に、前記映像用バッファの先頭の映像フレームの映像を出力し、
前記会話開始音声フレームの音声を出力した後に、前記映像用バッファに前記会話開始映像フレームの映像が蓄積された場合、前記音声用バッファの先頭の音声フレームの音声を出力すると共に、前記映像用バッファの先頭から前記会話開始映像フレームの直前までの映像フレームの映像をスキップ、削除、または早送りし、前記会話開始映像フレームの映像を出力し、
以降、映像と音声が同期するまでの間に、Ｘ（Ｘは、２以上の自然数）個の映像フレームが入力され、前記Ｘ個の映像フレームのうちの先頭の映像フレームの映像が前記映像用バッファの先頭に蓄積された場合、前記音声用バッファの先頭の音声フレームの音声を出力すると共に、前記映像用バッファの先頭からＹ（Ｙは、Ｘ−１以下の自然数）個分の映像フレームの映像をスキップ、削除、または早送りし、残りの映像フレームのうちの先頭の映像フレームの映像を出力する、付記１５に記載のプログラム。 In addition, although a part or all of said embodiment can be described also as the following additional remarks, it is not restricted to the following.
(Appendix 1)
A video processing unit for inputting video;
A voice processing unit for inputting voice;
A video conversation determination unit for detecting a face from the video input to the video processing unit and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination unit that detects the start of a conversation on the voice by performing a volume determination of the voice input to the voice processing unit;
The difference between the on-video conversation start timing when the video conversation determination unit detects that the conversation has started on the video and the on-speech conversation start timing when the voice conversation determination unit detects that the conversation has started on the voice And a video / audio synchronization unit that adjusts an output timing of the video and the audio based on the video and audio.
(Appendix 2)
The video / audio synchronization unit includes:
The video according to appendix 1, wherein the video and the audio output timing are adjusted when a difference between the video conversation start timing and the voice conversation start timing falls within a preset allowable time range. Audio processing device.
(Appendix 3)
The video / audio synchronization unit includes:
A video buffer that is set to a buffer length corresponding to the allowable time, and temporarily stores the video;
An audio buffer that is set to a buffer length corresponding to the allowable time and temporarily stores the audio;
The voice of the conversation start voice frame corresponding to the voice conversation start timing is accumulated at the beginning of the voice buffer, and the video of the conversation start video frame corresponding to the video conversation start timing is other than the head of the video buffer. If stored, the audio of the conversation start audio frame is output, and the video of the video frame from the beginning of the video buffer to the immediately before the conversation start video frame is skipped, deleted, or fast-forwarded to start the conversation The video / audio processing apparatus according to appendix 2, which outputs video of a video frame.
(Appendix 4)
The video / audio synchronization unit includes:
When the audio of the conversation start audio frame is accumulated at the head of the audio buffer and the video of the conversation start video frame is not accumulated in the video buffer, the audio of the conversation start audio frame is output, and Output the video of the first video frame in the video buffer,
When the video of the conversation start video frame is accumulated in the video buffer after outputting the audio of the conversation start audio frame, the audio of the first audio frame of the audio buffer is output and the video buffer Skips, deletes, or fast forwards the video frame video from the beginning of the video frame immediately before the conversation start video frame, and outputs the video of the conversation start video frame,
Thereafter, X (X is a natural number of 2 or more) video frames are input until the video and audio are synchronized, and the video of the first video frame among the X video frames is used for the video. When accumulated at the beginning of the buffer, the audio of the first audio frame of the audio buffer is output and Y (Y is a natural number equal to or less than X-1) video frames from the beginning of the video buffer. 4. The video / audio processing apparatus according to appendix 3, wherein the video is skipped, deleted, or fast-forwarded and the video of the first video frame among the remaining video frames is output.
(Appendix 5)
A video processing unit for inputting video;
A voice processing unit for inputting voice;
A video conversation determination unit for detecting a face from the video input to the video processing unit and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination unit that detects the start of a conversation on the voice by performing a volume determination of the voice input to the voice processing unit;
The difference between the on-video conversation start timing when the video conversation determination unit detects that the conversation has started on the video and the on-speech conversation start timing when the voice conversation determination unit detects that the conversation has started on the voice And a video / audio synchronization system that adjusts an output timing of the video and the audio based on the video and audio.
(Appendix 6)
The video / audio synchronization unit includes:
The video according to appendix 5, wherein the video and the audio output timing are adjusted when a difference between the video conversation start timing and the voice conversation start timing falls within a preset allowable time range. Voice processing system.
(Appendix 7)
The video / audio synchronization unit includes:
A video buffer that is set to a buffer length corresponding to the allowable time, and temporarily stores the video;
An audio buffer that is set to a buffer length corresponding to the allowable time and temporarily stores the audio;
The voice of the conversation start voice frame corresponding to the voice conversation start timing is accumulated at the beginning of the voice buffer, and the video of the conversation start video frame corresponding to the video conversation start timing is other than the head of the video buffer. If stored, the audio of the conversation start audio frame is output, and the video of the video frame from the beginning of the video buffer to the immediately before the conversation start video frame is skipped, deleted, or fast-forwarded to start the conversation The video / audio processing system according to appendix 6, which outputs video of a video frame.
(Appendix 8)
The video / audio synchronization unit includes:
When the audio of the conversation start audio frame is accumulated at the head of the audio buffer and the video of the conversation start video frame is not accumulated in the video buffer, the audio of the conversation start audio frame is output, and Output the video of the first video frame in the video buffer,
When the video of the conversation start video frame is accumulated in the video buffer after outputting the audio of the conversation start audio frame, the audio of the first audio frame of the audio buffer is output and the video buffer Skips, deletes, or fast forwards the video frame video from the beginning of the video frame immediately before the conversation start video frame, and outputs the video of the conversation start video frame,
Thereafter, X (X is a natural number of 2 or more) video frames are input until the video and audio are synchronized, and the video of the first video frame among the X video frames is used for the video. When accumulated at the beginning of the buffer, the audio of the first audio frame of the audio buffer is output and Y (Y is a natural number equal to or less than X-1) video frames from the beginning of the video buffer. The video / audio processing system according to appendix 7, wherein the video is skipped, deleted, or fast-forwarded and the video of the first video frame among the remaining video frames is output.
(Appendix 9)
A video / audio synchronization method by a video / audio processing apparatus,
A video processing step for inputting video;
A voice processing step for inputting voice;
A video conversation determination step of detecting a face from the video input in the video processing step and detecting a conversation on the video by performing a motion determination of the detected lip region;
A voice conversation determination step of detecting that a conversation is started on the voice by performing a volume determination of the voice input in the voice processing step;
A conversation start timing on the video when the conversation start on the video is detected in the video conversation determination step, and a voice conversation start timing when the conversation is started on the voice in the voice conversation determination step; And a video / audio synchronization step of adjusting an output timing of the video and the audio based on a difference between the video and audio.
(Appendix 10)
In the video / audio synchronization step,
The video according to appendix 9, wherein the output timing of the video and the audio is adjusted when a difference between the video conversation start timing and the voice conversation start timing falls within a preset allowable time range. Audio processing method.
(Appendix 11)
The video / audio processing apparatus includes:
A video buffer that is set to a buffer length corresponding to the allowable time, and temporarily stores the video;
An audio buffer that is set to a buffer length corresponding to the allowable time and temporarily stores the audio;
In the video / audio synchronization step,
The voice of the conversation start voice frame corresponding to the voice conversation start timing is accumulated at the beginning of the voice buffer, and the video of the conversation start video frame corresponding to the video conversation start timing is other than the head of the video buffer. If stored, the audio of the conversation start audio frame is output, and the video of the video frame from the beginning of the video buffer to the immediately before the conversation start video frame is skipped, deleted, or fast-forwarded to start the conversation The video / audio processing method according to appendix 10, wherein the video of the video frame is output.
(Appendix 12)
In the video / audio synchronization step,
When the audio of the conversation start audio frame is accumulated at the head of the audio buffer and the video of the conversation start video frame is not accumulated in the video buffer, the audio of the conversation start audio frame is output, and Output the video of the first video frame in the video buffer,
When the video of the conversation start video frame is accumulated in the video buffer after outputting the audio of the conversation start audio frame, the audio of the first audio frame of the audio buffer is output and the video buffer Skips, deletes, or fast forwards the video frame video from the beginning of the video frame immediately before the conversation start video frame, and outputs the video of the conversation start video frame,
Thereafter, X (X is a natural number of 2 or more) video frames are input until the video and audio are synchronized, and the video of the first video frame among the X video frames is used for the video. When accumulated at the beginning of the buffer, the audio of the first audio frame of the audio buffer is output and Y (Y is a natural number equal to or less than X-1) video frames from the beginning of the video buffer. The video / audio processing method according to attachment 11, wherein the video is skipped, deleted, or fast-forwarded, and the video of the first video frame among the remaining video frames is output.
(Appendix 13)
On the computer,
Video processing procedure for inputting video,
Audio processing procedure for inputting audio,
A video conversation determination procedure for detecting a face from the video input in the video processing procedure and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination procedure for detecting that a conversation is started on the voice by performing a volume determination of the voice input in the voice processing procedure;
A conversation start timing on the video when the conversation start on the video is detected in the video conversation determination procedure, and a voice conversation start timing when the conversation is started on the voice in the voice conversation determination procedure; And a video / audio synchronization procedure for adjusting the output timing of the video and the audio based on the difference between them.
(Appendix 14)
In the video / audio synchronization procedure,
The program according to appendix 13, wherein the output timing of the video and the audio is adjusted when a difference between the video conversation start timing and the voice conversation start timing falls within a preset allowable time range. .
(Appendix 15)
The computer
A video buffer that is set to a buffer length corresponding to the allowable time, and temporarily stores the video;
An audio buffer that is set to a buffer length corresponding to the allowable time and temporarily stores the audio;
In the video / audio synchronization procedure,
The voice of the conversation start voice frame corresponding to the voice conversation start timing is accumulated at the beginning of the voice buffer, and the video of the conversation start video frame corresponding to the video conversation start timing is other than the head of the video buffer. If stored, the audio of the conversation start audio frame is output, and the video of the video frame from the beginning of the video buffer to the immediately before the conversation start video frame is skipped, deleted, or fast-forwarded to start the conversation The program according to appendix 14, which outputs a video of a video frame.
(Appendix 16)
In the video / audio synchronization procedure,
When the audio of the conversation start audio frame is accumulated at the head of the audio buffer and the video of the conversation start video frame is not accumulated in the video buffer, the audio of the conversation start audio frame is output, and Output the video of the first video frame in the video buffer,
When the video of the conversation start video frame is accumulated in the video buffer after outputting the audio of the conversation start audio frame, the audio of the first audio frame of the audio buffer is output and the video buffer Skips, deletes, or fast forwards the video frame video from the beginning of the video frame immediately before the conversation start video frame, and outputs the video of the conversation start video frame,
Thereafter, X (X is a natural number of 2 or more) video frames are input until the video and audio are synchronized, and the video of the first video frame among the X video frames is used for the video. When accumulated at the beginning of the buffer, the audio of the first audio frame of the audio buffer is output and Y (Y is a natural number equal to or less than X-1) video frames from the beginning of the video buffer. The program according to appendix 15, wherein the video is skipped, deleted, or fast-forwarded and the video of the first video frame among the remaining video frames is output.

本発明は、映像と音声を利用するマルチメディア分野全般に利用可能である。 The present invention can be used in general multimedia fields using video and audio.

また、本発明は、双方向通信の要否や、リアルタイム性の要否を問わずに利用可能である。 In addition, the present invention can be used regardless of the necessity of bidirectional communication or the necessity of real-time performance.

１映像用ネットワーク
２音声用ネットワーク
１００，２００テレビ会議送受信機
２１０映像処理部
２２０音声処理部
２３０映像音声同期処理部
２３１会話判定部（映像）
２３２会話判定部（音声）
２３３映像音声同期部 DESCRIPTION OF SYMBOLS 1 Video network 2 Audio network 100,200 Video conference transmitter / receiver 210 Video processing unit 220 Audio processing unit 230 Video / audio synchronization processing unit 231 Conversation determination unit (video)
232 Conversation determination unit (voice)
233 Video / audio synchronization unit

Claims

A video processing unit for inputting video;
A voice processing unit for inputting voice;
A video conversation determination unit for detecting a face from the video input to the video processing unit and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination unit that detects the start of a conversation on the voice by performing a volume determination of the voice input to the voice processing unit;
The difference between the on-video conversation start timing when the video conversation determination unit detects that the conversation has started on the video and the on-speech conversation start timing when the voice conversation determination unit detects that the conversation has started on the voice And a video / audio synchronization unit that adjusts an output timing of the video and the audio based on the video and audio.

The video / audio synchronization unit includes:
The output timing of the video and the audio is adjusted when the difference between the video conversation start timing and the audio conversation start timing falls within a preset allowable time range. Video / audio processor.

The video / audio synchronization unit includes:
A video buffer that is set to a buffer length corresponding to the allowable time, and temporarily stores the video;
An audio buffer that is set to a buffer length corresponding to the allowable time and temporarily stores the audio;
The voice of the conversation start voice frame corresponding to the voice conversation start timing is accumulated at the beginning of the voice buffer, and the video of the conversation start video frame corresponding to the video conversation start timing is other than the head of the video buffer. If stored, the audio of the conversation start audio frame is output, and the video of the video frame from the beginning of the video buffer to the immediately before the conversation start video frame is skipped, deleted, or fast-forwarded to start the conversation The video / audio processing apparatus according to claim 2, wherein the video / audio processing apparatus outputs video of a video frame.

The video / audio synchronization unit includes:
When the audio of the conversation start audio frame is accumulated at the head of the audio buffer and the video of the conversation start video frame is not accumulated in the video buffer, the audio of the conversation start audio frame is output, and Output the video of the first video frame in the video buffer,
When the video of the conversation start video frame is accumulated in the video buffer after outputting the audio of the conversation start audio frame, the audio of the first audio frame of the audio buffer is output and the video buffer Skips, deletes, or fast forwards the video frame video from the beginning of the video frame immediately before the conversation start video frame, and outputs the video of the conversation start video frame,
Thereafter, X (X is a natural number of 2 or more) video frames are input until the video and audio are synchronized, and the video of the first video frame among the X video frames is used for the video. When accumulated at the beginning of the buffer, the audio of the first audio frame of the audio buffer is output and Y (Y is a natural number equal to or less than X-1) video frames from the beginning of the video buffer. The video / audio processing apparatus according to claim 3, wherein the video is skipped, deleted, or fast-forwarded, and the video of the first video frame among the remaining video frames is output.

A video processing unit for inputting video;
A voice processing unit for inputting voice;
A video conversation determination unit for detecting a face from the video input to the video processing unit and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination unit that detects the start of a conversation on the voice by performing a volume determination of the voice input to the voice processing unit;
The difference between the on-video conversation start timing when the video conversation determination unit detects that the conversation has started on the video and the on-speech conversation start timing when the voice conversation determination unit detects that the conversation has started on the voice And a video / audio synchronization system that adjusts an output timing of the video and the audio based on the video and audio.

The video / audio synchronization unit includes:
6. The output timing of the video and the audio is adjusted when a difference between the on-video conversation start timing and the on-voice conversation start timing falls within a preset allowable time range. Video / audio processing system.

The video / audio synchronization unit includes:
A video buffer that is set to a buffer length corresponding to the allowable time, and temporarily stores the video;
An audio buffer that is set to a buffer length corresponding to the allowable time and temporarily stores the audio;
The voice of the conversation start voice frame corresponding to the voice conversation start timing is accumulated at the beginning of the voice buffer, and the video of the conversation start video frame corresponding to the video conversation start timing is other than the head of the video buffer. If stored, the audio of the conversation start audio frame is output, and the video of the video frame from the beginning of the video buffer to the immediately before the conversation start video frame is skipped, deleted, or fast-forwarded to start the conversation The video / audio processing system according to claim 6, wherein the video / audio processing system outputs video of a video frame.

The video / audio synchronization unit includes:
When the audio of the conversation start audio frame is accumulated at the head of the audio buffer and the video of the conversation start video frame is not accumulated in the video buffer, the audio of the conversation start audio frame is output, and Output the video of the first video frame in the video buffer,
When the video of the conversation start video frame is accumulated in the video buffer after outputting the audio of the conversation start audio frame, the audio of the first audio frame of the audio buffer is output and the video buffer Skips, deletes, or fast forwards the video frame video from the beginning of the video frame immediately before the conversation start video frame, and outputs the video of the conversation start video frame,
Thereafter, X (X is a natural number of 2 or more) video frames are input until the video and audio are synchronized, and the video of the first video frame among the X video frames is used for the video. When accumulated at the beginning of the buffer, the audio of the first audio frame of the audio buffer is output and Y (Y is a natural number equal to or less than X-1) video frames from the beginning of the video buffer. The video / audio processing system according to claim 7, wherein the video is skipped, deleted, or fast-forwarded and the video of the first video frame among the remaining video frames is output.

A video / audio synchronization method by a video / audio processing apparatus,
A video processing step for inputting video;
A voice processing step for inputting voice;
A video conversation determination step of detecting a face from the video input in the video processing step and detecting a conversation on the video by performing a motion determination of the detected lip region;
A voice conversation determination step of detecting that a conversation is started on the voice by performing a volume determination of the voice input in the voice processing step;
A conversation start timing on the video when the conversation start on the video is detected in the video conversation determination step, and a voice conversation start timing when the conversation is started on the voice in the voice conversation determination step; And a video / audio synchronization step of adjusting an output timing of the video and the audio based on a difference between the video and audio.

On the computer,
Video processing procedure for inputting video,
Audio processing procedure for inputting audio,
A video conversation determination procedure for detecting a face from the video input in the video processing procedure and detecting a conversation on the video by performing a motion determination of the detected lip area;
A voice conversation determination procedure for detecting that a conversation is started on the voice by performing a volume determination of the voice input in the voice processing procedure;
A conversation start timing on the video when the conversation start on the video is detected in the video conversation determination procedure, and a voice conversation start timing when the conversation is started on the voice in the voice conversation determination procedure; And a video / audio synchronization procedure for adjusting the output timing of the video and the audio based on the difference between them.