JP6016277B2

JP6016277B2 - Audiovisual processing system, audiovisual processing method, and program

Info

Publication number: JP6016277B2
Application number: JP2014094976A
Authority: JP
Inventors: 井上　晃; 晃井上; 野村　俊之; 俊之野村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-05-02
Filing date: 2014-05-02
Publication date: 2016-10-26
Anticipated expiration: 2029-09-25
Also published as: JP2014195267A

Description

本発明は映像音響処理システム、映像音響処理方法及びプログラムに関する。 The present invention relates to an audiovisual processing system, an audiovisual processing method, and a program.

映像コミュニケーション装置等において、しばしば映像音響信号から特定のオブジェクト（人や物体など）に注目して視聴したいという要望がある。特定オブジェクトへの注目処理に関し、映像信号における注目処理を映像注目処理とし、音響信号における注目処理を音響注目処理とする。 In video communication devices and the like, there is often a demand for viewing by paying attention to a specific object (such as a person or an object) from an audiovisual signal. Regarding attention processing on a specific object, attention processing in a video signal is referred to as video attention processing, and attention processing in an audio signal is referred to as acoustic attention processing.

図１８を用いて映像注目処理の一例について説明する。元映像フレーム７１に、４つのオブジェクト（オブジェクトＡ７４、オブジェクトＢ７５、オブジェクトＣ７６、オブジェクトＤ７７）が含まれているとする。オブジェクトとは撮影された映像空間を構成する物体であり、例えば人物や自動車、建物などである。元映像フレーム７１内では、これらのオブジェクトの位置を示す矩形が点線で表示されている。ここで、元映像フレーム７１内における注目領域を、注目領域７３で示す実線矩形とする。注目処理映像７２は、注目領域７３に対して映像注目処理を施した一例である。注目処理映像７２は、注目領域を拡大して表示幅が最大となるように表示した例である。 An example of the video attention process will be described with reference to FIG. Assume that the original video frame 71 includes four objects (object A74, object B75, object C76, and object D77). An object is an object constituting a photographed video space, such as a person, a car, or a building. In the original video frame 71, rectangles indicating the positions of these objects are displayed by dotted lines. Here, the attention area in the original video frame 71 is a solid line rectangle indicated by the attention area 73. The attention processing image 72 is an example in which the image attention processing is performed on the attention region 73. The attention processing image 72 is an example in which the attention region is enlarged and displayed so that the display width is maximized.

音響注目処理の一例として、注目するオブジェクトに対応した音響信号だけを再生する方法がある。この場合、元映像フレーム７１では、注目領域７３を指定するとその内部に含まれるオブジェクトＤ７７のみの音声が再生されることになる。 As an example of the acoustic attention process, there is a method of reproducing only the acoustic signal corresponding to the object of interest. In this case, in the original video frame 71, when the attention area 73 is designated, only the sound of the object D77 included therein is reproduced.

上述のような注目処理により、視聴者が興味のある領域だけを詳細に観察することが可能となる。 By the attention processing as described above, it is possible to observe in detail only the region in which the viewer is interested.

映像中の音を発する特定オブジェクトへの注目処理を実現するには、音源となっている映像オブジェクトを抽出し、それぞれのオブジェクトが発生する音と映像を分離する必要があった。そこで、このような技術に関連する一例が、特許文献１に記載されている。 In order to realize attention processing for a specific object that emits sound in a video, it is necessary to extract a video object as a sound source and separate the sound and video generated by each object. An example relating to such a technique is described in Patent Document 1.

特許文献１によれば、ＴＶ会議システムにおいて複数のカメラと複数のマイクを用い、注目領域として特定のカメラ映像を選択すると、その映像に近い位置のマイクのみを動作させることで、注目領域に適した音声を収録・再生することが記載されている。すなわち、１つのオブジェクトに対して、特定のカメラとマイクとを固定することにより、映像オブジェクトと音響オブジェクトとを対応づけて分離している。 According to Patent Document 1, when a specific camera image is selected as a region of interest using a plurality of cameras and a plurality of microphones in a TV conference system, only a microphone located near the image is operated, so that it is suitable for the region of interest. Recording and playback of recorded audio is described. That is, by fixing a specific camera and microphone to one object, the video object and the sound object are associated with each other and separated.

また、関連する技術として、音源方向検出を利用したオブジェクト分離方法がある。図１９に関連するオブジェクト分離装置を示す。 As a related technique, there is an object separation method using sound source direction detection. FIG. 19 shows an object separation apparatus related to FIG.

このオブジェクト分離装置は、映像オブジェクト分離部９０１と、音源方向検出部９０２とから構成される。そして、映像信号は映像オブジェクト分離部９０１に入力され、音響信号は音源方向検出部９０２に入力される。 This object separation device includes a video object separation unit 901 and a sound source direction detection unit 902. The video signal is input to the video object separation unit 901, and the audio signal is input to the sound source direction detection unit 902.

音源方向検出部９０２は、マルチチャンネルの音響信号を用いて音源方向を検出する。音源方向の検出方法の一例として、複数の指向性マイクの信号を比較して、最も音量の大きいマイクの向いている方向を音源方向とする方法がある。もう一つの音源方向検出方法として、音響ビームフォーミング技術がある。これは、複数のマイク信号の位相差が最も小さくなる方向を音源方向と見なし、信号処理によって音源方向を推定する公知の技術である。なお、音源方向検出部９０２によって求められる音源方向は一つである。 The sound source direction detection unit 902 detects the sound source direction using a multi-channel acoustic signal. As an example of the sound source direction detection method, there is a method in which signals from a plurality of directional microphones are compared, and the direction in which the loudest microphone is facing is set as the sound source direction. As another sound source direction detection method, there is an acoustic beam forming technique. This is a known technique in which the direction in which the phase difference between the plurality of microphone signals is the smallest is regarded as the sound source direction, and the sound source direction is estimated by signal processing. Note that one sound source direction is obtained by the sound source direction detection unit 902.

映像オブジェクト分離部９０１は、音源方向検出部９０２で求められた方向情報を用いて、映像信号から映像フレーム内の映像オブジェクトを分離する。映像オブジェクトの例として、図１８におけるオブジェクトＡ７４，オブジェクトＢ７５，オブジェクトＣ７６のような、人物オブジェクトがある。またその他、自動車、建物や、草木、など、空間を構成する物体を、映像オブジェクトと見なすことができる。 The video object separating unit 901 uses the direction information obtained by the sound source direction detecting unit 902 to separate the video object in the video frame from the video signal. Examples of video objects include person objects such as object A74, object B75, and object C76 in FIG. In addition, an object that constitutes a space, such as an automobile, a building, or a plant, can be regarded as a video object.

映像オブジェクト分離部９０１の例として、パターン認識を利用した物体検出方法がある。予め映像オブジェクト画像のテンプレートを作成し、このテンプレートを用いて映像フレーム全体にテンプレートマッチングを施す。テンプレートとの相関値がしきい値以上であれば、所望の映像オブジェクトが存在するものと判断する。映像オブジェクト分離部９０１は、パターン認識等で検出されたオブジェクト候補の中から、音源方向検出部９０２で求められた方向に存在するオブジェクトを一つ選択して出力する。映像は空間中の限定された領域を投影したものなので、正確には音源方向に最も近い映像オブジェクトを選択して出力する。 As an example of the video object separation unit 901, there is an object detection method using pattern recognition. A video object image template is created in advance, and template matching is performed on the entire video frame using this template. If the correlation value with the template is equal to or greater than the threshold value, it is determined that a desired video object exists. The video object separation unit 901 selects and outputs one object existing in the direction obtained by the sound source direction detection unit 902 from the object candidates detected by pattern recognition or the like. Since the video is a projection of a limited area in space, the video object closest to the sound source direction is selected and output accurately.

このように図１９のオブジェクト分離装置は、音源方向にある映像オブジェクトを分離することで、音と映像の対応付けされたオブジェクト信号を分離することが可能となる。 As described above, the object separation apparatus of FIG. 19 can separate the object signal in which the sound and the video are associated by separating the video object in the sound source direction.

特開２００５−４５７７９号公報JP 2005-45779 A

しかしながら、特許文献１の技術は、オブジェクトごとにカメラとマイクとを適切に配置して映像音響信号を取得する必要があった。その結果、映像制作・蓄積・伝送コストが高くなるという課題があった。 However, the technique of Patent Literature 1 needs to acquire a video and audio signal by appropriately arranging a camera and a microphone for each object. As a result, there has been a problem that the cost of video production / storage / transmission increases.

また、他の関連する技術は、音源方向検出において、１つの音源方向しか求められなかった。その結果、複数の音源から同時に発声した場合に、映像オブジェクトとの対応が取れないという課題があった。 Moreover, other related techniques require only one sound source direction in sound source direction detection. As a result, there is a problem that when a plurality of sound sources are uttered at the same time, correspondence with the video object cannot be obtained.

そこで、本発明は上記課題に鑑みて発明されたものであって、その目的は、複数のオブジェクトが混在する中から、相関を用いて、音のオブジェクトと映像のオブジェクトとを少なくとも１以上対応づける映像音響処理システム、映像音響処理方法及びプログラムを提供することにある。 Therefore, the present invention has been invented in view of the above problems, and an object of the present invention is to associate at least one sound object and video object by using a correlation among a plurality of objects. To provide an audiovisual processing system, an audiovisual processing method, and a program.

上記課題を解決する本発明は、入力映像信号から映像オブジェクトを分離する映像オブジェクト分離部と、入力音響信号から音響オブジェクトを分離する音響オブジェクト分離部と、前記映像オブジェクトと前記音響オブジェクトとの相関を求め、前記映像オブジェクトと前記音響オブジェクトとを少なくとも１以上対応付ける相関対応付け部とを有する映像音響処理システムである。 The present invention that solves the above-described problems includes a video object separation unit that separates a video object from an input video signal, a sound object separation unit that separates a sound object from an input sound signal, and a correlation between the video object and the sound object. In other words, the present invention is a video / audio processing system having a correlation associating unit that associates at least one of the video object and the audio object.

上記課題を解決する本発明は、入力映像信号から映像オブジェクトを分離し、入力音響信号から音響オブジェクトを分離し、前記映像オブジェクトと前記音響オブジェクトとの相関を求め、前記映像オブジェクトと前記音響オブジェクトとを少なくとも１以上対応付ける映像音響処理方法である。 The present invention that solves the above problems separates a video object from an input video signal, separates a sound object from an input audio signal, obtains a correlation between the video object and the audio object, Is a video and audio processing method for associating at least one or more.

上記課題を解決する本発明は、入力映像信号から映像オブジェクトを分離する処理と、入力音響信号から音響オブジェクトを分離する処理と、前記映像オブジェクトと前記音響オブジェクトとの相関を求め、前記映像オブジェクトと前記音響オブジェクトとを少なくとも１以上対応付ける処理とを情報処理装置に実行させるプログラムである。 The present invention for solving the above-described problems is a process of separating a video object from an input video signal, a process of separating a sound object from an input audio signal, obtaining a correlation between the video object and the acoustic object, A program for causing an information processing apparatus to execute processing for associating at least one or more acoustic objects.

本発明は、複数のオブジェクトが信号中に混在していても、映像オブジェクトと音響オブジェクトとを対応付けすることができる。 The present invention can associate a video object and an audio object even if a plurality of objects are mixed in the signal.

図１は本実施の形態のブロック図である。FIG. 1 is a block diagram of the present embodiment. 図２は本実施の形態を説明するための図である。FIG. 2 is a diagram for explaining the present embodiment. 図３は第１の実施の形態のブロック図である。FIG. 3 is a block diagram of the first embodiment. 図４は第２の実施の形態のブロック図である。FIG. 4 is a block diagram of the second embodiment. 図５は第２の実施の形態の相関対応付け部３２のブロック図である。FIG. 5 is a block diagram of the correlation association unit 32 according to the second embodiment. 図６はカテゴリ対応付け部３２３の動作を説明するための図である。FIG. 6 is a diagram for explaining the operation of the category association unit 323. 図７は第３の実施の形態のブロック図である。FIG. 7 is a block diagram of the third embodiment. 図８は第３の実施の形態の映像オブジェクト分離部１３のブロック図である。FIG. 8 is a block diagram of the video object separation unit 13 according to the third embodiment. 図９は第３の実施の形態の相関対応付け部３３のブロック図である。FIG. 9 is a block diagram of the correlation association unit 33 according to the third embodiment. 図１０はＡＶ信号相関対応付け部３３３の動作を説明するための図である。FIG. 10 is a diagram for explaining the operation of the AV signal correlation association unit 333. 図１１はＡＶ信号相関対応付け部３３３の動作を説明するための図である。FIG. 11 is a diagram for explaining the operation of the AV signal correlation association unit 333. 図１２はＡＶ信号相関対応付け部３３３の動作を説明するための図である。FIG. 12 is a diagram for explaining the operation of the AV signal correlation association unit 333. 図１３はＡＶ信号相関対応付け部３３３の動作を説明するための図である。FIG. 13 is a diagram for explaining the operation of the AV signal correlation association unit 333. 図１４はＡＶ信号相関対応付け部３３３の動作を説明するための図である。FIG. 14 is a diagram for explaining the operation of the AV signal correlation association unit 333. 図１５は第４の実施の形態のブロック図である。FIG. 15 is a block diagram of the fourth embodiment. 図１６は第４の実施の形態の相関対応付け部３４のブロック図である。FIG. 16 is a block diagram of the correlation association unit 34 according to the fourth embodiment. 図１７はＡＶ信号相関対応付け部３４３の動作を説明するための図である。FIG. 17 is a diagram for explaining the operation of the AV signal correlation association unit 343. 図１８は関連技術を説明するための図である。FIG. 18 is a diagram for explaining the related art. 図１９は関連技術を説明するための図である。FIG. 19 is a diagram for explaining the related art.

本発明の実施の形態の概要を説明する。 An outline of an embodiment of the present invention will be described.

図１を参照すると、本発明は、映像オブジェクト分離部１と、音響オブジェクト分離部２と、相関対応付け部３とから構成されている。 Referring to FIG. 1, the present invention includes a video object separation unit 1, an acoustic object separation unit 2, and a correlation association unit 3.

映像オブジェクト分離部１は、映像信号から映像フレーム内の映像オブジェクトを分離する。映像オブジェクトの例として、図２におけるオブジェクトＡ７４，オブジェクトＢ７５，オブジェクトＣ７６のような人物オブジェクトがある。またその他、自動車、建物や、草木、など、空間を構成する物体を、映像オブジェクトと見なすことができる。尚、分離する映像オブジェクトは、複数あってよい。 The video object separation unit 1 separates the video object in the video frame from the video signal. Examples of video objects include person objects such as object A74, object B75, and object C76 in FIG. In addition, an object that constitutes a space, such as an automobile, a building, or a plant, can be regarded as a video object. There may be a plurality of video objects to be separated.

音響オブジェクト分離部２は、入力された音響信号を、複数の音源信号に分離する部である。ここで、分離された音源信号を音響オブジェクトと呼ぶ。 The acoustic object separation unit 2 is a unit that separates an input acoustic signal into a plurality of sound source signals. Here, the separated sound source signal is called an acoustic object.

相関対応付け部３は、複数の映像オブジェクトと複数の音響オブジェクトとを入力し、映像オブジェクトと音響オブジェクトとの相関とを求め、音響オブジェクトが映像フレーム中のどの位置にある映像オブジェクトに対応するのかを特定する。 The correlation associating unit 3 receives a plurality of video objects and a plurality of audio objects, obtains a correlation between the video objects and the audio objects, and corresponds to the video object at which the audio object corresponds in the video frame. Is identified.

以下に、本発明の実施の形態を、図面を参照して詳細に説明する。
＜第１の実施の形態＞
第１の実施の形態を説明する。 Embodiments of the present invention will be described below in detail with reference to the drawings.
<First Embodiment>
A first embodiment will be described.

図３を参照すると、第１の実施の形態は、映像オブジェクト分離部１１と、音響オブジェクト分離部２１と、相関対応付け部３１とから構成されている。 Referring to FIG. 3, the first embodiment includes a video object separation unit 11, an acoustic object separation unit 21, and a correlation association unit 31.

映像オブジェクト分離部１１は、映像信号から映像フレーム内の映像オブジェクトを分離する。映像オブジェクトの例として、図２におけるオブジェクトＡ７４，オブジェクトＢ７５，オブジェクトＣ７６のような人物オブジェクトがある。またその他、自動車、建物や、草木、など、空間を構成する物体を、映像オブジェクトと見なすことができる。映像オブジェクト分離部１の例として、パターン認識を利用した物体検出方法がある。予め映像オブジェクト画像のテンプレートを作成し、このテンプレートを用いて映像フレーム全体にテンプレートマッチングを施す。テンプレートとの相関値がしきい値以上であれば、所望の映像オブジェクトが存在するものと判断して、該当する部分領域を映像オブジェクト信号として分離する。尚、分離する映像オブジェクトは、複数あってよい。 The video object separation unit 11 separates the video object in the video frame from the video signal. Examples of video objects include person objects such as object A74, object B75, and object C76 in FIG. In addition, an object that constitutes a space, such as an automobile, a building, or a plant, can be regarded as a video object. An example of the video object separation unit 1 is an object detection method using pattern recognition. A video object image template is created in advance, and template matching is performed on the entire video frame using this template. If the correlation value with the template is equal to or greater than the threshold value, it is determined that a desired video object exists, and the corresponding partial area is separated as a video object signal. There may be a plurality of video objects to be separated.

音響オブジェクト分離部２１は、入力されたマルチチャンネルの音響信号を、複数の音源信号に分離する部である。ここで、分離された音源信号を音響オブジェクトと呼ぶ。音響オブジェクト分離部２において、オブジェクト分離情報を生成する方法として、ブラインド信号源分離（Blind Source Separation）や、独立成分分析（Independent Component Analysis）と呼ばれる手法を用いることができる。ブラインド信号源分離および独立成分分析の方法に関連する技術は、非特許文献１（2005年、「スピーチ・エンハンスメント」、シュプリンガー、（Speech Enhancement, Springer, 2005, pp. 271-369）、271ページから369ページ。）に開示されている。適切なパラメータ設定を行うことで、音響オブジェクト分離部２は、入力オーディオ信号から自動的に音源信号に分離することができる。 The acoustic object separation unit 21 is a unit that separates an input multi-channel acoustic signal into a plurality of sound source signals. Here, the separated sound source signal is called an acoustic object. As a method of generating object separation information in the acoustic object separation unit 2, a technique called blind source separation (Independent Component Analysis) or a method called independent component analysis can be used. Non-Patent Document 1 (2005, “Speech Enhancement”, Springer, (Speech Enhancement, Springer, 2005, pp. 271-369), page 271, describes the technology related to blind source separation and independent component analysis. 369 pages). By performing appropriate parameter settings, the acoustic object separation unit 2 can automatically separate the sound source signal from the input audio signal.

相関対応付け部３１は、複数の映像オブジェクトと複数の音響オブジェクトとを入力し、映像オブジェクトと音響オブジェクトとの相関を求め、音響オブジェクトが映像フレーム中のどの位置にある映像オブジェクトに対応するのかを特定する。すなわち、音響オブジェクト（音源）が映像フレーム中のどこから発生しているかを求める。対応付け処理は、映像と、音響とのそれぞれのオブジェクトから特徴ベクトルを抽出し、それらの相関を取って最も相関値の高い組み合わせを求めることによって実現する。特徴ベクトルの一例としては、時間周波数特徴や、カテゴリ帰属度などがある。 The correlation associating unit 31 inputs a plurality of video objects and a plurality of audio objects, obtains a correlation between the video objects and the sound objects, and indicates at which position in the video frame the sound object corresponds to the video object. Identify. That is, it is determined from where in the video frame the acoustic object (sound source) is generated. The association processing is realized by extracting a feature vector from each object of video and sound, and obtaining a correlation having the highest correlation value by taking a correlation between them. Examples of feature vectors include time frequency features and category attribution.

このようにして、映像オブジェクトと音響オブジェクトとを対応付ける。
＜第２の実施の形態＞
第２の実施の形態を説明する。 In this way, the video object and the sound object are associated with each other.
<Second Embodiment>
A second embodiment will be described.

図４を参照すると、第２の実施の形態は、映像オブジェクト分離部１２と、音響オブジェクト分離部２２と、相関対応付け部３２とから構成されている。 Referring to FIG. 4, the second embodiment includes a video object separation unit 12, an acoustic object separation unit 22, and a correlation association unit 32.

尚、映像オブジェクト分離部１２と音響オブジェクト分離部２２とは、第１の実施の形態の映像オブジェクト分離部１１と映像オブジェクト分離部１１と同様なものなので、詳細な説明は省略する。 The video object separation unit 12 and the audio object separation unit 22 are the same as the video object separation unit 11 and the video object separation unit 11 according to the first embodiment, and thus detailed description thereof is omitted.

相関対応付け部３２は、図５に示す如く、映像カテゴリ判別部３２１と、音響カテゴリ判別部３２２と、カテゴリ対応付け部３２３とから構成されている。 As shown in FIG. 5, the correlation association unit 32 includes a video category determination unit 321, an acoustic category determination unit 322, and a category association unit 323.

映像カテゴリ判別部３２１は、映像オブジェクトのカテゴリを特定、またはカテゴリへの帰属度を算出する部である。オブジェクトのカテゴリ例として、男性の顔、女性の顔、子供の顔、男性の全身、女性の全身、子供の全身、自動車、電車、ＰＣ、ディスプレイなどがある。判別されたオブジェクトのカテゴリは、後段の処理によって、映像フレームに存在する音響オブジェクトの特定に用いられる。 The video category determination unit 321 is a unit that specifies the category of the video object or calculates the degree of belonging to the category. Examples of object categories include male faces, female faces, children's faces, male whole bodies, female whole bodies, child whole bodies, cars, trains, PCs, displays, and the like. The determined category of the object is used for specifying an acoustic object existing in the video frame by subsequent processing.

映像カテゴリ判別部３２１の動作の一例を以下に述べる。予めいくつかの映像カテゴリを決めておき、それぞれのカテゴリに対応する典型的な画像群をテンプレートとして用意する。ビデオオブジェクト領域画素とテンプレートとのパターンマッチングを行い、最も類似度が大きいカテゴリに当該映像オブジェクトを分類することによって、帰属カテゴリを特定する。また、類似度を各カテゴリへの帰属度として算出する方法もある。パターンマッチングの方法としては、正規化相関法などの公知の技術を用いることができる。 An example of the operation of the video category determination unit 321 will be described below. Several video categories are determined in advance, and typical image groups corresponding to the respective categories are prepared as templates. The attribute category is specified by performing pattern matching between the video object region pixel and the template and classifying the video object into the category having the highest similarity. There is also a method of calculating the similarity as the degree of belonging to each category. As a pattern matching method, a known technique such as a normalized correlation method can be used.

音響カテゴリ判別部３２２は、音響オブジェクトのカテゴリを特定、またはカテゴリへの帰属度を算出する部である。オブジェクトのカテゴリ例として、男性の声、女性の声、子供の声、自動車音、電車音、空調音、キーボード音、マウスクリック音、周辺ノイズ、などがある。 The acoustic category determination unit 322 is a unit that specifies the category of the acoustic object or calculates the degree of belonging to the category. Examples of object categories include male voices, female voices, child voices, car sounds, train sounds, air conditioning sounds, keyboard sounds, mouse click sounds, ambient noise, and the like.

音響カテゴリ判別部３２２の動作の一例を以下に述べる。予めいくつかの音響カテゴリを決めておき、それぞれのカテゴリに対応する典型的な音源データを用意する。オーディオオブジェクトの波形と、前記音源データの波形とのマッチングを行い、最も類似度が大きいカテゴリに当該音響オブジェクトを分類することによって、帰属カテゴリを特定する。また各類似度を各カテゴリへの帰属度として算出する方法もある。 An example of the operation of the acoustic category determination unit 322 will be described below. Several acoustic categories are determined in advance, and typical sound source data corresponding to each category is prepared. The belonging category is specified by matching the waveform of the audio object with the waveform of the sound source data and classifying the acoustic object into the category having the highest similarity. There is also a method of calculating each similarity as the degree of belonging to each category.

カテゴリ対応付け部３２３は、映像オブジェクトのカテゴリと音響オブジェクトのカテゴリとを対応付けし、映像オブジェクトと音響オブジェクトとの対応付けを行う。カテゴリ対応付け部３２３の動作の一例を、図６を用いて説明する。
映像フレーム１１１において、映像オブジェクトのカテゴリとして男性の顔１１２、女性の顔１１３、自動車１１４が存在する。映像オブジェクト群をオブジェクトリスト１１５に示す。 The category association unit 323 associates the category of the video object with the category of the audio object, and associates the video object with the audio object. An example of the operation of the category association unit 323 will be described with reference to FIG.
In the video frame 111, a male face 112, a female face 113, and a car 114 exist as video object categories. A video object group is shown in the object list 115.

映像フレーム１１１において音響オブジェクトのカテゴリとして、自動車の音、女性の声、男性の声、ノイズが分類されている。音響オブジェクト群を音響オブジェクトリスト１１６に示す。自動車は自動車の音に対応し、男性の声は男性の顔に対応し、女性の声は女性の顔に対応することは容易に判断することができる。 In the video frame 111, car sounds, female voices, male voices, and noises are classified as acoustic object categories. The acoustic object group is shown in the acoustic object list 116. It can be easily determined that a car corresponds to the sound of a car, a male voice corresponds to a male face, and a female voice corresponds to a female face.

しかし、ノイズの音響オブジェクトだけは対応する映像オブジェクトが存在しない。 However, there is no corresponding video object only for the noise acoustic object.

以上の処理によって、オブジェクト対応表１１７を生成することができる。オブジェクト対応表１１７によって、各音響オブジェクトがどの映像オブジェクトに対応しているか、そしてオブジェクトの映像フレーム中の座標値を求めることができる。 The object correspondence table 117 can be generated by the above processing. From the object correspondence table 117, it is possible to determine which video object each acoustic object corresponds to and the coordinate value in the video frame of the object.

尚、オブジェクト対応表１１７はカテゴリを一意に特定する例であるが、カテゴリ対応付け部３２３の実現方法の一例としてカテゴリ帰属度を特徴量として、オブジェクト同士の特徴量の相関を求めて対応付けする方法もある。カテゴリ帰属度は、各カテゴリとの類似度で構成される特徴ベクトルであり、(男、女、自動車)=(1.0, 0.5, 0.2)等の値で表現される。この特徴ベクトルが最も近い組み合わせを取ることによって、映像オブジェクトと音響オブジェクトとの対応付けを実現することができる。 Note that the object correspondence table 117 is an example of uniquely identifying a category. As an example of a method for realizing the category association unit 323, a category attribute is used as a feature amount, and a correlation between feature amounts of objects is obtained and associated. There is also a method. The category attribution is a feature vector composed of the similarity to each category, and is represented by a value such as (male, female, car) = (1.0, 0.5, 0.2). By taking the closest combination of the feature vectors, the association between the video object and the sound object can be realized.

相関対応付け部３２によって、映像と音との対応付けを行った後に、映像オブジェクト信号と、音響オブジェクト信号とは出力される。
＜第３の実施の形態＞
第３の実施の形態を説明する。 After the correlation and association unit 32 associates the video with the sound, the video object signal and the acoustic object signal are output.
<Third Embodiment>
A third embodiment will be described.

図７を参照すると、第３の実施の形態は、映像オブジェクト分離部１３と、音響オブジェクト分離部２３と、相関対応付け部３３とから構成されている。 Referring to FIG. 7, the third embodiment includes a video object separation unit 13, an acoustic object separation unit 23, and a correlation association unit 33.

音響オブジェクト分離部２３は、音響オブジェクト分離部２１と同様に動作して、音響オブジェクトを分離して出力する。 The acoustic object separation unit 23 operates in the same manner as the acoustic object separation unit 21, and separates and outputs the acoustic object.

映像オブジェクト分離部１３は、人物検出部１３１を有している。この人物検出部１３１は、パターン認識を用いて人物領域を映像オブジェクトとして抽出する。人物領域を抽出する方法として、人物の顔領域を検出する方法があり、例えば、非特許文献２（M.Turk, A.Pentland, “Face Recognition on Using Eigenfaces,”Proceedings of IEEE, CVPR91, pp.586-591 (1991)）などに記載されている。 The video object separation unit 13 includes a person detection unit 131. The person detection unit 131 extracts a person area as a video object using pattern recognition. As a method for extracting a person area, there is a method for detecting a face area of a person. 586-591 (1991)).

映像オブジェクト分離部１３は、図８に示す如く、人物検出部１３１を用いて人物領域を検出し、人物領域を映像オブジェクトとして分離する。 As shown in FIG. 8, the video object separation unit 13 detects a person area using a person detection unit 131 and separates the person area as a video object.

相関対応付け部３３は、図９に示す如く、動き検出部３３１と、音声区間検出部３３２と、ＡＶ信号相関部３３３とから構成されている。 As shown in FIG. 9, the correlation association unit 33 includes a motion detection unit 331, a voice segment detection unit 332, and an AV signal correlation unit 333.

動き検出部３３１は、人物領域内の唇などの部分領域に着目し、部分領域のフレーム間差分を過去ｔ時間にわたって求め、映像動きパターンを出力する。映像動きパターンは領域内の画素値の時間変化を表す。 The motion detection unit 331 pays attention to a partial region such as a lip in the person region, obtains a difference between frames of the partial region over the past t time, and outputs a video motion pattern. The video motion pattern represents the temporal change of the pixel value in the area.

音声区間検出部３３２は、音響オブジェクトごとに、過去ｔ時間にわたって音声区間が存在するかどうかを求め、音声区間パターンを出力する。 The voice section detection unit 332 obtains, for each acoustic object, whether a voice section exists over the past t time, and outputs a voice section pattern.

ＡＶ信号相関対応付け部３３３は、動き検出部３３１からの映像動きパターンと、音声区間検出部３３２からの音声区間パターンとを対応付けし、相関の高い組み合わせを求めることによって、人物の映像オブジェクトに対応した音響オブジェクトを同定する。 The AV signal correlation associating unit 333 associates the video motion pattern from the motion detecting unit 331 with the audio segment pattern from the audio segment detecting unit 332, and obtains a combination having a high correlation. Identify the corresponding acoustic object.

図１０を参照して、具体的なＡＶ信号相関対応付け部３３３の動作を説明する。 A specific operation of the AV signal correlation association unit 333 will be described with reference to FIG.

例えば、映像フレーム１２１において、人物検出部１３１によってオブジェクトＪ１２２と、オブジェクトＫ１２３が検出されている。これらの人物領域内の部分領域である唇部分におけるフレーム間差分が、唇領域フレーム間差分１２４に示されている。唇領域フレーム間差分１２４に対し、適当なしきい値によって２値化することによって動きパターン１２５が得られる。 For example, in the video frame 121, the object detection unit 131 detects the object J122 and the object K123. The inter-frame difference in the lip portion which is a partial area in these person areas is shown in the lip area inter-frame difference 124. The motion pattern 125 is obtained by binarizing the lip region inter-frame difference 124 with an appropriate threshold value.

また、音響オブジェクトに対して音声区間検出を行った結果を音声区間パターン１２６とする。動きパターン１２５と、音声区間パターン１２６とを比較すると、オブジェクトＪの動きパターンと、第一の音声区間パターン１２７との間に高い相関があることが分かるので、これらのオブジェクトが同一であると判断する。また、オブジェクトＫの動きパターンと、第二の音声区間パターン１２８との間に高い相関があるので、同様にこれらのオブジェクトが同一であると判断する。このようにして、オブジェクトＪの音源信号が第一の音響オブジェクトであり、オブジェクトＫの音源信号が第二の音響オブジェクトであることが分かる。 Also, a result of voice segment detection performed on the acoustic object is a voice segment pattern 126. When the movement pattern 125 is compared with the voice segment pattern 126, it can be seen that there is a high correlation between the motion pattern of the object J and the first voice segment pattern 127, so it is determined that these objects are the same. To do. Further, since there is a high correlation between the motion pattern of the object K and the second voice segment pattern 128, it is determined that these objects are the same as well. In this way, it can be seen that the sound source signal of the object J is the first acoustic object and the sound source signal of the object K is the second acoustic object.

そして、ＡＶ信号相関対応付け部３３３によって映像と音との対応付けを行った後に、映像オブジェクト信号と、音響オブジェクト信号とを出力する。 Then, after the video and sound are associated by the AV signal correlation associating unit 333, the video object signal and the acoustic object signal are output.

次に、ＡＶ信号相関対応付け部３３３における具体的な相関値の計算方法を示す。 Next, a specific correlation value calculation method in the AV signal correlation association unit 333 will be described.

図１１に、映像のフレーム間差分の積分値を０と１とに２値化して得られた時系列の動きパターンbx２０１と、０と１とに２値化された音声区間パターンby２０２の例を示す。ここで、動きパターンbx２０１は上述した動きパターン１２５に相当するものであり、音声区間パターンby２０２は上述した音声区間パターン１２６に相当するものである。 FIG. 11 shows an example of a time-series motion pattern bx201 obtained by binarizing the integral value of the inter-frame difference of video into 0 and 1, and an audio section pattern by202 binarized into 0 and 1. Show. Here, the motion pattern bx201 corresponds to the motion pattern 125 described above, and the speech segment pattern by202 corresponds to the speech segment pattern 126 described above.

予め決められた時間間隔Ｔを用いて、時刻aからＴ時間の相関値Sは、数１を用いて算出することができる。 Using a predetermined time interval T, the correlation value S from time a to time T can be calculated using equation (1).

そして、音と映像の組み合わせの中から、相関値Ｓが大きい組み合わせを選択することによって対応付けを行なう。

Then, association is performed by selecting a combination having a large correlation value S from the combination of sound and video.

他のＡＶ信号相関対応付け部３３３における相関値の計算方法を示す。 A correlation value calculation method in another AV signal correlation association unit 333 will be described.

図１２に、映像のフレーム間差分の積分値を０と１とに２値化して得られた時系列の動きパターンbx２１１と、０と１とに２値化された音声区間パターンby２１２とを示す。ここで、動きパターンbx２１１は上述した動きパターン１２５に相当するものであり、音声区間パターンby２１２は上述した音声区間パターン１２６に相当するものである。 FIG. 12 shows a time-series motion pattern bx211 obtained by binarizing the integral value of the inter-frame difference of video into 0 and 1, and an audio section pattern by212 binarized into 0 and 1. . Here, the motion pattern bx211 corresponds to the motion pattern 125 described above, and the voice segment pattern by 212 corresponds to the voice segment pattern 126 described above.

動きパターンbxがスターとする時間(０から１に変化する時間)をt1xとし、終了時間（１から０に変化する時間）をt2xとする。また、音声区間パターンbｙが立ち上がる時間(０から１に変化する時間)をt1yとし、終了時間（１から０に変化する時間）をt2yとする。そして時間差を数２のＴｄによって算出する。音と映像の組み合わせの中で、時間差Ｔｄが小さほど対応していると考えて、音と映像の対応付けを行なう。 The time when the motion pattern bx is a star (time when it changes from 0 to 1) is t1x, and the end time (time when it changes from 1 to 0) is t2x. In addition, the time when the voice section pattern by rises (time when changing from 0 to 1) is t1y, and the end time (time when changing from 1 to 0) is t2y. Then, the time difference is calculated by Td of Equation 2. In the combination of sound and video, it is considered that the smaller the time difference Td is, the more the sound and video are associated.

尚、スタート時間だけを比較することで対応付けを行なうことも可能である。この場合には、式１０３に示す時間差Ｔｄ２を用いて時間差を算出する。

It is also possible to perform association by comparing only the start times. In this case, the time difference is calculated using the time difference Td2 shown in Expression 103.

ＡＶ信号相関対応付け部３３３は、動き検出部３３１からの映像動きパターン（フレーム間差分）と、音声区間検出部３３２からの音声区間パターンとを入力する。そして、映像のフレーム間差分の積分値の時系列動きパターンＭ２２１と、音響オブジェクトの音声信号パワーＪ２２２とを求める。図１３に、映像のフレーム間差分の積分値の時系列動きパターンＭ２２１と、音響オブジェクトの音声信号パワーＪ２２２との一例を示す。 The AV signal correlation associating unit 333 receives the video motion pattern (difference between frames) from the motion detecting unit 331 and the audio section pattern from the audio section detecting unit 332. Then, the time-series motion pattern M221 of the integral value of the inter-frame difference of the video and the audio signal power J222 of the acoustic object are obtained. FIG. 13 shows an example of the time-series motion pattern M221 of the integral value of the inter-frame difference of the video and the audio signal power J222 of the acoustic object.

このとき、映像オブジェクトと音響オブジェクトとの時刻aからＴ時間における相関値S２は、予め決められた時間間隔Ｔを用いて、数３を用いて算出することができる。 At this time, the correlation value S2 between the video object and the sound object from time a to time T can be calculated using Equation 3 using a predetermined time interval T.

また、数４のＳ３のように、ＭとＪとの相関係数を相関値として算出することもできる。

Further, the correlation coefficient between M and J can be calculated as a correlation value, as in S3 of Equation 4.

尚、上述したＡＶ信号相関対応付け部３３３における相関値の計算方法において、動きパターンを映像オブジェクトの動きベクトルから算出するようにしても良い。 In the above-described correlation value calculation method in the AV signal correlation association unit 333, the motion pattern may be calculated from the motion vector of the video object.

この場合、図１４に示すように、時刻ｔから時刻ｔ+1までの、映像オブジェクトの動きベクトルを求める。動きベクトルの算出方法として、テンプレートマッチング法などがある。これは時刻ｔでオブジェクトが占める部分領域画像をテンプレートとし、t+1の映像中から類似パターンが存在する位置をテンプレートマッチングで探索する方法である。これにより、映像オブジェクトの時刻ｔからｔ＋１の動きベクトルを算出することができる。次に、動きベクトル２３３の長さを求める。本実施の形態では、動きベクトルの長さを、上述したフレーム間差分の積分値に置き換えて動きパターンを生成し、相関値を算出する。
＜第４の実施の形態＞
第４の実施の形態を説明する。 In this case, as shown in FIG. 14, the motion vector of the video object from time t to time t + 1 is obtained. A motion vector calculation method includes a template matching method. This is a method in which a partial region image occupied by an object at time t is used as a template, and a position where a similar pattern exists is searched from the t + 1 video by template matching. Thereby, the motion vector of the video object from time t to t + 1 can be calculated. Next, the length of the motion vector 233 is obtained. In the present embodiment, the motion vector is generated by replacing the length of the motion vector with the integral value of the inter-frame difference described above, and the correlation value is calculated.
<Fourth embodiment>
A fourth embodiment will be described.

図１５を参照すると、第４の実施の形態は、映像オブジェクト分離部１４と、音響オブジェクト分離部２４と、相関対応付け部３４とから構成されている。 Referring to FIG. 15, the fourth embodiment includes a video object separation unit 14, an acoustic object separation unit 24, and a correlation association unit 34.

映像オブジェクト分離部１４は、映像オブジェクト分離部１１と同様に動作して、映像オブジェクトを分離して出力する。音響オブジェクト分離部２４は、音響オブジェクト分離部２１と同様に動作して、音響オブジェクトを分離して出力する。 The video object separation unit 14 operates in the same manner as the video object separation unit 11 and separates and outputs the video object. The acoustic object separation unit 24 operates in the same manner as the acoustic object separation unit 21, and separates and outputs the acoustic object.

相関対応付け部３４は、図１６に示す如く、映像動作検出部３４１と、動作音区間検出部３４２と、ＡＶ信号相関対応付け部３４３とから構成されている。 As shown in FIG. 16, the correlation association unit 34 includes a video motion detection unit 341, a motion sound section detection unit 342, and an AV signal correlation correlation unit 343.

映像動作検出部３４１は、映像オブジェクトが存在する部分領域に着目し、前記部分領域のフレーム間差分を過去ｔ時間にわたって求め、動きパターンを出力する。動きパターンは領域内の画素値の時間変化を表す。 The video motion detection unit 341 pays attention to the partial area where the video object exists, obtains the inter-frame difference of the partial area over the past t time, and outputs a motion pattern. The motion pattern represents a temporal change in pixel values in the region.

動作音区間検出部３４２は、音響オブジェクトごとに、過去ｔ時間にわたって動作音が存在するかどうかを求め、動作音区間パターンを出力する。動作音の一例として、自動車のエンジン音や、人物の歩く足音などがある。 The motion sound section detection unit 342 determines whether a motion sound exists for the past t hours for each acoustic object, and outputs a motion sound section pattern. As an example of the operation sound, there are an engine sound of a car and a footstep sound of a person walking.

ＡＶ信号相関対応付け部３４３は、前記映像動きパターンと前記動作音区間パターンとを比較して相関の高い組み合わせを求め、映像オブジェクトに対応した音響オブジェクトを同定する。 The AV signal correlation associating unit 343 compares the video motion pattern and the motion sound interval pattern to obtain a highly correlated combination, and identifies an acoustic object corresponding to the video object.

図１７を参照してＡＶ信号相関対応付け部３４３の動作を説明する。 The operation of the AV signal correlation association unit 343 will be described with reference to FIG.

映像フレーム１３１において、映像オブジェクト分離部１４によってオブジェクトＬ１３２と、オブジェクトＭ１３３が検出されている。映像動作検出部３４１は、これらのオブジェクトが存在する部分領域のフレーム間差分を算出し（図１７中、オブジェクト領域フレーム間差分１３４）、オブジェクト領域フレーム間差分１３４に対し、適当なしきい値によって２値化することによって動きパターン１３５を算出する。 In the video frame 131, the object L132 and the object M133 are detected by the video object separation unit. The video motion detection unit 341 calculates the inter-frame difference of the partial area where these objects exist (the object area inter-frame difference 134 in FIG. 17), and 2 for the object area inter-frame difference 134 by an appropriate threshold value. The motion pattern 135 is calculated by digitizing.

また、動作音区間検出部３４２は、音響オブジェクトに対して動作音区間検出を行った結果を動作音区間パターン１３６とする。 Further, the motion sound section detection unit 342 sets a motion sound section pattern 136 as a result of performing the motion sound section detection on the acoustic object.

ＡＶ信号相関対応付け部３４３は、動きパターン１３５と、動作音区間パターン１３６とを比較し、オブジェクトＬの動きパターンと、第一の音声区間パターン１３７との間に高い相関があることが分かるので、これらのオブジェクトが同一であると判断する。また、オブジェクトＭの動きパターンと、第二の音声区間パターン１３８との間に高い相関があるので、同様にこれらのオブジェクトが同一であると判断する。このようにして、オブジェクトＬの音源信号が第一の音響オブジェクトであり、オブジェクトＭの音源信号が第二の音響オブジェクトであることが分かる。 Since the AV signal correlation associating unit 343 compares the motion pattern 135 and the motion sound interval pattern 136, it can be seen that there is a high correlation between the motion pattern of the object L and the first audio interval pattern 137. , It is determined that these objects are the same. In addition, since there is a high correlation between the movement pattern of the object M and the second voice segment pattern 138, it is similarly determined that these objects are the same. In this way, it can be seen that the sound source signal of the object L is the first acoustic object and the sound source signal of the object M is the second acoustic object.

ＡＶ信号相関対応付け部３４３は、映像と音との対応付けを行った後に、映像オブジェクト信号と、音響オブジェクト信号とを出力する。 The AV signal correlation associating unit 343 outputs the video object signal and the audio object signal after associating the video with the sound.

また、相関値の算出は、上記第３の実施の形態で説明した相関値の計算方法を用いることができる。 The correlation value can be calculated using the correlation value calculation method described in the third embodiment.

尚、上述した実施の形態では各部をハードウェアで構成したが、プログラムで動作するＣＰＵ等の情報処理装置で構成しても良い。この場合、プログラムは、上述した動作をＣＰＵ等に実行させる。 In the above-described embodiment, each unit is configured by hardware, but may be configured by an information processing apparatus such as a CPU that operates by a program. In this case, the program causes the CPU or the like to execute the above-described operation.

以上好ましい実施の形態をあげて本発明を説明したが、本発明は必ずしも上記実施の形態に限定されるものではなく、その技術的思想の範囲内において様々に変形し実施することが出来る。 Although the present invention has been described with reference to the preferred embodiments, the present invention is not necessarily limited to the above-described embodiments, and various modifications can be made within the scope of the technical idea.

１映像オブジェクト分離部
２音響オブジェクト分離部
３相関対応付け部
１３１人物検出部
３２１映像カテゴリ判別部
３２３音響カテゴリ判別部
３２３カテゴリ対応付け部
３３１動き検出部
３３２音声区間検出部
３３３ＡＶ信号相関部
３４１映像動作検出部
３４２動作音区間検出部
３４３ＡＶ信号相関対応付け部 DESCRIPTION OF SYMBOLS 1 Video object separation part 2 Acoustic object separation part 3 Correlation correlation part 131 Person detection part 321 Video category discrimination | determination part 323 Acoustic category discrimination | determination part 323 Category correlation part 331 Motion detection part 332 Audio | voice section detection part 333 AV signal correlation part 341 Video Motion detection unit 342 Motion sound interval detection unit 343 AV signal correlation association unit

Claims

A video object separation unit that separates the video of each object from the input video signal as a video object;
An acoustic object separation unit that separates each sound source signal from the input acoustic signal as an acoustic object using blind signal source separation;
Correlation between each of the video objects and each of the acoustic objects, a correlation association unit that associates at least the video objects and the acoustic objects;
An output unit that outputs a signal of the video object and a signal of the acoustic object after the correlation associating unit associates the video object and the acoustic object ;
The video object is a partial area of the input video,
The video sound processing system , wherein the output unit outputs a signal of the video object corresponding to the partial area and a signal of the audio object corresponding to the video object .

The video / audio processing system according to claim 1 , wherein the correlation associating unit generates an object correspondence table indicating to which video object each acoustic object separated from an input audio signal corresponds.

Separate the video of each object from the input video signal as a video object,
Separate each sound source signal from the input acoustic signal using blind signal source separation as an acoustic object,
Obtaining a correlation between each video object and each acoustic object, and associating at least the video object with the acoustic object;
After associating the video object with the acoustic object, the signal of the video object and the signal of the acoustic object are output ,
The video object is a partial area of the input video,
A video and audio processing method for outputting the video object signal corresponding to the partial area and the audio object signal corresponding to the video object in outputting the video object signal and the audio object signal .

The video / audio processing method according to claim 3 , wherein an object correspondence table indicating which video object corresponds to each audio object separated from the input audio signal is generated.

A process of separating the video of each object from the input video signal as a video object;
Separating each sound source signal from the input acoustic signal as an acoustic object using blind signal source separation;
Correlation between each of the video objects and each of the acoustic objects, and an association process for associating at least the video objects and the acoustic objects;
After the video object and the acoustic object are associated with each other by the correlation association process, the information processing apparatus executes an output process for outputting the signal of the video object and the signal of the acoustic object ,
The video object is a partial area of the input video,
The output process is a program for outputting a signal of the video object corresponding to the partial area and a signal of the acoustic object corresponding to the video object .

The program according to claim 5 , wherein the association processing generates an object correspondence table indicating which video object corresponds to each acoustic object separated from the input acoustic signal.