JP2015177490A

JP2015177490A - Image/sound processing system, information processing apparatus, image/sound processing method, and image/sound processing program

Info

Publication number: JP2015177490A
Application number: JP2014054476A
Authority: JP
Inventors: 酉華木原; Yuka Kihara
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2014-03-18
Filing date: 2014-03-18
Publication date: 2015-10-05

Abstract

PROBLEM TO BE SOLVED: To accurately generate a profile by using images and sounds.SOLUTION: An image/sound processing apparatus includes: detection means for detecting a region of a detection target from an image signal imaged on a predetermined position; sound signal separation means for separating a sound signal in each detection target corresponding to the detection target region detected by the detection means from sound signals collected on the predetermined position; sample data selection means for selecting sample data for generating a profile of an image or a sound from the image signal or the sound signal on the basis of the detection target region and an arriving direction of the sound signal in each detection target; and profile generation means for generating a profile of the image or the sound by using the sample data selected by the sample data selection means.

Description

本願は、映像音声処理システム、情報処理装置、映像音声処理方法、及び映像音声処理プログラムに関する。 The present application relates to a video / audio processing system, an information processing apparatus, a video / audio processing method, and a video / audio processing program.

例えば、会議等で録音された音声情報をテキスト情報に変換することで議事録を生成する技術が知られている。このような技術では、会議中に誰がどの発言を行ったかを記憶すると共に、複数の会議参加ユーザが同時に発言した場合にも話者を特定し、精度良く音声を認識する技術が求められている。音声を精度良く認識するためには、予め個々のユーザが発する特徴を記憶する音声プロファイルを作成し、作成した音声プロファイルを指標として、音声情報のキャリブレーションを行ってテキスト情報に変換することが行われている。 For example, a technique for generating minutes by converting voice information recorded at a meeting or the like into text information is known. In such a technique, there is a need for a technique for memorizing who speaks during a conference, and for identifying a speaker even when a plurality of conference participants speak at the same time, and recognizing speech with high accuracy. . In order to recognize speech accurately, a speech profile that stores the characteristics of individual users in advance is created, and speech information is calibrated and converted into text information using the created speech profile as an index. It has been broken.

また、会議の様子をカメラ等で撮影し、撮影された映像信号を用いて話者と音声とを対応付ける技術が知られている。例えばカメラ等で撮影された入力映像信号から映像オブジェクトを分離すると共に、入力音響信号から音響オブジェクトを分離し、映像オブジェクトと音響オブジェクトとの相関を求め、映像オブジェクトと音響オブジェクトとを対応付ける方法が知られている（例えば、特許文献１参照）。 In addition, a technique is known in which a meeting is photographed with a camera or the like, and a speaker and voice are associated with each other using a photographed video signal. For example, a method is known in which a video object is separated from an input video signal shot by a camera or the like, a sound object is separated from an input audio signal, a correlation between the video object and the sound object is obtained, and the video object and the sound object are associated with each other. (For example, refer to Patent Document 1).

しかしながら、特許文献１の手法では、入力映像信号から映像オブジェクトを抽出する処理と、入力音響信号から音響オブジェクトを抽出する処理とがそれぞれ独立に行われている。したがって、例えば映像信号や音声信号においてノイズやオクルージョン等が発生した場合には両者の適切な相関が得られず、オブジェクトの検出精度が低下してしまう。 However, in the method of Patent Document 1, a process of extracting a video object from an input video signal and a process of extracting an acoustic object from an input audio signal are performed independently. Therefore, for example, when noise or occlusion occurs in a video signal or an audio signal, an appropriate correlation between the two cannot be obtained, and the object detection accuracy decreases.

１つの側面では、本発明は、映像と音声とを用いて精度良くプロファイルを生成することを目的とする。 In one aspect, an object of the present invention is to generate a profile with high accuracy using video and audio.

一態様において、所定の位置で撮像された映像信号から、検出対象の領域を検出する検出手段と、前記所定の位置において集音された音声信号から、前記検出手段により検出された検出対象の領域に対応する前記検出対象ごとの音声信号に分離する音声信号分離手段と、前記検出対象の領域と、前記検出対象ごとの音声信号の到来方向とに基づいて、前記映像信号又は前記音声信号から、映像又は音声のプロファイルを生成するためのサンプルデータを選定するサンプルデータ選定手段と、前記サンプルデータ選定手段により選定された前記サンプルデータを用いて、映像又は音声のプロファイルを生成するプロファイル生成手段とを有する。 In one aspect, detection means for detecting a detection target region from a video signal imaged at a predetermined position, and detection target region detected by the detection means from a sound signal collected at the predetermined position From the video signal or the audio signal, based on the audio signal separation means for separating the audio signal for each detection target corresponding to, the region of the detection target, and the arrival direction of the audio signal for each detection target, Sample data selection means for selecting sample data for generating a video or audio profile, and profile generation means for generating a video or audio profile using the sample data selected by the sample data selection means Have.

映像と音声とを用いて精度良くプロファイルを生成することが可能となる。 A profile can be generated with high accuracy using video and audio.

映像音声処理システムの概略構成の一例を示す図である。It is a figure which shows an example of schematic structure of a video / audio processing system. 映像音声処理装置の機能構成の一例を示す図である。It is a figure which shows an example of a function structure of a video / audio processor. 映像音声処理装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of an audio-video processing apparatus. プロファイル生成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a profile production | generation process. 映像オブジェクトとの対応付け処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a matching process with a video object. サンプルデータ選定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a sample data selection process. 映像オブジェクトとの対応付けを説明するための図である。It is a figure for demonstrating matching with a video object. 映像オブジェクト領域と推定領域の相関を説明するための図である。It is a figure for demonstrating the correlation of a video object area | region and an estimation area | region.

以下、実施の形態について詳細に説明する。 Hereinafter, embodiments will be described in detail.

＜映像音声処理システム：概略構成＞
図１は、映像音声処理システムの概略構成の一例を示す図である。図１に示す映像音声処理システム１は、会議収録装置１０と、情報処理装置の一例である映像音声処理装置２０とを有する。図１に示す映像音声処理システム１は、例えば会議中の映像や音声を取得し、取得した情報を用いて精度良くプロファイルを生成する。 <Video / audio processing system: schematic configuration>
FIG. 1 is a diagram illustrating an example of a schematic configuration of a video / audio processing system. A video / audio processing system 1 illustrated in FIG. 1 includes a conference recording device 10 and a video / audio processing device 20 which is an example of an information processing device. The video / audio processing system 1 shown in FIG. 1 acquires video and audio during a conference, for example, and generates a profile with high accuracy using the acquired information.

会議収録装置１０は、例えば会議室等の所定の位置に設置され、周囲の映像信号及び音声信号（音源信号）の収録を行う。なお、この映像信号と音声信号とは、時間情報に基づく対応付けがなされている（同期が取れている）ものとする。会議収録装置１０は、例えば広角カメラとマイクロフォンアレイ（以下、「マイクアレイ」という）とを備える。会議収録装置１０は、例えば広角カメラにより会議室等にいる会議参加者を撮影（撮像）して映像信号を生成し、例えばマイクアレイにより会議参加者が発言した音声を録音（集音）して、マルチチャンネルの音声信号を生成する。 The conference recording device 10 is installed at a predetermined position such as a conference room, for example, and records surrounding video signals and audio signals (sound source signals). It is assumed that the video signal and the audio signal are associated (synchronized) based on time information. The conference recording apparatus 10 includes, for example, a wide-angle camera and a microphone array (hereinafter referred to as “microphone array”). The conference recording device 10 shoots (captures) a conference participant in a conference room or the like using, for example, a wide-angle camera, generates a video signal, and records (collects) voices spoken by the conference participant using, for example, a microphone array. Generate multi-channel audio signals.

会議収録装置１０のマイクアレイは、音源位置の特定を可能とするため、例えば複数のマイクが所定の間隔で放射上に配置されているものが望ましく、それぞれのマイクに収録された音声信号はそれぞれ独立に保存されると良い。なお、会議収録装置１０のカメラは、広角カメラに限定されるものではない。また、会議収録装置１０のマイクは、マイクアレイに限定されるものではない。 The microphone array of the conference recording apparatus 10 preferably has a plurality of microphones arranged on the radiation at predetermined intervals, for example, in order to be able to specify the sound source position. The audio signals recorded in the microphones are respectively It should be stored independently. Note that the camera of the conference recording apparatus 10 is not limited to a wide-angle camera. Further, the microphone of the conference recording apparatus 10 is not limited to the microphone array.

映像音声処理装置２０は、会議収録装置１０から得られる映像信号から検出対象としてのオブジェクト（例えば、会議参加者等の人物）の領域を検出し、マルチチャンネルの音声信号からオブジェクト（例えば、人物）ごとの音声信号を分離し、収録された映像信号又はマルチチャンネルの音声信号からサンプルデータを収集する。更に、映像音声処理装置２０は、収集された映像信号のサンプルデータ又は音声信号のサンプルデータを用いて映像又は音声のプロファイルを生成する。 The video / audio processing device 20 detects an area of an object (for example, a person such as a conference participant) as a detection target from the video signal obtained from the conference recording apparatus 10, and the object (for example, a person) from the multi-channel audio signal. Each audio signal is separated, and sample data is collected from the recorded video signal or multi-channel audio signal. Further, the video / audio processing device 20 generates a video or audio profile using the collected sample data of the video signal or sample data of the audio signal.

映像音声処理装置２０は、例えば一般的なＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）、サーバ等により実現され、特徴的な機能は、主にソフトウェア処理により提供される。映像音声処理装置２０は、例えば音声処理装置２１と、映像処理装置２２と、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２３と、主記憶装置２４と、補助記憶装置２５とを有する。 The video / audio processing device 20 is realized by, for example, a general PC (Personal Computer), a server, and the like, and characteristic functions are mainly provided by software processing. The video / audio processing device 20 includes, for example, an audio processing device 21, a video processing device 22, a CPU (Central Processing Unit) 23, a main storage device 24, and an auxiliary storage device 25.

音声処理装置２１は、信号線を介して、会議収録装置１０のマイクアレイ等で録音（集音）された音声信号を入力する。なお、マイクアレイ等からの音声入力インターフェースは、例えばミキサ等との間の接続仕様等に応じてＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）インターフェース等を用いることが可能であるが、これに限定されるものではない。 The audio processing device 21 inputs an audio signal recorded (collected) by a microphone array or the like of the conference recording device 10 via a signal line. The audio input interface from the microphone array or the like can use a USB (Universal Serial Bus) interface or the like according to the connection specification with the mixer or the like, but is not limited to this. .

また、入力される音声信号は、複数のアナログ信号である場合がある。その場合、音声処理装置２１は、入力された複数のアナログ信号をＡ／Ｄ変換によりデジタルデータに変換し、変換されたデータを補助記憶装置２５に記憶する。 Further, the input audio signal may be a plurality of analog signals. In that case, the audio processing device 21 converts a plurality of input analog signals into digital data by A / D conversion, and stores the converted data in the auxiliary storage device 25.

映像処理装置２２は、信号線を介して、会議収録装置１０の広角カメラ等で撮影（撮像）された映像信号を入力し、入力した映像信号を補助記憶装置２５に記憶する。なお、映像処理装置２２は、入力した映像信号がアナログ信号である場合には、アナログ信号をＡ／Ｄ変換によりデジタルデータに変換しても良い。 The video processing device 22 inputs a video signal photographed (captured) by a wide-angle camera or the like of the conference recording device 10 via a signal line, and stores the input video signal in the auxiliary storage device 25. Note that when the input video signal is an analog signal, the video processing device 22 may convert the analog signal into digital data by A / D conversion.

ＣＰＵ２３は、映像音声処理装置２０における各種処理を実行する。主記憶装置２４は、高速記憶装置であり、映像音声処理装置２０に電源が投入されると、補助記憶装置２５に格納されていたソフトウェアコードの全部又は１部がコピーされ、ＣＰＵ２３の制御に用いられる。 The CPU 23 executes various processes in the video / audio processing apparatus 20. The main storage device 24 is a high-speed storage device, and when the video / audio processing device 20 is turned on, all or a part of the software code stored in the auxiliary storage device 25 is copied and used for the control of the CPU 23. It is done.

補助記憶装置２５は、ハードディスクやフラッシュメモリ、光ディスク等である。補助記憶装置２５には、本実施形態に必要なソフトウェアが格納される。なお、主記憶装置２４及び補助記憶装置２５は、１つの記憶装置として構成されていても良い。 The auxiliary storage device 25 is a hard disk, a flash memory, an optical disk, or the like. The auxiliary storage device 25 stores software necessary for this embodiment. The main storage device 24 and the auxiliary storage device 25 may be configured as one storage device.

上述した映像音声処理システム１の構成は、図１の例に限定されるものではなく、例えば、映像音声処理装置２０を、会議収録装置１０に組み込んだ構成としても良い。また、会議収録装置１０と、映像音声処理装置２０との間は、例えばＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）やインターネット等に代表される通信ネットワークにより接続されていても良い。通信ネットワークは、有線でも無線でも良く、これらの組み合わせでも良い。 The configuration of the video / audio processing system 1 described above is not limited to the example of FIG. 1. For example, the video / audio processing device 20 may be incorporated in the conference recording device 10. The conference recording apparatus 10 and the audio / video processing apparatus 20 may be connected by a communication network represented by a LAN (Local Area Network), the Internet, or the like. The communication network may be wired or wireless, or a combination thereof.

本実施形態では、例えば会議参加者の音声、人物映像をより正確に取得するために、音声プロファイル、映像プロファイルを自動で精度良く生成する。なお、マイクアレイを使うことで、特定の人の声を抽出することができるが、特定の音声信号を分離するだけで、個々の音声信号が誰の声なのかは分からない。 In the present embodiment, for example, an audio profile and a video profile are automatically generated with high accuracy in order to more accurately acquire the voice and person video of a conference participant. Although the voice of a specific person can be extracted by using the microphone array, it is not known who the individual voice signals are by simply separating the specific voice signals.

例えば、会議中に常に話者が特定の位置にいる場合は問題にならないが、例えば人が立ち上がってホワイトボードの前に行ったり、途中で人が入ってきて席を移動することも起こる。このような場合、従来の音声分離処理では話者の音声信号を分離することはできるが、どの音声信号が誰の音声信号なのかを判定することができない。 For example, there is no problem if the speaker is always in a specific position during the conference, but for example, a person may stand up and go in front of the whiteboard, or a person may enter and move between the seats. In such a case, the voice signal of the speaker can be separated by the conventional voice separation process, but it cannot be determined which voice signal is who.

本実施形態では、例えば分離した音声信号に対して、映像信号からの情報を相補的に利用して音声プロファイルを生成する。また、本実施形態では、映像信号に対して、音声信号からの情報を相補的に利用して映像プロファイルを生成することも可能である。 In this embodiment, for example, an audio profile is generated by using information from a video signal in a complementary manner with respect to a separated audio signal. In this embodiment, it is also possible to generate a video profile using information from an audio signal in a complementary manner with respect to a video signal.

＜映像音声処理装置＞
図２は、映像音声処理装置の機能構成の一例を示す図である。図２に示すように、映像音声処理装置２０は、音声信号分離手段３１と、検出手段の一例としてのオブジェクト検出手段３２と、プロファイル取得手段３３とを有する。 <Video / audio processor>
FIG. 2 is a diagram illustrating an example of a functional configuration of the video / audio processing apparatus. As shown in FIG. 2, the video / audio processing apparatus 20 includes an audio signal separation unit 31, an object detection unit 32 as an example of a detection unit, and a profile acquisition unit 33.

プロファイル取得手段３３は、初期画像プロファイル生成手段４０と、到来方向検出手段４１と、対応付け手段の一例としての映像オブジェクト対応付け手段４２と、サンプルデータ選定手段４３と、プロファイル生成手段４４とを有する。 The profile acquisition unit 33 includes an initial image profile generation unit 40, an arrival direction detection unit 41, a video object association unit 42 as an example of an association unit, a sample data selection unit 43, and a profile generation unit 44. .

音声信号分離手段３１は、会議収録装置１０のマイクアレイにより入力（集音）されたマルチチャンネルの音声信号を、検出対象の一例としてのオブジェクト（例えば、人物）ごとの音声信号に分離する処理を行う。オブジェクトごとの音声信号に分離する方法としては、例えばブラインド信号源分離（ＢｌｉｎｄＳｏｕｒｃｅＳｅｐｅｒａｔｉｏｎ）や、独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）等の手法を用いることが可能である。 The audio signal separation means 31 performs a process of separating the multi-channel audio signals input (sound collection) by the microphone array of the conference recording apparatus 10 into audio signals for each object (for example, a person) as an example of a detection target. Do. As a method for separating the audio signal for each object, for example, a technique such as blind source separation or independent component analysis can be used.

ブラインド信号源分離は、例えば複数の未知の信号系列を未知の線形混合系で混合した複数の測定値系列を用いて、それぞれの音声信号に分離する。また、独立成分分析は、多変量解析手法の１つであり、入力された信号を独立性という基準から分析し、それぞれの音声信号に分離する。なお、分離する手法については、これに限定されるものではない。 In the blind signal source separation, for example, a plurality of measurement value sequences obtained by mixing a plurality of unknown signal sequences in an unknown linear mixed system are used to separate the respective sound signals. Independent component analysis is one of the multivariate analysis methods, and analyzes an input signal based on the criterion of independence and separates it into respective audio signals. Note that the separation method is not limited to this.

オブジェクト検出手段３２は、例えば映像信号から検出対象としてのオブジェクトの領域を抽出する。オブジェクトの領域とは、例えば会議参加者の顔領域等であるが、これに限定されるものではなく、予め設定された物体等であっても良い。 The object detection unit 32 extracts an object area as a detection target from, for example, a video signal. The object area is, for example, a face area of a conference participant or the like, but is not limited thereto, and may be a preset object or the like.

オブジェクト検出手段３２は、例えば１以上の画像サンプルの初期画像プロファイルを用いて画像マッチングすることにより、オブジェクトの領域を抽出することが可能である。なお、上述したオブジェクトの領域は、例えば画像解析等により顔（又は物体）の特徴を抽出することで、オブジェクトの領域を抽出することが可能である。 The object detection means 32 can extract an object region by performing image matching using, for example, an initial image profile of one or more image samples. The object region described above can be extracted by extracting features of a face (or an object) by image analysis or the like, for example.

プロファイル取得手段３３は、オブジェクトの領域と、オブジェクトごとの音声信号の到来方向とに基づいて、収録された映像又は音声からサンプルデータを収集し、収集されたサンプルデータの映像又は音声を用いて、映像プロファイル又は音声プロファイルを生成する。なお、音声プロファイルとは、例えば音響モデル（例えば、音の波形を集めたものや音の特徴量）等であり、映像プロファイルとは、例えば各個人（例えば会議参加者等）を検出するための識別情報等であるが、これに限定されるものではない。 The profile acquisition means 33 collects sample data from the recorded video or audio based on the area of the object and the arrival direction of the audio signal for each object, and uses the video or audio of the collected sample data, A video profile or an audio profile is generated. Note that the audio profile is, for example, an acoustic model (for example, a collection of sound waveforms or a feature amount of sound), and the video profile is, for example, for detecting each individual (for example, a conference participant). Although it is identification information etc., it is not limited to this.

初期画像プロファイル生成手段４０は、会議収録装置１０から得られる映像信号から各個人ごとの画像特徴量（例えば顔画像の特徴量）を取得し、取得した情報を初期画像のプロファイルとして生成する。 The initial image profile generation means 40 acquires an image feature amount (for example, a feature amount of a face image) for each individual from the video signal obtained from the conference recording apparatus 10, and generates the acquired information as a profile of the initial image.

初期画像プロファイル生成手段４０は、顔画像の特徴取得方法として、例えばＨａａｒ−Ｌｉｋｅ等を用いることができるが、これに限定されるものではない。なお、Ｈａａｒ−ｌｉｋｅ特徴量は、例えば矩形領域の平均明度の差分値として求められるスカラ量であり、その値は明度勾配の強度を表している。 For example, Haar-Like or the like can be used as the facial image feature acquisition method for the initial image profile generation means 40, but the method is not limited to this. The Haar-like feature amount is a scalar amount obtained as a difference value of the average brightness of the rectangular area, for example, and the value represents the intensity of the brightness gradient.

到来方向検出手段４１は、音声信号分離手段３１により分離されたオブジェクト（例えば、人物）ごとの音声信号について、例えば所定の位置における所定時間ごとの音の到来方向（音源方向）を検出する。 The arrival direction detection unit 41 detects, for example, the arrival direction (sound source direction) of the sound at a predetermined position at a predetermined position from the audio signal for each object (for example, a person) separated by the audio signal separation unit 31.

到来方向検出手段４１は、例えばマイクアレイにおいて各マイクに音が到達するまでの時間の差を利用することで音の到来方向を検出する。到来方向検出手段４１は、例えば、音声信号の位相差を算出することで、到来方向を検出することができるが、これに限定されるものではない。なお、算出方法の一例としては、特開２０１１−７１６８３号公報に記載されている内容を利用することが可能である。 The arrival direction detection means 41 detects the arrival direction of sound by using the difference in time until the sound reaches each microphone in the microphone array, for example. The arrival direction detection means 41 can detect the arrival direction by, for example, calculating the phase difference of the audio signal, but is not limited to this. As an example of the calculation method, the contents described in JP 2011-71683 A can be used.

映像オブジェクト対応付け手段４２は、到来方向検出手段４１により到来方向（音源方向）が検出された音声信号が、映像中のどのオブジェクト（検出対象）に対応するか推定し、推定結果に基づいて、映像オブジェクト（例えば映像内のオブジェクト）と音声オブジェクト（例えば、音声信号や音源方向）とを対応付ける。 The video object association means 42 estimates which object (detection target) in the video corresponds to the audio signal whose arrival direction (sound source direction) is detected by the arrival direction detection means 41, and based on the estimation result, A video object (for example, an object in the video) and an audio object (for example, an audio signal or a sound source direction) are associated with each other.

映像オブジェクト対応付け手段４２は、会議収録装置１０の広角カメラによって撮影（撮像）される映像を、例えば３次元物体を投影平面に投影した映像と考え、音声信号の到来方向を投影平面に投影することによって、オブジェクトの推定位置を絞り込む。 The video object association unit 42 considers the video shot (captured) by the wide-angle camera of the conference recording apparatus 10 as, for example, a video obtained by projecting a three-dimensional object onto the projection plane and projects the arrival direction of the audio signal onto the projection plane. Thus, the estimated position of the object is narrowed down.

また、映像オブジェクト対応付け手段４２は、推定位置に存在する映像内のオブジェクト（映像オブジェクト）を、初期画像プロファイル生成手段４０により生成された初期画像プロファイルを用いてパターンマッチングし、領域内のオブジェクト（例えば、特定の人物等）を検出する。 In addition, the video object association unit 42 performs pattern matching on the object (video object) in the video existing at the estimated position using the initial image profile generated by the initial image profile generation unit 40, and the object ( For example, a specific person is detected.

サンプルデータ選定手段４３は、会議収録装置１０から得られる映像信号や音声信号から、プロファイルを生成するためのサンプルデータ（サンプル映像信号やサンプル音声信号等）を選定する。 The sample data selection means 43 selects sample data (sample video signal, sample audio signal, etc.) for generating a profile from the video signal and audio signal obtained from the conference recording apparatus 10.

サンプルデータ選定手段４３は、例えば音声信号のサンプルデータとして、音声信号の位置、方向、及び音声の開始又は終了時刻のうち少なくとも１つを用いて、オブジェクトの領域との相関の高いサンプルデータを収集する。また、サンプルデータ選定手段４３は、例えば映像信号のサンプルデータとして、映像オブジェクトの位置、方向、及び動作の開始又は終了時刻のうち少なくとも１つを用いて、音声信号との相関の高いサンプルを収集する。 The sample data selection means 43 collects sample data having a high correlation with the object region, for example, using at least one of the position and direction of the audio signal and the start or end time of the audio as the sample data of the audio signal. To do. Further, the sample data selection means 43 collects a sample having a high correlation with the audio signal by using at least one of the position and direction of the video object and the start or end time of the operation as the sample data of the video signal, To do.

例えば、会議中の同一話者であれば、映像オブジェクトの位置や発話タイミングは、音声信号の位置や発話タイミングと相関があるはずである。したがって、サンプルデータ選定手段４３は、上述の条件を用いて検出精度を向上させることにより、良好なサンプル音声及び画像を同時に収集することが可能となる。 For example, if the same speaker is in a meeting, the position of the video object and the utterance timing should be correlated with the position of the audio signal and the utterance timing. Therefore, the sample data selection unit 43 can collect good sample sound and images simultaneously by improving the detection accuracy using the above-described conditions.

サンプルデータ選定手段４３は、例えばオブジェクトが１以上検出された場合に、顔画像のマッチング度（検出信頼度、類似度ともいう）に応じて、サンプルデータとして抽出するか判定する。 For example, when one or more objects are detected, the sample data selection unit 43 determines whether to extract as sample data according to the matching degree (also referred to as detection reliability or similarity) of the face image.

例えば、ある音源方向にＡさんとＢさんがいたとする。この場合、サンプルデータ選定手段４３は、ＡさんとＢさんの初期画像プロファイルに保存されている画像特徴量と、映像オブジェクトとして抽出されたＡさんの画像特徴量及びＢさんの画像特徴量とをマッチングし、類似度の高い方をサンプルデータとして対応付ける。 For example, suppose that A and B are in a certain sound source direction. In this case, the sample data selection means 43 uses the image feature quantity stored in the initial image profiles of Mr. A and Mr. B, the image feature quantity of Mr. A extracted as a video object, and the image feature quantity of Mr. B. Match and match the higher similarity as sample data.

なお、本実施形態では、ＡさんもＢさんも画像特徴が類似していない場合（例えば、類似度が一定値以下の場合）は、サンプルデータとして抽出しないこととしても良い。 In the present embodiment, when the image features of both Mr. A and Mr. B are not similar (for example, when the similarity is equal to or less than a certain value), the sample data may not be extracted.

プロファイル生成手段４４は、サンプルデータ選定手段４３により得られるサンプルデータを用いて、個人単位（例えば会議参加者ごと）又は複数人単位で、映像又は音声のプロファイルを生成する。 The profile generation unit 44 uses the sample data obtained by the sample data selection unit 43 to generate a video or audio profile in units of individuals (for example, for each conference participant) or in units of multiple people.

なお、音声プロファイルを生成する場合には、対象となる個人（人物）の発話音声を蓄積していき、蓄積音声に基づいて音響モデルを生成するとともに、言語モデルを生成することが可能である。言語モデルは、音声的な処理で生じる曖昧性を解消する役割を果たし、高い認識精度を実現する。通常、言語モデルの学習には、デジタル化されたテキストを用いて、認識対象のドメインや発話スタイルに合致した大規模の学習データが必要とされる。 In the case of generating a voice profile, it is possible to accumulate speech speech of a target individual (person), generate an acoustic model based on the accumulated voice, and generate a language model. The language model plays a role of eliminating ambiguity caused by speech processing and realizes high recognition accuracy. Usually, learning of a language model requires large-scale learning data that matches a domain to be recognized and an utterance style using digitized text.

しかしながら、例えば「ベイズ推論を用いた連続音声からの言語モデル学習情報処理学会研究報告，ＳＬＰ，音声言語情報処理２０１０−ＳＬＰ−８２（１６），１−６，２０１０−０７−１５、ｈｔｔｐ：／／ｗｗｗ．ｐｈｏｎｔｒｏｎ．ｃｏｍ／ｐａｐｅｒ／ｎｅｕｂｉｇ１０ｓｌｐ８２．ｐｄｆ」に示すように、連続音声データと音響モデルとを用いて言語モデルと単語辞書とを学習できる手法を取り入れ、言語モデル学習を行うことも可能である。 However, for example, “Language Model Learning Information Processing Society Research Report from Bayesian Reasoning, SLP, Spoken Language Information Processing 2010-SLP-82 (16), 1-6, 2010-07-15, http: // /Www.phontron.com/paper/neubig10slp82.pdf "It is also possible to perform language model learning by adopting a method capable of learning a language model and a word dictionary using continuous speech data and an acoustic model. is there.

＜ハードウェア構成＞
次に、上述した本実施形態に係る映像音声処理装置２０等のハードウェア構成例について図を用いて説明する。図３は、ハードウェア構成の一例を示す図である。 <Hardware configuration>
Next, a hardware configuration example of the video / audio processing device 20 according to the present embodiment described above will be described with reference to the drawings. FIG. 3 is a diagram illustrating an example of a hardware configuration.

図３に示すように、コンピュータの一例である映像音声処理装置２０は、ＣＰＵ５０と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）５１と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）５２と、ＨＤＤ５３と、Ｉ／Ｆ部５４と、ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）５５と、操作部５６とを有する。 As shown in FIG. 3, the video / audio processing device 20, which is an example of a computer, includes a CPU 50, a RAM (Random Access Memory) 51, a ROM (Read Only Memory) 52, an HDD 53, an I / F unit 54, An LCD (Liquid Crystal Display) 55 and an operation unit 56 are provided.

ＣＰＵ５０と、ＲＡＭ５１と、ＲＯＭ５２と、ＨＤＤ５３と、Ｉ／Ｆ部５４とは、バス５７を介して接続されている。また、Ｉ／Ｆ部５４に、ＬＣＤ５５と、操作部５６とが接続されている。 The CPU 50, RAM 51, ROM 52, HDD 53, and I / F unit 54 are connected via a bus 57. In addition, an LCD 55 and an operation unit 56 are connected to the I / F unit 54.

ＣＰＵ５０は、演算手段であり、映像音声処理装置２０全体の動作を制御する。ＲＡＭ５１は、情報の高速な読み書きが可能な揮発性の記憶媒体であり、ＣＰＵ５０が情報を処理する際の作業領域として用いられる。 The CPU 50 is a calculation means and controls the operation of the entire video / audio processing apparatus 20. The RAM 51 is a volatile storage medium capable of reading and writing information at high speed, and is used as a work area when the CPU 50 processes information.

ＲＯＭ５２は、読み出し可能な不揮発性の記憶媒体であり、ファームウェア等のプログラムが格納されている。ＨＤＤ５３は、情報の読み書きが可能な不揮発性の記憶媒体であり、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や各種の制御プログラム、アプリケーションプログラム等が格納される。 The ROM 52 is a readable non-volatile storage medium, and stores programs such as firmware. The HDD 53 is a nonvolatile storage medium that can read and write information, and stores an OS (Operating System), various control programs, application programs, and the like.

Ｉ／Ｆ部５４は、各種ハードウェアやネットワーク等と接続する。ＬＣＤ５５は、例えば、本実施形態における各処理を実行するための各画面を表示するための表示部である。操作部５６は、例えばキーボード、マウス、各種ハードボタン、タッチパネル等であり、ユーザから入力される情報を受け付けるためのユーザインターフェースである。 The I / F unit 54 is connected to various hardware and networks. The LCD 55 is a display unit for displaying, for example, each screen for executing each process in the present embodiment. The operation unit 56 is, for example, a keyboard, a mouse, various hard buttons, a touch panel, and the like, and is a user interface for receiving information input from the user.

上述したコンピュータのハードウェア構成において、ＲＯＭ５２やＨＤＤ５３、光学ディスク等の記憶媒体に格納されたプログラムがＲＡＭ５１に読み出され、ＣＰＵ５０が読み出されたプログラムにより演算を行うことにより、ソフトウェア制御部が構成される。このようにして構成されたソフトウェア制御部と、ハードウェアとの組合せにより、本実施形態に係るシステムを構成する各装置の機能を実現することが可能となる。 In the above-described computer hardware configuration, a program stored in a storage medium such as the ROM 52, the HDD 53, or an optical disk is read into the RAM 51, and the CPU 50 performs an operation with the read program, thereby configuring the software control unit. Is done. With the combination of the software control unit configured in this way and hardware, it is possible to realize the functions of the devices constituting the system according to the present embodiment.

＜プロファイル生成処理＞
図４は、プロファイル生成処理の流れを示すフローチャートである。図４に示すように、映像音声処理装置２０は、初期画像プロファイル生成手段４０により、プロファイルを作成したいユーザの少なくとも１以上の顔画像を取得し、各個人（人物）の画像特徴量を算出することで初期画像プロファイルを生成する（Ｓ１０）。ここで、画像特徴量としては、例えば上述したＨａａｒ−Ｌｉｋｅ等を用いることができるが、これに限定されるものではない。 <Profile generation process>
FIG. 4 is a flowchart showing the flow of profile generation processing. As shown in FIG. 4, the video / audio processing apparatus 20 acquires at least one face image of a user whose profile is to be created by the initial image profile generation unit 40 and calculates the image feature amount of each individual (person). Thus, an initial image profile is generated (S10). Here, as the image feature amount, for example, the Haar-Like described above can be used, but the image feature amount is not limited thereto.

Ｓ１０の処理では、例えば会議収録装置１０の広角カメラで、顔画像を撮影しても良い。画像から人物の顔領域を検出する方法としては、画像内を端から順番に走査（Ｗｉｎｄｏｗ走査）し、予め用意されたテンプレートとマッチングする方法が一般的であるが、始めに取得した画像に対して特徴化したものを初期画像プロファイルとして用いることが可能である。 In the process of S10, for example, a face image may be taken with the wide-angle camera of the conference recording apparatus 10. As a method of detecting a human face area from an image, a method of scanning the inside of the image sequentially from the end (Window scan) and matching with a template prepared in advance is generally used. Can be used as the initial image profile.

なお、入力画像に対して初期画像プロファイルを用いて顔領域を検出する場合には、単純なマッチングではロバスト性に欠けるため、通常マッチング時の識別器を用いて判断する。例えば、サンプル画像を自動で収集し、最終的に得られた画像を用いて識別器を構成して初期画像プロファイルとすることも可能である。 Note that when a face region is detected using an initial image profile for an input image, since simple matching lacks robustness, determination is performed using a discriminator during normal matching. For example, it is also possible to automatically collect sample images and configure the discriminator using the finally obtained image to obtain an initial image profile.

次に、映像音声処理装置２０は、音声信号分離手段３１により、会議収録装置１０のマイクアレイから入力されたマルチチャンネルの音声信号を、複数の音声信号に分離する処理を行う（Ｓ１１）。 Next, the audio / video processing apparatus 20 performs a process of separating the multi-channel audio signal input from the microphone array of the conference recording apparatus 10 into a plurality of audio signals by the audio signal separating unit 31 (S11).

次に、映像音声処理装置２０は、到来方向検出手段４１により、Ｓ１１の処理で分離された音声信号のそれぞれについて、所定の位置における音の到来方向（音源方向）を検出する（Ｓ１２）。次に、映像音声処理装置２０は、映像オブジェクト対応付け手段４２より、Ｓ１１の処理で分離された音声信号が、映像中のどのオブジェクトに対応するか推定し、映像内のオブジェクト（映像オブジェクト）との対応付けを行う（Ｓ１３）。 Next, the audiovisual processing device 20 detects the arrival direction (sound source direction) of the sound at a predetermined position for each of the audio signals separated in the process of S11 by the arrival direction detection unit 41 (S12). Next, the video / audio processing device 20 estimates to which object in the video the audio signal separated in the processing of S11 corresponds to the object (video object) in the video. Are associated (S13).

次に、映像音声処理装置２０は、Ｓ１３の処理で、対象オブジェクトが１以上検出された場合には、サンプルデータ選定手段４３により、顔画像のマッチング度（検出信頼度）に応じてサンプルデータとして抽出するか判定する（サンプルデータを選定する）（Ｓ１４）。 Next, when one or more target objects are detected in the process of S13, the video / audio processing apparatus 20 uses the sample data selection unit 43 as sample data according to the matching degree (detection reliability) of the face image. It is determined whether to extract (select sample data) (S14).

次に、映像音声処理装置２０は、例えば広角カメラ等で撮影された映像の最後のフレームか判断し（Ｓ１５）、最後のフレームではないと判断した場合（Ｓ１５において、ＮＯ）、Ｓ１１の処理に戻り、処理を続ける。また、映像音声処理装置２０は、最後のフレームであると判断すると（Ｓ１５において、ＹＥＳ）、プロファイル生成手段４４により、例えば音声プロファイルを生成し（Ｓ１６）、処理を終了する。 Next, the video / audio processing device 20 determines whether it is the last frame of the video shot by, for example, a wide-angle camera (S15). If it is determined that it is not the last frame (NO in S15), the video / audio processing device 20 proceeds to the processing of S11. Return and continue processing. If the video / audio processing device 20 determines that it is the last frame (YES in S15), the profile generation unit 44 generates, for example, an audio profile (S16), and the process ends.

上述したプロファイル生成処理は、例えば撮影された映像信号が終了するまで、サンプルデータの選定を行い、映像信号が終了した時点で、それまでに選定されたサンプルデータに基づいて音声プロファイルが生成される。なお、本実施形態におけるプロファイル生成処理は、これに限定されるものではなく、例えば映像プロファイルを生成しても良い。 In the profile generation process described above, for example, sample data is selected until the captured video signal is completed, and at the time when the video signal is completed, an audio profile is generated based on the sample data selected so far. . Note that the profile generation processing in the present embodiment is not limited to this, and for example, a video profile may be generated.

＜映像オブジェクトとの対応付け＞
図５は、映像オブジェクトとの対応付け処理の流れを示すフローチャートである。なお、図５の処理では、上述した図４に示すＳ１３の処理を、映像と音声の解析結果を相補的に使って互いの検出精度を向上させる。 <Association with video object>
FIG. 5 is a flowchart showing the flow of the association process with the video object. In the process of FIG. 5, the process of S <b> 13 shown in FIG. 4 described above uses the video and audio analysis results in a complementary manner to improve the mutual detection accuracy.

図５に示すように、映像オブジェクト対応付け手段４２は、到来方向検出手段４１から時刻ごとの音源方向（音の到来方向）を取得する（Ｓ２０）。到来方向検出手段４１は、上述したように会議収録装置１０から得られるマルチチャンネルの音声入力を解析し、時刻ごとに音の到来方向を推定する。音の到来方向を推定する上で必要となる音源定位（左右の耳に達する音の音圧や位相、時間等の差によって音源の位置が定位されること）の取得は、周知の一般的な手法を用いることが可能である。 As illustrated in FIG. 5, the video object association unit 42 acquires the sound source direction (sound arrival direction) for each time from the arrival direction detection unit 41 (S20). The arrival direction detection means 41 analyzes the multi-channel audio input obtained from the conference recording apparatus 10 as described above, and estimates the arrival direction of the sound at each time. Acquisition of sound source localization necessary for estimating the direction of sound arrival (the position of the sound source is localized by the difference in sound pressure, phase, time, etc. of the sound reaching the left and right ears) It is possible to use a technique.

次に、映像オブジェクト対応付け手段４２は、音源となる話者の存在範囲を推定する（Ｓ２１）。Ｓ２１の処理では、例えば検出された時刻ごとの音のピーク値のデータに基づいて、音の空間スペクトルのピークを会議全体、又は会議中の必要な区間に対して集積してヒストグラムデータを生成する。 Next, the video object associating means 42 estimates the presence range of the speaker serving as the sound source (S21). In the processing of S21, for example, based on sound peak value data for each detected time, the peaks of the spatial spectrum of the sound are accumulated for the entire meeting or a necessary section during the meeting to generate histogram data. .

また、Ｓ２１の処理では、生成したヒストグラムデータに対し、ｋ−ｍｅａｎｓ法等のクラスタリングを行う。クラスタ数は、例えば会議に参加している人数等とし、得られたクラスタ中心から適当な角度のマージンを持たせた空間を話者の存在範囲として推定することが可能となる。なお、Ｓ２１の処理については、これに限定されるものではない。 In the process of S21, clustering such as a k-means method is performed on the generated histogram data. The number of clusters is, for example, the number of participants in the conference, and a space having a margin with an appropriate angle from the obtained cluster center can be estimated as the speaker existence range. In addition, about the process of S21, it is not limited to this.

次に、映像オブジェクト対応付け手段４２は、Ｓ２０の処理で得られた音源方向と、Ｓ２１の処理で得られた話者の存在範囲の情報とに基づき、各時刻の話者の発話状態（どの話者が発話しているか）を推定する（Ｓ２２）。Ｓ２２の処理では、ある時刻において、空間スペクトルのピークがある話者の範囲に入っていれば、当該話者が発話していたものとして推定することができるが、これに限定されるものではない。 Next, based on the sound source direction obtained in the process of S20 and the information on the range of the speaker obtained in the process of S21, the video object association unit 42 determines the utterance state (which Whether the speaker is speaking is estimated (S22). In the process of S22, if the peak of the spatial spectrum is within a certain speaker range at a certain time, it can be estimated that the speaker is speaking, but the present invention is not limited to this. .

次に、映像オブジェクト対応付け手段４２は、発話が重畳している部分について発話分離を行い、目的話者に対する位置ベクトルを推定する（Ｓ２３）。次に、映像オブジェクト対応付け手段４２は、目的話者（音声信号）の位置ベクトルと映像オブジェクトの位置情報との相関度を計算する（Ｓ２４）。 Next, the video object association unit 42 performs utterance separation on the portion where the utterance is superimposed, and estimates a position vector for the target speaker (S23). Next, the video object association unit 42 calculates the degree of correlation between the position vector of the target speaker (audio signal) and the position information of the video object (S24).

Ｓ２４の処理では、Ｓ２３の処理で得られた目的話者の位置ベクトルを画像上の位置に変換するため、会議収録装置１０から得られた映像を、例えば３次元物体を投影平面に投影した映像と考え、目的話者の位置ベクトルをこの投影平面に投影し、推定領域を算出する。 In the process of S24, in order to convert the position vector of the target speaker obtained in the process of S23 into a position on the image, an image obtained from the conference recording apparatus 10, for example, an image obtained by projecting a three-dimensional object onto the projection plane Then, the position vector of the target speaker is projected onto this projection plane, and the estimated area is calculated.

また、映像オブジェクト対応付け手段４２は、パターンマッチングにより得られた映像オブジェクトの領域と、推定領域との相関を、Ａを推定領域、Ｂを事前に得られた映像オブジェクト領域として、相関度＝（Ａ∩Ｂ）／（Ａ∪Ｂ）」と定義する。 Further, the video object association means 42 uses the correlation between the video object area obtained by pattern matching and the estimated area as the estimated area and B as the previously obtained video object area. A∩B) / (A∪B) ”.

次に、映像オブジェクト対応付け手段４２は、映像オブジェクトの信頼度を算出し（Ｓ２５）、処理を終了する。Ｓ２５の処理では、Ｓ２４の処理で得られた相関度から、既に得られているオブジェクト信頼度Ｃｏｎｆとすると、ＡとＢの相関は、例えば、「ＡとＢの相関＝Ｃｏｎｆ×（Ａ∩Ｂ）／（Ａ∪Ｂ）」となる。 Next, the video object association unit 42 calculates the reliability of the video object (S25) and ends the process. In the process of S25, if the object reliability Conf obtained from the correlation obtained in the process of S24 is already obtained, the correlation between A and B is, for example, “correlation between A and B = Conf × (A∩B ) / (A∪B) ”.

これにより、映像オブジェクトを映像信号に含まれる画像単独で検出するのではなく、音声信号の位置ベクトルと映像オブジェクトの位置情報との相関をみることで、検出されたオブジェクトの信頼度を算出し、オクルージョンやノイズ等に耐性のある検出が可能となる。 Thus, the reliability of the detected object is calculated by looking at the correlation between the position vector of the audio signal and the position information of the video object, rather than detecting the video object alone by the image included in the video signal, Detection with resistance to occlusion, noise, and the like becomes possible.

なお、本実施形態では、一旦映像オブジェクトと音源信号とが対応付けされたら、それ以降のフレームでは映像上で顔画像を追跡し、映像中のオブジェクトの位置や方向から音源方向を精度良く推定することが可能である。 In this embodiment, once a video object and a sound source signal are associated with each other, a face image is tracked on the video in subsequent frames, and the sound source direction is accurately estimated from the position and direction of the object in the video. It is possible.

例えば、映像から音声信号の到来のあるべき方向を事前に推定することが可能である。例えば３個以上のマイクを使用している場合、音源の到来方向を算出する際に始めから特定のチャンネルのペアから到来方向を算出すれば、効率的かつ精度良く音源の到来方向を抽出することが可能となる。また、本実施形態では、生成された音声プロファイルを使用して発話音声の確からしさを評価することも可能である。 For example, the direction in which the audio signal should arrive can be estimated in advance from the video. For example, when three or more microphones are used, if the direction of arrival is calculated from a specific channel pair from the beginning when calculating the direction of arrival of the sound source, the direction of arrival of the sound source can be extracted efficiently and accurately. Is possible. In the present embodiment, it is also possible to evaluate the likelihood of the uttered voice using the generated voice profile.

なお、逆に、音声信号を利用して映像オブジェクトの検出精度を向上することも可能である。例えば、画像から話者Ａの顔を検出する際、「急激な照明変化（カメラ内部パラメータが自動で変更される場合も急激な照明変化には対応できない）」、「話者の顔の向き」、「遮蔽物」等の要因で検出精度が劣化することが知られている。 Conversely, it is also possible to improve the detection accuracy of the video object using an audio signal. For example, when detecting the face of the speaker A from the image, “abrupt illumination change (cannot cope with a sudden illumination change even when the camera internal parameter is automatically changed”), “speaker face orientation” It is known that detection accuracy deteriorates due to factors such as “shielding object”.

従来の画像追跡技術を用いれば、前フレーム情報を用いてカバーすることができるが、例えばフレーム中のどこかのタイミングで見失ってしまうと、追跡が中断してしまう場合がある。 If the conventional image tracking technique is used, it is possible to cover using the previous frame information. However, for example, if the image is lost at some point in the frame, tracking may be interrupted.

したがって、本実施形態では、まず、音声信号から時刻毎の音源方向を推定し、音源となる話者の存在範囲を推定し、発話時間等の相関をみることで目的話者の位置ベクトルを推定する。音声信号から目的話者の発話を分離する処理自体は、例えば特開２００７−２３３２３９号公報等に記載のある従来手法を用いることが可能である。 Therefore, in this embodiment, first, the direction of the sound source for each time is estimated from the speech signal, the existence range of the speaker as the sound source is estimated, and the position vector of the target speaker is estimated by looking at the correlation of the utterance time and the like. To do. The process itself for separating the speech of the target speaker from the audio signal can use a conventional method described in, for example, Japanese Patent Application Laid-Open No. 2007-233239.

＜サンプルデータ選定処理＞
図６は、サンプルデータ選定処理の流れを示すフローチャートである。図６（Ａ）は、音声のサンプルデータ収集処理の流れを示すフローチャートである。図６（Ａ）の例では、初期データとして、初期画像プロファイル生成手段４０等から得られる顔画像プロファイルが与えられている場合、音声プロファイルの初期データを生成する。 <Sample data selection process>
FIG. 6 is a flowchart showing the flow of sample data selection processing. FIG. 6A is a flowchart showing the flow of the voice sample data collection process. In the example of FIG. 6A, when the face image profile obtained from the initial image profile generation means 40 or the like is given as the initial data, the initial data of the audio profile is generated.

図６（Ａ）に示すように、サンプルデータ選定手段４３は、顔画像プロファイルとのマッチングにより、例えば対象人物の顔画像を示す映像オブジェクトを検出し、映像オブジェクトと対応付けされた音声信号を取得する（Ｓ３０）。 As shown in FIG. 6A, the sample data selection unit 43 detects, for example, a video object indicating a face image of the target person by matching with the face image profile, and acquires an audio signal associated with the video object. (S30).

サンプルデータ選定手段４３は、映像オブジェクトと対応付けする際の相関（例えば図５のＳ２４の処理で得られた相関度）を取得し、相関が予め設定された閾値（第１の閾値）よりも高いか判断する（Ｓ３１）。サンプルデータ選定手段４３は、相関が閾値（Ｔｈ_１）よりも高いと判断した場合（Ｓ３１において、ＹＥＳ）、サンプルデータとして採用し（Ｓ３２）、相関が閾値（Ｔｈ_１）よりも高いと判断しなかった場合（Ｓ３１において、ＮＯ）、処理を終了する。 The sample data selection unit 43 obtains a correlation (for example, the degree of correlation obtained by the processing of S24 in FIG. 5) when associating with the video object, and the correlation is higher than a preset threshold (first threshold). It is judged whether it is high (S31). When the sample data selection unit 43 determines that the correlation is higher than the threshold value (Th ₁ ) (YES in S31), the sample data selection unit 43 adopts it as sample data (S32) and determines that the correlation is higher than the threshold value (Th ₁ ). If not (NO in S31), the process is terminated.

図６（Ｂ）は、映像のサンプルデータ収集処理の流れを示すフローチャートである。図６（Ｂ）に示すように、サンプルデータ選定手段４３は、初期画像プロファイル生成手段４０等から得られた顔画像プロファイルとのマッチングにより当人の顔画像を示す映像オブジェクトを取得する（Ｓ４０）。 FIG. 6B is a flowchart showing the flow of video sample data collection processing. As shown in FIG. 6B, the sample data selection means 43 acquires a video object indicating the face image of the person by matching with the face image profile obtained from the initial image profile generation means 40 or the like (S40). .

サンプルデータ選定手段４３は、音声信号と対応付けする際の相関を取得して、相関が予め設定された閾値（第２の閾値）よりも高いか判断する（Ｓ４１）。サンプルデータ選定手段４３は、相関が閾値（Ｔｈ_２）よりも高いと判断した場合（Ｓ４１において、ＹＥＳ）、ポジティブサンプルとして採用する（Ｓ４２）。 The sample data selection unit 43 obtains a correlation when associating with the audio signal, and determines whether the correlation is higher than a preset threshold (second threshold) (S41). When the sample data selection unit 43 determines that the correlation is higher than the threshold value (Th ₂ ) (YES in S41), the sample data selection unit 43 employs the sample data selection unit 43 as a positive sample (S42).

また、サンプルデータ選定手段４３は、相関が閾値（Ｔｈ_２）よりも高いと判断しなかった場合（Ｓ４１において、ＮＯ）、ネガティブサンプルとして採用し（Ｓ４３）、処理を終了する。本実施形態では、ネガティブサンプルを利用することで、例えば誤検知する確率を下げることができる。例えばＡさんと、Ｂさんとを取り違える確率を下げることが可能である。 If the sample data selection unit 43 does not determine that the correlation is higher than the threshold value (Th ₂ ) (NO in S41), the sample data selection unit 43 adopts it as a negative sample (S43) and ends the process. In the present embodiment, by using a negative sample, for example, the probability of erroneous detection can be reduced. For example, it is possible to reduce the probability of mistaking Mr. A and Mr. B.

なお、上述した顔画像を示す映像オブジェクトを取得する際、対象となる個人の顔画像を蓄積していき、識別関数を構成し、この識別関数を用いて検出を行うことが可能である。具体的には、映像に含まれる画像を基準として、画像（画面）内をＷｉｎｄｏｗの端（角）から順番に走査し、各Ｗｉｎｄｏｗ内の特徴量を識別器に与え、識別器がＴＲＵＥかＦＡＬＳＥか判定する。ＴＲＵＥの場合、Ｗｉｎｄｏｗ内の画像が本人の顔画像である可能性が高いと判断する。 When acquiring the above-described video object indicating the face image, it is possible to accumulate the face image of the target individual, construct an identification function, and perform detection using this identification function. Specifically, the image (screen) is scanned in order from the edge (corner) of the window on the basis of the image included in the video, and the feature amount in each window is given to the discriminator, and the discriminator is TRUE or FALSE. To determine. In the case of TRUE, it is determined that there is a high possibility that the image in the window is the person's face image.

上述した識別器には、ＳＶＭ（サポート・ベクタ・マシン）等があり、これは「教師あり学習」と呼ばれ、予め画像サンプルをネガティブ（誤例）とポジティブ（正例）に分けて記録し、このサンプルを使用してネガティブとポジティブとを判別することが可能である。 The classifier described above includes SVM (support vector machine), which is called “supervised learning”, and records image samples separately in advance as negative (false examples) and positive (positive examples). This sample can be used to discriminate between negative and positive.

＜映像オブジェクトとの対応付けについて＞
図７は、映像オブジェクトとの対応付けを説明するための図である。なお、図７は、図４のＳ１３の処理で、音声信号と映像オブジェクトとの対応付けを行うときに用いられる方法を説明するものである。 <Associating with video objects>
FIG. 7 is a diagram for explaining the association with the video object. FIG. 7 illustrates a method used when associating an audio signal with a video object in the process of S13 of FIG.

図７（Ａ）の例では、会議収録装置１０が有するカメラ（例えば、広角カメラ等）によって撮影される映像（カメラ撮影映像）に対し、検出されたオブジェクト（３次元物体）を投影平面に投影した例を示している。映像オブジェクト対応付け手段４２は、この投影平面に、図４のＳ１２の処理で得られた音源方向（音声信号の到来方向）を投影することで、撮影された映像の全領域から映像オブジェクトの推定位置（推定領域）を絞り込むことが可能となる。 In the example of FIG. 7A, a detected object (three-dimensional object) is projected onto a projection plane with respect to a video (camera shot video) shot by a camera (for example, a wide-angle camera) included in the conference recording apparatus 10. An example is shown. The video object associating means 42 estimates the video object from the entire area of the captured video by projecting the sound source direction (the direction of arrival of the audio signal) obtained by the processing of S12 in FIG. 4 onto this projection plane. It is possible to narrow down the position (estimated region).

また、映像オブジェクト対応付け手段４２は、推定した音源方向に一番近い映像オブジェクトを基準として、マイクアレイで取得した各音声とを音源を対応付ける。 The video object association unit 42 associates a sound source with each sound acquired by the microphone array with reference to the video object closest to the estimated sound source direction.

図７（Ｂ）では、図７（Ａ）に示すようにカメラ撮影映像で話者Ａと話者Ｂとが正対したアングルの映像でなかった場合、話者Ａと話者Ｂの映像信号を変換して画面に正対したアングルの映像とした例を示している。 In FIG. 7B, as shown in FIG. 7A, when the video shot by the camera is not an angled video image of the speaker A and the speaker B, the video signals of the speaker A and the speaker B are displayed. This shows an example of converting the video into an angle video facing the screen.

図７（Ｂ）に示すように人物（話者）の顔が画面に正対したアングルの映像に変換した映像を用いることで、顔の向きに対する制御（補正）を行ってから対応付けを行うことができるため、音源方向と映像オブジェクトとの対応付けを適切に行うことが可能となる。図７に示すように、映像信号を実世界の位置に変換し、音声信号の絶対方向の位置と比較することで音声信号と映像オブジェクトの対応付けを適切に行うことが可能となる。 As shown in FIG. 7B, by using an image in which the face of a person (speaker) is converted into an image of an angle facing the screen, association is performed after controlling (correcting) the orientation of the face. Therefore, it is possible to appropriately associate the sound source direction with the video object. As shown in FIG. 7, it is possible to appropriately associate the audio signal with the video object by converting the video signal into a position in the real world and comparing it with the position in the absolute direction of the audio signal.

＜映像オブジェクト領域と推定領域との相関について＞
図８は、映像オブジェクト領域と推定領域との相関を説明するための図である。なお、図８（Ａ）、図８（Ｂ）は、図５のＳ２４の処理で目的話者（音声信号）の位置ベクトルと映像オブジェクトの位置情報との相関度を求める方法を説明するための図である。 <Correlation between video object area and estimated area>
FIG. 8 is a diagram for explaining the correlation between the video object area and the estimated area. 8A and 8B are diagrams for explaining a method for obtaining the degree of correlation between the position vector of the target speaker (voice signal) and the position information of the video object in the process of S24 of FIG. FIG.

すなわち、図８（Ａ）、図８（Ｂ）に示すように、Ｓ２４の処理で算出された推定領域Ａと、パターンマッチングにより得られた映像オブジェクトの領域（事前に得られた映像オブジェクト領域Ｂ）とに基づき、音声信号の位置ベクトルと映像オブジェクトの位置情報との相関度を求める。 That is, as shown in FIGS. 8A and 8B, the estimated area A calculated in the process of S24 and the area of the video object obtained by pattern matching (the video object area B obtained in advance). ) To obtain the correlation between the position vector of the audio signal and the position information of the video object.

なお、図８（Ａ）は、Ａ∩Ｂの領域（斜線部分）を示し、図８（Ｂ）は、Ａ∪Ｂの領域（斜線部分）を示している。このとき、相関度は、「相関度＝（Ａ∩Ｂ）／（Ａ∪Ｂ）」と定義する。また、Ｓ２５の処理で映像オブジェクトの信頼度Ｃｏｎｆとすると、ＡとＢの相関は、「Ｃｏｎｆ×（Ａ∩Ｂ）／（Ａ∪Ｂ）」となる。 8A shows a region A∩B (shaded portion), and FIG. 8B shows a region A 斜 B (shaded portion). At this time, the correlation degree is defined as “correlation degree = (A∩B) / (A∪B)”. If the reliability Conf of the video object is obtained in the process of S25, the correlation between A and B is “Conf × (A∩B) / (A∪B)”.

つまり、図８（Ａ）、図８（Ｂ）に示すように、推定領域と、事前に得られた映像オブジェクト領域の重なり領域が大きいほど、相関度が高くなる。したがって、相関度が予め設定された閾値（例えば、上述したＴｈ_１、Ｔｈ_２）より大きい場合に、サンプルデータとして選定し、選定したサンプルデータを用いて映像プロファイルや音声プロファイルを生成することが可能となる。 That is, as shown in FIGS. 8A and 8B, the larger the overlapping area between the estimated area and the video object area obtained in advance, the higher the degree of correlation. Therefore, when the degree of correlation is larger than a preset threshold (for example, Th ₁ and Th ₂ described above), it is possible to select as sample data and generate a video profile and an audio profile using the selected sample data. It becomes.

上述した実施形態によれば、映像と音声とを用いて精度良くプロファイルを生成することが可能となる。例えば、会議等の収録映像や音声から映像プロファイルや音声プロファイルを生成する際、映像内のオブジェクト（人物の顔）と音声内のオブジェクト（人物の話し声の音源）を相補的に検出する。これにより、映像、音声信号の一方でノイズやオクルージョンが発生していても、他方の情報に基づき検出を補正する。そのため、精度良くオブジェクトの検出することが可能となる。 According to the above-described embodiment, it is possible to generate a profile with high accuracy using video and audio. For example, when a video profile or audio profile is generated from recorded video or audio such as a meeting, an object in the video (a person's face) and an object in the audio (a person's speaking voice source) are detected in a complementary manner. Thereby, even if noise or occlusion occurs in one of the video and audio signals, the detection is corrected based on the other information. Therefore, it is possible to detect an object with high accuracy.

また、本実施形態では、検出した映像内や音声内の各オブジェクト情報等から相関度に応じて最小限のサンプルデータ（サンプル画像信号やサンプル音声信号等）を選定し、そのサンプルデータを用いて効率良くプロファイルを生成する。これにより、例えば会議等の収録画像から画像及び音声プロファイルを生成する際、個人プロファイル作成に必要な画像サンプル及び音声サンプルのみを相補的に収集することで、自動で精度良くプロファイルを生成することが可能となる。 In this embodiment, the minimum sample data (sample image signal, sample audio signal, etc.) is selected according to the degree of correlation from each object information in the detected video and audio, and the sample data is used. Efficiently generate profiles. Thus, for example, when generating an image and sound profile from a recorded image such as a meeting, only the image sample and sound sample necessary for creating a personal profile are complementarily collected, so that the profile can be generated automatically and accurately. It becomes possible.

以上、開示の技術の好ましい実施形態について詳述したが、開示の技術に係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された開示の技術の要旨の範囲内において、種々の変形、変更が可能である。 The preferred embodiments of the disclosed technology have been described in detail above, but the invention is not limited to the specific embodiments according to the disclosed technology, and within the scope of the disclosed technology described in the claims, Various modifications and changes are possible.

１映像音声処理システム
１０会議収録装置
２０映像音声処理装置（情報処理装置の一例）
２１音声処理装置
２２映像処理装置
２３，５０ＣＰＵ
２４主記憶装置
２５補助記憶装置
３１音声信号分離手段
３２オブジェクト検出手段（検出手段の一例）
３３プロファイル取得手段
４０初期画像プロファイル生成手段
４１到来方向検出手段
４２映像オブジェクト対応付け手段（対応付け手段の一例）
４３サンプルデータ選定手段
４４プロファイル生成手段
５１ＲＡＭ
５２ＲＯＭ
５３ＨＤＤ
５４Ｉ／Ｆ部
５５ＬＣＤ
５６操作部
５７バス 1 Video / Audio Processing System 10 Conference Recording Device 20 Video / Audio Processing Device (Example of Information Processing Device)
21 Audio processing device 22 Video processing device 23, 50 CPU
24 main storage device 25 auxiliary storage device 31 audio signal separation means 32 object detection means (an example of detection means)
33 Profile acquisition means 40 Initial image profile generation means 41 Arrival direction detection means 42 Video object association means (an example of association means)
43 Sample data selection means 44 Profile generation means 51 RAM
52 ROM
53 HDD
54 I / F 55 LCD
56 Operation unit 57 Bus

特開２０１１−７１６８４号公報JP 2011-71684 A

Claims

Detection means for detecting a region to be detected from a video signal imaged at a predetermined position;
An audio signal separating unit that separates an audio signal collected at the predetermined position into audio signals for each detection target corresponding to the detection target area detected by the detection unit;
Sample data selection means for selecting sample data for generating a video or audio profile from the video signal or the audio signal based on the detection target area and the arrival direction of the audio signal for each detection target When,
A video / audio processing system comprising: profile generation means for generating a video or audio profile using the sample data selected by the sample data selection means.

Based on the detection target region and the arrival direction of the audio signal for each detection target, the audio signal in which the arrival direction is detected corresponds to the detection target included in the video signal, and the estimation result The video / audio processing system according to claim 1, further comprising: an association unit that associates the detection target included in the video signal with the audio signal based on the video signal.

The sample data selection means includes
Collecting at least one of the position, direction and start / end time of the sound as sample data of the sound signal, and collecting sample data having a correlation with the detection target area greater than a threshold value. The video / audio processing system according to claim 1, wherein the system is a video / audio processing system.

The sample data selection means includes
As the sample data of the video signal, collecting sample data whose correlation with the audio signal is greater than a threshold at at least one of the position, direction, and start or end time of the detection target included in the video signal. The video / audio processing system according to any one of claims 1 to 3, wherein the system is a video / audio processing system.

The detection means includes
5. The video / audio processing system according to claim 1, wherein the detection target region is extracted by performing image matching using an initial profile of at least one or more image samples.

Detection means for detecting a region to be detected from a video signal imaged at a predetermined position;
An audio signal separating unit that separates an audio signal collected at the predetermined position into audio signals for each detection target corresponding to the detection target area detected by the detection unit;
Sample data selection means for selecting sample data for generating a video or audio profile from the video signal or the audio signal based on the detection target area and the arrival direction of the audio signal for each detection target When,
An information processing apparatus comprising: profile generation means for generating a video or audio profile using the sample data selected by the sample data selection means.

A video / audio processing method executed by a video / audio processing system,
A detection procedure for detecting a detection target region from a video signal imaged at a predetermined position;
An audio signal separation procedure for separating an audio signal collected at the predetermined position into an audio signal for each detection target corresponding to a detection target area detected by the detection procedure;
Sample data selection procedure for selecting sample data for generating a video or audio profile from the video signal or the audio signal based on the detection target area and the arrival direction of the audio signal for each detection target When,
And a profile generation procedure for generating a video or audio profile using the sample data selected by the sample data selection procedure.

Computer
Detecting means for detecting a detection target region from a video signal imaged at a predetermined position;
An audio signal separating unit that separates an audio signal collected at the predetermined position into audio signals for each detection target corresponding to the detection target area detected by the detection unit;
Sample data selection means for selecting sample data for generating a video or audio profile from the video signal or the audio signal based on the detection target area and the arrival direction of the audio signal for each detection target ,as well as,
A video / audio processing program for functioning as profile generation means for generating a video or audio profile using the sample data selected by the sample data selection means.