JP4015018B2

JP4015018B2 - Recording apparatus, recording method, and recording program

Info

Publication number: JP4015018B2
Application number: JP2002377255A
Authority: JP
Inventors: 青木　　伸; 憲彦村田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-12-26
Filing date: 2002-12-26
Publication date: 2007-11-28
Anticipated expiration: 2022-12-26
Also published as: JP2004208188A

Description

【０００１】
【発明の属する技術分野】
この発明は、画像の記録、特にビデオによる撮影を補助として用いて音声を記録する記録装置、記録方法及び記録プログラムに関し、特にその音源や発言者を区別しながら音声を再生することができる記録装置、記録方法及び記録プログラムに関する。
【０００２】
【従来の技術】
会議室などで、その会議の参加者の発言する発言内容や発表する演出の模様を音声や動画を記録し、会議内容のレビューや議事録作成などを支援する必要性が高まっている。その場合特に、記録された長時間の音声情報、あるいは動画像を含むビデオ情報の中で、見たい区間を簡単に検索できる装置および方法の必要性が高まっている。そして、そのための技術が各種提案されている。
【０００３】
例えば、特開２０００−１２５２７４号公報の発明は、マイクロホンアレイを利用して音声の到達時間差から音源の方向を推定し、発言者各人の発言を区別して記録する技術である。そして、マイクロホンアレイの近傍にビデオカメラを配設し、発言ごとに判定される音源方向にビデオカメラを振り向け、その方向の画像を表示することにより、音声の検索を容易にしている（特許文献１参照）。
【０００４】
また、特開２００２−２４７４８９号公報の発明は、マイクアレイにより音源方向を推定するとともに全体の画像を記録し、その画像上で音源方向に対応する位置と名前を入力し、発言と名前を関連付けする方式を開示している（特許文献２参照）。
【０００５】
さらに、特開２００２−２５１３９３号公報の発明は、音声および動画の記録と同時に会議参加者がそれぞれ参加する自分自身を登録し、登録された参加者だけに記録情報の閲覧が許可されるように制限する技術を開示している。ここでは登録方法として、参加者が各自磁気カードを持ち、カードリーダに通すことにより登録している。また、ビデオカメラ、マイク、およびカードリーダを供えた端末装置を参加者の分だけ複数用意し、参加者別の音声および動画を記録する方法が記載されている（特許文献３参照）。
【０００６】
【特許文献１】
特開２０００−１２５２７４号公報（４〜７頁、図１〜４）
【特許文献２】
特開２００２−２４７４８９号公報（５〜８頁、図１，９，１２，１４１５）
【特許文献３】
特開２００２−２５１３９３号公報（４〜７頁、図１〜４，６〜８）
【０００７】
【発明が解決しようとする課題】
上述の特開２０００−１２５２７４号公報の発明によると、表示された画像を人間が観察することによって発言者を判断することによって、会議内容に索引付けを行うシステムである。しかしながら、システムに用いられる装置自体は音源方向とその画像を取得するだけであり、発せられた各発言が誰のものかは認識していない。それ故、同一発言人が例えば移動したような場合においては、別の人間と判断してしまうこともあり得る。そのため、特定方向からの発言を集めることは可能であったとしても、その発言の発言者が必ずしも同一であるとは限らず、誤って記録してしまうという欠点があった。同号公報の発明の［００６１］には、人間が移動する場合には、「ビデオパターン認識及び／又は発言者の音声識別技術を利用することによって」追跡（タイムラインを併合）できるという記載もあるが、同［０００３］にはビデオパターン認識、音声認識技術の信頼性は不十分であると記載しており、これらの記載から同号公報の発明では十分に高精度の追跡は困難と考えられる。
【０００８】
そのような点を解決しようとした上述の特開２００２−２４７４８９号公報の発明では、画像上での音源方向に対応する位置と名前を入力して、発言と名前を関連付けることができるとした。しかし、同号公報の発明では、音声画像の記録とは別に、ユーザが位置と名前を入力する必要があるので、ユーザが、特に一人のユーザが参加者全員の位置と名前を入力する場合などは、入力操作が非常に煩雑になるという欠点があった。
【０００９】
また、特開２００２−２５１３９３号公報の発明では、個人別の参加登録を行って記録として残すためには、参加者一人につきそのための一台の端末が必要となるために、コストがかかるという欠点があった。
【００１０】
本発明は、上記の問題点に鑑みてなされ、その目的は、音声やビデオによる記録において、ユーザである参加者が参加登録するための固有の端末機器を不要とし、参加者の発言を確実に特定して記録中に追跡し、記録後には容易に発言者を指定してその発言を検索できる記録装置、記録方法及び記録プログラムを提供することである。
【００１１】
【課題を解決するための手段】
上記目的を達成するために、請求項１にかかる発明は、利用者が操作する端末装置と接続された記録装置であって、音源の音声を音声データとして取得する音声取得手段と、前記音源の位置情報を音源位置データとして取得する音源位置取得手段と、複数のユーザを含む領域が撮像された画像データを取得する画像取得手段と、前記端末装置に前記画像データを出力する出力手段と、前記端末装置から、出力した前記画像データ上で指定された入力位置を、前記画像データで表された前記領域内でユーザの存在する位置として入力を受け付ける入力受付手段と、前記入力位置の入力を受け付けた場合、当該入力位置に対応する前記領域内の位置に存在するユーザであることを、ユーザ名及びパスワードによってユーザ認証を行うユーザ認証手段と、前記ユーザ認証が成功した場合、前記音源の方向と前記画像データで表示される前記領域内の方向位置との予め定められた対応関係に基づいて、前記画像データ内の前記入力位置と、当該ユーザの位置と判断される前記音源位置データの前記音源の位置情報と、を対応付ける対応付手段と、前記ユーザ認証が成功した場合に、前記画像データ上の前記入力位置に、前記ユーザであることを特定する識別情報を付加して、前記端末装置に出力する識別情報出力手段と、前記端末装置において前記画像データに付加された前記識別情報の位置が変更された場合、前記端末装置から、前記識別情報の変更された変更位置を受け付ける変更受付手段と、を備え、前記対応付手段は、さらに、前記画像データ上の前記変更位置と、当該ユーザの移動先の前記位置と判断される前記音源位置データの前記音源の位置情報と、を対応付け、前記識別情報出力手段は、前記画像データ上の前記変更位置に前記識別情報を付加して、前記端末装置に出力すること、を特徴とする記録装置である。
【００１２】
この請求項１の発明によれば、利用者が操作する端末装置と接続された記録装置であって、音源の音声を音声データとして取得する音声取得手段と、前記音源の位置情報を音源位置データとして取得する音源位置取得手段と、複数のユーザを含む領域が撮像された画像データを取得する画像取得手段と、前記端末装置に前記画像データを出力する出力手段と、前記端末装置から、出力した前記画像データ上で指定された入力位置を、前記画像データで表された前記領域内でユーザの存在する位置として入力を受け付ける入力受付手段と、前記入力位置の入力を受け付けた場合、当該入力位置に対応する前記領域内の位置に存在するユーザであることを、ユーザ名及びパスワードによってユーザ認証を行うユーザ認証手段と、前記ユーザ認証が成功した場合、前記音源の方向と前記画像データで表示される前記領域内の方向位置との予め定められた対応関係に基づいて、前記画像データ内の前記入力位置と、当該ユーザの位置と判断される前記音源位置データの前記音源の位置情報と、を対応付ける対応付手段と、前記ユーザ認証が成功した場合に、前記画像データ上の前記入力位置に、前記ユーザであることを特定する識別情報を付加して、前記端末装置に出力する識別情報出力手段と、前記端末装置において前記画像データに付加された前記識別情報の位置が変更された場合、前記端末装置から、前記識別情報の変更された変更位置を受け付ける変更受付手段と、を備え、前記対応付手段は、さらに、前記画像データ上の前記変更位置と、当該ユーザの移動先の前記位置と判断される前記音源位置データの前記音源の位置情報と、を対応付け、前記識別情報出力手段は、前記画像データ上の前記変更位置に前記識別情報を付加して、前記端末装置に出力することによって、ユーザ認証のためだけの各ユーザに固有の端末が不要で、ユーザの認証動作に基づいてその位置データを能率良く正確に取得でき、記録後には確実に発言者を特定できるので、発言者による発言を正確に検索可能な記録を採ることが出来る低コストの記録装置を提供できる。また、記録中にユーザの位置が時間的に変化しても追跡して記録できるので、発言者の正確な特定と発言の検索が可能となる記録装置を提供できる。
【００１７】
また、請求項２にかかる発明は、請求項１に記載の記録装置において、前記端末装置から、前記画像データに付加された前記識別情報の選択を受け付ける選択受付手段と、選択を受け付けた前記識別情報が示す前記ユーザの位置と、前記対応付手段により対応付けられた前記音源の位置情報から取得した前記音声データを、前記端末装置に出力する再生出力手段と、をさらに備えることを特徴とする。
【００１８】
この請求項２の発明によれば、請求項１に記載の発明の作用に加えて、前記端末装置から、前記画像データに付加された前記識別情報の選択を受け付ける選択受付手段と、選択を受け付けた前記識別情報が示す前記ユーザの位置と、前記対応付手段により対応付けられた前記音源の位置情報から取得した前記音声データを、前記端末装置に出力する再生出力手段と、をさらに備えるので、再生時に発言者の発言が迅速に検索できるデータとして記録可能な記録装置を提供できる。
【００１９】
また、請求項３にかかる発明は、利用者が操作する端末装置と接続された装置で行われる記録方法であって、音源の音声を音声データとして取得する音声取得ステップと、前記音源の位置情報を音源位置データとして取得する音源位置取得ステップと、複数のユーザを含む領域が撮像された画像データを取得する画像取得ステップと、前記端末装置に前記画像データを出力する出力ステップと、前記端末装置から、出力した前記画像データ上で指定された入力位置を、前記画像データで表された前記領域内でユーザの存在する位置として入力を受け付ける入力受付ステップと、前記入力位置の入力を受け付けた場合、当該入力位置に対応する前記領域内の位置に存在するユーザであることを、ユーザ名及びパスワードによってユーザ認証を行うユーザ認証ステップと、前記ユーザ認証が成功した場合、前記音源の方向と前記画像データで表示される前記領域内の方向位置との予め定められた対応関係に基づいて、前記画像データ内の前記入力位置と、当該ユーザの位置と判断される前記音源位置データの前記音源の位置情報と、を対応付ける対応付ステップと、前記ユーザ認証が成功した場合に、前記画像データ上の前記入力位置に、前記ユーザであることを特定する識別情報を付加して、前記端末装置に出力する識別情報出力ステップと、前記端末装置において前記画像データに付加された前記識別情報の位置が変更された場合、前記端末装置から、前記識別情報の変更された変更位置を受け付ける変更受付ステップと、を有し、前記対応付ステップは、さらに、前記画像データ上の前記変更位置と、当該ユーザの移動先の前記位置と判断される前記音源位置データの前記音源の位置情報と、を対応付け、前記識別情報出力ステップは、前記画像データ上の前記変更位置に前記識別情報を付加して、前記端末装置に出力すること、を特徴とする記録方法である。
【００２０】
この請求項３の発明によれば、利用者が操作する端末装置と接続された装置で行われる記録方法であって、音源の音声を音声データとして取得する音声取得ステップと、前記音源の位置情報を音源位置データとして取得する音源位置取得ステップと、複数のユーザを含む領域が撮像された画像データを取得する画像取得ステップと、前記端末装置に前記画像データを出力する出力ステップと、前記端末装置から、出力した前記画像データ上で指定された入力位置を、前記画像データで表された前記領域内でユーザの存在する位置として入力を受け付ける入力受付ステップと、前記入力位置の入力を受け付けた場合、当該入力位置に対応する前記領域内の位置に存在するユーザであることを、ユーザ名及びパスワードによってユーザ認証を行うユーザ認証ステップと、前記ユーザ認証が成功した場合、前記音源の方向と前記画像データで表示される前記領域内の方向位置との予め定められた対応関係に基づいて、前記画像データ内の前記入力位置と、当該ユーザの位置と判断される前記音源位置データの前記音源の位置情報と、を対応付ける対応付ステップと、前記ユーザ認証が成功した場合に、前記画像データ上の前記入力位置に、前記ユーザであることを特定する識別情報を付加して、前記端末装置に出力する識別情報出力ステップと、前記端末装置において前記画像データに付加された前記識別情報の位置が変更された場合、前記端末装置から、前記識別情報の変更された変更位置を受け付ける変更受付ステップと、を有し、前記対応付ステップは、さらに、前記画像データ上の前記変更位置と、当該ユーザの移動先の前記位置と判断される前記音源位置データの前記音源の位置情報と、を対応付け、前記識別情報出力ステップは、前記画像データ上の前記変更位置に前記識別情報を付加して、前記端末装置に出力すること、を特徴とするによって、認証のためだけの固有の端末が不要で、ユーザの認証動作に基づいてユーザ位置データを能率良く正確に取得でき、記録後には会議の参加者であるユーザを指定することによってその発言を確実に検索できるので、正確な発言者による発言の検索が可能な低コストの記録方法を提供できる。また、記録中にユーザの位置が時間的に変化しても追跡して記録できるので、発言者の正確な特定と発言の検索が可能となる記録方法を提供できる。
【００２５】
また、請求項４にかかる発明は、請求項３に記載の記録方法において、前記端末装置から、前記画像データに付加された前記識別情報の選択を受け付ける選択受付ステップと、選択を受け付けた前記識別情報が示す前記ユーザの位置と、前記対応付手段により対応付けられた前記音源の位置情報から取得した前記音声データを、前記端末装置に出力する再生出力ステップと、をさらに有することを特徴とする。
【００２６】
この請求項４の発明によれば、請求項３に記載の記録方法の作用に加えて、前記端末装置から、前記画像データに付加された前記識別情報の選択を受け付ける選択受付ステップと、選択を受け付けた前記識別情報が示す前記ユーザの位置と、前記対応付手段により対応付けられた前記音源の位置情報から取得した前記音声データを、前記端末装置に出力する再生出力ステップと、をさらに有するので、再生時に発言者の発言が迅速に検索できるデータとして記録可能な記録方法を提供できる。
【００２７】
また、請求項５にかかる発明は、請求項３又は４に記載された方法をコンピュータに実行させるプログラムであるので、請求項３又は４に記載された方法をコンピュータに実行させることができる。
【００２８】
【発明の実施の形態】
以下に添付図面を参照して、この発明にかかる記録装置、記録方法及び記録プログラムの好適な実施の形態を詳細に説明する。
【００２９】
（１．記録検索装置のネットワーク構成）
図１は、本発明の実施の形態による記録検索装置のネットワーク構成図である。実施の形態による記録検索装置は、ビデオカメラ１、マイクロホンアレイ２、ビデオカメラ１およびマイクロホンアレイ２が接続されて音声・動画を記録する記録装置３、および会議参加者であるユーザが各自会議の場に持ち込むユーザ計算機（以下ユーザＰＣと略する）４が、ネットワーク５で接続されて構成される。
【００３０】
図２は、実施の形態による記録検索装置が用いる記録装置３のハードウェア構成図である。ここで用いる記録装置３は、中央演算装置（ＣＰＵ）１１、ランダムアクセスメモリ（ＲＡＭ）１２、ハードディスク１３、キーボード１４、モニタ１５がシステムバス２０で相互に接続された一般的な計算機である（記録ＰＣ）。しかし、本実施の形態における記録装置は一般的な計算機で構成しなくても良い。
【００３１】
記録装置３には、２つのマイクロホン１７および１８が、音声インタフェース１６を介してシステムバス２０に接続されている。マイクロホン１７および１８は、図１のマイクロホンアレイ２を構成する。また同様に、ビデオカメラ１がビデオインタフェース１９を介してシステムバス２０に接続されている。これらマイクロホン１７と１８，およびビデオカメラ１がシステムバス２０と接続されることによって、それぞれ音声データおよび画像データはハードディスク１３に取り込まれ、格納される。ここで画像データは動画データが好適であるが、必ずしも動画データである必要はなく、状況によっては複数の静止画データであっても良い。また、システムバス２０は、ネットワークインタフェース２１を介してネットワーク５に接続され、外部と通信できる。
【００３２】
記録装置３は、音声動画データを記録する記録部３１（後述）を動作させ、音声動画データを記録してハードディスク１３に格納する。また記録装置３は、ユーザＰＣ４で実行するユーザの位置を入力するユーザ位置入力部３３（後述）の動作に対応して、参加者であるユーザの位置データを取得し、ハードディスク１３に格納する。また、記録装置３は、記録されたデータを配信部３２よってユーザＰＣ４に配信させ、ユーザＰＣ４である計算機端末は再生部３４を動作させその記録を再生する。これら各部の動作については後述する。
【００３３】
ここで、マイクロホン１７、１８、および記録部３１は、本発明における音声取得手段を構成する。ビデオカメラ１および記録部３１は、本発明における画像取得手段を構成する。
【００３４】
図３は、実施の形態による記録装置３に接続されたビデオカメラ１とマイクロホンアレイ２との模式的斜視図である。２本以上のマイクロホン１７および１８への音声入力データによって、各時刻での音源の方向が検出できる。ビデオカメラ１は光学系を備えたレンズ２６を有する。レンズ２６は、音源の向きに回転するようにしても良い。
【００３５】
音源方向の検出には、例えば、特開２０００−１２５２７４号公報に開示された技術を用いる。ここでマイクロホン１７および１８とビデオカメラ１とが固定されている場合、マイクロホン１７および１８から判定される音源方向と、ビデオカメラ１で撮影される画像上の横方向位置とは、あらかじめ対応付けておくことができる。
【００３６】
図４は、本実施の形態による記録検索装置に用いるユーザＰＣ４のハードウェア構成図である。ここでユーザＰＣ４は、中央演算装置（ＣＰＵ）４１、ランダムアクセスメモリ（ＲＡＭ）４２、ハードディスク４３、キーボード４４、モニタ４５、そしてマウス４６がシステムバス４８で相互に接続された一般的な計算機である。ユーザＰＣ４は、小型のいわゆるノートパソコンであっても良い。システムバス４８は、またネットワークインタフェース４７を介して外部のネットワーク５に接続されて、ユーザＰＣ４はネットワーク５を介して記録装置３と通信する。
【００３７】
ユーザＰＣ４では一般的なオペレーティングシステム、例えばマイクロソフト社製ウインドウズを動作させる。このオペレーティングシステム上のアプリケーションプログラムを実行することにより、ユーザＰＣ４はユーザの位置を入力する後述のユーザ位置入力部３３、および再生部３４を動作させることができる。このユーザＰＣ４は、会議中に資料参照などのために他のプログラムの実行にも利用できる。ユーザＰＣ４としてはプログラムに互換性があれば、いわゆる携帯情報端末（ＰＤＡ）を用いることができる。上記各部の動作については後述する。
【００３８】
（２．記録検索装置の機能的構成）
図５は、本発明の実施の形態による記録検索装置の機能的構成を示すブロック図である。この中で記録検索装置の中の記録装置３は、記録部３１、および配信部３２を動作させる。しかし、配信部３２は、必ずしも記録装置３が行うとは限らず、他に配信サーバを立てて実行しても良い。
【００３９】
ユーザＰＣ４はユーザ位置入力部３３、および再生部３４を動作させる。しかし、再生部３４は必ずしもユーザＰＣ４が行うとは限らず、他に再生装置あるいは再生のためのクライアントを立てて実行しても良い。
【００４０】
ここで、上記動作を行う記録部３１など各部は、その機能的名称を冠されて記録ＰＣ３やユーザＰＣ４に格納されたコンピュータプログラムとして、例えば記録プログラムとして構成可能であるが、そのような方式に限定されるものではない。また、記録された各種データの格納先はハードディスク１３および４３であるが、あるいは他の記憶媒体に格納されても良い。
【００４１】
（２．１記録部）
記録部３１は、音声入力部５１、動画入力部５２、ユーザ位置取得部５３、音源方向推定部５４、音声・音源方向記録部５５，ビデオ圧縮部５６、ビデオ送信部５７、ビデオ記録部５８、ユーザ位置記録部５９、およびリスト記録部６０から構成される。また、音源・位置対応部６１および対応付け記録部６２を備えても良い。
【００４２】
音声入力部５１は、記録装置３の音声インタフェース１６を介して、２つのマイクロホン１７および１８からの２チャンネル音声データを入力する。動画入力部５２は、ビデオインタフェース１９を介して、ビデオカメラ１からの動画データを入力する。音声および動画データはハードディスク１３に格納される。
【００４３】
ユーザ位置取得部５３は、ユーザＰＣ４と通信してユーザＰＣ４から送られてくるユーザ名、パスワード、位置データを、取得して登録する。そして、あらかじめユーザ名とパスワードが記録されたパスワードリストをハードディスク１３から読み込み、ユーザＰＣ４から送信されてきて取得したデータの内容がパスワードリストに含まれた内容と一致すればユーザ認証の成功を、また一致しなければ失敗と判定して、それぞれユーザＰＣ４に返送する。
【００４４】
ただしこの場合、厳しく閲覧を制限しない運用を行うならば、パスワードは設定しなくても良い。また、ユーザ名が記録リストに存在しなければ、新たなユーザとしてユーザリストに加えてユーザリストファイル７２を書き換えるように構成することも可能である。ここで、ユーザ位置取得部５３は、本発明のユーザ認証手段を構成する。
【００４５】
会議などに参加するユーザは、その場所に設置されているユーザＰＣ４を使うか、あるいはユーザが持ち込むユーザＰＣ４を通信ラインに接続する。そして、ユーザＰＣ４からユーザ認証のための入力を行うと同時に、ユーザの参加位置を設定入力して、記録装置３に送信する。ここでユーザとは、会議等に参加して発言する参加者のことである。ユーザにはその他にも、記録検索装置を用いて会議などを記録する記録者としてのユーザ、および会議の記録を閲覧する閲覧者としてのユーザがある。ここで、ユーザの位置を取得するユーザ位置取得部５３はまた、本発明のユーザ位置取得手段を構成する。
【００４６】
音源方向推定部５４は、従来技術と同様に、２チャンネルの音声データから音源方向を推定する。例えば０．１秒毎にチャンネル間の相互の相関性を計算し、相関が最大となる時間差を到達時間差として求め、２つのマイクロホン１７および１８の間の距離からその音源方向を推定する。この動作により、０．１秒毎の方向（角度）データを出力する。しかし、音声が一定レベル未満の場合は、音源方向推定部５４は、無音であると判断して無音データを出力する。ここで、マイクロホン１７、１８、および音源方向推定部５４は、本発明の音源位置取得手段を構成する。
【００４７】
音源方向記録部５５は、音源方向推定部５４で求めた各時間の音源の方向データを、話者即ち会議等の参加者のうちの発言者の位置として記録し、ハードディスク１３に格納する。時間の計時は、計算機に一般的に備えられているタイマを用いることができる。
【００４８】
話者位置の記録は、例えば、ビデオカメラで撮影されて表示画面上に表示された画像上の相対位置として記録する。即ち、画像上の左端を０、右端を１とし、また、無音および画面範囲外の方向を−１で表す。図６は、実施の形態による記録装置の記録した音源方向記録データの１例である。表から時間が時間１，２，および３へ経過するにつれて話者の位置は、０．１、０．１２、および０．８へと移動したことを表している。
【００４９】
ビデオ圧縮部５６は、動画入力部５２により入力した動画データと音声入力部が入力した音声データを圧縮し、記録データおよび送信データを生成する。圧縮方式としては、例えば周知のＭＰＥＧ１圧縮アルゴリズムを利用することができる。
【００５０】
ビデオ送信部５７は、記録装置３のユーザ位置取得部５３からの要求を受け、ユーザＰＣ４に対して、現在撮影中のビデオデータを送信して、ユーザＰＣ４のモニタ４５上にユーザの位置を表示する。この時の送信は、記録した後の再生のための送信ではなく、会議の進行中の画像の送信である。送信には、例えば周知のＲＴＳＰプロトコルを利用することができる。なお、ここでの送信データは、参加者の位置入力に利用されるので、音声データは必ずしも必要ではなく、音声の記録データとは別に動画あるいは画像データだけを送信するようにしても良い。
【００５１】
ビデオ記録部５８は、ビデオ圧縮部５６によって生成された音声あるいは画像の圧縮データを、記録装置３のハードディスク１３に音声画像ファイル７３として格納する。また、ハードディスク１３には、音声画像記録の開始時刻および終了時刻も記録し格納する。
【００５２】
ユーザ位置記録部５９は、ユーザ位置取得部５３によって取得されたユーザ位置データと、その入力時刻をユーザ位置記録ファイル７４としてハードディスク１３に記録する。ここで、ユーザ位置とはその会議に参加する参加者の位置のことである。また、時刻は秒単位、位置は画面の相対位置とする。
【００５３】
図７は、実施の形態による記録検索装置で記録されるユーザ位置データを示す１例である。図７では、記録開始後３秒目に、ユーザ「ａｏｋｉ」が「０．１」の位置に登録され、次に５秒目にユーザ「ｍｕｒａ」が「０．８」の位置に登録され、６００秒目にユーザ「ａｏｋｉ」が「０．５」の位置に移動した場合を示している。
【００５４】
リスト記録部６０は、ユーザ位置取得部５３で取得したユーザ名を、記録装置３のハードディスク１３のユーザリストファイル７２に追加記載する。ただし記録前にリストを検索し、一度記録されたユーザ名は追加しない。ユーザリストファイル７２は、この会議に参加したユーザが記録され、後述の配信部３２が閲覧許可を判定する際に利用できる。
【００５５】
ここで記録装置３には、音源・位置対応付け部６１と対応付け記録部６２を設けることができる。この音源・位置対応付け部６１と対応付け記録部６２を設けることによって、音声および／または画像を記録する段階で、ユーザと話者とが対応付けられて記録することができるので、再生段階で対応付ける必要がなくなる。ただし、以下詳述するように再生段階で、ユーザと話者を対応付けて検索する方式も可能である。音源・位置対応付け部６１は、音源方向推定部５４によって得られた音源方向の位置データと、ユーザ位置取得部５３によって得られたユーザの位置とを対応付ける。この対応付けによって発言位置を推定されたユーザ即ち発言者と、会議参加者としてのユーザとが対応付けられてその発言が話者を特定しながら記録できるので、記録を閲覧する際に、ユーザを指定することにより話者が簡単に検索され、ユーザ発言が迅速に検索可能となる。あるいはまたこの場合、再生部３４（後述）においてユーザ区間選択手段が不必要となる。対応付け記録部６２は、音源・位置対応付け部６１によって対応付けられた発言者と参加者の対応付けを、記録装置３のハードディスク１３内に格納する。ここで、音源・位置対応付け部６１は、本発明の対応付け手段を構成する。
【００５６】
（２．２配信部）
配信部３２は記録装置３によって動作され、記録されたデータが、ユーザＰＣ４に配信される。配信後、記録データはユーザＰＣ４によって再生され閲覧される。配信部３２は、例えば特開２００２−２５１３９３号公報の段落番号００２０〜００２３に記載された技術を用いる。つまり、ユーザＰＣ４の再生部３４と通信してユーザ名とパスワードを受信し、記録されたユーザリスト、およびパスワードリストと照合してユーザ認証し、リストに記録されたユーザだけに記録データを送信する。ただし本実施の形態ではビデオデータ以外に、音源位置データ、ユーザ位置データをも送信する。また、配信部３２は、記録データをコピーするなどして、記録装置３以外の記録データを読み取り可能な計算機で実行しても良い。
【００５７】
（２．３ユーザ位置入力部）
ユーザ位置入力部３３は、記録部３１の記録実行中に、ユーザＰＣ４上で実行されて、ユーザＰＣ４のユーザ認証入力データおよびユーザ位置入力データが、記録装置３の記録部３１に送信される。図８は、実施の形態によるユーザＰＣ４のユーザ位置入力部の動作フロー図である。図９は、ユーザＰＣ４におけるユーザ位置入力時の表示画面の１例であり、（ａ）は、入力のためにユーザＰＣ４に送信されてきたユーザ位置画像であり、（ｂ）は、ユーザ位置入力のためのダイアログ画面であり、そして（ｃ）はユーザ認証後にユーザＰＣ４に送信されたユーザ名ラベルがマークされたユーザ位置画像である。
【００５８】
図８を参照しながらユーザＰＣ４における会議の参加者であるユーザの位置入力とユーザの認証動作を説明する。ユーザＰＣ４においてユーザ位置入力部３３が動作を開始すると、記録装置３側の記録部３１と通信を開始し、記録装置３に対してビデオデータ送信を要求する。そして、記録装置３からユーザＰＣ４へとビデオ表示画像が送信され、ビデオ表示が開始する（ステップＳ８０１）。ユーザＰＣ４が受信したビデオデータは例えば、図９（ａ）のように画面表示される。これ以後、動作終了時までユーザＰＣ４のモニタ画面はビデオデータの受信と表示を継続する。
【００５９】
ユーザＰＣ４のマウスボタンの状態が監視され、ボタンが押下されるのを待つ（ステップＳ８０２のＮ）。ボタンが押下されれば（ステップＳ８０２のＹ）、ボタンが押下されたときのマウスカーソル位置が指定されて取得される（ステップＳ８０３）。その時のマウスカーソル位置がビデオ表示画面内であれば、位置指定と判断し（ステップＳ８０３のＹ）、ユーザ名入力ステップへ進む。そして、図９の（ｂ）に示されたダイアログがユーザＰＣ４の画面に表示され、ユーザＰＣ４は、ユーザによって入力されるユーザ名、パスワードをキーボードから読み取り、メモリ（不図示）上に記録する（ステップＳ８０４）。パスワードを利用しない場合は入力しなくても良い。
【００６０】
次に、入力されたユーザ名（パスワード）と、マウスクリック位置を記録装置３へ送信する（ステップＳ８０５）。マウスクリック位置は、ビデオ画面内において相対位置とする。そうすることによって、ビデオ画面の大きさを変化させても相対的な画像位置が表示される。
【００６１】
記録装置３からの認証成功情報が送信されて、ユーザＰＣ４が受信すれば（ステップＳ８０６のＹ）、図９の（ｃ）に示されるようにビデオ表示画面のマウスクリックされた横方向位置に、ユーザ名ラベルが表示される（ステップＳ８０７）。ここでユーザ名ラベルによって、ユーザ画像においてユーザが識別されて位置が指定された。
【００６２】
ここで、位置指定を指定するステップＳ８０３において、ボタンが押下されたときのマウスカーソル位置がユーザ名ラベル内であれば（ステップＳ８０３のＮ）、位置移動の判定（ステップＳ８０８）へと進み、マーク移動ステップへ進む。
【００６３】
マウスボタンが離されるまで（ステップＳ８０９のＮ）、マウス位置を取得し、その横方向位置に従いユーザ名ラベルを移動する（ステップＳ８１０）。マウスボタンが離されれば（ステップＳ８０９のＹ）、その位置と、メモリ上に記憶されたユーザ名、およびパスワードを、位置指定時と同様に、記録装置３へ送信する（ステップＳ８０５）。ボタン押下されたときのマウスカーソル位置が終了ボタン上であれば（ステップＳ８０８のＮ）、停止かどうかを判断し（ステップＳ８１１）、停止の場合は（ステップＳ８１１のＹ）動作を終了する。
【００６４】
こうして本実施の形態では、ユーザ認証動作に基づいてユーザ位置データが取得される。この方式によって簡易な操作で確実にユーザの位置が認証と同時に認証に基づいて能率良く取得でき、それによりユーザごとの音声記録が正確となり、それ故ユーザごとの発言の検索も正確となる。なお、本実施の形態では、ユーザがユーザ名とパスワードを入力することにより認証を実現したが、ユーザＰＣ４でのユーザ名やネットワークアドレスを利用したり、認証サーバを利用するなど、別の方法を用いてもよい。
【００６５】
（２．４再生部）
再生部３４は、記録されたデータを、ユーザＰＣ４（また他のＰＣ）で再生するために実行する。これにより再生装置となり、再生中に検索動作を行うと検索装置となる。そして、記録装置を兼ねれば記録検索装置である。
【００６６】
再生部３４は、基本的には通常のビデオ再生部と同様に、ユーザＰＣ４においてモニタ画面上でのマウスクリックを読み取り、再生ボタンが押されるとビデオ再生を開始し、停止ボタンが押されると再生を停止する。またビデオ表示画面は、ウインドウサイズに応じて変化できるものとする。つまり、ユーザがマウスドラッグによってウインドウのサイズを変化させた場合には、その縦横サイズを取得し、ボタンなどの表示に必要な領域を除いた部分に収まるように、ビデオ表示画面サイズを変更する。
【００６７】
実施の形態による再生部３４は、起動後、キーボードからユーザ名とパスワードを入力し、配信部３２と通信してユーザ認証した後、ビデオデータの他に、音源方向データ、ユーザ位置データを読み込み、特定の話者の発言区間だけを再生する機能を持つ。その際、会議などへの参加者であるユーザと、発言者である音源方向は記録時に音声・位置対応付け部６１によって対応付けられている場合は、ユーザ（発言者）を指定するだけで、その発言者に対応する音源からの発言が検索され、特定のユーザ（発言者）だけを取り出して再生することが可能になる。即ち、特定発言者の発言区間のみを検索して再生することが可能となる。
【００６８】
（２．４．１再生時のユーザ指定モード）
一方、記録時にユーザと発言者の発言とが対応付けられて記録されない一般的な場合は、再生部３４において、発言者の発言とユーザとの対応付けを行う。再生部３４の再生は、通常とユーザ指定との動作モードが選択可能である。通常モードとは、発言者（ユーザ）を指定することなく再生するモードであり、ユーザ指定モードとは、その会議の記録の中で特定のユーザを指定しそのユーザの発言だけを検索しながら再生するモードである。再生部はメモリ（不図示）を指定しその中にモード記憶域を持ち、指定する値は「通常（０）」「ユーザ指定（１）」の２種類をとる。ここでメモリはユーザＰＣ４のハードディスク４３であっても良い。また、メモリ中に指定ユーザ名記憶域を持ち、ユーザ指定モードの場合、ユーザ名を記憶する。再生部３４における指定モードが、本発明のユーザ指定手段を構成する。ここでメモリは、ユーザＰＣ４の有するハードディスク４３を用いても良い。
【００６９】
図１０は、実施の形態による記録検索装置の再生部３４によって再生した画面の１例を示す図である。ここで例えば、ユーザＰＣ４の画面上に表示されたユーザ（発言者）の画像の上部にマークされたａｏｋｉ、またはｍｕｒａのラベルをクリックすることによって、それぞれａｏｋｉまたはｍｕｒａのラベルを付されたユーザの発言のみが選択されて、再生される。
【００７０】
上記２種類の動作モードは、ユーザ名ラベルをマウスクリックすることで切り替える。つまり再生部３４は、マウスボタンを監視し、クリック位置が後述のユーザ名ラベル表示位置である場合、そのラベルに表示されるユーザ名を指定ユーザ名として記憶し、モード記憶域に「ユーザ指定（１）」を書き込む。さらにクリックされたラベルの表示色を反転してユーザ指定モードであることを画面表示する。また、クリック位置がすでに指定されたユーザ名のラベルである場合、モード記憶域に「通常（０）」を書き込み通常モードに復帰する。
【００７１】
（２．４．２再生時のタイマ割り込み処理）
図１１は、実施の形態による記録検索装置における再生部３４のタイマ割り込み処理部６３の動作を示すフロー図である。この実施の形態では、一定時間（例えば１秒）毎のタイマを設定してタイマ割り込みを発生させ、一定時間毎に割り込み処理ルーチンを実行する。割り込みにより、検索のための操作を入力する。
【００７２】
図１１を参照しながらタイマ割り込み部６３の動作を説明する。タイマ割り込み処理においては先ず現在のビデオ表示時刻を取得し、ユーザ位置データから、現在有効なユーザ名とユーザ位置とを検索する。そして、有効ユーザデータがあれば、図１０に示すようなビデオ表示画面上部のそれぞれの対応位置にユーザ名ラベルを表示する（ステップＳ１１０１）。ユーザ名と位置は設定後、変更または終了まで有効と判断する。例えば図７の場合は、ユーザ「ａｏｋｉ」は１秒から６００秒まで位置「０．１」、６００秒以降は位置「０．５」で、またユーザ「ｍｕｒａ」は５秒以降に位置「０．８」で有効であると判定する。
【００７３】
次に、現在の表示モードがユーザ指定モードであるかどうかを判定し（ステップＳ１１０２）、ユーザ指定モードであれば（ステップＳ１１０２のＹ）、さらに処理を進める。通常モードであれば（ステップＳ１１０２のＮ）、割り込み処理ルーチンを終了する。
【００７４】
指定ユーザ名、現在時刻、有効ユーザ位置、および音源位置データから、現在が指定ユーザの発話区間か否かを判定する（ステップＳ１１０３）。この判定によって、ユーザと発言者とが対応付けて同一であるか否かが認識される。ある時刻での音源位置と指定ユーザの位置の差の絶対値が所定の閾値、例えば０．１以内であれば、その時刻の発話者を指定ユーザと判定する（ステップＳ１１０３のＹ）。そして、指定ユーザ発話区間であれば、割り込み処理ルーチンを終了する。
【００７５】
そうでなくて、現在の発話ユーザが指定ユーザではないと判定された場合（ステップＳ１１０３のＮ）、指定ユーザ名、現在時刻、有効ユーザ位置、および音源位置データから、現在表示時刻以降で、最初の指定ユーザ発話時刻を検索する（ステップＳ１１０４）。
【００７６】
そして、ビデオ表示時刻を検索された発話時刻へ移動する（ステップＳ１１０５）。図１０には、ビデオ表示画面だけを示したが、再生時刻を示すスライドバーや、従来技術同様に、発言の様子を示す、時刻と方向を縦横軸にとったタイムチャートなどを表示しても良い。ここで、上記の再生部３４におけるタイマ割り込み部６３が本発明におけるユーザ発言区間選択手段を構成する。
【００７７】
また本実施の形態では、話者別の再生区間の選択をユーザＰＣ４上の再生部３４で実現したが、再生区間の検索は配信部３２側で実行し、指定区間だけのビデオデータを送信する構成も可能である。
【００７８】
（３．ユーザによる操作例）
図１２は、実施の形態による記録検索装置のユーザ位置入力部の動作を示すフロー図である。今、記録検索装置の動作を、会議参加者であるユーザから見た使用を図１２を参照しながら説明する。先ず、会議開始時に、記録装置３を用いて記録を担当するユーザ（記録者）は記録装置３のキーボード１４を操作して、記録部３１を開始する。記録部３１は記録装置３に接続されたビデオカメラ１、マイクロホンアレイ２から音声動画データを取り込み記録し、ユーザＰＣ４に送信する（ステップＳ１２０１）。
【００７９】
各参加者は、各自のユーザＰＣ４上で、ユーザ位置入力部３３を動作させる。ユーザ位置入力部３３は起動後、図９（ａ）のように、記録装置３で現在撮影中の画像を表示する（ステップＳ１２５１）。ここで各ユーザは撮影画像中で、自分自身の写っている位置をマウスクリックして図９（ｂ）のユーザの識別情報入力ダイアログを表示し、キーボードから、ユーザ名と（必要ならば）パスワードを入力する。ユーザ位置入力部３３は、ユーザ名、パスワードと画面上の位置情報を記録装置３に送信する（ステップＳ１２５２）。送信されたユーザ名とパスワードを記録部３１が受信し（ステップＳ１２０２）、認証されると（ステップＳ１２０３のＹ）、ユーザ名をユーザＰＣ４に送信し（ステップＳ１２０４）、図９（ｃ）のように画面上の対応位置にユーザ名が表示される（ステップＳ１２５３）。
【００８０】
会議の席上実際にユーザが移動した場合（ステップＳ１２５４のＹ）、画面上の自分の位置もずれるので、図９（ｃ）のユーザ名表示部分を参加者であるユーザはマウスでドラッグし、ユーザ位置入力部３３は移動された位置を送信し（ステップＳ１２５５）、記録部３１のユーザ位置取得部５３はその位置を取得し、ユーザ位置記録部５９はその新しい位置（変更位置）と移動時刻を記録する（ステップＳ１２０５）。ここで、ユーザ位置取得部５３およびユーザ位置記録部５９は本発明のユーザ位置変更取得手段を構成し、ユーザ位置取得部５３、タイマ（不図示）、およびユーザ位置記録部５９は本発明のユーザ位置変更時刻取得手段を構成する。
【００８１】
こうして、発言者が移動したとしてもその発言者が移動した位置を発言者自らの入力によって確実に入力できるので、発言者を誤って認識するという従来例の欠点を改良することができる。また、一人のユーザが多数のユーザの位置を入力するという従来例の煩雑さも改良される。ただし、ユーザが発言しながら移動するときなどは、マイクロホン１７および１８によって、移動した位置を追跡できるので、その場合は自動的に追跡して位置変更を行うように構成しても良い。
【００８２】
以上のように、位置記録のためにユーザ（参加者）が行う操作は、最初の認証時の自分自身の位置入力（マウスクリック）、および移動した場合には位置移動入力（ドラッグ）だけで良く、参加者全員分の位置を入力する必要はない。そのため、参加者全員の位置を全時間の経過にわたって指定するという繁雑な作業を避けることができる。
【００８３】
ここで、ユーザＰＣ４が通常の計算機である場合は、会議中に参加者は、そのユーザＰＣ４を資料参照やメモの記録など他の目的に利用することができる。
【００８４】
ここで、ユーザ位置入力部３３の送信するユーザ位置の表示は、ビデオ表示画面幅を基準とした相対位置が好ましい。そうすることによって、表示画面のサイズを自由に変更してもそれに応じて相対的な位置が表示される利点がある。例えば、他のアプリケーションを利用中は、ウインドウサイズを小さくして画面の隅に表示することができる。
【００８５】
（４．ユーザ指定モードによる再生例）
図１３は、実施の形態による記録検索装置において、ユーザ指定モードで再生する動作を示すフロー図である。ユーザ指定モードによる再生によって、特定のユーザ、即ち参加者を指定して検索し、その参加者の発言のみを再生することが可能となる。図１３を参照しながらユーザ指定モードによる再生動作を説明する。会議終了後、記録されたデータを再生するために会議参加者は、ユーザＰＣ４または再生装置で再生部３４を起動する。ユーザＰＣ４あるいは再生装置は再生部３４を起動し、ネットワーク５を介して記録装置３と通信し、ユーザ名パスワードを入力し送信する（ステップＳ１３５１）。記録装置３はユーザ名パスワードを受信し（ステップＳ１３０１）、ユーザ認証を行う（ステップＳ１３０２）。認証できない場合はユーザＰＣ４に不認証を送信し（ステップＳ１３０２のＮ）、それを受信したユーザＰＣ４はエラー受信の処理を行う（ステップＳ１３５２）。認証された場合（ステップＳ１３０２のＹ）は、音声動画データ、発言方向データ、参加者位置データを読み込みユーザＰＣ４に送信する（ステップＳ１３０３）。それらのデータをユーザＰＣ４は受信し図１０で示したように、それらのデータに基づく動画と各ユーザ名を対応する位置に表示する（ステップＳ１３５３）。
【００８６】
ユーザ位置入力の操作については一人のユーザが自分一人についてだけの入力操作であったが、再生時には記録部３１で記録された参加者全員のユーザ名が表示される。ユーザが図１０で示されたようなユーザ名表示ラベルを選択してマウスクリックすると、表示は、ユーザ指定モードになり、指定されたユーザについての情報が記録装置３に送信され（ステップＳ１３５４）、記録装置３はユーザの指定入力を受信して（ステップＳ１３０４）、その発言した区間だけ記録されたビデオを送信し（ステップＳ１３０５）、ユーザＰＣ４は受信して再生する。
【００８７】
ここで、参加者の位置情報は会議の参加者各自が人手で入力したデータであり、画像認識や音声認識など不確実な技術によるものではないので、正確な発言の区間切り出しが実現可能である。特に最初に位置情報を入力するときはその場でユーザ認証を行うときであるので、確実に位置を指定できる。さらに特にユーザ認証に基づいてユーザ位置入力を行うならば、認証と同時に操作するので、能率的で確実な操作となる。
【００８８】
なお、この実施形態による記録装置、および記録検索装置で実行される記録部、および記録検索部は、インストール可能な形式または実行可能な形式のファイルでＣＤ−ＲＯＭ、フロッピー（Ｒ）ディスク（ＦＤ）、ＤＶＤなどのコンピュータで読み取り可能な記録媒体に記録されて提供される。
【００８９】
また、本実施形態の記録部、および記録検索部を、インターネットなどのネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供および配布するように構成しても良い。
【００９０】
【発明の効果】
請求項１にかかる発明は、ユーザ認証のためだけの各ユーザに固有の端末が不要で、ユーザの認証動作に基づいてユーザ位置データを能率良く正確に取得でき認証と同時にユーザの位置を取得して記録後には確実に発言者を特定できるので、発言者による発言を正確に検索可能な記録を採ることが出来る低コストの記録装置を提供できるという効果を奏する。さらに、記録中にユーザの位置が時間的に変化しても追跡して記録できるので、発言者の正確な特定と発言の検索が可能となる記録装置を提供できるという効果を奏する。
【００９３】
また、請求項２にかかる発明は、請求項１に記載の発明の効果に加えて、再生時に発言者の発言が迅速に検索できるデータとして記録可能な記録装置を提供できるという効果を奏する。
【００９４】
また、請求項３にかかる発明は、認証のためだけの固有の端末が不要で、ユーザの認証動作に基づいてユーザ位置データを能率良く正確に取得でき、記録後には会議の参加者であるユーザを指定することによってその発言を確実に検索できるので、コストを抑えて正確な発言の検索が可能な記録方法を提供できるという効果を奏する。さらに、記録中にユーザの位置が時間的に変化しても追跡して記録できるので、発言者の正確な特定と発言の検索が可能となる記録装置を提供できるという効果を奏する。
【００９７】
また、請求項４にかかる発明は、請求項３に記載の記録方法の効果に加えて、再生時に発言者の発言が迅速に検索できるデータとして記録可能な記録装置を提供できるという効果を奏する。
【００９８】
また、請求項５にかかる発明は、請求項３又は４に記載された方法をコンピュータに実行させることができるという効果を奏する。
【図面の簡単な説明】
【図１】本発明の実施の形態による記録検索装置のネットワーク構成図である。
【図２】実施の形態による記録検索装置が用いる記録装置のハードウェア構成図である。
【図３】実施の形態による記録装置に接続されたマイクロホンアレイとビデオカメラの模式的斜視図である。
【図４】実施の形態による記録検索装置に用いるユーザＰＣのハードウェア構成図である。
【図５】実施の形態による記録検索装置の機能的構成を示すブロック図である。
【図６】実施の形態による記録装置の記録した音源方向記録データの１例を示す図である。
【図７】実施の形態による記録装置で記録されるユーザ位置データを示す１例を示す図である。
【図８】実施の形態による記録装置のユーザＰＣのユーザ位置入力部の動作フロー図である。
【図９】実施の形態による記録検索装置のユーザ位置入力に関する表示画面の１例を示す図であり、（ａ）は、入力のためにユーザＰＣに送信されたユーザ位置画像であり、（ｂ）は、ユーザＰＣ４におけるユーザ位置入力のためのダイアログ画面であり、そして（ｃ）はユーザ認証後にユーザＰＣに送信されたユーザ名ラベルがマークされたユーザ位置画像である。
【図１０】実施の形態による記録検索装置の再生部によって再生した画面の１例を示す図である。
【図１１】実施の形態による記録検索装置における再生部のタイマ割り込み処理部の動作を示すフロー図である。
【図１２】実施の形態による記録検索装置のユーザ位置入力部の動作を示すフロー図である。
【図１３】実施の形態による記録検索装置において、ユーザ指定モードで再生する動作を示すフロー図である。
【符号の説明】
１ビデオカメラ
２マイクロホンアレイ
３記録装置（記録ＰＣ）
４ユーザ計算機（ユーザＰＣ）
５ネットワーク
１１、４１中央演算装置（ＣＰＵ）
１２、４２ランダムアクセスメモリ（ＲＡＭ）
１３、４３ハードディスク
１４、４４キーボード
１５、４５モニタ
１６音声インタフェース
１７、１８マイクロホン
１９ビデオインタフェース
２０、４８システムバス
２１、４７ネットワークインタフェース
２６レンズ
３１記録部
３２配信部
３３ユーザ位置入力部
３４再生部
５１音声入力部
５２動画入力部
５３ユーザ位置取得部
５４音源方向推定部
５５音声・音源方向記録部
５６ビデオ圧縮部
５７ビデオ送信部
５８ビデオ記録部
５９ユーザ位置記録部
６０リスト記録部
６１音源・位置対応付け部
６２対応付け記録部
６３タイマ割り込み処理部
７２ユーザリストファイル
７３音声画像ファイル
７４ユーザ位置記録ファイル[0001]
BACKGROUND OF THE INVENTION
  The present invention records audio using image recording, particularly video shooting as an aid.Recording apparatus, recording method, and recording program, Especially the sound can be played while distinguishing the sound source and speakerRecording apparatus, recording method, and recording programAbout.
[0002]
[Prior art]
There is an increasing need to support the review of conference content and the creation of minutes by recording audio and video recordings of speeches made by participants in the conference and the presentation patterns. In particular, there is a growing need for an apparatus and method that can easily search for a desired section in recorded long-time audio information or video information including moving images. Various techniques for that purpose have been proposed.
[0003]
For example, the invention of Japanese Patent Laid-Open No. 2000-125274 is a technique for estimating the direction of a sound source from a difference in arrival time of speech using a microphone array and distinguishing and recording the speech of each speaker. Then, a video camera is arranged in the vicinity of the microphone array, the video camera is directed to the sound source direction determined for each utterance, and an image in that direction is displayed, thereby facilitating the search for audio (Patent Document 1). reference).
[0004]
The invention of Japanese Patent Laid-Open No. 2002-247489 estimates a sound source direction by a microphone array and records an entire image, inputs a position and a name corresponding to the sound source direction on the image, and associates a statement with a name. This method is disclosed (see Patent Document 2).
[0005]
Furthermore, in the invention of Japanese Patent Laid-Open No. 2002-251393, a conference participant registers himself / herself at the same time as recording audio and moving images so that only registered participants can view the recorded information. Disclosure techniques are disclosed. Here, as a registration method, the participants have their own magnetic cards and register by passing them through a card reader. In addition, a method is described in which a plurality of terminal devices provided with video cameras, microphones, and card readers are prepared for each participant, and audio and moving images for each participant are recorded (see Patent Document 3).
[0006]
[Patent Document 1]
JP 2000-125274 A (pages 4-7, FIGS. 1-4)
[Patent Document 2]
JP 2002247474 A (pages 5 to 8, FIG. 1, 9, 12, 1415)
[Patent Document 3]
JP 2002-251393 A (pages 4-7, FIGS. 1-4, 6-8)
[0007]
[Problems to be solved by the invention]
According to the above-described invention of Japanese Patent Laid-Open No. 2000-125274, the system is configured to index conference contents by judging a speaker by observing a displayed image by a human. However, the device itself used in the system only acquires the sound source direction and its image, and does not recognize who each utterance is. Therefore, when the same speaker moves, for example, it may be judged as another person. For this reason, even if it is possible to collect messages from a specific direction, the speakers of the messages are not always the same, and there is a disadvantage that they are recorded erroneously. [0061] of the invention of the same publication also states that when a person moves, it can be tracked (by combining video pattern recognition and / or voice identification technology of a speaker) (merging timelines). However, [0003] describes that the reliability of the video pattern recognition and voice recognition technologies is insufficient, and based on these descriptions, it is considered difficult to track sufficiently accurately with the invention of the publication. It is done.
[0008]
In the invention of the above-mentioned Japanese Patent Application Laid-Open No. 2002-247489 that attempts to solve such a point, it is assumed that a position and a name corresponding to a sound source direction on an image can be input to associate the utterance with the name. However, in the invention of the same publication, since it is necessary for the user to input the position and name separately from the recording of the audio image, the user, especially when one user inputs the position and name of all participants, etc. However, the input operation is very complicated.
[0009]
Further, in the invention of Japanese Patent Application Laid-Open No. 2002-251393, in order to perform individual participation registration and leave it as a record, one participant is required for each participant, which is expensive. was there.
[0010]
  The present invention has been made in view of the above-described problems, and its purpose is to eliminate the need for a unique terminal device for a participant who is a user to register for participation in recording by voice or video, and to ensure that the participant speaks. Identify and track during recording, and easily specify a speaker and search for them after recordingRecording apparatus, recording method, and recording programIs to provide.
[0011]
[Means for Solving the Problems]
  In order to achieve the above object, the invention according to claim 1A recording device connected to a terminal device operated by a user,Sound acquisition means for acquiring sound of sound source as sound dataWhen,Sound source position acquisition means for acquiring position information of the sound source as sound source position dataImage acquisition means for acquiring image data obtained by imaging an area including a plurality of users, output means for outputting the image data to the terminal device, and designation on the output image data from the terminal device An input receiving unit that receives an input as a position where a user exists in the area represented by the image data, and an input in the area corresponding to the input position when the input of the input position is received. User authentication means for performing user authentication with a user name and password to be a user existing at a position, and a direction position in the area displayed by the sound source and the image data when the user authentication is successful Based on a predetermined correspondence relationship between the input position in the image data and the position of the sound source position data determined as the position of the user. An association unit for associating position information of a sound source; and, when the user authentication is successful, identification information for identifying the user is added to the input position on the image data, and the terminal device Identification information output means for outputting to the terminal, and a change acceptance means for accepting the changed changed position of the identification information from the terminal device when the position of the identification information added to the image data is changed in the terminal device The association means further associates the changed position on the image data with the position information of the sound source of the sound source position data determined to be the position of the user's movement destination. The identification information output means adds the identification information to the change position on the image data and outputs the identification information to the terminal device;The recording apparatus characterized by the above.
[0012]
  According to the invention of claim 1,A recording device connected to a terminal device operated by a user,Sound acquisition means for acquiring sound of sound source as sound dataWhen,Sound source position acquisition means for acquiring position information of the sound source as sound source position dataImage acquisition means for acquiring image data obtained by imaging an area including a plurality of users, output means for outputting the image data to the terminal device, and designation on the output image data from the terminal device An input receiving unit that receives an input as a position where a user exists in the area represented by the image data, and an input in the area corresponding to the input position when the input of the input position is received. User authentication means for performing user authentication with a user name and password to be a user existing at a position, and a direction position in the area displayed by the sound source and the image data when the user authentication is successful Based on a predetermined correspondence relationship between the input position in the image data and the position of the sound source position data determined as the position of the user. An association unit for associating position information of a sound source; and, when the user authentication is successful, identification information for identifying the user is added to the input position on the image data, and the terminal device Identification information output means for outputting to the terminal, and a change acceptance means for accepting the changed changed position of the identification information from the terminal device when the position of the identification information added to the image data is changed in the terminal device The association means further associates the changed position on the image data with the position information of the sound source of the sound source position data determined to be the position of the user's movement destination. The identification information output means adds the identification information to the change position on the image data and outputs the identification information to the terminal device.This eliminates the need for a unique terminal for each user only for user authentication, and can acquire the location data efficiently and accurately based on the user's authentication operation, and the speaker can be reliably identified after recording. It is possible to provide a low-cost recording apparatus capable of taking a record that can accurately search for the remarks byIn addition, even if the position of the user changes during recording, it can be tracked and recorded, so that it is possible to provide a recording apparatus capable of accurately identifying a speaker and searching for a speech.
[0017]
  Claims2The invention according to claim1In the recording apparatus described inSelection accepting means for accepting selection of the identification information added to the image data from the terminal device, the position of the user indicated by the identification information accepted for selection, and the sound source associated by the associating means Reproduction output means for outputting the audio data acquired from the position information to the terminal device;Is further provided.
[0018]
  This claim2According to the invention of claim1In addition to the action of the invention described inSelection accepting means for accepting selection of the identification information added to the image data from the terminal device, the position of the user indicated by the identification information accepted for selection, and the sound source associated by the associating means Reproduction output means for outputting the audio data acquired from the position information to the terminal device;Therefore, it is possible to provide a recording device capable of recording as data that can quickly search for a speaker's speech during reproduction.
[0019]
  Claims3The invention according toA recording method performed by a device connected to a terminal device operated by a user, a sound acquisition step for acquiring sound of a sound source as sound data, and sound source position acquisition for acquiring position information of the sound source as sound source position data An image acquisition step for acquiring image data obtained by imaging a region including a plurality of users, an output step for outputting the image data to the terminal device, and designation on the image data output from the terminal device An input receiving step for receiving an input as a position where a user exists in the area represented by the image data, and an input in the area corresponding to the input position when the input of the input position is received A user authentication step of performing user authentication using a user name and a password, and When the authentication is successful, based on a predetermined correspondence between the direction of the sound source and the direction position in the area displayed by the image data, the input position in the image data and the user's A step of associating the position information of the sound source of the sound source position data determined to be a position, and when the user authentication is successful, the user is identified at the input position on the image data And when the position of the identification information added to the image data is changed in the terminal device, the identification information is output from the terminal device. A change accepting step for accepting the changed change position, and the association step further includes the change position on the image data and the user Corresponding to the position of the sound source of the sound source position data determined to be the position of the movement destination, the identification information output step adds the identification information to the changed position on the image data, Output to a terminal device.It is a recording method.
[0020]
  This claim3According to the invention ofA recording method performed by a device connected to a terminal device operated by a user, a sound acquisition step for acquiring sound of a sound source as sound data, and sound source position acquisition for acquiring position information of the sound source as sound source position data An image acquisition step for acquiring image data obtained by imaging a region including a plurality of users, an output step for outputting the image data to the terminal device, and designation on the image data output from the terminal device An input receiving step for receiving an input as a position where a user exists in the area represented by the image data, and an input in the area corresponding to the input position when the input of the input position is received A user authentication step of performing user authentication using a user name and a password, and When the authentication is successful, based on a predetermined correspondence between the direction of the sound source and the direction position in the area displayed by the image data, the input position in the image data and the user's A step of associating the position information of the sound source of the sound source position data determined to be a position, and when the user authentication is successful, the user is identified at the input position on the image data And when the position of the identification information added to the image data is changed in the terminal device, the identification information is output from the terminal device. A change accepting step for accepting the changed change position, and the association step further includes the change position on the image data and the user Corresponding to the position of the sound source of the sound source position data determined to be the position of the movement destination, the identification information output step adds the identification information to the changed position on the image data, Output to a terminal device.Therefore, a unique terminal only for authentication is unnecessary, user location data can be efficiently and accurately acquired based on the user's authentication operation, and after recording, the user can be remarked by specifying the user who is a participant in the conference. Since it is possible to search reliably, it is possible to provide a low-cost recording method capable of searching for a speech by an accurate speaker.In addition, even if the position of the user changes during recording, it can be tracked and recorded, so that it is possible to provide a recording method that enables accurate identification of a speaker and retrieval of the speech.
[0025]
  Claims4The invention according to claim3In the recording method described inA selection receiving step for receiving selection of the identification information added to the image data from the terminal device, the position of the user indicated by the identification information having received the selection, and the sound source associated with the association unit A reproduction output step of outputting the audio data acquired from the position information to the terminal device..
[0026]
  This claim4According to the invention of claim3In addition to the operation of the recording method described inA selection receiving step for receiving selection of the identification information added to the image data from the terminal device, the position of the user indicated by the identification information having received the selection, and the sound source associated with the association unit A reproduction output step of outputting the audio data acquired from the positional information to the terminal device;Therefore, it is possible to provide a recording method capable of recording as data that allows a speaker's speech to be quickly searched during reproduction.
[0027]
  Claims5The invention according to claim3 or 4A program for causing a computer to execute the method described in claim 1.3 or 4Can be executed by a computer.
[0028]
DETAILED DESCRIPTION OF THE INVENTION
  The present invention will be described below with reference to the accompanying drawings.Recording apparatus, recording method, and recording programThe preferred embodiment will be described in detail.
[0029]
(1. Network configuration of record retrieval device)
FIG. 1 is a network configuration diagram of a record retrieval apparatus according to an embodiment of the present invention. The recording / retrieval device according to the embodiment includes a video camera 1, a microphone array 2, a recording device 3 to which the video camera 1 and the microphone array 2 are connected to record voice / moving images, and a user who is a conference participant in each conference. A computer 4 (hereinafter abbreviated as a user PC) 4 is connected to the network 5 and configured.
[0030]
FIG. 2 is a hardware configuration diagram of the recording device 3 used by the recording / retrieval device according to the embodiment. The recording device 3 used here is a general computer in which a central processing unit (CPU) 11, a random access memory (RAM) 12, a hard disk 13, a keyboard 14, and a monitor 15 are connected to each other via a system bus 20 (recording). PC). However, the recording apparatus in the present embodiment does not have to be configured with a general computer.
[0031]
In the recording apparatus 3, two microphones 17 and 18 are connected to the system bus 20 via the audio interface 16. The microphones 17 and 18 constitute the microphone array 2 of FIG. Similarly, the video camera 1 is connected to the system bus 20 via the video interface 19. When the microphones 17 and 18 and the video camera 1 are connected to the system bus 20, audio data and image data are captured and stored in the hard disk 13, respectively. The image data is preferably moving image data here, but is not necessarily moving image data, and may be a plurality of still image data depending on the situation. The system bus 20 is connected to the network 5 via the network interface 21 and can communicate with the outside.
[0032]
The recording device 3 operates a recording unit 31 (described later) for recording audio / video data, records audio / video data, and stores it in the hard disk 13. Further, the recording device 3 acquires the position data of the user who is a participant and stores it in the hard disk 13 in response to the operation of a user position input unit 33 (described later) for inputting the position of the user executed by the user PC 4. The recording device 3 distributes the recorded data to the user PC 4 by the distribution unit 32, and the computer terminal that is the user PC 4 operates the reproduction unit 34 to reproduce the recording. The operation of these units will be described later.
[0033]
Here, the microphones 17 and 18 and the recording unit 31 constitute sound acquisition means in the present invention. The video camera 1 and the recording unit 31 constitute image acquisition means in the present invention.
[0034]
FIG. 3 is a schematic perspective view of the video camera 1 and the microphone array 2 connected to the recording apparatus 3 according to the embodiment. The direction of the sound source at each time can be detected by the voice input data to two or more microphones 17 and 18. The video camera 1 has a lens 26 having an optical system. The lens 26 may be rotated in the direction of the sound source.
[0035]
For detection of the sound source direction, for example, a technique disclosed in Japanese Patent Laid-Open No. 2000-125274 is used. Here, when the microphones 17 and 18 and the video camera 1 are fixed, the sound source direction determined from the microphones 17 and 18 and the horizontal position on the image captured by the video camera 1 are associated in advance. I can leave.
[0036]
FIG. 4 is a hardware configuration diagram of the user PC 4 used in the record retrieval apparatus according to the present embodiment. Here, the user PC 4 is a general computer in which a central processing unit (CPU) 41, a random access memory (RAM) 42, a hard disk 43, a keyboard 44, a monitor 45, and a mouse 46 are connected to each other via a system bus 48. . The user PC 4 may be a small so-called notebook personal computer. The system bus 48 is also connected to the external network 5 via the network interface 47, and the user PC 4 communicates with the recording device 3 via the network 5.
[0037]
The user PC 4 operates a general operating system, for example, Microsoft Windows. By executing the application program on the operating system, the user PC 4 can operate a user position input unit 33 and a playback unit 34 (to be described later) for inputting the user position. This user PC 4 can also be used to execute other programs for reference to materials during a meeting. If the program is compatible as the user PC 4, a so-called personal digital assistant (PDA) can be used. The operation of each part will be described later.
[0038]
(2. Functional configuration of record retrieval device)
FIG. 5 is a block diagram showing a functional configuration of the record retrieval apparatus according to the embodiment of the present invention. Among these, the recording device 3 in the record retrieval device operates the recording unit 31 and the distribution unit 32. However, the distribution unit 32 is not necessarily performed by the recording device 3 and may be executed by setting up another distribution server.
[0039]
The user PC 4 operates the user position input unit 33 and the playback unit 34. However, the playback unit 34 is not necessarily performed by the user PC 4 and may be executed by setting up a playback device or a client for playback.
[0040]
Here, each unit such as the recording unit 31 that performs the above operation can be configured as a recording program, for example, as a recording program having a functional name and stored in the recording PC 3 or the user PC 4. It is not limited. The storage destinations of the recorded various data are the hard disks 13 and 43, but may be stored in other storage media.
[0041]
(2.1 Recording section)
The recording unit 31 includes an audio input unit 51, a moving image input unit 52, a user position acquisition unit 53, a sound source direction estimation unit 54, an audio / sound source direction recording unit 55, a video compression unit 56, a video transmission unit 57, a video recording unit 58, A user position recording unit 59 and a list recording unit 60 are included. Further, a sound source / position correspondence unit 61 and a correspondence recording unit 62 may be provided.
[0042]
The audio input unit 51 inputs 2-channel audio data from the two microphones 17 and 18 via the audio interface 16 of the recording device 3. The moving image input unit 52 inputs moving image data from the video camera 1 via the video interface 19. The audio and moving image data is stored in the hard disk 13.
[0043]
The user position acquisition unit 53 communicates with the user PC 4 and acquires and registers the user name, password, and position data sent from the user PC 4. Then, a password list in which the user name and password are recorded in advance is read from the hard disk 13, and if the content of the data transmitted from the user PC 4 matches the content included in the password list, the user authentication is successful. If they do not match, it is determined as a failure, and each is returned to the user PC 4.
[0044]
However, in this case, the password does not need to be set if the operation is not strictly restricted. Further, if the user name does not exist in the recording list, the user list file 72 can be rewritten as a new user in addition to the user list. Here, the user position acquisition part 53 comprises the user authentication means of this invention.
[0045]
A user who participates in a meeting or the like uses the user PC 4 installed in the place or connects the user PC 4 brought in by the user to the communication line. Then, at the same time as input for user authentication from the user PC 4, the user's participation position is set and input and transmitted to the recording device 3. Here, the user is a participant who participates in a conference or the like and speaks. Other users include a user as a recorder who records a meeting or the like using a record search device, and a user as a viewer who views a record of the meeting. Here, the user position acquisition part 53 which acquires a user's position also comprises the user position acquisition means of this invention.
[0046]
The sound source direction estimation unit 54 estimates the sound source direction from the two-channel audio data, as in the prior art. For example, the mutual correlation between channels is calculated every 0.1 seconds, the time difference at which the correlation is maximum is obtained as the arrival time difference, and the sound source direction is estimated from the distance between the two microphones 17 and 18. By this operation, direction (angle) data is output every 0.1 seconds. However, if the sound is below a certain level, the sound source direction estimating unit 54 determines that there is no sound and outputs silence data. Here, the microphones 17 and 18 and the sound source direction estimation unit 54 constitute sound source position acquisition means of the present invention.
[0047]
The sound source direction recording unit 55 records the direction data of the sound source at each time obtained by the sound source direction estimating unit 54 as the position of a speaker among speakers, that is, participants of a conference or the like, and stores the data in the hard disk 13. A timer generally provided in a computer can be used for measuring the time.
[0048]
For example, the speaker position is recorded as a relative position on an image captured by a video camera and displayed on a display screen. That is, the left end on the image is 0, the right end is 1, and the direction of silence and the direction outside the screen range is represented by -1. FIG. 6 is an example of sound source direction recording data recorded by the recording apparatus according to the embodiment. The speaker position has moved to 0.1, 0.12, and 0.8 as time passes from the table to times 1, 2, and 3.
[0049]
The video compression unit 56 compresses the moving image data input by the moving image input unit 52 and the audio data input by the audio input unit, and generates recording data and transmission data. As the compression method, for example, a well-known MPEG1 compression algorithm can be used.
[0050]
Upon receiving a request from the user position acquisition unit 53 of the recording device 3, the video transmission unit 57 transmits video data currently being shot to the user PC 4 and displays the user's position on the monitor 45 of the user PC 4. To do. The transmission at this time is not transmission for reproduction after recording, but transmission of an image in progress of the conference. For the transmission, for example, a well-known RTSP protocol can be used. Note that the transmission data here is used for the position input of the participant, so the audio data is not necessarily required, and only the moving image or the image data may be transmitted separately from the audio recording data.
[0051]
The video recording unit 58 stores the compressed audio or image data generated by the video compression unit 56 as the audio image file 73 in the hard disk 13 of the recording device 3. The hard disk 13 also records and stores audio image recording start time and end time.
[0052]
The user position recording unit 59 records the user position data acquired by the user position acquisition unit 53 and the input time thereof as a user position recording file 74 on the hard disk 13. Here, the user position is the position of the participant who participates in the conference. The time is in seconds and the position is the relative position of the screen.
[0053]
FIG. 7 is an example showing user position data recorded by the record retrieval apparatus according to the embodiment. In FIG. 7, the user “aoki” is registered at the position “0.1” 3 seconds after the start of recording, and then the user “mura” is registered at the position “0.8” 5 seconds later. The case where the user “aoki” moves to the position “0.5” at 600 seconds is shown.
[0054]
The list recording unit 60 additionally describes the user name acquired by the user position acquisition unit 53 in the user list file 72 of the hard disk 13 of the recording device 3. However, the list is searched before recording, and the user name recorded once is not added. The user list file 72 records the users who have participated in this conference, and can be used when the distribution unit 32 described later determines browsing permission.
[0055]
Here, the recording device 3 can be provided with a sound source / position association unit 61 and an association recording unit 62. By providing the sound source / position associating unit 61 and the association recording unit 62, the user and the speaker can be associated and recorded at the stage of recording the voice and / or the image. There is no need to associate it. However, as will be described in detail below, it is also possible to perform a search method in which a user and a speaker are associated with each other at the playback stage. The sound source / position associating unit 61 associates the sound source direction position data obtained by the sound source direction estimating unit 54 with the user position obtained by the user position obtaining unit 53. The user whose speaker position is estimated by this association, that is, the speaker, and the user as the conference participant are associated with each other and can record the speech while identifying the speaker. By designating, the speaker can be easily searched, and the user's speech can be quickly searched. Alternatively, in this case, no user section selection means is required in the playback unit 34 (described later). The association recording unit 62 stores the association between the speaker and the participant associated by the sound source / position association unit 61 in the hard disk 13 of the recording device 3. Here, the sound source / position associating unit 61 constitutes the associating means of the present invention.
[0056]
(2.2 Distribution Department)
The distribution unit 32 is operated by the recording device 3, and the recorded data is distributed to the user PC 4. After distribution, the recorded data is reproduced and viewed by the user PC 4. The distribution unit 32 uses, for example, the technique described in paragraph numbers 0020 to 0023 of JP-A-2002-251393. That is, the user name and password are received by communicating with the reproduction unit 34 of the user PC 4, the user authentication is performed by checking the recorded user list and the password list, and the recorded data is transmitted only to the user recorded in the list. . However, in this embodiment, in addition to the video data, sound source position data and user position data are also transmitted. Further, the distribution unit 32 may be executed by a computer that can read the recording data other than the recording device 3 by copying the recording data.
[0057]
(2.3 User position input unit)
The user position input unit 33 is executed on the user PC 4 during recording by the recording unit 31, and user authentication input data and user position input data of the user PC 4 are transmitted to the recording unit 31 of the recording apparatus 3. FIG. 8 is an operation flowchart of the user position input unit of the user PC 4 according to the embodiment. FIG. 9 is an example of a display screen at the time of user position input in the user PC 4, (a) is a user position image transmitted to the user PC 4 for input, and (b) is a user position input. And (c) is a user position image marked with a user name label transmitted to the user PC 4 after user authentication.
[0058]
With reference to FIG. 8, the position input of the user who is a participant in the conference in the user PC 4 and the user authentication operation will be described. When the user position input unit 33 starts operating in the user PC 4, it starts communication with the recording unit 31 on the recording device 3 side and requests the recording device 3 to transmit video data. Then, a video display image is transmitted from the recording device 3 to the user PC 4, and video display starts (step S801). The video data received by the user PC 4 is displayed on the screen as shown in FIG. Thereafter, the monitor screen of the user PC 4 continues to receive and display video data until the end of the operation.
[0059]
The state of the mouse button of the user PC 4 is monitored and waits for the button to be pressed (N in step S802). If the button is pressed (Y in step S802), the position of the mouse cursor when the button is pressed is specified and acquired (step S803). If the mouse cursor position at that time is within the video display screen, it is determined that the position is designated (Y in step S803), and the process proceeds to the user name input step. Then, the dialog shown in FIG. 9B is displayed on the screen of the user PC 4, and the user PC 4 reads the user name and password input by the user from the keyboard and records them on a memory (not shown) ( Step S804). If you do not use a password, you do not need to enter it.
[0060]
Next, the input user name (password) and mouse click position are transmitted to the recording device 3 (step S805). The mouse click position is a relative position in the video screen. By doing so, the relative image position is displayed even if the size of the video screen is changed.
[0061]
If the authentication success information from the recording device 3 is transmitted and received by the user PC 4 (Y in step S806), as shown in FIG. 9C, the horizontal position where the mouse is clicked is displayed on the video display screen. A user name label is displayed (step S807). Here, the user is identified and the position is specified in the user image by the user name label.
[0062]
If the mouse cursor position when the button is pressed is within the user name label (N in step S803) in step S803 for specifying the position designation, the process proceeds to the position movement determination (step S808), and the mark Go to the move step.
[0063]
Until the mouse button is released (N in step S809), the mouse position is acquired, and the user name label is moved in accordance with the horizontal position (step S810). If the mouse button is released (Y in step S809), the position, the user name and the password stored in the memory are transmitted to the recording device 3 in the same manner as the position designation (step S805). If the mouse cursor position when the button is pressed is on the end button (N in Step S808), it is determined whether or not it is stopped (Step S811). If it is stopped (Y in Step S811), the operation is ended.
[0064]
Thus, in the present embodiment, user position data is acquired based on the user authentication operation. With this method, the user's position can be acquired efficiently and efficiently based on the authentication simultaneously with a simple operation, so that the voice recording for each user becomes accurate, and hence the retrieval of the message for each user becomes accurate. In the present embodiment, the authentication is realized by the user inputting the user name and password. However, another method such as using the user name or network address in the user PC 4 or using an authentication server is used. It may be used.
[0065]
(2.4 Playback unit)
The reproduction unit 34 executes the recorded data for reproduction on the user PC 4 (or another PC). As a result, a playback device is obtained, and if a search operation is performed during playback, the search device is obtained. And if it doubles as a recording device, it is a record retrieval device.
[0066]
The playback unit 34 basically reads the mouse click on the monitor screen in the user PC 4 as in the normal video playback unit, starts video playback when the playback button is pressed, and plays when the stop button is pressed. To stop. The video display screen can be changed according to the window size. That is, when the user changes the size of the window by dragging the mouse, the vertical and horizontal sizes are acquired, and the video display screen size is changed so as to fit in a portion excluding an area necessary for display such as buttons.
[0067]
After starting, the playback unit 34 according to the embodiment inputs a user name and password from the keyboard, communicates with the distribution unit 32, authenticates the user, reads the sound source direction data and the user position data in addition to the video data, It has a function to play back only the speaking section of a specific speaker. At that time, if the user who is a participant in the conference and the sound source direction that is the speaker are associated by the voice / position associating unit 61 at the time of recording, only the user (speaker) is specified. A speech from a sound source corresponding to the speaker is searched, and only a specific user (speaker) can be extracted and reproduced. That is, it is possible to search and reproduce only the speech section of the specific speaker.
[0068]
(2.4.1 User specified mode during playback)
On the other hand, in the general case where the user and the speech of the speaker are not recorded in association with each other at the time of recording, the playback unit 34 associates the speech of the speaker with the user. For the playback of the playback unit 34, the normal and user-specified operation modes can be selected. The normal mode is a mode that plays without specifying the speaker (user), and the user-specified mode is a playback that specifies a specific user in the conference recording and searches only the user's speech. It is a mode to do. The playback unit designates a memory (not shown) and has a mode storage area therein, and the designated value takes two types of “normal (0)” and “user designation (1)”. Here, the memory may be the hard disk 43 of the user PC 4. Further, the memory has a designated user name storage area, and stores the user name in the user designation mode. The designation mode in the reproduction unit 34 constitutes the user designation means of the present invention. Here, the memory may be the hard disk 43 of the user PC 4.
[0069]
FIG. 10 is a diagram illustrating an example of a screen reproduced by the reproduction unit 34 of the recording / retrieval device according to the embodiment. Here, for example, by clicking the aoki or mura label marked at the top of the user (speaker) image displayed on the screen of the user PC 4, the user who has been labeled aoki or mura respectively. Only the utterance is selected and played.
[0070]
The two operation modes are switched by clicking the user name label with the mouse. That is, the playback unit 34 monitors the mouse button, and when the click position is a user name label display position described later, stores the user name displayed on the label as a specified user name and stores “user specified ( 1) "is written. Further, the display color of the clicked label is reversed to display on the screen that the mode is the user designation mode. If the click position is the label of the specified user name, “normal (0)” is written in the mode storage area to return to the normal mode.
[0071]
(2.4.2 Timer interrupt handling during playback)
FIG. 11 is a flowchart showing the operation of the timer interrupt processing unit 63 of the reproducing unit 34 in the recording / retrieval device according to the embodiment. In this embodiment, a timer interrupt is generated by setting a timer every fixed time (for example, 1 second), and an interrupt processing routine is executed every fixed time. An operation for searching is input by interruption.
[0072]
The operation of the timer interrupt unit 63 will be described with reference to FIG. In the timer interruption process, first, the current video display time is acquired, and the currently valid user name and user position are searched from the user position data. If there is valid user data, a user name label is displayed at each corresponding position in the upper part of the video display screen as shown in FIG. 10 (step S1101). After setting, the user name and position are determined to be valid until changed or completed. For example, in the case of FIG. 7, the user “aoki” has a position “0.1” from 1 second to 600 seconds, a position “0.5” after 600 seconds, and a user “mura” has a position “0” after 5 seconds. .8 ”is determined to be valid.
[0073]
Next, it is determined whether or not the current display mode is the user designation mode (step S1102). If the current display mode is the user designation mode (Y in step S1102), the process further proceeds. If it is the normal mode (N in step S1102), the interrupt processing routine is terminated.
[0074]
It is determined from the designated user name, current time, effective user position, and sound source position data whether or not the present is the designated user's utterance section (step S1103). By this determination, it is recognized whether or not the user and the speaker are associated with each other. If the absolute value of the difference between the sound source position and the designated user position at a certain time is within a predetermined threshold, for example, 0.1, the speaker at that time is determined as the designated user (Y in step S1103). And if it is a designated user speech area, an interruption process routine will be complete | finished.
[0075]
Otherwise, if it is determined that the current utterance user is not the designated user (N in step S1103), from the designated user name, the current time, the effective user position, and the sound source position data, the first time after the current display time The designated user utterance time is searched (step S1104).
[0076]
Then, the video display time is moved to the searched utterance time (step S1105). Although only the video display screen is shown in FIG. 10, a slide bar indicating the playback time, a time chart indicating the state of the speech, and the time and direction on the vertical and horizontal axes, as in the related art, may be displayed. good. Here, the timer interrupt unit 63 in the playback unit 34 constitutes the user speech section selection means in the present invention.
[0077]
In this embodiment, the playback section for each speaker is selected by the playback unit 34 on the user PC 4. However, the search for the playback section is executed on the distribution unit 32 side, and video data for only the specified section is transmitted. Configuration is also possible.
[0078]
(3. User operation example)
FIG. 12 is a flowchart showing the operation of the user position input unit of the record search apparatus according to the embodiment. Now, the operation of the record retrieval apparatus will be described with reference to FIG. First, at the start of the conference, a user (recorder) who is in charge of recording using the recording device 3 operates the keyboard 14 of the recording device 3 to start the recording unit 31. The recording unit 31 captures and records the audio / video data from the video camera 1 and the microphone array 2 connected to the recording device 3, and transmits them to the user PC 4 (step S1201).
[0079]
Each participant operates the user position input unit 33 on his / her user PC 4. After being activated, the user position input unit 33 displays an image currently being shot by the recording device 3 as shown in FIG. 9A (step S1251). Here, each user displays a dialog box for inputting the user identification information shown in FIG. 9B by clicking the position where the user is in the captured image, and displays the user name and password (if necessary) from the keyboard. Enter. The user position input unit 33 transmits the user name, password, and position information on the screen to the recording device 3 (step S1252). When the recording unit 31 receives the transmitted user name and password (step S1202) and is authenticated (Y in step S1203), the user name is transmitted to the user PC 4 (step S1204), as shown in FIG. 9C. The user name is displayed at the corresponding position on the screen (step S1253).
[0080]
When the user actually moves on the seat of the conference (Y in step S1254), the user's position on the screen is also shifted, so that the user who is the participant drags the user name display portion of FIG. 9C with the mouse, The user position input unit 33 transmits the moved position (step S1255), the user position acquisition unit 53 of the recording unit 31 acquires the position, and the user position recording unit 59 uses the new position (change position) and the movement time. Is recorded (step S1205). Here, the user position acquisition unit 53 and the user position recording unit 59 constitute a user position change acquisition unit of the present invention, and the user position acquisition unit 53, a timer (not shown), and the user position recording unit 59 are the user of the present invention. A position change time acquisition unit is configured.
[0081]
Thus, even if the speaker moves, the position where the speaker moves can be surely input by the input of the speaker himself, so that the drawback of the conventional example of erroneously recognizing the speaker can be improved. In addition, the complexity of the conventional example in which one user inputs the positions of many users is improved. However, when the user moves while speaking, the moved position can be tracked by the microphones 17 and 18, and in this case, the position may be automatically tracked to change the position.
[0082]
As described above, the user (participant) performs only the position input (mouse click) at the time of the first authentication and the position movement input (drag) when moving, as described above. , You do not need to enter the location for all participants. Therefore, it is possible to avoid the complicated work of designating the positions of all participants over the entire time.
[0083]
Here, when the user PC 4 is a normal computer, the participant can use the user PC 4 for other purposes such as referring to materials and recording memos during the conference.
[0084]
Here, the display of the user position transmitted by the user position input unit 33 is preferably a relative position based on the video display screen width. By doing so, there is an advantage that even if the size of the display screen is freely changed, the relative position is displayed accordingly. For example, while using another application, the window size can be reduced and displayed at the corner of the screen.
[0085]
(4. Example of playback in user specified mode)
FIG. 13 is a flowchart showing an operation of reproducing in the user designation mode in the record retrieval apparatus according to the embodiment. By reproduction in the user designation mode, a specific user, that is, a participant can be designated and searched, and only the speech of the participant can be reproduced. The reproduction operation in the user designation mode will be described with reference to FIG. After the conference is over, the conference participant activates the playback unit 34 on the user PC 4 or the playback device in order to play back the recorded data. The user PC 4 or the playback device activates the playback unit 34, communicates with the recording device 3 via the network 5, and inputs and transmits the user name password (step S1351). The recording device 3 receives the user name password (step S1301) and performs user authentication (step S1302). If the authentication cannot be performed, a non-authentication is transmitted to the user PC 4 (N in step S1302), and the user PC 4 that has received it performs an error reception process (step S1352). If authenticated (Y in step S1302), the audio video data, speech direction data, and participant position data are read and transmitted to the user PC 4 (step S1303). The user PC 4 receives the data and displays the moving image based on the data and the respective user names at corresponding positions as shown in FIG. 10 (step S1353).
[0086]
Regarding the user position input operation, one user is an input operation for only one person, but the user names of all the participants recorded in the recording unit 31 are displayed during reproduction. When the user selects a user name display label as shown in FIG. 10 and clicks with the mouse, the display enters the user designation mode, and information about the designated user is transmitted to the recording device 3 (step S1354). The recording device 3 receives the user's designated input (step S1304), transmits the video recorded for the uttered section (step S1305), and the user PC 4 receives and reproduces it.
[0087]
Here, the location information of the participants is data manually input by each participant of the conference, and is not based on uncertain techniques such as image recognition and voice recognition, so it is possible to realize accurate segmentation of speech . In particular, since the position information is input first when user authentication is performed on the spot, the position can be specified with certainty. Further, if user position input is performed based on user authentication, the operation is performed simultaneously with the authentication, so that the operation is efficient and reliable.
[0088]
Note that the recording unit and the recording search unit executed by the recording device and the recording search device according to this embodiment are files of an installable format or an executable format, and are a CD-ROM, floppy (R) disk (FD). The program is recorded on a computer-readable recording medium such as a DVD.
[0089]
Further, the recording unit and the record search unit of the present embodiment may be configured to be provided and distributed by storing them on a computer connected to a network such as the Internet and downloading them via the network.
[0090]
【The invention's effect】
  The invention according to claim 1 does not require a terminal unique to each user only for user authentication, and can acquire user position data efficiently and accurately based on the user authentication operation, and acquires the user position simultaneously with the authentication. Thus, it is possible to provide a low-cost recording apparatus capable of taking a record capable of accurately searching for a utterance by the speaker, since the speaker can be reliably identified after recording.Further, since the user can be tracked and recorded even if the position of the user changes during recording, it is possible to provide a recording apparatus capable of accurately identifying a speaker and searching for the speech.
[0093]
  Claims2The invention according to claim1In addition to the effects of the invention described in (1), it is possible to provide a recording device capable of recording data that can be quickly searched for a speaker's speech during reproduction.
[0094]
  Claims3The invention according to the present invention does not require a unique terminal only for authentication, can acquire user location data efficiently and accurately based on the authentication operation of the user, and designates a user who is a participant of the conference after recording. Records that can be searched for accurately at a low cost because the search can be reliably performedMethodThe effect that can be provided.Further, since the user can be tracked and recorded even if the position of the user changes during recording, it is possible to provide a recording apparatus capable of accurately identifying a speaker and searching for the speech.
[0097]
  Claims4The invention according to claim3In addition to the effect of the recording method described in the above, there is an effect that it is possible to provide a recording device that can record as data that can quickly search for a speaker's speech during reproduction.
[0098]
  Claims5The invention according to claim3 or 4The method described in 1) can be executed by a computer.
[Brief description of the drawings]
FIG. 1 is a network configuration diagram of a record retrieval apparatus according to an embodiment of the present invention.
FIG. 2 is a hardware configuration diagram of a recording apparatus used by the record search apparatus according to the embodiment.
FIG. 3 is a schematic perspective view of a microphone array and a video camera connected to the recording apparatus according to the embodiment.
FIG. 4 is a hardware configuration diagram of a user PC used in the record search apparatus according to the embodiment.
FIG. 5 is a block diagram showing a functional configuration of a record retrieval apparatus according to an embodiment.
FIG. 6 is a diagram showing an example of sound source direction recording data recorded by the recording apparatus according to the embodiment.
FIG. 7 is a diagram illustrating an example of user position data recorded by the recording apparatus according to the embodiment.
FIG. 8 is an operation flowchart of the user position input unit of the user PC of the recording apparatus according to the embodiment.
FIG. 9 is a diagram showing an example of a display screen related to user position input of the recording / retrieval device according to the embodiment, where (a) is a user position image transmitted to the user PC for input; ) Is a dialog screen for user position input in the user PC 4, and (c) is a user position image marked with a user name label transmitted to the user PC after user authentication.
FIG. 10 is a diagram illustrating an example of a screen reproduced by a reproduction unit of the recording / retrieval device according to the embodiment.
FIG. 11 is a flowchart showing the operation of the timer interrupt processing unit of the playback unit in the recording / retrieval device according to the embodiment;
FIG. 12 is a flowchart showing an operation of a user position input unit of the record search device according to the embodiment.
FIG. 13 is a flowchart showing an operation of reproducing in a user designation mode in the record search apparatus according to the embodiment.
[Explanation of symbols]
1 Video camera
2 Microphone array
3 Recording device (Recording PC)
4 User computer (User PC)
5 network
11, 41 Central processing unit (CPU)
12, 42 Random access memory (RAM)
13, 43 Hard disk
14, 44 keyboard
15, 45 monitor
16 Voice interface
17, 18 Microphone
19 Video interface
20, 48 System bus
21, 47 Network interface
26 lenses
31 Recording section
32 Distribution Department
33 User position input section
34 Playback unit
51 Voice input part
52 Video input section
53 User position acquisition unit
54 Sound source direction estimation unit
55 Voice / sound source direction recording section
56 Video compression unit
57 Video transmitter
58 Video recording part
59 User position recording section
60 List recording section
61 Sound source / position matching unit
62 Association recording section
63 Timer interrupt processing block
72 User list file
73 Audio image file
74 User location record file

Claims

A recording device connected to a terminal device operated by a user,
Sound acquisition means for acquiring sound of the sound source as sound data ;
Sound source position acquisition means for acquiring position information of the sound source as sound source position data ;
Image acquisition means for acquiring image data obtained by imaging an area including a plurality of users;
Output means for outputting the image data to the terminal device;
Input receiving means for receiving input from the terminal device as an input position specified on the output image data as a user's position in the area represented by the image data;
A user authentication means for performing user authentication by a user name and a password, indicating that the user is present at a position in the area corresponding to the input position when receiving an input of the input position;
If the user authentication is successful, the input position in the image data and the user based on a predetermined correspondence between the direction of the sound source and the direction position in the area displayed by the image data An associating means for associating the sound source position information of the sound source position data determined as the position of the sound source;
When the user authentication is successful, identification information output means for adding identification information for identifying the user to the input position on the image data and outputting the identification information to the terminal device;
When the position of the identification information added to the image data is changed in the terminal device, the terminal device includes a change receiving unit that receives the changed change position of the identification information from the terminal device,
The association means further associates the changed position on the image data with the position information of the sound source of the sound source position data determined to be the position of the user's destination,
The identification information output means adds the identification information to the change position on the image data and outputs the identification information to the terminal device;
A recording apparatus.

  Selection accepting means for accepting selection of the identification information added to the image data from the terminal device;
  A reproduction output means for outputting the audio data acquired from the position of the user indicated by the identification information that has received a selection and the position information of the sound source associated with the association means to the terminal device;
  The recording apparatus according to claim 1, further comprising:

  A recording method performed by a device connected to a terminal device operated by a user,
  An audio acquisition step for acquiring the sound of the sound source as audio data;
  A sound source position acquisition step of acquiring position information of the sound source as sound source position data;
  An image acquisition step of acquiring image data obtained by imaging an area including a plurality of users;
  Outputting the image data to the terminal device;
  An input receiving step of receiving input from the terminal device as an input position specified on the output image data as a position where a user exists in the area represented by the image data;
  A user authentication step of performing user authentication by a user name and a password, that is, a user existing at a position in the area corresponding to the input position when receiving an input of the input position;
  If the user authentication is successful, the input position in the image data and the user based on a predetermined correspondence between the direction of the sound source and the direction position in the area displayed by the image data An associating step for associating the sound source position information of the sound source position data determined as the position of the sound source;
  An identification information output step of adding identification information for identifying the user to the input position on the image data and outputting the identification information to the terminal device when the user authentication is successful When,
  When the position of the identification information added to the image data is changed in the terminal device, the change receiving step of receiving the changed change position of the identification information from the terminal device,
  The associating step further associates the change position on the image data with the position information of the sound source of the sound source position data determined to be the position of the user's destination,
  The identification information output step adds the identification information to the change position on the image data and outputs the identification information to the terminal device;
  A recording method characterized by the above.

  A selection receiving step for receiving selection of the identification information added to the image data from the terminal device;
  A reproduction output step of outputting to the terminal device the audio data acquired from the position of the user indicated by the identification information that has received a selection, and the position information of the sound source associated by the association means;
  The recording method according to claim 3, further comprising:

The program which makes a computer perform the method described in Claim 3 or 4 .