JP3617937B2

JP3617937B2 - Image monitoring method and image monitoring apparatus

Info

Publication number: JP3617937B2
Application number: JP18586299A
Authority: JP
Inventors: 真由美湯浅
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-06-30
Filing date: 1999-06-30
Publication date: 2005-02-09
Anticipated expiration: 2019-06-30
Also published as: JP2001016579A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば、テレビ付きインターホンや、ビデオメールやテレビ電話などにも用いることのできる画像監視装置に関し、特に、人物を含む動画像から人物等の被写体を監視するための画像監視装置に関する。
【０００２】
【従来の技術】
マンションの入口や家庭の玄関に設置されるテレビ付きインターホンは来訪者の様子をモニターで確認することができるが、留守番電話のように留守中にも映像や音声を記録することが望まれている。このような機能を実現するためには、画像を動画で保存するためには、ＶＴＲ装置等を用いると装置が大型で扱いにくいものとなり、また、ハードディスク等の記憶媒体を用いる場合には記憶容量の問題から、小型化、低価格化が困難であった。
【０００３】
一方、これらの問題点を解決するために、静止画で保存する場合には、必ずしも人物の識別が可能な画像が保存されるとは限らず、後で見たときに識別不可能となる可能性が存在する。
【０００４】
また、例え識別が可能な画像が存在したとしても、知人以外の集金、配達人や不審者の識別は必ずしも容易ではない。また、知人であっても人物の名前を失念することは今後高齢化の到来とともに多くなってくると思われる。
【０００５】
また、最近ではビデオカメラ付きパソコンの低価格化により、家庭においてもパソコンを利用したビデオメールやテレビ電話等が容易に実現できるようになりつつある。このように動画像が日常的にパソコン上で使われるようになると、ハードディスク等の記憶装置には限度があり、一度保存したデータの読みだしや検索にも時間がかかるという問題が新たに生じる。
【０００６】
【発明が解決しようとする課題】
そこで、本発明は上記問題点に鑑みなされたもので、動画像、静止画像から抽出される人物等の被写体の認識が各ユーザの都合やニーズに合わせて容易にしかも確実に行える画像監視方法およびそれを用いた画像監視装置を提供することを目的とする。
【０００７】
【課題を解決するための手段】
本発明の画像監視方法は、入力された画像から抽出された被写体を認識するための、各カテゴリ毎の特徴量を登録した辞書を生成し、入力された画像から抽出された被写体の特徴量と前記辞書に登録された各カテゴリの特徴量とを比較して該被写体のカテゴリを認識し、認識結果を呈示するとともに、該被写体の特徴量を基に、前記辞書を更新することにより、動画像、静止画像から抽出される人物等の被写体の認識が各ユーザの都合やニーズに合わせて容易にしかも確実に行える。
【０００８】
本発明の画像監視装置は、入力された画像から抽出された被写体を認識するための、各カテゴリ毎の特徴量を登録した辞書を生成し、入力された画像から抽出された被写体の特徴量と前記辞書に登録された各カテゴリの特徴量とを比較して該被写体のカテゴリを認識し、認識結果を呈示するとともに、該被写体の特徴量を基に前記辞書を更新し、入力された音声メッセージを前記認識結果または前記辞書に関連付けて記憶手段に記憶することにより、動画像、静止画像から抽出される人物等の被写体の認識が各ユーザの都合やニーズに合わせて容易にしかも確実に行えるとともに、ユーザのニーズに合わせて、よりインテリジェントな対応が可能となる。
【０００９】
好ましくは、入力された時系列の複数の画像のうち、前記辞書に登録されたカテゴリのいずれかとその特徴量が最も類似する被写体が抽出された画像のみを記憶手段に記憶する。これにより、動画像、静止画像から人物等の被写体の認識に適したもののみを選別して記憶することができるので、記憶容量の低減化が図れる。
【００１０】
本発明の画像監視装置は、入力された画像から抽出された被写体を認識するための、各カテゴリ毎の特徴量を登録した辞書を生成する辞書生成手段と、
入力された画像から抽出された被写体の特徴量と前記辞書に登録された各カテゴリの特徴量とを比較して該被写体のカテゴリを認識する認識手段と、
この認識手段での認識結果を呈示する呈示手段と、
前記被写体の特徴量を基に、前記辞書を更新する更新手段と、
を具備することにより、動画像、静止画像から抽出される人物等の被写体の認識が各ユーザの都合やニーズに合わせて容易にしかも確実に行える。
【００１１】
好ましくは、入力された時系列の複数の画像のうち、前記辞書に登録されたカテゴリのいずれかとその特徴量が最も類似する被写体が抽出された画像のみを記憶する記憶手段をさらに具備する。これにより、動画像、静止画像から人物等の被写体の認識に適したもののみを選別して記憶することができるので、記憶容量の低減化とそれに伴う装置の小型化、低価格化が図れる。
【００１２】
【発明の実施の形態】
以下、本発明の実施形態について図面を参照して説明する。
【００１３】
（第１の実施形態）
第１の実施形態では、マンションや家庭の玄関などに設置されたテレビカメラ付きインターホンなどで留守中に来訪者の伝言や画像を記録したり、在宅中でも来訪者の名前や予め登録したカテゴリに属するかどうかなどを表示することでスムーズな対応を可能にする来訪者監視装置について説明する。
【００１４】
図１は、本実施形態にかかる来訪者監視装置の構成例を示したもので、画像入力部１と音声入力部２と人物検知部３と情報抽出部４と記憶部５とモニタ部６とユーザ情報入力部７とから構成される。
【００１５】
画像入力部１は、例えばビデオカメラ等の監視カメラにて取得された画像を入力するためのものである。音声入力部２は、例えばインターホンのマイク等から取得された音声を入力するためのものである。
【００１６】
人物検知部３は、例えば、呼鈴やブザーなどの来訪者が自発的に来訪を知らせるもの、もしくは、カメラ等から取得された画像から変化を検出して人物を検知するもの、ノックや呼びかけなどの物理的な動作を検出するもの等、特に限定しない。
【００１７】
情報抽出部４は、入力された画像もしくは音声から記録あるいは認識に必要な情報を抽出するものである。
【００１８】
記憶部５は、辞書情報や抽出された画像情報、録音された音声情報などを記憶する磁気テープ、磁気ディスク、光磁気ディスク、光ディスク、半導体メモリ等の記録媒体から構成されている。
【００１９】
モニタ部６は、ディスプレイ装置、スピーカなどから構成され、抽出された各種情報やカメラやマイクからの直接入力された信号を表示するためのものである。
【００２０】
ユーザ情報入力部７は、来訪者の氏名やカテゴリを入力あるいは修正したり、記録情報の検索や整理を行うものである。
【００２１】
図２は、情報抽出部４の構成例を示したもので、顔検出部８と辞書生成部９と顔認識部１０と記録用画像決定部１１とから構成される。
【００２２】
顔検出部８は、入力された画像から人物の顔領域を検出する。例えば、予め作成した顔領域のテンプレートを利用して画像中で該テンプレートとの相関値が最も高い領域を切り出す。
【００２３】
辞書生成部９は、切り出した複数枚の顔領域画像から顔認識用の辞書を作成するようになっている。例えば、顔認識を従来からある部分空間法を用いて行う場合には、従来と同様、顔領域画像から瞳や鼻孔特徴点などの特徴点を検出し、検出された特徴点を利用して顔領域の正規化を行う。正規化をされた画像情報を特徴空間内の特徴ベクトルに変換し、それらの特徴ベクトルから部分空間を生成し、それを顔認識用の辞書とする。そして、その部分空間をカテゴリに分類して記憶部５に顔認識用の辞書として登録する。
【００２４】
但し、この操作（辞書の新規登録）は、当該来訪者が予め登録されたいずれのカテゴリにも属さない場合であり、属する場合においては当該カテゴリの辞書更新を行うのが望ましい。この辞書更新の操作を行う場合、辞書として保存するものは部分空間のみならず該部分空間の固有値や相関行列といった統計情報も同時に保存しておくことが望ましい。そして、辞書更新の際には、当該来訪者の顔領域画像から抽出された特徴量を基に、当該部分空間の統計情報を更新すればよい。
【００２５】
顔認識部１０では、予め登録したカテゴリに属するかどうかを入力画像系列と各カテゴリの辞書との類似度を計算する（例えば、各部分空間における特徴ベクトルの内積を求める等、従来からある手法でもよい）ことにより判定する。登録されたどのカテゴリにも属さない場合には先に述べた辞書を作成しておく。
【００２６】
記録用画像決定部１１では、画像入力部１から動画像が入力する度に、その入力画像系列から最も当該来訪者の特徴を表す画像を選別して、それを記憶部５に記憶する。選別方法は具体的には、入力画像系列の複数の画像フレームから当該来訪者の辞書との類似度が最も高い画像を選ぶ。入力された動画をそのまま記憶部５に記憶しておくのでは、記憶容量が多く必要であるし、例えばその中で一枚だけ残しておく場合にどの画像を残すかをユーザが選ぶのも面倒である。また、機械的に選ぶと顔が正しく写ってなかったりする可能性が大きい。しかし、入力画像系列中の画像が辞書にあるカテゴリのいずれにも属さないもの（いずれかのカテゴリに属すると判定するには類似度が低すぎるもの）であっても、そのうちの最も類似度が高い１枚のみを選定して記憶部５に記憶しておけば、上記問題は解決できるであろう。
【００２７】
次に、留守中に来訪者がやってきた場合を想定して、図１の来訪者対応装置の動作例を説明する。
【００２８】
ユーザは、予め留守状態であることをセットしておく。このときユーザは在宅していても差し支えない。この操作は、通常の留守番電話と同様であってもよい。来訪者がユーザ宅の玄関先に設定されたカメラ前にきて、呼鈴を鳴らすか、あるいは、集合住宅の場合は部屋番号を入力すると、その動作をトリガーとしてカメラから入力された画像とマイクから入力された音声とをそれぞれ画像入力部１、音声入力部２にて取り込みを開始し、室内に設置されたモニタ部６に表示する。
【００２９】
一方、情報抽出部４では、まず、顔検出部８で入力された画像から人物の顔領域を検出し、辞書生成部９を経て抽出された特徴ベクトルと記憶部５に記憶されている顔認識用の辞書を参照して、顔認識部１０で画像入力部１から入力した画像から来訪者の顔を認識し、例えば、当該来訪者のカテゴリを判定する。そして、認識結果としてのカテゴリ名をモニタ部６にカメラから取り込まれた画像とともに表示する。
【００３０】
カテゴリ名は予め登録されており、個人名に限らない。例えば、新聞集金、宅配、郵便配達などは毎度同じ人物が来る確率が高く、その個人名には意味がないので、個人名でないカテゴリ名、例えば、この場合、「新聞集金」「宅配」「郵便配達」といったカテゴリ名でこれら認識結果を分類することも可能である。同様に、特に分類する必要のないカテゴリに対しては、「不審者」というカテゴリ名であってもよいし、。
【００３１】
また、あらかじめ登録したカテゴリに属さない場合には、前述のように、入力画像系列から辞書を生成する。属する場合には、前述のように、既に存在する辞書を更新する。ただし、分類されたカテゴリが誤っていた場合に、ユーザが後で修正ができるように、当該来訪者のみの辞書も作成し、当該カテゴリのもとの辞書も消さずに保存しておくことが望ましい。
【００３２】
記録用画像決定部１１では、画像入力部１からの入力画像系列の中で作成された辞書との類似度が最も高い画像一枚を選別する。カテゴリの認識された画像およびカテゴリの認識されなかった画像であっても、その一枚のみを保存する。辞書は動画から生成するが、画像は一枚のみ残すことにより、記憶容量を節約する。さらなる記憶容量の削減のためには、顔領域のみを切り出した画像のみを残してもよい。
【００３３】
また、音声入力部１から入力した来訪者の音声メッセージを顔画像、認識結果と共に記憶部５に記録する。
【００３４】
メッセージはキーワードスポットにより、キーワードを音声認識により認識し、その結果を残すことで、後に新たな辞書登録へのカテゴリ名づけが容易になる。それを簡単にするために例えば、「お名前をどうぞ」などといったプロンプトを音声もしくは文字情報として出すことも有効である。
【００３５】
ユーザは帰宅時等、好きな時に伝言の確認を行なう。その際認識されたカテゴリの修正や、新たなカテゴリ名の入力、いらないメッセージの削除を行なう。
【００３６】
ユーザが在宅しているが、来訪者を確認してから応対したい場合にも、モニタに画像が表示され、認識結果も表示されるため、応対したい来訪者の場合だけに応対することが可能となる。このとき、同じ会いたくない来訪者があった時にいちいちその顔を覚えなくてもカテゴリ名のみで対応できるので便利である。
【００３７】
（第２の実施形態）
第２の実施形態では、カメラ付きパソコンなどでビデオメール（例えば、一般に、動画像を所定の通信手段を用いて送信することであってもよい）やビデオコンファレンスを通じて相手から送られてきた画像を利用して、ビデオメールの保存容量を減らしたり、相手を顔認識によって確認したり、顔認識用辞書を作成したりするものである。
【００３８】
第１の実施形態で説明した来訪者監視装置は、主に、セキュリティに利用するものであったが、第２の実施形態では、例えば、パソコン等で気軽に画像内容をチェックするためにも用いることのできる画像監視装置について説明する。
【００３９】
図３は、第２の実施形態にかかる画像監視装置の構成例を示したもので、画像入力部１２、顔検出部１３、辞書生成部１４、顔認識部１５、表示部１６、記憶部１７、記録用画像決定部１８から構成される。図３に示した来訪者監視装置は、パーソナルコンピュータ（パソコン）上に構成されていてもよい。すなわち、例えば、パソコンの有するハードウエアを用いて、上記各部の機能をコンピュータに実行させることのできるプログラムとして、磁気ディスク（フロッピーディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、半導体メモリなどの記録媒体に格納して頒布することもできる。
【００４０】
画像入力部１２は、あらかじめパソコンのハードディスク等に記録された動画像を読み込んだり、ＬＡＮや電話などの通信回線を通して送られて来た動画像を読み込んだりする。
【００４１】
顔検出部１３、辞書生成部１４、顔認識部１５、記憶部１７、記録用画像決定部１８については、第１の実施形態で説明した顔検出部８、辞書生成部９、顔認識部１０、記憶部５、記録用画像決定部１１と同様である。
【００４２】
次にビデオメールの容量削減の場合を例にして、実際の動作例について説明する。
【００４３】
相手から送られて来たビデオメールは通常パソコンのハードディスクに保存される。画像入力部１２は、そのデータを読み込んで、動画像の部分のみを取り出し、さらに顔検出部１３で顔領域を検出する。検出された顔領域からあらかじめ登録された人物カテゴリに属するかどうか顔認識部１５において判定する。判定結果をディスプレイなどの表示装置１６に画像とともに表示する。
【００４４】
また、同時にもしくはユーザの指示で当該動画像から辞書生成部１４において辞書を作成する。その際、新しいカテゴリの場合はユーザがカテゴリ名を新たに入力することができるが、通常の場合はカテゴリ名をメールアドレスとすると便利である。すでに存在するカテゴリの場合には、新たに辞書を作成する代わりに、存在する辞書を更新する。
【００４５】
記録用画像決定部１８において、上記動画像から記録用の画像を決める。具体的には、第１の実施形態で説明したように、入力画像系列中で、辞書との類似度が最も高い画像を選択する。
【００４６】
ビデオメールは動画像情報であるため、そのまま記憶すると記憶容量が多く必要であるが、例えばその中で一枚だけ残しておく場合にどの画像を残すかをユーザが選ぶのも面倒であるし、機械的に選ぶと顔が正しく写ってなかったりする可能性が大きいが、このようにすることでその問題が解決される。
【００４７】
なお、上記第１および第２の実施形態では、監視対象が人物である場合を例にとり説明したが、この場合に限るものではなく、監視対象は何でもよく、その場合も上記説明と同様である。
【００４８】
さらに、本発明はこれらの例に限定されるものではなく、種々変形して応用可能である。
【００４９】
【発明の効果】
以上説明したように、本発明によれば、動画像、静止画像から抽出される人物等の被写体の認識が各ユーザの都合やニーズに合わせて容易にしかも確実に行える。
【図面の簡単な説明】
【図１】本発明の第１の実施形態にかかる来訪者監視装置の構成例を示した図。
【図２】情報抽出部の構成例を示した図。
【図３】本発明の第２の実施形態にかかる画像監視装置の構成例を示した図。
【符号の説明】
１…画像入力部
２…音声入力部
３…人物検知部
４…情報抽出部
５…記憶部
６…モニタ部
７…ユーザ情報入力部
８…顔検出部
９…辞書生成部
１０…顔認識部
１１…記録用画像決定部
１２…画像入力部
１３…顔検出部
１４…辞書生成部
１５…顔認識部
１６…表示部
１７…記憶部
１８…記録用画像決定部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image monitoring apparatus that can be used for, for example, an interphone with a TV, video mail, or a videophone, and more particularly to an image monitoring apparatus for monitoring a subject such as a person from a moving image including a person.
[0002]
[Prior art]
A TV interphone installed at the entrance of a condominium or at the entrance of a home can check the state of visitors on a monitor, but it is desirable to record video and audio while you are away like an answering machine. . In order to realize such a function, if a VTR device or the like is used to save an image as a moving image, the device becomes large and difficult to handle, and if a storage medium such as a hard disk is used, the storage capacity Because of this problem, it was difficult to reduce the size and price.
[0003]
On the other hand, in order to solve these problems, in the case of saving as a still image, an image that can identify a person is not always stored, and may not be identified when viewed later. Sex exists.
[0004]
Moreover, even if there is an image that can be identified, it is not always easy to identify a collection other than an acquaintance, a delivery person or a suspicious person. In addition, even for acquaintances, it seems likely that people will forget their names with the aging of the population.
[0005]
Recently, with the reduction in the price of personal computers with video cameras, video mail and videophones using personal computers can be easily realized at home. Thus, when moving images are used on a personal computer on a daily basis, there is a limit to storage devices such as hard disks, and a new problem arises that it takes time to read and search data once stored.
[0006]
[Problems to be solved by the invention]
Therefore, the present invention has been made in view of the above problems, and an image monitoring method and method that can easily and reliably recognize a subject such as a person extracted from a moving image or a still image according to the convenience and needs of each user. An object is to provide an image monitoring apparatus using the same.
[0007]
[Means for Solving the Problems]
The image monitoring method of the present invention generates a dictionary in which feature amounts for each category are registered for recognizing a subject extracted from an input image, and includes a feature amount of the subject extracted from the input image By comparing the feature amount of each category registered in the dictionary to recognize the category of the subject, presenting a recognition result, and updating the dictionary based on the feature amount of the subject, In addition, recognition of a subject such as a person extracted from a still image can be easily and reliably performed according to the convenience and needs of each user.
[0008]
The image monitoring apparatus according to the present invention generates a dictionary in which feature amounts for each category are registered for recognizing a subject extracted from an input image, and the feature amount of the subject extracted from the input image Compare the feature quantity of each category registered in the dictionary to recognize the subject category, present the recognition result, update the dictionary based on the feature quantity of the subject, and input voice message Is stored in the storage means in association with the recognition result or the dictionary, so that a subject such as a person extracted from a moving image or a still image can be easily and reliably recognized according to the convenience and needs of each user. This makes it possible to respond more intelligently according to user needs.
[0009]
Preferably, among the plurality of input time-series images, only an image in which a subject whose feature quantity is most similar to any of the categories registered in the dictionary is stored in the storage unit. As a result, only those suitable for recognizing a subject such as a person can be selected and stored from moving images and still images, so that the storage capacity can be reduced.
[0010]
The image monitoring apparatus of the present invention includes a dictionary generating means for generating a dictionary in which feature quantities for each category are registered for recognizing a subject extracted from an input image;
Recognizing means for recognizing the category of the subject by comparing the feature amount of the subject extracted from the input image with the feature amount of each category registered in the dictionary;
Presenting means for presenting the recognition result by the recognizing means;
Updating means for updating the dictionary based on the feature amount of the subject;
Therefore, it is possible to easily and reliably recognize a subject such as a person extracted from a moving image or a still image according to the convenience and needs of each user.
[0011]
Preferably, the image processing apparatus further includes storage means for storing only an image in which a subject whose feature quantity is most similar to any one of the categories registered in the dictionary is extracted from a plurality of input time-series images. As a result, only those suitable for recognizing a subject such as a person can be selected and stored from moving images and still images, so that the storage capacity can be reduced, and the size and cost of the device can be reduced accordingly.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0013]
(First embodiment)
In the first embodiment, a message or image of a visitor is recorded while the user is away with an interphone with a TV camera installed at an apartment or home entrance, or belongs to a pre-registered category even at home. A visitor monitoring device that enables smooth response by displaying whether or not will be described.
[0014]
FIG. 1 shows a configuration example of a visitor monitoring apparatus according to the present embodiment. An image input unit 1, a voice input unit 2, a person detection unit 3, an information extraction unit 4, a storage unit 5, and a monitor unit 6 are shown. And a user information input unit 7.
[0015]
The image input unit 1 is for inputting an image acquired by a surveillance camera such as a video camera. The voice input unit 2 is for inputting voice acquired from, for example, an interphone microphone.
[0016]
The person detection unit 3 is, for example, a person such as a bell or a buzzer that voluntarily informs the visit, or a person that detects a person by detecting a change from an image acquired from a camera or the like, a knock or a call There is no particular limitation such as one that detects a physical operation.
[0017]
The information extraction unit 4 extracts information necessary for recording or recognition from the input image or sound.
[0018]
The storage unit 5 includes a recording medium such as a magnetic tape, a magnetic disk, a magneto-optical disk, an optical disk, and a semiconductor memory that stores dictionary information, extracted image information, recorded audio information, and the like.
[0019]
The monitor unit 6 is composed of a display device, a speaker, and the like, and is for displaying various extracted information and signals directly input from a camera and a microphone.
[0020]
The user information input unit 7 inputs or corrects the name and category of a visitor, and searches and organizes recorded information.
[0021]
FIG. 2 shows a configuration example of the information extraction unit 4, which includes a face detection unit 8, a dictionary generation unit 9, a face recognition unit 10, and a recording image determination unit 11.
[0022]
The face detection unit 8 detects a human face area from the input image. For example, using a face area template created in advance, an area having the highest correlation value with the template is cut out from the image.
[0023]
The dictionary generation unit 9 is configured to create a dictionary for face recognition from a plurality of cut face area images. For example, when face recognition is performed using a conventional subspace method, feature points such as pupils and nostril feature points are detected from the face area image, and the face is detected using the detected feature points. Perform region normalization. The normalized image information is converted into feature vectors in the feature space, a partial space is generated from these feature vectors, and this is used as a dictionary for face recognition. Then, the partial space is classified into categories and registered in the storage unit 5 as a dictionary for face recognition.
[0024]
However, this operation (new registration of the dictionary) is a case where the visitor does not belong to any of the categories registered in advance, and in this case, it is desirable to update the dictionary of the category. When this dictionary update operation is performed, it is desirable to save not only a partial space but also statistical information such as eigenvalues and correlation matrices of the partial space at the same time. When updating the dictionary, the statistical information of the partial space may be updated based on the feature amount extracted from the visitor's face area image.
[0025]
The face recognition unit 10 calculates the similarity between the input image series and the dictionary of each category as to whether it belongs to a pre-registered category (for example, a conventional method such as obtaining an inner product of feature vectors in each partial space). Good). If it does not belong to any registered category, the dictionary described above is created.
[0026]
Each time a moving image is input from the image input unit 1, the recording image determination unit 11 selects an image that most represents the characteristics of the visitor from the input image series and stores it in the storage unit 5. Specifically, the selection method selects an image having the highest similarity to the visitor's dictionary from a plurality of image frames of the input image series. Storing the input video as it is in the storage unit 5 requires a large storage capacity, and it is troublesome for the user to select which image to leave when, for example, only one of them is left. It is. Also, if you choose mechanically, there is a high possibility that your face will not be captured correctly. However, even if an image in the input image series does not belong to any of the categories in the dictionary (similarity is too low to determine that it belongs to any category), the most similar of them is If only one expensive sheet is selected and stored in the storage unit 5, the above problem will be solved.
[0027]
Next, an example of the operation of the visitor response apparatus in FIG. 1 will be described assuming that a visitor arrives while he / she is away.
[0028]
The user sets in advance that the user is away. At this time, the user may be at home. This operation may be similar to a normal answering machine. When a visitor comes in front of the camera set at the front door of the user's house and rings a doorbell or enters a room number in an apartment house, the action is triggered by the image and microphone input from the camera. The input voice is started to be captured by the image input unit 1 and the voice input unit 2, respectively, and displayed on the monitor unit 6 installed indoors.
[0029]
On the other hand, in the information extraction unit 4, first, a human face area is detected from the image input by the face detection unit 8, the feature vector extracted through the dictionary generation unit 9 and the face recognition stored in the storage unit 5. The face recognition unit 10 recognizes the visitor's face from the image input from the image input unit 1 and determines, for example, the visitor's category. Then, the category name as the recognition result is displayed on the monitor unit 6 together with the image captured from the camera.
[0030]
Category names are registered in advance and are not limited to individual names. For example, newspaper collection, home delivery, postal delivery, etc. are likely to have the same person each time, and their personal names are meaningless, so category names that are not personal names, for example, “newspaper collection” “home delivery” “postal” in this case It is also possible to classify these recognition results by category names such as “delivery”. Similarly, for a category that does not need to be classified, the category name “suspicious person” may be used.
[0031]
If the category does not belong to a pre-registered category, a dictionary is generated from the input image series as described above. If it belongs, the existing dictionary is updated as described above. However, if the classified category is incorrect, a dictionary only for the visitor may be created and saved without deleting the original dictionary for the user so that the user can correct it later. desirable.
[0032]
The recording image determination unit 11 selects one image having the highest similarity to the dictionary created in the input image series from the image input unit 1. Only one image of the category recognized image and the category unrecognized image is stored. The dictionary is generated from the moving image, but only one image is left to save storage capacity. In order to further reduce the storage capacity, only an image obtained by cutting out only the face area may be left.
[0033]
The voice message of the visitor input from the voice input unit 1 is recorded in the storage unit 5 together with the face image and the recognition result.
[0034]
By recognizing a keyword by voice recognition using a keyword spot and leaving the result, a category name for a new dictionary registration can be facilitated later. In order to simplify it, it is also effective to give a prompt such as “Please give your name” as voice or text information.
[0035]
The user checks the message when he / she likes to go home. At that time, the recognized category is corrected, a new category name is input, and unnecessary messages are deleted.
[0036]
Even if the user is at home but wants to respond after confirming the visitor, the image is displayed on the monitor and the recognition result is also displayed, so it is possible to respond only to the visitor who wants to respond Become. At this time, when there is a visitor who does not want to meet the same, it is convenient because it can be handled by only the category name without having to remember the face.
[0037]
(Second Embodiment)
In the second embodiment, an image sent from the other party through a video mail (for example, it is generally possible to transmit a moving image using a predetermined communication means) or a video conference by a personal computer with a camera or the like. It is used to reduce the storage capacity of video mail, to confirm the other party by face recognition, and to create a face recognition dictionary.
[0038]
The visitor monitoring apparatus described in the first embodiment is mainly used for security. In the second embodiment, for example, the visitor monitoring apparatus is also used for easily checking the image contents with a personal computer or the like. An image monitoring apparatus that can be used will be described.
[0039]
FIG. 3 shows a configuration example of the image monitoring apparatus according to the second embodiment. The image input unit 12, the face detection unit 13, the dictionary generation unit 14, the face recognition unit 15, the display unit 16, and the storage unit 17 are illustrated. The recording image determination unit 18 is configured. The visitor monitoring apparatus shown in FIG. 3 may be configured on a personal computer (personal computer). That is, for example, a magnetic disk (floppy disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a semiconductor memory, etc., can be used as a program that allows a computer to execute the functions of the above-described units using the hardware of a personal computer. It can also be stored in a recording medium and distributed.
[0040]
The image input unit 12 reads a moving image recorded in advance on a hard disk of a personal computer, or reads a moving image sent via a communication line such as a LAN or a telephone.
[0041]
Regarding the face detection unit 13, the dictionary generation unit 14, the face recognition unit 15, the storage unit 17, and the recording image determination unit 18, the face detection unit 8, the dictionary generation unit 9, and the face recognition unit 10 described in the first embodiment. The storage unit 5 and the recording image determination unit 11 are the same.
[0042]
Next, an actual operation example will be described by taking the case of reducing the capacity of video mail as an example.
[0043]
Video mail sent from the other party is usually stored on the hard disk of a personal computer. The image input unit 12 reads the data, extracts only the moving image portion, and the face detection unit 13 detects the face area. The face recognition unit 15 determines whether or not the detected face area belongs to a person category registered in advance. The determination result is displayed on the display device 16 such as a display together with the image.
[0044]
At the same time or according to a user instruction, the dictionary generation unit 14 creates a dictionary from the moving image. At that time, in the case of a new category, the user can newly input a category name. However, in the normal case, it is convenient to use the category name as an e-mail address. If the category already exists, the existing dictionary is updated instead of creating a new dictionary.
[0045]
The recording image determination unit 18 determines a recording image from the moving image. Specifically, as described in the first embodiment, an image having the highest similarity with the dictionary in the input image series is selected.
[0046]
Since video mail is moving image information, it needs a lot of storage capacity if it is stored as it is, but it is troublesome for the user to select which image to leave when only one piece is left, for example, If you choose mechanically, there is a high possibility that the face will not appear correctly, but this will solve the problem.
[0047]
In the first and second embodiments, the case where the monitoring target is a person has been described as an example. However, the present invention is not limited to this case, and any monitoring target may be used. .
[0048]
Furthermore, the present invention is not limited to these examples, and can be applied with various modifications.
[0049]
【The invention's effect】
As described above, according to the present invention, it is possible to easily and reliably recognize a subject such as a person extracted from a moving image or a still image according to the convenience and needs of each user.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of a visitor monitoring apparatus according to a first embodiment of the present invention.
FIG. 2 is a diagram illustrating a configuration example of an information extraction unit.
FIG. 3 is a diagram showing a configuration example of an image monitoring apparatus according to a second embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Image input part 2 ... Audio | voice input part 3 ... Person detection part 4 ... Information extraction part 5 ... Memory | storage part 6 ... Monitor part 7 ... User information input part 8 ... Face detection part 9 ... Dictionary generation part 10 ... Face recognition part 11 ... Recording image determination unit 12 ... Image input unit 13 ... Face detection unit 14 ... Dictionary generation unit 15 ... Face recognition unit 16 ... Display unit 17 ... Storage unit 18 ... Recording image determination unit

Claims

Image input means for inputting a plurality of time-series images;
First storage means for storing a dictionary in which feature quantities for each category are registered for recognizing a subject extracted from an input image;
Means for obtaining a feature amount of a subject extracted from a plurality of input time-series images;
Recognizing means for comparing the feature amount of the subject obtained from the plurality of images with the feature amount of each category registered in the dictionary, and recognizing the category to which the subject belongs;
When the category to which the subject belongs is recognized by the recognition unit, an image from which a subject most similar to the feature amount of the category is extracted is selected from the plurality of images, and the subject belongs to the recognition unit. Means for selecting an image in which a subject most similar to the feature amount of the subject obtained from the plurality of images is extracted from the plurality of images when the category cannot be recognized;
Second storage means for storing the selected image;
Presenting means for presenting the selected image together with its category name when the category to which the subject belongs is recognized by the recognizing means;
Means for registering a feature amount obtained from the input time-series images together with a new category name in the dictionary when the recognition means cannot recognize the category to which the subject belongs;
Means for updating the feature quantity of the dictionary corresponding to the category name with the feature quantity obtained from the input time-series images when the recognition means recognizes the category to which the subject belongs;
An image monitoring apparatus comprising:

Voice input means for inputting voice messages of subjects of the plurality of input time-series images;
The image monitoring apparatus according to claim 1, wherein the voice message is stored in the second storage unit together with the selected image.

First storage means for storing a dictionary in which feature amounts for each category are registered for recognizing a subject extracted from a plurality of input time-series images, and one of the plurality of images In the image monitoring method in a computer provided with the 2nd memory | storage means for memorize | storing,
A first step of obtaining a feature amount of a subject extracted from a plurality of input time-series images;
A second step of comparing the feature quantity of the subject obtained from the plurality of images with the feature quantity of each category registered in the dictionary to recognize the category to which the subject belongs;
When the category to which the subject belongs is recognized in the second step, an image from which the subject most similar to the feature amount of the category is extracted is selected from the plurality of images, and in the second step If the category to which the subject belongs cannot be recognized, a third step of selecting an image in which the subject most similar to the feature amount of the subject obtained from the plurality of images is extracted from the plurality of images;
A fourth step of storing the selected image in the second storage means;
A fifth step of presenting the selected image together with the category name when the category to which the subject belongs is recognized in the second step;
A sixth step of registering a feature amount obtained from the input time-series images together with a new category name in the dictionary when the category to which the subject belongs cannot be recognized in the second step;
When the category to which the subject belongs is recognized in the second step, the feature value of the dictionary corresponding to the category name is updated with the feature value obtained from the input time-series images. Steps,
An image monitoring method including:

First storage means for storing a dictionary in which feature amounts for each category are registered for recognizing a subject extracted from a plurality of input time-series images, and one of the plurality of images A computer having second storage means for storing
A first step of obtaining a feature amount of a subject extracted from a plurality of input time-series images;
A second step of comparing the feature quantity of the subject obtained from the plurality of images with the feature quantity of each category registered in the dictionary to recognize the category to which the subject belongs;
When the category to which the subject belongs is recognized in the second step, an image from which the subject most similar to the feature amount of the category is extracted is selected from the plurality of images, and in the second step If the category to which the subject belongs cannot be recognized, a third step of selecting an image in which the subject most similar to the feature amount of the subject obtained from the plurality of images is extracted from the plurality of images;
A fourth step of storing the selected image in the second storage means;
A fifth step of presenting the selected image together with the category name when the category to which the subject belongs is recognized in the second step;
A sixth step of registering a feature amount obtained from the input time-series images together with a new category name in the dictionary when the category to which the subject belongs cannot be recognized in the second step;
When the category to which the subject belongs is recognized in the second step, the feature value of the dictionary corresponding to the category name is updated with the feature value obtained from the input time-series images. Steps,
A machine-readable recording medium having recorded thereon a program for executing processing including: