JP4993678B2

JP4993678B2 - Interactive moving image monitoring method, interactive moving image monitoring apparatus, and interactive moving image monitoring program

Info

Publication number: JP4993678B2
Application number: JP2006273352A
Authority: JP
Inventors: 富士雄堤
Original assignee: Central Research Institute of Electric Power Industry
Current assignee: Central Research Institute of Electric Power Industry
Priority date: 2006-10-04
Filing date: 2006-10-04
Publication date: 2012-08-08
Anticipated expiration: 2026-10-04
Also published as: JP2008092471A

Description

本発明は、対話型動画像監視方法、対話型動画像監視装置および対話型動画像監視プログラムに関する。さらに詳述すると、長時間の動画像監視システムに好適な対話型の動画像監視方法、動画像監視装置および動画像監視プログラムに関する。 The present invention relates to an interactive moving image monitoring method, an interactive moving image monitoring apparatus, and an interactive moving image monitoring program. More specifically, the present invention relates to an interactive moving image monitoring method, a moving image monitoring apparatus, and a moving image monitoring program suitable for a long-time moving image monitoring system.

近年急速な発展を遂げたデジタルビデオカメラ、計算機などのＩＴ機器を活用した、画像監視技術の実用化が進められている。画像監視技術の中心となる技術は、撮影された画像を計算機により認識し、例えば侵入者の有無を判断したり、機器の障害などを検出する画像認識技術である。 In recent years, image surveillance technology using IT equipment such as digital video cameras and computers, which have been rapidly developed, has been put into practical use. The technology that becomes the center of the image monitoring technology is an image recognition technology that recognizes a captured image by a computer and determines, for example, whether there is an intruder or detects a failure of a device.

例えば、電力会社においては、保守・管理コストを低減するために電力関連設備の損傷箇所の検出に画像監視技術の実用化が進められている。また、電力関連施設への侵入者を検知のための画像監視技術の実用化が進められている。 For example, in electric power companies, in order to reduce maintenance and management costs, image monitoring technology is being put to practical use for detecting damaged parts of power-related equipment. In addition, image monitoring technology for detecting intruders in power-related facilities is being put to practical use.

従来、高精度の画像認識を実現する技術として、ＳＶＭ（サポートベクタマシン）やBoosting(ブースティング)等の教師付き機械学習技術（Supervised Machine Learning）が存在し、すでに様々な分野で活用されている。これらの教師付き機械学習は、人間が計算機に教示した認識すべき事例（正例という）と、認識すべきでない事例（負例という）をもとに、事例を学習させ、与えられていない事例に対しても適切な認識を行うものである。例えば、特許文献１には、ＳＶＭを用いた人の顔の判定技術が記載されている。 Conventionally, supervised machine learning technologies such as SVM (Support Vector Machine) and Boosting (Boosting) exist as technologies for realizing high-accuracy image recognition, and they are already used in various fields. . In these supervised machine learning, cases are learned based on cases that should be recognized (referred to as positive examples) and those that should not be recognized (referred to as negative examples) that humans have taught to computers. Appropriate recognition is also made for. For example, Patent Document 1 describes a technique for determining a human face using SVM.

更に、電力関連設備等での障害の検知であれば、障害が発生することは稀であり、監視映像から異常が映っている箇所を選び出す作業は容易でない。このような事例教示における課題に対して、直接操作型のユーザインタフェースを用いることで、人間の認知能力を活用し、解決を図ろうとするインタラクティブ機械学習（Interactive Machine Learning：以下、IMLという）が提案されている。IMLは、例えば、事例（以下、データともいう）の解析結果を色やシンボルを使って一覧性良く可視化することで、ユーザは自分が行った事例教示（以下、ラベル付けともいう）の良し悪しや、ラベルの修正が必要な箇所を容易に把握することができるものである。 Furthermore, if a failure is detected in a power-related facility or the like, a failure rarely occurs, and it is not easy to select a location where an abnormality is shown from the monitoring video. For this kind of case teaching problem, Interactive Machine Learning (IML) that uses human operation to solve the problem by using a direct operation type user interface is proposed. Has been. IML, for example, visualizes the analysis results of cases (hereinafter also referred to as data) with good visibility using colors and symbols, so that the user can improve the quality of case teaching (hereinafter also referred to as labeling) In addition, it is possible to easily grasp the place where the label needs to be corrected.

例えば、非特許文献１には、描画ソフトと同様のインタフェースにより、バッチ処理ではなく対話型（インタラクティブ）処理により機械を学習させ、静止画像の画像認識を実現できるシステムが提案されている。また、非特許文献２には、音、画像、ＲＦＩＤ等の複数のセンサ情報を同時に分析して人物の有無を判断するシステムが提案されている。
特開２００６−４００３号公報 Jerry Alan Fails and Dan R.Olsen Jr. Interactive machine learning.In Proceedings of the 8th international conference on Intelligent user interfaces,pp.39-45,2003. Anind K.Dey,Raffay Hamid,Chris Beckmann,Ian Li,and Daniel Hsu. a CAPpella:programming by demonstration of contextaware applications.In Proceedings of the SIGCHI conference on Hman factors in computing systems,pp.33-40,2004. For example, Non-Patent Document 1 proposes a system in which a machine can be learned by interactive processing instead of batch processing by an interface similar to drawing software, and still image recognition can be realized. Non-Patent Document 2 proposes a system that simultaneously analyzes a plurality of sensor information such as sound, image, RFID, and the like to determine the presence or absence of a person.
JP 20064003 A Jerry Alan Fails and Dan R. Olsen Jr. Interactive machine learning. In Proceedings of the 8th international conference on Intelligent user interfaces, pp. 39-45, 2003. Anind K. Dey, Raffay Hamid, Chris Beckmann, Ian Li, and Daniel Hsu.a CAPpella: programming by demonstration of contextaware applications.

画像認識において必要とされる事例教示は、膨大なデータの中から教示すべきデータを選び出す作業（事例の選択）と、それが認識すべきデータか否かという情報（正負のラベル）を付与する作業（ラベル付け）から成るが、この事例の選択とラベル付けは、いずれも人間の労力が必要であり軽減が求められている。 The case teaching required for image recognition gives a task of selecting data to be taught from a huge amount of data (selection of case) and information (positive or negative label) as to whether or not the data is to be recognized. Although it consists of work (labeling), both the selection and labeling of this case requires human effort and is required to be reduced.

しかしながら、特許文献１に記載のような教師付き機械学習は、認識精度は高いが、精度良く認識が可能になるまでの学習システムの教育には、人間による事例教示の手間が膨大であり、多大な時間を要するという問題がある。 However, although supervised machine learning as described in Patent Document 1 has high recognition accuracy, the teaching of the learning system until it is possible to recognize with high accuracy requires a great deal of time and effort for teaching cases by humans. There is a problem that it takes a long time.

例えば、屋外での監視映像であれば、多様な照明条件（朝、昼、夕方、雨、曇り、雪、人工照明など）のもとで、見た目の色や形を変える様々な現象（錆、傷、剥離、放電、侵入者、動植物等）を認識することが求められるため、計算機を十分に学習させるためには多数の事例教示しなければならず、大変な手間がかかり、実用的とはいえない。そのため、実用的な画像監視システムを構築するには、事例教示の手間を削減する必要がある。 For example, in the case of an outdoor surveillance video, various phenomena that change the color and shape of appearance (such as rust, daylight, morning, evening, rain, cloudy, snow, artificial lighting, etc.) (Scratches, peeling, discharges, intruders, animals and plants, etc.) must be recognized. I can't say that. Therefore, in order to construct a practical image monitoring system, it is necessary to reduce the trouble of teaching examples.

また、これら教師付き機械学習における認識処理は、バッチ処理で行われるものであり、処理に時間がかかるという問題がある。また、バッチ処理であるので、ユーザが処理の途中などに、必要に応じて学習状況を確認したり、認識精度の確認を行ったりする対話型での実行は不可能であった。したがって、認識結果が満足のいくものでなければ、バッチ処理が終了した後に、再度、事例教示を行って、もう一度バッチ処理を行うことが必要となり、満足のいく認識結果を得るためには時間がかかるという問題があった。 In addition, the recognition processing in the supervised machine learning is performed by batch processing, and there is a problem that processing takes time. In addition, since it is a batch process, interactive execution in which the user checks the learning status or confirms the recognition accuracy as needed during the process is impossible. Therefore, if the recognition result is not satisfactory, after batch processing is completed, it is necessary to teach the case again and perform batch processing again, and it takes time to obtain a satisfactory recognition result. There was a problem that it took.

この問題に対して、非特許文献１に記載の技術では、認識結果を即座に可視化し表示することで、正誤をユーザに認識させるという方法で解決を図っている。このため、静止画像では適用可能であるが、動画像での対象物の追跡においては、扱う情報量が静止画像の数万倍と多量になるため、処理に時間がかかり即時的なフィードバックが返せないという問題がある。即ち、動画像に適用した場合、対話型の処理といっても教示結果が即座に反映されず、バッチ処理のときと同じ問題が生じる。 In order to solve this problem, the technique described in Non-Patent Document 1 attempts to solve the problem by allowing the user to recognize the correctness by immediately visualizing and displaying the recognition result. For this reason, it can be applied to still images, but in tracking an object in a moving image, the amount of information handled is tens of thousands of times that of a still image, so processing takes time and immediate feedback can be returned. There is no problem. That is, when applied to a moving image, the teaching result is not immediately reflected even in interactive processing, and the same problem as in batch processing occurs.

このように、情報量の増大と共に処理時間は増加するので、膨大な情報量を扱う動画像監視においては、バッチ処理でしか処理を行うことはできず、対話型の機械学習システムの実現は困難であると考えられていた。 As described above, the processing time increases as the amount of information increases, so in moving image monitoring that handles an enormous amount of information, processing can only be performed by batch processing, and it is difficult to realize an interactive machine learning system. It was thought to be.

また、非特許文献２に記載の技術では、連続する画像を対象としているものの、画像情報に加えて、音情報や無線ＩＤタグ情報を組み合わせることにより人物の監視を行うものであり、画像情報だけを用いて動画像での監視を可能としたものではない。また、人間以外のものを監視対象とする場合や侵入者を監視する場合は、音情報、無線ICタグ情報等を得られない場合が考えられ、非特許文献２に記載の技術では、監視を行うことが不可能である。 In the technique described in Non-Patent Document 2, although continuous images are targeted, a person is monitored by combining sound information and wireless ID tag information in addition to image information, and only image information is displayed. It is not possible to monitor moving images using In addition, when a non-human subject is monitored or an intruder is monitored, sound information, wireless IC tag information, or the like may not be obtained. In the technique described in Non-Patent Document 2, monitoring is not possible. Impossible to do.

更に、特許文献１に記載の教師付き機械学習、非特許文献１及び２に示すようなIMLを用いた技術では、ユーザが事例教示を追加していく過程で、その追加した事例教示が、データ全体の識別精度にどのような影響を及ぼすのかを知る術がなく、ユーザが状況を認識できないという問題がある。換言すれば、ユーザは、現在行っている事例教示の効果がよくわからないという問題がある。 Furthermore, in the technique using supervised machine learning described in Patent Document 1 and IML as shown in Non-Patent Documents 1 and 2, in the process of adding a case teaching by a user, the added case teaching is There is a problem that the user cannot recognize the situation because there is no way of knowing how it affects the overall identification accuracy. In other words, there is a problem that the user does not understand the effect of the case teaching currently being performed.

そこで、動画像の監視において、ユーザが事例を教示していく過程で、即座に事例教示状況を把握しながら最適な事例教示方法を選択することが可能であり、更に、従来のバッチ処理に比べて迅速かつ認識精度の良い対話型の処理システムの実現が望まれている。 Therefore, in the monitoring of moving images, it is possible to select the best case teaching method while grasping the case teaching situation immediately in the process of teaching the case by the user, and compared with the conventional batch processing. Realization of an interactive processing system that is quick and has high recognition accuracy is desired.

そこで本発明は、ユーザにより教示された最小限の事例に基づいて動画像データ中の対象物を監視、追跡し、更にその監視、追跡結果を分かりやすく可視化表示することを可能とする対話型動画像監視方法、対話型動画像監視装置及び対話型動画像監視プログラムを提供することを目的とする。 Therefore, the present invention monitors and tracks an object in moving image data based on a minimum case taught by a user, and further makes it possible to visualize and display the monitoring and tracking result in an easy-to-understand manner. An object of the present invention is to provide an image monitoring method, an interactive moving image monitoring apparatus, and an interactive moving image monitoring program.

かかる目的を達成するため、請求項１に記載の対話型動画像監視方法は、動画像中の監視対象物を監視及び追跡する方法において、動画像のフレーム画像のうち画像処理の対象となる処理対象領域内に一以上の画像クリップを設け、画像クリップ内に監視対象物が撮影されているかどうかについてのラベル付けデータを登録するラベル登録処理と、画像クリップについて該画像クリップに含まれる画素の色情報値に基づいて画像特徴量を求め、ラベル付けデータが登録されていない画像クリップの画像特徴量が、ラベル付けデータが登録された画像クリップの画像特徴量の各要素を軸とする高次元空間において、予め定めた一定の範囲にあるものについては、該ラベル付けデータに従ったタグ付けをラベル付けデータが登録されていない画像クリップに対し行うタグ付け処理と、タグ付け処理の結果をタグの重要度に基づいて動画像のフレーム画像と併せて表示する可視化処理とを行うようにしている。 In order to achieve such an object, the interactive moving image monitoring method according to claim 1 is a method of monitoring and tracking a monitoring object in a moving image, and is a process to be subjected to image processing among frame images of the moving image. One or more image clips are provided in the target area, and label registration processing for registering labeling data as to whether or not the monitoring target is photographed in the image clip, and the color of the pixel included in the image clip for the image clip A high-dimensional space in which the image feature amount is calculated based on the information value, and the image feature amount of the image clip for which labeling data is not registered is centered on each element of the image feature amount of the image clip for which labeling data is registered In the above, for images in a predetermined range, an image in which no labeling data is registered for tagging according to the labeling data. Tagged processing performed with respect to the lip, and to perform a visualization process of displaying together with a frame image of the moving image based on the results of the tagging process to the importance of the tag.

また、請求項６に記載の対話型動画像監視装置は、動画像中の監視対象物を監視及び追跡する装置であって、動画像のフレーム画像を読み出し、画像処理の対象とする処理対象領域内に一以上の画像クリップを設定し、画像クリップ内に監視対象物が撮影されているかどうかについて予め指定されるラベル付けデータをデータベースに登録するラベル登録手段と、画像クリップについて該画像クリップに含まれる画素の色情報値に基づいて画像特徴量を求め、データベースにラベル付けデータが登録されていない画像クリップの画像特徴量が、ラベル付けデータが登録されている画像クリップの画像特徴量の各要素を軸とする高次元空間において、予め定めた一定の範囲にあれば、該ラベル付けデータに従ったタグ付けをラベル付けデータが登録されていない画像クリップに関連づけて記憶するタグ付け手段と、タグ付け処理の結果をタグの重要度に基づいて動画像のフレーム画像と併せて出力装置に表示する可視化手段とを備えるものである。 The interactive moving image monitoring apparatus according to claim 6 is an apparatus that monitors and tracks a monitoring object in a moving image, and reads out a frame image of the moving image and performs processing on a processing target area. One or more image clips are set in the label, and label registration means for registering in the database labeling data designated in advance as to whether or not the monitoring object is photographed in the image clip, and the image clip includes the image clip The image feature amount is calculated based on the color information value of the pixel to be recorded, and the image feature amount of the image clip whose labeling data is not registered in the database is the element of the image feature amount of the image clip whose labeling data is registered. In a high-dimensional space with the axis as the axis, if it is within a predetermined range, the labeling data will be tagged according to the labeling data. Tagging means for storing in association with unrecorded image clips, and visualization means for displaying the result of tagging processing on the output device together with the frame image of the moving image based on the importance of the tag. .

また、請求項１１に記載の対話型動画像監視プログラムは、動画像のフレーム画像のうち、画像処理の対象とする処理対象領域内に一以上の画像クリップを設定し、主記憶装置に記憶させる対象領域設定処理と、記憶装置に予め登録された、画像クリップ内に監視対象物が撮影されているかどうかについてのラベル付けデータを読み出し、画像クリップについて該画像クリップに含まれる画素の色情報値に基づいて画像特徴量を求め、ラベル付けデータが登録されていない画像クリップの画像特徴量が、ラベル付けデータが登録されている画像クリップの画像特徴量の各要素を軸とする高次元空間において、予め定めた一定の範囲にあれば、該ラベル付けデータに従ってラベル付けデータが登録されていない画像クリップにタグ付けを行い、更に、そのタグ付け結果をタグの重要度に基づいて動画像のフレーム画像と併せて出力装置に表示する画像認識、解析処理とをコンピュータに実行させることにより動画像中の監視対象物を監視及び追跡するものである。 The interactive moving image monitoring program according to claim 11 sets one or more image clips in a processing target area to be subjected to image processing among frame images of the moving image, and stores them in the main storage device. Read the target area setting process and labeling data registered in advance in the storage device as to whether or not the monitoring object is photographed in the image clip, and use the color information value of the pixel included in the image clip for the image clip. The image feature amount is obtained based on the image feature amount of the image clip in which the labeling data is not registered, in a high-dimensional space with the element of the image feature amount of the image clip in which the labeling data is registered as an axis, If it is within a predetermined range, tag the image clip for which no labeling data is registered according to the labeling data, In addition, the monitoring target in the moving image is monitored by causing the computer to execute image recognition and analysis processing for displaying the tagging result on the output device together with the frame image of the moving image based on the importance of the tag. To track.

したがって、先ず、監視対象物が撮影された動画像の各フレーム画像に対し、画像処理の対象となる部分を処理対象領域として設定し、さらに当該処理対象領域内を画像クリップを少なくとも１つ以上の領域に区分する。そして、ユーザにより当該画像クリップ毎に選択されたラベル、具体的には、監視対象物が撮影されている画像クリップに対して付されるラベル（ターゲットラベル）、または監視対象物が撮影されていない画像クリップに対して付されるラベル（非ターゲットラベル）のいずれかのラベル付けデータをデータベースに登録している。更に、各フレームの各画像クリップについて、その画像クリップ内の画素のRGB、HSV等の色情報値に基づいて画像特徴量を算出する。また、データベースに既にラベル付けデータが登録されている画像クリップについての画像特徴量の各要素を軸とする高次元空間において、そこから予め設定された一定の範囲に含まれる画像特徴値を有するまだラベル付けデータが登録されていない画像クリップについて、その基準となるラベルがターゲットラベルであれば、ターゲットタグを、非ターゲットラベルであれば、非ターゲットタグを付与している。更に、ラベル付け及びタグ付けがなされたすべてのフレームのタグ情報をタグの重要度に基づいて、表示されているフレーム画像と同一の画面で併せて表示している。尚、本明細書においてタグとは、付されたラベルに基づいて計算機が推定するラベルをいう。また、画像特徴量とは計算機が対象となる画像データから計算した画像を特徴づける色や模様などの情報である。 Therefore, first, for each frame image of the moving image in which the monitoring target is photographed, a portion to be subjected to image processing is set as a processing target area, and at least one image clip is set in the processing target area. Divide into areas. A label selected by the user for each image clip, specifically, a label (target label) attached to an image clip in which the monitoring object is photographed, or the monitoring object is not photographed. Any labeling data of a label (non-target label) attached to the image clip is registered in the database. Further, for each image clip of each frame, an image feature amount is calculated based on color information values such as RGB and HSV of pixels in the image clip. In addition, in a high-dimensional space around each element of the image feature amount for an image clip for which labeling data has already been registered in the database, the image feature value that is included in a predetermined range from there is still For an image clip for which labeling data is not registered, a target tag is assigned if the reference label is a target label, and a non-target tag is assigned if the label is a non-target label. Further, the tag information of all frames that have been labeled and tagged are displayed together on the same screen as the displayed frame image based on the importance of the tag. In addition, in this specification, a tag means the label which a computer estimates based on the attached label. The image feature amount is information such as a color or a pattern that characterizes an image calculated from image data targeted by the computer.

請求項２に記載の発明は、請求項１に記載の対話型動画像監視方法において、ラベル付けデータを登録及び検索する際に、局所性鋭敏型ハッシュのアルゴリズムを用いるようにしている。また、請求項７に記載の発明は、請求項２に記載の対話型動画像監視装置において、ラベル付けデータを登録及び検索する際に、局所性鋭敏型ハッシュのアルゴリズムを用いるものである。 According to a second aspect of the present invention, in the interactive moving image monitoring method according to the first aspect, a local sensitive hash algorithm is used when labeling data is registered and searched. The invention described in claim 7 uses the local sensitive hash algorithm in the interactive moving image monitoring apparatus according to claim 2 when registering and searching the labeling data.

したがって、ツリー型の検索アルゴリズムではなく、ハッシュ関数を用いた近似最近傍探索（ANN;Approximate Nearest Neighbor)の検索アルゴリズムである局所性鋭敏型ハッシュ（LSH;Locality-Sensitive Hashing）を用いている。 Therefore, a locality-sensitive hash (LSH), which is an approximate nearest neighbor (ANN) search algorithm using a hash function, is used instead of a tree-type search algorithm.

請求項３に記載の発明は、請求項１または２に記載の対話型動画像監視方法において、タグ付け処理は、監視対象物が撮影されているとラベル付けされた画像クリップの画像特徴量の各要素を軸とする高次元空間において、予め定めた一定の範囲にあるものについてはターゲットタグ、監視対象物が撮影されていないとラベル付けされた画像クリップの画像特徴量の各要素を軸とする高次元空間において、予め定めた一定の範囲にあるものについては非ターゲットタグ、いずれの範囲内にもない画像クリップについては不明タグをタグ付けするようにしている。 According to a third aspect of the present invention, in the interactive moving image monitoring method according to the first or second aspect of the invention, the tagging process is performed by calculating an image feature amount of an image clip labeled that the monitoring target is photographed. In a high-dimensional space with each element as an axis, the target tag and the image feature value of an image clip labeled that the monitoring target object is not photographed are used as axes for those in a predetermined range. In a high-dimensional space, a non-target tag is tagged for a certain range, and an unknown tag is tagged for an image clip that is not in any range.

また、請求項８記載の発明は、請求項６または７に記載の対話型動画像監視装置において、タグ付け手段は、監視対象物が撮影されているとラベル付けされた画像クリップの画像特徴量の各要素を軸とする高次元空間において、予め定めた一定の範囲にあるものについてはターゲットタグ、監視対象物が撮影されていないとラベル付けされた画像クリップの画像特徴量の各要素を軸とする高次元空間において、予め定めた一定の範囲にあるものについては非ターゲットタグ、いずれの範囲内にもない画像クリップについては不明タグをタグ付けするものである。 According to an eighth aspect of the present invention, in the interactive moving image monitoring apparatus according to the sixth or seventh aspect, the tagging means includes an image feature amount of an image clip labeled that the monitored object is photographed. In a high-dimensional space with each element as an axis, the target tag and the image feature value of an image clip labeled that the monitoring target object is not photographed are used as axes for objects in a predetermined range. In the high-dimensional space, a non-target tag is tagged for a certain range, and an unknown tag is tagged for an image clip that is not in any range.

したがって、ラベル付けデータからでは判断できないデータ（画像クリップの画像特徴量）、即ち、正例または負例として登録されたいずれのラベルの画像特徴量の各要素を軸とする高次元空間において、一定の距離にないデータを不明タグとしている。 Therefore, data that cannot be determined from the labeling data (image feature amount of an image clip), that is, in a high-dimensional space with each element of the image feature amount of any label registered as a positive example or a negative example as a constant. Data that is not within the distance is set as an unknown tag.

請求項４に記載の発明は、請求項１から３までのいずれかに記載の対話型動画像監視方法において、可視化処理は、動画像のすべてのフレーム画像についてのタグ付け処理の結果を圧縮して一画面上に表示するようにしている。また、請求項９に記載の発明は、請求項６から８までのいずれかに記載の対話型動画像監視装置において、可視化手段は、動画像のすべてのフレーム画像についてのタグ付け結果を圧縮して一画面上に表示するものである。 According to a fourth aspect of the present invention, in the interactive moving image monitoring method according to any one of the first to third aspects, the visualization process compresses the result of the tagging process for all frame images of the moving image. Is displayed on one screen. The invention according to claim 9 is the interactive moving image monitoring apparatus according to any one of claims 6 to 8, wherein the visualization means compresses the tagging results for all frame images of the moving image. Displayed on a single screen.

したがって、対象となる動画像のすべてのフレームについてのタグ付けの結果を出力装置の表示画素数に収まるように圧縮して一画面上に表示している。 Therefore, the tagging results for all the frames of the target moving image are compressed and displayed on one screen so as to be within the number of display pixels of the output device.

請求項５に記載の発明は、請求項４に記載の対話型動画像監視方法において、タグの重要度は、不明タグが最も高く、非ターゲットタグが最も低いようにしている。また、請求項１０記載の発明は、請求項９に記載の対話型動画像監視装置において、タグの重要度は、不明タグが最も高く、非ターゲットタグが最も低いものである。 According to a fifth aspect of the present invention, in the interactive moving image monitoring method according to the fourth aspect, the importance of the tag is the highest for the unknown tag and the lowest for the non-target tag. According to a tenth aspect of the present invention, in the interactive moving image monitoring apparatus according to the ninth aspect, the importance of a tag is highest for an unknown tag and lowest for a non-target tag.

したがって、タグ付け結果を圧縮して表示する際に、不明タグを最優先に表示し、次いで、ターゲットタグ、非ターゲットタグの順に表示するようにしている。 Therefore, when compressing and displaying the tagging result, the unknown tag is displayed with the highest priority, and then the target tag and the non-target tag are displayed in this order.

以上説明したように、請求項１に記載の対話型動画像監視方法、請求項６に記載の対話型動画像監視装置及び請求項１１に記載の対話型動画像監視プログラムによれば、動画像中の監視対象物を、対話型処理により事例教示の効果を確認しながら、かつその監視、追跡精度を向上させながら解析することができる。 As described above, according to the interactive moving image monitoring method according to claim 1, the interactive moving image monitoring apparatus according to claim 6, and the interactive moving image monitoring program according to claim 11, the moving image It is possible to analyze the object to be monitored while confirming the effect of the case teaching by interactive processing and improving the monitoring and tracking accuracy.

また、動画像の再生と同時に可視化表示されるタグ付け結果を見ることで、タグの時間・空間的変化を瞬時に捉えることができる。即ち、現在の教示状況（ラベル付け状況）下での対象物の監視、追跡精度を即時に捉えることができ、更なる監視、追跡精度の向上のためには、どのようなラベル付けを行えばよいかを判断することができる。 Also, by looking at the tagging result visualized and displayed at the same time as the playback of the moving image, the temporal and spatial changes of the tag can be captured instantly. In other words, it is possible to immediately grasp the monitoring and tracking accuracy of the object under the current teaching status (labeling status). What kind of labeling should be performed for further monitoring and improvement of tracking accuracy? It is possible to judge whether it is good.

このように、タグ付けをユーザ制御下におくことにより、情報処理量を必要最小限とすることができ、従来不可能であった、動画像におけるインタラクティブな監視システムを実現することが可能となる。 In this way, by placing tagging under user control, the amount of information processing can be minimized, and an interactive monitoring system for moving images, which has been impossible in the past, can be realized. .

また、情報処理量の削減により、ユーザによるラベル付けは即座にフィードバック処理が成されるので、ユーザが新たに教示したラベルが画像全体の監視状況にどのような影響を与えたのかを、その都度、即座に確認することができる。これにより、ユーザは、どのようなラベル付けを追加して行っていけばよいのかを判断したり、どのようにラベル付けを行えば少ないラベル付け、即ち、少ない時間で監視、追跡精度を向上させることができるのかを判断することができる。 In addition, because the amount of information processing is reduced, the labeling by the user is immediately feedback processed, so how the label newly taught by the user affects the monitoring status of the entire image each time. Can be confirmed immediately. As a result, the user can determine what kind of labeling should be added, and how to perform labeling can reduce labeling, that is, improve monitoring and tracking accuracy in a short time. You can judge whether you can.

また、請求項２に記載の対話型動画像監視方法及び請求項７に記載の対話型動画像監視装置によれば、データの追加によりツリー構造が複雑になることがないため、ツリー型のアルゴリズムに比べて高速なデータ登録、データ検索処理を実現することができる。 In addition, according to the interactive moving image monitoring method according to claim 2 and the interactive moving image monitoring apparatus according to claim 7, the tree structure is not complicated by the addition of data. Compared to the above, high-speed data registration and data search processing can be realized.

また、請求項３に記載の対話型動画像監視方法及び請求項８に記載の対話型動画像監視装置によれば、ラベル付けデータからでは、ターゲットまたは非ターゲットタグを付与できない、即ち、監視対象物または非監視対象物のいずれであるかを判断できないデータに不明タグを付与し、可視化表示することで、ユーザにどのような点を中心に事例教示を行えばよいかを提示し、ユーザが最小限の教示数で精度の高い対象物の監視、追跡を可能とすることができる。 Further, according to the interactive moving image monitoring method according to claim 3 and the interactive moving image monitoring apparatus according to claim 8, a target or non-target tag cannot be assigned from the labeling data, that is, a monitoring target. By adding an unknown tag to data that cannot be determined whether it is an object or a non-monitoring object and displaying it visually, the user is presented with what points should be used to provide case teaching. It is possible to monitor and track a highly accurate object with the minimum number of teachings.

また、請求項４に記載の対話型動画像監視方法及び請求項９に記載の対話型動画像監視装置によれば、対象となる動画像のフレーム数が出力装置の出力可能な画素数（例えば横軸方向）を超える場合であっても、タグ付け結果が圧縮して一画面上に可視化表示されるので、ユーザはラベル付けの効果を確認しながら、事例教示を続けていくことが可能となる。 According to the interactive moving image monitoring method according to claim 4 and the interactive moving image monitoring apparatus according to claim 9, the number of frames of the target moving image is the number of pixels that can be output by the output device (for example, The tagging result is compressed and visualized on a single screen even if it exceeds (horizontal axis direction), allowing the user to continue teaching examples while confirming the labeling effect. Become.

また、請求項５に記載の対話型動画像監視方法及び請求項１０に記載の対話型動画像監視装置によれば、タグ付け結果が圧縮して表示される場合であっても、まだタグの推定が行えていない不明タグを中心としたラベル付けを行うことができる。 Further, according to the interactive moving image monitoring method according to claim 5 and the interactive moving image monitoring apparatus according to claim 10, even if the tagging result is displayed in a compressed manner, the tag Labeling can be performed around unknown tags that have not been estimated.

以下、本発明の構成を図１〜図７に示す実施の形態に基づいて詳細に説明する。 Hereinafter, the configuration of the present invention will be described in detail based on the embodiment shown in FIGS.

図１に本発明の対話型画像監視装置１の構成の一例を示す。本発明の対話型画像監視装置１は、ディスプレイ等の出力装置２と、キーボード、マウス等の入力装置３と、演算処理を行う中央処理演算装置（CPU）４と、演算中のデータ、パラメータ等が記憶される主記憶装置（メモリ、RAM）５と、計算結果等の各種データが記録される補助記憶装置６としてのハードディスク、撮影された動画像が入力される入力インターフェース７等を備えている。以下、主記憶装置５及び補助記憶装置６を総称して、単に記憶装置ともいう。上記のハードウェア資源は例えばバス８を通じて電気的に接続されている。 FIG. 1 shows an example of the configuration of the interactive image monitoring apparatus 1 of the present invention. An interactive image monitoring apparatus 1 according to the present invention includes an output device 2 such as a display, an input device 3 such as a keyboard and a mouse, a central processing arithmetic device (CPU) 4 that performs arithmetic processing, data being processed, parameters, and the like. Is stored, a hard disk as an auxiliary storage device 6 in which various data such as calculation results are recorded, an input interface 7 to which a captured moving image is input, and the like. . Hereinafter, the main storage device 5 and the auxiliary storage device 6 are collectively referred to simply as a storage device. The above hardware resources are electrically connected through the bus 8, for example.

入力インターフェース７は、ビデオカメラ等の撮像手段９から入力される、又は映像が記録されたDVD、ビデオテープ等の記憶媒体１０から読み込まれる信号をコンピュータでの処理が可能なデータに変換する機能や、映像を構成する各フレーム画像をそれぞれ映像データ１４として補助記憶装置６に記録する機能を有する。このような入力インターフェース７として、例えば既存のNTSC-RGBコンバータやフレームグラバまたはパーソナルコンピュータ用画像取り込みボード等を利用して良い。また、出力装置２には、ユーザインターフェース画面などが表示される。また、本発明の対話型画像監視プログラムは、補助記憶装置６に記録されており、当該プログラムがCPU４に読み込まれ実行されることによって、コンピュータが対話型画像監視装置１として機能する。 The input interface 7 has a function of converting a signal input from an imaging means 9 such as a video camera or read from a storage medium 10 such as a DVD or a video tape on which video is recorded into data that can be processed by a computer. Each frame image constituting the video has a function of recording the video data 14 as the video data 14 in the auxiliary storage device 6. As such an input interface 7, for example, an existing NTSC-RGB converter, a frame grabber, or an image capturing board for a personal computer may be used. The output device 2 displays a user interface screen and the like. The interactive image monitoring program of the present invention is recorded in the auxiliary storage device 6, and the computer functions as the interactive image monitoring device 1 when the program is read and executed by the CPU 4.

また、対話型画像監視装置１は、ラベル登録処理を実行するラベル登録手段１１、タグ付け処理を実行するタグ付け手段１２及び可視化処理を実行する可視化手段１３とを備えるものである。尚、上記ラベル付け手段１１、タグ付け手段１２及び可視化手段１３は、ＣＰＵ４で実行されるソフトウェアをコンピュータで実行させることで構成できる。 The interactive image monitoring apparatus 1 includes a label registration unit 11 that executes label registration processing, a tagging unit 12 that executes tagging processing, and a visualization unit 13 that executes visualization processing. The labeling means 11, the tagging means 12, and the visualization means 13 can be configured by causing software executed by the CPU 4 to be executed by a computer.

その実行の際に必要なデータは、RAM５にロードされる。また、補助記憶装置６には、映像データ１４が記憶され、ユーザにより教示されたデータとそのラベルを記憶し、データの登録、検索を可能とするLSHデータベース１５及び画像クリップ識別番号、フレーム番号I、グリッド中の座標位置xy、ラベル、タグ等を記憶するICP(Image Clip Profile)データベース１６が構成される。また、RAM５には、可視化領域３７の水平画素×垂直画素のメモリ領域で、ラベルもしくはタグを保持する可視化領域のデータマップ１７が形成される。尚、可視化領域のデータマップ１７は、初期化時は、すべて不明タグ５３となっている。尚、補助記憶装置６は、必ずしもコンピュータ内部の装置であることに限らず、外付けのハードディスク、ネットワーク経由でアクセス可能な外部記憶装置を用いても良いのは勿論である。以上述べた対話型画像監視装置１の構成は一例であってこれに限られるものではない。 Data necessary for the execution is loaded into the RAM 5. The auxiliary storage device 6 stores video data 14, stores data taught by the user and labels thereof, an LSH database 15 that enables registration and retrieval of data, an image clip identification number, and a frame number I. An ICP (Image Clip Profile) database 16 that stores coordinate positions xy, labels, tags, and the like in the grid is constructed. In the RAM 5, a data map 17 of a visualization area that holds a label or a tag in a horizontal pixel × vertical pixel memory area of the visualization area 37 is formed. It should be noted that the visualization area data map 17 is all set to the unknown tag 53 at the time of initialization. Note that the auxiliary storage device 6 is not necessarily a device inside the computer, and it goes without saying that an external hard disk or an external storage device accessible via a network may be used. The configuration of the interactive image monitoring apparatus 1 described above is an example and is not limited to this.

本発明の対話型画像監視方法は、動画像中の監視対象物を監視及び追跡する方法において、動画像の各フレーム画像のうち画像処理の対象となる処理対象領域内に一以上の画像クリップを設け、画像クリップ内に監視対象物が撮影されているかどうかについてのラベル付けデータを登録するラベル登録処理と、画像クリップについて該画像クリップに含まれる画素の色情報値に基づいて画像特徴量を求め、ラベル付けデータが登録されていない画像クリップの画像特徴量が、ラベル付けデータが登録された画像クリップの画像特徴量の各要素を軸とする高次元空間において、予め定めた一定の範囲にあるものについては、該ラベル付けデータに従ったタグ付けをラベル付けデータが登録されていない画像クリップに対し行うタグ付け処理と、タグ付け処理の結果をタグの重要度に基づいて動画像のフレーム画像と併せて表示する可視化処理とを行うものである。以下、本実施形態における対話型画像監視方法及び対話型画像監視プログラムについて述べる。 The interactive image monitoring method of the present invention is a method for monitoring and tracking a monitoring object in a moving image, wherein one or more image clips are included in a processing target area to be subjected to image processing among each frame image of the moving image. Providing a label registration process for registering labeling data as to whether or not the monitoring object is photographed in the image clip, and obtaining an image feature amount based on the color information value of the pixel included in the image clip. The image feature amount of an image clip for which no labeling data is registered is within a predetermined range in a high-dimensional space with the elements of the image feature amount of the image clip for which labeling data is registered as an axis. A tagging process for performing tagging according to the labeling data for an image clip for which no labeling data is registered; And it performs the visualization processing for displaying together with a frame image of the moving image based on the result of the grayed assigning process to the importance of the tag. Hereinafter, the interactive image monitoring method and the interactive image monitoring program in this embodiment will be described.

本発明の対話型画像監視方法では、情報探索手法としてLSHを用いることにより、ツリー型の探索手法に比べてデータ探索の速度向上を図っているが、動画像は静止画像の数万倍という情報量となるため、単にLSHを用いただけでは、フィードバックが遅くなり対話型での処理は不可能である。 In the interactive image monitoring method of the present invention, the speed of data search is improved by using LSH as an information search method compared to the tree-type search method. Therefore, simply using LSH makes feedback slow and interactive processing is impossible.

そこで本発明の対話型画像監視方法では、ユーザが容易かつ迅速にラベル付け可能な直接操作型のグラフィカルインタフェースを用い、更にタグ付け処理をユーザ制御下に置くことで、フィードバック遅れの問題を解消し、画像認識結果を即座に表示することができる動画像での対話型監視を実現している。 Therefore, in the interactive image monitoring method of the present invention, the problem of feedback delay is solved by using a direct operation type graphical interface that can be easily and quickly labeled by the user, and placing the tagging process under user control. In addition, interactive monitoring with moving images that can immediately display the result of image recognition is realized.

尚、直接操作とはウィンドウシステムで用いられるユーザインタフェース技法をいい、具体的には、マウスによるファイル移動や、スライダーでの画面スクロールを指す。これはファイルアイコンや、スライダーコントロールといった、計算機内のリソースを視覚的に表現した画像シンボルを、直接触って動かしているかのように操作できる点に特徴がある。また、直接操作が効果的に働くのは、画像データを瞬時に把握できるという視覚的認知能力、50msec以内のコマ送り画像を動いていると認識する仮現運動、さらに200msec以内に起こった現象に関しては、変化を鋭敏に感知したり、因果関係を感じるという変化検知能力が人間に備わっているためである。直接操作では、変化がすぐにフィードバックされるため、ユーザは自身のラベル付けが及ぼす影響を感覚的に理解でき、さらに計算機が間違っていて、修正が必要な事例を容易に識別・指定できる。 Direct operation refers to a user interface technique used in a window system, and specifically refers to file movement with a mouse and screen scrolling with a slider. This is characterized in that image symbols such as file icons and slider controls that visually represent resources in the computer can be manipulated as if they were moving in direct contact. In addition, the direct operation works effectively with regard to the visual recognition ability that the image data can be grasped instantaneously, the apparent movement that recognizes that the frame advance image within 50 msec is moving, and the phenomenon that occurred within 200 msec. This is because humans have the ability to detect change sensitively and feel a causal relationship. In direct operation, changes are immediately fed back so that the user can sensuously understand the effects of their labeling, and easily identify and specify cases where the calculator is wrong and needs to be corrected.

例えば、表１の視覚認知特性に示すように、システムの反応時間に応じて、ユーザが処理可能な処理内容は変化する。
For example, as shown in the visual perception characteristics of Table 1, the processing content that can be processed by the user varies depending on the reaction time of the system.

従来のIMLでは、主に50ms以下の直接操作可能な範囲を対象としていた。しかし、連続操作以上に時間のかかる遅い応答しか返せない場合も存在する。そこで、本発明では、動画像の再生モードとして、「コマ送り」、「通常再生」、「バッチ処理」の３つのモードをユーザに選択的に利用可能とさせることにより、ユーザの操作性を向上させて処理時間の短縮化を図り、対話型の画像監視システムを実現することを特徴としている。 In the conventional IML, the range that can be directly operated is 50 ms or less. However, there are cases where only a slow response that takes more time than continuous operation can be returned. Therefore, the present invention improves user operability by allowing the user to selectively use three modes of “frame advance”, “normal playback”, and “batch processing” as moving image playback modes. Thus, the processing time is shortened to realize an interactive image monitoring system.

「コマ送り」モードとは、ユーザが画像を１フレームずつ確認しながら、ラベル付けを可能とするモードである。新たに指定されたラベル付けされたデータは、LSHにより瞬時に登録される。「通常再生」モードとは、既にタグ付け処理がなされた画像をＴＶレート（29.97fps）で再生しながらユーザがタグ付け状況を確認することができるモードである。「バッチ処理」モードとは、すべての画像にタグ付け処理を行いながらバッチ処理を行うものである。本実施形態では、タグ付け処理においては、常に途中経過を可視化表示し、また、バッチ処理中でもいつでも一時停止及び任意の時点から処理を可能とすることで、操作性を向上させている。 The “frame advance” mode is a mode in which the user can label while checking images one frame at a time. Newly designated labeled data is instantly registered by LSH. The “normal playback” mode is a mode in which the user can check the tagging status while reproducing an image that has already been tagged at a TV rate (29.97 fps). In the “batch processing” mode, batch processing is performed while tagging processing is performed on all images. In the present embodiment, in the tagging process, the progress is always visualized and displayed, and the operability is improved by allowing the process to be paused and processed at any time even during batch processing.

次に、図９に示す本発明の対話型画像監視プログラムが行う処理全体を示すフローチャートを用いて説明する。 Next, description will be made with reference to a flowchart showing the entire processing performed by the interactive image monitoring program of the present invention shown in FIG.

先ず、初期設定（Ｓ１）を行う。初期設定（Ｓ１）では、対象フレーム番号iの初期化を行い(i=1)、再生モードを「コマ送り」とする。更に、LSHデータベース１５、ICPデータベース１６が既にある場合は、補助記憶装置６から読み込み、存在しない場合は、新規にデータベースの作成を行う。また、可視化領域のデータマップ１７の初期化を行う。 First, initial setting (S1) is performed. In the initial setting (S1), the target frame number i is initialized (i = 1), and the playback mode is set to “frame advance”. Further, if the LSH database 15 and the ICP database 16 already exist, they are read from the auxiliary storage device 6, and if they do not exist, a new database is created. The visualization area data map 17 is initialized.

次に、対象領域設定（Ｓ２）を行う。対象領域設定（Ｓ２）では、先ず、読み込んだ動画像の１フレーム目の画像を出力装置２に表示させる。尚、１フレーム目の画像に監視対象物４０（以下、対象物、ターゲットともいう）が撮影されていない場合等は、ユーザが対象物４０の撮影が開始されるまで、フレームを早送り機能により動画像を進めれば良い。また、認識、解析の対象とする映像は、補助記憶装置６に予め記憶されている映像データ１４から読み出しても、または撮像手段９及び記録媒体１０から直接キャプチャ処理を行うようにしても良い。尚、本実施形態では、映像データ１４は、例えばＴＶレート（29.97fps）の画像としているが、フレームレートは特に限られるものではない。 Next, target area setting (S2) is performed. In the target area setting (S2), first, an image of the first frame of the read moving image is displayed on the output device 2. If the monitoring object 40 (hereinafter also referred to as an object or target) is not photographed in the first frame image, the frame is moved by the fast-forward function until the user starts photographing the object 40. Just advance the statue. The video to be recognized and analyzed may be read from the video data 14 stored in advance in the auxiliary storage device 6 or may be directly captured from the imaging unit 9 and the recording medium 10. In the present embodiment, the video data 14 is, for example, an image with a TV rate (29.97 fps), but the frame rate is not particularly limited.

図２に本発明の対話型画像監視プログラムのユーザインタフェース画面の一例を示す。ウィンドウ２１に動画像を表示するのが映像表示領域２２である。また、画面に表示されているグリッド２３の最外辺に囲まれる領域が、ユーザにより指定された処理対象領域２４を示す。以降のラベル付けやタグ付け等の画像処理は、すべてこの処理対象領域２４に対して行われる。このようなマスク処理を前提とすることで、計算量を減らしかつノイズの混入を制限できるため、高速かつ高精度な処理が可能となる。例えば、ＤＶ−ＮＴＳＣの場合であれば映像の画素数は７２０×４８０ピクセルであるが、マスク処理をすることによって処理対象領域２４を限定することができる。 FIG. 2 shows an example of a user interface screen of the interactive image monitoring program of the present invention. A video display area 22 displays a moving image on the window 21. An area surrounded by the outermost side of the grid 23 displayed on the screen indicates the processing target area 24 specified by the user. All subsequent image processing such as labeling and tagging is performed on the processing target area 24. By presuming such mask processing, the amount of calculation can be reduced and the mixing of noise can be restricted, so that high-speed and high-precision processing is possible. For example, in the case of DV-NTSC, the number of pixels of the video is 720 × 480 pixels, but the processing target area 24 can be limited by performing mask processing.

尚、画面全体に変化が生じたか否かのみを監視したい場合等においては、例えば、処理対象領域２４を指定せず、映像表示領域２２に表示された画像全体を処理の対象としても良い。 In the case where it is desired to monitor only whether or not the entire screen has changed, for example, the processing target area 24 may not be specified, and the entire image displayed in the video display area 22 may be the processing target.

処理対象領域２４は、対象物４０が撮影された映像表示領域２２に表示されたフレーム画像に対し、ユーザがマウス操作等で処理対象領域２４を選択し、更に、処理対象領域２４を区切るグリッド２３のサイズを指定することにより指定される。また、グリッド２３の設定は必須ではなく、例えば、処理対象領域２４に対象物４０が存在するか否かのみを確認したい場合等では、グリッド２３を設定する必要はない。 The processing target area 24 is a grid 23 that delimits the processing target area 24 by selecting the processing target area 24 by a mouse operation or the like on the frame image displayed in the video display area 22 in which the target object 40 is captured. It is specified by specifying the size of. The setting of the grid 23 is not essential. For example, when it is desired to check only whether or not the object 40 exists in the processing target area 24, it is not necessary to set the grid 23.

本実施形態では、フレーム画像中のグリッド２３に囲まれる画像単位を「画像クリップ」と呼び、ラベル付けやタグ付けの単位とする。 In the present embodiment, an image unit surrounded by the grid 23 in the frame image is referred to as an “image clip” and is a unit for labeling or tagging.

尚、本実施形態では、グリッド２３を矩形としているが、グリッド２３は必ずしも矩形である必要はない。また、背景差分や各種フィルタ処理等の前処理を行って、画像クリップを切り出し、処理対象領域２４内に表示するようにしても良い。例えば、背景差分を前処理として行えば、フレーム間で動きのあった画像クリップだけを以降の処理の対象とすることができる。また、フィルタ処理を前処理として行えば、あらかじめ監視対象物の色の特徴を登録しておくことで、その色の特徴を有する画像クリップだけ以降の処理の対象とすることができる。 In the present embodiment, the grid 23 is rectangular, but the grid 23 is not necessarily rectangular. Further, preprocessing such as background difference and various filter processes may be performed to cut out the image clip and display it in the processing target area 24. For example, if background difference is performed as preprocessing, only image clips that have moved between frames can be targeted for subsequent processing. Further, if the filtering process is performed as a pre-process, the color feature of the monitoring target object is registered in advance, so that only the image clip having the color feature can be set as a target for subsequent processing.

ウィンドウ２１の右に並ぶボタン類は、映像の再生やバッチ処理などを指定するユーザインタフェースである。尚、以下に説明するユーザ補助としての機能は、必要に応じて、実装するようにすれば良く、必ずしも必須のものではない。また以下に説明する機能に限られず、他のユーザ補助機能を備えるようにしても良いのは勿論である。 Buttons arranged on the right side of the window 21 are user interfaces for designating video reproduction, batch processing, and the like. It should be noted that the user assistance function described below may be implemented as necessary and is not necessarily essential. Of course, the present invention is not limited to the functions described below, and other user assistance functions may be provided.

「<1S」ボタン２５ａは、１秒前の画像フレームに戻るものである。同様に、「<10f」ボタン２５ｂは、１０フレーム前の画像に戻るものである。「コマ送り」ボタン２６は、クリックすることにより、次のフレームが表示される。「このフレームを検査」ボタン２７は、現在ウィンドウ２１に表示されているフレーム画像に対して、タグ付け処理を行い、その結果を表示するものである。 The “<1S” button 25a returns to the image frame one second before. Similarly, the “<10f” button 25b returns to the image 10 frames before. When the “frame advance” button 26 is clicked, the next frame is displayed. The “inspect this frame” button 27 performs a tagging process on the frame image currently displayed in the window 21 and displays the result.

例えば、映像を解析するタグ付けは、１フレームあたりのデータ数２５個(画像クリップ数）という後述の実験の場合でも、１秒あたり７５０個、一分で４５０００個ものデータをタグ付けしなければならないため、ラベル付けされる度に映像全体のタグ付けを行うのでは、処理に時間がかかり対話型での処理は不可能である。しかし、ユーザが現在見ている表示フレームに限れば、タグ付けすべきデータ数は少ない（最大２５個）ため、「このフレームを検査」ボタン２７がクリックされた場合に、即時に当該フレームに対しタグ付け処理を行いフィードバックを返すことができる。 For example, tagging for analyzing video requires tagging 750 data per second and 45,000 data per minute even in the later-described experiment of 25 data per frame (number of image clips). Therefore, if the entire image is tagged every time it is labeled, the processing takes time and interactive processing is impossible. However, since the number of data to be tagged is small (up to 25) if it is limited to the display frame currently being viewed by the user, when the “inspect this frame” button 27 is clicked, the frame is immediately displayed. Perform tagging and return feedback.

「未指定対象時停止」３０のチェックは、通常再生モードの場合の機能である。通常再生モードでは、タグ付け処理はせずに、これまでの最新のタグ付け結果をウィンドウ２１にオーバーレイ表示しながら映像再生を行う。ここで、「未指定対象時停止」３０のチェックを外すと、タグの種類を問わず再生を続けるが、チェックを入れると、不明タグが現れた場合に、自動的に一時停止するものである。「未指定対象時停止」３０のチェックを入れておけば、不明タグを優先的にラベル指定する場合に、不明タグを探索する手間を省くことができる。 The check for “stop when not specified” 30 is a function in the normal playback mode. In the normal playback mode, the tagging process is not performed, and video playback is performed while displaying the latest tagging result so far overlaid on the window 21. Here, if the “Stop at unspecified target” 30 is unchecked, playback will continue regardless of the type of tag, but if checked, it will automatically pause if an unknown tag appears. . If the check for “stop when not specified” 30 is checked, it is possible to save the trouble of searching for an unknown tag when labeling an unknown tag preferentially.

「通常再生」ボタン３１は、現在ウィンドウ２１に表示されているフレーム画像以降の動画像を通常再生モードで再生するためのものである。「<<10s」ボタン３２ａは、１０秒前の画像に戻るものである。同様に、「10s>>」ボタン３２ｂは１０秒後の、「30s>>」ボタン３２ｃは３０秒後の画像を表示するものである。「バッチ検査」ボタン３３は、現在ウィンドウ２１に表示されている画像以降のフレームに対してバッチ処理を行うものである。「clear tag map」ボタン３４は、ウィンドウ２１下の可視化領域３７を初期化（クリア）するものである。「近傍距離閾値」ボタン３５は、タグ付け処理の際に用いる近傍距離閾値ｒを指定するためのものである。尚、近傍距離閾値ｒは、スライダ３５ａで決定するとフィールド３５ｂに表示される。また、フィールド３５ｃに、スライダ３５ａで設定できる値の範囲の最大値を入力することで、スライダ３５ａで設定できる値の範囲を変更できる。「InitializeDB」ボタン３６は、当該画像についてのLSHデータベース１５及びICPデータベース１６の初期化を行うものである。 The “normal playback” button 31 is for playing back the moving image after the frame image currently displayed in the window 21 in the normal playback mode. The “<< 10s” button 32a returns to the image 10 seconds before. Similarly, the “10s >>” button 32b displays an image after 10 seconds, and the “30s >>” button 32c displays an image after 30 seconds. The “batch inspection” button 33 performs batch processing on the frames after the image currently displayed in the window 21. The “clear tag map” button 34 initializes (clears) the visualization area 37 under the window 21. The “neighbor distance threshold value” button 35 is used to designate a neighborhood distance threshold value r used in the tagging process. The neighborhood distance threshold r is displayed in the field 35b when determined by the slider 35a. Further, by inputting the maximum value of the range of values that can be set with the slider 35a in the field 35c, the range of values that can be set with the slider 35a can be changed. The “InitializeDB” button 36 is used to initialize the LSH database 15 and the ICP database 16 for the image.

また、ウィンドウ２１下に表示されている模様が可視化領域３７、即ち、処理対象の映像に対しタグ付け処理を行った結果を示す領域である。可視化圧縮方向３８は、後述の可視化圧縮を横軸方向に対して行うのか、縦軸方向に対して行うのかを選択可能としている。 In addition, the pattern displayed under the window 21 is a visualization region 37, that is, a region indicating a result of performing tagging processing on a processing target video. The visualization compression direction 38 can select whether visualization compression described later is performed in the horizontal axis direction or the vertical axis direction.

また、ウィンドウ２１下の画像スライダ３９は、スクロールさせることで動画像の任意の地点の画像フレームを操作することができる。即ち、画像スライダ３９を最左端にすると最初のフレームを表示し、最右端にすると最終フレームを表示するものである。 The image slider 39 below the window 21 can be operated to operate an image frame at an arbitrary point of the moving image by scrolling. That is, when the image slider 39 is at the leftmost end, the first frame is displayed, and when it is at the rightmost end, the final frame is displayed.

対象領域設定（Ｓ２）までが終了すると、指定された再生モードにより画像認識、解析処理（Ｓ３〜Ｓ１２）が進行する。尚、画像認識、解析処理は、原則としてフレーム番号i毎にループ処理が行われるものである。 When the process up to the target area setting (S2) is completed, the image recognition and analysis process (S3 to S12) proceeds according to the designated reproduction mode. In principle, the image recognition and analysis processing is performed for each frame number i.

画像認識、解析処理では、先ず、映像設定（Ｓ３）を行う。映像設定（Ｓ３）は、i番目のフレーム画像をメモリへ読み込むものである。また、同時に、映像表示領域２２内に対象領域設定（Ｓ２）で設定されたグリッド２３が表示される。 In the image recognition and analysis process, first, video setting (S3) is performed. In the video setting (S3), the i-th frame image is read into the memory. At the same time, the grid 23 set in the target area setting (S2) is displayed in the video display area 22.

本実施形態では、映像表示領域２２に表示されたフレーム画像についてラベル付けを行う場合について述べる。尚、ラベル付けは必ずしも始めに表示されたフレーム画像について行う必要はなく、上述の画像のスキップ機能（画像スライダ３９、「<<10s」ボタン３２ａ等）でラベル付けに適した（例えば、対象物が１つの画像クリップ内に撮影されている）フレームを選択すればよい。 In the present embodiment, a case where labeling is performed on a frame image displayed in the video display area 22 will be described. Note that the labeling need not necessarily be performed on the frame image displayed first, and is suitable for labeling with the above-described image skip function (image slider 39, “<< 10s” button 32a, etc.) (Similar to one image clip) may be selected.

まず、画像認識、解析処理の基本となるラベル付けについて説明する。 First, labeling that is the basis of image recognition and analysis processing will be described.

ラベル付けはユーザにより行われる。具体的には、対象物４０が撮影されている画像クリップを選択し、「指定ラベルを登録」ボタン２９を押すことで、ラベル付けがされる。尚、ラベル付けには、特にルールはなく任意のフレームの任意の画像クリップに任意の数のラベル付けを行えばよい。 Labeling is performed by the user. Specifically, labeling is performed by selecting an image clip in which the object 40 is photographed and pressing a “register specified label” button 29. There is no particular rule for labeling, and any number of labels may be attached to any image clip in any frame.

本実施形態では、画像クリップをクリックする度に、その画像クリップの枠（グリッド２３）の色が、赤 → グレー → 赤 → グレー → 赤 ...と繰り返し変化する。ここで、赤い枠の状態で「指定ラベルを登録」ボタン２９をクリックすると、当該画像クリップは正例（ターゲット）としてターゲットラベルが登録される。逆にグレー枠の状態で「指定ラベルを登録」ボタン２９をクリックすると、当該画像クリップは負例（非ターゲット）として非ターゲットラベルが登録される。 In this embodiment, each time an image clip is clicked, the color of the frame (grid 23) of the image clip changes repeatedly from red → gray → red → gray → red. Here, when the “register specified label” button 29 is clicked in the state of a red frame, the target label is registered as a positive example (target) for the image clip. Conversely, when the “register specified label” button 29 is clicked in a gray frame state, a non-target label is registered as a negative example (non-target) for the image clip.

即ち、ユーザは、対象物４０が撮影されている画像クリップの場合は、赤い枠の状態で、「指定ラベルを登録」ボタン２９をクリックすればよい。また、対象物４０が撮影されていない画像クリップであれば、枠がグレーの状態で「指定ラベルを登録」ボタン２９をクリックすればよい。尚、本発明の対話型画像監視方法は、ユーザが指定する最小限のラベル付けされたデータを基に、タグ付け処理を行いラベル付けのされていない画像クリップをタグ付け処理を行い対象物４０が撮影されているかどうかを判断するものであるので、当該フレームにおけるすべての画像クリップについてラベルを登録する必要はなく、必要に応じて行うだけでよい。 In other words, in the case of an image clip in which the object 40 is captured, the user may click the “Register designated label” button 29 in a red frame. If the target object 40 is an image clip that has not been shot, the “register specified label” button 29 may be clicked with the frame in gray. Note that the interactive image monitoring method of the present invention performs the tagging process based on the minimum labeled data specified by the user, and performs the tagging process on the unlabeled image clip. Therefore, it is not necessary to register labels for all the image clips in the frame, and it is only necessary to do so.

更に、本実施形態では、表示されたフレーム画面のいずれの画像クリップにも対象物４０が映っていない場合は、いずれの画像クリップも選択しない状態で、「指定ラベルを登録」ボタン２９をクリックすることで、当該フレームにおけるすべての画像クリップについては、非ターゲットラベルをラベル付けすることができる。 Further, in the present embodiment, when the object 40 is not shown in any image clip on the displayed frame screen, the “register specified label” button 29 is clicked without selecting any image clip. Thus, for all image clips in the frame, a non-target label can be labeled.

尚、ユーザによるラベル付けの方法は特に限られるものではなく、例えば、「正例として登録」、「負例として登録」の２つの登録ボタンを設け、画像クリップを選択していずれかの登録ボタンを選択することでラベル登録を行うようにしても良い。 The labeling method by the user is not particularly limited. For example, two registration buttons “register as positive example” and “register as negative example” are provided, and an image clip is selected and any registration button is selected. Label registration may be performed by selecting.

このようにして、指定されたラベルはLSHデータベース１５に登録され、タグ付け処理の際の基準データとなる。以下に、LSHによるラベル登録について説明する。 In this way, the designated label is registered in the LSH database 15 and becomes reference data for the tagging process. Hereinafter, label registration by LSH will be described.

従来型のIMLでの探索では、決定木（DT ; Decision Tree）の一種を用いて、高速なフィードバックを実現している。しかし、決定木は、事例データ全体を見て、良い分岐点を探す手法であるため、インタラクティブなラベル付けに利用すると、木がアンバランスになり速度が低下するという問題がある。高速な登録速度を維持するには木の再構成をしなければならず、これには時間がかかる。 In conventional IML search, a kind of decision tree (DT) is used to achieve high-speed feedback. However, the decision tree is a method for searching for a good branch point by looking at the entire case data, and therefore, when used for interactive labeling, there is a problem that the tree becomes unbalanced and the speed decreases. To maintain a high registration speed, the tree must be reconstructed, which takes time.

これに対し、データ同士の類似性を直接用いる最近傍探索（NN；Nearest Neighbor）は、分岐点を探す必要がないため逐次的なラベル追加に適している。しかし、一般に事例数に比例して探索時間が増えるという欠点がある。また、近年、探索時間の短縮を図る技術として近似最近傍探索（ANN;Aproximate Nearest Neighbor)が提案されている。ANNは、完全ではないが、高い確率でNNを可能とすることで、高い探索精度を維持したまま探索時間の探索を図るものである。従来の近似最近傍探索には、kd-treeをはじめとするtree型の探索手法が良く用いられている。しかしながら、ツリー型の探索手法は、探索対象のデータの増大に伴い、ツリー構築に時間がかかり、迅速な探索が行えなくなるという問題点を有していた。 On the other hand, the nearest neighbor search (NN; Nearest Neighbor) that directly uses the similarity between data is suitable for sequential label addition because there is no need to search for a branch point. However, there is a drawback that the search time generally increases in proportion to the number of cases. In recent years, approximate nearest neighbor (ANN) has been proposed as a technique for shortening the search time. ANN is not perfect, but it enables NNs with a high probability to search for search time while maintaining high search accuracy. In the conventional approximate nearest neighbor search, tree-type search techniques such as kd-tree are often used. However, the tree-type search method has a problem that it takes time to construct a tree as the search target data increases, and a quick search cannot be performed.

このANNを高速に実現する汎用性の高い手法として局所性鋭敏型ハッシュ（LSH;Locality-Sensitive Hashing）が提案されている。LSHは、代表的な高次元データ用kd-treeの４０倍の速度向上が実験的に示されており、最近傍探索の代表的手法の一つである。 Locality-sensitive hashing (LSH) has been proposed as a highly versatile method for realizing this ANN at high speed. LSH has been experimentally shown to be 40 times faster than typical high-dimensional data kd-trees, and is one of the representative methods of nearest neighbor search.

本発明の対話型画像監視方法では、LSHを用いて、ユーザが指定したラベルを記憶し、データ認識（タグ付け）に利用する。これによりユーザが映像の任意の箇所をラベル付けすると、そのラベル情報は即座にデータベースに反映される。映像データの認識はユーザが指定したすべてのラベルを使って、その場で行えるため、迅速なフィードバックが可能となる。尚、本実施形態では、画像類似性の判定で一般的な「ユークリッド距離」でのANNを実現するため、p安定分布を用いたLSH（p-LSH)を用いているが、他のLSHを用いても良い。 In the interactive image monitoring method of the present invention, a label specified by a user is stored using LSH and used for data recognition (tagging). Thus, when the user labels an arbitrary portion of the video, the label information is immediately reflected in the database. Since video data can be recognized on the spot using all labels specified by the user, quick feedback is possible. In this embodiment, LSH (p-LSH) using a p-stable distribution is used in order to realize ANN at a general “Euclidean distance” in image similarity determination, but other LSHs are used. It may be used.

以下に、p-LSHについて簡単に説明する（p-LSHの詳細は、Mayur Datar,Nicole Immorlica,Piotr Indyk,and Vahab S.Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry,pp.253-262,2004参照）。 The following is a brief description of p-LSH. twentieth annual symposium on Computational geometry, pp.253-262, 2004).

p-LSHでは、先ず、扱うデータ（本明細書では、画像特徴量）をd次元の実数ベクトルvとし、このd次元データをk次元に写像する（但し、k＜d）。 In p-LSH, first, data to be handled (in this specification, an image feature amount) is a d-dimensional real vector v, and this d-dimensional data is mapped to k dimensions (where k <d).

そのためにp安定分布（ユークリッド距離の場合は２安定分布、即ち正規分布）に従う独立な値をd個用意し、それぞれを要素とするd次元ベクトルaを、k個作成する。 For this purpose, d independent values according to the p stable distribution (2 stable distributions in the case of Euclidean distance, ie, normal distribution) are prepared, and k d-dimensional vectors a each of which are elements are created.

更に、数式１で示される関数を用いて、h_a,b(v)を要素とするk次元整数ベクトルgを生成する。これによりd次元ベクトルｖは、k次元整数ベクトルに写像される。
ここで、bは[0,ω]の範囲の実数パラメータである。 Further, a k-dimensional integer vector g having h _{a, b} (v) as elements is generated using the function expressed by Equation 1. As a result, the d-dimensional vector v is mapped to a k-dimensional integer vector.
Here, b is a real number parameter in the range of [0, ω].

ここで、あるベクトルv1,v2があった場合、写像後の差(a・v1-a・v2)は、‖v1-v2‖_p×Xに分布する。尚、‖v‖_p はp-ノルム、Xはp安定分布である。これにより、v1,v2がr以内にあると高い確率で同じgが得られる。 Here, when there is a vector v1, v2, difference post-mapping (a · v1-a · v2 ) are distributed in ‖v1-v2‖ _p × X. Incidentally, ‖V‖ _p is p- norm, X is p stable distribution. As a result, if v1 and v2 are within r, the same g can be obtained with high probability.

p-LSHは、k個のaの組をL個用意し、それぞれとの内積計算によりL個のgを生成し、それぞれを別のテーブル（バケット）に格納する。即ち、あるベクトルvからL個のｋ次元整数ベクトルが生成され、それぞれをバケットに格納するものである。即ち、L個の写像空間を用意して、vをそれぞれの空間に写像しているといえる。これにより、それぞれで近傍が発見される確率がp(c)であっても、L個のバケットを全て探索すると1-(1-p(c))^Lの確率で発見できることになるので、最近傍探索を精度良く求めることができる。 p-LSH prepares L sets of k pieces of a, generates L pieces of g by inner product calculation with each, and stores them in separate tables (buckets). That is, L k-dimensional integer vectors are generated from a certain vector v, and each is stored in a bucket. That is, it can be said that L mapping spaces are prepared and v is mapped to each space. As a result, even if the probability that each neighborhood is found is p (c), if all L buckets are searched, it can be found with a probability of 1- (1-p (c)) ^L. The side search can be obtained with high accuracy.

ここで、Lの数を多くすれば、探索精度は向上するが、探索に時間がかかるようになる。また、写像空間の次元数kの値も、同様に精度と時間に影響を与える。kを大きくすると探索時間は減るが、内積の計算時間が増え、かつ同じrに対して探索精度が下がることになる。よって、L及びkの値は、必要な精度と時間の制限を考慮して選択すべきものである。以上でLSHについての説明を終了する。 Here, if the number of L is increased, the search accuracy is improved, but the search takes time. Also, the value of the dimension number k of the mapping space similarly affects the accuracy and time. If k is increased, the search time decreases, but the calculation time of the inner product increases, and the search accuracy decreases for the same r. Thus, the values of L and k should be selected taking into account the required accuracy and time limitations. This is the end of the description of LSH.

上述のようにユーザは、最初に一部のデータにラベル付けをすることが必要となるが、本発明の対話型画像監視方法は、ユーザの負担を最小限に減らし、さらに対話型処理により、ユーザが現在行っているラベル付けがどのように解析結果に反映しているか、即ち、効果的なラベル付けを考えながらラベル付けができるようにすることで、ラベル付けに要する時間を最小限にすることが可能となる。 As described above, the user needs to label some data first, but the interactive image monitoring method of the present invention minimizes the burden on the user, and further, by interactive processing, Minimize the time required for labeling by allowing users to label while considering how the current labeling is reflected in the analysis results, that is, considering effective labeling It becomes possible.

Ｓ４以降の処理は、選択されている再生モードにより異なる処理が行われる。次に、タグ付け処理について説明する。 The processes after S4 are different depending on the selected playback mode. Next, tagging processing will be described.

図３にタグ付け処理を表す模式図を示す。ここで、四角形の枠４１は、画像特徴量を軸とする空間を示すとする。この特徴量によってデータは識別される。この空間内で近ければ特徴量が似ていること、言い換えれば画像が似ていることを意味する。本実施形態では、画像特徴量は、各フレームの各画像クリップ毎に１つのd次元ベクトルとして与えられる。以下、タグ付け処理によるタグ付けは、この画像特徴量を基準になされる。 FIG. 3 is a schematic diagram showing the tagging process. Here, it is assumed that the rectangular frame 41 indicates a space having the image feature amount as an axis. Data is identified by this feature amount. If it is close in this space, it means that the feature amount is similar, in other words, that the image is similar. In the present embodiment, the image feature amount is given as one d-dimensional vector for each image clip of each frame. Hereinafter, tagging by the tagging process is performed based on this image feature amount.

図４に画像特徴量の指定インタフェース画面の一例を示す。本実施形態では、画像特徴量となる基準を予め選択的に設定することが可能である。特徴量４７としては、一般的な縮小（スケーリング）またはヒストグラムのいずれかを選択可能としている。 FIG. 4 shows an example of an image feature amount designation interface screen. In the present embodiment, it is possible to selectively set a reference as an image feature amount in advance. As the feature amount 47, either general reduction (scaling) or a histogram can be selected.

また、表色系４８としては、RGB、HSV、グレースケール、CIE Yxy、CIE L^*a^*b^*から選択可能としている、また、縮小方法４９としては、最近傍法、双線形補完法、双三次補完法、平均化のいずれかのアルゴリズムを選択可能としている。尚、いずれのアルゴリズムも公知のアルゴリズムであるので説明は省略する。また、データの次元５０には、特徴量データの次元数dを入力する。次元数を大きくすると細かな特徴を考慮した類似性の判定が行うことが可能となるが、タグ付けの処理時間が長くなり、次元数を小さくすると、類似性判定が粗くなるが、処理時間は短くなる。このため、要求される精度、処理時間の制限等の制約条件に応じて次元数を設定すればよい。尚、画像特徴量として用いることが可能な基準は上述の例に限られない。 The color system 48 can be selected from RGB, HSV, grayscale, CIE Yxy, CIE L ^* a ^* b ^* , and the reduction method 49 can be a nearest neighbor method, a bilinear interpolation method, a bilinear interpolation method, Either a cubic interpolation method or an averaging algorithm can be selected. In addition, since all algorithms are well-known algorithms, description is abbreviate | omitted. Further, the dimension number d of the feature data is input to the data dimension 50. Increasing the number of dimensions makes it possible to determine similarity considering fine features, but the tagging processing time increases, and decreasing the number of dimensions makes the similarity determination rough, but the processing time is Shorter. For this reason, the number of dimensions may be set in accordance with constraints such as required accuracy and processing time restrictions. The reference that can be used as the image feature amount is not limited to the above-described example.

以下に、表色形にRGB、３次元ヒストグラム特徴を用いた場合を例に画像特徴量の算出方法を示す。各チャンネルの色量子化数をnとすると、３次元ヒストグラムはn×n×nの値を有するヒストグラムとなる。例えばn=4の場合は、4×4×4=64の６４次元ベクトルとなる。 In the following, an image feature amount calculation method will be described by taking as an example the case of using RGB three-dimensional histogram features as the color specification. Assuming that the color quantization number of each channel is n, the three-dimensional histogram is a histogram having a value of n × n × n. For example, when n = 4, a 64-dimensional vector of 4 × 4 × 4 = 64 is obtained.

次に、各チャンネルの色値の取り得る最大値をmaxR,maxG,maxBとし（0〜255）、画像クリップ内のある画素の色値の値を（R，G，B）とする。また、r' = (maxR+1)/n, g' = (maxG+1)/n, b' = (maxR+1)/nとし、r = floor(R/r'), g = floor(G/g'), b = floor(B/b')とすると、r, g, bは0〜nの整数となる。 Next, the maximum possible color value of each channel is set to maxR, maxG, maxB (0 to 255), and the color value of a certain pixel in the image clip is set to (R, G, B). R '= (maxR + 1) / n, g' = (maxG + 1) / n, b '= (maxR + 1) / n, r = floor (R / r'), g = floor ( If G / g ′) and b = floor (B / b ′), r, g and b are integers from 0 to n.

画像クリップ内のすべての画素に対して、上記計算を行って、異なる(r,g,b)毎に画素数を集計する。k = r×n×n + g×n + bとし、k番目の要素を(r, g, b)の集計画素数とし、集計画素数を並べてヒストグラムを表現するベクトルとする。尚、画像クリップ内に該当する色値の画素が無い場合は集計画素数は０となる。 The above calculation is performed for all the pixels in the image clip, and the number of pixels is counted for each different (r, g, b). k = r × n × n + g × n + b, the kth element is the total number of pixels of (r, g, b), and the total number of pixels is arranged as a vector expressing the histogram. Note that the total number of pixels is 0 when there is no corresponding color value pixel in the image clip.

また、表色形にRGBの双線形補完法による５×５次元のスケーリング特徴を用いた場合を例に画像特徴量の算出について説明する。尚、スケーリング特徴とは、画像を碁盤目状のブロックに分割し、それぞれのブロックの代表値を縮小方法４９で指定された方法で算出し、その代表値をベクトルの各要素とするものであり、双線形補完法とは、縮小時のある画素が、縮小前の画像における、その座標をとりまく四画素の値から、線形補間により、縮小時の値を計算する方法である。 The calculation of the image feature amount will be described by taking as an example the case where a 5 × 5 dimensional scaling feature by the RGB bilinear interpolation method is used for the color specification. The scaling feature is to divide an image into grid-like blocks, calculate the representative value of each block by the method specified by the reduction method 49, and use the representative value as each element of the vector. The bilinear interpolation method is a method in which a pixel at the time of reduction is calculated by linear interpolation from the values of four pixels surrounding the coordinates in the image before reduction.

具体的には、縮小前の画像が３２×３２画素の画像I₁を、５×５次元のスケーリング特徴I₂に変換する場合は、５／３２の縮小となるので、I₂での(1,1)座標の値i1(1,1)は、I₂でのi2(32/5,32/5)、即ちi2(6.4, 6.4)となる。I₂の座標(6.4, 6.4)には値がないので、i2(6, 6), i2(6, 7), i2(7, 6), i2(7, 7)から線形補完し、i1(1,1)の値とする。同様に全ての画素について、RGBのそれぞれの値について行い画像特徴量を算出する。 Specifically, the image I ₁ of 32 × 32 pixels of the image before reduction is, to convert into 5 × 5-dimensional scaling feature I _2, since the reduction of the 5/32, in the I ₂ (1 , 1) coordinate values i1 (1, 1) is, i2 (32 / 5,32 / 5 with I _2), that is, i2 (6.4, 6.4). Since the coordinates of I ₂ (6.4, 6.4) have no value, i2 (6, 6), i2 (6, 7), i2 (7, 6), i2 (7, 7) are linearly complemented, and i1 ( The value is 1,1). Similarly, image feature values are calculated for each of RGB values for all pixels.

図３（ａ）は、ユーザによるラベル付けが成された場合の模式図であり、白丸４２が認識すべきとラベル付けられたデータ（以下、ターゲットラベルデータ４２）、黒丸４３が認識すべきでないとラベル付けられたデータ（以下、非ターゲットラベルデータ４３）を示している。尚、図３では説明を簡単にするため、縦横の２次元としているが、本実施形態では、数十から数百の高次元空間を用いる。 FIG. 3A is a schematic diagram when labeling is performed by the user. Data labeled that the white circle 42 should be recognized (hereinafter, target label data 42) and the black circle 43 should not be recognized. Is labeled (hereinafter, non-target label data 43). In FIG. 3, for the sake of simplicity, two dimensions are used in the vertical and horizontal directions. In this embodiment, several tens to several hundreds of high-dimensional spaces are used.

図３（ｂ）は、当該ラベル付けにしたがって、本発明のタグ付け処理によりタグ付けがなされた様子を示す模式図である。タグ付け処理では、ラベル付けされたデータであるターゲットラベルデータ４２及び非ターゲットラベルデータ４３から、近傍距離閾値r以内にあるデータにタグを付ける。即ち、ターゲットラベルデータ４２から一定の距離r内にあるデータにターゲットタグデータ４４がタグ付けされ、非ターゲットラベルデータ４３から一定の距離r内にあるデータに非ターゲットタグデータ４５がタグ付けされる。尚、近傍距離閾値ｒは「検査距離」２８（図２参照）に示される参考値を元に決定される任意のパラメータである。 FIG. 3B is a schematic diagram showing a state in which tagging is performed by the tagging process of the present invention in accordance with the labeling. In the tagging process, tags are attached to data within the proximity distance threshold r from the target label data 42 and the non-target label data 43 that are labeled data. That is, target tag data 44 is tagged to data within a certain distance r from the target label data 42, and non-target tag data 45 is tagged to data within a certain distance r from the non-target label data 43. . The neighborhood distance threshold r is an arbitrary parameter determined based on the reference value shown in the “inspection distance” 28 (see FIG. 2).

近傍距離閾値ｒの設定方法の一例について説明する。「このフレームを検査」ボタン２７を押してタグ付け処理を行うと、表示されているフレーム画像の画像クリップは、既に登録済みのラベル付きデータと比較される。比較対象は、画像クリップの画像特徴量を現すd次元ベクトルに対し、近傍距離閾値r以内にある、登録済みのラベル付きデータである。ここで、比較対象のうち、最近傍のデータまでの距離をDとすると、画面内の全画像クリップに対し、最近傍データまでの距離Dを計算した後で、その最大値をmaxDとし、「検査距離」２８には、maxDが表示される。尚、近傍距離閾値ｒ以内にデータがない場合はD=9999とする。 An example of a method for setting the neighborhood distance threshold r will be described. When the “inspect this frame” button 27 is pressed and the tagging process is performed, the image clip of the displayed frame image is compared with the already-registered labeled data. The comparison target is registered labeled data that is within the neighborhood distance threshold r with respect to the d-dimensional vector representing the image feature amount of the image clip. Here, if the distance to the nearest data in the comparison target is D, after calculating the distance D to the nearest data for all image clips in the screen, the maximum value is set to maxD. In the “inspection distance” 28, maxD is displayed. If there is no data within the neighborhood distance threshold r, D = 9999.

この「検査距離」２８に表示される数値は、近傍距離閾値rを設定する参考値とすることができる。例えば、画面内に既にラベル登録したデータと明らかに類似の画像がある場合に、タグ推定を失敗している場合は、近傍距離閾値rの設定が小さ過ぎることを意味している。そのような場合には、閾値rを大きくして対処することになるが、あまりに大きくすると類似していないデータまで類似していると誤推定してしまう。そこで、類似画像を正しくタグ推定できる状態で、検査距離よりも少し大きな値を近傍距離閾値rに設定する。 The numerical value displayed in the “inspection distance” 28 can be a reference value for setting the neighborhood distance threshold r. For example, if there is an image that is clearly similar to the data that has already been registered in the screen and the tag estimation has failed, it means that the neighborhood distance threshold r is set too small. In such a case, the threshold r is increased to deal with it. However, if the threshold r is increased too much, it is erroneously estimated that even similar data is similar. Therefore, a value slightly larger than the inspection distance is set as the neighborhood distance threshold r in a state where the similar image can be correctly estimated.

本発明の対話型画像監視方法では、ユーザがラベル付けを行う量を最小限にすることで、迅速な処理を可能とすることを目的の一つとしている。したがって、ラベル付けされたデータは、映像は全体から見ればごく一部にすぎない。このため、少ないラベル付けされたデータ４２，４３から、rの値を大きくする等により無理にターゲットタグデータ４４及び非ターゲットタグデータ４５を推測してタグ付けを行うと、推定精度が悪くなり、誤判定が多くなるばかりか、ユーザがラベル付けがどのように解析結果に影響を及ぼしているのかの判断ができないこととなる。 An object of the interactive image monitoring method of the present invention is to enable rapid processing by minimizing the amount of labeling by the user. Therefore, the labeled data is only a part of the video as a whole. For this reason, when the target tag data 44 and the non-target tag data 45 are forcibly estimated by increasing the value of r from a small amount of labeled data 42 and 43, the estimation accuracy deteriorates. Not only will there be many misjudgments, but the user will not be able to determine how labeling affects the analysis results.

そこで本発明の対話型画像監視方法では、図３（ｂ）に示すようにターゲットタグデータ４４、非ターゲットタグデータ４５のいずれにもならないものを不明タグデータ４６としている。 Therefore, in the interactive image monitoring method of the present invention, as shown in FIG. 3 (b), the unknown tag data 46 is data that cannot be the target tag data 44 or the non-target tag data 45.

本実施形態では、上述のように、ターゲットラベルが付された画像クリップのグリッド２３を赤色、非ターゲットラベルが付された画像クリップのグリッド２３をグレーで表示することにしている。ここで、タグ付け処理が実行されると、ラベルが付されていない画像クリップについては、ターゲットタグ５１、非ターゲットタグ５２、不明タグ５３のいずれかのタグ付けが成されるが、ターゲットタグ５１が付された画像クリップのグリッド２３をオレンジ色、非ターゲットタグ５２が付された画像クリップのグリッド２３を青色、不明タグ５３が付された画像クリップのグリッド２３を白色で表示するようにしている。 In the present embodiment, as described above, the grid 23 of the image clip with the target label is displayed in red, and the grid 23 of the image clip with the non-target label is displayed in gray. Here, when the tagging process is executed, any one of the target tag 51, the non-target tag 52, and the unknown tag 53 is tagged for the image clip that is not labeled. Is displayed in orange, the image clip grid 23 with a non-target tag 52 in blue, and the image clip grid 23 with an unknown tag 53 in white. .

また、ユーザがいくつかのラベル付けを行った後、「バッチ検査」ボタン３３をクリックすると、現在画面に表示されているフレーム以降のフレーム画像に対して、連続してタグ付け処理が開始され、途中で停止が指示されない限り、動画像の最終フレームまでタグ付け処理を行う（バッチ処理）。尚、タグ付けの速度は、特徴量データの次元数、LSHの各種パラメータ、および画面内で区切られた画像クリップの個数などに影響される。 Further, when the user clicks the “batch inspection” button 33 after performing some labeling, the tagging process is continuously started for the frame images after the frame currently displayed on the screen, Unless stop is instructed in the middle, tagging processing is performed up to the final frame of the moving image (batch processing). The tagging speed is affected by the number of dimensions of the feature data, the various parameters of LSH, the number of image clips divided in the screen, and the like.

本発明の対話型動画像監視方法では、バッチ処理の最中であっても、その途中でのユーザからの指示で「コマ送り」ボタン２６や「通常再生」ボタン３１がクリックされることでバッチ処理を停止し、ユーザは、それまでの同時に並行して表示される可視化領域３７での解析状況に応じて、ラベルを新たに追加、または既に付したラベルを修正することができる。更に、ラベルが追加・修正した後に、再びタグ付け処理を開始すると、以降のフレーム画像については新たに追加されたラベルを反映したタグ付けが行われる。 In the interactive moving image monitoring method of the present invention, even when the batch processing is in progress, the batch is generated by clicking the “frame advance” button 26 or the “normal playback” button 31 in response to an instruction from the user during the batch processing. The process is stopped, and the user can add a new label or modify a label that has already been added according to the analysis state in the visualization region 37 displayed in parallel at the same time. Further, when the tagging process is started again after the label is added / modified, tagging reflecting the newly added label is performed for the subsequent frame images.

このタグ付け処理の状況は、図２に示すように可視化領域３７上に可視化表示されていく。 The status of this tagging process is visualized and displayed on the visualization area 37 as shown in FIG.

図５（ａ）に、ある動画像のフレーム画像の一例と図５（ｂ）にその映像に対し自動タグ付け処理を行った場合に表示される可視化領域３７の拡大図の一例を示す。 FIG. 5A shows an example of a frame image of a certain moving image, and FIG. 5B shows an example of an enlarged view of the visualization region 37 displayed when the automatic tagging process is performed on the video.

可視化領域３７の横軸は時間軸であり、左端が最初のフレームの情報を表し、映像中の時間の推移とともに、右方に移り、最右端が最後のフレームを表す。模様はそれぞれのフレームでの、画像クリップのタグを表し、上述したグリッド２３での枠の色と同様である。本実施形態では、オレンジの枠（図中薄いグレー）がターゲットタグ５１、青の枠（図中濃いグレー）が非ターゲットタグ５２、白い枠が不明タグ５３を表す。 The horizontal axis of the visualization region 37 is a time axis, the left end represents information of the first frame, moves to the right along with the transition of time in the video, and the rightmost end represents the last frame. The pattern represents the tag of the image clip in each frame, and is the same as the frame color in the grid 23 described above. In the present embodiment, the orange frame (light gray in the figure) represents the target tag 51, the blue frame (dark gray in the figure) represents the non-target tag 52, and the white frame represents the unknown tag 53.

この可視化結果は、時間の推移とともに画面内のタグが、どのように変化するかを表すものである。不明タグ５３を示す白い領域は、これまでにユーザによりラベル付けられた画像情報では、ターゲット５１、非ターゲット５２のいずれにもタグ付けできない箇所を意味している。 This visualization result represents how the tag in the screen changes with time. The white area indicating the unknown tag 53 means a portion where neither the target 51 nor the non-target 52 can be tagged in the image information labeled by the user so far.

例えば、図５（ｂ）では、最上段はすべて青色（図中濃いグレー）になっており、画面の上段には全くターゲット（対象物４０）が現れていない事がわかる。また中段は、オレンジ色（図中薄いグレー）の帯が現れており、中段に時々ターゲットが現れることが読み取れる。 For example, in FIG. 5B, it can be seen that the uppermost stage is all blue (dark gray in the figure), and no target (object 40) appears at the upper stage of the screen. In the middle row, an orange (light gray in the figure) band appears, and it can be seen that the target sometimes appears in the middle row.

ここで、静止画像のタグ付けと異なり、映像全体のタグはビットマップディスプレイが高解像度になったとはいえ、一画面に表示できる量ではない。即ち、長時間の映像の場合、フレーム総数は、アプリケーション可視化領域の画素数よりも、はるかに大きいため一つの画素が複数フレームの情報を表示しなければならないこととなる。つまり、可視化領域３７上の一点は、空間的・時間的な多数のタグの重なった表示領域ということになる。 Here, unlike the tagging of still images, the tag for the entire video is not an amount that can be displayed on one screen even though the bitmap display has a high resolution. That is, in the case of a long-time video, the total number of frames is much larger than the number of pixels in the application visualization area, so one pixel must display information of a plurality of frames. That is, one point on the visualization region 37 is a display region where a large number of spatial and temporal tags overlap.

また、解像度が高くても、表示が稠密になれば、人間の視力限界を越えて見えなくなってしまう。もちろんズーム機能を設けたり、可視化マップをスクロール可能にすることで問題を軽減はできるが、広い範囲を一度に見るという要求と、部分を拡大するズームとは両立しない。 Even if the resolution is high, if the display becomes dense, it will be beyond the human eyesight limit. Of course, providing a zoom function or making the visualization map scrollable can alleviate the problem, but the requirement to view a wide area at once and zooming to enlarge the part are not compatible.

そのため、限られた画素数の範囲にタグ付け結果を一覧表示するには、複数のタグの情報を同じ場所に表示する必要がある。本実施形態では、タグに重要度を割り当て、重要な情報を優先的に表示するようにしている。 Therefore, in order to display a list of tagging results within a limited range of the number of pixels, it is necessary to display information on a plurality of tags at the same place. In this embodiment, importance is assigned to tags, and important information is preferentially displayed.

本実施形態では、タグの重要度を以下の重要度とした。
重要度・高：不明タグ
重要度・中：ターゲットタグ
重要度・低：非ターゲットタグ
重要度・高の「不明タグ」は、どのような点を中心に事例教示を行えばよいか、をユーザに提示し、最小限の教示数で精度の高い対象物の追跡を可能とするために最も優先的に表示すべきものである。また、重要度・中の「ターゲットタグ」は、ユーザが指定した少数のターゲットラベルと良く似た未知データを表すため、正しく推測している場合は、ラベル付け作業の進捗状況を確認する助けとなり、ユーザの想定外の場所に現れた場合は、誤推測の可能性を示す点で重要である。更に、「非ターゲットタグ」は、ユーザが指定した多量の非ターゲットに良く似たデータであり、ユーザのラベル付けにミスがない限り、見る必要性の低いデータとなるため、重要度は最も低い。 In this embodiment, the importance of the tag is set to the following importance.
Importance / High: Unknown Tag Importance / Medium: Target Tag Importance / Low: Non-target Tag “Unknown Tag” of importance / high indicates what points should be used to teach cases. In order to enable tracking of an object with high accuracy with a minimum number of teachings, it should be displayed most preferentially. In addition, the “target tag” of importance / medium represents unknown data that is very similar to the small number of target labels specified by the user, so it helps to check the progress of labeling work when correctly guessing. If it appears in a place that is not expected by the user, it is important in that it indicates the possibility of a false guess. Furthermore, “non-target tags” are data that closely resembles a large amount of non-targets specified by the user, and unless the user labels, there is little need to see, so the least important .

また、上記重要度付けは、以下の理由により最適である。例えば、「ターゲットタグ」を「不明タグ」より優先するようにすると、未だタグ付けがなされていない箇所を見過ごすことにつながる。また、同様に「非ターゲットタグ」を「ターゲットタグ」より優先するようにすると、タグ付けの誤推定を見過ごしてしまう。したがって、上述の重要度に基づき可視化処理を行うことで、ラベル付け作業に必要な情報を見落とす可能性を減らすことができる。 Moreover, the above importance ranking is optimal for the following reasons. For example, giving priority to “target tag” over “unknown tag” leads to overlooking a portion that has not yet been tagged. Similarly, if “non-target tag” is given priority over “target tag”, an erroneous estimation of tagging is overlooked. Therefore, by performing the visualization process based on the above-described importance, it is possible to reduce the possibility of overlooking information necessary for the labeling operation.

尚、タグの重要度は上述の例に限られるものではない。例えば、不明タグの不明度により更に重要度を細かく表示してもよい。ここで不明度とは、最近傍のターゲットラベルデータ４２、非ターゲットラベルデータ４３までの距離の大小や近傍距離閾値r以内のターゲットラベルデータ４２、非ターゲットラベルデータ４３の個数等を基準に設定することができる。例えば、最近傍のターゲットラベルデータ４２、非ターゲットラベルデータ４３までの距離が大きい順に不明度を設定し、不明度の大きいものから優先的に表示させるようにしても良い。この場合には、不明タグをその不明度により更に細かく色分けをして表示させるようにすれば良い。 The importance level of the tag is not limited to the above example. For example, the importance may be displayed more finely according to the unknown degree of the unknown tag. Here, the degree of unknown is set based on the distance to the nearest target label data 42 and non-target label data 43, the number of target label data 42 and non-target label data 43 within the neighborhood distance threshold r, and the like. be able to. For example, the degree of unknown may be set in descending order of the distance to the nearest target label data 42 and non-target label data 43, and may be displayed preferentially in descending order of the degree of unknown. In this case, the unknown tag may be displayed with finer color coding depending on the degree of unknown.

次に、図６を用いて可視化処理について説明する。扱う動画像の総フレーム数をFとした場合の、i番目のフレームの画像イメージを図６（ａ）に示す。 Next, the visualization process will be described with reference to FIG. FIG. 6A shows an image image of the i-th frame when the total number of frames of moving images to be handled is F.

横グリッド数をXg,縦グリッド数をYgとし、可視化対象とする画像グリップのグリッド位置を(xg,yg)とする。この場合、当該画像グリップに対応する可視化領域３７上の該当領域Rは、数式２により求めることができる。尚、領域Rの左上座標を(xv,yv)で示す。
＜数２＞
xv = Xv×(i/F)
w = Xv/F 但し、w<1の場合はw=1とする。wは、該当領域Rの横幅を示す。 The horizontal grid number is Xg, the vertical grid number is Yg, and the grid position of the image grip to be visualized is (xg, yg). In this case, the corresponding region R on the visualization region 37 corresponding to the image grip can be obtained by Expression 2. The upper left coordinates of the region R are indicated by (xv, yv).
<Equation 2>
xv = Xv × (i / F)
w = Xv / F However, if w <1, w = 1. w indicates the width of the region R.

また、図２に示すように、本実施形態の対話型画像監視プログラムのインタフェースでは、「可視化圧縮方法」ラジオボタン３８により、射影を行う方向を横軸方向に行うのか、縦軸方向に行うのかを選択可能としている。上述のように、動画像中に付されるタグの数は膨大であり、画面内にすべて表示することは不可能なためである。横軸方向に射影を行う場合は数式３で、縦軸方向に射影を行う場合は数式４で、該当領域Rの縦幅hを求めることができる。尚、Xvは可視化領域３７の横画素数、Yvは可視化領域３７の縦画素数を示す。
＜数３＞
yv = Yv×(yg/Yg)
h = Yv/Yg 但し、h<1の場合はh=1とする。
＜数４＞
yv = Yv×(yg/Xg)
h = Yv/Xg 但し、h<1の場合はh=1とする。 Also, as shown in FIG. 2, in the interface of the interactive image monitoring program of this embodiment, the “visualization compression method” radio button 38 is used to determine whether the projection direction is in the horizontal axis direction or in the vertical axis direction. Can be selected. This is because, as described above, the number of tags attached to a moving image is enormous, and it is impossible to display all the tags on the screen. When projecting in the horizontal axis direction, the vertical width h of the corresponding region R can be obtained by Formula 3 and when projecting in the vertical axis direction, Formula 4. Xv represents the number of horizontal pixels in the visualization region 37, and Yv represents the number of vertical pixels in the visualization region 37.
<Equation 3>
yv = Yv × (yg / Yg)
h = Yv / Yg However, if h <1, h = 1.
<Equation 4>
yv = Yv × (yg / Xg)
h = Yv / Xg However, if h <1, h = 1.

更に、図７を用いて横軸方向に射影した場合について説明する。上述のように可視化領域３７の横画素数は限られており、画素数を超えるフレーム数となる場合は、数フレームの画像情報を１つの画素列で表示することが必要となる。ここでは、フレーム画像（i〜i+k）のk+1個のフレーム画像（j行目）をどのように１画素列に圧縮して表示するかを説明する。 Furthermore, the case where it projects in a horizontal-axis direction is demonstrated using FIG. As described above, the number of horizontal pixels in the visualization region 37 is limited. When the number of frames exceeds the number of pixels, it is necessary to display several frames of image information in one pixel column. Here, it will be described how k + 1 frame images (j-th row) of frame images (i to i + k) are compressed into one pixel column and displayed.

先ず、フレーム画像（i〜i+k）のj行目に着目し、j行目の画像クリップの重要度を比較する。本実施形態では、上述の重要度の最も高いもので代表させるようにしている。ここでは、不明タグ５３が存在するので、可視化領域３７の該当領域は不明タグ５３を示す白色になる。 First, paying attention to the j-th row of the frame image (i to i + k), the importance of the image clip in the j-th row is compared. In the present embodiment, the above-mentioned items having the highest importance are represented. Here, since the unknown tag 53 exists, the corresponding region of the visualization region 37 is white indicating the unknown tag 53.

また、該当する行に不明タグ５３がない場合であって、一つでもターゲットタグ５１がある場合は、ターゲットタグ５１示すオレンジ色に、非ターゲットタグ５２しかない場合には、非ターゲットタグ５２を示す青色に可視化領域３７がマッピングされる。尚、縦軸方向に射影する場合は、j行目をj列目として処理を行うようにすれば良い。 Further, when there is no unknown tag 53 in the corresponding line and there is even one target tag 51, when there is only a non-target tag 52 in orange indicating the target tag 51, the non-target tag 52 is set. The visualization region 37 is mapped to the blue color shown. In addition, when projecting in the vertical axis direction, the processing may be performed with the j-th row as the j-th column.

このように、限られた可視化領域３７に重要な情報を集約して表示させ、ユーザのラベル付け支援、ひいては少ない教示による高精度の画像監視の実現支援を行うものである。 In this way, important information is gathered and displayed in the limited visualization region 37, so that user labeling is supported, and thus high-precision image monitoring with less teaching is supported.

可視化領域３７でのタグの可視化結果は、例えば動画像中で場所の移動がない対象物４０であれば、可視化領域にはターゲットタグを示すオレンジの線が直線で現れる（実施例１、図１５参照）。 If the visualization result of the tag in the visualization region 37 is, for example, the target object 40 that does not move in the moving image, an orange line indicating the target tag appears in a straight line in the visualization region (Example 1, FIG. 15). reference).

これに対し、動画像中で移動する対象物４０の監視であれば、可視化領域に現れるオレンジの線の軌跡により対象物４０の追跡を行うことが可能となる（実施例４、図２２参照）。この場合は、対象物４０の移動が画面の横方向に移動することが多いのか、画面の縦方向に移動することが多いのかにより、射影を行う方向の選択を行えばよい。 On the other hand, if the object 40 moving in the moving image is monitored, the object 40 can be tracked by the locus of the orange line appearing in the visualization region (see Example 4, FIG. 22). . In this case, the direction of projection may be selected depending on whether the movement of the object 40 often moves in the horizontal direction of the screen or the vertical direction of the screen.

また、画像監視において検知すべき箇所や発生時間が事前にわかっていることはほとんどなく、また検知すべき映像は、全体のごく一部である。例えば、碍子の夜間の放電を調べるために撮影された映像の場合、そのほとんどは真っ暗な夜間の碍子連の映像であり、あらかじめ放電画像を教示しておくことは困難である（実施例３参照）。即ち、ターゲットを教示することが困難な場合が存在する。 In addition, it is rarely known in advance that the location and the generation time to be detected in the image monitoring, and the video to be detected is a very small part of the whole. For example, in the case of an image taken for examining the nighttime discharge of the lion, most of the images are dark nighttime lion images, and it is difficult to teach the discharge image in advance (see Example 3). ). That is, there are cases where it is difficult to teach a target.

このようにターゲットを教示することが困難な動画像に対してでも、本発明の対話型動画像監視方法によれば、ユーザにより非ターゲットが教示されていれば、それとは異なる対象に不明タグを付けて映像中から抽出できるため、必ずしもあらかじめターゲットを教示しておく必要はない。 Even if it is difficult to teach a target in this way, according to the interactive video monitoring method of the present invention, if a non-target is taught by a user, an unknown tag is applied to a different target. In addition, since it can be extracted from the video, it is not always necessary to teach the target in advance.

即ち、ユーザは不明タグが付けられた画像を確認すれば、ターゲットの絞込みを行うことができ、映像全体を注意深く見続けなくても、ターゲットを確実に教示することができる。 That is, the user can narrow down the target by checking the image with the unknown tag, and can reliably teach the target without continuing to watch the entire video carefully.

この場合のタグ付け方法の模式図を、図８に示す。ユーザは、まず映像の中で容易に教示できる非ターゲットをラベル付けする。例えば、放電映像の場合であれば放電の発生していない通常の状態の画像を教示する（図８（ａ））。 A schematic diagram of the tagging method in this case is shown in FIG. The user first labels non-targets that can be easily taught in the video. For example, in the case of a discharge image, an image in a normal state where no discharge is generated is taught (FIG. 8A).

その状態でタグ付け処理を行うと、教示した非ターゲットに類似する画像（放電の無い画像）には、自動的に非ターゲットタグ５２が付けられ、それ以外のすべてに不明タグが付けられる（図８（ｂ））。即ち、真っ暗な状態のままであれば、非ターゲットタグ５２が付されるので、不明タグ５３が付けられたデータには、何らかの現象が発生している可能性がある。 When the tagging process is performed in that state, a non-target tag 52 is automatically attached to an image similar to a taught non-target (an image without discharge), and an unknown tag is attached to all other images (see FIG. 8 (b)). In other words, since the non-target tag 52 is attached if it remains in a completely dark state, there is a possibility that some kind of phenomenon has occurred in the data to which the unknown tag 53 is attached.

よって、この不明タグデータ４６を画像で確認し、問題がなければ非ターゲットラベルを付け、ターゲット（この場合、放電）が映っていればターゲットラベルを付ける（図８（ｃ））。 Therefore, the unknown tag data 46 is confirmed by an image, and if there is no problem, a non-target label is attached, and if the target (in this case, discharge) is reflected, a target label is attached (FIG. 8C).

それからタグ付け処理を行うと、さらに絞り込んだタグ付けがなされる。その後、タグ推定が十分になるまで繰り返す（図８（ｄ））。 Then, when tagging processing is performed, further narrowed tagging is performed. Thereafter, the process is repeated until the tag estimation is sufficient (FIG. 8D).

上述のタグ付け方法は、映像全体に占める通常状態の割合が多く、検出すべき現象の発生頻度が低いほど効率的な方法であり、監視映像に適した方法である。 The tagging method described above is a more efficient method as the proportion of the normal state in the entire video is larger and the occurrence frequency of the phenomenon to be detected is lower, and is a method suitable for monitoring video.

また、上述の例のように、処理開始時にターゲットを教示することが困難な動画像についての近傍距離閾値rの設定は、例えば以下のように行う。 In addition, as in the above-described example, the neighborhood distance threshold r is set for a moving image that is difficult to teach a target at the start of processing, for example, as follows.

ラベルが指定されておらず、かつターゲットのない画像クリップを、計算機が非ターゲットとタグ推定できる最小の距離を設定する。この際、距離rが小さすぎると、タグ推定結果が不明（白）ばかりになる。逆に、距離が大きすぎると、本来検出すべきターゲットを見逃してしまう。 An image clip with no label specified and no target is set to the minimum distance that the computer can tag estimate as a non-target. At this time, if the distance r is too small, the tag estimation result is only unknown (white). Conversely, if the distance is too large, the target that should be detected is missed.

そこで、明らかにターゲットのない画像クリップがすべて青色（非ターゲットタグ）となる最小の距離、即ち、それ以上小さくすると青から白色（不明タグ）に変わってしまう距離に設定することで精度の良いタグ推定を行うことが可能となる。 Therefore, by setting the minimum distance at which all image clips with no target clearly become blue (non-target tags), that is, the distance at which the image clips change from blue to white (unknown tags) if the distance is further reduced, accurate tags are set. Estimation can be performed.

以下、図９〜１３に示すフローチャートを用いて、本発明の対話型動画像監視プログラムが行う画像認識、解析処理について説明する。 Hereinafter, image recognition and analysis processing performed by the interactive moving image monitoring program of the present invention will be described with reference to flowcharts shown in FIGS.

どの再生モードが選択されているかによって、画像認識、解析処理の内容は異なる。 The contents of the image recognition and analysis processing differ depending on which playback mode is selected.

再生モードが「バッチ処理」モードの場合（Ｓ４；Ｙｅｓ）、バッチ処理（Ｓ５）を行う。 When the reproduction mode is the “batch process” mode (S4; Yes), the batch process (S5) is performed.

図１０のフローチャートを用いて、バッチ処理（Ｓ５）について説明する。 The batch process (S5) will be described with reference to the flowchart of FIG.

先ず、グリッド中の最左、最上の画像クリップを処理対象とする（Ｓ５０１）。 First, the leftmost and uppermost image clips in the grid are processed (S501).

次に、処理対象の画像クリップを予めユーザによって指定された方法（特徴量４７，表色系４８，縮小方法４９，データの次元数５０）でd次元実数ベクトルv（画像特徴量）に変換し（Ｓ５０２）、LSHデータベース１５に対しvをキーとした質問を行ってvに類似するデータの検索を行う（Ｓ５０３）。 Next, the image clip to be processed is converted into a d-dimensional real vector v (image feature amount) by a method designated in advance by the user (feature amount 47, color system 48, reduction method 49, data dimension number 50). (S502) A question using v as a key is made to the LSH database 15 to search for data similar to v (S503).

検索結果のデータ（画像特徴量）と、質問として与えたvとの距離が予め指定された近傍距離閾値rよりも大きい場合は、当該画像クリップのタグを「不明タグ」とし、rよりも小さくかつ最近傍データのラベルが「ターゲットラベル」である場合は、当該画像クリップのタグを「ターゲットタグ」とし、ラベルが「非ターゲットラベル」である場合は、「非ターゲットタグ」とする（Ｓ５０４）。 When the distance between the search result data (image feature amount) and v given as a question is larger than the pre-designated neighborhood distance threshold r, the tag of the image clip is set as an “unknown tag” and smaller than r If the label of the nearest data is “target label”, the tag of the image clip is set as “target tag”, and if the label is “non-target label”, it is set as “non-target tag” (S504). .

当該画像クリップのタグをICPデータベースに登録する（Ｓ５０５）。当該画像クリップのタグに応じた色の枠を、ビデオ映像の当該画像クリップの位置に表示する（Ｓ５０６）。 The tag of the image clip is registered in the ICP database (S505). A color frame corresponding to the tag of the image clip is displayed at the position of the image clip of the video image (S506).

可視化領域３７の制御機構に対し、当該フレーム番号、グリッド内の位置、タグを通知し可視化結果の更新を行う（Ｓ５０７）。 The control unit of the visualization area 37 is notified of the frame number, the position in the grid, and the tag, and the visualization result is updated (S507).

図１１に、Ｓ５０７の処理を詳細化したフローチャートを示す。当該フレーム番号とグリッド内の位置から可視化領域上の該当領域Rを計算する（Ｓ５０７−１）。 FIG. 11 shows a flowchart detailing the processing of S507. The corresponding region R on the visualization region is calculated from the frame number and the position in the grid (S507-1).

次に、可視化領域上の該当領域Rに対応するデータマップに登録されたタグTregistの重要度と、当該画像クリップのタグTnewの重要度を比較する（Ｓ５０７−２）。 Next, the importance of the tag Tregist registered in the data map corresponding to the corresponding region R on the visualization region is compared with the importance of the tag Tnew of the image clip (S507-2).

次に、再生モードが「バッチ処理」であり、且つ、他のモードから「バッチ処理」に変更された後、該当領域Rにアクセスするのが最初である場合（Ｓ５０７−３；Ｙｅｓ）は、Ｓ５０７−５へ移る。それ以外の場合（Ｓ５０７−３；Ｎｏ）は、Ｓ５０７−４へ移る。Tnewのタグ重要度がTregistのタグ重要度より大きい場合（Ｓ５０７−４；Ｙｅｓ）は、Ｓ５０７−５へ移る。一方、Tnewのタグ重要度がTregistのタグ重要度と同じまたは小さい場合は、Ｓ５０７の処理は終了する。 Next, when the playback mode is “batch processing” and after changing from another mode to “batch processing”, it is the first time to access the region R (S507-3; Yes). The process moves to S507-5. In other cases (S507-3; No), the process proceeds to S507-4. When the tag importance of Tnew is greater than the tag importance of Tregist (S507-4; Yes), the process proceeds to S507-5. On the other hand, when the tag importance of Tnew is the same as or smaller than the tag importance of Tregist, the process of S507 ends.

Ｓ５０７−５では、可視化領域のデータマップ１７にTnewを登録し、可視化領域３７の該当領域RをTnewに対応する色で塗りつぶして表示し、Ｓ５０７の処理は終了する。 In S507-5, Tnew is registered in the data map 17 of the visualization area, and the corresponding area R of the visualization area 37 is displayed in a color corresponding to Tnew, and the process of S507 ends.

図１０のフローチャートの説明に戻る。未処理の画像クリップがグリッド内に存在するかどうか判断し、存在する場合（Ｓ５０８；Ｙｅｓ）は、未処理の画像クリップを処理対象として（Ｓ５０９）、Ｓ５０２の処理へ戻る。すべての画像クリップについて処理が終了したら（Ｓ５０８；Ｎｏ）、バッチ処理は終了し、Ｓ８へ移る。 Returning to the flowchart of FIG. It is determined whether or not an unprocessed image clip exists in the grid. If it exists (S508; Yes), the unprocessed image clip is set as a processing target (S509), and the process returns to S502. When the processing is completed for all the image clips (S508; No), the batch processing ends, and the process proceeds to S8.

次に、再生モードが「通常再生」の場合（Ｓ６；Ｙｅｓ）は、通常再生処理（Ｓ７）を行う。図１２のフローチャートを用いて、通常再生処理（Ｓ７）について説明する。 Next, when the reproduction mode is “normal reproduction” (S6; Yes), normal reproduction processing (S7) is performed. The normal reproduction process (S7) will be described using the flowchart of FIG.

グリッド中の最左、最上の画像クリップを処理対象とし（Ｓ７０１）、処理対象の画像クリップのフレーム番号、グリッド内の位置をキーとして、ICPデータベース１６から、登録済みのラベルもしくはタグを検索する（Ｓ７０２）。 The leftmost and uppermost image clip in the grid is set as the processing target (S701), and the registered label or tag is searched from the ICP database 16 using the frame number of the processing target image clip and the position in the grid as keys (S701). S702).

当該画像クリップのタグもしくはラベルに応じた色の枠を、当該画像クリップの枠の色として表示する（Ｓ７０３）。 A color frame corresponding to the tag or label of the image clip is displayed as the color of the frame of the image clip (S703).

可視化領域の制御機構に対し、当該フレーム番号、グリッド内の位置、タグを通知し、可視化結果を更新する処理（Ｓ７０４）を行う。尚、Ｓ７０４の処理は、上述のＳ５０７の処理（図１１参照）と同じであるので説明は省略する。 The control unit of the visualization area is notified of the frame number, the position in the grid, and the tag, and processing for updating the visualization result is performed (S704). Note that the processing in S704 is the same as the processing in S507 described above (see FIG. 11), and thus the description thereof is omitted.

未処理の画像クリップがグリッド内に存在するかどうか判断し、存在する場合（Ｓ７０５；Ｙｅｓ）は、未処理の画像クリップを処理対象として（Ｓ７０６）、Ｓ７０２の処理へ戻る。すべての画像クリップについて処理が終了したら（Ｓ７０５；Ｎｏ）、通常再生処理は終了し、Ｓ８へ移る。 It is determined whether or not an unprocessed image clip exists in the grid. If it exists (S705; Yes), the unprocessed image clip is set as a processing target (S706), and the process returns to S702. When the processing is completed for all the image clips (S705; No), the normal reproduction processing is terminated, and the process proceeds to S8.

次に、再生モードが「コマ送り」の場合（Ｓ８；Ｙｅｓ）は、コマ送り処理（Ｓ９）を行う。図１３のフローチャートを用いて、コマ送り処理（Ｓ９）について説明する。 Next, when the playback mode is “frame advance” (S8; Yes), a frame advance process (S9) is performed. The frame advance process (S9) will be described with reference to the flowchart of FIG.

先ず、グリッド内の位置（xg,xy）が選択された場合（Ｓ９０１；Ｙｅｓ）は、当該画像クリップのタグもしくはラベルをICPデータベースを検索して取得し、得られたタグもしくはラベルをTpとする（Ｓ９０２）。 First, when the position (xg, xy) in the grid is selected (S901; Yes), the tag or label of the image clip is obtained by searching the ICP database, and the obtained tag or label is Tp. (S902).

Tpが不明タグ、非ターゲットタグ、非ターゲットラベルのいずれかである場合（Ｓ９０３；Ｙｅｓ）、当該画像クリップのラベルを「ターゲット」に変更し、ICPデータベース１６に登録、グリッドの枠の色をターゲットに対応する色に変更（Ｓ９０４）し、Ｓ９０１に戻る。 If Tp is one of an unknown tag, non-target tag, or non-target label (S903; Yes), the label of the image clip is changed to “target”, registered in the ICP database 16, and the grid frame color is targeted (S904), and the process returns to S901.

一方、Tpがターゲットタグ、ターゲットラベルである場合（Ｓ９０５）は、当該画像クリップのラベルを「非ターゲット」に変更し、ICPデータベース１６に登録、グリッドの枠の色を非ターゲットに対応する色に変更（Ｓ９０６）し、Ｓ９０１に戻る。 On the other hand, if Tp is a target tag or target label (S905), the label of the image clip is changed to “non-target”, registered in the ICP database 16, and the color of the grid frame is changed to a color corresponding to the non-target. Change (S906) and return to S901.

グリッド内の位置（xg,xy）が選択されない場合（Ｓ９０１；Ｎｏ）は、「このフレームを検査」ボタン２７がクリックされたかどうかを判断する（Ｓ９０７）。「このフレームを検査」ボタン２７がクリックされた場合（Ｓ９０７；Ｙｅｓ）は、バッチ処理（Ｓ５）をおこなってからＳ９０１に戻る。 When the position (xg, xy) in the grid is not selected (S901; No), it is determined whether or not the “inspect this frame” button 27 has been clicked (S907). When the “inspect this frame” button 27 is clicked (S907; Yes), the batch processing (S5) is performed, and the process returns to S901.

「このフレームを検査」ボタン２７クリックされていない場合（Ｓ９０７；Ｎｏ）は、「指定ラベルを登録」ボタン２９がクリックされたかどうかを判断し（Ｓ９０８）、クリックされた場合（Ｓ９０８；Ｙｅｓ）は、「コマ送り処理」中に更新された全てのラベルをLSHデータベース１５に登録し（Ｓ９０９）、バッチ処理（Ｓ５）をおこなってからＳ９０１に戻る。 If the “inspect this frame” button 27 has not been clicked (S907; No), it is determined whether or not the “register specified label” button 29 has been clicked (S908), and if it has been clicked (S908; Yes). All the labels updated during the “frame advance processing” are registered in the LSH database 15 (S909), the batch processing (S5) is performed, and the process returns to S901.

「指定ラベルを登録」ボタン２９がクリックされていない場合（Ｓ９０８；Ｎｏ）は、「コマ送り」ボタン２６がクリックされたかどうかを判断し（Ｓ９１０）、クリックされた場合（Ｓ９１０；Ｙｅｓ）は、コマ送り処理（Ｓ９）は終了する。 If the “register specified label” button 29 has not been clicked (S908; No), it is determined whether or not the “frame advance” button 26 has been clicked (S910). If it has been clicked (S910; Yes), The frame advance process (S9) ends.

「コマ送り」ボタン２６がクリックされていない場合（Ｓ９１０；Ｎｏ）は、「バッチ検査」ボタン３３がクリックされたかどうかを判断し（Ｓ９１１）、クリックされた場合（Ｓ９１１；Ｙｅｓ）は、再生モードを「バッチ処理」に変更して（Ｓ９１２）、コマ送り処理（Ｓ９）は終了する。 If the “frame advance” button 26 has not been clicked (S910; No), it is determined whether the “batch inspection” button 33 has been clicked (S911). If it has been clicked (S911; Yes), the playback mode is selected. Is changed to “batch processing” (S912), and the frame advance processing (S9) ends.

「バッチ検査」ボタン３３がクリックされていない場合（Ｓ９１１；Ｎｏ）は、「通常再生」ボタン３１がクリックされたかどうかを判断し（Ｓ９１３）、クリックされた場合（Ｓ９１３；Ｙｅｓ）は、再生モードを「通常再生」に変更して（Ｓ９１４）、コマ送り処理（Ｓ９）は終了する。一方、クリックされていない場合（Ｓ９１３；Ｎｏ）は、Ｓ９０１に戻る。 If the “batch inspection” button 33 has not been clicked (S911; No), it is determined whether or not the “normal playback” button 31 has been clicked (S913). If it has been clicked (S913; Yes), the playback mode. Is changed to “normal playback” (S914), and the frame advance processing (S9) is ended. On the other hand, when it is not clicked (S913; No), the process returns to S901.

図９のフローチャートの説明に戻る。Ｓ５、Ｓ７、Ｓ９のいずれかの処理が終了すると、対象フレーム番号i を i+1に更新する(Ｓ１０）。 Returning to the flowchart of FIG. When any one of S5, S7, and S9 is completed, the target frame number i is updated to i + 1 (S10).

この際に、画像スライダ３９、「<<10s」ボタン３２ａ等によりフレーム番号i'への移動が指示されている場合は対象フレーム番号iをi'に変更する（Ｓ１１）。 At this time, if movement to the frame number i ′ is instructed by the image slider 39, the “<< 10s” button 32a, etc., the target frame number i is changed to i ′ (S11).

最後にシステム終了が指示されているかどうかを判断する（Ｓ１２）。以上で本発明の対話型動画像監視プログラムが実行する処理が終了する。 Finally, it is determined whether or not the system termination is instructed (S12). This completes the processing executed by the interactive moving image monitoring program of the present invention.

尚、上述の実施形態は本発明の好適な実施の例ではあるがこれに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変形実施可能である。また、上述の演算式は一例であり、本発明の要旨を逸脱しない範囲において種々変形実施可能である。 The above-described embodiment is a preferred embodiment of the present invention, but is not limited thereto, and various modifications can be made without departing from the gist of the present invention. Moreover, the above arithmetic expression is an example, and various modifications can be made without departing from the scope of the present invention.

例えば、本実施形態では、情報探索手法としてLSHを用いているが、他の手法、例えば、分類木やSVM等の他の学習器を用いても良い。その場合は、LSHデータベース１５に替えて、対応するデータベースを構成し処理を行えばよい。 For example, in the present embodiment, LSH is used as an information search method, but other methods such as other learning devices such as a classification tree and SVM may be used. In that case, instead of the LSH database 15, a corresponding database may be configured and processed.

また、可視化領域３７は必ずしも圧縮して一画面上に表示する必要はない。この場合、可視化領域３７をスクロールバーによりスクロール可能とすればよい。また、圧縮表示し、該当箇所を選択することで該当箇所が圧縮前の状態をズームイン表示するようにしても良い。 Further, the visualization area 37 is not necessarily compressed and displayed on one screen. In this case, the visualization area 37 may be scrolled by a scroll bar. Alternatively, the state before compression may be displayed in a zoomed-in display by compressing and selecting the corresponding part.

（実施例１）
本発明の対話型動画像監視プログラムを用いて、照明変動などを模擬した人工的な試験映像への適用実験を行った。 Example 1
Using the interactive moving image monitoring program of the present invention, an application experiment to an artificial test video simulating lighting fluctuation was conducted.

本実験では、図１４に示すように１枚の紙に印刷されたロゴ４０を撮影した動画像（実験映像１）を用いた。ターゲット４０は、ロゴ４０である。紙は、中央に固定した状態で撮影を行ったため、ロゴ４０の位置は変化しない。ロゴ４０を検出することには、特段の困難性はないが、本実験は本発明の対話型動画像監視プログラムが、照明変動に対応することが可能であるか否かを目的とした。 In this experiment, as shown in FIG. 14, a moving image (experimental video 1) obtained by photographing a logo 40 printed on one sheet of paper was used. The target 40 is a logo 40. Since the image was taken with the paper fixed in the center, the position of the logo 40 does not change. Although there is no particular difficulty in detecting the logo 40, the purpose of this experiment was to determine whether or not the interactive moving image monitoring program of the present invention can cope with illumination fluctuations.

本実験における照明は、以下の３つの条件とした。
（１）蛍光灯による人工照明
（２）室内のブラインドをおろして照明を消した状態
（３）室内のブラインドを開けて照明を消した状態
上記３状態を連続変化させ、明るさ、色、コントラスト等を変動させながら撮影を行った。尚、実験映像１は４７秒（1,424フレーム）であった。 The illumination in this experiment was performed under the following three conditions.
(1) Artificial lighting with fluorescent lamps (2) State in which the indoor blind is turned off and the light is turned off (3) State in which the indoor blind is opened and the light is turned off Shooting was performed while varying the above. Experimental video 1 was 47 seconds (1,424 frames).

本実験では、当該映像を用いて、３人の被験者により、本発明の対話型動画像監視プログラムによりロゴ認識を行った。 In this experiment, logo recognition was performed by the three subjects using the interactive video monitoring program of the present invention using the video.

被験者は、実験映像１を見て、特定のフレームを選択し、最小限の画像クリップに対し、ラベル付けを行った後、本発明の対話型動画像監視プログラムを実行した。これにより、ラベル付けがされていない他のすべての画像クリップにタグ付け処理がなされ、可視化領域３７に監視結果を表示される。 The subject watched the experimental video 1, selected a specific frame, labeled the minimum image clip, and then executed the interactive video monitoring program of the present invention. As a result, all other unlabeled image clips are tagged, and the monitoring result is displayed in the visualization area 37.

更に、被験者は、監視結果を見て必要なフレームの必要な画像クリップに対しラベル付けを再度行い、再度プログラムを実行させる処理ことを繰り返し実行した。本実験では、被験者が正確な監視ができたと判断した時点で処理を終了し、実験を終了した。 Further, the test subject repeatedly performed the process of re-labeling the necessary image clip of the necessary frame by viewing the monitoring result and executing the program again. In this experiment, the process was terminated when the subject determined that accurate monitoring was possible, and the experiment was terminated.

本実験において、計算機の学習度合い、即ち、画像の監視精度を判定するために表２に示す指標を用いた。
In this experiment, the index shown in Table 2 was used to determine the learning level of the computer, that is, the image monitoring accuracy.

学習度合いは、本来付されるべきラベル（以下、正解ラベル）に対して、どのようなタグ付けがされたかで判断することができる。例えば、正解ラベルに対しターゲットタグが付されている数である。 The degree of learning can be determined by what kind of tagging is performed on a label that should be attached (hereinafter, correct label). For example, the number of target tags attached to correct labels.

表２では、TPとTNが多いことが望まれ、FNとFPが多いことはタグの推定精度が低いことを意味する。また、FUは、ターゲットとすべき所を不明としている、いわば見落とし箇所であり、これを減らすことが目的となる。 In Table 2, a large amount of TP and TN is desired, and a large amount of FN and FP means that the tag estimation accuracy is low. In addition, FU is an overlooked part that makes it unclear where it should be targeted, and its purpose is to reduce it.

本実験では、画像特徴量及びLSHには以下のパラメータを用いた。画像特徴量には、表色系はRGBとし、双線形補完法による５×５次元のスケーリング特徴を用いた。また、LSHのパラメータとしては、L=20 , k=10 ,ω=0.4とした。尚、このパラメータの設定は、距離r以内のデータを９０％の確率で正しく検索し、距離r外のデータを５％の確率で誤検出する設定値である。また、近傍距離閾値r=0.15とした。 In this experiment, the following parameters were used for the image feature amount and LSH. For the image feature amount, the color system is RGB, and a 5 × 5 dimensional scaling feature by a bilinear interpolation method is used. The LSH parameters were L = 20, k = 10, and ω = 0.4. This parameter setting is a setting value for correctly searching data within the distance r with a probability of 90% and erroneously detecting data outside the distance r with a probability of 5%. Also, the neighborhood distance threshold r = 0.15.

実験結果を図１５に示す。（ａ）は、映像中でターゲットを確実にターゲットとして検出できているか否かを示す再現率（recall）、（ｂ）は、非ターゲットを確実に非ターゲットとして検出できているか否かを示す非ターゲット検出率(TNR)が、ラベルの登録数に応じてどのように変化したかを示すグラフである。尚、再現率（recall）は数式５で、非ターゲット検出率(TNR)は数式６で示され、１に近ければ近いほど精度が高いことを意味している。
＜数５＞
recall = TP/(TP+FN+FU)
＜数６＞
TNR = TN/(TN+FP+TU) The experimental results are shown in FIG. (A) is a recall (recall) indicating whether or not a target can be reliably detected as a target in the video, and (b) is a non-notice indicating whether or not a non-target can be reliably detected as a non-target. It is a graph which shows how the target detection rate (TNR) changed according to the registration number of labels. The recall (recall) is expressed by Equation 5 and the non-target detection rate (TNR) is expressed by Equation 6, meaning that the closer to 1, the higher the accuracy.
<Equation 5>
recall = TP / (TP + FN + FU)
<Equation 6>
TNR = TN / (TN + FP + TU)

また、被験者との比較のため、フレームの選択はランダムに行って、当該フレームでのラベル付けは正確に行うラベル付け作業（以下、ランダム選択という）を行った。本実験における動画像では、対象物であるロゴ４０は、画面の中心のまま動かないので、ロゴ４０が出現しているフレームを与えれば、自動的にランダム選択が可能となる。このランダム選択と被験者との再現率（recall）、非ターゲット検出率(TNR)を比較することにより被験者がフレームの選択を効率よく行うことができたかを確認できる。即ち、被験者の結果とランダム選択の結果が同等であれば、被験者はフレーム選択を無作為に行っていたといえ、被験者の結果がランダム選択の結果より良ければ、被験者は効率的にフレーム選択を行ったといえることになる。 In addition, for comparison with the subject, selection of a frame was performed at random, and a labeling operation (hereinafter referred to as random selection) was performed in which the labeling in the frame was accurately performed. In the moving image in this experiment, the logo 40 as the object does not move at the center of the screen, and therefore, if a frame in which the logo 40 appears is given, a random selection can be automatically made. By comparing this random selection with the recall rate (recall) and non-target detection rate (TNR) of the subject, it can be confirmed whether or not the subject was able to select the frame efficiently. That is, if the result of the random selection is equal to the result of the subject, it can be said that the subject has randomly selected the frame. If the result of the subject is better than the result of the random selection, the subject efficiently selects the frame. It can be said that.

図１５に示されるように、３人の被験者の結果はランダム選択と比較して、早い段階で１に近づいていることがわかる。FP及びFNは、いずれの例でもほぼ０であるので、数式７で示される精度(precision)及び数式８で示される誤検出率(FPR)は、精度≒１、誤検出率≒０であった。
＜数７＞
precision = TP/(TP+FP)
＜数８＞
FPR = FP/(FP+FU+TN) As shown in FIG. 15, it can be seen that the results of the three subjects approached 1 at an early stage as compared with the random selection. Since FP and FN are almost 0 in both examples, the precision (precision) shown in Equation 7 and the false detection rate (FPR) shown in Equation 8 were accuracy 1 and error detection rate 0. .
<Equation 7>
precision = TP / (TP + FP)
<Equation 8>
FPR = FP / (FP + FU + TN)

また、ラベル登録数が増加するにつれ、可視化領域３７がどのように変化したのかを図１６に示す。図１６は、上からラベル登録数の増加に伴う、可視化領域３７の変化の様子を示すものである。 Also, FIG. 16 shows how the visualization region 37 changes as the number of label registrations increases. FIG. 16 shows how the visualization region 37 changes as the number of registered labels increases from above.

登録ラベルの少ない初期段階では、白い領域（不明タグ）６１が多く、対象物の監視を行えていないことを示しているが、ラベル付けが進むにつれて不明タグが減少し、中心にオレンジ（図中ではグレー）のライン６２が現れる。本実験では、対象物のロゴ４０は画面の中心にあるので、横軸方向に射影した本実験では、可視化領域３７の中心にオレンジのライン６２が現れれば、ロゴ４０の追跡に成功していることを示す。 In the initial stage with a small number of registered labels, there are many white areas (unknown tags) 61, indicating that the object cannot be monitored. However, as the labeling progresses, the number of unknown tags decreases, and orange (in the figure) In this case, a gray line 62 appears. In this experiment, since the logo 40 of the object is at the center of the screen, if the orange line 62 appears in the center of the visualization region 37 in this experiment projected in the horizontal axis direction, the logo 40 has been successfully tracked. It shows that.

本実験から、照明変動による対象物の色の変化に対応することが可能であることが確認できた。 From this experiment, it was confirmed that it was possible to cope with changes in the color of the object due to illumination fluctuations.

（実施例２）
同様に、図２に示す画像により実験を行った。尚、特に記載のない限り実験は、実施例１と同様の条件下である。 (Example 2)
Similarly, the experiment was performed using the image shown in FIG. Unless otherwise specified, the experiment is performed under the same conditions as in Example 1.

本実験では、画像中の缶６３が回転し、側面に貼り付けられたロゴ４０の追跡を行った。即ち、ロゴ４０が缶６３の回転に合わせて見えたり見えなくなったりを繰り返す動画像（実験映像２）である。尚、実験映像２は、１２０秒（3,596フレーム）であった。 In this experiment, the can 63 in the image was rotated and the logo 40 attached to the side surface was tracked. That is, it is a moving image (experimental video 2) in which the logo 40 repeatedly appears and disappears as the can 63 rotates. Experimental video 2 was 120 seconds (3,596 frames).

実施例１と同様に３人の被験者に本発明の対話型画像監視プログラムを実行してもらった結果を図１７に示す。 FIG. 17 shows the result of having three subjects execute the interactive image monitoring program of the present invention in the same manner as in the first embodiment.

実験映像２でもFPとFNは極めて小さい値であり、精度≒１、誤検出率≒０であった。試験映像２では照明変動がないため、背景の変動がなく、最初に数フレーム分に背景を非ターゲットラベルとして登録することで、背景を除外することができた。 Even in the experimental video 2, FP and FN are extremely small values, and accuracy ≈ 1 and false detection rate ≈ 0. Since there was no illumination fluctuation in Test Video 2, there was no background fluctuation, and the background could be excluded by first registering the background as a non-target label for several frames.

また、実験映像２では、缶６３が繰り返し４回転し、その位置も同じであるため、１回転分に適切にラベル付けすることで、残りの映像についても適切にタグ付けを行うことができた。よって、実験映像１に比して、少ないラベル登録数で高い再現率を達成できた。 Moreover, in the experimental video 2, the can 63 repeatedly rotated four times, and the position thereof is also the same, so that the remaining video can be appropriately tagged by appropriately labeling for one rotation. . Therefore, compared with the experiment video 1, a high reproduction rate can be achieved with a small number of label registrations.

（実施例３）
本実験では、碍子の漏れ電流の監視を行った。実験に用いた映像（実験映像３）は、直流送電線の放電騒音防止のための暴露試験として、試験場に設置された直流碍子連を、数ケ月に渡って長期撮影した映像の一部である。 (Example 3)
In this experiment, the leakage current of the insulator was monitored. The video used in the experiment (Experiment Video 3) is a part of a long-term image taken over several months of the DC insulators installed at the test site as an exposure test to prevent discharge noise from DC transmission lines. .

実験映像３の総再生時間は４８分４２秒、総フレーム数８７，５７５フレーム、放電が確認できる夜間の映像である。尚、碍子連の昼間の撮影例を図１８に示す。当該映像でのターゲットは碍子の放電現象であり、映像中から放電が起きた時刻やその頻度を正確に検出する必要がある。 The total playback time of the experimental video 3 is 48 minutes and 42 seconds, the total number of frames is 87,575 frames, and it is a night video in which discharge can be confirmed. Incidentally, FIG. 18 shows an example of the daytime shooting of Choshiren. The target in the video is the insulator discharge phenomenon, and it is necessary to accurately detect the time and frequency of the discharge from the video.

実験映像３では、映像のほとんどの時間は放電がなく、変化のない単調な画面が続く。また一回の放電時間は極めて短い（33msec以内)。そのため、例えば検査員が、放電箇所を探しながら注意力を維持して見続けるのはかなりの労力であり、また見落としも多くなることが考えられる。 In the experimental video 3, there is no discharge during most of the video, and a monotonous screen without change continues. In addition, the discharge time for one time is extremely short (within 33 msec). For this reason, for example, it is considered that the inspector maintains a caution and continues to look while searching for a discharge point, and it is considered that there are many oversights.

図１９に示すように、最右列の碍子連に対して、横２マス、縦２０マスのグリッド２３を設定した。尚、夜間であるため画面は真っ黒である。 As shown in FIG. 19, a grid 23 having two horizontal columns and 20 vertical columns is set for the rightmost column of lions. Since it is night, the screen is black.

本実験では、画像特徴量はグレースケール（輝度は[0,1]の実数）、平均化による４×４次元のスケーリング特徴とした。LSHのパラメータは実施例１及び２と同じとし、近傍距離閾値r = 0.24とした。 In this experiment, the image feature amount is gray scale (luminance is a real number of [0,1]), and 4 × 4 dimensional scaling feature by averaging. The LSH parameters were the same as those in Examples 1 and 2, and the neighborhood distance threshold r = 0.24.

本実験では、先ず放電の映っていない最初のフレーム画像の４０個の画像クリップすべてに非ターゲットラベルをつけた。そして、このラベル付けのみの状態でタグ付け処理を行った時の可視化領域３７を図２０（ａ）に示す。図２０は、図１９の横方向に射影して得られた可視化領域３７を示すものである。尚、実験映像３の４８分の映像のタグ付けには、Pentium（登録商標）4 3.6GHzの計算機で３２分を要した。 In this experiment, first, non-target labels were attached to all 40 image clips of the first frame image in which no discharge was reflected. And the visualization area | region 37 when a tagging process is performed only in this labeling state is shown to Fig.20 (a). FIG. 20 shows a visualization region 37 obtained by projecting in the horizontal direction of FIG. In addition, tagging of the 48-minute video of the experimental video 3 took 32 minutes with a Pentium (registered trademark) 4 3.6 GHz computer.

本実験では、可視化領域３７の横画素数を７２０としたため、横１画素には１２２フレーム、２４４個のタグ情報が集約されている(=87,575×2 / 720）。 In this experiment, since the number of horizontal pixels in the visualization region 37 is 720, 122 frames and 244 pieces of tag information are aggregated in one horizontal pixel (= 87,575 × 2/720).

図２０（ａ）では、非ターゲットタグを表す青色（図中では濃いグレー）がほとんどを占め、不明タグを表す白い領域がところどころに見られる。即ち、ほとんどは放電の無い映像であったということである。また、白い領域には、非ターゲットとは似ていない何かが撮影されている可能性があることを示す。ユーザは、映像全体を見る必要はなく、この白い不明タグのついた画像のみを検査すればよいことになる。尚、ターゲットラベルは一つも登録していないため、オレンジ色で示されるターゲットタグは一切見られない。 In FIG. 20A, the blue color representing the non-target tag (dark gray in the figure) occupies most, and a white region representing the unknown tag is seen in some places. That is, most of the images were images without discharge. The white area indicates that something that is not similar to the non-target may have been shot. The user does not need to see the entire video and only needs to inspect the image with the white unknown tag. Since no target label is registered, no target tag shown in orange is seen.

不明タグのいくつかを映像で確認したところ、図２０（ａ）中の符号６５で示す不明タグの集まりは、いずれも、ビデオテープのノイズであることがわかった。ノイズはビデオテープの傷及びビデオデッキのヘッドが原因であった。 When some of the unknown tags were confirmed by video, it was found that any of the unknown tag collections indicated by reference numeral 65 in FIG. 20A was video tape noise. The noise was caused by scratches on the videotape and the head of the video deck.

これに対し、最下段に途中から現れて、映像の最後付近まで連続して現れている不明タグの連続６６で示す箇所は、放電現象を捉えていることがわかった。 On the other hand, it was found that the portion indicated by a series 66 of unknown tags that appeared from the middle in the lowermost stage and continued to the vicinity of the end of the video captured the discharge phenomenon.

そこで、映像を確認しつつ、ビデオノイズには非ターゲットラベルを、放電箇所にはターゲットラベルを付ける作業を行った。２９７個のラベルをつけた後のタグ付け状態を図２０（ｂ）に、さらに２６６フレーム、４１４個までラベルのラベルをつけた後のタグ付け状態を図２０（ｃ）に示す。 Therefore, while confirming the video, an operation was performed to attach a non-target label to the video noise and a target label to the discharge part. FIG. 20B shows a tagging state after attaching 297 labels, and FIG. 20C shows a tagging state after labeling up to 266 frames and 414 labels.

ラベルが増えるに従って、ノイズが消えて行き、最下段にオレンジ（図中では薄いグレー）のターゲットタグの連続６７が増えていることがわかる。 It can be seen that as the number of labels increases, the noise disappears and the target tag sequence 67 of orange (light gray in the figure) increases at the bottom.

以上のように、本発明の対話型動画像監視プログラムによりビデオ映像の解析を行うと、ユーザによる、ほんの少数の非ターゲット情報を教示するだけで、監視映像中の代表的な放電パターンなど、注目すべき箇所を適切に見出せることが確認できた。 As described above, when the video image is analyzed by the interactive moving image monitoring program of the present invention, the user can pay attention to a representative discharge pattern in the monitoring image only by teaching a small number of non-target information. It was confirmed that the place to be able to be found properly.

更に、事例画像の選択、ラベルの教示作業を容易かつ確実にできことが確認できた。本発明の対話型動画像監視プログラムによれば、碍子の放電映像に限らず、発生頻度が低く、計算機への教示事例を見出すのが難しい長時間監視においても事例教示の作業労力を大幅に低減することができる。 Furthermore, it was confirmed that selection of case images and label teaching work can be performed easily and reliably. According to the interactive video monitoring program of the present invention, not only the discharge image of the insulator but also the occurrence frequency is low and the work effort of the case teaching is greatly reduced even in the long-time monitoring in which it is difficult to find the teaching example to the computer. can do.

（実施例４）
本実験では、ラジコンの自動車の監視、追跡を行った。本実験では、図２１に示すように監視対象物としてラジコンの自動車４０が床を左右に横断して走行する映像（実験映像４）を用いた。尚、実験映像４の総再生時間は６７秒、総フレーム数２，００８フレームである。 Example 4
In this experiment, radio-controlled cars were monitored and tracked. In this experiment, as shown in FIG. 21, an image (experimental image 4) in which a radio controlled car 40 travels across the floor from side to side is used as an object to be monitored. The total playback time of the experimental video 4 is 67 seconds and the total number of frames is 2,008 frames.

本実験では、画像特徴量をRGB３×３×３=27次元のヒストグラム特徴とし、近傍距離閾値をr = 0.47844とした。 In this experiment, the image feature amount is RGB 3 × 3 × 3 = 27-dimensional histogram feature, and the neighborhood distance threshold is r = 0.487844.

グリッド２３は横２１×縦１３の細かな碁盤目状とし、可視化圧縮方向３８を横軸方向に設定し、本発明の対話型動画像監視プログラムを実行した。その結果を図２２に示す。尚、ラベルは、ターゲットタグ、非ターゲットタグをあわせて２６個だけ登録した。 The grid 23 has a fine grid pattern of 21 × 13 in the horizontal direction, the visualization compression direction 38 is set in the horizontal axis direction, and the interactive moving image monitoring program of the present invention is executed. The result is shown in FIG. In addition, only 26 labels including target tags and non-target tags were registered.

可視化領域３７には、非ターゲットを示す青色のバック７０にターゲットを示すオレンジ色の傾きをもったライン７１が表示されている。オレンジのライン７１は、自動車４０の追跡結果を示している。尚、不明タグはほとんど存在せず、自動車４０の動きに合わせて追跡結果が表示されており、追跡に成功したことを示している。 In the visualization region 37, a line 71 having an orange inclination indicating a target on a blue back 70 indicating a non-target is displayed. An orange line 71 indicates the tracking result of the automobile 40. There are almost no unknown tags, and the tracking result is displayed in accordance with the movement of the automobile 40, indicating that the tracking is successful.

ここで、オレンジのライン７１が、左下から右上に伸びている場合は、自動車４０は、画面の右から左へ移動したことを表し、左上から右下に伸びている場合は、画面の左から右へ移動したことを表している。また、ライン７１の傾きは、自動車４０の速度や走行コースで変化する。具体的には、自動車４０の画面横方向速度成分が、遅いと傾きが大きくなり、高速だと傾きが小さくなる。 Here, when the orange line 71 extends from the lower left to the upper right, it indicates that the automobile 40 has moved from the right to the left of the screen. When the orange line 71 extends from the upper left to the lower right, from the left of the screen. Indicates that it has moved to the right. Further, the inclination of the line 71 changes depending on the speed of the automobile 40 and the traveling course. Specifically, when the screen horizontal speed component of the automobile 40 is slow, the inclination increases, and when it is high, the inclination decreases.

本実験により、可視化領域３７に表示されるタグ付け状況から監視対象物がどのような動きをしたか、即ち監視対象物の追跡を、ごく少ないユーザによる教示で実現できることが確認できた。 From this experiment, it has been confirmed that the movement of the monitoring object from the tagging status displayed in the visualization region 37, that is, the tracking of the monitoring object can be realized by teaching by a very small number of users.

本発明の対話型画像監視装置の一例を示す概略構成図である。It is a schematic block diagram which shows an example of the interactive image monitoring apparatus of this invention. 本発明の対話型動画像監視プログラムのインタフェース画面の一例である。また、実験映像２のフレーム画像の一例である。It is an example of the interface screen of the interactive moving image monitoring program of this invention. Moreover, it is an example of the frame image of the experiment video 2. タグ付け処理の概念図であり、（ａ）は２次元の画像特徴量空間でのターゲットラベルデータ及び非ターゲットラベルデータを示し、（ｂ）は（ａ）に示したターゲットラベルデータ及び非ターゲットラベルデータに基づいてターゲットタグデータ、非ターゲットタグデータ及び不明タグデータがタグ付けされる様子を示す。It is a conceptual diagram of tagging processing, (a) shows target label data and non-target label data in a two-dimensional image feature amount space, (b) shows target label data and non-target labels shown in (a). A mode that target tag data, non-target tag data, and unknown tag data are tagged based on data is shown. 本発明の対話型動画像監視プログラムの画像特徴指定のインタフェース画面の一例である。It is an example of the interface screen of the image feature designation | designated of the interactive moving image monitoring program of this invention. （ａ）は、対象となる動画像のフレーム画像を、（ｂ）は、当該動画像について可視化処理を行った後に表示される可視化領域を示す。(A) shows the frame image of the moving image which becomes object, (b) shows the visualization area | region displayed after performing the visualization process about the said moving image. フレーム番号及びグリッド内の位置から可視化領域での該当領域Rを求める方法を説明するための図である。It is a figure for demonstrating the method of calculating | requiring the applicable area | region R in a visualization area | region from a frame number and the position in a grid. 可視化領域を横軸方向に射影して表示する場合の処理方法を説明するための図である。It is a figure for demonstrating the processing method in the case of projecting and displaying a visualization area | region to a horizontal axis direction. タグ付け処理の概念図の他の例であり、（ａ）は２次元の画像特徴量空間での非ターゲットラベルデータを示し、（ｂ）は（ａ）に示した非ターゲットラベルデータに基づいて非ターゲットタグデータ及び不明タグデータがタグ付けされる様を示し、（ｃ）は更にターゲットラベルデータがラベル付けされた様子を示し、（ｄ）は（ｃ）に示したターゲットラベルデータ及び非ターゲットラベルデータに基づいてターゲットタグデータ、非ターゲットタグデータ及び不明タグデータがタグ付けされる様子を示す。It is another example of the conceptual diagram of tagging processing, (a) shows non-target label data in a two-dimensional image feature amount space, (b) is based on the non-target label data shown in (a). The non-target tag data and the unknown tag data are shown to be tagged, (c) shows the target label data being further labeled, and (d) is the target label data and non-target shown in (c). A mode that target tag data, non-target tag data, and unknown tag data are tagged based on label data is shown. 本発明の対話型動画像監視プログラムが実行する処理全体を示すフローチャートである。It is a flowchart which shows the whole process which the interactive moving image monitoring program of this invention performs. 本発明の対話型動画像監視プログラムが実行するバッチ処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the batch process which the interactive moving image monitoring program of this invention performs. Ｓ５０７及びＳ７０４の処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the process of S507 and S704. 本発明の対話型動画像監視プログラムが実行する通常再生処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the normal reproduction | regeneration processing which the interactive moving image monitoring program of this invention performs. 本発明の対話型動画像監視プログラムが実行するコマ送り処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the frame advance process which the interactive moving image monitoring program of this invention performs. 実験映像１のフレーム画像の一例である。It is an example of the frame image of the experiment image | video 1. 実施例１での実験結果を示すグラフであり、（ａ）は登録ラベル数と再現率との関係を示すグラフである。（ｂ）は登録ラベル数と非ターゲット検出率との関係を示すグラフである。It is a graph which shows the experimental result in Example 1, (a) is a graph which shows the relationship between the number of registration labels, and recall. (B) is a graph showing the relationship between the number of registered labels and the non-target detection rate. 実施例１におけるラベル登録数の増加に伴う可視化領域の変化の様子を示す図である。It is a figure which shows the mode of a change of the visualization area | region accompanying the increase in the label registration number in Example 1. FIG. 実施例２での実験結果を示すグラフであり、（ａ）は登録ラベル数と再現率との関係を示すグラフである。（ｂ）は登録ラベル数と非ターゲット検出率との関係を示すグラフである。It is a graph which shows the experimental result in Example 2, (a) is a graph which shows the relationship between the number of registration labels, and recall. (B) is a graph showing the relationship between the number of registered labels and the non-target detection rate. 碍子連の昼間の撮影画像の一例である。It is an example of the daytime photographed image of Choshiren. 碍子連に対して、横２マス縦２０マスのグリッドを設定した画像の一例である。It is an example of the image which set the grid of 2 squares by 20 squares with respect to the insulator series. 実験映像３に対し、（ａ）非ターゲットラベルのみ４０ラベル、（ｂ）更に、ターゲット又は非ターゲットラベルを２９７ラベル、（ｃ）更に、４１４ラベルをラベル付けしてタグ付け処理を行った場合の、可視化領域を示す。For the experiment video 3, when (a) only 40 non-target labels, (b) 297 labels of target or non-target labels, (c) and 414 labels are further tagged. , Showing the visualization area. 実験映像４のフレーム画像の一例である。It is an example of the frame image of the experiment image | video 4. 実験映像４にグリッドの設定をしたウィンドウ及びタグ付けされた可視化領域を示す。Experimental video 4 shows a window with grid settings and a tagged visualization area.

Explanation of symbols

１対話型映像解析装置
１１ラベル付け手段
１２タグ付け手段
１３可視化手段
１４映像データ
２４処理対象領域
５１ターゲットタグ
５２非ターゲットタグ
５３不明タグ DESCRIPTION OF SYMBOLS 1 Interactive type | formula video analysis apparatus 11 Labeling means 12 Tagging means 13 Visualization means 14 Video data 24 Process target area 51 Target tag 52 Non-target tag 53 Unknown tag

Claims

In the method for monitoring and tracking a monitoring object in a moving image, one or more image clips are provided in a processing target area to be subjected to image processing in the frame image of the moving image, and the monitoring target is included in the image clip. Label registration processing for registering labeling data as to whether or not an object has been photographed, and obtaining an image feature amount based on a color information value of a pixel included in the image clip, and registering the labeling data The image feature amount of the image clip that has not been performed is in a predetermined fixed range in a high-dimensional space around each element of the image feature amount of the image clip in which the labeling data is registered For, the tagging according to the labeling data is performed on the image clip in which the labeling data is not registered. And grayed with processing, interactive video image monitoring method and performing the visualization process of displaying together with a frame image of the moving image on the basis of the results of the tagging process to the importance of the tag.

2. The interactive moving image monitoring method according to claim 1, wherein a local sensitive hash algorithm is used when registering and searching the labeling data.

The tagging process is performed in a predetermined range in a high-dimensional space around each element of the image feature amount of the image clip labeled that the monitoring object is photographed. Is for a target tag, a thing in a predetermined range in a high-dimensional space around each element of the image feature quantity of the image clip labeled that the monitoring object is not photographed. 3. The interactive moving image monitoring method according to claim 1, wherein an unknown tag is tagged to the target tag and the image clip that is not in any range.

The interactive video according to any one of claims 1 to 3, wherein the visualization process compresses and displays the result of the tagging process for all frame images of the moving image on one screen. Image monitoring method.

5. The interactive moving image monitoring method according to claim 4, wherein importance of the tag is highest for the unknown tag and lowest for the non-target tag.

An apparatus for monitoring and tracking a monitoring object in a moving image, wherein a frame image of the moving image is read out, one or more image clips are set in a processing target area to be subjected to image processing, and the image clip includes Label registration means for registering in the database labeling data designated in advance as to whether or not the monitoring object is photographed, and an image feature amount based on color information values of pixels included in the image clip for the image clip The image feature quantity of the image clip for which the labeling data is not registered in the database is high with respect to each element of the image feature quantity of the image clip for which the labeling data is registered. In the dimensional space, if it is within a predetermined range, tagging according to the labeling data is performed. Tagging means for storing in association with the unregistered image clip, and visualization means for displaying the result of the tagging process on the output device together with the frame image of the moving image based on the importance of the tag. An interactive moving image monitoring apparatus characterized by that.

7. The interactive moving image monitoring apparatus according to claim 6, wherein a local sensitive hash algorithm is used when registering and searching the labeling data.

The tagging means is in a predetermined fixed range in a high-dimensional space around each element of the image feature quantity of the image clip labeled that the monitoring object is photographed. Is for a target tag, a thing in a predetermined range in a high-dimensional space around each element of the image feature quantity of the image clip labeled that the monitoring object is not photographed. The interactive moving image monitoring apparatus according to claim 6 or 7, wherein an unknown tag is tagged for the target tag and the image clip that is not in any range.

9. The interactive moving image monitoring apparatus according to claim 6, wherein the visualizing unit compresses and displays a tagging result for all frame images of the moving image on one screen. .

The interactive moving image monitoring apparatus according to claim 9, wherein importance of the tag is highest for the unknown tag and lowest for the non-target tag.

Among the frame images of the moving image, one or more image clips are set in a processing target area to be subjected to image processing and stored in the main storage device, and the image registered in advance in the storage device Reads labeling data as to whether or not a monitoring object is photographed in the clip, obtains an image feature amount based on a color information value of a pixel included in the image clip, and registers the labeling data The image feature amount of the image clip that has not been set is within a predetermined range in a high-dimensional space that has each element of the image feature amount of the image clip in which the labeling data is registered as an axis. For example, the image clip in which the labeling data is not registered is tagged according to the labeling data, and the tag Dialogue for monitoring and tracking a monitoring object in a moving image by causing a computer to execute image recognition and analysis processing for displaying the result of the detection together with the frame image of the moving image on the output device based on the importance of the tag Type video surveillance program.