JP2009289210A

JP2009289210A - Device and method for recognizing important object and program thereof

Info

Publication number: JP2009289210A
Application number: JP2008143743A
Authority: JP
Inventors: Junei Kin; 順暎金; Masakatsu Ota; 昌克太田; Mitsuo Teramoto; 光生寺元
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-05-30
Filing date: 2008-05-30
Publication date: 2009-12-10

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for recognizing an important object, which can easily make extraction from an arbitrary object image, without the need for defining an object to be extracted beforehand. <P>SOLUTION: The important object existing in a moving image is extracted, and a representative image which is a summary of video utilizing the object is created. In addition, an important object existing in the surroundings of a user is recognized from the information of a photographed moving image;, the state of the object is obtained; and then the user is alerted when the user is inattentive or the like. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、画像情報に存在する顕著な物体を自動的に抽出することのできる重要物体認識装置および重要物体認識方法ならびにそのプログラムに関する。 The present invention relates to an important object recognition apparatus, an important object recognition method, and a program thereof that can automatically extract a prominent object existing in image information.

従来、ユーザの置かれた状況を自動的に判断し、その状況に合わせたサービスを提供するための技術開発はユビキタスコンピューティングにおける需要事項の一つであった。そして現在、「状況」に対する計測器の乏しさから、研究の大半が状況を特定する物体の位置や時間、あるいはその組合せから推定される当該物体のプレゼンス情報などの情報を採用して「状況」を判定するなどしている。しかしながら、場所と状況の対応関係は１対１ではない。例えばワンルームマンションでは、同じ室内において、人が食事を取り、睡眠し、あるいは娯楽に興じるなど、人の状況は多様に変化する。従って、高度な「状況」の判断を抽出することが求められている。
なお、近年、画像処理技術の発達に伴い、動画像中から人物や自動車などを自動的に検出する技術が特許文献１に開示されている。この特許文献１の技術には、動画像中に搭乗する人物を自動的に抽出し、顧客データベースに利用するシステムが開示されている。
また、大量の画像と単語列との組を用意し、その関係を学習することで画像辞書を作成する技術が非特許文献１に開示されている。
特開２００４−２５８７６４号公報中山英樹、外３名、「画像・単語間概念対応の確率構造学習を利用した超高速画像認識・検索方法」、社団法人電子情報通信学会、信学技報、ＰＲＭＵ２００７−１４７，ｐｐ．６５−７０，２００７年１２月 Conventionally, technology development for automatically determining a user's situation and providing a service according to the situation has been one of the demand items in ubiquitous computing. And now, because of the lack of measuring instruments for “situation”, most of the research adopts information such as the object's presence information estimated from the position and time of the object that identifies the situation, or a combination thereof. And so on. However, the correspondence between place and situation is not one-to-one. For example, in a one-room apartment, the situation of a person changes in various ways, such as a person eating, sleeping, or entertaining in the same room. Therefore, it is required to extract advanced “situation” judgments.
In recent years, with the development of image processing technology, Patent Document 1 discloses a technology for automatically detecting a person or a car from a moving image. The technique of Patent Document 1 discloses a system that automatically extracts a person boarding a moving image and uses it in a customer database.
Further, Non-Patent Document 1 discloses a technique for preparing an image dictionary by preparing a set of a large number of images and word strings and learning the relationship.
JP 2004-258774 A Hideki Nakayama, 3 others, “Ultra-high-speed image recognition / retrieval method using probabilistic structure learning corresponding to concept between images and words”, The Institute of Electronics, Information and Communication Engineers, IEICE Technical Report, PRMU 2007-147, pp. 65-70, December 2007

ここで、上述の特許文献１の技術は、人物の顔など、対象を想定してのテンプレートマッチングを行なっている。テンプレートマッチングは検出したい対象が明確な場合には非常に有効であるが、検出対象を他の一般的な物体に拡張することは不可能である。例えば大自然や街中の雑踏の風景など、主となる物体を予め定義できない場合には利用することができない。
また上述の非特許文献１の技術は、画像と単語列との組を作成する労力が大きいため、大量の異なる物体の自動抽出をそれぞれ行なうことは困難である。 Here, the technique of Patent Document 1 described above performs template matching assuming a target such as a human face. Template matching is very effective when the object to be detected is clear, but it is impossible to extend the object to be detected to other general objects. For example, it cannot be used when the main object cannot be defined in advance, such as nature or a busy scene in the city.
Further, since the technique of Non-Patent Document 1 described above requires a great effort to create a set of images and word strings, it is difficult to automatically extract a large number of different objects.

そこでこの発明は、抽出対象の物体を事前に定義する必要無く容易に任意の物体の画像からの抽出を行うことのできる重要物体認識装置および重要物体認識方法ならびにそのプログラムを提供することを目的としている。 Accordingly, an object of the present invention is to provide an important object recognition apparatus, an important object recognition method, and a program thereof that can easily extract an object from an image without having to predefine an object to be extracted. Yes.

上記目的を達成するために、本発明は、物体抽出装置と重要度算出装置と学習装置とを備えた重要物体認識装置であって、前記物体抽出装置が、入力を受け付けた動画像中から所定の間隔で静止画像を抽出する静止画像抽出手段と、前記抽出した静止画像それぞれにおける特徴点の画像特徴量を算出する特徴量算出手段と、前記静止画像を画像断片アルゴリズムを用いて複数の画像断片に分割する画像分割手段と、前記静止画像それぞれについて前記分割によって得られた各画像断片を、前記算出した画像特徴量に基づいてクラスタリングし、それらクラスタリングされた各画像断片とその画像断片における特徴点における画像特徴量とを対応付けた画像断片毎特徴量情報を、前記クラスタリングの結果得られた画像断片のクラスタ毎に記憶部へ記録するクラスタ内画像断片特徴量算出手段と、を備え、前記重要度算出装置が、前記所定の間隔で抽出された各静止画像における各画素の顕著性の度合いを示す顕著性値を算出する顕著性値算出手段と、前記画像断片に対応する前記静止画像中の画素の前記顕著性値に基づいて、当該画像断片の重要度を算出する画像断片重要度算出手段と、前記動画像におけるユーザの注視点のデータと、前記画像断片の前記静止画像における座標とに基づいて、前記画像断片毎のユーザによる注視回数または注視時間を算出する注視情報算出手段と、を備え、前記学習装置が、前記クラスタ毎の重要度を、当該クラスタに属する前記画像断片の重要度に基づいて算出するクラスタ重要度算出手段と、前記クラスタ毎の注視回数または注視時間を、当該クラスタに属する前記画像断片の前記注視回数または注視時間に基づいて算出するクラスタ毎注視情報算出手段と、前記クラスタ毎の重要度と前記クラスタ毎の注視回数または注視時間とに基づいて、複数の前記クラスタが示す物体のうち重要物体を示すクラスタを特定する重要物体対応クラスタ特定手段と、を備えることを特徴とする重要物体認識装置である。 In order to achieve the above object, the present invention provides an important object recognition device including an object extraction device, an importance calculation device, and a learning device, wherein the object extraction device is configured to perform predetermined input from a moving image that has received an input. A still image extracting means for extracting still images at intervals, a feature amount calculating means for calculating an image feature amount of a feature point in each of the extracted still images, and a plurality of image fragments using the still image as an image fragment algorithm. And image segmentation means for segmenting each image fragment obtained by the segmentation for each of the still images based on the calculated image feature amount, the clustered image fragments and feature points in the image fragments For each image fragment obtained as a result of the clustering. An intra-cluster image fragment feature amount calculating means for recording, wherein the importance calculation device calculates a saliency value indicating a degree of saliency of each pixel in each still image extracted at the predetermined interval. Saliency value calculating means, image fragment importance calculating means for calculating the importance of the image fragment based on the saliency values of the pixels in the still image corresponding to the image fragment, and a user in the moving image Gaze information calculation means for calculating the number of gazes or gaze time by the user for each image fragment based on the data of the gaze point and the coordinates of the image fragment in the still image, and the learning apparatus comprises: Cluster importance calculation means for calculating the importance for each cluster based on the importance of the image fragment belonging to the cluster, and the number of gazing times or the gazing time for each cluster. Based on the per-cluster gaze information calculation means for calculating the number of gazes or gaze times of the image fragments belonging to a cluster, and based on the importance for each cluster and the gaze count or gaze time for each cluster, a plurality of the gaze information An important object recognizing apparatus comprising: an important object corresponding cluster specifying unit that specifies a cluster indicating an important object among objects indicated by the cluster.

また本発明は、上述の重要物体認識装置が物体要約生成装置をさらに備え、当該物体要約生成装置は、前記静止画像それぞれにおける特徴点の画像特徴量に基づいて、前記各クラスタに属する画像断片の特徴点の画像特徴量を示すクラスタ毎特徴点リストを生成するクラスタ毎特徴点リスト生成手段と、前記各クラスタに属する画像断片それぞれの特徴点の画像特徴量に基づいて、閾値以上の割合で複数の異なる画像断片に存在する複数の特徴点を抽出する特徴点抽出手段と、前記抽出した複数の特徴点に基づいて、当該特徴点を持つ各画像断片を合成して代表画像を生成する代表画像生成手段と、を備えることを特徴とする。 Further, according to the present invention, the above-described important object recognition device further includes an object summary generation device, and the object summary generation device detects image fragments belonging to each cluster based on an image feature amount of a feature point in each of the still images. A feature point list generating unit for each cluster that generates a feature point list for each cluster indicating an image feature amount of the feature point, and a plurality of feature points at a ratio equal to or greater than a threshold based on the image feature amount of each feature point of each image fragment belonging to each cluster. A representative image that extracts a plurality of feature points existing in different image fragments and generates a representative image by synthesizing each image fragment having the feature points based on the plurality of extracted feature points Generating means.

また本発明は、上述の重要物体認識装置が要約表示装置をさらに備え、当該要約表示装置は、前記クラスタ毎の重要度に基づいて重要クラスタを判定し、当該重要クラスタと判定したクラスタの前記代表画像を表示部へ表示する代表画像表示手段を備えることを特徴とする。 Further, according to the present invention, the important object recognition device described above further includes a summary display device, the summary display device determines an important cluster based on the importance for each cluster, and the representative of the cluster determined to be the important cluster. A representative image display means for displaying an image on a display unit is provided.

また本発明は、上述の重要物体認識装置がユーザ嗜好解析装置をさらに備え、前記ユーザ嗜好解析装置は、前記クラスタ毎の重要度とその重要度の閾値とに基づいて、前記動画像中から所定の間隔で抽出した各静止画像において重要な画像断片が含まれているか否かを判定する重要画像断片判定手段と、前記クラスタ毎の注視回数と当該注視回数の閾値または注視時間と当該注視時間の閾値に基づいて、前記動画像中から所定の間隔で抽出した各静止画像において前記ユーザが注視する画像断片を判定する注視画像断片判定手段と、前記重要な画像断片と、前記ユーザが注視する画像断片とが異なる場合に、注意喚起情報を出力する注意喚起情報出力手段と、を備えることを特徴とする。 Further, according to the present invention, the above-described important object recognition device further includes a user preference analysis device, and the user preference analysis device is configured to select a predetermined value from the moving image based on the importance for each cluster and a threshold value for the importance. Important image fragment determination means for determining whether or not an important image fragment is included in each still image extracted at an interval of, and a gaze count for each cluster and a threshold value of the gaze count or a gaze time and a gaze time Based on a threshold value, gaze image fragment determination means for determining an image fragment to be watched by the user in each still image extracted from the moving image at a predetermined interval, the important image fragment, and an image to be watched by the user And an alert information output means for outputting alert information when the fragment is different.

また本発明は、物体抽出装置と重要度算出装置と学習装置とを備えた重要物体認識装置における重要物体認識方法であって、前記物体抽出装置の特徴量算出手段が、入力を受け付けた動画像中から所定の間隔で静止画像を抽出し、前記物体抽出装置の特徴量算出手段が、前記抽出した静止画像それぞれにおける特徴点の画像特徴量を算出し、前記物体抽出装置の画像分割手段が、前記静止画像を画像断片アルゴリズムを用いて複数の画像断片に分割し、前記物体抽出装置のクラスタ内画像断片特徴量算出手段が、前記静止画像それぞれについて前記分割によって得られた各画像断片を、前記算出した画像特徴量に基づいてクラスタリングし、それらクラスタリングされた各画像断片とその画像断片における特徴点における画像特徴量とを対応付けた画像断片毎特徴量情報を、前記クラスタリングの結果得られた画像断片のクラスタ毎に記憶部へ記録し、前記重要度算出装置の顕著性値算出手段が、前記所定の間隔で抽出された各静止画像における各画素の顕著性の度合いを示す顕著性値を算出し、前記重要度算出装置の画像断片重要度算出手段が、前記画像断片に対応する前記静止画像中の画素の前記顕著性値に基づいて、当該画像断片の重要度を算出し、前記重要度算出装置の注視情報算出手段が、前記動画像におけるユーザの注視点のデータと、前記画像断片の前記静止画像における座標とに基づいて、前記画像断片毎のユーザによる注視回数または注視時間を算出し、前記学習装置のクラスタ重要度算出手段が、前記クラスタ毎の重要度を、当該クラスタに属する前記画像断片の重要度に基づいて算出し、前記学習装置のクラスタ毎注視情報算出手段が、前記クラスタ毎の注視回数または注視時間を、当該クラスタに属する前記画像断片の前記注視回数または注視時間に基づいて算出し、前記学習装置の重要物体対応クラスタ特定手段が、前記クラスタ毎の重要度と前記クラスタ毎の注視回数または注視時間とに基づいて、複数の前記クラスタが示す物体のうち重要物体を示すクラスタを特定することを特徴とする重要物体認識方法である。 The present invention is also an important object recognition method in an important object recognition device comprising an object extraction device, an importance level calculation device, and a learning device, wherein the feature amount calculation means of the object extraction device accepts an input moving image. A still image is extracted at a predetermined interval from the inside, and a feature amount calculation unit of the object extraction device calculates an image feature amount of a feature point in each of the extracted still images, and an image division unit of the object extraction device includes: The still image is divided into a plurality of image fragments using an image fragment algorithm, and the intra-cluster image fragment feature amount calculation means of the object extraction device, for each of the still images, each image fragment obtained by the division, Clustering is performed based on the calculated image feature values, and the image feature values at the feature points in the image fragments are associated with the clustered image fragments. The feature value information for each image fragment is recorded in the storage unit for each cluster of the image fragments obtained as a result of the clustering, and the saliency value calculating means of the importance calculating device extracts each of the extracted pieces at the predetermined intervals. A saliency value indicating a degree of saliency of each pixel in the still image is calculated, and the image fragment importance calculation unit of the importance calculation device calculates the saliency value of the pixel in the still image corresponding to the image fragment. And the gaze information calculation means of the importance calculation device calculates the importance of the image fragment based on the data of the user's gaze point in the moving image and the coordinates of the image fragment in the still image. Then, the number of times of gazing by the user or the gazing time for each image fragment is calculated, and the cluster importance level calculation means of the learning device determines the importance level for each cluster as the image slice belonging to the cluster. And the gaze information calculation means for each cluster of the learning device calculates the number of gazes or gaze times for each cluster based on the number of gazes or gaze times of the image fragments belonging to the cluster. Then, the important object corresponding cluster specifying means of the learning device selects a cluster indicating an important object among the objects indicated by the plurality of clusters based on the importance for each cluster and the number of times of gazing or the time of gazing for each cluster. It is an important object recognition method characterized by specifying.

また本発明は、重要物体認識装置のコンピュータを、入力を受け付けた動画像中から所定の間隔で静止画像を抽出する静止画像抽出手段、前記抽出した静止画像それぞれにおける特徴点の画像特徴量を算出する特徴量算出手段、前記静止画像を画像断片アルゴリズムを用いて複数の画像断片に分割する画像分割手段、前記静止画像それぞれについて前記分割によって得られた各画像断片を、前記算出した画像特徴量に基づいてクラスタリングし、それらクラスタリングされた各画像断片とその画像断片における特徴点における画像特徴量とを対応付けた画像断片毎特徴量情報を、前記クラスタリングの結果得られた画像断片のクラスタ毎に記憶部へ記録するクラスタ内画像断片特徴量算出手段、前記所定の間隔で抽出された各静止画像における各画素の顕著性の度合いを示す顕著性値を算出する顕著性値算出手段、前記画像断片に対応する前記静止画像中の画素の前記顕著性値に基づいて、当該画像断片の重要度を算出する画像断片重要度算出手段、前記動画像におけるユーザの注視点のデータと、前記画像断片の前記静止画像における座標とに基づいて、前記画像断片毎のユーザによる注視回数または注視時間を算出する注視情報算出手段、前記クラスタ毎の重要度を、当該クラスタに属する前記画像断片の重要度に基づいて算出するクラスタ重要度算出手段、前記クラスタ毎の注視回数または注視時間を、当該クラスタに属する前記画像断片の前記注視回数または注視時間に基づいて算出するクラスタ毎注視情報算出手段、前記クラスタ毎の重要度と前記クラスタ毎の注視回数または注視時間とに基づいて、複数の前記クラスタが示す物体のうち重要物体を示すクラスタを特定する重要物体対応クラスタ特定手段として機能させるためのプログラムである。 According to the present invention, the computer of the important object recognition apparatus calculates still image extraction means for extracting a still image at predetermined intervals from a moving image that has received an input, and calculates an image feature amount of a feature point in each of the extracted still images. Feature amount calculating means, image dividing means for dividing the still image into a plurality of image fragments using an image fragment algorithm, and each image fragment obtained by the division for each of the still images as the calculated image feature amount Clustering is performed on the basis of each of the image fragments obtained by clustering, and the image fragment feature amount information in which the clustered image fragments are associated with the image feature amounts at the feature points of the image fragments is stored for each cluster of the image fragments obtained as a result of the clustering. Intra-cluster image fragment feature amount calculating means for recording to each part, in each still image extracted at the predetermined interval Saliency value calculating means for calculating a saliency value indicating the degree of saliency of each pixel, and calculating the importance of the image fragment based on the saliency value of the pixel in the still image corresponding to the image fragment Image fragment importance calculating means for calculating, based on data of a user's gazing point in the moving image and coordinates in the still image of the image fragment, a gazing number or gazing time by the user for each image fragment Information calculating means, cluster importance calculating means for calculating the importance for each cluster based on the importance of the image fragment belonging to the cluster, the number of gazing times or the gazing time for each cluster, the image belonging to the cluster Per-cluster gaze information calculation means for calculating based on the number of gaze times or gaze times of the fragments, importance for each cluster, and gaze count for each cluster Others on the basis of the viewing time, a program for functioning as a key object corresponding cluster specifying means for specifying a cluster indicating the importance object of the object indicated by the plurality of clusters.

本発明によれば、動画像中に存在する重要な物体を抽出し、その動画像中に存在する代表的な物体に対応する代表画像を作成することが可能となる。また日常生活において撮影した動画像の情報から、ユーザの周囲に存在する重要物体を認識し、物体の状況の把握が可能となり、また不注意時等の注意喚起をユーザに対して行なうことができる。 According to the present invention, an important object existing in a moving image can be extracted, and a representative image corresponding to a representative object existing in the moving image can be created. In addition, it is possible to recognize important objects existing around the user from information of moving images taken in daily life, to grasp the state of the object, and to alert the user in case of carelessness. .

以下、本発明の第１の実施形態による重要物体認識装置を図面を参照して説明する。
図１は第１の実施形態による重要物体認識装置の構成を示すブロック図である。
この図において、符号１は重要物体認識装置である。そして、重要物体認識装置１は画像撮影装置１１、物体抽出装置１２、重要度算出装置１３、学習装置１４、記録装置１５を備えている。なお、これら画像撮影装置１１、物体抽出装置１２、重要度算出装置１３、学習装置１４、記録装置１５の各機能は一つのコンピュータ装置に備えられていてもよいし、複数のコンピュータ装置それぞれに分散されて互いにネットワークケーブル等で接続されることにより重要物体認識装置１を構成するようにしてもよい。 Hereinafter, an important object recognition apparatus according to a first embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of the important object recognition apparatus according to the first embodiment.
In this figure, reference numeral 1 denotes an important object recognition device. The important object recognition device 1 includes an image capturing device 11, an object extraction device 12, an importance calculation device 13, a learning device 14, and a recording device 15. Note that these functions of the image photographing device 11, the object extracting device 12, the importance calculating device 13, the learning device 14, and the recording device 15 may be provided in one computer device or distributed among a plurality of computer devices. The important object recognition apparatus 1 may be configured by being connected to each other via a network cable or the like.

そして、画像撮影装置１１は、環境撮影部１１１、視線計測部１１２の各機能部を有し、また物体抽出装置１２は特徴点算出部１２１、領域抽出部１２２、物体学習部１２３の各機能部を有している。また重要度算出装置１３は、重要度算出部１３１、視線情報変換部１３２の各機能部を有し、また学習装置１４は重要度解析部１４１、視線情報解析部１４２、相関解析部１４３の各機能部を有している。 The image capturing device 11 includes functional units such as an environment capturing unit 111 and a line-of-sight measurement unit 112. The object extraction device 12 includes functional units such as a feature point calculation unit 121, a region extraction unit 122, and an object learning unit 123. have. The importance calculation device 13 includes functional units such as an importance calculation unit 131 and a line-of-sight information conversion unit 132. The learning device 14 includes each of an importance level analysis unit 141, a line-of-sight information analysis unit 142, and a correlation analysis unit 143. It has a functional part.

ここで、本発明の実施形態による重要物体認識装置１は、まず、事前に物体を定義されない状況において画像のみから特徴のある物体を学習する処理を行なう。本実施形態において学習とは重要な物体を抽出することを意味する。この学習する処理においては、まず物体抽出装置１２が、動画像をフレーム毎に分割し、特徴点算出部１２１において各フレームの特徴点を算出する。特徴点としては、１）特定のピクセルまたは部分に対して局所的に求まり、２）画像が細分化された後にも値が保持され、３）拡大縮小や回転等の影響を受けない、という性質のものであればどのようなものでもよい。
そして領域抽出部１２２が、輝度や彩度、エッジ情報などを元に、各フレームを単一の物体もしくは物体の一部を示す画像断片に分割する。
そして、物体学習部１２３は、領域抽出部１２２により得られた画像断片を、画像断片中に含まれる特徴点を元にクラスタリングする。クラスタリングアルゴリズムには階層的クラスタリングなど、クラスタ数を定義しなくてよいものを利用する。 Here, the important object recognition apparatus 1 according to the embodiment of the present invention first performs a process of learning a characteristic object only from an image in a situation where the object is not defined in advance. In this embodiment, learning means extracting an important object. In this learning process, first, the object extraction device 12 divides the moving image for each frame, and the feature point calculation unit 121 calculates the feature points of each frame. Characteristic points are as follows: 1) obtained locally for a specific pixel or part, 2) retained values even after the image is subdivided, and 3) unaffected by scaling or rotation. Anything can be used.
Then, the region extraction unit 122 divides each frame into image fragments indicating a single object or a part of the object based on luminance, saturation, edge information, and the like.
Then, the object learning unit 123 clusters the image fragments obtained by the region extraction unit 122 based on the feature points included in the image fragments. A clustering algorithm that does not require definition of the number of clusters, such as hierarchical clustering, is used.

ここで、それぞれのクラスタは単一の物体もしくは物体の一部を示す画像集合である。例えば椅子であれば背もたれと肘掛がそれぞれ別の物体として分かれて学習される可能性がある。また本と本棚は異なる物体であるが、環境を認識する上ではそれらをまとめて本棚という１つの物体として認識した方が良い場合もある。これら物体内もしくは物体間の階層構造を吸収するため、クラスタ内に含まれる画像断片の共起関係を利用してクラスタを階層的に統合する。 Here, each cluster is a set of images showing a single object or a part of an object. For example, in the case of a chair, there is a possibility that the backrest and the armrest are learned separately as different objects. In addition, although the book and the bookshelf are different objects, it may be better to recognize them together as a single object called a bookshelf in order to recognize the environment. In order to absorb the hierarchical structure within or between these objects, the clusters are hierarchically integrated using the co-occurrence relationship of the image fragments included in the cluster.

図２は第１の実施形態による重要物体認識装置の処理フローを示す図である。
次に、第１の実施形態による重要物体認識装置の処理フローについて説明する。
まず、重要物体認識装置の、画像撮影装置１１においては、環境撮影部１１１がユーザの周囲の環境を撮影し（ステップＳ１）、視線計測部１１２がユーザの注視位置を計測する（ステップＳ２）。ここで、例えば環境撮影部１１１は具体的にはヘッドマウント型のビューカメラ等であり、これをユーザが頭部に装着する。また、ユーザの眼球運動を撮影するアイカメラを同時に装着し、予め眼球の動きとビューカメラにおける座標との対応関係を学習しておくことで、視線計測部１１２が、ユーザのその時々の撮影画像における注視点を算出することができる。そして、ビューカメラが撮影した各フレームの画像と、アイカメラが撮影した各時間における注視点の情報とが記録装置１５の取得データ記憶部に時間の経過と共に順次記録される。 FIG. 2 is a diagram showing a processing flow of the important object recognition apparatus according to the first embodiment.
Next, a processing flow of the important object recognition device according to the first embodiment will be described.
First, in the image photographing device 11 of the important object recognition device, the environment photographing unit 111 photographs the environment around the user (step S1), and the line-of-sight measuring unit 112 measures the user's gaze position (step S2). Here, for example, the environment photographing unit 111 is specifically a head-mounted view camera or the like, and the user wears this on the head. Also, by wearing an eye camera that captures the user's eye movements at the same time and learning the correspondence between the eye movements and the coordinates in the view camera in advance, the line-of-sight measurement unit 112 allows the user's occasional captured image to be captured. The gaze point at can be calculated. Then, the image of each frame photographed by the view camera and the information of the gazing point at each time photographed by the eye camera are sequentially recorded in the acquisition data storage unit of the recording device 15 as time passes.

そして、物体抽出装置１２の特徴点算出部１２１（静止画像抽出手段、特徴量算出手段）は、画像撮影装置１１より受信した動画像のデータを、フレーム毎に分割し、ある一定のインターバルにより解析対象となる静止画像を抽出する（ステップＳ３）。そして、特徴点算出部１２１は、抽出したそれぞれの静止画像の局所特徴点における特徴量を算出して（ステップＳ４）、静止画像ごとにその特徴点の特徴量を示すリストを記録装置１５の静止画像毎特徴点リスト記憶部へ記録する。この局所特徴量は、静止画像が細分化された後にも当該細分化後の各画像にも当該特徴量の値が保持されるようなものであれば良く、例えば、ロボットビジョンやパノラマ画像生成などに広く利用されている二次元不変特徴量であるＳＩＦＴ特徴量等の公知な手法などが利用できる。ＳＩＦＴ特徴量は画像内の特徴的なピクセルに対し、１２８次元の数値ベクトルとして与えられる。 Then, the feature point calculation unit 121 (still image extraction unit, feature amount calculation unit) of the object extraction device 12 divides the moving image data received from the image capturing device 11 for each frame and analyzes it at a certain interval. A target still image is extracted (step S3). Then, the feature point calculation unit 121 calculates the feature amount at the local feature point of each extracted still image (step S4), and displays a list indicating the feature amount of the feature point for each still image in the recording device 15. Record in the feature point list storage unit for each image. The local feature amount may be any value as long as the value of the feature amount is retained in each of the subdivided images after the still image is subdivided. For example, robot vision, panoramic image generation, etc. For example, a well-known technique such as SIFT feature value, which is a two-dimensional invariant feature value widely used in the field, can be used. The SIFT feature amount is given as a 128-dimensional numerical vector for the characteristic pixel in the image.

図３はＳＩＦＴ特徴量のデータ例を示す図である。
この図においてはｎフレーム毎に切り出された静止画像と、その画像中に存在する特徴点のリストを示している。ＳＩＦＴ特徴量の算出においては、特徴点は画像中において数百点〜数千点特定され、それぞれの特徴点において、１２８次元の数値ベクトルが与えられる。特徴点のリストにおける左上の数字列はその特徴点の画像の左上を原点としたｘ座標，ｙ座標を示している（最初の特徴点においてはｘ座標１３６．１５、ｙ座標５００．７０と記載）。なおＳＩＦＴ特徴量の算出手法は「D. G. Lowe,“Distinctive image features from scaleinvariant keypoints”, International Journal of Computer Vision, 60(2), pp. 91-110 (2004).< http://citeseer.ist.psu.edu/cache/papers/cs/30631/http:zSzzSzwww.cs.ubc.cazSz~lowezSzpaperszSzijcv04.pdf/lowe04distinctive.pdf >」などを利用する。 FIG. 3 is a diagram showing an example of SIFT feature data.
In this figure, a still image cut out every n frames and a list of feature points existing in the image are shown. In the calculation of SIFT feature values, hundreds to thousands of feature points are specified in the image, and a 128-dimensional numerical vector is given to each feature point. The upper left number string in the feature point list indicates the x and y coordinates with the upper left corner of the image of the feature point as the origin (in the first feature point, the x coordinate is 136.15 and the y coordinate is 500.70. ). The calculation method of SIFT features is “DG Lowe,“ Distinctive image features from scaleinvariant keypoints ”, International Journal of Computer Vision, 60 (2), pp. 91-110 (2004). <Http://citeseer.ist. psu.edu/cache/papers/cs/30631/http:zSzzSzwww.cs.ubc.cazSz~lowezSzpaperszSzijcv04.pdf/lowe04distinctive.pdf> ”and the like are used.

そして、物体抽出装置１２の領域抽出部１２２（画像分割手段）は、輝度や彩度、エッジ情報（例えば画像中において輝度や彩度が閾値以上に大きく異なる箇所の連続する画素による線分）などを元に、画像ピラミッド法や平均値シフト法、watershedアルゴリズム等を用いて、抽出された静止画像を、当該静止画像における単一の物体もしくは物体の一部を示す画像断片に分割する（ステップＳ５）。 Then, the region extraction unit 122 (image dividing unit) of the object extraction device 12 has luminance, saturation, edge information (for example, a line segment formed by successive pixels in a portion where the luminance and saturation are greatly different from each other in the image). The extracted still image is divided into image fragments indicating a single object or a part of the object in the still image using an image pyramid method, an average value shift method, a watershed algorithm, or the like (step S5). ).

図４は画像断片の一例を示す図である。
この図が示すように、領域抽出部１２２は、フレーム毎に切り出された静止画像を、単一の物体もしくは物体の一部を示す画像断片に分割している。 FIG. 4 is a diagram illustrating an example of an image fragment.
As shown in this figure, the region extraction unit 122 divides a still image cut out for each frame into image fragments indicating a single object or a part of the object.

次に、物体学習部１２３（クラスタ内画像断片特徴量算出手段）は、領域抽出部１２２より得られた画像断片を、画像断片中に含まれる特徴点を元にクラスタリングする（ステップＳ６）。このクラスタリングアルゴリズムは、公知手法である階層的クラスタリング等を用いればよい。なお、この階層的クラスタリングにおいて、画像断片ａ，ｂ間の距離Ｄ（ａ，ｂ）は、ＳＩＦＴ特徴点を用いた場合、下記式（１）のように定義することができる。 Next, the object learning unit 123 (intra-cluster image fragment feature amount calculation means) clusters the image fragments obtained from the region extraction unit 122 based on the feature points included in the image fragments (step S6). This clustering algorithm may be a known technique such as hierarchical clustering. In this hierarchical clustering, the distance D (a, b) between the image fragments a and b can be defined as the following equation (1) when SIFT feature points are used.

但し、式（１）においてｎは画像断片ａと画像断片ｂの中で一致した特徴点の数、ｎＡｌｌ_ａ，ｎＡｌｌ_ｂはそれぞれ画像断片ａ，ｂが持つ特徴点の個数、ａ_ｉ，ｂ_ｉは画像断片ａ，画像断片ｂにおいて一致したｉ番目の組の特徴点、Ｄ（ａ_ｉ，ｂ_ｉ）はａ_ｉ，ｂ_ｉの特徴ベクトルのユークリッド距離を表す。 In Equation (1), n is the number of feature points that match in the image fragment a and the image fragment b, nAll _a and nAll _b are the number of feature points that the image fragments a and b have, and a _i and b _{i, respectively.} Is the i-th set of feature points that coincide in image fragment a and image fragment b, and D (a _i , b _i ) represents the Euclidean distance between the feature vectors of a _i and b _i .

ここで対応する特徴点の組は、
１）画像断片ａ，ｂにおける全ての特徴点の組についてユークリッド距離を求め、閾値以下の距離の組を候補組とする。
２）候補組中の任意の３組の特徴点のユークリッド距離（ｘ_１，ｙ_１），（ｘ_２，ｙ_２），（ｘ_３，ｙ_３）についてｘ_１からｙ_１へ、ｘ_２からｙ_２へ、ｘ_３からｙ_３へ変換する単一のアフィン変換が存在するとき、この３組の特徴点を対応する組とみなす。
ことによりクラスタリングの結果を生成することができる。 The set of corresponding feature points here is
1) Euclidean distances are obtained for a set of all feature points in image fragments a and b, and a set of distances equal to or less than a threshold is set as a candidate set.
2) For Euclidean distances (x ₁ , y ₁ ), (x ₂ , y ₂ ), (x ₃ , y ₃ ) of any three sets of feature points in the candidate set, from x ₁ to y ₁ , from x ₂ When there is a single affine transformation that transforms y ₂ to x ₃ to y ₃ , these three sets of feature points are considered as corresponding pairs.
Thus, a clustering result can be generated.

次に物体抽出装置１２の物体学習部１２３が、クラスタを階層的に統合する。上述のクラスタリングの処理において階層的クラスタリングを用いている場合には、そのクラスタリング結果をそのまま利用することができる。また、画像断片の共起関係を求め、閾値以上の割合で共に出現するクラスタを１つのクラスタに統合するという方法で統合することもできる。 Next, the object learning unit 123 of the object extraction device 12 hierarchically integrates the clusters. When hierarchical clustering is used in the above-described clustering process, the clustering result can be used as it is. Further, the co-occurrence relationships of image fragments can be obtained, and the clusters that appear together at a rate equal to or higher than the threshold can be integrated into one cluster.

図５はクラスタリングの処理概要を示す図である。
この図が示すように、２つの静止画像Ａ，Ｂからはそれぞれ特徴点のリストＡ，Ｂが特徴点算出部１２１によって生成されている。そして、その静止画像Ａ，Ｂそれぞれの画像断片の特徴点を比較して、上記手法により、クラスタリングを行なっている。なお物体学習部１２３は、クラスタリングの結果である、クラスタに属する各静止画像の画像断片の識別情報（ＩＤ）と、その特徴点とを対応付けたクラスタ内画像断片毎の特徴点リスト（画像断片毎特徴量情報）を、記録装置１５のクラスタ内画像断片毎特徴点リスト記憶部に記録する。 FIG. 5 is a diagram showing an outline of clustering processing.
As shown in this figure, feature point lists A and B are generated by the feature point calculation unit 121 from the two still images A and B, respectively. Then, the feature points of the image fragments of the still images A and B are compared, and clustering is performed by the above method. The object learning unit 123 obtains a feature point list (image fragment) for each image fragment in the cluster in which the identification information (ID) of the image fragment of each still image belonging to the cluster and the feature point are associated with each other as a result of clustering. Each feature amount information) is recorded in the feature point list storage unit for each image fragment in the cluster of the recording device 15.

次に重要度算出装置１３が、画像撮影装置１１より入力を受けた画像および視線情報から、画像断片の重要度および画像断片に対応する物体の被注視時間をそれぞれ算出する。まず、重要度算出部１３１（顕著性値算出手段、画像断片重要度算出手段）が、物体抽出装置１２と同様に、抽出された静止画像に対して、顕著性ＭＡＰ（Saliency map）を生成し、当該静止画像の各ピクセルにおける顕著性の値を算出する。なお、顕著性ＭＡＰとは、画像において人の目の注意を引きやすい指標を算出するための技術であり、この顕著性ＭＡＰを算出するための理論について「C.Koch, S.Ullman “Shifts in selective visual attention: Towards the underlying neural circuitry” Human Neurobilogy Vol.4, pp.219-227, 1985.」に開示されており、また、その理論を計算機によって実装した技術が「Laurent Itti and Christof Koch.“Computational Modeling of visual Attention”Nature Neuroscience Review, Vol.2, pp.194-204, 2001.」に開示されている。そして、本実施形態においては、この文献に記述されている顕著性ＭＡＰの算出手法を用いている。ここで、人の視覚は輝度や色相が局所的に大きく変動する部分に引き付けられ易いという性質が知られている。そして、「Laurent Itti and Christof Koch.“Computational Modeling of visual Attention”Nature Neuroscience Review, Vol.2, pp.194-204, 2001.」の文献においては、１枚の画像からスケール（縮尺）の異なる複数の画像を生成し、それぞれの画像における各ピクセルについて輝度・色相及び輝度の変化方向を算出している。そして、異なるスケールの２枚の画像の組のそれぞれについて、対応するピクセルにおける輝度・色相及び輝度の変化方向の変化量の重み付き和を足し合わせ、その値をピクセルにおける顕著性値として算出している。 Next, the importance calculation device 13 calculates the importance of the image fragment and the gaze time of the object corresponding to the image fragment from the image and the line-of-sight information received from the image capturing device 11, respectively. First, the importance calculation unit 131 (saliency value calculation means, image fragment importance calculation means) generates a saliency MAP (Saliency map) for the extracted still image in the same manner as the object extraction device 12. The saliency value at each pixel of the still image is calculated. Note that the saliency MAP is a technique for calculating an index that easily attracts human eyes in an image. The theory for calculating the saliency MAP is “C. Koch, S. Ullman“ Shifts in "Selective visual attention: Towards the underlying neural circuitry" Human Neurobilogy Vol.4, pp.219-227, 1985. " Computational Modeling of visual Attention "Nature Neuroscience Review, Vol.2, pp.194-204, 2001." In this embodiment, the saliency MAP calculation method described in this document is used. Here, it is known that human vision is easily attracted to a portion where the luminance and hue greatly vary locally. And in the document “Laurent Itti and Christof Koch.“ Computational Modeling of visual Attention ”Nature Neuroscience Review, Vol.2, pp.194-204, 2001.”, a plurality of images with different scales (scales) from one image. The brightness / hue and brightness change direction are calculated for each pixel in each image. Then, for each set of two images of different scales, add the weighted sum of the amount of change in luminance / hue and luminance change direction in the corresponding pixel, and calculate the value as the saliency value in the pixel. Yes.

図６は画像断片の重要度の算出処理概要を示す図である。
顕著性ＭＡＰは、１枚の静止画像（カラー、モノクロ問わず）に対して１枚のグレースケールの画像の形で生成され、各ピクセルの輝度値（例えば、０〜２５５）がそのピクセルの重要度にあたる。そして、図４に示すように、フレーム毎に切り出された静止画像からグレースケールの画像を生成される。この画像において値が１に近い、つまり白に近い色の箇所が人の目の注意を引きやすい箇所となる。そして、重要度算出部１３１は、ある画像断片の各ピクセルに対して顕著性の値を算出すると、その最大値または平均を、当該画像断片の重要度として算出および出力する（ステップＳ７）。この出力においては、図６で示すように、各画像断片のＩＤと、その画像断片が属する静止画像のＩＤと、静止画像中において画像断片が割り当てられた番号と、その画像断片の重要度の値が対応付けられた表形式で出力される。 FIG. 6 is a diagram showing an outline of calculation processing of importance of image fragments.
The saliency MAP is generated in the form of a single grayscale image for a single still image (regardless of color or monochrome), and the luminance value of each pixel (for example, 0 to 255) is important for that pixel. Hit the degree. Then, as shown in FIG. 4, a gray scale image is generated from the still image cut out for each frame. In this image, a portion having a value close to 1, that is, a color close to white, is a portion where it is easy to draw attention to the human eye. After calculating the saliency value for each pixel of a certain image fragment, the importance calculation unit 131 calculates and outputs the maximum value or the average as the importance of the image fragment (step S7). In this output, as shown in FIG. 6, the ID of each image fragment, the ID of the still image to which the image fragment belongs, the number assigned to the image fragment in the still image, and the importance of the image fragment. Output in tabular format with associated values.

次に視線情報変換部１３２（注視情報算出手段）が、各静止画像に含まれる画像断片それぞれが、次のインターバル間隔によって静止画像が抽出されるまでの間に注視される回数を算出する。例えば、インターバル間隔が１５フレームである場合、時刻ｔから時刻ｔ＋１４における静止画像内の対応する画像断片の被注視回数を算出する（ステップＳ８）。ここで、アイカメラが撮影した各時間における注視点の情報が記録装置１５に記録されているので、視線情報変換部１３２は、そのアイカメラが撮影した各時間における注視点の情報により、インターバル間隔における１５フレームそれぞれの、ユーザの注視点を記録装置１５から読み取り、その注視点が画像断片の静止画像中の座標に対応するか否かによって、その画像断片を注視したかどうか判断して、被注視回数を算出すればよい。 Next, the line-of-sight information conversion unit 132 (gaze information calculation means) calculates the number of times each image fragment included in each still image is stared until the still image is extracted at the next interval interval. For example, when the interval interval is 15 frames, the number of times of gaze of the corresponding image fragment in the still image from time t to time t + 14 is calculated (step S8). Here, since the information of the gazing point at each time taken by the eye camera is recorded in the recording device 15, the line-of-sight information conversion unit 132 determines the interval interval according to the information on the gazing point at each time taken by the eye camera. The user's gazing point for each of the 15 frames is read from the recording device 15, and it is determined whether or not the image fragment has been gazed depending on whether or not the gazing point corresponds to the coordinates in the still image of the image fragment. What is necessary is just to calculate the number of gazes.

図７は各フレームの静止画像における注視点と被注視回数の算出結果の一例を示す図である。
図７で示すように、視線情報変換部１３２は、記録装置１５に記録されているインターバル間隔における各１５フレームそれぞれのユーザの注視点を読み取り、その一覧を生成する。その一覧の例が左の表である。そして視線情報変換部１３２は、その注視点が画像断片の静止画像中の座標に対応するか否かによって、その画像断片を注視したかどうか判断して、画像断片のＩＤとその画像断片が属する静止画像のＩＤと、画像断片の静止画像における番号と被注視回数とを対応付けたデータを生成して、記録装置１５の画像断片毎被注視回数記憶部に記録する。 FIG. 7 is a diagram illustrating an example of a calculation result of the gazing point and the number of gazing times in the still image of each frame.
As shown in FIG. 7, the line-of-sight information conversion unit 132 reads the user's gazing point for each of the 15 frames in the interval interval recorded in the recording device 15 and generates a list thereof. An example of the list is the table on the left. Then, the line-of-sight information conversion unit 132 determines whether or not the image fragment is watched depending on whether or not the gazing point corresponds to the coordinates in the still image of the image fragment, and the image fragment ID and the image fragment belong to it. Data in which the ID of the still image is associated with the number of the still image of the image fragment and the number of times of gazing is generated and recorded in the number of gazing times storage unit for each image fragment of the recording device 15.

そして、次に学習装置１４が、物体抽出装置１２によって求められた画像断片およびクラスタリング結果の情報と、重要度算出装置１３によって算出された各画像断片の重要度と被注視回数の情報とを結合し、その相関を分析する。つまり、まず学習装置１４において、重要度解析部１４１（クラスタ重要度算出手段）が、クラスタ毎の重要度を算出する（ステップＳ９）。この重要度は、例えば、クラスタを構成する各画像断片の重要度の正規分布から大きく外れたものを取り除いた後の各重要度の値の平均値とする。そして、各クラスタのＩＤとそのクラスタに対して算出した重要度とそのクラスタを構成する各画像断片のＩＤとを対応付けたクラスタ重要度テーブルを生成して記録装置１５のクラスタ重要度テーブル記憶部に記録する。 Then, the learning device 14 combines the image fragment and clustering result information obtained by the object extraction device 12 with the importance of each image fragment calculated by the importance calculation device 13 and the information on the number of times of gaze. And analyze the correlation. That is, first, in the learning device 14, the importance analysis unit 141 (cluster importance calculation means) calculates the importance for each cluster (step S9). This importance is, for example, the average value of the importance values after removing those that deviate significantly from the normal distribution of importance of the image fragments constituting the cluster. Then, a cluster importance degree table in which the ID of each cluster, the importance calculated for the cluster and the ID of each image fragment constituting the cluster are associated with each other is generated and the cluster importance degree table storage unit of the recording device 15 is created. To record.

図８はクラスタ毎重要度テーブルの一例を示す図である。
この図における左の表は、図６で示したような、重要度算出部１３１の算出した各画像断片に対する重要度を示す表である。そしてこの情報を用いて、学習装置１４が、右側の表であるクラスタ毎重要度テーブルを生成する。 FIG. 8 is a diagram showing an example of the cluster importance level table.
The table on the left in this figure is a table showing the importance for each image fragment calculated by the importance calculation unit 131 as shown in FIG. And using this information, the learning device 14 generates an importance table for each cluster which is a table on the right side.

また学習装置１４において、視線情報解析部１４２（クラスタ毎注視情報算出手段）が、クラスタ毎の被注視回数を算出する（ステップＳ１０）。クラスタ毎の被注視回数はユーザの特性や時間に大きく影響されるため、トレンド分析等の時系列解析的手法が必要となる。例えば、連続するシーン（インターバル間隔内の１５フレーム）における被注視回数の最大値を、クラスタを構成する各画像断片ごとに求め、その平均値、最大値、最小値、分散などの値を、当該クラスタの被注視回数と算出する。 Further, in the learning device 14, the line-of-sight information analysis unit 142 (gaze information calculation unit for each cluster) calculates the number of gazes for each cluster (step S10). Since the number of times of gazing for each cluster is greatly affected by the user's characteristics and time, a time series analysis method such as trend analysis is required. For example, the maximum value of the number of gazes in consecutive scenes (15 frames within an interval interval) is determined for each image fragment constituting the cluster, and the average value, maximum value, minimum value, variance, etc. Calculated as the number of times the cluster is watched.

図９はクラスタ毎被注視回数テーブルの一例を示す図である。
この図における左の表は、図７で示したような、視線情報変換部１３２の算出した各画像断片に対する被注視回数を示す表である。そしてこの情報を用いて、学習装置１４の視線情報解析部１４２が、右側の表であるクラスタ毎被注視回数テーブルを生成する。 FIG. 9 is a diagram illustrating an example of the gaze count table for each cluster.
The table on the left in this figure is a table showing the number of times of gazing for each image fragment calculated by the line-of-sight information conversion unit 132 as shown in FIG. Then, using this information, the line-of-sight information analysis unit 142 of the learning device 14 generates a cluster-specific gaze count table that is a table on the right side.

また学習装置１４において、相関解析部１４３（重要物体対応クラスタ特定手段）が、重要度解析部１４１および視線情報解析部１４２で得られた結果（クラスタ毎重要度テーブル，クラスタ毎被注視回数テーブル）のうち、重要度、被注視回数ピーク値の平均値または最大値のいずれかが、予め定義された閾値以上であったクラスタを重要物体の画像断片を示すクラスタであると特定して記録装置１５の重要物体記憶部に記録する（ステップＳ１１）。 In the learning device 14, the correlation analysis unit 143 (important object correspondence cluster specifying means) is the result obtained by the importance analysis unit 141 and the line-of-sight information analysis unit 142 (importance table for each cluster, number-of-watches table for each cluster) Among them, the recording device 15 identifies a cluster in which either the importance level or the average value or the maximum value of the gazing frequency peak value is equal to or greater than a predetermined threshold as a cluster indicating an image fragment of the important object. Is recorded in the important object storage unit (step S11).

図１０は重要物体の画像断片を示すクラスタの一覧を示す図である。
この図が示すように、重要物体の画像断片を示すクラスタの一覧は、クラスタＩＤと、そのクラスタの重要度、被注視回数ピーク値の平均値、分散、最大値、最小値、クラスタに含まれる各画像断片を対応付けている。そして相関解析部１４３は、この一覧のデータを記録装置１５へ記録する。以上の処理により、第１の実施形態による重要物体認識装置の処理が終了する。 FIG. 10 is a diagram showing a list of clusters showing image fragments of important objects.
As shown in this figure, the list of clusters indicating image fragments of important objects is included in the cluster ID, the importance of the cluster, the average value of the number of times of gazing, the variance, the maximum value, the minimum value, and the cluster. Each image fragment is associated. Then, the correlation analysis unit 143 records this list data in the recording device 15. With the above processing, the processing of the important object recognition apparatus according to the first embodiment is completed.

この第１の実施形態による処理によれば、事前に重要な物体を定義することなく、撮影した動画像から重要と思われる物体を検出することが可能となる。なお、上述の処理においては被注視回数を用いているが、ユーザが物体を注視した被注視時間を用いて、上記の処理を行なうようにしても良い。被注視時間は、各静止画像に含まれる画像断片それぞれが、次のインターバル間隔によって静止画像が抽出されるまでの間に注視される時間である。上述と同様に、例えば、インターバル間隔が１５フレームである場合、時刻ｔから時刻ｔ＋１４における静止画像内の対応する画像断片の被注視時間を算出する。アイカメラが撮影した各時間における注視点の情報が記録装置１５に記録されているので、視線情報変換部１３２は、そのアイカメラが撮影した各時間の合計を、その注視点が画像断片毎に算出すればよい。 According to the processing according to the first embodiment, it is possible to detect an object that seems to be important from a captured moving image without defining an important object in advance. In the above-described processing, the number of times of gaze is used, but the above-described processing may be performed using the gaze time when the user gazes at the object. The watched time is a time during which each image fragment included in each still image is watched until a still image is extracted at the next interval interval. Similarly to the above, for example, when the interval interval is 15 frames, the gaze time of the corresponding image fragment in the still image from time t to time t + 14 is calculated. Since the gaze information at each time taken by the eye camera is recorded in the recording device 15, the line-of-sight information conversion unit 132 calculates the total time taken by the eye camera for each image fragment. What is necessary is just to calculate.

図１１は第２の実施形態による重要物体認識装置の構成を示すブロック図である。
この図が示すように第２の実施形態による重要物体認識装置は、第１の実施形態の重要物体認識装置に、物体要約生成装置１６が備えられた構成となっている。この物体要約生成装置１６は、物体抽出装置１２および記録装置１５と接続されているものとする。なお、これら画像撮影装置１１、物体抽出装置１２、重要度算出装置１３、学習装置１４、記録装置１５、物体要約生成装置１６の各機能は一つのコンピュータ装置に備えられていてもよいし、複数のコンピュータ装置それぞれに分散されて互いにネットワークケーブル等で接続されることにより重要物体認識装置１を構成するようにしてもよい。そして物体要約生成装置１６は、特徴点学習部１６１と、代表画像生成部１６２の処理部を有している。 FIG. 11 is a block diagram showing the configuration of the important object recognition apparatus according to the second embodiment.
As shown in this figure, the important object recognition device according to the second embodiment has a configuration in which the object summary generation device 16 is provided in the important object recognition device of the first embodiment. The object summary generation device 16 is assumed to be connected to the object extraction device 12 and the recording device 15. Note that these functions of the image photographing device 11, the object extracting device 12, the importance calculating device 13, the learning device 14, the recording device 15, and the object summary generating device 16 may be provided in one computer device, or a plurality of functions may be provided. The important object recognition apparatus 1 may be configured by being distributed to each of the computer apparatuses and connected to each other by a network cable or the like. The object summary generation device 16 includes a feature point learning unit 161 and a processing unit of a representative image generation unit 162.

そして、第２の実施形態による重要物体認識装置においては、物体要約生成装置１６が、物体抽出装置１２によって算出された画像断片の各クラスタについて、閾値以上の割合で出現する特徴点の集合を生成し、その特徴点を元に画像断片を合成して、代表画像を作成する処理を行なう。 In the important object recognition device according to the second embodiment, the object summary generation device 16 generates a set of feature points that appear at a rate equal to or higher than the threshold for each cluster of image fragments calculated by the object extraction device 12. Then, a process of creating a representative image by synthesizing image fragments based on the feature points is performed.

図１２は第２の実施形態による重要物体認識装置の処理フローを示す図である。
次に、第２の実施形態による重要物体認識装置の処理フローについて説明する。
図１の実施形態による重要物体認識装置の処理において、既に、物体抽出装置１２の処理によって各静止画像の特徴点リストが生成されている。物体要約生成装置１６の特徴点学習部１６１（クラスタ毎特徴点リスト生成手段、特徴点抽出手段）は、その各静止画像の特徴点リストを物体抽出装置１２より取得して（ステップＳ２１）、上述のクラスタリングの結果に基づき、各クラスタに属する画像断片の特徴点リスト（クラスタ毎特徴点リスト）を生成する（ステップＳ２２）。そして、物体要約生成装置１６の特徴点学習部１６１が、各クラスタそれぞれについて、クラスタに含まれる各画像断片の特徴点のうち、閾値以上の割合で複数の画像断片に存在する複数の特徴点を抽出する（ステップＳ２３）。 FIG. 12 is a diagram illustrating a processing flow of the important object recognition apparatus according to the second embodiment.
Next, a processing flow of the important object recognition device according to the second embodiment will be described.
In the process of the important object recognition apparatus according to the embodiment of FIG. 1, the feature point list of each still image has already been generated by the process of the object extraction apparatus 12. The feature point learning unit 161 (the feature point list generating unit for each cluster, the feature point extracting unit) of the object summary generating device 16 acquires the feature point list of each still image from the object extracting device 12 (step S21), and Based on the result of clustering, a feature point list (feature point list for each cluster) of image fragments belonging to each cluster is generated (step S22). Then, for each cluster, the feature point learning unit 161 of the object summary generation device 16 selects a plurality of feature points present in the plurality of image fragments at a ratio equal to or higher than a threshold among the feature points of each image fragment included in the cluster. Extract (step S23).

この特徴点の抽出処理は、第１の実施形態と同様に、
１）画像断片ａ，ｂにおける全ての特徴点の組についてユークリッド距離を求め、閾値以下の距離の組を候補組とする。
２）候補組中の任意の３組の特徴点のユークリッド距離（ｘ_１，ｙ_１），（ｘ_２，ｙ_２），（ｘ_３，ｙ_３）についてｘ_１からｙ_１へ、ｘ_２からｙ_２へ、ｘ_３からｙ_３へ変換する単一のアフィン変換が存在するとき、この３組の特徴点を対応する組とみなす。この操作をクラスタ内の全ての画像断片の２つの組について適用し、特徴点の対応関係のリストを生成する。次に、クラスタ内に閾値以上の割合で出現する特徴点を以下のように列挙する。
ａ）ｉ番目の画像断片におけるｊ番目の特徴点について、上記２）の処理結果により他の画像断片における特徴点と対応関係があるかをチェックする。
ｂ）上記ａの処理においてｉおよびｊの値を１つずつ加えていき、各画像断片それぞれの各特徴点について同様の処理を行なう。
ｃ）ある特徴点について、他の画像断片内の特徴点との対応関係が幾つの画像断片にわたっているかをカウントする。
ｄ）ある特徴点について、他の画像断片内の特徴点との対応関係が、閾値以上の画像断片にわたって存在する特徴点を代表特徴点とする。 The feature point extraction process is the same as in the first embodiment.
1) Euclidean distances are obtained for a set of all feature points in image fragments a and b, and a set of distances equal to or less than a threshold is set as a candidate set.
2) For Euclidean distances (x ₁ , y ₁ ), (x ₂ , y ₂ ), (x ₃ , y ₃ ) of any three sets of feature points in the candidate set, from x ₁ to y ₁ , from x ₂ When there is a single affine transformation that transforms y ₂ to x ₃ to y ₃ , these three sets of feature points are considered as corresponding pairs. This operation is applied to two sets of all image fragments in the cluster to generate a list of feature point correspondences. Next, feature points that appear at a rate equal to or higher than the threshold in the cluster are listed as follows.
a) It is checked whether the j-th feature point in the i-th image fragment has a corresponding relationship with the feature point in the other image fragment based on the processing result of 2).
b) The values of i and j are added one by one in the process a, and the same process is performed for each feature point of each image fragment.
c) A certain feature point is counted for how many image fragments the correspondence with the feature points in other image fragments is.
d) For a certain feature point, a representative feature point is a feature point that exists over an image fragment whose correspondence with a feature point in another image fragment is equal to or greater than a threshold value.

次に代表画像生成部１６２（代表画像生成手段）が、前記抽出した特徴点を持つ複数の画像断片を用いて、抽出した特徴点を重ね合わせて、それら複数の画像断片を合成する処理を行なう（ステップＳ２４）。そして、これにより１つのクラスタに対する代表画像が生成できる。この代表画像の生成においては、代表特徴点を閾値以上の個数もしくは割合で含む画像断片を抽出する。次に、それぞれの共通特徴点の座標が揃うように各画像断片を回転、移動、縮小、拡大により変形させて、それぞれのピクセルの平均値を求めて代表画像を生成する。また物体要約生成装置１６は生成した代表画像をクラスタのＩＤに対応付けて記録装置１５の代表画像記憶部へ記録する。
以上の処理により、クラスタリング結果により得られた複数の画像断片から、その代表画像を生成することができる。 Next, the representative image generation unit 162 (representative image generation means) performs a process of superimposing the extracted feature points using the plurality of image fragments having the extracted feature points and synthesizing the plurality of image fragments. (Step S24). Thus, a representative image for one cluster can be generated. In the generation of the representative image, an image fragment including representative feature points in a number or a ratio equal to or greater than a threshold value is extracted. Next, each image fragment is deformed by rotation, movement, reduction, and enlargement so that the coordinates of each common feature point are aligned, and an average value of each pixel is obtained to generate a representative image. The object summary generation device 16 records the generated representative image in the representative image storage unit of the recording device 15 in association with the cluster ID.
Through the above processing, the representative image can be generated from a plurality of image fragments obtained from the clustering result.

図１３は第３の実施形態による重要物体認識装置の構成を示すブロック図である。
この図が示すように第３の実施形態による重要物体認識装置は、第２の実施形態の重要物体認識装置に、要約表示装置１７が備えられた構成となっている。この要約表示装置１７は、記録装置１５と接続されているものとする。なお、これら画像撮影装置１１、物体抽出装置１２、重要度算出装置１３、学習装置１４、記録装置１５、物体要約生成装置１６、要約表示装置１７の各機能は一つのコンピュータ装置に備えられていてもよいし、複数のコンピュータ装置それぞれに分散されて互いにネットワークケーブル等で接続されることにより重要物体認識装置１を構成するようにしてもよい。 FIG. 13 is a block diagram showing the configuration of the important object recognition apparatus according to the third embodiment.
As shown in this figure, the important object recognition apparatus according to the third embodiment has a configuration in which the summary display device 17 is provided in the important object recognition apparatus according to the second embodiment. This summary display device 17 is assumed to be connected to the recording device 15. Note that each function of the image photographing device 11, the object extracting device 12, the importance calculating device 13, the learning device 14, the recording device 15, the object summary generating device 16, and the summary display device 17 is provided in one computer device. Alternatively, the important object recognition apparatus 1 may be configured by being distributed among a plurality of computer apparatuses and connected to each other by a network cable or the like.

そして、第３の実施形態による重要物体認識装置においては、要約表示装置１７（代表画像表示手段）が、重要度算出装置１３によって算出された重要度を基準として、重要な物体の画像に相当する画像断片を有するクラスタの情報を取得し、その代表画像を記録装置１５から読み取って表示部に出力する処理を行なう。例えば、要約表示装置１７は、各動画像に対し、重要度の高いクラスタを上位から定数個取得し、その代表画像を表示部に出力する。なお、この処理は、画像断片のクラスタ、そのクラスタの情報から物体要約生成装置１６が生成した代表画像、およびそれぞれのクラスタの重要度のみの各情報で処理を実現できるため、視線情報は必須ではなく、視線計測部１１２を持たない画像撮影装置１１、つまり一般的なカメラ等で撮影された、またはアニメーション等の人工的に作り出された画像に対しても代表画像を出力する処理を行うことができる。また、視線情報が利用可能であれば、被注視回数や被注視時間を元にクラスタを選択するようにしても良い。 In the important object recognition device according to the third embodiment, the summary display device 17 (representative image display means) corresponds to an image of an important object on the basis of the importance calculated by the importance calculation device 13. Information on clusters having image fragments is acquired, and the representative image is read from the recording device 15 and output to the display unit. For example, the summary display device 17 acquires a fixed number of clusters with high importance from the top for each moving image, and outputs the representative image to the display unit. Note that this processing can be realized with the cluster of image fragments, the representative image generated by the object summary generation device 16 from the cluster information, and each piece of information of only the importance of each cluster. In addition, it is possible to perform a process of outputting a representative image even on an image capturing device 11 that does not have the line-of-sight measurement unit 112, that is, an image captured by a general camera or the like or artificially created such as an animation. it can. Further, if line-of-sight information is available, a cluster may be selected based on the number of times of attention and the time of attention.

図１４は第４の実施形態による重要物体認識装置の構成を示すブロック図である。
この図が示すように第４の実施形態による重要物体認識装置は、第２の実施形態の重要物体認識装置に、ユーザ嗜好解析装置１８が備えられた構成となっている。このユーザ嗜好解析装置１８は、画像撮影装置１１および記録装置１５と接続されているものとする。なお、これら画像撮影装置１１、物体抽出装置１２、重要度算出装置１３、学習装置１４、記録装置１５、物体要約生成装置１６、ユーザ嗜好解析装置１８の各機能は一つのコンピュータ装置に備えられていてもよいし、複数のコンピュータ装置それぞれに分散されて互いにネットワークケーブル等で接続されることにより重要物体認識装置１を構成するようにしてもよい。そして、ユーザ嗜好解析装置１８は、ユーザ嗜好学習部１８１と、異常検出部１８２の機能部を有している。 FIG. 14 is a block diagram showing the configuration of the important object recognition apparatus according to the fourth embodiment.
As shown in this figure, the important object recognition device according to the fourth embodiment has a configuration in which the user preference analysis device 18 is provided in the important object recognition device according to the second embodiment. This user preference analysis device 18 is assumed to be connected to the image capturing device 11 and the recording device 15. Note that each function of the image photographing device 11, the object extracting device 12, the importance calculating device 13, the learning device 14, the recording device 15, the object summary generating device 16, and the user preference analyzing device 18 is provided in one computer device. Alternatively, the important object recognition apparatus 1 may be configured by being distributed among a plurality of computer apparatuses and connected to each other by a network cable or the like. The user preference analysis device 18 includes a user preference learning unit 181 and functional units of an abnormality detection unit 182.

そして、第４の実施形態による重要物体認識装置においては、人の注視の向き方に個人差が存在し、またその時の人の思考状態（何かに集中している、上の空である等）に大きく依存することに注目し、ユーザが普段どのような物体を注視しているかを学習して、ある状況下において注視がなされなかった場合に警告を発するものである。 In the important object recognition apparatus according to the fourth embodiment, there is an individual difference in the direction of gaze of the person, and the person's thinking state at that time (concentrated on something, the sky above, etc.) It pays attention to the large dependence, learns what kind of object the user is usually gazing at, and issues a warning when the gazing is not done under a certain situation.

図１５は第４の実施形態による重要物体認識装置の処理フローを示す図である。
次に、第４の実施形態による重要物体認識装置の処理フローについて説明する。
ユーザ嗜好解析装置１８は、ユーザの注視の向きの傾向を解析し、本来見るべき物体が注視されていないと判定したときに注意喚起を行うものとする。この処理方法は学習フェーズと運用フェーズに分かれており、どちらのフェーズにおいても学習装置１４の視線情報解析部１４２の機能を必要とする。 FIG. 15 is a diagram illustrating a processing flow of the important object recognition apparatus according to the fourth embodiment.
Next, a processing flow of the important object recognition device according to the fourth embodiment will be described.
The user preference analysis device 18 analyzes the tendency of the user's gaze direction, and alerts the user when it is determined that the object to be originally viewed is not being gaze. This processing method is divided into a learning phase and an operation phase, and the function of the line-of-sight information analysis unit 142 of the learning device 14 is required in both phases.

まず学習フェーズにおいては、ユーザ嗜好学習部１８１（重要画像断片判定手段、注視画像断片判定手段）が重要度および被注視回数（または被注視時間）を元にユーザの嗜好を学習する。重要度は一般的な注視の向きやすさを示す指標とみなすことができるため、重要度と実際の被注視回数（または被注視時間）とのずれを個人差と考えることができる。よって、ユーザ嗜好学習部１８１は、重要度が高いにもかかわらず被注視回数が少ない（または被注視時間が長い）クラスタがある場合、ユーザに重要な物体があると注意喚起すべきと判断する。あるいはユーザが複数存在し、他のユーザの情報を利用することができるようなサービスシナリオにおいては、複数のユーザの学習結果を重ね合わせて一般的に注意を払うべき物体群を抽出するような利用も可能である。 First, in the learning phase, the user preference learning unit 181 (important image fragment determination unit, gaze image fragment determination unit) learns the user's preference based on the importance and the number of times of gaze (or gaze time). Since the degree of importance can be regarded as an index indicating the general ease of gaze direction, a difference between the degree of importance and the actual number of times of gaze (or time of gaze) can be considered as an individual difference. Therefore, the user preference learning unit 181 determines that the user should be alerted when there is an important object when there is a cluster with a small number of times of attention (or a long time of attention) despite high importance. . Or, in a service scenario where there are multiple users and the information of other users can be used, it is used to extract a group of objects that should generally be paid attention by superimposing the learning results of multiple users Is also possible.

具体的には、ユーザ嗜好学習部１８１が、記録装置１５から、図８で示すクラスタ毎重要度テーブルから各クラスタの重要度を読み取り（ステップＳ３１）、また図９で示すクラスタ毎被注視回数テーブルから各クラスタの被注視回数を読み取る（ステップＳ３２）。そして、重要度が重要度閾値よりも高いクラスタが画像中に存在するにもかかわらず、そのクラスタで示される画像断片の被注視回数が被注視回数閾値よりも低い場合には、そのクラスタで示される物体があると注意喚起すべきと判定する。 Specifically, the user preference learning unit 181 reads the importance level of each cluster from the importance level table for each cluster shown in FIG. 8 from the recording device 15 (step S31), and the gaze count table for each cluster shown in FIG. From this, the number of times of gazing for each cluster is read (step S32). If there is a cluster whose importance is higher than the importance threshold value, but the number of times that the image fragment indicated by the cluster is lower than the target eye number threshold, the cluster is indicated. If there is an object to be detected, it is determined that attention should be given.

また、運用フェーズにおいては、異常検出部１８２（注意喚起情報出力手段）が、画像撮影装置１１から動画像のデータの入力を受け付ける。そして異常検出部１８２は、ユーザの視界内に重要物体と一致する画像領域が存在するか否かを判定する。例えば、異常検出部１８２図７で示すユーザの注視点と、重要な物体として検出したクラスタの画像断片が静止画像に存在する座標とを比較して、一致していなければ、ヘッドマウントディスプレイを装着している場合には、ディスプレイ上にその注意喚起の表示を行なう。またヘッドマウントディスプレイを装着していない場合には、例えば、警告音や振動により注意喚起を行なう処理をするようにしてもよい。 In the operation phase, the abnormality detection unit 182 (attention information output unit) receives input of moving image data from the image capturing device 11. Then, the abnormality detection unit 182 determines whether there is an image area that matches the important object in the user's field of view. For example, the abnormality detection unit 182 compares the user's gazing point shown in FIG. 7 and the coordinates of the cluster image fragment detected as an important object in the still image. If it is, the alert is displayed on the display. Further, when the head mounted display is not attached, for example, a process for calling attention by a warning sound or vibration may be performed.

つまり、ユーザ嗜好解析装置１８においては、クラスタ毎の重要度とその重要度の閾値とに基づいて、動画像中から所定の間隔で抽出した各静止画像において重要な画像断片が含まれているか否かを判定し（ステップＳ３３）、クラスタ毎の注視回数と当該注視回数の閾値または注視時間と当該注視時間の閾値に基づいて、動画像中から所定の間隔で抽出した各静止画像においてユーザが注視する画像断片を判定する（ステップＳ３４）。そして、重要な画像断片と、ユーザが注視する画像断片とが異なる場合に、注意喚起情報を出力している（ステップＳ３５）。 That is, in the user preference analysis device 18, whether or not important image fragments are included in each still image extracted from the moving image at a predetermined interval based on the importance for each cluster and the threshold of the importance. (Step S33), and the user gazes at each still image extracted from the moving image at a predetermined interval based on the number of gazes for each cluster and the threshold of the number of gazes or the gaze time and the gaze time threshold. The image fragment to be determined is determined (step S34). Then, when the important image fragment is different from the image fragment that the user gazes at, the alerting information is output (step S35).

以上、本発明の実施形態について説明したが、上述の処理によれば、動画像中に存在する重要な物体を抽出し、その動画像中に存在する代表的な物体に対応する代表画像を作成することが可能となる。また日常生活において撮影した動画像の情報から、ユーザの周囲に存在する重要物体を認識し、物体の状況の把握が可能となり、また不注意時等の注意喚起をユーザに対して行なうことができる。 As described above, the embodiments of the present invention have been described. According to the above-described processing, an important object existing in a moving image is extracted, and a representative image corresponding to a representative object existing in the moving image is created. It becomes possible to do. In addition, it is possible to recognize important objects existing around the user from information of moving images taken in daily life, to grasp the state of the object, and to alert the user in case of carelessness. .

上述の重要物体認識装置における各装置は内部に、コンピュータシステムを有している。そして、上述した各処理の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータが読み出して実行することによって、上記処理が行われる。ここでコンピュータ読み取り可能な記録媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等をいう。また、このコンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしても良い。 Each device in the above-described important object recognition device has a computer system therein. Each process described above is stored in a computer-readable recording medium in the form of a program, and the above process is performed by the computer reading and executing the program. Here, the computer-readable recording medium means a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, or the like. Alternatively, the computer program may be distributed to the computer via a communication line, and the computer that has received the distribution may execute the program.

また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記録装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 The “computer-readable recording medium” refers to a recording medium such as a portable medium such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, or a hard disk built in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

重要物体認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of an important object recognition apparatus. 第１の実施形態による重要物体認識装置の処理フローを示す図である。It is a figure which shows the processing flow of the important object recognition apparatus by 1st Embodiment. ＳＩＦＴ特徴量のデータ例を示す図である。It is a figure which shows the example of data of SIFT feature-value. 画像断片の一例を示す図である。It is a figure which shows an example of an image fragment. クラスタリングの処理概要を示す図である。It is a figure which shows the process outline | summary of clustering. 画像断片の重要度の算出処理概要を示す図である。It is a figure which shows the calculation process outline | summary of the importance of an image fragment. 静止画像における注視点と被注視回数の算出結果の一例を示す図である。It is a figure which shows an example of the calculation result of the gaze point in a still image, and the frequency | count of gaze. クラスタ毎重要度テーブルの一例を示す図である。It is a figure which shows an example of the importance table for every cluster. クラスタ毎被注視回数テーブルの一例を示す図である。It is a figure which shows an example of the number-of-gazing frequency table for every cluster. 重要物体の画像断片を示すクラスタの一覧を示す図である。It is a figure which shows the list of the clusters which show the image fragment of an important object. 重要物体認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of an important object recognition apparatus. 重要物体認識装置の処理フローを示す図である。It is a figure which shows the processing flow of an important object recognition apparatus. 重要物体認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of an important object recognition apparatus. 重要物体認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of an important object recognition apparatus. 重要物体認識装置の処理フローを示す図である。It is a figure which shows the processing flow of an important object recognition apparatus.

Explanation of symbols

１・・・重要物体認識装置
１１・・・画像撮影装置
１２・・・物体抽出装置
１３・・・重要度算出装置
１４・・・学習装置
１５・・・記録装置
１６・・・物体要約生成装置
１７・・・要約表示装置
１８・・・ユーザ嗜好解析装置 DESCRIPTION OF SYMBOLS 1 ... Important object recognition apparatus 11 ... Image pick-up apparatus 12 ... Object extraction apparatus 13 ... Importance calculation apparatus 14 ... Learning apparatus 15 ... Recording apparatus 16 ... Object summary production | generation apparatus 17 ... summary display device 18 ... user preference analysis device

Claims

An important object recognition device including an object extraction device, an importance calculation device, and a learning device,
The object extraction device comprises:
A still image extracting means for extracting a still image at predetermined intervals from a moving image that has received an input;
A feature amount calculating means for calculating an image feature amount of a feature point in each of the extracted still images;
Image dividing means for dividing the still image into a plurality of image fragments using an image fragment algorithm;
Each image fragment obtained by the division for each of the still images is clustered based on the calculated image feature amount, and each clustered image fragment is associated with an image feature amount at a feature point in the image fragment. Intra-cluster image fragment feature amount calculating means for recording the image fragment feature amount information in a storage unit for each cluster of image fragments obtained as a result of the clustering,
The importance calculation device is
Saliency value calculating means for calculating a saliency value indicating the degree of saliency of each pixel in each still image extracted at the predetermined interval;
Image fragment importance calculating means for calculating the importance of the image fragment based on the saliency value of the pixel in the still image corresponding to the image fragment;
Gaze information calculating means for calculating the number of times of gaze or gaze time by the user for each image fragment, based on the data of the user's gaze point in the moving image and the coordinates in the still image of the image fragment,
The learning device is
Cluster importance calculating means for calculating the importance for each cluster based on the importance of the image fragment belonging to the cluster;
A per-cluster gaze information calculation means for calculating the number of gazes or gaze times for each cluster based on the number of gazes or gaze times of the image fragments belonging to the cluster;
An important object corresponding cluster specifying means for specifying a cluster indicating an important object among objects indicated by the plurality of clusters, based on the importance for each cluster and the number of times or the time of attention for each cluster. An important object recognition device.

An object summary generator;
The object summary generation device
A feature point list generating unit for each cluster for generating a feature point list for each cluster indicating an image feature amount of a feature point of an image fragment belonging to each cluster based on an image feature amount of a feature point in each of the still images;
Feature point extracting means for extracting a plurality of feature points present in a plurality of different image fragments at a ratio equal to or higher than a threshold based on the image feature amount of each feature point of the image fragment belonging to each cluster;
Representative image generation means for generating a representative image by combining the image fragments having the feature points based on the extracted feature points;
The important object recognition device according to claim 1, comprising:

A summary display device;
The summary display device includes a representative image display unit that determines an important cluster based on an importance degree for each cluster and displays the representative image of the cluster determined to be the important cluster on a display unit. The important object recognition apparatus of Claim 1 or Claim 2.

A user preference analysis device;
The user preference analysis device
An important image fragment determination for determining whether or not an important image fragment is included in each still image extracted from the moving image at a predetermined interval based on the importance for each cluster and a threshold of the importance. Means,
Based on the number of times of gazing for each cluster and the threshold of the number of times of gazing or the threshold of time of gazing and the time of gazing time, an image fragment to be gazed by the user in each still image extracted from the moving image at predetermined intervals Gaze image fragment determination means;
Alert information output means for outputting alert information when the important image fragment is different from the image fragment to be watched by the user;
The important object recognition apparatus according to claim 1, further comprising:

An important object recognition method in an important object recognition device comprising an object extraction device, an importance calculation device, and a learning device,
The feature amount calculation means of the object extraction device extracts still images at predetermined intervals from the moving image that has received the input,
The feature amount calculation means of the object extraction device calculates an image feature amount of a feature point in each of the extracted still images,
The image dividing means of the object extracting device divides the still image into a plurality of image fragments using an image fragment algorithm,
The intra-cluster image fragment feature amount calculation means of the object extraction device clusters each image fragment obtained by the division for each of the still images based on the calculated image feature amount, and the clustered image fragments. And image piece feature amount information that associates image feature amounts at feature points in the image fragment with each other, and records the image piece feature amount information in the storage unit for each cluster of image fragments obtained as a result of the clustering,
The saliency value calculating means of the importance calculating device calculates a saliency value indicating a degree of saliency of each pixel in each still image extracted at the predetermined interval;
The image fragment importance calculation means of the importance calculation device calculates the importance of the image fragment based on the saliency value of the pixel in the still image corresponding to the image fragment,
The gaze information calculation means of the importance level calculation device, based on the data of the user's gaze point in the moving image and the coordinates in the still image of the image fragment, the number of gazes or the gaze time by the user for each image fragment To calculate
The cluster importance calculation means of the learning device calculates the importance for each cluster based on the importance of the image fragment belonging to the cluster,
The per-cluster gaze information calculating means of the learning device calculates the gaze count or gaze time for each cluster based on the gaze count or gaze time of the image fragment belonging to the cluster,
The important object corresponding cluster specifying unit of the learning device specifies a cluster indicating an important object among the objects indicated by the plurality of clusters based on the importance for each cluster and the number of times of gazing or the gazing time for each cluster. An important object recognition method characterized by the above.

The computer of the important object recognition device
Still image extraction means for extracting still images at predetermined intervals from moving images that have received input;
A feature amount calculating means for calculating an image feature amount of a feature point in each of the extracted still images;
Image dividing means for dividing the still image into a plurality of image fragments using an image fragment algorithm;
Each image fragment obtained by the division for each of the still images is clustered based on the calculated image feature amount, and each clustered image fragment is associated with an image feature amount at a feature point in the image fragment. Intra-cluster image fragment feature amount calculation means for recording the image fragment feature amount information in a storage unit for each cluster of image fragments obtained as a result of the clustering,
Saliency value calculating means for calculating a saliency value indicating a degree of saliency of each pixel in each still image extracted at the predetermined interval;
Image fragment importance calculating means for calculating the importance of the image fragment based on the saliency value of the pixel in the still image corresponding to the image fragment;
Gaze information calculating means for calculating the number of times of gaze or gaze time by the user for each image fragment based on the data of the user's gaze point in the moving image and the coordinates in the still image of the image fragment;
Cluster importance calculation means for calculating the importance for each cluster based on the importance of the image fragment belonging to the cluster,
Per-cluster gaze information calculation means for calculating the number of gazes or gaze times for each cluster based on the number of gazes or gaze times of the image fragments belonging to the cluster;
A program for functioning as an important object corresponding cluster specifying unit for specifying a cluster indicating an important object among objects indicated by the plurality of clusters based on the importance for each cluster and the number of times of watching or the time for each cluster. .