JP2008199136A

JP2008199136A - Similar video search apparatus and similar video search method

Info

Publication number: JP2008199136A
Application number: JP2007029905A
Authority: JP
Inventors: Masayuki Nakazawa; 正幸中沢
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2007-02-09
Filing date: 2007-02-09
Publication date: 2008-08-28
Anticipated expiration: 2027-02-09
Also published as: JP5009638B2

Abstract

<P>PROBLEM TO BE SOLVED: To compare a reference video with an image and a search target video and the image, change into a suitable feature value, and obtain search results which agrees with human being's similarity feeling. <P>SOLUTION: The apparatus decomposes the video into brightness and hue based on the decoded video, and performs classification by contrast of lightness and darkness for the brightness after performing correction based on a time-space filter. It performs classification of color tone for hue after performing correction by the time-space filter if needed. The feature value is calculated from these classification results, and similar videos are searched based on the agreement of the feature value. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、映像検索技術に関し、特に、複数の映像の中から類似するシーンを検索する類似映像検索装置及び類似映像検索方法に関する。 The present invention relates to a video search technique, and more particularly to a similar video search apparatus and a similar video search method for searching for a similar scene from a plurality of videos.

近年、ネットワーク技術の発達やデータストレージの大容量化に伴って、大量の映像コンテンツを保持し、それに基づいてコンテンツを視聴することが可能になってきている。放送局や制作会社などのコンテンツホルダーが保持する大量の映像のみではなく、ハードディスクレコーダ等の普及による保存された大量のコンテンツ、インターネット上の膨大な映像コンテンツなど、今後さらに映像コンテンツが増えていくことが予想されおり、効率的な映像検索技術がより求められてくる。 In recent years, with the development of network technology and the increase in capacity of data storage, it has become possible to hold a large amount of video content and view the content based on it. In addition to the large amount of video held by content holders such as broadcasting stations and production companies, the amount of video content will increase further in the future, such as a large amount of stored content due to the spread of hard disk recorders, etc., and a huge amount of video content on the Internet. Therefore, efficient video search technology is required more and more.

現在、映像検索手段の最も一般的な形態として、キーワード検索がある。例えば、検索対象の映像のシーン毎に予めキーワードを付与しておき、所望の映像シーンを検索する際には、この付与されたキーワードを手がかりに検索を行う方法である。しかながら、この方法では、映像シーンにキーワードを予め付与する必要があるが、その作業は人手によるのが一般的である。ところで、大量の映像データに対してキーワードを付与する作業に膨大な労力が必要となることに加え、たとえ自動的にキーワードを付与しようとしても、シーン中のどの情報に着目するかにより、付与するキーワードが異なってくるため、一貫した基準に従ってキーワードを付与することが困難である。 Currently, keyword search is the most common form of video search means. For example, a keyword is assigned in advance for each video scene to be searched, and when searching for a desired video scene, a search is performed using the assigned keyword as a clue. However, in this method, it is necessary to assign a keyword to the video scene in advance, but the operation is generally performed manually. By the way, in addition to the enormous effort required to assign keywords to a large amount of video data, even if keywords are automatically assigned, they are assigned depending on which information in the scene is focused on. Because keywords are different, it is difficult to assign keywords according to consistent criteria.

別の映像検索手段として、映像に含まれるシーン、もしくは、そのシーンに含まれる画像自体を検索の手がかりとして、検索を行う方法がある。この方法では、検索対象となる映像に含まれるシーンの個々の画像データから特徴量を抽出し、その特徴量を比較することによって類似する類似シーンやそのシーンに含まれる画像を検索結果として抽出する。 As another video search means, there is a method of performing a search using a scene included in a video or an image itself included in the scene as a clue for search. In this method, feature amounts are extracted from individual image data of scenes included in a video to be searched, and similar feature scenes and images included in the scenes are extracted as search results by comparing the feature amounts. .

一般的に、検索対象となる映像に含まれるシーンもしくは、そのシーンに含まれる画像の濃淡、色、テクスチャ、画像要素となるオブジェクトの形状、配置、内容等の情報は、その画像が持つシーンや、そのシーンに含まれる画像により特徴的なパターンを示す。したがって、濃淡、色、テクスチャ、画像要素となるオブジェクトの形状、配置、内容等色彩やエッジに関する情報は、シーンやそのシーンに含まれる画像及び、物体等を検索、分類及び識別等をするための重要な手がかりとなる。 In general, information such as the scene included in the video to be searched or the shade, color, texture, and shape, arrangement, and content of the object included in the image, the scene included in the image, A characteristic pattern is shown by an image included in the scene. Therefore, information on colors and edges such as shade, color, texture, shape, arrangement, and content of objects that are image elements is used for searching, classifying, and identifying scenes and images and objects included in the scenes. This is an important clue.

このような映像のシーンに含まれる画像の濃淡、色、テクスチャ、画像要素となるオブジェクトの形状、配置、内容等色彩やエッジに関する情報を利用した従来の映像検索方法として、ＲＧＢ（色の三原色としての赤・緑・青）やＨＳＶ（マンセル表色系における色相・彩度・輝度）等の表色系において、各画像の濃淡レイアウト情報、エッジパターン情報、色情報のヒストグラム等を利用し、基準画像に類似する類似画像を検索する手法が種々提案されている。 As a conventional video search method using information on color and edges such as the shade, color, texture, and shape, arrangement, and content of an object included in such a video scene, RGB (as the three primary colors) In color systems such as red, green, and blue (HSV) and HSV (Hues, Saturation, and Luminance in the Munsell color system), the grayscale layout information, edge pattern information, color information histogram, etc. Various techniques for retrieving similar images similar to images have been proposed.

例えば、特許文献１（特開2000-48181号公報）では、画像が持つ複数の特徴量それぞれについて重み付けを自動的に行い、重み付けされた特徴量に基づいて画像間の類似性を比較する方法が開示されている。この方法は、基準画像および検索対象となる画像データの各々から複数の特徴量を抽出するものである。当該方法では、特徴量として、基準画像の彩度の平均値Ｃ_０、スペクトラム画像のピーク値Ｐ_０、エッジの強さＥ_０が例示されており、基準画像から抽出した特徴量に基づいて、各特長量に適用される重み付けが決められる。例えば、色に関する特徴量の重み付けＷｃ、テクスチャに関する重み付けＷｔが決定される。ここで、基準画像の彩度の平均値Ｃ_０が大きいほど色の重み付けＷ_Ｃを大きくし、基準画像のスペクトラム画像のピーク値Ｐ_０およびエッジの強さＥ_０が大きいほどテクスチャの重み付けＷ_ｔを大きく設定する。 For example, in Patent Document 1 (Japanese Patent Application Laid-Open No. 2000-48181), there is a method of automatically weighting each of a plurality of feature amounts of an image and comparing similarities between images based on the weighted feature amounts. It is disclosed. This method extracts a plurality of feature amounts from each of a reference image and image data to be searched. In this method, the average value C ₀ of the saturation of the reference image, the peak value P ₀ of the spectrum image, and the edge strength E ₀ are exemplified as the feature values. Based on the feature values extracted from the reference image, A weight applied to each feature amount is determined. For example, the weight Wc of the feature amount related to the color and the weight Wt related to the texture are determined. Here, to increase the color weighting W _C of as the average value C ₀ of the saturation of the reference image is large, the texture of the weighting as the peak value P ₀ and the intensity E ₀ of the edge of the spectral image of the reference image is larger W _t Set a larger value.

色の類似度Ｒ_Ｃとテクスチャの類似度Ｒ_ｔは、類似度は公知の手法により、それぞれ別個に求められる。例えば、色の類似度Ｒ_Ｃはキー画像の彩度の平均値Ｃ_０と検索対象の画像の彩度の平均値Ｃ_ｊとから求められ、テクスチャの類似度Ｒ_ｔはキー画像のスペクトラム画像のピーク値Ｐ_０、エッジの強さＥ_０と検索対象の画像のスペクトラム画像のピーク値Ｐ_ｊ、エッジの強さＥ_ｊとから求められる。そして、総合類似度Ｒ_ａは、

として定義される。ここで、Ｗ_ｋはキー画像についての第ｋの特徴量についての重みを、Ｒ_ｋはキー画像と検索対象の画像との第ｋの特徴量についての類似度を表し、Σはパラメータｋについての総和を表している。 Similarity R _t of similarity R _C and texture color, degree of similarity by a known method, are respectively separately determined. For example, the color similarity R _C is obtained from the average value C ₀ of the saturation of the key image and the average value C _j of the saturation of the image to be searched, and the texture similarity R _t is obtained from the spectrum image of the key image. It is obtained from the peak value P ₀ , the edge strength E ₀ , the peak value P _j of the spectrum image of the search target image, and the edge strength E _j . The overall similarity _Ra is

Is defined as Here, W _k represents the weight for the k-th feature amount for the key image, R _k represents the similarity for the k-th feature amount between the key image and the search target image, and Σ represents the parameter k. Represents the sum.

当該方法では、色のウェイトＷ_ｃ、テクスチャのウェイトＷ_ｔ、色の類似度Ｒ_Ｃ、テクスチャの類似度Ｒ_ｔから、総合類似度Ｒ_ａを（Ｗ_ＣＲ_Ｃ＋Ｗ_ｔＲ_ｔ）/（Ｗ_Ｃ＋Ｗ_ｔ）として求め、総合類似度が所定のしきい値よりも大きいか否かの判断を行い、大きい場合にはキー画像と検索対象の画像とが類似していると判断している。 In this method, the total similarity _Ra is _calculated as (W _C _RC + W _t R _t ) / (W from the color weight W _c , the texture weight W _t , the color similarity R _C , and the texture similarity R _t. _C + W _t ), and it is determined whether or not the total similarity is larger than a predetermined threshold value. If the total similarity is large, it is determined that the key image and the search target image are similar.

このように、基準画像の特徴に従って算出される重み付けされた特徴量を利用することによって、テクスチャの周期性やエッジの強度が弱く、彩度が高い基準画像では色を重視し、逆に、彩度が低く、テクスチャの周期性やエッジが強い基準画像ではテクスチャを重視して画像を検索することができる。 In this way, by using the weighted feature amount calculated according to the feature of the reference image, the reference image emphasizes the color in the reference image where the periodicity of the texture and the strength of the edge are weak and the saturation is high. In a reference image that is low in degree and has a strong texture periodicity and edges, it is possible to search for an image with an emphasis on texture.

さらに、特許文献２（特開平5-174072号公報）及び特許文献３（特開平7-46517号公報）では、映像コンテンツのショットの長さ情報、もしくは、シーン長を特徴量として用いた類似映像検索手法が提案されている。これらの方法では、映像シーンの切り替わり（カット点）で区切ることができる部分映像を１つのショットとして定義し、その時間長を特徴量とする方法である。特徴量の照合は、ショット長のシーケンスをそれぞれ逐一比較することにより実現される。 Further, in Patent Document 2 (Japanese Patent Laid-Open No. 5-174072) and Patent Document 3 (Japanese Patent Laid-Open No. 7-46517), similar video using shot length information of video content or scene length as a feature amount is disclosed. Search methods have been proposed. In these methods, a partial video that can be divided by video scene switching (cut points) is defined as one shot, and the time length is used as a feature amount. The comparison of the feature amount is realized by comparing the shot length sequences one by one.

さらに、特許文献４（特開2000-123173号公報）に開示の技術では、画像データを複数の小領域に分割して各小領域における画像濃淡情報を検出するとともに、検出した画像濃淡情報に基づく濃淡レイアウトを定量化し、この定量化された濃淡レイアウトを画像データの全体または小領域単位の画像特徴として抽出することを特徴としている。そして、この定量化された濃淡レイアウトである画像特徴量を用いて、画像データベースに蓄積された画像データと比較対象となる画像データとの類否判定を行う手法を提案し、類似画像をより多くの基準で検索できるようにするための様々なサービスメニューを設け、利便性の向上を行っている。 Further, in the technique disclosed in Patent Document 4 (Japanese Patent Laid-Open No. 2000-123173), image data is divided into a plurality of small areas to detect image grayscale information in each small area, and based on the detected image grayscale information. It is characterized in that the grayscale layout is quantified, and the quantified grayscale layout is extracted as image features of the entire image data or small area units. Then, we propose a method for determining similarity between image data stored in an image database and image data to be compared using image feature quantities that are the quantified shading layout, and increase the number of similar images. Various service menus are provided to enable search based on the above criteria, thereby improving convenience.

さらに、特許文献５（特開平2002-63208号公報）では、人の視覚特性に基づいて、基準画像に類似する類似画像を効率よく検索することができる画像検索装置、画像検索手法が提案されている。この手法は、互いに異なる色相に反応する色相フィルタをもうけ、基準画像に対する色相フィルタの各応答出力の平均値及び標準偏差と比較画像に対する色相フィルタの各応答出力の平均値及び標準偏差との差をそれぞれ加算して、類似度を求める手法である。この手法は、以下のような理由を考慮して考案されている。 Further, Patent Document 5 (Japanese Patent Laid-Open No. 2002-63208) proposes an image search device and an image search method that can efficiently search for similar images similar to a reference image based on human visual characteristics. Yes. This method provides a hue filter that reacts to different hues, and calculates the difference between the average value and standard deviation of each response output of the hue filter for the reference image and the average value and standard deviation of each response output of the hue filter for the comparative image. This is a method of obtaining the similarity by adding each. This method is devised in consideration of the following reasons.

ＤＣＴ等を用いた画像符号化方法は、信号処理的な観点から画像データを圧縮符号化するものであり、人の視覚特性のうち空間周波数感度のみが考慮されていること、ＲＧＢ及びＨＳＶ等の色空間に基づく画像検索方法では、信号処理的な観点から見た類似度により基準画像に類似する類似画像を検索しているため、実際に画像を見た時の人間の視覚的な印象とは異なることが多く、人の視覚特性に基づいて基準画像に類似する類似画像を抽出することができないという理由による。 An image encoding method using DCT or the like compresses and encodes image data from the viewpoint of signal processing, and only spatial frequency sensitivity is considered among human visual characteristics, such as RGB and HSV. In the image search method based on color space, a similar image that is similar to the reference image is searched based on the similarity from the viewpoint of signal processing, so what is the human visual impression when actually viewing the image? This is because they are often different and a similar image that is similar to the reference image cannot be extracted based on the visual characteristics of the person.

さらに、特許文献６（特開平8-194714号公報）では、圧縮動画像の検索を行なう際に、圧縮動画像から、動き情報を抽出する処理を軽減することによって、検索装置に圧縮動画像を入力してから検索可能となるまでの時間を短縮し、検索装置の利便性を向上させる手法、及び、動画中に出現する物体の動きだけでなく、動いた物体の大きさ、形状を自動抽出し、検索の情報として用いる手法が提案されている。 Further, in Patent Document 6 (Japanese Patent Application Laid-Open No. Hei 8-947714), when searching for a compressed moving image, the processing for extracting motion information from the compressed moving image is reduced, so that the compressed moving image is stored in the search device. Reduces the time from input to search and improves the convenience of the search device, and automatically extracts the size and shape of the moving object as well as the movement of the object appearing in the video However, a technique used as search information has been proposed.

これは、時系列の圧縮率を向上させるために公知の動き補償アルゴリズムで生成された動きベクトルのデータを圧縮動画像の内部に持っているＭＰＥＧ１、ＭＰＥＧ２と呼ばれる圧縮方式の圧縮動画像を対象としている。通常、動き情報を求めるには、圧縮動画像に復号処理し、非圧縮動画像に変換した後に、動き情報を抽出する。圧縮動画像を非圧縮動画像に復号する処理には、多くの計算が必要であり、非圧縮動画像から動き情報を抽出する処理も、フレームを分割したブロック単位で行われるため、多くの計算が必要となる。したがって、圧縮動画像から動き情報を抽出するには、多大な処理が必要となり、圧縮動画像を入力してから、動き情報による検索が可能になるまでに処理時間がかかり、即座に検索可能とはならないという問題がある。特許文献６に記載の発明では、動き情報による検索を行なうために、圧縮動画像に復号処理を施し、非圧縮動画像に変換した後に動き情報を抽出するという過程を行わず、圧縮動画像から直接動きベクトルを抽出し、形状データの生成及び検索情報の生成を行うことで計算量の軽減及び計算時間の短縮を行い圧縮動画の検索装置の利便性を向上させている。 This is intended for compressed moving images of compression methods called MPEG1 and MPEG2 that have motion vector data generated by a known motion compensation algorithm in order to improve the time-series compression rate. Yes. Usually, in order to obtain motion information, a compressed moving image is decoded and converted into an uncompressed moving image, and then the motion information is extracted. The process of decoding a compressed video into an uncompressed video requires many calculations, and the process of extracting motion information from an uncompressed video is also performed on a block-by-block basis, so many calculations are required. Is required. Therefore, extraction of motion information from a compressed moving image requires a great deal of processing, and it takes processing time from the input of a compressed moving image until search by motion information becomes possible, and can be searched immediately. There is a problem that must not be. In the invention described in Patent Document 6, in order to perform a search based on motion information, a process of decoding the compressed moving image and extracting the motion information after converting the compressed moving image into an uncompressed moving image is not performed. By extracting the motion vectors directly, generating the shape data and the search information, the calculation amount is reduced and the calculation time is reduced, thereby improving the convenience of the compressed moving image search apparatus.

特開2000-48181号公報JP 2000-48181 A 特開平5-174072号公報Japanese Patent Laid-Open No. 5-174072 特開平7-46517号公報JP 7-46517 A 特開2000-123173号公報JP 2000-123173 A 特開2002-63208号公報JP 2002-63208 A 特開平8-194714号公報Japanese Laid-Open Patent Publication No. 8-194714 池田光男、視覚の心理物理学、森北出版、1975Mitsuo Ikeda, Visual psychophysics, Morikita Publishing, 1975 日本視覚学会（編）、視覚情報処理ハンドブック、朝倉出版、2000Visual Society of Japan (ed.), Visual Information Processing Handbook, Asakura Publishing, 2000 堀田裕弘、ディジタル画像の評価法と国際標準、トリケップス、2006Hirota Hotta, Digital Image Evaluation Method and International Standard, Triqueps, 2006 大山正他（編）、新編感覚・知覚心理学ハンドブック、誠信書房、1994Oyama Tadashi et al. (Edition), new edition Handbook of Sensory and Perceptual Psychology, Seishin Shobo, 1994 松島洋之（編）、テレビ番組制作技術の基礎、社団法人日本映画テレビ技術協会、1996Matsushima Hiroyuki (ed.), Basics of TV Program Production Technology, Japan Film and Television Technology Association, 1996 森田敏夫他（監）、テレビ番組の制作技術-基礎からノウハウまで、兼六館出版、1991Toshio Morita et al. (Director), TV program production technology-from basics to know-how, Kenrokukan Publishing, 1991

しかしながら、前述の如く、映像中の類似シーン及びそのシーンに含まれる類似画像の検索においては、人間の類似感覚と一致したものを検索するという観点ではなく、本来、映像及び画像が持っている特徴量同士が数学理論的な立場から一致するかどうかという観点で技術開発がなされている。そのため、上記特許文献に記載の技術では、映像及び画像が持つ特徴量同士を数学的に比較するという点では厳密に一致した映像及び画像が検索されることになるが、人間がその映像を比較した場合、必ずしも人間の類似感覚と一致した検索結果が得られるとは限らないという問題があった。すなわち、従来の類似画像検索で得られる類似度は、実際の人間の類似感覚とは異なるという問題がある。 However, as described above, in the search for similar scenes in a video and similar images included in the scene, it is not a point of searching for a scene that matches the human sense of similarity, but inherently the characteristics of the video and the images. Technological development is being carried out in terms of whether the quantities agree from a mathematical theoretical standpoint. For this reason, in the technique described in the above-mentioned patent document, a video and an image that exactly match in terms of mathematically comparing feature quantities of the video and the image are searched, but a human compares the video. In such a case, there is a problem that a search result consistent with a human sense of similarity is not always obtained. That is, there is a problem that the similarity obtained by the conventional similar image search is different from the actual human sense of similarity.

例えば、台風の中継映像など、激しい雨の中で傘をさしながらアナウンサーがその状況を説明している場面と、雨の日の街頭インタビューなど、激しい雨ではないが、同じように傘をさしながら一般の人がインタビューに答えている場面とでは、映像及び画像が本来持っている特徴量の数学的な比較という観点、すなわち、映像もしくは画像全体が一致しているかという観点からは、一致度は低い。 For example, a scene where an announcer explains the situation while holding an umbrella in a heavy rain, such as a typhoon relay video, and a rainy street interview, etc. However, it is consistent with the scene where the general person answers the interview from the viewpoint of mathematical comparison of the features that the video and the image originally have, that is, whether the video or the whole image matches. The degree is low.

しかしながら、人間の類似感覚という観点からみると両者の一致度は高いと考えられる。なぜなら、人間が視覚情報的に類似していると判断している際に着目しているのは、傘をさしている人が映っているという点であり、人間が着目している類似部分として、一致度が高いと考えられるためである。 However, from the viewpoint of human similarity, the degree of agreement between the two is considered high. Because when we judge that humans are visually similar, we are focusing on the fact that the person wearing an umbrella is reflected. This is because the degree of coincidence is considered high.

さらに細かく言えば、雨が強いか弱いかに関しては、人間の類似感覚からいうとさほど大きな問題ではない。これに対し、数学的な比較という点では、雨の降り具合、傘をさしている人が映っているかどうか、傘をさしている人の回りの様子などが映像及び画像の特徴量という観点から比較することとなる。 More precisely, whether the rain is strong or weak is not a big problem from a human sense of similarity. On the other hand, in terms of mathematical comparison, it is compared from the viewpoint of image and image features such as how rain falls, whether the person wearing the umbrella is reflected, and the situation around the person wearing the umbrella. It will be.

さらに、別の具体例として、台風の中継映像など、激しい雨の中で傘をさしながら、アナウンサーが状況を説明している場面において、アナウンサーが大きな岩の横で状況を説明している映像の場合、大きな岩がある／ないという条件で、映像及び画像が本来持つ画像特徴量の数学的な比較という観点では、類似しない映像と判断される。これは、映像及び画像が本来持つ特徴量の数学的な比較という観点では、大きな岩の横かどうかが類似性判断の一つの大きな要素となってしまい、異なる映像と判断されるためである。 Furthermore, as another specific example, in a scene where the announcer is explaining the situation while holding an umbrella in heavy rain, such as a typhoon relay video, the announcer is explaining the situation beside a large rock In this case, it is determined that the images are not similar from the viewpoint of mathematical comparison of the image feature amount inherent to the images and images under the condition that there is a large rock. This is because, from the viewpoint of mathematical comparison of the characteristic amount inherent to the video and the image, whether it is next to a large rock or not is one big factor in similarity judgment and is judged as a different video.

一方、人間の類似感覚からは類似した映像と判断される。これは、上記の問題と同様に、人間の類似感覚は、傘をさしている人の回りの様子の違い、すなわち、大きな岩の横かどうかというよりも、傘をさしている人その自身に注目していることが理由に挙げられる。 On the other hand, it is determined that the images are similar based on the human sense of similarity. This is because, similar to the above problem, the human sense of similarity focuses on the person who is holding the umbrella rather than on the difference in behavior around the person holding the umbrella, that is, whether it is next to a large rock or not. The reason is that.

上記の例からわかることは、映像及び画像が本来持つ画像特徴量の数学的な比較という観点を重視すると、映像及び画像が持つ全ての特徴量を考慮してしまうため、人間の類似感覚と一致しない場合があるという問題点が生じているということが分かる。 What can be seen from the above example is that if the viewpoint of mathematical comparison of image feature values inherent to video and images is emphasized, all feature values of video and images are taken into account, which is consistent with the human sense of similarity. It can be seen that there is a problem that there is a case of not.

この問題の解決するためには、映像及び画像が持つ全ての特徴量と、人間の視覚特性を考慮した類似感覚と、を結びつける必要がある。静止画では考慮する必要のなかった動画の類似検索という課題特有の問題、すなわち、人間の視覚特性では、動きの早い映像、遅い映像、静止した映像を同じような比率で重視しているわけではなく、生理学的な視覚特性の知見を反映した映像及び画像特徴量に変換した後に、この特徴量を数学的に比較する必要がある。もちろん、この生理学的な視覚特性は、映像の動きの早さに関してだけでなく、映像中に映っている物体の色や大きさに関しても同様である。 In order to solve this problem, it is necessary to combine all the feature quantities of the video and the image with a similar sense in consideration of human visual characteristics. A problem specific to video similarity search that did not need to be considered in still images, that is, human visual characteristics do not emphasize fast moving images, slow moving images, and still images in the same ratio. Rather, it is necessary to mathematically compare these feature quantities after converting them into video and image feature quantities that reflect the knowledge of physiological visual characteristics. Of course, this physiological visual characteristic is not only related to the speed of movement of the image, but also to the color and size of the object shown in the image.

このことは、人間の視覚特性を生理学的に考察した結果の知見である空間周波数特性及び時間周波数特性を、映像及び画像が本来持つ画像特徴量の空間周波数特性及び時間周波数特性に反映することと等価である。 This means that the spatial frequency characteristics and temporal frequency characteristics, which are knowledge of the results of physiological considerations of human visual characteristics, are reflected in the spatial frequency characteristics and temporal frequency characteristics of the image features inherent to video and images. Is equivalent.

このように、映像及び画像が本来持つ画像特徴量を数学的に比較するというだけでは、人間の類似性判断と一致した映像及び画像を精度良く検索することは難かしい。そこで、人間の生理的な視覚特性を考慮した適切な特徴量に変換して、類似映像・画像を抽出する手段が必要となる。 As described above, it is difficult to accurately search for a video and an image that match the human similarity judgment only by mathematically comparing the image feature amounts inherent to the video and the image. Therefore, it is necessary to provide a means for extracting a similar video / image by converting it into an appropriate feature amount in consideration of human physiological visual characteristics.

ここで、人間の生理的な視覚特性を考慮した類似映像・画像を抽出する方法・装置について検討するために、まず、人間の視覚特性という観点を考慮し、視覚の明るさ知覚特性、空間周波数特性、時間周波数特性、について説明する。人間の生理的な視覚特性は、空間的な広がり（空間周波数特性）と、時間的な変化（時間周波数特性）と、に関連している。さらに、その時の照明光に対する明るさ知覚特性もこの二つの周波数特性に関係する。 Here, in order to examine a method and apparatus for extracting similar images and images that take into account human physiological visual characteristics, we first consider the viewpoint of human visual characteristics, visual brightness perception characteristics, spatial frequency Characteristics and time frequency characteristics will be described. Human physiological visual characteristics are related to spatial extent (spatial frequency characteristics) and temporal changes (temporal frequency characteristics). Further, the brightness perception characteristic with respect to the illumination light at that time is also related to these two frequency characteristics.

照明の明るさ知覚感覚に関する人間の生理的視覚特性としては、ブロッカ・ザルツァー（Broca-Sulzer）効果がよく知られている。ブロッカ・ザルツァー効果とは、刺激光の持続時間によってその明るさ知覚が変わる（増幅もしくは減少される）現象をいう。例えば、200トロランド（td）の刺激光は、30ミリ秒のあたりで最も明るさ知覚が増大される。これは、同じ明るさでも、最も効率よく明るさが知覚できる継続時間があるということを示している。さらに、ブロッカ・ザルツァー効果は、刺激光が強くなると増大し、そのときの最も効果の高い刺激光の持続時間は短くなる。刺激光が弱くなると、効果は小さくなり、そのときのもっとも効果の高い刺激光の持続時間も長くなる。この現象は、両眼視の場合さらに顕著に効果が現れることが確認されている。しかし、あまりに強い刺激光の場合には、この効果が弱まることも示唆されている。言い換えると、人間の視覚特性は、映っている映像に対して、最もよく映像を知覚できる明るさがあり、それよりも暗すぎても明るすぎても効果が弱まるということが言える（非特許文献１：ブロッカ・ザルツァー効果；166ページ、非特許文献３：視覚の時間周波数特性；15ページ、非特許文献４：338ページ）。 The Broca-Sulzer effect is well known as a human physiological visual characteristic related to the perception of brightness. The blocker-Salzer effect is a phenomenon in which the brightness perception changes (amplifies or decreases) depending on the duration of the stimulus light. For example, 200 toroland (td) stimulus light has the highest brightness perception around 30 milliseconds. This indicates that there is a duration in which the brightness can be perceived most efficiently even at the same brightness. Furthermore, the blocker-Salzer effect increases as the stimulus light becomes stronger, and the duration of the most effective stimulus light then becomes shorter. When the stimulus light becomes weaker, the effect becomes smaller, and the duration of the most effective stimulus light at that time also becomes longer. It has been confirmed that this phenomenon is more effective in binocular vision. However, it is also suggested that this effect is weakened in the case of too strong stimulus light. In other words, it can be said that human visual characteristics have the brightness that can best perceive the image being shown, and the effect is weakened if it is too dark or too bright (Non-Patent Documents). 1: Blocker Salzer effect; 166 pages, non-patent document 3: Visual time-frequency characteristics; page 15, non-patent document 4: page 338).

視覚の空間周波数特性については、その特性を表現するために、光刺激を正弦波（明るい暗いの波の刺激）として与えた時のコントラスト閾値から求めたコントラスト感度関数（Contrast Sensitivity Function; CSF）を用いることが多い。閾値を測定する際のコントラストとしては、光刺激の正弦波の振幅の平均輝度に対する比率（マイケルソンコントラスト）が用いられる。このコントラスト閾値は、刺激光の振幅を変化させた時の被験者が知覚できる最小のコントラストである。 Contrast Sensitivity Function (CSF) obtained from the contrast threshold when light stimulus is given as a sine wave (stimulus of bright and dark waves) to express the spatial frequency characteristics of vision. Often used. As the contrast when measuring the threshold value, a ratio (Michelson contrast) of the amplitude of the sine wave amplitude of the light stimulus to the average luminance is used. The contrast threshold is the minimum contrast that can be perceived by the subject when the amplitude of the stimulation light is changed.

コントラスト感度は、9td（トロランド）以上では、３〜５cycle/°付近に感度の最大を持つ帯域通過型（バンドパス）の特性を示すが、それより低い照度では、コントラスト感度は低下するとともに、帯域通過型（バンドパス）から低域通過型（ローパス）の特性に変化する。明るい環境の帯域通過型（バンドパス）の特性は、画像のエッジを強調する効果（３〜５cycle/°付近の周波数成分の強調）があることを意味する。それに対し、暗い環境の低域通過型（ローパス）の特性は、広い範囲から信号を集めることに対応し、感度を上げることを重要視したメカニズムとなっている。（非特許文献２：コントラスト閾値；193ページ、非特許文献３：視覚の空間周波数特性；12ページ、非特許文献４：時空間相互作用；579ページ） Contrast sensitivity shows a band-pass characteristic with a maximum sensitivity in the vicinity of 3 to 5 cycles / ° at 9td (Toroland) or higher, but at lower illumination, the contrast sensitivity decreases and the band The characteristic changes from a pass type (band pass) to a low pass type (low pass). The characteristic of the band pass type (band pass) in a bright environment means that there is an effect of enhancing the edge of the image (enhancement of frequency components in the vicinity of 3 to 5 cycle / °). On the other hand, the low-pass characteristics in a dark environment are a mechanism that emphasizes increasing sensitivity in response to collecting signals from a wide range. (Non-patent document 2: Contrast threshold; page 193, Non-patent document 3: Visual spatial frequency characteristics; page 12, non-patent document 4: Spatio-temporal interaction; page 579)

人間の視覚系の時間周波数特性は、フリッカー（ちらつき；flicker）実験によってもこれまで盛んに行われてきた。それによれば、刺激光全体の平均輝度が小さい場合は、ローパスフィルタの特性を持つ。交照測光（flicker photometry;１８０°位相を反転させた刺激光）の時間周波数が５Hzくらいからコントラストは急激に減少しはじめ、２０Hzになると、コントラストをいくら大きくしてもちらつきは見えなくなる。次に、刺激光全体の平均輝度を段々大きな値にしていくと、中間周波数特性領域で感度が上昇しはじめ、ピークを持ってからやがて減少し、バンドパスフィルタの特性となる。コントラスト感度のピーク周波数は、刺激光全体の平均輝度の値によって異なるが、約１０から３０Hz程度の間にある。 The time-frequency characteristics of the human visual system have been actively studied by flicker experiments. According to this, when the average luminance of the entire stimulation light is small, it has a low-pass filter characteristic. When the time frequency of flicker photometry (stimulated light with 180 ° phase inversion) is about 5 Hz, the contrast starts to decrease sharply, and when it reaches 20 Hz, no flicker can be seen no matter how much the contrast is increased. Next, when the average luminance of the entire stimulation light is gradually increased, the sensitivity starts to increase in the intermediate frequency characteristic region, and after a peak is reached, the sensitivity decreases and becomes a bandpass filter characteristic. The peak frequency of contrast sensitivity varies depending on the average luminance value of the entire stimulus light, but is between about 10 to 30 Hz.

さらに、フリッカー閾（フリッカーを感じる閾値）は、刺激光全体の平均輝度のいかんに関わらず、常に平均輝度からの差分輝度によって決定される。すなわち、平均輝度の大きさによって、フリッカー閾は変わるのではなく、刺激光と平均輝度との差の絶対値の大きさによって変わるということを示している。言い換えれば、フリッカー閾は、画面全体の明るさではなく、画面中の明暗対比の状態（組み合わせ）によって変わるということを示している。さらには、映像の中のオブジェクトの動きは、知覚にもっとも適した動き速度があり、遅すぎる動きや早すぎる動きの映像もしくは、映像の中のオブジェクトは知覚しにくいことを示している。（もちろん、人間が映像中のどのオブジェクトに興味を抱き、注視しているかという問題もある。）（非特許文献１：時間周波数特性；180ページ、非特許文献２：コントラスト感度特性；220ページ、非特許文献３：視覚の時間周波数特性；15ページ、非特許文献４：視覚系の時間特性；551ページ） Further, the flicker threshold (threshold for feeling flicker) is always determined by the difference luminance from the average luminance regardless of the average luminance of the entire stimulation light. That is, it is shown that the flicker threshold does not change depending on the magnitude of the average luminance, but changes depending on the magnitude of the absolute value of the difference between the stimulus light and the average luminance. In other words, it indicates that the flicker threshold is changed not by the brightness of the entire screen but by the state (combination) of contrast in the screen. Furthermore, the motion of the object in the video has a motion speed that is most suitable for perception, which indicates that a video that is too slow or too fast or an object in the video is difficult to perceive. (Of course, there is also a problem of which object in the video the human is interested in and is gazing at.) (Non-patent document 1: Time frequency characteristics; 180 pages, Non-patent document 2: Contrast sensitivity characteristics; 220 pages, (Non-patent document 3: Visual time-frequency characteristics; page 15, Non-patent document 4: Temporal characteristics of visual system; page 551)

次に、色覚の時空間周波数について説明する。色覚のコントラスト感度関数は、輝度のコントラスト感度関数と次の２つの点で異なっている。１つは、色のコントラスト感度関数は高空間周波数側での感度減衰とカットオフが輝度のコントラスト感度関数より低空間周波数で起こること、２つ目は、色のコントラスト感度関数は低空間周波数での感度の低下が見られないということである。つまり、輝度のコントラスト感度関数が帯域通過型（バンドパス）の特性を持つものに対し、色のコントラスト感度関数は低域通過型（ローパス）である。例えば、Mullenらによれば、０．５cycle/°以下の低周波数帯域では色コントラストの方が感度が良く、それ以上の周波数帯域では輝度コントラストの方が感度がよい。そして、青−黄のコントラスト感度関数が赤−緑の場合に比べ、高空間周波数での感度減衰とカットオフとが、より低空間周波数で起こる。 Next, the spatiotemporal frequency of color vision will be described. The color sensitivity contrast sensitivity function differs from the brightness contrast sensitivity function in the following two points. First, the color contrast sensitivity function has a lower spatial frequency than the luminance contrast sensitivity function, and second, the color contrast sensitivity function has a lower spatial frequency. This means that there is no decrease in sensitivity. That is, the contrast sensitivity function of luminance has a band-pass type (band pass) characteristic, whereas the color contrast sensitivity function is a low-pass type (low pass). For example, according to Mullen et al., The color contrast is more sensitive in the low frequency band of 0.5 cycle / ° or less, and the luminance contrast is more sensitive in the higher frequency band. As compared with the case where the blue-yellow contrast sensitivity function is red-green, the sensitivity attenuation and cut-off at a high spatial frequency occur at a lower spatial frequency.

また、時間周波数特性においても、色のコントラスト感度関数は輝度のコントラスト感度関数と同様に異なっており、低い時間周波数での感度の低下は存在せず、空間周波数特性のものと類似する（非特許文献２：色覚の時空間周波数特性；232ページ、非特許文献４：色度情報に対する視覚系の時空間特性；592ページ）。 Also in the time frequency characteristics, the color contrast sensitivity function is different from the brightness contrast sensitivity function, and there is no decrease in sensitivity at a low time frequency, which is similar to that of the spatial frequency characteristics (non-patent) Reference 2: Spatiotemporal frequency characteristics of color vision; 232 pages; Non-patent document 4: Spatiotemporal characteristics of visual system for chromaticity information; page 592).

次に、視覚特性をそのまま適用することの問題点について説明する。映像及び画像が本来持つ画像特徴量に対して、時間周波数特性、空間周波数特性、色知覚特性など人間の生理的な視覚特性を考慮した変換を行った場合について説明する。この場合、例えば、映像及び画像が本来持つ画像特徴量に対して、各画像の濃淡レイアウト情報、エッジパターン情報、色情報のヒストグラム等を利用し、基準画像に類似する類似画像を検索する手法を適用した場合、人間の生理的な視覚特性は、時間周波数特性、空間周波数特性に関しては、単純なバンドパスフィルタの適用になり、色知覚特性に関しては、単純なローパスフィルタの適用になってしまい、映像及び画像が本来持つ画像特徴量の一部を抽出しただけの画像特徴量と等しくなる。そのため、かえって類似映像・画像検索性能が悪くなってしまうという問題が生じる。 Next, the problem of applying visual characteristics as they are will be described. A case will be described in which image feature amounts inherent to video and images are converted in consideration of human physiological visual characteristics such as temporal frequency characteristics, spatial frequency characteristics, and color perception characteristics. In this case, for example, a method for searching for similar images similar to the reference image by using the shading layout information, edge pattern information, color information histogram, and the like of each image with respect to the image feature amount inherent to the video and the image. When applied, the human physiological visual characteristic is a simple band-pass filter application for the time frequency characteristic and spatial frequency characteristic, and a simple low-pass filter application for the color perception characteristic. This is equal to the image feature amount obtained by extracting only a part of the image feature amount inherent to the video and the image. Therefore, there arises a problem that the similar video / image search performance deteriorates.

つまり、映像及び画像中に映っている、ある程度の大きさのオブジェクトが、ある程度の早さで動いているものだけが画像特徴量として抽出されることとなってしまい、視覚特性を反映することでかえって類似度検索の精度が悪くなる。尚、「ある程度の大きさ」とは、空間周波数が高すぎず、かつ、低すぎない大きさという意味であり、映像及び画像としては、画面を覆うような大きなオブジェクトではなく、かつ、近づいて見ないと見えないような小さなオブジェクトではないという意味である。また、「ある程度の早さ」とは、動いているかどうか分からないくらいの早さのオブジェクトではなく、かつ、動きが早すぎてわからないようなオブジェクトではないという意味である。 In other words, only objects that have a certain size and are moving at a certain speed in the video and images are extracted as image feature values, reflecting the visual characteristics. On the contrary, the accuracy of the similarity search is deteriorated. Note that “a certain size” means that the spatial frequency is not too high and not too low, and images and images are not large objects that cover the screen and are close to each other. It means that it is not a small object that cannot be seen without seeing it. Further, “a certain degree of speed” means that the object is not an object whose speed is too high to know whether it is moving or an object whose movement is too fast to know.

以上のように、人間の視覚特性という観点から、一般的な類似映像及び、類似画像検索の問題点を述べる。 As described above, from the viewpoint of human visual characteristics, problems of general similar video and similar image retrieval will be described.

特許文献２（特開平5-174072号公報）及び特許文献３（特開平7-46517号公報）では、映像コンテンツのショットの長さ、もしくは、シーンの長さを特徴量とするため、人の視覚特性を適応する方法が明示されていないため、視覚特性の適用方法が不明である。 In Patent Document 2 (Japanese Patent Laid-Open No. 5-174072) and Patent Document 3 (Japanese Patent Laid-Open No. 7-46517), the length of a video content shot or the length of a scene is used as a feature amount. Since the method for adapting the visual characteristics is not specified, the application method of the visual characteristics is unknown.

特許文献１（特開2000-48181号公報）に記載の方法は、色とテクスチャ（スペクトラム画像のピーク値及びエッジの強さ）の重みを可変にするという手法であるが、静止画を対象としていることに加え、人の視覚特性を適応する方法が明示されていないため、動画像への適用方法が不明である。 The method described in Patent Document 1 (Japanese Patent Application Laid-Open No. 2000-48181) is a technique in which the weights of colors and textures (peak values and edge strengths of spectrum images) are made variable. In addition, since the method of adapting the human visual characteristics is not specified, the application method to the moving image is unknown.

特許文献４（特開2000-123173号公報）は、画像の濃淡情報に基づく濃淡レイアウトを特徴量として、類似画像の検索を行う手法であるが、静止画を対象としていることに加え、人の視覚特性を適応する方法が明示されていないため、動画像への適用方法が不明である。 Patent Document 4 (Japanese Patent Application Laid-Open No. 2000-123173) is a technique for searching for a similar image using a light / dark layout based on the light / dark information of an image as a feature amount. Since the method of adapting the visual characteristics is not specified, the application method to the moving image is unknown.

特許文献５（特開2002-63208号公報）は、人の視覚特性に基づいて基準画像に類似する類似画像を検索する手法であり、基準画像に対する色相フィルタの出力値の平均値及び標準偏差をそのまま利用している。この手法は、静止画を対象としていることに加え、基準画像の輝度分布は利用していない。さらに、動画像に適応した場合の適用方法が不明である。 Patent Document 5 (Japanese Patent Laid-Open No. 2002-63208) is a method for searching for a similar image similar to a reference image based on human visual characteristics, and calculates an average value and a standard deviation of output values of a hue filter with respect to the reference image. We use as it is. In addition to targeting still images, this method does not use the luminance distribution of the reference image. Furthermore, the application method when adapted to a moving image is unknown.

また、圧縮動画像の種類には、ＭＰＥＧと呼ばれる時系列の圧縮率を向上させるために公知の動き補償アルゴリズムで生成された動きベクトルのデータを、圧縮動画像の内部に持っているものがある。この動きベクトルのデータを類似映像検索に利用した例として、特許文献６（特開平8-194714号公報）がある。この手法は、圧縮動画から直接動きベクトルを抽出し、形状データの生成及び検索を行う手法であり、動きベクトルは形状データの生成を目的としているが、人の視覚特性を動画検索に適応する方法が明示されておらず、動画像への適用際の方法が不明である。 In addition, some types of compressed moving images have motion vector data generated by a known motion compensation algorithm in order to improve a time-series compression rate called MPEG. . As an example in which the motion vector data is used for a similar video search, there is Patent Document 6 (Japanese Patent Laid-Open No. Hei 8-947714). This method extracts motion vectors directly from a compressed video, and generates and searches for shape data. The motion vector is intended to generate shape data, but it is a method for adapting human visual characteristics to video search. Is not specified, and the method for application to moving images is unknown.

以上に説明したように、人間の視覚特性を考慮した類似映像及び類似画像を検索する方法及び装置を実現する場合に、類似映像及び類似画像検索方法及び装置に、単純に人間の視覚特性を反映したフィルタを実装した視覚特性補正手段を用いただけでは、映像及び画像が、本来持つ画像特徴量の一部を抽出しただけの画像特徴量と等しくなり、かえって類似映像・画像検索性能が悪化してしまうという問題が発生する場合が多い。 As described above, when realizing a method and apparatus for retrieving similar videos and similar images in consideration of human visual characteristics, the human visual characteristics are simply reflected in the similar video and similar image retrieval method and apparatus. If only the visual characteristic correction means that implements the filter is used, the video and the image are equal to the image feature amount obtained by extracting a part of the original image feature amount, and the similar video / image search performance deteriorates. Often occurs.

本発明は、上記従来技術の問題を鑑みて、基準映像及び画像と検索対象映像及び画像を比較して適切な特徴量に変換し、人間の類似感覚と一致した検索結果を得ることができる映像及び画像検索装置、映像及び画像検索方法、映像及び画像検索プログラムを提供することを目的とする。 In view of the above-described problems of the prior art, the present invention compares the reference video and image with the search target video and image, converts them into appropriate feature amounts, and obtains a search result that matches the human sense of similarity. It is another object of the present invention to provide an image search apparatus, a video and image search method, and a video and image search program.

上記の課題を解決するため、人間が視聴する映像とは、どのようなものであるのかという観点から映像について考察する。この考察は、映像及び画像がどのように製作されているかという観点からの考察であり、ドラマやバラエティなど、制作者及び編集者がどのようにして、シーンやショットを構成し映像を作成しているかという観点からの考察である。 In order to solve the above problems, the video is considered from the viewpoint of what the video viewed by human beings is. This consideration is from the viewpoint of how video and images are produced, and how producers and editors such as dramas and variety compose scenes and shots to create video. It is a consideration from the point of view.

画面を構成する要素としては、１）シャッターチャンス、２）構図、３）カメラアングルという３要素がある。優れた映像は、１）シャッターチャンス、２）構図、３）カメラアングルの３要素のバランスがとれているものであるとされている。これらの要素のうち、１）のシャッターチャンスは、画面に対して迫力や力を与える要素となり、２）の構図や３）のカメラアングルは、テーマ（物語の主題）そのものを与える要素となる。 As elements constituting the screen, there are three elements: 1) photo opportunity, 2) composition, and 3) camera angle. An excellent image is said to be a balance of three elements: 1) photo opportunity, 2) composition, and 3) camera angle. Among these elements, the 1) shutter chance is an element that gives power and power to the screen, and the composition of 2) and the camera angle of 3) are elements that give the theme (the theme of the story) itself.

ここで、テーマには、コミカル、シリアス、サスペンス、ドキュメンタリーなどがあり、そのテーマに即した映像表現により、光と影との変化を利用して、その場面の雰囲気、登場人物の心理描写などを印象づけることができる。従って、映像の各シーンの内容により照明の構成の仕方は、作り手側からも視聴者側からも、自然と同じようにイメージされるものとなる。それは、日々放送される映像を視聴することにより、ドラマを視る目、映像を視る目が、制作者と視聴者とで同じになり、テーマに対する映像の作り方に関する暗黙のルールが形成されてきたためと考えることができる。つまり、制作者と視聴者とのコミュニケーションには、絶対的なものではないが、ある一定の法則の沿ったルールがあり、このルールがあることにより制作者の意図を視聴者に伝えることができる。例えば、明るい内容のドラマが、暗い映像であれば視聴者が違和感を持つ可能性が高い。シリアスで緊迫した場面が、和やかな映像もしくは、さわやかな映像を持つと、緊迫した状況が視聴者に伝わらない可能性が高い。 Here, themes include comical, serious, suspense, documentary, etc., and the visual expression of the theme uses the change of light and shadow to express the atmosphere of the scene, the psychological description of the characters, etc. You can make an impression. Therefore, the way in which the illumination is structured according to the contents of each scene of the video image is the same as the natural image from the creator side and the viewer side. By watching videos broadcast daily, the eyes of watching dramas and watching videos are the same for producers and viewers, and implicit rules on how to create videos for themes have been formed. This can be thought of as a result. In other words, the communication between the producer and the viewer is not absolute, but there are rules that follow certain rules, and this rule can convey the intentions of the producer to the viewers. . For example, if a drama with bright content is a dark video, the viewer is likely to be uncomfortable. If a serious and tense scene has a calm video or a refreshing video, there is a high possibility that the tense situation will not be transmitted to the viewer.

さらに、自然環境（実際の光の当たり方）を映像で再現することだけが、リアルな表現の追求であるか否かという問題もある。テレビ映像でいう「リアル」とは、照明で創りだした映像が画面を通して“らしく”視えることである。言い換えれば、視聴者が視たい情報が不自然さを感じさせない範囲で表現されているか否かということである。ドラマの照明は、場面の雰囲気“らしさ”の中で演技、顔の表情を的確に捕らえ、伝えることが重要であり、場面の内容によっては照明を強調したドラマティックな映像を創ることもある。つまり、視聴者にインパクトを与え象徴的に見せることも欠かせない要素の一つである。 Furthermore, there is also a problem of whether or not it is the pursuit of realistic expression only by reproducing the natural environment (actual light exposure) with video. “Real” in TV images means that images created with lighting can be seen “likely” through the screen. In other words, whether or not the information that the viewer wants to see is expressed in a range that does not feel unnatural. It is important for drama lighting to accurately capture and convey the performance and facial expression within the atmosphere of the scene, and depending on the content of the scene, there may be dramatic images that emphasize the lighting. In other words, it is one of the indispensable elements to make viewers impactful and symbolic.

このような制作者側と視聴者側とが暗黙のルールとして形成してきた映像の印象を決定付けるものとして、映像の調子（トーン）がある。 The tone of the video is what determines the impression of the video that the producer side and the viewer side have formed as an implicit rule.

映像の調子は、明暗対比によるトーンと色彩によるトーンとがある。明暗対比によるトーンは、映像に占める明暗の面積比率によるものであり、大きく分けて以下のように分類できる。 The tone of the image has a tone based on contrast between light and dark and a tone based on color. The tone based on the contrast between light and dark is based on the area ratio of light and darkness in the video and can be roughly classified as follows.

・ローキー（暗い調子）
１）映像の暗い部分が大半を占め、明るい部分でアクセントを付ける表現方法
２）画面の大部分を暗くした調子で、ナイトシーンやシリアスなドラマに多く見られる。
陰惨、神秘、恐怖などの感情が得やすい
・ミディアムキー（平均した調子）
１）明暗の部分と中間の部分の面積比率が平均した、ローキーとハイキーの中間的な表現方法
・フラットキー（平板な調子）
１）明暗の部分の占める面積比率が少なく、中間部分が多い影の少ない表現方法
・ハイキー（明るい調子）
２）全体的に明るい部分が多く、暗い部分を付加することで明るさを調整する方法
３）画面全体に明部が多く、明るいホームドラマ、クイズ番組などに見られる
・ソフトキー（柔らかい調子）
１）明暗の部分の占める面積比率が少なく、中間部分を強調し明暗の境界を柔らかく変化させる表現方法
２）ハードキーの逆で、中間部が多く、明部と暗部の境が柔らかく変化している調子
・ハードキー（硬い調子）
１）中間の部分が少なく、明部と暗部のはっきりした明暗の対比を強調した表現方法
一方、色彩によるカラートーンは次のように分類される
・ペールトーン
高明度の彩度の低い色彩で統一されたトーンである。明るく甘い雰囲気にあるトーン
・ブライトトーン
はっきりした色調でペールトーンより明度が低く、彩度が高い。明るく快活なトーンである。同色系で構成すると非常にソフトな感じになるがアクセントが不足しやすい
・ビビットトーン
彩度の高い色を中心としたトーン。強い緊張感が生まれるが、配色のバランスを間違えるとどぎつく泥臭くなる
・ダルトーン
色調を抑えた渋いトーン。静かな重厚なトーンとなるが色光のみでこのトーンを創るのは難しい
・ダークトーン
ダルトーンより色調を抑えた暗いトーン
・モノトーン
ほとんど無彩色に近いトーンで白黒映像に近くなる・ Low key (dark tone)
1) An expression method in which the dark part of the video occupies the majority and accentuates the bright part. 2) Most of the screen is darkened, and it is often seen in night scenes and serious dramas.
Easily obtain emotions such as misery, mystery, and fear. Medium key (average tone)
1) Intermediate expression method of low key and high key, with the area ratio of light and dark parts and middle part averaged. Flat key (flat tone)
1) Representation method with a small area ratio of light and dark parts, many intermediate parts and few shadows ・ High key (bright tone)
2) Method of adjusting brightness by adding dark parts on the whole 3) Bright keys on the whole screen, which can be seen in bright home dramas, quiz programs, etc.-Soft keys (soft tone)
1) An expression method in which the area ratio of light and dark parts is small, and the middle part is emphasized and the boundary between light and dark is changed softly. 2) The opposite of the hard keys, there are many intermediate parts, and the boundary between the light and dark parts changes softly. The tone / hard key (hard tone)
1) An expression method that emphasizes the contrast between bright and dark areas with few intermediate parts. On the other hand, color tones according to color are classified as follows: Pale tone Unification with low-saturation colors with high brightness Tone. Bright and sweet tone / bright tone A clear tone with lower brightness and higher saturation than the pale tone. Bright and cheerful tone. If it is composed of the same color system, it feels very soft, but the accent tends to be insufficient. A strong tension is born, but if you make a mistake in the color balance, it will become muddy. ・ A dull tone with a dull tone. It is a quiet and profound tone, but it is difficult to create this tone only with colored light. ・ Dark tone with darker tone than the dark tone dull tone. ・ Monotone.

ドラマにおいては、この明暗対比によって、映像の調子が設計される。例えば、同じ調子で通す方法では、単調で印象に残りにくい映像になってしまう。コミカルなドラマ（ハイキー）にシリアス（ハードキー）な調子を取り入れることで、映像の流れに変化をもたらし、印象的な効果が得られることもある。このように、個々の持つ調子を組み合わせることで、違った流れを創りだすことができる。（非特許文献５：ドラマ照明とは；５３ページ、非特許文献６：映像のトーン；９２ページ） In the drama, the tone of the video is designed based on this contrast. For example, when the same tone is used, the image becomes monotonous and difficult to leave an impression. Incorporating a serious (hard key) tone into a comical drama (high key) may change the flow of the video and produce an impressive effect. In this way, a different flow can be created by combining the individual tone. (Non-patent document 5: What is drama lighting; page 53; Non-patent document 6: video tone; page 92)

上記のように、映像の制作という観点、すなわち、どのように映像は創られているのかということを考慮することが不可欠であり、制作者が意図するものを視覚特性により補正することにより、映像の表現の中から視聴する人が受ける映像の印象を映像特徴量として抽出することが可能になる。 As mentioned above, it is indispensable to consider the viewpoint of video production, that is, how the video is created, and by correcting what the creator intended by visual characteristics, It is possible to extract the impression of the video received by the viewer from the expression as video feature.

本発明によれば、人間の視覚特性として、視覚の時間周波数特性、空間周波数、明るさ知覚特性、色覚の時空間特性を考慮した処理手段を設けることにより、人間の類似感覚に一致した類似映像及び類似画像の検索を行う。より具体的には、ＭＰＥＧなど圧縮動画を復号する符号化映像復号部と、人間の生理的な視覚特性を反映した映像特徴量算出部と、予め記録されている映像の特徴量と入力された映像の特徴量とを比較する映像特徴量比較部と、を備えることを特徴とする。 According to the present invention, by providing processing means that takes into account visual temporal frequency characteristics, spatial frequency, brightness perception characteristics, and color vision spatio-temporal characteristics as human visual characteristics, similar images that match human similar sensations are provided. And search for similar images. More specifically, an encoded video decoding unit for decoding compressed video such as MPEG, a video feature amount calculation unit reflecting human physiological visual characteristics, and a prerecorded video feature amount are input. And a video feature amount comparison unit that compares the feature amount of the video.

上記映像特徴量算出部は、人間の生理的な視覚特性を用いるために、
１）映像に含まれる画像の輝度成分と色相成分に分解する輝度・色相分解部、
２）人間の生理的視覚特性のうち、輝度に対する時間的な視覚特性を用いて、映像に含まれる画像の輝度成分の補正を行う輝度時間周波数フィルタ部、
３）人間の生理的視覚特性のうち、輝度に対する空間的な視覚特性を用いて、映像に含まれる画像の輝度成分の補正を行う輝度空間周波数フィルタ部、
４）人間の生理的視覚特性のうち、赤-緑の色相に対する時間周波数の視覚特性を用いて、映像に含まれる画像の色相（赤-緑）成分の補正を行う色相（赤-緑）時間周波数フィルタ部、
５）人間の生理的視覚特性のうち、赤-緑の色相に対する空間周波数の視覚特性を用いて、映像に含まれる画像の色相（赤-緑）成分の補正を行う色相（赤-緑）空間周波数フィルタ部、
６）人間の生理的視覚特性のうち、黄-青の色相に対する時間周波数の視覚特性を用いて、映像に含まれる画像の色相（黄-青）成分の補正を行う色相（黄-青）時間周波数フィルタ部、
７）人間の生理的視覚特性のうち、黄-青の色相に対する空間周波数の視覚特性を用いて、映像に含まれる画像の色相（黄-青）成分の補正を行う色相（黄-青）空間周波数フィルタ部、
８）輝度に関する人間の生理学的視覚特性を用いて補正された画像を、輝度の割合により、複数のトーンに分類する明暗対比分類部、
９）色相に関する人間の生理学的視覚特性を用いて補正された画像を明度と彩度の割合により、複数のトーンに分類する色彩トーン分類部
を備え、
１０）明暗対比分類部の結果と色彩トーン分類部の結果を合わせて、映像特徴量比較のための特徴量とすることを特徴とする。 In order to use the human physiological visual characteristics, the video feature amount calculation unit
1) a luminance / hue separation unit that decomposes the luminance component and hue component of an image included in the video;
2) A luminance time frequency filter unit that corrects luminance components of an image included in a video using temporal visual characteristics with respect to luminance among human physiological visual characteristics;
3) a luminance spatial frequency filter unit that corrects the luminance component of an image included in a video using a spatial visual characteristic with respect to luminance among human physiological visual characteristics;
4) Hue (red-green) time for correcting the hue (red-green) component of the image contained in the video using the visual characteristics of the temporal frequency for the red-green hue among human physiological visual characteristics. Frequency filter section,
5) Hue (red-green) space that corrects the hue (red-green) component of the image contained in the image using the visual characteristics of the spatial frequency for the red-green hue among human physiological visual characteristics. Frequency filter section,
6) Hue (yellow-blue) time for correcting the hue (yellow-blue) component of the image included in the image using the visual characteristics of the temporal frequency for the yellow-blue hue among human physiological visual characteristics Frequency filter section,
7) Hue (yellow-blue) space that corrects the hue (yellow-blue) component of the image contained in the image using the visual characteristics of the spatial frequency for the yellow-blue hue among human physiological visual characteristics. Frequency filter section,
8) A light / dark contrast classifying unit that classifies an image corrected using human physiological visual characteristics related to luminance into a plurality of tones according to the luminance ratio;
9) A color tone classification unit that classifies an image corrected using human physiological visual characteristics related to hue into a plurality of tones according to the ratio of lightness and saturation,
10) It is characterized in that the result of the contrast contrast classification unit and the result of the color tone classification unit are combined into a feature amount for video feature amount comparison.

さらに、前記映像特徴量算出部は、圧縮動画を復号する符号化映像復号部から、ＤＣＴ係数（空間周波数特性）、動きベクトル（時間周波数特性）、及び復号映像を入力することを特徴とする。 Furthermore, the video feature quantity calculation unit receives a DCT coefficient (spatial frequency characteristic), a motion vector (time frequency characteristic), and a decoded video from an encoded video decoding unit that decodes a compressed moving image.

これにより、人間の生理学的な視覚特性を考慮した映像特徴量の算出が可能になり、視聴者が受ける映像の印象に基づいた映像検索が可能になる。 As a result, it is possible to calculate a video feature amount in consideration of human physiological visual characteristics, and it is possible to perform a video search based on the impression of the video received by the viewer.

さらに、符号化映像復号部から、ＤＣＴ係数（空間周波数特性）及び、動きベクトル（時間周波数特性）を入力とすることにより計算量の削減を計ることができ、大量の映像データの中から短時間で検索結果を入手できるといった作用が得られる。 Furthermore, it is possible to reduce the amount of calculation by inputting DCT coefficients (spatial frequency characteristics) and motion vectors (temporal frequency characteristics) from the encoded video decoding unit. The search result can be obtained with

尚、本発明は各部に基づくステップをコンピュータに実行させるためのプログラムであっても良く、そのプログラムを記録するコンピュータ読みとり可能な記録媒体であっても良い。 The present invention may be a program for causing a computer to execute the steps based on each unit, or a computer-readable recording medium for recording the program.

以上に、説明したように、本発明による類似画像検索技術によれば、人間の生理学的な視覚特性を反映した類似画像の検索が行えるという優れた効果を奏する。 As described above, according to the similar image search technique of the present invention, there is an excellent effect that a similar image reflecting human physiological visual characteristics can be searched.

さらに、明暗対比及び色彩のトーンにより映像特徴量を算出し、この特徴量を基準に類似映像識別を行うため、明暗対比及び色彩のトーンの異なるショットの分割を自動的に行うことができる。 Furthermore, since the video feature quantity is calculated based on the contrast of light and darkness and the tone of color, and similar video is identified based on this feature quantity, it is possible to automatically divide shots with different contrast of light and darkness and tone of color.

また、連続した明暗対比及び色彩のトーンの組み合わせ自体が類似した複数ショット（シーン）も自動的に分割することができる。 In addition, a plurality of shots (scenes) having similar combinations of continuous light and dark contrasts and color tones can be automatically divided.

さらに、高速再生した動画に対しては、通常の再生速度の動きベクトルに高速再生速度の値を加え、この再生速度に対応した動きベクトルに対して、視覚特性を本手法に適用することで、高速再生時の類似映像検索を実現することができる。 Furthermore, for a video played at high speed, the value of the high speed playback speed is added to the motion vector at the normal playback speed, and the visual characteristics are applied to this method for the motion vector corresponding to this playback speed, Similar video search during high-speed playback can be realized.

以下、本発明の実施の形態による類似映像検索技術について、図面を参照しながら説明を行う。図１から図８までは、本発明の一実施の形態による類似映像検索技術に関連する図面であって、それぞれの図面において、同じの符号を付した部分は同一の構成を示すものである。尚、図２の基本的な構成は、図９に示す一般的な構成と同様であるが、図２に示すように、符号化映像復号部から、DCT係数（空間周波数特性）及び、動きベクトル（時間周波数特性）を入力とする点及び、映像特徴量算出部で、人間の生理学的な視覚特性を反映した映像特徴量を算出する点にある。なお、図9は、MPEGなどの圧縮動画の符号化処理のブロック図であり、公知の技術である。 Hereinafter, a similar video search technique according to an embodiment of the present invention will be described with reference to the drawings. 1 to 8 are drawings related to a similar video search technique according to an embodiment of the present invention. In each drawing, the same reference numerals denote the same components. The basic configuration of FIG. 2 is the same as the general configuration shown in FIG. 9, but as shown in FIG. 2, the DCT coefficient (spatial frequency characteristic) and motion vector are transmitted from the encoded video decoding unit. (Time frequency characteristic) is input, and the video feature quantity calculation unit calculates video feature quantity reflecting human physiological visual characteristics. FIG. 9 is a block diagram of encoding processing of compressed moving images such as MPEG, which is a known technique.

図１は、本実施の形態による類似画像検索装置の一構成例を示す機能ブロック図である。図１に示す構成において、ハードディスクＨＤ１０１に記録された映像、もしくは、ＭＯ、ＣＤ、ＤＶＤなどの光記録媒体１０２に記録された映像は、Ｉ／ＯインターフェイスＩＦ１０３から読み込まれ、類似映像抽出処理部１０７で類似映像が抽出され、抽出された映像は、ディスプレイインターフェイスＩＦ１０９に接続されたディスプレイ１１０に表示される。この際、映像に音声が含まれている場合は、ＡＶインターフェイスＩＦ１０８に繋がれたスピーカで出力される。また、映像は通信ポートから通信用インターフェイスＩＦ１０６を介しての入力、ＡＶインターフェイスＩＦ１０８に繋がれたＶｉｄｅｏ，Ｍｉｃなど１１３等により、直接入力することもできる。入力された映像及び類似映像抽出処理のためのプログラムは、例えばメモリ１０４に記録されており、ＣＰＵ１０５によりプログラムに基づく処理がなされる。この処理はハードウェア処理でも良い。図１では、類似映像抽出処理をハードウェア処理により行う例として類似映像抽出処理部１０７を設けている。また、類似映像抽出処理部１０７により、抽出された映像は、メモリ１０４に一旦格納され、ディスプレイインターフェイスＩＦ１０９を介して、ディスプレイ１１０に出力される。 FIG. 1 is a functional block diagram showing a configuration example of a similar image search apparatus according to this embodiment. In the configuration shown in FIG. 1, a video recorded on the hard disk HD 101 or a video recorded on an optical recording medium 102 such as an MO, CD, or DVD is read from the I / O interface IF 103 and the similar video extraction processing unit 107. The similar video is extracted at, and the extracted video is displayed on the display 110 connected to the display interface IF 109. At this time, if audio is included in the video, it is output from a speaker connected to the AV interface IF 108. The video can also be directly input from the communication port via the communication interface IF 106, the Video, Mic 113 or the like connected to the AV interface IF 108, and the like. The input video and similar video extraction processing program is recorded in the memory 104, for example, and the CPU 105 performs processing based on the program. This process may be a hardware process. In FIG. 1, a similar video extraction processing unit 107 is provided as an example of performing similar video extraction processing by hardware processing. The video extracted by the similar video extraction processing unit 107 is temporarily stored in the memory 104 and output to the display 110 via the display interface IF 109.

次に、本実施の形態による類似映像抽出処理部１０７の処理内容に関して、ソフトウェア処理・ハードウェア処理を問わず処理内容に関する機能ブロック図として、図２に示す。図２に示す構成は、基本的には図９に示す従来の類似映像処理部と同様である。図２との違いを明確にするために、図９についてまず説明を行う。 Next, regarding the processing contents of the similar video extraction processing unit 107 according to the present embodiment, FIG. 2 is shown as a functional block diagram regarding the processing contents regardless of software processing or hardware processing. The configuration shown in FIG. 2 is basically the same as the conventional similar video processing unit shown in FIG. In order to clarify the difference from FIG. 2, first, FIG. 9 will be described.

図９では、ＭＰＥＧなどの符号化された圧縮動画（符号化映像）が入力される。ＭＰＥＧは、エントロピー符号化という公知の技術で符号化されており、可変長復号化部９０１において、公知の技術により復号される。この際、復号された信号のうち、符号化モード及び動きベクトルは、動き補償部９０４に入力され、公知の技術により動き補償処理が行われる。一方、逆量子化部９０２では、符号化処理の時に量子化されたＤＣＴ係数が入力されるため、ここで逆量子処理が公知の技術により行われる。その後、ＩＤＣＴ部９０３において逆コサイン変換処理（公知の技術）で、映像に含まれる画像が生成される。但し、この時に得られる画像は、動きの補償が施されていない画像である。次に、動き補償部９０４において、動きベクトルによる動き補償処理が公知の技術により行われ、ピクチャ蓄積部９０５に送られる。先の動き補償部９０４は、過去にピクチャ蓄積部９０５に蓄積した画像も用いて、動きの補償を行う。そして、補償された画像は、ピクチャ並び替え部９０６に送られるとともに、ピクチャ蓄積部９０５にも送られる。ＭＰＥＧなどの符号化処理された映像は、フレーム内符号化、フレーム間符号化と呼ばれる公知の圧縮方式を用いて圧縮率が高められており、ＭＰＥＧでは３つのピクチャタイプが存在する。それぞれ、イントラピクチャ（Ｉピクチャ）、予測符号化ピクチャ（Ｐピクチャ）、双方向予測符号化ピクチャ（Ｂピクチャ）と呼ばれ、３つのピクチャを用いて、圧縮率の向上と画質の向上とを計っている。ピクチャ並び替え部９０６では、この３つのピクチャを並び変えて、圧縮する前の順番に戻す処理が行われる。 In FIG. 9, an encoded compressed moving image (encoded video) such as MPEG is input. MPEG is encoded by a known technique called entropy coding, and is decoded by a variable length decoding unit 901 by a known technique. At this time, among the decoded signals, the encoding mode and the motion vector are input to the motion compensation unit 904, and motion compensation processing is performed by a known technique. On the other hand, since the inverse quantization unit 902 receives the DCT coefficient quantized at the time of the encoding process, the inverse quantization process is performed here by a known technique. After that, an IDCT unit 903 generates an image included in the video by inverse cosine transform processing (a known technique). However, the image obtained at this time is an image that has not been subjected to motion compensation. Next, in the motion compensation unit 904, motion compensation processing using a motion vector is performed by a known technique and sent to the picture storage unit 905. The previous motion compensation unit 904 performs motion compensation using the images stored in the picture storage unit 905 in the past. The compensated image is sent to the picture rearrangement unit 906 and also to the picture storage unit 905. A video that has been subjected to encoding processing such as MPEG has a compression rate increased by using a known compression method called intra-frame encoding or inter-frame encoding, and there are three picture types in MPEG. These are called an intra picture (I picture), a predictive coded picture (P picture), and a bidirectional predictive coded picture (B picture), and use three pictures to improve the compression rate and the image quality. ing. The picture rearrangement unit 906 performs processing for rearranging these three pictures and returning them to the order before compression.

映像特徴量算出部９１１には、圧縮する前の映像が復号化映像が入力される。一般的な類似映像検索システムでは、ここで映像特徴の算出が行われる。例えば、特許文献２（特開平5−174072号公報）では、ショットの長さ情報もしくはシーン長が特徴量として算出される。その後、映像特徴量ＤＢ９１３に蓄積された特徴量と、入力された映像の特徴量と、が、映像特徴量比較部９１２において比較される。比較の結果、類似と判断された方の映像が類似映像として出力される。尚、ここで符号９００は、映像復号処理部を示し、符号９１０は類似映像抽出処理部の全体を示す。 The video feature amount calculation unit 911 receives a decoded video of the video before compression. In a general similar video search system, video features are calculated here. For example, in Patent Document 2 (Japanese Patent Laid-Open No. 5-174072), shot length information or scene length is calculated as a feature amount. Thereafter, the feature quantity stored in the video feature quantity DB 913 and the feature quantity of the input video are compared in the video feature quantity comparison unit 912. As a result of the comparison, the image determined to be similar is output as a similar image. Here, reference numeral 900 denotes a video decoding processing unit, and reference numeral 910 denotes an entire similar video extraction processing unit.

次に、図２と図９とを比較しながらその相違点について説明する。図２に示す構成では、可変長復号化部２０１は図９の可変長復号化部９０１に対応し、公知の技術によって同様の処理が行われる。さらに、逆量子化部２０２は逆量子化部９０２に、ＩＤＣＴ部２０３はＩＤＣＴ部９０３に、動き補償部２０５は動き補償部９０４に、ピクチャ蓄積部２０４はピクチャ蓄積部９０５に、ピクチャ並び替え部２０６はピクチャ並び替え部９０６に対応し、公知の技術によって同様の処理が行われる。 Next, the difference between FIG. 2 and FIG. 9 will be described. In the configuration shown in FIG. 2, the variable length decoding unit 201 corresponds to the variable length decoding unit 901 in FIG. 9, and the same processing is performed by a known technique. Further, the inverse quantization unit 202 is in the inverse quantization unit 902, the IDCT unit 203 is in the IDCT unit 903, the motion compensation unit 205 is in the motion compensation unit 904, the picture storage unit 204 is in the picture storage unit 905, and the picture rearrangement unit. 206 corresponds to the picture rearrangement unit 906, and the same processing is performed by a known technique.

図２に示す本実施の形態による類似映像検索処理部が、図９に示す類似映像検索処理部と異なる点は、可変長復号化部２０１の動きベクトル（時間周波数特性）及び逆量子化部２０２のＤＣＴ係数（空間周波数特性）が映像特徴量算出部２１１に送られる点である。そして、映像特徴量算出部２１１では、ＤＣＴ係数（空間周波数特性）、動きベクトル（時間周波数特性）及び復号映像を用いて、人間の生理学的な視覚特性を反映した映像特徴量の算出が行われ、映像特徴量ＤＢ２１３に蓄積された特徴量と、入力された映像の特徴量と、が映像特徴量比較部２１２において比較され、類似と判断された映像が出力される。 The similar video search processing unit according to the present embodiment shown in FIG. 2 is different from the similar video search processing unit shown in FIG. 9 in that the motion vector (temporal frequency characteristic) of the variable length decoding unit 201 and the inverse quantization unit 202 are different. The DCT coefficient (spatial frequency characteristic) is sent to the video feature quantity calculation unit 211. Then, the video feature quantity calculation unit 211 uses the DCT coefficient (spatial frequency characteristic), the motion vector (temporal frequency characteristic), and the decoded video to calculate the video feature quantity that reflects human physiological visual characteristics. The feature quantity stored in the video feature quantity DB 213 and the feature quantity of the input video are compared by the video feature quantity comparison unit 212, and an image judged to be similar is output.

尚、ここで、符号２００は映像復号処理部を示し、符号２１０は類似映像抽出処理部を示す。 Here, reference numeral 200 denotes a video decoding processing unit, and reference numeral 210 denotes a similar video extraction processing unit.

次に、図３を参照しながら、映像特徴量算出部２１１における映像の特徴量算出方法について説明する。映像特徴量算出部２１１には、復号映像、及び、DCT係数、動きベクトルが入力され、映像特徴量比較部２１２に算出された特徴量が送られる。この際、ピクチャ並び替え部２０６から送られた復号映像は、輝度・色相分解部３０１で輝度及び色相（赤-緑）、色相（黄-青）に分解され、輝度は時間周波数フィルタ３０３へ、色相（赤-緑）は時間周波数フィルタ３０４へ、色相（黄-青）は時間周波数フィルタ３０５へと送られる。 Next, a video feature value calculation method in the video feature value calculation unit 211 will be described with reference to FIG. The video feature quantity calculation unit 211 receives the decoded video, the DCT coefficient, and the motion vector, and sends the calculated feature quantity to the video feature quantity comparison unit 212. At this time, the decoded video sent from the picture rearrangement unit 206 is decomposed into luminance and hue (red-green) and hue (yellow-blue) by the luminance / hue separation unit 301, and the luminance is transmitted to the temporal frequency filter 303. Hue (red-green) is sent to the temporal frequency filter 304, and hue (yellow-blue) is sent to the temporal frequency filter 305.

例えば、復号映像の色空間がＲＧＢであれば、ＲＧＢからＸＹＺ表色系を経て、Ｌ＊ａ＊ｂ＊表色系への変換処理が行われる。また、復号映像の色空間がＹＵＶであれば、ＹＵＶからＲＧＢに変換した後、ＸＹＺ表色系を経て、Ｌ＊ａ＊ｂ＊表色系への変換が行われる。もちろん、ＹＵＶから直接Ｌ＊ａ＊ｂ＊表色系への変換を行ってもよい。色空間の変換式は、例えば、以下のようになる。 For example, if the color space of the decoded video is RGB, conversion processing from RGB to the L * a * b * color system via the XYZ color system is performed. Also, if the color space of the decoded video is YUV, after conversion from YUV to RGB, conversion to the L * a * b * color system is performed via the XYZ color system. Of course, conversion from YUV directly to the L * a * b * color system may be performed. The color space conversion formula is, for example, as follows.

ＹＵＶからＲＧＢへの変換式： YUV to RGB conversion formula:

ＲＧＢからＸＹＺへの変換式： Conversion formula from RGB to XYZ:

ＸＹＺからＬ＊ａ＊ｂ＊への変換式： Conversion formula from XYZ to L * a * b *:

ＸＹＺ表色系は、色差知覚と表色系内での距離との相関が低く、人が知覚する色の違いである色差を表すには不適当であることから、色差を評価するための等色関数としてＬ＊ａ＊ｂ＊表色系が定義されており、変換式は以下のようになる。 The XYZ color system has a low correlation between the color difference perception and the distance in the color system, and is inappropriate for representing a color difference that is a color difference perceived by humans. The L * a * b * color system is defined as a color function, and the conversion formula is as follows.

尚、Ｘｎ、Ｙｎ、Ｚｎは、照明光（Ｄ６５光源下における完全拡散反射面）の三刺激値である。また、Ｘ／Ｘｎ、Ｙ／Ｙｎ、Ｚ／Ｚｎは、０．００８８５６以上の値をとる。なお、Ｘ／Ｘｎ、Ｙ／Ｙｎ、Ｚ／Ｚｎが、０．００８８５６以下の場合は、下記のような式となる。 Xn, Yn, and Zn are tristimulus values of illumination light (fully diffuse reflection surface under the D65 light source). X / Xn, Y / Yn, and Z / Zn have values of 0.008856 or more. In addition, when X / Xn, Y / Yn, and Z / Zn are 0.008856 or less, it becomes the following formula | equation.

Ｌ＊ａ＊ｂ＊表色系は、ＣＩＥが１９７６年に定めた均等色空間（等しい大きさに知覚される色差が、空間内の等しい距離に対応するよう意図した色空間）の１つであり、知覚と装置の違いによる色差とを測定することができる。この表色系は、１９７６年に勧告され、日本ではＪＩＳＺ８７２９に規定されている。このＣＩＥ１９７６Ｌ＊ａ＊ｂ＊は、ＣＩＥＸＹＺを基礎としており、Ｌ＊ａ＊ｂ＊の非直線関係は、目の対数的な感応性の模倣を目的としている。Ｌ＊はLuminance（輝度）を意味し、ａ＊、ｂ＊は色相と彩度を表す色度を意味する。つまり、ａ＊、ｂ＊は色の方向を示しており、ａ＊は赤方向、−ａ＊は緑方向、ｂ＊は黄方向、−ｂ＊は青方向を示している。 The L * a * b * color system is one of the uniform color spaces defined by the CIE in 1976 (a color space intended to allow color differences perceived to be equal in magnitude to correspond to equal distances in the space). Yes, it can measure perception and color differences due to device differences. This color system is recommended in 1976, and is defined in JIS Z8729 in Japan. This CIE 1976 L * a * b * is based on CIE XYZ, and the non-linear relationship of L * a * b * is intended to imitate the logarithmic sensitivity of the eyes. L * means Luminance (luminance), and a * and b * mean chromaticity representing hue and saturation. That is, a * and b * indicate the color direction, a * indicates the red direction, -a * indicates the green direction, b * indicates the yellow direction, and -b * indicates the blue direction.

時間周波数フィルタ３０３では、可変長復号化部２０１から送られた動きベクトル（時間周波数特性）を元に、人間の生理学的な視覚特性のうち、時間周波数特性に関する補正が行われる。さらに、空間周波数フィルタ３０６では、逆量子化部２０２から送られたＤＣＴ係数（空間周波数特性）を元に、人間の生理学的な視覚特性のうち、空間周波数特性に関する補正が行われる。その後、人間の生理学的な視覚特性により補正された輝度情報は、明暗対比分類部３０９へ送られる。 In the time-frequency filter 303, correction related to the time-frequency characteristic is performed among the human physiological visual characteristics based on the motion vector (time-frequency characteristic) sent from the variable length decoding unit 201. Further, the spatial frequency filter 306 performs correction related to the spatial frequency characteristic among the human physiological visual characteristics based on the DCT coefficient (spatial frequency characteristic) sent from the inverse quantization unit 202. Thereafter, the luminance information corrected by the human physiological visual characteristic is sent to the contrast contrast classification unit 309.

明暗対比分類部３０９において、輝度のヒストグラムが作成され、図４に示す明暗対比のトーン分類（ａ）〜（ｅ）に基づいて、ＬｏｗＫｅｙ、ＭｅｄｉｕｍＫｅｙ、Ｓｏｆｔ／ＦｌａｔＫｅｙ、ＨｉｇｈＫｅｙ、ＨａｒｄＫｅｙのそれぞれのキーに分類される。この際に分類する手法としては、公知の技術であるk-means法によるクラスタリングや、公知の技術であるニューラルネットワークを使った自己組織化アルゴリズムであるＳＯＭなどを用いることができる。また、公知の技術であるサポートベクタマシンによる非線識別によるクラスタリング手法や、白、灰、黒に相当する輝度の比率をルール化し、そのルールに従って分類するという方法を用いることも可能である。 In the light / dark contrast classification unit 309, a luminance histogram is created, and based on the light / dark contrast tone classifications (a) to (e) shown in FIG. 4, Low Key, Medium Key, Soft / Flat Key, High Key, and Hard Key. It is classified into each key. As a classification method at this time, clustering by a k-means method which is a known technique, SOM which is a self-organization algorithm using a neural network which is a known technique, or the like can be used. It is also possible to use a known technique, such as a clustering method based on non-line identification using a support vector machine, or a method in which luminance ratios corresponding to white, gray, and black are ruled and classified according to the rule.

時間周波数フィルタ３０４においては、可変長復号化部２０１から送られた動きベクトル（時間周波数特性）に基づいて、人間の生理学的な視覚特性のうち、時間周波数特性に関する補正が行われる。さらに、空間周波数フィルタ３０７では、逆量子化部２０２から送られたＤＣＴ係数（空間周波数特性）に基づいて、人間の生理学的な視覚特性のうち、空間周波数特性に関する補正が行われる。その後、人間の生理学的な視覚特性により補正された輝度情報は、色彩トーン分類部３１０に送られる。 In the time-frequency filter 304, correction related to the time-frequency characteristic among the physiological visual characteristics of human beings is performed based on the motion vector (time-frequency characteristic) sent from the variable length decoding unit 201. Further, the spatial frequency filter 307 performs correction related to the spatial frequency characteristic among human physiological visual characteristics based on the DCT coefficient (spatial frequency characteristic) sent from the inverse quantization unit 202. Thereafter, the luminance information corrected by the human physiological visual characteristic is sent to the color tone classification unit 310.

時間周波数フィルタ３０５では、可変長復号化部２０１から送られた動きベクトル（時間周波数特性）を元に、人間の生理学的な視覚特性のうち時間周波数特性に関する補正が行われる。さらに、空間周波数フィルタ３０８では、逆量子化部２０２から送られたＤＣＴ係数（空間周波数特性）を元に、人間の生理学的な視覚特性のうち、空間周波数特性に関する補正が行われる。その後、人間の生理学的な視覚特性により補正された輝度情報は、色彩トーン分類部３１０に送られる。 In the time-frequency filter 305, correction related to the time-frequency characteristic among the human physiological visual characteristics is performed based on the motion vector (time-frequency characteristic) sent from the variable length decoding unit 201. Further, the spatial frequency filter 308 performs correction related to the spatial frequency characteristic among the human physiological visual characteristics based on the DCT coefficient (spatial frequency characteristic) sent from the inverse quantization unit 202. Thereafter, the luminance information corrected by the human physiological visual characteristic is sent to the color tone classification unit 310.

時間周波数特性及び空間周波数特性の補正を行うためには、それぞれの特性値をテーブル化する方法を用いることができる。時間周波数特性による補正を行う場合には、動きベクトルの値を時間周波数の値として、輝度と時間周波数の値から補正値を選択し、重み係数として掛け合わせるなどの方法により補正する。また、空間周波数特性による補正を行う場合には、ＤＣＴ係数の値を空間周波数の値として、輝度と時間周波数の値から補正値を選択し、重み係数として掛け合わせるなどの方法により補正する。尚、通常、ＭＰＥＧなどの符号化された圧縮動画は、８×８画素や１６×１６画素などのブロック毎に、輝度及びＤＣＴ係数が得られる。その場合、そのブロック毎に時間周波数及び空間周波数の補正を行えば良い。 In order to correct the time frequency characteristic and the spatial frequency characteristic, a method of tabulating each characteristic value can be used. In the case of performing correction based on the time frequency characteristic, correction is performed by a method such as selecting a correction value from luminance and time frequency values using the motion vector value as the time frequency value and multiplying by the weight coefficient. When correction is performed using the spatial frequency characteristic, correction is performed by a method of selecting a correction value from luminance and temporal frequency values using the DCT coefficient value as a spatial frequency value and multiplying the correction value as a weighting coefficient. Note that normally, a compressed compressed moving image encoded by MPEG or the like can obtain luminance and DCT coefficients for each block such as 8 × 8 pixels or 16 × 16 pixels. In that case, the time frequency and the spatial frequency may be corrected for each block.

色彩トーン分類部３１０では、明度及び彩度による分類を行うために、Ｌ*ａ*ｂ*表色系からＨＳＶ表色系に変換を行った後に、明度及び彩度の値を元に、例えば図５に示す色彩のトーン分類に従って、各トーンに分類される。トーン分類の手法としては、公知の技術であるk-means法によるクラスタリングや、公知の技術であるニューラルネットワークを使った自己組織化アルゴリズムであるＳＯＭなどを用いることができる。さらに、公知の技術であるサポートベクタマシンを用いた非線識別によるクラスタリング手法を利用することも可能である。また、当該技術者なら容易に推測され実現可能である方法、例えば、明度及び彩度の値の組み合わせをテーブル化し、その時のトーン分類値に従って分類することも可能である。尚、ＨＳＶ色空間は、色をＨ（Ｈｕｅ /色相）、Ｓ（Ｓａｔｕｒａｔｉｏｎ /彩度）、Ｖ（Ｖａｌｕｅ/明度）で表す公知の記述方式である。 In the color tone classification unit 310, after performing conversion from the L * a * b * color system to the HSV color system in order to perform classification based on lightness and saturation, for example, based on the values of lightness and saturation, for example, Each tone is classified according to the tone classification shown in FIG. As a tone classification method, clustering by a k-means method that is a known technique, SOM that is a self-organization algorithm using a neural network that is a known technique, or the like can be used. Furthermore, it is also possible to use a clustering technique based on non-linear identification using a support vector machine, which is a known technique. In addition, it is possible to form a table of combinations of lightness and saturation values that can be easily estimated and realized by those skilled in the art, and classify them according to the tone classification values at that time. The HSV color space is a known description method in which colors are represented by H (Hue / hue), S (Saturation / saturation), and V (Value / lightness).

以上のようにして、明暗対比分類部３０９及び色彩トーン分類部３１０から得られた映像特徴量は、図２の映像特徴量比較部２１２に送られ、映像特徴量ＤＢ２１３内の映像特徴量と比較され、まず、類似度が高い画像が出力される。一般的に、この際に用いられる特徴量は、映像特徴量比較部２１２中の比較部により決定される。例えば、比較部を特徴ベクトル間の距離と定義し、公知の技術であるベクトル間のユークリッド距離や方向余弦距離、さらには、公知の技術であるベクトル間の分散を考慮したマハラノビス距離などを用いることができる。 As described above, the video feature values obtained from the contrast contrast classification unit 309 and the color tone classification unit 310 are sent to the video feature value comparison unit 212 in FIG. 2 and compared with the video feature values in the video feature value DB 213. First, an image with a high degree of similarity is output. In general, the feature amount used at this time is determined by the comparison unit in the video feature amount comparison unit 212. For example, the comparison unit is defined as the distance between feature vectors, and the Euclidean distance and the direction cosine distance between vectors, which are known techniques, and the Mahalanobis distance considering the variance between vectors, which is a known technique, are used. Can do.

具体例として、例えば、特徴量フォーマットを以下のように定義し、明暗対比分類部３０９において Low Key に分類され、色彩トーン分類部３１０でモノトーンと分類された場合は、以下のような特徴量ベクトルとなる。 As a specific example, for example, when the feature quantity format is defined as follows, classified as Low Key in the contrast contrast classification unit 309, and classified as monotone in the color tone classification unit 310, the following feature vector It becomes.

特徴量フォーマット＝｛明暗対比特徴量ベクトル、色彩トーン特徴量ベクトル｝
＝｛ “Low Key”、 “Medium Key”、 “Soft/Flat Key”、 “High Key”、 “Hard Key”、 “モノトーン”、 “ペールトーン”、 “ダークトーン”、 “ブライトトーン”、 “ダルトーン”、 “ビビットトーン” ｝
特徴量ベクトル＝｛明暗対比特徴量ベクトル、色彩トーン特徴量ベクトル｝
＝｛ 1、 0、 0、 0、 0、 1、 0、 0、 0、 0、 0 ｝ Feature quantity format = {light and dark contrast feature quantity vector, color tone feature quantity vector}
= {“Low Key”, “Medium Key”, “Soft / Flat Key”, “High Key”, “Hard Key”, “Monotone”, “Pale Tone”, “Dark Tone”, “Bright Tone”, “Dal Tone” "," Bibit tone "}
Feature vector = {light and dark contrast feature vector, color tone feature vector}
= {1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0}

上記は、明暗対比分類をLow Key = 1、それ以外は0とし、色彩トーン分類をモノトーン＝１、それ以外は０とした場合の例である。次に、ベクトル化された上記特徴量と、図２に示す映像特徴量ＤＢ２１３に記録されている映像の中のある画像の特徴ベクトルとの距離を、公知の技術により算出すればよい。尚、アプリケーションの目的によっては、明暗対比分類の結果と色彩トーン分類の結果とをベクトル化する際に、重み係数を用いて、いずれを重視するかに関する重みを付加することも可能である。例えば、明暗対比分類の結果と色彩トーンの結果とを２対１の比率にする場合には、明暗対比分類の特徴ベクトルをそのような比率で作成（明暗対比分類の特徴量を２倍したベクトル）することも考えることができる。そのような場合、以下のような特徴量ベクトルとなる。 The above is an example in which the light / dark contrast classification is Low Key = 1, otherwise 0, and the color tone classification is monotone = 1, otherwise 0. Next, the distance between the vectorized feature quantity and the feature vector of an image in the video recorded in the video feature quantity DB 213 shown in FIG. 2 may be calculated by a known technique. Depending on the purpose of the application, when vectorizing the result of light / dark contrast classification and the result of color tone classification, a weighting factor may be used to add a weight as to which is important. For example, when the result of the light / dark contrast classification and the result of the color tone are set to a ratio of 2: 1, a feature vector of the light / dark contrast classification is created at such a ratio (a vector obtained by doubling the feature quantity of the light / dark contrast classification) ) Can also be considered. In such a case, the following feature quantity vector is obtained.

特徴量ベクトル＝｛２＊（明暗対比特徴量ベクトル）、色彩トーン特徴量ベクトル｝
＝｛２、０、０、０、０、１、０、０、０、０、０｝ Feature vector = {2 * (light / dark contrast feature vector), color tone feature vector}
= {2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0}

上記特徴量を、例えばベクトル間のユークリッド距離では、入力映像の特徴ベクトルをａベクトル、映像特徴量ＤＢに記録されているある一つの映像の特徴ベクトルをｂベクトルとすると、ユークリッド距離rは以下のように計算される。 For example, when the feature quantity is a Euclidean distance between vectors, a feature vector of the input video is a vector, and a feature vector of one video recorded in the video feature quantity DB is a b vector, the Euclidean distance r is Is calculated as follows.

ここで、ａｋ、ｂｋは、ベクトルの第ｋ番目の要素を表す。すなわち、それぞれのベクトルの要素間の差の自乗の和が大きさに関連する。また、方向余弦距離ｒは、以下のように計算される。 Here, ak and bk represent the kth element of the vector. That is, the sum of the squares of the differences between the elements of each vector is related to the magnitude. The direction cosine distance r is calculated as follows.

ここで、ａ・ｂは内積を表し、|ａ|及び|ｂ|は、それぞれベクトルａ及びベクトルｂの大きさを表す。すなわち、ベクトルａとベクトルｂとのなす角度が距離として表現される。次に、画像に対して求められた類似度を、映像の類似度として利用するためには、その映像を構成する全ての画像についての類似度を加算すればよい。すなわち、映像特徴量ＤＢに記録された一つの映像が持つi番目（ｉは１以上の整数）の画像の類似度をｒｉとすると、入力映像とｉ番目の画像との類似度Ｒは、以下のように計算することができる。 Here, a · b represents an inner product, and | a | and | b | represent the magnitudes of the vector a and the vector b, respectively. That is, the angle formed by the vector a and the vector b is expressed as a distance. Next, in order to use the degree of similarity obtained for an image as the degree of similarity of a video, it is only necessary to add the degrees of similarity for all images constituting the video. That is, if the similarity of an i-th image (i is an integer equal to or greater than 1) included in one video recorded in the video feature DB is ri, the similarity R between the input video and the i-th image is It can be calculated as follows.

ここでは、映像特徴量ＤＢに記録された一つの映像全体と入力映像全体との類似度を表している。 Here, the degree of similarity between one entire video recorded in the video feature DB and the entire input video is represented.

さらに、映像特徴量ＤＢに記録された一つの映像の一部が入力映像と類似しているか否かを算出する場合には、例えば、公知の技術である動的計画法（数理計画法のうちの１つであって、時間の経過とともに多段階にわたってなされる決定過程の効果を、数式にモデル化し、最適条件を求める方法）を用いることができる。入力映像が持つ画像の特徴ベクトルと映像特徴量ＤＢに記録された一つの映像が持つ画像の特徴ベクトルとに対して、始端フリーの動的計画法を用いることで、映像特徴量ＤＢに記録された一つの映像の部分映像を類似映像として算出することができる。通常、この場合、類似性の判断には、閾値が用いられる。すなわち、算出された類似度が閾値より小さければ類似性が高いと判断し、算出された類似度が閾値より大きければ類似性が低い判断することができる。この公知の技術は、パターン認識技術として広く知られている。 Further, when calculating whether or not a part of one video recorded in the video feature DB is similar to the input video, for example, dynamic programming (a mathematical programming method) which is a well-known technique is used. And the effect of the decision process made over multiple stages with the passage of time can be modeled into a mathematical expression to obtain an optimum condition. By using the start-free dynamic programming method for the image feature vector of the input video and the image feature vector of one video recorded in the video feature DB, it is recorded in the video feature DB. A partial video of one video can be calculated as a similar video. Usually, in this case, a threshold is used to determine similarity. That is, if the calculated similarity is smaller than the threshold, it can be determined that the similarity is high, and if the calculated similarity is larger than the threshold, it can be determined that the similarity is low. This known technique is widely known as a pattern recognition technique.

次に、フローチャート図を用いて映像特徴量算出部２１１における処理の流れについて説明する。なお、図２の２００及び２１２、２１３は、本発明の特徴的な部分ではないため省略する。また、図２の符号２００は、図９の符号９００と同様の構成であり、ＭＰＥＧなどの圧縮動画の符号化処理を行う機能ブロック図である。公知の技術と異なる点は、図２の可変長復号部２０１から動きベクトルを抽出する部分、逆量子化部２０２からＤＣＴ係数を抽出する部分である。これらは、図９においては、可変長復号化部９０１及び逆量子化部９０２において抽出される情報であるため、図２の映像特徴量算出部２１１に単に入力するだけでよい。当該技術に知識を持つものならば、抽出の仕方は容易に想像がつく。 Next, the flow of processing in the video feature amount calculation unit 211 will be described using a flowchart. Note that reference numerals 200, 212, and 213 in FIG. 2 are omitted because they are not characteristic portions of the present invention. Also, reference numeral 200 in FIG. 2 has the same configuration as the reference numeral 900 in FIG. 9, and is a functional block diagram for performing encoding processing of compressed moving images such as MPEG. A difference from the known technique is a part for extracting a motion vector from the variable length decoding unit 201 in FIG. 2 and a part for extracting a DCT coefficient from the inverse quantization unit 202. Since these are information extracted by the variable length decoding unit 901 and the inverse quantization unit 902 in FIG. 9, they may simply be input to the video feature amount calculation unit 211 in FIG. 2. If you have knowledge of this technology, you can easily imagine how to extract it.

ＭＰＥＧビットストリームのデータ構造は、シーケンスヘッダ部及びシーケンス拡張、GOPヘッダ、ピクチャヘッダ、ピクチャデータ、拡張・ユーザデータから構成され、ピクチャデータには、各ピクチャの復号を行うための具体的な符号化パラメータが含まれている。この復号には、ブロック内のDCT係数の復号、動きベクトルの復号などの処理が含まれる。このデータは可変長の符号されており、図２の可変長復号化部２０１で復号される。このとき、動きベクトルが求められ動き補償部２０５へと送られる。また、逆量子化部２０２で求められたDCT係数は、映像特徴量算出部２１１へと送られる。 The data structure of the MPEG bit stream is composed of a sequence header part and a sequence extension, a GOP header, a picture header, picture data, and extension / user data. In the picture data, specific coding for decoding each picture is performed. Contains parameters. This decoding includes processing such as decoding of DCT coefficients in the block and decoding of motion vectors. This data is encoded with a variable length and is decoded by the variable length decoding unit 201 in FIG. At this time, a motion vector is obtained and sent to the motion compensation unit 205. Also, the DCT coefficient obtained by the inverse quantization unit 202 is sent to the video feature amount calculation unit 211.

図２及び図８を参照し、図２の映像特徴量算出部２１１における処理の流れについて説明する。図８に示すように、まずステップＳ８０１において、符号化映像の復号処理の出力である復号映像の入力が行われる。また、ステップＳ８０２においては、図２の可変長復号化部２０１の動きベクトルが読み込まれる。さらに、ステップＳ８０３においては、図２の逆量子化部２０２のＤＣＴ係数が読み込まれる。 With reference to FIGS. 2 and 8, the flow of processing in the video feature amount calculation unit 211 in FIG. 2 will be described. As shown in FIG. 8, first, in step S801, a decoded video that is an output of a decoding process of the encoded video is input. In step S802, the motion vector of the variable length decoding unit 201 in FIG. 2 is read. Further, in step S803, the DCT coefficient of the inverse quantization unit 202 in FIG. 2 is read.

次に、ステップＳ８０４で入力された復号映像の表色系の判断が行われ、ＲＧＢ表色系の場合はステップＳ８０５へと処理が進み、ＲＧＢ表色系からＬ：ａ＊ｂ＊表色系への映像信号の表色系変換が行われる。ステップＳ８０４で入力された復号映像がＲＧＢ表色系でない場合には、ステップＳ８０６において、さらに、ＹＵＶ表色系か否かを判断する。ＹＵＶ表色系であれば（Ｙｅｓ）、ステップＳ８０７に処理が進み、ＹＵＶ表色系からＬ＊ａ＊ｂ＊表色系への映像信号の表色系間変換が行われる。 Next, the color system of the decoded video input in step S804 is determined. If the color system is RGB, the process proceeds to step S805, where the RGB color system is changed to L: a * b * color system. The color system conversion of the video signal is performed. If the decoded video input in step S804 is not an RGB color system, it is further determined in step S806 whether the decoded video is a YUV color system. If it is the YUV color system (Yes), the process proceeds to step S807, and the inter-color system conversion of the video signal from the YUV color system to the L * a * b * color system is performed.

ステップＳ８０６において、ＹＵＶ表色系ではないと判断された場合は（Ｎｏ）、ステップＳ８０１に戻る。 If it is determined in step S806 that the system is not a YUV color system (No), the process returns to step S801.

尚、これらステップＳ８０４及びステップＳ８０６では、ＲＧＢ表色系及びＹＵＶ表色系の判断が行われているが、このような処理に限らず、ＹＣｒＣｂ表色系などの表色系をも対象にする場合には、ステップＳ８０６でＹＵＶ表色系ではないと判断された場合は、ステップＳ８０１へと処理が戻らず、ＹＣｒＣｂ表色系の判断が行われ、Ｙｅｓと判断された場合には、ＹＣｒＣｂ表色系からＬ＊ａ＊ｂ＊表色系への映像信号の表色系変換が行われる。 In step S804 and step S806, the RGB color system and the YUV color system are determined. However, the present invention is not limited to such processing, and color systems such as the YCrCb color system are also targeted. In this case, if it is determined in step S806 that it is not the YUV color system, the process does not return to step S801, the determination of the YCrCb color system is performed, and if it is determined Yes, the YCrCb table is determined. The color system conversion of the video signal from the color system to the L * a * b * color system is performed.

以上のように、ステップＳ８０５及びステップＳ８０７の処理を行うことにより、ステップＳ８０１で入力された復号映像は、どのような入力であってもＬ＊ａ＊ｂ＊表色系へと変換することができる。次に、Ｌ＊ａ＊ｂ＊表色系へと変換された復号映像のうち輝度に関する信号は、ステップＳ８０８において、ステップＳ８０２で入力された動きベクトルの情報を利用して、時間周波数特性による補正が行われる。そして、ステップＳ８０９において、ステップＳ８０３で入力されたＤＣＴ係数の情報を利用して、周波数特性による補正が行われる。その後、ステップＳ８１０において、例えば図４を基準にして、明暗対比によるトーンの分類を行なう。 As described above, by performing the processing in steps S805 and S807, the decoded video input in step S801 can be converted into the L * a * b * color system regardless of the input. it can. Next, in the decoded video converted into the L * a * b * color system, the luminance-related signal is corrected by the time-frequency characteristic using the motion vector information input in step S802 in step S808. Is done. In step S809, correction based on frequency characteristics is performed using the DCT coefficient information input in step S803. Thereafter, in step S810, for example, tone classification is performed based on contrast between light and dark with reference to FIG.

また、Ｌ＊ａ＊ｂ＊表色系に変換された復号映像のうち、色相（赤-緑）に関する信号は、ステップＳ８１１において、ステップＳ８０２で入力された動きベクトルの情報を利用して、時間周波数特性による補正が行われる。そして、ステップＳ８１２において、ステップＳ８０３で入力されたＤＣＴ係数の情報を利用して、周波数特性による補正を行う。 In addition, in the decoded video converted into the L * a * b * color system, a signal related to the hue (red-green) is obtained in step S811 by using the motion vector information input in step S802. Correction based on frequency characteristics is performed. In step S812, the frequency characteristic correction is performed using the DCT coefficient information input in step S803.

さらに、Ｌ＊ａ＊ｂ＊表色系へと変換された復号映像のうち、色相（黄-青）に関する信号は、ステップＳ８１５において、ステップＳ８０２で入力された動きベクトルの情報を利用して、時間周波数特性による補正が行われる。そして、ステップＳ８１６において、ステップＳ８０３で入力されたＤＣＴ係数の情報を利用して、周波数特性による補正が行われる。 Further, in the decoded video converted into the L * a * b * color system, a signal related to the hue (yellow-blue) is obtained using the motion vector information input in step S802 in step S815. Correction based on time-frequency characteristics is performed. In step S816, correction based on the frequency characteristics is performed using the DCT coefficient information input in step S803.

その後、ステップＳ８１２において補正された色相（赤-緑）及び、ステップＳ８１６で補正された色相（黄―青）の情報は、ステップＳ８１３において、Ｌ＊ａ＊ｂ＊表色系からＨＳＶ表色系へと表色系変換が行われ、明度と彩度との情報を得る。ステップＳ８１４では、ステップＳ８１３の変換で得られた明度と彩度の情報を利用し、図５を基準にして、色彩トーンによる分類が行われる。 After that, information on the hue (red-green) corrected in step S812 and the hue (yellow-blue) corrected in step S816 are obtained from the L * a * b * color system to the HSV color system in step S813. Color system conversion is performed to obtain information on brightness and saturation. In step S814, classification by color tone is performed using the lightness and saturation information obtained by the conversion in step S813 and using FIG. 5 as a reference.

その後、ステップＳ８１０による明暗対比分類結果とステップＳ８１４による色彩トーン分類結果とを用いて、特徴量の生成がステップＳ８１７で行われる。 After that, using the light / dark contrast classification result in step S810 and the color tone classification result in step S814, feature quantity generation is performed in step S817.

以上の処理により、図２の映像特徴量算出部２１１に入力された復号映像は、特徴量として算出され、図２の映像特徴量比較部２１２の入力となる。 Through the above processing, the decoded video input to the video feature quantity calculation unit 211 in FIG. 2 is calculated as a feature quantity and is input to the video feature quantity comparison unit 212 in FIG.

このように、図２で入力となった符号化映像を、映像特徴量比較部２１２で映像特徴量ＤＢ２１３に記録されている映像の特徴量と比較することにより、類似映像が出力されることとなる。 As described above, by comparing the encoded video input in FIG. 2 with the video feature quantity recorded in the video feature quantity DB 213 by the video feature quantity comparison unit 212, a similar video is output. Become.

以下、図４〜７を参照しながら説明を行う。図４は、映像のトーンを構成する２つの要素のうち、明暗対比によるトーン分類を示した図である。この明暗対比のトーンとは、映像に占める明暗の面積比率によるものであり、番組制作時の指針となる公知の技術（文献５；５４ページ、文献６；９２ページ）である。尚、この面積比率によるトーンの分類は、精密に定義されたものではないため、白、灰、黒のそれぞれのおおよその面積比率の割合で定義されている。そのため、本実施の形態では、ソフトキーとフラットキーとは同一のものとして扱っている。 Hereinafter, description will be given with reference to FIGS. FIG. 4 is a diagram showing tone classification based on contrast between light and dark, out of two elements constituting a video tone. This contrast of light and dark is based on the area ratio of light and dark in the video, and is a well-known technique (reference 5; page 54, reference 6; page 92) that serves as a guideline for program production. Note that the tone classification based on the area ratio is not precisely defined, and is therefore defined by the ratio of the approximate area ratio of each of white, gray, and black. Therefore, in this embodiment, the soft key and the flat key are treated as the same.

図５は、映像のトーンを構成する２つの要素のうち、色彩によるトーン（色調）分類を示した図である。この色彩によるトーンは、明度と彩度によって決定され、PCCS（Practical Color Coordinate System; 日本色研配色体系）を応用した公知の技術（文献６；９３ページ）である。尚、色彩によるトーンは、図５に示した分類だけではなく、さらに細かく分類されている。例えば、モノトーンとブライトトーンとの境界はライトグレイッシュトーン、モノトーンとダルトーンとの境界はグレイッシュトーンと呼ばれる。本実施の形態では、印刷物などのトーンではなく、テレビなどの表示デバイスを対象とするため、図５に示すような分類をそのまま利用することができる。 FIG. 5 is a diagram showing tone (color tone) classification by color among the two elements constituting the tone of the video. The tone based on the color is determined by lightness and saturation, and is a known technique (Reference 6; page 93) to which PCCS (Practical Color Coordinate System) is applied. Note that the tone based on the color is not only classified as shown in FIG. For example, the boundary between a monotone and a bright tone is called a light grayish tone, and the boundary between a monotone and a dull tone is called a grayish tone. In this embodiment, since the target is not a tone such as a printed matter but a display device such as a television, the classification as shown in FIG. 5 can be used as it is.

図６は、輝度に関する人間の生理学的な視覚特性を図示したものであり、横軸に空間周波数νを、縦軸に時間周波数fを示している。コントラスト感度最大値を基準にして等コントラスト感度の点を結んだものであり、公知の特性（非特許文献３：視覚の時空間周波数特性；１７ページ）である。空間周波数において、fが小さいときは帯域通過特性であるが、fが大きくなる（約２０Ｈｚ以上）と低域通過特性になる。また、時間周波数特性において、νが小さいときは帯域通過特性であるが、νが大きくなる（約５ＣＰＤ以上）と低域通過特性になる。 FIG. 6 illustrates the human physiological visual characteristics related to luminance, with the horizontal axis representing the spatial frequency ν and the vertical axis representing the time frequency f. A point of equal contrast sensitivity is connected on the basis of the maximum contrast sensitivity value, which is a known characteristic (Non-Patent Document 3: Visual spatio-temporal frequency characteristic; page 17). In the spatial frequency, when f is small, the band pass characteristic is obtained, but when f becomes large (about 20 Hz or more), the low pass characteristic is obtained. Further, in the time-frequency characteristic, when ν is small, it is a band pass characteristic, but when ν becomes large (about 5 CPD or more), it becomes a low pass characteristic.

尚、コントラスト感度最大値ではなく、それより小さい値の等コントラスト感度の場合は、空間周波数特性及び時間周波数特性の両方において、低域通過特性となることは公知の知見（文献１：時間周波数特性；１８０ページ、文献２：コントラスト閾値；１９３、２２０ページ、文献３：視覚の空間周波数特性；１２、１５ページ、文献４：時空間相互作用；５５１、５７９ページ）である。本実施の形態では、公知の知見である輝度に関する人間の生理学的な視覚特性を利用する。例えば、上記図６に示すような輝度の時空間特性を例えばテーブル化して記憶しておき、これを用いて輝度補正を行うことができる。 Note that it is known that the low-pass characteristic is obtained in both the spatial frequency characteristic and the temporal frequency characteristic in the case of the equal contrast sensitivity having a value smaller than the maximum contrast sensitivity value (Reference 1: Temporal frequency characteristic). 180, document 2: contrast threshold; 193, page 220, document 3: visual spatial frequency characteristics; pages 12, 15; document 4: spatiotemporal interaction; pages 551, 579). In the present embodiment, a human physiological visual characteristic related to luminance, which is a well-known knowledge, is used. For example, the spatio-temporal characteristics of luminance as shown in FIG. 6 can be stored as a table, for example, and luminance correction can be performed using this.

図７は、空間周波数における色覚の相対コントラスト感度を示す図である。尚、白―黒の特性は輝度特性となり、色覚の特性と対比するために記してある。色覚のコントラスト感度関数は、輝度のコントラスト感度関数と次の２つの点で異なっていることがこれまでの実験により分かっている。１つは、色のコントラスト感度関数は高空間周波数側での感度減衰とカットオフとが輝度のコントラスト感度関数より低空間周波数で起こることである。２つ目は、色のコントラスト感度関数は低空間周波数での感度の低下が見られないということである。つまり、輝度のコントラスト感度関数が帯域通過型（バンドパス）の特性を持つものに対し、色のコントラスト感度関数は低域通過型（ローパス）である。また、低周波数帯域では色コントラストの方が、それ以上の周波数帯域では輝度コントラストの方が感度がよい。そして、青-黄のコントラスト感度関数が赤-緑の場合に比べ、高空間周波数での感度減衰とカットオフがより低空間周波数で起こる。 FIG. 7 is a diagram showing the relative contrast sensitivity of color vision at a spatial frequency. The white-black characteristic is a luminance characteristic, and is shown for comparison with the color vision characteristic. It has been found by experiments so far that the contrast sensitivity function of color vision is different from the contrast sensitivity function of luminance in the following two points. One is that the color contrast sensitivity function has a sensitivity attenuation and cut-off on the high spatial frequency side at a lower spatial frequency than the luminance contrast sensitivity function. Second, the color contrast sensitivity function does not show a decrease in sensitivity at low spatial frequencies. That is, the contrast sensitivity function of luminance has a band-pass type (band pass) characteristic, whereas the color contrast sensitivity function is a low-pass type (low pass). Also, the color contrast is more sensitive in the low frequency band, and the luminance contrast is more sensitive in the higher frequency band. Then, compared to the case where the blue-yellow contrast sensitivity function is red-green, sensitivity attenuation and cut-off at a high spatial frequency occur at a lower spatial frequency.

また、時間周波数特性においても、色のコントラスト感度関数は輝度のコントラスト感度関数と同様に異なっており、低い時間周波数での感度の低下は存在せず、空間周波数特性のものと類似する。（文献２：色覚の時空間周波数特性；２３２ページ、文献４：色度情報に対する視覚系の時空間特性；５９２ページ）
例えば、上記図７に示すような色相（黄−青、赤−緑）の空間特性を例えばテーブル化して記憶しておき、これを用いて色相補正を行うことができる。 Also in the time frequency characteristics, the color contrast sensitivity function is different from the brightness contrast sensitivity function, and there is no decrease in sensitivity at a low time frequency, which is similar to that of the spatial frequency characteristics. (Reference 2: Spatiotemporal frequency characteristics of color vision; page 232; Reference 4: Spatiotemporal characteristics of visual system for chromaticity information; page 592)
For example, the spatial characteristics of hues (yellow-blue, red-green) as shown in FIG. 7 can be stored in a table, for example, and hue correction can be performed using this.

尚、本実施の形態による類似映像検索装置及び方法は、上述の図示例にのみ限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変更を加え得ることは勿論である。 It should be noted that the similar video search apparatus and method according to the present embodiment are not limited to the above-described illustrated examples, and various modifications can be made without departing from the scope of the present invention.

例えば、図３では、時間周波数フィルタ３０３の後段において空間周波数フィルタ３０６を行っているが、この順番である必要は必ずしもなく、先に空間周波数フィルタ３０６を行った後、時間周波数フィルタ３０３を行っても良い。また、時間周波数フィルタ３０３と空間周波数フィルタ３０６との二段階に必ずしも分ける必要はなく、一度に行っても良い。その場合、時間周波数フィルタ３０３の特性と空間周波数フィルタ３０６の特性との両方を併せ持つ時空間周波数特性による補正となる。尚、この処理は、時間周波数フィルタ３０４及び空間周波数フィルタ３０７、時間周波数フィルタ３０５及び空間周波数フィルタ３０８にも同様のことが言える。 For example, in FIG. 3, the spatial frequency filter 306 is performed after the temporal frequency filter 303, but this order is not necessarily required, and after performing the spatial frequency filter 306 first, the temporal frequency filter 303 is performed. Also good. Further, it is not always necessary to divide the time frequency filter 303 and the spatial frequency filter 306 into two stages, and they may be performed at a time. In this case, the correction is based on the spatio-temporal frequency characteristic that has both the characteristic of the temporal frequency filter 303 and the characteristic of the spatial frequency filter 306. The same applies to the time frequency filter 304 and the spatial frequency filter 307, the time frequency filter 305, and the spatial frequency filter 308.

さらに、図８においても、ステップＳ８０８の時間周波数特性による補正の後に、ステップＳ８０９の空間周波数による補正を行っているが、この順番を変えても良く、先にステップＳ８０９の空間周波数による補正の後に、ステップＳ８０８の時間周波数特性による補正を行っても良い。また、必ずしもステップＳ８０８及びステップＳ８０９の二段階に分ける必要はなく、一度に行っても良い。その場合、ステップＳ８０８の時間周波数特性とステップＳ８０９の空間周波数特性との両方の特性を併せ持つ時空間周波数特性による補正となる。なお、この処理は、ステップＳ８１１及びステップＳ８１２、また、ステップＳ８１５及びステップＳ８１６に関しても同様である。 Further, in FIG. 8, the correction by the spatial frequency of step S809 is performed after the correction by the time frequency characteristic of step S808. However, this order may be changed, and after the correction by the spatial frequency of step S809 first. The correction based on the time frequency characteristic in step S808 may be performed. Further, it is not always necessary to divide into two stages of step S808 and step S809, and they may be performed at a time. In this case, the correction is based on the spatio-temporal frequency characteristic having both the temporal frequency characteristic of step S808 and the spatial frequency characteristic of step S809. This process is the same for steps S811 and S812, and also for steps S815 and S816.

また、本実施の形態では、人間の生理学的な視覚特性を用いて映像の特徴量を算出している。この人間の生理学的な視覚特性は、実験者によって多少異なる結果が得られる場合がある。例えば、色知覚の時間周波数特性において、低い時間周波数で、高空間周波数側のカットオフが起こるという結果を提示している研究結果例もある。本実施の形態は、人間の生理学的な視覚特性を応用するものであり、生理学的な視覚特性そのものを特徴とするものではない。そのため、実施する者が、適宜、利用するアプリケーションに適した形で人間の生理学的な視覚特性による補正を行えばよい。 In the present embodiment, the feature amount of the video is calculated using human physiological visual characteristics. This human physiological visual characteristic may give somewhat different results depending on the experimenter. For example, there is a research result example that presents a result that a cut-off on the high spatial frequency side occurs at a low temporal frequency in the temporal frequency characteristics of color perception. This embodiment applies human physiological visual characteristics, and does not feature physiological visual characteristics themselves. Therefore, a practitioner may perform correction based on human physiological visual characteristics as appropriate in a form suitable for the application to be used.

また、本実施の形態は、ＭＰＥＧなどにより符号された圧縮映像を入力として、類似映像を検索する装置及び方法について説明しているが、入力される映像は、必ずしも符号化された圧縮映像である必要はない。通常、ＭＰＥＧなどの符号化処理は、符号化されていない映像を入力として、符号化された圧縮動画を生成する処理である。その圧縮動画の生成過程で、図２の動きベクトル及びＤＣＴ係数が求められ、符号化映像の信号の中に埋め込まれる。つまり、符号化されていない映像の場合は、公知の技術であるＭＰＥＧなどの符号化処理と同じ処理を行い、動きベクトル及びＤＣＴ係数を求めることが可能である。従って、本実施の形態において、符号化されていない映像が入力された場合には、公知の技術であるＭＰＥＧなどの符号化処理と同様の処理を行い、動きベクトル及びＤＣＴ係数を求めた後に、図２の映像特徴量算出部２１１による処理を行えばよい。ＭＰＥＧなどの符号化処理は、公知の技術であるため、多くの文献において説明されている。 Further, although the present embodiment describes an apparatus and a method for searching for similar videos using compressed video encoded by MPEG or the like as input, the input video is not necessarily encoded compressed video. There is no need. Usually, an encoding process such as MPEG is a process of generating an encoded compressed moving image by using an unencoded video as an input. In the process of generating the compressed moving image, the motion vector and DCT coefficient shown in FIG. 2 are obtained and embedded in the encoded video signal. That is, in the case of an unencoded video, it is possible to obtain the motion vector and the DCT coefficient by performing the same process as the encoding process such as MPEG which is a known technique. Therefore, in the present embodiment, when an unencoded video is input, the same processing as the encoding technique such as MPEG which is a known technique is performed, and after obtaining the motion vector and the DCT coefficient, What is necessary is just to perform the process by the image | video feature-value calculation part 211 of FIG. Since encoding processing such as MPEG is a known technique, it is described in many documents.

まとめ
本実施の形態による類似映像検索技術は、時空間特性（時間周波数特性及び空間周波数特性）のテーブル等をメモリに記憶し（或いは、ネットワーク上のデータベースに格納されてものを）、これを参照して補正を行う。テーブルを参照するために、入力映像を、輝度と色相とのそれぞれの成分に分解する。次いで、輝度情報に対してテーブルの補正値を適用させる。時空間特性で補正された輝度情報については明暗対比の分類を行い、色相については色彩トーンによる分類を行う。その後、映像特徴量としてベクトル化し、他の入力映像（同様の特徴ベクトル）と比較して、一致するか否かを判定する。 Summary The similar video search technique according to the present embodiment stores a table of spatio-temporal characteristics (temporal frequency characteristics and spatial frequency characteristics) in a memory (or stored in a database on a network) and refers to this. To correct. In order to refer to the table, the input image is decomposed into luminance and hue components. Next, the correction value of the table is applied to the luminance information. The brightness information corrected by the spatio-temporal characteristics is classified by contrast, and the hue is classified by color tone. Thereafter, it is vectorized as a video feature amount, and compared with other input video (similar feature vectors), it is determined whether or not they match.

従って、(a)輝度の時空間特性のテーブル
(b)輝度の時空間特性による補正手段
(c)補正された輝度情報を用いた明暗対比分類テーブルによる分類手段
(d)色相を用いた色彩トーン分類テーブルによる分類手段
(e)入力映像の特徴量化（明暗対比，色彩トーン）の手段
(f)映像特徴量同士の比較による一致度算出手段
を有することが好ましい。 Therefore, (a) Table of luminance spatio-temporal characteristics
(b) Correction means based on the spatio-temporal characteristics of luminance
(c) Classification means based on contrast table for classification using corrected luminance information
(d) Classification means by color tone classification table using hue
(e) Means for characterizing input video (light / dark contrast, color tone)
(f) It is preferable to have a degree of coincidence calculation means by comparing video feature amounts.

輝度と同様に色相に関しても補正する。そのための補正手段の一例として、輝度と同様に図６、図７のような特性をテーブル化しておくようにしても良い。 The hue is corrected as well as the luminance. As an example of the correction means for that purpose, the characteristics as shown in FIGS. 6 and 7 may be tabulated in the same manner as the luminance.

以上、説明したように本実施の形態による類似画像検索技術によれば、人間の生理学的な視覚特性を反映した類似画像の検索を行うことができるという優れた効果を奏する。さらに、明暗対比及び色彩のトーンにより映像特徴量を算出し、この特徴量を基準に類似映像識別を行うため、明暗対比及び色彩のトーンの異なるシーン毎にシーン分割を自動的に行うことができる。また、明暗対比及び色彩のトーンの組み合わせ自体が連続的に類似する複数ショット（シーン）を自動的に分割することができる。例えば、ニュース番組などでは、スタジオ映像と現場中継の映像とを自動的に分割することができる。また、ドラマでは、回想シーンとそれ以外のシーンとに分割できるという効果を発揮する。また、通常用いられる動画は、記録した時と同じ速度で再生されることを前提としているが、本手法では、高速再生した動画に対しては、通常の再生速度の動きベクトルに高速再生速度の値を加え、この再生速度に対応した動きベクトルに対して、視覚特性に基づく類似度判定を行う本手法を適用することで、高速再生時の類似映像検索を実現できる。 As described above, according to the similar image search technique according to the present embodiment, there is an excellent effect that it is possible to search for a similar image reflecting human physiological visual characteristics. Furthermore, since the image feature amount is calculated based on the contrast of light and darkness and the tone of color, and similar images are identified based on the feature amount, scene division can be automatically performed for each scene having a different contrast of light and darkness and color tone. . Further, it is possible to automatically divide a plurality of shots (scenes) in which the combination of light and dark contrast and color tone is continuously similar. For example, in a news program or the like, a studio video and an on-the-job video can be automatically divided. Moreover, in the drama, the effect that it can divide | segment into a recollection scene and other scenes is exhibited. In addition, it is assumed that a normally used movie is played back at the same speed as when it was recorded. However, with this method, for a movie played at a high speed, the motion vector of the normal playback speed has a high playback speed. By adding this value and applying this method of determining similarity based on visual characteristics to a motion vector corresponding to this playback speed, it is possible to realize a similar video search during high-speed playback.

尚、通常の再生速度時の特性（図６）は、例えば２倍速の再生（早送り）では，時間周波数の値も２倍になる。従って、２倍速再生時には、該当する時間周波数の値を用いて補正を行うことにより、高速再生時に人間が感じる映像の類似性を実現でき、高速再生類似映像検索が可能になる（通常再生時と高速再生時では、類似映像検索結果が異なる）。 Note that the characteristic at the normal playback speed (FIG. 6) is, for example, that the value of the time frequency is doubled in double speed playback (fast forward). Therefore, at the time of double-speed playback, by performing correction using the value of the corresponding time frequency, it is possible to realize the similarity of the video that humans feel during high-speed playback, and high-speed playback similar video search is possible (in normal playback and Similar video search results differ during high-speed playback).

本発明は、類似映像検索装置として利用可能である。 The present invention can be used as a similar video search apparatus.

本発明の一実施の形態による類似画像検索装置に一構成例を示す機能ブロック図である。It is a functional block diagram which shows one structural example in the similar image search device by one embodiment of this invention. 本実施の形態による類似映像抽出処理部の機能ブロック図である。It is a functional block diagram of a similar video extraction processing unit according to the present embodiment. 本実施の形態による映像特徴量算出処理部の機能ブロック図である。It is a functional block diagram of the image | video feature-value calculation process part by this Embodiment. 本実施の形態において用いられる明暗対比分類の一例を示す図である。It is a figure which shows an example of the contrast contrast classification used in this Embodiment. 本実施の形態において用いられる色彩トーン分類の一例を示す図である。It is a figure which shows an example of the color tone classification used in this Embodiment. 本実施の形態において用いられる輝度に関する視覚の生理学的な時間空間周波数特性を示す図である。It is a figure which shows the visual physiological spatiotemporal frequency characteristic regarding the brightness | luminance used in this Embodiment. 本実施の形態で用いられる色に関する空間周波数特性の一例を示す図である。It is a figure which shows an example of the spatial frequency characteristic regarding the color used by this Embodiment. 本実施の形態による映像特徴量算出処理の流れを示すフローチャート図である。It is a flowchart figure which shows the flow of the image | video feature-value calculation process by this Embodiment. 一般的な類似映像抽出処理部の機能ブロック図である。It is a functional block diagram of a general similar image extraction processing unit.

Explanation of symbols

101…HD、102…MO、 CD、 DVD、104…メモリ、105…CPU、106…通信用IF部、107…類似映像抽出処理部、108…AV IF部、109…ディスプレイIF部、200…符号化映像復号部、201…可変長復号化部、202…逆量子化部、203…IDCT部、204…ピクチャ蓄積部、205…動き補償部、206…ピクチャ並び替え部、210…類似映像検出部、211…映像特徴量算出部、212…映像特徴量比較部、213…映像特徴量DB、301…輝度・色相分解部、303…輝度情報時間周波数特性フィルタ部、304…色相（赤―緑）情報時間周波数特性フィルタ部、305…色相（黄―青）情報時間周波数特性フィルタ部、306…輝度情報空間周波数特性フィルタ部、307…色相（赤―緑）空間時間周波数特性フィルタ部、308…色相（黄―青）空間時間周波数特性フィルタ部、309…明暗対比分類部、310…色彩トーン分類部、900…符号化映像復号部、901…可変長復号化部、902…逆量子化部、903…IDCT部、904…動き補償部、905…ピクチャ蓄積部、906…ピクチャ並び替え部、910…類似映像検出部、911…映像特徴量算出部、912…映像特徴量比較部、913…映像特徴量DB。 101 ... HD, 102 ... MO, CD, DVD, 104 ... Memory, 105 ... CPU, 106 ... Communication IF unit, 107 ... Similar video extraction processing unit, 108 ... AV IF unit, 109 ... Display IF unit, 200 ... Code 201: variable length decoding unit, 202 ... inverse quantization unit, 203 ... IDCT unit, 204 ... picture storage unit, 205 ... motion compensation unit, 206 ... picture rearrangement unit, 210 ... similar video detection unit 211 ... Video feature value calculation unit, 212 ... Video feature value comparison unit, 213 ... Video feature value DB, 301 ... Luminance / Hue separation unit, 303 ... Luminance information time frequency characteristic filter unit, 304 ... Hue (red-green) Information temporal frequency characteristic filter unit, 305 ... Hue (yellow-blue) information temporal frequency characteristic filter unit, 306 ... Luminance information spatial frequency characteristic filter unit, 307 ... Hue (red-green) spatial time frequency characteristic filter unit, 308 ... Hue (Yellow-Blue) Spatio-temporal frequency characteristics filter unit, 309 ... Light / dark contrast classification unit, 310 ... Color tone classification unit, 900 ... Symbol Video decoding unit, 901 ... variable length decoding unit, 902 ... inverse quantization unit, 903 ... IDCT unit, 904 ... motion compensation unit, 905 ... picture storage unit, 906 ... picture rearrangement unit, 910 ... similar video detection unit, 911 ... Video feature value calculation unit, 912 ... Video feature value comparison unit, 913 ... Video feature value DB.

Claims

A similar video search device for searching for a video similar to the input video from the stored video,
A video correction unit for performing correction based on human physiological visual characteristics with respect to the input video;
A first classification unit that classifies the video based on the contrast in the video after being corrected by the correction unit;
A second classification unit that classifies the video based on color tone in the corrected video,
A feature amount generation unit that generates a video feature amount based on a classification result by the first classification unit and a classification result by the second classification unit;
A similar video search apparatus, comprising: a search unit that searches for similar videos based on the video feature amount.

The video correction unit
The similar image according to claim 1, wherein the input image is decomposed into luminance and hue, and the luminance and hue are corrected by visual characteristics of both temporal frequency characteristics and spatial frequency characteristics regarding luminance and hue. Search device.

3. The similar video search apparatus according to claim 2, wherein each of luminance and hue is corrected by referring to a table of the temporal frequency characteristic and the spatial frequency characteristic.

The luminance information corrected by the spatio-temporal characteristics is classified by contrast, and the hue is classified by color tone, and then vectorized as a video feature value, and compared with other input video. 4. The similar video search device according to claim 3, wherein the similarity of the video is calculated.

5. The similar image search device according to claim 1, wherein after the correction based on the physiological visual characteristics of human beings, the classification based on contrast between light and dark and the classification based on color tone are performed.

In the course of decoding the input encoded video, a motion vector and a DCT coefficient are extracted, the motion vector is used for correction by temporal frequency characteristics, and the DCT coefficient is used for correction by spatial frequency characteristics. The similar image search device according to claim 1.

Compensating means for performing correction based on the human physiological visual characteristics, wherein the input image is decomposed into luminance, hue (red-green) and hue (yellow-blue), and the temporal frequency characteristic or spatial frequency characteristic is The similar image search apparatus according to claim 1, further comprising a correction unit based on at least one visual characteristic.

A video feature amount calculation unit that calculates a video feature amount that reflects human physiological visual characteristics with respect to the video,
Video feature amount comparison means for comparing a pre-recorded video feature amount with a newly input video feature amount;
A similar video search unit that searches for similar videos based on the comparison of the feature quantities,
The video feature amount calculation unit
A luminance / hue separation unit that decomposes the luminance component and hue component of the image included in the video;
A luminance temporal frequency filter unit that corrects a luminance component of an image included in an image using temporal visual characteristics with respect to luminance among human physiological visual characteristics;
A luminance spatial frequency filter unit that corrects the luminance component of an image included in an image using a spatial visual characteristic with respect to luminance among human physiological visual characteristics;
Hue (red-green) temporal frequency filter that corrects the hue (red-green) component of the image contained in the video using the visual characteristics of temporal frequency for the red-green hue among human physiological visual characteristics And
Hue (red-green) spatial frequency filter that corrects the hue (red-green) component of the image contained in the video using the visual characteristics of the spatial frequency for the red-green hue among human physiological visual characteristics And
Hue (yellow-blue) temporal frequency filter that corrects the hue (yellow-blue) component of the image contained in the image using the visual characteristics of the temporal frequency for the yellow-blue hue among human physiological visual characteristics And
Hue (yellow-blue) spatial frequency filter that corrects the hue (yellow-blue) component of an image contained in the image using the visual characteristics of the spatial frequency for the yellow-blue hue among human physiological visual characteristics And
A light / dark contrast classifying unit that classifies an image corrected using human physiological visual characteristics related to luminance into a plurality of tones according to a luminance ratio;
A color tone classification unit that classifies an image corrected using human physiological visual characteristics related to hue into a plurality of tones according to a ratio of lightness and saturation;
A similar video search apparatus characterized in that a classification result by the contrast contrast classification unit and a classification result by a color tone classification unit are combined into a feature amount for video feature amount comparison.

The video feature amount calculation unit receives DCT coefficients (spatial frequency characteristics), motion vectors (temporal frequency characteristics), and decoded video from an encoded video decoding unit that decodes a compressed moving image. 8. The similar video search device according to 8.

10. The similar video search apparatus according to claim 8, further comprising means for dividing a scene based on the video feature amount.

10. The similar video search device according to claim 8, further comprising means for collecting a plurality of scenes having similar combinations of light and dark contrast and color tone based on the continuity of the video feature amount.

A similar video search method for searching a video similar to an input video from stored video,
Correcting for human physiological visual characteristics;
Classifying the input video according to contrast of light and dark;
Categorizing input video according to color tone,
A similar video search method comprising a step of generating a video feature amount based on the classification result of the contrast contrast and the classification result of the color tone.

Correction by the human physiological visual characteristics is
13. The similar image search method according to claim 12, wherein the input image is decomposed into luminance and hue, and corrected by at least one visual characteristic of time frequency characteristics or spatial frequency characteristics.

14. The similar image search method according to claim 12, wherein classification based on contrast between light and dark and classification based on color tone are performed after correction based on human physiological visual characteristics.

During the process of decoding the input encoded video, the motion vector and DCT coefficient are extracted, the motion vector is used for correction by temporal frequency characteristics, the DCT coefficient is used for correction by spatial frequency characteristics, and human physiological The similar image search method according to claim 12, further comprising a step of correcting a visual characteristic.

When performing correction based on human physiological visual characteristics, the input image is decomposed into luminance, hue (red-green), and hue (yellow-blue), and at least one of temporal frequency characteristics or spatial frequency characteristics The similar image search method according to claim 12, wherein correction is performed based on visual characteristics.

A program for causing a computer to execute the steps according to any one of claims 12 to 16.

A computer-readable recording medium for recording the program according to claim 17.