JP2013009248A

JP2013009248A - Video detection method, video detection device, and video detection program

Info

Publication number: JP2013009248A
Application number: JP2011141777A
Authority: JP
Inventors: Takeshi Irie; 豪入江; Takashi Sato; 隆佐藤; Akira Kojima; 明小島; Ushio Shibusawa; 潮渋沢; Katsushi Shindo; 勝志進藤; Koichi Maruyama; 剛一丸山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-06-27
Filing date: 2011-06-27
Publication date: 2013-01-10

Abstract

PROBLEM TO BE SOLVED: To accurately and efficiently detect advertisement video.SOLUTION: A video detection device 1 comprises: a structure analysis part 101 for analyzing a section structure of an input video and dividing it into partial sections; a candidate section detection part 102 for detecting one or more candidate sections including an advertisement video on the basis of characteristics of the partial sections, and finding its start time and its completion time; a shot characteristic extraction part 103 for analyzing at least one of an image and a sound in the partial section existing inside the start time and the completion time of the candidate section, and extracting a characteristic amount; a coding part 104 for expressing the partial sections by codes on the basis of the characteristic amount; a recursive code sequence detection part 105 for detecting a group of the partial sections to be code sequences recursively appearing from the code sequences constituted by one or more partial sections expressed by the codes; and a filter part 106 for outputting only the partial sections satisfying a predetermined condition as an advertisement video section among the group of the detected partial sections.

Description

本発明は、処理対象となる入力映像を解析し、広告映像を検出する映像検出方法、映像検出装置、および映像検出プログラムに関する。 The present invention relates to a video detection method, a video detection apparatus, and a video detection program for analyzing an input video to be processed and detecting an advertisement video.

ＶＯＤ（Video On Demand ）やＩＰＴＶ（Internet Protocol Television）、地上デジタル放送再送信などに代表されるように、通信を利用した映像配信サービスの利用が活発化し、多チャンネルで膨大な量の映像コンテンツが配信されるようになった。 As represented by VOD (Video On Demand), IPTV (Internet Protocol Television), digital terrestrial broadcast retransmission, etc., the use of video distribution services using communications has become active, and a huge amount of video content has been created over multiple channels. It came to be delivered.

製作者側にとっては、これまでの放送や映画館でのスクリーン上映に加えて、前記の映像配信サービスによるマルチチャネルでの販売を展開できるようになった。さらにこのことは、単純な販売チャネルの増加だけでなく、新しいビジネスチャンスももたらしている。その代表が映像広告である。映像広告は、宣伝対象に対する興味を惹き、消費者を購買へと動機づけるための重要な宣伝材料である。したがって、映像広告は、市場動向やユーザのニーズに合ったタイムリーなものであることが要請される。 For producers, in addition to the existing broadcasts and screen screenings in movie theaters, multi-channel sales using the video distribution service are now possible. In addition, this not only increases the number of simple sales channels, but also brings new business opportunities. The representative is video advertisement. Video advertising is an important promotional material that attracts interest in the object of promotion and motivates consumers to purchase. Therefore, video advertisements are required to be timely in line with market trends and user needs.

しかしながら、前記のような新しい映像配信サービスでは必ずしもうまくいかない。すなわち、これまで主流であった放送や映画館では、同時（あるいは同時期）に全ユーザが同じコンテンツを視聴することが普通であった。このため、いつどこで、場合によってはどのようなユーザ層がそのコンテンツを視聴するかが把握できたため、適切な広告映像を打つことができた。一方、新しい映像配信サービスでは、各々のユーザが好きな時間に自由な場所で視聴を楽しむことが普通である。このため、ユーザが視聴する時期・時刻によっては、挿入されている広告映像が全く興味を惹かないものになったり、全く無意味なものとなったりする問題が起こる。このような問題に対処するため、例えばユーザが視聴する時期に応じて適切な広告映像に差し替えるなど、適応的な広告映像配信システムが望まれている。 However, the new video distribution service as described above does not always work. In other words, in broadcasts and movie theaters that have been mainstream until now, it was common for all users to view the same content at the same time (or at the same time). For this reason, since it was possible to grasp when and where, and in some cases, what kind of user group would view the content, it was possible to hit an appropriate advertisement video. On the other hand, in a new video distribution service, it is normal for each user to enjoy viewing at a free place at any time. For this reason, depending on the timing and time when the user views, there arises a problem that the inserted advertisement video becomes completely uninteresting or meaningless at all. In order to cope with such a problem, for example, an adaptive advertisement video distribution system is desired, such as switching to an appropriate advertisement video depending on the time when the user views.

このようなシステムを実現するために必要な技術は多岐に渡るが、広告映像を古いものから新しいものに差し替えるためには、少なくともまず始めに、元の映像のどこに広告映像が挿入されているかを知っておく必要がある。予めこの情報が得られている場合には何の苦労もないが、大量の映像データが流通している昨今、全ての映像データについてその広告映像の位置を知り、管理することは容易ではない。そこで、大量の映像データから広告映像のみを自動的に検出する技術が求められている。 There are a variety of technologies required to realize such a system, but in order to replace the advertisement video from the old one to the new one, at least first, where the advertisement video is inserted in the original video. I need to know. If this information is obtained in advance, there will be no trouble, but it is not easy to know and manage the position of the advertisement video for all the video data in recent years when a large amount of video data is distributed. Therefore, there is a demand for a technology that automatically detects only advertising videos from a large amount of video data.

広告映像の検出技術に関しては、これまでにも数多くの発明がなされてきている。古くデジタルビデオレコーダーなどでは、広告映像区間とその他の映像区間とで音声チャネル数が異なることを利用していたが、最近では音声チャネル数に違いがなくなったために、このような方法では検出することが難しくなった。このため、映像を直接解析し、広告映像を検出する技術に関する発明がなされてきている。既存の先行技術は、大別して下記の２つに分類される。 As for advertisement video detection technology, many inventions have been made so far. Older digital video recorders, etc. used the fact that the number of audio channels differs between the advertisement video section and other video sections, but recently there is no difference in the number of audio channels, so this method detects it. Became difficult. For this reason, inventions relating to techniques for directly analyzing video and detecting advertisement video have been made. The existing prior art is roughly classified into the following two.

（１）リファレンスデータがある場合
（２）リファレンスデータがない場合
前者は、検出すべき広告映像が定まっており、かつこれが得られている場合（このデータをリファレンスデータと呼ぶ）を想定したものであり、映像データの中からリファレンスデータと合致する映像区間を探しだす技術である。一方、後者は、このようなリファレンスデータが一切得られていないことを想定した技術である。 (1) When there is reference data (2) When there is no reference data The former assumes that the advertisement video to be detected has been determined and is obtained (this data is referred to as reference data). Yes, it is a technique for finding a video section that matches the reference data from the video data. On the other hand, the latter is a technique that assumes that no such reference data is obtained.

前者に属する技術として、特許文献１には、リファレンスデータとして静止画像を保持しておき、当該静止画像に類似する映像フレームを検出する技術が開示されている。 As a technique belonging to the former, Patent Document 1 discloses a technique for holding a still image as reference data and detecting a video frame similar to the still image.

後者に属する技術として、非特許文献１には、Support Vector Machine（ＳＶＭ）や隠れマルコフモデル（Hidden Markov Model:ＨＭＭ）などに代表される統計モデルを用いて広告映像に現れる独特な特徴をモデル化しておき、この“広告映像モデル”に基づいて検出を行う技術が開示されている。映像特徴としては、映像の中に含まれる隣り合うショット（カット点とカット点に挟まれた一続きの映像区間）間の画像フレームの類似度と、エッジの変化率に関する平均と分散、音のタイプ（発話、音楽、無音など）、それらの時間差分量を用いている。これらの映像特徴をＳＶＭを用いてモデル化し、広告映像区間らしい区間を検出する。 As a technology belonging to the latter, Non-Patent Document 1 models unique features that appear in advertising videos using statistical models such as Support Vector Machine (SVM) and Hidden Markov Model (HMM). A technique for performing detection based on the “advertisement video model” has been disclosed. The video features include image frame similarity between adjacent shots included in the video (a series of video segments between cut points), average and variance of edge change rate, Type (speech, music, silence, etc.) and their time difference are used. These video features are modeled using SVM, and a section that seems to be an advertising video section is detected.

同じく後者に属する技術として、非特許文献２には、モデルを用いずショットの類似度のみに基づいて検出を行う技術が開示されている。この技術では、まず、全ての映像をショットに分割する。続いて、各ショットに含まれる画像フレームの色の平均と分散、方向輝度の平均と分散、エッジ、および画像中に含まれる顔の位置と大きさをショットの特徴量として抽出し、全ショット間の類似度を特徴量に基づいて計算する。最後に、他のショットとの類似度の分布が特別な傾向をもつショットのみを広告映像区間として検出する。 Similarly, as a technique belonging to the latter, Non-Patent Document 2 discloses a technique for performing detection based only on shot similarity without using a model. In this technique, first, all videos are divided into shots. Next, the average and variance of the color of the image frame included in each shot, the average and variance of the directional luminance, the edge, and the position and size of the face included in the image are extracted as shot feature quantities, and between all shots The similarity is calculated based on the feature amount. Finally, only shots having a special tendency in the similarity distribution with other shots are detected as advertisement video sections.

またさらに後者に属する技術として、特許文献２には、１つ以上のショットを検出し、その開始時刻と終了時刻が無音であるもののうち、その間隔が１５秒、３０秒のいずれかである区間を広告映像区間として検出する技術が開示されている。 Further, as a technique belonging to the latter, Patent Document 2 discloses a section in which one or more shots are detected and the start time and end time are silent, and the interval is either 15 seconds or 30 seconds. Is disclosed as an advertisement video section.

特許文献２と同様、特許文献３には、広告映像の時間長を利用した技術が開示されている。この技術では、１００ミリ秒間以上連続して平均音量値が８０より低下した時刻をブランクとして検出し、このブランクの間隔が広告映像の時間長（例えば１５秒、３０秒）と一致した際に当該区間を広告映像として検出する。 Similar to Patent Document 2, Patent Document 3 discloses a technique using the time length of an advertisement video. In this technique, a time when the average volume value has dropped below 80 continuously for 100 milliseconds or more is detected as a blank, and when the interval of this blank matches the time length (for example, 15 seconds, 30 seconds) of the advertisement video, The section is detected as an advertisement video.

特許第４４２１５２７号公報Japanese Patent No. 4421527 特許第３４０７８４０号公報Japanese Patent No. 3407840 特許第３５１３４２４号公報Japanese Patent No. 3513424

X.-S. Hua, L. Lu, and H.-J. Zhang. “Robust Learning-based TV Commercial Detection,” in Proceedings of IEEE International Conference on Multimedia & Expo., 2005.X.-S. Hua, L. Lu, and H.-J. Zhang. “Robust Learning-based TV Commercial Detection,” in Proceedings of IEEE International Conference on Multimedia & Expo., 2005. P. Duygulu, M.-Y. Chen, and A. Hauptmann. “Comparison and Combination of Two Novel Commercial Detection Methods,” in Proceedings of IEEE International Conference on Multimedia & Expo., pp. 1267-1270, 2004.P. Duygulu, M.-Y. Chen, and A. Hauptmann. “Comparison and Combination of Two Novel Commercial Detection Methods,” in Proceedings of IEEE International Conference on Multimedia & Expo., Pp. 1267-1270, 2004.

特許文献１に記載されているようなリファレンスデータを前提とした技術では、必ず検出対象となる広告映像が既知である必要がある。しかしながら、新規に公開される広告映像も含め、過去全ての広告映像を予め入手・記憶しておき、これと照らし合わせることは事実上不可能であるという問題がある。 In the technology based on the reference data as described in Patent Document 1, it is necessary that the advertisement video to be detected is always known. However, there is a problem that it is practically impossible to obtain and store all advertisement videos in the past including newly published advertisement videos in advance and compare them.

一方、リファレンスデータを前提としない技術では、必ずしも検出対象となる広告映像が既知でなくともよい。そのため、前記のような問題は回避できるが、以下の問題がある。 On the other hand, with a technique that does not assume reference data, the advertisement video to be detected does not necessarily have to be known. Therefore, although the above problems can be avoided, there are the following problems.

非特許文献１に記載の技術では、広告映像のモデルを利用する。汎用かつ精度の高いモデルを得るためには、全ての広告映像を包含するようなデータを事前に得ておく必要がある。しかしながら、このようなモデルを得るには、大量の広告映像データが必要となるため、実用上の精度が得にくいという問題があった。 In the technique described in Non-Patent Document 1, an advertisement video model is used. In order to obtain a general-purpose and high-accuracy model, it is necessary to obtain data including all advertisement videos in advance. However, in order to obtain such a model, a large amount of advertisement video data is required, so that there is a problem that it is difficult to obtain practical accuracy.

非特許文献２の技術では、モデルを用いないため、事前に大量のデータを要求するようなことはない。しかしながら、計算量が問題となる。通常、1時間の映像データには、1,000程度のショットが含まれている。これらの全てのショットの類似度を計算するには、499,500回の類似度計算が必要となる。現在、多チャンネル配信などにより、1時間あたりに配信される映像の時間長はこれよりも遥かに大きく、今後もますます増えていくことが予想される。したがって、このように計算量の多い技術は実用的でない。 In the technique of Non-Patent Document 2, since a model is not used, a large amount of data is not requested in advance. However, the amount of calculation becomes a problem. Usually, one hour of video data contains about 1,000 shots. To calculate the similarity of all these shots, 499,500 similarity calculations are required. Currently, due to multi-channel distribution etc., the time length of video delivered per hour is much larger than this, and it is expected that it will continue to increase in the future. Therefore, a technique with such a large amount of calculation is not practical.

特許文献２や特許文献３の技術では、単純に境界の時間間隔が広告映像の長さにあっているかのみによって検出を行う至極単純なものである。そのため、過剰検出（誤検出）が多く、精度が低いという問題点があった。 The techniques of Patent Literature 2 and Patent Literature 3 are extremely simple in that detection is performed only based on whether or not the boundary time interval matches the length of the advertisement video. Therefore, there are many problems of excessive detection (false detection) and low accuracy.

以上示したように、従来は、大量の映像から効率的かつ高精度に広告映像を検出する映像検出技術は実現されていなかった。 As described above, conventionally, there has not been realized a video detection technique for efficiently and accurately detecting an advertising video from a large amount of video.

本発明は、前記課題を解決するためになされたものであり、高精度かつ効率的に広告映像を検出することができる映像検出方法、映像検出装置、および映像検出プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object thereof is to provide a video detection method, a video detection device, and a video detection program capable of detecting an advertising video with high accuracy and efficiency. To do.

前記目的を達成するため、第１の態様に係る発明は、処理対象となる入力映像から広告映像を検出する映像検出方法であって、前記入力映像の区間構造を解析して部分区間に分割する構造解析処理と、前記部分区間の特徴に基づいて、前記広告映像が含まれる候補区間を１つ以上検出し、その開始時刻および終了時刻を求める候補区間検出処理と、前記候補区間の開始時刻および終了時刻の内部に存在する前記部分区間の画像および音のうちの少なくとも１つを解析し、特徴量を抽出する特徴抽出処理と、前記特徴量に基づいて、前記部分区間を符号によって表現する符号化処理と、前記符号によって表現された１つ以上の部分区間によって構成される符号系列から、再起的に出現する符号系列となる部分区間群を検出する再起符号系列検出処理と、前記検出された部分区間群のうち、所定の条件を満たすもののみを広告映像区間として出力するフィルタ処理とを備えることを特徴とする。 In order to achieve the object, the invention according to the first aspect is a video detection method for detecting an advertising video from an input video to be processed, and analyzes the segment structure of the input video and divides it into partial sections. Based on the structure analysis process, one or more candidate sections including the advertisement video based on the characteristics of the partial sections, a candidate section detection process for obtaining a start time and an end time, a start time of the candidate sections, and Analyzing at least one of the image and sound of the partial section existing inside the end time and extracting the feature quantity, and a code for expressing the partial section with a code based on the feature quantity Code sequence detection that detects a sub-segment group that is a code sequence that recursively appears from a code sequence composed of one or more sub-intervals represented by the code And sense, of the detected subinterval group, characterized in that it comprises a filtering process for outputting only a predetermined condition is satisfied as an advertisement video section.

第２の態様に係る発明は、第１の態様に係る発明において、前記構造解析処理で、前記入力映像の切れ目であるカット点を解析し、これらカット点のうち、少なくとも発話区間でも音楽区間でもない点の時刻、および同条件を満たす次の点までの時間長を求めることを特徴とする。 The invention according to a second aspect is the invention according to the first aspect, wherein the structural analysis process analyzes a cut point that is a break of the input video, and at least a speech section or a music section among these cut points. It is characterized in that the time of no point and the time length to the next point satisfying the same condition are obtained.

第３の態様に係る発明は、第１の態様に係る発明において、前記特徴抽出処理で、部分領域ごとの輝度平均値、分散、部分領域ごとの色平均値、分散、輝度ヒストグラム、色ヒストグラム、ＭＦＣＣのうちの少なくとも１つを特徴量として抽出することを特徴とする。 The invention according to a third aspect is the invention according to the first aspect, wherein in the feature extraction process, the luminance average value, variance, color average value for each partial region, variance, luminance histogram, color histogram, It is characterized in that at least one of the MFCCs is extracted as a feature quantity.

第４の態様に係る発明は、第１の態様に係る発明において、前記再起符号系列検出処理で、一定以上の生起回数となる部分符号系列を繰り返し検出することによって再起符号系列を検出し、前記フィルタ処理では、所定の時間長となる再起符号系列のみを出力することを特徴とする。 The invention according to a fourth aspect is the invention according to the first aspect, wherein in the restart code sequence detection process, a restart code sequence is detected by repeatedly detecting a partial code sequence having a certain number of occurrences. The filtering process is characterized in that only a restart code sequence having a predetermined time length is output.

また、前記目的を達成するため、第５の態様に係る発明は、処理対象となる入力映像から広告映像を検出する映像検出装置であって、前記入力映像の区間構造を解析して部分区間に分割する構造解析部と、前記部分区間の特徴に基づいて、前記広告映像が含まれる候補区間を１つ以上検出し、その開始時刻および終了時刻を求める候補区間検出部と、前記候補区間の開始時刻および終了時刻の内部に存在する前記部分区間の画像および音のうちの少なくとも１つを解析し、特徴量を抽出する特徴抽出部と、前記特徴量に基づいて、前記部分区間を符号によって表現する符号化部と、前記符号によって表現された１つ以上の部分区間によって構成される符号系列から、再起的に出現する符号系列となる部分区間群を検出する再起符号系列検出部と、前記検出された部分区間群のうち、所定の条件を満たすもののみを広告映像区間として出力するフィルタ部とを備えることを特徴とする。 In order to achieve the above object, the invention according to a fifth aspect is a video detection device that detects an advertising video from an input video to be processed, and analyzes the section structure of the input video to make a partial section. Based on the characteristics of the partial section, a structure analysis section to be divided, a candidate section detection section that detects one or more candidate sections including the advertisement video and obtains a start time and an end time thereof, and a start of the candidate section Analyzing at least one of the image and sound of the partial section existing inside the time and the end time, and extracting a feature amount, and expressing the partial section by a code based on the feature amount And a restart code sequence detection unit for detecting a partial section group that is a code sequence that recursively appears from a code sequence constituted by one or more partial sections expressed by the code , Among the detected partial section group, characterized in that it comprises a filter unit for outputting only a predetermined condition is satisfied as an advertisement video section.

第６の態様に係る発明は、第５の態様に係る発明において、前記構造解析部が、前記入力映像の切れ目であるカット点を解析し、これらカット点のうち、少なくとも発話区間でも音楽区間でもない点の時刻、および同条件を満たす次の点までの時間長を求めることを特徴とする。 The invention according to a sixth aspect is the invention according to the fifth aspect, wherein the structural analysis unit analyzes a cut point that is a break of the input video, and at least a speech section or a music section among these cut points. It is characterized in that the time of no point and the time length to the next point satisfying the same condition are obtained.

第７の態様に係る発明は、第５の態様に係る発明において、前記特徴抽出部が、部分領域ごとの輝度平均値、分散、部分領域ごとの色平均値、分散、輝度ヒストグラム、色ヒストグラム、ＭＦＣＣのうちの少なくとも１つを特徴量として抽出することを特徴とする。 The invention according to a seventh aspect is the invention according to the fifth aspect, wherein the feature extraction unit includes a luminance average value for each partial region, a variance, a color average value for each partial region, a variance, a luminance histogram, a color histogram, It is characterized in that at least one of the MFCCs is extracted as a feature quantity.

第８の態様に係る発明は、第５の態様に係る発明において、前記再起符号系列検出部が、一定以上の生起回数となる部分符号系列を繰り返し検出することによって再起符号系列を検出し、前記フィルタ部が、所定の時間長となる再起符号系列のみを出力することを特徴とする。 The invention according to an eighth aspect is the invention according to the fifth aspect, wherein the restart code sequence detection unit detects a restart code sequence by repeatedly detecting a partial code sequence having a certain number of occurrences. The filter unit outputs only a restart code sequence having a predetermined time length.

また、前記目的を達成するため、第９の態様に係る発明は、第１乃至４のいずれかの態様における各処理をコンピュータに実行させることを特徴とする映像検出プログラムである。 In order to achieve the above object, an invention according to a ninth aspect is a video detection program that causes a computer to execute each process according to any one of the first to fourth aspects.

本発明によれば、高精度かつ効率的に広告映像を検出することができる映像検出方法、映像検出装置、および映像検出プログラムを提供することが可能である。 According to the present invention, it is possible to provide a video detection method, a video detection device, and a video detection program capable of detecting an advertising video with high accuracy and efficiency.

本発明の実施形態における映像検出装置の構成例を示す図である。It is a figure which shows the structural example of the image | video detection apparatus in embodiment of this invention. 本発明の実施形態における映像検出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the image | video detection apparatus in embodiment of this invention. 本発明の実施形態における構造解析処理の一例を示す図である。It is a figure which shows an example of the structure-analysis process in embodiment of this invention. 本発明の実施形態における候補区間検出処理の一例を示す図である。It is a figure which shows an example of the candidate area detection process in embodiment of this invention. 本発明の実施形態における特徴量抽出位置の一例を示す図である。It is a figure which shows an example of the feature-value extraction position in embodiment of this invention. 本発明の実施形態における記憶装置に記憶される情報の一例を示す図である。It is a figure which shows an example of the information memorize | stored in the memory | storage device in embodiment of this invention. 本発明の実施形態における再起符号系列検出処理の一例を示す図である。It is a figure which shows an example of the restart code sequence detection process in embodiment of this invention. 本発明の実施形態における記憶装置に記憶される情報の一例を示す図である。It is a figure which shows an example of the information memorize | stored in the memory | storage device in embodiment of this invention.

以下、図面を用いて、本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施形態における映像検出装置１の構成例を示す図である。この映像検出装置１は、映像データベース１０８に記憶されている映像を入力とし、その入力映像から広告映像を検出して記憶装置１０７に記憶する装置であって、構造解析部１０１、候補区間検出部１０２、ショット特徴抽出部１０３、符号化部１０４、再起符号系列検出部１０５、フィルタ部１０６、および記憶装置１０７を備える。これら処理部は、ＣＰＵ、メモリ、外部記憶装置などからなるコンピュータのハードウェアとソフトウェアプログラムとによって実現される。本実施形態においては、映像データベース１０８は映像検出装置１の外部にあり、相互に通信可能な通信網によって接続されている。また、記憶装置１０７は映像検出装置１の内部にある場合を例示している。もちろん、映像データベース１０８は映像検出装置１の内部にあってもよいし、記憶装置１０７は映像検出装置１の外部にあってもよい。 FIG. 1 is a diagram illustrating a configuration example of a video detection device 1 according to an embodiment of the present invention. The video detection apparatus 1 is an apparatus that receives video stored in the video database 108, detects an advertising video from the input video, and stores the advertisement video in the storage device 107. The structure analysis unit 101, the candidate section detection unit 102, a shot feature extraction unit 103, an encoding unit 104, a restart code sequence detection unit 105, a filter unit 106, and a storage device 107. These processing units are realized by computer hardware and software programs including a CPU, a memory, an external storage device, and the like. In the present embodiment, the video database 108 is external to the video detection device 1 and is connected by a communication network that can communicate with each other. Further, the case where the storage device 107 is inside the video detection device 1 is illustrated. Of course, the video database 108 may be inside the video detection device 1, and the storage device 107 may be outside the video detection device 1.

図２は、映像検出装置１の動作を示すフローチャートである。 FIG. 2 is a flowchart showing the operation of the video detection apparatus 1.

まず、構造解析部１０１が、映像データベース１０８に登録された映像に対して少なくともショット解析を含む映像構造解析処理を実行し、映像を意味のある部分区間（以下、単に「区間」という。）に分割する（ステップＳ２０１）。続いて、候補区間検出部１０２が、構造解析部１０１の処理結果を受けて、検出対象となる広告映像が含まれると想定される区間のみを候補区間として検出し、構造解析部１０１の処理結果と共に記憶部１０７に記憶する（ステップＳ２０２）。続いて、ショット特徴抽出部１０３が、映像データベース１０８に登録された映像の候補区間に対して、構造解析部１０１の処理結果として得られるショット毎に特徴量を抽出する（ステップＳ２０３）。続いて、符号化部１０４が、ショット特徴抽出部１０３が抽出したショット特徴量に基づいて、ショットを符号として表現する符号化処理を実行し、記憶部１０７に記憶する（ステップＳ２０４）。続いて、再起符号系列検出部１０５が、記憶部１０７に記憶されたショットの符号系列を読み込み、そのショットの符号系列から再起（２度以上生起）している符号系列を列挙する（ステップＳ２０５）。最後に、フィルタ部１０６が、列挙された再起符号系列から広告映像の条件を満たす系列を抽出し、記憶部１０７に記憶する（ステップＳ２０６）。このようにして、映像検出装置１は、リファレンスデータ、広告映像モデル、ショット間の類似度を用いることなく、大量の映像から高精度かつ効率的に広告映像を検出することができる。 First, the structure analysis unit 101 executes a video structure analysis process including at least shot analysis on a video registered in the video database 108, and the video is made into a meaningful partial section (hereinafter simply referred to as “section”). Divide (step S201). Subsequently, the candidate section detection unit 102 receives the processing result of the structure analysis unit 101, detects only a section that is assumed to include the advertisement video to be detected as a candidate section, and the processing result of the structure analysis unit 101 At the same time, it is stored in the storage unit 107 (step S202). Subsequently, the shot feature extraction unit 103 extracts a feature amount for each shot obtained as a processing result of the structure analysis unit 101 with respect to the candidate section of the video registered in the video database 108 (step S203). Subsequently, the encoding unit 104 executes an encoding process for expressing a shot as a code based on the shot feature amount extracted by the shot feature extraction unit 103, and stores it in the storage unit 107 (step S204). Subsequently, the restart code sequence detection unit 105 reads the code sequence of the shot stored in the storage unit 107 and enumerates the code sequences that have been restarted (occurred twice or more) from the code sequence of the shot (step S205). . Finally, the filter unit 106 extracts a sequence satisfying the condition of the advertisement video from the listed reoccurrence code sequences, and stores it in the storage unit 107 (step S206). In this way, the video detection device 1 can detect an advertising video from a large amount of video with high accuracy and efficiency without using reference data, an advertising video model, and similarity between shots.

以降、映像検出装置１が実行する各処理の一例について詳述する。以降の各処理は、映像データベース１０８に登録された全ての映像に対して実行してもよいし、特定の一部の映像に対して実行してもよい。映像データベース１０８には、各映像がどのような映像であるのかを示すメタデータが付いているものとしてもよいが、本実施形態では、いかなるメタデータも仮定せずに広告映像を検出するものとする。また、処理対象となる映像は、一般性を失うことなく、一続きのデータであるとする。 Hereinafter, an example of each process executed by the video detection device 1 will be described in detail. The subsequent processes may be executed for all the videos registered in the video database 108 or may be executed for a specific part of the videos. The video database 108 may be attached with metadata indicating what kind of video each video is, but in the present embodiment, advertisement video is detected without assuming any metadata. To do. Also, it is assumed that the video to be processed is a series of data without losing generality.

〔ステップＳ２０１：構造解析処理〕
構造解析部１０１は、映像データベース１０８に登録された映像に対して構造解析処理を実行する。ここでいう構造解析とは、映像を意味のある区間単位に区切ることであり、本実施形態では、カット点とカット点の間に挟まれるショット、および広告映像区間とその他（映像本編など）の映像区間との境界候補を発見することを目的としている。したがって、少なくともショット、すなわち、画像フレームの切れ目（カット点）を解析する必要がある。また、後者の境界候補を得るために、以下に述べるように、音の切れ目も解析することが好ましい。 [Step S201: Structural analysis processing]
The structure analysis unit 101 performs a structure analysis process on the video registered in the video database 108. The structural analysis here is to divide the video into meaningful section units, and in this embodiment, the shot between the cut points and the shots between the cut points, the advertisement video section and others (video main part etc.) The purpose is to find a boundary candidate with the video section. Therefore, it is necessary to analyze at least shots, that is, image frame breaks (cut points). Moreover, in order to obtain the latter boundary candidate, it is preferable to analyze the break of sound as described below.

まず、ショットの解析には、様々な公知の方法を用いることができる。例えば、参考文献１に記載のショット検出法を用いればよい。 First, various known methods can be used for shot analysis. For example, the shot detection method described in Reference 1 may be used.

［参考文献１］：Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki,“Structured Video Computing”，IEEE Multimedia, vol.1, no.3, pp.34-43, 1994．
このショットの情報のみを用い、カット点を直ちに境界候補としてもよい。しかしながら、これだけでは、必ずしも広告映像区間とその他の映像区間との境界になるとは限らない。通常、映像全体に対して広告映像の区間は疎に分布しているから、むしろ相当数の過剰検出を見込むことになる。そこで、好ましくは音の切れ目も用いて境界候補を絞り込んでいく。この様子を、図３を用いて説明する。 [Reference 1]: Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, “Structured Video Computing”, IEEE Multimedia, vol.1, no.3, pp.34-43, 1994.
Only the information of this shot may be used, and the cut point may be immediately set as a boundary candidate. However, this alone is not necessarily a boundary between the advertising video section and other video sections. Usually, the advertisement video sections are sparsely distributed with respect to the entire video, so a considerable number of excessive detections are expected. Therefore, the boundary candidates are preferably narrowed down using sound breaks. This will be described with reference to FIG.

この図に示すように、元の映像３０１に対して得られたショット３０２に基づいて、映像の切れ目を発見することができる。一方で、広告映像区間とその他の映像区間とは映像として連続しておらず、無関係であるから、その境界では必ず音声も不連続になる（途切れる）。そこで、音の切れ目の情報を複合して用いることで、境界候補を絞り込むことができる。 As shown in this figure, based on a shot 302 obtained for the original video 301, a video break can be found. On the other hand, the advertisement video section and the other video sections are not continuous as a video and are irrelevant, so that the sound is always discontinuous (disconnected) at the boundary. Therefore, boundary candidates can be narrowed down by using information of sound breaks in combination.

音の切れ目を解析する方法としては、映像中の音声を解析し、例えば、発話区間でも音楽区間でもない音声フレームを切れ目とすることができる。参考文献２などに記載の、公知の音楽・発話区間検出技術を利用して、発話区間および音楽区間を検出することができるので、これらのいずれにも当てはまらない区間を境界とすればよい。 As a method for analyzing a sound break, sound in a video is analyzed, and for example, a voice frame that is neither a speech section nor a music section can be used as a break. Since a known music / speech segment detection technique described in Reference 2 or the like can be used to detect a speech segment and a music segment, a segment that does not apply to any of these may be used as a boundary.

［参考文献２］：K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura，“Video Handling with Music and Speech Detection”，IEEE Multimedia, vol.5, no.3, pp.17-25, 1998．
このようにして得られたショット３０２と発話または音楽区間３０３から境界候補を発見する。すなわち、ショットの境界（カット点）であり、かつ発話区間でも音楽区間でもない区間が候補境界であるから、この結果、３０４の境界候補が得られることとなる。この図に示されるように、音の切れ目の情報を活用することによって、ショット３０２のみによって得られる境界よりも効率的に境界候補３０４を絞り込むことができる。このように解析した境界候補の時刻、および各ショットの開始または終了時刻を候補区間検出部１０２に出力する。 [Reference 2]: K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura, “Video Handling with Music and Speech Detection”, IEEE Multimedia, vol.5, no.3, pp.17-25, 1998.
A boundary candidate is found from the shot 302 thus obtained and the utterance or music section 303. That is, since a section that is a shot boundary (cut point) and is neither a speech section nor a music section is a candidate boundary, 304 boundary candidates are obtained as a result. As shown in this figure, by utilizing the sound break information, the boundary candidates 304 can be narrowed down more efficiently than the boundary obtained by the shot 302 alone. The boundary candidate time thus analyzed and the start or end time of each shot are output to the candidate section detection unit 102.

ここで、特許文献２あるいは特許文献３のように、検出された境界候補の間隔が１５秒あるいは３０秒であるような区間を見出し、これを広告映像として出力することもできる。しかしながら、この条件だけでは、広告映像でない区間も過剰検出されてしまい、著しく精度が低いという問題点があった。以降述べるように、本実施形態では、効果的なショット特徴量の抽出と符号化、および効率的な再起符号系列の検出を導入することによって、高精度な検出を実現する。 Here, as in Patent Document 2 or Patent Document 3, it is possible to find a section where the interval between detected boundary candidates is 15 seconds or 30 seconds, and output this as an advertisement video. However, there is a problem in that the section that is not the advertisement video is excessively detected only under this condition, and the accuracy is extremely low. As will be described later, in the present embodiment, highly accurate detection is realized by introducing effective shot feature amount extraction and encoding, and efficient detection of a restart code sequence.

以上が構造解析処理の処理詳細の一例である。 The above is an example of the processing details of the structural analysis processing.

〔ステップＳ２０２：候補区間検出処理〕
続いて、候補区間検出部１０２は、検出対象となる広告映像が含まれると想定される区間のみを候補区間として検出し、ステップＳ２０１の構造解析処理の処理結果と共に記憶部１０７に記憶する。この候補区間検出処理では、映像全体の中で、広告映像である可能性のある区間を残し、それ以外の区間を間引く。 [Step S202: Candidate Section Detection Processing]
Subsequently, the candidate section detection unit 102 detects only sections that are assumed to include the advertisement video to be detected as candidate sections, and stores them in the storage unit 107 together with the processing result of the structure analysis process in step S201. In this candidate section detection process, sections that may be advertisement videos are left in the entire video, and other sections are thinned out.

一般に、広告映像は視聴者の注意を惹きつけるための工夫がなされている。例えば、ショット分割を細かくして画面変化を多くしたり、音に特徴的な効果音や音楽を挿入したりすることが多い。したがって、このような特徴をもたない映像区間は、広告映像である可能性が低い。以下では、この観察に基づく候補区間検出処理の一例を、図４を用いて説明する。この例では、ショット（カット）の頻度に基づいて候補区間を検出する場合について説明している。 In general, the advertisement video is devised to attract the viewer's attention. For example, there are many cases in which shot division is made fine to increase screen changes, and sound effects and music characteristic of the sound are inserted. Therefore, a video section having no such feature is unlikely to be an advertising video. In the following, an example of candidate section detection processing based on this observation will be described with reference to FIG. In this example, the case where a candidate section is detected based on the frequency of shots (cuts) is described.

図４に示すように、映像と、それに対応してステップＳ２０１の構造解析処理を経て得られたショットが得られている。まず、映像に対して、窓長さＷ、シフトＷＳの窓を設ける。窓長さＷは、窓の大きさ（単位時間範囲）を表す。シフトＷＳは、窓を移動させる時間長を表す。Ｗ、ＷＳは任意の値で指定すればよい。例えば、図４では、Ｗを３０秒、ＷＳを３０秒とした場合を例示している。この窓の単位でショットの数＃Ｓをカウントし、これが一定以上の値をもつ窓に含まれるショットを候補区間として検出する。図４の例では、＃Ｓが４以上の窓に（部分的にでも）含まれるショットを候補区間として検出している。 As shown in FIG. 4, an image and a shot obtained through the structure analysis processing in step S201 are obtained correspondingly. First, a window having a window length W and a shift WS is provided for an image. The window length W represents the size (unit time range) of the window. The shift WS represents the time length for moving the window. W and WS may be designated by arbitrary values. For example, FIG. 4 illustrates a case where W is 30 seconds and WS is 30 seconds. The number of shots #S is counted in units of the window, and shots included in the window having a value greater than or equal to a certain value are detected as candidate sections. In the example of FIG. 4, shots included in a window where #S is 4 or more (even partially) are detected as candidate sections.

図４の例では、ＷとＷＳを同じ長さとしているため、窓は互いに重なりがなく、＃Ｓは重複なく計算されている。しかしながら、必ずしもＷとＷＳは同じ時間である必要はない。仮にＷを３０秒、ＷＳを１５秒とした場合には、窓は互いに１５秒ずつ重なりあうことになる。いずれの場合にも、窓長さＷの範囲でこれに含まれるショット数をカウントし、＃Ｓを求めることには変わりはない。ただし、最も近い二つの窓は連続し、間に空白の区間がないことが好ましく、Ｗ≧ＷＳとするのがよい。 In the example of FIG. 4, since W and WS have the same length, windows do not overlap each other, and #S is calculated without overlap. However, W and WS do not necessarily have to be the same time. If W is 30 seconds and WS is 15 seconds, the windows overlap each other for 15 seconds. In any case, the number of shots included in the window length W is counted and #S is not changed. However, it is preferable that the two closest windows are continuous and there is no blank section between them, and it is preferable that W ≧ WS.

窓単位での＃Ｓのみを用いて候補区間を得ると、ノイズの影響によって正確な候補区間検出ができない場合がある。このような場合には平滑化処理、例えばk-近傍法やmajority voting法などのノイズ抑制処理を導入してもよい。 If a candidate section is obtained using only #S in window units, there are cases where accurate candidate section detection cannot be performed due to the influence of noise. In such a case, smoothing processing, for example, noise suppression processing such as k-neighboring or major voting may be introduced.

ステップＳ２０１で得られた境界候補の時刻、候補区間となったショットの開始（または終了）時刻と時間長、および各ショットの開始（または終了）時刻が境界候補であるか否かを表す情報を記憶部１０７に記憶する。 Information indicating the time of the boundary candidate obtained in step S201, the start (or end) time and time length of the shot that became the candidate section, and whether the start (or end) time of each shot is a boundary candidate. Store in the storage unit 107.

以上がステップＳ２０２の処理詳細の一例である。 The above is an example of the processing details of step S202.

〔ステップＳ２０３：ショット特徴抽出処理〕
続いて、ショット特徴抽出部１０３は、映像データベース１０８に登録された映像の候補区間に対してショット毎に特徴量を抽出する。ステップＳ２０２で検出された各候補区間には、それぞれ１つ以上のショットが含まれている。本処理では、その各ショットから、ショットの特徴を表す特徴量を抽出する。抽出する特徴量としては、画像情報として輝度値の統計量、色の統計量などを利用することができる。あるいは、参考文献３に記載のBag-of-Visual-Wordsヒストグラムや、参考文献４に記載のGISTなどを利用してもよい。 [Step S203: Shot Feature Extraction Processing]
Subsequently, the shot feature extraction unit 103 extracts a feature amount for each shot with respect to a video candidate section registered in the video database 108. Each candidate section detected in step S202 includes one or more shots. In this process, a feature amount representing the feature of the shot is extracted from each shot. As the feature amount to be extracted, a luminance value statistic, a color statistic, or the like can be used as image information. Alternatively, a Bag-of-Visual-Words histogram described in Reference 3 or GIST described in Reference 4 may be used.

［参考文献３］：J. Sivic and A. Zisserman. “Video Google: A Text Retrieval Approach to Object Matching in Videos”, In Proc. International Conference on Computer Vision (ICCV), pp. 1470-1477, 2003.
［参考文献４］：A. Oliva and A. Torralba. “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope”. International Journal of Computer Vision (IJCV), vol. 42, no. 3, pp. 145-175, 2001.
音情報としては基本周波数や音量、あるいはMel Frequency Cepstrul Coefficients（ＭＦＣＣ）などを利用してもよい。また、前記画像、音に関する特徴量を一つ以上組み合わせて利用しても構わない。 [Reference 3]: J. Sivic and A. Zisserman. “Video Google: A Text Retrieval Approach to Object Matching in Videos”, In Proc. International Conference on Computer Vision (ICCV), pp. 1470-1477, 2003.
[Reference 4]: A. Oliva and A. Torralba. “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope”. International Journal of Computer Vision (IJCV), vol. 42, no. 3, pp. 145-175, 2001.
As sound information, fundamental frequency, volume, or Mel Frequency Cepstrul Coefficients (MFCC) may be used. Moreover, you may utilize combining the feature-value regarding the said image and sound one or more.

一方で、多くの特徴量を利用する場合、それだけ計算時間を必要とする。特に、大量の映像を処理する場合には、計算時間はできる限り少ない方が好ましい。この場合、例えば、（１）部分領域ごとの輝度平均値、分散、（２）部分領域ごとの色平均値、分散、（３）輝度ヒストグラム、（３）色ヒストグラム、（４）ＭＦＣＣのうち、いずれか一つを特徴量として採用することが好ましい。ここで、部分領域とは、１枚の画像を例えば３×３などに分割した際の各領域のことを指す。これらの特徴量は高速に抽出可能な上、後の検出においても高い精度を発揮する効果的な特徴量である。 On the other hand, when many feature quantities are used, the calculation time is required. In particular, when processing a large amount of video, it is preferable that the calculation time be as short as possible. In this case, for example, (1) luminance average value and variance for each partial region, (2) color average value and variance for each partial region, (3) luminance histogram, (3) color histogram, and (4) MFCC, Any one of them is preferably adopted as the feature amount. Here, the partial area refers to each area when one image is divided into, for example, 3 × 3. These feature quantities are effective feature quantities that can be extracted at high speed and exhibit high accuracy in later detection.

また、ショットはそれ自体複数の画像フレームと一定の長さの音信号を含んでいるため、どの画像フレーム、あるいは音信号から特徴量を抽出するかについては任意性をもつ。本実施形態では、ショットを時間軸方向にｎ等分割したとき、それぞれの中間にある画像フレームあるいは音信号から特徴量を抽出する。図５は、２等分割した場合に画像特徴を抽出する位置を示す図である。この例では、１４枚の画像フレームＦ１〜Ｆ１４を含んでいる。２等分割すると、前段の区間には７枚の画像フレームＦ１〜Ｆ７が含まれ、後段の区間には７枚の画像フレームＦ８〜Ｆ１４が含まれる。その中間位置にあるフレームＦ４、Ｆ１１から特徴量を抽出する。ｎの選び方は任意であるが、例えば、計算時間の短縮を図るべく、ｎ＝１としてもよい。この場合には、ショットのちょうど中間１フレームから特徴量を抽出することになる。抽出した特徴量は、ショット特徴量として符号化部１０４に出力する。 In addition, since a shot itself includes a plurality of image frames and a sound signal having a certain length, the image frame or the sound signal is arbitrarily extracted from which image feature amount is extracted. In this embodiment, when a shot is divided into n equal parts in the time axis direction, a feature amount is extracted from an image frame or sound signal in the middle of each. FIG. 5 is a diagram illustrating positions where image features are extracted when the image is divided into two equal parts. In this example, 14 image frames F1 to F14 are included. When dividing into two equal parts, seven image frames F1 to F7 are included in the preceding section, and seven image frames F8 to F14 are included in the following section. A feature amount is extracted from the frames F4 and F11 at the intermediate position. The method of selecting n is arbitrary, but for example, n = 1 may be set in order to shorten the calculation time. In this case, a feature amount is extracted from exactly one frame in the shot. The extracted feature amount is output to the encoding unit 104 as a shot feature amount.

以上がステップＳ２０３の処理詳細の一例である。 The above is an example of the processing details of step S203.

〔ステップＳ２０４：符号化処理〕
続いて、符号化部１０４は、ステップＳ２０３で抽出されたショット特徴量に基づいて、ショットを符号として表現する符号化処理を実行し、記憶部１０７に記憶する。符号化にはいくつかの方法があるが、ここでは２つの方法を説明する。１つはベクトル量子化に基づく方法であり、もう１つはハッシュを利用する方法である。 [Step S204: Encoding Process]
Subsequently, the encoding unit 104 executes an encoding process for expressing a shot as a code based on the shot feature amount extracted in step S <b> 203, and stores it in the storage unit 107. Although there are several methods for encoding, two methods will be described here. One is a method based on vector quantization, and the other is a method using a hash.

ベクトル量子化に基づく方法では、ショット特徴量を符号化するための符号帳を予め用意しておき、これに基づいてショット特徴量を符号化する。符号帳を作成する方法は、さまざまな公知の方法を用いることができるが、例えば、k-means法などのクラスタリング法を適用し、作成する。符号化する際には、各クラスタの中心ベクトルとショット特徴量との距離を計算し、これが最も近い距離にあるクラスタのｉｄを符号として割り当てる。 In the method based on vector quantization, a code book for encoding shot feature quantities is prepared in advance, and shot feature quantities are encoded based on the code book. Various known methods can be used as a method for creating a codebook. For example, a codebook is created by applying a clustering method such as a k-means method. When encoding, the distance between the center vector of each cluster and the shot feature amount is calculated, and the id of the cluster having the closest distance is assigned as a code.

ハッシュを利用する方法では、例えば以下のような手続きを用いる。まず、ショット特徴量（あるいはその部分）のベクトルをf=(f1, f2, …, fd)と表す。このとき、複数の閾値をf’i (i = 1, 2, …, M)として与え、ベクトルの各次元の値がこの閾値以上であるか否かに基づいて、別の値bに変換する。 In the method using a hash, for example, the following procedure is used. First, a vector of shot feature amounts (or portions thereof) is represented as f = (f1, f2,..., Fd). At this time, a plurality of threshold values are given as f′i (i = 1, 2,..., M), and converted to another value b based on whether or not the value of each dimension of the vector is greater than or equal to this threshold value. .

例えば、fが３次元で、f=(0.8, 0.1, 0.5)だったとしよう。このとき、閾値を３つ用意し、これをそれぞれf’1 = 0.2, f’2 = 0.5, f’3 = 0.7としたとする。このとき、仮に、f’1未満の値に0001、f’1以上f’2未満の値に0010、f’2以上f’3未満の値に0100、f’3以上の値に1000をアサインするものとする。f1は0.8なのでf’3以上であるから1000、f2は0.1なのでf’1未満であるから0001、f3は0.5なのでf’2以上であるから0100となる。すると、このときのbはb = 100000010100というハッシュ値に変換できる。この値によって符号を得ることができる。この閾値は任意の値を用いてよいが、例えば、各ベクトルの要素の統計量、およびそれを座標変換して得られる値を用いることができる。 For example, suppose f is three-dimensional and f = (0.8, 0.1, 0.5). At this time, it is assumed that three threshold values are prepared, and f′1 = 0.2, f′2 = 0.5, and f′3 = 0.7, respectively. At this time, for example, 0001 is assigned to a value less than f'1, 0010 is assigned to a value greater than or equal to f'1 and less than f'2, 0100 is assigned to a value greater than or equal to f'2 and less than f'3, and 1000 is assigned to a value greater than or equal to f'3. It shall be. Since f1 is 0.8, f′3 or more is 1000, and f2 is 0.1, so it is less than f′1, 0001, and f3 is 0.5, so it is f′2 or more, so 0100. Then, b at this time can be converted into a hash value of b = 100000010100. A sign can be obtained by this value. As this threshold value, an arbitrary value may be used. For example, a statistic of each vector element and a value obtained by coordinate transformation of the statistic can be used.

以上、ステップＳ２０１〜ステップＳ２０４の処理を終えた時点で、各候補区間に含まれるショットの系列を符号の系列に変換することができるので、これらを記憶部１０７に記憶する。図６に、ステップＳ２０４の終了時点で記憶部１０７に記憶された情報の一例を示す。図６（ａ）に示すように、テーブル６１には、候補区間の情報（候補区間id、区間開始時刻、区間終了時刻）が記憶されている。また、図６（ｂ）に示すように、テーブル６２には、ショットの情報（ショットid、所属する候補区間のid、ショットの開始時刻、ショットの時間長、ショットが境界候補であるか否かを表す情報、ショットを表す符号）が記憶されている。このように数値ベクトルではなく符号Ａ，Ｄ，Ｅ，・・・などでショットを表せば、従来のようにショットの類似度を計算する場合に比べて大幅に計算量を削減することが可能となる。 As described above, since the sequence of shots included in each candidate section can be converted into a sequence of codes when the processing of step S201 to step S204 is completed, these are stored in the storage unit 107. FIG. 6 shows an example of information stored in the storage unit 107 at the end of step S204. As shown in FIG. 6A, the table 61 stores information on candidate sections (candidate section id, section start time, section end time). Further, as shown in FIG. 6B, the table 62 includes shot information (shot id, id of candidate section to which the shot belongs, shot start time, shot time length, and whether the shot is a boundary candidate. And information indicating a shot and a symbol indicating a shot) are stored. If shots are represented by codes A, D, E,... Instead of numerical vectors in this way, it is possible to significantly reduce the amount of calculation compared to the case of calculating shot similarity as in the prior art. Become.

以上がステップＳ２０４の処理詳細の一例である。 The above is an example of the processing details of step S204.

〔ステップＳ２０５：再起符号系列検出処理〕
続いて、再起符号系列検出部１０５は、記憶部１０７に記憶されたショットの符号系列を読み込み、当該ショットの符号系列から再起符号系列を列挙する。 [Step S205: Restart Code Sequence Detection Process]
Subsequently, the restart code sequence detection unit 105 reads the code sequence of the shot stored in the storage unit 107 and enumerates the restart code sequence from the code sequence of the shot.

広告映像は、同じ広告映像が複数のチャンネル、時刻に渡って繰り返し利用される。そこで、全ての候補区間内に再起して出現する符号系列を抽出することによって、広告映像らしい区間を得ることができる。 The same advertisement video is repeatedly used over a plurality of channels and times. Therefore, by extracting a code sequence that appears again in all candidate sections, a section that seems to be an advertisement video can be obtained.

しかしながら一方で、このような再起符号系列を求める処理は多くの計算時間を要することが知られている。これまでにも、同様の問題を解くいくつかの公知の技術があり、例えば、参考文献５、参考文献６、参考文献７に記載の技術などを用いることができる。 However, on the other hand, it is known that such a process for obtaining a restart code sequence requires a lot of calculation time. Until now, there are some known techniques for solving the same problem, for example, the techniques described in Reference 5, Reference 6, and Reference 7 can be used.

［参考文献５］：特許第３７５９４３８号公報
［参考文献６］：R. Agrawal and R. Srikant. “Mining sequential patterns”, In Proc. International Conference on Data Engineering (ICDE), pp. 3-14, 1995.
［参考文献７］：J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. “Prefixspan: Mining sequential patterns by prefix-projected growth”, In Proc. International Conference of Data Engineering (ICDE), pp. 215-224, 2001.
前記の技術はいずれも効果的であるが、広告映像の検出に最適化されていない。そこで、本実施形態では、下記の観点を考慮することにより、より効率的な処理を実現する。 [Reference 5]: Japanese Patent No. 3759438 [Reference 6]: R. Agrawal and R. Srikant. “Mining sequential patterns”, In Proc. International Conference on Data Engineering (ICDE), pp. 3-14, 1995 .
[Reference 7]: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. “Prefixspan: Mining sequential patterns by prefix-projected growth ”, In Proc. International Conference of Data Engineering (ICDE), pp. 215-224, 2001.
All of the above techniques are effective but are not optimized for advertising video detection. Therefore, in the present embodiment, more efficient processing is realized by considering the following viewpoints.

（１）連続する符号系列のみを列挙する
（２）広告映像は基準となる長さをもつ（１５秒、３０秒、６０秒など）
深さ優先探索型の参考文献７に記載のPrefixSpanを基本とし、後述のステップＳ２０６のフィルタ処理を合わせ、これを修正した効率的な処理を用いることが好ましい。幅優先探索よりも深さ優先探索の方が、必要となるメモリ量が少なくて済む。以降、この再起符号系列を求める処理の一例について、図７を用いて述べる。 (1) List only continuous code sequences (2) Advertising video has a reference length (15 seconds, 30 seconds, 60 seconds, etc.)
It is preferable to use an efficient process which is based on the PrefixSpan described in Reference 7 of the depth priority search type, combined with the filter process in step S206 described later, and corrected. A depth-first search requires less memory than a width-first search. Hereinafter, an example of the process for obtaining the restart code sequence will be described with reference to FIG.

まず、候補区間とそれに含まれるショットの符号を時刻順に並べた符号系列を記憶部１０７から得る。図７では、候補区間１〜４の４つの候補区間が得られており、それぞれ以下の符号系列をもっている。 First, a code sequence in which candidate sections and codes of shots included therein are arranged in time order is obtained from the storage unit 107. In FIG. 7, four candidate sections 1 to 4 are obtained, and each has the following code sequence.

候補区間１：{A, C, D, E, F, G, A, C, F}
候補区間２：{A, B, C, E, F, G, F, D}
候補区間３：{C, E, F, B, Y, E, B, C}
候補区間４：{A, D, E, H, G, E}
まず始めに、これらの候補区間に現れる符号数（生起回数）をカウントする。この例の場合、A: 4, B: 3, C: 5, D: 3, E: 6, F: 5, G: 3, H: 1, Y: 1 となる。この中で、一定以上の生起回数をもつ符号を処理対象として記憶しておく。以降、これらの符号に対して順に処理を進めていくことになるが、ここでは、深さ優先探索をベースに処理を実行するものとし、まずは最大の生起回数をもつEに着目する。各候補区間のうち、Eを境にこれよりも後に出現する符号系列により、（部分）候補区間を新たに生成する。図７の例では、候補区間１は、{A, C, D, E, F, G, A, C, F}なので、Eよりも後に出現する区間{F, G, A, C, F}を候補区間１aとする。以降同様に、下記のような新たな（部分）候補区間を生成する。 Candidate section 1: {A, C, D, E, F, G, A, C, F}
Candidate section 2: {A, B, C, E, F, G, F, D}
Candidate section 3: {C, E, F, B, Y, E, B, C}
Candidate section 4: {A, D, E, H, G, E}
First, the number of codes (number of occurrences) appearing in these candidate sections is counted. In this example, A: 4, B: 3, C: 5, D: 3, E: 6, F: 5, G: 3, H: 1, Y: 1. Among them, a code having a certain number of occurrences is stored as a processing target. Thereafter, the processing is sequentially performed on these codes. Here, the processing is executed based on the depth-first search, and attention is first focused on E having the maximum number of occurrences. Among each candidate section, a (partial) candidate section is newly generated by a code sequence appearing after this at E. In the example of FIG. 7, the candidate section 1 is {A, C, D, E , F, G, A, C, F}, so the section {F, G, A, C, F} that appears after E Is a candidate section 1a. Thereafter, similarly, the following new (partial) candidate section is generated.

候補区間１a：{F, G, A, C, F}
候補区間２a：{F, G, F, D}
候補区間３a：{F, B, Y, E, B, C}
候補区間３b：{B, C}
候補区間４a：{H, G, E}
次に、候補区間１a〜4aにおいて、最初に現れる符号の生起回数をカウントすると、F: 3, B: 1, H: 1となる。この中で最大の生起回数となったFに着目し、部分候補区間を生成する。これと同時にFを記憶し、前回記憶したEと合わせて{E, F}という再起符号系列が発見される。 Candidate section 1a: {F, G, A, C, F}
Candidate section 2a: {F, G, F, D}
Candidate section 3a: {F, B, Y, E, B, C}
Candidate section 3b: {B, C}
Candidate section 4a: {H, G, E}
Next, when the number of occurrences of the code that appears first in the candidate sections 1a to 4a is counted, F: 3, B: 1, H: 1. The partial candidate section is generated by paying attention to F that has the maximum number of occurrences. At the same time, F is stored, and a restart code sequence {E, F} is found together with E stored previously.

以上の処理を繰り返すことにより、再起符号系列を列挙することができる。この例では、深さ優先探索を前提としているので、まず最大の生起回数となるEやFに着目して処理を実行したが、例えばCなどの他の符号についても同様に処理を進める。各候補区間、あるいは部分候補区間内で、一定以上の生起回数となった符号全てに着目して、同様の処理を繰り返すことが好ましい。 By repeating the above processing, it is possible to enumerate restart code sequences. In this example, since depth-first search is premised, the processing is first executed by paying attention to E and F that are the maximum number of occurrences, but the processing is similarly performed for other codes such as C, for example. It is preferable to repeat the same processing by paying attention to all codes that have a certain number of occurrences in each candidate section or partial candidate section.

また、ノイズの影響により、同じ広告映像であっても符号系列に揺らぎが出る場合がある。このような場合への対処として、一定数以下の符号が置き換わってもよいことを許容したり、編集距離が一定以下の場合には同じ符号系列であるとみなすように閾値を導入することによって、頑健性を高めることができる。また、同時に、次のような効果を得ることもできる。 In addition, the code sequence may fluctuate even in the same advertisement video due to the influence of noise. As a countermeasure to such a case, by allowing a certain number of codes or less may be replaced, or by introducing a threshold value so as to be regarded as the same code sequence when the editing distance is less than a certain value, Can improve robustness. At the same time, the following effects can be obtained.

すなわち、広告映像は、通常、１５秒版、３０秒版、６０秒版など複数の版が存在し、これらの版を同じものを広告するための広告映像として管理したい場合もある。このような版の違う広告映像は、全く別の映像ではなく、互いに一部が共通しており、いくつかのショットが挿入・削除されたり、置換されたりして制作されている場合が多いため、符号の置換や編集距離によって差異を定量化・吸収することができる。
以上の処理を経て、発見された全ての再起符号系列区間をフィルタ部１０６に出力する。 That is, there are usually multiple versions of advertisement video such as a 15-second version, a 30-second version, and a 60-second version, and there are cases where it is desired to manage these versions as an advertisement video for advertising the same one. These different versions of the advertising video are not completely different videos, they are partly in common with each other, and many shots are often inserted / deleted or replaced. Differences can be quantified and absorbed by code replacement and editing distance.
Through the above processing, all the found code sequence sections found are output to the filter unit 106.

以上がステップＳ２０５の処理詳細の一例である。 The above is an example of the processing details of step S205.

〔ステップＳ２０６：フィルタ処理〕
最後に、フィルタ部１０６は、列挙された再起符号系列から、広告映像の条件を満たす系列を抽出し、記憶部１０７に記憶する。 [Step S206: Filter Processing]
Finally, the filter unit 106 extracts a sequence satisfying the condition of the advertisement video from the enumerated restart code sequences and stores the sequence in the storage unit 107.

日本の場合、広告映像は、１５秒、３０秒、６０秒など、所定の時間長をもっていることが普通である。したがって、抽出された再起符号系列のうち、所定の時間長をもつもののみをろ過（採択）し、最終的に広告映像であるとして検出を行う。検出された再起符号系列の長さは、記憶部１０７に記憶されたショットの時間長を参照することによって求めることができる。再起符号系列の長さを求めると、広告映像の時間長が所定の時間長に当てはまるか否かを判定し、当てはまるもののみを採択する。ただし、この際、ステップＳ２０１で境界候補でないと判定されたショット（カット）で開始、または終端する再起符号系列は採択しない。また、元々の広告映像の時間のずれ、構造解析処理の誤差、丸め誤差などの影響により、必ずしも正確に１５秒などの時間長にならない場合もある。そこで、一定の許容範囲（例えば±0.5秒等）を設けておき、その範囲に収まる時間長の再起符号系列を許容することが好ましい。 In Japan, the advertisement video usually has a predetermined time length such as 15 seconds, 30 seconds, 60 seconds, or the like. Therefore, only those having a predetermined time length are filtered out (adopted) from the extracted recurring code sequences, and finally detected as advertisement video. The length of the detected restart code sequence can be obtained by referring to the time length of the shot stored in the storage unit 107. When the length of the restart code sequence is obtained, it is determined whether or not the time length of the advertisement video is applicable to a predetermined time length, and only the applicable one is adopted. However, at this time, a restart code sequence starting or ending with a shot (cut) determined not to be a boundary candidate in step S201 is not adopted. In addition, the time length of 15 seconds or the like may not always be accurate due to the influence of the time difference of the original advertisement video, the error of the structure analysis process, the rounding error, and the like. Therefore, it is preferable to provide a certain allowable range (for example, ± 0.5 seconds) and allow a repeated code sequence having a time length that falls within that range.

図２に示すフローチャートでは、一度全ての再起符号系列を列挙し終えたのち（ステップＳ２０５）、列挙された全ての再起符号系列についてフィルタ処理（ステップＳ２０６）を実行することとしている。しかしながら、実際には、一つ再起符号系列を発見したタイミングでフィルタ処理を適用し、その再起符号系列を採択するか否かを逐次的に判定してもよい。こうすることにより、逐次不要な再起符号系列を忘却することができるため、メモリ使用量の観点で効率化できる。 In the flowchart shown in FIG. 2, after all the restart code sequences are enumerated once (step S205), the filtering process (step S206) is executed for all the restart code sequences listed. However, in practice, it may be possible to sequentially determine whether or not to adopt the restart code sequence by applying filter processing at the timing when one restart code sequence is found. By doing so, it is possible to forget a re-occurring code sequence that is not necessary successively, and thus it is possible to improve efficiency from the viewpoint of memory usage.

ろ過された再起符号系列の情報を広告映像区間として記憶部１０７に記憶し、処理を終了する。例えば、テーブル６３には、図８に示すように、開始時刻あるいは終了時刻、時間長、および再起符号系列が記憶される。このような広告映像区間の情報は、そのまま利用者が参照できるように出力しても構わない。 The filtered re-occurrence code sequence information is stored in the storage unit 107 as an advertisement video section, and the process is terminated. For example, as shown in FIG. 8, the table 63 stores a start time or end time, a time length, and a restart code sequence. Such information of the advertisement video section may be output so that the user can refer to it as it is.

以上がステップＳ２０６の処理詳細の一例である。 The above is an example of the processing details of step S206.

以上のように、本発明の実施形態における映像検出装置１では、映像の画像・音情報を解析することにより符号系列を得、再起的に出現する符号系列となる部分区間群を検出するようにしている。そのため、リファレンスデータ、広告映像モデル、ショット間の類似度を用いることなく、大量の映像から高精度かつ効率的に広告映像を検出することができる。 As described above, in the video detection apparatus 1 according to the embodiment of the present invention, a code sequence is obtained by analyzing video image / sound information, and a partial section group that is a code sequence that reoccurs is detected. ing. Therefore, it is possible to detect an advertising video from a large amount of video with high accuracy and efficiency without using reference data, an advertising video model, and similarity between shots.

また、構造解析処理Ｓ２０１では、入力映像の切れ目であるカット点を解析し、これらカット点のうち、少なくとも発話区間でも音楽区間でもない点の時刻、および同条件を満たす次の点までの時間長を求めるようにしている。そのため、広告映像とその他の映像を分離する境界を効率的に絞り込むことができる。 In the structural analysis process S201, cut points that are breaks in the input video are analyzed, and at least the time of a point that is neither a speech segment nor a music segment among these cut points, and the time length to the next point that satisfies the same condition Asking for. Therefore, it is possible to efficiently narrow down the boundary that separates the advertisement video from the other video.

また、特徴抽出処理Ｓ２０３では、部分領域ごとの輝度平均値、分散、部分領域ごとの色平均値、分散、輝度ヒストグラム、色ヒストグラム、ＭＦＣＣのうちの少なくとも１つを特徴量として抽出するようにしている。そのため、従来の特徴抽出処理に比べて高速に特徴量を抽出可能な上、高い精度で広告映像を検出することができる。 In the feature extraction process S203, at least one of the luminance average value, variance, color average value, variance, luminance histogram, color histogram, and MFCC for each partial region is extracted as a feature amount. Yes. Therefore, it is possible to extract feature amounts at a higher speed than in the conventional feature extraction process, and it is possible to detect an advertisement video with high accuracy.

また、再起符号系列検出処理Ｓ２０５では、一定以上の生起回数となる部分符号系列を繰り返し検出することによって再起符号系列を検出し、フィルタ処理Ｓ２０６では、所定の時間長となる再起符号系列のみを出力するようにしている。そのため、従来の再起符号系列検出処理に比べて極めて効率的に再起符号系列を検出することができる。 In the restart code sequence detection processing S205, a restart code sequence is detected by repeatedly detecting a partial code sequence having a certain number of occurrences. In the filter processing S206, only the restart code sequence having a predetermined time length is output. Like to do. Therefore, it is possible to detect a restart code sequence extremely efficiently as compared with the conventional restart code sequence detection process.

なお、再起符号系列検出処理Ｓ２０５では、再起的に出現する符号系列となる部分区間群を検出することとしているが、予め符号系列（リファレンスデータ）が記憶されている場合は、その符号系列と同一の符号系列となる部分区間群を検出するようにしてもよい。すなわち、一度得られた広告映像を別の新たな映像から検出する場合には、既にその広告映像の時間長と符号系列が得られているため、これをリファレンスデータとして記憶装置１０７に記憶しておく。この場合、前記のステップＳ２０１〜Ｓ２０６を全て実行する必要はない。すなわち、少なくともステップＳ２０１、ステップＳ２０３、ステップＳ２０４を実行して映像を符号系列に変換したのち、予め記憶されているリファレンスデータに基づいて、各広告映像の時間長、符号系列と合致する映像区間をスキャンすればよい。このように、一度得られた広告映像をリファレンスデータとして記憶しておけば、より効率的に処理を実行することが可能となる。 In the reoccurrence code sequence detection process S205, a partial section group that becomes a recurring code sequence is detected. However, when a code sequence (reference data) is stored in advance, the same as that code sequence is stored. It is also possible to detect a partial section group that is a code sequence of. That is, when the advertisement video obtained once is detected from another new video, since the time length and code sequence of the advertisement video have already been obtained, this is stored in the storage device 107 as reference data. deep. In this case, it is not necessary to execute all steps S201 to S206. That is, after executing at least step S201, step S203, and step S204 to convert the video into a code sequence, the video section that matches the time length and code sequence of each advertisement video is determined based on the reference data stored in advance. Just scan. Thus, if the advertisement video obtained once is stored as reference data, the process can be executed more efficiently.

以上、本発明の実施形態における映像検出装置１について詳細に説明した。このような映像検出方法は、ソフトウェアプログラムを用いてコンピュータ上で実行できることはいうまでもなく、また、本発明は、説明した実施形態の一例に限定されず、特許請求の範囲に記載した技術的範囲において各種の変形を行うことが可能である。例えば、本発明は、ＩＰＴＶやデジタルサイネージ、ＶＯＤ(Video on Demand) 、地上デジタル放送再送信などといった様々な映像配信・通信サービスに用いることができる。 The video detection device 1 according to the embodiment of the present invention has been described in detail above. It goes without saying that such a video detection method can be executed on a computer using a software program, and the present invention is not limited to an example of the described embodiment, and the technical description described in the claims. Various modifications can be made in the range. For example, the present invention can be used for various video distribution / communication services such as IPTV, digital signage, VOD (Video on Demand), and terrestrial digital broadcast retransmission.

１…映像検出装置
１０１…構造解析部
１０２…候補区間検出部
１０３…ショット特徴抽出部
１０４…符号化部
１０５…再起符号系列検出部
１０６…フィルタ部
１０７…記憶装置
１０８…映像データベース DESCRIPTION OF SYMBOLS 1 ... Video | video detection apparatus 101 ... Structure analysis part 102 ... Candidate area detection part 103 ... Shot feature extraction part 104 ... Coding part 105 ... Reoccurrence code sequence detection part 106 ... Filter part 107 ... Memory | storage device 108 ... Video database

Claims

A video detection method for detecting advertisement video from input video to be processed,
Analyzing the section structure of the input video and dividing it into partial sections;
Candidate section detection processing for detecting one or more candidate sections including the advertisement video based on the characteristics of the partial sections and obtaining the start time and end time;
A feature extraction process of analyzing at least one of the image and sound of the partial section existing inside the start time and end time of the candidate section, and extracting a feature amount;
An encoding process for expressing the partial section by a code based on the feature amount;
A recursive code sequence detection process for detecting a partial section group to be a recurring code sequence from a code sequence composed of one or more partial sections expressed by the code;
A filtering process for outputting only those satisfying a predetermined condition among the detected partial section groups as an advertising video section;
A video detection method comprising:

2. The video detection method according to claim 1, wherein in the structure analysis process, a cut point that is a break of the input video is analyzed, and a time of a point that is not at least an utterance section or a music section among these cut points, and A video detection method characterized in that a time length to the next point satisfying the same condition is obtained.

The video detection method according to claim 1, wherein in the feature extraction process, at least one of luminance average value, variance, color average value, variance, luminance histogram, color histogram, and MFCC for each partial region. A video detection method characterized by extracting one as a feature amount.

The video detection method according to claim 1, wherein in the restart code sequence detection process, a restart code sequence is detected by repeatedly detecting a partial code sequence having a certain number of occurrences, and in the filter process, a predetermined code sequence is detected. A video detection method characterized by outputting only a restart code sequence having a time length.

A video detection device for detecting an advertising video from an input video to be processed,
A structure analysis unit that analyzes a section structure of the input video and divides the section structure into partial sections;
A candidate section detecting unit that detects one or more candidate sections including the advertisement video based on the characteristics of the partial section and obtains a start time and an end time;
A feature extraction unit that analyzes at least one of the image and sound of the partial section existing inside the start time and end time of the candidate section, and extracts a feature amount;
An encoding unit for expressing the partial section by a code based on the feature amount;
A recursive code sequence detection unit that detects a partial section group that becomes a recurring code sequence from a code sequence composed of one or more partial sections expressed by the code;
Among the detected partial section group, a filter unit that outputs only an advertisement video section that satisfies a predetermined condition;
A video detection apparatus comprising:

6. The video detection apparatus according to claim 5, wherein the structure analysis unit analyzes a cut point that is a break in the input video, and at least a time of a point that is neither an utterance section nor a music section among the cut points, and An image detection apparatus characterized by obtaining a time length to the next point satisfying the same condition.

6. The video detection apparatus according to claim 5, wherein the feature extraction unit includes at least one of luminance average value, variance, color average value, variance, luminance histogram, color histogram, and MFCC for each partial region. A video detection device that extracts one as a feature amount.

The video detection device according to claim 5, wherein the restart code sequence detection unit detects a restart code sequence by repeatedly detecting a partial code sequence having a predetermined number of occurrences, and the filter unit A video detection apparatus that outputs only a restart code sequence having a time length.

A video detection program for causing a computer to execute each of the processes according to any one of claims 1 to 4.