JP2017112448A

JP2017112448A - Video scene division device and video scene division program

Info

Publication number: JP2017112448A
Application number: JP2015244026A
Authority: JP
Inventors: 住吉　英樹; Hideki Sumiyoshi; 英樹住吉; 吉彦河合; Yoshihiko Kawai
Original assignee: Nippon Hoso Kyokai NHK
Current assignee: Japan Broadcasting Corp
Priority date: 2015-12-15
Filing date: 2015-12-15
Publication date: 2017-06-22
Anticipated expiration: 2035-12-15
Also published as: JP6557592B2

Abstract

PROBLEM TO BE SOLVED: To provide a video scene division device and a video scene division program capable of dividing a video content appropriately into scenes.SOLUTION: A video scene division device 1 includes a shot boundary detector 11 for detecting a shot boundary, i.e., a discontinuous point of the videos, based on the difference of image data between the frames of a video, a still image extraction unit 12 for extracting a plurality of still images for each shot divided by the shot boundary, a subject recognition unit 13 for recognizing the subject of each still image, a histogram generation unit 14 for generating a histogram indicating the frequency of appearance of a subject for respective shots, and a scene boundary determination unit 15 for determining a scene boundary, i.e., a discontinuous point of histogram, based on the similarity of the histogram.SELECTED DRAWING: Figure 1

Description

本発明は、映像コンテンツをシーンに分割するための装置及びプログラムに関する。 The present invention relates to an apparatus and a program for dividing video content into scenes.

従来、統計的な手法を用いて、画像の中の被写体をソフトウェアにより認識してメタデータを付与することにより、画像検索等のアプリケーションが実現されてきた。 Conventionally, an application such as an image search has been realized by using a statistical technique and recognizing a subject in an image by software and adding metadata.

ところで、テレビ番組等の映像コンテンツは、図７にイメージを示すように、階層構造で表現されることが多い。具体的には、より小さな単位から、フレーム、ショット、シーン、コンテンツ（番組映像）と呼ばれる。 By the way, video contents such as television programs are often expressed in a hierarchical structure as shown in FIG. Specifically, it is called a frame, a shot, a scene, and content (program video) from a smaller unit.

ショットは、撮影時のカメラの切り替わり点を境界とし、長さは数秒〜数十秒と短い。一般的な１時間程度の番組では、ショット数は１００〜１０００程度と多くなるので、ショットの羅列から番組全体の構造を把握することは容易ではない。例えば、ドラマ等、複数のショットの組み合わせにより映像の意味を表現する番組の場合、単一のショットだけでは映像の意味が理解できないことが多い。このため、映像検索の利用者は、映像分割の単位としてショットは細かすぎると感じることが多い。
また、映像検索において検索結果を提示する場合、コンテンツの内容を構造的に示したり、必要な映像を意味的な単位で再生したりする機能が望まれている。 A shot has a short point of several seconds to several tens of seconds with a camera switching point at the time of shooting as a boundary. In a general program of about 1 hour, the number of shots increases to about 100 to 1000, so it is not easy to grasp the structure of the entire program from a sequence of shots. For example, in the case of a program that expresses the meaning of a video by combining a plurality of shots, such as a drama, the meaning of the video is often not understood only by a single shot. For this reason, a user of video search often feels that a shot is too fine as a unit of video division.
In addition, when presenting a search result in video search, a function of structurally showing the contents and reproducing a necessary video in a semantic unit is desired.

このような状況において、例えば、色、模様又は音の連続性に着目し、この連続性の途切れた点を境界として映像を分割する手法が提案されている（例えば、特許文献１及び２参照）。 In such a situation, for example, paying attention to the continuity of the color, pattern, or sound, a method of dividing the video using the point where the continuity is interrupted as a boundary has been proposed (see, for example, Patent Documents 1 and 2). .

特開２００４−２８０６６９号公報Japanese Patent Laid-Open No. 2004-280669 特開２００８−５１６７号公報JP 2008-5167 A

ところで、シーンは、映像編集者によって意味付けされた複数のショットにより構成される区間であり、コンテンツ内で表現されている場所又は時を同じくすることが多い。このため、従来手法で用いられる映像又は音声信号の連続性が示す区間と、人の考える意味区間とは乖離が大きく、利用者の希望する境界で分割されないことが多かった。
このように、意味的な映像内容の境界であるシーン境界を自動的に検出することは難しかった。 By the way, a scene is a section composed of a plurality of shots given meaning by a video editor, and often has the same place or time expressed in content. For this reason, the section indicated by the continuity of the video or audio signal used in the conventional method and the meaning section considered by the person are largely different, and are often not divided at the boundary desired by the user.
As described above, it is difficult to automatically detect a scene boundary that is a boundary of semantic video content.

本発明は、映像コンテンツを適切にシーンに分割できる映像シーン分割装置及び映像シーン分割プログラムを提供することを目的とする。 It is an object of the present invention to provide a video scene division apparatus and a video scene division program that can appropriately divide video content into scenes.

本発明に係る映像シーン分割装置は、映像のフレーム間の画像データの差分に基づいて、当該映像の不連続点であるショット境界を検出するショット境界検出部と、前記ショット境界により分割されたショット毎に複数の静止画像を抽出する静止画像抽出部と、前記静止画像毎の被写体を認識する被写体認識部と、前記ショットそれぞれについて、前記被写体の出現頻度を示すヒストグラムを生成するヒストグラム生成部と、前記ヒストグラムの類似度に基づいて、当該ヒストグラムの不連続点であるシーン境界を判定するシーン境界判定部と、を備える。 A video scene division device according to the present invention includes a shot boundary detection unit that detects a shot boundary that is a discontinuous point of the video based on a difference in image data between video frames, and a shot divided by the shot boundary. A still image extraction unit that extracts a plurality of still images every time, a subject recognition unit that recognizes a subject for each still image, a histogram generation unit that generates a histogram indicating the appearance frequency of the subject for each of the shots, A scene boundary determining unit that determines a scene boundary that is a discontinuous point of the histogram based on the similarity of the histogram.

前記被写体認識部は、前記静止画像に含まれる所定の特徴量に基づくクラスタリングにより、複数の前記被写体を識別してもよい。 The subject recognition unit may identify the plurality of subjects by clustering based on a predetermined feature amount included in the still image.

前記ヒストグラム生成部は、前記ショットの期間において前記被写体が認識された前記静止画像の数を正規化した度数、又は前記被写体が認識された前記静止画像に対応する前記ショットの期間内の時間を正規化した度数の分布を、前記ヒストグラムとして生成してもよい。 The histogram generation unit normalizes a frequency obtained by normalizing the number of the still images in which the subject is recognized in the shot period, or a time in the shot period corresponding to the still image in which the subject is recognized. The normalized frequency distribution may be generated as the histogram.

前記ヒストグラム生成部は、前記被写体のグループに対して前記ヒストグラムを生成してもよい。 The histogram generation unit may generate the histogram for the group of subjects.

前記シーン境界判定部は、前記ヒストグラムにおける度数が上位所定数の前記被写体のみからなる部分ヒストグラムに基づいて、前記シーン境界を判定してもよい。 The scene boundary determination unit may determine the scene boundary based on a partial histogram including only the predetermined number of subjects whose frequency in the histogram is higher.

前記シーン境界判定部は、シーンにおける前記被写体の時間軸及び位置に関して予め記憶された出現パターンに基づいて、前記シーン境界の判定結果を調整してもよい。 The scene boundary determination unit may adjust the determination result of the scene boundary based on an appearance pattern stored in advance regarding the time axis and position of the subject in the scene.

本発明に係る映像シーン分割プログラムは、コンピュータを、前記映像シーン分割装置として機能させる。 The video scene division program according to the present invention causes a computer to function as the video scene division device.

本発明によれば、映像コンテンツを適切にシーンに分割できる。 According to the present invention, video content can be appropriately divided into scenes.

実施形態に係る映像シーン分割装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the video scene division | segmentation apparatus which concerns on embodiment. 実施形態に係るショット境界の検出及び静止画像の抽出の具体例を示す図である。It is a figure which shows the specific example of the detection of the shot boundary which concerns on embodiment, and the extraction of a still image. 実施形態に係るシーン分割の手順を示す概略図である。It is the schematic which shows the procedure of the scene division | segmentation concerning embodiment. 実施形態に係る制御部による処理を示すフローチャートである。It is a flowchart which shows the process by the control part which concerns on embodiment. 実施形態に係る番組ジャンルに基づいた被写体への重み付けの例を示す図である。It is a figure which shows the example of the weighting to the object based on the program genre which concerns on embodiment. 実施形態に係る映像編集の知見に基づくシーン分割の例を示す図である。It is a figure which shows the example of the scene division | segmentation based on the knowledge of the video editing which concerns on embodiment. 映像コンテンツの階層的な単位のイメージを示す図である。It is a figure which shows the image of the hierarchical unit of video content.

以下、本発明の実施形態の一例について説明する。
図１は、本実施形態に係る映像シーン分割装置１の機能構成を示すブロック図である。 Hereinafter, an example of an embodiment of the present invention will be described.
FIG. 1 is a block diagram showing a functional configuration of a video scene dividing device 1 according to the present embodiment.

映像シーン分割装置１は、ショット境界検出部１１と、静止画像抽出部１２と、被写体認識部１３と、ヒストグラム生成部１４と、シーン境界判定部１５とを含む制御部１０、及び記憶部２０を備えた情報処理装置（コンピュータ）である。 The video scene dividing apparatus 1 includes a control unit 10 including a shot boundary detection unit 11, a still image extraction unit 12, a subject recognition unit 13, a histogram generation unit 14, and a scene boundary determination unit 15, and a storage unit 20. An information processing apparatus (computer) provided.

制御部１０は、映像シーン分割装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、前述のハードウェアと協働し、本実施形態における各種機能を実現している。制御部１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）であってよい。 The control unit 10 is a part that controls the entire video scene dividing device 1, and appropriately reads and executes various programs stored in the storage unit 20, thereby cooperating with the hardware described above. Various functions are realized. The control unit 10 may be a CPU (Central Processing Unit).

記憶部２０は、ハードウェア群を映像シーン分割装置１として機能させるための各種プログラム、本実施形態の各種機能を制御部１０に実行させるためのプログラム、及び各種データ等を記憶する。記憶部２０が記憶するデータには、処理対象である映像データ、及び処理後のシーン分割データの他、シーン分割に関する後述の判断基準のデータが含まれる。 The storage unit 20 stores various programs for causing the hardware group to function as the video scene dividing device 1, programs for causing the control unit 10 to execute various functions of the present embodiment, various data, and the like. The data stored in the storage unit 20 includes video data to be processed and scene division data after processing, as well as later-described determination criteria data related to scene division.

ショット境界検出部１１は、映像データを構成するフレーム間の画像データの差分に基づいて、連続して収録された映像が途切れる不連続点であるショット境界を検出する。
具体的には、ショット境界検出部１１は、１フレーム毎に、直前のフレームとの間で画像データの差分を求める。そして、この差分値が第１閾値を超えた場合に、ショット境界検出部１１は、この第１閾値を超えた前後のフレームの間を、ショット境界として検出する。なお、画像データの差分は、画像データに含まれる画素値の変化量の合計又は平均、あるいは輝度ヒストグラムの変化量等、フレーム間での画像の変化の度合いを評価する値として適宜設定される。 The shot boundary detection unit 11 detects a shot boundary, which is a discontinuous point at which continuously recorded videos are interrupted, based on the difference in image data between frames constituting the video data.
Specifically, the shot boundary detection unit 11 obtains a difference in image data from the immediately preceding frame for each frame. When the difference value exceeds the first threshold value, the shot boundary detection unit 11 detects a frame between frames before and after the first threshold value as a shot boundary. Note that the difference between the image data is appropriately set as a value for evaluating the degree of change in the image between frames, such as the total or average change amount of the pixel values included in the image data, or the change amount of the luminance histogram.

静止画像抽出部１２は、ショット境界により分割されたショット毎に、１以上の静止画像を抽出する。
具体的には、静止画像抽出部１２は、画像データの差分の累積が第２閾値を超える度に、この第２閾値を超えたフレームを、静止画像として抽出する。 The still image extraction unit 12 extracts one or more still images for each shot divided by the shot boundary.
Specifically, the still image extraction unit 12 extracts, as a still image, a frame that exceeds the second threshold every time the accumulated difference of the image data exceeds the second threshold.

図２は、本実施形態に係るショット境界の検出及び静止画像の抽出の具体例を示す図である。
ショット境界検出部１１は、映像コンテンツを構成する時間的に連続するフレームを順に比較し、画像データの差分を算出する。 FIG. 2 is a diagram showing a specific example of shot boundary detection and still image extraction according to the present embodiment.
The shot boundary detection unit 11 compares the temporally continuous frames constituting the video content in order, and calculates the difference between the image data.

ショット境界検出部１１は、算出した差分値がショット境界の閾値（第１閾値）Ｘを超えた場合に、直前のフレームとの間をショット境界として検出する。 When the calculated difference value exceeds the threshold (first threshold) X of the shot boundary, the shot boundary detection unit 11 detects the interval between the previous frame as a shot boundary.

静止画像抽出部１２は、ショット境界の前後のフレーム、すなわちショット内の最初と最後のフレームＰ１及びＰ２を、ショットを代表する静止画像（サムネイル）として抽出する。 The still image extraction unit 12 extracts the frames before and after the shot boundary, that is, the first and last frames P1 and P2 in the shot as still images (thumbnail) representing the shot.

また、静止画像抽出部１２は、例えば、ショット境界検出部１１により計算されたフレーム間の差分値をショット毎に累積しており、累積値がサムネイル出力の閾値（第２閾値）Ｙ１、Ｙ２、Ｙ３、Ｙ４、・・・を超える度に、この時のフレームＰ３、Ｐ４、Ｐ５、Ｐ６、・・・を、ショット内で比較的大きく変動した静止画像として、さらに抽出してもよい。
なお、静止画像抽出部１２は、一定時間間隔（例えば、１０フレーム毎、１秒毎等）で静止画像を抽出してもよい。 The still image extraction unit 12 accumulates, for example, the difference value between frames calculated by the shot boundary detection unit 11 for each shot, and the accumulated value is a threshold value (second threshold value) Y1, Y2, .., Each time it exceeds Y3, Y4,..., The frames P3, P4, P5, P6,.
The still image extraction unit 12 may extract still images at regular time intervals (for example, every 10 frames, every second, etc.).

このとき、静止画像抽出部１２は、抽出した静止画像を、映像の先頭から順に付与されるショット番号、及びフレームを識別する時刻情報と共に、記憶部２０に記憶する。 At this time, the still image extraction unit 12 stores the extracted still image in the storage unit 20 together with the shot number given in order from the top of the video and the time information for identifying the frame.

被写体認識部１３は、抽出された静止画像毎に、画像内の被写体を認識する。
具体的には、被写体認識部１３は、予め想定されている特定の複数の被写体を学習しておき、これらの被写体が静止画像に含まれているか否かを判定する。 The subject recognition unit 13 recognizes a subject in the image for each extracted still image.
Specifically, the subject recognition unit 13 learns a plurality of specific subjects that are assumed in advance, and determines whether or not these subjects are included in the still image.

事前に学習する被写体は、対象となる映像コンテンツの内容又は分野等により適宜選択される。例えば、ドラマであれば登場人物の顔というように、処理対象とする映像コンテンツに頻繁に登場する被写体が予め学習される。
このとき、映像コンテンツに登場する広範囲な被写体を認識できるように多数の被写体が学習されることが望ましいが、認識可能な被写体が限定される場合には、番組内での登場頻度が高いと想定される被写体が選択される。 The subject to be learned in advance is appropriately selected depending on the content or field of the target video content. For example, in the case of a drama, a subject that frequently appears in video content to be processed is learned in advance, such as the face of a character.
At this time, it is desirable to learn a large number of subjects so that a wide range of subjects appearing in the video content can be recognized, but when the recognizable subjects are limited, it is assumed that the frequency of appearance in the program is high. The subject to be selected is selected.

被写体認識部１３は、例えばＢａｇ−ｏｆ−ｖｉｓｕａｌ−ｗｏｒｄｓ法等の画像データ内の特徴量に基づく個人の識別を可能とする技術を用い、映像コンテンツ中に登場する人物（Ａさん、Ｂさん、・・・）を特定する。
あるいは、被写体認識部１３は、静止画像に含まれる所定の特徴量に基づくクラスタリングにより、複数の被写体それぞれを特定することなく、仮のラベル（Ａクラスタ、Ｂクラスタ、・・・）によって識別してもよい。 The subject recognition unit 13 uses a technique that enables individual identification based on feature amounts in image data, such as the Bag-of-visual-words method, and the like (persons A, B, ...) is specified.
Alternatively, the subject recognition unit 13 performs identification based on temporary labels (A cluster, B cluster,...) Without specifying each of the plurality of subjects by clustering based on a predetermined feature amount included in the still image. Also good.

ヒストグラム生成部１４は、ショットそれぞれについて、被写体毎に出現数、すなわち被写体が含まれる静止画像の数をカウントし、被写体の出現頻度を示すヒストグラムを生成する。
このとき、ヒストグラム生成部１４は、例えば次の（１）又は（２）のように正規化した度数の分布として、ショットの期間内に各被写体が出現した信頼度を表現する。これにより、ショット毎のヒストグラムにおける度数の最大は一定値に揃えられる。 The histogram generation unit 14 counts the number of appearances for each subject, that is, the number of still images including the subject for each shot, and generates a histogram indicating the appearance frequency of the subject.
At this time, the histogram generation unit 14 expresses the reliability of appearance of each subject within the shot period, for example, as a frequency distribution normalized as in (1) or (2) below. As a result, the maximum frequency in the histogram for each shot is set to a constant value.

（１）ヒストグラム生成部１４は、ショットの期間において被写体が認識された静止画像の数を正規化した度数の分布を、ヒストグラムとして生成する。
（２）ヒストグラム生成部１４は、被写体が認識された静止画像に対応するショットの期間内の時間を正規化した度数の分布を、ヒストグラムとして生成する。 (1) The histogram generation unit 14 generates a frequency distribution obtained by normalizing the number of still images in which a subject is recognized during a shot period as a histogram.
(2) The histogram generation unit 14 generates a frequency distribution obtained by normalizing the time within the shot period corresponding to the still image in which the subject is recognized as a histogram.

また、ヒストグラム生成部１４は、映像コンテンツの種類に応じて、複数の被写体からなるグループに対してヒストグラムを生成してもよい。例えば、ドラマでは、登場人物を家族又はサークル等のグループとして扱ったり、人物が被写体の中心とならない紀行番組では、人物を１グループにまとめ、「人物」を山、海等の他の被写体と同じレベルとして扱ったりできる。 The histogram generation unit 14 may generate a histogram for a group of a plurality of subjects according to the type of video content. For example, in a drama, the characters are treated as a group such as a family or a circle, or in a travel program where the person is not the center of the subject, the people are grouped together and the “person” is the same as other subjects such as mountains and the sea. It can be treated as a level.

シーン境界判定部１５は、生成されたヒストグラムの類似度に基づいて、一連のヒストグラムの不連続点であるシーン境界を判定する。
具体的には、シーン境界判定部１５は、シーンに出現する被写体の連続性を判定するために、ショット単位に生成したヒストグラム間の類似度を求め、例えば類似度が一定以下の場合に連続性が途切れ場面が切り替わったと判定する。
なお、ヒストグラム間の類似度の判定には、ヒストグラムインターセクション等の手法が利用されてよい。また、例えば、上位一定数の被写体が同時に変化した、又は一定の割合以上が変化した点を場面の分割点と判定する等、簡易的な手法が利用されてもよい。 The scene boundary determination unit 15 determines a scene boundary that is a discontinuous point of a series of histograms based on the similarity of the generated histogram.
Specifically, the scene boundary determination unit 15 obtains the similarity between histograms generated for each shot in order to determine the continuity of subjects appearing in the scene. For example, when the similarity is equal to or lower than a certain level, It is determined that the interrupted scene has been switched.
Note that a technique such as histogram intersection may be used to determine the similarity between histograms. In addition, for example, a simple method may be used, such as determining a point at which a certain number of subjects at the same time have changed at the same time or at a certain rate or more as a scene division point.

図３は、本実施形態に係るシーン分割の手順を示す概略図である。
映像シーン分割装置１は、入力された映像をショット１〜４に分割すると、それぞれのショットから複数の静止画像（サムネイル）を抽出する。 FIG. 3 is a schematic diagram showing a procedure of scene division according to the present embodiment.
The video scene dividing device 1 divides the input video into shots 1 to 4 and extracts a plurality of still images (thumbnail images) from each shot.

続いて、映像シーン分割装置１は、各静止画像から被写体Ａ〜Ｄを認識し、出現回数に基づく正規化されたヒストグラムを生成する。
映像シーン分割装置１は、生成されたヒストグラムについて、時系列に前後の類似度を算出し、類似度が閾値に満たないショット３とショット４との境界を、シーン境界として判定する。 Subsequently, the video scene dividing device 1 recognizes the subjects A to D from each still image, and generates a normalized histogram based on the number of appearances.
The video scene segmentation device 1 calculates the degree of similarity before and after the generated histogram in time series, and determines the boundary between the shot 3 and the shot 4 whose similarity is less than the threshold as the scene boundary.

ここで、連続性の判定を単一の被写体で行うと過剰に分割されやすい。また、認識数（登場回数）が少ない被写体を選択すると、被写体認識処理による誤検出の影響を受けることが多い。
そこで、シーン境界判定部１５は、ヒストグラムにおける度数が上位所定数の被写体のみからなる部分ヒストグラムに基づいて、シーン境界を判定する。例えば、ヒストグラムに現れている上位所定数（例えば３）の被写体、又は一定の割合（例えば５０％）に注目して類似度を計算する方法が採用される。 Here, if determination of continuity is performed on a single subject, excessive division is likely to occur. Also, if a subject with a small number of recognitions (appearance count) is selected, it is often affected by erroneous detection by subject recognition processing.
Therefore, the scene boundary determination unit 15 determines a scene boundary based on a partial histogram including only a predetermined number of subjects whose frequencies in the histogram are higher. For example, a method of calculating similarity by paying attention to a predetermined number (for example, 3) of subjects appearing in the histogram or a certain ratio (for example, 50%) is adopted.

図４は、本実施形態に係る制御部１０による処理を示すフローチャートである。
ステップＳ１において、ショット境界検出部１１は、映像コンテンツに含まれる一連のフレームから、ショット境界を検出し、映像を複数のショットに分割する。 FIG. 4 is a flowchart showing processing by the control unit 10 according to the present embodiment.
In step S1, the shot boundary detection unit 11 detects a shot boundary from a series of frames included in the video content, and divides the video into a plurality of shots.

ステップＳ２において、静止画像抽出部１２は、ステップＳ１で分割されたショット毎に、複数の静止画像（サムネイル）を抽出する。 In step S2, the still image extraction unit 12 extracts a plurality of still images (thumbnails) for each shot divided in step S1.

ステップＳ３において、被写体認識部１３は、ステップＳ２で抽出された静止画像毎に、被写体を認識する。 In step S3, the subject recognition unit 13 recognizes the subject for each still image extracted in step S2.

ステップＳ４において、ヒストグラム生成部１４は、ステップＳ３で認識された被写体の出現頻度を表すヒストグラムを生成する。 In step S4, the histogram generation unit 14 generates a histogram representing the appearance frequency of the subject recognized in step S3.

ステップＳ５において、シーン境界判定部１５は、ステップＳ４で生成されたヒストグラムを、時系列に順に選択していく。 In step S5, the scene boundary determination unit 15 sequentially selects the histogram generated in step S4 in time series.

ステップＳ６において、シーン境界判定部１５は、ステップＳ５で選択したヒストグラムと、直前に選択されたヒストグラムとの類似度を算出する。 In step S6, the scene boundary determination unit 15 calculates the similarity between the histogram selected in step S5 and the histogram selected immediately before.

ステップＳ７において、シーン境界判定部１５は、ステップＳ６で算出した類似度が所定の閾値より小さいか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ８に移り、判定がＮＯの場合、処理はステップＳ９に移る。 In step S7, the scene boundary determination unit 15 determines whether or not the similarity calculated in step S6 is smaller than a predetermined threshold. If this determination is YES, the process proceeds to step S8, and if the determination is NO, the process proceeds to step S9.

ステップＳ８において、シーン境界判定部１５は、ステップＳ７で類似度が小さいと判定されたショット境界を、シーン境界として判定する。 In step S8, the scene boundary determination unit 15 determines the shot boundary determined to have a low similarity in step S7 as a scene boundary.

ステップＳ９において、シーン境界判定部１５は、ヒストグラムが最後まで選択され映像が終了したか否かを判定する。この判定がＹＥＳの場合、処理は終了し、判定がＮＯの場合、処理はステップＳ５に戻る。 In step S9, the scene boundary determination unit 15 determines whether the histogram has been selected to the end and the video has ended. If this determination is YES, the process ends. If the determination is NO, the process returns to step S5.

＜変形例＞
前述の統計に基づいたヒストグラムの類似度による連続性の判断基準は、映像コンテンツ制作・編集のセオリーに基づいて、例えば、以下の（Ａ）又は（Ｂ）の判断基準が用いられてもよい。あるいは、前述の手法により判定されたシーン境界がこれらの判断基準によって調整されてもよい。
シーンは、人手により作られた構造であるが、番組映像の編集には、一定のセオリーもあり、番組映像に関する知識を利用することで、シーン検出の誤りが低減される。 <Modification>
As the continuity determination criterion based on the histogram similarity based on the above-described statistics, for example, the following determination criterion (A) or (B) may be used based on the theory of video content production / editing. Alternatively, the scene boundary determined by the above-described method may be adjusted according to these determination criteria.
Although a scene is a manually created structure, there is a certain theory for editing a program video. By using knowledge about the program video, errors in scene detection are reduced.

（Ａ）番組ジャンルを考慮した連続性の判断基準
ヒストグラム生成部１４及びシーン境界判定部１５は、番組ジャンルに基づいて、被写体に対する重みづけを調整してもよい。
例えば、ドラマ等、人物が被写体の中心となり、個人が重要な被写体である場合と、紀行系番組等、特定個人よりも、人物と他の被写体との変化がシーンを分割する要因となる場合とでは、人物と自然物との重みを変化させ、連続性の判断基準を別に設ける。 (A) Criteria for determining continuity considering program genre The histogram generation unit 14 and the scene boundary determination unit 15 may adjust the weighting of the subject based on the program genre.
For example, when a person is the center of a subject, such as a drama, and an individual is an important subject, and when a change between a person and another subject becomes a factor that divides a scene rather than a specific individual, such as a travel program Then, the weight of a person and a natural object is changed, and a criterion for determining continuity is provided separately.

具体的には、ドラマ等では、個人を認識した上で、登場人物をセット（ＡＢグループ、ＢＣＤグループ等）で扱い、セット毎の分布であるヒストグラムが用いられる。一方、紀行等の人物が被写体の中心とならない映像コンテンツでは、人物は全て１グループにまとめ、他の被写体（山、海等）との境界に、より重みが付けられる。 Specifically, in a drama or the like, after recognizing an individual, characters are handled as a set (AB group, BCD group, etc.), and a histogram that is a distribution for each set is used. On the other hand, in video content in which a person such as a journey is not the center of a subject, all the people are grouped together, and the boundary with other subjects (mountains, seas, etc.) is more weighted.

図５は、本実施形態に係る番組ジャンルに基づいた被写体への重み付けの例を示す図である。
ドラマの場合（ａ）、人物Ａ及びＢが登場するショットと、人物Ｃ及びＤが登場するショットとの境界がシーン境界として判定されている。
また、人物Ｃ及びＤが同一グループの場合、人物Ｃ及びＤが登場するショットと、人物Ｄのみが登場するショットとは、被写体が同一グループであるため、同一のシーンとして判定されている。 FIG. 5 is a diagram showing an example of weighting on the subject based on the program genre according to the present embodiment.
In the case of a drama (a), a boundary between a shot in which persons A and B appear and a shot in which persons C and D appear is determined as a scene boundary.
When the persons C and D are in the same group, the shot in which the persons C and D appear and the shot in which only the person D appears are determined as the same scene because the subjects are in the same group.

なお、シーン境界判定部１５は、グループに属する被写体の全員が登場しているショットを同一のシーンと判定してもよいし、一定以上、又はいずれかが登場しているショットを同一のシーンと判定してもよい。これらの判断基準は、番組ジャンル及びグループの種類等により適宜設定されてよい。 Note that the scene boundary determination unit 15 may determine that a shot in which all of the subjects belonging to the group appear as the same scene, or a shot in which one or more or one of them appears as the same scene. You may judge. These determination criteria may be appropriately set depending on the program genre, the type of group, and the like.

紀行の場合（ｂ）、被写体が風景のショットから被写体が人物Ｅのショットへの遷移を、シーン境界として判定されている。同様に、被写体が人物Ｇのショットから被写体が動物のショットへの遷移を、シーン境界として判定されている。
また、人物Ｅ、Ｆ又はＧが登場する複数のショットは、被写体が同一グループであると判断され、同一のシーンとして判定されている。 In the case of travel (b), a transition from a shot of a landscape to a shot of a person E is determined as a scene boundary. Similarly, a transition from a shot of a person G as a subject to a shot of an animal as a subject is determined as a scene boundary.
In addition, a plurality of shots in which persons E, F, or G appear are determined to have the same group of subjects and are determined to be the same scene.

（Ｂ）映像編集の知見を利用した連続性の判断基準
シーン境界判定部１５は、シーンにおける被写体の時間軸及び位置に関して予め記憶された出現パターンに基づいて、シーン境界の判定結果を調整してもよい。
例えば、人の顔が交互に映されるようなドラマ等の番組では、個人の顔に注目した場合、シーンが細切れになってしまう場合がある。そこで、一般的な編集技法である、２人の人物を交互に映すモンタージュ技法と呼ばれる対話シーンの知識を組み込むことで、顔が交互に被写体となるショットの連続は、１つの対話シーンとして適切に判定される。これにより、シーンの過分割が抑制される。 (B) Criteria for determining continuity using knowledge of video editing The scene boundary determination unit 15 adjusts the determination result of the scene boundary based on the appearance pattern stored in advance with respect to the time axis and position of the subject in the scene. Also good.
For example, in a program such as a drama where people's faces are shown alternately, the scene may be shredded when attention is paid to the individual's face. Therefore, by incorporating the knowledge of a conversation scene called a montage technique that alternately reflects two people, which is a general editing technique, a series of shots in which faces alternately become subjects are appropriately used as one conversation scene. Determined. Thereby, the excessive division | segmentation of a scene is suppressed.

図６は、本実施形態に係る映像編集の知見に基づくシーン分割の例を示す図である。
ヒストグラムの類似度に基づく統計的な手法のみの場合（ａ）、人物Ａが登場するショットと、人物Ｂが登場するショットとがシーン境界として判定され、複数のシーン１〜４に細かく分割されている。 FIG. 6 is a diagram illustrating an example of scene division based on the knowledge of video editing according to the present embodiment.
In the case of only a statistical method based on histogram similarity (a), a shot in which person A appears and a shot in which person B appears are determined as scene boundaries, and are divided into a plurality of scenes 1 to 4 in detail. Yes.

対話シーンの人物が交互に登場する特徴を判断基準とする場合（ｂ）、人物Ａ又はＢのいずれかが交互に登場する複数のショットが１つのシーンとして判定され、人物Ａ及びＢが登場するショットとの間がシーン境界と判定されている。 When the feature that the characters in the conversation scene appear alternately is used as the criterion (b), a plurality of shots in which either the characters A or B appear alternately are determined as one scene, and the characters A and B appear. A scene boundary is determined between shots.

ここで、被写体認識部１３は、人物の顔を認識した際に、顔の位置（例えば、中心位置）も、ショット番号、フレーム時刻、人物ＩＤ等と共に保存することが好ましい。これにより、シーン境界判定部１５は、例えば、前述の対話のシーン（図６）において、人物Ａ及び人物Ｂが映されている位置の特徴（人物Ａは画面左寄り、人物Ｂは画面右寄り等）を加味して、精度よくシーン分割を行える。 Here, when the subject recognition unit 13 recognizes a person's face, the face position (for example, the center position) is preferably stored together with the shot number, frame time, person ID, and the like. Thereby, for example, the scene boundary determination unit 15 has the feature of the position where the person A and the person B are shown in the above-described dialog scene (FIG. 6) (the person A is on the screen left side, the person B is on the screen right side, etc.) Can be divided into scenes with high accuracy.

以上のように、本実施形態によれば、映像シーン分割装置１は、ショット毎に抽出された静止画像の被写体について、出現頻度を示すヒストグラムの類似度に基づいて、ショット間での被写体の出現の連続性を判断し、不連続点におけるシーンの切り替えを判定する。
したがって、映像シーン分割装置１は、映像の内容を表す被写体の認識結果を用い、より意味内容に近い形で、映像コンテンツを意味的な区間の切れ目であるシーンに適切に分割できる。
この結果、映像の検索又は再利用時に、人の感覚により近い意味的な単位であるシーンの単位で表示及び再生が可能になる。また、映像検索の結果をコンテンツ単位で提示する場合、意味区間であるシーン毎に整理して提示することで、コンテンツ全体の概要が把握しやすくなるので、映像検索又はメタデータの付与等の２次利用が容易になる。 As described above, according to the present embodiment, the video scene dividing device 1 causes the appearance of a subject between shots based on the similarity of histograms indicating the appearance frequency of a still image subject extracted for each shot. Continuity is determined, and scene switching is determined at discontinuous points.
Therefore, the video scene dividing apparatus 1 can appropriately divide the video content into scenes that are breaks in semantic sections in a form closer to the semantic content using the recognition result of the subject representing the video content.
As a result, at the time of video search or reuse, display and playback can be performed in scene units, which are semantic units closer to human senses. In addition, when presenting video search results in units of content, it is easier to grasp the outline of the entire content by organizing and presenting it for each scene that is a semantic section. The next use becomes easy.

また、映像シーン分割装置１は、ショットの期間から抽出された複数の静止画像から被写体毎のヒストグラムを生成するので、被写体認識における誤認識又は見落とし等によるノイズを低減でき、シーン分割の精度が向上する。 Further, since the video scene dividing device 1 generates a histogram for each subject from a plurality of still images extracted from the shot period, noise due to erroneous recognition or oversight in subject recognition can be reduced, and the accuracy of scene division is improved. To do.

映像シーン分割装置１は、フレーム間の差分が第１の閾値を超えた場合にショット境界を検出し、ショット毎にフレーム間の差分の累積値が第２閾値を超えた場合に静止画像（サムネイル）を抽出する。
したがって、映像シーン分割装置１は、簡易なルールに基づいて効率的に映像をショットに分割し、ショットそれぞれの内容を特徴づける静止画像を抽出できる。 The video scene dividing device 1 detects a shot boundary when a difference between frames exceeds a first threshold value, and a still image (thumbnail) when a cumulative value of differences between frames exceeds a second threshold value for each shot. ).
Therefore, the video scene dividing device 1 can efficiently divide the video into shots based on simple rules and extract still images that characterize the contents of each shot.

映像シーン分割装置１は、静止画像に含まれる所定の特徴量に基づくクラスタリングにより、複数の被写体を識別できる。これにより、映像シーン分割装置１は、被写体を特定するために予め学習することなく、未知の被写体それぞれを識別できる。 The video scene dividing device 1 can identify a plurality of subjects by clustering based on a predetermined feature amount included in a still image. Thereby, the video scene dividing device 1 can identify each unknown subject without learning in advance in order to identify the subject.

映像シーン分割装置１は、ショットの期間において被写体が認識された静止画像の数を正規化した度数の分布を、又は被写体が認識された静止画像に対応するショットの期間内の時間を正規化した度数の分布を、ヒストグラムとして生成する。
これらの正規化の手法により、映像シーン分割装置１は、ショットの期間内に各被写体が出現した信頼度を表現する。これにより、ショット毎のヒストグラムにおける度数の最大は一定値に揃えられ、ショット間でのヒストグラムの比較がより正確に行える。 The video scene dividing device 1 normalized the frequency distribution obtained by normalizing the number of still images in which the subject is recognized in the shot period, or the time in the shot period corresponding to the still image in which the subject is recognized. A frequency distribution is generated as a histogram.
By these normalization methods, the video scene dividing device 1 expresses the reliability with which each subject appears within the shot period. Thereby, the maximum frequency in the histogram for each shot is set to a constant value, and comparison of histograms between shots can be performed more accurately.

映像シーン分割装置１は、被写体のグループに対して前記ヒストグラムを生成することで、番組ジャンルに応じて、より適切なヒストグラムを用いることができ、適切なシーン境界を判定でき、シーン境界の過分割が低減される。 The video scene segmentation apparatus 1 can generate a histogram for a group of subjects so that a more appropriate histogram can be used according to the program genre, an appropriate scene boundary can be determined, and scene boundary overdivision can be performed. Is reduced.

映像シーン分割装置１は、ヒストグラムにおける度数が上位所定数の被写体のみからなる部分ヒストグラムを比較することにより、シーン境界を判定する。これにより、出現頻度が低い被写体、又は誤認識等によるノイズを低減し、精度よくシーン分割できる。 The video scene segmentation apparatus 1 determines a scene boundary by comparing partial histograms made up of only subjects with a higher frequency in the histogram. As a result, it is possible to reduce scenes with low appearance frequency or noise due to misrecognition and to divide the scene with high accuracy.

映像シーン分割装置１は、シーンにおける被写体の時間軸及び位置に関して予め記憶された出現パターンに基づいて、番組映像の編集に関する知識を利用した判断基準によりシーン境界を判定できる。これにより、被写体認識の誤りによる影響や、シーン境界の過分割が低減される。 The video scene dividing apparatus 1 can determine a scene boundary based on a judgment criterion using knowledge about editing of a program video based on an appearance pattern stored in advance with respect to a time axis and a position of a subject in a scene. This reduces the influence of subject recognition errors and excessive division of scene boundaries.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、本実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. Further, the effects described in the present embodiment are merely a list of the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to those described in the present embodiment.

本実施形態では、映像シーン分割装置の構成と動作について説明したが、本発明はこれに限られず、各構成要素を備え、映像をシーンに分割するための方法、又はプログラムとして構成されてもよい。 In the present embodiment, the configuration and operation of the video scene dividing device has been described. However, the present invention is not limited to this, and each of the components may be configured as a method or program for dividing a video into scenes. .

さらに、映像シーン分割装置の機能を実現するためのプログラムをコンピュータで読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。 Further, the present invention may be realized by recording a program for realizing the function of the video scene dividing device on a computer-readable recording medium, causing the computer system to read and execute the program recorded on the recording medium. Good.

ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータで読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 The “computer system” here includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a hard disk built in the computer system.

さらに「コンピュータで読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでもよい。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Furthermore, “computer-readable recording medium” means that a program is dynamically held for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It is also possible to include one that holds a program for a certain time, such as a volatile memory inside a computer system that becomes a server or client in that case. Further, the program may be for realizing a part of the above-described functions, and may be capable of realizing the above-described functions in combination with a program already recorded in the computer system. .

１映像シーン分割装置
１０制御部
１１ショット境界検出部
１２静止画像抽出部
１３被写体認識部
１４ヒストグラム生成部
１５シーン境界判定部
２０記憶部 DESCRIPTION OF SYMBOLS 1 Image | video scene division | segmentation apparatus 10 Control part 11 Shot boundary detection part 12 Still image extraction part 13 Subject recognition part 14 Histogram generation part 15 Scene boundary determination part 20 Storage part

Claims

A shot boundary detection unit that detects a shot boundary that is a discontinuous point of the video based on a difference in image data between frames of the video;
A still image extraction unit that extracts a plurality of still images for each shot divided by the shot boundary;
A subject recognition unit for recognizing the subject for each still image;
For each of the shots, a histogram generation unit that generates a histogram indicating the appearance frequency of the subject;
And a scene boundary determining unit that determines a scene boundary that is a discontinuous point of the histogram based on the similarity of the histogram.

The video scene segmentation device according to claim 1, wherein the subject recognition unit identifies a plurality of the subjects by clustering based on a predetermined feature amount included in the still image.

The histogram generation unit normalizes a frequency obtained by normalizing the number of the still images in which the subject is recognized in the shot period, or a time in the shot period corresponding to the still image in which the subject is recognized. The video scene dividing device according to claim 1, wherein the distribution of the converted frequency is generated as the histogram.

The video scene dividing device according to any one of claims 1 to 3, wherein the histogram generation unit generates the histogram for the group of subjects.

The video scene dividing device according to any one of claims 1 to 4, wherein the scene boundary determination unit determines the scene boundary based on a partial histogram including only the predetermined number of subjects having higher frequencies in the histogram. .

The video according to claim 1, wherein the scene boundary determination unit adjusts the determination result of the scene boundary based on an appearance pattern stored in advance with respect to a time axis and a position of the subject in the scene. Scene division device.

A video scene division program for causing a computer to function as the video scene division device according to any one of claims 1 to 6.