JP2012010265A

JP2012010265A - Summary video generating apparatus and summary video generating program

Info

Publication number: JP2012010265A
Application number: JP2010146443A
Authority: JP
Inventors: Yoshihiko Kawai; 吉彦河合; Masato Fujii; 真人藤井
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2010-06-28
Filing date: 2010-06-28
Publication date: 2012-01-12
Anticipated expiration: 2030-06-28
Also published as: JP5537285B2

Abstract

PROBLEM TO BE SOLVED: To provide a summary video generating apparatus and a summary video generating program capable of widely generating summary videos without restrictions on special conditions and program genre.SOLUTION: A summary video generating apparatus 1 for generating a summary video, whose video time length is shorter than a video time length of an original video, from the original video comprises: video division means 4 for generating a divided video which serves as a video basic unit from the original video; video analysis means 5 for sorting block area types per characteristic amount in a key frame of each divided video; video evaluation means 6 for calculating a score used to identify whether a characteristic amount of a block area is characteristic in the original video for each divided video; and video section selection means 8 for selecting a divided video from the original video in descending order of scores.

Description

本発明は、テレビジョンにおける要約映像制作の技術に係り、映像あるいは映像と映像音声を基に、自動的に要約映像を生成する要約映像生成装置及び要約映像生成プログラムに関する。 The present invention relates to a summary video production technique in a television, and relates to a summary video generation device and a summary video generation program that automatically generate summary video based on video or video and video audio.

従来、番組映像からその番組内容を要約した要約映像を生成する映像要約の技術が提案されている。ここで、映像要約とは、元の番組映像から、その意味内容を保持したまま、より時間長の短い映像を生成する操作をいう。テレビジョン放送における要約映像の例としては、番組映像が放送される前に放送される数十秒程度の番組紹介映像や、ニュースにおけるプロ野球やサッカーのダイジェスト映像などがある。 2. Description of the Related Art Conventionally, video summarization techniques for generating summary video summarizing the program contents from program video have been proposed. Here, video summarization refers to an operation for generating a video with a shorter time length from the original program video while retaining its semantic content. Examples of summary video in television broadcasting include a program introduction video of about several tens of seconds broadcast before the program video is broadcast, and a professional baseball or soccer digest video in the news.

この従来の要約映像を生成する技術は、要約の対象とする映像のジャンルや種類を非常に狭い範囲に特化したものがほとんどであり、要約映像のための重要なシーンの抽出において、そのジャンルに関するドメイン知識（ジャンル等に固有の知識）に強く依存しているため、他の分野の映像にそのまま適用することができない場合が多い。 Most of the conventional techniques for generating a summary video specialize in the genre and type of video to be summarized in a very narrow range. In many cases, it cannot be directly applied to videos in other fields.

例えば、要約映像を生成する技術として、ジャンルを野球に限定した場合、スロー映像の前には得点シーンであるといった重要なプレーがある可能性が高い、あるいは、画面上の所定の位置に文字スーパーが表示されたときは得点が入ったときである可能性が高いといった知識を利用するものや、スポーツにおいて、歓声が大きい映像区間は、重要なシーンであるといった音声情報を利用するものがある。この野球等のスポーツに適用される技術は、得点シーンの特有のパターンのないドラマやドキュメンタリ番組に適用することはできない。
なお、より汎用的な技術として、電子番組表における紹介テキストを利用する技術もある（特許文献１参照）。
さらに、ユーザの嗜好に合致したキーワードに基づいてシーンを選択する技術が提案されている（特許文献２参照）。 For example, if the genre is limited to baseball as a technique for generating a summary video, there is a high possibility that there is an important play such as a scoring scene before the slow video, or a character superposition at a predetermined position on the screen. When the is displayed, there is one that uses knowledge that there is a high possibility that a score has been entered, and another that uses audio information that a video section with a large cheer in sports is an important scene. The technology applied to sports such as baseball cannot be applied to dramas and documentary programs that do not have a unique pattern of scoring scenes.
As a more general technique, there is a technique that uses an introduction text in an electronic program guide (see Patent Document 1).
Furthermore, a technique for selecting a scene based on a keyword that matches the user's preference has been proposed (see Patent Document 2).

特許第４４５６５７３号公報Japanese Patent No. 4456573 特開２００４−２８９５１３号公報JP 2004-289513 A

しかし、従来の前記した要約映像を制作する装置では、以下に示すような問題点が存在した。
特許文献１に記載の要約映像を生成する装置では、外部情報である電子番組表が入手できない場合（例えば過去の番組等）、あるいは、電子番組表の情報が制作されていない番組の場合に対して、当該電子番組表を全く利用できないため、要約映像を生成することができないという問題がある。
また、特許文献２に記載の要約映像を生成する装置では、やはり全く異なるジャンルの番組に対応できないという問題がある。 However, the conventional apparatus for producing the summary video has the following problems.
In the apparatus for generating a summary video described in Patent Document 1, when an electronic program guide that is external information cannot be obtained (for example, a past program), or when the information of the electronic program guide is not produced Therefore, there is a problem that the summary video cannot be generated because the electronic program guide cannot be used at all.
Further, the apparatus for generating the summary video described in Patent Document 2 has a problem that it cannot cope with programs of completely different genres.

本発明は、前記した問題点に鑑み創案されたものであり、特別な条件や番組ジャンルの制限なしで幅広く要約映像を生成することができる要約映像生成装置及び要約映像生成プログラムを提供することを課題とする。 The present invention has been made in view of the above-described problems, and provides a summary video generation apparatus and a summary video generation program capable of generating a wide range of summary videos without any special conditions or restrictions on program genres. Let it be an issue.

前記した課題を解決するため、本発明の請求項１に係る要約映像生成装置は、元映像から当該元映像よりも映像時間が短い時間長となる要約映像を生成する要約映像生成装置であって、映像分割手段と、映像解析手段と、映像評価手段と、映像区間選択手段と、備える構成とした。 In order to solve the above-described problem, a summary video generation apparatus according to claim 1 of the present invention is a summary video generation apparatus that generates a summary video having a shorter video time than an original video. The video dividing unit, the video analyzing unit, the video evaluating unit, and the video section selecting unit are provided.

かかる構成により、要約映像生成装置は、映像分割手段により、番組映像、ＤＶＤ等の映像である元映像を入力して、入力した映像の場面の区切りとなるショットの位置を検出して、そのショットの単位を映像単位として分割映像を生成する。そして、要約映像生成装置は、映像解析手段により、元映像を分割した分割映像ごとにキーフレームを予め設定した条件で抽出し、抽出したキーフレームについて特徴を示すブロック領域を検出し、検出したブロック領域をここでは視覚単語とみなしている。 With this configuration, the summary video generation apparatus inputs the original video, which is a video such as a program video or a DVD, by the video dividing unit, detects the position of the shot that becomes the break of the scene of the input video, and the shot Divided video is generated with each unit as a video unit. Then, the summary video generation device extracts a key frame for each divided video obtained by dividing the original video by the video analysis unit under a preset condition, detects a block area indicating the feature of the extracted key frame, and detects the detected block The region is considered here as a visual word.

さらに、要約映像生成装置は、映像解析手段により、検出した特徴であるブロック領域（局所領域）を特徴量ごとに区分するようにして当該特徴量の種類を解析する。例えば、特徴量としては、勾配ヒストグラム（例えば輝度勾配ヒストグラム）を視覚単語（ブロック領域）の種類として解析する。また、映像解析手段では、キーフレームの小領域におけるテクスチャや色等を、ブロック領域における種類を区分するための特徴量として利用することができる。 Further, the summary video generation apparatus analyzes the type of the feature amount by classifying the block region (local region), which is the detected feature, for each feature amount by the video analysis unit. For example, as a feature amount, a gradient histogram (for example, a luminance gradient histogram) is analyzed as a type of visual word (block region). In the video analysis means, the texture, color, etc. in the small area of the key frame can be used as the feature quantity for classifying the type in the block area.

そして、要約映像生成装置は、映像評価手段により、視覚単語（ブロック領域の特徴量）の種類のそれぞれに対して、当該特徴量が前記元映像において特徴的であることを識別する予め定めた指標、例えば、ＴＦ−ＩＤＦ、ＴＦ−Ｓ等の手法により、評価基準となるスコアを演算する。すなわち、映像評価手段は、分割映像内で抽出したキーフレーム全部についてのスコアを算出することで、分割映像の単位で当該分割映像が元映像において特徴的であるか否かを評価するための基準とする。さらに、要約映像生成装置は、映像区間選択手段により、評価された分割映像のスコアに基づいて、当該スコアの高い順で映像時間が早い方から要約映像の時間長となるように、元映像から分割映像の映像区間の映像を選択することで要約映像を生成する。 Then, the summary video generation device uses a video evaluation unit to determine, for each type of visual word (a feature amount of a block area), a predetermined index for identifying that the feature amount is characteristic in the original video For example, a score serving as an evaluation criterion is calculated by a technique such as TF-IDF or TF-S. That is, the video evaluation means calculates a score for all the key frames extracted in the divided video, and thereby a criterion for evaluating whether or not the divided video is characteristic in the original video in units of the divided video. And Further, the summary video generation device uses the video segment selection unit to start from the original video so that the summary video has the duration from the earliest video time in descending order of the score based on the score of the divided video evaluated. A summary video is generated by selecting a video in the video section of the divided video.

本発明の請求項２に係る要約映像生成装置は、元映像から当該元映像よりも映像時間が短い時間長となる要約映像を生成する要約映像生成装置であって、ストリーム分離手段と、映像分割手段と、映像解析手段と、映像評価手段と、文字情報抽出手段と、形態素解析手段と、音声評価手段と、映像区間選択手段と、を備える構成とした。 A summary video generation apparatus according to claim 2 of the present invention is a summary video generation apparatus that generates a summary video having a video duration shorter than that of the original video from the original video, comprising: a stream separation unit; And a video analysis unit, a video evaluation unit, a character information extraction unit, a morpheme analysis unit, a voice evaluation unit, and a video section selection unit.

かかる構成において、要約映像生成装置は、ストリーム分離手段により、元映像から映像ストリームと、音声ストリームとをそれぞれ分離する。そして、要約映像生成装置は、映像分割手段により、映像ストリームからショットの単位となる映像単位ごとに分割した分割映像を生成する。そして、要約映像生成装置は、映像解析手段により、分割映像ごとにキーフレームを予め設定した条件で抽出し、抽出したキーフレームについて特徴を示すブロック領域を検出し、検出したブロック領域を視覚単語とみなす。そして、要約映像生成装置は、映像評価手段により、ブロック領域（視覚単語）の特徴量の種類を区分して解析し、その視覚単語のそれぞれを分割映像ごとに評価基準のスコアで評価する。 In such a configuration, the summary video generation apparatus separates the video stream and the audio stream from the original video by the stream separation unit. Then, the summary video generation device generates a split video divided by the video dividing unit for each video unit as a shot unit from the video stream. Then, the summary video generation device extracts a key frame for each divided video by the video analysis unit under a preset condition, detects a block region indicating the feature of the extracted key frame, and detects the detected block region as a visual word. I reckon. Then, the summary video generation device classifies and analyzes the types of feature quantities of the block area (visual word) by the video evaluation means, and evaluates each of the visual words with an evaluation criterion score for each divided video.

さらに、要約映像生成装置は、文字情報抽出手段により、音声ストリームから音声を認識して文字データを抽出し、抽出した文字データを形態素解析手段により形態素解析する。そして、要約映像生成装置は、音声評価手段により、形態素解析した文字データについて、当該文字データが前記元映像において特徴的であることを識別する予め定めた指標、例えば、ＴＦ−ＩＤＦ、ＴＦ−Ｓ等の手法により、評価基準となるスコアを演算する。さらに、要約映像生成装置は、映像区間選択手段により、映像評価手段で求めたスコアと音声評価手段で求めたスコアを合算し、合算して求めたスコアに基づいて、当該スコアの高い順で映像時間が早い方から要約映像の時間長となるように、元映像から分割映像の映像区間の映像を選択することで要約映像を生成する。 Furthermore, the summary video generation apparatus recognizes the voice from the voice stream by the character information extraction unit, extracts the character data, and performs morphological analysis on the extracted character data by the morpheme analysis unit. Then, the summary video generation apparatus uses, for example, a predetermined index for identifying that the character data is characteristic in the original video, for example, TF-IDF, TF-S, for the character data analyzed by the voice evaluation means. The score used as an evaluation standard is calculated by a method such as the above. Further, the summary video generation device adds the score obtained by the video evaluation unit and the score obtained by the audio evaluation unit by the video section selection unit, and based on the score obtained by the addition, the summary video generation unit The summary video is generated by selecting the video in the video segment of the split video from the original video so that the time length of the summary video is from the earlier time.

本発明の請求項３に係る要約映像生成装置は、請求項１又は請求項２に記載の要約映像生成装置において、映像解析手段が、前記ブロック領域を特徴量ごとの種類によりクラスタリングするクラスタリング手段をさらに備え、前記映像評価手段は、前記クラスタリング手段でクラスタリングした種類の前記特徴量について前記スコアを算出する構成とした。 The summary video generation apparatus according to claim 3 of the present invention is the summary video generation apparatus according to claim 1 or 2, wherein the video analysis means includes clustering means for clustering the block regions according to types for each feature amount. Further, the video evaluation unit is configured to calculate the score for the types of the feature quantities clustered by the clustering unit.

かかる構成により要約映像生成装置は、映像解析手段で映像を解析するときにブロック領域（視覚単語）の特徴量をそれぞれの種類とし、そのブロック領域（視覚単語）の特徴量の区分けしたときの数を、クラスタリング手段により、ブロック領域（視覚単語）の特徴量についてクラスタリングして、映像評価手段で評価する対象となるブロック領域（視覚単語）の特徴量の種類の数を減らしてスコアを算出する。 With this configuration, the summary video generation apparatus uses the feature amount of the block area (visual word) as each type when the video is analyzed by the video analysis means, and the number of the feature amounts of the block area (visual word) is divided. Are clustered with respect to the feature quantity of the block area (visual word) by the clustering means, and the score is calculated by reducing the number of types of feature quantities of the block area (visual word) to be evaluated by the video evaluation means.

本発明の請求項４に係る要約映像生成装置は、請求項１から請求項３のいずれか一項に記載の要約映像生成装置において、前記映像解析手段が、前記分割映像のフレーム画像ごとに動きベクトル量を算出し、前記動きベクトル量の累計が前記分割映像の総動きベクトル量を予め定めた数で等分した累計に達したときのフレーム画像を順次、前記キーフレームとして抽出する構成とした。 The summary video generation device according to claim 4 of the present invention is the summary video generation device according to any one of claims 1 to 3, wherein the video analysis means moves for each frame image of the divided video. A vector amount is calculated, and a frame image is sequentially extracted as the key frame when the total motion vector amount reaches a total obtained by equally dividing the total motion vector amount of the divided video by a predetermined number. .

かかる構成により、要約映像生成装置は、映像解析手段により、分割映像のフレーム画像において、動きベクトル量の累計から等分するようにキーフレームを選ぶことで、動きベクトル量の差が大きい部分を、動きベクトル量の差が小さいフレーム画像よりも多くキーフレームとして抽出することができる。 With such a configuration, the summary video generation device selects a key frame so that the video analysis means equally divides the motion vector amount from the total of the motion vector amount in the frame image of the divided video. More key frames can be extracted than a frame image having a small difference in motion vector amount.

本発明の請求項５に係る要約映像生成装置は、請求項１に記載の要約映像生成装置において、映像区間選択手段が、分割映像選択手段と、調整手段と、分割映像連結手段と、を備える構成とした。 The summary video generation apparatus according to claim 5 of the present invention is the summary video generation apparatus according to claim 1, wherein the video section selection means includes divided video selection means, adjustment means, and split video connection means. The configuration.

かかる構成により、要約映像生成装置は、分割映像選択手段により、算出されたスコアに基づいて要約映像の時間長となるまで、元映像から分割映像の映像区間を選択する。そして、要約映像生成装置は、調整手段により、選択された分割映像について予め設定された当該分割映像の時間的に前後となる所定分割映像範囲にある分割映像のスコアを引き下げる調整を行う。そして、要約映像生成装置は、分割映像連結手段により、調整されたスコアに基づいて、分割映像選択手段で選択された分割映像を映像時間が早い方から連結して要約映像を生成する。 With this configuration, the summary video generation device selects the video segment of the split video from the original video until the time length of the summary video is reached based on the calculated score by the split video selection unit. Then, the summary video generation apparatus performs adjustment to lower the score of the divided video in the predetermined divided video range that is set before and after the predetermined divided video with respect to the selected divided video by the adjusting unit. Then, the summary video generation device generates the summary video by connecting the divided videos selected by the divided video selection unit from the earliest video time based on the adjusted score by the divided video connection unit.

本発明の請求項６に係る要約映像生成装置は、請求項２に記載の要約映像生成装置において、映像区間選択手段が、スコア統合手段と、分割映像選択手段と、調整手段と、分割映像連結手段と、を備える構成とした。 The summary video generation apparatus according to claim 6 of the present invention is the summary video generation apparatus according to claim 2, wherein the video section selection means includes score integration means, divided video selection means, adjustment means, and divided video connection. Means.

かかる構成により、要約映像生成装置は、スコア統合手段により、映像評価手段で算出したスコア及び音声評価手段で算出したスコアを合算する。そして、要約映像生成装置は、分割映像選択手段により、算出されたスコアに基づいて要約映像の時間長となるまで、元映像から分割映像の映像区間を選択する。そして、要約映像生成装置は、調整手段により、選択された分割映像について予め設定された当該分割映像の時間的に前後となる所定分割映像範囲にある分割映像のスコアを引き下げる調整を行う。そして、要約映像生成装置は、分割映像連結手段により、調整されたスコアに基づいて、分割映像選択手段で選択された分割映像を映像時間が早い方から連結して要約映像を生成する。 With this configuration, the summary video generation apparatus adds the score calculated by the video evaluation unit and the score calculated by the audio evaluation unit by the score integration unit. Then, the summary video generation device selects the video segment of the split video from the original video until the time length of the summary video is reached based on the calculated score by the split video selection unit. Then, the summary video generation apparatus performs adjustment to lower the score of the divided video in the predetermined divided video range that is set before and after the predetermined divided video with respect to the selected divided video by the adjusting unit. Then, the summary video generation device generates the summary video by connecting the divided videos selected by the divided video selection unit from the earliest video time based on the adjusted score by the divided video connection unit.

本発明の請求項７に係る要約映像生成装置は、請求項２に記載の要約映像生成装置において、文字情報抽出手段が文字データ検出手段をさらに備える構成とした。 According to a seventh aspect of the present invention, there is provided a summary video generation apparatus according to the second aspect, wherein the character information extraction means further comprises a character data detection means.

かかる構成により、要約映像生成装置は、データストリームに、例えばクローズドキャプション等の文字データが存在する場合には、文字データ検索手段により、映像ストリームに対して付されているクローズドキャプションを文字データとして検出して形態素解析手段に出力する。また、要約映像生成装置は、データストリームに文字データが存在しない場合には、音声認識手段により、音声ストリームから音声認識を行い、文字データを生成して形態素解析手段に文字データを出力する。 With this configuration, the summary video generation apparatus detects, as character data, the closed caption attached to the video stream by the character data search means when character data such as closed caption exists in the data stream. And output to the morpheme analyzing means. In addition, when there is no character data in the data stream, the summary video generation device performs voice recognition from the voice stream by the voice recognition unit, generates character data, and outputs the character data to the morpheme analysis unit.

本発明の請求項８に係る要約映像生成プログラムは、元映像から当該元映像よりも映像時間が短い時間長となる要約映像を生成するためにコンピュータを、映像分割手段、映像解析手段、映像評価手段、映像区間選択手段、として機能させる構成とした。 According to an eighth aspect of the present invention, there is provided a summary video generation program for generating a summary video having a shorter video time than the original video from the original video, a video dividing means, a video analysis means, and a video evaluation. Means and video section selection means.

かかる構成により、要約映像生成プログラムは、映像分割手段により、元映像を入力して、入力した映像のショットの単位を映像単位として分割映像を生成する。そして、要約映像生成プログラムは、映像解析手段により、分割映像ごとにキーフレームを抽出し、抽出したキーフレームについて特徴を示すブロック領域を検出し、検出したブロック領域（視覚単語）について、当該ブロック領域の種類を特徴量ごとに、例えば、予め設定された勾配ヒストグラムにより区分して解析する。そして、要約映像生成プログラムは、映像評価手段により、ブロック領域（視覚単語）の特徴量の種類のそれぞれについて、当該特徴量が前記元映像において特徴的であることを識別する予め定めた指標を評価基準となるスコアとして分割映像の単位で評価する。そして、要約映像生成プログラムは、映像区間選択手段により、評価された分割映像のスコアに基づいて、当該スコアの高い順で映像時間が早い方から要約映像の時間長となるように、元映像から分割映像の映像区間の映像を選択することで要約映像を生成する。 With this configuration, the summary video generation program inputs the original video by the video dividing unit, and generates a divided video using the shot unit of the input video as a video unit. Then, the summary video generation program extracts a key frame for each divided video by the video analysis unit, detects a block area indicating the feature of the extracted key frame, and detects the block area (visual word) for the block area For each feature quantity, for example, is classified and analyzed by a preset gradient histogram. Then, the summary video generation program evaluates, for each of the types of feature quantities of the block area (visual word), a predetermined index for identifying that the feature quantities are characteristic in the original video by the video evaluation unit. Evaluate in units of divided video as a standard score. Then, the summary video generation program is based on the score of the divided video evaluated by the video segment selection means, so that the time length of the summary video starts from the earliest video time in descending order of the score. A summary video is generated by selecting a video in the video section of the divided video.

本発明の請求項９に係る要約映像生成プログラムは、元映像から当該元映像よりも映像時間が短い時間長となる要約映像を生成するためにコンピュータを、ストリーム分離手段、映像分割手段、映像解析手段、映像評価手段、文字情報抽出手段、形態素解析手段、音声評価手段、映像区間選択手段、として機能させる構成とした。 According to a ninth aspect of the present invention, there is provided a summary video generation program comprising: a computer, a stream separation unit, a video division unit, a video analysis unit; Means, video evaluation means, character information extraction means, morphological analysis means, voice evaluation means, and video section selection means.

かかる構成により要約映像生成プログラムは、ストリーム分離手段により元映像から映像ストリームと、音声ストリームとを分離する。そして、要約映像生成プログラムは、映像分割手段により、映像ストリームからショットの単位となる分割映像を生成する。そして、要約映像生成プログラムは、映像解析手段により、生成した分割映像からキーフレームを抽出してその特徴となるブロック領域（視覚単語）を検出し、当該ブロック領域（視覚単語）を予め設定された、例えば、勾配ヒストグラムで種類を表わすように特徴量ごとに区分して解析する。さらに、要約映像生成プログラムは、映像評価手段により、ブロック領域（視覚単語）の特徴量のそれぞれに対して、当該ブロック領域が前記元映像において特徴的であることを識別する予め定めた指標を評価基準となるスコアとして算出している。なお、要約映像生成プログラムは、例えば、ＴＦ−ＩＤＦ、ＴＦ−Ｓ等の手法により演算して求めることで当該スコアを算出している。 With this configuration, the summary video generation program separates the video stream and the audio stream from the original video by the stream separation unit. Then, the summary video generation program generates a split video as a shot unit from the video stream by the video dividing means. Then, the summary video generation program extracts a key frame from the generated divided video by the video analysis means, detects a block area (visual word) that is a feature thereof, and sets the block area (visual word) in advance. For example, analysis is performed by classifying each feature amount so that the type is represented by a gradient histogram. Furthermore, the summary video generation program evaluates a predetermined index for identifying that the block area is characteristic in the original video for each feature amount of the block area (visual word) by the video evaluation unit. It is calculated as a standard score. Note that the summary video generation program calculates the score by calculating and obtaining by a technique such as TF-IDF or TF-S, for example.

そして、要約映像生成プログラムは、文字情報抽出手段により、ストリーム分離手段により分離された音声ストリームから音声認識して文字データを抽出し、形態素解析手段により、抽出した文字データを形態素解析する。さらに、要約映像生成プログラムは、音声評価手段により、映像単位ごとに特徴のある前記単語が前記元映像において特徴的であることを識別する予め定めた指標を評価基準となるスコアとして算出する。なお、スコアを算出する場合には、例えば、ＴＦ−ＩＤＦ、ＴＦ−Ｓ等の手法を用いることができる。そして、要約映像生成プログラムは、映像区間選択手段により、映像評価手段で求めたスコアと音声評価手段で求めたスコアを合算したスコアに基づいて、当該スコアの高い順で映像時間が早い方から要約映像の時間長となるように、元映像から分割映像の映像区間の映像を選択することで要約映像を生成する。 Then, the summary video generation program extracts voice data by voice recognition from the voice stream separated by the stream separation means by the character information extraction means, and morphologically analyzes the extracted character data by the morpheme analysis means. Further, the summary video generation program calculates a predetermined index for identifying that the characteristic word for each video unit is characteristic in the original video as a score as an evaluation reference by the audio evaluation unit. In addition, when calculating a score, methods, such as TF-IDF and TF-S, can be used, for example. Then, the summary video generation program summarizes the video section in descending order of the score based on the score obtained by adding the score obtained by the video evaluation means and the score obtained by the audio evaluation means by the video section selection means. The summary video is generated by selecting the video of the video segment of the split video from the original video so as to be the time length of the video.

本発明は、以下に示すような優れた効果を奏するものである。
請求項１、８に記載の発明によれば、映像の特徴を示すブロック領域について、あたかも単語（視覚単語）の種類のように扱うことで、元映像からハードディスクレコーダなどの録画機器において自動的に要約映像を生成することが可能となり、ユーザはその要約映像から判断して素早く目的の映像を選択することが可能となる。
また、映像制作者においては、映像販売のための見本映像を自動生成することが可能となり、製作者の負担を軽減することができる。特に、大量の映像を処理する場合には、大きな効果を期待できる。 The present invention has the following excellent effects.
According to the first and eighth aspects of the present invention, the block area indicating the feature of the video is handled as if it is a type of a word (visual word), so that the original video is automatically recorded in a recording device such as a hard disk recorder. A summary video can be generated, and the user can quickly select a target video based on the summary video.
In addition, it becomes possible for the video producer to automatically generate a sample video for video sales, and the burden on the producer can be reduced. In particular, a large effect can be expected when processing a large amount of video.

請求項２、９に記載の発明によれば、分割映像における映像と音声の両方の情報から、要約映像を生成することができ、より適切な分割映像を選択して要約映像を生成することが可能となる。
請求項３に記載の発明によれば、ブロック領域（視覚単語）の特徴量について区分けされる種類の数がクラスタリング手段により適切に抑制されるので、視覚単語の分布が疎となる状態を抑制し、適切に特徴となる分割映像を選択することができる。 According to the second and ninth aspects of the invention, it is possible to generate a summary video from both video and audio information in a split video, and to select a more appropriate split video to generate a summary video. It becomes possible.
According to the invention described in claim 3, since the number of types classified for the feature amount of the block area (visual word) is appropriately suppressed by the clustering means, the state where the distribution of visual words is sparse is suppressed. Therefore, it is possible to select a divided video that is appropriately characterized.

請求項４に記載の発明によれば、分割映像の動きの大きなフレーム画像をキーフレームとして抽出でき、分割映像ごとの評価をより適切に行なうことが可能となる。 According to the fourth aspect of the present invention, it is possible to extract a frame image with a large motion of a divided video as a key frame, and it is possible to more appropriately evaluate each divided video.

請求項５、６に記載の発明によれば、映像区間選択手段において調整手段により、選択された分割映像が元映像の全体からバランスよく選択することが可能となる。
請求項７に記載の発明によれば、データストリームにクローズドキャプションのような文字データが存在した場合、当該文字データを利用することで、音声認識を行う負荷を軽減させることができる。 According to the fifth and sixth aspects of the present invention, the selected divided video can be selected from the entire original video in a balanced manner by the adjusting means in the video section selecting means.
According to the seventh aspect of the present invention, when character data such as closed caption exists in the data stream, the load for performing speech recognition can be reduced by using the character data.

本発明に係る要約映像生成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the summary video generation apparatus which concerns on this invention. 本発明に係る要約映像生成装置の構成の詳細を示すブロック図である。It is a block diagram which shows the detail of a structure of the summary video generation apparatus concerning this invention. （ａ）、（ｂ）は、本発明に係る要約映像生成装置におけるキーフレーム抽出手段の抽出手法を示す説明図である。(A), (b) is explanatory drawing which shows the extraction method of the key frame extraction means in the summary image | video production | generation apparatus based on this invention. 本発明に係る要約映像生成装置の全体の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the summary video production | generation apparatus which concerns on this invention. 本発明に係る要約映像生成装置の映像基本単位となる分割映像を生成する動作を示すフローチャートである。6 is a flowchart illustrating an operation of generating a divided video that is a basic video unit of the summary video generation device according to the present invention. 本発明に係る要約映像生成装置の分割映像を解析する動作を示すフローチャートである。It is a flowchart which shows the operation | movement which analyzes the division | segmentation image | video of the summary image | video production | generation apparatus which concerns on this invention. 本発明に係る要約映像生成装置の分割映像を選択してスコアを調整しながら連結する動作を示すフローチャートである。It is a flowchart which shows the operation | movement which selects the division | segmentation image | video of the summary image | video production | generation apparatus based on this invention, and connects, adjusting a score. 本発明に係る他の要約映像生成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the other summary image | video production apparatus which concerns on this invention. 本発明に係る他の要約映像生成装置の構成の詳細を示すブロック図である。It is a block diagram which shows the detail of a structure of the other summary image | video production | generation apparatus which concerns on this invention. 本発明に係る他の要約映像生成装置の全体の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the other summary image | video production | generation apparatus which concerns on this invention. 本発明に係る他の要約映像生成装置の文字情報抽出手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the character information extraction means of the other summary image | video production | generation apparatus which concerns on this invention. 本発明に係る他の要約映像生成装置の分割映像を選択する動作を示すフローチャートである。It is a flowchart which shows the operation | movement which selects the division | segmentation image | video of the other summary image | video production | generation apparatus which concerns on this invention.

以下、本発明に係る要約映像生成装置について、図面を参照して説明する。なお、はじめに映像データのみから要約映像を生成する構成及び動作について図１〜図７を参照して説明し、次に、映像データ（映像ストリーム）、音声ストリーム及びデータストリームを備える映像から要約映像を生成する構成及び動作について、図８〜図１２を参照して説明する。 A summary video generation apparatus according to the present invention will be described below with reference to the drawings. First, a configuration and operation for generating a summary video from only video data will be described with reference to FIGS. 1 to 7. Next, a summary video is generated from a video including a video data (video stream), an audio stream, and a data stream. The configuration and operation to be generated will be described with reference to FIGS.

［要約映像生成装置の構成］
まず、図１（図２）を参照して、本発明の実施形態に係る要約映像生成装置１の構成について説明する。要約映像生成装置１は、ＭＥＰＧ２ストリームのようなデジタル放送番組映像の映像を入力して、その入力した映像より映像時間が短い時間長となる要約映像を生成して出力するものである。この要約映像生成装置１は、入力手段２と、映像分割手段４と、映像解析手段５と、映像評価手段６と、映像区間選択手段８と、出力手段９とを備え、さらに、分割映像を記憶する記憶手段７と、外部入力手段１０により要約映像の時間長を設定する目標長さ設定手段１１とを備えている。 [Configuration of summary video generator]
First, with reference to FIG. 1 (FIG. 2), the configuration of the summary video generation apparatus 1 according to the embodiment of the present invention will be described. The summary video generation apparatus 1 inputs video of a digital broadcast program video such as a MPEG2 stream, and generates and outputs a summary video having a video duration shorter than the input video. The summary video generation apparatus 1 includes an input unit 2, a video division unit 4, a video analysis unit 5, a video evaluation unit 6, a video section selection unit 8, and an output unit 9. Storage means 7 for storing, and target length setting means 11 for setting the time length of the summary video by the external input means 10 are provided.

入力手段２は、元映像となる映像データ（映像ストリーム）を外部から入力するものである。ここでは、入力手段２は、外部から放送波あるいは通信により放送番組の映像を入力するものであることとした。この入力手段２は、入力される映像として、例えば、インターネット等のネットワークから入力される映像であってもよいし、放送波等を介して入力される映像であることや、あるいは、ＤＶＤ、ＣＤ等の映像であってもよい。この入力手段２で入力された映像は、映像分割手段４に出力される。 The input means 2 inputs video data (video stream) as an original video from the outside. Here, it is assumed that the input means 2 is for inputting video of a broadcast program from outside by broadcast waves or communication. The input means 2 may be, for example, an image input from a network such as the Internet, an image input via a broadcast wave, etc., or a DVD, CD Or the like. The video input by the input unit 2 is output to the video dividing unit 4.

映像分割手段４は、入力された映像データをショット（カット）ごとに分割するものである。ここで、ショットとは、一台のカメラで連続して撮影されたフレーム列（映像区間）をいう。この映像区間の切れ目では映像が大きく切り替わるため、映像分割手段４は、例えば、図示しない区間映像抽出部によって抽出された映像を構成する前後のフレーム画像間の色の差分をとり、差分の値が大きいときに映像を分割することで、抽出された映像をショットに分割することができる。この映像分割手段４で分割された分割映像は、要約映像を生成するための映像の基本単位（映像基本単位）となる。 The video dividing means 4 divides input video data for each shot (cut). Here, a shot refers to a frame sequence (video section) continuously photographed by one camera. Since the video is largely switched at the break of the video section, the video dividing means 4 takes the color difference between the frame images before and after the video extracted by the section video extraction unit (not shown), for example, and the difference value is By dividing the video when it is large, the extracted video can be divided into shots. The divided video divided by the video dividing means 4 is a basic unit of video (video basic unit) for generating a summary video.

なお、映像分割手段４は、例えば、フレーム画像間の周波数特徴の差分をとり、差分が大きい場合に映像を分割することとしてもよいし、また、フレーム画像を複数の小領域に分割し、各小領域が次のフレーム画像においてどの位置に移動したのかを調べるブロックマッチングを行い、移動先が特定できなかった小領域数が所定値より多い場合に映像を分割することとしてもよい。映像分割手段４で分割されたショットを映像単位とする分割映像には、ショットの始まりと終わりの時間情報が分割映像ごとに付される。 The video dividing unit 4 may take, for example, a difference in frequency characteristics between frame images, and may divide the video when the difference is large, or may divide the frame image into a plurality of small regions, Block matching that examines where the small area has moved in the next frame image may be performed, and the video may be divided when the number of small areas for which the movement destination cannot be specified is greater than a predetermined value. The divided video having the shot divided by the video dividing unit 4 as a video unit is provided with time information of the start and end of the shot for each divided video.

また、分割映像は、必要に応じて映像時間長及び元映像の先頭から何番目であるかの映像開始時間の早い順序を示す情報を併せて付しても構わない。映像分割手段４で分割された分割映像は、映像解析手段５及び記憶手段７に出力される。 In addition, the divided video may be attached with information indicating the video time length and the order of the video start time from the top of the original video, as necessary. The divided video divided by the video dividing unit 4 is output to the video analyzing unit 5 and the storage unit 7.

映像解析手段５は、映像分割手段４で分割された分割映像からキーフレームを抽出してそのキーフレーム画像の特徴となるブロック領域を視覚単語（ｖｉｓｕａｌｗｏｒｄ）としてみなすようにして解析するものである。つまり、この映像解析手段５では、特徴となるキーフレームのブロック領域を視覚単語とみなし単語のような扱いをすることで、単語で使用されている手法を用いて、映像評価手段６と併せて単語（視覚単語）の重要度を算出して映像を評価するようにしている。 The video analysis unit 5 extracts a key frame from the divided video divided by the video division unit 4 and analyzes the block region as a feature of the key frame image as a visual word. . In other words, the video analysis means 5 treats the block area of the key frame as a feature as a visual word and treats it as a word, thereby using the technique used in the word and combining it with the video evaluation means 6. The importance of the word (visual word) is calculated and the video is evaluated.

なお、ここで、キーフレームとは、分割映像を解析するために、分割映像内から部分的に抽出するフレーム画像である。
ここでは、映像解析手段５は、映像分割手段４で分割された各分割映像から予め設定された条件によりキーフレームを抽出する。そして、映像解析手段５は、その抽出したキーフレームのフレーム画像内の特徴となるブロック領域を視覚単語とみなし、その視覚単語（ブロック領域）の種類を特徴量ごとに区分するため、予め設定された勾配ヒストグラムを用い映像を解析し、映像評価手段６に解析した分割映像ごとのキーフレームを評価したスコアを出力する。映像解析手段５は、ここでは、図２に示すように、キーフレーム抽出手段５ａ、特徴ブロック領域抽出手段５ｂ及びクラスタリング手段５ｃを備えている。 Here, the key frame is a frame image partially extracted from the divided video in order to analyze the divided video.
Here, the video analysis means 5 extracts a key frame from each divided video divided by the video division means 4 according to preset conditions. The video analysis means 5 regards the block area as a feature in the frame image of the extracted key frame as a visual word, and classifies the type of the visual word (block area) for each feature amount. The image is analyzed using the gradient histogram, and the score that evaluates the key frame for each divided image is output to the image evaluation means 6. Here, as shown in FIG. 2, the video analysis means 5 includes key frame extraction means 5a, feature block region extraction means 5b, and clustering means 5c.

キーフレーム抽出手段５ａは、入力された分割映像から予め設定された条件でキーフレームを抽出するものである。このキーフレーム抽出手段５ａは、例えば、分割映像の中で動きが激しい部分をより多く抽出するために、動きベクトルに基づいて分割映像からキーフレームを抽出する。このキーフレーム抽出手段５ａにより抽出されたキーフレームは、特徴ブロック領域抽出手段５ｂに出力される。 The key frame extraction means 5a is for extracting key frames from the input divided video under preset conditions. For example, the key frame extracting unit 5a extracts key frames from the divided video based on the motion vector in order to extract more portions where the motion is intense in the divided video. The key frame extracted by the key frame extraction means 5a is output to the feature block area extraction means 5b.

ここで、図３を参照して、キーフレーム抽出手段５ａが、動きベクトルに基づいてキーフレームを抽出する手法について説明する。
図３（ａ）は、横軸に分割映像（映像基本単位）の時間（フレーム画像単位）を示し、縦軸にフレーム画像ごとの動きベクトル量を示している。また、図３（ｂ）は、横軸に分割映像（映像基本単位）の時間（フレーム画像単位）を示し、縦軸にフレーム画像ごとの動きベクトル量の累計を示している。なお、この図３（ｂ）では、映像基本単位の動きベクトル累計を、予め定めたキーフレームの数で等分し、その等分した累計に対応したフレーム画像をキーフレームとして選択した例を示している。 Here, with reference to FIG. 3, a method in which the key frame extraction means 5a extracts a key frame based on a motion vector will be described.
In FIG. 3A, the horizontal axis indicates the time (frame image unit) of the divided video (video basic unit), and the vertical axis indicates the motion vector amount for each frame image. In FIG. 3B, the horizontal axis indicates the time (frame image unit) of the divided video (basic video unit), and the vertical axis indicates the total motion vector amount for each frame image. FIG. 3B shows an example in which the motion vector total of the video basic unit is equally divided by a predetermined number of key frames, and a frame image corresponding to the equal total is selected as a key frame. ing.

図３（ａ）に示すように、キーフレーム抽出手段５ａは、分割映像のフレーム画像ごとに、動きベクトル量を算出する。すなわち、キーフレーム抽出手段５ａは、フレーム画像の予め定めたブロック（例えば、マクロブロック）ごとに、前フレーム画像との動きの差（動きベクトル）を、フレーム画像内で累計し、当該フレーム画像の動きベクトル量とする。 As shown in FIG. 3A, the key frame extraction unit 5a calculates a motion vector amount for each frame image of the divided video. That is, the key frame extraction means 5a accumulates the motion difference (motion vector) from the previous frame image for each predetermined block (for example, macroblock) of the frame image in the frame image, and The amount of motion vector.

そして、図３（ｂ）に示すように、キーフレーム抽出手段５ａは、動きベクトル量を累計して、分割映像（映像基本単位）内の総動きベクトル量を、予め定めたキーフレームの数で等分する。そして、キーフレーム抽出手段５ａは、動きベクトル量累計が、等分した累計に達したときのフレーム画像を順次キーフレームとして抽出する。、
これによって、図３（ｂ）に示すように、動きベクトル量が大きく変化する（動きが激しい）時間区間で、より多くのキーフレームを抽出することができ、映像基本単位内で特徴となるフレーム画像をより多く選択することができる。
なお、キーフレームの抽出数は、任意に設定することができる。例えば、分割映像（映像基本単位）の時間長に対応してキーフレームの数を予め定めておく。キーフレームの数が多ければ、評価するときに緻密に評価できるが、処理速度が遅くなるので、処理速度と正確な評価を行えるフレーム数を予め統計により定めておくことが望ましい。 Then, as shown in FIG. 3 (b), the key frame extraction means 5a accumulates the motion vector amounts, and calculates the total motion vector amount in the divided video (video basic unit) by the predetermined number of key frames. Divide equally. Then, the key frame extraction means 5a sequentially extracts the frame images when the motion vector amount total reaches the equal total, as key frames. ,
As a result, as shown in FIG. 3B, more key frames can be extracted in a time section in which the amount of motion vector changes greatly (strong motion), and a frame that is a feature within the video basic unit. More images can be selected.
Note that the number of extracted key frames can be arbitrarily set. For example, the number of key frames is determined in advance corresponding to the time length of the divided video (video basic unit). If the number of key frames is large, precise evaluation can be performed at the time of evaluation, but the processing speed is slowed down. Therefore, it is desirable to preliminarily determine the processing speed and the number of frames that can be accurately evaluated.

図１（図２）に戻って、要約映像生成装置１の構成について説明を続ける。
特徴ブロック領域抽出手段５ｂは、抽出されたキーフレームのフレーム画像の特徴を示すブロック領域を検出し、検出したブロック領域を視覚単語とみなし、その視覚単語の種類を利用して視覚単語の分類を行い、分割映像ごとにどのような種類の視覚単語がいくつ含まれているかを解析するものである。解析した分割映像ごとのキーフレームにおける視覚単語は、映像評価手段６に出力される。この特徴ブロック領域抽出手段５ｂは、ここでは、キーフレーム画像特徴検出手段５ｂ_１と、勾配ヒストグラム生成手段５ｂ_２とを備えている。 Returning to FIG. 1 (FIG. 2), the description of the configuration of the summary video generation device 1 will be continued.
The feature block region extraction unit 5b detects a block region indicating the feature of the frame image of the extracted key frame, regards the detected block region as a visual word, and classifies the visual word using the type of the visual word. This is to analyze how many types of visual words are included in each divided video. The visual word in the key frame for each divided video that has been analyzed is output to the video evaluation means 6. This feature block area extracting unit 5b is here, the key frame image feature detection unit 5b _1, and a gradient histogram generating unit 5b _2.

キーフレーム画像特徴検出手段５ｂ_１は、入力されるキーフレームのフレーム画像における特徴を検出して、検出した特徴の部分（ブロック領域）を視覚単語とみなし（扱うようにし）ている。このキーフレーム画像特徴検出手段５ｂ_１は、例えば、フレーム画像の特徴となる局所特徴（例えば、コーナー等）を検出し、特徴点の周辺の局所領域をブロック領域（視覚単語）として特定する。このキーフレーム画像特徴検出手段５ｂ_１は、ブロック領域（視覚単語）を特定する情報（例えば、画像座標）を勾配ヒストグラム生成手段５ｂ_２に出力する。 The key frame image feature detecting means 5b ₁ detects the feature in the frame image of the input key frame, and regards (detects) the detected feature portion (block region) as a visual word. The key frame image feature detecting means 5b ₁ detects, for example, a local feature (for example, a corner) as a feature of the frame image, and identifies a local region around the feature point as a block region (visual word). The key frame image feature detecting means 5b ₁ outputs information (for example, image coordinates) for specifying a block area (visual word) to the gradient histogram generating means 5b ₂ .

勾配ヒストグラム生成手段５ｂ_２は、予め設定された輝度、色等の特徴に対して勾配ヒストグラムにより視覚単語の種類を生成するものである。この勾配ヒストグラム生成手段５ｂ_２は、ここでは例えば、キーフレーム画像特徴検出手段５ｂ_１で検出されたブロック領域（視覚単語）における輝度についての勾配ヒストグラムを視覚単語の種類として生成する。 The gradient histogram generating means 5b ₂ generates a type of visual word by using a gradient histogram for preset features such as brightness and color. The gradient histogram generation means 5b ₂ generates, for example, a gradient histogram for the luminance in the block area (visual word) detected by the key frame image feature detection means 5b ₁ as the type of visual word.

なお、特徴点の検出や勾配ヒストグラムの算出には、既存のSIFT（David G Lowe,” Object recognition from local scale-invariant features, ” In Proc.IEEE International Conference on Computer Vision, vol.2, pp.1150-1157,1999.）や、あるいは、既存のSURF(Herbert. Bay, Tinne Tuytelaars, and L Van Gool,” SURF : Speeded Up Robust Features, ” In Proc. European Conference on Computer Vision, vol. 3951, pp 404-417,2006)等の技術を利用することができる。 For feature point detection and gradient histogram calculation, existing SIFT (David G Lowe, “Object recognition from local scale-invariant features,” In Proc. IEEE International Conference on Computer Vision, vol.2, pp.1150 -1157, 1999.) or existing SURF (Herbert. Bay, Tinne Tuytelaars, and L Van Gool, “SURF: Speeded Up Robust Features,” In Proc. European Conference on Computer Vision, vol. 3951, pp 404 -417,2006) can be used.

このキーフレーム画像特徴検出手段５ｂ_１及び勾配ヒストグラム生成手段５ｂ_２を備える特徴ブロック領域抽出手段５ｂは、検出した特徴となる部分（ブロック領域：局所領域）を視覚単語としてみなし、当該視覚単語の特徴量ごとに区分して視覚単語の種類とみなすことで、視覚単語で用いることができる手法を使用できるようにしている。 Wherein the block region extracting means 5b provided with the key frame image feature detection unit 5b ₁ and gradient histogram generating unit 5b _2, the portion to be detected characteristic: regarded (block area local region) as a visual words, characteristics of the visual words A method that can be used for a visual word can be used by classifying each quantity into a visual word.

なお、勾配ヒストグラムを画像の特徴とした場合、次元数が高く、ブロック領域（視覚単語）の分布が疎になる場合がある。そこで、特徴ブロック領域抽出手段５ｂが抽出するブロック領域（視覚単語）の特徴について、ここでは、常に、後記するクラスタリング手段５ｃによりクラスタリングするように構成している。 Note that when the gradient histogram is used as a feature of the image, the number of dimensions is high, and the distribution of the block area (visual word) may be sparse. Therefore, the feature of the block area (visual word) extracted by the feature block area extracting unit 5b is always clustered by the clustering unit 5c described later.

クラスタリング手段５ｃは、視覚単語とみなしたフレーム画像のブロック領域における特徴の種類を予め設定されているクラスタリングの条件で分類（クラスタリング）するものである。これによって、ブロック領域（視覚単語）の分布が疎になることを回避することができる。このクラスタリング手段５ｃは、ブロック領域の特徴を、予め設定されているクラスタ数にクラスタリングして、そのクラスタリングされた分割映像ごとの視覚単語を映像評価手段６に出力する。なお、クラスタリング手段５ｃでは、例えば、すでに知られているｋ平均法等が利用できる。 The clustering means 5c classifies (clusters) the types of features in the block area of the frame image that is regarded as a visual word under a preset clustering condition. Thereby, it is possible to avoid the block area (visual word) from being sparsely distributed. The clustering means 5 c clusters the features of the block area into a preset number of clusters, and outputs the visual words for each clustered divided video to the video evaluation means 6. In the clustering means 5c, for example, the k-mean method already known can be used.

映像評価手段６は、分割映像ごとに視覚単語の種類（特徴量）について、元映像において特徴的であることを識別する予め定めた指標を評価基準となるスコアとして算出するものである。この映像評価手段６は、映像解析手段５で解析された視覚単語ごとの出現傾向に基づいて、分割映像（映像基本単位）ごとにスコアを与えている。映像評価手段６では、視覚単語の集合として分割映像ごとに表現されているため、従来の言語処理における手法を利用することが可能となる。そして、この映像評価手段６は、例えば、言語処理で代表的なＴＦ−ＩＤＦ（term frequency-inverse document frequency）によって評価することができる。ここで、ＴＦ−ＩＤＦは、情報探索やテキストマイニングなどの分野で利用され、文書中に出現した特定の単語がどのくらい特徴的であるかを識別するための指標である。 The video evaluation means 6 calculates a predetermined index for identifying the type of visual word (feature amount) for each divided video as being characteristic in the original video as a score as an evaluation criterion. The video evaluation unit 6 gives a score for each divided video (basic video unit) based on the appearance tendency for each visual word analyzed by the video analysis unit 5. Since the video evaluation means 6 is expressed for each divided video as a set of visual words, it is possible to use a conventional language processing technique. The video evaluation means 6 can perform evaluation by, for example, TF-IDF (term frequency-inverse document frequency) typical in language processing. Here, TF-IDF is used in fields such as information search and text mining, and is an index for identifying how characteristic a specific word that appears in a document is.

なお、式（１）において、ｔｆ（ｔ，ｄ）は映像基本単位（分割映像）ｄに含まれる視覚単語ｔの出現頻度を表わし、Ｎは番組（元映像）における映像基本単位の総数を表わし、ｄｆ（ｔ）は視覚単語ｔが含まれる映像基本単位の総数を表わす。また、式（１）において算出されるスコアは、視覚単語ｔの出現頻度（出現回数）である指標（ＴＦ）と、一種の一般語フィルタ（ＩＤＦ）とにより算出するスコアであり、ＩＤＦにより、多くの映像基本単位に出現する語（一般的な語）は重要度が下がり、特定の映像基本単位にしか出現しない単語（視覚単語）の重要度を上げる役割が果たされたスコア（ｔｆｉｄｆ（ｔ，ｄ））となる。 In Expression (1), tf (t, d) represents the appearance frequency of the visual word t included in the video basic unit (divided video) d, and N represents the total number of video basic units in the program (original video). , Df (t) represents the total number of video basic units including the visual word t. Further, the score calculated in Expression (1) is a score calculated by an index (TF) that is the appearance frequency (number of appearances) of the visual word t and a kind of general word filter (IDF). Words (general words) that appear in many video basic units are less important, and scores (tfidf () that play the role of increasing the importance of words (visual words) that appear only in specific video basic units) t, d)).

ここでは、映像評価手段６は、映像基本単位内に含まれる視覚単語の種類ごとのＴＦ−ＩＤＦの総和を求め、映像基本単位の長さあるいは映像基本単位内に出現した視覚単語の総数によって正規化することにより、映像基本単位（分割映像ごと）のスコアを算出する。これによって、映像評価手段６は、視覚的に特徴のある映像基本単位に対して高いスコアを与えることができる。この映像評価手段６において分割映像ごとに算出されたスコアは、映像区間選択手段８に出力される。 Here, the video evaluation means 6 obtains the sum of the TF-IDF for each type of visual word included in the video basic unit, and is normalized by the length of the video basic unit or the total number of visual words appearing in the video basic unit. By calculating the score, the score of the video basic unit (each divided video) is calculated. Thereby, the video evaluation means 6 can give a high score to the video basic unit having the visually characteristic. The score calculated for each divided video by the video evaluation means 6 is output to the video section selection means 8.

記憶手段７は、分割映像を記憶するもので、ハードディスク、光磁気ディスク等の一般的な映像等のデータを記憶する手段である。なお、ここでは、記憶手段７には、分割映像が、分割映像ごとの時間長、分割された始点時間及び終点時間の時間情報と共に記憶される。 The storage means 7 stores divided video, and stores data such as general video such as a hard disk and a magneto-optical disk. Here, the storage unit 7 stores the divided video together with the time length of each divided video and the time information of the divided start point time and end point time.

映像区間選択手段８は、映像評価手段６により評価された分割映像のスコアの高い順で映像時間が早い方から要約映像の時間長となるまで、元映像から分割映像の映像区間の映像を順次選択するものである。この映像区間選択手段８は、外部入力手段１０により予め目標長さ設定手段１１を介して設定された要約映像の目標長さとなるように、記憶手段７に記憶されている分割映像から、映像評価手段６からのスコアに基づいて、スコアの高い分割映像を選択し、その選択した分割映像を並び替えて連結して要約映像を生成する。ここでは、図２に示すように、映像区間選択手段８は、分割映像選択手段８ｂと、分割映像連結手段８ｃと、調整手段８ｄと、メモリ８ｅとを備えている。 The video segment selection means 8 sequentially selects the video of the video segment of the divided video from the original video until the time length of the summary video starts from the highest video time in descending order of the score of the divided video evaluated by the video evaluation means 6. To choose. This video section selecting means 8 evaluates the video from the divided video stored in the storage means 7 so as to be the target length of the summary video set in advance by the external input means 10 via the target length setting means 11. Based on the score from the means 6, a divided video with a high score is selected, and the selected divided videos are rearranged and connected to generate a summary video. Here, as shown in FIG. 2, the video section selection means 8 includes a divided video selection means 8b, a divided video connection means 8c, an adjustment means 8d, and a memory 8e.

分割映像選択手段８ｂは、映像評価手段６により分割映像ごとに算出されたスコアを参照して、そのスコアの高い順に記憶手段７から、予め設定された要約映像の時間長となるまで分割映像を選択するものである。この分割映像選択手段８ｂは、ここでは、スコアの高い分割映像を選択すると、元映像のうちの何番目の分割映像であるかを示す情報（例えば、分割時間情報、分割映像番号情報等）を調整手段８ｄに出力する。そして、分割映像選択手段８ｂは、メモリ８ｅに記憶されている元映像の分割映像ごとのスコアから次に高い値が付された分割映像を選択し、その結果（分割時間情報、分割映像番号情報等）を調整手段８ｄに出力する。 The divided video selection unit 8b refers to the score calculated for each divided video by the video evaluation unit 6 and stores the divided video from the storage unit 7 in descending order of the score until the time length of the preset summary video is reached. To choose. Here, when the divided video selection unit 8b selects a divided video having a high score, information (for example, division time information, divided video number information, etc.) indicating what number of the divided video in the original video is displayed. Output to the adjusting means 8d. Then, the divided video selection means 8b selects the divided video with the next highest value from the score for each divided video of the original video stored in the memory 8e, and the result (division time information, divided video number information). Etc.) is output to the adjusting means 8d.

また、分割映像選択手段８ｂは、分割映像の時間長を合計した値が予め設定された要約映像の時間長となるまで分割映像のスコアに基づいて分割映像を選択する。分割映像選択手段８ｂにより選択された分割映像を示す情報（分割時間情報、分割映像番号情報等）は、分割映像連結手段８ｃに出力され、予め設定された要約映像の時間長となるまで、分割映像連結手段８ｃに出力される。 Further, the divided video selection unit 8b selects the divided video based on the score of the divided video until the sum of the time lengths of the divided videos reaches the preset time length of the summary video. Information (divided time information, divided video number information, etc.) indicating the divided video selected by the divided video selecting means 8b is output to the divided video connecting means 8c, and is divided until the preset summary video time length is reached. It is output to the video connection means 8c.

調整手段８ｄは、分割映像選択手段８ｂで選択された分割映像の元映像における何番目かの情報（例えば、分割時間情報、分割映像番号情報等）を基準として、その選択された分割映像の基準に対して時間方向における前後所定範囲の分割映像に付されたスコアを引き下げるものである。この調整手段８ｄは、選択された分割映像に対してスコアを引き下げる対象となる所定範囲が予め設定されると共に、スコアの値をどれだけ引き下げるかが予め設定されている。 The adjustment unit 8d uses the information of the original number of the divided video selected by the divided video selection unit 8b (for example, division time information, divided video number information, etc.) as a reference, and the reference of the selected divided video On the other hand, the score attached to the divided video in a predetermined range before and after in the time direction is lowered. The adjusting means 8d is preset with a predetermined range to be subject to lowering the score for the selected divided video and how much the score value is lowered.

例えば、調整手段８ｄは、選択された分割映像に対してその前後時間方向に１０ずつの分割映像について、それらの分割映像のスコアを１／２にする等の調整を行う。この調整手段８ｄでスコアが引き下げられて調整がなされた分割映像のスコアは、メモリ８ｅに送られて更新されて記憶される。そして、調整手段８ｄによって調整した分割映像のスコアによりメモリ８ｅが記憶している分割映像のスコアが書き換えられ、その書き換えられた分割映像のスコアに基づいて、全体の中で次に高いスコアとなる分割映像を、分割映像選択手段８ｂが選択するようになる。 For example, the adjustment unit 8d performs adjustment such as halving the score of each divided video for ten divided videos in the time direction before and after the selected divided video. The score of the divided video that has been adjusted by lowering the score by the adjusting means 8d is sent to the memory 8e to be updated and stored. Then, the score of the divided video stored in the memory 8e is rewritten with the score of the divided video adjusted by the adjusting means 8d, and the next highest score is obtained based on the score of the rewritten divided video. The divided video selection means 8b selects the divided video.

メモリ８ｅは、記憶手段７に記憶されている分割映像（分割映像番号）ごとのスコアの値を記憶するものであり、半導体メモリ等の一般的な記憶媒体である。このメモリ８ｅは、調整手段８ｄによって調整されたスコアの値によって書き換えられて更新される。 The memory 8e stores a score value for each divided video (divided video number) stored in the storage unit 7, and is a general storage medium such as a semiconductor memory. The memory 8e is rewritten and updated with the score value adjusted by the adjusting means 8d.

分割映像連結手段８ｃは、分割映像選択手段８ｂにより選択された分割映像を、映像時間が早い方から並べて連結する。この分割映像連結手段８ｃは、分割映像の開始時間Ｔ_１〜Ｔ_ｎ（あるいは分割映像番号情報）を基準にして早い方から並べて連結し、連結して生成した要約映像を、出力手段９を介して出力する。なお、出力された要約映像は、図示しないハードディスク等に、元映像と紐付けされて記憶される。
以上のように要約映像生成装置１を構成することで、映像のジャンル等にかかわらず、映像に含まれる視覚的特徴によって、要約映像を生成することができる。 The divided video connecting unit 8c connects the divided videos selected by the divided video selecting unit 8b side by side from the earlier video time. The divided video connecting means 8c connects and connects the summarized videos generated from the earliest ones based on the start times T _{1 to} T _n (or divided video number information) of the divided videos, and outputs the combined videos via the output means 9. Output. The output summary video is stored in association with the original video on a hard disk (not shown) or the like.
By configuring the summary video generation device 1 as described above, a summary video can be generated based on the visual features included in the video regardless of the video genre or the like.

［要約映像生成装置の動作］
次に、本発明に係る要約映像生成装置１の動作について説明する。ここでは、要約映像生成装置１の全体動作の概略について先に説明し、個別の手段における詳細動作についてはその後に説明することとする。 [Operation of summary video generator]
Next, the operation of the summary video generation apparatus 1 according to the present invention will be described. Here, the outline of the overall operation of the summary video generation apparatus 1 will be described first, and the detailed operation of the individual means will be described later.

（全体動作）
まず、図４を参照（構成については、適宜図２参照）して、要約映像生成装置１の全体動作について説明する。
まず、要約映像生成装置１は、元映像を、入力手段２を介して入力し映像分割手段４により映像基本単位となる分割映像に分割する（ステップＳ１）。このとき、映像分割手段４は、シーンの区切りで映像を分割して映像基本単位としている。 (Overall operation)
First, the overall operation of the summary video generation apparatus 1 will be described with reference to FIG.
First, the summary video generation apparatus 1 inputs an original video through the input unit 2 and divides the original video into divided videos that are video basic units by the video division unit 4 (step S1). At this time, the video dividing means 4 divides the video at scene breaks to form a video basic unit.

そして、要約映像生成装置１は、映像解析手段５により、ステップＳ１で分割された分割映像ごとにキーフレームを抽出し、抽出したキーフレームのフレーム画像の特徴（特徴点）となるブロック領域を検出し、検出したブロック領域を視覚単語とみなし、勾配ヒストグラムで視覚単語の種類を区分けするように解析する（ステップＳ２）。このような解析を行う動作を映像解析手段５が全ての分割映像（映像基本単位）におけるキーフレームについて行う（ステップＳ３）。 Then, the summary video generation device 1 uses the video analysis unit 5 to extract a key frame for each of the divided videos divided in step S1, and to detect a block region that is a feature (feature point) of the frame image of the extracted key frame. Then, the detected block area is regarded as a visual word, and analysis is performed so as to classify the type of the visual word using a gradient histogram (step S2). The video analysis means 5 performs such an analysis operation on key frames in all divided videos (video basic units) (step S3).

そして、要約映像生成装置１は、映像評価手段６により、ステップＳ２で解析された視覚単語について分割映像ごとにスコアを算出し付与する（ステップＳ４）。全ての分割映像（映像基本単位）についてスコアが算出されるまで繰り返し映像評価手段６の処理が行われる（ステップＳ５）。
さらに、要約映像生成装置１は、映像区間選択手段８により、ステップＳ４で算出されたスコアの値に基づいて、元映像の分割映像が記憶されている記憶手段７から、当該スコアが高い映像区間（分割映像）を選択し、さらに、ここではスコアの調整を行い、選択された分割映像を映像時間が早い方から並べて、予め設定された時間長となるまで連結することで要約映像を生成する（ステップＳ６）。
以上の動作によって、要約映像生成装置１は、映像に含まれる視覚的特徴によって、要約映像を生成する。 Then, the summary video generation device 1 calculates and assigns a score for each divided video for the visual word analyzed in step S2 by the video evaluation means 6 (step S4). The processing of the video evaluation means 6 is repeatedly performed until the scores are calculated for all the divided videos (video basic units) (step S5).
Furthermore, the summary video generation device 1 uses the video segment selection unit 8 to store a video segment having a high score from the storage unit 7 storing the divided video of the original video based on the score value calculated in step S4. (Split video) is selected, and further, the score is adjusted here, and the selected split video is arranged from the earliest video time and connected until a predetermined time length is generated to generate a summary video. (Step S6).
With the above operation, the summary video generation apparatus 1 generates a summary video based on the visual features included in the video.

（映像基本単位分割動作）
次に、図５を参照（構成については、適宜図２参照）して、要約映像生成装置１の映像分割手段４において分割映像（映像基本単位）を生成する動作について詳細に説明する。なお、この動作は、図４で説明した要約映像生成装置１の全体動作のうちのステップＳ１の動作に相当する。 (Basic video unit division operation)
Next, referring to FIG. 5 (refer to FIG. 2 as appropriate for the configuration), the operation of generating the divided video (video basic unit) in the video dividing means 4 of the summary video generating device 1 will be described in detail. This operation corresponds to the operation of step S1 in the overall operation of the summary video generation apparatus 1 described in FIG.

まず、要約映像生成装置１は、映像がフレーム画像単位で映像分割手段４に入力されると（ステップＳ１１）、映像分割手段４によって、元映像からショットを検出する（ステップＳ１２）。例えば、映像分割手段４は、フレーム画像間の色の差分をとり、差分が予め設定された閾値より大きいときには、ショットの区間として検出する。
さらに、要約映像生成装置１は、映像分割手段４によって、ステップＳ１２で検出したショットごとに、映像基本単位である分割映像として映像を分割する（ステップＳ１３）。なお、映像分割手段４は、分割映像ごとにショットの初めと終わりの時間情報を付すこととする。そして、全フレームの処理が終了していない場合（ステップＳ１４でＮｏ）、ステップＳ１２に戻って前記した同じ処理を繰り返し行い、元映像の全てのフレームについて処理する（ステップＳ１４でＹｅｓ）。
この動作によって、要約映像生成装置１は、元映像からショット単位で、映像基本単位となる分割映像を生成することができる。 First, when the video is input to the video dividing unit 4 in units of frame images (step S11), the summary video generating apparatus 1 detects a shot from the original video by the video dividing unit 4 (step S12). For example, the video dividing unit 4 takes a color difference between frame images, and detects a shot interval when the difference is larger than a preset threshold.
Further, the summary video generating apparatus 1 divides the video as a divided video that is a basic video unit for each shot detected in step S12 by the video dividing unit 4 (step S13). The video dividing means 4 attaches time information at the beginning and end of the shot for each divided video. If all the frames have not been processed (No in step S14), the process returns to step S12 and the same process is repeated to process all the frames of the original video (Yes in step S14).
By this operation, the summary video generation apparatus 1 can generate a divided video that is a basic video unit in shot units from the original video.

（分割映像解析動作）
次に、図６を参照（構成については、適宜図２参照）して、要約映像生成装置１の映像解析手段５において、分割映像（映像基本単位）における視覚単語の種類を解析する動作について詳細に説明する。なお、この動作は、図５で説明した要約映像生成装置１の全体動作のうちのステップＳ２の動作に相当する。 (Split video analysis operation)
Next, referring to FIG. 6 (refer to FIG. 2 as appropriate for the configuration), the operation of analyzing the type of visual word in the divided video (video basic unit) in the video analysis means 5 of the summary video generation device 1 will be described in detail. Explained. This operation corresponds to the operation of step S2 in the overall operation of the summary video generation apparatus 1 described in FIG.

まず、分割映像が映像解析手段５に入力されると、要約映像生成装置１は、キーフレーム抽出手段５ａにより、予め設定された条件でキーフレームを抽出する（ステップＳ２１）。例えば、キーフレーム抽出手段５ａは、動きの激しい部分のフレーム画像をより多くキーフレームとして抽出する。
そして、要約映像生成装置１は、特徴ブロック領域抽出手段５ｂのキーフレーム画像特徴検出手段５ｂ_１により、ステップＳ２１で抽出されたキーフレームのフレーム画像の特徴となるブロック領域を検出する（ステップＳ２２）。このステップＳ２２で検出されたブロック領域を視覚単語としてみなして扱うようにする。 First, when the divided video is input to the video analysis unit 5, the summary video generation device 1 extracts key frames under preset conditions by the key frame extraction unit 5a (step S21). For example, the key frame extraction means 5a extracts more frame images of a portion with a strong movement as key frames.
The video summary generation device 1, by the key frame image feature detection unit 5b ₁ feature block area extracting unit 5b, detecting characteristics become block region of the frame image in the key frames extracted in the step S21 (step S22) . The block area detected in step S22 is treated as a visual word.

このステップＳ２２においてブロック領域が検出されて視覚単語とみなされた後、要約映像生成装置１は、勾配ヒストグラム生成手段５ｂ_２により、視覚単語の勾配ヒストグラムを生成する（ステップＳ２３）。ここでは、勾配ヒストグラム生成手段５ｂ_２は、輝度勾配ヒストグラムを生成する。この輝度勾配ヒストグラムが視覚単語の種類として扱われることになる。 After the block area is detected in step S22 and is regarded as a visual word, the summary video generation device 1 generates a gradient histogram of the visual word by the gradient histogram generation unit 5b ₂ (step S23). Here, the gradient histogram generating means 5b ₂ generates a luminance gradient histogram. This luminance gradient histogram is treated as a type of visual word.

このステップＳ２３においてブロック領域の特徴量に対して輝度勾配ヒストグラムが生成されると、要約映像生成装置１は、クラスタリング手段５ｃにより輝度勾配ヒストグラムの分類を予め設定した種類の範囲（クラスタ数）に絞りこんで、視覚単語の種類を生成する処理を行う（ステップＳ２４）。そして、要約映像生成装置１では、全キーフレームに対して視覚単語の種類を割り当てるまで処理を繰り返し行い（ステップＳ２５でＮｏ）、全キーフレームに対してステップＳ２２〜Ｓ２４の処理が終了したら（ステップ２５でＹｅｓ）、分割映像の解析処理を終了する。 When the brightness gradient histogram is generated for the feature amount of the block area in step S23, the summary video generating apparatus 1 narrows down the classification of the brightness gradient histogram to a preset type range (number of clusters) by the clustering means 5c. Here, a process of generating the type of visual word is performed (step S24). The summary video generating apparatus 1 repeats the process until the type of visual word is assigned to all key frames (No in step S25), and when the processes of steps S22 to S24 are completed for all key frames (step S25). 25), the divided video analysis process is terminated.

（映像区間選択・連結動作）
次に、図７を参照（構成については、適宜図２参照）して、要約映像生成装置１の映像区間選択手段８において、映像区間を選択し、連結する動作について詳細に説明する。なお、この動作は、図５で説明した要約映像生成装置１の全体動作のうちのステップＳ６の動作に相当する。 (Video section selection / connection operation)
Next, referring to FIG. 7 (refer to FIG. 2 as appropriate for the configuration), the operation of selecting and connecting video sections in the video section selection means 8 of the summary video generation apparatus 1 will be described in detail. This operation corresponds to the operation of step S6 in the overall operation of the summary video generation apparatus 1 described in FIG.

まず、映像評価手段６から分割映像ごとのスコアが映像区間選択手段８に入力されると、要約映像生成装置１は、映像区間選択手段８の分割映像選択手段８ｂにより、記憶手段７に記憶されている分割映像のうちで、スコアの最も高い分割映像を選択する（ステップＳ６１）。なお、分割映像ごとの全てのスコアは、メモリ８ｅにも送られて記憶されるものとする。分割映像が選択されると、分割映像選択手段８ｂは、調整手段８ｄにどの分割映像を選択したかを示す情報（映像開始時間、分割映像の番号等の選択した分割映像が特定できる情報でも可）を出力する。 First, when a score for each divided video is input from the video evaluation unit 6 to the video section selection unit 8, the summary video generation apparatus 1 is stored in the storage unit 7 by the divided video selection unit 8 b of the video section selection unit 8. Among the divided images, the divided image having the highest score is selected (step S61). It is assumed that all scores for each divided video are also sent to the memory 8e and stored. When the divided video is selected, the divided video selecting unit 8b can select information indicating which divided video has been selected by the adjusting unit 8d (information that can identify the selected divided video such as the video start time, the number of the divided video, etc. ) Is output.

その後、要約映像生成装置１は、調整手段８ｄにより、メモリ８ｅに記憶されている各分割映像のスコアに対して、選択された分割映像から時間方向に前後となる所定範囲、例えば、１０ずつの分割映像のスコアを１／２となるように調整する（ステップＳ６２）。この調整手段８ｄは、どの範囲の分割映像をどれだけスコアを引き下げるかが外部から予め設定されているものとする。これによって、次に、分割映像を選択する際に、すでに選択された分割画像と時間的に近い分割画像が選択されにくくなる。 After that, the summary video generation device 1 uses the adjusting unit 8d to set a predetermined range, for example, 10 in the time direction from the selected divided video to the score of each divided video stored in the memory 8e. The score of the divided video is adjusted to ½ (step S62). It is assumed that the adjusting means 8d is preset in advance from the outside as to how much the score of the divided video in which range is to be lowered. This makes it difficult to select a divided image that is close in time to the already selected divided image when the divided video is next selected.

そして、要約映像生成装置１は、分割映像選択手段８ｂにより、ステップＳ６１で順次選択された分割映像の時間長の合計が、予め設定されている目標長さであるか否かが判定され（ステップＳ６３）、まだ目標長さに達していない場合（ステップＳ６３でＮｏ）、ステップＳ６１に戻ってステップＳ６１，Ｓ６２の動作を繰り返す。
一方、選択された分割映像の時間長の合計が、目標長さを越えた場合（ステップＳ６３でＹｅｓ）、要約映像生成装置１は、分割映像連結手段８ｃにより、ステップＳ６１で選択された分割映像を映像時間の早い方から並べて連結する（ステップＳ６４）。なお、このとき、目標長さを超過した映像については、カットすることとしてもよい。 Then, the summary video generation device 1 determines whether or not the total time length of the divided videos sequentially selected in step S61 is the preset target length by the divided video selection unit 8b (step S61). S63) If the target length has not yet been reached (No in step S63), the process returns to step S61 and the operations in steps S61 and S62 are repeated.
On the other hand, if the total time length of the selected divided video exceeds the target length (Yes in step S63), the summary video generating device 1 uses the divided video connecting means 8c to select the divided video selected in step S61. Are connected side by side from the earlier video time (step S64). At this time, an image exceeding the target length may be cut.

以上説明したように、要約映像生成装置１では、映像を映像基本単位にしてキーフレームから特徴を分析して視覚単語とし、その視覚単語の種類に対して、元映像において特徴的であることを識別する予め定めた指標を評価基準となるスコアを算出して特徴のある分割映像を選択している。つまり、要約映像生成装置１では、映像の特徴のあるブロック領域を含む分割映像を選択して要約映像を生成するので、映像のジャンル等によることなく汎用的に要約映像を生成することができる。 As described above, the summary video generation apparatus 1 analyzes the features from the key frame using the video as a basic video unit to obtain a visual word, and the type of the visual word is characteristic in the original video. A characteristic divided video is selected by calculating a score that serves as an evaluation criterion for a predetermined index to be identified. That is, the summary video generation apparatus 1 generates a summary video by selecting a divided video including a block region having a video feature, and therefore, the summary video can be generated universally without depending on the video genre or the like.

［要約映像生成装置の他の構成］
次に、本発明の他の実施形態に係る要約映像生成装置の構成について説明する。図１（図２）で説明した要約映像生成装置１は、映像のデータ（映像ストリーム等）から要約映像を生成するものであったが、映像及び音声のデータから要約映像を生成するように構成してもよい。 [Other configuration of summary video generator]
Next, the configuration of a summary video generation apparatus according to another embodiment of the present invention will be described. The summary video generation apparatus 1 described with reference to FIG. 1 (FIG. 2) generates summary video from video data (video stream, etc.), but is configured to generate summary video from video and audio data. May be.

ここで、図８（図９）を参照して、本発明の他の実施形態に係る要約映像生成装置１Ａの構成について説明する。なお、図１（図２）で説明した要約映像生成装置１と同一の構成については、同じ符号を付して説明を省略する。 Here, with reference to FIG. 8 (FIG. 9), the configuration of the summary video generating apparatus 1A according to another embodiment of the present invention will be described. In addition, about the structure same as the summary image | video production | generation apparatus 1 demonstrated in FIG. 1 (FIG. 2), the same code | symbol is attached | subjected and description is abbreviate | omitted.

要約映像生成装置１Ａは、ＭＥＰＧ２ストリームのようなデジタル放送番組映像の映像、映像の音声及び映像の文字データを入力してその入力した情報から、映像及び音声の特徴ある部分を選択して、当該映像より映像時間が短い時間長となる要約映像を生成して出力するものである。この要約映像生成装置１Ａは、入力手段２と、ストリーム分離手段３と、映像分割手段４と、映像解析手段５と、映像評価手段６と、映像区間選択手段８Ａと、出力手段９と、文字情報抽出手段１５と、形態素解析手段１６と、音声評価手段１７と、目標長さ設定手段１１と、を備えている。
入力される映像となる元映像には、少なくとも映像データである映像ストリームと、音声データである音声ストリームがあればよく、さらに、データストリームが文字情報を含んでいれば、要約映像生成装置１Ａでは、その文字情報が使用される。 The summary video generation apparatus 1A inputs a digital broadcast program video such as a MPEG2 stream, video audio and video character data, and selects a characteristic part of the video and audio from the input information. A summary video having a shorter video time than the video is generated and output. The summary video generation apparatus 1A includes an input unit 2, a stream separation unit 3, a video division unit 4, a video analysis unit 5, a video evaluation unit 6, a video section selection unit 8A, an output unit 9, a character An information extraction unit 15, a morpheme analysis unit 16, a speech evaluation unit 17, and a target length setting unit 11 are provided.
The source video that is the input video only needs to include at least a video stream that is video data and an audio stream that is audio data. If the data stream includes character information, the summary video generation device 1A performs the following process. The character information is used.

入力手段２を介して入力された映像は、ストリーム分離手段３により、映像ストリームと、音声ストリームと、データストリームとに分離され、映像ストリームが映像分割手段４に出力され、音声ストリーム及びデータストリームが文字情報抽出手段１５に出力される。ストリーム分離手段３は、例えば、映像がＭＰＥＧ２ＴＳの場合であれば、コンポーネントタグの値により各ストリームを分離する。 The video input via the input means 2 is separated by the stream separation means 3 into a video stream, an audio stream, and a data stream, the video stream is output to the video division means 4, and the audio stream and data stream are It is output to the character information extraction means 15. For example, if the video is MPEG2TS, the stream separation unit 3 separates each stream based on the value of the component tag.

ストリーム分離手段３により分離された映像ストリームは、映像分割手段４に入力される。そして、すでに説明したように、要約映像生成装置１Ａは、フレームからショット等を検出して映像基本単位となる分割映像を生成し、映像解析手段５及び映像評価手段６により処理されて分割映像ごとのスコアを算出する。そして、算出された分割映像ごとのスコアは、映像区間選択手段８Ａに出力される。
一方、ストリーム分離手段３により分離された音声ストリームとデータストリームは、文字情報抽出手段１５に入力される。 The video stream separated by the stream separation unit 3 is input to the video division unit 4. Then, as already described, the summary video generation apparatus 1A detects a shot or the like from the frame to generate a divided video as a basic video unit, and is processed by the video analysis unit 5 and the video evaluation unit 6 for each divided video. Calculate the score. Then, the calculated score for each divided video is output to the video section selection means 8A.
On the other hand, the audio stream and the data stream separated by the stream separation unit 3 are input to the character information extraction unit 15.

文字情報抽出手段１５は、入力された音声ストリーム又はデータストリームから文字データを抽出するものである。この文字情報抽出手段１５で抽出された文字データは、形態素解析手段１６に出力される。ここでは、文字情報抽出手段１５は、データストリームから文字データ検索し、あるいは、音声データから文字データを生成する。この文字情報抽出手段１５は、文字データ検出手段１５ａと、音声認識手段１５ｂとを備えている。 The character information extraction unit 15 extracts character data from the input audio stream or data stream. The character data extracted by the character information extraction unit 15 is output to the morpheme analysis unit 16. Here, the character information extraction means 15 searches for character data from the data stream or generates character data from the audio data. The character information extraction means 15 includes character data detection means 15a and voice recognition means 15b.

文字データ検出手段１５ａは、ストリーム分離手段３で分離されたデータストリームから、クローズドキャプションの文字データを検出するものである。なお、文字データ検出手段１５ａは、データストリーム中にクローズドキャプションがない場合に、音声認識手段１５ｂに信号出力して音声認識を行わせ、また、データストリーム中にクローズドキャプションがある場合に、当該クローズドキャプションから文字データを検出する。この文字データ検出手段１５ａは、クローズドキャプションが検出できたか否かについて、音声認識手段１５ｂに信号を出力して、検出したときには、音声認識処理を行わず、検出できなかったときには、音声認識処理を行わせるようにしている。文字データ検出手段１５ａで検出された文字データは、形態素解析手段１６に出力される。 The character data detection means 15a detects closed caption character data from the data stream separated by the stream separation means 3. The character data detection means 15a outputs a signal to the voice recognition means 15b to perform voice recognition when there is no closed caption in the data stream, and when there is a closed caption in the data stream, the character data detection means 15a Detect character data from the caption. The character data detection means 15a outputs a signal to the voice recognition means 15b as to whether or not the closed caption has been detected. When it is detected, the character data detection means 15a does not perform the voice recognition process. I try to do it. The character data detected by the character data detection unit 15 a is output to the morpheme analysis unit 16.

音声認識手段１５ｂは、ストリーム分離手段３で分離された音声ストリームを音声認識してテキストデータ等の文字データを生成するものである。なお、音声認識手段１５ｂは、文字データ検出手段１５ａからの音声認識を行う旨の信号により、音声ストリームを音声認識する。この音声認識手段１５ｂは、一般的な音声認識装置を用いればよい。音声認識手段１５ｂにより認識された文字データは、形態素解析手段１６に出力される。
文字情報抽出手段１５は、文字データ検出手段１５ａで検出した文字データ（テキストデータ）か、あるいは、音声認識手段１５ｂで音声認識して生成した文字データ（テキストデータ）かのいずれかを形態素解析手段１６に出力する。 The voice recognition means 15b recognizes the voice stream separated by the stream separation means 3 and generates character data such as text data. Note that the voice recognition means 15b recognizes the voice stream by a signal indicating voice recognition from the character data detection means 15a. The voice recognition means 15b may use a general voice recognition device. The character data recognized by the voice recognition unit 15 b is output to the morpheme analysis unit 16.
The character information extraction means 15 is a morphological analysis means for either character data (text data) detected by the character data detection means 15a or character data (text data) generated by voice recognition by the voice recognition means 15b. 16 is output.

形態素解析手段１６は、入力した文字データを形態素解析し、テキストデータを単語（形態素）へ分割するものである。この形態素解析手段１６は、テキストデータを単語（言語で意味を持つ最小単位）に分割する。この形態素解析手段１６で解析された文字データは、音声評価手段１７に出力される。なお、形態素解析手段１６で形態素に分割された文字データは、映像ストリームにおけるどのタイミングで表示あるいは発音されるかの時間情報を付された状態で音声評価手段１７に出力される。また、形態素解析手段１６は、予め設定されている記号や不用語について、除去した後に音声評価手段１７に文字データを出力する。 The morpheme analyzing means 16 performs morphological analysis on the input character data and divides the text data into words (morphemes). The morpheme analyzing means 16 divides the text data into words (the smallest unit having meaning in the language). The character data analyzed by the morpheme analysis unit 16 is output to the voice evaluation unit 17. Note that the character data divided into morphemes by the morpheme analysis unit 16 is output to the audio evaluation unit 17 with time information indicating at which timing in the video stream it is displayed or pronounced. Further, the morpheme analyzing unit 16 outputs character data to the voice evaluating unit 17 after removing symbols and non-terms set in advance.

音声評価手段１７は、入力した文字データの形態素に対して、映像分割手段４から入力される分割映像ごとの単位となる情報により分割映像ごとのスコアを算出するものである。この音声評価手段１７は、映像分割手段４から分割された分割映像の時間情報を入力すると、その分割映像の時間情報ごとに形態素の範囲を区分けして、その区分けした分割映像の単位ごとに特定の文字（形態素）が、元映像において特徴的であることを識別する予め定めた指標を評価基準となるスコアとして算出される。音声評価手段１７は、例えば、すでに説明した式（１）で示すＴＦ−ＩＤＦにより各形態素のスコアを分割映像ごとの単位で算出して評価基準を付す。この音声評価手段１７により算出される分割映像ごとのスコアにより文字データで示されるナレーションや台詞に特徴的な単語が出現する映像基本単位（分割映像）に高いスコアを与えることが可能となる。この音声評価手段１７で算出された分割映像ごとのスコアは、映像区間選択手段８Ａに出力される。 The voice evaluation unit 17 calculates a score for each divided video based on information serving as a unit for each divided video input from the video dividing unit 4 with respect to the morpheme of the input character data. When the audio evaluation unit 17 inputs time information of the divided video divided from the video division unit 4, the audio evaluation unit 17 divides the range of the morpheme for each time information of the divided video, and specifies the unit of the divided divided video. A predetermined index for identifying that the character (morpheme) is characteristic in the original video is calculated as a score serving as an evaluation criterion. For example, the voice evaluation unit 17 calculates the score of each morpheme by the unit of each divided video by the TF-IDF represented by the formula (1) already described, and attaches an evaluation criterion. It is possible to give a high score to a basic video unit (divided video) in which a narration or a word characteristic in dialogue appears in the character data based on the score for each divided video calculated by the audio evaluation means 17. The score for each divided video calculated by the audio evaluation unit 17 is output to the video section selection unit 8A.

映像区間選択手段８Ａは、映像評価手段６で算出した分割映像ごとのスコアと、音声評価手段１７で算出した分割映像ごとのスコアとを統合して、元映像から分割映像の映像区間を選択する。この映像区間選択手段８は、統合手段８ａと、分割映像選択手段８ｂと、分割映像連結手段８ｃと、調整手段８ｄと、メモリ８ｅとを備えている。なお、分割映像選択手段８ｂ、分割映像連結手段８ｃ、調整手段８ｄ及びメモリ８ｅは、すでに図２で説明したものと同じである。 The video section selection unit 8A integrates the score for each divided video calculated by the video evaluation unit 6 and the score for each divided video calculated by the audio evaluation unit 17, and selects the video section of the divided video from the original video. . The video section selecting means 8 includes an integrating means 8a, a divided video selecting means 8b, a divided video connecting means 8c, an adjusting means 8d, and a memory 8e. The divided video selection means 8b, the divided video connection means 8c, the adjustment means 8d, and the memory 8e are the same as those already described with reference to FIG.

統合手段８ａは、音声評価手段１７で算出した音声のスコアと、映像評価手段６で算出した映像のスコアとを統合（合計）して、映像基本単位である分割映像ごとのスコアを算出するものである。この統合手段８ａは、スコアの統合の方法として、例えば、重み付き和により統合する方法や、あるいは、重み付き積により統合する方法により統合スコアを算出することができる。
例えば、音声のスコアをｆａ（ｓ）、映像のスコアをｆｖ（ｓ）、重みをα（予め設定された値）としたとき、式（２）により、重み付き和による統合スコアｆ（ｓ）を算出することができ、式（３）により、重み付き積による統合スコアｆ（ｓ）を算出することができる。 The integration unit 8a integrates (totals) the audio score calculated by the audio evaluation unit 17 and the video score calculated by the video evaluation unit 6, and calculates a score for each divided video that is a video basic unit. It is. The integration means 8a can calculate an integrated score by, for example, a method of integrating by a weighted sum or a method of integrating by a weighted product as a score integration method.
For example, when the audio score is fa (s), the video score is fv (s), and the weight is α (preset value), the integrated score f (s) by the weighted sum is calculated according to Equation (2). And an integrated score f (s) based on a weighted product can be calculated according to equation (3).

この統合手段８ａにより式（２）あるいは式（３）で統合したスコアｆ（ｓ）に基づいて、分割映像選択手段８ｂと、分割映像連結手段８ｃと、調整手段８ｄと、メモリ８ｅとによりスコアｆ（ｓ）を調整しながら分割映像を選択し、選択した分割映像を結合して予め設定した時間長となるように要約映像を生成する。 Based on the score f (s) integrated by Expression (2) or Expression (3) by the integration means 8a, the score is obtained by the divided video selection means 8b, the divided video connection means 8c, the adjustment means 8d, and the memory 8e. The divided videos are selected while adjusting f (s), and the selected divided videos are combined to generate a summary video so as to have a preset time length.

なお、図２（図９）では、記憶手段７に分割映像を記憶することとし、分割映像選択手段８ｂが記憶手段７から分割映像を選択し、分割映像連結手段８ｃがそれらを連結することとしたが、要約映像生成装置１Ａでは、映像区間選択手段８Ａは、分割映像を示す情報（分割時間情報、分割映像番号情報等）から、入力手段２を介して外部から、対応する分割映像を入力することとすることで、記憶手段７を省略する構成とした。
以上のように要約映像生成装置１Ａを構成することで、映像及び音声によって、特徴のある映像区間を抽出することができ、映像のジャンル等にかかわらず、要約映像を生成することができる。 In FIG. 2 (FIG. 9), the divided video is stored in the storage means 7, the divided video selecting means 8b selects the divided video from the storage means 7, and the divided video connecting means 8c connects them. However, in the summary video generation device 1A, the video segment selection unit 8A inputs the corresponding divided video from the outside via the input unit 2 from information indicating the divided video (divided time information, divided video number information, etc.). Therefore, the storage unit 7 is omitted.
By configuring the summary video generation apparatus 1A as described above, it is possible to extract a characteristic video section by video and audio, and it is possible to generate a summary video regardless of the video genre and the like.

［要約映像生成装置の動作］
次に、本発明に係る要約映像生成装置１Ａの動作について説明する。ここでは、要約映像生成装置１Ａの全体動作の概略について先に説明し、個別の手段における詳細動作についてはその後に説明することとする。なお、すでに説明したステップは、同じ符号を付してその説明を省略する。 [Operation of summary video generator]
Next, the operation of the summary video generating apparatus 1A according to the present invention will be described. Here, the outline of the overall operation of the summary video generation apparatus 1A will be described first, and the detailed operation of the individual means will be described later. In addition, the step already demonstrated attaches | subjects the same code | symbol and abbreviate | omits the description.

（全体動作）
まず、図１０を参照（構成については、適宜図８，図９参照）して、要約映像生成装置１Ａの全体動作について説明する。
まず、要約映像生成装置１Ａは、ＭＥＰＧ２ストリームのようなデジタル放送番組映像が入力手段２を介して入力されると、ストリーム分離手段３により、映像ストリームと、音声ストリームと、データストリームとに分離する（ステップＳ１Ａ）。そして、ストリーム分離手段３は、分離した映像ストリームを映像分割手段４に出力し、分離した音声ストリーム及びデータストリームを文字情報抽出手段１５に出力する。また、要約映像生成装置１Ａは、すでに説明したように、映像分割手段４により、映像ストリームから、映像基本単位である、例えばショットの単位となる分割映像を生成する（ステップＳ１）。 (Overall operation)
First, the overall operation of the summary video generating apparatus 1A will be described with reference to FIG.
First, when a digital broadcast program video such as a MPEG2 stream is input via the input unit 2, the summary video generation apparatus 1 </ b> A separates the video stream, the audio stream, and the data stream by the stream separation unit 3. (Step S1A). Then, the stream separating unit 3 outputs the separated video stream to the video dividing unit 4 and outputs the separated audio stream and data stream to the character information extracting unit 15. In addition, as described above, the summary video generation apparatus 1A generates a split video that is a basic video unit, for example, a shot unit, from the video stream by the video dividing unit 4 (step S1).

さらに、要約映像生成装置１Ａは、文字情報抽出手段１５により、データストリームに映像の文字データであるクローズドキャプションがなければ音声ストリームを音声認識手段１５ｂにより音声認識して文字データを生成し、クローズドキャプションがあれば、文字データとして抽出する（ステップＳ２Ａ）。そして、要約映像生成装置１Ａは、ステップＳ２Ａで抽出された文字データが形態素解析手段１６に送られ、形態素解析手段１６により、文字データ（テキストデータ）の形態素解析が行われて文字データが形態素になるように解析する（ステップＳ２Ｂ）。一方、ステップＳ１で分割された分割映像は、映像解析手段５により、すでに説明したようにステップＳ２、Ｓ３の処理が行われ、特徴があるブロック領域を視覚単語とみなすような解析がなされる。 Further, the summary video generating apparatus 1A generates character data by using the character information extracting unit 15 to recognize the voice stream by the voice recognition unit 15b if the data stream does not have the closed caption that is the character data of the video, and generates the closed caption. If there is, it is extracted as character data (step S2A). Then, the summary video generation apparatus 1A sends the character data extracted in step S2A to the morpheme analysis unit 16, and the morpheme analysis unit 16 performs morpheme analysis on the character data (text data) to convert the character data into morphemes. It analyzes so that it may become (step S2B). On the other hand, the divided video divided in step S1 is processed by the video analysis means 5 so that the processing in steps S2 and S3 is performed as described above, and the block area having features is regarded as a visual word.

また、要約映像生成装置１Ａは、音声評価手段１７により、ステップＳ２Ｂで解析された文字データの形態素について分割映像ごとにＴＦ−ＩＤＦ等の評価手法を用いて音声の音声スコアを算出し、さらに、映像評価手段６により、ステップＳ２で解析された視覚単語について分割映像ごとにＴＦ−ＩＤＦ等の評価手法を用いて映像の映像スコアを算出する（ステップＳ４α）。そして、映像評価手段６は、映像基本単位となる分割映像におけるスコアの全てが終了するまで（ステップＳ５αでＹｅｓ）、音声スコア及び映像スコアを算出する。 In addition, the summary video generation device 1A uses the voice evaluation unit 17 to calculate the voice score of the voice using an evaluation method such as TF-IDF for each divided video for the morpheme of the character data analyzed in step S2B. The video evaluation means 6 calculates the video score of the video using an evaluation method such as TF-IDF for each divided video for the visual word analyzed in step S2 (step S4α). Then, the video evaluation unit 6 calculates the audio score and the video score until all of the scores in the divided video serving as the video basic unit are completed (Yes in step S5α).

そして、要約映像生成装置１Ａは、音声スコア及び映像スコアが算出されると、映像区間選択手段８Ａにより、音声スコア及び映像スコアを統合手段８ａにより統合したスコアが算出される。そして、要約映像生成装置１Ａは、分割映像選択手段８ｂにより統合したスコアに基づいて、元映像からスコアの高い映像区分となる分割映像が選択される。このとき、要約映像生成装置１Ａは、調整手段８ｄによって、すでに説明したようにスコアの調整が行われ、分割映像連結手段８ｃによって、選択された分割映像が映像時間の早い順に並びかえられて連結され、予め設定された時間長となる要約映像が生成される（ステップＳ６α）。
以上の動作によって、要約映像生成装置１Ａは、映像や音声に含まれる特徴によって、要約映像を生成する。 In the summary video generation apparatus 1A, when the audio score and the video score are calculated, the video section selection unit 8A calculates a score obtained by integrating the audio score and the video score by the integration unit 8a. Then, based on the score integrated by the divided video selection unit 8b, the summary video generation apparatus 1A selects the divided video that is the video segment with the higher score from the original video. At this time, the summary video generating apparatus 1A adjusts the score as described above by the adjusting unit 8d, and the divided video connecting unit 8c rearranges the selected divided videos in order of the video time and connects them. Then, a summary video having a preset time length is generated (step S6α).
Through the above operation, the summary video generating apparatus 1A generates a summary video based on the characteristics included in the video and audio.

（文字抽出動作）
次に、図１１を参照（構成については、適宜図９参照）して、要約映像生成装置１Ａの文字情報抽出手段１５において、文字データを抽出する動作について詳細に説明する。なお、この動作は、図１０で説明した要約映像生成装置１Ａの全体動作のうちのステップＳ２Ａの動作に相当する。 (Character extraction operation)
Next, with reference to FIG. 11 (refer to FIG. 9 as appropriate for the configuration), the operation of extracting character data in the character information extraction means 15 of the summary video generation apparatus 1A will be described in detail. This operation corresponds to the operation of step S2A in the overall operation of the summary video generating apparatus 1A described with reference to FIG.

まず、要約映像生成装置１Ａは、音声ストリーム及びデータストリームが文字情報抽出手段１５に入力されると、文字データ検出手段１５ａにより、データストリーム中に文字データであるクローズドキャプションが存在するか否かを判定する（ステップＳ２Ａａ）。ここで、データストリーム中にクローズキャプションがある場合（ステップＳ２ＡａでＹｅｓ）、クローズドキャプションを文字データとして検出する（ステップＳ２Ａｂ）。このとき、文字データ検出手段１５ａは、音声認識手段１５ｂに対して、音声認識を行わない旨の指示（信号）を通知する。 First, when the audio stream and the data stream are input to the character information extraction unit 15, the summary video generation apparatus 1A determines whether or not a closed caption that is character data exists in the data stream by the character data detection unit 15a. Determine (step S2Aa). If there is a closed caption in the data stream (Yes in step S2Aa), the closed caption is detected as character data (step S2Ab). At this time, the character data detection means 15a notifies the voice recognition means 15b of an instruction (signal) not to perform voice recognition.

一方、データストリーム中にクローズドキャプションがない場合（ステップＳ２ＡａでＮｏ）、音声認識手段１５ｂに音声認識を行う旨の指示（信号）を通知し、音声認識手段１５ｂは、その信号により音声ストリームを音声認識して文字データを生成する（ステップＳ２Ａｃ）。
このように、要約映像生成装置１Ａは、データストリーム中に映像で使用されるクローズドキャプションのような文字データがあった場合、音声認識を行わないため、動作の負荷を軽減することができる。 On the other hand, when there is no closed caption in the data stream (No in step S2Aa), the voice recognition unit 15b is notified of an instruction (signal) for performing voice recognition, and the voice recognition unit 15b It recognizes and produces | generates character data (step S2Ac).
As described above, the summary video generation apparatus 1A does not perform voice recognition when there is character data such as closed caption used in the video in the data stream, and thus can reduce the operation load.

（映像区間選択・連結動作）
次に、図１２を参照（構成については、適宜図９参照）して、要約映像生成装置１Ａの映像区間選択手段８Ａにおいて、映像区間を選択し、連結する動作について詳細に説明する。なお、この動作は、図１０で説明した要約映像生成装置１Ａの全体動作のうちのステップＳ６αの動作に相当する。 (Video section selection / connection operation)
Next, referring to FIG. 12 (refer to FIG. 9 as appropriate for the configuration), the operation of selecting and connecting video sections in the video section selection means 8A of the summary video generation apparatus 1A will be described in detail. This operation corresponds to the operation of step S6α in the overall operation of the summary video generating apparatus 1A described in FIG.

まず、映像評価手段６から出力される分割映像の映像に対する映像のスコアと、音声評価手段１７から出力される音声に対応する文字データである音声のスコアとが映像区間選択手段８Ａに入力されると、要約映像生成装置１Ａは、統合手段８ａにより、両スコアを統合したスコアを算出する（ステップＳ１６ａ）。そして、要約映像生成装置１Ａは、映像基本単位である分割映像について統合したスコアを算出し、全ての分割映像の処理が行われていない場合には（ステップＳ１６ｂでＮｏ）、繰り返しステップＳ１６ａの動作を行う。 First, the video score for the divided video image output from the video evaluation unit 6 and the audio score which is character data corresponding to the audio output from the audio evaluation unit 17 are input to the video section selection unit 8A. Then, the summary video generating apparatus 1A calculates a score obtained by integrating both scores by the integrating unit 8a (step S16a). Then, the summary video generating apparatus 1A calculates an integrated score for the divided video that is the basic video unit, and when all the divided videos have not been processed (No in Step S16b), the operation of Step S16a is repeated. I do.

そして、要約映像生成装置１Ａは、全映像基本単位となる全ての分割映像についての統合したスコアの算出が終了したら（ステップ１６ｂでＹｅｓ）、分割映像選択手段８ｂにより、元映像からスコアの高い順に映像区間に対応する分割映像を選択する。要約映像生成装置１Ａでは、分割映像を選択するステップＳ１６ｃから、調整手段８ｄによりメモリ８ｅに記憶されている分割映像のスコアの調整を行うステップＳ１６ｄ、要約映像の時間長となるまで繰り返し処理するステップ１６ｅ（Ｙｅｓ、Ｎｏ）、ならびに、分割映像連結手段８ｃにより分割映像を連結するステップＳ１６ｆについては、すでに図７で説明したステップＳ６１〜Ｓ６４と同等の動作を行って要約映像を生成する。 Then, when the summary video generation apparatus 1A finishes calculating the integrated score for all the divided videos that are all video basic units (Yes in step 16b), the divided video selection unit 8b causes the score to increase in descending order from the original video. Select the divided video corresponding to the video section. In the summary video generating apparatus 1A, from step S16c for selecting a split video, step S16d for adjusting the score of the split video stored in the memory 8e by the adjusting means 8d, and a step of repeatedly processing until the time length of the summary video is reached. 16e (Yes, No) and step S16f for connecting the divided images by the divided image connecting means 8c, the same operation as steps S61 to S64 already described in FIG. 7 is performed to generate the summary image.

要約映像生成装置１Ａは、以上説明した各ステップにより、映像の特徴と音声の特徴の両方から分割映像となる映像区間を、元映像から選択して生成するので、より特徴を正確に表わす要約映像を生成することが可能となる。 The summary video generation apparatus 1A selects and generates a video section that becomes a split video from both video features and audio features by the above-described steps, so that the summary video more accurately represents the features. Can be generated.

［要約映像生成装置の変形例］
すでに説明した図１，図２で示す要約映像生成装置１と、図９，図１０で示す要約映像生成装置１Ａについて、以下のような構成としてもよい例を説明する。 [Variation of summary video generation device]
An example in which the following configuration may be used for the summary video generation device 1 shown in FIGS. 1 and 2 and the summary video generation device 1A shown in FIGS. 9 and 10 will be described.

すなわち、要約映像生成装置１，１Ａは、映像分割手段４において、ショットの単位を映像分割手段４で元映像あるいは映像ストリームから映像基本単位である分割映像として説明したが、映像基本単位は、元映像から均等な時間ごとに分割した映像区分を映像基本単位とする分割映像としても構わない。このような元映像から時間的に均等な時間位置で均等な時間長さの分割映像とする場合には、生成される要約映像の状態がショットを映像基本単位にしたものと比較した場合、結合部分に違和感があるようになる可能性があるが、元映像を選択するための内容を示すレベルにおいては使用できるものとなる。また、映像分割手段４は、カメラの動き（パン、チルト等）が変化する時間長さ方向の点を検出し、そのカメラの動きのある時間長さ方向の点を区切りとした映像区間を分割映像としても構わない。つまり、映像分割手段４は、予め設定された映像区間を映像基本単位として分割映像を生成する構成としても構わない。 In other words, the summary video generation apparatuses 1 and 1A have explained that the video division unit 4 uses the video division unit 4 as the divided video that is the basic video unit from the original video or the video stream. A divided video having a video basic unit as a video segment divided from the video at equal time intervals may be used. In the case of dividing the original video from the original video into a split video of the same time length at the same time position, it is necessary to combine the summary video generated when the shot is compared with the basic video unit shot. Although there may be a sense of incongruity in the portion, it can be used at a level indicating the content for selecting the original video. The video dividing means 4 detects a point in the time length direction at which the camera movement (pan, tilt, etc.) changes, and divides the video section using the point in the time length direction with the camera movement as a break. It does not matter as video. In other words, the video dividing unit 4 may be configured to generate a divided video using a preset video section as a video basic unit.

なお、映像分割手段４では、ショット長の単位となる映像基本単位の分割映像に分割する例を説明したが、ショット長に対して閾値を設定し、その設定した閾値より長いショット長となる分割映像について、映像の動きに基づいてさらに分割するようにしても構わない。そして、閾値は、予め設定された値（例えば、これまでに生成されたスポット映像において使用されているカットの平均長）であってもよいし、外部入力手段１０から入力されたスポット映像の長さの情報等に基づいて算出された値としてもよい。 In addition, although the video dividing means 4 explained the example of dividing into the divided video of the basic video unit which is the unit of the shot length, the threshold is set for the shot length, and the division becomes the shot length longer than the set threshold. The video may be further divided based on the motion of the video. The threshold value may be a preset value (for example, the average length of cuts used in spot images generated so far) or the length of a spot image input from the external input means 10. It is good also as a value calculated based on the information of length.

また、要約映像生成装置１，１Ａは、映像解析手段５において、キーフレームのフレーム画像について特徴となるブロック領域を抽出し、そのブロック領域（視覚単語）の特徴量の種類を区分するため、輝度勾配ヒストグラムを一例として説明した。ただし、映像解析手段５は、ブロック領域（視覚単語）の特徴量の種類を区分することができるように、特徴量の分布、度数、レンジ等を予め設定した範囲ごとに区画することで、当該特徴量の種類を区分することができれば、特徴量の種類について特に限定されるものではない。 In addition, the summary video generation apparatuses 1 and 1A extract the block area that is characteristic of the frame image of the key frame in the video analysis unit 5 and classify the types of feature quantities of the block area (visual word). The gradient histogram has been described as an example. However, the video analysis means 5 divides the feature amount distribution, frequency, range, and the like into predetermined ranges so that the types of feature amounts of the block area (visual word) can be classified. As long as the types of feature quantities can be classified, the types of feature quantities are not particularly limited.

そして、要約映像生成装置１，１Ａは、映像解析手段５において、キーフレーム抽出手段５ａが、動きの激しいフレームを多く含むように抽出する構成の例として説明したが、キーフレーム抽出手段５ａにより、先頭から最後まで予め設定された均等な時間区間からキーフレームを抽出する構成としても構わない。あるいは、要約映像生成装置１，１Ａは、キーフレーム抽出手段５ａが乱数により映像区間から無作為にキーフレームを選択するようにしても構わない。つまり、要約映像生成装置１，１Ａは、映像解析手段５において、キーフレーム抽出手段５ａが、予め設定された条件により分割映像の区間からキーフレームを抽出するように構成しても構わない。 The summary video generation device 1 or 1A has been described as an example of a configuration in which the key frame extraction unit 5a extracts a large number of frames with high motion in the video analysis unit 5, but the key frame extraction unit 5a A configuration may be adopted in which key frames are extracted from equal time intervals set in advance from the beginning to the end. Alternatively, the summary video generation apparatuses 1 and 1A may be configured such that the key frame extraction unit 5a randomly selects key frames from the video section using random numbers. That is, the summary video generation apparatuses 1 and 1A may be configured such that in the video analysis unit 5, the key frame extraction unit 5a extracts a key frame from a segmented video segment according to a preset condition.

また、要約映像生成装置１，１Ａは、特徴ブロック領域抽出手段５ｂが、フレーム画像の特徴点（オブジェクトのコーナー部分等）となる局所特徴量を検出する例として説明したが、フレーム画像の特徴となる大域特徴、あるいは、局所特徴及び大域特徴の両方を組み合わせた特徴部分を検出するようにしても構わない。なお、大域特徴を検出する手法としては、フレーム画像の小領域におけるテクスチャの色等の情報を利用することができる。例えば、特徴ブロック領域抽出手段５ｂが、キーフレーム画像における予め設定したブロック領域を視覚単語とみなして、その視覚単語としたブロック領域を特徴量ごとの種類に区分するようにしても構わない。 The summary video generation apparatuses 1 and 1A have been described as an example in which the feature block region extraction unit 5b detects a local feature amount that is a feature point (such as a corner portion of an object) of a frame image. It is also possible to detect a global feature or a feature portion that combines both a local feature and a global feature. As a technique for detecting the global feature, information such as the texture color in a small area of the frame image can be used. For example, the feature block region extracting unit 5b may regard a preset block region in the key frame image as a visual word and classify the block region used as the visual word into types for each feature amount.

さらに、要約映像生成装置１，１Ａは、映像解析手段５において、クラスタリング手段５ｃを備える構成として説明したが、映像解析手段５において、勾配ヒストグラム生成手段５ｂ_２による勾配ヒストグラムの次元数が低く、分布が密となるような勾配の区分に予め設定することで、クラスタリング手段５ｃを必要としない構成としてもよい。 Furthermore, the video summary generation unit 1,1A, in the video analysis unit 5 has been described as a configuration in which the clustering means 5c, the video analysis unit 5, a low number of dimensions of the gradient histogram with a gradient histogram generating unit 5b _2, distribution It is possible to adopt a configuration in which the clustering means 5c is not required by presetting the gradient sections so as to be dense.

そして、要約映像生成装置１，１Ａは、映像評価手段６において、具体的には、ＴＦ−ＩＤＦの値（視覚単語の種類（特徴量）に対して、元映像において特徴的であることを識別する予め定めた指標を評価基準として算出したスコア）を算出する例として示したが、ＩＤＦのかわりにエントロピー（Ｓ）に基づいて算出されるＴＦ−Ｓの値（信号）を利用することもできる。なお、ＴＦ−Ｓの算出式は、以下の通りである。 The summary video generation apparatuses 1 and 1A then identify in the video evaluation means 6 that the original video is characteristic with respect to the value of TF-IDF (type of visual word (feature amount)). In the above example, a score calculated using a predetermined index as an evaluation criterion) is shown, but a value (signal) of TF-S calculated based on entropy (S) can be used instead of IDF. . In addition, the calculation formula of TF-S is as follows.

さらに、要約映像生成装置１，１Ａは、映像評価手段６において、画像特徴ベクトルの共起に基づく特徴量を利用することもできる。画像特徴ベクトルの共起（ｔ）に基づく重要度は、次式のように算出できる。 Furthermore, the summary video generation apparatuses 1 and 1A can use the feature amount based on the co-occurrence of the image feature vectors in the video evaluation unit 6. The importance based on the co-occurrence (t) of the image feature vector can be calculated as follows.

なお、式（５）において括弧内のｔはｔ＝｛ｔ_１，…，ｔ_ｉ，…，ｔ_ｎ｝であり、ｔ_ｉは、画像特徴ベクトルの１つを表わす。また、ｔｆ（ｔ，ｄ）は映像基本単位ｄにおける画像特徴ベクトルの共起ｔの出現頻度を表わし、Ｎは番組（元映像）における映像基本単位の総数を表わし、ｄｆ（ｔ）は番組（元映像）における映像基本単位のうち、画像特徴ベクトルの共起ｔが含まれる映像基本単位の総数を表わす。ｔｆ（ｔ，ｄ）は、映像基本単位ｄに含まれる全ての共起の総数で割ることによって正規化されている。
このように、共起を利用することにより映像基本単位の特徴をより正確に捉えることが可能となる。
なお、式（５）において、全ての共起を利用するようにしているが、画像特徴ベクトルにおいて、ある位置関係（近くに出現する関係、ある位置関係に出現する関係等）を満たすものだけを利用して、重要度を算出するようにしてもよい。 Incidentally, t is _{t =} in parentheses in equation (5) {t 1, ... , t i, ..., t n} is a, _{t i} represents one of the image feature vectors. Tf (t, d) represents the appearance frequency of the co-occurrence t of the image feature vector in the video basic unit d, N represents the total number of video basic units in the program (original video), and df (t) represents the program ( The total number of video basic units including the co-occurrence t of the image feature vector among the video basic units in the original video). tf (t, d) is normalized by dividing by the total number of all co-occurrence included in the video basic unit d.
In this way, by using co-occurrence, it is possible to capture the characteristics of the video basic unit more accurately.
In Equation (5), all co-occurrence is used, but only those satisfying a certain positional relationship (a relationship appearing nearby, a relationship appearing in a certain positional relationship, etc.) in the image feature vector. The importance may be calculated by using it.

さらに、要約映像生成装置１，１Ａは、映像評価手段６において、番組内のみではなく、過去の様々な放送番組から画像特徴ベクトルの重要度を算出する方法も考えられる。例えば、次の式（６）により重要度が算出できる。 Further, the summary video generation apparatuses 1 and 1A may consider a method in which the video evaluation means 6 calculates the importance of the image feature vector from various past broadcast programs as well as within the program. For example, the importance can be calculated by the following equation (6).

ここで、ｐｆ（ｔ）は、過去の様々な番組のうち画像特徴ベクトルｔが出現する番組の総数を表わし、Ｍは過去の様々な番組の総数を表わす。ｔｆｉｄｆ（ｔ，ｄ）は、すでに説明した式（１）の重要度と同じものである。ｐｆ（ｔ）により、番組内における画像特徴ベクトルの出現特徴だけでなく、他の番組における出現傾向も考慮した重要度を算出することができるようになる。式（６）では、特定の番組にのみ出現するような画像特徴ベクトルに対して、大きな重みを与えることができる。 Here, pf (t) represents the total number of programs in which the image feature vector t appears among various past programs, and M represents the total number of various past programs. tfidf (t, d) is the same as the importance of the expression (1) already described. With pf (t), not only the appearance feature of the image feature vector in the program but also the importance considering the appearance tendency in other programs can be calculated. In Expression (6), a large weight can be given to an image feature vector that appears only in a specific program.

要約映像生成装置１，１Ａは、視覚単語として画像の特徴的な領域を示して、その特徴的な領域に対して、映像評価手段６において重要度を評価することで、元映像からどの映像区間が要約映像に相応しい分割映像であるかを選択できるようにしており、分割映像の重要度を選択できる評価基準として算出できる指標となる値であれば、以上説明したような式（１）、（４）〜（６）等を使用することが可能となる。 The summary video generation device 1, 1 </ b> A shows a characteristic area of an image as a visual word, and the video evaluation unit 6 evaluates the importance of the characteristic area, thereby determining which video section from the original video Can be selected whether it is a divided video suitable for the summary video, and if it is a value that serves as an index that can be calculated as an evaluation criterion with which the importance of the divided video can be selected, formulas (1), ( 4) to (6) can be used.

また、要約映像生成装置１，１Ａは、映像区間選択手段８，８Ａにおいて、調整手段８ｄにより算出したスコアの値を調整して元映像から特徴のある映像区間の分割映像を選択するようにしたが、調整手段８ｄを使用することなく、分割映像選択手段８ｂが選択した分割映像を分割映像連結手段８ｃにより連結して要約映像を生成するようにしてもよい。 In addition, the summary video generation apparatuses 1 and 1A adjust the score value calculated by the adjustment means 8d in the video section selection means 8 and 8A to select the divided video of the characteristic video section from the original video. However, without using the adjusting unit 8d, the divided video selected by the divided video selecting unit 8b may be connected by the divided video connecting unit 8c to generate the summary video.

さらに、要約映像生成装置１，１Ａは、調整手段８ｄにおいて、予め設定された映像区間において、予め設定された値だけスコアを引き下げるようにして調整する構成について説明したが、以下のようなエントロピーを用いて調整する構成としても構わない。 Furthermore, the summary video generation device 1 or 1A has been described with respect to the configuration in which the adjustment unit 8d adjusts the score by lowering the score by a preset value in a preset video section. It does not matter even if it uses and adjusts.

すなわち、すでに選択した映像基本単位である分割映像をＶ＝｛ｖ_１，…，ｖ_ｉ，…，ｖ_ｎ｝とした場合、番組内における選択済みの分割映像のばらつきは、エントロピーを利用して式（７）により算出することができる。 That is, when the divided video that is the selected video basic unit is V = {v ₁ ,..., V _i ,..., V _n }, the variation of the selected divided video in the program uses entropy. It can be calculated by equation (7).

式（７）において、ｐ（ｖ）は、映像基本単位（分割映像）ｖの番組冒頭からの位置を表わし、秒やフレームなどの単位で表わされるものとする。Ｈ（Ｖ）が大きいほど、番組内の様々な位置からの映像区間の分割映像が選択されていることを表わしている。
Ｖに対して新たな映像区間となる分割映像ｖ_ｎ＋１を追加する場合は、Ｈ（Ｖ）の増加量と分割映像ｖ_ｎ＋１の重要度（スコア）を統合したスコアに基づいて、分割映像を選択することになる。すなわち、メモリ８ｅに記憶させたスコアを更新させながら、分割映像選択手段８ｂが、統合したスコアにより分割映像を選択するようになる。なお、統合した重要度δは、以下の式（８）により算出することができる。 In Equation (7), p (v) represents the position of the video basic unit (divided video) v from the beginning of the program, and is expressed in units such as seconds and frames. As H (V) is larger, it indicates that divided videos of video sections from various positions in the program are selected.
When adding a divided video v _{n + 1} as a new video segment to V, select a divided video based on a score obtained by integrating the increase amount of H (V) and the importance (score) of the divided video v _{n + 1} Will do. That is, while the score stored in the memory 8e is updated, the divided video selection unit 8b selects the divided video based on the integrated score. The integrated importance δ can be calculated by the following equation (8).

式（８）において、ｉｍｐ（ｖ）は、映像区間（分割映像）ｖの重要度であり、前記したＴＦＩＤＦ（ＴＦＳ、ＴＦＩＤＦＰ）等に基づいて算出される。エントロピーにより分割映像のスコアを調整することで、番組内の様々な位置から分割映像が選択されるようになる。 In Equation (8), imp (v) is the importance of the video section (divided video) v, and is calculated based on the above-described TFIDF (TFS, TFIDFP) and the like. By adjusting the score of the divided video by entropy, the divided video is selected from various positions in the program.

要約映像生成装置１では、スコアに基づいて、元映像から映像区間となる分割映像を選択するために、記憶手段に記憶させるようにしたが、分割映像の記憶手段を使用することなく要約映像生成装置１Ａのように、入力手段により入力された元映像からスコアの高い分割映像を選択するように構成しても構わない。
また、要約映像生成装置１Ａでは、音声評価手段１７において、形態素について、スコアを算出する場合、前記した式（４）〜（６）の値を使用してスコアを算出することも可能となる。 In the summary video generation device 1, in order to select a divided video as a video section from the original video based on the score, the video is stored in the storage means. However, the summary video generation is performed without using the storage means for the divided video. As in the apparatus 1A, a divided video with a high score may be selected from the original video input by the input unit.
Further, in the summary video generation device 1A, when the score is calculated for the morpheme in the voice evaluation unit 17, it is also possible to calculate the score using the values of the equations (4) to (6) described above.

要約映像生成装置１，１Ａは、以上説明したように、各手段において、キーフレームから特徴となるブロック領域を抽出し、その抽出したブロック領域を視覚単語とみなし、視覚単語の勾配ヒストグラムを生成することで、視覚単語の種類とし、分割映像における視覚単語の種類に対して、元映像において特徴的であることを識別する予め定めた指標を評価基準となるスコアとして算出している。そのため、要約映像生成装置１，１Ａは、ジャンルに囚われることなく元映像に対する要約映像を的確に生成することが可能となる。 As described above, the summary video generation apparatus 1 or 1A extracts a block area as a feature from a key frame in each means, regards the extracted block area as a visual word, and generates a gradient histogram of the visual word. Thus, the visual word type is calculated, and a predetermined index for identifying that the visual word type in the divided video is characteristic in the original video is calculated as a score as an evaluation criterion. Therefore, the summary video generation apparatuses 1 and 1A can accurately generate the summary video for the original video without being restricted by the genre.

なお、要約映像生成装置１，１Ａは、一般的なＣＰＵ、ＲＡＭ、ＲＯＭなどで構成することができ、要約映像を出力するために、コンピュータを、前記した各手段として機能させるプログラム（要約映像生成プログラム）で実現することが可能となる。 The summary video generation devices 1 and 1A can be configured by a general CPU, RAM, ROM, and the like, and a program (summary video generation) that causes a computer to function as each of the above-described means to output the summary video. Program).

１要約映像生成装置
１Ａ要約映像生成装置
２入力手段
３ストリーム分離手段
４映像分割手段
５映像解析手段
５ａキーフレーム抽出手段
５ｂ特徴ブロック領域抽出手段
５ｂ_１キーフレーム画像特徴検出手段
５ｂ_２勾配ヒストグラム生成手段
５ｃクラスタリング手段
６映像評価手段
７記憶手段
８映像区間選択手段
８ａ統合手段
８ｂ分割映像選択手段
８ｃ分割映像連結手段
８ｄ調整手段
８ｅメモリ
９出力手段
１０外部入力手段
１１目標長さ設定手段
１５文字情報抽出手段
１５ａ文字データ検出手段
１５ｂ音声認識手段
１６形態素解析手段
１７音声評価手段 DESCRIPTION OF SYMBOLS 1 Summary video generation device 1A Summary video generation device 2 Input means 3 Stream separation means 4 Video division means 5 Video analysis means 5a Key frame extraction means 5b Feature block area extraction means 5b ₁ Key frame image feature detection means 5b ₂ Gradient histogram generation means 5c Clustering means 6 Video evaluation means 7 Storage means 8 Video section selection means 8a Integration means 8b Divided video selection means 8c Divided video connection means 8d Adjustment means 8e Memory 9 Output means 10 External input means 11 Target length setting means 15 Character information extraction 15 Means 15a Character data detection means 15b Speech recognition means 16 Morphological analysis means 17 Speech evaluation means

Claims

A summary video generation device that generates a summary video having a video duration shorter than the original video from the original video,
Video dividing means for generating a divided video divided for each video unit as a shot unit from the input original video;
Video analysis that extracts a key frame for each divided video divided by the video dividing means, detects a block region indicating the feature of the extracted key frame, and classifies and analyzes the type of the block region for each feature amount Means,
Video evaluation means for calculating, as a score serving as an evaluation criterion, a predetermined index for identifying that the feature quantity is characteristic in the original video for each type of feature quantity of the block area analyzed by the video analysis means When,
A video section for sequentially selecting the video of the divided video from the original video until the time length of the summary video from the earliest video time in descending order of the score of the divided video evaluated by the video evaluation means A selection means;
A summary video generation apparatus comprising:

A summary video generation device that generates a summary video having a video duration shorter than the original video from the original video,
The original video is a video having a video stream and an audio stream, and stream separating means for separating the video stream and the audio stream from the input original video;
Video dividing means for generating a divided video divided for each video unit as a shot unit from the video stream separated by the stream separating means;
Video analysis that extracts a key frame for each divided video divided by the video dividing means, detects a block region indicating the feature of the extracted key frame, and classifies and analyzes the type of the block region for each feature amount Means,
Video evaluation means for calculating, as a score serving as an evaluation criterion, a predetermined index for identifying that the feature quantity is characteristic in the original video for each type of feature quantity of the block area analyzed by the video analysis means When,
Character information extraction means for recognizing voice from the voice stream and extracting character data;
Morpheme analysis means for analyzing morpheme for each word of the character data extracted by the character information extraction means;
For the character data analyzed by the morpheme analyzing means, a predetermined index for identifying that the characteristic word for each video unit of the divided video divided by the video dividing means is characteristic in the original video is used as an evaluation criterion. Voice evaluation means for calculating as a score
From the original video until the sum of the score calculated by the video evaluation unit and the score calculated by the audio evaluation unit is the time length of the summary video from the earliest video time in descending order. A video section selection means for sequentially selecting video sections of the divided video and selecting an audio stream of the video section from the original video;
A summary video generation apparatus comprising:

The video analysis means further comprises clustering means for clustering the block area according to the type for each feature amount,
3. The summary video generation apparatus according to claim 1, wherein the video evaluation unit calculates the score for the types of feature quantities clustered by the clustering unit. 4.

The video analysis means calculates a motion vector amount for each frame image of the divided video, and when the total of the motion vector amounts reaches a total obtained by equally dividing the total motion vector amount of the divided video by a predetermined number 4. The summary video generation apparatus according to claim 1, wherein the frame images are sequentially extracted as the key frames. 5.

The video section selecting means is
Split video selection means for sequentially selecting videos of the video segment of the split video from the original video until the time length of the summary video is reached based on the calculated score;
Adjusting means for performing adjustment to lower the score of the divided video in the set divided video range with respect to a preset divided video range that is set before and after the divided video selected by the divided video selection unit;
Adjustment by this adjustment means, divided video connecting means for generating the summary video by connecting the divided video selected by the divided video selection means,
The summary video generation apparatus according to claim 1, further comprising:

The video section selecting means is
A score integration means for adding the score calculated by the video evaluation means and the score calculated by the audio evaluation means;
Based on the score calculated by the score integration means, the video segment of the divided video from the original video is set so that the video time is the length of the summary video from the earliest video time in descending order of the score. Split video selection means to select;
Adjusting means for performing adjustment to lower the score of the divided video in the set divided video range with respect to a preset divided video range that is set before and after the divided video selected by the divided video selection unit;
Adjustment by this adjustment means, divided video connecting means for generating the summary video by connecting the divided video selected by the divided video selection means,
The summary video generation apparatus according to claim 2, further comprising:

The original video includes the video stream, the audio stream, and a data stream, and the stream separation unit separates the original video into the video stream, the audio stream, and the data stream,
The character information extraction means further comprises character data detection means for detecting whether the data stream contains character data for the video stream,
When character data does not exist in the data stream, the character data is extracted by the voice recognition unit and output to the morpheme analysis unit, and the character data detection is performed when the data stream includes character data. 3. The summary video generation apparatus according to claim 2, wherein character data is detected by the means and output to the morpheme analysis means.

In order to generate a summary video from the original video that has a shorter video time than the original video,
Video dividing means for generating a divided video divided for each video unit as a shot unit from the input original video,
Video analysis that extracts a key frame for each divided video divided by the video dividing means, detects a block region indicating the feature of the extracted key frame, and classifies and analyzes the type of the block region for each feature amount means,
Video evaluation means for calculating, as a score serving as an evaluation criterion, a predetermined index for identifying that the feature quantity is characteristic in the original video for each type of feature quantity of the block area analyzed by the video analysis means ,
A video section for sequentially selecting the video of the divided video from the original video until the time length of the summary video from the earliest video time in descending order of the score of the divided video evaluated by the video evaluation means Selection means,
Summary video generation program characterized by functioning as

In order to generate a summary video from the original video that has a shorter video time than the original video,
The original video is a video having a video stream and an audio stream, and a stream separating means for separating the video stream and the audio stream from the input original video;
Video dividing means for generating a divided video divided for each video unit as a shot unit from the video stream separated by the stream separating means,
Video analysis that extracts a key frame for each divided video divided by the video dividing means, detects a block region indicating the feature of the extracted key frame, and classifies and analyzes the type of the block region for each feature amount means,
Video evaluation means for calculating, as a score serving as an evaluation criterion, a predetermined index for identifying that the feature quantity is characteristic in the original video for each type of feature quantity of the block area analyzed by the video analysis means ,
Character information extraction means for recognizing voice from the voice stream and extracting character data;
Morpheme analysis means for morphological analysis of the character data extracted by the character information extraction means for each word;
For the character data analyzed by the morpheme analyzing means, a predetermined index for identifying that the characteristic word for each video unit of the divided video divided by the video dividing means is characteristic in the original video is used as an evaluation criterion. Voice evaluation means for calculating as a score
From the original video until the sum of the score calculated by the video evaluation unit and the score calculated by the audio evaluation unit is the time length of the summary video from the earliest video time in descending order. Video section selection means for sequentially selecting video sections of the divided video and selecting an audio stream of the video section from the original video,
Summary video generation program characterized by functioning as