JPH10232884A

JPH10232884A - Method and device for processing video software

Info

Publication number: JPH10232884A
Application number: JP9304862A
Authority: JP
Inventors: Toru Fukuda; 徹福田; Haruki Tsuchiya; 治紀槌屋
Original assignee: MEDIA RINKU SYST KK
Current assignee: MEDIA RINKU SYST KK
Priority date: 1996-11-29
Filing date: 1997-10-20
Publication date: 1998-09-02

Abstract

PROBLEM TO BE SOLVED: To exactly grasp an entire video software by detecting a position to change the state of video software constitutive data and generating the summary of video software by extracting several representative images from the video software based on that information. SOLUTION: A superimposed character discrimination and storage part 4 discriminates and segments superimposed characters (caption). The information (timing information) of position to let the superimposed characters appear is sent to a cut discrimination and storage part 1. The superimposed characters are preserved on a hard disk 6 (or RAM) together with the information of this position. The cut discrimination and storage part 1 receives the information of position to let the superimposed characters appear and extracts a still picture from the prescribed position of correspondent cut (scene). The hard disk 6 stores the superimposed characters, still pictures and cut images clipped and designated by a viewer during the reproduction of representative images. According to a program stored in a ROM 3 and the hard disk 6, a CPU 2 executes processing for generating the summary.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は映像ソフトウェア処
理方法及び映像ソフトウェア処理装置に関し、詳しく
は、例えばＤＶＤ（デジタルビデオディスク）に収録さ
れた２時間ドラマの内容から、的確な代表画像を抽出
し、例えば１画面数秒づつ順に静止表示することで、当
該映像ソフトウェアの全体を、例えば５分間等の短時間
で、高速且つ的確に把握可能にする為の映像ソフトウェ
ア処理方法、及び映像ソフトウェア処理装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a video software processing method and a video software processing device, and more particularly, to extracting a representative image from the contents of a two-hour drama recorded on a DVD (digital video disc), for example. For example, the present invention relates to a video software processing method and a video software processing device for displaying the whole video software in a short time, for example, 5 minutes, at high speed and accurately by displaying still images in sequence for several seconds per screen.

【０００２】始めに用語の使い方について断っておく。
先ず、本明細書に於て映像ソフトウェアとは、ＤＶＤ収
納映像、デジタルＴＶ（テレビジョン）放送、地上波Ｔ
Ｖ放送、インターネットやパソコンで利用される映像フ
ァイル、コンピュータグラフィック、映画などを言い、
ＤＶＤ、ハードディスク、磁気テープ、フィルムその
他、そのデータ形式や媒体の形式如何に拘らず、或る程
度の数の画像（フレーム）の連続で、被写体の動きや、
情景、その場の雰囲気、その他を視聴者（利用者）に訴
える為のもの全てをいう（或る程度としたのは、４コマ
漫画などを除く意。）。[0002] First, the usage of terms will be refused.
First, in this specification, video software means DVD stored video, digital TV (television) broadcast, ground wave T
V broadcast, video files used on the Internet and personal computers, computer graphics, movies, etc.
DVD, hard disk, magnetic tape, film, etc., regardless of their data format or medium format, a certain number of continuous images (frames),
It refers to everything that appeals to the viewer (user) about the scene, the atmosphere of the place, and the like (a certain degree excludes 4-frame cartoons, etc.).

【０００３】また本明細書に於て、各請求項を始めとし
て何箇所かでは、例えば「位置の情報」と「位置情報」
というように、間に「の」の字が有るか無いかの違いの
みの言葉を何個か使用している。これは、前後の関係で
語呂を良くし読みやすくする為に使い分けただけのこと
で、意味は同じである。判り切ったことかも知れないが
念のため断っておく。「シーン」と「カット」、「視聴
者」と「利用者」、「画像」と「静止画」、「画面」等
についても同様とする。更に、用語は適宜略して使用す
ることがある。例えば、映像ソフトウェアを「映像ソフ
ト」と、本発明に係る処理を含む動作を「エッセンスプ
レイ」と、その処理により生成される要約物を「エッセ
ンス」、「ビデオエッセンス」或いは「ＶＥ」というよ
うに言うことがある。In this specification, in some places including each claim, for example, “position information” and “position information”
So, I use several words that differ only in the presence or absence of the "no" character between them. This is the only meaning used in order to improve the readability and readability in the context, and the meaning is the same. It may have been obvious, but I will refuse it just in case. The same applies to “scene” and “cut”, “viewer” and “user”, “image” and “still image”, “screen”, and the like. Further, the terms may be abbreviated as appropriate. For example, video software is referred to as "video software", an operation including processing according to the present invention is referred to as "essence play", and an abstract generated by the processing is referred to as "essence", "video essence" or "VE". I have something to say.

【０００４】[0004]

【従来の技術】現代社会はマルチメデイア時代に向かっ
ている。これらの時代を支える情報産業には下記のよう
なものがある。情報１次産業＝作家、新聞記者、カメラマン等（情報生
産）情報２次産業＝出版・新聞社、レコード会社、放送局等
（情報加工）情報３次産業＝書店、新聞配達、レンタルビデオ等（情
報流通）情報の加工・流通では自動化が進展している。このた
め、情報生産に携わる人々が増大しつつある。2. Description of the Related Art Modern society is moving toward the multimedia age. Information industries that support these eras include the following. Information primary industry = writer, newspaper reporter, photographer, etc. (information production) Information secondary industry = publishing / newspaper company, record company, broadcasting station, etc. (information processing) Information tertiary industry = bookstore, newspaper delivery, rental video, etc. Information distribution) Automation is progressing in information processing and distribution. For this reason, the number of people involved in information production is increasing.

【０００５】こうした情報産業を支えるのがエレクトロ
ニクス産業で、情報産業の高度化に貢献するさまざまな
ハードやソフトが提供されつつある。この結果として、
以下のような状況が生み出される。・情報生産者が増え、玉石混交のコンテンツが大量に生
産される。・マルチチャンネルＴＶ、ＤＶＤ等、新メディアの開発
による映像情報洪水。・個人の情報摂取時間には限界があり、対応しきれなく
なる。The information industry is supported by the electronics industry, and various hardware and software that contribute to the advancement of the information industry are being provided. As a result of this,
The following situation is created.・ Information producers will increase, and large amounts of cobblestone content will be produced.・ Flood of video information due to development of new media such as multi-channel TV and DVD.・ Individual information intake time is limited and cannot be handled.

【０００６】またテレビジョンの変貌もある（楽しむか
ら探すへ）。即ち、・チャンネルが少ない時代は家族全員で共通の映像を楽
しんでいた。今は、・マルチチャンネル時代であり、自分が見たい映像を探
すパーソナル化が進み、テレビ雑誌による番組選択、ザ
ッピング（リモコンによるチャンネルの頻繁な切り替
え）、マルチ画面の利用が盛んになって来ている。[0006] There is also a transformation of television (from fun to search). In other words, in the days when there were few channels, the whole family enjoyed common images. In the multi-channel era, personalization to search for the video you want to watch is progressing, and program selection, zapping (frequent switching of channels with a remote control), and use of multi-screens are becoming popular. I have.

【０００７】ところで、映像は時間の流れに沿って見る
もので、これを見るのには、それなりの時間をとる必要
がある。しかし、上述したようなのような世の中の動き
の中で、映像に対する時間節約の現状を見てみると、・ビデオでタイムシフト・・・見たい時、（暇な時）に
その録画を見る。・ビデオで時間短縮・・・・・早送り、ダイジェストプ
レイで要点を見る。・２画面ビデオ・・・・・・・スポーツと音楽を同時
に楽しむ。一方を再生しつつ他方を受信する。などの対応が見られる（ビデオ：本来は「映像」の意で
あるが、ここでは、ビデオデッキや録画のこと、或いは
これらを使用することを指す）。[0007] By the way, an image is viewed along the flow of time, and it is necessary to take a certain amount of time to watch it. However, in the movement of the world as described above, the current state of time saving for video is as follows: Time shift by video: When you want to watch, watch the recording at your leisure.・ Time saving with video ・・・・・・・ Fast forward and digest play to see the main points.・ Two screen video ・・・・・・・ Enjoy sports and music at the same time. Play one and receive the other. (Video: Originally means "video", but here refers to VCR, recording, or using these).

【０００８】上述のとおり、映像は時間消費的である。
しかし、人間の時間は限られている。もし映像の要点を
どこででも簡単に短い時間で把握することが出来れば、
多くの映像の中から自分の本当に見たい映像（作品、番
組その他）のみを選択し、ゆっくりと自分のペースで見
ることが出来る。このような願望は、マルチメディア時
代と言われる以前からも存在した。従来は、その目的の
ため、上述の早送りなどが利用されてきた。また近年で
は、上記ダイジェストプレイと言われる機能を搭載した
ビデオデッキ、即ち、約１．５〜２倍の早送り画面に、
通常の再生速度の音を併せて再生し、通常の約半分の時
間で内容を知ることが出来る機能を搭載したものも市販
されている。[0008] As mentioned above, images are time consuming.
But human time is limited. If you can easily grasp the point of the video anywhere in a short time,
You can select only the ones you really want to see (works, programs, etc.) from many images and watch them slowly at your own pace. Such desires existed even before being called the multimedia age. Conventionally, the above-mentioned fast forward and the like have been used for that purpose. In recent years, a video deck equipped with a function called the digest play, that is, a fast-forward screen of about 1.5 to 2 times,
There is also a commercially available one equipped with a function of reproducing a sound at a normal reproduction speed together with the function of allowing the user to know the content in about half the normal time.

【０００９】[0009]

【発明が解決しようとする課題】しかし、前記早送り、
ダイジェストプレイ等、従来の手法は、何れも煩雑で視
聴者（利用者）にとって極めて不満足なものであった。
即ち、これら従来の手法は、例えばダイジェストプレイ
によれば音こそ通常の速度で再生可能であるものの、画
面は全て早送りであった。このため、映像はブレて見に
くく、しかもそれが当該映像ソフトのポイントを表わす
ものかどうかを視聴者自身が判断しなければならないか
ら、目を皿のようにして画面に注目しつつ（ダイジェス
トプレイの場合は耳の方にも）、早送り、再生、レビュ
ー、一時停止、といった操作を頻繁に行なわなければな
らなかった。この手法では、特定のシーン（カット、場
面）だけ見つけ出したいなら兎も角、映像ソフトの全体
像を掴もうとするには、目も神経もかなり疲れる。However, the fast forward,
Conventional methods such as digest play are all complicated and extremely unsatisfactory for viewers (users).
That is, in these conventional methods, for example, according to the digest play, the sound can be reproduced at a normal speed, but the entire screen is fast-forwarded. For this reason, the image is blurred and difficult to see, and the viewer himself has to judge whether or not the point represents the point of the video software. In some cases, the ears too), fast forward, play, review, pause, and so on. In this method, if you want to find only specific scenes (cuts, scenes), rabbits are horns, and eyes and nerves are quite tired to try to grasp the whole picture of video software.

【００１０】また、前述のように、大量の映像情報が供
給されるようになりつつある、或いは先端技術の時代に
なって来つつあると言われても、そのことによって人間
の能力や自由な時間が大幅に増えて行くという訳ではな
い。この為、・忙しいのでゆっくり映像ソフトを見ていられない。・大量の映像ソフトの中から見たいものを素早く取り出
したい。・自分のペースで短い時間に自由に多くの映像ソフトを
見たい。という新たな要求も出てきている。[0010] As described above, even though it is said that a large amount of video information is being supplied or that the era of advanced technology is beginning to occur, the ability of human beings and the freedom of human beings can be improved. This does not mean that time will increase significantly. Because of this, ・ I am not busy watching video software slowly.・ I want to quickly extract what I want to see from a large amount of video software.・ I want to see many video software freely in a short time at my own pace. There is also a new demand for this.

【００１１】本発明の目的は、このような人間の時間節
約ニーズに答え、従来の手法と異なる新たな手法で、映
像ソフトの全体の的確な把握を可能にする手法を提供す
ることにある。It is an object of the present invention to provide a method which responds to such human time-saving needs and which enables a new method different from the conventional method to accurately grasp the entire video software.

【００１２】[0012]

【課題を解決するための手段】上記目的達成のため本発
明では、画像、音、字幕その他の映像ソフトウェア構成
データの少なくとも一つについて、その状態が変化する
位置を検出し、該検出された位置の情報に基いて、前記
映像ソフトウェアから幾つかの代表画像を抽出し、前記
映像ソフトウェアの要約を生成する（請求項１）。ま
た、画像、音、字幕その他の映像ソフトウェア構成デー
タの少なくとも一つについて、その状態が変化する位置
を検出し、該検出された位置の情報を、幾つかの代表画
像からなる前記映像ソフトウェアの要約を生成する為の
位置情報として、前記映像ソフトウェアに付加する（請
求項２）。In order to achieve the above object, according to the present invention, a position where a state of at least one of image, sound, subtitle and other video software configuration data changes is detected, and the detected position is detected. Based on this information, some representative images are extracted from the video software, and a summary of the video software is generated (claim 1). Also, for at least one of image, sound, subtitle, and other video software configuration data, a position at which the state changes is detected, and the information of the detected position is converted into a summary of the video software including several representative images. Is added to the video software as position information for generating (Claim 2).

【００１３】また、画像、音、字幕その他の映像ソフト
ウェア構成データの少なくとも一つについて、その状態
が変化する位置であるとして前記映像ソフトウェアに付
加されている位置の情報を読み出し、該位置情報に基い
て、前記映像ソフトウェアから幾つかの代表画像を抽出
し、該映像ソフトウェアの要約を生成する（請求項
３）。また、画像、音、字幕その他の映像ソフトウェア
構成データの少なくとも一つについて、その状態が変化
する位置を検出し、該検出された位置に係る幾つかの代
表画像を抽出し、要約として前記映像ソフトウェアに付
加する（請求項４）。[0013] In addition, for at least one of image software, image data, subtitle data, and other video software configuration data, information on a position added to the video software as a position at which the state changes is read out, and based on the position information. Then, some representative images are extracted from the video software, and a summary of the video software is generated (claim 3). In addition, for at least one of image, sound, subtitle, and other video software configuration data, a position where the state changes is detected, and some representative images related to the detected position are extracted. (Claim 4).

【００１４】また、画像、音、字幕その他の映像ソフト
ウェア構成データの少なくとも一つについて、その状態
が変化する位置を検出し、該位置に係る代表画像として
抽出し、前記映像ソフトウェアに付加されている要約を
読み出して、順次再生する（請求項５）。また、請求項
１、請求項３又は請求項５の何れかに記載の映像ソフト
ウェア処理方法に於て、前記生成された要約、或いは付
加されていた要約の再生時、視聴者から命令があった場
合、当該再生していた代表画像を抽出した位置付近から
当該映像ソフトウェアを通常状態で再生する（請求項
６）。また、請求項１、請求項３、請求項５又は請求項
６の何れかに記載の映像ソフトウェア処理方法に於て、
通常状態での映像ソフトウェアの再生時、視聴者から命
令があった場合、該位置付近から当該映像ソフトウェア
の要約生成を実行する（請求項７）。。[0014] Further, for at least one of image, sound, subtitle and other video software configuration data, a position at which the state changes is detected, extracted as a representative image relating to the position, and added to the video software. The digests are read out and sequentially reproduced (claim 5). In the video software processing method according to any one of claims 1, 3 and 5, when the generated summary or the added summary is reproduced, a command is issued from a viewer. In this case, the video software is reproduced in a normal state from the vicinity of the position where the reproduced representative image is extracted (claim 6). Further, in the video software processing method according to any one of claims 1, 3, 5, and 6,
When the video software is reproduced in the normal state, if there is a command from the viewer, a summary of the video software is generated from the vicinity of the position (claim 7). .

【００１５】また、画像、音、字幕その他の映像ソフト
ウェア構成データの少なくとも一つについて、その状態
が変化する位置を検出する位置検出手段と、該検出され
た位置の情報に基いて、前記映像ソフトウェアから幾つ
かの代表画像を抽出し、前記映像ソフトウェアの要約を
生成する要約生成手段とを備える（請求項８）。また、
画像、音、字幕その他の映像ソフトウェア構成データの
少なくとも一つについて、その状態が変化する位置を検
出する位置検出手段と、該検出された位置の情報を、前
記映像ソフトウェアの要約を生成する為の位置情報とし
て、前記映像ソフトウェアに付加する位置情報付加手段
とを備える（請求項９）。[0015] Further, for at least one of image, sound, subtitle and other video software configuration data, a position detecting means for detecting a position at which the state changes, and the video software based on the detected position information. And summarizing means for extracting some representative images from the video software and generating a summarization of the video software. Also,
For at least one of image, sound, subtitle, and other video software configuration data, a position detection unit that detects a position at which the state changes, and information on the detected position is used to generate a summary of the video software. A position information adding means for adding the position information to the video software is provided (claim 9).

【００１６】画像、音、字幕その他の映像ソフトウェア
構成データの少なくとも一つについて、その状態が変化
する位置であるとして前記映像ソフトウェアに付加され
ている位置の情報を読み出す位置情報読み出し手段と、
該位置情報に基いて前記映像ソフトウェアの要約を生成
する要約生成手段とを備える（請求項１０）。また、画
像、音、字幕その他の映像ソフトウェア構成データの少
なくとも一つについて、その状態が変化する位置を検出
する位置検出手段と、該検出された位置の情報に基い
て、前記映像ソフトウェアから幾つかの代表画像を抽出
し、該代表画像を要約として前記映像ソフトウェアに付
加する画像付加手段とを備える（請求項１１）。Position information reading means for reading out information on a position added to the video software as a position at which a state of at least one of image software, subtitles, and other video software configuration data is changed;
A summary generating means for generating a summary of the video software based on the position information (claim 10). In addition, for at least one of the image software, the sound, the caption, and other video software configuration data, a position detection unit that detects a position where the state changes, and, based on the detected position information, Image adding means for extracting the representative image of the image software and adding the representative image as a summary to the video software.

【００１７】また、画像、音、字幕その他の映像ソフト
ウェア構成データの少なくとも一つについて、その状態
が変化する位置を検出し、該検出された位置の情報に基
いて、前記映像ソフトウェアから抽出され、要約として
前記映像ソフトウェアに付加されている代表画像を順次
再生する再生手段を備える（請求項１２）。また、請求
項８、請求項１０又は請求項１２の何れかに記載の映像
ソフトウェア処理装置に於て、前記生成された要約、或
いは付加されていた要約の再生時、視聴者から命令があ
った場合、当該再生していた代表画像を抽出した位置の
付近から当該映像ソフトウェアを通常状態で再生する再
生手段を備える（請求項１３）。また、請求項８、請求
項１０、請求項１２又は請求項１３の何れかに記載の映
像ソフトウェア処理装置に於て、通常状態での映像ソフ
トウェア再生時、視聴者から命令があった場合、当該再
生していた位置付近から前記要約生成を実行する要約手
段を備える（請求項１４）。[0017] Further, for at least one of image, sound, subtitle, and other video software configuration data, a position at which the state changes is detected, and extracted from the video software based on information of the detected position. There is provided a reproducing means for sequentially reproducing the representative images added to the video software as a summary (claim 12). In the video software processing device according to any one of claims 8, 10, and 12, when the generated summary or the added summary is reproduced, a command is issued from a viewer. In this case, there is provided a reproducing means for reproducing the video software in a normal state from the vicinity of the position where the reproduced representative image is extracted (claim 13). Further, in the video software processing device according to any one of claims 8, 10, 12, and 13, when the viewer receives a command from the viewer when playing the video software in a normal state, Summarization means is provided for executing the summarization from the vicinity of the reproduced position (claim 14).

【００１８】また、請求項１乃至請求項７の何れか一に
記載の映像ソフトウェア処理方法に於て、各カットを代
表する夫々のカット代表画像のヒストグラムについて多
次元空間内での距離を算出し、該距離が近いもの同士を
纏めてグループを形成し、該各グループを代表する各位
置を抽出することで、前記構成データの状態が変化する
位置の検出を実行する（請求項１５）。Further, in the video software processing method according to any one of claims 1 to 7, a distance in a multidimensional space is calculated for a histogram of each cut representative image representing each cut. Then, by detecting the positions at which the state of the configuration data changes, a group is formed by putting together the objects having the short distances, and each position representative of each group is extracted (claim 15).

【００１９】そして、請求項８乃至請求項１４の何れか
一に記載の映像ソフトウェア処理装置に於て、各カット
を代表する夫々の画像のヒストグラムについて多次元空
間内での距離を算出する距離算出手段と、該距離が近い
もの同士を纏めてグループ化するグループ化手段と、該
各グループを代表する位置を検出するグループ代表検出
手段とで、前記位置検出手段を実現する（請求項１
６）。In the video software processing apparatus according to any one of claims 8 to 14, a distance calculation for calculating a distance in a multidimensional space for a histogram of each image representing each cut. The position detecting means is realized by means, a grouping means for grouping together the objects having the short distances, and a group representative detecting means for detecting a position representative of each group.
6).

【００２０】（作用）即ち本願発明では、画像、音、字
幕コードその他、当該映像ソフト構成データが変化をす
る位置に着目し、例えばＤＶＤに収録された２時間ドラ
マの内容を、当該各変化位置付近を代表する各画像の集
合（要約）で表現することとし、具体的には、これを１
画面数秒づつ順次静止表示して（コマ送り表示して）、
例えば５分間で当該ドラマ等の全体像を的確に把握出来
るようにする。本発明では、音、文字、画像の特徴など
をさまざまに利用し、またコンピュータによる画像処理
技術を利用して、この要約をつくりだす。このシステム
を利用すると、例えば図２の３０分の長さの「ちびまる
こちゃん」の映像ソフトＶＳは、その右に示される何枚
かの画像ＶＤと字幕ＣＰからなる要約に纏められる（な
お図２の絵は、漫画「ちびまるこちゃん」から引
用。）。(Function) In the present invention, attention is paid to the position where the video software configuration data changes, for example, images, sounds, subtitle codes, etc. It is represented by a set (summary) of each image representing the neighborhood, and specifically, this is represented by 1
The still image is displayed in sequence for several seconds on the screen (by frame advance display),
For example, the entire image of the drama or the like can be accurately grasped in five minutes. In the present invention, this summary is created using various features such as sounds, characters, and images, and using image processing technology by a computer. When this system is used, for example, the video software VS of “Chibi Maruko-chan” having a length of 30 minutes in FIG. 2 is summarized into a summary composed of several images VD and subtitles CP shown on the right (note that FIG. The picture is quoted from the comic "Chibi Maruko-chan.")

【００２１】[0021]

【発明の実施の形態】以下、本発明の詳細を説明する。
理解を容易にするため、先ず図１に示す実施の形態の一
例であるエッセンスプレーヤ１０について説明し、その
後、本発明の種々の展開について説明をする。即ち、エ
ッセンスプレーヤ１０は、デジタルビデオプレーヤ２０
に接続して使用され、カット判別記憶部１、字幕判別記
憶部４、画面合成部５、ハードディスク６、ＣＰＵ２、
ＲＯＭ３、及び操作ユニット７、ディスプレイ８等を備
えている。DESCRIPTION OF THE PREFERRED EMBODIMENTS The details of the present invention will be described below.
To facilitate understanding, an essence sprayer 10 as an example of the embodiment shown in FIG. 1 will be described first, and then various developments of the present invention will be described. That is, the essence player 10 is a digital video player 20.
And is used in connection with a cut determination storage unit 1, a subtitle determination storage unit 4, a screen synthesis unit 5, a hard disk 6, a CPU 2,
It includes a ROM 3, an operation unit 7, a display 8, and the like.

【００２２】各部の機能は以下のとおりである。字幕判別記憶部４：字幕（キャプション）を判別し、切
り出す。字幕が出現する位置の情報（タイミング情報）
をカット判別記憶部１に送る。字幕はこの位置の情報と
共に、ハードディスク６（またはＲＡＭ）に保存され
る。カット判別記憶部１：字幕の出現する位置の情報を受
け、それに対応するカット（シーン）の所定位置から、
静止画を取り出す（静止画＝一つの画面（１フレー
ム）。請求項にいう代表画像にあたる。）。静止画には
アドレスが付けられ、ハードディスク（またはＲＡＭ）
に保存される。The function of each section is as follows. Caption discrimination storage unit 4: discriminates and cuts out captions (captions). Information on the position where the caption appears (timing information)
To the cut discrimination storage unit 1. The caption is stored on the hard disk 6 (or RAM) together with the information on the position. Cut discrimination storage unit 1: receives information on the position where a caption appears, and from a predetermined position of a corresponding cut (scene),
A still image is taken out (still image = one screen (one frame). This corresponds to a representative image described in the claims). Still images are addressed and stored on a hard disk (or RAM)
Is stored in

【００２３】また、所定位置とは、例えば図３のＰＣの
如き位置（シーン中央）をいう。シーンが切り替えられ
て少し経って、そのシーンを代表する画面（画像）が出
現することが多いと推定し、ここでは、シーン中央を所
定位置に挙げたが、ジャンルによってその特徴的画像の
出現位置は異なるから、この所定位置は任意に定める。
視聴者或いは映像ソフト製作者が一度要約抽出を試行
し、その結果を見て所定位置を決めても良い。The predetermined position refers to, for example, a position (center of a scene) such as the PC in FIG. It is presumed that a screen (image) representative of the scene often appears a little after the scene is switched, and here, the center of the scene is listed at a predetermined position, but the appearance position of the characteristic image depends on the genre. Is different, the predetermined position is arbitrarily determined.
The viewer or the video software maker may try the abstract extraction once, and determine the predetermined position based on the result.

【００２４】ハードディスク６：字幕、静止画及び代表
画像再生中に視聴者が切り抜き指定したカット映像が記
憶される。ＣＰＵ２、ＲＯＭ３：ＲＯＭ３及びハードディスク６に
格納されたプログラムに従い、ＣＰＵ２が本発明に係る
処理を実行する。Hard disk 6: Stores cut images designated by the viewer for cutting during reproduction of subtitles, still images, and representative images. CPU2, ROM3: The CPU2 executes processing according to the present invention in accordance with programs stored in the ROM3 and the hard disk 6.

【００２５】操作ユニット７：下記各種ボタンを備えて
いる。 EP/NP エッセンスプレイとノーマルプレイの切り替え FWD エッセンスプレイ中、次画面送り BK 〃前画面戻り CUT 〃ハードディスク６への切り抜き保存ディスプレイ８：映像ソフトまたは代表画面（要約）の
表示。The operation unit 7 has the following various buttons. Switching between EP / NP essence play and normal play FWD During essence play, go to the next screen BK 戻り Return to the previous screen CUT 〃 Cut out and save to hard disk 6 Display 8: Display of video software or representative screen (summary).

【００２６】静止画と字幕には、その元となった映像の
位置情報、及びこれら相互の関係を示す情報とが付さ
れ、リンクが張られる（情報相互の関連づけが行なわれ
る）。位置情報は、例えばＤＶＤの格納アドレス、先頭
から数えた画像（フレーム）の番号、先頭からそこまで
の通常再生時間などで表現される。これに基いて、例え
ば、上記 EP/NP ボタンによるエッセンスプレイとノー
マルプレイの切り替え動作、及び当該静止画抽出位置付
近の１画面または所望長さの連続画面の切り抜き保存が
実行される。The still picture and the subtitles are provided with positional information of the original video and information indicating the mutual relationship between them, and a link is established (the information is associated with each other). The position information is represented by, for example, the storage address of the DVD, the number of the image (frame) counted from the beginning, the normal reproduction time from the beginning to that, and the like. Based on this, for example, the switching operation between the essence play and the normal play by the EP / NP button, and the cutting and saving of one screen or a continuous screen of a desired length near the still image extraction position are executed.

【００２７】なお、エッセンスプレイとノーマルプレイ
の切り替え動作は、例えば、静止画＋字幕の再生（エッ
センスプレイ）から、利用者が見たいカットが見つかっ
たら、その命令に従い、そのカット位置からノーマルプ
レイ（通常の再生）に移る為の処理、或いは、ノーマル
プレイ中に、利用者の命令に従って、その位置から静止
画＋字幕の再生処理に移行するという形でも実行され
る。また、エッセンス利用者は、自分が字幕を読むスピ
ードに合わせ、画面送りの速度を調節して見ることがで
きる。The switching operation between the essence play and the normal play is performed, for example, when a cut desired by the user is found from the reproduction of a still image + subtitles (essence play). (Normal playback) or during normal play, in accordance with the user's command, moving from that position to still image + subtitle playback processing. In addition, the essence user can adjust the screen feed speed according to the speed at which the subtitles are read, and view the subtitles.

【００２８】エッセンスプレイ中にCUTボタンを押され
たときは、それに対応する１画面或いは所望長さのシー
ン（カット）が映像データとしてハードディスク６に保
存される。この画像データにはファイル名が付される。
字幕からテキストが取り出せるときは、それをファイル
名とするのも良い。そうすれば後で検索しやすい。また
順にファイルを指定することにより、保存されたシーン
を自由に繋いで編集可能にしておくのも良い。When the CUT button is pressed during the essence play, one screen corresponding to the CUT button or a scene (cut) of a desired length is stored on the hard disk 6 as video data. A file name is given to this image data.
If the text can be extracted from the subtitles, it can be used as the file name. This makes it easier to search later. Also, by specifying the files in order, the saved scenes may be freely connected and edited.

【００２９】字幕は文字コードで映像ソフトに添付され
ている場合と、映像そのものとして画像データの中に組
み込まれている場合とがある。前者の場合は、当該媒体
の格納フォーマットに従って、その文字コードを読み出
せば良い。その出現位置（タイミングデータ）も当該格
納位置に関連して容易に把握できる。字幕だけでもスト
ーリーは理解出来るから、取り出した字幕データをそれ
ぞれ１枚づつの静止画に仕立て、字幕だけの要約とする
のも面白い。There are cases where subtitles are attached to video software by character codes and cases where they are incorporated into image data as video itself. In the former case, the character code may be read according to the storage format of the medium. The appearance position (timing data) can also be easily grasped in relation to the storage position. Since the story can be understood only by the subtitles, it is interesting to tailor the extracted subtitle data into one still image and summarize the subtitles only.

【００３０】映像そのものとして画像中に字幕が組み込
まれている場合には、近年その精度が上がって来た漢字
ＯＣＲの手法を用い、字幕の有無を検出する。画面中に
何か文字がある、という程度の認識が出来れば十分であ
るから、例えば、一般に字幕表示位置とされる、画像の
下部とか両端部分に着目し、その画像を粗い解像度で捉
え、そこから何らかの文字が読取れるかどうかで、字幕
の有無を判別すれば良い。When a caption is incorporated in an image as a video itself, the presence or absence of the caption is detected by using the kanji OCR method, which has recently been improved in accuracy. It is enough to be able to recognize that there are some characters on the screen.For example, focus on the lower part or both ends of the image, which is generally the subtitle display position, and capture the image with a coarse resolution. The presence or absence of subtitles may be determined based on whether or not any character can be read from.

【００３１】以下、映像の種類毎に、映像、カット、字
幕の関係を例示する。［１］従来の字幕付き映画従来の字幕付き映画は、フィルムに字幕が焼き付けてあ
る。従ってその処理方法は上述したようになり、Ｃ＝カ
ット（シーン），Ｓ＝字幕，Ｇ＝静止画又は短い動画と
すると、夫々の関係は、例えば下記のようになる。映像＝Ｃ1＋Ｃ2＋Ｃ3＋・・・・・・・・・・Ｃx カットは数秒〜数十秒で、その中に字幕がついている場
合と字幕なしの場合がある。字幕は１カットに１回の場
合と数回の場合がある。カット字幕（スーパーインポーズ）Ｃ1 Ｓ11＋Ｓ12 Ｃ2 ０Ｃ3 Ｓ31＋Ｓ32＋Ｓ33 ・・・・・・Ｃx Ｓx1＋Ｓx2＋Ｓx3＋Ｓx4Hereinafter, the relationship between video, cut, and subtitle will be exemplified for each type of video. [1] Conventional subtitled movie In a conventional subtitled movie, subtitles are printed on a film. Therefore, the processing method is as described above. If C = cut (scene), S = caption, G = still image or short moving image, the respective relationships are as follows, for example. Video = C1 + C2 + C3 +... Cx The cut is several seconds to several tens of seconds, and there are cases in which subtitles are attached and no subtitles. Subtitles may be once per cut or several times. Cut subtitles (superimpose) C1 S11 + S12 C20 C3 S31 + S32 + S33 Cx Sx1 + Sx2 + Sx3 + Sx4

【００３２】この場合、エッセンス画像の抽出は、例え
ば以下の如く行なう。 1.字幕を判別し取り出す。 2.字幕を表示順に列べる。Ｓ11＋Ｓ12、Ｓ31＋Ｓ32＋Ｓ33、・・・Ｓx1＋Ｓx2＋Ｓ
x3＋Ｓx4 3.各カットで最初に字幕が表示された位置の静止画を１
枚切り出す。Ｃ1 Ｇ1 Ｃ2 − Ｃ3 Ｇ3 ・・・・・・Ｃx Ｇx 4.静止画と字幕とに夫々アドレスを持たせ、ハードディ
スク６に保存する。In this case, the extraction of the essence image is performed, for example, as follows. 1. Determine and extract subtitles. 2. List the subtitles in the display order. S11 + S12, S31 + S32 + S33,... Sx1 + Sx2 + S
x3 + Sx4 3. One still image at the position where subtitles were first displayed in each cut
Cut out pieces. C1 G1 C2-C3 G3 Cx Gx 4. Each still image and subtitle have an address, and are stored on the hard disk 6.

【００３３】これらとエッセンスプレイとの関係は、以
下の如くである。 1.ＤＶＤやデジタルビデオで通常再生映像を見ていると
きに、エッセンスプレイボタン（EP/NP ボタン）を押す
ことにより、静止画と字幕が表示される。 2.例えばＣ3のカットをプレイ中に、エッセンスプレイ
ボタンを押すと、Ｇ3＋Ｓ31 が表示され、次ページボタン（FWD ボタン）を押すこと
により、Ｇ3＋Ｓ32、Ｇ3＋Ｓ33・・・Ｇx＋Ｓx1、Ｇx＋Ｓx2・・
・とエッセンス画面が順次表示される。また戻りボタン
（BK）を押すと１画面前に戻ることができる。 3.エッセンスプレイ中にノーマルプレイボタン（EP/NP
ボタン）を押すと、そのカットの初めから通常の映像を
見ることができる。 4.エッセンスプレイ中に切り抜きボタン（CUT）を押す
と、そのカットの映像をハードディスク６に保存するこ
とができる。The relationship between these and the essence play is as follows. 1. When watching a normal playback video on a DVD or digital video, pressing the essence play button (EP / NP button) displays a still image and subtitles. 2. For example, if you press the essence play button while playing the cut of C3, G3 + S31 will be displayed, and by pressing the next page button (FWD button), G3 + S32, G3 + S33 ... Gx + Sx1, Gx + Sx2,.
・ And the essence screen are displayed in order. Press the return button (BK) to return to the previous screen. 3. Normal play button (EP / NP) during essence play
Button), you can watch normal video from the beginning of the cut. 4. When the cut button (CUT) is pressed during the essence play, the image of the cut can be stored in the hard disk 6.

【００３４】［２］映像＋字幕（画像形式の字幕） (１) 映像と字幕の構成。映像と字幕が別の画面として構成され、再生時に合成表
示されるものもある。この場合の関係は下記のようにな
る。映像＝Ｃ1＋Ｃ2＋Ｃ3＋・・・・・・・・・・Ｃx 字幕＝Ｓ1＋Ｓ2＋Ｓ3＋・・・・・・・・・・Ｓy カット字幕（スーパーインポーズ）Ｃ1 Ｓ1＋Ｓ2 Ｃ2 ０Ｃ3 Ｓ3＋Ｓ4＋Ｓ5 ・・・・・・Ｃx Ｓy[2] Video + Caption (Caption in Image Format) (1) Structure of video and caption. In some cases, the video and subtitles are configured as separate screens, and are combined and displayed during playback. The relationship in this case is as follows. Video = C1 + C2 + C3 + ... Cx subtitle = S1 + S2 + S3 + ... Sy cut subtitle (superimposed) C1 S1 + S2 C2 0 C3 S3 + S4 + S5 · · · · · · Cx Sy

【００３５】（２）エッセンス画像の抽出方法 1.字幕を表示順に列べる。Ｓ1、Ｓ2、Ｓ3、Ｓ4・・・Ｓy 2.各カットで最初に字幕が表示された静止画を１枚切り
出す。Ｃ1 Ｇ1 Ｃ2 − Ｃ3 Ｇ3 ・・・ − ・・Ｃx Ｇx 3.静止画と字幕にそれぞれアドレスを持たせ記憶装置に
保存する。(2) Extraction method of essence image 1. Subtitles are arranged in display order. S1, S2, S3, S4,... Sy 2. Cut out one still image with subtitles displayed first in each cut. C1 G1 C2-C3 G3------------------------------------------------------------------------------------

【００３６】（３）エッセンスプレイ 1.ＤＶＤやデジタルビデオで映像を見ているときに、エ
ッセンスプレイボタンを押すことにより、静止画と字幕
が表示される。 2.例えばＣ3のカットをプレイ中に、エッセンスプレイ
ボタンを押すと、Ｇ3＋Ｓ3 が表示され、次ページボタンを押すことによりＧ3＋Ｓ4、Ｇ3＋Ｓ5・・・とエッセンス画面を見ていくことができる。戻りボタン
を押すと１画面前に戻ることができる。 3.エッセンスプレイ中にノーマルプレイボタンを押す
と、そのカットの初めから通常の映像を見ることができ
る。 4.エッセンスプレイ中に切り抜きボタンを押すと、その
カットの映像を記憶装置に保存することができる。(3) Essence Play 1. When watching an image on a DVD or digital video, pressing the essence play button displays a still image and subtitles. 2. For example, if you press the essence play button while playing a cut of C3, G3 + S3 is displayed. By pressing the next page button, you can see the essence screen as G3 + S4, G3 + S5. Press the return button to return to the previous screen. 3. If you press the normal play button during essence play, you can watch normal video from the beginning of the cut. 4. If you press the crop button during essence play, the video of the cut can be saved in the storage device.

【００３７】［３］映像＋字幕（テキストデータ）字幕が画像データでなくテキストデータの場合、テキス
トデータを一旦画像に展開して字幕にし、再生時に映像
と合成表示する。この場合も、字幕が表示されるカット
が指定されているので、映像から字幕判別の必要がな
い。［２］のときと同様にエッセンスプレイ処理、操作
が行える。[3] Video + Subtitle (Text Data) When the subtitle is not image data but text data, the text data is temporarily expanded into an image to form a subtitle, and displayed together with the video during reproduction. Also in this case, since the cut at which the subtitle is displayed is specified, there is no need to determine the subtitle from the video. Essence play processing and operation can be performed in the same manner as in [2].

【００３８】以上、実施の形態例について説明をした。
本発明は更に広範な形式で実施できる。以下、これらに
ついて詳述する。なお項目番号は、ここから改めて付け
る。先ず、ＶＥは原映像から抽出するものである。そこ
で原映像の特性の把握と、そこからどのような要約映像
をつくりだせるかということが問題になる。先ずこの点
から説明する。The embodiment has been described above.
The invention can be implemented in a further wide variety of forms. Hereinafter, these will be described in detail. Item numbers will be added here again. First, VE is extracted from the original video. Therefore, the problem is to understand the characteristics of the original video and what kind of summary video can be created from it. First, this point will be described.

【００３９】１．原映像（マルチメデイア・データ）の
要素ＶＥを抽出するための原映像は、ＤＶＤ、テレビ、ビデ
オ、パソコンなどに使われるマルチメデイア映像であ
り、大きく分けると次の２種類がある。１）ストリーム映像映画、ビデオなど、従来から存在する映像のこと。一定
の流れに従って変化する。２）リンク映像（ハイパージャンプ映像）近年、ゲームなどで使用されている映像で、ユーザーの
選択により、リンク先が変り、ストーリー展開が変化す
る映像。ＶＥはこれら２種類の何れも対象にするが、現
在は、「ストリーム映像」の方が多数なので、当面、こ
ちらが対象になる。1. Elements of Original Video (Multimedia Data) The original video for extracting the VE is a multimedia video used for DVD, television, video, personal computer, and the like, and is roughly classified into the following two types. 1) Stream video Video that has existed in the past, such as movies and videos. It changes according to a certain flow. 2) Link video (hyper jump video) A video that is used in games and the like in recent years, where the link destination changes according to the user's selection and the story development changes. VE targets both of these two types, but at present, "stream video" is more, so for now, this will be the target.

【００４０】これら映像ソフトには、以下のような要素
が含まれている。１）文字コード（ＴＣ）文字コードは画面の説明として、難聴者のために必須と
なる。これから製作されるマルチメデイアデータの多く
に含まれるようになる。２）イメージ文字（ＴＩ）文字として見えるが、文字コードになっていないもの。
この種の文字は、自然の映像ではなく文字であるかどう
かを知ることができれば利用できる。この処理は、前述
の如く既存の文字認識手法で行なえる。These video softwares include the following elements. 1) Character code (TC) A character code is indispensable for a hearing-impaired person as an explanation of a screen. It will be included in much of the multimedia data to be produced. 2) Image characters (TI) Those that appear as characters but are not in character code.
This kind of character can be used if it is possible to know whether it is not a natural image but a character. This processing can be performed by the existing character recognition method as described above.

【００４１】３）静止画像（Ｓ）画面に一定時間静止した映像として表示されるもの。Ｔ
Ｖなどにおける動画映像の中で変化のないもの、または
パソコンなどに利用される静止画ファイルである。４）動画映像（Ｄ）映画、ビデオなどの映像そのもので、連続した多数のフ
レームで被写体の動きなどを表わすもの。ＴＶの場合に
は１秒間に３０枚の画像になる。画面の動きが早いと一
枚の静止画では解像度が低下することがある。その場合
は、静止画に替え、１秒ほどの動画映像を切り出す。こ
れも請求項にいう「代表画像」に含むものとする。3) Still image (S) An image displayed as a still image on the screen for a certain period of time. T
V or a still image file used for a personal computer or the like. 4) Moving image (D) An image itself such as a movie or a video, which represents the motion of a subject in a number of continuous frames. In the case of a TV, there are 30 images per second. If the movement of the screen is fast, the resolution of one still image may decrease. In that case, a moving image video of about 1 second is cut out instead of a still image. This is also included in the “representative image” described in the claims.

【００４２】５）音（Ａ）・音には、音声とそれ以外の音とがある。・音声は人間の声であり、はっきりした意味のあるもの
である（広義には、「音声」には全ての「音」が含まれ
るが、ここでは、一応、前述の如き狭義で使用する（厳
密な区別はしていない））。・「それ以外の音」には、例えば、音楽、効果音、周囲
の音などがある。音声とそれ以外の音とは、マルチメデ
ィアデータではあらかじめ分離されて格納されるのこと
が多い。5) Sound (A) The sound includes voice and other sounds.・ Speech is a human voice and has a clear meaning (in a broad sense, “speech” includes all “sounds”, but here, it is used in a narrow sense as described above. No strict distinction is made)). The “other sounds” include, for example, music, sound effects, and surrounding sounds. Audio and other sounds are often separated and stored in multimedia data in advance.

【００４３】２．ＶＥの構成ＶＥの構成要素としては、上記マルチメデイア要素を全
て利用可能である。しかし、短時間にこれを見るために
は、静止画と文字とを使い、必要に応じて短い動画映像
を使用するのが効果的である。１）静止画・・・・ユーザーは、これをページめくりす
るようにして見る。２）文字・・・・・文字コードデータがある場合にはこ
れを表示する。イメージ文字がある場合にはこれを画面
に貼り付ける。画面に静止させるか、例えば左右に流れ
るように表示すると良い。３）動画映像・・・ユーザがページ送りすると開始され
る。４）音・・・・・・ユーザがページ送りすると開始され
る。ページ送りが早いときは出力しなくて良いであろ
う。2. Configuration of VE All the multimedia elements described above can be used as VE components. However, in order to see this in a short time, it is effective to use still images and characters and use short moving images as needed. 1) Still image... The user views this as turning a page. 2) Characters: If there is character code data, this is displayed. If there is an image character, paste it on the screen. It is good to make it stand still on a screen, or to display it, for example, so that it may flow right and left. 3) Moving picture image: This is started when the user turns the page. 4) Sound: Start when the user turns the page. If the page feed is fast, you do not need to output.

【００４４】３．ＶＥを見る場合のユーザーの操作１）ユーザーはＶＥを本のページをめくるようにしてみ
る。２）ひとつのページは静止画または動画である。３）ページには音がついていて、ユーザーがそのページ
を開けば、前述のように音の出力が行われる（繰り返し
もできる）。４）自動めくりモードも用意しておくと良い。例えば、
静止画のとき、視聴者が設定した間隔、例えば数秒ごと
に自動的に次のページに進むようにすると良い。動画映
像はページが開かれるとすぐに動き出し、終了すると次
のページへ進むようにすると良い。再生速度を上げると
音声は聞き取りにくいから、音の出力は効果音を中心に
する方が良いであろう。５）原映像へのリンク（ハイパージャンプ）ＶＥを見ていてその原映像を見たくなったら、ボタンを
押す。これだけで原映像にリンクされ、もう一度押すと
またＶＥに戻ることもできる。3. User's operation when viewing VE 1) The user turns the VE so that the pages of the book are turned. 2) One page is a still image or a moving image. 3) The page has a sound, and when the user opens the page, the sound is output as described above (it can be repeated). 4) It is better to prepare an automatic turning mode. For example,
In the case of a still image, it is preferable to automatically advance to the next page at intervals set by the viewer, for example, every few seconds. It is preferable that the moving image starts moving as soon as the page is opened, and proceeds to the next page when the moving image ends. Since it is difficult to hear the sound when the reproduction speed is increased, it is better to output the sound mainly with the sound effect. 5) Link to the original video (hyper jump) If you want to see the original video while watching the VE, press the button. This alone links to the original video, and you can return to VE again by pressing it again.

【００４５】４．ユーザーが設定出来るパラメータＶＥを利用するときに、ユーザーが設定できるパラメー
タとして、例えば下記のようなものが考えられる。１）要約率の大きさ２）自動めくり／手動めくり３）要約の手法（代表画像抽出の動機）デフォルトの要約法を定めておくのも良い。それ以外に
も視聴者が選択出来るようにしておくのが良い。複数の
動機を使って、例えば人の声が存在する位置と、歓声の
上がった位置の両方のついて代表画像を抽出しても構わ
ない。4. Parameters That Can Be Set by User When using VE, parameters that can be set by the user include, for example, the following. 1) The size of the summarization rate 2) Automatic turning / manual turning 3) Summarization method (motivation of representative image extraction) A default summarization method may be defined. In addition, it is better to allow the viewer to select. Using a plurality of motives, for example, a representative image may be extracted for both a position where a human voice is present and a position where a cheer is raised.

【００４６】［動機の例］・シーンの切り替わり・一定時間ごとの周期的切り出し・拍手、歓声などの音のクライマックスの切り出しビデオなどにＶＥボタンをつけると、ユーザーはこのボ
タンを押してＶＥ映像をすぐに取り出せるようになる。
パソコンではＶＥボタンをクリックすると、ＶＥ映像の
ぱらぱらめくりができる。[Example of motivation]-Switching of scenes-Periodic extraction at regular intervals-Extraction of climax of sounds such as applause and cheers When a VE button is attached to a video or the like, the user presses this button to immediately display the VE image. Can be taken out.
Clicking the VE button on a personal computer allows you to flip the VE video.

【００４７】５．ＶＥの抽出方法ＶＥの抽出には、マルチメデイアとしての情報の総合的
な関係を利用する。静止画や動画映像を取り出すタイミ
ングの発見に、動画映像の状態、音、文字などの情報を
利用する。先に述べたことと重複するが、整頓の意味で
改めて説明すると、ＶＥを抽出する方法として、例えば
次のような手法が考えられる。5. VE Extraction Method VE extraction uses a comprehensive relationship of information as multimedia. Information such as the state, sound, and characters of the moving image is used to find the timing of extracting the still image or moving image. Although this is the same as that described above, if it is described again in the sense of tidying, the following method can be considered as a method of extracting the VE.

【００４８】１）先頭画面の切り出し原映像の開始時の画面をタイトル（代表画面）として単
純に切り出す（抽出する）。２）周期的切り出し一定時間ごとの周期的な画面の切り出し（要約率に応じ
て１０秒おきなど）。３）カットの切り出しひとつのシーンをとりだし、そこから１枚の静止画を切
り出す。４）字幕の切り出し画面の一部に字幕があればこれを静止画として切り出
す。1) Cutting out the top screen The screen at the start of the original video is simply cut out (extracted) as a title (representative screen). 2) Periodic clipping Periodic clipping of the screen at regular intervals (every 10 seconds depending on the summarization rate). 3) Cut out a cut Take out one scene and cut out one still image from it. 4) Subtitle cutout If there is a subtitle in a part of the screen, this is cut out as a still image.

【００４９】５）音声のあるシーンの切り出し音声のないシーンを飛ばして、音声のあるシーンのみを
切り出す。６）文字パターン画面文字情報が画面に明示されたシーン（パターン表示な
ど）を１代表画面として切り出す。７）クライマックス拍手、大歓声など音声が大きくなったクライマックス時
の短時間の動画映像を切り出す。5) Extraction of a scene with sound A scene without sound is skipped, and only a scene with sound is extracted. 6) Character pattern screen A scene in which character information is clearly displayed on the screen (such as a pattern display) is cut out as one representative screen. 7) Climax Cut out a short-time video image during climax when the sound becomes loud such as applause and loud cheers.

【００５０】６．想定される対象とＶＥの抽出各種ストリーム映像について、ＶＥの具体的抽出動機
（要素の変化位置）を例示すれば以下のようになる。映像のジャンルＶＥ抽出動機ニュースパターン（フリップ）のあるシーンドラマ字幕のあるシーン音声のあるシーンドキュメンタリー音声のあるシーン英会話字幕のあるシーンスポーツ拍手、歓声の上がるシーンとその周辺（音声クライマックス）アニメ字幕のあるシーン長く静止しているシーン音声のあるシーンＴＶショッピング字幕のあるシーン（価格などの情報が見える）歌番組音楽の始まるシーン（音声から判別）教育番組パターンのあるシーンバラエテイショー歓声の上がるシーンオーケストラ音楽のスタートシーン（周期的抽出）天気予報静止したシーン（パターン（フリップ）＝文字や画を書いた板のことで、ＴＶで話し手などが使用する。）6. Assumed Targets and Extraction of VEs The following is an example of specific VE motives (element change positions) for various stream videos. Video genre VE extraction motive News Scene with pattern (flip) Drama Scene with subtitles Scene with audio Documentary Audio scene English conversation with subtitles Sports Applause, cheering scenes and their surroundings (audio climax) Animation Subtitles One scene A long stationary scene A scene with sound TV shopping A scene with subtitles (You can see information such as price) A song program A scene where music starts (Judgment from sound) Educational program A scene with patterns Variety show A scene where cheers rise orchestra Music start scene (periodic extraction) Weather forecast Stationary scene (pattern (flip) = a board on which characters and pictures are written, used by speakers on TV)

【００５１】７．要約率要約率＝ＶＥを標準の速度の自動めくりで見る時間／原
映像を見るのに要する時間である。ＶＥを利用すると、原映像の表示時間の１／１
０〜１／３０の時間（要約率）でその内容を見ることが
出来る。7. Summarization rate Summarization rate = time to see VE with automatic flipping at standard speed / time to see original video. Using VE, 1/1 of the original video display time
The contents can be viewed in the time of 0 to 1/30 (summarization rate).

【００５２】ＶＥを見るのに必要な時間はユーザーの操
作に依存する。ユーザーはゆっくりとページをめくるよ
うに見てもいいし、素早く映像を送って見てもよいし、
また自動ページめくりで見てもよい。自動めくりの場合
で、２時間（１２０分）の原映像から抽出されるＶＥ
は、おおよそ４〜１２分程度になる。The time required to view the VE depends on the operation of the user. Users can look at the page slowly, send the video quickly,
You may also look at automatic page turning. VE extracted from original video of 2 hours (120 minutes) in case of automatic turning
Takes about 4 to 12 minutes.

【００５３】８．ＶＥの実現形態ＶＥ技術の実現形態は、例えば、以下の各種が考えられ
る。１）映像ソフトへの付加ＶＥが標準化された場合には、どのような映像にも、一
般的にＶＥが標準的に付加されることになる。映像ソフ
ト制作側は、制作段階でＶＥのデータを付加するように
する。8. Implementation of VE The implementation of the VE technology includes, for example, the following various types. 1) Addition to Video Software When VE is standardized, VE is generally added to any video as standard. The video software production side adds VE data at the production stage.

【００５４】２）ハードへの組み込みＶＥは、様々なハード（装置）に組み込むことも出来
る。専用ＬＳＩチップとして生産し、販売しても良い。３）ＶＥをソフトとして利用するＶＥ技術をコンピュータソフトウェアとして実現し、パ
ソコンなどに組み込んで実行する。利用者はオプション
により、さまざまな映像に対してＶＥを使いわけること
ができる。インターネットからの映像の取り出し時など
にも利用できる。2) Incorporation into Hardware VE can also be incorporated into various hardware (devices). It may be manufactured and sold as a dedicated LSI chip. 3) Use VE as software The VE technology is realized as computer software, and is executed by being incorporated into a personal computer or the like. The user can optionally use VE for various images. It can also be used when retrieving video from the Internet.

【００５５】本発明は、更にマルチ・メディア以外の既
存の映像ソフトに対しても適用できる。以下、それらに
ついて説明する。先ず、最終的に取り出される要約映像
は以下のように構成される。（１）静止画像（２）短い動画映像（３）文字（映画の字幕にあたる）（４）音The present invention is further applicable to existing video software other than multimedia. Hereinafter, these will be described. First, the summary video finally taken out is configured as follows. (1) Still image (2) Short video image (3) Character (equivalent to subtitle of movie) (4) Sound

【００５６】（１）原映像の入力装置通常の映像取り込装置、例えばビデオデッキ、レーザデ
ィスク用デッキなどを利用する。（２）要約処理の方法１）簡便法・原画像がＮ枚の画像で構成され、これからＭ枚の画像
を抽出して要約を生成するとするならば、Ｎ／Ｍ枚ごと
に映像を静止画としてとりだす。これは最も単純な方法
である。・音は、無音部分を取り除き、音声が言葉になっている
部分をとりだす。言葉になっている部分と音楽、騒音
（轟音）などは周波数分析により比較的判別可能であ
る。取り出される静止画の映像の付近にある音声のみを
とりだす。・文字データが与えられている場合には、静止画付近に
表示される文字を取り出して、静止画に重ね合わせる。(1) Original Video Input Device A normal video capture device, such as a video deck or a laser disk deck, is used. (2) Summarization processing method 1) Simple method-If the original image is composed of N images and M images are to be extracted from this to generate a summary, a still image should be created for every N / M images Take it out. This is the simplest method.・ For sound, remove silence parts and take out parts where speech is words. The words, music, noise (roar), etc. can be relatively distinguished by frequency analysis. Only the audio near the video of the still image to be extracted is extracted. -If character data is given, the character displayed near the still image is extracted and superimposed on the still image.

【００５７】２）差分法（シーン切り替えタイミングの
発見）これは、時間的に大きな変化をする映像を見つけ出し
（シーンの切り替えなど）、これに注目して映像を取り
出すものである。・シーンが切り替わったら、次にシーンが切り替わるま
での時間を調べ、その中間の時刻の画面を静止画として
とりだす（図３ＰＣの位置）。あるいは、そのシーン切
り替えが生じる中間の部分のある時間幅に相当する動画
像を要約画像としてとりだす。・文字の扱いマルチメディア・データの場合には、取り出した静止画
像／動画像の付近にある文字を「文字データ」から取り
出して、静止画にスーパーインポーズする。そうでない
ときは字幕を判別して表示する。・音声の扱い取り出した静止画像／動画像の付近にある音声を「音声
データ」から取り出して、静止画が映っているあいだそ
の音声を一回だけ、またはくりかえし出力する。2) Difference method (finding of scene switching timing) In this method, a video that changes greatly with time is found (such as scene switching), and the video is taken out by paying attention to this. When the scene is changed, the time until the next scene change is checked, and a screen at an intermediate time is taken out as a still image (the position of PC in FIG. 3). Alternatively, a moving image corresponding to a certain time width of an intermediate portion where the scene switching occurs is extracted as a summary image. -Handling of characters In the case of multimedia data, characters near the extracted still image / moving image are extracted from "character data" and superimposed on the still image. Otherwise, the subtitle is determined and displayed.・ Handling of audio The audio near the extracted still image / moving image is extracted from the “audio data”, and the audio is output only once or repeatedly while the still image is displayed.

【００５８】画像の処理方法としては以下のような手法
を利用できる。１）ピクセルの集約処理映像は２次元的な広がりとして例えば３００ｘ２６０と
いうような点の集まりである。それぞれの点に色がつい
ている。テレビの映像ではこのような映像が１秒間に３
０枚必要である。この映像のピクセル構成は、この後の
処理に対して通常は多すぎるので、ピクセルの集約化を
行う。例えば４ｘ４の点で集約すれば、原映像は１／１
６のデータ量にすることができる。８ｘ８で取り出せ
ば、１／６４に集約できる。The following method can be used as an image processing method. 1) Pixel aggregation processing An image is a collection of points such as 300 × 260 as a two-dimensional spread. Each dot is colored. In a television picture, such a picture is 3 per second.
0 sheets are required. Since the pixel configuration of this image is usually too large for subsequent processing, pixel aggregation is performed. For example, if the images are aggregated at 4x4 points, the original video is 1/1
6 data volumes. If it is taken out as 8x8, it can be reduced to 1/64.

【００５９】このようにすると、ズームアップ／ダウンのときゆっくりと変化する画像に対しても、小さい変化を切り捨てる効果を持つ。この
ような前処理を行った上で夫々の値、ａ（ｔ,ｘ,ｙ）をとりだす。但し、ｔ：時刻ｘ，ｙ：集約した映像の
座標ａ：その点（ｘ，ｙ）の色の値であり、ａ＝Ｒ＋
Ｇ＋Ｂなどとするのが一般的である（Ｒ，Ｇ，Ｂは３原
色情報の値）。This has the effect of discarding small changes even for images that change slowly when zooming up / down. After performing such preprocessing, the respective values, a (t, x, y), are extracted. Here, t: time x, y: coordinates of the aggregated video a: a color value of the point (x, y), a = R +
Generally, G + B is used (R, G, and B are values of three primary color information).

【００６０】２）時間方向の集約化処理時間的に変化する映像データは、そのまま扱うには冗長
であり、またムダも多い。そこで、前記前処理を行なっ
た各点のデータａ（ｔ,ｘ,ｙ）に対して、以下の何れか
の処理をする。（１）周期的なサンプリング（２）映像の時間的な差分比較処理一枚の画像のデータを構成する上記各点のデータ、ａ
（ｔ，ｘ，ｙ）について、時間的な差分を求める。ｄ（ｔ，ｘ，ｙ）＝ａ（ｔ，ｘ，ｙ）−ａ（ｔ−△ｔ，
ｘ，ｙ）但し、△ｔ：適宜の時間幅2) Aggregation processing in the time direction Video data that changes with time is redundant and is wasteful to handle as it is. Therefore, any one of the following processes is performed on the data a (t, x, y) of each point on which the preprocessing has been performed. (1) Periodic sampling (2) Temporal difference comparison processing of video Data of each of the above points constituting data of one image, a
A temporal difference is obtained for (t, x, y). d (t, x, y) = a (t, x, y) −a (t−Δt,
x, y) where Δt: appropriate time width

【００６１】画像全体（ｘ，ｙ）についてこの値を集計
する。式で表わせば、This value is totaled for the entire image (x, y). In terms of the formula,

【数１】となる。これは、時間的に△ｔだけ隣り合う２枚の画像
の間の変化量（差分）を示している。(Equation 1) Becomes This indicates the amount of change (difference) between two images that are temporally adjacent by Δt.

【００６２】ここで値が大きい順に、Ｄａ（ｔ）をＮ個
とりだす。このようにして取り出したＮ個のＤａ（ｔ）
の例を図３に示す。値が大きい位置、即ち映像の差分が
大きい位置であるＣＳ１，ＣＳ２は、そこで画像に大き
な変化があること、即ちそこでシーンが切り替わってい
る可能性が高いことを表わしている。そこで、このＣＳ
１とＣＳ２の間を一つのシーンと考え、この中から１枚
の代表画像を抽出する。ジャンルによって異なるが、シ
ーンを端的に表わす静止画は、一般にシーン中央付近に
ある。そこで、この図３の例では、シーンの中間位置Ｐ
Ｃを当該代表画面を取り出す位置としている。Here, N Da (t) are taken out in descending order of the value. N Da (t) thus taken out
3 is shown in FIG. CS1 and CS2 where the value is large, that is, where the difference between the images is large, indicate that there is a large change in the image, that is, there is a high possibility that the scene has been switched there. So, this CS
Considering between 1 and CS2 as one scene, one representative image is extracted from this. Although different depending on the genre, a still image that clearly represents a scene is generally located near the center of the scene. Therefore, in the example of FIG.
C is a position where the representative screen is taken out.

【００６３】なお映像ソフトを構成する各フレームにつ
いて、例えば夫々の中央付近の水平走査線１本分の画像
データに着目して画像と画像の間の変化の度合い（シー
ンの切り替わり）を検出するようにしても良い。具体的
には、例えば、この水平走査線１本分の画像データを、
Ｎ個の区間に分け、夫々の区間について平均値を求め
る。そして、各区間毎に、その前の画像の当該区間の平
均値との差分を求める。この差分を各フレーム毎に総和
し、その値が大きくなっている位置、即ち、図３のＣＳ
１、或いはＣＳ２に当たる位置を求め、これを上記同様
のシーンの切り替え位置であるとする。For each frame constituting the video software, the degree of change between images (switching of scenes) is detected by focusing on, for example, image data of one horizontal scanning line near the center of each frame. You may do it. Specifically, for example, the image data for one horizontal scanning line is
It is divided into N sections, and an average value is obtained for each section. Then, for each section, a difference between the previous image and the average value of the section is calculated. The difference is summed for each frame, and the position where the value is large, that is, CS in FIG.
1 or a position corresponding to CS2 is obtained, and this is assumed to be a scene switching position similar to the above.

【００６４】要約映像は次のような特徴をもつ。ユーザーが好きな時間のタイミングでページをめく
るようにして見ることができる。静止画の場合には単純である。動画の場合には、ある部分がひとつのページ内で短
い時間に動画として表示され、指示により繰り返して見
ることができる。The summary video has the following features. Users can turn the page at any time they like. In the case of a still image, it is simple. In the case of a moving image, a part is displayed as a moving image in a short time within one page, and can be repeatedly viewed by an instruction.

【００６５】以下、変形例について説明する。先ず、要
約再生の際、それを構成する代表画面を１度に４つ、デ
ィスプレイに表示しても良い（代表画面４つを合成し、
１画面にして表示する）。代表画像は、もともと１つの
映像ソフトから出発している。従って夫々は相互に関連
を持っており（ストーリーを持っており）、これらが複
数個、一度に画面に表示されても、視聴者は容易にその
内容を感得出来る。従って、例えばＮ枚の代表画像を一
度に画面表示したとすれば、単純には、１枚づつ静止表
示のときのＮ分の一の時間で、当該要約を再生すること
が出来る。また、映像ソフトにストーリーがあることに
鑑みれば、寧ろ一度に複数代表画像を表示する形式の方
が、一つ一つ再生する形式より内容を感得しやすいかも
知れない。Hereinafter, modified examples will be described. First, at the time of summary playback, four representative screens constituting the summary screen may be displayed on the display at a time (the four representative screens are synthesized,
One screen is displayed.) The representative image originally starts from one video software. Therefore, each is related to each other (has a story), and even if a plurality of these are displayed on the screen at once, the viewer can easily perceive the contents. Therefore, if, for example, N representative images are displayed on the screen at a time, the summary can be simply reproduced one by one in N times of the time of still display. Also, in view of the fact that the video software has a story, it may be easier to perceive the contents in a format in which a plurality of representative images are displayed at a time than in a format in which the images are reproduced one by one.

【００６６】本発明は専用の装置において実現してもよ
いし、アプリケーションプログラムの一つとしてハード
ディスク等に格納しておき、必要に応じて呼出してコン
ピュータ上でも実施しても良い。フロッピーデイスクな
どに格納して配布することも出来る。専用チップを製作
してパソコンに組み込んだり、ＤＶＤプレーヤ、ゲーム
プレーヤに組み込んだり、ＶＴＲに組み込んだりするの
も良い。パソコン等で実施する場合、画面上、或いはキ
ーボードに、実施の形態例の「EP/NP」ボタン等を割り
付けると良い。The present invention may be realized by a dedicated device, or may be stored in a hard disk or the like as one of the application programs, and may be called up as necessary and executed on a computer. They can also be stored on floppy disks and distributed. It is also possible to manufacture a dedicated chip and incorporate it into a personal computer, a DVD player, a game player, or a VTR. When the operation is performed by a personal computer or the like, the “EP / NP” button or the like in the embodiment may be assigned to a screen or a keyboard.

【００６７】一つの映像ソフトに複数の要約を付加して
おき、視聴者が選択視聴出来るようにしても良い。例え
ば、字幕の出現位置を動機とした要約と、歓声の上がる
位置を動機とする要約の二つを、当該映像ソフトに付加
しておくのも良い。この処理は、位置情報を映像ソフト
に付加する実施の形態に対しても適用出来る。また、位
置の情報と要約画像の双方を映像ソフトに付加しておく
のも良い。こうすると位置情報から当該要約を生成する
機能が付けられていない通常の映像ソフト再生装置（Ｄ
ＶＤプレーヤ、ビデオデッキなど）でも、要約再生が出
来る。A plurality of summaries may be added to one piece of video software so that the viewer can selectively view. For example, it is also possible to add, to the video software, two summaries that are motivated by the appearance position of the caption and that are motivated by the position where cheers are raised. This processing can be applied to the embodiment in which the position information is added to the video software. Also, both the position information and the summary image may be added to the video software. In this case, a normal video software playback device (D) having no function of generating the summary from the position information is provided.
VD players, VCRs, etc.) can also perform summary playback.

【００６８】代表画面（要約）又は位置の情報が付加さ
れた映像ソフトは、ＤＶＤ、ビデオテープの如き独立し
た媒体で伝達出来るほか、電波、ネットワーク、個別通
信線その他の媒体を介しても伝達することが出来る。こ
の際、当該代表画面抽出位置付近の種々のデータ（字
幕、音等）も併せて伝達し、視聴者の選択で再生可能に
しても良い。視聴者や管理者が選択した動機に基いて、
映像ソフトから代表画像を抽出し、これらを、ハードデ
ィスクやビデオテープ等、適宜の記憶媒体に集積してお
くのも良い。こうすると、映像ライブラリー等に於て、
目標とする映像ソフトの検索が、的確、且つ容易に実施
できる。個人が所有する映像ソフトの管理にも活用でき
る。The video software to which the representative screen (summary) or the position information is added can be transmitted on an independent medium such as a DVD or a video tape, or transmitted via a radio wave, a network, an individual communication line, or another medium. I can do it. At this time, various data (captions, sounds, etc.) near the representative screen extraction position may also be transmitted together, and may be reproduced by the viewer's selection. Based on the motivation selected by viewers and administrators,
It is also possible to extract representative images from video software and accumulate them on an appropriate storage medium such as a hard disk or a video tape. In this way, in the video library etc.
The search for the target video software can be performed accurately and easily. It can also be used to manage video software owned by individuals.

【００６９】他の実施の形態例３０を図４〜図１０に示
す。この実施の形態例３０は、請求項１５の映像ソフト
ウェア処理方法及び請求項１６の映像ソフトウェア処理
装置についての一つの実施の形態例であり、主としてＣ
ＰＵ２，ＲＯＭ３上で実現される。Another embodiment 30 is shown in FIGS. This embodiment 30 is one embodiment of the video software processing method according to claim 15 and the video software processing device according to claim 16, and mainly includes C
It is realized on PU2 and ROM3.

【００７０】例えばドラマに於て、或る場所（背景）に
於て演技が行なわれ、カットが変って他の一つ或いは幾
つかの場所でそれに続く演技が行なわれ、また元の場所
に戻って演技が続けられるということは多い。要約は、
それを構成する代表画像の数が適切であり、しかもそれ
らが多面的に夫々異なる場面を表現している方が当該映
像ソフトの内容を感得しやすい。For example, in a drama, a performance is performed at a certain place (background), the cut is changed, and a subsequent performance is performed at one or several other places, and then returns to the original place. In many cases, you can continue acting. The summary is
The content of the video software is more easily perceived when the number of representative images constituting the video software is appropriate, and when they represent different scenes from various aspects.

【００７１】図４〜図１０に示したこの実施の形態例３
０は、このような要望に応え得るもので、各カットから
取り出したカット代表画像について夫々そのヒストグラ
ムを作成し、これらヒストグラムの多次元空間内での距
離の遠近という視点から、各ヒストグラム、即ち各カッ
ト代表画像の近似性を求め、近似するもの、即ち前記距
離が近いものを同じグループに纏めていく処理を繰り返
して、これら多数のカット代表画像を、適宜数、例えば
数十のグループに集約し、この各グループから代表画像
を取り出すという処理を実行する。The third embodiment shown in FIGS. 4 to 10
0 can respond to such a demand, and creates a histogram for each of the cut representative images extracted from each of the cuts, and from the viewpoint of the distance of these histograms in a multidimensional space, each histogram, that is, each histogram, The similarity of the cut representative images is obtained, and a process of approximating the close ones, that is, the close ones are grouped into the same group is repeated, and these many cut representative images are aggregated into an appropriate number, for example, several tens of groups. , A process of extracting a representative image from each group.

【００７２】これにより、各代表画像の内容は、当該映
像ソフトの異なった夫々の場面を表す多面的なものとな
り、また、その数も適宜数に絞られる。また、ここで実
施される各画面についての類似性判断は、映像ソフトの
画面が一般的に備えている性質、即ち、カットが変ると
そのヒストグラムの分布（波形、包絡線）がそこで目に
見えて変化する、という性質を利用したものなので、そ
の映像ソフトのジャンルが何であるかに拘らず、画一
的、機械的に実施できるという利点がある。As a result, the contents of each representative image become multifaceted representing different scenes of the video software, and the number is reduced to an appropriate number. In addition, the similarity judgment performed for each screen performed here is based on the characteristic that the screen of the video software generally has, that is, when the cut changes, the distribution (waveform, envelope) of the histogram becomes visible there. The advantage is that it can be implemented uniformly and mechanically regardless of the genre of the video software.

【００７３】以下、この図４〜図１０に示す実施の形態
例３０について説明をする。見出しの番号はこの実施の
形態例の説明で改めて付ける。１）カットの切り出し。映像ソフトは、多数のカットから成り立っている。一つ
のカットは数秒から数十秒の長さである。カットが変化
する境界では、各フレームのデータが大きく変化する。
これを検出する方法としては、先に説明した実施の形態
例での判別法も一つであるが、ここでは、これに有効で
ある各フレーム（画像）のヒストグラム（映像ヒストグ
ラム）（Ｈ）を利用する手法を用いる。The embodiment 30 shown in FIGS. 4 to 10 will be described below. The heading numbers are added again in the description of this embodiment. 1) Cut out the cut. Video software consists of many cuts. One cut is a few seconds to a few tens of seconds long. At the boundary where the cut changes, the data of each frame changes significantly.
As a method of detecting this, one of the discrimination methods in the above-described embodiment is used. Here, a histogram (video histogram) (H) of each frame (image) effective for this is used. Use the method that you use.

【００７４】ヒストグラムの例を図５に示す。このヒス
トグラムはヒストグラム作成部３１で作成する。横軸に
各画像の各ピクセルの色（Ｒ，Ｇ，Ｂ）の情報、或いは
輝度や色合い（Ｙ，Ｕ，Ｖ）をプロットし、縦軸に一つ
の画像中での、その色等のピクセルの出現頻度（そのピ
クセル数）をプロットする。図５の例では、横軸に輝度
Ｙを２５６段階でとり、縦軸に、その輝度Ｙを持つピク
セルの発生頻度（ピクセル数）を示している。処理の簡
素化のためには、横軸を１０〜２０段階に区分して、図
６に示すように纏めて表現するのが便利である。FIG. 5 shows an example of the histogram. This histogram is created by the histogram creating unit 31. The horizontal axis plots the information of the color (R, G, B) of each pixel of each image, or the brightness and the hue (Y, U, V), and the vertical axis displays the pixel such as the color in one image. Is plotted (the number of pixels). In the example of FIG. 5, the horizontal axis represents the luminance Y in 256 steps, and the vertical axis represents the frequency of occurrence of pixels having the luminance Y (the number of pixels). In order to simplify the processing, it is convenient to divide the horizontal axis into 10 to 20 steps and collectively express them as shown in FIG.

【００７５】ヒストグラムは映像の持つ特性を簡潔に表
現出来るので、様々に有効利用出来る。ここでは先ず前
述のとおりカットの判別に利用する。即ち、差分算出部
３２により、各ヒストグラムＨ（ｔ）の差分Ｈａを求
め、その変化が大きいところを検出する。この差分が大
きい位置がカットの切替り位置を表わす。The histogram can simply represent the characteristics of the image, and can be effectively used in various ways. Here, first, as described above, it is used for discrimination of a cut. That is, the difference calculation unit 32 obtains the difference Ha of each histogram H (t), and detects a portion where the change is large. The position where the difference is large represents the cut switching position.

【００７６】式で表わすと、Ｈ（ｔ）＝Ｈ（ｔ、ｙ１，ｙ２，ｙ３，…，ｙＮ）但し、Ｈ（ｔ）：時刻ｔのフレームのヒストグラムｙｎ：第ｎ区分の色または色や明るさ情報の頻度
（ｎ＝１、２、…、Ｎ。Ｎ＝横軸の区分数。）H (t) = H (t, y1, y2, y3,..., YN) where H (t): histogram of the frame at time t yn: color or color of the n-th division Frequency of brightness information (n = 1, 2,..., N, where N = the number of sections on the horizontal axis)

【００７７】ΔＨａ＝Ｈ（ｔ＋１）−Ｈ（ｔ）＝Σ ｙ_n（ｔ＋１）−ｙ_n（ｔ）（構成要素の差の絶対値の和）となる。図に表わすと、
例えば図７のようになる。ここから一定値以上の変化の
あるところを取り出せばカットの切り替わりを検出でき
る。[0077] ΔHa = H (t + 1) a _{-H (t) = Σ y n} (t + 1) -y n (t) ( the sum of the absolute values of the differences of the components). In the figure,
For example, as shown in FIG. If a portion having a change equal to or more than a certain value is taken out from this portion, the change of cut can be detected.

【００７８】２）カット代表ヒストグラムの抽出。連続する映像、例えば１時間のドラマの連続した各画像
を幾つものカットに切り出したので、次に代表ヒストグ
ラム抽出部３３により、各カットの代表的なヒストグラ
ム（Ｈｓ：代表ヒストグラム）を作る。これには、例え
ばカット内の初めの時刻のフレーム、中央付近の時刻の
フレーム、終り時刻のフレームを取り出してヒストグラ
ムを作成する。１カットの全体のヒストグラムについて
の時間平均を作成しこれを代表にしても良い。2) Extraction of a representative cut histogram. Since a continuous video, for example, a continuous image of a one-hour drama is cut into a number of cuts, a representative histogram (Hs: representative histogram) of each cut is created by the representative histogram extracting unit 33 next. For this, for example, a histogram is created by extracting the frame at the first time, the frame at the time near the center, and the frame at the end time in the cut. A time average for the entire histogram of one cut may be created and used as a representative.

【００７９】これを式で表わすと、Ｈｓ＝Ｈｓ（ｙｓ１，ｙｓ２，ｙｓ３，…，ｙｓＮ）但し、Ｈｓ：カットｓの代表ヒストグラム（ｓ＝１，
２，…，Ｓ）ｙｓｎ：第ｎ区分の色又は色や明るさ情報の頻度（ｎ＝
１，２，…，Ｎ）となる。（そのカットの代表画面やカ
ットの時間に亘る平均）This is expressed by the following equation: Hs = Hs (ys1, ys2, ys3,..., YsN) where Hs: a representative histogram of the cut s (s = 1,
2,..., S) ysn: the color of the n-th division or the frequency of the color or brightness information (n =
1, 2,..., N). (Average over the representative screen of the cut and the time of the cut)

【００８０】３）カット代表ヒストグラムの近似度算
出。次に、代表ヒストグラム間距離算出部３４で、各カット
代表ヒストグラムＨｓ間の距離Ｄｓｔを算出する（ｓ，
ｔ＝１，２，３，…，Ｓ）。距離Ｄｓｔは、或るカット
ｓの代表ヒストグラムと、別のカットｔの代表ヒストグ
ラムとの間の距離であり、この値（Ｄｓｔ）は、夫々の
ヒストグラムの対応する各区分ｙｓｎ同士間の差の絶対
値又は二乗値の合計として求められる。式で表わすと、3) Calculation of the degree of approximation of the cut representative histogram. Next, the distance between representative cut histograms Hs is calculated by the representative histogram distance calculating unit 34 (s,
t = 1, 2, 3,..., S). The distance Dst is the distance between the representative histogram of a certain cut s and the representative histogram of another cut t. This value (Dst) is the absolute value of the difference between the corresponding sections ysn of each histogram. It is calculated as the sum of values or square values. In terms of the formula,

【数２】である。カット代表ヒストグラムの数がｓ個の場合、こ
のようにして求められるそれら各代表ヒストグラム間の
距離Ｄｓｔの数は、リーグ戦での試合数と同じで、ｓ×
（ｓ−１）÷２個である。(Equation 2) It is. When the number of cut representative histograms is s, the number of distances Dst between the representative histograms thus obtained is the same as the number of games in the league match, and s ×
(S-1) ÷ 2.

【００８１】４）グループ化。つぎに、カット間距離Ｄｓｔの小さなもの同士をとりま
とめて、適当な数（Ｍ）のグループにする。この処理
は、グループ化処理部３６で行なう。この処理には各種
数学的手法が利用できるが、最も単純なのは以下のよう
な方法である。先ずＤｓｔのうちで最小の距離にあるカ
ット代表ヒストグラムｓとｔとを求める。そしてこのｓ
とｔを一つのグループにする。次に、残りの距離Ｄｓｔ
の中から最小の距離を持つｓとｔの組合わせを見つけて
グループにする。4) Grouping. Next, small ones having a small inter-cut distance Dst are put together to form an appropriate number (M) of groups. This processing is performed by the grouping processing unit 36. Various mathematical methods can be used for this processing, but the simplest method is as follows. First, the cut representative histograms s and t at the minimum distance among Dst are obtained. And this s
And t into one group. Next, the remaining distance Dst
And finds the combination of s and t having the minimum distance from and group them.

【００８２】当初は全カットの数に対応したｓ個のヒス
トグラムが独立に存在し、グループもこれと同じ数のｓ
個存在する。ここで、グループ化の操作を１回行なう
と、グループが一つ減少する。従って、このグループ化
操作をＳ−Ｍ回繰り返せば、グループの数はＭ個にな
る。これにより、各カットの代表ヒストグラムｓ個を図
８に示すような適切な数（Ｍ）のグループに纏めること
が出来る。Initially, s histograms corresponding to the number of all cuts exist independently, and the group has the same number of s histograms.
Exists. Here, if the grouping operation is performed once, the group is reduced by one. Therefore, if this grouping operation is repeated SM times, the number of groups becomes M. As a result, s representative histograms of each cut can be grouped into an appropriate number (M) of groups as shown in FIG.

【００８３】５）各グループの代表画像抽出。グループの区分けが完了したら各グループから代表画像
を抽出する。この処理は代表画像抽出部３７で実行す
る。この場合、夫々のグループから一つランダムにグル
ープ代表画像を選んでも構わないし、厳密にするなら、
グループの重心Ｈｇ（図８のｘ記号）を求め、この重心
Ｈｇに最も近いものを代表画像とする。5) Extraction of representative images of each group. When the group division is completed, a representative image is extracted from each group. This process is executed by the representative image extracting unit 37. In this case, one group representative image may be randomly selected from each group.
The center of gravity Hg (symbol x in FIG. 8) of the group is obtained, and the image closest to the center of gravity Hg is set as a representative image.

【００８４】重心Ｈｇを求めるには、先ずそのグループ
に属することになった各ヒストグラムについて夫々のｙ
ｓｎの値を平均する。式で表わせば、であり、To determine the center of gravity Hg, first, for each histogram belonging to the group, each y
Average the values of sn. In terms of the formula,

【数３】 (Equation 3)

【数４】と表わされる。(Equation 4) It is expressed as

【００８５】そして各々のヒストグラムｊとグループ重
心Ｈｇとの間の距離を求める。この距離が最小になるヒ
ストグラム、即ちもっともグループ重心に近いヒストグ
ラムの画像を、そのグループの代表画像として取り出
す。これら代表画像は、その儘集合させ要約にしてもよ
い。ここまでの流れを纏めれば図９のようになる。Then, the distance between each histogram j and the group center of gravity Hg is obtained. The histogram with the smallest distance, that is, the image of the histogram closest to the group centroid is extracted as a representative image of the group. These representative images may be aggregated as they are and summarized. FIG. 9 shows the flow up to this point.

【００８６】この実施の形態例ではグループ化処理を行
なった。従って、その処理結果を代表画像の抽出に反映
させると、一層的確に原映像ソフトの内容を表現する要
約が生成出来る。具体的には要約に含まれる夫々の画像
に優先度を付け、この優先度に応じて下記の如く代表画
像の選択を行なう。In this embodiment, a grouping process is performed. Therefore, when the processing result is reflected in the extraction of the representative image, a summary expressing the contents of the original video software can be generated more accurately. Specifically, each image included in the summary is assigned a priority, and a representative image is selected as described below according to the priority.

【００８７】１）グループに含まれるカット（代表ヒス
トグラム）の数の反映。グループを構成するカットの数が多い、即ち一本の映像
ソフトの中で類似の画像なりカットなりが何度も出て来
る場合、そのグループはその映像ソフトの中で存在意義
が大きいと考えられる。そこで、このようなものについ
ては、代表画像を一つだけ取り出すのではなく、頻度に
比例した数の代表画像を取り出すことにするのが効果的
と考えられる。この場合、取り出されたものが近似して
いては代表画像を複数にした意味が薄れる感じがする。
従ってこのときは、なるべく距離の遠いものを選んで複
数取り出すと良い。1) Reflecting the number of cuts (representative histograms) included in the group. If the number of cuts that make up a group is large, that is, if similar images or cuts appear many times in one piece of video software, the group is considered to have significant significance in that video software. . Therefore, it is considered effective to take out a representative image of a number proportional to the frequency, instead of taking out only one representative image. In this case, if the extracted ones are similar, the sense of having a plurality of representative images feels less.
Therefore, in this case, it is preferable to select as many items as possible as far as possible and take out a plurality of items.

【００８８】２）カットの合計時間の反映。グループを構成しているヒストグラムの母体である各カ
ットの持続時間の合計が長い場合も、そのグループも映
像ソフトの中で存在意義が大きいと考えられる。そこで
これらの優先度を高くするために、グループ毎のカット
の合計時間数に比例して幾つかの代表画像を取り出すの
も効果的である。３）頻度の低いカットの無視。グループを構成するカットの数が少なければそのグルー
プを無視し、その代表画像を要約に採用しないことも考
えられる。2) Reflection of total time of cut. Even when the total of the durations of the cuts, which are the bases of the histograms constituting the group, is long, it is considered that the group also has significant significance in the video software. Therefore, in order to increase these priorities, it is also effective to take out some representative images in proportion to the total number of cut times for each group. 3) Ignoring infrequent cuts. If the number of cuts constituting the group is small, the group may be ignored and the representative image may not be used for the summary.

【００８９】代表画像の表示についても二つほど工夫が
ある。先ず、これらは大きな画面上に分散して配置して
同時に多数表示するようにしても良い。言わば「鳥獣戯
画」の如き表現形式である。こうすると、要約を時間的
にではなく、空間的に把握することが出来るようにな
る。そして、この空間的に配置された代表画像の一つを
クリックすると、その代表画像が動画として動きだすよ
うにするのも良く、そこで更にダブルクリックすれば、
そのカットから原映像ソフトが再生されるようにするの
も良い。The display of the representative image has two ideas. First, they may be distributed on a large screen and displayed in large numbers at the same time. In other words, it is an expression format like "bird and beast caricature". In this way, the summary can be grasped spatially rather than temporally. Then, when one of the representative images arranged spatially is clicked, it is good to make the representative image start moving as a moving image.
The original video software may be played back from the cut.

【００９０】このとき、上記１），２），３）のような
観点から、各代表画像の表示の大きさに重み付けを行な
うと判りやすい。即ち図１０に示すように、そのグルー
プに属するヒストグラムが多いもの、或いはそのカット
の持続時間の合計が多いものは表示面積を大きくし、且
つ中央に配置する。こうすると、視聴者はその映像ソフ
トの言わばさわりを一瞥で感得出来る。なお、最初の実
施の形態例についての変形例は、ここに説明した実施例
にも当てはまる。At this time, it is easy to understand that weighting is applied to the display size of each representative image from the viewpoints such as 1), 2) and 3). That is, as shown in FIG. 10, a histogram having a large number of histograms belonging to the group or a graph having a large total duration of the cut has a large display area and is arranged at the center. In this way, the viewer can feel the touch of the video software at a glance. The modification of the first embodiment also applies to the embodiment described here.

【００９１】[0091]

【発明の効果】以上説明したように、本発明によれば、
視聴者が欲する映像ソフト構成データに着目し、映像の
要約を自動的に抽出することができる。視聴者は、それ
を自由に自分のペースで短時間のうちに見ることがで
き、当該映像ソフトの全体像を、簡単、かつ正確に把握
することが出来る。これにより、簡単に大量の映像にア
クセスできるようになる。As described above, according to the present invention,
By paying attention to the video software configuration data desired by the viewer, a video summary can be automatically extracted. The viewer can freely see it at his own pace in a short time, and can easily and accurately grasp the entire image of the video software. This makes it possible to easily access a large amount of video.

【００９２】また、要約映像は、原映像に比較すればか
なり小さいデータ量になり、通信や携帯端末にも適する
ものと成り得る。将来、光ファイバー高速通信網、大容
量メモリー、高速ＣＰＵのおかげでデータ量そのものは
問題にならなくなるとも思われるが、情報量自体が小さ
いことは、同じ通信手段でより沢山の種類の情報を伝達
することが出来るという利点がある。また高速通信網へ
の移行期でも、容易に実施出来るという利点もある。Further, the summary video has a considerably smaller data amount than the original video, and may be suitable for communication and portable terminals. In the future, the data volume itself will not be a problem due to the high-speed optical fiber communication network, large capacity memory, and high-speed CPU, but the small amount of information itself means that more types of information can be transmitted by the same communication means. There is an advantage that can be. There is also an advantage that it can be easily implemented even in the transition period to a high-speed communication network.

[Brief description of the drawings]

【図１】実施の形態例を示すブロック図。FIG. 1 is a block diagram illustrating an example of an embodiment.

【図２】原映像ソフトウェアと抽出した要約（代表画
面）の例を示す説明図。FIG. 2 is an explanatory diagram showing an example of original video software and an extracted abstract (representative screen).

【図３】各画像データ間の差分の例を示すグラフ。FIG. 3 is a graph showing an example of a difference between image data.

【図４】他の実施の形態例を示すブロック図。FIG. 4 is a block diagram showing another embodiment.

【図５】１枚の画像のヒストグラムの例を示すグラフ。FIG. 5 is a graph showing an example of a histogram of one image.

【図６】簡素化したヒストグラムの例を示すグラフ。FIG. 6 is a graph showing an example of a simplified histogram.

【図７】ヒストグラムの差分の例を示すグラフ。FIG. 7 is a graph showing an example of a histogram difference.

【図８】ｎ次元空間での各カット（ヒストグラム）のグ
ループ化の例を示す概念図。FIG. 8 is a conceptual diagram showing an example of grouping each cut (histogram) in an n-dimensional space.

【図９】他の実施の形態例に於ける要約生成のプロセス
を示す概念図。FIG. 9 is a conceptual diagram showing a summary generation process in another embodiment.

【図１０】空間配置の要約表示例を示す平面図。FIG. 10 is a plan view showing a summary display example of a spatial arrangement.

[Explanation of symbols]

１…カット判別記憶部２…ＣＰＵ３…ＲＯＭ４…字幕判別記
憶部５…画面合成部６…ハードディ
スク７…操作ユニット８…ディスプレ
イ１０…エッセンスプレーヤ２０…ＤＶＤプ
レーヤ３０…他の実施の形態例３１…ヒストグ
ラム作成部３２…差分算出部３３…代表ヒス
トグラム抽出部３４…代表ヒストグラム間距離算出部３６…グループ
化処理部３７…代表画像抽出部DESCRIPTION OF SYMBOLS 1 ... Cut discrimination storage part 2 ... CPU 3 ... ROM 4 ... Subtitle discrimination storage part 5 ... Screen synthesis part 6 ... Hard disk 7 ... Operation unit 8 ... Display 10 ... Essen sprayer 20 ... DVD player 30 ... Other embodiment examples 31 ... Histogram creation unit 32 ... Difference calculation unit 33 ... Representative histogram extraction unit 34 ... Representative histogram distance calculation unit 36 ... Grouping processing unit 37 ... Representative image extraction unit

Claims

[Claims]

At least one of image, sound, subtitle, and other video software configuration data detects a position at which the state changes, and, based on the information on the detected position, divides some of the video software from the video software. A video software processing method, comprising: extracting a representative image and generating a summary of the video software.

2. At least one of image, sound, subtitle, and other video software configuration data is detected at a position where the state changes, and the information of the detected position is converted to the video comprising several representative images. A video software processing method, wherein the video software processing method is added to the video software as position information for generating a software summary.

3. At least one of image, sound, subtitle, and other video software configuration data is read as information on a position added to the video software as a position at which the state changes, and based on the position information. And
Extracting some representative images from the video software,
A video software processing method comprising generating a summary of the video software.

4. For at least one of image, sound, subtitle, and other video software configuration data, a position at which the state changes is detected, and some representative images related to the detected position are extracted as a summary. A video software processing method, which is added to the video software.

5. At least one of image, sound, subtitle, and other video software configuration data is detected at a position where the state changes, extracted as a representative image relating to the position, and added to the video software. A video software processing method, wherein a digest is read and sequentially reproduced.

6. When reproducing the generated summary or the added summary, if there is a command from a viewer, the video software is normally operated from the vicinity of the position where the reproduced representative image is extracted. 6. The video software processing method according to claim 1, wherein the video software is reproduced.

7. The video software according to claim 1, wherein when the video software is reproduced in a normal state, when a viewer gives an instruction, a summary of the video software is generated from the vicinity of the position. The video software processing method according to claim 5.

8. A position detecting means for detecting a position where at least one of image, sound, subtitle and other video software configuration data changes in state, and said video software based on the detected position information. And a summary generating means for extracting some representative images from the image and generating a summary of the video software.

9. A position detecting means for detecting a position at which a state of at least one of image, sound, subtitle, and other video software configuration data changes, and information on the detected position is summarized by the video software. And a position information adding unit for adding position information to the image software as position information for generating the image software.

10. Position information reading means for reading information of a position added to the video software as a position at which a state of at least one of image, sound, subtitle, and other video software configuration data is changed; A video software processing device, comprising: a summary generating unit configured to generate a video software summary based on the position information.

11. A position detecting means for detecting a position at which at least one of image, sound, subtitle and other video software configuration data changes in state, and said video software based on information of the detected position. An image adding means for extracting some representative images from the image software and adding the representative images as a summary to the video software.

12. At least one of image, sound, subtitle, and other video software configuration data detects a position at which the state changes, and is extracted from the video software based on information of the detected position. A video software processing device comprising a reproducing unit for sequentially reproducing a representative image added to the video software as a summary.

13. At the time of reproducing the generated summary or the added summary, if there is a command from a viewer, the video software is placed in a normal state from a position near the position where the reproduced representative image is extracted. 10. A reproducing means for reproducing in the step (c).
3. The video software processing device according to any one of 2.

14. A reproducing apparatus according to claim 8, further comprising a summarizing means for executing the summarization from the vicinity of the position where the video software is reproduced when a command is received from a viewer during reproduction of the video software in a normal state. The video software processing device according to any one of claims 10, 12, and 13.

15. A distance in a multidimensional space is calculated for a histogram of each of the cut representative images representing each cut, and a group in which the distances are short is formed to form a group. By extracting the position,
The video software processing method according to any one of claims 1 to 7, wherein a position at which a state of the configuration data changes is detected.

16. A distance calculating means for calculating a distance in a multidimensional space with respect to a histogram of each image representing each cut, a grouping means for grouping together those having a short distance, 15. The video software processing apparatus according to claim 8, wherein the position detection unit is realized by a group representative detection unit that detects a position representative of a group.