JP7486871B1

JP7486871B1 - Scene extraction system, scene extraction method, and scene extraction program

Info

Publication number: JP7486871B1
Application number: JP2024048196A
Authority: JP
Inventors: 学吉田; 空也西坂; 久佳中岸
Original assignee: Star Ai
Current assignee: Star Ai
Priority date: 2024-03-25
Filing date: 2024-03-25
Publication date: 2024-05-20
Anticipated expiration: 2044-03-25

Abstract

【assignment】
To provide a new technique for extracting scenes that match a user's preferences.
SOLUTION
The scene extraction system 0 includes a storage unit, a division unit, a similarity calculation unit, a preferred scene extraction unit, and a similar scene extraction unit. The division unit divides a video into scenes. The storage unit stores the video importance level for the video feature, the sound importance level for the sound feature, and the speech importance level for the speech feature for each user in the divided split scenes, linked to the user ID. The similarity calculation unit calculates the video similarity, sound similarity, and speech similarity between the split scenes based on the video feature, sound feature, and speech feature of the divided split scenes. The preferred scene extraction unit extracts preferred scenes preferred by the user from the multiple split scenes based on the user's viewing history. The similar scene extraction unit extracts similar scenes from the split scenes based on the preferred scene, the similarity, and the importance level linked to the user ID.
[Selected Figure] Figure 1

Description

本発明は、シーン抽出システム、シーン抽出方法及びシーン抽出プログラムに関する。 The present invention relates to a scene extraction system, a scene extraction method, and a scene extraction program.

従来、シーンにおける複数の特徴に基づいて、ユーザの嗜好に合うシーンの抽出を行う技術が存在する。 Conventionally, there exists technology that extracts scenes that match a user's preferences based on multiple features in the scene.

例えば、特許文献１には、シーンにおける単語、画像、音等の特徴に基づいて、ユーザの嗜好に合うシーンの抽出を行う技術が開示されている。 For example, Patent Document 1 discloses a technology that extracts scenes that match a user's preferences based on the characteristics of the words, images, sounds, etc. in the scene.

特許５３３８４５０号公報Patent No. 5338450

何れの特徴を重視して動画を視聴しているかはユーザごとに異なる。よって、ユーザの嗜好に合うシーンを提供するためには、何れの特徴を重視してシーンを抽出するかをユーザごとに設定するのが好ましい。しかしながら特許文献１の技術では、複数の特徴に基づいてシーンを抽出できる一方、ユーザが重視する特徴をユーザごとに設定してシーンを抽出することはできない。 Each user has a different preference for which feature to prioritize when watching a video. Therefore, in order to provide scenes that match the user's preferences, it is preferable to set for each user which feature to prioritize when extracting a scene. However, while the technology in Patent Document 1 can extract scenes based on multiple features, it cannot set the feature that each user prioritizes for each user and extract scenes.

本発明は、上述したような事情に鑑みてなされたものであって、ユーザの嗜好に合うシーンを抽出する新たな技術を提供することを解決すべき課題とする。 The present invention was made in consideration of the above-mentioned circumstances, and the problem to be solved is to provide a new technology for extracting scenes that match the user's preferences.

上記課題を解決するために、本発明は動画からシーンを抽出するシーン抽出システムであって、
前記シーン抽出システムは、記憶部、分割部、類似度算出部、嗜好シーン抽出部、類似シーン抽出部、を備え、
前記分割部は、前記動画をシーンごとに分割し、
前記記憶部は、前記分割された分割シーンにおける、ユーザごとの、映像に関する特徴に対する映像重視度、音に関する特徴に対する音重視度、発話に関する特徴に対する発話重視度、をユーザＩＤに紐づけて格納し、
前記類似度算出部は、前記分割された分割シーンの映像特徴量、音特徴量、発話特徴量、に基づいて、前記分割シーン同士の映像類似度、音類似度、発話類似度、を算出し、
前記嗜好シーン抽出部は、前記ユーザの視聴履歴に基づいて、複数の分割シーンから当該ユーザが嗜好する嗜好シーンを抽出し、
前記類似シーン抽出部は、前記嗜好シーン、前記映像類似度、前記音類似度、前記発話類似度、前記ユーザＩＤに紐づく重視度、に基づいて、前記分割シーンから類似シーンを抽出する。 In order to solve the above problems, the present invention provides a scene extraction system for extracting scenes from a video, comprising:
the scene extraction system includes a storage unit, a division unit, a similarity calculation unit, a preference scene extraction unit, and a similar scene extraction unit;
The division unit divides the video into scenes,
the storage unit stores, for each user, a video importance level for a video-related feature, a sound importance level for a sound-related feature, and an utterance importance level for an utterance-related feature in the divided split scenes, in association with a user ID;
the similarity calculation unit calculates a video similarity, a sound similarity, and a speech similarity between the split scenes based on a video feature, a sound feature, and a speech feature of the split scenes;
the preference scene extraction unit extracts preference scenes preferred by the user from a plurality of split scenes based on a viewing history of the user;
The similar scene extraction unit extracts similar scenes from the split scenes based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and an importance level linked to the user ID.

また、本発明は、動画からシーンを抽出するシーン抽出システムが実行するシーン抽出方法であって、
前記シーン抽出システムは、記憶部、分割部、類似度算出部、嗜好シーン抽出部、類似シーン抽出部、を備え、
前記分割部が、前記動画をシーンごとに分割するステップと、
前記記憶部が、前記分割された分割シーンにおける、ユーザごとの、映像に関する特徴に対する映像重視度、音に関する特徴に対する音重視度、発話に関する特徴に対する発話重視度、をユーザＩＤに紐づけて格納するステップと、
前記類似度算出部が、前記分割された分割シーンの映像特徴量、音特徴量、発話特徴量、に基づいて、前記分割シーン同士の映像類似度、音類似度、発話類似度、を算出するステップと、
前記嗜好シーン抽出部が、前記ユーザの視聴履歴に基づいて、複数の分割シーンから当該ユーザが嗜好する嗜好シーンを抽出するステップと、
前記類似シーン抽出部が、前記嗜好シーン、前記映像類似度、前記音類似度、前記発話類似度、前記ユーザＩＤに紐づく重視度、に基づいて、前記分割シーンから類似シーンを抽出するステップと、を含む。 The present invention also provides a scene extraction method executed by a scene extraction system that extracts scenes from a moving image, comprising the steps of:
the scene extraction system includes a storage unit, a division unit, a similarity calculation unit, a preference scene extraction unit, and a similar scene extraction unit;
A step of dividing the video into scenes by the dividing unit;
storing, by the storage unit, a video importance level for a video feature, a sound importance level for a sound feature, and an utterance importance level for an utterance feature, for each user, in the divided split scenes, in association with a user ID;
a step of calculating a video similarity, a sound similarity, and a speech similarity between the split scenes based on a video feature, a sound feature, and a speech feature of the split scenes by the similarity calculation unit;
a step of the preference scene extraction unit extracting preference scenes preferred by the user from a plurality of split scenes based on a viewing history of the user;
The similar scene extraction unit extracts similar scenes from the split scenes based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and the importance level linked to the user ID.

また、本発明は、動画からシーンを抽出するシーン抽出プログラムであって、
コンピュータを、記憶部、分割部、類似度算出部、嗜好シーン抽出部、類似シーン抽出部、として機能させ、
前記分割部は、前記動画をシーンごとに分割し、
前記記憶部は、前記分割された分割シーンにおける、ユーザごとの、映像に関する特徴に対する映像重視度、音に関する特徴に対する音重視度、発話に関する特徴に対する発話重視度、をユーザＩＤに紐づけて格納し、
前記類似度算出部は、前記分割された分割シーンの映像特徴量、音特徴量、発話特徴量、に基づいて、前記分割シーン同士の映像類似度、音類似度、発話類似度、を算出し、
前記嗜好シーン抽出部は、前記ユーザの視聴履歴に基づいて、複数の分割シーンから当該ユーザが嗜好する嗜好シーンを抽出し、
前記類似シーン抽出部は、前記嗜好シーン、前記映像類似度、前記音類似度、前記発話類似度、前記ユーザＩＤに紐づく重視度、に基づいて、前記分割シーンから類似シーンを抽出する。 The present invention also provides a scene extraction program for extracting a scene from a video, comprising:
A computer is caused to function as a storage unit, a division unit, a similarity calculation unit, a preferred scene extraction unit, and a similar scene extraction unit;
The division unit divides the video into scenes,
the storage unit stores, for each user, a video importance level for a video-related feature, a sound importance level for a sound-related feature, and an utterance importance level for an utterance-related feature in the divided split scenes, in association with a user ID;
the similarity calculation unit calculates a video similarity, a sound similarity, and a speech similarity between the split scenes based on a video feature, a sound feature, and a speech feature of the split scenes;
the preference scene extraction unit extracts preference scenes preferred by the user from a plurality of split scenes based on a viewing history of the user;
The similar scene extraction unit extracts similar scenes from the split scenes based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and an importance level linked to the user ID.

このような構成にすることで、ユーザごとの重視する特徴に基づいて、ユーザの嗜好する嗜好シーンを抽出することが可能となり、ユーザの嗜好に合う類似シーンを抽出することができる。 This configuration makes it possible to extract scenes that are preferred by the user based on the features that each user values, and to extract similar scenes that match the user's preferences.

本発明の好ましい形態では、前記シーン抽出システムは、変化重視度作成部、を備え、
前記変化重視度作成部は、前記重視度及び所定条件に基づいて、当該重視度の全部又は一部を変化させた変化重視度を作成し、
前記類似シーン抽出部は、前記嗜好シーン、前記映像類似度、前記音類似度、発話類似度、前記変化重視度、に基づいて、前記分割シーンから変化重視度類似シーンを抽出する。 In a preferred embodiment of the present invention, the scene extraction system includes a change importance level creation unit,
the change importance level creation unit creates a change importance level by changing all or a part of the importance level based on the importance level and a predetermined condition;
The similar scene extraction unit extracts scenes with similar change importance from the split scenes based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and the change importance.

このような構成にすることで、ユーザＩＤに紐づく重視度を変化させた変化重視度に基づく類似シーンである変化重視度類似シーンを抽出することが可能となり、変化重視度及びユーザＩＤに紐づく重視度を用いてユーザの嗜好に合う重視度を探ることができる。 This configuration makes it possible to extract change importance similar scenes, which are similar scenes based on the change importance obtained by changing the importance associated with the user ID, and it is possible to find an importance that matches the user's preferences using the change importance and the importance associated with the user ID.

本発明の好ましい形態では、前記シーン抽出システムは、重視度更新部、を備え、
前記重視度更新部は、前記変化重視度類似シーンに関するユーザの視聴履歴に基づいて、当該変化重視度類似シーンに関連する変化重視度を当該ユーザの重視度としてユーザＩＤに紐づけて更新する。 In a preferred embodiment of the present invention, the scene extraction system includes an importance updating unit,
The importance level update unit updates the change importance level related to the change importance level similar scene as the importance level of the user by linking it to a user ID based on a viewing history of the user for the change importance level similar scene.

このような構成にすることで、様々な重視度による類似シーンの視聴履歴を用いることが可能となり、ユーザの嗜好に合う重視度に更新することができる。 This configuration makes it possible to use viewing histories of similar scenes with various levels of importance, and update the level of importance to match the user's preferences.

本発明の好ましい形態では、前記シーン抽出システムは、提示部、を備え、
前記提示部は、前記類似シーンと比べて低い割合の変化重視度類似シーンを提示する。 In a preferred embodiment of the present invention, the scene extraction system includes a presentation unit,
The presentation unit presents scenes having a similar change importance level at a lower rate than the similar scenes.

このような構成にすることで、類似シーンと少量の変化重視度類似シーンをユーザに対して同時に提示することが可能となり、それぞれの視聴履歴を比較してユーザの嗜好に合う重視度を探ることができる。 This configuration makes it possible to simultaneously present similar scenes and scenes with similar importance with small changes to the user, and by comparing the viewing histories of each, the user can find an importance level that suits their preferences.

本発明の好ましい形態では、前記シーン抽出システムは、ダイジェスト動画作成部、を備え、
前記ダイジェスト動画作成部は、複数の前記類似シーンを用いて、ダイジェスト動画を作成する。 In a preferred embodiment of the present invention, the scene extraction system includes a digest movie creation unit,
The digest movie creation unit creates a digest movie using a plurality of the similar scenes.

このような構成にすることで、複数の類似シーンを用いるダイジェスト動画を作成することが可能となり、ユーザに対して追加のコンテンツを提供することができる。 This configuration makes it possible to create a digest video that uses multiple similar scenes, providing additional content to users.

本発明の好ましい形態では、前記シーン抽出システムは、フレーム抽出部、フレーム類似度算出部、を備え、
前記フレーム抽出部は、複数の前記類似シーンの最初と最後のフレームを抽出し、
前記フレーム類似度算出部は、前記抽出した最初のフレームと最後のフレームのフレーム類似度を算出し、
前記ダイジェスト動画作成部は、前記フレーム類似度に基づいて、前記ダイジェスト動画を作成する。 In a preferred embodiment of the present invention, the scene extraction system includes a frame extraction unit and a frame similarity calculation unit,
The frame extraction unit extracts the first and last frames of the plurality of similar scenes;
the frame similarity calculation unit calculates frame similarity between the extracted first frame and last frame;
The digest movie creation section creates the digest movie based on the frame similarities.

このような構成にすることで、類似シーンの最初のフレームと最後のフレームの類似度に基づいてダイジェスト動画を作成することが可能となり、類似シーンごとの切れ目をユーザに感じさせないダイジェスト動画を作成することができる。 This configuration makes it possible to create a digest video based on the similarity between the first and last frames of similar scenes, making it possible to create a digest video in which the user does not notice the gaps between similar scenes.

本発明の好ましい形態では、前記シーン抽出システムは、嗜好スコア算出部、を備え、
前記嗜好スコア算出部は、前記嗜好シーン、前記映像類似度、前記音類似度、前記発話類似度、前記重視度、に基づいて、嗜好スコアを算出し、
前記ダイジェスト動画作成部は、前記類似シーンにおいて前記嗜好スコアが最も高いものを前記ダイジェスト動画の先頭とし、前記先頭の類似シーンを除く類似シーンの最初のフレームの中で、前記先頭の類似シーンの最後のフレームとのフレーム類似度が最大である類似シーンを前記先頭の類似シーンの次のシーンとして前記ダイジェスト動画を作成する。 In a preferred embodiment of the present invention, the scene extraction system includes a preference score calculation unit,
the preference score calculation unit calculates a preference score based on the preference scene, the video similarity, the sound similarity, the speech similarity, and the importance level;
The digest video creation unit creates the digest video by designating the similar scene with the highest preference score as the beginning of the digest video, and by designating, among the first frames of similar scenes excluding the beginning similar scene, the similar scene with the highest frame similarity with the last frame of the beginning similar scene as the next scene after the beginning similar scene.

このような構成にすることで、ダイジェスト動画の先頭をユーザの嗜好に最も合う類似シーンとすることが可能となり、ユーザの興味を引くダイジェスト動画を作成することができる。 By configuring it in this way, it is possible to start the digest video with a similar scene that best suits the user's preferences, making it possible to create a digest video that will capture the user's interest.

本発明によれば、ユーザの嗜好に合うシーンを抽出する新たな技術を提供することができる。 The present invention provides a new technology for extracting scenes that match a user's preferences.

本実施形態におけるシーン抽出システムの構成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a scene extraction system according to an embodiment of the present invention. 本実施形態におけるハードウェア構成図。FIG. 2 is a diagram showing a hardware configuration according to the present embodiment. 本実施形態における記憶部に格納されたデータ構成の一例。4 shows an example of a data configuration stored in a storage unit in the present embodiment. 本実施形態における記憶部に格納されたデータ構成の一例。4 shows an example of a data configuration stored in a storage unit in the present embodiment. 本実施形態におけるシーン抽出処理のフローチャート。4 is a flowchart of a scene extraction process in the present embodiment. 本実施形態における重視度更新処理のフローチャート。11 is a flowchart of an importance level update process according to the present embodiment. 本実施形態におけるダイジェスト動画作成処理のフローチャート。5 is a flowchart of a digest movie creation process according to the embodiment.

以下、図面を用いて、本発明のシーン抽出システムについて説明する。図面には好ましい実施形態が示されている。しかし、本発明は多くの異なる形態で実施されることが可能であり、本明細書に記載される実施形態に限定されない。 The scene extraction system of the present invention will be described below with reference to the drawings. The drawings show a preferred embodiment. However, the present invention can be embodied in many different forms and is not limited to the embodiment described in this specification.

例えば、本実施形態ではシーン抽出システムの構成、動作等について説明するが、実行される方法（ステップ）、装置、コンピュータプログラム等によっても、同様の作用効果を奏することができる。本実施形態におけるプログラムは、コンピュータが読み取り可能な非一過性の記録媒体として提供されても良いし、外部のサーバからダウンロード可能に提供されても良いし、クライアント端末でその機能を実施するために外部のコンピュータにおいて当該プログラムを起動させても良い（いわゆるクラウドコンピューティング）。 For example, in this embodiment, the configuration, operation, etc. of the scene extraction system are described, but similar effects can be achieved by the executed method (steps), device, computer program, etc. The program in this embodiment may be provided as a non-transient computer-readable recording medium, may be provided so as to be downloadable from an external server, or the program may be started on an external computer to implement its functions on a client terminal (so-called cloud computing).

また、本実施形態において「部」とは、例えば、広義の回路によって実施されるハードウェア資源と、これらハードウェア資源によって具体的に実現され得るソフトウェアの情報処理とを合わせたものも含み得る。本実施形態において「情報」とは、例えば電圧・電流を表す信号値の物理的な値、０又は１で構成される２進数のビット集合体としての信号値の高低、又は量子的な重ね合わせ（いわゆる量子ビット）によって表され、広義の回路上で通信・演算が実行され得る。 In addition, in this embodiment, a "unit" may include, for example, a combination of hardware resources implemented by a circuit in the broad sense and software information processing that can be specifically realized by these hardware resources. In this embodiment, "information" is represented, for example, by the physical value of a signal value representing a voltage or current, the high or low value of a signal value as a binary bit collection consisting of 0 or 1, or a quantum superposition (so-called quantum bit), and communication and calculations can be performed on a circuit in the broad sense.

広義の回路とは、回路（Circuit）、回路類（Circuitry）、プロセッサ（Processor）及びメモリ（Memory）等を適宜組み合わせることによって実現される回路である。即ち、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）等を含むものである。 In the broad sense, a circuit is a circuit that is realized by appropriately combining a circuit, circuitry, processor, memory, etc. In other words, it includes a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), an LSI (Large Scale Integration), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), etc.

＜システム概要＞
図１は、本実施形態におけるシーン抽出システムの構成を示すブロック図である。図１に示すように、シーン抽出システム０は、シーン抽出装置１、ユーザ端末２、を備える。シーン抽出装置１は、ネットワークＮＷを介してユーザ端末２と通信可能に構成される。 <System Overview>
Fig. 1 is a block diagram showing the configuration of a scene extraction system according to the present embodiment. As shown in Fig. 1, the scene extraction system 0 includes a scene extraction device 1 and a user terminal 2. The scene extraction device 1 is configured to be able to communicate with the user terminal 2 via a network NW.

シーン抽出装置１は、動画、シーン、ユーザ、に関する情報等に基づいて、動画からユーザの嗜好に合うシーンを抽出する。シーン抽出装置１は、１つの動画から抽出したユーザの嗜好シーンに基づいて、その１つの動画中の別の分割シーンから類似シーンを抽出しても良い。また、シーン抽出装置１は、抽出した嗜好シーンに基づいて、複数の動画の複数の分割シーンから類似シーンを抽出しても良い。 The scene extraction device 1 extracts scenes that match the preferences of a user from a video, based on information about the video, the scene, the user, etc. The scene extraction device 1 may extract a similar scene from another split scene in one video, based on a user's preferred scene extracted from the one video. The scene extraction device 1 may also extract similar scenes from multiple split scenes of multiple videos, based on the extracted preferred scene.

シーン抽出装置１としては、汎用のサーバ向けのコンピュータやパーソナルコンピュータ等を利用することが可能である。また、複数のコンピュータを用いてシーン抽出装置１を構成することも可能である。 A general-purpose server computer, a personal computer, or the like can be used as the scene extraction device 1. It is also possible to configure the scene extraction device 1 using multiple computers.

ユーザは、ユーザ端末２を介して、動画やシーンを視聴する。さらにユーザは、ユーザ端末２を介して、重視度等を入力してシーン抽出装置１に送信しても良い。ユーザ端末２としては、スマートフォンやタブレット端末、パーソナルコンピュータ等の端末装置を利用することができる。 The user watches videos and scenes via the user terminal 2. Furthermore, the user may input the importance level and the like via the user terminal 2 and transmit it to the scene extraction device 1. As the user terminal 2, a terminal device such as a smartphone, tablet terminal, or personal computer can be used.

ネットワークＮＷは、本実施形態では、ＩＰ（Internet Protocol）ネットワークであるが、通信プロトコルの種類に制限はなく、更に、ネットワークの種類、規模にも制限はない。 In this embodiment, the network NW is an IP (Internet Protocol) network, but there are no restrictions on the type of communication protocol, and there are also no restrictions on the type or size of the network.

＜ハードウェア構成＞
図２は、ハードウェア構成図である。図２（ａ）に示すように、情報処理装置１０（シーン抽出装置１）は、制御部１０１、記憶部１０２、及び通信部１０３を有し、各部及び各工程の作用発揮に用いられる。 <Hardware Configuration>
2 is a hardware configuration diagram. As shown in Fig. 2A, the information processing device 10 (scene extraction device 1) has a control unit 101, a storage unit 102, and a communication unit 103, which are used to perform the functions of each unit and each process.

制御部１０１は、ＣＰＵ（Central Processing Unit）等の１又は２以上のプロセッサを含み、本発明に係るシーン抽出プログラム、ＯＳ（Operating System）やブラウザソフト、その他のアプリケーションを実行することで、情報処理装置１０の動作処理全体を制御する。 The control unit 101 includes one or more processors such as a CPU (Central Processing Unit), and controls the overall operation and processing of the information processing device 10 by executing the scene extraction program according to the present invention, an OS (Operating System), browser software, and other applications.

記憶部１０２は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）等であって、本発明に係るシーン抽出プログラム及び、制御部１０１がプログラムに基づき処理を実行する際に利用するデータ等を記憶する。制御部１０１が、記憶部１０２に記憶されているシーン抽出プログラムに基づき処理を実行することによって、後述する機能構成が実現される。 The storage unit 102 is a hard disk drive (HDD), a solid state drive (SSD), a read only memory (ROM), a random access memory (RAM), etc., and stores the scene extraction program according to the present invention and data used when the control unit 101 executes processing based on the program. The control unit 101 executes processing based on the scene extraction program stored in the storage unit 102, thereby realizing the functional configuration described below.

通信部１０３は、ネットワークＮＷとの通信制御を実行して、情報処理装置１０を動作させるために必要な入力や、動作結果に係る出力を行う。 The communication unit 103 controls communication with the network NW, and performs input necessary to operate the information processing device 10 and outputs related to the results of the operation.

図２（ｂ）のように、端末装置９（ユーザ端末２）は、制御部９１、記憶部９２、通信部９３、入力部９４、及び出力部９５を有し、各部及び各工程の作用発揮に用いられる。 As shown in FIG. 2(b), the terminal device 9 (user terminal 2) has a control unit 91, a memory unit 92, a communication unit 93, an input unit 94, and an output unit 95, and is used to perform the functions of each unit and each process.

端末装置９の制御部９１は、ＣＰＵ等の１以上のプロセッサを含み、端末装置９の動作処理全体を制御する。端末装置９の記憶部９２は、ＨＤＤ、ＳＳＤ、ＲＯＭ、ＲＡＭ等であって、上述のアプリケーション及び、制御部９１がプログラムに基づき処理を実行する際に利用するデータ等を記憶する。 The control unit 91 of the terminal device 9 includes one or more processors such as a CPU, and controls the overall operation and processing of the terminal device 9. The storage unit 92 of the terminal device 9 is an HDD, SSD, ROM, RAM, etc., and stores the above-mentioned applications and data used when the control unit 91 executes processing based on a program.

端末装置９の通信部９３は、ネットワークＮＷとの通信を制御する。端末装置９の入力部９４は、マウス及びキーボード等であって、利用者／提供者による操作要求を制御部９１に入力する。端末装置９の出力部９５は、ディスプレイ等であって、制御部９１の処理の結果等を表示する。 The communication unit 93 of the terminal device 9 controls communication with the network NW. The input unit 94 of the terminal device 9 is a mouse, keyboard, or the like, and inputs operation requests from the user/provider to the control unit 91. The output unit 95 of the terminal device 9 is a display, or the like, and displays the results of processing by the control unit 91, etc.

＜機能構成要素＞
図２に示すように、シーン抽出装置１は、分割部１１、特徴量生成部１２、類似度算出部１３、嗜好シーン抽出部１４、嗜好スコア算出部１５、類似シーン抽出部１６、提示部１７、変化重視度作成部１８、重視度更新部１９、フレーム抽出部１ａ、フレーム類似度算出部１ｂ、ダイジェスト動画作成部１ｃを備える。 <Functional components>
As shown in FIG. 2, the scene extraction device 1 includes a division unit 11, a feature generation unit 12, a similarity calculation unit 13, a preference scene extraction unit 14, a preference score calculation unit 15, a similar scene extraction unit 16, a presentation unit 17, a change importance creation unit 18, an importance update unit 19, a frame extraction unit 1a, a frame similarity calculation unit 1b, and a digest movie creation unit 1c.

これら機能構成要素の配置は一例であり、シーン抽出装置１の備えた機能構成の一部が、シーン抽出装置１やユーザ端末２と通信可能に構成された１又は複数の装置に配置されても良い。 The arrangement of these functional components is an example, and some of the functional components of the scene extraction device 1 may be arranged in one or more devices configured to be able to communicate with the scene extraction device 1 and the user terminal 2.

＜データ構成＞
図３及び４は、本実施形態における記憶部に格納されたデータ構成の一例である。シーン抽出装置１の記憶部は、分割シーン情報、映像類似度情報、音類似度情報、発話類似度情報、視聴履歴情報、重視度情報、類似度スコア情報、嗜好スコア情報、類似シーン情報、類似シーン類似度情報を格納する。 <Data structure>
3 and 4 are examples of data configurations stored in the storage unit in this embodiment. The storage unit of the scene extraction device 1 stores split scene information, video similarity information, sound similarity information, speech similarity information, viewing history information, importance information, similarity score information, preference score information, similar scene information, and similar scene similarity information.

各データの配置も一例であり、シーン抽出装置１の記憶部に格納されたデータの一部又は全部が、シーン抽出装置１やユーザ端末２と通信可能に構成された１又は複数の装置に格納されても良い。 The arrangement of each piece of data is also an example, and some or all of the data stored in the memory unit of the scene extraction device 1 may be stored in one or more devices configured to be able to communicate with the scene extraction device 1 and the user terminal 2.

＜分割シーンの作成＞
図５は、本実施形態におけるシーン抽出処理のフローチャートである。まず、ステップＳ５０１において、分割部１１は、動画をシーンごとに分割して分割シーン（例えば、チャプターごと、野球の打者ごと、サッカーの試合のシュートとそれ以外のシーン等）を作成する。分割部１１が分割する動画の数は１つでも複数でも良い。本実施形態において分割部１１は、管理している（記憶部に格納した、ユーザ端末を介して受け付けた）すべての動画をシーンごとに分割する。 <Creating split scenes>
5 is a flowchart of the scene extraction process in this embodiment. First, in step S501, the division unit 11 divides a video into scenes to create divided scenes (for example, by chapter, by batter in baseball, or by shots and other scenes in a soccer game). The number of videos divided by the division unit 11 may be one or more. In this embodiment, the division unit 11 divides all videos that it manages (stored in the storage unit, received via a user terminal) into scenes.

分割部１１は、一定時間（例えば、５分）ごとの等間隔の時間で動画を分割しても良いし、動画の映像から変化点を検出して自動で分割しても良い。また、分割部１１は、動画の映像を時間方向にクラスタリングを行うことによって類似する映像を１クラスタとして分割しても良いし、動画に対して分割するためのタグ（例えば、チャプター）が付与されている場合はそのタグによって分割しても良い。 The division unit 11 may divide the video at equal time intervals (e.g., 5 minutes), or may detect change points in the video footage and divide it automatically. The division unit 11 may also divide the video footage into similar images as one cluster by clustering the video footage in the time direction, or, if a tag for dividing the video (e.g., chapter) is attached, the division unit 11 may divide the video footage by the tag.

分割部１１は、分割したシーンが何れの動画から分割されたものか、及び、分割元の動画の何れの区間のものか等に関する分割シーン情報を記憶部に格納する。分割シーン情報は、図３（ａ）のように、動画ＩＤ、分割シーンＩＤ、動画区間、を含む。これによってシーン抽出装置１は、例えば、分割シーンＩＤ「Ｍ１＿Ｃ２」が動画ＩＤ「Ｍ１」の動画の分割シーンであって、その動画の１０～２０分の区間のシーンであることを、参照することができる。 The division unit 11 stores in the storage unit split scene information relating to which video the split scene was split from and which section of the original video the scene belongs to. As shown in FIG. 3(a), the split scene information includes a video ID, a split scene ID, and a video section. This allows the scene extraction device 1 to refer to, for example, that split scene ID "M1_C2" is a split scene of a video with video ID "M1", and is a scene from the 10-20 minute section of that video.

＜特徴量の生成＞
ステップＳ５０２において、特徴量生成部１２は、分割部１１が作成した分割シーンの映像特徴量、音特徴量、発話特徴量を生成する。例えば、動画又はシーンを複数の要素に分解し、それぞれの要素に関する特徴量を生成することができる。本実施形態では、動画又はシーンを映像、音、発話の内容（会話の意味）、の３つの要素に分解し、特徴量生成部１２は、それぞれの特徴量を映像特徴量、音特徴量、発話特徴量として生成する。 <Feature Generation>
In step S502, the feature generator 12 generates a video feature, a sound feature, and a speech feature of the split scene created by the splitter 11. For example, a video or a scene can be decomposed into a plurality of elements, and a feature for each element can be generated. In this embodiment, a video or a scene is decomposed into three elements, namely, video, sound, and speech content (meaning of the conversation), and the feature generator 12 generates each feature as a video feature, a sound feature, and a speech feature.

本実施形態で映像特徴量は、動画又はシーンに現れるものやその動き等（例えば、物体の色、形、大きさ、動き、位置関係等）に関する特徴量である。本実施形態で音特徴量は、動画又はシーンにおける音（例えば、大きさ、速さ、周波数、波長等）に関する特徴量である。本実施形態で発話特徴量は、動画又はシーンにおける発話（音声）の内容（人間が理解可能な意味のある内容）に関する特徴量であって、例えば、人間の発話を抽出して自然言語処理をして生成する。 In this embodiment, video features are features related to things that appear in a video or scene and their movements (e.g., the color, shape, size, movement, and positional relationship of an object). In this embodiment, sound features are features related to sounds in a video or scene (e.g., volume, speed, frequency, wavelength, etc.). In this embodiment, speech features are features related to the content of speech (audio) in a video or scene (meaningful content that can be understood by humans), and are generated, for example, by extracting human speech and performing natural language processing.

特徴量生成部１２は、分割シーンが含む映像（動きのあるもの、動きのないもの）や音（音声、非音声）をベクトル化することによって、特徴量を生成することが考えられる。特徴量生成部１２は、図３（ａ）のように、生成した特徴量を分割シーンＩＤに紐づけて記憶部に格納する。 The feature generator 12 may generate features by vectorizing the video (moving and still) and sound (audio and non-audio) contained in the split scene. As shown in FIG. 3(a), the feature generator 12 associates the generated features with the split scene ID and stores them in the storage unit.

例えば、特徴量生成部１２は、分割シーンをフレーム（静止画）ごとに分解して画像ファイル群を作成することによって、映像特徴量を生成する。 For example, the feature generator 12 generates video features by breaking down the split scenes into frames (still images) and creating a group of image files.

画像ファイル群からの映像特徴量生成は、独自に学習されたエンコード用のモデルを用いても良いし、オープンソースソフトウェアとして公開されている学習済モデルを用いても良い。公開されているものとしては、例えばＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒベースのモデルを用いても良く、４００次元の特徴量を生成することができる。 To generate video features from a group of image files, a proprietary trained encoding model may be used, or a trained model published as open source software may be used. For example, a publicly available model based on Vision Transformer may be used, which can generate 400-dimensional features.

また、画像の模様を自然言語で説明するような学習済モデルも公開されており、画像から抽出される自然言語の情報をエンベディングする（埋め込む）ことで特徴量を生成することもできる。自然言語の情報をエンベディングする（埋め込む）モデルも独自に学習しても良いし、公開されている学習済モデルを用いても良い。公開されているものとしては、例えばＯｐｅｎＡＩ社がＡＰＩ（Application Programming Interface）として公開しているエンコードを適用しても良く、この場合は１５３６次元の特徴量を生成することができる。 Additionally, trained models that explain the patterns in images in natural language are also publicly available, and features can also be generated by embedding natural language information extracted from images. A model that embeds natural language information can be trained independently, or a publicly available trained model can be used. For example, one publicly available encoding published by OpenAI as an API (Application Programming Interface) can be applied, in which case 1,536-dimensional features can be generated.

例えば、特徴量生成部１２は、分割シーンから音に関する音情報のみを抽出して音ファイルを作成することによって、音特徴量を生成する。抽出する音情報は、人間の音声を含んでいても良いし、人間の音声を含んでいなくても良い。 For example, the feature generator 12 generates sound features by extracting only sound information related to sound from the split scenes and creating a sound file. The extracted sound information may or may not include human voice.

音ファイルからの特徴量生成は、独自に学習されたエンコード用のモデルを用いても良いし、オープンソースソフトウェアとして公開されている学習済モデルを用いても良い。公開されているものとしては、例えばＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒベースのモデルを用いても良く、３８３２３２次元の特徴量を生成することができる。 To generate features from audio files, a proprietary trained encoding model may be used, or a trained model published as open source software may be used. For example, a publicly available model based on the Vision Transformer may be used, which can generate 383232-dimensional features.

例えば、特徴量生成部１２は、分割シーンから人間の発話（音声）に関する発話情報を抽出して発話特徴量を生成する。特徴量生成部１２は、分割シーンから抽出した音声情報を用いて、音声認識を行い自然言語に変換することで発話言語ファイルを作成することが考えられる。 For example, the feature generation unit 12 extracts speech information related to human speech (voice) from the split scenes to generate speech features. The feature generation unit 12 may use the voice information extracted from the split scenes to perform voice recognition and convert it into natural language to create a speech language file.

発話言語ファイルからの特徴量は、独自に学習されたエンコード用のモデルを用いても良いし、オープンソースソフトウェアとして公開されている学習済モデルを用いても良い。公開されているものとしては、例えばＯｐｅｎＡＩ社がＡＰＩとして公開しているエンコードを適用しても良く、この場合は１５３６次元の特徴量を生成することができる。 The features from the spoken language file may use a proprietary trained encoding model, or a trained model published as open source software. For example, an encoding model published as an API by OpenAI may be used, in which case 1536-dimensional features can be generated.

＜類似度の算出＞
ステップＳ５０３において、類似度算出部１３は、分割シーン同士の類似度を算出する。類似度算出部１３は、分割された分割シーンの映像特徴量、音特徴量、発話特徴量、に基づいて、分割シーン同士の映像類似度、音類似度、発話類似度を算出する。 <Calculation of similarity>
In step S503, the similarity calculation unit 13 calculates the similarity between the split scenes. The similarity calculation unit 13 calculates the video similarity, sound similarity, and speech similarity between the split scenes based on the video feature amount, sound feature amount, and speech feature amount of the divided split scenes.

類似度算出部１３は、すべての分割シーン同士のそれぞれの特徴量の類似度（映像類似度、音類似度、発話類似度）を算出し、図３（ｂ）～（ｄ）のように、記憶部に格納することが考えられる。分割シーンがＮ個存在する場合、類似度算出部１３は、Ｎ×Ｎ回の計算をそれぞれの特徴量の類似度を算出するために行う。類似度算出部１３は、ベクトル同士の類似度を算出するコサイン類似度によって、類似度を算出することが考えられる。 The similarity calculation unit 13 may calculate the similarity (image similarity, sound similarity, speech similarity) of each feature between all split scenes, and store it in the storage unit as shown in Figures 3(b) to (d). When there are N split scenes, the similarity calculation unit 13 performs N x N calculations to calculate the similarity of each feature. The similarity calculation unit 13 may calculate the similarity using cosine similarity, which calculates the similarity between vectors.

また、類似度算出部１３は、同じ分割シーン同士の類似度（例えば、分割シーンＩＤ「Ｍ１＿Ｃ１」と「Ｍ１＿Ｃ１」の類似度）を算出してもしなくても良い。ユーザが嗜好する嗜好シーンに類似する類似シーンとして、ユーザが一度視聴したものを抽出しても良い場合、類似度算出部１３は、同じ分割シーン同士の類似度が高くなるように算出することが考えられる。一方、ユーザが一度視聴したものを抽出しない場合、類似度算出部１３は、同じ分割シーン同士の類似度が低くなるように算出、又は、図３（ｂ）～（ｄ）のように類似度を算出しないことが考えられる。 The similarity calculation unit 13 may or may not calculate the similarity between the same split scenes (for example, the similarity between split scene IDs "M1_C1" and "M1_C1"). If it is acceptable to extract scenes that the user has viewed once as similar scenes that are similar to the user's preferred scenes, the similarity calculation unit 13 may calculate the similarity between the same split scenes to be high. On the other hand, if it is not acceptable to extract scenes that the user has viewed once, the similarity calculation unit 13 may calculate the similarity between the same split scenes to be low, or may not calculate the similarity as in Figures 3(b) to (d).

＜嗜好シーンの抽出＞
ステップＳ５０４において、嗜好シーン抽出部１４は、ユーザの視聴履歴に基づいて、複数の分割シーンからユーザが嗜好する嗜好シーンを抽出する。嗜好シーン抽出部１４は、ユーザごとの分割シーンの視聴履歴に関する視聴履歴情報を記憶部に格納する。視聴履歴情報は、図３（ｅ）のように、ユーザＩＤ、分割シーンＩＤ、累積視聴時間、を含む。 <Extraction of favorite scenes>
In step S504, the preference scene extraction unit 14 extracts preference scenes that the user prefers from the multiple split scenes based on the user's viewing history. The preference scene extraction unit 14 stores viewing history information regarding the viewing history of the split scenes for each user in the storage unit. The viewing history information includes a user ID, a split scene ID, and a cumulative viewing time, as shown in FIG. 3(e).

嗜好シーン抽出部１４は、ユーザごとに、それぞれの分割シーンに対する視聴時間を集計し、累積視聴時間を算出する。これによってシーン抽出装置１は、例えば、ユーザＩＤ「Ｕ１」が過去に分割シーンＩＤ「Ｍ２＿Ｃ１」の分割シーンを合計２８０秒間視聴したことを、参照することができる。 The preferred scene extraction unit 14 counts the viewing time for each split scene for each user and calculates the cumulative viewing time. This allows the scene extraction device 1 to refer to, for example, the fact that user ID "U1" previously viewed the split scene with split scene ID "M2_C1" for a total of 280 seconds.

嗜好シーン抽出部１４は、累積視聴時間に基づいて、ユーザが嗜好する嗜好シーンを抽出する。例えば、嗜好シーン抽出部１４は、分割シーンの累積視聴時間が一定時間（閾値）を超えた場合、その分割シーンを嗜好シーンとして抽出しても良い。嗜好シーン抽出部１４は、嗜好シーンを抽出した場合、図３（ｅ）のように（その分割シーンＩＤに嗜好フラグ「１」を紐づけて）記憶部に格納しても良い。 The preference scene extraction unit 14 extracts preference scenes that are preferred by the user based on the cumulative viewing time. For example, if the cumulative viewing time of a split scene exceeds a certain time (threshold), the preference scene extraction unit 14 may extract the split scene as a preference scene. When the preference scene extraction unit 14 extracts a preference scene, it may store the preferred scene in the memory unit as shown in FIG. 3(e) (associating the split scene ID with a preference flag "1").

また、嗜好シーン抽出部１４は、累積視聴時間がその分割シーンの再生時間に占める割合を用いて、嗜好シーンを抽出しても良い。分割シーンの累積視聴時間がそのシーンの再生時間の所定の割合（例えば、２分の１等）以上である場合、嗜好シーン抽出部１４は、その分割シーンを嗜好シーンとして抽出する。本実施形態ではこの割合を２分の１とするが、これに限られず管理者等が端末を介して割合を設定することができる。 The preferred scene extraction unit 14 may also extract preferred scenes using the proportion of the cumulative viewing time to the playback time of the split scene. If the cumulative viewing time of a split scene is equal to or greater than a predetermined proportion (e.g., half) of the playback time of the scene, the preferred scene extraction unit 14 extracts the split scene as a preferred scene. In this embodiment, this proportion is set to half, but is not limited to this and the administrator, etc., can set the proportion via the terminal.

例えば、分割シーンＩＤ「Ｍ１＿Ｃ１」は、再生時間が６００秒であるため、ユーザ「Ｕ１」の累積視聴時間が３００秒の場合、嗜好シーン抽出部１４は、分割シーンＩＤ「Ｍ１＿Ｃ１」をユーザ「Ｕ１」の嗜好シーンとして抽出する。一方、分割シーンＩＤ「Ｍ２＿Ｃ２」は、再生時間が６３０秒であるため、ユーザ「Ｕ１」の累積視聴時間が３００秒の場合、嗜好シーン抽出部１４は、分割シーンＩＤ「Ｍ２＿Ｃ２」をユーザ「Ｕ１」の嗜好シーンとして抽出しない。 For example, since the playback time of split scene ID "M1_C1" is 600 seconds, if the cumulative viewing time of user "U1" is 300 seconds, the preference scene extraction unit 14 extracts split scene ID "M1_C1" as a preference scene of user "U1". On the other hand, since the playback time of split scene ID "M2_C2" is 630 seconds, if the cumulative viewing time of user "U1" is 300 seconds, the preference scene extraction unit 14 does not extract split scene ID "M2_C2" as a preference scene of user "U1".

この他にも、嗜好シーンの数量が限定されていても良い。例えば、嗜好シーン抽出部１４は、累積視聴時間が上位Ｍ個（Ｍ＝１、２、・・・）の分割シーンを嗜好シーンとして抽出しても良い。また、嗜好シーン抽出部１４は、累積視聴時間がその分割シーンの再生時間に占める割合が上位Ｍ個の分割シーンを嗜好シーンとして抽出しても良い。 In addition, the number of preferred scenes may be limited. For example, the preferred scene extraction unit 14 may extract the top M (M = 1, 2, ...) split scenes in terms of cumulative viewing time as preferred scenes. The preferred scene extraction unit 14 may also extract the top M split scenes in terms of the proportion of the cumulative viewing time to the playback time of the split scene as preferred scenes.

嗜好シーン抽出部１４は、更に、分割シーンに紐づくジャンルを用いて嗜好シーンを抽出しても良い。記憶部が、動画ＩＤ又は分割シーンＩＤにジャンルを紐づけて格納することによって、嗜好シーン抽出部１４は、このジャンルを用いて嗜好シーンを抽出することが考えられる。 The preference scene extraction unit 14 may further extract preference scenes using a genre associated with the split scene. By storing the video ID or split scene ID in association with a genre in the storage unit, the preference scene extraction unit 14 may extract preference scenes using this genre.

例えば、ユーザ端末を介して、予めユーザの好みのジャンルを受け付け、記憶部がユーザＩＤに紐づけてそのジャンルを格納することによって、嗜好シーン抽出部１４は、ユーザの累積視聴時間及び受け付けたジャンルに基づいて、嗜好シーンを抽出することができる。 For example, the user's preferred genres are received in advance via the user terminal, and the memory unit stores the genres in association with the user ID, so that the preferred scene extraction unit 14 can extract preferred scenes based on the user's cumulative viewing time and the received genres.

＜嗜好スコアの算出＞
ステップＳ５０５において、嗜好スコア算出部１５は、嗜好シーン、映像類似度、音類似度、発話類似度、ユーザＩＤに紐づく重視度、に基づいて、嗜好スコアを算出する。記憶部は、図４（ｆ）のように、ユーザごとの映像重視度、音重視度、発話重視度、を格納する。それぞれのユーザの映像重視度、音重視度、発話重視度、の合計は１であっても良い。また、映像重視度、音重視度、発話重視度、の合計はユーザごとに異なっていても良い。ユーザ端末２を介して、嗜好スコア算出部１５がユーザごとの重視度を受け付けても良い。 <Calculation of preference score>
In step S505, the preference score calculation unit 15 calculates a preference score based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and the importance associated with the user ID. The storage unit stores the video importance, the sound importance, and the speech importance for each user, as shown in FIG. 4(f). The sum of the video importance, the sound importance, and the speech importance for each user may be 1. The sum of the video importance, the sound importance, and the speech importance may be different for each user. The preference score calculation unit 15 may receive the importance for each user via the user terminal 2.

類似シーンを抽出するうえで、映像、音、発話の何れを重視して抽出するかはユーザによって異なる。映像重視度は、ユーザが類似シーンを抽出するうえで映像という要素をどの程度重視するかに関する情報である。音重視度は、ユーザが類似シーンを抽出するうえで音という要素をどの程度重視するかに関する情報である。発話重視度は、ユーザが類似シーンを抽出するうえで発話（自然言語）という要素をどの程度重視するかに関する情報である。 Which of the three - video, sound, or speech - is emphasized when extracting similar scenes varies from user to user. Video importance is information about how much importance the user places on the video element when extracting similar scenes. Sound importance is information about how much importance the user places on the sound element when extracting similar scenes. Speech importance is information about how much importance the user places on the speech (natural language) element when extracting similar scenes.

嗜好スコア算出部１５は、類似度算出部１３が算出したそれぞれの類似度に重視度（重み）を適用することによって、ユーザごとに適した類似シーンを抽出するための嗜好スコアを算出することができる。 The preference score calculation unit 15 can calculate a preference score for extracting similar scenes suitable for each user by applying an importance (weight) to each similarity calculated by the similarity calculation unit 13.

まず、嗜好スコア算出部１５は、嗜好シーン、映像類似度、音類似度、発話類似度、に基づいて、図４（ｇ）のような類似度スコア情報を算出する。図４（ｇ）のそれぞれのスコアのカッコ内は、それぞれのスコアを算出するための計算を表す（図３（ｂ）～（ｄ）の類似度との対応関係を示すために便宜上記載する）。例えば、分割シーンＩＤ「Ｍ１＿Ｃ２」の映像スコアであれば、「Ｍ１＿Ｃ２」と嗜好シーン「Ｍ１＿Ｃ１」の映像類似度０．１２、「Ｍ１＿Ｃ２」と嗜好シーン「Ｍ２＿Ｃ１」の映像類似度０．２２を計算（合計）した結果が映像スコア０．３４となる。 First, the preference score calculation unit 15 calculates similarity score information as shown in FIG. 4(g) based on the preferred scene, video similarity, sound similarity, and speech similarity. The brackets next to each score in FIG. 4(g) indicate the calculation for calculating each score (written for convenience to show the correspondence with the similarities in FIGS. 3(b) to (d)). For example, in the case of the video score of split scene ID "M1_C2", the video similarity of 0.12 between "M1_C2" and the preferred scene "M1_C1" and the video similarity of 0.22 between "M1_C2" and the preferred scene "M2_C1" are calculated (summed up) to result in a video score of 0.34.

嗜好スコア算出部１５は、すべての嗜好シーン（嗜好フラグが１）とそれぞれの分割シーンのそれぞれの類似度に基づいて、分割シーンごとに嗜好シーンとの類似度を加算した類似度スコア（映像スコア、音スコア、発話スコア）を算出する。 The preference score calculation unit 15 calculates a similarity score (video score, sound score, speech score) by adding up the similarity between each of the split scenes and the preferred scene for each split scene, based on the similarity between each of the preferred scenes (preference flag is 1) and each of the split scenes.

例えば、図３（ｅ）によると、ユーザＩＤ「Ｕ１」の嗜好シーンが分割シーンＩＤ「Ｍ１＿Ｃ１」と「Ｍ２＿Ｃ１」であるため、それぞれの類似度を加算して類似度スコアを算出する。図４（ｇ）の分割シーンＩＤ「Ｍ１＿Ｃ２」の映像スコアは、分割シーンＩＤ「Ｍ１＿Ｃ１」と「Ｍ１＿Ｃ２」の映像類似度０．１２、及び、分割シーンＩＤ「Ｍ２＿Ｃ１」と「Ｍ１＿Ｃ２」の映像類似度０．２２、を合計した値０．３４となる。このようにすることによって、複数の嗜好シーンを考慮した類似度スコアを算出し、類似シーンを抽出することができる。 For example, in FIG. 3(e), the preferred scenes for user ID "U1" are split scene IDs "M1_C1" and "M2_C1", so the similarities for each are added together to calculate a similarity score. The video score for split scene ID "M1_C2" in FIG. 4(g) is 0.34, which is the sum of the video similarity of split scene IDs "M1_C1" and "M1_C2" (0.12) and the video similarity of split scene IDs "M2_C1" and "M1_C2" (0.22). In this way, a similarity score that takes multiple preferred scenes into consideration can be calculated, and similar scenes can be extracted.

この他にも、嗜好スコア算出部１５は、分割シーンＩＤ「Ｍ１＿Ｃ２」の映像スコアとして、分割シーンＩＤ「Ｍ１＿Ｃ１」と「Ｍ１＿Ｃ２」の映像類似度０．１２、及び、分割シーンＩＤ「Ｍ２＿Ｃ１」と「Ｍ１＿Ｃ２」の映像類似度０．２２、をそれぞれ映像スコアとしても良い。 In addition, the preference score calculation unit 15 may set the video similarity between split scene IDs "M1_C1" and "M1_C2" of 0.12 and the video similarity between split scene IDs "M2_C1" and "M1_C2" of 0.22 as the video score for split scene ID "M1_C2".

嗜好スコア算出部１５は、類似度スコア及び重視度に基づいて、嗜好スコアを算出する。嗜好スコア算出部１５は、例えば、式（１）によってぞれぞれの分割シーンに対して嗜好スコアを算出し、図４（ｈ）のようにユーザＩＤ及び分割シーンＩＤに紐づけて記憶部に格納する。図４（ｈ）の嗜好スコアのカッコ内は、式（１）を用いた計算を表す（図４（ｇ）のスコアとの対応関係を示すために便宜上記載する）。例えば、分割シーンＩＤ「Ｍ２＿Ｃ２」の嗜好スコアは１．０２２であって、カッコ内はそのスコアを導出するための計算である。 The preference score calculation unit 15 calculates a preference score based on the similarity score and the importance level. For example, the preference score calculation unit 15 calculates a preference score for each split scene using formula (1), and stores the preference score in the storage unit in association with the user ID and split scene ID as shown in FIG. 4(h). The brackets in the preference score in FIG. 4(h) indicate the calculation using formula (1) (written for convenience to show the correspondence with the score in FIG. 4(g)). For example, the preference score for split scene ID "M2_C2" is 1.022, and the brackets indicate the calculation for deriving that score.

＜類似シーンの抽出＞
ステップＳ５０６において、類似シーン抽出部１６は、嗜好シーン、映像類似度、音類似度、発話類似度、ユーザＩＤに紐づく重視度、に基づいて、分割シーンからユーザの嗜好シーンに類似する類似シーンを抽出する。類似シーン抽出部１６は、嗜好スコア算出部１５が算出した嗜好スコアに基づいて、分割シーンからユーザの嗜好シーンに類似する類似シーンを抽出する。 <Extraction of Similar Scenes>
In step S506, the similar scene extraction unit 16 extracts scenes similar to the user's preferred scenes from the split scenes based on the preferred scenes, video similarity, audio similarity, speech similarity, and importance associated with the user ID. The similar scene extraction unit 16 extracts scenes similar to the user's preferred scenes from the split scenes based on the preference scores calculated by the preference score calculation unit 15.

例えば、類似シーン抽出部１６は、嗜好スコアがある一定の数値（閾値）以上の分割シーンを類似シーンとして抽出することが考えられる。また、類似シーン抽出部１６は、分割シーンの中から嗜好スコアが上位Ｌ個（Ｌ＝１、２、・・・）を類似シーンとして抽出することが考えられる。 For example, the similar scene extraction unit 16 may extract split scenes with preference scores equal to or greater than a certain numerical value (threshold value) as similar scenes. The similar scene extraction unit 16 may also extract the top L scenes (L=1, 2, ...) with the highest preference scores from among the split scenes as similar scenes.

＜類似シーンの提示（レコメンド）＞
ステップＳ５０７において、提示部１７は、類似シーン抽出部１６が抽出した類似シーンに基づいて、類似シーンを提示（レコメンド）する。提示部１７は、更に、動画ＩＤ又は類似シーンＩＤに紐づくジャンルに基づいて、ジャンルごとに類似シーンを提示しても良い。 <Recommendation of similar scenes>
In step S507, the presentation unit 17 presents (recommends) similar scenes based on the similar scenes extracted by the similar scene extraction unit 16. The presentation unit 17 may further present similar scenes for each genre based on the genre associated with the video ID or the similar scene ID.

＜変化重視度の作成＞
図６は、本実施形態における重視度更新処理のフローチャートである。上述の処理と同様の処理については、同様の符号を付してその説明を省略する。ステップＳ６０１において、変化重視度作成部１８は、重視度及び所定条件に基づいて、重視度の全部又は一部を変化させた変化重視度を作成する。 <Creating the degree of importance of change>
6 is a flowchart of the importance update process in this embodiment. Processes similar to those described above are denoted by the same reference numerals and their description will be omitted. In step S601, the change importance creation unit 18 creates a change importance by changing all or a part of the importance based on the importance and a predetermined condition.

変化重視度は、所定条件に基づいてユーザＩＤに紐づく重視度を変化させた重視度である。もとの重視度（ユーザＩＤに紐づく重視度）に基づいて抽出した類似シーンとは異なるシーン（変化重視度類似シーン）を抽出するために、変化重視度作成部１８は、ユーザＩＤに紐づく重視度に変化を与えた重視度（変化重視度）を作成する。 The change importance level is an importance level obtained by changing the importance level associated with the user ID based on a predetermined condition. In order to extract a scene (a change importance level similar scene) that is different from the similar scene extracted based on the original importance level (importance level associated with the user ID), the change importance level creation unit 18 creates an importance level (change importance level) obtained by changing the importance level associated with the user ID.

変化重視度作成部１８は、例えば、式（２）～（４）によって、それぞれの変化重視度（変化映像重視度、変化音重視度、変化発話重視度）を作成する。変化重視度作成部１８は、式（２）～（４）によって、映像重視度をＸ倍（例えば、Ｘ＝１．１）した変化映像重視度を作成し、更に、変化映像重視度、変化音重視度、変化発話重視度、の合計値が１となるような変化音重視度、変化発話重視度、を作成する。 The change importance creation unit 18 creates each change importance (changed image importance, changed sound importance, changed speech importance) using, for example, formulas (2) to (4). The change importance creation unit 18 creates a changed image importance that is X times the image importance (for example, X = 1.1) using formulas (2) to (4), and further creates a changed sound importance and a changed speech importance such that the sum of the changed image importance, changed sound importance, and changed speech importance is 1.

変化重視度作成部１８は、式（２）～（４）と同様に、音重視度をＸ倍した変化音重視度、発話重視度をＸ倍した変化発話重視度、を作成することができる。Ｘは、整数であっても良く、整数でない分数であっても良い。さらに、変化重視度作成部１８は、Ｘをランダムに決定しても良く、定期的にＸを変化させても良い。 The change importance level creation unit 18 can create a changed sound importance level by multiplying the sound importance level by X, and a changed speech importance level by multiplying the speech importance level by X, in the same way as with equations (2) to (4). X may be an integer or a non-integer fraction. Furthermore, the change importance level creation unit 18 may determine X randomly, or may change X periodically.

＜変化重視度に基づく変化重視度類似シーンの抽出＞
ステップＳ６０２において、嗜好スコア算出部１５は、嗜好シーン、映像類似度、音類似度、発話類似度、変化重視度、に基づいて、嗜好スコアを算出しても良い。類似シーン抽出部１６は、嗜好シーン、映像類似度、音類似度、発話類似度、変化重視度、に基づいて、分割シーンから変化重視度類似シーンを抽出する。 <Extraction of scenes with similar change importance based on change importance>
In step S602, the preference score calculation unit 15 may calculate a preference score based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and the change importance degree. The similar scene extraction unit 16 extracts change importance degree similar scenes from the split scenes based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and the change importance degree.

＜変化重視度類似シーンの提示（レコメンド）＞
ステップＳ６０３において、提示部１７は、ある一定の割合の変化重視度類似シーンを提示する。提示部１７は、類似シーンと比べて低い割合の変化重視度類似シーンを提示する。 <Recommendation of scenes with similar change importance>
In step S603, the presentation unit 17 presents a certain proportion of scenes with similar change importance levels. The presentation unit 17 presents scenes with similar change importance levels at a lower proportion than the similar scenes.

提示部１７は、例えば、Ｋ個（１０個等）のシーンを提示する場合、類似シーンのうち嗜好スコアが高い７割、映像重視度をＸ倍した変化重視度類似シーンのうち嗜好スコアが高い方から１割、音重視度をＸ倍した変化重視度類似シーンのうち嗜好スコアが高い方から１割、発話重視度をＸ倍した変化重視度類似シーンのうち嗜好スコアが高い方から１割、を提示する。 When presenting K (e.g., 10) scenes, for example, the presentation unit 17 presents 70% of similar scenes with the highest preference scores, 10% of similar change-importance scenes with X times the video importance and the highest preference scores, 10% of similar change-importance scenes with X times the sound importance and the highest preference scores, and 10% of similar change-importance scenes with X times the speech importance and the highest preference scores.

このように、提示部１７が、提示するシーンの中に少量の変化重視度類似シーンを含めてシーンを提示することによって、ユーザが変化重視度類似シーンを好んで視聴する可能性が生じる。ユーザが、類似シーンよりも提示される数が少ない変化重視度類似シーンを好んで視聴する場合、現状のユーザＩＤに紐づく重視度よりも変化重視度の方がユーザの好みの重視度であると判定できる。 In this way, by the presentation unit 17 presenting scenes including a small number of scenes with similar change importance among the scenes to be presented, there is a possibility that the user will prefer to watch the scenes with similar change importance. If the user prefers to watch the scenes with similar change importance that are presented in smaller numbers than the similar scenes, it can be determined that the change importance is more preferred by the user than the importance associated with the current user ID.

＜重視度の更新＞
ステップＳ６０４において、重視度更新部１９は、変化重視度類似シーンに関するユーザの視聴履歴に基づいて、変化重視度類似シーンに関連する変化重視度をユーザの重視度としてユーザＩＤに紐づけて更新する。重視度更新部１９は、ユーザが映像重視度をＸ倍した変化重視度類似シーンを好んで視聴していると判定した場合、そのユーザの重視度をその変化重視度（映像重視度をＸ倍した変化映像重視度、それに伴って変化した音重視度及び発話重視度）に更新する。 <Update on priority>
In step S604, the importance updating unit 19 updates the change importance related to the change importance similar scene as the user's importance by linking it to the user ID based on the user's viewing history for the change importance similar scene. If the importance updating unit 19 determines that the user prefers to view a change importance similar scene with a video importance multiplied by X, it updates the user's importance to the change importance (a changed video importance multiplied by X, and the sound importance and speech importance changed accordingly).

例えば、ユーザが１日で最も長い時間視聴した分割シーンが映像重視度をＸ倍した変化重視度類似シーンである場合、重視度更新部１９は、ユーザが映像重視度をＸ倍した変化重視度類似シーンを好んで視聴していると判定し、その変化重視度をユーザの重視度としてユーザＩＤに紐づけて更新する。 For example, if the split scene that the user watched for the longest time in one day is a scene with a similar change importance with the video importance multiplied by X, the importance update unit 19 determines that the user prefers to watch scenes with a similar change importance with the video importance multiplied by X, and updates the change importance as the user's importance by linking it to the user ID.

このように、変化重視度を定期的に変化させ、更に、ユーザの視聴履歴に基づいて重視度を更新することによって、ユーザの好みの重視度に近づけることが可能となる。 In this way, by periodically changing the change importance level and further updating the importance level based on the user's viewing history, it is possible to bring the importance level closer to the user's preferences.

＜類似シーンからフレームの抽出＞
図７は、本実施形態におけるダイジェスト動画作成処理のフローチャートである。上述の処理と同様の処理については、同様の符号を付してその説明を省略する。ステップＳ７０１において、フレーム抽出部１ａは、類似シーン抽出部１６が抽出した複数の類似シーンの最初のフレーム（静止画）と最後のフレーム（静止画）を抽出する。 <Extracting frames from similar scenes>
7 is a flowchart of the digest movie creation process in this embodiment. Processes similar to those described above are denoted by the same reference numerals, and their description will be omitted. In step S701, the frame extraction unit 1a extracts the first frame (still image) and the last frame (still image) of the multiple similar scenes extracted by the similar scene extraction unit 16.

＜フレーム類似度の算出＞
ステップＳ７０２において、フレーム類似度算出部１ｂは、フレーム抽出部１ａが抽出した最初のフレームと最後のフレームのフレーム類似度を算出する。フレーム類似度算出部１ｂは、フレーム抽出部１ａが抽出した類似シーンの最初のフレームと、その類似シーンを除くフレーム抽出部１ａが抽出した類似シーンの最後のフレームと、のフレーム類似度を算出する。 <Calculation of Frame Similarity>
In step S702, the frame similarity calculation unit 1b calculates the frame similarity between the first frame and the last frame extracted by the frame extraction unit 1a. The frame similarity calculation unit 1b calculates the frame similarity between the first frame of the similar scene extracted by the frame extraction unit 1a and the last frame of the similar scene extracted by the frame extraction unit 1a excluding the first similar scene.

フレーム類似度算出部１ｂは、類似シーン抽出部１６が抽出したすべての類似シーンの最初のフレームの特徴量と最後のフレームの特徴量を生成し、図４（ｉ）のような類似シーン情報を記憶部に格納する。フレーム類似度算出部１ｂは、特徴量生成部１２が特徴量を生成したのと同様に特徴量を生成することが考えられる。 The frame similarity calculation unit 1b generates features of the first frame and the last frame of all similar scenes extracted by the similar scene extraction unit 16, and stores similar scene information such as that shown in FIG. 4(i) in the storage unit. It is considered that the frame similarity calculation unit 1b generates features in the same way that the feature generation unit 12 generates features.

フレーム類似度算出部１ｂは、更に、すべての類似シーンの最初のフレームの特徴量と、すべての類似シーンの最後のフレームの特徴量と、に基づいて、類似度を算出し、図４（ｊ）のような類似シーン類似度情報を記憶部に格納する。フレーム類似度算出部１ｂは、ベクトル同士の類似度を算出するコサイン類似度によって、特徴量の類似度を算出することが考えられる。本実施形態では図４（ｊ）のように、フレーム類似度算出部１ｂは、同一の類似シーンの最初のフレームと最後のフレームの類似度を算出しない。 The frame similarity calculation unit 1b further calculates similarities based on the feature amounts of the first frames of all similar scenes and the feature amounts of the last frames of all similar scenes, and stores similar scene similarity information such as that shown in FIG. 4(j) in the storage unit. It is considered that the frame similarity calculation unit 1b calculates the similarity of the feature amounts using cosine similarity, which calculates the similarity between vectors. In this embodiment, as shown in FIG. 4(j), the frame similarity calculation unit 1b does not calculate the similarity between the first frame and the last frame of the same similar scene.

＜ダイジェスト動画の作成＞
ステップＳ７０３において、ダイジェスト動画作成部１ｃは、類似シーン抽出部１６が抽出した複数の類似シーンを用いて、ダイジェスト動画を作成する。ダイジェスト動画作成部１ｃは、例えば、抽出した類似シーンの中で嗜好スコアが高い順に類似シーンをつなぎ合わせてダイジェスト動画を作成しても良い。 <Creating digest videos>
In step S703, the digest movie creation unit 1c creates a digest movie using the plurality of similar scenes extracted by the similar scene extraction unit 16. The digest movie creation unit 1c may create a digest movie by connecting similar scenes from among the extracted similar scenes in descending order of preference score, for example.

ダイジェスト動画作成部１ｃは、抽出した類似シーンにおいて嗜好スコア算出部１５が算出した嗜好スコアが最も高いものをダイジェスト動画の先頭とし、先頭の類似シーンを除く類似シーンの最初のフレームの中で、先頭の類似シーンの最後のフレームとのフレーム類似度が最大である類似シーンを先頭の類似シーンの次のシーンとしてダイジェスト動画を作成しても良い。 The digest video creation unit 1c may create a digest video by setting the extracted similar scene with the highest preference score calculated by the preference score calculation unit 15 as the beginning of the digest video, and setting the similar scene with the highest frame similarity to the last frame of the beginning similar scene, among the first frames of similar scenes other than the beginning similar scene, as the next scene after the beginning similar scene.

ダイジェスト動画作成部１ｃは、更に、それまでのダイジェスト動画の作成に利用した類似シーンを除く類似シーンの最初のフレームの中で、Ｊ番目（Ｊ＝１、２、・・・）の最後のフレームとのフレーム類似度が最大である類似シーンをＪ＋１番目の類似シーンとしてダイジェスト動画を作成いても良い。 The digest video creation unit 1c may further create a digest video by selecting the similar scene that has the highest frame similarity with the last frame of the Jth (J=1, 2, ...) similar scene among the first frames of similar scenes excluding similar scenes used in creating the digest videos up to that point, as the J+1th similar scene.

例えば、類似シーン抽出部１６が、類似シーンとして、類似シーンＩＤ「Ｍ４＿Ｃ１」「Ｍ６＿Ｃ２」「Ｍ１０＿Ｃ１」「Ｍ１１＿Ｃ１」を抽出したとする。さらに、類似シーンＩＤ「Ｍ１０＿Ｃ１」の嗜好スコアが最大である場合、ダイジェスト動画作成部１ｃは、類似シーンＩＤ「Ｍ１０＿Ｃ１」を先頭のシーンとする。 For example, suppose that the similar scene extraction unit 16 extracts similar scene IDs "M4_C1," "M6_C2," "M10_C1," and "M11_C1" as similar scenes. Furthermore, if the preference score of the similar scene ID "M10_C1" is the highest, the digest movie creation unit 1c sets the similar scene ID "M10_C1" as the first scene.

図４（ｊ）によると、類似シーンＩＤ「Ｍ１０＿Ｃ１」の最後のフレームと類似シーンＩＤ「Ｍ４＿Ｃ１」の最初のフレームの類似度が最大であるため、ダイジェスト動画作成部１ｃは、類似シーンＩＤ「Ｍ４＿Ｃ１」を次のシーンとする。 As shown in FIG. 4(j), the similarity between the last frame of similar scene ID "M10_C1" and the first frame of similar scene ID "M4_C1" is the highest, so the digest movie creation unit 1c selects similar scene ID "M4_C1" as the next scene.

さらに、図４（ｊ）によると、類似シーンＩＤ「Ｍ４＿Ｃ１」の最後のフレームと類似シーンＩＤ「Ｍ１０＿Ｃ１」の最初のフレームの類似度が最大である一方、ダイジェスト動画の作成に既に類似シーンＩＤ「Ｍ１０＿Ｃ１」を利用している。よって、ダイジェスト動画作成部１ｃは、その類似シーンを除いて、類似シーンＩＤ「Ｍ４＿Ｃ１」の最後のフレームとのフレーム類似度が最大である類似シーンＩＤ「Ｍ１１＿Ｃ１」を次のシーンとする。ダイジェスト動画作成部１ｃは、抽出した類似シーンをすべて利用するまでこのような処理を繰り返してダイジェスト動画を作成する。 Furthermore, according to FIG. 4(j), the similarity between the last frame of similar scene ID "M4_C1" and the first frame of similar scene ID "M10_C1" is the highest, but similar scene ID "M10_C1" has already been used to create the digest movie. Therefore, the digest movie creation unit 1c excludes this similar scene and sets the next scene to similar scene ID "M11_C1", which has the highest frame similarity with the last frame of similar scene ID "M4_C1". The digest movie creation unit 1c repeats this process to create the digest movie until all extracted similar scenes have been used.

以上のように、本発明の構成によれば、ユーザの嗜好に合うシーンを抽出する新たな技術を提供することができる。 As described above, the configuration of the present invention can provide a new technology for extracting scenes that match the user's preferences.

０シーン抽出システム
１シーン抽出装置
１１分割部
１２特徴量生成部
１３類似度算出部
１４嗜好シーン抽出部
１５嗜好スコア算出部
１６類似シーン抽出部
１７提示部
１８変化重視度作成部
１９重視度更新部
１ａフレーム抽出部
１ｂフレーム類似度算出部
１ｃダイジェスト動画作成部
２ユーザ端末
ＮＷネットワーク
0 Scene extraction system 1 Scene extraction device 11 Dividing unit 12 Feature quantity generating unit 13 Similarity calculation unit 14 Preference scene extraction unit 15 Preference score calculation unit 16 Similar scene extraction unit 17 Presentation unit 18 Change importance level creation unit 19 Importance level update unit 1a Frame extraction unit 1b Frame similarity calculation unit 1c Digest movie creation unit 2 User terminal NW Network

Claims

A scene extraction system for extracting scenes from a video, comprising:
the scene extraction system includes a storage unit, a division unit, a similarity calculation unit, a preference scene extraction unit, and a similar scene extraction unit;
The division unit divides the video into scenes,
the storage unit stores, for each user, a video importance level for a video-related feature, a sound importance level for a sound-related feature, and an utterance importance level for an utterance-related feature in the divided split scenes, in association with a user ID;
the similarity calculation unit calculates a video similarity, a sound similarity, and a speech similarity between the split scenes based on a video feature, a sound feature, and a speech feature of the split scenes;
the preference scene extraction unit extracts preference scenes preferred by the user from a plurality of split scenes based on a viewing history of the user;
the similar scene extraction unit extracts similar scenes from the split scenes based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and an importance level associated with the user ID;
Scene extraction system.

The scene extraction system includes a change importance level creation unit,
the change importance level creation unit creates a change importance level by changing all or a part of the importance level based on the importance level and a predetermined condition;
the similar scene extraction unit extracts scenes having a similar change importance level from the split scenes based on the preference scene, the video similarity level, the sound similarity level, the speech similarity level, and the change importance level;
The scene extraction system according to claim 1 .

The scene extraction system includes an importance updating unit,
the importance updating unit updates the change importance level related to the change importance level similar scene as the importance level of the user by linking the change importance level related to the change importance level similar scene with a user ID based on a viewing history of the user regarding the change importance level similar scene;
The scene extraction system according to claim 2 .

The scene extraction system includes a presentation unit,
the presentation unit presents scenes with a similar change importance level that have a lower ratio than the similar scenes;
The scene extraction system according to claim 3 .

the scene extraction system includes a digest movie creation unit,
the digest movie creation unit creates a digest movie using a plurality of the similar scenes;
The scene extraction system according to claim 1 .

The scene extraction system includes a frame extraction unit and a frame similarity calculation unit,
The frame extraction unit extracts the first and last frames of the plurality of similar scenes;
the frame similarity calculation unit calculates frame similarity between the extracted first frame and last frame;
the digest movie creator creates the digest movie based on the frame similarity.
The scene extraction system according to claim 5 .

The scene extraction system includes a preference score calculation unit,
the preference score calculation unit calculates a preference score based on the preference scene, the video similarity, the sound similarity, the speech similarity, and the importance level;
the digest movie creation unit creates the digest movie by setting the similar scene with the highest preference score as the beginning of the digest movie, and by setting, among the first frames of similar scenes other than the beginning similar scene, a similar scene having the highest frame similarity with a last frame of the beginning similar scene as a scene next to the beginning similar scene.
The scene extraction system according to claim 6.

A scene extraction method executed by a scene extraction system for extracting scenes from a video, comprising:
the scene extraction system includes a storage unit, a division unit, a similarity calculation unit, a preference scene extraction unit, and a similar scene extraction unit;
A step of dividing the video into scenes by the dividing unit;
storing, by the storage unit, a video importance level for a video feature, a sound importance level for a sound feature, and an utterance importance level for an utterance feature, for each user, in the divided split scenes, in association with a user ID;
a step of calculating a video similarity, a sound similarity, and a speech similarity between the split scenes based on a video feature, a sound feature, and a speech feature of the split scenes by the similarity calculation unit;
a step of the preference scene extraction unit extracting preference scenes preferred by the user from a plurality of split scenes based on a viewing history of the user;
The similar scene extraction unit extracts similar scenes from the split scenes based on the preferred scene, the video similarity, the audio similarity, the speech similarity, and an importance level associated with the user ID.
Scene extraction method.

A scene extraction program for extracting scenes from a video, comprising:
A computer is caused to function as a storage unit, a division unit, a similarity calculation unit, a preferred scene extraction unit, and a similar scene extraction unit;
The division unit divides the video into scenes,
the storage unit stores, for each user, a video importance level for a video-related feature, a sound importance level for a sound-related feature, and an utterance importance level for an utterance-related feature in the divided split scenes, in association with a user ID;
the similarity calculation unit calculates a video similarity, a sound similarity, and a speech similarity between the split scenes based on a video feature, a sound feature, and a speech feature of the split scenes;
the preference scene extraction unit extracts preference scenes preferred by the user from a plurality of split scenes based on a viewing history of the user;
the similar scene extraction unit extracts similar scenes from the split scenes based on the preferred scene, the video similarity, the sound similarity, the speech similarity, and an importance level associated with the user ID;
Scene extraction program.