JP2024013024A

JP2024013024A - Moving image analyzer, moving image analysis system, edition device, learning device, moving image analysis method, and moving image analysis program

Info

Publication number: JP2024013024A
Application number: JP2022114920A
Authority: JP
Inventors: 将吾浜田; Shogo Hamada; 貴浩菅野; Takahiro Sugano; 和紀梅村; Kazuki Umemura; 大助内藤; Daisuke Naito; 早紀子鈴木; Sakiko Suzuki
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2024-01-31

Abstract

To allow an improvement of reliability of an analysis of a scene and an object.SOLUTION: A moving image analyzer divides a moving image in each scene from the moving image and genre information indicating genre of the moving image, uses a first learning model for determining a scene name of a classification of the scene to each scene, divides the moving image into each scene, determines the scene name to each scene, sets a label as information related to an object imaged to a video image frame on the basis of the scene name to each video image frame contained in the scene, and outputs the scene name of each scene containing the video image frame and the label in each video image frame.SELECTED DRAWING: Figure 1

Description

本発明は、動画解析装置等に関する。 The present invention relates to a video analysis device and the like.

新たな動画を作成する際に、過去に作成された動画が使用される場合がある。このとき、動画の作成者は、過去に作成された動画が蓄積されているアーカイブから、新たな動画の作成に使用したい動画を検索する。 When creating a new video, videos created in the past may be used. At this time, the creator of the video searches for a video that he/she wants to use for creating a new video from an archive that stores videos created in the past.

動画には、番組名や番組ジャンルなどの情報が、メタデータが付与されている場合がある。この場合、動画の作成者は、このメタデータに基づいて、使用したい動画を検索する。しかし、アーカイブに多くの動画が蓄積されている場合、メタデータに基づいて動画を検索しても、多くの動画が抽出される可能性がある。そして、作成者は、この多くの動画の中から使用したいシーンを目視で探すことになる。また、作成者は、使用したいシーンを人手で切り出し、切り出された動画を使用して、新たな動画を作成する。なお、シーンとは、時系列的に連続した一場面を指す。 Videos may have metadata attached to them, including information such as the program name and program genre. In this case, the creator of the video searches for the video he or she wants to use based on this metadata. However, if there are many videos stored in the archive, even if you search for videos based on metadata, there is a possibility that many videos will be extracted. Then, the creator visually searches for the scene he or she wants to use from among these many videos. In addition, the creator manually cuts out the scene he or she wants to use, and uses the cut out video to create a new video. Note that a scene refers to a chronologically continuous scene.

これに関連する方法として、特許文献１から特許文献２に記載された方法がある。これらの方法では、動画に撮影されているシーンや被写体が解析される。 As methods related to this, there are methods described in Patent Documents 1 to 2. In these methods, the scenes and objects captured in the video are analyzed.

特開２０１８－００５６３８号公報JP2018-005638A 特開２０２０－０７７３４３号公報JP2020-077343A

しかし、たとえば、放送局で使用される動画の場合、シーンや被写体の解析結果に、高い信頼性が求められる。解析結果の信頼性が低い場合、作業者による確認や修正に時間を要してしまい、作業者の負担が大きくなる。 However, for example, in the case of videos used by broadcasting stations, the results of scene and subject analysis require high reliability. If the reliability of the analysis results is low, it will take time for the operator to check and make corrections, increasing the burden on the operator.

本発明の目的は、上記課題を鑑み、シーンや被写体の解析の信頼性をより向上することを可能にする動画解析装置等を提供することにある。 SUMMARY OF THE INVENTION In view of the above-mentioned problems, an object of the present invention is to provide a video analysis device and the like that can further improve the reliability of scene and subject analysis.

本発明の一態様において、動画解析装置は、動画のジャンルを示すジャンル情報に基づいて、前記動画をシーンに分割し、前記シーンの各々に対して、前記シーンの分類を示すシーン名を決定するシーン分割部と、前記シーンに含まれる映像フレームの各々に対して、前記映像フレームに撮像されている被写体に関する情報であるラベルを設定するラベル設定部と、前記映像フレームの各々について、前記映像フレームが含まれる前記シーンの前記シーン名と、前記ラベルとを出力する出力部とを備え、前記シーン分割部は、前記動画と当該動画の前記ジャンル情報とから、前記動画を前記シーンごとに分割し、前記シーンの各々に対する前記シーン名を決定する第一の学習モデルを使用して、前記シーンの分割と前記シーン名の決定とを行う。 In one aspect of the present invention, the video analysis device divides the video into scenes based on genre information indicating the genre of the video, and determines, for each of the scenes, a scene name indicating the classification of the scene. a scene dividing section; a label setting section that sets, for each of the video frames included in the scene, a label that is information about a subject imaged in the video frame; an output unit that outputs the scene name and the label of the scene including the scene, and the scene dividing unit divides the video into scenes based on the video and the genre information of the video. , dividing the scenes and determining the scene names using a first learning model that determines the scene names for each of the scenes.

また、本発明の他の態様において、動画解析方法は、動画と当該動画のジャンルを示すジャンル情報とから、前記動画をシーンごとに分割し、前記シーンの各々に対して、シーンの分類を示すシーン名を決定する第一の学習モデルを使用して、前記動画を前記シーンに分割し、前記シーンの各々に対して前記シーン名を決定し、前記シーンに含まれる映像フレームの各々に対して、前記シーン名に基づいて、前記映像フレームに撮像されている被写体に関する情報であるラベルを設定し、前記映像フレームの各々について、前記映像フレームが含まれる前記シーンの前記シーン名と、前記ラベルとを出力する。 In another aspect of the present invention, the video analysis method divides the video into scenes based on the video and genre information indicating the genre of the video, and indicates a scene classification for each of the scenes. A first learning model that determines a scene name is used to divide the video into the scenes, determine the scene name for each of the scenes, and determine the scene name for each of the video frames included in the scene. , based on the scene name, set a label that is information about the subject imaged in the video frame, and for each video frame, set the scene name of the scene in which the video frame is included, and the label. Output.

また、本発明の他の態様において、動画解析プログラムは、コンピュータに、動画のジャンルを示すジャンル情報に基づいて、前記動画をシーンに分割し、前記シーンの各々に対して、前記シーンの分類を示すシーン名を決定するシーン分割機能と、前記シーンに含まれる映像フレームの各々に対して、前記シーン名に基づいて、前記映像フレームに撮像されている被写体に関する情報であるラベルを設定するラベル設定機能と、前記映像フレームの各々について、前記映像フレームが含まれる前記シーンの前記シーン名と、前記ラベルとを出力する出力機能とを実現させ、前記シーン分割機能は、前記動画と当該動画の前記ジャンル情報とから、前記動画を前記シーンごとに分割し、前記シーンの各々に対する前記シーン名を決定する第一の学習モデルを使用して、前記シーンの分割と前記シーン名の決定とを行う。 In another aspect of the present invention, the video analysis program causes the computer to divide the video into scenes based on genre information indicating the genre of the video, and to classify each of the scenes into scenes. a scene division function that determines a scene name to be displayed; and a label setting that sets a label, which is information about the subject imaged in the video frame, based on the scene name, for each video frame included in the scene. and an output function that outputs, for each of the video frames, the scene name of the scene in which the video frame is included, and the label, and the scene division function is configured to output the video and the label of the video frame. The video is divided into scenes based on the genre information, and the scenes are divided and the scene names are determined using a first learning model that determines the scene names for each of the scenes.

本発明によれば、シーンや被写体の解析の信頼性をより向上することが可能になる。 According to the present invention, it is possible to further improve the reliability of scene and subject analysis.

本発明の第一の実施形態の動画解析装置の構成例を示す図である。1 is a diagram showing a configuration example of a video analysis device according to a first embodiment of the present invention. 本発明の第一の実施形態の動画解析装置の動作フローの例を示す図である。It is a figure showing an example of an operation flow of a video analysis device of a first embodiment of the present invention. 本発明の第二の実施形態の動画解析装置を含むシステムの構成例を示す図である。It is a diagram showing an example of the configuration of a system including a video analysis device according to a second embodiment of the present invention. 本発明の第二の実施形態の学習装置の構成例を示す図である。It is a figure showing an example of composition of a learning device of a second embodiment of the present invention. 本発明の第二の実施形態の動画解析装置の構成例を示す図である。It is a figure showing an example of composition of a video analysis device of a second embodiment of the present invention. 本発明の第二の実施形態の編集装置の構成例を示す図である。It is a figure showing an example of composition of an editing device of a second embodiment of the present invention. 本発明の第二の実施形態のシーン編集画像の例を示す図である。FIG. 7 is a diagram showing an example of a scene edited image according to the second embodiment of the present invention. 本発明の第二の実施形態のラベル編集画像の例を示す図である。It is a figure which shows the example of the label edit image of 2nd embodiment of this invention. 本発明の第二の実施形態の編集メニュー画像の例を示す図である。It is a figure which shows the example of the edit menu image of 2nd embodiment of this invention. 本発明の第二の実施形態の学習装置の動作フローの例を示す図である。It is a figure showing an example of an operation flow of a learning device of a second embodiment of the present invention. 本発明の第二の実施形態の動画解析装置の動作フローの例を示す図である。It is a figure showing an example of an operation flow of a video analysis device of a second embodiment of the present invention. 本発明の第二の実施形態の編集装置の動作フローの例を示す図である。It is a figure showing an example of an operation flow of an editing device of a second embodiment of the present invention. 本発明の各実施形態のハードウェア構成例を示す図である。1 is a diagram showing an example of a hardware configuration of each embodiment of the present invention. FIG.

［第一の実施形態］
本発明の第一の実施形態について説明する。第一の実施形態における動画解析装置１０の具体的な一例が、後述する第二の実施形態における動画解析装置２０である。 [First embodiment]
A first embodiment of the present invention will be described. A specific example of the video analysis device 10 in the first embodiment is a video analysis device 20 in the second embodiment described later.

図１に本実施形態の動画解析装置１０の構成例を示す。本実施形態の動画解析装置１０は、シーン分割部１１とラベル設定部１２と出力部１３とを含む。 FIG. 1 shows a configuration example of a video analysis device 10 of this embodiment. The video analysis device 10 of this embodiment includes a scene dividing section 11, a label setting section 12, and an output section 13.

シーン分割部１１は、ジャンル情報に基づいて、動画をシーンごとに分割する。ジャンル情報は、動画のジャンルを示す。また、シーン分割部１１は、シーンの各々に対して、シーン名を決定する。シーン名は、シーンの分類を示す。 The scene dividing unit 11 divides the video into scenes based on genre information. The genre information indicates the genre of the video. Further, the scene dividing unit 11 determines a scene name for each scene. The scene name indicates the classification of the scene.

シーン分割部１１は、第一の学習モデルを使用して、上述の、シーンの分割とシーン名の決定とを行う。第一の学習モデルは、動画と当該動画のジャンル情報とから、動画をシーンごとに分割し、シーンの各々に対するシーン名を決定する。 The scene dividing unit 11 uses the first learning model to perform the above-described scene division and scene name determination. The first learning model divides the video into scenes based on the video and the genre information of the video, and determines a scene name for each scene.

ラベル設定部１２は、シーンに含まれる映像フレームの各々に対して、シーン名に基づいて、ラベルを設定する。ラベルは、映像フレームに撮像されている被写体に関する情報である。 The label setting unit 12 sets a label for each video frame included in a scene based on the scene name. The label is information regarding the subject imaged in the video frame.

出力部１３は、映像フレームの各々について、映像フレームが含まれるシーンのシーン名と、ラベルとを出力する。 The output unit 13 outputs, for each video frame, the scene name and label of the scene in which the video frame is included.

次に、図２に本実施形態の動画解析装置１０の動作フローの例を示す。 Next, FIG. 2 shows an example of the operation flow of the video analysis device 10 of this embodiment.

シーン分割部１１は、ジャンル情報に基づいて、動画をシーンに分割する。また、シーン分割部１１は、シーンの各々に対して、シーン名を決定する（ステップＳ１０１）。 The scene dividing unit 11 divides the video into scenes based on genre information. Furthermore, the scene dividing unit 11 determines a scene name for each scene (step S101).

ラベル設定部１２は、シーンに含まれる映像フレームの各々に対して、ラベルを設定する（ステップＳ１０２）。 The label setting unit 12 sets a label for each video frame included in the scene (step S102).

出力部１３は、映像フレームの各々について、映像フレームが含まれるシーンのシーン名と、ラベルとを出力する（ステップＳ１０３）。 The output unit 13 outputs, for each video frame, the scene name and label of the scene in which the video frame is included (step S103).

以上で説明したように、本発明の第一の実施形態では、動画解析装置１０は、シーン分割部１１とラベル設定部１２と出力部１３とを含む。シーン分割部１１は、ジャンル情報に基づいて、動画をシーンに分割する。また、シーン分割部１１は、シーンの各々に対して、シーン名を決定する。シーン分割部１１は、第一の学習モデルを使用して、シーンの分割とシーン名の決定とを行う。第一の学習モデルは、動画と当該動画のジャンル情報とから、動画をシーンごとに分割し、シーンの各々に対するシーン名を決定する。ラベル設定部１２は、シーンに含まれる映像フレームの各々に対して、シーン名に基づいて、ラベルを設定する。出力部１３は、映像フレームの各々について、映像フレームが含まれるシーンのシーン名と、ラベルとを出力する。 As described above, in the first embodiment of the present invention, the video analysis device 10 includes a scene dividing section 11, a label setting section 12, and an output section 13. The scene dividing unit 11 divides the video into scenes based on genre information. Further, the scene dividing unit 11 determines a scene name for each scene. The scene dividing unit 11 uses the first learning model to divide scenes and determine scene names. The first learning model divides the video into scenes based on the video and the genre information of the video, and determines a scene name for each scene. The label setting unit 12 sets a label for each video frame included in a scene based on the scene name. The output unit 13 outputs, for each video frame, the scene name and label of the scene in which the video frame is included.

このように、動画解析装置１０は、ジャンル情報に基づいて、シーンの分割とシーン名の決定とを行う。これにより、動画解析装置１０は、ジャンル情報によって示されるジャンルで頻出するシーンとしてシーンが解析される可能性を向上する。その結果、動画解析装置１０は、動画の特徴量が似ているシーンであるが、誤ったシーンとしてシーンが解析される可能性を低減することができる。したがって、シーン解析の信頼性を向上することができる。また、動画解析装置１０は、シーン名に基づいてラベルを設定するので、シーン解析の信頼性の向上によって、被写体の解析の信頼性も向上する。そのため、シーンや被写体の解析の信頼性をより向上することが可能になる。 In this manner, the video analysis device 10 divides scenes and determines scene names based on genre information. Thereby, the video analysis device 10 increases the possibility that a scene will be analyzed as a scene that frequently appears in the genre indicated by the genre information. As a result, the video analysis device 10 can reduce the possibility that a scene with similar video feature amounts will be analyzed as an incorrect scene. Therefore, the reliability of scene analysis can be improved. Furthermore, since the video analysis device 10 sets the label based on the scene name, the reliability of the scene analysis is improved, and the reliability of the subject analysis is also improved. Therefore, it is possible to further improve the reliability of scene and subject analysis.

［第二の実施形態］
次に、本発明の第二の実施形態における動画解析装置２０について説明する。第一の実施形態における動画解析装置１０の具体的な一例が、第二の実施形態における動画解析装置２０である。 [Second embodiment]
Next, a video analysis device 20 according to a second embodiment of the present invention will be described. A specific example of the video analysis device 10 in the first embodiment is the video analysis device 20 in the second embodiment.

まず、図３に、本実施形態の動画解析装置２０を含む動画解析システム８０の構成例を示す。動画解析システム８０は、学習装置６０と動画解析装置２０と編集装置３０とを含む。動画解析装置２０は、動画記憶装置４０と学習装置６０と編集装置３０とに接続される。また、学習装置６０は、動画記憶装置４０と動画解析装置２０とに接続される。また、編集装置３０は、動画記憶装置４０と利用者端末５０とに接続される。 First, FIG. 3 shows a configuration example of a video analysis system 80 including the video analysis device 20 of this embodiment. The video analysis system 80 includes a learning device 60, a video analysis device 20, and an editing device 30. The video analysis device 20 is connected to the video storage device 40, the learning device 60, and the editing device 30. Further, the learning device 60 is connected to the video storage device 40 and the video analysis device 20. Further, the editing device 30 is connected to the video storage device 40 and the user terminal 50.

動画記憶装置４０は、動画を記憶している。動画は、映像情報を含む。また、動画は、音声情報を含んでいてもよい。また、動画記憶装置４０は、メタデータを記憶している。メタデータは、動画に関するデータである。 The video storage device 40 stores videos. A video includes video information. Further, the video may include audio information. The video storage device 40 also stores metadata. Metadata is data related to videos.

メタデータは、ジャンル情報を含む。ジャンル情報は、動画のジャンルを示す情報である。ジャンルは、動画の種別である。ジャンルは、たとえば、スポーツ、ニュース・報道、バラエティなどである。 The metadata includes genre information. Genre information is information indicating the genre of the video. The genre is the type of video. Genres include, for example, sports, news/reporting, and variety.

また、メタデータは、動画に含まれる映像フレームの各々についてのシーン名を含むことができる。シーン名とは、シーン（場面）の分類を示す名称である。シーン名は、同一の場面（シーン）を示す映像フレームに対して決定される。たとえば、動画がサッカーの動画であれば、ドリブル、ゴール、カウンターなどがシーン名である。 Additionally, the metadata can include a scene name for each video frame included in the video. The scene name is a name indicating the classification of a scene. Scene names are determined for video frames showing the same scene. For example, if the video is a soccer video, scene names include dribbling, goal, and counter.

また、メタデータは、動画に含まれる映像フレームの各々についてのラベルを含むことができる。ラベルは、映像フレームに撮像されている被写体に関する情報である。被写体は、人物であってもよい。ラベルは、一つの映像フレームに対して、一または二以上の複数個設定されることができる。また、ラベルが設定されていない映像フレームがあってもよい。 The metadata can also include labels for each video frame included in the video. The label is information regarding the subject imaged in the video frame. The subject may be a person. One or more labels can be set for one video frame. Furthermore, there may be video frames for which no labels are set.

ラベルは、たとえば、被写体の名称であってもよい。この場合、ラベルは、たとえば、「サッカー選手」、「サッカーボール」などである。ラベルは、人物名であってもよい。また、ラベルは、被写体の動作を示す情報であってもよい。この場合、ラベルは、たとえば、「ボールを蹴る」、「ピッチング」、「バッティング」などである。また、一つの被写体に対して一または二以上のラベルが設定されてもよい。たとえば、「バッティング」というラベルが設定された映像フレームに対して、さらに「バッティング」より詳細な情報である「ノーステップ打法」や「一本足打法」といったラベルが設定されてもよい。 The label may be, for example, the name of the subject. In this case, the labels are, for example, "soccer player" or "soccer ball." The label may be a person's name. Further, the label may be information indicating the motion of the subject. In this case, the labels are, for example, "kicking the ball," "pitching," "batting," and the like. Furthermore, one or more labels may be set for one subject. For example, for a video frame to which the label "batting" has been set, labels such as "no-step batting method" or "one-leg batting method", which are more detailed information than "batting", may be further set.

また、メタデータは、ラベルの各々について、ラベルに相当する被写体の領域情報を含んでいてもよい。領域情報は、映像フレームのどの領域に被写体が撮像されているかを示す情報である。 Further, the metadata may include, for each label, area information of the subject corresponding to the label. The area information is information indicating in which area of the video frame the subject is imaged.

なお、動画記憶装置４０は、動画解析装置２０による解析の対象の動画と、学習装置６０による学習に使用される動画とを記憶することができる。また、解析の対象の動画には、まだ解析されていない動画と、解析された動画とがあり得る。また、解析された動画には、メタデータに含まれている解析結果が編集されているものと、編集されていないものとがあり得る。なお、解析結果とは、映像フレームの各々のシーン名やラベルを指す。また、学習、解析および編集については、後述する。また、解析結果とは、映像フレームの各々のシーン名とラベルを指す。まだ解析されていない動画のメタデータは、シーン名やラベルを含まない。解析された動画や学習に使用される動画のメタデータは、シーン名やラベルを含む。 Note that the video storage device 40 can store videos to be analyzed by the video analysis device 20 and videos used for learning by the learning device 60. Further, the videos to be analyzed may include videos that have not yet been analyzed and videos that have been analyzed. Furthermore, some analyzed videos may have the analysis results included in the metadata edited, while others may not have been edited. Note that the analysis result refers to the scene name and label of each video frame. Further, learning, analysis, and editing will be described later. Furthermore, the analysis result refers to the scene name and label of each video frame. Video metadata that has not yet been analyzed does not include scene names or labels. Metadata of analyzed videos and videos used for training includes scene names and labels.

また、動画記憶装置４０は、動画解析装置２０から要求された動画と当該動画のメタデータとを、動画解析装置２０へ送信する。また、動画記憶装置４０は、学習装置６０から要求された動画と当該動画のメタデータとを、学習装置６０へ送信する。また、動画記憶装置４０は、編集装置３０から要求された動画と当該動画のメタデータとを、編集装置３０へ送信する。 Further, the video storage device 40 transmits the video requested by the video analysis device 20 and the metadata of the video to the video analysis device 20. Further, the video storage device 40 transmits the video requested by the learning device 60 and the metadata of the video to the learning device 60. Further, the video storage device 40 transmits the video requested by the editing device 30 and the metadata of the video to the editing device 30.

利用者端末５０は、メタデータを編集する利用者によって使用される端末である。利用者端末５０は、たとえば、パーソナルコンピュータ等の情報処理装置である。利用者端末５０は、入力デバイスの一例であるキーボードや、出力デバイスとしてのディスプレイ等のマンマシンインタフェースを備える。利用者端末５０は、利用者によって入力デバイスに入力された操作に応じて、編集装置３０に対する指示を行う。また、利用者端末５０は、編集装置３０からの制御によって、利用者端末５０が備える表示手段に、画像を表示させる。 The user terminal 50 is a terminal used by a user who edits metadata. The user terminal 50 is, for example, an information processing device such as a personal computer. The user terminal 50 includes a man-machine interface such as a keyboard, which is an example of an input device, and a display, which is an output device. The user terminal 50 issues instructions to the editing device 30 in response to operations input by the user to the input device. Further, the user terminal 50 causes the display means provided in the user terminal 50 to display an image under the control of the editing device 30.

学習装置６０は、動画の解析のための学習モデルを生成する。学習装置６０の詳細については後述する。 The learning device 60 generates a learning model for video analysis. Details of the learning device 60 will be described later.

動画解析装置２０は、動画の解析を行う。動画解析装置２０の詳細については後述する。 The video analysis device 20 analyzes videos. Details of the video analysis device 20 will be described later.

編集装置３０は、メタデータの編集を行う。編集装置３０の詳細については後述する。 The editing device 30 edits metadata. Details of the editing device 30 will be described later.

次に、図４に、本実施形態の学習装置６０の構成例を示す。学習装置６０は、学習情報入力部６１、学習情報記憶部６２およびモデル生成部６３を含む。 Next, FIG. 4 shows a configuration example of the learning device 60 of this embodiment. The learning device 60 includes a learning information input section 61, a learning information storage section 62, and a model generation section 63.

なお、本実施形態では、学習装置６０と動画解析装置２０とが異なる装置である場合について説明するが、動画解析装置２０が学習装置６０の機能を備えていても良い。 In this embodiment, a case will be described in which the learning device 60 and the video analysis device 20 are different devices, but the video analysis device 20 may have the functions of the learning device 60.

本実施形態の学習装置６０は、シーン分割用の第一の学習モデルと、ラベル設定用の第二の学習モデルとを生成する。シーン分割とラベル設定とについては、後述する。第一の学習モデルを生成する処理と、第二の学習モデルを生成する処理は、互いに異なる処理である。学習装置６０は、第一の学習モデルを生成する装置と、第二の学習モデルを生成する装置とに分かれていてもよい。 The learning device 60 of this embodiment generates a first learning model for scene segmentation and a second learning model for label setting. Scene division and label setting will be described later. The process of generating the first learning model and the process of generating the second learning model are mutually different processes. The learning device 60 may be divided into a device that generates a first learning model and a device that generates a second learning model.

学習情報入力部６１は、学習情報を受信し、学習情報記憶部６２に記憶させる。学習情報記憶部６２は、学習情報を記憶する。モデル生成部６３は、学習情報を用いて学習モデルを生成して出力する。 The learning information input unit 61 receives learning information and stores it in the learning information storage unit 62. The learning information storage unit 62 stores learning information. The model generation unit 63 generates and outputs a learning model using the learning information.

まず、学習装置６０が第一の学習モデルを生成する場合について説明する。第一の学習モデルは、シーン分割用の学習モデルである。第一の学習モデルへの入力は、動画と当該動画のジャンル情報である。また、第一の学習モデルの出力は、動画に含まれる映像フレームの各々に対するシーン名である。第一の学習モデルは、動画のジャンル情報に基づいて、動画をシーンごとに分割し、シーンの各々に対してシーン名を決定する。 First, a case where the learning device 60 generates a first learning model will be described. The first learning model is a learning model for scene segmentation. The input to the first learning model is a video and genre information of the video. Furthermore, the output of the first learning model is a scene name for each video frame included in the video. The first learning model divides a video into scenes based on video genre information and determines a scene name for each scene.

この場合、第一の学習モデルの生成のための学習情報は、シーン名、当該シーン名に該当する映像フレーム、および、当該映像フレームが含まれる動画のジャンル情報を含む。動画記憶装置４０には、学習用の動画と、当該動画のメタデータが記憶されている。学習用の動画のメタデータには、ジャンル情報と、映像フレームの各々についてのシーン名が含まれている。そのため、学習情報入力部６１は、学習用の動画とメタデータとを動画記憶装置４０から取得することによって、学習情報を得ることができる。 In this case, the learning information for generating the first learning model includes a scene name, a video frame corresponding to the scene name, and genre information of a video including the video frame. The video storage device 40 stores learning videos and metadata of the videos. The learning video metadata includes genre information and scene names for each video frame. Therefore, the learning information input unit 61 can obtain the learning information by acquiring the learning video and metadata from the video storage device 40.

なお、解析がされた動画のうち、編集装置３０によってメタデータが編集された動画が、学習用の動画として使用されてもよい。このように、編集装置３０による編集結果を学習に使用することで、学習モデルの精度を向上することができる。 Note that among the analyzed videos, a video whose metadata has been edited by the editing device 30 may be used as a learning video. In this way, by using the editing results by the editing device 30 for learning, the accuracy of the learning model can be improved.

また、ジャンル情報は、他の装置（不図示）によって、動画に対してあらかじめ設定され、設定されたジャンル情報を含むメタデータが動画記憶装置４０に記憶されている。また、メタデータにジャンル情報が含まれていない場合には、動画解析装置２０または編集装置３０が、利用者端末５０からの操作入力に応じて、メタデータにジャンル情報を設定してもよい。また、後述のように、編集装置３０は、メタデータの編集の際に、ジャンル情報を変更することができる。 Further, the genre information is set in advance for the video by another device (not shown), and metadata including the set genre information is stored in the video storage device 40. Furthermore, if the metadata does not include genre information, the video analysis device 20 or the editing device 30 may set the genre information in the metadata in response to an operation input from the user terminal 50. Furthermore, as will be described later, the editing device 30 can change genre information when editing metadata.

モデル生成部６３は、学習情報に基づいて、第一の学習モデルを生成する。また、モデル生成部６３は、生成した第一の学習モデルを動画解析装置２０のシーン分割部２１へ送信する。 The model generation unit 63 generates a first learning model based on the learning information. Furthermore, the model generation unit 63 transmits the generated first learning model to the scene division unit 21 of the video analysis device 20.

なお、学習情報入力部６１は、所定のタイミングで、学習用の動画を動画記憶装置４０から取得することができる。このときの所定のタイミングは、たとえば、新たな学習用の動画が動画記憶装置４０に記憶された場合や、編集装置３０によってメタデータが編集された場合などである。また、モデル生成部６３は、所定のタイミングで、新たな第一の学習モデルを生成しても良い。このときの所定のタイミングは、たとえば、定時間おきや、新たな学習用の動画が学習情報記憶部６２に記憶された場合や、利用者端末５０から学習を指示する学習指示が入力された場合などである。 Note that the learning information input unit 61 can acquire a learning video from the video storage device 40 at a predetermined timing. The predetermined timing at this time is, for example, when a new learning video is stored in the video storage device 40 or when metadata is edited by the editing device 30. Furthermore, the model generation unit 63 may generate a new first learning model at a predetermined timing. The predetermined timing at this time is, for example, at regular intervals, when a new learning video is stored in the learning information storage unit 62, or when a learning instruction for instructing learning is input from the user terminal 50. etc.

次に、学習装置６０が第二の学習モデルを生成する場合について説明する。第二の学習モデルは、ラベル設定用の学習モデルである。第二の学習モデルは、シーン名が同一である映像フレームである同一シーンフレーム、当該同一シーンフレームのシーン名、および、当該同一シーンフレームを含む動画のジャンル情報から、当該同一シーンフレームに含まれる映像フレームの各々に対するラベル名を設定する。 Next, a case where the learning device 60 generates a second learning model will be described. The second learning model is a learning model for label setting. The second learning model is based on the same scene frame, which is a video frame with the same scene name, the scene name of the same scene frame, and the genre information of the video that includes the same scene frame. Set a label name for each video frame.

この場合、第二の学習モデルの生成のための学習情報は、ラベル、当該ラベルが設定されている映像フレーム、当該ラベルの領域情報、当該ラベルが設定されている映像フレームのシーン名、および、当該ラベルが設定されている映像フレームを含む動画のジャンル情報を含む。動画記憶装置４０には、学習用の動画と、当該動画のメタデータが記憶されている。学習用の動画のメタデータには、ジャンル情報と、映像フレームの各々についてのシーン名およびラベルと、ラベルの各々についての領域情報が含まれている。そのため、学習情報入力部６１は、学習用の動画と当該動画のメタデータとを動画記憶装置４０から取得することによって、学習情報を得ることができる。なお、解析がされた動画のうち、編集装置３０によってメタデータが編集された動画が、学習用の動画として使用されてもよい。 In this case, the learning information for generating the second learning model includes a label, a video frame to which the label is set, area information of the label, a scene name of the video frame to which the label is set, and Contains genre information of the video that includes the video frame to which the label is set. The video storage device 40 stores learning videos and metadata of the videos. The learning video metadata includes genre information, scene names and labels for each video frame, and area information for each label. Therefore, the learning information input unit 61 can obtain the learning information by acquiring the learning video and the metadata of the video from the video storage device 40. Note that among the analyzed videos, a video whose metadata has been edited by the editing device 30 may be used as a learning video.

モデル生成部６３は、学習情報に基づいて、第二の学習モデルを生成する。また、モデル生成部６３は、生成した第二の学習モデルを動画解析装置２０のラベル設定部２２へ送信する。 The model generation unit 63 generates a second learning model based on the learning information. Furthermore, the model generation unit 63 transmits the generated second learning model to the label setting unit 22 of the video analysis device 20.

なお、学習情報入力部６１は、新たな学習用の動画が動画記憶装置４０に記憶された場合や、編集装置３０によってメタデータが編集された場合などの所定のタイミングで、学習用の動画を動画記憶装置４０から取得することができる。また、モデル生成部６３は、所定のタイミングで、新たな第二の学習モデルを生成しても良い。所定のタイミングは、たとえば、所定時間おきや、新たな学習用の動画が学習情報記憶部６２に記憶された場合や、利用者端末５０から学習を指示する学習指示が入力された場合などである。 Note that the learning information input unit 61 inputs the learning video at a predetermined timing such as when a new learning video is stored in the video storage device 40 or when metadata is edited by the editing device 30. It can be acquired from the video storage device 40. Furthermore, the model generation unit 63 may generate a new second learning model at a predetermined timing. The predetermined timing is, for example, at a predetermined time interval, when a new learning video is stored in the learning information storage unit 62, or when a learning instruction for instructing learning is input from the user terminal 50. .

次に、図５に、本実施形態の動画解析装置２０の構成例を示す。動画解析装置２０は、シーン分割部２１、ラベル設定部２２および出力部２３を含む。なお、動画解析装置２０によって動画に対して行われる一連の動作を、解析とよぶ。 Next, FIG. 5 shows a configuration example of the video analysis device 20 of this embodiment. The video analysis device 20 includes a scene dividing section 21, a label setting section 22, and an output section 23. Note that a series of operations performed on a video by the video analysis device 20 is referred to as analysis.

シーン分割部２１は、解析の対象の動画と当該動画のメタデータとを、動画記憶装置４０から取得する。また、シーン分割部２１は、ジャンル情報に基づいて、解析対象の動画をシーンに分割する。また、シーン分割部２１は、シーンの各々に対して、シーン名を決定する。シーン分割部２１は、学習装置６０で生成された第一の学習モデルを使用して、シーン分割とシーン名の決定とを行う。シーン分割部２１は、第一の学習モデルに、解析対象の動画と当該動画のジャンル情報とを入力する。第一の学習モデルは、動画に含まれる映像フレームの各々に対するシーン名を出力する。 The scene dividing unit 21 acquires a moving image to be analyzed and metadata of the moving image from the moving image storage device 40. Furthermore, the scene dividing unit 21 divides the moving image to be analyzed into scenes based on the genre information. Further, the scene dividing unit 21 determines a scene name for each scene. The scene division unit 21 uses the first learning model generated by the learning device 60 to perform scene division and scene name determination. The scene dividing unit 21 inputs a video to be analyzed and genre information of the video to the first learning model. The first learning model outputs a scene name for each video frame included in the video.

ラベル設定部２２は、解析対象の動画に含まれる映像フレームの各々に対してラベルを設定する。ラベル設定部２２は、学習装置６０で生成された第二の学習モデルを使用して、ラベルの設定を行う。また、ラベル設定部２２は、第二の学習モデルを使用して、ラベルの各々についての領域情報を決定する。ラベル設定部２２は、第二の学習モデルに、同一シーンフレームと、当該同一シーンフレームのシーン名と、当該同一シーンフレームを含む動画のジャンル情報とを入力する。同一シーンフレームは、シーン名が同一である映像フレームである。第二の学習モデルは、当該同一シーンフレームに含まれる映像フレームの各々に対するラベルと、ラベルの領域情報とを出力する。 The label setting unit 22 sets a label for each video frame included in the moving image to be analyzed. The label setting unit 22 uses the second learning model generated by the learning device 60 to set a label. Furthermore, the label setting unit 22 uses the second learning model to determine area information for each label. The label setting unit 22 inputs the same scene frame, the scene name of the same scene frame, and the genre information of the video including the same scene frame to the second learning model. The same scene frames are video frames with the same scene name. The second learning model outputs a label for each video frame included in the same scene frame and label area information.

出力部２３は、シーン分割部２１で決定されたシーン名と、ラベル設定部２２で設定されたラベルとを出力する。より具体的には、出力部２３は、映像フレームの各々についてのシーン名およびラベルと、ラベルの各々についての領域情報とをメタデータに含ませて、動画記憶装置４０に記憶させる。 The output unit 23 outputs the scene name determined by the scene dividing unit 21 and the label set by the label setting unit 22. More specifically, the output unit 23 causes the video storage device 40 to store the metadata including the scene name and label for each video frame and the area information for each label.

次に、図６に本実施形態の編集装置３０の構成例を示す。本実施形態の編集装置３０は、編集部３４と記憶部３５とを含む。編集装置３０は、動画解析装置２０に含まれていてもよい。 Next, FIG. 6 shows a configuration example of the editing device 30 of this embodiment. The editing device 30 of this embodiment includes an editing section 34 and a storage section 35. The editing device 30 may be included in the video analysis device 20.

編集部３４は、利用者端末５０からの動画取得指示に応じて、動画記憶装置４０から、編集対象の動画と当該動画のメタデータとを取得する。また、編集部３４は、取得した動画およびメタデータを記憶部３５に記憶させる。動画取得指示は、編集対象の動画の取得を指示する指示である。動画取得指示は、編集対象の動画の識別情報、たとえばファイル名などを含む。 The editing unit 34 acquires a video to be edited and metadata of the video from the video storage device 40 in response to a video acquisition instruction from the user terminal 50. Furthermore, the editing unit 34 causes the storage unit 35 to store the acquired moving image and metadata. The video acquisition instruction is an instruction to acquire a video to be edited. The video acquisition instruction includes identification information of the video to be edited, such as a file name.

編集部３４は、シーン編集指示に応じて、編集対象の動画のメタデータを編集する。シーン編集指示は、映像フレームの各々についてのシーン名の編集に関する指示である。また、編集部３４は、ラベル編集指示に応じて、編集対象の動画のメタデータを編集する。ラベル編集指示は、映像フレームの各々についてのラベルの編集に関する指示である。シーン編集指示およびラベル編集指示は、利用者端末５０から編集部３４に入力される。また、編集部３４は、編集されたメタデータを動画記憶装置４０に記憶させる。 The editing unit 34 edits the metadata of the video to be edited in accordance with the scene editing instruction. The scene editing instruction is an instruction regarding editing the scene name for each video frame. Furthermore, the editing unit 34 edits the metadata of the video to be edited in accordance with the label editing instruction. The label editing instruction is an instruction regarding editing the label for each video frame. Scene editing instructions and label editing instructions are input from the user terminal 50 to the editing section 34. The editing unit 34 also stores the edited metadata in the video storage device 40.

また、編集部３４は、シーン編集画像表示指示に応じて、シーン編集画像１１０を利用者端末５０に表示させる。シーン編集画像表示指示は、シーン編集画像１１０の表示を指示する指示である。シーン編集画像１１０は、シーン名の編集のための画像である。また、編集部３４は、取得した動画とメタデータとに基づいて、利用者端末５０に表示させるシーン編集画像１１０を生成する。 Furthermore, the editing unit 34 displays the scene edited image 110 on the user terminal 50 in response to the scene edited image display instruction. The scene edited image display instruction is an instruction to display the scene edited image 110. The scene editing image 110 is an image for editing a scene name. Furthermore, the editing unit 34 generates a scene edited image 110 to be displayed on the user terminal 50 based on the acquired video and metadata.

図７に、シーン編集画像１１０の例を示す。 FIG. 7 shows an example of the scene edited image 110.

シーン編集画像１１０は、元動画表示画像１１１を含む。元動画表示画像１１１は、編集対象の動画を表示する。元動画表示画像１１１に表示された動画は、利用者端末５０からの操作入力によって、動画の再生や停止が可能である。編集部３４は、編集対象の動画を、元動画表示画像１１１に表示させる。また、編集部３４は、利用者端末５０からの操作入力に応じて、動画の再生の開始や停止を行う。 The scene edited image 110 includes an original moving image display image 111. The original video display image 111 displays the video to be edited. The video displayed in the original video display image 111 can be played back or stopped by operation input from the user terminal 50. The editing unit 34 displays the video to be edited on the original video display image 111. Further, the editing unit 34 starts or stops playing the video in response to an operation input from the user terminal 50.

また、シーン編集画像１１０は、シーン表示画像１１２を含む。シーン表示画像１１２は、編集対象の動画がどのようにシーン分割されているかを示す。シーン表示画像１１２において、横軸は、動画の始点からの再生時間を示す。シーン表示画像１１２では、同じシーン名が付与されている映像フレームが、同じ色で示されていてもよい。メタデータには、映像フレームの各々についてのシーン名が含まれている。編集部３４は、編集対象の動画のメタデータに基づいて、同じシーン名が付与されている映像フレームを同じシーンであるとすることができる。編集部３４は、互いに異なるシーン名が付与されている映像フレームの境界を、シーン表示画像１１２に表示させる。 The scene edited image 110 also includes a scene display image 112. The scene display image 112 shows how the video to be edited is divided into scenes. In the scene display image 112, the horizontal axis indicates the playback time from the starting point of the video. In the scene display image 112, video frames to which the same scene name is given may be shown in the same color. The metadata includes scene names for each video frame. The editing unit 34 can determine that video frames to which the same scene name is given are the same scene, based on the metadata of the video to be edited. The editing unit 34 causes the scene display image 112 to display boundaries between video frames to which mutually different scene names are given.

また、シーン編集画像１１０は、切り出し画像１１３を含む。切り出し画像１１３には、シーン表示画像１１２に表示されているカーソルによって示される映像フレームの画像（静止画像）である。シーン表示画像１１２に表示されているカーソルは、利用者端末５０からの操作入力に応じて移動される。編集部３４は、利用者端末５０からの操作入力に応じて、シーン表示画像１１２に表示されているカーソルが示す映像フレームの画像を切り出し画像１１３として表示する。なお、切り出し画像１１３の上側に表示されているタイムコードは、カーソルによって示されている映像フレームが含まれるシーンの開始位置と終了位置とを表す。 Furthermore, the scene edited image 110 includes a cutout image 113. The cutout image 113 is an image (still image) of the video frame indicated by the cursor displayed on the scene display image 112. The cursor displayed on the scene display image 112 is moved according to an operation input from the user terminal 50. The editing unit 34 displays the image of the video frame indicated by the cursor displayed on the scene display image 112 as a cutout image 113 in response to an operation input from the user terminal 50. Note that the time code displayed above the cutout image 113 represents the start position and end position of the scene that includes the video frame indicated by the cursor.

また、シーン編集画像１１０は、ジャンル画像１１４を含む。ジャンル画像１１４には、編集対象の動画のジャンル情報が含まれる。編集部３４は、編集対象の動画のメタデータに含まれるジャンル情報を、ジャンル画像１１４に表示させる。 Furthermore, the scene edited image 110 includes a genre image 114. The genre image 114 includes genre information of the video to be edited. The editing unit 34 displays genre information included in the metadata of the video to be edited on the genre image 114.

また、シーン編集画像１１０は、シーン名編集画像１１５を含む。シーン名編集画像１１５は、シーン名の編集のための画像である。シーン名編集画像１１５には、シーン表示画像１１２に表示されているカーソルによって示されるシーンのシーン名が表示される。また、シーン名編集画像１１５では、利用者端末５０からの操作入力によって、「編集」が選択されると、シーン名の編集が可能になる。利用者端末５０からの操作入力によって、シーン名編集画像１１５に文字列が入力されると、編集部３４に、シーン編集指示が入力される。編集部３４は、入力されたシーン編集指示に応じて、記憶部３５に記憶されているメタデータを編集することによって、当該シーンに含まれる映像フレームのシーン名を変更し、編集されたメタデータを、利用者端末５０に記憶させる。 The scene edited image 110 also includes a scene name edited image 115. The scene name editing image 115 is an image for editing a scene name. The scene name editing image 115 displays the scene name of the scene indicated by the cursor displayed on the scene display image 112. Furthermore, in the scene name editing image 115, when "edit" is selected by operation input from the user terminal 50, the scene name can be edited. When a character string is input into the scene name editing image 115 through an operation input from the user terminal 50, a scene editing instruction is input into the editing section 34. The editing unit 34 changes the scene name of the video frame included in the scene by editing the metadata stored in the storage unit 35 according to the input scene editing instruction, and edits the edited metadata. is stored in the user terminal 50.

また、シーン編集画像１１０は、シーン一覧画像１１６を含む。シーン一覧画像１１６は、シーンの一覧を表示する。シーン一覧画像１１６には、編集対象の動画に含まれるシーンのシーン名が含まれる。また、シーン一覧画像１１６には、シーン名の各々について、該当するシーンのサムネイルが表示される。また、シーン一覧画像１１６は、各々のシーンについてのタイムコードを含んでいてもよい。タイムコードは、動画の始点からの再生時間によって、各々のシーンが動画のどの位置に含まれるかを示す。 The scene editing image 110 also includes a scene list image 116. The scene list image 116 displays a list of scenes. The scene list image 116 includes scene names of scenes included in the video to be edited. Further, in the scene list image 116, a thumbnail of the corresponding scene is displayed for each scene name. Further, the scene list image 116 may include a time code for each scene. The time code indicates where each scene is included in the video, depending on the playback time from the start point of the video.

シーン一覧画像１１６は、シーン名入力領域１１７とシーン検索画像１１８を含んでいてもよい。シーン名入力領域１１７は、シーン名の入力のための領域である。また、シーン検索画像１１８は、シーン名入力領域１１７に入力されたシーン名のシーンを検索するための画像である。シーン検索画像１１８が選択されると、シーン名入力領域１１７に入力されているシーン名のシーンが、シーン一覧画像１１６に表示される。 The scene list image 116 may include a scene name input area 117 and a scene search image 118. The scene name input area 117 is an area for inputting a scene name. Further, the scene search image 118 is an image for searching for a scene with the scene name input in the scene name input area 117. When the scene search image 118 is selected, the scene with the scene name input in the scene name input area 117 is displayed on the scene list image 116.

また、シーン一覧画像１１６に表示されているサムネイルは、利用者端末５０からの操作入力によって、異なるシーン名のサムネイルの表示領域へ移動されることが可能である。サムネイルの移動は、シーンの移動に相当する。編集部３４は、シーンの移動を示すシーン編集指示が入力されると、記憶部３５に記憶されているメタデータを編集することによって、移動されたシーンに該当する映像フレームのシーン名を変更する。そして、編集部３４は、編集されたメタデータを動画記憶装置４０に記憶させる。 Furthermore, the thumbnails displayed in the scene list image 116 can be moved to a display area for thumbnails with different scene names by operation input from the user terminal 50. Moving the thumbnail corresponds to moving the scene. When a scene editing instruction indicating movement of a scene is input, the editing unit 34 changes the scene name of the video frame corresponding to the moved scene by editing the metadata stored in the storage unit 35. . The editing unit 34 then stores the edited metadata in the video storage device 40.

また、シーン編集画像１１０は、時間幅編集画像１１９を含んでいてもよい。時間幅編集画像１１９は、シーンの開始位置と終了位置とを編集可能にするための画像である。時間幅編集画像１１９が選択されると、シーン表示画像１１２におけるシーンの開始位置と終了位置とが編集可能になる。編集部３４は、シーンの開始位置または終了位置の編集を示すシーン編集指示が入力されると、記憶部３５に記憶されているメタデータを編集することによって、シーン名が変更になった映像フレームのシーン名を変更する。そして、編集部３４は、編集されたメタデータを動画記憶装置４０に記憶させる。 Further, the scene edited image 110 may include a time width edited image 119. The time width editing image 119 is an image for making it possible to edit the start position and end position of a scene. When the time width editing image 119 is selected, the start position and end position of the scene in the scene display image 112 can be edited. When a scene editing instruction indicating editing of the start position or end position of a scene is input, the editing unit 34 edits the metadata stored in the storage unit 35 to edit the video frame whose scene name has been changed. Change the scene name. The editing unit 34 then stores the edited metadata in the video storage device 40.

また、編集部３４は、ラベル編集画像表示指示に応じて、ラベル編集画像１２０を利用者端末５０に表示させる。ラベル編集画像表示指示は、ラベル編集画像１２０の表示を指示する指示である。ラベル編集画像１２０は、ラベルの編集のための画像である。また、編集部３４は、取得した動画とメタデータとに基づいて、利用者端末５０に表示させるラベル編集画像１２０を生成する。 Further, the editing unit 34 displays the label edited image 120 on the user terminal 50 in response to the label edited image display instruction. The label edited image display instruction is an instruction to display the label edited image 120. The label editing image 120 is an image for editing a label. Furthermore, the editing unit 34 generates a label edited image 120 to be displayed on the user terminal 50 based on the acquired video and metadata.

図８に、ラベル編集画像１２０の例を示す。 FIG. 8 shows an example of the label edited image 120.

ラベル編集画像１２０は、元動画表示画像１２１を含む。元動画表示画像１２１については、元動画表示画像１１１と同様のため、説明を省略する。 The label edited image 120 includes an original moving image display image 121. The original moving image display image 121 is the same as the original moving image display image 111, so a description thereof will be omitted.

また、ラベル編集画像１２０は、シーン表示画像１２２を含む。シーン表示画像１２２は、シーン表示画像１１２と同様のため、説明を省略する。 Furthermore, the label editing image 120 includes a scene display image 122. The scene display image 122 is similar to the scene display image 112, so the description thereof will be omitted.

また、ラベル編集画像１２０は、切り出し画像１２３を含む。切り出し画像１２３は、切り出し画像１１３と同様のため、説明を省略する。 The label edited image 120 also includes a cutout image 123. The cutout image 123 is similar to the cutout image 113, so the description thereof will be omitted.

また、ラベル編集画像１２０は、ジャンル画像１２４を含む。ジャンル画像１２４には、編集対象の動画のジャンル情報が示される。編集部３４は、編集対象の動画のメタデータに含まれるジャンル情報を、ジャンル画像１２４に表示させる。 The label edited image 120 also includes a genre image 124. The genre image 124 shows genre information of the video to be edited. The editing unit 34 displays genre information included in the metadata of the video to be edited on the genre image 124.

また、ジャンル画像１２４では、利用者端末５０からの操作入力に応じて、表示されるジャンル情報が変更される。ジャンル画像１２４の「登録」を選択する操作入力がされると、編集部３４は、記憶部３５に記憶されているメタデータを編集することによって、ジャンル情報を、ジャンル画像１２４に表示されているジャンル情報へ変更する。 Furthermore, in the genre image 124, the displayed genre information is changed in response to an operation input from the user terminal 50. When an operation input to select "Register" for the genre image 124 is made, the editing section 34 edits the metadata stored in the storage section 35 to update the genre information displayed on the genre image 124. Change to genre information.

第一の学習モデルおよび第二の学習モデルは、ジャンル情報に基づく学習モデルである。そのため、ジャンル情報が変更された場合、再解析が行われると、映像フレームのシーン名やラベルが変更になる可能性がある。そのため、編集装置３０は、ジャンル情報が変更された場合に、動画解析装置２０に対して、編集対象の動画の再解析を指示してもよい。そして、編集部３４は、再解析後のメタデータを動画記憶装置４０から取得して、新たなメタデータに基づくシーン編集画像１１０やラベル編集画像１２０を利用者端末５０に表示させてもよい。 The first learning model and the second learning model are learning models based on genre information. Therefore, if the genre information is changed and reanalysis is performed, the scene name or label of the video frame may be changed. Therefore, when the genre information is changed, the editing device 30 may instruct the video analysis device 20 to re-analyze the video to be edited. Then, the editing unit 34 may acquire the metadata after the reanalysis from the video storage device 40 and display the scene edited image 110 and the label edited image 120 based on the new metadata on the user terminal 50.

また、ラベル編集画像１２０は、ラベル表示画像１２５を含む。ラベル表示画像１２５は、ラベルの編集のための画像である。ラベル表示画像１２５には、シーン表示画像１２２に表示されているカーソルによって選択されている映像フレームのラベルが表示される。編集部３４は、編集対象の動画のメタデータを参照して、当該映像フレームに設定されているラベルをラベル表示画像１２５に表示する。また、ラベル表示画像１２５に表示されているラベルは、利用者端末５０からの操作入力によって削除されることが可能である。 Furthermore, the label editing image 120 includes a label display image 125. The label display image 125 is an image for editing a label. The label display image 125 displays the label of the video frame selected by the cursor displayed on the scene display image 122. The editing unit 34 refers to the metadata of the video to be edited and displays the label set for the video frame in the label display image 125. Further, the label displayed on the label display image 125 can be deleted by operation input from the user terminal 50.

また、ラベル表示画像１２５には、類似ラベルが含まれる。類似ラベルは、追加されるラベルの候補である。 Further, the label display image 125 includes similar labels. Similar labels are candidates for labels to be added.

類似ラベルは、たとえば、設定されているラベルに類似した単語であってもよい。編集部３４は、たとえば、類似単語辞書を参照して、映像フレームに設定されているラベルに類似する単語を、類似ラベルとしてラベル表示画像１２５に表示させる。この場合、記憶部３５には、類似単語辞書が記憶されている。 The similar label may be, for example, a word similar to the set label. For example, the editing unit 34 refers to a similar word dictionary and displays words similar to the label set in the video frame on the label display image 125 as similar labels. In this case, the storage unit 35 stores a similar word dictionary.

また、類似ラベルは、たとえば、類似動画に設定されているラベルであってもよい。編集部３４は、たとえば、編集対象の動画とジャンル情報が同じ動画を、動画記憶装置４０に記憶されている動画の中から検索し、検索された動画のメタデータを取得する。そして、編集部３４は、当該メタデータに含まれているラベルを、類似ラベルとしてラベル表示画像１２５に表示させる。また、たとえば、編集部３４は、動画の特徴が類似する動画を、動画記憶装置４０に記憶されている動画の中から検索してもよい。この場合、動画解析装置２０の出力部２３は、解析の際に、動画の特徴量を算出して、動画記憶装置４０に記憶させておく。そして、編集部３４は、編集対象の動画の特徴量と近い特徴量を持つ動画を、動画記憶装置４０から検索する。 Further, the similar label may be, for example, a label set to a similar video. For example, the editing unit 34 searches for a video stored in the video storage device 40 for a video that has the same genre information as the video to be edited, and acquires metadata of the searched video. Then, the editing unit 34 displays the label included in the metadata as a similar label on the label display image 125. Furthermore, for example, the editing unit 34 may search for videos with similar video characteristics from among the videos stored in the video storage device 40. In this case, the output unit 23 of the video analysis device 20 calculates the feature amount of the video during analysis and stores it in the video storage device 40. Then, the editing unit 34 searches the video storage device 40 for a video having a feature amount similar to the feature amount of the video to be edited.

編集部３４は、類似ラベルの追加を指示するラベル編集指示が入力されると、メタデータを編集することによって、選択された類似ラベルを、映像フレームのラベルとして追加する。より具体的には、編集部３４は、ラベル追加画像１２７の「追加」を選択する操作入力がされると、類似ラベルの横のチェックボックスにチェックが入力されている類似ラベルを、映像フレームに設定されているラベルとして追加する。 When a label editing instruction to add a similar label is input, the editing unit 34 adds the selected similar label as a label of the video frame by editing the metadata. More specifically, when the editing unit 34 inputs an operation to select "Add" in the label addition image 127, the editing unit 34 adds similar labels whose check boxes next to the similar labels are checked to the video frame. Add as a configured label.

また、ラベル編集画像１２０は、ラベルなしシーン画像１２６を含む。ラベルなしシーン画像１２６には、ラベルが設定されていないシーンのサムネイルが表示される。ラベルが設定されていないシーンとは、ラベルが設定されている映像フレームがないシーンである。編集部３４は、編集対象の動画のメタデータに基づいて、ラベルが設定されていないシーンのサムネイルをラベルなしシーン画像１２６に表示させる。また、ラベル追加画像１２７の「追加」を選択する操作入力がされ、ラベルなしシーンとして表示されていたシーンに対してラベルが設定されると、編集部３４は、ラベルなしシーン画像１２６から、当該シーンを削除する。なお、編集部３４は、保存用画像１２９または登録用画像１３０を選択する操作入力がされたタイミングで、ラベルが設定されたシーンを、ラベルなしシーン画像１２６から削除してもよい。 Additionally, the label edited image 120 includes an unlabeled scene image 126. The unlabeled scene image 126 displays thumbnails of scenes for which no labels have been set. A scene with no label set is a scene where there is no video frame with a label set. The editing unit 34 causes the unlabeled scene image 126 to display thumbnails of scenes for which no labels have been set, based on the metadata of the video to be edited. Further, when an operation input to select "Add" of the label added image 127 is performed and a label is set for a scene that was displayed as a scene without a label, the editing unit 34 selects the corresponding scene from the unlabeled scene image 126. Delete a scene. Note that the editing unit 34 may delete the labeled scene from the unlabeled scene image 126 at the timing when an operation input to select the storage image 129 or the registration image 130 is made.

また、ラベル編集画像１２０は、ラベル追加画像１２７を含む。ラベル追加画像１２７は、ラベルの追加のための画像である。編集部３４は、ラベル追加画像１２７を、ラベル表示画像１２５に表示されている「＋」を選択する操作入力がされた場合に表示する。また、編集部３４は、ラベル追加画像１２７を、ラベルなしシーン画像１２６に表示されているサムネイルを選択する操作入力がされた場合に表示する。この場合、編集部３４は、シーン表示画像１２２に表示されているカーソルを、選択されたサムネイルに該当するシーンの位置まで、右または左に移動させる。 The label edited image 120 also includes a label added image 127. The label addition image 127 is an image for adding a label. The editing unit 34 displays the label addition image 127 when an operation input to select "+" displayed on the label display image 125 is made. Further, the editing unit 34 displays the label added image 127 when an operation input for selecting a thumbnail displayed in the unlabeled scene image 126 is performed. In this case, the editing unit 34 moves the cursor displayed on the scene display image 122 to the right or left to the position of the scene corresponding to the selected thumbnail.

編集部３４は、操作入力によってラベル追加画像１２７に文字列が入力され、「追加」を選択する操作入力がされると、入力された文字列を、映像フレームに設定されているラベルとして、ラベル表示画像１２５に表示させる。このとき、チェックボックスにチェックが入力されている類似ラベルがあれば、当該類似ラベルについても、映像フレームに設定されているラベルとして、ラベル表示画像１２５に追加される。 When a character string is input to the label addition image 127 through an operation input and an operation input to select "add" is performed, the editing unit 34 uses the input character string as a label set to the video frame. It is displayed on the display image 125. At this time, if there is a similar label whose checkbox is checked, that similar label is also added to the label display image 125 as a label set in the video frame.

また、ラベル編集画像１２０は、ラベルボックス画像１２８を含む。ラベルボックス画像１２８は、ラベルについての領域情報を示すボックスを表示する。ボックスは、領域情報、すなわち、ラベルに相当する被写体が映像フレームのどの領域に撮像されているかを示す。ラベルボックス画像１２８は、シーン表示画像１２２に表示されているカーソルが示す映像フレームについて、当該映像フレームに設定されているラベルのボックスを表示する。このとき、ラベルボックス画像１２８は、シーン表示画像１２２に表示されているカーソルが示す映像フレームの画像に重畳して、ボックスを表示する。また、ラベルボックス画像１２８は、ボックスに対応するラベルも表示する。編集部３４は、メタデータに含まれている領域情報に基づいて、各々のラベルのボックスをラベルボックス画像１２８に表示する。また、ボックスは、利用者端末５０からの操作入力によって、移動され、また、変形される。 The label edit image 120 also includes a label box image 128. Label box image 128 displays a box indicating area information about the label. The box indicates area information, that is, in which area of the video frame the subject corresponding to the label is captured. The label box image 128 displays a box with a label set for the video frame indicated by the cursor displayed on the scene display image 122. At this time, the label box image 128 is displayed as a box superimposed on the image of the video frame indicated by the cursor displayed on the scene display image 122. The label box image 128 also displays a label corresponding to the box. The editing unit 34 displays each label box in the label box image 128 based on the area information included in the metadata. Further, the box is moved and transformed by operation input from the user terminal 50.

また、編集部３４は、映像フレームに設定されているラベルがラベル表示画像１２５に追加された場合には、ラベルボックス画像１２８に、追加されたラベルのボックスを表示する。なお、このときに表示されるボックスの初期位置および初期サイズは、任意である。また、編集部３４は、映像フレームに設定されているラベルがラベル表示画像１２５から削除された場合には、ラベルボックス画像１２８から、削除されたラベルのボックスを削除する。 Furthermore, when a label set for a video frame is added to the label display image 125, the editing unit 34 displays a box for the added label in the label box image 128. Note that the initial position and initial size of the box displayed at this time are arbitrary. Furthermore, when the label set for the video frame is deleted from the label display image 125, the editing unit 34 deletes the box of the deleted label from the label box image 128.

また、ラベル編集画像１２０は、保存用画像１２９を含む。保存用画像１２９は、ラベルの一時保存のための画像である。編集部３４は、保存用画像１２９を選択する操作入力がされると、記憶部３５に記憶されているメタデータを編集することによって、シーン表示画像１２２に表示されているカーソルが示す映像フレームのラベルを、ラベル表示画像１２５に表示されているラベルへ変更する。このとき、また、編集部３４は、記憶部３５に記憶されているメタデータを編集することによって、シーン表示画像１２２に表示されているカーソルが示す映像フレームのラベルの領域情報を、ラベルボックス画像１２８に表示されているラベルボックスの領域情報へ変更する。また、編集部３４は、このとき、当該映像フレームと同じシーン名を持ち、さらに、変更前のラベルが同じ映像フレームについて、同様のラベル変更を行ってもよい。 The label edited image 120 also includes a storage image 129. The storage image 129 is an image for temporarily storing a label. When the editing unit 34 receives an operation input to select the storage image 129, the editing unit 34 edits the metadata stored in the storage unit 35 to edit the video frame indicated by the cursor displayed on the scene display image 122. The label is changed to the label displayed in the label display image 125. At this time, the editing unit 34 also edits the metadata stored in the storage unit 35 to change the area information of the label of the video frame indicated by the cursor displayed in the scene display image 122 to the label box image. The area information of the label box displayed in 128 is changed. Further, at this time, the editing unit 34 may change the label in the same way for a video frame that has the same scene name as the video frame and also has the same label before the change.

また、ラベル編集画像１２０は、登録用画像１３０を含む。登録用画像１３０は、メタデータの登録用の画像である。編集部３４は、登録用画像１３０を選択する操作入力がされると、まず、保存用画像１２９を選択する操作入力がされた場合と同様に、記憶部３５に記憶されているメタデータを編集する。そして、編集部３４は、記憶部３５に記憶されているメタデータを、動画記憶装置４０に記憶させる。 Furthermore, the label editing image 120 includes a registration image 130. The registration image 130 is an image for registering metadata. When an operation input to select the registration image 130 is made, the editing unit 34 first edits the metadata stored in the storage unit 35 in the same way as when an operation input to select the storage image 129 is made. do. The editing unit 34 then stores the metadata stored in the storage unit 35 in the video storage device 40.

なお、シーン編集画像１１０およびラベル編集画像１２０は、たとえば、編集メニュー画像１４０に対する操作入力によって、利用者端末５０に表示される。図９に編集メニュー画像１４０の例を示す。 Note that the scene editing image 110 and the label editing image 120 are displayed on the user terminal 50 by, for example, an operation input to the editing menu image 140. FIG. 9 shows an example of the edit menu image 140.

編集メニュー画像１４０は、映像情報表示画像１４１を含む。映像情報表示画像１４１は、編集対象の動画の識別情報、たとえばファイル名を表示する。 Edit menu image 140 includes video information display image 141. The video information display image 141 displays identification information of the video to be edited, for example, the file name.

また、編集メニュー画像１４０は、動画選択用画像１４２を含む、動画選択用画像１４２は、編集対象の動画を選択するための画像である。編集部３４は、動画選択用画像１４２を選択する操作入力がされると、動画の一覧などを表示する。また、編集部３４は、編集対象の動画が選択されると、選択された動画の識別情報を映像情報表示画像１４１に表示させる。また、編集部３４は、選択された動画と当該動画のメタデータとを取得して、取得した動画とメタデータとを記憶部３５に記憶させる。 Further, the editing menu image 140 includes a moving image selection image 142. The moving image selection image 142 is an image for selecting a moving image to be edited. When an operation input to select the moving image selection image 142 is input, the editing unit 34 displays a list of moving images and the like. Further, when a moving image to be edited is selected, the editing unit 34 causes the identification information of the selected moving image to be displayed on the video information display image 141. The editing unit 34 also acquires the selected video and the metadata of the video, and stores the acquired video and metadata in the storage unit 35.

また、編集メニュー画像１４０は、シーン編集画像表示用画像１４３を含む。シーン編集画像表示用画像１４３は、シーン編集画像１１０の表示のための画像である。シーン編集画像表示用画像１４３が選択されると、シーン編集画像表示指示が編集部３４に入力される。編集部３４は、シーン編集画像表示指示が入力されると、シーン編集画像１１０を利用者端末５０に表示させる。 The edit menu image 140 also includes a scene edit image display image 143. The scene edited image display image 143 is an image for displaying the scene edited image 110. When the scene edited image display image 143 is selected, a scene edited image display instruction is input to the editing section 34. When the scene edited image display instruction is input, the editing unit 34 displays the scene edited image 110 on the user terminal 50.

また、編集メニュー画像１４０は、ラベル編集画像表示用画像１４４を含む。ラベル編集画像表示用画像１４４は、ラベル編集画像１２０の表示のための画像である。ラベル編集画像表示用画像１４４が選択されると、ラベル編集画像表示指示が編集部３４に入力される。編集部３４は、ラベル編集画像表示指示が入力されると、ラベル編集画像１２０を利用者端末５０に表示させる。 The edit menu image 140 also includes a label edit image display image 144. The label edited image display image 144 is an image for displaying the label edited image 120. When the label edited image display image 144 is selected, a label edited image display instruction is input to the editing unit 34. When the label edited image display instruction is input, the editing unit 34 displays the label edited image 120 on the user terminal 50.

このように、本実施形態の編集装置３０は、メタデータを編集して、シーン名やラベルを変更することができる。また、学習装置６０は、編集されたメタデータを用いて再学習を行うことができる。このように、編集されたメタデータに基づく学習情報を用いて再学習が行われることによって、利用者が期待する精度のシーン分割や、利用者が期待する粒度のラベル付けを実現することが可能になる。 In this way, the editing device 30 of this embodiment can edit metadata and change scene names and labels. Further, the learning device 60 can perform re-learning using the edited metadata. In this way, by performing relearning using learning information based on the edited metadata, it is possible to achieve scene segmentation with the precision the user expects and labeling with the granularity the user expects. become.

たとえば、映像フレームに設定されているラベルとして、ラベル表示画像１２５に、「バッティング」が表示されていたとする。このとき、利用者は、切り出し画像１２３などを確認して、さらに詳細なラベル（たとえば、「ノーステップ打法」「一本足打法」）を追加することができる。また、学習装置６０は、追加されたラベルを含む学習情報を用いて再学習を行うことができる。追加されたラベルを含む学習情報を用いた再学習が行われて新たな第二の学習モデルが生成されると、第二の学習モデルによって、追加されたラベルが自動的に設定されるようになる。このように、編集されたラベルを含む学習情報を用いて再学習が行われることによって、利用者が期待する粒度のラベル付けを実現することが可能になる。 For example, assume that "batting" is displayed in the label display image 125 as a label set to a video frame. At this time, the user can check the cutout image 123 and the like and add more detailed labels (for example, "no-step batting method" and "one-leg batting method"). Further, the learning device 60 can perform re-learning using learning information including added labels. When retraining is performed using learning information that includes added labels and a new second learning model is generated, the added labels will be automatically set by the second learning model. Become. In this way, by performing relearning using learning information including edited labels, it becomes possible to realize labeling with the granularity expected by the user.

次に、図１０に、本実施形態の学習装置６０の動作フローの例を示す。学習装置６０は、第一の学習モデルの生成と第二の学習モデルの生成の各々について、図１０の動作を行う。また、学習装置６０は、所定時間おきや、利用者端末５０から学習が指示された場合などに、図１０の動作を行う。 Next, FIG. 10 shows an example of the operation flow of the learning device 60 of this embodiment. The learning device 60 performs the operations shown in FIG. 10 for each of generation of the first learning model and generation of the second learning model. Further, the learning device 60 performs the operation shown in FIG. 10 at predetermined intervals or when learning is instructed from the user terminal 50.

学習情報入力部６１は、学習情報を受信し、学習情報記憶部６２に記憶させる（ステップＳ２０１）。モデル生成部６３は、学習情報を用いて学習モデルを生成して出力する（ステップＳ２０２）。 The learning information input unit 61 receives the learning information and stores it in the learning information storage unit 62 (step S201). The model generation unit 63 generates and outputs a learning model using the learning information (step S202).

次に、図１１に、本実施形態の動画解析装置２０の動作フローの例を示す。動画解析装置２０は、所定時間おきや、利用者端末５０から解析が指示された場合などに、図１１の動作を行う。 Next, FIG. 11 shows an example of the operation flow of the video analysis device 20 of this embodiment. The video analysis device 20 performs the operation shown in FIG. 11 at predetermined intervals or when an analysis is instructed from the user terminal 50.

また、動画解析装置２０は、新たな解析対象の動画が動画記憶装置４０に追加された場合などに、図１１の動作を行ってもよい。たとえば、動画解析装置２０は、動画記憶装置４０から所定時間おきに動画一覧を取得して新旧の動画一覧を比較することにより、新たな動画が動画記憶装置４０に追加されたことを検知することができる。または、動画記憶装置４０に新たな動画を追加した装置から、動画解析装置２０に対して、動画を追加したことを示す通知が送信されてもよい。 Further, the video analysis device 20 may perform the operation shown in FIG. 11 when a new video to be analyzed is added to the video storage device 40. For example, the video analysis device 20 can detect that a new video has been added to the video storage device 40 by acquiring a video list from the video storage device 40 at predetermined time intervals and comparing the old and new video lists. Can be done. Alternatively, a notification indicating that the new video has been added may be transmitted from the device that added the new video to the video storage device 40 to the video analysis device 20.

シーン分割部２１は、解析の対象の動画と当該動画のメタデータとを、動画記憶装置４０から取得する（ステップＳ３０１）。また、シーン分割部２１は、ジャンル情報に基づいて、解析対象の動画をシーンに分割する。また、シーン分割部２１は、動画に含まれる映像フレームの各々に対して、シーン名を決定する（ステップＳ３０２）。シーン分割部２１は、学習装置６０で生成された第一の学習モデルを使用して、シーン分割とシーン名の決定とを行う。シーン分割部２１は、第一の学習モデルに、解析対象の動画と当該動画のジャンル情報とを入力する。第一の学習モデルは、動画に含まれる映像フレームの各々に対するシーン名を出力する。 The scene dividing unit 21 acquires a video to be analyzed and metadata of the video from the video storage device 40 (step S301). Furthermore, the scene dividing unit 21 divides the moving image to be analyzed into scenes based on the genre information. Furthermore, the scene dividing unit 21 determines a scene name for each video frame included in the video (step S302). The scene division unit 21 uses the first learning model generated by the learning device 60 to perform scene division and scene name determination. The scene dividing unit 21 inputs a video to be analyzed and genre information of the video to the first learning model. The first learning model outputs a scene name for each video frame included in the video.

ラベル設定部２２は、解析対象の動画に含まれる映像フレームの各々に対してラベルを設定する（ステップＳ３０３）。ラベル設定部２２は、学習装置６０で生成された第二の学習モデルを使用して、ラベルの設定を行う。また、ラベル設定部２２は、第二の学習モデルを使用して、ラベルの各々についての領域情報を決定する。ラベル設定部２２は、第二の学習モデルに、同一シーンフレーム、当該同一シーンフレームのシーン名、当該同一シーンフレームを含む動画のジャンル情報を入力する。第二の学習モデルは、当該同一シーンフレームに含まれる映像フレームの各々に対するラベルと、ラベルの領域情報とを出力する。 The label setting unit 22 sets a label for each video frame included in the moving image to be analyzed (step S303). The label setting unit 22 uses the second learning model generated by the learning device 60 to set a label. Furthermore, the label setting unit 22 uses the second learning model to determine area information for each label. The label setting unit 22 inputs the same scene frame, the scene name of the same scene frame, and the genre information of the video including the same scene frame to the second learning model. The second learning model outputs a label for each video frame included in the same scene frame and label area information.

出力部２３は、シーン分割部２１で決定されたシーン名と、ラベル設定部２２で設定されたラベルとを出力する。より具体的には、出力部２３は、映像フレームの各々についてのシーン名およびラベルと、ラベルの各々についての領域情報とをメタデータに含ませて、動画記憶装置４０に記憶させる（ステップＳ３０４）。 The output unit 23 outputs the scene name determined by the scene dividing unit 21 and the label set by the label setting unit 22. More specifically, the output unit 23 includes the scene name and label for each video frame, and the area information for each label in the metadata, and stores the metadata in the video storage device 40 (step S304). .

次に、図１２に、本実施形態の編集装置３０の動作フローの例を示す。 Next, FIG. 12 shows an example of the operation flow of the editing device 30 of this embodiment.

編集部３４は、利用者端末５０からの動画取得指示に応じて、動画記憶装置４０から、編集対象の動画と当該動画のメタデータとを取得する（ステップＳ４０１）。また、編集部３４は、取得した動画およびメタデータを記憶部３５に記憶させる。 The editing unit 34 acquires a video to be edited and metadata of the video from the video storage device 40 in response to a video acquisition instruction from the user terminal 50 (step S401). Furthermore, the editing unit 34 causes the storage unit 35 to store the acquired moving image and metadata.

編集部３４は、シーン編集画像表示指示に応じて、シーン編集画像１１０を利用者端末５０に表示させる。また、編集部３４は、ラベル編集画像表示指示に応じて、ラベル編集画像１２０を利用者端末５０に表示させる（ステップＳ４０２）。 The editing unit 34 displays the scene edited image 110 on the user terminal 50 in response to the scene edited image display instruction. Furthermore, the editing unit 34 displays the label edited image 120 on the user terminal 50 in response to the label edited image display instruction (step S402).

そして、編集部３４は、シーン編集指示に応じて、シーン編集画像１１０を更新する。また、編集部３４は、シーン編集指示に応じて、記憶部３５に記憶されているメタデータを編集する。また、編集部３４は、ラベル編集指示に応じて、ラベル編集画像１２０を更新する。また、編集部３４は、ラベル編集指示に応じて、記憶部３５に記憶されているメタデータを編集する。また、編集部３４は、指示に応じて、記憶部３５に記憶されているメタデータを、動画記憶装置４０に記憶させる（ステップＳ４０３）。 The editing unit 34 then updates the scene edited image 110 in accordance with the scene editing instruction. The editing unit 34 also edits the metadata stored in the storage unit 35 in response to a scene editing instruction. The editing unit 34 also updates the label editing image 120 in response to the label editing instruction. Further, the editing section 34 edits the metadata stored in the storage section 35 in accordance with the label editing instruction. Further, the editing unit 34 stores the metadata stored in the storage unit 35 in the video storage device 40 in accordance with the instruction (step S403).

以上で説明したように、本発明の第二の実施形態では、動画解析装置２０は、シーン分割部２１とラベル設定部２２と出力部２３とを含む。シーン分割部２１は、ジャンル情報に基づいて、動画をシーンに分割する。また、シーン分割部２１は、シーンの各々に対して、シーン名を決定する。シーン分割部２１は、第一の学習モデルを使用して、シーンの分割とシーン名の決定とを行う。第一の学習モデルは、動画と当該動画のジャンル情報とから、動画をシーンごとに分割し、シーンの各々に対するシーン名を決定する。ラベル設定部２２は、シーンに含まれる映像フレームの各々に対して、シーン名に基づいて、ラベルを設定する。出力部２３は、映像フレームの各々について、映像フレームが含まれるシーンのシーン名と、ラベルとを出力する。 As described above, in the second embodiment of the present invention, the video analysis device 20 includes a scene dividing section 21, a label setting section 22, and an output section 23. The scene dividing unit 21 divides the video into scenes based on genre information. Further, the scene dividing unit 21 determines a scene name for each scene. The scene dividing unit 21 uses the first learning model to divide scenes and determine scene names. The first learning model divides the video into scenes based on the video and the genre information of the video, and determines a scene name for each scene. The label setting unit 22 sets a label for each video frame included in a scene based on the scene name. The output unit 23 outputs, for each video frame, the scene name and label of the scene in which the video frame is included.

このように、動画解析装置２０は、ジャンル情報に基づいて、シーンの分割とシーン名の決定とを行う。これにより、動画解析装置２０は、ジャンル情報によって示されるジャンルで頻出するシーンとしてシーンが解析される可能性を向上する。その結果、動画解析装置２０は、動画の特徴量が似ているシーンであるが、誤ったシーンとしてシーンが解析される可能性を低減することができる。したがって、シーン解析の信頼性を向上することができる。また、動画解析装置２０は、シーン名に基づいてラベルを設定するので、シーン解析の信頼性の向上によって、被写体の解析の信頼性も向上する。そのため、シーンや被写体の解析の信頼性をより向上することが可能になる。 In this way, the video analysis device 20 divides scenes and determines scene names based on genre information. Thereby, the video analysis device 20 increases the possibility that the scene will be analyzed as a scene that frequently appears in the genre indicated by the genre information. As a result, the video analysis device 20 can reduce the possibility that a scene with similar video feature amounts will be analyzed as an incorrect scene. Therefore, the reliability of scene analysis can be improved. Further, since the video analysis device 20 sets the label based on the scene name, the reliability of the scene analysis is improved, and the reliability of the subject analysis is also improved. Therefore, it is possible to further improve the reliability of scene and subject analysis.

ラベル設定部２２は、第二の学習モデルを使用して、ラベルの設定を行う。第二の学習モデルは、同一シーンフレームと、当該同一シーンフレームのシーン名と、当該同一シーンフレームを含む動画のジャンル情報とから、当該同一シーンフレームに含まれる映像フレームの各々に対するラベルを設定する。同一シーンフレームは、シーン名が同一である映像フレームである。これにより、ラベルの設定についてもジャンル情報が使用されるので、被写体の解析の信頼性をより向上することが可能になる。また、ラベル設定に第二の学習モデルが使用されることによって、設定されるラベルのばらつきを、人手でラベルが設定される場合に比べて、低減することができる。 The label setting unit 22 uses the second learning model to set a label. The second learning model sets a label for each video frame included in the same scene frame based on the same scene frame, the scene name of the same scene frame, and genre information of the video that includes the same scene frame. . The same scene frames are video frames with the same scene name. As a result, the genre information is also used for label setting, making it possible to further improve the reliability of subject analysis. Furthermore, by using the second learning model for label setting, variations in the set labels can be reduced compared to when labels are set manually.

また、第二の学習モデルは、さらに、ラベルの各々について、領域情報を出力する。領域情報は、映像フレームのどの領域に、ラベルに相当する被写体が撮像されているかを示す情報である。また、ラベル設定部２２は、第二の学習モデルを使用して、ラベルの各々についての領域情報を決定する。出力部２３は、さらに、領域情報を出力する。これによって、領域情報の可視化が可能になるので、利用者にとっての利便性が向上する。 Furthermore, the second learning model further outputs region information for each label. The region information is information indicating in which region of the video frame the subject corresponding to the label is imaged. Furthermore, the label setting unit 22 uses the second learning model to determine area information for each label. The output unit 23 further outputs area information. This makes it possible to visualize the area information, improving convenience for the user.

また、出力部２３は、シーン名を、メタデータに含ませて、動画記憶装置に記憶させる。メタデータは、動画に関する情報である。動画記憶装置は、動画とメタデータとを記憶する。これにより、利用者が必要とするタイミングで、利用者は、メタデータを利用することができる。 Furthermore, the output unit 23 includes the scene name in the metadata and stores it in the video storage device. Metadata is information about a video. The video storage device stores videos and metadata. This allows the user to use the metadata at the timing the user needs.

また、編集装置３０は、編集部３４を備える。編集部３４は、動画記憶装置４０から、編集対象の動画と当該動画に関するメタデータとを取得する。また、編集部３４は、シーン編集画像表示指示に応じて、取得した動画とメタデータとに基づいて、シーン編集画像を利用者端末５０に表示させる。シーン編集画像表示指示は、シーン編集画像の表示を指示する指示である。シーン編集画像は、シーン名の編集のための画像である。また、編集部３４は、シーン編集指示に応じて、メタデータを編集する。シーン編集指示は、シーン名の編集に関する指示である。また、編集部３４は、シーン編集指示に応じて、メタデータに含まれるシーン名を編集し、編集されたメタデータを、動画記憶装置４０に記憶させる。これにより、シーン名の利用者による編集が可能になる。 The editing device 30 also includes an editing section 34. The editing unit 34 acquires a video to be edited and metadata regarding the video from the video storage device 40. Furthermore, in response to the scene edited image display instruction, the editing unit 34 causes the user terminal 50 to display the scene edited image based on the acquired video and metadata. The scene edited image display instruction is an instruction to display a scene edited image. The scene editing image is an image for editing a scene name. The editing unit 34 also edits metadata in response to scene editing instructions. The scene editing instruction is an instruction regarding editing a scene name. Further, the editing unit 34 edits the scene name included in the metadata in accordance with the scene editing instruction, and stores the edited metadata in the video storage device 40. This allows the user to edit the scene name.

また、シーン編集画像は、編集対象の動画に含まれるシーンの各々についてのシーン名とサムネイルとを含む。これにより、シーン分割の結果を利用者が容易に確認することが可能になる。 Furthermore, the scene edited image includes a scene name and a thumbnail for each scene included in the video to be edited. This allows the user to easily check the result of scene division.

また、編集部３４は、異なるシーン名のサムネイルの表示領域へサムネイルが移動されることによって、シーンの移動を示すシーン編集指示が入力されると、メタデータを編集することによって、移動されたシーンに相当する映像フレームのシーン名を変更する。これにより、シーン名の変更を、容易に実現することが可能になる。 Further, when a scene editing instruction indicating movement of a scene is input by moving a thumbnail to a display area of thumbnails with a different scene name, the editing unit 34 edits the metadata to edit the moved scene. Change the scene name of the video frame corresponding to . This makes it possible to easily change the scene name.

また、編集部３４は、動画記憶装置４０から、編集対象の動画と当該動画に関するメタデータとを取得する。また、編集部３４は、ラベル編集画像表示指示に応じて、取得した動画とメタデータとに基づいて、ラベル編集画像を利用者端末５０に表示させる。ラベル編集画像表示指示は、ラベル編集画像の表示を指示する指示である。ラベル編集画像は、ラベルの編集のための画像である。また、編集部３４は、ラベル編集指示に応じて、メタデータを編集する。ラベル編集指示は、ラベルの編集に関する指示である。また、編集部３４は、ラベル編集指示に応じて、メタデータに含まれるラベルを編集し、編集されたメタデータを、動画記憶装置４０に記憶させる。これにより、ラベルの利用者による編集が可能になる。 The editing unit 34 also acquires a video to be edited and metadata regarding the video from the video storage device 40. In addition, the editing unit 34 causes the user terminal 50 to display the label edited image based on the acquired video and metadata in response to the label edited image display instruction. The label edited image display instruction is an instruction to display a label edited image. The label editing image is an image for editing a label. Further, the editing unit 34 edits the metadata according to the label editing instruction. The label editing instruction is an instruction regarding label editing. Further, the editing unit 34 edits the label included in the metadata in accordance with the label editing instruction, and stores the edited metadata in the video storage device 40. This allows the label to be edited by the user.

また、ラベル編集画像は、追加されるラベルの候補である類似ラベルを含む。編集部３４は、類似ラベルの追加を指示するラベル編集指示が入力されると、メタデータを編集することによって、選択された類似ラベルを、映像フレームのラベルとして追加する。これにより、ラベルの追加が容易になる。 The label edited image also includes similar labels that are candidates for labels to be added. When a label editing instruction to add a similar label is input, the editing unit 34 adds the selected similar label as a label of the video frame by editing the metadata. This makes it easy to add labels.

また、類似ラベルは、映像フレームに設定されているラベルに類似した単語である。これにより、類似した単語をラベルに追加することが容易になる。 Further, the similar label is a word similar to the label set for the video frame. This makes it easy to add similar words to the label.

また、類似ラベルは、類似動画に設定されているラベルである。類似動画は、編集対象の動画とジャンル情報が同じ動画、または、特徴量が近い動画である。編集部３４は、動画記憶装置４０に記憶されている動画の中から類似動画を検索し、検索された類似動画のメタデータに含まれているラベルを、類似ラベルとする。これにより、類似動画に設定されているラベルを追加することが容易になる。 Further, the similar label is a label set to a similar video. A similar video is a video that has the same genre information as the video to be edited, or a video that has similar feature amounts. The editing unit 34 searches for similar videos from the videos stored in the video storage device 40, and sets a label included in the metadata of the searched similar video as a similar label. This makes it easy to add labels set to similar videos.

また、ラベル編集画像は、ラベルが設定されていないシーンのサムネイルを含む。これにより、利用者は、ラベルが設定されていないシーンの存在を容易に知ることが可能になる。 Furthermore, the label edited image includes thumbnails of scenes to which no labels have been set. This allows the user to easily know the existence of scenes for which no labels have been set.

また、ラベル設定部２２は、ラベルの各々について、領域情報を決定する。領域情報は、映像フレームのどの領域に、ラベルに相当する被写体が撮像されているかを示す情報である。出力部２３は、さらに領域情報をメタデータに含めて動画記憶装置４０に記憶させる。ラベル編集画像は、ラベルについての領域情報を示すボックスを含む。これにより、利用者は、ラベルに相当する領域を容易に把握することが可能になる。 Furthermore, the label setting unit 22 determines area information for each label. The region information is information indicating in which region of the video frame the subject corresponding to the label is imaged. The output unit 23 further includes the region information in the metadata and stores it in the video storage device 40. The label edit image includes a box that indicates area information about the label. This allows the user to easily understand the area corresponding to the label.

また、学習装置６０は、編集装置３０によって編集されたメタデータを学習情報に用いて、第一の学習モデルを生成する。これにより、第一の学習モデルの信頼性をさらに向上することができる。 Further, the learning device 60 uses the metadata edited by the editing device 30 as learning information to generate a first learning model. Thereby, the reliability of the first learning model can be further improved.

ラベル設定部２２は、第二の学習モデルを使用して、ラベルの設定を行う。第二の学習モデルは、同一シーンフレームと、当該同一シーンフレームのシーン名と、当該同一シーンフレームを含む動画の前記ジャンル情報とから、当該同一シーンフレームに含まれる映像フレームの各々に対するラベルを設定する。同一シーンフレームは、シーン名が同一である映像フレームである。学習装置６０は、編集装置３０によって編集されたメタデータを学習情報に用いて、第二の学習モデルを生成する。これにより、利用者による編集の結果が学習に使用されるので、自動付与されるラベルを、利用者の希望の粒度に近づけることが可能になる。 The label setting unit 22 uses the second learning model to set a label. The second learning model sets a label for each video frame included in the same scene frame from the same scene frame, the scene name of the same scene frame, and the genre information of the video that includes the same scene frame. do. The same scene frames are video frames with the same scene name. The learning device 60 uses the metadata edited by the editing device 30 as learning information to generate a second learning model. As a result, the results of editing by the user are used for learning, making it possible to bring the automatically assigned labels closer to the granularity desired by the user.

［ハードウェア構成例］
上述した本発明の各実施形態における動画解析装置（１０、２０）、編集装置３０または学習装置６０（以降、動画解析装置等とよぶ）を、一つの情報処理装置（コンピュータ）を用いて実現するハードウェア資源の構成例について説明する。なお、動画解析装置等は、物理的または機能的に少なくとも二つの情報処理装置を用いて実現してもよい。また、動画解析装置等は、専用の装置として実現してもよい。また、動画解析装置等の一部の機能のみを情報処理装置を用いて実現してもよい。 [Hardware configuration example]
The video analysis device (10, 20), editing device 30, or learning device 60 (hereinafter referred to as video analysis device, etc.) in each embodiment of the present invention described above is realized using one information processing device (computer). An example of the configuration of hardware resources will be explained. Note that the video analysis device and the like may be physically or functionally realized using at least two information processing devices. Further, the video analysis device and the like may be realized as a dedicated device. Further, only some functions of the video analysis device or the like may be realized using an information processing device.

図１３は、本発明の各実施形態の動画解析装置等を実現可能な情報処理装置のハードウェア構成例を概略的に示す図である。情報処理装置９０は、通信インタフェース９１、入出力インタフェース９２、演算装置９３、記憶装置９４、不揮発性記憶装置９５およびドライブ装置９６を含む。 FIG. 13 is a diagram schematically showing an example of the hardware configuration of an information processing device that can implement the video analysis device and the like of each embodiment of the present invention. Information processing device 90 includes a communication interface 91, an input/output interface 92, an arithmetic device 93, a storage device 94, a nonvolatile storage device 95, and a drive device 96.

たとえば、図１のシーン分割部１１およびラベル設定部１２は、演算装置９３で実現することが可能である。また、出力部１３は、通信インタフェース９１および演算装置９３で実現することが可能である。 For example, the scene dividing section 11 and label setting section 12 in FIG. 1 can be realized by the arithmetic device 93. Further, the output unit 13 can be realized by the communication interface 91 and the arithmetic device 93.

通信インタフェース９１は、各実施形態の動画解析装置等が、有線あるいは／および無線で外部装置と通信するための通信手段である。なお、動画解析装置等を、少なくとも二つの情報処理装置を用いて実現する場合、それらの装置の間を通信インタフェース９１経由で相互に通信可能なように接続してもよい。 The communication interface 91 is a communication means through which the video analysis device and the like of each embodiment communicates with an external device by wire and/or wirelessly. Note that when a video analysis device or the like is implemented using at least two information processing devices, these devices may be connected to each other via the communication interface 91 so that they can communicate with each other.

入出力インタフェース９２は、入力デバイスの一例であるキーボードや、出力デバイスとしてのディスプレイ等のマンマシンインタフェースである。 The input/output interface 92 is a man-machine interface such as a keyboard as an example of an input device and a display as an output device.

演算装置９３は、汎用のＣＰＵ（Central Processing Unit）やマイクロプロセッサ等の演算処理装置や複数の電気回路によって実現される。演算装置９３は、たとえば、不揮発性記憶装置９５に記憶された各種プログラムを記憶装置９４に読み出し、読み出したプログラムに従って処理を実行することが可能である。 The arithmetic unit 93 is realized by an arithmetic processing unit such as a general-purpose CPU (Central Processing Unit) or a microprocessor, and a plurality of electric circuits. The arithmetic device 93 can, for example, read various programs stored in the nonvolatile storage device 95 into the storage device 94 and execute processing according to the read programs.

記憶装置９４は、演算装置９３から参照可能な、ＲＡＭ（Random Access Memory）等のメモリ装置であり、プログラムや各種データ等を記憶する。記憶装置９４は、揮発性のメモリ装置であってもよい。 The storage device 94 is a memory device such as a RAM (Random Access Memory) that can be referenced by the arithmetic device 93, and stores programs, various data, and the like. Storage device 94 may be a volatile memory device.

不揮発性記憶装置９５は、たとえば、ＲＯＭ（Read Only Memory）、フラッシュメモリ、等の、不揮発性の記憶装置であり、各種プログラムやデータ等を記憶することが可能である。 The nonvolatile storage device 95 is a nonvolatile storage device such as a ROM (Read Only Memory) or a flash memory, and is capable of storing various programs, data, and the like.

ドライブ装置９６は、たとえば、後述する記録媒体９７に対するデータの読み込みや書き込みを処理する装置である。 The drive device 96 is, for example, a device that reads and writes data to and from a recording medium 97, which will be described later.

記録媒体９７は、たとえば、光ディスク、光磁気ディスク、半導体フラッシュメモリ等、データを記録可能な任意の記録媒体である。 The recording medium 97 is any recording medium capable of recording data, such as an optical disk, a magneto-optical disk, or a semiconductor flash memory.

本発明の各実施形態は、たとえば、図１３に例示した情報処理装置９０により動画解析装置等を構成し、この動画解析装置等に対して、上記各実施形態において説明した機能を実現可能なプログラムを供給することにより実現してもよい。 In each embodiment of the present invention, for example, a video analysis device or the like is configured by the information processing device 90 illustrated in FIG. This may be realized by supplying.

この場合、動画解析装置等に対して供給したプログラムを、演算装置９３が実行することによって、実施形態を実現することが可能である。また、動画解析装置等のすべてではなく、一部の機能を情報処理装置９０で構成することも可能である。 In this case, the embodiment can be realized by the arithmetic device 93 executing a program supplied to the video analysis device or the like. Further, it is also possible to configure not all but some functions of the video analysis device and the like with the information processing device 90.

さらに、上記プログラムを記録媒体９７に記録しておくこともできる。そして、動画解析装置等の出荷段階、あるいは運用段階等において、適宜上記プログラムが不揮発性記憶装置９５に格納されるよう構成されてもよい。なお、この場合、上記プログラムの供給方法は、出荷前の製造段階、あるいは運用段階等において、適当な治具を利用して動画解析装置等内にインストールする方法を採用してもよい。また、上記プログラムの供給方法は、インターネット等の通信回線を介して外部からダウンロードする方法等の一般的な手順を採用してもよい。 Furthermore, the above program can also be recorded on the recording medium 97. The program may be configured to be stored in the non-volatile storage device 95 as appropriate during the shipping stage of the video analysis device or the like, or during the operational stage. In this case, the above-mentioned program may be supplied by installing it into the video analysis device or the like using an appropriate jig at the manufacturing stage or operation stage before shipping. Further, as the method for supplying the program, a general procedure such as a method of downloading the program from an external source via a communication line such as the Internet may be adopted.

上記の実施形態の一部または全部は、以下の付記のようにも記載されうるが、以下には限られない。 Part or all of the above embodiments may be described as in the following additional notes, but are not limited to the following.

（付記１）
動画のジャンルを示すジャンル情報に基づいて、前記動画をシーンごとに分割し、前記シーンの各々に対して、前記シーンの分類を示すシーン名を決定するシーン分割部と、
前記シーンに含まれる映像フレームの各々に対して、前記シーン名に基づいて、前記映像フレームに撮像されている被写体に関する情報であるラベルを設定するラベル設定部と、
前記映像フレームの各々について、前記映像フレームが含まれる前記シーンの前記シーン名と、前記ラベルとを出力する出力部と
を備え、
前記シーン分割部は、前記動画と当該動画の前記ジャンル情報とから、前記動画を前記シーンごとに分類し、前記シーンの各々に対する前記シーン名を決定する第一の学習モデルを使用して、前記シーンの分割と前記シーン名の決定とを行う、
動画解析装置。 (Additional note 1)
a scene dividing unit that divides the video into scenes based on genre information indicating a genre of the video, and determines a scene name indicating a classification of the scene for each of the scenes;
a label setting unit that sets, for each of the video frames included in the scene, a label that is information about a subject imaged in the video frame based on the scene name;
an output unit that outputs, for each of the video frames, the scene name of the scene in which the video frame is included and the label;
The scene dividing unit classifies the video into each scene based on the video and the genre information of the video, and uses a first learning model that determines the scene name for each of the scenes. dividing the scene and determining the scene name;
Video analysis device.

（付記２）
前記ラベル設定部は、前記シーン名が同一である前記映像フレームである同一シーンフレームと、当該同一シーンフレームの前記シーン名と、当該同一シーンフレームを含む前記動画の前記ジャンル情報とから、当該同一シーンフレームに含まれる前記映像フレームの各々に対する前記ラベルを設定する第二の学習モデルを使用して、前記ラベルの設定を行う、
付記１に記載の動画解析装置。 (Additional note 2)
The label setting unit selects the same scene frame which is the video frame having the same scene name, the scene name of the same scene frame, and the genre information of the video including the same scene frame. Setting the labels using a second learning model that sets the labels for each of the video frames included in a scene frame;
The video analysis device described in Appendix 1.

（付記３）
前記第二の学習モデルは、さらに、前記ラベルの各々について、前記映像フレームのどの領域に、前記ラベルに相当する前記被写体が撮像されているかを示す情報である領域情報を出力し、
前記ラベル設定部は、前記第二の学習モデルを使用して、前記ラベルの各々についての前記領域情報を決定し、
前記出力部は、さらに、前記領域情報を出力する、
付記２に記載の動画解析装置。 (Additional note 3)
The second learning model further outputs, for each of the labels, area information that is information indicating in which area of the video frame the subject corresponding to the label is imaged;
The label setting unit determines the area information for each of the labels using the second learning model,
The output unit further outputs the area information.
The video analysis device described in Appendix 2.

（付記４）
前記出力部は、前記シーン名を、前記動画に関する情報であるメタデータに含ませて、前記動画と前記メタデータとを記憶する動画記憶装置に記憶させる、
付記１に記載の動画解析装置。 (Additional note 4)
The output unit includes the scene name in metadata that is information regarding the video, and stores the scene name in a video storage device that stores the video and the metadata.
The video analysis device described in Appendix 1.

（付記５）
付記１に記載の動画解析装置と、
前記第一の学習モデルを生成する学習装置と
を備える動画解析システム。 (Appendix 5)
The video analysis device described in Appendix 1,
A video analysis system comprising: a learning device that generates the first learning model.

（付記６）
付記２または付記３に記載の動画解析装置と、
前記第一の学習モデルおよび前記第二の学習モデルを生成する学習装置と
を備える動画解析システム。 (Appendix 6)
The video analysis device described in Appendix 2 or 3,
A video analysis system comprising: a learning device that generates the first learning model and the second learning model.

（付記７）
付記４に記載の動画解析装置と、編集装置とを備え、
前記編集装置は、
前記シーン名の編集に関する指示であるシーン編集指示に応じて、前記メタデータを編集する編集部
を備え、
前記編集部は、
前記動画記憶装置から、編集対象の動画と当該動画に関するメタデータとを取得し、
前記シーン名の編集のための画像であるシーン編集画像の表示を指示するシーン編集画像表示指示に応じて、取得した前記動画と前記メタデータとに基づいて、前記シーン編集画像を利用者端末に表示させ、
前記シーン編集指示に応じて、前記メタデータに含まれる前記シーン名を編集し、編集された前記メタデータを、前記動画記憶装置に記憶させる、
動画解析システム。 (Appendix 7)
Comprising the video analysis device and editing device described in Appendix 4,
The editing device includes:
an editing unit that edits the metadata in accordance with a scene editing instruction that is an instruction regarding editing the scene name;
The editorial department is
Obtaining a video to be edited and metadata regarding the video from the video storage device,
In response to a scene editing image display instruction that instructs displaying a scene editing image that is an image for editing the scene name, the scene editing image is displayed on the user terminal based on the acquired video and the metadata. display,
Editing the scene name included in the metadata in response to the scene editing instruction, and storing the edited metadata in the video storage device.
Video analysis system.

（付記８）
前記シーン編集画像は、編集対象の前記動画に含まれる前記シーンの各々についての前記シーン名とサムネイルとを含む、
付記７に記載の動画解析システム。 (Appendix 8)
The scene editing image includes the scene name and thumbnail for each of the scenes included in the video to be edited.
The video analysis system described in Appendix 7.

（付記９）
前記編集部は、異なる前記シーン名の前記サムネイルの表示領域へ前記サムネイルが移動されることによって、前記シーンの移動を示す前記シーン編集指示が入力されると、前記メタデータを編集することによって、移動された前記シーンに相当する前記映像フレームの前記シーン名を変更する、
付記８に記載の動画解析システム。 (Appendix 9)
The editing unit edits the metadata when the scene editing instruction indicating movement of the scene is input by moving the thumbnail to a display area of the thumbnail with a different scene name. changing the scene name of the video frame corresponding to the moved scene;
The video analysis system described in Appendix 8.

（付記１０）
付記４に記載の動画解析装置と、編集装置とを備え、
前記編集装置は、
前記動画解析装置は、前記ラベルの編集に関する指示であるラベル編集指示に応じて、前記メタデータを編集する編集部
を備え、
前記編集部は、
前記動画記憶装置から、編集対象の動画と当該動画に関する前記メタデータとを取得し、
前記ラベルの編集のための画像であるラベル編集画像の表示を指示するラベル編集画像表示指示に応じて、取得した前記動画と前記メタデータとに基づいて、前記ラベル編集画像を利用者端末に表示させ、
前記ラベル編集指示に応じて、前記メタデータに含まれる前記ラベルを編集し、編集された前記メタデータを、前記動画記憶装置に記憶させる、
動画解析システム。 (Appendix 10)
Comprising the video analysis device and editing device described in Appendix 4,
The editing device includes:
The video analysis device includes an editing unit that edits the metadata in accordance with a label editing instruction that is an instruction regarding editing the label,
The editorial department is
acquiring a video to be edited and the metadata regarding the video from the video storage device;
Displaying the label editing image on the user terminal based on the acquired video and the metadata in response to a label editing image display instruction that instructs displaying a label editing image that is an image for editing the label. let me,
editing the label included in the metadata in response to the label editing instruction, and storing the edited metadata in the video storage device;
Video analysis system.

（付記１１）
前記ラベル編集画像は、追加されるラベルの候補である類似ラベルを含み、
前記編集部は、前記類似ラベルの追加を指示する前記ラベル編集指示が入力されると、前記メタデータを編集することによって、選択された前記類似ラベルを、前記映像フレームの前記ラベルとして追加する、
付記１０に記載の動画解析システム。 (Appendix 11)
The label editing image includes similar labels that are candidates for labels to be added,
When the label editing instruction to add the similar label is input, the editing unit adds the selected similar label as the label of the video frame by editing the metadata.
The video analysis system described in Appendix 10.

（付記１２）
前記類似ラベルは、前記映像フレームに設定されている前記ラベルに類似した単語である、
付記１１に記載の動画解析システム。 (Appendix 12)
The similar label is a word similar to the label set on the video frame.
The video analysis system described in Appendix 11.

（付記１３）
前記類似ラベルは、類似動画に設定されている前記ラベルであり、
前記類似動画は、編集対象の前記動画と前記ジャンル情報が同じ前記動画、または、特徴量が近い前記動画であり、
前記編集部は、前記動画記憶装置に記憶されている前記動画の中から前記類似動画を検索し、検索された前記類似動画の前記メタデータに含まれているラベルを、前記類似ラベルとする、
付記１１に記載の動画解析システム。 (Appendix 13)
The similar label is the label set to a similar video,
The similar video is a video that has the same genre information as the video to be edited, or a video that has similar feature amounts,
The editing unit searches for the similar video from among the videos stored in the video storage device, and sets a label included in the metadata of the searched similar video as the similar label.
The video analysis system described in Appendix 11.

（付記１４）
前記ラベル編集画像は、前記ラベルが設定されていない前記シーンのサムネイルを含む、
付記１０に記載の動画解析システム。 (Appendix 14)
The label edited image includes a thumbnail of the scene to which the label is not set.
The video analysis system described in Appendix 10.

（付記１５）
前記ラベル設定部は、前記ラベルの各々について、前記映像フレームのどの領域に、前記ラベルに相当する前記被写体が撮像されているかを示す情報である領域情報を決定し、
前記出力部は、さらに前記領域情報を前記メタデータに含めて前記動画記憶装置に記憶させ、
前記ラベル編集画像は、前記ラベルについての前記領域情報を示すボックスを含む、
付記１０に記載の動画解析システム。 (Appendix 15)
The label setting unit determines, for each of the labels, area information that is information indicating in which area of the video frame the subject corresponding to the label is imaged;
The output unit further includes the area information in the metadata and stores it in the video storage device,
the label editing image includes a box indicating the area information about the label;
The video analysis system described in Appendix 10.

（付記１６）
さらに、前記第一の学習モデルを生成する学習装置を備え、
前記学習装置は、前記編集装置によって編集された前記メタデータを学習情報に用いて、前記第一の学習モデルを生成する、
付記１０から付記１５のいずれかに記載の動画解析システム。 (Appendix 16)
further comprising a learning device that generates the first learning model;
The learning device generates the first learning model by using the metadata edited by the editing device as learning information.
The video analysis system according to any one of appendices 10 to 15.

（付記１７）
前記ラベル設定部は、前記シーン名が同一である前記映像フレームである同一シーンフレームと、当該同一シーンフレームの前記シーン名と、当該同一シーンフレームを含む前記動画の前記ジャンル情報とから、当該同一シーンフレームに含まれる前記映像フレームの各々に対する前記ラベルを設定する第二の学習モデルを使用して、前記ラベルの設定を行い、
さらに、前記編集装置によって編集された前記メタデータを学習情報に用いて、前記第二の学習モデルを生成する学習装置を備える、
付記１０から付記１５のいずれかに記載の動画解析システム。 (Appendix 17)
The label setting unit selects the same scene frame which is the video frame having the same scene name, the scene name of the same scene frame, and the genre information of the video including the same scene frame. Setting the labels using a second learning model that sets the labels for each of the video frames included in a scene frame;
Further, a learning device that generates the second learning model by using the metadata edited by the editing device as learning information,
The video analysis system according to any one of appendices 10 to 15.

（付記１８）
付記７から付記１５のいずれかに記載の動画解析システムにおける編集装置。 (Appendix 18)
An editing device in the video analysis system according to any one of appendices 7 to 15.

（付記１９）
付記５に記載の動画解析システムにおける学習装置。 (Appendix 19)
A learning device in the video analysis system according to appendix 5.

（付記２０）
付記６に記載の動画解析システムにおける学習装置。 (Additional note 20)
A learning device in the video analysis system described in Supplementary Note 6.

（付記２１）
付記１６に記載の動画解析システムにおける学習装置。 (Additional note 21)
A learning device in the video analysis system according to appendix 16.

（付記２２）
付記１７に記載の動画解析システムにおける学習装置。 (Additional note 22)
A learning device in the video analysis system according to appendix 17.

（付記２３）
動画と当該動画のジャンルを示すジャンル情報とから、前記動画をシーンごとに分割し、前記シーンの各々に対して、シーンの分類を示すシーン名を決定する第一の学習モデルを使用して、前記動画を前記シーンに分割し、前記シーンの各々に対して前記シーン名を決定し、
前記シーンに含まれる映像フレームの各々に対して、前記シーン名に基づいて、前記映像フレームに撮像されている被写体に関する情報であるラベルを設定し、
前記映像フレームの各々について、前記映像フレームが含まれる前記シーンの前記シーン名と、前記ラベルとを出力する、
動画解析方法。 (Additional note 23)
Using a first learning model that divides the video into scenes based on the video and genre information indicating the genre of the video, and determines a scene name indicating the classification of the scene for each of the scenes, dividing the video into the scenes and determining the scene name for each of the scenes;
setting a label, which is information about a subject imaged in the video frame, for each video frame included in the scene, based on the scene name;
outputting, for each of the video frames, the scene name and the label of the scene in which the video frame is included;
Video analysis method.

（付記２４）
コンピュータに、
動画のジャンルを示すジャンル情報に基づいて、前記動画をシーンに分割し、前記シーンの各々に対して、前記シーンの分類を示すシーン名を決定するシーン分割機能と、
前記シーンに含まれる映像フレームの各々に対して、前記シーン名に基づいて、前記映像フレームに撮像されている被写体に関する情報であるラベルを設定するラベル設定機能と、
前記映像フレームの各々について、前記映像フレームが含まれる前記シーンの前記シーン名と、前記ラベルとを出力する出力機能と
を実現させ、
前記シーン分割機能は、前記動画と当該動画の前記ジャンル情報とから、前記動画を前記シーンごとに分割し、前記シーンの各々に対する前記シーン名を決定する第一の学習モデルを使用して、前記シーンの分割と前記シーン名の決定とを行う、
動画解析プログラム。 (Additional note 24)
to the computer,
a scene dividing function that divides the video into scenes based on genre information indicating a genre of the video, and determines a scene name indicating a classification of the scene for each of the scenes;
a label setting function for setting a label, which is information about a subject imaged in the video frame, for each video frame included in the scene, based on the scene name;
realizing, for each of the video frames, an output function that outputs the scene name and the label of the scene in which the video frame is included;
The scene division function divides the video into scenes based on the video and the genre information of the video, and uses a first learning model that determines the scene name for each of the scenes. dividing the scene and determining the scene name;
Video analysis program.

以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described above with reference to the embodiments, the present invention is not limited to the above embodiments. The configuration and details of the present invention can be modified in various ways that can be understood by those skilled in the art within the scope of the present invention.

１０、２０動画解析装置
１１、２１シーン分割部
１２、２２ラベル設定部
１３、２３出力部
３０編集装置
３４編集部
３５記憶部
４０動画記憶装置
５０利用者端末
６０学習装置
６１学習情報入力部
６２学習情報記憶部
６３モデル生成部
８０動画解析システム
９０情報処理装置
９１通信インタフェース
９２入出力インタフェース
９３演算装置
９４記憶装置
９５不揮発性記憶装置
９６ドライブ装置
９７記録媒体 10, 20 Video analysis device 11, 21 Scene dividing section 12, 22 Label setting section 13, 23 Output section 30 Editing device 34 Editing section 35 Storage section 40 Video storage device 50 User terminal 60 Learning device 61 Learning information input section 62 Learning Information storage section 63 Model generation section 80 Video analysis system 90 Information processing device 91 Communication interface 92 Input/output interface 93 Arithmetic device 94 Storage device 95 Nonvolatile storage device 96 Drive device 97 Recording medium

Claims

a scene dividing unit that divides the video into scenes based on genre information indicating a genre of the video, and determines a scene name indicating a classification of the scene for each of the scenes;
a label setting unit that sets, for each of the video frames included in the scene, a label that is information about a subject imaged in the video frame based on the scene name;
an output unit that outputs, for each of the video frames, the scene name of the scene in which the video frame is included and the label;
The scene dividing unit divides the video into scenes based on the video and the genre information of the video, and uses a first learning model that determines the scene name for each of the scenes. dividing the scene and determining the scene name;
Video analysis device.

The label setting unit selects the same scene frame which is the video frame having the same scene name, the scene name of the same scene frame, and the genre information of the video including the same scene frame. Setting the labels using a second learning model that sets the labels for each of the video frames included in a scene frame;
The video analysis device according to claim 1.

The output unit includes the scene name in metadata that is information regarding the video, and stores the scene name in a video storage device that stores the video and the metadata.
The video analysis device according to claim 1.

A video analysis device according to claim 1;
A video analysis system comprising: a learning device that generates the first learning model.

A video analysis device according to claim 2;
A video analysis system comprising: a learning device that generates the first learning model and the second learning model.

Comprising the video analysis device according to claim 3 and an editing device,
The editing device includes:
an editing unit that edits the metadata in accordance with a scene editing instruction that is an instruction regarding editing the scene name;
The editorial department is
Obtaining a video to be edited and metadata regarding the video from the video storage device,
In response to a scene edit image display instruction that instructs display of a scene edit image that is an image for editing the scene name, the scene edit image is displayed on the user terminal based on the acquired video and the metadata. display,
Editing the scene name included in the metadata in response to the scene editing instruction, and storing the edited metadata in the video storage device.
Video analysis system.

An editing device in a video analysis system according to claim 6.

A learning device in a video analysis system according to claim 4.

Using a first learning model that divides the video into scenes based on the video and genre information indicating the genre of the video, and determines a scene name indicating the classification of the scene for each of the scenes, dividing the video into the scenes and determining the scene name for each of the scenes;
setting a label, which is information about a subject imaged in the video frame, for each video frame included in the scene, based on the scene name;
outputting, for each of the video frames, the scene name and the label of the scene in which the video frame is included;
Video analysis method.

to the computer,
a scene dividing function that divides the video into scenes based on genre information indicating a genre of the video, and determines a scene name indicating a classification of the scene for each of the scenes;
a label setting function that sets a label, which is information about a subject imaged in the video frame, for each video frame included in the scene, based on the scene name;
realizing, for each of the video frames, an output function that outputs the scene name and the label of the scene in which the video frame is included;
The scene division function divides the video into scenes based on the video and the genre information of the video, and uses a first learning model that determines the scene name for each of the scenes. dividing the scene and determining the scene name;
Video analysis program.