JP2023154608A

JP2023154608A - Video analysis device, video analysis method, and video analysis program

Info

Publication number: JP2023154608A
Application number: JP2022064045A
Authority: JP
Inventors: 雄基太田; Takemoto Ota
Original assignee: SoftBank Corp
Current assignee: SoftBank Corp
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2023-10-20

Abstract

To provide a video analysis device capable of improving searchability for finding a target scene intended by a user.SOLUTION: A video analysis device includes an acquisition unit for acquiring video identification information and a video file, a camera work detector for detecting a predetermined camera work included in the video file based on the movement of an image, and extracting a start time and an end time of a scene including the camera work, a tagging unit for referring to a database that associates types of camera works with nuances corresponding to the camera works to add a tag representing the nuance corresponding to the detected camera work every scene, a video analysis data creation unit for creating video analysis data that includes video identification information and associates a camera work, a start time and an end time, and tag information for each scene, and an output unit for outputting the video analysis data.SELECTED DRAWING: Figure 1

Description

本発明は、動画解析装置、動画解析方法、及び動画解析プログラムに関する。 The present invention relates to a video analysis device, a video analysis method, and a video analysis program.

ユーザ自身が映像を制作して投稿することが可能な動画プラットフォーム（ＰＦ）サービスが普及しており、映像制作は、より身近なものとなり、その質も向上している。映像・動画（以下、「コンテンツ」ともいう。）の増加に伴い、それを管理するＰＦの開発が加速している。中でも人工知能（ＡＩ）による、動画、画像、音声、言語の解析を利用して、コンテンツの内容にタグ付け（印付け）することで検索性を高める方法が報告されている（例えば、特許文献１）。 Video platform (PF) services that allow users to create and post videos themselves are becoming widespread, making video production more accessible and improving its quality. With the increase in images and videos (hereinafter also referred to as "content"), the development of PFs to manage them is accelerating. Among them, methods have been reported that utilize analysis of videos, images, audio, and language by artificial intelligence (AI) to improve searchability by tagging (marking) content (for example, patent documents 1).

特許文献１に記載のタグ付け方法は、動画ファイルを音声認識してテキスト情報へ変換し、動画ファイルを画像解析して動画ファイルにおけるシーンの切り替わりを判定し、テキスト情報における時間的な切れ目、内容的な切れ目、及びシーンの切り替わりに基づいて、動画ファイルを複数のシーンに分割し、テキスト情報から、予め定められた規則に従って、タグを抽出し、抽出されたタグを、複数のシーンのうち、対応するシーンに付与するものである。 The tagging method described in Patent Document 1 involves voice recognition of a video file, converting it into text information, image analysis of the video file to determine scene changes in the video file, and temporal breaks and content in the text information. The video file is divided into multiple scenes based on breaks and scene changes, tags are extracted from the text information according to predetermined rules, and the extracted tags are divided into multiple scenes from among the multiple scenes. It is added to the corresponding scene.

しかしながら、動画ファイルに含まれる音声を解析するだけでは、音声を含まない画像についてはタグ付けすることが難しいという問題があった。 However, there is a problem in that it is difficult to tag images that do not include audio by simply analyzing the audio included in the video file.

また、動画ファイルからカメラワークを検出する方法が知られている（例えば、非特許文献１）。しかしながら、コンテンツ制作において、カメラワークは重要な要素のうちの一つであるが、プロではない一般のユーザがカメラワークの種類（語句）とそれに伴うニュアンスを紐づけるのは難しい。そのため、動画の中にどのようなシーンが含まれているかを検索するためのタグを付して、複数の動画ファイルの中から、ユーザが視聴を目的とした動画を探し出すことが難しいという問題があった。 Furthermore, a method of detecting camera work from a video file is known (for example, Non-Patent Document 1). However, although camera work is one of the important elements in content production, it is difficult for general users who are not professionals to associate the types of camera work (phrases) and the nuances that accompany them. Therefore, the problem is that it is difficult for users to search for the video they want to watch from among multiple video files by adding tags to search for what kind of scenes are included in the video. there were.

特開２０２０－７９９８２号公報JP2020-79982A

吉高敦夫、松井亮治、平松宗、「カメラワークを利用した感性情報の抽出」、情報処理学会論文誌、Ｖｏｌ．４７、Ｎｏ．６、ｐ．１６９６－１７０７Atsuo Yoshitaka, Ryoji Matsui, So Hiramatsu, "Extraction of emotional information using camera work," Journal of the Information Processing Society of Japan, Vol. 47, No. 6, p. 1696-1707

本発明は、複数の動画ファイルの中から、ユーザが意図した対象の目的シーンを見つけ出すための検索性を向上させることが可能な動画解析装置を提供することを目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to provide a video analysis device that can improve searchability for finding a target scene intended by a user from among a plurality of video files.

本開示の一実施形態に係る動画解析装置は、動画識別情報及び動画ファイルを取得する取得部と、画像の動きに基づいて、動画ファイルに含まれる所定のカメラワークを検出し、該カメラワークを含むシーンの開始時間及び終了時間を抽出するカメラワーク検出部と、カメラワークの種類と該カメラワークに対応するニュアンスとを関連付けたデータベースを参照して、シーン毎に、検出したカメラワークに対応するニュアンスを表すタグを付与するタグ付与部と、動画識別情報を含み、シーン毎に、カメラワーク、開始時間及び終了時間、並びに、タグに関する情報を対応付けた動画解析データを作成する動画解析データ作成部と、動画解析データを出力する出力部と、を有することを特徴とする。 A video analysis device according to an embodiment of the present disclosure includes an acquisition unit that acquires video identification information and a video file, and detects predetermined camera work included in the video file based on the movement of the image, and performs the camera work. A camera work detection unit that extracts the start time and end time of the included scene, and a database that associates the type of camera work with the nuances corresponding to the camera work, and corresponds to the detected camera work for each scene. Video analysis data creation that creates video analysis data that includes a tagging section that adds tags that represent nuances, video identification information, and associates information about camera work, start time and end time, and tags for each scene. and an output unit that outputs video analysis data.

本開示の一実施形態に係る動画解析装置において、動画ファイルに含まれる画像内の物体を認識する物体認識部をさらに有し、タグ付与部は、物体に基づくタグを動画解析データに付加してよい。 The video analysis device according to an embodiment of the present disclosure further includes an object recognition unit that recognizes an object in an image included in the video file, and the tagging unit adds a tag based on the object to the video analysis data. good.

本開示の一実施形態に係る動画解析装置において、動画ファイルに含まれる音声を認識する音声認識部をさらに有し、タグ付与部は、音声に基づくタグを動画解析データに付加してよい。 The video analysis device according to an embodiment of the present disclosure may further include a voice recognition unit that recognizes audio included in the video file, and the tagging unit may add a tag based on the audio to the video analysis data.

本開示の一実施形態に係る動画解析装置において、動画ファイルに含まれる画像内の文字を認識する文字認識部をさらに有し、タグ付与部は、文字に基づくタグを動画解析データに付加してよい。 The video analysis device according to an embodiment of the present disclosure further includes a character recognition unit that recognizes characters in images included in the video file, and the tagging unit adds tags based on the characters to the video analysis data. good.

本開示の一実施形態に係る動画解析装置において、所望の動画を検索するための検索タグに関する情報を取得する検索情報取得部と、動画解析データを参照して、検索タグに対応する動画を検索する動画検索部と、をさらに有してよい。 A video analysis device according to an embodiment of the present disclosure includes a search information acquisition unit that acquires information regarding a search tag for searching a desired video, and a search information acquisition unit that searches for a video corresponding to the search tag by referring to video analysis data. The video search unit may further include a video search unit for searching.

本開示の一実施形態に係る動画解析装置において、動画解析データを参照して、シーンの開始時間における画像を表示する表示部をさらに有してよい。 The video analysis device according to an embodiment of the present disclosure may further include a display unit that refers to the video analysis data and displays an image at the start time of the scene.

本開示の一実施形態に係る動画解析装置において、カメラワークは、パン、チルト、トラック、ズームイン、キャラクタードリー、ズームアウト、プルバックのうちの少なくとも１つを含んでよい。 In the video analysis device according to an embodiment of the present disclosure, camera work may include at least one of panning, tilting, tracking, zooming in, character dolly, zooming out, and pullback.

本開示の一実施形態に係る動画解析方法は、取得部が、動画識別情報及び動画ファイルを取得し、カメラワーク検出部が、画像の動きに基づいて、動画ファイルに含まれる所定のカメラワークを検出し、該カメラワークを含むシーンの開始時間及び終了時間を抽出し、タグ付与部が、カメラワークの種類と該カメラワークに対応するニュアンスとを関連付けたデータベースを参照して、シーン毎に、検出したカメラワークに対応するニュアンスを表すタグを付与し、動画解析データ作成部が、動画識別情報を含み、シーン毎に、カメラワーク、開始時間及び終了時間、並びに、タグに関する情報を対応付けた動画解析データを作成し、出力部が、動画解析データを出力することを特徴とする。 In a video analysis method according to an embodiment of the present disclosure, an acquisition unit acquires video identification information and a video file, and a camerawork detection unit detects a predetermined camerawork included in the video file based on the movement of the image. The tagging unit detects and extracts the start time and end time of the scene including the camerawork, and the tagging unit refers to a database that associates the type of camerawork with the nuance corresponding to the camerawork, and then, for each scene, A tag representing the nuance corresponding to the detected camera work was added, and the video analysis data creation unit included video identification information and associated information about the camera work, start time and end time, and tag for each scene. The present invention is characterized in that the video analysis data is created and the output unit outputs the video analysis data.

本開示の一実施形態に係る動画解析プログラムは、プロセッサに、動画識別情報及び動画ファイルを取得し、画像の動きに基づいて、動画ファイルに含まれる所定のカメラワークを検出し、該カメラワークを含むシーンの開始時間及び終了時間を抽出し、カメラワークの種類と該カメラワークに対応するニュアンスとを関連付けたデータベースを参照して、シーン毎に、検出したカメラワークに対応するニュアンスを表すタグを付与し、動画識別情報を含み、シーン毎に、カメラワーク、開始時間及び終了時間、並びに、タグに関する情報を対応付けた動画解析データを作成し、動画解析データを出力する、各ステップを実行させることを特徴とする。 A video analysis program according to an embodiment of the present disclosure causes a processor to acquire video identification information and a video file, detect predetermined camera work included in the video file based on the movement of the image, and perform the camera work. The start time and end time of the included scene are extracted, and a tag representing the nuance corresponding to the detected camera work is created for each scene by referring to a database that associates the type of camera work with the nuance corresponding to the camera work. create video analysis data that includes video identification information and associate information about camera work, start time and end time, and tags for each scene, and output the video analysis data. It is characterized by

本開示の一実施形態に係る動画解析装置によれば、複数の動画ファイルの中から、ユーザが意図した対象の目的シーンを見つけ出すための検索性を向上させることができる。 According to the video analysis device according to an embodiment of the present disclosure, it is possible to improve search performance for finding a target scene intended by a user from among a plurality of video files.

本開示の一実施形態に係る動画解析装置の構成ブロック図である。FIG. 1 is a configuration block diagram of a video analysis device according to an embodiment of the present disclosure. 本開示の一実施形態に係る動画解析方法の手順を説明するためのフローチャートである。1 is a flowchart for explaining the procedure of a video analysis method according to an embodiment of the present disclosure. 本開示の一実施形態に係る動画解析装置に対して、端末から動画ファイルをアップロードする際の端末表示画面の例である。It is an example of a terminal display screen when uploading a video file from a terminal to a video analysis device according to an embodiment of the present disclosure. 本開示の一実施形態に係る動画解析装置を用いて検出したカメラワークに基づくタグ付与結果画面の例である。It is an example of a tagging result screen based on camera work detected using the video analysis device according to an embodiment of the present disclosure. 本開示の一実施形態に係る動画解析装置を用いて作成した動画解析データの例である。1 is an example of video analysis data created using a video analysis device according to an embodiment of the present disclosure. 本開示の一実施形態に係る動画解析装置を用いて、動画ファイルからカメラワークを検出する手順を説明するためのフローチャートである。It is a flowchart for explaining the procedure of detecting camera work from a video file using a video analysis device according to an embodiment of the present disclosure. カメラワークと該カメラワークから得られるニュアンス（印象タグ）との対応関係を示すデータベースの例である。This is an example of a database showing the correspondence between camera work and nuances (impression tags) obtained from the camera work. 本開示の一実施形態に係る動画解析装置に対して、端末から動画ファイルを検索する際の端末表示画面の例である。It is an example of a terminal display screen when searching for a video file from a terminal in a video analysis device according to an embodiment of the present disclosure. 本開示の一実施形態の変形例に係る動画解析装置の構成ブロック図である。FIG. 2 is a configuration block diagram of a video analysis device according to a modification of an embodiment of the present disclosure. 本開示の一実施形態の変形例に係る動画解析装置を用いて検出したカメラワークを含むシーンから抽出した付加情報抽出結果画面の例である。It is an example of an additional information extraction result screen extracted from a scene including camera work detected using a video analysis device according to a modification of an embodiment of the present disclosure. 本開示の一実施形態の変形例に係る動画解析装置を用いて作成した付加情報解析データの例である。It is an example of additional information analysis data created using a video analysis device according to a modification of an embodiment of the present disclosure. 本開示の一実施形態の変形例に係る動画解析装置を用いて作成した付加情報を含む動画解析データの例である。It is an example of video analysis data including additional information created using a video analysis device according to a modification of an embodiment of the present disclosure.

以下、図面を参照して、本発明に係る動画解析装置、動画解析方法、及び動画解析プログラムについて説明する。ただし、本発明の技術的範囲はそれらの実施の形態には限定されず、特許請求の範囲に記載された発明とその均等物に及ぶ点に留意されたい。 Hereinafter, a video analysis device, a video analysis method, and a video analysis program according to the present invention will be described with reference to the drawings. However, it should be noted that the technical scope of the present invention is not limited to these embodiments, but extends to the invention described in the claims and equivalents thereof.

（動画解析システムの概要）
まず、本開示の一実施形態に係る動画解析装置を含む動画解析システムの概要について説明する。図１に、本開示の一実施形態に係る動画解析システム１０１の構成図を示す。動画解析システム１０１は、動画解析装置１０と、端末２０と、を有し、これらは通信ネットワーク３０を介して、有線、または無線によりデータを送受信可能に接続されている。図１には、端末２０が１台のみ記載されているが、複数台であってよい。 (Overview of video analysis system)
First, an overview of a video analysis system including a video analysis device according to an embodiment of the present disclosure will be described. FIG. 1 shows a configuration diagram of a video analysis system 101 according to an embodiment of the present disclosure. The video analysis system 101 includes a video analysis device 10 and a terminal 20, which are connected via a communication network 30 so as to be able to transmit and receive data by wire or wirelessly. Although only one terminal 20 is shown in FIG. 1, there may be a plurality of terminals 20.

ユーザは端末２０を用いて、動画ファイルを動画解析装置１０にアップロードする。動画ファイルは、ユーザが作成したものであってもよいし、既存の動画ファイルの一部または全部であってもよい。 A user uses the terminal 20 to upload a video file to the video analysis device 10. The video file may be one created by the user, or may be part or all of an existing video file.

動画解析装置１０は、例えば、サーバであってよい。動画解析装置１０は、端末２０から受信した動画ファイルを画像解析して、動画ファイルに含まれているカメラワークを検出する。ここで、「カメラワーク」とは、動画を撮影する際のカメラの動かし方をいう。カメラワークには、例えば、カメラを水平方向に回転させる操作である「パン」、カメラを垂直方向に回転させる操作である「チルト」、及びカメラを水平、または垂直方向に移動させる操作である「トラック」等がある。ただし、これらの例には限られない。 The video analysis device 10 may be, for example, a server. The video analysis device 10 performs image analysis on the video file received from the terminal 20 and detects camera work included in the video file. Here, "camera work" refers to how to move the camera when shooting a video. Camera work includes, for example, ``pan'', which is an operation to rotate the camera horizontally, ``tilt'', which is an operation to rotate the camera vertically, and ``pan,'' which is an operation to move the camera horizontally or vertically. There are "trucks" etc. However, the examples are not limited to these examples.

カメラワークを変えることにより、表現や心理的印象（ニュアンス）を変えることができる。例えば、カメラワークが「パン」であるシーンが動画に含まれている場合、そのシーンは「感動」あるいは「ダイナミックさ」を表現していると考えられる。このように種々のカメラワークと、そのカメラワークが表現するニュアンスとを関連付けたデータベースを用意しておけば、検出したカメラワークに基づいて、データベースを参照して、その動画に含まれるニュアンスを抽出することができる。 By changing the camera work, you can change the expression and psychological impression (nuance). For example, if a video includes a scene where the camera work is "panning", that scene is considered to express "impression" or "dynamics". If you prepare a database that associates various types of camera work with the nuances expressed by the camera work, you can refer to the database based on the detected camera work and extract the nuances contained in that video. can do.

このようにして抽出したニュアンスを動画ファイルにタグ付けすることにより、動画ファイルにどのようなニュアンスを有するシーンが含まれているかを客観的に識別することができる。動画ファイルに抽出されたニュアンスをタグ付けした動画解析データを用意しておけば、タグに基づいて所望のニュアンスを含む動画ファイルを検索することができる。 By tagging the video file with the nuances extracted in this way, it is possible to objectively identify what kinds of nuances scenes are included in the video file. By preparing video analysis data tagged with nuances extracted from video files, video files containing desired nuances can be searched based on the tags.

例えば、これから動画を制作しようとするユーザが、参考とするために現在アップロードされている複数の動画の中から、目的とするシーンが含まれている動画を探し出す場合を例にとって説明する。この場合、複数の動画のそれぞれを最初から最後まで再生して目的とするシーンを探し出すことは非常に労力を要する。そこで、例えば、ユーザが感動的なシーンを含む動画を探す場合に、感動的なシーンが、どの動画のどの時間帯のシーンに含まれているかを探すことができれば効率的である。 For example, a case will be described in which a user who is about to create a video searches for a video containing a desired scene from among a plurality of videos currently uploaded for reference. In this case, it takes a lot of effort to play each of the multiple videos from beginning to end to find the desired scene. Therefore, for example, when a user searches for a video that includes a moving scene, it would be efficient if the user could search for which video and in which time period the moving scene is included.

本開示の一実施形態に係る動画解析システムによれば、動画ファイルに含まれるカメラワークを検出し、検出したカメラワークで表現されるニュアンスを動画ファイルにタグ付けることにより、所望の動画の検索性を向上させることができる。 According to the video analysis system according to an embodiment of the present disclosure, by detecting camerawork included in a video file and tagging the video file with nuances expressed by the detected camerawork, searchability of a desired video can be improved. can be improved.

（動画解析装置の構成）
次に、本開示の一実施形態に係る動画解析装置１０について説明する。図１に示すように、動画解析装置１０は、制御部１１と、送受信部１２と、記憶部１３と、出力部１４と、計時部１５と、表示部１６と、を有し、これらは内部バス１７により接続されている。 (Configuration of video analysis device)
Next, a video analysis device 10 according to an embodiment of the present disclosure will be described. As shown in FIG. 1, the video analysis device 10 includes a control section 11, a transmitter/receiver section 12, a storage section 13, an output section 14, a timer section 15, and a display section 16. They are connected by a bus 17.

送受信部１２は、通信ネットワーク３０を介して端末２０との間でデータの送受信を行う。特に、送受信部１２は、端末２０から動画ファイルや、動画ファイルを検索するための情報を取得する。 The transmitting/receiving unit 12 transmits and receives data to and from the terminal 20 via the communication network 30. In particular, the transmitter/receiver 12 acquires a video file and information for searching for a video file from the terminal 20.

記憶部１３は、半導体メモリやハードディスク等の記憶装置である。記憶部１３は、端末２０から取得した動画ファイルや、データベースを記憶する。さらに、記憶部１３は、動画解析装置１０を制御するためのプログラムを記憶してよい。 The storage unit 13 is a storage device such as a semiconductor memory or a hard disk. The storage unit 13 stores video files acquired from the terminal 20 and a database. Furthermore, the storage unit 13 may store a program for controlling the video analysis device 10.

制御部１１は、取得部１と、カメラワーク検出部２と、タグ付与部３と、動画解析データ作成部４と、検索情報取得部５と、動画検索部６と、を有し、これらは、動画解析装置１０に設けられたＣＰＵ等のプロセッサにより記憶部１３に記憶されたプログラムを実行することにより実現される。 The control unit 11 includes an acquisition unit 1, a camerawork detection unit 2, a tagging unit 3, a video analysis data creation unit 4, a search information acquisition unit 5, and a video search unit 6. , is realized by executing a program stored in the storage unit 13 by a processor such as a CPU provided in the video analysis device 10.

取得部１は、通信ネットワーク３０を介して、端末２０から、動画識別情報及び動画ファイルを取得する。動画識別情報は動画ファイルを識別するための情報であって、例えば、数字、文字、記号、または、これらの組み合わせであってよい。動画ファイルは、映像及び音声が格納されたファイルである。動画ファイルのフォーマットは、例えば、ＭＰ４、ＭＯＶ、ＷＭＶ、ＡＶＩ等であるが、これらの例には限られない。取得部１は、取得した動画ファイルをカメラワーク検出部２へ出力する。 The acquisition unit 1 acquires video identification information and a video file from the terminal 20 via the communication network 30. The video identification information is information for identifying a video file, and may be, for example, numbers, letters, symbols, or a combination thereof. A video file is a file in which video and audio are stored. The format of the video file is, for example, MP4, MOV, WMV, AVI, etc., but is not limited to these examples. The acquisition unit 1 outputs the acquired video file to the camera work detection unit 2.

カメラワーク検出部２は、画像の動きに基づいて、動画ファイルに含まれる所定のカメラワークを検出し、該カメラワークを含むシーンの開始時間及び終了時間を抽出する。動画ファイルからカメラワークを検出する方法については後述する。 The camerawork detection unit 2 detects a predetermined camerawork included in the video file based on the movement of the image, and extracts the start time and end time of the scene including the camerawork. A method for detecting camera work from a video file will be described later.

タグ付与部３は、カメラワークの種類と該カメラワークに対応するニュアンスとを関連付けたデータベースを参照して、シーン毎に、検出したカメラワークに対応するニュアンスを表すタグを付与する。データベースは、予め記憶部１３に記憶しておいてよい。１つのカメラワークによって表現されるニュアンスは１つでもよいし、複数でもよい。 The tagging unit 3 refers to a database that associates types of camerawork with nuances corresponding to the camerawork, and adds a tag representing the nuance corresponding to the detected camerawork to each scene. The database may be stored in the storage unit 13 in advance. One or more nuances may be expressed by one camera work.

なお、カメラワークの種類と該カメラワークに対応するニュアンスとを関連付けた情報は、データベースを参照して取得する代わりに、カメラワークと対応するニュアンスとの関係をプログラムに組み込むようにしてよい。 Note that instead of obtaining information relating the type of camera work and the nuance corresponding to the camera work by referring to a database, the relationship between the camera work and the nuance corresponding to the camera work may be incorporated into the program.

動画解析データ作成部４は、動画識別情報を含み、シーン毎に、カメラワーク、開始時間及び終了時間、並びに、タグに関する情報を対応付けた動画解析データを作成する。 The video analysis data creation unit 4 creates video analysis data that includes video identification information and associates camera work, start time, end time, and tag information for each scene.

出力部１４は、動画解析データを出力する。例えば、出力部１４は、動画解析データを、通信ネットワーク３０を介して端末２０に出力してよい。 The output unit 14 outputs video analysis data. For example, the output unit 14 may output the video analysis data to the terminal 20 via the communication network 30.

計時部１５は、現在時刻を出力するモジュールである。計時部１５は、カメラワーク検出部２が動画ファイルからカメラワークが切り替わるシーンの開始時間及び終了時間を動画の先頭の開始時間からの経過時間として計時してよい。 The clock unit 15 is a module that outputs the current time. The timer 15 may measure the start time and end time of the scene where the camerawork detecting unit 2 changes the camerawork from the video file as the elapsed time from the start time of the beginning of the video.

表示部１６は、液晶表示装置や有機ＥＬ表示装置等の表示装置である。表示部１６は、動画解析データを表示してよく、動画解析データを参照して、シーンの開始時間における画像を表示してよい。 The display unit 16 is a display device such as a liquid crystal display device or an organic EL display device. The display unit 16 may display the video analysis data, and may refer to the video analysis data to display an image at the start time of the scene.

検索情報取得部５は、所望の動画を検索するための検索タグに関する情報を取得する。検索タグは、端末２０の入力部２１に入力されて、通信ネットワーク３０を介して検索情報取得部５によって取得されてよい。 The search information acquisition unit 5 acquires information regarding a search tag for searching for a desired video. The search tag may be input to the input unit 21 of the terminal 20 and acquired by the search information acquisition unit 5 via the communication network 30.

動画検索部６は、動画解析データを参照して、検索タグに対応する動画を検索する。動画解析データは動画識別情報を含み、シーン毎に、カメラワーク、開始時間及び終了時間、並びに、タグに関する情報が対応付けられているため、検索タグに基づいて、所望のシーンが含まれた動画ファイルを検索することができる。 The video search unit 6 refers to the video analysis data and searches for a video corresponding to the search tag. Video analysis data includes video identification information, and information about camera work, start time and end time, and tags are associated with each scene, so based on the search tag, videos containing the desired scene can be searched. Files can be searched.

（端末の構成）
端末２０は、ユーザが有する端末であって、携帯電話、スマートフォン、あるいはタブレット端末等の携帯端末や、パーソナルコンピュータ等の情報端末を用いることができる。端末２０は、入力部２１と、記憶部２２と、制御部２３と、送受信部２４と、表示部２５と、を有している。 (Terminal configuration)
The terminal 20 is a terminal owned by the user, and may be a mobile terminal such as a mobile phone, a smartphone, or a tablet terminal, or an information terminal such as a personal computer. The terminal 20 includes an input section 21 , a storage section 22 , a control section 23 , a transmitting/receiving section 24 , and a display section 25 .

入力部２１には、キーボードやマウス等の入力装置を用いることができる。入力部２１は、動画ファイルを動画解析装置１０にアップロードしたり、検索しようとするタグを入力したりすることができる。 For the input unit 21, an input device such as a keyboard or a mouse can be used. The input unit 21 can upload a video file to the video analysis device 10 and input a tag to be searched.

記憶部２２は、半導体メモリやハードディスク等の記憶装置である。記憶部２２は、動画解析装置１０にアップロードするための動画ファイルや、アップロードする動画ファイルの格納場所を示すＵＲＬ等のデータ等を記憶する。さらに、記憶部２２は、端末２０を制御するためのプログラムを記憶してよい。 The storage unit 22 is a storage device such as a semiconductor memory or a hard disk. The storage unit 22 stores data such as a video file to be uploaded to the video analysis device 10 and a URL indicating a storage location of the video file to be uploaded. Furthermore, the storage unit 22 may store a program for controlling the terminal 20.

制御部２３は、記憶部２２に記憶されたプログラムによって端末２０の動作を制御する。制御部２３には、ＣＰＵ等のプロセッサを用いることができ。 The control unit 23 controls the operation of the terminal 20 using a program stored in the storage unit 22. A processor such as a CPU can be used for the control unit 23.

送受信部２４は、通信ネットワーク３０と接続され、動画解析装置１０との間でデータを送受信することができる。端末２０には、上記のような物理端末だけでなく、仮想端末を用いてよい。仮想端末を用いることにより、動画解析装置１０にアプリケーションやデータを集約させることができ、端末２０にデータを残さないようにすることができるため、情報漏洩を防止することができる。 The transmitter/receiver 24 is connected to the communication network 30 and can transmit and receive data to and from the video analysis device 10 . As the terminal 20, not only a physical terminal as described above but also a virtual terminal may be used. By using a virtual terminal, applications and data can be aggregated in the video analysis device 10, and no data can be left on the terminal 20, so information leakage can be prevented.

表示部２５には、液晶表示装置や有機ＥＬ表示装置等の表示装置を用いることができる。 For the display section 25, a display device such as a liquid crystal display device or an organic EL display device can be used.

（動画解析データの作成方法）
次に、本開示の一実施形態に係る動画解析装置を用いて、動画解析データを作成する方法について説明する。図２に、本開示の一実施形態に係る動画解析方法の手順を説明するためのフローチャートを示す。 (How to create video analysis data)
Next, a method of creating video analysis data using a video analysis device according to an embodiment of the present disclosure will be described. FIG. 2 shows a flowchart for explaining the procedure of a video analysis method according to an embodiment of the present disclosure.

まず、ステップＳ１０１において、ユーザが端末２０を用いて動画ファイルをアップロードし、動画解析装置１０の取得部１は、アップロードされた動画ファイルを取得する。 First, in step S101, a user uploads a video file using the terminal 20, and the acquisition unit 1 of the video analysis device 10 acquires the uploaded video file.

図３に、本開示の一実施形態に係る動画解析装置１０に対して、端末２０から動画ファイルをアップロードする際の端末表示画面である動画アップロード画面２００の例を示す。まず、アップロードする動画ファイルを選択するために、動画ファイル入力欄２０１に動画ファイルのファイル名を入力する。あるいは、動画ファイル入力欄２０１をクリックして、記憶部２２に記憶された動画ファイルを選択するようにしてもよい。 FIG. 3 shows an example of a video upload screen 200 that is a terminal display screen when uploading a video file from the terminal 20 to the video analysis device 10 according to an embodiment of the present disclosure. First, in order to select a video file to be uploaded, the file name of the video file is input into the video file input field 201. Alternatively, a video file stored in the storage unit 22 may be selected by clicking on the video file input field 201.

また、動画ファイル入力欄２０１には動画ファイルの入手先を示すＵＲＬを入力してよい。 Further, in the video file input field 201, a URL indicating where to obtain the video file may be input.

アップロードする動画が１つのみである場合は、アップロードする動画ファイル（動画ファイル（１））を選択した後、アップロードボタン２０２を押下して、動画ファイルの動画解析装置１０へのアップロードを実行する。 If there is only one video to be uploaded, after selecting the video file to be uploaded (video file (1)), the upload button 202 is pressed to execute the upload of the video file to the video analysis device 10.

さらに、アップロードする動画を追加する場合は、追加ボタン２０３を押下する。そうすると、２つ目の動画ファイルである動画ファイル（２）をアップロードするための動画ファイル入力欄２０４が表示される。 Furthermore, when adding a video to be uploaded, an add button 203 is pressed. Then, a video file input field 204 for uploading the second video file, video file (2), is displayed.

動画ファイル入力欄２０４をクリックしてアップロードする動画ファイルを選択した後、アップロードボタン２０２を押下することにより動画解析装置１０へのアップロードを実行することができる。以下、同様に、３個以上の動画ファイルを動画解析装置１０にアップロードしてよい。 After selecting the video file to be uploaded by clicking on the video file input field 204, the upload to the video analysis device 10 can be executed by pressing the upload button 202. Thereafter, three or more video files may be similarly uploaded to the video analysis device 10.

なお、動画ファイルをアップロードする際に、動画ファイル入力欄（２０１、２０４）に入力した動画ファイルのファイル名が、動画識別情報となる。従って、アップロードを実行した動画ファイル名が、既に動画解析装置１０にアップロードされており、動画解析装置１０の記憶部１３に同一ファイル名の動画ファイルが存在している場合は、動画解析装置１０は端末２０に対して、ファイル名の変更を促すようにしてよい。 Note that when uploading a video file, the file name of the video file input in the video file input field (201, 204) becomes the video identification information. Therefore, if the video file name that has been uploaded has already been uploaded to the video analysis device 10 and a video file with the same file name exists in the storage unit 13 of the video analysis device 10, the video analysis device 10 The terminal 20 may be prompted to change the file name.

次に、ステップＳ１０２において、カメラワーク検出部２が、画像の動きに基づいて、動画ファイルに含まれる所定のカメラワークを検出し、該カメラワークを含むシーンの開始時間及び終了時間を抽出する。カメラワーク検出部２が、動画ファイルからカメラワークを検出する手順については後述する。 Next, in step S102, the camera work detection unit 2 detects a predetermined camera work included in the video file based on the movement of the image, and extracts the start time and end time of the scene including the camera work. The procedure by which the camera work detection unit 2 detects camera work from a video file will be described later.

次に、図２に示すように、ステップＳ１０３において、タグ付与部３が、カメラワークの種類と該カメラワークに対応するニュアンスとを関連付けたデータベースを参照して、シーン毎に、検出したカメラワークに対応するニュアンスを表すタグ（印象タグ）を付与する。 Next, as shown in FIG. 2, in step S103, the tagging unit 3 refers to a database that associates the type of camerawork with nuances corresponding to the camerawork, and identifies the detected camerawork for each scene. Assign a tag (impression tag) that represents the corresponding nuance.

図４に、本開示の一実施形態に係る動画解析装置を用いて検出したカメラワークに基づくタグ付与結果画面３００の例を示す。図４に示した例では、タグ付与結果画面３００の上側に検出したカメラワークとカメラワークに対応する印象タグの検出結果３０１が表示され、下側に、検出されたカメラワーク（ＣＷ１～ＣＷ４）を含むシーンがサムネイル（ＳＮ１～ＳＮ３）と共に表示されている。 FIG. 4 shows an example of a tagging result screen 300 based on camera work detected using the video analysis device according to an embodiment of the present disclosure. In the example shown in FIG. 4, the detected camera work and the detection result 301 of the impression tag corresponding to the camera work are displayed on the upper side of the tagging result screen 300, and the detected camera work (CW1 to CW4) is displayed on the lower side. Scenes including the above are displayed together with thumbnails (SN1 to SN3).

例えば、動画を再生し、時刻（００：００）［ｍｍ：ｓｓ］においてカメラワークＣＷ１が検出され、このカメラワークＣＷ１が「パン」であったものとする。この場合、タグ付与部３は、後述するデータベースを参照して、カメラワークＣＷ１である「パン」に対応するニュアンス「感動」及び「ダイナミック」をそれぞれ印象タグ（１）及び（２）として抽出する。 For example, assume that a video is played back, camera work CW1 is detected at time (00:00) [mm:ss], and camera work CW1 is "pan". In this case, the tagging unit 3 refers to the database described later and extracts the nuances "impressive" and "dynamic" corresponding to the camerawork CW1 "pan" as impression tags (1) and (2), respectively. .

さらに動画を再生して、次のカメラワークＣＷ２が時間（００：１０）で検出されたものとする。そうすると、時間（００：００）を開始時間とし、時間（００：１０）を終了時間とする動画が１つのシーンを構成する。そこで、このシーンを第１シーンとする。図４に示すように、タグ付与結果画面３００には、第１シーンのサムネイルＳＮ１を時間軸と共に表示してよい。次に、タグ付与部３は、第１シーンに印象タグ（１）として「感動」を付与し、印象タグ（２）として「ダイナミック」を付与する。 Assume that the video is further played back and the next camera work CW2 is detected at time (00:10). Then, a video whose start time is time (00:00) and whose end time is time (00:10) constitutes one scene. Therefore, this scene is defined as the first scene. As shown in FIG. 4, the tagging result screen 300 may display the thumbnail SN1 of the first scene together with the time axis. Next, the tagging unit 3 adds "impression" to the first scene as an impression tag (1) and "dynamic" as an impression tag (2).

次に、時刻（００：１０）で検出されたカメラワークＣＷ２が「キャラクタードリー」であったものとする。この場合、タグ付与部３は、後述するデータベースを参照して、カメラワークＣＷ２である「キャラクタードリー」に対応するニュアンス「立体感」及び「迫力」をそれぞれ印象タグ（１）及び（２）として抽出する。 Next, assume that the camera work CW2 detected at time (00:10) is "Character Dolly". In this case, the tagging unit 3 refers to the database described later, and sets the nuances "three-dimensional effect" and "powerful" corresponding to the camera work CW2 "character dolly" as impression tags (1) and (2), respectively. Extract.

さらに動画を再生して、次のカメラワークＣＷ３が時間（００：２０）で検出されたものとする。そうすると、時間（００：１０）を開始時間とし、時間（００：２０）を終了時間とする動画が１つのシーンを構成する。そこで、このシーンを第２シーンとする。図４に示すように、タグ付与結果画面３００には、第２シーンのサムネイルＳＮ２を時間軸と共に表示してよい。次に、タグ付与部３は、第２シーンに印象タグ（１）として「立体感」を付与し、印象タグ（２）として「迫力」を付与する。 Assume that the video is further played back and the next camera work CW3 is detected at time (00:20). Then, a video whose start time is time (00:10) and whose end time is time (00:20) constitutes one scene. Therefore, this scene is defined as the second scene. As shown in FIG. 4, the tagging result screen 300 may display the thumbnail SN2 of the second scene together with the time axis. Next, the tagging unit 3 adds "three-dimensional effect" to the second scene as an impression tag (1), and "power" as an impression tag (2).

次に、時刻（００：２０）で検出されたカメラワークＣＷ３が「チルト」であったものとする。この場合、タグ付与部３は、データベースを参照して、カメラワークＣＷ３である「チルト」に対応するニュアンス「悲しみ」、「孤独感」、及び「悩み事」をそれぞれ印象タグ（１）～（３）として抽出する。 Next, assume that the camera work CW3 detected at time (00:20) is "tilt". In this case, the tagging unit 3 refers to the database and assigns impression tags (1) to ( Extract as 3).

さらに動画を再生して、次のカメラワークＣＷ４が時間（００：３０）で検出されたものとする。そうすると、時間（００：２０）を開始時間とし、時間（００：３０）を終了時間とする動画が１つのシーンを構成する。そこで、このシーンを第３シーンとする。図４に示すように、タグ付与結果画面３００には、第３シーンのサムネイルＳＮ３を時間軸と共に表示してよい。次に、タグ付与部３は、第３シーンに印象タグ（１）として「悲しみ」を付与し、印象タグ（２）として「孤独感」を付与し、印象タグ（３）として「悩み事」を付与する。 Assume that the video is further played back and the next camera work CW4 is detected at time (00:30). Then, a video whose start time is time (00:20) and whose end time is time (00:30) constitutes one scene. Therefore, this scene is designated as the third scene. As shown in FIG. 4, the tagging result screen 300 may display the thumbnail SN3 of the third scene together with the time axis. Next, the tagging unit 3 adds "sadness" as an impression tag (1) to the third scene, "feeling of loneliness" as an impression tag (2), and "worries" as an impression tag (3). Grant.

以上のようにして、タグ付与結果画面３００には、カメラワークに対応する印象タグの検出結果３０１として、検出したカメラワークＣＷ１～ＣＷ４によって分割された第１シーンから第３シーンまでのそれぞれのシーンに付与する印象タグ（１）～（３）を表示してよい。 As described above, the tagging result screen 300 displays each of the first to third scenes divided by the detected camerawork CW1 to CW4 as the detection result 301 of the impression tag corresponding to the camerawork. Impression tags (1) to (3) may be displayed.

次に、ステップＳ１０５において、出力部１４は動画解析データをユーザの端末２０に出力する。具体的には、まず、動画解析装置１０の出力部１４が送受信部１２に動画解析データを出力する。次に、動画解析装置１０の送受信部１２が、端末２０の送受信部２４に通信ネットワーク３０を介して動画解析データを送信する。端末２０の送受信部２４は、受信した動画解析データを表示部２５に出力する。 Next, in step S105, the output unit 14 outputs the video analysis data to the user's terminal 20. Specifically, first, the output unit 14 of the video analysis device 10 outputs video analysis data to the transmission/reception unit 12. Next, the transmitter/receiver 12 of the video analysis device 10 transmits the video analysis data to the transmitter/receiver 24 of the terminal 20 via the communication network 30 . The transmitting/receiving unit 24 of the terminal 20 outputs the received video analysis data to the display unit 25.

図５に、本開示の一実施形態に係る動画解析装置を用いて作成した動画解析データの例を示す。端末２０の表示部２５は、動画解析データ４０１を含む動画解析データ表示画面４００を表示してよい。動画解析データ４０１は、動画ＩＤ（Ｂ１）、シーンＢ２、シーンの開始時間Ｂ３、シーンの終了時間Ｂ４、シーンの継続時間Ｂ５、検出されたカメラワークＢ６、印象タグ（１）Ｂ７、印象タグ（２）Ｂ８、印象タグ（３）Ｂ９を含む。ただし、このような例には限られない。 FIG. 5 shows an example of video analysis data created using the video analysis device according to an embodiment of the present disclosure. The display unit 25 of the terminal 20 may display a video analysis data display screen 400 that includes video analysis data 401. Video analysis data 401 includes video ID (B1), scene B2, scene start time B3, scene end time B4, scene duration B5, detected camera work B6, impression tag (1) B7, and impression tag ( 2) B8, including impression tag (3) B9. However, it is not limited to such an example.

動画ＩＤ（Ｂ１）は、動画識別情報である。例えば、ユーザが動画アップロード画面２００において入力した動画ファイル名を動画ＩＤとしてよい。 The video ID (B1) is video identification information. For example, the video file name input by the user on the video upload screen 200 may be used as the video ID.

シーンＢ２は、動画ファイルに含まれるシーンを識別するための情報である。例えば、第１シーン、第２シーン等のように連続番号を含む名称を付してよい。ただし、このような例には限られず、シーンの名称は任意の名称としてよく、例えば、印象タグに対応した名称としてよい。 Scene B2 is information for identifying a scene included in the video file. For example, names including consecutive numbers such as first scene, second scene, etc. may be given. However, the name of the scene is not limited to this example, and may be any name, for example, a name corresponding to an impression tag.

開始時間Ｂ３は、シーンにおいてカメラワークが検出された時間である。例えば、時刻（００：１０）にカメラワークが検出された場合は、時刻（００：１０）が第２シーンの開始時間となる。 Start time B3 is the time when camera work is detected in the scene. For example, if camera work is detected at time (00:10), time (00:10) becomes the start time of the second scene.

終了時間Ｂ４は、あるシーンにおいて次のカメラワークが検出された時間である。例えば、第２シーンの時刻（００：２０）に次のカメラワークが検出された場合は、時刻（００：２０）が第２シーンの終了時間となる。 End time B4 is the time when the next camera work is detected in a certain scene. For example, if the next camera work is detected at the time (00:20) of the second scene, the time (00:20) becomes the end time of the second scene.

継続時間Ｂ５は、あるシーンにおける開始時間から終了時間までの時間である。例えば、第２シーンの開始時間が（００：１０）であり、終了時間が（００：２０）である場合は、継続時間は１０［ｓｅｃ］となる。 The duration B5 is the time from the start time to the end time in a certain scene. For example, if the start time of the second scene is (00:10) and the end time is (00:20), the duration is 10 [sec].

カメラワークＢ６は、あるシーンの最初に検出されたカメラワークである。例えば、第２シーンの開始時間（００：１０）にカメラワーク「キャラクタードリー」が検出された場合は、第２シーンのカメラワークは「キャラクタードリー」となる。 Camera work B6 is camera work detected at the beginning of a certain scene. For example, if the camera work "Character Dolly" is detected at the start time (00:10) of the second scene, the camera work of the second scene becomes "Character Dolly".

印象タグ（１）～（３）（Ｂ７～Ｂ９）は、検出されたカメラシーンに対応するニュアンスを表すタグであって、後述するデータベースを参照して抽出したものである。例えば、動画の第３シーンでカメラワーク「チルト」が検出された場合は、「悲しみ」、「孤独感」、「悩み事」の印象タグが抽出される。 Impression tags (1) to (3) (B7 to B9) are tags representing nuances corresponding to the detected camera scene, and are extracted with reference to a database described later. For example, if camera work "tilt" is detected in the third scene of the video, impression tags of "sadness", "loneliness", and "worries" are extracted.

（カメラワークの検出方法）
次に、本開示の一実施形態に係る動画解析装置１０を用いて動画ファイルからカメラワークを検出する方法について説明する。図６に、本開示の一実施形態に係る動画解析装置１０を用いて、動画ファイルからカメラワークを検出する手順を説明するためのフローチャートを示す。ここでは、動画ファイルからカメラワークを検出する方法として、非特許文献１として示した「吉高敦夫、松井亮治、平松宗、「カメラワークを利用した感性情報の抽出」、情報処理学会論文誌、Ｖｏｌ．４７、Ｎｏ．６、ｐ．１６９６－１７０７」に記載された時空間投影画像を解析する方法を例にとって説明する。時空間投影画像とは、フレーム内の一定位置における、フレームに水平、垂直な直線、あるいは対角線上の映像を各フレームから抽出し、時間方向に並べた画像である。 (Method of detecting camera work)
Next, a method of detecting camera work from a video file using the video analysis device 10 according to an embodiment of the present disclosure will be described. FIG. 6 shows a flowchart for explaining a procedure for detecting camera work from a video file using the video analysis device 10 according to an embodiment of the present disclosure. Here, as a method for detecting camera work from video files, we will introduce "Atsuo Yoshitaka, Ryoji Matsui, So Hiramatsu, "Extraction of emotional information using camera work," Information Processing Society of Japan Transactions, which is shown in Non-Patent Document 1. Vol. 47, No. 6, p. 1696-1707'' will be explained by taking as an example a method for analyzing a spatio-temporal projection image. A spatio-temporal projection image is an image in which images on a straight line or diagonal line horizontally, vertically, or diagonally to the frame at a certain position within the frame are extracted from each frame and arranged in the temporal direction.

まず、ステップＳ２０１において、カメラワーク検出部２が、取得部１から取得した動画ファイルから時空間投影画像を生成する。具体的には、ある動画において、あるフレーム内の直線Ｌ（定線）を定め、各フレームから直線Ｌを抽出し、それらを時間軸に沿って並べることにより、時空間投影画像を生成する。 First, in step S201, the camera work detection unit 2 generates a spatiotemporal projection image from the video file acquired from the acquisition unit 1. Specifically, in a certain moving image, a straight line L (fixed line) within a certain frame is determined, the straight lines L are extracted from each frame, and the straight lines L are arranged along the time axis to generate a spatiotemporal projection image.

次に、ステップＳ２０２において、時空間投影画像からエッジを検出する。動画にカメラワークが施されている場合は、時空間投影画像において、定線上にある静止物体の輝度変化が急峻な部分に直線成分が現れる。そこで、二値化した時空間投影画像において直線となるエッジを検出する。 Next, in step S202, edges are detected from the spatiotemporal projection image. When camera work is applied to a video, a straight line component appears in a spatiotemporal projection image in a portion where a stationary object on a fixed line has a steep change in brightness. Therefore, straight edges are detected in the binarized spatiotemporal projection image.

次に、ステップＳ２０３において、カメラワーク検出で参照する区間を決定する。参照する区間を、例えば、実際の映像作業等を考慮して、狭範囲及び広範囲の２種類とし、狭範囲をショットの先頭から所定時間（例えば２秒）、広範囲を狭範囲の判定区間に含まれる直線エッジ郡のうち、最小の始点時間から最大の終点時間の８割までの連続区間とする。 Next, in step S203, a section to be referred to for camera work detection is determined. For example, in consideration of actual video work, there are two types of reference sections: narrow range and wide range, and the narrow range is included for a predetermined period of time (for example, 2 seconds) from the beginning of the shot, and the wide range is included in the narrow range judgment interval. This is the continuous section from the minimum start point time to 80% of the maximum end point time among the straight edge groups.

次に、ステップＳ２０４において、カメラワーク検出部２は、生成した時空間投影画像にカメラワークによって現れるエッジのパターンからカメラワークの種別を判定する。検出するカメラワークは、（１）パン、（２）チルト（チルトアップ及びチルトダウンを含む）、（３）トラック、（４）ズームイン、（５）キャラクタードリー、（６）ズームアウト、（７）プルバック、の７種類である。 Next, in step S204, the camerawork detection unit 2 determines the type of camerawork from the edge pattern that appears due to the camerawork in the generated spatiotemporal projection image. The camera work to be detected is (1) pan, (2) tilt (including tilt up and tilt down), (3) track, (4) zoom in, (5) character dolly, (6) zoom out, (7) There are seven types of pullbacks.

例えば、カメラワークが、パン、チルト、トラック、ズームの場合は、カメラの前後移動操作のために三脚や移動車等の装置を用いて、カメラの位置や向き、あるいはレンズの焦点距離を滑らかに変化させる。そのため、カメラワークが、パン、チルト、トラック、ズームの場合において、カメラワークの動きに対応して現れるエッジは、カメラの操作時間に比例した長さを有する、方向がほぼ一定のエッジとなる。即ち、水平な線、垂直な線、対角線を定線とした時空間投影画像のうち、どれに対してどのような直線成分が現れるかを判定することにより、カメラワークの種類を判別する。なお、時空間投影画像を解析する方法における、具体的なエッジのパターンとカメラワークの検出との関係は、非特許文献１を参照されたい。 For example, if the camera work involves panning, tilting, tracking, or zooming, use a device such as a tripod or moving vehicle to move the camera back and forth to smoothly change the camera's position and orientation, or the focal length of the lens. change. Therefore, when the camera work is panning, tilting, tracking, or zooming, the edge that appears in response to the movement of the camera work is an edge whose direction is approximately constant and whose length is proportional to the camera operation time. That is, the type of camera work is determined by determining which linear component appears for which of the spatio-temporal projection images in which horizontal lines, vertical lines, and diagonal lines are fixed lines. Note that for the relationship between specific edge patterns and camera work detection in the method of analyzing spatio-temporal projection images, please refer to Non-Patent Document 1.

以上のようにして検出したカメラワークに基づいて、カメラワークによって表現されるニュアンスをタグ（印象タグ）として動画に紐づけるために、予めカメラワークとニュアンスとの関係を記録したデータベースを用意しておくことが好ましい。 Based on the camerawork detected as described above, in order to link the nuances expressed by the camerawork to the video as tags (impression tags), a database is prepared in advance that records the relationship between camerawork and nuances. It is preferable to leave it there.

（カメラワークとニュアンスとの関係）
次に、カメラワークと、カメラワークによって表現されるニュアンスとの関係について説明する。図７に、カメラワークと該カメラワークから得られるニュアンス（印象タグ）との対応関係を表すデータベースの例を示す。データベース５００は、カメラワークＡ１、カメラ・レンズの動きＡ２、印象タグ（１）～（５）（Ａ３～Ａ７）を含んでよい。カメラワークＡ１には、「パン」、「チルト」、「トラック」、「ズームイン」、「キャラクタードリー」、「ズームアウト」、「プルバック」等が含まれるが、これらの例には限られない。 (Relationship between camera work and nuance)
Next, the relationship between camera work and the nuances expressed by camera work will be explained. FIG. 7 shows an example of a database representing the correspondence between camera work and nuances (impression tags) obtained from the camera work. The database 500 may include camera work A1, camera/lens movement A2, and impression tags (1) to (5) (A3 to A7). Camera work A1 includes, but is not limited to, "pan", "tilt", "track", "zoom in", "character dolly", "zoom out", "pull back", etc.

カメラワーク「パン」とは、カメラを水平方向に回転させる操作をいう。動画ファイルにカメラワーク「パン」によって撮像されたシーンが含まれている場合、そのシーンは、例えば、「感動」や「ダイナミック」といったニュアンスを表現していると考えられ、これらのニュアンスを印象タグ（１）、（２）としてよい。 Camera work "Pan" refers to the operation of rotating the camera in the horizontal direction. If a video file contains a scene captured using "pan" camera work, that scene is considered to express nuances such as "impressive" or "dynamic," and these nuances are used as impression tags. (1) and (2) may be used.

カメラワーク「チルト」とは、カメラを垂直方向に回転させる操作をいう。動画ファイルにカメラワーク「チルト」によって撮像されたシーンが含まれている場合、そのシーンは、例えば、「悲しみ」、「孤独感」、または「悩み事」といったニュアンスを表現していると考えられ、これらのニュアンスを印象タグ（１）～（３）としてよい。なお、カメラワーク「チルト」には、「チルトダウン」と「チルトアップ」が含まれる。図７に示した例では、このうち「チルトダウン」に対応するニュアンス（印象タグ）を示している。「チルト」が「チルトアップ」である場合は、対応するニュアンスは、例えば、「希望」、「憧れ」、「前進」、「期待」としてよい。 Camera work "Tilt" refers to the operation of rotating the camera in the vertical direction. If a video file contains a scene captured using "tilt" camera work, that scene is likely to express nuances such as "sadness," "loneliness," or "worries." , these nuances may be used as impression tags (1) to (3). Note that the camera work "tilt" includes "tilt down" and "tilt up." In the example shown in FIG. 7, nuances (impression tags) corresponding to "tilt down" are shown. When "tilt" is "tilt up," the corresponding nuance may be, for example, "hope," "admiration," "forward," or "expectation."

カメラワーク「トラック」とは、カメラを水平、または垂直方向に移動させる操作をいう。動画ファイルにカメラワーク「トラック」によって撮像されたシーンが含まれている場合、そのシーンは、例えば、「迫力」や「客観的」といったニュアンスを表現していると考えられ、これらのニュアンスを印象タグ（１）、（２）としてよい。 Camera work "track" refers to the operation of moving the camera horizontally or vertically. If a video file contains a scene captured by the camera work "track", that scene is considered to express nuances such as "powerful" or "objective", and these nuances are used to give an impression. It may be used as tags (1) and (2).

カメラワーク「ズームイン」とは、カメラの焦点距離を変化させる拡大操作をいう。動画ファイルにカメラワーク「ズームイン」によって撮像されたシーンが含まれている場合、そのシーンは、例えば、「緊張感」といったニュアンスを表現していると考えられ、このニュアンスを印象タグ（１）としてよい。 Camera work "Zooming in" refers to an enlargement operation that changes the focal length of the camera. If a video file contains a scene captured by the camera work "zoom in", that scene is considered to express a nuance such as "feeling of tension", and this nuance is used as an impression tag (1). good.

カメラワーク「キャラクタードリー」とは、カメラを主体に近づける拡大操作をいう。動画ファイルにカメラワーク「キャラクタードリー」によって撮像されたシーンが含まれている場合、そのシーンは、例えば、「立体感」や「迫力」といったニュアンスを表現していると考えられ、これらのニュアンスを印象タグ（１）、（２）としてよい。 Camera work "Character dolly" refers to an enlargement operation that brings the camera closer to the subject. If a video file contains a scene captured using the camera work "Character Dolly," that scene is likely to express nuances such as "three-dimensional effect" or "powerful force," and these nuances are It may be used as impression tags (1) and (2).

カメラワーク「ズームアウト」とは、カメラの焦点距離を変化させる縮小操作をいう。動画ファイルにカメラワーク「ズームアウト」によって撮像されたシーンが含まれている場合、そのシーンは、例えば、「悲しさ」、「切なさ」、「孤独感」、「解放感」、「ゆったり」といったニュアンスを表現していると考えられ、これらのニュアンスを印象タグ（１）～（５）としてよい。 Camera work "Zooming out" refers to a reduction operation that changes the focal length of the camera. If the video file contains a scene captured by the camera work "Zoom Out", the scene may be, for example, "Sadness", "Worship", "Loneliness", "Feeling of freedom", or "Relaxation". These nuances may be used as impression tags (1) to (5).

カメラワーク「プルバック」とは、カメラを主体から遠ざける縮小操作をいう。動画ファイルにカメラワーク「プルバック」によって撮像されたシーンが含まれている場合、そのシーンは、例えば、「別れ」、「悲しみ」、「喪失感」、「孤独」、または「解放感」といったニュアンスを表現していると考えられ、これらのニュアンスを印象タグ（１）～（５）としてよい。 Camera work "pullback" refers to a reduction operation that moves the camera away from the subject. If the video file contains a scene captured by the camera work "pullback", the scene has nuances such as "breakup", "sadness", "feeling of loss", "loneliness", or "feeling of freedom". These nuances can be used as impression tags (1) to (5).

カメラワークは、図７に示したものに限られず、他の撮影方法を含んでよい。また、カメラワークとそのカメラワークに対応するニュアンスとを関連付ける場合に、人工知能（ＡＩ）による機械学習モデルを用いてよい。 The camera work is not limited to that shown in FIG. 7, and may include other photographing methods. Furthermore, when associating camera work with nuances corresponding to the camera work, a machine learning model based on artificial intelligence (AI) may be used.

また、データベース５００に含まれる種々のカメラワークに対応するニュアンスは、適宜変更してよい。データベース５００に含まれる印象タグの修正は、ユーザが行ってもよく、動画解析装置１０の管理者が行ってもよい。 Further, the nuances corresponding to various camera works included in the database 500 may be changed as appropriate. The impression tags included in the database 500 may be modified by the user or by the administrator of the video analysis device 10.

また、１つのカメラワークに複数のニュアンスを表す印象タグが含まれる場合は、それらの印象タグに優先順位を付けるようにしてよい。例えば、１つのカメラワークに対応する印象タグの優先順位を印象タグ（１）から（５）まで、降順としてよい。 Further, when one camera work includes impression tags representing a plurality of nuances, these impression tags may be prioritized. For example, the priority order of impression tags corresponding to one camera work may be set in descending order from impression tags (1) to (5).

また、１つのニュアンス（印象タグ）に対応するカメラワークが複数存在する場合も考えられる。例えば、図７に示すように、印象タグ「迫力」に対応するカメラワークには、「トラック」、「キャラクタードリー」がある。この場合、印象タグ「迫力」に対応するカメラワークとして、優先順位が最も高い印象タグ（１）に「迫力」が対応しているカメラワーク「トラック」を選択してよく、あるいは、「トラック」、「キャラクタードリー」の順に優先順位を付けて選択してよい。 It is also conceivable that there are multiple camera movements corresponding to one nuance (impression tag). For example, as shown in FIG. 7, camera work corresponding to the impression tag "powerful" includes "track" and "character dolly." In this case, as the camera work corresponding to the impression tag "Impact", you may select the camera work "Track" for which "Impact" corresponds to the impression tag (1) with the highest priority, or "Track". , "Character Dolly" in order of priority.

（動画ファイルの検索方法）
次に、作成した動画解析データに基づいて、目的とするタグが付与された動画ファイルを検索する方法について説明する。図８に、本開示の一実施形態に係る動画解析装置に対して、端末から動画ファイルを検索する際の端末表示画面である動画検索画面６００の例を示す。 (How to search video files)
Next, a method of searching for a video file to which a target tag has been added based on the created video analysis data will be described. FIG. 8 shows an example of a video search screen 600 that is a terminal display screen when searching a video file from a terminal for a video analysis device according to an embodiment of the present disclosure.

まず、動画検索画面６００において、動画を検索するための検索タグ（１）を検索タグ（１）入力欄６０１に入力する。図８に示した例では、検索タグ（１）として「感動」というワードを入力している。 First, on the video search screen 600, a search tag (1) for searching a video is input into the search tag (1) input field 601. In the example shown in FIG. 8, the word "impression" is input as the search tag (1).

なお、検索タグ（１）入力欄６０１に検索タグのワードを直接入力する代わりに、プルダウンメニューを表示させて、表示された検索タグの中から所望の検索タグを選択するようにしてよい。 Note that instead of directly inputting the search tag word in the search tag (1) input field 601, a pull-down menu may be displayed and a desired search tag may be selected from among the displayed search tags.

検索タグが１つのみである場合は、検索タグ（１）を入力した後、検索開始ボタン６０２を押下して、動画ファイルの検索を実行する。 If there is only one search tag, after inputting search tag (1), the search start button 602 is pressed to execute a search for video files.

さらに、検索タグを追加する場合は、タグ追加ボタン６０３を押下する。そうすると、２つ目の検索タグである検索タグ（２）を入力するための検索タグ（２）入力欄６０４が表示される。図８に示した例では、検索タグ（２）として「悲しみ」というワードが入力されている。 Furthermore, if a search tag is to be added, an add tag button 603 is pressed. Then, a search tag (2) input field 604 for inputting the second search tag, search tag (2), is displayed. In the example shown in FIG. 8, the word "sadness" is input as the search tag (2).

検索タグ（１）及び（２）を入力した後、検索開始ボタン６０２を押下することにより、検索タグ（１）及び（２）に紐づけられた動画ファイルの検索を実行することができる。以下、同様に、３個以上の検索タグを用いて動画ファイルの検索を実行するようにしてもよい。 After inputting the search tags (1) and (2), by pressing the search start button 602, a search for the video files associated with the search tags (1) and (2) can be executed. Hereinafter, similarly, a search for video files may be executed using three or more search tags.

検索開始ボタン６０２が押下されると、端末２０の入力部２１（図１）に、検索タグに関する情報が入力され、送受信部２４に出力される。検索タグに関する情報は、送受信部２４から通信ネットワーク３０を介して、動画解析装置１０の送受信部１２に送信される。 When the search start button 602 is pressed, information regarding the search tag is input to the input unit 21 (FIG. 1) of the terminal 20 and output to the transmitting/receiving unit 24. Information regarding the search tag is transmitted from the transmitter/receiver 24 to the transmitter/receiver 12 of the video analysis device 10 via the communication network 30 .

動画解析装置１０の送受信部１２は、受信した検索タグに関する情報を検索情報取得部５に出力する。検索情報取得部５は、所望の動画を検索するための検索タグに関する情報を取得し、動画検索部６に出力する。 The transmitting/receiving unit 12 of the video analysis device 10 outputs the received information regarding the search tag to the search information acquisition unit 5. The search information acquisition unit 5 acquires information regarding search tags for searching for a desired video, and outputs the information to the video search unit 6.

動画検索部６は、動画解析データを参照して、検索タグに対応する動画を検索する。動画検索部６は、図５に示した動画解析データ４０１を参照して、検索タグ（１）「感動」及び検索タグ（２）「悲しみ」を含むシーンを検索し、それぞれ第１シーン及び第３シーンに含まれていることを検出する。 The video search unit 6 refers to the video analysis data and searches for a video corresponding to the search tag. The video search unit 6 refers to the video analysis data 401 shown in FIG. It is detected that the scene is included in three scenes.

動画検索部６は、検索結果を送受信部１２に出力し、送受信部１２は通信ネットワーク３０を介して検索結果に関する情報を端末２０に送信する。端末２０の送受信部２４は、検索結果に関する情報を受信し、表示部２５に出力する。 The video search section 6 outputs the search results to the transmitting/receiving section 12, and the transmitting/receiving section 12 transmits information regarding the search results to the terminal 20 via the communication network 30. The transmitting/receiving unit 24 of the terminal 20 receives information regarding the search results and outputs it to the display unit 25.

検索結果に関する情報は、表示部２５により、図８に示すように、動画検索画面６００の検索結果表示欄６０５に表示されてよい。検索結果表示欄６０５には、検索された動画ファイル名Ｃ１、検索したタグが付されたシーンＣ２、評価Ｃ３が表示されてよい。 Information regarding the search results may be displayed by the display unit 25 in a search result display field 605 of the video search screen 600, as shown in FIG. The search result display field 605 may display the searched video file name C1, the scene C2 with the searched tag, and the evaluation C3.

検索された動画ファイル名Ｃ１は、例えば、検索タグ（１）「感動」及び検索タグ（２）「悲しみ」を含む動画のファイル名であり、図８に示した例では３つの動画ファイル「１００１．ｍｐ４」、「ＭＰ２０２２．ｗｍｖ」、「ＤＧ７７７７．ａｖｉ」が検出されたことを示している。 The searched video file name C1 is, for example, the file name of a video containing the search tag (1) "impression" and the search tag (2) "sadness", and in the example shown in FIG. .mp4", "MP2022.wmv", and "DG7777.avi" are detected.

検索したタグが付されたシーンＣ２は、検索タグ（１）「感動」及び検索タグ（２）「悲しみ」がタグ付けされたシーンの番号である。例えば、動画ファイル「１００１．ｍｐ４」については、第１のシーンであるＳＮ００１と第３のシーンであるＳＮ０００３に検索タグ（１）及び（２）の少なくともいずれかが付与されていることを示している。 The searched tagged scene C2 is the scene number tagged with the search tag (1) "emotion" and the search tag (2) "sadness." For example, for the video file "1001.mp4", it indicates that at least one of search tags (1) and (2) is attached to the first scene SN001 and the third scene SN0003. There is.

図８に示した検索結果表示欄６０５には、検出された動画ファイルの数が３つである場合を例示しているが、このような例には限られず、検出される動画の数は、３つ未満、あるいは４つ以上であってよい。 Although the search result display field 605 shown in FIG. 8 shows an example in which the number of detected video files is three, the number of detected video files is not limited to this example. It may be less than three or more than four.

また、検出された動画ファイルの数が複数である場合は、優先順位の高い順に表示してよい。優先順位付けは、検索タグが含まれるシーンの継続時間が長い順であってよい。あるいは、優先順位付けは、検索タグが含まれるシーンの検出回数が多い順であってよい。 Furthermore, if a plurality of video files are detected, they may be displayed in descending order of priority. The prioritization may be in order of the duration of the scenes that include the search tags. Alternatively, the prioritization may be in order of the number of times scenes containing the search tag have been detected.

表示部１６は、動画解析データを参照して、検索したシーンの開始時間における画像を表示するようにしてよい。例えば、図８において検索結果表示欄６０５に示された第１シーンの番号を表示する代わりに、第１シーンの開始時間における画像をサムネイルとして表示してよい。 The display unit 16 may refer to the video analysis data and display an image at the start time of the searched scene. For example, instead of displaying the number of the first scene shown in the search result display field 605 in FIG. 8, the image at the start time of the first scene may be displayed as a thumbnail.

さらに、検索結果表示欄６０５に表示されたシーン番号（例えば、「ＳＮ００１」）またはそのサムネイルをクリックすることで、そのシーンの動画を再生するようにしてよい。例えば、第１シーンは時間（００：００）を開始時間としているため、表示部１６は動画解析データ４０１（図５）を参照して、第１シーンの番号である「ＳＮ００１」がクリックされた場合に、第１シーンの動画を再生するようにしてよい。 Furthermore, by clicking on the scene number (for example, "SN001") or its thumbnail displayed in the search result display field 605, the moving image of that scene may be played. For example, since the start time of the first scene is time (00:00), the display unit 16 refers to the video analysis data 401 (FIG. 5) and indicates that "SN001", which is the number of the first scene, has been clicked. In this case, the video of the first scene may be played.

シーンＣ２に表示された番号またはサムネイルを選択して、そのシーンの動画を再生した結果、目的としたシーンであったか否かを評価入力欄Ｃ３に入力してよい。評価入力欄Ｃ３に入力する評価は、例えば、５段階であってよい。図８に示した例では、動画ファイル「１００１．ｍｐ４」のシーンＳＮ００１及びＳＮ００３の検索結果に対する評価が５段階中の「４」であったことを示している。評価入力欄Ｃ３に入力する評価は５段階に限られず、任意の評価方法を用いてよい。 After selecting the number or thumbnail displayed in the scene C2 and playing back the video of that scene, the user may input in the evaluation input field C3 whether or not the scene was the intended scene. The evaluation input in the evaluation input field C3 may be, for example, in five stages. The example shown in FIG. 8 shows that the evaluation for the search results for scenes SN001 and SN003 of the video file "1001.mp4" was "4" out of five. The evaluation input in the evaluation input field C3 is not limited to five levels, and any evaluation method may be used.

評価結果は、動画解析データ作成部４に出力され、動画解析データ作成部４は、評価結果に基づいて、カメラワークとニュアンスを表す印象タグとの対応関係を記録したデータベースを修正してよい。例えば、あるシーンに対して低い評価がなされた場合に、そのシーンに付与されたタグと、対応するカメラワークとの関係を修正してよい。データベースの修正を行う場合に、機械学習モデルを用いて、修正を行ってよい。 The evaluation results are output to the video analysis data creation unit 4, and the video analysis data creation unit 4 may modify a database that records the correspondence between camera work and impression tags representing nuances based on the evaluation results. For example, if a certain scene is given a low evaluation, the relationship between the tag given to that scene and the corresponding camera work may be corrected. When modifying a database, a machine learning model may be used to perform the modification.

以上のようにして、動画ファイルからカメラワークを検出し、カメラワークによって表現されるニュアンスをタグ付けすることができる。これにより、ユーザはタグに基づいて目的のニュアンスを含む動画を探すことができ、ユーザが意図した対象の目的シーンを見つけ出すための検索性を向上させることができる。 In the manner described above, camera work can be detected from a video file and the nuances expressed by the camera work can be tagged. This allows the user to search for a video that includes the desired nuance based on the tag, and improves the search performance for finding the target scene that the user intended.

（変形例）
次に、本開示の一実施形態の変形例に係る動画解析システムについて説明する。図９に、本開示の一実施形態の変形例に係る動画解析システム１０２の構成ブロック図を示す。本開示の一実施形態の変形例に係る動画解析システム１０２が、図１に示した動画解析システム１０１と異なっている点は、動画ファイルに含まれる画像内の物体を認識する物体認識部７をさらに有し、タグ付与部３は、物体に基づくタグを動画解析データに付加する点である。変形例に係る動画解析システム１０２のその他の構成は、図１に示した動画解析システム１０１における構成と同様であるため、詳細な説明は省略する。 (Modified example)
Next, a video analysis system according to a modification of the embodiment of the present disclosure will be described. FIG. 9 shows a configuration block diagram of a video analysis system 102 according to a modification of an embodiment of the present disclosure. A video analysis system 102 according to a modification of an embodiment of the present disclosure differs from the video analysis system 101 shown in FIG. Furthermore, the tagging unit 3 adds a tag based on an object to the video analysis data. The rest of the configuration of the video analysis system 102 according to the modification is the same as the configuration of the video analysis system 101 shown in FIG. 1, so a detailed explanation will be omitted.

上述したように、動画に含まれるカメラワークに基づいて動画にタグ付けされた印象タグを用いて目的とするシーンを含む動画を検索することができる。一方、動画に含まれる物体を検索タグに加えることにより、目的とする動画の検索性をさらに向上させることができる。そこで、変形例に係る動画解析システム１０２においては、動画に含まれる物体に関する情報を付加情報として検索タグに加えて動画の検索を行う。 As described above, it is possible to search for a video containing a target scene using impression tags tagged to videos based on camera work included in the video. On the other hand, by adding objects included in the video to search tags, it is possible to further improve the searchability of the target video. Therefore, in the video analysis system 102 according to the modified example, the video is searched by adding information regarding the object included in the video as additional information to the search tag.

図１０に、本開示の一実施形態の変形例に係る動画解析装置を用いて検出したカメラワークを含むシーンから抽出した付加情報抽出結果画面７００の例を示す。物体認識部７は、画像認識により、動画ファイルに含まれる画像内の物体を認識する。例えば、第１シーンＳＮ１においては、物体Ｐ１～Ｐ４が検出され、画像認識により、それぞれ「窓」、「椅子」、「机」、「鞄」が認識される。同様に、第２シーンＳＮ２においては、物体Ｐ５及びＰ６が検出され、画像認識により、それぞれ「女性」、「犬」が認識される。 FIG. 10 shows an example of an additional information extraction result screen 700 extracted from a scene including camera work detected using a video analysis device according to a modification of an embodiment of the present disclosure. The object recognition unit 7 recognizes an object in an image included in a video file by image recognition. For example, in the first scene SN1, objects P1 to P4 are detected, and a "window", "chair", "desk", and "bag" are respectively recognized by image recognition. Similarly, in the second scene SN2, objects P5 and P6 are detected, and "woman" and "dog" are respectively recognized by image recognition.

動画解析データ作成部４は、物体認識部７が認識した物体に関する情報に基づいて、付加情報として図１１に示す付加情報解析データ８００を作成してよい。付加情報解析データ８００は、動画ＩＤ（Ｄ１）、シーンＤ２、検出時間Ｄ３、付加情報の種別Ｄ４、タグ（１）～（４）（Ｄ５～Ｄ８）を含んでよい。 The video analysis data creation unit 4 may create additional information analysis data 800 shown in FIG. 11 as additional information based on information regarding the object recognized by the object recognition unit 7. The additional information analysis data 800 may include a video ID (D1), a scene D2, a detection time D3, an additional information type D4, and tags (1) to (4) (D5 to D8).

検出時間Ｄ３は、物体が検出された時間である。例えば、図１１においては、第１シーンにおいて、時間（００：０５）に物体Ｐ１～Ｐ４が検出されたことを示している。 The detection time D3 is the time when the object was detected. For example, FIG. 11 shows that objects P1 to P4 were detected at time (00:05) in the first scene.

動画解析データ作成部４は、付加情報解析データ８００と、図５に示した動画解析データ４０１とを組み合わせて、付加情報を含む動画解析データを作成してよい。図１２に、本開示の一実施形態の変形例に係る動画解析装置を用いて作成した付加情報を含む動画解析データ９００の例を示す。 The video analysis data creation unit 4 may combine the additional information analysis data 800 and the video analysis data 401 shown in FIG. 5 to create video analysis data including additional information. FIG. 12 shows an example of video analysis data 900 including additional information created using a video analysis device according to a modification of an embodiment of the present disclosure.

付加情報を含む動画解析データ９００は、項目Ｅ１～Ｅ１３を含んでよい。Ｅ１～Ｅ９は、図５に示した動画解析データ４０１のＢ１～Ｂ９と同様である。付加情報を含む動画解析データ９００は、動画解析データ４０１に加えて、タグ（１）～（４）（Ｅ１０～Ｅ１３）を含んでよい。 The video analysis data 900 including additional information may include items E1 to E13. E1 to E9 are the same as B1 to B9 of the moving image analysis data 401 shown in FIG. The video analysis data 900 including additional information may include tags (1) to (4) (E10 to E13) in addition to the video analysis data 401.

タグ（１）～（４）（Ｅ１０～Ｅ１３）は、物体認識部７が認識した物体に関する情報に対応するタグである。例えば、第１シーンで検出された物体Ｐ１は「窓」であると認識され、これをタグ（１）Ｅ１０として、付加情報を含む動画解析データ９００に格納する。同様に、第１シーンで検出された物体Ｐ２～Ｐ４は、それぞれ「椅子」、「机」、「鞄」であると認識され、それぞれタグ（２）～（４）（Ｅ１１～Ｅ１３）として、付加情報を含む動画解析データ９００に格納する。 Tags (1) to (4) (E10 to E13) are tags corresponding to information regarding the object recognized by the object recognition unit 7. For example, the object P1 detected in the first scene is recognized as a "window", and this is stored as the tag (1) E10 in the video analysis data 900 including additional information. Similarly, objects P2 to P4 detected in the first scene are recognized as "chair," "desk," and "bag," respectively, and are assigned tags (2) to (4) (E11 to E13), respectively. It is stored in the video analysis data 900 including additional information.

図１２に示した印象タグ（１）～（３）に加えて、タグ（１）～（４）を用いて、動画を検索することにより、目的とする動画をより正確に検索することができる。例えば、図８に示した動画検索画面６００において、検索タグ（１）または（２）に、物体に関するタグを入力して検索を実行することができる。 By searching for videos using tags (1) to (4) in addition to the impression tags (1) to (3) shown in Figure 12, it is possible to more accurately search for the desired video. . For example, in the video search screen 600 shown in FIG. 8, a tag related to an object can be entered in search tag (1) or (2) to execute a search.

上記の例では、動画に含まれる物体を認識して付加情報としてタグ付けする例について説明したが、付加情報は物体に関する情報に限られず、動画に含まれる音声に関する情報であってよい。そこで、変形例に係る動画解析装置１０は、図９に示すように、動画ファイルに含まれる音声を認識する音声認識部８をさらに有し、タグ付与部３は、音声に基づくタグを動画解析データに付与してよい。 In the above example, an example was described in which an object included in a video is recognized and tagged as additional information, but the additional information is not limited to information regarding the object, but may be information regarding audio included in the video. Therefore, as shown in FIG. 9, the video analysis device 10 according to the modified example further includes a voice recognition unit 8 that recognizes the audio included in the video file, and the tagging unit 3 adds tags based on the audio to the video analysis device. May be added to data.

音声認識部８が認識する音声には、人間の会話や発言、動物の鳴き声、環境音（波の音、風の音、雷の音等）、ＢＧＭ等、動画に含まれる音を含んでよい。 The voices recognized by the voice recognition unit 8 may include sounds included in moving images, such as human conversations and remarks, animal sounds, environmental sounds (sounds of waves, wind, thunder, etc.), BGM, and the like.

例えば、図１０に示すように、第３シーンにおいて、「そんなにあれが欲しいの？」といった音声Ｓ１が検出された場合、音声認識により文字に変換される。変換された文字情報は、第３シーンの時間（００：２５）において検出されたタグ（１）Ｄ５として、付加情報解析データ８００に格納され、付加情報を含む動画解析データ９００に格納される。 For example, as shown in FIG. 10, when a voice S1 such as "Do you want that that much?" is detected in the third scene, it is converted into text by voice recognition. The converted character information is stored in the additional information analysis data 800 as the tag (1) D5 detected at the time (00:25) of the third scene, and is stored in the video analysis data 900 including additional information.

さらに、上記の例では、動画に含まれる物体及び音声を認識して付加情報としてタグ付けする例について説明したが、付加情報は物体及び音声に関する情報に限られず、動画に含まれる文字に関する情報を含んでよい。そこで、変形例に係る動画解析装置１０は、図９に示すように、動画ファイルに含まれる文字を認識する文字認識部９をさらに有し、タグ付与部３は、文字に基づくタグを動画解析データに付与してよい。 Furthermore, in the above example, an example was explained in which objects and sounds included in a video are recognized and tagged as additional information. However, additional information is not limited to information about objects and sounds, and information about characters included in a video is also tagged as additional information. may be included. Therefore, as shown in FIG. 9, the video analysis device 10 according to the modification further includes a character recognition unit 9 that recognizes characters included in the video file, and the tagging unit 3 analyzes tags based on characters in the video file. May be added to data.

例えば、図１０に示すように、第２シーンにおいて、文字Ｐ７「ＫＥＥＰＡＮＤＧＲＯＷ」及び文字Ｐ８「ＳＯＵＰ」が検出された場合、文字認識によりテキストに変換される。変換された文字情報は、第２シーンの時間（００：１７）において検出されたタグ（１）Ｄ５及びタグ（２）Ｄ６として、付加情報解析データ８００に格納され、付加情報を含む動画解析データ９００に格納される。 For example, as shown in FIG. 10, when characters P7 "KEEP AND GROW" and characters P8 "SOUP" are detected in the second scene, they are converted into text by character recognition. The converted character information is stored in the additional information analysis data 800 as tag (1) D5 and tag (2) D6 detected at the time of the second scene (00:17), and is stored in the video analysis data including additional information. 900.

なお、動画ファイルから文字情報を検出する場合、背景や物体等に映し出されている文字だけでなく、動画を編集して加えられたテロップや字幕等の文字を認識するようにしてよい。 Note that when detecting text information from a video file, it may be possible to recognize not only characters displayed on the background or objects, but also characters such as subtitles and subtitles added by editing the video.

以上のように、カメラワークから抽出されたニュアンスに関する印象タグを動画ファイルにタグ付けするだけでなく、動画に含まれる物体、音声、文字等に関する情報についてタグ付けすることにより、ユーザは正確に目的の動画を検索することができる。 As mentioned above, by not only tagging video files with impression tags related to the nuances extracted from camerawork, but also tagging information about objects, sounds, text, etc. included in the video, users can accurately achieve their goals. You can search for videos.

１取得部
２カメラワーク検出部
３タグ付与部
４動画解析データ作成部
５検索情報取得部
６動画検索部
７物体認識部
８音声認識部
９文字認識部
１０動画解析装置
１１制御部
１２送受信部
１３記憶部
１４出力部
１５計時部
１６表示部
１７内部バス
２０端末
２１入力部
２２記憶部
２３制御部
２４送受信部
２５表示部
３０通信ネットワーク
１０１、１０２動画解析システム 1 Acquisition unit 2 Camera work detection unit 3 Tagging unit 4 Video analysis data creation unit 5 Search information acquisition unit 6 Video search unit 7 Object recognition unit 8 Voice recognition unit 9 Character recognition unit 10 Video analysis device 11 Control unit 12 Transmission/reception unit 13 Storage unit 14 Output unit 15 Time measurement unit 16 Display unit 17 Internal bus 20 Terminal 21 Input unit 22 Storage unit 23 Control unit 24 Transmission/reception unit 25 Display unit 30 Communication network 101, 102 Video analysis system

Claims

an acquisition unit that acquires video identification information and video files;
a camerawork detection unit that detects predetermined camerawork included in the video file based on image movement and extracts a start time and an end time of a scene including the camerawork;
a tagging unit that refers to a database that associates types of camerawork with nuances corresponding to the camerawork, and adds a tag representing the nuance corresponding to the detected camerawork to the scene;
a video analysis data creation unit that creates video analysis data that includes video identification information and associates information regarding the camera work, the start time and end time, and the tag for each scene;
an output unit that outputs the video analysis data;
A video analysis device comprising:

further comprising an object recognition unit that recognizes an object in an image included in the video file,
The tagging unit adds an object-based tag to the video analysis data.
The video analysis device according to claim 1.

further comprising a voice recognition unit that recognizes voice included in the video file,
The tagging unit adds a tag based on audio to the video analysis data.
The video analysis device according to claim 1 or 2.

further comprising a character recognition unit that recognizes characters in images included in the video file,
The tagging unit adds a text-based tag to the video analysis data.
The video analysis device according to claim 1 or 2.

a search information acquisition unit that acquires information regarding a search tag for searching for a desired video;
a video search unit that searches for a video corresponding to the search tag by referring to the video analysis data;
The video analysis device according to claim 1 or 2, further comprising:

The video analysis device according to claim 1 or 2, further comprising a display unit that displays an image at a scene start time with reference to the video analysis data.

The video analysis device according to claim 1 or 2, wherein the camera work includes at least one of panning, tilting, tracking, zooming in, character dolly, zooming out, and pullback.

The acquisition unit acquires the video identification information and the video file,
a camerawork detection unit detects a predetermined camerawork included in the video file based on the movement of the image, and extracts a start time and an end time of a scene including the camerawork,
The tagging unit refers to a database that associates types of camerawork with nuances corresponding to the camerawork, and adds a tag representing the nuance corresponding to the detected camerawork to each scene,
a video analysis data creation unit creates video analysis data that includes the video identification information and associates information regarding the camera work, the start time and end time, and the tag for each scene;
an output unit outputs the video analysis data;
A video analysis method characterized by:

to the processor,
Obtain video identification information and video file,
Detecting a predetermined camera work included in the video file based on the movement of the image, extracting the start time and end time of the scene including the camera work,
Referring to a database that associates the type of camera work with the nuance corresponding to the camera work, assigning a tag representing the nuance corresponding to the detected camera work to each scene,
creating video analysis data that includes the video identification information and associates information regarding the camera work, the start time and end time, and the tag for each scene;
outputting the video analysis data;
A video analysis program that executes each step.