JP6688368B1

JP6688368B1 - Video content structuring device, video content structuring method, and computer program

Info

Publication number: JP6688368B1
Application number: JP2018212765A
Authority: JP
Inventors: 喜美子川嶋; 安永　健治; 健治安永
Original assignee: Nippon Telegraph and Telephone West Corp
Current assignee: Nippon Telegraph and Telephone West Corp
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2020-04-28
Anticipated expiration: 2038-11-13
Also published as: JP2020080469A

Abstract

【課題】内容を考慮して階層的に映像コンテンツを構造化することが可能な映像コンテンツ構造化装置、映像コンテンツ構造化方法、及びコンピュータプログラムを提供する。【解決手段】映像コンテンツ構造化装置１は、映像コンテンツをブロック毎に分割するブロック分割部１０と、分割されたブロック毎にメタデータを付与するメタデータ付与部２０と、付与されたメタデータに基づいて階層的に映像コンテンツを構造化する構造化部３０とを備える。【選択図】図１PROBLEM TO BE SOLVED: To provide a video content structuring device, a video content structuring method, and a computer program capable of hierarchically structuring video contents in consideration of contents. A video content structuring apparatus 1 includes a block dividing unit 10 that divides a video content into blocks, a metadata adding unit 20 that adds metadata to each divided block, and a metadata that is added. And a structuring unit 30 for hierarchically structuring the video content. [Selection diagram] Figure 1

Description

本発明は、映像コンテンツを構造化する映像コンテンツ構造化装置、映像コンテンツ構造化方法、及びコンピュータプログラムに関する。 The present invention relates to a video content structuring apparatus, a video content structuring method, and a computer program for structuring video content.

近年、放送局などにおいては、番組制作環境のファイルベース化が進み、映像コンテンツの効率良い管理がますます重要になってきている。映像編集者が膨大な映像コンテンツの中から、特定のキーワードが含まれるコンテンツを検索し、さらに、そのコンテンツの中から、番組制作に使える映像シーンを探し出すには多大な稼働がかかる。そこで、映像コンテンツの検索を容易にするため、映像コンテンツにメタデータを付与する技術が提案されている（非特許文献１）。しかし、メタデータを人手で付与するには時間がかかるため、自動で付与する技術が研究されている（非特許文献２）。また、映像編集作業を効率化するため、音声／映像信号に基づいて、映像コンテンツをシーンに分割する技術も提案されている（非特許文献３）。 In recent years, in broadcasting stations and the like, the file-based program production environment has been advanced, and efficient management of video contents has become more and more important. It takes a great deal of work for a video editor to search for a content containing a specific keyword from a huge amount of video content, and to search for a video scene that can be used for program production from the content. Therefore, in order to facilitate the search of the video content, a technique of adding metadata to the video content has been proposed (Non-Patent Document 1). However, since it takes time to manually add metadata, a technique for automatically adding metadata has been studied (Non-Patent Document 2). In addition, in order to improve the efficiency of video editing work, a technique of dividing video content into scenes based on audio / video signals has been proposed (Non-Patent Document 3).

「メタデータ制作支援に関する動向」、2018年10月17日検索、インターネット＜URL：https://www.nhk.or.jp/strl/publica/rd/rd163/pdf/P04-21.pdf＞"Trends on support for metadata production", search October 17, 2018, Internet <URL: https://www.nhk.or.jp/strl/publica/rd/rd163/pdf/P04-21.pdf> 「コンテンツのメタデータ付与について」、2018年10月17日検索、インターネット＜URL：http://www.soumu.go.jp/main_content/000225131.pdf＞"About adding metadata to content", search October 17, 2018, Internet <URL: http://www.soumu.go.jp/main_content/000225131.pdf> 「viaPlatz」、2018年10月17日検索、インターネット＜URL：http://www.viaplatz.com/spec/＞"ViaPlatz", search October 17, 2018, Internet <URL: http://www.viaplatz.com/spec/>

しかし、従来技術によれば、内容が連続していても、シーンが分割されてしまう。たとえば、テロップが同じで、内容が連続していても、背景映像が変わると分割されてしまう。具体的には、ロケ番組で、「大阪」のようなテロップが出ていて、その背景映像が数秒おきに切り替わる場合などである。同一人物の話が連続していても、正面からの撮影、横からの撮影というようにカメラカットが変わると分割されてしまう。 However, according to the conventional technique, the scene is divided even if the contents are continuous. For example, even if the telop is the same and the content is continuous, it will be divided when the background image changes. Specifically, there is a case where a telop such as "Osaka" appears in a location program and the background image changes every few seconds. Even if the same person talks continuously, it will be divided if the camera cut changes, such as shooting from the front or shooting from the side.

また、従来技術によれば、一意に分割されてしまうため、確認したいシーンの粒度は映像編集者によって異なることに対応できない。たとえば、図８（ａ）に示すように、映像コンテンツを複数のブロックＢ１，Ｂ２，Ｂ３，Ｂ４，Ｂ５，…に区切り、図８（ｂ）に示すように、そのメタデータを管理しているものとする。ここで、「食べ物に関するロケシーン（ブロックＢ１−Ｂ３）」が「店の外のシーン（ブロックＢ１）」と「店の中のシーン（ブロックＢ２−Ｂ３）」で構成されている場合、「店の外のシーン、店の中のシーンをまとめて作業をしたい人（ブロックＢ１−Ｂ３をまとめて１つのシーンとしたい人）」、「店の中のシーンだけを確認したい人（ブロックＢ２−Ｂ３をまとめて１つのシーンとしたい人）」というように、確認したいシーンの粒度は異なる。また、「コーナーの切り替わり（ブロックＢ４とブロックＢ５）」で、スタジオキャスターが「前のコーナーのまとめ（ブロックＢ４）」と「次のコーナーへのつなぎ（ブロックＢ５）」を連続して話す場合、「キャスターのシーンとしてまとめて作業したい人（ブロックＢ４−Ｂ５をまとめて１つのシーンとしたい人）」、「前のコーナーのまとめのシーン、次のコーナーへのつなぎのシーンを分けて確認したい人（ブロックＢ４、Ｂ５をそれぞれのシーンとしたい人）」というように、確認したいシーンの粒度は異なる。 Further, according to the conventional technique, since the image data is uniquely divided, the granularity of the scene to be confirmed cannot be changed depending on the video editor. For example, as shown in FIG. 8 (a), the video content is divided into a plurality of blocks B1, B2, B3, B4, B5, ... And the metadata thereof is managed as shown in FIG. 8 (b). I shall. Here, when the “location scene related to food (blocks B1 to B3)” is composed of “scene outside the store (block B1)” and “scene inside the store (block B2 to B3)”, "People who want to work outside scenes and scenes inside the store together (people who want to combine blocks B1-B3 into one scene)", "People who want to check only scenes inside the store (blocks B2-B3 People who want to make one scene together) ”, the granularity of the scenes to check is different. Also, in the case of "Switching corners (block B4 and block B5)", when the studio caster talks "Conclusion of previous corner (block B4)" and "Connect to next corner (block B5)" in succession, "Person who wants to work as a caster scene collectively (person who wants to combine blocks B4-B5 into one scene)", "Person who wants to confirm separately the scene of the previous corner and the scene of the connection to the next corner (People who want to use blocks B4 and B5 as their respective scenes) "have different grain sizes of scenes to be confirmed.

本発明は、上述した従来技術に鑑み、内容を考慮して階層的に映像コンテンツを構造化することが可能な映像コンテンツ構造化装置、映像コンテンツ構造化方法、及びコンピュータプログラムを提供することを目的とする。 In view of the above-mentioned conventional technique, the present invention aims to provide a video content structuring apparatus, a video content structuring method, and a computer program capable of hierarchically structuring video content in consideration of the content. And

上記目的を達成するため、第１の態様に係る発明は、映像コンテンツ構造化装置であって、映像コンテンツをブロック毎に分割するブロック分割部と、分割されたブロック毎にメタデータを付与するメタデータ付与部と、付与されたメタデータに基づいて階層的に映像コンテンツを構造化する構造化部とを備えることを要旨とする。 In order to achieve the above object, the invention according to a first aspect is a video content structuring apparatus, which includes a block division unit for dividing video content into blocks, and a meta data for giving metadata to each divided block. The gist is to include a data adding unit and a structuring unit that hierarchically structures the video content based on the added metadata.

第２の態様に係る発明は、第１の態様に係る発明において、前記メタデータ付与部が、音声認識結果、文字認識結果、画像認識結果のうちの少なくとも１つに基づいてメタデータを導出し、導出したメタデータに対して重みづけを行い、重みづけされたメタデータをブロック毎に統合することを要旨とする。 The invention according to a second aspect is the invention according to the first aspect, wherein the metadata adding section derives metadata based on at least one of a voice recognition result, a character recognition result, and an image recognition result. , The derived metadata is weighted, and the weighted metadata is integrated for each block.

第３の態様に係る発明は、第２の態様に係る発明において、前記メタデータ付与部が、音声認識結果と文字認識結果の両方で導出されたキーワードの重みを大きくすることを要旨とする。 The invention according to a third aspect is characterized in that, in the invention according to the second aspect, the metadata adding unit increases the weight of the keyword derived from both the voice recognition result and the character recognition result.

第４の態様に係る発明は、第２の態様に係る発明において、前記メタデータ付与部が、出現している時間が長いキーワード及びオブジェクトほど重みを大きくする、または、出現している回数が多いキーワード及びオブジェクトほど重みを大きくすることを要旨とする。 The invention according to a fourth aspect is the invention according to the second aspect, wherein the metadata adding unit increases the weight of a keyword and an object that has been appearing for a long time, or has a large number of appearances. The main point is to increase the weight for keywords and objects.

第５の態様に係る発明は、第２から第４のいずれか１つの態様に係る発明において、前記構造化部が、代表ベクトルの単語に対する重みを小さくすることを要旨とする。 A fifth aspect of the invention is based on the invention of any one of the second to fourth aspects, and the gist is that the structuring unit reduces the weight of the word of the representative vector.

第６の態様に係る発明は、第２から第４のいずれか１つの態様に係る発明において、前記構造化部が、階層が深くなるほど、オブジェクトに対する重みを大きし、キーワードに対する重みを小さくすることを要旨とする。 According to a sixth aspect of the invention, in the invention according to any one of the second to fourth aspects, the structuring unit increases the weight for the object and decreases the weight for the keyword as the hierarchy becomes deeper. Is the gist.

第７の態様に係る発明は、映像コンテンツ構造化方法であって、コンピュータが、映像コンテンツをブロック毎に分割するブロック分割ステップと、分割されたブロック毎にメタデータを付与するメタデータ付与ステップと、付与されたメタデータに基づいて階層的に映像コンテンツを構造化する構造化ステップとを実行することを要旨とする。 An invention according to a seventh aspect is a video content structuring method, comprising: a block dividing step in which a computer divides video content into blocks; and a metadata adding step of adding metadata to each of the divided blocks. , A structuring step of hierarchically structuring the video content based on the added metadata.

第８の態様に係る発明は、コンピュータプログラムであって、第１から第６のいずれか１つの態様に係る映像コンテンツ構造化装置としてコンピュータを機能させるためのものであることを要旨とする。 The gist of the invention according to an eighth aspect is a computer program for causing a computer to function as the video content structuring device according to any one of the first to sixth aspects.

本発明によれば、内容を考慮して階層的に映像コンテンツを構造化することが可能な映像コンテンツ構造化装置、映像コンテンツ構造化方法、及びコンピュータプログラムを提供することが可能である。 According to the present invention, it is possible to provide a video content structuring apparatus, a video content structuring method, and a computer program capable of hierarchically structuring video content in consideration of contents.

本発明の実施形態における映像コンテンツ構造化装置の構成図である。It is a block diagram of the video content structuring device in the embodiment of the present invention. 本発明の実施形態におけるメタデータ付与部の構成図である。It is a block diagram of the metadata provision part in embodiment of this invention. 本発明の実施形態におけるキーワード導出部の動作を示すフローチャートである。It is a flow chart which shows operation of a keyword derivation part in an embodiment of the present invention. 本発明の実施形態におけるオブジェクト導出部の動作を示すフローチャートである。It is a flow chart which shows operation of an object derivation part in an embodiment of the present invention. 本発明の実施形態における構造化部の動作を示すフローチャートである。It is a flow chart which shows operation of a structuring part in an embodiment of the present invention. 本発明の実施形態における重み更新関数と階層数の関係を示すグラフである。6 is a graph showing the relationship between the weight update function and the number of layers in the embodiment of the present invention. 本発明の実施形態における構造化部による構造化結果の一例を示す図である。It is a figure which shows an example of the structuring result by the structuring part in embodiment of this invention. シーンの粒度例を説明するための図である。It is a figure for explaining an example of granularity of a scene.

以下、図面を用いて本発明の実施の形態を説明する。以下の図面の記載において、同一または類似の部分には同一または類似の符号を付している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description of the drawings, the same or similar reference numerals are given to the same or similar parts.

≪全体構成≫
図１は、本発明の実施形態における映像コンテンツ構造化装置１の構成図である。この映像コンテンツ構造化装置１は、映像コンテンツを構造化するコンピュータであって、機能的には、ブロック分割部１０と、メタデータ付与部２０と、構造化部３０とを備える。 << Overall structure >>
FIG. 1 is a configuration diagram of a video content structuring device 1 according to an embodiment of the present invention. The video content structuring device 1 is a computer for structuring video contents, and functionally includes a block dividing unit 10, a metadata adding unit 20, and a structuring unit 30.

ブロック分割部１０は、映像コンテンツをブロック毎に分割する。映像コンテンツを細かく分割する技術としては従来技術を用いることができる。たとえば、viaPlatzや、オープンソース等を用いることが考えられる。ブロック分割部１０には映像コンテンツが入力され、ブロック分割部１０からはブロック分割結果が出力される。 The block division unit 10 divides the video content into blocks. A conventional technique can be used as a technique for finely dividing the video content. For example, viaPlatz or open source can be used. The video content is input to the block division unit 10, and the block division result is output from the block division unit 10.

メタデータ付与部２０は、ブロック分割部１０により分割されたブロック毎にメタデータ（キーワード＋オブジェクト）を付与する。メタデータは、映像コンテンツについて記述した情報である。メタデータは、映像コンテンツに埋め込む形で存在するものもあるが、ここでは、映像コンテンツとは別に管理されているものとする。メタデータ付与部２０にはブロック分割結果が入力され、メタデータ付与部２０からはブロック毎のメタデータが出力される。 The metadata adding section 20 adds metadata (keyword + object) to each block divided by the block dividing section 10. The metadata is information that describes the video content. Some metadata exists in the form of being embedded in the video content, but here, it is assumed that the metadata is managed separately from the video content. The block division result is input to the metadata addition unit 20, and the metadata for each block is output from the metadata addition unit 20.

構造化部３０は、メタデータ付与部２０により付与されたメタデータ（キーワード＋オブジェクト）に基づいて階層的に映像コンテンツを構造化する。構造化部３０にはブロック毎のメタデータと階層数が入力され、構造化部３０からは映像の構造化結果が出力される。階層数は、階層化したい数であり、映像編集者などのユーザにより指定される。 The structuring unit 30 hierarchically structures the video content based on the metadata (keyword + object) provided by the metadata providing unit 20. The structuring unit 30 receives the metadata and the number of layers for each block, and the structuring unit 30 outputs the structuring result of the video. The number of layers is the number to be hierarchized and is designated by a user such as a video editor.

以上のように、本発明の実施形態における映像コンテンツ構造化装置１によれば、ブロック毎のメタデータを抽出することができるため、ブロック毎のメタデータ（内容）を考慮して階層的に映像コンテンツを構造化することが可能となる。その結果、内容が連続している区間を統合することが可能となり、また、確認したいシーンの粒度を映像編集者によって変えることも可能となる。 As described above, according to the video content structuring apparatus 1 in the embodiment of the present invention, it is possible to extract the metadata for each block, and thus the metadata (contents) for each block is taken into consideration to hierarchically video. It is possible to structure the content. As a result, it is possible to integrate sections in which the contents are continuous, and it is also possible for the video editor to change the granularity of the scene to be confirmed.

≪メタデータ付与部≫
図２は、メタデータ付与部２０の構成図である。この図に示すように、メタデータ付与部２０は、各種認識部２１と、メタデータ統合部２２とを備える。 << Metadata adder >>
FIG. 2 is a configuration diagram of the metadata adding unit 20. As shown in this figure, the metadata providing unit 20 includes various recognition units 21 and a metadata integration unit 22.

各種認識部２１は、ブロック分割結果に基づいて各種の認識処理を行う機能部であり、音声認識部２１Ａと、文字認識部２１Ｂと、画像認識部２１Ｃとを備える。音声認識部２１Ａは、ブロック分割結果に含まれる音声を認識する。文字認識部２１Ｂは、ブロック分割結果に含まれる文字を認識する。画像認識部２１Ｃは、ブロック分割結果に含まれる画像を認識する。このような各種認識部２１には、NTT、Google、Azure、Watson等の外部ＡＰＩを用いることが考えられる。 The various recognition unit 21 is a functional unit that performs various recognition processes based on the block division result, and includes a voice recognition unit 21A, a character recognition unit 21B, and an image recognition unit 21C. The voice recognition unit 21A recognizes the voice included in the block division result. The character recognition unit 21B recognizes a character included in the block division result. The image recognition unit 21C recognizes an image included in the block division result. It is conceivable to use an external API such as NTT, Google, Azure, and Watson for the various recognition units 21.

メタデータ統合部２２は、各種認識部２１による各種認識結果に基づいてブロック毎のメタデータを導出し、導出したメタデータに対して重みづけを行い、重みづけされたメタデータをブロック毎に統合する機能部であり、キーワード導出部２２Ａと、オブジェクト導出部２２Ｂとを備える。キーワード導出部２２Ａは、音声認識部２１Ａによる音声認識結果（キーワード）と文字認識部２１Ｂによる文字認識結果（キーワード）とに基づいて、各キーワードの重みづけを行う。オブジェクト導出部２２Ｂは、画像認識部２１Ｃの画像認識結果（オブジェクト）に基づいて、各オブジェクトの重みづけを行う。 The metadata integration unit 22 derives metadata for each block based on the various recognition results by the various recognition units 21, weights the derived metadata, and integrates the weighted metadata for each block. And a keyword derivation unit 22A and an object derivation unit 22B. The keyword derivation unit 22A weights each keyword based on the voice recognition result (keyword) by the voice recognition unit 21A and the character recognition result (keyword) by the character recognition unit 21B. The object derivation unit 22B weights each object based on the image recognition result (object) of the image recognition unit 21C.

≪キーワード導出部≫
図３は、キーワード導出部２２Ａの動作を示すフローチャートである。以下、図３を用いて、キーワード導出部２２Ａの機能をその動作とともに説明する。 ≪Keyword derivation part≫
FIG. 3 is a flowchart showing the operation of the keyword derivation unit 22A. Hereinafter, the function of the keyword derivation unit 22A will be described together with its operation with reference to FIG.

まず、キーワード導出部２２Ａは、音声認識部２１Ａによる音声認識結果に基づいてキーワードを導出するとともに、文字認識部２１Ｂによる文字認識結果に基づいてキーワードを導出する（ステップＳ１，Ｓ２）。たとえば、NTT corevo キーワード抽出APIや、yahooキーフレーズ抽出API等の外部ＡＰＩを用いてキーワードを導出することが考えられる。 First, the keyword derivation unit 22A derives a keyword based on the voice recognition result by the voice recognition unit 21A, and also derives a keyword based on the character recognition result by the character recognition unit 21B (steps S1 and S2). For example, it is possible to derive a keyword using an external API such as the NTT corevo keyword extraction API or the yahoo key phrase extraction API.

次いで、キーワード導出部２２Ａは、キーワード導出ステップＳ１，Ｓ２で導出された各キーワードに対する重み（a_key）を導出し、各キーワードに対して重みづけを行う（ステップＳ３）。このとき、キーワード導出ステップＳ１，Ｓ２の両方で導出されたキーワードの重みを大きくすることが考えられる。また、キーワード導出ステップＳ１，Ｓ２の結果を統合し、各キーワードに対し、出現していた時間に基づいて、最も出現時間が長いキーワードの重みを1、最も出現時間が短いキーワードの重みを0.1として、0.1から1の間で規格化することが考えられる。同様に、キーワード導出ステップＳ１，Ｓ２の結果を統合し、各キーワードに対し、出現していた回数に基づいて、最も出現回数が多いキーワードの重みを1、最も出現回数が少ないキーワードの重みを0.1として、0.1から1の間で規格化することが考えられる。 Next, the keyword derivation unit 22A derives a weight (a_key) for each keyword derived in the keyword derivation steps S1 and S2, and weights each keyword (step S3). At this time, it is conceivable to increase the weight of the keywords derived in both the keyword derivation steps S1 and S2. Further, by integrating the results of the keyword derivation steps S1 and S2, the weight of the keyword having the longest appearance time is set to 1 and the weight of the keyword having the shortest appearance time is set to 0.1 based on the time of appearance for each keyword. , 0.1 to 1 can be standardized. Similarly, the results of the keyword derivation steps S1 and S2 are integrated, and for each keyword, based on the number of appearances, the weight of the keyword having the highest appearance frequency is 1, and the weight of the keyword having the lowest appearance frequency is 0.1. It is conceivable to standardize between 0.1 and 1.

最後に、キーワード導出部２２Ａは、キーワード重みづけステップＳ３で重みづけされた各キーワードを出力する（ステップＳ４）。 Finally, the keyword derivation unit 22A outputs each keyword weighted in the keyword weighting step S3 (step S4).

≪オブジェクト導出部≫
図４は、オブジェクト導出部２２Ｂの動作を示すフローチャートである。以下、図４を用いて、オブジェクト導出部２２Ｂの機能をその動作とともに説明する。 ≪Object derivation part≫
FIG. 4 is a flowchart showing the operation of the object derivation unit 22B. Hereinafter, the function of the object derivation unit 22B will be described together with its operation with reference to FIG.

まず、オブジェクト導出部２２Ｂは、画像認識部２１Ｃによる画像認識結果に基づいてオブジェクトを導出する（ステップＳ１１）。 First, the object derivation unit 22B derives an object based on the image recognition result by the image recognition unit 21C (step S11).

次いで、オブジェクト導出部２２Ｂは、オブジェクト導出ステップＳ１１で導出された各オブジェクトに対する重み（a_obj）を導出し、各オブジェクトに対して重みづけを行う（ステップＳ１２）。このとき、各オブジェクトに対し、出現していた時間に基づいて、最も出現時間が長いオブジェクトの重みを1、最も出現時間が短いオブジェクトの重みを0.1として、0.1から1の間で規格化することが考えられる。同様に、各オブジェクトに対し、出現していた回数に基づいて、最も出現回数が多いオブジェクトの重みを1、最も出現回数が少ないオブジェクトの重みを0.1として、0.1から1の間で規格化することが考えられる。 Next, the object derivation unit 22B derives the weight (a_obj) for each object derived in the object derivation step S11, and weights each object (step S12). At this time, the weight of the object with the longest appearance time is set to 1 and the weight of the object with the shortest appearance time is set to 0.1 based on the time of appearance for each object, and normalized between 0.1 and 1. Can be considered. Similarly, for each object, based on the number of appearances, standardize between 0.1 and 1 with the weight of the object with the most appearances being 1 and the weight of the object with the least appearances being 0.1. Can be considered.

最後に、オブジェクト導出部２２Ｂは、オブジェクト重みづけステップＳ１２で重みづけされた各オブジェクトを出力する（ステップＳ１３）。 Finally, the object derivation unit 22B outputs each object weighted in the object weighting step S12 (step S13).

≪構造化部≫
図５は、構造化部３０の動作を示すフローチャートである。以下、図５を用いて、構造化部３０の機能をその動作とともに説明する。 << Structured Department >>
FIG. 5 is a flowchart showing the operation of the structuring unit 30. Hereinafter, the function of the structured unit 30 will be described together with its operation with reference to FIG.

まず、構造化部３０は、階層数Ｒが入力されると、rankに1を設定する（ステップＳ２１→Ｓ２２）。階層数Ｒは、ユーザにより指定される。rankは、階層数を表す変数である。 First, when the number R of layers is input, the structuring unit 30 sets 1 to rank (steps S21 → S22). The number of layers R is designated by the user. rank is a variable indicating the number of layers.

次いで、構造化部３０は、メタデータ付与部２０からのメタデータ（キーワード＋オブジェクト）に基づいてクラスタリングする（ステップＳ２３）。クラスタリングとは、大量のデータから、似ているものを集めて自動的に分類していく技術や手法である。 Next, the structuring unit 30 performs clustering based on the metadata (keyword + object) from the metadata adding unit 20 (step S23). Clustering is a technique or method for collecting similar items from a large amount of data and automatically classifying them.

このクラスタリングステップＳ２３には、ブロック毎のメタデータベクトル化ステップと、ブロックのクラスタリングステップとが含まれる。ブロック毎のメタデータベクトル化ステップでは、ブロック毎に、キーワード（key）と各キーワードに対する重み（a_key）、オブジェクト（obj）と各オブジェクトに対する重み（a_obj）を入力とし、word2vec等のベクトル化ツールを用い、ブロック毎の意味ベクトル（S(b)）を導出する（bはブロック番号）。ブロックのクラスタリングステップでは、ブロック毎の意味ベクトル（S(b)）を入力とし、k-means法等のクラスタリングツールを用い、クラスタリングする。 The clustering step S23 includes a metadata vectorization step for each block and a block clustering step. In the metadata vectorization step for each block, for each block, a keyword (key) and a weight (a_key) for each keyword, an object (obj) and a weight (a_obj) for each object are input, and a vectorization tool such as word2vec is input. A semantic vector (S (b)) for each block is derived by using (b is a block number). In the block clustering step, the semantic vector (S (b)) for each block is input, and clustering is performed using a clustering tool such as the k-means method.

次いで、構造化部３０は、代表メタデータを導出する（ステップＳ２４）。この代表メタデータ導出ステップＳ２４では、各クラスタを構成するブロック群の「ブロック毎の意味ベクトル（S(b)）」の平均値S(b,c)を導出し（cはクラスタ番号）、各ブロックの代表ベクトルとする。また、word2vec等のベクトル化ツールを用いて、各ブロックの代表ベクトルS(b,c)を単語（W）に変換する。 Next, the structuring unit 30 derives the representative metadata (step S24). In this representative metadata derivation step S24, the average value S (b, c) of the “semantic vector for each block (S (b))” of the block group forming each cluster is derived (c is the cluster number), and Let it be the representative vector of the block. Also, the representative vector S (b, c) of each block is converted into a word (W) using a vectorization tool such as word2vec.

次いで、構造化部３０は、階層に分けて構造化するため、重みを更新する（ステップＳ２５）。この重み更新ステップＳ２５では、クラスタ毎に次の処理をすることが考えられる。 Next, the structuring unit 30 updates the weights so that the structuring is performed by dividing the hierarchy (step S25). In the weight updating step S25, the following processing can be considered for each cluster.

まず、代表ベクトルの単語Wに対する重み（a_W）を小さくすることが考えられる。たとえば、すでに代表ベクトルとして抽出されたメタデータの影響を除くために、a_W=0とする。 First, it is possible to reduce the weight (a_W) of the representative vector with respect to the word W. For example, a_W = 0 is set in order to remove the influence of the metadata already extracted as the representative vector.

また、オブジェクトは細かく分割するのに役立つため、階層数が増えるにつれて、オブジェクトに対する重み（a_obj）の値を大きくし、キーワードに対する重み（a_key）の値を小さくすることも考えられる。たとえば、以下のように更新する。 Further, since the object is useful for finely dividing, it is possible to increase the value of the weight (a_obj) for the object and decrease the value of the weight (a_key) for the keyword as the number of layers increases. For example, update as follows.

a_obj(rank+1)=α×a_obj(rank)
a_key(rank+1)=(2-α)×a_key(rank)
α=β×exp(rank+γ)
重み更新関数αと階層数rankの関係は、図６に示すように、rankが増えるほど（階層が深くなるほど）、オブジェクトに対する重み（a_obj）の値が大きくなるように定式化する。ここでは、指数関数で定式化しているが、他の数式も考えられる。 a_obj (rank + 1) = α × a_obj (rank)
a_key (rank + 1) = (2-α) × a_key (rank)
α = β × exp (rank + γ)
As shown in FIG. 6, the relationship between the weight update function α and the rank rank is formulated such that the value of the weight (a_obj) for an object increases as the rank increases (the hierarchy becomes deeper). Here, the formula is formulated by an exponential function, but other formulas are also conceivable.

次いで、構造化部３０は、rankの値に１を加算し、rankの値が階層数Ｒに達するまで同様の処理を繰り返す（ステップＳ２６→Ｓ２７→Ｓ２３→・・・）。そして、rankの値が階層数Ｒに達すると、単語（W）を構成化結果とあわせて出力する（ステップＳ２７→Ｓ２８）。 Next, the structuring unit 30 adds 1 to the value of rank and repeats the same processing until the value of rank reaches the number R of layers (steps S26 → S27 → S23 → ...). When the value of rank reaches the number of layers R, the word (W) is output together with the structuring result (steps S27 → S28).

≪構造化結果例≫
図７は、構造化部３０による構造化結果の一例を示す図である。ここでは、ユーザに表示するＵＩイメージを例示している。たとえば、ユーザにより階層数３が指定された場合は、階層１，２，３における各区間の代表メタデータを表示するようになっている。 << Example of structured result >>
FIG. 7 is a diagram showing an example of the structuring result by the structuring unit 30. Here, a UI image displayed to the user is illustrated. For example, when the number of layers is designated by the user, the representative metadata of each section in the layers 1, 2, and 3 is displayed.

具体的には、あるロケ番組が「京都の話をしているシーン」「大阪の話をしているシーン」「神戸の話をしているシーン」で構成されているものとする。また、「京都の話をしているシーン」は、「スタジオで話をしているシーン」「寺のシーン」「お茶屋のシーン」で構成されているものとする。このような場合、階層１における区間Ｍ１１の代表メタデータとして「京都」を表示してもよい。また、階層２における区間Ｍ２１，Ｍ２２，Ｍ２３の代表メタデータとして「スタジオ」「寺」「お茶屋」を表示してもよい。さらに、階層３における区間Ｍ３１，Ｍ３２，Ｍ３３，Ｍ３４，Ｍ３５の代表メタデータとして「寺の中のシーン」「寺の外のシーン」などを表示してもよい。 Specifically, it is assumed that a location program is composed of “scenes talking about Kyoto”, “scenes talking about Osaka”, and “scenes talking about Kobe”. Also, the "scene of talking in Kyoto" is assumed to be composed of "scene of talking in the studio", "scene of the temple", and "scene of the teahouse". In such a case, “Kyoto” may be displayed as the representative metadata of the section M11 in the layer 1. In addition, “studio”, “temple”, and “teahouse” may be displayed as representative metadata of the sections M21, M22, and M23 in the layer 2. Furthermore, "scenes inside the temple", "scenes outside the temple", etc. may be displayed as representative metadata of the sections M31, M32, M33, M34, M35 in the layer 3.

以上のように、本発明の実施形態における映像コンテンツ構造化装置１によれば、従来技術の分割を基に構造化することで、従来技術では細かく分割してしまっているところを統合して、階層１，２，３のように表示することが可能である。また、階層数はユーザが指定できるため、確認したいシーンの粒度を映像編集者によって変えることが可能である。 As described above, according to the video content structuring apparatus 1 in the embodiment of the present invention, by structuring based on the division of the conventional technique, the portions that are finely divided in the conventional technique are integrated, It is possible to display as layers 1, 2, and 3. Since the number of layers can be specified by the user, the granularity of the scene to be confirmed can be changed by the video editor.

≪変形例≫
上記実施形態では、ユーザにより階層数Ｒが指定されることとしているが、階層数Ｒの指定は必ずしも必要でない。たとえば、階層を１０段階まで構造化できる場合は、一律に階層１から階層１０までの全部を構造化結果として出力することも考えられる。 ≪Modification≫
In the above embodiment, the number of layers R is specified by the user, but the number of layers R is not necessarily specified. For example, when the hierarchy can be structured up to 10 levels, it is possible to uniformly output all of the hierarchy 1 to hierarchy 10 as the structured result.

≪他の応用例≫
上記実施形態では、映像編集者が映像シーンを検索する際を想定して記載しているが、一般ユーザが自分の好みの芸能人が出ているシーンだけを検索する等、一般ユーザが利用することも考えられる。たとえば、歌番組に好みの芸能人が出ている場合、従来技術では、オープニングや歌っている箇所、クロージングなど該当の芸能人が登場する箇所に飛び飛びでメタデータが付与されるが、本発明を用いて構造化することで、その芸能人が歌っている箇所だけを見つけやすくすることができる。 ≪Other application examples≫
In the above embodiment, the description is made assuming that the video editor searches for the video scene, but the general user searches for only the scene in which the entertainer of his or her preference appears, so that the general user can use it. Can also be considered. For example, when a favorite entertainer appears in a singing program, in the conventional technology, metadata is randomly added to a place where the corresponding entertainer appears such as an opening, a singing place, or a closing, but the present invention is used. By structuring, you can make it easy to find only the part where the entertainer is singing.

≪まとめ≫
以上説明したように、本発明の実施形態における映像コンテンツ構造化装置１は、映像コンテンツをブロック毎に分割するブロック分割部１０と、分割されたブロック毎にメタデータを付与するメタデータ付与部２０と、付与されたメタデータに基づいて階層的に映像コンテンツを構造化する構造化部３０とを備える。これにより、ブロック毎のメタデータを抽出することができるため、ブロック毎のメタデータ（内容）を考慮して階層的に映像コンテンツを構造化することが可能となる。 ≪Summary≫
As described above, the video content structuring apparatus 1 according to the embodiment of the present invention includes the block dividing unit 10 that divides the video content into blocks, and the metadata adding unit 20 that adds metadata to each divided block. And a structuring unit 30 that hierarchically structures the video content based on the added metadata. As a result, since the metadata for each block can be extracted, it becomes possible to hierarchically structure the video content in consideration of the metadata (contents) for each block.

また、メタデータ付与部２０は、音声認識結果、文字認識結果、画像認識結果のうちの少なくとも１つに基づいてメタデータを導出し、導出したメタデータに対して重みづけを行い、重みづけされたメタデータをブロック毎に統合してもよい。これにより、映像や音声に含まれる特徴を捉えることができるため、ブロック毎の代表的な特徴を抽出することが可能となる。 Further, the metadata adding section 20 derives metadata based on at least one of the voice recognition result, the character recognition result, and the image recognition result, weights the derived metadata, and weights the metadata. The metadata may be integrated for each block. As a result, since the features included in the video and audio can be captured, it is possible to extract the representative features of each block.

また、メタデータ付与部２０は、音声認識結果と文字認識結果の両方で導出されたキーワードの重みを大きくしてもよい。これにより、音声と映像（テロップ）の両方で導出されたキーワードは、そのブロックの特徴を強く表していることを考慮することができる。 Further, the metadata adding section 20 may increase the weight of the keyword derived from both the voice recognition result and the character recognition result. Thus, it can be considered that the keywords derived from both the audio and the video (telop) strongly represent the characteristics of the block.

また、メタデータ付与部２０は、出現している時間が長いキーワード及びオブジェクトほど重みを大きくする、または、出現している回数が多いキーワード及びオブジェクトほど重みを大きくしてもよい。これにより、出現している時間が長いキーワード及びオブジェクトほど、そのブロックの特徴を強く表していることを考慮することができる。また、出現している回数が多いキーワード及びオブジェクトほど、そのブロックの特徴を強く表していることを考慮することができる。 In addition, the metadata adding unit 20 may increase the weight for a keyword and an object that appear for a long time, or may increase the weight for a keyword and an object that appear a large number of times. As a result, it can be considered that the longer the appearance time of a keyword or object, the stronger the characteristic of the block is represented. Further, it can be considered that the keyword and the object that appear more frequently indicate the feature of the block more strongly.

また、構造化部３０は、代表ベクトルの単語に対する重みを小さくしてもよい。これにより、すでに代表ベクトルになったものが以降も導出される不具合を回避することが可能となる。 Further, the structuring unit 30 may reduce the weight of the word of the representative vector. This makes it possible to avoid the problem that a vector that has already become a representative vector is derived thereafter.

また、構造化部３０は、階層が深くなるほど、オブジェクトに対する重みを大きくし、キーワードに対する重みを小さくしてもよい。これにより、階層が深くなるほど、オブジェクトの方がキーワードよりも代表メタデータとして導出されやすくなる。 The structuring unit 30 may increase the weight for the object and decrease the weight for the keyword as the hierarchy becomes deeper. As a result, the deeper the hierarchy, the easier it is for the object to be derived as representative metadata rather than the keyword.

なお、本発明は、映像コンテンツ構造化装置１として実現することができるだけでなく、映像コンテンツ構造化装置１が備える特徴的な機能部をステップとする映像コンテンツ構造化方法として実現したり、映像コンテンツ構造化装置１としてコンピュータを機能させるためのコンピュータプログラムとして実現したりすることもできる。そして、このようなコンピュータプログラムは、ＣＤ−ＲＯＭ等の記録媒体やインターネット等の伝送媒体を介して配信することができるのはいうまでもない。 Note that the present invention can be realized not only as the video content structuring apparatus 1, but also as a video content structuring method that uses the characteristic functional units of the video content structuring apparatus 1 as steps, and It can also be realized as a computer program for causing a computer to function as the structured device 1. It goes without saying that such a computer program can be distributed via a recording medium such as a CD-ROM or a transmission medium such as the Internet.

≪その他の実施形態≫
上記のように、本発明の実施形態について記載したが、開示の一部をなす論述および図面は例示的なものであり、限定するものであると理解すべきではない。この開示から当業者には様々な代替実施形態、実施例および運用技術が明らかとなろう。すなわち、本発明の実施形態は、ここでは記載していない様々な実施形態などを含む。 << Other Embodiments >>
While embodiments of the present invention have been described above, the discussion and drawings which form a part of the disclosure are to be understood as illustrative and not limiting. From this disclosure, various alternative embodiments, examples and operational techniques will be apparent to those skilled in the art. That is, the embodiments of the present invention include various embodiments not described here.

１映像コンテンツ構造化装置
１０ブロック分割部
２０メタデータ付与部
２１各種認識部
２１Ａ音声認識部
２１Ｂ文字認識部
２１Ｃ画像認識部
２２メタデータ統合部
２２Ａキーワード導出部
２２Ｂオブジェクト導出部
３０構造化部 DESCRIPTION OF SYMBOLS 1 video content structuring device 10 block dividing unit 20 metadata assigning unit 21 various recognizing units 21A voice recognizing unit 21B character recognizing unit 21C image recognizing unit 22 metadata integrating unit 22A keyword deriving unit 22B object deriving unit 30 structuring unit

Claims

A block division unit that divides the video content into blocks,
A metadata adding unit that adds metadata to each divided block,
And a structuring unit for hierarchically structuring the video content based on the added metadata ,
The metadata assigning unit derives metadata based on at least one of a voice recognition result, a character recognition result, and an image recognition result, weights the derived metadata, and weights the metadata. A video content structuring device characterized by integrating data for each block .

The video content structuring apparatus according to claim 1 , wherein the metadata adding unit increases the weight of the keyword derived from both the voice recognition result and the character recognition result.

The meta data providing section, the time have appeared to increase the weight longer keywords and objects, or, according to claim 1, characterized in that to increase the weight as keywords and objects more times that have emerged Video content structuring device.

The video content structuring device according to claim 1 , wherein the structuring unit reduces a weight of a word of the representative vector.

4. The video content structuring apparatus according to claim 1 , wherein the structuring unit increases the weight for the object and decreases the weight for the keyword as the hierarchy becomes deeper.

Computer
A block dividing step of dividing the video content into blocks,
A metadata adding step of adding metadata to each of the divided blocks,
And a structuring step for structuring the video content hierarchically based on the given metadata ,
The metadata applying step derives metadata based on at least one of a voice recognition result, a character recognition result, and an image recognition result, weights the derived metadata, and weights the weighted meta data. A video content structuring method characterized by integrating data for each block .

A computer program for causing a computer to function as the video content structuring device according to any one of claims 1 to 5 .