JPWO2009113505A1

JPWO2009113505A1 - Video segmentation apparatus, method and program

Info

Publication number: JPWO2009113505A1
Application number: JP2010502811A
Authority: JP
Inventors: 真寺尾; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-03-14
Filing date: 2009-03-09
Publication date: 2011-07-21
Anticipated expiration: 2029-03-09
Also published as: WO2009113505A1; JP5278425B2

Abstract

映像データを意味内容に応じて適切にトピックへと分割することが可能な映像分割装置を提供する。映像分割装置は、映像と関連付けられたテキストであって映像における再生位置が付されたものを参照してテキストに含まれる単語又は単語列（以下「単語等」という。）が映像に含まれる被写体を表すか否かを判定する被写体判定部と、単語等のうち被写体を表すと判定されたものに対してそれ以外のものに対する重み付けよりも大きい重み付けをする被写体重み付け部と、重み付けに基づいてテキストを分割することによって映像を分割する映像分割部と、を備えている（図１）。Provided is a video dividing device capable of appropriately dividing video data into topics according to meaning contents. The video segmentation device refers to a text or a word string (hereinafter referred to as “words”) included in the text with reference to text associated with the video and provided with a playback position in the video. A subject determination unit that determines whether or not to represent a subject, a subject weighting unit that weights a word or the like that is determined to represent the subject, a weight that is greater than a weight for the other, a text based on the weight And a video dividing unit that divides the video by dividing the video (FIG. 1).

Description

［関連出願の記載］
本発明は、日本国特許出願：特願２００８−０６６２２１号（２００８年３月１４日出願）の優先権主張に基づくものであり、同出願の全記載内容は引用をもって本書に組み込み記載されているものとする。[Description of related applications]
The present invention is based on the priority claim of Japanese Patent Application: Japanese Patent Application No. 2008-066221 (filed on March 14, 2008), the entire contents of which are incorporated herein by reference. Shall.

本発明は、映像分割装置、方法及びプログラムに関し、特に、映像データを意味的にまとまった単位へと分割する映像分割装置、方法及びプログラムに関する。 The present invention relates to a video segmentation apparatus, method, and program, and more particularly, to a video segmentation apparatus, method, and program that divide video data into semantically organized units.

近年、大量の映像データが流通しつつあり、映像データを意味的なまとまりの単位（以下「トピック」という。）に分割する技術は、映像データの一覧性及び検索性を向上させる技術として重要性を増しつつある。 In recent years, a large amount of video data has been distributed, and the technology for dividing video data into semantic units (hereinafter referred to as “topics”) is important as a technology for improving the listability and searchability of video data. Is increasing.

映像データをトピックへと分割する代表的な方法として、映像データに含まれる発話の内容を表すテキスト（以下「発話テキスト」という。）に対して、テキスト分割技術を適用する方法が挙げられる。発話テキストとしては、映像データに含まれる発話を音声認識して得られるテキストを用いることができる。また、映像データがテレビ番組であれば、字幕情報（クローズドキャプション）を利用することができる場合もある。このような発話テキストには、映像データの始端からの経過時間等の映像における再生位置情報が付与されている。したがって、テキストを分割することによって映像データを分割することが可能となる。 As a typical method for dividing video data into topics, there is a method in which a text division technique is applied to text representing the content of an utterance included in video data (hereinafter referred to as “uttered text”). As the utterance text, text obtained by voice recognition of the utterance included in the video data can be used. In addition, when the video data is a television program, caption information (closed caption) may be used. Such utterance text is given reproduction position information in a video such as an elapsed time from the beginning of the video data. Therefore, the video data can be divided by dividing the text.

一般に、テキスト分割技術においては、入力テキストを構成する単語又は単語列（以下「単語等」という。）を分析することによって、テキストにおいて意味内容が変化している単語境界を求める。かかる処理を行う場合には、全ての単語を同等に扱うよりも、入力テキスト中の各トピックの意味内容との関連性が強い単語に大きな重みを与えた方が、入力テキストを意味内容に応じてより適切にトピックに分割することができる。 In general, in the text segmentation technique, a word boundary that changes in meaning content in a text is obtained by analyzing a word or a word string (hereinafter referred to as “word or the like”) constituting an input text. When performing such processing, it is better to give a greater weight to words that are strongly related to the semantic content of each topic in the input text than to treat all words equally. Can be divided into topics more appropriately.

テキスト処理において重要な単語に重みを与える方法として、ＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）による方法が知られている。単語ＷｉのＩＤＦは、あらかじめ大量の文書を収集した上で、ＩＤＦ（Ｗｉ）＝ｌｏｇ（全文書数／単語Ｗｉが含まれる文書数）によって求められる。すなわち、少数の文書にしか出現しない単語ほどＩＤＦは大きくなる。ＩＤＦは、「何らかの文書単位を決めた上で大量の文書を収集したときに、少数の文書にしか出現しない単語は重要な単語である」との仮定に基づいた単語の重み付け方法である。 A method using IDF (Inverse Document Frequency) is known as a method of assigning weights to important words in text processing. The IDF of the word Wi is obtained by collecting a large amount of documents in advance and then IDF (Wi) = log (total number of documents / number of documents including the word Wi). That is, the IDF increases as the word appears only in a small number of documents. IDF is a word weighting method based on the assumption that “a word that appears only in a small number of documents when a large number of documents are collected after a certain document unit is determined” is an important word.

テキスト分割におけるＩＤＦによる単語の重み付けの一例が、非特許文献１に記載されている。非特許文献１では、入力テキストの各部分に対して一定幅の分析区間を設定し、各分析区間に対して、その分析区間における各単語の重要度を要素としたトピックベクトルを求める。このとき、ある分析区間における単語Ｗｉの重要度は、その分析区間内の単語Ｗｉの出現頻度をＴＦ（Ｗｉ）としたとき、ＴＦ（Ｗｉ）×ＩＤＦ（Ｗｉ）によって求める。すなわち、トピックベクトルは、分析区間内における単語の出現頻度分布をＩＤＦによって補正した値である。このようにトピックベクトルを求めた後に、隣接する分析区間のトピックベクトル間のコサイン類似度系列を求め、類似度の極小点をトピックの境界点として検出する。 An example of word weighting by IDF in text division is described in Non-Patent Document 1. In Non-Patent Document 1, an analysis interval having a certain width is set for each part of the input text, and a topic vector having the importance of each word in the analysis interval as an element is obtained for each analysis interval. At this time, the importance of the word Wi in a certain analysis section is obtained by TF (Wi) × IDF (Wi) where the appearance frequency of the word Wi in the analysis section is TF (Wi). That is, the topic vector is a value obtained by correcting the appearance frequency distribution of words in the analysis section by IDF. After obtaining the topic vector in this way, a cosine similarity series between topic vectors in adjacent analysis sections is obtained, and the minimum point of the similarity is detected as a topic boundary point.

国際公開第２００４／０９５３７４号パンフレットInternational Publication No. 2004/095374 Pamphlet 内海、藤井、田中、“分析区間長を可変としたテキスト分割手法”、言語処理学会第１２回年次大会発表論文集、ｐ．１１７−１２０、２００６．Utsumi, Fujii, Tanaka, “Text Segmentation Method with Variable Analysis Interval Length”, Proc. Of the 12th Annual Conference of the Language Processing Society, p. 117-120, 2006. Ｋ．Ｋｉｍｕｒａ、Ｉ．Ｙａｍａｄａ、Ｈ．Ｓｕｍｉｙｏｓｈｉ、Ｎ．Ｙａｇｉ、“ＡｕｔｏｍａｔｉｃＧｅｎｅｒａｔｉｏｎｏｆａＭｕｌｔｉｍｅｄｉａＥｎｃｙｃｌｏｐｅｄｉａｆｒｏｍＴＶＰｒｏｇｒａｍｓｂｙＵｓｉｎｇＣｌｏｓｅｄＣａｐｔｉｏｎｓａｎｄＤｅｔｅｃｔｉｎｇＰｒｉｎｃｉｐａｌＶｉｄｅｏＯｂｊｅｃｔｓ、”ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＳｙｍｐｏｓｉｕｍｏｎＭｕｌｔｉｍｅｄｉａ ‘０６、ｐ．８７３−８８０、２００６．K. Kimura, I. et al. Yamada, H .; Sumiyoshi, N .; Yagi, “Automatic Generation of a Multimedia Encyclopedia from TV Program by Using Closed Captions and Detecting Principal Video Objects,” 873-880, 2006.

なお、上記特許文献及び非特許文献の全開示内容はその引用をもって本書に繰込み記載する。以下の分析は、本発明によって与えられたものである。 The entire disclosure of the above patent documents and non-patent documents is incorporated herein by reference. The following analysis is given by the present invention.

しかしながら、上記の従来技術は以下の問題を抱える。 However, the above prior art has the following problems.

すなわち、発話テキストをテキスト分割することによって映像データを分割する際、ＩＤＦによって単語を重み付けするだけでは映像データを意味内容に応じて適切にトピックへと分割することができるとは限らない、という問題がある。その理由は次のとおりである。 That is, when video data is divided by dividing the text of the utterance text, it is not always possible to appropriately divide the video data into topics according to the semantic content simply by weighting words with IDF. There is. The reason is as follows.

ある単語のＩＤＦはあくまでその単語の一般的な重要度を表す指標であり、必ずしもその単語と個々のトピックの意味内容との関連性の強さを表しているとは限らない。例えば、「モンシロチョウ」という単語がどのような場面に現れても、そのＩＤＦは同じ値となる。しかし、実際には、「モンシロチョウ」という単語が現れる場面によってモンシロチョウとトピックの意味内容との関連性の強さは異なると考えられる。したがって、意味内容に応じたトピック分割を行うためには、モンシロチョウが実際にトピックの主題である場合には「モンシロチョウ」という単語の重みを大きくし、それ以外の場合には「モンシロチョウ」という単語の重みを小さくすることが望ましい。しかしながら、ＩＤＦではこのような重み付けは行われないからである。 The IDF of a word is an index that represents the general importance of the word to the last, and does not necessarily represent the strength of relevance between the word and the semantic content of each topic. For example, the IDF has the same value regardless of the scene where the word “Month white butterfly” appears. However, in reality, the strength of the relevance between the white butterfly and the semantic content of the topic is considered to differ depending on the scene in which the word “Butterfly” appears. Therefore, in order to divide the topic according to the semantic content, the weight of the word “Monthrocho” is increased when the white butterfly is actually the subject of the topic; It is desirable to reduce the weight. However, IDF does not perform such weighting.

そこで、映像データを意味内容に応じて適切にトピックへと分割することができる映像分割装置、映像分割方法及び映像分割用プログラムを提供することが課題となる。 Therefore, it is an object to provide a video dividing device, a video dividing method, and a video dividing program that can appropriately divide video data into topics according to meaning contents.

本発明の第１の視点に係る映像分割装置は、映像と関連付けられたテキストであって映像における再生位置が付されたものを参照してテキストに含まれる単語又は単語列（以下「単語等」という。）が該映像に含まれる被写体を表すか否かを判定する被写体判定部と、単語等のうち被写体を表すと判定されたものに対してそれ以外のものに対する重み付けよりも大きい重み付けをする被写体重み付け部と、重み付けに基づいてテキストを分割することによって映像を分割する映像分割部と、を備えている。 The video segmentation device according to the first aspect of the present invention refers to a text or a word string (hereinafter referred to as “words”) included in the text with reference to text associated with the video and having a playback position attached to the video. And a subject determination unit that determines whether or not a subject included in the video is represented, and a weight that is determined to represent the subject among words or the like is larger than a weight for other subjects. A subject weighting unit; and a video dividing unit that divides the video by dividing text based on the weighting.

本発明の第２の視点に係る映像分割方法は、コンピュータによって、映像と関連付けられたテキストであって映像における再生位置が付されたものを参照してテキストに含まれる単語又は単語列（以下「単語等」という。）が映像に含まれる被写体を表すか否かを判定する被写体判定工程と、単語等のうち被写体を表すと判定されたものに対してそれ以外のものに対する重み付けよりも大きい重み付けをする被写体重み付け工程と、重み付けに基づいてテキストを分割することによって映像を分割する映像分割工程と、を含む。 The video segmentation method according to the second aspect of the present invention refers to a word or a word string (hereinafter referred to as “a word sequence”) included in a text by referring to text associated with the video and having a playback position attached to the video. A word determination process for determining whether or not a word or the like represents a subject included in the video, and a weight that is greater than a weight for other words determined to represent the subject. A subject weighting step for performing the above and a video dividing step for dividing the video by dividing the text based on the weighting.

本発明の第３の視点に係るプログラムは、映像と関連付けられたテキストであって映像における再生位置が付されたものを参照してテキストに含まれる単語又は単語列（以下「単語等」という。）が映像に含まれる被写体を表すか否かを判定する被写体判定処理と、単語等のうち被写体を表すと判定されたものに対してそれ以外のものに対する重み付けよりも大きい重み付けをする被写体重み付け処理と、重み付けに基づいてテキストを分割することによって映像を分割する映像分割処理と、をコンピュータに実行させる。 The program according to the third aspect of the present invention refers to a text or a word string (hereinafter referred to as “word or the like”) included in the text with reference to the text associated with the video and the playback position in the video. ) Represents a subject included in the video, and subject weighting processing for weighting a word or the like that is determined to represent the subject larger than the weights for other subjects. And a video dividing process for dividing the video by dividing the text based on the weighting.

本発明に係る映像分割装置によると、映像データを意味内容に応じて適切にトピックに分割することができる。映像分割装置は、映像と関連付けられたテキストであってその映像における再生位置が付与されたものを参照してテキストに含まれる単語等が映像に映っている被写体を表すか否かを判定する。また、映像分割装置は、被写体を表すと判定された単語等の重み付けを大きくし、その重み付けに基づいてテキストを分割することによって、映像を分割するからである。 According to the video dividing apparatus according to the present invention, video data can be appropriately divided into topics according to the semantic content. The video dividing device determines whether or not a word or the like included in the text represents a subject shown in the video with reference to text associated with the video and provided with a playback position in the video. In addition, the video dividing device divides the video by increasing the weighting of a word or the like determined to represent the subject and dividing the text based on the weighting.

本発明の第１の実施例の構成を示すブロック図である。It is a block diagram which shows the structure of the 1st Example of this invention. 本発明の第１の実施例における被写体判定部の構成を示すブロック図である。It is a block diagram which shows the structure of the to-be-photographed object determination part in 1st Example of this invention. 本発明の第１の実施例の動作を示す流れ図である。It is a flowchart which shows operation | movement of the 1st Example of this invention. 本発明の第１の実施例におけるテキスト記憶部の具体例を説明する図である。It is a figure explaining the specific example of the text memory | storage part in 1st Example of this invention. 本発明の第１の実施例における被写体認識結果記憶部の具体例を説明する図である。It is a figure explaining the specific example of the to-be-recognized result memory | storage part in 1st Example of this invention. 本発明の第１の実施例における被写体判定部の判定結果の具体例を説明する図である。It is a figure explaining the specific example of the determination result of the to-be-photographed object determination part in 1st Example of this invention. 本発明の第１の実施例における重み付きテキスト記憶部の具体例を説明する図である。It is a figure explaining the specific example of the weighted text memory | storage part in 1st Example of this invention. 本発明の第１の実施例における映像分割部の動作の一例を説明する図である。It is a figure explaining an example of operation | movement of the image | video division | segmentation part in 1st Example of this invention. 本発明の第１の実施例の動作の具体例を説明する図である。It is a figure explaining the specific example of operation | movement of the 1st Example of this invention. 本発明の第１の実施例の動作の具体例を説明する図である。It is a figure explaining the specific example of operation | movement of the 1st Example of this invention. 本発明の第２の実施例の構成を示すブロック図である。It is a block diagram which shows the structure of the 2nd Example of this invention.

Explanation of symbols

１１映像データ記憶部
１２テキスト記憶部
１３被写体判定部
１４被写体重み付け部
１５重み付きテキスト記憶部
１６映像分割部
１７分割結果記憶部
１８映像視聴部
３１映像分割用プログラム
３２データ処理装置
３３記憶装置
１３０被写体認識部
１３１オブジェクト認識部
１３２顔画像認識部
１３３文字認識部
１３４被写体抽出部
１３５被写体認識結果記憶部
１３６照合部
３３１映像データ記憶部
３３２テキスト記憶部
３３３重み付きテキスト記憶部
３３４分割結果記憶部
３３５被写体認識結果記憶部11 Video Data Storage Unit 12 Text Storage Unit 13 Subject Determination Unit 14 Subject Weighting Unit 15 Weighted Text Storage Unit 16 Video Division Unit 17 Division Result Storage Unit 18 Video Viewing Unit 31 Video Division Program 32 Data Processing Device 33 Storage Device 130 Subject Recognition unit 131 Object recognition unit 132 Face image recognition unit 133 Character recognition unit 134 Subject extraction unit 135 Subject recognition result storage unit 136 Verification unit 331 Video data storage unit 332 Text storage unit 333 Weighted text storage unit 334 Division result storage unit 335 Subject Recognition result storage

第１の展開形態の映像分割装置は、被写体判定部が、単語等が映像のうち再生位置を基準とする所定の範囲内のものに映っている被写体を表すか否かの判定をすることが好ましい。 In the video dividing apparatus according to the first development form, the subject determination unit may determine whether or not a word or the like represents a subject that appears in a predetermined range of the video with respect to the playback position. preferable.

第２の展開形態の映像分割装置は、被写体判定部が、単語等が映像に映っている被写体を表すと判定した場合には単語等が被写体を表す信頼度を計算するとともに、被写体重み付け部が、信頼度が高いものほど単語等に大きい重み付けをすることが好ましい。 In the video dividing apparatus according to the second development form, when the subject determination unit determines that the word or the like represents the subject shown in the video, the subject weighting unit calculates the reliability that the word or the like represents the subject. It is preferable to weight a word or the like with higher reliability.

第３の展開形態の映像分割装置は、被写体判定部が、単語等が映像に映っている被写体を表すと判定した場合には被写体の映像における重要度を決定するとともに、被写体重み付け部が、重要度が高いものほど単語等に大きい重み付けをすることが好ましい。 In the video dividing apparatus according to the third development form, when the subject determination unit determines that a word or the like represents a subject in the video, the importance level in the video of the subject is determined, and the subject weighting unit is important It is preferable to weight a word or the like with a higher degree.

第４の展開形態の映像分割装置は、被写体判定部が、被写体が映像に占める割合に応じて被写体の重要度を決定することが好ましい。 In the video dividing apparatus according to the fourth development form, it is preferable that the subject determination unit determines the importance of the subject according to the ratio of the subject to the video.

第５の展開形態の映像分割装置は、被写体判定部が、映像に映っている被写体を認識して被写体認識結果テキストとして出力する被写体認識部と、単語等と被写体認識結果テキストとを照合して単語等が映像に映っている被写体を表すか否かを判定する照合部と、をさらに備えていることが好ましい。 In the video dividing apparatus according to the fifth embodiment, the subject determination unit compares the subject and the subject recognition result text with the subject recognition unit that recognizes the subject reflected in the video and outputs the subject recognition result text. It is preferable to further include a matching unit that determines whether or not a word or the like represents a subject that is shown in the video.

第６の展開形態の映像分割装置は、被写体が、オブジェクト、顔画像、又は文字を含むことが好ましい。 In the video dividing apparatus according to the sixth development form, it is preferable that the subject includes an object, a face image, or a character.

第７の展開形態の映像分割装置は、被写体認識部が、オブジェクトを認識するオブジェクト認識部、顔画像を認識する顔画像認識部、文字を認識する文字認識部、及び、テキストに含まれる単語等から被写体を表す単語等を抽出する被写体抽出部、のうち少なくとも１つを備えていることが好ましい。 In the video dividing apparatus according to the seventh development form, the object recognition unit has an object recognition unit that recognizes an object, a face image recognition unit that recognizes a face image, a character recognition unit that recognizes a character, a word included in the text, and the like It is preferable to include at least one of a subject extracting unit that extracts a word representing a subject from the subject.

第８の展開の映像分割装置は、再生位置が、テキストに含まれる文又は単語等を単位として付与されていることが好ましい。 In the eighth video splitting device, the playback position is preferably given in units of sentences or words included in the text.

第９の展開形態の映像分割装置は、照合部が、単語等及び被写体認識結果テキストのうち少なくとも一方をシソーラスによって展開して、単語等と被写体認識結果テキストとを照合することが好ましい。 In the video dividing apparatus according to the ninth development form, it is preferable that the collation unit develops at least one of a word or the like and the subject recognition result text with a thesaurus, and collates the word or the like with the subject recognition result text.

第１０の展開形態の映像分割装置は、テキストが、映像に含まれる発話の内容を表すテキストであることが好ましい。 In the video dividing device according to the tenth development form, it is preferable that the text is a text representing the content of an utterance included in the video.

第１１の展開形態の映像分割方法は、被写体判定工程において、単語等が映像のうち再生位置を基準とする所定の範囲内のものに映っている被写体を表すか否かの判定をすることが好ましい。 In the video dividing method according to the eleventh development mode, in the subject determination step, it is determined whether or not a word or the like represents a subject appearing in a predetermined range with reference to the reproduction position in the video. preferable.

第１２の展開形態の映像分割方法は、被写体判定工程において、単語等が映像に映っている被写体を表すと判定した場合には単語等が被写体を表す信頼度を計算し、被写体重み付け工程において、信頼度が高いものほど単語等に大きい重み付けをすることが好ましい。 In the video dividing method of the twelfth development mode, in the subject determination step, when it is determined that the word or the like represents the subject reflected in the video, the reliability that the word or the like represents the subject is calculated, and in the subject weighting step, It is preferable to weight a word or the like with higher reliability.

第１３の展開形態の映像分割方法は、被写体判定工程において、単語等が映像に映っている被写体を表すと判定した場合には被写体の映像における重要度を決定し、被写体重み付け工程において、重要度が高いものほど単語等に大きい重み付けをすることが好ましい。 The video dividing method of the thirteenth development mode determines the importance in the video of the subject when it is determined in the subject determination step that a word or the like represents the subject shown in the video, and the importance in the subject weighting step. It is preferable to weight a word or the like with a higher value.

第１４の展開形態の映像分割方法は、被写体判定工程において、被写体が映像に占める割合に応じて被写体の重要度を決定することが好ましい。 In the video dividing method according to the fourteenth development mode, it is preferable that the importance level of the subject is determined in accordance with the ratio of the subject to the video in the subject determination step.

第１５の展開形態の映像分割方法は、被写体判定工程が、映像に映っている被写体を認識して被写体認識結果テキストとして出力する被写体認識工程と、単語等と被写体認識結果テキストとを照合して単語等が映像に映っている被写体を表すか否かを判定する照合工程と、を含むことが好ましい。 In the video dividing method according to the fifteenth development mode, the subject determination step compares the subject and the subject recognition result text with the subject recognition step of recognizing the subject reflected in the video and outputting it as the subject recognition result text. And a collation step of determining whether or not a word or the like represents a subject shown in the video.

第１６の展開形態の映像分割方法は、被写体が、オブジェクト、顔画像、又は文字を含むことが好ましい。 In the video dividing method according to the sixteenth development form, it is preferable that the subject includes an object, a face image, or a character.

第１７の展開形態の映像分割方法は、被写体認識工程が、オブジェクトを認識するオブジェクト認識工程、顔画像を認識する顔画像認識工程、文字を認識する文字認識工程、及び、テキストに含まれる単語等から被写体を表す単語等を抽出する被写体抽出工程、のうち少なくとも１つを含むことが好ましい。 In the video dividing method according to the seventeenth development mode, the subject recognition step includes an object recognition step for recognizing an object, a face image recognition step for recognizing a face image, a character recognition step for recognizing characters, a word included in the text, etc. It is preferable to include at least one of subject extraction steps for extracting a word or the like representing a subject from the subject.

第１８の展開形態の映像分割方法は、照合工程において、単語等及び被写体認識結果テキストのうち少なくとも一方をシソーラスによって展開して、単語等と被写体認識結果テキストとを照合することが好ましい。 In the video dividing method according to the eighteenth development form, it is preferable that in the collation step, at least one of the word and the subject recognition result text is developed by a thesaurus to collate the word and the subject recognition result text.

第１９の展開形態の映像分割方法は、テキストが、映像に含まれる発話の内容を表すテキストであることが好ましい。 In the video dividing method according to the nineteenth development mode, it is preferable that the text is a text representing the content of an utterance included in the video.

第２０の展開形態のプログラムは、被写体判定処理において、単語等が映像のうち再生位置を基準とする所定の範囲内のものに映っている被写体を表すか否かの判定をすることが好ましい。 The program according to the twentieth development form preferably determines whether or not a word or the like represents a subject appearing in a predetermined range with reference to the reproduction position in the video in the subject determination process.

第２１の展開形態のプログラムは、被写体判定処理において、単語等が映像に映っている被写体を表すと判定した場合には単語等が被写体を表す信頼度を計算し、被写体重み付け処理において、信頼度が高いものほど単語等に大きい重み付けをすることが好ましい。 In the subject development process, the program of the twenty-first form calculates the reliability that the word or the like represents the subject when it is determined that the word or the like represents the subject that appears in the video. It is preferable to weight a word or the like with a higher value.

第２２の展開形態のプログラムは、被写体判定処理において、単語等が映像に映っている被写体を表すと判定した場合には被写体の映像における重要度を決定し、被写体重み付け処理において、重要度が高いものほど単語等に大きい重み付けをすることが好ましい。 The program in the twenty-second development mode determines the importance in the subject image when it is determined in the subject determination process that a word or the like represents the subject in the image, and the importance is high in the subject weighting process. It is preferable to give a greater weight to a word or the like.

第２３の展開形態のプログラムは、被写体判定処理において、被写体が映像に占める割合に応じて被写体の重要度を決定することが好ましい。 The program in the twenty-third development form preferably determines the importance of the subject according to the ratio of the subject to the video in the subject determination process.

第２４の展開形態のプログラムは、被写体判定処理において、映像に映っている被写体を認識して被写体認識結果テキストとして出力する被写体認識処理と、単語等と被写体認識結果テキストとを照合して単語等が映像に映っている被写体を表すか否かを判定する照合処理と、をコンピュータに実行させることが好ましい。 The program according to the twenty-fourth development mode includes a subject recognition process for recognizing a subject shown in a video and outputting it as a subject recognition result text in a subject determination process, a word or the like by collating a word or the like with the subject recognition result text, and the like. It is preferable to cause the computer to execute collation processing for determining whether or not represents a subject appearing in the video.

第２５の展開形態のプログラムは、被写体が、オブジェクト、顔画像、又は文字を含むことが好ましい。 In the program of the twenty-fifth development form, it is preferable that the subject includes an object, a face image, or a character.

第２６の展開形態のプログラムは、被写体認識処理において、オブジェクトを認識するオブジェクト認識処理、顔画像を認識する顔画像認識処理、文字を認識する文字認識処理、及び、テキストに含まれる単語等から被写体を表す単語等を抽出する被写体抽出処理、のうち少なくとも１つをコンピュータに実行させることが好ましい。 The program in the twenty-sixth form includes an object recognition process for recognizing an object, a face image recognition process for recognizing a face image, a character recognition process for recognizing a character, a word included in the text, and the like. It is preferable to cause the computer to execute at least one of the subject extraction processing for extracting a word or the like representing.

第２７の展開形態のプログラムは、照合処理において、単語等及び被写体認識結果テキストのうち少なくとも一方をシソーラスによって展開して、単語等と被写体認識結果テキストとを照合することが好ましい。 It is preferable that the program in the twenty-seventh development mode collates the word and the subject recognition result text by developing at least one of the word and the subject recognition result text by a thesaurus in the collation processing.

第２８の展開形態のプログラムは、テキストが、映像に含まれる発話の内容を表すテキストであることが好ましい。 In the program in the twenty-eighth expanded form, the text is preferably a text representing the content of an utterance included in the video.

本発明の実施の形態について、図面を参照して以下に説明する。 Embodiments of the present invention will be described below with reference to the drawings.

本実施形態に係る映像分割装置は、図１を参照すると、被写体判定部１３、被写体重み付け部１４及び映像分割部１６を備えている。被写体判定部１３は、映像と関連付けられたテキストであって映像における再生位置が付されたものを参照してテキストに含まれる単語又は単語列（以下「単語等」という。）が映像に含まれる被写体を表すか否かを判定する。被写体重み付け部１４は、単語等のうち被写体を表すと判定されたものに対してそれ以外のものに対する重み付けよりも大きい重み付けをする。映像分割部１６は、重み付けに基づいてテキストを分割することによって映像を分割する。 Referring to FIG. 1, the video segmentation apparatus according to the present embodiment includes a subject determination unit 13, a subject weighting unit 14, and a video segmentation unit 16. The subject determination unit 13 includes a word or a word string (hereinafter referred to as a “word or the like”) included in the text with reference to text associated with the video with a reproduction position in the video. It is determined whether or not the subject is represented. The subject weighting unit 14 weights a word or the like that is determined to represent the subject as a weight that is greater than the weighting for other words. The video dividing unit 16 divides the video by dividing the text based on the weighting.

次に、本発明の第１の実施例について、図面を参照して詳細に説明する。図１は本発明の第１の実施例の構成を示すブロック図である。本発明の第１の実施例は、映像を複数の区間に分割する映像分割装置である。図１を参照すると、映像分割装置は、映像データ記憶部１１、テキスト記憶部１２、被写体判定部１３、被写体重み付け部１４、重み付きテキスト記憶部１５、映像分割部１６、分割結果記憶部１７及び映像視聴部１８を備えている。これらの各部は、それぞれ次のように動作する。 Next, a first embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the first embodiment of the present invention. The first embodiment of the present invention is a video dividing apparatus that divides a video into a plurality of sections. Referring to FIG. 1, the video dividing apparatus includes a video data storage unit 11, a text storage unit 12, a subject determination unit 13, a subject weighting unit 14, a weighted text storage unit 15, a video division unit 16, a division result storage unit 17, and A video viewing unit 18 is provided. Each of these units operates as follows.

映像データ記憶部１１は、分割対象となる映像データを記憶する。また、テキスト記憶部１２は、映像データ記憶部１１が記憶する映像データに関連するテキストであって、映像データにおける再生位置情報が付与されたものを記憶する。 The video data storage unit 11 stores video data to be divided. The text storage unit 12 stores text related to the video data stored in the video data storage unit 11 and provided with reproduction position information in the video data.

まず、被写体判定部１３は、テキスト記憶部１２に記憶されるテキストに含まれるそれぞれの単語等に対して、その単語等が映像データ記憶部１１に記憶される映像データに映っている被写体を表すか否かを判定し、判定結果を被写体重み付け部１４に出力する。 First, the subject determination unit 13 represents, for each word or the like included in the text stored in the text storage unit 12, a subject in which the word or the like is reflected in the video data stored in the video data storage unit 11. And the determination result is output to the subject weighting unit 14.

次に、被写体重み付け部１４は、被写体判定部１３によって被写体を表すものと判定された単語等の重みが大きくなるように、テキスト記憶部１２に記憶されるテキストに含まれるそれぞれの単語等に重みを与え、その結果を重み付きテキスト記憶部１５に出力する。 Next, the subject weighting unit 14 weights each word included in the text stored in the text storage unit 12 so that the weight of the word determined to represent the subject by the subject determination unit 13 increases. And outputs the result to the weighted text storage unit 15.

映像分割部１６は、重み付きテキスト記憶部１５からそれぞれの単語等が重み付けされたテキストを読み込み、重みを用いてテキストを分割することによって映像データをトピックに分割し、その結果を分割結果記憶部１７に出力する。 The video dividing unit 16 reads the text weighted with each word from the weighted text storage unit 15 and divides the video data into topics by dividing the text using the weights, and the result is divided into result storage units. 17 to output.

映像視聴部１８は、分割結果記憶部１７に記憶される映像データの分割結果を読み込むことによって、分割されたトピックを単位として映像データ記憶部１１に記憶される映像データの検索や再生を行う。 The video viewing unit 18 retrieves and reproduces the video data stored in the video data storage unit 11 in units of the divided topics by reading the video data division result stored in the division result storage unit 17.

次に、図１、図２のブロック図、図３の流れ図、及び、図４〜図１０の説明図を参照して、本実施例の全体の動作について詳細に説明する。 Next, the overall operation of this embodiment will be described in detail with reference to the block diagrams of FIGS. 1 and 2, the flowchart of FIG. 3, and the explanatory diagrams of FIGS. 4 to 10.

映像データ記憶部１１は、分割対象となる映像データを記憶する。映像データとして、例えば、テレビ番組、講義映像、ホームビデオ、など様々な映像が考えられる。 The video data storage unit 11 stores video data to be divided. As the video data, for example, various videos such as a TV program, a lecture video, and a home video can be considered.

テキスト記憶部１２は、映像データ記憶部１１に記憶される映像データに関連するテキストであって、映像データにおける再生位置が付与されたものを記憶する。このようなテキストとして、例えば、映像データに含まれる発話の内容を表すテキストが考えられる。具体的には、発話を音声認識して得られる音声認識結果テキストや、テレビ番組において送信される字幕情報（クローズドキャプション）などを用いることができる。また、テキスト記憶部１２は、映像中の各シーンに対して、人手によって付与された、コメント、感想、各シーンの要約テキスト等が記憶されていても良い。 The text storage unit 12 stores text associated with video data stored in the video data storage unit 11 and having a playback position in the video data. As such text, for example, text representing the content of an utterance included in video data can be considered. Specifically, a speech recognition result text obtained by speech recognition of an utterance, subtitle information (closed caption) transmitted in a television program, or the like can be used. In addition, the text storage unit 12 may store comments, comments, summary text of each scene, and the like that are manually assigned to each scene in the video.

本実施例におけるテキスト記憶部１２が記憶するテキストは、単語単位に分かち書きされているものとする。日本語のようにテキストが単語単位に分かち書きされていない言語である場合には、公知の形態素解析技術を用いてあらかじめテキストを単語単位に分割しておくことが好ましい。 The text stored in the text storage unit 12 in this embodiment is assumed to be written in units of words. In the case of a language in which text is not divided into word units such as Japanese, it is preferable to divide the text into word units in advance using a known morphological analysis technique.

これらのテキストには、テキストが映像データのどの区間と対応づけられたテキストであるのかを表すために、映像データにおける再生位置を表す情報が付与されている必要がある。再生位置を表す情報としては、映像データの始端、映像データ中の特定位置からの経過時間、画像フレーム数等を用いることができる。再生位置を表す情報は、テキストに含まれる文単位に付与されていても良いし、テキストに含まれる単語単位に付与されていても良い。映像分割部１６によってテキストをトピックへと分割することにより、分割されたテキストに付与された再生位置情報に基づいて映像データも分割することができる。 In order to indicate which section of the video data the text is associated with, the text needs to be provided with information indicating a reproduction position in the video data. As information representing the playback position, the start of video data, the elapsed time from a specific position in the video data, the number of image frames, and the like can be used. Information indicating the reproduction position may be given to each sentence included in the text, or may be given to each word included in the text. By dividing the text into topics by the video dividing unit 16, the video data can also be divided based on the reproduction position information given to the divided text.

なお、以下では、テキスト記憶部１２は、一例として、映像データに含まれる発話の内容を表すテキストを、映像データの始端からの経過時間情報とともに記憶しているものとする。勿論、本発明において、テキスト記憶部１２に記憶されるテキストが発話の内容を表すテキストに限定されるものではない。 In the following, it is assumed that the text storage unit 12 stores, as an example, text representing the content of an utterance included in video data together with elapsed time information from the beginning of the video data. Of course, in the present invention, the text stored in the text storage unit 12 is not limited to the text representing the content of the utterance.

図４は、テキスト記憶部１２が記憶するデータの一例である。図４を参照すると、映像データ記憶部１１が記憶する映像データにおいて、映像データの始端からの経過時間が１０２．０〜１０５．０秒の間に「携帯電話メーカーの間で競争が激化しています」と発話され、始端からの経過時間が１０５．０〜１１０．０秒の間に「様々な機能が携帯電話に搭載されるようになりました」と発話されている。また、それぞれのテキストは、形態素解析技術によって単語単位に分かち書きされている。このようなテキストは、前述のように、発話を音声認識したり、字幕情報を利用したりすることによって得られる。 FIG. 4 is an example of data stored in the text storage unit 12. Referring to FIG. 4, in the video data stored in the video data storage unit 11, “the competition between mobile phone manufacturers is intensifying between 102.0 to 105.0 seconds from the beginning of the video data. "The various functions are now installed in the mobile phone" during the elapsed time from the beginning 105.0 to 110.0 seconds. Each text is divided into words by a morphological analysis technique. As described above, such text can be obtained by voice recognition of an utterance or using subtitle information.

映像データを分割する際、まず、被写体判定部１３は、テキスト記憶部１２が記憶するテキストに含まれるそれぞれの単語等に対して、その単語等が映像データ記憶部１１が記憶する映像データに映っている被写体を表すか否かを判定する（ステップＳ２１〜Ｓ２３）。被写体判定部１３は、判定結果を被写体重み付け部１４に出力する。なお、被写体とは、映像データに映っている何らかのオブジェクトや顔画像や文字画像や場所等をいう。 When dividing the video data, first, the subject determination unit 13 displays each word or the like included in the text stored in the text storage unit 12 in the video data stored in the video data storage unit 11. It is determined whether or not the subject is represented (steps S21 to S23). The subject determination unit 13 outputs the determination result to the subject weighting unit 14. Note that the subject means any object, face image, character image, place, or the like reflected in the video data.

図２は、被写体判定部１３の構成を示すブロック図である。図２を参照すると、被写体判定部１３は、被写体認識部１３０、被写体認識結果記憶部１３５及び照合部１３６を備えている。また、被写体認識部１３０は、オブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３及び被写体抽出部１３４を含む。 FIG. 2 is a block diagram illustrating a configuration of the subject determination unit 13. Referring to FIG. 2, the subject determination unit 13 includes a subject recognition unit 130, a subject recognition result storage unit 135, and a collation unit 136. The subject recognition unit 130 includes an object recognition unit 131, a face image recognition unit 132, a character recognition unit 133, and a subject extraction unit 134.

被写体判定部１３は、被写体認識部１３０によって、映像データ記憶部１１に記憶される映像データに映っている被写体を認識し、被写体を表すテキストを被写体認識結果テキストとして被写体認識結果記憶部１３５に出力する（ステップＳ２１）。映像データに映っている被写体の認識は、オブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３、被写体抽出部１３４によって行われる。 The subject determination unit 13 uses the subject recognition unit 130 to recognize a subject shown in the video data stored in the video data storage unit 11 and outputs text representing the subject to the subject recognition result storage unit 135 as subject recognition result text. (Step S21). Recognition of the subject shown in the video data is performed by the object recognition unit 131, the face image recognition unit 132, the character recognition unit 133, and the subject extraction unit 134.

オブジェクト認識部１３１は、映像データ記憶部１１から映像データを読み込み、映像データに対して、例えば、特許文献１（なお、特許文献１の記載事項は、引用をもって本書に繰込み記載されているものとする。）に記載のオブジェクト認識技術を適用することによって実現することができる。ここで、オブジェクトとは、映像中に映っているひとまとまりの物体を意味する。特許文献１によれば、オブジェクトを認識するために、まず、映像データ中の画像を分割して部分映像を抽出し（特許文献１における部分映像抽出手段）、部分映像からカラーレイアウトやエッジヒストグラム等の視覚的特徴量を計算し（特許文献１における視覚的特徴量設定手段）、部分映像の視覚的特徴量と被写体の候補となる候補オブジェクトの視覚的特徴量との類似度を計算する（特許文献１における特徴量比較手段）。そして、両者の類似度が閾値を超えれば、映像データに候補オブジェクトが映っているものと判定する。この判定処理を様々な候補オブジェクトに対して行うことによって、映像データに映っている様々なオブジェクトを被写体として認識することができる。オブジェクト認識部１３１は、映像に映っているものと判定されたオブジェクトの名称を被写体認識結果テキストとして被写体認識結果記憶部１３５に出力する。 The object recognizing unit 131 reads video data from the video data storage unit 11, and for example, Patent Document 1 (the matters described in Patent Document 1 are incorporated herein by reference) This can be realized by applying the object recognition technology described in the above. Here, the object means a group of objects reflected in the video. According to Patent Document 1, in order to recognize an object, first, a partial video is extracted by dividing an image in video data (partial video extraction means in Patent Document 1), and a color layout, an edge histogram, etc. are extracted from the partial video. Is calculated (visual feature value setting means in Patent Document 1), and the similarity between the visual feature value of the partial video and the visual feature value of the candidate object that is a candidate for the subject is calculated (patent) Feature quantity comparison means in Document 1). If the similarity between the two exceeds the threshold value, it is determined that the candidate object appears in the video data. By performing this determination processing on various candidate objects, various objects reflected in the video data can be recognized as subjects. The object recognition unit 131 outputs the name of the object determined to be reflected in the video to the subject recognition result storage unit 135 as subject recognition result text.

なお、このような処理は映像データ中のすべての画像フレームに対して行っても良い。また、処理量を軽減するために、ショット（カメラの切り替わり等のない一続きの映像区間）ごとにオブジェクト認識処理を行っても良い。かかる処理を行うためには、あらかじめ映像データに対して公知のショット分割を適用し、得られたそれぞれのショットの代表画像に対してオブジェクト認識処理を行えばよい。代表画像の選び方は様々な方法を用いることができ、一例として、単純にショットの先頭の画像フレーム等としても良い。 Such processing may be performed on all image frames in the video data. In addition, in order to reduce the processing amount, the object recognition process may be performed for each shot (a continuous video section without camera switching or the like). In order to perform such processing, known shot division may be applied to the video data in advance, and object recognition processing may be performed on the obtained representative image of each shot. Various methods can be used for selecting the representative image. For example, the representative image frame may be simply the head image frame of the shot.

ここでは、オブジェクト認識部１３１を実現する手法として特許文献１に記載の手法を説明した。勿論、本発明におけるオブジェクト認識部１３１の実現手法は、特許文献１に記載の手法に限定されず、映像に映っているオブジェクトを認識する手法であればどのような手法であってもよい。 Here, the technique described in Patent Document 1 has been described as a technique for realizing the object recognition unit 131. Of course, the method for realizing the object recognition unit 131 in the present invention is not limited to the method described in Patent Document 1, and any method may be used as long as it recognizes an object shown in a video.

顔画像認識部１３２は、映像データ記憶部１１から映像データを読み込み、映像データに対して公知の顔画像認識技術を適用することによって実現することができる。顔画像認識部１３２は、映像に映っていると判定された人物の名前や役職名等を被写体認識結果テキストとして被写体認識結果記憶部１３５に出力する。 The face image recognition unit 132 can be realized by reading video data from the video data storage unit 11 and applying a known face image recognition technique to the video data. The face image recognition unit 132 outputs the name or title of the person determined to be reflected in the video to the subject recognition result storage unit 135 as subject recognition result text.

文字認識部１３３は、映像データ記憶部１１から映像データを読み込み、映像データに対して公知の文字認識技術を適用することによって実現することができる。文字認識部１３３は、例えば、映像に映っている文字画像の文字認識結果をそのまま被写体認識結果テキストとして被写体認識結果記憶部１３５に出力する。また、文字認識部１３３は、文字認識結果の中から人物名、場所名、名詞全般等を抽出し、これらを被写体認識結果テキストとして被写体認識結果記憶部１３５に出力してもよい。 The character recognition unit 133 can be realized by reading video data from the video data storage unit 11 and applying a known character recognition technique to the video data. For example, the character recognition unit 133 outputs the character recognition result of the character image shown in the video as it is to the subject recognition result storage unit 135 as the subject recognition result text. Further, the character recognition unit 133 may extract a person name, a place name, general nouns, and the like from the character recognition result, and output them to the subject recognition result storage unit 135 as subject recognition result text.

なお、映像データがテレビ番組等の場合には、映像中に表示されている文字（テロップ）の内容に関する情報が配信されることも考えられる。かかる場合、文字認識部１３３は、映像に映っている文字列の文字認識をすることなく、配信されたテロップの情報を文字認識結果とみなして被写体認識結果テキストを出力してもよい。 When the video data is a television program or the like, it is also conceivable that information relating to the contents of characters (telops) displayed in the video is distributed. In such a case, the character recognition unit 133 may output the subject recognition result text by regarding the distributed telop information as the character recognition result without performing the character recognition of the character string shown in the video.

被写体抽出部１３４は、テキスト記憶部１２からテキストを読み込み、テキストに含まれる単語等の中から、映像に映っている被写体を表す単語等を抽出する。このような処理は、例えば、非特許文献２（なお、非特許文献２の記載事項は、引用をもって本書に繰込み記載されているものとする。）に記載の技術を用いることで実現することができる。すなわち、テキスト中のある名詞が映像に映っているか否かを、「その名詞が体言止めであるか否か」、「その名詞が含まれる名詞句の付属語の種類は何であるか」等の特徴を組み合わせたプロダクションルールによって判定する。非特許文献２に記載の技術によれば、映像中のそれぞれのショットに対して、そのショットの主な被写体を表す名詞をテキスト中から抽出することができる。被写体抽出部１３４は、映像に映っている被写体を表すと判定された単語等を被写体認識結果テキストとして被写体認識結果記憶部１３５に出力する。 The subject extraction unit 134 reads text from the text storage unit 12 and extracts a word or the like representing the subject shown in the video from words or the like included in the text. Such processing is realized, for example, by using the technology described in Non-Patent Document 2 (note that the matters described in Non-Patent Document 2 are incorporated herein by reference). Can do. That is, whether or not a noun in the text is reflected in the video, such as "whether the noun is a verbal stop", "what is the type of noun phrase that includes the noun" Judged by production rules that combine features. According to the technique described in Non-Patent Document 2, for each shot in the video, a noun representing the main subject of the shot can be extracted from the text. The subject extraction unit 134 outputs words or the like determined to represent the subject shown in the video to the subject recognition result storage unit 135 as subject recognition result text.

ここでは、被写体抽出部１３４を実現する手法として非特許文献２に記載の手法を説明した。勿論、本発明における被写体抽出部１３４の実現手法は非特許文献２に記載の手法に限定されるものではなく、テキストから被写体を表す単語等を抽出する手法であればどのような手法でも構わない。 Here, the method described in Non-Patent Document 2 has been described as a method for realizing the subject extraction unit 134. Of course, the method of realizing the subject extraction unit 134 in the present invention is not limited to the method described in Non-Patent Document 2, and any method may be used as long as it is a method for extracting a word representing a subject from text. .

上述したオブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３、被写体抽出部１３４は、被写体認識結果テキストに加えて、その被写体認識の信頼度を出力してもよい。ここで、被写体認識の信頼度とは、得られた被写体認識結果をどの程度正しいと考えられるかどうかを表す値をいう。すなわち、オブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３、被写体抽出部１３４が出力する被写体認識結果は自動処理であるため、誤っている場合もある。しかし、信頼度が大きいほど、得られた被写体認識結果テキストは実際に映像データに映っている可能性が高い。 The object recognition unit 131, face image recognition unit 132, character recognition unit 133, and subject extraction unit 134 described above may output the reliability of subject recognition in addition to the subject recognition result text. Here, the reliability of subject recognition refers to a value indicating how much the obtained subject recognition result is considered correct. That is, the subject recognition results output from the object recognition unit 131, the face image recognition unit 132, the character recognition unit 133, and the subject extraction unit 134 are automatic processes, and may be incorrect. However, the higher the reliability, the higher the possibility that the obtained subject recognition result text is actually reflected in the video data.

例えば、オブジェクト認識部１３１が特許文献１に記載の方法で実現されている場合には、映像データ中の部分映像から抽出した視覚特徴量と候補オブジェクトの視覚特徴量との類似度を被写体認識結果の信頼度とすることができる。これは、両者の類似度が大きいほど、その被写体認識結果が実際に映像データに映っている可能性が高いと考えられるからである。 For example, when the object recognizing unit 131 is realized by the method described in Patent Literature 1, the similarity between the visual feature amount extracted from the partial video in the video data and the visual feature amount of the candidate object is obtained as the subject recognition result. Reliability. This is because the higher the similarity between the two, the higher the possibility that the subject recognition result is actually reflected in the video data.

被写体抽出部１３４が非特許文献２に記載の方法で実現されている場合には、被写体を抽出する際に適用されたプロダクションルールの予測精度を被写体認識結果の信頼度とすることができる。これは、予測精度の高いルールによって得られた被写体認識結果ほど、実際に映像データに映っている可能性が高いと考えられるからである。 When the subject extracting unit 134 is realized by the method described in Non-Patent Document 2, the prediction accuracy of the production rule applied when extracting the subject can be used as the reliability of the subject recognition result. This is because a subject recognition result obtained by a rule with high prediction accuracy is considered to be more likely to be actually reflected in video data.

顔画像認識部１３２や文字認識部１３３においても、公知の顔画像認識技術や文字認識技術においてよく知られた方法により計算される信頼度を用いることができる。なお、これらの信頼度の値は、適当な方法によって０〜１の範囲へ正規化してもよい。 The face image recognition unit 132 and the character recognition unit 133 can also use the reliability calculated by a well-known method in the known face image recognition technology and character recognition technology. Note that these reliability values may be normalized to a range of 0 to 1 by an appropriate method.

また、オブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３は、被写体認識結果テキストに加えて、その被写体の映像データにおける重要度を出力してもよい。ここで、被写体の映像データにおける重要度とは、その被写体がその映像データの意味内容とどの程度関連性があるかを表した値である。 In addition to the subject recognition result text, the object recognition unit 131, the face image recognition unit 132, and the character recognition unit 133 may output the importance in the video data of the subject. Here, the importance in the video data of the subject is a value representing how much the subject is related to the semantic content of the video data.

被写体の重要度は、例えば、オブジェクト認識部１３１において認識されたオブジェクト、顔画像認識部１３２において認識された顔画像、又は、文字認識部１３３において認識された文字画像が映像に映っている領域が画面全体に占める割合、として求めることができる。これは、オブジェクトや顔画像や文字等の被写体が映像に映っている範囲が大きいほど、その被写体がその映像において主要な被写体であり、映像の意味内容との関連性も高いと考えられるからである。また、オブジェクトや顔画像や文字等の被写体が映像に映っている時間の長さを重要度とすることもできる。 The importance of the subject is, for example, an area in which an object recognized by the object recognition unit 131, a face image recognized by the face image recognition unit 132, or a character image recognized by the character recognition unit 133 is shown in the video. It can be calculated as a percentage of the entire screen. This is because the larger the range in which the subject such as an object, face image, or character appears in the video, the more the subject is the main subject in the video and the higher the relevance to the semantic content of the video. is there. In addition, the length of time during which a subject such as an object, a face image, or a character appears in the video can be set as the importance level.

このようにして、オブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３、被写体抽出部１３４によって、映像データ記憶部１１に記憶される映像データに映っている被写体が認識され、被写体認識結果テキストが得られる。映像に映っているオブジェクト、顔画像、文字画像は、映像データの意味内容との関連性が特に高いと考えられる。このとき、それぞれの被写体認識結果テキストには、その被写体が映像データのどの区間に映っているのかを表す再生位置の情報を、例えば、映像データの始端からの経過時間等として付与することもできる。また、前述のように、それぞれの被写体認識結果テキストには、被写体認識結果の信頼度又は被写体の映像データにおける重要度等を付与することもできる。これらの被写体認識結果は、被写体認識結果記憶部１３５に記憶される。 In this manner, the object recognition unit 131, the face image recognition unit 132, the character recognition unit 133, and the subject extraction unit 134 recognize the subject shown in the video data stored in the video data storage unit 11, and the subject recognition result. The text is obtained. The objects, face images, and character images shown in the video are considered to be particularly highly related to the semantic content of the video data. At this time, each subject recognition result text can be provided with information on a reproduction position that indicates in which section of the video data the subject is reflected, for example, as an elapsed time from the start of the video data. . Further, as described above, the reliability of the subject recognition result or the importance in the video data of the subject can be given to each subject recognition result text. These subject recognition results are stored in the subject recognition result storage unit 135.

図５は、被写体認識結果記憶部１３５に記憶されるデータの一例である。図５を参照すると、映像データの始端からの経過時間が１０５．０〜１２０．０秒の間に「携帯電話」が被写体として映っており、その被写体認識結果の信頼度は０．８であり、被写体の重要度は０．４である。また、映像データの始端からの経過時間が１９０．０〜１９５．０秒の間に「ＰａＰｅＲｏ」が被写体として映っており、その被写体認識結果の信頼度は０．６であり、被写体の重要度は０．２である。 FIG. 5 is an example of data stored in the subject recognition result storage unit 135. Referring to FIG. 5, the “mobile phone” is shown as the subject during the elapsed time from the beginning of the video data of 105.0 to 120.0 seconds, and the reliability of the subject recognition result is 0.8. The importance of the subject is 0.4. In addition, “PaPeRo” appears as a subject during the elapsed time from the beginning of the video data between 190.0 and 195.0 seconds, the reliability of the subject recognition result is 0.6, and the importance of the subject Is 0.2.

なお、被写体認識部１３０は、オブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３、被写体抽出部１３４の全てを備えている必要はなく、これらのうちいずれか一つ以上の手段を備えていればよい。 The subject recognition unit 130 does not have to include all of the object recognition unit 131, the face image recognition unit 132, the character recognition unit 133, and the subject extraction unit 134, and includes at least one of these means. It only has to be.

次に、被写体判定部１３は、照合部１３６によって、テキスト記憶部１２に記憶されるテキストに含まれるそれぞれの単語に対して、その単語が映像データに映っている被写体を表すか否かを判定する（ステップＳ２２、Ｓ２３）。被写体判定部１３は、判定結果を被写体重み付け部１４に出力する。照合部１３６は、テキストに含まれるそれぞれの単語と、被写体認識結果記憶部１３５に記憶される被写体認識結果テキストとを照合することによって、この判定を行う。 Next, the subject determination unit 13 determines, for each word included in the text stored in the text storage unit 12, by the matching unit 136 whether or not the word represents the subject shown in the video data. (Steps S22 and S23). The subject determination unit 13 outputs the determination result to the subject weighting unit 14. The collation unit 136 performs this determination by collating each word included in the text with the subject recognition result text stored in the subject recognition result storage unit 135.

照合部１３６の動作を説明する。一例として、テキスト記憶部１２は図４に示すテキストを記憶し、被写体認識結果記憶部１３５は図５に示す被写体認識結果を記憶するものとする。照合部１３６は、テキストに含まれる「携帯電話」という単語が被写体を表すか否かを判定する（ステップＳ２２）。この判定は、「携帯電話」と被写体認識結果記憶部１３５に記憶される被写体認識結果テキストとを照合することによって行われる。いま、被写体認識結果テキストには「携帯電話」が存在するため、「携帯電話」という単語は映像データに映っている被写体を表しているものと判定される。 The operation of the verification unit 136 will be described. As an example, the text storage unit 12 stores the text shown in FIG. 4, and the subject recognition result storage unit 135 stores the subject recognition result shown in FIG. The collation unit 136 determines whether or not the word “mobile phone” included in the text represents the subject (step S22). This determination is performed by collating the “mobile phone” with the subject recognition result text stored in the subject recognition result storage unit 135. Now, since the subject recognition result text includes “mobile phone”, the word “mobile phone” is determined to represent the subject shown in the video data.

テキスト記憶部１２に記憶されるテキスト中の全ての単語の判定が終了していないため（ステップＳ２３のＮｏ）、照合部１３６は「携帯電話」の次の単語である「メーカー」という単語が被写体を表すか否かを判定する（ステップＳ２２）。「メーカー」と被写体認識結果テキストとを照合した結果、被写体認識結果テキストには「メーカー」は存在しないため、「メーカー」という単語は被写体を表していないものと判定される（ステップＳ２２）。このような処理を、テキスト中の全ての単語の判定が終了するまで繰り返す（ステップＳ２３のＹｅｓ）。なお、このような判定を行う際に、照合部１３６は、テキスト中の単語ではなく、単語列と被写体認識結果テキストとを照合してもよい。例えば、２つの単語「携帯電話」、「メーカー」を組み合わせた「携帯電話/メーカー」という単語列が、被写体認識結果に存在するか否かを照合してもよい。 Since the determination of all the words in the text stored in the text storage unit 12 has not been completed (No in step S23), the collation unit 136 determines that the word “maker”, which is the next word after “mobile phone”, is the subject. Is determined (step S22). As a result of collating “maker” with the subject recognition result text, “maker” does not exist in the subject recognition result text, so it is determined that the word “maker” does not represent the subject (step S22). Such a process is repeated until the determination of all the words in the text is completed (Yes in step S23). When making such a determination, the collating unit 136 may collate the word string and the subject recognition result text instead of the word in the text. For example, it may be verified whether the word string “mobile phone / maker”, which is a combination of two words “mobile phone” and “maker”, exists in the subject recognition result.

また、照合部１３６は、ある単語が被写体を表すと判定する場合に、その判定の信頼度を判定結果に含めて、被写体重み付け部１４に出力しても良い。ここで、判定の信頼度とは、その判定がどの程度正しいと考えられるかどうかを表す値をいう。 Further, when it is determined that a certain word represents a subject, the matching unit 136 may include the reliability of the determination in the determination result and output it to the subject weighting unit 14. Here, the reliability of the determination refers to a value indicating how much the determination is considered correct.

また、照合部１３６は、ある単語が被写体を表すものと判定する場合には、その被写体の映像データにおける重要度を判定結果に含めて、被写体重み付け部１４に出力してもよい。ここで、重要度とは、その被写体がその映像データの意味内容とどの程度関連性が高いかを表した値をいう。 Further, when determining that a certain word represents a subject, the collation unit 136 may include the importance in the video data of the subject in the determination result and output it to the subject weighting unit 14. Here, the importance level is a value indicating how highly the subject is related to the semantic content of the video data.

これらの信頼度や重要度を出力するには、被写体認識結果記憶部１３５に記憶されている値を用いればよい。例えば、図５を参照すると、被写体認識結果において「携帯電話」の信頼度は０．８、重要度は０．４であるため、テキスト中の「携帯電話」という単語が被写体を表すと判定するときに、その判定の信頼度は０．８、被写体の重要度は０．４とすることができる。また、被写体認識結果において「ＰａＰｅＲｏ」は２箇所に出現しているが、このような場合、信頼度や重要度としては、例えば、値の大きな方を採用すれば良い。その結果、テキスト中の「ＰａＰｅＲｏ」という単語が被写体を表すと判定するときに、その判定の信頼度は０．８、被写体の重要度は０．６とすることができる。また、テキスト中の「ＰａＰｅＲｏ」という単語が対応づけられた再生位置に時間的に最も近い「ＰａＰｅＲｏ」の被写体認識結果の値を用いてもよい。 In order to output the reliability and importance, values stored in the subject recognition result storage unit 135 may be used. For example, referring to FIG. 5, since the reliability of “mobile phone” in the subject recognition result is 0.8 and the importance is 0.4, it is determined that the word “mobile phone” in the text represents the subject. Sometimes the reliability of the determination can be 0.8 and the importance of the subject can be 0.4. In the subject recognition result, “PaPeRo” appears in two places. In such a case, for example, the larger value may be adopted as the reliability and importance. As a result, when it is determined that the word “PaPeRo” in the text represents a subject, the reliability of the determination can be 0.8 and the importance of the subject can be 0.6. Further, the value of the subject recognition result of “PaPeRo” that is temporally closest to the reproduction position associated with the word “PaPeRo” in the text may be used.

また、照合部１３６は、ある単語と被写体認識結果テキストとを照合する際に、その単語、又は、被写体認識結果テキストのうち、少なくともいずれか一方をシソーラスによって展開した上で、両者を照合してもよい。ここで、シソーラスとは、単語の同義語、広義語、狭義語、関連語等を得ることができる辞書をいう。 Further, when collating a certain word with the subject recognition result text, the collation unit 136 expands at least one of the word or the subject recognition result text with a thesaurus, and then collates the two. Also good. Here, the thesaurus refers to a dictionary that can obtain synonyms, broader terms, narrower terms, related terms, and the like of words.

例えば、シソーラスを用いて、図５に示した被写体認識結果テキスト中の「携帯電話」を同義語である「ケータイ」や狭義語である「Ｎ９０５ｉ」に展開したり、「ＰａＰｅＲｏ」を広義語である「ロボット」に展開したりしてから、照合を行ってもよい。このような処理を行うことによって、例えば、テキスト記憶部１２に記憶されるテキスト中に「ケータイ」という単語があった場合に、「ケータイ」が被写体を表すと判定することができる。また、このような展開は、適切な辞書を用いることができる場合は、建物名からその建物が存在する場所名に展開する等、様々な展開が考えられる。例えば、「エッフェル塔」を「パリ」に展開する等である。このようにすることで、照合部１３６は、テキスト記憶部１２が記憶するテキストに含まれる単語と、被写体認識結果記憶部１３５が記憶する被写体認識結果テキストとの間の表現の違いを吸収して、より適切に被写体の判定を行うことが可能となる。 For example, by using a thesaurus, “mobile phone” in the subject recognition result text shown in FIG. 5 is expanded to the synonym “mobile phone” or the narrower term “N905i”, or “PaPeRo” is a broad term. The collation may be performed after expanding to a certain “robot”. By performing such processing, for example, when the word “mobile phone” is included in the text stored in the text storage unit 12, it can be determined that “mobile phone” represents the subject. In addition, when such an expansion can be used, various expansions such as expansion from a building name to a place name where the building exists can be considered. For example, expanding the “Eiffel Tower” to “Paris”. In this way, the collation unit 136 absorbs the difference in expression between the word included in the text stored in the text storage unit 12 and the subject recognition result text stored in the subject recognition result storage unit 135. Thus, the subject can be determined more appropriately.

以上のように、被写体判定部１３は、テキスト記憶部１２に記憶されるテキストに含まれるそれぞれの単語等に対して、その単語等が映像データに映っている被写体を表すか否かを判定し、判定結果を被写体重み付け部１４に出力する。このとき、前述のように、ある単語等が被写体を表すものと判定された場合には、その判定の信頼度、又は、その被写体の映像データにおける重要度を判定結果に含めてもよい。 As described above, the subject determination unit 13 determines, for each word included in the text stored in the text storage unit 12, whether or not the word represents a subject reflected in the video data. The determination result is output to the subject weighting unit 14. At this time, as described above, when it is determined that a certain word or the like represents a subject, the reliability of the determination or the importance in the video data of the subject may be included in the determination result.

図６は、被写体判定部１３が被写体重み付け部１４に出力する判定結果の一例を示す。判定結果は、テキスト記憶部１２に記憶されるテキストに含まれるそれぞれの単語に対して、その単語が映像データに映っている被写体を表すか否かの判定結果を含む。さらに、判定結果は、単語が被写体を表すと判定された場合には、その判定の信頼度及びその被写体の映像データにおける重要度を含む。 FIG. 6 shows an example of a determination result output from the subject determination unit 13 to the subject weighting unit 14. The determination result includes, for each word included in the text stored in the text storage unit 12, a determination result as to whether or not the word represents a subject shown in the video data. Further, when it is determined that the word represents the subject, the determination result includes the reliability of the determination and the importance in the video data of the subject.

なお、本実施例では、被写体判定部１３において、オブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３は、映像に映っている被写体を何ら制限なく認識するものとした。しかし、オブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３は、テキスト記憶部１２からテキストを読み込み、認識する被写体をテキストに含まれる単語等に限定して被写体の認識を行うよう動作してもよい。例えば、テキスト記憶部１２に図４に示すテキストが記憶されている場合には、オブジェクト認識部１３１は、映像データに映っている全てのオブジェクトを認識するのではなく、認識するオブジェクトの候補をテキストに含まれている「携帯電話」「ＰａＰｅＲｏ」等に限定した上で、オブジェクトを認識するよう動作してもよい。同様に、顔画像認識部１３２は、テキストに含まれる人物名に対応する顔画像に候補を限定した上で、顔画像を認識するよう動作してもよい。同様に、文字認識部１３３は、テキストに含まれる単語等に候補を限定した上で、文字画像を認識するよう動作してもよい。 In this embodiment, in the subject determination unit 13, the object recognition unit 131, the face image recognition unit 132, and the character recognition unit 133 recognize the subject appearing in the video without any limitation. However, the object recognizing unit 131, the face image recognizing unit 132, and the character recognizing unit 133 read the text from the text storage unit 12, and operate to recognize the subject by limiting the recognized subject to a word or the like included in the text. May be. For example, when the text shown in FIG. 4 is stored in the text storage unit 12, the object recognition unit 131 does not recognize all the objects shown in the video data, but instead recognizes candidate objects to be recognized as text. It is also possible to operate so as to recognize an object after limiting to “mobile phone”, “PaPeRo”, etc. Similarly, the face image recognition unit 132 may operate so as to recognize face images after limiting candidates to face images corresponding to person names included in the text. Similarly, the character recognition unit 133 may operate so as to recognize a character image after limiting candidates to words included in the text.

さらに、本実施例においては、照合部１３６の動作に先立ち、被写体認識部１３０は、あらかじめ映像データに映っている被写体を認識するものとした。しかし、被写体認識部１３０は、あらかじめ被写体を認識することなく、照合部１３６がテキストに含まれる単語等と被写体認識結果テキストとを照合するときに、その単語等に限定した上で被写体を認識するようにしてもよい。 Furthermore, in this embodiment, prior to the operation of the collating unit 136, the subject recognizing unit 130 recognizes a subject shown in the video data in advance. However, the subject recognizing unit 130 recognizes the subject after limiting the word or the like when the collating unit 136 collates the word included in the text with the subject recognition result text without recognizing the subject in advance. You may do it.

次に、被写体重み付け部１４は、被写体判定部１３によって被写体を表すものと判定された単語等の重みが大きくなるように、テキスト記憶部１２に記憶されるテキストに含まれるそれぞれの単語等に重みを与え（ステップＳ２４）、結果を重み付きテキスト記憶部１５に出力する。 Next, the subject weighting unit 14 weights each word included in the text stored in the text storage unit 12 so that the weight of the word determined to represent the subject by the subject determination unit 13 increases. (Step S24), and outputs the result to the weighted text storage unit 15.

一例として、被写体を表すものと判定された単語等には重みとしてＳａ＝１０を与え、被写体を表さないと判定された単語等には重みとしてＳｂ＝１を与えるようにしてもよい。もちろん、被写体を表すと判定されたときの重みが被写体を表さないと判定されたときの重みよりも大きければ（Ｓａ＞Ｓｂであれば）、これらの重みの具体的な値は他の値でもよい。ここで、被写体を表さないと判定された単語にも最低限の重みを与えることによって、映像分割部１６は、被写体を表さない単語も考慮することができる。 As an example, Sa = 10 may be given as a weight to a word determined to represent a subject, and Sb = 1 may be given as a weight to a word determined not to represent a subject. Of course, if the weight when determined to represent the subject is greater than the weight when determined not to represent the subject (if Sa> Sb), the specific values of these weights are other values. But you can. Here, by giving a minimum weight to a word determined not to represent a subject, the video dividing unit 16 can also consider a word that does not represent a subject.

また、被写体を表すものと判定された単語等に、その判定の信頼度が含まれている場合には、信頼度が大きいものほど、その単語等の重みを大きくするようにしてもよい。このようにすることによって、被写体を表すという判定が正しいと考えられるものほど大きな重みが与えられることになる。 In addition, when the reliability of the determination is included in the word determined to represent the subject, the weight of the word or the like may be increased as the reliability increases. By doing so, a greater weight is given to those that are considered to be correct in representing the subject.

また、被写体を表すと判定された単語等に、その被写体の重要度が含まれている場合には、重要度が大きいほど、その単語等の重みを大きくするようにしてもよい。このようにすることによって、その被写体が映像データの意味内容と関連性が高いと考えられるほど、大きな重みが与えられることになる。 In addition, when the importance of the subject is included in the word determined to represent the subject, the weight of the word or the like may be increased as the importance increases. By doing so, a greater weight is given to the subject so that the subject is considered highly related to the semantic content of the video data.

このような重み付けを行うためには、例えば、信頼度と重要度の積に比例した値を重みとすれば良い。すなわち、重み＝Ｓａ×信頼度×重要度、として計算する。図６に示したように、「携帯電話」という単語が被写体を表しており、その信頼度が０．８、重要度が０．４である場合、重みとして、１０×０．８×０．４＝３．２を与えることができる。ここで、比例定数（Ｓａ）として１０を与えたが、その値は他の値でもよい。また、信頼度や重要度が大きいほど重みが大きくなるような関数であれば、重みを計算する関数は他の関数でもよい。なお、このとき、被写体を表すと判定された単語等の重みが、被写体を表さないと判定された単語等に与える重みよりも小さくならないようにすることが好ましい。例えば、信頼度が０．３、重要度が０．２である場合、前述したように重みを計算すると重みは１０×０．３×０．２＝０．６となるが、もし、被写体を表さないと判定されたときに与える重みがＳｂ＝１である場合には、重みとして０．６ではなくＳｂ＝１を与えるようにすることが好ましい。 In order to perform such weighting, for example, a value proportional to the product of reliability and importance may be used as the weight. That is, calculation is performed as weight = Sa × reliability × importance. As shown in FIG. 6, when the word “mobile phone” represents the subject, the reliability is 0.8, and the importance is 0.4, the weight is 10 × 0.8 × 0. 4 = 3.2 can be given. Here, 10 is given as the proportionality constant (Sa), but the value may be other values. In addition, the function for calculating the weight may be another function as long as the weight increases as the reliability or importance increases. At this time, it is preferable that the weight of the word determined to represent the subject is not smaller than the weight given to the word determined not to represent the subject. For example, when the reliability is 0.3 and the importance is 0.2, if the weight is calculated as described above, the weight becomes 10 × 0.3 × 0.2 = 0.6. If the weight to be given when it is determined not to be expressed is Sb = 1, it is preferable to give Sb = 1 instead of 0.6 as the weight.

また、重み付けする単語の品詞が名詞及び動詞以外の場合には、重みを小さな値、例えば０としてもよい。また、重み付けする単語が付属語又は機能語の場合には、重みを小さな値としてもよい。このようにすることで、映像データの意味内容との関連性が低いと考えられる単語の重みを小さくすることができる。 Further, when the part of speech of the word to be weighted is other than the noun and the verb, the weight may be set to a small value, for example, 0. Further, when the word to be weighted is an attached word or a function word, the weight may be a small value. By doing so, it is possible to reduce the weight of a word that is considered to be less relevant to the semantic content of the video data.

図７は、テキスト記憶部１２に図４に示したテキストが記憶され、被写体認識結果記憶部１３５に図５に示したテキストが記憶されている場合の、重み付きテキスト記憶部１５に記憶されるデータを示す。ここでは、被写体を表すと判定された単語の重みを１０×信頼度×重要度で計算し、被写体を表さないと判定された単語の重みを１とし、名詞及び動詞以外の単語重みを０とした。 7 is stored in the weighted text storage unit 15 when the text shown in FIG. 4 is stored in the text storage unit 12 and the text shown in FIG. 5 is stored in the subject recognition result storage unit 135. Data is shown. Here, the weight of the word determined to represent the subject is calculated by 10 × reliability × importance, the weight of the word determined not to represent the subject is set to 1, and the word weight other than the noun and the verb is set to 0. It was.

最後に、映像分割部１６は、重み付きテキスト記憶部１５からそれぞれの単語等に重みが付与されたテキストを読み込み、重みを用いてテキストを分割することによって映像データをトピックへ分割し、結果を分割結果記憶部１７に出力する。すなわち、前述したように、テキストには各テキストが映像データのどの区間と対応づいているのかを表す再生位置の情報が付与されているため、テキストを分割することによって、分割点に対応する映像データの再生位置を求めることができる。 Finally, the video dividing unit 16 reads the text in which the weights are assigned to the respective words and the like from the weighted text storage unit 15, divides the video data into topics by dividing the text using the weights, and determines the result. The result is output to the division result storage unit 17. That is, as described above, since the text is provided with information on the reproduction position indicating which section of the video data each text corresponds to, the video corresponding to the division point is divided by dividing the text. The data reproduction position can be obtained.

以下では、映像分割部１６によって、重み付きテキスト記憶部１５から読み込んだテキストを意味的なまとまりを表すトピックに分割する方法を詳細に説明する。図８は、映像分割部１６がテキストを分割する処理の一例を示す図である。まず、テキストの各部分に対して一定幅の分析区間を設定し（図８（ａ））、それぞれの分析区間に対して、分析区間においてそれぞれの単語に与えられた重みの分布を求める（図８（ｂ））。具体的には、テキスト全体に出現する単語の種類の数を次元数とし、分析区間においてそれぞれの単語に与えられた重みの和を要素とするベクトルを重みの分布とすればよい。例えば、重み付きテキスト記憶部１５に図７に示したテキストが記憶され、分析区間の幅を文２つと定めた場合には、文ＩＤ１と文ＩＤ２から構成される分析区間の単語の重みの分布は、「携帯電話」→６．４、「メーカー」→１、「間」→１、「競争」→１、「激化し」→１、「機能」→１、「搭載」→１、その他の単語→０を要素とするベクトルで表される。次に、隣接する分析区間の間の重み分布の類似度を計算し、類似度の極小点を求める（図８（ｃ））。類似度の極小点は単語の重みの分布が変化する点であるため、これをテキストの分割点とする。図８では、ＸとＹが極小点であるため、これらの点をテキストの分割点とする。なお、極小点を求める場合、谷の深さが一定以上あるものに限定してもよく、類似度が閾値以下であるものに限定してもよい。 Hereinafter, a method of dividing the text read from the weighted text storage unit 15 by the video dividing unit 16 into topics representing a semantic group will be described in detail. FIG. 8 is a diagram illustrating an example of processing in which the video dividing unit 16 divides text. First, an analysis interval having a certain width is set for each part of the text (FIG. 8A), and for each analysis interval, a distribution of weights given to each word in the analysis interval is obtained (FIG. 8). 8 (b)). Specifically, the number of types of words appearing in the entire text may be the number of dimensions, and a vector having the sum of the weights given to each word in the analysis section as an element may be used as the weight distribution. For example, when the text shown in FIG. 7 is stored in the weighted text storage unit 15 and the width of the analysis section is set to two sentences, the distribution of word weights in the analysis section composed of sentence ID1 and sentence ID2 Is “mobile phone” → 6.4, “maker” → 1, “between” → 1, “competition” → 1, “intensification” → 1, “function” → 1, “installation” → 1, other It is represented by a vector with word → 0 as an element. Next, the similarity of the weight distribution between the adjacent analysis sections is calculated, and the minimum point of the similarity is obtained (FIG. 8C). Since the minimum point of the similarity is a point where the distribution of the weight of the word changes, this is set as a division point of the text. In FIG. 8, since X and Y are local minimum points, these points are used as text dividing points. In addition, when calculating | requiring a minimum point, you may limit to the depth with a certain depth or more, and you may limit to a thing whose similarity is below a threshold value.

映像データにおいては、映像に映っている被写体は、映像データに含まれる個々のトピックの意味内容との関連性が特に高いと考えられる。なぜならば、映像データを作成する際には、当然ながら、視聴者に伝えたい内容に関する物・人物・文字・場所等を映像として映すためである。映像に映っている被写体がその映像で伝えたい内容と無関係である可能性は非常に低い。 In the video data, it is considered that the subject reflected in the video is particularly highly related to the semantic content of each topic included in the video data. This is because, when creating video data, it is a matter of course that an object, person, character, place, etc. relating to the content to be transmitted to the viewer is shown as a video. It is very unlikely that the subject shown in the video is irrelevant to the content that the video wants to convey.

したがって、映像データに映っている被写体を表す単語に大きな重みを与え、単語の重み分布が変化する点を検出することによって、映像データに含まれるトピックの意味内容を適切に反映したテキストの分割が可能となる。すなわち、図４ないし図７の例では、被写体として「携帯電話」と「ＰａＰｅＲｏ」が映像に映っており、これらはトピックの主題と深く関連すると考えられる。したがって、これらの単語に大きな重みを与えて単語の重み分布の変化点を求めることは、トピックの主題をより際立たせつつテキストを分割することを意味し、トピックへの分割精度が向上する。 Therefore, by dividing the text appropriately reflecting the semantic content of the topic contained in the video data by giving a large weight to the word representing the subject shown in the video data and detecting the point where the weight distribution of the word changes. It becomes possible. That is, in the examples of FIGS. 4 to 7, “mobile phone” and “PaPeRo” are reflected in the video as subjects, and these are considered to be closely related to the subject matter of the topic. Therefore, giving a large weight to these words to find the change point of the weight distribution of the words means that the text is divided while making the topic theme more prominent, and the division accuracy into topics is improved.

また、同じ「携帯電話」という単語であっても、「携帯電話」が被写体として映像に映っていなかった場合には、「携帯電話」に対する重みは小さな値となり、テキスト分割に与える影響は小さなものとなる。「携帯電話」が被写体でないことから、その映像においては「携帯電話」がトピックの主題ではない可能性が高い。したがって、かかる場合においても、トピックへの分割精度が向上する。このように、本発明によれば、同じ単語であっても、その単語が被写体を表すか否かに応じて適切に重み付けをすることによって、トピックへの分割精度が向上する。 Also, even if the word “mobile phone” is the same, if “mobile phone” is not shown in the video as the subject, the weight for “mobile phone” is small, and the effect on text division is small. It becomes. Since “mobile phone” is not the subject, it is highly possible that “mobile phone” is not the subject of the topic in the video. Accordingly, even in such a case, the division accuracy into topics is improved. As described above, according to the present invention, even when the same word is used, the division accuracy into topics is improved by appropriately weighting the word according to whether or not the word represents a subject.

図９は、本発明の効果をより具体的に説明する図である。ここでは、テキストは、映像データに含まれる発話内容を表すテキストとする。図９は、トピックＡ〜Ｇの７つのトピックから構成される映像データ（図９（ａ））を分割する場合に、単純に各分析区間の単語の出現頻度分布を求めた場合（図９（ｂ））と、本発明によって被写体を表す単語に大きな重みを与えて各分析区間の単語の重みの分布を求めた場合（図９（ｃ））との間で、隣接する分析区間の間の分布の類似度系列を比較して示す。トピックＦの映像区間には、時刻ｔ１〜ｔ２の間に「アクチビン」と「細胞」が被写体として映っており（図９（ｄ））、テキスト中にはこれら「アクチビン」と「細胞」が時刻ｔ３〜ｔ４の間に出現しており（図９（ｅ））、テキスト中には被写体ではない単語「表皮」等いくつかの単語が時刻ｔ５〜ｔ６の間に出現している（図９（ｆ））。トピックＦは「アクチビン」という物質を主題としたトピックである。したがって、「アクチビン」という単語は、テキストにおいてトピックＦの区間ｔ３〜ｔ４において出現している。 FIG. 9 is a diagram for more specifically explaining the effect of the present invention. Here, the text is a text representing the utterance content included in the video data. FIG. 9 shows a case where the appearance frequency distribution of words in each analysis section is simply obtained when video data (FIG. 9A) composed of seven topics A to G is divided (FIG. 9 ( b)) and a case where the word weight distribution of each analysis section is obtained by giving a large weight to the word representing the subject according to the present invention (FIG. 9C), between the adjacent analysis sections. Comparison similarity series of distributions are shown. In the video section of Topic F, “activin” and “cell” are shown as subjects between times t1 and t2 (FIG. 9 (d)), and these “activin” and “cell” are the time in the text. It appears between t3 and t4 (FIG. 9 (e)), and some words such as the word “skin” that is not the subject appear between times t5 and t6 in the text (FIG. 9 ( f)). Topic F is a topic whose subject is the substance “activin”. Therefore, the word “activin” appears in sections t3 to t4 of topic F in the text.

単純に各分析区間の出現頻度分布を求めた場合（図９（ｂ））には、トピックＥとトピックＦとの境界において類似度は周囲と変わらず、したがって、トピックＥとトピックＦの間で分割することができない。これは、「表皮」等のトピックＥとトピックＦをまたがって存在する単語が多数存在するために、これらの影響によりトピックＥとトピックＦとの境界で単語分布の類似度が小さくならなかったからである。 When the appearance frequency distribution of each analysis section is simply obtained (FIG. 9B), the similarity is not different from the surroundings at the boundary between the topic E and the topic F, and therefore, between the topic E and the topic F. Cannot be divided. This is because there are many words that exist across the topic E and the topic F such as “skin” and the similarity of the word distribution at the boundary between the topic E and the topic F is not reduced by these influences. is there.

一方、本発明によって被写体を表す単語に大きな重みを与えて各分析区間の単語の重みの分布を求めた場合（図９（ｃ））には、トピックＥとトピックＦとの境界において類似度の谷が得られ、トピックＥとトピックＦの間で正しく分割することができる。このような結果は、被写体に含まれない「表皮」等の単語の重みを小さくし、被写体に含まれる「アクチビン」や「細胞」等の単語の重みを大きくしたために得られたものである。すなわち、被写体を表さない単語はトピックの主題との関連性が低い場合がある。本発明によって、そのような単語の影響を取り除くことができる。また、被写体を表す単語はトピックの主題との関連性が高く、本発明によって、そのような単語が強調される。 On the other hand, when a large weight is given to the word representing the subject according to the present invention and the distribution of the word weight in each analysis section is obtained (FIG. 9C), the degree of similarity at the boundary between the topic E and the topic F is calculated. A valley is obtained and can be correctly divided between topic E and topic F. Such a result is obtained by reducing the weight of words such as “skin” that are not included in the subject and increasing the weight of words such as “activin” and “cell” included in the subject. That is, a word that does not represent a subject may be less relevant to the topic subject. According to the present invention, the influence of such words can be removed. In addition, the word representing the subject is highly related to the subject of the topic, and such a word is emphasized by the present invention.

なお、「アクチビン」や「細胞」が映像に映っているのは時刻ｔ１〜ｔ２であり、トピックＦの区間とは異なるため、単純に被写体が映し出されている区間をそのままトピックの区間としても適切なトピックへの分割はできない。これは、「アクチビン」や「細胞」がトピックの主題であっても、その被写体が常に映し出されているわけではないためである。しかし、映像に含まれる発話においては、トピックの主題である「アクチビン」や「細胞」といった単語がトピックＦの区間（時刻ｔ３〜ｔ４）を通じて現れているため、本発明によって適切にトピックへと分割できる。このように、本発明の効果をより高めるためには、テキスト記憶部１２に記憶するテキストとして、映像の各部分において映像の意味内容と関連性の高い単語が含まれるテキストが望ましい。例えば、テキスト記憶部１２に記憶するテキストとして、映像に含まれる発話の内容を表すテキスト等を用いることが好ましい。 Note that “activin” and “cells” are shown in the video at times t1 to t2, and are different from the topic F section, so that the section in which the subject is simply displayed can be used as the topic section as it is. It cannot be divided into various topics. This is because even if “activin” or “cell” is the subject of the topic, the subject is not always shown. However, in the utterances included in the video, words such as “activin” and “cell” that are the subject of the topic appear throughout the section of the topic F (time t3 to t4). it can. As described above, in order to further enhance the effect of the present invention, the text stored in the text storage unit 12 is preferably a text including words that are highly relevant to the semantic content of the video in each part of the video. For example, as the text stored in the text storage unit 12, it is preferable to use text or the like representing the content of the utterance included in the video.

なお、映像分割部１６は、各分析区間の単語の重み分布を計算する際に、各単語のＩＤＦを乗じてもよい。前述したように、同じ単語であれば、その単語がどのような場面に現れてもＩＤＦは同じ値となるが、ＩＤＦは単語の一般的な重要度を表す指標であるため、本発明と併用することもできる。 Note that the video dividing unit 16 may multiply the IDF of each word when calculating the weight distribution of the words in each analysis section. As described above, if the word is the same, the IDF has the same value regardless of the scene in which the word appears. However, since the IDF is an index indicating the general importance of the word, it is used in combination with the present invention. You can also

また、本実施例においては、各分析区間において単語の重みの分布を求めた後に、隣接する分析区間の間の重み分布の類似度の極小点を求めることによってテキストを分割した。しかし、本発明においてテキストを分割する手法はかかる手法に限定されるものではない。例えば、あらかじめ、新聞記事等のトピックごとに分割されているテキストコーパスを用いて、様々なトピックに関するトピックモデルを用意し、各トピックモデルをそれぞれの分析区間における単語の重み分布と照合することでテキストを分割してもよい。トピックモデルとして、例えば、各トピックに出現する単語の出現頻度等の単語分布を学習したモデルを用いればよい。ここで、トピック間の遷移の起こりやすさを適宜決めることによって、分析区間の系列と最もよく整合するトピックモデル系列を、トピックの分割点の位置とともに求めることができる。このようなトピックモデルを用いたテキスト分割手法は、例えば「Ｊ．Ｐ．Ｙａｍｒｏｎ、Ｉ．Ｃａｒｐ、Ｌ．Ｇｉｌｌｉｃｋ、Ｓ．Ｌｏｗｅ、ａｎｄＰ．ｖａｎＭｕｌｂｒｅｇｔ、“ＡＨＩＤＤＥＮＭＡＲＫＯＶＭＯＤＥＬＡＰＰＲＯＡＣＨＴＯＴＥＸＴＳＥＧＭＥＮＴＡＴＩＯＮＡＮＤＥＶＥＮＴＴＲＡＣＫＩＮＧ、”ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ、ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ、ｐｐ．３３３−３３６、１９９８．」に記載されている。 In this embodiment, the text is divided by obtaining the minimum point of the similarity of the weight distribution between adjacent analysis sections after obtaining the distribution of word weights in each analysis section. However, the method for dividing text in the present invention is not limited to such a method. For example, using a text corpus that has been divided into topics such as newspaper articles in advance, prepare topic models for various topics, and check each topic model against the word weight distribution in each analysis section. May be divided. As the topic model, for example, a model in which word distribution such as the appearance frequency of words appearing in each topic is learned may be used. Here, by appropriately determining the likelihood of transition between topics, the topic model series that best matches the series of analysis sections can be obtained together with the position of the topic dividing point. Text segmentation methods using such a topic model are, for example, “JP Yaman, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt,“ A HIDDEN MARKOV MODEL APPROACH TO TEXT SEGMENTATION AND. TRACKING, “IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 333-336, 1998.”.

以上のように、映像分割部１６によって映像が分割される。そして、分割された映像は、例えば分割点に対応する再生位置情報等として分割結果記憶部１７に記憶される。 As described above, the video is divided by the video dividing unit 16. The divided video is stored in the division result storage unit 17 as, for example, reproduction position information corresponding to the division point.

映像視聴部１８は、分割結果記憶部１７に記憶される映像データの分割結果を読み込むことによって、分割されたトピックを単位として、映像データ記憶部１１に記憶される映像データの検索や再生等を行う。例えば、ニュース項目を単位としてニュース番組を検索又は再生し、学習項目を単位として講義映像を検索又は再生したりすることができる。もちろん、これらの機能は本発明によって分割された映像データの利用方法の一例に過ぎない。本発明はトピックを単位として映像データを活用するあらゆるアプリケーションに適用することができる。 The video viewing unit 18 reads the video data division result stored in the division result storage unit 17, thereby searching and reproducing the video data stored in the video data storage unit 11 in units of divided topics. Do. For example, a news program can be searched or played in units of news items, and lecture videos can be searched or played in units of learning items. Of course, these functions are merely examples of a method of using video data divided according to the present invention. The present invention can be applied to any application that utilizes video data in units of topics.

なお、本実施例においては、テキスト記憶部１２に記憶されるテキストは、あらかじめ単語単位に分かち書きされているものとした。しかし、テキスト記憶部１２に記憶されるテキストは単語単位に分かち書きされていなくてもよい。すなわち、被写体判定部１３や被写体重み付け部１４が、テキスト記憶部１２からテキストを読み込む際に公知の形態素解析技術を用いてテキストを単語単位へと分割するよう動作することによって、テキスト記憶部１２に記憶されるテキストがあらかじめ単語単位に分かち書きされていない場合であっても、本発明を適用することができる。 In the present embodiment, the text stored in the text storage unit 12 is preliminarily written in units of words. However, the text stored in the text storage unit 12 may not be written in units of words. That is, when the subject determination unit 13 or the subject weighting unit 14 reads the text from the text storage unit 12 and operates to divide the text into words using a known morphological analysis technique, the text storage unit 12 The present invention can be applied even when the stored text is not previously written in units of words.

また、本実施例においては、被写体判定部１３に含まれる照合部１３６は、テキスト記憶部１２に記憶されるテキストに含まれるそれぞれの単語に対して、その単語が被写体を表すか否かを判定する際に、その単語と被写体認識結果記憶部１３５に記憶されるすべての被写体認識結果テキストとを照合し、その単語が被写体認識結果テキストのいずれかと一致した場合に、その単語が被写体であると判定した。このとき、照合部１３６は、ある単語が被写体を表すか否かを判定する際に、その単語が、その単語に対応する再生位置から時間的に所定の範囲内に限定した映像区間に映っている被写体を表すか否かを判定するようにしてもよい。 In the present embodiment, the collation unit 136 included in the subject determination unit 13 determines whether or not each word included in the text stored in the text storage unit 12 represents the subject. When the word is matched with all the subject recognition result texts stored in the subject recognition result storage unit 135 and the word matches any of the subject recognition result texts, the word is the subject. Judged. At this time, when the collation unit 136 determines whether or not a certain word represents the subject, the word is reflected in a video section that is limited to a predetermined range in time from the reproduction position corresponding to the word. It may be determined whether or not the subject is represented.

一例として、テキスト記憶部１２において図４に示すテキストが記憶されており、被写体認識結果記憶部１３５において図５に示すデータが記憶されているものとする。このとき、文ＩＤ１に含まれる「携帯電話」という単語が被写体を表すか否かを判定する際に、文ＩＤ１に対応する再生位置である１０２．０〜１０５．０秒から、例えば、１０秒以内の映像区間、すなわち９２．０〜１１５．０秒に限定した映像区間に映っている被写体を表すか否かを判定するようにしてもよい。そのためには、「携帯電話」と被写体認識結果記憶部１３５に記憶される被写体認識結果テキストとを照合する際に、映っている時刻が９２．０〜１１５．０秒と重なっている被写体認識結果に限定して照合すればよい。 As an example, it is assumed that the text shown in FIG. 4 is stored in the text storage unit 12, and the data shown in FIG. 5 is stored in the subject recognition result storage unit 135. At this time, when it is determined whether or not the word “mobile phone” included in the sentence ID1 represents the subject, the reproduction position corresponding to the sentence ID1 is 102.0 to 105.0 seconds, for example, 10 seconds. It is also possible to determine whether or not the subject is shown in the video segment within the range, that is, the video segment limited to 92.0 to 115.0 seconds. For this purpose, when the “mobile phone” and the subject recognition result text stored in the subject recognition result storage unit 135 are collated, the subject recognition result in which the reflected time overlaps with 92.0 to 115.0 seconds. It is sufficient to collate by limiting to.

図１０は、このような処理を具体的に説明する図である。図１０に示した映像データには、「花」が主題であるトピックＨ、「ミツバチ」が主題であるトピックＩ、「モンシロチョウ」が主題であるトピックＪ、が含まれている（図１０（ｂ））。また、各トピックの中で、「花」、「ミツバチ」、「モンシロチョウ」が被写体として映っている（図１０（ａ））。さらに、テキスト記憶部１２には、図１０（ｃ）に示すテキストが記憶されている。 FIG. 10 is a diagram for specifically explaining such processing. The video data shown in FIG. 10 includes a topic H whose theme is “flower”, a topic I whose theme is “honey bee”, and a topic J whose theme is “Monro butterfly” (FIG. 10B). )). In each topic, “flowers”, “bees”, and “white butterflies” are shown as subjects (FIG. 10A). Further, the text storage unit 12 stores the text shown in FIG.

このとき、上述したような時間的制約を課さずに、テキストに含まれる単語が被写体を表すか否かを判定すると、トピックＨのテキストに含まれている「ミツバチ」や「モンシロチョウ」といった単語も被写体を表すと判定され、大きな重みが与えられる。しかし、トピックＨのテキストに出現する「ミツバチ」や「モンシロチョウ」といった単語は、トピックＩにて映像に映っている「ミツバチ」やトピックＪにおいて映像に映っている「モンシロチョウ」を指し示しているわけではない。このように、本来その単語が指し示しているわけではない映像において、たまたまその単語が表す被写体が映っている場合には、その単語の重みを大きくするべきではない。 At this time, if it is determined whether or not the word included in the text represents a subject without imposing time constraints as described above, words such as “bee” and “Monro butterfly” included in the text of Topic H are also included. It is determined that the subject is represented, and a large weight is given. However, words such as “bee” and “Monro butterfly” appearing in the text of Topic H do not indicate “bee” appearing in the video in Topic I or “Monro butterfly” appearing in the video in Topic J. Absent. As described above, in the video that the word originally does not indicate, when the subject represented by the word happens to be reflected, the weight of the word should not be increased.

そこで、上述したような時間的制約を課して、テキストに含まれる単語が被写体を表すか否かを判定すると、トピックＨのテキストに含まれている「ミツバチ」や「モンシロチョウ」といった単語は、「ミツバチ」や「モンシロチョウ」が映像に映っている映像区間とは時間的に離れた再生位置にあるため、被写体を表すとは判定されず、重みが大きくなることはない。その結果、トピックＨにおいて大きな重みが与えられる単語は「花」のみとなり、トピックＨの主題と関連性の高い単語のみが大きな重みを与えられるようにすることができる。 Therefore, by imposing time constraints as described above and determining whether or not the word included in the text represents a subject, the words such as “bee” and “Monro butterfly” included in the topic H text are: Since it is at a playback position that is temporally separated from the video section in which the “bee” or “Monro butterfly” appears in the video, it is not determined to represent the subject and the weight does not increase. As a result, only the word “flower” is given a high weight in the topic H, and only a word highly related to the subject of the topic H can be given a high weight.

本発明は、あらゆる映像データに適用することができる。なお、ニュース番組や教育映像等のように、映像に映っている被写体そのものが説明されることが多い映像データ、又は、映像中の発話によって映像の内容が詳細に説明される映像データ等に対して、本発明は特に高い効果を発揮する。 The present invention can be applied to any video data. For video data that often explains the subject itself, such as news programs and educational videos, or video data that explains the details of the video by utterances in the video. Thus, the present invention exhibits a particularly high effect.

次に、本実施例の効果について説明する。本発明によれば、映像データを意味内容に応じて適切にトピックへと分割することが可能となる。その理由は、本発明においては、映像データに含まれる個々のトピックの意味内容との関連性が特に強いと考えられる被写体を判定した上で、被写体を表す単語等の重みを大きくして映像データと関連するテキストを分割することで、映像データを分割するからである。 Next, the effect of the present embodiment will be described. According to the present invention, it is possible to appropriately divide video data into topics according to semantic content. The reason for this is that in the present invention, after determining a subject that is considered to be particularly strongly related to the semantic content of each topic included in the video data, the weight of a word or the like representing the subject is increased, and the video data This is because the video data is divided by dividing the text related to the.

次に、本発明の第２の実施例について、図面を参照して詳細に説明する。 Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

本発明の第２の実施例は、第１の実施例をプログラムにより構成した場合に、そのプログラムにより動作するコンピュータとして実現される。 The second embodiment of the present invention is realized as a computer that operates according to a program when the first embodiment is configured as a program.

図１１を参照すると、本発明の第２の実施例は、ＣＰＵ等を含んで構成されるデータ処理装置３２と、磁気ディスクや半導体メモリ等で構成される記憶装置３３と、映像分割用プログラム３１とから構成される。 Referring to FIG. 11, in the second embodiment of the present invention, a data processing device 32 including a CPU and the like, a storage device 33 including a magnetic disk, a semiconductor memory, and the like, and a video dividing program 31 are shown. It consists of.

記憶装置３３は、映像データ記憶部３３１、テキスト記憶部３３２、重み付きテキスト記憶部３３３、分割結果記憶部３３４、被写体認識結果記憶部３３５等として使用される。 The storage device 33 is used as a video data storage unit 331, a text storage unit 332, a weighted text storage unit 333, a division result storage unit 334, a subject recognition result storage unit 335, and the like.

映像分割用プログラム３１は、データ処理装置３２に読み込まれ、データ処理装置３２の動作を制御することにより、データ処理装置３２上に、上記第１の実施例の機能を実現する。すなわち、データ処理装置３２は、映像分割用プログラム３１の制御によって、図１の被写体判定部１３、被写体重み付け部１４、映像分割部１６、映像視聴部１８、あるいは、図２のオブジェクト認識部１３１、顔画像認識部１３２、文字認識部１３３、被写体抽出部１３４、照合部１３６の処理を実行する。 The video segmentation program 31 is read by the data processing device 32 and controls the operation of the data processing device 32 to realize the functions of the first embodiment on the data processing device 32. That is, the data processing device 32 controls the subject division unit 13, the subject weighting unit 14, the video division unit 16, the video viewing unit 18, or the object recognition unit 131 in FIG. The processing of the face image recognition unit 132, the character recognition unit 133, the subject extraction unit 134, and the collation unit 136 is executed.

本発明は、映像データを話題ごとに整理された状態で閲覧する情報閲覧システムや、情報閲覧システムをコンピュータに実現するためのプログラムといった用途に適用することができる。また、大量の映像データの中から特定の話題に関する映像データを検索する情報検索システム等の用途にも適用することができる。さらに、本発明は、トピックを単位として映像データを活用するあらゆるアプリケーションに適用することができる。 INDUSTRIAL APPLICABILITY The present invention can be applied to uses such as an information browsing system for browsing video data in a state of being organized for each topic, and a program for realizing the information browsing system on a computer. Further, the present invention can be applied to uses such as an information retrieval system for retrieving video data related to a specific topic from a large amount of video data. Furthermore, the present invention can be applied to any application that utilizes video data in units of topics.

なお、本発明の全開示（請求の範囲を含む）の枠内において、さらにその基本的技術思想に基づいて、実施例ないし実施例の変更・調整が可能である。また、本発明の請求の範囲の枠内において種々の開示要素の多様な組み合わせないし選択が可能である。すなわち、本発明は、請求の範囲を含む全開示、技術的思想にしたがって当業者であればなし得るであろう各種変形、修正を含むことは勿論である。 It should be noted that the examples and the examples can be changed and adjusted within the scope of the entire disclosure (including claims) of the present invention and based on the basic technical concept. Various combinations and selections of various disclosed elements are possible within the scope of the claims of the present invention. That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea.

Claims

A word or a word string (hereinafter referred to as “words”) included in the text indicates a subject included in the video with reference to a text associated with the video and having a playback position in the video. A subject determination unit that determines whether or not
A subject weighting unit that weights a word that is determined to represent the subject among the words or the like to a weight that is greater than a weight for the other ones;
And a video dividing unit that divides the video by dividing the text based on the weighting.

The subject determination unit determines whether or not the word or the like represents a subject appearing in a predetermined range of the video with respect to the playback position. The video dividing device described in 1.

When the subject determination unit determines that the word or the like represents a subject reflected in the image, the subject determination unit calculates the reliability that the word or the like represents the subject,
The video dividing apparatus according to claim 1, wherein the subject weighting unit weights the word or the like more as the reliability is higher.

When the subject determination unit determines that the word or the like represents a subject reflected in the video, the subject determination unit determines the importance of the subject in the video;
The video dividing apparatus according to claim 1, wherein the subject weighting unit weights the word or the like more as the importance is higher.

5. The video segmentation device according to claim 4, wherein the subject determination unit determines the importance of the subject according to a ratio of the subject to the video.

The subject determination unit recognizes a subject reflected in the video and outputs it as a subject recognition result text;
A collation unit further comprising: a collation unit that collates the word or the like with the subject recognition result text to determine whether or not the word or the like represents a subject reflected in the video. 2. The video dividing device according to 1.

The video segmentation device according to claim 6, wherein the subject includes an object, a face image, or a character.

The object recognizing unit includes an object recognizing unit that recognizes the object, a face image recognizing unit that recognizes the face image, a character recognizing unit that recognizes the character, and a word that represents a subject from words included in the text. The video segmentation device according to claim 7, further comprising at least one of a subject extraction unit that extracts a video.

7. The collation unit according to claim 6, wherein the collation unit develops at least one of the word or the like and the subject recognition result text by a thesaurus, and collates the word or the like with the subject recognition result text. Video segmentation device.

The video segmentation apparatus according to claim 1, wherein the text is a text representing the content of an utterance included in the video.

11. The video dividing apparatus according to claim 1, wherein the reproduction position is given in units of sentences or words included in the text.

By computer
A word or a word string (hereinafter referred to as “words”) included in the text indicates a subject included in the video with reference to a text associated with the video and having a playback position in the video. A subject determination step for determining whether or not
A subject weighting step of weighting a word that is determined to represent the subject among the words, etc., with a weight greater than a weighting for other ones;
And a video dividing step of dividing the video by dividing the text based on the weighting.

13. The subject determination step determines whether or not the word or the like represents a subject in a predetermined range of the video as a reference with respect to the playback position. The video dividing method described in 1.

In the subject determination step, when it is determined that the word or the like represents a subject reflected in the video, the reliability that the word or the like represents the subject is calculated,
13. The video segmentation method according to claim 12, wherein, in the subject weighting step, the higher the reliability is, the greater weight is given to the word or the like.

In the subject determination step, when it is determined that the word or the like represents a subject reflected in the video, the importance of the subject in the video is determined,
13. The video segmentation method according to claim 12, wherein, in the subject weighting step, the higher the importance, the greater weight is given to the word or the like.

16. The video segmentation method according to claim 15, wherein, in the subject determination step, importance of the subject is determined according to a ratio of the subject to the video.

The subject determination step includes a subject recognition step of recognizing a subject reflected in the video and outputting the subject recognition result text;
The collating step of collating the said word etc. with the said subject recognition result text, and determining whether the said word etc. represents the to-be-photographed object reflected in the said image | video is included, The collating process characterized by the above-mentioned. Video segmentation method.

The video segmentation method according to claim 17, wherein the subject includes an object, a face image, or a character.

The subject recognition step includes an object recognition step for recognizing the object, a face image recognition step for recognizing the face image, a character recognition step for recognizing the characters, a word representing a subject from words included in the text, and the like The video segmentation method according to claim 18, further comprising at least one of a subject extraction step of extracting a subject.

18. The collation process according to claim 17, wherein in the collation step, at least one of the word or the like and the subject recognition result text is developed by a thesaurus, and the word or the like and the subject recognition result text are collated. Video segmentation method.

21. The video segmentation method according to claim 12, wherein the text is text representing the content of an utterance included in the video.

A word or a word string (hereinafter referred to as “words”) included in the text indicates a subject included in the video with reference to a text associated with the video and having a playback position in the video. Subject determination processing for determining whether or not
A subject weighting process for weighting a word determined to represent the subject among the words or the like, with a weight greater than a weight for the other
A program that causes a computer to execute video division processing for dividing the video by dividing the text based on the weighting.

23. In the subject determination process, it is determined whether or not the word or the like represents a subject shown in a predetermined range with respect to the reproduction position in the video. The program described in.

In the subject determination process, when it is determined that the word or the like represents the subject shown in the video, the reliability that the word or the like represents the subject is calculated,
23. The program according to claim 22, wherein, in the subject weighting process, the higher the reliability, the greater weighting is given to the word or the like.

In the subject determination process, when it is determined that the word or the like represents a subject shown in the video, the importance of the subject in the video is determined,
23. The program according to claim 22, wherein, in the subject weighting process, the higher the importance, the greater weighting is given to the word or the like.

26. The program according to claim 25, wherein in the subject determination process, the importance level of the subject is determined according to a ratio of the subject to the video.

In the subject determination processing, subject recognition processing for recognizing a subject reflected in the video and outputting as a subject recognition result text;
The computer is caused to execute collation processing for collating the word or the like with the subject recognition result text to determine whether the word or the like represents a subject shown in the video. 22. The program according to 22.

The program according to claim 27, wherein the subject includes an object, a face image, or a character.

In the subject recognition process, an object recognition process for recognizing the object, a face image recognition process for recognizing the face image, a character recognition process for recognizing the character, a word representing a subject from a word included in the text, etc. 30. The program according to claim 28, wherein at least one of the subject extraction processing for extracting a subject is executed by a computer.

28. The collation process according to claim 27, wherein in the collation process, at least one of the word or the like and the subject recognition result text is developed by a thesaurus, and the word or the like and the subject recognition result text are collated. program.

The program according to any one of claims 22 to 30, wherein the text is a text representing the content of an utterance included in the video.