JP2007243359A

JP2007243359A - Device and program for extracting video-image

Info

Publication number: JP2007243359A
Application number: JP2006060339A
Authority: JP
Inventors: Yoshihiko Kawai; 吉彦河合; Hideki Sumiyoshi; 英樹住吉
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2006-03-06
Filing date: 2006-03-06
Publication date: 2007-09-20
Anticipated expiration: 2026-03-06
Also published as: JP4456573B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a video-image extraction device that is compatible with video-images in various fields and is capable of extracting a video image corresponding to information showing a part of video-image contents without utilizing knowledge or metadata in manual extraction. <P>SOLUTION: The video-image extraction device 5 is provided with a rarity data storage means (a morpheme probability information storage means) 56 which stores rarity showing an appearance probability of each morpheme included in voice text data of a plurality of other video images; a similarity calculation means 55 which calculates similarity between each sentence in an introductory text and a CC sentence on the basis of the rarity and an appearance frequency of a morpheme included in each sentence of the introductory text, in a plurality of the CC sentences constituting the voice text data (CC) of the inputted video images; a candidate section detection means (a section detection means) 57 for selecting the CC sentence on the basis of the similarity so as to detect a video-image section corresponding to the CC sentence; and a section video-image extractor (a section video-image extraction means) 58a for extracting a video image of the section from the inputted video images. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、映像の内容の一部を示す情報に基づいて、映像からこの情報に対応する部分を抽出する映像抽出装置及び映像抽出プログラムに関する。 The present invention relates to a video extraction apparatus and a video extraction program for extracting a part corresponding to this information from a video based on information indicating a part of the content of the video.

従来、ある分野の映像に対して、その映像の一部分を抽出して要約映像や、番組紹介のためのスポット映像を生成することが行われている。これらの技術において、映像の一部分を抽出する際には、その分野の映像から人手によって抽出する場合の知識を利用している。例えば、野球の映像において、スロー映像の前は得点シーンなどの重要なプレーの映像である可能性が高く、また、この位置に文字スーパーが表示されたときは得点が入ったときである、というような映像上の知識を利用している。更に、スポーツの映像においては、歓声の大きい区間は重要なシーンである、という音声情報を利用している。 Conventionally, for a video in a certain field, a part of the video is extracted to generate a summary video or a spot video for introducing a program. In these techniques, when extracting a part of a video, knowledge in the case of manually extracting from a video in the field is used. For example, in a baseball video, before the slow video, there is a high possibility that it is an important play video such as a scoring scene, and when a character super is displayed at this position, it is when a score is entered Such visual knowledge is used. Furthermore, in sports images, audio information is used that a section with a large cheer is an important scene.

また、映像から要約映像を生成する技術が開示されている（特許文献１参照）。ここでは、ユーザの嗜好に関するキーワードが、映像のメタデータ内の、映像内で発生した事象をイベントとして列挙したイベント列にどれだけ含まれているかに基づいてイベントの重要度を算出し、重要度の高いイベントに対応する区間の映像を抽出して要約映像とする。
特開２００４−２８９５１３号公報（段落番号００２２〜００７４） In addition, a technique for generating a summary video from a video is disclosed (see Patent Document 1). Here, the importance level of the event is calculated based on how many keywords related to the user's preference are included in the event sequence listing the events that occurred in the video as events. The video of the section corresponding to the high event is extracted to be a summary video.
JP-A-2004-289513 (paragraph numbers 0022 to 0074)

しかしながら、人手によって抽出する場合の知識を利用して要約映像を生成する場合に、様々な分野の映像に対応するためには、様々な分野の映像について要約映像を生成するための知識を有している必要がある。また、それらは、人手によって感覚的に抽出する場合がほとんどであり、抽出のための規則が明確ではない。更に、どの知識がどの程度の重要性を持つのかを決定する方法も明確でなく、ある分野の映像に複数の抽出方法があった場合に、どの抽出方法を優先するかを決定することができない。また、特許文献１に記載の技術では、映像に対応するメタデータがない場合には適用できないという問題があった。 However, when generating summary video using the knowledge of manual extraction, in order to deal with videos in various fields, we have knowledge to generate summary video for videos in various fields. Need to be. Moreover, in most cases, they are extracted sensuously by hand, and the rules for extraction are not clear. In addition, it is not clear how to determine what knowledge is important, and when there are multiple extraction methods for a video in a certain field, it is not possible to determine which extraction method has priority. . Further, the technique described in Patent Document 1 has a problem that it cannot be applied when there is no metadata corresponding to a video.

本発明は、前記従来技術の問題を解決するために成されたもので、様々な分野の映像に対応し、人手によって抽出する場合の知識やメタデータを利用せず、映像の内容の一部を示す情報から、この情報に対応する映像を抽出することができる映像抽出装置及び映像抽出プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems of the prior art, and corresponds to videos in various fields, and does not use knowledge and metadata when manually extracted, and part of the contents of the video. An object of the present invention is to provide a video extraction apparatus and a video extraction program capable of extracting a video corresponding to this information from information indicating the above.

前記課題を解決するため、請求項１に記載の映像抽出装置は、映像と、前記映像に対応する音声のテキストデータである音声テキストデータと、前記映像の一部分の内容を示す抽出映像内容情報とを入力し、前記抽出映像内容情報に対応する前記映像の一部分を抽出する映像抽出装置であって、形態素確率情報記憶手段と、類似度算出手段と、区間検出手段と、区間映像抽出手段とを備える構成とした。 In order to solve the above-described problem, the video extraction device according to claim 1, a video, audio text data that is audio text data corresponding to the video, and extracted video content information indicating a content of a part of the video, And extracting a part of the video corresponding to the extracted video content information, comprising: a morpheme probability information storage means, a similarity calculation means, a section detection means, and a section video extraction means. It was set as the structure provided.

かかる構成によれば、映像抽出装置は、形態素確率情報記憶手段に、複数の他の映像に対応する音声のテキストデータである他音声テキストデータに含まれる各々の形態素の、複数の他音声テキストデータにおける出現確率を示す出現確率情報と、当該形態素とを対応させた形態素確率情報を記憶する。 According to such a configuration, the video extracting device stores, in the morpheme probability information storage unit, a plurality of other voice text data of each morpheme included in the other voice text data which is voice text data corresponding to a plurality of other videos. Appearance probability information indicating the appearance probability in, and morpheme probability information in which the morpheme is associated are stored.

更に、映像抽出装置は、類似度算出手段によって、形態素確率情報において、抽出映像内容情報に含まれる各々の形態素に対応する出現確率情報と、音声テキストデータを複数の区分に分割した音声区分データの各々に当該形態素が出現する頻度とに基づいて、抽出映像内容情報と、音声区分データとが類似する度合いを示す類似度を算出する。ここで、出現確率の高い形態素は、様々な映像に対応する音声に出現するものであるため、映像の特徴を示すものではなく、一方、出現確率の低い形態素は特定の映像や特定の区間の映像に対応する音声のみに出現するものであるため、その映像の特徴を示している可能性が高い。そのため、出現確率の低い形態素を、抽出映像内容情報と音声区分データとが共通して多く含むほど、この抽出映像内容情報と音声区分データとの各々によって示される映像の内容の特徴が類似する。これによって、映像抽出装置は、抽出映像内容情報に含まれる各々の形態素の出現確率情報と、当該形態素が各々の音声区分データに出現する頻度とに基づいて、各々の音声区分データと抽出映像内容情報の類似度を算出することができる。 Further, the video extraction device uses the similarity calculation means to generate appearance probability information corresponding to each morpheme included in the extracted video content information and voice classification data obtained by dividing the voice text data into a plurality of sections in the morpheme probability information. Based on the frequency at which the morpheme appears, a similarity indicating the degree of similarity between the extracted video content information and the audio classification data is calculated. Here, morphemes with a high appearance probability appear in audio corresponding to various videos, and thus do not indicate video characteristics. On the other hand, a morpheme with a low appearance probability has a specific video or a specific section. Since it appears only in the audio corresponding to the video, there is a high possibility that it shows the characteristics of the video. For this reason, the more the extracted video content information and the audio classification data contain more morphemes with a low appearance probability, the more similar the characteristics of the video content indicated by each of the extracted video content information and the audio classification data. Thus, the video extraction device can detect each piece of audio segment data and extracted video content based on the appearance probability information of each morpheme included in the extracted video content information and the frequency with which the morpheme appears in each audio segment data. Information similarity can be calculated.

また、映像抽出装置は、区間検出手段によって、類似度に基づいて、抽出映像内容情報に対応する音声区分データを選定し、当該音声区分データに対応する映像の区間を検出し、区間映像抽出手段によって、この検出された区間の映像を抽出する。これによって、映像抽出装置は、抽出映像内容情報によって示される内容の映像を、入力された映像から抽出することができる。 Further, the video extraction device selects the audio segment data corresponding to the extracted video content information based on the similarity by the segment detection unit, detects the segment of the video corresponding to the audio segment data, and the segment video extraction unit Thus, the video of the detected section is extracted. As a result, the video extraction device can extract the video having the content indicated by the extracted video content information from the input video.

また、請求項２に記載の映像抽出装置は、請求項１に記載の映像抽出装置において、前記区間映像抽出手段によって抽出された映像をカットに分割するカット分割手段を備える構成とした。 According to a second aspect of the present invention, there is provided the video extracting device according to the first aspect, further comprising a cut dividing unit that divides the video extracted by the section video extracting unit into cuts.

これによって、映像抽出装置は、区間映像抽出手段によって抽出された、音声区分データに対応する区間の映像の途中で映像が大きく切り替わる場合に、連続して撮影された映像区間であるカットの映像に分割することができる。 As a result, the video extraction device can convert the video that has been continuously shot into the cut video when the video is largely switched in the middle of the video of the zone corresponding to the audio classification data extracted by the zone video extraction means. Can be divided.

また、請求項３に記載の映像抽出プログラムは、映像と、前記映像に対応する音声のテキストデータである音声テキストデータと、複数の形態素から構成され、前記映像の一部分の内容を示す抽出映像内容情報とを入力し、形態素確率情報記憶装置に記憶された、複数の他の映像に対応する音声のテキストデータである他音声テキストデータに含まれる各々の形態素の、複数の前記他音声テキストデータにおける出現確率を示す出現確率情報と、当該形態素とを対応させた形態素確率情報に基づいて、前記抽出映像内容情報に対応する前記映像の一部分を抽出するためにコンピュータを、類似度算出手段、区間検出手段、区間映像抽出手段として機能させることとした。 The video extraction program according to claim 3 is composed of a video, audio text data that is audio text data corresponding to the video, and a plurality of morphemes, and indicates the content of a part of the video Information, and stored in the morpheme probability information storage device, each morpheme included in other speech text data which is speech text data corresponding to a plurality of other images, in the plurality of other speech text data Based on the appearance probability information indicating the appearance probability and the morpheme probability information corresponding to the morpheme, the computer is used to extract a part of the video corresponding to the extracted video content information. And function as section video extraction means.

かかる構成によれば、映像抽出プログラムは、類似度算出手段によって、形態素確率情報において、抽出映像内容情報に含まれる各々の形態素に対応する出現確率情報と、音声テキストデータを複数の区分に分割した音声区分データの各々に当該形態素が出現する頻度とに基づいて、抽出映像内容情報と、音声区分データとが類似する度合いを示す類似度を算出する。また、映像抽出プログラムは、区間検出手段によって、類似度に基づいて、抽出映像内容情報に対応する音声区分データを選定し、当該音声区分データに対応する映像の区間を検出し、区間映像抽出手段によって、この検出された区間の映像を抽出する。これによって、抽出映像内容情報によって示される内容の映像を、入力された映像から抽出することができる。 According to such a configuration, the video extraction program divides the appearance probability information corresponding to each morpheme included in the extracted video content information and the voice text data into a plurality of sections in the morpheme probability information by the similarity calculation means. Based on the frequency of appearance of the morpheme in each of the audio classification data, a similarity indicating the degree of similarity between the extracted video content information and the audio classification data is calculated. Further, the video extraction program selects the audio segment data corresponding to the extracted video content information based on the similarity by the segment detection unit, detects the segment of the video corresponding to the audio segment data, and the segment video extraction unit Thus, the video of the detected section is extracted. As a result, the video having the content indicated by the extracted video content information can be extracted from the input video.

本発明に係る映像抽出装置及び映像抽出プログラムでは、以下のような優れた効果を奏する。請求項１及び請求項３に記載の発明によれば、映像の分野を問わず、かつ、人手によって抽出する場合の知識やメタデータを利用せずに、入力された映像から、抽出映像内容情報によって示される内容の映像を抽出することができる。これによって、抽出したい映像の内容を示す自然文や形態素の羅列を入力するだけで、様々な分野の映像から要約映像や紹介映像等を容易に生成したり、映像の候補を出力することができ、要約映像や紹介映像等を制作する制作者の作業量を軽減することができる。 The video extraction apparatus and video extraction program according to the present invention have the following excellent effects. According to the first and third aspects of the present invention, the extracted video content information can be extracted from the input video regardless of the field of the video and without using knowledge and metadata when manually extracting. It is possible to extract a video having the content indicated by. This makes it possible to easily generate summary videos, introduction videos, etc. from videos in various fields, and output video candidates by simply inputting a natural sentence or a list of morphemes indicating the contents of the video to be extracted. This reduces the amount of work for creators who produce summary videos and introductory videos.

請求項２に記載の発明によれば、区間映像抽出手段によって抽出された映像は、音声区分データに対応する区間の映像であるため、音声の切れ目と映像の切れ目が一致せずに途中で映像が切り替わり別のシーンの映像が混ざって抽出される場合もあるが、この映像をカットに分割することで、連続して撮影された部分だけを取り出すことができる。 According to the second aspect of the present invention, since the video extracted by the section video extraction means is a section video corresponding to the audio classification data, the video break does not coincide with the video break and the video is cut halfway. In some cases, the images of different scenes are mixed and extracted, but by dividing this video into cuts, it is possible to extract only the continuously shot portions.

以下、本発明の実施の形態について図面を参照して説明する。
［映像抽出装置の構成］
まず、図１を参照して、本発明における映像抽出装置５を備えるスポット映像生成装置１の構成について説明する。図１は、本発明における映像抽出装置を備えるスポット映像生成装置の構成を模式的に示した模式図である。 Embodiments of the present invention will be described below with reference to the drawings.
[Configuration of video extraction device]
First, with reference to FIG. 1, the structure of the spot image | video production | generation apparatus 1 provided with the image | video extraction device 5 in this invention is demonstrated. FIG. 1 is a schematic diagram schematically showing a configuration of a spot video generation device including a video extraction device according to the present invention.

スポット映像生成装置１は、外部から入力された紹介文に対応し、入力された映像の一部分から構成されるスポット映像を生成するものである。ここで、スポット映像生成装置１は、形態素希少度算出装置３と、映像抽出装置５とを備える。 The spot video generation device 1 generates a spot video composed of a part of the input video corresponding to the introduction text input from the outside. Here, the spot video generation device 1 includes a morpheme scarcity calculation device 3 and a video extraction device 5.

形態素希少度算出装置３は、複数の映像に対応する音声のテキストデータ（クローズドキャプション；ＣＣ）である放送映像ＣＣ（他音声テキストデータ）に含まれる各形態素の希少度（出現確率情報）を算出するものである。なお、希少度とは、複数の放送映像ＣＣにおいて、ある形態素がどれだけ偏って特定の番組の放送映像ＣＣに含まれているかを示すもので、特定の番組の放送映像ＣＣに、ある形態素が偏って多く含まれるほど高い値となる。ここでは、対象となる映像を過去の放送番組の映像としたが、例えば、インターネット等のネットワークから入力される映像であってもよいし、放送波等を介して入力される映像であってもよい。また、ここでは、音声のテキストデータを、当該映像に付加されるＣＣとしたが、当該映像に対応する音声を音声認識した結果であってもよい。ここで、形態素希少度算出装置３は、放送映像ＣＣ記憶手段３０と、形態素解析手段３１と、希少度算出手段３２と、形態素頻度記憶手段３３とを備える。 The morpheme rarity calculation device 3 calculates the rarity (appearance probability information) of each morpheme included in the broadcast video CC (other audio text data) that is audio text data (closed caption; CC) corresponding to a plurality of videos. To do. The rarity indicates how much a certain morpheme is included in the broadcast video CC of a specific program in a plurality of broadcast video CCs. A certain morpheme is included in the broadcast video CC of a specific program. The higher the value, the higher the value. Here, the target video is a video of a past broadcast program. However, for example, the video may be input from a network such as the Internet, or may be input via a broadcast wave or the like. Good. Further, here, the audio text data is the CC added to the video, but it may be the result of voice recognition of the audio corresponding to the video. Here, the morpheme rarity calculation device 3 includes a broadcast video CC storage unit 30, a morpheme analysis unit 31, a rarity calculation unit 32, and a morpheme frequency storage unit 33.

放送映像ＣＣ記憶手段３０は、複数の放送映像ＣＣを記憶するもので、ハードディスク等の一般的な記憶手段である。ここで記憶された放送映像ＣＣは、放送映像ＣＣを形態素解析手段３１によって参照されて用いられる。 The broadcast video CC storage means 30 stores a plurality of broadcast videos CC and is a general storage means such as a hard disk. The broadcast video CC stored here is used by referring to the broadcast video CC by the morphological analysis means 31.

ここで、図２を参照して、ＣＣの構成について説明する。図２は、ＣＣの例を示した説明図である。なお、図２では、説明の都合上、ＣＣの各行の左側に行番号を付している。図２に示すように、ＣＣの各行は、タイムコード情報Ｄ１と、発話内容情報（以下、ＣＣ文Ｄ２という）とから構成される。ＣＣ文Ｄ２は、映像に対応する音声のテキストデータである。ここで、ＣＣ内の１つの文が所定の字数以内である場合にはその文が１つのＣＣ文Ｄ２となり、１つの文が所定の字数を超える場合には、所定の字数以内に分割されて、それぞれがＣＣ文Ｄ２、Ｄ２、…となる［例えば、（４）〜（５）行目］。また、タイムコード情報Ｄ１は、ＣＣ文Ｄ２に対応する映像の始点を示す情報である。 Here, the configuration of the CC will be described with reference to FIG. FIG. 2 is an explanatory diagram showing an example of CC. In FIG. 2, for convenience of explanation, a line number is assigned to the left side of each line of CC. As shown in FIG. 2, each line of CC is composed of time code information D1 and utterance content information (hereinafter referred to as CC sentence D2). The CC sentence D2 is audio text data corresponding to the video. Here, when one sentence in CC is within a predetermined number of characters, the sentence becomes one CC sentence D2, and when one sentence exceeds a predetermined number of characters, it is divided within the predetermined number of characters. , Are CC statements D2, D2,... [For example, lines (4) to (5)]. The time code information D1 is information indicating the start point of the video corresponding to the CC sentence D2.

図１に戻って説明を続ける。形態素解析手段３１は、放送映像ＣＣ記憶手段３０に記憶された放送映像ＣＣの各々のＣＣ文を形態素解析するものである。ここで、形態素解析手段３１は、放送映像ＣＣ記憶手段３０に記憶された複数の放送映像ＣＣから１つの放送映像ＣＣを選択する。そして、形態素解析手段３１は、選択した放送映像ＣＣ内のＣＣ文を順次形態素解析する。そして、１つの放送映像ＣＣのすべてのＣＣ文について形態素解析が終了したら、形態素解析手段３１は、順次放送映像ＣＣを選択して各ＣＣ文について形態素解析を行う。ここで解析された形態素の情報（形態素、品詞、当該形態素が含まれる放送映像ＣＣの番組名等）は、希少度算出手段３２に出力される。なお、形態素解析手段３１は、一般的な形態素解析システムによって実現することができる。 Returning to FIG. 1, the description will be continued. The morpheme analysis unit 31 performs morpheme analysis on each CC sentence of the broadcast video CC stored in the broadcast video CC storage unit 30. Here, the morphological analysis unit 31 selects one broadcast video CC from the plurality of broadcast videos CC stored in the broadcast video CC storage unit 30. Then, the morpheme analyzing unit 31 sequentially performs morpheme analysis on the CC sentences in the selected broadcast video CC. When morphological analysis is completed for all CC sentences of one broadcast video CC, the morphological analysis means 31 sequentially selects the broadcast video CC and performs morphological analysis on each CC sentence. The morpheme information analyzed here (morpheme, part of speech, program name of the broadcast video CC including the morpheme, etc.) is output to the rarity calculation means 32. The morpheme analyzing means 31 can be realized by a general morpheme analysis system.

希少度算出手段３２は、形態素解析手段３１によって解析された各々の形態素の希少度を算出するものである。ここで算出された希少度は、映像抽出装置５の希少度データ記憶手段５６に記憶される。 The scarcity calculator 32 calculates the scarcity of each morpheme analyzed by the morpheme analyzer 31. The rarity calculated here is stored in the rarity data storage means 56 of the video extraction device 5.

以下、希少度算出手段３２による、ある形態素の希少度の算出方法について説明する。ここで、放送映像ＣＣは複数のＣＣ文から構成され、各々のＣＣ文は形態素の集合から構成される。そして、希少度算出手段３２には、それぞれの放送映像ＣＣの各ＣＣ文の形態素が入力される。そうすると、希少度算出手段３２は、各々の形態素がＣＣ文中に出現する頻度を求め、各々の形態素と、品詞、頻度及び番組名とを対応させて形態素頻度記憶手段３３に記憶する。そして、すべての放送映像ＣＣの各ＣＣ文についての各々の形態素の頻度の解析が終了したら、各々の形態素についてエントロピを算出する。このエントロピは、形態素が、放送映像ＣＣ記憶手段３０に記憶されるすべての放送映像ＣＣにおいて、どの程度偏って出現しているかを示す。 Hereinafter, a method of calculating the scarcity of a morpheme by the scarcity calculator 32 will be described. Here, the broadcast video CC is composed of a plurality of CC sentences, and each CC sentence is composed of a set of morphemes. The scarcity calculation means 32 receives the morpheme of each CC sentence of each broadcast video CC. Then, the rarity calculation means 32 obtains the frequency with which each morpheme appears in the CC sentence, and stores each morpheme, the part of speech, the frequency, and the program name in association with each other in the morpheme frequency storage means 33. Then, when the analysis of the frequency of each morpheme for each CC sentence of all broadcast images CC is completed, entropy is calculated for each morpheme. This entropy indicates how biased morphemes appear in all broadcast video CCs stored in the broadcast video CC storage unit 30.

希少度算出手段３２は、ある形態素ｔ_ｊがある放送映像ＣＣｇ_ｉに含まれる確率Ｐ（ｔ_ｊ，ｇ_ｉ）を、以下の式（１）によって算出する。ここで、放送映像ＣＣｇ_ｉは、複数のＣＣ文の集合で表され、Ｇは、放送映像ＣＣ記憶手段３０に記憶される放送映像ＣＣの集合を表し、ｔｆ（ｔ、ｌ）は、ＣＣ文ｌ中に出現する形態素ｔの頻度を表す。 The rarity calculation means 32 calculates the probability P (t _j , g _i ) included in the broadcast video CCg _i with a certain morpheme t _j by the following equation (1). Here, the broadcast video CCg _i is represented by a set of a plurality of CC statements, G represents a set of broadcast videos CC stored in the broadcast video CC storage means 30, and tf (t, l) is a CC statement. This represents the frequency of the morpheme t appearing in l.

そして、希少度算出手段３２は、形態素ｔ_ｊのエントロピＨ（ｔ_ｊ）を、以下の式（２）によって算出する。このエントロピＨ（ｔ_ｊ）の値が小さいほど、特定の番組に偏って出現している形態素ということができる。 Then, the rarity calculator 32 calculates the entropy H (t _j ) of the morpheme t _j by the following equation (2). It can be said that the smaller the value of this entropy H (t _j ) is, the morpheme that appears biased toward a specific program.

そして、ここで、希少度算出手段３２は、偏って出現している形態素に対して大きな値を与えるようにするため、エントロピＨ（ｔ_ｊ）の増減を逆にした希少度Ｓ（ｔ_ｊ）を、以下の式（３）によって算出する。なお、ここで、｜Ｇ｜は、放送映像ＣＣ記憶手段３０に記憶される放送映像ＣＣの総数を示す。 Here, the rarity degree calculation means 32 gives a large value to the morphemes that appear in a biased manner, so that the rarity degree S (t _j ) in which the increase and decrease of the entropy H (t _j ) are reversed. Is calculated by the following equation (3). Here, | G | indicates the total number of broadcast video CCs stored in the broadcast video CC storage unit 30.

希少度算出手段３２は、各々の形態素の希少度Ｓ（ｔ_ｊ）を求め、各々の形態素と、品詞及び希少度Ｓ（ｔ_ｊ）とを対応させて映像抽出装置５の希少度データ記憶手段５６に記憶する。なお、この希少度Ｓ（ｔ_ｊ）は、過去の放送番組において、あらゆる番組で幅広く使用されている形態素では低い値に、特定の番組においてのみ使用される、あるいは、過去の放送番組でほとんど使用されていない形態素では高い値になる。そして、この形態素希少度算出装置３による形態素の希少度の算出は、事前に一度だけ行い、映像抽出装置５の希少度データ記憶手段５６に記憶しておけばよい。 The rarity calculation means 32 obtains the rarity S (t _j ) of each morpheme, and associates each morpheme with the part of speech and the rarity S (t _j ) to store the rarity data storage means of the video extraction device 5. 56. The rarity S (t _j ) is low in morphemes widely used in all programs in past broadcast programs, and is used only in specific programs, or almost used in past broadcast programs. It is high for morphemes that are not. The calculation of the morpheme rarity by the morpheme rarity calculator 3 may be performed only once in advance and stored in the rarity data storage means 56 of the video extractor 5.

形態素頻度記憶手段３３は、複数の放送映像ＣＣに含まれる形態素と、品詞と、当該形態素のＣＣ文における頻度と、番組名とを対応させて記憶するもので、ハードディスク等の一般的な記憶手段である。ここで記憶された頻度のデータは、希少度算出手段３２によって希少度を算出する際に参照されて用いられる。 The morpheme frequency storage means 33 stores the morpheme included in a plurality of broadcast images CC, the part of speech, the frequency in the CC sentence of the morpheme, and the program name in association with each other, and is a general storage means such as a hard disk. It is. The frequency data stored here is referred to and used when the rarity level calculation means 32 calculates the rarity level.

映像抽出装置５は、映像と、当該映像に対応する音声のテキストデータであるＣＣ（音声テキストデータ）と、当該映像の紹介文とを入力し、当該紹介文に対応する映像であるスポット映像を生成するものである。ここでは、音声のテキストデータを当該映像に付加されるＣＣとしたが、単に当該映像に対応する音声を音声認識した結果であってもよい。ここで、映像抽出装置５は、紹介文入力手段５０と、ＣＣ入力手段５１と、映像入力手段５２と、形態素解析手段５３と、形態素解析手段５４と、類似度算出手段５５と、希少度データ記憶手段５６と、候補区間検出手段５７と、区間映像分割手段５８と、スポット映像出力手段５９とを備える。 The video extraction device 5 inputs a video, CC (voice text data) which is audio text data corresponding to the video, and an introduction sentence of the video, and generates a spot video which is a video corresponding to the introduction sentence. Is to be generated. Here, the audio text data is CC added to the video, but it may be the result of simply recognizing the audio corresponding to the video. Here, the video extraction device 5 includes an introduction sentence input unit 50, a CC input unit 51, a video input unit 52, a morpheme analysis unit 53, a morpheme analysis unit 54, a similarity calculation unit 55, and a rarity data. The storage means 56, the candidate area detection means 57, the area image division means 58, and the spot image output means 59 are provided.

紹介文入力手段５０は、映像入力手段５２から入力される映像の内容の一部を示す紹介文を外部から入力するものである。ここで入力された紹介文は、形態素解析手段５３に出力される。ここでは、紹介文入力手段５０は、電子番組表等に記載される番組紹介文を入力することとした。なお、この番組紹介文は通常複数の文から構成されているため、ここでは紹介文が複数の文（抽出映像内容情報）から構成されている場合について説明する。 The introductory text input means 50 inputs an introductory text indicating a part of the content of the video input from the video input means 52 from the outside. The introductory sentence input here is output to the morpheme analyzing means 53. Here, the introductory text input means 50 inputs the program introductory text described in the electronic program guide or the like. Since the program introduction sentence is usually composed of a plurality of sentences, the case where the introduction sentence is composed of a plurality of sentences (extracted video content information) will be described here.

ＣＣ入力手段５１は、映像入力手段５２から入力される映像に対応する音声のテキストデータであるＣＣを外部から入力するものである。ここでは、音声のテキストデータを、当該映像に付加されるＣＣとしたが、当該映像に対応する音声を音声認識した結果であってもよい。ここで入力されたＣＣは、形態素解析手段５４に出力される。 The CC input means 51 inputs CC, which is audio text data corresponding to the video input from the video input means 52, from the outside. Here, the audio text data is the CC added to the video, but it may be the result of voice recognition of the audio corresponding to the video. The CC input here is output to the morphological analysis means 54.

映像入力手段５２は、映像を外部から入力するものである。ここでは、映像入力手段５２は、外部から放送番組の映像を入力することとした。しかし、入力される映像は、例えば、インターネット等のネットワークから入力される映像であってもよいし、放送波等を介して入力される映像であってもよい。ここで入力された映像は、区間映像分割手段５８に出力される。 The video input means 52 is for inputting video from outside. Here, the video input means 52 inputs the video of the broadcast program from the outside. However, the input video may be, for example, a video input from a network such as the Internet, or a video input via a broadcast wave or the like. The video input here is output to the segment video dividing means 58.

形態素解析手段５３は、紹介文入力手段５０から入力された紹介文の各文を形態素解析するものである。ここで、形態素解析手段５３は、紹介文に含まれる文を順次形態素解析する。ここで解析された形態素の情報（形態素、品詞等）は、類似度算出手段５５に出力される。なお、形態素解析手段５３は、一般的な形態素解析システムによって実現することができる。 The morpheme analysis unit 53 performs morpheme analysis on each sentence of the introduction sentence input from the introduction sentence input unit 50. Here, the morpheme analyzing means 53 sequentially performs morpheme analysis on the sentences included in the introduction sentence. The analyzed morpheme information (morpheme, part of speech, etc.) is output to the similarity calculation means 55. The morpheme analyzing means 53 can be realized by a general morpheme analysis system.

形態素解析手段５４は、ＣＣ入力手段５１から入力されたＣＣの各ＣＣ文（音声区分データ）を形態素解析するものである。ここで、形態素解析手段５４は、ＣＣに含まれるＣＣ文を順次形態素解析する。ここで解析された形態素の情報（形態素、品詞等）は、類似度算出手段５５に出力される。なお、形態素解析手段５４は、一般的な形態素解析システムによって実現することができる。 The morpheme analysis unit 54 performs morpheme analysis on each CC sentence (voice classification data) of the CC input from the CC input unit 51. Here, the morpheme analyzing means 54 sequentially performs morpheme analysis on the CC sentences included in the CC. The analyzed morpheme information (morpheme, part of speech, etc.) is output to the similarity calculation means 55. The morpheme analyzing means 54 can be realized by a general morpheme analysis system.

類似度算出手段５５は、紹介文入力手段５０から入力された紹介文の各文と、ＣＣ入力手段５１から入力されたＣＣの各ＣＣ文との類似度を算出するものである。ここで算出された類似度は、候補区間検出手段５７に出力される。 The similarity calculation means 55 calculates the similarity between each sentence of the introduction sentence input from the introduction sentence input means 50 and each CC sentence of the CC input from the CC input means 51. The similarity calculated here is output to the candidate section detecting means 57.

以下、類似度算出手段５５による類似度の算出方法について説明する。類似度算出手段５５は、紹介文中のある文ｑと、ＣＣ中のＣＣ文ｄ_ｉとの類似度Ｓｉｍ（ｑ，ｄ_ｉ）を、以下の式（４）によって算出することができる。 Hereinafter, a method of calculating the similarity by the similarity calculator 55 will be described. The similarity calculation means 55 can calculate the similarity Sim (q, d _i ) between a certain sentence q in the introduction sentence and the CC sentence d _i in the CC by the following equation (4).

ここで、Ｑは、紹介文中の文ｑに含まれる形態素の集合、Ｄは、ＣＣに含まれるＣＣ文ｄ_ｉの集合、｜Ｄ｜は、ＣＣに含まれるＣＣ文ｄ_ｉの総数を示す。ここで、類似度算出手段５５は、形態素解析手段５３から入力された紹介文中の文ｑに含まれる各々の形態素ｔに対応する希少度Ｓ（ｔ）を、希少度データ記憶手段５６から読み出すとともに、形態素解析手段５４から入力された各々の形態素ｔがＣＣ文ｄ_ｉ中に出現する頻度ｔｆ（ｔ，ｄ_ｉ）を求め、前記の式（４）によって類似度Ｓｉｍ（ｑ，ｄ_ｉ）を算出する。 Here, Q is a set of morphemes included in the sentence q in the introduction sentence, D is a set of CC sentences d _i included in the CC, and | D | is a total number of CC sentences d _i included in the CC. Here, the similarity calculation means 55 reads the rarity S (t) corresponding to each morpheme t included in the sentence q in the introduction sentence input from the morpheme analysis means 53 from the rarity data storage means 56. The frequency tf (t, d _i ) at which each morpheme t input from the morpheme analyzing means 54 appears in the CC sentence d _i is obtained, and the similarity Sim (q, d _i ) is obtained by the above equation (4). calculate.

ここで、図３を参照して、各形態素と希少度との関係を、具体例を用いて説明する。図３は、特定の番組の紹介文に含まれる形態素の希少度の具体例を示す説明図である。ここでは、希少度の高い順に、形態素と、当該形態素の品詞と、希少度とが同じ列に示されている。図３に示すように、複数の放送映像ＣＣにわずかしか含まれない形態素（例えば、「鉱害」、「鉱毒」）の希少度は高く設定され、複数の放送映像ＣＣの多くに含まれる形態素（例えば、「て」、「。」）の希少度は低く設定されている。そして、複数の放送映像ＣＣの一部に含まれる形態素（例えば、「探る」、「回復」）の希少度はその中間の値に設定されている。 Here, with reference to FIG. 3, the relationship between each morpheme and rarity will be described using a specific example. FIG. 3 is an explanatory diagram showing a specific example of the scarcity of morphemes contained in the introduction text of a specific program. Here, the morpheme, the part of speech of the morpheme, and the rarity are shown in the same column in descending order of rarity. As shown in FIG. 3, the rarity level of morphemes (for example, “mine damage” and “mineral poison”) that are only slightly included in the plurality of broadcast images CC is set high, and the morphemes included in most of the plurality of broadcast images CC ( For example, the rarity of “te” and “.”) Is set low. The rarity level of morphemes (eg, “search”, “recovery”) included in a part of the plurality of broadcast images CC is set to an intermediate value.

このように、紹介文やＣＣには、映像の特徴を示す形態素のみでなく、助詞、助動詞や句読点のような映像の特徴を示さない形態素も多く含まれているため、単に形態素が一致する確率のみで紹介文の各文とＣＣ文の示す内容が類似しているかを評価することはできない。しかし、希少度に基づいてそれぞれの形態素の重要性を考慮することで、紹介文の各文とＣＣ文の示す内容が類似しているかを評価することが可能になる。 In this way, introductory sentences and CCs contain not only morphemes that show video features but also many morphemes that do not show video features such as particles, auxiliary verbs, and punctuation marks. It is not possible to evaluate whether the contents of the introductory sentence and the contents indicated by the CC sentence are similar. However, by considering the importance of each morpheme based on the degree of rarity, it is possible to evaluate whether the contents of the introductory sentence and the CC sentence are similar.

つまり、希少度の高い形態素は、特定の番組のＣＣに偏って含まれる形態素であり、その映像の特徴を示すものである可能性が高い。一方、希少性の低い形態素は、例えば、助詞や助動詞や句読点のように、その映像の内容とは関連性が低く一般的な文章に多く含まれる形態素である可能性が高い。そのため、紹介文中の文に含まれている、希少度が高い形態素が、ＣＣ中のあるＣＣ文に多く含まれている場合には、映像の特徴を示す形態素が、共通してこの文とこのＣＣ文とに多く含まれ、内容が類似していることになる。そこで、類似度算出手段５５が、希少度を用いて、紹介文中の文に含まれる形態素のうち希少度の高い形態素が多く含まれるＣＣ文に対して、この紹介文中の文との類似度を高く算出することで、映像の特徴を示す形態素を共通して含むＣＣ文と紹介文中の文との類似度を高く算出することができる。 That is, a morpheme having a high degree of rarity is a morpheme that is biased to be included in the CC of a specific program, and is likely to indicate the characteristics of the video. On the other hand, a morpheme with low rarity is highly likely to be a morpheme that is not related to the content of the video and contained in general sentences, such as particles, auxiliary verbs, and punctuation marks. For this reason, if there are many rare morphemes contained in the introductory sentence in a certain CC sentence in the CC, the morpheme indicating the characteristics of the video is shared with this sentence. It is included in many CC sentences and the contents are similar. Therefore, the similarity calculation means 55 uses the rarity degree to calculate the similarity between the sentence in the introduction sentence and the CC sentence including many morphemes with a high degree of rarity among the morphemes included in the sentence in the introduction sentence. By calculating a high value, it is possible to calculate a high degree of similarity between a CC sentence that commonly includes a morpheme indicating a video feature and a sentence in the introduction sentence.

なお、式（４）では、紹介文中の形態素の希少度の和によって類似度を正規化しているが、これを行わずに、例えば、以下の式（５）のように、長い文ほど類似度が高くなるような方法で類似度を算出することとしてもよい。これは、紹介文中の文において、短い文に重要度の高い形態素が１つある場合よりも、長い文にある程度重要な形態素が多く含まれているほうが重要と考えられる場合に有効である。 In Expression (4), the similarity is normalized by the sum of the rarity of morphemes in the introductory sentence, but without doing this, for example, the longer the sentence, as in Expression (5) below, The degree of similarity may be calculated by a method that increases. This is effective when it is considered that it is more important for the sentence in the introduction sentence to contain a lot of important morphemes in the long sentence than in the case where the short sentence has one morpheme having high importance.

また、類似度算出手段５５は、紹介文の各文ｑとＣＣ文ｄ_ｉとのそれぞれを、紹介文及びＣＣ文に含まれる形態素ｔのベクトルと考え、その距離を利用して、以下の式（６）によって、類似度Ｓｉｍ（ｑ，ｄ_ｉ）を算出することとしてもよい。なお、Ｒは、紹介文及びＣＣ文に含まれる形態素の集合を示す。 The similarity calculation means 55 considers each sentence q and CC sentence d _i of the introduction sentence as a vector of morphemes t included in the introduction sentence and the CC sentence, and uses the distance to calculate the following expression The similarity Sim (q, d _i ) may be calculated by (6). R represents a set of morphemes included in the introduction sentence and the CC sentence.

更に、類似度算出手段５５は、距離の代わりに余弦を利用して、以下の式（７）によって、類似度Ｓｉｍ（ｑ，ｄ_ｉ）を算出することとしてもよい。 Furthermore, the similarity calculation unit 55 may calculate the similarity Sim (q, d _i ) by using the cosine instead of the distance by the following equation (7).

希少度データ記憶手段（形態素確率情報記憶手段、形態素確率情報記憶装置）５６は、複数の放送映像ＣＣに含まれる形態素と、形態素希少度算出装置３の希少度算出手段３２によって算出された当該形態素の希少度と、品詞とを対応させたデータ（形態素確率情報）を記憶するもので、ハードディスク等の一般的な記憶手段である。ここで記憶された希少度のデータは、類似度算出手段５５によって参照されて用いられる。 The rarity data storage means (morpheme probability information storage means, morpheme probability information storage device) 56 includes morphemes included in a plurality of broadcast images CC and the morphemes calculated by the rarity degree calculation means 32 of the morpheme rarity degree calculation device 3. Is a general storage means such as a hard disk, which stores data (morpheme probability information) in which the degree of rarity and the part of speech are associated with each other. The scarcity data stored here is referred to and used by the similarity calculation means 55.

候補区間検出手段（区間検出手段）５７は、類似度算出手段５５によって算出された類似度に基づいてＣＣ文を選定し、当該ＣＣ文に対応する映像の区間を検出するものである。ここで、候補区間検出手段５７は、類似度算出手段５５によって算出された類似度の高いＣＣ文に対応する映像の区間を、スポット映像を構成する映像の区間として検出する。ここで検出された区間の情報は、区間映像分割手段５８に出力される。ここでは、候補区間検出手段５７は、紹介文に含まれる各々の文ついて、類似度の高い複数のＣＣ文を選定し、当該ＣＣ文に対応する複数の映像を、スポット映像を構成する映像の候補として、当該複数の映像の区間を検出することとした。しかし、候補区間検出手段５７は、紹介文に含まれる各々の文ついて、最も類似度の高いＣＣ文を１つ選定し、当該ＣＣ文に対応する映像の区間を検出することとしてもよい。 The candidate section detection means (section detection means) 57 selects a CC sentence based on the similarity calculated by the similarity calculation means 55, and detects a video section corresponding to the CC sentence. Here, the candidate section detection unit 57 detects the section of the video corresponding to the CC sentence with the high similarity calculated by the similarity calculation unit 55 as the section of the video constituting the spot video. Information on the section detected here is output to the section video dividing means 58. Here, the candidate section detecting means 57 selects a plurality of CC sentences having a high degree of similarity for each sentence included in the introduction sentence, and selects a plurality of videos corresponding to the CC sentence as video images constituting the spot video. As a candidate, the plurality of video sections are detected. However, the candidate section detection means 57 may select one CC sentence having the highest similarity for each sentence included in the introduction sentence, and detect a section of the video corresponding to the CC sentence.

ここで、候補区間検出手段５７は、抽出する映像の区間の開始時間を、選定されたＣＣ文に付加されたタイムコード情報に基づいて決定することができる。また、候補区間検出手段５７は、当該区間の終了時間を、例えば、話速に基づいて、ＣＣ文の字数から当該区間の時間を算出して求めることが可能である。このとき、候補区間検出手段５７は、例えば、図２の（４）行目のＣＣ文Ｄ２のように、ＣＣにおいて１つの文が所定の字数を超え、所定の字数以内に分割されたＣＣ文Ｄ２のタイムコード情報Ｄ１と、当該ＣＣ文Ｄ２の字数と、次の行のタイムコード情報Ｄ１とに基づいて、この２つのタイムコード情報Ｄ１によって示される時刻の間にこの字数分の音声が出力されることとして、話速を算出することができる。 Here, the candidate section detection means 57 can determine the start time of the section of the video to be extracted based on the time code information added to the selected CC sentence. Further, the candidate section detecting means 57 can obtain the end time of the section by calculating the time of the section from the number of characters of the CC sentence based on the speech speed, for example. At this time, the candidate section detection means 57, for example, a CC sentence in which one sentence exceeds a predetermined number of characters in CC and is divided within a predetermined number of characters, as in the CC sentence D2 on the (4) line in FIG. Based on the time code information D1 of D2, the number of characters of the CC sentence D2, and the time code information D1 of the next line, audio corresponding to the number of characters is output between the times indicated by the two time code information D1. As a result, the speech speed can be calculated.

区間映像分割手段５８は、映像入力手段５２から入力された映像から、候補区間検出手段５７によって検出された区間の映像を抽出し、当該映像からスポット映像を生成するものである。ここでは、区間映像分割手段５８は、区間映像抽出部５８ａと、カット分割部５８ｂと、分割部５８ｃと、映像選定部５８ｄとから構成される。 The section video dividing unit 58 extracts a video of a section detected by the candidate section detection unit 57 from the video input from the video input unit 52 and generates a spot video from the video. Here, the section video dividing unit 58 includes a section video extracting unit 58a, a cut dividing unit 58b, a dividing unit 58c, and a video selecting unit 58d.

区間映像抽出部（区間映像抽出手段）５８ａは、映像入力手段５２から入力された映像から、候補区間検出手段５７によって検出された区間の映像を抽出するものである。ここで抽出された映像は、カット分割部５８ｂに出力される。 The section video extraction unit (section video extraction means) 58 a extracts the video of the section detected by the candidate section detection means 57 from the video input from the video input means 52. The extracted video is output to the cut dividing unit 58b.

カット分割部（カット分割手段）５８ｂは、区間映像抽出部５８ａによって抽出された映像をカットに分割するものである。ここで、カットとは、一台のカメラで連続して撮影された映像区間をいう。この映像区間の切れ目では映像が大きく切り替わるため、カット分割部５８ｂは、例えば、区間映像抽出部５８ａによって抽出された映像を構成する前後のフレーム画像間の色の差分をとり、差分の値が大きいときに映像を分割することで、抽出された映像をカットに分割することができる。ここで分割されたカットの映像は、分割部５８ｃに出力される。なお、カット分割部５８ｂは、例えば、フレーム画像間の周波数特徴の差分をとり、差分が大きい場合に映像を分割することとしてもよいし、また、フレーム画像を複数の小領域に分割し、各小領域が次のフレーム画像においてどの位置に移動したのかを調べるブロックマッチングを行い、移動先が特定できなかった小領域数が所定値より多い場合に映像を分割することとしてもよい。 The cut division unit (cut division unit) 58b divides the video extracted by the section video extraction unit 58a into cuts. Here, the term “cut” refers to a video section continuously photographed with one camera. Since the video is largely switched at the break of the video section, the cut dividing unit 58b takes the color difference between the frame images before and after the video extracted by the section video extraction unit 58a, for example, and the difference value is large. Sometimes, the video can be divided to divide the extracted video into cuts. The video of the cut divided here is output to the dividing unit 58c. Note that the cut dividing unit 58b may take, for example, a difference in frequency characteristics between frame images, and may divide the video when the difference is large, or may divide the frame image into a plurality of small regions, Block matching that examines where the small area has moved in the next frame image may be performed, and the video may be divided when the number of small areas for which the movement destination cannot be specified is greater than a predetermined value.

分割部５８ｃは、カット分割部５８ｂから入力されたカットの映像のうち、カット長が閾値以上の映像を、映像の動きに基づいて更に分割するものである。ここで分割された映像は、映像選定部５８ｄに出力される。なお、閾値は、あらかじめ設定された値（例えば、これまでに作成されたスポット映像において使用されているカットの平均長）であってもよいし、図示しない入力手段から入力されたスポット映像の長さの情報と、紹介文に含まれる文の数とに基づいて算出された値としてもよい。 The dividing unit 58c further divides an image whose cut length is greater than or equal to a threshold among the cut images input from the cut dividing unit 58b based on the motion of the image. The divided video is output to the video selection unit 58d. Note that the threshold value may be a preset value (for example, the average length of cuts used in spot images created so far), or the length of a spot image input from input means (not shown). It may be a value calculated on the basis of the information on the number and the number of sentences included in the introduction sentence.

ここで、図４を参照して、分割部５８ｃが、カットを更に分割する方法の例について説明する。図４は、分割部が、カットを更に分割する方法の例を説明するための説明図である。まず、分割部５８ｃは、カット分割部５８ｂから入力されたカットの映像のカット長が閾値以上であるかを判定する。そして、閾値以上である場合には、この映像に基づいて、当該映像を撮影したカメラ（図示せず）の動きを判定する。なお、ここでは、分割部５８ｃは、カメラの動きの判定を、ブロックマッチングによる動きベクトルの解析によって行うこととした。 Here, an example of a method in which the dividing unit 58c further divides the cut will be described with reference to FIG. FIG. 4 is an explanatory diagram for explaining an example of a method in which the dividing unit further divides the cut. First, the dividing unit 58c determines whether the cut length of the cut video input from the cut dividing unit 58b is equal to or greater than a threshold value. If it is equal to or greater than the threshold, the movement of the camera (not shown) that captured the video is determined based on this video. Here, the dividing unit 58c determines the camera motion by analyzing the motion vector by block matching.

図４（ａ）に示すように、カットの映像が、カメラが静止した状態で撮影された区間Ｚ１と、カメラが動いている状態で撮影された区間Ｚ２とから構成される場合、分割部５８ｃは、カメラが動き出す少し前で分割、つまり、区間Ｚ１と区間Ｚ２との境界の少し前（所定時間前）で分割する。このように、区間Ｚ１の一部を削除することで、カメラが静止し、同じ映像が長時間続いている部分を一部残して除去することができる。 As shown in FIG. 4A, when the cut video is composed of a section Z1 captured with the camera stationary and a section Z2 captured with the camera moving, the dividing unit 58c. Is divided just before the camera starts moving, that is, divided just before the boundary between the section Z1 and the section Z2 (predetermined time). In this way, by deleting a part of the section Z1, it is possible to remove the part where the camera is stationary and the same video continues for a long time.

また、図４（ｂ）に示すように、カットの映像が、カメラが動いている状態で撮影された区間Ｚ２と、カメラが静止した状態で撮影された区間Ｚ１とから構成される場合、分割部５８ｃは、カメラが静止する少し前で分割、つまり、区間Ｚ２と区間Ｚ１との境界の少し前（所定時間前）で分割する。 Further, as shown in FIG. 4B, when the cut video is composed of a section Z2 shot with the camera moving and a section Z1 shot with the camera stationary, the division is performed. The unit 58c is divided slightly before the camera stops, that is, divided slightly before the boundary between the section Z2 and the section Z1 (predetermined time).

ここで、カメラが静止している状態から、カメラが動き出してすぐ別のカットに変わる映像や、カメラが動いている状態からカメラが静止してすぐ別のカットに変わる映像は、視聴者から見て不自然な映像となるが、このように、静止した状態からカメラが動き出す少し前、あるいは、動いている状態から静止する少し前で分割することで、視聴者から見て自然な映像となるように分割することができる。 Here, the viewer sees a video that changes from a still camera to another cut as soon as the camera starts moving, or a video that changes from a camera moving to a different cut immediately after the camera moves. In this way, it becomes a natural image as viewed from the viewer by dividing it just before the camera starts moving from a stationary state or just before moving from a moving state. Can be divided as follows.

更に、図４（ｃ）、（ｄ）、（ｅ）に示すように、カットの映像が、カメラが少し動いてから静止した状態で撮影された場合や、カメラが少し静止してから動いた状態で撮影された場合や、カメラが静止し続ける状態で撮影された場合、動き続けた状態で撮影された場合には、分割部５８ｃは、所定の長さで分割する。更に、分割部５８ｃは、所定の長さで分割して最後に残った区間が所定の長さより短い場合には、この区間を除去することとした。そして、これらの方法で分割して得られた映像の長さが閾値以上である場合には、分割部５８ｃは、更に前記の方法で分割することとした。 Furthermore, as shown in FIGS. 4C, 4D, and 4E, when the cut image was shot in a stationary state after the camera moved a little, or after the camera moved a little still When the image is taken in a state, when the image is taken with the camera kept stationary, or when the image is taken with the camera kept moving, the dividing unit 58c divides the image by a predetermined length. Further, the dividing unit 58c removes the section when it is divided by a predetermined length and the last remaining section is shorter than the predetermined length. When the length of the video obtained by dividing by these methods is equal to or greater than the threshold, the dividing unit 58c further divides by the above method.

映像選定部５８ｄは、分割部５８ｃから入力された映像から、スポット映像を構成する映像を選定するものである。ここで選定された映像はつなぎ合わされて、スポット映像としてスポット映像出力手段５９に出力される。 The video selection unit 58d selects a video constituting the spot video from the video input from the dividing unit 58c. The images selected here are joined together and output to the spot image output means 59 as a spot image.

ここで、映像選定部５８ｄは、映像の動きベクトルの方向が同じものが連続しないように、また、同じ色調の映像が連続しないように、類似度の高いものから優先して映像を選定する。そして、選定した映像をつなぎ合わせてスポット映像を生成する。 Here, the video selection unit 58d preferentially selects videos with the highest similarity so that videos with the same direction of motion vectors do not continue and videos with the same color tone do not continue. Then, the selected video is connected to generate a spot video.

スポット映像出力手段５９は、区間映像分割手段５８の映像選定部５８ｄから入力されたスポット映像を外部に出力するものである。 The spot video output unit 59 outputs the spot video input from the video selection unit 58d of the section video division unit 58 to the outside.

なお、本発明の映像抽出装置５の紹介文入力手段５０から入力される紹介文は１つの文から構成されることとしてもよいし、複数の形態素の羅列（抽出映像内容情報）であってもよい。また、特許請求の範囲に記載の抽出映像内容情報は、当該映像の内容の一部を示すものであればよく、紹介文ではなく、例えば、要約文であってもよい。更に、ここでは、映像抽出装置５が、紹介文の各々の文について、対応する映像を抽出することとしたが、紹介文全体に対して、対応する映像を抽出することとしてもよい。このとき、特許請求の範囲に記載の抽出映像内容情報は、紹介文に相当する。また、映像抽出装置５が、紹介文を、例えば、字数や時間長などに基づいて分割した文字列の各々に対して、対応する映像を抽出することとしてもよい。このとき、特許請求の範囲に記載の抽出映像内容情報は、この文字列に相当する。 The introductory sentence input from the introductory sentence input means 50 of the video extracting apparatus 5 of the present invention may be composed of one sentence, or may be a list of a plurality of morphemes (extracted video content information). Good. Further, the extracted video content information described in the claims only needs to indicate a part of the content of the video, and may be, for example, a summary sentence instead of an introduction sentence. Furthermore, here, the video extraction device 5 extracts the corresponding video for each sentence of the introductory text. However, it is also possible to extract the corresponding video for the entire introductory text. At this time, the extracted video content information described in the claims corresponds to an introduction sentence. The video extraction device 5 may extract a video corresponding to each of the character strings obtained by dividing the introduction sentence based on, for example, the number of characters and the time length. At this time, the extracted video content information described in the claims corresponds to this character string.

また、ここでは形態素希少度算出装置３によって希少度を算出し、映像抽出装置５の希少度データ記憶手段５６に記憶することとしたが、例えば、形態素希少度算出装置３が形態素のエントロピを算出して、希少度データ記憶手段５６に記憶することとしてもよい。このとき、特許請求の範囲に記載の出現確率情報はエントロピに相当し、映像抽出装置５は、類似度算出手段５５によって、偏って出現する形態素に対して大きな値を与えるようにエントロピの増減を逆にして類似度を算出すればよい。 Here, the scarcity is calculated by the morpheme scarcity calculation device 3 and stored in the scarcity data storage means 56 of the video extraction device 5. For example, the morpheme scarcity calculation device 3 calculates the entropy of the morpheme. Then, it may be stored in the rarity data storage means 56. At this time, the appearance probability information described in the claims corresponds to entropy, and the video extraction device 5 uses the similarity calculation means 55 to increase or decrease the entropy so as to give a large value to the morphemes that appear unevenly. Conversely, the similarity may be calculated.

更に、形態素希少度算出装置３及び映像抽出装置５は、コンピュータにおいて各手段を各機能プログラムとして実現することも可能であり、各機能プログラムを結合して、形態素希少度算出プログラム及び映像抽出プログラムとして動作させることも可能である。 Furthermore, the morpheme rarity calculation device 3 and the video extraction device 5 can also realize each means as each function program in a computer, and combine each function program as a morpheme rarity calculation program and a video extraction program. It is also possible to operate.

［スポット映像生成装置の動作］
次に、図５及び図６を参照して、スポット映像生成装置１の動作について説明する。図５は、形態素希少度算出装置が、本発明の映像抽出装置の希少度データ記憶手段に記憶される形態素の希少度を算出する動作を示したフローチャートである。図６は、本発明の映像抽出装置が紹介文の各文とＣＣ文の類似度を算出して、紹介文に対応するスポット映像を生成する動作を示したフローチャートである。 [Operation of spot image generator]
Next, the operation of the spot video generation device 1 will be described with reference to FIGS. FIG. 5 is a flowchart showing an operation in which the morpheme scarcity calculating device calculates the scarcity of the morpheme stored in the scarcity data storage means of the video extracting device of the present invention. FIG. 6 is a flowchart showing an operation in which the video extraction apparatus of the present invention calculates the similarity between each sentence of the introductory sentence and the CC sentence, and generates a spot video corresponding to the introductory sentence.

（形態素希少度算出装置の動作）
まず、図５を参照（適宜図１参照）して、形態素希少度算出装置３の動作（希少度算出動作）について説明する。 (Operation of morpheme rarity calculator)
First, referring to FIG. 5 (refer to FIG. 1 as appropriate), the operation of the morpheme scarcity calculation device 3 (rareness calculation operation) will be described.

形態素希少度算出装置３は、形態素解析手段３１によって、放送映像ＣＣ記憶手段３０に記憶された複数の放送映像ＣＣのうちの１つの放送映像ＣＣを読み出す（ステップＳ３１）。続いて、形態素希少度算出装置３は、形態素解析手段３１によって、ステップＳ３１において読み出した放送映像ＣＣの１つのＣＣ文の形態素解析を行う（ステップＳ３２）。 The morpheme scarcity calculation device 3 reads one broadcast video CC among the plurality of broadcast videos CC stored in the broadcast video CC storage unit 30 by the morpheme analysis unit 31 (step S31). Subsequently, the morpheme scarcity calculation device 3 uses the morpheme analysis unit 31 to perform morpheme analysis of one CC sentence of the broadcast video CC read in step S31 (step S32).

更に、形態素希少度算出装置３は、希少度算出手段３２によって、ステップＳ３２における形態素解析結果に基づいてＣＣ文に含まれる各々の形態素の出現する頻度を求め、形態素と、品詞と、頻度と、番組名とを対応させて形態素頻度記憶手段３３に記憶する（ステップＳ３３）。そして、形態素希少度算出装置３は、形態素解析手段３１によって、ステップＳ３１において読み出されたすべてのＣＣ文について終了したかを判断する（ステップＳ３４）。そして、終了していない場合（ステップＳ３４でＮｏ）には、ステップＳ３２に戻って、形態素希少度算出装置３が、形態素解析手段３１によって、次のＣＣ文の形態素解析を行う動作以降の動作を行う。 Further, the morpheme rarity calculation device 3 obtains the frequency of appearance of each morpheme included in the CC sentence based on the morpheme analysis result in step S32 by the rarity calculation unit 32, and the morpheme, the part of speech, the frequency, The program name is associated with and stored in the morpheme frequency storage means 33 (step S33). Then, the morpheme scarcity calculation device 3 determines whether or not the morpheme analysis unit 31 has finished all the CC sentences read in step S31 (step S34). And when not complete | finished (it is No at step S34), it returns to step S32 and the operation | movement after the operation | movement after the morpheme scarcity calculation apparatus 3 performs the morphological analysis of the next CC sentence by the morpheme analysis means 31 is performed. Do.

また、終了した場合（ステップＳ３４でＹｅｓ）には、形態素希少度算出装置３は、形態素解析手段３１によって、ステップＳ３１においてすべての放送映像ＣＣを読み出したかを判断する（ステップＳ３５）。そして、すべての放送映像ＣＣを読み出していない場合（ステップＳ３５でＮｏ）には、ステップＳ３１に戻って、形態素希少度算出装置３が、形態素解析手段３１によって、次の放送映像ＣＣを読み出す動作以降の動作を行う。 When the processing is completed (Yes in step S34), the morpheme scarcity calculation device 3 determines whether or not all broadcast images CC have been read in step S31 by the morpheme analysis unit 31 (step S35). If all the broadcast images CC have not been read (No in step S35), the process returns to step S31, and the morpheme scarcity calculating device 3 reads the next broadcast image CC by the morpheme analyzing means 31. Perform the operation.

また、すべての放送映像ＣＣを読み出した場合（ステップＳ３５でＹｅｓ）には、形態素希少度算出装置３は、希少度算出手段３２によって、ステップＳ３３において形態素頻度記憶手段３３に記憶された頻度に基づいて各々の形態素の希少度を算出し、形態素と、品詞と、希少度とを対応させて映像抽出装置５の希少度データ記憶手段５６に記憶して（ステップＳ３６）、動作を終了する。 When all broadcast videos CC have been read (Yes in step S35), the morpheme scarcity calculating device 3 is based on the frequency stored in the morpheme frequency memory 33 in step S33 by the scarcity calculator 32. The degree of rarity of each morpheme is calculated, the morpheme, the part of speech, and the degree of rarity are associated with each other and stored in the rarity degree data storage means 56 of the video extraction device 5 (step S36), and the operation is terminated.

以上の動作によって、形態素希少度算出装置３は、放送映像ＣＣ記憶手段３０に記憶された複数の放送映像ＣＣに含まれる形態素の希少度を算出し、映像抽出装置５の希少度データ記憶手段５６に記憶する。 Through the above operation, the morpheme scarcity calculating device 3 calculates the scarcity of morphemes included in the plurality of broadcast videos CC stored in the broadcast video CC storage unit 30, and the rarity data storage unit 56 of the video extraction device 5. To remember.

（映像抽出装置の動作）
続いて、図６を参照（適宜図１参照）して、映像抽出装置５の動作（映像抽出動作）について説明する。 (Operation of video extractor)
Next, the operation (video extraction operation) of the video extraction device 5 will be described with reference to FIG.

映像抽出装置５は、紹介文入力手段５０によって紹介文を、ＣＣ入力手段５１によってＣＣを、映像入力手段５２によって映像を外部から入力する（ステップＳ５１）。続いて、映像抽出装置５は、形態素解析手段５３によって、ステップＳ５１において入力された紹介文の１つ文の形態素解析を行う（ステップＳ５２）。 The video extraction device 5 inputs the introduction sentence by the introduction sentence input means 50, the CC by the CC input means 51, and the picture from the outside by the video input means 52 (step S51). Subsequently, the video extraction device 5 performs morpheme analysis of one sentence of the introduction sentence input in step S51 by the morpheme analysis unit 53 (step S52).

更に、映像抽出装置５は、形態素解析手段５４によって、ステップＳ５１において入力されたＣＣの１つのＣＣ文の形態素解析を行う（ステップＳ５３）。そして、映像抽出装置５は、類似度算出手段５５によって、ステップＳ５２において形態素解析した紹介文の文と、ステップＳ５３において形態素解析したＣＣ文との類似度を算出する（ステップＳ５４）。 Further, the video extraction device 5 performs morphological analysis of one CC sentence of the CC input in step S51 by the morphological analysis unit 54 (step S53). Then, the video extraction device 5 uses the similarity calculation unit 55 to calculate the similarity between the introductory sentence analyzed in step S52 and the CC sentence analyzed in step S53 (step S54).

そして、映像抽出装置５は、形態素解析手段５４によって、ステップＳ５１において入力されたすべてのＣＣ文について終了したかを判断する（ステップＳ５５）。そして、終了していない場合（ステップＳ５５でＮｏ）には、ステップＳ５３に戻って、映像抽出装置５が、形態素解析手段５４によって、次のＣＣ文の形態素解析を行う動作以降の動作を行う。 Then, the video extracting device 5 determines whether or not all the CC sentences input in step S51 have been completed by the morphological analyzer 54 (step S55). If it is not completed (No in step S55), the process returns to step S53, and the video extraction device 5 performs the operation after the operation of performing the morphological analysis of the next CC sentence by the morphological analysis means 54.

また、終了した場合（ステップＳ５５でＹｅｓ）には、映像抽出装置５は、候補区間検出手段５７によって、ステップＳ５４において算出された類似度の高いＣＣ文を選定し（ステップＳ５６）、当該ＣＣ文に対応する映像の区間を検出する（ステップＳ５７）。そして、映像抽出装置５は、形態素解析手段５３によって、ステップＳ５１において入力された紹介文のすべての文について終了したかを判断する（ステップＳ５８）。そして、終了していない場合（ステップＳ５８でＮｏ）には、ステップＳ５２に戻って、映像抽出装置５が、形態素解析手段５３によって、紹介文の次の文の形態素解析を行う動作以降の動作を行う。 When the processing is completed (Yes in step S55), the video extraction device 5 selects a CC sentence with a high similarity calculated in step S54 by the candidate section detection unit 57 (step S56), and the CC sentence The video section corresponding to is detected (step S57). Then, the video extraction device 5 determines whether the morphological analysis unit 53 has finished all the introductory sentences input in step S51 (step S58). If it has not ended (No in step S58), the process returns to step S52, and the video extraction device 5 performs the morphological analysis of the next sentence after the introduction sentence by the morphological analysis means 53. Do.

また、終了した場合（ステップＳ５８でＹｅｓ）には、映像抽出装置５は、区間映像分割手段５８の区間映像抽出部５８ａによって、ステップＳ５７において検出された区間の映像を、ステップＳ５１において入力された映像から抽出する（ステップＳ５９）。そして、映像抽出装置５は、区間映像分割手段５８のカット分割部５８ｂによって、ステップＳ５９において抽出された映像をカットに分割し、分割部５８ｃによって、カット長が閾値以上の映像を、映像の動きに基づいて更に分割して、映像の区間の調整をする（ステップＳ６０）。 If the processing has been completed (Yes in step S58), the video extraction device 5 has received the video of the section detected in step S57 by the section video extraction unit 58a of the section video dividing means 58 in step S51. Extract from the video (step S59). Then, the video extracting device 5 divides the video extracted in step S59 into cuts by the cut dividing unit 58b of the section video dividing unit 58, and the dividing unit 58c converts the video whose cut length is greater than or equal to the threshold to the motion of the video. Are further divided to adjust the video section (step S60).

続いて、映像抽出装置５は、区間映像分割手段５８の映像選定部５８ｄによって、ステップＳ６０において区間が調整された映像から、映像の動きや色調、類似度に基づいて映像を選定してつなぎ合わせ、スポット映像を生成する（ステップＳ６１）。更に、映像抽出装置５は、スポット映像出力手段５９によって、ステップＳ６１において生成されたスポット映像を出力する（ステップＳ６２）。 Subsequently, the video extracting device 5 selects and joins the video from the video whose section is adjusted in step S60 by the video selection unit 58d of the section video dividing unit 58 based on the motion, color tone, and similarity of the video. A spot video is generated (step S61). Further, the video extracting device 5 outputs the spot video generated in step S61 by the spot video output means 59 (step S62).

以上の動作によって、映像抽出装置５は、形態素希少度算出装置３によって算出された各々の形態素の希少度に基づいて、外部から入力された紹介文の各々の文と、ＣＣ文との類似度を算出することができる。そして、類似度の高いＣＣ文に対応する区間の映像を抽出してスポット映像を生成することができる。 With the above operation, the video extraction device 5 uses the morpheme scarcity calculation device 3 to calculate the similarity between each sentence of the introduction sentence input from the outside and the CC sentence based on the scarcity of each morpheme. Can be calculated. Then, a spot video can be generated by extracting a video of a section corresponding to a CC sentence having a high similarity.

［映像抽出装置の応用例］
映像抽出装置５は、図示しない映像表示手段を有し、この映像表示手段によって、スポット映像とともに、区間の情報や、紹介文や、類似度等の情報を図示しない表示手段に出力して表示画面に表示することとしてもよい。更に、映像抽出装置５は、図示しないスポット映像編集手段を有し、このスポット映像編集手段によって、図示しない指令入力手段から入力された操作者の指令に基づいて、スポット映像を編集することとしてもよい。 [Application example of video extractor]
The video extraction device 5 has a video display means (not shown), which outputs information about the section, introduction text, similarity, etc. to the display means (not shown) together with the spot video. It is good also as displaying on. Further, the video extraction device 5 may include spot video editing means (not shown), and the spot video editing means may edit the spot video based on an operator command input from command input means (not shown). Good.

ここで、図７を参照して、映像抽出装置５の応用例について説明する。図７は、映像抽出装置によって生成されたスポット映像の編集画面の例を模式的に示した模式図である。操作者が、映像と、当該映像のＣＣと、紹介文とを映像抽出装置５に入力すると、映像抽出装置５は、当該紹介文の各々の文に対応する映像を抽出する。そして、映像抽出装置５は、図７に示すように、図示しない映像表示手段によって、表示画面に、操作者によって編集されたスポット映像Ｖａと、操作者によって候補の映像から選択され、編集される素材となる映像Ｖｂと、当該スポット映像を構成する映像の区間の情報Ｖｃと、紹介文を構成する文Ｖｄ１、Ｖｄ２、Ｖｄ３、…と、文Ｖｄ１、Ｖｄ２、Ｖｄ３、…に対応し、類似度の高い順に並べられた映像Ｖｅ、Ｖｅ、…とを表示する。そして、図示しないスポット映像編集手段によって、スポット映像にタイムライン上でカットの長さを調整したり、その他の候補の画像Ｖｅと入れ替えたりするなどの手直しを加えることで、操作者は紹介文に対応するスポット映像を容易に制作することができる。 Here, an application example of the video extraction device 5 will be described with reference to FIG. FIG. 7 is a schematic diagram schematically showing an example of a spot video editing screen generated by the video extraction device. When the operator inputs a video, a CC of the video, and an introduction sentence to the video extraction device 5, the video extraction device 5 extracts a video corresponding to each sentence of the introduction sentence. Then, as shown in FIG. 7, the video extraction device 5 is selected and edited from the spot video Va edited by the operator and the candidate video by the operator on the display screen by video display means (not shown). Corresponding to the video Vb as the material, the information Vc of the video section constituting the spot video, the sentences Vd1, Vd2, Vd3,... And the sentences Vd1, Vd2, Vd3,. .. Are displayed in the descending order. Then, the spot video editing means (not shown) adjusts the length of the cut to the spot video on the timeline or adds other candidate images Ve to the operator so that the operator can make an introductory sentence. Corresponding spot images can be produced easily.

更に、映像抽出装置５は、インターネットなどの映像の検索に適用することとしてもよい。例えば、インターネットの映像の検索に適用する場合には、映像抽出装置５は、インターネットに接続され、紹介文入力手段５０から入力されたテキストデータ及びＣＣ入力手段５１から入力されたインターネットの映像のＣＣに基づいて、映像入力手段５２から入力されたインターネットの映像から、当該テキストデータの内容を示す映像を選定する。ここでは、図８（ａ）に示すように、紹介文入力手段５０から、テキストデータとして、紹介文の代わりに操作者によって複数の形態素ｔ１、ｔ２が入力され、映像抽出装置５は、希少度に基づいて、紹介文入力手段５０から入力された当該形態素ｔ１、ｔ２と、ＣＣ入力手段５１から入力されたＣＣのＣＣ文との類似度を算出して、類似度の高い区間の映像を選定する。 Furthermore, the video extraction device 5 may be applied to video search such as the Internet. For example, when applied to Internet video search, the video extraction device 5 is connected to the Internet, and the text data input from the introductory text input means 50 and the CC of the Internet video input from the CC input means 51. Based on the above, a video showing the contents of the text data is selected from the Internet video inputted from the video input means 52. Here, as shown in FIG. 8 (a), a plurality of morphemes t1 and t2 are input as text data from the introductory text input means 50 by the operator instead of the introductory text. Based on the above, the similarity between the morphemes t1 and t2 input from the introductory sentence input means 50 and the CC sentence of the CC input from the CC input means 51 is calculated, and the video in the section with high similarity is selected. To do.

そして、図８（ｂ）に示すように、映像表示手段によって、表示画面に、類似度の高い区間の映像Ｖｅ、Ｖｅ、…と、当該映像Ｖｅ、Ｖｅ、…の類似度の情報Ｖｆ、Ｖｆ、…とを、検索結果として表示する。なお、図８は、発明の映像抽出装置を、インターネットの映像の検索に適用した場合の表示画面の例を模式的に示した模式図、（ａ）は、検索する映像の内容を示す形態素を入力する入力画面の例を示した模式図、（ｂ）は、検索された映像を表示する画面の例を示した模式図である。 Then, as shown in FIG. 8 (b), the video display means displays video Ve, Ve,... In a section with high similarity on the display screen, and similarity information Vf, Vf of the video Ve, Ve,. ,... Are displayed as search results. FIG. 8 is a schematic diagram schematically showing an example of a display screen when the video extracting device of the invention is applied to Internet video search. FIG. 8A is a morpheme indicating the content of the video to be searched. The schematic diagram which showed the example of the input screen to input, (b) is the schematic diagram which showed the example of the screen which displays the searched image | video.

本発明における映像抽出装置を備えるスポット映像生成装置の構成を模式的に示した模式図である。It is the schematic diagram which showed typically the structure of the spot image | video production | generation apparatus provided with the image | video extraction apparatus in this invention. 本発明における映像抽出装置を備えるスポット映像生成装置に用いられるＣＣの例を示した説明図である。It is explanatory drawing which showed the example of CC used for a spot image generation apparatus provided with the image | video extraction apparatus in this invention. 本発明における映像抽出装置を備えるスポット映像生成装置によって算出された、特定の番組の紹介文に含まれる形態素の希少度の具体例を示す説明図である。It is explanatory drawing which shows the specific example of the scarcity of the morpheme contained in the introduction sentence of the specific program calculated by the spot image | video production | generation apparatus provided with the image | video extraction apparatus in this invention. 本発明における映像抽出装置の分割部が、カットを更に分割する方法の例を説明するための説明図である。It is explanatory drawing for demonstrating the example of the method in which the division part of the video extraction device in this invention further divides | segments a cut. 本発明における映像抽出装置を備えるスポット映像生成装置の形態素希少度算出装置が、本発明の映像抽出装置の希少度データ記憶手段に記憶される形態素の希少度を算出する動作を示したフローチャートである。5 is a flowchart showing an operation of calculating a morpheme rarity stored in a rarity data storage unit of a video extraction device of a video extraction device by a morpheme rarity calculation device of a spot video generation device including the video extraction device according to the present invention. . 本発明の映像抽出装置が紹介文の各文とＣＣ文の類似度を算出して、紹介文に対応するスポット映像を生成する動作を示したフローチャートである。It is the flowchart which showed the operation | movement which the image extraction apparatus of this invention calculates the similarity degree of each sentence of an introduction sentence, and CC sentence, and produces | generates the spot image | video corresponding to an introduction sentence. 本発明の映像抽出装置によって生成されたスポット映像の編集画面の例を模式的に示した模式図である。It is the schematic diagram which showed typically the example of the edit screen of the spot image | video produced | generated by the image | video extraction apparatus of this invention. 本発明の映像抽出装置を、インターネットの映像の検索に適用した場合の表示画面の例を模式的に示した模式図、（ａ）は、検索する映像の内容を示す形態素を入力する入力画面の例を示した模式図、（ｂ）は、検索された映像を表示する画面の例を示した模式図である。The schematic diagram which showed the example of the display screen at the time of applying the image | video extraction apparatus of this invention to the search of the image | video of the internet, (a) is an input screen which inputs the morpheme which shows the content of the image | video to search The schematic diagram which showed the example, (b) is the schematic diagram which showed the example of the screen which displays the searched image | video.

Explanation of symbols

５映像抽出装置
５５類似度算出手段
５６希少度データ記憶手段（形態素確率情報記憶手段）
５７候補区間検出手段（区間検出手段）
５８ａ区間映像抽出部（区間映像抽出手段）
５８ｂカット分割部（カット分割手段） 5 Video extraction device 55 Similarity calculation means 56 Rareness data storage means (morpheme probability information storage means)
57 Candidate section detection means (section detection means)
58a Section video extraction unit (section video extraction means)
58b Cut division part (cut division means)

Claims

Corresponding to the extracted video content information by inputting a video, audio text data that is audio text data corresponding to the video, and extracted video content information that is composed of a plurality of morphemes and indicates the contents of a part of the video A video extraction device for extracting a part of the video,
Appearance probability information indicating the appearance probability of each morpheme included in other speech text data, which is speech text data corresponding to a plurality of other videos, and the morpheme are associated with each other. Morpheme probability information storage means for storing morpheme probability information;
In the morpheme probability information stored in the morpheme probability information storage means, the appearance probability information corresponding to each morpheme included in the extracted video content information, and speech segment data obtained by dividing the speech text data into a plurality of segments A similarity calculation means for calculating a similarity indicating the degree of similarity between the extracted video content information and the audio classification data, based on the frequency of appearance of the morpheme in each;
Section detection means for selecting the audio segment data corresponding to the extracted video content information based on the similarity calculated by the similarity calculation means and detecting the segment of the video corresponding to the audio segment data;
Section video extraction means for extracting the video of the section detected by the section detection means,
A video extraction apparatus comprising:

The video extracting apparatus according to claim 1, further comprising cut dividing means for dividing the video extracted by the section video extracting means into cuts.

A video, audio text data that is audio text data corresponding to the video, and extracted video content information that is composed of a plurality of morphemes and indicates a part of the video are input and stored in a morpheme probability information storage device The appearance probability information indicating the appearance probability of each of the morphemes included in the other voice text data that is the voice text data corresponding to the plurality of other videos, and the morpheme In order to extract a portion of the video corresponding to the extracted video content information based on the matched morpheme probability information,
In the morpheme probability information stored in the morpheme probability information storage device, the appearance probability information corresponding to each morpheme included in the extracted video content information, and speech classification data obtained by dividing the speech text data into a plurality of sections Similarity calculation means for calculating a similarity indicating the degree of similarity between the extracted video content information and the audio classification data based on the frequency of appearance of the morpheme in each;
Section detection means for selecting the audio segment data corresponding to the extracted video content information based on the similarity calculated by the similarity calculation means and detecting the segment of the video corresponding to the audio segment data;
A video extraction program which functions as section video extraction means for extracting a video of a section detected by the section detection means.