JP6446987B2

JP6446987B2 - Video selection device, video selection method, video selection program, feature amount generation device, feature amount generation method, and feature amount generation program

Info

Publication number: JP6446987B2
Application number: JP2014211413A
Authority: JP
Inventors: 祐一森谷; 善雄石澤; 康高山本; 綾子星野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-10-16
Filing date: 2014-10-16
Publication date: 2019-01-09
Anticipated expiration: 2034-10-16
Also published as: JP2016081265A

Description

本発明は、コンピュータなどを用いて映像を選択する情報処理技術に関する。 The present invention relates to an information processing technique for selecting an image using a computer or the like.

音声と共に映像を再生する際、再生される音声の内容と映像の内容とがかけ離れていれば、その音声及び映像を視聴する視聴者は違和感を覚える。しかし、人手で音声の内容に関係がある映像を選択することは、非常に繁雑な作業である。再生される音声として、例えば、ナレーション、朗読、アナウンス、及び楽曲の歌詞などがある。 When reproducing a video together with audio, if the content of the reproduced audio and the content of the video are far from each other, a viewer who views the audio and video feels uncomfortable. However, manually selecting a video that is related to the audio content is a very complicated task. Examples of the reproduced voice include narration, reading, announcement, and lyrics of music.

楽曲に合わせて画像を表示する技術の例が、特許文献１によって開示されている。特許文献１によって開示されているスライドショー作成サーバは、楽曲の行ごとの歌詞と、行ごとに表示される一連の画像と、歌詞を利用して推定された全体印象語とを含む、元スライドショーのデータを、複数の元スライドショーについて記憶する。スライドショー作成サーバは、ユーザによって指定された複数の画像から画像特徴量を抽出する。スライドショー作成サーバは、抽出した画像特徴量を用いて画像データにタグを付与する。スライドショー作成サーバは、指定された画像全てについて、付与されているタグを使用して、全体印象ラベルを付与する。スライドショー作成サーバは、付与した全体印象ラベルに適合する元スライドショーを選択する。スライドショー作成サーバは、選択された元スライドショーの画像データを、ユーザによって指定された画像で置き換えることによって、新しいスライドショーを作成する。 An example of a technique for displaying an image in accordance with music is disclosed in Patent Document 1. The slide show creation server disclosed in Patent Literature 1 includes an original slide show including lyrics for each line of music, a series of images displayed for each line, and an overall impression word estimated using the lyrics. Data is stored for multiple original slide shows. The slide show creation server extracts image feature amounts from a plurality of images designated by the user. The slide show creation server adds a tag to the image data using the extracted image feature amount. The slide show creation server assigns the entire impression label to all the designated images using the assigned tag. The slide show creation server selects an original slide show that matches the assigned overall impression label. The slide show creation server creates a new slide show by replacing the image data of the selected original slide show with an image designated by the user.

特許文献２には、複数の種別のうち、必ずしも同一でないいずれかの１つ以上の種別に分類される、複数のコンテンツの類似度を示す距離を算出するコンテンツ検索装置が記載されている。そのコンテンツ検索装置は、導出可能なコンテンツの特徴（画像特徴、音響特徴、及び意味特徴のいずれか１つ以上）の特徴量を導出する。そのコンテンツ検索装置は、あらかじめ計算された異なる特徴間の相関に基づいて、導出した特徴量を使用して、そのコンテンツの未知の特徴の特徴量を推定する。そのコンテンツ検索装置は、例えば、画像コンテンツ及びそのメタデータから導出した画像特徴量及び意味特徴量に基づいて、その画像コンテンツの音響特徴量を推定する。そのコンテンツ検索装置は、導出及び推定した全ての特徴量に基づいて、類似度を算出する。 Patent Document 2 describes a content search device that calculates a distance indicating a degree of similarity of a plurality of contents classified into one or more types that are not necessarily the same among a plurality of types. The content search device derives a feature amount of a derivatable content feature (one or more of an image feature, an acoustic feature, and a semantic feature). The content search apparatus estimates the feature amount of an unknown feature of the content using the derived feature amount based on the correlation between different features calculated in advance. For example, the content search apparatus estimates the acoustic feature amount of the image content based on the image feature amount and the semantic feature amount derived from the image content and its metadata. The content search device calculates the similarity based on all the derived and estimated feature quantities.

特許文献３には、楽曲をカテゴリに分類する楽曲分類装置が記載されている。その楽曲分類装置は、楽曲の歌詞データに基づいて、楽曲をカテゴリ名によって特定される分類先に分類するカテゴリ分類器を、学習によって生成する。その楽曲分類装置は、カテゴリに分類された楽曲を、クラスタリングによってサブカテゴリに分類する。 Patent Document 3 describes a music classification device that classifies music into categories. The music classification device generates, by learning, a category classifier that classifies music to a classification destination specified by a category name based on the lyrics data of the music. The music classification device classifies music classified into categories into subcategories by clustering.

特開２０１４−１１５７２９号公報JP 2014-115729 A 国際公開第２０１０／０５３１６０号International Publication No. 2010/053160 特開２０１３−２１４３２６号公報JP2013-214326A

再生される音声の内容は、テキストによって表すことができる。 The content of the reproduced voice can be represented by text.

特許文献１のスライドショー作成サーバは、あらかじめ作成されている元スライドショーの画像データを、ユーザによって指定された画像で置き換えることによって、新しいスライドショーを作成する。従って、そのスライドショー作成サーバは、例えば楽曲が表すテキストにマッチした映像を選択することはできない。 The slide show creation server of Patent Document 1 creates a new slide show by replacing the image data of the original slide show created in advance with images designated by the user. Therefore, the slide show creation server cannot select a video that matches the text represented by the music, for example.

また、元スライドショーは、楽曲と、その楽曲の歌詞が表すテキストにマッチしない画像データとによって作成されていてもよい。その場合、そのスライドショー作成サーバが作成する新しいスライドショーは、楽曲と、その楽曲の歌詞が表すテキストにマッチしない画像データとによって構成される。すなわち、そのスライドショー作成サーバは、楽曲と、その楽曲が表すテキストにマッチする画像データとによって構成されるスライドショーを作成するとは限らない。 In addition, the original slide show may be created by music and image data that does not match the text represented by the lyrics of the music. In that case, the new slide show created by the slide show creation server is composed of music and image data that does not match the text represented by the lyrics of the music. That is, the slide show creation server does not always create a slide show composed of music and image data that matches the text represented by the music.

特許文献２のコンテンツ検索装置は、計算又は推定された特徴量に基づいてコンテンツ間の距離を計算することによって、コンテンツ間の類似度を推定する。そのコンテンツ検索装置が、例えば指定されたテキストからの距離が小さいコンテンツを検索する場合、必ずしも映像が得られるとは限らない。また、コンテンツ間の距離は音響特徴量にも依存するので、互いにマッチするテキスト及び映像の意味特徴量及び顔像特徴量が近い場合であっても、計算又は推定された音響特徴量の差が大きければ、そのテキスト及び映像の間の距離は小さくならない。従って、そのコンテンツ検索装置は、テキストにマッチした映像を選択することはできない。 The content search device of Patent Literature 2 estimates the similarity between contents by calculating the distance between the contents based on the calculated or estimated feature amount. For example, when the content search device searches for content having a small distance from the specified text, a video is not always obtained. In addition, since the distance between contents also depends on the acoustic feature amount, even if the semantic feature amount and the facial image feature amount of the matching text and video are close to each other, the difference in the calculated or estimated acoustic feature amount is different. If it is large, the distance between the text and the video will not be small. Therefore, the content search device cannot select a video that matches the text.

特許文献３の技術は、楽曲を分類する技術である。従って、特許文献３の技術によって、テキストにマッチした映像を選択することはできない。 The technique of Patent Document 3 is a technique for classifying music pieces. Therefore, the video matching the text cannot be selected by the technique of Patent Document 3.

本発明の目的は、テキストと映像とのマッチングを行う負荷を軽減することができる映像選択装置などを提供することにある。 An object of the present invention is to provide a video selection device and the like that can reduce the load of matching text and video.

本発明の一態様に係る映像選択システムは、映像に関連付けられているテキストに対してテキストマイニング処理を実行することによって、前記映像の特徴量である映像特徴量を、複数の前記映像の各々について生成する映像特徴生成手段と、対象テキストに対して前記テキストマイニング処理を実行することによって、前記対象テキストの特徴量である対象特徴量を生成する対象特徴生成手段と、前記映像特徴量の各々について、当該映像特徴量の、前記対象特徴量に対する類似の程度を表す類似度を導出する類似度導出手段と、導出された前記類似度に基づいて、前記対象特徴量に対する類似の程度が高い前記映像特徴量を選択し、選択された前記映像特徴量が導出された前記テキストに関連付けられている前記映像を選択する映像選択手段と、を備える。 The video selection system according to an aspect of the present invention performs a text mining process on text associated with a video to obtain a video feature amount that is a feature amount of the video for each of the plurality of videos. For each of the video feature generation means, the target feature generation means for generating the target feature quantity that is the feature quantity of the target text by executing the text mining process on the target text, and the video feature quantity Similarity degree deriving means for deriving a degree of similarity representing the degree of similarity of the video feature quantity with respect to the target feature quantity, and the video having a high degree of similarity with respect to the target feature quantity based on the derived similarity degree Video selection for selecting a feature and selecting the video associated with the text from which the selected video feature was derived It includes a stage, a.

本発明の一態様に係る特徴量生成装置は、それぞれテキストが関連付けられている複数の映像の、少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記属性を修飾する語句である素性を抽出し、抽出した前記素性を素性記憶手段に格納する素性抽出手段と、前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、前記素性記憶手段に格納されている前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の映像特徴量として生成する映像特徴生成手段と、を備える。 The feature amount generation device according to an aspect of the present invention is a phrase that modifies the attribute from the text based on an attribute that is a phrase related to at least one of a plurality of videos each associated with the text. A feature extraction unit that extracts a feature and stores the extracted feature in the feature storage unit; and for each of the plurality of videos, the text stored in the feature storage unit in a text associated with the video Video feature generation means for detecting each feature and generating a feature quantity representing the detected feature as a video feature quantity of the video;

本発明の一態様に係る映像選択方法は、映像に関連付けられているテキストに対してテキストマイニング処理を実行することによって、前記映像の特徴量である映像特徴量を、複数の前記映像の各々について生成し、
対象テキストに対して前記テキストマイニング処理を実行することによって、前記対象テキストの特徴量である対象特徴量を生成し、前記映像特徴量の各々について、当該映像特徴量の、前記対象特徴量に対する類似の程度を表す類似度を導出し、導出された前記類似度に基づいて、前記対象特徴量に対する類似の程度が高い前記映像特徴量を選択し、選択された前記映像特徴量が導出された前記テキストに関連付けられている前記映像を選択する。 In the video selection method according to one aspect of the present invention, a video mining process is performed on the text associated with the video, and the video feature quantity that is the video feature quantity is determined for each of the plurality of videos. Generate
By executing the text mining process on the target text, a target feature amount that is a feature amount of the target text is generated, and for each of the video feature amounts, the video feature amount is similar to the target feature amount A degree of similarity representing the degree of the image is derived, and based on the derived degree of similarity, the video feature quantity having a high degree of similarity to the target feature quantity is selected, and the selected video feature quantity is derived Select the video associated with the text.

本発明の一態様に係る特徴量生成方法は、それぞれテキストが関連付けられている複数の映像の、少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記属性を修飾する語句である素性を抽出し、抽出した前記素性を素性記憶手段に格納し、前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、前記素性記憶手段に格納されている前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の映像特徴量として生成する。 A feature value generation method according to an aspect of the present invention is a phrase that modifies an attribute from the text based on an attribute that is a phrase related to at least one of a plurality of videos each associated with the text. A feature is extracted, the extracted feature is stored in a feature storage unit, and each of the features stored in the feature storage unit in each of the plurality of videos is associated with the video. A feature amount representing the detected feature is detected and generated as a video feature amount of the video.

本発明の一態様に係る映像選択プログラムは、コンピュータを、映像に関連付けられているテキストに対してテキストマイニング処理を実行することによって、前記映像の特徴量である映像特徴量を、複数の前記映像の各々について生成する映像特徴生成手段と、対象テキストに対して前記テキストマイニング処理を実行することによって、前記対象テキストの特徴量である対象特徴量を生成する対象特徴生成手段と、前記映像特徴量の各々について、当該映像特徴量の、前記対象特徴量に対する類似の程度を表す類似度を導出する類似度導出手段と、導出された前記類似度に基づいて、前記対象特徴量に対する類似の程度が高い前記映像特徴量を選択し、選択された前記映像特徴量が導出された前記テキストに関連付けられている前記映像を選択する映像選択手段と、して動作させる。 The video selection program according to an aspect of the present invention performs a text mining process on a text associated with a video by a computer, thereby obtaining a video feature amount that is a feature amount of the video. Video feature generation means for generating each of the target feature generation means for generating a target feature quantity that is a feature quantity of the target text by executing the text mining process on the target text, and the video feature quantity A similarity degree deriving unit for deriving a degree of similarity representing the degree of similarity of the video feature amount with respect to the target feature amount, and a degree of similarity with respect to the target feature amount based on the derived degree of similarity. The video associated with the text from which the high video feature is selected and the selected video feature is derived A video selecting means for selecting operates in.

本発明の一態様に係る特徴量生成プログラムは、コンピュータを、それぞれテキストが関連付けられている複数の映像の、少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記属性を修飾する語句である素性を抽出し、抽出した前記素性を素性記憶手段に格納する素性抽出手段と、前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、前記素性記憶手段に格納されている前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の映像特徴量として生成する映像特徴生成手段と、して動作させる。 The feature value generation program according to one aspect of the present invention modifies the attribute from the text based on an attribute that is a phrase related to at least one of a plurality of videos each associated with the text. A feature extraction unit that extracts a feature that is a phrase to be stored and stores the extracted feature in the feature storage unit; and for each of the plurality of videos, a text associated with the video is stored in the feature storage unit Each feature is detected and a feature amount representing the detected feature is operated as a video feature generation unit that generates a video feature amount of the video.

本発明には、テキストと映像とのマッチングを行う負荷を軽減することができるという効果がある。 The present invention has the effect of reducing the load for matching text and video.

図１は、本発明の第１の実施形態に係る映像選択装置１の構成を表すブロック図である。FIG. 1 is a block diagram showing a configuration of a video selection device 1 according to the first embodiment of the present invention. 図２は、本発明の第１の実施形態の付随情報の例を表す図である。FIG. 2 is a diagram illustrating an example of accompanying information according to the first embodiment of this invention. 図３は、特徴生成装置１００の、複数のテキストと付随情報とに基づいて素性を抽出する動作の例を表すフローチャートである。FIG. 3 is a flowchart showing an example of an operation of extracting features based on a plurality of texts and accompanying information in the feature generation device 100. 図４は、抽出される素性の例を模式的に表す図である。FIG. 4 is a diagram schematically illustrating an example of extracted features. 図５は、抽出された素性の例を模式的に表す図である。FIG. 5 is a diagram schematically illustrating an example of extracted features. 図６は、素性リストの例を模式的に表す図である。FIG. 6 is a diagram schematically illustrating an example of a feature list. 図７は、本発明の第１の実施形態の特徴生成装置１００の、映像特徴を生成する動作の例を表すフローチャートである。FIG. 7 is a flowchart illustrating an example of an operation for generating video features of the feature generation device 100 according to the first embodiment of this invention. 図８は、映像毎の、映像に関連付けられているテキストの例を模式的に表す図である。FIG. 8 is a diagram schematically illustrating an example of text associated with a video for each video. 図９は、映像特徴ベクトル例を模式的に表す図である。FIG. 9 is a diagram schematically illustrating an example of a video feature vector. 図１０は、本発明の第１の実施形態の映像選択装置１１０の、映像特徴ベクトルを受信する動作の例を表すフローチャートである。FIG. 10 is a flowchart illustrating an example of an operation of receiving a video feature vector of the video selection device 110 according to the first embodiment of this invention. 図１１は、本発明の第１の実施形態の映像選択装置１１０の、対象テキストを受信するのに応じて映像を選択する動作の例を表すフローチャートである。FIG. 11 is a flowchart illustrating an example of an operation of selecting a video in response to receiving the target text of the video selection device 110 according to the first embodiment of this invention. 図１２は、対象特徴ベクトルの例を模式的に表す図である。FIG. 12 is a diagram schematically illustrating an example of the target feature vector. 図１３は、類似度導出部１１３が類似度を導出する、対象特徴量、及び、映像特徴量の例を模式的に表す図である。FIG. 13 is a diagram schematically illustrating an example of the target feature amount and the video feature amount from which the similarity deriving unit 113 derives the similarity. 図１４は、類似度の例を模式的に表す図である。FIG. 14 is a diagram schematically illustrating an example of similarity. 図１５は、本発明の第１の実施形態の第１の変形例の映像選択システム１Ａの構成の例を表すブロック図である。FIG. 15 is a block diagram illustrating an example of a configuration of a video selection system 1A according to a first modification of the first embodiment of the present invention. 図１６は、本発明の第１の実施形態の第２の変形例の映像選択システム１Ｂの構成の例を表すブロック図である。FIG. 16 is a block diagram illustrating an example of a configuration of a video selection system 1B according to a second modification of the first embodiment of the present invention. 図１７は、本発明の第１の実施形態の第３の変形例の映像選択システム１Ｃの構成の例を表すブロック図である。FIG. 17 is a block diagram illustrating an example of a configuration of a video selection system 1C according to a third modification of the first embodiment of the present invention. 図１８は、本発明の第２の実施形態の映像選択システム１Ｄの構成の例を表すブロック図である。FIG. 18 is a block diagram illustrating an example of a configuration of a video selection system 1D according to the second embodiment of this invention. 図１９は、本発明の各実施形態に係る映像選択装置及び特徴生成装置を実現するのに使用できるコンピュータの構成の例を表すブロック図である。FIG. 19 is a block diagram illustrating an example of the configuration of a computer that can be used to implement the video selection device and the feature generation device according to each embodiment of the present invention.

以下では、本発明の実施形態について、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜第１の実施形態＞
まず、本発明の第１の実施形態について、図面を参照して詳細に説明する。 <First Embodiment>
First, a first embodiment of the present invention will be described in detail with reference to the drawings.

図１は、本発明の第１の実施形態に係る映像選択システム１の構成を表すブロック図である。 FIG. 1 is a block diagram showing the configuration of a video selection system 1 according to the first embodiment of the present invention.

図１を参照すると、本実施形態の映像選択システム１は、特徴生成装置１００と、映像選択装置１１０とを含む。特徴生成装置１００と、映像選択装置１１０とは、通信可能に接続されている。図１に示す例では、特徴生成装置１００と、映像選択装置１１０とは、異なる装置として実装されている。しかし、映像選択装置１１０が、特徴生成装置１００を含んでいてもよい。映像選択装置１１０が、特徴生成装置１００として動作してもよい。 Referring to FIG. 1, a video selection system 1 according to the present embodiment includes a feature generation device 100 and a video selection device 110. The feature generation device 100 and the video selection device 110 are connected to be communicable. In the example illustrated in FIG. 1, the feature generation device 100 and the video selection device 110 are implemented as different devices. However, the video selection device 110 may include the feature generation device 100. The video selection device 110 may operate as the feature generation device 100.

特徴生成装置１００は、付随情報受信部１０１と、付随情報記憶部１０２と、教師データ受信部１０３と、教師データ記憶部１０４と、属性抽出部１０５と、素性抽出部１０６と、素性記憶部１０７と、映像特徴生成部１０８とを含む。映像選択システム１は、映像選択システム１のユーザが指示などを入力するのに使用する、ユーザ端末（図示されない）を含んでいてもよい。 The feature generation device 100 includes an accompanying information receiving unit 101, an accompanying information storage unit 102, a teacher data receiving unit 103, a teacher data storage unit 104, an attribute extraction unit 105, a feature extraction unit 106, and a feature storage unit 107. And a video feature generation unit 108. The video selection system 1 may include a user terminal (not shown) that is used by the user of the video selection system 1 to input instructions and the like.

映像選択装置１１０は、対象受信部１１１と、対象特徴生成部１１２と、類似度導出部１１３と、映像選択部１１４と、出力部１１５と、映像特徴受信部１１６と、映像特徴記憶部１１７とを含む。映像選択装置１１０は、さらに、映像受信部１１８と、映像記憶部１１９とを含んでいてもよい。 The video selection device 110 includes a target reception unit 111, a target feature generation unit 112, a similarity derivation unit 113, a video selection unit 114, an output unit 115, a video feature reception unit 116, and a video feature storage unit 117. including. The video selection device 110 may further include a video reception unit 118 and a video storage unit 119.

教師データ受信部１０３は、複数のテキストを受信する。本発明の各実施形態において、テキストを表すデータを、単にテキストとも表記する。例えば、テキストを受信することは、そのテキストを表すデータを受信することを意味する。それらの複数のテキストは、例えば、映像選択装置１１０によって、教師データ受信部１０３に入力されてもよい。それらの複数のテキストは、例えば、後述の複数の映像を記憶する映像サーバ（図示されない）によって、教師データ受信部１０３に入力されてもよい。それらの複数のテキストは、例えば、ユーザ端末（図示されない）によって、教師データ受信部１０３に入力されてもよい。 The teacher data receiving unit 103 receives a plurality of texts. In each embodiment of the present invention, data representing text is also simply referred to as text. For example, receiving text means receiving data representing the text. The plurality of texts may be input to the teacher data receiving unit 103 by the video selection device 110, for example. The plurality of texts may be input to the teacher data receiving unit 103 by, for example, a video server (not shown) that stores a plurality of videos described later. The plurality of texts may be input to the teacher data receiving unit 103 by a user terminal (not shown), for example.

教師データ受信部１０３が受信するテキストの各々は、そのテキストの内容と関連する、複数の映像のうち少なくともいずれかの映像に関連付けられている。それらの複数の映像の各々は、少なくとも１つのテキストと関連付けられていればよい。例えば、映像選択システム１のユーザが、複数のテキストの各々について、テキストとそのテキストの内容に関連すると判定した映像とを、あらかじめ関連付けておけばよい。テキストの内容に関連すると判定された映像は、例えば、そのテキストの内容に、視覚的に、体感的に、または、視覚的で体感的にマッチすると判定された映像である。映像選択システム１のユーザは、複数の映像の各々について、その映像の内容に、視覚的に、体感的に、または、視覚的で体感的にマッチすると判定した１つ以上のテキストを、その映像に関連付けておいてもよい。映像に関連付けられるテキストは、その映像の企画意図を含んでいてもよい。後述されるように、企画意図は、例えば、映像のコンセプトや、映像を制作する目的などを表すフレーズや文章である。 Each text received by the teacher data receiving unit 103 is associated with at least one of a plurality of videos related to the content of the text. Each of the plurality of videos only needs to be associated with at least one text. For example, the user of the video selection system 1 may associate the text and the video determined to be related to the content of the text in advance for each of the plurality of texts. The video determined to be related to the text content is, for example, a video determined to match the text content visually, bodily, or visually and bodily. The user of the video selection system 1 adds, for each of a plurality of videos, one or more texts determined to match the contents of the video visually, bodily, or visually and bodily. You may associate with. The text associated with the video may include the planning intent of the video. As will be described later, the planning intention is, for example, a phrase or a sentence representing the concept of the video, the purpose of producing the video, or the like.

教師データ受信部１０３は、テキストと、少なくとも一つの、映像の識別子である映像ＩＤ（Ｉｄｅｎｔｉｆｉｅｒ）との、複数の組を受信すればよい。教師データ受信部１０３は、受信した、複数の映像の少なくともいずれかに関連付けられているテキストを、教師データ記憶部１０４に格納する。 The teacher data receiving unit 103 may receive a plurality of sets of text and at least one video ID (Identifier) that is an identifier of the video. The teacher data receiving unit 103 stores the received text associated with at least one of the plurality of videos in the teacher data storage unit 104.

テキストは、例えば、歌詞である。テキストは、ナレーション、朗読、又はアナウンスの内容を表すテキストであってもよい。以下では、主に、テキストが歌詞である場合の映像選択システム１について説明する。 The text is, for example, lyrics. The text may be text representing the content of a narration, reading, or announcement. In the following, the video selection system 1 when the text is lyrics will be mainly described.

教師データ記憶部１０４は、それぞれ、複数の映像の少なくともいずれかに関連付けられている、複数のテキストを記憶する。 The teacher data storage unit 104 stores a plurality of texts associated with at least one of the plurality of videos.

付随情報受信部１０１は、上述の複数の映像の各々の、付随情報を受信する。付随情報は、例えば、映像選択装置１１０によって、付随情報受信部１０１に入力されてもよい。付随情報は、例えば、前述の映像サーバ（図示されない）によって、付随情報受信部１０１に入力されてもよい。付随情報は、例えば、ユーザ端末（図示されない）によって、付随情報受信部１０１に入力されてもよい。付随情報受信部１０１は、受信した付随情報を、付随情報記憶部１０２に格納する。 The accompanying information receiving unit 101 receives the accompanying information of each of the plurality of videos described above. The accompanying information may be input to the accompanying information receiving unit 101 by the video selection device 110, for example. The accompanying information may be input to the accompanying information receiving unit 101 by the above-described video server (not shown), for example. The accompanying information may be input to the accompanying information receiving unit 101 by a user terminal (not shown), for example. The accompanying information receiving unit 101 stores the received accompanying information in the accompanying information storage unit 102.

付随情報記憶部１０２は、複数の映像の各々の、付随情報を記憶する。 The accompanying information storage unit 102 stores accompanying information for each of a plurality of videos.

付随情報は、映像の内容を表す単語を含む情報である。付随情報は、例えば、映像を分類する観点（すなわち、映像の種類）を表す項目毎に、単語によって表されていてもよい。付随情報は、項目毎に、文章又はフレーズによって表されていてもよい。付随情報は、項目毎に、単語と、文章又はフレーズとの、いずれか一方又は双方によって表されていてもよい。 The accompanying information is information including a word representing the content of the video. The accompanying information may be represented by a word for each item representing a viewpoint for classifying videos (that is, the type of video), for example. The accompanying information may be expressed by sentences or phrases for each item. The accompanying information may be represented by one or both of a word and a sentence or a phrase for each item.

図２は、付随情報の例を表す図である。付随情報は、例えば、映像の内容を表す複数の種類の情報を含む。それらの複数の種類の情報は、例えば、映像を複数の観点で複数のカテゴリに分類した場合における、分類の観点及び分類の結果を表す。その場合、例えば、分類の観点を項目と表記し、分類の観点を特定する名称を項目名と表記し、分類の結果（すなわち、映像が分類されたカテゴリ）を項目の値と表記する。付随情報は、分類の観点及び分類の結果でなくてもよい。「項目名」は、付随情報によって内容が表される映像の、分類の観点を表す名称である。図２では、「例（内容）」は、それぞれの項目名によって表される情報の、具体例又は内容を表す。括弧によって囲まれていない単語は、付随情報が含む、項目毎の値の具体例である。括弧によって囲まれている語句が、付随情報が含む項目毎の値の内容を表す。付随情報によって内容が表される映像の種類は、項目毎の値によって表される。図２を参照すると、付随情報は、例えば、企画意図、モデル、服装、場所、季節、天候、時候、イベントなどの項目名によって表される項目の値を含む。以下の説明では、例えば、項目名が「企画意図」である項目の値を、「企画意図の値」などと表記する。 FIG. 2 is a diagram illustrating an example of accompanying information. The accompanying information includes, for example, a plurality of types of information representing the content of the video. The plurality of types of information represent, for example, classification viewpoints and classification results when videos are classified into a plurality of categories from a plurality of viewpoints. In this case, for example, a classification viewpoint is expressed as an item, a name specifying the classification viewpoint is expressed as an item name, and a classification result (that is, a category in which videos are classified) is expressed as an item value. The accompanying information may not be a classification viewpoint and a classification result. The “item name” is a name representing the viewpoint of classification of the video whose contents are represented by the accompanying information. In FIG. 2, “example (content)” represents a specific example or content of information represented by each item name. A word not surrounded by parentheses is a specific example of a value for each item included in the accompanying information. Words enclosed in parentheses represent the contents of values for each item included in the accompanying information. The type of video whose content is represented by the accompanying information is represented by a value for each item. Referring to FIG. 2, the accompanying information includes, for example, values of items represented by item names such as planning intention, model, clothes, place, season, weather, time of day, and event. In the following description, for example, the value of an item whose item name is “planning intention” is expressed as “planning intention value” or the like.

図２に示す例では、企画意図の値は、例えば、フレーズ又は文章によって表される。企画意図以外の項目の値は、単語によって表される。企画意図の値であるフレーズや文章は、例えば、映像のコンセプトや、映像を制作する目的などを表す。モデルの値は、例えば、男女、男、女などの映像に登場する人物の性別などである。モデルの値は、人物以外の、動物、植物、又は物などであってもよい。服装の値は、例えば、洋服、和服などの、映像に登場する人物の服装である。服装の値は、単に「服」であってもよい。場所の値は、例えば、映像が撮影された場所、又は、撮影の対象が存在する場所を表す単語である。場所の値は、都会や海などの、場所の区分を表す単語であっても、具体的な地域名であってもよい。季節の値は、映像が撮影された季節を表す単語である。天候の値は、映像が撮影されたときの天候を表す単語である。時候の値は、映像が撮影された時間帯を表す単語である。イベントの値は、映像として撮影された場面において起こっているイベントを表す単語である。付随情報は、単語によって表される項目について、一つの項目当たり２つ以上の単語を含んでいてもよい。単語を含まない項目（すなわち、値が存在しない項目）が存在していてもよい。付随情報は、図２に示す例に限られない。付随情報は、図２に示す項目を含んでいなくてもよい。付随情報は、図２に示す項目以外の項目の情報を含んでいてもよい。 In the example illustrated in FIG. 2, the plan intention value is represented by, for example, a phrase or a sentence. The values of items other than the planning intention are represented by words. The phrase or sentence, which is the value of the planning intention, represents, for example, the concept of the video or the purpose of producing the video. The value of the model is, for example, the sex of a person appearing in a video such as a man, woman, man or woman. The model value may be an animal, a plant, or an object other than a person. The value of clothes is, for example, the clothes of a person appearing in the video, such as clothes and Japanese clothes. The value of clothes may simply be “clothes”. The value of the place is, for example, a word representing the place where the video is shot or the place where the subject of shooting exists. The value of the place may be a word representing a division of the place, such as a city or the sea, or a specific area name. The value of the season is a word that represents the season in which the video was shot. The value of the weather is a word representing the weather when the video is taken. The time value is a word indicating the time zone when the video was taken. The value of the event is a word representing an event occurring in the scene shot as a video. The accompanying information may include two or more words per item for the item represented by the word. There may be an item that does not include a word (that is, an item that does not have a value). The accompanying information is not limited to the example shown in FIG. The accompanying information may not include the items shown in FIG. The accompanying information may include information on items other than the items shown in FIG.

属性抽出部１０５は、付随情報から、映像の内容の少なくとも一部を表す単語である、属性を抽出する。属性抽出部１０５は、付随情報が値として単語を含む場合、その単語を抽出すればよい。付随情報が値として文章又はフレーズを含む場合、属性抽出部１０５は、その文章又はフレーズから、例えば、ＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）法によって、映像毎に、その映像の特徴を表す単語を抽出すればよい。 The attribute extraction unit 105 extracts an attribute that is a word representing at least part of the content of the video from the accompanying information. When the accompanying information includes a word as a value, the attribute extraction unit 105 may extract the word. When the accompanying information includes a sentence or a phrase as a value, the attribute extraction unit 105 represents the feature of the picture for each picture from the sentence or the phrase by, for example, a TF-IDF (Term Frequency-Inverse Document Frequency) method. What is necessary is just to extract a word.

素性抽出部１０６は、抽出された属性の各々について、教師データ記憶部１０４に格納されているテキストから、その属性に係る（すなわち、その属性を修飾する）語句を、素性として抽出する。素性抽出部１０６は、抽出された属性の各々について、例えばその属性を持つ映像に関連付けられているテキストから、その属性に係る（すなわち、その属性を修飾する）語句を、素性として抽出すればよい。映像が持つ属性は、その映像の付随情報が含む属性である。素性抽出部１０６は、抽出した素性を、素性記憶部１０７に格納する。素性抽出部１０６は、抽出した全ての素性のリストである、素性リストを生成してもよい。素性抽出部１０６は、生成した素性リストを、素性記憶部１０７に格納してもよい。 For each of the extracted attributes, the feature extraction unit 106 extracts, from the text stored in the teacher data storage unit 104, a word / phrase related to the attribute (that is, modifies the attribute) as a feature. For each of the extracted attributes, the feature extraction unit 106 may extract, for example, a word or phrase relating to the attribute (that is, modifying the attribute) from the text associated with the video having the attribute as a feature. . The attribute of the video is an attribute included in the accompanying information of the video. The feature extraction unit 106 stores the extracted features in the feature storage unit 107. The feature extraction unit 106 may generate a feature list that is a list of all extracted features. The feature extraction unit 106 may store the generated feature list in the feature storage unit 107.

素性抽出部１０６は、例えば、テキストに対して、形態素解析や構文解析などの、基本的なテキスト処理によって、属性に係る語句を検出し、検出された語句の品詞や特性を推定すればよい。 For example, the feature extraction unit 106 may detect a phrase related to the attribute by basic text processing such as morphological analysis or syntax analysis on the text, and estimate the part of speech or characteristic of the detected phrase.

素性抽出部１０６が検出する語句の長さは、限定されていればよい。その場合、語句の長さは、語句を構成する単位の数であればよい。語句を構成する単位は、例えば、その語句を構成する、形容詞、形容動詞、及び、名詞と助詞との組み合わせなどであればよい。例えば、属性が「人」であり、「人」に係る語句として、「背が高い」が検出された場合、素性抽出部１０６は、語句「背が高い」を構成する単位として、例えば、「背が」と「高い」を特定すればよい。そして、素性抽出部１０６は、語句「背が高い」の長さが２であると判定すればよい。素性抽出部１０６は、あらかじめ決められた長さ（例えば２個）以下の長さの語句を検出すればよい。語句の特性は、例えば、属性を修飾する可能性が否かを表す特性である。 The length of the phrase detected by the feature extraction unit 106 may be limited. In that case, the length of the phrase may be the number of units constituting the phrase. The unit that constitutes the phrase may be, for example, an adjective, an adjective verb, and a combination of a noun and a particle that constitute the phrase. For example, when the attribute is “person” and “tall” is detected as a phrase related to “person”, the feature extraction unit 106 uses, for example, “ What is necessary is just to specify "tall" and "high". Then, the feature extraction unit 106 may determine that the length of the phrase “tall” is 2. The feature extraction unit 106 may detect a word or phrase having a length equal to or less than a predetermined length (for example, two). The characteristic of the phrase is, for example, a characteristic indicating whether or not there is a possibility of modifying the attribute.

素性抽出部１０６は、テキスト処理の結果を使用して、検出された、属性に係る語句が、その属性を修飾しうる語句か否かを判定すればよい。素性抽出部１０６は、検出された、属性に係る語句が、その属性を修飾しうると判定した場合、その語句を素性として抽出すればよい。 The feature extraction unit 106 may use the text processing result to determine whether the detected word / phrase relating to the attribute is a word / phrase that can modify the attribute. If the feature extraction unit 106 determines that the detected phrase relating to the attribute can modify the attribute, the feature extraction unit 106 may extract the phrase as a feature.

例えば、素性抽出部１０６は、検出された語句が、形容詞、形容動詞、又は、名詞と助詞との組み合わせなどの、他の単語を修飾できる語句である場合、その語句が属性を修飾しうる語句であると判定すればよい。その場合、素性抽出部１０６は、検出された語句を、素性として抽出してもよい。 For example, if the detected word / phrase is a phrase that can modify other words, such as an adjective, an adjective verb, or a combination of a noun and a particle, the phrase can modify the attribute. What is necessary is just to determine that it is. In that case, the feature extraction unit 106 may extract the detected word / phrase as a feature.

素性抽出部１０６は、さらに、例えば、単語の種類と、その種類の単語を修飾するのに使用される語句との組み合わせを含む辞書を使用して、属性に係る語句として検出された語句が、その属性を修飾する語句として使用されるか否かを判定してもよい。素性抽出部１０６は、検出された語句が他の単語を修飾できる語句であり、さらに、検出された語句が、その語句が係る属性を修飾するのに使用される場合、その語句を素性として抽出してもよい。 The feature extraction unit 106 further uses, for example, a dictionary including a combination of a word type and a word used to modify the word of the type, and a word detected as a word related to the attribute is You may determine whether it is used as a phrase which modifies the attribute. The feature extraction unit 106 extracts a phrase as a feature when the detected phrase is a phrase that can modify another word, and the detected phrase is used to modify the attribute that the phrase relates to. May be.

素性は、素性抽出部１０６によって、映像に関連付けられているテキストから抽出される素性に限られない。素性は、あらかじめ選択された、例えば、イベントの名称を表す語句や、場所を表す語句を含んでいてもよい。素性抽出部１０６によって映像に関連付けられているテキストから抽出される素性以外の素性は、例えば映像選択システム１の管理者によって、あらかじめ素性記憶部１０７に格納されていてもよい。イベントや場所を表す語句が集められた辞書が、例えば映像選択システム１の管理者によって、あらかじめ作成され、例えば素性辞書記憶部（図示されない）や素性記憶部１０７などの、素性抽出部１０６がアクセスできる記憶部に格納されていてもよい。そして、素性抽出部１０６は、そのような辞書から語句を読み出してもよい。素性抽出部１０６は、上述の辞書から読み出した語句を、素性として、素性記憶部１０７に格納してもよい。 The feature is not limited to the feature extracted by the feature extraction unit 106 from the text associated with the video. The feature may include, for example, a word / phrase representing the name of an event or a word / phrase representing a place, which is selected in advance. Features other than the features extracted from the text associated with the video by the feature extraction unit 106 may be stored in the feature storage unit 107 in advance by the administrator of the video selection system 1, for example. A dictionary in which phrases representing events and places are collected is created in advance by the administrator of the video selection system 1, for example, and accessed by the feature extraction unit 106 such as the feature dictionary storage unit (not shown) or the feature storage unit 107. It may be stored in a storage unit that can. Then, the feature extraction unit 106 may read a phrase from such a dictionary. The feature extraction unit 106 may store the phrase read from the above dictionary in the feature storage unit 107 as a feature.

上述のように、映像には、その映像に内容が関連すると判定された、例えば歌詞などのテキストが関連付けられている。映像の内容を表す情報（例えば上述の付随情報）から抽出された単語（例えば上述の属性）は、その映像に関連付けられている、例えば歌詞などのテキストにも現れることが多い。そして、映像に関連付けられているテキストにおいて現れる、その映像の内容を表す情報から抽出された単語に係る語句は、「視覚的」、「体感的」な語句であることが、経験的に知られている。従って、素性抽出部１０６が素性として抽出する語句は、「視覚的」、「体感的」な語句である。言い換えると、素性抽出部１０６は、「視覚的」、「体感的」な語句を、素性として抽出することができる。 As described above, the video is associated with text such as lyrics, for example, which has been determined to be related to the video. A word (for example, the above-described attribute) extracted from information representing the content of the video (for example, the accompanying information described above) often appears in text associated with the video, such as lyrics. It is empirically known that the phrases related to the words extracted from the information representing the contents of the video appearing in the text associated with the video are “visual” and “sensible” phrases. ing. Therefore, the phrases extracted by the feature extraction unit 106 as the features are “visual” and “experienced” phrases. In other words, the feature extraction unit 106 can extract words such as “visual” and “sensible” as features.

映像特徴生成部１０８は、映像毎に、映像に関連付けられているテキスト対してテキストマイニング処理を行うことによって、その映像の特徴量である映像特徴量を生成する。より具体的には、映像特徴生成部１０８は、映像毎に、映像に関連付けられているテキストにおいて、抽出された素性の各々を検出する。そして、映像特徴生成部１０８は、映像毎に、映像に関連付けられているテキストにおいて出現する素性を表す、映像特徴量を生成する。上述のテキストマイニング処理は、例えば、上述のように素性を抽出し、抽出された素性の各々を検出することを表す。映像特徴生成部１０８は、前述の複数の映像の全てが選択されるまで、順次映像を選択しながら、選択された映像の映像特徴量の生成を繰り返せばよい。具体的には、映像特徴生成部１０８は、例えば、映像を選択し、選択した映像に関連付けられているテキストにおいて、素性として抽出された語句（すなわち、素性）を検出すればよい。 For each video, the video feature generation unit 108 performs a text mining process on the text associated with the video, thereby generating a video feature amount that is a feature amount of the video. More specifically, the video feature generation unit 108 detects, for each video, each extracted feature in text associated with the video. Then, the video feature generation unit 108 generates a video feature amount representing the feature appearing in the text associated with the video for each video. The text mining process described above represents, for example, extracting features as described above and detecting each of the extracted features. The video feature generation unit 108 may repeat generation of the video feature amount of the selected video while sequentially selecting videos until all of the plurality of videos are selected. Specifically, for example, the video feature generation unit 108 may select a video and detect a phrase (that is, a feature) extracted as a feature in text associated with the selected video.

映像特徴生成部１０８は、素性を検出した結果に基づいて、検出された素性を表す特徴量を、映像特徴量として生成する。映像特徴量は、例えば、検出された素性が要素である集合であってもよい。映像特徴量は、例えば、抽出された全ての素性がいずれかの要素に関連付けられているベクトルによって表現されていてもよい。その場合、以下の説明では、映像特徴量を、映像特徴ベクトルとも表記する。映像特徴生成部１０８は、例えば、素性とベクトルの要素とが、１対１に関連付けられるように、素性とベクトルの要素とを関連付ければよい。映像特徴生成部１０８は、例えば、素性リストにおける素性の順で、素性と、映像特徴ベクトルの要素とを関連付ければよい。映像特徴量は、例えば、要素の値が、その要素に関連付けられている素性が出現したことを表す値（例えば１）又はその要素が出現しなかったことを表す値（例えば０）である、映像特徴ベクトルであってもよい。 The video feature generation unit 108 generates a feature amount representing the detected feature as a video feature amount based on the result of detecting the feature. The video feature amount may be, for example, a set in which the detected feature is an element. The video feature amount may be expressed by a vector in which all extracted features are associated with any element, for example. In that case, in the following description, the video feature amount is also expressed as a video feature vector. The video feature generation unit 108 may associate the feature and the vector element so that the feature and the vector element are associated one-to-one, for example. The video feature generation unit 108 may associate the features with the elements of the video feature vector in the order of the features in the feature list, for example. The video feature amount is, for example, a value (for example, 1) indicating that the feature associated with the element has appeared, or a value (for example, 0) indicating that the element has not appeared. It may be a video feature vector.

映像特徴生成部１０８は、選択した映像に関連付けられているテキストにおいて、素性毎に、素性の出現頻度を検出してもよい。その場合、映像特徴生成部１０８は、素性毎に検出された素性の出現頻度を表す特徴量を、映像特徴量として生成すればよい。映像特徴量は、要素が、検出された素性とその素性の出現頻度との組み合わせである、集合であってもよい。映像特徴量は、要素の値が、その要素に関連付けられている素性の出現頻度である、映像特徴ベクトルであってもよい。その場合、映像特徴生成部１０８は、複数の映像の映像特徴ベクトルの大きさが一定になるように、各映像特徴ベクトルを正規化すればよい。 The video feature generation unit 108 may detect the appearance frequency of the feature for each feature in the text associated with the selected video. In that case, the video feature generation unit 108 may generate a feature quantity representing the appearance frequency of the feature detected for each feature as the video feature quantity. The video feature amount may be a set in which elements are combinations of detected features and appearance frequencies of the features. The video feature quantity may be a video feature vector whose element value is the appearance frequency of the feature associated with the element. In that case, the video feature generation unit 108 may normalize each video feature vector so that the size of the video feature vectors of a plurality of videos is constant.

映像特徴ベクトルの大きさは、例えば、長さ（すなわち、各要素の値の２乗の和の平方根）である。映像特徴生成部１０８は、各映像の特徴ベクトルの大きさが１になるように、各映像特徴ベクトルを正規化してもよい。 The size of the video feature vector is, for example, the length (that is, the square root of the sum of the squares of the values of the respective elements). The video feature generation unit 108 may normalize each video feature vector so that the size of the feature vector of each video is 1.

映像特徴生成部１０８は、生成した、各映像の映像特徴量を、映像選択装置１１０に送信する。映像特徴生成部１０８は、複数の映像の各々について、映像ＩＤと映像特徴量とを関連付け、互いに関連付けられた映像ＩＤと映像特徴量とを、映像選択装置１１０に送信すればよい。映像特徴生成部１０８は、さらに、抽出された素性を、映像選択装置１１０に送信する。映像特徴生成部１０８は、素性記憶部１０７から素性リストを読み出し、読み出した素性リストを、映像選択装置１１０にすればよい。 The video feature generation unit 108 transmits the generated video feature amount of each video to the video selection device 110. The video feature generation unit 108 may associate the video ID with the video feature amount for each of the plurality of videos, and transmit the video ID and the video feature amount associated with each other to the video selection device 110. The video feature generation unit 108 further transmits the extracted features to the video selection device 110. The video feature generation unit 108 may read the feature list from the feature storage unit 107 and use the read feature list as the video selection device 110.

映像選択装置１１０の映像特徴受信部１１６は、映像特徴生成部１０８から、各映像の映像特徴量を受信する。映像特徴生成部１０８は、複数の映像の各々について、互いに関連付けられた映像ＩＤと映像特徴量とを、映像特徴生成部１０８から受信すればよい。映像特徴受信部１１６は、受信した、複数の映像の各々の、互いに関連付けられた映像ＩＤと映像特徴量とを、映像特徴記憶部１１７に格納する。映像特徴受信部１１６は、さらに、例えば映像特徴生成部１０８から、例えば素性リストとして、抽出された素性の集合を受信し、受信した素性の集合（例えば素性リスト）を、映像特徴記憶部１１７に格納する。 The video feature receiving unit 116 of the video selection device 110 receives the video feature amount of each video from the video feature generating unit 108. The video feature generation unit 108 may receive the video ID and the video feature amount associated with each other from the video feature generation unit 108 for each of the plurality of videos. The video feature receiving unit 116 stores, in the video feature storage unit 117, the video ID and the video feature amount associated with each other of the received videos. The video feature receiving unit 116 further receives the extracted feature set, for example, as a feature list from the video feature generating unit 108, for example, and stores the received feature set (eg, feature list) in the video feature storage unit 117. Store.

映像特徴記憶部１１７は、複数の映像の各々の、互いに関連付けられた映像ＩＤと映像特徴量とを記憶する。映像特徴記憶部１１７は、さらに、素性の集合（例えば素性リスト）を記憶する。 The video feature storage unit 117 stores a video ID and a video feature amount associated with each other of each of the plurality of videos. The video feature storage unit 117 further stores a set of features (for example, a feature list).

対象受信部１１１は、例えば、ユーザによって指定されたテキストを、そのユーザが使用するユーザ端末（図示されない）から受信する。指定されたテキストは、例えば、ユーザが、映像選択装置１１０に、そのテキストに応じた映像を選択させるテキストである。以下の説明では、指定されたテキストを、対象テキストと表記する。対象テキストは、例えば、歌詞である。対象テキストは、例えば、ナレーション、朗読、あるいは、アナウンスなどの内容を表すテキストであってもよい。 For example, the target receiving unit 111 receives text specified by a user from a user terminal (not shown) used by the user. The designated text is, for example, text that causes the video selection device 110 to select a video corresponding to the text. In the following description, the specified text is expressed as target text. The target text is, for example, lyrics. The target text may be, for example, text representing content such as narration, reading, or announcement.

対象特徴生成部１１２は、対象テキストに対してテキストマイニング処理を行うことによって、その対象テキストの特徴量である対象特徴量を生成する。より具体的には、対象特徴生成部１１２は、対象テキストにおいて、例えば映像特徴記憶部１１７に格納されている素性の集合（例えば素性リスト）に含まれる素性を検出する。そして対象特徴生成部１１２は、素性を検出した結果に基づいて、対象テキストにおいて出現する素性を表す特徴量である、対象特徴量を生成する。 The target feature generation unit 112 generates a target feature amount that is a feature amount of the target text by performing a text mining process on the target text. More specifically, the target feature generation unit 112 detects features included in a set of features (for example, a feature list) stored in the video feature storage unit 117, for example, in the target text. Then, the target feature generation unit 112 generates a target feature amount, which is a feature amount representing a feature that appears in the target text, based on the result of detecting the feature.

映像特徴生成部１０８が生成する映像特徴量が、素性毎に、映像に関連付けられているテキストにおいて素性が出現するか否かを表す特徴量である場合、対象特徴生成部１１２は、対象テキストにおいて、各素性が出現するか否かを検出すればよい。そして、対象特徴生成部１１２は、素性が出現するか否かを、素性毎に表す対象特徴量を生成すればよい。対象特徴量は、検出された素性の集合であってもよい。対象特徴量は、各要素の値が、その要素に関連付けられている素性が出現したことを表す値（例えば１）、又は、その要素に関連付けられている要素が出現しなかったことを表す値（例えば０）であるベクトル（対象特徴ベクトル）であってもよい。 When the video feature amount generated by the video feature generation unit 108 is a feature amount indicating whether or not a feature appears in the text associated with the video for each feature, the target feature generation unit 112 includes the target text in the target text. What is necessary is just to detect whether each feature appears. Then, the target feature generation unit 112 may generate a target feature amount that represents whether or not a feature appears for each feature. The target feature amount may be a set of detected features. The target feature amount is a value indicating that a feature associated with the element has appeared (for example, 1), or a value indicating that an element associated with the element has not appeared. It may be a vector (target feature vector) that is (for example, 0).

映像特徴量が素性毎の素性の出現頻度を表す場合、対象特徴生成部１１２は、対象テキストにおいて、素性毎の素性の出現頻度を検出すればよい。そして、対象特徴生成部１１２は、各素性の出現頻度を表す対象特徴量を生成すればよい。対象特徴量は、例えば、要素が、素性とその素性の出現頻度との組み合わせである、集合であってもよい。対象特徴量は、各要素の値が、その要素に関連付けられている素性の出現頻度を表すベクトル（対象特徴ベクトル）であってもよい。対象特徴生成部１１２は、対象特徴ベクトルを正規化してもよい。対象特徴生成部１１２は、対象特徴ベクトルを正規化しなくてもよい。 When the video feature amount represents the appearance frequency of the feature for each feature, the target feature generation unit 112 may detect the appearance frequency of the feature for each feature in the target text. Then, the target feature generation unit 112 may generate a target feature amount representing the appearance frequency of each feature. The target feature amount may be, for example, a set in which an element is a combination of a feature and an appearance frequency of the feature. The target feature amount may be a vector (target feature vector) in which the value of each element represents the appearance frequency of a feature associated with the element. The target feature generation unit 112 may normalize the target feature vector. The target feature generation unit 112 may not normalize the target feature vector.

類似度導出部１１３は、複数の映像の各々について、映像の映像特徴量に対する、対象特徴量の類似の程度を示す指標である、類似度を導出する。すなわち、類似度導出部１１３が導出する類似度は、映像特徴量と対象特徴量とが、どの程度類似しているかを示す指標である。以下の説明では、２つの特徴量が類似していることを、類似性が高いと表記する。２つの特徴量が類似していないことを、類似性が低いと表記する。２つの特徴量が類似する程度を、類似性の高さと表記する。類似度は、類似性が高いほど大きくてもよい。類似度は、類似性が高いほど小さくてもよい。 The degree-of-similarity deriving unit 113 derives the degree of similarity, which is an index indicating the degree of similarity of the target feature amount with respect to the image feature amount of the image, for each of the plurality of images. That is, the similarity derived by the similarity deriving unit 113 is an index indicating how similar the video feature quantity and the target feature quantity are. In the following description, the fact that two feature quantities are similar is expressed as high similarity. The fact that the two feature quantities are not similar is expressed as low similarity. The degree to which two feature quantities are similar is expressed as high similarity. The similarity may be larger as the similarity is higher. The similarity may be smaller as the similarity is higher.

類似度は、例えば、式１によって表される、コサイン類似度である。類似度がコサイン類似度である場合、類似度の値が大きいほど、類似性が高い。式１及び以下で示す式において、「×」は掛け算を表す識別子である。ベクトルｑは、対象特徴ベクトルであり、ｑ_ｉは対象特徴ベクトルのｉ番目の要素である。ベクトルｄ^ｘは、ｘ番目の映像の映像特徴ベクトルであり、ｄ^ｘ _ｉは、ベクトルｄ^ｘのｉ番目の要素である。また、映像の数はＮ（Ｎは自然数）である。 The similarity is, for example, a cosine similarity expressed by Equation 1. When the similarity is a cosine similarity, the similarity is higher as the similarity value is larger. In Expression 1 and the following expressions, “x” is an identifier representing multiplication. The vector q is a target feature vector, and q _i is the i-th element of the target feature vector. The vector d ^x is the video feature vector of the x th video, and d ^x _i is the i th element of the vector d ^x . The number of images is N (N is a natural number).

類似度は、例えば、式２によって表される、ユークリッド距離であってもよい。類似度がユークリッド距離である場合、類似度の値が小さいほど、類似性が高い。類似度は、ユークリッド距離の逆数であってもよい。その場合、ユークリッド距離が０である場合、類似度導出部１１３は、例えば、類似度導出部１１３が処理できる最大の数値を、類似度として設定すればよい。類似度がユークリッド距離の逆数である場合、類似度の値が大きいほど、類似性が高い。式２において、「ｔ」はベクトルの転置を表し、「＊」はベクトルの積（内積）を表す。式２において、各ベクトルは行ベクトルである。 The similarity may be, for example, a Euclidean distance expressed by Equation 2. When the similarity is the Euclidean distance, the similarity is higher as the similarity value is smaller. The similarity may be a reciprocal of the Euclidean distance. In this case, when the Euclidean distance is 0, the similarity deriving unit 113 may set, for example, the maximum numerical value that can be processed by the similarity deriving unit 113 as the similarity. When the similarity is the reciprocal of the Euclidean distance, the similarity is higher as the similarity value is larger. In Equation 2, “t” represents transposition of a vector, and “*” represents a product (inner product) of vectors. In Equation 2, each vector is a row vector.

類似度は、式３によって表される、ジャッカード係数であってもよい。ジャッカード係数は、２つの特徴ベクトルの０ではない共通の要素の数を、それらの特徴ベクトルの少なくとも一方の要素が０ではない要素の数で割ることによって得られる値である。類似度がジャッカード係数である場合、類似度の値が大きいほど、類似性が高い。式３において、|Q∩D^x|は、ベクトルｑとベクトルｄ^ｘの、値が０でない共通の要素の数を表す。|Q∪D^x|ベクトルｑとベクトルｄ^ｘの少なくとも一方の要素の値が０ではない要素の数である。Qは、例えば、ベクトルｑの、値が０でない要素の番号の集合である。D^xは、例えば、ベクトルｄ^ｘの、値が０でない要素の番号の集合である。「∩」は、積集合を表す。「∪」は和集合を表す。 The similarity may be a Jackard coefficient expressed by Equation 3. The Jackard coefficient is a value obtained by dividing the number of non-zero common elements of two feature vectors by the number of elements in which at least one element of those feature vectors is not zero. When the similarity is a Jackard coefficient, the similarity is higher as the similarity value is larger. In Equation 3, | Q∩D ^x | represents the number of common elements of vector q and vector d ^x whose values are not zero. | Q∪D ^x | This is the number of elements in which the value of at least one of the vector q and the vector d ^x is not zero. Q is, for example, a set of element numbers of vector q whose values are not 0. D ^x is, for example, a set of numbers of elements whose values are not 0 of the vector d ^x . “∩” represents a product set. “∪” represents a union.

類似度=|Q∩D^x|/|Q∪D^x| ・・・（式３）
映像選択部１１４は、導出された類似度が、類似性が高いことを表す映像を選択する。映像選択部１１４は、複数の映像の各々について算出された類似度から、類似性が最も高いことを表す類似度を選択すればよい。そして、映像選択部１１４は、選択された類似度の導出に使用された映像特徴量を持つ（すなわち映像特徴量に関連付けられている）映像を選択すればよい。 Similarity = | Q∩D ^x | / | Q∪D ^x | (Formula 3)
The video selection unit 114 selects a video whose derived similarity is high. The video selection unit 114 may select a similarity indicating that the similarity is the highest from the similarities calculated for each of the plurality of videos. Then, the video selection unit 114 may select a video having the video feature amount used for deriving the selected similarity (that is, associated with the video feature amount).

対象受信部１１１は、対象テキストに加えて、排除キーワードを受信してもよい。排除キーワードは、１つ以上の単語である。その場合、映像選択部１１４は、付随情報に排除キーワードのいずれかが含まれる映像を、導出される類似度の値にかかわらず、選択しない。また、その場合、例えば映像記憶部１１９が、付随情報記憶部１０２が記憶する付随情報と同じ付随情報を記憶していればよい。例えば、映像受信部１１８が、複数の映像を記憶する映像サーバ（図示されない）、ユーザ端末（図示されない）、又は特徴生成装置１００などから、その付随情報を受信すればよい。そして映像受信部１１８が、受信した付随情報を映像記憶部１１９に格納すればよい。 The target receiving unit 111 may receive an exclusion keyword in addition to the target text. An exclusion keyword is one or more words. In that case, the video selection unit 114 does not select a video including any of the exclusion keywords in the accompanying information regardless of the derived similarity value. In that case, for example, the video storage unit 119 may store the same accompanying information as the accompanying information stored in the accompanying information storage unit 102. For example, the video reception unit 118 may receive the accompanying information from a video server (not shown) that stores a plurality of videos, a user terminal (not shown), the feature generation device 100, or the like. The video receiving unit 118 may store the received accompanying information in the video storage unit 119.

出力部１１５は、映像選択部１１４による選択の結果を表すデータを出力する。選択の結果を表すデータは、例えば、選択された映像の映像ＩＤである。 The output unit 115 outputs data representing the result of selection by the video selection unit 114. The data representing the selection result is, for example, the video ID of the selected video.

前述のように、映像選択装置１１０は、映像受信部１１８と映像記憶部１１９とを含んでいてもよい。 As described above, the video selection device 110 may include the video reception unit 118 and the video storage unit 119.

映像受信部１１８は、前述の複数の映像を記憶する映像サーバ（図示されない）から、それらの複数の映像の映像データを受信する。そして、映像受信部１１８は、受信した映像データを、映像記憶部１１９に格納する。映像記憶部１１９は、映像受信部１１８によって格納された、複数の映像の映像データを記憶する。 The video reception unit 118 receives video data of the plurality of videos from a video server (not shown) that stores the plurality of videos. Then, the video receiving unit 118 stores the received video data in the video storage unit 119. The video storage unit 119 stores video data of a plurality of videos stored by the video receiving unit 118.

映像記憶部１１９が複数の映像データを記憶している場合、出力部１１５は、映像選択部１１４によって選択された映像の映像データを出力してもよい。 When the video storage unit 119 stores a plurality of video data, the output unit 115 may output the video data of the video selected by the video selection unit 114.

次に、本実施形態の映像選択システム１の動作について、図面を参照して詳細に説明する。 Next, the operation of the video selection system 1 of the present embodiment will be described in detail with reference to the drawings.

図３は、特徴生成装置１００の、複数のテキストと付随情報とに基づいて素性を抽出する動作の例を表すフローチャートである。 FIG. 3 is a flowchart showing an example of an operation of extracting features based on a plurality of texts and accompanying information in the feature generation device 100.

まず、教師データ受信部１０３が、映像に関連付けられた１つ以上のテキストを、複数の映像の各々について受信する（ステップＳ１０１）。映像に関連付けられた１つ以上のテキストは、教師データとも表記される。教師データ受信部１０３は、受信した教師データを、教師データ記憶部１０４に格納する（ステップＳ１０２）。 First, the teacher data receiving unit 103 receives one or more texts associated with a video for each of a plurality of videos (step S101). One or more texts associated with the video are also described as teacher data. The teacher data receiving unit 103 stores the received teacher data in the teacher data storage unit 104 (step S102).

次に、付随情報受信部１０１が、付随情報を受信する（ステップＳ１０３）。図２は、付付随情報受信部１０１が受信する、付随情報の例を表す。付付随情報受信部１０１は、複数の映像の各々について、図２に例示する付随情報を受信する。付随情報受信部１０１は、受信した付随情報を、付随情報記憶部１０２に格納する（ステップＳ１０４）。 Next, the accompanying information receiving unit 101 receives the accompanying information (step S103). FIG. 2 shows an example of the accompanying information received by the attached information receiving unit 101. The attached information receiving unit 101 receives the accompanying information illustrated in FIG. 2 for each of a plurality of videos. The accompanying information receiving unit 101 stores the received accompanying information in the accompanying information storage unit 102 (step S104).

ステップＳ１０３及びステップＳ１０４の動作は、ステップＳ１０１及びステップＳ１０２の動作より前に行われてもよい。ステップＳ１０３及びステップＳ１０４の動作は、ステップＳ１０１及びステップＳ１０２の動作と並列に行われてもよい。 The operations in steps S103 and S104 may be performed before the operations in steps S101 and S102. The operations in step S103 and step S104 may be performed in parallel with the operations in step S101 and step S102.

次に、属性抽出部１０５は、付随情報記憶部１０２に格納されている付随情報から、属性を抽出する（ステップＳ１０５）。 Next, the attribute extraction unit 105 extracts attributes from the accompanying information stored in the accompanying information storage unit 102 (step S105).

次に、素性抽出部１０６が、属性抽出部１０５が抽出した属性を修飾する語句を、素性として、学習データから抽出する（ステップＳ１０６）。 Next, the feature extraction unit 106 extracts a word or phrase that modifies the attribute extracted by the attribute extraction unit 105 from the learning data as a feature (step S106).

図４は、抽出される素性の例を模式的に表す図である。図４に示す例では、映像１とテキストＡが関連付けられている。映像２とテキストＢが関連付けられている。映像３とテキストＣが関連付けられている。右側のブロック内の語句は、それらのテキストの一部を模式的に表す。これらのブロック内で、丸括弧に囲まれている単語が、属性検出部１０５によって抽出された属性のうち、テキストに含まれる属性である。下線が付されている語句が、属性に係る語句として抽出された素性である。例えば、テキストＡは、属性として、「春」と「男」とを含んでいる。素性抽出部１０６は、「春」に係る素性として、「暖かい」を抽出する。素性抽出部１０６は、さらに、「男」に係る素性として「背の高い」を抽出する。 FIG. 4 is a diagram schematically illustrating an example of extracted features. In the example shown in FIG. 4, video 1 and text A are associated. Video 2 and text B are associated with each other. Video 3 and text C are associated with each other. The phrases in the right block schematically represent part of the text. In these blocks, words enclosed in parentheses are attributes included in the text among the attributes extracted by the attribute detection unit 105. An underlined phrase is a feature extracted as a phrase related to an attribute. For example, the text A includes “spring” and “male” as attributes. The feature extraction unit 106 extracts “warm” as the feature related to “spring”. The feature extraction unit 106 further extracts “tall” as the feature related to “male”.

図５は、抽出された素性の例を模式的に表す図である。図５において、「属性」は、いずれかのテキストにおいて検出された属性である。図５において、「素性」は、検出された属性に係る語句として抽出された素性である。 FIG. 5 is a diagram schematically illustrating an example of extracted features. In FIG. 5, “attribute” is an attribute detected in any text. In FIG. 5, “feature” is a feature extracted as a phrase related to a detected attribute.

次に、素性抽出部１０６は、抽出された素性のリストである素性リストを生成する（ステップＳ１０７）。 Next, the feature extraction unit 106 generates a feature list that is a list of extracted features (step S107).

図６は、素性リストの例を模式的に表す図である。図６において、「属性」は、いずれかのテキストにおいて検出された属性である。図６において、「素性」は、検出された属性に係る語句として抽出された素性である。図６において、太い線によって囲まれている部分が素性リストを表す。 FIG. 6 is a diagram schematically illustrating an example of a feature list. In FIG. 6, “attribute” is an attribute detected in any text. In FIG. 6, “feature” is a feature extracted as a phrase related to a detected attribute. In FIG. 6, a part surrounded by a thick line represents a feature list.

次に、素性抽出部１０６は、生成した素性リストを、素性記憶部１０７に格納する（ステップＳ１０８）。 Next, the feature extraction unit 106 stores the generated feature list in the feature storage unit 107 (step S108).

次に、本実施形態の特徴生成装置１００の、映像特徴を生成する動作について、図面を参照して詳細に説明する。 Next, an operation for generating video features of the feature generation device 100 according to the present embodiment will be described in detail with reference to the drawings.

図７は、本実施形態の特徴生成装置１００の、映像特徴を生成する動作の例を表すフローチャートである。 FIG. 7 is a flowchart illustrating an example of an operation of generating a video feature of the feature generation device 100 according to the present embodiment.

図７を参照すると、映像特徴生成部１０８は、素性抽出部１０６によって抽出された素性のリストである素性リストを、素性記憶部１０７から読み出す。（ステップＳ１１１）。 Referring to FIG. 7, the video feature generation unit 108 reads a feature list that is a list of features extracted by the feature extraction unit 106 from the feature storage unit 107. (Step S111).

次に、映像特徴生成部１０８は、映像に関連付けられているテキストを、映像毎に特定する（ステップＳ１１２）。 Next, the video feature generation unit 108 identifies text associated with the video for each video (step S112).

図８は、映像毎の、映像に関連付けられているテキストの例を模式的に表す図である。上述のように映像ＩＤは映像を特定する識別子である。図８において、「テキスト」は、映像ＩＤによって特定されるそれぞれの映像に関連付けられている、１つ以上のテキストの識別子を表す。 FIG. 8 is a diagram schematically illustrating an example of text associated with a video for each video. As described above, the video ID is an identifier for specifying a video. In FIG. 8, “text” represents an identifier of one or more texts associated with each video specified by the video ID.

映像特徴生成部１０８は、映像毎に、映像に関連付けられているテキストにおいて、素性リストに含まれる素性を検出する（ステップＳ１１３）。映像特徴生成部１０８は、素性を検出した結果に基づいて、映像に関連付けられているテキストに出現する素性を表す映像特徴量（例えば映像特徴ベクトル）を、映像毎に生成する（ステップＳ１１４）。 For each video, the video feature generation unit 108 detects a feature included in the feature list in the text associated with the video (step S113). The video feature generation unit 108 generates, for each video, a video feature amount (for example, a video feature vector) representing a feature appearing in the text associated with the video based on the result of detecting the feature (step S114).

図９は、映像特徴ベクトル例を模式的に表す図である。図９において、例えば、太い線によって描かれている四角形によって囲まれている部分が、映像１の映像特徴ベクトルを表す。映像１の特徴ベクトルの下の、２つの段に示す数値の列が、映像２及び映像３の映像特徴ベクトルを表す。図９に示す映像特徴ベクトルの各要素の値は、映像に関連付けられているテキストにおける、素性の出現頻度を表す。図９に示す例では、各映像特徴ベクトルは正規化されていない。 FIG. 9 is a diagram schematically illustrating an example of a video feature vector. In FIG. 9, for example, a portion surrounded by a rectangle drawn by a thick line represents a video feature vector of video 1. Numerical value columns shown in two stages below the feature vector of the video 1 represent the video feature vectors of the video 2 and the video 3. The value of each element of the video feature vector shown in FIG. 9 represents the appearance frequency of the feature in the text associated with the video. In the example shown in FIG. 9, each video feature vector is not normalized.

映像特徴生成部１０８は、さらに、各映像の映像特徴ベクトルの大きさが同じになるように、各映像特徴ベクトルを正規化すればよい。映像特徴生成部１０８ではなく、例えば、映像選択装置１１０の類似度導出部１１３が、各映像特徴ベクトルを正規化してもよい。 The video feature generation unit 108 may further normalize each video feature vector so that the video feature vectors of each video have the same size. For example, the similarity deriving unit 113 of the video selection device 110 may normalize each video feature vector instead of the video feature generation unit 108.

映像特徴生成部１０８は、映像毎に生成した映像特徴ベクトルと、素性リストとを、映像選択装置１１０に送信する（ステップＳ１１５）。 The video feature generation unit 108 transmits the video feature vector generated for each video and the feature list to the video selection device 110 (step S115).

次に、本実施形態の映像選択装置１１０の動作について、図面を参照して詳細に説明する。まず、本実施形態の映像選択装置１１０の、映像特徴ベクトルを受信する動作について説明する。 Next, the operation of the video selection device 110 of this embodiment will be described in detail with reference to the drawings. First, an operation of receiving a video feature vector of the video selection device 110 according to the present embodiment will be described.

図１０は、本実施形態の映像選択装置１１０の、映像特徴ベクトルを受信する動作の例を表すフローチャートである。 FIG. 10 is a flowchart illustrating an example of an operation of receiving a video feature vector of the video selection device 110 according to the present embodiment.

図１０を参照すると、映像特徴受信部１１６が、特徴生成装置１００の映像特徴生成部１０８から、映像特徴量（例えば映像特徴ベクトル）と、素性リストとを受信する（ステップＳ２０１）。映像特徴受信部１１６は、受信した映像特徴量と素性リストとを、映像特徴記憶部１１７に格納する（ステップＳ２０２）。 Referring to FIG. 10, the video feature receiving unit 116 receives a video feature amount (for example, a video feature vector) and a feature list from the video feature generating unit 108 of the feature generating device 100 (step S201). The video feature receiving unit 116 stores the received video feature quantity and feature list in the video feature storage unit 117 (step S202).

次に、本実施形態の映像選択装置１１０の、対象テキストを受信するのに応じて映像を選択する動作について説明する。 Next, an operation of selecting a video in response to receiving the target text of the video selection device 110 of the present embodiment will be described.

図１１は、本実施形態の映像選択装置１１０の、対象テキストを受信するのに応じて映像を選択する動作の例を表すフローチャートである。 FIG. 11 is a flowchart showing an example of an operation of selecting a video in response to receiving the target text in the video selection device 110 of the present embodiment.

図１１を参照すると、まず、対象受信部１１１が、対象テキストを受信する（ステップＳ２１１）。対象受信部１１１は、例えば、コンテンツ配信サーバから、例えば、音声コンテンツの内容を表す対象テキストを受信してもよい。 Referring to FIG. 11, first, the target receiving unit 111 receives a target text (step S211). For example, the target receiving unit 111 may receive target text representing the content of audio content from a content distribution server, for example.

次に、対象特徴生成部１１２は、対象受信部１１１が受信した対象テキストにおいて、素性抽出部１０６によって抽出された素性のリストである素性リストに含まれる素性を抽出する（ステップＳ２１２）。前述のように、素性リストは、例えば、映像特徴受信部１１６によって映像特徴記憶部１１７に格納されている。 Next, the target feature generation unit 112 extracts features included in the feature list, which is a list of features extracted by the feature extraction unit 106, from the target text received by the target reception unit 111 (step S212). As described above, the feature list is stored in the video feature storage unit 117 by the video feature receiving unit 116, for example.

対象特徴生成部１１２は、素性を抽出した結果に基づいて、対象テキストに出現する素性を表す、対象特徴量（例えば対象特徴ベクトル）を生成する（ステップＳ２１３）。 The target feature generation unit 112 generates a target feature amount (for example, a target feature vector) that represents a feature that appears in the target text based on the result of extracting the features (step S213).

図１２は、対象特徴ベクトルの例を模式的に表す図である。図１２に示す数値列が、対処特徴ベクトルを表す。図１２に示す対象特徴ベクトルは、対象テキストにおける、素性リストに含まれる各素性の出現頻度を表す。 FIG. 12 is a diagram schematically illustrating an example of the target feature vector. A numerical string shown in FIG. 12 represents a countermeasure feature vector. The target feature vector shown in FIG. 12 represents the appearance frequency of each feature included in the feature list in the target text.

次に、類似度導出部１１３が、映像特徴量の各々に対する、対象特徴量の類似の程度を表す類似度を算出する（ステップＳ２１４）。 Next, the similarity degree deriving unit 113 calculates a degree of similarity representing the degree of similarity of the target feature amount with respect to each of the video feature amounts (step S214).

図１３は、類似度導出部１１３が類似度を導出する、対象特徴ベクトル、及び、映像特徴ベクトルの例を模式的に表す図である。図１３に示す例では、対象特徴量は、対象特徴ベクトルである。映像特徴量は、映像特徴ベクトルである。また、図１３に示す例では、対象特徴ベクトル、及び、各映像特徴ベクトルは、正規化されている。前述のように対象特徴ベクトルは、正規化されていなくてもよい。類似度導出部１１３は、映像特徴ベクトルの各々について、対象特徴ベクトルと映像特徴ベクトルとの間の類似性の高さを表す類似度を導出する。 FIG. 13 is a diagram schematically illustrating examples of target feature vectors and video feature vectors from which the similarity deriving unit 113 derives similarities. In the example illustrated in FIG. 13, the target feature amount is a target feature vector. The video feature amount is a video feature vector. In the example shown in FIG. 13, the target feature vector and each video feature vector are normalized. As described above, the target feature vector may not be normalized. The similarity degree deriving unit 113 derives the degree of similarity representing the degree of similarity between the target feature vector and the video feature vector for each of the video feature vectors.

図１４は、類似度の例を模式的に表す図である。図１４は、図１３に示す各映像の映像特徴ベクトルと、対象特徴ベクトルとの間の類似性の高さを表す類似度である。図１４に示す例では、類似度はコサイン類似度である。従って、類似度の値が大きいほど、類似性が高い。 FIG. 14 is a diagram schematically illustrating an example of similarity. FIG. 14 shows the similarity indicating the level of similarity between the video feature vector of each video shown in FIG. 13 and the target feature vector. In the example shown in FIG. 14, the similarity is a cosine similarity. Therefore, the greater the similarity value, the higher the similarity.

映像選択部１１４は、算出した類似度を使用して、対象特徴量に類似する映像特徴量を選択する（ステップＳ２１５）。映像選択部１１４は、類似度が、類似性が最も高いことを表す映像特徴量を選択すればよい。図１４に示す例では、類似度の値が最も大きい映像特徴量が、対象特徴量に最も良く類似する映像特徴量である。そして、映像３の映像特徴量が、対象特徴量に最も良く類似する。すなわち、図１４に示す例では、映像３の映像特徴量と対象特徴量との類似性が最も高い。 The video selection unit 114 selects a video feature amount similar to the target feature amount using the calculated similarity (step S215). The video selection unit 114 may select a video feature amount indicating that the similarity is the highest. In the example shown in FIG. 14, the video feature amount having the largest similarity value is the video feature amount most similar to the target feature amount. The video feature quantity of video 3 is most similar to the target feature quantity. That is, in the example illustrated in FIG. 14, the similarity between the video feature quantity of the video 3 and the target feature quantity is the highest.

映像選択部１１４は、選択された映像特徴量に関連する映像を選択する（ステップＳ２１６）。図１４に示す例では、選択された、映像３の映像特徴量に関連する映像は、映像３である。映像選択部１１４は、映像３を選択する。 The video selection unit 114 selects a video related to the selected video feature amount (step S216). In the example illustrated in FIG. 14, the selected video related to the video feature amount of the video 3 is the video 3. The video selection unit 114 selects the video 3.

出力部１１５は、映像選択部１１４による選択の結果を出力する（ステップＳ２１７）。出力部１１５は、映像選択部１１４による選択の結果として、例えば、選択された映像の識別子（すなわち映像ＩＤ）を出力すればよい。図１４に示す例では、出力部１１５は、選択された映像である映像３の映像ＩＤを出力すればよい。出力部１１５は、例えば、対象受信部１１１に対象テキストを送信した装置に、選択された映像の映像ＩＤを出力すればよい。対象受信部１１１が、コンテンツ配信サーバ（図示されない）から対象テキストを受信した場合、出力部１１５は、そのコンテンツ配信サーバに、選択された映像の映像ＩＤを送信すればよい。コンテンツ配信サーバは、出力部１１５から映像ＩＤを受信する。 The output unit 115 outputs the result of selection by the video selection unit 114 (step S217). The output unit 115 may output, for example, an identifier of the selected video (that is, a video ID) as a result of selection by the video selection unit 114. In the example illustrated in FIG. 14, the output unit 115 may output the video ID of the video 3 that is the selected video. For example, the output unit 115 may output the video ID of the selected video to the device that has transmitted the target text to the target receiving unit 111. When the target receiving unit 111 receives the target text from a content distribution server (not shown), the output unit 115 may transmit the video ID of the selected video to the content distribution server. The content distribution server receives the video ID from the output unit 115.

例えば、コンテンツ配信サーバが、カラオケの楽曲と映像とを配信するカラオケサーバである場合、コンテンツ配信サーバは、楽曲の配信の要求を受信するのに応じて、その楽曲の歌詞である対象テキストを、映像選択装置１１０に送信すればよい。映像選択装置１１０は、送信された対象テキストに対して選択した映像ＩＤをコンテンツ配信サーバに送信する。コンテンツ配信サーバは、受信した映像ＩＤが表す映像を特定する。そして、コンテンツ配信サーバは、対象テキストが歌詞である楽曲と、受信した映像ＩＤによって表される映像とを、例えば、その楽曲の配信を要求した端末に配信すればよい。コンテンツ配信サーバは、あらかじめ、配信することができる複数の楽曲について、歌詞を対象テキストとして映像選択装置１１０に送信しておいてもよい。そして、コンテンツ配信サーバは、あらかじめ、選択された映像の映像ＩＤを受信しておいてもよい。コンテンツ配信サーバは、あらかじめ、楽曲の識別子である楽曲ＩＤと、その楽曲の歌詞が対象テキストである場合に選択された映像の映像ＩＤとを、記憶領域（図示されない）に記憶しておいてもよい。そして、コンテンツ配信サーバは、楽曲の配信を要求されるのに応じて、配信を要求された楽曲の歌詞に対して選択された映像の映像ＩＤを読み出せばよい。そして、コンテンツ配信サーバは、配信を要求された楽曲と、読み出した映像ＩＤが表す映像とを、楽曲の配信を要求した端末に配信すればよい。 For example, when the content distribution server is a karaoke server that distributes karaoke music and video, the content distribution server receives the target text that is the lyrics of the music in response to receiving a request for music distribution, What is necessary is just to transmit to the image | video selection apparatus 110. The video selection device 110 transmits the video ID selected for the transmitted target text to the content distribution server. The content distribution server specifies the video represented by the received video ID. Then, the content distribution server may distribute the music whose target text is lyrics and the video represented by the received video ID, for example, to the terminal that requested the distribution of the music. The content distribution server may transmit in advance lyrics to the video selection device 110 as a target text for a plurality of music pieces that can be distributed. The content distribution server may receive the video ID of the selected video in advance. The content distribution server may store in advance a song ID that is an identifier of a song and a video ID of a video that is selected when the lyrics of the song are the target text in a storage area (not shown). Good. Then, in response to a request for music distribution, the content distribution server may read the video ID of the video selected for the lyrics of the music requested to be distributed. Then, the content distribution server may distribute the music requested to be distributed and the video represented by the read video ID to the terminal that requested the distribution of the music.

コンテンツ配信サーバが、ナレーション、朗読、又は、アナウンスなど音声コンテンツの要求に応じて、音声コンテンツと映像とを配信するコンテンツ配信サーバであってもよい。その場合、コンテンツ配信サーバは、例えば、配信可能な音声コンテンツの内容を表すテキストデータを、対象テキストとして、映像選択装置１１０に送信すればよい。コンテンツ配信サーバは、映像選択装置１１０から映像ＩＤを受信し、対象テキストによって内容が表される音声コンテンツと、受信した映像ＩＤによって表される映像とを、音声コンテンツの配信を要求した端末に送信すればよい。コンテンツ配信サーバは、コンテンツの配信の要求を受信するのに応じて、対象テキストの送信、映像ＩＤの受信、及び、音声コンテンツと映像との配信を行ってもよい。コンテンツ配信サーバは、あらかじめ、対象テキストの送信と、映像ＩＤの受信とを行い、対象テキストによって内容が表されるコンテンツのコンテンツＩＤと、その対象テキストに対して選択された映像ＩＤとを記憶していてもよい。音声コンテンツの配信の要求を受信するのに応じて、その音声コンテンツと、記憶している映像ＩＤに基づいて特定した、その音声コンテンツの内容を表す対象テキストに対して選択された映像とを、音声コンテンツの配信を要求した端末に送信してもよい。 The content distribution server may be a content distribution server that distributes audio content and video in response to a request for audio content such as narration, reading, or announcement. In this case, the content distribution server may transmit, for example, text data representing the contents of distributable audio content as the target text to the video selection device 110. The content distribution server receives the video ID from the video selection device 110, and transmits the audio content whose content is represented by the target text and the video represented by the received video ID to the terminal that requested the distribution of the audio content. do it. The content distribution server may transmit the target text, receive the video ID, and distribute the audio content and the video in response to receiving the content distribution request. The content distribution server transmits the target text and receives the video ID in advance, and stores the content ID of the content represented by the target text and the video ID selected for the target text. It may be. In response to receiving the request for distributing the audio content, the audio content and the video selected for the target text representing the content of the audio content specified based on the stored video ID, You may transmit to the terminal which requested | required delivery of audio | voice content.

以上で説明した本実施形態には、テキストと映像とのマッチングを行う負荷を軽減することができるという第１の効果がある。 The present embodiment described above has a first effect that the load for matching text and video can be reduced.

その理由は、映像の特徴を表す映像特徴量とテキストの特徴を表す対象特徴量との、類似性の高さの程度に基づいて、映像選択部１１４がテキストと映像とのマッチングを行うからである。映像特徴量は、映像にあらかじめ関連付けられているテキストとその映像の付随情報とを使用して、映像特徴生成部１０８によって生成される。対象特徴量は、映像特徴量と同じ種類の特徴量である。そして、対象特徴量は、マッチングの対象であるテキスト（上述の対処テキスト）を使用して、対象特徴生成部１１２によって生成される。 The reason is that the video selection unit 114 performs matching between the text and the video based on the degree of similarity between the video feature amount representing the video feature and the target feature amount representing the text feature. is there. The video feature amount is generated by the video feature generation unit 108 using text pre-associated with the video and accompanying information of the video. The target feature amount is the same type of feature amount as the video feature amount. Then, the target feature amount is generated by the target feature generation unit 112 using the text that is the target of matching (the above-described countermeasure text).

本実施形態には、テキストの内容と映像の内容とが精度よく一致するように、テキストに対して映像を選択することができるという効果がある。 In the present embodiment, there is an effect that the video can be selected for the text so that the content of the text and the content of the video are accurately matched.

その理由は、映像特徴生成部１０８と対象特徴生成部１１２とが、素性抽出部１０６が抽出する素性を使用して、特徴量を生成するからである。前述のように、映像の内容を端的に表す単語（上述の属性）を修飾する語句である素性は、視覚的な、体感的な、又は、視覚的で体感的な語句であることが、経験的に判明している。映像にあらかじめ関連付けられているテキストは、その映像に、視覚的に、体感的に、または、視覚的で体感的にマッチすると、例えばユーザによって判定されたテキストである。従って、映像に関連付けられているテキストにおいて出現する上述の素性と、対象テキストにおいて出現する素性とが類似している場合、その映像と対象テキストとは、視覚的に、体感的に、または、視覚的で体感的にマッチする可能性が高い。映像特徴生成部１０８は、映像に関連付けられているテキストにおいて出現する素性を表す映像特徴量を生成する。対象特徴生成部１１２は、対象テキストにおいて出現する素性を表す対象特徴量を生成する。類似度導出部１１３は、そのような、映像特徴量と対象特徴量とが類似する程度を表す類似度を導出する。映像選択部１１４は、そのような類似度を使用して、映像特徴量と対象特徴量とが類似するように、対象テキストに対する映像を選択する。従って、映像選択部１１４は、対象テキストに対して、視覚的に、体感的に、または、視覚的で体感的にマッチする映像を、精度よく選択することができる。 The reason is that the video feature generation unit 108 and the target feature generation unit 112 use the features extracted by the feature extraction unit 106 to generate feature amounts. As described above, it is experienced that a feature that is a word that modifies a word (the above-described attribute) that expresses the content of a video is a visual, bodily sensation, or a visual bodily sensation. Has been revealed. The text pre-associated with the video is, for example, text determined by the user when it matches the video visually, bodily, or visually and bodily. Therefore, when the above-mentioned feature appearing in the text associated with the video is similar to the feature appearing in the target text, the video and the target text are visually, bodily, or visually There is a high possibility that the match will be intuitive. The video feature generation unit 108 generates a video feature amount representing a feature that appears in text associated with a video. The target feature generation unit 112 generates a target feature amount representing a feature appearing in the target text. The similarity degree deriving unit 113 derives such a degree of similarity representing the degree of similarity between the video feature quantity and the target feature quantity. Using the similarity, the video selection unit 114 selects a video for the target text so that the video feature amount and the target feature amount are similar to each other. Accordingly, the video selection unit 114 can accurately select a video that visually, bodily, or visually matches the target text.

以上で説明した第２の効果について、さらに具体的に詳しく説明する。
The second effect described above will be described in more detail.

素性抽出部１０６は、上述のように、付随情報から抽出された単語である属性を修飾する語句を抽出することによって、視覚的、体感的に表現されている語句を、素性として抽出する。対象特徴生成部１１２は、例えば歌詞などの対象テキストにおける、素性の出現頻度をもとに、対象特徴量（例えば対象特徴ベクトル）を生成する。映像特徴生成部１０８は、映像に関連付けられている、歌詞などのテキストや、その映像の付随情報（特に企画意図等）における、素性の出現頻度をもとに、映像特徴ベクトルを生成する。類似度導出部１１３は、対象特徴ベクトルと映像特徴ベクトルとが類似する程度である類似性の高さを表す、例えばコサイン類似度などの類似度を算出する。映像選択部１１４は、コサイン類似度などの類似度を使用して、対象特徴ベクトルと映像特徴ベクトルとを比較することによって、対象テキストと映像との関連の深さを表す関連性を判定する。上述の対象特徴ベクトルと映像特徴ベクトルとの間の類似度による判定は、視覚的、体感的な特性を利用した、対象テキストと映像との間の関連の判定である。従って、そのような類似度を使用して、対象テキストに対して、対象特徴ベクトルと映像特徴ベクトルとが類似する映像を選択することによって、視覚的、体感的に、対象テキストに類似した映像が選ばれることが期待できる。 As described above, the feature extraction unit 106 extracts a phrase that modifies an attribute that is a word extracted from the accompanying information, thereby extracting a phrase that is expressed visually and bodily as a feature. The target feature generation unit 112 generates a target feature amount (for example, a target feature vector) based on the appearance frequency of features in the target text such as lyrics. The video feature generation unit 108 generates a video feature vector based on the appearance frequency of features in text such as lyrics associated with the video and accompanying information (particularly planning intention) of the video. The similarity degree deriving unit 113 calculates a degree of similarity, such as a cosine similarity degree, for example, indicating the level of similarity that is the degree to which the target feature vector and the video feature vector are similar. The video selection unit 114 compares the target feature vector with the video feature vector by using a similarity such as a cosine similarity, and determines the relevance representing the depth of the relationship between the target text and the video. The above-described determination based on the similarity between the target feature vector and the video feature vector is a determination of the relationship between the target text and the video using visual and physical characteristics. Therefore, by using such similarity and selecting a video whose target feature vector and video feature vector are similar to the target text, a video similar to the target text is visually and physically experienced. You can expect to be chosen.

例えば、素性の抽出において、属性である「空」に係る語句を抽出することによって、「青い」や、「晴れた」などの、視覚的、体感的な語句が素性として抽出されることが期待できる。映像に関連付けられたテキストにおける、そのような素性の出現頻度を導出した場合、例えば「青い空」の特徴を持つ映像に関連付けられたテキストの中に、「青い」や「晴れた」などの、視覚的に同じ特性を備える語句の出現数が高いことが期待できる。さらに、対象テキストの中に、「青い空」または「晴れた空」という表現がある場合、「青い」という語句及び「晴れた」という語句の少なくともいずれかの出現頻度が高い映像と、対象テキストとの類似度が高くなる。 For example, in the feature extraction, it is expected that visual and bodily words such as “blue” and “sunny” will be extracted as features by extracting words related to the attribute “sky”. it can. When the appearance frequency of such a feature in the text associated with the video is derived, for example, in the text associated with the video with the characteristic of “blue sky”, such as “blue” and “sunny” It can be expected that the number of words having visually the same characteristics is high. Furthermore, when the target text includes the expression “blue sky” or “sunny sky”, the target text and the video with a high frequency of appearance of at least one of the words “blue” and the term “sunny” The degree of similarity is increased.

以上のように、本実施形態の映像選択システム１は、例えば対象テキストに、「青い空」や「厳しい冬」のような、視覚的な、体感的な表現がある場合、その対象テキストを、「晴れた空」や、「寒い季節」などの特徴を備える映像にマッチさせることができる。このように、本実施形態の映像選択システム１は、同義語を使ったマッチングによって実現することができないマッチングを行うことができる。 As described above, in the video selection system 1 according to the present embodiment, for example, when the target text has a visual and bodily expression such as “blue sky” or “harsh winter”, the target text is You can match images with features such as “clear sky” and “cold season”. Thus, the video selection system 1 of the present embodiment can perform matching that cannot be realized by matching using synonyms.

すなわち、本実施形態の映像選択システム１は、テキストに対して、視覚的に違和感のない適切な映像を選択することができる。 That is, the video selection system 1 of the present embodiment can select an appropriate video that is visually uncomfortable with respect to the text.

＜第１の実施形態の第１の変形例＞
次に、本発明の第１の実施形態の第１の変形例について、図面を参照して詳細に説明する。 <First Modification of First Embodiment>
Next, a first modification of the first embodiment of the present invention will be described in detail with reference to the drawings.

図１５は、本変形例の映像選択システム１Ａの構成の例を表すブロック図である。図１５と図１を比較すると、本変形例の映像選択システム１Ａは、映像選択装置１１０の代わりに、映像選択装置１１０Ａを含む。映像選択装置１１０Ａは、第１の実施形態の映像選択装置１１０の各構成要素に加えて、教師データ受信部１２１と、教師データ記憶部１２２と、教師データ送信部１２３と、付随情報受信部１２４と、付随情報記憶部１２５と、付随情報送信部１２６とを含む。映像選択システム１Ａは、映像選択システム１Ａのユーザが指示などを入力するのに使用する、ユーザ端末（図示されない）を含んでいてもよい。 FIG. 15 is a block diagram illustrating an example of the configuration of the video selection system 1A according to the present modification. Comparing FIG. 15 with FIG. 1, the video selection system 1 </ b> A of the present modification includes a video selection device 110 </ b> A instead of the video selection device 110. The video selection device 110A includes a teacher data reception unit 121, a teacher data storage unit 122, a teacher data transmission unit 123, and an accompanying information reception unit 124, in addition to the components of the video selection device 110 of the first embodiment. And an accompanying information storage unit 125 and an accompanying information transmission unit 126. The video selection system 1A may include a user terminal (not shown) that is used by the user of the video selection system 1A to input instructions and the like.

教師データ受信部１２１は、例えばコンテンツ配信サーバなどの他の装置から、上述の教師データ（それぞれ映像に関連付けられている、複数のテキスト）を受信する。教師データ受信部１２１は、受信した教師データを、教師データ記憶部１２２に格納する。教師データ記憶部１２２は、教師データを記憶する。教師データ送信部１２３は、教師データ記憶部１２２に格納されている教師データを、教師データ受信部１０３に送信する。教師データ受信部１０３は、教師データ送信部１２３から、教師データを受信する。 The teacher data receiving unit 121 receives the above-described teacher data (a plurality of texts associated with each video) from another device such as a content distribution server. The teacher data receiving unit 121 stores the received teacher data in the teacher data storage unit 122. The teacher data storage unit 122 stores teacher data. The teacher data transmission unit 123 transmits the teacher data stored in the teacher data storage unit 122 to the teacher data reception unit 103. The teacher data receiving unit 103 receives teacher data from the teacher data transmitting unit 123.

付随情報受信部１２４は、例えばコンテンツ配信サーバなどの他の装置から、上述の付随情報を受信する。付随情報受信部１２４は、受信した付随情報を、付随情報記憶部１２５に格納する。付随情報記憶部１２５は、付随情報を記憶する。付随情報送信部１２６は、付随情報記憶部１２５に格納されている付随情報を、付随情報受信部１０１に送信する。付随情報受信部１０１は、付随情報送信部１２６から、付随情報を受信する。 The accompanying information receiving unit 124 receives the accompanying information described above from another device such as a content distribution server. The accompanying information receiving unit 124 stores the received accompanying information in the accompanying information storage unit 125. The accompanying information storage unit 125 stores accompanying information. The accompanying information transmitting unit 126 transmits the accompanying information stored in the accompanying information storage unit 125 to the accompanying information receiving unit 101. The accompanying information receiving unit 101 receives accompanying information from the accompanying information transmitting unit 126.

以上の相違を除き、本変形例の映像選択システム１Ａは、第１の実施形態の映像選択システム１と同じである。 Except for the above differences, the video selection system 1A of the present modification is the same as the video selection system 1 of the first embodiment.

＜第１の実施形態の第２の変形例＞
次に、本発明の第１の実施形態の第２の変形例について、図面を参照して詳細に説明する。 <Second Modification of First Embodiment>
Next, a second modification of the first embodiment of the present invention will be described in detail with reference to the drawings.

図１６は、本変形例の映像選択システム１Ｂの構成の例を表すブロック図である。図１６を参照すると、映像選択システム１Ｂは、映像選択装置１１０Ｂを含む。映像選択システム１Ｂは、特徴生成装置１００を含む。映像選択システム１Ｂは、映像選択システム１Ｂのユーザが指示などを入力するのに使用する、ユーザ端末（図示されない）を含んでいてもよい。 FIG. 16 is a block diagram illustrating an example of a configuration of a video selection system 1B according to the present modification. Referring to FIG. 16, the video selection system 1B includes a video selection device 110B. The video selection system 1B includes a feature generation device 100. The video selection system 1B may include a user terminal (not shown) that is used by the user of the video selection system 1B to input instructions and the like.

以上の相違を除き、本変形例の映像選択システム１Ｂは、第１の実施形態の映像選択システム１と同じである。 Except for the above differences, the video selection system 1B of the present modification is the same as the video selection system 1 of the first embodiment.

＜第１の実施形態の第３の変形例＞
次に、本発明の第１の実施形態の第３の変形例について、図面を参照して詳細に説明する。 <Third Modification of First Embodiment>
Next, a third modification of the first embodiment of the present invention will be described in detail with reference to the drawings.

図１７は、本変形例の映像選択システム１Ｃの構成の例を表すブロック図である。本変形例の映像選択システム１Ｃは、映像選択装置１１０Ｃを含む。映像選択装置１１０Ｃは、第１の実施形態の特徴生成装置１００の各構成要素を含む。映像選択装置１１０Ｃは、映像特徴受信部１１６を含んでいなくてよい。そして、映像選択装置１１０Ｃは、第１の実施形態の特徴生成装置１００として動作する。映像選択システム１Ｃは、映像選択システム１Ｃのユーザが指示などを入力するのに使用する、ユーザ端末（図示されない）を含んでいてもよい。 FIG. 17 is a block diagram illustrating an example of a configuration of a video selection system 1C according to the present modification. The video selection system 1C of the present modification includes a video selection device 110C. The video selection device 110C includes each component of the feature generation device 100 of the first embodiment. The video selection device 110 </ b> C may not include the video feature receiving unit 116. Then, the video selection device 110C operates as the feature generation device 100 of the first embodiment. The video selection system 1C may include a user terminal (not shown) that is used by the user of the video selection system 1C to input instructions and the like.

本実施形態の映像特徴生成部１０８は、生成した映像特徴量を、映像特徴記憶部１１７に格納する。本実施形態の映像特徴生成部１０８は、上述の素性リストを、映像特徴記憶部１１７に格納してもよい。本実施形態の素性抽出部１０６は、素性リストを、対象特徴生成部１１２に送信してもよい。本実施形態の対象特徴生成部１１２は、素性記憶部１０７から素性リストを読み出してもよい。 The video feature generation unit 108 of the present embodiment stores the generated video feature amount in the video feature storage unit 117. The video feature generation unit 108 of the present embodiment may store the above-described feature list in the video feature storage unit 117. The feature extraction unit 106 according to the present embodiment may transmit the feature list to the target feature generation unit 112. The target feature generation unit 112 of this embodiment may read the feature list from the feature storage unit 107.

以上の相違を除き、本変形例の映像選択システム１Ｃは、第１の実施形態の映像選択システム１と同じである。 Except for the above differences, the video selection system 1C of the present modification is the same as the video selection system 1 of the first embodiment.

＜第２の実施形態＞
次に、本発明の第２の実施形態について、図面を参照して詳細に説明する。本実施形態は、本発明の各実施形態を概念的に表す実施形態である。 <Second Embodiment>
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is an embodiment conceptually representing each embodiment of the present invention.

図１８は、本実施形態の映像選択システム１Ｄの構成の例を表すブロック図である。 FIG. 18 is a block diagram illustrating an example of the configuration of the video selection system 1D of the present embodiment.

図１８を参照すると、本実施形態の映像選択システム１１０Ｄは、映像に関連付けられているテキストに対してテキストマイニング処理を実行することによって、前記映像の特徴量である映像特徴量を、複数の前記映像の各々について生成する映像特徴生成部１０８と、対象テキストに対して前記テキストマイニング処理を実行することによって、前記対象テキストの特徴量である対象特徴量を生成する対象特徴生成部１１２と、前記映像特徴量の各々について、当該映像特徴量の、前記対象特徴量に対する類似の程度を表す類似度を導出する類似度導出部１１３と、導出された前記類似度に基づいて、前記対象特徴量に対する類似の程度が高い前記映像特徴量を選択し、選択された前記映像特徴量が導出された前記テキストに関連付けられている前記映像を選択する映像選択部１１４と、を備える。 Referring to FIG. 18, the video selection system 110D according to the present embodiment performs a text mining process on text associated with a video, thereby obtaining a video feature quantity that is the video feature quantity. A video feature generation unit 108 that generates each of the videos; a target feature generation unit 112 that generates a target feature amount that is a feature amount of the target text by executing the text mining process on the target text; For each video feature quantity, a similarity degree deriving unit 113 that derives a degree of similarity of the video feature quantity with respect to the target feature quantity, and the target feature quantity based on the derived similarity degree The video feature amount having a high degree of similarity is selected, and the selected video feature amount is associated with the derived text. Comprises a video selection unit 114 for selecting the image are, the.

以上で説明した本実施形態には、第１の実施形態の第１の効果と同じ効果がある。その理由は、第１の実施形態の第１の効果が生じる理由と同じである。 The present embodiment described above has the same effect as the first effect of the first embodiment. The reason is the same as the reason why the first effect of the first embodiment occurs.

＜他の実施形態＞
映像選択装置１１０、映像選択装置１１０Ａ、映像選択装置１１０Ｂ、映像選択装置１１０Ｃ、映像選択装置１１０Ｄ、特徴生成装置１００、及び、特徴生成装置１００Ｄは、それぞれ、コンピュータ及びコンピュータを制御するプログラム、専用のハードウェア、又は、コンピュータ及びコンピュータを制御するプログラムと専用のハードウェアの組合せにより実現することができる。 <Other embodiments>
The video selection device 110, the video selection device 110A, the video selection device 110B, the video selection device 110C, the video selection device 110D, the feature generation device 100, and the feature generation device 100D are respectively a computer, a program for controlling the computer, and a dedicated program. It can be realized by hardware or a combination of a computer and a program for controlling the computer and dedicated hardware.

図１９は、映像選択装置１１０、映像選択装置１１０Ａ、映像選択装置１１０Ｂ、映像選択装置１１０Ｃ、映像選択装置１１０Ｄ、特徴生成装置１００、及び、特徴生成装置１００Ｄを実現することができる、コンピュータ１０００のハードウェア構成の一例を表す図である。図１９を参照すると、コンピュータ１０００は、プロセッサ１００１と、メモリ１００２と、記憶装置１００３と、Ｉ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ）インタフェース１００４とを含む。また、コンピュータ１０００は、記録媒体１００５にアクセスすることができる。メモリ１００２と記憶装置１００３は、例えば、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ハードディスクなどの記憶装置である。記録媒体１００５は、例えば、ＲＡＭ、ハードディスクなどの記憶装置、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、可搬記録媒体である。記憶装置１００３が記録媒体１００５であってもよい。プロセッサ１００１は、メモリ１００２と、記憶装置１００３に対して、データやプログラムの読み出しと書き込みを行うことができる。プロセッサ１００１は、Ｉ／Ｏインタフェース１００４を介して、例えば、他の装置にアクセスすることができる。プロセッサ１００１は、記録媒体１００５にアクセスすることができる。記録媒体１００５には、コンピュータ１０００を、映像選択装置１１０、映像選択装置１１０Ａ、映像選択装置１１０Ｂ、映像選択装置１１０Ｃ、映像選択装置１１０Ｄ、特徴生成装置１００、又は、特徴生成装置１００Ｄとして動作させるプログラムが格納されている。 FIG. 19 illustrates a computer 1000 capable of realizing the video selection device 110, the video selection device 110A, the video selection device 110B, the video selection device 110C, the video selection device 110D, the feature generation device 100, and the feature generation device 100D. It is a figure showing an example of a hardware constitutions. Referring to FIG. 19, a computer 1000 includes a processor 1001, a memory 1002, a storage device 1003, and an I / O (Input / Output) interface 1004. The computer 1000 can access the recording medium 1005. The memory 1002 and the storage device 1003 are storage devices such as a RAM (Random Access Memory) and a hard disk, for example. The recording medium 1005 is, for example, a storage device such as a RAM or a hard disk, a ROM (Read Only Memory), or a portable recording medium. The storage device 1003 may be the recording medium 1005. The processor 1001 can read and write data and programs from and to the memory 1002 and the storage device 1003. The processor 1001 can access, for example, other devices via the I / O interface 1004. The processor 1001 can access the recording medium 1005. The recording medium 1005 is a program that causes the computer 1000 to operate as the video selection device 110, the video selection device 110A, the video selection device 110B, the video selection device 110C, the video selection device 110D, the feature generation device 100, or the feature generation device 100D. Is stored.

プロセッサ１００１は、記録媒体１００５に格納されている、コンピュータ１０００を、映像選択装置１１０、映像選択装置１１０Ａ、映像選択装置１１０Ｂ、映像選択装置１１０Ｃ、映像選択装置１１０Ｄ、特徴生成装置１００、又は、特徴生成装置１００Ｄとして動作させるプログラムを、メモリ１００２にロードする。そして、プロセッサ１００１が、メモリ１００２にロードされたプログラムを実行することにより、コンピュータ１０００は、映像選択装置１１０、映像選択装置１１０Ａ、映像選択装置１１０Ｂ、映像選択装置１１０Ｃ、映像選択装置１１０Ｄ、特徴生成装置１００、又は、特徴生成装置１００Ｄとして動作する。 The processor 1001 stores the computer 1000 stored in the recording medium 1005 as a video selection device 110, a video selection device 110A, a video selection device 110B, a video selection device 110C, a video selection device 110D, a feature generation device 100, or a feature. A program to be operated as the generation device 100D is loaded into the memory 1002. Then, when the processor 1001 executes the program loaded in the memory 1002, the computer 1000 causes the video selection device 110, the video selection device 110A, the video selection device 110B, the video selection device 110C, the video selection device 110D, and the feature generation. It operates as the device 100 or the feature generation device 100D.

付随情報受信部１０１、教師データ受信部１０３、属性抽出部１０５、素性抽出部１０６、映像特徴生成部１０８、対象受信部１１１、対象特徴生成部１１２、類似度導出部１１３、映像選択部１１４、出力部１１５、映像特徴受信部１１６、映像受信部１１８、教師データ受信部１２１、教師データ送信部１２３、付随情報受信部１２４、及び、付随情報送信部１２６は、例えば、プログラムを記憶する記録媒体１００５からメモリ１００２に読み込まれた、各部の機能を実現することができる専用のプログラムと、そのプログラムを実行するプロセッサ１００１により実現することができる。また、付随情報記憶部１０２、教師データ記憶部１０４、素性記憶部１０７、映像特徴記憶部１１７、映像記憶部１１９、教師データ記憶部１２２、及び、付随情報記憶部１２５は、コンピュータ１０００が含むメモリ１００２やハードディスク装置等の記憶装置１００３により実現することができる。あるいは、付随情報受信部１０１、付随情報記憶部１０２、教師データ受信部１０３、教師データ記憶部１０４、属性抽出部１０５、素性抽出部１０６、素性記憶部１０７、映像特徴生成部１０８、対象受信部１１１、対象特徴生成部１１２、類似度導出部１１３、映像選択部１１４、出力部１１５、映像特徴受信部１１６映像特徴記憶部１１７、映像受信部１１８、映像記憶部１１９、教師データ受信部１２１、教師データ記憶部１２２、教師データ送信部１２３、付随情報受信部１２４、付随情報記憶部１２５、及び、付随情報送信部１２６の一部又は全部を、各部の機能を実現する専用の回路によって実現することもできる。 Accompanying information reception unit 101, teacher data reception unit 103, attribute extraction unit 105, feature extraction unit 106, video feature generation unit 108, target reception unit 111, target feature generation unit 112, similarity derivation unit 113, video selection unit 114, The output unit 115, the video feature receiving unit 116, the video receiving unit 118, the teacher data receiving unit 121, the teacher data transmitting unit 123, the accompanying information receiving unit 124, and the accompanying information transmitting unit 126 are, for example, recording media for storing programs It can be realized by a dedicated program that can be read from the memory 1002 to the memory 1002 and that can realize the function of each unit, and a processor 1001 that executes the program. The accompanying information storage unit 102, the teacher data storage unit 104, the feature storage unit 107, the video feature storage unit 117, the video storage unit 119, the teacher data storage unit 122, and the accompanying information storage unit 125 are included in the memory included in the computer 1000. 1002 and a storage device 1003 such as a hard disk device. Alternatively, the accompanying information receiving unit 101, the accompanying information storing unit 102, the teacher data receiving unit 103, the teacher data storing unit 104, the attribute extracting unit 105, the feature extracting unit 106, the feature storing unit 107, the video feature generating unit 108, and the target receiving unit 111, target feature generation unit 112, similarity derivation unit 113, video selection unit 114, output unit 115, video feature reception unit 116 video feature storage unit 117, video reception unit 118, video storage unit 119, teacher data reception unit 121, A part or all of the teacher data storage unit 122, the teacher data transmission unit 123, the incidental information reception unit 124, the incidental information storage unit 125, and the incidental information transmission unit 126 is realized by a dedicated circuit that realizes the function of each unit. You can also.

また、上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Moreover, although a part or all of said embodiment can be described also as the following additional remarks, it is not restricted to the following.

（付記１）
映像に関連付けられているテキストに対してテキストマイニング処理を実行することによって、前記映像の特徴量である映像特徴量を、複数の前記映像の各々について生成する映像特徴生成手段と、
対象テキストに対して前記テキストマイニング処理を実行することによって、前記対象テキストの特徴量である対象特徴量を生成する対象特徴生成手段と、
前記映像特徴量の各々について、当該映像特徴量の、前記対象特徴量に対する類似の程度を表す類似度を導出する類似度導出手段と、
導出された前記類似度に基づいて、前記対象特徴量に対する類似の程度が高い前記映像特徴量を選択し、選択された前記映像特徴量が導出された前記テキストに関連付けられている前記映像を選択する映像選択手段と、
を備える映像選択システム。 (Appendix 1)
Video feature generation means for generating, for each of the plurality of videos, a video feature quantity that is a feature quantity of the video by performing a text mining process on text associated with the video;
Target feature generation means for generating a target feature quantity that is a feature quantity of the target text by executing the text mining process on the target text;
Similarity degree deriving means for deriving the degree of similarity of the video feature quantity with respect to the target feature quantity for each of the video feature quantities;
Based on the derived degree of similarity, the video feature amount having a high degree of similarity to the target feature amount is selected, and the video associated with the text from which the selected video feature amount is derived is selected. Video selection means to
A video selection system comprising:

（付記２）
前記複数の映像の少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記映像の前記属性を修飾する語句である素性を抽出する素性抽出部をさらに備え、
前記映像特徴生成手段は、前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、抽出された前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の前記映像特徴量として生成し、
前記対象特徴生成手段は、前記対象テキストにおいて、抽出された前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記対象特徴量として生成する
付記１に記載の映像選択システム。 (Appendix 2)
A feature extracting unit that extracts a feature that is a phrase that modifies the attribute of the video from the text based on an attribute that is a phrase related to at least one of the plurality of videos;
For each of the plurality of videos, the video feature generation means detects each of the extracted features in the text associated with the video, and calculates a feature amount representing the detected features. Generating as the video feature amount,
The video selection system according to claim 1, wherein the target feature generation unit detects each of the extracted features in the target text, and generates a feature amount representing the detected feature as the target feature amount.

（付記３）
前記映像の各々に関連付けられている、当該映像を特徴付ける情報である付随情報から、前記属性を抽出する属性抽出手段
をさらに備える付記２に記載の映像選択システム。 (Appendix 3)
The video selection system according to claim 2, further comprising attribute extraction means for extracting the attribute from accompanying information that is associated with each of the videos and is information that characterizes the video.

（付記４）
それぞれテキストが関連付けられている複数の映像の、少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記属性を修飾する語句である素性を抽出し、抽出した前記素性を素性記憶手段に格納する素性抽出手段と、
前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、前記素性記憶手段に格納されている前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の映像特徴量として生成する映像特徴生成手段と、
を備える特徴量生成装置。 (Appendix 4)
Based on an attribute that is a phrase related to at least one of a plurality of videos each associated with text, a feature that is a phrase that modifies the attribute is extracted from the text, and the extracted feature is stored as a feature A feature extraction means stored in the means;
For each of the plurality of videos, in the text associated with the video, each of the features stored in the feature storage means is detected, and a feature amount representing the detected feature is detected as the video of the video. Video feature generation means for generating as feature quantities;
A feature amount generating apparatus.

（付記５）
前記映像の各々に関連付けられている、当該映像を特徴付ける情報である付随情報から、前記属性を抽出する属性抽出手段
をさらに備える付記４に記載の特徴量生成装置。 (Appendix 5)
The feature quantity generation device according to appendix 4, further comprising attribute extraction means for extracting the attribute from accompanying information that is associated with each of the videos and is information that characterizes the videos.

（付記６）
映像に関連付けられているテキストに対してテキストマイニング処理を実行することによって、前記映像の特徴量である映像特徴量を、複数の前記映像の各々について生成し、
対象テキストに対して前記テキストマイニング処理を実行することによって、前記対象テキストの特徴量である対象特徴量を生成し、
前記映像特徴量の各々について、当該映像特徴量の、前記対象特徴量に対する類似の程度を表す類似度を導出し、
導出された前記類似度に基づいて、前記対象特徴量に対する類似の程度が高い前記映像特徴量を選択し、選択された前記映像特徴量が導出された前記テキストに関連付けられている前記映像を選択する、
映像選択方法。 (Appendix 6)
Generating a video feature amount that is a feature amount of the video for each of the plurality of videos by performing a text mining process on text associated with the video;
By executing the text mining process on the target text, a target feature amount that is a feature amount of the target text is generated,
For each of the video feature quantities, derive a similarity indicating the degree of similarity of the video feature quantity to the target feature quantity,
Based on the derived degree of similarity, the video feature amount having a high degree of similarity to the target feature amount is selected, and the video associated with the text from which the selected video feature amount is derived is selected. To
Video selection method.

（付記７）
前記複数の映像の少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記映像の前記属性を修飾する語句である素性を抽出し、
前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、抽出された前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の前記映像特徴量として生成し、
前記対象テキストにおいて、抽出された前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記対象特徴量として生成する
付記６に記載の映像選択方法。 (Appendix 7)
Based on an attribute that is a phrase related to at least one of the plurality of videos, a feature that is a phrase that modifies the attribute of the video is extracted from the text;
For each of the plurality of videos, in the text associated with the video, each of the extracted features is detected, and a feature amount representing the detected feature is generated as the video feature amount of the video. ,
The video selection method according to claim 6, wherein each of the extracted features is detected in the target text, and a feature amount representing the detected feature is generated as the target feature amount.

（付記８）
前記映像の各々に関連付けられている、当該映像を特徴付ける情報である付随情報から、前記属性を抽出する
付記７に記載の映像選択方法。 (Appendix 8)
The video selection method according to claim 7, wherein the attribute is extracted from accompanying information that is associated with each of the videos and is information that characterizes the video.

（付記９）
それぞれテキストが関連付けられている複数の映像の、少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記属性を修飾する語句である素性を抽出し、抽出した前記素性を素性記憶手段に格納し、
前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、前記素性記憶手段に格納されている前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の映像特徴量として生成する、
特徴量生成方法。 (Appendix 9)
Based on an attribute that is a phrase related to at least one of a plurality of videos each associated with text, a feature that is a phrase that modifies the attribute is extracted from the text, and the extracted feature is stored as a feature Stored in the means,
For each of the plurality of videos, in the text associated with the video, each of the features stored in the feature storage means is detected, and a feature amount representing the detected feature is detected as the video of the video. Generate as feature quantity,
Feature generation method.

（付記１０）
前記映像の各々に関連付けられている、当該映像を特徴付ける情報である付随情報から、前記属性を抽出する
付記９に記載の特徴量生成方法。 (Appendix 10)
The feature value generation method according to claim 9, wherein the attribute is extracted from accompanying information that is associated with each of the videos and is information that characterizes the video.

（付記１１）
コンピュータを、
映像に関連付けられているテキストに対してテキストマイニング処理を実行することによって、前記映像の特徴量である映像特徴量を、複数の前記映像の各々について生成する映像特徴生成手段と、
対象テキストに対して前記テキストマイニング処理を実行することによって、前記対象テキストの特徴量である対象特徴量を生成する対象特徴生成手段と、
前記映像特徴量の各々について、当該映像特徴量の、前記対象特徴量に対する類似の程度を表す類似度を導出する類似度導出手段と、
導出された前記類似度に基づいて、前記対象特徴量に対する類似の程度が高い前記映像特徴量を選択し、選択された前記映像特徴量が導出された前記テキストに関連付けられている前記映像を選択する映像選択手段と、
して動作させる映像選択プログラム。 (Appendix 11)
Computer
Video feature generation means for generating, for each of the plurality of videos, a video feature quantity that is a feature quantity of the video by performing a text mining process on text associated with the video;
Target feature generation means for generating a target feature quantity that is a feature quantity of the target text by executing the text mining process on the target text;
Similarity degree deriving means for deriving the degree of similarity of the video feature quantity with respect to the target feature quantity for each of the video feature quantities;
Based on the derived degree of similarity, the video feature amount having a high degree of similarity to the target feature amount is selected, and the video associated with the text from which the selected video feature amount is derived is selected. Video selection means to
Video selection program to be operated.

（付記１２）
コンピュータを、
前記複数の映像の少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記映像の前記属性を修飾する語句である素性を抽出する素性抽出部と、
前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、抽出された前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の前記映像特徴量として生成する前記映像特徴生成手段と、
前記対象テキストにおいて、抽出された前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記対象特徴量として生成する前記対象特徴生成手段と、
して動作させる付記１１に記載の映像選択プログラム。 (Appendix 12)
Computer
A feature extraction unit that extracts a feature that is a phrase that modifies the attribute of the video from the text based on an attribute that is a phrase related to at least one of the plurality of videos;
For each of the plurality of videos, in the text associated with the video, each of the extracted features is detected, and a feature amount representing the detected feature is generated as the video feature amount of the video. The video feature generating means;
The target feature generating means for detecting each of the extracted features in the target text and generating a feature amount representing the detected feature as the target feature amount;
The video selection program according to appendix 11, which is operated as described above.

（付記１３）
コンピュータを、
前記映像の各々に関連付けられている、当該映像を特徴付ける情報である付随情報から、前記属性を抽出する属性抽出手段と
して動作させる付記１２に記載の映像選択プログラム。 (Appendix 13)
Computer
13. The video selection program according to appendix 12, wherein the video selection program is operated as attribute extraction means for extracting the attribute from accompanying information that is associated with each of the videos and is information that characterizes the video.

（付記１４）
コンピュータを、
それぞれテキストが関連付けられている複数の映像の、少なくともいずれかに関連する語句である属性に基づいて、前記テキストから、前記属性を修飾する語句である素性を抽出し、抽出した前記素性を素性記憶手段に格納する素性抽出手段と、
前記複数の映像の各々について、当該映像に関連付けられているテキストにおいて、前記素性記憶手段に格納されている前記素性の各々を検出し、検出された前記素性を表す特徴量を、前記映像の映像特徴量として生成する映像特徴生成手段と、
して動作させる特徴量生成プログラム。 (Appendix 14)
Computer
Based on an attribute that is a phrase related to at least one of a plurality of videos each associated with text, a feature that is a phrase that modifies the attribute is extracted from the text, and the extracted feature is stored as a feature A feature extraction means stored in the means;
For each of the plurality of videos, in the text associated with the video, each of the features stored in the feature storage means is detected, and a feature amount representing the detected feature is detected as the video of the video. Video feature generation means for generating as feature quantities;
A feature generation program to be operated.

（付記１５）
コンピュータを、
前記映像の各々に関連付けられている、当該映像を特徴付ける情報である付随情報から、前記属性を抽出する属性抽出手段と
して動作させる付記１４に記載の特徴量生成プログラム。 (Appendix 15)
Computer
15. The feature quantity generation program according to appendix 14, which is operated as attribute extraction means for extracting the attribute from accompanying information that is associated with each of the videos and is information that characterizes the videos.

以上、実施形態を参照して本発明を説明したが、本発明は上記実施形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

１映像選択システム
１Ａ映像選択システム
１Ｂ映像選択システム
１Ｃ映像選択システム
１Ｄ映像選択システム
１００特徴生成装置
１００Ｄ特徴生成装置
１０１付随情報受信部
１０２付随情報記憶部
１０３教師データ受信部
１０４教師データ記憶部
１０５属性抽出部
１０６素性抽出部
１０７素性記憶部
１０８映像特徴生成部
１１０映像選択装置
１１０Ａ映像選択装置
１１０Ｂ映像選択装置
１１０Ｃ映像選択装置
１１０Ｄ映像選択装置
１１１対象受信部
１１２対象特徴生成部
１１３類似度導出部
１１４映像選択部
１１５出力部
１１６映像特徴受信部
１１７映像特徴記憶部
１１８映像受信部
１１９映像記憶部
１２１教師データ受信部
１２２教師データ記憶部
１２３教師データ送信部
１２４付随情報受信部
１２５付随情報記憶部
１２６付随情報送信部
１０００コンピュータ
１００１プロセッサ
１００２メモリ
１００３記憶装置
１００４Ｉ／Ｏインタフェース
１００５記録媒体 DESCRIPTION OF SYMBOLS 1 Image | video selection system 1A Image | video selection system 1B Image | video selection system 1C Image | video selection system 1D Image | video selection system 100 Feature generation apparatus 100D Feature generation apparatus 101 Accompanying information receiving part 102 Accompanying information storage part 103 Teacher data receiving part 104 Teacher data storage part 105 Attribute Extraction unit 106 Feature extraction unit 107 Feature storage unit 108 Video feature generation unit 110 Video selection device 110A Video selection device 110B Video selection device 110C Video selection device 110D Video selection device 111 Target reception unit 112 Target feature generation unit 113 Similarity derivation unit 114 Video selection unit 115 Output unit 116 Video feature reception unit 117 Video feature storage unit 118 Video reception unit 119 Video storage unit 121 Teacher data reception unit 122 Teacher data storage unit 123 Teacher data transmission unit 124 Accompanying information reception Part 125 accompanying information storing unit 126 accompanying information transmitting unit 1000 computer 1001 processor 1002 memory 1003 storage 1004 I / O interface 1005 recording medium

Claims

Feature extraction means for extracting a feature that is a phrase that modifies the attribute of the video from text associated with the video having the attribute based on an attribute that is a phrase related to at least one of the plurality of videos; ,
Oite the text associated with the image, extracted by detecting each of the feature, the feature value representing the said detected feature, image feature generating means for generating a video image characteristic amount is a feature amount of the image When,
Detecting each of said feature that Oite, extracted into target text, the object characteristic generating means for generating a target characteristic quantity is a feature quantity representing the said detected feature, a feature quantity of the target text,
Similarity degree deriving means for deriving the degree of similarity of the video feature quantity with respect to the target feature quantity for each of the video feature quantities;
Based on the derived degree of similarity, the video feature amount having a high degree of similarity to the target feature amount is selected, and the video associated with the text from which the selected video feature amount is derived is selected. Video selection means to
A video selection system comprising:

Before Symbol image feature generating means, for each of the plurality of images to generate the image feature
Image selection system according to 請 Motomeko 1.

The video selection system according to claim 2, further comprising attribute extraction means for extracting the attribute from accompanying information that is associated with each of the videos and is information that characterizes the video.

Based on an attribute that is a phrase related to at least one of a plurality of videos each associated with text, a feature that is a phrase that modifies the attribute is extracted from the text, and the extracted feature is stored as a feature A feature extraction means stored in the means;
For each of the plurality of videos, in the text associated with the video, each of the features stored in the feature storage means is detected, and a feature amount representing the detected feature is detected as the video of the video. Video feature generation means for generating as feature quantities;
A feature amount generating apparatus.

The feature quantity generation device according to claim 4, further comprising attribute extraction means that extracts the attribute from accompanying information that is associated with each of the videos and is information that characterizes the videos.

Computer
Based on an attribute that is a phrase related to at least one of a plurality of videos, a feature that is a phrase that modifies the attribute of the video is extracted from text associated with the video having the attribute,
Oite the text associated with the image, extracted by detecting each of the feature, the feature value representing the said detected feature, generates as image feature is a feature value of the image,
Oite the target text, extracted detects each of the feature, the feature value representing the said detected feature, generates a target feature quantity is a feature quantity of the target text,
For each of the video feature quantities, derive a similarity indicating the degree of similarity of the video feature quantity to the target feature quantity,
Based on the derived degree of similarity, the video feature amount having a high degree of similarity to the target feature amount is selected, and the video associated with the text from which the selected video feature amount is derived is selected. To
Video selection method.

Computer
Generating the video feature for each of the plurality of videos;
Video selection method according to 請 Motomeko 6.

Computer
The video selection method according to claim 7, wherein the attribute is extracted from accompanying information that is associated with each of the videos and is information that characterizes the video.

Computer
Based on an attribute that is a phrase related to at least one of a plurality of videos each associated with text, a feature that is a phrase that modifies the attribute is extracted from the text, and the extracted feature is stored as a feature Stored in the means,
For each of the plurality of videos, in the text associated with the video, each of the features stored in the feature storage means is detected, and a feature amount representing the detected feature is detected as the video of the video. Generate as feature quantity,
Feature generation method.

Computer
The feature value generation method according to claim 9, wherein the attribute is extracted from accompanying information that is associated with each of the videos and is information that characterizes the video.

On the computer,
A feature extraction process for extracting a feature that is a phrase that modifies the attribute of the video based on an attribute that is a phrase related to at least one of the plurality of videos from text associated with the video having the attribute; ,
Oite the text associated with the image, extracted by detecting each of the feature, the feature value representing the said detected feature, image feature generation process of generating the image feature is a feature value of the image When,
Oite the target text, extracted detects each of the feature, the feature value representing the said detected feature, and the object characteristic generation process of generating a target characteristic quantity is a feature quantity of the target text,
For each of the video feature amounts, a similarity degree derivation process for deriving a degree of similarity representing the degree of similarity of the video feature amount to the target feature amount;
Based on the derived degree of similarity, the video feature amount having a high degree of similarity to the target feature amount is selected, and the video associated with the text from which the selected video feature amount is derived is selected. Video selection processing to
A video selection program that executes

The video feature generation processing generates the video feature amount for each of the plurality of videos.
Video selection program according to 請 Motomeko 11.

On the computer,
Attribute extraction processing for extracting the attribute from accompanying information that is associated with each of the videos and is information that characterizes the videos
The video selection program according to claim 12, wherein:

On the computer,
Based on an attribute that is a phrase related to at least one of a plurality of videos each associated with text, a feature that is a phrase that modifies the attribute is extracted from the text, and the extracted feature is stored as a feature A feature extraction process stored in the means;
For each of the plurality of videos, in the text associated with the video, each of the features stored in the feature storage means is detected, and a feature amount representing the detected feature is detected as the video of the video. Video feature generation processing to generate as a feature amount;
A feature generation program that executes

On the computer,
Attribute extraction processing for extracting the attribute from accompanying information that is associated with each of the videos and is information that characterizes the videos
15. The feature quantity generation program according to claim 14, wherein: